Вы находитесь на странице: 1из 10

8/22/2011 10:45 AM Chapter 1 Introduction

1.1 A model for problem solving Definition: Statistics is the science of (or a collection of techniques for) Collecting (sampling, census) Classifying (descriptive statistics) Analyzing (e.g., regression analysis) Generalizing (statistical inference) A set of data for a special purpose Each of these activities is based on probability. These activities are carried out to solve some problems observed by the decision-makers. The following are important steps in the decision making process: 1. Specify your goal(s) by clearly stating the problem or question. 2. Collect and analyze data. 3. Interpret the findings of your analyses and make a decision. 4. Implement the decision and verify that it is the right approach to solving the problem stated in step 1. 5. Plan the next action. When we have census data relevant to the problem at hand, the decision-making process is relatively easy. However, in many real life problems we do not have (recent) census data. In such a case we will collect data from a random sample of population units. Then we make our decision about one or more characteristic of the population, based on the information in the sample data. This process is called statistical inference and that is the main subject of this course (Chapters 8 to 12). The necessary tools for statistical inference are developed in the first seven chapters of your text. We will emphasize making inferences about one or more population parameters based on data from random samples. This process is a systematic approach to decision making.

Some new terms need to be defined: Population: a set of well-defined units (objects or outcomes) about which information is sought. Sample: A subset of the population, containing objects or outcomes that are actually observed. Random sample: A sample selected according to some rules of probability o Simple Random Sample (SRS): A sample of size n, selected in such a way that every sample of size n (from the population of size N) has an equal chance of being the selected sample. As a result of this property every element in the population has an equal chance (n/N) of being in the random sample.

STA3032 Chapter 1, Page 1 of 10

o SRS Selected with replacement: Some population units may appear more than once. This is used in theoretical studies. o SRS selected without replacement: Any population unit may appear in the sample at most once. This method is used in real life problems. o Although the two selection methods are different, the difference becomes negligible when the population size (N) is extremely large, relative to the sample size (n). o In this course whenever we talk about a sample we mean a SRS selected with replacement. o A SRS selected with replacement gives independent observations, i.e., knowing the value of any one element in the sample does not help in predicting the value of the of he elements. More concepts will be defined as we move along. Other sampling methods (such as stratified random sampling, cluster sampling, multi-stage sampling, etc., are used in real-life problems because they are usually more efficient. Then, the formulas given in this course need to be modified. Such sampling methods will not be covered in this course. [A detailed discussion of different sampling procedures is given in Chapter 3. You may take STA4222 Sampling and Survey Design if you are interested in more details.]

1.1 Types of Experiments and Tabulation of Data Identification of the type of data in a study is extremely important because all summary techniques we will see in this Chapter and inference techniques we will see later depend on the type of data. The techniques suitable for one type of data cannot be used for the other type. Types of Data Quantitative (Numerical) Data obtained as a result of some measurement or counting process of population (or sample) elements. The results can be used in arithmetic operations. Qualitative (Categorical) Data obtained as a result of observations on some characteristic of the population (or sample) element. An element either belongs to a category or does not belong to that category. Such data cannot be used in arithmetic operations.

Tabulation of Data A table is a summary of the data which tells (almost) everything about the data. Every table must have a title and a source of data (if you did not collect them). We will also specify the characteristic (random variable) summarized with unit of measurement when appropriate, and the values of the random variable in the first column. In the second column we report the number of observations (f = frequency) that have the specified value(s). Alternatively, we may put the relative frequency (rf = f/n) or percentage (rf100) in each row of the table. Such tables are called frequency tables.

STA3032 Chapter 1, Page 2 of 10

For quantitative data, if the number of observations is large, we use (usually equal) intervals. Here are some examples (1). Table 1. Distribution of STA6125 Students by Age Age Number of (in years) Students 17.5 up to 22.5 3 22.5 up to 27.5 32 27.5 up to 32.5 14 32.5 up to 37.5 4 37.5 up to 42.5 3 42.5 up to 47.5 1 47.5 up to 52.5 2 52.5 up to 57.5 0 57.5 up to 62.5 0 62.5 up to 67.5 0 67.5 up to 72.5 1 Total 60 In some cases accumulating the frequencies or relative frequencies may be useful.

When we have categorical data, then the values of the characteristic are simply names of the categories and we list them, as shown below. Table 2. Distribution of STA3032 Students by Their Year of Study at UF Number of Percent of Year at UF Students Students Freshman 24 10.9 Sophomore 98 44.5 Junior 76 34.5 Senior 22 10.0 Total 220 100.0

Your book defines these intervals as 17.5 < age 22.5, 22.5 < age 27.5, etc. We prefer to use up to for the same purpose. If both ends of the interval are included we use to or a simple dash -.

STA3032 Chapter 1, Page 3 of 10

1.3 Graphical Summaries of Data 1.3.1 Graphs for Categorical Data We can summary categorical data graphically by using a pie chart or a bar graph:

Year at UF of STA3032 Students, Spring 2011

Senior 10.0% Freshman 10.9% Sophomore 44.5%

Junior 34.5%

Slices are proportional to the frequency or percentage in each category.

A Bar Graph of the Number of Mulfunctioning in Distilation Towers


40

Percent of Towers

30

20

10

Coking

Scale

Precipitation Sloids Cause of Mulfunctioning

Polymer

Source: Scheaffer, et. al., 2011

Note that the bars must have equal bases and should not touch each other. The height of each bar is proportional to the frequency or percentage of the number of observations in each category. A special case of the bar graph is called a Pareto Chart, where the categories are ordered by the frequency of each category. It is a very useful graph when one wants to identify or emphasize the order of importance of each category.

STA3032 Chapter 1, Page 4 of 10

A Pareto-Chart of the Number of Mulfunctioning in Distilation Towers


40

Percent of Towers

30

20

10

Scale

Sloids

Coking Precipitation Cause of Mulfunctioning

Polymer

Source: Scheaffer, et. al., 2011, page 21, Problem 1.10

1.3.2 Graphical Summaries of Quantitative Data The most frequently used graphs to summarize quantitative data are dot plot (used when the sample size is small) and histogram. Here is an example of each.
AQI Exceedencies in 15 Metropolitan Area

10

15 20 25 Number of AQI Exceedencies

30

35

STA3032 Chapter 1, Page 5 of 10

Figure - 3. Distiribution of STA6125 Students by Age


35 30
Number of Students

25 20 15 10 5 0

20

30

40 50 Ages of Students

60

70

Source: See Table 1. Note that the bars have equal bases when the intervals are equal. The bars must touch each other (except those with zero frequency). The heights of the bars are proportional to the frequencies. 1.2 Numerical Summaries of quantitative sample data: Measures of location (center) Sample Mean: is used as a measure of location or the center of the data and is denoted by X Sample Median is a number that divides the ordered sample data into two equal parts. Hence at most 50% of the observations have values below the median and at most 50% have values above the median. Sample Quartiles: These are 3 numbers, denoted by Q1, Q2 and Q3 that divide the ordered sample data into 4 equal parts. Note that Q2 is the sample median, Q1 (also called the lower quartile) is the median of the observations that have values less than the median and Q3 (also called the upper quartile) is the median of all observations that have values above the median. Percentiles: The 3 quartiles Q1, Q2 and Q3 are also called the 25th, 50th and 75th percentiles, respectively, because 25% of all observations are below the 25th percentile, 50% are below the 50th percentile and 75% are below the 75th percentile. In general, for some between zero and 100, 0 < 100, the th percentile of sample data is a number below which % of all sample data are observed. Sample Mode: The most frequently observed value in the sample. This gives a quick and easy way of locating the center but is not much used.

STA3032 Chapter 1, Page 6 of 10

Measures of Variation (Dispersion, Scatter) Sample Standard Deviation is the most frequently used measure of dispersion (scatter, variability) of the data and is denoted by S. It is the (positive) square root of the sample variance. Sample Variance is the square of the sample standard deviation and denoted by S2. Sample range: Difference between the largest and the smallest observed values. It gives a quick way of getting some information about the dispersion (scatter) of the data. Inter-quartile range (IQR) = Q3 Q1 is a measure of the dispersion in the middle 50% of the sample data.

Numerical summaries of Categorical Sample Data Sample Data will be denoted by X1, X2, , Xn, where Xi = 1 if ith sample element has the characteristic of interest (belongs to a certain category of interest) and Xi = 0 otherwise when there are only two categories. Sample frequency = Y X i = number of sample elements that belong to the
i 1 n

category of interest.

Sample proportion = p X i / n = proportion of sample elements that belong to the


i 1

category of interest = relative frequency.

1.4.4 Reading Computer Output: Here is a computer output from Minitab:


Descriptive Statistics: AQI Variable AQI N 15 Mean 13.47 StDev 10.58 Minimum 1.00 Q1 5.00 Median 11.00 Q3 18.00 Maximum 36.00

Make sure that you understand what each number in the above output tell you.

1.4.5 Effects of Shifting and Scaling: Two important rules o If X1, X2, , Xn is a sample of measurements on a quantitative variable and for some constants a and b, Yi = a + bXi, then Y a bX . If X1, X2, , Xn is a sample of measurements on a quantitative variable and for some 2 2 constants a and b, Yi = a + bXi, then SY b2 S X and SY b S X .

STA3032 Chapter 1, Page 7 of 10

1.5 Summary Measures and Decision The graphs and summary statistics of sample data give us an idea about the location (center), variation (dispersion or scatter) and the shape of the sample data and hence the population data. We will be talking about symmetric or left-skewed or right-skewed distributions. [What can you say about the location, dispersion and shape of the ages of students in STA6125, by looking at the above histogram?] 1.5.1 The Empirical Rule: There is a very powerful rule, that makes use the sample mean and sample standard deviation, to give information on location, dispersion as well as the shape of the distribution:

Empirical rule: IF the distribution of a population is mound-shaped, THEN a) Approximately 68% of all population measurements are within one standard deviation of the population mean, i.e., between and + . b) Approximately 95% of all population measurements are within two standard deviations of the population mean, i.e., between 2 and + 2. c) Almost all (99.7%) of all population measurements are within three standard deviations of the population mean, i.e., between 3 and + 3. 1.5.2 Standardized Values (Z Scores) A z-score of an observed value of a quantitative data gives the distance between that observation and the mean of the data in standard deviation units. The z-score is defined as
Z Measrement - Mean . Standard Deviation

The z-score of an observation tell you how many standard deviations that observation is above (z > 0) or below (z < 0) the mean of the random variable. Example: Suppose a test you took has a mean of 70 and standard deviation of 10. If your grade in that test is 80, then the z-score of your grade is z = (80 70)/10 = 1. Hence your grade is one standard deviation above the mean of the class. If your friends z-score is 0.8, you can say that you did better than your friend. [Can you work back and find another friends grade when its z-score is 0.5?] 1.5.3 Box-and-Whisker Plots This is another graphical summary of data that shows location, dispersion and shape of the distribution of the data. It uses what is known as the 5-number summary, where 5-Number Summary = (Min, Q1, Q2, Q3, Max) These five numbers will summarize your data whether you have a small sample of, say, 20 observations or a large sample of 2000 observations or a population of infinitely many observations.

STA3032 Chapter 1, Page 8 of 10

Once you have these 5 numbers you may show them graphically in a box-and-whisker (or simply a box-plot).

For the data set of the ages of graduate students we get the following output using Minitab:
Variable age N 60 Mean StDev Minimum 29.17 8.48 22.00 Q1 Median 24.00 26.50 Q3 Maximum 31.00 71.00

Now we can use the 5-Number Summary = (22, 24, 26.5, 31, 71) we can now draw (or ask Minitab to draw a box and whisker plot of the data set (Figure 1).
Figure 1. Boxplot of Ages 70
50 Figure - 2. Boxplot of Ages

60
Ages (in years)
Age (in Years)

45

40

50

35

40

30

30

25

20

20

1.5.4 Detecting outliers An observation is said to be an outlier if its value is less than Q1 1.5IQR or it is larger than Q3 +1.5IQR It is obvious that 71 is an outlier (71 > Q3 + 1.5IQR). There are other potential outliers and these are shown by * in the above graphs. After checking that observation and seeing no error, we may delete it. Notice how the results change:
Variable age N 59 Mean StDev Minimum 28.458 6.511 22.000 Q1 Median 24.000 26.000 Q3 Maximum 31.000 50.000

The corresponding stem-and-leaf box is given in Figure 2. Interpretation of the box-and-whisker plot: We observe that the data are located around 29 years but are highly scattered. In both graphs we observe that the distribution of ages is skewed (more so in Figure 1). We also note in both

STA3032 Chapter 1, Page 9 of 10

graphs that the median is not in the middle of the box, and the whiskers are not equal, both indicating a skewed distribution. In Figure 2 we see that there are at least two more outliers (two students aged 50). Read the Summary of the Chapter in 1.6 and solve at least every third question in the supplementary exercises. Skip Chapter 2 for now. We will come back to it when we cover Chapter 11. Chapter 3 is a very detailed summary of different sampling techniques. A course in sampling and survey design can teach you more on both selection and estimation. We will skip it. You may want to read it later when you need to carry out experiments or collect survey data. Start reading Chapter 4.

STA3032 Chapter 1, Page 10 of 10

Вам также может понравиться