Вы находитесь на странице: 1из 16

Chapter One – Introduction to Statistics 1.

2 We Learn about Populations using


Samples
1.1 What is Statistics?
Population – The total set of subjects in which we
Statistics – The science of designing studies are interested.
and analyzing the data that those studies Ex. = the entire voting public
produce. Statistics is the science of learning
Sample – A subset of the population for whom we
from data. have data.
Ex. = 200 randomly selected voters
Example – Predicting an Election Using an Exit
Poll Subject – entities that we measure in a study.
Ex. = each voter in the sample

Parameter – A numerical value summarizing the


population data.
Ex. = proportion of voters voting for candidate A in
the entire population

Statistic – A numerical value summarizing the


sample data.
Ex. = proportion of voters voting for candidate A in
our sample (the 200 randomly selected voters)

Page 1 of 61 Page 2 of 61

Example: A college dean is interested in learning Symbols we use in Statistics


about the average age of faculty at the college. The
dean takes a random sample of 30 faculty In the previous example, we were interested in the
members and averages their 30 ages. average age of all faculty members at the University of
Georgia.
Match the following: Whenever we are interested in an average for a full
A. population population, there is a lower-case Greek symbol we use
B. sample to denote this value, µ. It is pronounced “mu”.
C. subject
D. parameter So in this previous example, we would say
E. statistic µ = average age for all faculty members at UGA
__________________________________________
Now, most of the time in real life, we will not be able to
actually calculate this value, so we try for the next best
____ the average age of all faculty members at the thing. Like in the previous example, instead of trying to
college find every single faculty member, we are okay with just
getting an average for 30 randomly selected faculty
____ 30 randomly selected faculty members at the members and use that as our estimate.
college
Whenever we have an average calculated from a sample,
____ a single faculty member from the sample like in this case, from 30 randomly selected faculty
members, there is a symbol we use for that average
____ all faculty members at the college from the sample, ⎯x.

So in this example, ⎯x = the average age for 30


____ the average age of the 30 randomly selected randomly selected faculty members.
faculty members at the college
µ represents an average calculated from a full
population. ⎯x represents an average calculated from a
sample.

Page 3 of 61 Page 4 of 61
More Symbols Aspects of Statistics
1.) Design - How to obtain the data to answer
In the election example on page 1, we were interested in questions of interest.
studying proportions, not averages. So when we are Ex. use a survey, set up an experiment
studying proportions, there are other symbols we use.

If we have the proportion for an entire population, like 2.) Description – summarizing the obtained data.
the proportion of voters voting for candidate A in the Describe the sample data.
entire population, we use the letter “p” to denote this
population proportion. EX – bar graph

However, if we have the proportion for just a sample,


like the proportion of voters voting for candidate A in
our sample (the 200 randomly selected voters), we use
the symbol to denote this sample proportion.

p represents a proportion calculated from a full


population. represents a proportion calculated from
a sample.

3.) Inference – Making decisions and predictions


based on the sample data. Predict using the sample
data.
Ex – In our population, the majority of people want
the Saints to win the Super Bowl based on our
sample.

Page 5 of 61 Page 6 of 61

Sampling Methods (from Chapter 4)

Sampling – obtaining subjects that are representative of the


population to participate in a certain study so that accurate
information about the population can be obtained.
Simple Random Sampling – every subject has an equally
likely chance of being selected for the sample

When performing simple random sampling, usually


samples are chosen using a random number table.

Example of Simple Random Sampling:


The following table lists the 44 presidents of the United
States.

Obtain a simple random sample of size five using the


table on page 157 which is a random number table. We
will use row 1.

Page 7 of 61 Page 8 of 61
HW 4.2-4.3 Simple Random Sampling Problem What are different ways to sample?
A randomized experiment investigates
whether an herbal treatment is better than a Types of Sampling:
1. Simple Random Sampling – every subject has an
placebo in treating subjects suffering from equally likely chance of being selected for the sample.
depression. Unknown to the researchers, Usually, samples are chosen using a random number
the herbal treatment has no effect: Subjects table.
have the same score on a rating scale for
depression (for which higher scores 2. Stratified Sampling – the population is divided into
non-overlapping groups (called strata) and a simple
represent worse depression) no matter random sample is then obtained from each group.
which treatment they take.
(a) The study will use eight subjects 3. Cluster Sampling - the population is divided into non-
numbered 1 to 8. Using random numbers overlapping groups and all individuals within a
randomly selected group or groups are sampled.
pick the four subjects who will take the
herbal treatment. (Use the first row and first 4. Convenience Sampling – sampling where the
column in the table.) individuals are easily obtained. Internet surveys are
Line/Col (1) (2) (3) convenience samples. Studies that use convenience
sampling generally have results that are suspect.
1 10480 15011 01536
2 22368 46573 25595 5. Systematic Sampling – selecting every kth subject
from the population.
3 24130 48360 22527
4 42167 93093 06243 The difference between stratified and cluster sampling is
Identify the four who will take the herbal that stratified sampling samples some individuals from all
groups, where cluster sampling samples all individuals
treatment. (List in numerical order.) from some groups.
, , ,

Page 9 of 61 Page 10 of 61

Example: Identify the type of sampling used below. Chapter Two – Exploring and Summarizing
Data
In order to determine the average IQ of ninth-grade
students, a school psychologist obtains a list of all high
schools in the local public school system. She randomly Variable – Characteristic that we are studying
selects five of these schools and administers an IQ test to
all ninth-grade students at the selected schools. 2.1 What are the Types of Data?
_______________________________________________
A member of Congress wishes to determine her
county’s opinion regarding estate taxes. She divides her
Two Kinds of Variables:
county into three income classes: low-income
households, middle-income households, and upper- 1.) Categorical – Classifies subjects based on some
income households. She then takes a random sample of attribute or characteristic. Each observation
households from each income class. belongs to a set of categories.
________________________________________________
A radio station asks its listeners to call in their opinion Ex – A person could live in a ‘house’, ‘condo’,
regarding the use of American forces in peacekeeping ‘apartment’, ‘dormitory’, etc.
missions.
________________________________________________
In an effort to identify whether an advertising campaign
2.) Quantitative – Provides numerical measures of
has been effective, a marketing firm conducts a subjects. The variable takes on numerical values.
nationwide poll by randomly selecting individuals from
a list of known users of the product. Ex – A person can be 56 inches in height, weigh
132 pounds or get a 92 on a test.
A lobby has a list of the 100 senators of the U.S. In
order to determine the Senate’s position regarding farm
subsidies, they decide to talk with every seventh senator
on the list starting with the third.
________________________________________________

Page 11 of 61 Page 12 of 61
Two Kinds of Quantitative Variables: Example: Identify each of the following as
categorical or quantitative variables. If
1.) Discrete – a countable number of values. quantitative, identify further as discrete or
continuous.
EX – The number of people in this class, the
number of words on this page. 1. The length of time until a pain reliever begins
to work.
2.) Continuous – an uncountable number of
values. Continuous variables are usually variables
that can take on all values on an interval. 2. The colors used in a statistics textbook.
EX – Height, Weight, Temperature
3. The number of files on a hard drive.

categorical 4. The number of staples in a stapler.


/
/
variable
\ discrete
\ /
quantitative
\
continuous

Page 13 of 61 Page 14 of 61

Frequency Tables 2.2 How can we describe data using graphs?


Bar Graphs – graphs constructed by putting the categories
Frequency - number of occurrences. on the horizontal axis, the frequency or proportion on the
vertical axis, and the height of the rectangles for each
Frequency table – Lists the number of observations for category are equal to the category’s frequency or
each category of data. proportion.
Favorite Cookie
Ex – Sample taken of 30 consumers’ favorite type of cookie
Oreo Sugar Peanut Butter Chocolate Chip Chocolate Chip 10 9
Chocolate Chip Oreo Chocolate Chip Oatmeal Raisin Chocolate Chip 9
8
Oatmeal Raisin Chocolate Chip Chocolate Chip Sugar Brownie 7
Frequency

Sugar Oatmeal Raisin Oreo Brownie Peanut Butter 6 5 5


Peanut Butter Oatmeal Raisin Chocolate Chip Oreo Oatmeal Raisin 5 4 4
Brownie Peanut Butter Brownie Chocolate Chip Oreo 4 3
3
2
1
Favorite
0
Cookie Frequency
Oreo Chocolate Oatmeal Sugar Peanut Brownie
Oreo Chip Raisin Butter
Chocolate Chip
Cookie Type
Oatmeal Raisin
Sugar
Peanut Butter Favorite Cookie
Brownie
0.35
0.30
0.30
Instead of frequency, sometimes we are interested in the 0.25
proportion or percentage of observations within a certain 0.20 0.17 0.17
category. 0.15 0.13 0.13
0.10
Ex. Calculate the proportion of people that picked Chocolate 0.10
Chip cookies.
0.05
0.00
Oreo Chocolate Oatmeal Sugar Peanut Brownie
Ex. Calculate the percentage of people that picked Peanut Chip Raisin Butter
Butter cookies. Cookie Type

Page 15 of 61 Page 16 of 61
Pareto chart – a bar graph whose bars are drawn in Graphs for Quantitative Variables
decreasing order of frequency or proportion.
Histogram – a bar graph for quantitative data.
Favorite Cookie
Ex. The table below shows the number of points scored by the
0.35 UGA football team in the 2002-2003 season.
0.30
0.30
y

0.25 Opponent # of Points


0.20 0.17 0.17 Clemson 31
0.15 0.13 0.13 South Carolina 13
0.10
0.10 Northwestern St 45
New Mexico St. 41
0.05
Alabama 27
0.00
Tennessee 18
Chocolate Oreo Oatmeal Peanut Brownie Sugar
Chip Raisin Butter Vanderbilt 48
Cookie Type Kentucky 52
Florida 13
Pie Charts – A pie chart is a circle divided into sectors. Ole Miss 31
Each sector represents a category of data. Auburn 24
Georgia Tech 51
Favorite Cookie Arkansas 30
Florida St. 26
Sugar,
10.00%
Chocolate Construct a bar graph by tens. Have the groups be 10-19, 20-
Brownie,
13.33%
Chip, 30.00% 29, 30-39, 40-49 and 50-59.

Peanut
Butter,
13.33%
Oreo ,
Oatmeal 16.67%
Raisin,
16.67%

Page 17 of 61 Page 18 of 61

Stem and Leaf Plot – A stem and leaf plot is just a bar Example
graph on its side. The stem consists of all digits except for
the final one, which is the leaf. The following data represent the length of eruption in seconds for a
random sample of eruptions of “Old Faithful”, a geyser at
Ex. The table below shows the number of points scored by the Yellowstone National Park. Draw a stem and leaf plot.
UGA football team in the 2002-2003 season.
108 113 102 97 106
Opponent # of Points 110 99 109 108 112
Clemson 31 97 76 107 104 114
South Carolina 13
Northwestern St 45
New Mexico St. 41
Alabama 27
Tennessee 18
Vanderbilt 48
Kentucky 52
Florida 13
Ole Miss 31
Auburn 24
Georgia Tech 51
Arkansas 30
Florida St. 26

First, place the numbers in ascending order:

Then put the numbers into Stem-and Leaf Diagram

1|
2|
3|
4|
5|

Page 19 of 61 Page 20 of 61
Shapes of Histograms Example
Symmetrical/Normal – the side of the distribution below the IQ's of 7th Graders
middle is a mirror image of the side above the middle. 140
128
30
120
25

100
20

Frequency
80
15

10 60 52 52

5 40

0 15 15
20
2 3 3
Skewed left – left tail is stretched out longer than the right tail. 0
1

30 60-69 70-79 80-89 90-99 100-109 110-119 120-129 130-139 140-149


IQ Scores
25

20

15 How many students were sampled?


10

5
Which class has the highest frequency? What is its frequency?
0

Skewed right – right tail is stretched out longer than the left tail.
30
Which class has the lowest frequency? What is its frequency?
25

20
What proportion of the students have an IQ between 120 and 129?
15

10

5
Describe the shape of the distribution – is it skewed right, skewed
0
left or approximately normal?
NOTE: Many times we will use smooth curves to show the data
rather than histograms.

Page 21 of 61 Page 22 of 61

2.3 How can we describe the center of quantitative Mode – The data value that occurs most frequently (has the
highest frequency). It is important to point out that the mode is
data? NOT equal to the frequency, the mode is the data value that
Mean (Average) – adding up all the values of the variable x corresponds with the highest frequency.
and dividing by the number of these values, n.

Mean =
∑ x Example: 10 bags of M&M’s were opened and the number of
M&M’s in each of the 10 bags is:
n
Example: What is the mean of 1, 3, 6, 7, 8? 32 34 31 35 32 36 29 38 34 32

What is the mode?

Population Mean: μ (known as mu) Example: What is the mode in the bar graph below?
Favorite Cookie
Sample Mean:⎯x (known as x-bar)
10 9
9
Median – The value of the data that occupies the middle 8
7
position when the data are ranked in ascending order. It
Frequency

6 5 5
separates the bottom 50% of the data from the top 50% of 5 4 4
4
the data. 3
3

2
1
Steps in Computing the Median of a Data Set: 0
1. Arrange the data from low to high. Oreo Chocolate Oatmeal Sugar Peanut Brownie
Chip Raisin Butter

2a. If n (the number of values) is odd, there is a unique Cookie Type

middle data value. The median is the observation


that lies in the (n + 1)/2 position. Example: The following is a list of animals found on a farm:
Example: What is the median of this dataset: 1, 3, 6, 7, 8?
8 Horses, 32 Chickens, 2 Dogs, 15 Cows, 2 Cats
2b. If n is even, the median is the average of the two middle
observations in the data set. These two middle What is the mode of this dataset?
observations lie in the n/2 and n/2 + 1 positions.
Example: What is the median of this dataset: 1, 3, 4, 6, 7, 8? Can we get the mean or median for this dataset? Why not?

Page 23 of 61 Page 24 of 61
Example: Using the previous UGA football example: The following frequency table shows the number of children in a
Number of Points = 31, 13, 45, 41, 27, 18, 48, 52, 13, 31, 24, 51, daycare separated out by their ages:
30, 26 Age of Children Frequency
2 3
What is the sum of all points scored by the Bulldogs that year? 3 7
4 6
5 1

Write out all the ages for the children at the daycare.
What is the mean number of points scored by the Bulldogs that
year?

How many total children attend the daycare?

What is the mode of this data?


(This is an example of a bimodal dataset: has two modes)
What is the mean age for children at this daycare?

What is the median of the data?

What is the median age for children at this daycare?

NOW, let’s see how StatCrunch can do these calculations for us.

Put in the data and go to Stat Æ Summary Stats Æ Columns

Page 25 of 61 Page 26 of 61

It is important to note that the mean is sensitive to extreme values Mean = Median: The graph is approximately normal/symmetrical.
in the dataset, either very large or very small numbers. The
median, however, is not. The median is resistant to extreme Mean < Median: The graph is skewed left.
values.
This is true because a skewed left graph has more low data values
Example: on the left. These low data values make the mean lower & less than
Data set: the median.

13 12 16 10 18 17 15 10 600 Mean > Median: The graph is skewed right.

Find the mean This is true because a skewed right graph has more high data
values on the right. These high data values make the mean higher
n= & greater than the median.
Example: Match the histograms to these summary statistics.
Σx = mean = Mean Median Graph
1 42 42
Find the median 2 31 36
3 31 26
Put values in order:

Median =

Find the mode

Mode =

If I asked for the number which best describes the “middle” of the
data, what is the best answer? Why?

Page 27 of 61 Page 28 of 61
2.4 How can we describe the spread of Sample Variance – the mean of the squared deviations, calculated
using n – 1 as the divisor. What you are doing when you are
quantitative data? calculating sample variance is, in a way, you are averaging all the
squared deviations, except you are dividing by n – 1 instead of
Range – The difference between the largest and the smallest pieces dividing by n.
of data.

Variance =
∑ ( x − x) 2

range = Largest value - Smallest value n −1


From the example on the previous page,
Deviation from the Mean – A deviation from the mean, x – ⎯x, is
the difference between the value of x and the mean. x x -⎯x Deviation (x -⎯x)2
2 2–7
6 6–7

x−x
7 7–7
9 9–7
11 11 – 7
Ex. Data Set: 2, 6, 7, 9, 11 SUM:

2 + 6 + 7 + 9 + 11 35 Variance =
x= = =7
5 5 Standard Deviation – the positive square-root of the variance

x x -⎯x Deviation
s = Variance
2 2–7 From the example above,
6 6–7
s=
7 7–7
9 9–7 In lab, you will learn how to use StatCrunch to calculate this
11 11 – 7 sample standard deviation value without having to go through all
these steps.

Page 29 of 61 Page 30 of 61

Variance and standard deviation measure how spread apart Example: Consider the following three data sets:
your data values are. The higher the variance and standard
deviation, the more spread apart the data values will be. A: 50, 50, 50 B: 40, 50, 60 C: 30, 50, 70

Example: If we administered Test A and Test B to five students, Use these data sets to practice finding the sample standard
and their scores were the following: deviation.

Test A: 20 58 79 92 98 A Deviation Deviation2


Test B: 75 78 80 82 83 50
Which test scores would have a larger standard deviation? Why?
50
50

B Deviation Deviation2
Just like we can either be looking at a population mean or a sample
mean depending upon if we are looking at the entire population or 40
just a sample from the population, we also have symbols to 50
represent population standard deviation and sample standard
deviation: 60

Population Standard Deviation – σ (known as sigma) C Deviation Deviation2


30
Sample Standard Deviation – s
50
Whenever we calculate standard deviation using StatCrunch, we 70
are calculating sample standard deviation, s.
Which distribution has the smallest standard deviation?

Which distribution has the largest standard deviation?

Page 31 of 61 Page 32 of 61
As you can see in the distributions below, the distribution with a Empirical Rule – If a distribution is bell-shaped, we can
larger standard deviation is going to be wider, because its data approximate the percentage of data that lie within one, two, and
values are more spread apart: three standard deviations of the mean.

A. Standard Deviation Equal to 1.0 μ ± 1σ (-1 to +1) ~ 68% of the data values
μ ± 2σ (-2 to +2) ~ 95% of the data values
μ ± 3σ (-3 to +3) ~ all of the data values

B. Standard Deviation Equal to 3.0

Example: If we have a population of test scores with μ = 80 and


σ = 6 that is bell-shaped, label the test scores that correspond to 1,
2, and 3 standard deviations away on the above curve and interpret
those values.

Page 33 of 61 Page 34 of 61

Example: The weight, in grams, of both kidneys based upon a 2.5 How can we describe the position of values in
sample of 30 forty-five year old men resulted in a sample mean of
325 grams, with a sample standard deviation of 30 grams.
quantitative data?
1. Percentiles
a. A histogram of the data indicates that the data follow a bell-
shaped distribution. Draw a curve of these kidney weights.
The pth percentile is a value such that p% of the observations in the
data fall below or at that value.

This also means that the other (100 – p)% of the observations in
the data are larger than that value.

A data value’s percentile tells you approximately what % of the


data are less than that value
b. By the Empirical Rule, approximately 68% of the kidney
weights will be within one standard deviation of the mean
If a value lies at the 30th percentile, then approximately 30% of the
according to the Empirical Rule. Find the value that is one standard
data values are less than that value and approximately 70% of the
deviation below the mean and the value that is one standard
data values are higher than that value.
deviation above the mean.
Example: If John graduated at the 78th percentile in a class of
876, approximately how many students ranked below John?

c. Determine the percentage of kidneys that weigh between 265


grams and 385 grams, according to the Empirical Rule.

Page 35 of 61 Page 36 of 61
Quartiles – specific percentiles that are useful. Each set of data The following data represent the hemoglobin (in g/dL) for 20
has three quartiles. randomly selected cats.
5.7 7.7 7.8 8.7 8.9
First Quartile (Q1) – the value such that 25% of the data 9.4 9.5 9.6 9.6 9.9
values are smaller than Q1, and 75% are larger. This is also 10.0 10.3 10.6 10.7 11.0
known as the 25th percentile. 11.2 11.7 12.9 13.0 13.4

Second Quartile (Q2) – the value such that 50% of the data Determine the quartiles.
values are smaller than Q2, and 50% are larger. This is also
known as the median and the 50th percentile.

Third Quartile (Q3) – the value such that 75% of the data
values are smaller than Q3, and 25% are larger. This is also
known as the 75th percentile.

25% 25% 25% 25%


Minimum Q1 Q2 Q3 Maximum

NOTE: Q1 and 25th percentile are the same; Q3 and 75th


percentile are the same; Q2 and the 50th percentile are the
same.

Finding Quartiles

1. Arrange the data in order.


2. Find the median. This is the second quartile, Q2.
3. Consider the lower half of the observations. The median of
these observations is the first quartile, Q1.
4. Consider the upper half of the observations. The median of
these observations is the third quartile, Q3.

Page 37 of 61 Page 38 of 61

Outliers – extreme observations that occur because of error in the The 5-Number Summary and Boxplots
measurement of the variable, during data entry, or from errors in
sampling.
25% 25% 25% 25%
Steps for Checking for Outliers: Minimum Q1 Q2 Q3 Maximum
1.) Determine the first and third quartiles of the dataset.
2.) Compute the interquartile range. The interquartile range This is the 5-number summary, it includes the minimum, Q1, Q2
or IQR is the difference between the third and first or the median, Q3, and the maximum number.
quartile.
IQR = Q3 – Q1 Boxplot – a graph of the five number summary.
3.) If a data value is less than Q1 – 1.5(IQR) or greater than
Q3 + 1.5(IQR), it is considered an outlier. Steps in Drawing a Boxplot:
1.) Determine Q1, Q2, and Q3.
Example (continued): Hemoglobin in Cats 2.) Draw vertical lines at Q1, the median (Q2), and Q3.
The following data represent the hemoglobin (in g/dL) for 20 Enclose these vertical lines in a box.
randomly selected cats. 3.) Draw a line from Q1 to the smallest data value that is not
5.7 7.7 7.8 8.7 8.9 an outlier. Draw a line from Q3 to the largest data value
9.4 9.5 9.6 9.6 9.9 that is not an outlier.
10.0 10.3 10.6 10.7 11.0 4.) Any data values that are outliers are marked with an
11.2 11.7 12.9 13.0 13.4 asterisk (*).
Compute the IQR.

Are there any outliers?

Page 39 of 61 Page 40 of 61
Example: Draw a boxplot for the cat data: Distribution Shape Based upon Boxplot:
5.7 7.7 7.8 8.7 8.9 1.) If the median is near the center of the box and each
9.4 9.5 9.6 9.6 9.9 horizontal line is approximately equal length, the
10.0 10.3 10.6 10.7 11.0 distribution is approximately symmetric.
11.2 11.7 12.9 13.0 13.4 2.) If the median is to the left of the center of the box or the
right line is much longer than the left line, the
Step 1: Determine Q1, Q2, and Q3 distribution is skewed right.
3.) If the median is to the right of the center of the box or the
left line is much longer than the right line, the
distribution is skewed left.

Step 2: Draw vertical lines at Q1, the median (Q2), and Q3.
Enclose these vertical lines in a box.
Step 3: Draw a line from Q1 to the smallest data value that is not
an outlier. Draw a line from Q3 to the largest data value that is not
an outlier.
Step 4: Any data values that are outliers are marked with an
asterisk (*).

Page 41 of 61 Page 42 of 61

2. Z-score If the heights for males are normally distributed, draw a curve
representing these heights. Label where the 75-inch tall man is
Z-score – The position a value has relative to the mean measured under this curve, and see that it corresponds to his Z-score.
in standard deviations.

value - mean
z − score =
standard deviation
The Z-score is the number of standard deviations a data value is What height is exactly two standard deviations below the mean.
from the mean. Calculate the Z-score for this height to make sure it does equal -2.

NOTICE: If the value is equal to the mean then the z-score = 0.

Example:
From samples taken, the average 20-29 year-old man is
70.0 inches tall, with a standard deviation of 2.8 inches, Using Z-Scores to check for Outliers
while the average 20-29 year-old woman is 64.6 inches
Outliers for a bell-shaped curve:
tall, with a standard deviation of 2.6 inches. A data value in a bell-shaped distribution is regarded as a potential
outlier if it falls more than three standard deviations from the mean.
Find the z-score for a 75-inch tall man. Or, in other words, if a value has a Z-Score less than -3 or a
Z-Score greater than +3, then it is a potential outlier.

Assume that male heights are normally distributed. In the previous


example, would a male with a height of 58 inches be considered a
Find the z-score for a 70-inch tall woman. potential outlier?

What about a male with a height of 62 inches?


Who is relatively taller, a 75-inch man or a 70-inch
woman?

Page 43 of 61 Page 44 of 61
Chapter Three – Association: Contingency, 3.1 How can we explore the association between
Correlation, and Regression two categorical variables?

In Chapter 3, we explore the relationships between two variables. To do this, we use contingency tables.

Response variable – a variable that can be explained by, or is Contingency or 2-way table – a table that relates 2 categorical
determined by, another variable. This is our y-variable, the variable variables. Each box inside the table is referred to as a cell.
that goes on the vertical axis when we are graphing data.
Suppose we have the following data:
Explanatory variable – explains, or affects, the response variable.
This is our x-variable, the variable that goes on the horizontal axis Left-handed Right-handed
when we are graphing data. Male 160 600
Female 140 560
Ex. The amount you eat affects how much weight you gain. The
amount you eat is the explanatory variable which determines Are these categorical variables?
weight gain, the response variable.
What is the response variable?
Association – an association exists between two variables if a
particular value for one variable is more likely to occur with
certain values of the other variable. What is the explanatory variable?

Ex. If the amount we eat is small, then we probably won’t notice


much gain in weight. However, if the amount we eat is large, then In the examples we use, the explanatory variable will always be on
we probably will notice some gain in weight. So there is an the side, and the response variable will always be on top.
association between the amount eaten and weight gain.
First thing we do is fill in the totals for each row and column:
Lurking variable – related to the response or explanatory variable, Left-handed Right-handed Total
or both, but is not the variable being studied. Male 160 600
Female 140 560
Ex. A lurking variable could be frequency of exercise. The amount Total
of exercising can also affect weight gain, the response variable. How many males are there in this data?

How many right-handed people are there in this data?

Page 45 of 61 Page 46 of 61

We can also calculate the proportion for each group. Total up the Relative Risk / Odds Ratio
columns and rows again, and answer the questions: We can use these conditional proportions to determine the
Left-handed Right-handed Total comparative odds for each group.
Male 160 600 760
Female 140 560 700 Let’s create a table with these conditional proportions for
Total 300 1160 1460 categories of the response variable.
Ex. What proportion of the people in the data is female? Left-handed Right-handed
Male
Female
Ex. What proportion of the people in the data is left-handed?
conditional proportion for one group
relativerisk =
Conditional Proportion – the proportion for a value of a variable, conditional proportion for another group
given a specific value of the other variable.
When we calculate relative risk, the higher conditional proportion
Total up the columns and rows again, and answer the questions:
Total
goes in the numerator. We can use relative risk to see how many
Left-handed Right-handed
times more likely the outcome for one group is than the other
Male 160 600 760
group.
Female 140 560 700
Total 300 1160 1460
Example: Fill in the blank. A male is _____ times more likely to be
left-handed than a female.
Ex. What proportion of the males is right-handed?

Ex. What proportion of the females is left-handed?


You can see this value is close to one. A relative risk close to 1
means it is about the same likelihood for both groups. Now look at
this example:

Page 47 of 61 Page 48 of 61
We asked 1795 their political affiliation and whether they think 3.2 How can we explore the association between
marijuana should be legalized. Here is the data we received:
two quantitative variables?
Legalize Marijuana?
Political Yes No Total
Affiliation When we have two quantitative variables, the first thing we do is
Democrat 240 326 566 make a scatterplot of the data.
Independent 292 446 738
Republican 121 370 491 Scatterplot – a graphical display for two quantitative variables.
Total 653 1142 1795 (Explanatory variable is on the horizontal axis, response variable is
on the vertical axis, and the points are not connected.)
Example:
Fill in the blank. A democrat is _______ times more likely to favor
legalization of marijuana than a republican.

We can see that when we get larger differences in the conditional


proportions like here, we get a much higher relative risk value than
1.

We want to see if the two variables (# spaces from GO and Cost)


are associated, and later we will see if we can use the # spaces
from GO (explanatory variable) to predict the cost of a property
(response variable).

Page 49 of 61 Page 50 of 61

Here is a scatterplot of this data: Example: Determine the type of association for the following
pairs of variables.
350

300 a) weight of a car and miles per gallon it gets


250

b) speed of a car and distance required to come to a complete


Cost

200
stop
150

100
c) weight on a bar and number of repetitions a weightlifter can
50 achieve
0
0 5 10 15 20 25 30 35
d) the temperature outside and my grade on a test
Spaces from GO
positive association – as x increases, y increases.

negative association – as x increases, y decreases.

no association – as x increases, there is no definite shift in the


values of y.

Are the variables related? (Does it look like there is an association


between these two variables?)

What type of association do we see in the scatterplot above?

Page 51 of 61 Page 52 of 61
So we can state the association between two variables, but what if Coefficient of linear correlation (r) – the numerical measure of
we want to take it one step further and determine if there is a the strength of the linear relation between x and y.
linear relationship between the variables?
Then we calculate what we call correlation.

linear correlation – when the data tend to follow a straight line


path (if x increases and y increases it is positive correlation; or if x
increases and y decreases it is negative correlation)

no correlation – as x increases there is no definite shift in the


values of y (no linear relationship between x & y)

Correlation can be:


1. Positive, negative or no correlation
2. Strong or weak correlation

Scatter Diagrams and Correlation:


Properties of Linear Correlation Coefficient:
1.) r must always be between -1 and 1.
-1 ≤ r ≤ 1
2.) r > 0 indicates a positive linear relationship.
If r = +1, there is perfect positive correlation.
3.) r < 0 indicates a negative linear relationship.
If r = -1, there is perfect negative correlation.
4.) If r = 0, there is no linear relation between the 2 variables.
5.) A value of r close to 1 or -1 indicates a strong linear
relationship; while, a value of r close to zero represents a
weak linear relationship.

-1 -.5 0 .5 1

Which of the following is the strongest correlation?

.8, .67, -.34, 0, -.92

Page 53 of 61 Page 54 of 61

Calculate r for the Monopoly data: Example


Weight(x) 40 80 100 120 150
Number of
Reps(y) 25 20 18 15 10

1) Draw a scatterplot and comment on the type of relation that


appears to exist between x and y. Is it a negative or positive
relation? Does it seem strong or weak?

2) Calculate r for this data using StatCrunch.

x: mean = 18.75, standard deviation = 11.758


y: mean = 225, standard deviation = 59.722

To recap, the correlation coefficient r tells us if there is a positive,


negative or no linear relationship. r also tells us if it is a strong or
weak linear relationship.
StatCrunch can also calculate this r value by putting in the data and
going to Stat Æ Summary Stats Æ Correlation

Page 55 of 61 Page 56 of 61
3.3-4 How to predict the outcome of a variable? Here is our data:

To predict the response variable using the


explanatory variable, we create what is called a
regression line.

regression line – predicts the value for the response And here is the graph of the regression line for this
variable y as a straight-line function of the value x of data:
the explanatory variable. 350

300

yˆ : predicted value of y using the regression line 250

Cost
200

the equation for the regression line : ŷ = a + b x 150

In this formula, a is called the y-intercept and b is 100

called the slope. 50

So for every x value, we have an actual y value from 0 5 10 15 20 25 30 35

our data, and a predicted y value using this regression Spaces from GO
line. The best regression line is going to be the one
that has the predicted y values closest to the actual y Residual – the difference between the actual value and the
values. predicted value of y.
^
We use the actual data values to create the regression residual = actual y – predicted y = y – y
line. We won’t need to do this, but StatCrunch can do
this for us. The line that “best” describes the relation between 2
Let’s take a look back at our monopoly data and look variables is the one that makes the residuals as small as
at the regression line for that data: possible.

Page 57 of 61 Page 58 of 61

The formula for the regression line using the least squares method: Calculate the predicted cost and residual for each of the
four properties in our data using the regression line formula:
ŷ = a + b x Property # spaces Actual Predicted Cost Residual
from GO Cost
where a = y-intercept and b = slope Reading 5 $200
Railroad
So for our Monopoly example, here is the regression Virginia 14 $160
line formula: Ave.
Illinois 24 $240
yˆ = 147.016 + 4.159 x Ave.
N. Carolina 32 $300
Ave.
We can get this in StatCrunch by going to
StatÆRegressionÆSimple Linear. When we use our regression line to predict the costs for
other properties, this is called extrapolation.
Interpretations of y-intercept and slope:
y-intercept = the predicted value of y when x = 0. Predict the cost for Tennessee Ave., it is 18 spaces from
Interpret the y-intercept in the above scenario: GO.

The actual cost for Tennessee Ave. is $180.


Slope = the amount that the predicted value of y changes Compute the residual for Tennessee Ave.
when x increases by one unit.
Interpret the slope in the above scenario:

Page 59 of 61 Page 60 of 61
Example
Weight(x) 40 80 100 120 150
Number of
Reps(y) 25 20 18 15 10

Given the following least-squares regression line for this data:


yˆ = 30.7616 − 0.1343x
2.) What is the slope? Interpret that value.

3.) What is the y-intercept? Interpret that value.

4.) Predict the y value when x is 80.

5.) Compute the residual for x = 80.

Page 61 of 61

Вам также может понравиться