Академический Документы
Профессиональный Документы
Культура Документы
n
x x
s
Standard deviation: the square root of the
variance
) (
2
1
) (
2
=
n
x x
s
Normal (Gaussian) distribution
Many kinds of data follow this symmetrical, bell-shaped curve, often
called a Normal Distribution.
Normal distributions have statistical properties that allow us to predict
the probability of getting a certain observation by chance. p y g g y
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Statistics
Normal (Gaussian) distribution
When sampling a variable, you are most likely to
obtain values close to the mean obtain values close to the mean
68% within 1 SD
95% within 2 SD 95% within 2 SD
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2 0 1 0 0 1 0 2 0 2.0 1.0 0 1.0 2.0
Normal (Gaussian) distribution
Note that a couple values are outside the 95th (2 SD) interval
These are improbable p
The essence of hypothesis testing:
If an observation appears in one of the tails of a distribution, there
is a probability that it is not part of that population is a probability that it is not part of that population.
2 0 1 5 1 0 0 5 0 0 0 5 1 0 1 5 2 0 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
2.0 1.0 0 1.0 2.0
Significant Differences
A difference is considered significant if the
b bilit f tti th t diff b probability of getting that difference by
random chance is very small.
P value:
The probability of making an error by chance
Historically we use p < 0.05
The probability of detecting a significant
difference is influenced by: difference is influenced by:
The magnitude of the effect The magnitude of the effect
A big difference is more likely to be significant
than a small one than a small one
The probability of detecting a significant
difference is influenced by: difference is influenced by:
The spread of the data The spread of the data
If the Standard Deviation is low, it will be easier
to detect a significant difference to detect a significant difference
Hypothesis testing
- Hypothesis:
-A statement which can be proven false -A statement which can be proven false
- Null hypothesis H
O
:
-There is no difference
H
0
:
1
=
2
- Alternative hypothesis (H
A
):
-There is a difference
In statistical testing we try to reject the null
H
A
:
1
2
- In statistical testing, we try to reject the null
hypothesis
-If the null hypothesis is false, it is likely that our alternative
h th i i t hypothesis is true
-False there is only a small probability that the results
we observed could have occurred by chance y
Common probability levels
Alpha
Level
Reject Null
Hypothesis Level Hypothesis
P > 0.05
Not
significant
No
significant
P < 0.05 1 in 20 Significant Yes
P <0.01 1 in 100 Significant Yes
P < 0.001 1 in 1000
Highly
Significant
Yes
Significant
Common statistical tests
Question Test
Does a single observation belong to a population
of values?
Z-test
A t ( l ti ) f b Are two (or more populations) of number
different?
T-test
F-test (ANOVA)
Is there a relationship between x and y Regression
Is there a trend in the data (special case of above Regression
The z distribution: Standard normal
distribution distribution
The Z-distribution is a Normal Distribution, with special properties:
Mean = 0 Variance = 1 Mean 0 Variance 1
Z = (observed value mean)/standard error
Standard error = standard deviation / sqrt(n)
The Z distribution
Statistics
Mean and Standard Deviation of the mean
ti t f l ti estimates from a population
Just as we calculated the mean of a sample, we can also calculate
the mean of means. (sum them up and divide by the number of
estimates) estimates).
The standard deviation of the mean estimates is called the
standard error, and is given by:
s
N
s
SE =
x x
x
x
x x
x x
x x
Are two populations different: The t-test
- Also called Students t-test. Student was a
synonym for a statistician that worked for y y
Guinness brewery
- Useful for small samples (<30)
- One of the most basic statistical tests, can be
performed in Excel or any common statistical
package package
- Same principle as Z-test calculate a t value,
and assess the probability of getting that p y g g
value
the
difference
between between
the means
is the is the
same in all
th three.
But, the three situations don't look the same
-- the two groups that appear most different or g p pp
distinct is the bottom or low-variability case
This leads us to a very important y p
conclusion:
differences between scores for
two groups, are evaluated based g p
on
th diff b t th i the difference between their
means relative to the spread or
variability of their scores.
Th t t t d j t thi The t-test does just this.
Statistical Analysis of the t-test Statistical Analysis of the t test
Th f l f th t t t i ti The formula for the t-test is a ratio
The top part of the ratio is just the
difference between the two means or
averages
The bottom part is a measure of the
variability or dispersion of the scores.
STUDENT'S T-TEST
General Procedure
First calculate the t-test statistic
then compare that value to the critical t then compare that value to the critical t-
value located in the t-table for the
relevant degrees of freedom. relevant degrees of freedom.
The t-distribution is tabled with several different
probability levels as columns and degrees of
freedom as rows freedom as rows.
STUDENT'S T-TEST STUDENT S T-TEST
T l l t t l d t fi t To calculate your t-value, you need to first
calculate the mean (x bar) and the standard
error of EACH of your samples. y p
Remember: Remember:
The standard error of a sample mean is the
sample standard deviation divided by the sample standard deviation divided by the
square root of the sample size
Compare your t-value with the Compare your t value with the
critical t-value
F i df if t l i l th For a given df, if your t-value is larger than
the value found in the table.
the null hypothesis of no difference between yp
the means should be rejected.
Types of Student's t tests
for Quality Control
One-Sample p
Two Independent (Unpaired) Samples
Two Dependent (Paired) Samples Two Dependent (Paired) Samples
The calc lation of the t al e is The calculation of the t-value is
different for each of these tests
One-sample Student's t test
Used to compare a population mean Used to compare a population mean
inferred from a sample with a
hypothetical population mean hypothetical population mean
(a standard or specification) (a standard or specification)
t-value for a Single Sample t-value for a Single Sample
(sample vs. standard)
Sample mean
minus
Standard (population) Standard (population)
mean
Divided
by the standard error of by the standard error of
the mean
t=(Sample Mean - Hypothetical Mean)/SEM
Many are confused about the difference between
the standard deviation (SD) and standard error of t e sta da d de at o (S ) a d sta da d e o o
the mean (SEM).
Th SD ifi h h h The SD quantifies scatter - how much the
values vary from one another. On average, the
SD will stay the same as sample size gets y p g
larger.
The SEM quantifies how accurately you know The SEM quantifies how accurately you know
the true population mean. The SEM gets smaller
as your samples get larger, simply because the
f l l i lik l t b l t mean of a large sample is likely to be closer to
the true mean than is the mean of a small
sample.
Two-tailed and One-tailed
i f h versions of these tests
Two-tailed test - evaluates whether a
difference exists between 2 samples, not the
direction of the difference
One-tailed test - evaluates whether a
difference exists between 2 samples, and
ifi ll l t th di ti f th specifically evaluates the direction of the
difference (whether one sample is larger or
smaller than the other) smaller than the other)
o = 0.05
One-tailed: use if you know a priori
that the data can only trend in one
di ti
o = 0.05
direction
Sum
o = 0.05
Two-tailed: use if you do not know
a priori which direction the data will
trend
t
o/2
t
o/2
Example
t-test for a Single Sample
You are responsible for the operation of all
equipment in a prepress area. A film processor in
thi i d i d t d l fil t t d d this area is designed to develop film at a standard
temperature of 65 degrees.
A sample of twenty measurements are made over
the course of a day with a mean of 70.5 and a
variance of 121 variance of 121.
Is your processor temperature significantly
different from the standard? Use an o = 0.05
Thi i d il d 0 0 This t-test is done as a two-tailed test at o = 0.05.
H : mean = 65 H : mean <> 65 H
0
: mean = 65 H
1
: mean <> 65
First compute the standard deviation from the First, compute the standard deviation from the
variance:
s = SQRT(s2) = SQRT(121) = 11 Q ( ) Q ( )
Next, compute the standard error of the mean:
sem = s/SQRT(n) = 11/SQRT(20) = 11/4.47 = 2.46
Compute the t-test:
t (Xb )/ (70 5 65)/2 46 2 24 t = (Xbar - )/sem = (70.5 - 65)/2.46 = +2.24
The degrees of freedom df = n 1 = 20 1 = 19 The degrees of freedom, df = n -1 = 20 -1 = 19
The critical value for t at alpha = 0.05 is +2.09 p
Thus, it is concluded that the temperature of your
processor scored significantly higher than the
standard. You reject the H
0.
Table for t-Statistic
v= Degrees of freedom
v=(n
1
-1) + (n
2
-1)
Patients were given one of two drug treatments
and blood clotting times (minutes) were measured and blood-clotting times (minutes) were measured
Experiment #
Placebo Drug
(mean values)
New Drug
(mean values)
1 8.8 9.9
2 8.4 9.0
3 7.9 11.1
4 8.7 9.6
5 9 1 8 7 5 9.1 8.7
6 9.6 10.4
Problem Solving: t-Test Problem Solving: t Test
Calculate the mean and S.E.M. for the control and
t t t treatment group.
Calculate the pooled sample estimator, s
p
2
p p
p
Calculate the t-Statistic
Look-up the t-Statistic for o = 0.05
Compare with your calculated t-Statistic Compare with your calculated t-Statistic.
Would you conclude that there is a significant difference
between the two groups?
Regression defined
40
45
A statistical technique to define
30
35
t
(
o
z
)
the relationship between a
response variable and one or
more predictor variables
15
20
25
h
W
e
i
g
h
Here, fish length is a predictor
variable (also called an
independent variable.
5
10
15
F
i
s
Fish weight is the response
variable
0
5 7 9 11 13 15
Fish Length (in)
Regression and correlation Regression and correlation
Regression: Regression:
Identify the relationship between a predictor and
response variables
Correlation
Estimate the degree to which two variables vary
t th together
Does not express one variable as a function of the other
No distinction between dependent and independent variables p p
Do not assume that one is the cause of the other
Do typically assume that the two variable are both effects of
a common cause a common cause
Basic linear regression
40
45
Basic linear regression
Assumes there is a
30
35
t
(
o
z
)
straight-line relationship
between a predictor (or
independent) variable X
d (
15
20
25
h
W
e
i
g
h
and a response (or
dependent) variable Y
Equation for a line:
Y = mX + b
5
10
15
F
i
s Y = mX + b
m the slope coefficient
(increase in Y per unit
increase in X)
0
5 7 9 11 13 15
Fish Length (in)
increase in X)
b the constant or Y
Intercept
( l f Y h X 0) (value of Y when X=0)
Basic linear regression
40
45
Basic linear regression
Regression analysis
30
35
t
(
o
z
)
g y
finds the best fit line
that describes the
dependence of Y on X
15
20
25
h
W
e
i
g
h
p
Outputs of regression
Regression model
Y = mx + b
5
10
15
F
i
s
b
Weight = 4.48*Length +
-28.722
0
5 7 9 11 13 15
Fish Length (in)
Coefficient of
Determination
R
2
= 0 89 R = 0.89
How good is the fit? The
C ffi i t f D t i ti
45
Coefficient of Determination
30
35
40
(
o
z
)
R
2
: The proportion of
the total variation that is
explained by the
20
25
30
W
e
i
g
h
t
(
explained by the
regression
Coefficient of determination
5
10
15
F
i
s
h
R
2
= 0.89
Ranges from 0.00 to 1.00
0.00 No correlation
0
5 7 9 11 13 15
Fish Length (in)
1.00 Perfect correlation
no scatter around line
Example coefficients of
1.2
70
80
determination
0.8
1
50
60
0.4
0.6
20
30
40
0
0.2
0 0 2 0 4 0 6 0 8 1
0
10
20
0 0 2 0 4 0 6 0 8 1
0 0.2 0.4 0.6 0.8 1
R
2
= 0.08
0 0.2 0.4 0.6 0.8 1
R
2
= 0.54