Академический Документы
Профессиональный Документы
Культура Документы
Conclusion of Session 1
Q&A
Session 2 Agenda
CPD
The graph of the cpd is a smooth curve, typically bell
shaped if normally distributed
This curve, a function of x, is denoted by symbol f(x)
and is interchangeably called a probability density
function, frequency function or probability distribtution
The areas under the probability distribution (Area under
the curve or AUC) correspond to the p of x (See next
slide)
AUC
The are shaded is the probability that x is between 1 and
2.
Computing AUC
Computing AUC is somewhat complex for the everyday
user
Z scores
There is another type of z table that calculates the probability that a
variable is to the right or left of a value.
They are also based on the same AUC calcs, but represented a bit
differently (See Z score table in appendix)
Note that the values for 1/-1 are complements of each other
Quick exercise..calculate the probability that a value is either to the
left of negative one or to the right of positive one.
To the left of negative one is -.1587 and to the right of positive one is (1 - .
8413) or +.1587. Add these up and they total .3174. The complement of
that is .6826 (That 68% again).
So we can say that the there is a 68% probability that our value is between 1
and -1, or a 32% probability that it is in the tails
Exercise two
What is the probability that a standard normal variable
is greater than 1.64 (Do this using both tables)
Exercise Three
Find the probability that a standard normal variable
exceeds 1.96 IN ABSOLUTE VALUE (Draw the problem
out first)
Exercise Four
Assume that length of time x between charges of a cell
phone are normally distributed with a mean of 10 hours
and standard deviation of 1.5. Find the probability that
the cell phone will last between 8 and 12 hours between
charges.
Ex Four solution
First, we need to calculate the z score for 8 and for 12
Z(8) = 8-10/1.5 = -1.33
Z(12) = 12-10/1.5 = 1.33
Looking up our values, we get a total p of .8164 (.4082
+ .4082)
Therefore, the probability that it lasts from 8-12 hours is
about 81.64% (Refer to slide 11)
Exercise Five
Suppose a car manufacturer claims that on average, a
car gets 27 miles per gallon. Although no variability is
mentioned, you find out that the standard deviation is 3
miles per gallon.
1. If you were to buy this car, what is the p that you
would purchase one that gets less than 20 miles per
gallon
2. Suppose you bought one and it did in fact get less
than 20mpg. Should you conclude that your model is
incorrect?
Sampling Distributions
Remember yesterday that we discussed samples as a
method of inferring the statistics (or parameters) of a
population.
Population
Parameter
Mean
Variance
Variance
Standard
Standard Deviation
Deviation
^2
^2
Sample Statistic
s^2
s^2
s
s
A
A sample
sample statistic
statistic is
is a
a numerical
numerical descriptive
descriptive measure
measure of
of a
a sample.
sample.
We
will
often
use
this
information
to
infer
the
parameters
of
We will often use this information to infer the parameters of a
a
population
population
Sampling Distributions
However, two samples can yield VERY different results,
thus making VERY different inferences about the
population.
This illustrates an important point
Sample statistics are themselves random variables
Therefore, they must be judged and compared based on their
respective probability distributions
This assumes that the sampling experiments are repeated a VERY large
number of times
Sampling Distributions
An example:
Suppose that in Canada, the daily high temp recorded for ALL past months
of January has a mean of 10 degrees F and a standard deviation of 5. (Note
that all recorded measures would constitute the population in this case, but
that is not often the case)
Now suppose we randomly sample 25 observations, but we do this random
sample several times (with replacement). Sample one shows a mean of 9.8,
sample two a mean of 11.4, sample three a mean of 10.5, etc. If we plotted
these sample means, and if the means are a good estimator of , we would
expect the probability distribution of the sample means to cluster around 10.
This probability distribution is called a sampling distribution
In actual practice, we typically would use computer simulations to generate
the sampling distribution.
.
The standard deviation of the sampling distribution of equals
the
Standard deviation of sampled population/square root of sample
size
Or of = /n
Note that this is the deviation of the sampling distribution values and is not
the same as the standard deviation of the sample itself
This standard deviation is often referred to as the standard error of the
mean
Exercise Six
A random sample of n=100 observations is selected
from a population with a mean of 30 and standard
deviation of 16
1.
2.
3.
Find P( 28)
4.
Find P( 28.2)
5.
2.
3.
4.
5.
Break
Estimators
A single number that is calculated from a sample and
that estimates the target population parameter is called
a point estimate.
For example, we can use the sample mean to estimate the
population mean
2.
3.
According to our z table, a z score of either +/- 1.96 would give us an AUC of
.95, or a 95% probability that our value would be within this interval
4.
Using our formula then, 1.96/ n would give us our interval estimates
(Or range) whereby there is a 95% probability that the true population mean
lies within this range
5.
The true population mean could lie outside of this range, but that is
CI calculation
Lets
assume our sample of 100 patients has a mean stay (x bar)
of 4.5 days. Lets also assume that it is known that the standard
deviation of the length of stay is 4 days. Calculate the CI for
1.96/ n = 4.5 (1.96)(4/100) = 4.5 .78 = (3.72, 5.28)
Therefore, we are 95% confident that the average length of stay is
between 3.72 and 5.28.
A note about
In the last example, can we be certain that the
population mean is actually within the CI? We cant be
CERTAIN, but we can be reasonably confident. What the
CI really means is that if we repeatedly sampled and
formed the the CIs each time, 95% of the time, the
intervals would contain .
Also, CI actually refers to the interval numbers, whereas
the confidence level, ie 95% is the the confidence
coefficient, or probabilitiy, expressed as a percentage.
However, in practice the term CI is typically used in
reference to the percentage, or AUC of the interval
estimates.
T-statistic
Remember that our previous CIs, which used the z scores, had
an a priori condition that the sample size usually be large (The
larger the better, but usually at least 30). This was
fundamental to the CLT.
What if the sample was small. This could result in a sample
standard deviation that is a poor approximation of the
population standard deviation.
Whereas the z score utilizes the population standard deviation
(or its estimate), the t-statistic uses the sample standard
deviation (without regard to sample size)
You may have heard the term Students T-statisticthis term has nothing to do with students. It was a pen name for
William Gossett, who discovered the T-distribution.
T-statistic
This is noted as such:
T-statistic
If the sampling is done from a normal distribution, the t statistic has
a sampling distribution very similar to the z statistic
However, the t statistic introduces more variability courtesy of the
sample standard deviation (Which may not accurately approximate
the population standard deviation)
The actual amount of variability depends on the sample size n
This is expressed in degrees of freedom which is (n-1)
Recall from the divisor of s^2 that we used n-1 as opposed to n
It is helpful to think of degrees of freedom as the amount of information
available for estimating the target parameter.
In general, the smaller the number of df, the more variable will be the
sampling distribution
T-statistic tables
As with Z scores, the T statistic is available in a lookup
table (See appendix)
If we wanted to look up a T score for t.025 with 4
degrees of Freedom, we would get 2.776
Note that .025 would be the right tail where our z score would
normally be 1.96
T-statistic example
A pharma company wants to estimate the mean
increase in bp of patients that are taking a new drug.
However, they only have a sample of 6 patients, whose
increases respectively, were (1.7, 3.0, .8, 3.4, 2.7, 2.1).
Use this information to construct a 95% CI for , the
average increase in bp.
1st, note that our sample is too small to assume CLM that x bar
is normally distributed. Instead, we must assume that our
OBSERVED variables are normally distributed in order for the
distribution of x bar to be normal
2nd, assume we dont know and cant approximate it
because of small sample size. Therefore, we must use the tdistribution with (n-1)df
T-statistic example
In
this case, n-1 is 5 (6-1) and the t-value for 95%CI is .025 (.025 in
each tail). Looking up the table, we get a t value of 2.571. To get the
interval, we can use our previous CI formula, but substitute our t for z
Remember our z based formula, at 95% 1.96/ n
X bar for sample is 2.283 with s = .95
Thus we get 2.571/ n = 2.283 (2.571)(.95/6) = 2.283 .997
= (1.286, 3.280)
This is our interval and we can be 95% confident that the mean bp
increase is between these two numbers (THIS STILL ASSUMES
NORMALITY)
Exercise seven
We are interested in the average number of characters
that can be printed before a printer head fails. Because
the failures are rare, or take a long time, we have little
dataonly a sample of 15. The # of characters (in
millions) before failure were (1.13, 1.55, 1.43, .92, 1.25,
1.36, 1.32, .85, 1.07, 1.48, 1.20, 1.33, 1.18, 1.22, 1.29)
Form a 99% CI for the mean # of characters printed
before failure.
What assumptions are required for this interval to be
valid
Sampling distribution of
The mean of the sampling distribution of p hat is p;
stated another way, p-hat is an unbiased estimator of p
The standard deviation of the sampling distribution of phat is (pq/n) where q = 1-p.
For large samples, the sampling distribution of p-hat is
approximately normal
CI for p
P-hat z (p-hat * q-hat)/n
So to calculate the 95% CI for worker happiness, based
on our sample of 637 happy workers out of 1000
.637 (1.96)((.637)(.363)/1000) = .637 .030 =
(.607, .667)
Thus we can say we are 95% confident that the interval
between 60.7% to 66.7% contains the true percentage
off ALL ftes that are happy with their job.
Exercise Eight
A random poll shows that 157 out of 484 consumers
are optimistic about the state of the economy. Using
this information, and a 90%CI, what is the estimated
proportion of happy consumers in Florida.
Exercise Eight
A random poll shows that 157 out of 484 consumers
are optimistic about the state of the economy. Using
this information, and a 90%CI, what is the estimated
proportion of happy consumers in Florida.
P-hat = 157/484 = .324
Z score for .9 is 1.645 (We need to find the value of (1-.9)/2
which is .05
Our formula is:
.324 (1.645)((.324)(.676)/484) = .324 .035 = (.289, .359)
This means we are 90% confident that the proportion of people
Hypothesis Testing
As seen with our various z/t tests, we are often
determining if the value of a parameter is greater than,
less than or equal to some specified number. (See
exercise six)
This type of inference is called a test of a hypothesis
We utilize the rare-event concept to reach a decision
To do this, we form two hypothesis called the null
hypothesis and the alternative (or research) hypothesis
Hypothesis Testing
Null Hypothesis, denoted Ho, represents the hypothesis that will
be accepted unless the data provides convincing evidence that
it is false. This usually represents the status quo or some claim
about the population parameter that the researcher wants to
test
Alternative Hypothesis, denoted Ha or H1, represents the
hypothesis that will be accepted only if the data provide
convincing evidence of its truth. This usually represents the
values of a population parameter for which the researcher
wants to gather evidence to support.
A statistical hypothesis is a statement about the numerical value of a
population parameter
Hypothesis example
Suppose a city wants to buy some sewer pipes, but wants to make
sure they meet specifications of having an average breaking
strength of 2,400 pounds per foot.
In this example, we are less interested in the approximate mean of the
population , then we are in testing a hypothesis about its value.
Specifically, we want to decide whether the mean breaking strength of the
pipe exceeds 2,400 pounds per foot.
Our Ho is that the manufacturers pipe does NOT meet specs, or that:
2400
Consider what this might look like within a normal distribution chart
Hypothesis example
This is where it gets tricky
We know from our z table that a standard deviation of 1.65
approximates the area where 95% of the auc is to the left and
the remaining 5% is in the tail. This 5% is our rejection region
If, in fact, the true mean of is 2400, there is only a 5% chance that
the sample mean would be more than 1.65 standard deviations above
2400.
If our null hypothes Ho were true, this is a rare-event. However, we
typically reject the rare event and, thus, we would reject the null
hypothesis.
By doing so, we conclude that the alternative hypothesis is likely true.
Tails
The previous example was a one-tail (or one-sided) test.
This is because we used an inequality.
If we are looking at equality, we would use a two tailed
test.
Which tail to use
Ho some value, use upper tail (One to the right) as rejection
region
Ho some value, use lower tail (one to the left) as rejection
region
Ho = some value, use both tails as rejection region
Upper Tail
Two tailed
Z < -1.645
Z > 1.645
Exercise nine
Use the setup from the previous rats case
Sample of 100 rats gave us x bar of 1.05 and s of .5.
Determine if we reject or fail to reject Ho
Exercise nine
Assume that we sample 100,000 rats and get a mean of
1.1995 with s of .05. Calculate our z score
z-= (1.1995 1.2) / (.05/100000) = -3.16
Therefore, we would reject our Ho. However, in this
case, the researcher may decide that even though it is
statistically significant, it may not be practically
significant as 1.1995 and 1.2 are virtually identical.
P-values
Recall that our z value was 2.12, which was in the
rejection region. If we look up the auc for 2.12 it is .
9830. Therefore, our p value would be .017.
What this means is that if the value of were actually
2400, as stated in our null hypothesis, there is only a
probability of 1.7% that the observed value of z would
be 2.12 or greater.
Most experiments require a P value of at least .05 or
lowerthe lower, the better.
Key words/phrases
Data type
# of samples
1 - 2
Difference between
means
Quant
2 independent
samples
Mean of paired
differences
Quant
1 paired sample
1, 2, 3. k
Quant
K independent
samples
*Alternatively, t could be z in
large sample
Diet example
Appendix
Greek Symbols
Website with statistical symbols
http://www.rapidtables.com/math/symbols/Statistical_Sy
mbols.htm
Greek letters
Z score table
T-table