Вы находитесь на странице: 1из 44

1

INSE 6220 -- Week 4


Advanced Statistical Approaches to Quality

Inferences about Process Control


Sampling and Estimation
Confidence intervals
Control Charts and hypothesis testing
Statistical basis for Control Charts

Dr. A. Ben Hamza Concordia University


2

Using the normal cdf and pdf


We often want to talk about percentage
points of the distribution-portion in the
tails.

P ( Z z / 2 ) 1 P ( Z z / 2 ) 1 ( z / 2 )
2

( z / 2 ) 1
2

z / 2 = -1 1
2

>> icdf('normal',1- /2,0,1)


Also, we have: P ( Z z / 2 ) ( z / 2 )
2

Example: z0.20 / 2 z0.10 1.2816


z0.05/ 2 z0.025 1.96
3
Moments of the population vs. sample statistics
Population Sample

1 n
Mean X E( X ) X Xi
n i 1
2

Variance Var ( X ) E ( X X )
2 2
X
2
S S
2 1 n2
X
n 1 i 1
Xi X
E( X 2 ) E( X )
2

Standard 2 S S2
Deviation

Covariance
2
XY Cov( X , Y ) E ( X X )(Y y ) S 2
XY
1 n

n 1 i 1

X i X Yi Y
E ( XY ) E ( X ) E (Y )

XY
2
Cov( X , Y ) 2
S XY
Correlation XY rXY
Coefficient XY Var ( X )Var (Y ) S X SY
4
Statistical Inference
The purpose of statistical inference is to obtain information about a population from
information contained in a sample.
A population is the set of all the elements of interest.
A sample is a subset of the population.
The sample results provide only estimates of the values of the population
characteristics.
A parameter is a numerical characteristic of a population.
With proper sampling methods, the sample results will provide good estimates of
the population characteristics.
In point estimation we use the data from the sample to compute a value of a
sample statistic that serves as an estimate of a population parameter.
We refer to X as the point estimator of the population mean .
s is the point estimator of the population standard deviation .
When the expected value of a point estimator is equal to the population parameter,
the point estimator is said to be unbiased.
5
Sampling and Estimation
Sampling: act of making observations from populations
Random sampling: when each observation is identically and
independently distributed (i.i.d.)
Statistic: a function of sample data; a value that can be computed from
data (contains no unknowns)
average, median, standard deviation
A statistic is a random variable, which itself has a sampling distribution
i.e., if we take multiple random samples, the value for the statistic will be different
for each set of samples, but will be governed by the same sampling distribution
If we know the appropriate sampling distribution, we can reason about the
population based on the observed value of a statistic
E.g. we calculate a sample mean from a random sample; in what range do we
think the actual (population) mean really sits?
6
Point and Interval Estimators
A point estimator draws inferences about a population by estimating the
value of an unknown parameter using a single value or point.

An interval estimator draws inferences about a population by estimating the


value of an unknown parameter using an interval.

That is we say (with some ___% certainty) that the population parameter of
interest is between some lower and upper bounds.
7
Point & Interval Estimation
For example, suppose we want to estimate the mean summer income of a
class of Quality Systems Engineering students. For n=25 students,
is calculated to be 400 $/week.

point estimate interval estimate

An alternative statement is:


The mean income is between 380 and 420 $/week.
8
Population vs. Sampling Distribution
9
Sampling Distributions

The probability distribution of a statistic is called a


sampling distribution

Sampling distribution for sample mean when large sample


Normal distribution
Sampling distribution for sample mean when small sample
Student-t distribution
Sampling distribution for sample variance
Chi-squared distribution
Sampling distribution of the ratio of two sample variances
F Distribution
10

Estimation Process

Random Sample I am 95%


confident that
is between
Population Mean 40 & 60.
(mean, , is X = 50
unknown)

Sample
General Formula 11

The general formula for all confidence intervals is:

Point Estimate (Critical Value)(Standard Error)


Where:
Point Estimate is the sample statistic estimating the population
parameter of interest

Critical Value is a table value based on the sampling distribution


of the point estimate and the desired confidence level

Standard Error is the standard deviation of the point estimate


12
Confidence Level, (1-)

Confidence Level
The confidence that the interval will contain the unknown population parameter
A percentage (less than 100%)

Suppose confidence level = 95%


Also written (1 - ) = 0.95, (so = 0.05)
A relative frequency interpretation:
95% of all the confidence intervals that can be constructed will contain
the unknown true parameter
A specific interval either will contain or will not contain the true
parameter
No probability involved in a specific interval
Confidence interval on the mean: variance known 13

We know , e.g. from historical data


Estimate mean in some interval to (1 )100% confidence

X z / 2 X z / 2
n n

width
14
Finding the Critical Value, z/2
Consider a 95% confidence interval:

z /2 1.96
1 0.95 0.05


0.025 0.025
2 2

Z units: -z/2 = -1.96 0 z/2 = 1.96


Lower Upper
X units: Confidence Point Estimate Confidence
Limit Limit
Example 15

A sample of 11 circuits from a large normal population has a mean


resistance of 2.20 ohms. We know from past testing that the population
standard deviation is 0.35 ohms. Determine a 95% confidence interval
for the true mean resistance of the population.

Solution:
X z /2
n
2.20 1.96 (0.35/ 11)
2.20 0.2068
1.9932 2.4068

We are 95% confident that the true mean resistance is between 1.9932 and
2.4068 ohms.
Although the true mean may or may not be in this interval, 95% of intervals
formed in this manner will contain the true mean
16
Do You Ever Truly Know ?
Probably not!
In virtually all real world situations, is not known.
If there is a situation where is known then is also known (since to calculate
you need to know .)

If you truly know there would be no need to gather a sample to estimate it.

Confidence Interval for ( Unknown)


If the population standard deviation is unknown, we can substitute the
sample standard deviation, S
This introduces extra uncertainty, since S is variable from sample to
sample
So we use the t distribution instead of the normal distribution
17

Confidence Interval on the Mean of a Normal Distribution, Variance Unknown

Want to estimate the population mean, , of data that is Normally


distributed with unknown variance (i.e. small sample)
Take a sample of size n from Normally distributed data (Note: n < 40)
Find sample mean, X

X-
S n
will be Student-t distributed with n-1 degrees of freedom
How do I know this? Remember that Student t Distribution is the
sampling distribution of the sample mean when sample size, n, is small
and underlying distribution is Normal (or close to Normal)
Calculate upper and lower CI limit using Student-t distribution and 1- such
that P(LCL UCL) = 1-
18
Sampling: the Chi-Square distribution

2
S ~ 2
2
(n 1) n 1
19

Sampling: the t student distribution


If Z N (0,1) then
Z
Y /k
tkwith Y k is distributed as a student
2

t distribution with k degrees of freedom.

Typical use: Find distribution of average when is NOT known

For k , tk N (0,1)

Consider X i N ( , 2 ) Then
X
X / n N (0,1)
t n 1
s/ n s / 1
n21
n 1
This is just the normalized distance from mean (normalized to our
estimate of the sample variance).
20
Confidence intervals: variance unknown
Case where we dont know variance a priori
Now we have to estimate not only the mean based on our data, but also
estimate the variance
Our estimate of the mean to some interval with (1-)100% confidence becomes
s s
X t / 2,n 1 X t / 2,n 1
n n
Note that the t distribution is slightly wider than the normal distribution, so that
our confidence interval on the true mean is not as tight as when we know the
variance.
Confidence intervals: Estimate of variance

( n 1) s 2 ( n 1) s 2
2
2 / 2,n 1 12 / 2,n 1

The appropriate sampling distribution is the Chi-square


Because chi-square is asymmetric, confidence intervals bounds not symmetric.
21

Confidence Interval on the Mean of a Normal Distribution, Small Sample Size


SummaryofProcedure
DataDistribution:Normali.e.X~ N(, 2)
Populationparameters:
Mean,:unknowntobeestimated
Variance,2:unknownes mateusingS2
SampleSize:small(i.e.n<40)
SampleStatistic:SampleMean
SamplingDistribution:StudenttDistribution
ConfidenceIntervals(CIs):
2sidedCI Upper1sidedCI Lower1sidedCI
S S S
X t / 2 , n 1 X t ,n 1 X t ,n 1
n n n
22
t-distribution table

The shaded are is equal


to for t t ,
= degree of freedom

Example:

n 16, 0.05
t / 2,n 1 t0.025,15 2.131
23
Confidence Interval on the Variance of a Normal Distribution
Summary of Procedure
Data Distribution: Normal i.e. X~ N(, 2)
Population parameters:
Variance, 2: unknown to be estimated
Sample Size: any
Sample Statistic: Sample variance
Sampling Distribution: Chi-squared Distribution
Because chi-square is asymmetric, confidence intervals bounds not symmetric.
Confidence Intervals (CIs):

2sidedCI Upper1sidedCI Lower1sidedCI


( n 1) S 2 ( n 1) S 2 (n 1) S 2 (n 1) S 2
2
2 / 2 , n 1 12 / 2 ,n 1 12 ,n 1 2 ,n 1
24

Sampling: the F distribution

MATLAB
>> icdf('F',1-0.05/2,19,19) = 2.5265
>> icdf('F',0.05/2,19,19) = 0.3958
Hypothesis Testing pronounced
Null
25

H nought
Alternative Hypothesis
Hypothesis H 0 : 1.10
H1 : 1.10

A hypothesis test is a procedure for determining if an assertion about a characteristic of a


population is reasonable.
Example1: The mean monthly cell phone bill in this city is = $42
Example2: The proportion of adults in this city with cell phones is p = 0.68

Example3: suppose that someone says that the average price of a liter of regular unleaded
gas in Montreal is $1.10. How would you decide whether this statement is true? You could try
to find out what every gas station in the city was charging and how many liters they were
selling at that price. That approach might be definitive, but it could end up costing more than
the information is worth. A simpler approach is to find out the price of gas at a small number of
randomly chosen stations around the city and compare the average price to $1.10.
Of course, the average price you get will probably not be exactly $1.10 due to variability in
price from one station to the next. Suppose your average price was $1.18. Is this three cent
difference a result of chance variability, or is the original assertion incorrect? A hypothesis test
can provide an answer.
26
Hypothesis Test Terminology
The significance level is related to the degree of certainty you require in order to reject the
null hypothesis in favor of the alternative. By taking a small sample you cannot be certain
about your conclusion. So you decide in advance to reject the null hypothesis if the
probability of observing your sampled result is less than the significance level. For a
typical significance level of 5%, the notation is = 0.05. For this significance level, the
probability of incorrectly rejecting the null hypothesis when it is actually true is 5%. If you
need more protection from this error, then choose a lower value of .

The p-value is the probability of observing the given sample result under the assumption
that the null hypothesis is true. If the p-value is less than , then you reject the null
hypothesis. For example, if = 0.05 and the p-value is 0.03, then you reject the null
hypothesis. The converse is not true. If the p-value is greater than , you have insufficient
evidence to reject the null hypothesis.

The null hypothesis is always about a population parameter, not about a sample
statistic

H0 : 3 H0 : X 3
27
Type I and Type II Errors
Since hypothesis tests are based on sample data, we must allow
for the possibility of errors.
A Type I error is rejecting H0 when it is true.
The person conducting the hypothesis test specifies the
maximum allowable probability of making a
Type I error, denoted by and called the level of significance.
A Type II error is accepting H0 when it is false.
Generally, we cannot control for the probability of making a Type
II error, denoted by .
Statistician avoids the risk of making a Type II error by using do
not reject H0 and not accept H0.

P(Type I error) =
P(Type II error) =
28
Inference on the mean of a population, variance known
H 0 : 0
H1 : 0 (3-22)

X 0
Z0 (3-23)
/ n

H1 in equation (3-22) is a two-sided alternative hypothesis


The procedure for testing this hypothesis is to:
take a random sample of n observations on the random variable x,
compute the test statistic, and
reject H0 if |Z0| > Z/2, where Z/2 is the upper /2 percentage of the
standard normal distribution.
In some situations we may wish to reject H0 only if the true mean is larger
than 0
Thus, the one-sided alternative hypothesis is H1: >0, and we would reject
H0: =0 only if Z0>Z
If rejection is desired only when <0
Then the alternative hypothesis is H1: <0, and we reject H0 only if Z0<Z
29
Confidence interval on the mean, variance known

Furthermore, a 100(1 )% upper confidence bound on is

whereas a 100(1 )% lower confidence bound on is


A Summary of Forms for Null and Alternative 30

Hypotheses about a Population Mean

The equality part of the hypotheses always appears in the null


hypothesis.
In general, a hypothesis test about the value of a population mean
must take one of the following three forms (where 0 is the
hypothesized value of the population mean).

H0: > 0 H0: < 0 H0: = 0


H1: < 0 H1: > 0 H1: 0

One-tailed One-tailed Two-tailed


31
Example
32
Example
33

F-test statistic
34
Introduction to control charts
Principal purpose: early detection of an out-of-control process
A process is out of control if it is producing items which are
off target or
too variable
An out-of-control process is likely to produce many nonconforming items
If an assignable cause can be found, the process can be corrected and
brought back into control.
A capable, in-control process will produce fewer nonconforming items.

Basic principle: Samples of measurements are periodically taken at one


or more stages of a production process to provide data for the monitoring
of the process.
Based on each sample, a statistic is computed and plotted against time.
The result is a time series of the observed statistic values.
Control Charts 35

A Control Chart is a graphical method to spot assignable


cause variation quickly.
Two important lines on a control chart are the upper
control limit (UCL) and lower control limit (LCL).
These lines are chosen so that when the process is in
control, there will be a high probability that the sample
finding will be between the two lines.
Values outside of the control limits provide strong
evidence that the process is out of control.
A range is specified within which the statistic is likely to
have come from the same distribution as the preceding
data.
A control chart is like a hypothesis test.
Control Limits are used to determine if the process
is in a state of statistical control (i.e., is producing
consistent output).
Specification Limits are used to determine if the
product will function in the intended fashion.
36

Control charts and hypothesis testing


Null hypothesis H : process is in-control
0

Alternative hypothesis H1: process is out-of-control


When a point plots within the control limits, the null hypothesis is not
rejected
When a point plots outside the control limits, the null hypothesis is
rejected
Type I error:
1. Rejecting the null hypothesis when it is true
2. Concluding the process is out of control when it isnt
3. False Alarm: an in-control point plots outside the control limits
Type II error:
1. Not rejecting the null hypothesis when it is false
2. Failing to detect an out of control condition: an out-of-control point plots inside the
control limits
Types of Control Charts 37

An x chart is used if the quality of the output is measured in terms of a variable such as
length, weight, temperature, and so on.
x represents the mean value found in a sample of the output.
An R chart is used to monitor the range of the measurements in the sample.
A p chart is used to monitor the proportion defective in the sample.
An np chart is used to monitor the number of defective items in the sample.

>> controlchart(data,'chart','xbar','sigma','range','rules','we6');
38
Shewhart Control Charts
Suppose we have a general statistic W
We plot W over time
We specify control limits of the form
U C L 3
W W Mean of W
C L W

L C L W 3 W Std. Dev. of W
A control chart based on a number of standard deviations of the statistic
from the mean of the statistic is called a Shewhart Control Chart
Some commonly used Ws
X-bar: Average
R: Range
s: Standard deviation
We can also specify control charts using probability limits
39

X-bar Control Charts


We dont know and , so we must estimate them
If we have m subgroups, with averages
X 1 , X 2 ,..., X m

then the best estimate for is


X X ... X
X 1 2 m

Suppose we have subgroup ranges (Xmax-Xmin)


R R ... R
R1, R 2 ,..., R m R 1 2 m

It turns out that R is a biased estimator of , with biasing term d2

R
So an unbiased estimator is given by: d 2
40
X-bar Control Charts

Control Limits

U C L X
3 X

C L X

L C L X
3 X

Therefore, A 2

3
L C L X R
d 2 n
C L X
3
U C L X R
d 2 n
41
R-Control Charts
We are looking to make control charts of the form
LCL R 3 R

UCL R 3 R

The best estimate for R is R


What about R?
It turns out that: d R 3

So
R
R d 3
d2
D 3

Thus LCL R 3d
R
1 3
d 3

R
3
d 2 d 2
R d
UCL R 3d 3 1 3 3
R
d 2 d 2
D 4
42
43

Example

x Chart :
UCL x A2 R
1.5056 (0.577 )(0.32521) 1.69325
Central line x 1.5056
LCL x A2 R
1.5056 (0.577 )(0.32521) 1.31795

R Chart :
UCL D4 R ( 2.114)(0.32521)
Central line R 0.32521
LCL D3 R (0)(0.32521)

x 1.50345
R 0.3360
44
X-bar chart using MATLAB
>> load parts
>> controlchart(runout,'chart','xbar','sigma','range','rules','we6');

Interpreting Charts:
Observations outside control limits indicate the process
is probably out-of-control
Significant patterns in the observations indicate the
process is probably out-of-control
Random causes will on rare occasions indicate the
process is probably out-of-control when it actually is
not

Вам также может понравиться