You are on page 1of 105

Module-4

Basic Statistics
Basic Statistics - Agenda

1. Introduction
2. Variable Data
3. Measures of Location and Dispersion
4. The Normal Distribution
5. Non-Normal Data
6. Data Transformation
7. Testing for Normality
8. Attribute Data
9. Defective Items – Binomial Distribution
10. Defects – Poisson Distribution
11. Z-Tables
Six Sigma - Statistics

A good definition of Statistics:

“The art of making decisions in the


light of uncertainty”
Why do we use Statistics?

Decisions are best made with the aid of facts and data

We collect USEFUL
DATA INFORMATION
Statistics
Translates
2 6 2 7 into
2 2 6 3
4 1 3 2
3 5 1 1
4 5 1 4
1 1 3 2
The Use of Statistics

Statistics can be used to :


• Describe our processes
• Track improvement
• Draw conclusions and predict results
• Control our processes
Histogram Control Chart Scatter
Diagram
x x
x x
x x
x x x x
x x
x x x
x
x
x x x
x x
Types of Data

There are two primary types of data:


Examples

• Variable Data Time


(also called continuous data) Height The right thing
Data comes from taking measurements Weight
from a continuous scale.

• Attribute Data Number of defective items


Number of errors
Data comes from counting. Pass/Fail data
Variable Data
Variable Process Outputs
Process Output (y)

Preparing monthly accounts Time to Complete (hours)


Order Fulfilment Fulfilment Time (hours)
Expense Claim Payment Time
Recruitment Process Recruitment Time

The process output of the majority of


transactional processes is time related.
Often we have two outputs, one related to cost
(time) the other related to quality (errors).
Sampling

• A population is the complete


collection of items being studied
• A sample is a subset of items drawn
from the population
• Many statistical procedures require
that a random sample is drawn
• A random sample means that each
member of the population has an
equal chance of being drawn
Descriptive v Inferential Statistics

By examining a sample of 50
items we can:
• Describe and summarise the set
of data (descriptive statistics)
• Make predictions about the
whole population (inferential
statistics)
Characteristics of Variable Data

There are many ways to examine a data set in order


to improve our understanding:

• Frequency Distribution
• Tally Chart
• Histogram
• Measures of Location - Mean, Median, Mode
• Measures of Variation – Range, Standard Deviation
• etc……
Random Sample - 100 SAT Verbal Scores
546 592 591 602 691
689 644 546 602 695
490 536 618 669 599
531 586 622 689 560
603 555 464 599 618
549 612 641 597 622
663 546 534 740 644
515 496 503 599 618
557 631 502 605 547
673 708 624 528 645
650 656 599 586 536
546 515 644 599 734
502 541 530 663 599
547 579 666 578 635
496 541 605 560 695
426 555 483 641 546
515 609 534 645 572
637 457 631 721 578
541 592 666 619 663
547 624 567 489 528
Random Sample - 100 SAT Verbal Scores
P P P P P
P P P P P
F P P P P
P P P P P
P P F P P
P P P P P
P P P P P
P F P P P
P P P P P
P P P P P
P P P P P
P P P P P
P P P P P
P P P P P
F P P P P
F P F P P
P P P P P
P F P P P
P P P P P
P P P F P

0-499 Fail 8
500+ Pass 92
Random Sample - 100 SAT Verbal Scores
546 592 591 602 691
689 644 546 602 695
490 536 618 669 599
531 586 622 689 560
603 555 464 599 618
549 612 641 597 622
663 546 534 740 644
515 496 503 599 618
557 631 502 605 547
673 708 624 528 645
650 656 599 586 536
546 515 644 599 734
502 541 530 663 599
547 579 666 578 635
496 541 605 560 695
426 555 483 641 546
515 609 534 645 572
637 457 631 721 578
541 592 666 619 663
547 624 567 489 528

400-499 Red 8
500-599 Yellow 48
600+ Green 44
SAT Verbal Data - Arranged in Order
426 536 572 605 645
457 536 578 609 645
464 541 578 612 650
483 541 579 618 656
489 541 586 618 663
490 546 586 618 663
496 546 591 619 663
496 546 592 622 666
502 546 592 622 666
502 546 597 624 669
503 547 599 624 673
515 547 599 631 689
515 547 599 631 689
515 549 599 635 691
528 555 599 637 695
528 555 599 641 695
530 557 602 641 708
531 560 602 644 721
534 560 603 644 734
534 567 605 644 740
Tally Chart
Class Class limits Tallies Class
Frequency
1 425-449 Ι 1
2 450-474 ΙΙ 2
3 475-499 ΙΙΙΙ 5
4 500-524 ΙΙΙΙ Ι 6
5 525-549 ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ 20
6 550-574 ΙΙΙΙ ΙΙ 7
7 575-599 ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ 15
8 600-624 ΙΙΙΙ ΙΙΙΙ ΙΙΙΙ Ι 16
9 625-649 ΙΙΙΙ ΙΙΙΙ I 11
10 650-674 ΙΙΙΙ ΙΙΙΙ 9
11 675-699 IIII 4
12 700-724 II 2
13 725-749 II 2
Histogram
Frequency
20

15

10

0
424.5 449.5 474.5 499.5 524.5 549.5 574.5 599.5 624.5 649.5 674.5 699.5 724.5 749.5

A “rule of thumb” for selecting the


number of classes is to take the square
root of the number of data points
Histogram - Minitab

Histogram of SAT
25

20
Frequency

15

10

0
380 460 540 620 700 780
SAT
Dotplot

Dotplot of SAT

450 500 550 600 650 700 750


SAT
Measures of Location

The Mode
• The mode is defined as the value in the sample
which occurs most frequently.

• If each value occurs the same number of times there


is no mode.

• If two or more values occur the same number of


times, then there is more than one mode and the
sample is said to be multimodal.
Measures of Location

The Median

If the sample values are arranged in order from


smallest to largest, the median is defined as:

• The middle value if the sample size is odd

• The number half-way between the two middle


values if the sample size is even
Measures of Location

The Arithmetic Mean


This is the most commonly used measure of
location called, simply, the mean.

Sum of the values


Mean =
Number of values

i=n
∑yi ∑y
i=1
y= usually abbreviated to
n n
Measures of Dispersion

The Range
The range is defined as the largest sample observation
minus the smallest sample observation

The Standard Deviation (S.D.)


The standard deviation is the square root of the average
squared deviations of all data values from the mean
Why use Standard Deviation?

• The range is insensitive – it does not take into account


information about all data values, just the minimum
and maximum values.
• The standard deviation takes account of every piece of
data. It is the most sensitive measure of variation and
can be used to predict the proportion of the population
above or below any critical point (such as the
specification).
Why use Standard Deviation?

The standard deviation, s, is the square root of the


variance, s2,which is often used in statistics because of its
additive nature:

∑(y – y)2 ∑(y – y)2


s2 = n-1 s= n-1
Standard Deviation - Example

y y-y (y – y)2
∑(y – y)2
2 s=
4 n-1
4
5 ∑ = Summation
6 y = Mean
9 y = Individual Data
n = Number of Data

∑y
y= = s=
n
Standard Deviation & Variance

Alternative Formulae Notations

(Σy) 2
Σy2 -
n 2
s2 = s2 = σn-1 = sample variance
n-1

(Σy)2
Σy2 -
s = n s = σn-1 = sample standard deviation
n-1
Descriptive Statistics

Summary for SAT


A nderson-Darling N ormality Test
A -S quared 0.33
P -V alue 0.512
M ean 590.24
S tD ev 65.10
V ariance 4238.08
S kew ness 0.026342
Kurtosis -0.397313
N 100
M inimum 426.00
1st Q uartile 542.25
M edian 598.00
3rd Q uartile 640.00
420 480 540 600 660 720 M aximum 740.00
95% C onfidence Interv al for M ean
577.32 603.16
95% C onfidence Interv al for M edian
570.71 605.00
95% C onfidence Interv al for S tDev
95% Confidence Intervals
57.16 75.63
Mean

Median

570 580 590 600 610


Sample v Population

Sample Standard Deviation Population Standard Deviation

∑(y – y)2 ∑(y – µ)2


s=σn-1 = σ=
n-1 n
Where: Where:
∑ = Summation ∑ = Summation
y = Mean of Sample µ = Mean of Population
y = Individual Data y = Individual Data
n = Number of Data in Sample n = Number of Data in Population

We normally deal with sample data, and therefore generally


use the formula for the sample standard deviation.
Sample Statistics & Population Parameters

Sample Population
y s s2 µ σ σ2

Sample Statistics are descriptive Population Parameters are descriptive


measures of the sample: measures of the population:
y = Sample Mean µ = Population Mean
s = Sample Standard Deviation σ = Population Standard Deviation
s2 = Sample Variance σ2 = Population Variance
Statistical Inference

Sample Population
y s s2 µ σ σ2

Statistical inference deals with making inferences


about the population parameters (unknown) from
information obtained in the sample (known)
The Normal Distribution
The Normal Distribution

• The normal distribution is the most common form of


continuous distribution. The measurement of natural
phenomena tends to follow the normal distribution.
• Examples of this include the distribution of heights or
weights of a randomly selected sample of people.
The Normal Distribution

In industry, many continuous variables tend to follow


the normal distribution such as:
• Dimension of Parts
• Fill Volume
Many continuous variables can also have distributions
other than the normal distribution.
Normal Distribution

The normal distribution can be specified by two


parameters – the mean and the standard deviation (or
variance) of the population. The calculation of these
two parameters has been addressed earlier.
Normal Distribution

If a random variable y has a normal distribution with


mean µ, and variance σ2, then the equation of the
distribution is:
2
1  y−µ 
1 − 2 σ 
f ( y) = .e  
σ 2π
Fortunately, we do not need to work through this
formula in practice!
The Standard Normal Distribution

σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ

Standardization can be effected by subtracting µ, the


mean of y, from y and dividing the difference by σ, the
standard deviation of y.
y−µ
z=
σ
The result of this standardization to a normal
distribution with mean 0 and standard deviation 1, is
known as the standard normal distribution
The Standard Normal Distribution

σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ

y−µ
z=
σ
This equation is extremely useful in determining areas
under the standard normal curve. The variable z, the
standard normal variable, is used for this purpose and
values of z are tabulated in statistical tables.
Standard Normal Distribution

σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ

The mean, y, will always be positioned in the centre of the


distribution.
Within the range y ± σ we can expect approximately 68%
of the values of the distribution to be found. Within the
range y ± 2σ
σ we can expect about 95% of values to be
found and within y ± 3σ σ we can expect about 99.7% of
values to be found.
Calculating Areas Under the Curve

Calculate the value of z:


y−µ
z=
σ
Find z in the normal table provided. The tabulated result is
the proportion of the population expected to be less than
the value of y.
To calculate the area between two points apply the same
principles as above and subtract the lower result from the
higher result. This will give the proportion of the
population expected between the two values of y.
Example

Normal distribution has a mean of 35 and a standard


deviation of 5.

σ -2σ
-3σ σ σ
-σ y σ σ
2σ σ

20 25 30 35 40 45 50
Example

We wish to find the proportion of the distribution less than


a value of 43, the shaded area under the curve.

20 25 30 35 40 45 50

y−µ 43 - 35
z= z= = 1.60
σ 5
Looking up a value of z = 1.60, we see that a value of 0.9452
is tabulated. This means that 94.52% of our population
would be expected to have a value less than 43.
Workshop

A normal distribution is found to have a mean of 35


and a standard deviation of 5.
Find the percentage of the distribution expected to
have values greater than 30.

20 25 30 35 40 45 50
Workshop

A normal distribution is found to have a mean of 35


and a standard deviation of 5.
Find the percentage of the distribution expected to
have values between 28 and 35.

20 25 30 35 40 45 50
The Normal Distribution - Things to Remember

• The standardised normal distribution can be used to


estimate the proportion of values expected to be less or
more than any particular value, often a specification.
• We must not assume that our data set is normal – we need
to carry out a test of normality first (more later……)
• Many data sets are non-normal for perfectly
understandable reasons.
Confidence Intervals for the Mean
and Standard Deviation
Confidence Intervals

• When we calculate statistics such as the mean for a data


set, we are making an estimate of the true value since we
are dealing with a sample of the population
• Based on our estimate from the sample we draw
inferences about the population
• Making decisions based on point estimates can be very
risky:
• The true value might vary considerably from the point estimate
• We should ask: what is the accuracy of estimate?
• Decisions should always be based on confidence intervals
not point estimates
Confidence Intervals in words

All confidence intervals are constructed around:

(confidence scaling factor × measure of variation)


point estimate ±
sample size

• As sample size increases C.I. narrows

• As measure of variation increases C.I. widens

• As confidence increases C.I. widens


Descriptive Statistics
Summary for mpg
A nderson-D arling Normality Test
A -S quared 0.63
P-V alue 0.092
M ean 33.417
StD ev 1.604
V ariance 2.572
Skew ness -0.21121
Kurtosis -1.16145
N 30
M inimum 30.450
1st Q uartile 31.861
M edian
3rd Q uartile
33.844
34.890
Inferential
30 31 32 33 34 35 36 M aximum 36.162
95% C onfidence Interv al for M ean Statistics
32.818 34.016
95% C onfidence Interv al for M edian
32.378 34.380
95% C onfidence Interv al for S tDev
9 5 % C onfidence Inter vals
1.277 2.156
Mean

Median

32.5 33.0 33.5 34.0 34.5


Confidence Levels
Confidence α
Level Risk
90% 9 times out of 10 the true 0.1 Only 1 in 10 times will the true
value will lie within the value lie outside the confidence
confidence interval interval

95% 19 times out of 20 the true 0.05 Only 1 in 20 times will the true
value will lie within the value lie outside the confidence
confidence interval interval

99% 99 times out of a 100 the 0.01 Only 1 in 100 times will the true
true value will lie within value lie outside the confidence
the confidence interval interval

99.9% 999 times out of 1000 the 0.001 Only 1 in 1000 times will the true
true value will lie within value lie outside the confidence
the confidence interval interval
Confidence Interval for the Mean
The confidence interval for the mean can be calculated as
follows:
σ n −1
y ± tα ×
2 n
Where: y = sample mean
tα = t distribution critical value, with n-1 df
2
σ n −1 = sample standard deviation
n = sample size
Assumes that the underlying distribution of y is normal but
the calculation is fairly robust to violations of this
assumption
Confidence Interval for Mean
(Normal Distribution)
Summary for mpg
A nderson-D arling N ormality Test
σ n −1
A -S quared
P -V alue
0.63
0.092 y ± tα ×
M ean
StD ev
33.417
1.604
2 n
V ariance 2.572
Skew ness -0.21121 1.604
Kurtosis
N
-1.16145
30
= 33.417 ± 2.045 ×
M inimum 30.450 30
1st Q uartile 31.861
M edian
3rd Q uartile
33.844
34.890
= 33.417 ± 2.045 × 0.293
30 31 32 33 34 35 36 M aximum 36.162
95% C onfidence Interv al for M ean
32.818 34.016 = 33.417 ± 0.599
95% C onfidence Interv al for M edian
32.378 34.380

9 5 % Confidence Inter vals


95% C onfidence Interv al for StD ev
1.277 2.156
32.818 〈 µ 〈 34.016
Mean

Median

32.5 33.0 33.5 34.0 34.5

We can say with 95% confidence that the true


mean lies between 32.818 and 34.016
Confidence Interval Standard Deviation

The confidence interval for the standard deviation can


be calculated as follows:
2 2
( n − 1)σ n −1 ( n − 1)σ n −1
2
to
χ n −1,α 2 χ n2−1,1−α 2

2
Where: χ = Critical value of Chi Squared distribution
σ n −1 = sample standard deviation
n = sample size
This formula assumes normality. Large errors are likely if
the underlying distribution is non-normal.
Confidence Interval for Standard Deviation
(Normal Distribution)

Summary for mpg The confidence interval is:


A nderson-Darling N ormality Test
A -S quared 0.63
P -V alue 0.092
2 2
( n − 1)σ n −1 ( n − 1)σ n −1
M ean 33.417
S tDev 1.604
V ariance 2.572
S kew ness -0.21121
2 to
χ χ n2−1,1−α
Kurtosis -1.16145
N 30

M inimum 30.450 n −1,α 2 2


1st Q uartile 31.861
M edian 33.844
3rd Q uartile 34.890
30 31 32 33 34 35 36 M aximum 36.162
95% C onfidence Interv al for Mean
(30 − 1) 2 .572 (30 − 1) 2 .572
32.818 34.016
95% C onfidence Interv al for M edian to
9 5 % C onfidence Inter vals
32.378 34.380
95% C onfidence Interv al for S tDev 45 .7 16
1.277 2.156
Mean

Median

32.5 33.0 33.5 34.0 34.5


= 1.28 to 2.16

We can say with 95% confidence that the true


=
standard deviation lies between 1.28 and 2.16
Non-Normal Data
Data Set - Weight
73 78 119 105 73
67 98 102 138 100
92 86 78 63 89
92 66 77 80 124
101 73 85 114 87
77 84 55 80 87
94 80 90 53 67 The data opposite
71 79 84 61 74
83 87 110 83 78 represents the weights of a
63 63 84 69 106 random sample of 100
93 79 90 87 112
95 96 73 78 89 employees at the Perfect
81 77 94 66 79
66 91 78 82 91
Pizza Factory
83 73 74 103 93
79 53 84 71 68
75 77 81 70 74
59 71 92 89 79
73 67 90 76 86
68 77 77 75 98
Draw a Picture
Histogram of Weight
35

30

25
Frequency

20

15

10

0
60 80 100 120 140
Weight

• It is always a good idea to construct a histogram.


• Does this data look normally distributed?
Testing for Normality
MINITAB has several tests for normality (Anderson-Darling,
Ryan-Joiner and Kolmogorov-Smirnov). The default test is the
Anderson-Darling. The Anderson-Darling test is the most robust
normality test.

The Null Hypothesis H0: The data is normal


The Alternate Hypothesis H1: The data is non-normal

A P-value is returned.

The P-value is the probability of getting the sample data if the null
hypothesis is true. We generally accept that the data is from a
normal distribution if the P-value is greater than 0.05 (alpha risk).
Normality Test

Probability Plot of Weight


Normal
The P-Value of
99.9
Mean 82.54
0.015 means that we
99
StDev
N
14.91
100 reject the null
AD 0.954
95
90
P-Value 0.015 hypothesis (p<0.05)
80
70
and accept the
Percent

60
50
40
alternate hypothesis
30
20
that the data is non-
10
5
normal.
1

0.1
50 75 100 125 150
Weight
Reasons for Failing a Normality Test
1. A shift occurred in the middle of the data
2. Mixed populations
3. Truncated data
4. Rounding to a small number of values
5. Outliers
6. Too much data
7. The underlying distribution is not normal
With this data set, the most likely reason for failing is that
the underlying distribution is not normal. We generally
would need to investigate the other reasons before
reaching this conclusion!
Data Transformation
Histogram of Weight
35

30

25
Frequency

20

15

10

0
60 80 100 120 140
Weight

When the underlying distribution is non-normal, we


can attempt to normalise using a data transformation.
Why Transform Non-Normal Data?

We transform non-normal data into normal data


in order to use the properties of the normal
distribution for predictive purposes.
Data Transformations - Right Skewed Data

LSL USL LSL USL

Taking the logarithm of right skewed distributions may


transform the data into a normal distribution. Note that
both the data and the specification limits are transformed.
Data Transformations Right Skewed Data

Data is right skewed and Data is right skewed with


has a lower bound of zero: non-zero lower bound:
log y log( y + c )
ln y y+c
y 3 y+c
3 y
Note: c must be large
1 enough to convert all of

y the data into positive
numbers
1

y
The Box-Cox Transformation
The Box-Cox Transformation in Minitab will
automatically try a number of transformations and
calculates a “lambda” value which often indicates the
transformation most likely to work.
No
Transformation

1 1 1
y2 y y ln y y y y2 y3
-2.0 -1.0 -0.5 0 0.5 1.0 2.0 3.0

Box-Cox λ Lambda Values


Weight - Normality Test

Probability Plot of Weight The P-Value of


Normal
99.9 0.015 means that we
Mean 82.54

99
StDev
N
14.91
100
reject the null
95
AD
P-Value
0.954
0.015 hypothesis (p<0.05)
90
80 and accept the
70
Percent

60
50 alternate hypothesis
40
30
20
that the data is non-
10
5
normal.
1

0.1
50 75 100 125 150
Weight
Histogram – Transformed Data

Histogram of Ln Weight
25

20
Frequency

15

10

0
4.0 4.2 4.4 4.6 4.8
Ln Weight
Normality Test – Transformed Data

Probability Plot of Ln Weight


Normal
The P-Value of
99.9
Mean 4.398
0.518 means that we
99
StDev
N
0.1758
100 accept the null
AD 0.325
95
90
P-Value 0.518 hypothesis that the
80
70
transformed data is
Percent

60
50
40
normal.
30
20
10
5

0.1
4.00 4.25 4.50 4.75 5.00
Ln Weight
Ln(Weight) – Graphical Summary
Summary for Ln Weight
A nderson-D arling N ormality Test
A -S quared 0.33
P -V alue 0.518

M ean 4.3978
S tDev 0.1758
V ariance 0.0309
S kew ness 0.184802
Kurtosis 0.561530
N 100

M inimum 3.9703
1st Q uartile 4.2905
M edian 4.3820
3rd Q uartile 4.5081
4.0 4.2 4.4 4.6 4.8 M aximum 4.9273
95% C onfidence Interv al for M ean
4.3629 4.4327
95% C onfidence Interv al for M edian
4.3567 4.4308
95% C onfidence Interv al for S tD ev
9 5 % C onfidence Inter vals
0.1543 0.2042
Mean

Median

4.350 4.365 4.380 4.395 4.410 4.425 4.440


Using our Transformed Data

Once we have successfully transformed our data to a


normal distribution, we can use the properties of the
normal distribution to:

1. Estimate proportions of the distribution less than,


greater than or between two points
2. Calculate Capability Indices (Cp, Cpk, Pp, Ppk)
3. Construct Confidence Intervals on Sample Statistics
4. Carry out Hypothesis Tests on Sample Statistics

We will cover these issues during this course


Non-Normal Continuous Distributions

• With many data sets, the underlying distribution may be


continuous, but not normally distributed.
• If a data set fails a normality test, we should check all the
“reasons for failing a normality test”
• In many cases, it is impossible to transform the data set
into a normal distribution – this can be due to many
reasons
Normality Testing
Reasons for Failing a Normality Test

1. A shift occurred in the middle of the data


2. Mixed populations
3. Truncated data
4. Rounding to a small number of values
5. Outliers
6. Too much data
7. The underlying distribution is not normal
1. Shift in the Middle of the Data

Shift

• Data should be checked using a control chart to check


for a time shift
• A shift could indicate that the process is unstable
(special cause present)
• We should select a period of time without a shift before
carrying out a normality test
2. Mixed Populations

• A “twin peak” or “multiple peak” type histogram could


be an indication that multiple populations exist
• Reasons for this could be different process operators,
alternative process routes, process bottlenecks etc.
• We should separate the distributions before carrying
out a normality test.
3. Truncated Data

• A truncated distribution such as the one above could be


caused by:
• The existence of a physical block (such as zero)
• An inadequate measurement system
• Manipulation (if there is “stacking” just within spec!)
4. Rounding

“Comb Type” “Few Classes”

• Rounding errors, such as favouring even numbers, could


cause a “comb type” distribution. We should endeavour
to use the maximum discrimination of our measurement
system.
• Some data has too few classes. An example of this would
be counting in days when hours would be more
appropriate.
5. Outliers

• Outliers can cause a normality test to fail


• The outlying data should be investigated, and excluded,
prior to carrying out the normality test
• The outlying data should only be excluded if there is a
logical explanation for the extreme values
• The presence of outliers may indicate that we have a
mistake proofing issue!
6. Too Much Data

• Rather ironically, too many data points can, in itself,


lead to the failure of a normality test
• The Anderson-Darling test is sensitive to this
• No distribution is exactly normal, and with enough data,
any distribution can be proven non-normal
• When performing a normality test, the question we want
answered is “Does our data approximate the normal
distribution?”
• Taking a random sample of 50 data points should
overcome this problem
7. The Underlying Distribution is not Normal

• Many types of data are not expected to be normally


distributed
• It is necessary to recognise the difference between non-
normal and unnatural data
• If the data is “lumpy” rather than continuous, then
there is likely to be a special cause
• If the data is continuous, we should then ask the
question “should the data be normal?”
Which Type of Distribution?

• Some types of continuous data tend to be normally


distributed, others are not
• For example:
• Additive characteristics tend to be normally
distributed e.g. height, dimensions etc.
• Multiplicative characteristics tend to be lognormal
e.g. weight, waiting time etc.
• Many other continuous types of data can be
characterised using the Weibull Distribution – we
will cover this in later modules
Why is Normality Important?

• Many statistical procedures rely on the assumption of


normality
• Some tests are more robust to this assumption than
others
• Two procedures which are particularly sensitive to
departures from normality are:
• The construction of normal tolerance intervals
• Determination of sample sizes for variables
sampling plans
Attribute Data
Examples of Attribute Data

Number of defects

Number of changes

Number of errors on an invoice

Number of accidents

Number of reworked items

Number of failures
Attribute Data

There are two primary types of attribute data:

1. The number of defective items (pass/fail data)


This type of data follows the Binomial Distribution.
Examples:
The number of months with lost-time accidents.
The number of expense forms with errors.

2. The number of defects (“counting” data)


This type of data follows the Poisson Distribution.
Examples: =4
The number of accidents per month.
The number of errors on each expense form.
Defects or Defective Items?

9 Items (9 invoices)
3 Defective Items (3 invoices with errors)
5 Defects (5 errors)
Defective Items

• When only two outcomes are possible (pass


or fail, good or bad), then our data is in
terms of the number of defective items.

• When we are counting defective items,


then the data will follow the Binomial
Distribution.

• The Binomial Distribution was discovered


through the study of games of chance.
Binomial Distribution
The binomial distribution can be applied if the
following conditions are satisfied:

1. There are a fixed number of trials (or items).


2. Each trial has only two possible outcomes, usually
success or failure.
3. The outcome of any trial is independent of the
outcome of any other trial.
4. The probability of success or failure is constant
from trial to trial.
The outcomes from rolling 5 poker dice would satisfy these
conditions since each dice can be considered an independent
trial with an equal chance of success and failure.
Binomial Distribution

Parameters

n = number of trials (or items)


y = number of successes
p = probability of success in one trial
q = 1-p = probability of failure in one trial
P(y) = probability of obtaining exactly y successes
µ= population mean = n x p
σ = population standard deviation = n x p x q
Binomial Distribution - Example

Example: Five poker dice are thrown. What is the


probability that five aces will be thrown?
The probability that the number of aces is five is the
product of the probabilities that each die shows an ace:
5
1 1 1 1 1 1 1
 × × × ×  =   = = 0 .00013
6 6 6 6 6 6 7776
Binomial Distribution - Example

What is the probability of throwing exactly four aces in


a roll of 5 poker dice?
The probability that four specified dices show an ace
and the remaining one “not-ace” is:
4
1 1 1 1 5 1 5 5
             
× × × × = = = 0 .00064
 6   6   6   6   6   6   6  7776
The die which does not show the ace may, however, be any
one of the five dice, so the probability that the number of
aces is four is: 4
1 5 25
5×     = = 0.00322
 6   6  7776
Binomial Probability Formula

The general equation for calculating the probability of


events for the binomial distribution is:

n! y n− y
P( y ) = p q
y!(n − y )!
Where:
y = number of successes
n = number of independent trials (or items)
p = probability of success
q = probability of failure
Note: p + q = 1
Binomial Data – Workshop 1

Calculate the probability of obtaining 0,1,2,3,4 or 5 Aces in a


roll of 5 poker dice. (Your answer can be checked by adding
the probabilities of each occurrence and checking that the
sum equals 1.00.)
n! −
P( y ) = p y qn y
y!( n − y )!
Where:
y = number of successes (0,1,2,3,4 or 5)
n = number of independent trials (or items) = 5
p = probability of success (1/6)
q = probability of failure (5/6)
Binomial Data – Workshop 2

An accounts office produces complex invoices which have


to be inspected before being posted. Past data has shown
that 90% of invoices pass inspection. In order to achieve
efficiency targets, each accounts clerk has to produce 25
invoices a day and is expected to produce 23 invoices
which pass the inspection. Each accounts clerk is given a
bonus every time they achieve their daily target of 23, 24
or 25 invoices which pass inspection.

How often can the operators expect to receive their bonus?


Poisson Distribution

Whenever we are counting the number of defects (or


other occurrences) within a specific interval, the Poisson
Distribution can be applied.
The probability of the event occurring y times is given
by the Poisson probability distribution.
Requirements - Poisson Distribution

1. y is the number of occurrences (or defects) over some interval


2. The occurrences (or defects) occur randomly
3. The occurrences (or defects) are independent of each other
4. The occurrences (or defects) occur uniformly over the interval
(some “count data” such as complaints and late shipments may not
satisfy these requirements)
Poisson Distribution - Parameters

Mean = µ (equivalent to DPU when counting defects)


Standard Deviation = σ = The square root of µ = µ
The Poisson formula is:
e- µ . µy
P(y) = y!

Where: e = 2.71828 (base of natural logarithm)


y = number of occurrences (or defects)
Poisson Distribution - Workshop
The following data shows some statistics regarding
accidents in a large manufacturing plant over a period of
200 working days. The average number of accidents per
working day is 1.605 (321/200).
Number of Number of Total Number Expected Number
Accidents Occasions of Accidents of Occasions
0 42 0
1 62 62
2 48 96
3 33 99
4 11 44
5 4 20
>5 0 0
Total 200 321
Poisson Distribution - Workshop

Calculate the expected number of occasions for each


number of accidents tabulated. Discuss whether the data
is a good fit to the Poisson Distribution.

What is the probability of there being 3 or less accidents


on any one day?
Poisson Distribution and Defects (Errors)

The following errors were found in 400 invoices:

Defect Number of Occurrences


Wrong recipient 26
Incorrect Price 12
Wrong address 19
Incorrect Quantity 6
Total Errors 63

What is the probability of a randomly selected invoice


having no errors?
Poisson Distribution & Defects (Errors)

The mean number of errors per invoice = 63/400 = 0.1575

The probability of a randomly selected invoice containing


no errors = p(y) = p(0)

e- µ . µy e- 0.1575 . 0.15750
P(0) = = = e- 0.1575 = 0.8542
y! 0!

There is an 85.42% chance that an invoice drawn at


random will exhibit no defects.
Poisson Distribution & Six Sigma

In the previous example:


µ = 63 defects in 400 invoices = 63/400 = 0.1575
µ = Defects per Unit = DPU = 0.1575
p(0) Defects = e-0.1575 = e-DPU = 0.8542

So, e-DPU gives the probability that any randomly selected


invoice is defect free, therefore:
100 (e-DPU) = %Yield or First Time Pass
Binomial & Poisson Distributions
• We generally apply the Binomial Distribution when
only two outcomes are possible (pass/fail or good/bad).
In Six Sigma activities, this usually applies to the
examination of the number of defective items.
• We apply the Poisson Distribution when counting the
number of occurrences of an event. In Six Sigma
activities this usually applies to the examination of the
number of defects.

9 Items (9 invoices)
3 Defective Items (3 invoices with errors)
5 Defects (5 errors)
Binomial v Poisson

Binomial Distribution Poisson Distribution


Affected by sample size, n Affected only by the
And probability, p. mean, µ.
Possible values of y Possible values of y
have an upper limit, n. have no upper limit

9 Items (9 invoices)
3 Defective Items (3 invoices with errors)
5 Defects (5 errors)
Basic Statistics - Summary
• Variable data provides a fuller description of our processes than
attribute data
• Many continuous distributions follow the Normal Distribution
• Normality must not be assumed
• There is a difference between non-Normal and unnatural data
• Some data sets are naturally non-Normal
• Defective Items can often be characterised using the Binomial
Distribution
• Defects (errors) can often be characterised using the Poisson
Distribution
• Understanding the underlying distribution of a data set allows us to
employ the correct statistical procedures