Вы находитесь на странице: 1из 20

CHAPTER 1: DESCRIPTIVE STATISTICS

1.1 Introduction

Example 1: Making Steel Rods

Consider a machine that makes steel rods for use in optical storage
devices. The specification for the diameter of the rods is 0.450.02
cm. The machine makes 1000 rods per hour (continuous flow
production). The engineer wants to be fairly certain that the
percentage of good rods is at least 90%; otherwise he will shut
down the process for recalibration.

Example 2: Comparison of Breaking Strength of Two Alloys

In order to compare the strength qualities of two alloys, five


specimens from each group were selected randomly and their
breaking strength (force required to rupture a specimen in a tension
test) in megapascals was measured.

The following data were obtained:

Alloy A Alloy B
404 365
406 452
396 378
392 461
402 344

1
Alloy A

Alloy B

Which alloy seems to be the better in terms of its strength?

Study components: obtaining a random sample, collecting data


(obtaining trustworthy measurements), data analysis and
conclusions (generalization from the sample to the whole
population).

1.2 Applications of Statistics in Engineering

Quality control (in manufacturing operations randomly sampling


and testing a fraction of the output; process can be corrected
before a large number of defective items is produced)

Example 3: Filling process

Filling machine fills plastic bottles with a drink; random


sampling used to control the amount of drink in bottles; filling
process can be corrected before it creates a large number of
underfilled or overfilled bottles.

Reliability (ability of a device or system to perform a required


function under stated conditions for a specified period of time;
how long a component or a system will survive)

2
Example 4: Fatigue Tests for Aircraft Wheels

Suppose time until failure of wheels used on commercial aircraft


needs to be estimated. The wheels are made of very strong alloys
able to support aircraft on the ground.

A machine is used to roll the wheel under the desired design load.
The times until failure for wheels randomly selected from the
production run (in thousands of km) are obtained. The data are
used to estimate time until failure for all wheels.

Example 5: Data Mining in Oil and Gas Extraction

Digitization of oil fields: surface rock and soil type, seismic data
(creating shock waves that pass through hidden rock layers and
interpreting the waves that are reflected back to the surface),
satellite images, small core samples obtained by shallow drilling.
Statistical models used to find economically viable fields.

1.3 Random Sampling

Population - the entire collection of individuals or objects about


which information is desired

Sample - the collection of individuals or objects we will actually


measure

Generalization
Sample Population

3
Inferences - statements about the population based on the sample
data

Valid inferences about population can be reached if sample is


representative of the population.
Random sample - a sample in which the elements are chosen at
random (random sample is representative of the population). Larger
random samples give more accurate results than smaller samples.

1.4 Variables

Variable any characteristic of a person or thing that can be


expressed as a number or a label.

Variables

Categorical Quantitative (numerical)


(values are labels) (values are numbers)

Categorical variables: gender, hair color, marital status


Quantitative variables: weight, height, age, income

4
1.5 Displaying Categorical Variables

Consider a class of 30 with 18 males and 12 females.

(a) Bar Graph

A vertical bar erected over each category; the height of the bar is
the frequency or the percentage of observations in the category.

Percent

60

40

Females Males

(b) Pie Chart

Females
Males
Slices represent categories;
size of each slice corresponds
to the percentage for the
category

5
1.6 Describing Quantitative Variables

(a) Measures of Center

Sample Mean

Suppose a sample consists of n observations x1, x2, , xn . The


sample mean x is defined as
x1 x2 ... xn
x
n

x .
In compact notation: x
i

Sample mean x is an estimate of the population mean (the


mean of all observations in the whole population).

Example 6: 30 30 40 50 50 60 40 x 300 / 7 42.857

30 30 40 50 50 60 340 x 600 / 7 85.714

Conclusion: The mean is not resistant measure of center (very


sensitive to outliers, observations that fall well below or above
the overall bulk of the data).

Sample Trimmed Mean

Delete some of the smallest and some of the largest observations


(usually bottom 10% and top 10% removed) and take the mean
of the remaining observations.
6
The Sample Median

To compute the median:

(i) Arrange all observations in order, from smallest to largest


(ii)

The single middle value if n is odd,


Median =
The average of the two middle values if n
is even.
Example 7:

Data set 1 : 30 60 40 30 50 40 50 Median =40


Ordered list: 30 30 40 40 50 50 60

Data set 2: 30 60 40 30 50 48 50 40 Median =(40+48)/2


Ordered list: 30 30 40 40 48 50 50 60

The median remains the same if 60 replaced by 600.

Conclusion: The median is a resistant measure of center.

Sample Mode

The sample mode is the most frequently occurring observation in


the sample (no mode if the observations occur with the same
frequency).

7
(b) Measures of Spread

Sample Range

Range = Largest Smallest.

Range ignores all of the information between the largest and the
smallest values.

Variance and Standard Deviation

x x1

x1 x
Observations: x1, x2, , xn

Variance s2 is defined as

( x1 x )2 ( x2 x ) 2 ...( xn x ) 2
s
2
.
n 1

s 2

(x x )
i
2

.
Compact notation: n 1

Standard deviation s: s s
2

8
Properties of s:

1. Measures the spread of observations about the mean.


2. s is not resistant to outliers.
3. s=0 only when there is no spread (all observations equal).
4. s is an estimate of population standard deviation
(standard deviation of all observations in the population).

Example 8: Compute the variance and standard deviation of the


observations: 20, 40, 50, 30, 60, 70

Solution:

Equivalent formula for the variance:


n
2
1 n 2
( xi )
s
2
xi i 1
.
n 1 i 1 n

Example 9: Use the above formula to recalculate the standard


deviation for the above data.

9
Sample Quantiles

The p th sample quantile is a value such that p percent of the


observations fall below or at that value.

Three useful quantiles are quartiles. The lower (or first) quartile has
p=25, the median (or second) quartile denoted by Q2 has p=50, and
the upper (or third) quartile has p=75.

They are denoted by Q1, Q2, and Q3 , respectively.

M=median= Q2

LOWER HALF UPPER HALF

Lower Upper Lower Upper

n even n odd

10
Q1 = Median of the Lower Half
Q2 = Overall Median,
Q3 = Median of the Upper Half.


Q1 Q2 Q3

Interquartile range IQR: IQR = Q3 Q1.

IQR is a measure of spread in the data.

Example 10: Obtain the quartiles and IQR for the sample:

30 30 40 40 48 50 50 60 66 86 94 112

Solution:

11
1.7 Outliers

Outliers- observations separated from the main body of data

outlier

Outlier an observation 1.5*IQR below Q1 or 1.5*IQR above Q3


Q1 Q2 Q3
1.5 IQR 1.5 IQR

Example 11: Are there are any outliers in Example 10?

Solution:

12
1.8 Displaying Quantitative Variables

Example 12: 30 examination scores:

75 79 58 73 82 94
61 77 54 77 65 67
62 61 64 45 58 86
66 83 70 91 48 78
86 66 52 80 59 55

(a) Histograms

1. Divide the range of the data into non-overlapping classes of


equal width.

( )( )()( )(

Convention: Right-hand limit of each class is included, left-


hand limit is excluded (Excel).
2. Count the number of observations (frequency) in each class.
3. Erect over each class a rectangle whose height equals to the
frequency of that class.

13
Frequency Table:

Class Intervals Frequency Relative


Frequency
40-50 2 2/30
50-60 6 6/30
60-70 9 9/30
70-80 7 7/30
80-90 4 4/30
90-100 2 2/30

40 50 60 70 80 90 100

Frequency histogram for the 30 scores

14
9/30

7/30

2/30

40 50 60 70 80 90 100

Relative frequency histogram of the 30 scores

Shapes of histograms

Unimodal (one peak), bimodal (two peaks)

Shapes of histograms

Symmetric Skewed

Skewed right Skewed left

15
4

Frequency
2

Symmetric

4
Frequency

Skewed Left

4
Frequency

Skewed Right

16
(b) Boxplots

Outlier (more than 1.5


IQR above Q3)

The largest observation
within 1.5 IQR from Q3

Q3

IQR Q2
R Q1
The smallest observation
within 1.5 IQR from Q1

Outlier (more than 1.5


IQR below Q1)

Skewed right 17
Skewed left Symmetric
For symmetric distribution: Mean=Median=Mode.

Example 13: Obtain the boxplot for the 30 exam scores. Repeat the
exercise with the score 94 replaced by 120.

(c) Scatterplots

Scatterplots are used to display a relationship between two


numerical variables.

Sales



Price

18
(d) Time Series Plots (line charts)

Variable

| | | | |

Equally-spaced time intervals

Example 14: Lumber Cutting

Operator cuts 2-by-4 lumber into exactly 96-inch lengths using a


table saw. However, few pieces will be exactly 96 inches long.
Sources of variation in the cut lengths: most saw blades wobble,
lumber is at least slightly warped, cuts become less precise as the
saw blade becomes duller. The lengths (in inches) of 20 cuts are
given below:
Order Length Order Lengths
1 95.99 11 96.01
2 96 12 95.96
3 95.99 13 96.01
4 96 14 96.02
5 95.98 15 95.95
6 96 16 96.04
7 95.98 17 96.02
8 96 18 96.07
9 95.97 19 96.03
10 96.03 20 96.05

19
20

Вам также может понравиться