Вы находитесь на странице: 1из 111

Review of Top 10 Concepts

in Statistics

NOTE: This Power Point file is not an introduction,


but rather a checklist of topics to review

Top Ten #1

Descriptive Statistics

Measures of Central Location

Mean
Median
Mode

Mean

Population mean == x/N = (5+1+6)/3 = 12/3 =


4
Algebra: x = N* = 3*4 =12
Sample mean = x-bar = x/n
Example: the number of hours spent on the
Internet: 4, 8, and 9
x-bar = (4+8+9)/3 = 7 hours
Do NOT use if the number of observations is
small or with extreme values
Ex: Do NOT use if 3 houses were sold this week,
and one was a mansion

Median

Median = middle value


Example: 5,1,6

Step 1: Sort data: 1,5,6


Step 2: Middle value = 5

When there is an even number of observation,


median is computed by averaging the two
observations in the middle.
OK even if there are extreme values
Home sales: 100K,200K,900K, so
mean =400K, but median = 200K

Mode

Mode: most frequent value


Ex: female, male, female

Ex: 1,1,2,3,5,8

Mode = female
Mode = 1

It may not be a very good measure, see the


following example

Measures of Central Location Example


Sample: 0, 0, 5, 7, 8, 9, 12, 14, 22, 23

Sample Mean = x-bar = x/n = 100/10 = 10


Median = (8+9)/2 = 8.5
Mode = 0

Relationship

Case 1: if probability distribution symmetric


(ex. bell-shaped, normal distribution),

Mean = Median = Mode

Case 2: if distribution positively skewed to


right (ex. incomes of employers in large firm: a
large number of relatively low-paid workers
and a small number of high-paid executives),

Mode < Median < Mean

Relationship contd

Case 3: if distribution negatively skewed to left


(ex. The time taken by students to write
exams: few students hand their exams early
and majority of students turn in their exam at
the end of exam),

Mean < Median < Mode

Dispersion Measures of
Variability

How much spread of data


How much uncertainty
Measures

Range
Variance
Standard deviation

Range

Range = Max-Min > 0


But range affected by unusual values
Ex: Santa Monica has a high of 105 degrees
and a low of 30 once a century, but range
would be 105-30 = 75

Standard Deviation (SD)

Better than range because all data used


Population SD = Square root of variance
=sigma =
SD > 0

Empirical Rule

Applies to mound or bell-shaped curves


Ex: normal distribution
68% of data within + one SD of mean
95% of data within + two SD of mean
99.7% of data within + three SD of mean

Standard Deviation =
Square Root of Variance

(x x)
n 1

Sample Standard Deviation


x

xx

( x x )2

6-8=-2

(-2)(-2)= 4

6-8=-2

7-8=-1

(-1)(-1)= 1

8-8=0

13

13-8=5

(5)(5)= 25

Sum=40

Sum=0

Sum = 34

Mean=40/5=8

Standard Deviation
Total variation = 34
Sample variance = 34/4 = 8.5
Sample standard deviation =
square root of 8.5 = 2.9

Measures of Variability - Example


The hourly wages earned by a sample of five students
are:
$7, $5, $11, $8, and $6
Range: 11 5 = 6
Variance:

X X 7 7.4 ... 6 7.4 21.2


s

5.30
n 1
5 1
5 1
2

Standard deviation:

s2

5.30 2.30

Graphical Tools

Line chart: trend over time


Scatter diagram: relationship between two
variables
Bar chart: frequency for each category
Histogram: frequency for each class of
measured data (graph of frequency distr.)
Box plot: graphical display based on
quartiles, which divide data into 4 parts

Top Ten #2

Hypothesis Testing

H0: Null Hypothesis

Population mean=
Population proportion=
A statement about the value of a population
parameter
Never include sample statistic (such as, xbar) in hypothesis

HA or H1: Alternative Hypothesis

ONE TAIL ALTERNATIVE


Right tail: >number(smog ck)
>fraction(%defectives)
Left tail: <number(weight in box of crackers)
<fraction(unpopular Presidents %
approval low)

One-Tailed Tests
A test is one-tailed when the alternate
hypothesis, H1 or HA, states a direction, such as:
H1: The mean yearly salaries earned by full-time

employees is more than $45,000. (>$45,000)


H1: The average speed of cars traveling on
freeway is less than 75 miles per hour. (<75)
H1: Less than 20 percent of the customers pay
cash for their gasoline purchase. ( <0.2)

Two-Tail Alternative

Population mean not equal to number (too


hot or too cold)
Population proportion not equal to fraction (%
alcohol too weak or too strong)

Two-Tailed Tests
A test is two-tailed when no direction is
specified in the alternate hypothesis
H1: The mean amount of time spent for the

Internet is not equal to 5 hours. ( 5).

H1: The mean price for a gallon of gasoline

is not equal to $2.54. ( $2.54).

Reject Null Hypothesis (H0) If

Absolute value of test statistic* > critical value*

Reject H0 if p-value < significance level (alpha)

Reject H0 if |Z Value| > critical Z


Reject H0 if | t Value| > critical t
Note that direction of inequality is reversed!

Reject H0 if very large difference between sample


statistic and population parameter in H0

* Test statistic: A value, determined from sample information, used to determine


whether or not to reject the null hypothesis.
* Critical value: The dividing point between the region where the null hypothesis is
rejected and the region where it is not rejected.

Example: Smog Check

H0 : = 80
HA: > 80
If test statistic =2.2 and critical value = 1.96,
reject H0, and conclude that the population
mean is likely > 80
If test statistic = 1.6 and critical value = 1.96,
do not reject H0, and reserve judgment about
H0

Type I vs Type II Error

Alpha= = P(type I error) = Significance level =


probability that you reject true null hypothesis

Beta= = P(type II error) = probability you do not


reject a null hypothesis, given H0 false

Ex: H0 : Defendant innocent

= P(jury convicts innocent person)


=P(jury acquits guilty person)

Type I vs Type II Error


H0 true

H0 false

Reject H0

Alpha = =
P(type I error)

1 (Correct
Decision)

Do not reject H0

1 (Correct
Decision)

Beta = =
P(type II error)

Example: Smog Check

H0 : = 80
HA: > 80
If p-value = 0.01 and alpha = 0.05, reject H0,
and conclude that the population mean is
likely > 80
If p-value = 0.07 and alpha = 0.05, do not
reject H0, and reserve judgment about H0

Test Statistic

When testing for the population mean from a


large sample and the population standard
deviation is known, the test statistic is given
by:

X
z
/ n

Example
The processors of Best Mayo indicate on the
label that the bottle contains 16 ounces of
mayo. The standard deviation of the process
is 0.5 ounces. A sample of 36 bottles from last
hours production showed a mean weight of
16.12 ounces per bottle. At the .05
significance level, can we conclude that the
mean amount per bottle is greater than 16
ounces?

Example contd
1. State the null and the alternative hypotheses:
H0: = 16,
H1: > 16
2. Select the level of significance. In this case,
we selected the .05 significance level.
3. Identify the test statistic. Because we know the
population standard deviation, the test statistic is z.
4. State the decision rule.
Reject H0 if |z|> 1.645 (= z0.05)

Example contd
5. Compute the value of the test statistic

X 16.12 16.00
z

1.44
n
0.5 36
6. Conclusion: Do not reject the null hypothesis.
We cannot conclude the mean is greater than 16
ounces.

Top Ten #3

Confidence Intervals: Mean and Proportion

Confidence Interval
A confidence interval is a range of values within
which the population parameter is expected
to occur.

Factors for Confidence Interval


The factors that determine the width of a
confidence interval are:
1. The sample size, n
2. The variability in the population, usually
estimated by standard deviation.
3. The desired level of confidence.

Confidence Interval: Mean

Use normal distribution (Z table if):


population standard deviation (sigma)
known and either (1) or (2):
(1)
(2)

Normal population
Sample size > 30

Confidence Interval: Mean

If normal table, then

x
n

Normal Table

Tail = .5(1 confidence level)


NOTE! Different statistics texts have different
normal tables
This review uses the tail of the bell curve
Ex: 95% confidence: tail = .5(1-.95)= .025
Z.025 = 1.96

Example

n=49, x=490, =2, 95% confidence

490
2

1.96
10 0.56
49
49

9.44 < < 10.56

Another Example
One of SOM professors wants to
estimate the mean number of hours
worked per week by students. A sample
of 49 students showed a mean of 24
hours. It is assumed that the population
standard deviation is 4 hours. What is
the population mean?

Another Example contd


95 percent confidence interval for the
population mean.

4
X 1.96
24.00 1.96
n
49
24.00 1.12

The confidence limits range from 22.88 to


25.12. We estimate with 95 percent
confidence that the average number of hours
worked per week by students lies between
these two values.

Confidence Interval: Mean


t distribution

Use if normal population but population


standard deviation () not known
If you are given the sample standard
deviation (s), use t table, assuming normal
population
If one population, n-1 degrees of freedom

Confidence Interval: Mean


t distribution

x
n

t n1

s
n

Confidence Interval:
Proportion
Use if success or failure
(ex: defective or not-defective,
satisfactory or unsatisfactory)
Normal approximation to binomial ok if
(n)() > 5 and (n)(1-) > 5, where
n = sample size
= population proportion
NOTE: NEVER use the t table if proportion!!

Confidence Interval:
Proportion

p(1 p)
pz
n
Ex: 8 defectives out of 100, so p = .08 and
n = 100, 95% confidence

(0.08)(.92)
.08 1.96
.08 .05
100

Confidence Interval:
Proportion
A sample of 500 people who own their house
revealed that 175 planned to sell their homes
within five years. Develop a 98% confidence
interval for the proportion of people who plan to
sell their house within five years.
175
p
0.35
500

(.35)(. 65)
.35 2.33
.35 .0497
500

Interpretation

If 95% confidence, then 95% of all confidence


intervals will include the true population parameter
NOTE! Never use the term probability when
estimating a parameter!! (ex: Do NOT say
Probability that population mean is between 23 and
32 is .95 because parameter is not a random
variable. In fact, the population mean is a fixed but
unknown quantity.)

Point vs Interval Estimate

Point estimate: statistic (single number)


Ex: sample mean, sample proportion
Each sample gives different point estimate
Interval estimate: range of values
Ex: Population mean = sample mean + error
Parameter = statistic + error

Width of Interval

Ex: sample mean =23, error = 3


Point estimate = 23
Interval estimate = 23 + 3, or (20,26)
Width of interval = 26-20 = 6
Wide interval: Point estimate unreliable

Wide Confidence Interval If


(1) small sample size(n)
(2) large standard deviation
(3) high confidence interval (ex: 99% confidence
interval wider than 95% confidence interval)
If you want narrow interval, you need a large
sample size or small standard deviation or low
confidence level.

Top Ten #4

Linear Regression

Linear Regression
y b0 b1 x

Regression equation:
=dependent variable=predicted value
y
x= independent variable
b0=y-intercept =predicted value of y if x=0
b1=slope=regression coefficient
=change in y per unit change in x

Slope vs Correlation

Positive slope (b1>0): positive correlation


between x and y (y increase if x increase)
Negative slope (b1<0): negative correlation (y
decrease if x increase)
Zero slope (b1=0): no correlation(predicted
value for y is mean of y), no linear
relationship between x and y

Simple Linear Regression

Simple: one independent variable, one


dependent variable
Linear: graph of regression equation is
straight line

Example

y = salary (female manager, in thousands of


dollars)
x = number of children
n = number of observations

Given Data
x

48

52

33

Totals
x

48

52

33

Sum=7

Sum=133

n=3

Slope (b1) = -6.5

Method of Least Squares formulas not on


BUS 302 exam
b1= -6.5 given
Interpretation: If one female manager has 1
more child than another, salary is $6,500
lower; that is, salary of female managers
is expected to decrease by -6.5 (in
thousand of dollars) per child

Intercept (b0)

b y b x
0

7
x
2.33
n
3

y 133
n

44.33

b0 = 44.33 (-6.5)(2.33) = 59.5

If number of children is zero,


expected salary is $59,500

Regression Equation

y 59.5 6.5x

Forecast Salary If 3 Children


59.5 6.5(3) = 40
$40,000 = expected salary

Standard Error of Estimate

y forecast b0 b1 x

error y y
SSE
( y y )
S

n2
n2

Standard Error of Estimate

48

(3) y = (4)=
59.5(2)-(3)
6.5x
46.5
1.5

2.25

52

53

-1

33

33.5

-.5

.25

(1)=x

(2)=y

( y y )2

SSE=3.5

Standard Error of Estimate

3.5
S
3.5 1.9
3 2
Actual salary typically $1,900
away from expected salary

Coefficient of Determination

R2 = % of total variation in y that can be


explained by variation in x
Measure of how close the linear regression
line fits the points in a scatter diagram
R2 = 1: max. possible value: perfect linear
relationship between y and x (straight line)
R2 = 0: min. value: no linear relationship

Sources of Variation (V)

Total V = Explained V + Unexplained V


SS = Sum of Squares = V
Total SS = Regression SS + Error SS
SST = SSR + SSE
SSR = Explained V, SSE = Unexplained

Coefficient of Determination

R2 = SSR
SST
R2 = 197 = .98
200.5
Interpretation: 98% of total variation in salary
can be explained by variation in number of
children

0 < R2 < 1

0: No linear relationship since SSR=0


(explained variation =0)
1: Perfect relationship since SSR = SST
(unexplained variation = SSE = 0), but does
not prove cause and effect

R=Correlation Coefficient

Case 1: slope (b1) < 0


R<0
R is negative square root of coefficient of
determination

R R

Our Example

Slope = b1 = -6.5
R2 = .98
R = -.99

Case 2: Slope > 0

R is positive square root of coefficient of


determination
Ex: R2 = .49
R = .70
R has no interpretation
R overstates relationship

Caution

Nonlinear relationship (parabola, hyperbola,


etc) can NOT be measured by R2
In fact, you could get R2=0 with a nonlinear
graph on a scatter diagram

Summary: Correlation Coefficient

Case 1: If b1 > 0, R is the positive square root


of the coefficient of determination

Case 2: If b1 < 0, R is the negative square


root of the coefficient of determination

Ex#1: y = 4+3x, R2=.36: R = +.60

Ex#2: y = 80-10x, R2=.49: R = -.70

NOTE! Ex#2 has stronger relationship, as


measured by coefficient of determination

Extreme Values

R=+1: perfect positive correlation

R= -1: perfect negative correlation

R=0: zero correlation

MS Excel Output
Correlation Coefficient (-0.9912): Note
that you need to change the sign because
the sign of slope (b1) is negative (-6.5)
Coefficient of Determination
Standard Error of Estimate

Regression Coefficient

Top Ten #5

Expected Value

Expected Value

Expected Value = E(x) = xP(x)


= x1P(x1) + x2P(x2) +

Expected value is a weighted average, also a


long-run average

Example

Find the expected age at high school


graduation if 11 were 17 years old, 80 were
18 years old, and 5 were 19 years old

Step 1: 11+80+5=96

Step 2
x

P(x)

x P(x)

17

11/96=.115

17(.115)=1.955

18

80/96=.833

18(.833)=14.994

19

5/96=.052

19(.052)=.988
E(x)= 17.937

Top Ten #6

What Distribution to Use?

Use Binomial Distribution If:

Random variable (x) is number of successes in n


trials
Each trial is success or failure
Independent trials
Constant probability of success () on each trial
Sampling with replacement (in practice, people
may use binomial w/o replacement, but theory is
with replacement)

Success vs. Failure

The binomial experiment can result in only


one of two possible outcomes:
Male vs. Female
Defective vs. Non-defective
Yes or No
Pass (8 or more right answers) vs. Fail (fewer
than 8)
Buy drink (21 or over) vs. Cannot buy drink

Binomial Is Discrete

Integer values
0,1,2,n
Binomial is often skewed, but may be symmetric

Normal Distribution

Continuous, bell-shaped, symmetric


Mean=median=mode
Measurement (dollars, inches, years)
Cumulative probability under normal curve : use
Z table if you know population mean and
population standard deviation
Sample mean: use Z table if you know
population standard deviation and either normal
population or n > 30

t Distribution

Continuous, mound-shaped, symmetric


Applications similar to normal
More spread out than normal
Use t if normal population but population
standard deviation not known
Degrees of freedom = df = n-1 if estimating the
mean of one population
t approaches z as df increases

Normal or t Distribution?

Use t table if normal population but population


standard deviation () is not known
If you are given the sample standard deviation
(s), use t table, assuming normal population

Top Ten #7

P-value

P-value

P-value = probability of getting a sample statistic


as extreme (or more extreme) than the sample
statistic you got from your sample, given that the
null hypothesis is true

P-value Example: one tail test

H0: = 40
HA: > 40
Sample mean = 43
P-value = P(sample mean > 43, given H0 true)
Meaning: probability of observing a sample
mean as large as 43 when the population mean
is 40
How to use it: Reject H0 if p-value <
(significance level)

Two Cases

Suppose = .05
Case 1: suppose p-value = .02, then reject H0
(unlikely H0 is true; you believe population mean
> 40)
Case 2: suppose p-value = .08, then do not
reject H0 (H0 may be true; you have reason to
believe that the population mean may be 40)

P-value Example: two tail test


H0 : = 70
HA: 70
Sample mean = 72
If two-tails, then P-value =
2 P(sample mean > 72)=2(.04)=.08
If = .05, p-value > , so do not reject H0

Top Ten #8

Variation Creates Uncertainty

No Variation

Certainty, exact prediction


Standard deviation = 0
Variance = 0
All data exactly same
Example: all workers in minimum wage job

High Variation

Uncertainty, unpredictable
High standard deviation
Ex #1: Workers in downtown L.A. have variation
between CEOs and garment workers
Ex #2: New York temperatures in spring range
from below freezing to very hot

Comparing Standard
Deviations

Temperature Example
Beach city: small standard deviation (single
temperature reading close to mean)
High Desert city: High standard deviation (hot
days, cool nights in spring)

Standard Error of the Mean


Standard deviation of sample mean =
standard deviation/square root of n
Ex: standard deviation = 10, n =4, so standard
error of the mean = 10/2= 5

Note that 5<10, so standard error < standard


deviation.
As n increases, standard error decreases.

Sampling Distribution

Expected value of sample mean = population


mean, but an individual sample mean could be
smaller or larger than the population mean
Population mean is a constant parameter, but
sample mean is a random variable
Sampling distribution is distribution of sample
means

Example

Mean age of all students in the building is


population mean
Each classroom has a sample mean
Distribution of sample means from all
classrooms is sampling distribution

Central Limit Theorem (CLT)

If population standard deviation is known,


sampling distribution of sample means is normal
if n > 30
CLT applies even if original population is
skewed

Top Ten #9

Population vs. Sample

Population

Collection of all items (all light bulbs made at


factory)
Parameter: measure of population
(1) population mean (average number of
hours in life of all bulbs)
(2) population proportion (% of all bulbs that
are defective)

Sample

Part of population (bulbs tested by inspector)


Statistic: measure of sample = estimate of
parameter
(1) sample mean (average number of hours
in life of bulbs tested by inspector)
(2) sample proportion (% of bulbs in sample
that are defective)

Top Ten #10

Qualitative vs. Quantitative

Qualitative

Categorical data:
success vs. failure
ethnicity
marital status
color
zip code
4 star hotel in tour guide

Qualitative

If you need an average, do not calculate the


mean
However, you can compute the mode
(average person is married, buys a blue car
made in America)

Quantitative

Two cases

Case 1: discrete
Case 2: continuous

Discrete
(1) integer values (0,1,2,)
(2) example: binomial
(3) finite number of possible values
(4) counting
(5) number of brothers
(6) number of cars arriving at gas station

Continuous

Real numbers, such as decimal values


($22.22)
Examples: Z, t
Infinite number of possible values
Measurement
Miles per gallon, distance, duration of time

Graphical Tools

Pie chart or bar chart: qualitative


Joint frequency table: qualitative (relate
marital status vs zip code)
Scatter diagram: quantitative (distance from
CSUN vs duration of time to reach CSUN)

Hypothesis Testing
Confidence Intervals

Quantitative: Mean
Qualitative: Proportion

Вам также может понравиться