Академический Документы
Профессиональный Документы
Культура Документы
Univariate Statistics
Chi Square
ANOVA
Descriptive Statistics
Summarization of a collection of data in a clear
and understandable way
the most basic form of statistics
lays the foundation for all statistical knowledge
Inferential Statistics
Two main methods:
1. estimation
the sample statistic is used to estimate a
population parameter
a confidence interval about the estimate is
constructed.
2. hypothesis testing
a null hypothesis is put forward
Analysis of the data is then used to
determine whether to reject it.
Inferential statistics generally require that
sampling be random
TYPES OF DATA
• Nominal : gender, type of customer
(loyalty), flavor/color liked, etc.
• Ordinal/Ranking :type of user, preferred
brand, brand awareness, etc.
• Interval: Attitudinal or satisfaction scales.
Are you satisfied with your education at U of L?
Dissatisfied 1 2 3 4 5 Satisfied
• Ratio: Income, price willing to pay, age, etc.
Type of Type of
Measurement descriptive analysis
Frequency table
Two Proportion (percentage)
categories
Ratio means
Frequency Tables
The arrangement of statistical data in a row-and-
column format that exhibits the count of
responses or observations for each category
assigned to a variable
• How many of certain brand users can be called
loyal?
• What percentage of the market are heavy users
and light users?
• How many consumers are aware of a new product?
• What brand is the “Top of Mind” of the market?
WebSurveyor Bar Chart
How did you find your last job?
643 Netw orking
213 print ad
Temporary agency 1.5 % 179 Online recruitment site
112 Placement firm
18 Temporary agency
Placement firm 9.6 %
print ad 18.3 %
90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Measures of Central Location or
Tendency
• Mean: average value
• Mode: the most frequent category
• Median: the middle observation of the data
The Mean (average value)
sum of all the scores divided by the number of scores.
a good measure of central tendency for roughly
symmetric distributions
can be misleading in skewed distributions since it can be
greatly influenced by extreme scores in which case other
statistics such as the median may be more informative
formula m = SX/N (population)
¯ = Sxi/n (sample)
X
where m/X
¯ is the population/sample mean
and N/n is the number of scores.
Mode
the most frequent category
users 25%
non-users 75%
Advantages:
• meaning is obvious
• the only measure of central tendency that can be used
with nominal data.
Disadvantages
• many distributions have more than one mode, i.e. are
"multimodal
• greatly subject to sample fluctuations
• therefore not recommended to be used as the only
measure of central tendency.
Median
the middle observation of the data
number times per week consumers use mouthwash
112223333344444445555566677
Frequency
distribution of
Mouthwash
use per week
- m a b
- m1 0 m
2
Skewed Distributions
Occur when one tail of the distribution is longer than the other.
Positive Skew Distributions
have a long tail in the positive direction.
sometimes called "skewed to the right"
more common than distributions with negative skews
E.g. distribution of income. Most people make under $40,000 a
year, but some make quite a bit more with a small number making
many millions of dollars per year
The positive tail therefore extends out quite a long way
s2 = S(x-
¯ ix ) 2/n S = S(x-
¯ ix ) 2/n
Measures of Dispersion
Dislike 1 2 3 4 5 Like Data
1. x 5
2. x 4
¯ = 4.6
X
3. x 5 s2=0.26
S = 0.52
4. x 5
5. x 5
6. x 4
s2 = S(x-
¯ ix ) 2/n S = S(x-
¯ ix ) 2/n
Measures of Dispersion
Dislike 1 2 3 4 5 Like Data
1. x 1
2. x 5
¯ 3
X=
3. x 1 s2=4
S=2
4. x 5
5. x 1
6. x 5
s2 = S(x-
¯ ix ) 2/n S = S(x-
¯ ix ) 2/n
Normal Distributions
with different SD
s3
s1
s2
- m
How does the Normal Distribution
help to make decisions?
Suppose you are about to introduce new
“Guacamole Doritos” to the market.
• Need to determine:
– Desired flavor intensity (How hot it should be)
– Package size offered
– Introduction price
• What do you do in order to answer your
questions?
ASK THE CONSUMER
• How?
TAKE A SAMPLE
1 2 3 4 5 6 7
Too Perfect Too
Mild Flavor Hot
Results show : x1 = 2.3 and S1= 1.5
• Can you conclude that on average the target
population thought the flavor was mild?
• Suppose you take a series of random
samples of n=100 subjects:
x2 = 3.7 and S2 = 2
x3 = 4.3 and S3 = 0.5
x4 = 2.8 and S4 = .97
..
.
x50 = 3.7 and S50 = 2
The Sampling Distribution
The means of all the samples will have their own
distribution called the sampling distribution of the
means
It is a normal distribution
The sampling distribution of a proportions is a
binomial that approximates a normal distribution in
large samples (30+)
The mean of the sampling distribution of the mean =
X = (ΣXi)/n
It equals the population parameter
Sampling Distribution
The standard deviation of the sampling distribution is
called the sampling error of the mean (or proportion).
= s = s / n
X
sp= π(1-π)/n
Often the population standard deviation s is
unknown and has to be estimated from the sample
S = s Σ(Xi-X)/n-1
Population distribution of the Doritos’ flavor (X)
X
m
Sample distribution of the x Doritos’ flavor
x
1 2 3 4 5 6 7
• What relationship does the Population
Distribution have to the Sample Distribution?
X N(0, s2/n)
• If we standardized X, we would get
X- 0 N(0, 1)
Z = s/n
t = X- 0 T(n-1)
s/n
- 0
• X= perceived difference between the pizzas
• m = real population mean, that equals zero if H0 is
true.
• x = 3.5, observed sample mean
• SD= 2.1, observed sample standard deviation
• n=40
• =.01 3.5 - 0
t= T (39)
2.1/40
t =10.54 T=.005(39)=2.074
5. Determine the Probability-
value (Critical Value)
The p-value is the probability of seeing a
random sample at least as extreme as the
sample observed given that the null
hypothesis is true.
• For example:
– In reference to the null hypothesis, if H0
hypothesized that there would be no difference
between the pizzas, a sample mean value of 2.5
would be high, but even more extreme would
be a value of 3.5.
– If the p-value is 0.03, it would mean that if we
take 100 samples we would observe only three
samples with an extreme value of 3.5.
– It would be concluded that we have enough
evidence to reject H0.
6. Compare with the level of significance, and determine if
the critical value falls in the rejection region
Do not Reject H0
1-
Reject H0 Reject H0
/2 /2
(XH-XL)- (mH-mL)
t= T(nH-nL -2)
S X -X
H L
0 f
SPSS Output
Independent Samples Test
.65 - .63 -0
z= = 0.66
(.65)(.35)/400+ (.63)(.37)/400
Tylenol vs Advil
= 0.10 N(0,1) = 1.64
-1
/2 /2
di=x1i-x2i i=1,2,…,n
Std. Error
Mean N Std. Deviation Mean
Pair RATING1 2.10 40 4.717 .746
1 RATING2 -1.33 40 3.083 .488
Paired Differences
95% Confidence
Interval of the
Std. Error Difference
Mean Std. Deviation Mean Lower Upper t df Sig. (2-tailed)
Pair 1 RATING1 - RATING2 3.43 5.857 .926 1.55 5.30 3.699 39 .001
Cross Tabulation
and Chi Square Test
for Independence
Cross-tabulation
• Helps answer questions about whether two
or more variables of interest are linked:
– Is the type of mouthwash user (heavy or light)
related to gender?
– Is the preference for a certain flavor (cherry or
lemon) related to the geographic region (north,
south, east, west)?
– Is income level associated with gender?
• Cross-tabulation determines association not
causality.
Dependent and Independent Variables
• The variable being studied is called the
dependent variable or response variable.
• A variable that influences the dependent
variable is called independent variable.
Cross-tabulation
• Cross-tabulation of two or more variables is
possible if the variables are discrete:
– The frequency of one variable is subdivided by the
other variable categories.
• Generally a cross-tabulation table has:
– Row percentages
– Column percentages
– Total percentages
• Which one is better?
DEPENDS on which variable is considered as
independent.
Cross tabulation
GROUPINC * Gender Crosstabulation
Gender
Female Male Total
GROUPINC income <= 5 Count 10 9 19
% within GROUPINC 52.6% 47.4% 100.0%
% within Gender 55.6% 18.8% 28.8%
% of Total 15.2% 13.6% 28.8%
5<Income<= 10 Count 5 25 30
% within GROUPINC 16.7% 83.3% 100.0%
% within Gender 27.8% 52.1% 45.5%
% of Total 7.6% 37.9% 45.5%
income >10 Count 3 14 17
% within GROUPINC 17.6% 82.4% 100.0%
% within Gender 16.7% 29.2% 25.8%
% of Total 4.5% 21.2% 25.8%
Total Count 18 48 66
% within GROUPINC 27.3% 72.7% 100.0%
% within Gender 100.0% 100.0% 100.0%
% of Total 27.3% 72.7% 100.0%
Contingency Table
• A contingency table shows the conjoint
distribution of two discrete variables
• This distribution represents the probability
of observing a case in each cell
– Probability is calculated as:
P= Observed cases
Total cases
Chi-square Test for Independence
• The Chi-square test for independence
determines whether two variables are
associated or not.
H0: Two variables are independent
H1: Two variables are not independent
d . f . ( R 1)(C 1)
d . f . ( 2 1)( 2 1) 1
Chi-square
3.84 Reject H0
2 22.16
Analysis of Variance
(ANOVA)
What is an ANOVA?
• One-way ANOVA stands for Analysis of
Variance
• Purpose:
– Extends the test for mean difference between
two independent samples to multiple samples.
– Employed to analyze the effects of
manipulations (independent variables) on a
random variable (dependent).
Definitions
• Dependent variable: the variable we are
trying to explain, also known as response
variable (Y).
• Independent variable: also known as
explanatory variables (X).
Therefore, we would like to study whether
the independent variable has an effect on
the variability of the dependent variable
Continuous Dependent variable
Means
Type of Ad
Means
Type of Ad
Category Y1 Y2 Y3 Yc Y
Mean Grand
Mean
Between Category Variation SSbetween
Decomposition of the Total
Variation
• Total Variation:
SSy = S(Yi- Y)2
SSy =SSbetween + SSwithin
SSy =SSx + SSerror
• Between variation:
c
SSx= S n(Yj- Y)2
j
• Within variation:
c n
SSerror= S S(Yij- Yj)2
j i
Measurement of the Effects
• We would like to know how strong are the
effects of the independent variable (X) on
the dependent variable (Y).
SSy =SSx + SSerror
SSx =SSy – SSerror
SSy – SSerror SSx
= =
SSy SSy
ANOVA Test
• Under H0 m1= m2 = m3 …..= mn, SSx and SSy
have the same source of variability since the
means are equal between categories.
• Therefore the estimate of the population
variance of Y can be based on either sum of
squares:
SSx SSerror
Sy= =
(c-1) (N-c)
MSx MSerror
ANOVA Test
• The null hypothesis would be tested with
the F distribution MS x
f=
F distribution MSerror
Reject H0
f(c-1)(N-
c)