Академический Документы
Профессиональный Документы
Культура Документы
Statistics is derived from Italian words ‘State’ and ‘Statista’ – which mean ‘The Government’
and ‘Useful facts/information’. It is a branch of mathematics and deals with
Collection of facts/information/data
Arrangement, classification and tabulation of data
Analysis and detailed study of the collected data
Interpretation, conclusions and understanding of the data
Helps important business decision making.
Types of Statistics
a) Descriptive statistics
b) Inferential Statistics
a) Descriptive Statistics – as the name suggests describes the data collated or collected
for a specific business purpose.
While doing so, various factors determine the amount and scale of information to be collected.
The purpose for which the information/data is collected.
Target audience or the particular section of people from whom information needs to be
collected.
The number of people from whom data needs to be collected (population /sample size)
Age group, geographic location, areas of interest etc
Income groups and various sections of the economy to be targeted
The kind of analysis that needs to be done with the collected information
The level of accuracy to be maintained at every stage of data collection/analysis
Business Decisions to be arrived based on statistical analysis and interpretation of
information.
Population - is the collection of objects/people which form the larger group of the analysis.
Sample - is also the collection of objects/people which are usually a subset of the larger
population or sometimes is the only size available for analysis.
Types of Data
a) Qualitative data – this data suggests only the quality of the variable/attribute. Also
called as Categorical data, this attribute is useful for classifying data into groups or classes. It is
more applicable when there are limited or fewer values possible for that data. Qualitative data
cannot be measured.
Data
Qualitative Quantitative
Data Data
Nominal Interval
Ordinal Data Ratio
Data Scale
When a research or a survey is done, it is advisable to have a combination of all data types, as
this would give an indepth view of the purpose of the market research. It is always better to
have focus on the purpose of the market research, but also to collect a little more information
than required, so that import data does not get omitted in the analysis.
1) NOMINAL Data – assigns a name as a value. This value is just a representation of which
category the value belongs to. This is usually a Character data and not a numeric. No further
analysis is possible other than being used for classification / categorization. They have some
kind of an identity.
Ex: Company working in, Region, Name of a person etc…
2) ORDINAL data – is also similar to nominal data, but ordinal data can be ranked in relation to
other values in the same variable. Ordinal data values have some level indications within
themselves.
Ex: Small, Medium, Large
Bad, Average, Good, Very Good, Excellent
scale of 1 to 5 (where the numbers 1-5 represent only ranking)
3) INTERVAL SCALE DATA - are quantitative data, and can be used for numerical analysis. They
might have equal intervals between the values, or can also be continuous data. Usually, it
would be possible for us to take the difference between any 2 values in this kind of data to get
a meaningful result, but a division of values would not fetch a meaning result.
Ex; Age groups, Salary ranges, dates (where the difference can be taken, but a division
cant)
4) RATIO SCALE DATA - this is measureable data, and can be used for all kinds of econometrics.
Here difference and division give meaningful result
Descriptive statistics describe a lot about your data. They are very important as they decide the
quality and type of data and levels of accuracy for all inferential statistics and ultimate business
decision making. All inferential statistics are based on descriptive statistics.
When you receive data, you are expected to have all the values in a normal distribution. Which
means, that the data should all revolve around some central value. Or the average/mean of the
values should represent the entire data value range. Sometimes, there may be values which
are slightly off the normal values expected, either on the lower side or the higher side. Such
values which lie outside the normal values are called OUTLIERS. These outlier values can affect
your analysis and hence the treatment of these outlier values depends on the company’s
decision to handle or ignore them.
Descriptive statistics give lot more insight into not only the central values, but also about how
much they vary or deviate from the central values, ways to handle negative values,
categorization of subsets from populations, simple comparisons, representation of these values
in graphs, shapes of graphs/plots etc.
They explain how the values of a particular variable/attribute are centered around the central
value. When a central value in a data range is calculated, it is expected to be a representative of
all the values in that variable and not too different. The objective of calculating this is to find
out an approximate central value for the entire range of data. Whatever is the Measure of
central tendency you choose, it should exactly divide your data into half and 50% of data values
should lie to the right and left of the mean. This kind of data is called normal data.
There are 3 ways of calculating Measures of Central Tendency:- (Mean, Median & Mode)
a) MEAN - While mean is Mean is extremely sensitive to outliers as even one or few
values can influence the centre value. Mean is the most common measure of central tendency.
It is calculated as the total of all the values by the number of values or frequency. Mean does
not take into calculation any missing values. Zero is still considered as a value in the data range.
Mean uses all the data and each value influences the mean.
Average is also similar to Mean, but one major difference is that Average takes all
missing values also into account, which will not be accurate as required.
Mean for a population is represented by Mu, and a Sample mean is represented by X bar.
The choice of treating the outlier values (lower or higher values) depends on individual
companies and the level of accuracy expected for their analysis. If there are too many outliers,
then the data first has to be normalized and then analysed. Of if there are few outlier values,
companies may choose to ignore them or treat them separately in their analysis.
b) MEDIAN - Is the exact central value of the data range, when arranged either in
ascending or descending order of values. It is the central part of the data values. When there
are even no. of values in the data range, Median is the average of the 2 middle values in the
arranged series. Hence Median also is called Positional Average.
Ex: 4, 5, 3, 2, 7 (arranged in asc 2,3,4,5,7) the MEDIAN is 4.
4, 5, 3, 2, 7, 5 (arranged in desc 7,5,5,4,3,2) the MEDIAN is (5+4)/2 = 4.5
Main Advantage of using a median over mean - Median is less affected by outliers.
Ex: 20, 30, 35 (arranged 20,30,35) – Median is 30
20, 30, 35, 140 - Median is (30+35)/2 = 32.5 (not affected much by the outlier)
Disadvantage of Median is that is does not take all the values into account. It is based on only
the middle value by virtue of its position in the centre.
c) MODE - Most repeated value in the data series. Usually repeat values occur in the data
series only if you have limited possible values in that data. Ie, when only limited/pre-defined
values are entered into the data series, MODE represents the most popular or preferred value.
Used mostly for categorical /nominal data and not much quantitative data. When data is highly
repetitive, you go for the mode.
Classic example of Mode usage is categorical data – where there are limited possible
values.
Ex: Bad, Poor, Average, Good, Excellent - the number of times each value is repeated,
represents the preferability of the product/service and hence the Mode becomes important.
While the Measures of central tendency give insight into the central values of your data,
Measures of Dispersion tell you how much they are deviated from the Central Values. In reality,
data is mostly not very consistent and always data is varied. There is a need to calculate the
differences to get more relevance in the data and not just talk about middle / central values in
our analysis. They describe the spread or the dispersion of the data set.
Ex: the scores of students in Univ1 and Univ2, though have a mean of 7.5, their data is
distributed differently. In such cases, a study of Central value alone is insufficient and we need
to look at how the data are dispersed from the Mean. This is because data is not normally
distributed. Hence the measures of variability help us find how scattered the data values are.
From the data below, we can infer that Univ 1 is better because… 1) Dispersion of values is less,
2) Spread is less 3) Range of scores is less 4) Variations are less
Univ 1 Univ 2
8 14
9 2
10 8
8 6
7 13
7 4
8 10
57 57
1) RANGE - Range is the difference between the highest and the lowest value in the data
values. Range does not take any other values, especially the central values into account,
hence not so popularly used for statistical analysis. Range is affected by outliers.
Mostly used to find out what band the values of the data fall under.
Ex: salary range, age group, - mostly used as intervals too.
2 -2.83
3 -1.83
4 -0.83
5 0.17
6 1.17
4 -0.83
5 0.17
3 -1.83
5 0.17
6 1.17
7 2.17
8 3.17
4.833333 0.00
1) MEAN ABSOLUTE DEVIATION – Deviations from the MEAN value is first calculated, but
the negative signs are ignored and absolute numbers are taken. Then they are summed
up. Ie, we find out how each value in the data set varies from the Mean. All absolute
deviations are summed up to get the Total Deviation. MAD is the average of the
absolute values of the deviations around the mean, for a set of values. Only absolute
values are taken and negative values are ignored so that meaningful analysis is possible.
But still useful in the field of forecasting, used as a measure of error.
THE MAD of 1.36 states that the average deviation or difference of each of the values
in the data set, from the mean is 1.36.
2 2.83
3 1.83
4 0.83
5 0.17
6 1.17
4 0.83
5 0.17
3 1.83
5 0.17
6 1.17
7 2.17
8 3.17
Sum of abs
deviations=
Mean=4.833333 16.33
2) VARIANCE - Since MAD neglects all the –ves and treats them as positive, Variance uses
a different calculation. Variance is just the square of all negative and positive devations
from the mean, divided by the total number of obs. Variance is the sum of all the
square deviations from the mean by N. Variance = ∑(x1 – xbar)/N ….Sample Variance =
∑(x1 – xbar)/n-1
Population variance is denoted by σ2
Sample Variance is denoted by S2
3) MEASURES OF SHAPE
Now, we have data on hand, and we have calculated the central tendency measure, We also
know how each value in the data series varies from the central value.
We now need to see a graphical representation of the data and analyze the shape which
describes the rough distribution of data from its central value. Two Measures of Shape are:
1) SKEWNESS
2) KURTOSIS
Skewness and Kurtosis help you find out the pattern of your data distribution, help find
extreme or outlier values and also tell you about the symmetry of your data distribution
When we find out the central tendency value for a data series, we expect that the
MEAN/MEDIAN/MODE would divide the values into halves and expect to see the right half and
left half of the graph almost similar. This distribution of data in which the right half is a mirror
image of the left half is said to be SYMMETRICAL.
The skewed portion is the long thin part of the curve. If the bell curve is to the right and
the skew tail is to the left , then it is NEGATIVELY SKEWED. If the skewed portion is to
the right and the curve is to the left, it is POSITIVELY SKEWED. The skew distributions
denote that the data are sparse at one end of the distribution and piled up at the other
end.
Skewness shows that more number of data values lie in a particular side of the central
value.
Ex: Skew shows that more number of students have scores in a particular (lower or
higher) range in the data.
Introduction
Business researchers often need to provide insight and information to decision makers to assist
them in answering questions like,
What container shape is most economical and reliable for shipping a product?
Which management approach best motivates employees in the retail industry?
What is the most effective means of advertising in a business setting?
How can the company’s retirement investment financial portfolio be diversified for
optimum performance?
For these purposes, researches develop “Hypotheses” to be studied and explored.
HYPOTHESIS is
Something that needs to be proven or disproved.
It is an educated guess
A claimed fact
Is a tentative explanation of a principle operating in nature.
Is an assumption about population or unknown value/parameter
We explore all types of hypotheses, how to test them, how to interpret the results of such
tests to help decision making. Research hypothesis is a statement of what the researcher
believes will be the outcome of an experiment or a study. Business researchers have some
idea or theory based on previous experience and data as to how the study will turn out.
These are typically concerning relationships, approaches and techniques in business.
Statistical hypothesis is required in order to scientifically test the research hypothesis . All
Statistical hypothesis consists of 2 parts, NULL HYPOTHESIS & ALTERNATIVE HYPOTHESIS.
NULL HYPOTHESIS – usually states that “Null” condition exists, that there is nothing new
happening, the old theory is still true, the old standards / quality are correct, and the
system is under control. It is represented by H0.
ALTERNATIVE HYPOTHESIS - on the other hand usually states that the new theory is
true, there are new standards, the system is out of control or something different is
happening all the time. Represented as H1.
Ex 1 : Suppose a baking flour manufacturer has a package size of 40 oz and wants to test
whether their packaging process is correct, the NULL hypothesis for this test would be that the
average weight of the pack is 40 ounces (no problem). The ALTERNATIVE hypothesis is that the
average is not 40 ounces (process has differences).
Ex2: Suppose a company held an 18% market share earlier and because of increased marketing
effort, company officials believe the market share is more than 18%, the market researches
would like to prove it.
H0: p = 0.18
H1: p > 0.18
Note: Though the statistical calculations gives results are T Value of F Values for the above
tests, in SPSS we look mainly at the Significance value (p value) and in SAS the p.Value is
taken. If this p value is less than 0.05, we reject Null Hypothesis, else accept it. This is becoz,
the T Tables and F tables are not handy.
STEPS for HYPOTHESIS TESTING – Most of the hypothesis testing is based on Mean
comparisons.
1) State the Hypotheses, (both NULL and Alternative) clearly. H0 and H1. The purpose of
the test should be clearly understood as per the applications. The researcher needs to
be doubly sure of the requirement and purpose, to negate the null hypothesis.
3) Use appropriate statistical test, based on requirement and based on hypothesis in Step
1.
Hypothesis
Testing
2 > Variables
One Variable Two Variables
ANOVA (F Test)
4) Decision Rule- The researcher should be 1-α confident to prove his hypothesis. General
rule is
P ≤ 0.05 (alpha level) you reject H0 and prove your theory.
P≥ 0.05 (alpha level is more) you accept H0 to prove the older theory.
OR
If T.Value or F.Value (from step 3) is more than table value, then reject H0, else accept
H0.
Ex: if alpha is 0.05, and P value is 0.03, then the researcher is 97% confident of his
theory and can reject Null Hypothesis. If he is less confident or p value is more than alpha level,
his confidence levels are going down and is forced to accept the null hypothesis and that his
tests have failed.
5) Conclusion - TYPE 1 and TYPE 2 Errors
Actual Claim H0
Correct Decision
TYPE 1 Error
Reject (power of test
Alpha=0.05
Beta)
DEGREE of FREEDOM (df) - It is the Number of observations minus the number of parameters
estimated. Ex: if you calculate Mean of 15 observations, the df is 15-1, usually N-1, which
means that one degree of calculating a parameter is gone.
Type 2 Error – is committed when the researcher fails to reject a false null hypothesis.
The probability of committing a Type 2 Error is β (beta) which is 1-alpha. This essentially
means, the probability of committing type 1 error is alpha (0.05) and the rest of the available
percentage (1 - 0.05=95%) when it had to be correct, is wrong. The probability of a
test going 95% wrong is more dangerous. Ex: The package sizes don’t weight 40 ounces (here
null hypothesis is False), but researcher says it weighs 40 ounces, he is committing a bigger
mistake by accepting a false hypothesis, resulting in huge production losses. It is like freeing a
criminal who has committed a crime, but a decision is made to free him of charges.
Let us see each of these Hypothesis Tests with suitable examples both in SAS and SPSS with
codes and results.
Variable: lifetime
14 -2.27 0.0392
1) The only variable being checked for Mean value here is LIFETIME.
2) N is the number of observations
3) Mean of LIFETIME variable is 9.3000 (here it says average life of the bulb is 9.30 yrs)
4) STD Deviation of all the 15 values from the mean is 1.1922
5) Minimum and Maximum are the values in the data range in the LIFETIME variable.
6) 95% CL Mean – is the Mean at 95% Confidence Limit of data being picked up for
analysis.
7) The Confidence Interval for the Mean is 8.6398 to 9.9602
8) 95% Confidence Interval for SD is 0.8729 to 1.8803
9) DF – Degrees of Freedom (always N-1)
10) T.Value is actually not considered here as we cannot compare it with the table values.
11) Probability to T.Value is 0.0392.
Here the P value is the one to be considered. Since it is much lesser than alpha 0.05, we REJECT
Null Hypothesis and say that the average life time of Philips Bulb is NOT 10 yrs. Here is the P
value was higher than 0.05, we would accept the H0 and say the lifetime is 10 yrs.
SPSS Results/Output
One-Sample Statistics
One-Sample Test
Test Value = 10
Essentially, all the descriptive statistics are the same both in SPSS and SAS.
One additional parameter given by SPSS is the Mean Difference, which is the difference
between the Actual Mean and the Null Hypothesis claimed Mean ( 9.3 – 10.0) in the above
example.
Example 2: 15 Customers each in Mumbai and Delhi were asked to rate Brand X on a 7 point
scale. The responses of all 30 customers are presented. Test whether the responses to Brand X
is the same in both cities.
SAS OUTPUT
The TTEST Procedure
Variable: rating
city Method Mean 95% CL Mean Std Dev 95% CL Std Dev
Equality of Variances
Method Num DF Den DF F Value Pr > F
Table 1 – gives descriptive statistics for the Class Variable city, with values 1 for Mumbai and 2
Delhi.
No. of obs, minimum / maximum value, Mean and Std Dev for each city separately and the
difference between both the cities.
Table 2 –gives Confidence Level descriptive statistics for the Class Variables and the differences
are given in two methods (Pooled and Satterthwaite)
Table 4- gives the Equality of variances for both variables. For this sub-test, the H0: The
variances are equal, H1: The variances are unequal. We need to check the p value in the
Equality of Variances table, which is 0.2867 here, higher than 0.05. So we accept H0 which is =
Variances are equal. Here accept this H0 and conclude to choose the Equal method. The
degrees of freedom and t.value are not considered here.
Table 3 – gives the comparison of Variances under the two methods. Now that the p value of
Equal method is 0.0309, which is way lesser than 0.05 of our original H0 (Mean of Delhi=Mean
of Delhi), we reject this H0 and say that Mumbai and Delhi means are not equal. They differ in
their ratings.
SPSS OUTPUT
Analyse -> Compare Means -> Independent T.Tests -> Choose City as Group Variable and Rating
as Factor variable - > Click on Group variable and give the groups 1 and 2 as values -> OK.
Group Statistics
Table 1 above shows the descriptive statistics for both groups in the City variable 1=Mumbai and
2=Delhi.
Table 2 – below shows details of Levene’s test for Equality of Variances based on 2 assumptions
– Variances being equal and unequal. The CL limits are taken for a 95% data coverage.
The Sig level or P value for the Equal Variances assumption is 0.128 which is much more than
0.05. Hence take the assumption that Equal Variances are correct. Now check the sig value for
Equal variances, it is 0.031, which is lesser than 0.05, hence Reject Null hypothesis and
conclude that Mumbai and Delhi ratings are NOT the same.
Rating Equal 2.457 .128 -2.273 28 .031 -.867 .381 -1.648 -.086
variances
assumed
Conclusion:- Whether it is SAS or SPSS, the following comparison needs to be done in the
means and variances of the two groups. Since the p value is lesser than 0.05, we reject null
hypothesis and conclude that Mumbai and Delhi ratings do differ.
Prob / Equal
Group Var City Mean SD T.Value Pooled
Mumbai 3.73 0.88
-2.27 0.031
Delhi 4.6 1.18
Example 3:- A plant breeder wants to find out if light is important for seed sprouting. He puts
10 seeds each in 10 pots and places them in Dark and another similar set of 10 pots with 10
seeds each in the light. He monitors the number of seeds sprouting in a particular duration of a
week.
Lightsprouts Darksprouts 1) H0= There is no diff in no. of sprouts when exposed to light or dark
H1 = There is some diff in no. of sprouts when seeds are exposed to
9 6 light.
6 4
8 7 2) Alpha = 0.10
5 2
5 4 3) Ind T.Test for 2 variables
6 5 4) For this data needs to be transposed to get a class variable as sheet 4
5 3 5) Check for Test for equality of variances
6 7
9 8
9 6
SAS Output
Variable: sprouts
exposure Method Mean 90% CL Mean Std Dev 90% CL Std Dev
Equality of Variances
SAS Interpretation:
Table 1 – gives the descriptive statistics for both groups 1 and 2 (dark and light exposure) and
the differences in the descriptive statistics.
Table 2 – The Means of both groups are compared under 2 methods with 90% CL.(alpha is 0.1
here)
Table 3 – Gives details of variances under 2 assumptions (variances being equal and unequal).
We go by the method which has the highest p value. Here since p value is lesser than 0.1, which
is 0.068, we REJECT Null hypothesis and conclude that there is difference in Means with Light
and dark exposure in seed sprouting.
Table 4- gives Equality of variances. Here the p value is way high 0.774, so accept the sub null
hypo and go with the variances equal method. (For this the H0: variances are equal,
H1=Variances unequal).
SPSS Output
Table 1 as usual gives descriptive statistics for both groups (light and dark) under variable
Exposure.
Group Statistics
Levene's Test
for Equality of
Variances t-test for Equality of Means
95% Confidence
Interval of the
Sprouts Equal variances .013 .911 - 18 .068 -1.600 .825 -3.332 .132
assumed 1.940
Conclusion:-
Groups/Exposure Mean SD T.Value P. Value
Dark 5.2 1.932
Light 6.8 1.751 -1.94 0.068
Since the Sig level of ‘Equal Variances” is high, we take that method and compare the sig level
for T-Test for Equality of Means, which is 0.068 here in both cases, less than alpha 0.1. We
conclude that sprout means have some difference with and without light exposure and reject
null hypothesis.
Example 4: - Complete Analytics wants to assess if students who took up statistics course have
enhanced their knowledge after course completion. The scores of the same set of students
before and after course completion, is listed.
SAS Output
proc ttest data=pairedt alpha=.05;
paired before*after;
run;
The TTEST Procedure
9 -2.05 0.0703
Gives the basic descriptive statistics, with differences in Mean, SD and with 95% CL.
Here, the diff in Mean is -1.3, which means the comparison is before minus after, if you give the
diff as after-before or state the variables in reverse order, the value will still be the same, but
will be +1.3.
Here P value is 0.07, which is higher than alpha 0.05, so ACCEPT null hypothesis and say there is
no difference between “before and after the course” completion, though there is a diff in mean
of 1.3.
SPSS Output
Analyse-> Compare Means -> Paired Sample T-Test-> choose the paired variables under the first
pair->OK
Paired Samples Statistics
N Correlation Sig.
Pair 1 Before & After 10 .931 .000
Paired Differences
Table 1 – Gives basic descriptive statistics for the paired variables (before and after)
Table 2 – gives correlation between the 2 variables (before and after), which is not considered
in this paired T.test for analysis.
Table 3 – Gives the differences in Mean and SD before-After. Gives 95% CL limits, T.Value and
Sig(Pvalue).
Conclusion:-
Mean SD %improvement Paired T Value P Value
Before 15.5 5.104
After 16.8 5.493 8.387096774 -2.053 0.07
Mean diff/Old Mean *
Diff -1.3 0.389 100
Though there is an improvement in Mean by 1.3 and the improvement % is 8.39%, the p value
is 0.07, more than the alpha value. Hence we accept Null hypothesis and conclude that there is
no difference in the mean scores of students before and after taking up the statistics course.
This however, can change if you set the significance level to 0.1.
Example 5:- We have recorded the ratings of Tamarind brand garments from 18 respondents
before and after an advertisement campaign was released for this brand. The ratings are on a
10 point scale. Test whether the campaign had an effect on sales of Tamarind brand garments.
Before After
3 4 Step 1) H0: No effect of the advt campaign (Mean before and after are same)
4 5 H1: Advt Campaign had an effect on sales (Mean sales before / after are diff)
2 3 Step 2) Alpha 0.05
5 4 Step 3) Go for Paired T.Test since the same sample obs are tested twice
3 5
SAS
4 5 Code proc ttest data=pairedt2;
5 6 paired before*after;
3 4 run;
4 5
2 4 SPSS Analyse-> Compare Means -> Paired Sample T.Test->
2 4 Choose pair variables -> Ok
4 5
1 3
3 5
6 8
3 4
2 4
3 5
SAS OUTPUT
17 -7.38 <.0001
Above SAS output gives descriptive statistics N, Mean difference (before-after), SD and
min/max values all in differences. Even the 95% CL Mean and SD are in differences. The Mean
has improved by 1.33 and the p value is 0.0001, which is less than alpha 0.05, hence reject H0
and state that the advt campaign does have some effect on sales.
Note: Since SAS is not giving the descriptive statistics for both variables separately and gives
only the differences statistics. So If you want the desc statistics of both variables before and
after, you need to run PROC Means before running PROC TTEST;
SPSS OUTPUT
Paired Samples Statistics
Table 1- Gives Mean, N obs, STD Dev before and after the advt. campaign.
N Correlation Sig.
Table 2 – though we are not considering any correlation between the variables BEFORE &
AFTER, SPSS produces this table and shows the correlation, but the sig is zero.
Paired Differences
Table 3 – shows the pair differences of Mean, SD, 95% CI of the difference estimates, The
T.Value and the sig (p value)
Conclusion:-
Mean SD % Incr T.Value P Value
Before 3.28 1.274
After 4.61 1.144 40.5% -7.376 0.0001
Diff B-A -1.33 0.767
There is a difference in mean of 1.33, a 40.5% increase is seen in the mean sales before / after
the advt. campaign, and the p value is much lesser than 0.05. Hence we confidently REJECT H0
and conclude that the advt. campaign has had a significant effect on sales of Tamarind brand.
ANOVA (ANALYSIS OF VARIANCE)
When we say, more than 2 variables, it means, influence of 1 or more factors on a variable.
Ex: Sales affected by location of item displayed. (either window or near counter or shelf)-
though there are 3 options, they are categorised as a single factor influencing sales. Here
ANOVA is mainly concerned with the analysis of “which one is better among the 3” options
within the same variable called storage area.
2 Factors can also influence a variable. Ex: Sales because of storage location and price. In such
cases, ANOVA gives the combination of the factors which will fetch max results.
ANOVA mainly deals with a detailed analysis of variances in the variables as follows:-
Between the variables, (ex. Variance of Variable 1 and variance of variable 2 )
Within the variable (variances in variable 1 from its mean)
in totality.
POST-HOC ANOVA:-
This is a very important test conducted after the ANOVA is completed. ANOVA calculates means
and gives detailed analysis of variances of variables from the means. Whereas, POST-HOC
ANOVA gives suggestions as to
which variable is best suited or
which variable differs to what extent from the others or
which combination of factors would be the best.
POST-HOC ANOVA throws a table which assigns alphabets (in SAS) or numbers (in SPSS) for
different combination of groups of variables.
ANOVA Steps:-
1) Transpose your data so that the 2>variables form a factor variable (in SPSS) or class
variable (in SAS), and the dependent variable is the continuous one.
2) Run PROC Means in SAS or check the DESCRIPTIVE stats box in SPSS to get the
descriptive statistics for all the variables, as PROC ANOVA (in SAS) and ONE-Way
ANOVA (in SPSS) do not give descriptive statistics.
3) Then run PROC Anova or One-way ANOVA as required.
Analyse if the sales of kitkat is affected by the area of display (storage location) in a store. Sales
for a week is observed when kitkat is placed at the window side, counter side or on the shelf.
Windo Counte
Shelf w r Step 1) H0=Storage area sales are all equal
450 500 550 H1= Storage area sales -atleast one location is different
490 530 570 Step 2) Alpha = 0.05
500 510 560 Step 3) Test to be performed - one way ANOVA with 3 class variables
470 500 530 SAS PROC Means data=dsn; SPSS:
480 530 590 class position; Analyse -> Compare Means -> one way ANOVA
500 540 600 var sales; select Dep var sales and factor variable - position
460 520 580 run; ok
PROC Anova data =dsn;
class position;
model sales=position;
Meansposition/tukey
; run;
SPSS Ouput
Descriptives
Sales
N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum
Table 1 – gives all descriptive statistics for the 3 options in the class variable 1,2,3.
ANOVA
Sales
Sum of Squares df Mean Square F Sig.
Table – 2 gives the Analysis of variance between and within the groups. Now the sig value (p
value is lesser than 0.05), so reject H0 and say there is a significant difference in sales when the
product is placed in different positions.
Note:- Now we know by running a one-way anova test, that there is a difference in mean
sales of kitkat when placed at different sale positions. To find out which position is the best,
or to analyse further, we run post-hoc ANOVA.
Multiple Comparisons
Dependent Variable:sales
(I) Position (J) Position Difference (I-J) Std. Error Sig. Lower Bound Upper Bound
Table – 3 above gives combination of each position with the other. You will notice that the p
value for position 3 is the best p value as it is closer to zero (much lesser than 0.05 alpha), under
all the post-hoc methods (Scheffe, Duncan and LSD). Hence we conclude that position 3
(counter sales) is the best position to display kitkat.
SAS OUTPUT
a) First run PROC means to get desc stats for all variables.
N
position Obs N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------------------------
--
1 7 7 478.5714286 19.5180015 450.0000000 500.0000000
position 3 12 3
Table 2- is the actual ANOVA procedure which gives Degrees of freedom 2. It clearly says p value is
<0.0001, which means we reject H0 and state that there is a diff in sales if product is placed at different
positions.
Table 3 below - Post-hoc anova results come in Tukey’s method, as we opted for Tukeys. SAS
automatically gives alphabets and groups the variables to suggest the variable with highest Mean. Here
position 3 is grouped as A with highest mean. SAS suggests that Position 3 has highest mean.
The ANOVA Procedure
NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II
error rate than REGWQ.
Alpha 0.05
Error Degrees of Freedom 18
Error Mean Square 403.1746
Critical Value of Studentized Range 3.60930
Minimum Significant Difference 27.392
B 518.57 7 2
C 478.57 7 1