Вы находитесь на странице: 1из 31

Statistics – an Introduction

Statistics is derived from Italian words ‘State’ and ‘Statista’ – which mean ‘The Government’
and ‘Useful facts/information’. It is a branch of mathematics and deals with
 Collection of facts/information/data
 Arrangement, classification and tabulation of data
 Analysis and detailed study of the collected data
 Interpretation, conclusions and understanding of the data
 Helps important business decision making.

Types of Statistics

a) Descriptive statistics
b) Inferential Statistics

a) Descriptive Statistics – as the name suggests describes the data collated or collected
for a specific business purpose.

b) Inferential Statistics - involves detailed study, analysis and interpretation of data to


draw conclusions based on Descriptive statistics.

Collection of Data/useful facts


All relevant information required for a particular business purpose is collected from the
public/population by research companies through various methods.
 Personal enquiry or interviews
 Through observation and study of a particular pattern or behavior
 In the form of questionnaires

While doing so, various factors determine the amount and scale of information to be collected.
 The purpose for which the information/data is collected.
 Target audience or the particular section of people from whom information needs to be
collected.
 The number of people from whom data needs to be collected (population /sample size)
 Age group, geographic location, areas of interest etc
 Income groups and various sections of the economy to be targeted
 The kind of analysis that needs to be done with the collected information
 The level of accuracy to be maintained at every stage of data collection/analysis
 Business Decisions to be arrived based on statistical analysis and interpretation of
information.

Population - is the collection of objects/people which form the larger group of the analysis.
Sample - is also the collection of objects/people which are usually a subset of the larger
population or sometimes is the only size available for analysis.

Types of Data

Data is nothing but collection of relative facts/information in a particular form as required by


the business. We classify data based on
 value of the attribute
 characteristics of the object
 level of analysis that can be done with that value collected

A broader or simpler classification is

a) Qualitative data – this data suggests only the quality of the variable/attribute. Also
called as Categorical data, this attribute is useful for classifying data into groups or classes. It is
more applicable when there are limited or fewer values possible for that data. Qualitative data
cannot be measured.

Ex: Region – South, North, East, West


Gender – Male, Female
Nationality – Indian, American, African

b) Quantitative Data – is all kinds of information / data which can be quantified by


assigning numeric values to them. These can be any number from 0 or negative numbers upto
infinity. In simple terms, data that can be measured is called quantitative data. The possibility
of values for this attribute is unlimited unlike Qualitative data.

Ex: Height, Weight, Salary, temperature etc.

Here, the famous NOIR classification should be recollected.

Data

Qualitative Quantitative
Data Data

Nominal Interval
Ordinal Data Ratio
Data Scale
When a research or a survey is done, it is advisable to have a combination of all data types, as
this would give an indepth view of the purpose of the market research. It is always better to
have focus on the purpose of the market research, but also to collect a little more information
than required, so that import data does not get omitted in the analysis.
1) NOMINAL Data – assigns a name as a value. This value is just a representation of which
category the value belongs to. This is usually a Character data and not a numeric. No further
analysis is possible other than being used for classification / categorization. They have some
kind of an identity.
Ex: Company working in, Region, Name of a person etc…

2) ORDINAL data – is also similar to nominal data, but ordinal data can be ranked in relation to
other values in the same variable. Ordinal data values have some level indications within
themselves.
Ex: Small, Medium, Large
Bad, Average, Good, Very Good, Excellent
scale of 1 to 5 (where the numbers 1-5 represent only ranking)

3) INTERVAL SCALE DATA - are quantitative data, and can be used for numerical analysis. They
might have equal intervals between the values, or can also be continuous data. Usually, it
would be possible for us to take the difference between any 2 values in this kind of data to get
a meaningful result, but a division of values would not fetch a meaning result.
Ex; Age groups, Salary ranges, dates (where the difference can be taken, but a division
cant)

4) RATIO SCALE DATA - this is measureable data, and can be used for all kinds of econometrics.
Here difference and division give meaningful result

Ex:Height /weight =bmi


DESCRIPTIVE STATISTICS

Descriptive statistics describe a lot about your data. They are very important as they decide the
quality and type of data and levels of accuracy for all inferential statistics and ultimate business
decision making. All inferential statistics are based on descriptive statistics.

When you receive data, you are expected to have all the values in a normal distribution. Which
means, that the data should all revolve around some central value. Or the average/mean of the
values should represent the entire data value range. Sometimes, there may be values which
are slightly off the normal values expected, either on the lower side or the higher side. Such
values which lie outside the normal values are called OUTLIERS. These outlier values can affect
your analysis and hence the treatment of these outlier values depends on the company’s
decision to handle or ignore them.

Descriptive statistics give lot more insight into not only the central values, but also about how
much they vary or deviate from the central values, ways to handle negative values,
categorization of subsets from populations, simple comparisons, representation of these values
in graphs, shapes of graphs/plots etc.

Descriptive statistics are broadly classified into:

1) Measures of Central Tendency


2) Measures of Dispersion/Variability
3) Measures of shape
1) MEASURES OF CENTRAL TENDENCY

They explain how the values of a particular variable/attribute are centered around the central
value. When a central value in a data range is calculated, it is expected to be a representative of
all the values in that variable and not too different. The objective of calculating this is to find
out an approximate central value for the entire range of data. Whatever is the Measure of
central tendency you choose, it should exactly divide your data into half and 50% of data values
should lie to the right and left of the mean. This kind of data is called normal data.

There are 3 ways of calculating Measures of Central Tendency:- (Mean, Median & Mode)

a) MEAN - While mean is Mean is extremely sensitive to outliers as even one or few
values can influence the centre value. Mean is the most common measure of central tendency.
It is calculated as the total of all the values by the number of values or frequency. Mean does
not take into calculation any missing values. Zero is still considered as a value in the data range.
Mean uses all the data and each value influences the mean.

Average is also similar to Mean, but one major difference is that Average takes all
missing values also into account, which will not be accurate as required.

MEAN = Sum of all values / total no. of values

Mean for a population is represented by Mu, and a Sample mean is represented by X bar.

Ex: 2, 3, 5, 4, 5, 6, 8, 20 - The mean is 6.125, but the outlier 20 affects it.


If 20 is removed, then the mean is 4.14 which is a more accurate central value.

The choice of treating the outlier values (lower or higher values) depends on individual
companies and the level of accuracy expected for their analysis. If there are too many outliers,
then the data first has to be normalized and then analysed. Of if there are few outlier values,
companies may choose to ignore them or treat them separately in their analysis.

b) MEDIAN - Is the exact central value of the data range, when arranged either in
ascending or descending order of values. It is the central part of the data values. When there
are even no. of values in the data range, Median is the average of the 2 middle values in the
arranged series. Hence Median also is called Positional Average.
Ex: 4, 5, 3, 2, 7 (arranged in asc 2,3,4,5,7) the MEDIAN is 4.
4, 5, 3, 2, 7, 5 (arranged in desc 7,5,5,4,3,2) the MEDIAN is (5+4)/2 = 4.5

Main Advantage of using a median over mean - Median is less affected by outliers.
Ex: 20, 30, 35 (arranged 20,30,35) – Median is 30
20, 30, 35, 140 - Median is (30+35)/2 = 32.5 (not affected much by the outlier)

Disadvantage of Median is that is does not take all the values into account. It is based on only
the middle value by virtue of its position in the centre.

c) MODE - Most repeated value in the data series. Usually repeat values occur in the data
series only if you have limited possible values in that data. Ie, when only limited/pre-defined
values are entered into the data series, MODE represents the most popular or preferred value.
Used mostly for categorical /nominal data and not much quantitative data. When data is highly
repetitive, you go for the mode.

Ex: - 2, 3, 6, 3, 3, 3, 2, 5, = Mode is 3 becoz 3 is most repeated.


2, 3, 4, 2, 3, 3, 2, 5 = both 2 and 3 are repeated equal no. of times, this kind of
data is called Bi-modal.
2, 3, 4, 5, 6 - there are not repeat values, so the data is non-modal.

Classic example of Mode usage is categorical data – where there are limited possible
values.
Ex: Bad, Poor, Average, Good, Excellent - the number of times each value is repeated,
represents the preferability of the product/service and hence the Mode becomes important.

2) MEASURES OF DISPERSION / VARIABILITY

While the Measures of central tendency give insight into the central values of your data,
Measures of Dispersion tell you how much they are deviated from the Central Values. In reality,
data is mostly not very consistent and always data is varied. There is a need to calculate the
differences to get more relevance in the data and not just talk about middle / central values in
our analysis. They describe the spread or the dispersion of the data set.

Ex: the scores of students in Univ1 and Univ2, though have a mean of 7.5, their data is
distributed differently. In such cases, a study of Central value alone is insufficient and we need
to look at how the data are dispersed from the Mean. This is because data is not normally
distributed. Hence the measures of variability help us find how scattered the data values are.
From the data below, we can infer that Univ 1 is better because… 1) Dispersion of values is less,
2) Spread is less 3) Range of scores is less 4) Variations are less

Univ 1 Univ 2
8 14
9 2
10 8
8 6
7 13
7 4
8 10
57 57

3 main measures of Dispersion are - RANGE, Standard Deviation and Variance.


Other measures are Interquartile Range, Mean Absolute Deviation (MAD), z Scores and Co-
Efficient of Variation.

1) RANGE - Range is the difference between the highest and the lowest value in the data
values. Range does not take any other values, especially the central values into account,
hence not so popularly used for statistical analysis. Range is affected by outliers.

Mostly used to find out what band the values of the data fall under.
Ex: salary range, age group, - mostly used as intervals too.

SUM OF DEVIATIONS FROM THE ARITHMETIC MEAN IS ALWAYS ZERO.


If you have data set having 12 values, and their mean is 4.83, the sum of deviation of
each value from the mean value of 4.83 is first calculated by ∑(x-µ), where x represents
each data value and mu is the mean. The sume of (x-mu) is always zero.

2 -2.83
3 -1.83
4 -0.83
5 0.17
6 1.17
4 -0.83
5 0.17
3 -1.83
5 0.17
6 1.17
7 2.17
8 3.17
4.833333 0.00

1) MEAN ABSOLUTE DEVIATION – Deviations from the MEAN value is first calculated, but
the negative signs are ignored and absolute numbers are taken. Then they are summed
up. Ie, we find out how each value in the data set varies from the Mean. All absolute
deviations are summed up to get the Total Deviation. MAD is the average of the
absolute values of the deviations around the mean, for a set of values. Only absolute
values are taken and negative values are ignored so that meaningful analysis is possible.
But still useful in the field of forecasting, used as a measure of error.

MAD = ∑(𝑥 − µ)/𝑁, here the MAD is 16.33/12 = 1.36

THE MAD of 1.36 states that the average deviation or difference of each of the values
in the data set, from the mean is 1.36.

2 2.83
3 1.83
4 0.83
5 0.17
6 1.17
4 0.83
5 0.17
3 1.83
5 0.17
6 1.17
7 2.17
8 3.17
Sum of abs
deviations=
Mean=4.833333 16.33

2) VARIANCE - Since MAD neglects all the –ves and treats them as positive, Variance uses
a different calculation. Variance is just the square of all negative and positive devations
from the mean, divided by the total number of obs. Variance is the sum of all the
square deviations from the mean by N. Variance = ∑(x1 – xbar)/N ….Sample Variance =
∑(x1 – xbar)/n-1
Population variance is denoted by σ2
Sample Variance is denoted by S2

In the above ex, Variance is 33.668/12 = 2.8056.


2 -2.83 8.0089
3 -1.83 3.3489
4 -0.83 0.6889
5 0.17 0.0289
6 1.17 1.3689
4 -0.83 0.6889
5 0.17 0.0289
3 -1.83 3.3489
5 0.17 0.0289
6 1.17 1.3689
7 2.17 4.7089
8 3.17 10.0489
4.833333 0 33.668

3) STANDARD DEVIATION – SD is the most popular measure of variability. Used mostly in


computing confidence intervals and Hypothesis Testing. Since Variance takes both
negative and positive square of deviations, it may not be the accurate measure. SD is
the square root of the sum of all variances. Denoted by SIGMA σ, is also the Square root
of Variance. SD is expressed in the same units as the data, and Variance is expressed in
the square units of the data.
In the above ex, SD is the square root of 2.8056 = 1.675
Since MAD neglects all the –ve numbers, SD takes both –ve and +ve numbers into
consideration and removes the square effect of variance. Hence SD is the most effective
measure.

3) MEASURES OF SHAPE
Now, we have data on hand, and we have calculated the central tendency measure, We also
know how each value in the data series varies from the central value.

We now need to see a graphical representation of the data and analyze the shape which
describes the rough distribution of data from its central value. Two Measures of Shape are:
1) SKEWNESS
2) KURTOSIS

Skewness and Kurtosis help you find out the pattern of your data distribution, help find
extreme or outlier values and also tell you about the symmetry of your data distribution

When we find out the central tendency value for a data series, we expect that the
MEAN/MEDIAN/MODE would divide the values into halves and expect to see the right half and
left half of the graph almost similar. This distribution of data in which the right half is a mirror
image of the left half is said to be SYMMETRICAL.

1) SKEWNESS – Skewness is when a distribution is asymmetrical or lacks symmetry, the


data is said to be skewed either rightways or leftways. When you draw a curve graph of
the data, if the bell shaped curve is either to the left or right, it is said to be skewed.

The skewed portion is the long thin part of the curve. If the bell curve is to the right and
the skew tail is to the left , then it is NEGATIVELY SKEWED. If the skewed portion is to
the right and the curve is to the left, it is POSITIVELY SKEWED. The skew distributions
denote that the data are sparse at one end of the distribution and piled up at the other
end.

Skewness shows that more number of data values lie in a particular side of the central
value.
Ex: Skew shows that more number of students have scores in a particular (lower or
higher) range in the data.

2) KURTOSIS:- Explains the amount of peakedness of the distribution curve. Distributions


that are high and thin are referred as Leptokurtic, distributions that are flat and spread
out are called Platykurtic and the distributions which are almost normal bell shaped are
Mesokurtic .
HYPOTHESIS TESTING

Introduction

Business researchers often need to provide insight and information to decision makers to assist
them in answering questions like,
 What container shape is most economical and reliable for shipping a product?
 Which management approach best motivates employees in the retail industry?
 What is the most effective means of advertising in a business setting?
 How can the company’s retirement investment financial portfolio be diversified for
optimum performance?
For these purposes, researches develop “Hypotheses” to be studied and explored.

THESIS is something already proven and established.

HYPOTHESIS is
 Something that needs to be proven or disproved.
 It is an educated guess
 A claimed fact
 Is a tentative explanation of a principle operating in nature.
 Is an assumption about population or unknown value/parameter

We explore all types of hypotheses, how to test them, how to interpret the results of such
tests to help decision making. Research hypothesis is a statement of what the researcher
believes will be the outcome of an experiment or a study. Business researchers have some
idea or theory based on previous experience and data as to how the study will turn out.
These are typically concerning relationships, approaches and techniques in business.

Statistical hypothesis is required in order to scientifically test the research hypothesis . All
Statistical hypothesis consists of 2 parts, NULL HYPOTHESIS & ALTERNATIVE HYPOTHESIS.

 NULL HYPOTHESIS – usually states that “Null” condition exists, that there is nothing new
happening, the old theory is still true, the old standards / quality are correct, and the
system is under control. It is represented by H0.

 ALTERNATIVE HYPOTHESIS - on the other hand usually states that the new theory is
true, there are new standards, the system is out of control or something different is
happening all the time. Represented as H1.

Ex 1 : Suppose a baking flour manufacturer has a package size of 40 oz and wants to test
whether their packaging process is correct, the NULL hypothesis for this test would be that the
average weight of the pack is 40 ounces (no problem). The ALTERNATIVE hypothesis is that the
average is not 40 ounces (process has differences).

Hypothesis is represented as follows:


NULL HYPOTHESIS is H0: µ = 40 ounces
ALTERNATIVE HYPOTHESIS is H1: µ ≠ 40 ounces

Ex2: Suppose a company held an 18% market share earlier and because of increased marketing
effort, company officials believe the market share is more than 18%, the market researches
would like to prove it.
H0: p = 0.18
H1: p > 0.18

APPLICATIONS OF HYPOTHESIS TESTING (called T Test or Z Test)

No. Application Example Test Used


To check whether Sample Mean Avg salary of Company A, Dept 4 One Sample T.Test or Z.Test
1
=Population Mean employees is 10,000 ( T or Z Value)
Independent T.Test or
To compare Mean of One Population Nokia Vs. Samsung mobile phone Z.Test for both companies
2
Vs Mean of another population sales in a region separately & compare (T or
Z Value to be considered)
To compare the effect before and Effect of a BP drug on patients,
Paired T.Test (T value to be
3 after a particular event, on the same before and after consumption of
taken)
sample the drug
To compare more than two
Sales of Nokia, Samsung and ANOVA or F.TEST (F.Value
4 independent variables or >2
Motorola mobiles in the result to be taken)
independent populations
To find the association between 2
Person's edu qualification with his Chi-Square Test (X2)
5 attributes (attributes are usually
earning capacity/salary/profession (T Value)
categorical/qualitative)
To find out Goodness of Fit Chi-Square Test (X2)
6 Estimated Vs actual sales
(Observed =Expected or not) (T Value)

Note: Though the statistical calculations gives results are T Value of F Values for the above
tests, in SPSS we look mainly at the Significance value (p value) and in SAS the p.Value is
taken. If this p value is less than 0.05, we reject Null Hypothesis, else accept it. This is becoz,
the T Tables and F tables are not handy.

STEPS for HYPOTHESIS TESTING – Most of the hypothesis testing is based on Mean
comparisons.
1) State the Hypotheses, (both NULL and Alternative) clearly. H0 and H1. The purpose of
the test should be clearly understood as per the applications. The researcher needs to
be doubly sure of the requirement and purpose, to negate the null hypothesis.

2) Specify Level of significance – can also be said as allowable non-confidence limits. It is


also the probability of committing Type 1 Error. Common values of alpha are 0.05, 0.03,
0.01 etc, depending on the criticality of the business errors. Ex: A Retail industry test
may accept a level of 5%, but the aeronautics industry would want only a 0.01 level.
The drugs industry would want even precise levels of testing as close as 99.2%.

3) Use appropriate statistical test, based on requirement and based on hypothesis in Step
1.

Hypothesis
Testing

2 > Variables
One Variable Two Variables
ANOVA (F Test)

N ≤ 30 One Independent Dependent One Way Two Way


N ≥ 30 Z.Test
Sample T.Test T / Z Test T/Z Test ANOVA ANOVA

One Dep& One Dep


N≤ 30 Ind N ≥ 30 Ind
Paired T.Test One Ind &Two Ind
T.Test Z.Test
Variable variables

4) Decision Rule- The researcher should be 1-α confident to prove his hypothesis. General
rule is
P ≤ 0.05 (alpha level) you reject H0 and prove your theory.
P≥ 0.05 (alpha level is more) you accept H0 to prove the older theory.
OR
If T.Value or F.Value (from step 3) is more than table value, then reject H0, else accept
H0.

Ex: if alpha is 0.05, and P value is 0.03, then the researcher is 97% confident of his
theory and can reject Null Hypothesis. If he is less confident or p value is more than alpha level,
his confidence levels are going down and is forced to accept the null hypothesis and that his
tests have failed.
5) Conclusion - TYPE 1 and TYPE 2 Errors
Actual Claim H0

If you ↓ If True If False

Correct Decision TYPE 2 Error


Accept
(1-alpha)=0.95 (β=0.95)(1-alpha)

Correct Decision
TYPE 1 Error
Reject (power of test
Alpha=0.05
Beta)

 Probability of Type 1 Error = level of significance (0.05)


 Power of Test β=0.95, since 0.05 alpha is already fixed.

DEGREE of FREEDOM (df) - It is the Number of observations minus the number of parameters
estimated. Ex: if you calculate Mean of 15 observations, the df is 15-1, usually N-1, which
means that one degree of calculating a parameter is gone.

Type 1 error - is committed by rejecting a true null hypothesis, when researcher


decides it is not true. The probability of committing a Type 1 error is called ALPHA, or
level of Significance. Ex: If the average package weight is 40 ounces of baking flour, but
researcher decides it not, then it is Type 1 error. If a manager fires an employee for a
fraud, when the employee has not committed the fraud, he commits a Type 1 error.

Type 2 Error – is committed when the researcher fails to reject a false null hypothesis.
The probability of committing a Type 2 Error is β (beta) which is 1-alpha. This essentially
means, the probability of committing type 1 error is alpha (0.05) and the rest of the available
percentage (1 - 0.05=95%) when it had to be correct, is wrong. The probability of a
test going 95% wrong is more dangerous. Ex: The package sizes don’t weight 40 ounces (here
null hypothesis is False), but researcher says it weighs 40 ounces, he is committing a bigger
mistake by accepting a false hypothesis, resulting in huge production losses. It is like freeing a
criminal who has committed a crime, but a decision is made to free him of charges.

Let us see each of these Hypothesis Tests with suitable examples both in SAS and SPSS with
codes and results.

HYPOTHESIS TESTING – PRACTICAL PROBLEMS


Example 1) Philips Bulb Co. states that the average lifetime of Ecostar Bulb is 10
years. Now WIPRO doesn’t accept this claim and tests the average life of 15 Philips Bulbs.

Lifetime_Yrs 1) H0: µ=10, H1=µ≠10


9 2) Alpha level = 0.05
11 3) Simple T.Test should check the average lifetime
10 there is only one variable to check, obs are less than 30
8
9 SAS - PROC Ttest data=dsn H0=10 Alpha=0.05;
9 Var lifetime_yrs;
8 run;
7
10 SPSS - Import the data from excel to SPSS
10 File-> Read Text Data-> Choose the excel file/range
11 In the new SPSS table, go to Analyze-> Compare Means
11 Select One Sample test (becoz there is only one variable)
9 Choose the Lifetime_yrs variable to the right box
9 Specify Mean=10 (this is H0), (default is zero)
8.5 Click on OK .

Excel- Data -> TTest-> select the range

Now, let us see the output in SAS


The TTEST Procedure

Variable: lifetime

N Mean Std Dev Std Err Minimum Maximum

15 9.3000 1.1922 0.3078 7.0000 11.0000

Mean 95% CL Mean Std Dev 95% CL Std Dev

9.3000 8.6398 9.9602 1.1922 0.8729 1.8803

DF t Value Pr > |t|

14 -2.27 0.0392

1) The only variable being checked for Mean value here is LIFETIME.
2) N is the number of observations
3) Mean of LIFETIME variable is 9.3000 (here it says average life of the bulb is 9.30 yrs)
4) STD Deviation of all the 15 values from the mean is 1.1922
5) Minimum and Maximum are the values in the data range in the LIFETIME variable.
6) 95% CL Mean – is the Mean at 95% Confidence Limit of data being picked up for
analysis.
7) The Confidence Interval for the Mean is 8.6398 to 9.9602
8) 95% Confidence Interval for SD is 0.8729 to 1.8803
9) DF – Degrees of Freedom (always N-1)
10) T.Value is actually not considered here as we cannot compare it with the table values.
11) Probability to T.Value is 0.0392.

Here the P value is the one to be considered. Since it is much lesser than alpha 0.05, we REJECT
Null Hypothesis and say that the average life time of Philips Bulb is NOT 10 yrs. Here is the P
value was higher than 0.05, we would accept the H0 and say the lifetime is 10 yrs.

SPSS Results/Output

One-Sample Statistics

N Mean Std. Deviation Std. Error Mean

Lifetime_Yrs 15 9.300 1.1922 .3078

One-Sample Test

Test Value = 10

95% Confidence Interval of the


Difference

t df Sig. (2-tailed) Mean Difference Lower Upper

Lifetime_Yrs -2.274 14 .039 -.7000 -1.360 -.040

Essentially, all the descriptive statistics are the same both in SPSS and SAS.
One additional parameter given by SPSS is the Mean Difference, which is the difference
between the Actual Mean and the Null Hypothesis claimed Mean ( 9.3 – 10.0) in the above
example.

Example 2: 15 Customers each in Mumbai and Delhi were asked to rate Brand X on a 7 point
scale. The responses of all 30 customers are presented. Test whether the responses to Brand X
is the same in both cities.

Rating of Brand X on a scale of 0-7


Mumbai Delhi City Rating
Step
2 3 1) H0: Mean of Mumbai = Mean of Delhi 1 2
3 4 H1: Mean of Mumbai not eq to Mean of Delhi 1 3
3 5 Step 2) Alpha 0.05 1 3
4 6 Step 3) Independent T.Test for 2 samples 1 4
5 5 Step 4) Since we have only one factor here, ie responses 1 5
4 5 from customers, but in 2 cities, we first get the 1 4
4 5 data ready, by classifying the variable City as 1 & 2 1 4
5 4 and Rating in the second variable. 1 5
3 3 SAS PROC Ttest data=ttest alpha=0.05; 1 3
4 3 Class City; 1 4
5 3 Var rating; 1 5
4 5 run; 1 4
3 6 SPSS Analyze-> Compare Means-> Ind TTest 1 3
3 6 Select CITY var into Grouping var box 1 3
4 6 Select RATING var into calculation box 1 4
OK 2 3
2 4
2 5

SAS OUTPUT
The TTEST Procedure

Variable: rating

city N Mean Std Dev Std Err Minimum Maximum

1 15 3.7333 0.8837 0.2282 2.0000 5.0000


2 15 4.6000 1.1832 0.3055 3.0000 6.0000
Diff (1-2) -0.8667 1.0443 0.3813

city Method Mean 95% CL Mean Std Dev 95% CL Std Dev

1 3.7333 3.2439 4.2227 0.8837 0.6470 1.3937


2 4.6000 3.9448 5.2552 1.1832 0.8663 1.8660
Diff (1-2) Pooled -0.8667 -1.6477 -0.0856 1.0443 0.8287 1.4123
Diff (1-2) Satterthwaite -0.8667 -1.6506 -0.0827

Method Variances DF t Value Pr > |t|

Pooled Equal 28 -2.27 0.0309


Satterthwaite Unequal 25.912 -2.27 0.0316

Equality of Variances
Method Num DF Den DF F Value Pr > F

Folded F 14 14 1.79 0.2867

Table 1 – gives descriptive statistics for the Class Variable city, with values 1 for Mumbai and 2
Delhi.
No. of obs, minimum / maximum value, Mean and Std Dev for each city separately and the
difference between both the cities.

Table 2 –gives Confidence Level descriptive statistics for the Class Variables and the differences
are given in two methods (Pooled and Satterthwaite)

Table 4- gives the Equality of variances for both variables. For this sub-test, the H0: The
variances are equal, H1: The variances are unequal. We need to check the p value in the
Equality of Variances table, which is 0.2867 here, higher than 0.05. So we accept H0 which is =
Variances are equal. Here accept this H0 and conclude to choose the Equal method. The
degrees of freedom and t.value are not considered here.
Table 3 – gives the comparison of Variances under the two methods. Now that the p value of
Equal method is 0.0309, which is way lesser than 0.05 of our original H0 (Mean of Delhi=Mean
of Delhi), we reject this H0 and say that Mumbai and Delhi means are not equal. They differ in
their ratings.

SPSS OUTPUT

Analyse -> Compare Means -> Independent T.Tests -> Choose City as Group Variable and Rating
as Factor variable - > Click on Group variable and give the groups 1 and 2 as values -> OK.

Group Statistics

City N Mean Std. Deviation Std. Error Mean


Rating 1 15 3.73 .884 .228

2 15 4.60 1.183 .306

Table 1 above shows the descriptive statistics for both groups in the City variable 1=Mumbai and
2=Delhi.

Table 2 – below shows details of Levene’s test for Equality of Variances based on 2 assumptions
– Variances being equal and unequal. The CL limits are taken for a 95% data coverage.
The Sig level or P value for the Equal Variances assumption is 0.128 which is much more than
0.05. Hence take the assumption that Equal Variances are correct. Now check the sig value for
Equal variances, it is 0.031, which is lesser than 0.05, hence Reject Null hypothesis and
conclude that Mumbai and Delhi ratings are NOT the same.

Independent Samples Test

Levene's Test for


Equality of
Variances t-test for Equality of Means

95% Confidence Interval

Mean Std. Error of the Difference

F Sig. T Df Sig. (2-tailed) Difference Difference Lower Upper

Rating Equal 2.457 .128 -2.273 28 .031 -.867 .381 -1.648 -.086
variances
assumed

Equal -2.273 25.912 .032 -.867 .381 -1.651 -.083


variances
not
assumed

Conclusion:- Whether it is SAS or SPSS, the following comparison needs to be done in the
means and variances of the two groups. Since the p value is lesser than 0.05, we reject null
hypothesis and conclude that Mumbai and Delhi ratings do differ.

Prob / Equal
Group Var City Mean SD T.Value Pooled
Mumbai 3.73 0.88
-2.27 0.031
Delhi 4.6 1.18

Example 3:- A plant breeder wants to find out if light is important for seed sprouting. He puts
10 seeds each in 10 pots and places them in Dark and another similar set of 10 pots with 10
seeds each in the light. He monitors the number of seeds sprouting in a particular duration of a
week.

Lightsprouts Darksprouts 1) H0= There is no diff in no. of sprouts when exposed to light or dark
H1 = There is some diff in no. of sprouts when seeds are exposed to
9 6 light.
6 4
8 7 2) Alpha = 0.10
5 2
5 4 3) Ind T.Test for 2 variables
6 5 4) For this data needs to be transposed to get a class variable as sheet 4
5 3 5) Check for Test for equality of variances
6 7
9 8
9 6

SAS Output

proc ttest data=ttestsprouts alpha=0.1;


class exposure;
var sprouts;
run;

The TTEST Procedure

Variable: sprouts

exposure N Mean Std Dev Std Err Minimum Maximum

1 10 5.2000 1.9322 0.6110 2.0000 8.0000


2 10 6.8000 1.7512 0.5538 5.0000 9.0000
Diff (1-2) -1.6000 1.8439 0.8246

exposure Method Mean 90% CL Mean Std Dev 90% CL Std Dev

1 5.2000 4.0799 6.3201 1.9322 1.4092 3.1788


2 6.8000 5.7849 7.8151 1.7512 1.2772 2.8811
Diff (1-2) Pooled -1.6000 -3.0299 -0.1701 1.8439 1.4560 2.5529
Diff (1-2) Satterthwaite -1.6000 -3.0307 -0.1693

Method Variances DF t Value Pr > |t|

Pooled Equal 18 -1.94 0.0682


Satterthwaite Unequal 17.829 -1.94 0.0683

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 9 9 1.22 0.7743

SAS Interpretation:
Table 1 – gives the descriptive statistics for both groups 1 and 2 (dark and light exposure) and
the differences in the descriptive statistics.

Table 2 – The Means of both groups are compared under 2 methods with 90% CL.(alpha is 0.1
here)

Table 3 – Gives details of variances under 2 assumptions (variances being equal and unequal).
We go by the method which has the highest p value. Here since p value is lesser than 0.1, which
is 0.068, we REJECT Null hypothesis and conclude that there is difference in Means with Light
and dark exposure in seed sprouting.

Table 4- gives Equality of variances. Here the p value is way high 0.774, so accept the sub null
hypo and go with the variances equal method. (For this the H0: variances are equal,
H1=Variances unequal).

SPSS Output

Table 1 as usual gives descriptive statistics for both groups (light and dark) under variable
Exposure.
Group Statistics

Exposure N Mean Std. Deviation Std. Error Mean

Sprouts 1 10 5.20 1.932 .611

2 10 6.80 1.751 .554

Independent Samples Test

Levene's Test
for Equality of
Variances t-test for Equality of Means

95% Confidence
Interval of the

Sig. (2- Mean Std. Error Difference

F Sig. T df tailed) Difference Difference Lower Upper

Sprouts Equal variances .013 .911 - 18 .068 -1.600 .825 -3.332 .132
assumed 1.940

Equal variances - 17.829 .068 -1.600 .825 -3.334 .134


not assumed 1.940

Conclusion:-
Groups/Exposure Mean SD T.Value P. Value
Dark 5.2 1.932
Light 6.8 1.751 -1.94 0.068

Since the Sig level of ‘Equal Variances” is high, we take that method and compare the sig level
for T-Test for Equality of Means, which is 0.068 here in both cases, less than alpha 0.1. We
conclude that sprout means have some difference with and without light exposure and reject
null hypothesis.

Example 4: - Complete Analytics wants to assess if students who took up statistics course have
enhanced their knowledge after course completion. The scores of the same set of students
before and after course completion, is listed.

Before After H0 There is no diff in mean scores before & after


18 19 H1 There is some diff in mean scores before & after
17 22
21 24 Alpha 0.05
14 12 Test Paired T Test
18 20
9 10
19 18
14 16
5 7
20 20
Diff
mean 1.3
Mean 15.5 16.8

SAS Output
proc ttest data=pairedt alpha=.05;
paired before*after;
run;
The TTEST Procedure

Difference: before - after

N Mean Std Dev Std Err Minimum Maximum

10 -1.3000 2.0028 0.6333 -5.0000 2.0000


Mean 95% CL Mean Std Dev 95% CL Std Dev

-1.3000 -2.7327 0.1327 2.0028 1.3776 3.6563

DF t Value Pr > |t|

9 -2.05 0.0703

Gives the basic descriptive statistics, with differences in Mean, SD and with 95% CL.

Here, the diff in Mean is -1.3, which means the comparison is before minus after, if you give the
diff as after-before or state the variables in reverse order, the value will still be the same, but
will be +1.3.
Here P value is 0.07, which is higher than alpha 0.05, so ACCEPT null hypothesis and say there is
no difference between “before and after the course” completion, though there is a diff in mean
of 1.3.

SPSS Output
Analyse-> Compare Means -> Paired Sample T-Test-> choose the paired variables under the first
pair->OK
Paired Samples Statistics

Mean N Std. Deviation Std. Error Mean

Pair 1 Before 15.50 10 5.104 1.614

After 16.80 10 5.493 1.737

Paired Samples Correlations

N Correlation Sig.
Pair 1 Before & After 10 .931 .000

Paired Samples Test

Paired Differences

95% Confidence Interval

Std. Std. Error of the Difference Sig. (2-


Mean Deviation Mean Lower Upper t df tailed)

Pair Before - -1.300 2.003 .633 -2.733 .133 -2.053 9 .070


1 After

Table 1 – Gives basic descriptive statistics for the paired variables (before and after)
Table 2 – gives correlation between the 2 variables (before and after), which is not considered
in this paired T.test for analysis.
Table 3 – Gives the differences in Mean and SD before-After. Gives 95% CL limits, T.Value and
Sig(Pvalue).

Conclusion:-
Mean SD %improvement Paired T Value P Value
Before 15.5 5.104
After 16.8 5.493 8.387096774 -2.053 0.07
Mean diff/Old Mean *
Diff -1.3 0.389 100

Though there is an improvement in Mean by 1.3 and the improvement % is 8.39%, the p value
is 0.07, more than the alpha value. Hence we accept Null hypothesis and conclude that there is
no difference in the mean scores of students before and after taking up the statistics course.

This however, can change if you set the significance level to 0.1.

Example 5:- We have recorded the ratings of Tamarind brand garments from 18 respondents
before and after an advertisement campaign was released for this brand. The ratings are on a
10 point scale. Test whether the campaign had an effect on sales of Tamarind brand garments.
Before After
3 4 Step 1) H0: No effect of the advt campaign (Mean before and after are same)
4 5 H1: Advt Campaign had an effect on sales (Mean sales before / after are diff)
2 3 Step 2) Alpha 0.05
5 4 Step 3) Go for Paired T.Test since the same sample obs are tested twice
3 5
SAS
4 5 Code proc ttest data=pairedt2;
5 6 paired before*after;
3 4 run;
4 5
2 4 SPSS Analyse-> Compare Means -> Paired Sample T.Test->
2 4 Choose pair variables -> Ok
4 5
1 3
3 5
6 8
3 4
2 4
3 5
SAS OUTPUT

The TTEST Procedure

Difference: before - after

N Mean Std Dev Std Err Minimum Maximum

18 -1.3333 0.7670 0.1808 -2.0000 1.0000

Mean 95% CL Mean Std Dev 95% CL Std Dev

-1.3333 -1.7147 -0.9519 0.7670 0.5755 1.1498

DF t Value Pr > |t|

17 -7.38 <.0001

Above SAS output gives descriptive statistics N, Mean difference (before-after), SD and
min/max values all in differences. Even the 95% CL Mean and SD are in differences. The Mean
has improved by 1.33 and the p value is 0.0001, which is less than alpha 0.05, hence reject H0
and state that the advt campaign does have some effect on sales.

Note: Since SAS is not giving the descriptive statistics for both variables separately and gives
only the differences statistics. So If you want the desc statistics of both variables before and
after, you need to run PROC Means before running PROC TTEST;

proc means data=pairedt2;


var before after;
run;
The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum


------------------------------------------------------------------------------
before 18 3.2777778 1.2744344 1.0000000 6.0000000
after 18 4.6111111 1.1447522 3.0000000 8.0000000
------------------------------------------------------------------------------

SPSS OUTPUT
Paired Samples Statistics

Mean N Std. Deviation Std. Error Mean

Pair 1 Before 3.28 18 1.274 .300

After 4.61 18 1.145 .270

Table 1- Gives Mean, N obs, STD Dev before and after the advt. campaign.

Paired Samples Correlations

N Correlation Sig.

Pair 1 Before & After 18 .804 .000

Table 2 – though we are not considering any correlation between the variables BEFORE &
AFTER, SPSS produces this table and shows the correlation, but the sig is zero.

Paired Samples Test

Paired Differences

95% Confidence Interval

Std. Std. Error of the Difference Sig. (2-


Mean Deviation Mean Lower Upper t df tailed)

Pair 1 Before - -1.333 .767 .181 -1.715 -.952 -7.376 17 .000


After

Table 3 – shows the pair differences of Mean, SD, 95% CI of the difference estimates, The
T.Value and the sig (p value)

Conclusion:-
Mean SD % Incr T.Value P Value
Before 3.28 1.274
After 4.61 1.144 40.5% -7.376 0.0001
Diff B-A -1.33 0.767

There is a difference in mean of 1.33, a 40.5% increase is seen in the mean sales before / after
the advt. campaign, and the p value is much lesser than 0.05. Hence we confidently REJECT H0
and conclude that the advt. campaign has had a significant effect on sales of Tamarind brand.
ANOVA (ANALYSIS OF VARIANCE)

Hypothesis testing applies to testing the parameters of a sample or a population. While


Hypothesis testing involves testing one or more samples for comparing their means, ANOVA
also refers to a type of hypothesis testing, which involves more than two variables or factors,
which is a limitation of hypothesis testing.

When we say, more than 2 variables, it means, influence of 1 or more factors on a variable.
Ex: Sales affected by location of item displayed. (either window or near counter or shelf)-
though there are 3 options, they are categorised as a single factor influencing sales. Here
ANOVA is mainly concerned with the analysis of “which one is better among the 3” options
within the same variable called storage area.

2 Factors can also influence a variable. Ex: Sales because of storage location and price. In such
cases, ANOVA gives the combination of the factors which will fetch max results.

ANOVA mainly deals with a detailed analysis of variances in the variables as follows:-
 Between the variables, (ex. Variance of Variable 1 and variance of variable 2 )
 Within the variable (variances in variable 1 from its mean)
 in totality.

This is explained with the help of a table:-


Dep var=sales Ind var=QF Ind Var =Avail Var between Qf & Avail
155 1 2 xxx
500 2 3 xxx
535 3 1 xxx
420 4 4 xxx
Total Sales 1610 Var within QF Var within Avail TOT Variance

POST-HOC ANOVA:-
This is a very important test conducted after the ANOVA is completed. ANOVA calculates means
and gives detailed analysis of variances of variables from the means. Whereas, POST-HOC
ANOVA gives suggestions as to
 which variable is best suited or
 which variable differs to what extent from the others or
 which combination of factors would be the best.

Post-hoc Anova has different methods of calculation:-


a) TUKEY’s test
b) SCHEFFE’s test
c) LSD (Lease Square Deviation method) – use when data is more continuous
d) Duncan’s test – when you have more variables to be compared.

POST-HOC ANOVA throws a table which assigns alphabets (in SAS) or numbers (in SPSS) for
different combination of groups of variables.
ANOVA Steps:-

1) Transpose your data so that the 2>variables form a factor variable (in SPSS) or class
variable (in SAS), and the dependent variable is the continuous one.
2) Run PROC Means in SAS or check the DESCRIPTIVE stats box in SPSS to get the
descriptive statistics for all the variables, as PROC ANOVA (in SAS) and ONE-Way
ANOVA (in SPSS) do not give descriptive statistics.
3) Then run PROC Anova or One-way ANOVA as required.

Example 1 - Simple One-Way ANOVA

Analyse if the sales of kitkat is affected by the area of display (storage location) in a store. Sales
for a week is observed when kitkat is placed at the window side, counter side or on the shelf.

Windo Counte
Shelf w r Step 1) H0=Storage area sales are all equal
450 500 550 H1= Storage area sales -atleast one location is different
490 530 570 Step 2) Alpha = 0.05
500 510 560 Step 3) Test to be performed - one way ANOVA with 3 class variables
470 500 530 SAS PROC Means data=dsn; SPSS:
480 530 590 class position; Analyse -> Compare Means -> one way ANOVA
500 540 600 var sales; select Dep var sales and factor variable - position
460 520 580 run; ok
PROC Anova data =dsn;
class position;
model sales=position;
Meansposition/tukey
; run;
SPSS Ouput

Descriptives
Sales

95% Confidence Interval for Mean

N Mean Std. Deviation Std. Error Lower Bound Upper Bound Minimum Maximum

1 7 478.57 19.518 7.377 460.52 496.62 450 500


2 7 518.57 15.736 5.948 504.02 533.12 500 540
3 7 568.57 24.103 9.110 546.28 590.86 530 600
Total 21 521.90 42.263 9.223 502.67 541.14 450 600

Table 1 – gives all descriptive statistics for the 3 options in the class variable 1,2,3.

ANOVA
Sales
Sum of Squares df Mean Square F Sig.

Between Groups 28466.667 2 14233.333 35.303 .000


Within Groups 7257.143 18 403.175
Total 35723.810 20

Table – 2 gives the Analysis of variance between and within the groups. Now the sig value (p
value is lesser than 0.05), so reject H0 and say there is a significant difference in sales when the
product is placed in different positions.

Note:- Now we know by running a one-way anova test, that there is a difference in mean
sales of kitkat when placed at different sale positions. To find out which position is the best,
or to analyse further, we run post-hoc ANOVA.

Multiple Comparisons
Dependent Variable:sales

Mean 95% Confidence Interval

(I) Position (J) Position Difference (I-J) Std. Error Sig. Lower Bound Upper Bound

Tukey HSD 1 2 -40.000* 10.733 .004 -67.39 -12.61

3 -90.000* 10.733 .000 -117.39 -62.61

2 1 40.000* 10.733 .004 12.61 67.39

3 -50.000* 10.733 .001 -77.39 -22.61

3 1 90.000* 10.733 .000 62.61 117.39

2 50.000* 10.733 .001 22.61 77.39


Scheffe 1 2 -40.000* 10.733 .006 -68.62 -11.38
3 -90.000* 10.733 .000 -118.62 -61.38
2 1 40.000* 10.733 .006 11.38 68.62
3 -50.000* 10.733 .001 -78.62 -21.38
3 1 90.000* 10.733 .000 61.38 118.62
2 50.000* 10.733 .001 21.38 78.62
LSD 1 2 -40.000* 10.733 .002 -62.55 -17.45

3 -90.000* 10.733 .000 -112.55 -67.45

2 1 40.000* 10.733 .002 17.45 62.55

3 -50.000* 10.733 .000 -72.55 -27.45

3 1 90.000* 10.733 .000 67.45 112.55

2 50.000* 10.733 .000 27.45 72.55

*. The mean difference is significant at the 0.05 level.

Table – 3 above gives combination of each position with the other. You will notice that the p
value for position 3 is the best p value as it is closer to zero (much lesser than 0.05 alpha), under
all the post-hoc methods (Scheffe, Duncan and LSD). Hence we conclude that position 3
(counter sales) is the best position to display kitkat.

SAS OUTPUT
a) First run PROC means to get desc stats for all variables.

proc means data=anovaposition;


class position;
var sales;
run;
The MEANS Procedure

Analysis Variable : sales

N
position Obs N Mean Std Dev Minimum Maximum
---------------------------------------------------------------------------------------
--
1 7 7 478.5714286 19.5180015 450.0000000 500.0000000

2 7 7 518.5714286 15.7359158 500.0000000 540.0000000

3 7 7 568.5714286 24.1029538 530.0000000 600.0000000


---------------------------------------------------------------------------------------
--

b)Run anova with post-hoc tukey option

Class Level Information

Class Levels Values

position 3 12 3

Number of Observations Read 21


Number of Observations Used 21
Table 1- gives details of class variable.

The ANOVA Procedure

Dependent Variable: sales


Sum of
Source DF Squares Mean Square F Value Pr > F

Model 2 28466.66667 14233.33333 35.30 <.0001

Error 18 7257.14286 403.17460

Corrected Total 20 35723.80952

R-Square Coeff Var Root MSE sales Mean

0.796854 3.847294 20.07921 521.9048

Source DF Anova SS Mean Square F Value Pr > F

position 2 28466.66667 14233.33333 35.30 <.0001

Table 2- is the actual ANOVA procedure which gives Degrees of freedom 2. It clearly says p value is
<0.0001, which means we reject H0 and state that there is a diff in sales if product is placed at different
positions.

Table 3 below - Post-hoc anova results come in Tukey’s method, as we opted for Tukeys. SAS
automatically gives alphabets and groups the variables to suggest the variable with highest Mean. Here
position 3 is grouped as A with highest mean. SAS suggests that Position 3 has highest mean.
The ANOVA Procedure

Tukey's Studentized Range (HSD) Test for sales

NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II
error rate than REGWQ.

Alpha 0.05
Error Degrees of Freedom 18
Error Mean Square 403.1746
Critical Value of Studentized Range 3.60930
Minimum Significant Difference 27.392

Means with the same letter are not significantly different.

Tukey Grouping Mean N position


A 568.57 7 3

B 518.57 7 2

C 478.57 7 1

CONCLUSION:- From both SAS and SPSS we conclude as below:

Position Mean SD F Value P Value


Shelf 1 478.57 19.518
Window 2 518.57 15.736 35.303 0
Counter 3 568.57 24.103

a) Counter 3 – position has the highest mean sales


b) Since the P value is <0.0001, our model is good and we reject H0 and strongly say that different
positions have different sales.

Вам также может понравиться