Вы находитесь на странице: 1из 25

Understanding Data Analysis in

Social Sciences
or: why data analysis?...
required knowledge on the basic
statistics
what are the various tests?...
the assumptions behind each test
how to use the tests?...
an introduction to the fundamentals
of data analysis for MBA students
Partially based on Pandya K, et.al., SPSS in Simple Steps,
Dreamtech Press, ISBN-13: 978-93-5004-251-9.
2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
1
Steps, Dreamtech Press.

Purpose of Statistics
1) To describe a phenomena,
2) To organize and summarize our result more
conveniently and meaningfully,
3) To make inference or make predictions,
4) To explain, and
5) To make a conclusion.

Exploratory Data Analysis (EDA)


2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
2
Steps, Dreamtech Press.

Types of Statistics

Descriptive Statistics: The event or outcome of events is described


without drawing conclusions. It is concerned only with the collection,
organisation, summarising, analysis and presentation of an array of
numerical qualitative or quantitative data. Descriptive statistics include
the mean, median, mode, standard deviation, range, percentile, kurtosis,
correlation coefficient, proportions etc.
Inferential Statistics: This is built on the descriptive statistics by going a
step further to make interpretation with a view to population upon which
a decision would be based. Valid and reliable decisions, generalisations,
predictions and conclusions could be drawn using this statistics tools such
as stochastic process, queuing theory, game theory, quality control, chisquare, t-test, f-test etc.
Experimental Statistics: Relates to the design of experiments to
establishing causes and effects of such designs as experimental, Quasiexperiments etc.

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in

Measurement scales
Nominal:
Numbers assigned to
runners

Finish
7

11

Ordinal:
Rank Order of
Winners

Finish

Third
Place
Interval:
Performance Rating
on a 0-to-10 scale
Ratio:
Time to finish in
seconds

Second
Place

First
Place

8.2

9.1

9.6

15.2

14.1

13.4

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

Nominal variables allow for only qualitative classification. That is, they
can be measured only in terms of whether the individual items belong to
some distinctively different categories, but we cannot quantify or even
rank order those categories. Typical examples of nominal variables are
gender, race, color, city, no. of players etc.
Permissible Statistics:
Descriptive percentages, mode
Inferential Chi-square, binomial test
Ordinal variables allow us to rank order the items we measure in terms
of which has less and which has more of the quality represented by the
variable, but still they do not allow us to say "how much more. A typical
example of an ordinal variable is the socioeconomic status of families.
Permissible Statistics:
Descriptive percentile, median
Inferential Rank-order correlation, Friedman ANOVA

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

Interval variables allow us not only to rank order the items that are
measured, but also to quantify and compare the sizes of differences
between them. For example, temperature, as measured in degrees
Fahrenheit or Celsius, constitutes an interval scale.
Permissible Statistics:
Descriptive range, mean, S.D
Inferential pearson correlation, t-tests, ANOVA, regression, factor analysis
Ratio variables are very similar to interval variables; in addition to all the
properties of interval variables, they feature an identifiable absolute zero
point, thus they allow for statements such as x is two times more than y.
Typical examples of ratio scales are measures of time or space.
Permissible Statistics:
Descriptive geometric mean, harmonic mean
Inferential coefficient of variation

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

Some Important Measures


Measures of Central Tendency/Averages

- Mean
- Median
- Mode
Measures of Spread / Dispersion

- Range
- Variance
- S.D
- IQR
Measure of Asymmetry
Measure of Peakedness

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

Measures of Central Tendency


One of the main objectives of statistical analysis is to get a single value
that describes the characteristic of the entire data.
Such a value is called the central value and the most commonly used
measures of central tendencies are Arithmetic Mean, Median and Mode.
They give us an idea of the concentration of the observation (data)
about the central part of the distribution.
Arithmetic Mean of a series is defined as the sum of the observations
divided by the no. of observations.
Median is defined as the middle item of the given observations
arranged in order.
The value of the variable which occurs most frequently in a series is
called the mode.

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

Skewness
Indicates asymmetry in the distribution
Positive value right skewed
Long tail to right
Median < Mean
Negative value left skewed
Long tail to left
Median > Mean
For normally distributed data skewness
is zero

Kurtosis
Indicates peakedness of the distribution
Positive values indicate heavy tails
Negative value indicate light tails
Normal distribution has a kurtosis of zero

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple Steps,
9
Dreamtech Press.

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

10

Kurtosis
Leptokurtic
Normal
Platykurtic

many statistical tests assume


values are normally distributed

not always the case!


examine data prior
to processing

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

11

Measures of Dispersion
The knowledge of measures of central tendency cannot give a complete
idea about the distribution.
Measures of dispersion give an idea about the scatteredness of the data
of the distribution.
Range, Variance, Standard Deviation and Quartile Deviation are the
measures of deviation.
Range is the difference between the greatest and the least of the
observations.
A better way to measure dispersion is to square the differences between
each data and the mean before averaging them. This is called Variance.
The positive square root of the Variance is called the Standard Deviation.
Co-efficient of variation helps us to compare the consistency of two or
more collections of data.
When coefficient of variation is more the data is less consistent and vice
versa.

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

12

Quartile Deviation
Median is a value below which there are 50% cases in
the distribution. Median divides the distribution into
two equal halves.
There are other points that divide the distribution into
various ratios.
The point below which there are 25% of the cases in
the distribution is known as the first quartile (Q1).
The point below which there are 75% of the cases is
known as the third quartile (Q3).
Quartile Deviation is half the difference between the
values of these quartiles.
25%
Min

Q1

25%

Mdn

25%

25%

Q3

Max

Percentile gives the relative position in a distribution.


2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in
Simple Steps, Dreamtech Press.

13

Box Plot
Outlier
Maximum

75th Percentile

50th Percentile (Median)


25th Percentile
Minimum
Outlier

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in
Simple Steps, Dreamtech Press.

14

Normal Probability Plots


1

1. Normal distribution
2. Right skewed dist
3. Left skewed dist
4. Light tailed dist
5. Heavy tailed dist

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in
Simple Steps, Dreamtech Press.

15

Mean is used
- when the distribution is not badly skewed.
- when the measure of the greatest stability is required.
- when other statistics (such as S.D, t) are to be computed.
Median is used
- when the mid points is required.
- when there are extreme cases.
- when the distribution is truncated.
Mode is used
- when we need a quick approximate value
- when we need the most typical value.
C.T. of a nominal (categorical) scale is given by its mode.
C.T. of a Ordinal scale is given by Median.
C.T. of Interval/Ratio scale is given by Mean (symmetrical data),
Median (skewed data).

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

16

Parametric & Non-parametric Tests


A parametric test requires a sample to be normally distributed.
Large sample (>100) sampling distribution is normal
(assumption)
Interval or Ratio Scale
When should we use non-parametric tests?
- When the sample size is too small.
- When the sample is not normally distributed.
- Variables are measured on nominal or ordinal scale.
- The second situation can happen if the data has outliers. In
this case, statistical methods which are based on the normality
assumption breaks down and we have to use non-parametric
tests.

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

17

Commonly used Parametric


& Non-Parametric Tests
PARAMETRIC
One-sample t-Test

NON-PARAMETRIC
Runs Test

Independent sample t-Test


Chi-Square Test
Paired sample t-Test
One-Way ANOVA
One-Way repeated
measures ANOVA
Two-way ANOVA
Pearson Correlation
Coefficient

Mann-Whitney U Test
Wilcoxon Signed Rank
Test
Kruskal-Wallis Test
Spearmann & Kendall
Tau coefficients

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

18

UNI-VARIATE
ANALYSIS
Cross Tab & Two Cat Var

Qualitative
Data

Cross Tab & >2 variables

Yes

Comp Obs Fr with Exp Frq


No
Diff /Relationship

Comp with
Standard Value

Difference
No
Are data from
same
respondents?
Dependent

Yes

Relationship

Independen
t

No

Dependent
Variable ?
Yes
Yes

2 samples

Corr (r)

Regression
Analysis

One Samp t
Indpt t

No

3 Samples
2 samples
3 Samples

One way ANOVA

Paired t
Repeated Measurement Analysis

2012, Marie Anne Rosario,


partially based on, and adapted
from Pandya K, et.al., SPSS in

20

Tests of Normality
A common application for distribution fitting procedures is
when the assumption of normality has to be verified
before using some parametric test.
Kolmogorov-Smirnov test for normality (one sample / two
sample),
Shapiro-Wilks W test
Lilliefors test (used instead of KS when mean and SD are
not known).
Histogram with normal curve.
Skewness & Kurtosis
Significance Level

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
21
Steps, Dreamtech Press.

How to Analyse Your Data


Descriptive Statistics

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

22

Inferential Statistics
Check for related samples or unrelated
samples

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

23

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

24

2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
Steps, Dreamtech Press.

25

Вам также может понравиться