Data Analysis in Social Sciences

Understanding Data Analysis in
Social Sciences
or: why data analysis?...
required knowledge on the basic
statistics
what are the various tests?...
the assumptions behind each test
how to use the tests?...
an introduction to the fundamentals
of data analysis for MBA students
Partially based on Pandya K, et.al., SPSS in Simple Steps,
Dreamtech Press, ISBN-13: 978-93-5004-251-9.
2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple
1
Steps, Dreamtech Press.
Purpose of Statistics
1) To describe a phenomena,
2) To organize and summarize our result more
conveniently and meaningfully,
3) To make inference or make predictions,
4) To explain, and
5) To make a conclusion.
Exploratory Data Analysis (EDA)

2
Types of Statistics
Descriptive Statistics: The event or outcome of events is described

without drawing conclusions. It is concerned only with the collection,
organisation, summarising, analysis and presentation of an array of
numerical qualitative or quantitative data. Descriptive statistics include
the mean, median, mode, standard deviation, range, percentile, kurtosis,
correlation coefficient, proportions etc.
Inferential Statistics: This is built on the descriptive statistics by going a
step further to make interpretation with a view to population upon which
a decision would be based. Valid and reliable decisions, generalisations,
predictions and conclusions could be drawn using this statistics tools such
as stochastic process, queuing theory, game theory, quality control, chisquare, t-test, f-test etc.
Experimental Statistics: Relates to the design of experiments to
establishing causes and effects of such designs as experimental, Quasiexperiments etc.
2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in
Measurement scales
Nominal:
Numbers assigned to
runners
Finish
7
11
Ordinal:
Rank Order of
Winners
Finish
Third
Place
Interval:
Performance Rating
on a 0-to-10 scale
Ratio:
Time to finish in
seconds
Second
Place
First
Place
8.2
9.1
9.6
15.2
14.1
13.4
Nominal variables allow for only qualitative classification. That is, they
can be measured only in terms of whether the individual items belong to
some distinctively different categories, but we cannot quantify or even
rank order those categories. Typical examples of nominal variables are
gender, race, color, city, no. of players etc.
Permissible Statistics:
Descriptive percentages, mode
Inferential Chi-square, binomial test
Ordinal variables allow us to rank order the items we measure in terms
of which has less and which has more of the quality represented by the
variable, but still they do not allow us to say "how much more. A typical
example of an ordinal variable is the socioeconomic status of families.
Descriptive percentile, median
Inferential Rank-order correlation, Friedman ANOVA
Interval variables allow us not only to rank order the items that are
measured, but also to quantify and compare the sizes of differences
between them. For example, temperature, as measured in degrees
Fahrenheit or Celsius, constitutes an interval scale.
Descriptive range, mean, S.D
Inferential pearson correlation, t-tests, ANOVA, regression, factor analysis
Ratio variables are very similar to interval variables; in addition to all the
properties of interval variables, they feature an identifiable absolute zero
point, thus they allow for statements such as x is two times more than y.
Typical examples of ratio scales are measures of time or space.
Descriptive geometric mean, harmonic mean
Inferential coefficient of variation
Some Important Measures

Measures of Central Tendency/Averages
- Mean
- Median
- Mode
Measures of Spread / Dispersion
- Range
- Variance
- S.D
- IQR
Measure of Asymmetry
Measure of Peakedness
Measures of Central Tendency

One of the main objectives of statistical analysis is to get a single value
that describes the characteristic of the entire data.
Such a value is called the central value and the most commonly used
measures of central tendencies are Arithmetic Mean, Median and Mode.
They give us an idea of the concentration of the observation (data)
about the central part of the distribution.
Arithmetic Mean of a series is defined as the sum of the observations
divided by the no. of observations.
Median is defined as the middle item of the given observations
arranged in order.
The value of the variable which occurs most frequently in a series is
called the mode.
Skewness
Indicates asymmetry in the distribution
Positive value right skewed
Long tail to right
Median < Mean
Negative value left skewed
Long tail to left
Median > Mean
For normally distributed data skewness
is zero
Kurtosis
Indicates peakedness of the distribution
Positive values indicate heavy tails
Negative value indicate light tails
Normal distribution has a kurtosis of zero
2012, Marie Anne Rosario, partially based on, and adapted from Pandya K, et.al., SPSS in Simple Steps,
9
Dreamtech Press.
10
Kurtosis
Leptokurtic
Normal
Platykurtic
many statistical tests assume

values are normally distributed
not always the case!

examine data prior
to processing
11
Measures of Dispersion
The knowledge of measures of central tendency cannot give a complete
idea about the distribution.
Measures of dispersion give an idea about the scatteredness of the data
of the distribution.
Range, Variance, Standard Deviation and Quartile Deviation are the
measures of deviation.
Range is the difference between the greatest and the least of the
observations.
A better way to measure dispersion is to square the differences between
each data and the mean before averaging them. This is called Variance.
The positive square root of the Variance is called the Standard Deviation.
Co-efficient of variation helps us to compare the consistency of two or
more collections of data.
When coefficient of variation is more the data is less consistent and vice
versa.
12
Quartile Deviation
Median is a value below which there are 50% cases in
the distribution. Median divides the distribution into
two equal halves.
There are other points that divide the distribution into
various ratios.
The point below which there are 25% of the cases in
the distribution is known as the first quartile (Q1).
The point below which there are 75% of the cases is
known as the third quartile (Q3).
Quartile Deviation is half the difference between the
values of these quartiles.
25%
Min
Q1
25%
Mdn
25%
25%
Q3
Max
Percentile gives the relative position in a distribution.

Simple Steps, Dreamtech Press.
13
Box Plot
Outlier
Maximum
75th Percentile
50th Percentile (Median)

25th Percentile
Minimum
Outlier
14
Normal Probability Plots

1
1. Normal distribution
2. Right skewed dist
3. Left skewed dist
4. Light tailed dist
5. Heavy tailed dist
15
Mean is used
- when the distribution is not badly skewed.
- when the measure of the greatest stability is required.
- when other statistics (such as S.D, t) are to be computed.
Median is used
- when the mid points is required.
- when there are extreme cases.
- when the distribution is truncated.
Mode is used
- when we need a quick approximate value
- when we need the most typical value.
C.T. of a nominal (categorical) scale is given by its mode.
C.T. of a Ordinal scale is given by Median.
C.T. of Interval/Ratio scale is given by Mean (symmetrical data),
Median (skewed data).
16
Parametric & Non-parametric Tests

A parametric test requires a sample to be normally distributed.
Large sample (>100) sampling distribution is normal
(assumption)
Interval or Ratio Scale
When should we use non-parametric tests?
- When the sample size is too small.
- When the sample is not normally distributed.
- Variables are measured on nominal or ordinal scale.
- The second situation can happen if the data has outliers. In
this case, statistical methods which are based on the normality
assumption breaks down and we have to use non-parametric
tests.
17
Commonly used Parametric

& Non-Parametric Tests
PARAMETRIC
One-sample t-Test
NON-PARAMETRIC
Runs Test
Independent sample t-Test

Chi-Square Test
Paired sample t-Test
One-Way ANOVA
One-Way repeated
measures ANOVA
Two-way ANOVA
Pearson Correlation
Coefficient
Mann-Whitney U Test
Wilcoxon Signed Rank
Test
Kruskal-Wallis Test
Spearmann & Kendall
Tau coefficients
18
UNI-VARIATE
ANALYSIS
Cross Tab & Two Cat Var
Qualitative
Data
Cross Tab & >2 variables
Yes
Comp Obs Fr with Exp Frq

No
Diff /Relationship
Comp with
Standard Value
Difference
No
Are data from
same
respondents?
Dependent
Yes
Relationship
Independen
t
No
Dependent
Variable ?
Yes
Yes
2 samples
Corr (r)
Regression
Analysis
One Samp t
Indpt t
No
3 Samples
2 samples
3 Samples
One way ANOVA
Paired t
Repeated Measurement Analysis
2012, Marie Anne Rosario,

partially based on, and adapted
from Pandya K, et.al., SPSS in
20
Tests of Normality
A common application for distribution fitting procedures is
when the assumption of normality has to be verified
before using some parametric test.
Kolmogorov-Smirnov test for normality (one sample / two
sample),
Shapiro-Wilks W test
Lilliefors test (used instead of KS when mean and SD are
not known).
Histogram with normal curve.
Skewness & Kurtosis
Significance Level
21
How to Analyse Your Data

Descriptive Statistics
22
Inferential Statistics
Check for related samples or unrelated
samples
23
24
25

Data Analysis in Social Sciences

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Analysis in Social Sciences

Загружено:

Авторское право:

Доступные форматы

Understanding Data Analysis in

Exploratory Data Analysis (EDA)

Descriptive Statistics: The event or outcome of events is described

Some Important Measures

Measures of Central Tendency

many statistical tests assume

not always the case!

Percentile gives the relative position in a distribution.

50th Percentile (Median)

Normal Probability Plots

Parametric & Non-parametric Tests

Commonly used Parametric

Independent sample t-Test

Cross Tab & >2 variables

Comp Obs Fr with Exp Frq

One way ANOVA

2012, Marie Anne Rosario,

How to Analyse Your Data

Вам также может понравиться