Data Analysis

INTRODUCTION
Data Preparation
Types of Variables Analysis
Univariate Analysis
Bivariate Analysis
Bivariate Analysis (cont)
Multivariate Analysis
Measures of Centrality and Dispersion
Types of Measures of Dispersion
Range
Mean Deviation
Examples
Variance
Standard Deviation
Example
Interpretation
Coefficient of Variation
Why Coefficient of Variation
Frequency Distribution
Distributions
Frequency Distributions A description of the
number of times the various attributes of a
variable are observed in a sample.
Frequency distribution is counts of the number of
response to a question or to the occurrence of a
phenomenon of interest.
Distribution (cont)
Distribution (cont)
Central Tendency
Central Tendency
Average An ambiguous term generally

suggesting typical or normal a central
tendency (examples: mean, median, mode).
1. Mean = Sum of values / total number of
cases
2. Mode = Most frequently occurring
attribute
3. Median = Middle attribute in the ranked
distribution of observed attribute
Central Tendency (Cont)

Practice: The following list represents the scores
on a mid-term exam.
100, 94, 88, 91, 75, 61, 93, 82, 70, 88, 71, 88
Determine the mean.

Determine the mode.
Determine the median.
Measurement of Dispersion
Range
Percentile range
Quartile deviation
Mean deviation
Variance and standard deviation
Relative measure of dispersion
Coefficient of variation
Coefficient of mean deviation
Coefficient of range
Coefficient of quartile deviation
Measures of Dispersion: The Range

Simplest measure of dispersion
Difference between the largest and the smallest values:
Range = Xlargest Xsmallest

Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 13 - 1 = 12
Interquartile Range
Measures the range of the middle 50% of the values only
Is defined as the difference between the upper and lower
quartiles
Interquartile range
quartile
= upper quartile - lower
= Q3 - Q1
The mean deviation

Measures the average distance of each observation away
from the mean of the data
Gives an equal weight to each observation
Generally more sensitive than the range or interquartile
range, since a change in any value will affect it
Actual and absolute deviations from

mean
A set of x values has a mean of
The residual of a particular x-value is:

Residual or deviation = x
The absolute deviation is:
x-x
Mean deviation
The mean of the absolute deviations
Mean deviation =
xx
n
To calculate mean deviation

1. Calculate mean of data
Find
2. Subtract mean from each For each x, find

observation
xx
Record the differences
3. Record absolute value of Find
each residual
xx
for each x
4. Calculate the mean of

the absolute values
Mean deviation =
xx
n
Add up absolute values

and divide by n
The standard deviation

Measures the variation of observations from the
mean
The most common measure of dispersion
Takes into account every observation
Measures the average deviation of observations
from mean
Works with squares of residuals not absolute
valueseasier to use in further calculations
Standard deviation of a population

Every observation in the population is used.
The square of the population standard
deviation is called the variance.
Variance = 2
(x x )
Standard deviation = =
Standard deviation of a sample

In practice, most populations are very large
and it is more common to calculate the
sample standard deviation.
(x x )
Sample standard deviation = s =
n 1
Where: (n-1) is the number of observations in the sample
To calculate standard deviation

1. Calculate the mean
2. Calculate the residual for each x
3. Square the residuals
4. Calculate the sum of the squares
5. Divide the sum in Step 4 by (n-1)
x
xx
( x x )2
xx
2
(
)
x
n 1
6. Take the square root of quantity
in Step 5
xx
n 1
)2
Standard deviations for frequency

distributions
If data is in a frequency distribution
No. Units
n
1
Frequency
f
85
192
123
Total
400
Calculate standard deviation using:

Total
s=
x x
1
Is a measure of relative variability used to:
measure changes that have occurred in a
population over time
compare variability of two populations that are
expressed in different units of measurement
expressed as a percentage rather than in terms of
the units of the particular data
Formula for coefficient of variation

Denoted by V
s
V = 100 %
x
where
= the mean of the sample

s = the standard deviation of the sample
Summary
Measures of central tendency
no ideal measure of dispersion exists

standard deviation is the most important measure
of central tendency
it is the most frequently used
the value is affected by the value of every observation
in the data
extreme values in the population may distort the data
The range
Simply the difference between the largest and
smallest values in a set of data
Useful for: daily temperature fluctuations or share
price movement
Is considered primitive as it considers only the
extreme values which may not be useful indicators of
the bulk of the population.
The formula is:
Range = largest observation - smallest

observation
Interquartile range
Measures the range of the middle 50% of the
values only
Is defined as the difference between the upper
and lower quartiles
Interquartile range = upper quartile - lower

quartile
= Q3 - Q1
The mean deviation

Measures the average distance of each observation
away from the mean of the data
Gives an equal weight to each observation
Generally more sensitive than the range or
interquartile range, since a change in any value will
affect it
Actual and absolute deviations from

mean
A set of x values has a mean of x
The residual of a particular x-value is:
Residual or deviation = x
The absolute deviation is: x - x
Mean deviation
The mean of the absolute deviations
Mean deviation =
xx
n
To calculate mean deviation

1. Calculate mean of data
Find
2. Subtract mean from each For each x, find

observation
xx
Record the differences
3. Record absolute value of Find
each residual
xx
for each x
4. Calculate the mean of

the absolute values
Mean deviation =
xx
n
Add up absolute values

and divide by n
The standard deviation

Measures the variation of observations from the
mean
The most common measure of dispersion
Takes into account every observation
Measures the average deviation of observations
from mean
Works with squares of residuals not absolute
valueseasier to use in further calculations
Standard deviation of a population
Every observation in the population is used.
(
x x)
The squareStan
ofdard
thedeviation
population
standard
==
deviation is called the variance. n
2
Variance = 2
2002 McGraw-Hill Australia,
PPTs t/a Introductory
Mathematics & Statistics for
Business 4e by John S.
Croucher
47
Standard deviation of a sample

s
In practice, most populations are very large
and it is more common to calculate the
sample standard deviation.
(x x )
Sample standard deviation = s =
n 1
Where: (n-1) is the number of observations in the sample

Croucher
48
To calculate standard deviation

1. Calculate the mean
2. Calculate the residual for each x

3. Square the residuals
xx
( x x )2
4. Calculate the sum of the squares

5. Divide the sum in Step 4 by (n-1)
xx
2
(
)
x
n 1
6. Take the square root of quantity
in Step 5
Croucher
49
xx
n 1
)2
Standard deviations for frequency

distributions
If data is in a frequency distribution
No. Units
n
1
Frequency
f
85
192
123
Total
400
Calculate standard deviation using:

Total
s=
Croucher
x x
1
50
Is a measure of relative variability used to:
measure changes that have occurred in a
population over time
compare variability of two populations that are
expressed in different units of measurement
expressed as a percentage rather than in terms of
the units of the particular data

Croucher
51
Formula for coefficient of variation

Denoted by V
s
V = 100 %
x
where

Croucher
= the mean of the sample

s = the standard deviation of the sample
52
Summary
no ideal measure of dispersion exists

standard deviation is the most important measure
of central tendency
it is the most frequently used
the value is affected by the value of every observation
in the data
extreme values in the population may distort the data

Croucher
53
Measures of Dispersion:
Why The Range Can Be Misleading
Ignores the way in which data are distributed
7
10 11 12
Range = 12 - 7 = 5
9 10
11 12
Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Quartile Measures
Quartiles split the ranked data into 4 segments with an
equal number of values per segment
25%
25%
Q1
25%
Q2
25%
Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% of the observations are
smaller and 50% are larger)
Only 25% of the observations are greater than the third
quartile
The Variance
Average (approximately) of squared deviations
of values from the mean
n
Sample variance:
Where
S =
2
(X X)
i=1
X= arithmetic mean
n = sample size
Xi = ith value of the variable X
n -1
The Standard Deviation s
Most commonly used measure of variation

Shows variation about the mean
Is the square root of the variance
Has the same units as the original data
n
Sample standard deviation:
S=
2
(X
X
)
i
i=1
n -1
Sample
Data (Xi) :
Sample Standard Deviation:
Calculation Example
10
12
14
15 17 18 18 24
n=8
Mean = X = 16
S=
(10 X ) 2 + (12 X ) 2 + (14 X ) 2 + + (24 X ) 2

n 1
(10 16) 2 + (12 16) 2 + (14 16) 2 + + (24 16) 2

8 1
130
7
4.3095
A measure of the average scatter

around the mean
Summary of Measures
Range
X largest X smallest
Total Spread
Standard Deviation
(Sample)
(X i
Dispersion about
Sample Mean
Standard Deviation
(Population)
(X i
X )
Dispersion about
Population Mean
Variance
(Sample)
X)
n 1
N
(X i X
n1
Squared Dispersion
about Sample Mean
The Semi-Interquartile Range

The semi-interquartile range (or SIR) is
defined as the difference of the first and
third quartiles divided by two
The first quartile is the 25th percentile
The third quartile is the 75th percentile
SIR = (Q3 - Q1) / 2
60
The Semi-Interquartile Range

Example
What is the SIR for the
data to the right?
25 % of the scores are
below 5
5 is the first quartile
25 % of the scores are

above 25
25 is the third quartile
SIR = (Q3 - Q1) / 2 = (25 5) / 2 = 10
2
4
6
8
10
12
14
20
30
60
5 = 25th %tile
25 = 75th %tile
61
Variance
Variance is defined as the average of the
square deviations:
(
)
X
=
2
N
62
What Does the Variance Formula

Mean?
First, it says to subtract the mean from each of
the scores
This difference is called a deviate or a deviation
score
The deviate tells us how far a given score is from
the typical, or average, score
Thus, the deviate is a measure of dispersion for a
given score
63
Continuous & Discrete Variables
Subgroup Comparison
Collapsing Response Categories
Handling Dont Knows
Numerical Description in Qualitative

Research
Bivariate Analysis
Bivariate Analysis (cont)
Bivariate Analysis
Percentaging a Table
Percentaging a Table (cont)
Constructing & Reading Bivariate

Tables
Multivariate Analysis
Conclusion
DESCRIPTIVE STATISTICS
Summarise and organise data.
Mean average sum of scores/number of scores.
Mode most common value typical value.
Median middle value.
Findings can be presented in a number of ways:

Frequency tables
How often do
you go training
in a week?
17
2
3
(n = number of responses)
Always provide a table of results.
n
25.0
29
22
%
42.7
32.3
USING GRAPHS AND CHARTS

Only present a graph/chart if it illustrates something.
These describe data they do not explain anything.
INFERENTIAL STATISTICS
Allow you to make inferences from data.
Uses at least 2 variables.
What affect does the independent variable have on the
dependent variable? Causality is A caused by B?
TYPES OF TEST
1. Parametric tests. These tests use interval or ratio data (see
Chapter 6 for a reminder). Parametric tests assume that the
data is drawn from a normally distributed population (i.e. the
data is not skewed) and have the same variance (or spread) on
the variables being measured.
2. Non-parametric tests. These are used with ordinal or
nominal data, and do not make any assumptions about the
characteristics of the sample in terms of its distribution.
TESTS OF ASSOCIATION
CORRELATION
Correlations investigate the relationship between two variables
consisting of interval or ratio data.
A correlation can indicate:
Whether there is a relationship between the two variables.
The direction of the relationship, i.e. whether it is positive or
negative.
The strength, or magnitude of the relationship.
Correlation scores range from 1 to -1
R=1
strong
positive
correlation
R= -1
strong
negative
correlation
R=0
no
correlation
A strong correlation does not necessarily mean a relationship!

e.g. lectures attended positively correlates with final grade.
May be:
more lectures attended = more interest
more interest = higher grade
Spuriousness relationship.
TESTING DIFFERENCES
Tests of difference generally assess whether differences between
two samples are likely to have occurred by chance, or whether
they are the result of the effect of a particular variable.
THE INDEPENDENT SAMPLES T-TEST

This examines whether the mean scores of two different
groups can be considered as being significantly different.
It can be used when:
The data is interval or ratio in nature.
The groups are randomly assigned (hence, you should use an
ANOVA rather than a t-test to compare differences between
males and females, as gender is not randomly determined
when you come to assign your groups).
The two groups are independent of each other.
The variance, or spread, in the two groups is equal.
PAIRED SAMPLES T-TEST

The paired t-test measures whether the mean of a single group
is different when measured at different times.
ANALYSIS OF VARIANCE (ANOVA)

ANOVA is similar in nature to the independent t-test, however
it allows you to ascertain differences between more than two
groups.
If you are looking to explore gender differences, then this is a
more appropriate test to use than an independent t-test as it
does not assume that participants have been randomly
assigned to each group.
THE MANN-WHITNEY TEST

An alternative to the independent t-test.
Used when data is ordinal and non-parametric.
This test works on ranking the data rather than testing the
actual score, and scoring each rank (so the lowest score would
be ranked 1, the next lowest 2 and so on) ignoring the group
to which each participant belonged.
The principle of the test is that if the groups were equal, then
the sum of the ranks should also be the same.
THE WILCOXON SIGNED RANK TEST

Similar to the Mann-Whitney test, however it examines
differences where the two sets of scores are from the same
participants (effectively it is non-parametric alternative to a one
sample t-test).
THE KRUSKAL-WALLIS TEST
This is a non-parametric alternative to the ANOVA test, and can
be used to identify differences between three or more
independent groups.
WHICH TEST SHOULD I USE?

The type of data that you collect will be important in your final
choice of test:
Nominal
Consider a chi-squared test if you are interested in differences
in frequency counts using nominal data, for example comparing
whether month of birth affects the sport that someone
participates in.
WHICH TEST SHOULD I USE?

Ordinal
If you are interested in the relationship between groups, then
use Spearmans correlation.
If you are looking for differences between independent groups,
then a Mann-Whitney test may be appropriate.
If the groups are paired, however, then a Wilcoxon Signed rank
test is appropriate.
If there are three or more groups then consider a KruskalWallis test.
Interval or ratio
Are you looking to identify relationships between two variables?
If so, consider the use of a Pearsons correlation.
If there are three or more variables, then consider multiple
regression.
If you are concerned with differences between scores, then ttests or ANOVA may be appropriate.
If you want to identify differences within one group, then a
paired samples t-test should be used.
If you are comparing two randomly assigned groups, then use
an independent samples t-test.
If you are looking to compare two non-randomly assigned, or
three or more groups, then use ANOVA.

Data Analysis

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Analysis

Загружено:

Авторское право:

Доступные форматы

INTRODUCTION

Types of Variables Analysis

Bivariate Analysis (cont)

Measures of Centrality and Dispersion

Types of Measures of Dispersion

Why Coefficient of Variation

Average An ambiguous term generally

Central Tendency (Cont)

Determine the mean.

Relative measure of dispersion

Measures of Dispersion: The Range

Range = Xlargest Xsmallest

= upper quartile - lower

The mean deviation

Actual and absolute deviations from

The residual of a particular x-value is:

The absolute deviation is:

To calculate mean deviation

2. Subtract mean from each For each x, find

4. Calculate the mean of

Add up absolute values

The standard deviation

Standard deviation of a population

Standard deviation of a sample

Sample standard deviation = s =

Where: (n-1) is the number of observations in the sample

To calculate standard deviation

Standard deviations for frequency

Calculate standard deviation using:

Formula for coefficient of variation

= the mean of the sample

no ideal measure of dispersion exists

Range = largest observation - smallest

Interquartile range = upper quartile - lower

The mean deviation

Actual and absolute deviations from

The absolute deviation is: x - x

To calculate mean deviation

2. Subtract mean from each For each x, find

4. Calculate the mean of

Add up absolute values

The standard deviation

Standard deviation of a population

Every observation in the population is used.

Standard deviation of a sample

Sample standard deviation = s =

2002 McGraw-Hill Australia,

To calculate standard deviation

2. Calculate the residual for each x

4. Calculate the sum of the squares

Standard deviations for frequency

Calculate standard deviation using:

2002 McGraw-Hill Australia,

Formula for coefficient of variation

2002 McGraw-Hill Australia,

= the mean of the sample

no ideal measure of dispersion exists

2002 McGraw-Hill Australia,

Range = 120 - 1 = 119

Most commonly used measure of variation

Sample standard deviation:

(10 X ) 2 + (12 X ) 2 + (14 X ) 2 + + (24 X ) 2

(10 16) 2 + (12 16) 2 + (14 16) 2 + + (24 16) 2

A measure of the average scatter