Академический Документы
Профессиональный Документы
Культура Документы
Ric Coe
ICRAF, Nairobi, Kenya
Contents
Introduction..................................................................................................................................... 1
Preliminaries ................................................................................................................................... 3
Descriptive Statistics ...................................................................................................................... 4
1. Summarizing Single Variables............................................................................................ 4
2. Two variables. ...................................................................................................................... 6
Descriptive statistics - common problems .................................................................................. 11
Confirmatory analysis: estimation and hypothesis testing......................................................... 13
The problem .............................................................................................................................. 13
Estimates, standard errors and confidence intervals. .............................................................. 14
Hypothesis tests: The logic....................................................................................................... 16
Examples of calculations .......................................................................................................... 17
Limitations................................................................................................................................. 19
What should you do .................................................................................................................. 20
Confirmatory Analysis - Regression ........................................................................................... 20
Starting Regression ................................................................................................................... 20
Fitting the regression line ......................................................................................................... 21
Check the fit .............................................................................................................................. 23
Interpretation ............................................................................................................................. 23
Adding more variables - Multiple regression.......................................................................... 23
Interpretation................................................................................................................................. 24
References ..................................................................................................................................... 25
Introduction
This guide summarises the use of simple statistical analyses in the interpretation of
survey data. It is aimed at the typical small surveys (up to a few hundred respondents)
carried out by researchers looking at the role and uptake of new agricultural technologies.
Modified from input to a course Formal data analysis for bean researchers organised by CIAT at CMRT,
Egerton University, February 1996. Thanks to Soniia David for permission to quote the example.
1
There are several common problems in the approaches to survey analysis used by many
researchers, probably a result of the research methods courses followed during training.
One is to concentrate attention on a few well known statistical techniques, such as chisquared tests in 2-way tables and regression analysis, and to place a naively simplistic
reliance on the results. This is the topic of this guide. A second problem is to treat
statistical analysis as a recipe that can be followed to a successful conclusion without
much thought or understanding along the way. This is the topic of a companion guide
Steps in survey analysis (Coe 2002). A third problem is to ignore the context in which
the survey was carried out, so ignoring many of the possibilities and limitations of the
statistical analysis. This is the topic of the guide Approaches to analysis of survey data
(SSC, 2001).
Example
The example used in this guide was a survey of farmers in two districts of Uganda. It
aimed to characterize the pattern of bean growing and understand role of new bean
varieties in the household economy of new farmers. A few of the stated objectives were:
Overall: Provide a baseline against which to measure adoption and impact of improved
bean varieties.
Hypotheses:
1. Adoption.
a. There is no relationship between adoption of new varieties and wealth.
b. The rate of adoption for MCM5001 will be higher in Mbale than Mukono, due to
strong non-appreciation of small seeded varieties in Mukono.
2. Impact.
a. Adoption of new varieties will result in an increase in absolute quantities and
proportion of beans sold, hence increasing household income from beans.
b. Adoption of new varieties will not result in increased sales of fresh beans.
c. Adoption of new varieties will not change the amount of income from beans controlled
by women.
d. ...
The examples are based on a subset of just 50 households from the whole survey of 179.
The variables used in the example have been labeled so should be self-explanatory.
In this guide SPSS has been used for the statistical analysis. General points appear in
normal text. Computer output and other items relating specifically to the example are
boxed.
Preliminaries
Before starting analysis:
1. Make sure you are familiar with the data source and collection methods.
For example:
1. Clarify objectives
These should have been listed in detail when the survey was planned. If they were not, or
have changed, they must be listed now. It is impossible to analyze a survey if you do
not know what you are trying to find out.
3. Coding and Data entry.
4. Make sure you understand the data. You must understand the exact meaning of every
number and code.
Data that needs clarifying.
Variable WIVES (Question 3): Does 1 mean 1 wife or 2 wives? (conflict between
questionnaire and code book).
Variable ARRANGE (Question 4). Does NA mean there are no bean plots or no
husband/wife?
Variables OCCUPHDI and OCCUPHD2 (Question 8): Why are two occupations given
when the question asks for the main occupation?
Variable KAW94A (Question 21). What is the difference between na and No?
Variable AMKW94A Question 21). What are the units?
Descriptive Statistics
1.
MATOKE
Value Label
Yes
No
Value
Frequency
Percent
Valid
Percent
Cum
Percent
1
2
42
8
------50
84.0
16.0
------100.0
84.0
16. 0
------100.0
84.0
100.0
Total
Valid cases
HHTYPE
50
Missing cases
Household type
Value Label
Male headed one wife
Male headed more tha
Female headed absent
Female headed, no hu
Single man
Other
Value
Frequency
Percent
Valid
Percent
Cum
Percent
1
2
3
4
5
7
27
4
3
13
2
1
------50
54.0
8.0
6.0
26.0
4.0
2.0
------100.0
54.0
8.0
6.0
26.0
4.0
2. 0
------100.0
54.0
62.0
68.0
94.0
98.0
100.0
Total
Valid cases
50
Missing cases
Crop
% growing
Cassava
Beans
Matoke
Maize
Yams
Sample size
100
98
84
78
20
50
Look carefully at and identify rare cases. Such data points may be errors, or may
need special treat
What is the 1 other household type in question 2?
One farmer does not grow beans. Should this case be deleted from all
analyses?
Bar charts are most appropriate when the categories can be ordered in some useful
way.
Quantitative Variables
15.9
34.2
4.0
0
14.0
10.1
40
30
20
10
S td . D e v = 3 4 .2 1
M e a n = 1 6 .0
N = 4 7 .0 0
0
0 .0
2 5 .0
5 0 .0
7 5 .0
1 0 0 .0
1 2 5 .0
1 5 0 .0
1 7 5 .0
2 0 0 .0
2.
Two variables.
Crop earning
highest income
Male
Headed
Coffee
Groundnut
Bogoya
Cassava
Matoke
Beans
Other
No sales
Total
19
2
1
1
2
1
5
0
Household type
Female
Single
Headed
Male
7
4
3
0
0
0
0
2
1
0
0
1
0
0
0
0
Total
27
6
4
2
2
1
5
2
49
12
Mean
Median
25% point
Number
50
45
40
16
35
30
25
20
15
10
5
0
N=
31
15
M is si ng
m a le
fe m a le
7.
300
200
1.
2.
3.
4.
5.
6.
100
-1 0 0
-1 0
m a le
0
10
20
30
40
50
Artificial
Example
Region 1
Incom
e
Incom
e
_
10
20
Region 2
40
L
H
20
Overall
Adoption
+
20
40
Adoption
+
20
10
Adoption
Exactly the same thing occurs with
+
continuous variables where spurious
Incom
L
50
40
correlation (or lack of it) can be due to
e
a third variable which has not been
H
40
50
allowed for. More advanced graphical
(e.g. small multiple pictures) and numerical (regression and log-linear modeling,
multivariate methods such as principal components) methods exist to help there.
p la n te d 9 4 a
p la n te d 9 4 b
ha r ve ste d 9 4 a
ha r ve ste d 9 4 b
10
C u mmu lativ e h is to g r a m
28
52
26
48
24
44
22
40
20
36
32
16
No of obs
No of obs
18
14
12
28
24
20
10
16
8
6
12
8
4
2
0
0
<= 0
(0 ,5 ]
(5 ,1 0 ]
(1 0 ,1 5 ] (1 5 ,2 0 ] (2 0 ,2 5 ] (2 5 ,3 0 ] (3 0 ,3 5 ] (3 5 ,4 0 ]
> 40
<= 0
( 0 ,5 ]
( 5 ,1 0 ]
( 1 0 ,1 5 ] ( 1 5 ,2 0 ] ( 2 0 ,2 5 ] ( 2 5 ,3 0 ] ( 3 0 ,3 5 ] ( 3 5 ,4 0 ]
AMPL T9 4 A
> 40
AMPL T9 4 A
Box Plot
Quantile -Quantile
D is tr ibutio n: N or ma l
.05
.1
.25
.5
.75
.9
.95
.99
50
40
40
Observed Value
30
20
10
AMPLT9 4A
30
20
10
-1 0
-2
-1
11
Cases which deviate from the mean, contributing to variability, are probably just as
important as the average values.
Make sure you understand whether variation is important, and if so, describe it.
It is unlikely that each substantive question can be answered from columns of raw data
alone. Calculations of new variables is certain to be important.
Calculate new variables that are needed to answer the questions.
Many datasets contain data collected at more that 1 level ( e.g. plot, person, household,
community). Analyses must use the relevant level. Mixed levels are almost wrong.
Even in surveys with data collected at one level there is room for confusion
regarding, for example, calculations of percentages.
Variety
Kawanda
Manyigamulimi
Kanyebwa
White haricot
All others
No beans planted
Number of
farmers planting
in 94A
11
21
0
0
14
18
Average of those
farmers who planted
2.45
10.53
2.04
-
Make sure all relevant data, but no irrelevant data, is being used.
The problem
A.
Labour
Household Type
Never hire
or exchange
Hire or
exchange
13
Male
Female
23
13
36
10
33
3
16
13
49
Mean
s.d.
n
Proportions
The proportion of female headed households in the population is P. P is unknown.
The sample value is p = 0.33 ( = 16/49). The uncertainty due to sampling errors in
14
p(1 p)
,
n
. 33(1. 33)
. 07
49
s
, where s2 is the variance in amount of beans and n the sample size.
n
8. 62
se( mean)
1. 6
30
The 95% confidence interval is
mean 2 x se(mean)
= 2 x 1.6
= (2.6, 9.0)
is se( mean)
The mean amount of beans planted is between 2.6 and 9.0 kg.
Differences
If interested in differences between subgroups we can similarly estimate the
difference and find a standard error of the estimate.
Difference in mean amount of beans planted by
males and females = 6.5 - 2.9
= 3.6 kg.
15
se( difference)
s12 s22
n1 n2
9. 52 1. 32
24
6
= 2.0
95% confidence interval for difference is
3.6 2 x 2.0
(-0.4, 7.6)
The mean difference between amounts planted by males and females could be
anything between -0.4 kg and 7.6 kg.
16
The level of agreement is measured by the 'significance level', explained in the examples
below.
Examples of calculations
17
Male
Female
Never hire
33 x
36
= 24. 2
49
Hire
33 x
3.
16 x
13
= 8. 8
49
36
= 11. 8
49
16
13
4. 2
49
( 24. 2 - 23 )2
( 11. 8 - 13 )2
( 8. 8 - 10 )2
( 4. 2 - 3 )2
+
+
+
= 0. 74
=
24. 2
11. 8
8.8
4. 2
2
4.
If (1) is valid then the value of 2 should be an observation from a 12 distribution. Comparison with tables shows that 0.74 is not an extreme observation. A
number at least as big as this would occur 39% of the time. The significance level is p =
0.39. Hence there is no strong reason not to believe the null hypothesis.
If (1) is true, then the difference in means of 3.6kg, scaled by its standard
error
(= 2.0) ,
t
3. 6
1. 8 ,
2. 0
18
4.
Comparison with tables shows that 1.8 is not an extreme observation. A
difference as big as this would occur 8% of the time (1) is true. The significance level is p =
0.08. Hence there is not much reason not to believe the null hypothesis.
Limitations
Assumptions.
The calculations in both 4.1 and 4.2 are based on a series of assumptions. The key
ones are:
Independence. In both examples A and B we assume observations are independent.
Lack of independence is caused by:
(i)
(ii)
interference between observations. This would be the case if individuals
within these household responded, or if data were collected at a group meeting.
Lack of bias due to non-response, interviewer effects, attempts to 'please' the
researcher etc.
Equality of variance and normal distribution (t-test). These assumptions can be
checked. In example B the data is clearly not normally distributed
Limits to interpretation.
(1)
If the result is significant we can reject the null hypothesis, and conclude
that there is a real difference in the population. If the result is not significant we have not
proved there is no difference. It is never possible to prove the null hypothesis is true (if
almost never will be!). All we can say is this study has not produced evidence to make us
disbelieve the null hypothesis.
(2)
At what level of significance should the null hypothesis be rejected? 5% is
commonly used but there is absolutely no reason why it should be treated as a rigid cut off.
6% and 4% significance levels are, for all real purposes, equivalent.
(3)
Whether the null-hypothesis is rejected depends as much on the sample size
and precision of the study, as on the 'truth' of the null hypothesis. A small, imprecise survey
will not detect a difference that could be picked up by a larger study. May be we just did not
collect enough data!
19
(4)
The whole logic of significance testing and the p-value rests on what would
happen in repeated surveys of the same design, using new randomisations. Is this sense,
when we know the survey would not and can not ever be repeated?
(5)
In most analysis exercises, differences which 'look interesting' at the
exploratory stage are investigated further in the confirmatory analysis. If the tests to
perform have been selected because differences look large, all significance levels are
invalid.
(6)
If a large number of tests are performed, as is often the case in analysis of a
study with many variables, then we would expect 5% of the tests to give "significant" results
at the p = 0.5 level even if all null hypotheses were true. Hence it can be difficult to
interpret the results of multiple tests.
What should you do
(1)
Treat the significance level p as an indication of 'strength of evidence'
against the null hypothesis, not as a Yes/No decision maker.
(2)
Concentrate on estimating the size of differences, rather than just testing
whether they exist. Confidence intervals for differences will be much more useful than
hypothesis tests.
At the end of every significance test apply the SO WHAT? test. Ask yourself 'So
what?'. Has the significance test really improved your understanding of the situation
and helped you take a rational decision for future action? If not forget it, and get on
with something more useful.
20
The example used here is rather artificial. It examines the proposition that the amount of
beans harvested in 94a depends only on land area.
-
220
180
HVTOT94A
140
100
60
20
-20
-1
5
LANDAREA
21
11
* * * *
M U L T I P L E
R E G R E S S I O N
* * * *
Listwise Deletion of Missing Data
Equation Number 1
Dependent Variable..
total beans harvested 94a
Block Number
1.
Method:
Enter
HVTOT94A
LANDAREA
.54425
.29621
.28057
29.01659
Analysis of Variance
DF
1
45
Regression
Residual
F =
Sum of Squares
15946.10384
37888.31105
18.93921
Signif F =
Mean Square
15946.10384
841.96247
.0001
SE B
Beta
8.200238
1.884280
.544249
4.352
-2.863844
6.051297
-.473
22
Interpretation
Significance does not tell you whether the fitted model is logically sound or if it fits
the data well.
Significance does not tell you whether the model is useful in explaining or
describing a relationship, or if the relationship has much predictive power.
A regression model derived from survey data can not tell you what would happen
when a x-variable is changed. For example we can not use it to predict the bean
harvest of a farmer whose land holding changes.
Existence of a regression relationship between two variables does not mean there is a
causal relationship.
Regression relationships become useful when similar relationships are found in a number
of different conditions. Look for significant sameness between regions, crops, farm
types, etc.
23
and influential points. Multiple regression analysis will not be successful if these are
not understood.
Stepwise and similar variable selection techniques, so loved by social scientists,
have little theoretical basis and can produce answers which are very poor. Regression
modeling will be most successful if understanding of the underlying processes is
used to choose possible models, rather than relying on computer algorithms.
The sample size required for multiple regression analysis depends on the
configuration of the data (in particular the range of the x-variables and correlations
among them). The required sample size quickly becomes large as the number of xvariables increases. If regression analysis is the part of the principle objectives of the
survey, it might be possible to select the sample in a way that makes the analysis
more efficient.
120
Raw residuals
80
40
-40
-80
1
2
HHTYPE2
Interpretation
Interpret results. This does not mean understand which effects are significant but
understand and communicate what you now know about the problem. You should be
able to:
24
References
Coe R (2002) Steps in Survey Analysis. Nairobi: ICRAF. 15pp
SSC (2001) Approaches to analysis of survey data. Reading: Statistical Services Centre.
28 pp
25