Вы находитесь на странице: 1из 28

Introduction to Statistical Research

Methods (829N1)
Lecture 2: Samples & Data

Maria Savona
SPRU (Science Policy Research Unit)

Basics on sampling

From last weeks lecture

Mr. Smith owns a company and he wants to know what are


the tastes of his customers? Would he interview all of them?

I need to test the quality of my products. Should I test all of


them? How many?

A pharmaceutical company developed a revolutionary drug


that is supposed to reduce patients fever. How do they test if
it works?

Sampling

What is sampling?
In statistics and survey methodology, sampling is concerned with the
selection of a subset of individuals from within a statistical population to
estimate characteristics of the whole population.

Population

Inference

Sample

Why do we need sampling?


It has some advantages: its faster, cheaper and at times the only
feasible choice
It allows to accurately estimate general features of a population using
data collected from a tiny fraction of the total.
4

How do we sample from a population?


Sampling process
1.

Define the population

2.

Specify a sampling frame

3.

Specify a sampling method

4.

Determine the sample size

5.

Collect the data

1) Defining the population

The population is the set of all possible cases of interests (e.g. firms, people,
students, countries, patents, etc.).

A sample requires a clearly defined population, from which to draw the


sample.

Requires conceptual clarity


e.g. consider question: How many hours of study do students have each
week? But who are the student population?
Anyone who sees themselves as a student?
Or restricted to those registered in higher education?
Anywhere in UK? Restricted to Sussex?
Full time only? Home and overseas? Etc.

2) Specifying a sampling frame

Once the population of interest is defined, we have to specify how


we access that population

A sampling frame is the source from which a sample is drawn. It is a


list of all those within a population who can be sampled
Population

Sampling frame

Sussex University students

List of registered students at Sussex University

All the firms based in London


working in the pharmaceutical sector

List of firms located in London from Companies


House
Selecting those companies with some specific
SIC (Standard Industrial Classification) codes

All the patents owned by a company


which are about semiconductors

Patent database from WIPO


Selecting the patents which have some specific
IPC codes (International Patent Classification)
7

2) Specifying a sampling frame

Target population
Sampling frame

Sample (drawn from


sampling frame)

2) Specifying a sampling frame

Sampling frame definition is very important; errors in the sampling


frame affect the representativeness of the sample

Sometimes the sampling frame does not match completely the


population;
e.g. if we are conducting household surveys on poverty in the Brighton area our
sampling frame misses homeless people and thus introduces bias because
homeless people are disproportionately poor and they are not included in the
sampling frame.

Other potential issues with sampling frames:


Missing elements: some members of the population are not included in the
sampling frame
Foreign elements: the sampling frame includes some non-members of the
population
Duplicate elements: some elements could be included more than once
9

3) Specifying a sampling method

The sampling method is the way in which the sample units are to be
selected.

Probability
sampling:
the
extraction of a population unit is known
a) Simple random sampling
b) Systematic sampling
c) Stratified sampling

probability

of

Inference from the


sample to the
population

Non-probability sampling
d) Quota sampling
e) Convenience sampling
f) Snowball sampling

They do not allow


statistical inference,
but they are still
informative and
allow some kind of
generalizations
10

a) Simple random sampling

A subset of individuals (a sample) chosen from a larger


set (a population). Each individual is chosen randomly
and entirely by chance, such that each individual has
the same probability of being chosen at any stage
during the sampling process.

Imagine a school with 1000 students and you want to


select 100 for further study. You could select them
randomly by pulling their names out of a hat. This
random process means that each student has an equal
chance (or probability) of being selected.
11

b) Systematic sampling

The population (sampling frame) units are sorted


according to some characteristics.
The first sample unit is extracted randomly
The other units are extracting following a step n

LIST OF ALL POSTGRADUATES


IN SOCIAL SCIENCES

A..
B..
C..
D..
E..
F..
G..

You randomly start here


Pick every nth case from list

SAMPLE
12

b) Stratified sampling

The sampling frame is divided into sub-groups (strata) with


respect to some relevant population characteristics (e.g.
gender, age, etc.) so that units are relatively similar within
stratum and different across different strata.

Stratified sampling consists in performing simple random


sampling within each stratum, so that all the sub-groups
are adequately represented.

For example, we have a population of 1000 people, 500


males and 500 females, and we would like to extract a
sample of 100 people, stratified by gender our sample
will include 50 randomly selected males and 50 randomly
selected females.
13

b) Stratified sampling
LIST OF ALL SOCIAL SCIENCE
POSTGRADUATES SORTED BY
DEPARTMENT

SPRU
Sociology

nSPRU

Separate random sample drawn


from each department (stratum)

N
80
50
4
Total _ N 1000
How many
from each stratum?
SPRU

Geography
History
Etc.

SAMPLE

Nh
nh n
Nh
h

Total number of students (N=1000)


Sample size (n=50)
Number of students at SPRU (80)

14

d) Quota sampling

Probability sampling allows a precise and accurate


estimation of population parameters. However, in some
cases probability sampling is not possible. In this case, nonprobability sampling is common practice.

Quota sampling is similar to stratified sampling.


The population is segmented in sub-groups (strata).
Quota sampling does not need a sampling frame for
each stratum, since extraction does not follow a
probabilistic rule.
Quota sampling only requires that the same proportions
apply to the sample. For example, an interviewer is told to
sample 50 females and 50 males between the age of 18 to
30.

15

e) Availability (convenience) Sampling


As the name suggests, the sample is merely based on
those who are easy to find

a local factory to provide a sample of workers


this class to provide a sample of students
a bus station near my house to provide a sample of
users of public transport
university students used to provide a sample of
consumers
Some studies do not need a representative sample and
the aim is to show a methodology or to test a theory.

16

f) Snowball Sampling

Uses initial respondents to contact new respondents.


A useful option where relevant contacts are hard to
identify e.g. where the roles and responsibilities of
different personnel in a company are unclear or where
youre interested in surveying a minority population but
do not know how to locate more than one or two
members of that population.
Depends upon whether members of these populations
know other members
Again, it does not grant a representative sample.
17

4) Determining the sample size


Size matters More is better!
If this class is my population (54 students) and I want to find out what
is the mean amount of money in your pockets.
Lets say that on average you have 5 per person obviously some
will have more, some less.
If I ask 2 may well pick 2 with nothing, or 2 with 20
If I ask 10 more likely to balance out and give a mean closer to
true value.
If I ask 20 even more likely.

However working out the appropriate sample size depends on other


factors such as the precision level required and budget constraints.

18

4) Determining the sample size

With a sample of 30 units and a population of 100 units we commit


an 11.75% error.
Same sample (30) and a population of 1000, the error increases to
16.28%.
With a sample of 500 units and a population of 1000, the error is
only 1.98%.
With a sample of 1000 units the error is below 3% even with a
population of 100 millions units.

Household expenditure on pizza (with true population mean=$20, and standard19deviation=9,


confidence level 5%). Source: adapted from: Mazzocchi 2008, Box 5.4, p.116.

Sampling and non-sampling error


An estimate based on a sample can differ from the true population figure
because of:

sampling error: random chance involved when sampling. Two main


factors affect the amount of sampling error:
a) The size of the sample - increasing sample size reduces sampling
error
b) The amount of variation in the population in the characteristic being
measured (age, income etc.) - the more variation, the greater the
sampling error (for any given sample size)

non-sampling error: errors arising from all other aspects of the procedure
a) Poorly designed sampling frame
b) Measurement errors during fieldwork
c) Systematic non response
d) Systematic attrition

20

Data...!

21

What type of data?

Cross-sectional data

Longitudinal data
Time series
Panel data

22

Variables
A variable is a condition or a quality that can differ from
one case to another
Conceptual definition: literal or general definition of the
variable
Operational definition: specifying the criteria for taking
a measurement of that variable
For example:
We want to measure firms innovativeness
We define innovativeness as the capacity to produce new
inventions
We measure innovativeness by taking the number of patents
of the firm
23

Scales of measurement

The scale of measurement specifies a range of values


that the variable can take
Discrete versus continuous
Discrete data means that there are finite values within a
specified range (e.g. number of children per household)
1 2 3 4 5
Continuous data means that there are infinite values within a
specified range (e.g. age)
1 2 3
| 4 5
-> 3years and 6 months
-> 3 years, 6 months and 2 days, etc.
24

Increasing precision and meaning

4 levels of measurement

Nominal measures use numbers simply as labels for different values (e.g.,
Female=1, Male=2; or Bus=1, Train=2, Car=3).

Ordinal measures are like nominal ones in that they too use numbers simply as
labels, but in this case a higher number does indicate more and a lower number less
(e.g., How often do you smoke? Never=1, Sometimes=2, Frequently=3, Very
often=4).

Interval/ratio scales are those that permit to say by how much a case is better or
stronger than another.
Interval scales measure the order of data points and the size of the intervals in
between data points.
Ratio scales are interval scales with a true zero point.
Interval scales can have an arbitrary zero reference point, while ratio scales
have a true zero point. For example 0 age and 0 income means no age and no
income (i.e. the zero reference point is non-arbitrary), while 0C only indicates
the point at which water freezes, it does not mean no heat at all! However, this
distinction is not relevant for the kind of analysis we will carry out.
25

Some more examples

Variable name

Values Value Label

Marital status

1
2
3
4
5
1
2
3
4
5

Type

Age

Single never married Nominal


Married
Separated
Divorced
Widowed
< 3 months
Ordinal
3-5 months
6-11 months
1 year but less than 2
2 years but less than
3
0 95+ None
Ratio scale

Firms sales

Duration of
unemployment

None

Ratio scale

*** The level of measurement of a variable is important because it


determines the techniques that you can use to analyse it ***
26

How does SPSS classify data?

27

Before next week

Read Introduction to SPSS available on Study Direct

Download and install SPSS by following the instructions


available on this page:

http://www.sussex.ac.uk/its/services/software/owncomputer

28

Вам также может понравиться