Вы находитесь на странице: 1из 44

93

SAMPLING

Sampling

Population and Sample

94

Aim: we want to learn about the average national expenditure

on food in India.
What should we ideally do?

Problem: We have to do it in a month, and we dont have too

many people in hand. What should we do?

Glossary of Important Terms


Subject:

Population:
Sample:

Parameter:
Statistic:

Sampling unit:

Sampling design:

95

Sampling

Sampling

96

When do we use sampling?

Complete population is not tractable: We want to know

how long the batteries of a particular brand last until they


need to be replaced

Considering the whole population is not practical: We

want to conduct an opinion poll about the preference of the


nation for the next prime minister

Considerations in Sampling
Sampling

97

How to sample if the survey is to seek opinion on


Should smoking be banned in public places?
Is IIM Ahmedabad the top B-School in India?
Need to decide what is
1) the right population
2) the right sample

3) the right sampling scheme


4) The right sample size ...

Statistical Studies

Sampling

98

Controlled experiments: samples are generated


Observational studies: samples are collected
In this class, we are mostly interested in observational

studies

Simple Random Sample

99

A sample selected in such a way that every unit in the

population has an equal chance of being included in


the sample.

Can be chosen with or without replacement.

Finite population: Probability of Inclusion in a


sample
Sampling

100

Example: A club is selecting two of its officers to attend the

club's annual conference.


Five officers: President (P), Vice-President (V), Secretary (S),
Treasurer (T) and Activity Coordinator (A).
To select a simple random sample of size 2.

The 5 names are written on identical slips of paper, placed in a hat,

mixed, and then a neutral person blindly selects two slips from the hat this is the sampling design.

a) How many samples are possible? List all possible samples.


b) Does each officer have an equal chance?

Sampling

Random number generation using


Statistical software

In Excel, use Rand or Randbetween


Use statistical software: Minitab,

R, SAS, SPSS
FREE
Software!

101

Sampling

102

Random number generation using websites

Very Large Population

103

Think of sampling from a very large population (Example:

Entire nation; visitors to a public place etc.)

WR/WOR are practically equivalent: being selected twice is

almost impossible. The population can be considered infinite


for all practical purposes.

Simple random sampling scheme for such infinite population

is defined as follows:
Each element of the sample should come from the population.
The elements are selected independently of each other.

Sampling

104

Other Sampling Schemes: Clustered Sampling


Example: What could be a quick (but valid) way to estimate

the average household income of this class just before


joining?

To ease the burden of sampling while choosing samples for

monitoring student income for different types of students, one


may just choose the syndicates as clusters and sample
them as a whole.

Cluster Sampling

Useful when sampling frame consisting of ALL the

105

population units is hard to come by.


The population is divided into a large number of clusters
and a simple random sample of a pre-specified number of
clusters can be selected.
The cluster sample will consist of all the units in those
clusters.

Assumptions in Cluster Sampling

106

1. Spread of data inside the clusters are heterogeneous

and similar to that of the population.


2. The clusters are similar entities.

Examples:
1. When sampling for age distribution, households can be
considered as clusters. (But not for income distribution.)
2. Each PGPX syndicate is a cluster as far as last salary,
age etc. are concerned.

Sampling

107

Other Sampling Schemes: Stratified Sampling


Examples:

You want to estimate the mean number of hours a week

that regular IIMA students spend in the library and also


how it compares among PGP I, PGP II, PGPX and FPM

You want to estimate the average income in a city

To get a proper idea of the income level of a city, one


should sample over various income strata.
You want to study the prevalence of Pneumonia in

Ahmedabad this year


Disease prevalence/mortality rate is more relevant with
age-specific information

Stratified Sampling

108

We divide the population into separate groups or strata

and select a simple random sample from each stratum.

Useful to ensure minority representation.


Example:

To get a proper idea of the income level of a city, sampling

over various income strata ensures high-income groups


are represented in the sample.
Disease prevalence / mortality rate is more relevant with
age-specific information: we ensure all ages are covered
by using age-based strata.

Sampling

109

Differences between cluster and stratified


sampling
In clustered sampling the sampling units are clusters, in

stratified sampling the sampling units are individuals


inside each strata.
In a cluster sample we have all members of some
clusters. In a stratified sample we have some members
from each stratum.
Clusters have more variability within than between.
Strata have less variability within than between.

Volunteer/Convenience Samples

110

Carried out using easily obtainable samples, mostly

volunteers
Come with selection bias either in part of the person
conducting survey, or because of self-selection of volunteers
Examples: Any internet survey, telephone survey etc.
Whenever we conduct an internet survey on whether people
support child marriage, the response rates are overwhelmingly
negative. However, the real picture is completely different.
Why?
Bias! What are the biases here?

Necessary Evil?

111

Reliable study or ethical study?


Recent newspaper reports suggest that a number of women
from the lowest income strata of the Mumbai slums died of
cervical cancer because they were placed in the control arm of a
study about cervical cancer screening funded by some US drug
company without informed consent:
Report 1 (http://timesofindia.indiatimes.com/india/Clinical-trials-Supreme-Court-asks-details-on-deaths-from-Centre/articleshow/31814283.cms)
Report 2 (http://timesofindia.indiatimes.com/india/Row-over-clinical-trial-as-254-Indian-women-die/articleshow/34016785.cms)
As a statistical study, such a study will produce much more
reliable results
There are obvious humanitarian and ethical issues
A (somewhat) convenience sample is the only choice here

112

SAMPLING
DISTRIBUTIONS

Idea

113

The sample characteristics are summarized using the

sample statistics.
The characteristics of a population are summarized by the
population parameters.
Statistical inference boils down to estimating a population
parameter using analogous sample statistic.
Examples:
Sample mean estimates population mean
Sample variance (SD) estimates population variance (SD)
Sample proportion estimates population proportion

Basic Idea: Point Estimation

114

We resort to sampling when we can not get hold of the

entire population, and hence dont know its characteristics


(parameters): mean, SD, proportion etc.
The parameters are estimated through sample statistics:
mean, SD, proportion etc.
You can think of this technique as extrapolation of
sample properties to the population.
This technique is known as (point) estimation.
The sample statistic is called the (point) estimator of the
parameter.

Sampling Distribution

115

Why Bother About the Sample Mean?


Point Estimation: finding a good proxy of the population

parameters through sample statistics.


Mean/average is perhaps the most important example.
Note:

1. We are only going to restrict ourselves to SRS.

2. Most of our inferences will be for SRS with replacement

(SRSWR). In such a situation the samples are all


independent, and they have identical distributions.
Hence we refer to such samples as independent and
identically distributed (IID) samples. So SRSWR IID.

Example

Sampling Distribution

116

There are 5 people in an office. The ages of the 5 people are:

25, 32, 31, 36 and 54.


Two of them are chosen at random with replacement.
What are the possible values of the sample mean?
What is the distribution of the sample mean?

Note: Here the sample size is n = 2. As we select more and

more times, n increases. (As we are selecting WR, we can


select as many as we want.)

Repeat if the sampling is done without replacement. (Exercise)

Sampling Distribution

117

Why Does the Sample Mean Have a Distribution?


Samples (in most cases, hopefully) are random variables.
Sample mean is therefore a mean of random variables:

random variable.

Caution: samples are not random variables any more after

we have observed them!

Sample Mean (and SD)

118

For a simple random sample X 1 , , X n from some population,

it is given by

1 n
X Xi
n i1

The sample standard deviation is given by

1 n
2
SX
(
X

X
)
.

i
n 1 i 1

This ensures that expected value of the sample variance is

equal to the population variance.

119

Law of large numbers for IID Samples

As the sample size n gets larger, sample mean converges

to population mean with probability 1.

This will provide the logic behind the simulation exercises


we perform: the larger the number of simulations, the more
accurate are the values we get.

Simulation Example

120

In a bidding process, there are two bidders. Suppose the

first bidder is expected to bid Rs.10,000 on average but with


an SD of Rs.1500, whereas the second bidder is expected
to bid around Rs.9000 with an SD of Rs.800.
If the bids are assumed to be normally distributed and
independent, what is the probability that the first bidder wins
the bid?
How does simulation help? What is the rationale?

121

Expectation and Variance of Sample Mean

Notice that the sample mean changes with every sample, and

hence it is a random variable: has expectation and variance.

For both SRSWR and SRSWOR, E X = , where is the

population mean.
Variance of the sample mean: (where population variance is X2 )
For WR

X2
Var X
n

X
SD(X )=
.
n

For WOR

X2
Var(X)
n
SD(X)=

N-n

N-1

Nn
.
N 1

122

Finite Population Correction Factor


The number

N-n
N-1

in the expression of variance is called


the finite population correction factor which is close to 1
if N>>n.
n
Hence, we typically ignore it if
0.05.
N

123

Sampling Distribution of Sample Mean

The sampling distribution of the sample mean is given by the

probability distribution of the values the sample mean can take.

Example

There are 3 eateries in a locality: A, B and C.


A student goes to these eateries with prob. 60%, 20% and 20% resp.
He chooses one eatery every day independently of his previous
decisions.
His favourite dish in A costs 100 and his favourite dish in B costs
140. The same costs 150 in C. He only eats his favourite dish.
a) Obtain the distribution of the students daily expenditure on lunch.
b) Obtain the distribution of the students average expenditure on
lunch over two days.
c) Obtain the distribution of the students average expenditure on
lunch over 30 days.

Distribution of

Sampling Distribution

124

Sampling Distribution

125

Form of the Sampling Distribution of the Sample Mean


for SRSWR

Any population: not necessarily normal

Small sample size: depends on the actual distribution

Large sample size (n 30): can use a normal approximation

Central Limit Theorem (CLT)

Sampling Distribution

The Central Limit Theorem (CLT)

126

X1, , X n : IID Sample with mean and variance 2

For large sample size n,

n X

is approximately N(0,1).

Or, equivalently:

For large IID sample size n, X is approximately N(, 2 n).


Or, in words:

When an IID sample of large size n is collected, the sample


mean is approximately normally distributed with mean and
variance 2/n.

An alternative form of the CLT


Sampling Distribution

127

Sum of samples:

For large n, Xi is approximately normal with mean n and variance n2.


n

i1

Issues With the CLT


Sampling Distribution

When can we use it:

If samples are from IID distributions


If there is moderate or no skew
When we may not use it:

Dont use it if the distributions are not IID.

Errors may be large for small samples from skewed

distributions

128

Example

Binomial(n,p) with p = 0.5, and various n.

129

Example

Binomial(n,p) with p = 0.1, and various n.

130

131

How good are the approximations?


Depends!

Larger skew: need larger sample size

No skew: 15 is a reasonable sample size to use the

normal approximation
Moderate skew: need 30 or more
High skew: need 50 or more
Severe skew (Example: binomial with large n and small p
so that Poisson approximation holds): might need very
high sample size, in the range of several hundred or even
higher

Example
Binomial(n,p) with p = 0.01, and various n.

132

Issues With the CLT (cont.)

Sampling Distribution

133

What the CLT does say:

Mean of large IID sample becomes approximately normal


Sum of large IID sample becomes approximately normal
What the CLT does not say:

CLT does not say that large samples become normal, it only

works for the sample mean


CLT usually does not work for SRSWOR

Special case: Normal Population


Sampling Distribution

134

If I have a random sample (WR) from a normal distribution,

then regardless the value of the sample size n, X is


exactly normal with mean and variance 2/n.
If I have a random sample (WR) from an approximately
normal distribution, then regardless the value of the sample
size n, X is approximately normal with mean and
variance 2/n.
No CLT is needed in the above cases.

Back to our example

135

Work out and . Hence work out the approximate distribution

of the sample mean for n = 30 using CLT.


What is the probability that over 30 days,
a) average spend is at least 120?
b) between 110 and 130?
c) What is the probability that the total spend is not more than
4000?
= 100 x 0.6+ 140 x 0.2 + 150 x 0.2 = 118.
2 = (100-118)2 x 0.6+ (140-118)2 x 0.2 + (150-118)2 x 0.2 = 496.
(Note: In class wrong 2 was used.)
So, X ~ N(118,496 / 30) N(118,16.53).

136

a) P( X 120) 1 P( X 120)
=1-NORM.DIST(120,118,SQRT(16.53),TRUE) = 0.3114.
b) P(110 X 130) P( X 130) P( X 110)
=NORM.DIST(130,118,SQRT(16.53),TRUE)
-NORM.DIST(110,118,SQRT(16.53),TRUE)
= 0.9739.
30

P
X

4000
c) i
P X 4000 / 30
i 1

= NORM.DIST(4000/30,118,SQRT(16.53),TRUE) = 0.9999.

Вам также может понравиться