Вы находитесь на странице: 1из 345

Sampling

Techniques
WILLIAM G. COCHRAN
Prof'S$or of S;-,ati;tics
Harvard U"itl~sity

G.K. V.1t. LI',.,

ASIA PUBLISHING HOUSE


.OM.AY • CALCUTrA • NEW DELHI • MADIlAS
LONDON. NEW YORK.
CoPTRIGHT, 1953
BY
JOHN WILEY & SoN8, INC.

All Rights Reamed


TM, book or anl' part thereof m~ not
be reproduced in anl' form ~hout
the written permi"ion of the pubt~her .

Fiut l1/dian E dition ; 1960


First Indian Printing ; 1962

.~ t~(;IC'l .. f0i'\AL eLLEGE,


I
----.-
L::JR.".:"':Y ,
(" f "7
I
1--
I ,'/0 ... _.... ... "........... .
A ·,,'I· /I.

i Data..... ............ ..
I
. ... I
, H EeS AL, EA :lGPLG i1::'.,.,!,I
- - ..-- J

U,m£ RSITY I F A r.J(.v ;" " L .'C,i. NCES


UN IH ~ 'tTY LI . ~ . ,j',. .. .1 . , (Hi E

A c c e ssIOn N..,.SJ..).. .~ . ~
Da t~ .. ..... .......... .. .... ........ .... ... "

PRINTED IN INDIA
IIV c. L. BHAROAVA AT C. W. LAWRIB AND co., LUO!tNOW AND
PUBLISHBD BY P. S. JAYASlNOH!!, ASIA PUBLI'HtNO HOUS!!, BOMBAY.
To Betty
Preface

THIS BOOK WAS DEVELOPED FROM A COURSE OF LECTURES ON


sample survey techniques which I gave for a few years at North Caro-
liM State College to students who intended to make their careers in
the field of statistics. The purpose of the book is to present a reason-
ably comprehensive account of sampling theory as it has been de-
veloped for use in sample surveys, with sufficient iIlustr&tions to show
how the theory is applied in practice, and with a supply of exercises to
be worked by the student. My hope is that the book will be useful
both as the basis of a course on sample survey techniques in which the
major emphasis is on theory, and for individual reading by the student
who does not have acooss to formal instruction.
As an indication of the level at which the book is directed, the
minimum mathematical equipment necessary for an easy understand-
ing of the proofs is a knowledge of calculus as far as the determination
of maxima and minima (using Lagrange's multipliers where required),
plus a familiarity with elementary algebra, and especially with the use
of summation signs. On the statistical side, the book presupposes an
introductory course which includes such topics as combinatorial prob-
abilities, expected values and their properties, means and standard
deviations, the normal, binomial, and multinomial distributions, con-
fidence limits, Student's t-test, linear regression, and the simpler types
of analysis of variance. Occasionally, reference is made to more ad-
vanced statistical results, since I have tried to point OQt the relation
between sample survey theory and the main stream of statistical
theory. In the early parts of the book, each step in a proof should be
readily apparent from the previous steps; towards the end, where the
proofs are more condensed, most readers will find that some work with
pencil and paper is necessary to follow the steps in deta~l.
Readers with advanced training in probability may find the argu-
ments by which theorems are established rather pedestrian. In a sense,
sample survey theory is easy, because thus far it has dealt mainly with
means and variances. By the use of powerful opera.tional methods, the
bulk of the existing theory can now, I believe, be devel()ped in a very
compact space as partiCUlar cases of a few general refSults. Such a
vii
viii PREFACE
development would be illuminating in clarifying the interrelationships
between the different parts of the subject, and might prove a stimulus
to further research and discovery. But my experience in teaching has
been that most students who wish to learn something about sampling
theory find this kind of presentation heavy going, and prefer a more
leisurely progress.
Sampling theory and practice have both grown so much in the past
ten years that an adequate coverage of the two aspects of sampling
now requires a lengthy volume. Although this book is not intended to
contain a thorough discussion of sampling practice, it does endeavor
to show how the various topics that comprise sampling theory arise
from problems in sampling practice. This link is essential to an under-
standing of sample survey theory, whose primary aim is to make
sampling practice more efficient and economical. In the same way,
the book presents some of the recommendations about sampling prac-
tice that follow from the results in theory. I have deliberately re-
frained, however, from making these recommendations too specific or
too strong. The tendency in sampling practice, where decisions must
often be made quickly on inadequate knowledge, is to develop a series
of working rules, each of which has some basis in theory. There is
danger, however, that working rules which have been successful in one
type of sampling become entrenched, so that they are relied upon in
quite different kinds of sampling for which they are not appropriate.
Re-examination from time to time of the theoretical basis for any pro-
posed working rule helps to avoid this danger.
The choice of a system of notation is a perplexing one to the writer.
The chief problem is how to prevent an epidemic of subscripts, which
make the results look formidable and unattractive. With multistage
stratified sampling, several symbols are 'needed to remind the reader
of the structure of the population, and, ideally, the notation adopted
for a.n estimate computed from sample data. should remind him not
only of the way in which the estimate is made, but also of the wa.y in
which the sample is drawn. My approach has been to use capital
letters for characteristics of the population and small letters for those
of the sample, and to employ a consistent set of SUbscripts to denote
the structure of the population. For the rest, subscripts with a mne-
monic content have been favored, and I have not hesitated to repeat
the definition of some notation is\. places where my guess is that the
reader will have begun to forget it. Lapses from consistency occur:
the alphabet ~n becomes used up, and the letter m, for instance, is
worked overtIme. Although I hope that any inconsistencies will not
be troublesome, the reader who is puzzled by them haa my apologies
PREFACE ix

and sympathy; the struggle to understand a theorem without knowing


clearly what the symbols mean is highly exasperating.
My best thanks are due to Dr. A. L. Finkner and Dr. Emil H. Jebe,
who prepared a large part of the mimeographed lecture notes from
which this book was developed. Dr. Jebe made a painstaking reading
of the presen~ book in manuscript, and Dr. Helen Abbey also read
parts of the manuscript. The seCl:etarial staff and graduate students
of the Department of Biostatistics, Johns Hopkins University, gave
invaluable help in the preparation of manuscript and in proofreading.
For permission to use data from surveys I am indebted to Dr. F . C.
Cornell and Dr. Finkner. Some theoretical investigations were facili-
tated by a research contract with the Office of Naval Research. While
the manuscript was nearing completion, I had the advantage of read-
ing a substantial part of the book, Sample Survey Methods and Theory,
by M . H . Hansen, W. N. Hurwitz, and W. G. Madow, and of noting
how these authors had handled the inevitable points at which a lucid
exposition is hard to find . Numerous references to this fine book would
have been made if it had appeared in print in sufficient time.
The present book contains more material than can be covered in the
time usually devoted to a course on sample surveys. However, the
sections have been prepared so that many of them can be omitted, or
condensed to a brief statement of the results, without detriment to
later parts of the book. There are, for instance, numerous discus-
sions of special topics, which attempt to answer questions that have
heen raised by alert sampling practitioners but which are not essen-
tial to a firm understanding of the fundamentals of the subject. Al-
though the selection of topics for discussion must depend on the field
of application, the following suggestions are made of sections which
may be omitted or condensed in an introductory course: 3.5, 3.9; 4.7;
5.8, 5.10, 5.12, 5.16, 5.17, 5.21; 6.8, 6.10; 7.4, 7.6, 7.7, 7.8, 7.9; 8.5,
8.6,8.7,8.9,8.14; 9.5, 9.6, 9.12, 9.13; 10.4, 10.5, 10.8, 10.9; 11.9, 11.10,
11.11,11.12,11.13; 12.11; 13.4.
WILLIAM G. COCHRAN
TM 1011.11$ HopkiM UniverBitll
March, 1955
Contents

1. INTRODUCTION . . . . . . . . • • • • • • 1
1.1 Advantages of the sampling method. . 1
1.2 The principal steps in a sample survey 2
1.3 The role of sampling theory 5
1.4 Probability sampling 6
1.5 Bias and its effects . 7
1.6 Referen<'es... . . 10
2. BIllPLE RANDoM SAMPLING . 11
2.1 Simple random sampling. 11
2.2 Definitions and notation . 12
2.3 Properties of the estimates . 13
2.. Variances of the estimates . 15
2.5 The finite population correction 17
2.6 Estimation of the standard error from 9. sample . 18
2.7 Confidence limits . . . . . . . . . . . . . . 20
2.8 Validity of the normal apprpximation . . . . . 22
2.9 Effect of non-normality on the estimated variance 27
2.10 Exercises . . . . . . . . . . . . . . 29
2.11 References. . . . . . . . . . . . . . 30
8. S..uaUNG lOa PRoPORTIONS AND PERcmNTAGES 31
3.1 Qualitative characteristics . . . . . . 31
3.2 Variances of the sample estimates . . . 31
3.3 The effect of P on the standard errors . 35
3.. The binomial distribution . . . . . . 36
3.5 The general distribution of p . . . . . 37
3.6 Confidence limits . . . . . . . . . . 39
3.7 Classification into more than two c l _ . .. 43
3.8 Confidence limits when there are more than two classes 44
3.9 The conditional distnbution of p 45
3.10 Exercises . . . . . . . . 47
3.11 References . . . . . . . . 48
"- Tn EilTulATION OJ' S..uaLll SUII • 50
U A hypothetical example . . 50
U Analyllia of the problem . . 51
U The specification of preciBion . 52
•••
U
The formula for", in sampling for proportions
The formula for", with continuous data .
53
55
U Sample me with mOre than one item . . • . 57
"-1 Stein', method of two-atage sampling . . . . 59
U An attempt at a &enerallOlution of the _pie me problem 61
"-9 EurciaeI . 63
UO RefereDOel • • • • • • • M
xii CONTENTS

5. STRATIFIED RANDOM SAMPLING 65


5.1 Description . . . . . 65
5.2 Notation 66
5.3 Properties of the estimates. . . . . . . . 66
5.4 The estimated variance and confidence limits 72
5.5 Optimum allocation . . . . . . . . . . 73
5.6 Optimum allocation with varying costs . 75
5.7 Relative precision of stratified random and simple random
sampling . . . . . . . . . . . . . . 76
5.8 EffecUi of deviations from the optimum . 78
5.9 Determination of the allocation from previous data . 81
5.10 Effects of errors in the estimated Sh . . . . . 82
5.11 Allocation with several items . . . . . . . . . . . 84
5.12 Allocation requiring more than 100 per cent sampling 86
5.13 Estimation of sample size . . . . . 87
5.14 Stratified sampling for proportions . . . . . . . . 90
5.15 The construction of strata . . . . . . . . . . . . 93
5.16 Proximity as a basis for stratification . . . . . . . 96
5.17 Estimation from a sample of the gain due to stratification 97
5.18 Effects of errors in the stratum sizes . . . . . . . 102
5.19 Case where the strata cannot be identified in advance 104
5.20 Quota sampling. . . . . . . . . . . 105
5.21 Stratification with one unit per stratum 105
5.22 Stratification in analytical studies. 106
5.23 Exercises . . lOS
5.24 References . . . . . . 110
6. RATIO EsTIMATES . . . . . . 111
6.1 Methods of estimation. III
6.2 The ratio estimate . . . . . 111
6.3 ApI;'r~ximate ~ariance and bias of the ratio estimate 114
6.4 Estimated vanance . . . . . . . . . . . . . . . 118
6.5 Sample size . . . . . . . . . . . . . . . . . . 120
6.6 Confidence limits . . . . . . . . . . . . . . . . . 120
6.7 Comparison of the ratio estimate with the mean per unit 122
6.8 Conditions under which the ratio estimate is optimum . . 123
6.9 The ratio estimate in sampling for proportions . . . . . 124
6.10 The approach to normality of the distribution of the ratio 127
6.11 Ratio estimates in stratified random sampling 129
6.12 The combined ratio estimate . . . . . . 131
6.13 Optimum allocation with a ratio estimate 135
6.14 Exercises . . . . . . 137
6.15 References. . . . . . . . . 139
7. RmGRE88ION ESTIMATES ... .. . 140
7.1 The linear regression estimate 140
7.2 Larg&-ssmple theory 141
7.3 Elementary theory . . . . . . . . . . . . . . . . . . 144
7.4 Bias of the regression estimate . . . . . . . . . . . . . 147
7.5 Comparison with the ratio estimate and the mean per unit . 148
7.6 The regression estimate in stratified 88.IIlpling 150
7.7 The combined regression estimate. . . . . . . . 153
7.8 Estimated variances . . . . . . . . . . . . . . 157
7.9 Comparison of the two types of regression estimate 157
7.10 Exercises 158
7.11 References 159
CONTENTS xiii
8. SYSTEMATIC SAMPLING . 160
8.1 Description 160
8.2 An alternative view 161
8.3 Va.riance of the estimated mean 162
8.4 Comparison of systematic with stratified random 8Il.mpling 167
8.5 Populations in "random" order . 168
8.6 Populations with linear trend . 170
8.7 End corrections . 172
8.8 Populations with periodic variation 174
8.9 Autocorrelated populations . 174
8.10 N aturru populations . 176
8.11 Quasi-periodic effects 178
8.12 Estimation of the variance from a single sample 179
8.13 Stratified systematic sampling. 182
8.14 Systematic sampling in two dimensions 183
8.15 Summary 185
8.16 Exercises 186
8.17 References 187
9. TYPE 01"SAMPLING UNIT 189
9.1 The optimum unit 189
11.2 A simple example . 189
9.3 General procedure for comparing units 192
9.4 Comparisons made from survey data 195
9.5 Variance functions 198
9.6 A cost function . 200
9.7 VMance in terms of in tra-cluster correlation 202
9.8 Cluster sampling for proportions 203
9.9 Measures of the size of a unit 200
9.10 Sampling with probability proportional to size 206
9.11 Theory for selection with arbitrary probabilities 207
9.12 Comparison with the ratio estimate 210
9.13 Extension to stratified sampling 212
9.14 Exercises 212
9.15 References . 213
10. SUJl8AMPLING WITH UNITS 01" EQUAL SIZE 215
10.1 Introduction 215
10.2 Elementary theory 217
10.3 Prediction of the variance for other subsampling ratios 218
10.4 General theory 220
10.5 Estimation of the variance in the general case 223
10.6 Optimum sampling and subsampling fractions 225
10.7 Subsampling for proportions 228
10.8 Three-atage sa.rnpling 229
10.9 Stratified sampling of the units 231
10.10 Exercises 232
10.11 References . 233
11. SUJlSAKPLING WITH UNITS OJ' UNEQUAL SIZE . 234
11.1 Introduction . .. 234
11.2 Sampling methods when n - 1 235
11.3 Sampling with probability proportional to estimated size . 239
11.4 Summa.ry of methods for n - 1 . 243
11.5 Sampling methods when n > 1 243
11.6 Summary of methods for n > 1 . 247
xiv CONTENTS
11.7 The eetimation of proportions . • 248
11.8 Interim comments . . . . . . . 248
11.9 The principal variance formulas . 249
11.10 Optimum probabiJjties of selection 203
11.11 Biased versus unbiased eetimates 257
11.12 E8timated variances, . . . . . 259
11.13 Extension to stratified sampling 262
11.14 Summary comments 265
11.15 Exercises . 266
11.16 Referencee . . , , , 267
12. DouBlJ!I SAllPLlNG . . , , 268
12.1 Deacription of the technique 26S
12.2 Double sampling for stratification , 268
12.8 Optimum allocation . , . , . . , , , , , , . , , . 271
12.4 Estimated variance in double sampling for stratification 273
12.5 Regression estimates . . , , . , , . , , , . . , , 275
12.6 Double sampling with regression versus single sampling 278
12.7 Estimated variance in double sampling for regression 280
12.8 Ratio estimates , . , , . .' . . , . , . . 281
12.9 Repeated sampling of the same population , 282
12.10 Sampling on two oocasions , . , . . 284
12.11 Sampling on more than two occasions 286
12.12 Exercises . , . , . . . 290
12.13 Referencee. , . , , . , 291
18. SoURCES OJ' EllIIoR IN SURVEYS. 292
13.1 Introduction. . . . . . 292
13.2 Effects of non-response . . . , . . . , , . , . , . 292
13.3 Optimum sampling fraction among the non-respondents 298
18.4 Other techniques for non-response . . . . . . . ' . 302
13.5 Errors of measurement . . . . . . . , , , . . . . 304
13.6 A ma~hematical model for errors of measurement . . . 805
13.7 Effects of oonstant bias , , . . , , . , , , . . . , . . 806
13.8 Effects of components thai are independent from unit to unit 308
18.9 Effects of correlation between errors on different units • 311
18.10 Interpenetrating Bubsamples 312
13.11 Summary . . 816
13.12 Referenoes , . 317
ANsWlIllUl TO ExERCISES . 819
AUTHOR INDEX . 321
BlT8l1IICf mDm: . 323
CHAPTER 1

INTRODUCTION

1.1 Advantages of the sampling method. Our knowledge, our atti-


tudes, and our actions are based to a very large extent upon samples.
This is equally true in everyday life and in scientific research. A
person's opinion of an institution that conducts thousands of trans-
actions every day is often determined by the one or two encounters
which he has had with the institution in the course of several years.
The traveller who spends 10 days in a foreign country and then pro-
ceeds to write a book telling the inhabitants how to revive their in-
dustries, reform their political system, balance their budget, and im-
prove the food in their hotels is a familiar figure of fun. But in a real
sense he differs from the political scientist who devotes 20 years to
living and studying in the country only in that he bases his conclusions
on a much smaller sample of experience and is less likely to be aware
of the extent of his ignorance. In every branch of science we lack the
resources to "study more than a fragment of the phenomena that might
advance our knowledge.
Until recent years, relatively little attention was given to the prob-
lem of how to draw a good sample. This does not matter so long as
the material from which we are sampling is uniform, so that any kind
of sample gives almost the same results. Laboratory diagnoses about
the state of our health are made from a few drops of blood. This
procedure is based on the assumption that the circulating blood is
always well mixed and that one drop tells the same story as another-
an I:I.SSUIIlption which we as "laymen fervently hope is correct. But
when the material is far from uniform, as is often the case, the method
by which the sample is obtained is critical, and the study of techniques
that ensure a trustworthy sample becomes important.
This book contains an account of the body of theory that has been
built up to provide a background for good sampling methods. In
most of the applications for which this theory was constructed, the
'aggregate about which information is desired is finite and delimited-
the inhabitants of a town, the machines in a factory, the fish in a lake.
In some cases it may seem feasible to obtain accurate information by
1
2 INTRODUCfION 1.1

taking a complete enumeration or census of the aggregate. Adminis-


trators who have been accustomed to dealing with censuses have some-
times been suspicious of samples and reluctant to use them in place
of censuses. Although this attitude is losing ground, it may be well
to list the principal advantages of sampling as compared with complete
enumeration.
i. Reduced cost. If data are secured from only a small fraction of
the aggregate, expenditures may be expected to be smaller than if a
complete census is attempted.
ii. Greater apeed. For the same reason, the data can be collected
and summarized more quickly with a sample than with a complete
count. This may be a vital consideration when the information is
urgently needed.
iii. Greater scope. In certain types of inquiry, highly trained per-
sonnel or specialized equipment, limited in availability, must be used
to obtain the data. A complete census may then be impracticable:
the choice lies between obtaining the information by sampling or not
at a.Il. Thus surveys which rely on sampling have more scope and
flexibility as to the types of information that can be obtained. On
the other hand, if information is wanted for many subdivisions or
segments of the population, it may be found that a complete enumera-
tion offers the best solution.
iv. Greater accuracy. Because personnel of higher quality can he
employed and can be given intensive training, a sample may actually
produce more accurate results than the kind of complete enumeration
that it is feasible to take.

1.2 The principal steps in a sample survey. As a preliminary to a


discuBBion of the role which theory plays in a sample survey, it is
convenient to describe briefly the steps that are usually involved in
.the planning and execution of a survey. Surveys vary greatly in
their complexity. To take a sample from 5000 cards, neatly arranged
and numbered in a file, is an easy task. It is another matter to sam-
ple the inhabitants of a region where transport is by water through the
forests, where there are no maps, where fifteen different dialects are
spoken, and where the inhabitants are suspicious of a stranger, and
very suspicious of an inquisitive stranger. Problems which are baf-
fling in one survey may be trivial or non-existent in another.
The principal steps in a survey are grouped somewhat arbitrarily
under nine headings.
i. Statement of the objectives of the survey. A lucid statement of
the objectives is most helpful. Without this, it is easy in a complex
1.2 THE PRINCIPAL STEPS IN A SAMPLE SURVEY 3

survey to forget the objectives when engrossed in the details of plan-


ning, and to make decisions that are at variance with the objectives.
ii. Definition of the population to be sampled. The word population
will be used to denote the aggregate from which the sample is chosen.
The definition of the population may present no problem, as when
sampling a batch of electric light bulbs in order to estimate the aver-
age length of life of a bulb. In sampling a population of farms, on the
other hand, rules must be set up to define a farm, and borderline cases
will arise. These rules must be usable in practice: the enumerator
must be able to decide in the field, without much hesitation, whether
a doubtful case belongs to the population or not.
Whenever possible, the population to be sampled should obviously
coincide with the population about. which information is wanted.
Sometimes this requirement is judged, rightly or wrongly, to be too
difficult. In a new area of research, where the collection of data pre-
sents perplexing problems of measurement, it may be decided to con-
centrate the resources on this aspect of the survey, choosing a popula-
tion that is compact and easy to sample, although this is not the broader
population about which information is really wanted. In t his eveut
one should also collect any comparative information about the two
populations that helps to show whether inferences to the broader pop-
ulation can be attempted.
iii. Determination of the data to be ccZlected. It is well to verify
that all the data are relevant to the purpose of the survey, and that
no essential data are omitted. There is frequently a tendency to col-
lect too many data, some of which are never subsequently examined.
iv . Methods of measurement. When the kinds of data that are
needed have been decided, there may be a choice as to the methods of
measurement to be employed. For instance, data about a person's
state of health may be obtained from statements which he makes or
from a more or less thorough medical examination. With human
populations, the manner and the order in which questions are asked
may produce substantial differences in the resuJts; see e.g. Payne
(1951).
v. Choice of sampling unit. As a preliminary to the selection of
a sample, the population must be subdivided in some way into parts
which will be called samplt"ng units, or units. The sampling units
must together comprise the whole of the population, and they must
be non-overlapping, in the sense that every element in the population
belongs to one and only one unit. Sometimes the appropriate unit is
obvious, as with a population of light bulbs, where the unit is the
single bulb. Sometimes there is a considerable choice of unit. In
INTRODUCTION 1.2

sampling the people in a. town, the unit might be an inruvidua.l per-


son, the members of a household, or a.ll persons dwelling in the same
city block. In sampling an agricultura.l crop, the unit is likely to be
an area of land whose shape and dimensions are at our disposal.
The construction of a complete list of sampling units, sometimes
called aframe, may be one of the major practica.l problems. Sometimes
the frame is impossible to construct, as with the population of fish in
a lake.
vi. Selectivn of the sample. There is now a variety of procedures
by which the sample may be selected. The selection involves aJso a
decision about the size of the sample, which in turn requires a provi-
sional estimate of the cost of the survey, to ensure that the sample
will fall within the a.llowable budget.
vii. Organization of the field work. In extensive surveys, many
problems of business administration are involved. The personnel
must receive training in the purpose of the survey and in the methods
of measurement to be employed and must be adequately supervised
in their work. A procedure for early checking of the quality of the
returns may be invaluable. Plans must be made for handling non-
response, that is, the failure of the enumerator to obtain information
from certain of the units in the sample.
viii. Summary and analysis of the data. The first step is to edit the
completed questionnaires, in the hope of amending recording errOl'S,
or at least of deleting data that are obviously erroneous. Decisions
about tabulating procedure are needed in the case where answers to
certain questions were omitted by some respondents or had to be de-
leted in the editing process. Thereafter, the tabulations which lead
to the estimates are performed. Different methods of estimation may
be available for the same data.
ix. Information gained for future surveys. The more information
we have initially about a population, the easier it is to devise a sam-
ple which will give accurate estimates. Any completed sample is po-
tentially a guide to improved future sampling, through the data which
it supplies about the means, standard deviations, and nature of the
variability of the principal measurements, and about the costs in-
volved in getting the data. Sampling practice advances more rapidly
when provisions are made to assemble and record information of this
type.
There is another important respect in which any completed sample
facilitates future samples. Things never go exactly as planned in a
complex survey. The alert sampler learns to recognize mistakes in
execution and to see that they do not occur in future surveys.
THE ROLE OF SAMPLING THEORY 6

1.3 The role of sampling theory. This list of the steps in a sample
survey has been given in order to emphasize that sampling is a prac-
tical business, which calls for several different types of skill. In some
of the steps-the definition of the population, the determination of
the data to be collected and of the methods of measurement, and the
organization of the field work-sampling theory plays at most a minor
role. Although these topics will not be discussed further in this book,
their importance should be realized. Sampling demands attention to
all phases of the activity: poor work in one phase may ruin a survey
in which everything else is done well.
The purpose of sampling theory is to make sampling more efficient.
It attempts to develop methods of sample selection and of estimation
that provide, at the lowest possible cost, estimates that are precise
enough for our purpose. This principle of specified precision at mini-
mum cost recurs repeatedly in the presentation of theory.
In order to apply this principle, we must be able to predict, for any
sampling procedure that is under consideration, the precision and the
cost to be expected. So far as precision is concerned, we cannot fore-
tell exactly how large an error will be present in an esti.:nate in any
specific situation, for this would require a knowledge of the true value
for the popUlation. Instead, the precision of a sampling procedure is
judged by examining the frequency distribution which is generated
for the estimate, if the procedure is applied again and again to the
same population. This is, of course, the standard technique by which
precision is judged in statistical theory.
A further simplification is introduced. With samples of the sizes
that are common in practice, there is often good reason to suppose
that the sample estimates are approximately normally distributed.
Consequently the sampling variance of the estimate is used to provide,
in inverse terms, a measure of its precision. A considerable part of
the theory deals with the calculation of formulas for the sampling
variances of estimates obtained by various procedures.
The study of sampling from an infinite population is a relatively old
and well-established discipline. The development of theory specifically
for application to sample surveys is quite recent. Nearly all the ref-
erences in this book are less than 20 years old and the majority are
less than 10 years old. The primary stimulus to sample survey theory
was the increasing use of sample surveys as a means of obtaining infor-
mation. Most of the work in sample survey theory has been done by
persons who are also actively engaged in the conduct of surveys. In
their turn, the advances in theory increased the scope and utility of
6 INTRODUCTION 1.3

the sampling method and contributed to a further growth in the prac-


tical use of surveys.·
One difference between sample survey theory and the older theory
of sampling is that the populations with which we have to deal in
survey work contain a finite number of units. The methods used to
prove theoreIDB are different, and the results are slightly more com-
plicated, when sampling is from a finite instead of an infinite popula-
tion. For practical purposes these differences in results for finite and
infinite populations are seldom important. Whenever the size of the
sample is small relative to the size of the population, as happens in the
great majority of applications, results derived from an infinite popula-
tion are fully adequate. In general, results for finite populations will
be presented in this book. In some of the more difficult problems, the
theory for infinite populations will be used to simplify the presentation.

1.' Probability sampling. All sampling procedures for which a theory


has been developed have the following mathematical properties in
common:
i. We are able to define the set of distinct samples, S1, S2, ... , S",
which the procedure is capabJe of selecting if applied to a specific pop-
ulation. This means that we can say precisely what sampling units
belong to S1I to S2, and so on. For example, suppose that the popula-
tion contains six units, numbered from 1 to 6. A common procedure
for choosing a sample of size 2 gives three possible candidates-
SI '" (1, 4); S2 '" (2, 5); Sa '" (3, 6). Note that not all possible
samples of size 2 need be represented.
ii. Each possible sample Si has assigned to it a kno"'ll probability
of selection ?rio
iii. We select one of the S. by a process in which each Si receives its
appropriate probability 1f, of being selected. In the example we might
assign equal probabilities to the three samples. Then the draw itself
can be made by choosing a random number between 1 and 3. If this
number is j, S; is the sample that is taken.
iv. The method for computing the estimate from the sample must
be stated and must lead to a unique estimate for any specific sample.
We may declare, for example, that the estimate is to be the average
of the measurements on the individual units in the sample.
For any sampling procedure which satisfies these properties, we are
in a position to calculate the frequency distribution of the estimates
which it generates if repeatedly applied to the same population, for
• Stephan (1948) gives a good historical account of the development o( the U8e8
of modern sampling techniques.
1.5 BIAS AND ITS EFFECTS 7
we know how frequently any particular sample Si will be selected, and
we know exactly how to calculate the estimate from the data in S i.
lt is clear, therefore, that we are able to develop a sampling theory
for any procedure of this type, although the details of the develop-
ment may be intricate.
The term probability 8ampling refers to a procedure of this type.
This term has not yet acquired a standard definition, and some writers
use it in a more restrictive sense. The main purpose of the term is to
distinguish this kind of sampling from purposive 8election, in which
the sample is restricted to units thought by someone to be especially
typical of the population or convenient for sampling. Purposive se-
lection may produce good results when the sample is small, but it is
not amenable to the development of a theory, because it contains no
element of random selection.
In practice we seldom draw a sample by writing down the S. and
7ri as outlined above. This is intolerably laborious with a large popu-
lation, where a sampling procedure may produce billions of possible
samples. The draw is most commonly made by specifying probabili-
ties of inclusion for the individual units, and drawing units, one by
one or in groups, until the sample of desired size and type is con-
structed. For the purposes of a theory it is sufficient to know that we
could write down the Si and 7ri if we wanted to and had unlimited time.

1.6 Bias and its effects. For simplicity, it is assumed in the presen-
tation of theory that any measurement Yi on the ith unit is the cor-
rect value for that unit. Errors of measurement are ignored. This
assumption is of course unrealistic, and in chapter 13 the effects of
errors of measurement on the standard results are examined. For
some types of error, the standard results remain valid with only minor
changes. For other types of error, more drastic changes are needed.
The effects of bia8 will, however, be discussed in this section, be-
cause the deliberate use of biased estimates is often found to be profit-
able in sample surveys.
A sampling procedure is said to be unbiased if the mean of the fre-
quency distribution of the estimates which it produces is exactly equal
to the population characteristic which is being estimated. In the na-
tation of the previous section, let z, be the estimate provided by the
sample S, (i ,.. 1, 2, ... , v), with probability of selection 7ri, and let
8, be the population value which is being estimated. The procedure
is unbiased if
8 INTRODUCTION

If the two quantities are not equal, their difference is called the bia8
in the sampling procedure :
~

Bias = L: .z. -
'II"• 8.
i-I

To examine the effect of bias, suppose that the estimate z is normally


distributed about a mean m which is a distance B from the true popu-
lation value ~, as shown in figure 1.1. The amount of bias is B =

FIoUlU!l l.1 Effeet of bias on errors of estimation.

m - ~. Suppose that we do not know that any bias is present. We


compute the standard deviation 11 of the frequency distribution of the
estimate-this will, of course, be the standard deviation about the
mean m of the distribution, not about the true mean II. As a state-
ment about the accuracy of the estimate, we declare that the proba..-
bility is 0.05 that the estimate z is in error by more than 1.9611.
We will consider how the presence of bias distorts this probability.
To do this, we calculate the true probability that the estimate is in
error by more than 1.9611, where error is measured from the true mean
~. The two tails of the distribution must be examined separately.
For the upper tail, the probability of an error of more than + 1.9& is
the shaded area above Q in figure 1.1. This area is given by
1.5 BIAS AND ITS EFFECTS 9

Put z; - m - I1t. The lower limit of the range of integration for


tis
IJ.-m B
-11- + 1.96 = 1.96 - -
11
Thus the area is
-1-
-yI2;
i""1.00-(Bf,l
e-I'/·dt

Similarly, the lower tail, i.e. the shaded area below P, has an area

--
1 f -1.00- (B/ .. )
e-t'/.dt
-yI2; - 00

From the form of the integrals it is clear that the amount of dis-
turbance depends solely on the ratio of the bias to the standard devia-
tion. The results are shown in table 1.1.

.TABLE 1.1 EFI!"ECT 0,. A BlAB B ON THE PROBABILITY OJ' AN JlRBOB


GREATER THAN 1.96cr
Probability of error

BItT < -1.960- > 1.960- Total


0.02 0.0238 0.0262 0.0500
0 .04 0 .0228 0 .0274 0 .0502
0.06 0.0211 0.0281 0 .0504
0.08 0.0207 0 .0301 0 .0508
0 .10 0.0197 0.0314 0.0511
0 .20 0.0154 0 .0392 0 .0546
0 .40 0.0091 0 .0594 0.0685
0 .60 0.0052 0.0869 0 .0921
0 .80 0 .0029 0.1230 0 . 1259
1.00 0.0015 0 . 1685 0.1700
1.50 0.0003 0.3228 0.3231

For the total probability of an error of more than 1.9611, the bias
has little effect provided that it is less than one-tenth of the standard
deviation. At this point the total probability is 0.0511 instead of the
0.05 which we think it is. As the bias increases further, the disturb-
ance becomes more serious. At B = 11, the total probability of error
is 0.17, more than three times the presumed value.
The two tails are a.ft'ected differently. With a positive bias, as in
this example, the probability of an underestimate by more than 1.9611
shrinks rapidly from the presumed 0.025 to become negligible when
B - 11. The probability of the corresponding overestimate mounts
10 INTRODUCTION 1.5

steadily. In most applications the total error is the primary interest,


but occasionally we are particularly interested in errors in one direc-
tion.
As a working rule, the effect of bias on the accuracy of an estimate
is negligible if the bias is less than one-tenth of the standard deviation
of the estimate. If we have a biased method of estimation for which
we can show that B/ u < 0.1, it can be claimed that the bias is not an
appreciable disadvantage of the meehod.
Any biased method must, however, be used with caution. Suppose
that samples are drawn from a population every month throughout a
year, and that monthly estimates are made by some biased method of
estimation. The arithmetic mean of the twelve estimates is computed
subsequently in order to obtain an average annual figure. If the pop-
ulation is changing only slowly, it is not unlikely that the biases in
the twelve estimates will have the same sign and be of about the same
magnitude. The bias in the annual average is therefore about the
same as the bias in a single monthly figure. If the monthly samples
are drawn independently, the standard error of the annual average
estimate will be about 1/ v12 times the standard error of a monthly
estimate. Hence the ratio of the bias to the standard error in the
annual average is roughly v'i2 times that in a monthly figure, and
this inflated ratio may not be negligible. Since it is always difficult
to foretell all the ways in which sample estimates may be averaged for
later purposes, the use of biased estimates is to be avoided unless there
is evidence that the ratio of the bias to the standard error is extremely
small.
Unsuspected bias may be present in an estimate, even when great
pains have been taken to exclude bias. Since the standard deviation
of an estimate, as obtained from a sample, does not include the con-
tribution of the bias, it is preferable to speak of this standard devia-
tion as measuring the precision of the estimate, rather than its ac-
curacy. Accuracy usually refers to the size of deviations from the
true mean Il, whereas precision refers to the size of deviations from the
mean m obtained by repeated application of the sampling procedure.

1.6 References.
PAYNE, S. L. (1951). TM art of asking questions. Princeton University Press.
TEPHAoN, F. F. (1948). History of the II8e8 of modern sampling proceduree.
Jour. AIMT. Sial. Assoc., '3, 12-39.
CHAPTER 2

SIMPLE RANDOM SAMPLING

2.1 Simple random sampling. Sample surveys deal with samples


drawn from populations which contain a finite number N of units.
If these units can all be distinguished from one another, the number
of distinct samples of size n that can be drawn from the N units is
given by the combinatorial formula

( N\
n)
= NC" = N!
n!(N - n)!
For example, jf the population contains five units denoted by A, B,
C, D, and E , there are ten different sanlples of size 3, as follows :
ABC ABD ABE ACD ACE
ADE BCD BCE BDE CDE
Note that the same letter is not allowed to occur twice in the sample.
No attention is paid to the order in which the letters occur in the
sample, the six samples ABC, ACB, BAC, BCA, CAB, and eBA be-
ing considered identical.
Simple random sampling is a method of selecting n units out of the
N such that every one of the NC" samples has an equal chance of
being chosen. This type of sampling is sometimes called random
sampling. Since the word random is used in the literature in many
different senses, an extra qualifying adjective is advisable. Some
writers prefer the phrase unrestricted random sampling.
In practice a simple random sample is drawn unit by unit. The
units in the population are numhered from 1 to N. A series of random
numbers between 1 and N is then drawn, either by means of a table
of random numbers or by placing the numbers 1 to N in a bowl and
mixing thoroughly. If the bowl is used, n numbers are drawn out in
succession. The units which bear these numbers constitute the sample.
At any stage in the draw, this process gives an equal chance of eeleo-
tion to all numbers not previously drawn. It is easy to verify that
all NC.. possible samples have an equal chance.
11
12 SIMPLE RANDOM SAMPLING 2.1

When a number has been drawn from the bowl, it is not replaced,
since this might allow the same unit to enter the sample more than
once. For this reason the sampling is described as without replacement.
Similarly, if a table of random numbers is employed, a number that
has been drawn previously is ignored. Sampling with replacement is
entirely feasible, but except in special circumstances is seldom used,
since there seems little point in having the same unit twice in the
eample.
Other methods of sampling are often preferable to simple random
sampling on the grounds of convenience or of increased precision.
Simple random sampling serves best to introduce sampling theory.

2.2 Definitions and notation. In a sample survey we decide upon


certain properties which we attempt to measure and record for every
unit that comes into the sample. These properties of the units will
be referred to as characteristiC8 or more simply as items.
The values obtained for any specific item in the N units which com-
prise the population are denoted by 7/1, 7/2, "', 7/N. The correspond-
ing values for the units in the sample are denoted by 7/1, Y2, •.• , 7/ft,
or, if we wish to refer to a typical sample member, by 7/i (i = 1, 2,
"', n) . Note that the sample will not consist of the firBt n units in
the population, except in the instance, usually rare, in which these
units happen to be drawn. If this point is kept in mind, my experi-
. ence has been that no confusion need result.
Capital letters will refer to characteristics of the population, and
lower CQ.8e letters to those of the Bample. For totals a.nd means we
have the following definitions:
Population yslue Sample value
Total: Y-III+II1+ " '+IIN tI - 111 + 1/1 + ... + 1/.. (2.1)

Mean: r _ til + III + ... + liN _.!: f} - 1/1 + 1/1 + ... + II.. _ ! (2.2)
N N n n

One unusual feature of this notation is the use of the symbol 1/ to


denote the sample total of the values 1h. In statistical literature, 1/
Be.rVes as a general symbol for the variate itself, as in the phrase tM
frequency di&fn'bution of 7/. Instead, we shall refer to tM frequency
di&tribution of 1/i, reserving the symbol 1/ for the sample total.
Although sampling is undertaken for many different purpoees, in-
terest centers most frequently on three characteristics of the popula-
tion. The first is the total Y of the values for some item over all units
in the population (e.g. the total number of &cree of wheat in a region).
2.3 PROPERTIES OF THE ESTIMATES 18

The second is the average value l' per unit (e.g. the average number
of acres of wheat per farm) . The t.hird is the proportion or percentage
of units which fall into some defined class (e.g. the percentage of
farms growing no wheat). Estimation of the population total and
mean will be considered in this chapter.
The symbol .. is used to denote an' estimate of a popUlation charac-
teristic made from a sample. In this chapter only the simplest types
of estimate are considered, as follows:
Estimate
Population mean: r= y = sample mean
PopUlation total: r = Ny = NYln

The factor N In by which the sample total is multiplied is called vari-


ously the expan8ion or raising or inflation factor. Its inverse, nlN,
is of course the ratio of the size of the sample to that of the population.
and is called the 8ampling ratio or the 8ampling fraction.

2.3 Properties of the estimates. The precision of any estimate made


from a sample depends both on the method by which the estimate is
calculated from the sample data, and on the plan of sampling. To
save space we sometimes write of "the precision of the sample mean"
or "the precision of simple random sampling," without specifically
mentioning the other fundamental factor. This has been done, we
hope, only in instances in which it is clear from the context what the
missing factor is. When studying any formula that is presented, the
reader should make sure that he knows the specific method of sampling
and method of estimation for which the formula has been established.
In this book, a method of estimation is called consi8tent if the esti-
mate becomes exactly equal to the population value when n '"" N,
that is, when the sample consists of the whole population. For sim-
ple random sampling it is obvious that fi and N fi are consistent esti-
mates of the population mean and total, respectively. Consistency is
a desirable property of estimates. On the other hand, an inconsistent
estima.te is not necessarily useless, since it may give satisfactory pre-
cision when n is small compared to N. Its utility is likely to be con-
fined to this situation.
In statistical theory the notion of consistency has been discU88ed
mainly for infinite popUlations. The usual definition is that fl is a
consistent estimate of l' if for any E > 0,
lim
,,-- .Prllll- 1'1 > e} ,.. 0
SIMPLE RANDOM SAMPLING 2.3

There is more than one way in which this definition can be adapted 80
as to apply to a finite population, and the definition which we have
given may not be the moet useful one for studying the properties of
estimates in large sampleS. However, the idea of consistency does not
play an Important part in the subsequent exposition.
As we have seen, a method of estimation is unbiased if the average
value of the estimate, taken over all possible samples of given size n,
is exactly equal to the true population value. If the method is to be
unbiased without qualification, this result must hold for any popula-
tion of finite values 11; and for any n. To investigate whet.her y is un-
biased with simple random sampling, we calculate the value of y for
all NC" samples and find the average of the estimates. The symbol
E denotes this average over all possible samples.

Theortma 1.1 The sample mean y is an unbiased estimate of Y.


Proof: By its definition

- L
Ii
E11----
L (111 + 1/2 + .. .+ 1/,,) (2.3)
NC,. N!
n ----
n!(N - n)!

where the sum extends over all NC,. samples. To evaluate this sum,
we find out in how many samples any specific value 1/; appears. Since
there are (N - 1) other units available for the rest of the sample and
(n - 1) other places to fill in the sample, the number of samples con-
taining 11. is
C _ (N - 1)1
(N-l) (,,-1) (n _ I) !(N - n) 1
I'
Hence
(N - I)!
L (VI + v, + ... + 1/,,) - (n - I)I(N - n)1
(til + tit + ... + tiN)
From (2.3) this gives
E (N - 1)1 nl(N - n)1
g-(n-I)I(N-n)1 nNl (1Il+1I'+"'+1IN)

(Yl + 112 + ... + YN)


- N
-Y (2.4)

CtmilltJrv· r - Ny is an unbiased estimate of the population- to-


tal Y.
2.4 VARIANCES OF THE ESTIMATES 15

A less cumbersome ptoof of theorem 2.1 is obtained 88 follows.


Since every unit appears in the same number of samples, it is clear
that
E(YI + Y2 + ... + 11.. ) must be some multiple of (711 + 712 + ... + tiN)
(2.5)
The multiplier must be n/N, since the expression on the left has n
terms and that on the right has N terms. This leads to the result.

2.4 Variances of the estimates. The variance of the y. in a finite


population is defined as

(12 = _1_ _ __ (2.6)


N
As a matter of notation, results will be presented in terms of a slightly
different expression, in which the divisor (N - 1) is used instead of
N . We take
N
:E (Yi - y)2
ff =_1_ _ __
(2.7)
N-1
This convention has been used by those who approach sampling theory
by means of the analysis of variance. Its advantage is that most re-
sults take a slightly simpler form. Provided that the same notation
is maintained consistently, all results are equivalent in either notation.
We now consider the variance of y. By this we mean E(y - y)2
taken over all NC.. samples.
Theorem it.S The varia.nce of the mean y from a simple random
sample is
S2 (N - n)
V(y) = E(y _ y)2 = - - _ (2.8)
n N
Proof:

n(g - Y) = (YI - Y) + (Y2 - Y) + ... + (71" - Y) (2.9)


By the same argument of symmetry as used in relation (2.5), it
follows that

E[(Y1 - y)2+ ... + (y" -~)21


n. . y)2 + ... + (YN -'Y)2]

V~
IVA ' \ . (2.10)
16 SIMPLE RANDOM SAMPLING

and also that


E(lll - 'Y)(112 - 1") + (YI - 1")(Y3 - 'Y)
+ ...+ (Y,,-I - Y)(y" - Y»)

... n(n - 1) [(YI _ Y)(Y2 _ Y) + (YI _ Y)(Y3 - Y)


N(N - 1)
+ ... + (YN-l - Y)(YN - Y») (2.11)
In equation (2.11) the sums of products extend over all pairs of units
in the sample and population, respectively. The sum on the left con-
tains n(n - 1)/ 2 terms, and that on the right N(N - 1)/2 terms.
Now square (2.9) and average over all simple random samples.
Using (2.10) and (2.11), we obtain

n2E(y - 1")2 = ~ [(YI - y)2 + ... + (YN - y)2

+ 2(n - I
1) (YI - Y)(Y2 - Y)
(N - 1)
+ .. .+ (YN-I - Y)(YN - 1") I]
Completing the square on the cross-product term, we have

n 2E(f} - "r) 2 = -n
N
[(1 - -
N-l
1)
n -- {(YI - r) 2 " + ... + (YN - "r )2)

(n - 1) ]
+ (N - 1)
{(YI - Y) + .. .+ (y N - YW

The second term inside the square bracket vanishes, since the sum of
the Yi equals NY. Division by n2 gives
(N - n) S2 (N - n)
V(g) = E(g - y)2 = EN (Yi - 'Y)2 >= - - -
nN(N - 1) i_I n N
Corollary 1 The standard error of y is

llfi
S
= Vn
IN-n
-z;- (2.12)

Corollary B The variance of f = Ny, as an estimate of the popu-


lation total y, is
V(t) ... E(f _ Y)2 = N2S2 (N - n) (2.13)
n N
2.6 THE FINITE POPULATION CORRECTION 17

CoroUary 5 The standard error of f is

at= ~JN;n (2.14)

2.6 The finite population correction. For a random sample of size n


from an infinite opulation, it is well" known that the variance of the
mean is a n. The only change in this result when the population is
finite is the introduction of the extra factor, (N - n)/ N. The factors
w-=n>/N for the variance and V(N - n)/ N for the standard error
are called the finite population corrections (fpc). They are given with
a divisor (N - 1) in place of N by writers who present results in
terms of a. Provided that the sampling fraction n/ N remains low,
these factors are close to unity, and the size of the population as such
has no direct effect on the standard error of the sample mean. For
instance, if S is the same in the two populations, a sample of 500 from
a population of 200,000 gives almost as precise an estimate of the popu-
lation mean as a sample of 500 from a population of 10,000. Persons
unfamiliar with sampling often find this result very difficult to be-
lieve, and indeed it is remarkable. To them it seems intuitively ob-
vious that, if information has been obtained about only a very small
fraction of the population, the sample mean just cannot be accurate.
It is instructive for the reader to consider why this point of view is
erroneous.
In practice the fpc can be ignored whenever the sampling fraction
does not exceed 5 per cent, and for many purposes even if it is as high
as 10 per cent. The effect of ignoring the correction is to overestimate
the standard error of the estimate ii.
Ingenious methods for developing sampling theory for a finite popu-
lation have been given by Cornfield (1944), Tukey (1950), and Wishart
(1952) .
The following theorem, which is an extension of theorem 2.1, is
not required for the discussion in this chapter, but is proved here for
later reference.
Thwrem 2.5 If Yi, Xi are a pair of variates defined on every unit
in the population, and ii, x are the corresponding means from a simple
random sample of size n, then their CQtJaNance
(N - n) 1 N
E(iJ - Y)(x - X) = :E (Yi - Y)(Xi - X) (2.15)
nN (N - 1) ':_ 1

This theorem reduces to theorem 2.2 if the variates Y.:, Xi are equal on
every unit.
18 SIMPLE RANDOM SAMPLING 2.5

Proof: 'Apply theorem 2.2 to the vruiate Ui = Yi + Xi. The popu-


lation mean of Ui is U = y + X, and theorem 2.2 gives

(N - n) 1
E(u - U)2 = LN (Ui - U)
2

nN (N - 1) i- I
i.e.
EI(g - Y) + (x - 1')1 2
(N - n) 1 N
-nN- - -L
(N - 1) i _ I
I(Yi - Y) + (Xi - 1')1 2 (2.16)

Expand the quadratic terms on both sides. By theorem 2,2,

(N - n) 1
E(fi - y)2 = LN (Yi _ y)2
nN (N - 1) i- I

with a similar relation for E(Xi - X)2, Hence these two terms can-
cel on the left and right sides of equation (2.16). The result of the
theorem, equation (2.15), follows from the cross-product terms.

2.6 Estimation of the standard error from a sample. The formulas


for the standard errors of the estimated population mean and tot.al
are used primarily for three purposes : (i) to compare the precision
obtained by simple random sampling with that given by other methods
of sampling., (ii) to estimate the size of sample needed in a survey
which is being planned, (iii) to estimate the precision actually at-
tained in a survey that has been completed. The formulas involve
S2, the population variance. In practice this will not be known, but
it can be estimated from the sample data. The relevant result is
stated in theorem 2.4.

Thoorem S.4 For a simple random sample


n
f
L (Yi - ii)2
8 2 = _ 1_ _ __
'TI - 1
is an unbiased estimate of

S2 =_1_ _ __

N - ]
ESTIMATED STANDARD ERRORS 19

Proof: We may write


1 "
g2 -
(n - 1)
:E I (Y. -
._1 Y) - (fi - y)}2

== 1
(n - 1)
[t._1 (Y. - Y? - n(fi - y?l
Now average over all simple random samples of size n. By the argu~
ment of symmetry used in theorem 2.2,
;. Y2} n ~ y2 n(N-l) 2
E {LJ (Yi - ) = - LJ (Yi - ) = 8
i-I N ._1 N
2
by the definition of 8 • Further, by theorem 2.2,

.
Hence
82
E(s2) = ---In(N - 1) - (N - n)} = 82 (2.17) ~
(n - I)N

Corollary. Unbiased estimates of the variances of y and 1': are


2
v(ii) = 8/ =;-8 (N-n)
-;;-
(2.19)
..,.
For the standard errors we take

These estimates are slightly biased : for most applications the bias is
unimportant.
The reader should note the symbols employed for true and esti~
mated variances of the estimates. Thus for y we write
True variance: V(y) == (Til
Estimated variance: v(iJ) = si
The notation is a little redundant, but it is convenient to have
separate symbols V and v for variance, and q and s for standard error.
20 SIMPLE RANDOM SAMPLING 2.7

2.7 Confidence limits. It is usually assumed that the estimates y


and t are normally distributed about the corresponding population
values. The reasons for this assumption and its limitations are con-
sidered in section 2.8. If the assumption holds, lower and upper con-
fidence limits for' the population mean and total are as follows:
Mean:

r L ..
ts IN-n
fi - Vn ~:
ts ~
r u = ii + Vn"\}~ (2.21)

Total :

tL =-Nfj-
tNs
Vn
IN~:
-n tu = Ny
INs IN~
+ Vn -n (2.22)

The symbol t is the value of the normal deviate corresponding to the


desired confidence probability. The most common values are:
Confidence probability (%) 50 80 90 95 99
I 0.67 1.28 1.64 1.96 2.58

If the sample size is less than 30, the percentage points may be taken
from Student's t-table with (n - I) degrees of freedom, these being
the degrees of freedom in the estimated variance 8 2 • ( The t-distribu-
tion holds exactly only if the observations Yi are themselves normally
distributed and N is infinite) Moderate departures from normality do
not affect it greatly. For small samples with very skew distributions,
special methods are needed.
Example. Signatures to a petition were collected on 676 sheets.
Each sheet had enough space for 42 signatures, but on many sheets a
smaller number of signatures had been collected. The numbers of
signatures per sheet were counted on a random sample of 50 sheets
(about a 7 per cent sample), with the results shown in tabJe 2.1.
Estimate the total number of ~ignat.ures to the petition and the 80
per cent confidence limits.
The sampling unit is a sheet, and the observations y, are the num-
bers of signatures per sheet. Since about half the sheets had the maxi-
mum number of signatures, 42, the data are presented as a frequency
distribution. Notice that the original distribution appears to be far
from normal, the greatest frequency being at the upper end. Never-
theless there is reason to believe from experience that the means of
samples of 50 are approximately normally distributed.
We find
n - "'Eh .,. 50; y = "'Ef;y,' = 1471; "'Ef,y,2 == 54,497
CONFIDENCE LIMITS 21

Hence the estimated total number of signatures is


~ (676)(1471)
I - Nfl - .., 19,888
50
2
For the sample va.riance 8 we have

J'" 2 (L fiYi)2}
r = (n _1 1)
'"
I """,f'(Yi - fi)
2
1=
1
(n _ 1) I """,fiY. - Lf;
1 { (1471)2}
"" - 54497 - - - = 229.0
49' 50
From equation (2.22) the 80 per cent confidence limits are

tNs
19,888 :!:: _ r
vn
IN - --
N
n
>= 19,888:!::
(1.28)(676)(15.13)"V'1 - 0.0740
_~
v50
= 19,888 :!:: 1781
This gives 18,107 and 21 ,669 for the 80 per cent limits. A complete
count which was made showed 21,045 signatures.

TABLE 2.1 RE8ULTS FOB A SAMPLE OJ' 50 PETITION SHEETII

Number of signatures Frequency


II' f;
42 23
41 4
36 1
32 1
29 1
27 2
23 1
19 1
16 2
15 2
14
11
10
9
7 1
6 3
5 2
4 1
3 1

IiO
SIMPLE RANDOM SAMPLING 2.8

2.8 Validity of the normal approximation. Confidence that the nor-


mal approximation is adequate in most practical situations comes
from a variety of sources. In the theory of probability, much study
haa been made of the distribution of means of random samples. It
haa been proved that for any population which haa a finite standard

100 200 300 400 500 600 700 800 900 1000 1100
City size (thousands)
FIGURE 2.1 Frequency distribution of sizes of 196 United States cities in 1920.

deviation the distribution of the sample mean tends to normality aa


n increases (see e.g. Feller, 1950). This work relates to infinite popu-
lations. Madow (1948) proved that for a large class of finite popula-
tions the distribution of the sample mean tends to normality even if
the sampling ratio n/ N is not negligible and sampling is without re-
placement. Madow stipulates that nand N both tend to infinity in
such a way that the ratio n/ N remains less than some number r < 1.
His results would apply, for example, even if the sampling ratio were
95 per cent.
This imposing body of knowledge leaves something to be desired.
It is not easy to answer the direct question: "For this population, how
2.8 VALIDITY OF THE NORMAL APPROXIMATION 23
large must n be so that the normal approximation is accurate enough?"
Non-normal distributiollB vary greatly both in the nature and in the
degree of their departure from normality. In sampling practice, it
cannot be assumed that the frequency distributions which are en-
countered will all be reasonably close to normality. The distributiollB
of many types of economic enterprise' (stores, chicken farms, towns)
exhibit a marked positive skewness, with a few large units and many
small units. The same kind of skewness is displayed by some biologi-
cal populations (e.g. the number of rats or flies per city block).
50

40

.,g30
'"W20
....
10
0
3 4 9
Millions
FIGURE 2.2 Frequency distribution of totals of 200 simple random samples with
n ... 49,

As an illustration of a positively skewed distribution, figure 2.1


shows the frequency distribution of the numbers of inhabitants in
196 large United States cities in 1920. (The 4 largest cities, New
York, Chicago, Philadelphia, and Detroit, were omitted. Their in-
clusion would necessitate extending the horizontal scale to over 5
times the length shown, and would, of course, greatly accentuate the
skewness.) Figure 2.2 shows the frequency distribution of the total
number of inhabitants in each of 200 simple random samples, with
n = 49, drawn from this population. The distribution of the sample
totals, and likewise of the means, is much more similar to a normal
curve, but still displays some positive skewness.
In any discussion of the validity of the normal approximation, we
must define what it meallB to say that the normal approximation is
"accurate enough." In sample surveys, the normal approximation is
used primarily to calculate confidence limits. When 95 per cent confi-
dence limits are computed for the population mean Y by the normal
approximation, we make the following statement :
fi - 1.96811 < Y < fi + 1.96sJi (2.23)
With repeated sampling, we claim that statements of this kind will
SIMPLE RANDOM SAMPLING 2.8

be wrong only 5 per cent of the time. Consequently, we might say


that the nonnal approximation is accurate enough if such statements
are in fact wrong between 4 and 6 per cent of the time. The choice
of the numbers 4 and 6 is arbitrary: some workers may be satisfied
with wider limits.
From thtl study of theoretical distributions that are skewed and
from the results of sampling experiments on actual skewed popula.-
tiona, some statements can be made about what usually happens to
confidence probabilities when we sample from positively skew popula-
tions. The sample size is a.ssumed large enough so that the distribu-
tion of ii shows some approach to normality, as in figure 2.2. The
statements are as follows :
i. The frequency with which the assertion
y- 1.968Ji < Y < ii + 1.96B~
is wrong, is usually slightly higher than 5 per cent.
ii. The frequency with which
Y > ii + 1.96B1i
is greater than 2.5 per cent.
iii. The frequency with which
Y < ii - 1.96Bg
is le88 than 2.5 per cent.
As an illustration, consider the Poisson distribution. The variate
IIi takes the values 0, 1, 2, . .. , the probability that IIi has the value
u being
e-mm"
Pr(Yi = u) = - -
u!
For the Poisson distribution, the distribution of the total II of a simple
random sample of size n is known to be a Poisson distribution with
parameter m' - nm. From tables of this distribution (Molina, 1949)
we can therefore find out how well the nonnal approximation to the
confidence limits works for different values of n and m.
Let us take m '"' 0.25, n ""' 400. For m = 0.25, the probabilities that
1/. - 0, 1, 2,3, and 4 are, respectively, 0.7788, 0.1947, 0.0243, 0.0020,
and 0.0001. The original distribution is obviously extremely skew.
The sample total II follows a Poisson distribution with parameter
m' = (400)(0.25) ""' 100. For this distribution, there are theoreti-
cal results to the effect that
E(1/) - m: tT,/ - E(lI - m')2 - m'
2.8 VALIDITY OF THE NORMAL APPROXIMATION 25

Consequently y is an unbiased estimate of m'. As a sample estimate


of the standard error of y, we take
8v = vy
Hence, 95 per cent confidence limits for m' are constructed by the
normal approximation as
y - 1.96Vy < m' < y + 1.9GVy
Consider the probability that this statement is wrong. If y is 82, the
upper limit (y + 1.96Vy) turns out to be 99.7; and if y is 83, the
upper limit is 100.9. Thus the stated upper limit is too low (since m'
actually is 1(0) whenever Y is 82 or less in sampling from a Poisson
distribution with m' = 100. From Molina's tables the probability
that y is 82 or less is found to be 0.0369. Similarly, the lower limit in
the statement is found to be too high whenever y is 122 or greater.
The corresponding probability is 0.0181. To summarize,
Pr(stated upper limit too low) = 0.0369
Pr(stated I(\wer limit too high) = 0.0181
Pr(confidence statement wrong) = 0.0550
The total probability of being wrong is satisfactorily close to 0.05,
but in about 70 per cent of the statement.s that are wrong, the true
m' is higher than the stated upper limit.
The result appears to be typical. If we are interested only in the
absolute value of the error of estimate, a fair amount of positive
skewness in the distribution of ij can he tolerated, but if the frequency
with which 1" exceeds the upper eontidence limit is to be close to 2.5
per cent, the normal approximation is not trustworthy unless very
little skewness remains in the distribution of y.
There is no safe general rule as to how large n must be for use of
the normal approximation in computing confiden~e limits. For popu-
lations in which the principal deviation from normality consists of
marked positive skewness, a erude rule which I have occasionally
found useful is

where G1 is Fisher's measure of skewness (Fisher, 1932).


E(Yi - 1")3 1 N
G1 =
0"3
= -3
NO" i_ I
L (Yi - 1'')3

This rule is designed so that a 95 per cent confidence probability


statement will be wrong not more than 6 per cent of the time. It is

UAS LlBRAf{ y GKVK

.~
11111111111111111111111111 1111
5235
26 SIMPLE RANDOM SAMPLING 2.8

derived mathematically by assuming that any disturbance due to mo-


ments of the distribution of fi higher than the third is negligible. The
rule attempts to control only the total frequency of wrong statements,
ignoring the direction of the error of estimate.
By calculating GI , or an estimate, for a specific population, we can
obtain a rough idea of the sample size needed for application of the

TABLE 2.2 FREQUENCY DISTRIBUTION OF ACRES IN CROPS ON 556 II'ARMS

CI88S Coded Fre-


intervals scale queocy I,Y, I,y;' 1.'11,'
(acres) y, I,

0-29 -0.9 47 -42 . 3 38 . 1 -34.3


30-63 0 143 0 0 0
64- 97 1 154 154 154 154
98- 131 2 82 164 328 656
132- 165 3 62 186 558 1,674
166-199 4 33 132 528 2,112
200-233 5 13 65 325 1,625
234-267 6 6 36 216 1,296
268-301 7 4 28 196 1 ,372
302- 335 8 6 48 384 3 ,072
336-369 9 2 18 162 1,458
370-403 10 0 0 0 0
404-437 11 2 22 242 2 ,662
438-471 12 0 0 0 0
472- 505 13 2 26 338 4 , 394

Totals 556 836.7 3,469.1 20,440.7

,,836.7
E (y,) - r - - - - 1.50486
556

E{y,'l) _ 3469.1 _ 6.23939


556
I) 20,440.7
E{ y. - ---s5(l - 36.76385
,,2 _ E{y,2) _ F' _ 3.97479

". - E{y, - F)' - E{I/,') - 3E{I/.2) F + 2F'


- 15.411

G ". 15.411
I - ;; - 7.925 - 1.9
2.9 EFFECT OF NON-NORMALITY ON VARIANCE 27

normal approximation to compute confidence limits. The result


should be checked by sampling experiments whenever possible.
Example. The data in table 2.2 show the numbers of acres devoted
to crops in 556 farms in eneca County, New York. The data come
from a series of studies by West (1951), who drew repeated samples of
size 100 from this population and examined the frequency distribu-
tions of i), s, and Student's t for several items of interest in farm man-
agement surveys.
The computation of G1 is shown under the table. The computations
are made on 8. coded scale, and since Gt is a pure number, there is no
need to return to the original scale. Note that the first class-interval
was slightly different from the others. .
Since G1 = 1.9, we take as a suggested minimum n
n = (25) (1.9)2 = 90
For samples of size 100, West found with this item (acres in crops)
that neither the distribution of y nor that of Student's t differed sig-
nificantly from the corresp0nding theoretical normal distributions.
Good sampling practice tends to make the normal approximation
more valid. Failure of the normal approximation occurs mostly when
the population contains some extreme individuals which dominate the
sample average when they are present. However, these extremes also
have a much more serious effect of increasing the variance of the sam-
ple and decreasing the precision. Consequently, it is wise to segregate
them and make separate plans for coping with them, perhaps by tak-
ing a complete enumeration of them if they are not numerous. This
removal of the extremes from the main body of the population ~e­
duces the skewness and improves the normal approximation. This
technique is an example of stratified sampling, which is discussed in
chapter 5.

2.9 Effect of non-normality on the estimated variance. One effect of


non-normality is that the estimated variance 8 2 may be more highly
variable from sample to sample than we expect if we assume that we
are sampling from a normal distribution. For any infinite population,
the variance of 8 2 in random samples of size n is (Fisher, 1932)
2114 K4
V(s2) = --+- (2.24)
n - 1 n
The first term after the equality sign is the value which the variance
of 8 2 has when the parent distribution is normal. The second term
28 SIMPLE RANDOM SAMPLING 2.9

represents the effect of non-normality. The quantity K4 is Fisher'8


fourth cumulant (Fisher, 1932) and is given by
Ie, = E(lIi - y)' - 3CT'
Note that skewness in the original distribution, as measured by Gl,
does not affect the stability of 8 2 : the important factor is the fourth
moment in the parent population.
The cumulant "4 is :.Iero for a normal distribution. It may take
either positive or negative values in other distributions, but in those
encountered in sampling practice, K4 appears to be positive much
more often than negative, and may have a high value for some parent
distributions.
We may write (2.24) as

VCr) ... ~ (1 + n - 1CT'"') . . n~ (1 + 2n 1G2)


n -
n - 1 2n - 1
where G2 "" ",/CT' is Fisher's measure of kurtosis (loe. cit.). The
quantity inside the parentheses shows the factor by which the vari-
ance of 82 is inflated owing to non-normality. Note that the factor is
almost independent of n, so that the inflation remains even with large
samples.
For West's data on farm acres in crops (table 2.2), the value of G2
will be found to be about 6. Thus V(8 2 ) is close to 4 times as large as
would be assumed if we regarded the original distribution of acres in
crops as normal. In his sampling studies, West found a similar in-
flation in the variance of the standard deviation 8, in 3 items which
he tested. The ratio of V(8) to the theoretical variance of 8 from a
normal popUlation was 3.7 for acres in crops, 2.1 for total acres oper-
ated, and 13.7 for productive-man-work units. (By theory this ratio
should be roughly the same for 8 as for 82 .)
The relevance of these results in practical sampling is that we some-
times use values of 8 2 to compare the precision of one method of
sampling with that of another, or to estimate the sample size needed
to attain a specified degree of precision in fi (see chapter 4). For these
purpoees it is well to have some idea of the precision of the estimate
2
8 , particularly if it has been calculated from rather scanty data. As
the previous results indicate, use of the "normal" formula for apprais-
ing the variance of 82 may give a very misleading impression of the
stability of 82 •
2.10 EXERCISES 29

2.10 Exercises.
2.1 In a population with N = 6, the values of y. are 8, 3, 1, 11, 4, and 7.
Calculate the sa.mple mean ii for al1 possible simple random sa.mples of size 2.
Verify that f} is an unbis.sed estima.te of Y and that its variance is a.s given in
theorem 2.2.
2.2 For the sa.me population, calculate 8 2 for al1 simple random sa.mples of
size 3, and verify that E(8 2) = 81.
2.3 If random sa.mples of size 2 are drawn with replacement (rom this
population, show by finding all possible sa.mples that V(ii) satisfies the equa-
tion (7'2 8 2 (N - 1)
V(ii) = - = - --
nn N
Give a general proof of this result.
2.4 A simple random sample of 30 households wa.s drawn {rom a city area
containing 14,848 households. The numbers of persons per household in the
sample were as follows:
5,6,3,3,2,3,3,3,4,4,3,2,7,4,3,5, 4,4,3,3,4,3,3, 1,2,4,3,4,2,4
Estimate the total number of people in the area and compute the probability
that this estimate is within ±10 per cent of the true value.
2.5 The table below shows the numbers of inhabitants in each of the 197
United States cities which had populations over SO,OOO in 1940. Calculate
the standard error of the estima.ted total number of inhabitants in all 197
cities for the following methods of sampling : (i) a simple random sample of
size SO, (ii) a sample which includes the 5 largest cities and is a simple random
sample of size 45 from the remaining 192 cities, (iii) a sample which includes
the 9 largest cities and is a simple random sample of size 41 from the rema.ining
cities.
FREQUENCY DISTRIBUTION OF CITY SIZES

Size class Size class Size class


(1000'8) f (1000'8) f (1000's) J
50-100 105 55(H)()() 2
100-150 36 600-650 1 1500-1550
150-200 13 650-700 2
200-250 6 700-750 0 1600-1650
250-300 7 750-800 I
300-350 8 800-850 1 1900-1950
350-400 4 850-900 2
4()()-450 1 900-950 0 3350-3400
450-500 3 950-1000 0
500-550 0 1000-1050 0 7450-7500
Gape in the intervals are indicated by ....
2.6 Calculate the coefficient of skewness 0 1 for the original population
and for the population remaining after removing (i) the 5 largest cities, (ii)
the 9 largest cities.
30 SIMPLE RANDOM SAMPLING 2.10

2.7 With certain populations it is known that the observations I/i are all
zero on a portion qN of the N units (0 < q < 1) . Sometimes, with varying
expenditures of effort, these units can be found and listed, so that they need
not be sampled. If q2 is the variance of I/i in the original popUlation, and qo2
is the variance when all zeros are excluded, show that
qo2 __ _'q_
= q2 y2
P pi
where p - 1 - q.
If the popula.tion total is estimated from a simple random sample of size
ft, show that with the exclusion of the "zero" units the fractional reduction
in the variance of the estimate is
q(V2 + 1)
V2
where VI - ql/ V2 is the square of the coefficient of variation in the original
population. (For further discu88ion of this technique, see Jessen and House-
man, 1944.) The fpc may be omitted.

2.11 References.
CORNFIELD, J . (1944). On samples from finite populations. Jour. Amer. Sial. A"oc.,
a9, 236-239.
FmLLlCR, W. (1950) . An introduction to probability tMory and its applicationl. John
Wiley &: Sona, New York.
FISHIIR, R. A. (1932). Statistical met/wd8 for re8earch workers . Oliver and Boyd,
Edinburgh, 4th ed.
JJi88IlN, R. J., and HOUSEMAN, E. E. (1944). Statistical investigations of farm
sample surveys taken in Iowa, Florida and California. [(YIJ)(J Agr. Exp. Sta. Ru.
BuU. 329.
MADow, W. G. (1948) . On the limiting distributions of estimates based on samples
from finite universes. Ann. M mh. Sial., 19, 535-545.
MOLINA, E. C. (1949). Poisson's exponential binomial limit. D. Van Nostrand
Co., New York.
TUUT, J. W. (1950). Some sampling simplified. Jour. A mer. Sial. Assoc., U,
501- 519.
WEST, Q. M. (1951) . The ruults of applying a.simple random sampling procus to
farm management data . Agricultural Experiment Station, Cornell University.
WISHART, J . (1952). Moment-coefficients of the k-statistics in samples from a
finite population. Biumetrika, 39, 1- 13.
CHAPTER 3

SAMPLING FOR PROPORTIONS AND PERCENTAGES

3.1 Qualitative characteristics. Sometimes we wish to estimate the


total number, or the proportion, or the percentage of units in the
population which possess some characteristic or attribute, or fall into
Bome defined class. Many of the results regularly published from
censuses or surveys are of this form, e.g. numbers of unemployed
persons, percentage of the population that is native-born. The classi-
fication may be introduced directly into the questionnaire, as with
questions that are answered by a simple "yes" or "no." In other
cases the original measurements are more or less continuous, and the
classification is introduced in the tabulation of results. Thus we may
record the respondents' ages to the nearest year, but publish the per-
centage of the population aged 60 and over.
Notation. We suppose that every unit in the population falls into
one of the two classes C and C' . The notation is as follows:
Number of units in C in Proportion of units in C in
Popula.tion Sa.mple Population Sample
A a P ., A IN p - al n

The sample estimate of P is p, and the sample estimate of A is Np


or Na/n.

3.2 Variances of the sample estimates. By means of a simple device


it is possible to apply the theorems established in chapter 2 to this
situation. For any unit in the sample or population, define y, as 1
if the unit is in C and as 0 if it is in C/. For this population of values
Y" it is clear that
N
Y = LY, = A (3.1)
1

~ 1i. A
y= - = - = p (3.2)
N N
31
32 SAMPLING FOR PROPORTIONS AND PERCENTAGES 3.2

Aleo, for the sample,


..
:E
1
Y.
a
Y---""'-"''P (3.3)
n n
Conaequently the problem of estimating A and P can be regarded
as that of estimating the total and mean of a population in which
every y. is either 1 or O. In order to use the theorems in chapter 2,
we first express 8 2 and 8 2 in terms of P and 'P. Note that
N "
:E Yi2 "" A - NP: :E Yi2 = a .. n'P
1 1
Hence,
N
:E (Yi - y?
82 = _1_ _ __
N -1 N -1

1 N
- - (NP - Np2) = PQ (3.4)
(N - 1) (N - 1)

where Q - 1 - P. Similarly
.
:E (Y' - ji)2
g2 ~ _1_ _ _ _ = __
n_ pq
(3.5)
n - 1 (n - 1)

Application of theorems 2.1, 2.2, and 2.4 to this population gives


the following results for simple random sampling of the units that are
being classified.

Theorem S.l The sample proportion 'P ... a/n is an unbiased esti-
mate of.the population proportion P = A/N.
Theorem S.S The variance of 'P is

yep) .., E(p _ P)2 _ - S2(N---n) _ PQ~


_ --")
- (3.6)
n N " - 1
using equation (3.4).
Corollary 1 If p and P are the sample and population percenllJges,
respectively, falJing into class C, formula (3.6) continues to hold for
the. variance of p.
3.2 VARIANCES OF THE SAMPLE ESTIMATES 33

Corollary 2 The variance of A = Np , the estimated total number


of units in class C, is
YeA) =
2
N pQ (N -
n \AT -
n)
1
(3.7)

Theorem S.3 An unbiased estimate of the variance of p, derived


from the sample, is
(N - n)
v(p) = 8/ = pq (3.8)
(n - 1)N
Proof: In theorem 2.4, corollary, it was shown that for a continuou8
variate Yi an unbiased estimate of the variance of the sample mean y is
2
8 (N - n)
V(fi) = - -- - (3 .9)
n N
For proportions, p takes the place of y, and in equation (3 .5) we showed
that
n
r = -- pq (3.10)
(n - 1)
Hence,
(N - n)
v(p) = 8/ = pq
(n - 1)N
It follows that if N is very large relative to n, so that the fpc is
negligible, an unbiased estimate of the variance of p is
pq
n - 1
This result may appear puzzling to some readers, since the expression
pq/ n is almost invariably used in practice for the estimated variance.
The fact is that pq/ n is not unbiased even with an infinite population.
Corollary. An unbiased estimate of the variance of A = Np, the
estimated total number of units in class C in the population, is
2 N(N - n)
V(A) = SNp = pq (3.11)
n - 1
Example. From a list of 3042 names and addresses, a simple ran-
dom sample of 200 names showed on investigation 38 wrong addresses.
Estimate the total number of addresses needing correction in the list
and find the standard error of this estimate. We have
N = 3042; n = 200 ; a = 38 ; p = 0.19
SAMPLING FOR PROPORTIONS AND PERCENT AGES 3.2

The estimated total number of wrong addresses is


...{ "" Np ... (3042)(0.19) "" 578

{(3042)(284~~O. 19)(0.81)} = V6685 ... 81.8


Since the sampling ratio is under 7 per cent, the fpc makes little dif-
ference. To remove it, replace the term (N - n) by N. If, in addi-
tion, we replace (n - 1) by n, we have the simpler formula

{Pq (0.19)(0.81)
8N'P = N -V-.;; = (3042) - - --
200
= 84.4

This is in fairly close agreement with the previous result, 81.8.


The preceding formulas for the variance and the estimated variance
of p hold only if the units are classified into C or C' so that P is the
ratio of the number of units in C in the sample to the total number
of units in the sample. There is a common situation in which each
unit is composed of a group of elements, and it is the elements that
are classified. A few examples are as follows:
Sampling unit Elements
Family Members of the family
Restaurant Employees
Crate of eggs Individual eggs
Peach tree Individual peaches

If a simple random sample of units is drawn in order to estimate the


proportion P of elements in the population which belong to class C,
the preceding formulas do not apply, except sometimes as a fair
approximation.
If each sampling unit contains the same number M of elements, let
Pi = ai/ M be the proportion of elements in C in the ith unit and let
P ...L Pi/n be the sample estimate of P. The correct procedure is
to apply the formulas of chapter 2 to the quantities Pi. With simple
random sampling,
(N - n) 1 N
V(p) =
Nn (N -
L (Pi -
1) i_I
p)2

and an unbiased estimate of this variance is


(N - n) 1" 2
v(p) = - - E (Pi - p)
Nn (n - 1) i_I
8.8 THE EFFECT OF P ON THE STANDARD ERRORS 86
If M varies from udit to unit, the problem is more complicated:
appropriate methods are presented in section 6.9.

8.3 The effect of P on the standard errora. Equation (3.6) shows


how the variance of the estimated percentage changes with P, for
fixed nand N. If the fpc is ignored, we have
PQ
V(p) - -
n
The function PQ and its square root are shown in table 3.1. These
functions may be regarded as the variance and standard deviation,
respectively, for a sample of size 1.

TABLE 3.1 VALUES or PQ AND vPQ


P - Population percentage in class C.

0 10 20 80 40 60 60 70 80 00 100

PQ 0 000 1600 2100 2400 2600 2400 2100 1600 000 0


v']SQ 0 80 40 46 49 60 49 46 40 80 0

The functions have their greatest values when the population is


equally divided between the two classes, and are symmetrical about
this point. The standard error of p changes relatively little when P
lies anywhere between 30 and 70 per cent. At the maximum value of
VPQ, 50, a sample size of 100 is needed to reduce the standard error
of the estimate to 5 per cent. To attain a 1 per cent standard error
requires a sample size of 2600.
This approach is not appropriate when interest lies in the total
number of units in the population which are in claee C. In this event
it is more natural to ask: Is the estimate likely to be correct to within,
say, 7 per cent of the true total? Thus we tend to think of the stand.
ard error expressed as a fraction or percentage of the true value, NP.
The fraction ill

(fNp NVPQ ~ 1 (Q IN - n
(3.12)
NP - VnNP ~N=1- v'n ~p N - 1
This quantity is usually called the coe.fficient of lIariation of the esti.
mate. If the fpc is ignored, the coefficient is v'Q/nP. The ratio
SAMPLING FOR PROPORTIONS AND PERCENTAGES 3.S

v'QiP, which might be considered the coefficient of variation for a


sample of size 1, is shown in table 3.2.

TABLE 3.2 V ALUJlII or v'Q1P roll DIrrJl)JlIlNT VALUIl8 or P


P - Population percentage in class C.

p 0 0.1 0.5 1 5 10 20
v'Q7P GO 31.6 14.1 9.9 4 .4 3.0 2.0
P 30 40 50 60 70 80 90
VQ1P 1.5 1.2 1.0 0 .8 0.7 0.5 0.3

For a fixed sample size, the coefficient of variation of the estimated


total in class G decreases steadily as the true percentage in G increa.ses.
The coefficient is high when P is less than 5 per cent. Very large
samples are needed for precise estimates of the total number posseBBing
any attribute that is rare in the population. For P - 1 per cent,
we must have Vn - 99 in order to reduce the coefficient of variation
of the estimate to 0.1 or 10 per cent. This gives a sample size of 9801.
Simple random sampling, or any method of sampling that is adapted
for general purposes, tends to be an expensive method of estimating
the total number of units of a scarce type. The problem is analogous
to that of finding the total number of needles in a haystack.

8.' The binomial diltribution. Since the population is of a particu-


larly simple type, in which the 1/. are either 1 or 0, we can find the
actual frequency distribution of the estimate p and not merely its
mean and variance.
The population contains A units that are in class G and (N - A)
units in G', where P - A/N. If the first unit that is drawn happens
to be in G, there will remain in the population (A - 1) units in G
and (N - A) in G' . Thus the proportion of units in G, after the first
draw, changes slightly to (A - l)/(N - 1). Alternatively, if the first
unit drawn is in G' , the proportion in G changes to A/(N - 1). In
sampling without replacement, the proportion keeps changing in this
way throughout the draw. In the present section these variations
are ignored, i.e. P is assumed constant. This amounts to e.ssuming
that A and (N - A) are both large relative to the sample size n.
With this assumption, the proceBB of drawing the sample consists
of a series of 11. trials, in eaoh of which the probability that the unit
drawn is in G is P. This situation gives rise to the familiar binomial
3.5 THE GENERAL DISTRIBUTION OF 'P 87

frequency distribution for the number of units in C in the sample.


The probability that the sample contains a units in Cis
n!
Pr(a) == p"Qn-o (3.13)
a!(n-a)!

From this expression we may tabulate the frequency distribution of


a, or of p = a/ n, or of the estimated total Np. The most comprehen-
sive tables are those published by the U. S. National Bureau of Stand-
ards (1950). They give both individual terms and the cumulative
sums of the terms for sample sizes up to 49 and for P varying by inter-
vals of 0.01. For tables with n between 50 and 100, see Romig (1952).

3.5 The general distribution of p. The distribution of p can be found


without the assumption that the population is large relative to the
sample. The numbers of units in the two classes C and C' in the pop-
ulation are A and A', respectively. We will calculate the probability
that the corresponding numbers in the sample are a and a', respeo-
tively, where
a + a' - n: A + A' - N
We may assume that a :::; A , and a' :::; A', because it is impossible to
draw a sample in which these inequalities do not hold, the sampling
being without replacement.
Consider a sample in which the first a units drawn all fall in class
C. At the first draw, the probability that a C is obtained is A/N.
After the first draw, there remain (A - 1) units in C in the popula-
tion, so that the probability of obtaining a C at the second draw il
(A - 1) / (N - 1). Since it is supposed that a C turns up at the sec-
ond draw also, the probability of a C at the third draw is (A - 2)/
(N - 2), and eo on. The probability that the first a units drawn are
all in C is the product
A(A - I)(A - 2) .,. (A - a + 1)
(3.14)
N(N - 1)(N - 2) ... (N - a + 1)

To obtain the type of sample in which we are interested, all the


remaining units drawn must fall in C'. At the (a + l)th draw, the
population has A' units in class C' out of (N - a) units. The proba-
bilities of a C' at Buccessive draws are therefore
A' (A' - 1) (A' - 2)
-----, etc.
(N - a) (N - a-I) (N - a - 2)
88 SAMPLING FOR PROPORTIONS AND PERCENTAGES
Hence, the probability that all remaining units fall in 0' is the product
A'(A' - 1)(A' - 2) ... (A' - a' + 1)
(3.16)
(N - a)(N - a - 1) .. . (N - n + 1)
Multiplication of (3.14) and (3.15) gives the probability of obtaining
a sample in which the first a units all fall in 0 and the remaining a'
units all fall in C':
A(A - 1) '" (A - a. + 1)(A')(A' -
Pr-----------~---------------------------
1) .. . (A' - a.' + 1)
N(N - 1) ..• (N - n + 1)
(3.16)
The reader may verify that this expression also hold! for any apecified
order in which the units appear, provided that a of them are in C and
a' in C'. To find the total probability, we count how many different
ordera can be specified, that is, how many distinct ways a objects of
one kind, and a' objects of another kind, can be arranged in order along
a line. This number is given by the familiar quantity ..C., or
nl
al(a')!
Finally, the probability that the sample conta.ms a units in C and
a' in C' is
Pr(a, a'/A, A' )

--nl A(A - 1) ... (A - a


0.10.'1
+ 1)(A')(A' - 1) . . . (A' - a'
__;.---;.__-~:-:-::---:-....;._~:----..;.._:--..;.._-----..;..
N(N - 1) ... (N - n + 1)
+ 1)
(3.17)
This result may be written in more compact form as

Pr(a, a'/A, A') _ ... O.'A'C", (3.18)


NC ..
This is the frequenoy distribution of a or np, from which that of p
is immediately derivable. The distribution is called the h7f1Jef'gef>-
tMtNc distribution.
Ezampk. A family of 8 oontains a males and 6 females. Find the
frequenoy distribution of the number of males in a simple random
sample of aile 4. In this case

A - 3; A' - 5; N - 8; n - 4
CONFIDENCE LIMITS
From formula (3.17) the distribution of the number of malee, G, is
as follows:
(I Probability
.1 U.8.2 1
o 0i4i . 8.7.6.6 - i4
.1 8.tiA.8 6
1rif . 8.7.6.11 - Ii
.1 8.2.11.4 6
2i2i . 8.7.6.11 - Ii
.1 8.2.1.11 1
8
iiii . 8.7.6.6 - Ii
Impoealble - 0

The reader may verify that the mean number of males is and the t
variance is H. These results agree with the formulas previouslyes-
tablished, section 3.2, which give

nA (4)(3) 3
E(np) - nP - - - - - - -
N 8 2

N - n 3 5 4 15
V(np) - nPQ N _ 1 - 4 . 8 . 8 . ;; - 28

3.6 Confidence Umits. We first discuss the meaning of confidence


limits in the case of qualitative characteristics. In the sample, Gout
of n fall in class C. S at infe~re to be made about the
number A in the population which fall in class C. For an upper confi-
dence limit to A, we compute a value A u such that for this value the
probability of getting a or less falling in C in the sample is some small
quantity au, e.g. 0.025. Formally, Au satisfies the equation
o
E Pr(j, n - jjA u , N - Au) - au (3.19)
i-O

where Pr is the probability term for the hypergeometric distribution,


as defined in equation (3.18).
When au is chosen in advance, equation (3.19) requires in general
a non-integral value of A u to satisfy it, whereas conceptually Au
should be a whole number. In practice we choose Au as the smallest
integral value of A such that the left side of (3.19) is less than or equal
SAMPLING FOR PROPORTIONS AND PERCENTAGES 3.8

to au. Similarly, the lower confidence limit . .h is the largest integral


value such that
:E" Pr(j, n - i/AL, N - AL) ~ OIL (3.20)
i-o
Confidence limits for P are then found by taking Pu = Au/ N , PL ...
AdN.
The normal approximation is often serviceable. In theorem 3.2,
it was found that the standard error of p is

tT,,= ~ fPQ
~~~-;;
If it is assumed that p is normally distributed, with estimated stand-
ard error
, JS-n~q
Bp'" ---
N - 1
-
n
we obtain, as a normal approximation to the confidence limits,

PL =: p - t J ~-;;
r;q: Pu = p +t IN -
N - 1
n r;q
~-;;
where t is the normal deviate corresponding to the confidence proba-
bility. •
It is worth while to amend these formulas by inserting a correction
for continuity, whenever this correction has an appreciable effect.
The rationale of the correction may be explained as follows. Suppose
that 30 units out of 70 are observed to fall in class C, and we wish to
approximate Pu . Using the exact distribution, we would find Pu
such that the sum of the probabilities that 0, 1, "', 30 units fall in
C is au. If the exact distribution is to be approximated by a. con-
tinuous normal distribution, it is natural to regard the ordinate at 30
as corresponding to the area of the normal curve between 29! a.nd 30;.
Thus the sum of terms from 0 to 30 corresponds to the area of the
normal curve below the point 30;. The effect of the correction is to
• In theorem 3.3 it was shown that an unbiased sample estimate of fT.,' ill
2 (N - n)pq
B., - N(n _ 1)
In estlma.ting PL and Pu, 3" might have been used for the standard error of 11
inatead of 3.,'. However, B.,' was preferred because it is more familiar, and both
eetimatee appear to give about equally good a.pproximations.
3.6 CONFIDENCE LIMITS . 41

computeP u by the equation P u ... p + 1/ 2n + ta'p. This increasesP u


by 1/2n. Similarly, PL is decreased by 1/ 2n. The amended limits
may be written thus:

- +-I}
p± {t
~-n~q
--
N-l n 2n
(3.21)

The correction for continuity produces a slight rather than a sub-


stantial improvement in the approximation. However, without the
correction, the confidence interval as found by the normal approxima-
tion is usually too narrow, and the correction helps to remedy this
defect.
The error in the normal approximation depends on all the quantities
n, p, N, :Xu, and aL. The quantity to which the error is most sensi-
tive is np, or more specifically the number observed in the smaller
class. Table 3.3 gives working rules for deciding when the normal
approximation (3.21) may be used.

TABLE 3.3 SMALLEST VALUliiS OF np FOR USE OF THE NORMAL APPROXIMATION

np - number observed
p in the smaller class n - sample size
0.5 15 30
0 .4 20 50
0 .3 24 80
0 .2 40 200
0.1 60 600
0 .05 70 1400
--0. 80 ..
• Thi8 means that p is extremely small, so that np follows the Poisson distribution.

The rules in table 3.3 are constructed so that with 95 per cent confi-
dence limits the true frequency with which the limits fail to enclose
P is not greater than 5.5 per cent. Further, the probability that the
upper limit is below P is between 2.5 and 3.5 per cent, and the proba-
bility that the lower limit exceeds P is between 2.5 and 1.5 per cent.
These restrictions on the one-tailed frequencies of error seemed ad-
visable because the binomial distribution is in general skew (see sec-
tion 2.8). The rules are not guaranteed to satisfy these probability
statements in all cases, 'since exhaustive examination is lengthy, but
I believe that the btatements are generally true. The choice of 5.5,
3.5, and 1.5 per cent for the probabilities is of course arbitrary. The
reader who is content with greater error in the normal approximation
can allow lower values of n.
SAMPLING FOR PROPORTIONS AND PERCENTAGES

When the situation lies outside the range of validity of the normal
approximation, or when greater accuracy is desired, reference may be
made to charts of the confidence limite of the hypergeometric function
by Chung and DeLu 1950). These give t e OO;-95-;-&nd 99 per
cen Imlts for P for popu ation sizes of 500, 2500, and 10,000. Values
for intermediate population sizes may be obtained by interpolation.
An alternative when the normal approximation does not apply is
to use the limite for the binomial distribution, adjusted if necessary
so as to take account of the finite population correction. For n less
than 50, the bimonial limits are quickly found from Table8 of the bi-
nomial frequency di8tribution (U. S. National Bureau of Standards).
A convenient table of the limits themselves, constructed by W. L.
Stevens, is given in Fisher and Yates's Stati8tical table8, 3rd .ed., table
VIII, 1. The limits presented by Stevens are those for nP, since this
quantity is more amenable to compact tabulation than P itself. The
method of amending these limits so as to allow for the finite popula-
tion correction is illustrated in example 2 below.
Example 1. In a simple random sample of size 100, from a popula-
tion of size 500, there are 37 units in class C. Find the 95 per cent
confidence limits for the proportion and for the total number in class
C in the population. In this example,
n - 100; N == 500; l' - 0.37
The example lies in the range in which the normal approximation is
recommended. The estimated standard error of p is

J(N - n) 1'q
(N - 1) n
= -400 (0.37)(0.63) = 0.0432
499 100
The correction for continuity, 1/ 2n, equals 0.005. Hence the 95 per
cent limits for P are estimated as

~
(N - n) pq
p± ( t -
(N - 1) n
+ -2n1 )
= 0.37 ± (1.96 X 0.0432 + 0.(05) = 0.37 ± 0.090
Pc. = 0.280: Pu = 0.460
The limits as read from the charts by Chung and DeLury are 0.285
and 0.462, respectively.
To find limits for the total number in class C in the population, we
multiply by N , obtaining 140 and 230, respectively.
CLASSIFICATION INTO MORE THAN TWO CLASSES
Ezample B. This example shows how binomial limits may be ueed
8.8 an approximation. Suppose that for another item in the previous
sample 9 units out of the 100 fall in class C. This is outside the range
for the norroa.J approximation.
The 95 per cent binomial limits for the expected number nP in
cl8.88 C are read from Fisher and Yates's Sla.ti8tical table8, table VIII,
1, 8.8 4.20 and 16.40. Dividing by n, we obtain approximate limits of
0.042 and 0.164 for p. If the sampling ratio is less than 5 per cent,
limits found in this way are close enough for most purposes. In this
sample, the sampling ratio is 20 per cent, 80 that the fpc should be
applied. The fpc factor is

~
oo
- - - 0.895
499
To apply the correction, we shorten the interval between p and each
limit by this factor. The adjusted limits are as follows:
PL ... 0.09 - (0.895)(0.09 - 0.042) == 0.047
Pu .. 0.09 + (0.895)(0.164 - 0.09) ... 0.156
The limits obtained from the tables by Chung and DeLury are 0.045
and 0.157, respectively.

3.7 Classification into more than two classes. Frequently, in the


presentation of results, the units are classified into more than two
classes. Thus a sample from a human population may be a.rranged
in fifteen 5-year age groups. Even when a question is suppoeed to
be answered by a simple "yes" or "no," the results actua.Jly obta.ined
may fall into four classes: "yes," "no," "don't know," and "no an-
swer." The extension of the theory to such cases will be illustrated
by the situation in which there a.re three classes.
We suppose that the number fa.Jling in the ith class is Ai in the
population and ai in the sample, where
Ai ai
N = L: Ai: n = L: ai: Pi == N: Pi = ;-

When the sample size n is small relative to all the Ai, the probabili-
ties Pi may be considered effectively constant throughout the draw-
ing of the sample. The probability of drawing the observed sample is
given by the multinomial expression
n!
Pr(ai) = PloIP2o'P30' (3.22)
a, !a2!aS!
•• SAMPLING FOR PROPORTIONS AND PERCENTAGES 8.7

This is the appropriate extension of the binomial distribution, and is


a good approximation when the sampling fraction is small.
The correot expression for the probability of drawing the observed
sample is
(3.23)

This expression is a natural extension of equation (3.18), section 3.5,


for the hypergeometric distribution. A proof will not be given: the
result can be established by the method used in proving the hyper-
geometric distribution.

S.8 Confiderlce limits when there are more than two classes. Two
different cases must be distinguished.
Case I . We calculate
Number in anyone class in sample 111
p'"
n n
or
P=
Total number in a group of classes a1 + a2 + 113 , say
n n
For example, if the answers are classified into "yes," "no," "don't
know," and "no answer," we might take p as the proportion in the
sample answering "yes," or alternatively as the proportion in the
sample giving a definite answer, either "yes" or "no." In either of
these situations, although the original classification contains more
than two classes, p itself is obtained from a subdivision of the n units
into only two classes. The theory already presented applies to this
case. Confidence limits are calculated as described in section 3.6.
Case II. Sometimes certain classes are omitted, p being computed
from a breakdown of the remaining classes into two parts. For ex-
ample, we might omit persons who did not know or gave no answer,
and consider the ratio of number of "yes" answers to "yes" plus "no"
answers. Ratios which are structurally of this type are often of in-
terest in sample surveys. The denomina.tor of such a ratio is not n,
but some smaller number n'.
The frequency distribution of p is more complicated than in Case I,
because both the numerator and denominator of p vary from one
sample to another, even although all samples have the same total
size n. This presents an obstacle to the calculation of confidence
limits. Most of the complications can be avoided by the device, com-
mon in statistira.l theory, of ca.lculating confidence limits from the
3.9 THE CONDITIONAL DISTRIBUTION OF 11
conditional distribution of p, given nand n'. In this method we con-
struct confidence statements which will be true, with the assigned
confidence probability, over all samples which have the same n and
n' as the observed sample.
The reason why this device helps is that the conditional distribu-
tion of p is obtained from an ordinary hypergeometric distribution,
as will l:>e shown in the next section. To give the result more pre-
cisely, suppose that
al
11 = ---: n' ... al + 1l:I: n ... al + a2 + as
al + a2
so that aa is the number in the sample falling in classes in which we
are not at the moment interested. Then the conditional distribution
of aJ and a2 is the hypergeometric distribution when the sample is
of size n' and the population of size N' = Al + A 2 • In particular,
the conditional standard error of p is

(f

P
= IN' - n' (PQ
N' - 1 ~-:;;;
The nonna! approximations to conditional confidence limits for P ...
AJ!(A J+ A 2 ) are
~ r;q 1)
l' ± (t ~~ ~;;; + 2n' (3.24)

One difficulty remams. Although n' is known, N' is not known


from the sample results, so that equations (3.24) are not usable as
they stand. Failing any additional infonnation about N', one pro-
cedure is to assume that n'/ n = N' / N , i.e. to estimate N' as Nn'/ n.
Substitution of this value in (3.24) gives the approximate limits

PL - P- t J N-n
N - (n/ n')
$ 1
- --
n' 2n'
(3.25)
P I N-n r;q 1
u "" 11 + t ~N _ (n/n') ~;;; + 2n'
The fpc is the only tenn affected by the approximation used for N'.

3.9 The conditional distribution of p. In this section we indicate how


th~ conditional distribution of 11 = aJ!(aJ + a2) is obtained, and
SAMPLING FOR PROPORTIONS AND PERCENTAGES 8.9

present an illustration. With three cla88e8, the probability of drawing


the obeerved sample has been given [equation (3.23)] aa

(3.26)

In the conditional distribution, n' - al + a:l is fixed . All samplel!l of


size n which do not have this value of n' are ignored. Samples which
have this value retain the same relative probabilities as in equa.tion
(3.26). To find the conditional distribution, we divide (3.26) by the
total probability for all pennissible samples.
This total probability is simply the probability tha.t n' out of n
will fall in class 1 or 2. It is given by the hypergeometric distribution:
(AI+A,)C(ol+O,)' A.Co, (3.27)
NC"
On division of (3.26) by (3.27), the conditional distribution of the
sample is obtained as
Pr(a,/A" n') _ A.col·A,co,
(AI+Ao)C(cJ1+oo)

(3.28)

where N' - Al + A 2 , n' - al + G2 . This is an ordinary hypergeo-


metric distribution for a sample of size n' from a population of size N'.
As an illustration, consider a population which consists of the five
units A, B, C, D, E, which fall in three classes.
CI_ AI Units denoted by
1 1 A
2 2 B,e
3 2 D,E

With unrestricted random samples of size 3, we wish to estimate'


P - Ad(A 1 + A 2 ) , or in this case 1. Thus N = 5, and N' - 3.
There are 10 possible samples of size 3, all with equal initial proba-
bilities. These will be grouped according to the value of n'.
n' -1
ConditioneJ
Sample 01 lit 'P probability (p - P)
ADE 1 0 1 i f
BDEor eDE 0 1 0 t -1
3.10 EXERCISES 47
If samples are specified by the values of Ill! G2, only two types are
obtainable: III - I, 112 ... 0; III - 0, 112 ... 1. Their conditional prob-
abilities, t and i, respectively, agree with the general expression (3.28).
Further,
E(p) - !
tips - (1)(*) + (i)(t) - -If - 1-
The estimate p i8 unbiased, and its variance agrees with the general
fonnula
CT 2 _
p
(N' - n')
N' - 1 n'
PQ _ (~) (~) (~) _ ~
3 - 1 3 \.3 9
For n' - 2 there are six possible samples, which give only two seta
of values of Ill, 112.
n' -2

Sample
ABD. ABE, ACD, or ACE
BCD or BCE
01
1
0
lit
1
2
,
l'

0
Conditional
probability (1'-P)
i
i -t
The estimate i8 again unbiased, and ita varianoe i8

which may be verified from the general fonnula. Note that the vari-
ance is only one-fourth of that obtained when n' - 1. In a condi-
tional approach, the variance changes with the oonfiguration of the
sample that was drawn.
For n' - 3, there is only one possible sample, ABC. This gives the
correct population fraction, ! . The conditional variance of p i8 zero,
as i8 indicated by the general fonnula, which reduces to zero when
N' -n'.

S.10 Exercilel.
8.1 For a population with N - 6, A - 4, A' - 2, work out the value of
a for all po.ible simple random sampl811 of lize 8. Verify the theorems aiven
for the mean and variance of p - a/ no Verify that
(N - n)
(n - l)N pq

is an unbiued eetimate of the variance of p.


3.2 In a aimple random sample ot 200 from a population of 2000 collepl
120 collei81 were in favor of a propoeal, 67 were oppolBCi, and 28 had no
SAMPLING FOR PROPORTIONS AND PERCENTAGES 8.10

opinion. Estimate 95 per cent confidence limits for the number of colleges
in the population which favored the proposal.
3.3 Do the results of the previous sample furnish conclusive evidence that
the majority of the colleges in the population favored this proposal?
3.4 A-population with N - 7 consists of the elements B11 C1 1 Cl , C" Dl ,
DI, and D,. A simple random sample of size 4 is taken in order to estimate
the proportion of C's to C's + D's. Work out the conditional distributions
of thiB proportion, p, and verify the formula for its conditional variance.
3.5 In the previous exercise, what is the probability that a sample of size
4 contains Bl? Hence find the average variance of p over all simple random
samples of size 4, and verify your answer by the general formula.
3.6 A simple random sample of 290 households was chosen from a city
area containing 14,828 households. Each family was asked whether it owned
or rented the house and also whether it had the exclusive use of an indoor
toilet. Results were as follows :
Owned Rented
Exohlllve U8Il of toilet Yes No Yes No Total
141 6 109 84 290

(i) For families which rent, estimate the percentage in the area who have ex-
clusive use of an indoor toilet and give the standard error of your estimate ;
(ii) estimate the total number of renting families in the area who do not have
exolusive indoor toilet facilities and give the standard error of this estimate.
3.7 In a simple random sample of size 5 from a popUlation of size 30, no
units in the sample were in class C. By the hypergeometrio distribution, find
the upper limit to the number A of units in class C in the population, corre-
sponding to a one-tailed oonfidence probability of 95 per pent.
3.8 In sampling for an attribute that is rare, one method is to continue
drawing a simple random sample until m units which possess the rare attribute
have been found (Haldane, 19(5), where m is decided in advance. If the fpo
can be ignored, show that the probability that the total sample required is
of aile n is
(n - 1) I P"Q~ __
(n ~ m)
(m - 1)1(n ~ m)l
where P is the frequency of the rare attribute. Find the average lile of the
total sample and show that p - (m - 1)/ (n - 1) is an unbiased estimats of
P. (For further discussion see Finney, 11149.)

3.11 References.
CHUNG, J. H., and DJlJLuKT, D. B. (1950). C07Ijidence limit. lor the hllpergeometric
dutrilndion. University of Toronto PreMo
FINNJlJT, D . J. (1949) . On a method of estimating; frequencies. BiorMtrika, ae,
233-284.
FIuu, R. A., and YATJIJI, F. (1948). St4li,ticolloblH lor ~, agricultural
and m«Iicol rtlfOrm. Oliver and Boyd, Edinbursh, 8rd ed.
3.11 REFERENCES 49

HALDANE, J. B. S. (1945). On a method of estimating frequencies. Biometrika, 33,


222-225.
NATIONAL BUIII&AU 01' STANDARDS (1950). Tables of the billom.ial 1"obabilitv dis-
tribution. U. S. Covernment Printing Office, Washington, D. C.
ROMIG, H . G. (1952) . 60-100 Binomial tablea. John Wiley &: Sons, New York.

Not cited in te:d


MCCARTHY, P . J. (1951) . Sampling, element.&ry principles. New York State
Sohool of Industrial and Labor Relations, BuU. 15. (A good technical discussion
of sample design and estimation as applied to the estimation of proportions.)
CHAPTER 4

THE ESTIMATION OF SAMPLE SIZE

"1 A hypothetical example. In the planning of a sample survey, a


stage is always reached at which some decision must be made about
the size of the sample. The decision is an important one. Too large
a sample implies a waste of resources, and too email a sample dimin-
ishes the utility of the results. The decision cannot always be made
satisfactorily, for often we do not p088e8B enough information to be
sure that our choice of sample size is the best one. Sampling theory
provides a framework within which to think intelligently about the
problem.
A hypothetical example may bring out the steps involved in reaching
a solution. An anthropologist is preparing to study the inhabitants of
some island. Among other things, he wishes to estimate what per-
centage of the inhabitants belongs to blood group O. Cooperation
has been IeCUred so that it is feasible to take a simple random sample.
How large should the sample be?
This question cannot be discussed without first receiving an answer
to another question: How accurately does the anthropologist wish to
know the percentage of people with blood group O? In reply, he states
. that he will be content if the percentage is correct to within ::l::5 per
. cent, in the sense that, if the sample shows 43 per cent to have blood
group 0, the percentage for the whole island is sure to lie between 38
and 48.
To avoid mi8UDderstanding, it may be advisable to point out to the
anthropologist that we cannot absolutely guarantee accuracy to within
5 per 'bent except by measuring everyone. However large n is taken,
there is a chance of a very unlucky sample which is in error by more
tha.n the desired 5 per cent. The anthropologist replies coldly that he
is aware of this, that he is willing to take a 1 in 20 chance of getting an
unluclcy sample, and that all he asks for is the value of n instead of a
lecture on statistica.
We are now in a position to make a rough estimate of n. To simplify
matters, the fpc is ignored, and the sample percentage p is aasumed
normally distributed. Whether these atl8umptioD.8 are reasonable can
be verified when the initial n is known.
150
ANALYSIS OF THE PROBLEM 51

In technical terms, p is to lie in the range (P ± 5), except for a


1 in 20 chance. Since p is assumed normally distributed about P, it
will lie in the range (P ± 2up ), apart from a 1 in 20 chance.· Further,

Up '=.
\}{PQ
--;;
Hence, we may put

2 fPQ = 5, or n = 4PQ
~~ 25
At this point a difficulty appears which is common to all problems
in the estimation of sample size. A formula for n has been obtained,
but n depends on some property of the population which is to be sam-
pled. In this instance the property is the quantity P which we would
like to measure. We must therefore ask the anthropologist if he can
give us some idea of the likely value of P. He replies that from pre-
vious data on other ethnic groups, and from his speculations about
the racial history of this island, he will be surprised if P lies outside
the range 30 to 60 per cent.
This information is sufficient. to provide a usable answer. For any
value of P between 30 and 50, the product PQ lies between 2100 and
a maximum of 2500 at P = 50. The corresponding n lies between 336
and 400. To be on the safe side, 400 is taken as the initial estimate
of n.
The assumptions made in this analysis can now be re-examined.
With n = 400 and a P between 30 and 50, the distribution of p should
be close to normal. Whether the fpc is required depends on the num-
ber of people on the island. If the population exceeds 8000, the sam-
pling fraction is less than 5 per cent and no adjustment for fpc is called
for. The method of applying the readjustment, if it is needed, is dis-
cussed in section 4.4.

4.2 Analysis of the problem. The principal steps involved in the


choice of a sample size are as follows:
i. There must be some statement as to what is expected of the
sample. This statement may be in tel'ms of desired limits of error, as
in the previous example, or in terms of some decision that is to be
made or action that is to be taken when the sample results are known.
The responsibility for framing the statement rests primarily with the
• The factor 2 instead of the more correct fadar l.96 gives a small margin of
safety ;,.
THE ESTIMATION OF SAMPLE SIZE
persons who wish to use the results of the survey, though they fre-
quently need guidance in putting their wishes into numerical terms.
ii. Some equation must be found which connects n with the desired
precision of the sample. The equation will vary with the content of
the statement of precision anei with the kind of sampling that is en-
visaged. One of the advantages of probability sampling is that it
enables this equation to be constructed.
iii. This equation will contain, as parameters, certain unknown
properties of the population. These must be estimated in order to
give specific results.
iv. It often happens that data are to be published for certain major
subdivisions of the population, and that desired limits of error are set
up for each subdivision. A separate calculation is made for the n in
each subdivision, and the total n is found by addition.
v. More than one item or characteristic is usually measured in a
sample survey : sometimes the number of items is large. If a desired
degree of precision is prescribed for each item, the calculations lead to
a series of conflicting values of n, one for each item. Some method
must be found for reconciling these values.
vi. FinalJy, the chosen value of n must be appraised to see whether
it is consistent with the resources available to take the sample. This
demands an estimation of the cost, labor, time, and materials required
to obtain the proposed size of sample. It sometimes becomes apparent
that n will have to be drastically reduced. A hard decision must then
be faced- whether to proceed with a much smaller sample size, thus
reducing precision, or to abandon efforts until more resources can be
found.
In succeeding sections, some of these questions are examined in
more detail.

4.3 The specification of precision. The sts.tement of precision de-


sired may be made by stating the amount of error which we are willing
to tolerate in the sample estimates. This amount is determined, as
best we can, in the light of the uses to which the sample results are to
be put. Sometimes it is difficult to decide how much error slwu1d be
tolerated, particularly when the results have several different uses.
Suppose that we asked th~ anthropologist why he wished the per-
centage with blood group 0 to be correct to 5 per cent, rather than,
say, 4 or 6 per cent. He might reply that the blood group data are to
be used primarily for racial classification. He strongly suspects that
the isiandel'B belong either to a racial type with a P of about 35 per
U THE FORMULA FOR " IN SAMPLING FOR PROPORTIONS 63
cent or to one with a P of about 50 per cent. An error limit of 5 per
cent in the estimate seemed to him small enough to permit claesifica-
tion into one of these types. He would, however, have no violent
objection to 4 or 6 per cent limits of error.
Thus the choice of a 5 per cent limit of error by the anthropologist
was to some extent arbitrary. In this respect the example is typical
of the way in which a limit of error is often decided upon. In fact, the
anthropologist was more certain of what he wanted than man.y other
scientists and many administrators will be found to be. When the
question of desired degree of precision is first raised, such persons may
confess that they have never thought about the question and have no
ideas as to the answer. My experience has been, however, that after
a little discussion they can frequently indicate at least roughly the
size of a limit of error which appears reasonable to them.
Further than this we may not be able to go in many practicalllitu8r-
tions. Part of the difficulty is that not enough is known about the
consequences of errors of different sizes as they affect the wisdom of
practical decisions that are made from survey results. This subject
deserves more study than it is currently receiving. As knowledge
accumulates, the choice of a desired degree of precision will become
easier. Even when the consequences of errors are known, however,
there are many important surveys whose results are used by different
people for different purposes, and some of the purposes are not fo~
seen at the time when the survey is planned. Consequently, an ele-
ment of guesswork is likely to be prominent in the specification of p~
cision for some time to come.
If the sample is taken for a very specific purpose, e.g. for making a
smgle "yes" or "no" decision, or for deciding how much money to
spend on a certain venture, the precision needed can usually be stated
in a more definite manner, in terms of the consequences of errors in
the decision. A general approach to problems of this type is given in
section 4.8, which, although in need of amplification, offers a logical
start on a solution.

'-4 The formula for n in sampling for proportions. The units are
classified into two classes, C and C'. Some margin of error d in the
estimated proportion p of units in class C has been agreed upon, as
well as a small risk a which we are willing to incur that the actual
error is larger than d. That is, we want

Pr{1 p - P I ~ dl = a
THE EBTIM.ATION OF SAMPLE SIZE

Simple random sampling is assumed, and p is taken 88 normally


distributed. From theorem 3.2, section 3.2,

f1p= ~ [PQ
...j;:;-=} \) -;;
Hence the formula which connects n with the desired degree of preci-
sion is

where t is the abscissa of the normal curve which cuts off an area a at
the tails. Solving for n, we find

(4.1)

For practical use, an advance estimate p of P is substituted in this


formula. If N is large, a first approximation is

(4.2)

Note that d and t enter the formulas only in their ratio. If

rfJ
V =f = Desired variance of the sample proportion

we have
pq
no=-
v
In practice we first calculate no. If no/N is negligible, no is a satis
factory approximation to the n of equation (4.1) . If not, it is apparent
on comparison of (4.1) and (4.2) that n is obtained as

no
n=---- no (4.3)
no - 1
1+ - - 1 + no
N N
THE FORMULA FOR n WITH CONTINUOUS DATA 55

Example. In the hypothetical blood groups example, we had


d = 0.05 ; P = 0.5; cr = 0.05 ; t =2
ThU8
(4) (0.5) (0.5)
no = (0.0025) = 400

Let us assume that there are only 3200 people on the island. The
fpc is needed, and we find
no 400
n = --';;"""- - - - =356
no - 1
1+ - - 1 +..f.Mr
N '
The formula for no holds also if d, p, and q are all expressed as per-
centages instead of proportions. Since the product pq increases as p
moves toward -! or 50 per cent, a conservative estimate of n is obtained
by choosing for p the value nearest to ! in the range in which p is
thought likely to lie. If 'T' seems likely to lie between 5 and 9 per cent,
for instance, we assume 9 per cent for the estimation of n.
A good discussion of sample size for proportions, with a specific
application, is given by Cornfield (1951 ).

4.6 The formula for n with continuous data. If g is the average of


the observations from a simple random sample, we wish to have

Pr{\ g - Y I ~ dl = cr

where d is the chosen margin of error, and ex a small probability. We


assume that g is normally distributed : from theorem 2.2, corollary 1,
its standard error is
~s
l1ii = 'V};-
- - Vn
Hence

This gives
d=t
N s
-
Vn
(4.4)

(~y
n= -----
1 +~(~y
66 THE ESTIMATION OF SAMPLE SIZE

As in the previous section, we take as a first approximation

(4.5)

This is adequate unless no/N is appreciable, in which event we com-


pute n as
n=-
no (4.6)
1 + no
N
Example. In nurseries which produce young trees for sale, it is
advisable to estimate, in late winter or early spring, how many healthy
young trees are likely to be on hand, since this determines policy to-
wards the solicitation and acceptance of orders. A study of sampling
methods for the estimation of the total numbers of seedlings was under-
taken by Johnson (1943). The data which follow were obtained from
a bed of silver maple seedlings, 1 ft wide and 430 ft long. The sampling
unit was 1 ft of the length of the bed, so that N = 430. By complete
enumeration of the bed, it was found that r = 19, S2 = 85.6, these
being the true population values.
With simple random sampling, how many units must be taken to
estimate r within 10 per cent, apart from a chance of 1 in 201 From
equation (4.5) we obtain
t2S2 (4) (85.6)
no=-= -95
rP (1.9)2
Since no/N is not negligible, we take
95
n = = 78
1+&
Almost 20 per cent of the whole bed has to be counted in order to
attain the precision desired.
The previous example is atypical in that the population value S2 is
known. In practice, S2 is estimated from previous sampling of a simi-
lar or related popUlation, or by intelligent guesswork. This points to
the value of publishing, or at least keeping accessible, records of
standard deviations obtained in sample surveys, as a guide for future
samples.
Even with little guidance from previous work, a serviceable estimate
of S2 can often be made. For example, in early studies in the estima-
tion of wireworm popUlations in the soil, a tool was used which took a
SAMPLE SIZE WITH MORE THAN ONE ITEM 57

small aample (e.g. 9 x 9 x 5 in. deep) of the topsoil. For estimation of


sample size, the aampler needed to know the standard deviation of the
number of wireworms which a sample would contain. If wireworms
were distributed at random over the topsoil, the number found in a
small volume would follow a Poisson distribution, for which S2 = Y.
Since there might be some tendency for wireworms to congregate, it
was decided to assume ~ = 1.2Y, the factor 1.2 being an arbitrary
safety factor. Although r itself was not known, the values of Y that
are of economic importance could be delineated from studies of the
wireworm densities that are critical with respect to damage to crop
growth. These two pieces of information made it possible to deter-
mine aample sizes that proved satisfactory. Deming (1950) gives use-
ful hints for estimating S from some knowledge of the range and aha
of the distribution. -
"l'helormulas for n given here apply only to simple random sampling
in which the aample mean is llsed as the estimate of Y. For other
methods of sampling and estimation, the appropriate formulas will be
presented with the discussion of these techniques.
In the discuasion thUd far, we have specified that n must be large
enough so that
Prll 11 - Y I ~ d} = a
where d has been called the margin of error. Alternatively, the sample
size is sometimes specified as large enough to provide a confidence in-
terval of half-width d, with confidence probability (1 - a). Except
in small samples, these two methods of specification lead to the same
estimated value for n, as will now be shown.
From section 2.7, the half-width of the confidence interval is

fN--=-;8
d = t '\}}{ . Vn (4.7)

This equation is the same as equation (4.4), with 8 in place of S,


except that if n is found to be leas than 30, the t-value in equation (4.7)
for d is obtained from Student's t-distribution, with (n - 1) degrees
of freedom, instead of from the normal distribution. Both method.ll
are approximations (see section 4.7).

'-6 Sample size with more than one item.. In most surveys, informa-
tion is collected on more than one item. One method of determining
aample size is to specify margins of error for those items that are re-
garded as most vital to the rvey. estimation of the sample size
needed is first made separately for each of these important items.
THE ESTIMATION OF SAMPLE SIZE
When the single-item estimations of n have been completed, it is
time to take stock of the situation. It may happen that the n's re-
quired are all reasonably close. If the largest of the n's falls within
the limits of the budget, this n is !!elected. More commonly, there is
a sufficient variation among the n's required so that we are reluctant
to choose the largest, either from budgetary considerations or because
of the fact that this will give an overall standard of precision sub-
stantially higher than was originally contemplated. In this event the
desired standard of precision may perhaps be relaxed for certain of the
items, in order to permit the use of a smaller value of n.
In some cases the n's required for different items are so discordant
that certain items must be dropped from the inquiry, because with the
resources available the precision expected for these items is totally
inadequate. The difficulty may be not merely one of sample size.
Some items call for a different type of sampling from others. With
populations that are sampled repeatedly, it is useful to amass info~
mation about those items which can be combined economically in a
general survey, and those which necessitate special methods. As an
example, a classification of items into four types, suggested by experi-
ence in regional agricultural surveys, is shown in table 4.1. In this

TABLE 4.1 AN IDXAMPLil or DI ....EBIDNT TrPES 0 .. ITIDM IN UGJONAL eURVliT8

Type CharaoteristiCII of item Type of sampling needed


Widespread throughout the region, ocour- A general survey with low
ring with re880nable frequency in all sampling ratio_
parte of the region.
ii Widespread throughout the region, but A general survey, but with
with low frequency. a higher sampling ratio.
iii Oocurring with re880nable frequency in For best results, a strati-
most parts of the region, but with more fied sample with different
sporadio distribution, being abeent in intensities in ditTerent
lOme parte and highly concentrated in parte of the region (chap-
others. ter 5). Can sometimes
be inoluded in a general
survey with some IUpple-
mentary sampling.
Iv Distribution very sporadio, or concen- Not suitable for a general
trated in a small part of the region. _ survey_ Requires a sam-
ple geared to its die-
tribution.

classification, a general survey means one in which the units are fairly
evenly distributed over some region, as for example by a simple ran-
dom sample.
STEIN'S METHOD OF TWO-STAGE SAMPLING 59

" 7 Stein', method of two--stage sampling. If S or P is estimated


from previous data, or by guesswork, the calculations given for n do
not provide any assurance that the margin of error or the confidence
limits will be of the desired size, because the provisional estimates of
Sand P may be wrong. For continuous data, Stein (1945) developed
a method which guarantees that the confidence interval Will be no
larger than some stated amount. In this method, the information
about S is obtained from the population that is being sampled. Since
Stein's technique assumes that the parent population is normal, its
practical application is restricted to sampling in which this .condition
is approximately satisfied. The method is, however, of great interest,
because it gives a chosen degree of precision which is not dependent
on the correctness of initial guesses.
The sample is taken in two parts. The first part, of size nt. supplies
an estimate 8 of S, calculated in the usual way, and also a preliminary
estimate of the mean Y. When the first part has been taken, Stein
shows how to calculate the number of additional observations needed
in order to have a specified confidence interval. Both parts must be
samples from the population that is being surveyed. If the population
changes with time, the time interval between the first and second parts
must be sufficiently small so that no appreciable change will have
occurred.
Since the method was developed for infinite popUlations, the case
where n/ N is negligible will be considered first. When the first sample
has been obtained, a confidence interval for Y can be calculated. The
half-width of this interval is
tl8/~
where tl denotes the t-value for (nl - 1) degrees of freedom and confi-
dence probability (1 - a) . If this quantity is less than or equal to d,
the desired half-width, the sample is already sufficiently large. If the
quantity exceeds d, additional observations are taken 80 that the total
aise of sample n is at least as great as
tl28
2
/tI?
Then, if fi is the mean of the whole sample,

Prl\ y - Y \ ~ dl Sa (4.8)

Sketch of 1"'oof. The proof assumes that the observations, Yl, Y2 , .. "
y", are normally distributed about Y. Throughout the proof, d, a, and
nl are fixed quantities. The total sample size n is not fi.xed, but is a
THE FBrIMATION OF SAMPLE SIZE
random variate, since ita value depends on the value of 8 that turns up
in the first sample. Nevertheless, for fixed 8, n is fixed, and the quantity

Vn (Y - Y)

ill normally distributed with mean zero and variance r? Hence, this
quantity follows. the normal distribution whether 8 is fixed or not.
Moreover, by a well-known property of the normal distribution, the
distribution is independent of that of 8. Consequently,

Vn (ii - Y)/8
follows the klistribution with (nl - 1) degrees of freedom. By defini-
tion of tl it follows that

Pr {I Vn (ti - Y) I?; }
8
tl - a (4.9)

This is the key result in the proof. Further, by the way in which the
value of n was calculated, we always have

(4.10)
Hence, from (4.9)

Pr {I
tl (y -
d
Y) I }Sa
?;tl

i.e.
PrO fi - Y I ?; dl S IX

Example. Suppose that d = 10, a = 0.05. From previous infor-


mation, S is guessed as about 50, although this guess may be seriously
in error. The first step in applying Stein's method is to select a value
of nl, the size of the initial sample. For this, it is helpful to know how
large the final sample would have to be if the assumed value of S hap-
pened to be correct. This value is

n - (tS)2 (2 X60)2 .
d - 1 ( ) .. 144
where t is taken from the normal distribution. Suppose that nl is
taken as 50. This value gives a reasonably large number of degrees of
freedom for estimating S and does not commit us to too large an initial
sample in case S should turn out to be less than we feared.
4.8 THE GENERAL SAMPLE SIZE PR(,~l.EM 61
2
When the first sample is taken, 8 is found to be 1938. Since til for
49 df, is 2.01, we have
t18 (2.01)(44.02)
-- = = 12.51
vn; 7.0711
so that the sample of 50 gives a confidence interval of half-width 12.51
instead of 10. Finally, n is chosen so that
2 2
t1 8 (4.040)(1938)
n = - = = 78.3
tP 100
That is, 29 additional observations are taken to make the total n ... 79.

oL8 An attempt at a general solution of the sample size problem.


This approach is most easily explained in the situation where certain
practical decisions are to be made from the sample results. Such deci-
sions will presumably be more fruitful if the sample estimate has a low
error than if it ha.s a high error. We may be able to calculate, in mone-
tary terms, the loss l(z) that will be incurred in a decision through an
error of amount z in the estimate. Although the actual vaJue of z is not
predictable in advance, sampling theory enables us to find the fre-
quency distribution fez, n) of z, which for a specified sampling method
will depend on the sample size n. Hence the expected loBB for a given
size of sample is
L(n) = fl(z ) fez, n) dz

The purpose in taking the sample is to diminish this lOllS. If C(n)


is the cost of a sample of size n, clearly n should be chosen so as to
minimize
C(n) + L(n)
since this is the total cost involved in taking the sample and in making
decisions from its results. The choice of n determines both the opti-
mum size of sample and the most advantageous degree of precision.
Alternatively, the same approach can be presented in terms of the
monetary gain which accrues from having the sample information,
rather than in terms of the loss which arises from errors in the sample
information. If monetary gain is used, we construct an expected gain
G(n) from a sample of size n, where G(n) is zero if no sample is taken.
We I'1I4Zimize
G(n) - C(n)
In this form the principle is equivaJent to the rule in classical eco-
nomics that profit is to be maximized.
THE ESTIMATION OF SAMPLE SIZE
The simplest application occurs when the loss function, l(z), is )..il,
where ). is a constant. It follows that

For instance, if r is the sample estimate of Y, and z = r - Y,


).S'l
L(n) = ). Vet') = -
n

if simple random sampling is used and the fpc is ignored.


The simplest type of cost function for the sample is

where Co is the overhead cost. By differentiation, the value of n which


miriimizes cost plus loss is

A more general form of this result is given by Yates (1949). The same
analysis applies to any method of sampling and estimation in which
the variance of the estimate is inversely proportional to n and the coat
is a linear function of n.
Blythe (1945) describes the application of this principle to the
estimation of the volume of timber in a lot for selling purposes (see
exercise 4.6) . Nordin (1944) discusses the optimum size of sample for
estimating potential sales in a market which a mll.nufacturer intends
to enter. If the sales can be forecast accurately, the amount of fixed
equipment and the production per unit period can be allocated so as
to maximize the manufacturer's expected profit.
Although the application of this technique is likely to be restricted
to situations in which the sample is taken for a specifi~ it
seems probable that this approach to the problem orBiDiple size has
a number of fruitful applications which have not yet been realized.
A related problem is the sampling of lots of articles in a mass-pro-
duction process, in order to determine whether the lot is to be accepted
or rejected on the basis of its estimated quality. Since the purpose of
the sample is usually to lead to a single "yes" or "no" decision, the -
best sample size can be decided by examining the consequences of
errors in the decision. Good introductory accounts of the techniques
are given by Tippett (1950) and Deming (1950) .
4.9 EXERCISES 63

.. 9 Exercises.
4.1 A survey is to be made of the prevalence of the common diseases in a
large population. For any disease that affects at least 1 per cent of the in-
dividuals in the population, it is desired to estimate the total number of cases
with a coefficient of variation of not more than 20 per cent. (i) What size of
simple random sample is neederl , assuming that the presence of the disease
can be recognized without mistakes? (ii) What size is needed if total cases
are wanted separat.ely for. maies and females, with the same precision?
4.2 In a wireworm survey, the number of wireworms per acre is to be es-
timated with a limit of error of 30 per cent, at the 95 per cent probability level,
in any field in which wireworm density exceeds 200,000 per acre in the top 5
in. of soil. The sampling tool measures 9 X 9 X 5 in. deep. Assuming that
the number of wireworms in a single sample follows a distribution slightly
more variable than the Poisson, we take S'l = 1.2Y. What size of simple
random sample is needed? (1 acre = 43,560 sq ft.)
4.3 The following coefficients of variation per unit were obtained in a
farm survey in Iowa, the unit being an area 1 mile square (data of R. J . Jessen) :
Estimated cv
Item (%)
Acres in tanus 38
Acres in corn 39
Acres in oats 44
Number of family workers 100
Number of hired workers 110
Number of unemployed 317
A survey is planned to estimate acreage items with lI. cv of 21 per cent and
numbers of workers (excluding unemployed) with a cv of 5 per cent. With
simple random sampling, how many units are needed? How well would this
sample be expected to estimate the number of unemployed?
4.4 By experimental sampling, the mean value of a random variate is to
be obtained correct to 0.001, with confidenee probability 95 per cent. The
values of the random variate for the first 20 samples drawn are shown below.
How many more samples are needed?
Sample Value of Sample Value of
no. random variate no. random variate
1 .0725 11 .0712
2 .0755 12 .0748
3 .0759 13 .0878
4 .0739 14 .0710
5 .0732 15 .0754
6 .0843 16 .0712
7 .07Z1 17 .0757
8 .0769 18 .0737
9 .0730 19 .0704
10 .0727 20 .0723
THE ESTIMATION OF SAMPLE SIZE
•.5 If the 10118 function due to an error in ii is >'1 ii - Y I and if the cost
C- Co +Cln, show that with simple random II&IIlpling, ignoring the fpc, the
moet economical value of n is

4.6 (Adapted from Blythe, 1945.) The selling price of a lot of standing
timber is UW, where U is the price per unit volume and W is the volume of
timber on the lot. The number N of logs on the lot is counted, and the aver-
age volume per log is fJIItimated from a simple random sample of n logs. The
estimate is made and paid for by the seller and is provisionally accepted by
the bu~r. Later, the buyer finds out the exact volume purchased, and the
seller reimburses him if he has paid for more than was delivered. If he has
paid for less than was delivered, the buyer does not mention the fact.
ConBtruct the seller's lOBS function. Assuming that the cost of measuring
n logs is en, find the optimum value of n. The standard deviation of the vol-
ume per log may be denoted by S, and the fpc ignored.

'-10 ReferenceL
BLY'l'BJ!, R. H . (1945) . The economiea of sample size Il.pplied to the ecaling of saw-
logs. Biom. Bull., 1, 67- 70.
CoUrtllLD, J. (1951). The determination of sample size. ArMr. Jour. Pub.
Hoalth, '1, ~l.
DIlIWIG, W. E. (1950) . SOTM eMory of 8ampling. John Wiley & Sons, New York.
JOHNSON, F . A. (1943). A etatistica.l study of sampling methods for tree nursery
inventories. Jour. Forut'1l. 'I, 674-679.
NOJU>JN, J. A. (1944) . Determining sample eise. Jour. ArMr. Stat. Auoc., Be,
497- 506.
STIlIN, C. (1945) . A two-sample test for a linear hypothesis whose power is inde-
pendent of the variance. Ann. Math. Stat., 16, 243-258.
TlPPIITT, L. H. C. (1950). Technological applicati0n3 of statistics. John Wiley &
Sons, N ew York.
YATU, F . (1949). Sampling methodb for ceMmU and 8UMJey8. Charles Griffin
and Co., London.
CHAPTER 5

S1'RATIFIED RANDOM SAMPLING

1i.1 Description. In stratified sampling, the population of N units is


first divided into subpopulations of N I , N 2 , ••. , N L units, respectively.
These subpopulations are non-overlapping and together they comprise
the whole of the population, so that
N 1 +N2 +···+NL=N
The subpopulations are called 8trata. To obtain the full benefit from
stratification, the va.Jues of the N" must be known. When the strata
have been determined, a sample is drawn from each st.ratum, the draw-
ings being made independently in different strata. The sample sizes
within the strata are denoted by nl, n2, "', nL, respectively.
If a simple random sample is taken in each stratum, the whole
procedure is described as stratified random 8ampling.
Stratification is a very common technique. There are many ressons
for this ; the principal ones are the following :
i. If data of known precision are wanted for certain subdivisions of
the population, it is advisable to treat each subdivision as a "popula.-
tion" in ita own right.
ii. Administrative convenience may dictatc the use of stratification;
e.g. the agency conducting the survey may have field offices, 8{Wh of
which can supervise the survey for a part of the population.
iii. Sampling problems may differ markedly in different parts of the
population. With human populations, people living in institutions
(e.g. hotels, hospitals, prisons) are often placed in a different stratum
from people living in ordinary homes, because a different approach to
the sampling is appropriate for the two situations.
iv. Stratification may bring about a gain in precision in the esti-
mates of characteristics of the whole population. The basic idea. is
that it may be possible to divide a heterogeneous population into sub-
populations, each of which is internally homogeneous. This is sug-
gested by the name strata, with its implication of a division into layers.
If each stratum is homogeneous, in that the measurements vary little
from one unit to another, a precise estimate of any stratum mean can
66
66 STRATIFIED RANDOM SAMPLING 6.1

be obtained from a small sample in that stratum. As will be shown,


these estimates can then be combined into a precise estimate for the
whole population.
The theory of stratified sampling deals with the properties of the
estimates from a stratified sample and with the best choice of the sam-
ple sizes nil so as to obtain maximum precision. In this development
it is taken for granted that the strata have already been constructed.
The prior problema of how to construct strata and of how many strata
there should be will be postponed to a later stage (section 5.15).

6.2 Notation. The suffix h denotes the stratum and i the unit within
the stratum. The notation is a natural extension of that previously
used. The following symbols all refer to stratum h:
Total number of units
Number of units in sample

Y"i Value obtained for the ith unit

f
i_ I
Yhi
Y,,= - - True mean
N"

Sample mean

N.
. L (YM - 1",,)2
2 _i-_l_ _ _ __
S" = True variance
N" -1
Note that the divisor for the variance is (N" - 1).

6.S Properties of the estimates. For the population mean per unit,
the simplest type of estimate appropriate to stratified sampling is fl.t
(st for stratified), where

(5.1)
5.8 PROPERTIES OF THE ESTIMATES

The estimate '0" is not in general the same as the sample mean, for
the sample mean, '0, can be written as
L
L nAYA
A-I
ii "" - -
n
- (5.2)

The difference is that in '0" the estimates from the individual strata
receive their correct weights N"IN. It is evident that Y coincides
with YII provided that in every stratum

nh N. n" n
- = - : or - = - = Constant
n N Nil N

This means that the sampling fraction is the same in all strata. Thw
stratification is described as stratification wi proportional allocation
of the nIl. It gives a self-weighting sample. If numerous estimates
have to be made, a self-weighting sample is time-saving.
The principal properties of the estimate Y" are outlined in the fol-
lowing theorems. The first two theorems apply to stratified sampling
in general and are not restricted to stratified random sampling: i.e. the
sample from any stratum need not be a simple random sample.

Theorem 5.1 If in every stratum the sample estimate '0" is unbiased,


then y.e is an unbiased estimate of the population mean Y.
Proof:
L
LNIIYII
E(ii,,) = "-I
E ---
N N

since the estimates are unbiased in the individual strata. But the
population mean Y may be written

N
This completes the proof.
CoroUa17/. ~ince '0" is an unbiased estimate of y" for simple random
sampling within strata, '0" is an unbiased estimate of Y for stratified
random sampling.
68 STRATIFIED RANDOM SAMPLING

Theorem 5.! For stratified sampling, the variance of 17." 88 an esti-


mate of the population mean Y, is
L
~ N A2 V(1iA)
V(17.,) = "-1 N2 (5.3)

where
V(y,,) = E(y" _ Y,,)2
There are two restrictions on the theorem: (i) f/Ir. must be an unbiased
estimate of Y", and (ii) the samples must be drawn independently in
different strata.
Proof:

where the sum extends over all strata. Note that the error (ti" - Y)
in the estimate is now expressed as a weighted mean of the errors of
estimation which have been made within the individual strata. Hence
_ 1" 2 _ ~ N"2(y,, - 1",,)2 2 ~ N"Nj(17" - Y,,)(17j - Yi )
(g., ) - N2 + ~

where the right-hand term extends over all pairs of strata.


We now average over all possible samples. For any cross-product
term, we begin by keeping the sample in stratum h fixed, and average
over all samples in stratum j. Since sampling is independent in the
two strata, the possible samples in stratum j will be the same and have
the same probabilities, whatever sample has been drawn in stratum h.
But since 'fij is assumed unbiased, the average of (17; .... 1"i ) is zero.
Hence, all cross-product terms vanish.
The squared terms give
~ N"2E(17,, - 1",,)2 ~ N,,2v(y,,)
V(1i~= ~ ~

The important point about this result is that the variance of g" de-
pends only on the variances of the estimates of the indi"idual stratum
meml!:> Y h • If it were possible to divide a highly variable population
into strata such that all items had the same value within a stratum, we
could estimate Y without any error. Examination of the proof shows
PROPERTIES OF THE ESTIMATES 69

that it is the use of the correct stratum weights N" which leads to this
result.
Theorem 5.S For stratified random sampling, the variance of the
estimate g., is
(5.4)

Proof: Since ti" is an unbiased estimate of Y", theorem 5.2 can be


applied. Further, by theorem 2.2, applied to an individual stratum,
S,,2 (N" - nIl)
V(y,,) = - - - - - (5.5)
nIl N"
By substitution into the result of theorem 5.2, we obtain
1 L 1 /, S,,2
V(ti.,) = 2 :E N,,2V(ij,,) = 2 :E NI.(N" - n,,) -
N "_1 N "_r,.~ n"
"_I '" fi
Some particular cases of this formula are g?ven in the following
corollaries.
CoroUary 1 If the sampling fractions nIl/ NIl are negligible in all
strata,
1 N,,2S,,2 W,,2S,,2
V(g.,) "" - 2 :E-- = :E-- (5.6)
N nil nil

where W" = N,, / N is the stratum weight. This is the appropriate


formuls. when finite population corrections can be ignored. With
stratified sampling, there is in general no single tinite population cor-
rection factor, the factors entering individually into each stratum.
CoroUary B With proportional allocation, we substitute
nN"
n,,=-
N
in (S 4), The variance reduces to

V(g.,) = :E N "S,,2(N - n) = (N - n) :EW,.8,,2 (5.7)


N n N nN
In this case there is a single fpc, (N - n)/ N.
CoroUary S If sampling is proportional and the variances in all
strata have the same value, Sw 2 , we obtain the simple result

S2(N-n)
V(g.,) = ~ l { (5.8)
70 STRATIFIED RANDOM SAMPLING 6.3

Th«1rt:m 5.4 If ral - Ng" is the estimate of the population total


Y, then
(5.9)

This follows at once from theorem 5.3.


Example. Table 5.1 shows the 1920 and 1930 numbers of inhabi-
tants, in thousands, of 64 large cities in the United States. The data
TABLE 6.1 SIZJ:8 OJ' 64 CITIJ:8 (IN 1000's) IN 1920 AND 1930

1920 size (:tAi) 1930 size (l/A;)

Stratum Stratum

" -1 2 1 2

797 314 172 121 900 364 209 113


773 298 172 120 822 317 183 115
748 296 163 119 781 328 163 123
734 258 162 118 S05 302 253 154
588 266 161 118 670 288 232 140
577 243 159 116 1238 291 260 119
5CY1 238 153 116 573 253 201 130
607 237 144 113 634 291 147 127
457 235 138 113 578 308 292 100
438 235 138 110 487 272 164 107
415 216 138 110 442 284 143 114
401 208 138 108 451 255 169 111
387 201 136 106 459 270 139 163
381 192 132 104 464 214 170 116
324 ISO 130 101 400 195 160 122
816 179 126 100 366 260 143 134

NOTS. Cities are arranged in the 8ame order in both years.

Total8 and sums of 8quares

1920 1930

,E(YA;) ,E(W".2)
1 8,349 4,756,619 10,070 7,145,450
2 7,941 1,474,871 9,498 2,141,720
5.3 PROPERTIES OF THE ESTIMATES 71

were obtained by taking the cities which ranked from 5th to 68th in
the United States in total number of inhabitants in 1920. The cities
are arranged in 2 strata, the first containing the 16 largest cities and
the second the rema.ining 48 cities.
The total 1930 number of inha.bitants in all 64 cities is to be esti-
mated from a sample of size 24. FiIid the standard error of the esti-
mated total for (i) a simple random sample, (ii) a stratified random
sample with proportional allocation, (iii) a stratified random sample
with 12 units drawn from each stratum.
This population resembles the populations of many types of business
enterprise in that some units-the large cities-eontribute very sub-
stantially to the total and display much greater variability than the
remainder.
The stratum totals and BUmS of squares are given under table 5.l.
For the complete population in 1930, we find

Y ... 19,568; 8 2 ... 52,448

The three estimates of Yare denoted by 1'"r"", 1'",.'0'" and :i>.q""I.


i. For simple random sampling,

~ N 28 2 (N - n) (64)2(52,448) (40)
V(rr",,) = -n- N ...
24
-
64
"'" 5,594,453

from theorem 2.2, corollary 2. The standard error is

u(1'",,,,,) == 2365
ii. For the individual strata the variances are

8 12 - 53,843: 8 2 2 - 5581

Note that the stratum with the largest cities has a variance nearly
10 timee that of the other stratum.
In proportional allocation, we have nl - 6, ns - 18. From formula
(5.7), multiplying by ~, we have

~ (N - n) 2
V(r JlTOJI) -
n
!:.N"s"
- ttl (16)(53,843) + (48)(5581) I - 1,882,293
cr(1'"JI'o,,) - 1372
STRATIFIED RANDOM SAMPLING

iii. For nl = ~ = 12 we use the general formula (5.9):


S2
V(t.fuol ) - I: N,,(NIo - nIl) ~
n"
(16)(4)(53,843) (48)(36)(5581)
= 12
+ 12
- 1,090,827

O(t"UGI) '"" 1044


In this example, equal sample sizes in the two strata are more pre-
cise than proportional allocation. Both are greatly superior to simple
random sampling. The 1920 data, not utilized here, will appear in
later examples.

6.' The estimated variance and confidence limits. If a simple ran-


dom sample is taken within each stratum, an unbiased estimate of SIo 2
is (from theorem 2.4)

(5.10)
Hence we obtain
Theorem 5.5 With stratified random sampling, an unbiased esti-
ate of the variance of fi" is
I L 8,.2
v(iill) = 82(y,,) '"' 2 I: N"(N,, - n,,) - (5.11)
~~(
~"'') N "_1 n"
An alternative form for computing purposes is
L W 2 2 L W :I
8 2 (fi,,) = " -- -
~ " S" ~
" -" -
SIo (5.12)
"_I nIl "_I N
The second term on the right represents the reduction due to the fpc.
In order that this estimate can be computed, there must be at }Pllst
two units drawn from every stratum. Estimation of the variam·c
when stratification is carried to the point where only one unit is ChOlSe1l
per stratum is discussed in section 5.21.
Corollary. In certain applications it is reasonable to suppose t.hat
S}o2 has the same value in all strata. From the analysis of variance of
the sample, a pooled estimate of this common variance is
L "4
I: I: (Yll. - g,.)'
2 "-1,_1
S", - - - -- - - -
(n - L)
OPTIMUM ALLOCATION 73

Since sampling is usually proportional in this situation, the estimated


variance of ii., takes the simple form (from theorem 5.3, corollary 3)
B",2 (N - n)
v(t1.,) .. - ---
,~ N
with (n - L) degrees of freedom.
The formulas for confidence limits are as follows:
Population mean: Y., ± la(Y.,) (5.13)

Population total: Nyu ± tNB(Y.,) (5.14)

These formulas assume that Y., is normally distributed, and that


B(Y.,) is well determined, 80 that the multiplier t can be read from
tables of the normal distribution.
If only a few degrees of freedom are provided by each stratum, the
usual procedure for taking account of the sampling error attached to
a quantity like B(Y.,) is to read the t-value from the tables of Student's
t instead of from the normal table. The distribution of B(ii.,) is in
general too complex to allow a strict application of this method. An
approximate method of assigning an effective number of degrees of
freedom to B(Y,,) is as follows (Satterthwaite, 1946):
We may write
1 L 2 N"(N,, - n,,)
;(ii.,) = 2 I: h,s" ,where/" "" - - - -
N "-1 n"
The effective number of degrees of freedom n. is
(E/h S,,2)2
n. = /2 4 (5.15)
E~
n" - 1
The value of n. always lies between the smallest of the values
(n" - 1) and their sum. The approximation takes account of the
fact that 8102 may vary from stratum to stratum, but it assumes that
the values Y"i are normally distributed, and is worth while only if this
condition appears to hold.

1.1 Optimum allocation. The problem of allocation concel1l8 the


choice of the sample sizes n"
in the respective strata. Theorem 5.6
gives an important result established by Tschuprow (1923) and Ney-
man (1934),
STRATIFIED RANDOM SAMPLING
Tlwwem 6.8 In stratified random sampling, the variance of the
estimated mean 'Ott is smallest, for a fiJced total aile of sample, if the
sample is allocated with 11." proportional to N "s,..
Proof: The problem is to minimize
1 L S,.2
Veti,,) = 2
N
1: N,,(N,. -
11_1
11.11) -
n,.
Bubject to the restriction
11.1 + ~ + ... + nL = n
We select the nIl and the Lagrange multiplier>. 80 &8 to minimize
V(g,,) + >.(11.1 + 11.2 + ... + nL - n)

= _!_2 1: (N,,2 - N,,)S,,2 + >'(nl + n, + ... + nL - n)


N 11."
Differentiating with respect to 11.11, we obtain the equation
N,,2 SII2
- -~ -n,,2 + >. ""' 0 (5.16)
This gives
NilS"
nil = NV>..' or nIl ex: N"s" (5.17)

To find the actual value of n,., add (5.17) over the strata. Thus
1: N"sll
1: nil - n = V>..
N >.
(5.18)

Substitution for vX into (5.17) gives


NilS,.
(5.19)
n" - n 1: N,.s"
This result states that the aampJe sile in a stratum should be propor-
tional to the product of the Bille of the stratum and the standard devia-
tion of the stratum, or in other worda that the sampling fraction
n"/N,, should be proportional to the standard deviation. Other things
being equal, a larger sample ie needed in a variable stratum.
An expression for the minimum variance iteelf is obtained by mb-
stituting the values of n" given by (6.19) into the pneral formula for
V(g.,). This gives
1 (1: N,.s,,)' 1 2
V••,,(9.,) - ~ n - HI E N,.s" (a.~)
5.6 OPTIMUM ALLOCATION WITH VARYING COSTS 75
6.6 Optimum allocation with varying costs. A more general approach
is to consider optimum allocation for a specified total cost. For this
we need a C08t function, which expresses the cost of taking the sample
in terms of the sample sizes nil. We shall consider only the simple
function
Cost = C = a + L c"n" (5.21)
Within any stratum, the cost mounts directly with the size of sample,
but the cost per unit, Ch, may vary from stratum to stratum. The
symbol a represents an overhead cost.
Theorem 5.7 With a cost function of the form (5.21) , the variance
of the estimated mean g,l is a minimum when nil is proportional to
N;.s,,/-Vc;..
Proof: This is similar to that of theorem 5.6. We minimize
V(ti,,) + X(a + L clln,,)
1
== 2
,,(N,,2
£...t - -
)
N" S"
2
+ X(a + ""'
£...t c"n,,) (5.22)
N nil

Differentiation with respect to nIl gives the equation


N1I2S,,2
---+Xc,,=O
2 N n,,2
i.e.
(5.23)

Summing over all strata, we obtain

~~
nYX = L Nvc" (5.24)

Finally, the ratio of (5.23) to (5.24) gives

N"S"
~ (5.25)
n
LN"S"
vc"
Thia theorem leads to the following rules of conduct. In a given
rtratum, take a larger sample if
(i) the stratum is larger,
(ii) the stratum is more variable,
(iii) sampling is cheaper in the stratum.
76 BTRATIFIED RANDOM SAMPLING

One further step is needed to complete the allocation. Equation


(5.25) gives the n" in terms of n, but we do not yet know what value
n has. The solution depends on whether the sample is chosen so as to
meet a specified total cost C or to give a specified variance V for y.,.
If COllt is fixed, we substitute the optimum values of n"
in the cost
function (5.21) and solve for n. This gives

(C - a) L: N"S"
-
~
n = --::=---""";"",--
L: NhS"V0.
If V is fixed, we substitute the optimum nil in the formula for V(y.,).
We find

n=

6.7 Relative precision of stratified random and simple random UID.-


plinc. If intelligently used, stratification will nearly always result in
a smaller varia~ for the estimated mean or total than is given by a
comparable simple random sample. It is not true, however, that any
stratified random sample gives a smaller variance than a simple ran-
dom sample. If the values of the nil are far from optimum, stratified
sampling may have a higher variance. In fact, as will be shown, even
stratification with optimum allocation for fixed total sample size may
give a higher variance, though this result appears to be an academic
curiosity rather than something that is likely to happen in practice.
In this section a comparison is made between simple random sam-
pling and stratified random sampling with proportional and optimum
allocation. * This comparison helps to show in what way the gain due
to stratification is achieved. At first, the fpc is ignored, since this
provides a clean-cut result. More general results are given later.
The variances of the estimated meam are denoted by V r..", V"rop,
and Va"" respectively.
Theorem 5.8 If terms in n"/N,, are ignored,
\ (5.26)
where the optimum allocation is for fixed n, i.e. with n" oc N"S".
• Intertl8tin& discussions or this question are civen by Armit$ge (1947) and
Evans (1951).
GAIN IN PRECISION FROM STRATIFICATION 77

Proof: If the fpc is ignored,


S2
V",,, - -
n

V"op =
r:. N"S,,2 [from equation (5.7), section 5.31
nN

V opl ""
(r:.nN2
N"SII)2
[from equation (5.20), section 5.5J

From the standard algebraic identity for the analysis of variance of


the stratified population, we have
(N - I)S2 = r:. r:. (y", - Y?
h ,

= r:. 1: (y", - Y,,)2 + 1: NII(Y" - y)2


" , "
= r:. (N" - 1)S,,2 + r:. N"(YIt - y)2 (5.27)
It "
Since terms in liN" a.re negligible, this may be written
NS 2 = r:. N"S,,2 + r:. N"(Y,, - y)2
II h
Hence
V,a" = -
S2
=
r:. N,.8,,2 + = r:. N"(Y,,
-----
- yr J

n nN nN
r:. N"(Y,, - y)2
"" Vprop + nN (5.28)

By the definition of V opl • we must have V prop ;:::: V opl • Their dif-
ference is

= _1
nN
r:. N"(S,, - S)2 (5.29)

where S ,. 1: N"S"I N. From (5.29) Md (5.28)

V rail = VOp1 +
r:. N"(S,, - S)2 + r:. NIt(Y" - y)2
(5.30)
nN nN
This result shows tha.t there are two components to the decrease in
variance as we change from simple random sampling to optimum allo-
cation. The first component (term on the extreme right) eomes from
78 STRATIFIED RANDOM SAMPLING 6.7

the elimination of differences among the stratum means; the 8econd


(middle term on the right) from elimination of the effect of differences
among the stratum standard deviations. The 8econd component repre-
sents the difference in variance between optimum and proportional
allocation.
More general ruuU. If the fpc cannot be neglected, the same type
of analysis leads to the result

Vr ..Il - V"r.." (N - n) {.E NII(Y" -


+ nN(N - 1)
y)2 - ~.E (N -
N
N,,)SI!}

(5.31)
It follows that simple random sampling gives a higher variance than
proportional stratification unless
.E N"(Y,, - y)2 :s ~.E (N - N,,)S,,2 (5.32)

This case could happen, since the Y" could all be identically equal. If
any differences among the Y" exist, the inequality is unlikely to be
satisfied except with small strata, since the left side is of order N"
while the right is of order unity.
The results reported here for optimum allocation are applicable to
sampling practice only if optimum allocation can be achieved. The
attempt to do 80 raises a number of problems that are discussed in
succeeding eections.

&.8 Effects of deviations from the optimum. For practical applica-


tions the formula for the optimum allocation is not enough. We need
to know also when the gains in precision over proportional allocation
are likely to be Bubstantial, and when they are likely to be small.
Further, since in practice the S" will not be knoWD, we can attain only
an approximation to the optimum, so that we must know how much
precision is lost if the allocation departs to some extent from the
optimum. Some insight into these questions is obtained by consider-
ing two strata, with the fpc ignored.
In this cue the general formula for the variance of g" ill
2 2
I (N1 S12 Na S.,,)
V --
1(1
-nl- + -n,-
Optimum allocation, for fixed sample sise, is obtained when
flNlSI fIN,s,
Ill.." - NISI + N,s,: n,.• - NISI + NaB,
5.8 EFFECTS OF DEVIATIONS FROM THE OPTIMUM 79
When these values are substituted into the general formula, we obtain

The relative precision (RP) of a general allocation as compared with


optimum allocation is given by

(5.33)

putting U = 81182 •
We want to examine this quantity for a series of values of nl and
n2 in the neighborhood of the optimum. Departures of nl and n2
from the optimum values can be expressed in terms of the ratio

Hence
(5.34)

Since n = nl + n2, we can rewrite equation (5.33) in terms of


ntfn2 as

Substitution for nt/n2 from (5.34) gives, after simplification,

RP = tp(NtU + N2)2
(."N1U + N 2 )(N1U + tpN2 )

For given values of tp and U, computations show that the relative


precision is not sensitive to moderate departures in the ratio of the
stratum sizes Nt/N2 from equality. Consequently, results will be
given for the situation in which Nt = N 2 • Figure 5.1 shows the RP
80 STRATIFIED RANDOM SAMPLING

(in per cent) plotted against tp for three different values of 81182 =
U = 2, 4, 8. The scale for 'P is logarithmic. On each curve the value
of 'I' which represents proportional allocation is shown.
110
100

90
~ ~
80
Vj t?'P I~ ~
~ l-jl ~
.§ 70
'"
.~ 60
pI/ // r~ ~'"
~/J
/
0t:~
a.
'"
~50
.!!
~ 40
I'V '\!C).. ~
U=SI!S2 ~"'2
30
Points denoted by P show the
20 relative precisions with
oroportional allocation
10
o
2 4 8
tfJ = ratio of 1l 1 /1l 2 to optimum value of Ildll2
FIOUD 6.1 Loss of precision through departures from optimum allocation.

Careful study of figure 5.1 suggests the following conclusions:


i. The optimum is "flat," in the sense that values of ndn2 any-
where between! and twice the optimum ratio lead to a loss of precision
which does not exceed 10 per cent for any of the three curves.
ii. If 811S2 lies between 1 and 2, there seems little point in attempt-
ing optimum allocation instead of proportional allocation. With Stl8 2
- 2, the loss in precision by using proportional allocation is at most.
10 per cent, and in practice is less, because we have only an approxi-
mate optimum. (This conclusion requires modification if the sampling
fraction is high and the fpc is no longer negligible.)
iii. The relative precision of proportional allocation falls to 74 per
cent when S,/S2 = 4 and to 62 per cent when 8t1S2 = 8. Thus, when
SdS2 exceeds 2, we can have values of nl and n2 that are quite far re-
moved from the optimum, and still give a worth-while increase in pre-
cision over proportional allocation.
iv. At first sight it may appear surprising that the curve for U = 2
is the lowest and that for ~. = 8 the highest. This result ia a conse-
quence of the abscissa} variaLle which was cho.:;en.
DETERMINATION OF THE ALLOCATION 81

As a working rule, proportional allocation is usually to be recom-


mended unless the expected gain in precision from optimum alloca-
tion, as estimated in advance of taking the sample, exceeds 20 per cent.

6.9 Determination of the allocation from previous data. Optimum


allocation requires advance estimates· of the stratum standard devia-
tions S". These may be obtainable from a previous surveyor census.
Example. The example in section 5.3 illustrates the situation where
good previous data are at hand. In table 5.1 (p. 70), the 64 large cities
are divided into 2 strata on the basis of their 1920 numbers of inhabi-
tants. A stratified random sample of 24 cities is to be taken to estimate
the 1930 total population. Optimum allocation for fixed sample size
is desired. Since 1930 data would not be available when the 1930 sam-
ple was being planned, we make the allocation from the 1920 values of
S". The calculations appear in table 5.2, with 1930 data included for
comparison.
TABLE 5.2 CALCULATION 0,. THE OPTIMUM ALLOCATION

1920 Data [1)30 Data


Stratum NA
SA NA8A ItA SA NASA nA

1 16 163 .30 2612 .80 11 .56 232 .04 3712 .64 12 .21
2 48 58 .55 2810 .40 12.44 74. 71 3586 .08 11 .79

Totals 64 5423 .20 24 .00 7298.72 24.00

The 1920 data indicate an nl of 11.56, as against a "true" optimum


of 12.21 for the 1930 data. When rounded to integers, both sets of
data give the same allocation- a sample size of 12 from each stratum.
Note, incidentally, that the S" are smaller in 1920 than in 1930. The
1920 data, although excellent for planning the allocation, give an
optimistic impression of the precision to be obtained in 1930. In prac-
tice this factor should always be taken into consideration. If it were
known that city sizes had increased between 1920 and 1930, BOrne
allowance should be made, in choosing the size of sample, for an accom-
panying increase in the standard deviation.
There is always some loss in precision from the theoretical optimum
because the n" have to be rounded to integers. This loss appears to
be unimportant even with n" as low as 6. The effect of this rounding
82 STRATIFIED RANDOM SAMPLING
in the example is seen by substituting nl - 12.21,1l2 ... 11.79, into the
general formula

for the variance of the estimated total. This gives a theoretical mini-
mum variance of 1,090,157. The variance actually attained by a 12,
12 allocation was worked out in the example in section 5.3 and was
found to be 1,090,827. The difference is trivial.
When an allocation is being planned, it is advisable to estimate the
apparent gain in precision relative to proportional allocation. In
section 5.8 we suggested that proportional allocation is to be preferred
in view of its self-weighting properties, unless the apparent gain due to
optimum allocation is at least 20 per cent. For the present data, the
comparison with proportional allocation was made in a previous ex-
ample (section 5.3) . The relative precision turned out to be 1883/1091,
or 173 per cent. Note that a calculation of this kind inevitably over-
estimates the gain in precision relative to proportional allocation, be-
cause it assumes that the S" in the new survey will be in the same
proportions, from stratum to stratum, as in the previous data which
we used to compute the allocation. In the present example this over-
estimation is negligible, because the 1920 allocation happened to be
the same as the 1930 allocation.

6.10 Effects of errors in the estimated S". In the previous example,


advance estimation of the SA proved completely successful for allocat-
ing the sample. Where previous data are scanty or absent, there may
be doubt whether the forecasts of the S" are good enough to use as the
basis of an allocation.
An examination of the effect of errors in the estimated S/o, has been
made by Evans (1951). The analysis leads to a rule showing whether
an "estimated optimum" allocation is likely to be profitable relative
to proportional allocation.
Suppose that· allocation is based on estimates 8" of the S... That is,
we use
nN"B"
nil ""=-- (5.35)
"£NIt.BIt.
If the symbol Ccv) denotes the coefficient of variation of the 811,
assumed the same in all strata, Evans shows that the average increase
in V<y.,) due to errors in the B/o, is approximately

V{ri,,) - V OP1 ·=' ~;: 1("£ N"slt.) 2


- ,,£N,,2SIt.3 1
5.10 EFFECTS OF ERRORS IN THE ESTIMATED SA 88

But, from equation (5.29),

Vprop - VOP1 = nN
1 {
:E N J.8h 2 - CE NhSh)2}
N

Thus, on the average, the " optimum" stratification gives a smaller


variance than proportional allocation; if
2 N2: N hS,,2 - (2: Nh~)2
(CV) < (L NJ.8h)2 - 2: (N,.s,,)2
For computation this result is more conveniently expressed in terms
of the relative sizes of the strata, W" = Nh/N.
2 L W,.sh 2 - (2: W hSh)2
(cv) < (L W,.sh)2 _ 2: (W,.sh)2 (5.36)

In practice, we should want (CV)2 to be definite~y less than this value,


because, with (CV)2 equal to this value, the chance that the supposed
"optimum" is superior is only of the order of 50-50.
Example. For the 1920 city size data (table 5.2, section 5.9) the
relevant figures are shown in table 5.3.

TABLE 5.3 ESTIMATION OF ALLOWABLE E,RROR iN THE S"


Stratum W" S" S,,2 W,.8" W,.8,,2
1 0 .25 163.3 26,667 40 .82 6,667
2 0.75 58 .6 3,428 43.95 2,571

The calculation proceeds as follows :


L W,.8h 2 = 9238 : (L W,.8h)2 = 7186: L (W,.8h)2 = 3598
2 9238 - 7186 2052
(cv) < 7186 _ 3598 = 3588 = 0.57, i.e. (cv) < 0.75
The calculation indicates that a coefficient of variation of 0.75 could
be allowed in the estimates of the Sh. This value is probably outside
the limits of the approximation used. It does suggest that very poor
estimates would suffice to make "optimum" allocation better than
proportional allocation, as is to be expected when the Sh differ markedly.
Other examples are given by Evans.
Sukhatroe (1935) has discussed the size of preliminary sample
needed for advance estimates of the S,,2, when the choice is between
stratified and simple random sampling. He shows that a very small
initial sample will usually give a high probability that the "optimum"
allocation will have the smaller variance. Evans (1951) shows how to
STRATIFIED RANDOM SAMPLING 6.10

compute the size of preliminary sample needed to make "optimum"


allocation better, on the average, than proportional allocation. Note
that these studies aBBume a normal population: when the popUlation
is not normal, the sizes of the preliminary samples will have to be in-
creased beoause the variance of 8,,2 is sensitive to departures from
normality (see section 2.9).

6.11 Allocation with several items. Since the allocation that is best
fOJ: one item will not in general be best for another, some compromise
must be reached in a survey with numerous items. The first step is to
reduce the items that will be considered in the allocs.tion to a relatively
small number that are thought to be most important. If good previ-
ous data are available, we can then compute the optimum allocation
for each item separately, and see to what extent there is disa.greement.
In a survey of a specialized type, the correlations among the items
may be high, and the allocations may differ relatively little.
Example . Data. given by Jessen (1942) illustrate a farm survey of
this kind. The state of Iowa was divided into five geographic regions,
each denoted by its major agricultural enterprise. Suppose that these
regions are to be used as strata in a survey on dairy farming. The
three items of most interest are the number of cows milked per day,
the number of gallons of milk per day, and the total annual cash re-
ceipts from dairy products. From So survey made in 1938, the esti-
mated standard deviations 8" within strata are shown in table 5.4. We
TABLE 5.4 SrA.'1DARD DEVIATION8 WITHIN 8TRATA

8A
SA SA Receipts
Stratum Cows Gallons for dairy
milked of miik products
(I)

N ortheaat dairy 0.197 4.6 11 .7 332


ClI8h grain 0 .191 3.4 e.8 357
Western livestook 0 .219 3.3 7.0 246
Southern pasture 0.184 2 .8 6 .5 173
Eastern livestock 0 .208 3.7 9.8 279

8"
shall assume, for the present, that the are the true standard devia-
tions. The 8" apply to a single farm. In table 5.5 the optimum alloca-
tions are given for the individual items in a sa.mple of 1000 farms.
6.11 ALLOCATION WITH SEVERAL ITEMS

TABLE 5.5 S.uaLIIIUJ)S WITHIN STRATA

Allocation

Stratum Optimum for


Propor- Averap
tiona! 1'11,\
CoWl Gallons Receipts

Northeast dairy 197 254 258 236 250


Cuh grain 191 182 209 246 212
Western Iiveetock 219 203 171 194 180
Southern pasture 184 145 134 115 131
Eastern livestock 208 216 228 209 218

Since the state contains over 200,000 farms, the fpc can be ignored.
Thus for any item
nW"8,,
nIl -
LW,,8h
The individual optimum allocations differ only moderately from
eaoh other. With one exception, all three deviate in the same direc-
tion from a proportional allocation. Thus, in the first stratum, propor-
tional allocation suggests 197 farms, while the individual allocations
lead to numbers between 236 and 258. The average of the optimum
sample sizes for the three items, shown in the right-hand column, pro-
vides a satisfactory compromise allocation.
Table 5.6 shows the expected sampling variances of ii,,, as given by
TABLE 5.6 EXPJ)CTJ)D VAIIlANCJ)8 01' THIl 1!I8TlMATilD MilAN
Type of allocation COW8 Gallons Receipts
Optimum 0.0127 0.0800 76 .9
Comprom.iee 0.0128 0.0802 77.6
Proportional 0.0131 0.0837 80.9

the individual optima, the compromise, and the proportional alloca-


tions. The formulas are as follows:
II"", _ (L W"SII)lI: v....." .. L (WIl8,,)2: v"rop _ L W,,8,,2
n ~ n
The formula for II""",,, comes from the general formula for V(g,,) in
86 STRATIFIED RANDOM SAMPLING 5.11

theorem 5.3, corollary 1; the denominator mil is the sample size given in
the hth stratum by the compromise allocation.
The compromise allocation gives almost as precise reSults as if it
were possible to use separate optimum allocations for each item. What
is more noteworthy is that proportional allocation, though consistently
poorer than the compromise, is only slightly less precise than the com-
promise or the individual optima. Further, table 5.6 overestimates
the precision of the optima and of the compromise, since these alloca-
tions were made from estimated rather than from true variances. This
result illustrates the important principle (previously discussed) that up
to a point the variance of the estimate does not increase much even if
allocation departs quite substantially from the optimum. From the
comparison in table 5.6, proportional allocation would be the recom-
mended choice. Had the S" differed more markedly among strata, the
compromise allocation might have been very satisfactory.
In surveys which cover a wider range of items, the allocations may
differ more violently. The best compromise is then not so obvious.
Although proportional allocation is often used in this situation, it may
be possible to find a compromise which is superior. As a preliminary
we need some criterion for a "best" compromise. One possibility is to
minimize the sum of the variances for the items, each divided by its
optimum variance. If the fpc is ignored, this amounts to finding nil
which minimize
L (W"s,,)2
nil
L' (L W"sll) 2
- n
where L' denotes summation over the items. It remains to be seen,
however, whether situations are, common, in surveys of wide scope, in
which the gain over proportional allocation is sufficient to offset the
advantage of the "self-weighting" feature in the latter.

6.12 Allocation requiring more than 100 per cent sampling. In the
planning of an allocation it may happen that the formula for the opti-
mum produces an n" in some stratum that is larger than the correspond-
ing Nil. Consider the example on city sizes in section 5.9. A sample
of 24 cities, distributed between 2 strata, called for 12 cities out of 16
in the first stratum and 12 out of 48 in the second. Had the sample
size been 48, the allocation would demand 24 cities out of 16 in the
first stratum. The best that can be done is to take all cities in the
stratum, leaving 32 cities for the second stratum instead of the 24
5.13 ESTIMATION OF SAMPLE SIZE 81
postulated by the formula. This problem arises only when the overall
sampling fraction is substantial, and one stratum is much more variable
than the others. It has occurred in practice on several occasions.
Care must be taken to use the correct formula in predicting the
expected variance from this allocation, or in comparing the allocation
with others. The general formulas" in theorem 5.3, section 5.3, are
appropriate if the nil given by the revised optimum allocation are sub-
stituted. Formula (5.20) for the minimum variance for fixed n,
_ 1 (J:,N"S/t,)2 1 II
Vm.,,(y.,) = N 2 n - NlILN"S"

is no longer correct. If stratum 1 is the only stratum in which over-


sampling is indicated, the correct formula for Vmin becomes
1 (L' N,.8,,)2 1 , :I
V",.,,(y.,) = N 2 n _ Nl - N2 L N,.8" (5.37)

where L' denotes summation over all strata except stratum 1.


5.13 Estimation of sample size. This question might have been dis-
cussed earlier in this chapter. On the other hand, since the formulas
for the variances of the estimated mean and total contain both the nil
and the S", sample size may not be decided in practice until advance
estimates of the S" are available and some decision about allocation
has been made. In the following discussion we assume that the pre-
cision desired is stated in terms either of the margin of error, or of the
half-width of confidence interval, d. For estimation of the population
mean, the baaic equation to be satisfied is
(5.38)
where t is the normal deviate corresponding to the desired confidence
probability. Since d and t enter the solution only in the form d2 /t 2 ,
we shall write V = d2 Ie, where V is the de8ired variance in the sample
estimate.
Let a" be the estimate of S", and let nil = w"n, where the w" have
been chosen. In these terms the expected V(y,,) is (from theorem 5.3,
section 5.3)
1 N,,2 S,,2 1
Expected V(y,,) = ---;
N
L -- -
n"
u2
IV~
L "a,.
N II
i.e.,
1 W/t,2S,,2 1 :I
V = -L----LW"a,. (5.39)
n w" N
88 STRATIFIED RANDOM: SAMPLING 6.13

with W_ - N,,/N. This gives, as a general formula for n,


::E W,,28,,2
w"
n------- (5.40)
V+ 1 "" W",,. 2
N~

If the fpc is ignored we have, as a first approximatioD,


1 "" W,\28,\2
no--~-- (5.41)
w,\ V
If no/N is not negligible, we may calculate n as
no
n----~-- (5.42)
1
1+ -::E
NV
W,\8,,2

In particular cases the formulas take various forms which may be


more convenient {or computation. A few are given.
PreBUmed optimum allocation (for fixed n): w" ex: W,,8,..
n -
(2: W,.8,,)2
_..;..=:;........;...;.;__ (5.43)
V + -1 .L..
"" W,,8A 2
N

n=--
no (5.44)
1 + no
N
If V is the desired variance for the estimated population total, the
principal formulas become as follows:
General: .

(5.45)

Pre8UTMd optimum ({or fixed n) :


(:EN",,,?
n = ----==---,- (5.46)
V+ :EN"s,,2
Proportional:
n=-
no (5.47)
1 + flo
N
11.13 ESTIMATION OF SAMPLE SIZE
E:wmpk. This example is taken from a paper by Cornell (1947),
which describes a sample of United States colleges and universities
drawn in 1946 by the U. S. Office of Education, in order to estimate
enrollments for the 1946-47 academic year. The illustration given
here is fur the population of 196 teachers' colleges and normal schools.
These were arranged in seven strata,' of which one small stratum will
be ignored. The first five strata were constructed by size of institu-
tion; the sixth contained colleges for women only. Estimates 8)0 of the
Sit. were computed from resu1ts for the 1943--44 academic year. An
"optimum" stratification based on these 8h was employed.
The objective was a coefficient of variation of 5 per cent in the
estimated total enrollment. In 1943 the total enrollment for this group
of colleges was 56,472. Thus the desired standard error is
(0.05)(56,472) ... 2824
80 that the desired variance is
V = (2824)2 = 7,974,976
It may be objected that enrollments will presumably be grea.ter in
1946 than in 1943, and that allowance should be made for this increase.
Actually, the calculation assumes only that the cv per college remains
the same in 1943 a.nd 1946-an assumption which may not be un-
reasonable.
Table 5.7 shows the values of Nh, 8h, and Nh8h , which were known
before determining n.
TABLE 5.7 DATA FOR E8TIMATI NG SAMPLE SIZE

Stratum N~ 3~ N hBA TlI\

1 13 325 4 , 225 9
2 18 190 3 ,420 7
3 26 189 4,914 10
4 42 82 3 ,444 7
6 73 86 6,278 13
6 24 190 4 ,560 10

Totals 196 26 ,841 56

The appropriate formula for n is (5.46), which applies to an "opti-


mum" allocation for estimating a total. With only 196 units in this
population, it is improbable that the fpc will be negligible. However,
90 STRATIFIED RANDOM SAMPLING 5.13

for purposes of illustration, a first approximation ignoring the fpc will


be BOUght. This is
(26,841)2
---=90.34
7,974,976

Adjustment is obviously needed. For the correct n in (5.46), we have


90.34
----=57.1
4,640,387
1+ -7,974,976
- -
A sample size of 56 was chosen.· The nIl for individual strata appear
in the right-hand column of table 5.7.

15.14. Stratified sampling for proportions. If we wish to estimate the


proportion of units in the population which fall into some defined class
C, the ideal stratification is attained if we can place in the first stratum
every unit which falls in C, and in the second every unit which does
not. Failing this, we try to construct strata such that the proportion
in class C varies as much as possible from stratum to stratum.
Let
P" =A" a"
- : p" = -
N" n"
be the proportions of units in C in the hth stratum and in the sample
from that stratum, respectively. For the proportion in the whole
population, the estimate appropriate to stratified random sampling is

~N"p"
P,I =£ .,-- (5.48)
N

Theorem 5.9 With stratified random.sampling, the variance of 'PII is

(5.49)

Proof: This is a particular case of the general theorem for the vari-
ance of the estimated mean. From theor~m 5.3
1 S,,2
V(y,,) =2 LN"(N,, - nl)-
N n"
• The arithmetical re8\llta di.f[er aliahtly {rom thoae siven by Cornell (1947).
5.14 STRATIFIED SAMPLING FOR PROPORTIONS 91

Let 1/", be a variate which has the value 1 when the unit is in C, and
zero otherwise. In section 3.2 it W'88 shown that for this variate
2 N"
S" = (N" - 1)
P"Q"
This gives the result.
Note. In practically all applications, even if the fpc is not negligible,
tenns in liN" will be negligible, and the slightly simpler fonnula
1 P"Q"
V(p,,) = 2 L. N"(N,, - n,,) - (5.50)
N nIl
can be used.
Corollary 1 When the fpc CAn be ignored,

V(p,,) = .2_ L. N,,2 P"Q" = L. W,,2 P"Q" (5 .51)


N~ nil nil

Corolln.ry t With proportional allocation,


(N - n) 1 ~ N"2p"Q,,
yep,,) = N nN 4..(N" - 1) (5.52)

(N - n) 1
' =. - L. W"P"Q" (6.53)
N n
For a sample estimate of the variance, we may substitute Pt., q" for the
unknown P", Q" in any of the fonnulas above. This does not give an un-
biased estimate, but is adequate for the calcula.tion of confidence limits.
The best choice of the nil in order to minimize V(p,t) follows from
the general theory in sections 5.5 a.nd 5.6.
Minimum variance for fixed total sample aize.

Thus
nIl a: N" J N"
N" -1
v'P&",. ..... N" VP;ii",.

N/oVP;ii",.
nil . =. n =----,,== (5.54)
L.N"YP"Q/o
Minimum variance for fixed C08t, where C08t = a +L c"n".

Nfl {P&;.
\} -;:-
(5.55)
nil ·"".n LN" lp"Qt.
...J C/o
The value of n is found as in section 5.6.
STR.ATIFIED RANDOM SAMPLING

Optimum ver8U8 proportional allocation. If the cost of 8&IIlpiing per


unit is the same in all stra.ta, and if all P" lie between 0.1 and 0.9, there
appears little to be gained by optimum alloca.tion over proportional
allocation.
Suppose that the fpc is ignored and that optimum allocation is made
for fixed n, Le. with nil ex: N" VP;ii;.. It will be found that

Ew"p"Q"
V"rOJl - - - - -
n

80 that the relative precision of proportional to optimum allocation is

---
V op ,
V"rop
(EW,,~)2
E W"P"Q"
If all P" lie between the two values Po and (1 - Po) ,we are inter-
ested in the smallest value which the relative precision takes. For
simplicity, we consider two strata of equal size (WI = W 2 ). The
minimum relative precision is atta.ined when PI - ! and P2 - Po.
The relative precision then becomes

(0.5 + ~_)2
--z::-____
VOl"

Vprop 2(0.25 + PoQo)


Some values of this function are given in table 5.8. Even with Po equal
to 0.1, or as high 8.8 0.9, the rela.tiye precision is 94 per cent. In most
cases the simplicity and the self-weighting feature of proportional
stratification more than compensate for this slight 1088 in precision.

TABLE 5.8 R.!lLATlVE PRECISION OF PROPORTIONAL TO OPTIMUM ALLOCATION

Po 0.4 or 0.6 0.3 or 0.7 0.2 or 0.8 0.1 or 0.9 0.05 or 0.95

W( %) 100.0 99.8 98.8 94.1 86.6

The limitations of the example should be noted. It does not take


account of differential coste of sampling in different strata. There are
surveys where the p" are very small, but range from, say, 0.001 to 0.05
in different strata. Here there would be a more substantial gain from
optimum stratification.
6.l6 THE CONSTRUCTION OF STRATA 93

Formulas for the determination of 880lllple size can be deduced from


the more general formulas in section 5.13. Let V be the desired vari-
ance in the estimate of the proportion P for the whole population.
The formulas for the two principal types of allocation are as follows:
Proportional:
n-- (5.56)
1 + no
N
Pre8'Umed optimum:
(:EWA~)2 no
n - -------------- (5.57)
no - V 1
1 + NV:E WAPA!lA

where no is the first approximation, which ignores the fpc, and 11. is the
corrected value taking account of the fpc. In the development of
thElse formulas, the factor Nh/(Nh - 1) has been taken as unity.
All results in this section apply to the estimate of a proportion. If it
is preferable to think in terms of percentages, the same formulas apply
if Ph, Qh, V, etc., are expressed as percentages. For the estimation of
the total number in the population which lie in class C, i.e. of NP, all
variances are mUltiplied by N 2 •

6.16 The construction of strata. This topic raises a number of ques-


tions. What is the best characteristic for the construction of strata?
How many strata should there be? How should the boundaries be-
tween the strata be determined? Much investigation remains to be
done on these questions, although some answers in general terms can
be given.
For a single item or variable, it seems obvious that the best charac-
teristic for stratification, if we could possess it, is the frequency dis-
tribution of the variable itself. The next best thing is the frequency
distribution of some other quantity that is highly correlated with the
variable, such as the values of the variable itself in a previous census.
So far as the number of strata is concerned, the answer is at first
sight the more the merrier. This follows from the general result about
the relative precision of stratified random to simple random sampling.
In section 5.7 it was shown that if the fpc can be ignored,
Vopt $ V prop $ V ran
It was suggested that these results remain valid, in nearly all cases,
even if the fpc is not negligible. Suppose that a stratum already in
STRATIFIED RANDOM SAMPLING 5.15

existence is divided into two new strata. For this stratum the effect
is to repla.ce simple random sampling by stratified sampling, which
should result in a lower variance for the estimated stratum mean, and
hence for the mean of the whole population. Thus the process of
constructing new strata by the subdivision of old strata should result
in a series of decreasing variances in the estimates.
Multiplication of strata is, however, attended by some disadvan-
tages. Unless allocation is proportional, the number of weighting fac-
tors W h which: appear in the tabulations increases, as do the number
of within-etratum variances to be estimated, both for allocation and
for finding the standard error of Y,I' The number of degrees of free-
dom available for the estimated standard error decreases until with
one unit per stratum the fonnula for the standard error cannot be used
at all. If an increase in the number of strata is under consideration,
we want to be assured that the ensuing gain in precision is sufficient
to repay us for these complications.
Consider a process in which each stratum is divided into halves to
fonn new strata. The number of strata becomes successively 2, 4, 8,
16, etc. If we can use the frequency distribution of Yi itself for the
construction of strata, and if the distribution of 'Vi is continuous, every
stage in the process produces a substantial reduction in the variance
of the estimate. When the number of strata becomes la.rge, each
doubling reduces the variance to approximately one-quarter of the
previous value.
To show this, suppose that before subdivision a typical stratum
consists of all values of 'Vi between a and b. If there are already many
strata, the distribution of Yi between a and b will be approximately
rectangular. The variance of this rectangular distribution is known
by theory to be (b - a?/ 12. When we ha.lve the stratum, we produce
two approximately rectangular distributions, each with range (b - a) /2
and variance (b - a)2/48. With proportional allocation, ignoring the
fpc,

Hence the contribution of this stratum to V(y,,) is reduced by the suD-


division from

to
WA (b - a? Wh (b - a)2 Wh(b - a)'
-2 48n
+-
2 48n
--- --
48n
615 THE CONSTRUCTION OF STRATA 95

As an illustration, the values of V(Y.t) were calculated for 2, 4, 8,


and 16 strata for a variate with the frequency function fey,) = e- III
(y, ~ 0). All strata in a given subdivision were of the same size,
although this is not the best method of division for maximum preci-
sion. Both proportional and optimum allocation (for fixed sample
size) were included. The values of n V(y.t) and the ratios of successive
values appear in table 5.9.

TABLE 5.9 VALUES OF nY(y.t) FOR STRATA FORMED FROM e-~I

Optimum Proportional
Number
oC strata
Ratio to Ratio to
nV nV
preceding preceding

1 1 1
2 0.27722 0.277 0.40747 0.407
4 0.06690 0.241 0. 11822 0.290
8 0.01618 0.242 0.03079 0.260
16 0.00397 0.245 0 .00778 0.253

For proportional allocation, the variance does not at first decrease


to one-quarter with a doubling of the number of strata, because the
stra.tum which extends to infinity has a variance substantially larger
than the others. However, the ratio quickly settles close to one-quar-
ter for both types of stratification.
In practice, the stratification is not based on the frequency distribu-
tion of y,. Suppose that it is based on the frequency distribution of a
variate z,' which is correlated with y,. Let Yi = Zi + ei, where the
variates Zi and ei are independently distributed. Strata formed by
cutting up the frequency distribution of Zi will not affect the variance
of e.. When there are many strata, the diviSIOn of a stratum into
halves will reduce the contribution of this stratum from

to
W,,(b - a)2 W"Ue,,2
- -48n- - + -n -
96 STRATIFIED RANDOM SAMPLING 6.15
As soon a.s (b - a) is small enough so that the term in fT.1t.2 dominates,
further increase in the number of strate. produces only a triviliol reduc-
tion in variance.
To summarize, subdivision of strate. in practice sooner or later re-
sults in practically no further incree.se in precision. If u, ha.s only a
moderate correlation with y" the point may be reached with a small
number of strate..
If strata are formed by cutting up a frequency distribution, there
remains the question: What are the best points of division?
Rules have been developed by Dalenius (1950) and Dalenius and
Gurney (1951) for division under proportional and optimum allocation.
One result will be quoted. If the variate z'" on which the division is
based is linearly related to the variate y", which is to be mea.suroo in
the survey, the division point z,,' between stratum h and stratum
(h + 1) should satisfy the equation

+ Z"+l
2z,,' - Z"
where Z", Z"+1 are the mean values of z'" in the two strata. This is
the optimum rule for proportional allocation. The division points ZI',
Z2', ••• , ZL_l' are found by trial and error, the number of strata L
being a.ssumed chosen in advance.
A stratification that is effective for one variable may not be so for
another. In surveys which cover a range of items, some compromise
criterion for the construction of strata must be found. Bases of
stratification for economic items have been discussed by Stephan
(1941) and Hagood and Bernert (1945), and for farm items by King
and McCarty (1941). What is wanted is some variable that ha.s high
correlations with all the principaf items in the survey.

6.16 Proximity as a basis for stratification. In surveys which cover


a geographic region, adjacent units are often more alike than units
that are far apart. Farmers living near one another tend to have a
similar soil type, a similar kind of farming, and a similar access to
markets. People in the samc part of a town tend to be of similar
economic level and to have certain things in common in their attitudes
and ta.stes. These remarks do not imply that the similarities are
strong, but merely that they exist.
This phenomenon can be utilized by constructing strata which
consist of compact geographic regions. For a number of typical farm
economic items, data on the effectiveness of geographic stratification
within states in the United States are shown in table 5.10. The data
were presented by Jessen (1942) and Jessen and Houseman (1944).
6.17 ESTIMATION OF GAIN FROM STRATIFICATION

Four sizes of stratum are represented-the township, the county, the


"type of fanning" area, and the state. Some idea of the relative sizes
of the strata is obtained from the fact that in Iowa there are about 1600
townships, 100 counties, and 5 areas.
TABLE 6.10 RmLATlVlD P1I.IICIIIION O~ DIrJ'JlBIiNT laND!! or OIlOGaAPBlC
IITBATlJl'tCATlON (IN PIIB ClINT)

Stratum
No.
State of Type o{
itelD8 Town-
County {armln" State
ship
area

Iowa, 1938 18 115 100 96 91


Iowa, 1939 19 121 100 97 ~l
Florida, 1942
Citrus fruit &rea 14 144 100
Truok farmin" area 16 111 100
California, 1942 17 113 100 97

The data shown are the average relative precisions of the estimates
of the means, averaged over the numbers of items shown in the table.
The county is taken as a standard in each case.
As appears to be typical of geographic stratification, the increases
in precision from increased numbers of strata are moderate rather than
large. This indicates that the similarities referred to above are weak.

6.17 Estimation from a sample of the gain due to atratification.


When a stratified random sample has been taken, it may be of interest,
&8 a guide to the conduct of future surveys, to appraise the gain in
precision relative to simple random sampling. .
The data available from the ·sample are the values of N II, nfl, y",
and B,,2. From section 5.4, the estimated variance of the weighted
mean from the stratified sample is
R7,,2B,,2 R7"B,,2
v('O'I) - E-- - E -
n" N
where R7" - NII/N.
The problem is to compare this variance with an estimate of the
variance of the mean that would have been obtained from a simple
random sample. A procedure sometimes used is to calculate the
98 STRATIFIED RANDOM SAMPLING 5.17

familiar mean square derivation from the sample mean,


r=L (Y/oi - ii?
n-l
where the strata are ignored . This is taken as an estimate of the vari-
ance per unit for a simple random sample. This method works well
enough if the allocation is proportional, since a simple random sample
distributes itself approxima.tely proportionally among strata. But if
an allocation far from proportional has been adopted, the sample
actually taken does not resemble a simple random sample, and this
2
8 may be a poor estimate. A general procedure will now be given.
The true variance of the mean of a simple random sample is
V
,aft
... (N - n) 8 2 _ (N - n)
nN - -n-N--
[L (N" - 1)8,,2 + L N"(Y,, -
(N - 1)
Y)2]

(5.58)
by an algebraic identity for 8 2 •
There is no difficulty in obtaining an estimate of the first term inside
the bracket. The second term requires some detailed investigation.
For estimating L
N/>,(Y" - y )2 it is natural to try L
N"(y,, - Yat)2 .
This quantity turns out to be an overestimate which needs adjustment.
We may write
L N"(y,, - gol)2 =L N"{(Y,, - Y) + (g" - Y,,) - (g" - y)}2
We now expand and take the average over all possible samples. It
may be verified that the average of each of the two crOBS-product
terms involving (Yh - Y) vanishes. This gives
E L Nh(Yh - Y,t)2 = L N,,( y~ - y)2 + E L Nh(y" - Y h)2
+E L N,,(Y.t - Y)2 - 2E L N,,(YIa - Yh)(got - Y) (5.59)
But
L N"(y,, - Y")(Y,t - Y) = N(g" _ y)2
by the definitions of g., and Y. Thus the last two ·terms in (5.59) coa-
lesce to give 2
-EN(Y,t _ y)2 = _ L N"(N,, - nh) 8h
N nIl
since this expression is N times the variance of g". For the second
term on the right in (5.59),
~ 2 N"(N,, - nIl) 8,,2 8.,.2
E £..JNIa('U" - Y,,) = L -
= (N" - n,,)- L
N" nIl n.,.
because within each stratum g" is the mean of a simple random sample.
5.17 ESTIMATION OF GAIN FROM STRATIFICATION 99

Hence,
E :E N"(fi,, - ii.,)2 = :E N"(Y,, _ y)2
~ S,,2 L N"(N,, - n,,) S,,2
+ L. (N" - 11,,) - - -
n" N nIt
The sum of the last two terms on the right is always positive, so that
L N"(fi" - y,,)2 gives an overestimate. It follows that an unbiased
estimate of L N,.(Y" - y)2 is
S,,2 L N"(N,, -
Q = L N"(fJ,, - ii,,? - L (N" - n,,) - + = - - - - -
n,,) s,:
nil N nIl

Finally, by substituting this estimate in equation (5.58), we obtain as


an unbiased estimate of V r"",
V ""
r
= (N - n) r:E (N" - l)s,,2 + Q] (5.60)
nN L N-I
For computing purposes it is convenient to express this in terms of the
relative stratum sizes W h = N hi N . When the value of Q ie inserted
in (5.60), we find, after a little cancellation,
_ (N - n) [~ 2 ~ W"S,,2
V rll " - L. W.,.s,. - L . - -
n(N - I) nil

+ :E W"y" 2- (:E WAgA) 2J


W,.2 s,.2 L W"s,,2
+E - --
n" N
This expression is unattractive to compute. In nearly all applications,
some simplifications can be utilized. Two will be given.
N > 50. This will hold for almost all populations. The fourth term
inside the bracket may be omitted, since it equals the first term, divided
by N. We obtain
2
(N - n) [~ 2 ~ W"Sh
V r" - L. W,,8h - L. - -
nN n"
W,,2 S ,.2
+L - -+ L Why,,2 - (2: WA1/,,)2 J (5.61)
n"
AU nA > 50. The second and third terms inside the bracket may
be dropped, to give
(N - n) ~ 2 ~ 2 ~ 2
Vr "" 00= nN fL. W"s,. + L. W"y" - (L. W"y,,) 1 (5.62)
100 STRATIFIED RANDOM SA1\1PLING 6.17
Example. The calculations will be illustrated from the first three
strata in the sample of teachers' colleges (section 5.13). Data from
the 1946 sample appear in table 5.11. The means represent enroll-
ment per college, in thousands.

TABLE 5.11 BASIC DATA FROM II. STRATIFIED SAMPLE


Stra.tum N~ nil tiA 8~2

1 13 9 2.200 1 .615
2 18 7 1.638 0 .063
3 26 10 0 .992 0 .077

Tota.ls 57 26

The sample is so small that expression (5.61) for Vran will be used.
The supplementary calculations are given in table 5.12.

TABLE 5.12 ARRANGEMENT OF CALCULATIONS


2
Stratum W" W ...8,,2 Wh 8h / n" W...28,,2/nh W ... y!
1 0 .228 0 . 36822 0.04091 0 .00933 0. 50160
2 0.316 0 .01991 0 .00284 0 .00090 0 .51761
3 0 . 456 0 . 03511 0 .00351 0 .00160 0.45235

Totals l.ooo 0.42324 0 .04726 0 . 01183 1 .47156

The formulas work out as follows :

31
Vra " =
(57)(26)
[0.4232 - 0.0473 + 0.0118 + 2.4000 - 2.1655] = 0.0130

Stratifioation appears to have reduced the varia.nce to about one-


third of the value for a simple random sample.
Proporticmal allocation. An estimate Vran that is usually adequate is
obtained from the sum of squares of deviations of the sample va.lues
from their mean, for

s2 = L (Y"i - '0)2
n- 1

== 1 ['"
L; (n" - 1)8"
2
+ L;
'" (L-n"y,,?]
nh'O" 2 - - --
(n - 1) n
by the usual identity in the analysis of va.riance. If terms in lin" are
5.17 ESTIMATION OF GAIN FROM STRATIFICATION 101

negligible, this is equivalent to


L: W,,8,,2 + L: Wh1l,,2 - (L W"1I,,)2
since W" = n,,/n. This in turn equals the quantity inside the bracket
in equation (5.62). Thus the expression
(N - n) S2
Vran =
N n
is satisfactory if allocation is proportional and terms in l/n,. are
negligible.
Proportional allocation with Sh2 constant. This case is assumed to
hold in numerous applications : e.g. in sampling an agricultural field,
where the strata are subdivisions of the field. The constant value of
S,,2, say S,,?, is estimated by the pooled mean square within strata,
2
8 w • For 11(11,1) = II.. , we have from equation (5.8)

:for II ran , we assume that terms in 1/ N are negligible, but not terms
in l / n", since sometimes there are only a few units per stratum in the
sample. Thus we start from equation (5.61), inserting, however, the
.pooled estimate 8 w2 in place of the individual 8,,2 . This gives

IIran = (N - n) [L: (Wh _ Wh + Wh2) 8,,/


Nn nh 7t"

+ L: Why,,2 - (L WhY") 2 J
2
Since n" = nWh , the coefficient of 8w may be written

L: (W" _ ~ + W
__II) = _n_-_L_+_ 1
n 11. n
where L is the number of strata. Hence
(N - n) 1 2
II ran =
N
2' I (n - L
n
+ 1)sw + L n"(1111 - 11,,)21

These quantities may be computed from an analysis of variance of the


sample data into two components, as shown in table 5.13.
TABLE 5.13 ANALYSIS OF VARIANCE OF THE SAMPLE DATA

Component df ms
Between strata (L - 1) B ~ L n~(llh - 1l.,)2/(L - 1)
Within strata. (n - L ) 2
W - 8..
lOZ STRATIFIED RANDOM SAMPUNG 5.l7

It follow8 that
V, .." - (N; n) ~2 [(L - 1)B + (n - L + 1)W]
E:unnple. In sampling an agricultural field experiment for estimat
ing the number of wireworms per plot, the 25 plots were divided into
halves. From each half 3 random samples of soil were taken with a
mall boring tool. The sample was a volume of soil 9 in. square to a
depth of 5 in. The combined analysis of variance of numbers of wire-
worms for the 25 plots was as follow,,;
df ms
Between 8trata (h&lf-plots) 25 B - 00.76
Within strata 100 W - 38 .44

The conditions are slightly different from those in the theory pre-
~nted above. Each plot represents a separate population, divided
into 2 strata. Thus n .. 6, n,\ .,. 3, L = 2 for each plot. The analysis
of variance gives the combined results for 25 stratified samples of this
type. The fpc need not be included.
38.44
V., - -6- - 6.41

v,.... - ,"[8 + 5Wj = n[90.76 + 5(38.44)) ." 7.86


7.86
Relative precision -= - - = 1.23
6.41
Stratification into halves increased the precision by slightly under
one-fourth.

6.18 Effects of errors in the stratum sizes. For a desirable type of


stratification, the stratum totals N,\ may not be known exactly, be;ng
derived from a c nsus that is out of date or from a large sample.
Definite statements about the consequences of basing a stra.tification
upon erroneous weightB cannot be made without considering particular
cases. A few conclusions of a general nature can be drawn. The fpc
will be ignored.
Instead of the tru.e stratum proportions W", we have estima.tes 10".
The sample estimated mean is L w"y". This estimate is bia.&ed. Its
mean valu in repeated sampling is L w"Y". The bias amounts to
L (10" - W")Y,,. Consequently, the error variance of this estimate
contains two eomponentB; the variance about the mean of the estimate,
5.18 EFFECTS OF ERRORS [N THE STRATUM SIZES 103
and the square of the bia.s. If "optimum" allocation is used (with th
Nil. replaced by their estimates), the first component is (L w~S~)2/n.
The total variance is
(L W4 Sh)2 2
VcY.,) -
n
+ IL (WA - W4)Y41 (5.63)

A more general form of this expression was given by Stephan (1941).


As he points out, the first term in (5.63) will usually not differ greatly
from the variance with the correct weights-the two are exactly the
same if the variance is the same in all strata. The loss of precision
from incorrect weights thus depends mainly on the size of the bias,
which in individual cases may be small or large. For any given set of
erroneous weights, the loss varies with the size of sample, because the
"bia.s" component of the total varia.nce is independent of the size of
sample. With increasing sample size, a stage is reached where the
"bias" term predominates, and where the stratification is less accurate
than simple random sampling.
This discussion does not. help in considering whether to strll.tify in 8.
survey where the weights are known to be in error, because the size of
the bias cannot be predicted. Occasionally a. standard error can be
attached to the estimate of each N A, from knowledge of the process by
which they were estimated. If the estimates of the N h are independ-
ent, and independent of the YA, the averagt: value of the bias component
of the total varia.nce of Y.t is roughly (Cochran, 1939)
L (FA - y)2V(NA)
2
N
where V(N A) is the variance of the estimate of NA. This quantity
measures the expected increase in variance of Y., due to errors in the N A.
King, McCarty, and McPeek (1942) applied this formula in research
directed towards the estimation of yield per acre, protein, and test
weight in the U. S. wheat belt. They discussed the utility of stratifice.-
tion by districts within each state. The total acreages N,. for each
district were themselves estimated by a sample survey, 80 that some
knowledge of the V(NA) wa.s available.
This problem arises in the technique known as doub~ aampling
(section 12.2). It will be shown that if the W A are estimated by taking
8. large random sample of size n' (n' ~ n), and noting how many fall in
each stratum, then the increase in V(fi,,) is approximately
L WA.(Y,. - y )2
n'
104 STRATIFIED RANDOM SAMPLING 6.19

6.19 Case where the strata cannot be identified in advance. Some-


times the stratum to which a unit belongs is not known until the data
have been obtained from the unit. In a survey on political attitudes
it may be useful to stratify according to the individual's voting be-
havior at the last election, which is not learned until the individual
has been interviewed. A similar situation may occur when stratifica-
tion is based on factors such as income, occupation, and religious
affiliation. f course, in such case the stratum sizes N" may not be
known exactly: we will , however, llBIIume that they are known.
n procedure is to take a simple random sample of size n. When
the sample data have been collected, the units are llBIIigned to the
strata by means of th information obtained about them. If y" is the
mean of the units'that fall in the hth stratum, the estimate is
L, N"y"
ii","" - - -
N
In other words we use the true stratum sizes as weights to obtain a
weighted mean, instead of taking the unweighted mean of the sample.
If the sampl is reasonably large, this technique is almost as precise
as proportiooal stratified sampling. Let rnA be the number in the sam-
ple that fall in the hth stratum. where rnA will vary from sample to
earnpl . For samples in which the rnA are fixed ,
1 8,,2
V(Y .. ) - -; L, NJ. 2 -
N ~
where the fpc is ignored.
The average value of this quantity in repeated sampling must now
be calculated. This procedure requires a little care, since one or more
of the mA c uld be zero. If this occurred, we would have to combine
two or more strata before making the estimate. The result would be
a Ieee acourate stratification and a higher variance for y",. With in-
creasing n the probability that any rnA is zero is so small that the con-
tribution to the variance from this source is negligible. (This state-
ment is given without proof.)
If the case where mA is zero is igliored, Stephan (1945) ha.e shown
that to terms of ord r n- li
1)
E (-
rnA
-1 - -n2-W"
-nW 1
+n-1
2 -
W,,2
A

where W" - N,,/N. Hene


1
EIV(D .. ) I - - L, W~,,2
n
+ termain (n-,)
5.21 STRATIFICATION WITH ONE UNIT PER STRATUM 105

The leading term is the variance obtained with proportional tratified


sampling.

6.20 Quota sampling. Another method that is used in this situation


is to decide in advance the n... that are wanted from each stratum and
to instruct the enumerator to continue sampling until the n{>Cessary
"quota" has been obtained in each tratum. If the enumerator ini-
tially chooses units at random , rejecting those that are not needed,
this method is equivalent to stratified random sampling. In the later
stages of sampling, this may require considerable work on the part of
the enumerator, since most of the units that are contacted may fall in
strata where the quota has already been met.
AI!, this method is used in practice by a number of agencies, the
enumerator does not select units at random . Instead, he takes ad-
vantage of any information that enables the quota to be filled quickly
(such as that rich people seldom live in slums). The object is to gain
the benefits of stratification without the high field costs that might be
incurred in an attempt to select units at random. Varying amounts
of latitude are permitted to the enumerators.
Sampling theory cannot be applied to quota methods which contain
no element of probability sampling. Information about the precision
of such methods is obtained only when a comparison is possible with a
census or with another sample for which confidence limits can be
computed.

6.21 Stratification with one unit per stratum. If the population is


highly variable and many effective criteria for stratification are known,
stratification may be ca.rried to the point where the sample contains
only one unit in each stratum. In this event the formula previously
given for estimating V(1i,,) cannot be used. An estimate may be
attempted by grouping the strata in pairs. In the two strata which
form a pair, we shall assume that the sizes N A are equal. Slight devis,..
tions from equality do not vitiate the method. The population means
Y" for the two members of a pair should not differ greatly, but the
allocation into pairs should be made before seeing the sample results,
for reasons which will become evident. The number of strata shOUld
be at least 20, to allow a minimum of 10 df in the estimated variance.
Let the observations in a typical pair be Yil, Yi2, where j goes from
1 to L/ 2. Then, averaging over all samples from this pair,

E (Yil - Yi2) :I - (r'"it - '"


r i2)
2
+ (N,Nj- 1)
(Sjl
2
+ Si2)2 (5 .64)
106 STRATIFIED RANDOM SAMPLING 6.21

where N t - NA is the ize of each stratum in the pair. Consider the


estimate
J L /2
v(ti.,) - N 2 L N/(YI1 - Yi2)2 (5.65)
i-I
By (5.64) the expected value of this quantity is

Ev(till) - ~ [t N"(N,, - 1)S,,2 + ~ N/(Yil - Yi ,?] (5.66)


N " -I i-I
The first term on the right is the correct variance (by theorem 5.3 with
nit. - 1) : the second represents a positive bias. The size of the bi88
depends on the success attained in the formation of pairs of strata
whose true means differ little. The form of estimate (5.65) warns us
that we should not construct pairs by m~king the sample values differ
&II little 88 possible, since this might give a serious underestimate. The
technique is sometimes referred to colloquially 88 the method of "col-
lap d strata."
An alternative method of sampling is to use each pair &II a single
stratum, with LI 2 strata and two units chosen at random per stratum.
An unbiased estimate of V(y.,) for this kind of sampling is obtainable
from the usual formula. The rea.der ma.y verify that
1 [ L (2N" - 2) 2
V(ti.,) - 2 L N"(N,, - 1) S"
N "-I 2N" - 1
L /2 (N ' - J) ]
+L
i_I
N/ 1
2Nj - 1
eY
il - Yi2 )2 (5.67)

By oomparison with (5 .66) it appears that estimate (5.65) is an


overestimate not only of the true variance with one unit per stratum,
but ::uso of the variance which would apply if strata twice 88 large
were used .
Whether the smaller strata are preferable, in the light of thill result,
is debatable. Unfortunately, if there is a large gain in precision from
one unit per stratum 88 compared with two units per pair, there is
also a large overestimation of the variance.

6.22 Stratification in analytical atudiel. In many studies which in-


volve sampling, we wish to compare the average values of certain
variables in one part of the population with those in another part, in
order to find out, for example, whether differences exist between an
urban and a rural area, or between fae-paying and non-fae-paying
schools, or between people of different educational levels. Such in-
vestigations may be called llTIlJlyticat, because they involve the study
6.22 STRATIFICATION IN ANALYTICAL STUD1E3 107

of relationships within the population rather than a mere description


of the characteristics of the population. In analytical studies, we
UBUally try to go further and speculate about the underlying causal
factors that have led to the observed differences. In the absence of
controlled experimentation, such speculations tend to be hazardous,
but in many areas of research the worker has no recourse but to rely
on a combination of observation and analysis for much of his knowledge.
The term domaim of study has been given by the U.N. Subeommi&-
sion on Sampling to thoee parts of the population which we wish to
compare.
The techniques of planning and analysis appropriate to analytical
surveys are different from those appropriate to the descriptive sur-
veys which have been considered thus far in this book. Suppose that
each domain of study is a separate stratum, this being one of the sim-
plest ways of designing an analytical survey. If two such strata are
being compared, it is UBUally of no scientific interest to ask whether
VI - V2 , because these means would not be exactly equal, except by
a rare chance, even though both strata were drawn at random from
the same infinite population. Instead, we ask whether the two strata
can be regarded as drawn from the same infinite population. Conse-
quently, in analytical comparisons we omit the fpc when computing
the variances of estimated stratum means.
The rules for allocating the sample si.zes in the strata are also dif-
ferent. If there are only two strata, the variance of the difference
between the sample means (iii - ih) is
8 12 8 2
V "" -
nl
+ -2
~
(5.68)

For a C08t function of the form


(5.69)
V is minimized when

(5.70)

With L strata (L > 2), the optimum allocation depends on circum-


stances. One method is to minimize the average variance of the dif-
ference between all pairs of strata. From (5.68) it follows that this
average variance is
v--2(8 8
2 2
1 2 8L2 )
-+-+
L 111 n,
... +-
11£
108 STRATIFIED RANDOM SAMPLING 6.22

V is minimized, for fixed C, by the same rule &8 in (5.70), i.e.

8"
nil ex: - -
~
c"
Alternatively, if the vary little from stratum to stratum but there
are marked variations in the S", it is sometimes preferable to allocate
so that every mean has the llarne varianoe. For this, we choose the
n" 80 that
S1 2 S,2 SL 2
- " " - = : .. . : : -

The IlUbject has many ramifications. The standard techniques of


analysis of variance and of multiple regression can be applied if each
domain is a separate stratum, or if the sample is a simple random
sample. orne complications occur when each domain covers parts of
several strata which have been set up.

11.28 Exercises.
5.1 The houaeholds in a town are to be sampled in order to estimate the
&verage amount of &8I.!ete per household that are readily convertible into cash.
The households are stratified into a high-rent and a low-rent stratum. A
house in the high-rent stratum is thought to have about 9 times 8.8 much
Uleta as one in the low-rent stratum, and S~ is expect.ed to be proportiona.!
to the equare root of the stratum mean.
There are 4000 households in the bigh-rent stratum and 20,000 in the low-
rent stratum. (i) How would you distribute a sample of 1000 households
between the two strata? (ii) If the "Object is to estimate the difference be-
tween &8I.!ete per hQul.!ehold in the two strata, how should the sample be dill-
tribut.ed ?
5.2 The following data 8how the 8tratification of all the farms in a county
by fann size, and the average acres of com (maize) per farm in each stratum.

Number of Average Standard


Farm Bise f&mlll com acree deviation
(&Cree) N~ F. s.
CHO 894 5 .4 8 .3
41-80 461 16 .3 18.3
81 -1~ 391 24 .3 16. 1
121-180 3M 34 .6 19.8
161- :D) 169 42 . 1 :U.6
201- 240 113 50 . 1 26 .0
241- 148 63 .8 36 .2

Totalorme&ll 2010 26 .3
5.23 EXERCISES 109
For a sample of 100 farms, compute the sample sizes in each stratum under
(i) proportional allocation, (ii) optimum allocation. Compare the precisions
of these methods with that of simple random sampling.
5.3 If the 7 strata are to be combined into 2 strata, what is the best point
of division for proportional allocation? What is the relative precision of 2
strata to 7 strata under proportional allocation?
5.4 Pro .....e the result stated in formula (5.31), section 5.7:

5.5 If there are 2 strata and if '(J is the ratio of the actual nvnt to the opti-
mum nJn2 for fixed sample size (as in section 5. ), show that, whatever the
va.lues of N 1, N 2, 8 1, and 8" the relative precision of the actual a.l1oca tion to
the optimum allocation is never less than

5.6 The variate IIi follows the frequency distribution e- rl dlli (0 5: 11.).
The population is divided into 2 strata at the point 110, and a stratified random
sample of size n is taken with proportional allocation. Find V(II.,) as a func-
tion of 110. Verify that the vaiue of 110 which minimizes V satisfies the rule
given by Dalenius (section 5.15).
5.7 The following data are derived from a stratified sample of tire dealers
taken in March 1945 (Deming and Simmons, 1946). The dealers were &8-
signed to strata according to the number of new tires held at a previous cen-
SUlI. The sample means 1iA are the mean numbers of new tires per dealer.
(i) Estimate the gain in precision due to the stratification. (ii) Compare this
result with the gain that would have been attained from proportional alloca-
tion .
Stra.tum
boundaries
1-9
NA
19,850
WA
0 .8032
'UA
4. 1
..'
34 .8 ""
3,000
1()""19 3,250 0 . 1315 13.0 92 .2 600
20-29 1,007 0.0407 25 .0 174 .2 34.0
80-39 606 O.~ 38.2 320 .4 230

Tota.le 24,713 0.9999 4,170

5.8 For a population with N - 6, L - 2, the values of 1/10; are 0, 1, 3 in


the first stratum and 5, 6, I} in the second stratum . Compute (i) V(g) for a
simple random sample with n - 2, (li) v(O.,) for a stratified random sample
with 1 wut per stratum, (iii) the average value of lI(g.,) as estimated by the
method of collapeed strata. Verify that fl(g.,) > v(1i).
110 BTRATIFIED RANDOM SAMPLING 5.24

1.26 References.
AlIJoIlTAOlC, P. (1947) . A compsrillOD of stratified with unrestricted random laID-
pIing from .. finite popula.tion. BiMndrika, M, 27S-280.
CocIJlU", W. G. (11139). The use of analysiB of variance in enumeration by laID-
plinK. Jour . Amer. 8taJ. A,uoc., M, 49Z-510.
eoaNllLL, F . G. (1947) . A stratified random sample of a small finite popul.tion.
Jour. Amm-. 8taJ. A • .oc., ' 2, 523-532.
DALlJNIUII, T. (1950). The problem of optimum stratification. Bleand. AIet. , as,
203-213.
DALlCN1UII, T ., and GlIltNET, M . (1951) . The problem of optimum stratification.
II. Sleand. AIet., M, 133-148.
D.III(lNO, W . E., and 8n'MoN8, W . R . (1946) . On the deaign of a lample for dealer
inventoriM. Jour. Amer. 8141. A.aoc., n, 16-33.
EVANII, W. D . (1951). On stratification and optimum a.llocations. Jour. Amm-.
Slat. A"oc., ' 8, 9~104 .
HAGOOD, M . J., and BW:RNERT, E. H . (1945). Component indexes as .. basis for
stratification. Jour. A mer. StaJ. Aaaoc., ' 0, 330-341
JlC88mN, R. J. (1942) . Statistical invostigation of a sample survey for obtaining
fum f.cte . Iowa Agr. Exp. 814. Rea. Bull. 304.
J_IIN, R. J ., and HOUBlCIolAN, E . E . (1944). Statistical investigations of farm
lample surveys taken in Iowa, Florida and California. Iowa Agr. Exp. SI4. flu .
Bull. 829.
KINO, A. J., and McCARTT, D. E . (1941) . Application of sampling to agricultural
.tatlatica with empbuis on stratified sampling. Jour . Marketing, 6, 462-474.
KINo, A. J., McCARTT, D . E, and McPI!:EI, M. (1942). An objective method of
eampling wheat fields to eetima.te production and Quality of wheat. U . S. Depl.
Agr. Ttdi. BuU 814.
MA.HALANOBI8, P . C. (1940) . A sample survey of the acreage under jute in Bengal.
&nMI/B, " 511- 530.
NlCYlIA1'l, J . (1934). On the two different aspec~ of the representative method :
the method of stratified sampling and the method of purposive selection. Jour.
&1/. Blai. Soc., 87, 568-606.
8A.TrIlRTHWAITlC, F. E. (1946). An approximate distribution of estimatee of vari-
&Doe oomponenta. BWmdMca, 2, no-: 1(.
BTilPIlAN, F. F . (1941) . Stratification in repre!lentative sampling. Jour . M(J1'Ic~/'­
iftg, 8, 88-46.
&r1CPIlAN, F. F . (1946) . The expected value and variance of the reciprocal and
other neptive powers of a positive Bernoullian variate. Ann. Math. BIai., 18,
6(Hi1.
BUDlATJoIlC, P. V. (1935). Contribution to the t.beory of the representative method.
Sup". Jour . &". Slat. Soc .• I , 263-268.
T.cuupJl.Ow, A. A. (1923) . On the mathematical expectation of the momenta of
frequency d.ietributione in the CII86 of correlated obaervations. M eUon, ll, 461-
493 &Dd 846-683.

N 04 cittd in u=
D AL.mIlUl, T . (1962) . The problem of optimum stratification in & special type
of deeign. Sk4nd. A let., III, 81-70. (Givee the best boundary for dividing a
skew popul ..tlon into two strata, given th&t th upper stratum is to be sampled
100 '* cent.)
CHAPTER 6

RATIO ESTIMATES

6.1 Methods of estimation. One feature of the growth of theoretical


statistics during the past 30 years is the emergence of a substantial
body of theory which discusses how to make good estima.tes from data.
In the development of theory specifically for sample surveys, rela-
tively little use has been made of this body of knowledge. For this I
think there are two principal reasons. First, with routine surveys
which contain a large number of items, there is It. great advantage in
an estimation procedure that requires little more than simple addi-
tion, whereas the superior methods of estimation in statistical theory,
such as maximum likelihood, may necessitate a. series of successive
approximations before the estimate can be found. Second, there has
been a difference in attitude in the two lines of resea.rch. M06t of the
estimation methods in theoretical statistics take it for granted that
we know the functional form of the frequency distribution followed
by the data in the sample, and the method of estimation is carefully
geared to this type of distribution. The preference in sample survey
theory has been to make only limited assumptions about this fre-
quency distribution, e.g. that it is very skew or rather symmetrical,
and to lcave its specific functional form out of the discuBBion. This
attitude is a reasonable one for handling surveys with many items,
where the type of distribution may change from one item to another,
and where we do not wish to stop and examine all these distributions
before deciding how to make each estimate.
Consequently, estimation techniques for sample survey work are at
present rather restricted in scope. Two techniques will be considered
- the ratio method in this chapter and the linear regression method in
chapter 7. It is possible that the use of more complex methods will
increase, because the gain in precision from a superior method of esti-
mation can often be secured fairly cheaply, since only the final com-
putations are affected.

6.2 The ratio estimate. In the ratio method, an auxiliary variate


x" correlated with 'Y., is obtained for each unit in the 8ILIIlple. The
III
112 RATIO ESTIMATES 6.2

population total X of the Xi must be known. In practice, Xi is often


the value of II. at some previous time when a complete cenBU8 was
taken. The aim in this method is to obtain increased precision by
taking advantage of the correlation between IIi and Xi. At present
we assume simple random sampling.
The ratio estimate of Y, the population total of the 11., is

fR ... ~X.., ~X (6.1)


:;J; :f

where II, % are the sample totals of the 11. and Xi, respectively.
If %i is the value of 11. at some previous time, the ratio method uses
the sample to estimate the relative change Y / X that has occurred
since the previous time. The estimated relative change, II/ X, is mul-
tiplied by the known population total X on the previous occa.sion, to
provide an estimate of the current popula.tion total. If the ratio
II;/X; is nearly the same on all sampling units, the values of II/X vary
little from one sample to another, and the ratio estimate is of high
precision. In another application, Xi may be the total acreage of a
farm and 11. the number of acres sown to some crop. The ratio esti-
mate will be successful in this case if all farmers devote about the same
percentage of their total acreage to this crop.
If the quantity to be estimated is 1', the population mean value of
Ifi, the ratio estimate is

Frequently we wish to estimate a ratio rather than a total or mean,


e.g. th ratio of com acre to wheat acres, the ratio of expenditures
on labor to total expenditures, or the ratio of liquid assets to total
assets. The sample estimate is R = II/ X. In this case X need not be
known.
Exampk. Table 6.1 shows the number of inhabitants (in 1000's)
in each of a simple random sample of 49 cities drawn from the popu-
lation of 196 large cities previously discussed in section 2.8. The
problem is to estimate the total number of inhabitants in the 196
oiti in 1930. The true 1920 total, X, is assumed to be known. Its
value is 22,919.
The example is a suitable one for the ratio estimate. The majority
of the cities in the sample show an increase in size from 1920 to 1930
of the order of 20 per cent. From the sample data we have
6.2 THE RATIO ESTIMATE 113

TABLE 6.1 SIZE8 OF 49 LAROI: UNITED STATU emu (in 1000's) IN 1920
(.r,) AND 1930 (~,)

:z:, 1/, :z:, 1/, :z:, ~,

76 80 2 50 243 291
138 143 507 634 87 106
67 67 179 260 30 III
29 50 121 113 71 79
381 464 50 64 266 288
23 48 44 58 43 61
37 63 77 89 25 57
120 115 64 63 94 86
61 69 64 77 43 50
387 459 56 142 298 317
93 104 40 60 36 46
172 183 40 64 161 232
78 106 38 52 7. 93
66 86 136 139 45 53
60 57 116 130 36 54
46 65 46 53 50 08
48 71)

70

60
200 ratio estimates
50
g 40
!:
...g30
20 '.\ ;; .
)( dtl'lOtH POPUIltlon tollil
10 h_

)(
. 1-;-,

j~I~1 a ro ~ ~ ~ ~ 30 ~
Total populltion (millions)
~ ~ ~ ~ u ~

FIOUU 6.1 Experimental comparison or the ratio flItim.t.te with the flItimate
baEd on the sample mean.
RATIO ESTIMAT~

Consequently the ratio estimate of the 1930 totaJ. for all 196 cities is

11 6262
'fR - - X - - (22,919) ... 28,397
:c 5054
The corresponding estimate baaed on the sample mean per city is

i) (196)(6262)
1 - Nr. - - - - - 25 048
II 49 '

The correct totaJ. in 1930 is 29,351.


Figure 6.1 shows the ratio estimate and the estimate based on the
sample mean per city for each of 200 simple random samples of si.ze
49 drawn from this population. A very substantial improvement in
. precision from the ratio method is apparent.

6.8 Approximate variance and blat of the ratio estimate. The dis-
tribution of the ratio estimate has proved annoyingly intractable, be-
caU8e both 11 and :c vary from sample to sample. The known theoreti-
cal results fall short of what we would like to know for practical appli-
cations. The principal results will be stated first without proof.
The ratio estimate is consistent (this is obvious). It is biased, ex-
cept for some special types of population, although the bias is negligible
in large samples. The limiting distribution of the ratio estimate, as n
becomes very large, is normal, subject to some mild restrictions on
the type of population from which we are sampling. In samples of
moderate size, the dis ribution shows a tendency to positive skewness,
at least for the kinds of population for which the method is most often
used. We do not possess exact formulas for the bias and the sampling
variance of the estimate, but only approximations that are valid in
large samples.
These results amount to saying that there is no difficulty if the
sample is large enough so that (i) the ratio is nearly normally distrib--
uted, and (ii) the large-sample formula for its variance is valid . Defi-
ciencies in the theory are (i) the lack of a well-substantiated rule to
answer the question: When is the sample large enough?, and (ii) a
serviceable method for estimating confidence limits for small samples.
As a working rule, the large-sample results may be used if the sample
siae exceeds 30 anti is also large enough 80 that the coefficients of vari-
ation of :f and 9 are both less than 10 per cent. This rule is rather
poorly documented as yet.
APPROXIMATE VARlANCE AND BIAS 116

Th«1rem 6.1 With a simple random sample of siIe n, the variance


of r R , the ratio estimate of the population total Y, is approximately

V(rR) _ N(N - n)
n(N - 1)
!:
i_I
(fli _ Rzi)2 (6.2)

where R ... YI X is the population ratio. The approximation assumes


that n is large.
Sketch of proof. The following discussion is not rigorous in that it
does not justify the approximation made in the analysis. The error
in the estimated population total is

rR - ii
Y = -X - Y
i

NX
- - ( y - R:)
i
since R - YIX.
If the sample is large, f should be close to X. The approximation
consists in replacing the factor Xli by 1. This gives
rR - Y '-. N(y - R:) (6.4)

Apart from the factor N, the approximate error of estimate in (6.4)


is the mean of the sample values of the variate u, - fl. - Rz,. We
now apply to the variate u. theorem 2.2 for the variance of the mean
of a simple random sample. This gives
N

N(N - n)
L
. I
(Ui - U)2
V(rR) = N2V(1Z) - -.----
n (N - 1)

where U is the population mean of the va.riate Uj. But


U-Y-RX-o
by the definition of R. Hence
N(N - n) N
VerB) -
n(N - 1)
L
<_I
(Iii - Rz,)'
116 RATIO EBTIMATElJ u
CurolJ4ry. There are various altern.a.tive fonna of the result. Since
l' .. RX, we may write

VerB) _ N(N - n) ~ {(I/I - 1') - R(Xi -1'}1'


n(N - 1) i_1

_ N(N - n) {L (Yi _ 1')' + R' L (Xi - 1')'


n(N - 1)

-2R L (1/i - Y)(Xi - X)l


Let us define the correlation coefficient p between 1/. and Xi in the finite
population by the equation
N
L (1/. - 1')(x. - X)
i-I
p ... - - - - - - -
(N - l )S.,,8z
This leads to the result

An equivalent form is

r (N - n) y2 {S1I
2
Sz' 2Sl/z} (6.6)
V ( B) = N n 1'2 + X' - l' X
where SII:r. ". pS.,,8z is the covariance between 1/. and x.. This relation
mayal80 be written &8

where C Cz:c are the squares of the coefficients of variation (cv) of


'IWI
'Vi and Xi, respectively, and CI/Z is the analogous relative covariance.
Note 1. AB an estimate of the population ratio R, the ratio method
uses the ratio of y to f. Some readers may wonder whether the mean
, of the ratios r. - 1I;/X. on the individu8.1 units would be a better es-
timate of R. Without going into details, this does not appear to be
the case with simple random sampling, except for some special types
of population (d. section 6.7). In fact, with a finite population, the
estimate f is not consistent, since f taken over all sampling units does
not equal R. Moreover, f is more tedious to compute than I//x.
APPROXIMATE VARIANCE AND BIAS 117

NoU I. The approximate formula for the variance of the ratio


estimate can be expresaed in terms of the amount of variation in "
from unit to unit. From equation (6.2) we have

~
V(l'B)-
N(N - n);'"
L..X,
2('I-I,- R)2 ... N(N - '1)" 2
L..%,(T,-R)
:I

n(N - 1) ,-1 x, n(N - 1)

The sum is a weighted sum of squares of the deviations of the T, from


T,
the population ratio R. If all are equal, the approximate varianc
vanishes, as it should, since the ratio estimate is then without error.
Note S. For the approximate variance of the ratio 1l .. 1//x, we
divide the preceding formulas by X 2 . Two forms of the result are as
follows:
N(N - n) N
VCR) = 2 :E
n(N - l)X i_I
&,)2 ('IIi -
(N - 7'1) 2
-
.,.N
R {C n + Cu - 2CII:r1 (6.8)

Note 4. Biaa. In finding the variance of the ratio estimate of Yin


theorem 6.1, the essential step was to introduce the approximation

NX
rB - Y "" -
f
(fl - ~) ..... N(ii - ~) (6.9)

Since E(y - Rf) = 0, it followt! that, to the order of approximation


used in the variance formula, the ratio estimate is unbiased. In order
to find the leading term in the bias of r R, we must take I,he appr'.>xi-
mation one stage further. This is done by writing

X i-X
-----=- "" 1 - - -
X(1 + f ~ X) X
retaining the first term in the Taylor series expansion. Substitution
in (6.9) gives
r R - Y = N(y -~) ( 1 - x-X)
f -

Now
(N-n)SJ
Ex(x - X) ... E(x _ X)2 "" :r
N n
118 RATIO ESTIMATES

Similarly,
Eg(f - X) - E(9 - y)(f - X) _ (N N- n) pS.s.
n
(6.10)

by theorem 2.3 (p. 17) and the definition of p. Hence the leading
term in the bias is
(N - n)
E(rB - Y) - nX (BS,,? - pSvS.) (6.11)

The bias may be either positive or negative. With increasing sample


aile, the bias diminishes as l/n, whereas the standard error of B r
diminishes as llYn. For any specific population a sample size exists
beyond which the bias is negligible relative to the standard error.
Now, from (6.5),

O'(rB ) - IN(Nn- n) V(Sl + R2S.2 - 2RpSvS.)

Hence the ratio of the bias to the standard error is approximately

Bias {IN - n S. }{ (BS. - pSl/) }


O'(rB ) - ~ Vn X V(Sl + R 2S.2 - 2RpSvS.)
The term inside the first bracket is the coefficient of variation of :i.
The absolute value of the term inside the second bracket is at most
unity (this is obvious on squaring). Hence
I Bias I ~ cv of f
--&- (6.12)
0'( r R)
If the sample is ~ enough 80 that the cv of f is less than 0.1,
the bias is negligible relative to the standard error (section 1.5).

8.4 Eatimated variance. From equation (6.2),


N(N - n) N
V(fB ) -
n(N - 1)
E
i_I
(11. - &.)2
Aa a sample estimate of
N
E (y; - &;)3

N -1

n-l
ESTIMATED VARrANCE UG
Thia estimate can be shown to have a bias of order l/n: no method ia
available for obtaining an unbiased sample estimate.
For the estimated variance, v(f,R}, this gives

v(fR} _ N(N - n)
n(n - 1)
i: (1/. _ flXi)2
._1
N(N - n} ~ 2 ~
-
n(n - 1)
(..l... 1/. + 4A2 ..l...
'" 2
Xi -
noA
ur, ..l...1I.%.) (6.13)

this being the form which is speediest to compute. Further algebraic


development of this expre88ion leads to sample analogues of expree-
sioM (6.6) and (6.7) .
For the estimated variance of the ratio fl, we divide (6.13) by Xl.
This gives
(N - n) ~ 2 A2 ~
A
11(4) - X2 (..l... 1/. + 4
2 A '"
..l... %. - 24 ..l... 7/.%.) (6.14)
Nn(n - 1)
Note that the sums of squares and products in (6.13) and (6.14) in-
volve no correction for the sample mean. If X is not known, the sample
estimate :f is substituted in the denominator.
Example. This illustrates the calculation of the standard error of a
ratio estimate of a population total. The data given previously in
table 6.1 will be used. First calculate

11 - L 11. - 6262: % - L %.... 5054 : R= "


- -
%
l.239019

Formula (6.13) will be used:


N(N - n) 2 - 2fl
lI(fR) -
n(n - 1
) (L 11.2 + fl2 L %i L 1/..%.)

To compute the term inside the bracket, the sums of squares and prod-
ucts are placed on the same row as their multipliers, as follow!!:

Multiplier
:E ,,; - 1, 527,882 1
:Exil - 1,044,504 1.635168 - It'
Ev,;z( -1 ,251 ,63Q 2 .478038 - 2R
Hence
v(f ) = (196)(147) (29 784) ... 364 854
R (49)(48)' ,

.(fR ) -= 604
120 RATIO ESTIMATE3

6.& Sample me. An estimate of the sample size required to a.ttain


a. specified degree of precision is made as follows. The most convenient
starting point is equation (6.8), which gives the variance of the esti-
mated ratio R. This equation may be rearranged as follows:
VOl) (N - n)
C" - Ji2 - Nn [Cn + C"''' - 2Cp J (6.8')

where C'J denotes the IqUQrt of the cv of fl . If we specify the de-


sired value of this cv, and hence the value of C", equation (6.8')
may be solved for n. The first step is to ignore the fpc, (N - n)/N,
giving as an approximation to n
Cn + C,," - 2Cv"
no-
C'R
II at this stage the fpc is found to be necessary, the correct solution of
(6.8') is obtained by putting
n = --
no
1 + no
N
With the ratio method, the cv's of fl, of the estimated population
total r}l,and of the estimated population mean per unit are all equal.
Hence the equations above apply to all three types of estimation. In
order to use these results, we must estimate in advance the cv's of
11; and X; and the correlation coefficient.

6.6 Confidence limits. If the sample size is large enough so that the
normal approximation applies, confidence limits for Y and R may be
obtained as follows:

Y: r}l±t~ (6.15)
R: fl ± tVv(R) (6.16)
where t is the normal deviate corresponding to the chosen confidence
probability.
In section 6.3 it was suggested that the normal approximation ap-
pli reasonably well if the sample size is at least 30 and is large enough
80 that the cv's of fl and .f are both less than 0.1. When theee condi-
tions do not apply, the formula for v(fl) tends to give values that are
too low and the positive skewness in the distribution of fl may be-
come noticeable.
CONFIDENCE LIMITS 121

An alternative method of computing confidence limits has been


used in biological assay (Fieller. 1932; Paulson. 1942). This approach
requires fewer assumptions than the normal approximation and takes
some account of the skewness of the distribution of R.
The method requires that y and f follow a bivariate normal distri-
bution, so that ('0 - Hz) is normally distributed. It follows that the
quantity
f} - Hz
(6.17)

is approximately normally distributed with mean zero and unit stand-


ard deviation. (We have substituted sample estimates 8~2, etc., for
the corresponding population variances and covariance. and are as--
suming the sample size large enough so that this introduces negligible
error. In biological assay, where samples may be quite small. the
quantity above would be regarded as following Student's t-distribu-
tion.)
The value of R is unknown, but any contemplated value of R which
makes this normal deviate large enough may be regarded as rejected
by the sample data. Consequently, confidence limits for R are found
by setting (6.17) equal to ±t, and solving the resulting quadratic
equation for R. The confidence limits are approximate. because if
we try to check them by sampling repeatedly from a fixed population
with known R. some values of fj and :f turn up for which the two roots
of the quadratic are imaginary. Suoh cases become rare if the cv's
of '0 and i are less than 0.3.
After some manipulation. the two roots may be expressed as

R _ fl {(I - fCg~) ± tY(C gg +C ft - 2Cg~) - t2(C liilCU - C~1t2)}


1 - t 2Cu
(6.18)
where
(N - n) s/
Cu - -----::-
Nn g2
is the square of the estimated cv of y. with analogous definitions of
C~ and Cu. If fCiIi/, fCu. and t2C"J are all small relative to 1, the
limits reduce to
R ., R ± tRYCiIiJ + Cu - 2Cv~
This expression is the same as the normal approximation (6.16).
122 RATIO ESTIMATES 11.11

Quadratic limits for Yare found by replacing fi. in equation (6.18)


by f R •
The quadratic limits should be always at least as good approxima-
tions as the normal limits, since they depend on fewer assumptions.
They are alowe.r to compute, however, and are not a complete solu-
tion to the problem because in sampling from skew populations the
distributions of 9 and f may themselves be skew.

8. 7 ComparilOn of the ratio estimate with the mean per unit. The
type of estimate of Y which was studied in previous chapters is Ny,
where 9 is the mean per unit for the sample (in simple random sam-
pling) or a weighted mean per unit (in stratified random sampling).
Estimates of this kind will be called estimates based on the mean per
unit or estimates obtained by limple expansion.
TlwJrem 6.t In large samples, with simple random sampHng, the
ratio estimate f R has a smaller variance than the estimate f ... Ny
obtained by simple expansion, if
1(S%
)/(SII) Coefficient of variation of x,
p> - - - - -----------
2 X Y 2(Coefficient of variation of 7/i)
Proof; For f we have
N(N - n) 2
VCr) _ 8 11
n
For the ratio estimate we have from (6 .5)
N(N - n)
V(rR) -
n
lsi + R2S.? - 2RpSuS.,1
Hence the ratio estimate has the smaller variance if
S~
2
+ R2S-s 2 - 2R"suS-s < SII 2
i.e. if

This theorem show that the ratio estimate may be either more or
less precise than a simple expansion. The issue depends on the size
of the correlation coefficient between 11; and Xi, and OD the cv's of
th two variates. The variability of the auxiliary variate Xi is an
important factor : if its cv is more than twice that of 11;, the ratio esti-
mate i alwa7lB I precise, since p cannot exceed 1. When X; is the
value of 7/. at some previous time, the two cv's may be about equal.
In this event the ratio estimate is superior if p exceeds 0.5.
6.8 CONDITIONS FOR RATIO ESTIMATE TO BE OPTIMUM 1:18
Theorem 6.2 applies only for samples large enough 80 that the ap-
proximate formula for VerB) is valid. In 8Jllaller samples the ratio
method probably does not compare as favorably as the theorem sug-
gests, since the approximate formula is usually an underestimate.

8.8 Conditions under which the ratio estimate is optimum. A well-


known result in the theory of regression indicates the type of popula-
tion for which the ratio estimate is the best among a wide class of
estimates. The theorem applies only to infinite populations.
Theorem 6.S With simple random sampling from an infinite popu-
lation, the ratio estimate of Y is a "best linear unbiased estimate" if
two conditions are satisfied :
(i) the relation between 1Ii and Xi is a straight line through the
origin, and
(ii) the variance of Yi about this line is proportional to Xi.
A "best linear unbiased estimate" is defined as follows. Consider
all estimates that are linear functions of the sample values Vi, i.e.
that are of the form
llY' + l2Y2 + ... + l"V..
where the l's do not depend on the Vi, although they may be functions
of the Xi. The choice of the l's is restricted to those that give unbiased
estimates of Y. The estimate that has the 8Jllallest variance is called
the "best linear unbiased estimate."
Proof: The mathematical model is
Vi = BXi + ei
where the Ili are independent of the Xi. In arrays in which X, is fixed,
has mean zero and variance Xxi. Hence
Ili

as shown by Gauss that the best linear unbiased estimate of BJl


is bJl, where b is the least squares estimate of B (see e.g. David and
Neyman, 1938). The least squares estimate is

1 1
where w, - - 2 - -
u.. Ax.
This gives
b_ :E 1Ii _ ~
:E Xi X

Consequently, the optimum estimate of Y is the ratio estimate (y/x)Jl.


13' RATIO ESTIMATES 6.8

The practical relevance of this result is that it suggests the condi-


tions under which the ratio estimate is not only superior to the mean
per unit, but is the best of a whole class of estimates. When we are
trying to decide what kind of estimate to use, a graph in which 1/i
is plotted against Xi is helpful. If this graph shows that the relation
is a straight line through the origin, and if the variance of the points
1/i about the line seems to increase proportionally to Xi, the ratio esti-
mate will be hard to beat.
Sometimes the relation between 1/i and Xi is a straight line through
the origin, but the variance of 1/i in arrays in which Xi is fixed is not
proportional to X i. In a population sample of Greece, Jessen et al.
(1947) found that the variance increased roughly 88 xl'. This sug-
gests a weighted regression in which Wi a: l /xt For the least squares
estimate b, this gives
b ... L: W.lIi X2i "" -1 .c....
"" (IIi)
-
L: WiXi n Xi

In this situation the best estimate of Y is bX, where b is the mean of


the ratios IIi/ Xi on the individual sampling units.
Under the conditions of theorem 6.3, the ratio estimate is unbiased.
This result does not contradict an earlier statement (section 6.3) that
the ratio estimate is in general biased. The ratio estimate is unbiased,
for any size of simple random sample, if the popUlation is infinite and
the relation between 1/i and Xi is a straight line through the origin. Tha
proof is left as an exercise to the reader.

6.9 The ratio estimate in sampling for proportions. The ratio method
plays an important role in the estimation of proportions. With sim-
ple random sampling, the usual formula for the variance of an esti-
mated proportion p is
PQ
V(p) - -
n
where P is the popula.tion proportion. [The factor (N - n)/(N - 1)
is inserted if the fpc is needed .]
As was pointed out in section 3.2, this formula is valid only if the
sample is a simple random sample of units, each of which is classified
into the two classes from which the proportion is derived. For in-
stance, if the proportion of diseased plants in a wheat field is estimated
by sampling, this formula applies if a simple random sample of indi-
vidual plan18 has been taken, each plant being classified as diseased or
healthy. It is unlikely that this method of sampling would be used.
6.9 THE RATIO ESTIMATE IN SAMPLING FOR PROPORTIONS 125

A more typical unit would be a compact area, say 1 ft long by 3 rows


wide, all plants in each sampled area being classified .
In such situations the sampling unit consists of a group or cluster
of smaller units, which we may call~ . Let the ith unit contain
%. elements. Each element is assigned to one of two classes, Cor C'.
Let 1/. be the number of elements in the ith unit which lie in C. The
proportion of units in C in the population is
N

.~ 1/.
Y
P.., - -=-
N X
I: %i
i- I

The sample estimate of this proportion is


.
I: 1/i
i_ I 1/
P =--=-
.. %
I: %i
i_ I

Structurally, this is a typical ratio estimate. Hence, the variance of


p is obtained by the formula appropriate to the ratio method.
Two equivalent forms for the approximate variance are
N (N - n) N
V(p) =
n(N - 1)X
2 I: (1Ii -
i_ I
P X i) 2

(N - n) ;.. 2 2
nN(N _ 1)X2 .~ %i (Pi - P)

where X ... X / N is the average number of elements in the cluster


unit. For the estimated variance, we have
(N - n)
t/(p) ... Nn(n _ 1)z2 II: 1Ii2 + p2 I: %i2 - 2p L 1/i%iJ (6.19)

Example. A simple random sample of 30 households was drawn


from a census taken in 1947 in wards 6 and 7 of the Eastern Health
District of Baltimore. The population contains about 15,000 house-
holds. In table 6.2 the persons in each household are classified (i)
as to whether they had consulted a doctor in the past 12 months, (ii)
as to sex.
us RATIO ESTIMATD 6.t

TABLE 6.2 D.T. fOil A SlVPLIl UNDOM SAMPLII or 30 BOUSbOLDI


Doctor_
No. of in Jut year
No. of
HOUIIehoid pel'lOns Males Females Yes No
No. %1 IIi I/i
1 Ii 1 4 Ii 0
2 6 3 3 0 6
3 3 1 2 2 1
4 3 1 2 3 0
Ii 2 1 1 0 2
6 3 1 2 0 3
7 3 1 2 0 3
8 3 1 2 0 3
{I 4 2 2 0 4
10 4 3 I 0 4
Il 3 2 1 0 3
12 2 1 1 0 2
13 7 3 4 0 7
14 4 3 1 4 0
16 3 2 1 1 2
16 6 3 2 2 3
17 4 3 1 0 4
18 4 3 I 0 4
19 3 2 1 ] 2
20 3 1 2 3 0
21 4 1 3 2 2
22 3 2 0 3
23 3 2 0 3
24 1 0 0 1
26 2 1 1 2 0
26 4 3 1 2 2
27 3 1 2 0 3
28 4 "2 2 2 2
29 2 1 1 0 2
30 4 2 2 S

Tot&ls 104 53 51 30 U

Our purpose is to contrast the ratio formula with the inappropriate


binomial formula. Consider first the proportion of people who had
consulted a doctor. For the binomial formula, we would take

n - 104: 7> - M ... 0.2885


Hence
pq (0.2885)(0.7115)
Ilbi.. (P) - - - - 0.00197
n 104
8.10 APPROACH TO NORMALITY 127
For the ratio formula, we note that there are 30 groups and take:
n - 30
Xi - Total number in ith household
1/. - Number in ith household who had seen a doctor
p - 0.2885, as before
f - W - 3.4667
E 1/.2 - E %,2 ... 404; E 1/,%, -
86; 113
The fpc may be ignored. Hence, from (6.19),
(86) + (0.2885?(404) - 2(0.2885)(113)
v(p) - - 0.00520
(30)(29)(3.4667)2
The variance given by the ratio method, 0.00520, is much larger
than that given by the binomial formula, 0.00197. This happens be-
cause, for various reasons, families differ in the frequency with which
their members consult a doctor. For the sample 88 a whole, the pro-
portion who consult a doctor is only a little over 1 in 4, but there are
several families in which every member has seen a doctor. Similar
results would be obtained for any characteristic in which the members
of the same family tend to act in the same way.
In estimating the proportion of males in the population, the results
are different. By the same type of calculation as above, we find:
Binomial formula: v(p) "" 0.00240
Ratio formula: v(p) - 0.00114
Here the binomial formula overestimates the variance. The reason is
interesting. Most households are set up as a result of a marriage, and
hence contain at least 1 male and 1 female. Consequently the pro-
portion of males per family varies less from! than would be expected
from the binomial formula. None of the 30 families, except one with
only 1 member, is composed entirely of males, or entirely of females.
If the binomial distribution were applicable, with a true P of approxi-
mately ;, households with all members of the same sex would consti-
tute one-quarter of the households of size 3 and onEHlighth of the
households of size 4. This property of the sex ratio has been dis-
cUSllEld by Hansen and Hurwitz (1942).

6.10 The approach to normality of the distribution of the ratio. The


result that the limiting distribution of the ratio 'iilf in large samples is
normal comes from standard theorems in probability. We shall quote,
as a lemma, a result of wide generality given by Cramer (1946). This
128 RATIO ESTIMATES 6.10

result, which assumes an infinite population, provides the limiting


distribution of any known function of any two central moments mil
~, say, calculated from a sample of pairs of values Yi, Xi. By a cen-
tral moment of the sample, we mean a quantity of the form
1 ..
m ... - 1:
(Yi - ti)"(Xi - f)"
ni_'
where u and w are positive integers. The corresponding moment for
the whole population is
M - E(Yi - ii)"(x, - f)V
averaged over ~11 units in the population.
Lemmo. (Cramer) . If in some neighborhood of the point m, = M"
~ .. M 2, the function H (ml ' m2) is continuous and ha.s continuous
derivatives of the first and second orders with respect to m, and ~,
then the )jmiting distribution of the random variable lJ(m" m2), as
n becomes large, is normal with

+ 0'""
2 (iJH)2
-
11m2

where the partial derivatives are computed at the point m, = M"


m2 - M 2•
Theorem 6.4 If the populati; n variances S,/, Sz? are finite, and if
the population mean X ~ 0, the limiting distribution of ii/X, in ran-
dom samples of size n from an infinite population, is normal with

Proof: In order to apply the lemma we take

Consequently M, - Y, M2 - X.
6.11 RATIO ESTIMATES IN STRATIFIED RANDOM SAMPLING 129

The function H is continuous and h8.8 continuous partial derivatives


of the first lind second orders in some neighborhood of the point il '""
Y, f ... X, provided that X ¢ o. Further, at this point,
aH 1 aH Ml Y
ami = M2 = Xi Om2 = - M22 = - X2
Hence, from the lemma, the limiting distribution of iilx is norma.l,
with
Mean

The result is the same as that stated previously in equation (6.8),


apart from the fpc. So far as practical applications arc concerned,
the generality of the resul t is pleasing, the only restrictions being that
y; and Xi have finite variances and that X ¢ o. Theoremo given by
Madow (1948) enable the result to be extended to finite populatiolls,
subject to further mild restrictions on the nature of the popula.tion.
This extension will not be discussed here.

6.11 Ratio estimates in stratified random sampling. There are sev-


eral ways in which a ratio estimate of the population total Y can be
made. One is to make a separate ratio estimate of the total of each
stratum and add these totals. If y}" Xh are the sample totals in the
hth stratum and X" is the stratum total of the X"i, this estimate ?R.
(8 for 8eparate) is
?R. "y" XI.
= .(...; - "ii"
= .(...; :- X" (6.20)
" x" " x"
No a.ssumption is made that the true ratio remaillB cOllBtant from stra-
tum to stratum: the estimate requires, however, a knowledge of the
separate totals XII.
Theorem 6.6 If the sample sizes n" are large in all strata,
<> " N"(N,, - nIl) 2
VCr R.) = .(...;
II nil
[SI/" + R" 2Sz" 2 - 2R"p~I/~%"J (6 .21)

where R" = Y"I X" is the true ratio in stratum h, and PII is defined 8.8
before in each stratum.
130 RATIO ESTIMATES 6.11

Proof: Write
~ 11"
IS" - -X"
x"
Then
(fR• - Y) - E (fRA - YA)
Hence "
V(fR.) - E(fR• - Y)2
- I: E(fRA - Y,,)2 + 2 EI: E(fRA - YA)(fRJ - Yi)
A A j

Since f R" is the ratio estimate made from a simple random sample
within stratum h, we may use formula (6.5) for the approximate vari-
ance of f Rlh i.e.
~ N"(N,, - n,,) 2
V(r RA) -
nil
[S"'" + R"2S",A2 - 2RAP~II~"'Al

The cr088-prociuct terms vanish, because the sampling is independent


in the different strata and, to the order of approximation used in the
variance formula, f ll" is an unbiased estimate of Y". Result (6.21)
follows.
This formula is valid only if the sample in each stratum is large
enough 80 that the approximate variance formula applies to each
stratum. This limitation should be noted in practical applications.
We do not pOll8e88 a trustworthy varianoe formula for f R , when the
n" are small.
Moreover, when the n" are small and L is large, the bias in f R,
may not be negligible relative to its standard error, as the following
crude argument suggests.
In a single stratum we have seen (section 6.3) that
I Bias in fRI> I
f - < cv of f"
v( R") -
If the biM has the same sign in all strata, as may happen, the bias in
f R. will be roughly L times that in f RA. But the standard error of
f R. is only of the order of Vi times that of f RI>. Hence the ratio
I Bias in fR.1
v(fs ,)
might be a.e large M
6.12 THE COMBINED RATIO ESTIMATE 131

For exa.mple, with 50 strata and the cv of 111 about 0.1 in each
r
stratum, the bias in R • might be as large as 0.7 times its standard
error.
r
In the present state of our knowledge, 8 • is to be avoided unless
vL (cv of XII) appears to be less than 0.2. This rule is probably too
conservative, because in practice the bias may be much smaller than
its upper bound, particularly if within each stratum the relation be-
tween 1Ihi and XIIi is approximately a straight line through the origin.

6.12 The combined ratio estimate. An alternative estimate is de-


rived from a single combined ratio (Hansen, Hurwitz, a.nd Gurney,
1946). From the sample data \\'e compute

r" = L: Nhiih: g"


II
= L: NhxlI
h

These a.re the standard estimat.es of the population totals Y and X,


respectively, made from a stratified sample. The combined ratio es-
r
timate, Re (c for combined) is

where Y., '"" r .rlN, i.,= g.clN are the estima.ted population means
from a stratified sample.
r
The estimate Re does not require a knowledge of the Xh, but only
of X.

Theorem 6.6 If the total sample size n is large,

Proof: This follows the same argument as theorem 6.1. In the


present case the key equa.tion (6.4) appears 11.8

(rRe - Y) . = . N (g" - Rx,,) (6.23)

Now consider the variate U/Ii = 11M - RXh i. The right side of equ~
tion (6.23) is NU", where u. , is the weighted mean of the variate U/Ii
in a stratified sample. Further, the population mean n
of Ulli is zero,
since R = f i X.
182 RATIO Eln'IMATm 6.12

Hence we may apply to a. , theorem 5.3 for the variance of the e8ti-
mated mean from a stratified random sample. This gives

N"(N,, - n",) I
n~J-~v~~- E ~
It nit
where

When the quadratic is expanded, result (6.22) is obtained.


From equations (6.21) and (6.22) it is interesting to note that the
approximate variances of r R. and r Rc aB8ume the same general form ,
the only difference being that the population ratios R", in the individual
strata in (6.21) are all replaced by R in (6.22).
Comparistm of eM two e8timate8. We may write

In situations in which the ratio estimate is appropriate, the last


tenn on the right is usually s~ll. (It vanishes if within each stratum
the relation between y", and Xh. is a straight line through the origin.)
Thus, unless R" is constant from stratum to stratum, the use of a
separate ratio estimate in each stratum is likely to be more precise.
This discussion assumes, however, that the sample in each stratum is
large enough 80 that the approximate formula for V(?R.) is valid.
With only a small sample in each stratum, the combined estimate is
to be recommended unless there is good empirical evidence to the
contrary.
For sample estimates of these variances we substitute sample esti-
mates of RIt. and R in the appropriate places. The sample mean squares
8",,2 and 8,.,,2 are substituted for the corresponding variances, and the
sample covariance for the term /),.$",.$"". The sample mean square
and covariance must be calculated separately for each stratum.
&.12 THE COMBINED RATIO ESTIMATE 188
Example. The data come from a census of all fanna in Jefferson
County, Iowa. In this example 'VAt represents acres in com, and :til,
acres in the fa1m . The population is divided into two strata, the .first
stratum containing farms of size up to 160 acres. We assume a sam-
ple of 100 farms. When stratified sampling is used, we shall suppose
that 70 farms are taken from stratum 1 and 30 from stratum 2, this
being roughly the optimum allocation. The data are given in table 6.3.

TABLE 6.3 DATA nOM JE,n:R80N CoUNTY, IOWA

Size
Stra.ta N~ Sp,' Sp. S ..' R,.,
(fa.nn &Cree)

1 ()-HIO 1580 312 494 2055 0.2350


2 Over 160 430 922 858 7357 0.2109

For complete pop. 2010 620 1453 7619 0.2242

Strata y~ X" 1111 Q. - W"'/'L/t V. ' V,,"

1 19.40 82 .56 70 0.008828 193 194


2 61.63 244 .85 30 0.001525 887 907

For complete pop. 26 .30 117 .28 100

The last three quantities, Q", V,,', and V,,", are auxiliary quantities
to be used in the computations, the last two being defined later.
We consider five methods of estimating the population mean com
acres per farm. The fpc will be ignored.
i. Simple ra.ndom sample: mean per farm estimate.
S,l 620
VI - - ""' - "'"' 6.20
n 100
ii. Simple random sample: ratio estimate.

1 2
V2 - -
n
[SII + R2Sz2 - 2RSII"J

= rh [620 + (0.2242)2(7619) - 2(0.2242)(1453)J


.. 3.51
134 RATIO ESTIMATES 6.12

iii. Stratified random sample: mean per farm estimate.

iv. Stratified random sample: ratio estimate lLBing a separate ratio


in each stratum.

v. Stratified random sampling : ratio estimate using a combined


ratio.

The relative precisions of the various methods can be summarized


as follows:
Method of Relative
Sampling method estimation precision
(i) Simple random Mean per farm 100
(ii) Simple random Ratio 177
(iii) Stratifibd random Mean per farm 149
(iv) Stratified random Separate ratio 203
(v) Stratified random Combined ratio 200

The results bring out an interesting point of wide application.


trl.Ltifi ::at,ion by size of farm accompli hes the same general purpose as
a ratio estimate in which the denominator is farm size. Both devices
diminish the efTect of variations in farm size on the sampling error of
the estimated mean COTn acres per farm . For instance, the gain in
precision from a ratio estimate is 77 per cent when simple random
sampling is used, but is only 36 per cent (203 against 149) when strati-
fied sampling is u d.
In the design of samples there may be a choice whether to introduce
orne factor into the stratification, or to utilize it in the method of
e timation. The best decision depends on the circumstances. Rele-
vant points are: (i) some factol'8, e.g. geographical location, are more
easily introduced into the stratification than into the method of esti-
mation; (ii) the issue depend on the nature of the relation between
'Y i and Xi All simple methods of estimation work most effectively
with a linear relation. With a complex or discontinuous relation,
stratification mllY be more effective, since, if there are enough strata,
stratification will eliminate the effects of almost any kind of relation
between 'Yi and Xi .
6.13 OPTIMUM ALLOCATION WITH A RATIO ESTIMATE 135
6.13 Optimum allocation with a ratio estimate. The optimum allo-
cation of the nr. may be different when a ratio estimate is used from
that when a mean per unit is used. Consider first the variate f R • •
From theorem 6.5 its variance is

(6.24)

where dM "" '1/100 - RAXA, is the deviation of 'l/A; from R"x",. By the
methods given in chapter 5 for finding optimum allocation, it follows
that (6.24) is minimized subject to a total cost of the form L CAn",
when

With a mean per unit it will be recalled that for minimum variance
nil is chosen proportional to N,$",,/Vc,..
In the planning of a sample, the allocation with a ratio estimate
may appear a little perplexing, because it seems difficult to speculate
about the likely values of Sd". Two rules are helpful. With a popula-
tion in which the ratio estimate is a best linear unbiased estimate, SdA
will be roughly proportional to ~ (by theorem 6.3). In this case
the nil should be proportional to N" V){,,/v-;;. Sometimes the vari-
ance of d", may be more nearly proportional to ){A2. This leads to
the allocation of nil proportional to N,,){,,/ Vc,., i.e. to the stratum
total of :tAi, divided by the square root of the cost per unit. An exam-
ple of this type is discussed by Hansen, Hurwitz, and Gurney (1946),
for a sample designed to estimate sales of retail stores.
If the estimate f Rc is to be used, the same general argument applies.
E:romple. The different methods of allocation can be compared
from data collected in a complete enumeration of 257 commercial
peach orchards in North Carolina in June 1946 (Finker, 1950). The
purpose of this survey was to determine the moat efficient sampling
procedure for estimating commercial peach production in this area.
Information was obtained on the number oC peach trees and the esti·
mated total peach production in each orchard. The high correlation
RATIO ESTIMATES U3

between theBe two variables suggested the ut!e of a ratio estimate.


One very large orchard WIUI omitted.
For thla illustration, the area is divided geographically into three
strata. The number of peach trees in an orchard is denoted by 2:""
and the estimated production in bushels of peaches by 'Viii. Only the
first ratio estimate f R • (based on a separate ratio in each stratum)
wiI1 be considered, since the principle is the same for both types of
etratified ratio estimate.
Four different methods of allocation will be compared: (i) nIl pro-
portional to Nil , (ii) fl." proportional to NItEIi/h (iii) fl." proportional to
N" VX;;, and (iv) fl." proportional to N"X" ... X". A sample sire of
100 will be considered. The data needed for these comparisons are
summarired in table 6.4.

TABLE 6.4 DATA noll THII NOJlTR CAJIOLnU PIIAOR 8t11lVllI'

Strata S_.2 B.d SwAt Sd S". Xl Yl R" SdJ,t

1 6 , 186 6 ,462 8 ,699 72 .01 93 .27 63 .SO 69 . 48 1.29133 658


2 2 ,367 3,100 4,6" 48 .65 67 .93 31.07 43 .64 1.4Q475 673
8 4 ,877 4 ,817 7 ,311 69 .83 85 .51 56 .97 66 .39 1.16547 2,706

Pop. 8 ,898 4,434 6,409 62 . 43 SO .06 44 . 46 56.47 1 .27063 1,433

Strata N" (i) N~ (ii) vX" N"vX" (iii) N"X. (iv)

] 47 J8 4,384 22 ~ 7.33 344 .5 20 2 ,529 22


2 118 46 8 ,016 40 6 .57 657 .3 39 8 ,666 32
8 91 36 7 ,781 38 7 .56 687 . 1 41 6 , 184 46
"

Pop. 256 100 20,181 100 20 . 46 ] ,688 .9 100 11 ,379 100

The upper part of the table shows the basic data. The method
employed to calculate the four varianoes was first to find the fl." for
eaoh type of aI1ocation. These values are shown in the columns headed
(i)-(iv) in th lower part of the table. Thus, with allocation i, fl." -
nN,./N, 80 that in the first stratum
(100)(47)
fl.) - - 18
256
EXERCISES 187
When the n" have been obtained, the corresponding V(1~.a.) is
found by subetituting in the formula

V(r .) _
s
:E N"(N,, - n,,) SA2

where
" n"
SA 2 _ 8",,2 + R"28,,,,2 - 2R,.sp"

The quantities StI" 2 are the same for all four allocations and are given
on the extreme right of the top half of table 6.4.
The variances and relative precisions are shown in table 6.5.

TABLE 6.6 CoMPAJllIION" or raUB MftBOD8 or ALLOCATION

Vari&n~
Method of
allocation: "" Strata Relative
proportional preci.ei.on
to Total
1 2 3

(i) NA 49 ,824 105 ,833 376 ,215 531 ,872 100


(ii) N,/jp. 35 , 144 131 ,847 343 ,446 510,437 104
(iii) NA~ 41,760 136,964 300 ,312 479,026 HI
(iv) NAX. 35 , 144 181 ,710 240,888 467,742 H6

There is not much to choose among the different allocations, &8


would be expected since the nil do not differ greatly in the four methods.
Method iv, in which allocation is proportional to the total number of
peach trees in the stratum, appears a triBe superior to the others.

6.1 In a field of barley the grain, 1/" and the grain plus straw, %" were
weighed for each of a large number of sampling units located at random over
the field. The total produce (grain plu8 straw) of the whole field wu &iIo
weighed. The following data were obtained :

C"" - 1.13, C•• - 0.78,' Cn - 1.11


Compute the pin in precision obtained by estimating the grain yield of the
field from the ratio of grain to total produce instead of from the mean yield
of grain per unit.
138 RATIO ESTIMATES 6.14

It requires 20 min to cut, thresh, and weigh the grain on each unit, 2 min
to weigh the straw on each unit, and 2 hr to collect and weigh the total produce
or the field. How many units must be taken per field in order that the ratio
estimate be more economical than the mean per unit?
6.2 For the data in table 6.1, f 8 - 28,367 and

eft - 0.0142068, C,I - 0.0146541, CII - 0.0156830


Compute the 95 per cent quadratic confidence limits for Y and compare them
with the limits round by the normal approximation.
6.3 For the sample of 30 households in table 6.2, the following data refer
to visits to the dentist in the last year :

DIID tist IMleD Dentist seeD


No. oC No. oC
pel'llOntI Yes No pel'8ODS Yes No
5 1 . 5

6 0 . 0
8
8
I
2
6
2
1
""
3
1
1
a
2
2 0 2 a 0 3
a 0 3 4 I a
3 I 2 3 0 3
3 2 3 1 2
1 0 1
"" 0 ..32 1
2
. 0
0
2
3 1
2
7
0
2
2
5
3
.. 1
1
"3
2

.. 3 2 0
..2
1
3 0 3 0
"
Estimate the variance of the prop&rtion or persons who saw a dentist, and
compare this with the binomial estimate or the variance.
6.4 The rollowing data are Cor a small artificial population with N - 8
and two strata of equal size :

Stratum 1 Stratum 2
%JI 1111 %1; II1i
2 0 10 7
5 a 18 15
9 7 21 10
15 10 25 16

For a stratified random 8&IIlple with fl. - n, - 2, compare the variances of


fA. and f R., by working out the results ror all possible samples. To what
elCtent is the difference in variances due to bi&8es in the estimates?
6.16 REFERENCES 139
6.16 References.
Ca.uBa, H . (1946). MatlterMtU:ol mdIto<U of .tati4t1a. Princeton University
PreIJII, p. 366.
DAVID, F. N., and Nl!lnlAN, J. (1938). Extension of the Markoff theorem of leut
eqU&re8. SI4l. Ru. Mem., II, lOS.
F'ntLL&A, E. C. (1932). The distribution of the index in a normal bivariate popula.-
tion. Biomdrilr4, lU, 428-440.
FtNItNEB, A. L . (1950) . Methode of sampling for eetimating commercial peach
production in North Carolina. North Carolina Agr. Exp. SI4. Tech . Bull. 9l.
HANSICN, M . H ., and HURWITZ, W. N. (1942) . Relative efficienciee of varioUl!
sampling units in population inquiries. JlYUT . Amer. SI4l. A3Ioc., S1, 89-94.
HAN8EN, M . H., HURWITZ, W. N., and GURNEY, M . (1946). Problems and methode
of a sample survey of businllllll. Jour . Amer. SI4l. A.aoc., ·U, 173-189.
JE88I!lN, R. J ., et al. (1947) . On a population sample for Greece. JlYUT. Amer. SI4l.
A,IOC., U, 357-384.
MADOW, W. G. (1948). On the limiting distributions of eetimates based on samplee
from finite univ6raee. A.m. Math. SI4l., 18,53&-545.
PAULSON, E. (1942). A note on the estimation of 80me mean valuee for a bivariate
distribution. Ann. Math. SI4l., lS, 440-444.
CHAPTER 7

REGRESSION ESTIMATES

7.1 The linear regreuion estimate. Like the ratio estimate, the linear
regression estimate is designed to inorea.ee precision by the use of an
auxiliazy variate Xt which is correlated with 1/.. When the relation
x,
between 1/, and is examined, it may be found that, although the re-
lation is approximately linear, the line does not go through the ori-
gin. This suggests an estimate based on the linear regreeaion of 11,
on :ti rather than on the ratio of the two variables.
We suppoee that 1/, and x, are each obtained for every unit in the
sample, and that the population mean X of the x, is known. The
x,
sample regression of 'V. on is computed. For the preeent. we 88I!WIle
that the least squares regression coefficient b is used, where
.
1:_____
.-_l
b __
('Vi - fi)(X. - :e)
_

The linear regression estimate of V, the population mean of the 'Vt,

'0,. - ~ + b(X - f) (7.1)


where the subecript lr denotes linear regruMDn. The rationale of this
estimate is that, if f is below average, we should expect 9 also to be
below average by an amount b(X - f), because of the regreaaion of
1/, on .:t,. For an estimate of the population total Y, we take f" -
Ng" .
Wateon (1937) used a regression of leaf area on leaf weight to esti-
mate the average area of the leaves on a plant. The procedure was
to weigh all the leaves on the plant. For a small sample of leaves, tho
area and the weight of each leaf were determined. The sample mean
leaf area was then adjusted by means of the regression on leaf weight.
The point of the application is, of course, that the weight of a leaf
can be found quickly, but determination of ita area is more time-
consuming.
140
7.2 LARO&8AMPLE THEORY 1(1

This example illustrates a general situation in which regression esti-


mates are potentially helpful. Suppose that we can make a rapid
estimate Xi of some characteristic for every unit, and can also, by some
more costly method, determine the correct value 1/. of the character--
istic for a simple random sample of the units. For instance, a rat
expert might make a quick eye estimate of the number of rats in each
block in a. city area, and then determine, by trapping, the actual num-
ber of rats in each of a simple random sample of the blocks. In another
application described by Yates (1949), an eye estimate of the volume
of timber was made on each of a number of ~acre plots, and the
actual timber volume was measured for a sample of the plots. The
regression estimate
9 + b(X - of)
adjusts the sample mean of the actual measurements by the regres-
sion of the actual measurements on the rapid estimates. The rapid
estimates need not be free from bias. If X, - 1/, ... D, so that the
rapid estimate is perfect eXC<lpt for a constant bias D, it may be veri-
fied that b - 1 and the regression estimate becomes

9 + (X - of) ... X + lY - f)
- (Pop. mean of rapid estimate) + (Adjustment for bias)
Our knowledge of the properties of the regression estimate is of
about the same scope as our knowledge for the ratio estimate. The
regression estimate is consistent, although this is in the trivial Benae
that, when the sample comprises the whole population, f ... X, and
the regression estimate makes no adjustment. As will be shown, the
estimate is in general biased, but the ratio of the bias to the standard
error becomes small when the sample is large. We po88es8 a large-
sample formula for the variance of the estimate, but more information
is needed about the distribution of the estimate in small samples and
about the value of n required for the practical use of large-sample
results.

7.2 Large-umple theory. The theory of linear regression plays a


prominent part in moet courses in elementary statistics. The 8tb.nd-
ard results of this theory are not entirely suitable for sample surveys,
because they require the restrictive assumptions that the populatiOn
regreasion of 1/. on Xi is linear, and that the residual variance of 1/(
about the regression line is constant. If these two assumptions are
violently wrong, a linear regression estimate will probably not be
REGRESSION ESTIMATES 7.2
preci8e, and an estimate based on a curvilinear regression or a weighted
linear regression is preferable. There are situations, however, in
which we doubt whether the gain in precision from these more elabo-
rate methods would be worth the labor, and there are others in which,
although we have reason to believe that the regression is linear, we
do not have good evidence that it actually is.
Consequently we first present a theory which does not assume that
the regression is linear in the population, and which gives results that
hold only in large samples. This theory is analogous to the larg&-
sample theory for the ratio estimate.
The finite population linear regression coefficient, B, is defined by
the relation
N
L: (y; - y) (x; - X)
B _ _i __ l~N~ __________
(7.2)
L: (Xi - X)2
i-I

The residual variate, e;, is defined by the relation

y; - Y + B(x; - X) + e; (7.3)

Adding (7.3) over all units in the population, we find

(7.4)

Note that no linear relation between Yi and x; is assumed. The


population consists of a set of N .pairs of values (y;, x;), and the appar-
ent linear regression in equation (7 .3) has been constructed by our
definitions of B and the e;.

Theorem 7.1 For a simple random sample, with n large enough so


that sampling errors in the sample regression coefficient b can be ig-
nored,
(N - n) S,2
V(filr) = - (7.5)
N n
where
7.2 LARGE-SAMPLE THEORY 143

Proof: By its definition (7.1)


iilr :: ii + b(l{ - i) (7.1)
But from (7.3), averaged over the units in the sample,
ii = Y + B (i - X) + e
Substitute into (7.1). This gives
fil, = Y + (b - B )(X - i) +e (7.6)
Hence, if the sampling error (b - B) can be ignored,
iilr - Y = e
Thus

But, since E (e) is zero by equation (7.4), E(e 2 ) is the variance of


the mean of the quantities ei in 8. simple random sample. Hence, by
theorem 2.3,
(N - n) S.2
V(iil,) = -
N n
Corollary . If the correlation coeffi cient p between Yi and Xi in the
finite population is defined by the relation

L: (Xi - X)2
(7.7)
L: (Yi - y)2

where the sums extend over all units in the population, then

(7.8)

Proof: From equation (7.3), summing over the poplIlation,


1: ei2 .. 1: {(Yi Y ) - B (Xi - X ) 12
"" L: (Yi - y )2 - 2B L: (Yi - Y)(Xi - X) + B2 1: (Xi - X)2
- 1: (y, - y )2 - B2 1: (x, - X)2
by the definition of B, equation (7.2);
.. 1: (y, - y)2(1 _ p2)

from the definition of p, equation (7.7).


Hence
REGRESSION ESTIMATm
Theorem 7.1 For a simple random sample, with n large enough so
that sampling errors in the sample regreseion coefficient are negligible,
an unbiased estimate of V(yz..) from the sample is

/v(iil') - (N - n)
Nn(n - 1)
±
i_I
{(fli - fi) _ b(X; - f»)2 (7.9)

Proof: From theorem 2.4, an unbiased estimate of S. 2 from the


sample is
1 "
8.
2
.. - - -
(n - 1)
L (e; -
;_1
t?
Now, from equation (7.3),
e, - e ... fI, - Y- B(x; - .f)
.. I(y; - ii) - b(x, - £)1 + (b - B)(x, - f)
If eampling errors in b are negligible, the last term vanishes, and

L" (e, - e)2 = L" {(y, - y) - b(:r, - z)1 2

Hence, for II(YI,) defined as in equation (7.9),


(N - n)
EII(;}I,) ,.
Nn
s.2 = _
V(YI,)
by equation (7.5) .
Theorems 7.1 and 7.2 do not specify how the sample regression co-
efficient b is to be computed. If b is the least squares regression co-
efficient, the sum of squares in I1(YI,) is most quickly calculated in the
form

If b is not the least squares regression coefficient, the preceding for-


mula does not hold. The sum of squares can be computed as
L~-~-~L~-0 ~-~+~ L~-~
7.3 Elementary theory. The preceding theory leaves unanswered the
question: How large must the sample be? A complete answer, valid
for any population, is not yet known, but some information is ob-
tained by examining the results of standard regression theory. In
this the population is assumed infinite, and the relation between fI,
and:t, is
1M = Y + B(x; - X) + etj (7.3')
ELEMENTARY THEORY

Formally, this is the same as equation (7.3) , except that an extra sub-
script j has been added as a reminder that in standard regression theory
there is a whole frequency distribution or array of values of YIi and
eli for each value of Xi . The theory assumes simple random sampling,
and further that, in every array in which XI is fixed,
E(ei;) = 0: E(ei/) "" 8,2 = Constant
From this model, by the same analysis 11.8 in theorem 7.1, we obtain
iii, - Y = l + (b - B)(X - f) (7.10)
Now, if b is the least squares regression coefficient,
b= L (Yi - fi)(Xi - x)
L (Xi - £)2
where the extra subscript j has been dropped. SUbstitution for YI
and y from equation (7.3') gives
L ei(X; - f)
b= B +L (X; _ £)2 (7.11)
Hence
b _ B = L ei(xI - f)
(7.12)
L (XI - i?
In repeated samples in which the Xi remain fixed from sample to sa.m-
pIe, it is easy to verify that the covariance of land (b - B ) is zero,
and that
S2
V(b) = E(b - B)2 =
L (x; • - i)2
Hence, from (7.10),
So"
V(til,) = E(fil, - y)2 = - + (X - i)2V(b)
n
_ 8 2 {: + (x - X)2 }
(7.13)
, n L (Xi - f)2

This is a standard result in elementary regression theory. It is exact


for any size of sample, subject to the assumptions stated previously.·
Under these assumptions, YI, is an unbiased estimate of Y. This
can be shown by considering repeated samples in which the remain x.
• Since the variance of ill. applies to repeated eamples in which the values 0(
the Z'8 remain fixed, this reeult is another instance of the \lie of conditWnol d.i6-
tributions, which help! to simplify the mathematical analylli.ll.
146 REGRESSION ESTIMATES
fixed from sample to sample. Since, by hypothesis, E(ejj) -= 0 in such
samples, it follows that E(l) ... 0 and from equation (7.12) that
E(b - B) - O. Referring now to equation (7.10), we see that iilr is
unbiased in repeated samples in which the Xi remain fixed, and hence
is unbiased in repeated simple random sampling.
These results lead to an alternative form of theorem 7.1.
Theorem 7.la Under the assumptions stated at the beginning of
this section, iilr is an unbiased estimate of Y, with variance as given
I.,yequation (7.13).
The term involving the Xi in equation (7.13) is the contribution
from the sampling error of b. The aver~ge value of this term in sim-
ple random samples of size n depends on the shape of the frequency
distribution of the Xi. If this distribution is normal, it may be shown
that
(i - X)2 ] 1
E [ L (Xi - x? = n(n - 3)
When the Xi are not normally distributed, the average of the term in
the Xi may be expanded in a series of inverse powers of n by Fisher's
method of cumulants (1928). The leading term in the series is found
by replacing L 2
(Xi - X)2 by (n - J )SE as an approximation. This
gives
(x - X)2 ] S/ J
E [ = - =
(n - I)S/ n(n - I)S.,2 - n(n - 1) n2
to this order of approximation .
Hence, to terms of order l / n2,
EI V(tilr)_I = S,2
n
(1 + ~)n (7.14)

This result indicates that if n exceeds 50, the contribution of sampling


errors in the least squares b is negligible.
This result is subject to the assumption that the population regres-
sion is linear. When this regression is non-linear, the contribution
of the sampling error of b to V(Ylr) can be expanded in a series of in-
verse powers of 11. The leading term is of order l /n 2 , as in equation
(7.14), but the numerator is a function of certain moments of the
joint distribution of ei and Xi (Cochran, 1942).
To complete the elementary theory, another standard result which
is valid for any size of sample is that
1 ..
8~ . .,2
n-
2)
"" - ( L
i- I
I(Yi - ti) - b(Xi - f)1 2 (7.15)
BlAB OF THE REGRESSION ESTIMATE 1(7

is an unbiased estimate of 8,2. This differs from the large-sample re-


BUlt given in theorem 7.2 only in that the divisor is (n - 2) here, aa
against (n - 1) in theorem 7.2; and that no fpc is included.

7.4 Bias of the regression estimate. If the relation between Yi and


Xi is non-linear, 'iiI. is subject to a bias of order l i n. We shall resum
the finite population model of section 7.2. From equation (7.6), the
error of estimate is
til. - Y = e + (b - B)(X - f)
If b is the least squares estimate of B, then by equation (7 .12)
L: ei(xi - i)
b_ B =
L: (Xi - i)2
Hence
(x - x) L: ei(Xi - x)
iii. - Y = e+ L: (X,, - f) 2
In repeated simple random samples E (e) "" 0, since the population
mean of the ei is zero. The average value of the second term on the
right can be expanded in a series of inverse powers of n. The leading
term is obtained by the following non-rigorous a.rgument.
We may replace the denominator, L: (Xi - X)2, by the approxima-
tion (n - 1)8.. 2 • The numerator may be written
- (x - X) {L: ei(xi - X) - L: ei(x - X) I (7.16)
Let Ui be the variate e,{xi - X). Then
N N
L: Ui "" L: ei(xi - Xl
N
... L: I(y; - Y) - B(x; - X)I(x; - X) .. 0
i-I

by the definition of B for the finite population. Hence the populatiop


mean U ... O. Consequently the average value of the first term iD
(7.16) may be written
-n(N - n) N
-nE(f - X)(u - U) ...
Nn(N - 1)
L: (x, -
i_I
X)(u; - U)
by theorem 2.3 (p. 17) j
(N - n) N 2
- - L: II;(X; - X)
N(N - 1) ;-J
1408 REGRESSION ESTIMATES

The average of the second term in (7.16) turns out to be of order lin
and will not be considered.
Hence, dividing by (n - I)S.,2, the leading term in the bias of g~
is, to terms of order lin,
- (N - n) {:E e,(x, - X)2} (7.17)
(" - I)NS.? (N - 1)
The expression inside the brackets is the population covariance bo-
tween e, and (x, - X)2; it represents a contribution from the ~
ratic regression of y, on x" and vanishes if the relation between y,
and X i is linear. Since the bias of fjlr is of order lin, while its standard
error is of order l l yn, the bias becomes negligible in large samples.

7.6 Comparison with the ratio estimate and the mean per unit For
these compa.risons the sample size n must be sufficiently large so that
the approximate formulas for the variances of the ratio and regres-
sion estimates are valid. The three comparable variances for the es-
timated population mean Yare as follows :
,A.Y (N - n)
"'"J (1)lr) - S1/2 (1 - p2) (Regression)
I Nn

(Ratio)

(N - n)
V(D) - SI/2 (Mean per unit)
Nn
It is apparent that the variance of the regression estimate is smaller
than that of the mean per unit unless p = 0, in which case the two
variances are equal.
The variance of the regression estimate is less than that of the ratio
estimate if
_p2S,,' < 1(, 2S,.2 - 2RpSwS,. (7.18)
This is equivalent to the inequality
(pSr - RS.. )2 >0
Therefore the regression estimate is more precise than the ratio esti~
mate unless
RS,. Coefficient of variation of x,
p - - - (7.19)
Sr Coefficient of variation of Jl4
since R - YIX.
7.6 COMPARISON WITH RATIO ESTIMATE If9

Equation (7.19) holda whenever the relation between 1/i and Xi is a


straight line through the origin, and in this event the regression and
ratio estimates are equally prec.ise. It is interesting that the ~
sion estimate is a.s precise a.s the ratio estimate, even when the latter
is a best unbiased estimate.
Actually, the ratio estimate is a particular case of the linear regres-
sion estimate. If we take the regression coefficient b" - g/f, a value
that might be considered appropriate if the line wa.s thought to pa.ss
through the origin, we have

g
... - X - 11B
f

The regression estimate is more laborious to compute than the ratio


estimate, principally owing to the labor of computing b. With a large
sample, an inefficient estimate b' can be used if this produces a sav-
ing in time. In section 7.3 it was pointed out that the contribution to
V(YI,) from the sampling error of the least squares b amounts to about
t / nth of the principal component of the variance. Consequently, an
estimate b' which effectively uses half the data, i.e. is of 50 per cent
efficiency, increases V(Yz,) from

S•.2(1 - p2) (1 + D
to

If n is la.rge this increase is trivial. Thus we may estim&te B and


S".Z2 from a Bubsample of the data. If there is good evidence that the
true regression is straight, the subsample may consist of, say, every
fifth or every ninth unit in the sample. If there is doubt whether the
regression is straight, the suhsa.mple should be a random one, or es-
sentially equivalent to this.
Sometimes it may be possible to guess a value of b from previous
experience. For any constant value of b, say b·, which does not de-
pend on the results of the sample, ti" is an unbiased estimate, since in
repeated simple random samples

EfJ" - Y + b·E(X - f) - Y
160 REGRESSION ESTIMATES

E:eampk. The precision of the regression, ratio, and mean per unit
estimates from a simple random sample can be compared using data
collected in the complete enumeration of peach orchards described on
p. 135. In this eXllJIlple, 1/; is the estimated peach production in an
orchard and Xi the number of peach trees in the orchard. We will
compare the estimates of the total production of the 256 orchards,
as made from a sample of 100 orchards. It is doubtful whether the
sample is large enough to make the variance formulas fully valid,
8ince the cv's of y and f are both somewhat higher than 10 per cent,
but the example will serve to illustrate the computations. The basic
data are as follows :
S,/ "" 6409: S~Z'" 4434: Sz2 =- 3898
R - 1.270: p - 0.887 : n - 100: N = 256
~ N(N - n) 2 2
V(rl,)- S,,(I-p)
n

_ (25~~56) (6409)(1 _ 0.787) ... 545,000

~ N(N - n) 2
V(rR) -
n
(S" + R2Sz2 - 2RS~,,)

(256)(156)
- 100 [6409 + (1.613)(3898) - 2(1.270)(4434)J

- 573,000
N(N - n)
V(f) - S,,2 - 2,559,000
n
There is little to choose between the regression and the ratio esti-
mates, as might be expected from the nature of the variables. Both
techniques are greatly superior to the mean per unit.

7.6 Thnecression estimate in .tratified aamplinc. As with the ratio


estimate, there are two types of regresaion estimate that can be made
in stratified random sampling. For the first estimate, g'rl (. for
separate), we compute a separate regression coefficient in each stra.-
tum. This estimate is appropriate to a mathematical model of the
fonn
(7.3»
7.6 THE REGRESSION ESTIMATE IN STRATIFIED SAMPLING 1&1
where as usual h denotes the stratum and i the observation within
the stratum, and where we believe that B" varies from one stratum to
another.
We first compute the regression estimate of each stratum mean, i.e.

th." - ii" + b"(X,, - ill)

where b" is the least squares estimate of B". Then


L
L: N,J)I."
fh .. -
"-I N (7.21)

There are two types of approach to the sampling theory for ill."
cl)rresponding to the two approaches made with simple random sam-
pling. On the one hand we may assume that the population size in
each stratum is infinite and that the regression really is linear within
each stratum, 80 that the results of standard least squares theory may
be applied. These assumptions are not too unrealistic for some ap-
plications (e.g. in agricultural sampling). On the other hand there is
the largtHl&IIlple theory (as in section 7.2) which does not assume an
infinite population or a linear regression. Since both reeultB may be
uaeful on occasion, two versions of the theorem for V(il'ra) will be
given. The elementary theory will be presented first.

Theorem r.Sa Suppose that each stratum may be regarded as in-


finite and that
Y"ii - Y" + BII(x", - XII) + eAi,
where for any fixed XM

E(e"'i) - 0: E(e",/) .. 8.,,2


Then, with stratified random sampling, ill .. is an unbiased estimate of
Y, with variance

V(D, .. ) -
L
A~
(NA)2
N 8",,2(1 - p/')
{nA1 + :E(ill(:tA'- _XJa)'~f,,)2 } (7.22)

Proof: Applying theorem 7.1a to stratum h, we deduce that il,." is


an unbiased estimate of YA, with variance

2 2 {I + '" ( (iA - XA)2 }


V(D'r") co 8,,, (1 - p,,) -
nA L.. x,... - .....")2
162 .REG~ION ESTIMATES 7.6

Since L N"
fhrl - E -N £1""
A_I

it follows that g, .. is an unbiased estimate of V and that ita variance


is as given in (7.22) .
ThRmem 7.S If the sample is large in every stratum,
L N,,(N A - nA)
V(g, .. ) - E
A_I
JV2
nA
8 11,,'(1 - p,,')

Proof: In this version we do not aeeume the existence of a linear


regreaaion. As in section 7.2, we define
N.
E (11M - Y")(x,,, - X,,)

Similarly the residual variate, eAi, is defined by the equation


tI'" - VII. + BA(XAi - Xh) + eM
The results of section 7.2 may now be applied to g"". The bias in
il',A is of order l/n", a.nd its varia.nce is, approxima.tely,
(NA - n,,) 2
V(g',A) . - . N 8 ,1A'(1 - PA )
An"
Consequently, by the definition of Y'r., its bias is at most of order
l/nA', where n,,' is the smallest of the nA. Since the sampling in differ--
ent strata is independent,

V(g".) - EL (NA)lI L N"(N,, -


-N V(g',A) . -. E JV2
n,,)
8",'(1 - p,,')
"-1 "_1 n"
Coroll4'71. If the samples are large in every stratum, an estimate
of V(g'ra) which is practically unbiased is
L NA(NA - n,,) I
"(g,,.) - E JV2 ' .... (7.23)
"-1 n"

This result also follows from the argument in aection 7.2.


7.7 THE COMBINED REGRESSION ESTIMATE
The estimate g,,. suffers from the same difficulty as the correspond-
ing ratio estimate, in that the bias may have the same sign in every
stratum. If the strata are numerous, the ratio of the bias of ih,. to
its standard error may become appreciable. Since, as shown in sec-
tion 7.4, the leading term in the bias comes from the quadratio regres-
sion of 1/,,_ on Ziti, this danger is most acute when the relation between
the variates approximates the quadratio rather than the linear type.

7.7 The combined regression estimate. The second estimate, 9"0


(c for combined), is appropriate when BII is presumed to be the same
in all strata. The model then becomeEI
1/11_ - Y + B(xloi - X) + eM
To compute giro, we first find
EN"YII
D., - N
:
These are the usual estimates appropriate to stratified sampling. Then
il"o ... fl., + b(X - f.,) (7.24)
For b it is often satisfactory to take the customary combined estimate
L "A

E .E (y", - D")(x,,, - fll)


b_ "_1 i_I
(7.25)
L n"
.E .E (X", - f,,)2
"_1 i-I

This is not in general the most precise estimate of B. The variance of


b", the estimate in stratum h, is
S 2
V(bll) ... w.P
.E
, (X", - f,,)2

where Sw.d? - S,,1I 2(1 - p,,2). The most precise estimate of B is,
theoretically, obtained by weigh;,,:; each b" inversely &8 its variance.
This will be found to give

bop'
.E .E g"(YM - 9")(z,,, - f,,)
- ~ ~ ,
LJ LJ g,,(z,", - flo)
where gil - I/S'/I.",,2. This estimate reduces to (7.25) only if the ~
sidual variances are the same in all strata. In practice, bop, cannot be
1M REGRESSION ESTIMATES 7.7

used, because we have to insert sample ~:.mates of S" .sh'l, \vith a re-
sulta.nt 1088 of precision from errors in these estimates. These errors
are Btnall when the samples within strata are large, but in that event
the sampling error of b makes only a negligible contribution to V(Yz •• ).
Consequently any improvement on the customary combined estimate
of B will probably be small unless the total sample size is, say, less
than 50, and there are large differences between the residual variances
in different strata.
In presenting the elementary form of the result for V(YI,,), we shall
suppose that the tmmpJe regression coefficient, b' say, is some weighted
mean of the bA, where the weights depend only on the XM. Such a
function includes, 8P particular cases, both the customary combined
band b.pt. and enables V(1/lr.) to be stated slightly more generally.
Theorem 7.4a Suppose that each stratum may be regarded as in-
finite and that
'VAij - YA + B(xAj - X,,) + eAij (7.26)
where, for any fixed :CA"

E(eA,j) - 0: E(eM/) - S'A~


Then, with stratified random sampling, the estimate
g,.o - f)., + b'(X - ~.t)
is an unbiased estimate of Y with variance
V(O, •• ) - i: (NA)' ~SIIA2(1
A_I N nA
-
2
PA ) + (~.' - X)'V(b') (7.27)

Proof: From (7.26) it follows that


f)., - Y + B(~., - X) + l.,
Hence the error of estimate
tiz•• - Y - till + b'(X - f.,) - Y
- ill + (b' - B)(X - .f.,)
Under the conditions stated, it follows in the usual way that b' is an
unbiased estimate of B, so that 1//•• is unbiased. Further, the covari-
ance of i.,
and b' may be shown to be zero. Hence
V(gz •• ) - V(i.,) + (f., - X)'V(b')

-
~ ('-NA)2 -1 S"A"(1
LJ - PA ) + (f" - X)2 V(b) I

A_ I N n,.
7.7 THE COMBINED REGRESSION ESTIMATE 156
Coroll4ry 1 If
L: L: (YM - fl")(x,,. - flo)
b' _ b "" -"--:'~=------=---

.
L: L: (XA, - X,,)2
"

(7.28)

To prove this, we have

Also

,
The Te!!ult follows by applying the usual formula for the varianoe of a
linear function.
CoroUary B There are various particular forms of this result, ac-
cording to the type of allocation adopted. For instance, if 8",,2(1 _ P"2)
is constant in a.ll strata. and proportional allocation is used, formula
(7.28) become!!
2 2 {1
V(fl'r.) - 81/" (1 - PA) - + ~ ~ (
(:f,t - X)2
)2
} (7.29)
n L. L. x'" - :flo
" i
With simple random sampling, the contribution of the sampling
error of b to V(Ylr) was found to be approximately lin time!! the total
variance. Unfortunately this result does not always hold for V(f}lr.) .
For equation (7.29) the result is valid, but in the more general exprea-
sion (7.28) it sometime!! happens that the major contribution to the
variance come!! from a single stratum, say stratum h. An examination
of formula (7.28), which will not be presented, shows that in this situ-
ation the contribution from V(b) may be as la.rge f.o8 lin" times the
total variance. In samples of moderate size it is therefore advisable
to check that the contribution of V(b) is negligible before discarding it.
The more general theory for g'ro, in which the assumption of a linear
regression is relaxed, becomes quite complicated. We shall carry it.
168 REGRESSION ESTIMATES 7.7
only far enough to exhibit, in a general way, what happens. The
within-etratum regression coefficients are defined as follows:

f (l/1li - X,,)
B
~
- ,_I
" ---------
L
r,,)(X/ai -

(XM - X,,)2

The residual variates eM are defined by the equations


'VA; ... r A + BII(x", - X".) + e",
Hence, M deductions from this equa.tion,

y" - rIo + B"(x,, - X,,) + l"


g" .. r + L W"B"(f,, - X,,) + l" (7.30)
where W" - N,,/N. Now
iilr• ... ti,1 + b'(X - f el )

Substitute for y" from equation (7.30). Hence the error of estimate is

tilro - r-L W"BIl(f" - XII) + b'(X - f,,) + l.,


At this point it is convenient to introduce the symbol B' - E(b').
The previous expression may be written as

91ro - r-L W"B,,(fll - XI<) - B'(f,1 - X)


- (b' - B')(f" :.. X) + l"
-~ +L~~-~~-~-W-~~-~
since L W"f" - f", a.nd L W"x" - X.
This analysis divides the error of estimate into three components.
The first is the familiar term l". The second arises from a.ny varia-
tion, from one stratum to a.nother, in the true within-etratum regres-
sion coefficients BII. If the BII are all equal to B, this term vanishes
provided that b' is an unbi88ed estimate of this B. Since E(fll) - XII,
this term does not introduce any bias into tilro, but it does contribute
to the variance of Ylr.·
The third tenn represents the contribution of the sampling error of
b'. As in simple random sampling, the mean value of this term is not
sero unless the regreeaion is actually linear: the leading term in the
7.9 COMPARISON OF THE ESTIMATES 167
bi&8 comes from the quadratic regression of y", on X4;. If the varia-
bility is approximately the same in &II strata and proportion&l sam-
pling is used, the bie.s is of order lin, but it may be larger if the con-
tribution from one stratum is dominating. The same remarks apply
to the contribution from V(b') .
If the bi&8 and variability arising from the term in b' can be ignored,
the leading part of V(y".) is
NII(NII - nil)
V(il,,,) .. L J{l [8.,,,2(1 - PII') + 8z112(B" - B')')
II nil

2
With the combined estimate, 8., .,,11 mlly be taken e.s

The divisor (nil - 1) is suggested, instead of (nil - 2) , becauee a com-


mon b he.s been employed in &ll strata. [As an "intuitive" approxima-
tion, the divisor (nil - 1 - i) might seem better.) To avoid com-
2
puting the deviations, the numerator of 8., .0:11 may be calculated e.s

L (11M - fill? - 2b L (11M - fill)(XM - fll) + b2 L (XM - :fll)'


; ; i

It is advisable to inaert the individual v&lues of 8p ll' in their re-


spective strata, rather than attempt any pooling, unless there is good
evidence that 8., ....11' does not vary from stratum to stratum.

7.9 Comparison of the two types of regression estimate. Hard and


fast rules cannot be give!} to decide whether the eeparate or the com-
bined estimate is better in any specific situation: some exercise of
judgment is required in making a choice. The defects of the separate
estimate are that it is more liable to bie.s when samples are sm&ll
within the individu&l strata, and that its variance he.s e. larger contri-
1Ii8 REGRESSION ESTIMATES 7.9

bution from sampling errors in the regression coefficients. The defect


of the combined estimate is that its variance is inflated if the popula-
tion regression coefficients differ from stratum to stratum.
If we are confident that the regressions are linear and if B" appears
to be the same in all stra.ta, so fa.r as can be judged, the combined es-
timate is to be preferred. If the customary combined regression b has
been used for B, the aample estimate of V(Ylrc) is obtained by substi-
tuting the quantities 8~ .zA2 into formula (7.28).
If the regressions appear linear (so that the danger of bias seems
small) but B" seems to vary from stratum to stratum, the separate
estimate is adviaable. A aample estimate of its variance is obtained
by substituting the values ~ .zA2 into formula (7.22).
If there is some curvilinearity in the regressions when a linear re-
gression estimate is used, the combined estimate is probably aafer
unless the aamples are large in all strata.

7.10 Exercises.
7.1 A popula.tion contains 6 units, with the following values of III and xc
Unit
1 2 3 4 6 6
2 4 I) 8 10 12
o 3 4 I) 6

By working out all possible cases, compare the precisions of the ratio and
linear regression estimates for simple random ss.mples of size 2. Compute
the contributions of the bias to the variances.
7.2 From the ss.mple data in table 6.1 (p. 113) compute the regression
estimate of the 1930 total number of inhabitants in the 196 large cities. Find
the standard error of this estimate, and compare its precision with that of
the ratio estimate.
7.3 In the previous exercise, find the estimated total number of inhabi-
tants, and its standard error, if b is arbitrarily taken as 1.
7.4 By working out all poSBible cases, compare the precisions of the sep-
arate and combined regression estimates of the total Y of the following popu-
lation, when simple random samples of size 2 are drawn from each stratum :

Stratum 1 Stratum 2
x,~ ~Ii %t. 1/1.
4 0 I) 7
6 3 6 12
7 I) 8 13

Uae the ordinary least squares estimates of the B's, formula (7.25) for b•.
7.11 REFERENCES 15G
7.11 References.
CocIl1tAN, W. O. (1942). Se.mpling theory when the 8aIIlpiing unite IU1I of unequal
!lisee. JOIM. Amer. SII:U. Auoc., 37, 199-212.
FIsHEll, R. A. (1928) . Moments ILnd product moments of IIl\mpling distrihution ..
Proc. London Math. Soc., 2, 30, 1119-238.
WATSON, D. J . (1937). The estimation of lelLf 1LI'6ILII. Jour . Agr. Sci., 11,474.
YATES, F . (1949) . Sampling -uuxu for c:m8IUU and ftU'IIeVI . Charles Griffin and
Co., London.
CHAPTER 8

SYSTEMATIC SAMPLING

8.1 Description. This method of sampling is at first sight quite dif-


ferent from simple random sampling. Suppose that the N units in
the population are numbered from 1 to N in some order. To select
a sample of n units, we take a unit at random from the first k units,
and every kth subsequent unit. For instance, if k is 15 and if the first
unit drawn is number 13, the subsequent units are numbers 28, 43,
58, and so on. The selection of the first unit detennines the whole
sample. This type of sample will be called an every kth. systematic
sample.
The apparent advantages of this method over simple random sam-
pling are as follows:
i. It is easier to draw a sample and often easier to execute without
mistakes. This is of particular advantage when the drawing is done
in the field. Even when drawing is done in an office there may be a
substantial saving in time. For instance, if the units are describelI
on cards which are all of the same size and lie in a file drawer, a card
oan be drawn out every inch along the file as measured by a ruler.
This operation is speedy, whereas simple random sampling would be
slow. Of course, this method departs slightly from the strict "every
kth" rule. -
ii. Intuitively, systematic sampling seems likely to be more pre-
cise than simple random sampling. In effect, it stratifies the popula-
tion into n strata, which consist of the first k units, the second k units,
and 80 on. We might therefore expect the systematic sample to be
about as precise as the corresponding stratified random sample with
one unit per stratum. The difference is that with the systematic sam-
ple the units all occur at the same relative position in the stratum,
whereas with the stratified random sample the position in the stratum
is detennined separately by randomization within each stratum (see
figure 8.1). The systematic sample is spread more evenly over the
population, and this fact has sometimes made systematic sampling
considerably more precise than stratified random sampling.
One variant of the systematio sample is to choose each unit at or
near the center of the stratum i that is, instead of starting the sequence
160
AN ALTERNATIVE VIEW 161

by a random number chosen between 1 and k, we take the starting


number at! (k +
1)/2 if k is odd, and either k/2 or (k +
2)/2 if k is
even. This procedure carries the idea of systematio sampling to itA!
logical conclusion. If 1/, can be considered a continuous funotion of a
continuous variable i, there are grounds for expecting that this oen-
trally located sample will be more preciae than a randomly loce.ted
one. Little investigation of the efficacy of centrally located samples

II - systematlc sample _ • stratified random sample

L"_...j_".__L.,, • I )(~.~___l
i ~ ~ U ~ ~
Unit number

F'IGUBll 8.1 Systematio and stratified random II&IIlpling.

has been made for the types of population usually encountered in


sample surveys, and attention will be confined to re.ndomly located
samples.
Since N is not in general an integral multiple of k, different syste-
matic samples from the same finite population may vary by one unit
in size. Thus with N ,.. 23, k ... 5, the numbers of the units in the
five systematic samples are as shown in table 8.1. The first three

TABLE 8.1 THII P088IBLE STSTEIlUTIC SAMPLES FOn N - 23, k - 6


Systematio sampJe number

1 11 1Il IV V
1 2 3 4 6
6 7 8 9 10
11 12 13 14 16
16 17 18 19 20
21 22 23

samples have n - 5, while the last two have n - 4. This fact in-
troduces a disturbance into the theory of systematic sampling. The
disturbance is probably negligible if n exceeds 50, and will be ignored,
for simplioity, in the presentation of theory. The disturbance is un-
likely to be large even when n is smell.

8.2 An alternative view. There is. a.nother wa.y of looking e.t syste-
matic sampling. With N - nk, the k possible systematio samples are
shown in the columns of table 8.2. It is evident from this table that
the population has been divided into k large sampling unitAl, each of
which r.ontains n of the original units. The operation of ohoosing a
UI2 SYSTEMATIC SAMPLING 8.2

randomly located systematic sample ill just the operation of choosing


one of tb large samplin units at random. Thus &ystanatic sam-
pling amount.a ntially to the selection of a ling~ complex sampling
unit which conatituteIJ th whole sample. In other words, systematio
plin ill actually simple random sampling, applied to a set of Ie:
larg unite, with the restriction that in terms of these large unitB the
eample arise is 1.

TABLE 8.2 CoKl'OlllTlOJol OJ' Ttl. k ""'I'I"'II"TlO IlAMPLU

Sample number
1 2 k

1f. 1/1 1f. I/l


II +1 I/l+l IIHI I/»

111.-1>-'+1 1/(. - 1,.+1 I/<. - I).-H 11M

Meana DI 1/1
" f1l

Uee of a sampling unit which consi ts of a group or cluster of the


mente in the population i a common device. The next three chap-
ters are d voted to this type of sampling, often referred to &8 clual.er
IOmpli1l9. Thus systematic sampling is a particular case of cluster
mpling, in which the sample is a ingle cluster.

8.3 Variance of the estimated mean. Several formulas have been


d ve10ped for th variance of g'II' the mean of a systematic sample.
The first three gh'en below apply to any kind of cluster sampling in
which the clusters all contain n lements and the sample consiIItB of
cluster.
If N - nk, it i easy to verify that g" is an unbiased estimate of Y
for a randomly located sy ternatie sample. If N '" nk, this result does
not bold, although the bi is unlikely to be important. The b'
can be avoided by allotting a higher prob bility of leetion to cer-
tain samples. Corurider the ample in table 8.1. If a probability of
woo h i given to each of tb first three samples, and a prob-
ability -b to each of the last two, the sample mean ill unbiaaed.
In th following anal;l'8i , the symbol llii denotes the jth member of
th ith systemati aarnpl, 80 tha i - I , 2, .. . , n, i - I , 2, ... , k.
Th mean of th ith sampl . d noted by ,]i. .
VARIANCE OF THE ESTIMATED MEAN UI3

TMorem 8.1 The variance of the mean of a systematic sample is


(N - 1) nt k(n - 1) •
V(9.,) - N ,y - N 8 •• " (8.1)
where

is the variance among units which lie within the same systematio
sample. The denominator of this variance, ken - 1), is corurt.ructed
by the u8Ual rules in the analY8is of variance : each of the k samples
contributes (n :... 1) degrees of freedom to the sum of squares in th
numerator.
Proof: By the usual identity of the analysis of variance
(N - 1)8' - LL (1/,} - y)2
• j

- n L (y; . - Y)2 +L L (Yi} - yd'


i ( j

But the variance of ii." is by definition


1 k
V(g.II) - - L (U; . - y)2
k'_ 1
Hence,
(N - 1) 2 - nkV(fi.,,) + ken - 1)8... .,2
The result follows.
Corollary. The mean of a systematic sample is more precise than
the mean of a simple random sample if and only if
8 ... "z > 8 2 (8.2)
Proof: If fi is the mean of a simple random IIIUllple of aile n,
(N - n) 8'
V(y) - - -
N n
From equation (8.1), V(g.,) < vey) if and only if
(N-l)8' ken-I) 2 (N-n)8'
N - N 8 ••11 < N n
i.e. if
k(n - 1)8••/ > {N - 1 - N: n} 8' _ k(n - 1)8'

This important. result., which appliee to cluster sampling in general,


states that systematic sampling is more precise than simple random
1M SYSTEMATIC SAMPLING

aampling if the variance within the systematic samples is wger than


the population variance as a whole. Systematic sampling is preciae
when unite within the same sample are heterogeneoue, and is impre-
ciae when they are homogeneoue. The result is obvioue intuitively.
II there is little variation within a systematic sample relative to that
in the population, the BUccessive units in the sample are repeating
more or I the same information.
Anoth r form for the variance is given in theorem 8.2.
Theorem 8.1

V(g,!t) - -;
S' [(N N- I) + (n - I)p"
] (8.3)
where
2 k
p" - /m(
n -
I)S' :E :E (110 -
I-I J< u
Y)(lIlu - Y)

Thill quantity may be described as the correlation coefficient between


pairs of unitll t hat are in the same systematic sample. The divisor
factor kn(n - 1)/ 2 is the Dumber of distinct terms in the sum of
products.
Proof:
AI
n'kV(D,w) - n' :E (g, . - y)1

..
- i:E
_I
[(IIi1 - Y) + (IIi' - Y) + ...+ (IIi" - Y»)I

Th Quared terms amount to the total BUm of squares of deviations


from Y, i.e. to (N - 1) ' . This gives
n'kV(g.w) - (N - 1)8' + 2 :E E (1111 - Y)(III .. - Y)
i 1<..

- (N - 1)S' + .m(n - 1)8'p.,


H nce,
V (g,!t) - ;-
S' [(N N- 1) + (n - I)p..
]

Thi ult shows th t p08itive correlation between unite in the


aame sampl inflates the variance of the sample mean . Even a small
positive correlation may bave a larg effect, becau of the multipli r
(n - 1).
Th two previou theorems exp V(g,.) in term of S', and hence
rela i to the variance for a simple random aample. There is an
8.8 VARlANcr OF THE ESTIMATED MEAN 105
analogue of theorem 8.2 which expl'eMell V(g.w) in terms of the vari-
ance for a stratified random sample in which the strata are compoeed
of the first k units, the aecond k units, and so on. In our notation the
subecript i in 1/(/ denotes the stratum. The stratum mean will be
written g./.
ThMrem8.!
8 ... ,2 [(N - n) ]
V(9.w) - -;;- N + (n - l)p ... , (8.4)
where

This is the variance among units that lie in the same stratum. The
divisor n(k - 1) is used because each of the n strata contributes
(k - 1) degrees of freedom. Further

P... , - len( 2
n -
1) I:i iI:
....
(l1;i -9 ·i)(Yi.. - 9 ... )/8... ,2 (8.5)

This rather complex quantity is the correlation coefficient between the


deviations from the stratum means of pairs of items that are in the
same systematic sample.
The proof is similar to that of theorem 8.2.
CoroUo.ry. A systematic sample has the same precision &8 the corre-
sponding stratified random sample, with one unit per stratum, if
P... , - 0. This follows because for this type of stratified random sam-
ple V(g.,) is (theorem 5.3, corollary 2)
N -
V(9.,) - ( - , ; - -n-
n)8... ,2

Other formulas for V(g.w), appropriate to an autocorrelated popu-


lation, have been given by W. O. and L. H. Madow (1944), who made
the first theoretical study of the precision of systematic sampling.
EZ4mpk. The data in table 8.3 are for a small artificial popula-
tion which exhibits a fairly steady rising trend. We have N - 40,
k - 10, n - 4. Each column repreeents a syetematic sample, and
the roWi are the 8trata. The example illu8trates the situation in
which the "within-etratum" correlation is positive. For instance, in
the first sample each of the four numbeJ'll 0, 6, 18, and 26 liM below
the mean of the stratum to which it belongs. This is conai.atently
true, with a few exceptiona, in the fiJ'IIt five systematic I&IDplM. In
the lut five samples, deviations from the strata meana are in moet
tee SYSTEMATIC SAMPLINO

ca.eea positive. ThU8 th croes-pro<iuct terms in P...t are predomi-


nantly poeitive. From theorem 8.3 we should expect systematic
sampling to be I precise than stratified mndom sampling with one
unit per stratum.

TABLE .3 DATA ro. 10 aTlITCllATIC IAMPLEII WITH" - 4, N - 1m - 40

Symmatic I&mple numbe ... trata


Strata
1 2 3 .
5 6 7 8 9 10 meall.l

I 0 1 1 2 6 7 7 6 .. ".1
11 6 8 9 10 13 12 16 16 16 17 12 .2
III 1 19 20 20 24 23 26 28 29 27 23 .3
IV 26 ao 31 31 33 32 36 ;r, 38 38 33 . 1

Tota18 IlO 58 6J 63 76 71 82 88 91 88 72 .7

Th varian V (l],w) is found directly from the systematic sample


totala as
1 k 1 k
V(g,w) - V'w - - L (Do '
k I_I
- y )2 - -2
n
L (nfi;.
k i_ I
- nY)2

__
I [(50)2 + (58)2 + ... + ( )2 _ (7Z7)2] _ 11.63
160 10
For random and stratified random sampling, we need an analysis
of variance of the population into "between rows" nd "within rows."
This ia preeented in tabl 8.4. Hence th variances oC the estimated

TABLE IU AN.u.Ul. or VAlUANC.

dt •
Betw ron (.!.rat..) a .. .3
Wit.h.lo Itrata 36 485 . 6 13 .411 - 8 ..1'
Tot&I.I 39 6313 .8 136.26 - 8'

m from simpl random and stratified random samples are ..


ColloWl:
N - 11.) st 9 136.26
V .... - ( - - - - - . - - - 30.66
N 11 10 •

N - n) t
V,,- ( - - - - -9 · -
13.49
- - 3.04
N ft 10 4
8.4 OOMPARISON WITH STRATIFIED RANDOM SAMPLING 187
Both stratified random sampling and systematic sampling aNI much
more effective than aimple random sampling, but, &8 anticipated, sy&-
tematic sampling is less preci8e than stratified random sampling.
Table 8.5 shows the same data, with the order of the obaervatione
reveraed in the ond and fourth strata. This has the effect of mak-
ing P••• negative, because it makes the majority of the cl"088-oproducte
between deviations from the strata means negative for pairs of obeer--
vations that lie in the same systematic sample. In the first systematic

TABLE 8.5 DATA. IN TA.~ 8.3, WITH 2'lf1l 0111)0 alllV&JII1I1) IN ITaATA. n ANI) 11'

Systematic a&mple numbel1l Strata


Strata
1 2 3 4 5 6 7 8 9 10 meanI

I 0 1 1 2 5 4 7 7 8 6 4. 1
II 17 16 16 15 12 13 10 9 8 6 12.2
III 18 19 20 20 24 23 Z5 Z8 Z9 Z7 23 .8
IV as as 37 35 8Z 33 31 31 30 26 33 .1

Tot&1a 78 74 74 72 78 78 78 75 75 65 72 .7

sample, for instance, the deviations [rom the strata means are now
-4.1, +4.8, -5.3, +4.9. Of th six products of pairs of deviations,
four are negative. Roughly the same situation applies in every sy&-
tematic sample.
Thie change does not affect V r ." and V.,. With systematic sam-
pling, it brings about a dramatic incN'.aee in precision, &8 is eeen when
the systematic sample totals in table 8.5 are compared with those in
table 8.3. We now have

V .. _ _1_ [(73)2 + (74)2 + ... + (615)1 _ (727)2] _ 0.46


160 10
It is lIOInetimes poaaible to exploit this result by numbering the unite
80 &8 to create negative correlations within strata. Accurate knowled
of the trends within the population is required. However, .. will be
eeen later, the situation in table 8.5 is one in which it is very difficult
to obtain from the sample a good estimate of the standard error of
g...
8., CcrmparUon of .,.tematic with Itratitled ruldom Nmpll",. The
performance of systematic sampling relative to that of stratified or
SYSTEMATIC 8AMPLlNO

simple random sampling is greatly dependent on the properties of the


population. There are populations for which systematic sampling is
extremely precise and others lor which it is less precise than simple
random sampling. For some populations and some values of n,
V(g.,,) may even increa# when a larger sample is taken-a startling
depa.rture from good behavior. Thus it is difficult to give general ad-
vice about. the situations in which systematic sampling is to be recom-
mended. A knowledge of the structure or the population is neceasary
for ita most. effective UJIe.
Two lin of research on this problem have been followed . One is
to compare the different types of sampling on !lI;.tificial populations in
which IIi is some simple function of i. Th other is to make the com-
parisons for natural populations. Bot.h types of investigation are la-
borious and a;" not yeL 9.s extensive as an advisor on sampling would
wish, assuming that he did not have t.o do the work. Some of the prin-
cipal re4Ults are presented in t.he succeeding sections.

8.5 Population. in "random" order. Systematic sampling is some-


times used, for its conveni nee. in populations where the numbering
of the units is ff rtively random. This is so in sampling from a file
arranged alphabetically by surnamcs, if the item that is being me&&-
ured haH no relation to the surname of the individual. There will then
be no trend or stratifiration in Yi as we proceed along the file, and no
correlation betwren neighboring valucs.
In thi itua ion we would expect systematic sampling to be essen-
ti lly equivalent to simpl random sampling and to have the sam
v rian . For any singl finite population, with given values of n
and k, thi is not xactly true, becau V." , which is based on only k
d greet! of freedom, i rather erratic wh n k is small, and may tum out
to be eith r greater or small r tha.n V,... . There are two resulta which
show that on the average th two varianc are equal. Both results
will be reported, sinre they illustrate different approach to the study
of sYStematic sampling.
TlttOTtm 8.4 Consid r all N! finite populations which are fonned
by th NI permutati 0 8 of any set of numbers YI, 112, "', liN . Then,
on the average over these finite populations,
E(V. w) - V,...
Note til t V,u i th sam for all permutations.
Thi result. which 'Wu.s proved by W. C. and L. H . Madow (1944),
that if til order of the item in n specific finite population can
ed drawn at. rand rn from th NI permutations, syst&-
POPULATIONS IN "RANDOM" ORDER 169
matio sampling is on the average equivalent to simple random sam-
pling.
The lJOOond approach is to regard the finite population aa drawn at
random from an infinite super-population which haa certain proper-
ties. The result that is proved does not apply to any single finite
population (i.e. to any specific set of values 111, 11" • •• , liN) but to th
average of all finite populations that can be drawn from the infinite
population. This approach may appear at first sight to have little
relation to practical applications, but this impression is erroneous.
Any sampling method is used in practice on a whole series of finite
populations. One'way of describing the class of finite populations for
which a given sampling method is efficient is to describe the infinite
super-population from which such finite populations might have been
drawn at random.
The symbol., denotes averages over all finite populations which can
be drawn from this super-population.
TMoreTn 8.6 II the variatefl tI. (i - 1, ' 2, ... , N) are drawn at
random from a super-population 'in which
«W. - P.)(YI - p.) - 0 (i ~ J) :
then

The crucial conditions are that ally. have the same mean p., i,e. there
is no trend, and that no linear correlation exists between the values
tI. and lIJ at two different points. The variance 11.' may change from
point to point in the series.
Proof: For any specific finite population,
N
L Y)'
v _ (N - n ).-1 (1/. -
'010 Nn - (-N- -- l )-
Now
N N
L (y. - y )2 - L {Wi _ p.) _ (Y _ p.)I'
i-I '-I
N
-L (1/. _ }')2 ..,. N(Y _ p.)2

Since IN and 111 are uncorrelated (i ~ JJ,


1 N
.(Y _ p.)2 - -
!(l
L:
i_I
tT,2
170 SYSTEMATIC AMPLING

Hence
,V,." -
(N - n) {NEcr(2 - N~
E crr}
Nn(N - 1) ._1 lV-
'Thia givee
(N - n) N 2
tV.... - N'n E cr.
._1
Turning to V ,~, let '0. denote the mean of the uth systematic sample.
For any specifio finite population,

Vq - -1
k ... 1
E• (g. - f)2

- ~k {i:.
..
(9. -
-I
~)' - k(Y - j.I)2}
By the theorem for the variance of the mean of an uncorrelated
aa.mple from an infinite population,
N N

Vii-I
• •• - k ---;;- - N'I
E cri2 k L crt'
i_I

(N - n) N 2
-
Nn
2 E cr( -
i_I
.V.....

8.8 Populationa with linear trend. If the population consieta IIOlely


of a linear trend, as illu trated in figure .2, it is fairly easy to gu
th nature of the results. From figu.re 8.2, it look.e as if V •., and V. ,
(with on unit per stratum) will both be emaller than V..... FUrther,

• - S)'I malIC samP!41


• • strltifled random SImple
8.0 POPULATIONS WITH UN EAR TREND 171
V., will be larger than V. becaU8e, if the systematic sample i too
" low in all strata, whereas stratified ran-
low in one stratum, it is too
dom sampling gives an opportunity for within-tltratum errors to cancel.
To examine the effects mathematically, we may A88Ume that 1/i - i.
We have
f
i _ N(N + 1): f.
(I _ N(N + 1)(2N + 1)
i_I 2 i_I 6
The population variance S' is given by
1
S' - (N _ 1) (~y,2 - Nf2]

1 [N(N + 1)(2N + 1) _ N(N + 1)2] _ N(N + 1) (8.6)


(N - 1) 6 4 12
Hence the variance of the mean of a simple random sample is
(N - n) S' n(k - 1) nJc(N 1) (k - l)(N 1) + +
V,.." - N . -; - - nJc
- . 12n - - - -1-2 - -

(8.7)
To find the variance within strata, S", 2, we need only replAce N by
kin (8.6). This gives
(N-n) 8 ..2 n(k-l) k(k+l) (k;2 - 1)
V,I - .----- (8.8)
N n nJc 12n 12n
For systematic sampling, the mean of the second sample exceed!
that of the first by 1, while the mean of the third exceeds that of the
second by 1, and 80 on. Thu8 the means g.. may be replaced by the
numbers 1, 2, ... , k. Hence, by a further application of (8.6),
..:.. ~2 k(k;2 - 1)
~ (9.. - ~ J - ---
.. _I 12
Tbia gives
V., _ ~ L (9.. _ Y)2 __
(k;2_-_1)
(8.9)
k 12
From the fonnulas (8.7), (8.8), and (8.9) we deduce, &II anticipated,
k;2 - 1 k;2 - 1 (k - I)(N + 1)
V., - - - < V,., - -12- <
12n - -
V,." - - - -
12
--
172 SYBTEMATIC SAMPLING

Equality occurs only when n - 1. Thus, for removing the effect


of a linear trend, IIIl8peCted or UDBU.8pected, the systematic sample is
much more effective than the simple random sample, but 1688 effective
than the stratified random sample.

8. 7 End correc:tions. The poor performance of systematic sampling


in the preeence of a linear trend can be improved in several ways.
One is to use a centrally located sample. Another is to change the
estimate (rom an unweighted to a weighted mean in which all in-
ternal membeTII of the sample receive the usual weight l i n, but the
first and th last members receive weights that are in general differ-
ent from lin. uch weights are called end correclionl.
As before, we select the systematic sample by chOosing a random
number i between 1 and k. The weights assigned to the two end mem-
beTII depend on th value of i which was r.hoeen. In computing the
weighted sample total, before division by n to obtain the weighted
sample mean, the weighta are as foIJows:

n(2i - k - I)
FiTllt member : 1+ -2(n-- -I)k-
n(2i - k - 1)
LalIt member : 1- - - - --
2(n - l)k

For any value of i, the two weights always add to 2.


E%Clmpk. Suppoee that n - 4, k - 3, N - 12. The weighted
means for the three poesible systematic samples are as follows:

i - 1. g••" - 1(11'1 + Jl4 + Jl7 -+ ¥VIO)


i - 2. 9"," - 1(Jl2 + Jla + Jl8 + JIll)
i - 3. g"'1/ - t<~a + JI& + Jle + bit)
A crud r tionaliaation of the system of weighting is that, when
i-I, the first member of the sample, JlI, receives a reduced weight
becau it is at one nd of th finite population ; 7110, OD the other
hand, receiv an increased weight because it "represents" the obser-
vations Jle, 1111, and 1111 which are n reT to it than to any other m.em-
ber of this sampl . End corrections are analogous to th coefficients
t whi hare ' goed to th two end tenns in th Euler-Maclaurin
formula f r numerical integration.
It is easily shown that, in a population whi h eonsi solely of a
linear trend, g.'1/ a)\l,ray giv tht' correct population mean. Thus
8.7 END OORRECTION8 178
V(D ..q) - O. Let the linear population be represented &8 before by
1IJ - j, and consider the systematic sample which starts at i.

nU ..." _ i {I + n(2i2(n- - Ie l)k- I)}+ Ii + leI + ...


+ {i + (n - 2)lel + Ii + (n - l)kl {I __
n(_2'_'-_k_-_I)}
2(n - l)k
. b(n - 1) n(2i - k - 1)
- nt +--
2
- - ---
2
--
n(b + 1)
2
Hence y",.w - (kn + 1)/ 2 - Y, irrespective of the value of i.
These end correctioll.8 completely remove the effect of any linear
trend in a population. In actual populatioll.8 more complex types of
trend may be present, &8 well &8 "random" variatioll.8 that are inde-
pendent from one member of the series to another. So far &8 the in-
dependent variatioll.8 are concerned, end correctioll.8 result in a slight
10118 of precision, for, if the 1IJ vary independently with the same vari-
ance SJ, we have

where ~ is the weight attached to any 1IJ' With fl.", the unweighted
mean, L wI' - l i n. With 1}VI'~' L w/ depends on the starting mem-
ber i of the sample. The average value of L wi', taken over the range
i - l , 2, "', Ie, is found to be

~
n
[1 + _n_(~_-.. .l.,. ). . ,.
6(n - 1)2_k2
]

The inflation of the varian e is negligible except for small n. For


n - 10, the inflation factor is at most about 2 per oent.
End correctioll.8 were first prop<l86(i by Yates (1948), who aasigned
alightly different weights
(2; - Ie - 1)
1::t: - -- - -
2k

to the first and last members. Th differ from the weights given
previously only by a factor (n - 1)/n. In tests of the efficacy of his
end corrections in five natural populations (described in tabl.e 8.6)
Yates found a worth-while increaae in precision in four of the five cues.
174 SYSTEMATIC SAMPLING

8.8 Population. with periodic variation. If the population consists 0(


a periodic trend, e.g. a simple sine curve, the ffectiveness of the sys-
tematic sample depends on the value of k. This may be seen pictori-
ally in figure 8.3. In this representation, the height of the curve is the

F,OOD 8.3 Periodio variatioD.

obeervation 11(. The sample points A represent the case least favor-
able to the systematic sample. This ca.ee holds whenever k is equal
to the period df the sine curve or is an integral multiple of the period.
Every obeervation within the systematic sample is exactly the same,
80 that the sampl is no more precise than II. single observation taken
at random from the population.
The moet favorable case (sample B) occurs when k is an odd mul-
tiple of the half-period. Every systematic sample has a mean exactly
equal to the true population mean, since successive deviations above
and below the middle li.ne cancel. The sampling variance of the mean
is therefore &ero. Betw n these two C&8e8 the sample has various
degrees of eft' tiven ,depending on th relation between k and the
wavel ngtb.

8.9 Autocorrelated populations. With many natural populations,


there is reaaon to expect that two obeervationll 11., 111 will be more
nearly alike wh n i and j are cloee together in the series than when
th y are distant. This happens whenever natural forces induce a
slow chan as we proceed alon the eeri . In a mathematical model
for this elf t, we may suppose that 7/. and 111 are positively correlated,
th oorrelation between them being a function solely of their distanoe
apart, i - j, and diminishin this distance increases. Although this
model is ov rsimplified, it may represent one of the salient features of
many natural populations.
In order to inv tigate wh tiler this model doee apply to a popullV
tion, w can caloulate th t of correlations Po. for items that are d~
t&nt " units apart, and plot this correlation against u. This curve,
or tb fUDotion ~ hich it represents, is called a correlogram. Even if
the model is valid, t.he correlogram will not be a 8IJlooth function for
any finite population, because irregularit.i are introduced by the
fini nature of tb population. In a comp rison of y tematic with
tratified random samplin& for thi mod I, th irreculariti make u
AUTOCORRELATED POPULATIONS 115
difficult to derive results for any single finite population. The oompari-
lIOn can be made over the average of a whole seri of finite populations,
which are drawn at random from an infinite 8uper-population to which
the model applies. This technique has already been applied in theorem
8.5.
Thus we &88UJDe that the observations 'Ii (i - 1, 2, "', N) are
drawn from a super-population in which

p,. ~ P. ~ 0, whenever u < 11


The drawing of one set of If, from this super-population croe.teB a sin-
gle finite population of size N .
The average variance for systematic sampling is denoted by
~v." _ fE(ii." _ y )2

For this cl&88 of populations it is easy to 8how that stratified ran-


dom sampling i8 superior to simple random sampling, but no g neral
result can be established about systematic sampling. Within the cl&88
there are super-populations in which 8ystematic sampling is 8uperior
to stratified random sampling, but there are also 8uper-populations
in which systematic sampling is inferior to simple random sampling
for certain values of k.
A general theorem can be obtained if it is further &88UJDed that the
correlogram i8 concave upwa.rds.
Theorem 8.6 If, in addition to conditions (8.10), we have
[i - 2, 3, "', (kn - 2)1
then
,V•., ~ ~V" 5 ~V..."

for any 8i~ of sample. Further, unless 6,.' - 0, i - 2, 3, ... , (Am - 2),

~V '" < .V.,


A proof has been given by Cochran (1946) and will not be repeated
here.
Quenouille (1949) has shown that the inequalities in theorem 8.6
remain valid when two of the conditions are relAxed eo that

eye) - "'; tUt, - 1£,)' - tit


In t.hia event each of the three averace variances is increued b7 the
same amount.
170 SYSTEMATIC SAMPLING

So far lUI practical applicationll are concerned, eonelograma which


are concave upward, have been propoeed by several writers lUI modele
(or specific natural populationll. The function P.. - tanh (u-H) waa
eu ted by Fieher and Mackenzie (1922) fo.r the correlation be-
tween the weekly rainfall at two weather stationll which are distant
u apart, the (unction P.. - e- >''' by Osborne (1942) for forestry and
land use eurveys, and the function P.. - (l - u)/l by Wold (1938) for
certain types of economic time series.

8.10 Natural populationL Investigationll have been made on a vari-


ety of natural populationll. The data are described in table 8.6. The

TABLE 8.6 N#.~.", POPULATIONS UlIiD IN ITt1DlU or I'l"8'l'mUTlC I.UO'UXO

Relereooe N Type of data


Yatel (1948), 288 Altitude. read at iotervale of 0.1 mile from ordnanoe
table 18 survey map.
Oeborne (1 ~2) Per oeot of area in (i) cult.ivated land, (U) ahrub,
(ill) lflii, (Iv) woodland 00 parallel lin. drawo
00 a ClOver-type map.
OIborne (1942) Per oeot of area 10 DoUIlu fir OD parallellinel drawn
00 a cover-type map.
Yatel (1948) 192 8011 temperature (12 in. under lflii) for 192 COIlleClU-
t.in daya.
Yatel (1948) 192 8011 temperature (4 In. under bare lOiI) for 192 daye.
Yatel (1948) 192 Air temperature for 192 daya.
Yatel (1~8) 96 Yields of 96 roWl of potatoel.
Finney (1948) HIO Volume of uJable timber per IItrip, a cbaina wldo
and of varyinl lencth (Mt. Stuart foreat).
Flnney (1948) 288 Volume of viriln timber per Itrip, 2.6 cbaina wide,
80 chalne I 01 (BIlek'. MOunt.a.iD fol'lllt).
Flnney (1950) Volume of timber per .trlp, 2 chainJ wide and of
varyin,len&th (Debra Dun fol'lllt).
JohDlon (1943) Number of ~inp per l-IWled-width in 4 bedI of
bardwood ~bed Jtock.
JobnIoo (1943) Number of ~)in per 1-It-bed-wldth in 8 bedI of
conilerauJ Medbed rtock.
JohD8oD (1943) 400 t Number of ~UD per l-lt-bed-wldth in 6 bedI of
coniferou.e transplant .toclt.

• TheoretJcally, N " inAnlte, If line. that are iolioit.ely thin can be enviIIcecL
t Approximately. The Dumber varied from bed to bed.

tim three studiee were mad from maps. In the first study, the finite
population consiets of 288 altitudes at euccessive distances of 0.1 mile
in undulating country. In the next two, the data are the fraction.e of
the lengt.ha of lin drawn on a COV8J'otype map that lie in • certain
8.10 NATURAL POPULATIONS 177
type of cover (e.g. grass). These examples might be considered the
cloaest to continuoU! variation in the mathematical eenae.
The next three studies are based on temperatures for 192 consecutive
days: (i) 12 in. under the aoil, (ii) 4 in. under the oil, (iii) in air.
This trio represents a gradation in the direction of greater influence of
erratic clay~to-day changes in the weather as compared with slow
aeaaonal influences.
The remaining studies deal with plant or tree yields in some se-
quence that lies along a line. In the study on potatoes, which is typi~
cal of the group, the finite population consists of the total yields of 96
rows in a field. Since no exhaustive search of the literature has been
made, further data may be available.
In some of the studies, V' II is compared with the variance V. for a.
"
stratified random sample with strata of size 2k and 2 units per stratum.
This comparison is of interest because an unbiased estimate of Y."
can be obtained from the sample data. This cannot be done for Y. n
(with strata of size k and 1 unit per stratum) or for V. II • Other writers
report comparisons of V.w with both V,II and V,I" The majority of
the IOUTces do not pl'e8ent comparisons with V ru in readily usable
form, but it appears that in general V. gave gains in precision over
V ...... "
In the papers by Yates and Finney, comparisons are given for a
range of values of nand k within each finite population. In these
caees the data in table 8.7 are the geometric means of the variance

TABLE 8.7 RIlLATIVI: PReCISION or ITllTllIIoIATIC AND lTaATU'IIlD


llANDOIIoI BAWPLINO
Relative prec..ion 01
eyatematic to stratified
Ranse
Data of k V.n/ V", V. II / V",
A1titudet 2-20 2 .9V 6.68
Per oent area (4 cover typell) 4 . 42
Per oent area (Oouglaa fir) 1.83
Soil temperature (12 In.) 2-24 2.42 4.23
SolI temperature (4 in.) 4-24 1.46 2.07
AJr tempera t.ure 4-24 1.26 1.66
Potatoee IH6 1.87 1.90
Timber volume (Mt. Stuart) 2..a2 1.07 1.35
Timber volume (Black'. Mt.) 2-24 1.19 1. ...
Timber volume (Debra Duo) 2..a2 l.1W 1.89
Hardwood -mop 14 1.89
Coniferoua MedlliI 14-24 2.22
Coollerows tranap.l an\ 12-22 0 .93
178 SYSTEMATIC SAMPLING 8.10

ratioe for the individual values of k. The other writers make compute,..
tiOIl8 for only one value of k per population, but llUI.y give data for
cJjfferent itema or for several populations of the sa.me natural type.
Here, again, geom tric means of the variance ratios were taken.
Although the data are limited in extent, the reBUitll are impressive.
In the studi which pennit comparison with V. , .. systematic 88lD a
piing shOW8 a coll8i8tent gain in preciaion which, although mod t, is
worth baving. The gains in comparison with V. ,2 are substantial.
Th internal trend of the resu1ts agrees with expectations, although
not too much 8hould be made of this in view of the srnaI1 number of
8tudies. Th gaill8 are largest for the types of data in which we would
gu that variation would be nearest to continuou8. The decline in
V"t!V.~ from aoil to air temperatures would also be anticipated from
thi vi wpoint. In tb last three items (forest nunICry data), the only
00 showing no gain i oniferous transplant stock, which is older and
more uniform than eeedling stock.

8.11 Quul.apertodic effects. In moet of these studies, the variance


ratioe V,,/ V,~ were reasonably stable for the different values of k
that w re examined. Exceptions are the Dehra Dun forest studied by
Finn y and on bed of hardwood ling stock which has been studied
by L . H . Madow (1946) . (Th re may be more exceptions, because
aom 8tudi included only one value of k.)
In th exceptions, V.~ changed with varying sample sile in a man-
ner that augg ts th presenc of sam thing approaching a hidden
periodicity. Table 8.8 shows data p nted by L. H. Madow. The

TABLE 8.8 DATA IUOOwnNO "QUASI-P JUODIC J:rr&Cr


{2 "
10
V..,
4.21
V'.11
7.21
v_
10 .29
V.n/V..
1.71
~ 21 3.06 3.00 4.77 O.
16 2.42 2.09 3.62 0.86
14 80 O. 1.90 3.20 2.76
10 42 1.74 1.29 2.26 0. 74
.7 80 O. 0.82 1.61 3. 15
6 S. 1.22 0.60 1.00 0 .41

bed WU 420 ft long, and th unit was 1 ft of the bed width.


Wh k is a mUltiple of 7, the preciaiOll of the systematic sample is
high relativ to that of stratified sampling. For intermediate values,
th p . 'on' about th same or' lower. The erratic behavior of
V.~ with in aampl aise is another ref! tion of this phenome.-
nOD. Finn y'8 data (1950) hibit a similar etJ t.
8.12 ESTIMATION OF VARIANCE FROM A SINGLE SAMPLE 1711
Bow frequently this effect occurs in natural populatioI18 it not
known: it is presumably lese likely in a long series (Le. N large) than
in a short series. It should be observed, however, that over the range
of values of k from 5 to {2 in table 8.8, systematic sampling has done,
on the whole, at least as well &8 stratified random sampling. The
criticism is therefore that the performance of systematio sampling is
unprediotable, not that it is uniformly poor.

8.12 Eltimation of the variance from a linIle aample. From the ro-
sults of a simple random sample with n > 1, we can calculate an un-
biased estimate of the variance of the sample mean, the estimate be-
ing unbiased whatever 1M form Of 1M population. ince a systematic
sample can be regarded as a simple random sample with n - 1, this
useful property does not hold for the systematic sample. As an ilIu&-
tration, consider the "sine curve" example. Let
II. - m + a sin (ri/2)
where k - 4 and i - I , 2, "', 4n. The succeasive observatioI18 in
the population are
(m + a), m, (m - a), m, (m + a), m, (m - a), m, ...
If i - I is choeen as the first member, all members of the system&tio
sample have the value (m + a). For the other three poesible ohoicee
of i, all members have the values m, (m - a), or m, respectively.
Thua from a lingle sample we have no me&I18 of finding out or estimat-
ing the value of a. But the true sampling variance of the mean of the
systematic sample is 0 2 /2. The illustration shows that it is impossible
to construct an estimated variance that is unbiased if periodic varia-
tion is present.
These results do not mean that nothing can be done. Excluding
the case of periodio variation, we might know enough about the struc-
ture of the population to be able to develop a mathematical model
which adequately represents the type of variation that is preeent.
We might then be able to manufacture a formula for the estimated
variance that is approximately unbiaaed for this model, although it
may be badly biaaed for other models. The decision to use one of
these models must rest on the judgment of the sampler. Unfortu-
nately, we frequently lack data, &8 distinct from opiniOI18, about the
structure of the population and are not confident that a given model
will be satisfactory.
Some simple models with their corresponding estimated variancee
are illustrated below. No proofs will be given.
180 SYSTEMATIC SAMPLING 8.12

The simplest modela apply to populatioDll in which 1/1 is eompoeed


of " trend plus a "random" component. Thus
J/i - ~ + e,
where ~{ is lIOme function of i. For the random component, we U8UIDe
that there is a super-population in which

(i "')
A propoeed formula I,w 2 for the estimated variance is called unbiaaed if
fE(",,2) _ ,V."
i.e. if it is unbiaeed over all finite populations that can be drawn from
the super-population.
T. PopuLatUm in "random" order.
JLj - Constant (i - 1, 2, " ', N)
(N - n) .L (lh - fi,,,?
Nn (n - 1)
This case applies when we are confident that the order is eseentially
random with reapect to the items being measured . The variance for-
mula is the same as for a simple random wnple and is unbiased if the
model is correct.
II. traJificoJiqn ejJecl4 onlJ/.
~ - C<>nstant (rA: +1~i ~ rA: + k)
,,~
2
- (N - n)
Nn
.L (J/. -
2(n - 1)
J/f+A:)'

In this Cl&8e the mean is conat.ant within each stratum of k unite. The
tim te •••,l, which is baaed on th mean aquare 8Ucceeaive difi'er-
en ,is not unbiased. It containa an unwanted contributio from the
dift' ce between ",'s in neighboring strata, and the fint and last
rata carry too little weight in estimating the random component of
the variance. With a reasonably large wnple, this estimate would in
ceneral be too high, &EIBW'Ding that the model is correct.
Ill. 1A~ trend.
""-,,,+fJ.
I.
, (N - n) n'
----
:E (rI, - +
~.+' lIH,Jl)'
N nl 6(n - 2)
Th estimate i baaed on llUe 've qur.dntic terma in the eequence
8.12 ESTIMATION OF VARIANCE FROM A SINGLE SAMPLE 1 I

tit. The sum of squares contains (n - 2) terms. With a linear trend


we have seen (section 8.7) that the trend can be eliminated by th
use of end corrections. The term n'l n 2 is the sum of squa.res of the
weights in Y""II' nless n is small, n'l n2 can be replaced by the usual
factor l i n . Because the strata at the ends receive too little weight,
the estimate iii not unbiased unless trl is constant, but it should be
satisfactory if n is large and the model is correct.
If continuous variation of a more complex type is pre nt, the pre-
ceding formulas may all give poor results. In table .9 the second

TABLE 8.9 V AJUANCEII or IIAIoIIPLJ: IoIIItAN NUMlIER!! or BElI:DLlNOII


(JORNIION'II DATA)

ActUAl
Bed V,. .",,2
' 1/1'
Silver maple I 0 .91 2 .8 2 .5
2 0 .74 3 .6 2.9
American elm 1 • .8 28 .• 12 .6
2 15 .5 22 .6 18.6
White epruoo 1 5 .5 17.2 11 .2
2 2.0 11.6 6 .•
White piDe 1 8 .2 21 .0 21 .9

and third formulas are applied to six forest nursery bed8 (Johnson,
1943). The quadratic formula is slightly better than that based on
successive differences, but both give serious overestimates.
Various other formulas can be devised. Residuals from a fitted
polynomial of higher degree may be effective if !l; varies continuously
and not too rapidly : tables have been provided by DeLury (1950) for
this method.
Formulas developed from simple assumptions about the nature of
the correlogram have been diecUMe<i by Osborne (1942), Cochran
(1946), and Ma~m (1947) . Yates (1949) has investigated an esti-
mate based on a quantity of the form
(1/" + 11..+2.1: + ""-H,t + .. .) -
+ 11"+111 + ...)(1I,,+,t
The successive items in the sample are given alternatively + and
aigns. II this expression is taken over the whole sample, only 1 df
is available. In order to provide more degrees of freedom, th sample
data can be broken into parts, which Yates suggests might contain 9
obeervations each. If we d note the successive observations in the
sy8tem&tic sample by 111" 11'/, etc., and give weight i to the first and
laat tenna, we may write

dl - Chi' + tla' + 11,/ + 1/1' + he') - Ut2' + 114' + tlr,' + 11.')


182 SYSTEMA.TIC SAMPLING 8.12

The next difference, d" may start with IIg', and 80 on. Then, for the
estimated variance of g,~, we take

"11' _ (N - n) t d,,'
Nn ,,_17.6g
The factor 7.6 is the sum 1)£ squares of the coefficients in any d", and
g is the number of differencee which the sample provides (g is approxi-
mately n/9). In the natural populations which Yates examined, a
formula of this type W&8 superior to the fonnula "/2 2 baaed on sue-
ceeaive differences, but it still overestimated the actual variance of '0'1/'
In conclusion, there is no dearth of formulas for the estimated vari-
ance, but all appear to have a limited range of applicability.

8.13 Stratl1led .yatematic aamplin&. We have seen that systematic


sampling is itself a kind of tratified sampling. In some applications
in whioh stratification seems desirable, it can be introduced by the
WI8 of a systematic sample. Consider a sample of the blocks in a
large city. Usually it is desirable to ensure that this sample has good
geographic ooverage in which the different types of residential area
are all repreeented. If the blocks are numbered serially, proceeding
from on side of the city to the other in a serpentine fashion, a syste-
matio sample will often give adequate geographic stratification.
The situation i different when the strata are distinct, so that one
Itratum does not merge into another, and when separate estimates
are required for each stratum. Here we may take a aeparate syste-
matio sample in each stratum, with a new starting point and perhaps
a different value of k. This method will be more precise than stratified
random sampling if systematic sampling within strata is more pre-
ciae than simple random sampling within strata.
If g,,,~ ie the mean of the sy tematic sample in stratum h, the esti-
mate of the population mean Y is, &8 usual,

:E N.g,~
g,,,,, - N

From theorem 5.2, aasuming that the timate is unbiaaed in each


stratum,

With only a few strata., the problem of finding a sample estimate of


this qua.ntity amounts to that of findin a aatisfactory sample esti-
SYSTEMATIC SAMPLING IN TWO DIMENSIONS ISS
mate of V(9•.,,,) in each stratum, by one of the methods di8cWltled in
the previous section.
When the number of strata. exceeds 20, a.n estimate baaed on t.he
differences between pairs of strata. may be preferable. Th stratum
sises a.nd the sample sizea should be approximately the same for the
members of a pair. If the first two strata form a pair, then

E(jj''IIt - f)''112)2 - V(f)''IId + V(f)''11I) + (Y t - f , )'


Conaequently, the estimate

where the sum extends over the pairs of strata, is on the averag an
overestimate, even if periodic effects are present within strata. The
amount of overestimation depends on the terms in (Y.. - y ;)2. So
far as can be predicted, strata in the eame pair should therefore have
about the aame population means. This d vic is an application of
the method of "collapsed strata," previously described in IlCCtion 5.21.

S.l' Systematic sampling in two dimenaio.ns. Some sampling prob-


lems that are two-dimensional are ha.ndled by numbering th units 80
that a one-dimensional systematic sample can be taken . The sample
of city blocks mentioned in the previous IlCCtion is an inBta.nce. An-
other is a method commonly employed in forestry surveys. A aeries
of equidista.nt parallel strips is mapped out, extending the whol
width of the forest . The volume of timber in each strip is estimated,
sometimes by measuring all trees in the strip but more frequently by
meaBl1ring a sample of the trees. If 1Ii is the total volume of timber in
the ith strip, one-dimensional theory may be applied. When the
forest varies in width, a natural modification is to regard the area Xi
of the ith strip as an auxiliary variate, a.nd employ a ratio or regree-
sion estimate.
Two forma of systematic sample for a aquare area are shown in
figure 8.4. The sample on the left, which resembles a aquare grid, is
completely determined by the choice of a pair of random numbers to
fix the coordinates of the upper left unit. The earnple on the right is
also syatematic because the diata.noe, horilOntal or vertical, between
units in BUcceesive strata. is always the same. How ver, unlike figure
8 .~, the units do not lie on the earne line. To feet a.n u1l4ligMd
sample of this kind, we fix the coordinates of the upper left unit by a
pair of random numbers. Two additional random numbers determin
184 SYSTEMATIC SAMPLING 8.14

the horiumIal coordinates of the remaining units in the first column of


strata. Another two are needed to fix the vertical coordinates of the
remaining uruts in the fil1lt row of strata.

It " :t • • :
I I"
:
, I
I•
___ , ____ 1-
I __ _
----'----..j.----
I t • I I
~ I • I • I I
I I I I "

---,----,---
I

I "
I

I •
I
I
I
:...---~----..L----
• I
• I

I
I , I I·
I I I I
(Q) AlipIed or ~ .lQWlN pid. (b) UnalilDed NIIIplAl
.. mple

FIOUIlII 8.4 Two typoe of twCHlimenaionAI sytlteIJIAtic I&I1lpie.

Qu nouille (J949) and Das (J9OO) have compared two-dimensional


8)'8tematic l!8mpl with various types oC stratified sampling in theo-
retical studi Cor some simple two-dimensional correlogr&ms. The ~
8Ults indicate that th unaligMd 8)'stematic sample will often be su-
perior to stratified random sampling.
Quenouille's analysis 8Uggests that the square grid is not 80 preci8e
aa t.h unaligned sampl . This 8Uggestion is 8Upported by a study on
natural population by Hayn (194). In Courteen agricultural uni-
formity trial8, he found that the grid had only about the same p~
ci ion as simple random sampling in two dim nsions. The relatively
poor performan of th squa.re grid..i8 not unexpected when we con-
lIider t.h eft t of lin r gradi nta. If th re is 8. pronounced linear
gradi nt parall I to th horisontal lIide of the area, the square grid in
figure .4a samp! thi gradient at. only three points; it BeeJD8 intui-
tiv that a m thod which 8IUJlpi the gradi nt at nine different points,
as does the unaligned sampl , will be 8Uperior.
Further avid nee for t.h uperiority of an unaligned sample is ob-
tained from experience in experimental d ign, where the latin square
has been found a precise m thod for arranging treatmentl! in a reo-
tangular Ii Id. Th 5 x 5 latin square in figure 8.50 may be regarded
as division of th fi Id into five 8)'!Itematic sampl ,one for each
1 toter. Th is 110m evidenee that this parti uw &quare, which is
called the "knight'8 move" latin square, is ali hUy more preci8e than
a rand mly h n 5 :; &quare, probably becau alignment is abeent
in the di onal as w 11 as in ro and columna.
Iii SUMMARY 185
The principle of the latin square has been used by Homeyer and
Black (1946) in sampling rectangular fields of oats. Each field con-
tained 21 plots. The three poesible I!Iy8tematic samples are d noted
by the letters A, B, and C, respectively, in figure 8.5b. This arrange-
A B C D B A B C
D B A B C B C A
B C D B A C A B
B A 1J C D A B C
C D B A B B C A
C A B
A B C
(4) "Knight's move" lat.in square (b) Systematic deaiJll for a 3 X 7 reo-
tangular field
FlOOD 8.5 Two aysternatic deailM baaed on the latin lIquare.

ment, with one of the letters chosen at random in each field, gave an
increase in precision of around 25 per cent over stratified random sam-
pling with rows as strata. The arrangement does not quite satisfy
the latin square property, because each letter appears 3 times in one
column and twice in the other columns, but it approaches this prop-
erty &8 nearly &8 poesible.
8.16 Summary. Systematic samples are convenient to draw and to
execute. In m08t of the studies reported in this chapter, both on arti-
ficial and on natural populations, they compared favorably in pro-
cision with stratified random samples. Their disadvantages are that
they may give poor precision when unsuspected periodicity is present
and that no trustworthy method for estimating V(g.w) from the sample
data is known.
In the light of these re8ults, writers on sampling are not in ~
ment in their viewB on the advisability of SYBtematw 8&Dlpling. It
appears, however, that I!Iy8tematic sampling can safely be recommended
in the fonowing situations:
i. Where the ordering of the population is eseenti&lly random, or
contains at most a mild stratification. Here systematic aampling is
used for convenience, with little expectation of a gain in precision.
Sample estimates of error which are reaaonably unbiased are availabl
(IIlCltion 8.12).
ii. Where a stratification with numeroua strata is mployed, and
an independent systematic sample i drawn from each tratum. The
effects of any bidden periodiciti tend to cancel out in this situation,
and an estimate of error which is known to be an overestimate can be
obtained (IIlCltion 8.13). Alternatively we can use half the number of
1. SYSTEMATIC SAMPLING

str&ta and draw two systematic samples, with independent random


8.15

1tartI, from each stratum. This method gives an unbiaaed estimate


of enor.
ill. For subsampJing the unita (chapter 10). In this case it turnB
out that an unbiaaed estimate of the 8&lllpling error can be obtained
in most practical situation!!.
iv. For sampling population! with variation of a continuoUB type,
provided that an el!timate of the sampling error is not regularly re-
quired. If a seriea of l!UrVey8 of this type is being made, an occasional
check on the sampling errors may be 8ufficient. Yates (1948) has
bOWD how this may be done by taking 8upplementary observations.
In conclUBion, further research may extend our knowledge of the
validity and range of application of formulas which purport to esti-
mate V(g,,,) and may lead to improved formulas.
8.18 Exerclee•.
. 1 The data below are the numben! of !leedJingll [or each foot of bed in a
bed 200 ft 10Di.

21- 41- 61 - 1- 101 - 121- 141- 161 - 181- Systematic


1- 20
40 60 80 100 120 140 160 180 200 sample
1 2 8 4 Ii 6 7 8 9 10 totals

8 20 26 84. 81 24 18 16 36 10 223
6 19 26 21 23 19 13 12 8 35 182
6 2/i 10 27 41 28 7 8 29 7 188
28 11 41 2/i 1 1 9 10 33 9 197
2/i 81 30 32 Iii 29 11 III a 12 211
16 26 IiIi 43 21 24 20 20 13 7 24Ii
28 29 34 33 8 33 16 17 18 6 222
21 19 56 4/i 22 37 9 12 20 14 2/iIi
22 17 89 23 11 32 14 7 13 12 100
18 28 41 27 3 26 15 17 24 15 214
26 16 27 87 . 36 20 21 29 18 234
28 II 20 14 5 .20 21 26 18 . 165
11 22 2/i 14 11 .a 16 16 16 4 177
16 26 89 24 9 27 14 18 20 9 202
7 17 24 18 2/i 20 13 11 6 8 149
2i 2/i 17 16 21 9 19 15 8 191
21 18 14 18
'"
26
31
14
40
44
IiIi 36
13
22
18
19
24
2/i
17
7
27
29
31
"8
8
II
10
Ii
1
227
2/iIi
26 80 89 29 II 30 80 29 10 3 235

ta
t.o1.ala 410 U9 674 Ml 325 528 803 as8 342 206 (1M
1.17 IlEFERENCEB 187
Find the variance of the mean of a ryetematic IIIUllple oonaieting of every
~ foot. Compare this with the varianoee for (i) a simple random sample,
(ll) a stratified random eample with 2 unite per stratum, (iii) a stratified ran-
dom eample with 1 unit per stratum. All eamplea have n - 10. II: Ull - Y)I
- 23,601.)
8.2 For the population in exerciae 8.1 , ia the precision of IIystematic 1&Ill-
piing improved by end corrections?
8.8 A two-dimensional population with a linear trend may be repreeented
by the relation
tlCJ - i + i (i, j - I , 2, "', nk)

where tlCJ ia the item value in the itb row and jth oolumn. The population
oontains N' - n'k ' unite.
A systematic tlQuare grid II8Jllple ia lI8lected by drawing at random two in-
dependent starting coordinatea 10, jo, each between 1 and k. The eample, of
lise n ' , oontains all unite whoee coordinatea are of the form

io + 'Yk, i. + 64:
where"(, 6 are any two integers between 0 and (n - I), inclusive.
Show that the mean of this sample has the same preciaion as ~h mean of a
simple random sample of size n'.
8.4 If the oompILriaon in exercise 8.3 were made for a three-dimensional
population with linear trend, what result would you expect?
8.5 A population of 360 households (numbered from 1 to 360) in Baltimore
ia arranged alphabetically in a file by the surname of the head of the hou
bold. Households in which the head ia non-white occur at the following
nutnbers : 28, 31-33, 36-41 , 44, 45, 47, 55, 56, 68, 68, 69, 82, 83, 86, 86, 89-94,
98, 99, 101, 107- 110, 114, 154, 156, 178, 223,224,200, 298-300, 302-3{)(,
306-323, 32&-331, 333, 335-339, 341, 342. ~The non-white households show
lOme "clumping" becaUlle of an association between lurn&tne and color.}
Cotnpare the precision of a 1 in 8 lyatematic sample with a simple random
sample of the same sUe, for eatimating the proportion of houaeholds in which
the head ia non-white.

8.17 R.eferences.
Co<nI:Luf, w. G. (1946) . IUJative accuraoy of ~matic ADd ftratmed random
-.rnple8 for a certain clue of populatJOOl. A"". Malh. SI4l., 1' , 164-177.
D.a.a, A.' C. (1950) . Two-dimeoaiooal aystematJe II&IJlpliJlI and the UIOciat.ed
ftratilied and random I&mpUng. ScmA:IIIlO, 10,95-108.
DsLuu, D . B. (l950). Valuu ond i~ of 1M orlJtogofICIl poi1"lOl1liolt up 10
It -16. Univ6rlity of Toronto P.,..
FufNa, D. J . (1948). Random and ~tic -.rnplinl in timber lUtVeyI.
Porutrv, JI, 1-38.
FINN.,., D . J. (l9&O). An example of periodie variation in foreat. I&mPlln&. Por-
,*" II, 96-111 .
FlIlHD, R. A., and MAc a II&, W. A. (1m ). The con-eJation of IN kly ralnfall.
Quorl. Jow- . RAIf/. M tI. . Soc., ta, 234-245.
SYSTEMATIC SAMPLING 8.17

IUnu:e, J . D . (1948). An empirical investiption of II&IJlplin, methoda for aD


area. M.S. thelia. University of North Carolina.
HOIOln:a, P. G., and BUell, C. A. (1~). 8amplinl replicated field experiment.
OD oat.l for yield det-erminationa. Proc. Soil Sci. Soc. Anwn'<:o, 11, 341-344..
JOHN"'H, F. A. (1943). A ltatiltica.l ltudy of II&IJlpling metbodl for tree ounery
Inv ntorlea. Jour . F«utrfl, 41, 67~ .
MADOW, L. H . (1946). Systematic sampling and ita relatioo to otber II&IDpliol
dlllilPll. Jour . ...4,,_. Slat . ...4 ..oc., '1, 207- 214..
MADOW, W. G., and L. H . (19«). On tb theory of ayetematio sampling. Ann.
MaUl. Stat., 16, 1- 24.
MA-TIlaH, B. (1947). Metbods of estimating the accuracy of line and II&IJlple plot
IUrveys. Medd. fr o SLaUnII S/wgIfor,lmifl9' hulitul, H , 1- 138.
o.lIOM'II, J . G. (1942). Sampling errors of ll)'lltematic and random lUrVeys of
oov r-type al'CA8. Jour . A,,_. Stat. Al3oc., 87, 256-264..
QUllHOUlLLI!l, M . H. (1949). Problema in plane aampling. Ann. MaUl. Slat., 10,
3M-376.
Wow, H. (1938) . A .Iudfl of 1M all4lfl';' of .tationarll ti"., uriu. Uppea1a.
YA-TIlI, F . (1948) . Systematio III.mpling. Phil. Tram . Roll. Soc. Lond., AMl,
846-377.
YA-TIlI, F . (1949). Saml'lifl9 rndAocU for cmIU.IU and ~.. Charlea Griffin
and Co., London.

Nol cil4d in ~

BUCK.LAHD, W. R. (1951). A review of the literature of 'Yltematio aamplin,.


JtnJr. &11. Slat. Soc., BU, 208-215.
CHAPTER 9

TYPE OF SAMPLING UNIT

9.1 The optimum unit. Sometimes a population can be divided into


units in various ways. A city may be regarded as composed of a num-
ber of city blocks, or of a number of households, or of a number of
persons. In soil sampling, the tool with which the sample is extracted
can be constructed of various sizes and shapes, each of which creates
a different subdiviljion of a field into sampling units. A change in th
type of unit usually affects both the cost of taking the sample and the
precision obtained from it. The determination of the optimum type
of unit may therefore be impoltant in the economics of sampling.
The optimum unit is that which gives the desired precision {or the
IlADlple estimates at the smallest cost, or the greatest precision for
fixed cost. For a given size of sample, a large unit is nearly alwayalesa
expensive than a small unit, but it is often less precise. The choioe of
unit involves striking a balance between relative precision and rela-
tive cost. As in most practical decisions, there may be imponderable
factors: one type of unit may have some special convenience or disad-
vantage that is difficult to include in a calculation of costs. In sam-
pling a growing crop, some experiences suggest that a small unit gives
biased estimates because of uncertainty about the exact boundaries of
the unit. For example, Homeyer and Black (1946) found that units
2 x 2 ft gave yields of oats about 8 per cent higher than units 3 x 3 ft,
possibly because samplers tend to place boundary plants inside th
unit when there is doubt. Sukhatme (1947) cites similar results for
wheat and rice.

9.2 A limple example. Johnson's data (1941) for a bed of white pine
seedlings provide a simple example of the procedure for comparing
different units. The bed contained 6 roWs, each 434 ft long. There
are many ways ill which the bed can be divided into sampling unite.
Data for four typee of unit are shown in table 9.1. Since the bed waa
completely counted, the data are correct population valuee.
1~
1110 TYPE OF SAMPLING UNIT IU

TABLE 9.1 DATA 10. IOU. nna or UKPUlCO Ulfl'l'

Type of unit

Preliminary data
l·ft Z-ft l.ft 2.ft
row row bed bed

lU!laUvII liM or unit 1 2 6 12


N Of - olUllber of unitt in pop. 2604 1302 434 217
8 ..' - pop. variaoOll per unit
Number orf t of row that can
be oounted In 16 miD
2.681

.. 6.746

62
23 .094

78
68.668

108

'The units were:


One foot of a single row.
Two feet of a single row.
One foot of the width of the bed.
Two ( t of the width of the bed.
With the fil'llt two units, it was a88Umed that sampling would be
str&tified by roWII, 80 that the S..2 represent variances within rows.
imple r&Ddom sampling was assumed for the last two units.
inee the principal cost is that of locating and counting tbe unita,
coeta were estimated by a time study (last row of table 9.1). With
th larger units, a greater bulk of sample can be counted in 15 min,
1 time being llpent in moving from one unit to another.
'The item to be estimated i the population total number of aeedlillgB.
In studies of this type, a population 1o141 is more convenient to discuss
than a population nuan, since the mean per unit for a 2-ft bed unit is
quite a different quantity from the mean per unit for a loft row unit,
wbereas the population total is the same quantity for all units. II the
fpc is ignored, the variance of the estimated population total is
N.'s,.'
"-
where" - I, 2, 3, 4 IltaDda for the type of unit. Thia variance is to be
the same for all units. If the emallest unit is choeen u a standard, the
valu of the other n.. that give the same precision as the emalle8t unit
are obtained from th equation
Ni' I' N."S.' . (N_.)' S.'
- ; . - - --;:-: I.e. "" - "I ~ 7
A SIMPLE EXAMPLE 1111

For example, the value of ~ comparable to At is

(~ 2.537
6 . 746
~ - At - -- - O.666n]

These data appear in table 9.2, first line.

TABLE 9.2 CoMPAllAIILm IAIlPLII 8lEU AND COlI,..

Type of unit
Suooeeaive ,t.epa in
the caloulAt.ion
1-ft 2-ft 1-ft 2-ft
row row bed bed

Comp&r&ble valuee of n" nl O.665nl O.Wnl O.ISSnl


ComparAble a&mple sia81
(in 1-ft row unite) III 1.33On1 1.618nl 2 .256nl
Comparable coste CI O. 944cI 0 .86&, O.1I111e1
Relative net precision 100 106 117 109

The next step is to find the comparable II8Jllple sileS in terms of


lingle fut of row, since coste are expressed in theBe terms. For ~ we
multiply the previous line by 2, because the unit contains 2 ft of row
(second line of table 9.2). Aa the size of the unit increases, the size of
sample required to obtain equal precision al80 increases: in fact, with
the 20ft bed unit, the sample must be 2t times as large as with the
1-ft row unit.
The C08t of taking At of the smallest unite may be expreeae<i as

since this is time required in IS-min intervale. Similarly the cost of


the .sample with the second unit is

l.33On t
- - - - 1.330 -
(44) Ct - 0.944c1
62 62
u shown in table 9.2. All the larger unite cost I than the smallMt
unit, although the differences are not great. The 1-ft bed unit appears
the beat. The last line of table 9.2 shOWI the reciproca!e of the costa,
with the amall I. unit taken as 100. In the tabl these figures hAve
un TYPE OF SAMPLING UNIT

been called rel4tive JUt precilion, becaWle, if the same comparisons are
made with COllt kept constant instead of variance, it will be found that
these figurea are invereely proportional to the relative variances of the
estimated population total, and hence me&8Ure relative preci8:ion.

9.3 General prooedure for comparing units. The analysis in this ex-
ample can be expre88ed in more general terms as follow8.
Theorem 9.1 For the uth type of unit, let
Relative size of unit - M ..
Variance among the item totals on the unit - 8 .. 2
Relative COllt per unit - c..
Simple random 8&mpling i8 assumed, with the fpc ignored, and the
population total is estimated by 8imple expansion. Then the relative
C08t for equal precision is proportional to
C..s..2 (9.1)
M ..2
Proof: This follows the argument used in the numerical example.
For th uth unit
Number of units in the population a: 11M"
Variance of eatimated population total a: S"2 In.,M ,, 2
SampJe size (n,,) for equal preci8ion a: S,,2/ M,,2
Relative coat for equal precision a: C"S.. 2I M,,2
The d finitions of 8,,2 and C. should be noted, because in the com-
pilation of data it is often convenient to expr these quantities
originally in lOme oth r form. "l'hu8 in the numerical example the
COl:lt data w re giv n in terms of the bulk of sample that could be
counted in a given time.
CoroUo.rll 1 Under th conditions of the theorem, the variance of
the estimated population total, for a fixed cOIIt, is also proportional to
C..s. 2 (9.2)

This folio by th same rgument.


Corollarv I In the analysis of variance, variances for units of dif-
ferent .aee are often computed on what i8 called a common baaia.
Th ariAnc ... among total of units of size M .. is divided by M ...
uppoee, for example, that we wished to preaent all the variances in
table .1 in terms of th variance per 1 ft of row. inee the 2-ft row
9.3 GENERAL PROCEDURE FOR COMPARING NITS 193
unit is the total of 2 single feet, the vananc 6.74u is divided by 2,
giving 3.373. Similarly the variances for the third and fourth unite
are divided by 6 and 12, respectively. The results are as follows:

CoKPAllAllUI VAJUAl'ICD PO Illl'lOUlI'OO1' or 1IOW'


Type of unit

I-ft 2-ft l·ft 2-ft


row row bed bed
2.687 8.873 8.849 5 . 718

When placed on a common basis, the variances still increase steadily


with the increasing size of unit.
Let S,/2 - S,,2/ M .. , 80 that the quantities .,2 are on a common
basis. A180 let C,.' be the cost of taking a given bulk of &ample, 80
that C,,' cr C.. /M". Then theorem 9.1 may be stated as follows:
Relative net cost for equal prl'Cision ex: C"'S,,,2
Relative net precision ex: 1/ C,,'S ..'2

This result shows that, if we are ignoring differences in the costs of


taking the sample (i .e. aasuming COl' constant), relative net precision
is inversely proportional to S .. '2. In other words, in order to compare
different units for the same total bulk of aample, the relevant quanti-
ties are the varia.nces among units, reduced to a common basis.
The results of theorem 9.1 and its corollaries remain valid for
stratified sampling with proportional allocation, if all strata are of the
sarne size and if S" 2, 8 .. '2 represent average variances within strata.
This is 80 because, under the conditions stated, the variance of the
estimated population total, ignoring the fpc , is N 28" 2 / n, and there-
fore bdBumes the same form as with sim;:>le random 8IUllpling. Theo-
rem 9.1 does not hold for more complex types of sampling.
The preceding results are intended merely as an illustration of the
general procedure. Com:pari¥mB al1W1lQ unit. slwuld always be made
for the kind of sampling that is to be u.aed in practice, or if this hc.a not
bun der:Uhd., for 1M 1cinda that are under comiderat.ion. Changes in th
method of sampling or of estimation will chang the relativ net pre-
cisions of the different units. Even with a fixed method of sampling
and estimation, relative net precisions will vary with size of sample
if the cost is not a linear function of size or if the ize is large enou h
10 that the fpc must be taken into account.
There is uaually more than one item to cOTl8id r. On approach is
to fix the total coet., and work out the relat.ive n t precisions for each
10. TYPE OF SAMPLING UNIT
type of unit and each item. Unless one type of unit iB uniformly su-
perior, some compromise decision iB made, giving principal weight to
the most important items.
In view of the numerous factors which influence the results, a study
of optimum size of unit in an extell8ive survey is a large wk. A good
example for farm. 88Jllpling iB described by Jessen (1942). An eXCf'rpt
from hiB results is given in table 9.a. This compares 4 sizes of unit-a
quarter..section, a half..section, a section, and a block consisting of 2
contiguous sectioll8. The section is an area 1 mile square, containing
on the average slightly under 4 fanna. In this comparison the total
field C08t (11000), the length of questionnaire (60 min to complete),
and the travel coat (5 c nts per mile) are all specified, because relative
net precisions change if any of these variables is altered. Costa are
at a 1939 level.

TABLE 9.8 EaTrtoiATliD IIT.uroAIID ICRIIOII8 (IN PICIl CIlNT) POll POUIl allD OF
UNIT, WITH IXIIIPLII ILA.NDOW IIAWPLINO
Bart
l~ 8/4 8/2 8 2S unit
Number of .wine 6.0 4.9 6.3 6.2 S/2
Number of ho~ 8.4 3.3 3 .6 4.2 8/2
oNumber of .beep 17.' 16 .7 14.9 14 .3 2S
Number of chieken. 3.0 3 .0 3 .3 3.8 8/4,8/2
Number of eua yMterd~ 6.7 6.2 4.9 4.7 2S

Number of cattJe 4.7 4.6 4.8 6.6 8/2


Number of OOWl milked 3.7 3.6 3 .8 4.4 8/2
Number of pllona of milk 4.4 4.2 4.4 4.9 S/2
Dairy prodUOIA receiplA 6.6 6. 2 6.4 6.0 8/2
Number of farm acree :U :u 3.0 3.6 8/2
Number of com _ 8.7 3.6 3 .8 4.4 8/2
Number or OIt _ ' .6 4.8 6.6 7.0 81'
Com yield 1.6 1.7 2.0 2.6 8/4
Oat yIeld 1.6 1.6 1.6 ' 1.8 8/2
Oommerolal feed expendiLUl\II 111.0 18.6 16.7 21.8 81'
ToLl1 expendl turea, operator 7.8 8.1 9.6 1:'.0 8/(
Tot.aI reoeJplA, oper.tor 6.2 6.6 7.7 9.8 Sf'
Net C&Ib income, oper.tor 6.8 6.9 7.8 9.6 S/..

The data in the table are the relative standard errors (in per cent)
of th Mtimated means per farm for 18.items. No unit is best for all
items. The half-.ection and the quartef.oolleotion are, aowever, superior
to the tar units for all except 2 items, with little to chooee between
the half· and quarteNl8Ctio . The half-eection would probably be
G.' OOMPARISONS MADE FROM SURVEY DATA 1115
preferred, becauae the problem of identifying the boundaries accu-
rately is easier.
In order to make any compariaon of this kind, we must know the
variability among unite for each type of unit that is included. Bow
are 8Uch data obtained? One 1IOUJ'Ce, 88 in the nu.rsery bed example,
is a complete count of the population for es.ch type of unit. Another
is the drawing of aeparate samples for each type of unit. uch methods
may be employed when the population is compact and it is not too
costly to obtain the data. With extensive populatioD8, however, it is
aeldom feasible to make a 8Urvey solely for the purpose of comparing
different types of unit. Infonnation about optimum type of unit is
more usually procured 88 an ingenious by-product of a 8Urvey Wh088
main purpose is to make estimates. Some techniques for doing this
are outlined in succeeding aectioD8.

9.' Comparison. made from aurvey data. Suppose t hat in a IlUrvey


each unit can be divided into M smaller unite. Instead of recording
only the totals for each " large" unit in the sample, we record data
aeparately for each of the M small unite. A comparison can then be
made of the preci8ion of the large and small unite. A simple random
sample of Bille n will be 8.88Umed at first.
The analysis of variance in table 9.4 can be computed from the
sample.

TABLE 9.4 AlorALTlla or VAR1AJIClI: or TIl. UMPUI DATA


{ON A .MALt.-U NIT BA l a)

df 11\1
Between 1&I'Ie unite (n - J) IIJ,I

Between IImall unite within lup n(M - 1) .,.'


(n - 1) ••1 + n(M - 1). .'
Between IJ1IaIl unite in aample (nM - 1) r - ':"""_':"":'".,.;-_':'.,...--:";;:_
(nM - 1)

The estimated variance of a large unit (on a small-unit basis) is .,'.


It might be thought that an appropriate estimate of the va.riance of 8.
sma.IJ unit would be the mean square between all IIIl'l&iJ unite in the
sample, i.e.
r_(n - 1).,'
+ n(AI - 1)••2 (9.8)
(nM - 1)
This estimate, although in many cases satisfactory, is biaaed, because
the sample is not a simple random sample of amall unite, since th
are .sampled in contiguous groupe of M unite.
1" TYPE OF SAMPLING UNIT
An unbiued estimate is obtained from the sample by constructing
an analysis of variance, as in table 9.5, (or the whole population, which
contain.e N large unite and NM sm&1.1 unite.

TABLE 11.6 MALTSJ. or 't'AlIl.UC1I ro. TID W'JIOUI ~PVL4T10.


(ON .to ~UJQT LUll)

dI
Betw n lara- unit. (N - 1)
Between IIUIA1I UDit. within lara-
unit. N(M - 1) 8.1

Betw n 1IUIA11 un.Ita i.D tb popu- (Nil _ 1) at _ (N - 1)8,' + H(M - 1)8•.'


lation (NM - 1)

By ite definition, the population variance among small unite is given


by the last line of the table, i.e.
S' _ (N - 1)8,· + N(M - 1)8,,'
(NM - 1)
With simple random eampling, .,1
in table 9.4 is an unbiaaed estimate
of 8.1 (thi follows from section 2.3). It may be shown easily that
...' is an unbiased estimate of 8",1. Hence an unbiaaed estimate of the
variance SS among all small unite in the population is
S' _ (N - 1),.· + N(M - 1),,,'
(9.4)
(NM - 1)
If n exceeds 50, timates (9.3) and (9.4) are practically identical,
since both reduc approximately to
S' ._." ._. ,,2 + (M - 1),•.'
(9.5)
M
The two timatee, .,' (for the original unit) and S' (for the emall
unit), are then in rted into the appropriate variance formulas for the
estimated population total.
If the sample is large, the emall units may be measured for a random
subsample of th lar units (say 100 out of 6(0). Alternatively, two
email units, choeen at random from each large unit, might be measured.
More than one size of sm&1.1 unit may be investigated simultaneously,
provided th t we take data which give an unbiased estimate of 8 ..2
for each small unit.
With stratified samPlinl, the vaMane for the large and amall unite
can be timated by th methods separately in each stratum, and
IU COMPARISONB MADE FROM URVEY DATA lW

then subetituted in the appropriate formula for the variance of the


estimate from a stratified sample.
Ezample. The data come from a farm sample taken in North Caro-
lina in 1942 in order to estimate farm employment (Finkner, Morgan,
and Monroe, 1943). The method of drawing the sample was to locate
points at random on the map and choose as sampling units the three
farms that were nearest to each point. This method is not recom-
mended, because a large farm has a greater chance of inclusion in the
&ample than a small farm and an isolated farm has a greater chance
than a farm in a densely farmed area. Any effects of this bias will be
ignored in the present illustration.
The sample was stratified, the stra.tum being a group of townshipa
that were similar in density of farm population a.nd in ratio of crop-
land to farmland . ome data for the sample taken in May are shown
in table 9.6.
TABLE 9.8 BIZa or POPULATION AND IIAMPLoII

Population IUllple
No. or strata 587 572
No. or sampling unite 72 . 849 1397
No. or (arffi$ 217 ,976 4166

It will be noted that a few strata were not sampled and that the
number of farms per unit was slightly under 3. These discrepanci
will be ignored. The sampling ratio was l.9 per cent.
From the sample data we can compare the group of 3 farms with the
individual farm as sampling unit. We shall use a slightly simpler
analy is than is strictly required. Th fpc can be omitted. ince th
sampling was stratified, the variance of the estimated population total is

V ( f,, ) - L: N 's 2
_ A_ A

" 11."
The standard procedure is to compute, within each stratum, an esti-
mate of SA' for the two types of unit, and substitute in this formula.
The t.rata contained in general between 300 and {50 farms, and eith r
two or three 3-farm units were taken in each stratum 80 as to make the
sampling approximately proportional. Assuming proportionality, i.e.
nlt/N" - nl N, we may write
N IV'
V(f,,) - -
11.
L: NJ08,,' ' - . -S,,'
11.

if we assume funher that the 8,,2 do not vary greatly among strata, 10
tha. they may be repla.ced by th ir average, S.2.
1. TYPE OF SAMPLING UNIT

Estimates of SIa' are obtained from the anaIyaie of variance in table


....
9.7, which is on a single-fann buia.

TABLE 11.7 8AJfPLII .uf4.L TIlI or 'UJlJAMC. (NlTaOI. . or P4JJ) "'0lUUIU)


(lDfo ....rUII . .8111)

elf IDI
Bet_n unite within IItrata ~ 8 .218
Bet_ '1omII
within unite 2788 11.1118

Betw 0 '1omII within Itrata 3693 8.878

For the group of 3 farma, the mean aquare '1t.3' - 6.218 llervelll as
the eetimate of SIa'. To obtain an estimate of the variance within
strata for the individual fann, we construct an analysis of variance
for the whole population (table 9.8) . The degrees of freedom come
from table 9.6, and the first two mean aqua.rea from table 9.7.

TABLE 11.8 CoNwraUC'I'I:D AMALTIIlI or VAJU4.NC. ro. TIUI "'ROLlI ro.UL4T10l(

elf IDI
Bet", II unite within .lAta n ,:M2 8 .218
Betw_ '1omII
within unit. 1~,127 2.1118

217,3811

The eetim&ted varianc between fanna within 8trata is th n COlD-


puted &II
"'" (72,262)(6.218) +
(146,127)(2.918)
.,- - - 4.016
217,389

ince th estimated variances, 6.218 for the group of 3 farms and


4.015 for the individual fann, are on a common buia, theee two 6guree
indicate the relative precisions of the two unite for a fixed total siJe of
eample (by theorem 9.1, corollary 2). The group of (arms giVeIII only
about two-thirds the precision of the single fann. Consideration of
coate would preeumably make the result more favorable to the 3-farm
unit.

9.1 Variance functions. uppoee that in the preceding exampJe we


wished to compare th precision of the 3-farm unit with that of a unit
oonaiatina of 2 or 6 or 10 farma. We would require aome method of
predi tina the variance " betw unite in the population as a funo-
tion of M , th Bile of th unit. By the anal . of variance, 8,' CAD
U VARIANCE FUNCTIONS lW

be found if we know (i) the variance $2 between all elements in the


population and (jj) th!l variance 8 ..2 between elements that lie in the
same unit. In the method to be Pre8ellted h re, the approach is to
predict 8,.2 and $2 and to find 8.2 by the analysis of variance.
The sample data produce estimatee of $2 and 8,.' for the sise of unit
actually uaed. Since $2 is the variance among elements, it is not
alfected by the aise of the unit. However, 8 .. ' will be alfected. It
might be expected to increase as the sise of the large unit incl'e8.8e8. If
the large uwts which are to be examined differ little in sise from the
unit actually used, a first approximation is to regard S",' as constant,
using the estimate given by the sample data. An investigation by
McVay (1947) suggests that this approximation may often be sati&-
factory.
All a better approximation, attempts have been made (Jessen, 1942;
Mahalanobis, 1944; Hendricks, 1944) to develop a general law which
predicts how B..' changes with the size of unit. In several agricultural
surveys, 8 ..' appeared to be related to M by the empirical formula
B..2 - AM' (g> 0) (9.6)
where A and g are constants that do not depend on M. In tws formula
B..' increaaes steadily as M increases. Usually g is small. A curve of
this type might be expected when there are forces that exert a similar
influence on elements that are cloee together. Climate, soil type,
topography, and access to markets tend to make neighboring fa~
have similar features.
Theoretically, the formula is open to objection, since it makes 8 ..'
increase without bound as M incre&8e8. If we assume, as seems reaeon-
able, that there is no correlation between elements that are very far
apart, a formula in which B..2 approaches an upper bound with larg
M would be more appropriate. However, any formula will suffice if
it gives a good fit over the range of M that is under inveetigation.
If this formula fits, log 8 ..2 should plot as a straight line againat
log M. Values of B..' for at least two values of M are needed in order
to estimate the constants log A and g. At least three values of M are
neceeeary for any appraisal of the linearity of the fit.
From the analysis of variance in table 9.5 we find
I (NM - 1)$2 - N(M - 1)8.'
S. -----------------------
N- 1
(NM - 1)8' - N(M - l)AM'
(9.7)
N-l
TYPE OF SAMPLING UNIT

U Nt the number of la.rge unita,· is large, this takes the simpler form
Sb 2 - MSl - (M - l)AM' (9.8)
Hendricks (1944) haa pointed out that the complete population
mi ht be regarded 88 a single large rwnpling unit containing NM ele-
menta. If formula (9.6) holds, then S2 so A(NM)'. The advantage
of this device is that the values of A and g can now be estimated from
th data for a survey in which only one value of M W88 used. The
two equations which lead to the timates are
logS ..2 - logA + glogM
Jog S2 - log A + g log (N M)
2
Th formula for Sb becomes, from (9.7),

Sb
2
-
AM'I(NM - I)N' - N(M -
- - - - - - -- - - - -

N-l
This m thod furnishes no check on the correctness of formula (9.6) .
It might happen that the formula held well enough for small values of
M , but failed for a value 88 large 88 NM . In this event the more
g n ral formul88 (9 .7) and (9 . ) should be employed.
Formula (9.6) is presented 88 an example of the methodology rather
than as a gen ral law. The reader who faces a imilar probl m should
construct and teet whatev r type of formula seems most appropriat.c
to his material. In 80me cases log Sb2 might be a 8imple function of M .

9.6 A cost function. In an extensive 8urvey the nature of the field


C08ts plays a large part in d termining the optimum unit. .AI!, an
illustration of th rol of coo factlore, we shall describe a cost function
whi h h been d veloped by J n (1942) for farm 8urveys in which
the large unite are clu tere of neighboring farms.
Two components of fi Id coat are distinguished. The component
clMn compri osta that vary directly with the total number of el&-
m nta (farm.) : thu8 CI contains the cost of the interview and the cost
f trav J from farm to farm within the cluster,
Tb ond compon nt, c,Vn, measu.re8 th cost of travel between
th clu8ters. Tee on a map showed that th' cost, for a fixed popule.-
lion, vari appr ximateiy the square root of the number of clusters.
Total field c t ' th refore
C - clMn + C2 n (9.9)
• In reI cited, N ' II8U&1Iy defined .. the number of ~ in the
population, 10 thAt fannul .. (0.7) and (0. ) have a diffeJ"flllt appeal'aQoe.
G.6 A COST FUNCTION

Assuming simple random sampling and ignoring the fpc, the variance
of the mean per element" is Sb 2 /nM. From (9.8), this equals
S2 - (M - l)AM,-1
V(t') ... - - - - - - (9.10)
n

To determine the optimum size of unit, we find M, and incid ntally


n, so as to minimize V for fixed C. The general solution is complicated,
although its application in a numerical problem presents no great
difficulty.
By some manipulation we can obtain the equation which gives th
optimum M . First solve the cost equation (9.9) as a quadratic in Vn.
This gives
2c MVn 4CCIM)~
- -
~
I
- - ( 1+--
~2
-1 (9.11 )

The equation to be minimized is

C +}.V ,.. clMn + C2Vn +}.V


Differentiating, and noting that aVian - - VIn, we obtain the
equations

n: (9 .12)

xav
M: Cln - - - - (9.13)
aM
Divide (9.13) by (9.12) 80 as to eliminate},. This leads to

or
-- -n av
VaM clM
cln
+ l~n-~

--- ------
Mav
VaM 1+ CJ
(9.14)

2c I M vn
If we substitute for vn from (9.11), we obtain, &fter some simplifica.-
tion,
M av ( 4CCI~-~
- - - 1+--
VaM es' -1 (9.l6)
TYPE OF 8AMPUNG UNIT lI.e
By writing out the I ft 'de of this equation in full and changing
signa on both sidee, we find

This equation gives the optimum M . The left side does not involve
any of the coet facton, being dependent only on the shape of the vari-
ance function. It follows that for a given population, which will have
lOme fiz«J. variance function, the optimum M reacts to changes in the
eon facton in such a way that the quantity

remains fixed.
Now CI inCrea8e8 if the length of interview increases, while c, de-
creasee if travel becom cheaper, or if the farms in a given area be-
come more dense. Theee facts lead to the conclusion that the optimum
sile of unit becomes rmaller when:
Length of interview increases.
Travel becomes cheaper.
The elements (farms) become more dense.
Total amount of money used (C) increases.
The conclusions are a c nsequence of the type of coo function and
would require l'tHOOUllination with a different function . They il-
lustrate the fact that the optimum unit is not a fixed characteristic of
the population, but d penda also on the type of survey and on the
I vels of prices and wages.

9.'7 Variance in terma of intra-clufter correlation. When the sam-


pling unit is .. cluster of M elements, variance formulas are sometimes
expreaed in terms of the correlation coefficient p between elements in
the same cluster. An example of this approach has already been given
for ayatematio sampling (section 8.3).
Let 1/" be the ohllerved valu for the jth element within the ith unit,
and I t 1/i be th unit total. In cluster sampling we need to distinguish
between two kinds of average: the mean per unit r - 1: 1/JN, and
the mean per element r - 1: lIi/ NM - r iM. The variance among
ements is
l: (lIiJ - Y)I
2
8 --"'----
NM -1
9.8 CLUSTER SAMPLING FOR PROPORTION
The intr&--cluster correlation coefficient, p, may be defined &8

:E ") (1/,i _ 1')(Yik _ 1')


, J'a
p - ------.,,----
NM(M - 1)8'/2
The numerical term in the divisor is the number of Cross-prodUCUl
NM(M - 1)/2 in the sum in the numerator.
Consider the variance of the estimated mean per I'lrmml" . With a
simple random sample of n complete clusters,
1 (N _ n) " (1/ ' _ 1")2
V(I'!) - -
1/ M2 V( 9) - NnM2 -(JV---I-)-
£... , (9.16)

since Jl - filM. But


(11. _ 1") .. (1/'1 - v)+ (1/'2 - V) + ... + (yiJI _ V)
Square and sum over all N clusters,
:E (71. _ 1")2 .. ~ ~ (1/,; _ y)2 + 2 ~ J~ (1M - J')(1/,. _ 1')

- (NM -- 1)8' + NM(M _ 1)pS2


ubstitute in equation (9.16) for V(Jl):
V(J) _ (N - n) . _1_S2 {(NM _ 1) + N(M _ I)P}
N nM (N _ I)M (N - I)
For N large, this reduces to

V(") .. (N - n) _1_ 8'(1


1/ N nM
+ (M _ l )pl
(9.17)

If a simple random sample of nM el m nts is taken, the formula for


Y(P) is the same as (9.li) except for the term in brae . The fa.ctor
1 + (M ..: l)p
shows by how much the variance is changed by the use of a cluster
instead of an element as sampling unjt. If p > 0, the c1UBter is I
precise for a given bulk of sample. If p < 0, &8 sometimes happens,
the cluster is more preci8e.

9.8 Cluster aampl.inc for proportion&. The same technjqu appiy to


cluster sampling for proportions. Suppoee that the M lementB in any
cluster can be classifi.ed into two cluee8, and that P. - a../ M is th
proportion in cl C in t.b ith cluster. A imple random mpl of n
TYPE OF SAMPLING UNIT
clusters it taken, and the average l' of the observed Pi in the sample it
ueed lUI the estimate of the population proportion P.
It will be recalled (IIOOtion 3.2) that we cannot use binomial theory
to find V(p), but muat apply the formula for continuous variates to
the Pi. This gives
N
L: (1'. - P )t
V( ) _ (N - n) i_I • _. (N - n) '" (p. _ P )2
l' Nn N _ 1 N2n L. •

Alternatively, if we take a simple random sample of nM elements, the


variance of P is obt.a.ined by binomial theory (theorem 3.2) &8

v ... (1') _ (NM - nM) PQ ._ (N - n) PQ


.... NM - 1 nM . N . nM
if N is large. Consequently, the factor

---
V(p) .
Vb.,,(P)
M L: (1'. -
NPQ
P)'
(N large) (9.18)

showlI the relative change in the variance due to the use of clusters.
Numerice.l va.lues of this factor are helpful in making preliminary
estimates of sample size with cluster sampling. The required sample
size is first estimated by the binomie.l fonnula, and then multiplied
by the factor to indicate the size that will be necessary with cluster
aampling. For an iIIulltration, see Cornfield (1951) .
If the cluster lIizes M. are variable, the estimate l' - L: ail L: M.
is a ratio timate. Its variance is given approximately by the formula
(eection 6.9)
N
- L: M.'(p, -
p)2
V( ) . _. (N - n) _i-_I _ _ __
P NM12 N - 1
&

wh re !J - L: M liN is the average sile of cluster.


If thi sampl is compared with a simple random aample of nM
elements, w find , as a generalization of (9. 18),
V(p) . L: Mhpi - p)2
-- - (9.19)
Vbi,,(p) NRpQ
.A1J with continuous varia , the relationship of i.e of cluster to
between-elu tor varian can be inv ti ated, either by exprelll!ing the
ractor in uations (9.1 ) and (9.19) as a function or lil or byeeeking
a relati n betw tb within-eluster variance and II. If we 'go
lUI MEASURES OF THE SIZE OF A UNIT

the value 1 to any unit which falls in claas C, and 0 to any other unit,
the fundamental analysis of variance equation for fixed It{ is

NMP(l - P) .. M L (Pi - p )2 + M LPi(l - 'Pi)


Totalse - 88 between clusters + within clusters
From this relation the mean square within cluaters can be computed
and plotted as a function of M . McVay (1947) d ribes how this
analysis can be used to investigate optimum cluster size.

9.9 Measures of the size of a unit. In many survey the units vary
in size. A house or a dwelling place, suitably defined, is often a con-
venient unit in surveys of human populations, but it may contain
anywhere between 0 and 25 or more pel'8ODS. In eX&lllples of thi
kind, we can define the size of the cluster as the number M,· of ele-
ments which it contains.
There are other populations in which obvious differenc in eiae
among units exist, although it is less clear how size i8 to be measured.
Farms, banks, and restaurants are examples. There are large and
small farms. As a meaflure of th size of a fann, however, WI' might
propose the total acreage, or th total acreag ava.ilable for crops, or
the total valu of the farm's production, or still other quantiti .
What kind of measure of size is useful to the sampler? uppoae that
om item 1/i is to be measured on each of a simple random sampl of
farms. The sampler fears that 1/, will have a high variance, because
there are BOme farms which year after year giv large valu of 1/..
whereas others consistently yield small valu . What is need d is an
auxiliary variabl Xi that is obtainable before th survey is tak nand
that predicts whether the value of y,' will be larg or small. Thus the
problem of finding a measure of size reduces to that of finding an
auxiliary variate which is highly correlated with 1/,' in BOm nse of
this term. The choice among total acreage, total tillable acreag , and
total value is made by examining which of the three has th high t
correlation, on the averag , with the items that are included in the
surv y . We shall not discu at present how this average correlation
would be calculated, in our interest is in the general conoept of a
good measure of size rather than in a specific definition.
For the sam survey, the best m &sure of size may d pend on the
item . If the item has been enumerated in a recent census, it is often
found that the best auxiliary variate %{ i8 the value of this item at the
previou cen8U8. In 8uch C&8e8 any gen ral measure of eize is inferior
to a eeparate measure for each item . However, available previous
TYPE OF SAMPLING UNIT
cen.sua data may not include the items that are to be in the new sur-
vey, but may provide eeveral general measures of Biu.
Given a general measure of aiu, we may utilize it by one of the two
methods di!cuaeed in previoua chapters. We may stratify by aile.
Since the variance of 1/, usually inCre&lle8 with Xi, the sampling fraction
should ordinarily be changed from stratum to stratum. Complete
enumeration of the strata with the largest units may be advisable.
The second method is to u a ratio or a regreesion estimate with X, as
the auxiliary variate. This allows th stratification to be employed to
control some other factor. A combination of the two techniques is
sometimes fruitful.

9.10 Samplinc with probability proportional to me. A third tech-


nique, suggested by Hansen and Hurwitz (1943), is to assign a higher
probability to the large units when the sample is being drawn. This
technique has found its principal use in surveys which employ sub-
eampling (ohapter II) , but it is also applicable to the present problem.
ampling with probability proportional to siu is illustrated in the fol-
lowing exampl of a small population of 7 units:

Meuure Sum of Aaicned


Unit of .i.e meaaUl'el ranee
1 3 3 HI
~
8 11
6
"
16
21
"
~lfj
16-21
"
II
6 "
2
26
ZT
22-26
26-ZT
7 3 80 28-30

The cumulative sum of th measures of size is formed. To select a


unit, w draw a random number between 1 and 30: suppoee that this
. HI. In the 8\lm, number 19 falls in unit 4, which covers numbers 16
to 21 incJusiv. With this method of drawing, the probability that
any unit i eel ted is proportional to th measure of size aasigned to
th unit.
Ua ond unit ia to be eel ted, the proce.tl8 is repeated with a new
random number betw n 1 and 30. How ver, contrary to our previous
pr ctice, w do not forbid th laction of unit 4 a second time. SeJeo-
tion with repl ment i n essary, when n exceeda 1, in order to keep
th probabiliti of 1 Lion proportional to the sillllS. This may be
by th extreme caae It> - 7. If lec:tion were made without re-
placement, all units would automatically be ch n, even though we
11.11 SELECTION WITH ARBITRARY PROBABILITIES

bad gone through the procedure of lection with probability propor-


tional to size. For values of n between 1 and 7, selection without re-
placement leads to probabilities that are intermediate between equal
probabilities and probabilities proportional to size.
As a rule, sampling with replacement i8 I precise than sampling
without replacement. However, when n/ N is small, the chance that
the same urut appears twice in th sample is small, and sampling with
replacement is practically equivalent to sampling without replace-
ment.·

9.11 Theory for selection with arbitrary probabilities. We shall first


establish a few general formulas under selection with any system of
probabilities. Let z, be the probability of selection of the ith unit,
where the z, are any set of positiv numbers which add to unity . A
sample of 8ize n is selected with replacement as described in the previ-
oua section.
Let t, be the number of times that the ith unit appears in a specific
sample of 8ize n, where ti ma.v have any of the values 0, 1,2, "' , n.
Consider the joint frequenc.v distribution of the Ii for all N units in th
population.
The method of drawing the sample is equivalent to th standard
probability problem in which n balls are thrown into N boxes, tho
z,
probability that a ball goes into the ith box being at every throw.
Con.eequently, the joint distribution of the t, is the multinomial
expression:
n!
- - - - z,"z/' .. . z",'/I
t,!~! ... IN!
For the multinomial, the following propert.ies of the distribution of
the l; are well known:
E(l;) "" nz,: Vel;) - nzi(l - z, ): cov(t,.(j) ~ -nzttJ
The sample mean under this system of selection is denoted by g"
The mean is computed in the ordinAry way by adding the item values
for the n units in the sample and dividing by n. This implies that, if
a urut has been drawn twice in the sample, its item valu receives a
weight 2 in the computation of fi"
o Sen (1952) baa dey loped variance fonnul .. for a aymm of leiection in wbleb
the fim unit ia etu.en with probability proportional to Bile lIod IUbeequcnt unl.
are ehoteo with equal probabillt.lce and without replaoement. General metbodl
for amplioc without replacement and with unequal probabiliUee have aI.a bMn
developed by Thompeon (1952), Midsuoo (1950), and Homu and Thom..,o
(11lU).
208 TYPE OF SAMPLING UNIT 9.lJ

T~em 9.! If g. is the mean of a sa.mple drawn with probability


proportional to '"
N
E(U.) - I>'.1It - y. (8&.y) (9.20)

1 N
V(g.) - E(g. - Y.)2 - - L %t(Yi -
11.,,_1
Y,)2 (9.21)

Proof: Since any unit in the sample is weighted by the number of


times t, that it has been drawn, we may write
1 1 II
9. - -
n
(tllli + 1,112 + ... + tN1IN) - -
n
L t.1Ii
i_I

Note that all the 1/1 in the population appear in this expression. In
repeated sampling, the l's are the random variables, whereas th 1/. are
a t of fixed numbers. Hence
1 N N
E(g.) - -
nL i_ I
1/;E(t;) - L
__ I
t(//; - Y.
Further,

v (g,) - ~n [f. y,2V(I.,) + 2 ,<.,


i_ I
L 1I,IIi COy (Itt})]

ThMrt:rn 9.3 An unbiased timate of V(y,) from the sample is

L" (II, - 'D.)'


"_I
v(f),) - n(n _ 1)

Proof: By th u ual algebraic id ntity, we may write

L" (lit - g,)' - L" (1/, - Y, )2 - n(g, _ Y. )2


11.11 SELECTION WITH ARBITRARY PROBABlLITIES

Hence
ft ft

E 1: (1Ii - g.)' - E 1: (11, - Y. )' - nV(g.)

since, by the definition of V(Y.), the mean value of the aeeond term on
the right is -nV(y.). Introducing the variables ti, we have
.. N
E 1: (1/. - g.)2 - E 1: 1..(11, - Y.)' - nV(g.)

We may now regard the 11, as fixed quantiti and the ti as the random
variables. S\nce E(ti) ... 7I.Zi,
ft N
E 1: (11, - g.)2 - n :E z,(111 - Y.)2 - n V(ti.)

2
- n V(ti.) - n V(ti.) - n(n - 1) V(g.)
by theorem 9.2. This completes the proof.
Note that, in estimating the variance from the sampltl, we do not
weight the 11i by the Zi, because this weighting has already been intro-
duced in the selection of the sample.
These results may now be applied to the estimation of a population
total when the units are of unequal size.
Theorem 9.4 A sample of size n is drawn with probability propor-
tional to measures of size z, -
M ;/ L Mi. The item totals for the
units in the sample are 111. 112, . • " 11". where the same unit may appear
more than once, since sampling is with replacement. As an estimate
of the population total Y we take ?ppJ (probability proportional to
size), where
'I>
I PI" - -1
n
GI + - + ...+ -
-
I
112
z2
"'')
z"
(9.22)

Then ?pP' is an unbiaaed estimate of Y, with variance

V(?pP' ) - E(?pp. - Y)2 _ ~


n
E (~
i_ I
z,
\.;.
_ y)2 (9.23)
Proof: Apply theorem 9.2 to the variate lI,/Z,. Then

E(?"".) - L
N
Zi
(y.) - :E II. -
_!
N
Y
'-1 t, ._1
v(1),,,,.) - -1
n
D,
N
I-t
-i - G' )2 y
210 TYPE OF 8AMPUNG tJNIT lUI

The result is exact for any size of sample. The variance contains no
fpc, becauee the sampling is with replacement. The estimate is r"".
not suitable for quick computation, since it involves finding the ratio
'I(l z( on every unit in the sample, and is unlikely to be widely used.
With this estimate, the optimum measures of size are the set of
numbers Mi for which the Zi minimize

V( r"".) - n-1 L t.
(II'~ Y)2 -
-
1{ 1I ,2
- L ~ - Y' }
z. n t.
If the 111 are all positive, it is easy to see that V(p) is zero when M .
ex: 7/., Coneequently the beat measures of size are the item totals 1/.
on the units. This result is not of practical interest, because, if the 1/.
were known in advance, the sample would be unnecessary. The result
8ugg ta, however, that if the items are relatively stable through time,
th mOllt recently available previous values of the 1/. may be the best
measure of size to adopt,

9.12 Comparison with the ratio estimate. A comparison between


1""". and 1"R , the ratio estimate derived from sampling with equal
probabiliti , is of som practical interest, Since V(r R) is known only
for larg samples, the comparison must be restricted to this case. For
the ratio timate,

By theorem 6,1, th approximate vuiance of a ratio estimate is


N(N - n) N
V(rB ) -
n(N - 1) i_I
L (IIi - Rx,)2

ince x, - Mi , and R - YI( L: M, ), this may be written

V(rB) _ N (N - n)
n(N - 1) ._1
f. ('Vi _ LM.
M.Y)2
N (N - n) Ii
- 'l(N - 1)
L: ('I. -
._1 Ylo)' (9.24)

For the unbiaaed estimate with pps sampling, we have from (9.23)

vn'..,..) - -I L
N
n ,-
t,
I,
a' ~ - Y )2 - -1 L -1 ('II -
Ii
n i-I Ii
Yzi )' (9.25)
9.12 COMPARISON WITH THE RATIO ESTIMATE 211
Assuming n/N small, the two comparable quantities IU'e
nV(rR) - N :E (u, - YZ,)I (9.26)

nV(r",,) - :E ~ Cu, -
Zj
YZ,)2 (9.27)

For some populations the ratio estimate ill superior, and for others
the pps estimate. I do not know any simple rule for predicting the
better estimate: formulas (9.26) and (9.27) can be used to make the
comparison when population data are av&ilable.
One result will be presented. Suppose that
1/, - Yz, + e,
where t; is independent of It in the probability eenae. In array. in
which to is fixed, we assume that
E(e,) - OJ E(el) - azl (g > 0)
If g - 1, this model satisfies t·h e conditioDll in which the ratio estimate
ill 8. beat linear unbiued etttimate (eection 6.8) .
From (9.26),
N
nV(fR ) - N :E el - N2E(,,2)

_ aN2E(z;')

where the average is now taken over all values of t,.


Simila.rly, from (9.27),

- aNE(,I-I)
Hence
1
V(r,,) > V(f",) if B(z/) - 'NE(zl-l) >0
ince E(I,) - 1/N, becatae the Ii add to unity, the inequality may be
written

This expreeeion is the covariance of the variat.el and The t, ,,,-I.


covariance vanishes if g - 1. If g > I, the covariance i.e poIitive.
z,
since the variate lies between 0 and 1. If 0 < g < 1, th covariance
i.e neptive.
212 TYPE OF SAMPLING UNIT UJ
To summarise, we have assumed that the relation betweeo lit and
%t is a straight line through the origin.If the- variance of !It about this
line increasee futer than Zi, ppe sampling is more preciae. In the few
studiet!l that have reported data exhibiting the relAtion between V(ei)
and %" the variance has increased at a rate somewhere between az;
and aaI.
This comparison W88 made for equal sample sizes in the two methode.
If it coste more to obtain data from a large unit than from a small one,
the comparison is biased in favor of ppe sampling, which tends to con-
centrate on the larger unite. Further, the ratio estimate is simpler
to compute.

9.18 Extension to stratified u.mplinc. 8&mpling with probability pro-


portional to size is likely to be uaeful when a stratification has been
made by some variable other than size. If the samples within each
stratum are small and the total sample is not very large, we have
aeen (section 6.11) that the available variance formulae for ratio esti-
mates are somewhat suspect and that one of the ratio estimates may
be seriously biased.
With ppe sa.mp\ing, th estimated total is the sum of the tiStimatee
from the separate strata:
o ~ _!_ ~ (11.1)
r".. -~
• n. 1...
;_1 'Ai

From the previous theorems we obtain

(9.28)

9.1 For the data in table 9.1, compare the relative net precwona of the
four type!! of unit when the object is to estimate the total number of eeedlings
in th bed with a standard error of 200 eeedli.np. (Note that the fpc is in-
volved.)
9.2 FOI the data in table 6.2 (p. l~) estimate the relative precilion of the
bo hold to the individual for estimating the eex ratio and the proportion 01
people who had III8D a doctor in the put. 12 montha, ... ,mine eimp1e random
amplinc·
9.15 REFEBENCES 21S

9.3 A population conaistiDg of 2500 elements i divided into 10 strata,


e&e:h cont.aiDing 50 large unite compoeed of 5 elemente. The analyaia of vari-
ance of the population for an item ill sa followa, on an element buill:

<if DIll

Between atrr.t& 9 30 .8
Between larp unit. within .tIata 400 3.0
Between element. within Jarp unit. 2000 1.8

Ignoring the fpc, i.e the relative precision of the large to the I5lDAll unit greater
with simple random sampling than with stratified random sampling (propor-
tional allocation)?
9.4 A population containing L1l M element. i.e divided into L strata, each
having 1llarge units, e8()h of which contains M small unit.. The following
quantities come from the an&Iysis of variance of the popU.latiOD, on an element
baai.e:
8 1' .. Mean equare between strata
8,' .. Mean equare between large unite within strata
8,' .. Mean eqUAre between elemente within strata
U TV i.e large and the fpc i.e ignored, how that the relative precision of the
large to the small unit (element) ill improved by stratification if
(M - 1) M 1
sr < 8,'- S,'
9.5 The large unite in a population arrange theJMelvetl into a finite num-
ber of size clas8ell: all unite in cla.es 11 contain MA I5lDAU unite. (i) Under what
conditions does 8&IDpling with PPII give, on the average, the same distribution
oC the size claa8e8 in the sample as stratification by size of unit, with optimum
allocation for fixed 8&IDpie Bise? (ii) If the variance among large unit. in
claae 11 is "MAl where " is a constant (or aU claaees, what 8ystem o( probabili-
t.iee of eelection of the unite gives a sampie in which the aizes have approxi-
mately the same distribution &8 a stratified random sample with optimum
Iillocation for fixed 8&IDple BiJeT

9.16 Referencea.
CoIDIYmLD, J. (1961). The det.enninatiOJl of IaJDpJe lise. A_. J(NT. p,u,.
HeallA, U, ~1.
FlHnr-, A. L.. MOIIGAN. J. J ., and MONllOIIl, R. J. (1943). Methode of eetimaLinc
farm employment (rom IaJDple data io North Carolina. N . C. Agr. Eq. &4.
Tee:.\. Bull.. 76.
1UNIIClf, M. R., and BUlltr",. W. N. (1P43). 00 the t.beory of IaJDplJn& from tinlta
populatioaa. Anll. MalA. &at., 16. 333-362.
lhHDuCD, W. A. (1944). The relative etlicieuc:iel oIlJOUPI of lanDIJ .. -pllnc
unita. J_. A_. &at. A-., It, 367-878.
214 TYPE OF SAMPLINO UNIT U6
HOMJ:'fU, P. O., and BLACK, C. A. (1946). Bampllnc replicated field ~t.
on oat. for yield detenninationa. Proc. &it Sci. Soc. Amerioo, 11. 1U1-3U.
Honrn:, D . O., and ThOIlI'lOM, D . J. (1952). A gen rali.cation of aamplinc with-
out replacement from a 6nite univene. Jour. Amer. Bt4l. A,IOC., ''', 663-686.
JUUH, R. J . (1942). Statistical inv.tipt.ion of a aample wrvey for obt&.ining
fann facta. 1_ Agr. Ezp. 814. Ru. Bull. 3(M.
JOJrNIOH, F. A. (1941). A lltatiltlcal study of aampling methoda for tree nUl'lel')'
lnventori_ M .S. tb_, Iowa State College.
M.uu.LA.HO.ll, P . C. (1944) . On large«a)eaampleaurveya. Pha. Tr<JM. Roy. Soc.
~,BII1, 829-461.
MoVAY, F . E . (1947). Samplinc methoda applied to eetimating numbers of oom·
meroiaI orohardJ in a oommerclal peach...... Juur . A",,,.. Stat. A_., U,
688-640.
MtDIt1NO, H. (1960). An outline of the theory of .mpling eyntuDI. AM. IMI.
Bt4l. MIJlJi. (Japan), 1, 149-166.
BIIM, A. R. (1962). Further development. of the theory and application of the IMlleo-
Uon of primary aampling unit. with 'P'lCial reference to the North Carolina
acricultural population. Ph. D . th_, Univeraity of North Cal'Olioa.
BOXJlATMII, P . v. (1947). The problem of plot aile in larg&«l&le yield .urveya.
Jour. AIIW. SI4l. A.'IOC., 62, 297- 810.
THOMI'ION, D . J. (1962). A theory of aampling finite univenel wil.b unequal
probabiliti . Pb.D . tb " Iowa State College.
CHAPTER 10

SUBSAMPLING WITH UNITS OF EQUAL SIZE

10.1 Introduction. Suppose that each unit in the population ca.n be


divided into a number of smaller units, or elements. A ee.rnple of "
units has been selected. If elements within a selected unit giVI' similar
results, it seems uneconomical to measure them all. A common prac-
tice is to select and measure a sa.mple of the elements in any h08CD
unit. This technique is called BUbBampling, sine the unit is not
measured completely, but is itselI sa.mpled. Another nam ,du to
Mahalanobis, is two-stage sampling, because the sample is taken in two
steps. The first is to select II. sample of units, often called the prima'll
units, and the second is to select a sample of elements from each chosen
primary unit.
ubsampling has II. great variety of applications, which go far beyond
the immediate scope of sample surveys. Whenever any prO<" in-
volves chemical , physical, or biological tests that can be performed on
a mat! amount of material, this is likely to be drawn as a subsample
from II. larger amount which is itself II. sa.mple.
In this chapter we consider the simplest case, in which every unit.
contains the same number M of elements, of which m are chosen when
any unit is subsampled. A schematic representation of a tWlHltag
l'ample, with M "" 9 and m - 2, is shown in figure 10.1.
The principal advantage of two-stage sampling is that it is more
flexible than one-stage sampling. It reduces to one-etage sampling
when m - M, but unless this is the best choice for m, we hav the
opportunity of taking some smaller valu that appears more fficient.
As usual, the issue reduces to a balanc between statis tical precision
and C08t . When elements in the same unit agree very closely, con-
siderations of precision 8Ugg t a small value of m. On the other hand,
it is 80metimes alm08t as cheap to measure the whole of a unit as to
subsample it, e.g. when the unit i8 a houaehold and a single respond-
ent can give accurate data about all members of the hou hold.
NotaJ.ion. With multistage srunpling, the notation is apt to becom
troubleeome, because it i n ry to distinguish among I16veral kinds
of mean- lhe mean per primary unit, mean per subunit, and flO 00.
216
218 SUB8AMPLlNO WITH UNIT8 OF EQUAL 8IZE lG.l

181 denotes In .'ement In thl sampl.


FloU'llll 10.1 Schematic repreeentation of two-etap eamplin& (N - 81, n - 3,
M - 9, m - 2).

Th b88io scheme is 88 follows for tW(HItage sampling. The symbol


11ii d notes the observation obtained for the jth element (subunit) in
th ith unit. As before, 11 and t: denote th sample and population
totals, respectively.

ample m n per element in the ith unit - fli -


.
:E lIiJ/m
i-I

Overall sample mean per element - p- :E DdT!- -
_.1 1I/nm


Sample mean per primary unit - 1J - :E 1I./n - 1I/ft - mp
Analogou definitions hold for the population means Vi, r, and :r
The of th nQt&ti n is that e. ain_gl bar <lenQtee an e.verage
over any single tage, a double bar an average over two stages. The
subscripts (if any) indioate what is being held constant. ThUll the
10.2 ELEMENTARY THEORY 217
average of the 7/11 for fixed i is th: the average of the fh over the units
is ,.
Note also that N, n are uaed for the number of primary units, and
M, m (or the number of subunits per primary unit. Since BOme authors
reverse these roles, a careful check of the notation is advised when
reading references cited bere.
10.2 Elementary theory. The theory of subsampling was dey loped
in connection with the sampling of the plots in agricultural field experi-
ments. In these applications both the sampling fraction nl N and the
subaampling fraction m/ M are UBUally small and fpc's can be ignored.
Since the resulting theory is elegant and is adequate for many applica-
tions, we shall describe it fint. Actually, th elementary theory re-
quires only that nlN be negligible, say less than 0.05, as will be n
when the exact theory is presented.
The observation 1M in the jth element of the ith unit is assumed to
be of the form
11ii - r+ + u.- W,} (10.1)
where the term u.- represents a component associated with the unit
and constant for all elements in the unit. The term WiJ represents a
component of variation from element to element within the unit. The
variates u, and Wi; are all independent in th probability sense and
h ve zero means. The variates u, have variance 8 M 2 (u for unit), and
the WIJ have variance 8",2 (w for within).
The values of N and M are 888UITled infinite. The units are choeen
at random from the population, and the elements at random from the
units.
It is easy to show, as a consequence of the mod I, that thc sample
mean per element 'fi is an unbiased estimate of r.
Theortm. 10.1 With this model, the variance of the sample m an
per element 'fi is
(10.2)

Proof: From (10.l) we have


+ \It + ... + u..
" UI + + ...+ w.....
WlI WI2
v- c + n
+ - - -nm -- - -
Hence, by the formula lor the variance of a mean from an infinite
population,
S 2 S 2
VCP) - E(f) - Y)2 - ~ +~
n nm
218 SUBSAMPLING WITH UNITS OF EQUAL SIZE 10.2

Note that an increase in m diminishes only the contribution from


the variance 8",l within units: an increase in n diminishes both com-
ponents of the variance.
Theorem 10.t An unbiaBed estimate of VeP) is obtained from the
sample 8.B

v(Ji) - 2 (
mnn - 1) n(n - 1)
The first form on the right is the most convenient for computing.
Proof: By an algebraic identity we have
ft ft

L (ti; - J7)2 = L (fi; - y )2 - n(Ji - Y?


From equation (10.1 ), averaging over the m elements in the ith unit in
the sample,
" (W,I + W,2 + ...+ W,"')
fi, - 1 = U. + --------- m
Hence,
2
E (ii; - y )2 = 8,,2 + 8-..
m
ft nS 2
E '~
"
,_I (ti; - "I )'. = nS" 2 + - m'"
By theorem 10.1,
8 2
nE(Ji - y)2 ... n ~on = 8,,2 +~
m
Hence, by subtraction,

E
._1L (9, -
ft
1])2 .. (n - 1) 8 .. 2
(8
+ ~2) m
= n(n - 1)V(P}

This completes the proof.


Note that the estimated variance is computed solely from variation
between units, just 8.8 in a design with no subsampling. The variation
within units, although it does not appear explicitly in the estimated
variance, is, however, taken into account, 8.8 is evident from the term
8 v,2jnm in V(fi) in theorem 10.1.

10.3 Prediction of the v~ce for other subsampling ratios. From


the mathematical model we can predict the variance of I] for sampling
and subsampling ratios that are different from those used in a survey
10.3 PREDICTION OF THE VARIANCE 219

for which we have data. This information is helpful in the planning of


future samples on the same type of population.
Suppose that the initial sample has n units and m elements per unit.
If these numbers are changed to n' and m', respectively, then by
theorem' 10.1 the variance of the sample mean becomes
8 .. 2 8",2
V(Ji') = - + - (10.3)
n' n'm'
In order to utilize this formula, sample estimates of 8,/ and 8",2 are
required. These are obtained from the analysis of variance of the
sample data. shown in table 10.1. T.he right-hand column indicates
TABLE 10.1 ANALYIIIII or VARIANCE or THE SAMPLE (ON AN ELEMENT SAlIIlI)

df ms Estimate of
m L(il, - V)'
.,,2 L$11 __' _ __
Between unite (n - 1)
n - 1
LL (1/,; - il,)2
Within unitll between elementll n(m - 1) ... 2 _ -'~;,---:-:-­
n(m - 1)

the quantities of which the mean squares 8b 2 and 8", 2 are unbia.eed
estimates. The result for 86 2 follows at once from theorem 10.2, since
86 = nmv(p). The result for 8,} if easily verified. Consequently, an
2

unbiased estimate of 8 u 2 is
2 2
2 Bb - 8",
Bu = ----
m
Hence an unbiased estimate of V C
D') is

= -1[ -8 + B", 2 ( - I - I)]


2 2 b2
v(P') = -8..
n'
+ -n'm'
Bw
n' m m'
-
m
(lOA)

Example. King and Jebe (1940) report the following analysis of


variance in sampling wheat fields in North Dakota, 1938. Two small
samples were taken from each field, and the fields were stratified by
districts.
TABLE 10.2 ANALYSIS OJ' VAlUANCE OJ' WRJlAT YIELDS (BU8HEL PEB ACU) •

Between fields within districtll


Within fields between subunitll
(elementll) .,.' - 38
• Since the aualysis presented hy King amI Jebe refeT8 to a field 7m4n, the mea.n
squares have been multiplied by 2 to place it 00 a subunit basil.
SUB8AMPLINO WITH UNITS OF EQUAL SIZE lO.J

Since the fields were not choeen at random, but by following routes
designed so 88 to give good coverage of the area, the mean square b&-
tween fields may be an overestimate of the variance that would be
obtained from a random sample of fields. This disturbance and the
effects of variation in field sile will be ignored.
We will examine how the variance of the sample mean is affected by:
i. Doubling the number of fields, with two subl58.lIlples per field.
ii. Keeping the number of fields unchanged, but taking four sub-
samples per field.
iii. Keeping the number of fields unohanged, but completely harvest-
ing the fields. .
Let n denote the number of fields in the original sample. From
formula (10.4), the following estimated variances for the sample mean
are obtained (note that m - 2):
Original sample: (n' - n, m' - 2) V - G)C~) --
00
11

45
Cue i: (n' - 2n, m' - 2) VI - ~~)C~)
Case ii: (n' - n, m' - 4) VII - - -S:5
(De~ ~)
Case iii: (n' - n, m' - GO) VIII - G) C~ -328) - ~
In case iii complete harvesting is 888umed to be equivalent to taking
all p088ible subunits from every field in the sample. Since the size of
the subunit was very small compared to the sile of a field, this implies
that m' - GO.
Cases jj and iii show that increases in the subsampling ratio, keeping
the number of fields constant, produce only modest reductions in the
variance. If a marked increase in precision is wanted, the number of
fields must be increased.

10.' General theory. We now drop the assumptions that nlN and
mlM are small and that the mathematical model holds. It is still con-
venient to express the variances in terms of the quantities So. 2 and
S",2,but these must first be defined in tel'llJl! of the observations Yij.
The definitions are constructed from the analysis of variance for the
complete population, shown in table 10.3. Thus, So. 2 and SID2 are de-
tined 80 that the two equations stated in the two lines of the analysis
lU GENERAL THEORY 121
of variance are valid. These definitiona were ehoeen becau8e they
enable the general theory to be expreseed as a natural extenaion of
the elementary theory.

TABLE 10.3 ANALYlIlII 01' VABUNCII roB THII OOKPLm'III'OP'OLA.'nOlf


(ONAN~ • .uu)
Defi.ued u
elf equrJ to
N
M I: (1', - 1)1
Between units (N -1) 1-'
H-l
_ 8.' + MS.'
N II
~ ~ (JIll - 1';)1
Within units between elements H(M -1) ~::..,_;;...--- - 8.'
H(M - 1)

With some populations the quantity S..2 may turn out to be nep-
tive; this happens when elements in the same unit are negatively cor-
related. Any feeling of discomfort created by the appearance of a
negative variance can be avoided by expressing the results in terms of
S",2 and p (the intra-unit correlation coefficients) instead of S",2 and
S.. 2. However, all formulas remain correct when S.. 2 is negative.
In two-stage sampling, expected values must be found not only over
all possible samples of n units that can be drawn, but also over aU p0s-
sible subsamples that can be drawn from the !!elected eet of units. It
is often helpful to perform the averaging in successive stages. For this
purpoee we introduce two symbols:

,
E - Average over all subsamples from the ith unit.

E
ft
- Average over all subsamples from a fixed set of n units.
If the m elements in any choeen unit are eelected by simple random
sampling,

Hence
EU
• - r•
where r..
denotes the mean that would be obu.ined if the fa units in
the sample were enumerated completely. If, further, the fa units are
also choeen by simple random sampling,
Er.. - r
This shoWB that p is an unbiaaed eetimate of r.
SUB8AMPLING WITH UNITS OF EQUAL SIZE IG.4
Theorem 10.3 If the n units and m elements from each choeen unit
are selected by simple random IlaInpling,
(N - n) S 2 (MN - mn) S '
V(m - E(l' - r)2 -
N
..
n
+ MN
~ (10.5)
mn
Proof: Write
fJ- r- ('1 - r,,) + (r" - v) (10.5,)
When we square both sides and take the average over any fixed set of
n units, there is no contribution from the croes-product term on the
right, since
H(m - r..
"
Consider the first tenn on the right. Each of the n units may be
regarded as a stratum composed of M elements. The sample from
theee units is a proportionally stratified sample, since m elements are
taken from every stratum. Consequently, the fonnula in theorem 5.3,
p. 69, for the variance of the mean of a stratified random sample may
be applied. This gives
1 "M(M - m)
E(J} - r,,)2 - - - 2 :E S...'
" (nM) ;_1 m
where S",l is the variance within the ith unit. This may be rewritten
(M - m) 1
E(fJ - r,,)1 - • -S_2 (10.6)
" M mn
where S",,, 2 is the average variance within these n units. If we further
average over all possible sets of n, it.. is clear that the average of SID" II
is S",2, as defined from table 10.3. Hence

E(') _ r,,)11 _ (M - m) • .!.....S.' (10.7)


M mn
The contribution from the second tenn on the right of equation (10.5')
presents DO difficulty, since r ..
is the mean of the values V; for a simple
random sample of n units. Consequently, by theorem 2.2, p. 15,
N
:E (V; - P')2
E(r _ "'. _ (N - n)
" I) Nn
;-1
(N - 1)
_ (N - n)
Nn"
(S '+ S..,)
M
(10.8)
by the definition of S..2 in table 10.3.
10.6 ESTIMATION OF THE YARIANCE IN THE GENERAL CASE 223

From (10.7) and (10.8) we obtain finally,


V(l') _ E(P _ 1')2 _ (M - m) S. ,2 + (N - n) (S,,2 + S",2)
M mn Nn M
2
(N - n) S,,2 (MN - mn) Sw
- N
-n +--MN
- - mn
If nlN is negligible, V(P) reduces to
S 2 S 2
~+~
n mn
in agreement with the elementary theory.

10.6 Eatimation of the variance in the renera! CAse. The first step
is to find sample estimates of S" 2 and S .. 2 • If the analysis of variance
of the sample is performed alS in table 10.1, it turns out that the mean
squares 'h'l and ,,,,2 still have the expectations given in the table: i.e.
E(8b 2 ) _ Sw 2 + mS..2 (10.9)
(10.10)
2
The result for ,,} is easily verified, but that for 'b is lees obvious. It
may be proved by straightforward methods as follows.

._1
"a ------
Now ~

1: {D. -
n- 1

p)2 -
.
1: '0.2 - n, (10.11)
, .. 1 i _l
Write
9, - f, + (y, - f,)
Since 9, it the mean of .. random subsample of size m;

~(g;lI) _ f,lI + (M ~ m) S;2


where the average ilS taken over aU .ubsamples from this unit. Hence,
for a fixed eeJection of n unite,
" "
E ~ 9,2 - ~ f,2 + (M - m) "
!: Swill
• '_1 '-1 Mm '-1
(10.12)
8UB8AMPLING WITH UNIT8 OF ECQUAL SIZE 10.5

Further, write
r .. + ~ - r ..)
p-
80 that
ET! - r ..2 + E(J) - r ..)2
.. "
where r"
ill the true mean of all n units in the sample. In equation
(10.6) it was shown that
E(fJ _ r..)2 _ (M - m) _1 B_2
" M mn
Hence,
Enp2 _ nr,,2 + (M - m)
B"",2 (10.13)
.. Mm
From (10.12) and (10.13), •

.. .. (n - I)(M - m)
E
" i_I
L (fh - fJ)' - L (Y
i-I
i - r,,)2 + Mm
B....'
i.e.
.
m" ~ (Y. - r )2 (M - m)
--E L
(n - 1) .. i_ I
(g. - fJ)' - m .-
n - 1
+ M
B....'

Now average over all possible selections of the n units. By theorem


2.4, the first term on the right is an unbiased estimate of m times the
popula.tion variance of Yi . This gives

N (y . _ "'" - (M - m)
E(I.') _ m L · Z) + S.'
;_1 N - 1 M

- ; (S.' + MS.,) + (1 - ;)S"" - S.' 4- mS.'


UBing the equation in table 10.3 whioh de.fines S ..'. This completes
the proof.
These results lead to an unbiased estimate of V(fJ).

ThtJomn 10.-' An unbiased estima.te of V(fJ) from the sample is

1 {(N - n) .. (M - m) n ,}
tI(fJ) - - I." + I. (10.14)
"'" N M N
10.8 OPTIMUM SAMPLING AND SUB8AMPLING FRACTIONS 226
Proof: Substitute the expected values of .&2, .",2 in 11(71).

EIJ(p) _ ~n {(N; n) (S.2 + mS,,2) + (M ~ m) ;S..2}

(N - n) S"I (MN - mn) S.2


- N
-n + - -
MN
- - mn
by collecting the terms in Slit
If m - M, formula (10.14) becomes that appropriate to simple
random sampling of the units.
If n - N, the formula is equivalent to that for proportional strati-
fied random sampling, since units may then be regarded 88 strata, all
of which are being sampled. (Incidentally, two-stage sampling is a
kind of incomplete stratification, with the units 88 strata.)
In the common Situation in which nl N is negligible,

"
2 :E (fi. - J1f'
11'1\ 8& '_1
"\I/J - - - ----- (10.15)
mn n(n - I)
This agrees with the result from the elementary theory, theorem 10.2.
When m .. 1, the sample provides no estimate 8,/. This does not
matter provided that nl N is negligible, since in that event .",2 does
not appear in lJon.One application of this result occurs when the
subsampling is systematic. Since a systematic subaample is equivalent
to a simple random subsample with a more complex type of element
and m - 1, formula (10.15) remains valid witb systematic subsam-
pIing unless nlN is substantial. If the fir8fArt,age sampling is system-
atic, however, the formula holds only if the systematic sample of the
units is equivalent to a simple random sample.

10.6 Optimum umpling and subsampling fractiona. Theee depend


on the type of cost function. One form that baa proved useful is

C - c,.n + c,nm (10.16)

The first component of cost, cun, is proportional to the number 0{ unit.


in the sample; the second, c.nm, to the total number of elements.
We choose n and m 80 88 to minimi-e

V'(f) + >"(C - c..n - c.nm) (10.17)


226 SUBSAMPLING WITH UNITS OF EQUAL SIZE 10.6

Since m enters into V and C only in the combination nm, put k - nm,
and write (10.17) as

(~- ~)S..2 + G- ~N)Sw2 + 'A(C - e,.n - c.k)

Differentiation with respect to n and k gives, respectively,


8 ..2
- - -'Ae,.
n2
8,,2
- - -Xc.
Jc2 •
Hence

(10.18)

The struoture of thie formula ie as would be expected. The greater the


variability within unite relative to that among units, the higher the
optimum m. Similarly, the greater the coet e,. of access to the unit
relative to the cost c. of obtaining data from any element in the unit,
the higher the optimum m.
The value of n ie found by solving either the cost equation or the
varianoe equation, depending on whether cost or variance has been
fixed.
For practical UBe, estimates of Sw and S .. are usually obtained from
the analyeie of varianoe (table 10.1) of a sample with specific values of
n and m. The sample estimate of mop' is

(10.19)

The value found will be non-integral. It ill sufficient to take the


nearest integer, although Cameron (1951) has pointed out that thie ie
not quite the best rule. If ?flop, lies between the two integers m,
(m + 1), we Illould ohoose (m + 1) if lnep,2 > m(m + 1): otherwise
we ohooee m (see exercise 10.3). Thue, if ?flop, ie between 1.414 - v'2
and 2, we round upward! to 2. If ?flo" ia greater than M, or if '6 2 ia
leu than 'w·,we take m - M and employ one-etage sampling.
The estimate ?flo,' ie italf subject to a u.mpling error, whose aile
depends on several factors, but mainly on the number n of unite in the
sample from whioh '6'i.e oomputed. Confidenoe limite for ",-" tend
to be wide when n ia J_ than 10. However, the optimum ia broad,
and an error of a few unite in m may produce only a small 10lIl of pre-
10.6 OPTIMUM SAMPLING AND SUBSAMPLING FRACTIONS '01

cision. The following exa.mple presents a method for investigating


this issue.
E%4mpk. Let
c.. = lOc.: S", = 1.3S"
then
m"", ..,
1.3 Vlo = 4.1
We will~ard total cost as fixed, and see how the variance of fJ changes
with m. Both N and M are assumed large.
S,,2 S",2
Von = -n + -nm

eliminating n by means of the cost equation. In our example, V may


be written

V(t7) - -
S,,2 C. (
1+-
S",2
2
)(c.. + m)
-
S,,2 Co (
= -
1.69)
1 +- (10 + m)
C mS" c. C m
Omitting the constant factor, the relative variance can be calculated
for different values of m. Table 10.4 shows these variances and the
relative precisions (with the maximum precisiou for m ... 4 taken as
the standard.).
TABLE 10.4 RmuTIVIl VAlUANCU AND PRECISIONS FOR DIJ'nRENT VALUU OJ' III

tn- 1 2 3 4 5 6 7 8 9 10

Rel. variance 29.59 22.14 20.32 19.92 20.07 20.51 21.10 21.80 22.56 23.38
ReI. precision 0.61 0.90 0.98 1.00 0.99 0.91 0.94 0.111 0.88 0.86

For values of m between 2 and 9, inclusive, the 1088 of precision


relative to the optimum is less than 12 per cent. An interesting appli-
cation of this type of analysis to the British monthly surveys of sick-
ness is given by Gray and Corlett (1950).
We now consider how well mop' can be estimated from an initial
sample with m - 4 and various values of n. From (10.19)

ttl
01"-
Vm
V(",2j,..2) _ 1 "c.f -V(8N,.. 6.324
2)-1
SUB8AMPLING WITH UNITS OF EQUAL SIZE 10.6

If the original variates Yii are normally distributed, it is known from


analysis of variance theory that 86 2/8",2 is distributed as

F(I + m ::~ - 3.3611i'

where Ii' follows the F-distribution with (n - 1) and n(m - 1) degrees


of freedom (Eisenhart, 1947). This gives
til 6.324
op' - Va.367F - 1

This result provides confidence limits for mopt. Take n == 10, and
corurider 80 per cent limits. The degrees of freedom are 9 and 30.
From the 10 per cent significance levels of F (Merrington and Thomp-
son, 1943), we find
F. 10 (9, 30) - 1.8490
F .oo(9, 30) ... 1/F. 10 (30, 9) - 1/2.2547 - 0.4435
Substitution of these values of F gives
Lower limit: mop, - 2.8
Upper limit: mo., - 9.0
As we have seen from table lOA, any m in this range gives a degree of
precision that is fairly close to the optimum. Thus, with n "" 10, the
chances are 8 in 10 that the loss in precision is small.
The 80 per cent and 95 per cent confidence limits for n = 5, 10, 20
appear in table 10.5. The upper limits m - 00 which occur in three
easee imply single-stage sampling.

TABLE 10.5 CoNFlDSNCS LDOT8 roa ~

" 80 per cent 96 per cent


II 2.5,00 1.8,00
10 2.8,9.0 2.3,00
:.0 3 . 1,6 .• 2 .7,9.1

To summarise, with n IS 20 we are almost certain to estimate mop.


well enough so that the precision actually attained is near the optimum.
This is not true with n - 5. Results may of course be different with
other values of the costs and variances.

10.7 SubaampliDc for proportioDL If the elements are classified into


two classes and we wish to estimate the proportion that Calls in the
first class, the preceding formulas can be applied by the usual device
10.8 THREE-STAGE SAMPLING

of defining Yi; as 1 if the corresponding element falls in this class, and


as zero otherwise. Let 'Pi ... ai/m be the proportion falling in the first
class in the subsample from the ith unit. The two estimated vari-
ances, 36 2 and 3 111 2 , work out as follows:
II

m 1: (Pi - fJ)2
I i-I
I" =------
n-l
m "
8t1,2~ - n(m - 1) ,~ P.lI,
where fJ - .E-ps/n. Consequently, the formula for the estimated
variance in tW<H!tage sampling is (by theorem 10.4)
(N - n) 1" 2
v(P) - N ~n _ 1) i~ (Pi - fJ)
(M - m) 1 ~
+ M Nn(m - 1)
£....
i-1
P;fJi

10.8 Three-stage sampling. The process of subsampling is sometimes


carried to a third stage by sampling the subunits (elements) instead of
enumerating them completely. For instance, in surveys to estimate
crop production in India (Sukhatme, 1947) , the village is a convenient
sampling unit. Within a village, only BOrne of the fields growing the
crop in question are selected, 80 that the field is a subunit. When a
field is selected, only certain parts of it are cut for the determination
of yield per acre: thus the subunit itself is sampled. If physical or
chemical analyse.e of the crop are involved, an additional subsampling
may be used, since these determinations are often made on a part of
the sample cut from a field.
Results for the elementary theory will be given briefly. The popu-
lation contains N units, each with M subunits, each of which has K
subsubunits. The corresponding numbers for the sample are n, m,
and k, respectively. The model is
Yija == r+ + +
Ui Wi; eij.

where the components U,' Wi;, e,i. are all independently distributed
with means zero and va.riances s.,2, S,,?, and SUI.., 2, respectively. It
follows that the variance of the sample mean per subsubunit is
S2 S2 S 2
V(J) =~+~+~ (10.20)
n nm nmk
SUBSAMPLING WITH UNITS OF EQUAL SIZE 10.8

The sample analysis of variance, on a subsubunit basis, is shown in


table 10.6.

TABLE 10.6 A.lULT8I8 0J'V4lUANCil roB TIIBIIz..&TAOIi BAMPLING

dI ID8 Estimates of
mk E, (17. - p)t
Between units (n - 1) ••2
n-l
8 111 ..' + kS ..' + mkS.1
Between BUb-
units within k EE
, (1]1; - 17,)2
unite n(m - 1) ••2 _ j

n(m - 1)
8 ....2 +- kS ..
2

Between BUb-
lIubunits
within EEl: (1/,;, - y,;)2
BUbunits nm(k - 1) .....2 _ -,'~;-'~--:-­ 8.,..2
nm(k - 1)

The expectations in the right-hand column are easily verified from


the model. From equation (10.20), an unbiased estimate of VCri') is
tJb' /nmk. AB in two-atage sampling, an unbiased estimate can also be
obtained of V(Z') for values of n, m, snd k different from those used.
In the general theory the first step is to define the variance com-
ponents S.. " S'" 2, and S"'''' 2 by an analysis of variance for the complete
population (table 10.7) , analogous to table 10.3 for two-stage sampling.

TABLE 10.7 AHALT61B OJ' VARl.ulCIIl FOR THE POPULATION

df ms Defined 811 equal to


MKE(V, - V)t
i
Between unite (N - I)
N -1
8 ....2 + KS ..' + MKS,,'
Between eubunita K El:; (1',( - 1';)'
wi thin uni ta N(M -1) •N(M -1) 8.,..1 + KS..'
Between eubsub-
units within
eubunitB NM(K -1)
,I
l:l: .
NM(K - 1)
!I;;. - Vij)'
8 ..'

Theorem 10.6 If simple random sampling is used at all three stages,

V(p) _ (~_ !'_)S ..2 + (...:.. __


1 )8",2 + (_1 _ _1_)8",,,,2
n N nm NM nmk NMK
Proof: Since the proof is a natural extension of that for two-atage
sampling, only the principal steps will be indicated. Write
" - r OK (" - Y"...) + (y"", - rIO) + (Y.. - V)
10.9 STRATIFIED SAMPLING OF THE UNITB 281

whare '1""", is the population mean for the nm subunits that were
selected, and '1"" the population mean for the n units that were selected.
When we square and take the average, the cross-product terms vanish.
The contributions of the squared terms turn out to be as follows:

E" - " (K K-k)~


r "...) 2 =
nmk
S"'VI
--
1 :I

" - "2
E(r"", r,,) (M-m)l(l
= \KS.... + S", 2)
~ ;;; :I

E(r.. _ '1")2 (N; n) ~ ~KS ....2+ ~ S",2 + S ..2)


=

When these three terms are added, the theorem is obtained.


Theorem 10.6 An unbiased estimate of V(!7) from the sample is
1 [(N - n)
v(J7) ... - - Bb
2+ (M - m) n 8..2+ (K - k) n m
-NM
.- 8 ....
2J
nmk N M N K
The proof reduces to showing that the results stated in table 10.6
for the expected values of Bb2 , 8,/, and Bww 2 are valid. The details are
straight.forward although tedious.
The extension of these results to further stages of sampling should
be clear from the structure of the formulas. Optimum sampling frac-
tions at the second and third stages can be investigated as for two-
stage sampling.

10.9 Stratified sampling of the units. Subsampling may be combined


with any type of sampling of the units. The subsampling itself may
employ stratification or systematic sampling. Variance formulas for
these modifications can be built up from the formulas for the simpler
methods. Results will be given for the combination of subsampling
with stratified sampling of the units. We assume that unit sizes are
constant for a given stratum, but may vary from stratum to stratum.
The subscript h refers to the stratum. The population variances
S..,,2 and 8",,,2 are iu general defined separately for each stratum by an
analysis of variance similar to that in table 10.3. The hth stratum
contains N" units, each with M" elements: the corresponding sample
numbers ,are n" and m". The estimated population mean per ele-
ment is
SUB8AMPLING WITH UNITS OF EQUAL SIZE 10.9

By applying theorem 10.3 within each stratum, we find


{(Nit. - n")S,,It.'J
'"
~(MIt.NIt.)
2
+(M,.N"
- - --mlt.n,,)
- - Sw,,'}
" N" nit. M,.N" mlt.n"
V(P.,) - (E M"N,,)2
From theorem 10.4, a sample estimate that is unbiased is
"" (M,.N,,)2 {(Nit. - nit.) 2 (M" - mit.) nit. 2}
~ - - - - 8bA + - 8",,,
It. mlt.n" Nit. Mit. Nit.
tI(P.,) .. (E M"N,,)2
If the sampling fractions nfl/Nfl for the units are aU small, these
formulas simplify to
2
V(P,,) "" (MIt.N,,) 2(S""
=- ~ - + -SID"2)/(~
"" M"N,,) 2
h nIt m"n"

10.10 Exercises.
10.1 From a simple random sample of fields of com, 2 subunits (each
consisting of 10 hills) were chosen in each field. The following mean squares
come from an analysis of variance of the number of ears per hill (on a single-
hill basis) :
ma
Between fields '; .89
Between subunits within fields 1 .41

If it takes 1 hr to locate each field, and 10 min to locate and count 1 sub-
unit (after the Ii ld i reached), what is the optimum number of subunits per
field? (The fpr's may be ignored .)
10.2 In the same survey, the mean square between hills within the same
subunit was 0.92. Assuming that this mean square would not change appreci-
ably if the subunit contained 20 hills, estimate the change in precision if one
2O-hill unit were taken per field instead of two 10-hill units.
10.3 Verify the rul (section 10.6), that, if ,ii opI lies between the two inte-
gers m, (m + 1), we should choose (m +
1) if

m > m(m + 1)
oJl ,'

10.4 how that, if S.} > 0, in the notation of section 10.4, a simple ran-
dom sample of 11 primary units, with 1 element chosen per unit, is more pre-
cise than a simple random sample of 11 elements (n > 1, M > 1). Show that
the precision of the two methods is equal if n/N is negligible. Would you ex-
pect this intuitively?
10.11 REFERENCES 283

10.11 References.
C AMERON, J . M. (1951). Use of variance components in preparing schedules for
the sampling of baled wool. Biometria, 7, 83-96.
EISf:NHART, C. (1947). The &ll8umptionl! underlying the analysis of variance.
BiomltriC8, 3, J 8.
GRAY, P.O., and CoRLETT, T . (1950). Sa.mpling for the social survey. Jou.r. Roy.
Stat. Soc., All3, 150-206.
KING, A. J., and JEBE, E . H. (1940). An experiment in the pre-harvest sampling
of wheat fields. Iowa Agr. Exp. SIG. Ru. Bu.U. 273.
MERRINGTO~, M., and THOMPSON, C. M. (1943). Tables of percentage points of
the inverted beta (F} distribution. Biomdrika., 33, 7~88.
SUDlATJ41l, P. V. (J947). The problem of plot me in large-scale yield aurveya.
Jour. Amer. Stat. AIBoe., d, 297-310.

NiX cited in te:tt


MARCU8E, 8. (1949). Optimum allocation and variance components in nested
sampling with an application to chemical analyais. Biometrica, II. 181Hl06.
CHAPTER 11

SUBSAMPLING WITH UNITS OF UNEQUAL SIZE

11.1 Introduction. If two-stage sampling is to be used when the pri-


mary units vary in size, one method is to stratify by size of unit, so
that the units within a stratum become equal, or nearly so, in size.
The formulas in section 10.9 may then be an Adequate approximation.
Sometimes, howtwer, units vary so much in size that substantial dif-
ferences remain ,vithin some of the strata, and sometimes it is advis-
able to base the stratification on variables other than size. In a review
of the British Social Surveys, which are mostly nationwide samples
with districts as primary units, Gray and Corlett (1950) point out that
size was at first included as one of the variables for stratification, but
another factor was found more desirable when the characteristics of
the population became better known.
Some concentrated effort is required in order to obtain a good work-
ing knowledge of multi-stage sampling when the units vary in sir.e,
because the technique is very flexible. The units may be chosen either
with equal probabilities or with probabilities proportional to size or to
some estimate of size. Various rules can be devised to determine the
subsampling fractions, and various methods of estimation are avail-
able. The advantages of the different methods depend on the nature
of the population, on the field costs, and on the supplementary data
that are at our disposal.
The first part of this chapter is devoted to a descriptio'Q of the princi-
pal methods that are in use. We shall begin with a population tha.i,
consists of a single stratum. The extension to stratified sampling can
be made, as in previous chapters, by summing the appropriate variance
formulas over the strata. For simplicity, we assume at first that only
a single primary unit is chosen: i.e. that n ,., 1. This case is not as
impractical as might appear at first sight because when there is a large
number of strata we may be able to achieve satisfactory precision in
estimation even although nil - 1. In the series of monthly surveys
taken by the U. S. Census Bureau in order to estimate numbers of
employed people, the county is the primary unit. This is a large unit,
but it has administrative advantages that decrease coats. Since coun-
2U
11.2 SAMPLING METHODS WHEN n - 1

ties are Car from uniform in their characteristics, stratification of


counties is extended in these surveys to the point where only one
county is selected from each stratum. Consequently, the theory to be
discUSlled here is applicable to a single stratum in this kind of sampling
plan.
As in previous chapters, the quantities to be estimated may be the
population total Y, the population mean (usually the mean per ele-
ment r), or a ratio of two variates. At present the discussion is con-
fined to the population mean per element.
N olation. The ob!lervation for the jth element within the ith unit is
denoted by 1/;i' The following symbols refer to the ith unit:

Population Sample
Number of elemente Mi IftI
Mean per element y, fli
Total Y,-MiYi tli - "" ..-0,

The following symbols refer to the whole population or sample:


Population
N
Sample
.
Number of elements

Total
M-LMi

Y-LY,
N
m-Lnlj

11 -
.
LY,
Mean per element y- YIM Ii -111m
Mean per primary unit Y - YIN ii - 1IIn
The notation departs from that of chapter 10 in one respect. We
define M and m as the total number of elements in the population and
sample, respectively, whereas in chapter 10 these symbols were thE
corresponding totals in any primary unit (all units being of equa.1 size).
To keep the notation consistent, symbols M and m should have been
used in chapter 10.

11.2 Sampling methods when n ... 1. Suppose that the ith unit i.E
selected and that it contains M; units, of which m; are sampled af
random. We consider three methods of estimating r.
I. Uniu cJw8en with equalprobaQilitll.
Estimate = Yt ... '0;.
The estimate is the sample mean per element. It is bia.sed.. For, it
repeated sampling from the same unit, the average of 'Oi is i , &DC r
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.2

since every unit has an equal chance of being selected, the average of
Yds

But the population mean is


N
L M;''Y, N
V - ,___1_ _ • where M =L M,
M
Hence the bias equals (Va - V).
To find V(17l), write
Yi - V == ('fi, - Vi) + (Y, - Vel) + (Va - V)
Square and take the expectation over all possible samples. All con-
tributions from cross-product terms vanish. The expectations of the
squared terms follow easily by the methods given in ohapter lO. We
find
1 ~ (M, - mi) sl 1 ~ Y "2 " "'-2
V(Yx) IS -
N
£.J
,_IM, m, + N- i_I
- £.J( . - Ia) + (Ia - I J (11.1)
Wilbin IU>ita Betw..n un ita BiM
where
1 M.
S .2 = '" (y .. _ y .)2
• (Mi - 1) ~., ,

is the variance among elements in the ith unit.


The variance of YI contains three components: one arising from vari-
ation within units, one from v&riation between the true means of the
units, and one from the bias.
The values of the m. have not been specified. The most common
choice is either to take all m, equal, or to take m, proportional to M i ,
i.e. to subsample a fixed proportion of whatever unit is selected. The
choioe of the m, affects only the first of the three components of the
variance-the component that arises from variation within units.
II. Units cJwsen with equal probabJ"lity.
Estimate - yn - NM;y,/M.
This estimate is unbiased. Since fit is an unbiased estimate of Vi,
the product M tYi is an unbiased estimate of the unit total Y i • HenCe
NM,Yi is an unbiased estimate of the population total Y. Dividing by
M, the total number of elements in the population, we obtain an un-
biased estimate of V.
11.2 SAMPLING METHODS WHEN "-1
To find V(Yu), we have

Un - r = NM&. - r
M

NMi (NMi ,,)


= M (li. - fi) + M f, - ¥

Now M.f, ... Y" the total for the unit, and r ... NfIM, when f is
the population mean per unit. This gives
NM · N
911 - r "" M' (fl, - f.) + M (Y. - f)
Hence,
N N S~ N N
V(Yu) = '""2
M
L M.(Mi
,_I - m.-) ~ + - L (Yo -
m. M2 ._1
f)2 (11.2)

The "between-unit" component of this variance (second term on the


right) represents the variat.jon among the unit wtals Y,. This com-
ponent is affected both by variations in the M. from unit to unit and
by variations in the means f .. per element. II the units vary con-
siderably in size, this component is large even though the means per
element fi are practicalJy constant from unit to unit. Frequently this
component is 80 large that till has a much higher variance than the
biased estimate iiI. ThuB neither method I nor method II is fully
satisfactory .
III. Units cho8en with. probability proportiorwl to size.
Estimate = 11m = iii = sample mean.
This technique is due to Hansen and Hurwitz (1943). It gives a
sample mean that is unbiased and is not subject to the inflation of the
variance in method II.
In repeated sampling, the ith unit appears with relative frequency
Mil M. Hence,
N M.
E<vtll) = L _' f. -
i_I M
r
Further,

Average first over samples in which the ith unit is selected.


(M . - m ·)S ·2
~(tim - l")2 - ' • ~ + (f. - l")lI
, Mi m;
SUB8AMPlJNG WITH UNITS OF UNEQUAL SIZE 11.2

Now average over all poesible selections of the unit. Since the ith unit
is aelooted with relative frequency M .1M,
J
V(9111) - ~ {t (M; -
M i-I
mil 8.
m;
+ f.
i-I
M,(Yi - V)J} (11.3)

Note that, as in method I, the "between-units" component arises


from differences among the means per element Y. in the successive
unite. If these means per element are all nearly equal, this component
is small.
E:tample. Let us apply these results to a small population, artifi-
cially constructed. The data are presented in table 11.1. There are
TABLE 11.1 ABTInOUL POPtn.ATlON WITH UNITII 0 .. UNEQUAL 8LZJl8

Unit SI'I M, Y, sl- V, V, - r


1 0,1 2 1 0.500 0.5 -2.25
2 1,2,2,3 4 8 0.661 2.0 -0 .15
8 8,8,"',6,6 6 24 0.800 4.0 +1.25

TotaJe 12 33

three units, with 2, 4, and 6 elements, respectively. The reader max


verify the figures given for Yi, 8;2, and Yi. The population mean Y
is H, or 2.75. The unweighted mean of the"f; is 2.167 == 80 thllt rca,
the bias in method I is -0.583. Its square, the contribution to the
variance, is 0.340.
One unit is to be selected and two elements sampled {rom it. We
consider four methods, two of which are variants of methOd I.
Method la.
Selection: unit with equal probability, m; - 2.
Estimate: 'Oi (biased).
Method lb.
Selection: unit with equal probability, mi = iM i •
Estimate: 'Oi (biased).
Method II.
Selection: unit with equal probability, m; - 2.
Estimate: NM.'9i/M (unbiased).
Method III.
Selection: unit with probability Mi/M, mi - 2.
Estimate: fJi (unbiased).
11.3 PROBABILITY PROPORTIONAL TO ESTIMATED SIZE 239

Method h (proportional subsampling) does not guarantee a sample


size of 2 (it may be 1, 2, or 3), but the average sample size is 2.
By application of the sampling error formulas (11.1), (11.2) , and
(11.3), we obtain the results in table 11.2.

TABLE 11.2 VAlUANC1:8 0 .. BA¥PLII E8T1l(ATIIlII 0 .. r


Contribution to v&riance from Totai
Method Within units Between units Bias variance

Ia 0 . 145 2.056 0 .340 2.541


Ib 0 . 183 2 .056 0 .340 2.579
n 0 .256 5.792 0 .000 6 .048
III 0 . 189 1.813 0 .000 2.002

Although the example is artificial, the results are typical of thoee


found in comparisons made on many populations. Method III gives
the smallest variance because it has the smallest contribution from
variation between units. Method II, although unbiased, is very in-
ferior. Method Ia (equal size of subsample) is slightly better than
method Ib (proportional subsampling) .
Some comparisons of these methods have also been made on actual
populations. For six items (total workers, total agricultural workers,
total non-agricultural workers, estimated separately for males and
females) Hansen and Hurwitz (1943) found that method III produced
large reductions in the contribution from variation between units as
compared with the unbi~ method II, and reductions which aver-
aged 30 per cent as compared with method I. (They /lB8UD'led the CO:l-
tribution from variation within units to be negligible.) In estimating
typical farm items for the state of North Carolina, Jebe (1952) reported
reductions in the total variance of the order of 15 per cent as compared
with methods of type 1. In both studies the primary unit was a county.

11.3 Sampling with probability proportional to estimated size. It


may happen that the sizes M. of the units are known only roughly. In
the sampling of towns, where the unit is a block and the element a
household, the number of households per block is usually obtained
from city maps or from previous census data, both of which are more
or less out of date. Some of the advantages of pps sampling can be
retained by sampling with probability proportional to the best estimate
of size. Let:t. be the probability assigned to the ith unit, where the %.
SUBSAMPLlNG WITH UNITS OF UNEQUAL SIZE U.3

are any set of positive numbers that add to unity. We 8till a8IJUDle
n - 1.
Method IV. An unbiased estimate of is r
(11.4)

This foUows because, in repeated sampling, the ith unit appears with
relative frequency Zi, 80 that

E(U,.v) - :E
N
z. (Mdl')
_' -:EN Mdl'
_' - V
'-1 z;M '_1 M
With this method it is customary to select m; 80 that
leM.
m;-- (11.5)
fI,

where k is a constant. The estimate may then be written


m,1h 1/,
fJr.v--=- (11.6)
kM kM
where 1/i is the sample total.
The quantity k may be described as the expected overall sampling
fraction. For
N N
E(m;) - :E t,m; ... k :E M, ... kM
Hence
Expected number of elements in sample
k - --------------
Numbers of elements in population
The advantage in choosing m, ... kM,/z, does not become apparent
in the present simple case. With n > 1, and with stratified sampling,
it will be seen later that this choice makes the estimate self-weighting.
E:w.mple . This illustrates how m, is determined. The stratum
contains 6 blocks, with estimated numbers of households as shown in
table 11.3. Suppose that we intend to have an overall sampling
TABLE 11.8 ILLU8TRATION or TID CALClJU1'[ON or ""
F.etirnAted no. Cumulative ~ed
Block boueeholda 8I1Dl ranp
1 10 10 1-10
2 80 .0 11-40
3 17 57 .1-67
• 25 82 58-82
5 23 105 83-105
6 10 121 10lH21
11.3 PROBABILITY PROPORTIONAL TO ESTIMATED SIZE 241

fraction of 5 per cent. We take k =..fc;. AB with ordinary pps


sampling, a unit is first selected by drawing a random number between
1 and 121. Let this be 96. The block selected is no. 5, which covers
the range from 83 to 105 in the cumulation.
The interviewer visits this block and prelists or counts all the house-
holds on it. We shall suppose that he finds 31 households. Applying
equation (11.5), he takes

m; - -kM'~)
= - (31) -121 = 8, to the nearest integer
z, 23
The desired subsampling ratio
m.; k 121 1
-=-=
M, z, (20) (23) 4
is known before the block is listed. Thus the interviewer can be told
in advance to take I household in 4 from this block. This rule is con-
venient when the subsample is to be systematic, as is often the case in
practice.
The variance of 'Iilv is obtained in the usual way. Write

..g, - I
firv-I" =M-- " {by (11.4)J
z,M

= -M1 {M' (Mz, .Y, - MY)}


z, (g, - Y,) + _'
-'
In the variance, each square receives a weig~t Zi. Hence,

V(fhv) ... -
1 {~M,(M;
L..
- m,) S;2
-
~ (M/f';
+ L.. Z; - - - M
,,)2}
I (11.7)
~ ;_ 1 Z; m j ;_1 Zj

If k - m,z,IM" this may be written in the slightly simpler form

V(gIV) ...
I
-2
M
{N
L
'_I
(M; _ m;)S,2
k
LN Zi (Y;
+._1 -
Z, - y.
)2} (11.7')

If z, = M,IM, formula (11.7) reduces to formula (11.3) for V(ym).


If z, ... liN (initial probabilities equal), formula (11.7) reduces to
formula (11.2) for the variance of the unbiased estimate when prob-
abilities are equal.
Unless z> = M,IM, the "between-units" component in (11.7) is
affected to some extent by variations in the si.zes M, as well as by vari-
ations in the means per element Y;.
:M2 8UBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.3

E:z:amp~. Table 11.4 shows the basic computations for finding


V(tiIv) in the artificial population in table 11.1. Since M - 12 and
the desired sample sise is 2, we put k - t. The I i have been taken as
0.2, 0.4, and 0.4.
TABLE 11.4 CcnatrrATlON OP V(Dnr)

Unit M, M ,/M
" m, - M61,, 6(M, - 1fI1) 81 y,
.. ..
Y, Y, - Y

1 2 0 .1 7 .2 Jj 2 0 .500 1 5 -28
2 4 0 .33 .4 ¥ 14 0.667 8 20 -13
a 6 0 .50 .4 J.j 21 O.I:!OO 24 60 +27

In practice, the m.. are rounded to integers. This has not been done
in the present illustration. From formula (11 .7'), the variance comes
out as follows :
Contribution from "within-units" - 6 L (M, - m;)SNM:J - 0.188

Contribution from "between·units" - L z.. (y, - y)2/W - 3.583


lei --
Total - 3.771
Comparison with table 11.2 reveals that method IV has a lower
variance than the unbiased method. II in which the unit is ehosen with
equal probabilities, but method. IV is decidedly inferior to method. I
or III. II the s.i~ were not knoWD, method. III (pps) could not be
used, but the biased estimates obtained in method I from sampling
with equal probabilities could be used. Apparently, in this example,
method IV pays too high a price i~ order to obtain an unbiased estimate.
The disappointing performance of method IV is not nooessari1y typi-
cal. With oloser estimates of sile the method. shows to more advantag1l.
It is natural to consider, however, whether the sample mean (as in
method I) would be better than the estimate adopted in method IV.
This leads to the fifth method to be discussed.
M eO&od V. Unita chosen with probabilitJl proportional to utimaUd nu.
Estimate ... gv - g.. - sample mean.
The estimate is biased, since
E(g,) ,. L .r,P', - y. (say)
If the z.. are good estimates, Y. is close to the correct mean Y-
L M;'f',/M, and the bias is small.
ll.S SAMPLING METHODS WHEN n >1
II we write
.Yv = (Yi - Vi) + (V; - F.) + (F, - V)
the three components of the variance work out as follows:
~ z;(M; - m;) S;2 ;... '1'7 F.) 2 +(r.-
" " 2
- +.c....,z;(r.- r)
V(yV) ""' .c....,
;_1 M. m. ._1
Example. If the va.lues of z; and mi are chosen as in table 11.4, the
reader may verify tha.t the components of the variance of yv are as
shown in table 11 .5.
TABLE 11.5 CONTRIBUTIONS TO THE VARIANCE IN JolllTHOD V
Within Between BiWl Tot.al
units units v&riance
0 . 178 1.800 0.062 2 .040

This is superior to a.ll methods except method III (pps) and is


almost as good as method III.

11.4 Summary of methods for n = 1. The five methods of estimat-


ing the mean per element F and their variances in the numerical ex-
ample are summarized in table 11.6.

TABLE 11.6 TWO-STADE SAMPLINO METHODS (n - 1)


Probabilities in Estimate Bias Variance
Method selecting units of r status in example
I Equal Vi Biased la: 2 .541
lb: 2 .579
NMiiii
II Equal Unbiased 6.048
M
M.
III M a: Bile il. Unbiased 2.002
m.lli
IV %i ex estimated size Unbiased 3.771
kM
V Ii ex estimated size 'Vi Biased 2.040

11.5 Sampling methods when n > 1. In this section the principal


sampling methods and estimates are described for the usual situation
in which more than one unit is selected. The discussion is still re-
stricted to a single stratum.
Consider first the sampling of units with equal probability. .As an
illustration, 20 pages were selected at random from the volume A meri-
can men of science. The Dumber of biographies M i per page varies in
SUBSAMPLINO WITH UNITS OF UNEQUAL SIZE 11.6

general from about 14 to 21. On each selected page, 2 biographies


were chosen at random and the age of the scientist was recorded. The
data appear in table 11.7. The purpose is to estimate the average age
of the biographees in the complete volume.

'TABLE 1l .7 AollS OJ' 40 8C1.IIHTlST81H Ammcan men oj ,cience (n - 20, m - 2)


Ages
Unit Total
no. M, 1111 11'" II .. M,g,
1 15 47 80 77 577 .5
2 19 38 51 89 845 .5
3 19 43 45 88 836.0
4 16 55 41 00 768.0
5 16 59 45 104 832.0
6 19 89 38 77 731.5
7 18 43 43 86 774 .0
8 18 49 51 100 900 .0
9 18 45 35 80 720 .0
10 18 46 59 105 945.0
11 20 71 64 135 1,350 .0
12 18 35 46 81 729 .0
13 19 61 54 115 1,002.5
14 19 45 87 132 1,254 .0
15 18 31 38 69 621.0
16 16 64 39 103 824.0
17 16 63 47 110 880 .0
18 19 36 33 69 655 .5
19 19 61 89 100 950 .0
20 19 54 34 88 836 .0

Totals 359 1,904 17 ,121.5

Method I, in which the ea.mple mean provides the estimate, has two
analogues. First, we may take the ordinary sample mean,
:E Yi
flI == - - == -
1904
= 47.6 years
:E 1ni 40
This estimate is biased. This is easily seen when m is constant, since
in that event
1 N
E(fll) >= -
N i_I
:E fi
whereas the population mean per element is r .. L
Mifi/M. The
biased estimate gives too much weight to the emaJIer units. If there
is no correla.tion between Mi and fi, the distorted weighting does not
ma.tter greatly, and the bias is unimportant in large samples. But, if
- Mi and fi are correlated, as often happens, the bias may not be neg-
11.5 SAMPLING METHODS WHEN n > 1
ligible in large samples. In the present example we might anticipate
a small negative correlation between M. and Fi , because the longer
biographies, which cut down the number per page, tend to be those of
the older scientists.
This diacUBBion suggests, 88 an alternative estimate in method I, the
weighted mean,

17,121.5
- - - == 47.7 years
359

This is a typical ratio estimate because both the numerator and the
denominator vary from sample to sample. As is characteristic of ratio
estimates, the bias is negligible in large samples. If the subsampling
is proportional, this estimate reduces to the ordinary sample mean and
coincides with that in method 1.
When the m's are all equal, the unweighted sample mean sometimes
has a smaller variance than the weighted mean . In view of its greater
liability to bias, however, the unweighted mean is more hazardous.
The unbiased estimate (method II) is given by
N "
PII LM,1h
= -
nM .-1
In this estimate the quantity L M,Yi, which is an unbiased estimate
of the total of the n units in the sample, is raised by the inverse N In of
the sampling fraction, and then divided by the total number M of ele-
ments. In the present exa.mple the number of pages in the book is
2823, and M (total number of biographies in the book) is given in the
preface 88 "about 50,000." Accepting this figure for illustration, we
have
2823
flu = (17,121.5) = 48.3 years
(20)(50,000)
As in the case n = 1, this estimate often has poor precision if there
is much variation among the M ,.
Sampling with probability proportional to size, or to estimated size,
is unlikely to be adopted in this illustration, because of the work in-
yolved in counting or estimating the numbers of entries on all 2823
pages. The estimates for these methods will be given in algebraic form .
As pointed out in section 9.10, we must sample with replacement in
order to keep the probabilities proportional to size or to estimated size.
If the same unit is drawn twice, the subsample is also taken with re-
placement. In examining the bias of an estimate we shall use the same
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 1l.5

mathematical device &8 in section 9.11. The random variate I; de-


notes the number of times that the ith unit appears in a specific sam-
ple. In repeated sampling, E(t;} - ?Lt;, where is the probability of z,
selection.
In method III (ppe sampling and an equal m for all units in the
sample) the sample mean Yis unbiased. For
1
E(Pm} - - E(l).Yt + ~fh + ... + tNYN)
11.

. In method IV (sampling with probability IX t o) an estimate that is


always unbiased, irrespective of the values assigned to the is m"
1 "M.
lilv " - L: - . iii
nM '_I z
For
E(fhv) - - E
1 (NL: -t;M''jh ) = -
1 N
L:nM;Y. - r
nM i_ I Zi nM i_I
This estimate becomes self-weighting if we take
kM;
m,. .. - (11.5)
to
since it reduces to
1" Y
"rv ... -
nkM
L: m.1ii '"' -nkM
._1
In the case n = I, the quantity k wa.s the expected overall sampling
fraction . For n > I, the expected overall sampling fraction is nk, for
from (11.5) N

m;z ·
L:m.z.-
. I
k =- ' = '--- - (11.8)
M; M
by summing the numerators and denominators of the series of equal
fractions. But the average number of elements in the sample is

E (j_: mi) == E (~t.mi) = 11. ~ t,m,.


__ I 1- 1 1-1

Hence, from (11 .8),


nk - (Expected number of elements in sample)jM
11.6 SUMMARY OF METHODS FOR " >1
The overall sampling fraction (ratio of number of elements in the
sample to that in the population) should be distinguished from the
primary unit sampling fraction nlN and from the subsampling frac-
tions milM;.
As with n - 1, this estimate Buffers to BOrne extent the same kind of
inflation of the variance as the unbiased method II.
There are several possible extensions of method V when n > I . The
estimate that seems least liable to serious bias is the weighted sample
mean

(11.9)

The numerator is an unbiased estimate of nY, and the denominator an


unbiased estimate of nM. If mi ... kM.lz" as in method IV, this esti-
mate becomes the unweighted sample mean fl.

11.6 Summary of methods for n > 1. The methods are shown in


table 11.8. In view of the advantages of a self-weighting sample, the
values of the m,
which make the estimates self-weighting are also given.

TABLE H .S TWO-STAGE IlAMPLING "'£THODS (n > 1)


Probabilitiee Estimate of r
in eelecting Self- Biu
Method units mi General Weighting IltatUll
I Equal ffl 1I Biaeed
kNM, LMifh
I' Equal II Biaeed
LM,
N 1/
II Equal kNM, nM L Mdli Uobi&eed
nkM
III M;/M ;Ii 'D Uobiaeed
kM.
- 1 LMi
1/
IV ~
I. nN
- g,
Zi nkM
Unbi&eed

_L~gl
HI, %,
V II Biued
~ _LM,
%,

In methods IV and V, these values are m; - kM.JZi. In methods I'


and II, proportional subsampling produces a self-weighting estimate:
this proportionality is denoted by writing mi ,. kNM,. Since Zi - liN
BUBBAMPLING WITH UNITS OF UNEQUAL SIZE 11.6

when sampling is with equal probabilities, the symbol k has the same
meaning in table 11.8 for methods I', II, IV, and V.
Where two algebraic expressions for an estimate are shown (&8 in
methode I', II, IV, and V), the first is the general form of the estimate,
which applies for any choice of the m, : the second is the self-weighting
form which holds only when the mj are chosen as in the preceding
column.

11.7 The estimation of proportions. The estimates in table 11.8 can


also be applied when the object is to estimate the proportion of ele-
ments in the popu1B.tion whkh fall in some defined class C (e.g. propor-
tion of people aged 21 and over). We need only adopt the usual de-
vice of defining Yi; as 1 if the element is in C and as 0 otherwise.
If the denominator of the proportion does not include all the ele-
ments in the population, the situation is different. In a general survey
covering both sexes, an example of this kind of proportion is

Number of employed males over 14 years


Total number of males over 14 years

If we let Yo; = 1 when the element is an employed male over 14 and


o otherwise, and Xi; = 1 when the element is a male over 14 and 0
otherwise, the population proportion is the ratio Y IX, so that rl\tio
estimates are involved.
We shall consider a ratio estimate which is a generalization of
method V in table 11.8. Let

(11.10)

The numerator and denominator are unbiased estimates of nY and


nX, respectively (by method IV in table 11.8). If m, - kM,lz" this
estimate reduces to the simple form R = ylx.
As will be seen, this type of estimate is very useful in presenting the
theory both for proportions and for continuous variates. Its variance
is given in section 11.9.

11.8 Interim comments. In describing the various methods of sam-


pling and estimation, we have already indicated two of the principal
results. The first is that, for a given number of elements I: "" and a
11.9 THE PRINCIPAL VARIANCE FOR.MULAS

given number of primary units n in the sample, sampling with pp


often gives the most precise estimates. The second is that, whether
the primary units are chosen with equal or unequal probabiHties, the
biased methods of estimation based on the weighted or unweighted
sample means per element are often more precise than the unbiased
estimates that can be constructed from the same data. Neither of
these results is a mathematical certainty, but experience as well as
some theoretical investigations suggest that they hold in practice for
many populations and many types of item.
The contents of the remainder of this chapter are as follows :
In section 11.9 some general variance formulas are developed which
cover all the methods discussed here.
Section 11.10 presents a method, due to Hansen and Hurwitz (1949),
for determining the optimum probabilities of selection of primary
units when field costs are taken into account. This method also de-
termines the optimum sampling and subsampiing fractions.
Section 11.11 shows the type of rela.tion which must hold between
the primary unit total Y; and the primary unit size Mi. in order that
the methods of estimation based on the sample mean be more precise
than the corresponding unbiased estimates.
In section 11.12 genera.l formulas are obtained for finding sample
estimates of variance. Section 11.13 indicates how the methods are
extended to stratified sampling.

11.9 The principal variance formulas. In order to avoid working out


individual formulas for each method, we may note that method V re-
duces to method III if Zi = Mil M and to method I' if Zi = l i N (ex-
cept that with equal probabilities we would sample with replacement) .
Similarly, method IV reduces to method II when Zi = l i N. Conse-
quently, variance formulas for methods Vand IV provide most of the
required information. Method I, although occasionally useful, will be
omitted in view of its liability to serious bias.
The work can be reduced further by considering the ratio estimate
presented in formula (11.10). Units are selected with probability
proportional to some estimate of size Zi. The estimate R is

(H.IO)
SUB8AMPIJNG WITH UNITS OF UNEQUAL SIZE 11.9

If xi; iB a "dummy" variate that has the value 1 for every element in
the population, 80 that ~i &I 1, R reduces to t'v as given in table 11.8,
and therefore includes methods I', Ill, and V 88 particular cases.
The ratio estimate is al80 useful in its own right. If Xij is the value
r
of 'Vii at a previou8 census, we can form ratio estimates of and Y 88
fll ... RX: ?R = RX
This type of e8timate was found to be more precise than any of the
preceding methods for farm items in North Carolina (L. H. Madow,
1950; Jebe, 1952).
In finding VCR) we 8hall use one of the variance formulas already
establiBhed for the case n = 1. As a preliminary, we require a well-
known result fo'!' the variance of a mean in sampling with replacement.
The result i8 deliberately stated in rather general terms.
Lemma. For a specified method of sampling and estimation, the
sample e8timate U i8 an unbiased estimate of some population charac-
teri8tic U, with variance S,/. Suppose that 8uch a sample i8 drawn n
times, with replacement after each draw, yielding the estimates UI,
"2, ... , u.". Then, if 12 is the arithmetic mean of the Ui,

S",2
V(u) ... - (l1.1J)
n
Proof:
1
(12 - U) ... - [(UI - U) + (U2 - U) + ... + (u" - U)]
n

In sampling with replacement, the value obtained for u; iB not influ-


enced by the value obtained for Uj (i ~ j). If we keep Uj constant in
any cross-product, and average over Ui, the average vanishes 8ince
E(u, - U) - O. Also each squared term contributes S.. 2. Hence

1 2 S,,2
V(12) - -nS" =-
n2 n
Note that the lemma does not specify how the sample is drawn: the
drawing may be with either equal or arbitrary probabilities.
Theorem 11.1 A sample of n units is drawn with replacement, the
probability of selection assigned to the ith unit at any draw being z(.
11.9 THE PRINCIPAL VARIANCE FORMULAS 251
From each selected unit a subaa.mple of si.ze "'-i is drawn by simple ran-
dom sampling. The estimate is
" M·
:E _' fi,
II = '-I Z,
~Mi
~ -i,
,'_I %i
Then, in large samples, •

V(ll) '=.
1 L [1-
-2
N
(Y, - RX,)2 + M·(M
" Z '_'
2
· - m ·) 8 d . ]
(11.12)
nX ,_I z, ,"'-i
where
1 J[.
8d ·2
,
= (Mi ~ I(y " - Rx ··) - (y . - RX ·) 12
_ 1) j~" ., , •

Proof: In the discuBBion of method IV for the case n - 1 (section


11.3), we saw that M,Yi/ziM gave an unbiased estimate of We r.
now have n such drawings, made with replacement. Let the subsoript
i denote the ith member of the sample, where the same unit may
appear more than once. Then the n quantities
Mi
- iii "" Pi (say)
Zi

are each unbiased estimates of the population total Y. Let


1 "
p" - - L
ni_1
Pi
with a similar definition for it". Then, if R = Y IX is the population
ratio,

As with the ratio estimate in simple random sampling (chapter 6),


we assume, as an approximation, that the sample estimate g" in the
denominator can be replaced by X . This gives
1
II - R '-. X (p" - Rit,,)

- -1 L" (Pi - Rg.)


nX i-I'

• The theorem dOlI! not reveal how large the 8&IIlple muat be. At with the ordi-
nary ratio estimate (chapter 6) the approximation is probably adequate if the
coefficiente of variation of the numerator and denominator of fl are both 1_ than
0.1, though further reeearch on this point is needed.
262 BUB8AMPLING WITH UNITB OF UNEQUAL SIZE 11.9

To this order of approximation, R is an unbiaBed estimate of R. By


the lemma, applied to the variate u, -
t, - Rg"
1
VCR) '-. n][2 V(t, - ~,) (11.13)

But in the WeeU88ion of method IV for n - 1, we find from formulas


(11.7) and (11.7') that since t. - M'OIV'

~ ~
V(.r.) - Li z, -
(y,- )2 + ~ M,(M, -
Y Li
m,) S;J
'-1 Z, '_1 z. m,
Apply this result to the variate t. - Rg,.
This va.riate equals
= Y'I - Rx,J' Substitution in (11.13) gives
M.J.;/r.., where d,l

Since Y ""' RX, this may be written,

VCR) '- . -2
1 [N
:E -1 (Y, - RX.)2 +:E
N M •·(M•· - m ·) S

.2]
d.
nX '-1 z, '_1 Z, m;
where
1 }//I
Sd,' -
(M. -
:E 1(1I'J -
1) i_I
Rx,,) - (f, - RX,»)2

Corollary. If m. = kM./r." 8O..that R becomes simply y/x, then


V(.~) '-. -2
1 :E [1-
N
(Y, - Rx,)2 + (M·· -k m· Sdi2
o
) ]

(11.12')
nX ,-I z,
}uj noted in section 11.9, this result provides variance formulas for
methoda V, III, and I' as particUlar cases.
Theorem 11.2 a.pplies to the unbiased methods IV and II.
TMor-em 11.1 With the same method of selection as in theorem 11.1,
the estimate of is r
1 • M.g,
Prv--:E-
Then
nM '-1 "
1 N [1
V(Prv) - - : E - (Y, - "y)' + M,(M, - S,2]
- m.)
(11.14)
nM' '_1 " ""',
11.10 OPTIMUM PROBABILITIES OF SELECTION

Proof: By the lemma,

V(Pxv) ==-V - 1 (M'il')


nM2 Z,

=~
nM
t
i_I
[~ (Y,
Zi
_ ~,Y? + M,(M, - m.) Si
Z, m,
2
]

by (11.7) .
Corollary. If mi = kMi/Z i , 'so that 'fhv = y / nkM, then

V(flIV) ==
1 L [1-
-2
N
(Yi - ZiY)2 +(M·' -k m.)
Si
2]
(11.14')
nM i_I Zi

Variance formula.a (11 .12') and (11.14') are structurally rather simi-
lar. Apart from mUltiplying factors, the principal difference is that in
the ratio estimate the variate (1/,; - RXi;) replaces the variate 1/'1
which appears in the unbiased estimate. The formula for the ratio
estimate is approximate; that for the unbiased estimate is exact, pro-
vided that sampling is with replacement.

11.10 Optimum probabilities of selection. This analysis will be car-


ried through for the ratio estimates. The analysis for the unbia.aed
estimates is similar on account of the similar structure of the two
variance fonnula.a.
Given that the sizes, or good estimates of them, are available, what
probabilities of selection should be allocated to the units? This ques-
tion has been examined by Hansen and Hurwitz (1949) . Their a.na.ly-
sis is important both for its results and a.a an example of the technique.
The cost function which they consider has three components:
c.. = Fixed cost per primary unit.
c. = Coat per element or subunit.
c, ". Cost of listing one element in a selected unit and other costa
that vary with the number of elements to be listed.
The third cost factor is included because the sampler must usually list
the elements in any selected unit, and verify their number, in order to
draw a Bubsa.mple.
Hence. .. ..
Cost - c..n + c. :E m, + c, :E M,
i_I i-I

This fonnula is not adapted for our purpose, because the quantities
L m. and :E M i are random variables which depend on the units
that happen to be selected. Instead, we consider the average cost.
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.10

Units are selected with arbitrary probabilities z, and subsamples are


chosen with m, - kM;./Zi. Now

E (
"
i~ m, - t; m.'1ni -
) N
nk EMi - nkM
N

Similarly,
.. ) N N
E( :E Mi - :E miMi - n:E "M.
'_1 '_1 i_I

Hence the average cost is


N
C - c"n + c.nkM + cln :E ZiM. (11.15)
'-1
In attempting to minimize the variance for fixed average cost, the
variables at our disposal are n, k, and the probabilities 'i.
By theorem 11.1, corollary, the variance to be minimized is
A
V(Jt) - -2
1 ~ [1 L. - (Y i - RXi )
2
+ (M, - m,)
S,fi
2]
nX i_I 'i k
Some changes will be made in this expression in order to simplify the
differentiations. We also substitute Y i = MiY i , Xi = M,X i . Fur-
ther, since the mi are chosen to satisfy the equations
kM.
mo--
'i
the quantity milk on the right may be replaced by Mi/z,. These
changes give
N 2 -
"" 1 [Mi Y .,. 2 Mi 2 Mi 2]
V .. X 2 V(Jt)
A
- L. - -
._1 n z.
( , - RAi) - -Sd'
Zi
+ -Sd'
k
If we write dii - (Vii - RXii), the quantity (V; - RXi ) is the mean
per element of d{i in the ith unit, and may be denoted by 15.. Combin-
ing the first two terms inside the square bracket, this gives
N 1 [M .
- _'2(15l - Sd. M.Sd.' ] 2
V - :E - ) + -'
i_I n it. Mi k
Finally, it will be noted that n appears only in the combinations JUi
and nk. We introduce the variables it,.' - JUi and k' - nk, 80 that V
is no longer an explicit function of n. Thus

~ [Mi (Pi: 2 S11;2) + -,


2
M, 2] (11.16)
V - L. - , Ui - - SlIi
i_I Ii Mi k
11.10 OPTIMUM PROBABILITIES OF SELECTION
We are to minimize V with respect to variations in n, k', and the z;',
subject to the restrictioDl! that cOBt is fixed and that
N N
L z, - 1: i.e. L z/ ... n
~_1 ~-1

Taking X and ~ as undetennined multipliers, we minimize


N N
V + X(c..n + c.k'M + Cl L z;'M,) + I'(n - L z;')
Differentiation gives
n: (11 .17)

,.
Z !· (11.18)
i.e.
M ,2(15,,2_ Sd,2)
, M,
(11 .19)
X(c .. + cIM,)
Since X is the same for all i, these equations provide explicit solutions
for the zl and hence the z, == zll n.
The numerators of the :;,2 in (11 .19) will be assumed positive. (If
they are negative it is found that subsampling is inefficient, single-
stage sampling of the primary units being superior,) For the optimum
z, we ha.ve

(11.20)

where
D, 2 = IP _ Sd,2
,.. , M,

The quantity D." 2 must now be examined, since it may depend on


the size of unit M,. A quantity resembling D,,,2 has been encountered
previously' (section 9.4) under a different notation. Consider a popu-

TABLE 11.9 ANALTlI8 or VAJIlANCJI roa TID POPULATION


elf IllS
IV
Between primary unite (N - 1) 8,2 - lJ :E DN (N -
,-1 1)
Between elements within primary
unite N(lJ - 1)
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.10

lation in which all units are of the same size 51. Perform an analysis of
variance similar to table 9.5 (p. 196), on the variates d;j = Yii - RXi;.
From the definition of D,,/, its average value over all units (assum-
ing M, - 51) is

Hence, if N is large, the average value of D,,,2 may be written, from


table 11.9,
(11.21)

In section 9.5 we studied how the functions 8 b2 Md 8 ..2 depend upon


M. We found, as empirical formulas,

(g > 0)

where 8 2 is the variance among all elements in the population. By


equation (11 .21), this gives

(11.22)

If 51 does not vary greatly, the assumption that 8",2 is constant,


i.e. g = 0 and hence 15'14 2 = constant, is often satisfactory, as was
suggested in section 9.5. Otherwise, equation (11 .22) shows that
15'142 may be expected to decrease as M increases. From their expe-
rience, Hansen and Hurwitz (1943) suggest that 15;0. 2 will seldom, in
practice, decrease as fast as I/M~
We are now in a position to discuss the optimum choice of the Zi.
From (11.20)

The following deductions may be drawn from this result :


i. Suppose that clM" the cost of listing and related operations per
primary unit, is small relative to C,,' the fixed cost per primary unit.
If Diu is constant, then selection with pps is optimum. If Diu de-
creases with increasing M" optimum probabilities lie between z, a: M,
and z, a:VM..
ii. If the cost of listing predominates, optimum probabilities lie
z,
between a: v'!i; and z, ,..constant (equal probabilities).
11.11 BIASED VERSUS UNBIASED ESTIMATES 267

iii. Ii costs of listing and fixed costs are of the same order of m8lJli-
tude, z, a: VM; is a good compromise.
The optimum k, found by differentiation, is left as an exerciBe to
the reader. Its value is

The optimum n is found by solving the cost equation (11 .15) for n.
As has been mentioned, the discussion in this section assumes that
the sizes, or good estimates of them, are known. No part of the budget
is allotted to obtaining information about the sizes of the N units,
except for any listing needed in the units that comprise the sample.

11.11 Biased versus unbiased estimates. In this section we examine


the conditions under which the biased estimates based on the sample
mean have a smaller variance than the corresponding unbiased meth-
ods. Assuming an arbitrary probability of selection for the primary z,
units, the comparison is between method V and method IV in table
11.8. The estimates are:

M,
:E -ii.
:1:;
Pv= =J7 (if m. = kM,/z,)
:EM;
z,

1 M. Y
PIV = -nM :E -z. ii, = -
nkM

Structurally, t'v is analogous to a ratio estimate y'/x', where yo' =


M.ih/z, and x;' = Mi/ Zi, while iilV = g'/M is analogous to a "mean
per uni't" estimate. As with an ordinary ratio estimate, we might
therefore anticipate that the size of the correlation between y/ and xl
will decide whether 'iiv is the more precise.
For simplicity, we assume m, = kM;/z; with both methods. From
theorem 11.1, corollary, we have

A
yen;) .-.
1 N 1
"V2:E r
- (Y, - RX,)
2
+ (M, -k m,)
Sdi
2]
nA- ~ l ~
SUB8AMPLINO WITH UNITS OF UNEQUAL SIZE 11.11

The variance VOiv) is found as a particular case of this result by putr


ting Xii .. 1. This gives
Y Y
R=---""Y
X M
2 2 J M. 2
Sd ' ... S· - - - L (y .. - y .)
• • Mi - 1i-I" •
Hence
V(Pv) ...
1 ~ [ -1 (Y "
MiT)
2
+ (M, - m,)
Si
2]
- 2 "-- i -
nM i_I Zi k

For V(PIV) we have, from theorem 11.2, corollary,

V(PIV) ...
1 ~ [ 1
-2 ~ - (Y i - Zi Y )
~
+ (Mi - mi)
Si
2]
nM i_I Zi k

The within-unit contributions are the same in the two estimates.


Hence Pv has the smaller variance if

,,1
"-- - (Y i - MiT)
"2
< ,,1
"-- - (Y i - ZiY)
2
Zi zi

This may be written

L Zi (y, r ~
-.!. -
Zi
M.)2 < L (y. Y )2
Zi
Zi -.!. -
Zi
(11.23)

Now let Y/ = Yi/z i, X/ = M;/Zi be variates defined for every


primary unit. Since units are selected with probabilities Zi, we have
y.
EY;' = LZi-.!. =0 Y
Zi


EX;' ... LZi~ = M
Zi

Hence r ...Y/M is the population ratio R' of Y;' to Xl.


lt follows that the right side of (11.23) is the population variance of
y;" whereas the left side is the population variance of the variate
(Y/ - R'X;').
11.12 ESTIMATED VARIANCES

By expanding the variance of (Yo' - R'X;') we find, as with the


ordinary ratio estimate (section 6.7), that V(t'v) is smaller than V(Jtv)
if
1 cv of Mi/z,
p>-----
2 cv of Y;/z,
where p is the correlation coefficient between Y';z, and Mi/z,.
When primary units are chosen with equal probability, this result
reduces to the simpler condition
1 cv of Mi
py,,Jl, > -2 (11.24)
, cv 0 f Y ,
This condition shows that the relative precisions of the biased and
unbiased estimates depend on the size of the correlation between the
primary unit totals Y, and the primary unit sizes Mi. The compari-
son is, however, a large-sample one, in which the bias in the method V
estimates has been ignored.

11.12 Estimated variances. As we have seen, most of the estimates


are particular cases either of the ratio estimate in section 11.9 or of the
unbiased estimate in method IV, for which the true variances were
presented in theorems 11.1 and 11.2. Sample estimates of these vari-
ances will now be given.
Theorem II.S A sample of n primary units is drawn (with replace-
ment) with probabilities z,. From each selected unit a subsample is
drawn by simple random sampling. The estimate is
" M·
L: -' fit
R _ ,-I Z,
" M,
L: -£;
i_I %l

An unbiased estimate of the approximate variance of R is


1 L:
vCR) - 22 M (1/, -
"{ _' Rx,) }2 (11.25)
nX ,_1 z,m,
where R is the population ratio Y IX.
Proof: The primary unit sample totals for the n successive dra.wings
are denoted by 1/1, 1/2, "',1/". Note that, although the same unit may
appear more than once, we give it a. separate subscript each time it
appears.
260 SUB8AMPLING WITH UNITS OF UNEQUAL SIZE U.12

Let

Then
M, M, M,
- (,/1 - Rx ..) - - d j ... -il;
ttJn, m,
t .. Zi

M.. M.. D. M, .
"" -15, + - (a i - 15,) - - + -Z.. (ai - 15,)
Zi Zi Z..
where
15, .,. f .. - RX.. and D..... M ..15, ... Y . - RX..
Now equare and average over all subsamples from the ith unit.
Mtdi)2 (D ..
~--=-+
)2 M,(M .. - m;) Sd,'J
(
• z.m, z.. z, 2 -
m;
Hence, for a fixed selection of n units,

where t, is the number of times that the ith unit has been drawn in
any specific sample. When we average over all possible sets of n units,
E(t,) - nz... Hence

E t
i_I
(M.-d..)2>= n
Z,mi
1: [Dl + M..(M
i_I Zi
i -

Zi
m,) Sdi2]
m;
But by theorem 11.1, formula (11.12), since D, = Y, - RX i ,

V(lt)
A..
= ' -2
1 ~
k
[Di2
-+ Mi(Mi - mi) Sdl]
-
nX i_I Zi Z.. m,
The r suit follows [note the divisor n 2 X 2 in v(R)].
CoroUary. Theorem 11.3 is not usable as it stands, since R, and in
some applications X also, are unknown. In pla('e of R we put the
sample estimate Ii and replace an n in the denominator by (n - 1), as
was done with the ordinary ratio timate in chapter 6. An unbiased
estimate of X is
g = ~ t Mifi
n i_ 1 Zi
11.12 ESTIMATED VARIANCES 261
Hence we take
vCR) = 1
n(n - l)g
2 i: {M'
,_I t,m,
(Y' _ RX,)}2 (11.26)

Example. Compute the standard error of the estimated mean age


by method I' in table 11.7 (p. 244). In this sample, 20 pages were
selected with equal probabilities, \vith m, = 2.
The estimate is

lip =
:E""M{fh "" 17,121.5 = 47 .6922 years
£...., Mi 359
Some extra decimal places are retained to ensure accuracy in later
computations.
To apply formula (11.26), we put
1 • N
z, = - : m, = 2: R = 1Jr,: Xi = m, = 2: g = - :EM,
N n
Substitution into formula (11.26) gives
n "
vCR) = (n _ l)(:EM.)<~~ IMi(Yi -17r,)}2 (11.27)

The sum of squares is easily computed from table 11.7, p. 244, in the
form
:E (M,y,)2 - 2Jir' L (M.y,)M, + 171'2 L M,2
-= 15,375,020 - (95.3844)(309,747.5) + (2274.55)(6481)
= 571,300 (11.28)
• (20) (571,300)
vCR) = (19)(359)2 = 4.67

8(R) = 2.16 years


Formula (11.27) can be seen to be identical [apart from the fpc
(N - n) / NJ with formula (6.14), p. 119, which was used to compute
the variance of a ratio estimate in single-stage sampling. Hence we
could have used formula (6.14) here by putting y/ = M.y., xi' ,.. M"
and calculating the variance for the ratio :E
Y// L xl. This is a
particular case of a general result which was also noted in chapter 10.
If n/N is negligible in two-stage sampling, estimated variances can be
found by the appropriate formulas for single-stage sampling. This is
fortunate, for, despite the relative complexity of two-stage sampling,
the formulas for estimated variances remain simple.
SUBSAMPLINO WITH UNITS OF UNEQUAL SIZE 11.12

The following result gives the sample estimate of variance for the
unbiased estimate in method IV.
Tlw>rem 11.4 With the same method of sampling as in theorem
11.3, the estimate of is r
1 "M.1}.
JiIV = - L-
nM '_1 Z.
Then an unbiased sample estimate of V(liIV) is
1
L" (r, -
where
v(JlIv) =
n(n - l)M
2
.-1 r,,)2 (11.29)

The proof is obtained by the same approach as in theorem 11.3.


Estimates of variance for all the methods discussed in this chapter
are deducible from these theorems. They are shown in table 11.10 for

TABLE 11.10 SAJolPL]) VARIANCES FOR llBTDfATIIlB 0 ... THE POPULATION MEAN

Unit Estimate Sample estima.te of variance


Method probability of Y (for a self-weighting IIILmple)

I' Eq.
""
1cNM, P (n _n )m2 1: {",,(g, _ p)l1
l
II I
II Eq. 1cNM, n(n _ 1)(kM)2 1: (!Ii _ g)1
nkM
M, 1
III
M
1ft P n(n _ 1)11121: (IIi _ g)2
1cM, _II- I
IV
" n1cM n(n _ 1)(kM)1 1: (!Ii _ g)'

V Sj
"
1cM,
'1
, n
(n _ l)ml:E !m,(g, _ p)}'

the self-weighting forms of the estimate. II the self-weighting condi-


tions do not apply, the reader should obtain his sample estimates of
variance directly from theoreIDS 11.3 anr;! 11.4.
The formulas in table 11.10 assume sampling with replacement, and
ignore the fpc which applies in methods I' and II when units are chosen
without replacement. The formulas are adequate provided that n/N
is small.
11.13 Extension to Itratifled aamplinc. For the unbiased methods
II, III, and IV, the extension to stratified sampling is straightforward.
11.13 EXTENSION TO STRATIFIED SAMPLING

The subscript h denotes the stratU!Jl: M" is the total number of ele-
ments in stratum h, and M is the total number of elements in the
population. The estimated population mean is
L

JI" = L W"!i,, , W" = M"IM (11.30)

where JI" denotes the estimate of the stratum mean per element r".
Further,
L
V(JI,,) = L W,,2V(JI,,)
II_I
(11.31)
and the estimated variance is
L
v(JI.,) = L W,,2V(JlII) (11.32)
"_1
Table 11.10, which gives the value of v(J7,,) for a single stratum, is
useful in constructing these estimates of variance.
Table 11.11, which is an extension of part of table 11.8 to stratified
sampling, presents the algebraic forms of the three unbiased estimates.
The column headed mil, shows the subsample sizes which make the
estimates self-weighting within 8trata. The self-weighting forms of the
estimates are given in the right-hand column : the general forms apply
if the m",have not been chosen so as to make the estimate self-weight-
ing within strata. In the summations within the table, h goes from 1
to L and i from 1 to nIl.

TABLE 11.11 UNBIASED Il8T1MATES IN STRATlJ'lED TWO-STAGE SAMPLING

Unit Sample estimate of r


probability
within Self-weight-
Method strata mAl General form ing form

II Eq. kANAMAI

M.I
III IlIA
M.
k.MAI
IV -.1
'AI

The right-hand column of table 11.11 also shows the conditions


under which the estimates become oompletely self-weighting. It will
be recalled that n"k" is the overall sampling fraction within stratum h.
Hence, if the overall sampling fraction is kept constant in all strata,
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.13

estimates II and IV are completejy self-weighting. The same result


holds for wnpJing with pps in method III, for with this method the
number of elements m" chosen per primary unit is constant within a
stratum, 80 that n"m"jM" is again the overall sampling fraction in
stratum h.
With the biased estimates l' and V, and the ratio estimate, we could
use a weighted mean of the estimates for the individual strata, similar
to formula (11.30). But since all three estimates are essentially ratio
estimates, the biases may have the same sign in all strata, and if there
are numerous strata, with small samples from each stratum, the bias
in the weighted mE'.an may be substantial. This kind of estimate is to
be recommended only if there are few strata, with large samples in
each stratum, or if we have evidence that the overall bias will still be
negligible.
The alternative is to make a combined ratio estimate, which will be
illustrated for the most general case. Let
1 n~ M ·
1\ = :E _'h y".
n" ._1
-
tA'

with a corresponding definition for gil. The quantities fill gil are
unbiased sample estimates of the stratum totals Y", X/o, respectively.
The combined ratio estimate is defined as

The approximate variance of R. is found in the usual way by writing


1 L
R. - R . -=. - :E (f4 - Rg/0) (11.33)
X "_I
The quantity (f" - Rg,,) is an unbiased method IV estimate of the
stratum total (Y/o - RX/o) of the variate d".; = Y"i; - RX"i;. Hence
V(f" - Rg/o) is found as a particular case of theorem 11.2. This
leads to the result:
1:E -:E [1-
V(Rc) ..... 2
LIN.
(DIt,i - Z,,;D/o)2 + M.·(M. ·-
n' ....
m,, ·) S
J ~
.2]
X "_I n" .-1 ZAt z", mlt,i
(11.34)
11.1. SUMMARY COMMENTS

where

In order to deduce the variances for methods V and I', we substi-


tute X"'i !Ii! 1 in formula (11 .34). This gives
X ... M: R -= Y
Dioi - Y", - fMAi
1 ~l
i~
2
Sdltl - 81oi = (Mit, _ I) (Yllii - YIt ,)'
The resulting variance formula, like (11 .34), is a ratio-type approxi-
mation and assumes that n"INIt is small. Another particular case is:
Method III.
Probability proportional to size within strata. ZAi'" M",1M".
Estimate: same as in table 11.11 if mA; -= 171".
The variance formula can be deduced from either (11.31) or (11.34).
1 M" Q.A. [ (M,, · - m,,-)
V(f)m.,) - M2
L
1: - L M",(Y". - Y,,)2 + ' , 81oi' ]
"_I n" ._1 m".
The choice of the optimum sampling ratios n"IN" for units within
strata can be attacked by the techniques given in chapter 5, but will
not be discuBBed here.
For the estimated variance of R., we have, from equation (11 .33),
1 L
II(R.) - V"2
.h -
1:
II(!>" - R:t,,)
"_I
By an application of theorem 11.4 to the variate d"'i in the individual
strata, we find
1 L 1 .:.t.
II(R.) - ---; 1: L (d,,;, - a,,')2 (11.35)
X "_1 n"(n,, - 1) i_I
where
d,,;' _ M",.alai: a,,' .. 2. f d,,;'
Z", n" .-1
a". - g", - RctAi
If X is not known, the sample estimate 1: :tIl is substituted for it.
IU' Summary comment.. As the discuBBion in this chapter indi-
cates, the efficient design of a two-etage sample with units of unequal
SUBSAMPLING WITH UNITS OF UNEQUAL SIZE 11.14

size requires a good deal of preliminary work. This section recapitu-


lates briefly the principal issues.
1. Find out whether the sizes are known, known approximately, or
unknown. In the last case, consider whether some information about
sizet:! can be obtained relatively easily. For example, Jessen et al.
(1947) conducted two-stage samples of blocks in some Greek towns in
which no usable estimates of the numbers of households per block were
available. They considered three approaches: (i) Drawing the blocks
with equal probabilities. (ii) Making a rapid tour of the town by jeep
in order to tie together small blocks so as to build artificial blocks that
appeared to have roughly the same numbers of households. Blocks
which obviously had no households were eliminated in this process.
The sample blocks were then chosen with equal probability. (iii)
Cruising the town slowly enough to permit estimates to be made of
the number of households in each block. Blocks were then chosen
with probability proportional to estimated sizes.
2. Consider whether to use size of unit 8.B one of the ....ariables for
stratification: this is advisable unless it prevents the use of some other
variable that might give a worth-while increase in precision.
3. Decide how the units are to be selected within strata. If sizes
are known at least approximately, selection with pps, or its square
root, will often be the best procedure, a.lthough this depends on the
nature of the field costs.
4. Select a method of estimation. For estimating the popUlation
mean or total, a ratio estimate using the value of the same item at a
recent census is sometimes very successful, if available. Estimates
based on the sample mean or weighted sample mean are often more
precise than the unbiased estbnateB.
5. Decide on the sampling and subsampling fractions within strata.
We have recommended that subsampling fractions be chosen 80 that
the estimates are self-weighting within strata. Further control so that
the sample is completely self-weighting is advisable unless it appears
to be accompanied by a substantial loss of precision.
Accounts of the planning and conduct of actual surveys in....olving
two-stage sampling with primary units of unequal size are contained
in the monograph A chapter in population 8ampling (1950) and in the
publications by Gray and Corlett (1950) and Yates (1949).

11.16 Enrclsea.
11.1 By working out the estimates for all poesible samples which can be
drawn from the artificial population in table 11.1, by methode 10, Ib, TI, and
TIl, verify the total variances given in table 11.2.
11.16 REFERENCES 267

11.2 A population contains 2 primary units, with 6 and 4 elements, re-


spectively. The values of llii are 0, 0, I, 3, 4, 4 in the first unit, and 0, 0, 2, 2
in the second. One primary unit is chosen, and the expected sample sue is
to be 3 elements. Work out the contributions to the total variance of the
estimated mean per element in methods la, II, III, IV, and V. In the last
two eases, the:t, are 0.55 and 0,45, respectively, and m, - kMJ " .
11.3 The elements in a population with 3 primary units are classified into
2 classes. The unit sizes M, and the proportions P, of elements which belong
to the first class are as follows :

MI - 100, MI -= 200, MI = 300: PI" 0.40, P, - 0.45, PI - 0.85


For a sample consisting of 50 elements from 1 primary unit, compare the
variances of methods la, II, and III for estimating the proportion of elements
in the first class in the population. (In the variance formulas in section 11.2,
8;2 is approximately P;Q,.)
11.4 A sample of 11 primary units is chosen with pps, and 11\ elements are
sampled from each unit in the sample. Deduce the formula for V(pm) from
both theorems 11.1 and 11.2. Is the variance formula exact?
11.5 A sample of 11 primary units is chosen with equal probabilities and
without replacement. The unbiased estimate PII of r
is used. Show that

M (Pu - P) -
N
t M#, - r ,) + (Y ~ -
.!11'_1 P)

where r" is the true mean per primary unit for the 11 units in the sa.mple.
Hence find the exact variance of PII and compare it with the variance deduced
from theorem 11.2, which aasumes sa.mpling with replacement.
11.6 For the data in table 11.7, estimate from the sample the standard
error of the unbiased estimate which was made of the mean age of entries in
A1Mrica1l nun of BCience. (M may be taken as 50,000.)

11.16 References.
A ch4pter in population ,amp1inll (1950). U. S. Government Printing Office.
GRAT, P. G., and CoRLitTI', T. (1950). Sampling for the socialeurvey. Jour. Rol/.
Slat. Soc., AIlS, 100-206.
HANSEN, M. R., and HUlIWITZ, W. N. (1943). On the theory of sa.mpliDI from
finite populations. Ann. Math. SIIU., 1., 333-362.
HANSEN, M. H., and HURWITZ, W. N. (1949). On the determination of the opti-
mum probabilities in sampling. Ann. Math. Stat., 20, 426-432.
JmBE, E. H. (1952). Estimation for 8ub-eamplinl designs employing the county
&8 a primary sa.mpiing unit. Jour. Amer. Sial. A"oc., '7, 49-70.
JasEN, R. J., et al. (J947). On a population sa.mple {or Greece. Jour. Amer. SIIU.
A.IOC., 61, 357-384.
MADOW, L. H. (19ro). On the use of the county &8 a primary aamplin, unit for
state estimates. Jour . Amer. SIIU. AMOC., '6, 30-47.
YATIC8, F. (1949). Sampling metIaocU Jur ~e~ and .uroey.. Charles Griffin and
Co., London.
CHAPTER 12

DOUBLE SAMPLING

12.1 Deacription of the technique. As we have seen, a number of


sampling techniques depend upon the possession of advance informa-
tion about an auxiliary variate Xi. Ratio and regression estimates
require a knowledge of the population mean X. If it is desired to
stratify the population according to the values of the Xi, their fre-
quency distribution must be known.
When such information is lacking, it is sometimes relatively cheap
to take a large preliminary sample in which Xi alone is measured. The
purpose of this sample is to furnish a good estimate of X or of the fre-
quency distribution of Xi. In a survey whose function is to make esti-
mates for some other variate Yi, it may pay to devote part of the re-
sources to this preliminary sample, although this means that the size
of the sample in the main survey on Yi must be decrea.sed. This tech-
nique is known as double Bafnpling or two-phase Bampling. As the dis-
cussion implies, the technique is profitable only if the gain in preci-
sion from ratio or regression estimates or stratification more than off-
sets the lOBS in precision due to the reduction in the size of the main
sample.
Double sampling may be very appropriate when the information
·about Xi is on file cards that have not been tabulated. For inatance, in
surveys of the German civilian population in 1945, the sample from
a.ny town was usually drawn from rationing registration lists. In ad-
dition to geographic stratification within the town, . for which data
were usually already available, stratification by sex and age was pro-
posed. Since the sample had to be drawn in a hurry, and since the
lists were in constant use, tabulation of the complete age and sex dis-
tribution was not feasible. A moderately large systematic sample
could, however, be dra.wn quickly. Each person drawn was classified
into the appropriate age-sex class. From these data the much llIDaller
list of persons to be interviewed was selected.

12.2 Double I&JIlplinc for .tratification. The theory was first given
by Neyman (1938).
268
12.2 DOUBLE SAMPLING FOR STRATIFICATION 269

The population is to be stratified into a number of cJ8.S3e8 according


to the values of Xi· The first sample is a simple random sample of
size 11,'. Let:
W" =: N,,/ N = proportion of population falling into stratum h.
w" = 11,,,'/11,' = proportion of first sample falling into stratum h.
Then w" is an estimate of W".
The second sample is a stratified random sample of size n in which
Yi is measured : 11,11 units are drawn from stratum h. Thl' !Second sam-
ple is often a subsample from the first sample, but it may be drawn
independently if this is more convenient.
The cost of the two samples is assumed to be of the form
c= nc" + n'c,.. (12.1)

where c" is usually large relative to c" •.


The problem is to choose 11,' and the 11,,, (and consequently 11,) so as to
minimize the variance of the estimate for a given cost. We must then
verify whether the minimum variance is smaller than can be attained
by a single simple random sample in which Yi alone is measured.
The first step is to set up the estimate and determine its variance.
The population mean is
L
Y = I: W,.Y. .
As an estimate we use
"-1
L

fiat ... I: w"y"


10_1

Whenever a new sample is drawn, this implies a fresh drawing of


both the first and the second samples. Thus the w" and the sample
means '0" are both random variables, subject to error. The problem is
therefore one of stratification in which the strata totals are not known
exactly. The strata boundaries are assumed fixed in repeated sam-
pling.
Theorem 1£.1 The estimate '0" is unbiased.
Proof: Write
WIl=W,,+U,,: y" ... Y,,+ell
Then the error of estimate may be expressed as
fi" - Y = I: (w"y", - W...Y. . )
...
... I: (W...e. . + Y"u" + u"e. . ) (12.2)

"
270 DOUBLE SAMPLING 12.2

By the properties of simple random sampling, the quantities 1.£/0 and


ell all have means zero. Further, by the method of drawing, 1.£/0 and e"
are independently distributed. Hence
E(fi" - Y) - 0
In the theorem for V(y,,), the sampling ratios n'IN, n/oIN" are as-
sumed negligible, since these assumptions seem valid in the great ma-
jority of applications. The variance of YAi in stratum h is denoted as
usual by 8,,'.
Theorem te.! If n'IN and n"IN" are negligible,
~ [{
V(g,,) ==
.
L..
11-1
W"
2
+ W A(1 n-, W/o)} 8,,2
-+
nil
W"(Y,, -
n
, Y)2]
(12.3)

Proof: From (12.2),


V(g.,) -; E(g., - y)2 ... E [ ~ (W"e" + Y"u/o + UAe,,)} (12.3')

Consider first the squared terms. These are:

E {~ (W"e/o + Y"u" + u"e,,) 2


}

"" L [W,,2E(e,,2) + Y,,2E(u,,2) + E(u/l 2)E(e,,2)] (12.4)


A
since all other terms in (12.4) vanish when the expectation is taken.
Since n'IN is assumed negligible, the variates w" follow a multi-
nomialdistribution : henceE(u/o 2 ) .. n'WA(1 - W/I). Thus the squared
terms contribute
~ [W A28,,2 Y/l 2W/I(1 - W,,) W,,(1 - w'1\) 8,,2]
L.. -- + , + , .- (12.5)
A n" n n n"
Now, consider the cross-product terms between different strata in
equation (12.3'). If h and j refer to two strata, there is no contribu-
tion from terms of the form e"e;. since sampling is independent in
different strata. The only non-zero contribution is that from terms in
Y/oYJou"Uj. For the multinomial distribution
W"Wj
E(u"uj) - - - -
n'
so that cros&-product terms contribute
~ Y/lYjw"Wj
-2 L.. (12.6)
4>; n'
12.3 OPTIMUM ALLOCATION 271

If the middle term. in (12.5) is combined with (12.6), the reader may
verify that these together amount to

The term. free from n' is the familiar expression for the variance
when the strata sizes are known exactly. The effects of errors in the
first sample are therefore to increase slightly the within-straum con-
tribution to the variance, and to introduce a between-stratum com-
ponent.
Corollary. If we are estimating a proportion in the second sample,
then

and theorem 12.2 gives

(12.7)
where Ph is the proportion in stratum h.

12.3 Optimum allocation. The values of the nil and n' that lead to
the minimum variance are rather complicated. It is clear from for-
mula (12.3) that nil should be proportional to

SA
J
WA
2 W (1 - W
h
+-----
n'
II )

Since the second term. inside the root is usually small compared with
the first, Neyman (1938) suggests taking nil proportional to W,.s",
Thus
272 DOUBLE SAMPLING 12.3

When these values are substituted into the variance (12.3), with the
term in W h (1 - W,,) ignored, we obtain
. (2: W hS,,)2 2: W,,(l\ - 'Y)2
VOl'l =. n
+=-----
n'
(12.8)

=
v" V",
-+- (say) (12.8')
n n'
This approximate expression for the variance is now minimized by
choice of nand n' for a given cost of the form stated previously

c ... nc" + n'c.. , (12.1)


It is easily found that
n n'
(12.9)
VV"c n, ... VVn,c"
This equation and (12.1) determine nand n'.
An expression for the minimum variance will be needed for later
applicatjons of double sampling. From (12.9),
n n' nc" + n'e",
VVnCn' "" VV" ,c" ... Vc"c" , (VV"c" + VV",e",)
c (12.9')

Substitute these solutions in equation (12.8') for VOl'" This gives

• (12.10)

Example. This example is artificial, but illustrates the calculations


involved. We use the Jefferson county data previously considered
(po 133). The Xi variate, farm size, is employed to divide the popula-
tion into 2 strata: farms up to 160 acres and farms over 160 acres.
Assume that it costs 10 times as much to sample for corn acres (Yi)
as for farm size (Xi), and let the cost be of the form
C = 100 = n + 0.1n' (12.11)
This means that, if double sampling is not used (n' = 0), we can
afford to take a sample of 100 farms to estimate corn acres.
12.4 ESTIMATED VARIANCE 273

The relevant data for the population are:

Strata WA SA VA
1 0 .786 17.7 19 . 404
2 0 .214 30.4 51.626

PopulatioD 620 26 .297

By formula (12.10) we could proceed at once to compute VOl'"


However, the intermediate steps will be given. We find
V" = (L W"8,,y2 ... 417
Vn , = L W"(1",, - 1")2 = 175
80 that, by formula (12.9),

~= J417 . 1 = 0.488
n' 175 10
From the cost equation (12.1.1) we obtain
100
n' = - - = 170' n = 170 X 0.488 ... 83
0.588 '
At this point the reader may verify from the data in this example
that the neglected term in W,,(l - W,,) in the variance formula (12.3)
is in fact negligible. From formula (12.8) we then have
VOI'I = W + Hi = 5.02 + 1.03 = 6.05
For a random sample of size 100, with no double sampling, we would
have
V = ill = 6.20
It appears that there would be only a trifling gain from double sam-
pling.

12.' Estimated variance in double sampling for stratification. A&-


suming n'/N, n"/N,, negligible, the true variance of y" is, from (12.3),

- ~ [{ 2
V (11,,) =.t..- W" + W,,(l -, W,,)} 8,,2
- + W"(1",, ,- 1")2]
"-1 n nil n
If estimates from the sample are substituted in this quantity, the
resulting expression turns out to be an overestimate of V(y,,). An
unbiased estimate can 'be constructed without difficulty.
274 DOUBLE SAMPLING 12.(

ThMrem U.S An unbiased estimate of V(g,,) is

V(f),,) -
11.' ~
~
(11.' - 1) ,.
[{ 2
W" - -
w,.} S.~
11.'
-
11.,.
+ w,.(f}l. 11.'- '0,,)2] (12.12)

Proof: This is obtained by substituting W/I = W,. + UII, '0,. = Y" +


e", in (12.12). The expectations of the successive teMllB inside the
main bracket work out as follows :

(12.13)

(12.14)

(12.15)
Adding (12.13), (12.14), and (12.15), we obtain for (11.' - 1)Ev(g,,)/ n'

' " [{
~
II
W"
2
+ W,,(1 11.'- Wh )} S/a2
- + W/a(Y"11.'-
n/a
Y)2]
-V(f),,)
-n'-
(n' - 1)V('O,,)
11.'
The theorem follows.
If 11.' is large relative to the 11.", v(y,,) reduces to ,
(12.16)

This expression is equivalent to assuming that errors in the strata


weights WI. can be ignored.
CoroUary. If p,. is the observed nroportion of units in stratum h
which fall into some defined class, and p" - L: W"PA/ L: w" is the
estimate of the population proport ')n, then an estimate of V(p,,) is

n' ~ [( 2 W,,) p"q. + W,,{PA -, p,,)2]


V(
p") -, ~ W,. - -; --
(11. - 1) A 11. 11.10 - 1 11.
12.6 REGRESSION ESTIMATES 275

In almoet all cases, this can be simplified to

v(p,,) . _. E [WIt 2p"q" + w"(p,, : P")']


" nIt -1 n
Frequently the term in lin' can also be dropped.
Example. In a simple random sample of 374 households 292 were
occupied by white families and 82 by non-white families. A sub-
sample of about 1 ill 4 households gave the following data aa to owner-
ship :
Owned ReDted Total
White: 31 43 U
NOD-white: • 1. 18

Estimate the proportion of rented households in the area from which


the sample was drawn, and find the standard error of the estimate.
If the first stratum consists of the white-occupied households,
WI - ill .. 0.78 : W2 = M ... 0.22
PI ... +t ... 0.60: P2 = H .... 0.78
Pol .. WIPI + W2P2 = 0.64

n' - 374, nl'" 74, ~ - 18

It is readily found that only the leading term in II(P.,) is of impor-


tance. Hence
~ w"2p"q,, (O. 78)~(0.60)(0.40) (0.22)2(0.78)(0.22)
(
" p,,) - ~ nIt - 1 ... 73 + 17

- 0.00248
8(p,,) - 0.049
The estimated proportion of rented households is 0.64 ± 0.049. The
reader may verify that there is only a trifling gain in precision over a
single-etage simple random sample of siae 92. In view of the rela.-
tively small siae of the non-white stratum, a greater difference between
the proportions of rented households for whites and non-whites would
be neceesary to make double sampling profitable.

12.6 Repeuion ..timatea. In a number of the applicationa of


double sampling, the auxiliary variate :t; has been WI8d to make a re-
gression estimate of Y. We shall 888WDe that the population is in-
276 DOUBLE SAMPLING 12.5

finite and that the relation between 1/1 and x, is linear. Write 88 a
model
(12.17)
where the second subscript a is introduced 88 a reminder that for
fixed x, the random variate eio follows a frequency distribution with
mean 0 and variance S.2 ... S/(1 _ p2) .
In the first (large) sample, of size n', we measure only Xi: in the
second, of size n, we measure both Xi and 1!ia. The estimate of Y is
til, .. ii + b(x' - i)
where i', f are the means of Xi in the first and second samples, respec-
tively, and b is the least squares regression coefficient of 1/1t. on Xi,
computed from the second sample.
We now examine the error of estimate (ti" - 1') From (12.17)
we find
j} ... Y + B(x - X) + ~ (12.18)

"
L (Yio - ii) (Xi - i)
b ... _.-_1_ _ _ _ __

L" e.a(Xi - x)
= B + _.-_1____ (12.19)
" (Xi -
L: i)2
i_I

From (12.18) and (12.19), substitute for ii and b in the error of esti·
mate. This gives
til, - Y = (y - Y) + b(f' - x)
= B(x - X) + e + B(x' - i) + (x' _ i) L eia(xi - i)
L (Xi - i)2
= e + (x' - x) ~eia(Xi ~ ~) + B(X' - X) (12.20)
(Xi - X)
In ordinary regression theory, in which x' = X, the standard prac-
tice is to discuss the conditional frequency distribution of the error of
estimate (9" - Y) in repeated samples in which the Xi values are
fixed. If this approach is adopted in the present problem, keeping
the Xi values fixed in both the first and the second II&lllples, we see
12.5 REGRESSION ESTIMATES
that the estima.te is biased in the conditional distribution, since
E (il" - Y) = B(i' - X)

If the bias is not too large, we have seen (section 1.5) that it may be
taken into account by adding its square to the variance of iiI,. Hence
we may regard the conditional variance V. of ii" as

V.(YI,) = 8/(1 -
1
p2) [ -
n
+ :E(i'(Xi - X)2]
_2 + B2(X' -
X)
X)2 (12.21)

This expression is not suitable for comparison with other methods


of sampling, since the variance depends on the set of Xi which appears
in the two samples. Instead, we need the average variance over all
possible drawings of the first and second samples.
A simple result for the average variance is obtained under the as-
sumption that (i) the first sample is drawn at random, (ii) the second
sample is a random subsample drawn from the first, and (iii) the Xi are
normally distributed. In this e, ent the average variance is found to be :

Vetil,) = 8 112 (1 2
- p) - [1n + (1-n - -n'1) (n -1 ]+ -B n'8-
3)
2
z
2
(12.22)

=S 2(1 - p2 ) [
1+
(n' - n) 1 ] l8 2
+_11_
1/ (12.23)
n n' (n - 3) n'
since B 28 z 2 = p281J2.
If the Xi are not normally distributed, the only term whose value is
changed is that in I / (n - 3), as discussed in section 7.3. As regards
assumption ii, the small sample might not be drawn at random from
the large sample : it is preferable to select the small sample so as to
obtain a wide spread in the values of Xi and hence reduce the sampling
error of b. The effect is to reduce, perhaps considerably, the term in
I / (n - 3).
In some applications the second sample is drawn independently of
the first. In this event the argument given in this section remains
unchanged down to equation (12.21). In equation (12.22) the term

(: -~)
n'
n
is replaced by

This cue of two independent samples was first considered by Cham eli
Bose (1943).
278 OOUBLE SAMPLING 12.5

To summarize, there is 80me doubt about the exact value of the


term in l/(n - 3) in the average variance. However, if lin is negli-
gible, this term is also negligible. This gives the following theorem.
Theorem It.4 If the first sample is of size n' and the second is of
size n, and if lin is negligible, the variance of Ylr, the regreesion esti-
mate in double sampling, is given approximately by
S 2(1 - p2)
V(lilr) . .... 1/ + l'S 2
_1/_ (12.24)
n n'
12.8 Double II&D1Plinl with recresaion versus single II&D1PUnc. From
the variance formula (12.24), double sampling with a regre.B8ion esti-
mate can be compared with a single simple random sample, under the
assumption that (i) the first sample is a simple random sample, (ii)
lin is negligible, and (iii) the second sample is also a simple random
sample. Results for this case should provide a rough guide to other
cases.
Write
V" VOl'
V(tilr) = - + -
n n'
where

Cost - c -= nc,. + n'c,..


The problem of finding the optimum n and n' and the minimum
variance is exactly the same as in tlouble sampling for stratification
(section 12.3). Equa.tion (12.10) gives

(~+~?
V"p, = C

SI/2 [V (1 - p2)c,. + pv;;. f'


(12.25)
c
where p is taken as positive.
If all resources are devoted to a single sample, with no adjustment
for regression, this sample is of size n. - Clc", and the variance of its
mean is
SI/2 C"S,,2
V(Ii) = - - - (12.26)
n. C
12.6 DOUBLE VERSUS SINGLE SAMPLING

Hence, double sampling gives a smaller variance if


c" > [V (1 - p2)e,. + pv;:. F'
Thia inequality may be expressed in two ways:

e,. (1 + ~)2 p2
(12.27)
->
e,., p2 - (1 -.Vl _ p2)2
or
2 4c,.e,.,
P >(e,.-+
-- - (12.28)
e,.,)2
Eql1ation (12.27) shows that, for a given value of p, the ratio of the
cost per unit in the second sample to the cost per unit in the first
sample must exceed a critical value before double sampling brings an
100

t
~ 50
U1\
,
.S 40 • \
'2
'" 30
1\ \
l n,\ \
""'\
~ 20
S
t ~ \
1l
i Ii
" ........
..........
.........
"'-
'\.
"\..
.S; 5
'"" '\.
§
l
4
:3 " ~ , \
......,\\
~
15 2
8 ~
= 10.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1
p• eorreIItion betwMn Yj Ind S j
Fi011U 12.1 Be1atioD between c../c..' and " for three fixed valuee of the relatin
precision of double and lilllie II&mplipi.
Curve I: double and lilllie ampJlnl equal1y preci8e.
Cum II: double amplinl cine 26 per DIlPt increaae in precision.
Cum III: double I&DIplini ciftll 50 per DIlnt increue in precision.
280 DOUBLE SAMPLING 12.6

increase in preClBJOn. Given Cn and Cn ., equation (12.28) shows the


critical value that must be exceeded by p2 to make double sampling
profitable. .
Figure 12.1 plots the values of the ratio cn/c", (on a log scale) against
p. Curve I is the relationship when double and single sampling are
equally precise; curve II holds when V opl = O.8V(y), i.e. when double
sampling gives a 25 per cent increase in precision; and curve III re-
fers to a 50 per cent increase in precision. For example, when p = 0.8,
double sampling equals single sampling in precision if cn/Cn' is 4, gives
a 25 per cent increase in precision if Cn/cn ' is about 71, and gives a
50 per cent increase if c,./en • is about 13.
For practical use, the curves overestimate the gains to be achieved
from double sampling, because the best values of nand n' must either
be estimated from previous data or be guessed. Some allowance for
errors in these estimations should be made before deciding to adopt
double sampling. .
For any p, there is an upper limit to the gain in precision from double
sampling. This occurs when information on x' is obtained free (c,..
= 0). The upper limit to the relative precision is 1/(1 _ p2).
12.7 Estimated variance in double sampling for regression. If terms
in lin are negligible, V(Ylr) is given by equation (12.24):
8 2(1 _ p2) 2
p8 2
V(iilr) . = . V + __v
n n'
By section 7.3, the q~antity

is an unbiased estimate of 8/(1 - p2), where the subscript IX has now


been dropped. Since
~ (Yo - ii)2
sv 2 =.:::.:.c.----
n-l
is an unbi~ estimate of 8/, it follows that
(s,/ - sv ..,2)
is an unbiased estimate of p 8 v2 •
2

Thus a sample estimate of V(Ylr) is

(12.29)
12.8 RATIO ESTIMATES 281
If the second sample is very small and terms in lin are not negligible,
a suggested estimate of variance is

V(iil,) = 8"." ~ + L
2 {I (x' -
(Xi':""
f)2}
X)2 +
(8,,2 - 8".,,2)
n'

This is a kind of hybrid of the conditional variance and the average


variance.

12.8 Ratio estimates. If the first sample is used to obtain x' for a
ratio estimate of Y, the estimate is
ii ,
'OR =-x (12.30)
X

To find the approximate variance, write

= a x- Y) + ~ (x' - x)
X -
.., - (fi - Rf) + ~ (x' - X)
x f

The first component is the error of the ordinary ratio estimate (sec-
tion 6.3). In obtaining the approximate error variance in section 6.3,
we replaced the factor XIX by unity in this term. To the same order
of approximation, we replace the factor ylx in the second component
by the population ratio R = Y IX. Thus
YR - y. =. (y - Rf) + R(i' - X) (12.31)

If the first and second samples are drawn independenUy, we obtain


S2 - 2RS + R2S 2 R2S :I
V{jjB) - "
n
"" " + --"
n'
(12.32)

where the fpc terms are assumed negligib1e.


If the second sample is a random subsample of the first, rearrange
equation (12.31) in the form

VB - Y - (jj - RX) + R(f' - f) ... (jj - y) + R(f' - i)


282 DOUBLE SAMPLING 12.8

It may be verified that, with the fpc ignored,


S2
V(O - Y) - ...!!_
n
cov {(O - Y)R(Z' - i) I - -RSII,. (~ - ~,)
V{R(i' - f)} = R2S,.2(~ -~)
n n'
Hence V(Yn) takes the form

V(Yn) =
S,/ - 2RSII", + R 2S,.2 + 2RS",. - R 2S,.2
(12.33)
n n'
Note that formulas (12.32) and (12.33) are both of the form
V" V",
V(On) = -+-
n n'
Hence the optimum choices of nand n', and the minimum variance for
comparison with single sampling, are found by the same procedure 88
for stratification and regression estimates. Details will not be given.
For sample estimates of variance, the quantities 8 112 , 8 11"" 8",2, and
R may be substituted in (12.32) and (12.33). The resulting estimates
V(YR) are not unbiased, but appear to be adequate to the order of ap-
proximation presented in the analysis.

12.9 Repeated sampling of the same population. As confidence in


sampling has increased, the practice- of relying on samples for the col-
lection of important series of data that are published at regular inter-
vals is becoming more common. In part, this is due to a realization
that with a dynamic population a census at infrequent intervals is of
limited use. Highly precise information about the characteristics of
a population in July 1945 and July 1950 may not help much in plan-
ning that demands a knOWledge of the population in 1952. A series
of relatively small samples at annual or even shorter intervals may be
more serviceable.
When the same population (apart from the changes which the pas-
sage of time introduces) is sampled repeatedly, we are in an ideal p0-
sition to make realistic estimates both of costs and of variances and
to apply the techniques that lead to optimum efficiency of sampling.
One important question is how frequently and in what manner the
sample should be changed as time progresses. Many consideratioIl8
affect the decision. People may be unwilling to give the same type of
12.9 REPEATED SAMPLING OF THE SAME POPULATION 283
information time after time. The respondents may be influenced by
information which they receive at the interviews, and this may make
them progreesively 1e1!8 representative as time proceeds. Sometimes,
however, cooperation is better in a second interview than in the first,
and when the information is technical or confidential the second visit
may produce more accurate data than the first.
In the remainder of this chapter we shall consider the question of
replacement of the sample and the related question of making esti-
mates from the series of repeated samples. The topic is appropriate
to the present chapter because double sampling techniques can be
utilized.
Given the data from a series of samples, there are three kinds of
quantity for which we may wish estimates:
i. The change in Y from one occasion to the next.
ii. The average value of Y over all occasions.
iii. The average value of Y for the most recent occasion.
In most surveys, interest centers on the current average (iii), par-
ticularly if the characteristics of the population are likely to change
rapidly with time. With a population in which time changes are
slow, on the other hand, an annual average (ii) taken over twelve
monthly samples or four quarterly samples may be adequate for the
major uses. This would be the situation in a study of the prevalence
of chronic diseases of long duration. With a disease whose prevalence
shows marked seasonal variation, the current data would be of major
interest, but annual averages would also be useful for comparisons
between different regions and different years. Estimates of change
(i) are wanted mainly in attempts to study the effects of forces that
are known to have acted on the population. For instance, if a bill is
passed which is supposed to stimulate the building of houses, it is
interesting to know whether the building rate of new houses has in-
creased in the succeeding year (with a realization that an increase
may not be entirely due to the bill).
Suppose that we are free to alter or retain the composition of the
sample, and that the total size of sample is to be the same on all 0c-
casions. If we wish to maximize precision, the following statements
can be made about replacement policy:
i. For estimating change, it is best to retain the same sample
throughout all occasions.
ii. For estimating the average over all occasions, it is best to draw
a new sample on each occasion. .
iii. For current estimates, equal precision is obtained either by keep-
ing the same sample or by changing it on every occasion. .Replace-
DOUBLE SAMPLING 12.9
ment of part of the sample on each occasion may be better than these
alternatives.
Statements i and ii hold because there is nearly always a positive .
correlation between the measurements on the same unit on two suc-
cessive occasions. The variance of the estimated change on a unit is
(8}2 + 8l - 2pS}82), where the subscripts refer to the occasions. If
change is estimated from two different units, the variance is (81 2 + 8 2 2 ) .
In estimating the overall mean for the two occasions, the variance is
(8}2 + 8i + 2pSI82)/4 if the same unit is retained, and (812 + 8 2 2 )/4
if a new unit is chosen.
Statement iii, which is less obvious, is investigated in succeeding
sections.

12.10 Sampling on two occasions. Suppose that the samples are of


the same size n on both occasions, and that the current estimates are
of primary interest: Replacement policy has been examined by Jessen
(1942) . For simplicity, we 888ume that simple random sampling is
used and that the population variance 8 2 of Vi is the same on both
occasions.
The mean of the first sample has variance 8 2 In, there being no
previous information to utilize. In selecting the second sample, m
of the units in the first sample are retained (m for matched) . The re-
maining u units (u for unmatched) are discarded and replaced by a
new selection.
Notatibn.
'0".. = Mean of unmatched pottion on occasion h.
fi"", = Mean of matched portion on occasion h.
fih = Mean of whole sample on occasion h.
The unmatched and matched portions of the second sample provide
independent estimates ih ..', ihm' of 1"2, as shown in table 12.1. In the

TABLE 12.1 EsTl¥ATES!'ROM THE UNM.tTCHED AND ¥ATClIlIlD PORTIONS

Estimate Variance
Unmatched:

Matched: g,...' - g,... + b(iil - ill...)

matched portion, we use a double sampling regression estimate, where


the "large" sample is the first sample, and the auxiliary variate X; is
the value of Yi on the first occasion. The variance of ih".' comes from
112.10 SAMPLING ON TWO OCCASIONS 285

formula (12.24), p. 278: note that our m and n correspond to n and


n', respectively, in formula (12.24).
The best combined estimate of Y2 is found by weighting the two
independent estimates inversely as t~eir variances. If W 2." W 2... are
the inverse variances, this estimate is
(12.34)
where
W 2U
'P2 = - - - - -
W 2u + W 2".
By least squares theory, the variance of 92' is
1
V(fh') = W 2u + W 2...
From table 12.1, this works out after simplification as
S2(n - Up2)
V( Y2
- ') -
- (--:---:0-:-
2 22)
(12.35)
11. - U P

Note that, if u = 0 (complete matching) or if u = n (no matching) ,


this variance has the same value, S2 In.
The optimum value of u is found by minimizing (12.35) with respect
to variation in u. This gives

u 1 m Vi""7 (12.36)
;; = 1 + V 1 - p2: ;" = 1 + V1 _ p2

When the optimum u is substituted in (12.35), the minimum vari-


ance worKs out as
• (12.37)

Table 12.2 shows for a series of values of p the optimum per cent
which should be matched and the relative gain in precision as compared
with no matching. The best percentage to match never exceeds 50
per cent and decreases steadily as p increases. When p = 1, the for-
mula suggests m = 0, which lies outside the range of our assumptions,
since m has been assumed reasonably large. The correct procedure in
this case is to take m = 2. The two matched units are sufficient to
determine the regression line exactly.
The greatest attainable gain in precision is 100 per cent, when p = 1.
Unless p is high, the gains are modest.
286 DOUBLE SAMPLING 12.10

Although the optimum percentage to match varies with p, only a


single percentage can be used in practice for all item8 in a 8Urvey.
The right-hand columns of table 12.2 show the per cent gains in pre-
cision when one-third and one-fourth of the units are matched. Both
are good compromises, except for items in whioh p exceeds 0.95.

TABLE 12.2 OPI'IIloIUM PICK CICNT MATCHICD

% gain with
p Optimum % gain in m 1 m
% matched precision ;-3 ;-4

0.5 46 7 7 6
0.6 44 11 11 9
0.7 42 17 17 15
0 .8 38 25 25 23
0.9 30 39 39 39
0.95 24 52 50 52
1.0 o 100 67 75

12.11 Sampling on more than two occasions. The general problem


of replacement has been studied by Yates (1949) and Patterson (1950),
with respect to both current estimates and estimates of change. When
there are more than two occasions, the opportunities for a flexible use
of the data are increased. On occasion h, we may have parts of the
sample that are matched with occasion (h - 1), parts that are matched
with both occasions (h - 1) ana (h - 2), and so on. In attempting
to improve the current estimate, we might try a multiple regression
involving all matchings to previous occasions. It is also possible to
revise the Qurrent estimate for occasion (h - 1) after the data for occa-
sion h are known. In the revised estimate, the regression of occasion
(h - 1) on both occasion (h - 2) and occasion h could be utilized,
assuming that suitably matched portions of the sample were available.
The present section contains an introduction to the subject. At-
tention will be restricted to current estimates in which only the re-
gression on the sampJe immediately preceding is used. This results
in some loss of precision, but since the correlatIOn p usually decreases
as the time interval between the occasions is increased, the 1088 of
precision will seldom be grea.t. The variance SJ and the correlation
coefficient p between the item values on the same unit on two suc-
cessive occasions are assumed con8tant throughout.
12.11 SAMPLING ON MORE THAN TWO OCCASIONS 287
On the third occasion, let m and u be the numbers of units that are
matched and unmatched, respectively, with the second occasion. The
two estimates of 1"a that can be made are given in table 12.3. The

TABLE 12.3 EaTIKATU 01' V. ON TIDI THIJU) OCCAIIJON


Estimate Variance
1
Unmatched:
u - WI..

Matched: fJ.,.' - fJ .... + b(g,' - g..,.)


S2(1 - p'l) + paV(g,') __1_
m W....

only change in procedure from the second occasion (table 12.1) is


that, in the regression e;djustment of the estimate from the matched
portion, we use the improved estimate Y2' instead of the sample mean
th·
The variance of the matched estimate Yam' in table 12.3 is derived
from equation (12.22) of section 12.5, after some translation of nota-
tion. Equation (12.22), omitting the terms in I/n2 , reads as follows :

-)
V(1/lr -
VIr; ')'
\Ham =.
Sl(1 - p2) + B2S",2
n n'
The translations needed to the present notation are:

n "" Size of "small" sample"" m.


s,l = S2.
B = p, because S2 is assumed the same on both occasions.
S,.2 In' = Variance of estimated mean from "large" sample "" V(Y2') ~

When these substitutions are made, the formula shown in table 12.3
is obtained. - - -----
On the hth occasion, the two estimates remain as in table 12.3 if
the subscript 3 is replaced by h and the subscript 2 by (h - 1).
We now find the optimum values of m and u on any occasion. It
will be ehown that the optimum m increases steadily on the successive
occasions, and rather rapidly approaches a limiting value of !.
Weighting as before inversely as the variance, the best estimate of
y~ . is
(12.38)
where
288 DOUBLE SAMPLING 12.11

Since
1
V(-') - - - - - (12.39)
Ylo - W II.. + W II",
we can find the optimum m on occasion h by maximizing
(W llu + W II",)
It is helpful to write
S2
V(ti,,') "" gil. - I (12.40)
n

Since V(YI') 'i'S2/ n , the quantity gil is the ratio of the variance on
occasion h to that on the first occasion. If the successive estimates
become steadily more precise, gil. will be a d~creasing function of h.
Now, from table 12.3, with h in place of 3,
1 S2(1 - p2) S2p2gh_l
--=-: - =
W hu U W"", m
+---
n
This gives

W lou + W 11m = \
S
[u + 1 - p
2 1
P
2
g" _ 1
1 (12.41 )
-m- + -n-
Hence
1
S2(Wh + W"",) = (n - m) + 2 2
1- p P 910-1
-m- + -n-
By differentiation with respect to m, the optimum mIl is found to satisfy
the equation
~ 1 - p2 p2g"_1
--
m...
- = -mil- + -n-
This gives

(12.42)

This equation is a generalization of equation (12.36), which gave


the optimum m on the second occasion. Equation (12.42) suggests
that the optimum percentage to match will increase steadily with
time, because we would expect V(Yh_l') and hence g"-1 to decrease
with time.
12.11 SAMPLING ON MORE THAN TWO OCCASIONS

In order to complete the solution, it is necessary to find the value of


gil. By its definition, (12.40),

~ = nV(y,,')
g"
~ = S2n (W"" + W A...) - ~n [Ufo + 1 - p2
1 1
p2g" _ 1
.
-mIl- + --
n
from (12.41).
By substituting the optimum m and u from equation (12.42) into
this equation, we obtain a recurrence relation which connects gIl with
g"- l . After some algebraic manipulation, the relation is expressible as

(12.43)

Since gl = 1, the successive values of gil) and hence the minimum


variance and the optimum m, ca.n be worked out for any given value
of p. As expected, it is found that the successive values of gIl steadily
decrease, whereas those of mil steadily increase.
It is easy to show that, as h increases, the quantities gil tend to a
limit, whose value is obtained by putting gil = gll-l = goo in equation
(12.43) and solving for goo. This gives

2'\1'1=7
goo = 1 + V'l=7
Hence the variance of YII' tends to

V(g..')
S2 (
= -; 1
2'\1'1=7
+ vT=7) (12.44)

Finally, the limiting value of mIl is obtained from equation (12.42) as

m.. '\1'1=7 1
- ; ... U.. (1 + VI='7) - 2
irrespective of the value of p.
Table 12.4 shows the optimum percentage matched-l00m,,/n, as
found from equation (12.42)-and the percentage gains in precision,
as computed from equation (12.43), for p ... 0.7, 0.8, 0.9, and 0.95,
and for a series of values of h. .
290 DOUBLE SAMPLING 12.11

TABLE 12.4 OPTIMUK PER CENT MATCHED ANI> GAINS IN PJlJ:C1810N

% matched lOOm,,/n % pin in precision


11 p-
p-
0.7 0.8 0.9 0.95 0.7 0.8 0.9 0.95

2 42 38 30 24 17 25 39 52
3 49 47 42 36 19 31 55 80
4 50 49 47 43 20 33 61 94
5 50 50 49 46 20 33 63 102
6 50 50 50 48 20 33 64 106
GO 50 50 50 50 20 33 65 110
/

These results suggest that with repeated sampling a good working


rule is to retain a fraction one-third or one-quarter of the first sample
on the second occasion. Thereafter, one-half of the sample should be
retained on each occasion, and one-half drawn anew.
These recommendations assume that all replacement policies cost
the same and are equally feasible, given that the total sample size
remains fixed. In practice, questions of cost and feasibility should
not be overlooked. Since extra costs are involved in drawing and con-
tacting a new sample, cost considerations will point to a slower rate
of turnover. Moreover, table 12.4 makes it clear that, unless p ex-
ceeds 0.8, the per cent gains in precision over complete matching are
modest, and large departures from the optimum matching will not in
general result in a serious loss of precision.

12.12 Exercises.
12.1 A population contains L strata of equal size. If V,an denotes the
variance of the mean of a simple random sample, and Vol, Vd. are the corre-
sponding variances for stratified random sampling with proportional alloca-
tion and for double sampling with stratification, show that, approximately,
:E (Y~ - Y)'
n V,an .. S~2 +" L
nV" ... S~1
':E (YA - Y)I
nV". - S~2 + n'~-"---=--
L
where SAl is the average variance within strata. (N and n' may both be as-
sumed large relative to L, and the 11,\ in double sampling may be aasumed
equal to niL.)
12.18. REFERENv'ES 291

Hence, if (RP)" denotes the relative precision of the stratified sample to


the simple random sample, with a corresponding definition for (RP)"" show
that
(RP)... _ (RP)"
n
1 + -; ((RP)" - 1)
n
For (RP)" - 2, plot (RP)... against n/n'. How small must this ratio be in
order that (RP)", - 1.9?
12.2 If p - 0.8 in double sampling for regression, how large must n' be
relative to n, if the lOBS in precision due to sampling errors in the mean of
the large sample is to be leBS than 10 per cent?
12.3 In an application of doubie sampling for regreBSion, the small sample
was of size 87 and the large sample of size 300. The following computations
apply to the small sample:
.E (y. - g)' ... 17,283; .E (Yi - fi)(xj - ~) - 5114; .E (Xi - ~)t - 3248
Compute the standard error of the regreBSion estimate of r.
12.4 For p - 0.95, verify the data given in table 12.4 for the optimum
percentage which should be matched and for the gain in precision relative to
no matching. Compute the corresponding per cent gains in precision if one-
third of the units are retained from the first to the second occasion, and one-
half of the units are retained on each subsequent occasion.

12.13 References.
BOSII, CILUIIILI (1043) . Note on the 88JIlpling error in the method of double
eampling. Sankh'Vo, 8, 330.
Jli88llN, R. J. (1942). Statistical investigation of a sample lIurvey for obtaining
farm facts. Iowa Agr. Ezp. Sta, Rea. Bull. 304.
NIITKAN, J. (1938). Contribution to the theory of sampling human populations.
Jour. Anur. Stat. Assoc., SS, 101-116.
PA'1'l'IilI8ON, H. D . (1950). Sampling on successive occasions with partial replace-
ment of units. JOUT. Rov. Stat. Soc., B12, 241- 255.
YATES, F. (1949). Sampling 1PUItJa0d4 1M" cemuaes and BUIWJI', Charlea Griffin and
Co., London.
CHAPTER 13

SOURCES OF ERROR IN SURVEYS

13.1 Introduction. The theory presented in previous chaptel'B as-


sumes throughout that some kind of probability sampling is used and
that the observation y. on the ith unit is the correct value for that
unit. The error of estimate arises solely from the random sampling
variation that is present when n of the units are measured instead of
the complete population of N units.
These assumptions hold reasonably well in the simpler types of sur-
veys in which the measuring devices are accurate and the quality of
work is high. In complex surveys, particularly where difficult ' prob-
lems of measurement are involved, the assumptions may be far from
true. Three additional sources of error that may be present are as
follows:
i. Failure to measure some of the units in the chosen I!I&Illple.
This may occur by ovel'Bight, or, with huma.n populations, beca.use
of failure to locate some individuals or their refusal to answer the
questions when located.
ii. Errol'B of measurement on a unit. The measuring device may
be biased or imprecise. With human populations the respondents
may not possess accurate information or they may give biased answerB.
iii. ErrOI'B introduced in editing and tabulation of the results.
These sources of error necessitate a modification of the standard
theory of sampling. The principal aims of such a modification are to
provide guidance about the allocation of resources as bet.ween the re-
duction of random sampling errol'B and the reduction of the other errol'B,
and to develop methods for computing standard errol'B and confidence
limits that remain valid when the other errol'B are present. Until re-
cently, most of the work in sampling theory was concerned with the
reduction of random sampling errol'B, and the necessary modifications
in the theory are at present incomplete. However, although many
difficulties remain, a good beginning has been made.

13.2 E1fects of non-response. We shall use the term non-reaponae to


refer to the failure to measure some of the units in the selected sample.
In the study of non-response it is convenient to think of the popul..
292
13.2 EFFECTS OF NON-RESPONSE 293
tion as divided into two "strata": The first consisting of all units for
which measurements would be obtained if the units happened to fall
in the sample, the second of the units for which no measurements
would be obtained. The compositions of the two strata depend inti-
mately on the methods used to find the units and obtain the data. A
survey in which at least three calls are made, if necessary, on every
house and in which a supervisor with exceptional powers of persuasion
calls on all persons who refuse to give da.ta will have a much sma.ller
"non-response" stratum than one in which only a single attempt is
made for every house.
This division of the population into two distinct strata is, of course,
an oversimplification. Chance plays a part in determining whether a
unit is found and measured in a given number of attempts. In a more
complete specification of the problem, we would attach to each unit
a probability representing the chance that it would be measured by a
given field method if it fell in the sample. However, the division into
two strata is adequate for the analysis to be presented here.
The sample provides no information about the non-response stratum
2. This would not matter if it could be assumed that the ~ha.racteri8-
tics of stratum 2 are the same as those of stratum 1. Where checks
have been made, however, it has often been found that units in the
"non-response" stratum differ from units that are measurable. An
illustration appears in table 13.1. The data come from an experi-
mental sampling of fruit orchards in North Carolina in 1946. Three
successive mailings of the same questionnaire were sent to growers.
For one of the questions-number of fruit treea-complete data were
available for the population (Finkner, 1950).

TABLE 13.1 RESPONBlIS TO THRE. REQUESTS IN A JLULI!ID INQUIRY

%of Average no.


No. of popu- of fruit trees
growers lation per grower
Response to firet mailing 300 10 456
Response to second mailing 543 17 382
Response to third mailing 434 14 340
Non-respondents after 3 mailinp 1839 59 290
Tot&! population 3116 100 329

The steady decline in the number of fruit trees per grower in the
successive responses is evident, these numbers being 456 for respond-
ents to the first mailing, 382 in the second mailing, 340 in the third,
and 290 for the refusals to all 3 letters. The total response was poor,
over half the popUlation failing to give data even after 3 attempts.
SOURCES OF ERROR IN SURVEYS 18.2

We now consider the effects of non-response on the sample estimate.


Let N 1, N'J be the numbers of units in the two strata and let WI -
NdN, W'J - N 2 /N, so that W 2 is the proportion of non-response in
the population. Assume that a simple random aample is drawn from
the population. When the field work is completed, we have data for
a simple random aample from stratum 1 but no data from stratum 2.
Hence the amount of bias in the aample mean is
E(jh) - Y - YI - Y = YI - (W 1YI + w2 Y'J)
(13.1)
The amount of bias is the product of the proportion of non-response
and the difference between the means in the two strata. Since the
aample provides no information about Y2 , the size of the bias is un-
known unless bounds can be placed on Y2 from some source other than
the aample data. With a continuous variate, the only bounds that
can be assigned with certainty are often so wide as to be useless.
Consequently, with continuous data, any sizable proportion of non-
response usually makes it impoaaible to assign useful confidence limits
to Y from the aample results. We are left in the position of .relying
on some guess about the size of the bias, without data to substantiate
the guess.·
In aampling for proportions the situation is a little easier, since the
unknown proportion P 2 in stratum 2 must lie between 0 and 1. If W 2
is known, these bounds for P 2 enable us to construct confidence limits
for the population proportion P. Suppose that a simple random aam-
pIe of n units is drawn and that measurements are obtained for nl of
the units in the aample. Assuming nl large enough, 95 per cent con-
fidence limits for PI are given by

PI ±: 2 r;q;
\)-;; ,
where PI is the aample proportion and the fpc is ignored.
• OccaaionalJy it pays to make no attempt to II&IJlple in one stratum. An exam-
ple occure when 1"1 is known to be very 1!lllaIJ. Without any II&IJlpling of stratum 2,
we adopt 91 - 0 as the II&IJlple estimate in this stratum. Hence the aa.mple estimate •
of 1" is

The bias of thla lltimate iI


Wl1"l - 1" - -WI1"1
If WI III known and if an upper bound for 1"1 is known, it may be found profitable
to accept the bi.. and devote the whole of the 8&IIIpie to reducin& the _piing
error of'1.
13.2 EFFECTS OF NON-RESPONSE 295

When we try to derive a confidence statement about P, we are on


safe ground if we assume P 2 = 0 when finding PL and P 2 ... 1 when
finding Pu . Thus we might take, for 95 per cent limits,

h = WI (PI - ·2 J¥-) + W2 (0) (13.2)

Pu = WI (PI )
+ 2 Jp~:1 + W2 (1) (13.3)

It is easy to verify that these limits are conservative, for the state-
ment

i.e.

is equivalent to the statement

(13.4)
Whatever the value of P 2 , the interval (13.4) always includes the
interval

Hence
Pr{h :5 P :5 Pul ~ 0.95
Although limits can be found in this way if the percentage W 2 of
non-response in the population is known, the limits are distressingly
wide unless W 2 is very small. Table 13.2 shows the Average limits for
. a sample size n = WOO and a series of values of W 2 and Pl. Since the
limits in equations (13.2) and (13.3) depend on the value of nl (num-
ber of respondents in the sample), we have taken nl = nW1 , its aver-
age value, in computing table 13.2.
The rapid increase in the width of the confidence interval with in-
creasing W 2 is evident. It is of interest to examine what values of n
would be needed to give the same widths of confidence interval if W 2
296 SOURCES OF ERROR IN SURVEYS 13.2

TABLE 13.2 95 PIiR CliNT CONFIDIiNCIi LIIOTS FOR P (IN I'll. CliNT)
WHIiN n - 1000

% of nOD- Sample percentage, 100,1


response,
5 10 20 50
lOOW.

0 (3.6, 6.4) (8.1, 11.9) (17.5, 22.5) (46.7, 53.2)


5 (3.4, 11.1) (7.6, 16.3) (16.5, 26.5) (44.4, 55.6)
10 (3.2, 15.8) (7.2,20.8) (15.6,30.4) (42.0, 58.0)
15 (3.0,20.5) (6.8,25.2) (14.7,34.3) (39.6, 60.4)
20 (2.8,25.2) (6.3,29.7) (13.7, 38.3) (37.2, 62.8)

were zero. This is easily done when PI is 50 per cent. For W 2 - 5


per cent, table 13.2 shows that the half-width of the confidence inter-
val is 5.6. The eq~ivalent sample size nIl assuming no non-response,
is found from the equation

5.6 .., 2J(50)(50)


n.
n• .., 320

For W 2 = 10 per cent, 15 per cent, and 20 per cent, the values of n.
are 155, 90, and 60, respectively. It is evidently worth ",hile to devote
a substantial proportion of the resources to the reduction of non-
response.
It may be objected that the limits in table 13.2 are much too con~
sarvative, since we have supposed that the worst poasible cases have
actually happened. Since, moreover, the limits are frequently too
wide to be useful, it is always tempting to make some guesses about
the bounds within which P 2 lies and construct much narrower "confi-
dence" limits for P based on these bounds. There is nothing wrong
with this procedure if the bounds are correct, but we should recognile
that the procedure represents the substitution of guesswork for objec-
tive evidence.
An interesting method of finding 8&Dlple size when non-response
is present has been given by Birnbaum and Sirken (195Oa, 195Gb).
The proportion W 2 of non-response is assumed known from previoU8
experience in the particular type of survey. No advance knowledp
of Ph Ps, or P is assumed. ThU8, if there were no non-response and
13.2 EFFECTS OF NON-RESPONSE

if we wished the absolute error in the sample proportion to be less


than d, we would take (by section 4.4)
t..2 PQ
11 =--
~

where t.. is the normal deviate corresponding to the risk a that the
error exceeds d. With no advance information about P, we would
take P = 0.5 as the least favorable case, giving
t,.2
n=- (13.5)
4~
By taking the least favorable combination of the bias W 2 (P1 - P 2)
and the value of PI, Birnbaum and Sirken show that a value of n
which still guarantees an error less than d, with risk a, is
ta 2
n ' =. - -1 (13.6)
4cl(d - W2 )W1
Note that no value of n suffices if W 2 > d. If W 2 = 0, this equation
reduces to (13.5) apart from the term -1, which comes from an ap-
proximation in the analysis. Soml!' values of n given by Birnbaum
and Sirken'~ method are shown in table 13.3.

TABLE 18.3 SIIALLEIIT VALUII OJ' n !'OR GIVEN LIMIT OJ!' ERROR d, WITH
RlSI[ .. - 0.05

% non- d (in per cent)


response,
100W! 20 15 10 5

0 24 43 96 384
2 27 50 122 653
4 31 60 166 2000
6 36 75 255
8 43 99 521
10 63 142
15 112

The table tells the same sad story as table 13.2. If we are content
with a crude estimate (d = 20), amounts of non-response up to 10
per cent can be handled by doubling the sample size. However, any
sizable percentage of non-response makes it impossible or very costly
, 298 SOURem3 OF ERROR IN SURVEYS 18.2
to attain a highly guaranteed precision by increasing the sample size
among the respondents.

13.3 Optimum sampling fraction among the non-respondents. The


non-respondents in a survey may be divided into two classes. One-
the hard core--eonsists of units for which it is impossible to obtain
data. With human populations, this class contains people who cannot
be found, who adamantly refuse or who are for some reason incapable
of giving the data, and for whom data cannot be supplied by another
person. The second class consists of units for which datA can be ob-
tained by more intensive and costly field methods than those originally
contemplated for the survey. This class contains people not at home
on the first call and people who are reluctant, but can be persuaded, to
give data.
Hansen and Hurwitz (1946), to whom the results in this section are
due, propose that this class be sampled with a smaller sampling fraction
than the units in stratum l.
The first step is to take a simple ranaom sample of n units, using
the ordinary field methods. Let nl be the number of units iD the sam-
ple that provide the data sought, and n, the number in the non-fa-
sponse group. By more intensivet'efiorta, the data are later obtained
from a random sample of r2 out of the n2. Let
n, = kr2 (k > 1) (13.7)
Then the average sampling fraction in the first stratum is k times that
in the second. This follows because if k is fixed in advance

The values of n (initial size of sample) and k are chosen 80 as to give


a specified precision for the lowest cost.
The cost of taking the sample is
C = eon + Clnl + ~r2
where the c's are costs per unit: eo is the cost of making the first at-
tempt, while Cl a.nd C2 are the costs of getting and processing the data
in the two strata, respectively. Since the values of nl and ~ are not
known until the first attempt is made, the expected coat is used in
planning the sample. The expected values of nl and r2 are, respec-
tively, Win and W2n/k. Thus expected cost is

(13.8)
lU OPTIMUM SAMPLING OF NON-RESPONDENTS
Let fit, fit. be the 88D1ple means in the two strata. The subscript,
is introduced as a reminder that the sample in the second stratum is of
sUre ". As an estimate of the population mean, we take
1
fi' = -n (nlfil + ntihr) (13.9)

Note that the second stratum receives a weight n" although the sam-
ple is only of size '2' This is done in order to obtain an unbiased
estimate.
This procedure is an application of double sampling with stratifica-
tion. The first or "large" sample, of size n, gives an estimate ndnt
of the relative size of the strata. The second or "small" sample is of
sile nl in the first stratum and '2 in the second stratum. Unfortu-
nately, the variance of 9 cannot be derived from the variance formulas
which were given in section 12.2 for double sampling with stratifica-
tion. In section 12.2, the .sizes nil in the second sample were assumed
fixed, whereas in the present problem nl and '2 are random variables.
To find V(fi'), write
1 nt
fi' ,.. - (nll'h
n
+ nt!h,,) + -n (thr - 172") (13.10)

where th .. is the mean of the whole sample of size n2 from stratum 2.


The first term on the right is the mean of a random sample of size n
from the whole population. I ts variance is therefore
(N - n) s'l
N n
where s'l is the variance of the whole population. Further, when we
find the variance of y', there is no contribution from cross-products
between the first and second terms. For
E {Y2" (th. - Y2n)} = 0
over all random samples of size '2 that can be drawn from a fixed
sample of size n2 .
Consider the second term on the right of (13.10) . If Y2 is the popu-
lation mean of the "non-response" stratum, we have
(fi2T - Y2) = (Y2T - .Y2 .. ) + (Y2" - Y2)
80 that
E(Y2T - Y2)2 = E(Y2r - th .. )2 + E(Y2" - Y2)'
there being no contribution from cross-product terms for the same
reason as before. Now ii2. is the mean of a simple random sample of
100 SOURCES OF ERROR IN SURVEYS 18.8

Iiae " from the second stratum, and g, .. is the mean of '" simple random
-.mple of aiM '" from the same stratum. Hence, /07' jiud '" 4114 '"
(N, - ,,) 8,' , (N, - n,) 81'
N
, . ... E(g" - fI, ..)
r, + N
2 n,
where 8l is the variance within the "non-~poD8e" stratum. This
gives
, , ( 1 1) , (ns - '2) , (k - 1)
E('U" - 1)",) ... 8, - - - ... 8, .., 8,
" ~ n,r, ~
aince n, -kr,.
Hence, adding the variances of the two terms in (13.10), we find, for
fixed n"
+ (n,)'
2
(N - n) 8
V(g') ...
N
-
n
-n (k n,- 1) 8,,
(N - n) S' (k - 1) ,
- ' N
n
+ n
,nsS, (13.11)

Since E(n,) ... nW2 , this gives for the expected variance
. (N - n) S' (k - 1) W,
V(g') .. -
N n
+ n
8,' (13.12)

The first tel'Dl is the variance that would be obtained if all n, in the
non-respoD8e group were sampled. The second term is the increase in
variance from sampling only of the n,. '2
The quantities n and k are then chosen to minimize average cost
(13.8) for a preassigned vahle of the expected variance (13.12).
The aolutions are:
c,(S' - W~,')
ko,,'= --::,:------
8, (eo + Ct Wt)
• U3.13)

NIS' + (k - I)WaS,'1
(13.14)
no", ... NV + S'

where V is the value specified fol' the variance of the estimated popula-
tion mean.
The aolutions require a knowledae of W.: often this can be estimated
from previous experience. Ip addition to S', whose value must be
estimated in advance in any "sample Bille" problem, the aolutions alao
involve 8.', the variance in the non-responee stratum. The value of
8,' is naturally harder to predict; it win probably not be the same as
OPTIMUM SAMPLING OF NON-RESPONDENTS 801
SJ. For instance, in surveys made by mail of moo kinds of economic
enterpriae, the J'fJ8POndents tend to be larger operators, with larger
between-unit variances than the non-respondents.
This technique was first presented for a survey made by mail, fol-
lowed by visits to & subsample of thoee who do not answer ~e letters.
With a mail survey, W 2 may be large and its value may be difficult to
predict fQr determining n..",. In this event a satisfactory approxima-
tion is to work out the value of no,,' for a range of assumed values of
W 2 between 0 and a safe upper limit. The max~um no", in this aeries
is adopted as the initial sample size n. When the replies to the mail
survey have been received, the value of ~ is known. The variance
formula (13.11) is then solved to find the value of k that gives the de-
sired variance V. The cost for this method is usually only slightly
higher than the optimum coat which would have applied if W 2 were
known.
E:t4mple. This example is ~ondensed from the paper by Hansen
and Hurwitz (1946). The first sample is taken by mail and the re-
sponse rate W 1 is expected to be 50 per cent. The precision desired is
that which would be given by a simple random sample of size 1000 if
there were no non-response. The cost of mailing a questionnaire is
10 cents, and the cost of processing the completed questionnaire is
40 cents. To carry out a personal interview costs e4.1O.
How many questionnaires should be sent out, and what percentage
of the non-respondents should be interviewed?
In terms of the cost function (13.8) the unit costs in dollars are as
follows;

Co - Cost of first attempt - 0.1


Cl - Cost of processing data. for a respondent = 0.4
C, - Cost of obtaining and processing data
for a Jlon-respondent - 4.5
The optimum n and k can be found from equations (13.13) and
(13.14). If the variances SJ and 8 2 2 are assumed equal, and N is as-
sumed very large, then

(4.5)(O.5) - r.:-: .., 7 9


ko,,' - ~-__;_--. v7.5 - ... 3
0.1 ~ (O.4){0.5)

n.,., _ S2{1 ~ {k .... l)Wa l _ 1000(1 ~ (1.739) (0.5) 1


V
... 1870
302 SOURCES OF ERROR IN SURVEYS 13.3

Note that we have put ~ /V = 1000, or V = ~ /1000, since this is the


variance that the sample mean would have if a sample of 1000 were
taken and complete response were obtained.
Consequently, 1870 questionnaires should be mailed. Of the 935
that are not returned, we interview a random subsample of 935/2.739
or 341. The cost will be found to be $2095.
With stratified sampling, the optimum values of the nil and the k"
in the individual strata are rather complex. A good approximation is
to estimate first, 9Y the methods in sections 5.5 and 5.6, the sample
sizes no" that would be required in the strata if there were no non-
response. Now from equation (13.14), if W 2 = 0, we have

Hence equation (13.14) can be rewritten as

nopl = noll + (k - I)WvS22/ 821 (13.15)

This equation, applied separately to each stratum, gives an approxi-


mation to the optimum nil. The values of k" are found by applying
equation (13.13) in each stratum.
These techniques can be used with ratio or regression estimates.
With the ratio estimate, the quantities 8 2 and 8 2 1 are replaced by 8i
and 8 2 i, where di = Yi - RXi. With a regression estimate, ~ be-
comes~(1 - p2) andS2 2 becomes 8 2 2 (1 _ p2).

IS.4 Other techniques for nem-response. One method which is some-


times thought to solve the non-response problem is to collect data for
some neighboring unit whenever a non-response is encountered. For
instance, if a house belonging to the sample is found to have no one at
home, the next house in the street is visited. In this way the size of
sample for which data are obtained remains equal to the size originally
planned.
All that this method accomplishes is to increase the sample size in
the first stratum. It does not obtain any data for the non-response
stratum. The method is not completely ineffective, because there is
some increase in precision from an addition to the sample size in the
first stratum. But, as haS been seen in section 13.2, this increase is
likely to be trivial if W 2 is at all substantial. The "substitution"
method does positive harm if the samplers are deluded into thinking
that the non-response problem has been adequately dealt with.
11.' OTHER TECHNIQUES FOR NON-RESPONSE
When non-response is due primarily to absence of people from their
homes, an ingenious approach has been made by Polita and Simmone
(1949, 1950). Suppose that all calls are made between the hours of
6 MI. and 9 P.M. during the period of field work-say a month-and
that only one call is made per respondent. If a person is always at
home during these hours in this month, he is certain to be found if he
falls in the sample. A person who is home half the time during these
hours has a 50 per cent chance of being at home when the interviewer
calls, if we can assume that the call takes place at a random instant of
time during the period.
The persons in the population can be classified into strata according
to the proportion of the time that they are at home. Let this propor-
tion be rA in stratum h. If a member of stratum h falls in the sample,
the probability that he is found is rio. This approach is a generaliza-
tion of the mathematical model which we adopted in sections 13.2 and
13.3. In the earlier model there were only two strata, with rA - 1
and 0, respectively.
In the new approach, the satnple for which data are obtained is seen
to be overweighted with persons who are at home most of the time.
If these persons differ in their characteristics from those who are less
frequently at home, a non-response bias is produced. Given the
values of the 'If'" most of this bias can be removed. The artificial data
in table 13.4 illustrate the process.

TABLE 13.4 DATA FOil ILLUSTIlATING IlESULTS WITH VAaYING PIlOPOllTION8 or


"NOT-AT-HOMES"

rA nA nA' fA - i}h !/A !/A/frA


100 100 1 100 100
0.5 100 50 2 100 200
0.25 100 25 3 75 300

Totals 300 175 275 600

There are 3 strata of equal size, with ?rh = 1, 0.5, and 0.25. An
initial sample size of n = 300 is planned (100 of which would fall in
each stratum, on the average). Owing to the absences from home, the
actual sample sizes n,,' average 100, 50, and 25, respectively.
With the assumed values of Yh, the true population mean Y is 2.
We have ignored the within-stratum sampling variances, taking 'Oh
= l"". The observed sample total y comes out as 275, and the sample
mean is 275/175 = 1.57. This is negatively biased because the "not--
at-homes" have higher values of Yh than the "at-homes."
SOURCES OF ERROR IN SURVEYS lU

An estimate free from bias is obtained by weighting the total from


any stratum inversely as the proportion of responses. 'rhus
1 y,. 600
g''"''-L:~ ... - .. 2
11. "'11 300
where 11. is the original size of sample.
In practical applications some bias remains in the estimate, becauae
persons with 11'" == 0 are not representea at all in the sample.
The chief problem is to estimate the "'11. In the method proposed by
Politz and Simmons, the interviewer asks the respondent, for each of
the I) previous nights, whether he was at home at the particular time
of day at which the question is asked. The respondents are then
t
classified into 6 strata, with estimated values of "'11 from t to inclusive.
If the time at which the question is asked can be considered random,
this technique gives unbiased estimates of "'11, but the variances of the
estimates may be rather high. For instance, a person who was regu-
larly at home from 6 P.M. to 7 P.M. but out from 7 P.M. to 9 P.M. would
report a "'11 of unity if the interviewer called before 7 P .M. and would
not be in the sample if the interviewer called after 7 P.M. An alterna-
tive is to ask about 5 randotIi instants of time between 6 P.M. and
9 P.M.
The assumption that the time at which the intenriewer calls is ran-
dom is, of course, open to question. Some judgment about the reason-
ableness of the assumption can be obtained from an analysis of the
times at which interviewers do call.
The variance fonnula for this procedure and a comparison with the
"call-back" method of reducing non-response are given in the 1949
Politz and Simmons reference. Persons interested in using the method
should study this reference,' since the presentation given here is over-
simplified, and the assumptions needed to obtain unbiased estimates
of Y require careful statement.

13.6 Errors of mealU1'8ment. This term will be used in a broad


sense, to denote any difference between the correct value 'Ii for an item
on the ith unit and the value Yi which is assigned to that item in the
computations from which the estimates are made. In this sense,
errors of measurement include errors introduced in recording, editing,
and tabulating the data, as well as errors that result from deficiencies
in the measuring device.
The problem of errors of measurement is an old one in physics,
chemistry, and biology. A large amount of information has been
gathered about the behavior both of instruments of measurement and
13.6 MATHEMATICAL MODEL 305

of the human observer who plays a role in many measuring techniques.


Much of this knowJ.edce should be applicable to errors of measurement
which occur in sample Surveys.
Less is lmown about errol'S of mea81.irement when the information
is given verbally in response to a verbal question in an interview.
Since the interplay between two individuals is involved, the pattern
of errors may be complex. The importance of this topic is now real-
ized by agencies which regularly sample human populations, and re-
search studies of interviewing errore are on the increase.

13.6 A mathematical model for ~ra of meaaurement. We now


COll8truct a mathematical description of some of the major components
of errore of measurement. Cot1sider a very large number of repetitions
of the measurement on the ith unit, and let Yio be the value obtained
in the ath repetition. Then we write
Yio =- ''7i + g/ + e,a (13 .16)

where 'I, = Correct value.


g;' = Bias oomponent.
e,o = "Random" component, with mean O.
This model assumes that in repeated mea,surements of the ith unit
the errors (g/ + e,a) follow some frequency distribution with mean g/.
With a well-oontrolled measuring instrument, the frequency distribu-
tion is often approximately norma1 in shape. With a new and com-
plex measuring process, on the other hand, it cannot be taken for
granted that repetitions will follow any single frequency distribution :
the shape may change erratically with time. Suoh cases are not cov-
ered by the theory presented here, since we assume that eio acts like a
random variate in the probability sense.
The next step is to consider how these components change when we
move from one unit to another. Various complications may occur.
For the bias component g;', there may be a constant bias g that
affects all units alike. There may be a component g, whi<:h follows a
fairly simple frequency distribution over the population. This com-
ponent may be correlated with the correct value 'Ii : for instance, the
measuring device may consistently underestimate high values of 'Ii
and overestimate low values.
There may be a complex pattern of interrelationships between the
values of g, on dUferent units, quite apart from any correlation that
is created as a result of the correlation between gi and 'Ii' The simplest
example is the "interviewer bias." Dramatic differences have some-
306 SOURCES OF ERROR IN SURVEYS 13.6

times been found in the mean values of Yi obtained by different inter-


viewers who are sampling apparently comparable parts of the same
population (see Lienau, 1941, and Mahalanobis, 1946). The same
effect has appeared when samples of a growing crop are cut by different
teams and when chemical or biological analyses are done in different
laboratories. The human factor is not the only cause for-correlations
among units that are measured at about the same time. Many meas-
uring devices are affected by the weather: some use raw materials
whose quality varies from batch to batch.
To turn to the "random" component of error eia, the frequency
distribution which it is presumed to follow may change from one unit
to another, although with a good system of measurement such changes
should be slight. As with gi, there may be correlations between the
values of eia on different units.
The components of the error of measurement are summarized in
table 13.5.

TABLE 13.5 COMPONENTS OF THE ERROR OF MEASUREMENT ON TRII iTH UNIT

Notation Nature of component


(J Constant bias over all units.
IIi "Variable" component of bias, which follows some fre-
quency distribution with mean zero, as i varies, and may
be correlated with the correct value 'Ii.
ei« "Random" component of error, which follows some fre-
quency distribution, with mean zero, as a varies for fixed i.
These frequency distributions may change in shape as i
varies.

We have noted further ~hat; values of g, on different units may be


correlated with one another, and similarly for values of e.,. on different
units.
A more detailed account of the construction of a model of this kind
is given by Hansen et al. (1951) . This paper contains a number of the
results given in later sections.

13.7 Effects of constant bias. Suppose that the measurements Yo on


all units are subject to a constant bias g whose magnitude is unknown.
Then the sample mean y of a simple random sample is also subject to
bias g. In the estimated error variance which we attach to the sample
mean, the bias cancels out, since this estimate is derived from a sum
of squares of terms (y, - y)2. Consequently, the usual computation
of the confidence limits for Y from the sample data ta.kes no account
of the bias. The same results hold in stratified ranq.om sampling.
13.7 EFFECTS OF CONSTANT BIAS

The situation is also eBBeIltially the same with regression and ratio
estimates. Consider the regression estimate

'!il. = Y+ b(X - l)

where both the Yi and the Xi may be subject to constant biases gil and
g", respectively. Since the least squares estimate b remains unchanged,
and since the bias g" cancels out of the term (X - x), it follows that
YI. is subject to a bias gil' It is easy to verify that the sample estimate
of V(YI.) contains no contribution due to the biases.
With the ratio estimate
- - :f/ X
YR
X

the bias is also gil, to a first approximation, since in large samples


E(X/x) is approximately 1 even if the Xi are subject to a constant bias.
In large samples the sample estimate of variance
. (N - n) L: (Yi - RX,)2
V(f/R) == - - - - - - -- -
Nn n-1
will be almost free from bias as an estimate of

E ('fiR - 1'')2

i.e. as an estimate of the variance about the biaaed mean r.·


To summarize, a constant bias p8.88eS undetected by the sample
data. As we have seen (section 1.5), the 95 per cent confidence proba-
bilities are almost unaffected if the ratio of gil to the' standard error of
the estimated mean is less than 0.1, but as the ratio increases beyond
this value, the computation of confidence limits becomes misleading.
It should also be remembered that when a number of independent
estimates, all subject to the same bias, are averaged, this ratio in-
Crea8e8, since the bias remains constant but the standard error of the
average diminishes. Estimates of change from one time period to an-
other, or from one stratum to another, remain unbiased, provided that
the bias is constant throughout.
• In one respect the ratio estimate is more affected by constant biases than the
recr-ioa or mean per unit estimates. If the relation between 'Ii and :z:. is a 8traight
line throu&h the origin, and IIi - 'Ii +
IN, the linear relation between 11. and :z:,
does not pa88 through the origin. The consequence is that the precision of the ratio
estimate, as an estimate of the biased mean Y. is affected by constant bias IN or g••
The point is not of practical importance if decisions about the use of the ratio esti-
mate are ba8ed on the observed relation between IIi and Xi.
808 SOURCES OF ERROR IN SURVEYS 18.8

18.8 E1fects ot components that are independent from unit to unit.


If constant bias is ignored, the model becomes

Yia = 'Ii + gi + eia


The distinction between the bias gi and the random error of measure-
ment eiG on the ith unit must be preserved if we are discussing sampling
plans in which the same unit is measured several times in order to in-
crease precision. Since discussion will be confined to the case where
any unit is measured only once, the error (Ui + e,a) can be combined
into a single term fia . The model simplifies to

Yia - 'Ii + 'ia


The nature of the population should be clearly realized. In repeated
sampling, whenever the ith unit appears in a sample the value of 'Ii
remains unchanged, since 'Ii is the correct value for this unit. How-
ever, 0. new measurement of the unit is made, giving a new value to
the error of measurement f,a' In any expressions which contain fia,
we average both over all possible measurements of the unit and also
over aU units or all samples of a given type.
Since constant bias is ignored, the average value of f;.. over the
population is taken as zero. This may be written

,
EE'ia - 0

,
where E denotes an average over all measurements of the ith unit
and E a subsequent average over aU units. By hypothesis, f,a may be
correlated with 'Ii, and it may also be correlated from unit to unit.
Now it happens that, if the errors fia, fi~ on any two different units
in the sample are independent, in the probability sense, the sampling
error formulas given in previous chapters remain valid, provided that
the population can be regarded as infinite.
This is easily seen for simple random sampling. The independent
drawing of successive units guarantees that the successive values of 'Ii
in a sample are mutually independent. It does not guarantee that the
values of the 'ia are mutually independent, but we have &88umed that
this is so. Henoe, the 7/ia are independent on sucoessive units, and the
ordinary theory for random sampling from an infinite population is
applicable to the 7/ia.
13.8 EFFECTS OF INDEPENDENT COMPONENTS
In particular, if E(ru) = R is the true population mean,
cr 2 1
V(y) = E(fi - R)2 = _!_
n
= - (cr~ 2
n
+ cr 2 + 2p~.cr~cr.)
f (13.17)
where
cr~2 = E(ru - R)2

cr.2 = EE(Eia 2) .
Further, since the Yia are independent members of a simple random
sample from an infinite population,
_) 1 E (Yia - fi)2
V (Y =- (13.18)
n (n - 1)

is an unbiased sample estimate of V(y). In this case, fortunately, the


ordinary formula for the estimated sampling error remains valid when
errors of measurement are present. In particular, two measuring
processes can be compared empirically by finding which gives the
smaller value of v(y) for a given cost.
In the same way it can be shown that the formulas given in previous
chapters for the sample estimates of error variances remain valid for
stratified sampling and for ratio and regression estimates, provided
that the errors of measurement in both Yia and Xia are independent
from unit to unit and that the fpc's can be ignored.
When nlN is not negligible, some change is required in the formulas.
The development will be sketched briefly for the mean of a simple
random sample. With a finite population, the population variances
and covariance will be defined as follows;

S 2 ... 1 ~ ( . _ R):I
~ (N - 1) i~ 11.

1 N
(1:1 =- E E(E ' 2)
• N i_I i 101

1 N
P".s~cr. =- (N _ 1) ~ {II Eia(l1i - R)}

Note the use of the divisor N in defining (1,2; this helps to keep the
results simple.
/310 SOURCES OF ERROR IN SURVEYS 13.8

The error of estimate of the sample mean is


y _ H == (7i _ /l) +f (13.19)
By theorem 2.3,
(N _ n)
E(~ _ /l)2 _ S~2 (13.20)
Nn
For the average vaJue of 12, we have

since the errors of measurement on different units in the sample are by


hypothesis independent. When the average is taken over a.Il simple
random samples of size n, we obta.in

(13.21)

For the cross-product, Et(;; _ R), let us suppose first that there is
a fixed error of measurement f i a' associated with the ith unit. Then,
by the ordinary theory for simple random samples from a finite popu-
lation (theorem 2.3),
N

E'l'(- _ R _ (N n
_ )H) 1: Eia'('Ii _
__ 1
'I ) - Nn --=--(N---I)-

where this average is over a.ll simple random samples with this fixed
set of errors of measurement. Taking the average over a.Il possible
sets of these errors, we have, by the definition of p~"
(N _ n)
E'i~ _ /l) = p".S~(I. (13.22)
Nn
Fina.Ily, from (13.20), (13.21), and (13.22),
(N _ n) (1.2
V(y) =
Nn
IS~2 + 2~~(I.\ + -n (13.23)

This formula replaces (13.17) when the fpc is not negligible. If


n ... N, the variance reduces to (I.2/N instead of to zero, reflecting the
fact that BOme error of measurement remains in the average of Ninde-
pendent measurements.
13.9 CORRELATION BETWEEN ERRORS 311

With this model, the mean square deviation from the sample mean,
"
L: (y, - y)2
82 = _'-_1_ _ __
n-1
can be shown to be an unbiased estimate of
S,/ = S~2 + 2p~.s~O', + 0',2
The usual formula (section 2.6) for the estimated variance of y is

v(y) = (N ~ n)~
Hence
_ (N - n) 2 2
Ev(y) = (S~
Nn
+ 2p~,S~O', + 0', I
By comparison with (13.23) it follows that v(y) has a negative bias
which amounts to 0', 2/ N and will usually be small. If the fpc is omitted
from v(iJ), we obtain an overestimate. An unbiased estimate cannot
be constructed without knowledge of O't

13.9 Effects of correlation between errors on different units. With


the same model,
Y,a = "I i + Eia
suppose that there are correlations between the Eia on different units
in the same sample. Simple random sampling from an infinite popula-
tion is assumed.
In finding E(y - 0)2, the only term to be added to the result (13 .17)
in the preceding section is the cross-product term

(13.24)

We may define the average within-sample correlation coefficient Pu


by the equation
p"O',2 = 2 EE
n(n - 1) n
{L
i<i
EiaEi a }

Hence the cross-product term (13.24) contributes


(n - 1) 2
- - - 7> ..0',
n
312 SOURCES OF ERROR IN SURVEYS 13.9

Adding this term to (13.17), we have

V(g) ... -1 [CTq 2


n
+ 2Pr,.CT CT. + CT.2 {I + (n -
q 1).11•• )J (13.25)

The average value of v(y) is found in the same way to be

_ 1 2
Ev(y) = - [CTq
n
+ 2Pq.CT~CT. + CT.2 {I - .II•• }J (13.26)

ThUB the standard estimated variance is biased. Since TJ .. appears


to be positive for most types of measurement error, v(y) is usually an
underestimate. Whether the underestimation is serious depends on
the relative sizes of ,,/ and ".2 as well as on the value of TJ ...
The analogy with systematic sampling, or more generally with
cluster sampling, is apparent.

lS.10 Interpenetrating Bubsamples. When measurement errors are


correlated on different units, we would like to have some method of
obtaining a sample estimate of variance v(ii) that is unbiased and some
way of finding out the extent to which the correlations increase the
true V(y) and hence decrease the precision of the estimate. This can
be done by a technique proposed by Mahalanobis (1946), who has
used it in a number of Indian surveys.
A simple random sample of n units is divided at random into k
groups of units, each group containing m = n/k units. The field work
in taking the sample is planned 80 that there is no correlation between
the errors of measurement of any two units that are in different groups.
For example, suppose that the correlation with which we have to cope
arises from biases of the interviewers. If each of k interviewers is
assigned to a different group and if there is no correlation between
errors of measurement for different interviewers, we have an example
of the proposed technique.
With the same mathematical model, it is convenient to label the
units by a double subscript notation. Let

11i; = '1ij + fija


where i denotes the group and j the member within the group. The
population is assumed infinite, and the average of the fija over the
population is assumed zero.
Since group i is a random subsample of a simple random sample, it
is itself a simple random sample from the population. Hence, by
13.10 INTERPENETRATING SUBSAMPLES 818

equation (13.25), the variance of the group mean 'Oi is


1 2
V{ti,) ... - [CT~
m
+ 2P~.CT~IT. + CT.2 {I + (m - 1)11.. 1]

Since flrrors are independent in the different groups,


1 1 2 . 2
V{y) = - V(fi,) = - [CT~
k n
+ 2P~'CT~CT. + CT, {I + (m - l)lI.. )] (13.27)

putting n = mk.
If, as before, the variance of a single measurement of a unit by an
interviewer chosen at random is denoted by
2
"..,/ = CT~ + 2p~,CT~CT, + CT,2
then (13.27) may be written

(13.28)

An alternative model which produces the same variance formula is


Yi; = ?Ii; + gi + Ei;a (13.29)
In this model g, is the personal bias of the ith interviewer. Over the
population of interviewers, E(gi) = 0 (i.e. there is no bias common to
aU interviewers), and E(gl) is denoted by CT/. The "random" errors
of measurement, Ei;a, are now assumed to be uncorrelated from unit to
unit, although they may be correlated with the 'Ii;, and to average to
zero over the population.
In repeated sampling, we select each time a random sample of k
interviewers from an assumed infinite population of interviewers.
Each interviewer is assigned to a random subsample of a simple ran-
dom sample of units.
In order to find V(y) we write as usual
y- 0 = (ii - 0) +0+ i
When the square is averaged over all samples, there is no contribution
from the product Y(Ti - 0), because in repeated sampling a given set
of interviewers appears equally often with any given set of units. For
the same re~on, Eloil is also zero. Hence

(13.30)

(13.31)
'l~ SOURCES OF ERROR IN SURVEYS 13.10

by substituting v,l as found from (13.29). Tliis is the analogue of


equation (13.28).
This equation shows that there is a limit to the precision obtainable
with a fixed number k of interviewers. For given n, the precision is
greatest when k = n, that is, when each interviewer measures only one
unit. (This conclusion may be unrealistic, since it assumes that inter-
viewers are drawn at random. from a pool of interviewers of equal
quality. In practice, average quality may well be higher when only a
few interviewers are to be recruited than when many must be found.)
With the same assumptions, it is easily verified that an unbiased
estimate of V(y) from the sa.mple is

:E" (Y. - g)2


i_I
v(g) = k(k _ 1)

This is the most useful property of the method of interpenetrating sub-


samples. Note, however, that the number of degrees of freedom on
which the estimate is based is (k - 1): this will be small if only a few
interviewers participate.
The analysis of variance of the sample data (table 13.6) is also of
interest. The expectations shown for the mean squares can be verified
rrom the model.
TABLE 13.6 ExPICCTATIONS IN THII ANALYSIS or VAlIIANCIi or THII SAMPLl!I
(ON It. SINOLII-UNIT BASIS)

<II lUll Expectatiolll!


m 1:; (iii - ii)2
Between groupe (k - 1) 'b
2
:.
k - 1
(fI.2 - fli) + mfI,2
LL (!Iii - ,,)2
Within groupe k(m - 1) 'w2 _ i i
(fI.1 - fI/)
k(m - 1)

The variance within groups 81D2 is an unbiased estimate of the vari-


ance per unit that would be obtained if interviewer biases could be
removed, while 8b2 estimates the actual variance per unit in the survey.
Comparison of the two mean squares reveals the extent of the 1088 in
precision from the biases.
The formula for V(y) can also be used to determine the optimum
number of groups for a given field cost situation. A simple type of
cost function that might apply is
C .. en + erk
13.10 INTERPENETRATING SUBSAMPLES 315

If n and k are chosen to minimize V(y) for fixed C, we find that the
optimum size of group (number of interviews per interviewer) is

A sample estimate from a preliminary sample is

Structurally, these equations are the same as those in section 10.6


for determining optimum sampling and subsampling ratios in two-
stage sampling. In fact, the whole mathematical approach is the same
as for two-stage sampling. This analysis is subject to the assumption
already mentioned that we can vary the number of interviewers with-
out affecting their average quality.
The method is not applicable if the groups are not random sub-
samples. A common practice is to assign each interviewer to a small
geographic area near his home, in order to decrease travel cost per
interviewer. If there are real differences between the averages iii for
different areas, these differences appear in the analysis of variance as
jf they were interviewer biases. Thus 862 becomes an overestimate of
the actual variance per unit, and 8 w 2 an underestimate of the variance
that would apply if interviewer biases could be removed.
One way of avoiding this difficulty without too great an increase in
field costs is to stratify the sample into compact areas. The sample in
each stratum is then divided into random subgroups. Each inter-
viewer is required to travel over the whole of a stratum but not over
the whole sample. The analysis presented here then applies to an in-
dividualstratum, and an unbiased estimate of V(y.,) is built up in the
usual way. For other solutions of the problem, see Hansen et al. (1951).
This technique of interpenetrating subsamples is skillful, since it
gives not only an unbiased estimate of error but also an insight into
the magnitude of the effects of correlations in the errors of measure-
ment on different units. Its utility is not confined to the handling of
interviewer biases: the different groups may represent different field
teams who are taking crop samples, or different medical teams who are
making diagnostic examinations, and so on. Limitations of the tech-
nique are the increase in travel costs, which may be considerable in
some surveys, and the fact that the technique does not handle any
bias that is constant over all units.
316 SOURCES OF ERROR IN SURVEYS 13.11

13.11 Summary. From the point of view of their effects on the for-
mulas given in previous chapters, the additional sources of error may
be classified as follows :
i. Errors of measurement that are independent from unit to unit
and average to zero over the whole population are properly taken into
account in the usual formulas for computing the standard errors of the
estimates, provided that the fpc is negligible. Such errors do, of
course, decrease the precision, and it is worth while to learn something
about their magnitude in order to find out whether the decrease is
serious.
ii. With non-response, the usual formulas for the standard errors, as
computed from the units that were measured, are likely to be under-
estimates since they ignore the bias due to differences between respond-
ents and non-respondents. The sampler has no excuse for not being
aware of this problem: a complete record of all non-response, with
reasons, is an essential part of good practice. If non-response can be
reduced by expenditure of greater effort on a certain segment of the
population, the method of Hansen and Hurwitz (1946) shows how to
allocate resources to this segment.
iii. If errors of measurement are correlated from unit to unit, the
usual formulas for the standard errors are biased. The standard errors
are likely to be too small, since the correlations appear to be mostly
positive in practice. This type of disturbance is harder to detect and
has probably often passed unnoticed. The device of interpenetrating
sub8&JIlples gives a measure of the magnitude of the effect as well as
unbiased estimates of the real error variances (a constant bias ex-
cepted).
iv. A constant bias that affects all units alike is hardest of all to
detect. No manipulations of the sample data will reveal this bias.
There is much work to be done on these problems. Perhaps the most
urgent need is for the accumulation of data on the na~ure and size of
errors of measurements in sample surveys. In many cases it will be
found that the usual formulas and techniques are disturbed to only a
minor degree. In others it will become evident, as has happened in
some types of study, that random sampling errors are the leaat of our
troubles, and that precise estimates are unattainable until a drastic
reduction in one of the other sources of error is made. More needs to
be learned, also, about what can be accomplished by good training and
supervision and by rechecks of the units from more experienced per-
sonnel.
13.12 REFERENCES 817

13.12 References.
BIRNBAUM, Z. W., and SIRKEN, M . O. (195Oa). BilLS due to non-availability in
sampling surveys. Jour. Amer. Stat. ABBOC., 46, 98-111.
BIRNBAUM, Z. W., and SIRKEN, M . O. (1950b) . On the total error due to non-
interview and to random sampling. Int. Jrrur. Opinion and AUitude ReB., 4,
179-191.
FINKNER, A. L. (1950). M ethods of sampling for estimating co=ercial peach
production in North Carolina. North Carolina Agr. Exp. Sta. Tech.. Bull. 91.
HANSEN, M . H., and HURWlTZ, W. N . (1946). The problem of non-response in
sample surveys. Jour . Amer. Stat. ABBOC., 41, 517- 529.
HANSEN, M . H ., et al. (1951) . Response errors in surveys. Jrrur . Amer. Stat.
ASBOC., 46, 147- 190.
LIENAU, C. C. (1941). Selection, training and performance of the National Health
Survey field staff. Amer. Jour . Hygiene, 34, 110-132.
MAIlALANOBIS, P. C. (1946). Recent experiments in statistical sampling in the
Indian Statistical Institute. Jour. Roy. Stat. Soc., 109,325-370.
POLITZ, A. N., and SIMMONS, W. R . (1949, 1950) . An attempt to get the "not at
homes" into the sample without callbacks. Jour. Afner. Stat. ASBoc., 44., 9- 31,
and 46, 136-137.
ANSWERS TO EXERCISES

t.' r-51,473. Pr about 0.9.


1.6 BE (in 1000's) - (i) 14,800; (ii) 3900; (iii) 3140.
U 9.2. (i) 2.7; (ll) 2.4.

u 1064, 1336.
u Nearly conclusive.
u (i) 76.2 ± 3.6 per cent; (ii) 1738 ± 280 famililll.
a.7 Au - 13.
a.8 Average size of sample - m/P.

U (i) 2475; (li) 4950.


' .2 n - 21 (taking t - 2).
U n - 484. For number of unemployed, cv would be about 15 per cent.
4.4 62 more.
U UNS)H
nopt - ( 2cv'2,;

Ii.l (i) nl - 375, n2 - 625; (ii) n) - 250, n 2 = 750.


1i.2 RP - ]8] per cent for proportional allocat ion and 214 per cent for optimum
allocation.
Ii.a Best point of division is at 120 acres. RP - 82 per cent.
'IIo2e - r ,
nV(ii •• ) - 1 - (1 _ e- rD)

1i.7 (i) Gain in precision is about 110 per cent. (ii) Gain from proportional
etratification over simple random sampling is about 90 per cent.
1i.8 (i) 3.733; (ii) 1.111; (iii) 8.222.

8.1 Gain - 66 per cent. At least 11 units by the ratio method.


8.2 Quadratic limi.ts (27,100; 29,870) : normal limits (27,030; 29,7(0) .
fU Variance is 0.00184 computed by the ratio method and 0.00160 by the
binomial formula.
8.' The variances are 46.5 for the separate ratio estimate and 40.6 for the
combined ratio estimate. In both cases the contribution of biM to the variance
is negligible.

'1.1 V(iilr) - 1.03, V(iiR) - 10.3 (one of the samples givlll a very poor IlIIti-
mate with the ratio method). Values of BItT are 0.32 and 0.27, respectively, where
tT is taken about Y.
'1.2 r lr - 28,177 ± 570. RP - 113 per cent.
'1.a 27,751 ± 694.
'1.' V(rlr.) - 34.5; V(r lrc) - 10.3.
319
ANSWERS TO EXERCISES
8.1 Varianeee are 8.19 (systematic), 11.27 (simple random), 8.25 (stratified, 2),
U6 (stratified, 1).
' .1 No: variance is 8.78 with end correctiona.
8.1 V", - 0.00141; V,. .. - 0.00340.
11.1 Relative net precieions are Ill, 125, and 128, respectively, for the lut three
types of unit relative to the firat.
II.. Relative precision of the household is 211 per cent for the !leX ratio and 38
per cent for the proportion who had _n a doctor.
lI.a Relative preeieion of the large unit is 0.566 with simple random sampling
and 0.625 with stratified random sampling.
11.1 (i) If the standard deviation among large units in cl&llll h ex: M". (ii) If
probability ex: ~.
10.1 2.
10.1 Lose in precision is about 8 per cent.
11.1 Contributions to variance from
Within Between
Methode units units Bias Total
Ia 0.356 0.250 0.010 0.616
II 0 .468 0.640 1.1OS
III 0.404 0 .240 0.644
IV 0 .386 0 .414 0.800
V 0 .350 0 .248 0 .002 0.600
11.a Total variance: 0.00482 (la) , 0.02337 (II), 0.00554 (II!).
11.1 Exact variance is

V(JIrI) -
nll'.l-
t
~ 1-1 [«NN -- I'll» (Y; - y)2 + M;(M; - m.) s.t]
m;

The variance formula deduced from theorem 11.2 is the same except that the factor
(N - n)/(N - 1) inside the bracket is-replaced by 1.
11.11 To compute II(Vu), use formula (11.29) in theorem 11.4, with .I; - liN.
This give.
N'I,
II(JIU) - n(n _ 1)M'l

where N - 2823, M - 50,000, and l: is the sum of squares of deviationa of the


quantities M.ij. which appear in table 11.7. The BE is 2.45 years.
11.. n' > Ian.
1I.a By formula 12.29, BE - 1.25.
11.' Per cent gains from the !lecond to the sixth occasion are 50, 75, 91, 100,
and 105, respectively.
NOTE. Owing to di!lenlnOO8 in rounding procedunll, readen' answen may
diller e1ightly from the above in the last significant figure.

U.. IIlR I r~ \", ,'. , ,, _ ;" /0. ' C .riCES


UIiIVE~ m 1 LI:';, _", ..: -I, . klJIIE._

AcceSS1 •• N .. ..~~~ ..~.!? .


Author Index

Annitage, P., 76, no Hansen, M. H., 131, 135, 139, 206, 213,
239, 249, 253, 267, 298, 301, 306, 314,
Bernert, E. H., 96, 110 316, 317
Birnbaum, Z. W., 296, 317 Haynes, J. D ., 184, 188
Black, C. A., 185, 188, 189, 214 H endricks, W. A., 199,200,213
Blythe, R. H ., 62, 64 Homeyer, P . G., 1&5, 188, 189, 214
Bose, Chameli, 277, 291 Horvitz, D. G., 207, 214
Buckland, W. R., 188 Houseman, E . E ., 30, 96, 110
Hurwitz, W. N ., 131, 135, 139,206,213,
Cameron, J. M ., 226, 233 239, 249, 253, 267, 298, 301, 316, 317
Chung, J. H., 42, 48
Cochran, W. G., 103, 110, 146, 159, 175, Jebe, E. H., 219, 233, 239, 250, 267
181, 187, 227, 233 Jessen, R. J ., 30, 63, 84, 96, no, 124,
Corlett, T., 227, 233, 234, 266, 267 139, 194, 199, 214, 266, 267, 284, 291
Cornell, F. G., 89, 90, 110 Johnson, F . A., 56, 64, 176, 181, 188,
Cornfield, J ., 17, 30, 55, 64, 204, 213 189, 214
Cram6r, H ., 127, 128, 139
King, A. J., 96, 103, 110,219,233
Dalenius, T., 96, 109, 110 Lienau, C. C., 306, 317
DM, A. C., 184, 187
David, F . N., 123, 139 Mackenzie, W. A., 176, 187
DeLury, D. B., 42, 48, 181, 187 Madow, L. H., 165, 168, 178, 188, 250,
Deming, W. E., 57, 62, 64, 109, 110 267
Madow, W. G., 22, 30, 129, 139, 165,
Eisenhart, C., 228, 233 168,188
Evall8, W. D . 76, 82, 83, no Mahalanobill, P. C., no, 199, 214, 215,
306, 312, 317
Feller, W., 22, 30 Marcuee, S., 233
Fieller, E. C., 121, 139 Mat6rn, B., 181 188
Finkner, A. L., 135, 139, 197, 213, McCarthy, P. J ., 49
293,317 McCarty, D. E., 96, 103, no
Finney, D. J., 48, 176, 177, 178, 187 McPeek, M., 103, 110
Fisher, R. A., 27, 28, 30, 42, 48,146,159, McVay, F. E., 199, 205, 214
176, 187 Merrington, M ., 228, 233
Midzuno, H., 207, 214
GaU88, C. F., 123 Molina, E. C., 24, 30
Gray, P. G., 227, 233, 234, 266, 267 Monroe, R. J., 197,213
Gurney, M., 96, no, 131, 135, 139 Morgan, J. J., 197, 213

Hagood, M. J., 96, 110 Neyman, J., 73, no, 123, 139,268,271,
Haldane, J. B. S., 48, 49 291
821
322 INDEX
Nordin, J. A., 62, 64 Stephan, F. F., 6, 10, 96, 103, 104, 110
Stevens, W. L., 42
Osborne, J. G., 176, lSI, 188 Sukbatme, P. V., 83, 110, IS9, 214, 229,
233
Patterson, H . D ., 286, 291
Paulson, E., 121, 139 Thompson, C. M., 228, 233
Thompson, D. J., 207, 214
Payne, S. L., 3, 10
Politz, A. N., 303, 304, 317 Tippett, L. H. C., 62, 64
Tschuprow, A. A., 73, 110
Tukey, J. W., 17, 30
Quenouille, M. H., 175, 184, 188
Watson, D . J., 140, 159
Romig, H. G., 37, 49 West, Q. M., 27, 2S, 30
Wishart, J., 17, 30
Satterthwa.ite, F. E., 73, llO Wold, H ., 176, 188
Sen, A. R., 207, 214
Simmons, W. R ., 109, 110,303,304,317 Yates, 1"., 42, 4S, 62, 64, 141, 159, 173,
Sirken, M. G., 296, 317 176, 177, lSI , IS2, 188,266,267,286,
Stein, C., 59, 50, 64 291
Subject Index

Acceptance sampling, 62 Bias, in small units, 189


Accuracy vs. precision, 10 in estimates from tw<Htaa;e sampling,
Aligned systematic sample, 184 236, 240, 242, 243, 244 247, 257
Allocation of sample sizes in stratified Binomial distribution, 36
sampling, BU Optimum allocation confidence limi~ for, 39
in stratified sampling tables of (references), 37
Analysis of variance in estimation of proportions, 36
in estimation of gain from stratifie&- as approl):imation to hypera;eometric
tion, 101 distribution, 42, 43
in systematic sampling, 166 erroneous use in oluster sampling, 34,
in comparing different types of unit, 124
193, 195, 197
in two-stage sampling, 219, 221 Centrally located eystematic samples,
in three-stage sampling, 229 160
in interpenetrating subsamples, 314 Change, estimate!! of, 283
Analytical studies by ratio method, 112
definition, 106 Characteristics, 12
allocation of sample sizes to strata, Cluster sampling (single-stage), 162
107 variance, 202
Arbitrary probabilities, Bee Selection effect of size of cluster, 198
with arbitrary probabilities optimum size of cluster, 189, 192, 195
Attributes, sampling for, BU Propor- Importance of heterogeneity within
tions, estimation of cluster, 163, 203, 204
Autocorrelated populations, 174 compared with simple random aem-
Auxiliary variate, 111, 140, 268 piing, 203, 204
for proportions, 124, 203
Best linear unbiased estimate, 123 use of pps sampling, 206
Bias Su al30 Sampling unit, optimum
definition, 7 Cluster aempling (with eUbsampling),
effects on errol'll of estimation, 8, 306 BU BUbsampling
effects on averages over repeated Coefficient of variation (cv), 35
samples, 10, 307 Collapsed strata, 106, 183
effects on estimates of change, 307 Combined ratio estimate, 131
permissible amount of, 10 variance, 131
deliberate use of biased procedures, 7, estimated variance, 132
10 illustration of precision, 133
due to errol'll in stratum weia;hts, 102 comparison with separate ratio esti-
of ratio estimates, 117, 130 mate, 132
of regression estimates, 147 optimum allocation with, 135
of non-response, 294 Combined regreesi.on estimate, 153
of interviewer, 305, 312 variance, 153, 155
324 INDEX

Combined regression estimate, estimated Cost functions, with interpenetrating


variance, 157 subsamples, 314
bias, 156 Covariance of sample means 17
comparison with separate regression Cumulant, <4, 28
estimate, 157
Complete census compared with sample Degrees of freedom, 20, 57
survey, 2 effective number in stratified sam-
Compromise allocation of sample sizes piing, 73
to strata, 85 Domains of study, 107
Conditional distribution Double sampling, 103
of proportions, 45 description, 268
worked example for proportions, 46 application to the problem of .00.0-
of regression estimate, 145, 276 response, 298
Confidence limits Double sampling with ratio estimates,
in simple random sampling, 20 281
for proportions or percentages, 39, 44 variance, 281
in stratified random sampling, 73 estimated variance, 282
for ratio estimates, 120 Double sampling with regression esti-
for optimum size of 8ubsample, 228 mates, 275
validity of normal approximation, variance, '278
22,41 estimated variance, 280
effect of bias on, 8 comparison with simple random sam-
effeot of non-response on, 294 pling, 278
conditional, 45 optimum sample sizes, 278
Consistency Double sampling with'stratification, 269
definition, 13 variance, 270
of mean of simple random sample, 13 estimated variance, 273
of ratio estimate, 114 comparison with simple random sam-
of regression estimate, 141 pling, 272
Correotion for continuity, 40 optimum sample sizes, 271
Correlation coefficient
in finite populations, 116, 143 E, average over all possible samples,
intra-cluster, 164, 202 14
within a systematic sample, 164, 165 Elements, 34, 125
Correlogram, 174-176 End corrections, 172
Cost functions Error, limits of, 51, 53, 55
in determining sample size, 61 Errors in surveys, types of, 292
in stratified random sampling, 75 Errors of measurement, 304
in analytical studies, 107 mathematical model for, 305
in determining optimum size of unit, effects of constant bias, 306
200 effeots of errors that are independent
in determining optimum subsampling from unit to unit, 308
fraction, 225 effeots of correlation between errol'll
in determining optimum probabilities on different units, 311
. of selecting primary units, 253 use of interpenetrating suhsamples,
in determining optimum sampling 312
fraotion for non-respondents, 298 summary of effects, 316
in double sampling for stratification, Estimates of population variances
272 for determining sample Bize, 56
in double sampling for regression esti- for allocation in stratified eamplin&.
mates, 278 81,82
INDEX

Uimates of population variances, for M&ilsurveys, 293, 301


comparing different types of unit, Matching (in repeated sampling of same
1911 , population), 282, 286, 290
for determining sUbsampling fraction Maximum likelihood estimates, III
in tWCH!tage sampling, 226 Mean
Exp&DSion factor, 13 of sample (ti), 12
Eye estimates, 141 of population ('1'), 12
Measures of size, 205
Field coete, effect on optimum size of Multinomial distribution, 43, 207
unit, 202 Multistage sampling, 229
Finite population correction (fpc), 17 See al80 Subsampling
rule for ignoring, 17
effect on size of sample for specified Non-normality
limite of error, 54, 56, 88, 120 frequently encountered in sampling
in stratified random sampling, 69 practice, 23
for ratio estimates, 115 effect on confidence limits, 22, 41
for regression estimates, 142 effect on sample variance, 27
in two-stage sampling, 222 effect of stratification on, 27
Frame, 4 Non-response, 4, 292
illustration of, 293
Geographic stratification, 96, 134
bias due to, 294
Grid sample in two dimensions, 183
effect on confidence limits, 294
Hypergeometric distribution, 37, 44 estimation of sample size with, 296
confidence limite for, 39 optimum sampling fraction among
charta of (reference), 42 non-respoDodente, 298
worked example for, 38 "substitution" method, 302
Politz and SimmOll3' method, 303
Incomplete stratification, 225 Normal distribution,S, 20
In1Ia.tion factor, 13 validity of use for sample estimates,
Interpenetrating subsamples, 312 22
variance, 313 &8 approximation to binomial distri-
estimated variance, 314 bution, 41 -
optimum size of subsample, 315 Notation
Interviewer bias, 305 for simple random sampling, 12
Intra-cluster correlation, 164, 202 for variances of estimates, 19
Item, definition, 12 for proportions, 31, 43
for stratified sampling, 66
Kurtosis, 28 for ratio estimates, 112
effect on sample variance, 28 for tWCH!tage sampling, 215, 235
for errors of measurement, 305
Latin squares, use in systematic sam-
piing, 184 Optimum allocation in stratified sam-
Least squares estimates, 123, 140, 145, pling
153 fixed sample size, 73
Limite of error (tolerable), 52 fixed total cost, 75
Linear regression estimate, see Regres- determination from previous data, 81
eion estimate comparison with proportional alloca-
Listing of primary unite, 241, 253 tion, 76, SO, 84
effect of listing cost on optimum prob- comparison with simple random sam-
ability of selection, 256 pling,76
Loee due to errors of estimation, 61 with more than one item, 84
326 INDEX
Optimum allocation in stratified 8&m- Probability proportional to size (ppe),
Plin&, requirinc more than 100% in stratified two-stage 8&IIlpling,
-.mplilll, 86 263, 265
in -.mpling for proportioDII, 91 Proba.billty sampling, definition and
in anaiyticaistudies, 107 properties, 6
with ratio eetimatee, 135 Proportional allocation in stratified
with double -.mpling, 271 eampling, 67
e1fect of deviatioDII from optimum, self-weighing sample obtained, 67
78 rule for use of, 81
effect of errol'll in strata variancee .~, variance, 69
82 in sampling for proportions, 92
Optimum per cent matched comparison with simple random sam-
in eampling on two OCCAlliODII, 286 pling, 76, 78
in 8&IIlpling on more than two occ.- comparison with optimum allocation,
llioDII,290 76,80,84,92
Optimum size of sub8ample comparison with stratification after
primary unita of equal size, 225 selection, 104
primary unita of unequalllize, 253 estimation of gain in precillion from,
Overall eampling fraction, 240, 246 100
Proportions, estimation of, 31
Percentages, estimation of, .ee Propor- in simple random sampling, 31
tioDII, estimation of more than two classes, 4~5
Periodic variation, effect on systematic in stratified random sampling, 90
eampling, 174, 179 in cluster sampling, 34, 124, 203
Poisson distribution, 24, 57 in two-stage sampling (unita of equal
Politz and Simmons' method for han- size), 228
dling non-response, 303 in tW0-6tage sampling (unita of un-
Population equal sizes), 248
definition, 3 in double sampling, 271
with linear trend, 170 size of sample for, 50, 53
with periodic variation, 17. effect of population P on precillion,
autocorrelated, 174 35
two-dimensional, 183 effect of non-response, 294
Precision Purposive selection, 7
vs. accuracy, 10
specification of, 52 Quadratic confidence ~ta for ratio
relative, 76, 79 estimate, 121
Primary sampling unita (primary unital, Qualitative characteristics,.ee
Propor-
215 tions, estimation of
Probability proportional to estimated Quota sampling, 105
size
in llingle-st&ge sampling, 206 Ra.ising factor, 13
in two-stage sampling, 239 "Random point" method of selection,
Probability proportional to sizll (ppe), 197
206 Random sampling, aee Simple random
method of drawing eample, 206 eampling
In llingle-st&ge 8&IIlpling, 206-212 Rare items, sampling for, 36, 48
oompariaon with ratio eetimate, 210 Ratio estimate, 111
in stratified 8ingle-st&ge 8&IIlpling, 212 variance, 114-117
in two-ste.ge eampling, 237, 243, 246, estimated variance, 118
247, 248, 256, 262 bias, 117
INDEX 327

Ratio estimate, conditio!lll under which Repeated sampling of the same popula-
billll is negligible, 118 tion, 282
co!lllistency, 114 types of estimate wanted, 283
confidence limits, 120 replacement policy, 283, 286, 290
optimum conditio!lll for, 123 sampling on two ocCllllio!lll, 284
in estimating proportio!lll, 124 sampling on more than two occuiODII,
in stratified random sampling, 129 286
optimum allocation for, 135 Replacement of sample, .ee Repeated
in cluster sampling, 124, 203 sampling of the same population
in two-stage sampling, 248
sample size with, 120 Sampling fraction, 13
comparison with mean per unit, 122 overall sampling fraction, 240, 246
comparison with regression estimate, Sampling on more than two OCcuiODII,
148 286
l1li special case of regression estimate, estimate of current population mean,
149 287
comparison with stratifie&tion, 134 optimum per cent matched, 290
comparison with pps sampling, 210 Sampling on two OCClllliODII, 284
limiting distribution, 127 estimate of current population mean,
effect of mell8urement billll on, 307 284
Set az'o Combined ratio estimate and optimum per cent matched, 285
Separate ratio estimate Sampling ratio, 13
Regression coefficient Sampling unit (unit)
lell8t squares, 140 definition, 3
in finite populations, 142 optimum mell8W'e of size of, 205
variance, 145 Sampling unit, optimum, 189
combined from different strata, 153 method for determining, 189-195
Regression estimate, 140 worked eX&IDples of method, 189, 194,
uses, 140 197
Iarge-ea.mple variance, 142 use of survey data in determining, 196
estimated variance, 144 use of variance functioDII, 198
lell8t squares theory, 144 effect of field coets on, 202
billll, 147 for proportioDII, 203
with arbitrary value of b, 149 Sampling with replacement, 12, 206,
with inefficient estimate of b, 149 245, 250, 269, 262
comparison with ratio estimate, 148 Sampling without replacement, 12
comparison with mean per unit, 148 Selection with arbitrary probabilities
in stratified random sampling, 150- in singllH!tage samPlin&, '1I.YT
, 158 in two-etage sampling, 239
in double sampling, 275 optimum probabilities of eelection of
in repeated sampling of the same primary units, 263
population, 284-290 Self-weighting sample, 67
effect of m8lllluremen t billll on, 307 Separate ratio estimate, 129
See alao Combined regression estimate variance, 129
and Separate regression estimate liability to bias, 130
Relative net precision, 192 comparison with combined ratio esti-
Relative precision (RP) mate, 132
of stratified random and aimple ran- estimated variance, 132
dom sampling, 76 illustration of precision, 133, 136
of optimum and general allocation, optimum allocation for, 136
79 Separate regrellllion estimate, 1150
328 INDEX
Separate regreB8ion estimate, variance, Skewed population, experimental sam-
151, 152 ples from, 22, 26
liability to bias, 153 Skewness
compariBon with combined reg.reesion coefficient of, 25
estimate, 157 effect on confidence limits, 22
estimated variance, 157 Square grid sample, 183
Simple expansion, 122 Standard error,
comparison with ratio estimate, 122 of mean of simple random sample, 16,
Simple random sampling 19, 55
definition, 11 of estimated population total from
method of drawing, 11 simple random sample, 17, 19
unbiased sample mean, 14 of sample standard deviation, 27
variance of sample mean, 15 of sample proportion, 32, 33, 34, 40,
estimated variance of sample mean, 45, 64
19 of total in population poseessing some
confidence limite for sample mean, attribute, 33
20 of weighted mean in stratified sam-
variance of sample proportion, 32, 35 pling,69, 71, 72,105
unbi88ed sample proportion, 32 of proportion estimated from strati-
estimated variance of sample propor- fied sample, 91
tion, 33 of ratio estimate, 115-119, 125-1'1:1
distribution of sample proportion, 36, of ratio estimate in stratified sam-
37 pling, 129, 131-134
corJidence limite for sample propor- of regression estimate, 14~146, 150
tion,39 of mean of systematic sample, 1~
for classification into more than two 167,171H82
olasses,43 of mean per element in two-etage
sample size needed, ro, 53, 55, 57 sampling, 217, 222, 224, 236-243,
precision compared with stratified 249-253, 259-265
random sampling, 76 in sampling with probability propor-
Sise of sample for specified limite of tional to sise, 20&-210
error in double sampling, '1:10, '1:18, '1:18, 280,
t.nalyais of problem, 51 281,282
with propor.tions, 5(}-64 with interpenetrating Bubsamples, 314
with continuous data, 56 (N0TIl: in some caseII the formula
with more than one item, 57 given is Mlr the variance.)
in stratified random sampling, 87, 93 Stein's method of two-etllge sampling,
with ratio esti]nates, 120 59
worked examples, 50, 55, 56, eo, 89 Steps in" sample survey, 2
Stein's method of two-atage sampling, Strata
59 definition, 65
by minimizing cost plus 10III'I due to construction, 98
errors, 61 optimum number, ~
effect of non-response on, 296 effect of subdivision on precision, 93
Siae of sample needed optimum boundaries between, 96
for norma.! approximation to confi- Stratification, 65
dence limits for continuous data, '1:1 reasons for, 65
for norma.! approximation to confi- best variable for, 93
dence limits of proportions, 41 geographic, 96, 134
for estimating optimum subeampling after aelection of II&II1ple, lOt
fractions, 226 with double aampq, 268
INDEX 329
Stratification, effect on normality of Subsampling (units of unequal size),
variate, 27 units chosen with probability pro-
Su olio Strata portional to estimated size, 239,
Stratified random sampling, 65 243, 246, 247
estimate, ii,t, 66 estimation of proportions, 248
variance of ii", 69 general formulas for variances, 249
estimated variance of ii,t, 72 general formulas for estimated vari-
confidence limits for continuollij data, ances, 259, 262
73 optimum probabilities of selection,
optimum ' allocation, 73 253
optimum allocation with varying costs, advantages of ratio estimates, 248,
75 250
size of sample, 87 comparison of biased and unbiased
for proportions, 90 estimates, 257
construction of strata, 93 in stratified sampling, 262
estimation of gain in precision from planning of sample, 266
stratification, 97-102 Bubsampling of non-respondents, 298
with ratio estimates, 129 "Substitution" method for non-response,
with regression estimates, 150 302
in analytical studies, 106 Super-population, 169
with one unit per stratum, 105 Systematic sampling, 160
comparison with simple random sam- advantages, 160
pling, 76 variance, 162
comparison with systematic sampling, estimation of the variance, 179
167 worked example for, 165
effects of errore in stratum weights, recommendations about use, 185
102, 268, 271 relation to cluster sampling, 162
deliberate omiBBion of a stratum, 294 comparison with simple random sam-
Stratified samplin~, 65 pling, 163, 167, 168, 170
estimate, 66 comparison with stratified sampling,
variance of estimate, 68 165, 167, 170, 176
Stratum weight, Wh, 69 end corrections, 172
Bube&mpling (units of equal size), 215 in popUlations in "random" order,
advantage, 215 168, 180
notation, 215 in populations with linear trend, 170,
approximate variance of mean, 217 180
exact variance of mean, 22Q effect of periodic variation, 174, 179
estimated variances, 218, 223 in autocorrelated populations, 174
prediction of variance for other sub- in natural populations, 176
sampling fractions, 218 in two dimensions, 183
optimum subeampling fractions, 225 in subsampling, 225
for proportions, 228 stratified systematic samplini, 182
in stratified sampling, 231
three-8tage sampling, 229 t-<iietrlbution, 20, 27, 57, 73
when subeample fe syl!tematic, 225 Theory of sampling, function in II&mple
Bubeampling (units of unequal size), 234 surveys, 5
notation, 235 Three-stage II&mpling, 229
units chosen with equal probabilities, Total in population, estimation
235, 236, 243, 244, 245, 247 by simple expa.nsion, 13
unite choeen with probability propor- by ratio estimate, 112
tional to size, 237, 243, 245, 247 il). stratified random II&mplini, 70
330 INDEX
Total in population, eetim&tion, for. at- Unbiased procedure or estimate, defini-
tributes, 31 tiOD,7,14
Tw~ell8ioul poplI)atloll, 181 Unit (eampling unit), definition, 3
lIqu&l'e crid eample, 188 Su. GUo Sampling unit
nuHpwf I!y1Itemr.tic IUDple, 188 Unrstricted random MDlPlin&, 11
simple random -.mple, 1M 8« GUo Simple random I!&IDplin,
UII! of latin Iqu&l'e principle, 184
Two-J'h- IIt.mplinc, _ Double IIIUD- Variance, definition of $I and~, 15
pIinc Variancee of II&Dlple eetimatee,.su Stand-
Two-etap IIIIIIIPIinc, definition, 216
ard error
&e alao Subamplinc
Variance within units, .a function of
UDAiiped i)'8tematic ample, 183 elK of unit, 198
UNIV. OF AGR[L. SCIENCES
UNIVERSITY LlBRARY. B-\NGALORE-560024

This book should be rctllr-ned on or before


the date mentioned below; or cls ~ the
Borrower wi II b'! I ia ble f,)r ove rdue chHrg:s
as per rules from th '_: DUE DATE. S 1.. '3 )"
Cl. No. ·3 l' · 1. I ('Q (_ A/c. No.

27 SEP 1985
50 S;;
A
~,1. 9 5
~~ .
4..j
:v~
• -

17 DEC 1985
. C;-01--/,
9 ~y (986
!S-c 91\.{_)
UAS LIBRARY GKVK

11\\1\1 \111\ 11\\\ 111\1 \111 1\\1


5235

Вам также может понравиться