Chapters 1 N 2

Chapter 1 Elements of the sampling problem
central limit thrm is difficult; becuz dependence relation
1.1
Introduction
Often we are interested in some characteristics of a nite population, e.g. the average income of last years graduates from HKUST; unemployment rate of last quarter in HK. Since the population is usually very large, we would like to say something (i.e. make inference) about the population by collecting and analyzing only a part of that population. The principles and methods of collecting and analyzing data from a nite population is a branch of statistics known as Sample Survey Method. The theory involved is called Sampling Theory. Sample survey is widely used in many areas such as agriculture, education, industry, social aairs, medicine.
1.2
Some technical terms

i.e., UST graduate, a student is an element population
An element is an object on which a measurement is taken. A population is a collection of elements about which we require information. Population characteristic: this is the aspect of the population we wish to measure, e.g. the average income of last years graduates from HKUST, or the total wheat yield of all farmers in a certain country.
sometimes units can be household, groups
Sampling units are nonoverlapping collections of elements from the population.
Sampling units may be the individual members of the population, they may be a coarser subdivision of the population, e.g. a household which may contain more than one individual member.
order them, list of names; / st ID
A frame is a list of sampling units, e.g., telephone directory. A sample is a collection of sampling units drawn from a frame or frames.
100 sample 2000
1.3
Why sampling?
If a sample is equal to the population, we have a census, which contains all the information one wants. However, census is rarely conducted for several reasons:
cost, (money is limited) time, (time is limited) destructive (testing a product can be destructive, e.g. light bulbs), accessibility (non-response can be a serious issue). In those cases, sampling is the only alternative.
1.4
How to select the sample: the design of the sample survey
The procedure for selecting the sample is called the sample survey design. The general aim of sample survey is to draw samples which are representative of the whole population. Broadly speaking, we can classify sampling schemes into two categories: probability sampling and some other sampling schemes.
1. Probability sampling
probability structure, we kw how the random structure is, we can put error bond, sure line within bonds
This is a sampling scheme whereby particular samples are numerated and each has a non-zero probability of being selected. With probability built in the design, we can make statements such as our estimate is unbiased and we are 95% condent that it is within 2 percentage point of the true proportion. In this course, we shall only concentrate on no error bond, not kw structure, not kw perfms of parameter Probability sampling.
far away from the real drug control group,aids a) volunteer sampling: a TV telephone polls, medical volunteers for research. b) subjective sampling: We choose samples that we consider to be typical or effect or not representative of the population. bias sampling
2. Some other sampling schemes i.e. weight, measurement error->kw underline structurehow
Statistics a lot of assumptions, nothing is wrong; just how accurate
c) quota sampling: One keeps sampling until certain quota is lled.
well structured sample to address those unknowns
All these sampling procedures provide some information about the population, but it is hard to deduce the nature of the population from the studies as the samples are very subjective and often very biased. Furthermore, it is hard to measure the precision of these estimates. prefer bias sampling
1.5
How to design a questionnaire and plan a survey
This can be the most important and perhaps most dicult part of the survey sampling problem. We shall come back to this point in more details later.
1.6
Some useful websites
Many government statistical organizations and other collectors of survey data now have Web sites where they provide information on the survey design. Here are a few examples. Note that these sites are subject to change, but you should be able to nd the organization through a search. Organization Federal Interagency Council of Statistical Policy U.S. Bureau of the Census Statistics Canada Statistics Norway Statistics Sweden UK Oce for National Statistics Australian Bureau of Statistics Statistics New Zealand Statistics Netherlands Gallup Organization Nielsen Media Research National Opinion Research Center Inter-University Consortium for Political and Social Research Address www.fedstats.gov www.census.gov www.statcan.ca www.ssb.no www.scb.se www.ons.gov.uk www.statistics.gov.au www.stats.govt.nz www.cbs.nl www.gallup.com www.nielsenmedia.com www.norc.uchicago.edu www.icpsr.umich.edu
Chapter 2 Simple random sampling

Simple random sampling is the simplest sampling procedure, and is the building block for other more complicated sampling schemes to be introduced in later chapters. Denition: If a sample of size n is drawn from a population of size N in such a way that every possible sample of size n has the same probability of being selected, the sampling procedure is called simple random sampling (s.r.s. for short). The resulting sample is called a simple random sample.
2.1
How to draw a simple random sample

{u1 , u2 , , uN }.
Suppose that the population of size N has values
N N There are possible samples of size n. If we assign probability 1 to each of n n the dierent samples, then each sample thus obtained is a simple random sample. Denote such a s.r.s as (y1 , y2 , , yn ). Remark: In other statistics course, we use upper-case letters like X, Y etc. to denote random variables and lower-case letters like x, y etc. to represent xed values. However, in survey sampling, by convention, we use lower-case letters like y1 , y2 etc. to denote random variables. We have the following result. Theorem 2.1.1 For simple random sampling, we have P (y1 = ui1 , y2 = ui2 , , yn = uin ) = where i1 , i2 , , in are mutually dierent. (N n)! . N!
Proof. By the denition of s.r.s, the probability of obtaining the sample {ui1 , ui2 , , uin } , ! N (where the order is not important) is 1 . There are n! number of ways of ordering n {ui1 , ui2 , , uin }. Therefore, P (y1 = ui1 , y2 = ui2 , , yn = uin ) = 1 (N n)!n! (N n)! ! = = . N !n! N! N n! n
!
N Recall that the total number of all possible samples is , which could be very n large if N and n are large. Therefore, getting a simple random sample by rst listing all possible samples and then drawing one at random would not be practical. An easier way to get a simple random sample is simply to draw n values at random without replacement from the N population values. That is, we rst draw one value at random from the N population values, and then draw another value at random from the remaining N 1 population values and so on, until we get a sample of n (dierent) values. Theorem 2.1.2 A sample obtained by drawing n values successively without replacement from the N population values is a simple random sample. Proof. Suppose that our sample obtained by drawing n values without replacement from the N population values is {a1 , a2 , , an }, where the order is not important. Let {ai1 , ai2 , , ain } be any permutation of {a1 , a2 , , an }. Since the sample is drawn without replacement, we have P (y1 = ai1 , , yn = ain ) = 1 1 1 (N n)! = . N (N 1) (N n + 1) N!
Hence, the probability of obtaining the sample {a1 , , an } (where the order is not important) is
X
all
(i1 ,,in )
P (y1 = ai1 , , yn = ain ) =
(N n)! (N n)! 1 !. = n! = N! N! N all (i1 ,,in ) n

X
The theorem is thus proved by the denition of the simple random sampling.
Two special cases will be used later when n = 1, and n = 2. Theorem 2.1.3 For any i, j = 1, ..., n and s, t = 1, ..., N , (i) (ii) Proof. P (yk = uj ) = all (i1 , , in ), but ik = j ! (N n)! N 1 (N n)! (N 1)! 1 = (n 1)! = = . N! n1 N! (N n)! N P (yk = us , yj = ut ) =
X X
P (yi = us ) =
1 . N
P (yi = us , yj = ut ) =
1 , N (N 1)
i 6= j,
s 6= t.
P (y1 = ui1 , , yk = uik , , yn = uin )
(N n)! N 2 (N n)! (N 2)! 1 = (n 2)! = = . N! n2 N! (N n)! N (N 1) Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s of size 2. List all possible samples and nd out the prob. of drawing {b, d}. Solution. Possible samples of size 2 are {a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d},
all (i1 , , in ), but ik = s,ij = t

!
P (y1 = ui1 , , yn = uin )
The probability of drawing {b, d} is 1/6.
2.2
2.2.1
Estimation of population mean and total

Estimation of population mean
For a population of size N : {u1 , u2 , , uN }, we are interested in the population mean = the population variance
N u1 + u2 + + uN 1 X = ui , N N i=1
N 1 X = (ui )2 . N i=1 2
Given a s.r.s of size n: {y1 , y2 , , yn }, an obvious estimator for is the sample mean: =y= Theorem 2.2.1 (i) E(yi ) = ,
n 1X yi . n i=1
V ar(yi ) = 2 . 2 (ii) Cov(yi , yj ) = , for N 1 Proof. (i). By Theorem 2.1.3, E(yi ) = V ar(yi ) =
N X
i 6= j.
uk P (yi = uk ) =
k=1 N X k=1
k=1
N X
uk
1 = , N
N X
(uk )2 P (yi = uk ) =
k=1
(uk )2
1 = 2. N
(ii). By dention, Cov(yi , yj ) = E(yi yj ) E(yi )E(yj ) = E(yi yj ) 2 . Now, X X 1 E(yi yj ) = us ut P (yi = us , yj = ut ) = u s ut N (N 1) all s 6= t all s 6= t
X 1 1 6 X 7 = us ut us ut 5 = 4 N (N 1) N (N 1) s=t all s, t 2 " 3 " N X
us
s=1
Thus, Cov(yi , yj ) = E(yi yj ) 2 = N1 . Theorem 2.2.2 Proof. Note y = E() = , y
h i 1 2 = N 2 2 N 2 N 2 = + 2 . N (N 1) N 1
2
1 (N )2 N (N 1)
N X
s=1
(us )2 + N 2
!#
! N X
t=1
ut
s=1
N X
u2 s
Var() = y
2 n
1 1 1 (y1 + ... + yn ). So E() = (Ey1 + ... + Eyn ) = (n) = . Now y n n n n n n n X X XX 1 1 V ar() = y Cov( yi , yj ) = 2 Cov(yi , yj ) 2 n n i=1 j=1 i=1 j=1
X 1 @X = Cov(yi , yj ) + Cov(yi , yj )A n2 i6=j i=j
n X 1 @X 2 = ( )+ V ar(yi )A n2 i6=j N 1 i=1
Nn N1
0 0
1 = n2 2 = n 2 = n
2 n(n 1)( ) + n 2 N 1 1 (n 1)( )+1 N 1 N n . N 1 7
Remark: From Theorem 2.2.2, y is an unbiased estimator for . Also as n gets large (but n N ), V ar() tends to 0. Thus y becomes more accurate for as n gets larger. In particular, y when n = N , we have a census and V ar() = 0. y Remark: In previous statistics courses, the sample (y1 , y2 , , yn ) are usually independent and identically distributed (i.i.d.), namely they are drawan from the population with replacement. As a result, 2 Eiid () = , y V ariid () = y . n Notice that V ariid () is dierent from V ar() in Theorem 2.2.2. In fact, for n > 1, y y V ar() = y 2 n
N n N 1
<
2 = V ariid (). y n
Thus, for the same sample size n, sampling without replacement produces a less variable estimator of . Why?
2.2.2
Estimation of 2 and V ar() y
The population variance 2 is usually unknown. Now dene

n n X 1 X 1 s = (yi y )2 = y 2 n()2 . y n 1 i=1 n 1 i=1 i 2
Theorem 2.2.3 Proof. Es

2
E(s2 ) =
N 2 . (i.e., s2 is biased for 2 .) N 1

!
n X 1 = Ey 2 nE()2 y n 1 i=1 i
n i h i Xh 1 = V ar(yi ) + (Eyi )2 n V ar() + (E y )2 y n 1 i=1
= = = =
h i 1 2 N n n 2 + 2 n + 2 n1 n N 1 2 n 1 N n 1 n1 n N 1 ! 2 n nN n (N n) n1 n(N 1) 2 N . N 1
"
#!
The bias in s2 can be easily corrected. The next theorem is an easy consequence of the last theorem. Theorem 2.2.4 := We shall dene n to be the sample proportion, N n 1f =1 to be the nite population correction (ab. fpc) N f= Then we have the following theorem. Theorem 2.2.5 An unbiased estimator for V ar() is y
2 d y ) = s (1 f ) . V ar( 2 N 1 2 s N
N 1 2 is an unbiased estimator of , e.g. E s = 2. N

2
Proof.
Es2 d y E V ar() = (1 f ) = n
N 2 n 1 . n(N 1) N
Condence intervals for 9
It can be shown that the sample average y under the simple random sampling is approxi mately normally distributed provided n is large ( 30, say) and f = n/N is not too close to 0 or 1. Central limit theorem: If n ! N such that n/N ! 2 (0, 1), then
q
V ar() y
N (0, 1)
approximately.
d y If V ar() is replaced by its estimator V ar(), we still have y q
Thus,
d y V ar()
approx. N (0, 1),
as
n/N ! > 0.
Therefore, an approximate (1 ) condence interval for is
0 1 q q y d y d y z/2 A = P y z/2 V ar() y + z/2 V ar() @ q 1P d y V ar() q s q d y y z/2 V ar() = y z/2 p 1 f. n
d y B := z/2 V ar() , is called bound on the error of estimation.
Example. A s.r.s. of size n = 200 is taken from a population of size N = 1000, resulting in y = 94 and s2 = 400. Find a 95% C.I. for . 20 q Solution 94 1.96 p 1 1/5 = 94 2.479 200 Example. A simple random sample of n = 100 water meters within a community is monitored to estimate the average daily water consumption per household over a specied dry spell. The sample mean and variance are found to be y = 12.5 and s2 = 1252. If we assume that there are N = 10, 000 households within the community, estimate , the true average daily consumption, and nd a 95% condence interval for . Solution = y = 12.5. and 2 N n s2 1252 V ar() = y = (1 n/N ) = (1 100/10000) = 12.3948. n N 1 n 100
q
V ar() = 3.5206 y
A 95% C.I. for is 12.5 1.96 3.5206 = (5.6, 19.4).
10
2.3
Selecting the sample size for estimating population means

2
population mean We have seen that V ar() = N n . So the bigger the sample size n is (but N ), the y n N 1 more accurate our estimate y is. It is of interest to nd out the minimum n such that our estimate is within an error bound with certain probability 1 , say, P (| | < B) 1 , y i.e., | | y B A 1 . P @q <q V ar() y V ar() y B
2 n
By the central limit theorem,

q
B V ar() y
=r
N n N 1
z/2
()
2 n
N n B2 = 2 = D, N 1 z/2
() Thus,
N (N 1)D 1= () n 2 n N 2 , (N 1)D + 2
N (N 1)D (N 1)D + 2 =1+ = n 2 2 where D= B2 2 z/2
Remarks: (1): if = 5%, then z/2 = 1.96 2, so D formula in the textbook (page 93).
B2 . 4
This coincides with the
(2): the above formula requires the knowledge of the population variance 2 , which is typically unknown in practice. However, we can approximate 2 by the following methods: (i) from pilot studies (ii) from previous surveys (iii) other studies. Example. Suppose that a total of 1500 students are to graduate next year. Determine the sample size n needed to ensure that the sample average in starting salary is within $40 of the population average with probability at least 0.9. From previous studies, we know that the standard deviation of the starting salary is approximately $400. Solution. 15004002 n = 1499402 /1.6452 +4002 = 229.37 230. 11
Example. Example 4.5 (p.94, 5th edition). The average amount of money for a hospitals accounts receivable must be estimated. Although no prior data are available to estimate the population variance 2 , that most accounts lie within a $100 range is known. There are 1000 open accounts. Find the sample size needed to estimate with a bound on the error of estimation $3 with probability 0.95. Solution. The solution depends on how one inteprets most accounts, whether it means 70%, 90%, 95% or 99% of all accounts. We need an estimate of 2 . For the normal distribution, N (0, 2 ), we have P (|N (0, 2 )| 1.96) = P (|N (0, 1)| 1.96) = 95%, P (|N (0, 2 )| 3) = P (|N (0, 1)| 3) = 99.87% So 95% accounts lie within a 4 range and 99.87% accounts lie within a 6 range. B = 3, N = 1000. If most means 95%, we take 2 (2) = 100, so = 25. Then n = 210.76 211. If most means 99.87%, we take 2 (3) = 100, so = 50/3. Then n 107.
12
2.3.1
A quick summary on estimation of population mean

1 (u1 + u2 + + uN ). N
The population mean is dened to be =
Suppose a simple random sample is {y1 , ..., yn }. 1) Estimators of and 2 are =y= 2) Properties of y : E y = , 2 V ar() = y n
n 1X yi , n i=1
s2 =
n 1 X (yi y )2 . n 1 i=1
N n . N 1
3) An unbiased estimator of V ar() is y

d y V ar() =
s2 (1 f ) , n
where f = n/N .
4) An approximate (1 ) C.I. for is
q s q d y y z/2 V ar() = y z/2 p 1 f. n
5) Minimum sample size n needed to have an error bound B with probability 1 n N 2 , (N 1)D + 2 where D= B2 2 z/2
13
2.3.2
Estimation of population total

= (u1 + u2 + + uN ) = N
The population total is dened to be
Suppose a simple random sample is {y1 , ..., yn }. 1. An estimator of is = Ny 2. The mean and variance of are E = , 3. An estimator of V ar() is s d d V ar() = V ar(N y ) = N 2 (1 f ) n
!
2
V ar() = N 2
2 n
N n . N 1
4. An approximate (1 ) condence interval for is z/2

q
s q s q d V ar() = z/2 N p 1 f = N y z/2 p 1f . n n
Proof. Central limit theorem (CLT): if n ! N such that n/N ! 2 (0, 1), then q !d N (0, 1). V ar()
d If V ar() is replaced by its estimator V ar(), we still have q
Thus,
d V ar()
!d N (0, 1),
as
n/N ! > 0.
Therefore, an approximate (1 ) condence interval for is z/2 B := z/2

q d V ar() = N z/2 q
0 1 q q d d z/2 A = P z/2 V ar() + z/2 V ar() 1 P @ q d V ar()
d y V ar() , is called bound on the error of estimation.
s d V ar() = z/2 N p
1 f.
5. Minimum sample size n needed to have an error bound B with probability 1 n N 2 , (N 1)D + 2 14 where D= B2 2 N 2 z/2
Example 4.6. (Page 95 of the textbook). An investigator is interested in estimating the total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh each bird would be time-consuming and tedious. Therefore, determine the number of chicks to be sampled in this study in order to estimate within a bound on the error of estimation equal to 1000 grams with probability 95%. Many similar studies on chick nutrition have been run in the past. Using data from these studies, the investigator found that 2 , the population variance, was approximately 36.00 (grams)2 . Determine the required sample size. Solution. D = B 2 /(1.96N )2 = 10002 /(1.962 10002 ) = 0.26. n = N 2 /((N 1)D + 2 ) = 1000 36/(999 0.26 + 36) = 121.72 122
15
2.4
Estimation of population proportion
We are interested in the proportion p of the population with a specied characteristic. Let 1 if the ith element has the characteristic yi = { 0 if not
2 It is easy to see that E(yi ) = E(yi ) = p (Why?). Therefore, we have
= E(yi ) = p, 2 = var(yi ) = p p2 = pq,
where q = 1 p
So the total number of elements in the sample of size n possessing the specied characP teristic is n yi . Therefore, i=1 1. An estimator of p: y= An estimator of 2 = pq: s
2 n n X 1 X 1 = (yi y )2 = y 2 n()2 y n 1 i=1 n 1 i=1 i n X 1 1 = yi n2 = p n n2 p p n 1 i=1 n1 n = pq where q = 1 p n1
Pn
i=1
yi
= p,
say.
!
From Theorems 2.2.2 and 2.2.3, we have E() = p, p
E(s2 ) =
N N 2 = pq. N 1 N 1
(4.1)
2. Again, from Theorem 2.2.2, the variance of p is 2 V ar() = p n
N n pq = N 1 n
N n . N 1
3. From equation (4.1) and Theorem 2.2.5, an estimator of the variance of p is

d p V ar() =
s2 pq (1 f ) = (1 f ) . n n1
4. An approximate (1 ) condence interval for p is p q pq q d p) = p z p p z/2 V ar( 1 f. /2 n1
5. The minimum sample size n required to estimate p such that our estimate p is within an error bound B with probability 1 is, n N pq , (N 1)D + pq where D= B2 2 z/2
Note that the right hand side is an increasing function of 2 = pq. 16
a) p is often unknown, so we can replace it by some estimate (from previous study, pilot study, etc.). b) If we dont have an estimate p, we can replace it by p = 1/2, thus pq = 1/4. Exmaple. A small town has N = 800 people. Let p = the proportion of people with blood type A. (1). What sample size n must be drawn in order to estimate p to be within 0.04 of p with probability 0.95? (2). If we know no more than 10% of the population have blood type A. Find n again in (1). Comment on the dierence between (1) and (2). (3). A s.r.s. of size n = 200 is taken and 7% of the sample has blood type A. Find a 90% condence interval for p. Solution. N = 800, = 0.05, B = 0.04. (1). Take p = 1/2 in the formula, we get n = 345. (2). p 0.10 so 2 = pq 0.09. Simple calculation yields n = 171. (3). (0.040, 0.096). Example A simple random sample of n = 40 college students was interviewed to determine the proportion of students in favor of converting from the semester to the quarter system. 25 students answered armatively. Estimate p, the proportion of students on campus in favor of the change. (Assume N = 2000.) Find a 95% condence interval for p. Solution. p = y = 25/40 = 0.625. V ar() = y pq 0.625 0.375 (1 n/N ) = (1 40/2000) = 5.889 103 . n1 39
q
V ar() = 0.07674. y
A 95% C.I. for p is 0.625 1.96 0.0767 = (0.4746, 0.7754).
17
2.5
Comparing estimates
For simplicity, suppose x1 , , xm is a random sample (i.i.d.) from a population with mean x and y1 , , yn is a random sample from a population with mean y . We are interested in the dierence of means y x , whose unbiased estimator is y x, as E( x) = y x . y Further, V ar( x) = V ar() + V ar() 2Cov(, x). y y x y Remark: If the two samples {x1 , , xm } and {y1 , , yn } are independent, then Cov(, x) = y 0. However, a more interesting case is when the two samples are dependent, which will be illustrated in the following example.
A dependent example Suppose an opinion poll asks n people the question Do you favor the abortion? The opinions given are YES, NO, NO OPINION.
Let the proportions of people who answer YES, NO, No opinion be p1 , p2 and p3 , respectively. In particular, we are interested in comparing p1 and p2 by looking at p1 p2 . Clearly, p1 and p2 are dependent proportions, since if one is high, the other is likely to be low. Let p1 , p2 and p3 be the three respective sample proportions amongst the sample of size n. Then X = n1 , Y = n2 and Z = n3 follows a multinomial distribution with p p p parameter (n, p1 , p2 , p3 ). That is P (X = x, Y = y, Z = z) = Please note that n! px py pz = 1. 1 2 3 x0,y0,x+y+z=n x! y! z! Question: What is the distribution of X? (Hint: Classify the people into Yes and Not Yes)
X
n x, y, z
px py pz = 1 2 3
n! p x py p z x! y! z! 1 2 3
18
Theorem 2.5.1 E(X) = np1 , V ar(X) = np1 q1 , E(Y ) = np2 , V ar(Y ) = np2 q2 , E(Z) = np3 , Cov(X, Y ) = np1 p2 .
Proof. X = number of people saying YES Bin(n, p1 ). So EX = np1 , V ar(X) = np1 q1 . Now Cov(X, Y ) = E(XY ) (EX)(EY ) = E(XY ) n2 p1 p2 . But E(XY ) = = = =
x,y0,x+yn
X X X X
xyP (X = x, Y = y) xyP (X = x, Y = y, Z = n x y) xy n! px py pnxy x! y! (n x y)! 1 2 3

X
x,y1,x+yn
x,y1,x+yn
n! px py pnxy (x 1)! (y 1)! (n x y)! 1 2 3 x,y1,x+yn

x1,y10,(x1)+(y1)(n2)
= n(n 1)p1 p2
= n(n 1)p1 p2
(n 2)! (n2)(x1)(y1) px1 py1 p3 2 1 (x 1)! (y 1)! ((n 2) (x 1) (y 1))!

x1 ,y1 0,x1 +y1 (n2)
(n 2)! (n2)x1 y1 px1 py1 p3 1 2 (x 1)! (y 1)! ((n 2) x1 y1 )! = n(n 1)p1 p2 = n2 p1 p2 np1 p2 . Therefore, Cov(X, Y ) = E(XY ) n2 p1 p2 = np1 p2 . Theorem 2.5.2 E(1 ) = p1 , p V ar(1 ) = p1 q1 /n, p E(2 ) = p2 , p V ar(2 ) = p2 q2 /n, p
Cov(1 , p2 ) = p1 p2 /n. p Proof. Note that p1 = X/n and p2 = Y /n. Apply the last theorem. From the last theorem, we have V ar(1 p2 ) = V ar(1 ) + V ar(2 ) 2Cov(1 , p2 ) = p p p p One estimator of V ar(1 p2 ) is p
d p V ar(1 p2 ) =
p1 q1 p2 q2 2p1 p2 + + . n n n
p1 q1 p2 q2 21 p2 p + + . n n n
v u u p1 q1 t
Therefore, an approximate (1 ) condence interval for p1 p2 is
(1 p2 ) z/2 Vd p1 p2 ) = (1 p2 ) z/2 p ar( p

19
p2 q2 21 p2 p + . n n
Example. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovich poll of 800 adult Americans carried out on April 6-7, 1994 gave the following results. Banned Special areas No restrictions Nonsmokers 44% 52% 3% Smokers 8% 80% 11%
Using a sample of 600 nonsmokers and 200 smokers, estimate and construct a 95% C.I. for (1) the true dierence between the proportions choosing Banned between nonsmokers and smokers; (2) the true dierence between the proportions among nonsmokers choosing between Banned and Special Areas. Solution. A. The proportions choosing banned are independent of each other; a high value does not force a low value of the other. An appropriate estimate of this dierence is 0.44 0.08 2
s
0.44 0.56 0.08 0.92 + = 0.36 0.06 600 200
B. The proportion of nonsmokers choosing special areas is dependent on the proportions choosing banned; if the latter is large, the former must be small. These are multinomial proportions. Thus, an appropriate estimate of this dierence is 0.52 0.44 2
s
0.44 0.56 0.52 0.48 0.44 0.52 + +2 = 0.08 0.08. 600 600 600
Example. The major league baseball season in US came to an abrupt end in the middle of 1994. In a poll of 600 adult Americans, 29% blamed the players for the strike, 34% blamed the owners, and the rest held various other opinions. Does evidence suggest that the true proportions who blame players and owner, respectively, are really dierent? Solution. Let p1 , p2 be proportions of Americans who blamed the players and the owners, respectively. p1 q1 p2 q2 21 p2 p V ar(1 p2 ) = p + + n n n 0.290.71 3466 2 0.290.34 = + + 600 n 600 = 1.0458 103 So an approximate 95% C.I. for p1 p2 is 0.29 0.34 z0.025 V ar(1 p2 ) = 0.05 1.96 0.03234 p = (0.11339, 0.01339).
q
20
2.6
Randomization Theory results for simple random sampling

n X yi i=1 N X i=1
Dene Zi = I{ui is in the sample} for i = 1, ..., N . Then y= = Zi ui . n
The Zi s are the only r.v.s here. For simple random sampling, {Z1 , ..., Zn } are identically distributed (but not independent) Bernoulli r.v. with i = P (Zi = 1) = P (select unit i in sample) number of samples including unit i = number of possible samples ! N 1 n1 n ! = = . N N n As a consequence, we have, for i, j = 1, ..., n, i 6= j, EZi = EZi2 = n N n n N N
2
n1 n N 1 N 2 n1 n n 1 n n Cov(Zi , Zj ) = E(Zi Zj ) E(Zi )E(Zj ) = = 1 N 1 N N N 1 N N E(Zi Zj ) = P (Zi = 1, Zj = 1) = P (Zj = 1|Zi = 1)P (Zi = 1) = Therefore, Ey =
N X i=1 N N ui X n ui 1 X = = ui = , n N i=1 i=1 N n
V ar(Zi ) = EZi2 (EZi )2 =
n n 1 N N
EZi
N N X X 1 V ar() = y Cov Zi ui , Zi ui n2 i=1 i=1
N N 1 XX ui uj Cov (Zi , Zj ) n2 i=1 j=1
N N X 1 @X 2 = ui V ar (Zi ) + ui uj Cov (Zi , Zj )A n2 i=1 i6=j = .... N n 2 = . N 1 n
21
2.7
Exercises
1. List all possible simple random samples of size n = 2 that can be selected from the population {0, 1, 2, 3, 4}. Calculate 2 and V (). y 2. A simple random sample of n = 100 water meters within a community is monitored to estimate the average daily water consumption per household over a specied dry spell, resulting in y = 12.5 and s2 = 1252. Assume N = 10, 000 households, estimate the true average daily consumption , and nd a 95% condence interval. 3. A simple random sample of n = 40 college students was interviewed to determine the proportion of students in favor of converting from the semester to the quarter system. 25 students answered armatively. Estimate p, the proportion of students on campus in favor of the change. (Assume N = 2000.) Find a 95% condence interval for p. 4. The major league baseball season in US came to an abrupt end in the middle of 1994. In a poll of 600 adult Americans, 29% blamed the players for the strike, 34% blamed the owners, and the rest held various other opinions. Does evidence suggest that the true proportions who blame players and owner, respectively, are really dierent? 5. (a) Suppose that a town has population size N = 2000. We are interested in the proportion p of people who support building childcare centre. Find the sample size n required to estimate p with an error bound B = 0.05 with probability 95%. (b) If we know that at least 80% of the people will support building childcare centre, nd n again in (a). 6. Show that, for estimating the population total, the minimum sample size n required so that our estimate is within an error bound B with probability 1 is given by N 2 n , (N 1)D + 2 where B2 D= 2 2 N z/2
7. In estimating the proportion p of the population with a specied characteristic, we P used p = yi /n to estimate it, where yi = { 1 if the ith element has the characteristic 0 if not
2 We have seen in the lecture that E(yi ) = E(yi ) = p, and 2 = var(yi ) = p p2 .
(1). Suppose that we use 2 = p p2 as an estimator of 2 . Is it an unbiased estimator of 2 ? If not, what is the its bias? (2). Find an unbiased estimator of 2 , based on your calculation on (1). Compare this estimate with the one given in the lecture. 8. In a decision theory approach, two functions are specied: L(n) (loss or cost of a bad estimate) and C(n) (cost of taking the sample). Suppose n 2 L(n) = kV ar() = k 1 y , C(n) = c0 + c1 n, N n for some c0 , c1 and k. Final an optimal n minimizing the total cost L(n) + C(n)? 22

Chapters 1 N 2

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Chapters 1 N 2

Загружено:

Авторское право:

Доступные форматы

Chapter 1 Elements of the sampling problem

central limit thrm is difficult; becuz dependence relation

Some technical terms

Sampling units are nonoverlapping collections of elements from the population.

How to select the sample: the design of the sample survey

Statistics a lot of assumptions, nothing is wrong; just how accurate

c) quota sampling: One keeps sampling until certain quota is lled.

well structured sample to address those unknowns

How to design a questionnaire and plan a survey

Some useful websites

Chapter 2 Simple random sampling

How to draw a simple random sample

Suppose that the population of size N has values

P (y1 = ai1 , , yn = ain ) =

(N n)! (N n)! 1 !. = n! = N! N! N all (i1 ,,in ) n

P (y1 = ui1 , , yk = uik , , yn = uin )

all (i1 , , in ), but ik = s,ij = t

P (y1 = ui1 , , yn = uin )

The probability of drawing {b, d} is 1/6.

Estimation of population mean and total

Thus, Cov(yi , yj ) = E(yi yj ) 2 = N1 . Theorem 2.2.2 Proof. Note y = E() = , y

2 n(n 1)( ) + n 2 N 1 1 (n 1)( )+1 N 1 N n . N 1 7

Estimation of 2 and V ar() y

The population variance 2 is usually unknown. Now dene

Theorem 2.2.3 Proof. Es

N 2 . (i.e., s2 is biased for 2 .) N 1

n i h i Xh 1 = V ar(yi ) + (Eyi )2 n V ar() + (E y )2 y n 1 i=1

N 1 2 is an unbiased estimator of , e.g. E s = 2. N

Condence intervals for 9

d y If V ar() is replaced by its estimator V ar(), we still have y q

approx. N (0, 1),

Therefore, an approximate (1 ) condence interval for is

0 1 q q y d y d y z/2 A = P y z/2 V ar() y + z/2 V ar() @ q 1P d y V ar() q s q d y y z/2 V ar() = y z/2 p 1 f. n

d y B := z/2 V ar() , is called bound on the error of estimation.

A 95% C.I. for is 12.5 1.96 3.5206 = (5.6, 19.4).

Selecting the sample size for estimating population means

By the central limit theorem,

N (N 1)D (N 1)D + 2 =1+ = n 2 2 where D= B2 2 z/2

This coincides with the

A quick summary on estimation of population mean

The population mean is dened to be =

3) An unbiased estimator of V ar() is y

4) An approximate (1 ) C.I. for is

q s q d y y z/2 V ar() = y z/2 p 1 f. n

Estimation of population total

The population total is dened to be

4. An approximate (1 ) condence interval for is z/2

s q s q d V ar() = z/2 N p 1 f = N y z/2 p 1f . n n

Therefore, an approximate (1 ) condence interval for is z/2 B := z/2

0 1 q q d d z/2 A = P z/2 V ar() + z/2 V ar() 1 P @ q d V ar()

d y V ar() , is called bound on the error of estimation.

Estimation of population proportion

= E(yi ) = p, 2 = var(yi ) = p p2 = pq,

From Theorems 2.2.2 and 2.2.3, we have E() = p, p

2. Again, from Theorem 2.2.2, the variance of p is 2 V ar() = p n

3. From equation (4.1) and Theorem 2.2.5, an estimator of the variance of p is

4. An approximate (1 ) condence interval for p is p q pq q d p) = p z p p z/2 V ar( 1 f. /2 n1

Note that the right hand side is an increasing function of 2 = pq. 16

A 95% C.I. for p is 0.625 1.96 0.0767 = (0.4746, 0.7754).

xyP (X = x, Y = y) xyP (X = x, Y = y, Z = n x y) xy n! px py pnxy x! y! (n x y)! 1 2 3

n! px py pnxy (x 1)! (y 1)! (n x y)! 1 2 3 x,y1,x+yn

(n 2)! (n2)(x1)(y1) px1 py1 p3 2 1 (x 1)! (y 1)! ((n 2) (x 1) (y 1))!

Therefore, an approximate (1 ) condence interval for p1 p2 is