6 Inference Intervals Sample Size

Inference, condence intervals and sample size determination
Applied Marketing (Market Research Methods) Topic 6: Inference, condence intervals and sample size determination
Dr James Abdey
Dr James Abdey
Overview Choosing a sample size Estimation
Sampling distribution of X
Sampling distribution properties Sample size and sampling fraction Central Limit Theorem Principle of condence intervals Construction: CI for X Variance Known
Construction: CI for X Variance Unknown

Choosing sample size Adjusting the statistically determined sample size Adjusting for non-response
Overview
Here we consider sample size determination in simple random sampling Properties of the sampling distribution are discussed We describe the required adjustments to statistically determined sample sizes to account for incidence and completion rates Non-response issues in sampling are also covered, with ways of improving response rates
Inference, condence intervals and sample size determination Dr James Abdey

Choosing a sample size

The question How big a sample do I need to take? is a common one when sampling data The answer to this depends on the quality of inference that the researcher requires from the data In the estimation context this can be expressed in terms of the accuracy of estimation If the researcher requires that there should be a 95% chance that the estimation error should be no bigger than d units (we refer to d as the tolerance), then this is equivalent to having a 95% condence interval of width 2d Note here d represents the half-width of the condence interval since the point estimate is, by construction, at the centre of the condence interval

Simple random sampling (SRS)

Recall a simple random sample is a sample selected by a process such that every possible sample (of the same size, n) has the same probability of selection The selection process is left to chance, thus eliminating the effect of selection bias Due to the random selection mechanism, we do not know (in advance) which sample will occur Every population element has a known, non-zero probability of selection in the sample but no element is certain to appear

Simple random sampling (SRS) Example

Consider a population of size N = 6 elements: A, B, C, D, E and F We consider all possible samples of size n = 2 (without replacement) There are 15 different, but equally likely, such samples: AB, AC, AD, AE, AF, BC, BD, BE, BF, CD, CE, CF, DE, DF, EF Since this is SRS, each sample has a probability of selection of 1/15

Estimation
A population has particular characteristics of interest such as the mean, variance etc. Collectively we refer to these characteristics as parameters If we do not have population data, the parameter values will be unknown Statistical inference is the process of estimating the (unknown) parameter values using the (known) sample data We use a statistic (estimator) calculated from sample observations to provide a point estimate

Estimation Example
Returning to our example, recall there are 15 different samples of size 2 from a population of size 6 Suppose the variable of interest is income A B C D E Individual Income in 000s 3 6 4 9 7
F 7
If we seek the population mean, , we will use the , as our estimator sample mean, X =1 X n
n

Choosing sample size Adjusting the statistically determined sample size
Xi
i =1
Adjusting for non-response
For example, if the observed sample was AB, the sample mean is (3000 + 6000)/2 = 4,500
Estimation Example
Clearly, different observed samples will lead to different sample means for all possible samples (in 000s): Consider X Sample Values X Sample Values X AB 36 4.5 BF 67 6.5 AC 34 3.5 CD 49 6.5 AD 39 6 CE 47 5.5 AE 37 5 CF 47 5.5 AF 37 5 DE 97 8 BC 64 5 DF 97 8 BD 69 7.5 EF 77 7 BE 67 6.5

values vary from 3.5 to 8, depending on the So X sample values
The previous slide showed all possible values of the estimator X Since we have the population data here, we can actually compute the population mean (in 000s) 1 = N
N i =1
Sampling distribution properties Sample size and sampling fraction Central Limit Theorem
3+6+4+9+7+7 Xi = =6 6
Principle of condence intervals Construction: CI for X Variance Known

values far from So even with SRS, we obtain some X = Here only one sample (AD) results in X
| Lets now consider the maximum | X
Overview
| max | X 0 0.5 1 1.5 2 2.5
Choosing a sample size
Number of samples 1 6 10 12 14 15
Probability 0.067 0.400 0.667 0.800 0.933 1.000
Estimation

So, for example, there is an 80% chance of being within 1.5 units of
We now represent this as a frequency distribution That is, we record the frequency of each possible value of X Frequency Relative frequency X 3.5 1 1/15 = 0.067 4.5 1 1/15 = 0.067 3 3/15 = 0.200 5.0 5.5 2 2/15 = 0.133 6.0 1 1/15 = 0.067 3 3/15 = 0.200 6.5 7.0 1 1/15 = 0.067 7.5 1 1/15 = 0.067 8.0 2 2/15 = 0.133 This is known as the sampling distribution of X

The sampling distribution is a central and vital concept in statistics It can be used to evaluate how good an estimator is Specically, we care about how close the estimator is to the population parameter of interest As we have seen, different samples yield different X values, as a consequence of the random sampling procedure is an example) are Hence estimators (of which X random variables is our estimator of So, X is a point estimate The observed value of X

Sampling distribution properties
Like any distribution, we care about a sampling distributions mean and variance Together, we can assess how good an estimator is First, consider the mean we seek an estimator which does not mislead us systematically So the average (mean) value of an estimator, over all possible samples, should be equal to the population parameter


Returning to our example: Frequency X 3.5 1 1 4.5 5.0 3 2 5.5 6.0 1 3 6.5 7.0 1 7.5 1 8.0 2 Total 15
Product 3.5 4.5 15.0 11.0 6.0 19.5 7.0 7.5 16.0 90.0

Hence the mean of this sampling distribution is 90/15 =6

An important difference between a sampling distribution and other distributions is that the values in a sampling distribution are summary measures of whole samples (i.e. statistics/estimators) rather than individual observations Formally, the mean of a sampling distribution is called the expected value of the estimator, denoted by E[] Hence the expected value of the sample mean is ] E[X An unbiased estimator has its expected value equal to the parameter being estimated ] = 6 = For our example, E[X

is always an Fortunately the sample mean X unbiased estimator in SRS, regardless of:
the sample size, n the distribution of the (parent) population

Choosing sample size
This is a good illustration of a population parameter, , being estimated by its sample counterpart, X
Adjusting the statistically determined sample size Adjusting for non-response

The unbiasedness of an estimator is clearly desirable, however we also need to take into account the dispersion of the estimators sampling distribution Ideally, the possible values of the estimator should not vary much around the true parameter value So, we seek an estimator with a small variance Recall the variance is dened to be the mean of the squared deviations about the mean of the distribution In the case of sampling distributions, it is referred to as the sampling variance


Returning to our example: )2 (X X X 3.5 2.5 6.25 4.5 1.5 2.25 5.0 1.0 1.00 5.5 0.5 0.25 0.0 0.00 6.0 6.5 0.5 0.25 7.0 1.0 1.00 7.5 1.5 2.25 2.0 4.00 8.0 Total
Frequency 1 1 3 2 1 3 1 1 2 15
Product 6.25 2.25 3.00 0.50 0.00 1.75 1.00 2.25 8.00 24.00

Hence sampling variance is 24/15 = 1.6
The population itself has a variance the population variance, 2 X 3 6 4 9 7 X 3 0 2 3 1 (X 9 0 4 9 1 ) 2 Frequency 1 1 1 1 2 Product 9 0 4 9 2

Hence the population variance is 2 = 24/6 = 4

We now consider the relationship between 2 and the sampling variance Intuitively, a larger 2 should lead to a larger sampling variance why? For population size N and sample size n,
2 ) = N n Var(X N 1 n
So for our example, ) = 6 2 4 = 1.6 Var(X 61 2 We use the term standard error to refer to the standard deviation of the sampling distribution, ) = S.E.(X ) = Var(X N n 2 = X N 1 n

Implications:
Overview Choosing a sample size
as the sample size, n, increases, the sampling variance decreases, i.e. the precision increases1 provided the sampling fraction, n/N , is small, the term N n 1 N 1 so can be ignored the precision depends effectively on n only
Estimation

Although greater precision is desirable, data collection costs will rise with n (remember why we sample in the rst place!)
Sample size and sampling fraction

The larger the sample, the less variability there will be between samples n=2 n=4 X 3.50 1 4.50 1 5.00 3 2 5.25 1 2 1 5.50 5.75 3 6.00 1 1 6.25 2 6.50 3 6.75 1 7.00 1 7.25 1 1 7.50 8.00 2


There is a striking improvement in the precision of the estimator The variability has decreased considerably values goes from 3.5 to 8.0 Range of possible X down to 5.0 to 7.25 The sampling variance is reduced from 1.6 to 0.4 Note precision in statistics refers to the inverse of the sampling variance


The factor
N n N 1
decreases steadily as n N
When n = 1 the factor equals 1, and when n = N it equals 0 Sampling without replacement, increasing n must increase precision since less of the population is left out In much practical sampling N is very large (e.g. several million), while n is comparably small (e.g. at most 1,000, say) Therefore in such cases the factor negligible, hence ) = Var(X
N n N 1

becomes
N n 2 2 for small n/N N 1 n n

n/N is called the sampling fraction When N is large, it is the sample size n which is important in determining precision, not the sampling fraction Consider two populations: N1 = 3 million and N2 = 200 million, both with the same variance 2 We sample n1 = n2 = 1, 000 from each population, then
2 X
= =
2 X
N1 n1 = (0.999667) N1 1 n1 1000 2 N2 n2 2 = (0.999995) N2 1 n2 1000

2

2 2 , despite N << N So X 1 2 X
1
Central Limit Theorem

When sampling from (almost) any non-normal : distribution, for sufciently large n, X
1. is approximately normally distributed 2. has mean 2 3. has variance n and standard error n
Sampling distribution properties Sample size and sampling fraction
The approximation is reasonable for n at least 30, as a rule-of-thumb Though because this is an asymptotic approximation (i.e. as n ), the bigger n is, the better the normal approximation Special case: if the population distribution is itself will have an exact Normal distribution for Normal, X any sample size n
Central Limit Theorem Principle of condence intervals Construction: CI for X Variance Known


for small Below is the sampling distribution of X (red) and large (black) n As n increases, the sampling variability of X decreases
Sampling Distribution of Sample Mean
0.4
0.3

Density 0.2
0.0
0.1
0 Sample mean
Although the shape of the population distribution does not affect the generality of the CLT result, it does affect the speed of convergence of the to the Normal distribution sampling distribution of X Obviously a symmetric population distribution would converge faster in n In practice, n = 30 is usually adequate to make the Normal approximation reasonable

Remember the CLT is based on SRS Without probability sampling methods, there is absolutely no basis for the use of the CLT This is principally why we insist on probability (random) sampling Otherwise the whole structure of statistical inference collapses!

The CLT also makes the use of the variance more reasonable The Normal distribution is completely characterised by its mean and variance Hence it is sensible to focus attention on these two characteristics of the sampling distribution

Principles of condence intervals
A point estimate is our best guess of an unknown population parameter based on sample data But as its based on a sample, there is some uncertainty/imprecision Condence intervals (CIs) communicate the level of imprecision


Formally, an x % condence interval covers the unknown parameter with x % probability over repeated samples The shorter the condence interval, the more reliable the estimate As we shall see, this is achievable by:
reducing the level of condence increasing the sample size

We now look at how to construct CIs
The general format (for our purposes) for a condence interval is statistic (multiplier coefcient) standard error
Alternatively, estimate margin of error

CI for (variance known)

= Point estimate for is calculated using X
n i =1
Xi
n
Estimation
Assuming the (population) variance 2 is known, the is standard error of X ) = = S.E.(X X N n 2 N 1 n n
Hence a 95% condence interval for is 1.96 X = n 1.96 X , X + 1.96 n n

This is a simple, but important result, forming a useful template Note the above interval was for 95% condence Other levels of condence pose no problem, but require a different multiplier coefcient When the variance is known we obtain a multiplier from the standard normal distribution For 90% condence, use the multiplier For 95% condence, use the multiplier For 99% condence, use the multiplier Hence a 99% condence interval for is 2.576 X = n 2.576 X , X + 2.576 n n 1.645 1.96 2.576 ( 2 )


So we see that a higher level of condence (a good thing) leads to a larger multiplier coefcient, and hence a wider condence interval (a bad thing) Hence, other things equal, we face a trade-off between level of condence and width of condence interval Since the width of a CI is part-determined by the standard error, by increasing n (costly) we will reduce the standard error, hence shorten the CI (a good thing)

CI for (variance unknown)
Unfortunately, to use the approach just discussed requires knowledge of the population variance, 2 This is because it is used in the standard error: z X n In practice, we are unlikely to know 2 After all, its a population characteristic, and so if we do not know , why would we know 2 ?


is Recall the sampling variance of X N n 2 2 ) = 2 Var(X = X N 1 n n But if 2 is unknown we have a problem It is not that we are fundamentally interested in 2 , only that we need to estimate it because the depends on it precision of X And there is little point having a point estimate if we know nothing about its precision


) is Our estimate of Var(X
2 sX
s2 N n s2 = N n n
n
where 1 s2 = n1
n i =1
Estimation
1 )2 = (xi x n1
2 xi2 nx
i =1
Sample size and sampling fraction Central Limit Theorem Principle of condence intervals Construction: CI for X Variance Known
Our estimate of the standard error is thus sx = N n s2 s N n n

n 1 in the social sciences since typically NN Once we have estimated this, we proceed as before to construct a CI using the estimate of the standard error in place of the actual standard error

So, for a 90% condence interval we use s 1.645 x n Similarly, for a 95% condence interval we use s 1.96 x n Finally, for a 99% condence interval we use s 2.576 x n

Choosing sample size
Note the trade-off between accuracy and data cost Solution: x desired precision and nd smallest n which achieves this If we want the sample mean to be within a tolerance d of with a specied probability, then d =z n = z 22 n= d2

n is the minimum sample size required to achieve the desired precision n must be an integer, so always round up!
Choosing sample size Example

A random sample is to be taken from a population with unknown mean and = 3 How big a sample size would be needed if there is to being within 1 unit of ? be a 95% chance of X The sample size n required for a tolerance of 1 satises 3 1 = 1.96 n = n = 34.57 = n = 35

Choosing sample size Adjusting the statistically determined sample size
Note that the required sample size in this type of calculation needs to be rounded up from a decimal fraction, since rounding down would result in a value not quite large enough!
Adjusting the statistically determined sample size

Incidence rate refers to the rate of occurrence, or the percentage, of persons eligible to participate in the study In general, if there are k qualifying factors with an incidence of Q1 , Q2 , Q3 , . . ., Qk , each expressed as a proportion: Incidence rate = Q1 Q2 Q3 . . . Qk The completion rate is the percentage of qualied respondents who complete the interview, enabling researchers to account for anticipated refusals by people who qualify Initial sample size = Final sample size Incidence rate Completion rate


Sub-sampling of non-respondents the researcher contacts a sub-sample of the non-respondents, usually by means of telephone or personal interviews In replacement, the non-respondents in the current survey are replaced with non-respondents from an earlier, similar survey The researcher attempts to contact these non-respondents from the earlier survey and administer the current survey questionnaire to them, possibly by offering a suitable incentive


In substitution, the researcher substitutes for non-respondents other elements from the sampling frame that are expected to respond The sampling frame is divided into sub-groups that are internally homogeneous in terms of respondent characteristics but heterogeneous in terms of response rates These sub-groups are then used to identify substitutes who are similar to particular non-respondents but dissimilar to respondents already in the sample

Subjective estimates when it is no longer feasible to increase the response rate by sub-sampling, replacement, or substitution, it may be possible to arrive at subjective estimates of the nature and effect of non-response bias This involves evaluating the likely effects of non-response based on experience and available information


Weighting attempts to account for non-response by assigning differential weights to the data depending on the response rates For example, in a survey the response rates were 85%, 70% and 40%, respectively, for the high-, medium- and low-income groups In analysing the data, these sub-groups are assigned weights inversely proportional to their response rates That is, the weights assigned would be (100/85), (100/70) and (100/40), respectively, for the highmedium- and low-income groups

Imputation
Imputation involves imputing, or assigning, the characteristic of interest to the non-respondents based on the similarity of the variables available for both non-respondents and respondents For example, a respondent who does not report brand usage may be imputed the usage of a respondent with similar demographic characteristics


6 Inference Intervals Sample Size

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

6 Inference Intervals Sample Size

Загружено:

Авторское право:

Доступные форматы

Inference, condence intervals and sample size determination

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Choosing a sample size

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Simple random sampling (SRS)

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Simple random sampling (SRS) Example

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Adjusting for non-response

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

values vary from 3.5 to 8, depending on the So X sample values

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Principle of condence intervals Construction: CI for X Variance Known

Construction: CI for X Variance Unknown

Inference, condence intervals and sample size determination Dr James Abdey

| max | X 0 0.5 1 1.5 2 2.5

Choosing a sample size

Probability 0.067 0.400 0.667 0.800 0.933 1.000

Construction: CI for X Variance Unknown

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Sampling distribution properties

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Sampling distribution properties

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Hence the mean of this sampling distribution is 90/15 =6

Sampling distribution properties

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Sampling distribution properties

Inference, condence intervals and sample size determination Dr James Abdey

Overview Choosing a sample size Estimation

Construction: CI for X Variance Unknown

Adjusting the statistically determined sample size Adjusting for non-response

Sampling distribution properties