Академический Документы
Профессиональный Документы
Культура Документы
Data Management
Data Management
What is Data?
Data are raw information or facts that become useful information when organized in a meaningful way. It could
be of qualitative and quantitative nature.
2. Sampling Methods
a. Nonprobability sampling – is any sampling method where some elements of the population have no
chance of selection or where the probability of selection can’t be accurately determined. The selection
of elements is based on some criteria other than randomness. These conditions give rise to exclusion
bias, caused by the fact that some elements of the population are excluded. Nonprobability sampling
does not allow the estimation of sampling errors. Information about the relationship between sample
and population is limited, making it difficult to extrapolate from the sample to the population.
Example: We visit every household in a given street, and interview the first person to answer the door.
In any household with more than one occupant, this is a nonprobability sample, because some people
are more likely to answer the door (e.g. an unemployed person who spends most of their time at home
is more likely to answer than an employed housemate who might be at work when the interviewer calls)
and it’s not practical to calculate these probabilities.
In addition, nonresponse effects may turn any probability design into a nonprobability design if the
characteristics of nonresponse are not well understood, since nonresponse effectively modifies each
element’s probability of being sampled.
b. Probability Sampling – it is possible to both determine which sampling units belong to which sample
and the probability that each sample will be selected. The following sampling methods are example of
probability sampling:
i. Simple Random Sampling (SRS), all samples of a given size have an equal probability of being
selected and selections are independent. The frame is not subdivided or partitioned. The sample
variance is a good indicator of the population variance, which makes it relatively easy to estimate
the accuracy of results.
However, SRS can be vulnerable to sampling error because the randomness of the selection may
result in a sample that doesn’t reflect the makeup of the population. For instance, a simple random
sample of ten people from a given country will on average produce five men and five women, but
any given trial is likely to overrepresent one sex and underrepresent the other. Systematic and
stratified techniques, discussed below, attempt to overcome this problem by using information
about the population to choose a more representative sample.
In some cases, investigators are interested in research questions specific to subgroups of the
population. For example, researchers might be interested in examining whether cognitive ability as
predictor of job performance is equally applicable across racial groups. SRS cannot accommodate
the needs of researchers in this situation because it does not provide subsamples of the population.
Stratified sampling, which is discussed below, addresses this weakness of SRS.
ii. Systematic Sampling – relies on dividing the target population into strata (subpopulations) of equal
size and then selecting randomly one element from the first stratum and corresponding elements
from all other strata. A simple example would be to select every 10th name from the telephone
directory, with the first selectin being random. SRS may select a sample from the beginning of the
list. Systematic sampling helps to spread the sample over the list.
As long as the starting point is randomized, systematic sampling is a type of probability sampling.
Every 10th sampling is especially useful for efficient sampling from databases.
However, systematic sampling is especially vulnerable to periodicities in the list. Consider a street
where the odd-numbered houses are all on one side of the road, and the even-numbered houses
are all on another side. Under systematic sampling, the houses sampled will all be either odd-
numbered or even-numbered. Another drawback of systematic sampling is that even in scenarios
where it is more accurate than SRS, its theoretical properties make it difficult to quantify that
accuracy.
Systematic sampling is not SRS because different samples of the same size have different selection
probabilities e.g. the set (4,14, 24,) has a one-in-ten probability of selection, but the set (4,1,24, 34,)
has zero probability of selection.
iii. Stratified Sampling – when the population embraces a number of distinct categories, the frame
can be organized by these categories into separate “strata”. Each stratum is then sampled as an
independent sub-population. Dividing the population into strata can enable researchers to draw
inferences about specific subgroups that may be lost in a more generalized random sample. Since
each stratum is treated as an independent population, different sampling approaches can be
applied to different strata. However, implementing such an approach can increase the cost and
complexity of sample selection. Example: To determine the proportions of defective products being
assembled in a factory.
A stratified sampling approach is most effective when three conditions are met:
a. Variability within strata are minimized
b. Variability between strata are maximized
c. The variables upon which the population is stratified are strongly correlated with the desired
dependent variable (beer consumption is strongly correlated with gender).
iv. Cluster Sampling – sometimes it is cheaper to ‘cluster’ the sample in some way (e.g. by selecting
respondents from certain areas only, or certain time-periods only). Cluster sampling is an example
of two-stage random sampling: in the first stage a random sample of areas is chosen; in the second
stage a random sample of respondents within those areas is selected. This works best when each
cluster is a small copy of the population.
This can reduce travel and other administrative costs. Cluster sampling generally increases the
variability of sample estimates above that of simple random sampling, depending on how the
clusters differ between themselves, as compared with the within-cluster variation. If clusters chosen
are biased in a certain way, inferences drawn about population parameters will be inaccurate.
v. Matched random sampling – in this method, there are two (2) samples in which the members are
clearly paired, or are matched explicitly by the researcher (for example, IQ measurements or pairs
of identical twins). Alternatively, the same attribute, or variable, may be measured twice on each
subject, under different circumstances (e.g. the milk yields of cows before and after being fed a
particular diet).
To be able to compare effects and make inference about associations or predictions, one typically
has to subject different groups to different conditions. Usually, an experimental unit is subjected to
treatment and a control group is not.
b. Random Assignments
The second fundamental design principle is randomization of allocation of (controlled variables)
treatments to units. The treatment effects, if present, will be similar within each group.
c. Replication
All measurements, observations or data collected are subject to variation, as there are no
completely deterministic processes. To reduce variability, in the experiment the measurements
must be repeated. The experiment itself should allow for replication itself should allow for
replication, to be checked by other researchers.
c. Blocking – is the arranging of experimental units in groups (blocks) that are similar to one another.
Typically, a blocking factor is a source of variability that is not of primary interest to the experimenter.
An example of a blocking factor might be the sex of a patient; by blocking on sex (that is comparing
men to men and women to women), this source of variability is controlled for, thus leading to greater
precision.
b. Randomized block design – is a collection of completely randomized experiments, each run within
one of the blocks of the total experiment. A matched pairs of design is its special case when the
blocks consist of just two (2) elements (measurements on the same patient before and after the
treatment or measurements on two (2) different but in some way similar patients).
Chi-Square
The chi-square test is used to determine whether there is significant difference between the expected value
frequencies and the observed frequencies in one or more categories.
There are two (2) types of chi-square tests. Both use the chi-square statistic and distribution for different
purposes:
Example: Researchers have conducted a survey of 1600 coffee drinkers asking how much coffee they drink in
order to confirm previous studies. Previous studies have indicated that 72% of Americans drink coffee. Below
are the results of previous studies (left) and the survey (right). At 𝛼𝛼 = 0.05, is there enough evidence to conclude
that the distributions are the same?
a. The null hypothesis 𝐻𝐻0 : the population frequencies are equal to the expected frequencies
b. The alternative hypothesis 𝐻𝐻𝑎𝑎 : the null hypothesis is false.
c. 𝑎𝑎 = 0.05
d. The degrees of freedom: 𝑘𝑘 − 1 = 4 − 1 = 3
e. The test statistic can be calculated using the table below:
Response % of Coffee 𝐸𝐸 𝑂𝑂 𝑂𝑂 − 𝐸𝐸 (𝑂𝑂 − 𝐸𝐸)2 (𝑂𝑂 − 𝐸𝐸)2
Drinkers 𝐸𝐸
2 cups per week 15% 0.15 × 1600 = 240 206 −34 1156 4.817
1 cup per week 13% 0.13 × 1600 = 208 193 −15 225 1.082
1 cup per day 27% 0.27 × 1600 = 432 462 30 900 2.083
2+ cups per day 45% 0.45 × 1600 = 720 739 19 361 0.5014
The procedure for the hypothesis test is essentially the same. The differences are that:
a. 𝐻𝐻0 is that the two (2) variables are independent.
b. 𝐻𝐻𝑎𝑎 is that the two (2) variables are not independent (they are dependent).
c. The expected frequency 𝐸𝐸𝑟𝑟,𝑐𝑐 for the entry in row 𝑟𝑟, column 𝑐𝑐 is calculated using:
(𝑆𝑆𝑆𝑆𝑆𝑆 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟 𝑟𝑟) × (𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑐𝑐)
𝐸𝐸𝑟𝑟,𝑐𝑐 =
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
d. The degrees of freedom: (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑒𝑒𝑟𝑟 𝑜𝑜𝑜𝑜 𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 − 1) × (𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 − 1)
Example: The results of a random sample of children with pain from musculoskeletal injuries treated with
acetaminophen, ibuprofen, or codeine are shown in the table. At 𝛼𝛼 = 0.10, is there enough evidence to
conclude that the treatment and result are independent?
Example: A doctor believes that the proportions of births in this country on each day of the week are equal. A
simple random of 700 births from a recent year is selected, and the result are below. At a significance level of
0.01, is there enough evidence to support the doctor’s claim?
a. The null hypothesis 𝐻𝐻0 : the population frequencies are equal to the expected frequencies
b. The alternative hypothesis 𝐻𝐻𝑎𝑎 : the null hypothesis is false.
c. 𝑎𝑎 = 0.01
d. The degrees of freedom: 𝑘𝑘 − 1 = 7 − 1 = 6
e. The test statistic can be calculated using a table:
REFERENCES:
Almukkahal R., Ottman L., DeLancey D., Evans A., Lawsky E., & Meery B. (2016). CK12 advance probability
and statistics. Flexbook Next Generation Textbooks.
Sampling and experimentation: planning and conducting a study (n.d.) Retrieved from
https://www.scribd.com/document/51105391/Planning-and-Conducting-a-Study-for-AP-Statistics