Академический Документы
Профессиональный Документы
Культура Документы
1
Question: What is statistic?
POPULATION SAMPLE
In population data, data In sample data, data is only
from every individual in the available from some of the
population is available. individuals in the population.
Entire population is called a Very commonly used in
census research studies of patients.
A parameter is a measure A statistics is a measure that
that describes the entire describes only a sample of
population. Examples are population. Examples mean
mean age of every age of Zimbabweans on
Zimbabwean on Medicare, the Medicare, the proportion of
proportion of Zimbabweans Zimbabweans in the
addicted to cigarettes, or Behavioral Risk Factor
actual voter turnout etc Surveillance Survey who admit
they are addicted to
cigarettes, or the proportion of
people in opinion polls who
say they plan to vote etc
Total population is Sample of population is
symbolized by N symbolized by n
Population parameter, the Sample statistics, analyses
analyses are by the are from study recruiting
government or on behalf of volunteers and the report
the government mentions only surveying or
measuring a sample of
individuals
In a census, measurements or In a sample, measurements or
observations from the entire observations from part of the
population are used population are used.
DESCRIBING VS INFERING
DATA
Data is a systematic record of a particular quantity. It is the different
values of that quantity represented together in a set. It is a collection
of facts and figures to be used for a specific purpose such as a survey
or analysis. When arranged in an organized form, can be called
information. The source of data (primary data, secondary data) is also
an important factor.
TYPES OF DATA
Qualitative Data: They represent some characteristics or attributes.
They depict descriptions that may be observed but cannot be
computed or calculated. For example, data on attributes such as
intelligence, honesty, wisdom, cleanliness, and creativity collected
using the students of your class a sample would be classified as
qualitative. They are more exploratory than conclusive in nature.
Quantitative Data: These can be measured and not simply observed.
They can be numerically represented and calculations can be
performed on them. For example, data on the number of students
playing different sports from your class gives an estimate of how many
of the total students play which sport. This information is numerical
and can be classified as quantitative.
DATA COLLECTION
Primary Data - These are the data that are collected for the first time
by an investigator for a specific purpose. Primary data are ‘pure’ in the
sense that no statistical operations have been performed on them and
they are original.
Secondary Data - They are the data that are sourced from someplace
that has originally collected it. This means that this kind of data has
already been collected by some researchers or investigators in the
past and is available either in published or unpublished form. This
information is impure as statistical operations may have been
performed on them already.
Discrete Data: These are data that can take only certain specific
values rather than a range of values. For example, data on the blood
group of a certain population or on their genders is termed as discrete
data. A usual way to represent this is by using bar charts.
Continuous Data: These are data that can take values between a
certain range with the highest and lowest values. The difference
between the highest and lowest value is called the range of data. For
example, the age of persons can take values even in decimals or so is
the case of the height and weights of the students of your school.
These are classified as continuous data. Continuous data can be
tabulated in what is called a frequency distribution. They can be
graphically represented using histograms.
QUANTITATIVE QUALITATIVE
(CONTINUOUS) (CATEGORICAL)
A variable that contains A variable that contains
quantitative data is a categorical data is a
quantitative variable categorical variable.
Quantitative is a numerical Qualitative refers to a
measurement of something. “quality” or categorical
Examples are time of admit, characteristic of something.
year of diagnosis, systolic Examples are health
blood pressure, platelet count insurance, country of origin,
etc state of cancer, etc
In algebraic equations, In algebraic equations,
quantitative variables are qualitative variables are
represented by symbols (e.g., represented by symbols (e.g.,
x, y, or z). X, Y, or Z).
Two types of data i.e interval Two types of data i.e nominal
data and ratio data data and ordinal data
Interval data the differences Nominal data applies to
between data values are categories, labels or names,
meaningful. and cannot be ordered from
Interval data there is no true smallest to largest.
zero e.g time cannot have a Nominal data it cannot be
true zero since time of admit ordered e.g type of insurance
is 08:39 am, year of diagnosis and country of origin etc
is 2020.
Ratio data the differences Ordinal data applies to data
between data values are that can be arranged in order
meaningful. of categories, but the
Ratio data there is true zero differences between data
e.g if you are dead, you have values cannot be determined,
a systolic blood pressure zero or is meaningless.
Ordinal data is natural order
since differences between
levels is meaningless e.g
stage of cancer etc
Quantitative variables can be There are three types of
further classified as discrete or categorical variables: binary,
continuous. nominal and ordinal variables
Discrete variables (aka Binary variables (aka
integer variables) - the data dichotomous variables) – the
represent counts of individual data represent yes/no
items or values e.g Number of outcomes e.g heads/tails in a
students in a class, Number of coin flip or win/lose in a
different tree species in a football game
forest Nominal variables – it
Continuous variables (aka represents groups with no
ratio variables) – the data rank or order between them
represent Measurements of e.g species names, color,
continuous or non-finite values brands
e.g Distance, Ordinal variables – it
Volume, Age represents groups that are
ranked in a specific order e.g
finishing place in a race or
rating scale responses in a
survey
Confounding variables
Latent variables
A variable that can’t be directly measured, but that you represent via a
proxy e.g Salt tolerance in plants cannot be measured directly, but can
be inferred from measurements of plant health in our salt-addition
experiment.
Composite variables
SAMPLING
We take a sample of the population because we want to do “inferential
statistics”
Reasons not to measure the whole population is that of impractical and
unnecessary
SAMPLING FRAME
Sampling frame is that part of the population from which you want to
draw a sample. Therefore, you want everyone from your sampling
frame to have a chance of being selected for your sample.
Sampling frame this is a list of individuals from which a sample is
actually selected.
List may be a physical, concrete list e.g list of students enrolled at a
nursing college.
Sampling frame may be a theoretical list not made up yet e.g list of
patients who will be present to the emergency department today.
UNDERCOVERAGE
ERRORS IN STATISTICS
SIMULATION
CONCEPTS IN SAMPLING
TYPE OF EXPLANATION
SAMPLING
Simple Random A simple random sampling is a simple
Sampling random sample of n measurements from a
population, or is a subset of the population
selected in such a manner that every
sample of size n from the population has
an equal chance of being selected.
Example, you have a list of the population
of students in a class. You want to take a
sample of 5 (n=5). If you take a sample
random sample from the class list, it
means all the different possible groups of
5 students you could pick from the list has
an equal chance of being the sample
(group) you actually pick.
Method of obtaining the sample are 1) Old
fashioned hat - number all of the
individuals in the population with a unique
number e.g like student ID number, put all
the student ID numbers in place from
which you can draw looking (like a hat),
draw 5 ID’s and use those students as
your sample. 2) Electronic hat -
Generate a list of random numbers as long
as the list of population, randomly assign
these numbers to the population in the
list, take the first 5 numbers (who ever get
assigned 1 through 5) e.g raffle ticket.
Limitations
You need a list, if you don’t know who
will present at the emergency
department that day, how do you
sample?
You need a good list, otherwise, you
risk undercoverage, what if part-time
students were not on the list? This will
result in non-sampling error.
Stratified Sampling Stratified sampling
The list is divided into groups, or strata.
That is a way to make it so that there are
certain proportions of groups in the final
sample.
Steps in Stratified Sampling
Divide entire population into distinct
subgroups called strata
The strata are based on a specific
characteristic, such as age, income,
educational level, and so on
All members of stratum share this
specific characteristic.
Draw a simple random sample from
each strata.
Examples, in a high school, sampling so
many students from each grades and in
hospitals, sampling so many patients or
providers from departments (different
intensive care units)
Limitations
Oversampling one group means your
summary statistic is unbalanced.
It is not possible to do without a list
beforehand (like simple random
sample)
It also hard because you have to split
the list into groups (strata) then simple
random sample from the strata
Useful if necessary to make all strata
equal, or to sample from groups that are
small in the large population.
Systematic Sampling Systematic sampling can be done with or
without a list
Steps in Systematic Sampling
Arrange all individuals of the population
in a particular order
Pick a random individual as a start.
Then take every kth member of the
population in the sample
- kth means every so many
Example, people enter in a shop and pick
kth member entering the shop of the
population in the sample.
characteristics of Systematic
sampling
You cannot do this when there is a
pattern to the data (boy/girl/boy/girl)
You can do it in a clinical setting, where
you do not know who is going to come
in that day.
Systematic sampling is easy to do with or
without a list.
Just pick a random starting point, then
pick the kth individual.
Cluster Sampling Why use cluster sampling when you
could use stratified or simple random
sampling?
Because the problem is in a
particular geographic location.
The problem is localized to a
particular location.
In cluster sampling, we begin by
dividing the map in geographic
areas.
Then we randomly pick clusters, or
areas, from the map. We take all
people in the cluster.
Problems with cluster sampling
Sometimes, the people located in a
cluster are in similar in a way that
makes the problem hard to study.
If cancer rates are high all over the
clusters, it’s hard to see if a geographic
location is causing higher rates.
Convenience Sampling Convenience sampling can be used under
low risk circumstances e.g what ice cream
is the best from the restaurant next to the
hospital? The results are not reliable.
Convenience sampling is using results or
data that are conveniently or readily
obtained.
can be useful if not a lot of resources
allocated to the study
it uses an already-assembled group for
surveys e.g ask students in a class to fill
out a survey like is the homework that l
give you last week was too hard?
What are the problems with
convenience sampling?
There is a bias in every group
Often miss important subpopulations
(what stratified sampling addresses)
Results can be severely biased.
Avoid using convenience sampling unless
the question is low risk.
Use if the only type of sampling possible
under the circumstances
Also used when resources are low
Multi-stage Sampling Combination of sampling strategies
layered in stages.
Example
Stage 1: cluster sample of states (two
census regions)
Stage 2: simple random sample of
countries (from each state)
Stage 3: stratified sample of schools
(urban/rural)
Stage 4: stratified sample of
classrooms
Multi-stage sampling usually used in large,
governmental studies.
INTRODUCTION TO EXPERIMENTAL
DESIGN
BASIC GUIDELINES FOR PLANNING A STATISTICAL STUDY
1. State a hypothesis
2. Identify the individuals of interest
3. Specify the variables to measure
4. Determine if you will use the entire population or a sample
If you choose a sample, choose a sampling method
5. Address ethical concerns before data collection.
6. Collect the data.
7. Use descriptive or inferential statistics o answer your hypothesis
8. Note any concerns about your data collection or analysis
Make recommendations for future studies
EXAMPLE:
It matters if you pick a census or sample for you study design because
if you pick a census you are going to do a certain kind of analysis and if
you pick a sample you can do a different kind analysis of statistics
Replication:
2. Truthfulness of response
Respondents may lie on purpose e.g if asked a question that is too
personal or if asked a question too hard to think about.
Respondents may lie inadvertently i.e they may not remember if
asking about something that happening a long time ago or may
have recall bias influenced by events that have happened since
original event.
3. Hidden bias
Question wording may induce a certain response i.e how long have
you been using software A
Order of questions and other wording may induce a certain
response i.e do you agree with Obamacare? Or more people have
health insurance than ever before. Do you agree with Obamacare?
Scales of questions my not accurately measure responses i.e do
your feelings always fit on a scale of 1 to 5
4. Interviewer influence
This is important with in-person and phone surveys
Best to have interviewer from same population as research
participant
All verbal and non-verbal influences matter
5. Vague wording
Avoid vague terms used in a survey i.e instead of asking if a person
waited a long time ago in the waiting room, ask the number of
minutes
If you must use vague terms, include grounding language i.e where
10 is extremely important, and 1 is not at all important, how
important is having a controllable lifestyle to you in your future
career? A controllable lifestyle is defined as one that allows the
physician to control the number of hours devoted to practicing
his/her specialty.
Experiments/clinical trials.
Observing and recording well-defined events (e.g., counting the
number of patients waiting in emergency at specified times of the
day).
Obtaining relevant data from management information systems.
Administering surveys with closed-ended questions (e.g., face-to
face and telephone interviews, questionnaires etc).
Interviews
Questionnaires
30
25
10
0
1-8 9-16 17-24 25-32 33-40 41-48
Class
(miles transported)
It is the shape that is made if you draw a line along the edges of a
histogram’
A stem-and-leaf of the same data will make the same shape
STEM LEAF
2 0
3 0 2 5
4 1 1 3 7 8
5 1 3 3 4 6 7 8 8 9
6 0 2 4 5 5 9
7 1 4 7
8 8
9
10 2
TYPES OF DISTRIBUTION
Normal distribution
The normal distribution is a continuous probability distribution that is
symmetrical on both sides of the mean, so the right side of the center
is a mirror image of the left side.
The area under the normal distribution curve represents probability
and the total area under the curve sums to one.
For a perfectly normal distribution the mean, median and mode will be
the same value, visually represented by the peak of the curve.
The normal distribution is often called the bell curve because the graph
of its probability density looks like a bell. It is also known as called
Gaussian distribution, after the German mathematician Carl Gauss who
first described it.
Uniformed distribution
Bimodal DISTRIBUTION
A data set is bimodal if it has two modes. This means that there is not
a single data value that occurs with the highest frequency. Instead,
there are two data values that tie for having the highest frequency.
The mode is one way to measure the center of a set of data.
Sometimes the average value of a variable is the one that occurs most
often. For this reason, it is important to see if a data set is bimodal.
Instead of a single mode, we would have two.
One major implication of a bimodal data set is that it can reveal to us
that there are two different types of individuals represented in a data
set.
A histogram of a bimodal data set will exhibit two peaks or humps.
OUTLIERS
Outliers are data values that are very different from other
measurements in the dataset.
In regression analysis, a data point that diverges greatly from the
overall pattern of data is called an outlier.
an outlier is an extreme value that differs greatly from other values in
a set of values.
Cumulative Frequency
In cumulative frequency you add up all the classes before the class you
are on
The first class is always the same as the frequency.
Each cumulative frequency is equal to or higher than the last
Classes along the x-axis and cumulative frequency along the y-axis
Because cumulative frequency goes up from class to class, the ogive
line always goes up to the top frequency
BAR GRAPH
Features of a bar graph are that they can be vertical or horizontal, are
of uniform width and of uniform spacing
Length of bars represent variable’s frequency or percentage of
occurrence
Same measurements scale used for each bar
Includes title, bar labels and scale labels on axis or actual values for
each bar
PIE CHARTS
Pie charts also circle graph are used with counts of mutually exclusive
frequencies
Often made in graphing programs because difficult to do by hand
Features of a pie chart
Every individual must be put in only one category
Can be qualitative or quantitative variable
If quantitative, classes and then graphed
Pie chart must be mutually exclusive categories i.e favorite color vs
check the colours you like
More informative to put % than frequency, but it is helpful to do with.
Always include title and legend
FREQUENCY TABLES
MODE
The value that occurs most frequently in the dataset is the mode
It is possible to have no mode
It is more possible to have more than one mode
It can get confusing with a lot of numbers in a short range – which is
most numerous?
Less confusing when scale is large
The mode tells you not much, the most popular answer, the most
common result
MEDIAN
21 33 42 62 78
1 2 3 4 5 6
1 3 7 8 9 12
Median is (7+8)/2=7.5
After arranging the values in order, you have an ordered set
If there are an odd number of values (n), take n+1, and divide by
2. Count up that many, and that is the median
Example 1 odd. n=21. 21+1=22, and 22/2=11. You would count
from the beginning of the ordered set, and the 11th value in it
would be the median.
Example 2 even. n=14 14+1=15, and 15/2=7.5. You would count
from the beginning of the ordered set, and take the 7th value and
8th value, add them together, then divide by 2 for the median.
The median tells you the 50th percentile of the data, the middle rank of
the data, the median doesn’t care much about the ends of the data,
Outliers don’t bother it and it is resistant and it is stable.
MEAN
Means medians
Very resistant to outliers Not resistant to outliers
Very stable Not very stable
A very high value or very low value (outliers) can really throw off the
mean.
This is not a problem with the median.
One solution to make the mean more resistant is to trim data off each
end so the outliers get cut off.
If you trim, you have to be fair and trim the same amount off each
side.
How to make 5% trimmed mean
Figure out how many data points you have. Then, figure out what
5% of them would be e.g if you have 100 data points, 5% would be
5 data points
Put the data in order
Remove 5% from the top and 5% from the bottom e.g remove the 5
top ones and the 5 bottom ones
Now make the mean out of the remaining data.
Weighted average
Sometimes, certain values should count more toward the mean than
others.
If homework is 10% of your grade, and quizzes are 20% of your grade,
the quizzes count for more than homework
You can arrange this by doing a weighted mean.
Example: homework worth 10%, quizzes worth 20%, and final worth
70%. You got an A (4.0) on homework, B + (3.5) on quizzes, and B
(3.0) on final. Non-weighted average: (4+3.5+3)/3=3.5
Weighted average (4.0*0.1)+(3.5*0.2)+(3.0*0.7)=3.2
Symbol is ∑xw/∑w
MEASURES OF VARIATION
VARIATION
Range
Variance is how much data vary, think: how well does the mean
represent the spread of the data?
Standard deviation: standard – following a standard, same and
deviation – like a deviation, septum
They are friends because this is how you calculate them:
First, calculate the variance.
Then, take the square root of variance, and that is the standard
deviation.
The formulas for sample variance and sample standard deviation are
different than those for population variance and population standard
deviation i.e we don’t use the population ones that often and we will
concentrate on how to use the sample ones
Two different ways of doing the formula (for both sample and for
population) – the definition formula and the computational
formula
Both get the same results
Computational formula it confuses
Therefore, defining formula is simple
Sample defining formulas
Sample variance
In simple random sample is defined by slightly different formula,
and uses a slightly different notation:
s2 = Σ ( x - x )2 / ( n - 1 )
σ2 = Σ ( X - μ )2 / N
s = sqrt [ s2 ] = sqrt [ Σ ( x - x )2 / ( n - 1 ) ]
σ = sqrt [ σ2 ] = sqrt [ Σ ( X - μ )2 / N ]
A single statistic – the mode, the median or the mean may not be a
model that represents the entire dataset accurately. Anytime we use a
single number to represent the data, we lose the sense of variability in
the data.
Range uses only the extreme values of a dataset and is hence very
susceptible to outliers. It is advisable to use range only for very small
distributions with no outliers.
Interquartile range is good for skewed distributions. This is because
IQR is resistant to outliers in the data. They are generally paired with
median to describe the data.
Standard deviation is a good measure of variability for normal
distributions or distributions that aren’t terribly skewed. Paired with
mean this is a good way to describe the data.
Variance is not used much as it is represented in squared units and is
not an intuitive measure.
PERCENTILES
n = (P/100) x N
where N = number of values in the data set, P = percentile, and n =
ordinal rank of a given value (with the values in the data set sorted
from smallest to largest)
Percentiles are frequently used to understand test scores and
biometric measurements.
Example: If you test at the 77th percentile, it means you did better
than 77% of the people taking the test. If 100 people took the test, you
had have done better than 77 of them.
Rules:
Percentiles can be between 1 and 99 i.e you can’t have a-2nd
percentile, or a 105th percentile
Whatever number you pick i.e that % of values fall below the
number and 100 minus that % of values fall above the number
Example: 20 people take a test i.e let’s say there is a maximum
score of 5 on the test. The 25th percentile means 25% of the scores
fall below this score, and 75% fall above that score
Let’s say it is an easy test, and 12 people get a 4, and the
remaining 8 get a 5. The 25th percentile, or the score the cuts off the
5 tests scores, will be 4. (even the 50th percentile will be 4)
This would come out very different if it were a hard test, and most
people got below a score of 3
Quartiles
59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90,
95, 98
First, mark down the median, Q2, which in this case is the tenth
value: 75.
Q1 is the central point between the smallest score and the median.
In this case, Q1 falls between the first and fifth score: 68. [Note
that the median can also be included when calculating Q1 or Q3
for an odd set of values. If we were to include the median on either
side of the middle point, then Q1 will be the middle value between
the first and tenth score, which is the average of the fifth and sixth
score – (fifth + sixth)/2 = (68 + 69)/2 = 68.5].
Q3 is the middle value between Q2 and the highest score: 84. [Or
if you include the median, Q3 = (82 + 84)/2 = 83].
Now that we have our quartiles, let’s interpret their numbers. A
score of 68 (Q1) represents the first quartile and is the 25th
percentile. 68 is the median of the lower half of the score set in the
available data i.e. the median of the scores from 59 to 75.
Q1 tells us that 25% of the scores are less than 68 and 75% of the
class scores are greater. Q2 (the median) is the 50th percentile
and shows that 50% of the scores are less than 75, and 50% of the
scores are above 75. Finally, Q3, the 75th percentile, reveals that
25% of the scores are greater and 75% are less than 84.
Interquartile range
The Interquartile Range or IQR describes the middle 50% of the values
when ordered from lowest to highest value.
IQR is considered a good measure of variation in skewed datasets as it
is resistant to outliers.
To calculate the IQR, we find the median of the lower and upper half of
the data. These are Quartile 1 and Quartile 3. The IQR is the difference
between Quartile 3 and Quartile 1.
IQR = Q3 – Q1
Median
Q1 Q3
Position 1 2 3 4 5 6 7 8
Box-and-Whisker Plot
μ = ∑x/N x = ∑x/n
variance σ2 s2
Proportion of P p
population elements
that have a particular
attribute
proportion of Q=1–P q=1-p
population elements
that do not have a
particular attribute,
population correlation p r
coefficient
Population set X x