Вы находитесь на странице: 1из 38

BUSINESS STATISTICS

1
Question: What is statistic?

 Statistics is the study how to collect, organize, analyze, and interpret


numerical information
 Statistics is both the science of uncertainty and technology of
extracting information from data.

Statistics is used to help us make decisions. This is especially important in


health care and other disciplines.

INDIVIDUALS AND VARIABLES

 Individuals are people or objects included in a study e.g 5 individuals


could be people, 5 records, or 5 reports. Examples are kidney dialysis
patient, baby born to a mother who smokes cigarettes, towns with
hospitals etc
 A variable is a characteristic of the individual to be measured or
observed e.g the age of an individual person, the time and individual
record was entered, or the diagnosis listed on an individual report.
Examples number of blood transfusions, birthweight, rate of
pathological gambling, rate of overdose rates, etc

POPULATION PARAMETER AND SAMPLE


STATISTICS
 Population is a group of people with a common theme. Example
theme: nurses who work at Chinhoyi hospital and population: list from
human resources of every current employed nurse at Chinhoyi
hospital.
 Sample is a small portion of the population. Example only survey ICU
nurses at Chinhoyi hospital (not a representative sample) and at least
one from each department (more representative sample)

POPULATION SAMPLE
 In population data, data  In sample data, data is only
from every individual in the available from some of the
population is available. individuals in the population.
 Entire population is called a  Very commonly used in
census research studies of patients.
 A parameter is a measure  A statistics is a measure that
that describes the entire describes only a sample of
population. Examples are population. Examples mean
mean age of every age of Zimbabweans on
Zimbabwean on Medicare, the Medicare, the proportion of
proportion of Zimbabweans Zimbabweans in the
addicted to cigarettes, or Behavioral Risk Factor
actual voter turnout etc Surveillance Survey who admit
they are addicted to
cigarettes, or the proportion of
people in opinion polls who
say they plan to vote etc
 Total population is  Sample of population is
symbolized by N symbolized by n
 Population parameter, the  Sample statistics, analyses
analyses are by the are from study recruiting
government or on behalf of volunteers and the report
the government mentions only surveying or
measuring a sample of
individuals
 In a census, measurements or  In a sample, measurements or
observations from the entire observations from part of the
population are used population are used.

DESCRIBING VS INFERING

 Descriptive statistics involve methods of organizing, picturing and


summarizing information from samples and population.
 Inferential statistics involves methods of using information from a
sample to draw conclusion regarding the population

DATA
 Data is a systematic record of a particular quantity. It is the different
values of that quantity represented together in a set. It is a collection
of facts and figures to be used for a specific purpose such as a survey
or analysis. When arranged in an organized form, can be called
information. The source of data (primary data, secondary data) is also
an important factor.

TYPES OF DATA
 Qualitative Data: They represent some characteristics or attributes.
They depict descriptions that may be observed but cannot be
computed or calculated. For example, data on attributes such as
intelligence, honesty, wisdom, cleanliness, and creativity collected
using the students of your class a sample would be classified as
qualitative. They are more exploratory than conclusive in nature.
 Quantitative Data: These can be measured and not simply observed.
They can be numerically represented and calculations can be
performed on them. For example, data on the number of students
playing different sports from your class gives an estimate of how many
of the total students play which sport. This information is numerical
and can be classified as quantitative.

DATA COLLECTION
 Primary Data - These are the data that are collected for the first time
by an investigator for a specific purpose. Primary data are ‘pure’ in the
sense that no statistical operations have been performed on them and
they are original.

 Secondary Data - They are the data that are sourced from someplace
that has originally collected it. This means that this kind of data has
already been collected by some researchers or investigators in the
past and is available either in published or unpublished form. This
information is impure as statistical operations may have been
performed on them already.

DISCRETE AND CONTINUOUS DATA

 Discrete Data: These are data that can take only certain specific
values rather than a range of values. For example, data on the blood
group of a certain population or on their genders is termed as discrete
data. A usual way to represent this is by using bar charts.

 Continuous Data: These are data that can take values between a
certain range with the highest and lowest values. The difference
between the highest and lowest value is called the range of data. For
example, the age of persons can take values even in decimals or so is
the case of the height and weights of the students of your school.
These are classified as continuous data. Continuous data can be
tabulated in what is called a frequency distribution. They can be
graphically represented using histograms.

CLASSIFYING LEVELS OF MEASUREMENT


CLASSIFYING VARIABLES

QUANTITATIVE QUALITATIVE
(CONTINUOUS) (CATEGORICAL)
 A variable that contains  A variable that contains
quantitative data is a categorical data is a
quantitative variable categorical variable.
 Quantitative is a numerical  Qualitative refers to a
measurement of something. “quality” or categorical
Examples are time of admit, characteristic of something.
year of diagnosis, systolic Examples are health
blood pressure, platelet count insurance, country of origin,
etc state of cancer, etc
 In algebraic equations,  In algebraic equations,
quantitative variables are qualitative variables are
represented by symbols (e.g., represented by symbols (e.g.,
x, y, or z). X, Y, or Z).
 Two types of data i.e interval  Two types of data i.e nominal
data and ratio data data and ordinal data
 Interval data the differences  Nominal data applies to
between data values are categories, labels or names,
meaningful. and cannot be ordered from
 Interval data there is no true smallest to largest.
zero e.g time cannot have a  Nominal data it cannot be
true zero since time of admit ordered e.g type of insurance
is 08:39 am, year of diagnosis and country of origin etc
is 2020.
 Ratio data the differences  Ordinal data applies to data
between data values are that can be arranged in order
meaningful. of categories, but the
 Ratio data there is true zero differences between data
e.g if you are dead, you have values cannot be determined,
a systolic blood pressure zero or is meaningless.
 Ordinal data is natural order
since differences between
levels is meaningless e.g
stage of cancer etc
 Quantitative variables can be  There are three types of
further classified as discrete or categorical variables: binary,
continuous. nominal and ordinal variables
 Discrete variables (aka  Binary variables (aka
integer variables) - the data dichotomous variables) – the
represent counts of individual data represent yes/no
items or values e.g Number of outcomes e.g heads/tails in a
students in a class, Number of coin flip or win/lose in a
different tree species in a football game
forest  Nominal variables – it
 Continuous variables (aka represents groups with no
ratio variables) – the data rank or order between them
represent Measurements of e.g species names, color,
continuous or non-finite values brands
e.g Distance,  Ordinal variables – it
Volume, Age represents groups that are
ranked in a specific order e.g
finishing place in a race or
rating scale responses in a
survey

INDEPENDENT DEPENDENT CONTROL


VARIABLES VARIABLES VARIABLES
 You manipulate the  Then measure the  They are
independent variable dependent constant
the one you think variable the one
might be the cause
you think might
be the effect
 Independent  Dependent variables  Control variables
variables (aka (aka response these are
treatment variables) variables) these are variables that
these are variables variables that are held
you manipulate in represent the constant
order to affect the outcome of the throughout the
outcome of an experiment e.g salt experiment e.g
experiment e.g salt tolerance salt tolerance
tolerance experiment - Any experiment - The
experiment - The measurement of temperature and
amount of salt plant health and light in the room
added to each growth: in this case, the plants are
plant’s water. plant height and kept in, and the
wilting. volume of water
given to each
plant.

OTHER COMMON TYPES OF VARIABLES

Confounding variables

 A variable that hides the true effect of another variable in your


experiment. This can happen when another variable is closely related
to a variable you are interested in, but you haven’t controlled it in your
experiment e.g Pot size and soil type might affect plant survival as
much or more than salt additions. In an experiment you would control
these potential confounders by holding them constant.

Latent variables

 A variable that can’t be directly measured, but that you represent via a
proxy e.g Salt tolerance in plants cannot be measured directly, but can
be inferred from measurements of plant health in our salt-addition
experiment.

Composite variables

 A variable that is made by combining multiple variables in an


experiment. These variables are created when you analyze data, not
when you measure it e.g he three plant health variables could be
combined into a single plant-health score to make it easier to present
your findings.

SAMPLING
 We take a sample of the population because we want to do “inferential
statistics”
 Reasons not to measure the whole population is that of impractical and
unnecessary

SAMPLING FRAME

 Sampling frame is that part of the population from which you want to
draw a sample. Therefore, you want everyone from your sampling
frame to have a chance of being selected for your sample.
 Sampling frame this is a list of individuals from which a sample is
actually selected.
 List may be a physical, concrete list e.g list of students enrolled at a
nursing college.
 Sampling frame may be a theoretical list not made up yet e.g list of
patients who will be present to the emergency department today.

UNDERCOVERAGE

 Undercoverage is the omitting population members from the sampling


frame.
 This happens when list of nursing students may not be including
everyone for administrative reasons or people who present to
emergency department at night might be different than those in day.

ERRORS IN STATISTICS

FACTOR-OF-LIFE ERROR ERROR YOU WANT TO


AVOID
 Sampling error  Non-sampling error
 The population mean will  Using a bad list.
probably be different from  Make sure that you pay
your sample mean. careful attention that
 The population percentage everyone in the population
will probably be different who is supposed to be
from your sample represented in your
percentage. sampling frame is in there!
 Sampling error is caused by  Non-sampling error is caused
the fact that, regardless of by poor sample design, sloppy
what you do, your sample will data, collection, inaccurate
not perfectly represent the measurement instruments,
population. bias in data collection, other
problems introduced by the
researcher.

SIMULATION

 A simulation is a numerical facsimile or representation of a real-word


phenomenon.
 It is a essentially working through a pretend situation to see how it
would come out in the case it was real.

CONCEPTS IN SAMPLING

 It is important to do your best to avoid non-sampling error


 This is achieved by making sure you do not have undercoverage when
sampling from your sampling frame.

TYPE OF EXPLANATION
SAMPLING
Simple Random  A simple random sampling is a simple
Sampling random sample of n measurements from a
population, or is a subset of the population
selected in such a manner that every
sample of size n from the population has
an equal chance of being selected.
 Example, you have a list of the population
of students in a class. You want to take a
sample of 5 (n=5). If you take a sample
random sample from the class list, it
means all the different possible groups of
5 students you could pick from the list has
an equal chance of being the sample
(group) you actually pick.
 Method of obtaining the sample are 1) Old
fashioned hat - number all of the
individuals in the population with a unique
number e.g like student ID number, put all
the student ID numbers in place from
which you can draw looking (like a hat),
draw 5 ID’s and use those students as
your sample. 2) Electronic hat -
Generate a list of random numbers as long
as the list of population, randomly assign
these numbers to the population in the
list, take the first 5 numbers (who ever get
assigned 1 through 5) e.g raffle ticket.
 Limitations
 You need a list, if you don’t know who
will present at the emergency
department that day, how do you
sample?
 You need a good list, otherwise, you
risk undercoverage, what if part-time
students were not on the list? This will
result in non-sampling error.
Stratified Sampling  Stratified sampling
 The list is divided into groups, or strata.
 That is a way to make it so that there are
certain proportions of groups in the final
sample.
 Steps in Stratified Sampling
 Divide entire population into distinct
subgroups called strata
 The strata are based on a specific
characteristic, such as age, income,
educational level, and so on
 All members of stratum share this
specific characteristic.
 Draw a simple random sample from
each strata.
 Examples, in a high school, sampling so
many students from each grades and in
hospitals, sampling so many patients or
providers from departments (different
intensive care units)
 Limitations
 Oversampling one group means your
summary statistic is unbalanced.
 It is not possible to do without a list
beforehand (like simple random
sample)
 It also hard because you have to split
the list into groups (strata) then simple
random sample from the strata
 Useful if necessary to make all strata
equal, or to sample from groups that are
small in the large population.
Systematic Sampling  Systematic sampling can be done with or
without a list
 Steps in Systematic Sampling
 Arrange all individuals of the population
in a particular order
 Pick a random individual as a start.
 Then take every kth member of the
population in the sample
- kth means every so many
 Example, people enter in a shop and pick
kth member entering the shop of the
population in the sample.
 characteristics of Systematic
sampling
 You cannot do this when there is a
pattern to the data (boy/girl/boy/girl)
 You can do it in a clinical setting, where
you do not know who is going to come
in that day.
 Systematic sampling is easy to do with or
without a list.
 Just pick a random starting point, then
pick the kth individual.
Cluster Sampling  Why use cluster sampling when you
could use stratified or simple random
sampling?
 Because the problem is in a
particular geographic location.
 The problem is localized to a
particular location.
 In cluster sampling, we begin by
dividing the map in geographic
areas.
 Then we randomly pick clusters, or
areas, from the map. We take all
people in the cluster.
 Problems with cluster sampling
 Sometimes, the people located in a
cluster are in similar in a way that
makes the problem hard to study.
 If cancer rates are high all over the
clusters, it’s hard to see if a geographic
location is causing higher rates.
Convenience Sampling  Convenience sampling can be used under
low risk circumstances e.g what ice cream
is the best from the restaurant next to the
hospital? The results are not reliable.
 Convenience sampling is using results or
data that are conveniently or readily
obtained.
 can be useful if not a lot of resources
allocated to the study
 it uses an already-assembled group for
surveys e.g ask students in a class to fill
out a survey like is the homework that l
give you last week was too hard?
 What are the problems with
convenience sampling?
 There is a bias in every group
 Often miss important subpopulations
(what stratified sampling addresses)
 Results can be severely biased.
 Avoid using convenience sampling unless
the question is low risk.
 Use if the only type of sampling possible
under the circumstances
 Also used when resources are low
Multi-stage Sampling  Combination of sampling strategies
layered in stages.
 Example
 Stage 1: cluster sample of states (two
census regions)
 Stage 2: simple random sample of
countries (from each state)
 Stage 3: stratified sample of schools
(urban/rural)
 Stage 4: stratified sample of
classrooms
 Multi-stage sampling usually used in large,
governmental studies.

INTRODUCTION TO EXPERIMENTAL
DESIGN
BASIC GUIDELINES FOR PLANNING A STATISTICAL STUDY

1. State a hypothesis
2. Identify the individuals of interest
3. Specify the variables to measure
4. Determine if you will use the entire population or a sample
 If you choose a sample, choose a sampling method
5. Address ethical concerns before data collection.
6. Collect the data.
7. Use descriptive or inferential statistics o answer your hypothesis
8. Note any concerns about your data collection or analysis
 Make recommendations for future studies

EXAMPLE:

 Hypothesis: Air pollution causes asthma in children who live in urban


settings.
 Individuals: Children in urban settings
 Variables: Air pollution and asthma
 Either collect data or use existing dataset e.g you can use a
government dataset for population measures.
 Can collect data from a sample for estimates
 Need to choose sampling approach
 Will need consent if legally found to be consent if legally found to
be “human research”
 May need consent from parents to collect data about children.

 It matters if you pick a census or sample for you study design because
if you pick a census you are going to do a certain kind of analysis and if
you pick a sample you can do a different kind analysis of statistics

EXPERIMENT OBSERVATIONAL STUDY


 A treatment or intervention is  Observations and
deliberately assigned to measurements of individuals
individuals. are taken
 The purpose is to study the  However, no treatment or
possible effect of the intervention is assigned by the
treatment or intervention on reasearcher
the on the variables measured

Replication:

 Studies must be done rigorously enough to be replicated.


 Replicating the results of observational studies and experiments is
necessary for science to progress.

AVOIDING BIAS IN SURVEY DESIGN

 Surveys can provide a lot of useful information


 However, it is important that all aspects of survey design and
administration minimize bias
 Several considerations should be made
1. Non-response and voluntary response
 If many people refuse your survey the people who do complete it
are likely to have biased opinion.
 There may be a reason they do not complete your survey that has
to do with how they feel about your survey topic.

2. Truthfulness of response
 Respondents may lie on purpose e.g if asked a question that is too
personal or if asked a question too hard to think about.
 Respondents may lie inadvertently i.e they may not remember if
asking about something that happening a long time ago or may
have recall bias influenced by events that have happened since
original event.

3. Hidden bias
 Question wording may induce a certain response i.e how long have
you been using software A
 Order of questions and other wording may induce a certain
response i.e do you agree with Obamacare? Or more people have
health insurance than ever before. Do you agree with Obamacare?
 Scales of questions my not accurately measure responses i.e do
your feelings always fit on a scale of 1 to 5

4. Interviewer influence
 This is important with in-person and phone surveys
 Best to have interviewer from same population as research
participant
 All verbal and non-verbal influences matter

5. Vague wording
 Avoid vague terms used in a survey i.e instead of asking if a person
waited a long time ago in the waiting room, ask the number of
minutes
 If you must use vague terms, include grounding language i.e where
10 is extremely important, and 1 is not at all important, how
important is having a controllable lifestyle to you in your future
career? A controllable lifestyle is defined as one that allows the
physician to control the number of hours devoted to practicing
his/her specialty.

METHODS OF COLLECTING DATA


Quantitative Data collection methods
 The Quantitative data collection methods, rely on random
sampling and structured data collection instruments that fit diverse
experiences into predetermined response categories. They produce
results that are easy to summarize, compare, and generalize.

 Quantitative research is concerned with testing hypotheses derived


from theory and/or being able to estimate the size of a phenomenon of
interest. Depending on the research question, participants may be
randomly assigned to different treatments. If this is not feasible, the
researcher may collect data on participant and situational
characteristics in order to statistically control for their influence on the
dependent, or outcome, variable. If the intent is to generalize from the
research participants to a larger population, the researcher will employ
probability sampling to select participants.

 Typical quantitative data gathering strategies include:

 Experiments/clinical trials.
 Observing and recording well-defined events (e.g., counting the
number of patients waiting in emergency at specified times of the
day).
 Obtaining relevant data from management information systems.
 Administering surveys with closed-ended questions (e.g., face-to
face and telephone interviews, questionnaires etc).

Interviews

 In Quantitative research (survey research), interviews are more


structured than in Qualitative research.
 In a structured interview, the researcher asks a standard set of
questions and nothing more. (Leedy and Ormrod, 2001)
 Face -to -face interviews: have a distinct advantage of enabling
the researcher to establish rapport with potential participants and
therefor gain their co-operation. These interviews yield highest
response rates in survey re-search. They also allow the researcher
to clarify ambiguous answers and when appropriate, seek follow-up
information. Disadvantages include impractical when large samples
are involved time consuming and expensive. (Leedy and Ormrod,
2001)

 Telephone interviews: are less time consuming and less


expensive and the researcher has ready access to anyone on the
planet who has a telephone. Disadvantages are that the response
rate is not as high as the face-to- face interview but considerably
higher than the mailed questionnaire. The sample may be biased to
the extent that people without phones are part of the population
about whom the researcher wants to draw inferences.
 Computer Assisted Personal Interviewing (CAPI): is a form of
personal interviewing, but instead of completing a questionnaire,
the interviewer brings along a laptop or hand-held computer to
enter the information directly into the database. This method saves
time involved in processing the data, as well as saving the
interviewer from carrying around hundreds of questionnaires.
However, this type of data collection method can be expensive to
set up and requires that interviewers have computer and typing
skills.

Questionnaires

 Paper-pencil-questionnaires can be sent to a large number of


people and saves the researcher time and money. People are more
truthful while responding to the questionnaires regarding
controversial issues in particular due to the fact that their responses
are anonymous. But they also have drawbacks. Majority of the
people who receive questionnaires don't return them and those who
do might not be representative of the originally selected sample.
(Leedy and Ormrod, 2001)

 Web based questionnaires: A new and inevitably growing


methodology is the use of Internet based research. This would mean
receiving an e-mail on which you would click on an address that
would take you to a secure web-site to fill in a questionnaire. This
type of research is often quicker and less detailed. Some
disadvantages of this method include the exclusion of people who
do not have a computer or are unable to access a computer. Also
the validity of such surveys are in question as people might be in a
hurry to complete it and so might not give accurate responses.
(http://www.statcan.ca/english/edu/power/ch2/methods/methods.ht
m)

 Questionnaires often make use of Checklist and rating scales. These


devices help simplify and quantify people's behaviors and attitudes. A
checklist is a list of behaviors, characteristics, or other entities that te
researcher is looking for. Either the researcher or survey participant
simply checks whether each item on the list is observed, present or
true or vice versa. A rating scale is more useful when a behavior needs
to be evaluated on a continuum. They are also known as Likert scales.
(Leedy and Ormrod, 2001)
Qualitative Data collection methods

 Qualitative data collection methods play an important role in impact


evaluation by providing information useful to understand the processes
behind observed results and assess changes in people’s perceptions of
their well-being. Furthermore, qualitative methods can be used to
improve the quality of survey-based quantitative evaluations by
helping generate evaluation hypothesis; strengthening the design of
survey questionnaires and expanding or clarifying quantitative
evaluation findings.
 These methods are characterized by the following attributes:
 they tend to be open-ended and have less structured protocols (i.e.,
researchers may change the data collection strategy by adding,
refining, or dropping techniques or informants)
 they rely more heavily on interactive interviews; respondents may
be interviewed several times to follow up on a particular issue,
clarify concepts or check the reliability of data
 they use triangulation to increase the credibility of their findings
(i.e., researchers rely on multiple data collection methods to check
the authenticity of their results)
 generally, their findings are not generalizable to any specific
population, rather each case study produces a single piece of
evidence that can be used to seek general patterns among different
studies of the same issue
 Regardless of the kinds of data involved, data collection in a qualitative
study takes a great deal of time. The researcher needs to record any
potentially useful data thoroughly, accurately, and systematically,
using field notes, sketches, audio tapes, photographs and other
suitable means. The data collection methods must observe the ethical
principles of research.
 The qualitative methods most commonly used in evaluation can be
classified in three broad categories:
 in-depth interview
 observation methods
 document review

FREQUENCY HISTOGRAMS AND


DISTRIBUTIONS
FREQUENCY HISTOGRAM

 It is a specific type of bar chart made from data in a frequency table.


 Frequency table contains frequency histogram and relative frequency
histograms.
 The purpose of the chart is to identify or reveal the distribution data.
 After making a frequency table, it is important to also make a
frequency histogram and or a relative frequency histogram.

Steps to follow to draw a Frequency Histogram

 Make a frequency table


Class limits frequency Relative Cumulative
(lower- frequency frequency
Upper)
1-8 miles 14 0.23 14
9-16 miles 21 0.35 35
17-24 miles 11 0.18 46
25-32 miles 6 0.10 52
33-40 miles 4 0.07 56
41-48 miles 4 0.07 60
Total 60 1.00

 Draw a vertical line for the y-axis.

30

25

frequency 20 this is a distribution


of
patients 15

10

0
1-8 9-16 17-24 25-32 33-40 41-48
Class
(miles transported)

 Write frequency of …………….. along the y-axis


 Draw a horizontal line for the x-axis
 Write the classes below the x-axis and label them
 For the first class, find the frequency in the table. Look for it on the y-
axis and draw a horizontal line.
 Draw the two vertical lines down to make a bar
 Repeat for all the other classes
 Colour in the bars

RELATIVE FREQUENCY HISTOGRAM

 In the relative frequency histogram, the relative frequency goes on the


y-axis
 The chart looks takes on a similar pattern.
 Relative frequency better for comparing two population samples.
 On the diagram above you just change the y-axis with relative
frequency of patients and numbering.
DISTRIBUTION

 It is the shape that is made if you draw a line along the edges of a
histogram’
 A stem-and-leaf of the same data will make the same shape

STEM LEAF

2 0
3 0 2 5
4 1 1 3 7 8
5 1 3 3 4 6 7 8 8 9
6 0 2 4 5 5 9
7 1 4 7
8 8
9
10 2

TYPES OF DISTRIBUTION

1. Normal distribution (also called mound-shaped symmetrical)


2. Uniformed distribution
3. Skewed left distribution
4. Skewed right distribution
5. Bimodal distribution

Normal distribution
 The normal distribution is a continuous probability distribution that is
symmetrical on both sides of the mean, so the right side of the center
is a mirror image of the left side.
 The area under the normal distribution curve represents probability
and the total area under the curve sums to one.
 For a perfectly normal distribution the mean, median and mode will be
the same value, visually represented by the peak of the curve.
 The normal distribution is often called the bell curve because the graph
of its probability density looks like a bell. It is also known as called
Gaussian distribution, after the German mathematician Carl Gauss who
first described it.

Uniformed distribution

 In statistics, a type of probability distribution in which all outcomes are


equally likely; each variable has the same probability that it will be the
outcome.
 The uniform distribution can be visualized as a straight horizontal line,
so for a coin flip returning a head or tail, both have a probability p =
0.50 and would be depicted by a line from the y-axis at 0.50.
 There are two types of uniform distributions: discrete and continuous.
The possible results of rolling a die provide an example of a discrete
uniform distribution: it is possible to roll a 1, 2, 3, 4, 5 or 6, but it is not
possible to roll a 2.3, 4.7 or 5.5. Therefore, the roll of a die generates a
discrete distribution with p = 1/6 for each outcome.
 Uniform Distribution. Suppose the random variable X can assume k
different values. Suppose also that the P(X = xk) is constant.

Then, P(X = xk) = 1/k


 The term "uniform distribution" is also used to describe the shape of a
graph that plots observed values in a set of data. Graphically, when
the observed values in a set of data are equally spread across the
range of the data set, the distribution is also called a uniform
distribution. Graphically, a uniform distribution has no distinct peaks.

Skewed left distribution

 In statistics, a negatively skewed (also known as left-skewed)


distribution is a type of distribution in which more values are
concentrated on the right side (tail) of the distribution graph while the
left tail of the distribution graph is longer.
 Unlike normally distributed data where all measures of central
tendency (mean, median, and mode) equal each other, in a negatively
skewed data, the measures are dispersed. The general relationship
between the central tendency measures in a negatively skewed
distribution may be expressed using the following inequality:

Mode > Median > Mean


 Measures of central tendency in negatively skewed distributions is that
the arithmetic mean is generally located on the left from the peak of
the distribution.
 The significant negative skewness of a distribution may not be suitable
for thorough statistical analysis. The high skewness of the data may
lead to misleading results from the statistical tests. Due to such a
reason, negatively skewed data goes through the transformation
process to make it close to the normal distribution. The statistical tests
are usually run only when the transformation of the data is complete.

Skewed right distribution


 In statistics, a positively skewed (or right-skewed) distribution is a type
of distribution in which most values are clustered around the left tail of
the distribution while the right tail of the distribution is longer.
 Unlike the normally distributed data where all measures of the central
tendency (mean, median, and mode) equal each other, in a positively
skewed data, the measures are dispersed. The general relationship
among the central tendency measures in the positively skewed
distribution may be expressed using the following inequality:

Mean > Median > Mode


 In contrast with a negatively skewed distribution, in which the mean is
located on the left from the peak of distribution, in a positively skewed
distribution, the mean can be found on the right from the distribution’s
peak.
 high level of the skewness can generate misleading results from
statistical tests, the extreme positive skewness is not desirable for a
distribution. In order to overcome such a problem, data transformation
tools may be employed to make the skewed data closer to the normal
distribution.
 For positively skewed distributions, the most popular transformation is
the log transformation. The log transformation implies the calculations
of the natural logarithm for each value in the dataset. The method
allows reducing the skew of a distribution. Statistical tests are usually
run only when the transformation of the data is complete.

Bimodal DISTRIBUTION
 A data set is bimodal if it has two modes. This means that there is not
a single data value that occurs with the highest frequency. Instead,
there are two data values that tie for having the highest frequency.
 The mode is one way to measure the center of a set of data.
Sometimes the average value of a variable is the one that occurs most
often. For this reason, it is important to see if a data set is bimodal.
Instead of a single mode, we would have two.
 One major implication of a bimodal data set is that it can reveal to us
that there are two different types of individuals represented in a data
set.
 A histogram of a bimodal data set will exhibit two peaks or humps.

OUTLIERS

 Outliers are data values that are very different from other
measurements in the dataset.
 In regression analysis, a data point that diverges greatly from the
overall pattern of data is called an outlier.
 an outlier is an extreme value that differs greatly from other values in
a set of values.
Cumulative Frequency
 In cumulative frequency you add up all the classes before the class you
are on
 The first class is always the same as the frequency.
 Each cumulative frequency is equal to or higher than the last
 Classes along the x-axis and cumulative frequency along the y-axis
 Because cumulative frequency goes up from class to class, the ogive
line always goes up to the top frequency

TIME SERIES GRAPHS


 Time series data are made of measurements for same variable for the
same individual taken at intervals over a period of time, with time on
the horizontal axis and the data of interest on the vertical axis.
 Stock market prices
 Yearly rates of diseases such as influenza
 Time series graphs are useful for understanding trends over time
 Graphing more than one set of time series data on one graph can help
in comparing the difference between the data sets

BAR GRAPH
 Features of a bar graph are that they can be vertical or horizontal, are
of uniform width and of uniform spacing
 Length of bars represent variable’s frequency or percentage of
occurrence
 Same measurements scale used for each bar
 Includes title, bar labels and scale labels on axis or actual values for
each bar

What is the difference between bar graph and histogram?

 The frequency histogram and relative histogram are special cases of a


bar graph
 They are bar graphs that:
 Must have classes of a quantitative variable on the x-axis
 Must have frequency or relative on the y-axis
PARETO CHART

 The height of the bar indicates the frequency of an event


 Arranged left to right according to decreasing height
 Meant to graph frequencies of problems
 Used more in engineering than in health care

PIE CHARTS
 Pie charts also circle graph are used with counts of mutually exclusive
frequencies
 Often made in graphing programs because difficult to do by hand
 Features of a pie chart
 Every individual must be put in only one category
 Can be qualitative or quantitative variable
 If quantitative, classes and then graphed
 Pie chart must be mutually exclusive categories i.e favorite color vs
check the colours you like
 More informative to put % than frequency, but it is helpful to do with.
 Always include title and legend

Choosing the right kind of Graph

Type of Graph Cases where Graph is Useful


Frequency Histogram for quantitative data, when you want to
see the distribution
Relative frequency For quantitative data, when you want to
histogram see the distribution. Also, good for
comparing to other data.
Stem-and-leaf display For quantitative data, when you want to
see the distribution. Easier to make by
hand than histogram.
Time series graph For graphing a variable that changes
over time and is measured at regular
intervals.
Bar graph For qualitative or quantitative data, and
for displaying frequency and percentage.
Pareto chart For frequencies of rare events in
descending order.
Pie graph For mutually-exclusive categories
(quantitative or qualitative)

FREQUENCY TABLES

 A frequency table displays each class along with the frequency


(number of data points) in each class.
 Selecting arbitrary class limits can make the frequency table
unbalanced.
 But not following the scientific literature can make your results non-
comparable.
 Class is an interval in the data e.g between 30 and 40 miles.
 Class limit is the lowest and highest value that can fit in a class e.g
30 would be the lowest class limit and 40 would be the upper-class
limit.
 Class width: how wide the class is. E.g upper class limit (40) minus
lower class limit (30) = 10 then add 1 = 11 that will be
30 31 32 33 34 35 36 37 38 39 40 = 11 numbers
 Frequency: how many values from the data fall in the class e.g how
many patients were transported 30 to 40 miles.
 Decide on classes
 Classes should be the same width
 Class width can be determined empirically e.g age 18-24, 25-34, 35-
44, 45-54, 55-64, 65 and older and should be based on the scientific
literature
 Can also be determined using a formula
 Class width formula
 Calculate this number: maximum-minimum e.g from the miles,
47=46
 Divide this by the number of classes desired e.g if we want 6
classes, 46/6=7.7
 Increase this to the next whole number e.g we increase this up to 8
 Make sure that all the data points are accounted for only once in one of
the classes
 Make sure the classes cover all the data
 Make sure the total of your classes adds up to the total data points
 Frequency tables are necessary for organizing quantitative data.
RELATIVE FREQUENCY TABLE
 Relative = in relationship to the rest of the data.
 Frequency = f
 Total sample size = n
 Relative frequency = f/n
 Relative frequency is the proportion of the values that are in that class.

STEAM AND LEAF


 In a stem and leaf there is always a stem
 Leaves are then added to the stem as we tally up the length of the
leaves
 Leaf is unordered if numbers out of order
 After making unordered version, order the leaves
 Then it is easier to count them up for your frequency table – no matter
what classes
 Or, make each leaf a class
 A stem and leaf is another way to organize quantitative data
 A stem and leaf is easier to make than frequency table and requires
less preparation
 Can help you put data in order to create a frequency table

Organizing Quantitative Data

Frequency data Stem and Leaf


 Need to set up classes, class  Do not need to set up classes
widths or class widths
 Need to count frequency in  No need to count. Can tally
each class the data as you go through
 Lots of pre-calculation the list.
 Quicker to do

MEASURES OF CENTRAL TENDENCY


CENTRAL TENDENCY

 How much does the data have a tendency to go to the center?


 This can only be asked about quantitative data
 Measures of central tendency are
 Mode
 Median
 With old number values
 With even number of values
 Mean
 Trimmed mean
 Weighted average

MODE

 The value that occurs most frequently in the dataset is the mode
 It is possible to have no mode
 It is more possible to have more than one mode
 It can get confusing with a lot of numbers in a short range – which is
most numerous?
 Less confusing when scale is large
 The mode tells you not much, the most popular answer, the most
common result

MEDIAN

 It is the center of the data


 Every set of quantitative data can be sorted in order of lowest to
highest
 Sometimes there are repeats in the data
 Sometimes there are outliers
 Sometimes all the data values are almost the same
 Even so, they can be arranged in order
 How to find the median
 Order the data from smallest to largest
 If there is an odd number of values, the median is the middle data
value
1 2 3 4 5

21 33 42 62 78

 If there is an even number of values, the median is the sum of the


middle two values divided by 2

1 2 3 4 5 6

1 3 7 8 9 12

Median is (7+8)/2=7.5
 After arranging the values in order, you have an ordered set
 If there are an odd number of values (n), take n+1, and divide by
2. Count up that many, and that is the median
Example 1 odd. n=21. 21+1=22, and 22/2=11. You would count
from the beginning of the ordered set, and the 11th value in it
would be the median.
 Example 2 even. n=14 14+1=15, and 15/2=7.5. You would count
from the beginning of the ordered set, and take the 7th value and
8th value, add them together, then divide by 2 for the median.
 The median tells you the 50th percentile of the data, the middle rank of
the data, the median doesn’t care much about the ends of the data,
Outliers don’t bother it and it is resistant and it is stable.

MEAN

 Whenever you see this ∑, say in your head sum of….


 ∑x is pronounced, sum of x
 x’s are the values in your data
 means add up all the x’s
 another example: ∑xy, sum of xy, means you must have a bunch of
xy’s and need to add them up.
 ∑ used a lot in statistics
 Formula for mean
 ∑x = add up all x’s
 n = number of values in your data
 after summing up all the x’s, you divide by n ∑x/n
 example: n=6
∑ (12 7 3 8 1 9) = 40

40/6=6.7 x = ∑x/n and μ =


∑x/N

Means medians
 Very resistant to outliers  Not resistant to outliers
 Very stable  Not very stable

Trimming the Mean

 A very high value or very low value (outliers) can really throw off the
mean.
 This is not a problem with the median.
 One solution to make the mean more resistant is to trim data off each
end so the outliers get cut off.
 If you trim, you have to be fair and trim the same amount off each
side.
 How to make 5% trimmed mean
 Figure out how many data points you have. Then, figure out what
5% of them would be e.g if you have 100 data points, 5% would be
5 data points
 Put the data in order
 Remove 5% from the top and 5% from the bottom e.g remove the 5
top ones and the 5 bottom ones
 Now make the mean out of the remaining data.

Weighted average

 Sometimes, certain values should count more toward the mean than
others.
 If homework is 10% of your grade, and quizzes are 20% of your grade,
the quizzes count for more than homework
 You can arrange this by doing a weighted mean.
 Example: homework worth 10%, quizzes worth 20%, and final worth
70%. You got an A (4.0) on homework, B + (3.5) on quizzes, and B
(3.0) on final. Non-weighted average: (4+3.5+3)/3=3.5
Weighted average (4.0*0.1)+(3.5*0.2)+(3.0*0.7)=3.2
 Symbol is ∑xw/∑w

Measures of central tendency on Normal Distribution and


Skewed Distribution

 On normal distribution, using these data, a person could determine


mean, median, mode they all on top of each other
 On skewed distributions, note that relative positions of mean, median
and mode are different showing mode on top and mean on bottom

MEASURES OF VARIATION
VARIATION

 It means how much does the data vary?


 Variation is a way to show how data is dispersed, or spread out.

Range

 The range is the difference between the maximum and minimum


value.
 It is easy to calculate
 Example: 42 33 21 78 62
 In the above data, 78 is the maximum
 21 is the minimum
 78-21=57
 57 is the range
 The range is not very useful for looking at variation because it just
relies on two points.
 It is not stable or resistant
 Therefore, other measures are used to look at variation typically.
 Example: we had a range of 57, we can just change these two
numbers and get a totally different range.
 Range is also the most affected by outliers as it uses only the extreme
values.

Variance and Standard Deviation

 Variance is how much data vary, think: how well does the mean
represent the spread of the data?
 Standard deviation: standard – following a standard, same and
deviation – like a deviation, septum
 They are friends because this is how you calculate them:
 First, calculate the variance.
 Then, take the square root of variance, and that is the standard
deviation.
 The formulas for sample variance and sample standard deviation are
different than those for population variance and population standard
deviation i.e we don’t use the population ones that often and we will
concentrate on how to use the sample ones
 Two different ways of doing the formula (for both sample and for
population) – the definition formula and the computational
formula
 Both get the same results
 Computational formula it confuses
 Therefore, defining formula is simple
 Sample defining formulas
Sample variance
 In simple random sample is defined by slightly different formula,
and uses a slightly different notation:

s2 = Σ ( x - x )2 / ( n - 1 )

 In a population, variance is the average squared deviation from


the population mean, as defined by the following formula:

σ2 = Σ ( X - μ )2 / N

sample standard deviation


 a simple random sample, the best estimate of the standard
deviation of a population is:

s = sqrt [ s2 ] = sqrt [ Σ ( x - x )2 / ( n - 1 ) ]

 The standard deviation is the square root of the variance. Thus,


the standard deviation of a population is:

σ = sqrt [ σ2 ] = sqrt [ Σ ( X - μ )2 / N ]

 Variance is the average squared difference of values from the mean.


 To calculate variance, we square the difference between each data
value and the mean. We divide the sum of these squares by the
number of items in the dataset.

 Using this formula, the sample variance can be considered an unbiased


estimate of the true population variance.
 Therefore, if you need to estimate an unknown population variance,
based on data from a simple random sample, this is the formula to use.
 Because variance is a squared quantity, there is no intuitive way to
compare variance directly to data values or mean.
 Standard deviation is a measure of how much data values deviate
away from the mean. Larger the standard deviation, greater the
amount of variation.
 Standard deviation is calculated as the square root of variance.
 Standard deviation uses the original units of data which makes
interpretation easier. Hence standard deviation is the most commonly
used measure of variation.

Why do we need measures of variation?

 A single statistic – the mode, the median or the mean may not be a
model that represents the entire dataset accurately. Anytime we use a
single number to represent the data, we lose the sense of variability in
the data.

When to use Range, IQR, Standard Deviation and


Variance?

 Range uses only the extreme values of a dataset and is hence very
susceptible to outliers. It is advisable to use range only for very small
distributions with no outliers.
 Interquartile range is good for skewed distributions. This is because
IQR is resistant to outliers in the data. They are generally paired with
median to describe the data.
 Standard deviation is a good measure of variability for normal
distributions or distributions that aren’t terribly skewed. Paired with
mean this is a good way to describe the data.
 Variance is not used much as it is represented in squared units and is
not an intuitive measure.

Effect of Changing Units

 Sometimes, researchers change units (minutes to hours, feet to


meters, etc.). Here is how measures of variability are affected when we
change units.
 If you add a constant to every value, the distance between
values does not change. As a result, all of the measures of
variability (range, interquartile range, standard deviation, and
variance) remain the same.
 On the other hand, suppose you multiply every value by a
constant. This has the effect of multiplying the range,
interquartile range (IQR), and standard deviation by that
constant. It has an even greater effect on the variance. It
multiplies the variance by the square of the constant.
Coefficient of Variation

 The coefficient of variation (CV) is a statistical measure of the


dispersion of data points in a data series around the mean.
 The coefficient of variation represents the ratio of the standard
deviation to the mean, and it is a useful statistic for comparing the
degree of variation from one data series to another, even if the means
are drastically different from one another.
 The coefficient of variation shows the extent of variability of data in a
sample in relation to the mean of the population.
 Always expressed in a %
 Below is the formula for how to calculate the coefficient of variation:

 Please note that if the expected return in the denominator of the


coefficient of variation formula is negative or zero, the result could be
misleading.
 The coefficient of variation can also be used to compare variability
between different measures.
 The Coefficient of Variation should only be used to compare positive
data on a ratio scale. The CV has little or no meaning for
measurements on an interval scale.
 The CV is particularly useful when you want to compare results from
two different surveys or tests that have different measures or values.

PERCENTILES

 Percentiles are used to understand and interpret data.


 The nth percentile of a set of data is the value at which n percent of
the data is below it. In everyday life, percentiles are used to
understand values such as test scores, health indicators, and other
measurements.
 They indicate the values below which a certain percentage of the data
in a data set is found.
 Percentiles for the values in a given data set can be calculated using
the formula:

n = (P/100) x N
 where N = number of values in the data set, P = percentile, and n =
ordinal rank of a given value (with the values in the data set sorted
from smallest to largest)
 Percentiles are frequently used to understand test scores and
biometric measurements.
 Example: If you test at the 77th percentile, it means you did better
than 77% of the people taking the test. If 100 people took the test, you
had have done better than 77 of them.
 Rules:
 Percentiles can be between 1 and 99 i.e you can’t have a-2nd
percentile, or a 105th percentile
 Whatever number you pick i.e that % of values fall below the
number and 100 minus that % of values fall above the number
 Example: 20 people take a test i.e let’s say there is a maximum
score of 5 on the test. The 25th percentile means 25% of the scores
fall below this score, and 75% fall above that score
 Let’s say it is an easy test, and 12 people get a 4, and the
remaining 8 get a 5. The 25th percentile, or the score the cuts off the
5 tests scores, will be 4. (even the 50th percentile will be 4)
 This would come out very different if it were a hard test, and most
people got below a score of 3

QUARTILES AND INTERQUARTILE RANGE

Quartiles

 A quartile is a statistical term describing a division of observations


into four defined intervals based upon the values of the data and how
they compare to the entire set of observations.
 It is a specific set of percentiles
 1st quartile: 25th percentile
 2nd quartile: 50th percentile (also median)
 3rd quartile: 75th percentile
 Each quartile contains 25% of the total observations. Generally, the
data is arranged from smallest to largest:
 First quartile: the lowest 25% of numbers
 Second quartile: between 25.1% and 50% (up to the median)
 Third quartile: 51% to 75% (above the median)
 Fourth quartile: the highest 25% of numbers
 These can be calculated by hand.
 The upper quartile formula is:
Q3 = ¾(n + 1)th Term.
 Steps to calculate the quartile
 Order the data from the smallest to the largest
 Find the median i.e 2nd quartile or 50th percentile
 Find the median of the lower half of the data i.e 1st quartile or 25th
percentile
 Find the median of the upper half of the data i.e 3rd quartile or 75th
percentile
 Example: Suppose, the distribution of math scores in a class of 19
students in ascending order is:

59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90,
95, 98

 First, mark down the median, Q2, which in this case is the tenth
value: 75.
 Q1 is the central point between the smallest score and the median.
In this case, Q1 falls between the first and fifth score: 68. [Note
that the median can also be included when calculating Q1 or Q3
for an odd set of values. If we were to include the median on either
side of the middle point, then Q1 will be the middle value between
the first and tenth score, which is the average of the fifth and sixth
score – (fifth + sixth)/2 = (68 + 69)/2 = 68.5].
 Q3 is the middle value between Q2 and the highest score: 84. [Or
if you include the median, Q3 = (82 + 84)/2 = 83].
 Now that we have our quartiles, let’s interpret their numbers. A
score of 68 (Q1) represents the first quartile and is the 25th
percentile. 68 is the median of the lower half of the score set in the
available data i.e. the median of the scores from 59 to 75.
 Q1 tells us that 25% of the scores are less than 68 and 75% of the
class scores are greater. Q2 (the median) is the 50th percentile
and shows that 50% of the scores are less than 75, and 50% of the
scores are above 75. Finally, Q3, the 75th percentile, reveals that
25% of the scores are greater and 75% are less than 84.

Interquartile range

 The Interquartile Range or IQR describes the middle 50% of the values
when ordered from lowest to highest value.
 IQR is considered a good measure of variation in skewed datasets as it
is resistant to outliers.
 To calculate the IQR, we find the median of the lower and upper half of
the data. These are Quartile 1 and Quartile 3. The IQR is the difference
between Quartile 3 and Quartile 1.

IQR = Q3 – Q1

 Imagine: with 8 values


 The median would be between the 4th and 5th positions added
together and divided by 2
 Q1 would consider the 4 values below median, and Q3 would
consider the 4 values above the median. In this case, for Q1,
because there are 4 values in the lower half, the values in the 2nd
and 3rd positions must be added together and divided by 2.
Similarly, for Q3, this must be done for the values in the 6th and 7th
positions.

41 74 90 97 121 126 142 155

Median
Q1 Q3

Position 1 2 3 4 5 6 7 8

Box-and-Whisker Plot

 This is a graph that summarizes numerical data based on quartiles,


which divide a data set into fourths.
 It is useful for revealing the central tendency and variability of a data
set, the distribution (particularly symmetry or skewness) of the data,
and the presence of outliers.
 It is also a powerful graphical technique for comparing samples from
two or more different treatments or populations.
 A box-and-whisker plot typically consists of a line (vertical or
horizontal) extending from the minimum value to the maximum value
and a box, the end lines of which depict the first quartile (Q1) and the
third quartile (Q3) and a central line within which depicts the second
quartile (Q2; also called the median). (The first quartile represents the
25th percentile, the second quartile represents the 50th percentile,
and the third quartile represents the 75th percentile.) Outliers are
plotted as individual data points.
 A boxplot is a graph that gives you a good indication of how the values
in the data are spread out.
STATISTICAL NOTATION

Summary measure Population Sample statistic


parameter
Population size N n
Standard deviation σ s
Mean μ (it is called x (it is called x-
mu) bar)

μ = ∑x/N x = ∑x/n

variance σ2 s2
Proportion of P p
population elements
that have a particular
attribute
proportion of Q=1–P q=1-p
population elements
that do not have a
particular attribute,
population correlation p r
coefficient
Population set X x

QUESTIONS AND ANSWERS

Вам также может понравиться