Вы находитесь на странице: 1из 116

Terminology

1
Introduction
2
Statistics (as opposed to statistic) is the science
of gathering, organizing, analyzing, and
interpreting numerical and categorical
information.

Statistics is the art of decision making in the
presence of uncertainty
CPT
3
Descriptive Statistics involve methods
of organizing, picturing, and
summarizing information from samples
or populations (Chapters 1-3)
Inferential Statistics involve methods of
using information form a sample to draw
conclusions regarding the population
(Chapters 7-11)
4
5
Descriptive
Statistics
Probability
Inferential
Statistics
1. Choose a topic and identify the problem to be
addressed.
2. Background research past research including
historical descriptive statistics and references.
3. Develop a conjecture or hypothesis.
4. Design an experiment
5. Gather additional information through
experimentation and observation.
6. Analyze the results and interpret the results if
hypothesis is rejected, then repeat Step 3 thru 6.
7. Formulate conclusions and draw inferences

6
A population, N, is a group of individual
persons, objects, or items that one wishes to
better understand certain characteristics about
and from which samples are taken for statistical
measurement.

The population data is the complete collection
of information from all of the individuals or
subjects of interest in a given study.
7
A census is a survey of every individual in the
population and the information gathered is called
the population data.

Recall: the population data is the complete
collection of information from all of the
individuals or subjects of interest in a given
study.
8
A sample, n, is a partial collection of
information for only some of the individuals or
subjects of interest in a given study

The sample size, n, of a sample is the number of
observations that constitute the sample. Whereas
the size of a population is denoted by the capital
letter N, the size of the sample is denoted by the
lowercase letter n.
9
Descriptive Statistics involve methods of
organizing and summarizing information (data)
and presenting it numerically or visually
(graphically).
Stem-and-leaf, Frequency Tables, and
Contingency Tables
Bar Charts, Pie Charts, and Histograms
Scatter Plots
Among others
10
The distribution of the data is a list of all the
values recorded in a sample; that is, the observed
outcomes and their frequency.

Distributions can be given in tabular form as a
frequency or contingency table or illustrated
graphically in the form of a bar chart or
histogram.

11
Characteristics of
Distribution
The mean, median, and mode
are central tendencies

Uniform distribution is
symmetric, not skewed

The range of information
between minimum and
maximum
Location

Shape

Spread
12
Location within a distribution is a specific value
within the domain of the data such as the
extremes: minimum and maximum, central
tendencies, etc.

13
Outliers are data within a distribution of the
data, but outside the overall pattern (cluster) of
the graph; that is, extreme values that can distort
the interpretation of the data by creating
misleading statistics.

14
Common
Locations
Minimum and Maximum

Mean, Median, and Mode
(among other weighted or
trimmed means)

A value, x
p%
, such that p% of
the observed values are to the
left
p%=0%, 25%, 50%, 75%, 100%
Extremes

Central
Tendencies

Percentiles

Quartile
15
Shape of a distribution describes the symmetry
or lack thereof (skewness).

Data that is symmetric exhibits balance and self-
similarity whereas skewness is a measure of the
asymmetry.

16
Common Shapes
Equal frequencies no mode

The mean equals the median
equals the mode - looks like a


The above are symmetric

Left: mean < median
Right: mean > median


Uniform

Bell
Shaped

Symmetric

Skewed
17
Spread of a distribution is a measure which
indicates how the data values are distributed, a
measure of the dispersion or variability within a
group of values. Some appropriate measures
include:
Range (from minimum to maximum)
Variance (mean square error)
Deviation (square root of variance)
Where error=observed value expected value
18
Common Measures
Minimum and Maximum

Mean, Median, and Mode
(among other weighted or
trimmed means)

Range, Variance, and Standard
Deviation

Count, Relative Frequency, Rate
Extremes

Central
Tendencies

Deviations

Frequency &
Proportions
19
An extreme is a characteristic farthest removed
from the ordinary; common extremes are the
minimum, which is the least observed value in a
sample, and the maximum, the greatest observed
value in a sample.
20
Measure of a characteristic which clusters around
a central value, using the data to estimate the
central tendency, called averages; there are
three common measures: the mode, median, and
mean.

21
Common Central
Tendencies
Most frequently observed value
A value such that 50% of the
observed values fall to the left
Sum of all data divided by
number of data points (equal
weights)
Average such that the weights
are not equal
Average such that some of the
weights are zero
Mode


Median


Mean


Weighted Mean


Trimmed Mean
22
Common
Deviations
Maximum minus minimum

Q
3
Q
1

Expected value of the
differences (error)-squared

Square root of the variance
Range

IQR
(Interquartile Range)


Variance

Standard
Deviation
23
Common
Frequency &
Proportions
1,2,3,

The number of times a given value
occurs (a count)

The ratio of the frequency of one
value to the sample size

Part in ratio to the whole the
relative frequency
Count

Frequency

Relative
Frequency

Rate
24
A statistical measure is sensitive if the
computed value changes readily even if a
single observed data value is different. Also, a
statistical method is sensitive if the decision
changes radically based on the assumptions
we made to develop it.
A statistical method is robust

if the decision
is not strongly dependent on the assumptions.

Not formally defined in the text

25
The degree of confidence represents the
proportion of times the statistical methodology
used captures the true state of nature.

This value will be denoted (1-)%, where is
the level of significance.


26
An event is considered statistically significant if
its occurrence is unlikely to happen by chance.

This value will be denoted by %.


27
Inferential Statistics involves methods of
analyzing and interpreting descriptive statistics
to draw conclusions regarding a particular
characteristic in the population with a certain
degree of assurance based on a preset level of
significance and specified assumptions.
28
Hypothesis Testing is the use of a statistical
method in arguing for or against a hypothesized
value based on observed information and using
this information to make a decision regarding an
initial hypothesis and an alternative hypothesis.






Not formally defined in chapter 1

29
An experiment is the method (procedure) that
we follow to obtain data or information.
An experiment design is the art of planning
and executing experiments designed to gather
information (data) from the population, N, in
such a way as to ensure the sample, n, is
representative of the population, N.
30
Individuals are the people, places, or things
included in a study and for which information is
gathered. In medical research studies,
individuals are referred as the subjects in the
study.
31
A variable is a distinct characteristic of an
individual to be observed and measured. These
observed data can be qualitative or quantitative.

A qualitative (categorical) variable is a variable
that describes the individual by placing the
individual into a category or group
A quantitative (numerical) variable is a variable
that takes on a real value or numerical
measurement for which sums, differences, and
ratios have meaning.

32
Regression is a statistical procedure used to
estimate the relationship among variables,
specifically between the primary (response)
variable of interest and all other variables.

33
In a study, the response variable is the primary
variable of interest; that is, the objective in the
given study.

This variable is also referred to as the dependent
variable (although the relationship of
dependence or cause-and-effect is yet to be
determined).
34
The explanatory variables are the extraneous
variables that have been measured but are not the
primary variable of interest; they are used to
understand the behavior of the response
(primary) variable.

This variable is also referred to as the
independent variable (although the relationship
of dependent/independent has not yet been
established).


35
Lurking variable(s) are the unknown variables
that have not been measured; however, they do
contribute to the response (primary) variable and
are not included as an explanatory variable.

36
Correlation is a measure of association between
a response variable and an explanatory variable.

Correlation measures the strength and direction
of a simple linear relationship, that is, a straight
line.

37
Causation is more than a measure of association
between a response variable and an explanatory
variable. It also implies direct cause or
dependence.

Correlation between two events may be a
common response to a lurking variable.

38
Confounding variables are the variables that
have been measured and are significantly
contributing variables; however, their
independent contributions to the subject response
are indistinguishable and are not deemed
significantly contributing in the larger model.

39
Discrete/Continuous
40
An instrument is any means by which
information is gathered or measured such as an
exam, survey, or other rulers such as a barometer,
thermometer, etc.

41
A parameter is a numerical measure that
describes the outlined characteristic of the
population such as central tendencies (mean,
median, mode, and proportion), spread (range,
variance, and standard deviation), and shape
(symmetric and skewed).

In general, when a specific parameter is not
specified, the lowercase Greek letter Theta () is
used to denote a population parameter.

42
Common
Parameters
Mean

Variance

Standard Deviation

Proportions

Correlation
43

p
2
A sample survey is a survey of only some of the
individuals in the population and the information
gathered is called the sample data.

The number of individuals included in the
sample survey is called the sample size, n.

The sample data is a subset of the population
data often denoted: x
1
, x
2
,. x
n
.
44
A statistic is a numerical measure that yields an
estimate of a population parameter.

That is, a numerical measure that uses the data
from the sample to estimate the outlined
characteristic of the population.

As opposed to Statistics - the study of how to
gather, organize, analyze and interpret
information.

45
POPULATION IS TO SAMPLE AS
CENSUS IS TO SAMPLE SURVEY
POPULATION IS TO SAMPLE AS
PARAMETER IS TO STATISTIC
46
Measure involves any standard of comparison,
estimation, or judgment; property of an
individual given a numerical value; a quantity,
a count, a degree, a rate, or a proportion. In
terms of data collections, the measured values
are referred to as the outcomes or observed
values.

Two types of measure are discrete and
continuous.
47
A discrete measure is such that the set of
possible observed outcomes are separate,
distinct, and finite such as a count.

Discrete measures are such that the outcomes can
be enumerated: one, two, three, etc.


48
Examples of
Discrete Measures
Number of children in a family tree
depending on the number of generations
included in the tree, there can be either 1,
2, 3, , but not 1.5 nothing between 1
and 2, or 2 and 3, etc.
Count of whole beans depending on the
number of pods included, there can be
either 1, 2, 3, , but not 1.2 since the
count is restricted to the whole number
Frequency of blue-eyed men and green-
eyed women.
Number


Count


Frequency
49
A continuous measure is such that the set of
possible observed outcomes are infinite and
uncountable.

Continuous measures are dense; that is, between
any two values (outcomes) there exist another
value (outcome) such as a mean or rate.

50
Examples of
Continuous
Measures
Length of a road it can measure 1 mile or 2
miles, and between these possible measures
exists 1.5 miles, 1.24 miles; in fact, between
any two values there exist other possible values
Height of a man a man can be 5 feet tall or 6
feet tall, and between these potential values
exist 5.5 feet, 5.14 feet, etc. While we might
not have an instrument precise enough to
measure the 1/100
th
of a foot, this measure
exists
Age of a woman between this moment and
the next, there is a continuous existence.
Between 1 yrs and 2 yrs, 1.8 yrs exist, etc.

Length


Height


Age
51
Samplings
52
53
Validity refers to the degree of accuracy to which a
study reflects the specific concept or characteristic
that the analyst or researcher is attempting to
measure.

Internal validity is the degree to which one can
draw valid conclusions about the causal effect
between variables. External validity is the degree
to which one can extend the findings that are
relevant to subjects and settings outside those
included in the experimental design.

54
For example, when evaluating a class of 180
students from a single mass lecture of STA
2023, can this information be used to evaluate
all students taking STA 2023 given there is more
than one section taught by different instructors?

Internal Validity drawing conclusions about
this specific subjects inside the study.
External Validity the ability to extend
conclusions to subjects outside of the study.
55
56
57
Bias is a consistent deviation of the statistics to
one side of the parameter.





LOW BIAS HIGH BIAS

58
For example, when weighing out coffee to be
ground and brewed at a coffee shop, the
employee forgets to zero-out the scale with the
cup used to measure the coffee. This leads to the
coffee measured in the cup to be off by the
weight of the cup.

Solution: add the weight of the cup in coffee to
each cup.
59
60
61
Variability measures the degree of dispersion
within a given data set.

Some common measures of dispersion include
range, mean (average) deviation, standard
deviation, variance, inter-quartile range, and
mean difference.

Variability can appear as gaps in the data when
illustrated graphically.

62
Reliable refers to the accuracy and precision of
the actual measuring instrument or procedure.

A reliable measure is a (precise) measurement
such that the random error is small.

63
Valid (Accurate)
Reliable (Precise)
We like samples to
represent the population
and the measures taken to
represent the parameters
estimated. These statistics
need to be a valid measure,
accurately estimating the
parameter with low bias as
well as be reliable,
measured with such
precision as to have low
variability when estimating
the parameter using
statistics.
64
ACCURACY (VALID MEASURE)
HITS THE TARGETS BULLS EYE
PRECISION (RELIABLE MEASURE)
HITS THE SAME LOCATION REPEATEDLY
65
ACCURATE INACCURATE
66
PRECISE IMPRECISE
67
PRECISE IMPRECISE
68
Nominal, Ordinal, Interval, & Ratio
69
Common
Levels of Measure
Data that consist of names, labels, or
categories
Data that can be arranged in order;
however, differences between data
values cannot be determined or are
meaningless
Data that can be ordered and
differences have meaning, but ratios
do not (equal distances, but no fixed
zero)
Ordinal and interval, but ratios have
meaning (equal distances and fixed
zero)
Nominal

Ordinal

Interval

Ratio
70
A nominal measure is one that measures a
characteristic of an individual by name only;
information in the form of categorical data where
the order of the categories is not relevant.

Names only no calculations can be preformed.
71
Examples of
Nominal Measures
Can be made ordinal if considered alphabetically, but
otherwise, this is a name only

There are relations among the digits that make up such
numbers, but there is not a true ordering, difference, or
ratio

While these codes can be ordered numerically, the
order is arbitrary and therefore not meaningful the zip
code 33617 is not less than the zip code 33620 the
only difference is geographical

Male/Female: these are clearly labels for which there is
no order other than alphabetically; however, it is
meaningless to argue less than or greater than in
general
Surnames

SSN

Zip Code

Gender
72
An ordinal measure is one that measures a
characteristic of an individual by the rank order
(1st, 2nd, 3
rd
, etc.) of the entities measured or by
implied ordering such as worst, bad, good, great.

Ordering the measured outcomes.

73
A simple ranking imposes an order on the
measured characteristic of an individual and the
set of natural numbers by defining a relationship
that establishes the position within a sequence of
outcomes "ranked higher than," "ranked lower
than," or "ranked equal to.

Imposing an ordinal scale.

74
A Likert scale establishes the hierarchy within a
sequence of outcomes.

For example, how attractive is a person on a
scale from 1 to 10, 1 meaning not very
attractive to a 10 which represents perfect
attraction.

75
Examples of
Ordinal Measures
What is the best-selling flavor of ice cream?

A five-point scale by which to evaluate an
instructor: poor, unsatisfactory, satisfactory,
good, great

Due to inconsistencies found in sizes
between designers a size 0 is smaller than a
size 2, which is smaller than a 4, but this does
not mean the difference between a 2 and a 4 is
the same as the difference between a 0 and a 2.
Furthermore, a 4 is not twice as large as a 2;
this ratio has no meaning.


Ranking

Likert
Scale

Dress Size
&
Shoe Size


76
An interval measure is one that measures a
characteristic of an individual where differences
between measures have meaning; that is, the
distance between two adjacent units is the same
but there is not a meaning zero point. An interval
measure is such that sums and averages have
meaning; however, ratios do not have meaning.

Sums (differences) but not ratios.
77
Examples of
Interval Measures
If your watch reads 12:05 and mine reads 12:07,
then my watch reads a later time than yours; hence
the measure is at least ordinal. However there is a 2-
minute difference, therefore this measure is interval.
It is not ratio since 12:07 in ratio to 12:05 has no
meaning.

If the daytime temperature is 50F in New York and
100F in Miami, then it is 50F hotter in Miami than
it is in New York. While the ratio of 100F to 50F
is 2, this measure has no meaning and is therefore
an invalid measure. You can not say 100F is twice
as hot as 50F.

Some may argue that degrees Kelvin, which has an absolute
zero, is ratio; however, in general, temperature is interval.



Time of Day



Temperature
78
A ratio measure is one that measures a
characteristic of an individual where not only do
differences between measures have meaning, but
ratios also have meaning. That is, a measure in
which any two adjoining values are the same
distance apart and there is a true zero point. Ratio
measures have fixed zeros; that is, an interval
measure with a true zero.

79
Examples of Ratio
Measures
At 2:00, the measure is 2 hours past noon and at 4:00,
the measure is 4 hours past noon, 4 hours is greater
than 2 hours; hence at least ordinal. The difference
between 4 hours and 2 hours is 2 hours, which has
meaning; hence at least interval. Moreover, the ratio
of 4 hours to 2 hours is 2, that is 4 hours is twice as
much time as 2 hours; thus this measure is ratio.

If you are 6 feet tall and your child is 3 feet tall, then
you are taller than your child (ordinal), you are 3 feet
taller than your child (interval), and you are twice as
tall as your child (ratio). Therefore, this measure is
Ratio.

If you are 36 years old and your child is 12 years old, then
you are older than your child (ordinal), you are 24 years
older than your child (interval), and you are three times as
old as your child (ratio). Therefore, this measure is Ratio.


Time Past
Noon


Height

Age
80
Changing Level of
Measure
What is your yearly salary? (a continuous scale)


Intervalwhat is your income bracket? (a discrete scale)
0-9,999,
10,000-19,999,
20,000-29,999,
30,000-39,999,
40,000-49,999,
50,000-59,999, etc.?

Where the difference between intervals is 10,000

Ordinalwhat is your tax bracket? (a discrete scale)
0-9,999,
10,000-39,999,
40,000-59,999,
60,000or more?

Where difference are not well-defined

Nominalin what currency are you paid?

Dollar, Yen, Euro, etc. (ordinal if you consider exchange rates)
Ratio

Interval


Ordinal

Nominal
81
SRS, Systematic, Cluster, Stratified, etc.
82
Samples
Simple Random
Samples

Systematic

Cluster Samples

Stratified
Samples

Convenience
Samples
83
A random sample is a sample of size n taken
from a population of size N in such a way that
each individual observed has an equally likely
chance of being selected.

84
A simple random sample (SRS) is such that

(1) each individual has an equally likely chance
of being selected as well as

(2) all groups of size n have an equally likely
chance of being selected.

85
Common Sampling
Schemes
Using a system to select
Using clusters of individuals
that are pre-existing
Using clusters of individuals
selected by a specified strata
Using individuals who are
conveniently surveyed
More than one stage of
sampling done in succession
Systematic

Cluster

Stratified

Convenience
(Volunteer
Response)

Multi-stage
86
Systematic sampling is a sample such that every
k
th
individual or item is measured.

Every 3
rd
: 1, not 2, not 3, 4, not 5, not 6, 7
Every 5
th
: 1, 6, 11,16, or 2, 7, 12, 17,
or 5,10,15,20. Etc.

87
Cluster sampling is such that groups are
selected based on pre-existing groups that is
arbitrary to the individual and not based on any
characteristic of the individual.
In the country, by region
In the state, by zip code
In the state or nation, by area code
For example, in a state, randomly selecting five
counties and surveying 100 individual from each
88
Stratified sampling is such that individuals are
first grouped by specific characteristics such as
gender and then samples are taken from each
group or strata.
Individuals grouped by gender
Individuals grouped by age
Individuals grouped by race
For example, grouping individuals by gender,
male/female, then selecting 100 individuals
from each group
89
Convenience sampling is such that individuals
are selected based upon ease of access. Such
sampling techniques are prone to bias. An
example of a convenience sampling is a
volunteer response.
Individuals as they passed by
Individuals willing to call in on a talk show
Individuals who agree to take online surveys
90
Multistage sampling is such that more than one
sampling technique is employed in the gathering
of information.

First stratify by gender, then systematically
take every other individual in each group.
First cluster individuals by state, then poll
these regions using mailers which individuals
have the option to fill out at their convenience
91
Too Regular

Implausible Numbers

Inconsistencies

Missing Information

Non-Adherers

Non-sampling Error

Hidden Agenda

Hidden Bias

Survey Error

Under-coverage

Incorrect Arithmetic
92
Control, Randomization, Replication,
& Enough Information
93
An observational study is an experiment
designed to observe without interference from
the observer in that every effort is made not to
sway the subject response or lead a subject in
their response.

Do not sway individuals!

94
Common
Observational
Studies
Historical data (past)


Single point in time (present)


Data gathered over an extended
period of time (future)
Retrospective
studies

Cross
Sectional

Prospective
studies
(Longitudinal)
95
An experimental study is an experiment designed
to be observed with interference from the observer
in that specific treatments are applied to the
individuals, in an effort to measure differences in the
subject response.

Note: the treatments used in an experiment are
intended to sway the outcome of the subject
response.

Subject Treatment Response (Outcome)

96
A treatment is any condition set forth that is
applied to the individual or subject in an effort to
determine differences among a variety of
treatments as compared to each other or a control
group.

97
A control group is a group created for sake of
comparison. This group can be one of the
treatment groups or a group that receives a false
treatment called a placebo.

Experimental Group:
Subject Treatment Response (Outcome)

Control Group:
Subject No Treatment (placebo) or Secondary Treatment Response (Outcome)


98
The placebo effect occurs when a subject
receives a false treatment (such as a sugar pill) or
no treatment, but (incorrectly) believes he or she
is in fact receiving treatment and responds
favorably.

99
In an experimental design, a block is a group of
individuals stratified based on a similar
characteristic and given treatments.

A block design is an experimental design in
which individuals or subjects are grouped into
categories or blocks and then test blocks are
treated as experimental units given different
treatments.


100
A randomized-block design is an experimental
design in which individual subjects are matched
based on a specific variable. The subjects are
then put into blocks of the same size as the
number of treatments and then each block is
assigned to different treatment groups randomly.

101
A (single) blind experiment is an experiment in
which individual subjects do not know the
treatment they receive; however, the researcher is
aware.

A double blind experiment is an experiment in
which neither the individual subjects nor the
researcher are aware of who received what
treatment.

102
Principles of
Experimental
Design
A comparative or control group

Selected at random

To verify validity and reliability

More important in inferential
statistics and not so much in
descriptive statistics
Control


Randomization


Replication


Enough
Information
103
Stages of Sampling
Define population of concern

The set of variables to be
measured

Systematic, Cluster, Stratified, etc.

Large Enough n (compared to N)

Implement sampling plan (ED)

Action of data collection
Population

Sampling Frame

Sampling
Method

Sampling Size (n)

Experimental
Design (ED)

Sampling
104
Medical Trials and Simulations
105
Medical Trials
Internal Review Board

Independent Ethics Committee

Ethical Review board

Requires that the individual (1) be
informed and (2) give consent
IRB

IEC

ERB

Informed
Consent
106
Anonymity is when no personal information is
taken, a coding system is in place to allow the
subject to get the information regarding a survey
without giving out any personal information; that
is, the information is not personally identifiable.

Confidentiality is when personal information is
given, but not shared. Only the statistical
summaries are made available to other
organizations or persons involved in the study.


107
Informed consent is when the individual
person is both informed of the ramifications
involved in the study and gives consent to
participate in the knowledge of such things as
side effects.

108
Simulation is the imitation of a natural
process using general characteristics or
behaviors in an effort to mimic or model the
natural system.

A simulation is only as good as the underlying
analytical model"
CPT
Can be used to verify statistical methods.


109
Examples of
Simulation
ONE POSSIBILITY:

Let evens represent a head and
odds represent a tail.

Hence the sequence
1,5,4,6,5
would represent
T,T,H,H,T
Use a fair dice to simulate
the tossing of a fair coin.
110
Random digit chart is the table of digits
selected at random and placed in a table in
Appendix B which can be used to simulate or
sample data.

07892632401926795457
111
Examples of
Simulation using
Random Digits
Let 0-5 represent a boy and
6-9 represent a girl; hence,
the sequence of random numbers

078

would simulate the sequence of
children: boy, girl, girl.
A man has a 60%
chance of having a
boy and a 40%
change of having a
girl, use the random
digit chart to
simulate the birth
order of three
children

Random digits:
07892632401926795457
112
Randomization or random charts can
be used to sample or re-sample the
data.

For example, if there are 100 data points available and we only
need 30, then we can randomly select this sample by enumerating
the data and using the random chart to select the required number
with or without replacement. With replacement, we can resample
200 times even though there are only half this many data points to
start this technique is called bootstrapping.
113
Examples of
Sampling using
Random Digits
Let: 0 represent A,
1 -B, 2 -C, 3 -D, 4 -E, 5 F,
6 -G, 7 -H, 8 -I and 9 -J.
Using the random set of digits
9263 generate a random
committee as follows:
9 J
2 C
6 G
3 D
A committee of four
is to be selected from
a group of ten
individuals: A, B, C,
D, E, F, G, H, I, and
J. Using the random
set of digits

07892632401926795457

generate a random
committee. Explain.
114
115
Descriptive Statistics
vs.
Inferential Statistics

Population vs.
Sample

N vs. n

Census vs. Sample
Survey

Representative
Samples

Sampling Techniques

Simulations

Re-sampling

Statistical
Perspective
Biologist have
microscopes

Physicist have
telescopes

Statisticians have
kaleidoscopes

116

Вам также может понравиться