Вы находитесь на странице: 1из 45

1.

Principle of Biostatistic
Marcello Pagano
2. Principle & Method
Richard A Jhonson & Gouri
K. Bhattacharyya
What is Statistics?
Statistics as a subject provides a
body of principles and methodology
for designing the process of data
collection, summarizing and
interpreting for data and drawing
conclusions or generalities
Statistical Methods
Descriptive statistics
Collecting and describing data
Inferential statistics
Drawing conclusions and/or making
decisions concerning a population
based only on sample data
Population and Sample
Population Sample
Use parameters to
summarize features
Use statistics to
summarize features
Inference on the population from the sample
Population
A Statistical population is the set of
measurement (or Record of some
qualitative trait) corresponding to
the entire collection of units about
which information is sought
Sample
A sample from a statistical
population is the set of
measurement there are actually
collected in the course of an
investigation
Descriptive Statistics
Collect data
e.g. Survey
Present data
e.g. Tables and graphs
Characterize data
e.g. Sample mean =
i
X
n

Inferential Statistics
Estimation
e.g.: Estimate the
population mean weight
using the sample mean
weight
Hypothesis testing
e.g.: Test the claim that
the population mean
weight is 120 pounds
Drawing conclusions and/or making decisions
concerning a population based on sample results.
Why We Need Data
To provide input to survey
To provide input to study
To measure performance of service or
production process
To evaluate conformance to standards
To assist in formulating alternative
courses of action
To satisfy curiosity
Types of Data
Categorical
(Qualitative)
Discrete Continuous
Numerical
(Quantitative)
Data
Reasons for Drawing a Sample
Less time consuming than a census
Less costly to administer than a
census
Less cumbersome and more
practical to administer than a
census of the targeted population
Chapter Topics
Tabulating and graphing Univariate
categorical data
The summary table
Bar and pie charts, the Pareto diagram
Tabulating and graphing Bivariate
categorical data
Contingency tables
Side by side bar charts
Graphical excellence and common errors in
presenting data
(continued)
Chapter Topics
Organizing numerical data
The ordered array and stem-leaf display
Tabulating and graphing Univariate
numerical data
Frequency distributions: tables,
histograms, polygons
Cumulative distributions: tables, the Ogive
Graphing Bivariate numerical data
Organizing Numerical Data
Numerical Data
Ordered Array
Stem and Leaf
Display
Frequency Distributions
Cumulative Distributions
Histograms
Polygons
Ogive
Tables
2 144677
3 028
4 1
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Data in raw form (as collected):
24, 26, 24, 21, 27, 27, 30, 41, 32, 38
Data in ordered array from smallest to
largest:
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Stem-and-leaf display:

Organizing Numerical Data
(continued)
2 144677
3 028
4 1
Tabulating and Graphing
Numerical Data
0
1
2
3
4
5
6
7
1 0 2 0 3 0 4 0 5 0 6 0
Numerical Data
Ordered Array
Stem and Leaf
Display
Histograms
Ogive
Tables
2 144677
3 028
4 1
41, 24, 32, 26, 27, 27, 30, 24, 38, 21
21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Frequency Distributions
Cumulative Distributions
Polygons
Ogive
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
1 0 2 0 3 0 4 0 5 0 6 0
Tabulating Numerical Data:
Frequency Distributions
Sort raw data in ascending order:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Find range: 58 - 12 = 46
Select number of classes: 5 (usually between 5 and 15)
Compute class interval (width): 10 (46/5 then round up)
Determine class boundaries (limits): 10, 20, 30, 40, 50,
60
Compute class midpoints: 15, 25, 35, 45, 55
Count observations & assign to classes
Frequency Distributions, Relative
Frequency Distributions and Percentage
Distributions




Class Frequency
10 but under 20 3 .15 15
20 but under 30 6 .30 30
30 but under 40 5 .25 25
40 but under 50 4 .20 20
50 but under 60 2 .10 10
Total 20 1 100
Relative
Frequency
Percentage
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Numerical Data:
The Histogram
Histogram
0
3
6
5
4
2
0
0
1
2
3
4
5
6
7
5 15 25 36 45 55 More
F
r
e
q
u
e
n
c
y
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
No Gaps
Between
Bars
Class Midpoints
Class Boundaries
Graphing Numerical Data:
The Frequency Polygon
F r e q u e n c y
0
1
2
3
4
5
6
7
5 1 5 2 5 3 6 4 5 5 5 M o r e
Class Midpoints
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Tabulating Numerical Data:
Cumulative Frequency
Cumulative Cumulative
Class Frequency % Frequency
10 but under 20 3 15
20 but under 30 9 45
30 but under 40 14 70
40 but under 50 18 90
50 but under 60 20 100
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Numerical Data:
The Ogive (Cumulative % Polygon)
Ogive
0
20
40
60
80
100
10 20 30 40 50 60
Class Boundaries (Not Midpoints)
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58
Graphing Bivariate Numerical
Data (Scatter Plot)
Mutual Funds Scatter Plot
0
10
20
30
40
0 10 20 30 40
Net Asset Values
T
o
t
a
l

Y
e
a
r

t
o

D
a
t
e

R
e
t
u
r
n

(
%
)
Tabulating and Graphing
Categorical Data:Univariate Data
Categorical Data
Tabulating Data
The Summary Table
Graphing Data
Pie Charts
Pareto Diagram
Bar Charts
Summary Table
(for an Investors Portfolio)
Investment Category Amount Percentage
(in thousands $)

Stocks 46.5 42.27

Bonds 32 29.09

CD 15.5 14.09

Savings 16 14.55

Total 110 100

Variables are Categorical
Graphing Categorical Data:
Univariate Data
Categorical Data
Tabulating Data
The Summary Table
0 10 20 30 40 50
S t oc k s
B onds
S avi ngs
CD
Graphing Data
Pie Charts
Pareto Diagram
Bar Charts
0
5
10
15
20
25
30
35
40
45
S t oc k s B onds S avi ngs CD
0
20
40
60
80
100
120
Bar Chart
(for an Investors Portfolio)
Investor's Portfolio
0 10 20 30 40 50
Stocks
Bonds
CD
Savings
Amount in K$
Pie Chart
(for an Investors Portfolio)
Percentages are
rounded to the
nearest percent.
Amount Invested in K$
Savings
15%
CD
14%
Bonds
29%
Stocks
42%
Summary Measures
Measures of central tendency
Mean, median, mode, geometric mean, midrange
Quartile
Measure of variation
Range, interquartile range, variance and standard
deviation, coefficient of variation
Shape
Symmetric, skewed, using box-and-whisker plots
Chapter Topics
Coefficient of correlation
Pitfalls in numerical descriptive
measures and ethical considerations
(continued)
Summary Measures
Central Tendency
Mean
Median
Mode
Quartile
Geometric Mean
Summary Measures
Variation
Variance
Standard Deviation
Coefficient of
Variation
Range
Measures of Central Tendency
Central Tendency
Average Median Mode
Geometric Mean
1
1
n
i
i
N
i
i
X
X
n
X
N

=
=
=
=

( )
1/
1 2
n
G n
X X X X =
Mean (Arithmetic Mean)
Mean (arithmetic mean) of data values
Sample mean



Population mean
1 1 2
n
i
i n
X
X X X
X
n n
=
+ + +
= =

1 1 2
N
i
i N
X
X X X
N N

=
+ + +
= =

Sample Size
Population Size
Mean (Arithmetic Mean)
The most common measure of central
tendency
Affected by extreme values (outliers)
(continued)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Mean = 5 Mean = 6
Median
Robust measure of central tendency
Not affected by extreme values



In an ordered array, the median is the middle number
If n or N is odd, the median is the middle number
If n or N is even, the median is the average of the
two middle numbers
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 12 14
Median = 5 Median = 5
Mode
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may may be no mode
There may be several modes

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Geometric Mean
Useful in the measure of rate of change
of a variable over time


Geometric mean rate of return
Measures the status of an investment over
time
( )
1/
1 2
n
G n
X X X X =
( ) ( ) ( )
1/
1 2
1 1 1 1
n
G n
R R R R = + + + (

Quartiles
Split Ordered Data into 4 Quarters


Position of i-th Quartile


and Are Measures of Noncentral Location
= Median, A Measure of Central Tendency
25% 25% 25% 25%
( )
1
Q
( )
2
Q ( )
3
Q
Data in Ordered Array: 11 12 13 16 16 17 18 21 22
( ) ( )
1 1
1 9 1 12 13
Position of 2.5 12.5
4 2
Q Q
+ +
= = = =
1
Q
3
Q
2
Q
( )
( )
1
4
i
i n
Q
+
=
Range (Variation)
Measure of variation
Difference between the largest and the
smallest observations:


Ignores the way in which data are distributed


Largest Smallest
Range X X =
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
( )
2
2
1
N
i
i
X
N

o
=

=

Important measure of variation
Shows variation about the mean
Sample variance:



Population variance:
( )
2
2
1
1
n
i
i
X X
S
n
=

=

Variance
Standard Deviation
Most important measure of variation
Shows variation about the mean
Has the same units as the original data
Sample standard deviation:



Population standard deviation:
( )
2
1
1
n
i
i
X X
S
n
=

=

( )
2
1
N
i
i
X
N

o
=

=

Comparing Standard
Deviations
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5
s = .9258
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5
s = 4.57
Data C
Comparing Coefficient
of Variation
Stock A:
Average price last year = $50
Standard deviation = $5
Stock B:
Average price last year = $100
Standard deviation = $5
Coefficient of variation:
Stock A:

Stock B:
$5
100% 100% 10%
$50
S
CV
X
| | | |
= = =
| |
\ . \ .
$5
100% 100% 5%
$100
S
CV
X
| | | |
= = =
| |
\ . \ .
Exploratory Data Analysis
Box-and-whisker plot
Graphical display of data using 5-number
summary
Median( )
4 6 8 10
12
X
largest
X
smallest
1
Q
3
Q
2
Q
Coefficient of Correlation
Measures the strength of the linear
relationship between two quantitative
variables




( )( )
( ) ( )
1
2 2
1 1
n
i i
i
n n
i i
i i
X X Y Y
r
X X Y Y
=
= =

=

Вам также может понравиться