Академический Документы
Профессиональный Документы
Культура Документы
Rajarshi Mukherjee
Stanford University
Class Logistics
Overview
Teaching Assistants
Sam Gross
Office Hours: Thursday 10 am - 12 pm
Room: Sequoia 227
e-mail: smgross@stanford.edu
Hera He
Office Hours: Friday 1 pm 3 pm
Room: Sequoia 232
e-mail: hera.he@stanford.edu
Qingyuan Zhao
Office Hours: Monday from 12 pm - 2 pm
Room: Sequoia 206
e-mail: qyzhao@stanford.edu
Xiaoying Tian
Office Hours: Tuesday 10 am - 12 pm
Room: Sequoia 232
e-mail: xtian@stanford.edu
Stats 141/Bio 141: Lecture 1
Sections
Prerequisites
Grading
10
Computing
11
Teaching Schedule
12
Compare classes
Stats 60 : Stats for social sciences (mostly surveys).
Stats 202: Cutting edge data mining and visuals. (akin to bioinformatics)
Stats 300A: Very theoretical. For PhD students with lots of math
background.
13
Introduction
14
Define Statistics.
15
16
17
Example 1
A new advertisement for Ben & Jerrys ice cream introduced in late May of last
year resulted in a 30% increase in ice cream sales for the following three months.
Thus, the advertisement was effective.
A major flaw is that ice cream consumption generally increases in the
months of June, July, and August regardless of advertisements.
Example where one interpretes outcomes as the result of one variable (Ben
& Jerry advertisment) when another variable (Time of the Year) is actually
responsible.
18
Example 2
The more churches in a city, the more crime there is. Thus, churches lead to
crime.
A major flaw is that both increased churches and increased crime rates can
be explained by larger populations.
In bigger cities, there are both more churches and more crime.
A third variable (number of people) can cause both situations.
However, people erroneously believe that there is a causal relationship
between the two primary variables rather than recognize that a third variable
can cause both.
19
Example 3
75% more interracial marriages are occurring this year than 25 years ago. Thus,
our society accepts interracial marriages..
A major flaw is that we dont have the information that we need. What is
the rate at which marriages are occurring?
Suppose only 1% of marriages 25 years ago were interracial and so now
1.75% of marriages are interracial (1.75 is 75%higher than1).
But this latter number is hardly evidence suggesting the acceptability of
interracial marriages.
In addition, the statistic provided does not rule out the possibility that the
number of interracial marriages has seen dramatic fluctuations over the years
and this year is not the highest.
Again, there is simply not enough information to understand fully the impact
of the statistics.
Stats 141/Bio 141: Lecture 1
20
Definition of Statistics
As a whole, these examples show that statistics are not only facts and figure.
They are something more than that.
In the broadest sense, statistics refers to the art (techniques and
procedures) for analyzing, interpreting, displaying, and making decisions
based on data.
21
22
23
Example
1
24
However, most of the time the population is too big (it can even be
infinite!), or impossible to census.
In that case we need to work with a sample.
A sample is a collection of selected individuals chosen from the population
we want to study.
Data is collection of measurements made on the these selected individuals
on the sample.
Hopefully, the sample is a representative of the population.
Aim of Statistics: Summarize data characteristics (Statistic) to give us
idea about population summaries (Parameter): a process called inference
Inference.
25
26
27
In This Course...
1
28
29
30
Data with n sampled units and p variables measured for each sampled unit
is therefore represented in a n p table.
Stats 141/Bio 141: Lecture 1
33
Var 1
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M
Var 2
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O
Var 3
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent
Var 4
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1
Var 5
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53
Var 6
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91
34
Types of Variables
Categorical/Qualitative Variables
Nominal variables
Ordinal variables
Numeric Variables
Discrete variables
Continuous variables
We will now go through each of them and try to understand how to suitably
summarize them.
35
Nominal Variables
The simplest type of variable is nominal variable, in which the observed values fall
into specific unordered categories/classes
Examples
Gender Male(M), Female(F).
Survival Status alive, deceased
Blood Group O, A, B, AB.
Race/ Ethnicity- Asian, Caucasian, African.
Cause of Death- natural, accident.
Political Affiliation- Republican, Democrat.
36
Nominal
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M
Nominal
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91
37
38
Ordinal Variables
Variables are ordinal if the observed values fall into specific categories/classes but
the order among the categories is important.
Examples
patient satisfaction with care received in the hospital poor, fair, good,
very good, excellent
injury status none, minor, moderate, severe, fatal
level of education- Bachelor, Masters, PhD.
39
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O
Ordinal
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91
40
41
Categorical Variables
Discrete Variables
Variables are discrete if it can take specific set of values but the order and
magnitude matter.
Numbers are not merely labels; they are actual measurable quantities.
However, these quantities are restricted to taking on specified values only
usually integers or counts.
Examples
number of siblings a person has - (0, 1, 2, 3, . . .).
number of motor vehicle accidents in the city of Boston in a given week (0, 1, 2, 3, . . .).
number of new cases of diabetes diagnosed in the United States over a
one-year period - (0, 1, 2, 3, . . .).
Number of days it rained in Palo Alto in the first week of November (0, 1, 2, 3, 4, 5, 6, 7).
Stats 141/Bio 141: Lecture 1
43
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent
Discrete
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91
44
45
Continuous Variables
Variables are continuous When both order and magnitude are important, but
quantities are not restricted to taking on specified values/counts.
Examples
Height.
Birth Weight.
Length of time a lung cancer patient survives after diagnosis.
Concentration of mercury in a particular fish.
46
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1
Continuous
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53
Continuous
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91
47
48
Numeric Variables