Вы находитесь на странице: 1из 49

Stats 141/Bio 141: Lecture 1

Rajarshi Mukherjee
Stanford University

September 23, 2014

Stats 141/Bio 141: Lecture 1

September 23, 2014

Class Logistics

Stats 141/Bio 141: Lecture 1

September 23, 2014

Overview

Instructor: : Rajarshi Mukherjee


Office: Sequoia Hall 202
e-mail: rmukherj@stanford.edu

Class Meetings: 2:15 pm 3:30 pm, Tuesday and Thursday, 200-02.

Office Hours: Thursday 4:00 pm-6:00 pm, and by appointment. (Room


TBD)

Who am I? Stein Fellow/ Lecturer, Department of Statistics, Stanford


University.

Stats 141/Bio 141: Lecture 1

September 23, 2014

Definitely Not a Professor!

Stats 141/Bio 141: Lecture 1

September 23, 2014

Teaching Assistants
Sam Gross
Office Hours: Thursday 10 am - 12 pm
Room: Sequoia 227
e-mail: smgross@stanford.edu
Hera He
Office Hours: Friday 1 pm 3 pm
Room: Sequoia 232
e-mail: hera.he@stanford.edu
Qingyuan Zhao
Office Hours: Monday from 12 pm - 2 pm
Room: Sequoia 206
e-mail: qyzhao@stanford.edu
Xiaoying Tian
Office Hours: Tuesday 10 am - 12 pm
Room: Sequoia 232
e-mail: xtian@stanford.edu
Stats 141/Bio 141: Lecture 1

September 23, 2014

Sections

Wed 3:40 PM - 4:30 PM at 380-380X (Listed as 3:15 pm - 4:30 pm)


Friday 3:15 PM - 4:05 PM at 380-380W
You can come to any of these sections.
Sections are not mandatory but recommended.
Sections will be used to review course materials/ work through practice
problems/discuss computing.

Stats 141/Bio 141: Lecture 1

September 23, 2014

Prerequisites

Some background of algebra will be helpful.


There will be some theoretical developments more than Stats 60/Stats 160.
However, no prior knowledge of statistics will be assumed.

Stats 141/Bio 141: Lecture 1

September 23, 2014

Homework and Exams


Homework: Assigned each week on Thursday. Due at the beginning of
class on Thursday of the following week. The lowest of your homework
scores will be dropped, so if you fail to turn one in, you will still be OK.
Quizzes: Quiz 1 Thursday October 9th; Quiz 2 December 2nd. (In class)
Midterm Exam: Wednesday, November 12th, in the evening (time and
place to be announced). It will roughly cover material from Lectures 1
through 12.
Final Exam: Date, time and place are determined by the Registrar.
Currently, we are listed as having our final on Thursday, December 11th
from 7pm to 10pm. The location is TBA. The final will test knowledge from
all sections of the course.
All exams are open book. But no computers will be allowed. You can use
hand scientific calculators.
Stats 141/Bio 141: Lecture 1

September 23, 2014

Grading

Grades: Your class grade will be calculated as follows:


Homework: 20%
Quizzes: 10% each (total 20%)
Midterm exam: 25%
Final exam: 35%

Stats 141/Bio 141: Lecture 1

September 23, 2014

Textbook and Class Notes


Textbook:
Required: Statistics for the Life Sciences: Samuels, Witmer and Schaffner,
4th Edition (Prentice Hall)
Suggested: Introductory Statistics with R: Dalgaard, 2nd Edition (Springer)
Using R for Introductory Statistics: Verzani, 1st Edition (Chapman &
Hall/CRC)
Course Website URL:
https://coursework.stanford.edu/portal/site/F14-BIO-141-01
Lecture Notes will be uploaded on the website before class and should be
downloaded/printed out in advance and brought to class.
I will try to post/mention the textbook section(s) being covered from the
text.
Today: General Introduction, Section 1.3, 2.1., Types of Variables.
Stats 141/Bio 141: Lecture 1

September 23, 2014

10

Computing

For statistical computation and programming we will use environment R.


You can download it for free from http://cran.r-project.org/.
You can find extensive documentation for free on the Web, starting with the
same site/ suggested textbooks.
It is also suggested that you download an editor for R. We suggest RStudio,
which can be downloaded from http://www.rstudio.com/.
Weekly sections will talk about computing and data examples.
Nice Introduction: https://www.codeschool.com/courses/try-r.

Stats 141/Bio 141: Lecture 1

September 23, 2014

11

Teaching Schedule

A tentative schedule of things to be covered is given on the website in the


syllabus document.
I will not be teaching on October 16. Instead there will be a class long TA
section devoted towards review and questions.
October 10 Fri, 5:00 pm: Last day to add or drop a class.

Stats 141/Bio 141: Lecture 1

September 23, 2014

12

Compare classes
Stats 60 : Stats for social sciences (mostly surveys).

Stats 110: A much more mathy class.

Stats 141: Understanding Tools used in Statistics.

Stats 202: Cutting edge data mining and visuals. (akin to bioinformatics)

Stats 300A: Very theoretical. For PhD students with lots of math
background.

Stats 141/Bio 141: Lecture 1

September 23, 2014

13

Introduction

Stats 141/Bio 141: Lecture 1

September 23, 2014

14

Lets begin with the Following

Describe the range of applications of statistics.

Identify situations in which statistics can be misleading.

Define Statistics.

Stats 141/Bio 141: Lecture 1

September 23, 2014

15

Range of applications of statistics


Statistics include numerical facts and figures.
Earth Sciences: The largest earthquake measured 9.2 on the Richter scale.
Social Sciences: Men are at least 10 times more likely than women to
commit murder.
Public Health: One in every 8 South Africans is HIV positive.
Biology: ANGPTL 3,4 and 5 genes might be associated with
Hypertriglyceridemia in Europe.
Statistical Physics, History, Anthropology,...
http://en.wikipedia.org/wiki/List of fields of application of statistics.
Stats 141/Bio 141: Lecture 1

September 23, 2014

16

Be Cautious against Misleading Statistics

The study of statistics involves math and relies upon calculations of


numbers.
But it also relies heavily on how the numbers are chosen and how the
statistics are interpreted.
For example, consider the following three scenarios and the interpretations
based upon the presented statistics.
You will find that the numbers may be right, but the interpretation may be
wrong.

Stats 141/Bio 141: Lecture 1

September 23, 2014

17

Example 1
A new advertisement for Ben & Jerrys ice cream introduced in late May of last
year resulted in a 30% increase in ice cream sales for the following three months.
Thus, the advertisement was effective.
A major flaw is that ice cream consumption generally increases in the
months of June, July, and August regardless of advertisements.
Example where one interpretes outcomes as the result of one variable (Ben
& Jerry advertisment) when another variable (Time of the Year) is actually
responsible.

Stats 141/Bio 141: Lecture 1

September 23, 2014

18

Example 2
The more churches in a city, the more crime there is. Thus, churches lead to
crime.
A major flaw is that both increased churches and increased crime rates can
be explained by larger populations.
In bigger cities, there are both more churches and more crime.
A third variable (number of people) can cause both situations.
However, people erroneously believe that there is a causal relationship
between the two primary variables rather than recognize that a third variable
can cause both.

Stats 141/Bio 141: Lecture 1

September 23, 2014

19

Example 3
75% more interracial marriages are occurring this year than 25 years ago. Thus,
our society accepts interracial marriages..
A major flaw is that we dont have the information that we need. What is
the rate at which marriages are occurring?
Suppose only 1% of marriages 25 years ago were interracial and so now
1.75% of marriages are interracial (1.75 is 75%higher than1).
But this latter number is hardly evidence suggesting the acceptability of
interracial marriages.
In addition, the statistic provided does not rule out the possibility that the
number of interracial marriages has seen dramatic fluctuations over the years
and this year is not the highest.
Again, there is simply not enough information to understand fully the impact
of the statistics.
Stats 141/Bio 141: Lecture 1

September 23, 2014

20

Definition of Statistics

As a whole, these examples show that statistics are not only facts and figure.
They are something more than that.
In the broadest sense, statistics refers to the art (techniques and
procedures) for analyzing, interpreting, displaying, and making decisions
based on data.

Stats 141/Bio 141: Lecture 1

September 23, 2014

21

Why Study Statistics

Statistics Encountered in Daily Life


4 out of 5 dentists recommend Dentine.
Almost 85% of lung cancers in men and 45% in women are tobacco-related.
A surprising new study shows that eating egg whites can increase ones life
span.
People predict that it is very unlikely there will ever be another baseball
player with a batting average over 400.

Stats 141/Bio 141: Lecture 1

September 23, 2014

22

Why Study Statistics

Statistics are often presented in an effort to add credibility to an argument


or advice.
Hopefully studying statistics will make us into an intelligent consumer of
statistical claims.
It can be a claim made by an Biologist about collected data or claim made
by a business company about their product.
Charles Frederick Mosteller:
While it is easy to lie with statistics, it is even easier to lie without them..

Stats 141/Bio 141: Lecture 1

September 23, 2014

23

Population and Sample


Statistics is the art of dealing with Data.
However, as we said earlier, one of the most important things to understand
is the context.
In statistics, we always want to study certain characteristics of individuals
(people/animals/organisms/objects....) in a population of interest.
It is important to identify the population of interest and the characteristics
of interest/ objectives of the study before embarking on statistical analysis.

Example
1

Height of Basketball Players: Population- All Basketball Players,


Characteristics of interest- Height.

Rate of Cell Division of a Skin Cancer Tissue: Population- All Skin


Cancer Tissues, Characteristics of interest- Rate of cell division.

Relationship between Work Hours and Blood Pressure in Texas:


Population- All people in Texas, Characteristics of interest- Work Hours,
Blood Pressure.
Stats 141/Bio 141: Lecture 1

September 23, 2014

24

Population and Sample

However, most of the time the population is too big (it can even be
infinite!), or impossible to census.
In that case we need to work with a sample.
A sample is a collection of selected individuals chosen from the population
we want to study.
Data is collection of measurements made on the these selected individuals
on the sample.
Hopefully, the sample is a representative of the population.
Aim of Statistics: Summarize data characteristics (Statistic) to give us
idea about population summaries (Parameter): a process called inference
Inference.

Stats 141/Bio 141: Lecture 1

September 23, 2014

25

Importance of Representative Sample: Random Sample


It is very important that the sampled data is a representative of the
population we are trying to study.
Random Sample: Subjects are selected from a population so that each
individual has an equal chance of being selected.
Random samples are representative of the source population.
Non-random samples are not representative. If not, the inference drawn is
potentially false and misleading- Problem due to Sampling Bias.
Section 1.3 is a good read.
From now onwards, we will always assume that the data is a valid random
sample from the population.
Stats 141/Bio 141: Lecture 1

September 23, 2014

26

Central Dogma of Statistics

Stats 141/Bio 141: Lecture 1

September 23, 2014

27

In This Course...
1

Descriptive Statistics: Given data, how to perform initial description and


visualization of it.
Identifying different types of data.
Tabular and Graphical Representation.
Numerical summaries of the data.

Probability and Inferential Statistics:


Data as a sample from a larger population of interest.
Given population variability how to quantify sample variability?
(Probability)
Given sample variability how to quantify population variability?
(Inferential Statistics)
Estimation (Sampling Distribution, Means, Proportions, Confidence
Intervals etc.)
Hypothesis Testing (Testing for means, proportions, group homogeneity
and equality etc.)

Stats 141/Bio 141: Lecture 1

September 23, 2014

28

Take Home Messages

Statistics is about drawing inference from data about a population of


interest.
Importance of Context: Ask the right questions to understand what is the
population of interest and what is the characteristics of the population we
are interested in.
Make sure data is representative of the population of interest and measures
the characteristics of the individuals of the population- Random Sample.

Stats 141/Bio 141: Lecture 1

September 23, 2014

29

Statistics Does Not Prove Anything !


You only observe data once.
What if someone else does another study on the same population with
another random sample drawn from the population?
Different random samples will include different subjects, with different
observations.
Each new random sample/data will lead to(slightly) different conclusions,
implying that, sometimes, not so precise conclusions will be drawn.
Absolute certainty cannot be expected as conclusions are based on only a
small part (the sample) from the total, infinitely large, population.
Statistics is about quantifying this uncertainty from data.
Stats 141/Bio 141: Lecture 1

September 23, 2014

30

Data and Variables

Data and Variables


Data is collection of characteristics/measurements of sampled units from an
underlying population.

A variable is a characteristic/measurement of a sampled unit.

Usually data is stored and presented in a dataset, comprised of variables


measured on sampled units.

Setting Up Data: Make a table with


One Sample unit per Row.
One Variable per Column.

Data with n sampled units and p variables measured for each sampled unit
is therefore represented in a n p table.
Stats 141/Bio 141: Lecture 1

September 23, 2014

33

Patient Data from a City Hospital:n = 25, p = 6


Sampled Individual
Subject Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Var 1
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M

Var 2
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O

Var 3
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent

Var 4
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1

Stats 141/Bio 141: Lecture 1

Var 5
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53

Var 6
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91

September 23, 2014

34

Types of Variables

In order to describe and summarize different variables of observed data, it is


important to first identify the different types of commonly appearing
variables.
These are:
1

Categorical/Qualitative Variables
Nominal variables
Ordinal variables

Numeric Variables
Discrete variables
Continuous variables

We will now go through each of them and try to understand how to suitably
summarize them.

Stats 141/Bio 141: Lecture 1

September 23, 2014

35

Nominal Variables

The simplest type of variable is nominal variable, in which the observed values fall
into specific unordered categories/classes

Examples
Gender Male(M), Female(F).
Survival Status alive, deceased
Blood Group O, A, B, AB.
Race/ Ethnicity- Asian, Caucasian, African.
Cause of Death- natural, accident.
Political Affiliation- Republican, Democrat.

Stats 141/Bio 141: Lecture 1

September 23, 2014

36

Patient Data from a City Hospital:n = 25, p = 6


Subject Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Nominal
Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M

Nominal
Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O

Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent

# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1

Stats 141/Bio 141: Lecture 1

Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53

Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91

September 23, 2014

37

Properties of Nominal Variables


Numbers are often used to represent the categories for convenience, but
these numbers are merely labels.
Examples:
- Gender: Male = 1 and Female = 0.
- Race/ Ethnicity: Asian = 1, Caucasian=2, African= 3.
Nominal variables that take on one of two distinct values are said to be
dichotomous/binary. (Example: Survival Status: alive(=1), deceased(=0).)
Both the order and the magnitude of the numbers are unimportant.
Most arithmetic operations do not make sense for nominal data.
Examples: It does not make sense to take average.
Stats 141/Bio 141: Lecture 1

September 23, 2014

38

Ordinal Variables

Variables are ordinal if the observed values fall into specific categories/classes but
the order among the categories is important.

Examples
patient satisfaction with care received in the hospital poor, fair, good,
very good, excellent
injury status none, minor, moderate, severe, fatal
level of education- Bachelor, Masters, PhD.

Stats 141/Bio 141: Lecture 1

September 23, 2014

39

Patient Data from a City Hospital:n = 25, p = 6


Subject Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M

Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O

Ordinal
Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent

# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1

Stats 141/Bio 141: Lecture 1

Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53

Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91

September 23, 2014

40

Properties of Ordinal Variables


As with nominal variables, categories might again be represented by
numbers.
Example: injury status none (0), minor (1), moderate (2), severe(3),
fatal(4).
The order of the numbers is important, while interpretation.
We are still not concerned with the magnitudes.
Example: injury status none (0), minor (1), moderate (2), severe(3),
fatal(4) means the same thing while interpreting as none (5), minor (10),
moderate (15), severe(20), fatal(50).
In particular, does not make sense to do most arithmetic operations such as
addition, subtraction, division, average etc.
Stats 141/Bio 141: Lecture 1

September 23, 2014

41

Categorical Variables

Together, nominal and ordinal variables are called categorical variables.

Discrete Variables
Variables are discrete if it can take specific set of values but the order and
magnitude matter.
Numbers are not merely labels; they are actual measurable quantities.
However, these quantities are restricted to taking on specified values only
usually integers or counts.

Examples
number of siblings a person has - (0, 1, 2, 3, . . .).
number of motor vehicle accidents in the city of Boston in a given week (0, 1, 2, 3, . . .).
number of new cases of diabetes diagnosed in the United States over a
one-year period - (0, 1, 2, 3, . . .).
Number of days it rained in Palo Alto in the first week of November (0, 1, 2, 3, 4, 5, 6, 7).
Stats 141/Bio 141: Lecture 1

September 23, 2014

43

Patient Data from a City Hospital:n = 25, p = 6


Subject Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M

Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O

Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent

Discrete
# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1

Stats 141/Bio 141: Lecture 1

Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53

Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91

September 23, 2014

44

Properties of Discrete Variables

Both order and magnitude of the values of the variable matter.


Arithmetic rules can be applied.
However, the result of the arithmetic operation might not be a discrete
variable itself.
Example: I person 1 has 2 siblings and person 2 has 3 siblings, then on
average they have (2 + 3)/2 = 2.5 siblings, but 2.5 is not a valid value for
the number of siblings.

Stats 141/Bio 141: Lecture 1

September 23, 2014

45

Continuous Variables

Variables are continuous When both order and magnitude are important, but
quantities are not restricted to taking on specified values/counts.

Examples
Height.
Birth Weight.
Length of time a lung cancer patient survives after diagnosis.
Concentration of mercury in a particular fish.

Stats 141/Bio 141: Lecture 1

September 23, 2014

46

Patient Data from a City Hospital:n = 25, p = 6


Subject Index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

Gender
M
M
F
M
M
F
F
F
F
M
F
F
M
M
M
F
M
F
M
M
M
M
M
M
M

Blood Gr.
A
B
B
A
B
B
O
AB
O
A
O
A
O
A
O
A
O
AB
B
A
O
B
O
O
O

Health
Fair
Excellent
Very Good
Poor
Fair
Very Good
Poor
Very Good
Excellent
Excellent
Excellent
Good
Very Good
Good
Fair
Very Good
Good
Poor
Very Good
Excellent
Excellent
Good
Excellent
Good
Excellent

# Siblings
5
4
3
1
1
1
4
3
5
0
0
2
6
3
0
3
2
4
4
3
4
3
8
5
1

Stats 141/Bio 141: Lecture 1

Continuous
Ht.(m)
1.53
1.16
1.85
1.51
1.94
1.62
1.22
1.09
1.52
1.09
1.66
1.4
1.18
1.48
1.72
1.72
1.87
1.42
1.53
1.28
1.86
1.53
1.42
0.58
2.53

Continuous
Wt.(Kg)
63.83
92.98
65.19
63.58
66.37
61.69
87.51
64.83
93.13
87.91
64.72
64.22
93.71
89.22
65.57
66.43
65.11
65.67
66.73
89.79
92.08
65.27
65.54
63.8
90.91

September 23, 2014

47

Properties of Continuous Variables

Most arithmetic operations make sense.


Result of adding, subtracting, dividing etc. is again valid continuous variable.
The difference between any two values can be arbitrarily small.
Therefore fractional values are possible.
The accuracy of the measuring instrument is the only limiting factor

Stats 141/Bio 141: Lecture 1

September 23, 2014

48

Numeric Variables

Together, discrete and continuous variables are called numeric variables.

Вам также может понравиться