Вы находитесь на странице: 1из 16
Lecture 17: Review Lecture Sandy Eckel seckel@jhsph.edu 20 May 2008 1 Summary of the approach

Lecture 17: Review Lecture

Sandy Eckel seckel@jhsph.edu

20 May 2008

1

Summary of the approach to modeling I

A general approach for most statistical modeling is to:

Define the Population of Interest State the Scientific Questions & Underlying Theories Describe and Explore the Observed Data Define the Model

Probability part

Systematic part (models the expectation / signal)

(models the randomness / noise)

Types of Biostatistics

1) Descriptive Statistics

Exploratory Data Analysis

often not in literature

Summaries

"Table 1" in a paper

Goal: visualize relationships, generate hypotheses

2) Inferential Statistics

Confirmatory Data Analysis (Methods Section of paper)

Hypothesis tests Confidence Intervals Regression modeling

Goal: quantify relationships, test hypotheses

2

Summary of the approach to modeling II

Estimate the Parameters in the Model

Fit the Model to the Observed Data

Make Inferences about Covariates Check the Validity of the Model

Verify the Model Assumptions

Re-define, Re-fit, and Re-check the Model if necessary Interpret the results of the Analysis in terms of the Scientific Questions of Interest

Descriptive Statistics ALWAYS look at your data If you can't see it, then don't believe

Descriptive Statistics

ALWAYS look at your data If you can't see it, then don't believe it

5

Stem-and-Leaf Plots

Age in years (10 observations)

25, 26, 29, 32, 35, 36, 38, 44, 49, 51

Age Interval

Observations

2*

5

6 9

3*

2

5 6 8

4*

4

9

5*

1

7

Key Descriptive Statistics Ideas

Visualizing data

Stem and leaf plots Histograms Boxplots Scatterplots

Describing data

Distribution shapes (especially skewness) Quartiles Measures of central tendency

Median, Mean, Mode

Measures of spread

Variance, Standard Deviation, Range, Interquartile Range

6 Histograms Pictures of the frequency or relative frequency distribution Histogram of Age 1 2
6
Histograms
Pictures of the frequency or relative
frequency distribution
Histogram of Age
1
2
3
4
8
Age Ca tegory
Frequency
1
2
3
4

Box-and-Whisker Plots (Boxplot)

Box Plot of Age

Age in Years 25 30 35 40 45 50
Age in Years
25
30
35
40
45
50

IQR = 44 – 29 = 15 Upper Fence = 44 + 15*1.5 = 66.5 Lower Fence = 29 – 15*1.5 = 6.5

9

Skewness

Positively Skewed

Longer tail in the high values Mean > Median > Mode

Mode Mean
Mode
Mean

Median

11

2 Continuous Variables

Scatterplot

Age by Height in cm 25 30 35 40 45 50 Age in Years 150
Age by Height in cm
25
30
35
40
45
50
Age in Years
150
Centimeters 180
160 Height in 170
190

Scatterplots visually display the relationship between

two continuous variables

10

Skewness

Negatively Skewed

Longer tail in the low values Mode > Median > Mean

Mean Mode Median
Mean
Mode
Median

12

Symmetric

Right and left sides are mirror images

Left tail looks like right tail Mean = Median = Mode Mean Median Mode
Left tail looks like right tail
Mean = Median = Mode
Mean
Median
Mode

13

right tail Mean = Median = Mode Mean Median Mode 13 Concepts from Biostat I used

Concepts from Biostat I used in this class

Other descriptive statistics – Review on your own

Quartiles Measures of central tendency

Median, Mean, Mode

Measures of spread

Variance, Standard Deviation, Range, Interquartile Range

14

Key Ideas of Probability/Distributions

Mutually exclusive Statistically independent Addition rule Conditional probability Common distributions

Continuous: Normal, t-distribution, Chi-square, F-distribution Discrete: Binomial

Key Ideas of The Normal Distribution and Statistical Inference

Normal distribution Parameters: mean, variance Standard normal 68-95-99.7 Rule Areas under the curve – relation to p-values Statistical Inference Population: parameters Sample: statistics We use sample statistics along with theoretical results to make inferences about population parameters Sampling distribution of sample mean Central Limit Theorem

17

68 – 95 – 99.7 Rule

95% of the area under the curve is within 2 standard deviation of the mean

under the curve is within 2 standard deviation of the mean 19 68 – 95 –

19

68

– 95 – 99.7 Rule

68% of the area under the curve is within 1 standard deviation of the mean

under the curve is within 1 standard deviation of the mean 18 68 – 95 –

18

68

– 95 – 99.7 Rule

99.7% of the area under the curve is within 3 standard deviation of the mean

of the mean 18 68 – 95 – 99.7 Rule 99.7% of the area under the

20

Sampling Distribution of the Sample Mean

Usually µ is unknown and we would like to estimate it

We use

We know the sampling distribution of Definition: Sampling distribution The distribution of all possible values of some statistic, computed from samples of the same size randomly drawn from the same population, is called the sampling distribution of that statistic

is called the sampling distribution of that statistic to estimate µ 21 Inferential Statistics 23 The

to estimate µ

the sampling distribution of that statistic to estimate µ 21 Inferential Statistics 23 The Central Limit

21

sampling distribution of that statistic to estimate µ 21 Inferential Statistics 23 The Central Limit Theorem

Inferential Statistics

23

The Central Limit Theorem

Given a population of any distribution with mean, µ, and variance, σ 2 , the sampling

distribution of

size n from this population, will be approximately normally distributed with mean, µ, and variance, σ 2 /n, when the sample

size is large.

In general, this applies when n 25. The approximation to normality becomes better as n increases

approximation to normality becomes better as n increases , computed from samples of   22 Confidence

, computed from samples of

 

22

Confidence Intervals

 

Point estimation

 

An estimate of a population parameter

 

Interval estimation

 

A point estimate plus an interval that expresses the uncertainty or variability associated with the estimate

 

100(1

α

)% Confidence interval:

estimate ± (critical value of z or t) × (standard error of estimate)

Critical value is the cutoff such that the area under the

curve in the tails beyond the critical value (both positive

and negative direction) is

α

 

24

Interpretation of confidence interval? Use a CI for µ as an example:

Before the data are observed, the probability is at least (1 alpha) that [L,U] will contain µ, the population parameter In repeated sampling from a normally

distributed population, 100(1

α )% of all

intervals of the form above will include the population mean µ After the data are observed, the constructed interval [L,U] either contains the true mean or it does not (no probability involved anymore)

 

25

MANY types of hypothesis testing

We discussed A single mean

H 0 : µ = 3000 vs. H a : µ 3000

A single proportion

H 0 : p = 0.35 vs. H a : p 0.35

Difference of means

H 0 : µ 1 - µ 2 = 0 vs. H a : µ 1 - µ 2 0 Need to decide whether to assume equality of variance

Difference of proportions

H 0 : p 1 – p 2 = 0 vs. H a : p 1 – p 2 0

Others for regression modelling

 

27

Steps of Hypothesis Testing

Define the null hypothesis, H 0 . Define the alternative hypothesis, H a , where H a is usually of the form “not H 0 ”. Define the type 1 error, α, usually 0.05. Calculate the test statistic Calculate the P-value If the P-value is less than α, reject H 0 .

Otherwise fail to reject H 0 .

26

Which test statistic do I use for each kind of test?

Usually, the form of the test statistic depends on

Population distribution Sample size Population variance

Whether known or estimated Or assumptions about equality

To find out which test statistic to use, check the summary sheets

Not something you really have to memorize

28

Example Summary: Hypothesis test for a single mean

Example Summary: Hypothesis test for a single mean 29 Why is the power of a test

29

Why is the power of a test important?

Power indicates the chance of finding a “significant” difference when there really is one

Low power: likely to obtain non-significant results even when significant differences exist

High power is desirable!

Low power is usually cause by small sample size

31

Relation between CI and hypothesis testing

General rule on the 100(1-

α )% confidence

interval approach to two-sided hypothesis testing

If the null hypothesis value is not contained in the confidence interval, you reject the null hypothesis with p-value If the null hypothesis value is contained in the confidence interval, you fail to reject the null

hypothesis with p-value>

α

30

We’re not always right

hypothesis with p-value> α 30 We’re not always right Aim: to keep Type I error (

Aim: to keep Type I error (α) small by specifying a small rejection region α is set before performing a test, usually at 0.05 Aim: To keep Type II error (β) small and thus power high Power = 1 – β

32

β: Probability of Type II Error

The value of β is usually unknown since it depends on a specified alternative value. β depends on sample size and α. Before data collection, scientists decide

the test they will perform

α

the desired β

They will use this information to choose the sample size

33

They will use this information to choose the sample size 33 Regression Modeling 35 P-Values Definition:

Regression Modeling

35

P-Values

Definition: The p-value for a hypothesis test is the probability of obtaining by chance, alone, when H 0 is true, a value of the test statistic as extreme or more extreme (in the appropriate direction) than the one actually observed.

34

Correlation

Measures strength and direction of the linear relationship between two continuous variables The correlation coefficient, ρ , takes values between -1 and +1

-1: Perfect negative linear relationship 0: No linear relationship +1: Perfect positive relationship

36

Association and Causation

In general, association between two variables means there is some form of relationship between them The relationship is not necessarily causal Association does not imply causation, no matter how much we would like it to Example: Hot days, ice cream, drowning

37

Simple Linear regression: Y= β 0 + β 1 X 1

Linear regression is used for continuous outcome variables β 0 : mean outcome when X=0 (Center!) Binary X = “dummy variable” for group

β 1 : mean difference in outcome between groups

Continuous X

β 1 : difference in mean outcome corresponding to a 1-unit increase in X Center X to give meaning to β 0

Test β 1 =0 in the population

39

Why use linear regression?

Linear regression is very powerful. It can be used for many things:

Binary X Continuous X Categorical X Adjustment for confounding Interaction Curved relationships between X and Y

38

Assumptions of Linear Regression

L

I

N

E

Linear relationship Independent observations Normally distributed around line Equal variance across X’s

Most often assess with graphs - One type: AV plots

visualize the relationship between the outcome and a continuous predictor after adjusting for the effects of a third variable

40

In Simple Linear Regression

In simple linear regression (SLR):

One Predictor / Covariate / Explanatory Variable:

X

In multiple linear regression (MLR):

Same Assumptions as SLR, (i.e. L.I.N.E.), but:

More than one Covariate:

X 1 , X 2 , X 3 , …, X p

Model:

Y ~ N(µ, σ 2 )

 

µ =

E(Y | X) = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 +

β p Xp

41 Regression Methods Interactions can allow us to draw separate lines for two groups X
41
Regression Methods
Interactions can allow us to draw separate lines for two
groups
X 1 = Year
X 2 = Group
43

Regression Methods

two groups X 1 = Year X 2 = Group 43 Regression Methods 42 Nested models

42

Nested models

One model is nested within another if the parent model contains one set of variables and the extended model contains all of the original variables plus one or more additional variables.

H 0 : all new β’s are zero Assess using F-test

If only one additional variable, use t-test

44

Effect Modification

In linear regression, effect modification is a way of allowing the association between the primary predictor and the outcome to change with the level of another predictor.

If the 3 rd predictor is binary, that results in a graph in which the two lines (for the two groups) are no longer parallel.

45 Confounding: example Smoking is a confounder of the relation between coffee consumption (X) and
45
Confounding: example
Smoking is a confounder of the
relation between coffee consumption (X)
and lung cancer (Y) since:
Lung Cancer Y
Smoking C
Coffee Consumption X
47

Confounding: the epidemiologic definition

C is a confounder of the relation between X and Y if:

Confounder C
Confounder C
Outcome Y Predictor X
Outcome Y
Predictor X

46

Modeling confounding and effect modification

Potential confounder(s)

Run model without confounder (model 1) Run model with confounder (model 2) Compare model 2 estimate to the model 1 CI of primary predictor to see whether new parameter is significantly different

Effect modification

Model using interaction term Test if statistically significant using a t-test

48

Spline Terms

Splines are used to allow the regression line to bend

the breakpoint is arbitrary and decided graphically or by hypothesis the actual slope above and below the breakpoint is usually of more interest than the coefficient for the spline (ie the change in slope)

Broken Arrow Model

3500 3000 Slope = ββββ 1 + ββββ 2 2500 Slope = ββββ 1 2000
3500
3000
Slope = ββββ 1 + ββββ 2
2500
Slope = ββββ 1
2000
3
5
7
9
Expenditures

length of stay (days)

49

Using R 2 as a model selection criteria

The coefficient of determination, R 2 evaluates the entire model R 2 shows the proportion of the total variation in Y that has been predicted by this model

Model 1: 0.0076; 0.8% of variation explained Model 2: 0.05; 5% of variation explained Model 3: 0.20; 20% of variation explained

You want a model with large R 2

51

Summary: Flexibility in linear models

A spline allows the “slope” for a continuous predictor to change at a given point; the coefficient is for the difference in log odds ratio An interaction term allows the odds ratio for one variable to differ by the value of a second variable; the coefficient is for the difference in log odds ratio

50

Logistic Regression

Basic Idea:

Logistic regression is the type of regression we use for a response variable (Y) that follows a binomial distribution Linear regression is the type of regression we use for a continuous, normally distributed response (Y) variable

Model log odds probability, which we also call the logit Baseline term interpreted as log odds Other coefficients are log odds ratios Transform log odds/ log odds ratio to odds/odds ratio scale by exponentiating coefficient

52

Logit Function

Relates log-odds (logit) to p = Pr(Y=1)

logit function

10 5 0 -5 -10 0 .5 1 log-odds
10
5
0
-5
-10
0
.5
1
log-odds

Probability of Success

53

Why we can interpret the difference in log odds as the log odds ratio

The slope in a logistic regression is

A difference in log odds associated with a 1 unit change in X (controlling for other X’s) A log odds ratio associated with a 1 unit change in X (controlling for other X’s)

Why? log(a) – log(b) = log(a/b)

so

log(odds|X=1) – log(odds|X=0) = log(OR for X=1 vs. X=0)

55

A basic logistic regression model and interpretation

logit(p i )

=

β 0 + β 1 (age i – 45)

β 0 = log-odds of blindness among 45 year olds

exp(β 0 ) = odds of blindness among 45 year olds

β

= difference in log-odds of blindness comparing a group that is one year older than another

1

exp(β ) = odds ratio of blindness comparing a group that is one year older than another

1

54

Multiplicative change interpretation of the slope

e

β

1

is the proportional increase of the

odds of not visiting a physician corresponding to a one year increase in age

(

odds for 30 - yr - old

)

×

(

odds for 31- yr - old

)

(

odds for 30 - yr - old

)

= (odds for 31- yr - old)

(

e

β

1

) 10

=e

10β

1

is the proportional increase of

the odds of not visiting a physician corresponding to a ten year increase in age

56

More useful math – how to get the probability from the odds

odds =

probability

1

probability

probability =

odds

1

+

odds

so

P

(

X

=

1

)

=

e

β + β

0

1

1

+

e

β + β

0

1

57

Grand summary

Exploratory analysis includes graphs and tables – good to get a feel for the data Confirmatory analysis is useful for making definitive conclusions Linear models provide us with a framework in which to perform confirmatory analysis in many settings

59

Comparing nested models with logisitic regression

Models that differ by one variable

Compare models with p-value or CI using the Wald test, a test that applies the CLT

H 0 : the new variable is not needed or H 0 : β new =0 in the population

Models that differ by more than one variable

Likelihood ratio test (Chi-square test of deviance)

H 0 : all new variables not needed or H 0 : all β new =0 in the population

58

Grand summary: linear models

Linear regression: for continuous (normal) outcomes Logistic regression: for binary outcomes

60

Grand summary: modelling

In all generalized linear models, we can use the following tools to make models more flexible:

Adjust for confounders using additive covariates Effect modification allows by interaction terms Curved and bent lines through polynomials and splines

61

References - textbooks For more information

Friendly Intro Epidemiology textbook

Epidemiology by Leon Gordis

Intro to Biostatistical Modeling Textbook

Regression Modeling Strategies by Frank E. Jr. Harrell

Slightly more theoretical intro to statistical modeling textbook

Mathematical Statistics and Data Analysis by John A. Rice

63

Grand summary: testing

We can test significance of a single predictor using z-test (or t-test for linear regression) Test significance of several covariates using a pair of nested models by a likelihood ratio test Know how to interpret p-values and confidence intervals!

62

References – online JHSPH open courseware For further directed self-study

JHSPH Biostatistics Open Courseware

http://ocw.jhsph.edu/Topics.cfm?topic_id=33

Essentials of Probability and Statistical Inference IV Methods in Biostatistics I & II Statistical Reasoning I & II Statistics for Laboratory Scientists I & II Statistics for Psychosocial Research:

Structural Models & Measurement

64