Вы находитесь на странице: 1из 16 Lecture 17: Review Lecture

Sandy Eckel seckel@jhsph.edu

20 May 2008

1

Summary of the approach to modeling I

A general approach for most statistical modeling is to:

Define the Population of Interest State the Scientific Questions & Underlying Theories Describe and Explore the Observed Data Define the Model

Probability part

Systematic part (models the expectation / signal)

(models the randomness / noise)

Types of Biostatistics

1) Descriptive Statistics

Exploratory Data Analysis

often not in literature

Summaries

"Table 1" in a paper

Goal: visualize relationships, generate hypotheses

2) Inferential Statistics

Confirmatory Data Analysis (Methods Section of paper)

Hypothesis tests Confidence Intervals Regression modeling

Goal: quantify relationships, test hypotheses

2

Summary of the approach to modeling II

Estimate the Parameters in the Model

Fit the Model to the Observed Data

Make Inferences about Covariates Check the Validity of the Model

Verify the Model Assumptions

Re-define, Re-fit, and Re-check the Model if necessary Interpret the results of the Analysis in terms of the Scientific Questions of Interest Descriptive Statistics

ALWAYS look at your data If you can't see it, then don't believe it

5

Stem-and-Leaf Plots

Age in years (10 observations)

25, 26, 29, 32, 35, 36, 38, 44, 49, 51

 Age Interval Observations 2* 5 6 9 3* 2 5 6 8 4* 4 9 5* 1

7

Key Descriptive Statistics Ideas

Visualizing data

Stem and leaf plots Histograms Boxplots Scatterplots

Describing data

Distribution shapes (especially skewness) Quartiles Measures of central tendency

Median, Mean, Mode

Variance, Standard Deviation, Range, Interquartile Range 6
Histograms
Pictures of the frequency or relative
frequency distribution
Histogram of Age
1
2
3
4
8
Age Ca tegory
Frequency
1
2
3
4

Box-and-Whisker Plots (Boxplot)

Box Plot of Age Age in Years
25
30
35
40
45
50

IQR = 44 – 29 = 15 Upper Fence = 44 + 15*1.5 = 66.5 Lower Fence = 29 – 15*1.5 = 6.5

9

Skewness

Positively Skewed

Longer tail in the high values Mean > Median > Mode Mode
Mean

Median

11

2 Continuous Variables

Scatterplot Age by Height in cm
25
30
35
40
45
50
Age in Years
150
Centimeters 180
160 Height in 170
190

Scatterplots visually display the relationship between

two continuous variables

10

Skewness

Negatively Skewed

Longer tail in the low values Mode > Median > Mean Mean
Mode
Median

12

Symmetric

Right and left sides are mirror images Left tail looks like right tail
Mean = Median = Mode
Mean
Median
Mode

13 Concepts from Biostat I used in this class

Other descriptive statistics – Review on your own

Quartiles Measures of central tendency

Median, Mean, Mode

Variance, Standard Deviation, Range, Interquartile Range

14

Key Ideas of Probability/Distributions

Mutually exclusive Statistically independent Addition rule Conditional probability Common distributions

Continuous: Normal, t-distribution, Chi-square, F-distribution Discrete: Binomial

Key Ideas of The Normal Distribution and Statistical Inference

Normal distribution Parameters: mean, variance Standard normal 68-95-99.7 Rule Areas under the curve – relation to p-values Statistical Inference Population: parameters Sample: statistics We use sample statistics along with theoretical results to make inferences about population parameters Sampling distribution of sample mean Central Limit Theorem

17

68 – 95 – 99.7 Rule

95% of the area under the curve is within 2 standard deviation of the mean 19

68

– 95 – 99.7 Rule

68% of the area under the curve is within 1 standard deviation of the mean 18

68

– 95 – 99.7 Rule

99.7% of the area under the curve is within 3 standard deviation of the mean 20

Sampling Distribution of the Sample Mean

Usually µ is unknown and we would like to estimate it

We use

We know the sampling distribution of Definition: Sampling distribution The distribution of all possible values of some statistic, computed from samples of the same size randomly drawn from the same population, is called the sampling distribution of that statistic to estimate µ 21 Inferential Statistics

23

The Central Limit Theorem

Given a population of any distribution with mean, µ, and variance, σ 2 , the sampling

distribution of

size n from this population, will be approximately normally distributed with mean, µ, and variance, σ 2 /n, when the sample

size is large.

In general, this applies when n 25. The approximation to normality becomes better as n increases , computed from samples of

 22 Confidence Intervals Point estimation An estimate of a population parameter Interval estimation A point estimate plus an interval that expresses the uncertainty or variability associated with the estimate 100(1 − α )% Confidence interval: estimate ± (critical value of z or t) × (standard error of estimate) Critical value is the cutoff such that the area under the curve in the tails beyond the critical value (both positive and negative direction) is α 24

Interpretation of confidence interval? Use a CI for µ as an example:

Before the data are observed, the probability is at least (1 alpha) that [L,U] will contain µ, the population parameter In repeated sampling from a normally

distributed population, 100(1

α )% of all

intervals of the form above will include the population mean µ After the data are observed, the constructed interval [L,U] either contains the true mean or it does not (no probability involved anymore)

 25 MANY types of hypothesis testing We discussed A single mean H 0 : µ = 3000 vs. H a : µ ≠ 3000 A single proportion H 0 : p = 0.35 vs. H a : p ≠ 0.35 Difference of means H 0 : µ 1 - µ 2 = 0 vs. H a : µ 1 - µ 2 ≠ 0 Need to decide whether to assume equality of variance Difference of proportions H 0 : p 1 – p 2 = 0 vs. H a : p 1 – p 2 ≠ 0 Others for regression modelling 27

Steps of Hypothesis Testing

Define the null hypothesis, H 0 . Define the alternative hypothesis, H a , where H a is usually of the form “not H 0 ”. Define the type 1 error, α, usually 0.05. Calculate the test statistic Calculate the P-value If the P-value is less than α, reject H 0 .

Otherwise fail to reject H 0 .

26

Which test statistic do I use for each kind of test?

Usually, the form of the test statistic depends on

Population distribution Sample size Population variance

Whether known or estimated Or assumptions about equality

To find out which test statistic to use, check the summary sheets

Not something you really have to memorize

28

Example Summary: Hypothesis test for a single mean 29

Why is the power of a test important?

Power indicates the chance of finding a “significant” difference when there really is one

Low power: likely to obtain non-significant results even when significant differences exist

High power is desirable!

Low power is usually cause by small sample size

31

Relation between CI and hypothesis testing

General rule on the 100(1-

α )% confidence

interval approach to two-sided hypothesis testing

If the null hypothesis value is not contained in the confidence interval, you reject the null hypothesis with p-value If the null hypothesis value is contained in the confidence interval, you fail to reject the null

hypothesis with p-value>

α

30

We’re not always right Aim: to keep Type I error (α) small by specifying a small rejection region α is set before performing a test, usually at 0.05 Aim: To keep Type II error (β) small and thus power high Power = 1 – β

32

β: Probability of Type II Error

The value of β is usually unknown since it depends on a specified alternative value. β depends on sample size and α. Before data collection, scientists decide

the test they will perform

α

the desired β

They will use this information to choose the sample size

33 Regression Modeling

35

P-Values

Definition: The p-value for a hypothesis test is the probability of obtaining by chance, alone, when H 0 is true, a value of the test statistic as extreme or more extreme (in the appropriate direction) than the one actually observed.

34

Correlation

Measures strength and direction of the linear relationship between two continuous variables The correlation coefficient, ρ , takes values between -1 and +1

-1: Perfect negative linear relationship 0: No linear relationship +1: Perfect positive relationship

36

Association and Causation

In general, association between two variables means there is some form of relationship between them The relationship is not necessarily causal Association does not imply causation, no matter how much we would like it to Example: Hot days, ice cream, drowning

37

Simple Linear regression: Y= β 0 + β 1 X 1

Linear regression is used for continuous outcome variables β 0 : mean outcome when X=0 (Center!) Binary X = “dummy variable” for group

β 1 : mean difference in outcome between groups

Continuous X

β 1 : difference in mean outcome corresponding to a 1-unit increase in X Center X to give meaning to β 0

Test β 1 =0 in the population

39

Why use linear regression?

Linear regression is very powerful. It can be used for many things:

Binary X Continuous X Categorical X Adjustment for confounding Interaction Curved relationships between X and Y

38

Assumptions of Linear Regression

L

I

N

E

Linear relationship Independent observations Normally distributed around line Equal variance across X’s

Most often assess with graphs - One type: AV plots

visualize the relationship between the outcome and a continuous predictor after adjusting for the effects of a third variable

40

In Simple Linear Regression

In simple linear regression (SLR):

One Predictor / Covariate / Explanatory Variable:

X

In multiple linear regression (MLR):

Same Assumptions as SLR, (i.e. L.I.N.E.), but:

More than one Covariate:

X 1 , X 2 , X 3 , …, X p

Model:

 Y ~ N(µ, σ 2 ) µ = E(Y | X) = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β p Xp 41
Regression Methods
Interactions can allow us to draw separate lines for two
groups
X 1 = Year
X 2 = Group
43

Regression Methods 42

Nested models

One model is nested within another if the parent model contains one set of variables and the extended model contains all of the original variables plus one or more additional variables.

H 0 : all new β’s are zero Assess using F-test

If only one additional variable, use t-test

44

Effect Modification

In linear regression, effect modification is a way of allowing the association between the primary predictor and the outcome to change with the level of another predictor.

If the 3 rd predictor is binary, that results in a graph in which the two lines (for the two groups) are no longer parallel. 45
Confounding: example
Smoking is a confounder of the
relation between coffee consumption (X)
and lung cancer (Y) since:
Lung Cancer Y
Smoking C
Coffee Consumption X
47

Confounding: the epidemiologic definition

C is a confounder of the relation between X and Y if: Confounder C Outcome Y
Predictor X

46

Modeling confounding and effect modification

Potential confounder(s)

Run model without confounder (model 1) Run model with confounder (model 2) Compare model 2 estimate to the model 1 CI of primary predictor to see whether new parameter is significantly different

Effect modification

Model using interaction term Test if statistically significant using a t-test

48

Spline Terms

Splines are used to allow the regression line to bend

the breakpoint is arbitrary and decided graphically or by hypothesis the actual slope above and below the breakpoint is usually of more interest than the coefficient for the spline (ie the change in slope)

Broken Arrow Model 3500
3000
Slope = ββββ 1 + ββββ 2
2500
Slope = ββββ 1
2000
3
5
7
9
Expenditures

length of stay (days)

49

Using R 2 as a model selection criteria

The coefficient of determination, R 2 evaluates the entire model R 2 shows the proportion of the total variation in Y that has been predicted by this model

Model 1: 0.0076; 0.8% of variation explained Model 2: 0.05; 5% of variation explained Model 3: 0.20; 20% of variation explained

You want a model with large R 2

51

Summary: Flexibility in linear models

A spline allows the “slope” for a continuous predictor to change at a given point; the coefficient is for the difference in log odds ratio An interaction term allows the odds ratio for one variable to differ by the value of a second variable; the coefficient is for the difference in log odds ratio

50

Logistic Regression

Basic Idea:

Logistic regression is the type of regression we use for a response variable (Y) that follows a binomial distribution Linear regression is the type of regression we use for a continuous, normally distributed response (Y) variable

Model log odds probability, which we also call the logit Baseline term interpreted as log odds Other coefficients are log odds ratios Transform log odds/ log odds ratio to odds/odds ratio scale by exponentiating coefficient

52

Logit Function

Relates log-odds (logit) to p = Pr(Y=1)

logit function 10
5
0
-5
-10
0
.5
1
log-odds

Probability of Success

53

Why we can interpret the difference in log odds as the log odds ratio

The slope in a logistic regression is

A difference in log odds associated with a 1 unit change in X (controlling for other X’s) A log odds ratio associated with a 1 unit change in X (controlling for other X’s)

Why? log(a) – log(b) = log(a/b)

so

log(odds|X=1) – log(odds|X=0) = log(OR for X=1 vs. X=0)

55

A basic logistic regression model and interpretation

logit(p i )

=

β 0 + β 1 (age i – 45)

β 0 = log-odds of blindness among 45 year olds

exp(β 0 ) = odds of blindness among 45 year olds

β

= difference in log-odds of blindness comparing a group that is one year older than another

1

exp(β ) = odds ratio of blindness comparing a group that is one year older than another

1

54

Multiplicative change interpretation of the slope

e

β

1

is the proportional increase of the

odds of not visiting a physician corresponding to a one year increase in age

(

odds for 30 - yr - old

)

×

 ( odds for 31- yr - old ) ( odds for 30 - yr - old )

= (odds for 31- yr - old)

(

e

β

1

) 10

=e

10β

1

is the proportional increase of

the odds of not visiting a physician corresponding to a ten year increase in age

56

More useful math – how to get the probability from the odds

odds =

probability

1

probability

probability =

odds

1

+

odds

so

P

(

X

=

1

)

=

e

β + β

0

1

1

+

e

β + β

0

1

57

Grand summary

Exploratory analysis includes graphs and tables – good to get a feel for the data Confirmatory analysis is useful for making definitive conclusions Linear models provide us with a framework in which to perform confirmatory analysis in many settings

59

Comparing nested models with logisitic regression

Models that differ by one variable

Compare models with p-value or CI using the Wald test, a test that applies the CLT

H 0 : the new variable is not needed or H 0 : β new =0 in the population

Models that differ by more than one variable

Likelihood ratio test (Chi-square test of deviance)

H 0 : all new variables not needed or H 0 : all β new =0 in the population

58

Grand summary: linear models

Linear regression: for continuous (normal) outcomes Logistic regression: for binary outcomes

60

Grand summary: modelling

In all generalized linear models, we can use the following tools to make models more flexible:

Adjust for confounders using additive covariates Effect modification allows by interaction terms Curved and bent lines through polynomials and splines

61

Friendly Intro Epidemiology textbook

Epidemiology by Leon Gordis

Intro to Biostatistical Modeling Textbook

Regression Modeling Strategies by Frank E. Jr. Harrell

Slightly more theoretical intro to statistical modeling textbook

Mathematical Statistics and Data Analysis by John A. Rice

63

Grand summary: testing

We can test significance of a single predictor using z-test (or t-test for linear regression) Test significance of several covariates using a pair of nested models by a likelihood ratio test Know how to interpret p-values and confidence intervals!

62

References – online JHSPH open courseware For further directed self-study

JHSPH Biostatistics Open Courseware

http://ocw.jhsph.edu/Topics.cfm?topic_id=33

Essentials of Probability and Statistical Inference IV Methods in Biostatistics I & II Statistical Reasoning I & II Statistics for Laboratory Scientists I & II Statistics for Psychosocial Research:

Structural Models & Measurement

64