Вы находитесь на странице: 1из 5

Notes

Unit 5: Regression Basics Table of Contents

Introducing Regression Introducing the Regression Line The Uses of Regression Calculating the Regression Line The Accuracy of a Line Identifying the Regression Line Refining Regression Quantifying the Predictive Power of Regression Residual Analysis Interpreting the Regression Coefficients Revisiting R2 and p

Page 1 of 5

Notes

Unit 5: Regression Basics

Introducing Regression Introducing the Regression Line


Regression analysis helps us find the mathematical relationship between two variables. We can use regression to describe a linear relationship: one that can be represented by a straight line and characterized by an equation of the form y = a + bx. Plot the behavior of two variables on a scatter diagram to observe patterns in their relationship. Use regression analysis to identify the linear relationship that best fits the data. The linear relationship has the form y = a + bx. o a is the y-intercept of the line o b is the slope of the line y is called the dependent variable and x is called the independent, or explanatory variable.

The Uses of Regression


We use regression analysis for two primary purposes: forecasting and studying the structure of the relationship between two variables. We can use regression to predict the value of the dependent variable for a specified value of the independent variable. The regression equation also tells us how the dependent variable has typically changed with changes in the independent variable. Use regression analysis to understand the structure of the relationship between two variables. Structural Relationship: y = a + bx Use regression analysis to forecast y for a value of x within the historically observed range of x-values. Be cautious about using regression to forecast for values beyond the historically observed range of x-values.

Page 2 of 5

Notes

Unit 5: Regression Basics

Calculating the Regression Line The Accuracy of a Line


To find the line that best fits a data set, we first need a measure of the accuracy of a line's fit: the Sum of Squared Errors. To find the Sum of Squared Errors, we calculate the vertical distances from the data points to the line, square the distances, and sum the squares. Error = vertical distance from data point to line = actual value predicted value = y y Measure of accuracy = Sum of Squared Errors

Identifying the Regression Line


The line that most accurately fits the data the regression line is the line for which the Sum of Squared Errors is minimized. Lower SSE Lowest SSE Higher Accuracy Regression Line

Page 3 of 5

Notes

Unit 5: Regression Basics

Refining Regression Quantifying the Predictive Power of Regression


R-squared measures how well the behavior of the independent variable explains the behavior of the dependent variable. R-squared is the ratio of the Regression Sum of Squares to the Total Sum of Squares. As such, it tells us what proportion of the total variation in the dependent variable is explained by its linear relationship with the independent variable. R2 = percentage of variation in the dependent variable explained by the independent variable R2 = variation explained by the regression total variation = Regression Sum of Squares Total Sum of Squares Equivalently, R2 = 1 Residual Sum of Squares Total Sum of Squares

Residual Analysis
A complete regression analysis should include a careful inspection of the residuals. Plot the residuals against the independent variable to reveal patterns in the distribution of the residuals.

Residual = y y
Plot residuals against the independent variable to reveal patterns. If underlying relationship is linear, residuals follow a normal distribution with mean zero and a constant variance. Residual plots reveal nonlinear relationships and heteroskedasticity. Heteroskedasticity: variance of the residual distribution changes with value of the independent variable.

Page 4 of 5

Notes

Unit 5: Regression Basics

Interpreting the Regression Coefficients


The slope and intercept of the regression line are estimates based on sample data: how closely they approximate the actual values is uncertain. Confidence intervals for the regression coefficients specify a range of likely values for the regression coefficients. Excel reports a p-value for each coefficient. If the pvalue for a slope coefficient is less than 0.05 we can be 95% confident that the slope is nonzero, and hence that there is a linear relationship between the independent and dependent variables.

Regression line coefficients are estimates based on sample data. Confidence intervals specify a range of likely values for each coefficient. If the true slope is 0 no linear relationship

Two approaches to testing for statistical significance of a coefficient:

Confidence intervals: Confidence interval contains 0

No linear relationship

Hypothesis test: o Reported p-value indicates likelihood that a coefficient is zero. o p-value < 0.05 linear relationship

Revisiting R and p
The p-value and R2 provide different information. A linear relationship can be significant but not explain a large percentage of the variation, so having a low p-value does not ensure a high R2. Sample size is an important determinant of regression accuracy: as with all sampling, larger samples give more accurate estimates.

R2 = percent of variation in dependent variable explained by relationship with independent variable p = probability that there is no linear relationship between the dependent and independent variables Larger sample size more accurate estimates

Page 5 of 5

Вам также может понравиться