Вы находитесь на странице: 1из 32

Teknik Mesin FTUI

Dr. Ir. Harinaldi, M.Eng

Tujuan Pembelajaran
Menjelaskan tujuan analisis regresi dan korelasi Menghitung dan menginterpretasikan arti dari persamaan regresi dan standard error dari estimasi-estimasi untuk analisis regresi linier sederhana Menggunakan hasil analisis untuk menduga interval dari variabel terikat Menghitung dan menjelakan arti koefisien korelasi dan determinasi

Teknik Mesin FTUI

Dr. Ir. Harinaldi, M.Eng

Linear Regression Analysis

Regression analysis is used to predict the value of one variable (the dependent variable) on the basis of other variables (the independent variables).

Dependent variable: denoted Y Independent variables: denoted X1, X2, , Xk If we only have ONE independent variable, the model is

y a bx

Linear Regression Analysis

y a bx Variables: X = Independent Variable (we provide this) Y = Dependent Variable (we observe this) Parameters: a= Y-Intercept b = Slope

House Price Lower vs. Higher Variability


House Price = 25,000 + 75(Size)

Same square footage, but different price points (e.g. dcor options, cabinet upgrades, lot location)
House size

Theoretical Linear Model

Correlation Analysis -1 < < 1

If we are interested only in determining whether a relationship exists, we employ correlation analysis. Example: Students height and weight.
Plot of Height vs Weight
7 6.6


6.2 5.8 5.4 5 4.6 100 140 180 220 260

Plot of Height vs Weight

6.8 6.5

6.6 6.2

Plot of Height vs Weight


100 140 180 220 260

6.2 5.9 5.6 5.3

5.8 5.4 5 100 140 180 220 260



Regression: Model Types X=size of house, Y=cost of house

Deterministic Model: an equation or set of equations that allow us to fully determine the value of the dependent variable from the values of the independent variables. y = $25,000 + (75$/ft2)(x) Area of a circle: A = *r2 Probabilistic Model: a method used to capture the randomness that is part of a real-life process. y = 25,000 + 75x + E.g. do all houses of the same size (measured in square feet) sell for exactly the same price?

Simple Linear Regression Model

Meaning of a and b b > 0 [positive slope] b < 0 y[negative slope]
rise run =slope (=rise/run)

=y-intercept x

Which line has the best fit to the data?

? ? ?

Estimating the Coefficients

In much the same way we base estimates of on , we estimate a and b, the y-intercept and slope (respectively) of the least squares or regression line given by:

y a bx
(This is an application of the least squares method and it produces a straight line that minimizes the sum of the squared differences between the points and the line)

Least Squares Line

these differences are called residuals or


Least Squares Line See if you can estimate Y-intercept and slope from this data


Data Points: x 1 y 6


2 3 4 5

1 9 5 17
12 y = .934 + 2.114x

Least Squares Line See if you can estimate Y-intercept and slope from this data

Sum = Xbar = Ybar = sxy = sx2 = b1 = b0 =

X 1 2 3 4 5 6 21 3.500 8.333 7.400 3.500 2.114 0.933

Y 6 1 9 5 17 12 50

X - Xbar -2.500 -1.500 -0.500 0.500 1.500 2.500 0.000

2 Y - Ybar (X-Xbar)*(Y-Ybar) (X - Xbar) -2.333 5.833 6.250 -7.333 11.000 2.250 0.667 -0.333 0.250 -3.333 -1.667 0.250 8.667 13.000 2.250 3.667 9.167 6.250 0.000 37.000 17.500

37.00/(6-1) 17.5/(6-1) 7.4/3.5 8.33 - 2.114*3.50

Excel: Data Analysis - Regression

SUMMARY OUTPUT Regression Statistics Multiple R 0.7007 R Square 0.4910 Adjusted R Square 0.3637 Standard Error 4.5029 Observations ANOVA df Regression Residual Total 1 4 5 SS MS F Significance F Same as p-value 78.22857143 78.22857143 3.858149366 0.120968388 H0: Regression Model is "NO Good" 81.1047619 20.27619048 159.3333333

The proportion of the variation in the variable Y that can be explained by your regression model Will use later 6

Intercept X Variable 1

Coefficients Standard Error t Stat P-value 0.933333333 4.19198025 0.222647359 0.834716871 2.114285714 1.076401159 1.96421724 0.120968388

H0: 1 = 0

Excel: Plotted Regression Model You will need to play around with this to get the plot to look Good

X Variable 1 Line Fit Plot

20 15 10 5 0 0 1 2 3 4 5 6 7 X Variable 1

Y Predicted Y

Assessing the Model

The least squares method will always produce a straight line, even if there is no relationship between the variables, or if the relationship is something other than linear. Hence, in addition to determining the coefficients of the least squares line, we need to assess it to see how well it fits the data. Well see these evaluation methods now. Theyre based on the what is called sum of squares for errors (SSE).

Standard Error

If is small, the fit is excellent and the linear model should be used for forecasting. If is large, the model is poor But what is small and what is large?

Standard Error
Judge the value of by comparing it to the sample mean of the dependent variable ( ). In this example, = .3265 and = 14.841 so (relatively speaking) it appears to be small, hence our linear regression model of car price as a function of odometer reading is good.

Testing the SlopeExcel output does this for you.

If no linear relationship exists between the two variables, we would expect the regression line to be horizontal, that is, to have a slope of zero. We want to see if there is a linear relationship, i.e. we want to see if the slope ( b ) is something other than zero. Our research hypothesis becomes: H1: b 0 Thus the null hypothesis becomes: H0: = 0

Testing the Slope

Coefficient of Determination
Tests thus far have shown if a linear relationship exists; it is also useful to measure the strength of the relationship. This is done by calculating the coefficient of determination R2. The coefficient of determination is the square of the coefficient of correlation (r), hence R2 = (r)2 r will be computed shortly and this is true for models with only 1 indepenent variable

Coefficient of Determination
R2 has a value of .6483. This means 64.83% of the variation in the auction selling prices (y) is explained by your regression model. The remaining 35.17% is unexplained, i.e. due to error. Unlike the value of a test statistic, the coefficient of determination does not have a critical value that enables us to draw conclusions. In general the higher the value of R2, the better the model fits the data. R2 = 1: Perfect match between the line and the data points. R2 = 0: There are no linear relationship between x and y.

Correlation Analysis -1 < < 1

If the correlation coefficient is close to +1 that means you have a strong positive relationship. If the correlation coefficient is close to -1 that means you have a strong negative relationship. If the correlation coefficient is close to 0 that means you have no correlation.


Remember Excels Output

An analysis of variance (ANOVA) table for the simple linear regression model can be give by: degrees
Source of freedom 1 n2 n1 Sums of Squares SSR SSE Variation in y (SST) Mean Squares MSR = SSR/1 F-Statistic Regression Error Total F=MSR/MSE

MSE = SSE/(n2)

Using the Regression Equation

We could use our regression equation: y = 17.250 .0669x to predict the selling price of a car with 40 (40,000) miles on it: y = 17.250 .0669x = 17.250 .0669(40) = 14, 574

We call this value ($14,574) a point prediction (estimate). Chances are though the actual selling price will be different, hence we can estimate the selling price in terms of a confidence interval.

Prediction Interval
The prediction interval is used when we want to predict one particular value of the dependent variable, given a specific value of the independent variable:

(xg is the given value of x were interested in)

Confidence Interval Estimator for Mean of Y

The confidence interval estimate for the expected value of y (Mean of Y) is used when we want to predict an interval we are pretty sure contains the true regression line . In this case, we are estimating the mean of y given a value of x:

(Technically this formula is used for infinitely large populations. However, we can interpret our problem as attempting to determine the average selling price of all Ford Tauruses, all with 40,000 miles on the odometer)

Whats the Difference?

Prediction Interval Confidence Interval

1 Used to estimate the value of one value of y (at given x)

no 1 Used to estimate the mean value of y (at given x)

The confidence interval estimate of the expected value of y will be narrower than the prediction interval for the same given value of x and confidence level. This is because there is less error in estimating a mean value as opposed to predicting an individual value.

Regression Diagnostics
There are three conditions that are required in order to perform a regression analysis. These are: The error variable must be normally distributed, The error variable must have a constant variance, & The errors must be independent of each other.

Procedure for Regression Diagnostics

1. Develop a model that has a theoretical basis. 2. Gather data for the two variables in the model. 3. Draw the scatter diagram to determine whether a linear model appears to be appropriate. Identify possible outliers. 4. Determine the regression equation. 5. Assess the models fit. 6. If the model fits the data, use the regression equation to predict a particular value of the dependent variable and/or estimate its mean.

1. Building the Model Collect Data

Student 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Test 1 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Test 2 32 33 34 35 36 37 39 40 41 42 43 44 46 47 48 49 50 51 53 54 55 56 57

From Data: Estimate a Estimate b