Вы находитесь на странице: 1из 18

Exercise 1: a) We have used the command describe in order to get the information about the dataset.

The total sample size is: 74. Each observation is one brand of car. There are 12 variables used in the dataset: make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign Make and Model Price Mileage (mpg) Repair Record 1978 Headroom (in.) Trunk space (cu. ft.) Weight (lbs.) Length (in.) Turn Circle (ft.) Displacement (cu. in.) Gear Ratio origin car type

b) To get the description statistics for variables we use the command summarize.

To get description for only foreign origin models only we add if foreign==1 to the command. To get description for only domestic models only we add if foreign==0 to the command. In the result, we found 52 domestic models, 22 foreign models. For price we have statistics below:

For mileage we have statistics below:

For length we have statistics below:

For weight we have statistics below:

c) We create new variables by using command gen : mpgDos Mileage per gallon of gas ( Dosmetics models) mpgFor Mileage per gallon of gas ( Foreign models) priceDos Price ( Dosmetics models) priceFor Price ( Foreign models)

To test the claim that domestic cars are cheaper than foreign car. We conducted twosample mean comparison test. H0 : mean(priceDos) mean(priceFor) <=0 Ha : mean(priceDos) mean(priceFor) >0 We used the command ttest and get the result below. Right-tail test: our p-value equals 0.6701 , which is higher than alpha=0.05, we fail to reject the null hypothesis. We can conclude that domestic cars are cheaper than the foreign ones.

To test the claim that domestic cars have better mileage (they can go more miles per consumed gallon of gas). We conducted two-sample mean comparison test. H0 : mean(mpgDos) mean(mpgFor) >=0 Ha : mean(mpgDos) mean(mpgFor) <0 We used the command ttest and get the result below.

Left-tail test: our p-value equals 0,0017, which is less than alpha=0.05, we reject the null hypothesis. We conclude that domestic cars go less miles per consumed gallon of gas than foreign ones. ( or foreign cars have better mileage)

d) D: Event the car has domestic origin F: Event the car has foreign origin E: Event the car is expensive (price > 5500) To find the number of cars which are expensive and have domestic origin, we used command below. The result is 18.

To find the number of cars which are expensive and have foreign origin, we used command below. The result is 12.

We have number of domestic cars is 52, and numbers of foreign cars is 22. Total sample size is 74. So we can construct this table. Domestic Foreign Totals Expensive 18 12 30 Not Expensive 34 10 44 Totals 52 22 74

From the table we have : (E|D) = 18/52 = 0.3562 nD =52 (E|F) = 12/22 = 0.5455 nF= 22 We now test the hypothesis that the proportion of expensive is higher for foreign models. So we do the two-sample proportion test. Ho: p(E|D) - p(E|F) 0 Ha: p(E|D) - p(E|F) > 0 We used command below to get the result.

Right-tail test. The p-value equals 0.9347, which is greater than alpha 0.05. Consequently, we fail to reject the Ho. So we can conclude that the proportion of expensive is higher for foreign models. e) We estimate the relationship between price index and mileage index by constructing the simple linear regression equation: price = b0 + b1(mpg ) We used command below to get the results of the regression: Ho : b0 = 0 Ha : b0 0 The coefficient for _cons is 11253.06 and its p-value P>|t| is 0.000. The coefficient = 11253.06 is significantly different from 0 because its p-value P>|t| is 0.000, which is less than 0.05. So we reject the Ho. We have b0 = 11253.06 Ho : b1 = 0 Ha : b1 0 The coefficient for utilities is -238.89 and its p-value P>|t| is 0.000. The coefficient = -238.89 is significantly different from 0 because its p-value P>|t| is 0.000, which is less than 0.05. So we reject the Ho. We have b1 = -238.89

We can conclude the multiple regression equation: price = 11253.06 238.89 (mpg ) Each unit increased in the mileage index, we can predict an decrease of 238.89 unit in the price of the car, given other indexes remain unchanged. ( Which mean, the more mileage the car can go per gallon of gas, the cheaper the car is) (negative relationship) We have R-squared = 0.2196, It means the portion of the total variation in the dependent variable (price) , which is explained by the variation of independent variable (mpg) in our regression, is 21.96%. However, The R-squared is small, the multiple regression still do not have sufficient quality to claim a strong the relationship between price and mileage index.

Question 3: We used Data Editor to put the dataset given into Stata. To get some description of the dataset, we used command sum. There is 5 variables, namely grocery, housing, utilities, transportation, and healthcare . There are total 25 observations.

a. We need to plot grocery index against each other indexes. The command we used is scatter . For Grocery vs Housing In the scatter plot given below, we suggest that the observation for Grocery and housing is require extra attention since the points tend to stay in a side of a plot. However, the points do not significantly follow any linear direction. We can keep in my this when we do the regression At this point we could not observe any significant pattern.

For Grocery vs Utilities The scatter plot given below can suggest a pattern. We can estimate an increase in the cost of grocery when the cost of utilities increases. However, the points stay away from each other, hence we may see a large variation from the predicted pattern. We can figure it out after doing the regression.

For Grocery vs Transportation :

In the scatter plot given below, we predict that the points tend to follow a linear pattern. In this case, the cost of Grocery tends to follow the increase of the cost of transportation. However, we suggest a cautious regression to claim the relationship between these variables, since there are some points which is significantly against the pattern.

For Grocery vs Healthcare: The scatter plot for Grocery vs Healthcare given below can not suggest any significant pattern since each point stands out away from all other point. We can figure it out after doing the regression.

b. Grocery vs Housing To run the regression of grocery index vs housing index, we construct a simple linear regression equation given below: Grocery = b0 + b1(housing) We test hypothesis about the coefficient. We used the command

Ho : b1 = 0 Ha : b1 0 The coefficient for housing is 0.0517 and its p-value P>|t| is 0.186. The coefficient = 0.0517 is not significantly different from 0 because its p-value P>|t| is 0.186, which is higher than 0.05. So we fail to reject the Ho. This means, from the observations given, we can not conclude any significant linear relationship between Grocery and Housing. Ho : b0 = 0 Ha : b0 0 The coefficient for _cons is 92.94 and its p-value P>|t| is 0.00 . The coefficient 92.94 is significantly different from 0 because its p-value P>|t| is 0.00, which is lower than 0.05. So we reject the Ho.

We can conclude the equation: Grocery = 92.94

We have R-squared = 0.1860, It means the portion of the total variation in the dependent variable, which is explained by variation in the independent is only 18.6%. 81.4% of total variation can not be explained by the model. The regression do not have sufficient quality to claim the relationship between the cost of Grocery and the cost of housing. Grocery vs utilities To run the regression of grocery index vs utilities index, we construct a simple linear regression equation given below: Grocery = b0 + b1(utilities) Command :

We test hypothesis about the coefficient. Ho : b1 = 0 Ha : b1 0 The coefficient for utilities is 0.1411. p-value P>|t| is 0.029 The coefficient = 0.1411 is significantly different from 0 because its p-value P>|t| is 0.029, which is smaller than 0.05. So we reject the Ho. This means, from the observations given, we can predict a linear relationship between the cost of Grocery and the cost of utilities. Ho : b0 = 0 Ha : b0 0 The coefficient = 83.99 is significantly different from 0 because its p-value P>|t| is 0.00 which is smaller than 0.05. So we reject the Ho. We can conclude the equation: Grocery = 83.99 + 0.144*(utilities) Each unit increased in the cost of utilities, we can predict an increase of 0.144 unit in the cost of grocery, given other indexes remain unchanged.

We have R-squared = 0.1911. It means the portion of the total variation in the dependent variable, which is explained by variation in the independent is only 19.11%. The regression do not have sufficient quality to claim the relationship between the cost of Grocery and the cost of utilities.

Grocery vs Transportation: To run the regression of grocery index vs transportation index, we construct a simple linear regression equation given below: Grocery = b0 + b1(transportation) Command :

We test hypothesis about the coefficient. Ho : b1 = 0 Ha : b1 0 The coefficient for utilities is 0.1372. p-value P>|t| is 0.45 The coefficient = 0.1372 is insignificantly different from 0 because its p-value P>|t| is 0.45, which is higher than 0.05. So we reject the Ho. This means, from the observations given, we cannot conclude any linear relationship between the cost of Grocery and the cost of transportation. Ho : b0 = 0 Ha : b0 0 The coefficient = 84.25 is significantly different from 0 because its p-value P>|t| is 0.00 which is smaller than 0.05. So we reject the Ho.

We can conclude the estimated equation: Grocery = 84.25 We have R-squared = 0.0251. It means the portion of the total variation in the dependent variable, which is explained by variation in the independent is only 2.51%. The regression do not have sufficient quality to claim the relationship between the cost of Grocery and the cost of transportation. Grocery vs Healthcare To run the regression of grocery index vs healthcare index, we construct a simple linear regression equation given below: Grocery = b0 + b1(healthcare) Command :

We test hypothesis about the coefficient. Ho : b1 = 0 Ha : b1 0 The coefficient for utilities is 0.0869. p-value P>|t| is 0.258 The coefficient = 0.0869 is insignificantly different from 0 because its p-value P>|t| is 0.258, which is greater than 0.05. So we fail to reject the Ho. This means, from the observations given, we cannot conclude any linear relationship between the cost of Grocery and the cost of healthcare. Ho : b0 = 0 Ha : b0 0 The coefficient = 89.44 is significantly different from 0 because its p-value P>|t| is 0.00 which is smaller than 0.05. So we reject the Ho. We can conclude the estimated equation: Grocery = 89.44 We have R-squared = 0.0552. It means the portion of the total variation in the dependent variable can be explained by our model is only 5.52%. The regression do not have sufficient quality to claim the relationship between the cost of Grocery and the cost of healthcare.

c. Log Grocery vs Log Housing In order to estimate the elasticity of housing to the grocery index, we construct a simple linear regression equation given below: ln(Grocery) = b0 + b1ln(housing)

Ho : b1 = 0 Ha : b1 0 The coefficient for ln_housing is 0.066 and its p-value P>|t| is 0.199. The coefficient = 0.066 is not significantly different from 0 because its p-value P>|t| is 0.199, which is higher than 0.05. So we fail to reject the Ho. This means, from the observations given, we can not conclude any significant linear relationship between Log Grocery and Log Housing. We can conclude housing elasticity of grocery index is =0

We have R-squared = 0.708, It means the portion of the total variation in the dependent variable, which is explained by our regression is only 7.08%. The regression do not have sufficient quality to claim the elasticity of housing index and grocery index. Log Grocery vs Log Utilities In order to estimate the elasticity of housing to the grocery index, we construct a simple linear regression equation given below: ln(Grocery) = b0 + b1ln(Utilities)

Ho : b1 = 0 Ha : b1 0 The coefficient for ln_utilities is 0.131 and its p-value P>|t| is 0.047. The coefficient = 0.131 is significantly different from 0 because its p-value P>|t| is 0.045, which is lower than 0.05. So we reject the Ho. This means, from the observations given, we can not conclude any significant linear relationship between Log Grocery and Log utilities. We predict, when utilities index increases by 1% , grocery index increases by 0.131%. The estimated elasticity is =0.131.

We have R-squared = 0.1605, It means the portion of the total variation in the dependent variable, which is explained by our regression is only 16.05%. The regression do not have sufficient quality to claim the elasticity of housing index and utilities index. Log Grocery vs Log Transportation In order to estimate the elasticity of housing to the grocery index, we construct a simple linear regression equation given below: ln(Grocery) = b0 + b1ln(Transportation)

Ho : b1 = 0 Ha : b1 0 The coefficient for ln_transportation is 0.1297 and its p-value P>|t| is 0.481. The coefficient = 0.1297 is not significantly different from 0 because its p-value P>|t| is 0.481, which is higher than 0.05. So we fail to reject the Ho. This means, from the observations given, we can not conclude any significant linear relationship between Log Grocery and Log transportation. We can conclude transportation elasticity of grocery index is =0

We have R-squared = 0.0218, It means the portion of the total variation in the dependent variable, which is explained by our regression is only 2.18%. The regression do not have sufficient quality to claim the elasticity of transportation index and grocery index. Log Grocery vs Log Healthcare In order to estimate the elasticity of housing to the grocery index, we construct a simple linear regression equation given below: ln(Grocery) = b0 + b1ln(Healthcare)

Ho : b1 = 0 Ha : b1 0 The coefficient for ln_Healthcare is 0.092 and its p-value P>|t| is 0.265. The coefficient = 0.092 is not significantly different from 0 because its p-value P>|t| is 0.265, which is higher than 0.05. So we fail to reject the Ho. This means, from the observations given, we can not conclude any significant linear relationship between Log Grocery and Log Healthcare. We can conclude Healthcare elasticity of grocery index is =0

We have R-squared = 0.0537, It means the portion of the total variation in the dependent variable, which is explained by our regression is only 5.37%. The regression do not have sufficient quality to claim the elasticity of healthcare index and grocery index.

d. Multiple regression To estimate the multiple linear model, we construct a multiple regression equation as below : Grocery= b0 + b1(housing) +b2(utilities) +b3(transportation) + b4(healthcare) We used command below to do the regression:

Ho : b0 = 0 Ha : b0 0 The coefficient for _cons is 76.31 and its p-value P>|t| is 0.000. The coefficient = 76.31 is significantly different from 0 because its p-value P>|t| is 0.000, which is less than 0.05. So we reject the Ho. We have b0 = 76.31 Ho : b1 = 0 Ha : b1 0 The coefficient for housing is 0.0859 and its p-value P>|t| is 0.109. The coefficient = 0.0859 is not significantly different from 0 because its p-value P>|t| is 0.109, which is higher than 0.05. So we fail to reject the Ho. We have b1 = 0 Ho : b2 = 0 Ha : b2 0 The coefficient for utilities is 0.1677 and its p-value P>|t| is 0.018. The coefficient = 0.1677 is significantly different from 0 because its p-value P>|t| is 0.018, which is less than 0.05. So we reject the Ho. . We have b2 = 0.1677 Ho : b3 = 0 Ha : b3 0 The coefficient for transportation is 0.0284 and its p-value P>|t| is 0.87. The coefficient = 0.0284 is not significantly different from 0 because its p-value P>|t| is 0.87, which is higher than 0.05. So we fail to reject the Ho. We have b3 = 0 Ho : b4 = 0 Ha : b4 0 The coefficient for healthcare is -0.0659 and its p-value P>|t| is 0.53. The coefficient = -0.0659 is not significantly different from 0 because its p-value P>|t| is 0.53, which is higher than 0.05. So we fail to reject the Ho. We have b4 = 0

We can conclude the multiple regression equation: Grocery= 76.31 + 0(housing) +0.1677 (utilities) +0(transportation) + 0(healthcare) Grocery= 76.31 +0.1677 (utilities) We have R-squared = 0.3145, It means the portion of the total variation in the dependent variable (grocery) , which is explained by the variation of independent variables (housing, utilities, transportation, healthcare) in our regression, is 31.45%. Although, the R-squared is higher than the R-squared in individual regressions, which mean a better model, the multiple regression still do not have sufficient quality to claim the relationship between Grocery index and other indexes. We can see in the multiple regressions, we increase the number of independent variables. The more independent variables we have, the more variation of dependent variable can be explained, the less error variation is. It means, we can increase the Rsquared. Hence the quality of our regression model is increased, the relationship between variables can be predicted better. R-squared = SSR/SST

Вам также может понравиться