Вы находитесь на странице: 1из 5

Innopolis, Fall 2014

Solution for SABD Problem Set #1


Smoking and Lung Cancer
By Artur Samigullin



Part 0. Problem Identification and Dataset
This problem set gives an opportunity to do some calculations on the relation between
smoking and lung cancer, using a (very) small sample of five countries. The purpose of this
exercise is to illustrate the mechanics of ordinary least squares (OLS) regression. First, calculate
the regression by hand using formulas from class and the textbook, then use statistical
packages to confirm the calculation.
The data are summarized in the table #1. The variables are per capita cigarette consumption in
1930 (the independent variable, X) and the death rate from lung cancer in 1950 (the dependent
variable, Y). The cancer rates are shown for a later timeperiod because it takes time for lung
cancer to develop and be diagnosed.
# Country
Cigarretes consumed per
capita in 1930 (X)
Lung cancer deaths per million
people in 1950 (Y)
1 Swietzerland 530 250
2 Finland 1115 350
3 Great Britain 1145 465
4 Canada 510 150
5 Denmark 380 165
Table #1. Sample dataset.


Part 1. Calculations by hand
1. Calculating mean of variables:

=
1

=1
=
1
5
(530 +1115 +1145 +510 +380) = 736

=
1

=1
=
1
5
(250 +350 +465 +150 +165) = 276
2. Calculating standard deviation:
# (

)
2
(

)
2

1 42436 676
2 143641 5476
3 167281 35721
4 51076 15876
5 126736 12321

2
132792,5 17517,5
364,4071 132,3537
Table #2. Calculating unbiased estimation for standard deviation
3. Calculating correlation coefficient:
(, ) =
1

=1
= 44673,75
(, ) = =
(, )
(

)
2
(

)
2
= 0, 926253

4. Compute estimated slope coefficient and intercept term:

1
=

=
178695
531170
= 0,3364

0
=

= 276 0,336418 736 = 28,3966



5. Compute estimated predicted values and residuals for each country:
#


1 206,6979 43,30205
2 403,5023 -53,5023
3 413,5948 51,40515
4 199,9696 -49,9696
5 156,2353 8,764708
Table #3. Predicted values and residuals for each country


Part 2. Calculating at R
Code:
1. x<-c(380,510,530,1115,1145)
2. y<-c(165,150,250,350,465)
3. mean(x)
4. mean(y)
5. sd(x)
6. sd(y)
7. cor(x,y)
8. reg<-lm(y~x)
9. print(reg)
10. fitted(reg)
11. residuals(reg)
12. plot(x,y)
13. abline(reg)
Output:
1. x<-c(380,510,530,1115,1145)
2. y<-c(165,150,250,350,465)
3. mean(x)
4. [1] 736
5. mean(y)
6. [1] 276
7. sd(x)
8. [1] 364.4071
9. sd(y)
10. [1] 132.3537
11. cor(x,y)
12. [1] 0.9262529
13. reg<-lm(y~x)
14. print(reg)
15. Call:
16. lm(formula = y ~ x)
17. Coefficients:
18. (Intercept) x
19. 28.3966 0.3364
20. fitted(reg)
a. 1 2 3 4 5
21. 156.2353 199.9696 206.6979 403.5023 413.5948
22. residuals(reg)
a. 1 2 3 4 5
23. 8.764708 -49.969595 43.302050 -53.502316 51.405153


Part 3. Visualization


Picture 1. Visualization of results with Excel

Picture 2. Visualization with command plot() in R
Summary
Regression analysis of sample by OLS showed linear relationship between per capita cigarette
consumption in 1930 and the death rate from lung cancer in 1950. Dependence can be described
by the equation:

= 28.3966 +0.3364


Swietzerland
530; 250
Finland
(1115; 350)
Great Britain
(1145; 465)
Canada
(510; 150)
Denmark
(380; 165)
0
50
100
150
200
250
300
350
400
450
500
0 200 400 600 800 1000 1200 1400
Y
X
Linear regression (OLS)

Вам также может понравиться