Вы находитесь на странице: 1из 20

Predicting Diamond Price using Linear Model

Sarajit Poddar
26 July 2015

Contents
1 Executive Summary

1.1

Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Exploratory Analysis

2.1

Loading relevant libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Subsetting the dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3

Plotting the characteristics of dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Predicting the diamond price

11

3.1

Determining the Significant Predictors of Diamond price . . . . . . . . . . . . . . . . . . . . .

11

3.2

Exploring the predictors using box plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

3.3

Generating the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.4

Analysing the variance between multiple models . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.5

Analysing the Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.6

Predicting using the fitted model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.7

Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4 Final conclusion

1
1.1

20

Executive Summary
Objective

Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Using
linear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.
Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. The
machine learning algorithms will be explored in subsequent articles.

1.2
1.2.1

About the data


Description

Prices of 50,000 round cut diamonds


Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables
are as follows:
1

1.2.2

Details

price. price in US dollars ($326-$18,823)


carat. weight of the diamond (0.2-5.01)
cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour. diamond colour, from J (worst) to D (best)
clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
x. length in mm (0-10.74)
y. width in mm (0-58.9)
z. depth in mm (0-31.8)
depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
table. width of top of diamond relative to widest point (43-95)

2
2.1

Exploratory Analysis
Loading relevant libraries

# Load required libraries


library(dplyr);
library(tidyr);
library(ggplot2)

2.2

Subsetting the dataset

The dataset is subset to a smaller size as the dataset it huge


# Load the diamonds dataset
data(diamonds)
# Convert continuous variables to factors
# Cut by interval of 1000
diamonds$price2 <- as.numeric(cut(diamonds$price,
seq(from = 0, to = 20000, by = 1000)))
# Cut by interval 0.5
diamonds$carat2 <- as.numeric(cut(diamonds$carat,
seq(from = 0, to = 6, by = 0.1)))
# Summary of diamonds dataset
summary(diamonds)
##
##
##

carat
Min.
:0.2000
1st Qu.:0.4000

Fair
Good

cut
: 1610
: 4906

color
D: 6775
E: 9797
2

clarity
SI1
:13065
VS2
:12258

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Median :0.7000
Mean
:0.7979
3rd Qu.:1.0400
Max.
:5.0100
depth
Min.
:43.00
1st Qu.:61.00
Median :61.80
Mean
:61.75
3rd Qu.:62.50
Max.
:79.00
y
Min.
: 0.000
1st Qu.: 4.720
Median : 5.710
Mean
: 5.735
3rd Qu.: 6.540
Max.
:58.900

Very Good:12082
Premium :13791
Ideal
:21551
table
Min.
:43.00
1st Qu.:56.00
Median :57.00
Mean
:57.46
3rd Qu.:59.00
Max.
:95.00
z
Min.
: 0.000
1st Qu.: 2.910
Median : 3.530
Mean
: 3.539
3rd Qu.: 4.040
Max.
:31.800

F: 9542
SI2
: 9194
G:11292
VS1
: 8171
H: 8304
VVS2
: 5066
I: 5422
VVS1
: 3655
J: 2808
(Other): 2531
price
x
Min.
: 326
Min.
: 0.000
1st Qu.: 950
1st Qu.: 4.710
Median : 2401
Median : 5.700
Mean
: 3933
Mean
: 5.731
3rd Qu.: 5324
3rd Qu.: 6.540
Max.
:18823
Max.
:10.740
price2
Min.
: 1.000
1st Qu.: 1.000
Median : 3.000
Mean
: 4.398
3rd Qu.: 6.000
Max.
:19.000

carat2
Min.
: 2.000
1st Qu.: 4.000
Median : 7.000
Mean
: 8.468
3rd Qu.:11.000
Max.
:51.000

# Structure of the diamond dataset


str(diamonds)
## 'data.frame':
53940 obs. of 12 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ price2 : num 1 1 1 1 1 1 1 1 1 1 ...
## $ carat2 : num 3 3 3 3 4 3 3 3 3 3 ...
# Lets say that input price range is 1000 to 5000 and
# the number of obs is 500
input.pricerange.low <- 1000
input.pricerange.high <- 5000
input.obs
<- 5000
# Subsetting sampling the data based on the price range
data.sample <- subset(diamonds,
price >= input.pricerange.low &
price <= input.pricerange.high)
# Sampling the data from the subset
data.sample <- data.sample[sample(1:nrow(data.sample), input.obs,
replace=FALSE),]

2.3
2.3.1

Plotting the characteristics of dataset


Plotting using base graphics

#-------------------------------# Plotting with Base graphics


#-------------------------------x <- data.sample$price
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Diamond Price",
main="Frequency Distribution of Diamond Price")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity
<- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(myhist$mids[1:2]) * length(x)
lines(xfit, yfit, col="red", lwd=2)
# Add legend
legend('topright', c("Mean", "Density Curve", "Normal Curve"),
lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))

Frequency Distribution of Diamond Price

600
400
0

200

Frequency

800

Mean
Density Curve
Normal Curve

1000

2000

3000

4000

Diamond Price
2.3.2
g
g
g
g

Plotting using ggplot

<<<<-

ggplot(data.sample, aes(x=price))
g + geom_histogram(aes(y = ..density..), fill="dark grey")
g + geom_density(alpha=.3, fill="#FF6666")
g + stat_function(fun = dnorm, colour = "red",
arg = list(mean = mean(data.sample$price),
sd=sd(data.sample$price)))
g <- g + xlab("Diamond price")
g <- g + ylab("Frequency")
g <- g + ggtitle("Frequency Distribution of Diamond Price")
g

5000

Frequency Distribution of Diamond Price


5e04

Frequency

4e04

3e04

2e04

1e04

0e+00
1000

2000

3000

4000

5000

Diamond price
2.3.3
g
#
#
g
g
g
g
g
g

Diamond price distribution with regards to Cut

<- ggplot(data.sample)
Using the cut as to show the differences in the price due to the
quality of the cut
<- g + geom_bar(aes(x=price, fill= cut))
<- g + xlab("Price of Diamonds")
<- g + ylab("Number of Diamonds")
<- g + ggtitle("Prices of Sampled Diamonds")
<- g + theme(legend.position="bottom")

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Prices of Sampled Diamonds

Number of Diamonds

300

200

100

0
1000

2000

3000

4000

5000

Price of Diamonds
cut

2.3.4
g
g
g
g
g

<<<<-

Fair

Good

Very Good

Premium

Ideal

Regression line showing the impact of Carat on the price (Using lm)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + geom_smooth(method=lm, col="red", lwd=1)
g + theme(legend.position="bottom")

price

6000

4000

2000

0.4

0.8

1.2

carat
clarity

2.3.5
g
g
g
g
g

<<<<-

I1

SI2

SI1

VS2

VS1

VVS2

VVS1

Regression line showing the impact of Carat on the price (Using Loess)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity))
g + geom_smooth(method=loess, col="blue", lwd=1)
g + theme(legend.position="bottom")

IF

5000

price

4000

3000

2000

1000
0.4

0.8

1.2

carat
clarity

2.3.6
g
g
g
g
g
g

<<<<<-

I1

SI2

SI1

VS2

VS1

Regression line faceted by Colour and Cut


ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + facet_grid(color~cut)
g + geom_smooth(method=lm, col="salmon", lwd=1)
g + theme(legend.position="bottom")

VVS2

VVS1

IF

Good

Very Good

Premium

Ideal
D
E
F
G

price

Fair
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0

H
I
J

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

0.4 0.8 1.2

carat
clarity

2.3.7

I1

SI2

SI1

VS2

VS1

VVS2

Correlation plot between all variables

library(corrplot)
# Convert Diamonds dataset all fields to numeric
diamonds.num <- data.sample
diamonds.num[, 1:12] <- sapply(diamonds.num[, 1:12], as.numeric)
# Remove price and carat and retain price2 and carat2
diamonds.num <- select(diamonds.num, cut:table, x:carat2)
M <- cor(diamonds.num)
corrplot.mixed(M)

10

VVS1

IF

cut
0.8

0.06 color
0.6

0.23 0.02clarity
0.4

0.25 0.06 0.07depth


0.2

0.46 0.03 0.170.25 table


0

0.23 0.3 0.6

0.2

x
0.2

0.18 0.25 0.5

0.15 0.84

y
0.4

0.27 0.3 0.58 0.22 0.12 0.94 0.83

z
0.6

0.16 0.15 0.35 0.06 0.13 0.85 0.73 0.82 price2


0.8

0.25 0.31 0.59 0.1 0.18 0.97 0.82 0.93 0.86 carat2
1
From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,
X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highly
correlated to Carat. This can also mean that takening Carat 2 as the
2.3.8

Exploratory plot

# Loading required libraries


library(ggplot2)
library(GGally)
library(scales)
# Sampling the data for the plot generation
diasamp <- diamonds[sample(1:length(diamonds$price), 500),]
# Generating the plot
ggpairs(diasamp, params = c(shape = I('.'), outlier.shape = I('.')))

3
3.1

Predicting the diamond price


Determining the Significant Predictors of Diamond price

model.data <- subset(data.sample, select = -c(price2, carat2))


full.model <- lm(price ~ ., data = model.data)
11

reduced.model <- step(full.model, direction="backward", k=2, trace=0)


summary(reduced.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat + cut + color + clarity + depth +
table + x + z, data = model.data)
Residuals:
Min
1Q
-2307.76 -186.11

Median
-18.29

3Q
179.54

Max
1564.51

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1757.964
436.220 -4.030 5.66e-05 ***
carat
5946.169
114.468 51.946 < 2e-16 ***
cut.L
289.409
18.773 15.416 < 2e-16 ***
cut.Q
-118.602
14.754 -8.039 1.13e-15 ***
cut.C
122.496
13.433
9.119 < 2e-16 ***
cut^4
27.127
11.409
2.378
0.0175 *
color.L
-889.472
16.658 -53.397 < 2e-16 ***
color.Q
-205.282
14.891 -13.785 < 2e-16 ***
color.C
-54.810
14.049 -3.901 9.69e-05 ***
color^4
31.203
13.128
2.377
0.0175 *
color^5
54.780
12.283
4.460 8.38e-06 ***
color^6
43.795
11.144
3.930 8.62e-05 ***
clarity.L
1938.848
28.227 68.689 < 2e-16 ***
clarity.Q
-726.394
24.116 -30.121 < 2e-16 ***
clarity.C
451.869
20.706 21.823 < 2e-16 ***
clarity^4
-248.135
17.122 -14.492 < 2e-16 ***
clarity^5
77.947
14.643
5.323 1.06e-07 ***
clarity^6
-17.521
13.226 -1.325
0.1853
clarity^7
18.544
11.973
1.549
0.1215
depth
-10.051
4.490 -2.239
0.0252 *
table
-4.957
2.624 -1.889
0.0589 .
x
91.671
49.098
1.867
0.0619 .
z
97.390
44.027
2.212
0.0270 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.1 on 4977 degrees of freedom
Multiple R-squared: 0.927, Adjusted R-squared: 0.9267
F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16

We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, z
and carat.

3.2

Exploring the predictors using box plot

#------------------------------## Exploring the predictors using box plot


12

#------------------------------# Exploring association of Cut with Carat and Price


ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = cut)) + xlab("Carat") +
theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
3

10

11

12

13

14

Carat
cut

Fair

Good

Very Good

# Exploring association of Clarity with Carat and Price


ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = clarity)) + xlab("Carat") +
theme(legend.position="bottom")

13

Premium

Ideal

15

16

5000

price

4000

3000

2000

1000
3

10

11

12

13

14

15

16

Carat
clarity

I1

SI2

SI1

VS2

VS1

VVS2

VVS1

IF

# Exploring association of Color with Carat and Price


ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = color)) + xlab("Carat") +
theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
3

10

11

12

13

14

Carat
color

14

15

16

3.3

Generating the Model

# The Starting and Suggested Model


simple.model <- lm(price ~ carat, data = model.data)
fitted.model <- lm(price ~ carat + cut + clarity + color + table + y + z,
data = model.data)

3.4

Analysing the variance between multiple models

# Summary of the simple model and fitted model


summary(simple.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat, data = model.data)
Residuals:
Min
1Q
-3138.17 -307.81

Median
-14.44

3Q
299.14

Max
2393.19

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -728.13
24.62 -29.57
<2e-16 ***
carat
4694.20
32.72 143.48
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 529.1 on 4998 degrees of freedom
Multiple R-squared: 0.8047, Adjusted R-squared: 0.8046
F-statistic: 2.059e+04 on 1 and 4998 DF, p-value: < 2.2e-16

summary(fitted.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat + cut + clarity + color + table +
y + z, data = model.data)
Residuals:
Min
1Q
-2340.21 -184.12

Median
-17.18

3Q
178.61

Max
1571.15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2253.841
180.766 -12.468 < 2e-16 ***
carat
6154.635
66.980 91.888 < 2e-16 ***
cut.L
316.192
17.328 18.247 < 2e-16 ***
cut.Q
-126.597
14.558 -8.696 < 2e-16 ***
cut.C
123.818
13.368
9.262 < 2e-16 ***
cut^4
25.591
11.421
2.241 0.02509 *
15

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

clarity.L
1936.765
28.124 68.866 < 2e-16 ***
clarity.Q
-733.049
24.012 -30.529 < 2e-16 ***
clarity.C
454.915
20.721 21.954 < 2e-16 ***
clarity^4
-247.327
17.145 -14.426 < 2e-16 ***
clarity^5
79.403
14.658
5.417 6.34e-08 ***
clarity^6
-18.140
13.241 -1.370 0.17075
clarity^7
19.361
11.993
1.614 0.10653
color.L
-890.439
16.679 -53.387 < 2e-16 ***
color.Q
-205.497
14.892 -13.800 < 2e-16 ***
color.C
-54.116
14.056 -3.850 0.00012 ***
color^4
32.902
13.141
2.504 0.01232 *
color^5
54.910
12.298
4.465 8.18e-06 ***
color^6
44.704
11.158
4.007 6.25e-05 ***
table
-1.865
2.462 -0.757 0.44892
y
30.742
12.057
2.550 0.01081 *
z
64.395
38.046
1.693 0.09060 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.6 on 4978 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265
F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16

# Conduct Analysis of Variance between the simple model and the best fitted model
anova(simple.model, fitted.model)
##
##
##
##
##
##
##
##
##

Analysis of Variance Table


Model 1: price ~ carat
Model 2: price ~ carat + cut + clarity + color + table + y + z
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
4998 1399334980
2
4978 524428896 20 874906084 415.24 < 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5
3.5.1

Analysing the Residuals


Checking unidentified patterns in the Residuals

The graph shows that variance between the actual and prediction are higher when the price of the dimond
increaes. There is a possibility that a factor that increases the price at a higher price range, is not captured in
the model. Hence the variance of the price cant be adequately captured by the model based on the available
predictors.
x <- model.data$price;
y <- resid(fitted.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha = 0.1) +
geom_point(size=2, colour="salmon", alpha = 0.2) +
xlab("Fitted value") +

16

ylab("Residual") +
geom_smooth(method="loess", colour="red", lwd=1)

1000

Residual

1000

2000

1000

2000

3000

4000

Fitted value
3.5.2

Density plot of residuals to check Normal Distribution

The graph shows that the residula falls in a normal pattern.


x <- residuals(fitted.model)
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Residuals",
main="Frequency Distribution of residuals")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist$counts / myhist$density
mydensity
<- density(x)
mydensity$y <- mydensity$y * multiplier[1]
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
17

5000

yfit <- yfit * diff(myhist$mids[1:2]) * length(x)


lines(xfit, yfit, col="red", lwd=2)

1000
0

500

Frequency

2000

Frequency Distribution of residuals

2000

1000

1000

2000

Residuals
3.6

Predicting using the fitted model

The formula for prediction is


Diamond price = -4253.844 + 4920.324 * carat + xx * cut + 77* clarity + zz * color + 1.462 * table +
376.099 * y + 275.481 * z
Note: The value of xx, yy and xx depends on the class of the variable
Coefficients: (Intercept) carat cut.L cut.Q cut.C cut4 clarity.L clarity.Q
-4253.844 4920.324 261.542 -83.890 66.899 37.187 2011.021 -749.420
clarity.C clarity4 clarity5 clarity6 clarity7 color.L color.Q color.C
477.879 -296.209 77.985 -30.944 27.387 -944.980 -226.105 -80.026
color4 color5 color6 table y z
18.037 12.063 30.269 1.462 376.099 275.481
# Join the predicted and model data for comparition
pred.data <- model.data
pred.data <- select(pred.data, cut:z, carat)
pred <- predict(fitted.model, pred.data)
pred <- data.frame(model.data, pred)
# Round the predicted data
pred$pred <- round(pred$pred, 0)
# Determining RMSE to assess fit (Root Mean Squared Error)

18

model.rmse<- sqrt(mean(residuals(fitted.model)^2))
model.rmse
## [1] 323.8607

3.7

Plotting the predicted data with actual data

Here we see that the prediction is more accurate between the price range of USD 1000 to USD 5000. Outside
this price range, the prediction is not accurate. Perhaps a different prediction model should be created for
dataset which are outside the range.
For the price range below 1000, the predicted price is lower than the actual price. Similarly for the price
range above USD 4500, the predicted price is higher than the actual price.
g
g
g
#
g
g
g
g
g

<- ggplot(pred, aes(y = price, x = pred))


<- g + geom_point(size=3, colour="black", alpha = 0.1)
<- g + geom_point(size=2, colour="salmon", alpha = 0.2)
g <- g + geom_point()
<- g + ylab("Actual Price")
<- g + xlab("Predicted Price")
<- g + geom_smooth(method=loess, col="blue", lwd=1)
<- g + geom_smooth(method=lm, col="red", lwd=1)

Actual Price

6000

4000

2000

0
0

2000

4000

Predicted Price

19

6000

0 1000

3000

5000

0 4

Normal QQ

8015
2367
5217

Standardized residuals

2000

Residuals vs Fitted

2000

Residuals

par(mfrow=c(2, 2))
plot(fitted.model)

236749190
5217

0 1000

3000

5000

Fitted values

Residuals vs Leverage
2315

8 2

1.5

2367 5217

Standardized residuals

ScaleLocation
49190

Theoretical Quantiles

0.0

Standardized residuals

Fitted values

1
0.5
0.5
1

4792

Cook's distance
49190

0.0

0.2

0.4

0.6

0.8

Leverage

The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.

Final conclusion

We have seen that using Linear model, a good predictive model can be developed, provided that the variables
(predictors) which significantly impact the outcome (price in this case) can be accurately identified.
We also observe tha the prediction may work within some boundary condition. If the boundary conditions
are accurately identified, then different models can be built for predicting the data outside the fitted model.

20