Вы находитесь на странице: 1из 20

# Predicting Diamond Price using Linear Model

Sarajit Poddar
26 July 2015

Contents
1 Executive Summary

1.1

Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2

## About the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 Exploratory Analysis

2.1

2.2

2.3

11

3.1

11

3.2

12

3.3

15

3.4

15

3.5

16

3.6

18

3.7

## Plotting the predicted data with actual data . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

4 Final conclusion

1
1.1

20

Executive Summary
Objective

Using the diamond dataset from the ggplot2 library, to determing the predictors of the dimond prices. Using
linear model to predict the prices, comparing the predicted price with the actuals and observb the residuals.
Check the accuracy of the fit vis-a-vis other predictive models using machine learning algorithms. The
machine learning algorithms will be explored in subsequent articles.

1.2
1.2.1

Description

## Prices of 50,000 round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds. The variables
are as follows:
1

1.2.2

Details

## price. price in US dollars (\$326-\$18,823)

carat. weight of the diamond (0.2-5.01)
cut. quality of the cut (Fair, Good, Very Good, Premium, Ideal)
colour. diamond colour, from J (worst) to D (best)
clarity. a measurement of how clear the diamond is (I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best))
x. length in mm (0-10.74)
y. width in mm (0-58.9)
z. depth in mm (0-31.8)
depth. total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43-79)
table. width of top of diamond relative to widest point (43-95)

2
2.1

Exploratory Analysis

library(dplyr);
library(tidyr);
library(ggplot2)

2.2

## The dataset is subset to a smaller size as the dataset it huge

# Load the diamonds dataset
data(diamonds)
# Convert continuous variables to factors
# Cut by interval of 1000
diamonds\$price2 <- as.numeric(cut(diamonds\$price,
seq(from = 0, to = 20000, by = 1000)))
# Cut by interval 0.5
diamonds\$carat2 <- as.numeric(cut(diamonds\$carat,
seq(from = 0, to = 6, by = 0.1)))
# Summary of diamonds dataset
summary(diamonds)
##
##
##

carat
Min.
:0.2000
1st Qu.:0.4000

Fair
Good

cut
: 1610
: 4906

color
D: 6775
E: 9797
2

clarity
SI1
:13065
VS2
:12258

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Median :0.7000
Mean
:0.7979
3rd Qu.:1.0400
Max.
:5.0100
depth
Min.
:43.00
1st Qu.:61.00
Median :61.80
Mean
:61.75
3rd Qu.:62.50
Max.
:79.00
y
Min.
: 0.000
1st Qu.: 4.720
Median : 5.710
Mean
: 5.735
3rd Qu.: 6.540
Max.
:58.900

Very Good:12082
Ideal
:21551
table
Min.
:43.00
1st Qu.:56.00
Median :57.00
Mean
:57.46
3rd Qu.:59.00
Max.
:95.00
z
Min.
: 0.000
1st Qu.: 2.910
Median : 3.530
Mean
: 3.539
3rd Qu.: 4.040
Max.
:31.800

F: 9542
SI2
: 9194
G:11292
VS1
: 8171
H: 8304
VVS2
: 5066
I: 5422
VVS1
: 3655
J: 2808
(Other): 2531
price
x
Min.
: 326
Min.
: 0.000
1st Qu.: 950
1st Qu.: 4.710
Median : 2401
Median : 5.700
Mean
: 3933
Mean
: 5.731
3rd Qu.: 5324
3rd Qu.: 6.540
Max.
:18823
Max.
:10.740
price2
Min.
: 1.000
1st Qu.: 1.000
Median : 3.000
Mean
: 4.398
3rd Qu.: 6.000
Max.
:19.000

carat2
Min.
: 2.000
1st Qu.: 4.000
Median : 7.000
Mean
: 8.468
3rd Qu.:11.000
Max.
:51.000

## # Structure of the diamond dataset

str(diamonds)
## 'data.frame':
53940 obs. of 12 variables:
## \$ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## \$ cut
: Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## \$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## \$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## \$ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## \$ table : num 55 61 65 58 58 57 57 55 61 61 ...
## \$ price : int 326 326 327 334 335 336 336 337 337 338 ...
## \$ x
: num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## \$ y
: num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## \$ z
: num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## \$ price2 : num 1 1 1 1 1 1 1 1 1 1 ...
## \$ carat2 : num 3 3 3 3 4 3 3 3 3 3 ...
# Lets say that input price range is 1000 to 5000 and
# the number of obs is 500
input.pricerange.low <- 1000
input.pricerange.high <- 5000
input.obs
<- 5000
# Subsetting sampling the data based on the price range
data.sample <- subset(diamonds,
price >= input.pricerange.low &
price <= input.pricerange.high)
# Sampling the data from the subset
data.sample <- data.sample[sample(1:nrow(data.sample), input.obs,
replace=FALSE),]

2.3
2.3.1

## Plotting the characteristics of dataset

Plotting using base graphics

## #-------------------------------# Plotting with Base graphics

#-------------------------------x <- data.sample\$price
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Diamond Price",
main="Frequency Distribution of Diamond Price")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist\$counts / myhist\$density
mydensity
<- density(x)
mydensity\$y <- mydensity\$y * multiplier
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(myhist\$mids[1:2]) * length(x)
lines(xfit, yfit, col="red", lwd=2)
legend('topright', c("Mean", "Density Curve", "Normal Curve"),
lty=c(1,1,1), lwd=c(2,2,2), col = c("darkgreen", "blue", "red"))

600
400
0

200

Frequency

800

Mean
Density Curve
Normal Curve

1000

2000

3000

4000

Diamond Price
2.3.2
g
g
g
g

## Plotting using ggplot

<<<<-

ggplot(data.sample, aes(x=price))
g + geom_histogram(aes(y = ..density..), fill="dark grey")
g + geom_density(alpha=.3, fill="#FF6666")
g + stat_function(fun = dnorm, colour = "red",
arg = list(mean = mean(data.sample\$price),
sd=sd(data.sample\$price)))
g <- g + xlab("Diamond price")
g <- g + ylab("Frequency")
g <- g + ggtitle("Frequency Distribution of Diamond Price")
g

5000

5e04

Frequency

4e04

3e04

2e04

1e04

0e+00
1000

2000

3000

4000

5000

Diamond price
2.3.3
g
#
#
g
g
g
g
g
g

## Diamond price distribution with regards to Cut

<- ggplot(data.sample)
Using the cut as to show the differences in the price due to the
quality of the cut
<- g + geom_bar(aes(x=price, fill= cut))
<- g + xlab("Price of Diamonds")
<- g + ylab("Number of Diamonds")
<- g + ggtitle("Prices of Sampled Diamonds")
<- g + theme(legend.position="bottom")

## Prices of Sampled Diamonds

Number of Diamonds

300

200

100

0
1000

2000

3000

4000

5000

Price of Diamonds
cut

2.3.4
g
g
g
g
g

<<<<-

Fair

Good

Very Good

Ideal

Regression line showing the impact of Carat on the price (Using lm)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + geom_smooth(method=lm, col="red", lwd=1)
g + theme(legend.position="bottom")

price

6000

4000

2000

0.4

0.8

1.2

carat
clarity

2.3.5
g
g
g
g
g

<<<<-

I1

SI2

SI1

VS2

VS1

VVS2

VVS1

Regression line showing the impact of Carat on the price (Using Loess)
ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity))
g + geom_smooth(method=loess, col="blue", lwd=1)
g + theme(legend.position="bottom")

IF

5000

price

4000

3000

2000

1000
0.4

0.8

1.2

carat
clarity

2.3.6
g
g
g
g
g
g

<<<<<-

I1

SI2

SI1

VS2

VS1

## Regression line faceted by Colour and Cut

ggplot(data.sample, aes(y = price, x = carat))
g + geom_point(aes(color=clarity), position="jitter")
g + facet_grid(color~cut)
g + geom_smooth(method=lm, col="salmon", lwd=1)
g + theme(legend.position="bottom")

VVS2

VVS1

IF

Good

Very Good

Ideal
D
E
F
G

price

Fair
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0
6000
4000
2000
0

H
I
J

carat
clarity

2.3.7

I1

SI2

SI1

VS2

VS1

VVS2

## Correlation plot between all variables

library(corrplot)
# Convert Diamonds dataset all fields to numeric
diamonds.num <- data.sample
diamonds.num[, 1:12] <- sapply(diamonds.num[, 1:12], as.numeric)
# Remove price and carat and retain price2 and carat2
diamonds.num <- select(diamonds.num, cut:table, x:carat2)
M <- cor(diamonds.num)
corrplot.mixed(M)

10

VVS1

IF

cut
0.8

0.06 color
0.6

0.23 0.02clarity
0.4

0.2

0

0.2

x
0.2

0.15 0.84

y
0.4

z
0.6

## 0.16 0.15 0.35 0.06 0.13 0.85 0.73 0.82 price2

0.8

0.25 0.31 0.59 0.1 0.18 0.97 0.82 0.93 0.86 carat2
1
From the correlation matrix it appeared that variables that have high correlation with the Price are Carat,
X, Y and Z. Other vairables dont have much correlation. However, the variable X, Y, Z are also highly
correlated to Carat. This can also mean that takening Carat 2 as the
2.3.8

Exploratory plot

library(ggplot2)
library(GGally)
library(scales)
# Sampling the data for the plot generation
diasamp <- diamonds[sample(1:length(diamonds\$price), 500),]
# Generating the plot
ggpairs(diasamp, params = c(shape = I('.'), outlier.shape = I('.')))

3
3.1

## Predicting the diamond price

Determining the Significant Predictors of Diamond price

## model.data <- subset(data.sample, select = -c(price2, carat2))

full.model <- lm(price ~ ., data = model.data)
11

## reduced.model <- step(full.model, direction="backward", k=2, trace=0)

summary(reduced.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat + cut + color + clarity + depth +
table + x + z, data = model.data)
Residuals:
Min
1Q
-2307.76 -186.11

Median
-18.29

3Q
179.54

Max
1564.51

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1757.964
436.220 -4.030 5.66e-05 ***
carat
5946.169
114.468 51.946 < 2e-16 ***
cut.L
289.409
18.773 15.416 < 2e-16 ***
cut.Q
-118.602
14.754 -8.039 1.13e-15 ***
cut.C
122.496
13.433
9.119 < 2e-16 ***
cut^4
27.127
11.409
2.378
0.0175 *
color.L
-889.472
16.658 -53.397 < 2e-16 ***
color.Q
-205.282
14.891 -13.785 < 2e-16 ***
color.C
-54.810
14.049 -3.901 9.69e-05 ***
color^4
31.203
13.128
2.377
0.0175 *
color^5
54.780
12.283
4.460 8.38e-06 ***
color^6
43.795
11.144
3.930 8.62e-05 ***
clarity.L
1938.848
28.227 68.689 < 2e-16 ***
clarity.Q
-726.394
24.116 -30.121 < 2e-16 ***
clarity.C
451.869
20.706 21.823 < 2e-16 ***
clarity^4
-248.135
17.122 -14.492 < 2e-16 ***
clarity^5
77.947
14.643
5.323 1.06e-07 ***
clarity^6
-17.521
13.226 -1.325
0.1853
clarity^7
18.544
11.973
1.549
0.1215
depth
-10.051
4.490 -2.239
0.0252 *
table
-4.957
2.624 -1.889
0.0589 .
x
91.671
49.098
1.867
0.0619 .
z
97.390
44.027
2.212
0.0270 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.1 on 4977 degrees of freedom
Multiple R-squared: 0.927, Adjusted R-squared: 0.9267
F-statistic: 2873 on 22 and 4977 DF, p-value: < 2.2e-16

We observe that the variables which have high impact on the Diamond Price are cut, color, clarity, table, y, z
and carat.

3.2

12

## #------------------------------# Exploring association of Cut with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = cut)) + xlab("Carat") +
theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
3

10

11

12

13

14

Carat
cut

Fair

Good

Very Good

## # Exploring association of Clarity with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = clarity)) + xlab("Carat") +
theme(legend.position="bottom")

13

Ideal

15

16

5000

price

4000

3000

2000

1000
3

10

11

12

13

14

15

16

Carat
clarity

I1

SI2

SI1

VS2

VS1

VVS2

VVS1

IF

## # Exploring association of Color with Carat and Price

ggplot(data.sample, aes(factor(carat2), price)) +
geom_boxplot(aes(fill = color)) + xlab("Carat") +
theme(legend.position="bottom")

5000

price

4000

3000

2000

1000
3

10

11

12

13

14

Carat
color

14

15

16

3.3

## # The Starting and Suggested Model

simple.model <- lm(price ~ carat, data = model.data)
fitted.model <- lm(price ~ carat + cut + clarity + color + table + y + z,
data = model.data)

3.4

## # Summary of the simple model and fitted model

summary(simple.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat, data = model.data)
Residuals:
Min
1Q
-3138.17 -307.81

Median
-14.44

3Q
299.14

Max
2393.19

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -728.13
24.62 -29.57
<2e-16 ***
carat
4694.20
32.72 143.48
<2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 529.1 on 4998 degrees of freedom
Multiple R-squared: 0.8047, Adjusted R-squared: 0.8046
F-statistic: 2.059e+04 on 1 and 4998 DF, p-value: < 2.2e-16

summary(fitted.model)
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

Call:
lm(formula = price ~ carat + cut + clarity + color + table +
y + z, data = model.data)
Residuals:
Min
1Q
-2340.21 -184.12

Median
-17.18

3Q
178.61

Max
1571.15

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2253.841
180.766 -12.468 < 2e-16 ***
carat
6154.635
66.980 91.888 < 2e-16 ***
cut.L
316.192
17.328 18.247 < 2e-16 ***
cut.Q
-126.597
14.558 -8.696 < 2e-16 ***
cut.C
123.818
13.368
9.262 < 2e-16 ***
cut^4
25.591
11.421
2.241 0.02509 *
15

##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##

clarity.L
1936.765
28.124 68.866 < 2e-16 ***
clarity.Q
-733.049
24.012 -30.529 < 2e-16 ***
clarity.C
454.915
20.721 21.954 < 2e-16 ***
clarity^4
-247.327
17.145 -14.426 < 2e-16 ***
clarity^5
79.403
14.658
5.417 6.34e-08 ***
clarity^6
-18.140
13.241 -1.370 0.17075
clarity^7
19.361
11.993
1.614 0.10653
color.L
-890.439
16.679 -53.387 < 2e-16 ***
color.Q
-205.497
14.892 -13.800 < 2e-16 ***
color.C
-54.116
14.056 -3.850 0.00012 ***
color^4
32.902
13.141
2.504 0.01232 *
color^5
54.910
12.298
4.465 8.18e-06 ***
color^6
44.704
11.158
4.007 6.25e-05 ***
table
-1.865
2.462 -0.757 0.44892
y
30.742
12.057
2.550 0.01081 *
z
64.395
38.046
1.693 0.09060 .
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.6 on 4978 degrees of freedom
Multiple R-squared: 0.9268, Adjusted R-squared: 0.9265
F-statistic: 3001 on 21 and 4978 DF, p-value: < 2.2e-16

# Conduct Analysis of Variance between the simple model and the best fitted model
anova(simple.model, fitted.model)
##
##
##
##
##
##
##
##
##

## Analysis of Variance Table

Model 1: price ~ carat
Model 2: price ~ carat + cut + clarity + color + table + y + z
Res.Df
RSS Df Sum of Sq
F
Pr(>F)
1
4998 1399334980
2
4978 524428896 20 874906084 415.24 < 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.5
3.5.1

## Analysing the Residuals

Checking unidentified patterns in the Residuals

The graph shows that variance between the actual and prediction are higher when the price of the dimond
increaes. There is a possibility that a factor that increases the price at a higher price range, is not captured in
the model. Hence the variance of the price cant be adequately captured by the model based on the available
predictors.
x <- model.data\$price;
y <- resid(fitted.model)
ggplot(data.frame(x, y), aes(x,y)) +
geom_hline(yintercept=0, size=1) +
geom_point(size=3, colour="black", alpha = 0.1) +
geom_point(size=2, colour="salmon", alpha = 0.2) +
xlab("Fitted value") +

16

ylab("Residual") +
geom_smooth(method="loess", colour="red", lwd=1)

1000

Residual

1000

2000

1000

2000

3000

4000

Fitted value
3.5.2

## The graph shows that the residula falls in a normal pattern.

x <- residuals(fitted.model)
# Plotting the histogram
myhist <- hist(x, breaks=10, density=10, col="darkgrey",
xlab="Residuals",
main="Frequency Distribution of residuals")
# Adding a vertical line for the mean
abline(v=mean(x), col="darkgreen", lwd=2)
# Plotting the density curve
multiplier <- myhist\$counts / myhist\$density
mydensity
<- density(x)
mydensity\$y <- mydensity\$y * multiplier
lines(mydensity, col="blue", lwd=2)
# Plotting the normal curve with the same mean and Standard deviation
xfit <- seq(min(x), max(x), length=40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
17

5000

## yfit <- yfit * diff(myhist\$mids[1:2]) * length(x)

lines(xfit, yfit, col="red", lwd=2)

1000
0

500

Frequency

2000

2000

1000

1000

2000

Residuals
3.6

## The formula for prediction is

Diamond price = -4253.844 + 4920.324 * carat + xx * cut + 77* clarity + zz * color + 1.462 * table +
376.099 * y + 275.481 * z
Note: The value of xx, yy and xx depends on the class of the variable
Coefficients: (Intercept) carat cut.L cut.Q cut.C cut4 clarity.L clarity.Q
-4253.844 4920.324 261.542 -83.890 66.899 37.187 2011.021 -749.420
clarity.C clarity4 clarity5 clarity6 clarity7 color.L color.Q color.C
477.879 -296.209 77.985 -30.944 27.387 -944.980 -226.105 -80.026
color4 color5 color6 table y z
18.037 12.063 30.269 1.462 376.099 275.481
# Join the predicted and model data for comparition
pred.data <- model.data
pred.data <- select(pred.data, cut:z, carat)
pred <- predict(fitted.model, pred.data)
pred <- data.frame(model.data, pred)
# Round the predicted data
pred\$pred <- round(pred\$pred, 0)
# Determining RMSE to assess fit (Root Mean Squared Error)

18

model.rmse<- sqrt(mean(residuals(fitted.model)^2))
model.rmse
##  323.8607

3.7

## Plotting the predicted data with actual data

Here we see that the prediction is more accurate between the price range of USD 1000 to USD 5000. Outside
this price range, the prediction is not accurate. Perhaps a different prediction model should be created for
dataset which are outside the range.
For the price range below 1000, the predicted price is lower than the actual price. Similarly for the price
range above USD 4500, the predicted price is higher than the actual price.
g
g
g
#
g
g
g
g
g

## <- ggplot(pred, aes(y = price, x = pred))

<- g + geom_point(size=3, colour="black", alpha = 0.1)
<- g + geom_point(size=2, colour="salmon", alpha = 0.2)
g <- g + geom_point()
<- g + ylab("Actual Price")
<- g + xlab("Predicted Price")
<- g + geom_smooth(method=loess, col="blue", lwd=1)
<- g + geom_smooth(method=lm, col="red", lwd=1)

Actual Price

6000

4000

2000

0
0

2000

4000

Predicted Price

19

6000

0 1000

3000

5000

0 4

Normal QQ

8015
2367
5217

Standardized residuals

2000

Residuals vs Fitted

2000

Residuals

par(mfrow=c(2, 2))
plot(fitted.model)

236749190
5217

0 1000

3000

5000

Fitted values

Residuals vs Leverage
2315

8 2

1.5

2367 5217

Standardized residuals

ScaleLocation
49190

Theoretical Quantiles

0.0

Standardized residuals

Fitted values

1
0.5
0.5
1

4792

Cook's distance
49190

0.0

0.2

0.4

0.6

0.8

Leverage

The points in Q-Q plot are more-or-less on the line, indicating that residuals are normaly distributed.

Final conclusion

We have seen that using Linear model, a good predictive model can be developed, provided that the variables
(predictors) which significantly impact the outcome (price in this case) can be accurately identified.
We also observe tha the prediction may work within some boundary condition. If the boundary conditions
are accurately identified, then different models can be built for predicting the data outside the fitted model.

20