Вы находитесь на странице: 1из 16

FORMAT OF THE LAB

TITLE - Real estate prices in Texas

GOAL

Study the change in house prices over a 10 year period in Texas, and the relation of this price change to
the factors - house hold income, unemployment, employment, total labour force, population, crime, aand
GDP.

DATASET

The data collected for the project

1) Housing Activity Data from the Real estate Center of Texas A&M University [1]. The data is
provided for the monthly housing activity from January 1990 to January 2017. The information
provided by the website include the amount of sales for the month, the total dollar volume of the
sales, the average price, the median price, total number of listings, months inventory. The website
also classifies the data based on the (MSA) Metropolitan Statistical Area, (LMA) Local Market
Area, and counties. Housing statistics are based on listing data from 50 MLS (Multiple Listing
Service) systems in Texas.

Dollar Average Total


Date Sales Volume Price Listings
1990-01-01 7741 655670441 84701 93041
1990-02-01 6200 530013200 85486 102099
1990-03-01 8545 727418760 85128 104780
1990-04-01 8245 708229010 85898 106347

2) Texas Employment data, [2] this was collected from the United States Department of Labor
website, the table provides monthly data from 1976 to 2016, the data includes the total labor
force, employed, unemployed, and the unemployment rate. For the month of December 2017 the
data is Preliminary data.

Labor
Date Force Employment Unemployment
1990-01-01 8526661 7987215 539446
1990-02-01 8547340 8010107 537233
1990-03-01 8565895 8029827 536068
1990-04-01 8581401 8045204 536197
1990-05-01 8594032 8056920 537112
1990-06-01 8603974 8065550 538424
1990-07-01 8613510 8072606 540904
3) Texas Crime data, [3] this data is collected from the disaster center website. The tables provide
yearly data from 1960 to 2016. The data includes, the population, index and crime which have
been categorized as violent, property, murder, forcible rape, robbery, aggravated assault, burglary,
larceny theft, vehicle theft.

Forcib Aggravat
Larcen
le ed
Burglar
Year Population Index Violent Property Murder Rape Robbery assault Theft
y
1990 16986510 1329494 129343 1200151 2389 8750 44297 73907 314512 7
1991 17349000 1356527 145743 1210784 2652 9266 49700 84125 312693 7
1992 17656000 1246148 142369 1103779 2239 9437 44588 86105 268928 6
1993 18031000 1161031 137419 1023612 2147 9922 40469 84881 233913 6
1994 18378000 1079225 129838 949387 2022 9102 37643 81071 214687 6

4) Texas (GDP) Gross Domestic Product [4], from United States Census Bureau. The tables provide
yearly data from 1963 to 2016. The data covers all industries, but the data has a discontinuity in
1997 thus presented from ((SIC) Standard Industrial Classification) 1963 to 1996 and ((NAICS)
North American Industrial Classification System) 1997 to 2015, the reason for the discontinuity
is recognizing research and development expenditures as capital and the capitalization of
entertainment, literary, and other artistic originals. These improvements have not been
incorporated in the SIC-based statistics. The NAICS-based statistics of GDP by state are
consistent with U.S. gross domestic product (GDP) while the SIC-based statistics of GDP by state
are consistent with U.S. gross domestic income (GDI).Unit millions of dollors
TXNGS
Observation date P
1990-01-01 378943
1991-01-01 393574
1992-01-01 416401
1993-01-01 443775
1994-01-01 475990

5) Historical Mortgage rates [5] HSH's Fixed-Rate Mortgage Indicator (FRMI) averages 30-year
mortgages of all sizes, including conforming, expanded conforming, and jumbo.

Date 30 Year FRM


Jan-90 0.0999300
Feb-90 0.1026600
Mar-90 0.1033600
Apr-90 0.1043900
May-90 0.1054100
Jun-90 0.1021800
Jul-90 0.1011100
Aug-90 0.1017800
6) Dow Jones monthly data [6] from 1990 to 2016,[6].

PREPORCESSOR

Data outliers, creation of new attributes, data integration, removal of noise and everything done to the
data

1) Data Cleaning
The tables were checked for incomplete, inconsistent, noisy data.
Missing data: - the tables were checked for missing data, before they were selected, but since the
data were from different sources some provided monthly data while others provided yearly data.
The housing activity data was from 1990 to January 2017 but none of the other tables had data for
January 2017, so during the data integration process steps were taken so that all the data were
available for the given periods. All the data for the years preceding 1990 from other tables were
ignored, as there was no real estate data for that period and 2016 was ignored.

Two tables were prepared for analysis as there was monthly data available for real estate and
employment data, which was used for Table 1, while Table 2 had yearly data from 1990 to 2015..

Outlier Analysis: - Two methods were used for outlier analysis Histogram method with box plot
and (Median Absolute Deviation )MAD.
MAD = median (|Xi median(X)|) and mi = 0.6745(|Xi median(X)|) /MAD

Using the above formula MAD for each column of the data was calculated. And mi for each of
the column was calculated and if |mi| > 3.5 then it was declared an outlier

Attribute MAD OUTLIERS


Sales 5071 0
Dollar Volume 1435908514 0
Average Price 40529 0
Median Price 28466 0
Total Listings 14027 0
Months 1.3 5
Inventory
labor force 1209376 0
employment 971229 0
unemployment 82762 0
Population 21322190.5 0
Index 49198.5 1
Violent 7845 0
Property 45522 1
Murder 111.5 5
Forcible Rape 369.5 0
Robbery 3417.5 0
Aggravated 2894.5 0
assault
Burglary 13769 2
Larcency Theft 32522.5 0
Vehicle Theft 10172.5 2
GDP 345263 0

VISUALIZE THE DATA

For the monthly data table the data were visualized using the Weka Visualization tools

Below are the visualizations of the Average price on the Y axis and the Sales, Dollar Volume, Total
Listings, Labor force, employment, unemployment. These visualizations were done in Python .
Correlation coefficient between all the attributes and the Average price were calculated

For first table

Correlation
Attribute
coefficient
Sales 0.84607
Dollar Volume 0.937
Median Price 0.977
Total Listings 0.39287
Months -0.738476
Inventory
labor force 0.9743
employment 0.9828
unemployment 0.31384
Mortgage Interest -0.900
rate
Dow Open 0.94651
Dow High 0.9457
Dow Low 0.94417
Dow Close 0.94535
Dow Volume 0.47755
Dow Adjust 0.945349
Close
For second table
Attribute MAD
Sales 0.89135
Dollar Volume 0.9779
Total Listings 0.416
labor force 0.71886
employment 0.72855
unemployment 0.8486
Population 0.98987
Index -0.6946518
Violent -0.68931
Property -0.6833
Murder -0.72697
Forcible Rape -0.46589
Robbery -0.833
Aggravated -0.557
assault
Burglary -0.561
Larceny Theft -0.561
Vehicle Theft -0.8522
GDP 0.988

MODEL

First Table (Monthly Data)

Based on the correlation between the variables Mortgage Rate and Employment were chosen to determine
the average price. All the models were tested with Employment and Average price data, Mortgage Rate
and Average Price data, Employment, Mortgage Rate and Average price.

All the classifier functions in Weka were tested on the data to determine the best model,

Modeling Results using all the three combinations showed that there was a trouble in prediction after
September 2008 that is when the stock market crash happened [6]. Thus Dow Jones data monthly data
from 1990 to present was added to the data set for analysis, and it has improved the accuracy of
prediction.

The model was developed from the data from 1990 till September 2008.

Using the SMOreg which implements Support Vector Machine for regression

Transformed training data:

1. Average_Price
2. Employment
3. Mortgage_interest
4. Dow_open
5. Dow_close
6. Month
7. Quarter
8. Date-remapped
9. Lag_Average_Price-1
10. Lag_Average_Price-2
11. Lag_Average_Price-3
12. Lag_Average_Price-4
13. Lag_Average_Price-5
14. Lag_Average_Price-6
15. Lag_Average_Price-7
16. Lag_Average_Price-8
17. Lag_Average_Price-9
18. Lag_Average_Price-10
19. Lag_Average_Price-11
20. Lag_Average_Price-12
21. Date-remapped^2
22. Date-remapped^3
23. Date-remapped*Lag_Average_Price-1
24. Date-remapped*Lag_Average_Price-2
25. Date-remapped*Lag_Average_Price-3
26. Date-remapped*Lag_Average_Price-4
27. Date-remapped*Lag_Average_Price-5
28. Date-remapped*Lag_Average_Price-6
29. Date-remapped*Lag_Average_Price-7
30. Date-remapped*Lag_Average_Price-8
31. Date-remapped*Lag_Average_Price-9
32. Date-remapped*Lag_Average_Price-10
33. Date-remapped*Lag_Average_Price-11
34. Date-remapped*Lag_Average_Price-12

Thus the Model is as follows

Average_Price:
SMOreg

weights (not support vectors):


- 0.0188 * (normalized) Employment
- 0.0088 * (normalized) Mortgage_interest
+ 0.0094 * (normalized) Dow_open
+ 0.0438 * (normalized) Dow_close
- 0.0392 * (normalized) Month=jan
- 0.0054 * (normalized) Month=feb
+ 0.0259 * (normalized) Month=mar
- 0.0192 * (normalized) Month=apr
+ 0.0109 * (normalized) Month=may
+ 0.0298 * (normalized) Month=jun
+ 0.022 * (normalized) Month=jul
- 0.0029 * (normalized) Month=aug
- 0.021 * (normalized) Month=sep
- 0.0151 * (normalized) Month=oct
- 0.0034 * (normalized) Month=nov
+ 0.0177 * (normalized) Month=dec
- 0.0188 * (normalized) Quarter=Q1
+ 0.0215 * (normalized) Quarter=Q2
- 0.0019 * (normalized) Quarter=Q3
- 0.0009 * (normalized) Quarter=Q4
+ 0.3854 * (normalized) Date-remapped
+ 0.396 * (normalized) Lag_Average_Price-1
+ 0.2176 * (normalized) Lag_Average_Price-2
- 0.0117 * (normalized) Lag_Average_Price-3
+ 0.0318 * (normalized) Lag_Average_Price-4
+ 0.0092 * (normalized) Lag_Average_Price-5
+ 0.0043 * (normalized) Lag_Average_Price-6
+ 0.0496 * (normalized) Lag_Average_Price-7
+ 0.1386 * (normalized) Lag_Average_Price-8
+ 0.0963 * (normalized) Lag_Average_Price-9
- 0.0413 * (normalized) Lag_Average_Price-10
- 0.015 * (normalized) Lag_Average_Price-11
- 0.0243 * (normalized) Lag_Average_Price-12
+ 0.3051 * (normalized) Date-remapped^2
- 0.1651 * (normalized) Date-remapped^3
- 0.0789 * (normalized) Date-remapped*Lag_Average_Price-1
- 0.1054 * (normalized) Date-remapped*Lag_Average_Price-2
- 0.0191 * (normalized) Date-remapped*Lag_Average_Price-3
+ 0.0079 * (normalized) Date-remapped*Lag_Average_Price-4
+ 0.0065 * (normalized) Date-remapped*Lag_Average_Price-5
- 0.0377 * (normalized) Date-remapped*Lag_Average_Price-6
- 0.1004 * (normalized) Date-remapped*Lag_Average_Price-7
- 0.0498 * (normalized) Date-remapped*Lag_Average_Price-8
+ 0.0159 * (normalized) Date-remapped*Lag_Average_Price-9
- 0.0057 * (normalized) Date-remapped*Lag_Average_Price-10
- 0.0394 * (normalized) Date-remapped*Lag_Average_Price-11
- 0.0252 * (normalized) Date-remapped*Lag_Average_Price-12
+ 0.0278
PREDICTION
This is the conclusion, discussion about the error and the future work that will be done on this, sources of
errors, assumptions

There was an error of 10 % to 20 % when the prediction of average price was made just based on the
Employment which had the highest correlation, and the prediction error was increasing after Septembet
2008 and on doing research, it was determined that there was a market crash that year and so the Dow
Jones Monthly data was added and after that the errors went down to 2% to 4% here are a few results.
Thus this shows that the market crash was responsible for the sudden change in house prices, Population
and crime would also be other factors but since monthly data for these are not available determining their
impact is not easy. Although this will be done with the yearly table.

Year %Error
2008-11-01 -4.0651
2008-12-01 -2.4709
2009-01-01 -15.166
2009-02-01 -3.5019
2009-03-01 -3.2116
2009-04-01 -4.8648
2009-05-01 1.09232
2009-06-01 2.41123
2009-07-01 -1.9488
2009-08-01 -2.2012
2009-09-01 -3.7518
2009-10-01 -3.8389
2009-11-01 -2.6099
2009-12-01 2.28865
2010-01-01 -8.3404
2010-02-01 -0.6837
2010-03-01 -0.1671
2010-04-01 -2.7713
2010-05-01 -1.2211
2010-06-01 3.05393
2010-07-01 2.43327
2010-08-01 -0.3438
2010-09-01 -2.766
2010-10-01 -0.962
2010-11-01 2.25093

Time Series Analysis

The package used for time series analysis was Weka Time Series Forcaster.

The Time Series Forecaster is a recent package that has been added to Weka and it is only available on
versions 3.7.3 and later. Weka requires input files to be in ARFF format, so a python code was created
which would generate the files in ARFF format, as manually creating the ARFF format is very time
consuming.
[1] https://www.recenter.tamu.edu/data/housing-activity#!/activity/State/Texas
[2] https://data.bls.gov/pdq/SurveyOutputServlet
[3] http://www.disastercenter.com/crime/txcrime.htm
[4] https://www.bea.gov/iTable/iTable.cfm?
reqid=70&step=1&isuri=1&acrdn=1#reqid=70&step=4&isuri=1&7003=200&7001=1200&7002
=1&7090=70
[5] http://www.hsh.com/monthly-mortgage-rates.html
[6] https://www.thebalance.com/stock-market-crash-of-2008-3305535
[7] https://finance.yahoo.com/quote/%5EDJI/history?
period1=631170000&period2=1483246800&interval=1mo&filter=history&frequency=1mo

Вам также может понравиться