Polyprep New Case Study

Softpro Learning Center Development Group
Heart Disease Prediction

Dataset
*************************************************************************************
Client’s Name:- Dr APJ Abdul Kalam Technical University Of Uttar Pradesh
Project Coordinator’s Name:- Facility Of MCA Dept.
Trainee’s Name:- …………………………………… Enrollment No:- ………………………..
1) Overview:-
In this Project I detail the machine learning (ML) models I implemented to

accurately predict the Heart Disease. The dataset for this experiment is accessed
from the technical website Of Dr Ram Manohar Lohiya Medical Collage .The report
is organized in such a way as to demonstrate the entire process right from getting
and cleaning the data, to exploratory analysis of the dataset to understand the
distribution and importance of various features in influencing the algorithm, to
coming with a hypothesis, training ML models, evaluation of the models, etc.
2) Introduction:-
It might have happened so many times that you or someone yours need doctors
help immediately, but they are not available due to some reason. The Heart
Disease Prediction application is an end user support and online consultation
1
© Copyright to SLC: Illegal copies of this material is prohibited.
project. Here, we propose a web application that allows users to get instant
guidance on their heart disease through an intelligent system online.
.“
3)OBJECTIVE :-
The prime objective of this project is to construct a working model which has the
capability of predicting the value of houses, we will need to separate the dataset
into features and the target variable. The features, ‘RM’, ‘LSTAT’, and ‘PTRATIO’,
give us quantitative information about each data point. The target
variable, ‘MEDV’, will be the variable we seek to predict. These are stored in
features and prices, respectively.
This project aims in constructing a mathematical model using Multiple Regression
to estimate the selling price of the house based on a set of predictor variables.
• Analysis Software Used – SAS (Statistical Analysis Software)
4)SCOPE :-
In future we can also include, latitude, longitude and elevation of the house in the
model to predict the house price with more accuracy. Future work can also include
demographics variable like income, number of children, education, age of the family
group etc in the model, to explain the variability in the house pricing and to
predict house pricing more effectively.
4)Name and Description of Modules:-
The dataset (Dehi NCR Housing Price) was taken from the Delhi Public Library is a
national depository library in the Indian state of Delhi and is freely available for
download from the Delhi State Website Repository. The dataset consists of 506
observations of 14 attributes. The median value of house price in $1000s, denoted
by MEDV, is the outcome or the dependent variable in our model. Below is a brief
description of each Model and the outcome in our dataset:
1. CRIM – per capita crime rate by town
2
2. ZN – proportion of residential land zoned for lots over 25,000 sq.ft
3. INDUS – proportion of non-retail business acres per town
4. CHAS – Yamuna River dummy variable (1 if tract bounds river; else 0)
5. NOX – nitric oxides concentration (parts per 10 million)
6. RM – average number of rooms per dwelling
7. AGE – proportion of owner-occupied units built prior to 1990
8. DIS – weighted distances to five Delhi NCR employment centres
9. RAD – index of accessibility to radial highways
10. TAX – full-value property-tax rate per $10,000
11. PTRATIO – pupil-teacher ratio by town
12. BK – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT – % lower status of the population
3
14. MEDV – Median value of owner-occupied homes in $1000
4) Architecture of House Prediction:-
## CRIM ZN INDUS CHAS

## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
4
## 1st Qu.: 6.95 1st Qu.:17.02

## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Let us visualize the distribution and density of the outcome, MEDV. The black curve represents the
density. In addition, the boxplot is also plotted to bring an additional perspective. We see that the
median value of housing price is skewed to the right, with a number of outliers to the right. It may be
useful to transform ‘MEDV’ column using functions like natural logrithm, while modeling the
hypothesis for regression analysis.
5
Summary Of architecture:
LIST OF DEPENDENT AND INDEPENDENT VARIABLES
-We have 8 independent variables and 1 dependent variable.we screen variables
based on correlation coefficient with price and amount of variability explained by
the model (R-square).
STASTISTICAL APPROACH
6
• The statistical approach used here is Multiple Regression.

EQUATION FOR MULTIPLE REGRESSION:
-> Y = b0 + b1*X1 + b2*X2 + ... + bp*Xp
-> X1, X2…Xp are the independent variables and Y is the housing price and is the
dependent variable that is being predicted or explained.
-> bo is the Constant or intercept
-> b1 is the Slope (Beta coefficient) for X1, b2 is the Beta coefficient for X2, etc…
• This equation is estimated using the Least-Squares method.
EXPLORATORY DATA ANALYSIS

• The exploratory data analysis involves the scatter plot outputs between house
price and predictor variables with natural log transform of price and without
natural log transform of price variable.
DISTRIBUTION OF HOUSING PRICE VARIABLE WITH NATURAL LOG

TRANSFORM
Distribution
1)Normal Probability plot 2)Histogram
1) The housing price is transformed using natural log and appears very close to
normal distribution. This ensures linearity relationship between housing price and
other predictor variables.
2) The distribution is not that much skewed compared to before transformation.
MULTIPLE REGRESSION ANALYSIS:

• Multiple regression was done on the data set using the Proc REG procedure
in SAS.
ANOVA TABLE:
7
Main Points from SAS output:

• The F-Value is 37.32 and P value is <0.05, so the regression model
is significant.
• The P-value for the t-statistic of the selected variables are all
<=0.05, so all the variables are significant in the model
• The R-square is 0.8092, which means 80.92% of the total variability
is explained by the age, lotsizesqft, bedrooms, appliances_cnt and
numfloors variables
• The Regression equation to predict the house price is
Identifying Outliers using residuals

• After Identifying influential observations, the outliers were removed from
the data. The top 3 and bottom 3 cases were removed, to see if it improves the
variability explained by the model. The R-square value increased from 0.8092 to
0.8322, which is good, so we retain the newly fit model after removing the outliers.
8
CUMULATIVE DISTRIBUTION OF PREDICTION ERROR %
• The formula is (abs(actual-predicted)*100/actual).

•This cumulative chart shows that 70% (0.7 on y-axis) of cases have
less than 9% prediction error when compare to the actual selling
price.
80% of cases have less than 10% prediction error
90% of cases have less than 12% prediction error
5) Conclusion:-
We are able to predict house price with around 90% accuracy for most of the
cases and we have a good R-square of 0.83, which means 83% of the variability is
explained by the model and we are also able to explain the interpretation of the
estimates of the model .
9

Polyprep New Case Study

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Polyprep New Case Study

Загружено:

Авторское право:

Доступные форматы

Softpro Learning Center Development Group

Heart Disease Prediction

Client’s Name:- Dr APJ Abdul Kalam Technical University Of Uttar Pradesh

Project Coordinator’s Name:- Facility Of MCA Dept.

Trainee’s Name:- …………………………………… Enrollment No:- ………………………..

In this Project I detail the machine learning (ML) models I implemented to

4)Name and Description of Modules:-

1. CRIM – per capita crime rate by town

2. ZN – proportion of residential land zoned for lots over 25,000 sq.ft

3. INDUS – proportion of non-retail business acres per town

4. CHAS – Yamuna River dummy variable (1 if tract bounds river; else 0)

5. NOX – nitric oxides concentration (parts per 10 million)

6. RM – average number of rooms per dwelling

7. AGE – proportion of owner-occupied units built prior to 1990

8. DIS – weighted distances to five Delhi NCR employment centres

9. RAD – index of accessibility to radial highways

10. TAX – full-value property-tax rate per $10,000

11. PTRATIO – pupil-teacher ratio by town

12. BK – 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

13. LSTAT – % lower status of the population

14. MEDV – Median value of owner-occupied homes in $1000

4) Architecture of House Prediction:-

## CRIM ZN INDUS CHAS

## 1st Qu.: 6.95 1st Qu.:17.02

• The statistical approach used here is Multiple Regression.

EXPLORATORY DATA ANALYSIS

DISTRIBUTION OF HOUSING PRICE VARIABLE WITH NATURAL LOG

MULTIPLE REGRESSION ANALYSIS:

Main Points from SAS output:

Identifying Outliers using residuals

CUMULATIVE DISTRIBUTION OF PREDICTION ERROR %

• The formula is (abs(actual-predicted)*100/actual).

Вам также может понравиться