Вы находитесь на странице: 1из 6

International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

A Scalable Solution for Heart Disease Prediction


using Classification Mining Technique
Rashmi G Saboji Prem Kumar Ramesh
brashmikaneri@gmail.com premkumar.r@cmrit.ac.in
Dept of Computer Science & Engg Dept of Computer Science & Engg
C.M.R Institute of Technology, C.M.R Institute of Technology
Bangalore, India Bangalore, India

Abstract— Latest trends in the health industry prediction of heart disease, one of the many health diseases for
suggest ever increasing amount of data accumulated digitally, health care prediction and analysis.
making it one of the top data intensive sectors. The evolution
of technology has enabled to process such large data, and Objectives
accurately predict interested outcomes. In this paper, propose
a scalable framework that uses healthcare data to predict heart
disease based on certain attributes. Our main contribution in As stated earlier, the purpose of this work is to predict the
this work is to predict the diagnosis of heart disease with a diagnosis of heart disease with reduced number of attributes.
small number of attributes. Our prediction solution uses Each dataset stored in HDFS is classified based on attributes.
random forest on Apache Spark, which gives massive Thirteen attributes and one class are chosen for the prediction.
opportunity for health care analysts to deploy this solution on This prediction solution using random forest on Apache Spark
ever changing, scalable big data landscape for insightful gives massive opportunity for health care analysts to deploy this
decision making. Using this approach, we show that up to solution on ever changing, scalable big data landscape for
98% accuracy is achieved. We also present a comparison insightful decision making.
against Naïve-Bayes classifier, where we show the random The scope of this work mainly deals with data analysis
forest approach outperforms the former by a significant part to improve the following key issues:
margin.
Keywords— Apache Spark, HDFS, Heart disease,
1. Complexity of the analysis: For some analysis
Random forest. algorithms, the computing time increases dramatically
even with smaller amounts of data growth.
Introduction 2. Accuracy in prediction: Different data mining
algorithms in classification, clustering, regression, and
The health care system is rapidly adopting electronic health association have different accuracy points when it
records, which will drastically increase the quantity of clinical comes to prediction.
data’s that are available digitally (EHR). For instance,
worldwide digital healthcare data is estimated to be equal to 3. Scale of the data: Even for simple data analysis, it
500 petabytes (1015 bytes), and is expected to reach 25 could take several days, even months, to obtain the
exabytes (1018 bytes) in 2020 [1]. Concurrently, fast progress result when data is very large (e.g. zeta bytes scale).
is being made in clinical analytics, such as techniques for
4. Parallelization of computing model: For those
analyzing large volumes of data and derives new insights from
computationally intense problems, we can parallelize
that analysis, also known as the big data analytics. This opens
the analysis so that the problem can be solved by
remarkable opportunities to reduce the costs of health care as
distributing tasks over many computers.
well as diagnosing the diseases in a much simpler way. In this
paper, we focus on heart disease, one such instance selected Our goal is to identify the key patterns or features from the
among others in healthcare. Heart disease is a general name medical data using the classifier model. The attributes that are
for a variety of diseases, whose symptoms may vary more relevant to heart disease diagnosis are observed.
depending on the specific type of heart disease.

The hospitals use database systems to store and manage their


patient data. These systems generate large volumes of data, Related Work
but these data are rarely used to support insightful clinical
decision making. Big data coupled with data mining Switching from regular database to big data technologies such
algorithms makes it possible to do multitude of things such as, as Hadoop Map-Reduce provide processing speed and
identify healthcare trends, disease prevention, and early analytical advancement. However, extracting useful
diagnosis to name a few. In this paper, we focus on the knowledge from big data transforms through multiple different

978-1-5386-1887-5/17/$31.00 ©2017 IEEE

1780
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

configuration stages to achieve full utilization. Each stage • Spark SQL assists the SQL execution in Spark. There
such as data aggregation, data maintenance, data integration, is a great improvement in terms of information
data analysis, and pattern interpretation/application faces affinity, performance development, elements
many challenges while dealing with healthcare big data augmentation.
(HBD). Three key focus areas, such as (1) complexity of the • Spark Streaming is a data flow computing structure
analysis, (2) scale of the data, and (3) parallelization of based on paradigm of spark, streaming gives a
computing model are discussed in [2]. In [4], the authors wealthy API, and consolidates batch, streaming &
have designed a system that can efficiently discover the rules interactive query implementation.
to predict the risk level of patients based on the given • Spark GraphX supports a concurrent calculation API
parameter about their health. The rules can be prioritized for management of Spark graphs and charts. GraphX
based on the user’s requirement. The classification model has a considerable advancement on the performance
covers rules based on decision trees. and reduction in overhead related to memory.
Spark is a general distributed computing framework which is • Spark MLlib is a scalable Machine-Learning library,
based on Hadoop MapReduce algorithms. It absorbs the it is comprised of pertinent tests and information
advantages of Hadoop MapReduce, but unlike MapReduce, creators. The accomplishment of Machine Learning
the intermediate and output results of the Spark jobs can be algorithms is that, it provides a 100 times better
stored in memory, which is called Memory Computing. improvement as compared to Map&Reduce. It
Memory Computing improves the efficiency of data supports the many machine-learning algorithms, such
evaluation [5]. as “classification”, “regression”, “clustering”,
“collaborative-filtering” & “dimensionality-
reduction”.
Proposed System
To enhance the prediction of classifiers, genetic search is
integrated; the genetic search resulted in 13 attributes which
contributes more towards the prediction of the cardiac disease.
The classifiers ensemble algorithms such as Random Forest
are used for prediction of patients with heart disease. The
classifiers are fed with reduced data set with 13 attributes.
Results are shown in. Observations exhibit that the Random
Forest data mining technique outperforms other data mining
techniques such as Naïve Bayes and Decision Tree after
incorporating feature subset selection but with high model
construction time. Random forest performs consistently before Fig 1: System Architecture
and after reduction of attributes with the same model
construction time.

The proposed system will have the following features:

• Collection of health care data of heart disease.


• Storage and Processing of data using Hadoop HDFS
and Spark.
• Applying Random Forest algorithm.
• Analyzing performance in terms of accuracy and
error rate.

Methodology
We adopt Apache Spark and Hadoop platform due to its
inherent support for scalability with attribute data sets.
Spark provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution
graphs. It also supports a rich set of higher-level tools
including Spark SQL for SQL, MLlib for machine learning, Fig 2: Spark Eco System
GraphX for graph processing, and Spark Streaming [5].

1781
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

When the datasets becomes very large and grows out of the collect heart disease datasets in csv files. These datasets need
storage capacity of a single machine then it becomes necessary to be processed, i.e., are labeled with class.
to disperse the data among several machines. HDFS is a
Hadoop file system that designed for storing files which are The class is simply a numerical representation of heart disease
very large in size which are in megabytes, gigabytes and prediction based on attribute values. Class 0 means absence of
terabytes in size, with streaming data access patterns, running heart disease and class 1 being its presence. Once the data is
on commodity hardware clusters. HDFS runs on philosophy of collected in csv, it is stored in HDFS as it provides fault
write once read many times paradigm. A datasets are usually tolerance. Then the data is extracted and parsed in order to
generated or transferred from the different sources and various handle the missing values of attributes. Random forest
analyzations on large datasets are performed on that dataset. algorithm is used to predict the newly arrived unsupervised,
HDFS framework has two types of nodes that they unlabelled dataset class. The same algorithm is applied to find
operate in master and slave fashion. A name node functions as its accuracy over the increase of training data set which
a master node and various data nodes functions as slaves’ addresses the issue of scalability on the big data. Finally, we
nodes. The name node manages the file system namespace check the computation time of the algorithm on Spark to
tree. It maintains tree for the file system and all files and address the issue of computational complexity. We measure
directories metadata which are stored in data node. The name the error rate, hence the accuracy for the given data set.
node carries the information of data nodes on which all the
blocks for a given file are stored. Hadoop HDFS architecture Implementation
is shown in Fig 3.
A user can access the file system by interacting with
the name node which in turn communicates with data nodes The heart disease datasets are collected from source as given
on behalf of user. The client provides a file system interface to in [12]: The UCI machine learning is most widely used
the user need not know about the name node and data node. repository which contains different datasets from various
Data nodes are used to store the blocks. They store and locations. These data sets are used for data mining and
retrieve the blocks on write and read operation by client. A machine learning purposes. As for the heart disease
name node gets information about the list of blocks which are prediction, data is collected from Cleveland, Switzerland and
stored in data nodes periodically. Hungary. The numbers collected in csv represents the values
of attributes which are indication to either presence or
absence of heart disease in the patient through another
attribute called class. The range of this attribute is from 0 (no
presence) to 4. Most of the experiments associated with
Cleveland database are focused on absence (“class” value 0)
and presence (“class” values from 1 to 4). For our
experimentation, we are using 2 classes for prediction, that is
0 being absent and 1 being present. Due to personal security,
the personal identification information of the patients is
replaced with dummy values.

The directory in repository contains a dataset


related with heart disease diagnosis. The Cleveland database
contains total 76 raw attributes, but in this experiments only
14 attributes of them are used. The dataset used in this
experiment contains different important parameters like
ECR, cholesterol, chest pain, fasting sugar, MHR (maximum
heart rate) and many more.
The detailed information about these attributes and their
domain range are as follows: [4]

Name Type Description


Fig 3: Data storage on HDFS
Age Continuous Age in years
The system architecture representation of the solution is Sex Discrete 0 = female 1 = male
shown in Figure 1. As the first step, the Spark eco system Cp Discrete Chest pain type: 1 = typical angina,
needs to be understood to take the advantage of its 2 = atypical angina, 3 = non-anginal
functionalities and support for machine learning libraries. pain 4 =asymptom
Figure 2 show cases the ecosystem of spark including its Trestbps Continuous Resting blood pressure (in mm Hg)
Chol Continuous Serum cholesterol in mg/dl
underlying resource manager, namely, YARN and dispersed
Fbs Discrete Fasting blood sugar>120 mg/dl: 1-
file system that is HDFS. Once this is done, the next task is to true 0=False

1782
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

Exang continuous Discrete Exercise induced angina: 1 = Yes 0 and all features are chosen for categoricalFeaturesInfo.
Maximum heart = No
rate achieved
Thalach Continuous Maximum heart rate achieved Advantages of selecting random forest algorithm for heart
Old peak ST Continuous Depression induced by exercise disease prediction [13]
relative to rest ƒ It has significant improvement in accuracy as
Slope Discrete The slope of the peak exercise compare to other current algorithms.
segment : ƒ It runs efficiently on large data bases as well as on
1 = up sloping 2 = flat 3 = down
sloping large data sets
Ca Continuous Number of major vessels colored by ƒ It can handle thousands of input variables without
fluoroscopy variable deletion.
that ranged between 0 and 3. ƒ It gives estimates of what variables are important in
Thal Discrete 3 = normal 6 = fixed defect 7= the classification.
reversible defect
ƒ It has an effective method for estimating missing
Class Discrete Diagnosis classes: 0 = No Presence
1=Least likely to data and maintains accuracy when a large
have heart disease 2= >1 3= >2 proportion of the data are missing.
4=More likely have ƒ It has methods for balancing error in class
heart disease population of unbalanced data sets.
Table 1: Attribute Information ƒ Generated trees can be saved for future use on other
data sets
ƒ It has capability of finding proximity between
Random Forest Algorithm training instances, which can be extended to
unlabeled data, which leads to unsupervised
The random forest algorithm [6] is explained as clustering.
below: For Prediction and accuracy, the following are observed:
Input: Dataset ƒ Datasets are extracted from HDFS.
Output: Predicted class label ƒ Datasets are parsed to fill in missing values to
set number of classes to N, number of features to M provide complete supervised datasets.
set m determine the number of features at a node of decision ƒ Then datasets are divided into training dataset
tree, (m < M) and testing datasets in 70:30 proportions.
for each decision tree do ƒ Training is done with the random forest model
Select randomly, a subset (with replacement) of with optimized parameters.
training data that represents the N classes and use the ƒ Evaluate model on test instances and compute
rest of data to measure the error of the tree test error.
for each node of this tree do ƒ Based on the model on the test instances, heart
Select randomly: m features to determine the disease is predicted.
decision at the ƒ Based on the comparison of previous label value
node and calculate the best split accordingly. of test data and predicted value by algorithm,
end for accuracy is evaluated.
end for Below figures depict the dataflow for prediction
and accuracy implementation. It provides the
Random Forest is predominantly an ensemble of unpruned detailed path of different modules involved in
classification trees. It provides remarkable performance on a heart disease prediction and accuracy using
number of practical problems, such as health care prediction Random Forest algorithm on spark platform.
problems as it is not sensitive to noise in the data set, and it
is not subjected to overfitting. It is built by combining the
predictions of several trees, each of them are trained
separately. It works fast, and usually exhibits a significant
performance improvement over many other tree-based
algorithms such as decision tree. There are three main
choices to be made when constructing a random tree [6]:
1. The method for splitting the leaves.
2. The type of predictor to use in each leaf.
3. The method for injecting randomness into the trees.

In this experiment, we have chosen optimized parameters for


decision tree created by random forest. We have selected
numClasses=2, numTrees=3, impurity=’gini’, maxDepth=4 Fig 4: Prediction Data-flow-diagram

1783
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

Experiment and Results

Below table provides the system requirement for conducting


experiment of prediction of heart disease using classification
algorithm, namely, Random Forest on Spark framework.

Hard Disk 500GB.


Ram 8 GB
Storage 20 GB

Operating system Ubuntu 16.04 LTS

Python runtime Python version: Python


environment 2.7.12
Fig 5: Accuracy Data-flow-diagram
Development
PyCharm
environment
Framework Spark
Front end Console
Platform Apache Spark

Table 2: System Requirement

Below graph highlights the outcome of random forest


implementation on Spark. The same prediction model built
using Naïve Bayes is also shown. Figure clearly shows that
Bayes prediction accuracy does not reach expected accuracy
level as compared to random forest. Figure 8 depicts
difference in accuracy performance between Random forest
and Naïve Bayes with respect to increase in training datasets
which are stored in HDFS. In Figure 9, the increase in
accuracy with training datasets is depicted. It is observed that
Fig 6: Prediction Sequence-diagram from 200 to 600 records, the accuracy increases from 88% to
98%. It is noteworthy to mention that from 200 records to 400
records (increase by 200), accuracy increased by 8%, while
from 400 to 600 records (next 200 records), the accuracy is up
only by 2%, evidently showing the law of diminishing returns.

Fig 8: Accuracy comparison chart


Fig 7: Accuracy Sequence-diagram

1784
International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS-2017)

As part of the future work, we would plan to explore other


healthcare disease prediction, such as early prediction of
certain types of cancer, etc. We are also planning investigate
the impact of large supervised datasets at a colossal scale on
performance and accuracy, running on high performance
clusters.

REFERENCES
[1] Sun J., Reddy C.K. Big Data Analytics for Healthcare.
Tutorial presentation at the SIAM International Conference on
Data Mining, Austin, TX, 2013.
Fig 9: Accuracy graph of Random Forest [2] Mu-Hsing Kuo, Dillon Chrimes, Belaid Moa, Wei Hu
"Design and Construction of a Big Data Analytics Framework
for Health Applications" 2015 IEEE International Conference
on Smart City/SocialCom/SustainCom together with DataCom
2015 and SC2 2015
[3] K. Rajalakshmi1* and K. Nirmala2 "In Heart Disease
Prediction with MapReduce by using Weighted Association
Classifier and K-Means" Indian Journal of Science and
Technology, Vol 9(19), DOI: 10.17485/ijst/2016/v9i19/93827,
May 2016
[4] Purushottam, Prof. (Dr.) Kanak Saxena, Richa Sharma
"Efficient Heart Disease Prediction System using Decision
Tree" International Conference on Computing,
Communication and Automation (ICCCA2015)
[5] Jian Fu, Junwei Sun, Kaiyuan Wang "SPARK—A Big
Data Processing Platform for Machine Learning", 2016
Fig 10: Computation time graph International Conference on Industrial Informatics -
Computing Technology, Intelligent Technology, Industrial
Information Integration
[6] Patil R Priya, Kinariwala A S, "Automated Diagnosis of
Heart Disease using Random Forest Algorithm" International
Journal of Advance Research, Ideas and Innovations in
Technology
[7] Hughes G. How big is 'Big Data' in healthcare?
URL:http://blogs.sas.com/content/ hls/2011/10/21/how-big-is-
big-data-inhealthcare/[accessed 2014-9-26].
[8] Herland et al. Journal of Big Data 2014, 1:2 "A review of
data mining using big data in health informatics".
[9] https://hortonworks.com/apache/hdfs/
[10] https://spark.apache.org/
[11]http://data-flair.training/blogs/hadoop-mapreduce-vs-
Fig 11: Accuracy Error graph apache-spark/
[12] https://archive.ics.uci.edu/ml/datasets/Heart+Disease
CONCLUSION AND FUTURE ENHANCEMENT [13] https://www.stat.berkeley.edu/~breiman/RandomForests/
[14] Ankush Verma, Ashik Hussain Mansuri, and Dr. Neelesh
Utilizing big data analytics, the healthcare data being Jain “Big Data Management Processing with Hadoop
generated from time to time in medical field can be processed MapReduce and Spark Technology: A Comparison" 2016
faster for predicting diseases with none overhead. We Symposium on Colossal Data Analysis and Networking
proposed a scalable solution in predicting the heart disease (CDAN)
attributes and validated its accuracy. We implemented random
forest algorithm on Spark framework for predicting heart
disease, and shown that with as small as 600 dataset records,
we are able to achieve 98% accuracy.

1785

Вам также может понравиться