Вы находитесь на странице: 1из 57

DIAGNOSIS OF HEART DISEASE USING MACHINE

LEARNING

by

MD. ISTIAQ HABIB KHAN

POST GRADUATE DIPLOMA IN INFORMATION AND COMMUNICATION


TECHNOLOGY

Institute of Information and Communication Technology (IICT)


BANGLADESH UNIVERSITY OF ENGINEERING AND TECHNOLOGY (BUET)
October 2020
Dedicated

To

My Parents and Family Members

iv
Table of Contents
List of Tables…………….……………………………………………..……..………....vii
List of Figures…………………………………………………………..………..…...…viii
Acknowledgment……………………………………………………..……………..…....ix
Abstract…………………………………………………………………………..……..…x

1 Introduction…………………………………………………..……………..……...…01

1.1 Introduction………………..………………………………...……..…………..…..01

1.2 Literature Review…………………………………………………….…………….01

1.3 Objectives.………………..……………………….………...……..…………..…..03

1.4 Organization of the Report………………..………………………………….…….04

2 Fundamentals of Machine Learning Algorithms…………………………..……….05

2.1 Machine Learning……………………………………………………...…………..05

2.2 Machine Learning Algorithms…….……………………………….…...…….....…05

2.2.1 Support Vector Machine……………………………………………………….05

2.2.2 Decision Tree………………………………………………………….........….06

2.2.3 Naïve Bayes Classifier………………………………………..……..…………06

2.2.4 k-Nearest Neighbors Algorithm…………………...………………….…..……07

2.2.5 Random Forest Classifier………………………………………….………...…07

2.2.6 Logistic Regression………………………………………….……...…….……07

2.2.7 Majority Voting………………………………………….……...…………..…08

3 Methods for Data Driven Diagnosis……………………..……………………..……09

3.1 Working Procedure …………………………….………………………….………09

3.2 Setting up Anaconda Distribution Package………………………..…...……..…...11

3.3 Algorithm for Disease Prediction………………………………………...….….…11

3.4 Performance Metrics………………………………………………………....….…12

v
4 Dataset Information………………………………….……………………..……...…14

4.1 Three Individual Datasets………………………………..…..……………....….....14

4.1.1 Dataset 1………………………………..…..……………………...…...............14

4.1.2 Dataset 2………………………………..…..…………….......................……...15

4.1.3 Dataset 3………………………………..…..……………...…………………...15

4.2 Hybrid Dataset……………………………………………………………..….…...16

4.2.1 Attributes of Hybrid Dataset……………………………………………..….…16

4.2.2 Normalization of Hybrid Dataset…………………………………...……….…16

5 Results & Discussion………………….………...…………………………...……..…19

5.1 Classification Results for Dataset 1………………………..………...………...…..19

5.2 Classification Results for Dataset 2…………………………………..…..….….…25

5.3 Classification Results for Dataset 3……………………………….……....….……29

5.4 Classification Results for Hybrid Dataset…………………………..…….......……33

5.5 Comparative Results………………………….……......…………………....…..…35

6 Conclusion…………………...………………………………………...….………..…38

6.1 Conclusion……...……………………………………………………...……….….38

6.2 Future Work……...…………………………………………….…………………..39

Appendix….……………………………………………………………………………..40

References…...…………………………………………………………………………..45

vi
LIST OF TABLES

Table No. Table Caption Page No.


Table 5.1 Accuracy score for dataset 1 19
Table 5.2 Confusion matrix and classification report for dataset 1 21
Table 5.3 Accuracy score with cross-validation for dataset 1 22
Table 5.4 Feature ranking for dataset 1 23
Table 5.5 Accuracy score for dataset 2 25
Table 5.6 Confusion matrix and classification report for dataset 2 26
Table 5.7 Accuracy score with cross-validation for dataset 2 27
Table 5.8 Feature ranking for dataset 2 28
Table 5.9 Accuracy score for dataset 3 29
Table 5.10 Confusion matrix and classification report for dataset 3 30
Table 5.11 Accuracy score with cross-validation for dataset 3 31
Table 5.12 Feature ranking for dataset 3 32
Table 5.13 Accuracy score for hybrid dataset 33
Table 5.14 Confusion matrix and classification report for hybrid dataset 34
Table 5.15 Comparative results for all datasets 35
Table 5.16 Comparative results of this work with the literature 37

vii
LIST OF FIGURES

Figure No. Figure Caption Page No.


Figure 5.1 Accuracy score with cross-validation for dataset 1 24
Figure 5.2 Comparative results for all datasets 36
Figure A1 Download page of Anaconda distribution website 40
Figure A2 Anaconda installation process step 2 41
Figure A3 Anaconda installation process step 3 41
Figure A4 Anaconda installation process step 4 42
Figure A5 Anaconda installation process step 5 42
Figure A6 Anaconda installation process step 6 43
Figure A7 Anaconda navigator interface 44
Figure A8 Jupyter notebook interface 44

viii
Acknowledgment

First of all, I would like to convey my gratitude to Almighty Allah for giving me the
opportunity to accomplish this project. I want to thank my supervisor Dr. Md. Rubaiyat
Hossain Mondal, Professor, IICT, BUET for giving me the chance to explore such an
interesting field of research and providing help and advice whenever I needed it. Without
his proper guidance, advice, continual encouragement, and active involvement in the
process of this work, it would have not been feasible.

A big thanks also goes to all the teachers, officers, and staffs of Institute of Information
and Communication Technology (IICT) for giving me their kind support and information
during the study.

Finally, I am very grateful to my parents and family members whose continuous support
all over my life has brought me this far in my career.

ix
Abstract
Early detection of heart disease can help in preventing the disease progression. Different
risk factors are associated with heart disease prediction. This project focuses on multiple
datasets in order to find the most valuable attributes and risk factors associated with heart
disease. One dataset containing 14 attributes including the target attribute and 303 instances
is collected from UCI machine learning repository. The second one containing 10 attributes
and 462 instances is collected from Kaggle repository. The third one contains 12 attributes
of 70000 instances, and is available at Kaggle repository. Seven different machine learning
algorithms are applied on these three individual datasets to study the most influential
attributes for heart disease prediction. One hybrid dataset is also generated using only the
common attributes of two individual datasets. Scikit-learn library of Python programing
language is used for data analysis purpose. Univariate feature selection algorithm is applied
in order to find the most valuable attributes associated with heart disease. The heart disease
is predicted using several machine learning algorithms including support vector machine
(SVM), decision tree, k-nearest neighbors (kNN), logistic regression, naïve Bayes, random
forest, and majority voting. The training and testing portions of each dataset is separated
using holdout and cross validation methods. Different parameters related to different
algorithms are altered and applied to find out which condition gives the highest accuracy.
To evaluate the performance of different algorithms, classification report and confusion
matrix are also calculated. It is shown here that majority voting as a combination of logistic
regression, SVM, and naïve Bayes exhibits the best accuracy of 88.89% when applied to
the first dataset. It is also shown that for the hybrid dataset, the classification accuracy is
lower than that of the individual datasets. Finally, the best result obtained from this project
work is compared with the results of existing similar research approaches.

x
1

CHAPTER 1

Introduction

1.1 Introduction
Heart attack or Myocardial Infraction is one of the deadliest diseases in the world at present
as it is the major cause of death and disability in many developed and developing countries
[1]. Most heart attacks occur due to coronary artery disease. The patient suffering from a
heart attack needs treatment within a very short time. So, it is very important to find out if
a person is at risk of having a heart attack considering the risk factors associated with it.
Machine learning algorithms [2-4] are considered in different application areas including
disease prediction. A number of research works have also been reported for coronary artery
disease or heart disease [5-12]. Accurate analysis of medical data enables early heart
disease detection, patient care, and community services. However, the findings of these
works vary, and one of the reasons behind this is the consideration of different attributes
and collection of different datasets by different authors. Accuracy of the results is reduced
when the medical data is incomplete. Therefore, research is still required to find out the
most important attributes and how selection of the attributes influences the disease
prediction. This project focuses on multiple datasets for finding the important attributes
associated with heart disease. The project then focuses on applying different machine
learning algorithms on the factors of different datasets.

1.2 Literature Review


A number of research papers report the use of machine learning algorithms to predict heart
disease [5-12]. Some works consider fuzzy logic [5-6, 8-9], while others consider the
application of machine learning (ML) [7, 10-12] in classifying heart disease patients.
2

Decision tree algorithm is one of the most popular data mining techniques used by several
researchers for heart disease prediction. Different types of decision tree are used to find out
which performs better in predicting heart disease [7]. This research uses a model that
combines discretization, decision tree and voting to find out a more accurate method for
heart disease prediction. The sensitivity, specificity, and accuracy are calculated in order
to compare the performance of different types of decision trees.

A computer-based noninvasive coronary artery disease diagnosis system is used in [8]. The
target of this research work is to design a clinically interpretable fuzzy rule-based system.
Discretization is done for the interval-scale variables, and then, the fuzzy rule-based system
is formulated based on a neuro-fuzzy classifier. Multiple logistic regression and sequential
feature selection are used for required attributes. The combination of multiple logistic
regression and neuro-fuzzy classifier method has exhibited the best performance.

A web-based fuzzy logic expert system is developed for the diagnosis of heart disease in
[9]. The system consists of fuzzification module, knowledge-based interface engine, and
defuzzification module. Fuzzification module operates on every input based on appropriate
membership function. Then, the interface engine triggers the appropriate rule from
knowledge base to find out the output value using appropriate defuzzification method.
HTML, CSS, JavaScript, jQuery, AJAX, PHP, Bootstrap, XML, and MySQL have been
used to implement this web-based system. The system is cost effective and efficient and
showed a very high accuracy when tested using the dataset of Cleveland clinical foundation
from UCI repository.

Ensemble classification, a method used to improve the accuracy of weak algorithms by


combining multiple classifiers, is the subject of investigation in [10]. A comparative
analytical approach is presented in this research work to show that ensemble techniques
can be applied to improve accuracy in heart disease prediction. Ensemble techniques like
bagging, boosting, etc., are proved to be effective in improving the prediction accuracy of
weak classifiers. Feature selection technique is applied furthermore and it has enhanced the
prediction accuracy.

A mean based splitting approach is used to partition a heart disease dataset [11]. A
homogeneous ensemble is generated from the partitions that are model by different
3

classification and regression trees. A classification accuracy of 93% is also reported in this
literature.

Stacked support vector machine (SVM) is reported for the diagnosis of heart disease [12].
In stacked SVM expert system, the first SVM removes the irrelevant features of the dataset,
while the second SVM is used to predict the possibility of heart disease. A hybrid grid
search technique is considered for optimization of the SVMs. Compared with the stand-
alone SVM algorithm, the stacked SVM shows better performance in terms of reduced
training time, and better classification accuracy.

Different data mining techniques, performance tools, and methods have been implemented
which provide different perspective on the prediction of heart disease. However, none of
the aforementioned research works show the variation in the results using multiple datasets.

1.3 Objectives
The goal of this project is to predict heart disease using different attributes. The novelty of
this work is in the consideration of multiple datasets and preparing a hybrid dataset for
heart disease prediction. Three different datasets have been used in this project work for
finding the most important attributes and for the prediction of heart disease. One hybrid
dataset is also created using only the common attributes considering two individual datasets
in this project work.

The specific aims of the work are as follows:

• To study the most influential attributes of three different datasets for predicting
heart disease.
• To prepare a hybrid dataset using two individual datasets.
• To apply different machine learning algorithms for the prediction of heart disease
on three datasets and one hybrid dataset.
• To compare the obtained results with the results reported in the literature.
4

1.4 Organization of the Report


The organization of the rest of this project report is as follows. In chapter 2, the background
concepts necessary to understand this project work is discussed. A brief discussion on
machine learning algorithms to conduct this project is provided in this chapter. In chapter
3, the overall methodology of the project is discussed step by step. The algorithm for
classification and the performance metrics for diagnosis of heart disease are also provided
in this chapter. Three individual datasets and one hybrid dataset are used to carry out this
project work. Chapter 4 provides the detail discussion on these datasets that are used in this
project for data analysis purpose. In chapter 5, the results with necessary discussion on the
performance of heart disease prediction is provided. The classification results for all the
datasets and the comparative results are discussed in this chapter. Chapter 6 provides the
concluding remarks and a short discussion on the considerations for future works. Finally,
references are mentioned after chapter 6, and Anaconda distribution package installation is
added in the appendix.
5

CHAPTER 2

Fundamentals of Machine Learning Algorithms


In this chapter, the necessary theoretical knowledge for understanding the project work is
discussed. First, a brief overview on machine learning is provided. Then, basic concepts of
the machine learning algorithms are discussed that have been used in this project.

2.1 Machine Learning


Machine learning is the field of study that enables computers to learn automatically and
improve from experience. Machine learning does not require any explicit programming or
instructions for performing tasks. Machine learning algorithms make predictions by
creating a mathematical model based on sample data, specifically known as training data.
These algorithms are often categorized as supervised and unsupervised. In supervised
learning, the algorithm is provided with a set of data which contains inputs and their desired
outputs. Unsupervised learning algorithms function without any labeled or categorized
data, that is, the algorithm is provided with data containing only inputs.

2.2 Machine Learning Algorithms


In this section, a brief discussion on different types of machine learning algorithms is
provided that are used to carry out this project.

2.2.1 Support Vector Machine

Support vector machine (SVM) falls under the category of supervised machine learning
models. SVM performs data analysis using two different methods named as classification
and regression analysis. When training data are given, SVM training algorithm creates a
model where every new example is associated to one of the two mentioned analysis
6

method. For this, SVM is known as a non-probabilistic binary linear classifier. SVM also
performs non-linear classification with high efficiency where inputs are being mapped into
high-dimensional feature spaces which is known as kernel trick. SVM usually performs its
analysis by creating a hyperplane or set of hyperplanes in a high-dimensional space.
Functional margin is the distance between a hyperplane and its nearest training data point
of any class. The generalization error of the classifier is observed to be lower with larger
functional margin and when the distance is the largest, a good separation is achieved.

2.2.2 Decision Tree

Decision tree algorithm is one of the most commonly used data mining techniques which
performs analysis based on predictive modeling approach. The objective of this machine
learning algorithm is to predict the value of a target variable using different input variables.
Two different types of decision tree are used in case of data analysis: one is classification
tree where the predicted result is a class to which data belongs, and the other one is
regression tree where the predicted result is a real number. The structure of a decision tree
model is similar to flow-chart where each non-leaf nodes represent a test on an attribute, a
branch represents the outcome of a test, and each terminal node represent a class label.
Decision tree algorithm is easy to interpret and performs well with large datasets which
makes it a popular data mining method.

2.2.3 Naïve Bayes Classifier

Naïve Bayes classifier is considered to be in the category of simple probabilistic classifiers


in machine learning field. This algorithm performs according to the principles of Bayes
theorem. Naïve Bayes algorithm works by creating classifier models where class labels are
assigned to problem instances. In case of some probability model categories, naïve Bayes
classifier can be trained with high efficiency. This classifier is highly scalable, and one
prominent feature of naïve Bayes technique is that only a small quantity of training data is
required for the estimation of parameters necessary for classification.
7

2.2.4 k-Nearest Neighbors Algorithm

k-nearest neighbors algorithm (kNN) belongs to the category of supervised machine


learning models. It is a non-parametric algorithm. kNN algorithm can be used for
classification purpose where the output is a class membership. Also, this algorithm is used
for regression purpose. This is a simple and versatile algorithm as well as its
implementation is easy. But, the functioning of this algorithm gets slower with increase in
the number of examples and predictor variables. k-nearest neighbors algorithm is
considered to be a type of lazy learning, also known as instance-based learning method.

2.2.5 Random Forest Classifier

Random forest classifier is a supervised machine learning algorithm that is used for
classification and regression problems. It is an ensemble learning method that creates
multiple decision trees when the model is trained and receives prediction from each of
them. The final output is the mode of the classes of all the individual trees in case of
classification problems. When it is used to solve regression problems, the final result is the
mean prediction of the individual trees. Thus, by averaging the results, overfitting is
reduced which can happen when a single decision tree is used. Because of its simple and
diversified characteristics, random forest classifier is one of the most used algorithms in
data science.

2.2.6 Logistic Regression

Logistic regression is a machine learning algorithm that functions according to the concept
of probability. It is a predictive analysis algorithm, and it is mainly used to solve
classification problems. Logistic regression uses a logistic function, also known as sigmoid
function in order to model binary dependent variable. This algorithm can be of three
different types: binomial, multinomial, and ordinal. When the observed outcome for a
dependent variable is any of the two possible types, it is known as binary or binomial
logistic regression. In multinomial logistic regression, the outcome can have three or more
8

possible types which are unordered or with no quantitative significance. The third type is
ordinal logistic regression where the dependent variable can have three or more ordered
possible outcomes.

2.2.7 Majority Voting

Majority voting algorithm is a combination of multiple base algorithms. Hard voting is


considered in this project as a majority voting. In this case, a class is predicted by majority
voting of the individual classifiers. As for example, if a majority voting algorithm consists
of three individual classifiers where the predictions are class P, class N, and class P, then,
the resultant prediction of the sample will be class P as there are two class P compared to
one class N.
9

CHAPTER 3

Methods for Data Driven Diagnosis


This chapter provides the step by step working procedures to carry out the project work.
The setup process of Anaconda distribution package and algorithm for classifiers are
discussed. Also, the performance metrics for diagnosis of heart disease are discussed in
this chapter.

3.1 Working Procedure


The overall working procedures to carry out this project are described as follows:

1. Three different datasets are collected from Kaggle and UCI machine learning
repository which are online communities for data scientists.
2. Python programming language is used to carry out data analysis. For the
deployment of python, Anaconda distribution is used which is a free and open-
source distribution that simplify package management. It includes a graphical user
interface known as Anaconda navigator. Different useful applications are available
in the navigator such as JupyterLab, Jupyter notebook, Spyder, Orange, RStudio,
etc. For this project, Jupyter notebook is used to run the codes for data analysis.
3. Different supervised machine learning algorithms including support vector
machine, decision tree, k-nearest neighbors, naïve Bayes, random forest, and
logistic regression are used. Majority voting classifier, an ensemble classification
method to improve the accuracy of weak algorithms by combining multiple
classifiers, is also used in this project. To implement these algorithms, scikit-learn
library is used in this project. Scikit-learn library is a free software machine learning
library which is included in the Anaconda distribution package.
4. The accuracy score for predicting heart disease is calculated for three different
datasets. This is done by holdout method and by cross-validation method. For the
case of holdout method, the percentage of training and testing data is set to four
10

different values. These are testing size of 10%, 15%, 20%, and 25% of the total
data samples. The train_test_split() class from scikit-learn library is used to split
the datasets into training and testing portions. For cross-validation, the total dataset
is divided into k equal groups and based on the value of k, the result of cross-
validation changes. Large value of k will increase computation time.
5. Different parameters associated with different learning algorithms are altered and
applied to compare between the results to find out the desired condition that gives
the highest accuracy score. In support vector machine algorithm, three different
kernels are applied: linear, rbf, and sigmoid. For rbf and sigmoid kernels, the C
value is altered from 1 to 5 and the gamma value is altered between auto and scale.
In case of k-nearest neighbors classifier, the value of k is changed from 1 to 50 to
find out the best possible accuracy score. In logistic regression algorithm, two
different solvers are applied: lbfgs and liblinear.
6. After calculating the accuracy score for three datasets with the specific conditions
mentioned in step 4 and 5, the results are compared to find out the best accuracy
score for different algorithms. The conditions for which the best accuracy is
obtained, are noted. Then, confusion matrix and classification report (precision,
recall and F1-score) are calculated for those conditions.
7. Feature selection is performed to find the best attributes of a dataset that lead to the
diagnosis of heart disease. In this project, the feature selection method is univariate
feature selection method where each feature is scored individually on certain
specified criteria and the features are then selected based on the higher scores or
higher ranks.
8. One hybrid dataset is created using only the common attributes considering two
individual datasets collected before (Detail information about the hybrid dataset is
discussed in the next chapter). All the required normalization is done for
constructing the hybrid dataset. Then, accuracy score, confusion matrix, and
classification report are calculated for this hybrid dataset according to step 4, step
5, and step 6.
9. Finally, the best result obtained from this project work is compared with the results
reported in the literature [7-8, 10].
11

3.2 Setting up Anaconda Distribution Package


Anaconda is a free and open-source distribution package for python and R programming
language. It simplifies package management and deployment for python, and can be used
for large-scale data processing. Anaconda distribution is used to carry out this project work.
The installer file is downloaded from Anaconda website [13]. The step by step process of
Anaconda distribution package installation is discussed in the Appendix section.

Anaconda distribution package includes a graphical user interface called Anaconda


navigator which allows users to launch applications and manage conda packages,
environments and channels without using command-line interface. Different useful
applications are available in the navigator such as JupyterLab, Jupyter notebook, Spyder,
Orange, RStudio, etc. In this project work, Jupyter notebook is used to run the necessary
codes for data analysis. It is a web-based interactive computational environment for
creating Jupyter notebook documents.

3.3 Algorithm for Disease Prediction


Algorithm: Detection of heart disease using classifiers

Input: Heart disease dataset with several attributes

Output: Accuracy score/Confusion matrix/Classification report of predicted values

Process:

Step 1: Import libraries of sklearn, pandas, numpy

Step 2: Import the classifier functions

Step 3: Import train_test_split function

Or Import cross_val_score function

Step 4: i) Import accuracy_score function

ii) Import confusion_matrix function


12

iii) Import classification_report function

Step 5: Load the CSV file containing data using read_csv() function

Step 6: Separate the input and target attributes

Step 7: i) For holdout method, separate the train and test data using train_test_split()

function

ii) Model the classifier using model.fit() function

iii) Predict the test data using model.predict() function

Or apply cross validation using cross_val_score() function

Step 8: i) Find accuracy using accuracy_score() function

ii) Find confusion matrix using confusion_matrix() function

iii) Find classification report using classification_report() function

3.4 Performance Metrics


Similar to the work of diagnosis of diseases reported in the literature [14], this work
considers a number of metrics for appropriate diagnosis of the patients with heart disease.
The metrics are training accuracy, testing accuracy, precision, recall, and F1-measure.
These metrics can be defined using several terms such as true positive (TP ) , true negative

(TN ) , false negative ( FN ) and false positive ( FP ) . In the context of this work, TP refers
to the patient samples that are correctly classified as abnormal which mean the patients
have heart disease. The terms TN is the number of normal people having normal condition
of the heart. The term FN refers to the people who actually have heart disease but remains
undetected by the system. Furthermore, FP refers to the number of samples who are
wrongly detected to have heart disease. In the following these metrics are defined. The
accuracy is the percentage of all normal and abnormal vectors that are correctly classified.
Accuracy, ac , can be expressed as follows.
13

TP + TN
ac = (3.1)
TP + TN + FP + FN

Training accuracy and testing accuracy are defined as the accuracy obtained for
training and testing samples, respectively. Precision, pr , can be mathematically written as
follows.

TP
pr = (3.2)
TP + FP

The term recall, re, is also known as sensitivity given by

TP
re = (3.3)
TP + FN

The F1-Measure, f1 , is given by

2 × pr × re
f1 = (3.4)
pr + re

Confusion matrix determines the performance of a classification algorithm by showing the


predicted instances and the actual instances.
14

CHAPTER 4

Dataset Information
Three different datasets have been used in this project work to carry out data analysis for
heart disease prediction. The detailed information about these datasets is discussed in this
chapter.

4.1 Three Individual Datasets

4.1.1 Dataset 1

Dataset 1 is collected from UCI machine learning repository [15]. It contains 14 attributes
including the target attribute and 303 instances. The attributes are as follows:

1. age (in years)


2. sex (1:male, 0:female)
3. cp (chest pain type) value: 1-4 (1:typical angina; 2:atypical angina; 3:non-anginal
pain; 4:asymptomatic)
4. trestbps (resting blood pressure in mmHg on admission to the hospital)
5. chol (serum cholesterol in mg/dl)
6. fbs (fasting blood sugar > 120 mg/dl) (1:true; 0:false)
7. restecg (resting electrocardiographic results) value:0-2 (0:normal; 1: having ST-T
wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05
mV); 2: showing probable or definite left ventricular hypertrophy by Estes'
criteria)
8. thalach (maximum heart rate achieved)
9. exang (exercise induced angina) (1:yes; 0:no)
10. oldpeak (ST depression induced by exercise relative to rest)
11. slope (the slope of the peak exercise ST segment) value 1-3 (1:upsloping; 2:flat;
3:downsloping)
12. ca (number of major vessels (0-3) colored by fluorosopy)
15

13. thal (3 = normal; 6 = fixed defect; 7 = reversable defect)


14. target

4.1.2 Dataset 2

Dataset 2 is collected from Kaggle [16]. It contains 10 attributes including the target
attribute and 462 instances. The attributes are as follows:

1. sbp (systolic blood pressure)


2. tobacco (cumulative tobacco)
3. ldl (low density lipoprotein cholesterol)
4. adiposity (a numeric vector)
5. famhist (family history of heart disease) value 1:present; 0:absent
6. typea (type-A behavior)
7. obesity (a numeric vector)
8. alcohol (current alcohol consumption)
9. age
10. chd (response, coronary artery disease)

4.1.3 Dataset 3

Dataset 3 is collected from Kaggle [17]. It contains 12 attributes including the target
attribute and 70000 instances. The attributes are as follows:

1. age (in days)


2. gender (1:female; 2:male)
3. height (cm)
4. weight (kg)
5. ap_hi (systolic blood pressure)
6. ap_lo (diastolic blood pressure)
7. cholesterol (1:normal; 2:above normal; 3:well above normal)
8. gluc (1:normal; 2:above normal; 3:well above normal)
16

9. smoke (whether patient smokes or not)


10. alco (alcohol intake)
11. active (physical activity)
12. cardio (target)

4.2 Hybrid Dataset


One hybrid dataset is created using only the common attributes considering two individual
datasets mentioned before. All the required normalization is done for creating the hybrid
dataset. The detailed information about this hybrid dataset is as follows.

4.2.1 Attributes of Hybrid Dataset

The hybrid dataset is created from dataset 1 and 3, using only the common attributes from
these two individual datasets used in this project. This hybrid dataset contains 6 attributes
including the target attribute and 70303 instances. The common 6 attributes which are
considered to create this hybrid dataset, are given below (attribute names are considered
from dataset 3):

1. age
2. gender
3. ap_hi
4. cholesterol
5. gluc
6. cardio (target attribute)

4.2.2 Normalization of Hybrid Dataset

Normalization is required to create this hybrid dataset. The detailed normalization process
for each of the 5 attributes (excluding target attribute cardio) is discussed as follows.
17

Attribute 1: age

In dataset 3, age is given in days. But in dataset 1, age is given in years. So, age in dataset
3 is converted from days to years.

Attribute 2: gender

In dataset 3, female is represented as 1 and male is represented as 2. In dataset 1, female is


represented as 0 and male is represented as 1. The change is done for dataset 1 to match
the values with that of dataset 3.

Attribute 3: ap_hi

This attribute refers to as systolic blood pressure. In dataset 3, this attribute is named as
ap_hi and the values are given in mmHg. In dataset 1, this is referred to as tresbps and the
values are given in mmHg. As the unit matches in both datasets, no additional
normalization is required for this attribute.

Attribute 4: cholesterol

This attribute refers to serum cholesterol level. In dataset 3, this attribute is named as
cholesterol and the values are given in 1, 2 and 3. The meaning of them is as follows:

1: normal

2: above normal

3: well above normal

In dataset 1, this attribute is named as chol and the values are given in mg/dl unit. So, the
values for dataset 1 need normalization. Normalization is done as follows:

1: value less than 200 mg/dl

2: value 200-239 mg/dl

3: value equal to or greater than 240 mg/dl


18

Attribute 5: gluc

This attribute refers to as glucose level or blood sugar level. In dataset 3, this attribute is
named as gluc and the values are given in 1, 2 and 3. The meaning of them is as follows:

1: normal

2: above normal

3: well above normal

In dataset 1, this attribute is named as fbs and the values are given in 0 or 1. The meaning
of them is as follows:

0: value less than or equal to 120 mg/dl

1: value greater than 120 mg/dl

Normalization is done for dataset 3. The value 1 is replaced with 0 and the values 2 and 3
are replaced with 1.
19

CHAPTER 5

Results and Discussion


This chapter provides the results and associated discussion on the performance of heart
disease prediction. For this, the concepts of artificial intelligence particularly machine
learning classifiers and performance metrics [18-23] are taken into consideration. As
mentioned earlier, the results are obtained using python with Anaconda distribution.

5.1 Classification Results for Dataset 1


This section provides the results and associated discussion on the performance of heart
disease prediction using dataset 1. Table 5.1 shows the testing accuracy of dataset 1 for
different classifiers. In this case, the parameters of the classifiers are varied. For example,
for the case of SVM, three different kernels are considered and the values of C and gamma
are varied.
Table 5.1: Accuracy score for dataset 1

Algorithms Test=25% Test=20% Test=15% Test=10%

Decision Tree Classifier 74.27% 73.83% 75.33% 76.33%


Kernel = C = 1 83.73% 83.83% 85.11% 84.67%
linear C=2 82.27% 83.16% 83.55% 85.67%
C=3 81.73% 82.33% 80.89% 83.33%
C=4 81.87% 81.66% 83.78% 84.33%
C=5 82.8% 83.33% 81.56% 84%
Kernel = C = 1 53.87% 53.5% 57.55% 57.67%
rbf; C=2 52.66% 53.33% 55.3% 56.33%
Gamma = C = 3 56.13% 53.83% 59.11% 57.3%
auto C=4 56.53% 56.67% 52.22% 55.66%
Support C=5 56.27% 54.16% 56.89% 53.33%
Vector Kernel = C = 1 67.86% 66.33% 64.67% 63.66%
Machine rbf; C=2 67.6% 66.83% 68% 68.33%
Gamma = C = 3 69.2% 66.83% 67.11% 65.67%
scale C=4 69.07% 69.16% 68.89% 66.33%
C=5 68.13% 70.5% 70% 67.33%
20

Algorithms Test=25% Test=20% Test=15% Test=10%

Kernel =C=1 54.66% 51.33% 53.78% 55%


sigmoid;C=2 53.07% 56.16% 54.44% 54.33%
Gamma = C=3 54.53% 57.83% 51.56% 54%
auto C=4 51.6% 53.67% 55.78% 57.33%
C=5 54.27% 51% 55.56% 54.3%
Kernel = C = 1 53.2% 54.17% 54.66% 59.33%
sigmoid; C = 2 53.33% 60.16% 58.22% 60.67%
Gamma = C = 3 58.8% 54.67% 61.11% 57%
scale C=4 59.07% 58.66% 56.44% 57.67%
C=5 55.2% 55.17% 52.89% 57.66%
kNN Classifier 69.33% 73.33% 68.89% 76.67%
[k=14] [k=6,23] [k=5] [k=5,9]
Naïve Bayes Classifier 83.33% 81.3% 84.22% 83%
Random Forest Classifier 80.8% 83.83% 82.89% 80.33%
Logistic Solver = lbfgs 84.13% 82.17% 81.78% 83.33%
Regression Solver = liblinear 84.53% 81.33% 85.11% 84.3%
Majority Voting Classifier 85.33% 88.3% 88.89% 86.67%
(Logistic Regression + SVM +
Naïve Bayes)

Table 5.1 indicates that among all the classifiers, majority voting exhibits the best accuracy
of 88.89% when the test size is 15% of the data samples. Moreover, we can see that each
classifier has its own best accuracy score for a set of conditions. These are SVM
(kernel=linear, C=2; test size 10%): 85.67%, logistic regression (solver=liblinear; test size
15%): 85.11%, naïve Bayes classifier (test size 15%): 84.22%, random forest classifier
(test size 20%): 83.83%, kNN classifier (k=5; test size 10%): 76.67%, and decision tree
classifier (test size 10%): 76.33%. Next, confusion matrix and classification reports are
calculated for these conditions and the results are shown in Table 5.2. It can be seen from
Table 5.2 that among all the classifiers, majority voting has the best precision of 90%, the
best recall of 86%, and the best F1 score of 88%.
21

Table 5.2: Confusion matrix and classification report for dataset 1

Confusion Matrix Classification Report


Algorithms True False False True Precision Recall F1-score
Negative Positive Negative Positive
Decision Tree 12 4 3 11 0.73 0.79 0.76
Classifier,
Test size: 10%
Support Vector 14 2 2 12 0.86 0.86 0.86
Machine (Kernel =
linear; C=2),
Test size: 10%
kNN Classifier 10 3 5 12 0.80 0.71 0.75
(k = 5),
Test size: 10%
Naïve Bayes, 17 3 4 21 0.88 0.84 0.86
Test size: 15%
Random Forest 29 3 7 21 0.88 0.75 0.81
Classifier,
Test size: 20%
Logistic 20 3 4 18 0.86 0.82 0.84
Regression, (Solver
= liblinear)
Test size: 15%
Majority Voting 22 2 3 18 0.90 0.86 0.88
Classifier (Logistic
Regression + SVM
+ Naïve Bayes),
Test size: 15%
22

The testing accuracy of the dataset 1 with cross-validation is discussed next. Table 5.3
shows the testing accuracy of dataset 1 with cross-validation for different classifiers.

Table 5.3: Accuracy score with cross-validation for dataset 1

Algorithms k=10 k=20 k=30


Decision Tree Classifier 77.13% 75.07% 74.3%
Kernel = linear 83.15% 82.81% 82.23%
Kernel = rbf; 54.89% 54.9% 54.83%
Gamma = auto
Support Vector Kernel = rbf; 66.34% 65.98% 67.51%
Machine Gamma = scale
Kernel = sigmoid; 53.89% 53.9% 53.92%
Gamma = auto
Kernel = sigmoid; 53.55% 53.9% 53.92%
Gamma = scale
kNN Classifier 68.71% 69.76% 69.8%
[k=23] [k=7] [k=30]
Naïve Bayes Classifier 84.5% 83.81% 83.05%
Random Forest Classifier 83.15% 83.45% 82.13%
Logistic Regression Solver = lbfgs 83.16% 82.78% 82.26%
Solver = liblinear 83.5% 83.48% 82.6%
Majority Voting Classifier (Logistic Regression + 84.83% 83.81% 82.9%
SVM + Naïve Bayes)
23

Feature ranking for dataset 1 is done using the univariate feature selection algorithm. Table
5.4 shows the feature ranking for dataset 1.

Table 5.4: Feature ranking for dataset 1

Feature Rank
thalach 1
ca 2
oldpeak 3
thal 4
exang 5
age 6
chol 7
trestbps 8
cp 9
restecg 10
slpoe 11
sex 12
fbs 13

Table 5.3 indicates that for dataset 1 using cross validation method, the best accuracy of
84.83% is obtained by majority voting classifier with logistic regression, SVM, and naïve
Bayes where k=10. This result is for the case when all the 13 features are used for
prediction. The highest accuracy values of the individual classifiers for dataset 1 is
illustrated in Figure 5.1. It can be seen from Figure 5.1 that SVM, logistic regression, naïve
Bayes, random forest, and majority voting exhibit accuracy values above 80%.
Next, univariate feature selection method is applied to see the impact of reducing the
number of features. From the experiments, the best result is obtained when the highest 11
24

ranked features shown in Table 5.4 are used for classification. Using those 11 features, an
85.15% accuracy is obtained with majority voting classifier.

90

80

70

60
Accuracy%

50

40

30

20

10

0
Decision tree SVM KNN Naïve Bayes Random forest Logistic Majority
regression voting
Classifier

Figure 5.1: Accuracy score with cross-validation for dataset 1


25

5.2 Classification Results for Dataset 2


This section provides the results and associated discussion on the performance of heart
disease prediction using dataset 2. Table 5.5 shows the testing accuracy of dataset 2 for
different classifiers.
Table 5.5: Accuracy score for dataset 2

Algorithms Test=25% Test=20% Test=15% Test=10%

Decision Tree Classifier 61.55% 60% 60.57% 61.49%


Kernel = C = 1 73.88% 69.89% 69% 72.77%
linear C=2 72.76% 70.75% 73.71% 69.36%
C=3 69.4% 71.4% 71.86% 73.83%
C=4 69.31% 72.9% 69.29% 72.98%
C=5 70.95% 70% 72.14% 68.94%
Kernel = C = 1 67.59% 65.05% 66.14% 67.02%
rbf; C=2 63.53% 65.91% 64.43% 65.74%
Gamma = C = 3 65.95% 63.76% 64.57% 66.6%
auto C=4 62.58% 64.62% 63.57% 65.74%
C=5 65.52% 64.95% 66.29% 66.17%
Support Vector Kernel = C=1 65.95% 66.77% 67.29% 65.53%
Machine rbf; C=2 66.81% 66.13% 68% 65.96%
Gamma = C = 3 65.86% 65.81% 65.57% 67.45%
scale C=4 65.95% 65.16% 65.29% 67.45%
C=5 67.41% 68.82% 66.71% 68.1%
Kernel = C = 1 65.43% 65.16% 67.43% 63.62%
sigmoid; C = 2 66.38% 65.16% 68.14% 63.83%
Gamma = C = 3 63.62% 67.63% 63.29% 65.96%
auto C=4 65.1% 67.1% 64.43% 66.6%
C=5 65.7% 66.02% 68.9% 66.17%
Kernel = C = 1 60% 61.3% 61.71% 61.28%
sigmoid; C = 2 55.26% 55.81% 55.29% 55.32%
Gamma = C = 3 54.91% 53.44% 55% 56.81%
scale C=4 56.38% 55.91% 56.14% 57.02%
C=5 55.1% 55.38% 55% 55.74%
kNN Classifier 69.83% 69.9% 70% 70.2%
[k=42] [k=47] [k=13] [k=11]
Naïve Bayes Classifier 70.52% 70% 72.28% 70.21%
Random Forest Classifier 70.17% 68.7% 68.14% 69.57%
Logistic Solver = lbfgs 71.12% 71.72% 72.9% 73.83%
Regression Solver = liblinear 72.33% 70.75% 69% 71.49%
Majority Voting Classifier (Logistic 73.28% 76.34% 75.71% 76.6%
Regression + SVM + Naïve Bayes)
26

Table 5.5 indicates that among all the classifiers, majority voting exhibits the best accuracy
of 76.60% when the test size is 10% of the data samples. Moreover, we can see that each
classifier has its own best accuracy score for a set of conditions. These are SVM
(kernel=linear; C=1; test size 25%): 73.88%, logistic regression (solver=lbfgs; test size
10%): 73.83%, naïve Bayes (test size 15%): 72.28%, kNN classifier (k=11; test size 10%):
70.2%, random forest (test size 25%): 70.17%, and decision tree (test size 25%): 61.55%.
Next, confusion matrix and classification reports are calculated for these conditions and
the results are shown in Table 5.6. It can be seen from Table 5.6 that among all the
classifiers, majority voting and logistic regression have the best precision of 69%, naïve
Bayes has the best recall of 65%, and logistic regression has the best F1 score of 65%.

Table 5.6: Confusion matrix and classification report for dataset 2

Confusion Matrix Classification Report


Algorithms True False False True Precision Recall F1-score
Negative Positive Negative Positive
Decision Tree 54 23 16 23 0.50 0.59 0.54
Classifier,
Test size: 25%
Support Vector 63 16 19 18 0.53 0.49 0.51
Machine (Kernel =
linear; C=1),
Test size: 25%
kNN Classifier 23 7 9 8 0.53 0.47 0.50
(k = 11),
Test size: 10%
Naïve Bayes, 32 12 9 17 0.59 0.65 0.62
Test size: 15%
Random Forest 56 21 16 23 0.52 0.59 0.55
Classifier,
Test size: 25%
Logistic Regression 24 5 7 11 0.69 0.61 0.65
(Solver = lbfgs),
Test size: 10%
Majority Voting 26 4 8 9 0.69 0.53 0.60
Classifier (Logistic
Regression + SVM
+ Naïve Bayes),
Test size: 10%
27

The testing accuracy of the dataset 2 with cross-validation is discussed next. Table 5.7
shows the testing accuracy of dataset 2 with cross-validation for different classifiers.

Table 5.7: Accuracy score with cross-validation for dataset 2

Algorithms k=10 k=20 k=30

Decision Tree Classifier 62.55% 62.12% 63.32%


Kernel = linear 70.78% 71.19% 71.17%
Kernel = rbf; 65.37% 65.36% 65.42%
Gamma = auto
Support Vector
Machine Kernel = rbf; 66.45% 65.35% 65.41%
Gamma = scale
Kernel = sigmoid; 65.37% 65.36% 65.42%
Gamma = auto
Kernel = sigmoid; 60.41% 59.34% 58.54%
Gamma = scale
kNN Classifier 67.94% 68.36% 67.75%
[k=16] [k=16] [k=16]
Naïve Bayes Classifier 71.003% 70.34% 70.83%
Random Forest Classifier 69.69% 68.6% 70.14%
Logistic Regression Solver = lbfgs 72.07% 72.27% 72.72%
Solver = liblinear 71.86% 71.63% 72.1%
Majority Voting Classifier (Logistic Regression 72.08% 72.27% 72.49%
+ SVM + Naïve Bayes)
28

Feature ranking for dataset 2 is done using the univariate feature selection algorithm. Table
5.8 shows the feature ranking for dataset 2.

Table 5.8: Feature ranking for dataset 2

Feature Rank
age 1
tobacco 2
adiposity 3
alcohol 4
sbp 5
ldl 6

famhist 7
typea 8
obesity 9

Table 5.7 indicates that for dataset 2 using cross validation method, the best accuracy of
72.72% is obtained by logistic regression (solver = lbfgs) where k=30. This result is for the
case when all the 9 features are used for prediction. Next, univariate feature selection
method is applied to see the impact of reducing the number of features. From the
experiments, the best result is obtained when the highest 8 ranked features shown in Table
5.8 are used for classification. Using those 8 features, a 73.62% accuracy is obtained with
logistic regression classifier.
29

5.3 Classification Results for Dataset 3


This section provides the results and associated discussion on the performance of heart
disease prediction using dataset 3. Table 5.9 shows the testing accuracy of dataset 3 for
different classifiers.
Table 5.9: Accuracy score for dataset 3

Algorithms Test=25% Test=20% Test=15% Test=10%

Decision Tree Classifier 63.53% 63.55% 63.29% 63.44%


Kernel = C = 1 71.83% 71.66% 72.54% 72.66%
linear C=2 71.9% 71.78% 71.73% 71.84%
C=3 72.47% 72.98% 72.2% 72.5%
C=4 72.05% 72.76% 72.34% 72.59%
C=5 72.11% 71.76% 72.21% 72.09%
Kernel = C = 1 57.75% 57.49% 58.17% 57.3%
rbf; C=2 57.22% 57.34% 57.19% 57.14%
Gamma = C = 3 57.46% 57.53% 57.62% 57.96%
auto C=4 57.37% 57.81% 57.35% 57.17%
C=5 57.74% 58.19% 58.05% 57.51%
Support Vector Kernel = C=1 60.71% 61.06% 61.37% 59.46%
Machine rbf; C=2 61.58% 61.48% 62.13% 61.49%
Gamma = C = 3 61.82% 61.83% 61.9% 60.5%
scale C=4 62.09% 61.21% 62.12% 61.97%
C=5 61.94% 62.11% 62.72% 60.64%
Kernel = C = 1 49.45% 49.72% 50.13% 49.74%
sigmoid; C = 2 49.89% 49.64% 49.74% 49.41%
Gamma = C = 3 49.82% 49.59% 49.83% 49.89%
auto C=4 49.65% 49.52% 49.12% 48.74%
C=5 49.57% 49.54% 50.06% 50%
Kernel = C = 1 59.76% 40.2% 60.16% 40.51%
sigmoid; C = 2 39.78% 59.89% 60.09% 39.96%
Gamma = C = 3 40.42% 40.1% 59.31% 39.69%
scale C=4 40.56% 40.01% 59.38% 41.29%
C=5 40.18% 59.34% 40.76% 39.91%
kNN Classifier 70.66% 70.68% 71.17% 71.88%
[k=31] [k=46] [k=35] [k=47]
Naïve Bayes Classifier 59.37% 59.31% 59.09% 59.4%
Random Forest Classifier 71.66% 71.39% 72.06% 71.44%
Logistic Solver = lbfgs 69.94% 69.09% 70.35% 70.91%
Regression Solver = liblinear 71.09% 71.64% 71.54% 70.44%
Majority Voting Classifier (Logistic 72.64% 70% 73.73% 74.8%
Regression + SVM + Naïve Bayes)
30

Table 5.9 indicates that among all the classifiers, majority voting exhibits the best accuracy
of 74.80% when the test size is 10% of the data samples. Moreover, we can see that each
classifier has its own best accuracy score for a set of conditions. These are SVM
(kernel=linear; C=3; test size 20%): 72.98%, random forest classifier (test size 15%):
72.06%, KNN classifier (k=47; test size 10%): 71.88%, logistic regression
(solver=liblinear; test size 20%): 71.64%, decision tree classifier (test size 20%): 63.55%,
and naïve Bayes classifier (test size 10%): 59.4%. Next, confusion matrix and
classification reports are calculated for these conditions and the results are shown in Table
5.10. It can be seen from Table 5.10 that among all the classifiers, majority voting has the
best precision of 76%, random forest has the best recall of 69%, and random forest and
SVM have the best F1 score of 71%.
Table 5.10: Confusion matrix and classification report for dataset 3

Confusion Matrix Classification Report


Algorithms True False False True Precision Recall F1-score
Negative Positive Negative Positive
Decision Tree 4533 2560 2526 4381 0.63 0.63 0.63
Classifier,
Test size: 20%
Support Vector 5330 1603 2283 4784 0.75 0.68 0.71
Machine (Kernel =
linear; C=3),
Test size: 20%
kNN Classifier 2760 798 1224 2218 0.74 0.64 0.69
(k = 47),
Test size: 10%
Naïve Bayes, 3183 391 2401 1025 0.72 0.30 0.42
Test size: 10%
Random Forest 3775 1411 1623 3691 0.72 0.69 0.71
Classifier,
Test size: 15%
Logistic Regression 5307 1783 2315 4595 0.72 0.66 0.69
(Solver = liblinear),
Test size: 20%
Majority Voting 2717 715 1297 2271 0.76 0.64 0.69
Classifier (Logistic
Regression + SVM
+ Naïve Bayes),
Test size: 10%
31

The testing accuracy of the dataset 3 with cross-validation is discussed next. Table 5.11
shows the testing accuracy of dataset 3 with cross-validation for different classifiers.

Table 5.11: Accuracy score with cross-validation for dataset 3

Algorithms k=10 k=20 k=30

Decision Tree Classifier 63.4% 63.13% 63.4%


Kernel = linear 72.22% 72.1% 72.2%
Kernel = rbf; 52.84% 53.24% 53.28%
Gamma = auto
Support Vector
Machine Kernel = rbf; 59.78% 59.72% 59.72%
Gamma = scale
Kernel = sigmoid; 50.04% 50.04% 50.03%
Gamma = auto
Kernel = sigmoid; 40.2% 41.12% 40.18%
Gamma = scale
kNN Classifier 71.16% 71.22% 71.23%
[k=47] [k=43] [k=30]
Naïve Bayes Classifier 59.02% 59.02% 59.03%
Random Forest Classifier 71.56% 71.43% 71.43%
Logistic Regression Solver = lbfgs 69.8% 69.87% 69.87%
Solver = liblinear 70.73% 70.77% 70.7%
Majority Voting Classifier (Logistic Regression + 71.58% 71.58% 71.48%
SVM + Naïve Bayes)
32

Feature ranking for dataset 3 is done using the univariate feature selection algorithm. Table
5.12 shows the feature ranking for dataset 3.

Table 5.12: Feature ranking for dataset 3

Feature Rank
age 1
ap_lo 2
ap_hi 3
weight 4
cholesterol 5
gluc 6
active 7
smoke 8
alco 9
height 10
gender 11

Table 5.11 indicates that for dataset 3 using cross validation method, the best accuracy of
72.22% is obtained by SVM (linear kernel) where k=10. This result is for the case when
all the 11 features are used for prediction. Next, univariate feature selection method is
applied to see the impact of reducing the number of features. From the experiments, there
is no improvement in the accuracy score rather the accuracy decreases with the reduction
in the number of features.
33

5.4 Classification Results for Hybrid Dataset


This section provides the results and associated discussion on the performance of heart
disease prediction using the hybrid dataset. Table 5.13 shows the testing accuracy of hybrid
dataset for different classifiers. In this case, the parameters of the classifiers are varied. For
example, for the case of logistic regression, two different solvers are considered.

Table 5.13: Accuracy score for hybrid dataset

Algorithms Test=25% Test=20% Test=15% Test=10%

Decision Tree Classifier 72.29% 72.4% 71.2% 71.29%


Kernel = C = 1 71.47% 71.53% 70.76% 71.48%
linear C=2 72.07% 71.77% 71.57% 70.38%
C=3 71.96% 71.23% 72.07% 71.01%
C=4 71.85% 72.34% 71.67% 71.23%
C=5 71.76% 71.86% 71.36% 71.5%
Kernel = C = 1 73.52% 71.88% 73.46% 73.28%
rbf; C=2 72.94% 72.91% 73.77% 72.79%
Gamma = C = 3 73.52% 72.6% 73.11% 72.89%
auto C=4 72.76% 73.38% 73.38% 73.08%
C=5 72.63% 73.26% 73.74% 73.6%
Support Vector Kernel = C=1 71.16% 71.3% 71.32% 71.8%
Machine rbf; C=2 70.92% 71.62% 71.56% 71.32%
Gamma = C = 3 71.15% 71.58% 72.62% 72.24%
scale C=4 72.02% 72.1% 72.2% 71.78%
C=5 71.98% 72.09% 71.93% 72.44%
Kernel = C = 1 49.9% 49.73% 49.36% 49.21%
sigmoid; C = 2 49.78% 49.53% 49.82% 49.92%
Gamma = C = 3 50.23% 49.94% 49.26% 49.61%
auto C=4 49.33% 49.73% 49.63% 49.95%
C=5 49.71% 49.97% 49.41% 49.89%
Kernel = C = 1 66.61% 63.66% 65.82% 62.96%
sigmoid; C = 2 63.95% 66.21% 64.38% 66.94%
Gamma = C = 3 62.07% 65.37% 63.77% 64.53%
scale C=4 65.84% 65.77% 65.99% 63.25%
C=5 66.14% 65.23% 65.06% 64.63%
kNN Classifier 72.64% 72.16% 73.21% 72.8%
[k=33] [k=16,39] [k=43] [k=40]
Naïve Bayes Classifier 58.31% 58.27% 58.26% 58.47%
Random Forest Classifier 72.68% 71.95% 72.8% 72.14%
Logistic Solver = lbfgs 71.19% 71.02% 71.81% 71.92%
Regression Solver = liblinear 71.56% 71.6% 71.39% 71.98%
Majority Voting Classifier (Logistic 71.48% 71.43% 71.96% 72.01%
Regression + SVM + Naïve Bayes)
34

Table 5.13 indicates that among all the classifiers, SVM (kernel=rbf; gamma=auto; C=2)
exhibits the best accuracy of 73.77% when the test size is 15% of the data samples.
Moreover, we can see that each classifier has its own best accuracy score for a set of
conditions. These are kNN classifier (k=43; test size 15%): 73.21%, random forest
classifier (test size 15%): 72.80%, decision tree classifier (test size 20%): 72.40%, majority
voting (test size 10%): 72.01%, logistic regression (solver=liblinear; test size 10%):
71.98%, and naïve Bayes classifier (test size 10%): 58.47%. Next, confusion matrix and
classification reports are calculated for these conditions and the results are shown in Table
5.14. It can be seen from Table 5.14 that among all the classifiers, majority voting has the
best precision of 78%, random forest has the best recall of 69%, and kNN classifier has the
best F1 score of 72%. The testing accuracy of the hybrid dataset using cross-validation
method is calculated next for different classifiers. Experimental results show that among
all the classifiers, the best accuracy of 71.33% is obtained by majority voting classifier
with logistic regression, SVM, and naïve Bayes where k=5.
Table 5.14: Confusion matrix and classification report for hybrid dataset

Confusion Matrix Classification Report


Algorithms True False False True Precision Recall F1-score
Negative Positive Negative Positive
Decision Tree 5499 1617 2385 4765 0.75 0.67 0.70
Classifier,
Test size: 20%
Support Vector 4203 1198 1730 3569 0.75 0.67 0.71
Machine (Kernel =
rbf; Gamma = auto;
C=2),
Test size: 15%
kNN Classifier 4136 1201 1707 3656 0.75 0.68 0.72
(k = 43),
Test size: 15%
Naïve Bayes, 3087 526 2406 1114 0.68 0.32 0.43
Test size: 10%
Random Forest 4143 1282 1637 3638 0.74 0.69 0.71
Classifier,
Test size: 15%
Logistic Regression 2730 846 1178 2379 0.74 0.67 0.70
(Solver = liblinear),
Test size: 10%
Majority Voting 2860 621 1324 2226 0.78 0.63 0.70
Classifier,
Test size: 10%
35

5.5 Comparative Results


From sections 5.1-5.4, it can be seen that the results are different for different classifiers.
Results also vary when the parameters of the classifiers are varied, holdout or cross
validation methods are used. Table 5.15 presents the conditions at which the best accuracy
values are obtained for three individual datasets and also for the hybrid dataset. It can be
seen that among all the cases shown in Table 5.15, the highest accuracy value of 88.89%
is obtained for dataset 1 using majority voting algorithm. It can also be seen that, the
accuracy values for hybrid dataset is slightly less than that of dataset 3.

Table 5.15: Comparative results for all datasets


Dataset Accuracy (Holdout, Accuracy (Cross Accuracy (Cross
All features) validation, All validation, Selected
features) Features)
Dataset 1 All 13 features, Majority All 13 features, Best 11 features, Majority
voting, test size=15%: Majority voting, k=10: voting, k=10: 85.15%
88.89% 84.83%
Dataset 2 All 9 features, Majority All 9 features, Logistic Best 8 features, Logistic
voting, test size=10%: regression, k=30: regression, k=30: 73.62%
76.60% 72.72%
Dataset 3 All 11 features, Majority All 11 features, SVM Best/All 11 features,
voting, test size=10%: (linear kernel), k=10: SVM (linear kernel), k=10:
74.80% 72.22% 72.22%
Hybrid All 5 features, SVM All 5 features, Majority Feature selection not
Dataset (rbf kernel), test voting, k=5: 71.33% applicable
size=15%: 73.77%
36

Figure 5.2: Comparative results for all datasets

Figure 5.2 presents the best accuracy values that are obtained for holdout/cross-validation
methods and for three individual datasets and one hybrid dataset used in this work. From
Figure 5.2, it can be seen that the accuracy value is always higher for the case of dataset 1
compared to other datasets.

Among the three individual datasets used in this project, the dataset 1 from UCI repository
is also used in [7-8, 10] for the prediction of heart disease. A comparison is done among
the best results obtained from [7-8, 10] with the best result obtained in this work for this
particular dataset. This comparison is shown in Table 5.16. It can be seen from Table 5.16
that, the majority voting used in this work gives an accuracy of 88.89% which is better than
other accuracy values of 84.10%, 84%, and 85.48% reported in [7], [8], and [10],
respectively.
37

Table 5.16: Comparative results of this work with the literature


Source Approach Accuracy
Shouman et al. (2011) [7] Nine voting equal frequency discretization 84.1%
gain ratio decision tree
Marateb et al. (2015) [8] Multiple logistic regression + Neuro-fuzzy 84%
classifier
Latha et al. (2019) [10] Majority vote with naïve Bayes, Bayes net, 85.48%
random forest and multilayer perceptron
This work (for dataset 1) Majority vote with logistic regression, 88.89%
SVM and naïve Bayes
38

CHAPTER 6

Conclusion

6.1 Conclusion
The goal of this project is to predict heart disease using different datasets. In this project,
seven different machine learning algorithms are applied on three individual datasets and
one hybrid dataset. Different parameters associated with each particular machine learning
algorithms are altered and applied to find out which possible case delivers the best accuracy
score. Experimental results show that the most important attribute of dataset 1 is ‘thalach’
which means the maximum heart rate achieved by a patient, while that of datasets 2 and 3
are patient age. For dataset 1 using holdout method, the best accuracy of 88.89% is obtained
when majority voting classifier is used with logistic regression, SVM, and naïve Bayes.
For the case of cross validation method, the best result of 84.83% accuracy for dataset 1 is
obtained by the same classifier. For dataset 2 using holdout method, the best accuracy of
76.60% is obtained when majority voting classifier is used with logistic regression, SVM,
and naïve Bayes. For the case of cross validation method, the best result of 72.72%
accuracy for dataset 2 is obtained by logistic regression. For dataset 3 using holdout
method, the best accuracy of 74.80% is obtained when majority voting classifier is used
with logistic regression, SVM, and naïve Bayes. For the case of cross validation method,
the best result of 72.22% accuracy for dataset 3 is obtained by SVM. For hybrid dataset
using holdout method, the best accuracy of 73.77% is obtained by SVM. For the case of
cross validation method, the best result of 71.33% accuracy for hybrid dataset is obtained
when majority voting classifier is used with logistic regression, SVM, and naïve Bayes.
Furthermore, it has been shown here that for dataset 1 and dataset 2, the accuracy values
slightly improve when selected features rather than all features are used for classification.
On the other hand, for dataset 3, there is no improvement in accuracy values with feature
selection. It is shown that for the hybrid dataset, the classification accuracy of 73.77% is
lower than the accuracy of 88.89% of dataset 1, and 74.80% of dataset 3. Hence, the
effectiveness of diagnosis of heart disease reduces when multiple datasets are combined.
Results also show that among all the datasets and classifiers used in this work, the best
39

accuracy is obtained from the majority voting when used as a combination of logistic
regression, SVM, and naïve Bayes and when used on dataset 1. At this particular case, the
majority voting gives an accuracy of 88.89% which is better than other accuracy values
reported in the literature using the same dataset.

It is observed from this project work that the results of holdout method vary with the
variation in testing size, and the variation occurs even when the testing size is the same.
This is because the samples are random and so they are different in each processing. Cross
validation method gives more stable result compared to holdout method. It is also observed
that based on the dataset and the attributes, a classifier has to be chosen for reliably
diagnose heart disease.

6.2 Future Work


The project work has been carried out for a limited number of datasets. More research work
needs to be done to find out the most influential attributes for heart disease prediction.
Also, this project only focuses on the application of supervised machine learning
algorithms. Other supervised and unsupervised learning algorithms can be considered to
carry out similar works in future to compare between the findings and to validate the
performance in terms of accuracy.
40

Appendix: Setting up Anaconda Distribution Package


Anaconda is a free and open-source distribution package for python and R programming
language. It simplifies package management and deployment for python and can be used
for large-scale data processing. Anaconda distribution is used to carry out this project work.
The step by step process is discussed as follows.

Step 1: The installer file is downloaded from anaconda website. There are 2 python
versions available for download. 64-bit graphical installer of Python 3.7 version is
downloaded for this project.

Figure A1: Download page of Anaconda distribution website


41

Step 2: After downloading the installer file, the next step is to launch the installation
wizard. Click on “Next” button to continue to next step.

Figure A2: Anaconda installation process step 2

Step 3: “License Agreement” window for Anaconda distribution package will appear.
Click on “I Agree” button to continue the setup process.

Figure A3: Anaconda installation process step 3


42

Step 4: “Select Installation Type” window will appear. Click on “Next” button to continue
to next step.

Figure A4: Anaconda installation process step 4

Step 5: “Choose Install Location” window will appear. Click “Next” button to continue
the setup process.

Figure A5: Anaconda installation process step 5


43

Step 6: “Advanced Installation Options” window will appear. Two options are provided
as shown in the figure 3.6. Check on the both options of the check-box and click “Install”
button to complete the installation process. This will install Anaconda distribution package
in the computer.

Figure A6: Anaconda installation process step 6

Anaconda distribution package includes a graphical user interface called Anaconda


navigator which allows users to launch applications and manage conda packages,
environments and channels without using command-line interface. By default, the
following applications are available in Anaconda navigator:

• JupyterLab
• Jupyter notebook
• Spyder
• Visual studio code
• Glue
• Orange
• RStudio
44

Figure A7: Anaconda navigator interface

In this project work, Jupyter notebook is used to run the necessary codes for data analysis.
It is a web-based interactive computational environment for creating Jupyter notebook
documents.

Figure A8: Jupyter notebook interface


45

References
[1] Go, A. S., Mozaffarian, D., Roger, V. L., Benjamin, E. J., Berry, J. D., Blaha, M.J.,
“Executive summary: heart disease and stroke statistics-2014 update: a report from the
American heart association”, Circulation, vol. 129, no. 3, pp. 399-410, Jan. 2014. doi:
10.1161/01.cir.0000442015.53336.12.
[2] Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. and Scholkopf, B., “Support vector
machines”, in IEEE Intelligent systems and their applications, vol. 13, no. 4, pp. 18-28,
July-Aug. 1998.
[3] Wang, G., “A survey on training algorithms for support vector machine classifiers”,
2008 Fourth International Conference on Networked Computing and Advanced
Information Management, pp. 123-128, Gyeongju, 2008. doi: 10.1109/NCM.2008.103.
[4] Laaksonen, J. and Oja, E., “Classification with learning k-nearest neighbors”,
Proceedings of International Conference on Neural Networks (ICNN'96), Washington, DC,
USA, 1996, pp. 1480-1483 vol.3.
[5] Sanz, J.A., Galar, M., Jurio, A., Brugos, A., Pagola, M. and Bustince, H., “Medical
diagnosis of cardiovascular diseases using an interval-valued fuzzy rule-based
classification system”, Applied Soft Computing, vol. 20, pp. 103-111, July 2014. doi:
10.1016/j.asoc.2013.11.009.
[6] Setiawan, N.A., “Fuzzy decision support system for coronary artery disease diagnosis
based on rough set theory”, International Journal of Rough Sets and Data Analysis, vol. 1,
no. 1, pp. 65-80, Jan. 2014. doi: 10.4018/ijrsda.2014010105.
[7] Shouman, M., Turner, T. and Stocker, R., “Using decision tree for diagnosing heart
disease patients”, Proceedings of the Ninth Australian Data Mining Conference, Australia,
2011, pp. 23-30.
[8] Marateb, H.R. and Goudarzi, S., “A noninvasive method for coronary artery disease
diagnosis using a clinically interpretable fuzzy-rule based system”, Journal of Research in
Medical Sciences, vol. 20, no. 3, pp. 214-223, March 2015.
46

[9] Goni, M. Osman, “Development of a web based expert system for diagnosis of heart
disease using fuzzy logic”, M. Engg. Project, Institute of Information and Communication
Technology, BUET, 2019.
[10] Latha, C.B.C and Jeeva, S.C., “Improving the accuracy of prediction of heart disease
risk based on ensemble classification techniques”, Informatics in Medicine Unlocked, vol.
16, 2019, 100203. doi: 10.1016/j.imu.2019.100203.
[11] Mienye, I. D., Sun, Y. and Wang, Z., “An improved ensemble learning approach for
the prediction of heart disease risk”, Informatics in Medicine Unlocked, vol. 20, 2020,
100402. doi: 10.1016/j.imu.2020.100402.
[12] L. Ali et al., “An Optimized Stacked Support Vector Machines Based Expert System
for the Effective Prediction of Heart Failure,” in IEEE Access, vol. 7, pp. 54007-54014,
2019, doi: 10.1109/ACCESS.2019.2909969.
[13] Anaconda distribution website, https://www.anaconda.com/distribution/ [Last
accessed on 12 Feb. 2020]
[14] Raihan-Al-Masud, M. and Mondal, M. R. H., “Data-driven diagnosis of spinal
abnormalities using feature selection and machine learning algorithms”. PLOS ONE. 2020;
15(2): e0228422. doi: 10.1371/journal.pone.0228422.
[15] Heart disease dataset, UCI machine learning repository,
https://archive.ics.uci.edu/ml/datasets/Heart+Disease [Last accessed on 30 Sep. 2020]
[16] Cardiovascular disease, https://www.kaggle.com/yassinehamdaoui1/cardiovascular-
disease [Last accessed on 30 Sep. 2020]
[17] Cardiovascular disease dataset, https://www.kaggle.com/sulianova/cardiovascular-
disease-dataset [Last accessed on 30 Sep. 2020]
[18] Bharati S., Podder P., Mondal M. R. H., and Robel M. R. A., “Threats and
Countermeasures of Cyber Security in Direct and Remote Vehicle Communication
Systems”, Journal of Information Assurance and Security, MIR Labs, USA, vol. 15, pp.
153-164, May 2020.
[19] Mondal M. R. H., Bharati S., Podder P., Podder P., “Data Analytics for Novel
Coronavirus Disease”, Informatics in Medicine Unlocked, Elsevier, vol. 20, 2020, 100374.
47

[20] Khanam F., Nowrin I., and Mondal M. R. H., “Data Visualization and Analyzation of
COVID-19”, Journal of Scientific Research and Reports, vol. 26, no. 3, pp. 42-52, Apr.
2020.
[21] Bharati S., Podder P., and Mondal M. R. H., “Hybrid deep learning for detecting lung
diseases from X-ray images”, Informatics in Medicine Unlocked, Elsevier, vol. 20, 2020,
100391.
[22] Bharati S., Podder P., and Mondal M. R. H., “Artificial Neural Network Based Breast
Cancer Screening: A Comprehensive Review”, International Journal of Computer
Information Systems and Industrial Management Applications, MIR Labs, USA, vol. 12,
pp. 125-137, May 2020.
[23] Kourou K., Exarchos T. P., Exarchos K. P., Karamouzis M. V., Fotiadis D. I.,
“Machine learning applications in cancer prognosis and prediction”, Computational and
Structural Biotechnology Journal, vol. 13, pp. 8-17, 2015.

Вам также может понравиться