Академический Документы
Профессиональный Документы
Культура Документы
Full Thesis (BUET)
Full Thesis (BUET)
LEARNING
by
To
iv
Table of Contents
List of Tables…………….……………………………………………..……..………....vii
List of Figures…………………………………………………………..………..…...…viii
Acknowledgment……………………………………………………..……………..…....ix
Abstract…………………………………………………………………………..……..…x
1 Introduction…………………………………………………..……………..……...…01
1.1 Introduction………………..………………………………...……..…………..…..01
1.3 Objectives.………………..……………………….………...……..…………..…..03
v
4 Dataset Information………………………………….……………………..……...…14
6 Conclusion…………………...………………………………………...….………..…38
6.1 Conclusion……...……………………………………………………...……….….38
Appendix….……………………………………………………………………………..40
References…...…………………………………………………………………………..45
vi
LIST OF TABLES
vii
LIST OF FIGURES
viii
Acknowledgment
First of all, I would like to convey my gratitude to Almighty Allah for giving me the
opportunity to accomplish this project. I want to thank my supervisor Dr. Md. Rubaiyat
Hossain Mondal, Professor, IICT, BUET for giving me the chance to explore such an
interesting field of research and providing help and advice whenever I needed it. Without
his proper guidance, advice, continual encouragement, and active involvement in the
process of this work, it would have not been feasible.
A big thanks also goes to all the teachers, officers, and staffs of Institute of Information
and Communication Technology (IICT) for giving me their kind support and information
during the study.
Finally, I am very grateful to my parents and family members whose continuous support
all over my life has brought me this far in my career.
ix
Abstract
Early detection of heart disease can help in preventing the disease progression. Different
risk factors are associated with heart disease prediction. This project focuses on multiple
datasets in order to find the most valuable attributes and risk factors associated with heart
disease. One dataset containing 14 attributes including the target attribute and 303 instances
is collected from UCI machine learning repository. The second one containing 10 attributes
and 462 instances is collected from Kaggle repository. The third one contains 12 attributes
of 70000 instances, and is available at Kaggle repository. Seven different machine learning
algorithms are applied on these three individual datasets to study the most influential
attributes for heart disease prediction. One hybrid dataset is also generated using only the
common attributes of two individual datasets. Scikit-learn library of Python programing
language is used for data analysis purpose. Univariate feature selection algorithm is applied
in order to find the most valuable attributes associated with heart disease. The heart disease
is predicted using several machine learning algorithms including support vector machine
(SVM), decision tree, k-nearest neighbors (kNN), logistic regression, naïve Bayes, random
forest, and majority voting. The training and testing portions of each dataset is separated
using holdout and cross validation methods. Different parameters related to different
algorithms are altered and applied to find out which condition gives the highest accuracy.
To evaluate the performance of different algorithms, classification report and confusion
matrix are also calculated. It is shown here that majority voting as a combination of logistic
regression, SVM, and naïve Bayes exhibits the best accuracy of 88.89% when applied to
the first dataset. It is also shown that for the hybrid dataset, the classification accuracy is
lower than that of the individual datasets. Finally, the best result obtained from this project
work is compared with the results of existing similar research approaches.
x
1
CHAPTER 1
Introduction
1.1 Introduction
Heart attack or Myocardial Infraction is one of the deadliest diseases in the world at present
as it is the major cause of death and disability in many developed and developing countries
[1]. Most heart attacks occur due to coronary artery disease. The patient suffering from a
heart attack needs treatment within a very short time. So, it is very important to find out if
a person is at risk of having a heart attack considering the risk factors associated with it.
Machine learning algorithms [2-4] are considered in different application areas including
disease prediction. A number of research works have also been reported for coronary artery
disease or heart disease [5-12]. Accurate analysis of medical data enables early heart
disease detection, patient care, and community services. However, the findings of these
works vary, and one of the reasons behind this is the consideration of different attributes
and collection of different datasets by different authors. Accuracy of the results is reduced
when the medical data is incomplete. Therefore, research is still required to find out the
most important attributes and how selection of the attributes influences the disease
prediction. This project focuses on multiple datasets for finding the important attributes
associated with heart disease. The project then focuses on applying different machine
learning algorithms on the factors of different datasets.
Decision tree algorithm is one of the most popular data mining techniques used by several
researchers for heart disease prediction. Different types of decision tree are used to find out
which performs better in predicting heart disease [7]. This research uses a model that
combines discretization, decision tree and voting to find out a more accurate method for
heart disease prediction. The sensitivity, specificity, and accuracy are calculated in order
to compare the performance of different types of decision trees.
A computer-based noninvasive coronary artery disease diagnosis system is used in [8]. The
target of this research work is to design a clinically interpretable fuzzy rule-based system.
Discretization is done for the interval-scale variables, and then, the fuzzy rule-based system
is formulated based on a neuro-fuzzy classifier. Multiple logistic regression and sequential
feature selection are used for required attributes. The combination of multiple logistic
regression and neuro-fuzzy classifier method has exhibited the best performance.
A web-based fuzzy logic expert system is developed for the diagnosis of heart disease in
[9]. The system consists of fuzzification module, knowledge-based interface engine, and
defuzzification module. Fuzzification module operates on every input based on appropriate
membership function. Then, the interface engine triggers the appropriate rule from
knowledge base to find out the output value using appropriate defuzzification method.
HTML, CSS, JavaScript, jQuery, AJAX, PHP, Bootstrap, XML, and MySQL have been
used to implement this web-based system. The system is cost effective and efficient and
showed a very high accuracy when tested using the dataset of Cleveland clinical foundation
from UCI repository.
A mean based splitting approach is used to partition a heart disease dataset [11]. A
homogeneous ensemble is generated from the partitions that are model by different
3
classification and regression trees. A classification accuracy of 93% is also reported in this
literature.
Stacked support vector machine (SVM) is reported for the diagnosis of heart disease [12].
In stacked SVM expert system, the first SVM removes the irrelevant features of the dataset,
while the second SVM is used to predict the possibility of heart disease. A hybrid grid
search technique is considered for optimization of the SVMs. Compared with the stand-
alone SVM algorithm, the stacked SVM shows better performance in terms of reduced
training time, and better classification accuracy.
Different data mining techniques, performance tools, and methods have been implemented
which provide different perspective on the prediction of heart disease. However, none of
the aforementioned research works show the variation in the results using multiple datasets.
1.3 Objectives
The goal of this project is to predict heart disease using different attributes. The novelty of
this work is in the consideration of multiple datasets and preparing a hybrid dataset for
heart disease prediction. Three different datasets have been used in this project work for
finding the most important attributes and for the prediction of heart disease. One hybrid
dataset is also created using only the common attributes considering two individual datasets
in this project work.
• To study the most influential attributes of three different datasets for predicting
heart disease.
• To prepare a hybrid dataset using two individual datasets.
• To apply different machine learning algorithms for the prediction of heart disease
on three datasets and one hybrid dataset.
• To compare the obtained results with the results reported in the literature.
4
CHAPTER 2
Support vector machine (SVM) falls under the category of supervised machine learning
models. SVM performs data analysis using two different methods named as classification
and regression analysis. When training data are given, SVM training algorithm creates a
model where every new example is associated to one of the two mentioned analysis
6
method. For this, SVM is known as a non-probabilistic binary linear classifier. SVM also
performs non-linear classification with high efficiency where inputs are being mapped into
high-dimensional feature spaces which is known as kernel trick. SVM usually performs its
analysis by creating a hyperplane or set of hyperplanes in a high-dimensional space.
Functional margin is the distance between a hyperplane and its nearest training data point
of any class. The generalization error of the classifier is observed to be lower with larger
functional margin and when the distance is the largest, a good separation is achieved.
Decision tree algorithm is one of the most commonly used data mining techniques which
performs analysis based on predictive modeling approach. The objective of this machine
learning algorithm is to predict the value of a target variable using different input variables.
Two different types of decision tree are used in case of data analysis: one is classification
tree where the predicted result is a class to which data belongs, and the other one is
regression tree where the predicted result is a real number. The structure of a decision tree
model is similar to flow-chart where each non-leaf nodes represent a test on an attribute, a
branch represents the outcome of a test, and each terminal node represent a class label.
Decision tree algorithm is easy to interpret and performs well with large datasets which
makes it a popular data mining method.
Random forest classifier is a supervised machine learning algorithm that is used for
classification and regression problems. It is an ensemble learning method that creates
multiple decision trees when the model is trained and receives prediction from each of
them. The final output is the mode of the classes of all the individual trees in case of
classification problems. When it is used to solve regression problems, the final result is the
mean prediction of the individual trees. Thus, by averaging the results, overfitting is
reduced which can happen when a single decision tree is used. Because of its simple and
diversified characteristics, random forest classifier is one of the most used algorithms in
data science.
Logistic regression is a machine learning algorithm that functions according to the concept
of probability. It is a predictive analysis algorithm, and it is mainly used to solve
classification problems. Logistic regression uses a logistic function, also known as sigmoid
function in order to model binary dependent variable. This algorithm can be of three
different types: binomial, multinomial, and ordinal. When the observed outcome for a
dependent variable is any of the two possible types, it is known as binary or binomial
logistic regression. In multinomial logistic regression, the outcome can have three or more
8
possible types which are unordered or with no quantitative significance. The third type is
ordinal logistic regression where the dependent variable can have three or more ordered
possible outcomes.
CHAPTER 3
1. Three different datasets are collected from Kaggle and UCI machine learning
repository which are online communities for data scientists.
2. Python programming language is used to carry out data analysis. For the
deployment of python, Anaconda distribution is used which is a free and open-
source distribution that simplify package management. It includes a graphical user
interface known as Anaconda navigator. Different useful applications are available
in the navigator such as JupyterLab, Jupyter notebook, Spyder, Orange, RStudio,
etc. For this project, Jupyter notebook is used to run the codes for data analysis.
3. Different supervised machine learning algorithms including support vector
machine, decision tree, k-nearest neighbors, naïve Bayes, random forest, and
logistic regression are used. Majority voting classifier, an ensemble classification
method to improve the accuracy of weak algorithms by combining multiple
classifiers, is also used in this project. To implement these algorithms, scikit-learn
library is used in this project. Scikit-learn library is a free software machine learning
library which is included in the Anaconda distribution package.
4. The accuracy score for predicting heart disease is calculated for three different
datasets. This is done by holdout method and by cross-validation method. For the
case of holdout method, the percentage of training and testing data is set to four
10
different values. These are testing size of 10%, 15%, 20%, and 25% of the total
data samples. The train_test_split() class from scikit-learn library is used to split
the datasets into training and testing portions. For cross-validation, the total dataset
is divided into k equal groups and based on the value of k, the result of cross-
validation changes. Large value of k will increase computation time.
5. Different parameters associated with different learning algorithms are altered and
applied to compare between the results to find out the desired condition that gives
the highest accuracy score. In support vector machine algorithm, three different
kernels are applied: linear, rbf, and sigmoid. For rbf and sigmoid kernels, the C
value is altered from 1 to 5 and the gamma value is altered between auto and scale.
In case of k-nearest neighbors classifier, the value of k is changed from 1 to 50 to
find out the best possible accuracy score. In logistic regression algorithm, two
different solvers are applied: lbfgs and liblinear.
6. After calculating the accuracy score for three datasets with the specific conditions
mentioned in step 4 and 5, the results are compared to find out the best accuracy
score for different algorithms. The conditions for which the best accuracy is
obtained, are noted. Then, confusion matrix and classification report (precision,
recall and F1-score) are calculated for those conditions.
7. Feature selection is performed to find the best attributes of a dataset that lead to the
diagnosis of heart disease. In this project, the feature selection method is univariate
feature selection method where each feature is scored individually on certain
specified criteria and the features are then selected based on the higher scores or
higher ranks.
8. One hybrid dataset is created using only the common attributes considering two
individual datasets collected before (Detail information about the hybrid dataset is
discussed in the next chapter). All the required normalization is done for
constructing the hybrid dataset. Then, accuracy score, confusion matrix, and
classification report are calculated for this hybrid dataset according to step 4, step
5, and step 6.
9. Finally, the best result obtained from this project work is compared with the results
reported in the literature [7-8, 10].
11
Process:
Step 5: Load the CSV file containing data using read_csv() function
Step 7: i) For holdout method, separate the train and test data using train_test_split()
function
(TN ) , false negative ( FN ) and false positive ( FP ) . In the context of this work, TP refers
to the patient samples that are correctly classified as abnormal which mean the patients
have heart disease. The terms TN is the number of normal people having normal condition
of the heart. The term FN refers to the people who actually have heart disease but remains
undetected by the system. Furthermore, FP refers to the number of samples who are
wrongly detected to have heart disease. In the following these metrics are defined. The
accuracy is the percentage of all normal and abnormal vectors that are correctly classified.
Accuracy, ac , can be expressed as follows.
13
TP + TN
ac = (3.1)
TP + TN + FP + FN
Training accuracy and testing accuracy are defined as the accuracy obtained for
training and testing samples, respectively. Precision, pr , can be mathematically written as
follows.
TP
pr = (3.2)
TP + FP
TP
re = (3.3)
TP + FN
2 × pr × re
f1 = (3.4)
pr + re
CHAPTER 4
Dataset Information
Three different datasets have been used in this project work to carry out data analysis for
heart disease prediction. The detailed information about these datasets is discussed in this
chapter.
4.1.1 Dataset 1
Dataset 1 is collected from UCI machine learning repository [15]. It contains 14 attributes
including the target attribute and 303 instances. The attributes are as follows:
4.1.2 Dataset 2
Dataset 2 is collected from Kaggle [16]. It contains 10 attributes including the target
attribute and 462 instances. The attributes are as follows:
4.1.3 Dataset 3
Dataset 3 is collected from Kaggle [17]. It contains 12 attributes including the target
attribute and 70000 instances. The attributes are as follows:
The hybrid dataset is created from dataset 1 and 3, using only the common attributes from
these two individual datasets used in this project. This hybrid dataset contains 6 attributes
including the target attribute and 70303 instances. The common 6 attributes which are
considered to create this hybrid dataset, are given below (attribute names are considered
from dataset 3):
1. age
2. gender
3. ap_hi
4. cholesterol
5. gluc
6. cardio (target attribute)
Normalization is required to create this hybrid dataset. The detailed normalization process
for each of the 5 attributes (excluding target attribute cardio) is discussed as follows.
17
Attribute 1: age
In dataset 3, age is given in days. But in dataset 1, age is given in years. So, age in dataset
3 is converted from days to years.
Attribute 2: gender
Attribute 3: ap_hi
This attribute refers to as systolic blood pressure. In dataset 3, this attribute is named as
ap_hi and the values are given in mmHg. In dataset 1, this is referred to as tresbps and the
values are given in mmHg. As the unit matches in both datasets, no additional
normalization is required for this attribute.
Attribute 4: cholesterol
This attribute refers to serum cholesterol level. In dataset 3, this attribute is named as
cholesterol and the values are given in 1, 2 and 3. The meaning of them is as follows:
1: normal
2: above normal
In dataset 1, this attribute is named as chol and the values are given in mg/dl unit. So, the
values for dataset 1 need normalization. Normalization is done as follows:
Attribute 5: gluc
This attribute refers to as glucose level or blood sugar level. In dataset 3, this attribute is
named as gluc and the values are given in 1, 2 and 3. The meaning of them is as follows:
1: normal
2: above normal
In dataset 1, this attribute is named as fbs and the values are given in 0 or 1. The meaning
of them is as follows:
Normalization is done for dataset 3. The value 1 is replaced with 0 and the values 2 and 3
are replaced with 1.
19
CHAPTER 5
Table 5.1 indicates that among all the classifiers, majority voting exhibits the best accuracy
of 88.89% when the test size is 15% of the data samples. Moreover, we can see that each
classifier has its own best accuracy score for a set of conditions. These are SVM
(kernel=linear, C=2; test size 10%): 85.67%, logistic regression (solver=liblinear; test size
15%): 85.11%, naïve Bayes classifier (test size 15%): 84.22%, random forest classifier
(test size 20%): 83.83%, kNN classifier (k=5; test size 10%): 76.67%, and decision tree
classifier (test size 10%): 76.33%. Next, confusion matrix and classification reports are
calculated for these conditions and the results are shown in Table 5.2. It can be seen from
Table 5.2 that among all the classifiers, majority voting has the best precision of 90%, the
best recall of 86%, and the best F1 score of 88%.
21
The testing accuracy of the dataset 1 with cross-validation is discussed next. Table 5.3
shows the testing accuracy of dataset 1 with cross-validation for different classifiers.
Feature ranking for dataset 1 is done using the univariate feature selection algorithm. Table
5.4 shows the feature ranking for dataset 1.
Feature Rank
thalach 1
ca 2
oldpeak 3
thal 4
exang 5
age 6
chol 7
trestbps 8
cp 9
restecg 10
slpoe 11
sex 12
fbs 13
Table 5.3 indicates that for dataset 1 using cross validation method, the best accuracy of
84.83% is obtained by majority voting classifier with logistic regression, SVM, and naïve
Bayes where k=10. This result is for the case when all the 13 features are used for
prediction. The highest accuracy values of the individual classifiers for dataset 1 is
illustrated in Figure 5.1. It can be seen from Figure 5.1 that SVM, logistic regression, naïve
Bayes, random forest, and majority voting exhibit accuracy values above 80%.
Next, univariate feature selection method is applied to see the impact of reducing the
number of features. From the experiments, the best result is obtained when the highest 11
24
ranked features shown in Table 5.4 are used for classification. Using those 11 features, an
85.15% accuracy is obtained with majority voting classifier.
90
80
70
60
Accuracy%
50
40
30
20
10
0
Decision tree SVM KNN Naïve Bayes Random forest Logistic Majority
regression voting
Classifier
Table 5.5 indicates that among all the classifiers, majority voting exhibits the best accuracy
of 76.60% when the test size is 10% of the data samples. Moreover, we can see that each
classifier has its own best accuracy score for a set of conditions. These are SVM
(kernel=linear; C=1; test size 25%): 73.88%, logistic regression (solver=lbfgs; test size
10%): 73.83%, naïve Bayes (test size 15%): 72.28%, kNN classifier (k=11; test size 10%):
70.2%, random forest (test size 25%): 70.17%, and decision tree (test size 25%): 61.55%.
Next, confusion matrix and classification reports are calculated for these conditions and
the results are shown in Table 5.6. It can be seen from Table 5.6 that among all the
classifiers, majority voting and logistic regression have the best precision of 69%, naïve
Bayes has the best recall of 65%, and logistic regression has the best F1 score of 65%.
The testing accuracy of the dataset 2 with cross-validation is discussed next. Table 5.7
shows the testing accuracy of dataset 2 with cross-validation for different classifiers.
Feature ranking for dataset 2 is done using the univariate feature selection algorithm. Table
5.8 shows the feature ranking for dataset 2.
Feature Rank
age 1
tobacco 2
adiposity 3
alcohol 4
sbp 5
ldl 6
famhist 7
typea 8
obesity 9
Table 5.7 indicates that for dataset 2 using cross validation method, the best accuracy of
72.72% is obtained by logistic regression (solver = lbfgs) where k=30. This result is for the
case when all the 9 features are used for prediction. Next, univariate feature selection
method is applied to see the impact of reducing the number of features. From the
experiments, the best result is obtained when the highest 8 ranked features shown in Table
5.8 are used for classification. Using those 8 features, a 73.62% accuracy is obtained with
logistic regression classifier.
29
Table 5.9 indicates that among all the classifiers, majority voting exhibits the best accuracy
of 74.80% when the test size is 10% of the data samples. Moreover, we can see that each
classifier has its own best accuracy score for a set of conditions. These are SVM
(kernel=linear; C=3; test size 20%): 72.98%, random forest classifier (test size 15%):
72.06%, KNN classifier (k=47; test size 10%): 71.88%, logistic regression
(solver=liblinear; test size 20%): 71.64%, decision tree classifier (test size 20%): 63.55%,
and naïve Bayes classifier (test size 10%): 59.4%. Next, confusion matrix and
classification reports are calculated for these conditions and the results are shown in Table
5.10. It can be seen from Table 5.10 that among all the classifiers, majority voting has the
best precision of 76%, random forest has the best recall of 69%, and random forest and
SVM have the best F1 score of 71%.
Table 5.10: Confusion matrix and classification report for dataset 3
The testing accuracy of the dataset 3 with cross-validation is discussed next. Table 5.11
shows the testing accuracy of dataset 3 with cross-validation for different classifiers.
Feature ranking for dataset 3 is done using the univariate feature selection algorithm. Table
5.12 shows the feature ranking for dataset 3.
Feature Rank
age 1
ap_lo 2
ap_hi 3
weight 4
cholesterol 5
gluc 6
active 7
smoke 8
alco 9
height 10
gender 11
Table 5.11 indicates that for dataset 3 using cross validation method, the best accuracy of
72.22% is obtained by SVM (linear kernel) where k=10. This result is for the case when
all the 11 features are used for prediction. Next, univariate feature selection method is
applied to see the impact of reducing the number of features. From the experiments, there
is no improvement in the accuracy score rather the accuracy decreases with the reduction
in the number of features.
33
Table 5.13 indicates that among all the classifiers, SVM (kernel=rbf; gamma=auto; C=2)
exhibits the best accuracy of 73.77% when the test size is 15% of the data samples.
Moreover, we can see that each classifier has its own best accuracy score for a set of
conditions. These are kNN classifier (k=43; test size 15%): 73.21%, random forest
classifier (test size 15%): 72.80%, decision tree classifier (test size 20%): 72.40%, majority
voting (test size 10%): 72.01%, logistic regression (solver=liblinear; test size 10%):
71.98%, and naïve Bayes classifier (test size 10%): 58.47%. Next, confusion matrix and
classification reports are calculated for these conditions and the results are shown in Table
5.14. It can be seen from Table 5.14 that among all the classifiers, majority voting has the
best precision of 78%, random forest has the best recall of 69%, and kNN classifier has the
best F1 score of 72%. The testing accuracy of the hybrid dataset using cross-validation
method is calculated next for different classifiers. Experimental results show that among
all the classifiers, the best accuracy of 71.33% is obtained by majority voting classifier
with logistic regression, SVM, and naïve Bayes where k=5.
Table 5.14: Confusion matrix and classification report for hybrid dataset
Figure 5.2 presents the best accuracy values that are obtained for holdout/cross-validation
methods and for three individual datasets and one hybrid dataset used in this work. From
Figure 5.2, it can be seen that the accuracy value is always higher for the case of dataset 1
compared to other datasets.
Among the three individual datasets used in this project, the dataset 1 from UCI repository
is also used in [7-8, 10] for the prediction of heart disease. A comparison is done among
the best results obtained from [7-8, 10] with the best result obtained in this work for this
particular dataset. This comparison is shown in Table 5.16. It can be seen from Table 5.16
that, the majority voting used in this work gives an accuracy of 88.89% which is better than
other accuracy values of 84.10%, 84%, and 85.48% reported in [7], [8], and [10],
respectively.
37
CHAPTER 6
Conclusion
6.1 Conclusion
The goal of this project is to predict heart disease using different datasets. In this project,
seven different machine learning algorithms are applied on three individual datasets and
one hybrid dataset. Different parameters associated with each particular machine learning
algorithms are altered and applied to find out which possible case delivers the best accuracy
score. Experimental results show that the most important attribute of dataset 1 is ‘thalach’
which means the maximum heart rate achieved by a patient, while that of datasets 2 and 3
are patient age. For dataset 1 using holdout method, the best accuracy of 88.89% is obtained
when majority voting classifier is used with logistic regression, SVM, and naïve Bayes.
For the case of cross validation method, the best result of 84.83% accuracy for dataset 1 is
obtained by the same classifier. For dataset 2 using holdout method, the best accuracy of
76.60% is obtained when majority voting classifier is used with logistic regression, SVM,
and naïve Bayes. For the case of cross validation method, the best result of 72.72%
accuracy for dataset 2 is obtained by logistic regression. For dataset 3 using holdout
method, the best accuracy of 74.80% is obtained when majority voting classifier is used
with logistic regression, SVM, and naïve Bayes. For the case of cross validation method,
the best result of 72.22% accuracy for dataset 3 is obtained by SVM. For hybrid dataset
using holdout method, the best accuracy of 73.77% is obtained by SVM. For the case of
cross validation method, the best result of 71.33% accuracy for hybrid dataset is obtained
when majority voting classifier is used with logistic regression, SVM, and naïve Bayes.
Furthermore, it has been shown here that for dataset 1 and dataset 2, the accuracy values
slightly improve when selected features rather than all features are used for classification.
On the other hand, for dataset 3, there is no improvement in accuracy values with feature
selection. It is shown that for the hybrid dataset, the classification accuracy of 73.77% is
lower than the accuracy of 88.89% of dataset 1, and 74.80% of dataset 3. Hence, the
effectiveness of diagnosis of heart disease reduces when multiple datasets are combined.
Results also show that among all the datasets and classifiers used in this work, the best
39
accuracy is obtained from the majority voting when used as a combination of logistic
regression, SVM, and naïve Bayes and when used on dataset 1. At this particular case, the
majority voting gives an accuracy of 88.89% which is better than other accuracy values
reported in the literature using the same dataset.
It is observed from this project work that the results of holdout method vary with the
variation in testing size, and the variation occurs even when the testing size is the same.
This is because the samples are random and so they are different in each processing. Cross
validation method gives more stable result compared to holdout method. It is also observed
that based on the dataset and the attributes, a classifier has to be chosen for reliably
diagnose heart disease.
Step 1: The installer file is downloaded from anaconda website. There are 2 python
versions available for download. 64-bit graphical installer of Python 3.7 version is
downloaded for this project.
Step 2: After downloading the installer file, the next step is to launch the installation
wizard. Click on “Next” button to continue to next step.
Step 3: “License Agreement” window for Anaconda distribution package will appear.
Click on “I Agree” button to continue the setup process.
Step 4: “Select Installation Type” window will appear. Click on “Next” button to continue
to next step.
Step 5: “Choose Install Location” window will appear. Click “Next” button to continue
the setup process.
Step 6: “Advanced Installation Options” window will appear. Two options are provided
as shown in the figure 3.6. Check on the both options of the check-box and click “Install”
button to complete the installation process. This will install Anaconda distribution package
in the computer.
• JupyterLab
• Jupyter notebook
• Spyder
• Visual studio code
• Glue
• Orange
• RStudio
44
In this project work, Jupyter notebook is used to run the necessary codes for data analysis.
It is a web-based interactive computational environment for creating Jupyter notebook
documents.
References
[1] Go, A. S., Mozaffarian, D., Roger, V. L., Benjamin, E. J., Berry, J. D., Blaha, M.J.,
“Executive summary: heart disease and stroke statistics-2014 update: a report from the
American heart association”, Circulation, vol. 129, no. 3, pp. 399-410, Jan. 2014. doi:
10.1161/01.cir.0000442015.53336.12.
[2] Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J. and Scholkopf, B., “Support vector
machines”, in IEEE Intelligent systems and their applications, vol. 13, no. 4, pp. 18-28,
July-Aug. 1998.
[3] Wang, G., “A survey on training algorithms for support vector machine classifiers”,
2008 Fourth International Conference on Networked Computing and Advanced
Information Management, pp. 123-128, Gyeongju, 2008. doi: 10.1109/NCM.2008.103.
[4] Laaksonen, J. and Oja, E., “Classification with learning k-nearest neighbors”,
Proceedings of International Conference on Neural Networks (ICNN'96), Washington, DC,
USA, 1996, pp. 1480-1483 vol.3.
[5] Sanz, J.A., Galar, M., Jurio, A., Brugos, A., Pagola, M. and Bustince, H., “Medical
diagnosis of cardiovascular diseases using an interval-valued fuzzy rule-based
classification system”, Applied Soft Computing, vol. 20, pp. 103-111, July 2014. doi:
10.1016/j.asoc.2013.11.009.
[6] Setiawan, N.A., “Fuzzy decision support system for coronary artery disease diagnosis
based on rough set theory”, International Journal of Rough Sets and Data Analysis, vol. 1,
no. 1, pp. 65-80, Jan. 2014. doi: 10.4018/ijrsda.2014010105.
[7] Shouman, M., Turner, T. and Stocker, R., “Using decision tree for diagnosing heart
disease patients”, Proceedings of the Ninth Australian Data Mining Conference, Australia,
2011, pp. 23-30.
[8] Marateb, H.R. and Goudarzi, S., “A noninvasive method for coronary artery disease
diagnosis using a clinically interpretable fuzzy-rule based system”, Journal of Research in
Medical Sciences, vol. 20, no. 3, pp. 214-223, March 2015.
46
[9] Goni, M. Osman, “Development of a web based expert system for diagnosis of heart
disease using fuzzy logic”, M. Engg. Project, Institute of Information and Communication
Technology, BUET, 2019.
[10] Latha, C.B.C and Jeeva, S.C., “Improving the accuracy of prediction of heart disease
risk based on ensemble classification techniques”, Informatics in Medicine Unlocked, vol.
16, 2019, 100203. doi: 10.1016/j.imu.2019.100203.
[11] Mienye, I. D., Sun, Y. and Wang, Z., “An improved ensemble learning approach for
the prediction of heart disease risk”, Informatics in Medicine Unlocked, vol. 20, 2020,
100402. doi: 10.1016/j.imu.2020.100402.
[12] L. Ali et al., “An Optimized Stacked Support Vector Machines Based Expert System
for the Effective Prediction of Heart Failure,” in IEEE Access, vol. 7, pp. 54007-54014,
2019, doi: 10.1109/ACCESS.2019.2909969.
[13] Anaconda distribution website, https://www.anaconda.com/distribution/ [Last
accessed on 12 Feb. 2020]
[14] Raihan-Al-Masud, M. and Mondal, M. R. H., “Data-driven diagnosis of spinal
abnormalities using feature selection and machine learning algorithms”. PLOS ONE. 2020;
15(2): e0228422. doi: 10.1371/journal.pone.0228422.
[15] Heart disease dataset, UCI machine learning repository,
https://archive.ics.uci.edu/ml/datasets/Heart+Disease [Last accessed on 30 Sep. 2020]
[16] Cardiovascular disease, https://www.kaggle.com/yassinehamdaoui1/cardiovascular-
disease [Last accessed on 30 Sep. 2020]
[17] Cardiovascular disease dataset, https://www.kaggle.com/sulianova/cardiovascular-
disease-dataset [Last accessed on 30 Sep. 2020]
[18] Bharati S., Podder P., Mondal M. R. H., and Robel M. R. A., “Threats and
Countermeasures of Cyber Security in Direct and Remote Vehicle Communication
Systems”, Journal of Information Assurance and Security, MIR Labs, USA, vol. 15, pp.
153-164, May 2020.
[19] Mondal M. R. H., Bharati S., Podder P., Podder P., “Data Analytics for Novel
Coronavirus Disease”, Informatics in Medicine Unlocked, Elsevier, vol. 20, 2020, 100374.
47
[20] Khanam F., Nowrin I., and Mondal M. R. H., “Data Visualization and Analyzation of
COVID-19”, Journal of Scientific Research and Reports, vol. 26, no. 3, pp. 42-52, Apr.
2020.
[21] Bharati S., Podder P., and Mondal M. R. H., “Hybrid deep learning for detecting lung
diseases from X-ray images”, Informatics in Medicine Unlocked, Elsevier, vol. 20, 2020,
100391.
[22] Bharati S., Podder P., and Mondal M. R. H., “Artificial Neural Network Based Breast
Cancer Screening: A Comprehensive Review”, International Journal of Computer
Information Systems and Industrial Management Applications, MIR Labs, USA, vol. 12,
pp. 125-137, May 2020.
[23] Kourou K., Exarchos T. P., Exarchos K. P., Karamouzis M. V., Fotiadis D. I.,
“Machine learning applications in cancer prognosis and prediction”, Computational and
Structural Biotechnology Journal, vol. 13, pp. 8-17, 2015.