Вы находитесь на странице: 1из 9

Diabetes Classification with Fuzzy Genetic Algorithm

Wissanu Thungrut1 and Naruemon Wattanapongsakorn1


1 Department of Computer Engineering

King Mongkut’s University of Technology Thonburi, Bangkok, Thailand


126 Pracha-utid, Bangmod, Thoongkru, Bangkok, Thailand
thungrut@gmail.com, naruemon.wat@kmutt.ac.th

Abstract. In this research work, we consider the diabetes classification on PIMA


Indian dataset with Fuzzy Genetic Algorithm. We applied two algorithms con-
sisting of Fuzzy Algorithm and Genetic Algorithm to combine the process to en-
hance the classification performance. In addition, we used Synthetic Minority
Over-sampling Technique (SMOTE) to handle the imbalance data set. We con-
ducted the experiments and found out that 5-fold cross-validation is the suitable
approach, providing very good results compared with those obtained from other
techniques.

Keywords: Fuzzy genetic algorithm, diabetic, SMOTE, BaggingLOF, Imputa-


tion technique

1 Introduction
Nowadays, diabetes is known as a difficult disease to completely cure. The symptom
can be varied in many stages of severity. The human body is a mysterious thing although
medical profession cannot explain everything in our body. Many factors of human body
affect the internal system to the owner such as blood sugar, fat, insulin, blood pressure,
pedigree and other, that cause diabetes. However, the prime factor comes from insulin.
When considering deep into the human body, the pancreas produces insulin to assimilate
sugar through the body. However, a problem occurs when the insulin cannot be produced
at a regular level and cause digestion to deteriorate. Other factors that may affect to dia-
betes are exercise activities, age, consumption as a sub-factor.
However, several types of researches have proposed various methods such as rule-
based classification systems that widely used for diabetes diagnosis. A challenge is to
producing a comprehensive optimal rule-set while balancing accuracy, sensitivity and
specificity values. Most researches attempt to improve or get the actual useful model
with accuracy. However, a problem is not easy to solve from different dataset because
of the differences of nationality or patient case. Diabetes has three different levels and
corresponding factors. The level of type 1 consists of a ratio of waist/hip, fat accumula-
tion, overweight, high blood pressure. Type 2 or pre-diabetes consists of the problem of
consuming sugar and abnormal insulin produce, and type 3 is the last stage of diabetes
consisting of complications in several diseases. We need to classify a useful rule-set if
those datasets are reliable.
In 2002, N.V. Nitesh et al. [1] studied the performance of Naive Bayes, C4.5 Deci-
sion Tree and Ripper Rule using Synthetic Minority Over-sampling Technique
2

(SMOTE) to handle imbalance data set. Their valuation methods are Receiver Operating
Characteristic (ROC) and Area Under the Curve (AUC). The results gave better perfor-
mance, compared to those with over-sampling and under-sampling.
Handling missing value is a strategy to solve misleading classification. Angeline
Christobel et al. [2] did research on the negative impact of missing value imputation.
They found that the mean substitution, data scaling and K-nearest neighbor (KNN) al-
gorithm can enhance the accuracy of their works.
In 2013, N. Wattanapongsakorn at el. [3] studied the performance of online intrusion
using Fuzzy Genetic Algorithm to create a rule-set with a trapezoidal membership func-
tion. They tested on various experiments including offline testing with the KDD99 da-
taset and online testing in their university environment. In 2015, H. Guan el al.
[4] studied the performance of their experiments in big data process with vector similar-
ity based local outlier factor (SLOF) and bagging feature method to improve the preci-
sion and recall rates. Their report showed the comparison of outlier detection methods.
The winner is SLOF that is suitable for big data process.
Then in 2016, S. Behera et al. [5] studied about the most effective outlier detection
method in breast cancer data by comparing several methods such as Density-based spa-
tial clustering of applications with noise (DBSCAN). Ordering points to identify the
clustering structure (OPTICS) and outlier detection method. The result showed that LOF
is the best for detecting noise with high accuracy.
In addition, P. Palwisut [6] studied the improvement of the decision tree using
SMOTE for Internet Addiction Disorder Data. SMOTE was considered to test on clas-
sification, with Decision Tree, Random Forest. The evaluation method were applied with
confusion matrix to compare all models. The results give improvement +5.24% accu-
racy, +13.82% sensitivity, and +8.47% specificity.
M.S. Barale and D.T. Shirke [7] considered using deleting missing value together
with KNN technique to classify data.
A hybrid model of Fuzzy ARTMAP and Genetic Algorithm for data classification
and Rule Extraction [8], F. Pourpanah et al studied the Fuzzy ARTMAP with Genetic
Algorithm (QFAM-GA) to compare the performance using different dataset with noise
and noise free.
Next, Šárka Brodinová et al. [9] studied the imbalanced clustering (IClust) in high
dimension media data by merging multi-group based on outlier detection method using
local outlier factor (LOF).
Interpretable and accurate medical data classification – a multi-objective genetic-
fuzzy optimization approach [10], M.B. Gorzałczany and F. Rudzinski showed the
Multi-Objective Genetic Algorithm (MOEA) with Fuzzy Algorithm to produce a rule
and test on several datasets. The result got satisfied.
SM-RuleMiner: Spider monkey based rule miner using novel fitness function for
diabetes classification [11], R. Cheruku, D.R. Edla and V. Kuppili studied the perfor-
mance of SM-RuleMiner against PIMA Indian dataset. The accuracy and sensitivity
tested with 10 folds cross-validation provided good performance but with low specific-
ity. The result is satisfied in term of sensitivity analysis.
3

In this paper, we classify the diabetes into 2 classes: “yes” (with the diabetes), and
“no” (without the diabetes). A Fuzzy Genetic Algorithm (Fuzzy GA) has recently been
proposed to solve many problems with satisfying results [4, 5, 10]. The technique can
help generating rules to classify disease when the medical data is available.
The remaining parts in this paper is presented as follows. In Section II, we discuss
research methodology, dataset, data preprocessing and fuzzy genetic algorithm. The al-
gorithm is applied to classify the PIMA Indian dataset. Section III presents the classifi-
cation result. Finally, in section IV, we give a conclusion of this research work.

2 Introduction

2.1 Dataset
We consider the Pima Indian dataset from UCI Machine Learning Repository [12] as
our input dataset. This dataset has 9 attributes and 768 instances. The description of all
attributes is shown in Table I.

Table 1. Pima Indian dataset.

No Attribute Name Mean Standard Min…Max


deviation
1 Number of time pregnant 3.8 3.4 0…17

2 Plasma glucose concentration 2 hours 120.9 32.0 0…199

3 Diastolic blood pressure (mm Hg) 69.1 19.4 0…122

4 Triceps skin fold (mm) 20.5 16.0 0…99

5 2-hour serum insulin (mu U/ml) 79.8 115.2 0…846

6 Body mass index (height in m) ^ 2) 32.0 7.9 0…67.1

7 Diabetes pedigree function 0.5 0.3 0.078…2.42

8 Age (years) 33.2 11.8 21…81

9 Diastolic blood pressure (mm Hg) - - -

2.2 Data Preprocessing


The major problem with this dataset is the missing value. We studied several methods
on missing value imputation [8, 10, 12], and found out that the best approach for this
problem solving is by using mean imputation in each class. The original dataset con-
tains two output classes: yes (with diabetes) and no (without diabetes). The ratio of the
two classes is 1:0.53 as shown below.
4

Fig. 1. The original dataset.

This original dataset is imbalance. We used SMOTE to deal with this problem of im-
balance datasets giving satisfied outcomes. SMOTE tuning is necessary to synthesis
density-base population of the instance group. However, the synthesis data may contain
some outliers. So, we used BaggingLOF to identify the outliers which is directly im-
proved accuracy of the classification. Our approach is explained in detail as follows.

Mean Imputation

We used the mean imputation to deal with missing value. For the implementation, we
split data into 2 sections, for class “yes” and class “no”. Then calculate mean value of
each attribute of those two classes and replace the missing value with its attribute mean
value. This method can reduce the error from data replacement better than when using
other methods such as global mean imputation, K-mean imputation, EM Imputation
and median imputation.

SMOTE

We used SMOTE [7] in Weka tool to generate sampling. The parameter setting of per-
centage create data = 86% and nearest neighbors = 15% of all instances. This method
performs a synthesis sampling from the neighborhood of dataset. The new sampling is
then generated.

BaggingLOF

We used Weka tool to generate the probability of BaggingLOF [2, 3, 6]. The threshold
of defined outlier is a probability of mean value from the overall dataset. In the original
version of this algorithm, there is no explanation of the threshold of noise detection.
We need to test it on our own for the best result. The outlier can be occurred when
performing SMOTE. Thus, the outlier detection method is needed to clean the data.
5

2.3 Fuzzy Genetic Algorithm


We considered using a Fuzzy Genetic algorithm (FuzzyGA) with a trapezoidal mem-
bership function based on N. Wattanapongsakorn [3]. In this work, the Fuzzy GA can
classify the diabetic output class effectively.

Fig. 2. A Trapezoidal membership function.

From Fig 2, the goal of this work is to find accuracy along with specificity and sensi-
tivity. We used maximum fitness function to increase the correct output as “yes” and
reduce incorrect output as n. The fitness function is described as follows.
𝑦 𝑛
𝑓𝑖𝑡𝑛𝑒𝑠𝑠 = − (1)
𝑌 𝑁

where
Y is the total number of “yes” records in the dataset.
N is the total number of “no” records in the dataset.
“y” is the total number of “yes” records correctly classified as “yes”.
“n” is the total number of “yes” records incorrectly classified as “no”.

2.4 Fuzzy Algorithm

This paper considered the trapezoidal membership function working with the genetic
algorithm. The trapezoidal membership function offers a flexible shape/rule depending
on the value of adjusting parameters to identify/predict the output class. According to
Fig. 2, the probability to determine “yes” and “no” outputs are described with a trape-
zoidal membership function. The trapezoidal scale is assigned as the probability in each
attribute considered for a rule. The parameters consist of a, b, c and d. The interval of
each individual is defined as 1.0 – 7.0 for the range in this work. We briefly explain the
concept of the methodology as shown in Fig 3.
6

Fig. 3. A Fuzzy Algorithm

The shape of trapezoidal membership can be changed depending on the values of a, b,


c, d. The graph shown in Fig. 2 has x-axis presenting value of an attribute and y-axis
for the probability of having “yes” for the output class. The forms of a graph should be
correctly sorting as a  b  c  d. Then the probability of being “yes” is calculated
according to the pseudocode shown in Fig. 4.

Fig. 4. A Pseudocode—trapezoidal membership function.

From Fig 3, the fuzzy algorithm in the fifth line is a trapezoidal membership function
that is related as a function in Fig 4. In addition, Fig 4, the gene is a subtle of the chro-
mosome in a genetic algorithm that we will discuss next.

2.5 Genetic Algorithm (GA)


A chromosome contains N genes and each gene has its corresponding a, b, c, d for
determining a rule for an attribute value.
7

Fig. 5. One gene with 4 blocks refers to one attribute with a range of 1-7.

The length of the chromosome is equal to the total number of attributes/genes. Total
number of blocks in the chromosome is equal the number of attributes multiply by four.
We considered an integer encoding for a simple implement. The GA is used to find
appropriate values of a, b, c and d for each attribute, forming a rule for classification,
which maximizes the accuracy of prediction. We used GA to generate various solutions
as rules for finding the best classification rule. The GA parameter setting is as follows.

 Crossover rate = 90%


 Mutation rate = 80%
 Population size (number of solutions considered at the same time) = 100
 Number of Generations/iterations = 300

GA has the main process consisting of reproduction and mutation. We used single-point
crossover method. For mutation, we randomly change a bit value in the range of 1 to 7.
For parent selection method, the tournament selection is chosen to perform a good ratio
of choice to select the candidate to create their offspring.

Table 2. Considering SMOTE and 5-Fold Cross-validation test


Measure
Dataset Accuracy (%) Sensitivity (%) Specificity (%)

Original dataset 75.91% 62.31% 83.03%

Dataset with
83.26% 84.33% 82.19%
SMOTE

Table 3. Considering SMOTE and BaggingLOF with 5-Fold Cross-validation test

Accuracy Sensitivity
Specificity (%)
(%) (%)

SMOTE 83.26% 84.33% 82.19%

SMOTE+BaggingLOF 85.82% 84.71% 86.82%


8

Table 4. Detailed results of the 5-Fold Cross-validation


Fold TP rate TN rate FP rate FN rate Accu- Sensi- Speci-
(%) (%) (%) (%) racy tivity ficity
(%) (%) (%)
1 91.23% 88.57% 11.42% 8.77% 89.76% 91.23% 88.57%
2 89.55% 86.67% 13.33% 10.44% 88.19% 89.55% 86.67%
3 78.13% 92.06% 7.93% 21.87% 85.04% 78.13% 92.06%
4 89.47% 88.57% 11.42% 10.52% 88.98% 89.47% 88.57%
5 85.71% 84.51% 15.49% 14.28% 85.04% 85.71% 84.51%
Average 86.82% 88.08% 11.91% 13.17% 87.40% 86.82% 88.08%

3 Classification Result
The test is on Pima Indian Dataset. We designed the evaluation criteria using 5-fold
cross-validation. We tested on a personal computer with CPU 2.9 GHz Intel i5 and 8 GB
RAM. The results are reported in Tables 2, 3 and 4, with 87.40% of accuracy, 86.82%
of sensitivity and 88% of specificity, respectively. The accuracy describes the overall
correctness of classification as disease and no disease.
We compared our results using FuzzyGA with those from QFAM-GA [9], MOEA-
GA [11] and SM-RuleMiner [12]. The best accuracy obtained from the QFAM-GA,
MOEA-GA and SM-RuleMiner were 90.35%, 81.5%, and 89.87% respectively. How-
ever, when testing with noise in the dataset, the QFAM-GA with noise gave 85.61%
accuracy which is lower than from our FuzzyGA combing with noise. In addition, the
results from our research work are superior and balance both in terms of sensitivity and
specificity.

4 Conclusion and Future Work

This paper applied FuzzyGA to classify diabetes. Data preprocessing is required to take
care of missing value and imbalance dataset. Classifying or predicting patients with
critical disease requires high accuracy method with high sensitivity and specificity. Our
Fuzzy GA gave superior performance for this problem solving. We plan to apply this
algorithm to other research problems in the near future.

References
1. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority
Over-sampling Technique. J. Artificial Intelligence Research 16, 321–357 (2002)
2. Christobel, A., Prakasam, S.: The Negative Impact of Missing Value Imputation in Classi-
fication of Diabetes Dataset and Solution for Improvement. IOSR J. Computer Engineering
7(4), 16-23 (2012)
9

3. Wattanapongsakorn, N., Jongsuebsuk, P., Charnsripinyo, C.: Real-Time Intrusion Detection


with Fuzzy Genetic Algorithm. In: 10th International Conference on Electrical Engineer-
ing/Electronics, Computer, Telecommunications and Information Technology Art no.
6559603 (2013)
4. Guan, H., Li, Q, Yan, Z., Wei, W.: SLOF: Identify Density-based Local Outliers in Big
Data. In: 12th International Conference on Web Information System and Application. Art
no.7396608, pp. 61-66 (2015)
5. Behera, S., Rani, R.: Comparative Analysis of Density based Outlier Detection techniques
on Breast Cancer Data Using Hadoop and Map Reduce. In: International Conference on
Inventive Computation Technologies. Art no.7824883 (2016)
6. Palwisut, P.: Improving Decision Tree Technique in Imbalanced Data Sets Using SMOTE
for Internet Addiction Disorder Data. Information Technology Journal. 12, 54-63 (2016)
7. Barale M.S., Shirke, D.T.: Cascaded Modeling for PIMA Indian Diabetes Data. International
Journal of Computer Applications. vol.139, no.11, 1-4 (2016)
8. Pourpanah, F., Peng Lim C., Saleh, J. M.: A hybrid model of Fuzzy ARTMAP and Genetic
Algorithm for data classification and Rule Extraction. Expert Systems with Applications 74-
85 (2016)
9. Brodinová, S., Zaharieva, M., Filzmoser, P., Ortner, T., Breiteneder, C.: Clustering of Im-
balanced High-dimensional Media data. Advances in Data Analysis and Classification. 1-
24 (2017)
10. Gorzałczany, M.B., F. Rudzinski, F.: Interpretable and Accurate Medical Data Classification
– a Multi-objective Genetic-fuzzy Optimization Approach,” Expert System with Applica-
tions. 26-39 (2017)
11. Cheruku, R., Edla, D.R., Kuppili, v.: SM-Rule Miner: Spider Monkey based Rule Miner
using Novel Fitness Function for Diabetes Classification. Computers in Biology and Medi-
cine. 79-92 (2017)
12. Sigillito, V.: Machine Learning Repository, (UCI) Retrieved from
https://archive.ics.uci.edu/ml/datasets/pima+indians+diabetes

Вам также может понравиться