Академический Документы
Профессиональный Документы
Культура Документы
1 Introduction
Nowadays, diabetes is known as a difficult disease to completely cure. The symptom
can be varied in many stages of severity. The human body is a mysterious thing although
medical profession cannot explain everything in our body. Many factors of human body
affect the internal system to the owner such as blood sugar, fat, insulin, blood pressure,
pedigree and other, that cause diabetes. However, the prime factor comes from insulin.
When considering deep into the human body, the pancreas produces insulin to assimilate
sugar through the body. However, a problem occurs when the insulin cannot be produced
at a regular level and cause digestion to deteriorate. Other factors that may affect to dia-
betes are exercise activities, age, consumption as a sub-factor.
However, several types of researches have proposed various methods such as rule-
based classification systems that widely used for diabetes diagnosis. A challenge is to
producing a comprehensive optimal rule-set while balancing accuracy, sensitivity and
specificity values. Most researches attempt to improve or get the actual useful model
with accuracy. However, a problem is not easy to solve from different dataset because
of the differences of nationality or patient case. Diabetes has three different levels and
corresponding factors. The level of type 1 consists of a ratio of waist/hip, fat accumula-
tion, overweight, high blood pressure. Type 2 or pre-diabetes consists of the problem of
consuming sugar and abnormal insulin produce, and type 3 is the last stage of diabetes
consisting of complications in several diseases. We need to classify a useful rule-set if
those datasets are reliable.
In 2002, N.V. Nitesh et al. [1] studied the performance of Naive Bayes, C4.5 Deci-
sion Tree and Ripper Rule using Synthetic Minority Over-sampling Technique
2
(SMOTE) to handle imbalance data set. Their valuation methods are Receiver Operating
Characteristic (ROC) and Area Under the Curve (AUC). The results gave better perfor-
mance, compared to those with over-sampling and under-sampling.
Handling missing value is a strategy to solve misleading classification. Angeline
Christobel et al. [2] did research on the negative impact of missing value imputation.
They found that the mean substitution, data scaling and K-nearest neighbor (KNN) al-
gorithm can enhance the accuracy of their works.
In 2013, N. Wattanapongsakorn at el. [3] studied the performance of online intrusion
using Fuzzy Genetic Algorithm to create a rule-set with a trapezoidal membership func-
tion. They tested on various experiments including offline testing with the KDD99 da-
taset and online testing in their university environment. In 2015, H. Guan el al.
[4] studied the performance of their experiments in big data process with vector similar-
ity based local outlier factor (SLOF) and bagging feature method to improve the preci-
sion and recall rates. Their report showed the comparison of outlier detection methods.
The winner is SLOF that is suitable for big data process.
Then in 2016, S. Behera et al. [5] studied about the most effective outlier detection
method in breast cancer data by comparing several methods such as Density-based spa-
tial clustering of applications with noise (DBSCAN). Ordering points to identify the
clustering structure (OPTICS) and outlier detection method. The result showed that LOF
is the best for detecting noise with high accuracy.
In addition, P. Palwisut [6] studied the improvement of the decision tree using
SMOTE for Internet Addiction Disorder Data. SMOTE was considered to test on clas-
sification, with Decision Tree, Random Forest. The evaluation method were applied with
confusion matrix to compare all models. The results give improvement +5.24% accu-
racy, +13.82% sensitivity, and +8.47% specificity.
M.S. Barale and D.T. Shirke [7] considered using deleting missing value together
with KNN technique to classify data.
A hybrid model of Fuzzy ARTMAP and Genetic Algorithm for data classification
and Rule Extraction [8], F. Pourpanah et al studied the Fuzzy ARTMAP with Genetic
Algorithm (QFAM-GA) to compare the performance using different dataset with noise
and noise free.
Next, Šárka Brodinová et al. [9] studied the imbalanced clustering (IClust) in high
dimension media data by merging multi-group based on outlier detection method using
local outlier factor (LOF).
Interpretable and accurate medical data classification – a multi-objective genetic-
fuzzy optimization approach [10], M.B. Gorzałczany and F. Rudzinski showed the
Multi-Objective Genetic Algorithm (MOEA) with Fuzzy Algorithm to produce a rule
and test on several datasets. The result got satisfied.
SM-RuleMiner: Spider monkey based rule miner using novel fitness function for
diabetes classification [11], R. Cheruku, D.R. Edla and V. Kuppili studied the perfor-
mance of SM-RuleMiner against PIMA Indian dataset. The accuracy and sensitivity
tested with 10 folds cross-validation provided good performance but with low specific-
ity. The result is satisfied in term of sensitivity analysis.
3
In this paper, we classify the diabetes into 2 classes: “yes” (with the diabetes), and
“no” (without the diabetes). A Fuzzy Genetic Algorithm (Fuzzy GA) has recently been
proposed to solve many problems with satisfying results [4, 5, 10]. The technique can
help generating rules to classify disease when the medical data is available.
The remaining parts in this paper is presented as follows. In Section II, we discuss
research methodology, dataset, data preprocessing and fuzzy genetic algorithm. The al-
gorithm is applied to classify the PIMA Indian dataset. Section III presents the classifi-
cation result. Finally, in section IV, we give a conclusion of this research work.
2 Introduction
2.1 Dataset
We consider the Pima Indian dataset from UCI Machine Learning Repository [12] as
our input dataset. This dataset has 9 attributes and 768 instances. The description of all
attributes is shown in Table I.
This original dataset is imbalance. We used SMOTE to deal with this problem of im-
balance datasets giving satisfied outcomes. SMOTE tuning is necessary to synthesis
density-base population of the instance group. However, the synthesis data may contain
some outliers. So, we used BaggingLOF to identify the outliers which is directly im-
proved accuracy of the classification. Our approach is explained in detail as follows.
Mean Imputation
We used the mean imputation to deal with missing value. For the implementation, we
split data into 2 sections, for class “yes” and class “no”. Then calculate mean value of
each attribute of those two classes and replace the missing value with its attribute mean
value. This method can reduce the error from data replacement better than when using
other methods such as global mean imputation, K-mean imputation, EM Imputation
and median imputation.
SMOTE
We used SMOTE [7] in Weka tool to generate sampling. The parameter setting of per-
centage create data = 86% and nearest neighbors = 15% of all instances. This method
performs a synthesis sampling from the neighborhood of dataset. The new sampling is
then generated.
BaggingLOF
We used Weka tool to generate the probability of BaggingLOF [2, 3, 6]. The threshold
of defined outlier is a probability of mean value from the overall dataset. In the original
version of this algorithm, there is no explanation of the threshold of noise detection.
We need to test it on our own for the best result. The outlier can be occurred when
performing SMOTE. Thus, the outlier detection method is needed to clean the data.
5
From Fig 2, the goal of this work is to find accuracy along with specificity and sensi-
tivity. We used maximum fitness function to increase the correct output as “yes” and
reduce incorrect output as n. The fitness function is described as follows.
𝑦 𝑛
𝑓𝑖𝑡𝑛𝑒𝑠𝑠 = − (1)
𝑌 𝑁
where
Y is the total number of “yes” records in the dataset.
N is the total number of “no” records in the dataset.
“y” is the total number of “yes” records correctly classified as “yes”.
“n” is the total number of “yes” records incorrectly classified as “no”.
This paper considered the trapezoidal membership function working with the genetic
algorithm. The trapezoidal membership function offers a flexible shape/rule depending
on the value of adjusting parameters to identify/predict the output class. According to
Fig. 2, the probability to determine “yes” and “no” outputs are described with a trape-
zoidal membership function. The trapezoidal scale is assigned as the probability in each
attribute considered for a rule. The parameters consist of a, b, c and d. The interval of
each individual is defined as 1.0 – 7.0 for the range in this work. We briefly explain the
concept of the methodology as shown in Fig 3.
6
From Fig 3, the fuzzy algorithm in the fifth line is a trapezoidal membership function
that is related as a function in Fig 4. In addition, Fig 4, the gene is a subtle of the chro-
mosome in a genetic algorithm that we will discuss next.
Fig. 5. One gene with 4 blocks refers to one attribute with a range of 1-7.
The length of the chromosome is equal to the total number of attributes/genes. Total
number of blocks in the chromosome is equal the number of attributes multiply by four.
We considered an integer encoding for a simple implement. The GA is used to find
appropriate values of a, b, c and d for each attribute, forming a rule for classification,
which maximizes the accuracy of prediction. We used GA to generate various solutions
as rules for finding the best classification rule. The GA parameter setting is as follows.
GA has the main process consisting of reproduction and mutation. We used single-point
crossover method. For mutation, we randomly change a bit value in the range of 1 to 7.
For parent selection method, the tournament selection is chosen to perform a good ratio
of choice to select the candidate to create their offspring.
Dataset with
83.26% 84.33% 82.19%
SMOTE
Accuracy Sensitivity
Specificity (%)
(%) (%)
3 Classification Result
The test is on Pima Indian Dataset. We designed the evaluation criteria using 5-fold
cross-validation. We tested on a personal computer with CPU 2.9 GHz Intel i5 and 8 GB
RAM. The results are reported in Tables 2, 3 and 4, with 87.40% of accuracy, 86.82%
of sensitivity and 88% of specificity, respectively. The accuracy describes the overall
correctness of classification as disease and no disease.
We compared our results using FuzzyGA with those from QFAM-GA [9], MOEA-
GA [11] and SM-RuleMiner [12]. The best accuracy obtained from the QFAM-GA,
MOEA-GA and SM-RuleMiner were 90.35%, 81.5%, and 89.87% respectively. How-
ever, when testing with noise in the dataset, the QFAM-GA with noise gave 85.61%
accuracy which is lower than from our FuzzyGA combing with noise. In addition, the
results from our research work are superior and balance both in terms of sensitivity and
specificity.
This paper applied FuzzyGA to classify diabetes. Data preprocessing is required to take
care of missing value and imbalance dataset. Classifying or predicting patients with
critical disease requires high accuracy method with high sensitivity and specificity. Our
Fuzzy GA gave superior performance for this problem solving. We plan to apply this
algorithm to other research problems in the near future.
References
1. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: Synthetic Minority
Over-sampling Technique. J. Artificial Intelligence Research 16, 321–357 (2002)
2. Christobel, A., Prakasam, S.: The Negative Impact of Missing Value Imputation in Classi-
fication of Diabetes Dataset and Solution for Improvement. IOSR J. Computer Engineering
7(4), 16-23 (2012)
9