Вы находитесь на странице: 1из 7

1114 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 14, NO.

4, JULY 2010

Intelligible Support Vector Machines for Diagnosis


of Diabetes Mellitus
Nahla H. Barakat, Andrew P. Bradley, Senior Member, IEEE, and Mohamed Nabil H. Barakat

Abstract—Diabetes mellitus is a chronic disease and a major tions, diabetes is found to be the fourth leading cause of global
public health challenge worldwide. According to the International death by disease.
Diabetes Federation, there are currently 246 million diabetic peo- The prevalence of type 2 diabetes is increasing at a fast pace
ple worldwide, and this number is expected to rise to 380 million
by 2025. Furthermore, 3.8 million deaths are attributable to di- due to obesity, in particular, central obesity, physical inactivity,
abetes complications each year. It has been shown that 80% of and unhealthy dietary habits [3]. Early detection of diabetes
type 2 diabetes complications can be prevented or delayed by early would be of great value given the fact that at least 50% and
identification of people at risk. In this context, several data min- 80% in some countries, of all people with diabetes are unaware
ing and machine learning methods have been used for the diag- of their condition and will remain unaware until complications
nosis, prognosis, and management of diabetes. In this paper, we
propose utilizing support vector machines (SVMs) for the diag- appear [2], [4].
nosis of diabetes. In particular, we use an additional explanation Recent studies have shown that 80% of type 2 diabetes com-
module, which turns the “black box” model of an SVM into an plications can be prevented or delayed by early identification and
intelligible representation of the SVM’s diagnostic (classification) intervention in people at risk [2], [4], for example, by chang-
decision. Results on a real-life diabetes dataset show that intelligi- ing their lifestyle [3] and/or by therapeutic methods. Intelligent
ble SVMs provide a promising tool for the prediction of diabetes,
where a comprehensible ruleset have been generated, with predic- data analysis, such as data mining and machine learning tech-
tion accuracy of 94%, sensitivity of 93%, and specificity of 94%. niques are, therefore, valuable for identifying those people. In
Furthermore, the extracted rules are medically sound and agree this context, several data mining and machine learning methods
with the outcome of relevant medical studies. have been proposed for the diagnosis and management of dia-
Index Terms—Data mining, diabetes, machine learning, medical betes [5]–[9]. It has also been shown that employing computer-
diagnosis. aided diagnostic systems (CAD) as a “second opinion” has lead
to improved diagnostic decisions [10], and support vector ma-
I. INTRODUCTION chines (SVMs) have shown remarkable success in this area [11].
In this study, we propose a novel hybrid model for medical di-
IABETES is a disease primarily associated with an in-
D crease in the level of blood glucose (hyperglycaemia) [1].
One reason for hyperglycaemia is insulin deficiency, where beta
agnosis, integrating three different data mining/machine learn-
ing techniques. In particular, an unsupervised and supervised
learning algorithm are used for sampling and model building,
cells in the pancreas fail to produce enough insulin. This is respectively, which are then followed by a rule-based explana-
known as type 1 diabetes. The other and most common type of tion component. Furthermore, we validate our model using a
diabetes is recognized as type 2, where the body cannot effec- real-life dataset for the prediction of type 2 diabetes. As we will
tively use the insulin produced [2]. demonstrate, our technique is able to predict diabetes with high
Chronic hyperglycaemia in people with diabetes increases accuracy, sensitivity, and specificity, which outperforms other
the risk of microvascular damage, which leads to retinopathy, techniques [8], [9] working on relevant problems. We also show
nephropathy, and neuropathy. Therefore, diabetes is the leading that the diagnostic criteria learned by our model are valid from
cause of blindness and visual impairment in adults in devel- a medical stand point and that our results are supported by other
oped countries [2] and is responsible for over one million lower medical studies [12].
limb amputations each year. Diabetic people are also exposed The paper is organized as follows. A brief background
to an elevated risk of macrovascular complications, where they for SVMs and rule extraction from SVMs is provided in
are two to four times more likely to get cardiovascular disease Section II. The experimental methodology is presented in
(CVD) than people without diabetes. Due to these complica- Section III, followed by results and discussion in Section IV.
Rule interpretation and validation is demonstrated in Section V
Manuscript received June 6, 2009; revised October 20, 2009; accepted and some conclusions are drawn in Section VI.
December 16, 2009. Date of publication January 12, 2010; date of current
version July 9, 2010.
N. H. Barakat is with the Department of Applied Information Technol-
ogy, German University of Technology in Oman, Muscat 130, Oman (e-mail: II. BACKGROUND
n.barakat@uq.edu.au).
A. P. Bradley is with the School of Information Technology and Electrical A. Support Vector Machines
Engineering, University of Queensland, Brisbane, QLD 4072, Australia (e-mail:
bradley@itee.uq.edu.au). SVMs operate by finding a linear hyperplane that separates
M. N. H. Barakat is with the Department of the Noncommunicable Diseases the positive and negative examples with a maximum interclass
Surveillance and Control, the Ministry of Health, Muscat 113, Oman (e-mail:
mnbarkat@yahoo.co.uk). distance or margin d. In the case of unequal misclassification
Digital Object Identifier 10.1109/TITB.2009.2039485 costs, a cost factor J (C+ /C− ) is introduced by which training
1089-7771/$26.00 © 2010 IEEE
BARAKAT et al.: INTELLIGIBLE SUPPORT VECTOR MACHINES FOR DIAGNOSIS OF DIABETES MELLITUS 1115

errors on positive examples outweigh errors on negative exam- lized to form an initial ruleset for the positive class, which is
ples [13]. Therefore, the optimization problem becomes then refined and pruned. A prepruning strategy is adopted to
1   prune rules with performance below a user-defined threshold.
minimize w2 + C+ ξi + C− ξj In the postpruning step, the algorithm utilizes the area under the
2 i:y =1i j :y =−1 j receiver operating characteristic (ROC) curve (AUC) to con-
Subject to yk (wxk + b) ≥ 1 − ξk , ξk ≥ 0. (1) trol the tradeoff between the classifier (ruleset) performance, in
terms of both the true positive (TP) and false positive (FP) rates
where yi is the class label, w is normal to the hyper-plane, and AUC, and comprehensibility as measured by the number of
|b|/w is the perpendicular distance from the hyper-plane to rules. Rules that do not result in a statistically significant (p <
the origin, w is the Euclidean norm of w, C is a regularization 0.05) increase in the ruleset AUC are pruned.
parameter, which defines the tradeoff between the training error 2) Eclectic Rule Extraction: In this approach [19], a labeled
and the margin d, and ξ i is a slack variable to allow errors in dataset is used to train an SVM to obtain a model (classifier)
classification [13]. with acceptable accuracy, precision, and recall. Next, a synthetic
To handle nonlinearly separable data kernel functions are dataset composed of the training examples that became SVs is
used, including kernel functions and a Lagrange multiplier αi , constructed with the target class for these examples replaced by
the dual optimization problem becomes the class predicted by the SVM. Rules representing the concepts
 learned by the SVM are then extracted from the synthetic dataset
1 
l l
maximize w(α) = αi − αi yi αj yj K (xi .xj ) using the C5 decision tree learner [20].
i=1
2 i=1,j =1


l
C ≥ αi ≥ 0 ∀i , αi yi = 0. (2) III. EXPERIMENTAL METHODOLOGY
i=1 A. Dataset
Solving for α, training examples with a nonzero α are called The dataset used in this paper has previously been used in
support vectors (SVs) and the hyper-plane is completely defined [21], which investigated the prevalence of diabetes in Oman.
by the SVs alone. Another study on this data [22] used data mining methods to
There is, however, a significant drawback to SVMs, in that learn rules for the diagnosis of diabetes. However, in this paper,
they have an inability to provide a comprehensible justification we will be employing SVMs for the first time for the diagnosis
for the classification decisions they make. That is, they are black of diabetes and justifying the SVM’s classification decisions by
box models. In medical diagnosis, it has been shown that the the extracted rules. A detailed description of the dataset can be
explanation of a classification decision is a crucial requirement found in [21]. However, for convenience, a brief description is
for the acceptance of black box models by end users [14]–[16]. provided next.
Therefore, techniques for rule extraction have been introduced Data from 4682 subjects of age 20 years and above was col-
[17] to enable SVMs to be more intelligible. lected using a questionnaire regarding demographic data, his-
tory, and anthropometric measures. Furthermore, blood pressure
B. Rule Extraction From SVMs and blood samples were analyzed to measure fasting venous and
Direct rule learners extract rules that describe a pattern or 2-h postglucose load [oral glucose tolerance test (OGTT)] for
relationships between input features and output class labels di- the participants of the survey in healthcare centers. Other at-
rectly from the data. In the case of black box models, these tributes were also collected, but omitted here as they are not
relationships are not comprehensible to end users. Therefore, relevant to this study.
the task of rule extraction is to devise rules from the model, The data collected about each subject include age in years
rather than directly from the data. In doing so, an explanation (min 20 and max 80), sex (male/female), family history of di-
of the knowledge learned by the black box model (from the abetes (yes/no), body mass index (BMI) (min 14 and max 47),
data) and embedded in the structure of the model (SVs in the waist circumference in centimeter (waist) (min 45 and max
case of SVMs) is revealed and provided to the end users in a 120), hip circumference (min 66 and max 125 cm), systolic
comprehensible form. blood pressure (min 90 and max 220), diastolic blood pressure
In this paper, we utilize two different techniques for rule (BPDIAS) (min 50 and max 120), cholesterol (min 1.9 and
extraction as the last module of our proposed approach, in par- max 8.5 mmol/L), fasting blood sugar (FBS) (min 55 and max
ticular SQRex-SVM [18] and the eclectic [19] methods are used 320 mg/dL), 2 h postglucose load (OGTT) min 38 and max
to turn the SVM black box into a more intelligible model. In 570 mg/dL).
the following subsections, a brief description of these methods The diabetes detection method used in this dataset is the
is introduced. OGTT, according to the 1985 World Health Organization
1) SQRex-SVM: A Sequential Covering Approach for Rule (WHO) criteria, where the subject is considered diabetic if their
Extraction: SQRex-SVM [18] extracts rules directly from a OGTT is equal to or greater than 200 mg/dL. Diabetic subjects
subset of the SVM SVs [see (2)], using a modified sequential who were taking oral hypoglycemic tablets or insulin (i.e., which
covering algorithm and based on an ordered search of the most were diagnosed as diabetic prior to the survey) were excluded
discriminatory features. These features are then ranked and uti- from the study. In addition, all subjects with missing values in
1116 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 14, NO. 4, JULY 2010

TABLE I TABLE II
DATASET RULE PERFORMANCE COMPARED TO THE SVM AND DIRECT RULE LEARNERS
AT EQUAL MISCLASSIFICATION COSTS

any of the 11 features were excluded. As a result, there are a


total of 3014 subjects in the dataset.

B. Preprocessing and Sampling


The prevalence of diabetes in the dataset was about 9%, which tion (RBF) kernel, with γ (gamma) of 0.0005 and C (regular-
means that the class priors of the dataset are highly skewed ization parameter) of 5 were chosen, as they gave the best recall
toward nondiabetic subjects (negative class). with the lowest error rate on the training set [see (1)].
We have selected a representative sample for the majority The training procedure described in [19] was used as follows.
class (nondiabetic subjects) using subsampling and a K-means 1) A number of SVM models were generated by varying the
clustering algorithm. Subsampling was used to overcome the misclassification cost factor J (1), starting with a small
problem of extremely skewed class distribution [23], where it value and increasing J until no change in TP or FP rates
has been shown that class imbalance may not allow the learn- was observed.
ing algorithms to learn good models [24], [25]. Furthermore, 2) Each of the generated models were then used to classify
decreasing the number of negative examples in the training set the independent test set and accuracy, TP and FP rates
makes the learned model more sensitive to false positive classi- calculated.
fications. 3) Rules were then extracted from each of the models
The clustering algorithm was employed for subsampling to using SQRex-SVM and eclectic methods described in
ensure the selection of a representative training set, especially Section II.
for the negative (majority) class. This was done because select- 4) Each of the extracted rulesets were then used to classify
ing training examples at random may result in important patterns the same independent test set and again accuracy, TP and
in the original dataset being excluded from training set, hence, FP rates, as well as fidelity were computed.
impeding the quality of the learned model. 5) ROC curves were then plotted for both the SVM and rule-
The K-means [23] clustering algorithm was used as follows. sets and the AUC was computed using trapezoidal inte-
1) Five clusters were devised from the original dataset (as gration. Standard errors for the AUC were estimated via
only 20% of negative examples were required). the standard error of the Wilcoxon statistic [19].
2) Within each cluster, samples (subjects) were sorted based
on their Euclidean distance from the cluster center. IV. RESULTS AND DISCUSSION
3) Samples at the maximum distance from the cluster center
were excluded, as they are likely to be outliers. Table II shows accuracy results of the SVM, the rule extrac-
4) From each of the sorted clusters, every third sample is tion methods and three direct rule learners [20], [27], [28] at
included in the final dataset, to select representative ex- equal misclassification costs, as well as rule fidelity (the extent
amples from each cluster, at different Euclidean distances to which the prediction behavior of a ruleset mimics that of the
from the center. SVM from which the rules were extracted) and comprehensi-
5) All positive samples were included in the final dataset. bility (as measures by number of rules/antecedents).
6) The selected subjects in the final dataset were then ran- Comparing the accuracy of the extracted rules to the SVM,
domly divided into two disjoint subsets: training and test- it can be seen that the rules extracted by the SQRex-SVM
ing. Details of the training and testing sets are shown in and eclectic approaches have better accuracy than that of the
Table I. SVM. The best accuracy is obtained by direct rule learning us-
ing C5 [20]. However, the difference in accuracy between C5
rules and those of SQRex-SVM is not statistically significant
C. SVM Training
(p > 0.05). Furthermore, the SQRex-SVM ruleset has the best
The first ten features (risk factors) were used to train the SVM comprehensibility.
to predict the diagnosis [diabetic (class label 1), nondiabetic Table III shows the TP and FP rates for the rulesets extracted
(class label-1)]. Specifically, a subject is considered diabetic if from the SVM and those of the SVM and direct rule learners.
his/her OGTT is ≥200 mg/dL, and nondiabetic if the OGTT is From these results, it can be seen that the eclectic approach
<200 mg/dL. achieved the best TP rate and AUC, followed by SQRex-SVM.
The SVMlight [26] implementation was used in all experi- To determine if the difference in AUC between the SQRex-
ments. The SVM training parameters were selected using leave- SVM and the eclectic approaches is statistically significant, a
one-out cross validation on the training set. A radial basis func- large sample z-test was performed. Results indicate that the null
BARAKAT et al.: INTELLIGIBLE SUPPORT VECTOR MACHINES FOR DIAGNOSIS OF DIABETES MELLITUS 1117

TABLE III
RULE TP, FP RATES, AND AUC COMPARED TO THE SVM AND DIRECT RULE
LEARNERS AT EQUAL MISCLASSIFICATION COSTS

Fig. 2. ROC curves for SQRex-SVM rules and the SVM.

TABLE IV
RULES AUC AT DIFFERENT MISCLASSIFICATION COSTS

Fig. 1. ROC curves for the eclectic rules and the SVM.
in AUC is not statistically significant (p > 0.05). Furthermore,
the rulesets extracted from SVMs have smaller number of rules,
hypothesis shows that there is no difference in measured AUC and therefore, improved comprehensibility. In addition, the dif-
between the two approaches cannot be rejected (p > 0.05). ference between the AUC of rules extracted from the SVM by
Furthermore, the differences in AUC between the SVM and the two approaches, as well as the difference of AUC between
those of the eclectic and the SQRex-SVM approaches are not each approach and the original SVM are not statistically signif-
statistically significant (p > 0.05). It should be noted that the icant (p > 0.05).
same training and testing sets described in Table I have been We have also compared our results with those of similar
used to train the SVM and direct rule learners, and the same test studies, e.g., [8], [9], where classification and regression tree
set is used to assess the quality of the SVM model, the extracted (CART) decision tree [27] has been used for the prediction of,
rules and the rules obtained by direct rule learners. or people at risk of, diabetes; it can be seen that our approach
The ROC curves for the performance of the SVM and the has obtained improved results. The best results obtained in [9]
extracted rules on the test set at different misclassification costs were 88%, 75%, and 85%, while best results obtained in [8]
are shown in Figs. 1 and 2, respectively. were 94.5%, 38%, and 73.6% for sensitivity, specificity, and
Comparing the ROC curves for the SVM and the extracted AUC, respectively.
rulesets, it can be seen that both curves follow the same pattern
with increasing misclassification cost. It can also be seen that
the ROC curves for both the eclectic and SQRex-SVM methods V. DIAGNOSTIC RULES VALIDATION AND INTERPRETABILITY
are almost identical to the SVM ROC curves, which is also After confirming the quality of the rules extracted by the two
reflected by their AUC. It should be noted here that the ROC methods and showing that they are providing a succinct explana-
curve had to be manually connected to the point (1,1) for the tion to the concepts learned by SVMs, we now look at a major
SVM and extracted rulesets (as no value of J ensured the SVM complementary measure of rule quality, rule interpretability,
always predicted positive). and whether these rules make sense to domain experts.
The AUC for these ROC curves and the associated standard In principle, the rules extracted by the two methods have
errors, as well as those direct rule learners are shown in Table IV. shown the correct and valid risk factors (features), which are
From this table, it can be shown that C5 has slightly higher used for the diagnosis and prediction of diabetes in their an-
AUC than rules extracted from SVM. However, this difference tecedent (namely, FBS, BPDIAS, and waist circumference).
1118 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 14, NO. 4, JULY 2010

However, there are slight variations in the cutoff values between


the methods. One possible reason for this variation is the differ-
ent induction bias for each rule extraction method. Specifically,
in the eclectic approach, a decision tree learner is used, which
in turn uses information gain to evaluate both the best feature
to split on and the preferred threshold value. Information gain
is not the measure used in SQRex-SVM, which uses the TP and
FP rates. Therefore, while both methods have selected the same
features, they have set their preferred threshold values slightly
differently. A more detailed description of the extracted rules
and their validation is provided in the following subsections.

A. SQRex-SVM Rules
It was shown in Section IV that SQRex-SVM extracts the best
rules in terms of rule quality from the machine learning point
of view, i.e., simplest and of best accuracy. It will therefore be Fig. 3. ROC curve for waist circumference as predictor for hyperglycemia.
interesting to see if these rules have same quality when evaluated
from the medical expert’s perspective.
The following are the rules extracted by SQRex-SVM at equal 1) Raised triglyceride level.
misclassification cost. 2) Reduced high-density lipoprotein cholesterol.
1) IF FBS >106.2 3) Raised blood pressure (systolic blood pressure ≥ 130 or
Then diabetic. BPDIAS ≥ 85 mm·Hg).
2) IF waist circumference ≥ 91 4) Raised fasting plasma glucose (FPG ≥ 100 mg/dL, or
and BPDIAS ≥ 90 previously diagnosed as type 2 diabetes).
Then diabetic. As mentioned earlier, central obesity is measured by waist
Default rule, nondiabetic. circumference, but it is also ethnicity specific [33].
If we consider the first rule, this rule is valid from the medical Comparing the IDF cutoff points for waist circumference with
perspective, as the diagnosis of diabetes can be solely done by the one obtained by the second rule, it can be seen that the cutoff
the FBS [1]. Considering the cutoff value of 106.2, this value is point we obtained for waist circumference is valid, as it is close
supported by a previous study [29], which used two datasets (one to the values defined by IDF [33] for different ethnic groups.
of them being the data used in this study) and concluded that It should also be noted that there is no cutoff value specifically
this value (106.2 mg/dL), even though it is lower than the cutoff defined for the Middle East (Arab) population [33]; and there is
value defined by the American Diabetes Association (ADA) and a recommendation to use the European data until more specific
WHO for the diagnosis of diabetes [1], is the value that gave data becomes available.
best AUC, sensitivity, and specificity for the Omani population, To further validate the second rule, we have plotted two ROC
hence, it is the value that best matches an OGTT ≥ 200 mg/dL. curves for the waist circumference as a predictor of the two
Al-Lawati and Barakat [29] have pointed out that if the ADA most common risk factors for metabolic syndrome namely, hy-
cutoff is applied, it would lead to underestimation of diabetes perglycemia and hypertension as shown in Figs. 3 and 4. The
by (18%) in the Omani population. Similar studies [30]–[32] waist circumference thresholds (cutoff values) used to plot these
have also shown that the ADA criteria would also underestimate ROC curves were <75, ≥75 and <80, ≥80 and <85, ≥85 and
diabetes in other populations. This rule can also be used to <91, and ≥91 cm.
discover undiagnosed diabetic subjects, missed by ADA and From these figures, it can be seen that waist circumference of
WHO criteria. 91 is a better predictor of each of these risk factors as compared
Considering the second rule, recent studies have shown an to lower waist circumference. This is more evident in males than
association between raised blood pressure (hypertension) and females (p < 0.05). We have also investigated the distribution of
central obesity, which is mainly measured by waist circum- diabetic subjects (as defined by the WHO criteria) over different
ference and the risk of developing diabetes and/or metabolic groups of waist circumferences. As Fig. 5 shows the distribution
syndrome [33]. A risk score for developing diabetes has been confirms our results, where the highest percentage of diabetic
proposed in [12], where hypertension contributed three points subjects have a waist circumference ≥91 cm, followed by the
and waist circumference contributed two points out of ten-point range of waist circumference ≥85 cm.
scale for all diabetes risks factors. Furthermore, subjects with Based on the discussion earlier, it can be concluded that the
metabolic syndrome also face an elevated risk of the develop- second rule can be used for opportunistic screening to identify
ing type 2 diabetes and CVDs [34]. The International Diabetes people at risk [12] (predicting diabetes). If indicated, further
Federation (IDF) defines metabolic syndrome as the presence investigation should be carried out and perhaps starting an early
of central obesity, plus any two of the following factors. intervention program to control the development of diabetes [3].
BARAKAT et al.: INTELLIGIBLE SUPPORT VECTOR MACHINES FOR DIAGNOSIS OF DIABETES MELLITUS 1119

males and females. However, there are other studies, which also
did not distinguish between males and females as well [35].
It is also worthwhile mentioning here that the average waist
circumference of males and females is the same in the survey
data, which explains the same cutoff value for both genders.

B. Eclectic Method Rules


The following are the rules extracted by the eclectic approach.
1) IF FBS > 124.2
Then diabetic.
2) IF FBS > 90
and FBS ≤ 124.2
and waist circumference > 84
and BPDIAS > 70
Then diabetic.
It should be noted that there were an additional three rules for
Fig. 4. ROC curve for waist circumference as predictor for hypertension. nondiabetics, which were removed as they were basically the
complement of the aforementioned rules.
Considering these rules it can be seen that the same risk
factors, namely, FBS, waist circumference, and BPDIAS also
appearing as rule antecedents. Although there is a difference in
the cutoff values due to the reasons mentioned earlier, the rules
(especially the first rule, which agrees with the WHO cutoff)
are still valid from a medical perspective. However, they do
not have such good results as the SQRex-SVM method. This
finding is again supported by [29] for this specific population.
The second rule particularly is not consistent with generally
accepted medical knowledge for the diagnosis of diabetes, as the
cutoff value for the BPDIAS is lower than the one defined for
hypertension. Therefore, this could be a redundant antecedent.
However, this rule can be again used for screening of people at
risk or undiagnosed subjects.

VI. CONCLUSION AND FUTURE PERSPECTIVES

Fig. 5. Distribution of diabetic subjects over waist circumference groups. In this paper, we have developed a hybrid system for medical
diagnosis. In particular, we have employed SVMs for the diag-
nosis and prediction of diabetes, where an additional rule-based
Figs. 3–5 also validate our rule, which suggest that the cutoff explanation component is utilized to provide comprehensibility.
value of 91 cm can be used as a guiding or starting point for The SVM and the rules extracted from it, are intended to work as
the definition of central obesity in both males and females in a second opinion for diagnosis of and as a tool to predict diabetes
the Arabian Gulf area population. It can also be expected that through identifying people at high risk. According to domain
a cutoff value, which could be between 85 and 91 cm for both experts, the significance of our approach lies in its simplicity,
males and females is more suitable for defining central obesity comprehensibility, and validity. They find the rules produced
as compared to the cutoff value of Europeans suggested by the by the system really helpful, where simple measurements in
IDF [33]. However, more retrospective studies are needed to the outpatient could be used for opportunistic screening. This
decide the best cutoff value. would offer an enhanced opportunity for timely and appropriate
It should be noted that the cutoff values for FBS as well as intervention to take place, which would reduce or control the in-
the waist circumference in our rules, which mainly diagnose cidence of diabetes or its expensive complications. Furthermore,
diabetes were the same for both males and females, which is the rules can help in the detection of undiagnosed subjects.
also the case in the WHO and IDF. However, we have plotted Results show that our model is of high quality in terms of
separate ROC curves for males and females (see Figs. 3 and diagnostic and prediction accuracy, and outperforms other tech-
4), for predicting different risk of metabolic syndrome, where niques working on similar problems.
IDF standards distinguish between males and females regarding One of the potential future extensions of this work is
waist circumference cutoff values. Unlike the IDF standards, to conduct a prospective study to further refine the predic-
our results suggest that the waist circumference cutoff value for tive results obtained by the proposed rules. Specifically, this
predicting the risk of metabolic syndrome is the same for both could be achieved by following up the subjects with a waist
1120 IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 14, NO. 4, JULY 2010

circumference ≥91 cm and/or BPDIAS ≥90 for a 3 to 5 year [22] M. N. Barakat, N. Barakat, J. Diederich, and J. Al Lawati, “Diagnosis
period to see if, and when, they develop diabetes. Based on of diabetes mellitus: A data mining approach,” Int. J. Diabetes Metab.,
vol. 13, no. 1, p. 42, 2005.
the outcomes of such a study, a more sophisticated risk score [23] P. N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining.
could be developed, which could significantly decrease health- New York: Addison-Wesley, 2005.
care costs via early prediction and diagnosis of diabetes. [24] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE:
Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16,
pp. 321–357, 2002.
REFERENCES [25] N. Japkowicz, “The class imbalance problem: Significance and strategies,”
in Proc. Int. Conf. Artif. Intell. ({IC}-{AI} 2000), pp. 111–117.
[1] WHO/IDF 2006 (2007, Jan.). Definition and diagnosis of dia-
[26] T. Joachims, Learning to Classify Text Using Support Vector Machines.
betes mellitus and intermediate hyperglycemia, World Health Or-
Norwell, MA: Kluwer, 2002.
ganisation [Online]. Available: http://www.who.int/diabetes/publications/
[27] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
Definition%20and%20diagnosis%20of%20diabetes_new.pdf
Regression Trees. Monterrey, CA: Wadsworth and Brooks, 1984.
[2] International Diabetes Federation, Diabetes Atlas, 3rd ed. Brussels, Bel-
[28] I. Witten and E. Frank, Data Mining: Practical Machine Learning Tools
gium: International Diabetes Federation, 2007.
and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005.
[3] M. Uusitupa, “Lifestyle matter in prevention of type 2 diabetes,” Diabetes
[29] J. A. Al-Lawati and M. N. Barakat, “Fasting cut-points in determining
Care, vol. 25, no. 9, pp. 1650–1651, 2002.
prevalence of diabetes in an Arab population of the middle east,” Diabetes
[4] M. Franciosi, G. D. Berardis, M. C. E. Rossi, and M. Sacco, “Use of
Res. Clin. Prac., vol. 75, no. 2, pp. 241–245, 2007.
the diabetes risk score for opportunistic screening and impaired glucose
[30] M. I. Harris, R. C. Eastman, C. C. Cowie, K. M. Flegal, and M. S.
tolerance,” Diabetes Care, vol. 28, no. 5, pp. 1187–1193, 2005.
Eberhardt, “Comparison of diabetes diagnostic categories in the U.S. pop-
[5] Y. Huang, P. McCullagh, N. Black, and R. Harper, “Feature selection
ulation according to the 1997 American Diabetes Association and 1980–
and classification model construction on type 2 diabetic patient’s data,”
1985 World Health Organization diagnostic criteria,” Diabetes Care,
in Lecture Notes Artificial Intelligence, vol. 3275, P. Perner, Ed. Berlin,
vol. 20, no. 12, pp. 1859–1862, 1997.
Germany: Springer-Verlag, 2004, pp. 153–162, ICDM 2004.
[31] P. W. Wahl, P. J. Savage, B. M. Psaty, T. J. Orchard, J. A. Robbins, and
[6] R. Bellazzi, C. Larizza, P. Magni, S. Montani, and M. Stefanelli, “Intelli-
R. P. Tracy, “Diabetes in older adults: Comparison of 1997 American
gent analysis of clinical time series: An application in the diabetes mellitus
Diabetes Association classification of diabetes mellitus with 1985 WHO
domain,” Artif. Intell. Med., vol. 20, pp. 37–57, 2000.
classification,” Lancet, vol. 352, no. 9133, pp. 1012–1015, 1998.
[7] R. Bellazzi, “Telemedicine and diabetes management: Current challenges
[32] J. E. Shaw, M. D. Courten, E. J. Boyko, and P. Z. Zimmet, “Impact of new
and future research directions,” J. Diabetes Sci. Technol., vol. 2, no. 1,
diagnostic criteria for diabetes on different populations,” Diabetes Care,
pp. 98–104, 2008.
vol. 22, no. 5, pp. 762–766, 1999.
[8] R. Goel, A. Misra, D. Kondal, R. M. Pandey, N. K. Vikram, J. S. Wasir,
[33] K. G. M. M. Alberti, P. Zimmet, and J. Shaw, “Metabolic syndrome—A
V. Dhingra, and K. Luthra, “Identification of insulin resistance in
new worldwide definition. A Consensus Statement from the International
Asian Indian adolescents: Classification and regression tree (CART) and
Diabetes Federation,” Diabet. Med., vol. 23, pp. 469–480, 2006.
logisticregression-based classification rules,” Clin. Endocrinol., vol. 70,
[34] S. Grundy, “Obesity, metabolic syndrome, and cardiovascular disease,”
pp. 717–724, 2009.
J. Clin. Endocrinol. Metab., vol. 89, no. 6, pp. 2595–2600, 2004.
[9] K. E. Heikes, B. Arondekar, D. M. Eddy, and L. Schlessinger, “Diabetes
[35] H. Wahrenberg, K. Hertel, B.-M. Leijonhufvud, L.-G. Persson, E. Toft, and
risk calculator, a simple tool for detecting undiagnosed diabetes and pre-
P. Arner, (2005). Use of waist circumference to predict insulin resistance:
diabetes,” Diabetes Care, vol. 31, no. 5, pp. 1040–1045, 2008.
Retrospective study, Brit. Med. J., viewed Jan. 2007. [Online]. Available:
[10] N. Lavrac, E. Keravnou, and B. Zupan, “Intelligent data analysis in
http://www.bmj.com/cgi/rapidpdf/bmj.38429.473310.AEv1
medicine,” in Encyclopedia of Computer Science and Technology, vol. 42,
A. Kent et al., Ed. New York: Marcel Dekker, 2000, pp. 113–157.
[11] W. Kong, L. Tham, K. Y. Wong, and P. Tan, “Support vector machine
approach for cancer detection using amplified fragment length polymor-
phism (AFLP) method,” presented at the the 2nd Asia-Pac. Bioinformatics
Conf. (APBC 2004), Dunedin, New Zealand. Nahla H. Barakat received the Ph.D. degree in computer science from the
[12] J. A. Al-Lawati and J. Tuomilehto, “Diabetes risk score in Oman: A tool University of Queensland, Brisbane, Australia.
to identify prevalent type 2 diabetes among Arabs of the middle east,” She is currently an Associate Professor in computer science with German
Diabetes Res. Clin. Pract., vol. 77, no. 3, pp. 438–444, 2007. University of Technology in Oman, Muscat, Oman. For more than ten years, she
[13] K. Morik, P. Brockhausen, and T. Joachims, “Combining statistical learn- has industry experience in the area of IT in a multinational environment. Her
ing with knowledge-based approach-A case study in intensive care moni- current research interests include machine learning and medical data mining.
toring,” in Proc. Eur. Conf. Mach. Learn., 1998, pp. 268–277.
[14] R. L. Ye and P. E. Johnson, “The impact of explanation facilities on user
acceptance of expert systems advise,” MIS Q., vol. 19, pp. 157–172, Jun.
1995.
[15] Z. Chen, J. Li, and L. Wei, “A multiple kernel support vector machine
scheme for feature selection and rule extraction from gene expression Andrew P. Bradley (SM’97) received the Ph.D. degree from the University of
data of cancer tissue,” Artif. Intell. Med., vol. 41, pp. 161–175, 2007. Queensland, Brisbane, Australia, in 1996.
[16] C. J. Wyatt and D. G. Altman, “Prognostic models: Clinically useful or Since 1996, he has been a Researcher in Australia, the United Kingdom,
quickly forgotten?” Brit. Med. J., vol. 311, pp. 1539–1541, 1995. and Canada. He is currently an Associate Professor in biomedical engineering
[17] N. Barakat, “Rule extraction from support vector machines: Medical di- with the University of Queensland. His research interests include biomedical
agnosis prediction and explanation,” Ph.D. thesis, School Inf. Technol. applications of pattern recognition in signal and image analysis.
Electr. Eng. (ITEE), Univ. Queensland, Brisbane, Australia, 2007. Dr. Bradley is a Chartered Electrical Engineer of the IET.
[18] N. Barakat and A. P. Bradley, “Rule extraction from support vector ma-
chines: A sequential covering approach,” IEEE Trans. Knowl. Data Eng.,
vol. 19, no. 6, pp. 729–741, Jun. 2007.
[19] N. Barakat and A. P. Bradley, “Rule extraction from support vector ma-
chines: Measuring the explanation capability using the area under the ROC Mohamed Nabil H. Barakat received the B.Sc. and M.Sc. degrees in internal
curve,” presented at the 18th Int. Conf. Pattern Recognit. (ICPR 2006), medicine from Alexandria University, Alexandria, Egypt and the M.Sc. degree
Hong Kong. in nephrology and metabolism from Cairo University, Cairo, Egypt.
[20] J. R. Quinlan, C4.5: Programs for Machine Learning. SanMateo, CA: He is a Consultant Physician and a Diabetologist with the Department
Morgan Kaufmann, 1993. of Noncommunicable Disease Surveillance and Control, Ministry of Health,
[21] M. Asfour, A. Lambourne, A. Soliman, S. Al-Behlani, D. Al-Asfoor, Muscat, Oman. He has a long clinical experience in treatment and management
and A. Bold, “High prevalence of diabetes mellitus and impaired glucose protocols of diabetes. He is involved in the national program of diabetes control
tolerance in the Sultanate of Oman: Results of the 1991 national survey,” with the Ministry of Health. His current research interests include diabetes and
Diabetic Med., vol. 12, pp. 1122–1125, 1995. cardiovascular diseases prevention and control techniques.