Вы находитесь на странице: 1из 12

KNOSYS 2018 No.

of Pages 12, Model 5G


18 February 2011

Knowledge-Based Systems xxx (2011) xxx–xxx


1

Contents lists available at ScienceDirect

Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys

2 Domain-driven KDD for mining functionally novel rules and linking disjoint
3 medical hypotheses
4 Y. Sebastian ⇑, Patrick H.H. Then
5 School of Engineering, Computing, and Science, Swinburne University of Technology (Sarawak Campus), Kuching, Sarawak 93350, Malaysia

7
6
a r t i c l e i n f o a b s t r a c t
9
2 1
10 Article history: Introduction: An important quality of association rules is novelty. However, evaluating rule novelty is 22
11 Received 8 September 2010 AI-hard and has been a serious challenge for most data mining systems. 23
12 Received in revised form 16 December 2010 Objective: In this paper, we introduce functional novelty, a new non-pairwise approach to evaluating rule 24
13 Accepted 22 January 2011
novelty. A functionally novel rule is interesting as it suggests previously unknown relations between user 25
14 Available online xxxx
hypotheses. 26
Methods: We developed a novel domain-driven KDD framework for discovering functionally novel 27
15 Keywords:
association rules. Association rules were mined from cardiovascular data sets. At post-processing, domain 28
16 Q1 Association rules
17 Data mining methods
knowledge-compliant rules were discovered by applying semantic-based filtering based on UMLS 29
18 Interactive data exploration and discovery ontology. Their knowledge compliance scores were computed against medical knowledge in Pubmed 30
19 Medical knowledge support systems literature. A cardiologist explored possible relationships between several pairs of unknown hypotheses. 31
20 The functional novelty of each rule was computed based on its likelihood to mediate these relationships. 32
Results: Highly interesting rules were successfully discovered. For instance, common rules such as 33
diabetes mellitus,coronary arteriosclerosis was functionally novel as it mediated a rare association 34
between von Willebrand factor and intracardiac thrombus. 35
Conclusion: The proposed post-mining domain-driven rule evaluation technique and measures proved to 36
be useful for estimating candidate functionally novel rules with the results validated by a cardiologist. 37
Ó 2011 Elsevier B.V. All rights reserved. 38

39
40
41 1. Introduction antecedent and consequent with respect to the existing knowl- 60
edge. We refer to this traditional pairwise approach as pairwise 61
42 In knowledge discovery in databases (KDD), an association rule novelty. 62
43 must exceed certain interestingness thresholds to be knowledge Unfortunately, pairwise novelty is limited by two challenging 63
44 [13]. Automatically estimating rule interestingness is AI-hard problems. The first is the rare item problem. Because novel rules 64
45 because interestingness is determined by various factors such as often point to rare cases, they are likely to have extremely low 65
46 user task, preference, and discovery context [54]. Despite being support count [35]. In order to discover these rules, data mining 66
47 extensively studied in data mining [17,37,19,31], the problem algorithm must operate at extremely low minimum support 67
48 remains highly challenging [14,61,54]. (minsupp) threshold. On the other hand, this will result in a large 68
49 There are nine rule interestingness criteria [19] and novelty is number of other uninteresting rules. No solution for this problem 69
50 arguably the most important criterion [16,51,13]. According to sur- exists yet as the optimal minimum support and minimum 70
51 vey by [19], a rule is novel if the user ‘did not know it before and is confidence (minconf) thresholds for mining interesting rules 71
52 not able to infer it from other known patterns’. Three categories of remain unknown [52]. 72
53 techniques for discovering novel rules were reported in this sur- The second problem is the difficulty in justifying the validity of 73
54 vey: (a) post-mining filtering technique [35,51], (b) interactive novel rules due to lack of supporting domain knowledge. User 74
55 incremental filtering technique [47]; and (c) in-mining con- acceptance of rules is an important issue for KDD systems. In 75
56 straint-based filtering technique [43]. Careful examinations on medical knowledge discovery, evidence showed that compliance 76
57 these existing techniques reveal that they consistently measured with the existing domain knowledge is key to medical expert’s 77
58 rule novelty in pairwise term. In other words, a novel rule ought acceptance of learned models [44]. 78
59 to consist of previously unknown, unexpected composition of rule To our knowledge, there is no compelling evidence to suggest 79
that pairwise novelty is the only paradigm for assessing rule 80

⇑ Corresponding author. Tel.: +60 82 416353; fax: +60 82 423594. novelty. In fact, Han et al. [21] recommended that future pattern 81

E-mail addresses: ysebastian@swinburne.edu.my (Y. Sebastian), pthen@swin- analysis should include the contextual analysis of patterns and 82
burne.edu.my (P.H.H. Then). not be limited to the analysis of pattern composition only. 83

0950-7051/$ - see front matter Ó 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.knosys.2011.01.008

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

2 Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx

Fig. 1. Swanson’s ABC inference model.

84 Jaroszewicz et al. [26] suggested that rule evaluation should con- of its item composition. Instead, a common rule can be novel to do- 123
85 sider the full joint probability on the data. Consequently, we ask main expert on account of its functionality, given a pair of previ- 124
86 this question, ‘Can a common rule (i.e. having well-known item com- ously unrelated hypotheses. This new non-pairwise approach is 125
87 position) at the same time be novel to user? If yes, how?’. The ques- important as it may alleviate rare item problem and rule accep- 126
88 tion is important not only because it demands rethinking of the tance problem. Data mining algorithm can be set to focus on 127
89 existing notion of rule novelty but also because the answer may mining domain knowledge-compliant rules which normally reside 128
90 provide means to help overcome the current limitations of tradi- with high support counts. This makes it unnecessary to apply very 129
91 tional pairwise novelty approach. low minimum support and minimum confidence thresholds, 130
92 A few notable works in information retrieval and information reducing the number of rules to be evaluated. By focusing the min- 131
93 science advocated the review of existing and known pieces of ing process on domain-knowledge compliant rules, the resulting 132
94 knowledge in concept-driven manner and linking them into test- rule outputs will also be more acceptable to domain experts. 133
95 able chain of hypotheses [56,6,5,27]. We illustrate this idea more Motivated by these, we propose functional novelty as a new 134
96 clearly by using Swanson’s ABC inference model [56] shown in measure of rule interestingness. In general, given a pair of user 135
97 Fig. 1. It was previously unknown that Fish Oil (A) could alleviate hypothesis A and D whose relation is previously unknown to user, 136
98 the symptoms of Raynaud’s Syndrome (C). However, it was known an association rule x ) y is functionally novel if it satisfies two 137
99 that fish oil could lower Blood Viscosity (B) level and that high Blood conditions (see Fig. 3): 138
100 Viscosity was associated with the occurrence of Raynaud’s Syn-
101 drome. By assembling these known information, Swanson syllogis- 1. Rule x ) y must be compliant with the existing domain 139
102 tically hypothesized that fish oil could eventually help Raynaud’s knowledge. 140
103 Syndrome patients. This hypothesis was later validated through a 2. The association between A and x, and the association between y 141
104 medical experiment. and D must be supported by the existing domain knowledge. 142
105 Note that at the time of Swanson’s discovery, blood viscosity did 143
106 not represent a novel piece of knowledge on its own. This human We design and implement a novel domain-driven data mining 144
107 blood property was already known to medical experts. Neverthe- (D3M) framework [11,9,62,10]. To be acceptable by domain ex- 145
108 less, one could argue that it was functionally novel as it was previ- perts, the inferred relationship between user hypotheses that is 146
109 ously unknown that it could be the mediating agent between fish mediated by the rules should be validated to a certain extent by 147
110 oil and the occurrence of Raynaud’s Syndrome. So, within this the existing knowledge. This makes the task challenging as it re- 148
111 knowledge synthesis scenario, medical expert may view blood vis- quires a comprehensive modeling of user’s domain knowledge. In 149
112 cosity as a novel and interesting piece of knowledge not on account biomedicine, this requires the incorporation of biomedical domain 150
113 of its existence (i.e. by being blood viscosity), but on account of its knowledge from the published literature [18] as well as formalized 151
114 functionality (i.e. by being the mediating agent of a previously un- ontology [30] into the new knowledge discovery framework. 152
115 known relation). This work is distinct from our previous work [48]. Here we pres- 153
116 Jensen et al. [27] extended Swanson’s ABC model to a more ent extensive experimental results to support the proposed frame- 154
117 complex ABCD model in which a previously unknown relation be- work. We organize this paper as follows. Section 2 defines two 155
118 tween Rim11 and Erg9 proteins (AD) was inferred by assembling main research problems in more details. In Section 3, we formalize 156
119 several already known protein-to-protein relations: Rim11 - the proposed functional novelty measures. Section 4 outlines the 157
120 M Ume6 (AB), Ume6 M Ino2 (BC), and Ino2 M Erg9 (CD) (Fig. 2). methods, proposed framework, and experimental designs. Experi- 158
121 How does this relate to rule novelty evaluation problem? This mental results are reported in Section 5 followed by discussions 159
122 means that an association rule need not be novel solely on account in Section 6. We conclude the paper in Section 7. 160

2. Problems 161

Fig. 2. ABCD inference model.


2.1. Mining domain knowledge-compliant association rules 162

It is believed that rules with strongly correlated items are more 163
likely to comply with the existing domain knowledge than those 164
which occur by random chance. But high support and confidence 165
Fig. 3. Functional novelty model. values alone could not guarantee genuine correlation between rule 166

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx 3

167 items [7]. Our first objective is to mine as many rules as possible Table 1
168 from the target data set by setting very low minsupp and minconf A 2  2 contingency table.

169 thresholds and then evaluating the true compliance of each gener- y :y
170 ated rule to the domain knowledge by using a reliable correlation x a b
171 metric. :x c d
172 One can rely on Chi-square (v2) statistics [39] to measure
173 genuine correlation between rule items. v2 has been proved to
174 be effective for mining correlation rules [7] and for pruning
tion rule is an implication x,y, where x, y # X, x \ y = ; [1]. We re- 223
175 insignificant association rules [35]. Ohsaki et al. [42] ranked v2
fer to x as the antecedent and y as the consequent. 224
176 among top five objective measures which correlated most with
177 medical experts’ real interests. However, relying on the conven-
Definition 1. Support of rule x,y, denoted as supp(x [ y), is the 225
178 tional v2 measure as the basis for determining domain knowl-
number of transactions in D containing x [ y, divided by m 226
179 edge-compliant rules is inadequate. Traditionally, v2 is calculated
transactions, 0 6 supp 6 1. 227
180 solely based on rule item distribution in the data set. But because
181 data set only represents a relatively small portion of the entire do-
182 main knowledge, recall rate for domain knowledge-compliant Definition 2. Confidence of a rule, conf(x,y) = supp(x [ y)/supp(x), 228
183 rules is likely to be low. is the likelihood that transactions containing x also contains y, 229
0 6 conf 6 1. 230
184 2.2. Evaluating functional novelty
Definition 3. lis the harmonic mean of supp and conf shown in Eq. 231
185 Given a pair of user hypotheses AD and a set of n domain knowl- (1) [25]. l serves as the basic rule strength measure. 232
186 edge-compliant rules x ) y as seen in Fig. 3, how can we rank the 233
187 rules from the most to the least functionally novel? User should be 2ðsuppÞðconf Þ
l¼ ð1Þ
188 allowed to prescribe hypothesis pair AD that is not part of data set supp þ conf 235
189 items because it gives greater flexibility for user to experiment
190 with a great variety of hypotheses using a single data set. This
Definition 4. Maxitem is the maximum number of items allowed 236
191 makes rule evaluation much more challenging because association
in candidate itemsets, 0 < k 6 maxitem. 237
192 A,x and y,D can no longer be represented by association rules
193 mined from the target data set.
194 To solves these problems, we propose the following methodol- Definition 5. Ontology is ‘‘an explicit specification of a conceptu- 238
195 ogy. We first mined candidate association rules x,y from a cardio- alization’’ [20]. Let C = {c1, c2, . . . , cn} be a set of all concepts in ontol- 239
196 vascular data set using FP-Growth algorithm with minimum ogy, S = {s1, s2, . . . , sm} be a set of all semantic types defined over C, 240
197 support and minimum confidence thresholds.1 Maximum itemset and R = {r1, r2, . . . , rp} be a set of all semantic relations defined over 241
198 size was set to 2-itemset to simplify analysis. Discovering domain S. The association between two concepts, c1 and c2, is semantically 242
199 knowledge-compliant rules is achieved via two post-processing valid if the relation between their respective semantic types, s1 and 243
200 stages. Initially, we filtered out illogical rules based on semantic s2, is defined in the ontology, i.e. rs1 !s2  R. Note that the relation 244
201 information specified in Unified Medical Language System (UMLS) between two semantic types is asymmetric, r s1 !s2 –r s2 !s1 . Let x,y 245
202 ontology. We defined illogical rules as rules whose antecedent and be an association rule, and cx and cy be concepts derived for item 246
203 consequent are not semantically valid in the ontology. At subse- x and y, respectively, the rule is semantically valid if rsx !sy  R. 247
204 quent stage, we introduced v2lit , a literature-calibrated v2 measure
205 that is calculated based on the co-occurrence of rule item x and y in
206 Pubmed literature. v2lit score was calculated for each candidate rule.
3.2. Domain knowledge-compliance measure: v2lit 248

207 We defined domain knowledge-compliant rule as rule whose v2lit


For term association analysis, Prabowo and Thelwall [46] has 249
208 score suggests statistically significant correlation between its rule
demonstrated v2 to be more reliable correlation measures than 250
209 items. Finally, to determine the functional novelty of each rule,
mutual information or information gain. Hu et al. [25] used v2 as 251
210 we acquired a pair of hypotheses AD from a medical expert. The
correlation measure for discovering hidden links among Pubmed 252
211 correlation strengths of association A,x and y,D were calculated
literature. 253
212 using v2lit based on their term co-occurrence in literature. Func-
213 tional novelty of a rule was ranked based on Min v2 score, which
Definition 6. Given antecedent x and consequent y, and a 2  2 254
214 is determined by the least of v2litAx and v2lityD scores. Rule with the
contingency table (Table 1), cell a is the number of transaction in 255
215 highest Min v2 score was deemed most functionally novel. We let
which x and y co-occurs. v2 is calculated for rule x,y (Eq. (2)) 256
216 the medical expert to evaluate the interestingness and validity of
based on information in the contingency table. 257
217 our mining results.
fe is the expected frequency for a cell. For cell a; fe ¼ ðaþcÞðaþbÞ
aþbþcþd
. fo is 258
the observed frequency for a cell. Given p = 0.05 and 1-degree of 259
218 3. Notations and definitions freedom, critical value threshold to mark the statistical significance 260
of v2 is 3.84 (a = 3.84). Correlation between x and y is statistically 261
219 3.1. Association rules significant in the data set if v2 > a. 262
263
220 Let {i1, i2, . . . , in} be a set of n binary items in data set and X ðfe  fo Þ2
2
221 D = {t1, t2, . . . , tm} be a set of m transactions over I. X = {i1, i2, . . . , ik} v ¼ : ð2Þ
fe 265
222 is a non-empty k-itemset, where X # I and 0 < k 6 n. An associa-

1
Definition 7. Given concept cx and cy, each derived for item x and y 266
Note that from this section onwards we replace notation ) with , to avoid
misinterpretation of the meaning of rule’s association. Since association rules are
respectively, a 2  2 contingency table is constructed based on the 267
computed solely based on co-occurrences of items, they do not necessarily imply a co-occurrence of concepts in literature. The contingency table 268
particular type of relation between the items, such as cause-and-effect. allows v2lit to be calculated (Eq. (3)). Given a = 3.84, the correlation 269

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

4 Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx

Fig. 4. KELAM framework.

270 between item x and y is statistically significant in literature if binary items; and (c) maps each item to appropriate biomedical 307
271 vslit > a. concept names in UMLS ontology. Next, in Stage 2 user specifies 308
272 data mining parameters, including minsupp, minconf, maxitem. In 309
X ðfe  fo Þ2 Stage 3, the rule set output is then semantically filtered against 310
2
v ¼ lit : ð3Þ semantic relations in the ontology and in Stage 4 v2lit score is calcu- 311
274 fe
lated for each rule that passes filtering based on information ob- 312
tained from Pubmed literature. Lastly in Stage 5 user prescribes a 313
275 3.3. Functional novelty score pair of hypotheses that is mapped to a standard biomedical con- 314
cept in ontology. By querying information from Pubmed literature, 315
276 Functional novelty score is defined as the function of correlation functionally novel rules are ranked based on Min v2 score and pre- 316
277 strengths of component associations that are involved in inferring sented to the user. The following sections describe each stage of 317
278 the relationship between user-prescribed hypothesis pair. Given a the framework in details. 318
279 domain knowledge-compliant rule x,y, the functional novelty of
280 the rule is equivalent to the weakest correlation of association
4.2. Stage 1: Semantic-based attribute discretization and mapping 319
281 A,x and y,D in the literature. This is equivalent to assuming that
282 the total strength of an inference chain is no greater than its weak-
UMLS is the most comprehensive source for biomedical ontolo- 320
283 est link [60].
gies.2 It contains Metathesaurus which is a very large collection of 321
2 2 biomedical and health-related vocabularies and concepts, their 322
284 Definition 8. Let v and v be the chi-square scores calculated
lit Ax lityD
term variants, semantic types, and the relationships between 323
285 for A,x and y,D from literature, respectively. Min v2{x, y} is the
semantic types. Another important component of UMLS is 324
286 lowest score between v2litAx and v2lityD (Eq. (4)).
287 n o the Semantic Network which defines 135 semantic types and 54 325
Min v2 fx; yg ¼ Min v2litAx ; v2lityD : ð4Þ semantic relations used in Metathesaurus. 326
289
Enriching rules with conceptual information from ontologies 327
helps bridge the semantic gap between pattern representations 328
290 4. Materials and methods and user interpretation [4]. After a target data set had been se- 329
lected, we discretized each attribute into binary item(s) and 330
291 4.1. KELAM knowledge discovery framework mapped each item to a standard biomedical concept in Metathe- 331
saurus. Numerical data attributes were discretized using value 332
292 KELAM (Knowledge Extraction via Logical Association Mining) is ranges defined in the ontology. For instance, Metathesaurus de- 333
293 a five-staged domain-driven KDD framework designed to mine fined two value ranges that correspond to numerical attribute 334
294 functionally novel association rules (Fig. 4). The term ‘logical’ is a Age: 6–12 years old for child and 13–18 years old for adolescent. 335
295 loose term that describes our objective to mine association rules In this case, discretization would produce two binary items: [Age 336
296 that will enable logical inference of previously unknown relation- 6–12 = Yes] and [Age 13–18 = Yes]. For categorical attribute, e.g. 337
297 ship between two medical concepts. Rules should not only be sta- Gender, with m number categorical values, m number of items will 338
298 tistically significant but also make sense medically (hence ‘logical’). be produced, e.g. [Male = Yes] and [Female = Yes]. 339
299 Recently, Sim et al. [52] attempted to discover association rules in For each binary item, we manually searched for the appropriate 340
300 which a rule must be reported only if there is enough ‘logical evi- concept from the ontology. Following the previous example, [Age 341
301 dence’ in data set. Logical evidence was collected by considering 6–12 = Yes] was assigned with corresponding concept Child 342
302 the presence and absence of items during data mining. However, (Age Group), and [Age 13–18 = Yes] was assigned with Adolescent 343
303 the authors’ approach is primarily dataset-oriented and therefore (Age Group). The corresponding semantic type of the concept is 344
304 is distinct from our work. shown between the parentheses. In another example, item 345
305 During data preparation in Stage 1, user (a) selects relevant data
306 set attributes to be mined; (b) discretizes these attributes into 2
http://www.nlm.nih.gov/research/umls/

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx 5

Table 2
An example of semantic map contents.

Item Concept CUI Semantic type Attribute Qualifier


[Male = Yes] Male gender C0024554 Organism attribute Gender
[Age 6–12 = Yes] Child C0008059 Age group Age
[hpt_stage1 = Yes] Hypertensive disease C0020538 Disease or syndrome Sbp, Dbp Stage 1

346 [Male = Yes] has a corresponding concept Male Gender in valid as the relation between Organism Attribute and Age Group 385
347 Metathesaurus. was not defined in Semantic Network. Hence the filter would ex- 386
348 Let I = {i1, i2, . . . , in} be a set of all binary items in data set, we clude R2 from subsequent rule evaluation. 387
349 then manually constructed a semantic map consisting of a vector The ontology-based filtering ensures that only association rules 388
350 of semantic information of in. An example of a semantic map is which are compliant with domain knowledge are preserved. It also 389
351 shown in Table 2. increases the efficiency of subsequent rule evaluation process. Our 390
filtering technique is similar to [30]. Marinica and Guillet [36] also 391
352 4.3. Stage 2: FP-Growth algorithm applied a variant of similar semantic-based filtering. However, our 392
technique is effective and simpler than the authors’ as we do not 393
353 FP-Growth algorithm was chosen for several reasons. It encodes require an explicit calculation of item-relatedness between rule 394
354 data set into highly compacted FP-Tree data structure and requires antecedent and consequent which is not deemed as necessary for 395
355 only two scans on the data set to generate all association rules. It the scope of our study. 396
356 avoids the expensive generate-and-test approach of the Apriori-like
357 algorithms, hence more efficient. For certain types of transactional 4.5. Stage 4: Discovering domain-knowledge compliant rules 397
358 data sets, it outperformed Apriori by several orders of magnitude
359 [57] and proved to be efficient particularly for mining large and Pubmed3 includes over 19 million citations of biomedical articles 398
360 dense data sets [22] such as biomedical data sets. We used Java in MEDLINE and other life science journals. Pubmed is accessible via 399
361 implementation of FP-Growth available via RapidMiner open Web interface keyword search. Each article in Pubmed is uniquely 400
362 source data mining package [38]. identified by its Pubmed identifier (PMID) and assigned with Medi- 401
cal Subject Heading (MeSH) terms to indicate its topical subject. 402
363 4.4. Stage 3: Semantic-based filter Each rule that passes filtering was further checked for its 403
compliance with the existing knowledge in biomedical literature. 404
364 We constructed a semantic-based filter that utilizes informa- Compliance is marked by the significant v2lit score calculated based 405
365 tion from semantic map and UMLS Semantic Network. A tiny por- on information in Pubmed. 406
366 tion of UMLS Semantic Network instances is shown in Table 3.
367 In the first instance, semantic type Disease or Syndrome and 4.6. Stage 5: Evaluating and ranking functionally novel rules 407
368 Organism Attribute are shown to be semantically related via associ-
369 ated_with semantic relation. Our filter looked for semantic rela- A pair of hypotheses which represents A and D was acquired 408
370 tions between rule items (antecedent and consequent) among from user and mapped to standard biomedical concepts in Meta- 409
371 the contents of Semantic Network. For example, consider the fol- thesaurus. For each rule x,y, two v2lit scores were calculated, each 410
372 lowing association rules: for association Ax and association yD. Based on these two scores, 411
Min v2 was calculated. Rules were then sorted based on Min v2 in 412
373 R1: [hpt_stage1 = Yes],[Male = Yes] descending order. Rules with Min v2 > 3.84 were considered as 413
374 R2: [Male = Yes],[Age 6–12 = Yes] functionally novel. 414
375
376 Referring to the semantic map (Table 2), both rules could be 4.7. Technological architecture 415
377 translated into the following semantic associations:
We designed and implemented KELAM as a semi-automatic, 416
378 R01 : Disease or Syndrome ? Organism Attribute interactive knowledge discovery application. Having interactive 417
379 R02 : Organism Attribute ? Age Group KDD systems is supported by majority of data mining practitioners 418
380 [2]. Fig. 5 shows the technological architecture. KELAM was imple- 419
381 Our filter considers rule R1 as semantically valid because the mented as a Web-based Java application to allow remote concur- 420
382 relation between Disease or Syndrome and Organism Attribute types rent access by multiple users. 421
383 was defined in Semantic Network, i.e. Disease or Syndrome, associ- Data preprocessing was performed manually by the user. Using 422
384 ated_with, Organism Attribute (see Table 3). R2 is not semantically DBMS and spreadsheet software, target data set was converted 423
from its native database format into .dat and .aml files which are 424
the required input file formats for FP-Growth class implementation. 425
Table 3
Partial depiction of UMLS semantic network contents. Semantic map and UMLS Semantic Network files were acquired in 426
.csv format. Metathesaurus contents were accessed through UMLS 427

MetamorphoSys,4 UMLS Knowledge Source Server,5 or MetaMap 428
Disease or Syndrome,associated_with,Organism Attribute
 Transfer (MMTx).6 Data mining was carried out by two main mod- 429
Disease or Syndrome,associated_with,Clinical Attribute ules: the RapidMiner implementation of FP-Growth algorithm and 430

Disease or Syndrome,isa,Biologic Function 3
http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed.
 4
http://www.nlm.nih.gov/pubs/factsheets/umlsmetamorph.html.
Organism Function,affects,Human 5
 http://umlsks.nlm.nih.gov
6
http://ii.nlm.nih.gov/MMTx.shtml

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

6 Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx

Fig. 5. KELAM technological architecture.

431 the semantic-based rule filtering module. Rule evaluation subsys- rules from the most functionally novel to the least novel based 449
432 tem consisted of two main modules: hypothesis term mapping on Min v2 score. Literature module retrieved and displayed Pubmed 450
433 module that utilized MMTx classes for automatic mapping of article titles, authors, journal names, and abstract texts which sup- 451
434 hypothesis term to Metathesaurus concepts, and functional novelty ported the inferences made the rule. Database module retrieved in- 452
435 evaluation module which calculated v2lit and Min v2 scores for each stances of medical records in clinical database that were covered 453
436 rule. by each rule being evaluated. These modules provides the user 454
437 To access Pubmed citations and article contents, we integrated with context-rich environment for rule evaluation. 455
438 Entrez Programming Utilities (E-Utils).7 into the application.
439 E-Utils is a collection of Java classes that programmatically retrieve
440 Pubmed contents outside the regular Pubmed Web interface. Three 4.8. Experimental design 456

441 E-Utils packages were used: EInfo for retrieving total Pubmed
442 article count; ESearch for retrieving citation count along with a list 4.8.1. Medical expert 457

443 of PMIDs in response to a query; and EFetch for retrieving article A widely used method for evaluating a KDD system’s effective- 458

444 details (e.g. title, author list, abstract text, medical subject head- ness is by testing it to discover new knowledge that is compatible 459

445 ings) based on PMIDs. Fig. 6 shows our data pipeline arrangement with prior knowledge [55,12,42,15,32]. The capability of a KDD 460

446 of ESearch and EFetch. system for discovering potentially new knowledge could be evalu- 461

447 Rule visualization subsystem was designed to rank and visual- ated based on domain expert’s opinion that acts as the evaluation 462

448 ize data mining outputs. Rule ranking module ranked association ‘gold standard’. Unlike others, this method can be applied to large 463
varieties of available medical data sets. 464
Here we briefly describe the qualifications of the medical 465
7
http://eutils.ncbi.nlm.nih.gov/. expert. At the time of the experiment, the medical expert was a 466

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx 7

Fig. 6. E-Utils data pipeline for real time access to Pubmed.

467 cardiologist and Clinical Lead for the Cardiac Magnetic Resonance 7. For each v2 and v2lit measures, we calculated Precision, Recall and 516
468 Imaging at a general hospital in Malaysia. He had been participat- MC(CP). Let RCP = {rjr 2 R ^ Lme(r) = CP} be a set of all rules 517
469 ing in over 20 Phase I, II and III Clinical Trials. He had also authored labeled as CP by the medical expert, R be the set of all rules 518
470 and reviewed articles published in international cardiology meet- to be evaluated,
 Lme(r) denoting the rule label being assigned. 519
471 ings and journals. Let Rstat ¼ rjr 2 R ^ v2 ðrÞYv2lit ðrÞ > a be a set of all statistically 520
significant rules based on v2 or v2lit , a denoting the critical value. 521
472 4.8.2. Data set descriptions Let RCP_stat = {rjr 2 RCP ^ r 2 Rstat} be a set of all statistically sig- 522
473 Our experiments involve two medical data sets as follow. nificant rules labeled as CP. We evaluated each measure using 523
evaluation metrics used in [42]: 524
525
474 CHD_DB data set: The Coronary Heart Disease Database
RCP stat
475 (CHD_DB).8 is a synthetic cardiovascular database curated by Precision ¼ ð5Þ
Rstat
476 Suka et al. [53] to replicate the original Framingham Heart
RCP stat
477 Study data sets. From this database we extracted a training data Recall ¼ ð6Þ
478 set containing 13,000 patient records. For complete descrip- RCP
479 tions of the data set attributes refer to [53] 2ðPrecisionÞðRecallÞ
MCðCPÞ ¼ ð7Þ 528
527
480 ECHO_MSCT data set: With the help of the medical expert, we Precision þ Recall
481 extracted ECHO_MSCT data set from cardiovascular clinical dat- 4.8.4. Experiment II 529
482 abases with permission from a local cardiac treatment centre. In the second experiment, we investigated the usefulness of 530
483 The data set contained 543 records obtained from echocardiog- Minv2lit for estimating and ranking rule functional novelty. Follow- 531
484 raphy (ECHO) and multi-slice computed tomography scan (MSCT) ing results obtained in Experiment I which will be described in the 532
485 examinations. The data set contained no missing value. next section, v2lit was used as the main indicator for compliant 533
486 rules. The medical expert suggested four pairs of medical hypoth- 534
487 4.8.3. Experiment I eses to be tested: 535
488 Experiment I was designed to study the usefulness of v2lit for
489 mining domain knowledge-compliant association rules in compar- (i)) Smoking M endothelial progenitor cell levels (EPC lev.) 536
490 ison to l and v2 measures. Its performance was measured based on (ii) Smoking M human anti murine antibody levels (HAMA lev.) 537
491 the degree of agreement with the medical expert’s evaluations. (iii) Matrix metalloproteinase (MMP) M intracardiac thrombus 538
492 Experimental settings: (iv) von Willebrand factor (VWF) M intracardiac thrombus 539
540
493 1. Given minsupp = 0.0001, minconf = 0.00, and maxitem = 2, we We experimented with (i), (ii) for mining CHD_DB data set, and 541
494 mined all 2-item association rules and computed l score for (iii), (iv) for ECHO_MSCT data set. Experimental settings: 542
495 each rule.
496 2. If the number of rules n 6 25, we selected all rules and ranked 1. Given minsupp = 0.0001, minconf = 0.00 and maxitem = 2, we 543
497 them by l in descending order. If n > 25, we ranked all rules mined all 2-item association rules and computed v2lit score for 544
498 by l in descending order and selected top-25 rules. each rule. 545
499 3. v2 and v2lit were computed for each rule selected in step no. 2. 2. We selected all domain-knowledge compliant rules (v2lit > a; 546
500 4. We presented rules to medical expert without showing l, v2, a ¼ 3:84). 547
501 and v2lit scores to avoid evaluation bias. 3. The rules produced potential inferences for a given pair of user 548
502 5. Medical expert manually labeled each rule according to his hypotheses and we calculated Min v2 score for each rule, rank- 549
503 knowledge and expertise. Unlike [42], we did not use interesting ing them in descending order. We selected top-10 rules and 550
504 or not interesting labels because the meanings of such labels are bottom-10 rules [42]. This method is suitable because in our 551
505 imprecise and unclear. Instead, we used compliant (CP), contra- case it is not feasible for the medical expert to evaluate all rules 552
506 dictory (CT), and not sure (NS) to precisely indicate if a rule is due to time and workload constraints. 553
507 compliant with the existing knowledge, contradictory to it, or 4. For each inference, we identified and extracted its component 554
508 neither both, respectively. associations. For instance, given an inference A,x,y,D, we 555
509 6. For each of l, v2 and v2lit : extracted association A,x and y,D, and presented them to 556
510 (a) We ranked rules in descending order. the medical expert. 557
511 (b) We calculated Precision-at-fifteen (P@5), Precision-at-ten 5. Medical expert labeled each association as compliant (CP), con- 558
512 (P@10), and Precision-at-fifteen (P@15) which are the frac- tradictory (CT), and not sure (NS). 559
513 tion of top-5, top-10, and top-15 rules labeled as CP, 6. We selected rule x,y which had both A,x and y,D be labeled 560
514
515 respectively. as CP. 561
7. We presented the rules back to medical expert for final evalua- 562
8
Available for use with permission from http://ichimura.ints.info.hiroshima- tion. We asked medical expert to judge if rule x,y could poten- 563
cu.ac.jp/chddb/. tially infer the association between hypothesis A and D in any 564

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

8 Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx

565 meaningful medical sense. Four types of inference quality labels 2. ECHO_MSCT data set consists of 543 transactions. At minimum, 607
566 were used: an association rule must satisfy 0.0543 transaction (0.0001  608
567 (a) Yes: all component associations are medically valid. A and D 543 = 0.0534). Given this condition, virtually all rules will be 609
568 are likely to be related. generated. 610
569 (b) Probably: only some component associations are medically 3. A key weakness of confidence is its inability to accurately mea- 611
570 valid. A and D are less likely to be related. sure true implications [8]. High confidence score does not guar- 612
571 (c) No, as far as I know: medical expert lacks sufficient back- antee true correlation between rule antecedent and consequent. 613
572 ground knowledge to make proper judgment. We chose minconf = 0.00 so as not to exclude true correlation 614
573 (d) Not specific enough: rule does not provide unambiguous and rules that may happen to reside at relatively low confidence 615
574 sufficient information to allow medical expert to make region. 616
575 proper judgment. 617
576 Fig. 7 shows the number of rules generated by FP-Growth algo- 618
578
577 Additional medical expert’s reasons and comments were rithm for both data sets at various minsupp levels (minconf = 0.00 619
579 acquired. and maxitem = 2). Setting minsupp at 0.0001 could remove more 620
580 8. Unlike Experiment I, it is not possible to precisely determine the than half of generated rules in both CHD_DB and ECHO_MSCT data 621
581 recall rate of Min v2 due to exclusion of middle-ranked rules. sets (544 ? 228; 312 ? 124, respectively), greatly reducing the 622
582 Hence, the performance of Min v2 had to be quantitatively number of rules to be evaluated. 623
583 determined by the comparing the percentage of the number
584 of candidate functionally novel rules in top-10 and bottom-10
5. Results 624
585 rules after being ranked by Min v2 score. The higher the per-
586 centage of rules in top-10 that are labeled as ‘Yes’ by the expert,
5.1. Experiment I 625
587 the better the performance of Min v2. The quality of inference
588 made by candidate functionally novel rules was qualitatively
5.1.1. CHD_DB results 626
589 assessed by the medical expert based on his background knowl-
The data mining algorithm produced 228 association rules, of 627
590 edge and other relevant information in literature.
which 168 passed the semantic filtering (23.32% reduction). Some 628
591
rules contained redundant item pairs (two rules that have similar 629
592 To our knowledge, no optimal minsupp and minconf threshold
items which are positioned differently as antecedent and conse- 630
593 values apply in all situations and the research to solve this problem
quent). For example, we considered rule [chd = yes],[smoker = - 631
594 is still ongoing [52]. From domain-driven perspective, we can se-
yes] and rule [smoker = yes],[chd = yes] as redundant because 632
595 lect threshold values that best satisfy the specific requirements
both rules cover the same item co-occurrence. Because association 633
596 of the domain and of the target data set. Considering our objective
rules do not necessarily assert causal relations between items, one 634
597 to mine as many rules as possible without leaving out significant
of the redundant rules could be removed to reduce medical ex- 635
598 rules, we chose minsupp = 0.0001 and minconf = 0.00 based on the
pert’s evaluation workload. After removing redundant rules, 85 636
599 following rationales:
rules remained (49.41% reduction). Top 25 rules were selected 637
for medical expert evaluation. 638
600 1. CHD_DB data set consists of 13,000 transactions. At minimum,
P@5, P@10, and P@15 scores are shown in Table 4. l demon- 639
601 an association rule must satisfy 1.3 transactions (0.0001 
strated the best ranking precision for compliant rules for top-5, 640
602 13,000 = 1.3). Such a low-support rule is likely to happen by
top-10, and top-15 rules. A comparison between v2 and v2lit showed 641
603 chance and medically insignificant as it only applies to one
that even though v2’s precision initially outperformed v2lit for top-5 642
604 patient/clinical instance. Using this setting, we ensure that all
rules, v2lit ’s precision gradually increased with the number of rules 643
605 medically significant rules will not be left out from subsequent
being evaluated, whereas v2’s precision had the tendency to 644
606 rule evaluations.
decrease. 645
Medical expert evaluation results are qualitatively presented in 646
Fig. 8. This visualization method is comparable to [42]. CP denotes 647
rules labeled as CP by medical expert (RCP), whereas N denotes 648
otherwise (CT/NS). White cells indicate statistically significant 649
rules (Rstat), whereas black cells indicating otherwise. The degree 650
of agreement between medical expert’s evaluation and v2 or v2lit 651
score can be observed by the number of white cells matching CP 652
(RCP_Stat). v2lit demonstrated a higher degree of agreement than v2. 653
Finally, when measured across 25 rules, v2lit demonstrated signifi- 654
cantly higher precision, recall, and MC(CP) rate for estimating do- 655
main knowledge-compliant rules compared to v2. 656

5.1.2. ECHO_MSCT results 657


We produced 124 association rules. Eighty-six rules passed 658
semantic filtering (30.65% reduction) and after removing redun- 659

Table 4
CHD_DB: performance comparison between l, v2, and v2lit .
2
l v v2lit
P@5 1.00 0.80 0.60
P@10 0.90 0.60 0.70 Q2
P@15 0.93 0.73 0.80
Fig. 7. Number of rules generated at various minsupp levels.

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx 9

Fig. 8. CHD_DB evaluation results.

660 dant rules, we obtained 43 rules (50.00% reduction). Top 25 rules ation between rule consequent (y) and endothelial progenitor cell 675
661 were selected for evaluation. levels (D). CPAx = yes and CPyD = yes denote domain knowledge- 676
662 Results in Table 5 show equal precisions among l, v2, and v2lit . compliant A,x and y,D, respectively. 677
663 However, evaluation across all 25 rules shows that v2lit significantly Result shows that all rules with both A,x and y,D labeled as 678
664 outperformed v2 (Fig. 9), particularly in terms of recall. This result Yes were among the top-10 rules: rule no. 2–4, 6–7 (rule no. 8 679
665 is consistent with the previous results obtained from CHD_DB data was excluded from consideration because rule smoking,[smoker = 680
666 set. yes] carried little meaning to user). Similar results were obtained 681
when evaluating rules with other hypotheses. We selected these 682

667 5.2. Experiment II rules for further evaluation. 683


Table 7 presents medical expert’s final assessment on the feasi- 684

Due to v 2
high recall rate, in this experiment v was used to 2 bility of the inference for hypothesis (i) by rules mined from 685
lit ’s
668 lit
669 automatically predict compliant rules. First we evaluated the func- CHD_DB (inference no. 1–5). We also presented results for the 686

670 tional novelty of rules mined from CHD_DB data for discovering remaining hypotheses: inference no. 6–10 for hypothesis (ii); no. 687

671 the relationship between smoking and endothelial progenitor cells 11 for hypothesis (iii); and no. 12 for hypothesis (iv). 688

672 levels. Table 6 shows top-10 and bottom-10 rules ranked by Medical expert considered inference no. 12 as the most interest- 689

673 Min v2. v2litAx refers to v2lit calculated for the association between ing discovery. The inference was labeled as Yes by medical expert. 690

674 smoking (A) and rule antecedent (x). v2lityD was calculated for associ- It was commonly known that VWF was associated with diabetes 691
mellitus ([dm = yes]). It was also widely known that diabetes mel- 692
Table 5 litus was a key risk factor of coronary arteriosclerosis ([ca = yes]), 693
ECHO_MSCT: performance comparison between l, v2, and v2lit . and that the latter was associated with intracardiac thrombus. 694

l v2 v2lit Interestingly, the association between VMF and intracardiac 695


thrombus seemed to be suggested in [33]. However, the mecha- 696
P@5 1.00 1.00 1.00
nism by which these were related remains little known. Pubmed 697
P@10 1.00 1.00 1.00
P@15 1.00 1.00 1.00 search for ‘von Willebrand factor AND intracardiac thrombus’ re- 698
turned only five citations, and Pubmed search for ‘von Willebrand 699

Fig. 9. ECHO_MSCT evaluation results.

Table 6
Functional novelty evaluation obtained by mining CHD_DB data set for smoking M endothelial progenitor cell levels.

Rank (top) Association rules Min v2 v2litAx CPAx v2lityD CPyD

1. [chol_low = yes],[chd = yes] 283.63 32760.4 283.63 Yes


2. [hypertrophy_left = yes],[chd = yes] 283.63 1423.23 Yes 283.63 Yes
3. [chol_raised = yes],[chd = yes] 283.63 1836.06 Yes 283.63 Yes
4. [chol_border = yes],[chd = yes] 283.63 4680.42 Yes 283.63 Yes
5. [edu_high_not_graduate = yes],[chd = yes] 283.63 13033.15 283.63 Yes
6. [hpt_stage1 = yes],[chd = yes] 283.63 2145.46 Yes 283.63 Yes
7. [hpt_stage2 = yes],[chd = yes] 283.63 2342 Yes 283.63 Yes
8. [smoker = yes],[chd = yes] 283.63 311701.7 Yes 283.63 Yes
9. [never_smoke = yes],[chd = yes] 283.63 46082.72 283.63 Yes
10. [edu_high_graduate = yes],[chd = yes] 273.84 273.84 283.63 Yes
10. [chol_raised = yes],[hpt_stage1 = yes] 0.37 1836.06 Yes 0.37
9. [smoker = yes],[hpt_stage1 = yes] 0.37 311701.7 Yes 0.37
8. [chd = yes],[chol_raised = yes] 0.12 85334.93 Yes 0.12
7. [chd = yes],[hpt_crisis = yes] 0.12 85334.93 Yes 0.12
6. [hpt_stage1 = yes],[chol_raised = yes] 0.12 2145.46 Yes 0.12
5. [hpt_stage2 = yes],[chol_raised = yes] 0.12 2342 Yes 0.12
4. [never_smoke = yes],[chol_raised = yes] 0.12 46082.72 0.12
3. [chd = yes],[chol_border = yes] 0.07 85334.93 Yes 0.07
2. [hpt_stage2 = yes],[chol_border = yes] 0.07 2342 Yes 0.07
1. [hpt_stage1 = yes],[chol_border = yes] 0.07 2145.46 Yes 0.07
Rank (bot.)

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

10 Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx

Table 7
Medical expert’s final evaluation on the functional novel rules.

No. A Rule x,y (support/confidence) [Pubmed count] D Label Comments


1. Smoking [hypertrophy_left = yes],[chd = yes] (0.079/0.745) [0] EPC lev. No, as far as I know Still under research
2. Smoking [chol_raised = yes],[chd = yes] (0.292/0.594) [0] EPC lev. Yes
3. Smoking [chol_border = yes],[chd = yes] (0.144/0.448) [0] EPC lev. Not specific enough ’border is not a good term as in
borderline male, borderline female,
etc.
4. Smoking [hpt_stage1 = yes],[chd = yes] (0.106/0.398) [0] EPC lev. No, as far as I know
5. Smoking [hpt_stage2 = yes],[chd = yes] (0.177/0.545) [0] EPC lev. No, as far as I know
6. Smoking [hypertrophy_left = yes],[chd = yes] (0.079/0.745) [0] HAMA lev. No, as far as I know
7. Smoking [chol_raised = yes],[chd = yes] (0.292/0.594) [0] HAMA lev. Probably Still under research, as HAMA levels
is surrogate for EPC levels
8. Smoking [chol_border = yes],[chd = yes] (0.144/0.448) [0] HAMA lev. Not specific enough Reason as in no. 3
9. Smoking [hpt_stage1 = yes],[chd = yes] (0.106/0.398) [0] HAMA lev. No, as far as I know Still under research, as HAMA levels
is surrogate for EPC levels
10. Smoking [hpt_stage2 = yes],[chd = yes] (0.177/0.545) [0] HAMA lev. No, as far as I know Still under research, as HAMA levels
is surrogate for EPC levels
11. MMP [hypertension = yes],[ca = yes] (0.145/0.305) [0] Intracardiac thrombus Probably Still under research
12. VWF [dm = yes],[ca = yes] (0.057/0.375) [0] Intracardiac thrombus Yes

700 factor AND diabetes mellitus AND coronary arteriosclerosis AND One may also notice that l constantly outperformed v2 and v2lit 741
701 intracardiac thrombus’ returned 0 citation. Consequently, rule for precision at top fifteen rules. This is actually in agreement with 742
702 [dm = yes],[ca = yes] is functionally novel because it suggests common belief that rules with high support and high confidence 743
703 the important role that the association between diabetes mellitus tend to be well known by user. However, it is difficult to use l in 744
704 and coronary arteriosclerosis could play in bridging the relation- practice for two reasons. Firstly, l score cannot be used to justify 745
705 ship between VMF and intracardiac thrombus which is relatively the statistical significance of an association rule. Secondly, the 746
706 unknown. minimum threshold for l cannot be determined because the opti- 747
707 The second interesting discovery involves inference no. 2. It was mal minsupp and minconf threshold values for rule mining remain 748
708 well known that smoking would increase blood cholesterol level. It unknown [52]. 749
709 was also known that raised cholesterol level normally increased Min v2 is useful for estimating and ranking rule functional nov- 750
710 the risk of coronary heart disease ([chd = yes]). Endothelial progen- elty. All functionally novel rules appeared among top-10 rules. The 751
711 itor cells level was also known to be predict coronary heart disease only problem is Min v2 seems to suffer from poor precision. This 752
712 outcomes. However, Pubmed search for ‘smoking AND endothelial has resulted in larger number of rules that need to be evaluated 753
713 progenitor cell levels’ and a search for ‘smoking AND cholesterol levels by user. Lack of precision might have also resulted in less accurate 754
714 raised AND coronary heart disease AND endothelial progenitor cell rule ranking, such as for rule no. 1 (Table 6). Ranking accuracy 755
715 levels’ returned 0 citation. Apparently, the relationship between could have also been affected by imprecise literature query. In this 756
716 smoking and EPC level had never been studied before. Therefore, study, we used standard Pubmed query statement in which key- 757
717 rule [chol_raised = yes],[chd = yes] is functionally novel as it sug- words were merely joined with AND operator. This may create 758
718 gests a possible association between smoking habit and EPC levels, problems because commonly used biomedical concepts (e.g. Adult, 759
719 which is currently not known. To the medical expert, such finding Smoking, Female Gender) tend to produce high number of citation 760
720 can point to a valuable area for future medical research. counts without necessarily implying any meaningful correlation 761
721 On last note, we point to inference no. 11 which was labeled as with the main subjects of the studies. 762
722 Probably. It was known that hypertension ([hypertension = yes]) One way to solve the problem in the future is to focus keyword 763
723 increased the likelihood of coronary arteriosclerosis ([ca = yes]), search on a particular section of the abstract texts, such as Results 764
724 and that coronary arteriosclerosis was closely associated with or Conclusions section. This will help to distinguish true correla- 765
725 intracardiac thrombus. Medical expert, however, was not com- tions from false ones. Alternatively, one may restrict the query spe- 766
726 pletely sure of the association between MMP and hypertension. cifically to title or major MeSH heading fields. Hristovski et al. 767
727 At the final assessment, however, the medical expert considered [23,24] showed that focusing the search on Pubmed MeSH major 768
728 relation between MMP and intracardiac remained interesting as a descriptors and utilizing UMLS semantic relations may increase 769
729 potential subject of medical investigation. the relevance of the literature retrieval results. 770
Most importantly, results obtained from the second experiment 771
highlights the contrast between functional novelty and the tradi- 772
730 6. Discussions tional pairwise novelty. Take for an example rule [dm = yes],[ca = 773
yes] shown in Table 7. From pairwise novelty perspective, the rule 774
731 6.1. Usefulness of v2lit and Min v2 measures would have not been novel to the medical expert because diabetes 775
mellitus is a known risk factor of coronary arteriosclerosis. However, 776
732 Results obtained from Experiment I consistently demonstrated the rule is now novel from functionality point-of-view because it 777
733 v2lit ’s better performance compared to v2 in recalling domain was previously unknown that it could mediate the relationship be- 778
734 knowledge-compliant rules. This serves as the evidence that incor- tween Von Willebrand Factor (VMF) and intracardiac thrombus. 779
735 porating more domain knowledge into rule evaluation process
736 could increase the recall rate for compliant rules which eventually
737 lead to more actionable knowledge [9]. We also argue that v2lit ’s 6.2. Significance and contributions 780
738 high recall value is more crucial than its lack of precision. It is cru-
739 cial to mine as many compliant rules as possible so that no impor- The observed extreme imbalance between numerous published 781
740 tant rule will be missed prior to functional novelty evaluation. data mining algorithms on one side and the lack of actionable KDD 782

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx 11

783 frameworks on the other side has motivated major proponents of Our work is different from other works that solely focused on liter- 846
784 domain-driven data mining [62,10] to urge a paradigm shift from ature-based knowledge discovery [56,60,27,23,59,25,49]. 847
785 data mining to knowledge discovery [13]. As a result, more and
786 more emphases are now placed on producing novel and effective 6.4. Limitations and recommendations 848
787 KDD frameworks and workflows that capitalize on the existing
788 well-proven algorithms. Our experiments involved only one medical expert due to time 849
789 One important aspect of this research direction is to enhance constraint and substantial effort required of the expert to evaluate 850
790 post-mining interestingness evaluation and provide solution for the rules. We compensate this by involving a medical expert with 851
791 automatically incorporating ubiquitous intelligence into the pro- high level of expertise in cardiology. For the sake of comparison 852
792 cess [10]. Our research contribution falls within this area. Firstly, [42], only involved two medical experts. 853
793 we provide some evidence to support functional novelty as a The usage of extremely low minsupp and minconf increases min- 854
794 new measure of rule novelty. This evidence is supported by an ac- ing time and usage of computing resources. Setting maxitem = 2 855
795 tual medical expert’s interestingness evaluation outcome. Sec- helped to alleviate this problem by introducing an additional con- 856
796 ondly, we provide a novel KDD framework that automatically straint. This is plausible. Allowing too many items in a rule eventu- 857
797 acquires ubiquitous domain knowledge from literature, ontology, ally reduces its usefulness as the rule tends to overfit into the data 858
798 and the user. This is technically challenging because the existing and has little predictive power [35]. 859
799 data mining algorithms, e.g. FP-Growth, are naïve towards the Recommended future works include automating the process of 860
800 semantics of a specific domain. They do not normally produce attribute discretization and semantic mapping. The accuracy and 861
801 patterns or rules that are semantically enriched so as to enable relevance of functional novelty scoring mechanism could be im- 862
802 automatic inference by a knowledge-based system. Additional pro- proved using other term analysis techniques such as term fre- 863
803 cessings are required to make these patterns more actionable and quency-inverse document frequency (TF-IDF) or co-word 864
804 understandable. In our proposed framework, semantic enrichment clustering. Increasing the number of items allowed in a rule would 865
805 of data items is key as it allows the system to ‘interpret’ the mean- also add complexity to rule evaluation process. Future works 866
806 ing of each rule. This enable, for instance, our semantic-based fil- should also look into this area. 867
807 tering mechanism to automatically separates rules which are
808 likely to make sense to the domain expert from those which are
7. Conclusions 868
809 not. Most importantly, because Pubmed uses similar semantic
810 classification for indexing its literature, our technique can effec-
Our experiments showed that functional novelty is a useful cri- 869
811 tively discover functionally novel rules and perform the automatic
terion for approximating rule novelty. Functional novelty has 870
812 inference of previously unknown relations between user-given
advantages over the traditional pairwise novelty. Functionally no- 871
813 hypotheses.
vel rules are likely to be more acceptable by medical experts as 872
814 On a bigger landscape, our work in finding hypothesis linkage
they remain compliant with domain knowledge. Mining function- 873
815 via interesting rules is valuable as we also see a serious growth
ally novel rules is also less likely to grapple with the rare item 874
816 of interest in link mining research recently, e.g. [50]. This trend is
problem as we do not need to focus on rules with rare item com- 875
817 not merely motivated by the overarching need to solve information
binations. To prove our idea, we designed and implemented a do- 876
818 overload problem, but can be interpreted as an increase in realiza-
main-driven KKD framework to automatically mine functionally 877
819 tion that much new and valuable knowledge are actually hidden
novel rules. The new framework incorporated domain knowledge 878
820 within the known knowledge.
from biomedical literature and ontology to mine domain knowl- 879
edge-compliant association rules from two cardiovascular data 880
sets. Through the experiments, a practising cardiologist prescribed 881
821 6.3. Related works
a few pairs of medical hypotheses that interest him. We success- 882
fully discovered some novel rules that seemed to mediate previ- 883
822 Several notable works have specifically studied rule novelty
ously unknown connections between these hypotheses, making 884
823 [28,51,47,34] but were limited to evaluating pairwise rule novelty.
them highly interesting to the medical expert. 885
824 An interesting work by Aydin and Güvenir [3] approached rule
825 interestingness evaluation as a post-mining classification problem
826 but did not include rule novelty assessment. References 886

827 Mostafa et al. [41] recently envisioned that fruitful research 887
[1] R. Agrawal, T. Imielinski, A. Swami, Mining association rules between sets of
828 area in the future will be one in which associations discovered items in large databases, in: Proceedings of the 1993 ACM SIGMOD Conference, 888
829 from clinical data warehouses are validated by evidence in litera- 1993, pp. 207–216. 889
[2] M. Ankerst, Report on the SIGKDD-2002 panel the perfect data mining tool: 890
830 ture. Our work realizes this futuristic concept by applying a combi-
Interactive or automated?, SIGKDD Explorations 4 (2) (2002) 110–111 891
831 nation of both domain ontology and literature to evaluate rule [3] T. Aydin, H.A. Güvenir, Modeling interestingness of streaming association rules 892
832 interestingness. Our work is distinct from [30,15,29,32,36,4,40] as a benefit-maximizing classification problem, Knowledge-Based Systems 22 893
833 which focused on the usage of either ontologies or literature, but (2009) 85–99. 894
[4] K. Becker, M. Vanzin, O3R: Ontology-based mechanism for a human-centered 895
834 not both. Works by [58,45] have, in different ways, used a combi- environment targeted at the analysis of navigation patterns, Knowledge-Based 896
835 nation of structured and unstructured information for biomedical Systems 23 (2010) 455–470. 897
836 knowledge discoveries. These works were characterized by the [5] M.V. Blagosklonny, A.B. Pardee, Unearthing the gems, Nature 416 (2002) 373. 898
[6] D. Bray, Reasoning for results, Nature 412 (2001) 863. 899
837 use of well-curated, research-oriented databases e.g. gene expres- [7] S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: generalizing 900
838 sion or genomic sequences databases in addition to literature and association rules to correlations, in: Proceedings of the 1997 ACM SIGMOD 901
839 ontology. They targeted very specific knowledge discovery tasks Conference, 1997, pp. 265–276. 902
[8] S. Brin, R. Motwani, J.D. Ullman, S. Tsur, Dynamic itemset counting and 903
840 such as the selection of gene candidates or the identification of 904
implication rules for market basket data, in: Proceedings of the 1997 ACM
841 cancer-related enzymes. Our work, on the other hand, focuses on SIGMOD Conference, 1997, pp. 255–264. 905
842 a general methodology for mining clinical data sets that are not [9] L. Cao, Data Mining for Business Applications, Springer, 2008. Ch. Introduction 906
to domain driven data mining, pp. 3–10.. 907
843 specifically well-curated for research purposes.
[10] L. Cao, P.S. Yu, C. Zhang, Y. Zhao, Domain Driven Data Mining, Springer, 2010. 908
844 We incorporate literature mining technique for the main [11] L. Cao, C. Zhang, Domain-driven, actionable knowledge discovery, IEEE 909
845 purpose of evaluating the interestingness of association rules. Intelligent Systems 22 (4) (2007) 78–88. 910

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008
KNOSYS 2018 No. of Pages 12, Model 5G
18 February 2011

12 Y. Sebastian, P.H.H. Then / Knowledge-Based Systems xxx (2011) xxx–xxx

911 [12] D.R. Carvalho, A.A. Freitas, N. Ebecken, Evaluating the correlation between [38] I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, T. Euler, Yale: Rapid 983
912 objective rule interestingness measures and real human interest, Lecture prototyping for complex data mining tasks, in: Proceedings of the 12th ACM 984
913 Notes in Computer Science 3721 (2005) 453–461. SIGKDD International Conference on Knowledge Discovery and Data Mining 985
914 [13] U. Fayyad, G. Piatetsky-shapiro, P. Smyth, From data mining to knowledge (KDD’06), 2006, pp. 935–940. 986
915 discovery in databases, AI Magazine 17 (1996) 37–54. [39] F. Mills, Statistical Methods, Pitman, 1955. 987
916 [14] U. Fayyad, G. Piatetsky-Shapiro, R. Uthurusamy, Summary from the KDD-03 [40] L. Moss, D. Sleeman, M. Sim, M. Booth, M. Daniel, L. Donaldson, C. Gilhooly, M. 988
917 panel: data mining: the next 10 years, ACM SIGKDD Explorations Newsletter 5 Hughes, J. Kinsella, Ontology-driven hypothesis generation to explain 989
918 (2) (2003) 191–196. anomalous patient responses to treatment, Knowledge-Based Systems 23 990
919 [15] S.J. Fodeh, P.-N. Tan, Incorporating background knowledge from the world (2010) 309–315. 991
920 wide web for rule evaluation using the minimum discriminative information [41] J. Mostafa, K. Seki, W. Ke, Biological Data Mining, CRC Press, 2010. pp. 449, Ch. 992
921 principle, in: Proceedings of the First International Workshop on Mining Beyond information retrieval: Literature mining for biomedical knowledge 993
922 Multiple Information Sources (MIMS’07), 2007, pp. 22–30. discovery. 994
923 [16] W.J. Frawley, G. Piatetsky-Shapiro, C.J. Matheus, Knowledge discovery in [42] M. Ohsaki, H. Abe, S. Tsumoto, H. Yokoi, T. Yamaguchi, Evaluation of rule 995
924 databases: an overview, AI Magazine 13 (3) (1992) 57–70. interestingness measures in medical knowledge discovery in databases, 996
925 [17] A.A. Freitas, On rule interestingness measures, Knowledge-Based Systems 12 Artificial Intelligence in Medicine 41 (3) (2007) 177–196. 997
926 (1999) (1999) 309–315. [43] B. Padmanabhan, A. Tuzhilin, A belief-driven method for discovering 998
927 [18] A.A. Freitas, Are we really discovering interesting knowledge from data?, unexpected patterns, in: Proceedings of the Fourth ACM SIGKDD 999
928 Expert Update Special Issue on the 2nd UK KDD Workshop 9 (1) (2006) 41–47. International Conference on Knowledge Discovery and Data Mining, 1998, 1000
929 [19] L. Geng, H.J. Hamilton, Interestingness measures for data mining, ACM pp. 94–100. 1001
930 Computing Survey 38 (3) (2006) 1–32. [44] M.J. Pazzani, S. Mani, W.R. Shankle, Acceptance of rules generated by machine 1002
931 [20] T. Gruber, A translation approach to portable ontology specifications, learning among medical experts, Methods of Information in Medicine 40 1003
932 Knowledge Acquisition 5 (2) (1993) 199–220. (2001) 380–385. 1004
933 [21] J. Han, H. Cheng, X. Dong, Frequent pattern mining: current status and future [45] P. Pospisil, L.K. Iyer, S.J. Adelstein, A.I. Kassis, A combined approach to data 1005
934 directions, Data Mining and Knowledge Discovery 15 (2007) 55–86. mining of textual and structured data to identify cancer-related targets, BMC 1006
935 [22] J. Han, J. Pei, Y. Yin, R. Mao, Mining frequent patterns without candidate Bioinformatics 7 (2006) 354. 1007
936 generation: A frequent-pattern tree approach, Data Mining and Knowledge [46] R. Prabowo, M. Thelwall, A comparison of feature selection methods for an 1008
937 Discovery 8 (1) (2004) 53–87. evolving rss feed corpus, Information Processing & Management 42 (6) (2006) 1009
938 [23] D. Hristovski, C. Friedman, T.C. Rindflesch, B. Peterlin, Exploiting 1491–1512. 1010
939 semantic relations for literature-based discovery, in: Proceedings of the [47] S. Sahar, Interestingness via what is not interesting, in: Proceedings of the Fifth 1011
940 2006 American Medical Informatics Association Annual Symposium, 2006, ACM SIGKDD International Conference on Knowledge Discovery and Data 1012
941 pp. 349–353. Mining (KDD’ 99), 1999, pp. 332–336. 1013
942 [24] D. Hristovski, J. Stare, B. Peterlin, S. Dzeroski, Supporting discovery in medicine [48] Y. Sebastian, B.C.S. Loh, P.H.H. Then, A paradigm shift: combined literature and 1014
943 by association rules mining in medline and umls, Studies in Health Technology ontology-driven data mining for discovering novel relations in biomedical 1015
944 and Informatics 84 (2) (2001) 1344–1348. domain, in: The 3rd IEEE ICDM International Workshop on Domain Driven 1016
945 [25] X. Hu, X. Zhang, X. Zhou, Comparison of Seven Methods for Mining Hidden Data Mining (DDDM 09), IEEE, Miami, 2009, pp. 51–57. 1017
946 links, Wiley Interscience, 2007. pp. 27–44. [49] K. Seki, J. Mostafa, Discovering implicit associations among critical biological 1018
947 [26] S. Jaroszewicz, T. Scheffer, D.A. Simovici, Scalable pattern mining with entities, International Journal of Data Mining and Bioinformatics 3 (2) (2009) 1019
948 bayesian networks as background knowledge, Data Mining and Knowledge 105–123. 1020
949 Discovery 18 (2009) 56–100. [50] D. Shahaf, C. Guestrin, Connecting the dots between news articles, in: 1021
950 [27] L.J. Jensen, J. Saric, P. Bork, Literature mining for the biologist: from Proceedings of the Sixteenth ACM SIGKDD International Conference on 1022
951 information retrieval to biological discovery, Nature Reviews Genetics 7 (2) Knowledge Discovery and Data Mining, 2010, pp. 623–632. 1023
952 (2006) 119–129. [51] A. Silberschatz, A. Tuzhilin, What makes patterns interesting in knowledge 1024
953 [28] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, A.I. Verkamo, Finding discovery systems, IEEE Transactions on Knowledge and Data Engineering 8 1025
954 interesting rules from large sets of discovered association rules, in: (6) (1996) 970–974. 1026
955 Proceedings of the Third International Conference on Information and [52] A.T.Z. Sim, M. Indrawan, S. Zutshi, B. Srinivasan, Logic-based pattern discovery, 1027
956 Knowledge Management (CIKM’94), 1994, pp. 401–407. IEEE Transactions on Knowledge and Data Engineering 22 (6) (2010) 798–811. 1028
957 [29] E.E. Kotsifakos, G. Marketos, Y. Theodoridis, A framework for integrating [53] M. Suka, T. Ichimura, K. Yoshida, Development of coronary heart disease 1029
958 ontologies and pattern-bases, Information Science Reference (2008). database, Knowledge-Based Intelligent Information and Engineering Systems 1030
959 [30] Y.T. Kuo, A. Lonie, L. Sonenberg, K. Paizis, Domain ontology driven data mining: 3214 (2004) 1081–1088. 1031
960 a medical case study, in: ACM SIGKDD International Workshop on Domain- [54] E. Suzuki, Interestingness measures – limits, desiderata, and recent results, in: 1032
961 Driven Data Mining (DDDM 07), 2007, pp. 11–17. Proceedings of the Quality Issues, Measures of Interestingness and Evaluation 1033
962 [31] P. Lenca, P. Meyer, B. Vaillant, S. Lallich, On selecting interestingness measures of Data Mining Models (QIMIE/PAKDD 2009), 2009, pp. 1–3. 1034
963 for association rules: user oriented description and multiple criteria decision [55] V. Svatek, J. Rauch, Ontology-enhanced association mining, Semantics, Web 1035
964 aid, European Journal of Operational Research 184 (2) (2008) 610–626. and Mining 4289/2006 (2006) 163–179. 1036
965 [32] J. Li, N. Circone, S.W.H. Wong, L.J. Yan, Enhancing rule importance measure [56] D.R. Swanson, Fish oil, raynaud’s syndrome, and undiscovered public 1037
966 using concept hierarchy, in: Proceedings of the Quality Issues, Measures of knowledge, Perspectives in Biology and Medicine 30 (1) (1986) 7–18. 1038
967 Interestingness and Evaluation of Data Mining Models Workshop (QIMIE’09). [57] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, Addison 1039
968 2009, pp. 43–54. Wesley, 2006. 1040
969 [33] G.Y. Lip, G.D. Lowe, M.J. Metcalfe, A. Rumley, F.G. Dunn, Effects of warfarin [58] N. Tiffin, J.F. Kelso, A.R. Powell, H. Pan, V.B. Bajic, W.A. Hide, Integration of text- 1041
970 therapy on plasma fibrinogen, von Willebrand factor, and fibrin d-dimer in left and data-mining using ontologies successfully selects disease gene candidates, 1042
971 ventricular dysfunction secondary to coronary artery disease with and without Nucleic Acids Research 33 (5) (2005) 1544–1552. 1043
972 aneurysms, American Journal of Cardiology 76 (7) (1995) 453–458. [59] V.I. Torvik, N.R. Smalheiser, A quantitative model for linking two disparate sets 1044
973 [34] B. Liu, W. Hsu, S. Chen, Y. Ma, Analyzing the subjective interestingness of of articles in medline, Bioinformatics 23 (13) (2007) 1658–1665. 1045
974 association rules, IEEE Intelligent Systems 15 (5) (2000) 47–55. [60] J.D. Wren, Extending the mutual information measure to rank inferred 1046
975 [35] B. Liu, W. Hsu, Y. Ma, Mining association rules with multiple minimum literature relationships, BMC Bioinformatics 5 (1) (2004) 145. 1047
976 supports, in: Proceedings of the Fifth ACM SIGKDD International Conference [61] Q. Yang, X. Wu, C. Elkan, J. Gehrke, J. Han, D. Heckermanand, D. Keim, J. Liu, D. 1048
977 on Knowledge Discovery and Data Mining, 1999, pp. 337–341. Madigan, G. Piatetsky-Shapiro, V. Raghavan, R. Rastogi, S. Stolfo, A. Tuzhilin, B. 1049
978 [36] C. Marinica, F. Guillet, Knowledge-based interactive postmining of association Wah, 10 challenging problems in data mining research, Journal of Information 1050
979 rules using ontologies, IEEE Transactions on Knowledge and Data Engineering Technology 5 (4) (2006) 597–604. 1051
980 22 (6) (2010) 784–797. [62] C. Zhang, P.S. Yu, D. Bell, Introduction to the domain-driven data mining 1052
981 [37] K. McGarry, A survey of interestingness measures for knowledge discovery, special edition, IEEE Transactions on Knowledge and Data Engineering 22 (6) 1053
982 Knowledge Engineering Review 20 (01) (2005) 39–61. (2010) 753–754. 1054
1055

Please cite this article in press as: Y. Sebastian, P.H.H. Then, Domain-driven KDD for mining functionally novel rules and linking disjoint medical hypoth-
eses, Knowl. Based Syst. (2011), doi:10.1016/j.knosys.2011.01.008