Академический Документы
Профессиональный Документы
Культура Документы
prediction of diseases in health care industry. Data mining is the An invariable characteristic of any prediction model is that it
process of selecting, exploring, and modeling large amounts of data uses some data mining classification algorithm with two folds as
to discover unknown patterns or relationships useful to the data Prediction model and Evaluation method as shown in Fig.2.1. In
analyst. Medical data mining has emerged impeccable with the first fold, it uses the training dataset for screening the
potential for exploring hidden patterns from the data sets of attributes and build classification predictive model. In the
medical domain. These patterns can be utilized for fast and better second fold, it uses testing dataset for finding classification
clinical decision making for preventive and suggestive medicine.
efficiency. The classification algorithm indicates whether the
patient disease can be predicted with high or low level and finds
However raw medical data are available widely distributed,
out whether patient is suffering from diseases or not.
heterogeneous in nature and voluminous for ordinary processing.
Data mining and Statistics can collectively work better towards
Decision tree looks like a tree structure. It is very simple,
discovering hidden patterns and structures in data. In this paper,
efficient and easy to implement. It is presented in [2]. A nave
two major Data Mining techniques v.i.z., FP-Growth and Apriori Bayesian classifier using Bayes theorem works with a
have been used for application to diabetes dataset and association probabilistic statistical classifier. The major advantage of using
rules are being generated by both of these algorithms. this nave Bayesian classifier lies in its simplicity and is efficient
in handling the dataset containing many attributes.
Keywords- Association rules (AR), FPtree, Ttree, Classification, and
Pima Indian Diabetes Data (PIDD)
I. INTRODUCTION
Data mining process can be extremely useful for Medical
practitioners for extracting hidden medical knowledge. It would
otherwise be impossible for traditional pattern matching and
mapping strategies to be so effective and precise in prognosis or
diagnosis without application of data mining techniques. This
work aims at correlating various diabetes input parameters for
efficient classification of Diabetes dataset and onward to mining
useful patterns. Knowledge discovery and data mining have
found numerous applications in business and scientific domain.
Valuable knowledge can be discovered from application of data
mining techniques in healthcare systems too [1]. Data
preprocessing and transformation is required before one can
apply data mining to clinical data. Knowledge discovery and
data mining is the core step, which results in discovery of hidden
but useful knowledge from massive databases [1].
III. METHODLOGY
163
value. There are no restrictions on discrete values associated B. Apriori
with a given data interval except that these values must induce
some ordering on the discretized attribute domain. Data This algorithm consists of two parts [11, 12]. The first part
discretization significantly improves the quality of discovered finds frequent itemsets, second part identifies the rules. For
knowledge and also reduces the running time of various data finding frequent itemsets following steps are followed:
mining tasks such as association rule discovery, classification,
and prediction [6]. Good discretization can lead to new and more Step 1: Scan all transactions and find all frequent items that have
accurate knowledge. On the other hand, bad discretization leads support above s %.Let these frequent items be L.
to unnecessary loss of information or in some cases to false
information with disastrous consequences. There are a wide Step 2: Build potential sets of k items from Lk-1 by using pairs
variety of discretization methods starting with the naive methods of itemsets in Lk-1 such that each pair has the first k-2 items in
often referred to as unsupervised methods such as equal-width, common. Now the k-2 common items and the one remaining
equal-frequency and supervised methods such as Minimum item from each of the two itemsets are combined to form a k-
Description Length(MDL) and Pearsons X2 or Wilks G2 itemset. The set of such potentially frequent k itemsets is the
statistics based discretization algorithms[6, 7]. candidate set Ck. (For k=2, we build the potential frequent pairs
by using the frequent itemset L1 appears with every other item
in L1. The set so generated is the candidate set C2)
IV. ALGORITHMS
Step 3: Scan all transactions and find all k-itemsets in Ck that
A. Association Rule Mining
are frequent. The frequent set so obtained is L2. The first pass of
Association rule mining techniques are used to identify the Apriori algorithm simply counts item occurrences to
relationships among a set of items in database [8]. These determine the large 1-itemsets. A subsequent pass, say pass k,
relationships are not based on inherent properties of the data consists of two phases. First, the large itemsets Lk-1 found in the
themselves as with functional dependencies, but rather based on (k-1)th pass are used to generate the candidate itemsets Ck,
co-occurrence of the data items. Association rules are more using the apriori-gen function. Next, the database is scanned and
appropriate when we search for completely new rules [8]. In this the support of candidates in Ck is counted. For fast counting, we
context, the associations rule mining technique may generate the need to efficiently determine the candidates in Ck that are
probable causes of the particular disease such as Diabetes in the contained in a given transaction t [11, 12].
form of association rules which can be used for fast and better
clinical decision-making. For finding rules, the following straightforward algorithm is
Let D be a set of transactions, where each transaction T is a used. Take a large frequent itemset, say l, and find each non-
set of items such that T I, I= {i1, i2,..im} be a set of literals, empty subset a. For every such subset a, output a rule of the
called items. Given the set of transactions D the problem is to form a (l-a) if support (l) / support (a) satisfies minimum
find association rules that have support and confidence greater confidence.
than the user specified minimum support and minimum
confidence [8]. C. Frequent Pattern growth
An association rule is an implication of the form XY, where FP-Growth is a two step approach which allows frequent itemset
X I, YI, XY=. The rule XY holds in the transaction set discovered without candidate itemset generation.
D with confidence c, if c %of transactions in D that contain X
Step 1: Build a compact data structure called the FP-tree. Build
also contain Y. The rule XY has support s in the transaction using 2 passes over the data-set.
set D, if s % of transactions in D contains XY [9, 10].
Step 2: Extracts frequent itemsets directly from the FP-tree
Given the set of transactions T, one may be interested in
generating all rules that satisfy certain fixed constraint for FP-Tree is constructed using 2 passes over the dataset
support and confidence. Support and confidence are measures of
the interestingness of the rule. A high level of support indicates Pass-1: compresses a large database into a compact, Frequent
that the rule is frequent enough for the organization to be Pattern tree (FP-tree) structure.
interested in it. A high level of confidence shows that the rule is
true often enough to justify a decision based on it [11]. Pass-2: develops an efficient, FP-tree based frequent pattern
mining.
Thus for a rule XY,
Support(XY)=(Number of times X and Y
appear together)/D
Confidence(XY)=Support(XY)/Support(X)
164
The major difference between FP-growth and the Apriori following table 2, frequent itemsets and rules were produced
algorithm discussed above is that FP-growth does not generate using different approaches and different parameter settings.
the candidate itemsets and then tests [13, 14].
165
1. IF ( OGTT=127) THEN (number of times pregnant=3) REFERENCES
2. IF (Diastolic Blood Pressure =75) AND (BMI=30) THEN [1] Harleen Kaur and Siri Krishan Wasan, Empirical Study on
(Diabetis pedigree function=0.5) Unit of BMIweight in aplications of Data Mining Techniques in Healthcare,
Kg/(height in m)^2 Journal of Compuuter Science, 2006.
3. IF (number of times pregnant=3) AND (Age=35) THEN
(Diabetis pedigree function=0.33) [2] Duen-Yian Yeh a, Ching-Hsue Cheng b, Yen-Wen Chen
4. IF (Diastolic Blood Pressure =50) THEN Not Diabetic 2011 A predictive model for cerebrovascular disease using
5. IF (number of times pregnant=6) AND (BMI=34) THEN Not
data mining Vol. 8970-8977
Diabetic
6. IF(Triceps Skinfold Thickness=22) AND (Diabetis pedigree
function=0.5) THEN Not Diabetic [3] ShantakumarB.Patil,Y.S.Kumaraswamy , Predictive data
mining for medical diagnosis of heart disease, 2011,
7. IF (number of times pregnant=5) AND (Diabetis pedigree
function=0.66) THEN Not Diabetic
8. IF (number of times pregnant=7) AND (Diabetis pedigree [4] D.Shanthi,,Dr.G.Sahoo,,Dr.N.Saravanan,2008, Designing an
Artificial Neural Network Model for the Prediction of Thrombo-
function=0.66) THEN Not Diabetic
embolic Stroke (IJBB), Volume 3, pp.10-18.
9. IF (BMI=35) AND (Diabetis pedigree function=0.66) THEN
Not Diabetic [5] UCI Machine Learning, Pima Indians Diabetes DataSet,
10. IF (OGTT=103) THEN Diabetic http://archive.ics.uci.edu/ml/machine-learning-
11. IF (OGTT=105) THEN Diabetic databases/pima-indians-diabetes
12. IF (OGTT=119) THEN Diabetic
13. IF (OGTT=120) THEN Diabetic [6] Khiops, A Statistical Discretization Method of Continuous
14. IF (Diastolic Blood Pressure =63) THEN Diabetic Attributes. Marc Boull. Journal Title: Machine Learning,, 2004.
15. IF (number of times pregnant=2) AND (Diastolic Blood .
[7] Ruoming Jin, Yuri Breitbart, Chibuike Muoh, "Data Discretization
Pressure =75) THEN Diabetic
Unification," icdm, pp. 183-192, Seventh IEEE International
16. IF (number of times pregnant=3) AND (Triceps Skinfold Conference on Data Mining, 2007.
Thickness=17) THEN Diabetic
17. IF (number of times pregnant=2) AND (BMI=30) THEN [8] M. H. Margahny and A. A. Mitwaly. Fast Algorithms for mining
Diabetic association rules. AIML 05 Conf.,19-21 December 2005.
18. IF (Diastolic Blood Pressure =75) AND (BMI=30) THEN
Diabetic [9] Milan Zorman e.t al. Mining Diabetes Database with Decision
19. IF (number of times pregnant=3) AND (BMI=33) THEN Trees and Association Rules, 2002
Diabetic
[10] Carlos Ordonnez, Comparing Association Rules and Decision
20. IF (OGTT=113) AND (Diabetis pedigree function=0.33)
Trees for Disease Prediction, HIKM06, November 11, 2006,
THEN Diabetic
21. IF (OGTT=120) AND (Diabetis pedigree function=0.33) [11] Agrawal and R. Srikant, Fast Algorithms for mining association
THEN Diabetic rules, Sept. 1994.
22. IF (BMI=32) AND (Diabetis pedigree function=0.33) THEN
Diabetic [12] R. Agrawal e.t al Mining association rules between sets of items
23. IF (OGTT=119) AND (Diabetis pedigree function=0.5) in large databases. In proc. of the ACM SIGMOD Conference on
THEN Diabetic Management of Data, 1993.
24. IF (Diastolic Blood Pressure =75) AND (Diabetis pedigree
[13] Christian Borge, An Implementation of The FP-growth
function=0.5) THEN Diabetic
Algorithm, Conference on Knowledge Discovery in Data
25. IF (Diastolic Blood Pressure =75) AND (BMI=30) THEN mining: frequent pattern mining implementations, 2005
(Diabetes pedigree function=0.5) AND Diabetic
[14] Jiawei Han, Jian Pei and Yiwen Yin, Mining Frequent Patterns
These rules have the potential to improve the expert system and without Candidate Generation, International Conference on
to make better clinical decision making. In a thickly populated Management of Data, 2000 ACM SIGMOD international
country with scarce resources such as India, public awareness conference on Management of data
can also be achieved through the dissemination of the above
knowledge [15] Frans Coenen, Paul Leng, and Shakil Ahmed, Data Structure for
Association Rule Mining, IEEE Transactions on Knowledge
and Data Engineering , VOL. 16, June. 2004.
166