Вы находитесь на странице: 1из 4

International Journal of Computer Applications (0975 8887)

Volume * No.*, ___________ 2013

Mining Medical Data to Identify Frequent


Diseases
Ami Vora

Neha Aitavade

Saumil Shah

Mrs. Rashmi Thakur

Student,

Student,

Student,

Faculty,

Thakur College of
Engineering and
Technology

Thakur College of
Engineering and
Technology

Thakur College of
Engineering and
Technology

Thakur College of
Engineering and
Technology

amivora00@gmail.co
m

neha.aitavade@gmail
.com

saumilshah500@gm
ail.com

rashmi.thakur@thaku
reducation.org

ABSTRACT
Data mining is a process of analyzing a huge data from
different perspectives and summarizing it into useful
information. The information can be converted into
knowledge about historical patterns and future trends. Data
mining is the extraction of hidden predictive information from
large databases; it is a powerful technology with great
potential to help organizations focus on the most important
information in their data warehouses [6]. The other famously
used term is knowledge discovery from data or KDD [7]. Data
mining plays a significant role in the field of information
technology. Health care industry today generates large
amounts of complex data about patients, hospitals resources,
diseases, diagnosis methods, electronic patients records, etc.
The data mining techniques are very useful to make medicinal
decisions in curing diseases. The health care industry collects
huge amount of health care data which, unfortunately, are not
mined to discover hidden information for effective decision
making. The discovered knowledge can be used by the health
care administrators to improve the quality of service. In this
paper, we will find a method to identify frequency of diseases
in particular geographical area at given time period with the
help of data mining tools.

General Terms
Data mining, Frequent diseases, Medical data, Future trends.

Keywords
Feature selection, KDD, Health care, Data mining, Apriori
algorithm, ID3 algorithm, K-means clustering algorithm.

1.

INTRODUCTION

The health care domain have a lot of challenges and difficult


task its one of the main difficult challenge is in disease
diagnosis. The data mining is the process of analyzing a huge
data from different perspective and summarizing it into useful
information [1]. Clinical databases are elements of the domain
where the procedure of data mining has develop into an
inevitable aspect due to the gradual incline of medical and
clinical research data. It is possible for the health care
industries to gain advantage of Data mining by employing the

same as an intelligent diagnostic tool. It is possible to acquire


knowledge and information concerning a disease from the
patient specific stored measurements as far as medical data is
concerned. Therefore, the data mining has been developed
into a vital domain in health care [2]. It is possible to predict
the efficiency of medical treatments by building the data
mining applications. Data mining can deliver an assessment of
which courses of action prove effective by comparing and
evaluating causes, symptoms, and courses of treatments [3].
The real-life data mining applications are attractive since they
provide data miners with varied set of problems, time and
again. Working on heart disease patients databases is one kind
of a real-life application. The detection of a disease from
several factors or symptoms is a multi-layered problem and
might lead to false assumptions frequently associated with
erratic effects. Therefore it appears reasonable to try utilizing
the knowledge and experience of several specialists collected
in databases towards assisting the diagnosis process [4], [5].
The researchers in the medical field identify and predict the
diseases besides proffering effective care for patients with the
aid of data mining techniques. The overall goal of the data
mining process is to extract knowledge from an existing data
set and transform it into a human understandable structure for
further use.

2.
RELATED WORK
Mining Diseases based on Feature Selection
A large population needs a great demand of doctor. But their
deficiency create problem so console plays very important
role for some extent. Facilitate the users to predict them self
even if they are at remote location and very hard to reach
doctors regularly. Less cost and time saving if we integrate it
to web portals. Data mining derives its name from the
similarities between searching for valuable business
information in a large database.
Type 1 (Based on age): Age wise there are different
frequently occurring diseases. Different age groups are prone
to different diseases due to their daily activities. Hence it is
important to sort the diseases based on age groups for future
use and awareness.

International Journal of Computer Applications (0975 8887)


Volume * No.*, ___________ 2013
Type 2 (Based on sex): Sex wise there are different
frequently occurring diseases. Sex wise classification is an
important way of differentiating and recognising different
types of diseases, making important scientific and biological
conclusions and predictions. Hence it is important to sort the
diseases based on age groups for future use and awareness.
Type 3 (Based on geographical area): Every area has
different geology, topography and climatic conditions. Hence
it is important to segregate frequently occurring diseases
based on geographical area.

Similarly it can be used for mining medical data from a


medical database where frequent itemsets will be sets of
frequently occurring diseases.
Apriori algorithm that is meant for discovering locally
frequent patterns from medical data sources. The dataset is
collected from ABC hospital in Mumbai. Various data mining
techniques were used earlier for medical data mining.
However, for finding locally frequent disease we thought of
adapting Aproiri as it is suitable for discovering frequent
patterns. We modified the Apriori algorithm with preprocessing step that makes the algorithm work efficiently.
This algorithm can generate locally frequent diseases and
visualize the experimental results in various view points. We
built a prototype application to demonstrate the proof of
concept. The empirical results reveal that our algorithm has
plenty of scope to improve the Quality of Service in
healthcare industry. Our work is significant in the context
where electronic health records and other historical medical
data available in textual and graphical formats are a gold mine
to researchers in the field. Our prototype is useful and can be
incorporated into real world Healthcare tools.

[2] ID3 Algorithm

Proposed Ideas
[1] Apriori Algorithm (Association Rule)
It is the fundamental and most important algorithm for mining
frequent itemsets. It was first given by Agrawaland Srikant in
1994 [7]. It is a level wise algorithm which works in an
iterative fashion to discover all frequent itemsets in a
database. It uses prior knowledge of frequent itemsets
properties [8]. Frequent itemsets are the sets of items that
satisfy minimum support threshold. This algorithm takes only
categorical input and associates attributes present in the
dataset. There is a property associated with this algorithm
called Apriori Property which states that any subset of
frequent itemsets is also a frequent itemset. For example, if
{x,y,z} is a frequent set then the sets { {x},{y},{z} },
{ { x,y },{ x,z },{ y,z }} must also be frequent. The execution
of this algorithm is organized in two phases. In the first stage,
the candidates are generated and in the next phase frequent
itemsets are generated [9]. The generated large itemsets are
used to produce association rules from database.
Generating
Association rules

Finding frequent itemsets (set


of items that satisfy minimum
support threshold)
Using frequent
itemsets to produce
association rules
with minimum
confidence value

A decision tree is built top-down from a root node and


involves partitioning the data into subsets that contain
instances with similar values (homogenous).
ID3 is a mathematical algorithm for building the decision tree.
It was invented by J. Ross Quinlan in 1979. It uses
Information Theory invented by Shannon in 1948. Builds the
tree from the top down, with no backtracking. Information
Gain is used to select the most useful attribute for
classification.
ID3 is a non-incremental algorithm, meaning it derives
its classes from a fixed set of training instances. An
incremental algorithm revises the current concept definition, if
necessary, with a new sample. The classes created by ID3 are
inductive, that is, given a small set of training instances, the
specific classes created by ID3 are expected to work for all
future instances. The distribution of the unknowns must be the
same as the test cases. Induction classes cannot be proven to
work in every case since they may classify an infinite number
of instances. Note that ID3 (or any inductive algorithm) may
misclassify data.
Data Mining is gaining its popularity in almost all
applications of real world. One of the data mining techniques
i.e., classification is an interesting topic to the researchers as it
is accurately and efficiently classifies the data for knowledge
discovery. Decision trees are so popular because they produce
human readable classification rules and easy to interpret than
other classification methods. Frequently used decision tree
classifiers are studied and the experiments are conducted to
find the best classifier for Medical Diagnosis. The
experimental results show that CART is the best algorithm for
identifying the frequently occurring disease from the available
medical data. It is also performs well for classification on
medical data sets of increased size.
[3] K-means clustering Algorithm

International Journal of Computer Applications (0975 8887)


Volume * No.*, ___________ 2013

Unsupervised learning algorithms that solve the well-known


clustering problem by using K-means. The procedure follows
a simple and easy way to classify a given data set through a
certain number of clusters (assume k clusters) fixed Apriori.
The main idea is to define k centers, one for each cluster.
These centers should be placed in a cunning way because of
different location causes different result. So, the better choice
is to place them as much as possible far away from each other.
The next step is to take each point belonging to a given data
set and associate it to the nearest center. When no point is
pending, the first step is completed and an early group age is
done. At this point we need to re-calculate k new centroids as
barycenter of the clusters resulting from the previous step.
After we have these k new centroids, a new binding has to be
done between the same data set points and the nearest new
center. A loop has been generated. As a result of this loop we
may notice that the k centers change their location step by
step until no more changes are done or in other words
centers do not move any more. Finally, this algorithm aims at
minimizing an objective function know as squared error
function given by:

Mining
Time
Complexity

Space
Complexity

COMPARITIVE ANALYSIS

The above three algorithms are being compared on their time


complexity, space complexity, accuracy, data set size and the
type of data it works on. The comparison is done in the below
Table 1.

Table 1: Comparative Analysis of


Results

Parameter

Principle

K-Means

ID3

Apriori

Algorithm

Algorithm

Algorithm

It is based
on
the
principle of
Clustering

It is based on
the

It is based
on
the
principle of
Association
Rule

principle
of
Decision Trees

Space
Complexity is
= O(n)

Space
Complexity
is =

Relatively
efficient
and easy to
implement.

Easy
and
comprehensible
to implement.

Relatively
efficient
and easy to
implement.

Sum of Error
(SSE)

Tries
to
minimize
square sum
of
error
(SSE)

Tries
to
minimize
square sum
of
error
(SSE)

Works On

Labeled
data

Labeled data

Unlabeled
data

i cluster.

3.

O [MN +
(R^1+R^2+
R^M)] =
O(MN+ (1R^M)/(1R)).

Efficiency and
implementation

th

c is the number of cluster centers.


The medical diseases will be successfully clustered and the
frequently occurring disease per cluster was computed.
Occurrence of diseases that were far away from the average
were flagged for further scrutiny. Hence the prototype can be
used isolate flag suspicious claims that can be subsequently
rechecked. This prototype can immensely increase the
medical claim fraud detection rate which in turn will yield
savings that cover operational costs and allowed to increase
the quality of the health care coverage, fully justifying the
investment.

Space
Complexit
y is =
O((m+k)n)

Time
Complexity
is

O(R^i)

||xi - vj|| is the Euclidean distance


ci is the number of data points in

Time
Complexity is
O(m n2)

O
(I*k*m*n)

where,
between xi and vj.

Time
Complexit
y is

4.

CONCLUSION

Data mining has come to prominence over the last two


decades as a discipline in its own right which offers benefits
with respect to many domains, both commercial and
academic. Broadly data mining can be viewed as an
application domain, as opposed to a technology. The
increasing ability of institutions to collect electronic data,
facilitated by advancement in computer processing, means
that the desire to mine" data is likely to expand. The data
mining community has a well-established set of techniques
available which we are seeking to apply to an even greater
variety of data. The driver for research in data mining is the
ever increasing size of the data we wish to work with. We are
therefore also interested in techniques to mine larger data sets
(and an ever greater variety of data).
The above mentioned algorithms in the paper can be enhanced
by considering and incorporating many more parameters for
disease identification and prediction on the basis of various
parameters such as age, sex, geographical areas.
The outcome of the study is that the algorithms can be
efficiently used to discover hidden patterns and generate rules
and trends from datasets.
The system can be used by researchers in order to predict
future diseases. The graphical representation helps in better
understanding of the available statistics. The algorithms can
be further enhanced by considering and incorporating many

International Journal of Computer Applications (0975 8887)


Volume * No.*, ___________ 2013
more parameters and creating a new hybrid algorithm which
should be more feasible according to the given environment.

5.

REFERENCES

[6] International Journal of Computer Science, Engineering


and Information Technology (IJCSEIT), Vol.2, No.3, June
2012

[1] J Woods and S O Neil,Subband coding of Images, IEEE


Trans on Acoustic speech signal processing, Vol 1, No.3, pp.
1278- 1288, 1986.

[7] Gitanjali J, C.Ranichandra, M.Pounambal, School of


Information Technology and Engineering, VIT UNIVERSITY,
Vellore-632014, Tamil Nadu, India

[2] S. T. Haiang and J.W. Woods, Embedded image coding


using zero blocks of subband/wavelet coefficients and context
modeling, IEEE Int. Conf. on Circuits and Systems
(ISCA2000), vol 3,pp.662-665, Los Angeles, CA, May 2000.

[8] Smitha.T and V.Sundaram,Association Models for


Prediction with Apriori Concept, International Journal of
Advances in Engineering & Technology, Nov. 2012, Vol. 5,
Issue 1, pp. 354-360.

[3] B. Klaus and P. Horn , Robot vision , Cambridge, MA,


MIT Press, 1986.

[9] P.KasemthaweesabandW.Kurutach, Association Analysis


of Diabetes Mellitus (DM) With Complication States Based
on Association Rules, 7th IEEE Conference on Industrial
Electronics and Applications (ICIEA) 2012.

[4] Darwin, C.R. The Expression of emotion in man and


animals, John Murray publishers, 1 st Edition, London, pp 8,
1978.
[5] H. Freitag, Design methodologies for LSI circuitry, IBM
Tech, Place, Rep TR41736, pp 80-82, 1983.

[10] M. Ilayaraja and T. Meyyappan, Mining Medical Data to


Identify Frequent Diseases using Apriori Algorithm, In:
Proceedings of the 2013 International Conference on Pattern
Recognition, Informatics and Mobile Engineering (PRIME),
21-22 February.

Вам также может понравиться