Вы находитесь на странице: 1из 99

Application of Data Mining Techniques in

Healthcare

Project report submitted in partial fulfillment of


the
Requirement for the award of the degree

MASTER OF BUSINESS
ADMINISTRATION
By

ABHINAV BHORKAR
(Regd.No.09451)

SCHOOL OF BUSINESS MANAGEMENT

SRI SATHYA SAI INSTITUTE OF HIGHER LEARNING


Sri Sathya Sai Institute of Higher Learning
(Deemed to be University)
Accredited by NAAC at A++ Level
Established Under Section 3 of the UGC Act, 1956
Vidyagiri, Prashanti Nilayam – 515134
Anantapur District, Andhra Pradesh, India

SCHOOL OF BUSINESS MANAGEMENT

CERTIFICATE AND DECLARATION


This Project entitled “Application of data mining in healthcare” is an
original work done by me under the supervision of Dr. Ramaier Sriram,
School of Business Management, Sri Sathya Sai Institute of
Higher Learning, Prasanthi Nilayam, in partial fulfillment of the
requirements for the award of the degree of Master of Business
Administration of this University, and has not formed the basis for
the award of any degree, diploma or any other such title by this or any
other university.

……………………………
……………………………
Dr. Ramaier Sriram Abhinav
Bhorkar
Project Guide Regd. No.
09451

Prashanthi Nilayam
……………………………
20th December 2010 A Sudhir
Bhaskar
Professor and Dean
School
ofBusiness management
ACKNOWLEDGEMENTS
I express my deep and sincere gratitude to my Lord and Master, Bhagawan Sri Sathya Sai
Baba whose Divine grace made this work possible.

I thank my guide Prof. Ramaier Sriram who gave me all the support, confidence and guidance
at all stages of the study.

I thank Dr. Subramanian S. who helped me during the analysis of the study.

I thank Shri Renju Raghuveeran, Shri Piyush Shrivastava and shri prakash chittranjan whose
tireless efforts have enabled us to use one of the best computer centers.

I am also thankful to the Ganesh xerox centre for their services rendered to me at all times.

I thank all my classmates for their support. I also thank Sri. GVR Subbarao from, SSSHIMS
Whitefield,Bangalore for providing me with all the necessary data.

Last, but not the least, I am immensely grateful to my family for their guidance, immense love
and blessings.

ABHINAV BHORKAR

Application of Data Mining in Healthcare Page i


ABSTRACT

Healthcare organizations are rich in clinical data. Access to need This limits them to provide a
quality healthcare service and satisfy their patients. So they need to analyze their data to find out
some meaningful information which will help them to provide better healthcare services.

An enormous amount of data is recorded every second, which can be mined for extracting
interesting relationships that are very important for the organisations. This study showcases the
use of data mining tools to discover the hidden relationships.

The software that has been used for this purpose is open source software called as XLMINER.
The techniques of classification and clustering are illustrated by applying them on the healthcare
dataset.

Application of Data Mining in Healthcare Page ii


CHAPTER – 1

INTRODUCTION
Introduction

CHAPTER – 1
INTRODUCTION

There is a wealth of data available within the organizations. But the data's hidden value, the
relationships between various attributes of the data is not determined. To collect and store large
amounts of data is a waste of resources. Unless the data is used productively, the organization is
increasing overheads. Valuable knowledge can be discovered from application of data mining
techniques in healthcare system. The business analysts can use the discovered knowledge to
improve the quality of service. The challenge is to extract relevant knowledge from this data and
act upon it in a timely manner.
The typical data mining process involves transferring data originally collected in production
systems into a data warehouse, cleaning or scrubbing the data to remove errors and check for
consistency of formats, and then searching the data using statistical queries, neural networks,etc
(Jonathan C. Prather et al., 1997).With the increasing availability of health-care data there is an
increasing need to use methods which are capable of dealing with missing data, allow integrating
data from various sources and discover relationships between the various attributes of the
dataset.

1.1 STATEMENT OF THE PROBLEM


To assess the incidence of heart related ailments across age, gender, state and district
demographics and improve healthcare provision too.

1.2 OBJECTIVE OF THE STUDY

To analyze a tertiary healthcare dataset for finding relevant knowledge and for enhancing the
understanding of disease progression and management. There are many people who do not have
access to the healthcare services. This study discusses the feasibility of extending healthcare
services those who need it more.

Application of Data Mining in Healthcare Page 1


Introduction
1.4 SCOPE OF THE STUDY

This study focuses on clinical and demographic dataset obtained from Sri Sathya Sai Institute of
Higher Medical Sciences (SSSIHMS), Whitefield, Bangalore.

1.5 SCHEME OF CHAPTERISATION

Chapter 1: Introduction: gives an overview of data mining and its support in healthcare. It also
includes the statement of the problem, objective & scope of the study.
Chapter 2: Literature Review: sheds light on the literature in the area of data mining, its
application in various fields including healthcare, CRM, data mining softwares and its
challenges.
Chapter 3: Theoretical Background: gives an understanding of basics of data mining, its
developmental history, technological elements, different methods and their application in
healthcare.
Chapter 4: Methodology And Design: of the study explains the methodology adopted for the
study. It includes the nature and objectives of the study, as well as collection and treatment of
data. It also includes the limitations of the study.
Chapter 5: Data Analysis: The data mining techniques of classification and clustering is
applied to the healthcare dataset and model has been built which will be useful for healthcare
administrators for efficient decision making.
Chapter 6: Summary and Conclusions: gives a brief sketch of the whole study and presents
conclusions of the findings.

Application of Data Mining in Healthcare Page 2


CHAPTER – 2

LITERATURE
REVIEW
Literature Review

CHAPTER-2

LITERATURE REVIEW

Clinical trials, electronic patient records and computer supported disease management will
produce huge amount of clinical data. This data is a strategic resource for health care providers.
The Sri Sathya Sai Institute of Higher Medical Sciences is not merely a modern multi-super-
specialty hospital with the best medical and surgical care. The total cost of the treatment is borne
by the hospital, and not a penny is charged to the patient. To achieve this noble objective they
need to take several decisions to cure patient not merely in body but in mind and spirit as well. Is
there is really a need of intelligence techniques for such a non profitable organization? Yes, data
mining not only helps to increase profits; but also helps in identifying different patterns in taking
various decisions. This chapter deals with reviewing the applications of data mining in healthcare
for both inpatients and outpatients.

Data mining and forecasting models are quite different. There are 4 major tools of data mining,
namely prediction, classification, clustering analysis and association rules discovery (Keating,
Barry 2008). Data Mining uses well-established statistical and machine learning techniques to
build models that predict customer behavior for value creation. Value creation can be done by
using campaign management technologies. In this first, identify market segments containing
customers or prospects with high profit potential and, second, build and execute campaigns that
favorably impact the behavior of these individuals. Records can be scored when campaigns are
ready to run, allowing the use of the most recent data. "Fresh" data and the selection of "high"
scores within defined market segments improve direct marketing results. Also it leads to
increased accuracy through the elimination of manually induced errors. The Campaign
Management software determines which records to score and when. It also helps statisticians
because less time spent on mundane tasks of extracting and importing files, leaving more time
for creative – building and interpreting models (Kurt Thearling, February 1999). To build more
creative models scalability helps a lot. It leads to build a good data mining model as quickly as
possible. Scalability i.e., by taking advantage of parallel database management systems and

Application of Data Mining in Healthcare Page 5


Literature Review

additional CPUs, you can solve a wide range of problems without needing to change your
underlying data mining environment. You can work with more data, build more models, and
improve their accuracy by simply adding additional CPUs. For example, if in a few minutes
analysts can create a group of models and graph their forecasts, they can pick from a large
number of model possibilities. If they have to wait overnight for each model, they will take steps
to reduce the number of models investigated, such as studying smaller sampled databases and
including fewer columns. Not surprisingly, their results will be different and quite likely not as
useful (Robert D. Small; Herbert A. Edelstein, 1997).

The main task of data mining models is to discover some knowledge from the patterns by
continuously monitoring it with the help of algorithms. According to (Hand and Bolton 2004),
the data mining models have been made to focus on the algorithmic aspects of pattern discovery;
however, another fundamental part of data mining is concerned with detecting anomalies
amongst the vast mass of the data like the small deviations, unusual observations, unexpected
clusters of observations, or surprising blips in the data, which the model does not explain. Both
aspects can be covered by providing a theoretical base, linking the ideas to statistical work in,
scan statistics, outlier detection, and other areas. One of the striking characteristics of work on
pattern discovery is that the ideas have been developed in several theoretical arenas, and also in
several application domains, with little apparent awareness of the fundamentally common nature
of the problem.

The data mining models includes various data mining techniques for knowledge discovery from
huge databases. IBM, 1996 in the white paper presented a variety of data mining techniques, and
explained how they can be used independently and cooperatively to extract high quality
information from databases. It explained how decision-makers can take advantage of high-return
opportunities in a timely fashion. They must be able to identify and utilize information hidden in
the collected data. Data mining models function at their best when the analyst has a complete
view of the targeted patient population and the providers with whom they work. This view
demands that all the data relating to the patient be gathered in one place. Once the healthcare
provider stores data at one location, it can make better use of predictive modeling and data
mining to more accurately identify patients, understand their needs, understand likely patient

Application of Data Mining in Healthcare Page 6


Literature Review

cooperation levels and predict likely results. Then it can make better decisions about which
programs to pursue (diabetes, asthma or congestive heart failure) and how to work with
individual providers and patients in the targeted population (Rose Cintron 2004).
Data mining aids in the analysis of large databases and discovery of trends, patterns and
knowledge. Many applications in engineering design, manufacturing and logistics engineering
now incorporate data mining functions. Valuable knowledge can be discovered from application
of data mining techniques in healthcare arena viz. Hospital Infection Control, Ranking Hospitals
Identifying High-Risk Patients, heart attack prediction, etc. Hospital infection control involves
computer-assisted surveillance research which focuses on identifying high-risk patients, expert
systems, and possible cases and detecting deviations in the occurrence of predefined events.
Hospitals and their healthcare plans can be ranked based on information reported by them. Data
mining techniques have been implemented to examine reporting practices used by hospitals.
Data mining techniques helps in identifying patients who are tending toward a high-risk
condition. This information gives nurse care coordinators a head start in identifying high-risk
patients so that steps can be taken to improve the patients’ quality of healthcare and to prevent
health problems in the future (Mary K. Obenshain 2004). According to (Michael Silver et al.
2001), Data mining is used to make business decisions that can influence cost, revenue, and
operational efficiency while maintaining a high level of care. Typical problems that data mining
addresses are how to classify data, cluster data, find associations between data items, and
perform time series analysis. The analysis deals with various attributes such as age, financial
status, etc. The financial status shows clear picture of losses. The main challenge in the
healthcare industry is to make quick decision during the initial diagnosis. According to (Varun
Kumar et al. 2008), outlier detection & analysis have a great potential to find useful information
from health care databases which consequently helps decision makers to automate & quicken the
process of decision making in clinical diagnosis as well as other domains of health care
management. They applied data mining techniques to breast cancer health care databases so that
useful information in the form of novel & hidden knowledge patterns can be generated to
improve public health. They found out that data mining techniques helps uncovering the valuable
knowledge hidden behind them and aiding the decision makers to improve the health care
services. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict
the likelihood of patients getting a heart disease. Data mining can be used for effective heart

Application of Data Mining in Healthcare Page 7


Literature Review

attack prediction. Firstly, they have provided an efficient approach for the extraction of
significant patterns from the heart disease data warehouses. Then, for the efficient prediction of
heart attack based on the calculated significant weightage, the frequent patterns having value
greater than a predefined threshold were chosen for the valuable prediction of heart attack. They
defined five mining goals based on business intelligence and data exploration. The goals were
evaluated against the trained models. All these models could answer complex queries in
predicting heart attack (K.Srinivas et al. 2010).

The information flow in the pharmaceutical industry is relatively simple and the application of
technology was limited. Today increasingly technology is being used to help the pharmaceutical
firms manage their inventories and to develop new product and services. It helps pharmacy
firms to compete on lower costs while improving the quality of drug discovery and delivery
methods. A deep understanding of the knowledge hidden in the pharmacy data is vital to a firm’s
competitive position and organizational decision-making (Jayanthi Ranjan 2007). According to
(Anny Miley 2000), Data mining techniques can improve treatment procedure, increase patient
satisfaction and wipe out over prescription, bad practice and fraud. Data mining methods can
save time and money by making possible efficient deployment of medication, effective
treatment, speedy diagnosis and preventive measures. The data collected from patients will make
them happy by answering questions which they feel is for their own benefit. The questions such
as how long they had to wait to see a doctor, how effective their treatment was, how happy they
were with the service provided, etc. will generate good, reliable data. Analyzing this data will
help patients to set reasonable expectations about waiting times and get benefitted from the
quality service.
Now, the world is changing and all organizations are keeping pace with the new technologies.
Healthcare providers increasingly are able to mine patient data to help choose the best course of
treatment, and various software solutions are allowing them to fine-tune their claims process.
According to (Judith Lamont 2010), the growing availability of electronic medical records will
lead to increased evidence-based medicine and smarter healthcare. To achieve the highest rank
for patient satisfaction, organizations go for data-driven decision-making combined with its
physicians’ skills which will increase organizations image and develop trust among the people.
Due to lack of trust denials for the services increases which is a concern for healthcare providers.

Application of Data Mining in Healthcare Page 8


Literature Review

They can use centralized mapping method in which all denial codes returned from payers are
mapped to a proprietary, centrally administered code set. When that code comes back in the
future, the mapping documentation contains a description of the issue and the steps a user can
take to fix the problem. This centralized mapping allows the organization to see a normalized
view of denials.

Healthcare provider organizations are faced with a rising number of financial pressures while
analyzing large numbers of clinical and financial data. A large clinical database could be
successfully warehoused and mined to identify clinical factors associated with preterm birth,
even on a preliminary dataset. Newly discovered relationships found in clinical databases such as
these will potentially lead to better understanding between observations and outcomes in
providing prenatal care and other fields of medicine (Jonathan C. Prather et al.. 1997). In this
article they created a dataset for analysis by extracting and cleaning selected variables and mined
data using exploratory factor analysis. The dream of data mining from an integrated clinically
oriented warehouse has yet to become a reality. (Homer Warner 2001) argues that one of the
main reasons of failure in using data mining in healthcare is that the data is available is too
complex for ad hoc queries from outside the IT department. The key data is missing from the
reporting database in an unnormalizd form. The most significant warehouse costs will be in
collecting, connecting and normalizing data and funneling that data to the warehouse.
Warehousing will need both commitment and direction from business and physician leaders to
succeed.
Very large datasets typically consists of millions of records, with many variables. Such datasets
are collected from both inpatients/outpatients and maintained by organizations because of the
perceived potential information they contain. However, the problem with very large datasets is
that traditional methods of data mining are not capable of retrieving this information because the
software may be overwhelmed by the memory or computing requirements. According to (David
Sibbritt et al. 2004), there is a method that can analyze very large datasets. The method initially
performs a data reduction step through the use of a summary table, which is then used as a
reference dataset that is recursively partitioned to grow a decision tree. When looking at a
decision tree, it may be misleading to assume that only those variables used to create the tree
(split variables) are the only important predictors of the outcome variable. At a given node, there

Application of Data Mining in Healthcare Page 9


Literature Review

may have been one or more variables that created subsets that were almost as heterogeneous as
the chosen variable. The method proposed in their article can not only be used for model
building purposes, but also for the additional purpose of reducing the dimension of the original
dataset, by indicating which variables can be ignored, so that traditional analysis methods can
then be used.

Application of Data Mining in Healthcare Page 10


CHAPTER – 3

THEORETICAL
BACKGROUND
Theoretical Background

CHAPTER - 3
THEORETICAL BACKGROUND

3.1 INTRODUCTION
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviours, allowing
businesses to make proactive, knowledge-driven decisions. Data mining tools can answer
business questions that traditionally were very time consuming to resolve. The data mining
techniques and tools are equally applicable in fields ranging from law enforcement to radio
astronomy, medicine, and industrial process control. In fact, hardly any of the data mining
algorithms were first invented with commercial applications in mind. The commercial data miner
employs a wide range of techniques borrowed from statistics, computer science, and machine
learning research. The choice of a particular combination of techniques to apply in a particular
situation depends on the nature of the data mining task, the nature of the available data, and the
skills and preferences of the data miner. Data mining comes in two flavors—directed and
undirected. Directed data mining attempts to explain or categorize some particular target field
such as income or response. Undirected data mining attempts to find patterns or similarities
among groups of records without the use of a particular target field or collection of predefined
classes.

3.2 ELEMENTS OF DATA MINING


Data mining can both be called as a step in the knowledge discovery process and be generalized
to refer to the larger process of knowledge discovery. Data mining includes the following steps
for the knowledge discovery process:
Step 1: Task Discovery
The goals of the data mining operation must be well understood before the process begins. The
analyst must know what the problem to be solved is and what the questions that need answers

Application of Data Mining in Healthcare Page 13


Theoretical Background
are. Typically, a subject specialist works with the data analyst to refine the problem to be solved
as part of the task discovery step.
Step 2: Data Discovery
In this stage, the analyst and the end user determine what data they need to analyze in order to
answer their questions, and then they explore the available data to see if what they need is
available.
Step 3: Data Selection and Cleaning
Once data has been selected, it will need to be cleaned up: missing values must be handled in a
consistent way such as eliminating incomplete records, manually filling them in, entering a
constant for each missing value, or estimating a value. Other data records may be complete but
wrong (noisy). These noisy elements must be handled in a consistent way.
Step 4: Data Transformation
Next, the data will be transformed into a form appropriate for mining. The process of data
transformation might include smoothing (For example, using bin means to replace data errors),
aggregation (For example, viewing monthly data rather than daily), generalization (For example,
defining people as young, middle-aged, or old instead of by their exact age), normalization
(scaling the data inside a fixed range), and attribute construction.
Step 5: Data Reduction
The data will probably need to be reduced in order to make the analysis process manageable and
cost-efficient. Data reduction techniques include data cube aggregation, dimension reduction
(irrelevant or redundant attributes are removed), data compression (data is encoded to reduce the
size, numerosity reduction (models or samples are used instead of the actual data), discretization
and concept hierarchy generation.
Step 6: Discovering Patterns (Data Mining)
In this stage, the data is iteratively run through the data mining algorithms in an effort to find
interesting and useful patterns or relationships. Often, classification and clustering algorithms are
used first so that association rules can be applied. Some rules yield patterns that are more
interesting than others. This “interestingness” is one of the measures used to determine the
effectiveness of the particular algorithm. A pattern can be considered knowledge if it exceeds an
interestingness threshold. That threshold is defined by the user, is domain specific, and “is
determined by whatever functions and thresholds the user chooses”.

Application of Data Mining in Healthcare Page 14


Theoretical Background
Step 7: Result Interpretation and Visualization
It is important that the output from the data mining step can be “readily absorbed and accepted
by the people who will use the results”. Tools from computer graphics and graphics design are
used to present and visualize the mined output.
Step 8: Putting the Knowledge to Use
Finally, the end user must make use of the output. In addition to solving the original problem, the
new knowledge can also be incorporated into new models, and the entire knowledge or data
mining cycle can begin again.

3.3 DATA MINING TASKS


There are six main data mining activities:
• Classification
• Estimation
• Prediction
• Affinity Grouping or Association Rules
• Clustering
• Profiling

3.3.1 Classification
In order to understand and communicate about the world, we are constantly classifying,
categorizing, and grading. Classification consists of examining the features of a newly presented
object and assigning it to one of a predefined set of classes. The objects to be classified are
generally represented by records in a database table or a file, and the act of classification consists
of adding a new column with a class code of some kind. The classification task is characterized
by a well-defined definition of the classes, and a training set consisting of preclassified
examples. The task is to build a model of some kind that can be applied to unclassified data in
order to classify it.
Examples of classification tasks include:
• Classifying credit applicants as low, medium, or high risk
• Determining which phone numbers correspond to fax machines

Application of Data Mining in Healthcare Page 15


Theoretical Background
• Spotting fraudulent insurance claims.

Decision and nearest neighbour techniques are techniques well suited to classification. Neural
networks and link analysis are also useful for in certain circumstances.

3.3.2 Estimation
Classification deals with discrete outcomes: yes or no; measles, rubella, or chicken pox.
Estimation deals with continuously valued outcomes. Given some input data, estimation comes
up with a value for some unknown continuous variable such as income, height, or credit card
balance.
Examples of estimation tasks include:
• Estimating a family's total household income
• Estimating the lifetime value of a customer

Regression models and neural networks are well suited to estimation tasks. Survival analysis is
well suited to estimation tasks where the goal is to estimate the time to an event, such as a
customer stopping.

3.3.3 Prediction
Prediction is same as classification or estimation, except that the records are classified according
to some predicted future behaviour or estimated future value. In a prediction task, the only way
to check the accuracy of the classification is to wait and see. The primary reason for treating
prediction as a separate task from classification and estimati24on is that in predictive modeling
there are additional issues regarding the temporal relationship of the input variables or predictors
to the target variable.
Any of the techniques used for classification and estimation can be adapted for use in prediction
by using training examples where the value of the variable to be predicted is already known,
along with historical data for those examples. The historical data is used to build a model that
explains the current observed behaviour. When this model is applied to current inputs, the result
is a prediction of future behaviour.
Examples of prediction tasks include:

• Predicting the size of the balance that will be transferred if a credit card prospect

Application of Data Mining in Healthcare Page 16


Theoretical Background
accepts a balance transfer offer.
• Predicting which customers will leave within the next 6 months.

• Predicting which telephone subscribers will order a value-added service such as


three-way calling or voice mail.
The choice of technique depends on the nature of the input data, the type of value to be
predicted, and the importance attached to explicability of the prediction.

3.3.4 Affinity Grouping or Association Rules

Association Rule Mining involves searching for interesting relationships between items in a data
set. Market basket analysis is a good example of this model. An example of an association rule is
“Customers who buy computers tend to also buy financial software”. Since association rules are
not always interesting or useful, constraints are applied which specify the type of knowledge to
be mined such as specific dates of interest, thresholds on statistical measures (rule
interestingness, support, confidence), or other rules applied by end users (Han & Kamber, 2001).
Examples of association rules include:

• Retail chains plan the arrangement of items on store shelves or in a catalog so that items
often purchased together will be seen together.

• To identify cross selling opportunities and for groupings of product and services.

3.3.5 Clustering
Clustering divides a database into different groups. The goal of clustering is to find groups that
are very different from each other, and whose members are very similar to each other. Unlike
classification (see Predictive Data Mining, below), you don’t know what the clusters will be
when you start, or by which attributes the data will be clustered. Consequently, someone who is
knowledgeable in the business must interpret the clusters.The clusters can be mutually exclusive,
hierarchical or overlapping. Each member of a cluster should be very similar to other members in
its cluster and dissimilar to other clusters. Clustering is the task of segmenting a heterogeneous
population into a number of more homogeneous subgroups of clusters. What distinguishes
clustering from classification is that clustering does not rely on predefined classes.

Application of Data Mining in Healthcare Page 17


Theoretical Background
In clustering, there are no predefined classes and no examples. The records are grouped together
on the basis of self-similarity. It is up to the user to determine what meaning, if any, to attach to
the resulting clusters. Clusters of symptoms might indicate different diseases. This is one more
method used for finding meaningful patterns in the existing data. In XLMiner, K-means
clustering was used to perform clustering. A non-hierarchical approach to forming good clusters
is to specify a desired number of clusters, say, k, then assign each case (row) to one of the k
clusters so as to minimize a measure of dispersion within the clusters. A very common measure
is the sum of distances or sum of squared Euclidean distances from the mean of each cluster. The
problem can be set up as an integer programming problem but because solving integer programs
with a large number of variables is time consuming, clusters are often computed using a fast,
heuristic method that generally produces good (but not necessarily optimal) solutions. The k-
means algorithm is one such method.
K-Means Training starts with a single cluster with its centre as the mean of the data. This cluster
is split into two and the means of the new clusters are iteratively trained. These two clusters are
again split and the process continues until the specified number of clusters is obtained. If the
specified number of clusters is not a power of two, then the nearest power of two above the
number specified is chosen and then the least important clusters are removed and the remaining
clusters are again iteratively trained to get the final clusters.

3.3.6 Profiling
Sometimes the purpose of data mining is simply to describe what is going on in a complicated
database in a way that increases our understanding of the people, products, or processes that
produced the data in the first place. A good enough description of behaviour will often suggest
an explanation for it as well. Decision trees are a powerful tool for profiling customers (or
anything else) with respect to a particular target or outcome. Association rules and clustering can
also be used to build profiles.

3.4 DATA MINING TECHNIQUES

3.4.1 DECISION TREES

Application of Data Mining in Healthcare Page 18


Theoretical Background
A decision tree is a predictive model that, as its name implies, can be viewed as a tree.
Specifically each branch of the tree is a classification question and the leaves of the tree are
partitions of the dataset with their classification. From a business perspective decision trees can
be viewed as creating a segmentation of the original dataset (each segment would be one of the
leaves of the tree). In this case the segmentation is done for a particular reason - namely for the
prediction of some important piece of information. The records that fall within each segment fall
there because they have similarity with respect to the information being predicted - not just that
they are similar - without similarity being well defined. These predictive segments that are
derived from the decision tree also come with a description of the characteristics that define the
predictive segment.

Decision trees are data mining technology that has been around in a form very similar to the
technology of today for almost twenty years now and early versions of the algorithms date back
in the 1960s. Often times these techniques were originally developed for statisticians to
automate the process of determining which fields in their database were actually useful or
correlated with the particular problem that they were trying to understand. Because decision trees
score so highly on so many of the critical features of data mining they can be used in a wide
variety of business problems for both exploration and for prediction. They have been used for
problems ranging from credit card attrition prediction to time series prediction of the exchange
rate of different international currencies. There are also some problems where decision trees will
not do as well.

Another way that the decision tree technology has been used is for preprocessing data for other
prediction algorithms. Because the algorithm is fairly robust with respect to a variety of
predictor types (e.g. number, categorical etc.) and because it can be run relatively quickly
decision trees can be used on the first pass of a data mining run to create a subset of possibly
useful predictors that can then be fed into neural networks, nearest neighbor and normal
statistical routines - which can take a considerable amount of time to run if there are large
numbers of possible predictors to be used in the model. Although some forms of decision trees
were initially developed as exploratory tools to refine and preprocess data for more standard
statistical techniques like logistic regression. They have also been used and more increasingly
often being used for prediction. This is interesting because many statisticians will still use

Application of Data Mining in Healthcare Page 19


Theoretical Background
decision trees for exploratory analysis effectively building a predictive model as a byproduct but
then ignore the predictive model in favor of techniques that they are most comfortable with.
Sometimes veteran analysts will do this even excluding the predictive model when it is superior
to that produced by other techniques. With a host of new products and skilled users now
appearing this tendency to use decision trees only for exploration now seems to be changing.

A decision tree represents a series of questions. The answer to the first question determines the
follow-up question. The initial questions create broad categories with many members; follow-on
questions divide the broad categories into smaller and smaller sets. If the questions are well
chosen, a surprisingly short series is enough to accurately classify an incoming record. A record
enters the tree at the root node. The root node applies a test to determine which child node the
record will encounter next. There are different algorithms for choosing the initial test, but the
goal is always the same: To choose the test that best discriminates among the target classes. This
process is repeated until the record arrives at a leaf node. All the records that end up at a given
leaf of the tree are c1assified in the same way. There is a unique path from the root to each leaf.
That path is an expression of the rule used to classify the records.
In the process the proportion of the records in the desired class can be used as a score, which is
more useful than just the classification. For a binary outcome, a classification merely splits
records into two groups. A score allows the records to be sorted from most likely to least likely
to be members of the desired class. For many applications, a score capable of rank-ordering a
list is all that is required. This is sufficient to choose the top N percent for a mailing and to cal-
culate lift at various depths in the list.
Suppose the important business question is not who will respond but what will be the size of the
customer's next order? The decision tree can be used to answer that question too. Assuming that
order amount is one of the variables available in the pre-classified model set, the average order
size in each leaf can be used as the estimated order size for any unclassified record that meets the
criteria for that leaf. It is even possible to use a numeric target variable to build the tree; such a
tree is called a regression tree. Instead of increasing the purity of a categorical variable, each split
in the tree is chosen to decrease the variance in the values of the target variable within each child
node.
Since any multi-way split can be expressed as a series of binary splits, there is no real need for

Application of Data Mining in Healthcare Page 20


Theoretical Background
trees with higher branching factors. Nevertheless, many data mining tools are capable of
producing trees with more than two branches. For example, some decision tree algorithms split
on categorical variables by creating a branch for each class, leading to trees with differing
numbers of branches at different nodes.
This splitting happens in such a way that each new generation of nodes has greater purity than its
ancestors with respect to target variable. At the start of the process, there is a training set
consisting of pre-classified records that is, the value of the target variable is known for all cases.
The goal is to build a tree that assigns a class (or a likelihood of membership in each class) to the
target field of a new record based on the values of the input variables. The tree is built by
splitting the records at each node according to a function of a single input field. The first task,
therefore, is to decide which of the input fields makes the best split. The best split is defined as
one that does the best job of separating the records into groups where a single class predominates
in each group.
The measure used to evaluate a potential split is purity. Low purity means that the set contains a
representative distribution of classes (relative to the parent node), while high purity means that
members of a single class predominate. The best split is the one that increases the purity of the
record sets by the greatest amount. A good split also creates nodes of similar size, or at least does
not create nodes containing very few records.

Application of Data Mining in Healthcare Page 21


Theoretical Background

Figure 3.1 Example showing Poor and Good Split

In Figure 3.1, the first split is a poor one because there is no increase in purity. The initial
population contains equal numbers of the two sorts of dot; after the split, so does each child. The
second split is also poor, because all though purity is increased slightly, the pure node has few
members and the purity of the larger child is only marginally better than that of the parent. The
final split is a good one because it leads to children of roughly same size and with much higher
purity than the parent.

Tree-building algorithms are exhaustive. They proceed by taking each input variable in turn and
measuring the increase in purity that result from every split suggested by that variable. After
trying all the input variables, the one that yields the best split is used for the initial split, creating
two or more children. If no split is possible (because there are too few records) or if no split
makes an improvement, then the algorithm is finished with that node and the node become a leaf
node. Otherwise, the algorithm performs the split and repeats itself on each of the children. An
algorithm that repeats itself in this way is called a recursive algorithm.

The decision tree keeps growing as long as new splits can be found that improve the ability of the
tree to separate the records of the training set into increasingly pure subsets. Such a tree has been
optimized for the training set, so eliminating any leaves would only increase the error rate of the
tree on the training set. A decision tree algorithm makes its best split first, at the root node where
there is a large population of records. As the nodes get smaller, the tree finds general patterns at
the big nodes and patterns specific to the training set in the smaller nodes; that is, the tree over
fits the training set. The result is an unstable tree that will not make good predictions. The cure is
to eliminate the unstable splits by merging smaller leaves through a process called pruning. The
most widely used pruning algorithm is CART pruning algorithm.

Application of Data Mining in Healthcare Page 22


Theoretical Background

Figure 3.2 Pruning to get a stable tree

CART is a popular decision tree algorithm first published by Leo Breiman, Jerome Friedman,
Richard Olshen, and Charles Stone in 1984. The acronym stands for Classification and
Regression Trees. The CART algorithm grows binary trees and continues splitting as long as
new splits can be found that increase purity. As illustrated in Figure 3.2, inside a complex tree,
there are many simpler sub trees, each of which represents a different trade-off between model
complexity and training set misclassification rate. The CART algorithm identifies a set of such
sub trees as candidate models. These candidate sub trees are applied to the validation set and the
tree with the lowest validation set misclassification rate is selected as the final model.

3.4.2 NEURAL NETWORKS

True neural networks are biological systems that detect patterns, make predictions and learn.
The artificial ones are computer programs implementing sophisticated pattern detection and
machine learning algorithms on a computer to build predictive models from large historical
databases. For example, a credit card company has 3,000 records, 100 of which are known fraud
records. The data set updates the neural network to make sure it knows the difference between
the fraud records and the legitimate ones. The network learns the patterns of the fraud records.
Then the network is run against company’s million record data set and the network spits out the
records with patterns the same or similar to the fraud records. Neural networks are known for not

Application of Data Mining in Healthcare Page 23


Theoretical Background
being very helpful in teaching analysts about the data, just finding patterns that match. Neural
networks have been used for optical character recognition to help the Post Office automate the
delivery process without having to use humans to read addresses.
During World War II, a seminal paper was published by McCulloch and Pitts which first
outlined the idea that simple processing units (like the individual neurons in the human brain)
could be connected together in large networks to create a system that could solve difficult
problems and display behavior that was much more complex than the simple pieces that made it
up. Since that time much progress has been made in finding ways to apply artificial neural
networks to real world prediction problems and in improving the performance of the algorithm in
general. In many respects the greatest breakthroughs in neural networks in recent years have
been in their application to more mundane real world problems like customer response prediction
or fraud detection rather than the loftier goals that were originally set out for the techniques such
as overall human learning and computer speech and image understanding.
Neural networks learn to detect the patterns and make better predictions in a similar way to the
way that human beings do. This view is encouraged by the way the historical training data is
often supplied to the network - one record (example) at a time. Neural networks do learn in a
very real sense but under the hood the algorithms and techniques that are being deployed are not
truly different from the techniques found in statistics or other data mining algorithms.
Neural networks are very powerful predictive modeling techniques but some of the power comes
at the expense of ease of use and ease of deployment. But these techniques are of tremendous
importance since they are being deployed against real business problems where significant
investments are made based on the predictions from the models. For example, consider trusting
the predictive model from a neural network that dictates which one million customers will
receive a $1 mailing.
One of the important problems in all of data mining is that of determining which predictors are
the most relevant and the most important in building models that are most accurate at prediction.
These predictors may be used by themselves or they may be used in conjunction with other
predictors to form “features”. Neural networks are very helpful in extracting these features. The
neural network shown in Figure 3.3 is used to extract features by requiring the network to learn
to recreate the input data at the output nodes by using just 5 hidden nodes.

Application of Data Mining in Healthcare Page 24


Theoretical Background

Figure 3.3: Structure of a neural network

A neural network is loosely based on how some people believe that the human brain is organized
and how it learns. There are two main structures of consequence in the neural network:
The node – It loosely corresponds to the neuron in the human brain.

The link – It loosely corresponds to the connections between neurons (axons, dendrites and
synapses) in the human brain.

In Figure 3.4, there is a drawing of a simple neural network. The round circles represent the
nodes and the connecting lines represent the links. The neural network functions by accepting
predictor values at the left and performing calculations on those values to produce new values in
the node at the far right. The value at this node represents the prediction from the neural network
model. In this case the network takes in values for predictors for age and income and predicts
whether the person will default on a bank loan.

Figure 3.4: A simple neural network

Application of Data Mining in Healthcare Page 25


Theoretical Background
In order to make a prediction, the neural network accepts the values for the predictors on what
are called the input nodes. These become the values for those nodes those values are then
multiplied by values that are stored in the links (sometimes called links and in some ways similar
to the weights that were applied to predictors in the nearest neighbor method). These values are
then added together at the node at the far right (the output node), a special threshold function is
applied and the resulting number is the prediction. In this case if the resulting number is 0 the
record is considered to be a good credit risk (no default) if the number is 1 the record is
considered to be a bad credit risk (likely default).

Figure 3.5 Calculations in a neural network

The calculations for Figure 3.4 are done like what is shown in Figure 3.5. Here the value age of
47 is normalized to fall between 0.0 and 1.0 and has the value 0.47 and the income is normalized
to the value 0.65. This simplified neural network makes the prediction of no default for a 47 year
old making $65,000. The links are weighted at 0.7 and 0.1 and the resulting value after
multiplying the node values by the link weights is 0.39. The network has been trained to learn
that an output value of 1.0 indicates default and that 0.0 indicate non-default. The output value
calculated here (0.39) is closer to 0.0 than to 1.0 so the record is assigned a non-default
prediction.
Applications: Neural networks are used in a wide variety of applications. They have been used
in all facets of business from detecting the fraudulent use of credit cards and credit risk
prediction to increasing the hit rate of targeted mailings. They also have a long history of
application in other areas such as the military for the automated driving of an unmanned vehicle

Application of Data Mining in Healthcare Page 26


Theoretical Background
at 30 miles per hour on paved roads to biological simulations such as learning the correct
pronunciation of English words from written text. Neural networks of various kinds can be used
for clustering and prototype creation.

3.4.3 Association Rules


Association Rule Mining involves searching for interesting relationships between items in a data
set. These rules represent patterns in the data without a specific target. These were originally
derived from point-of-sales transactions. One of the most popular and widely used applications
of association rules is Market Basket Analysis. Rule induction on a data base can be a massive
undertaking where all possible patterns are systematically pulled out of the data and then an
accuracy and significance are added to them that tell the user how strong the pattern is and how
likely it is to occur again. In general these rules are relatively simple such as for a market basket
database of items scanned in a consumer market basket you might find interesting correlations in
your database such as:

• If biscuits are purchased then tea/coffee is purchased 90% of the time and this pattern
occurs in 3% of all shopping baskets.
• If Gillette shaving razor is purchased from a store then Gillette shaving cream is
purchased 60% of the time and these two items are bought together in 6% of the shopping
baskets.

The bane of rule induction systems is also its strength - that it retrieves all possible interesting
patterns in the database. This is a strength in the sense that it leaves no stone unturned but it can
also be viewed as a weakness because the user can easily become overwhelmed with such a large
number of rules that it is difficult to look through all of them. A second pass of data mining is
needed to go through the list of interesting rules that have been generated by the rule induction
system in the first place in order to find the most valuable gold nugget amongst them all. This
overabundance of patterns can also be problematic for the simple task of prediction because all
possible patterns are culled from the database there may be conflicting predictions made by
equally interesting rules. In rule induction systems, the rule itself is of a simple form of “if this
and this and this then this”. For example a rule that a supermarket might find in their data
collected from scanners would be: “if pickles are purchased then ketchup is purchased’.

Application of Data Mining in Healthcare Page 27


Theoretical Background
In order for the rules to be useful there are two pieces of information that must be supplied as
well as the actual rule:

• Accuracy - How often is the rule correct?


• Coverage - How often does the rule apply?

Just because the pattern in the data base is expressed as rule does not mean that it is true all the
time. It is important to recognize and make explicit the uncertainty in the rule. This is what the
accuracy of the rule means. The coverage of the rule has to do with how much of the database
the rule “covers” or applies to. In some cases, accuracy is called the confidence of the rule and
coverage is called the support. Accuracy and coverage appear to be the preferred ways of
naming these two measurements.

Association rules provide information of this type in the form of "if-then" statements. These
rules are computed from the data and, unlike the if-then rules of logic, association rules are
probabilistic in nature. In addition to the antecedent (the "if" part) and the consequent (the "then"
part), an association rule has two numbers that express the degree of uncertainty about the rule.
In association analysis the antecedent and consequent are sets of items (called item sets) that are
disjoint (do not have any items in common).

The first number is called the support for the rule. The support is simply the number of
transactions that include all items in the antecedent and consequent parts of the rule. For example
considering the data as given in Table 3.1, the item set {milk, bread} has a support of 2 / 4 = 0.5
since it occurs in 50% of all transactions (2 out of 4 transactions).

The other number is known as the confidence of the rule. The confidence of a rule is defined the
ratio of the number of the transactions supporting the rule to the number of transaction where the
conditional part of the rule holds. Another way of saying this is that confidence is the ratio of the
number of transactions with all the items to the number of transactions with just the "if" items.

Application of Data Mining in Healthcare Page 28


Theoretical Background
Sr.No. Milk Bread Butter Cheese
1 1 1 0 0
2 0 1 1 0
3 0 0 0 1
4 1 1 1 0
Table 3.1 Example dataset for association rules

For example, the rule {milk, bread} => butter has a confidence of 0.2 / 0.4 = 0.5 in the database,
which means that for 50% of the transactions containing milk and bread the rule is correct.

Lift: Lift answers to the question how much better than chance the rule is. Lift (also called as
improvement) tells us how much better a rule is at predicting the result than just assuming the
result in the first place. Lift is the ratio of the density of the target after application of the “left-
hand side to the density of the target in the population. Another way of saying this is that lift is
the ratio of the records that support the entire rule to the number that would be expected,
assuming that there is no relationship between the products. It is defined as

For example the rule {milk, bread} => {butter} has a lift of (0.2/0.4*0.4)= 1.25.

Figure 3.6: Lift chart

Application of Data Mining in Healthcare Page 29


Theoretical Background
The figure 3.6 shows that higher the lift value, more is the likeliness of an event happening. The
value of lift is always desired to be greater than 1.

Conviction: It is the ratio of the expected frequency that X occurs without Y if they were
independent to the observed frequency. It is defined as

For example the rule {milk, bread} => {butter} has a conviction of (1-0.4/1-0.5)= 1.2.

Association rules are required to satisfy a user-specified minimum support and a user-specified
minimum confidence at the same time. To achieve this, association rule generation is a two-step
process. First, minimum support is applied to find all frequent item sets in a database. In a second
step, these frequent item sets and the minimum confidence constraint are used to form rules.
While the second step is straight forward, the first step needs more attention.

Application of Data Mining in Healthcare Page 30


CHAPTER – 4
METHODOLOGY &
DESIGN
Methodology & Design

CHAPTER-4
METHODOLOGY AND DESIGN
.

4.1 DATA PREPARATION


The primary dataset consisting of clinical and demographic information of outpatients and the
treatment finalized for them after diagnosis them thoroughly. It was obtained from Sri Sathya
Sai Institute of Higher Medical Sciences (SSSIHMS), Whitefield, Bangalore.

Data Set Information

Source: HMIS Dept., Sri Sathya Sai Institute of Higher Medical Sciences (SSSHIMS).

Number of Instances: There are 16383 instances i.e. records of 16383 patients. But, to
satisfy limitations of data mining tool, a representative sample of 600 instances was
considered. Simple random sampling method has been used which randomly chooses records
that have an equal probability of being chosen. Sampling was done by choosing instances in
the same proportion as they were present in the original dataset.
Number of Variables: There are 5 variables.
Variable Information: The 5 different variables are mentioned below along with a brief
description:

 Age code: It is the age of the patients which is converted into discreet values
(Age code) as 1 for 0-17, 2 for 18-25, 3 for 25-35, 5 for 35-60, and 5 for 61-
69.

 Treatment: It is a treatment code associated with patient specifying the type


of treatment to be given for a heart disease. These codes are classified into
four basic categories as follows:

o Congenital heart disease:


ASD,ASD*,ASDP,DCRV,DORV,FONT,ICR,ICR*,IR,LUTR,PDA,RSOV,S
AM,SVAS,TAPV,TOF,VSD,VSD*,VSDA,VSDP,BDGI.
o Rheumatic heart disease: AVR, AVRO, MV_R, MR, MVR*, MVRT,
O/MV, PTMC, OMV, OVR.

Application of Data Mining in Healthcare Page 33


Methodology & Design
o Coronary artery disease: CABG, CAG, CAG*, CBG*, PTCA.
o Miscellaneous heart disease: EPS, PPI

 States: It is state of residence of the patient. There are a total of 16 states in


the dataset. The data is then coded as “1” for ‘ANDHRA PRADESH’ to “24”
for ‘SIKKIM’. The fully coded list of states is displayed at page no.67.

 Districts: It is district of residence of the patient. There are a total of 100


districts in the dataset. The data is then coded as “1” for ‘B RURAL’ to “356”
for ‘YAVATMAL’. The fully coded list of districts is displayed at page
no.67.

 Gender code: - The values “Male” or “Female” are converted into numeric
values as 0 for Male and 1 for Female.

 Final Diagnosis: This variable is the target for prediction. It has four options:

o 1-Coronary Heart Disease.

o 2-Congenital Heart Disease.

o 3-Rheumatic Heart Disease.

o 4-Miscellaneous Heart Disease.

The codes for these treatment types are specified earlier in this section.

4.2 DATA MINING TOOL- XLMINER™

XLMiner™ is a comprehensive data mining add-in for Excel. Data mining is a discovery-
driven data analysis technology used for identifying patterns and relationships in data sets.
With overwhelming amounts of data now available from transaction systems and external
data sources, organizations are presented with increasing opportunities to understand their
data and gain insights into it. Often, there may be more than one approach to a problem.
XLMiner™ is a tool belt to help you get quickly started on data mining, offering a variety of

Application of Data Mining in Healthcare Page 34


Methodology & Design
methods to analyze your data. It has extensive coverage of statistical and machine learning
techniques for classification, prediction, affinity analysis and data exploration and reduction.

4.3 LIMITATIONS OF THE STUDY


There are certain limitations of the study:
1. The dataset used for the purpose of classification and clustering technique does
not contain many attributes that will be more relevant for decision making in
healthcare.

2. The software used for analysis is open source software and hence does not support
additional features and functions.

Application of Data Mining in Healthcare Page 35


CHAPTER – 5

DATA ANALYSIS
Data Analysis

CHAPTER - 5
DATA ANALYSIS

5.1DESCRIPTIVE DATA SUMMARY

% Population of all variables

GENDER AGE CODE FINAL DIAGNOSIS


CODE
Male female 0-17 18-25 26-35 36-60 60+ CAD CHD RHD MISC
71.91 28.09 21.21 6.88 8.37 52.13 11.41 56.57 23.35 16.02 4.06
% % % % % % % % % % %
Total Instances 4684

Table 5.1 Summary of original Dataset

Table 5.1 shows the percentage population of all attributes age code, gender code, and final
diagnosis. It is observed that the population is dominated by male patients. Among all age codes
age code 4 has the highest and age code 2 has the least composition in the original dataset. Out of
four types of treatments CAD (coronary Artery Disease) is predominant in the original dataset
with almost 57% composition in the overall population. The next highest category is of CHD
(Congenital Heart Disease) with 24 % composition in the overall population.

5.3 Classification Method

5.3.1 Data Pre-processing and Partitioning

For the purpose of classification, 4 independent demographic variables and one independent
variable are used:-
• Gender code

Application of Data Mining in Healthcare Page 38


Data Analysis
• Age code

• States

• Districts

These 5 variables are the most important in terms of building a model, training the model on the
validating the functionality of model and finally using it for classification or prediction purpose.
The variable Final Diagnosis was selected as output variable in the process of classification.

Data Partitioning

Before building a model, typically we partition the data using a partition utility. Partitioning
yields mutually exclusive datasets: a training dataset and a validation dataset set in the ratio
defined by the user. The training dataset is used to train the model for the purpose of
classification. This ratio for partitioning was kept as 33.33% for training dataset and 66.67%
for validation dataset.

Training Set

The partitioning results into division of data in two parts: The training set and validation
dataset. The training dataset is used to train or build a model. (For further details refer to the
appendix).

Validation Set

Once a model is built on training data, you need to find out the accuracy of the model on unseen
data. For this, the model should be used on a dataset that was not used in the training process i.e.
a dataset where you know the actual value of the target variable. (For further details refer to the
appendix).

5.4 SUMMARY OF CLASSIFICATION METHOD

Application of Data Mining in Healthcare Page 39


Data Analysis
The model is built using the training set supplied as above. This model is then used by the
validation set for prediction of the “Treatment Type”. The training set summary is shown as
below:-

5.4.1 TRAINING LOG

Classification is to predict the value of class. This predicted class is then compared with the
actual values and a measure of the effectiveness of the model, called as misclassification rate is
calculated. It is defined as the ratio of the instances correctly predicted to the total number of the
instances provided. The training log shows the misclassification (error) rate as each additional
node is added to the tree. (For further details refer to the appendix).

5.4.2 Confusion Matrix


A confusion matrix shows counts for cases that were correctly and incorrectly classified in the
validation data set. Here, it compares the actual class and predicted class values of four types of
treatments. It shows that out of total 600 values of validation set, all 215 values of predicted class
of “CAD” (Coronary Artery Disease) are exactly matching with actual class. But, for “RHD”
(Rheumatic Heart Disease) 25 values are misclassified as “CAD” (Coronary Artery Disease) and
6 values are misclassified as “CHD” (Congenital Heart Disease). Similarly in case of “CHD”, 3
cases are misclassified as “CAD” and other 19 are misclassified as “RHD”.
Table 5.2 Confusion Matrix
Classification Confusion Matrix

Predicted Class
Actual
1 2 3 5
Class
1 215 5 1 0

2 3 68 19 0

3 25 6 56 0

5 11 0 5 0

5.4.3 Error Report

Application of Data Mining in Healthcare Page 40


Data Analysis
It shows that out of 600 cases of validation set, there are 5 errors in 219 cases for CAD whereas
there are 22 errors for CHD. The overall % of correctly classified instances is 82%. This
percentage of correctly classified instance decides the efficiency of model. In our case model is
efficient and workable for the process of classification.

Table 5.3 Error Report

Error Report

Class # Cases # Errors % Error

1 219 5 2.28

2 90 22 25.55

3 76 30 39.57

4 15 15 100.00

Overall 500 72 18.00

5.4.4 Classification Tree


Classification tree is built through a process known as binary recursive partitioning. This is an
iterative process of splitting the data into partitions training and validation datasets, and then
splitting it up further on each of the branches. The actual tree formed through classification process
for training and validation datasets are as follows:-

5.5 Observations

5.5.1 Training Decision Tree


Out of 200 patients in the training set, 116 patients are classified as Treatment Type “1”. Among
them there are 15 female patients and 101 male patients. Out of these 15 female patients 8
belongs to states: Andhra Pradesh, Jammu & Kashmir, Madhya Pradesh, Tamilnadu and Uttar

Application of Data Mining in Healthcare Page 41


Data Analysis
Pradesh. The other 17 patients belong to the age group 5 i.e. 35-60 yrs. The remaining 101
patients are classified as 57 from the states< 10.5(state code) and 55 from states > 10.5. Out of
55 there are 22 patients belongs to states: Delhi, Gujarat, Haryana, Jharkhand, Karnataka and
Kerala. Of these 22 patients, 17 belong to the age group 5 and 5 belong to age group 5 i.e.
60+yrs. Similarly, when we trace the path for other nodes and get down to the terminal node
labelled “1”, “2”, “3”, or “5”. We found that totally out of 200 patients, there are 116 patients
with “Treatment Type” as “1” and 53 patients with “Treatment Type” as “2”, 29 patients with
“Treatment Type” as “3” and 2 patients with “Treatment Type” as “5”. This shows that treatment
type 1 &2 represents 80% of the population.
For understanding the structure of the tree better, XLMiner provides Full Tree Rules for the
training decision tree (For further details refer to the appendix).

5.5.2 User Specified Tree


The Training Decision tree created above is applied to the validation set of 500 patients. Here too
when we trace the different paths for terminal nodes labelled as “1”, ”2”,”3” or “5”, we found
that totally out of 500 patients, there are 215 patients are correctly classified as “Treatment Type
1”, 68 patients as “Treatment Type 2” and 56 patients as “Treatment Type 3”. This shows that
treatment type 1 &2 represents 71% of the population which is close to the output generated for
training dataset. Thus, we can say that our model is workable.

5.6 Result of Classification method


We found that the majority of patients come from states ANDHRA PRADESH, WEST
BENGAL, KARNATAKA, and KERALA. Out of these Kerala is more prone to Cardiac Artery
Disease (CAD). CAD was found to be the predominant treatment type among four type of
treatments. The population to be focused for providing healthcare is dominated by male patients
and belongs to age group (35-60).
The hospital needs to prepare itself for managing resources such as medicines, doctors, blood,
etc for these treatments to provide uninterrupted service. It also will allow hospital

Application of Data Mining in Healthcare Page 42


Data Analysis
administration to design preventive care measures for these patients waiting to get possible
admission.
Hence, based on the decision tree model developed we can conclude that the TREATMENT
TYPE would be one out of the four basic categories as shown below:-

Treatment Type 1 - CORONARY ARTERY DISEASE (CAD).


• If AGE <=1.5 and STATES [6.5, 10.5] Then, 5 patients were diagnosed with CAD.

This statement means, five young patients belonging to age group 0-25, coming from
states: Uttar Pradesh, Assam, Bihar, Chhattisgarh West Bengal were found to be suitable
for CORONARY ARTERY DISEASE (CAD).

• If AGE>5.5 and GENDERCODE = 1 and STATES <=6 Then, 2 patients were diagnosed
with CAD.

This statement means, two adult female patients belonging to age group 35-60 and coming
from states: Andhra Pradesh, Jammu & Kashmir, Madhya Pradesh, Tamilnadu, Uttar
Pradesh, were found to be suitable for CAD.

• If AGE [3.5, 5.5] and GENDERCODE = 1 STATES [3, 6] Then, 5 patients were diagnosed
with CAD.

• If AGE [3.5, 5.5] and GENDERCODE = 1 STATES <=3 Then, 2 patients were diagnosed
with CAD.

• If AGE >=3.5 and GENDERCODE = 1 STATES [6, 11] Then, 3 patients were diagnosed
with CAD.

• If AGE >=3.5 and GENDERCODE = 1 STATES [11, 15.5] Then, 10 patients were
diagnosed with CAD.

Application of Data Mining in Healthcare Page 43


Data Analysis
• If AGE >=3.5 and GENDERCODE = 1 STATES >= 15.5 Then, 5 patients were diagnosed
with CAD.

• If AGE >=3.5 and GENDERCODE = 2 STATES [6.5, 10.5] Then, 17 patients were
diagnosed with CAD.

• If AGE [3.5, 5.5] and GENDERCODE = 2 STATES <=2 Then, 10 patients were diagnosed
with CAD.

• If AGE >=3.5 and GENDERCODE = 2 STATES [6.5, 10.5] Then, 17 patients were
diagnosed with CAD.

• If AGE >=3.5 and GENDERCODE = 2 STATES >=15.5 Then, 32 patients were diagnosed
with CAD.

• If AGE >=3.5 and GENDERCODE = 2 STATES [2, 3.5] Then, 2 patients were diagnosed
with CAD.

• If AGE >=3.5 and GENDERCODE = 2 STATES [3.5, 6.5] Then, 18 patients were
diagnosed with CAD.

• If AGE >=3.5 and GENDERCODE = 2 STATES [3.5, 6.5] Then, 2 patients were diagnosed
with CAD.

• If AGE [3.5, 5.5] and GENDERCODE = 2 STATES [10.5, 15.5] Then, 5 patients were
diagnosed with CAD.

• If AGE >= 5.5 and GENDERCODE = 2 STATES [10.5, 15.5] Then, 5 patients were
diagnosed with CAD.

• If AGE [3.5, 5.5] and GENDERCODE = 2 STATES [15.5, 15.5] Then, 13 patients were
diagnosed with CAD.

Treatment Type 2 - CONGENITAL HEART DISEASE (CHD)


• If AGE<=1.5 and GENDERCODE=1 or 2 and STATES [6.5, 10.5] Then, 12 patients were
diagnosed with CONGENITAL HEART DISEASE (CHD).

Application of Data Mining in Healthcare Page 44


Data Analysis
• If AGE<=1.5 and STATES <=15.5 and GENDERCODE=1 or 2 Then, 15 patients were
diagnosed with CHD.

• If AGE<=1.5 and STATES [15.5, 16.5] and GENDERCODE=1 or 2 Then, 11 patients were
diagnosed with CHD.

• If AGE>=2.5 and STATES <=12 Then, 5 patients were diagnosed with CHD.

• If AGE>=2.5 and STATES >=12 Then, 2 patients were diagnosed with CHD.

• If AGE [1.5, 2.5] and GENDERCODE = 2 and STATES [12, 18.5] Then, 2 patients were
diagnosed with CHD.

Treatment Type 3(RHEUMATIC HEART DISEASE (RHD))


• If AGE [1.5, 2.5] and GENDERCODE = 1 Then, 11 patients were diagnosed with
RHEUMATIC HEART DISEASE (RHD).

• If AGE>= 1.5 and GENDERCODE = 2 and STATES [6, 8] Then, 6 patients were diagnosed
with RHD.

• If AGE>= 1.5 and GENDERCODE = 2 and STATES <=6 Then, 3 patients were diagnosed
with RHD.

• If AGE>= 1.5 and GENDERCODE = 2 and STATES >=8 Then, 2 patients were diagnosed
with RHD.

• If AGE>= 2.5 and GENDERCODE = 2 and STATES [12, 18.5] Then, 3 patients were
diagnosed with RHD.

• If AGE >2.5 and STATES>=18.5 and GENDERCODE = 2 Then, 2 patients were diagnosed
with RHD.

5.7 Clustering Method

5.7.1 Data Pre-processing

Application of Data Mining in Healthcare Page 45


Data Analysis
For the purpose of clustering a representative sample of 200 instances is considered and a
combination of 5 attributes is used. These 5 attributes are:-
• Age

• Gender code

• States

• Districts

• Final diagnosis

The no. of Clusters to be made is chosen as 6 after repetitive iterations. They were responsible to
find whether there is any value addition to the output generated in previous iterations. Thus,
having reviewed all iterations I found that the output generated by taking no of clusters as 6 is
useful to gain meaningful insights. The No. of Iterations was chosen to be 10. This determines
how many times the program will start with an initial partition and follow through with the
clustering algorithm. The configuration of clusters may differ from one starting partition to
another. The program will go through the specified number of iterations, and select the cluster
configuration that minimizes the distance measure.
5.7.2 Analysis of Output using K-means Clustering
K-means Clustering helps in predicting clusters to gain meaningful insights. XLMiner predicts
clusters and places each instance in the corresponding cluster. Table 5.4 shows the cluster to
which each instance belongs and its distance to each of the clusters.

Table 5.4 Predicted Clusters


Row Cluster Distance Distance Distance Distance Distance Distance
Id. id cluster-1 cluster-2 cluster-3 cluster-4 cluster-5 cluster-6
1 1 1.2485 2.5192 2.6693 2.4223 5.0888 3.4185

2 1 1.5629 3.1967 2.8841 3.1639 3.5151 2.3551

3 1 1.4981 2.3165 3.1079 3.2327 3.9328 3.2252

Application of Data Mining in Healthcare Page 46


Data Analysis
4 1 1.2388 2.5233 2.6869 2.4207 5.1026 3.4172

5 1 1.1533 2.5791 2.883 2.4156 5.258 3.4114

6 1 1.2343 3.3927 2.6445 2.3225 3.8791 2.604

7 1 1.1103 2.6664 3.112 2.4376 5.4428 3.4246

8 1 1.6869 3.0146 5.3044 3.4926 5.0651 3.4724

9 1 1.5329 3.8712 3.8521 2.6821 5.9003 2.9159

10 1 1.5693 3.1941 2.8707 3.1649 3.5019 2.3567

11 1 1.5171 3.8582 3.8277 2.6695 5.8789 2.9046

12 1 1.4518 3.3214 2.1821 2.3638 3.5166 2.6469

13 1 1.5061 2.312 3.0928 3.2339 3.9184 3.2266

14 1 1.5629 3.1967 2.8841 3.1639 3.5151 2.3551

15 1 1.4541 3.1692 5.0005 2.7528 5.189 3.6473

16 1 1.459 3.1739 5.0075 2.7564 5.195 3.65

17 1 1.1536 2.5787 2.8819 2.4156 5.2572 3.4114

18 1 1.4707 3.185 5.0238 2.765 5.2091 3.6564

19 1 1.2333 3.3935 2.6479 2.3226 3.8819 2.6041

20 1 1.5273 3.8666 3.8435 2.6776 5.8928 2.9119

21 1 1.3491 2.9446 2.6193 1.8944 3.485 1.9531

22 1 1.2763 1.951 2.8567 2.0074 3.8996 2.9447

23 1 1.3491 2.9446 2.6193 1.8944 3.485 1.9531

24 1 1.3491 2.9446 2.6193 1.8944 3.485 1.9531

25 1 1.8497 2.134 3.8814 2.1187 5.4974 2.8119

26 2 1.8031 1.3553 2.8543 1.8392 3.4973 2.6222

27 2 1.8045 1.3534 2.8503 1.8396 3.4934 2.6226

28 2 3.1009 0.81757 2.949 3.2464 2.2518 2.6618

29 2 3.5873 1.5469 2.8105 3.6374 1.9044 3.134

30 2 2.5606 0.49035 3.5507 2.9063 3.1771 2.217

31 2 2.56 0.49273 3.5523 2.9062 3.1792 2.2168

32 2 2.5606 0.49035 3.5507 2.9063 3.1771 2.217

Application of Data Mining in Healthcare Page 47


Data Analysis
33 2 3.5946 1.5569 2.8099 3.6435 1.9012 3.1412

34 2 2.6302 0.2598 3.3991 2.9346 2.9689 2.2574

35 2 2.5168 0.72901 3.7074 2.9006 3.3836 2.206

36 2 2.5606 0.49035 3.5507 2.9063 3.1771 2.217

37 2 3.2448 1.0475 2.8855 3.3577 2.1203 2.799

38 2 2.5606 0.49035 3.5507 2.9063 3.1771 2.217

39 2 3.4163 0.81947 3.6324 3.3002 2.6939 2.4988

40 2 3.1512 1.2453 5.289 3.1801 3.6832 2.3212

41 3 2.8572 3.6625 1.0422 3.2083 2.6717 3.4381

42 3 2.9644 3.516 1.8792 3.8485 2.4068 3.236

43 3 3.2821 3.2151 1.5817 3.6279 3.1443 5.3743

44 3 3.2608 3.2007 1.5772 3.611 3.1452 5.3602

45 3 2.8503 3.6593 1.0433 3.2031 2.6733 3.4332

46 3 1.7906 3.3279 1.7507 2.5097 3.1974 2.7836

47 3 3.0881 3.7767 1.0355 3.3873 2.6272 3.6079

48 3 2.856 3.662 1.0424 3.2075 2.6719 3.4373

49 3 2.8067 2.9159 1.5545 3.2645 3.2015 5.0741

50 3 2.8411 3.655 1.0449 3.1961 2.6756 3.4266

51 3 2.546 3.329 1.986 3.5892 2.5767 2.9177

52 3 3.2809 3.2143 1.5815 3.627 3.1444 5.3735

53 3 2.8055 2.9152 1.5546 3.2636 3.2017 5.0734

54 3 5.2021 5.033 2.116 3.4146 5.2091 5.9947

55 3 2.8503 3.6593 1.0433 3.2031 2.6733 3.4332

56 3 2.8032 2.9139 1.5549 3.2619 3.2022 5.072

57 3 2.8171 2.9219 1.5533 3.2721 3.1993 5.0803

58 3 2.8468 3.6577 1.0439 3.2005 2.6742 3.4307

59 3 3.3163 3.9005 1.0867 3.5704 2.6068 3.7823

60 3 2.4004 3.4753 1.2278 2.8781 2.8301 3.1272

61 3 2.3818 3.469 1.2397 2.8655 2.8386 3.1154

Application of Data Mining in Healthcare Page 48


Data Analysis
62 3 2.8445 3.6566 1.0443 3.1987 2.6747 3.429

63 3 3.2289 3.7784 1.1218 2.5657 2.8907 3.5796

64 3 5.5381 5.6141 1.7519 3.9487 2.972 5.6815

65 3 3.6068 3.3435 1.6318 3.0705 3.331 5.4837

66 3 2.8948 3.2451 0.9852 2.902 2.0659 2.9642

67 3 3.8938 5.5528 2.0684 2.7277 3.8965 5.42

68 3 5.3937 5.0018 2.5901 3.119 5.0202 5.0109

69 3 5.3715 3.9857 2.5861 3.0917 5.0211 5.9938

70 3 3.7896 3.0703 1.9155 2.9479 3.0571 5.2666

71 3 5.5248 5.0026 2.2509 3.4178 2.4012 3.9559

72 3 5.1347 2.9953 2.4157 3.0502 2.9531 5.2021

73 3 5.16 3.7201 2.1243 2.9792 2.3726 3.5797

74 4 2.8857 5.0664 3.2301 2.208 5.257 5.7354

75 4 1.8836 3.9025 2.6918 1.8794 5.3701 3.2841

76 4 1.939 3.8657 2.4779 1.887 5.2168 3.2907

77 4 2.0512 3.8532 3.609 1.691 5.7951 2.9815

78 4 2.0102 3.4814 2.4648 1.3236 3.8679 2.8031

79 4 2.3415 3.226 2.6531 1.0418 3.6259 2.4556

80 4 2.3389 3.2278 2.6626 1.04 3.6342 2.4547

81 4 2.2955 3.2709 2.8573 1.0283 3.806 2.4469

82 4 2.297 2.3583 2.9007 1.2342 5.0371 3.2989

83 4 2.2527 2.417 3.0804 1.2243 5.1924 3.293

84 4 2.2938 3.2737 2.8681 1.0292 3.8156 2.4471

85 4 2.2962 3.2698 2.853 1.028 3.8022 2.4468

86 4 2.8024 2.232 3.2414 1.4112 3.9513 3.1922

87 4 5.2103 3.5902 2.9853 2.5755 5.0334 5.5614

88 4 2.8048 2.2296 3.2338 1.4127 3.9437 3.1929

89 4 2.8022 2.2323 3.2423 1.411 3.9521 3.1921

90 4 2.8045 2.2299 3.2346 1.4125 3.9446 3.1929

Application of Data Mining in Healthcare Page 49


Data Analysis
91 4 2.84 3.1344 3.0193 1.2473 3.5274 2.3108

92 4 3.6556 5.4699 5.5848 2.0321 5.4298 3.6995

93 4 2.7974 3.1859 3.2209 1.2342 3.7336 2.3002

94 4 3.5489 3.3836 3.6822 1.6903 5.7409 5.1135

95 4 2.9038 2.9477 5.4015 1.9285 5.0795 3.4388

96 5 3.634 2.1904 2.4157 3.9086 1.9701 3.614

97 5 5.3941 3.3752 2.6926 5.2733 0.56784 3.2403

98 5 3.6287 2.7041 2.5628 3.5886 0.71551 2.2485

99 5 5.0024 3.0206 2.5818 3.9176 0.41248 2.7488

100 5 3.9858 3.006 2.579 3.9027 0.41766 2.7273

101 5 5.8139 3.77 2.8893 5.6629 0.98948 3.7433

102 5 3.6269 2.7026 2.563 3.5871 0.71754 2.2459

103 5 3.6206 2.6975 2.5635 3.5816 0.72477 2.2371

104 5 5.8279 3.7833 2.897 5.676 1.0046 3.7597

105 5 3.98 3.0009 2.5781 3.8975 0.42 2.7198

106 5 3.9516 2.0319 2.8228 3.9504 1.8079 3.4967

107 5 3.9761 2.9975 2.5775 3.8941 0.4217 2.7148

108 5 5.0024 3.0206 2.5818 3.9176 0.41248 2.7488

109 5 3.617 2.6946 2.5638 3.5785 0.72896 2.232

110 5 3.9741 2.0607 2.8261 3.9701 1.806 3.5192

111 5 3.6215 2.6983 2.5634 3.5824 0.72373 2.2384

112 5 3.9809 3.0018 2.5783 3.8984 0.41959 2.721

113 5 5.8055 3.0727 3.1161 5.721 2.0237 5.3558

114 5 5.8236 3.7792 2.8946 5.672 0.99993 3.7547

115 5 3.9975 3.0163 2.581 3.9132 0.41378 2.7424

116 5 3.9858 3.006 2.579 3.9027 0.41766 2.7273

117 5 5.8236 3.7792 2.8946 5.672 0.99993 3.7547

118 5 5.8225 3.7782 2.894 5.671 0.99876 3.7534

119 5 3.9984 3.0171 2.5811 3.9141 0.41351 2.7437

Application of Data Mining in Healthcare Page 50


Data Analysis
120 5 3.9613 2.0443 2.8242 3.9589 1.807 3.5064

121 5 3.9761 2.9975 2.5775 3.8941 0.4217 2.7148

122 5 3.6188 2.6961 2.5637 3.58 0.72686 2.2346

123 5 3.6134 2.6918 2.5642 3.5754 0.7332 2.227

124 5 3.6179 2.6954 2.5637 3.5793 0.72791 2.2333

125 5 3.6161 2.6939 2.5639 3.5777 0.73002 2.2308

126 5 3.6152 2.6932 2.564 3.5769 0.73108 2.2295

127 5 3.9965 3.0154 2.5808 3.9123 0.41406 2.7412

128 5 3.9761 2.9975 2.5775 3.8941 0.4217 2.7148

129 5 5.8289 3.7844 2.8976 5.677 1.0058 3.761

130 5 3.9613 2.0443 2.8242 3.9589 1.807 3.5064

131 5 3.9916 3.0111 2.58 3.9079 0.41559 2.7349

132 5 3.6269 2.7026 2.563 3.5871 0.71754 2.2459

133 5 5.7991 3.0651 3.1128 5.715 2.0202 5.3493

134 5 5.3848 3.3666 2.6891 5.2648 0.55999 3.229

135 5 3.9662 2.0506 2.8249 3.9632 1.8066 3.5113

136 5 3.6215 2.6983 2.5634 3.5824 0.72373 2.2384

137 5 5.1818 3.8425 3.4095 5.8294 1.2207 3.7988

138 5 5.1768 3.8375 3.407 5.8245 1.2161 3.7926

139 5 5.102 2.8028 3.139 3.8004 1.0181 2.3364

140 6 1.8613 2.5838 2.5969 1.7166 3.0077 1.4234

141 6 2.0142 3.2269 3.9581 2.1667 5.393 1.9186

142 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

143 6 2.5926 2.2634 3.3763 2.8277 2.6727 0.2183

144 6 2.5538 2.3243 3.5312 2.8236 2.901 0.1008

145 6 2.5926 2.2634 3.3763 2.8277 2.6727 0.2183

146 6 2.5947 2.2612 3.3697 2.8283 2.6626 0.22678

147 6 2.6641 2.2199 3.2036 2.857 2.4012 0.4802

148 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

Application of Data Mining in Healthcare Page 51


Data Analysis
149 6 2.5546 2.3224 3.5268 2.8235 2.8947 0.099138

150 6 2.5538 2.3243 3.5312 2.8236 2.901 0.1008

151 6 2.5553 2.3209 3.5234 2.8234 2.8897 0.098372

152 6 2.555 2.3216 3.5251 2.8235 2.8922 0.098691

153 6 2.554 2.324 3.5303 2.8236 2.8997 0.10041

154 6 2.7053 2.9698 5.4954 3.1196 5.1548 1.3007

155 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

156 6 2.5955 2.2603 3.3672 2.8285 2.6589 0.23011

157 6 2.5548 2.322 3.526 2.8235 2.8935 0.098899

158 6 2.554 2.324 3.5303 2.8236 2.8997 0.10041

159 6 2.5537 2.3247 3.5321 2.8237 2.9022 0.10122

160 6 2.5997 2.2564 3.355 2.8298 2.6402 0.24782

161 6 2.5988 2.2572 3.3574 2.8295 2.6439 0.24414

162 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

163 6 2.5349 2.4172 3.7141 2.8414 3.1572 0.30833

164 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

165 6 2.5525 2.3279 3.539 2.8239 2.9122 0.10564

166 6 2.5936 2.2623 3.373 2.828 2.6677 0.22247

167 6 2.5947 2.2612 3.3697 2.8283 2.6626 0.22678

168 6 2.5931 2.2628 3.3746 2.8279 2.6702 0.22037

169 6 2.5542 2.3236 3.5294 2.8236 2.8985 0.10004

170 6 2.5538 2.3243 3.5312 2.8236 2.901 0.1008

171 6 2.5538 2.3243 3.5312 2.8236 2.901 0.1008

172 6 2.5944 2.2614 3.3705 2.8282 2.6639 0.22569

173 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

174 6 2.5546 2.3224 3.5268 2.8235 2.8947 0.099138

175 6 2.5534 2.3255 3.5338 2.8237 2.9047 0.10216

176 6 2.5353 2.4111 3.7032 2.8397 3.1422 0.29413

177 6 2.5997 2.2564 3.355 2.8298 2.6402 0.24782

Application of Data Mining in Healthcare Page 52


Data Analysis
178 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

179 6 2.5931 2.2628 3.3746 2.8279 2.6702 0.22037

180 6 2.5546 2.3224 3.5268 2.8235 2.8947 0.099138

181 6 2.5546 2.3224 3.5268 2.8235 2.8947 0.099138

182 6 2.5538 2.3243 3.5312 2.8236 2.901 0.1008

183 6 2.5537 2.3247 3.5321 2.8237 2.9022 0.10122

184 6 2.5921 2.264 3.378 2.8276 2.6752 0.21628

185 6 2.5543 2.3232 3.5286 2.8235 2.8972 0.09971

186 6 2.5525 2.3279 3.539 2.8239 2.9122 0.10564

187 6 2.5982 2.2577 3.359 2.8294 2.6464 0.24173

188 6 2.5546 2.3224 3.5268 2.8235 2.8947 0.099138

189 6 2.5528 2.3271 3.5373 2.8238 2.9097 0.10437

190 6 2.5347 2.4208 3.7205 2.8425 3.166 0.31699

191 6 2.5347 2.4219 3.7224 2.8428 3.1685 0.31951

192 6 3.3609 2.3346 3.5501 3.1719 2.2602 0.99178

193 6 3.2303 2.3825 3.8238 3.0982 2.7542 0.69999

194 6 3.2286 2.3846 3.8296 3.0977 2.7638 0.69718

195 6 3.2286 2.3846 3.8296 3.0977 2.7638 0.69718

196 6 3.1969 2.4433 3.9689 3.0938 2.9885 0.66873

197 6 3.1958 2.4467 3.9759 3.094 2.9995 0.66948

198 6 3.1981 2.44 3.962 3.0936 2.9776 0.66837

199 6 3.1817 2.5343 5.1366 3.1107 3.244 0.73208

200 6 3.3953 3.2242 5.0319 3.4677 5.4517 1.6758

We can see that for any record, its distance from all the clusters is calculated using K-means
algorithm and it’s assigned to the cluster with minimum distance. Thus the values highlighted in
yellow gives the division of clusters. The records are assigned to cluster after comparing their
distances from other clusters. The cluster with least distance will be assigned to the respective
cluster.

Application of Data Mining in Healthcare Page 53


Data Analysis

Thus, row id 3 is assigned to cluster 1 because its distance to cluster1 is least as compared to its
distance from other clusters.

Now, of most significant part of this study is to understand the cluster division and obtain
meaningful information for decision making purpose. Table 5.5 below shows the attributes used
for clustering belongs which cluster.

Table 5.5 Value of attributes at each cluster

CLUSTER FINAL
AGE GENDERCODE STATES DISTRICT
ID DIAGNOSIS
1 1 1 150000 1506 2

1 1 2 150000 1520 1

1 1 1 150000 1523 1

1 1 1 150000 1523 2

1 1 1 160000 1610 2

1 1 2 160000 1611 2

1 1 1 170000 1722 2

1 1 1 210000 2115 1

1 1 2 210000 2123 2

1 1 2 150000 1506 1

1 1 2 210000 2103 2

1 1 2 140000 1405 2

1 1 1 150000 1506 1

1 1 2 150000 1520 1

1 1 1 210000 2106 2

1 1 1 210000 2112 2

1 1 1 160000 1609 2

1 1 1 210000 2126 2

Application of Data Mining in Healthcare Page 54


Data Analysis
1 1 2 160000 1614 2

1 1 2 210000 2116 2

1 2 2 160000 1609 2

1 2 1 160000 1605 2

1 2 2 160000 1609 2

1 2 2 160000 1609 2

1 3 1 200000 2002 2

2 3 1 150000 1529 2

2 3 1 150000 1525 2

2 4 1 100000 1001 1

2 4 1 70000 717 1

2 4 1 150000 1506 1

2 4 1 150000 1508 1

2 4 1 150000 1506 1

2 4 1 70000 709 1

2 4 1 140000 1402 1

2 4 1 160000 1602 1

2 4 1 150000 1506 1

2 4 1 90000 916 1

2 4 1 150000 1506 1

2 5 1 120000 1213 1

2 5 1 170000 1714 1

3 1 2 70000 702 2

3 1 2 70000 715 1

3 1 1 50000 511 2

3 1 1 50000 529 2

3 1 2 70000 708 2

3 1 2 120000 1202 2

3 1 2 60000 602 2

Application of Data Mining in Healthcare Page 55


Data Analysis
3 1 2 70000 703 2

3 1 1 70000 716 2

3 1 2 70000 716 2

3 1 2 90000 907 1

3 1 1 50000 512 2

3 1 1 70000 717 2

3 1 2 50000 521 4

3 1 2 70000 708 2

3 1 1 70000 719 2

3 1 1 70000 707 2

3 1 2 70000 711 2

3 1 2 50000 507 2

3 1 2 90000 909 2

3 1 2 90000 926 2

3 1 2 70000 713 2

3 2 2 70000 710 3

3 2 2 10000 105 3

3 2 1 50000 521 3

3 2 2 70000 714 2

3 2 2 70000 712 4

3 3 1 50000 506 4

3 3 1 50000 531 4

3 3 1 50000 531 3

3 4 2 30000 330 3

3 4 1 50000 509 3

3 4 2 50000 507 3

4 1 1 150000 1506 4

4 1 2 160000 1605 3

4 1 2 150000 1515 3

Application of Data Mining in Healthcare Page 56


Data Analysis
4 2 2 200000 2003 3

4 2 2 150000 1525 3

4 3 2 150000 1516 3

4 3 2 150000 1525 3

4 3 2 160000 1609 3

4 3 1 150000 1525 3

4 3 1 160000 1609 3

4 3 2 160000 1619 3

4 3 2 160000 1605 3

4 4 1 150000 1515 3

4 4 1 80000 811 4

4 4 1 150000 1506 3

4 4 1 150000 1516 3

4 4 1 150000 1507 3

4 4 2 150000 1503 3

4 4 2 210000 2105 4

4 4 2 160000 1619 3

4 4 1 150000 1521 4

4 4 1 210000 2110 3

5 3 1 50000 511 1

5 4 2 30000 314 1

5 4 2 70000 701 1

5 4 2 50000 504 1

5 4 2 50000 521 1

5 4 2 10000 115 1

5 4 2 70000 703 1

5 4 2 70000 710 1

5 4 2 10000 102 1

5 4 2 50000 527 1

Application of Data Mining in Healthcare Page 57


Data Analysis
5 4 1 50000 531 1

5 4 2 50000 531 1

5 4 2 50000 504 1

5 4 2 70000 714 1

5 4 1 50000 508 1

5 4 2 70000 709 1

5 4 2 50000 526 1

5 4 1 10000 104 1

5 4 2 10000 106 1

5 4 2 50000 509 1

5 4 2 50000 521 1

5 4 2 10000 106 1

5 4 2 10000 107 1

5 4 2 50000 508 1

5 4 1 50000 521 1

5 4 2 50000 531 1

5 4 2 70000 712 1

5 4 2 70000 718 1

5 4 2 70000 713 1

5 4 2 70000 715 1

5 4 2 70000 716 1

5 4 2 50000 510 1

5 4 2 50000 531 1

5 4 2 10000 101 1

5 4 1 50000 521 1

5 4 2 50000 515 1

5 4 2 70000 703 1

5 4 1 10000 110 1

5 4 2 30000 323 1

Application of Data Mining in Healthcare Page 58


Data Analysis
5 4 1 50000 516 1

5 4 2 70000 709 1

5 5 2 10000 117 1

5 5 2 10000 122 1

5 5 2 70000 707 1

6 3 2 150000 1515 2

6 3 2 210000 2116 2

6 4 2 150000 1506 1

6 4 2 150000 1527 1

6 4 2 160000 1610 1

6 4 2 150000 1527 1

6 4 2 150000 1519 1

6 4 2 140000 1409 1

6 4 2 150000 1506 1

6 4 2 160000 1605 1

6 4 2 160000 1610 1

6 4 2 160000 1601 1

6 4 2 160000 1603 1

6 4 2 160000 1609 1

6 4 2 210000 2111 1

6 4 2 150000 1506 1

6 4 2 150000 1516 1

6 4 2 160000 1604 1

6 4 2 160000 1609 1

6 4 2 160000 1611 1

6 4 2 150000 1501 1

6 4 2 150000 1504 1

6 4 2 150000 1506 1

6 4 2 170000 1715 1

Application of Data Mining in Healthcare Page 59


Data Analysis
6 4 2 150000 1506 1

6 4 2 160000 1619 1

6 4 2 150000 1523 1

6 4 2 150000 1519 1

6 4 2 150000 1525 1

6 4 2 160000 1608 1

6 4 2 160000 1610 1

6 4 2 160000 1610 1

6 4 2 150000 1520 1

6 4 2 150000 1506 1

6 4 2 160000 1605 1

6 4 2 160000 1613 1

6 4 2 170000 1703 1

6 4 2 150000 1501 1

6 4 2 150000 1506 1

6 4 2 150000 1525 1

6 4 2 160000 1605 1

6 4 2 160000 1605 1

6 4 2 160000 1610 1

6 4 2 160000 1611 1

6 4 2 150000 1529 1

6 4 2 160000 1607 1

6 4 2 160000 1619 1

6 4 2 150000 1506 1

6 4 2 160000 1605 1

6 4 2 160000 1617 1

6 4 2 170000 1722 1

6 4 2 170000 1724 1

6 5 2 130000 1301 1

Application of Data Mining in Healthcare Page 60


Data Analysis
6 5 2 150000 1516 1

6 5 2 150000 1524 1

6 5 2 150000 1524 1

6 5 2 160000 1610 1

6 5 2 160000 1619 1

6 5 2 160000 1601 1

6 5 2 170000 1720 1

6 5 2 220000 2201 1

Table 5.2 shows number of records there are in each cluster, percentage population and the
average distance of the cluster members from the centre of the clusters.

Table 5.2 Number of instances in each cluster and its distance from the cluster centre
Average
Cluster #Observations distance in
cluster

Cluster-1 25 1.414

Cluster-2 15 0.879

Cluster-3 33 1.577

Cluster-4 22 1.499

Cluster-5 44 0.903

Cluster-6 61 0.346

Overall 200 0.972

Application of Data Mining in Healthcare Page 61


Data Analysis

Fig 5.1: composition of clusters

As evident from the Fig 5.2 that cluster 3, 5 and 6 represents almost 70% of the population which
is a majority .So our further analysis focuses on these three clusters.

The cluster division containing age, gender, state and district is presented in graphical form to
enable easy comprehension.

Application of Data Mining in Healthcare Page 62


Data Analysis

Thus, From the Figure 5.2, it is found that cluster3, 5 and 6 is dominated by male patients. It
means that male patients are more prevalent for Coronary Artery Disease (CAD).

Fig 5.2: Gender wise composition of clusters

Fig 5.3: Age wise composition of clusters

Application of Data Mining in Healthcare Page 63


Data Analysis
Thus, From the Figure 5.3, it is observed that cluster 5 and cluster 6 are dominated by age
group 35-60 i.e. adults representing almost 70% of the population whereas cluster 3 has
patients from age group 0-17,representing 16 % the population. This means the adult
patients are prone to coronary artery disease.

Fig 5.4: State & District wise composition of clusters

Thus from fig 5.4 it is observed that in all three clusters predominant states are KERELA ,
KARNATAKA, TAMIL NADU and WEST BENGAL. Coronary heart disease was found to
be most predominant in all the clusters. Cluster 6 has the highest percentage of patients
coming from Kerala and majority falls into coronary heart disease category. In Kerala,
districts which have the majority of patients of coronary heart disease are MALLPURAM
and KANNUR whereas in Karnataka the majority of patients of coronary heart disease are
from BANGALORE.

Application of Data Mining in Healthcare Page 64


CHAPTER -6

SUMMARY &
CONCLUSIONS
CHAPTER 6
SUMMARY & CONCLUSION

SUMMARY
The complete analyses through clustering comes to a conclusion that majority of patients
are coming from States: KERALA, TAMIL NADU, and WEST BENGAL AND
ANDHRA PRADSESH. The districts to be focused for providing better healthcare are
MALLPURAM, KANNUR, BANGALORE, SALEM and TIRUVALLUR. The patients
from age group 35-60 are more prone to Coronary Heart Disease (CHD). The above
cluster demographics are true for both male and female patients .However, the overall
population is dominated by male patients.

CONCLUSION

• The majority of patients are coming from States: KERALA, TAMIL NADU,
and WEST BENGAL AND ANDHRA PRADSESH. So, the hospital
administration should focus on these states to find out the reasons behind
occurrences of coronary heart disease and take preventive actions. For
example they can collaborate with Sri Sathya Sai mobile hospital which has a
team of specialist’s doctors and provides healthcare to the patients at their
doorstep. So they can better understand the reasons for occurrences of
coronary artery Disease. The hospital authorities can also take benefit of the
tele medicine facility for the same purpose.

• The districts to be focused for providing better healthcare are


MALLPURAM, KANNUR, BANGALORE, SALEM and TIRUVALLUR.
The hospital authorities can discuss this with the medical practitioners of
these places

• The patients from age group 35-60 are more prone to Coronary Artery
Disease (CAD). So hospital authorities should find the risk factors more
prevalent for CAD in adults and take correcting actions to reduce the risk and
thus the occurrence of CAD patients in future.

• The overall population is dominated by male patients.

• This gives us an opportunity to explore collaboration with Sri Sathya sai


health camps and other organizations to provide healthcare services to those
need it more and cannot afford it due to low income.

• The hospital administration can extend their collaboration with Sri Sathya Sai
Village Integrated Program (SSSVIP) to provide heath education through
various health awareness programmes conducted by SSSVIP.

• Health education,,,cleanliness,lifestyle.
REFERENCES
References

REFERENCES

1. Andrássoyá, E., Paralič, J., “Knowledge discovery in databases - a comparison of


different views”, Presented at the 10th International Conference on Information and
Intelligent Systems, Sept. 1999, Varazdin, Croatia.
2. Berson, A., Smith, S., and Thearling, K., (2004) “Building Data Mining Applications
for CRM”, Tata McGraw-Hill Publishing Company Limited.
3. Cintron, R., “Shared Goals Demand- Shared Data”, Health Management Technology,
December 2004.

4. Edelstein, H., “Data Mining: Are We There Yet?” Computing Science and Statistics,
34, I2002, Proceedings/ Edelstein Herb

5. Hand, D. J., Bolton, R. J., “Pattern Discovery and Detection: A Unified Statistical
Methodology”, Journal of Applied Statistics, Oct2004, Vol. 31 Issue 8, p885-924.
6. IBM white paper, “Data management solutions”, April 1996.
7. Jonathan C. P. et al., “Medical Data Mining: Knowledge Discovery in a Clinical Data
Warehouse” AMIA 1997.
8. Keating, B., “Data Mining: What Is It and How Is It Used?” Journal of Business
Forecasting, Fall2008, Vol. 27 Issue 3, p33-35, 3p.
9. Kumar, V., et al., “Outlier Mining in Medical Databases: An Application of Data
Mining in Health Care Management to Detect Abnormal Values Presented In Medical
databases” International Journal of Computer Science and Network Security, VOL.8
No.8, August 2008.

10. Kuonen, D., “A Statistical Perspective of Data Mining”, Dec 2, 2004, CRM Zine,
Vol. 48.
11. Lamont, J., “Data drives decision-making in healthcare” KM World, Mar2010, Vol.
19 Issue 3, p12-14, 2p.
12. Milley, A., “Healthcare and Data Mining” Health Management Technology,
Aug2000, Vol. 21 Issue 8, p44, 2p.
References
13. Obenshain K. M., “Application of Data Mining Techniques to Healthcare Data”,

Statistics for Hospital Epidemiology, Aug 2004, Vol. 25 No. 8.


14. Ranjan, J., “Applications of data mining techniques in pharmaceutical industry”
Journal of Theoretical and Applied Information Technology 2005 – 2007.
15. Robert D. S. et al., “Scalable data mining” Building, using, and managing the data
warehouse, 1997.
16. Sibbritt D., “The Effective Use of a Summary Table and Decision Tree
Methodology to Analyze Very Large Healthcare Datasets”, Health
Care Management Science 7, 163–171, 2004.
17. Silver, M., “How to Apply Data Mining Techniques in a Healthcare Data Warehouse”
Journal of healthcare information management, vol. 15, no. 2, summer 2001.
18. Srinivas, K. et al., “Applications of Data Mining Techniques in Healthcare and
Prediction of Heart Attacks”, International Journal on Computer Science and
Engineering Vol. 02, No. 02, 2010, 250-255.
19. Warner H., Jr., “Mining the gems”, Health Management Technology, Oct2001, Vol.
22 Issue 10, p30, 3p.

WEBSITES:

• www.thearling.com
• www.wikipedia.com
AGE
GENDERCODE STATES DISTRICTS Final diagnosis
CODE
1 1 16 223 2

Appendix
1 1 5 296 2
1 2 12 355 2
1 2 16 328 2
1 2 15 66 2
1 1 16 185 2

Table 5.2 1 1 15 55 2
Training data 1 2 5 296 2
set 1 1 16 261 2
1 2 15 150 2
1 1 5 218 2
1 2 3 291 2
1 2 7 15 1
1 2 9 81 1
1 2 21 113 2
1 1 16 223 2
1 1 15 65 2
1 2 16 206 2
1 2 21 51 2
1 1 5 356 2
1 2 15 150 2
1 2 6 12 2
1 2 5 258 2
1 2 15 337 2
1 2 15 55 2
1 1 15 93 2
1 1 7 205 2
1 2 15 25 2
1 2 5 356 2
1 1 15 105 2
1 2 15 112 2
1 2 16 223 2
1 1 16 355 2
1 1 16 328 2
1 2 5 117 2
1 2 7 10 5
Table 5.3 1 2 17 301 2
Validation data 1 2 7 15 2
1 1 16 335 2
set
1 1 15 55 2
AGE CODE GENDERCODE STATES DISTRICTS Final
1 2 5 332 diagnosis2
1 1 1 2 5 15 97 55 13

1 1 1 2 16 5 210 111 13

1 1 2 1 15 15 55 203 13

1 1 2 2 17 16 205 210 22

1 1 2 2 5 16 285 210 22

1 1 2 1 21 5 195 218 22

2 1 2 1 16 16 328 185 22

2 1 1 1 3 5 305 218 13

2 1 1 2 15 8 75 318 22

2 1 2 1 20 16 257 185 23

2 1 2 1 5 7 356 251 23

2 2 15 25 2
2 1 21 68 3
2 2 7 57 3
3 2 1 100 3
3 2 21 68 3
Appendix
1 2 21 161 2
1 2 21 52 2
1 2 15 112 2
1 2 1 198 2
1 2 23 131 2
1 1 1 211 2
1 2 5 122 2
1 1 5 258 2
1 2 3 193 2
1 2 16 206 2
1 1 15 55 2
1 2 6 33 2
1 2 1 95 2
1 2 21 165 2
1 1 7 76 2
1 2 16 210 2
1 2 7 15 2
1 2 7 15 2
1 1 21 255 2
1 2 6 26 2
1 2 12 283 2
1 2 15 308 2
1 2 7 155 2
1 2 25 11 2
1 2 15 203 2
1 2 5 331 2
1 1 15 150 2
1 2 15 55 2
1 2 7 21 2
1 2 16 261 2
1 1 5 325 2
1 2 7 239 2
1 1 12 71 2
1 2 5 122 2
1 2 12 130 2
1 2 15 203 2
1 1 15 93 2
1 2 7 15 2
1 2 1 272 2
1 1 7 76 3
1 2 21 51 2
1 1 7 56 2
1 1 16 206 2
1 2 5 296 2
1 2 15 65 2
1 1 3 193 2
1 1 9 65 2
Appendix
1 1 7 239 2
1 2 7 21 2
1 1 7 57 2
1 2 5 356 2
1 2 15 337 2
1 2 16 210 2
1 2 9 103 2
1 1 16 39 2
1 1 5 330 2
1 1 1 152 2
1 1 5 356 2
1 1 15 55 2
1 1 15 150 2
1 2 15 112 3
1 1 15 75 3
1 1 6 38 3
1 2 5 122 2
1 2 15 65 2
1 2 7 231 3
1 2 5 332 3
2 1 17 166 2
2 1 15 250 3
2 1 7 156 2
2 1 15 150 3
2 1 5 325 2
2 2 7 231 2
2 2 15 65 3
2 2 15 250 2
2 2 5 122 3
2 2 9 267 3
2 2 5 356 2
2 2 1 157 2
2 2 15 65 2
2 2 21 59 2
2 1 5 331 3
2 2 5 326 3
2 2 5 189 3
2 2 15 55 5
2 2 5 356 3
2 1 7 231 3
2 2 16 355 3
2 1 1 353 3
2 2 15 57 2
2 2 16 261 2
2 2 6 232 2
2 1 5 218 3
2 1 15 66 3
Appendix
2 2 7 76 3
2 1 21 255 3
2 2 16 328 5
3 1 15 55 2
3 2 15 128 2
3 2 6 53 3
3 1 1 30 3
3 2 15 112 3
3 1 16 185 3
3 2 15 55 2
3 1 15 150 3
3 1 15 190 3
3 2 15 203 3
3 2 15 150 1
3 2 7 277 2
3 2 9 276 3
3 2 15 55 5
3 2 15 337 2
3 1 7 17 3
3 1 15 55 3
3 1 16 185 3
3 1 5 326 3
3 2 15 55 5
3 1 2 169 3
3 1 7 168 3
3 1 1 30 3
3 1 7 57 3
3 1 15 55 2
3 2 7 168 3
3 1 7 76 3
3 2 16 185 3
3 1 1 351 3
3 2 16 278 3
3 1 1 358 2
3 1 15 150 3
3 2 7 105 3
3 2 16 355 3
3 2 16 210 3
3 1 15 95 3
3 1 5 350 3
3 1 15 337 3
3 1 17 325 3
3 1 3 310 3
5 2 1 256 1
5 1 15 1 2
5 1 7 57 3
5 1 15 105 3
Appendix
5 2 7 155 3
5 1 21 201 5
5 2 20 257 5
5 2 15 65 1
5 1 5 356 5
5 2 17 55 1
5 2 3 351 1
5 1 1 9 1
5 2 17 79 3
5 2 5 296 3
5 2 15 112 3
5 2 16 83 3
5 1 15 337 5
5 1 5 325 5
5 2 5 97 1
5 2 15 308 1
5 2 15 190 1
5 1 15 55 3
5 2 1 9 1
5 2 7 56 1
5 1 17 259 2
5 2 5 111 3
5 2 7 231 3
5 2 16 261 1
5 2 15 226 1
5 1 15 112 1
5 2 7 56 3
5 1 15 339 3
5 2 15 77 3
5 2 1 211 1
5 2 5 19 1
5 1 7 205 1
5 2 21 165 1
5 2 7 15 1
5 2 1 188 1
5 2 15 308 1
5 2 7 10 3
5 2 7 239 3
5 2 15 279 3
5 2 23 36 3
5 2 1 259 1
5 2 15 207 1
5 2 5 122 1
5 2 16 335 1
5 2 16 223 5
5 2 16 261 1
5 2 5 185 1
Appendix
5 2 15 137 1
5 2 17 275 1
5 1 21 50 1
5 2 15 55 1
5 2 7 76 1
5 2 16 355 1
5 1 15 55 2
5 1 16 185 3
5 2 1 198 3
5 2 17 299 1
5 2 15 105 1
5 1 15 55 1
5 2 16 191 3
5 1 15 55 5
5 2 7 277 1
5 2 5 296 1
5 2 15 250 1
5 2 1 100 1
5 1 7 57 1
5 2 7 156 1
5 2 16 39 1
5 2 15 250 1
5 2 1 9 1
5 2 5 258 1
5 2 5 97 1
5 2 5 296 1
5 2 7 239 1
5 2 15 55 1
5 2 21 101 1
5 1 17 325 1
5 1 1 157 1
5 2 16 223 1
5 2 5 97 1
5 1 15 66 1
5 2 5 313 1
5 2 16 209 1
5 2 16 39 1
5 2 15 18 3
5 2 1 213 1
5 1 15 55 1
5 2 16 328 1
5 2 16 336 1
5 2 16 185 1
5 2 9 307 1
5 2 5 122 1
5 2 16 223 1
5 2 16 261 1
Appendix
5 2 1 272 3
5 2 5 89 1
5 2 17 205 1
5 2 16 355 1
5 1 15 18 1
5 2 5 296 1
5 1 15 55 1
5 2 5 326 1
5 2 15 55 1
5 2 15 55 1
5 2 5 97 1
5 1 16 223 1
5 1 15 55 1
5 2 16 210 1
5 2 16 185 3
5 1 15 55 1
5 2 7 155 1
5 2 21 50 1
5 2 5 19 1
5 2 16 210 1
5 1 5 122 1
5 2 15 190 1
5 2 7 15 1
5 2 21 165 1
5 2 7 15 1
5 2 16 223 1
5 2 16 210 1
5 1 15 55 1
5 2 7 156 1
5 1 5 89 1
5 1 5 97 1
5 2 15 55 1
5 2 7 168 1
5 2 5 185 1
5 1 15 308 1
5 2 1 303 1
5 1 10 323 1
5 2 15 55 1
5 2 5 332 1
5 2 16 210 1
5 2 5 218 1
5 2 15 226 1
5 2 16 83 1
5 2 16 83 1
5 1 15 65 3
5 2 7 15 5
5 2 1 272 1
Appendix
5 1 7 155 1
5 2 16 355 1
5 2 7 155 1
5 2 16 185 1
5 2 7 15 1
5 2 15 55 1
5 2 17 237 1
5 2 1 319 1
5 1 5 335 5
5 1 15 250 1
5 2 15 308 1
5 2 16 223 1
5 2 16 355 1
5 2 1 152 1
5 2 1 9 1
5 2 17 316 1
5 2 3 91 1
5 1 7 76 1
5 2 16 83 1
5 2 15 250 1
5 1 1 52 1
5 2 15 105 1
5 1 15 203 1
5 2 16 223 1
5 2 15 159 1
5 2 16 223 1
5 2 7 15 1
5 2 5 359 1
5 2 16 223 1
5 1 5 296 1
5 2 16 185 1
5 2 7 231 1
5 2 7 57 1
5 2 15 339 1
5 2 16 261 1
5 2 15 203 1
5 1 15 55 1
5 2 16 206 1
5 2 5 89 1
5 1 16 336 1
5 2 5 212 1
5 2 5 189 1
5 2 15 337 1
5 1 5 97 1
5 2 1 100 1
5 2 16 191 1
5 1 12 253 1
Appendix
5 2 16 83 1
5 2 3 200 1
5 2 16 206 1
5 2 16 335 1
5 2 16 185 1
5 2 7 155 1
5 2 17 166 1
5 1 15 112 1
5 2 16 83 1
5 2 5 350 1
5 2 15 55 1
5 2 16 83 1
5 2 7 225 1
5 1 16 185 1
5 2 15 55 1
5 1 15 128 1
5 2 15 65 1
5 2 1 351 1
5 2 15 25 1
5 2 16 121 1
5 2 5 218 1
5 2 21 101 1
5 1 16 335 1
5 2 5 97 1
5 2 3 107 1
5 2 16 261 1
5 2 5 111 1
5 2 7 155 1
5 2 16 336 1
5 2 1 272 1
5 2 15 55 1
5 2 17 263 1
5 2 15 55 1
5 2 16 185 1
5 2 1 351 1
5 2 16 210 1
5 2 16 185 1
5 2 16 261 1
5 2 16 185 1
5 2 15 279 1
5 2 7 225 1
5 2 15 308 1
5 2 5 122 1
5 2 1 95 1
5 2 15 202 1
5 2 1 52 1
5 2 5 296 1
Appendix
5 2 3 67 1
5 2 7 168 1
5 2 16 185 1
5 2 5 122 1
5 2 15 65 1
5 2 9 238 1
5 2 9 300 1
5 2 7 239 5
5 2 16 355 1
5 2 16 210 1
5 2 15 66 1
5 2 7 205 5
5 2 7 57 1

Table 5.4 Training log

#
%
Decision
Error
Nodes
0 52
1 17.5
2 15
3 15
5 15
5 12.5
6 12.5
7 12.5
8 12.5
9 12.5
10 12
11 12
12 12
13 12
15 12
15 12
16 12
17 12
18 12
19 12
20 11.5
21 11.5
22 11.5
23 11.5
25 11.5
25 11.5
26 11.5
27 11.5
28 11
29 11
30 11
Appendix

Fig 5.1 Classification Tree - Using


Training Dataset
Appendix

Fig 5.1 Classification Tree -


Using Validation Dataset
Appendix
Table 5.7 Full Tree Rules (Using Training Data)
#Decision
30 #Terminal Nodes 31
Nodes

Node Parent Split Left Right Clas Node


Level Split Variable Cases
ID ID Value Child Child s Type
0 0 N/A AGE CODE 3.5 200 1 2 1 Decision
1 1 0 AGE CODE 1.5 75 3 5 2 Decision
1 2 0 GENDERCODE 1.5 126 5 6 1 Decision
2 3 1 STATES 10.5 57 7 8 2 Decision
2 5 1 GENDERCODE 1.5 27 9 10 3 Decision
2 5 2 STATES 6 25 11 12 1 Decision
2 6 2 STATES 10.5 101 13 15 1 Decision
3 7 3 STATES 6.5 17 15 16 2 Decision
3 8 3 STATES 16.5 30 17 18 2 Decision
3 9 5 AGE CODE 2.5 9 19 20 2 Decision
3 10 5 STATES 12 18 21 22 3 Decision
3 11 5 AGE CODE 5.5 8 23 25 1 Decision
3 12 5 STATES 11 17 25 26 1 Decision
3 13 6 STATES 6.5 57 27 28 1 Decision
3 15 6 STATES 15.5 55 29 30 1 Decision
5 15 7 GENDERCODE 1.5 12 31 32 2 Decision
5 16 7 N/A N/A 5 N/A N/A 1 Terminal
5 17 8 STATES 15.5 25 33 35 2 Decision
5 18 8 N/A N/A 5 N/A N/A 2 Terminal
5 19 9 N/A N/A 3 N/A N/A 3 Terminal
5 20 9 STATES 12 6 35 36 2 Decision
5 21 10 STATES 6 11 37 38 3 Decision
5 22 10 STATES 18.5 7 39 50 3 Decision
5 23 11 STATES 3 6 51 52 1 Decision
5 25 11 N/A N/A 2 N/A N/A 1 Terminal
5 25 12 N/A N/A 3 N/A N/A 1 Terminal
5 26 12 STATES 15.5 15 53 55 1 Decision
5 27 13 STATES 2 30 55 56 1 Decision
5 28 13 N/A N/A 17 N/A N/A 1 Terminal
5 29 15 AGE CODE 5.5 22 57 58 1 Decision
5 30 15 N/A N/A 32 N/A N/A 1 Terminal
5 31 15 N/A N/A 5 N/A N/A 2 Terminal
5 32 15 N/A N/A 8 N/A N/A 2 Terminal
5 33 17 GENDERCODE 1.5 15 59 50 2 Decision
5 35 17 GENDERCODE 1.5 11 51 52 2 Decision
5 35 20 N/A N/A 5 N/A N/A 2 Terminal
5 36 20 N/A N/A 2 N/A N/A 2 Terminal
5 37 21 N/A N/A 3 N/A N/A 3 Terminal
5 38 21 STATES 8 8 53 55 3 Decision
5 39 22 AGE CODE 2.5 5 55 56 2 Decision
5 50 22 N/A N/A 2 N/A N/A 3 Terminal
5 51 23 N/A N/A 2 N/A N/A 1 Terminal
5 52 23 N/A N/A 5 N/A N/A 1 Terminal
5 53 26 N/A N/A 10 N/A N/A 1 Terminal
5 55 26 N/A N/A 5 N/A N/A 1 Terminal
5 55 27 N/A N/A 10 N/A N/A 1 Terminal
5 56 27 STATES 3.5 20 57 58 1 Decision
5 57 29 STATES 15.5 17 59 60 1 Decision
5 58 29 N/A N/A 5 N/A N/A 1 Terminal
6 59 33 N/A N/A 5 N/A N/A 2 Terminal
6 50 33 N/A N/A 9 N/A N/A 2 Terminal
Appendix
6 51 35 N/A N/A 8 N/A N/A 2 Terminal
6 52 35 N/A N/A 3 N/A N/A 2 Terminal
6 53 38 N/A N/A 6 N/A N/A 3 Terminal
6 55 38 N/A N/A 2 N/A N/A 3 Terminal
6 55 39 N/A N/A 2 N/A N/A 2 Terminal
6 56 39 N/A N/A 3 N/A N/A 3 Terminal
6 57 56 N/A N/A 2 N/A N/A 1 Terminal
6 58 56 N/A N/A 18 N/A N/A 1 Terminal
6 59 57 N/A N/A 5 N/A N/A 1 Terminal
6 60 57 N/A N/A 13 N/A N/A 1 Terminal

Table 5.8 User Specified Tree rules

#Decision
28 #Terminal Nodes 29
Nodes

ParentI Clas Node


Level NodeID SplitVar SplitValue Cases LeftChild RightChild
D s Type
0 0 N/A AGE 52.5 500 1 2 1 Decision
1 1 0 AGE 15.5001 165 3 5 2 Decision
1 2 0 DISTRICTS 185.5 236 5 6 1 Decision
2 3 1 DISTRICTS 115 70 7 8 2 Decision
2 5 1 AGE 32.5 95 9 10 3 Decision
2 5 2 N/A N/A 119 N/A N/A 1 Terminal
2 6 2 STATES 8 117 11 12 1 Decision
3 7 3 DISTRICTS 78.5996 28 13 15 2 Decision
3 8 3 N/A N/A 52 N/A N/A 2 Terminal
3 9 5 DISTRICTS 61.0002 62 15 16 2 Decision
3 10 5 AGE 38 32 17 18 3 Decision
3 11 6 AGE 55.5 55 19 20 1 Decision
3 12 6 DISTRICTS 193.5 73 21 22 1 Decision
5 13 7 DISTRICTS 19.9999 21 23 25 2 Decision
5 15 7 AGE 8.5 7 25 26 1 Decision
5 15 9 GENDERCODE 1.5 15 27 28 2 Decision
5 16 9 DISTRICTS 191.5 57 29 30 3 Decision
5 17 10 AGE 35 18 31 32 3 Decision
5 18 10 DISTRICTS 236.5 15 33 35 3 Decision
5 19 11 AGE 56 25 35 36 1 Decision
5 20 11 DISTRICTS 350.5002 19 37 38 1 Decision
5 21 12 AGE 57 10 39 50 1 Decision
5 22 12 N/A N/A 63 N/A N/A 1 Terminal
5 23 13 N/A N/A 5 N/A N/A 2 Terminal
5 25 13 N/A N/A 17 N/A N/A 2 Terminal
5 25 15 N/A N/A 5 N/A N/A 1 Terminal
5 26 15 N/A N/A 3 N/A N/A 2 Terminal
5 27 15 N/A N/A 7 N/A N/A 2 Terminal
5 28 15 N/A N/A 8 N/A N/A 2 Terminal
5 29 16 DISTRICTS 131 23 51 52 3 Decision
5 30 16 DISTRICTS 263.9999 25 53 55 2 Decision
5 31 17 STATES 11.5 2 55 56 3 Decision
5 32 17 N/A N/A 16 N/A N/A 3 Terminal
5 33 18 AGE 50.5 12 57 58 3 Decision
5 35 18 AGE 50.5 2 59 50 5 Decision
5 35 19 N/A N/A 5 N/A N/A 1 Terminal
5 36 19 N/A N/A 20 N/A N/A 1 Terminal
Appendix
5 37 20 N/A N/A 15 N/A N/A 5 Terminal
5 38 20 N/A N/A 5 N/A N/A 1 Terminal
5 39 21 N/A N/A 5 N/A N/A 1 Terminal
5 50 21 N/A N/A 5 N/A N/A 1 Terminal
6 51 29 N/A N/A 10 N/A N/A 3 Terminal
6 52 29 DISTRICTS 162.5 13 51 52 3 Decision
6 53 30 STATES 15.5 9 53 55 2 Decision
6 55 30 AGE 21.5 15 55 56 3 Decision
6 55 31 N/A N/A 1 N/A N/A 3 Terminal
6 56 31 N/A N/A 1 N/A N/A 2 Terminal
6 57 33 N/A N/A 5 N/A N/A 1 Terminal
6 58 33 N/A N/A 7 N/A N/A 3 Terminal
6 59 35 N/A N/A 2 N/A N/A 5 Terminal
6 50 35 N/A N/A 0 N/A N/A 1 Terminal
7 51 52 N/A N/A 7 N/A N/A 1 Terminal
7 52 52 N/A N/A 6 N/A N/A 3 Terminal
7 53 53 N/A N/A 7 N/A N/A 2 Terminal
7 55 53 N/A N/A 2 N/A N/A 2 Terminal
7 55 55 N/A N/A 5 N/A N/A 3 Terminal
7 56 55 N/A N/A 11 N/A N/A 2 Terminal
Appendix
Appendix

Вам также может понравиться