Академический Документы
Профессиональный Документы
Культура Документы
Healthcare
MASTER OF BUSINESS
ADMINISTRATION
By
ABHINAV BHORKAR
(Regd.No.09451)
……………………………
……………………………
Dr. Ramaier Sriram Abhinav
Bhorkar
Project Guide Regd. No.
09451
Prashanthi Nilayam
……………………………
20th December 2010 A Sudhir
Bhaskar
Professor and Dean
School
ofBusiness management
ACKNOWLEDGEMENTS
I express my deep and sincere gratitude to my Lord and Master, Bhagawan Sri Sathya Sai
Baba whose Divine grace made this work possible.
I thank my guide Prof. Ramaier Sriram who gave me all the support, confidence and guidance
at all stages of the study.
I thank Dr. Subramanian S. who helped me during the analysis of the study.
I thank Shri Renju Raghuveeran, Shri Piyush Shrivastava and shri prakash chittranjan whose
tireless efforts have enabled us to use one of the best computer centers.
I am also thankful to the Ganesh xerox centre for their services rendered to me at all times.
I thank all my classmates for their support. I also thank Sri. GVR Subbarao from, SSSHIMS
Whitefield,Bangalore for providing me with all the necessary data.
Last, but not the least, I am immensely grateful to my family for their guidance, immense love
and blessings.
ABHINAV BHORKAR
Healthcare organizations are rich in clinical data. Access to need This limits them to provide a
quality healthcare service and satisfy their patients. So they need to analyze their data to find out
some meaningful information which will help them to provide better healthcare services.
An enormous amount of data is recorded every second, which can be mined for extracting
interesting relationships that are very important for the organisations. This study showcases the
use of data mining tools to discover the hidden relationships.
The software that has been used for this purpose is open source software called as XLMINER.
The techniques of classification and clustering are illustrated by applying them on the healthcare
dataset.
INTRODUCTION
Introduction
CHAPTER – 1
INTRODUCTION
There is a wealth of data available within the organizations. But the data's hidden value, the
relationships between various attributes of the data is not determined. To collect and store large
amounts of data is a waste of resources. Unless the data is used productively, the organization is
increasing overheads. Valuable knowledge can be discovered from application of data mining
techniques in healthcare system. The business analysts can use the discovered knowledge to
improve the quality of service. The challenge is to extract relevant knowledge from this data and
act upon it in a timely manner.
The typical data mining process involves transferring data originally collected in production
systems into a data warehouse, cleaning or scrubbing the data to remove errors and check for
consistency of formats, and then searching the data using statistical queries, neural networks,etc
(Jonathan C. Prather et al., 1997).With the increasing availability of health-care data there is an
increasing need to use methods which are capable of dealing with missing data, allow integrating
data from various sources and discover relationships between the various attributes of the
dataset.
To analyze a tertiary healthcare dataset for finding relevant knowledge and for enhancing the
understanding of disease progression and management. There are many people who do not have
access to the healthcare services. This study discusses the feasibility of extending healthcare
services those who need it more.
This study focuses on clinical and demographic dataset obtained from Sri Sathya Sai Institute of
Higher Medical Sciences (SSSIHMS), Whitefield, Bangalore.
Chapter 1: Introduction: gives an overview of data mining and its support in healthcare. It also
includes the statement of the problem, objective & scope of the study.
Chapter 2: Literature Review: sheds light on the literature in the area of data mining, its
application in various fields including healthcare, CRM, data mining softwares and its
challenges.
Chapter 3: Theoretical Background: gives an understanding of basics of data mining, its
developmental history, technological elements, different methods and their application in
healthcare.
Chapter 4: Methodology And Design: of the study explains the methodology adopted for the
study. It includes the nature and objectives of the study, as well as collection and treatment of
data. It also includes the limitations of the study.
Chapter 5: Data Analysis: The data mining techniques of classification and clustering is
applied to the healthcare dataset and model has been built which will be useful for healthcare
administrators for efficient decision making.
Chapter 6: Summary and Conclusions: gives a brief sketch of the whole study and presents
conclusions of the findings.
LITERATURE
REVIEW
Literature Review
CHAPTER-2
LITERATURE REVIEW
Clinical trials, electronic patient records and computer supported disease management will
produce huge amount of clinical data. This data is a strategic resource for health care providers.
The Sri Sathya Sai Institute of Higher Medical Sciences is not merely a modern multi-super-
specialty hospital with the best medical and surgical care. The total cost of the treatment is borne
by the hospital, and not a penny is charged to the patient. To achieve this noble objective they
need to take several decisions to cure patient not merely in body but in mind and spirit as well. Is
there is really a need of intelligence techniques for such a non profitable organization? Yes, data
mining not only helps to increase profits; but also helps in identifying different patterns in taking
various decisions. This chapter deals with reviewing the applications of data mining in healthcare
for both inpatients and outpatients.
Data mining and forecasting models are quite different. There are 4 major tools of data mining,
namely prediction, classification, clustering analysis and association rules discovery (Keating,
Barry 2008). Data Mining uses well-established statistical and machine learning techniques to
build models that predict customer behavior for value creation. Value creation can be done by
using campaign management technologies. In this first, identify market segments containing
customers or prospects with high profit potential and, second, build and execute campaigns that
favorably impact the behavior of these individuals. Records can be scored when campaigns are
ready to run, allowing the use of the most recent data. "Fresh" data and the selection of "high"
scores within defined market segments improve direct marketing results. Also it leads to
increased accuracy through the elimination of manually induced errors. The Campaign
Management software determines which records to score and when. It also helps statisticians
because less time spent on mundane tasks of extracting and importing files, leaving more time
for creative – building and interpreting models (Kurt Thearling, February 1999). To build more
creative models scalability helps a lot. It leads to build a good data mining model as quickly as
possible. Scalability i.e., by taking advantage of parallel database management systems and
additional CPUs, you can solve a wide range of problems without needing to change your
underlying data mining environment. You can work with more data, build more models, and
improve their accuracy by simply adding additional CPUs. For example, if in a few minutes
analysts can create a group of models and graph their forecasts, they can pick from a large
number of model possibilities. If they have to wait overnight for each model, they will take steps
to reduce the number of models investigated, such as studying smaller sampled databases and
including fewer columns. Not surprisingly, their results will be different and quite likely not as
useful (Robert D. Small; Herbert A. Edelstein, 1997).
The main task of data mining models is to discover some knowledge from the patterns by
continuously monitoring it with the help of algorithms. According to (Hand and Bolton 2004),
the data mining models have been made to focus on the algorithmic aspects of pattern discovery;
however, another fundamental part of data mining is concerned with detecting anomalies
amongst the vast mass of the data like the small deviations, unusual observations, unexpected
clusters of observations, or surprising blips in the data, which the model does not explain. Both
aspects can be covered by providing a theoretical base, linking the ideas to statistical work in,
scan statistics, outlier detection, and other areas. One of the striking characteristics of work on
pattern discovery is that the ideas have been developed in several theoretical arenas, and also in
several application domains, with little apparent awareness of the fundamentally common nature
of the problem.
The data mining models includes various data mining techniques for knowledge discovery from
huge databases. IBM, 1996 in the white paper presented a variety of data mining techniques, and
explained how they can be used independently and cooperatively to extract high quality
information from databases. It explained how decision-makers can take advantage of high-return
opportunities in a timely fashion. They must be able to identify and utilize information hidden in
the collected data. Data mining models function at their best when the analyst has a complete
view of the targeted patient population and the providers with whom they work. This view
demands that all the data relating to the patient be gathered in one place. Once the healthcare
provider stores data at one location, it can make better use of predictive modeling and data
mining to more accurately identify patients, understand their needs, understand likely patient
cooperation levels and predict likely results. Then it can make better decisions about which
programs to pursue (diabetes, asthma or congestive heart failure) and how to work with
individual providers and patients in the targeted population (Rose Cintron 2004).
Data mining aids in the analysis of large databases and discovery of trends, patterns and
knowledge. Many applications in engineering design, manufacturing and logistics engineering
now incorporate data mining functions. Valuable knowledge can be discovered from application
of data mining techniques in healthcare arena viz. Hospital Infection Control, Ranking Hospitals
Identifying High-Risk Patients, heart attack prediction, etc. Hospital infection control involves
computer-assisted surveillance research which focuses on identifying high-risk patients, expert
systems, and possible cases and detecting deviations in the occurrence of predefined events.
Hospitals and their healthcare plans can be ranked based on information reported by them. Data
mining techniques have been implemented to examine reporting practices used by hospitals.
Data mining techniques helps in identifying patients who are tending toward a high-risk
condition. This information gives nurse care coordinators a head start in identifying high-risk
patients so that steps can be taken to improve the patients’ quality of healthcare and to prevent
health problems in the future (Mary K. Obenshain 2004). According to (Michael Silver et al.
2001), Data mining is used to make business decisions that can influence cost, revenue, and
operational efficiency while maintaining a high level of care. Typical problems that data mining
addresses are how to classify data, cluster data, find associations between data items, and
perform time series analysis. The analysis deals with various attributes such as age, financial
status, etc. The financial status shows clear picture of losses. The main challenge in the
healthcare industry is to make quick decision during the initial diagnosis. According to (Varun
Kumar et al. 2008), outlier detection & analysis have a great potential to find useful information
from health care databases which consequently helps decision makers to automate & quicken the
process of decision making in clinical diagnosis as well as other domains of health care
management. They applied data mining techniques to breast cancer health care databases so that
useful information in the form of novel & hidden knowledge patterns can be generated to
improve public health. They found out that data mining techniques helps uncovering the valuable
knowledge hidden behind them and aiding the decision makers to improve the health care
services. Using medical profiles such as age, sex, blood pressure and blood sugar it can predict
the likelihood of patients getting a heart disease. Data mining can be used for effective heart
attack prediction. Firstly, they have provided an efficient approach for the extraction of
significant patterns from the heart disease data warehouses. Then, for the efficient prediction of
heart attack based on the calculated significant weightage, the frequent patterns having value
greater than a predefined threshold were chosen for the valuable prediction of heart attack. They
defined five mining goals based on business intelligence and data exploration. The goals were
evaluated against the trained models. All these models could answer complex queries in
predicting heart attack (K.Srinivas et al. 2010).
The information flow in the pharmaceutical industry is relatively simple and the application of
technology was limited. Today increasingly technology is being used to help the pharmaceutical
firms manage their inventories and to develop new product and services. It helps pharmacy
firms to compete on lower costs while improving the quality of drug discovery and delivery
methods. A deep understanding of the knowledge hidden in the pharmacy data is vital to a firm’s
competitive position and organizational decision-making (Jayanthi Ranjan 2007). According to
(Anny Miley 2000), Data mining techniques can improve treatment procedure, increase patient
satisfaction and wipe out over prescription, bad practice and fraud. Data mining methods can
save time and money by making possible efficient deployment of medication, effective
treatment, speedy diagnosis and preventive measures. The data collected from patients will make
them happy by answering questions which they feel is for their own benefit. The questions such
as how long they had to wait to see a doctor, how effective their treatment was, how happy they
were with the service provided, etc. will generate good, reliable data. Analyzing this data will
help patients to set reasonable expectations about waiting times and get benefitted from the
quality service.
Now, the world is changing and all organizations are keeping pace with the new technologies.
Healthcare providers increasingly are able to mine patient data to help choose the best course of
treatment, and various software solutions are allowing them to fine-tune their claims process.
According to (Judith Lamont 2010), the growing availability of electronic medical records will
lead to increased evidence-based medicine and smarter healthcare. To achieve the highest rank
for patient satisfaction, organizations go for data-driven decision-making combined with its
physicians’ skills which will increase organizations image and develop trust among the people.
Due to lack of trust denials for the services increases which is a concern for healthcare providers.
They can use centralized mapping method in which all denial codes returned from payers are
mapped to a proprietary, centrally administered code set. When that code comes back in the
future, the mapping documentation contains a description of the issue and the steps a user can
take to fix the problem. This centralized mapping allows the organization to see a normalized
view of denials.
Healthcare provider organizations are faced with a rising number of financial pressures while
analyzing large numbers of clinical and financial data. A large clinical database could be
successfully warehoused and mined to identify clinical factors associated with preterm birth,
even on a preliminary dataset. Newly discovered relationships found in clinical databases such as
these will potentially lead to better understanding between observations and outcomes in
providing prenatal care and other fields of medicine (Jonathan C. Prather et al.. 1997). In this
article they created a dataset for analysis by extracting and cleaning selected variables and mined
data using exploratory factor analysis. The dream of data mining from an integrated clinically
oriented warehouse has yet to become a reality. (Homer Warner 2001) argues that one of the
main reasons of failure in using data mining in healthcare is that the data is available is too
complex for ad hoc queries from outside the IT department. The key data is missing from the
reporting database in an unnormalizd form. The most significant warehouse costs will be in
collecting, connecting and normalizing data and funneling that data to the warehouse.
Warehousing will need both commitment and direction from business and physician leaders to
succeed.
Very large datasets typically consists of millions of records, with many variables. Such datasets
are collected from both inpatients/outpatients and maintained by organizations because of the
perceived potential information they contain. However, the problem with very large datasets is
that traditional methods of data mining are not capable of retrieving this information because the
software may be overwhelmed by the memory or computing requirements. According to (David
Sibbritt et al. 2004), there is a method that can analyze very large datasets. The method initially
performs a data reduction step through the use of a summary table, which is then used as a
reference dataset that is recursively partitioned to grow a decision tree. When looking at a
decision tree, it may be misleading to assume that only those variables used to create the tree
(split variables) are the only important predictors of the outcome variable. At a given node, there
may have been one or more variables that created subsets that were almost as heterogeneous as
the chosen variable. The method proposed in their article can not only be used for model
building purposes, but also for the additional purpose of reducing the dimension of the original
dataset, by indicating which variables can be ignored, so that traditional analysis methods can
then be used.
THEORETICAL
BACKGROUND
Theoretical Background
CHAPTER - 3
THEORETICAL BACKGROUND
3.1 INTRODUCTION
Data mining, the extraction of hidden predictive information from large databases, is a powerful
new technology with great potential to help companies focus on the most important information
in their data warehouses. Data mining tools predict future trends and behaviours, allowing
businesses to make proactive, knowledge-driven decisions. Data mining tools can answer
business questions that traditionally were very time consuming to resolve. The data mining
techniques and tools are equally applicable in fields ranging from law enforcement to radio
astronomy, medicine, and industrial process control. In fact, hardly any of the data mining
algorithms were first invented with commercial applications in mind. The commercial data miner
employs a wide range of techniques borrowed from statistics, computer science, and machine
learning research. The choice of a particular combination of techniques to apply in a particular
situation depends on the nature of the data mining task, the nature of the available data, and the
skills and preferences of the data miner. Data mining comes in two flavors—directed and
undirected. Directed data mining attempts to explain or categorize some particular target field
such as income or response. Undirected data mining attempts to find patterns or similarities
among groups of records without the use of a particular target field or collection of predefined
classes.
3.3.1 Classification
In order to understand and communicate about the world, we are constantly classifying,
categorizing, and grading. Classification consists of examining the features of a newly presented
object and assigning it to one of a predefined set of classes. The objects to be classified are
generally represented by records in a database table or a file, and the act of classification consists
of adding a new column with a class code of some kind. The classification task is characterized
by a well-defined definition of the classes, and a training set consisting of preclassified
examples. The task is to build a model of some kind that can be applied to unclassified data in
order to classify it.
Examples of classification tasks include:
• Classifying credit applicants as low, medium, or high risk
• Determining which phone numbers correspond to fax machines
Decision and nearest neighbour techniques are techniques well suited to classification. Neural
networks and link analysis are also useful for in certain circumstances.
3.3.2 Estimation
Classification deals with discrete outcomes: yes or no; measles, rubella, or chicken pox.
Estimation deals with continuously valued outcomes. Given some input data, estimation comes
up with a value for some unknown continuous variable such as income, height, or credit card
balance.
Examples of estimation tasks include:
• Estimating a family's total household income
• Estimating the lifetime value of a customer
Regression models and neural networks are well suited to estimation tasks. Survival analysis is
well suited to estimation tasks where the goal is to estimate the time to an event, such as a
customer stopping.
3.3.3 Prediction
Prediction is same as classification or estimation, except that the records are classified according
to some predicted future behaviour or estimated future value. In a prediction task, the only way
to check the accuracy of the classification is to wait and see. The primary reason for treating
prediction as a separate task from classification and estimati24on is that in predictive modeling
there are additional issues regarding the temporal relationship of the input variables or predictors
to the target variable.
Any of the techniques used for classification and estimation can be adapted for use in prediction
by using training examples where the value of the variable to be predicted is already known,
along with historical data for those examples. The historical data is used to build a model that
explains the current observed behaviour. When this model is applied to current inputs, the result
is a prediction of future behaviour.
Examples of prediction tasks include:
• Predicting the size of the balance that will be transferred if a credit card prospect
Association Rule Mining involves searching for interesting relationships between items in a data
set. Market basket analysis is a good example of this model. An example of an association rule is
“Customers who buy computers tend to also buy financial software”. Since association rules are
not always interesting or useful, constraints are applied which specify the type of knowledge to
be mined such as specific dates of interest, thresholds on statistical measures (rule
interestingness, support, confidence), or other rules applied by end users (Han & Kamber, 2001).
Examples of association rules include:
• Retail chains plan the arrangement of items on store shelves or in a catalog so that items
often purchased together will be seen together.
• To identify cross selling opportunities and for groupings of product and services.
3.3.5 Clustering
Clustering divides a database into different groups. The goal of clustering is to find groups that
are very different from each other, and whose members are very similar to each other. Unlike
classification (see Predictive Data Mining, below), you don’t know what the clusters will be
when you start, or by which attributes the data will be clustered. Consequently, someone who is
knowledgeable in the business must interpret the clusters.The clusters can be mutually exclusive,
hierarchical or overlapping. Each member of a cluster should be very similar to other members in
its cluster and dissimilar to other clusters. Clustering is the task of segmenting a heterogeneous
population into a number of more homogeneous subgroups of clusters. What distinguishes
clustering from classification is that clustering does not rely on predefined classes.
3.3.6 Profiling
Sometimes the purpose of data mining is simply to describe what is going on in a complicated
database in a way that increases our understanding of the people, products, or processes that
produced the data in the first place. A good enough description of behaviour will often suggest
an explanation for it as well. Decision trees are a powerful tool for profiling customers (or
anything else) with respect to a particular target or outcome. Association rules and clustering can
also be used to build profiles.
Decision trees are data mining technology that has been around in a form very similar to the
technology of today for almost twenty years now and early versions of the algorithms date back
in the 1960s. Often times these techniques were originally developed for statisticians to
automate the process of determining which fields in their database were actually useful or
correlated with the particular problem that they were trying to understand. Because decision trees
score so highly on so many of the critical features of data mining they can be used in a wide
variety of business problems for both exploration and for prediction. They have been used for
problems ranging from credit card attrition prediction to time series prediction of the exchange
rate of different international currencies. There are also some problems where decision trees will
not do as well.
Another way that the decision tree technology has been used is for preprocessing data for other
prediction algorithms. Because the algorithm is fairly robust with respect to a variety of
predictor types (e.g. number, categorical etc.) and because it can be run relatively quickly
decision trees can be used on the first pass of a data mining run to create a subset of possibly
useful predictors that can then be fed into neural networks, nearest neighbor and normal
statistical routines - which can take a considerable amount of time to run if there are large
numbers of possible predictors to be used in the model. Although some forms of decision trees
were initially developed as exploratory tools to refine and preprocess data for more standard
statistical techniques like logistic regression. They have also been used and more increasingly
often being used for prediction. This is interesting because many statisticians will still use
A decision tree represents a series of questions. The answer to the first question determines the
follow-up question. The initial questions create broad categories with many members; follow-on
questions divide the broad categories into smaller and smaller sets. If the questions are well
chosen, a surprisingly short series is enough to accurately classify an incoming record. A record
enters the tree at the root node. The root node applies a test to determine which child node the
record will encounter next. There are different algorithms for choosing the initial test, but the
goal is always the same: To choose the test that best discriminates among the target classes. This
process is repeated until the record arrives at a leaf node. All the records that end up at a given
leaf of the tree are c1assified in the same way. There is a unique path from the root to each leaf.
That path is an expression of the rule used to classify the records.
In the process the proportion of the records in the desired class can be used as a score, which is
more useful than just the classification. For a binary outcome, a classification merely splits
records into two groups. A score allows the records to be sorted from most likely to least likely
to be members of the desired class. For many applications, a score capable of rank-ordering a
list is all that is required. This is sufficient to choose the top N percent for a mailing and to cal-
culate lift at various depths in the list.
Suppose the important business question is not who will respond but what will be the size of the
customer's next order? The decision tree can be used to answer that question too. Assuming that
order amount is one of the variables available in the pre-classified model set, the average order
size in each leaf can be used as the estimated order size for any unclassified record that meets the
criteria for that leaf. It is even possible to use a numeric target variable to build the tree; such a
tree is called a regression tree. Instead of increasing the purity of a categorical variable, each split
in the tree is chosen to decrease the variance in the values of the target variable within each child
node.
Since any multi-way split can be expressed as a series of binary splits, there is no real need for
In Figure 3.1, the first split is a poor one because there is no increase in purity. The initial
population contains equal numbers of the two sorts of dot; after the split, so does each child. The
second split is also poor, because all though purity is increased slightly, the pure node has few
members and the purity of the larger child is only marginally better than that of the parent. The
final split is a good one because it leads to children of roughly same size and with much higher
purity than the parent.
Tree-building algorithms are exhaustive. They proceed by taking each input variable in turn and
measuring the increase in purity that result from every split suggested by that variable. After
trying all the input variables, the one that yields the best split is used for the initial split, creating
two or more children. If no split is possible (because there are too few records) or if no split
makes an improvement, then the algorithm is finished with that node and the node become a leaf
node. Otherwise, the algorithm performs the split and repeats itself on each of the children. An
algorithm that repeats itself in this way is called a recursive algorithm.
The decision tree keeps growing as long as new splits can be found that improve the ability of the
tree to separate the records of the training set into increasingly pure subsets. Such a tree has been
optimized for the training set, so eliminating any leaves would only increase the error rate of the
tree on the training set. A decision tree algorithm makes its best split first, at the root node where
there is a large population of records. As the nodes get smaller, the tree finds general patterns at
the big nodes and patterns specific to the training set in the smaller nodes; that is, the tree over
fits the training set. The result is an unstable tree that will not make good predictions. The cure is
to eliminate the unstable splits by merging smaller leaves through a process called pruning. The
most widely used pruning algorithm is CART pruning algorithm.
CART is a popular decision tree algorithm first published by Leo Breiman, Jerome Friedman,
Richard Olshen, and Charles Stone in 1984. The acronym stands for Classification and
Regression Trees. The CART algorithm grows binary trees and continues splitting as long as
new splits can be found that increase purity. As illustrated in Figure 3.2, inside a complex tree,
there are many simpler sub trees, each of which represents a different trade-off between model
complexity and training set misclassification rate. The CART algorithm identifies a set of such
sub trees as candidate models. These candidate sub trees are applied to the validation set and the
tree with the lowest validation set misclassification rate is selected as the final model.
True neural networks are biological systems that detect patterns, make predictions and learn.
The artificial ones are computer programs implementing sophisticated pattern detection and
machine learning algorithms on a computer to build predictive models from large historical
databases. For example, a credit card company has 3,000 records, 100 of which are known fraud
records. The data set updates the neural network to make sure it knows the difference between
the fraud records and the legitimate ones. The network learns the patterns of the fraud records.
Then the network is run against company’s million record data set and the network spits out the
records with patterns the same or similar to the fraud records. Neural networks are known for not
A neural network is loosely based on how some people believe that the human brain is organized
and how it learns. There are two main structures of consequence in the neural network:
The node – It loosely corresponds to the neuron in the human brain.
The link – It loosely corresponds to the connections between neurons (axons, dendrites and
synapses) in the human brain.
In Figure 3.4, there is a drawing of a simple neural network. The round circles represent the
nodes and the connecting lines represent the links. The neural network functions by accepting
predictor values at the left and performing calculations on those values to produce new values in
the node at the far right. The value at this node represents the prediction from the neural network
model. In this case the network takes in values for predictors for age and income and predicts
whether the person will default on a bank loan.
The calculations for Figure 3.4 are done like what is shown in Figure 3.5. Here the value age of
47 is normalized to fall between 0.0 and 1.0 and has the value 0.47 and the income is normalized
to the value 0.65. This simplified neural network makes the prediction of no default for a 47 year
old making $65,000. The links are weighted at 0.7 and 0.1 and the resulting value after
multiplying the node values by the link weights is 0.39. The network has been trained to learn
that an output value of 1.0 indicates default and that 0.0 indicate non-default. The output value
calculated here (0.39) is closer to 0.0 than to 1.0 so the record is assigned a non-default
prediction.
Applications: Neural networks are used in a wide variety of applications. They have been used
in all facets of business from detecting the fraudulent use of credit cards and credit risk
prediction to increasing the hit rate of targeted mailings. They also have a long history of
application in other areas such as the military for the automated driving of an unmanned vehicle
• If biscuits are purchased then tea/coffee is purchased 90% of the time and this pattern
occurs in 3% of all shopping baskets.
• If Gillette shaving razor is purchased from a store then Gillette shaving cream is
purchased 60% of the time and these two items are bought together in 6% of the shopping
baskets.
The bane of rule induction systems is also its strength - that it retrieves all possible interesting
patterns in the database. This is a strength in the sense that it leaves no stone unturned but it can
also be viewed as a weakness because the user can easily become overwhelmed with such a large
number of rules that it is difficult to look through all of them. A second pass of data mining is
needed to go through the list of interesting rules that have been generated by the rule induction
system in the first place in order to find the most valuable gold nugget amongst them all. This
overabundance of patterns can also be problematic for the simple task of prediction because all
possible patterns are culled from the database there may be conflicting predictions made by
equally interesting rules. In rule induction systems, the rule itself is of a simple form of “if this
and this and this then this”. For example a rule that a supermarket might find in their data
collected from scanners would be: “if pickles are purchased then ketchup is purchased’.
Just because the pattern in the data base is expressed as rule does not mean that it is true all the
time. It is important to recognize and make explicit the uncertainty in the rule. This is what the
accuracy of the rule means. The coverage of the rule has to do with how much of the database
the rule “covers” or applies to. In some cases, accuracy is called the confidence of the rule and
coverage is called the support. Accuracy and coverage appear to be the preferred ways of
naming these two measurements.
Association rules provide information of this type in the form of "if-then" statements. These
rules are computed from the data and, unlike the if-then rules of logic, association rules are
probabilistic in nature. In addition to the antecedent (the "if" part) and the consequent (the "then"
part), an association rule has two numbers that express the degree of uncertainty about the rule.
In association analysis the antecedent and consequent are sets of items (called item sets) that are
disjoint (do not have any items in common).
The first number is called the support for the rule. The support is simply the number of
transactions that include all items in the antecedent and consequent parts of the rule. For example
considering the data as given in Table 3.1, the item set {milk, bread} has a support of 2 / 4 = 0.5
since it occurs in 50% of all transactions (2 out of 4 transactions).
The other number is known as the confidence of the rule. The confidence of a rule is defined the
ratio of the number of the transactions supporting the rule to the number of transaction where the
conditional part of the rule holds. Another way of saying this is that confidence is the ratio of the
number of transactions with all the items to the number of transactions with just the "if" items.
For example, the rule {milk, bread} => butter has a confidence of 0.2 / 0.4 = 0.5 in the database,
which means that for 50% of the transactions containing milk and bread the rule is correct.
Lift: Lift answers to the question how much better than chance the rule is. Lift (also called as
improvement) tells us how much better a rule is at predicting the result than just assuming the
result in the first place. Lift is the ratio of the density of the target after application of the “left-
hand side to the density of the target in the population. Another way of saying this is that lift is
the ratio of the records that support the entire rule to the number that would be expected,
assuming that there is no relationship between the products. It is defined as
For example the rule {milk, bread} => {butter} has a lift of (0.2/0.4*0.4)= 1.25.
Conviction: It is the ratio of the expected frequency that X occurs without Y if they were
independent to the observed frequency. It is defined as
For example the rule {milk, bread} => {butter} has a conviction of (1-0.4/1-0.5)= 1.2.
Association rules are required to satisfy a user-specified minimum support and a user-specified
minimum confidence at the same time. To achieve this, association rule generation is a two-step
process. First, minimum support is applied to find all frequent item sets in a database. In a second
step, these frequent item sets and the minimum confidence constraint are used to form rules.
While the second step is straight forward, the first step needs more attention.
CHAPTER-4
METHODOLOGY AND DESIGN
.
Source: HMIS Dept., Sri Sathya Sai Institute of Higher Medical Sciences (SSSHIMS).
Number of Instances: There are 16383 instances i.e. records of 16383 patients. But, to
satisfy limitations of data mining tool, a representative sample of 600 instances was
considered. Simple random sampling method has been used which randomly chooses records
that have an equal probability of being chosen. Sampling was done by choosing instances in
the same proportion as they were present in the original dataset.
Number of Variables: There are 5 variables.
Variable Information: The 5 different variables are mentioned below along with a brief
description:
Age code: It is the age of the patients which is converted into discreet values
(Age code) as 1 for 0-17, 2 for 18-25, 3 for 25-35, 5 for 35-60, and 5 for 61-
69.
Gender code: - The values “Male” or “Female” are converted into numeric
values as 0 for Male and 1 for Female.
Final Diagnosis: This variable is the target for prediction. It has four options:
The codes for these treatment types are specified earlier in this section.
XLMiner™ is a comprehensive data mining add-in for Excel. Data mining is a discovery-
driven data analysis technology used for identifying patterns and relationships in data sets.
With overwhelming amounts of data now available from transaction systems and external
data sources, organizations are presented with increasing opportunities to understand their
data and gain insights into it. Often, there may be more than one approach to a problem.
XLMiner™ is a tool belt to help you get quickly started on data mining, offering a variety of
2. The software used for analysis is open source software and hence does not support
additional features and functions.
DATA ANALYSIS
Data Analysis
CHAPTER - 5
DATA ANALYSIS
Table 5.1 shows the percentage population of all attributes age code, gender code, and final
diagnosis. It is observed that the population is dominated by male patients. Among all age codes
age code 4 has the highest and age code 2 has the least composition in the original dataset. Out of
four types of treatments CAD (coronary Artery Disease) is predominant in the original dataset
with almost 57% composition in the overall population. The next highest category is of CHD
(Congenital Heart Disease) with 24 % composition in the overall population.
For the purpose of classification, 4 independent demographic variables and one independent
variable are used:-
• Gender code
• States
• Districts
These 5 variables are the most important in terms of building a model, training the model on the
validating the functionality of model and finally using it for classification or prediction purpose.
The variable Final Diagnosis was selected as output variable in the process of classification.
Data Partitioning
Before building a model, typically we partition the data using a partition utility. Partitioning
yields mutually exclusive datasets: a training dataset and a validation dataset set in the ratio
defined by the user. The training dataset is used to train the model for the purpose of
classification. This ratio for partitioning was kept as 33.33% for training dataset and 66.67%
for validation dataset.
Training Set
The partitioning results into division of data in two parts: The training set and validation
dataset. The training dataset is used to train or build a model. (For further details refer to the
appendix).
Validation Set
Once a model is built on training data, you need to find out the accuracy of the model on unseen
data. For this, the model should be used on a dataset that was not used in the training process i.e.
a dataset where you know the actual value of the target variable. (For further details refer to the
appendix).
Classification is to predict the value of class. This predicted class is then compared with the
actual values and a measure of the effectiveness of the model, called as misclassification rate is
calculated. It is defined as the ratio of the instances correctly predicted to the total number of the
instances provided. The training log shows the misclassification (error) rate as each additional
node is added to the tree. (For further details refer to the appendix).
Predicted Class
Actual
1 2 3 5
Class
1 215 5 1 0
2 3 68 19 0
3 25 6 56 0
5 11 0 5 0
Error Report
1 219 5 2.28
2 90 22 25.55
3 76 30 39.57
4 15 15 100.00
5.5 Observations
This statement means, five young patients belonging to age group 0-25, coming from
states: Uttar Pradesh, Assam, Bihar, Chhattisgarh West Bengal were found to be suitable
for CORONARY ARTERY DISEASE (CAD).
• If AGE>5.5 and GENDERCODE = 1 and STATES <=6 Then, 2 patients were diagnosed
with CAD.
This statement means, two adult female patients belonging to age group 35-60 and coming
from states: Andhra Pradesh, Jammu & Kashmir, Madhya Pradesh, Tamilnadu, Uttar
Pradesh, were found to be suitable for CAD.
• If AGE [3.5, 5.5] and GENDERCODE = 1 STATES [3, 6] Then, 5 patients were diagnosed
with CAD.
• If AGE [3.5, 5.5] and GENDERCODE = 1 STATES <=3 Then, 2 patients were diagnosed
with CAD.
• If AGE >=3.5 and GENDERCODE = 1 STATES [6, 11] Then, 3 patients were diagnosed
with CAD.
• If AGE >=3.5 and GENDERCODE = 1 STATES [11, 15.5] Then, 10 patients were
diagnosed with CAD.
• If AGE >=3.5 and GENDERCODE = 2 STATES [6.5, 10.5] Then, 17 patients were
diagnosed with CAD.
• If AGE [3.5, 5.5] and GENDERCODE = 2 STATES <=2 Then, 10 patients were diagnosed
with CAD.
• If AGE >=3.5 and GENDERCODE = 2 STATES [6.5, 10.5] Then, 17 patients were
diagnosed with CAD.
• If AGE >=3.5 and GENDERCODE = 2 STATES >=15.5 Then, 32 patients were diagnosed
with CAD.
• If AGE >=3.5 and GENDERCODE = 2 STATES [2, 3.5] Then, 2 patients were diagnosed
with CAD.
• If AGE >=3.5 and GENDERCODE = 2 STATES [3.5, 6.5] Then, 18 patients were
diagnosed with CAD.
• If AGE >=3.5 and GENDERCODE = 2 STATES [3.5, 6.5] Then, 2 patients were diagnosed
with CAD.
• If AGE [3.5, 5.5] and GENDERCODE = 2 STATES [10.5, 15.5] Then, 5 patients were
diagnosed with CAD.
• If AGE >= 5.5 and GENDERCODE = 2 STATES [10.5, 15.5] Then, 5 patients were
diagnosed with CAD.
• If AGE [3.5, 5.5] and GENDERCODE = 2 STATES [15.5, 15.5] Then, 13 patients were
diagnosed with CAD.
• If AGE<=1.5 and STATES [15.5, 16.5] and GENDERCODE=1 or 2 Then, 11 patients were
diagnosed with CHD.
• If AGE>=2.5 and STATES <=12 Then, 5 patients were diagnosed with CHD.
• If AGE>=2.5 and STATES >=12 Then, 2 patients were diagnosed with CHD.
• If AGE [1.5, 2.5] and GENDERCODE = 2 and STATES [12, 18.5] Then, 2 patients were
diagnosed with CHD.
• If AGE>= 1.5 and GENDERCODE = 2 and STATES [6, 8] Then, 6 patients were diagnosed
with RHD.
• If AGE>= 1.5 and GENDERCODE = 2 and STATES <=6 Then, 3 patients were diagnosed
with RHD.
• If AGE>= 1.5 and GENDERCODE = 2 and STATES >=8 Then, 2 patients were diagnosed
with RHD.
• If AGE>= 2.5 and GENDERCODE = 2 and STATES [12, 18.5] Then, 3 patients were
diagnosed with RHD.
• If AGE >2.5 and STATES>=18.5 and GENDERCODE = 2 Then, 2 patients were diagnosed
with RHD.
• Gender code
• States
• Districts
• Final diagnosis
The no. of Clusters to be made is chosen as 6 after repetitive iterations. They were responsible to
find whether there is any value addition to the output generated in previous iterations. Thus,
having reviewed all iterations I found that the output generated by taking no of clusters as 6 is
useful to gain meaningful insights. The No. of Iterations was chosen to be 10. This determines
how many times the program will start with an initial partition and follow through with the
clustering algorithm. The configuration of clusters may differ from one starting partition to
another. The program will go through the specified number of iterations, and select the cluster
configuration that minimizes the distance measure.
5.7.2 Analysis of Output using K-means Clustering
K-means Clustering helps in predicting clusters to gain meaningful insights. XLMiner predicts
clusters and places each instance in the corresponding cluster. Table 5.4 shows the cluster to
which each instance belongs and its distance to each of the clusters.
We can see that for any record, its distance from all the clusters is calculated using K-means
algorithm and it’s assigned to the cluster with minimum distance. Thus the values highlighted in
yellow gives the division of clusters. The records are assigned to cluster after comparing their
distances from other clusters. The cluster with least distance will be assigned to the respective
cluster.
Thus, row id 3 is assigned to cluster 1 because its distance to cluster1 is least as compared to its
distance from other clusters.
Now, of most significant part of this study is to understand the cluster division and obtain
meaningful information for decision making purpose. Table 5.5 below shows the attributes used
for clustering belongs which cluster.
CLUSTER FINAL
AGE GENDERCODE STATES DISTRICT
ID DIAGNOSIS
1 1 1 150000 1506 2
1 1 2 150000 1520 1
1 1 1 150000 1523 1
1 1 1 150000 1523 2
1 1 1 160000 1610 2
1 1 2 160000 1611 2
1 1 1 170000 1722 2
1 1 1 210000 2115 1
1 1 2 210000 2123 2
1 1 2 150000 1506 1
1 1 2 210000 2103 2
1 1 2 140000 1405 2
1 1 1 150000 1506 1
1 1 2 150000 1520 1
1 1 1 210000 2106 2
1 1 1 210000 2112 2
1 1 1 160000 1609 2
1 1 1 210000 2126 2
1 1 2 210000 2116 2
1 2 2 160000 1609 2
1 2 1 160000 1605 2
1 2 2 160000 1609 2
1 2 2 160000 1609 2
1 3 1 200000 2002 2
2 3 1 150000 1529 2
2 3 1 150000 1525 2
2 4 1 100000 1001 1
2 4 1 70000 717 1
2 4 1 150000 1506 1
2 4 1 150000 1508 1
2 4 1 150000 1506 1
2 4 1 70000 709 1
2 4 1 140000 1402 1
2 4 1 160000 1602 1
2 4 1 150000 1506 1
2 4 1 90000 916 1
2 4 1 150000 1506 1
2 5 1 120000 1213 1
2 5 1 170000 1714 1
3 1 2 70000 702 2
3 1 2 70000 715 1
3 1 1 50000 511 2
3 1 1 50000 529 2
3 1 2 70000 708 2
3 1 2 120000 1202 2
3 1 2 60000 602 2
3 1 1 70000 716 2
3 1 2 70000 716 2
3 1 2 90000 907 1
3 1 1 50000 512 2
3 1 1 70000 717 2
3 1 2 50000 521 4
3 1 2 70000 708 2
3 1 1 70000 719 2
3 1 1 70000 707 2
3 1 2 70000 711 2
3 1 2 50000 507 2
3 1 2 90000 909 2
3 1 2 90000 926 2
3 1 2 70000 713 2
3 2 2 70000 710 3
3 2 2 10000 105 3
3 2 1 50000 521 3
3 2 2 70000 714 2
3 2 2 70000 712 4
3 3 1 50000 506 4
3 3 1 50000 531 4
3 3 1 50000 531 3
3 4 2 30000 330 3
3 4 1 50000 509 3
3 4 2 50000 507 3
4 1 1 150000 1506 4
4 1 2 160000 1605 3
4 1 2 150000 1515 3
4 2 2 150000 1525 3
4 3 2 150000 1516 3
4 3 2 150000 1525 3
4 3 2 160000 1609 3
4 3 1 150000 1525 3
4 3 1 160000 1609 3
4 3 2 160000 1619 3
4 3 2 160000 1605 3
4 4 1 150000 1515 3
4 4 1 80000 811 4
4 4 1 150000 1506 3
4 4 1 150000 1516 3
4 4 1 150000 1507 3
4 4 2 150000 1503 3
4 4 2 210000 2105 4
4 4 2 160000 1619 3
4 4 1 150000 1521 4
4 4 1 210000 2110 3
5 3 1 50000 511 1
5 4 2 30000 314 1
5 4 2 70000 701 1
5 4 2 50000 504 1
5 4 2 50000 521 1
5 4 2 10000 115 1
5 4 2 70000 703 1
5 4 2 70000 710 1
5 4 2 10000 102 1
5 4 2 50000 527 1
5 4 2 50000 531 1
5 4 2 50000 504 1
5 4 2 70000 714 1
5 4 1 50000 508 1
5 4 2 70000 709 1
5 4 2 50000 526 1
5 4 1 10000 104 1
5 4 2 10000 106 1
5 4 2 50000 509 1
5 4 2 50000 521 1
5 4 2 10000 106 1
5 4 2 10000 107 1
5 4 2 50000 508 1
5 4 1 50000 521 1
5 4 2 50000 531 1
5 4 2 70000 712 1
5 4 2 70000 718 1
5 4 2 70000 713 1
5 4 2 70000 715 1
5 4 2 70000 716 1
5 4 2 50000 510 1
5 4 2 50000 531 1
5 4 2 10000 101 1
5 4 1 50000 521 1
5 4 2 50000 515 1
5 4 2 70000 703 1
5 4 1 10000 110 1
5 4 2 30000 323 1
5 4 2 70000 709 1
5 5 2 10000 117 1
5 5 2 10000 122 1
5 5 2 70000 707 1
6 3 2 150000 1515 2
6 3 2 210000 2116 2
6 4 2 150000 1506 1
6 4 2 150000 1527 1
6 4 2 160000 1610 1
6 4 2 150000 1527 1
6 4 2 150000 1519 1
6 4 2 140000 1409 1
6 4 2 150000 1506 1
6 4 2 160000 1605 1
6 4 2 160000 1610 1
6 4 2 160000 1601 1
6 4 2 160000 1603 1
6 4 2 160000 1609 1
6 4 2 210000 2111 1
6 4 2 150000 1506 1
6 4 2 150000 1516 1
6 4 2 160000 1604 1
6 4 2 160000 1609 1
6 4 2 160000 1611 1
6 4 2 150000 1501 1
6 4 2 150000 1504 1
6 4 2 150000 1506 1
6 4 2 170000 1715 1
6 4 2 160000 1619 1
6 4 2 150000 1523 1
6 4 2 150000 1519 1
6 4 2 150000 1525 1
6 4 2 160000 1608 1
6 4 2 160000 1610 1
6 4 2 160000 1610 1
6 4 2 150000 1520 1
6 4 2 150000 1506 1
6 4 2 160000 1605 1
6 4 2 160000 1613 1
6 4 2 170000 1703 1
6 4 2 150000 1501 1
6 4 2 150000 1506 1
6 4 2 150000 1525 1
6 4 2 160000 1605 1
6 4 2 160000 1605 1
6 4 2 160000 1610 1
6 4 2 160000 1611 1
6 4 2 150000 1529 1
6 4 2 160000 1607 1
6 4 2 160000 1619 1
6 4 2 150000 1506 1
6 4 2 160000 1605 1
6 4 2 160000 1617 1
6 4 2 170000 1722 1
6 4 2 170000 1724 1
6 5 2 130000 1301 1
6 5 2 150000 1524 1
6 5 2 150000 1524 1
6 5 2 160000 1610 1
6 5 2 160000 1619 1
6 5 2 160000 1601 1
6 5 2 170000 1720 1
6 5 2 220000 2201 1
Table 5.2 shows number of records there are in each cluster, percentage population and the
average distance of the cluster members from the centre of the clusters.
Table 5.2 Number of instances in each cluster and its distance from the cluster centre
Average
Cluster #Observations distance in
cluster
Cluster-1 25 1.414
Cluster-2 15 0.879
Cluster-3 33 1.577
Cluster-4 22 1.499
Cluster-5 44 0.903
Cluster-6 61 0.346
As evident from the Fig 5.2 that cluster 3, 5 and 6 represents almost 70% of the population which
is a majority .So our further analysis focuses on these three clusters.
The cluster division containing age, gender, state and district is presented in graphical form to
enable easy comprehension.
Thus, From the Figure 5.2, it is found that cluster3, 5 and 6 is dominated by male patients. It
means that male patients are more prevalent for Coronary Artery Disease (CAD).
Thus from fig 5.4 it is observed that in all three clusters predominant states are KERELA ,
KARNATAKA, TAMIL NADU and WEST BENGAL. Coronary heart disease was found to
be most predominant in all the clusters. Cluster 6 has the highest percentage of patients
coming from Kerala and majority falls into coronary heart disease category. In Kerala,
districts which have the majority of patients of coronary heart disease are MALLPURAM
and KANNUR whereas in Karnataka the majority of patients of coronary heart disease are
from BANGALORE.
SUMMARY &
CONCLUSIONS
CHAPTER 6
SUMMARY & CONCLUSION
SUMMARY
The complete analyses through clustering comes to a conclusion that majority of patients
are coming from States: KERALA, TAMIL NADU, and WEST BENGAL AND
ANDHRA PRADSESH. The districts to be focused for providing better healthcare are
MALLPURAM, KANNUR, BANGALORE, SALEM and TIRUVALLUR. The patients
from age group 35-60 are more prone to Coronary Heart Disease (CHD). The above
cluster demographics are true for both male and female patients .However, the overall
population is dominated by male patients.
CONCLUSION
• The majority of patients are coming from States: KERALA, TAMIL NADU,
and WEST BENGAL AND ANDHRA PRADSESH. So, the hospital
administration should focus on these states to find out the reasons behind
occurrences of coronary heart disease and take preventive actions. For
example they can collaborate with Sri Sathya Sai mobile hospital which has a
team of specialist’s doctors and provides healthcare to the patients at their
doorstep. So they can better understand the reasons for occurrences of
coronary artery Disease. The hospital authorities can also take benefit of the
tele medicine facility for the same purpose.
• The patients from age group 35-60 are more prone to Coronary Artery
Disease (CAD). So hospital authorities should find the risk factors more
prevalent for CAD in adults and take correcting actions to reduce the risk and
thus the occurrence of CAD patients in future.
• The hospital administration can extend their collaboration with Sri Sathya Sai
Village Integrated Program (SSSVIP) to provide heath education through
various health awareness programmes conducted by SSSVIP.
• Health education,,,cleanliness,lifestyle.
REFERENCES
References
REFERENCES
4. Edelstein, H., “Data Mining: Are We There Yet?” Computing Science and Statistics,
34, I2002, Proceedings/ Edelstein Herb
5. Hand, D. J., Bolton, R. J., “Pattern Discovery and Detection: A Unified Statistical
Methodology”, Journal of Applied Statistics, Oct2004, Vol. 31 Issue 8, p885-924.
6. IBM white paper, “Data management solutions”, April 1996.
7. Jonathan C. P. et al., “Medical Data Mining: Knowledge Discovery in a Clinical Data
Warehouse” AMIA 1997.
8. Keating, B., “Data Mining: What Is It and How Is It Used?” Journal of Business
Forecasting, Fall2008, Vol. 27 Issue 3, p33-35, 3p.
9. Kumar, V., et al., “Outlier Mining in Medical Databases: An Application of Data
Mining in Health Care Management to Detect Abnormal Values Presented In Medical
databases” International Journal of Computer Science and Network Security, VOL.8
No.8, August 2008.
10. Kuonen, D., “A Statistical Perspective of Data Mining”, Dec 2, 2004, CRM Zine,
Vol. 48.
11. Lamont, J., “Data drives decision-making in healthcare” KM World, Mar2010, Vol.
19 Issue 3, p12-14, 2p.
12. Milley, A., “Healthcare and Data Mining” Health Management Technology,
Aug2000, Vol. 21 Issue 8, p44, 2p.
References
13. Obenshain K. M., “Application of Data Mining Techniques to Healthcare Data”,
WEBSITES:
• www.thearling.com
• www.wikipedia.com
AGE
GENDERCODE STATES DISTRICTS Final diagnosis
CODE
1 1 16 223 2
Appendix
1 1 5 296 2
1 2 12 355 2
1 2 16 328 2
1 2 15 66 2
1 1 16 185 2
Table 5.2 1 1 15 55 2
Training data 1 2 5 296 2
set 1 1 16 261 2
1 2 15 150 2
1 1 5 218 2
1 2 3 291 2
1 2 7 15 1
1 2 9 81 1
1 2 21 113 2
1 1 16 223 2
1 1 15 65 2
1 2 16 206 2
1 2 21 51 2
1 1 5 356 2
1 2 15 150 2
1 2 6 12 2
1 2 5 258 2
1 2 15 337 2
1 2 15 55 2
1 1 15 93 2
1 1 7 205 2
1 2 15 25 2
1 2 5 356 2
1 1 15 105 2
1 2 15 112 2
1 2 16 223 2
1 1 16 355 2
1 1 16 328 2
1 2 5 117 2
1 2 7 10 5
Table 5.3 1 2 17 301 2
Validation data 1 2 7 15 2
1 1 16 335 2
set
1 1 15 55 2
AGE CODE GENDERCODE STATES DISTRICTS Final
1 2 5 332 diagnosis2
1 1 1 2 5 15 97 55 13
1 1 1 2 16 5 210 111 13
1 1 2 1 15 15 55 203 13
1 1 2 2 17 16 205 210 22
1 1 2 2 5 16 285 210 22
1 1 2 1 21 5 195 218 22
2 1 2 1 16 16 328 185 22
2 1 1 1 3 5 305 218 13
2 1 1 2 15 8 75 318 22
2 1 2 1 20 16 257 185 23
2 1 2 1 5 7 356 251 23
2 2 15 25 2
2 1 21 68 3
2 2 7 57 3
3 2 1 100 3
3 2 21 68 3
Appendix
1 2 21 161 2
1 2 21 52 2
1 2 15 112 2
1 2 1 198 2
1 2 23 131 2
1 1 1 211 2
1 2 5 122 2
1 1 5 258 2
1 2 3 193 2
1 2 16 206 2
1 1 15 55 2
1 2 6 33 2
1 2 1 95 2
1 2 21 165 2
1 1 7 76 2
1 2 16 210 2
1 2 7 15 2
1 2 7 15 2
1 1 21 255 2
1 2 6 26 2
1 2 12 283 2
1 2 15 308 2
1 2 7 155 2
1 2 25 11 2
1 2 15 203 2
1 2 5 331 2
1 1 15 150 2
1 2 15 55 2
1 2 7 21 2
1 2 16 261 2
1 1 5 325 2
1 2 7 239 2
1 1 12 71 2
1 2 5 122 2
1 2 12 130 2
1 2 15 203 2
1 1 15 93 2
1 2 7 15 2
1 2 1 272 2
1 1 7 76 3
1 2 21 51 2
1 1 7 56 2
1 1 16 206 2
1 2 5 296 2
1 2 15 65 2
1 1 3 193 2
1 1 9 65 2
Appendix
1 1 7 239 2
1 2 7 21 2
1 1 7 57 2
1 2 5 356 2
1 2 15 337 2
1 2 16 210 2
1 2 9 103 2
1 1 16 39 2
1 1 5 330 2
1 1 1 152 2
1 1 5 356 2
1 1 15 55 2
1 1 15 150 2
1 2 15 112 3
1 1 15 75 3
1 1 6 38 3
1 2 5 122 2
1 2 15 65 2
1 2 7 231 3
1 2 5 332 3
2 1 17 166 2
2 1 15 250 3
2 1 7 156 2
2 1 15 150 3
2 1 5 325 2
2 2 7 231 2
2 2 15 65 3
2 2 15 250 2
2 2 5 122 3
2 2 9 267 3
2 2 5 356 2
2 2 1 157 2
2 2 15 65 2
2 2 21 59 2
2 1 5 331 3
2 2 5 326 3
2 2 5 189 3
2 2 15 55 5
2 2 5 356 3
2 1 7 231 3
2 2 16 355 3
2 1 1 353 3
2 2 15 57 2
2 2 16 261 2
2 2 6 232 2
2 1 5 218 3
2 1 15 66 3
Appendix
2 2 7 76 3
2 1 21 255 3
2 2 16 328 5
3 1 15 55 2
3 2 15 128 2
3 2 6 53 3
3 1 1 30 3
3 2 15 112 3
3 1 16 185 3
3 2 15 55 2
3 1 15 150 3
3 1 15 190 3
3 2 15 203 3
3 2 15 150 1
3 2 7 277 2
3 2 9 276 3
3 2 15 55 5
3 2 15 337 2
3 1 7 17 3
3 1 15 55 3
3 1 16 185 3
3 1 5 326 3
3 2 15 55 5
3 1 2 169 3
3 1 7 168 3
3 1 1 30 3
3 1 7 57 3
3 1 15 55 2
3 2 7 168 3
3 1 7 76 3
3 2 16 185 3
3 1 1 351 3
3 2 16 278 3
3 1 1 358 2
3 1 15 150 3
3 2 7 105 3
3 2 16 355 3
3 2 16 210 3
3 1 15 95 3
3 1 5 350 3
3 1 15 337 3
3 1 17 325 3
3 1 3 310 3
5 2 1 256 1
5 1 15 1 2
5 1 7 57 3
5 1 15 105 3
Appendix
5 2 7 155 3
5 1 21 201 5
5 2 20 257 5
5 2 15 65 1
5 1 5 356 5
5 2 17 55 1
5 2 3 351 1
5 1 1 9 1
5 2 17 79 3
5 2 5 296 3
5 2 15 112 3
5 2 16 83 3
5 1 15 337 5
5 1 5 325 5
5 2 5 97 1
5 2 15 308 1
5 2 15 190 1
5 1 15 55 3
5 2 1 9 1
5 2 7 56 1
5 1 17 259 2
5 2 5 111 3
5 2 7 231 3
5 2 16 261 1
5 2 15 226 1
5 1 15 112 1
5 2 7 56 3
5 1 15 339 3
5 2 15 77 3
5 2 1 211 1
5 2 5 19 1
5 1 7 205 1
5 2 21 165 1
5 2 7 15 1
5 2 1 188 1
5 2 15 308 1
5 2 7 10 3
5 2 7 239 3
5 2 15 279 3
5 2 23 36 3
5 2 1 259 1
5 2 15 207 1
5 2 5 122 1
5 2 16 335 1
5 2 16 223 5
5 2 16 261 1
5 2 5 185 1
Appendix
5 2 15 137 1
5 2 17 275 1
5 1 21 50 1
5 2 15 55 1
5 2 7 76 1
5 2 16 355 1
5 1 15 55 2
5 1 16 185 3
5 2 1 198 3
5 2 17 299 1
5 2 15 105 1
5 1 15 55 1
5 2 16 191 3
5 1 15 55 5
5 2 7 277 1
5 2 5 296 1
5 2 15 250 1
5 2 1 100 1
5 1 7 57 1
5 2 7 156 1
5 2 16 39 1
5 2 15 250 1
5 2 1 9 1
5 2 5 258 1
5 2 5 97 1
5 2 5 296 1
5 2 7 239 1
5 2 15 55 1
5 2 21 101 1
5 1 17 325 1
5 1 1 157 1
5 2 16 223 1
5 2 5 97 1
5 1 15 66 1
5 2 5 313 1
5 2 16 209 1
5 2 16 39 1
5 2 15 18 3
5 2 1 213 1
5 1 15 55 1
5 2 16 328 1
5 2 16 336 1
5 2 16 185 1
5 2 9 307 1
5 2 5 122 1
5 2 16 223 1
5 2 16 261 1
Appendix
5 2 1 272 3
5 2 5 89 1
5 2 17 205 1
5 2 16 355 1
5 1 15 18 1
5 2 5 296 1
5 1 15 55 1
5 2 5 326 1
5 2 15 55 1
5 2 15 55 1
5 2 5 97 1
5 1 16 223 1
5 1 15 55 1
5 2 16 210 1
5 2 16 185 3
5 1 15 55 1
5 2 7 155 1
5 2 21 50 1
5 2 5 19 1
5 2 16 210 1
5 1 5 122 1
5 2 15 190 1
5 2 7 15 1
5 2 21 165 1
5 2 7 15 1
5 2 16 223 1
5 2 16 210 1
5 1 15 55 1
5 2 7 156 1
5 1 5 89 1
5 1 5 97 1
5 2 15 55 1
5 2 7 168 1
5 2 5 185 1
5 1 15 308 1
5 2 1 303 1
5 1 10 323 1
5 2 15 55 1
5 2 5 332 1
5 2 16 210 1
5 2 5 218 1
5 2 15 226 1
5 2 16 83 1
5 2 16 83 1
5 1 15 65 3
5 2 7 15 5
5 2 1 272 1
Appendix
5 1 7 155 1
5 2 16 355 1
5 2 7 155 1
5 2 16 185 1
5 2 7 15 1
5 2 15 55 1
5 2 17 237 1
5 2 1 319 1
5 1 5 335 5
5 1 15 250 1
5 2 15 308 1
5 2 16 223 1
5 2 16 355 1
5 2 1 152 1
5 2 1 9 1
5 2 17 316 1
5 2 3 91 1
5 1 7 76 1
5 2 16 83 1
5 2 15 250 1
5 1 1 52 1
5 2 15 105 1
5 1 15 203 1
5 2 16 223 1
5 2 15 159 1
5 2 16 223 1
5 2 7 15 1
5 2 5 359 1
5 2 16 223 1
5 1 5 296 1
5 2 16 185 1
5 2 7 231 1
5 2 7 57 1
5 2 15 339 1
5 2 16 261 1
5 2 15 203 1
5 1 15 55 1
5 2 16 206 1
5 2 5 89 1
5 1 16 336 1
5 2 5 212 1
5 2 5 189 1
5 2 15 337 1
5 1 5 97 1
5 2 1 100 1
5 2 16 191 1
5 1 12 253 1
Appendix
5 2 16 83 1
5 2 3 200 1
5 2 16 206 1
5 2 16 335 1
5 2 16 185 1
5 2 7 155 1
5 2 17 166 1
5 1 15 112 1
5 2 16 83 1
5 2 5 350 1
5 2 15 55 1
5 2 16 83 1
5 2 7 225 1
5 1 16 185 1
5 2 15 55 1
5 1 15 128 1
5 2 15 65 1
5 2 1 351 1
5 2 15 25 1
5 2 16 121 1
5 2 5 218 1
5 2 21 101 1
5 1 16 335 1
5 2 5 97 1
5 2 3 107 1
5 2 16 261 1
5 2 5 111 1
5 2 7 155 1
5 2 16 336 1
5 2 1 272 1
5 2 15 55 1
5 2 17 263 1
5 2 15 55 1
5 2 16 185 1
5 2 1 351 1
5 2 16 210 1
5 2 16 185 1
5 2 16 261 1
5 2 16 185 1
5 2 15 279 1
5 2 7 225 1
5 2 15 308 1
5 2 5 122 1
5 2 1 95 1
5 2 15 202 1
5 2 1 52 1
5 2 5 296 1
Appendix
5 2 3 67 1
5 2 7 168 1
5 2 16 185 1
5 2 5 122 1
5 2 15 65 1
5 2 9 238 1
5 2 9 300 1
5 2 7 239 5
5 2 16 355 1
5 2 16 210 1
5 2 15 66 1
5 2 7 205 5
5 2 7 57 1
#
%
Decision
Error
Nodes
0 52
1 17.5
2 15
3 15
5 15
5 12.5
6 12.5
7 12.5
8 12.5
9 12.5
10 12
11 12
12 12
13 12
15 12
15 12
16 12
17 12
18 12
19 12
20 11.5
21 11.5
22 11.5
23 11.5
25 11.5
25 11.5
26 11.5
27 11.5
28 11
29 11
30 11
Appendix
#Decision
28 #Terminal Nodes 29
Nodes