Вы находитесь на странице: 1из 166

Clin Lab Med 28 (2008) xv

Dedication

Robert E. Reynolds, MD, DrPH

This issue is dedicated to Robert E. Reynolds, MD, DrPH, Professor of


Internal Medicine and Public Health Sciences, and Vice President and CIO
of the University of Virginia from 19992006. Dr. Reynolds created and
nurtured the UVa Clinical Data Repository, which is now a major integrated data resource for clinical research, and his vision and warm collegiality continue to inspire both students and faculty.
James H. Harrison, Jr, MD, PhD
Departments of Public Health Sciences and Pathology
University of Virginia
Hospital West Complex 3181
PO Box 800717
Charlottesville, VA 22908-0717, USA
E-mail address: james.harrison@virginia.edu

0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.12.014
labmed.theclinics.com

Clin Lab Med 28 (2008) xixiii

Preface

James H. Harrison, Jr, MD, PhD


Guest Editor

Clinical laboratory data are among the most detailed, objective, reliable,
and useful measures of patient characteristics contained in the medical
record. Numerous studies over the past 30 years based on laboratory data
alone and in aggregate with other clinical and experimental data have
revealed correlative and predictive patterns in laboratory data that have
improved our understanding of disease, therapeutic response, and health care
delivery processes. Additional useful patterns undoubtedly remain hidden in
the data, awaiting discovery by creative, prepared minds using eective
analysis techniques.
Some pathologists have recognized this opportunity; over the past 10
years there have been periodic reports in the literature that have used automated pattern recognition and modeling techniques collectively termed
data mining to identify patterns in laboratory data for various purposes.
Unfortunately, these eorts have been relatively few, whereas the use of data
mining techniques in other medical domains has increased dramatically (see
article by Dr. Harrison). There are several reasons that the use of data mining techniques has been inhibited in the laboratory. Data mining is a set of
statistical approaches to data analysis that are relatively technical and that
need to be correctly matched to an analysis task. This specialized knowledge
is generally outside the scope of laboratorians training. Software tools for
data mining by non-experts have been very expensive and were often a poor
t for laboratory databases. Most were designed to discover associations

0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.011
labmed.theclinics.com

xii

PREFACE

between discrete events and data elements in business analyses. As such,


these tools are generally better suited to analyzing relationships between diagnosis or procedure codes in other medical domains than recognizing important patterns in the time sequences of data elements that make up
much of laboratory databases. Various political and legal forces have prevented laboratories and other care providers from collaboratively building
the large data sets that optimally support data mining. Finally, dedication
of eort and resources to a pattern discovery project can be dicult when
the outcome is (by denition) unpredictable at the time of the investment
decision.
This situation is changing. Data mining techniques are becoming more
widely known, particularly those that are associated with high throughput
genomics and proteomics analyses (see articles by Drs. Klee and Lee). Medical informaticians, who can be familiar (if not expert) with data mining are
also more generally available within pathology practices or as local collaborators. High-quality open source software for data mining is available that
is appropriate for use by non-statisticians (see article by Drs. Zupan and
Demsar). Techniques to incorporate time series databases into data mining
analyses are being developed (see article by Drs. Post and Harrison). Eorts
are underway to create a societal and governmental consensus for the secondary analysis of health care information (see article by Dr. Harrison).
One of the most important developments promoting data mining in the
mainstream, however, is the coming convergence and correlation of genomic
and proteomic data with data representing patient phenotype (see article by
Dr. Harrison). These studies will use data mining techniques and will make
data mining approaches and tools broadly available for application to clinical data. Because laboratory data present a high-quality, reliable representation of patient phenotype, it will be of substantial interest for aggregation
with high throughput genomics and proteomics data. Laboratorians will
have opportunities to contribute to, or lead, parts of this work.
Our intent in assembling this issue is to provide an introduction to standard techniques for managing and mining clinical data and to illustrate
these techniques with several applications related to laboratory medicine
and associated research. The issue is divided into a foundations section,
which provides a discussion of data mining techniques and tools, data warehousing, and time series analysis, and an applications section that presents
a set of projects that illustrate data aggregation, detection of interesting and
unusual patterns in laboratory data, infectious disease surveillance, and discovery of patterns indicating new biomarkers and gene expression proles.
Balancing the level of complexity in introductory material is always challenging; we present the statistical discussions at a moderate level so as to
be useful to readers who have some familiarity with statistical concepts,
and provide references to additional materials appropriate for novices (see
article by Dr. Harrison) or experts. We hope that this issue is useful in raising the interest in data mining in the laboratory community and providing

PREFACE

xiii

a guide to the types of clinical and research opportunities that will become
available over the next several years.
James H. Harrison, Jr, MD, PhD
Departments of Public Health Sciences and Pathology
University of Virginia
Hospital West Complex 3181
PO Box 800717
Charlottesville, VA 22908-0717, USA
E-mail address: james.harrison@virginia.edu

Clin Lab Med 28 (2008) 17

Introduction to the Mining


of Clinical Data
James H. Harrison, Jr, MD, PhD
Division of Clinical Informatics, Departments of Public Health Sciences and Pathology,
University of Virginia, Suite 3181 West Complex, 1335 Hospital Drive,
Charlottesville, VA 22908-0717, USA

The progressive increase in the amount of clinical data stored in electronic form isdfor the rst time in historydmaking it possible to carry
out large-scale studies that focus on the interaction between genotype, phenotype, and disease at a population level. Such studies have extraordinary
potential to determine the eectiveness of treatment and monitoring strategies, identify subpopulations at risk for disease, dene the real variability in
the natural history of disease and comorbidities, discover rational bases
for targeting therapies to particular patients, and determine the incidence
and contexts of unwanted health care outcomes. Matching patient responses (phenotype) with gene expression and known metabolic pathway
relationships across large numbers of individuals may be the best hope for
understanding the complex interplay between multiple genes and the environment that underlies some of the most common and debilitating health
problems [1]. Although serious issues remain to be resolved before the
large-scale secondary use of health data for research can become routine,
this topic has been recognized and identied as a national priority in Canada
[2] and the United States [3]. Clinical laboratory databases contain perhaps
the largest available collection of structured medical data representing human phenotypes of disease progression and response to therapy. Alone
and especially in combination with other clinical and environmental data,
laboratory databases have substantial value for translational research, including correlative studies linking gene expression with phenotype, and
for identifying groups of patients with similar characteristics for follow-up
analysis or inclusion in clinical studies.
Large-scale clinical databases permit targeted observational and correlative studies that complement randomized clinical trials [4,5]. These databases also hold the promise of more comprehensive analyses to reveal
E-mail address: james.harrison@virginia.edu
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.001
labmed.theclinics.com

HARRISON

unknown, useful real-world relationships among clinical data. The data volume and comprehensiveness that make these data sets useful, however, also
make them dicult or impossible to analyze by manual or traditional statistical methods. Analogous challenges have occurred previously in other domains, including the need to identify purchasing associations among billions
of retail transactions [6], the need to identify similarities in patterns among
terabytes of geologic data for oil exploration [7], and the need to identify
patterns in planetary mapping data [8], among many other examples. These
needs have been addressed using a set of techniques from the machine learning and pattern recognition elds collectively referred to as data mining.
In recent years, biomedical science has also begun to apply these techniques
to large-scale data analysis, as evidenced by the dramatic increase in biomedical publications referring to data mining over the past 10 years (Fig. 1).
Brief overview of data mining
Data mining has been described as the extraction of implicit, previously
unknown and potentially useful information [9], such as associations and
correlations between data elements, from large repositories of data. It is
the technical and statistical component of the process of knowledge discovery in databases (KDD, Refs. [10,11]), which has a primary goal of identifying useful new information and is sometimes used synonymously with
KDD. Although the data mining label is sometimes also applied to techniques designed to determine whether and to what extent prespecied patterns exist in data sets, those primarily data querying methods are distinct
Data mining articles
Data mining reviews

Number of entries in the PubMed


database

300
250
200
150
100
50
1

1 1

0
1985

1990

1995

2000

2005

Year
Fig. 1. Dramatic increase in articles and reviews in biomedical science mentioning data mining
since 1981. The rst article appeared in 1984, and single articles also appeared in 1995 and 1996.
The rst review appeared in 1997. Articles and reviews include clinical research topics and studies using bioinformatics/high-throughput analytic techniques. In 2006 there were 304 articles
and 44 reviews mentioning data mining.

INTRODUCTION TO THE MINING OF CLINICAL DATA

from data mining. The latter, in contrast, relies on unsupervised machine


learning and statistical techniques [12] to create a catalog of patterns found
in data, without prespecication, and to assess their potential usefulness.
Data mining techniques include methods to group data elements that
have similar features (pattern discovery, eg, association rules and clustering)
and methods to determine predictive and other relationships between groups
of data elements (model building, eg, regression techniques, classication,
neural networks, and support vector machines, among others) [12]. For
best results, these techniques require consistent, accurate data that are appropriately structured, and thus they are often applied in the setting of
integrated data repositories or warehouses designed for mining. Mining
typically yields a large number of patterns and relationships, some of which
are previously known or trivial. Ultimately, evaluation of the usefulness of
these patterns requires human expertise, although statistical and heuristic
methods are important for prioritizing, or pruning, found pattern sets
to yield a subset most likely to be of interest that is small enough to evaluate
manually. Data mining is often performed in settings in which the number of
features or dimensions (eg, over-expressed genes) contributing to a comparison between items of interest (eg, tumor tissue specimens) is relatively large
with respect to the population of items. Because traditional statistical methods
are not well-suited to evaluating the probability of coincidental patterns in
these high-dimensional data sets, methods specialized for data mining applications, such as false discovery rate calculations, have been developed [12].
This issue presents an initial set of foundational articles in data mining
(techniques, tools, databases, and time sequence mining) followed by a set
of articles addressing data mining applications in health care. Basic data
mining techniques for discovering data patterns and their relationships are
introduced in the article by Brown, elsewhere in this issue. A substantial
number of high-quality software tools for applying these data mining techniques are available. An overview of open-source data mining tools is provided in the article by Zupan and Demsar, elsewhere in this issue and
detailed descriptions of commercial tools are available from vendors, such
as SAS [13], SPSS [14], Cognos [15], Insightful [16], and Oracle [17]. Key
topics related to aggregation of data for mining are discussed in an article
by Lyman and colleagues, elsewhere in this issue and topics related to
data handling for specic applications are addressed in later articles. Special
considerations related to the developing eld of time series mining are discussed in the article by Post and Harrison, elsewhere in this issue. The articles by Siadaty and Harrison, Harrison and Aller, Brossette and Hymel,
Klee, and Lee and colleagues, elsewhere in this issue examine data mining
applications relevant to laboratory medicine, including pattern detection
in laboratory databases in conjunction with knowledge bases, the characteristics and usefulness of several types of regional databases that may incorporate laboratory data, mining microbiology data in infection control, data
mining for biomarker discovery, and mining applied to high throughput

HARRISON

genomics data sets. Some foundational topics are also addressed in the later
articles in the setting of specic applications, for example pattern pruning
and false discovery rate evaluation in the articles by Siadaty and Harrison,
and Lee and colleagues, elsewhere in this issue.
Although topics in data mining can often be approached in an intuitive
manner, data mining methodology is based on mathematic and statistical
principles. Eective application of data mining techniques, including eective use of data mining software, requires a reasonable understanding of
these principles. A full introduction to the mathematics of data mining is beyond the scope of this brief review issue, but several references are generally
available for interested readers. Those who nd the mathematic discussions
in the articles by Brown, Klee, and Lee and colleagues, elsewhere in this
issue too advanced may wish to start their study of data mining by reviewing
Tans Introduction to Data Mining [18], which presents data mining topics
intuitively using visualizations, or the initial sections of Dunhams Data
Mining: Introductory and Advanced Topics [19], which takes a bit more
mathematic approach. General reviews introducing data mining topics are
also available from Fayyad and colleagues [10], Hand and colleagues [20],
and Cios and Moore [21].
Special characteristics of medical data
Medical data have characteristics that make them uniquely dicult to analyze in an automated fashion by traditional techniques or by data mining
[21]. Some of these characteristics appear in data sets from other domains,
but medical data seem to combine more problematic and challenging features at once than almost any other type of data.
High dimensionality
High dimensionality means that many dierent data elements, each representing a dimension that can vary in value, characterize an item of interest,
such as a patient, disease, or specimen. It is not unusual for a patients medical record to contain 50 or even 100 dierent types of data elements. With
so many variables across a limited number of comparisons, the likelihood of
patients sharing coincidental patterns is high and appropriate techniques
must be used to minimize identication of spurious patterns.
Heterogeneity
Medical data may include textual descriptions, various types of images,
and discrete values using multiple scales. Values may be obtained from multiple methods, some of which may produce incompatible results for the same
observation. Data mining requires consistent data, which means that large
volumes of clinical data may need to be transformed to compatible representations. Some of these challenges are further addressed in the articles by Siadaty and Harrison, and Harrison and Aller, elsewhere in this issue.

INTRODUCTION TO THE MINING OF CLINICAL DATA

Imprecision
Unlike retail transactions, which directly reect a particular purchase act,
medical observations commonly indicate a probability that a condition exists based on their sensitivity and specicity as indicators for that condition.
A given feature thus may be consistent with more than one condition, or the
condition may exist without the feature. Furthermore, sensitivities and specicities for most features are not known precisely for all data sets and at best
may be estimated based on values obtained in other data sets. For these reasons, linking data elements to true characteristics of patients is not
straightforward.
Interpretations
Diagnoses and other summary data in medical records are generally human interpretations of aggregates of observations and objective data values.
Interpretations by dierent individuals may dier or even conict, and are
often expressed as text that must be further interpreted to a form that
may be processed during data mining.
No canonical form
Although substantial progress continues to be made in developing standard medical terminologies, in the absence of a generally accepted representation for important medical concepts many clinical data are still expressed
in idiosyncratic ways.
Incomplete and inconsistent data
Patients who have the same conditions may have substantially dierent
types and timing of observations, unlike retail transactions, surveys, or technical data, which generally have comparable complements of data elements
obtained at similar times. Clinical data may be inconsistent or conicting for
various reasons. These qualities add noise and spurious patterns that increase the diculty of identifying real patterns of interest.
Dicult mathematic characterization
Medicine deals with concepts, such as inammation, comorbidities, and
disease severity, that strongly inuence clinical outcome but are dicult
to quantitate and incorporate into mathematic relationships with diagnoses
and disease progression models.
Temporal patterns
Data elements in clinical records may not be meaningful outside of a particular temporal context. This phenomenon is particularly true for laboratory databases, which are largely composed of time sequences. For

HARRISON

example, a time sequence of viral antigen and antibody values associated


with hepatitis might be interpreted dierently if it were randomly shued.
The important information that is implicit in time sequences presents a problem
because it is not accessible to data mining methods that catalog associations
between individual data elements. The article by Post and Harrison, elsewhere
in this issue discusses methods for abstracting temporal data sequences to
meaningful new data elements that can be used by data mining tools.
Need for generalization
The need to understand and generalize information about the mechanism
and causation of relationships in medical data makes data mining techniques
that do not reveal meaningful information about discovered relationships
(eg, neural networks) less desirable than those that do (eg, classication
trees), thus limiting applicable techniques.
Ethical, legal, and social issues
Important questions remain about the ownership and appropriate use of
clinical information, the need to ensure an appropriate balance of risk to patients with benet to society, and the need to protect patient privacy and
condentiality. Controversy and confusion exist related to these issues
that complicate the aggregation and analysis of clinical databases for reasons other than individual patient care. The call for a national framework
for the secondary use of health information referenced above [3] is largely
a recommendation that society, health care providers, and government resolve these issues so that the communitydand ultimately patientsdcan benet from techniques such as data mining.
These typical characteristics of medical data do not preclude data mining,
but they may necessitate extra data processing and compensatory techniques
that substantially complicate mining, increasing eort and costs.

Summary
Mining medical data, including laboratory data, is currently a challenging
exercise in data acquisition, aggregation, and reconciliation. Signicant political and legal challenges may also exist for medical data mining projects.
Partly for these reasons, data mining techniques have not been widely used
in laboratory medicine (with a few notable exceptions). Substantial potential
exists, however. The last ve articles in this issue show that data mining can
be applied successfully to clinical care, public health, and research problems.
As the volume of data online increases, standard data representations become more widespread, and issues related to secondary analysis of health
data are resolved, the cost and eort barriers to data mining projects will
decrease. Laboratory data represent a substantial volume of objective,

INTRODUCTION TO THE MINING OF CLINICAL DATA

relatively well-standardized and well-characterized data that directly express


patient phenotype in disease presentation and response to therapy. Because
of the value of these data, laboratories have the opportunity to contribute to
mining projects and to lead projects that particularly target laboratory data.
This is an opportunity that should not be missed and for which laboratories
can begin to prepare.

References
[1] Rees J. Complex disease and the new clinical sciences. Science 2002;296:698700.
[2] Canadian Institute for Health Research. Secondary use of personal information in health research: case studies. 2002. Available at: http://www.cihr-irsc.gc.ca/e/1475.html. Accessed
August 26, 2007.
[3] Safran C, Bloomrosen M, Hammond WE, et al. Toward a national framework for the secondary use of health data: an American medical informatics association white paper. J Am
Med Inform Assoc 2007;14(1):19.
[4] Grossman J, Mackenzie FJ. The randomized controlled trial: gold standard, or merely standard? Perspect Biol Med 2005;48(4):51634.
[5] Jager K, Stel V, Wanner C, et al. The valuable contribution of observational studies to nephrology. Kidney Int 2007;72(5):53942.
[6] Babcock C. Parallel processing mines retail data. Computerworld 1994;28(39):6.
[7] Harrison D. Backing up 100 terabytes. Network Computing 1993;413:98104.
[8] Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery: an
overview. In: Fayyad UM, Piatetsky-Shapiro G, Smyth P, editors. Advances in knowledge
discovery and data mining. Menlo Park (CA): AAAI Press; 1996. p. 134.
[9] Lee S, Siau K. A review of data mining techniques. Industrial Management & Data Systems
2001;100(1):416.
[10] Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in
databases. AI Magazine 1996;17(3):3754.
[11] Hipp J, Guntzer U, Nakhaeizadeh G. Data mining of association rules and the process of
knowledge discovery in databases. In: Perner P, editor. Advances in data mining: applications in e-commerce, medicine, and knowledge management. Berlin (Germany): Springer;
2002. p. 20726.
[12] Hand DJ. Principles of data mining. Drug Saf 2007;30(7):6212.
[13] SAS Institute Inc. SAS Enterprise Miner. Available at: http://www.sas.com/technologies/
analytics/datamining/miner/index.html. Accessed August 26, 2007.
[14] SPSS Inc. Clementine. Available at: http://www.spss.com/clementine/. Accessed
August 26, 2007.
[15] Cognos Inc. Data mining. Available at: http://www.cognos.com/data-mining.html. Accessed August 26, 2007.
[16] Insightful Corp. Insightful Miner. Available at: http://www.insightful.com/products/
iminer/default.asp. Accessed August 26, 2007.
[17] Oracle Corp. Oracle data mining. Available at: http://www.oracle.com/technology/
products/bi/odm/index.html. Accessed August 26, 2007.
[18] Tan P, Steinbach M, Kumar V. Introduction to data mining. Boston: Addison-Wesley Longman Publishing Co., Inc.; 2005.
[19] Dunham MH. Data mining: introductory and advanced topics. Upper Saddle River, NJ:
Prentice Hall; 2002.
[20] Hand D, Blunt G, Kelly M, et al. Data mining for fun and prot. Stat Sci 2000;15(2):11126.
[21] Cios KJ, Moore GW. Uniqueness of medical data mining. Artif Intell Med 2002;26(12):
124.

Clin Lab Med 28 (2008) 935

Introduction to Data Mining


for Medical Informatics
Donald E. Brown, PhD
Department of Systems and Information Engineering, University of Virginia,
151 Engineers Way, Charlottesville, VA 22904, USA

Although data mining is a new eld of study of interest to medical informatics the application of analytic techniques to the discovery of patterns has
a rich history. Perhaps one of the most successful early uses of data analysis
for discovery and understanding was in medicine, specically infectious
diseases.
In the middle of the nineteenth century, London was hit with a pandemic
of infectious disease that killed large numbers of its citizenry. At that time
medicine knew little about the causes of infectious diseases but two theories
competed for consideration. The rst and more popular theory, called the
miasma theory, suggested that bad air propagated disease. The second theory postulated infectious agents or germs as the source of infection.
A leading supporter of the miasma theory was Dr. William Farr, a civil
servant in the General Register Oce. According to Farr decaying organic
matter provided a mechanism for the transfer of disease. In areas closer to
the Thames River the air was particularly unhealthy from decaying matter,
whereas locations away from the Thames had more healthy air. His careful
analysis of available mortality data showed a strong negative correlation
with elevation above the Thames.
Dr. John Snow was a leading proponent of the germ theory. Snow was
a pioneer in anesthesia and had served as an obstetrician to Queen Victoria.
During the pandemic of this period, he meticulously collected observational
data on those infected with the disease and carefully looked for patterns that
could provide causal understanding. He quickly narrowed his search to understanding the contribution of water to the disease. During the outbreak in
1853 and 1854 he was particularly interested in assessing the association of
the disease with specic water companies.

E-mail address: brown@virginia.edu


0272-2712/08/$ - see front matter  2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.008
labmed.theclinics.com

10

BROWN

When a new outbreak of the disease occurred in 1852 Snow carefully collected data from higher and lower elevations that had dierent sources of water.
His analysis showed clearly that once the source of water was taken into account
the elevation above the Thames was not predictive of infection. Despite the apparent relationship shown in Farrs ndings, Snow showed strong evidence that
germs within a polluted water supply were the mode of disease transmission.
This example illustrates the advantages and disadvantages of data mining. On the positive side data mining can discover important patterns. We
can frequently identify patterns even when we do not fully understand the
causal mechanisms behind those patterns. In this sense, data mining can
open pathways to research and discovery that may not have been evident.
Using data mining it is possible to discover irrelevant or cursory patterns
that confuse more than they enlighten. Data mining thus should never replace careful analysis and directed reasoning about important problems.
The term data mining itself is somewhat unfortunate because we are generally not interested in mining data per se, but rather in mining information
from data. We live in data-rich times and as each day passes more data are collected and stored in databases. The desire to process and use these data to help
answer or understand important questions has driven the development of data
mining techniques. The goal of these techniques is to nd information within
the large stores of data. By information we mean patterns that are persistent
and meaningful. Data mining is sometimes referred to as knowledge discovery, which imputes the deeper goal of knowledge extraction from data.
Most techniques for mining large data sets have emerged in the last 20 years
with the widespread use of databases, particularly relational databases. These
databases have provided outstanding capabilities for transaction processing,
meaning that individual records for many millions of individuals can be
quickly and securely updated based on real-time processing of a transaction.
These same characteristics that enable ecient operation of individual
transactions in large databases frustrate the use of the data for analysis, leading to embarrassing shortcomings, such as the following: Although I can tell
you the date, time, and results from your last TB test, I cannot tell you how
many TB tests were performed by a specic laboratory over a specic period
of time. Because of these shortcomings most database developers have moved
to add functionality to provide some limited data summaries to their customers. The more intense search for patterns or information in these collections of data has only recently been available through data mining, however.
Although extracting information from data is the major motivation for
data mining, there is also a second, less discussed motivation. Much research
has shown that unaided human analysis of data for decision making is unintentionally awed (see Ref. [1]). Even with small databases data mining can
provide protection against unaided human inference about patterns. In this
use, data mining is an aid to human judgment and for this reason data mining
techniques should attempt to provide quantiable measures behind the discovered patterns.

INTRODUCTION TO DATA MINING

11

This article provides an introduction to common data mining techniques


with a view toward their use. The next section describes common techniques
for discovery and exploration of associations in observations and variables.
These techniques typically help us to organize and understand data. The section on predictive methods provides descriptions of important techniques
for relating one set of variables or features in the data to another set of variables. The latter set are commonly called response variables and have particular interest to the analyst. For instance, the response variable may be
the presence of coronary heart disease and the goal is to nd features in
the data set that can predict this response. The article ends with a discussion
of evaluation methods for data mining techniques. Regardless of the technique used, it is important to understand the basic approaches to evaluating
the technique and this section provides that overview.
Despite its relative newness, the data mining eld has many more techniques than can be discussed in this introductory article. The focus here is
therefore on techniques that are common and whose understanding can provide the foundation for the understanding and use of the large variety of
techniques not discussed.

Discovery techniques
Discovery techniques look for the interdependence or association between observations or between variables in a data set. Unlike the methods
described in the section on predictive techniques, the data have not been segmented by the analyst in sets of particular interest. Specically, there are no
designated response variables, such as coronary heart disease and type II
diabetes. Instead the concern is with nding patterns of association among
the observations (eg, patients) or variables (eg, demographics). Many techniques can be used for nding associations among observations and variables. Nonetheless, to maintain a user perspective on the techniques this
article purposely separates techniques by the analysts goal.
If the analyst is interested in nding associations among variables or features in a database, then the section on discovery methods for variables provides an introduction. These methods are particularly relevant to new
problems in which databases have many more variables than observations.
For example, in gene expression databases it is not uncommon to nd tens
of thousands of variables and only a few hundred observations.
The section on discovery methods for observations describes introductory methods for nding associations among observations. When used
well these methods provide analysts with improved understanding of their
data and sometimes serve as a preliminary step to the methods described
in the predictive techniques section. In other words, these methods frequently enable the segmentation of the data into sets of response and predictor variables needed by the predictive techniques.

12

BROWN

Discovery methods for variables


The goal of discovery methods for variables is to nd associations that
link variables in a database. This problem has become increasingly important as the quantity of data collected has increased. The gene expression
problem has many thousands of variables with relatively few observations.
This situation is also the case for text mining wherein there are tens of thousands of possible words in a text document and each word or combination
of words can represent a single variable.
One of the most common and eective methods for associating variables
uses singular value decomposition (SVD). Here SVD is described in principal components. Principal components look for linear combinations of the
variables and these linear combinations are then the principal components
that describe the original data. Principal components are generated as the
result of a search for the linear combinations of the variables that best capture as much of the spread or variance of the original data as possible.
Principal components are then linear combinations of the variables in the
database. The linear combinations are found that satisfy the following
properties:
The variance of each principal component is maximized,
The principal components are orthogonal with each other, and
Each component is normalized so that the sum of their square is of unit
value.
As indicated the rst property derives from a desire to display as much of
the spread of the original data set as possible. The second property is for
convenience. Orthogonality means that the projection in any two principal
components has properties similar to those customarily observed in Cartesian displays. The nal property is also for convenience. In this case it enables obtainment of a bounded solution to the optimization problem.
With these properties it is straightforward to nd the linear combinations
of the variables that produce the principal components. Let
ajk ; j; k 1; 2; .; K be the coecients for the jth variable in the kth principal
component. The value for observation i of the kth component eik is given by
eik

K
X

ajk xij

j1

In vector notation
ek aTk xi
where T indicates transpose, eTk e1k ; e2k ; .; n, aTk a1k ; a2k ; .; aKk ,
and xTi x1k ; x2k ; .; xKk for k 1,2,.,K.

INTRODUCTION TO DATA MINING

13

Consider the rst principal component, k 1. Because the goal is to maximize the variance this is equivalent to nding the aT1 a11 ; a21 ; .; aK1 that
maximizes the variance of the rst principal component, l1 . This variance is
given by
l1 aT1 Sa1
where S is the sample covariance matrix. As noted the normalizing property
provides a bound for this solution. The optimization thus requires the constraint that
aT1 a1 1:
With this normalizing constraint the solution for the vector that maximizes the variance of the rst principal component is rst eigenvector of S.
So a1 is the rst eigenvector of S and the variance of the rst principal component, l1 , is the rst eigenvalue of S.
The remaining principal components are found through similar procedures. Specically the second principal component is found from the
aT2 a12 ; a22 ; .; aK2 that maximizes the variance of the second principal
component, l2 . This variance is given by
l2 aT2 Sa2
and the normalizing constraint,
aT2 a2 1:
In this case there is also the second constraint given in the goal statement:
the principal components should be orthogonal. This implies that
aT1 a2 0:
Under these constraints, the solution for a2 is the second eigenvector for
S and l2 is the second eigenvalue. The procedure continues in this fashion to
obtain the remaining K  2 principal components. Each additional component found must be orthogonal to the ones preceding it. The solutions again
are the eigenvectors for the principal components and the eigenvalues for
the variances. In data mining we typically look for solutions in which the
number of principal components is less than the number of variables in
the database, indicating that the procedure has identied associated variables and placed them in the same principal components.
The proportion of variance explained by the principal components is easily found. For example, the variance explained by the rst two principal

14

BROWN

components is l1 l2 =l1 l2 . lK . Because data mining looks for


associations among variables, the results are particularly interesting when
a small number of principal components explain a large amount of the variance in the database. It is unrealistic to expect a small number of variables
to explain nearly all of the variance; however, it is often possible to nd
a small number of principal components that explain as much as half of
the original variance. As the number of variables gets larger it can become
harder to achieve this goal.
A small but real data set is used to illustrate principal components and
the other techniques in this article. The data set consists of 768 female
Pima Indians evaluated for diabetes. There are nine variables in data set:
Number of times pregnant
Two-hour oral glucose tolerance test (OGTT) plasma glucose
Diastolic blood pressure
Triceps skin fold thickness
Two-hour serum insulin
Body mass index
Diabetes pedigree function
Age
Diabetes onset within 5 years
The National Institute of Diabetes and Digestive and Kidney Diseases of
the National Institutes of Health originally had these data and in 1990 they
were provided to the University of California, Irvine; they can be downloaded at www.ics.uci.edu/mlearn/MLRepository.html. These data have
been used extensively in the data mining community to assess and understand data mining techniques (see Ref. [2]). The raw data have incorrect
and incomplete entries, such as a glucose of 0 or a body mass index of 0.
Once these observations are removed there are 392 entries remaining.
Fig. 1 shows the plot of the rst two principal components from the Pima
Indian data set. This graph shows that the variables glucose, blood pressure,
and insulin are associated in that they have low loadings or weights in the
rst principal component and similar weights in the second principal component. The graph further shows groupings of observations and outliers
in these principal components. For example, observation 227 seems unusual
in the second principal component and in the combination of both principal
components. There also seems to be a tight clustering of observations at
(0.7, 0.0).
The complete list of principal components is shown in Table 1. This table
illustrates how the use of principal components combines variables and provides the analyst with a view toward variable associations. In these data
component two has similarly weighted body mass index and skin thickness.
It has also associated age and pregnancy.
Principal components can be viewed as a technique derived from singular
valued decomposition. To understand how this is done consider a database

15

INTRODUCTION TO DATA MINING

15

10

10

Pregnant
Age

10
0.1

Comp.2

5
Glucose
BloodPress

0.0

Insulin
Pedigree

0.1
SkinThick

10

BodyMass

15

0.2
0.2

0.1

0.0

0.1

Comp.1
Fig. 1. Plot of the rst two principal components for the Pima Indian data set.

with n observations and p variables. Decompose the n  p data matrix, X,


into two orthogonal matrices and a diagonal matrix:
X UDTT
U is n  p and is a matrix of eigenvectors or principal components. T is
p  p, D is diagonal and p  p. Also UT U I and TT T I. This decomposition thus produces the principal components of the data matrix.
Table 1
Principal component loadings
Variable

PC 1

PC 3

PC 4

PC 5

Pregnant
Glucose
Blood
pressure
Skin
thickness
Insulin
Body mass
Pedigree
Age

0.315
0.424
0.330

PC 2
0.552
d
d

0.218
0.474
0.391

0.197
0.227
0.307

0.245
d
0.776

d
0.723
d

0.195
d
0.145

0.634
d
0.100

0.383

0.413

0.302

0.381

0.122

0.642

0.106

0.365
0.368
0.167
0.412

d
0.505
0.188
0.477

0.582
0.256
0.256
0.118

0.255
d
0.847
0.156

d
0.141
0.388
0.110

0.660
d
d
d

0.119
0.711
d
d

d
0.107
d
0.741

Abbreviation: PC, principal component.

PC 6

PC 7

PC 8

16

BROWN

In addition to principal components many other methods exist for associating variables. Some representative methods include partial least squares
[3], ridge regression [4], and independent components [5]. A discussion and
comparison of methods can be found in the article by Copas [6].
Discovery methods for observations
Many people outside the eld of data mining believe that associating observations is the only purpose for data mining. It is unquestionably the area
of data mining that has received the most attention in the popular press. It is
also the area of data mining that arguably contains the largest number of
intractable problems. The goal is similar to that of associating variables,
but when analysts associate observations they often want more than they require when associating variables. In particular, they seek strength of association and any causal implications. These requirements create inferential and
computational burdens on the proposed techniques.
This section provides an introduction to the techniques in this challenging
area. It starts with a description of the market basket problem and the Apriori algorithm often used for its solution. The section ends with an overview
of clustering methods and the commonly used hierarchical approaches.
Again the goal is to introduce representative techniques.
Data mining customer purchase behavior is the aim of market basket analysis. Consider, for example, data on the purchase of items by customers at
a store over a recent period of time. Do these customers frequently buy the
same groups of items? So, for example, when they purchase peanut butter,
do they also purchase jelly? Understanding these associations may help store
managers to better inventory, display, and manage their marketable items. In
health care, market basket analysis can provide an understanding of associations among patients with demands for similar services and treatments.
Consider the set of all possible items that can be placed in a customers
market basket. Then each item has value associated with it which represents
the quantity purchased by that customer. The goal of market basket analysis
is to nd those values of items for which their joint probability of occurrence
is high. Unfortunately, for even modest-sized stores this problem is
intractable.
Instead, analysts typically simplify the problem to allow only binary
values for the items. These values reect a yes or no decision for that item
and not the quantity. Each basket then is represented as a vector of binary
valued variables. These vectors show the associations among the items. The
results are typically formed into association rules. For example, customers
who buy peanut butter (pb) also buy jelly (j) is converted to the rule,
pb 0 j
These rules are augmented by the data to show the support and the
condence in the rule. Support for a rule means the proportion of

INTRODUCTION TO DATA MINING

17

observations or transactions in which both items occurred together. The


condence for a rule shows the proportion of times the consequent of
the rule occurs within the set of transactions containing the antecedent.
In the above example, the condence for the rule would be proportion
of times that jelly was purchased among those customers who purchased
peanut butter.
Several algorithms have been developed to nd rules of this sort. One of
the earliest and most commonly used of these algorithms is the Apriori algorithm (Ref. [7]). The Apriori algorithm operates on sets of items in baskets (ie, those with value one in binary formulation). These sets are called
itemsets. The algorithm begins with the most frequently observed single
itemsets, meaning those items most often purchased by themselves. The algorithm uses these sets to nd the most commonly purchased two-item
itemsets. At each iteration it prunes itemsets that do not pass a threshold
on support or the frequency with which the itemset appears in the transactions. Once the common two-item itemsets are found that pass this
threshold, the algorithm uses these to consider three-item itemsets. These
are again pruned based on the support threshold. The algorithm proceeds
in this way and stops when the threshold test is not satised by any
itemset.
Hierarchical clustering provides another technique for nding associations among observations. Clustering puts observations together
into groups based on their similarity or distance from each other. In
this sense clustering compacts a database into a single variable that labels
groups of observations. The groups or clusters represent associated
observations.
Hierarchical clustering also provides a level that shows the point of formation of dierent clusters, which allows for viewing of the database in two
dimensions: one (typically the abscissa) showing the cluster labels and the
other (the ordinate) showing the level of cluster formation. This plot is
called a dendrogram (illustrated in Figs. 24). The combination of labels
and levels provides an indication of the patterns and structures in the
database.
Hierarchical clustering begins with a measure of similarity or dissimilarity. Recall that distance, dij , between two observations, xi and xj , in p dimensional space has three properties:
1. dij 0 if xi xj and dij O 0 if xi s xj ;
2. dij dji ; and
3. dik % dij djk :
A dissimilarity, dij between observations xi and xj measured in p variables
has the rst two properties of a distance. The third property, called the triangle inequality, is not required.
Many possible choices can be used for dissimilarity measures, including
standard distances between quantitative variables:

18

BROWN

100

Height

80
60
40
20
0

Observations
Fig. 2. Dendrogram of single link clustering for the Pima Indian data set.

Euclidean
(
dij

p
X

2
jxik  xjk 

)12

k1

Manhattan or city block


dij

p
X


xik  xjk 
k1

Maximum
dij max



xik  xjk 

k1;2;.;p

Similarity measures, sij , between observations xi and xj measured in p


variables also have dening properties:
800

Height

600

400

200

Observations
Fig. 3. Dendrogram of complete link clustering for the Pima Indian data set.

INTRODUCTION TO DATA MINING

19

400

Height

300

200

100

Observations
Fig. 4. Dendrogram of average link clustering for the Pima Indian data set.

1. sij M if xi xj and sij ! M if xi s xj and


2. sij sji .
Here M is some upper bound for the similarity and typical choices are 1,
10, and 100. An example of a similarity between observations measured with
quantitative variables is the absolute value of the correlation.
Once the similarity or dissimilarity measure has been chosen then hierarchical clustering proceeds with the following algorithm:
Put every observation, x1 ; x2 ; .; xn in its own cluster, c1 ; c2 ; .; cn .
Join two clusters with the minimum dissimilarity or maximum similarity.
Stop if all observations are now in one cluster.
Recompute the dissimilarity or similarity between clusters and go to
step 2.
Because there are a nite number of observations and at each step two
clusters are joined together then this algorithm is guaranteed to stop with
all observations in one cluster.
Step 3 of the algorithm recomputes the dissimilarity or similarities between clusters. The choice made in this step determines the type of hierarchical clustering. Three popular choices for type of clustering are single link
(SLINK), complete link (CLINK), and average link (AVELINK).
Single link uses a minimum distance updating. After two clusters are
joined then the new dissimilarities between clusters are the minimum dissimilarity between observations in one cluster and the observations in another
cluster. Formally this procedure computes a minimum spanning tree
through the observations.
The complete link method computes the maximum dissimilarity between
observations in one cluster and the observations in another cluster. Finally,
the average link method computes the average dissimilarity between observations in one cluster and the observations in another cluster. The above

20

BROWN

descriptions have used dissimilarities. When working with similarities the


minimums are replaced with maximums and vice versa.
To illustrate the use of hierarchical clustering consider again the Pima Indian data set. Figs. 24 show the dendrogram plots for single link, complete
link, and average link clustering, respectively. The single link results have
grouped most observations together, but in so doing they have identied
several unusual observations. For example, the single observation that is
the last to join the cluster is a patient whose blood pressure is in the bottom
quartile; skin thickness is almost in the bottom quartile and body mass index
is in the bottom half. Her insulin was the largest observed in the study, however, and she is a 59-year-old diabetic.
The complete and average link results pull out and organize a number of
groups in the data. For example, cutting the dendrogram to produce four
groups in each method produces the results shown in Table 2. The two
methods produce similar clusters; specically, clusters 1 and 3 are identical,
whereas cluster 2 in average link is larger and cluster 4 in complete link is
larger. Table 3 shows the median values for several of the variables for
cluster 4 in the complete and average link results compared with the medians for the data set. The dierences between the complete and average
link results follow from the dierence in membership in this cluster as
shown in Table 2: the complete link cluster has 12 additional members.
In both cases, however, the clustering has found a group of patients who
have higher values on all of these variables. The complete link cluster
4 has 45% of its members who have diabetes, whereas the average link
cluster has 56%.
Table 4 shows the cluster 3 medians for the same variables in comparison
with the data set medians. In this case the cluster is identical using complete
or average link. Unlike cluster 4, the patients identied in this cluster have
the same median blood pressure as the study population, but much higher
insulin values.
These results illustrate how clustering can discover patterns of observations in data sets. For these data the clustering techniques have found
groups of patients with noticeably similar and distinct values for the variables. Of course, the simple analysis here only looked at partitions into

Table 2
Complete and average link clusters
Complete link
Average link

1
2
3
4

314
0
0
2

0
15
0
0

0
0
3
0

9
5
0
46

21

INTRODUCTION TO DATA MINING

Table 3
Medians for complete and average link cluster 4
Method

Glucose

Blood pressure

Skin thickness

Insulin

Complete
Average
Data

141.5
146.5
119.0

74
76
70

32.5
34.5
29.0

274.5
276.5
125.0

four clusters. A more complete analysis would consider other partitions and
possibly other clustering methods.

Predictive techniques
Many data mining techniques go beyond discovering relationships between variables and observations. A major set of techniques focuses on
the prediction of variable values given the values of other variables. In
this sense, these techniques look for strong associations between sets of
variables.
Prediction requires a priori identication of the set of variables to consider as predictors and the set of variables to predict (the response variables). Although many methods attempt to identify the more important
predictor variables, no methods can be expected to nd predictors that
are not present in the data. The type of response variable provides a strong
constraint on the data mining technique.
This section introduces representative data mining techniques for prediction. To keep the notation manageable only the single-variable response is
described. The extension to the multivariate case is conceptually straightforward once the univariate case is understood. The section begins with numeric response, because this builds directly on commonly used regression
or least squares techniques. From there the discussion moves to the categorical response variables. Most data mining methods can handle both types of
response, although the actual mechanics of the methods change with changing response type. Part of the discussion of these methods involves variables
selection and interpretation. In many applications, understanding the contribution of the variables to the prediction is important. Not all methods
provide an interpretation, and, hence, this dierence is noted where relevant.

Table 4
Medians for complete and average link cluster 3
Method

Glucose

Blood pressure

Skin thickness

Insulin

Complete
Data

189.0
119.0

70
70

33.0
29.0

744.0
125.0

22

BROWN

Least squares regression


The inputs to predictive data mining techniques are values on variables
segmented into predictors and response variables. These methods are frequently termed supervised learning techniques. This name connotes
knowledge of the values of the response variable for a set of predictors.
The name also implies the desire to learn to predict the value of the response
in the presence of new observations on the predictors.
Data mining algorithms change either subtly or dramatically depending
on whether the response variable is numeric or categoric. Numeric response
variables have values over a continuum or a reasonably large set of integers.
Categorical response variables have values that are unordered labels, such as
names. An intermediate category of response values are simply ordered. So
unlike names these values have an ordering, but unlike numeric values the
ratio or dierence of these values is meaningless. For the purposes of this
article, the methods that can handle categorical response can also handle
ordered response.
The oldest and most basic methods of predictive data mining for a numeric response use least squares regression. These methods nd a linear
function of the predictor variables that minimizes the sum of square dierences with the actual response values. Let yi be the response value and xi be
the vector of predictor values for observation i, i 1,., n. Also let f be the
function that estimates the response and q be the vector of parameters in this
function. For a given functional form, least squares chooses the parameters
that minimize the sum of square distances to each observed response value.
So, the estimated parameters, ^q are given by
n
nX
o
2
^q argmin
yi  fq; xn
i1

A convenient choice for f is a linear form and for p predictor variables


this gives the following:
fq; xn q0 q1 x1 . qp xp :
As an example of the use of least squares, the technique was applied to
the Pima Indian data to predict insulin. Insulin was used as the response because it is a numeric value, whereas the presence or absence of diabetes is
categorical. Actually, the log of insulin was predicted because the insulin
values in the data (Fig. 5) are heavily skewed. The log transformation removes this skewness. Fig. 6 shows a plot of the predicted versus actual
values of the log of insulin. Also shown is a line at 45 . Clearly, the predicted
values in this model tend to overestimate the insulin for lower values and
underestimate for higher values. Nonetheless, the model does predict reasonably well for this simple data set.

23

INTRODUCTION TO DATA MINING

150

Frequency

100

50

0
0

200

400

600

800

Insulin
Fig. 5. Histogram of insulin in the Pima Indian data set.

The least squares approach also identies useful predictor variables. This
identication is accomplished by hypothesis testing on the values of the parameters for each variable in the model. The hypothesis tested is whether
a variables parameter is signicantly dierent from zero. The level of significance is chosen by the analyst.

Predicted log(Insulin)

Actual log(Insulin)
Fig. 6. Predicted versus actual values in a linear model of the Pima Indian data.

24

BROWN

For the Pima Indian data set, two variables show signicance in predicting insulin: glucose and body mass index. Glucose actually has a nonlinear
relationship with insulin and this nonlinearity can also be captured in the
model.
Logistic regression
Least squares regression models data mining problems with numeric response variables. To mine data with categorical response variables requires
a dierent approach to regression. Consider the simplest case in which the
categorical variable is binary (eg, diabetes or no diabetes). Least squares regression would not be appropriate for this problem because it would provide
predictions that would lie outside the binary response values.
An extension to the regression approach is accomplished by modeling the
probability of a binary response. With n independent observations then the
probability of k occurrences of an event is given a binomial distribution. Let
m be the parameter for this binomial distribution, which is simply the probability of an event in any observation. A convenient, but by no means
unique, model assumes this probability m is a logistic function of the predictors with parametric vector q. This yields the following:



m
log
qT x
1m
where xT x0 ; x1 ; .; xk and qT q0 ; q1 ; .; qk .
This model can be applied to the Pima Indian data to classify the patients
as diabetic or not based on the values of the predictor variables. Fig. 7

1.0

Actual Diagnosis

0.8

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

1.0

Predicted
Fig. 7. Actual versus predicted values in a logistic regression model of the Pima Indian data.

25

INTRODUCTION TO DATA MINING

shows a plot of the actual versus predicted values for these data. The model
does reasonably well. Using a test set of patients not used to construct the
model it achieved an error rate of 22%. This nding compares well to the
base rate of diabetes in this data set of 33%. The plot shows there are patients who are not diabetic, however, and yet are given a high probability
(O0.9) of having diabetes, whereas there are other patients who have diabetes and yet are given a low probability (!0.1) of this event by the model.
As with linear regression, logistic regression provides insight into inuence of the predictors on the response. Using likelihood ratio tests two variables, glucose and body mass index, show signicance (p!.05) for
predicting diabetes in this population. A third variable, age, is signicant
at p!.1.
Classication trees
Classication trees provide an easily understood and interpretable approach to predictive data mining with a categoric response. The central
idea is to partition the data set into regions for which a particular categorical response is prominent. The partitioning is accomplished through a series
of questions. For example, is body mass index less than 40? Observations
with armative answers to this question are separated from those with negative answers. Additional questions continue the partitioning until regions
are found that primarily contain a single response value.
This partitioning can be viewed as a tree. Each node represents a question
that partitions the data and the combination of all questions in the nodes
provides the nal partitioning. Fig. 8 shows a classication tree obtained

Yes

Yes

Glucose
< 127.5

No

Yes

No

Insulin
< 143.5

No
Glucose
< 165.5

No Diabetes

Diabetes
Yes

No Diabetes

Age < 28.5

No

Yes

No
Age < 23.5

Diabetes

No Diabetes

Fig. 8. Classication tree for the Pima Indian data.

Diabetes

26

BROWN

for the Pima Indian data set. The top or root node partitions the data based
on values of glucose less than 127.5. All observations with values less than
127.5 go left in the tree and those greater than 127.5 go right. The observations on each path are further partitioned. For example, those that went
right are again partitioned by values of glucose, but this time they are compared with the value 165.5. Those that left are now partitioned based on the
value of insulin. The nodes at the base of the tree provide the classication
labels. Again looking at the tree in Fig. 8, patients who have a value of glucose greater than 165.5 are classied as diabetic, whereas those who have
a glucose measurement less than 127.5 and insulin less than 143.5 are classied as nondiabetic.
This example shows that tree classiers can provide easily understood results with excellent interpretability. Obviously a larger tree with more variables becomes less easily understood, but even in these cases it is possible to
view the tree in segments that lends to understanding of even very large
databases. This ease of interpretability and understanding is one the major
reasons for the use of classication trees in data mining.
The construction of classication trees is made possible through various
algorithms. One of the most eective of these, known as recursive partitioning, was developed by Breiman and colleagues [8]. This algorithm constructs
trees by providing eective answers to three major tree construction questions: (1) When to stop growing the tree, (2) what label to put on a terminal
node, and (3) how to choose a question at a node.
The rst question they answered in an unusual but important way. They
do not stop growing the tree. Instead the algorithm grows the tree out to its
maximum size (eg, each observation in its own terminal node). It then
prunes the tree back to a size that best predicts a set of holdout samples
(the actual approach used is discussed in the evaluation section). This pruning approach avoids generating trees that are not eective because they did
not consider a suciently large and cooperative set of nodes.
The second question is fairly easily answered by simply counting the
number of members of each category that appear in a terminal node and
choosing the winner. Ties are simply reported. This approach means that
the algorithm provides a quick estimate of the probabilities for each label
in the terminal node. For example, looking again at the tree in Fig. 8, patients who have a glucose reading greater than 165.5 are classied as diabetic. This classication has an estimated probability of 0.85 because 85%
of the patients in the Pima Indian database who had glucose levels this
high were diabetic.
The answer they provided to the third question was more involved. To
develop a question for a node, their algorithm begins with the data that
have arrived at the node. Each variable in the database is considered and
each value or set of values (for categorical variables) is considered. The algorithm chooses from this large set the question that best partitions the
data. Best is measured by purity of the results. So a question that partitions

INTRODUCTION TO DATA MINING

27

the data into nodes with dominant class labels is preferred to one that has
the labels in roughly equal proportions.
Other approaches exist to building classication trees and use dierent
answers to the questions on tree construction (eg, Ref. [9]). For example,
it is possible to build trees with more than pairwise partitions at the nodes
and to consider trees that ask more complicated questions involving more
than one variable [10].
The interpretability and ease of understanding of the results make classication trees an important and useful technique in data mining. Their importance is evident in continuing work to improve their accuracy and
applicability. Two of the more important recent extensions are boosting
[11] and random forests [12]. Boosting provides a method for trees to improve in accuracy by adapting to the errors they make in classication. Random forests provides a mechanism for combining results from multiple
classication trees to produce more accurate predictions.
Neural networks
Neural networks provide data mining techniques meant to mimic the pattern recognition properties of biologic systems. The most commonly used of
these techniques, multilayer perceptrons or backpropagation neural networks, begins with a simplied model of neural processing known as a perceptron. A perceptron typically applies a nonlinear transfer function to
a weighted sum of the inputs. The inputs are the values for each predictor
variable for an observation. The better performing transfer functions are
smooth and continuous.
As the name implies, multilayer perceptrons use several perceptrons and
organize them into dierent layers for processing the data. In most data
mining applications three layers of perceptrons are used. The rst layer is
the input layer and provides an input node for every variable in the database. Another node, known as the bias node, is often used to provide the
model with greater exibility in modeling. This node always inputs the
same value (eg, 1).
The next layer is known as the hidden layer. In fully connected
networks every input node is connected to every node of the hidden layer.
The number of nodes in this hidden layer is undetermined and can greatly
impact the results. In most applications trials are made with dierent numbers of hidden nodes to nd the number that works well for the specic
application.
The nal layer is the output. This layer depends on the values (scalar or
vector) sought in the output; so, for a simple binary classication problem
a single output node is sucient.
Fig. 9 shows an example of a multilayer perceptron neural network for
the Pima Indian data. The input layer contains nodes for each of the variables in this data set and for the bias term. This hidden layer contains

28

BROWN

Glucose
Hidden 1

Glucose
Hidden 2
Output

classification

Blood Press.

Hidden k

Bias

Fig. 9. Multilayer perceptron for the Pima Indian data.

some chosen number of nodes, say k. Finally, for this problem there is a single output node to report the classication of the patient.
A multilayer perceptron requires values for the weights represented by
the arcs or connections in the neural network. In the example in Fig. 9 we
would need weights for each of the arcs connecting the nodes in the input
layer to each member of the hidden layer. Similarly the connections between
each hidden node and the output node require weights.
The basic algorithm used to calculate these weights is the backpropagation algorithm of Werbos. The name of this algorithm is what gives multilayer perceptrons their other commonly used name: backpropagation neural
networks. This algorithm initializes by randomly assigning weights to the
connections in the network. As the algorithms rst step, an observation
from the database is presented to and processed by the neural net. The error
is computed at the terminal node and this error is then propagated back
through the network. The algorithm changes those weights the most that
most contributed to the error. The algorithm proceeds in this fashion until
the weights change little or not at all, which can take many presentations of
the data in the database to the neural network. Werbos basic algorithm has
been modied by many researchers and faster algorithms now exist that do
not require single observation processing.

29

INTRODUCTION TO DATA MINING

Fig. 10 shows the actual classication values for the Pima Indian plotted
against the predictions from a ve hidden node backpropagation neural network; this is the same plot as shown in Fig. 7 for logistic regression. As in
that case the neural network does well on many cases but still makes some
signicant errors. One important dierence between neural networks and
both logistic regression and classication trees is that they do not provide
an understandable interpretation of the results. By themselves, it is not possible to know which features or combination of features most inuenced
a prediction.
Support vector machines
The nal technique discussed in this article is one of the more recent additions to data mining. Support vector machines (SVMs) were developed by
Vapnick [13] and the technique seeks to predict class labels by separating the
database in mutually exclusive regions. There are several important innovations in the approach taken by support vector machines to this problem.
First, SVMs perform the separation based on the few points, the support
vectors, near the boundary between the classes. In this perspective they differ from all the previous approaches described in this article that form decision boundaries using data in all the points. Second, they transform the
data into a space where separability between the classes is improved. Finally, rather than explicitly performing the transformation, they use kernel
functions to provide computational tractability. This section provides an
overview to support vector machines by discussing these innovations.

1.0

actual diagnosis

0.8

0.6

0.4

0.2

0.0
0.0

0.2

0.4

0.6

0.8

1.0

predicted diagnosis
Fig. 10. Prediction results from a ve hidden layer multilayer perceptron for the Pima Indian
data.

30

BROWN

The rst essential component or innovation in support vector machines is


the creation of support vector classiers. Consider two linearly separable
classes in two variables. One approach to predicting the label of a new observation is to make the assignment based on the new observations position
relative to a separating decision surface. Support vectors create the separating decision surface by rst nding the nearest points between the two classes. Next the maximal distance or margin between these points is found as
the solution to the following problem:
Minkwk


subject to yi wT xi w0 R 1; i 1; .; n:
In this formulation w is the vector of variable coecients, yi 0; 1 is the
categoric response variable and xi is the vector of predictor variables for observations i 1,., n. Notice that 2/jjwjj is the margin. So the solution to
this problem maximizes the margin.
Finally, the separating decision surface is the perpendicular through the
center of the maximal margin. Decision surfaces of this type are called maximum margin classiers. Fig. 11 shows a decision surface formed by two
support vectors.
Most problems are not linearly separable, meaning that placement exists
for a linear surface that does not have points of each class on both sides of

gin

ar

on

i
cis

De
ry

da

un

bo

Support
vectors

Fig. 11. Support vector classier for linearly separable data.

INTRODUCTION TO DATA MINING

31

the surface. Support vector classiers handle this situation by relaxing the
requirement that all points must be on one side of the surface. Formally
they do this by adding slack variables to the previous formulation. Let
xi ; i 1; .; n be the slack variables, which have nonnegative values. Then
replace the constraint in the previous formulation with


yi wT xi w0 R 1  xi
Solving this constrained optimization problem allows some points to be
misclassied.
The next major component of the support vector machines is the transformation of the original problem into a new space. This transformation
enables the creation of nonlinear decision boundaries. This capability is important because many applications do not have the simple linear boundaries
shown in Fig. 11. The SVM solution is to nd a transformation of the original variables that enables linear separability. Even though the problem is
not linearly separable in the original space dened by the variables in the
database, SVM nds a transformation of the variables in which linear separability is possible.
The search for variable transformation and the maximal margin classier
in this new space can be computationally expensive. Fortunately, the third
major contribution from SVM helps resolve these issues. Rather than actually perform the transformation and develop the maximal margin classier
in a new space, SVM uses kernel functions in the original variables. These
kernel functions mean that there is no need to actually nd the variable
transformation for nonlinear decision surfaces. Kernel functions enable
this because they allow for computation of the similarity (formally, the
dot product) between two observations in the original space rather than
in the transformed space. In so doing the kernels make it possible to solve
the optimization for linear separability in reasonable time.
Several possible choices exist for kernel functions that enable the use of
SVM for data mining. Two common choices are polynomial kernels and
radial basis functions. Fig. 12 shows Pima Indian data classied with a polynomial kernel in two dimensions, glucose (X1) and body mass index (X2).
Evaluation
The nal section of this article examines methods to evaluate the results
of data mining. This discussion focuses on evaluating predictive techniques.
Discovery techniques are dicult to evaluate because unlike predictive techniques there is no available response value. The few available evaluation
methods for discovery techniques build on those for the predictive techniques that are discussed later.
Evaluation techniques are vital to the use of predictive data mining. It is
clearly not sucient to simply apply data mining techniques to a database.

32

BROWN

Fig. 12. SVM classier for the Pima Indian data with a polynomial kernel.

The results from these data mining techniques must be objectively assessed
before they are used to inform decision making. Evaluation requires testing
procedures and metrics.
Testing procedures are normally dened by the application area. For
some applications it is possible to conduct formal experiments designed using the variables in the database. For many applications experiments are not
possible, however. In these situations, the analyst normally uses the observations in the database to evaluate the data mining results.
The goal of testing procedures is to provide the analyst with an objective
view of the performance of the data mining technique on future observations. For many reasons it is best not to rely on the observations in the database that were used to parameterize a technique to assess its performance
on future values. The major reason for this caveat is because each technique
can be made to perform perfectly on a set of observations. This perfect performance on a known database would not translate into perfect performance on newly obtained observations, however. In fact, the performance
on these would be poor because the technique was overt to the existing
data.
Testing procedures provide a way to avoid overtting. The simplest
testing procedure is to divide the database into two parts. One part, the
training set, is used to build and parameterize the data mining technique.
The second part is used to test the technique. For reasonably sized

INTRODUCTION TO DATA MINING

33

databases the division is normally two thirds for training and one third for
testing. In addition, the choice of observations for each set is randomly
made. It may be useful to use stratied sampling for either or both of
the training and test sets if the distributions of groups within a target population is known.
Cross-validation is another testing procedure that is used when the database is small or when concerns exist about the representativeness of
a test set. Cross-validation begins by dividing the data into M roughly
equal-sized parts. For each part, I 1,., M the model is t using the
data in the other M  1 parts. The metric is then computed using the
data in the remaining part; this is done M times giving M separate estimates of the metric. The nal estimate for the metric is simply the average
over all M estimates.
Cross-validation has the advantage that it uses all the data for training
and testing; this means that the analyst does not have to form a separate
test set. Recursive partitioning, discussed in the classication trees section,
uses cross-validation to determine the nal size of the tree. In this way
cross-validation is frequently used to nd parameter values for the dierent
data mining techniques. For those methods that do not use it for parameter
estimation it provides a convenient testing approach to assess a data mining
technique.
In addition to testing procedures, the analyst must also select a metric or
metrics to use to evaluate the techniques. For numeric response problems,
common metrics are functions of sums of squares or sums of absolute deviations. Both measures weight performance by distance to the correct response, but the former measure tends to penalize extreme errors more
than measures that use absolute deviation.
For categorical response, metrics that count the number of errors are typically used. In many applications the type of error is also important, however. This observation is particularly true in diagnostic applications. In
these cases it is convenient to separate the errors into false positives and false
negatives. False positives occur when the data mining technique predicts an
outcome and the outcome does not occur. False negatives happen when the
data mining technique fails to predict an outcome that occurred. The diabetes example illustrates a case wherein these two errors are not equally
weighted. In this case a false negative typically is worse than a false positive
because the latter error can be caught by subsequent testing. On the other
hand, it would be disastrous if only false positives occurred because this
would quickly overwhelm the available testing resources. In performing
evaluations on classiers, therefore, both types of errors need to measured
and trade-os made between their predicted values.
A useful display that allows for viewing of both metrics is the receiver
operating characteristic (ROC) curve. The name for this graphic derives
from its origin in WWII where it was used by the allies to assess the performance of early radar systems. The ROC curve shows the trade-os between

34

BROWN

false positives and false negatives by plotting true positives (1  false negatives) versus false positives. This plotting means that the ideal performance
is in the upper left hand corner of the plot. The worst performance is in the
lower right hand corner. Random performance is shown by a diagonal line

at 45 .
Fig. 13 shows an ROC curve for several of the data mining techniques
discussed using the Pima Indian data set. These curves are for a test set of
100 observations selected randomly from original 392 observations. This
plot gives us a way to decide among the techniques given the desired
trade-o between false positives and false negatives.
The plot in Fig. 13 illustrates another aspect of data mining techniques.
In most applications, there is not clear winner among the techniques. The
choice of technique depends on the application as illustrated in this case
through the choice of trade-os between false positives and false negatives.
The choice also depends on the importance of the understanding and interpretability results, because some of the techniques provide these attributes more easily than others. Fortunately, the variety and the
capabilities of data mining techniques continue to improve. This variety
has to a large extent built on the successes of the methods described in
this article.

ROC Curve
1.0

True Positive

0.8

0.6

0.4

0.2

Logistic Regression
Tree
Neural Network

0.0
0.0

0.2

0.4

0.6

0.8

1.0

False Positive
Fig. 13. ROC curve for data mining techniques for the Pima Indian data set.

INTRODUCTION TO DATA MINING

35

References
[1] Kahneman D, Slovic P, Tversky A. Judgment under uncertainty: heuristics and biases.
Cambridge (UK): Cambridge University Press; 1982.
[2] Hand D, Mannila H, Smyth P. Principles of data mining. Cambridge (MA): MIT Press;
2001.
[3] Wold H. Soft modeling by latent variables: the nonlinear iterative partial least squares
(NIPALS) approach. Perspectives in probability and statistics, In Honor of MS Bartlett.
Sheeld, UK: Applied Probability Trust; 1975. p. 11744.
[4] Hoerl AE, Kennard R. Ridge regression: biased estimation for nonorthogonal problems.
Technometrics 1964;12:5567.
[5] Comon P. Independent component analysis, a new concept? Technometrics 1994;36:
287314.
[6] Copas JB. Regression, prediction and shrikage (with discussion). J Roy Stat Soc B 1983;45:
31354.
[7] Agrawal R, Mannila H, Srikant R, et al. Fast discovery of association rules. In: Fayyad UM,
Pietsky-Shapiro G, Smyth P, editors. Advances in knowledge discovery and data mining.
Cambridge (MA): AAAI/MIT Press; 1996. p. 30728.
[8] Breiman L, Friedman J, Olshen R, et al. Classication and regression trees. Belmont (CA):
Wadsworth; 1984.
[9] Kass GV. An exploratory technique for investigating large quantities of categorical data.
Appl Stat 1980;29:11927.
[10] Brown DE, Pittard CL. Classication trees with optimal multi-variate splits. Proceedings of
the IEEE International Conference on Systems, Man, and Cybernetics, Le Touquet
(France); 1993. p. 4758.
[11] Freund Y, Schapire R. A decision theoretic generalization of online learning and an application to boosting. J of Comp & Sys Sci 1997;55:11939.
[12] Breiman L. Random forests. Mach Learn 2001;45:532.
[13] Vapnick V. The nature of statistical learning theory. New York: Springer Verlag; 1995.

Clin Lab Med 28 (2008) 3754

Open-Source Tools for Data Mining


Blaz Zupan, PhDa,b,*, Janez Demsar, PhDa
a

University of Ljubljana, Trzaska 25, SI-1000 Ljubljana, Slovenia


Department of Molecular and Human Genetics, Baylor College of Medicine,
Houston, TX, USA

The history of software packages for data mining is short but eventful.
Although the term data mining was coined in the mid-1990s [1], statistics,
machine learning, data visualization, and knowledge engineeringdresearch
elds that contribute their methods to data miningdwere at that time
already well developed and used for data exploration and model inference.
Obviously, software packages were in use that supported various data mining tasks. But compared with the data mining suites of today, they were
awkward, most often providing only command-line interfaces and at best
oering some integration with other packages through shell scripting, pipelining, and le interchange. For an expert physician, the user interfaces of
early data mining programs were as cryptic as the end of the last sentence.
It took several decades and substantial progress in software engineering and
user interface paradigms to create modern data mining suites, which oer
simplicity in deployment, integration of excellent visualization tools for
exploratory data mining, anddfor those with some programming backgrounddthe exibility of crafting new ways to analyze the data and adapting algorithms t to the particular needs of the problem at hand.
Within data mining, there is a group of tools that have been developed by
a research community and data analysis enthusiasts; they are oered free of
charge using one of the existing open-source licenses. An open-source development model usually means that the tool is a result of a community eort,
not necessary supported by a single institution but instead the result of
contributions from an international and informal development team. This
development style oers a means of incorporating the diverse experiences
This work was supported by a Program Grant P20209 and a Project Grants J29699
and V20221 from the Slovenian Research Agency and NIH/NICHD Program Project Grant
P01 HD39691.
* Corresponding author. University of Ljubljana, Trzaska 25, SI-1000 Ljubljana,
Slovenia.
E-mail address: blaz.zupan@fri.uni-lj.si (B. Zupan).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.002
labmed.theclinics.com

38

ZUPAN & DEMSAR

and views of multiple developers into a single platform. Open-source data


mining suites may not be as stable and visually nished as their commercial counterparts but instead may oer high usefulness through alternative,
exciting, and cutting-edge interfaces and prototype implementations of the
most recent techniques. Being open source they are by denition extendable,
and may oer substantial exibility in handling various types of data.
In the following, we provide a brief overview of the evolution of approaches used in the development of data mining suites, with a particular
focus on their user interfaces. We then elaborate on potential advantages of
open-source data mining suites as compared with their commercial counterparts, and provide a wish list for techniques any data mining suite should oer
to the biomedical data analyst. For our review of tools, we select several
representative data mining suites available in open source, present each
briey, and conclude with a summary of their similarities and dierences.
Evolution of open-source data mining tools
Early model inference and machine learning programs from the 1980s
were most often invoked from a command prompt (eg, from a UNIX or
DOS shell), with the user providing the name of the input data le and
any parameters for the inference algorithm. A popular classication tree
induction algorithm called C4.5 [2] came with such an implementation.
C4.5 (the source code is available at http://www.rulequest.com/Personal)
could also accept a separate input le with cases for testing the model,
but included no procedures for sampling-based evaluation of the algorithm.
Implementations of early rule-based learning algorithms, such as AQ [3] and
CN2 [4], were similar to C4.5 in this respect. Much of the experimental
verication of these programs was performed on data sets from medicine,
including those related to cancer diagnosis and prediction (see UCI Machine
Learning Repository [5]). Such evaluations most often demonstrated that
the classication rules inferred had some meaningful medical interpretation
and performed well in classication accuracy on a separate test set.
Inference of models from medical data requires elaborate testing, which
was not fully integrated into early data mining programs. Researchers typically used a scripting language, such as Perl [6], to separately implement
sampling procedures and then execute programs for model inference and
testing. To compare dierent algorithms, such scripts needed to reformat
the data for each algorithm, parse textual outputs from each model, and
use them to compute the corresponding performance scores. Needless to
say, the implementation of such schemata required a substantial amount
of programming and text processing.
As an alternative, several research groups started to develop suites of
programs that shared data formats and provided tools for evaluation and
reporting. An early example of such an implementation is MLC [7],
a machine learning library in C with a command line interface that featured

OPEN-SOURCE TOOLS FOR DATA MINING

39

several then-standard data analysis techniques from machine learning.


MLC was also designed as an object-oriented library, extendible through
algorithms written by a user who could reuse parts of the library as desired.
Command line interfaces, limited interaction with the data analysis environment, and textual output of inferred models and their performance
scores were not things a physician or medical researcher would get too
excited about. To be optimally useful for researchers, data mining programs
needed to provide built-in data visualization and the ability to easily interact
with the program. With the evolution of graphical user interfaces and operating systems that supported them, data mining programs started to incorporate these features. MLC, for instance, was acquired by Silicon
Graphics in mid 1990s, and turned into MineSet [8], at that time the most
sophisticated data mining environment with many interesting data and
model visualizations. MineSet implemented an interface whereby the data
analysis schema was in a way predened: the user could change the parameters of analysis methods, but not the composition of the complete analysis
pathway. Clementine (http://www.spss.com/clementine), another popular
commercial data mining suite, pioneered user control over the analysis pathway by embedding various data mining tasks within separate components
that were placed in the analysis schema and then linked with each other
to construct a particular analysis pathway. Several modern open-source
data mining tools use a similar visual programming approach that, because
it is exible and simple to use, may be particularly appealing to data analysts
and users with backgrounds other than computer science.
Flexibility and extensibility in analysis software arise from being able to
use existing code to develop or extend ones own algorithms. For example,
Weka (http://www.cs.waikato.ac.nz/ml/weka/) [9], a popular data mining
suite, oers a library of well-documented Java-based functions and classes
that can be easily extended, provided sucient knowledge of Wekas architecture and Java programming. A somewhat dierent approach has been
taken by other packages, including R (http://www.r-project.org), which is
one of the most widely known open-source statistical and data mining
suites. Instead of extending R with functions in C (the language of its
core) R also implements its own scripting language with an interface to its
functions in C. Most extensions of R are then implemented as scripts,
requiring no source-code compilation or use of a special development environment. Recently, with advances in the design and performance of generalpurpose scripting languages and their growing popularity, several data
mining tools have incorporated these languages. The particular benet of
integration with a scripting language is the speed (all computationally intensive routines are still implemented in some fast low-level programming
language and are callable from the scripting language), exibility (scripts
may integrate functions from the core suite and functions from the scripting
languages native library), and extensibility that goes beyond the sole use of
the data mining suites through use of other packages that interface with that

40

ZUPAN & DEMSAR

particular scripting language. Although harder to learn and use for novices
and those with little expertise in computer science or math than systems
driven completely by graphical user interfaces, scripting in data mining
environments is essential for fast prototyping and development of new techniques and is a key to the success of packages like R.
Why mine medical data with open-source tools?
Compared with o-the-shelf commercial data mining suites, open-source
tools may have several disadvantages. They are developed mostly by
research communities that often incorporate their most recent data analysis
algorithms, resulting in software that may not be completely stable. Commercial data mining tools are often closely integrated with a commercial
database management system, usually oered by the same vendor. Opensource data mining suites instead come with plug-ins that allow the user
to query for the data from standard databases, but integration with these
may require more eort than a single-vendor system.
These and other potential shortcomings are oset by several advantages
oered by open-source data mining tools. First, open-source data mining
suites are free. They may incorporate new, experimental techniques, including some in prototype form, and may address emerging problems sooner
than commercial software. This feature is particularly important in biomedicine, with the recent emergence of many genome-scale data sets and new
data and knowledge bases that could be integrated within analysis schemata.
Provided that a large and diverse community is working with a tool, the set
of techniques it may oer can be large and thus may address a wide range of
problems. Research-oriented biomedical groups nd substantial usefulness
in the extendibility of the open-source data mining suites, the availability
of direct access to code and components, and the ability to cross-link the
software with various other data analysis programs. Modern scripting
languages are particularly strong in supporting this type of ad hoc integration. Documentation for open-source software may not be as polished as
that for commercial packages, but it is available in many forms and often
includes additional tutorials and use cases written by enthusiasts outside
the core development team. Finally, there is user support, which is dierent
for open-source than for commercial packages. Users of commercial packages depend on the companys user support department, whereas users of
open-source suites are, as a matter of principle, usually eager to help each
other. This cooperation is especially true for open-source packages with
large and active established user bases. Such communities communicate
by online forums, mailing lists, and bug tracking systems to provide encouragement and feedback to developers, propose and prioritize improvements,
report on bugs and errors, and support new users.
As these open-source tools incorporate advances in user interfaces and
reporting tools, implement the latest analysis methods, and grow their user

OPEN-SOURCE TOOLS FOR DATA MINING

41

bases, they are becoming useful alternatives and complements to commercial


tools in medical data mining.

Open-source data mining toolboxda wish list


To support medical data mining and exploratory analysis, a modern data
mining suite should provide an easy-to-use interface for physicians and
biomedical researchers that is well supported with data and model visualizations, oers data analysis tools to accommodate interactive search for any
interesting data patterns, and allows interactive exploration of inferred
models [1012]. In addition to being simple, the tools have to be exible,
allowing the users to dene their own schemata for data analysis. Modern
open-source data mining suites are almost by denition extendible; although
this may not be a major concern of the users, it is important for data analysts and programmers in biomedical research groups who may need to
develop custom-designed data analysis components and schemata.
Most open-source data mining tools today come as comprehensive, integrated suites featuring a wide range of data analysis components. In our
opinion, the following set of tools and techniques should be on the wish
list of any biomedical data analyst:
 A set of basic statistical tools for primary inspection of the data
 Various data visualization techniques, such as histograms, scatterplots,
distribution charts, parallel coordinate visualizations, mosaic and sieve
diagrams, and so forth
 Standard components for data preprocessing that include querying from
databases, case selection, feature ranking and subset selection, and
feature discretization
 A set of techniques for unsupervised data analysis, such as principal
component analysis, various clustering techniques, inference of association rules, and subgroup mining techniques
 A set of techniques for supervised data analysis, such as classication
rules and trees, support vector machines, na ve Bayesian classiers, discriminant analysis, and so forth
 A toolbox for model evaluation and scoring (classication accuracy,
sensitivity, specicity, Brier score, and other), that also includes graphical analysis of results, such as receiver-operating characteristic curves
and lift chart analysis
 Visualizations of inferred models developed from either supervised or
unsupervised analysis
 An exploratory data analysis environment, wherein the user can select
a set of cases, features, or components of the model and explore the
selection in a subsequent data or model visualization component. The
emphasis here is on the interplay between data visualization and
interaction.

42

ZUPAN & DEMSAR

 Techniques for saving the model in some standard format (such as


PMML, http://www.dmg.org/) for its later use in systems for decision
support outside the data mining suite with which the model was
constructed
 Reporting, that is, implementation of a notebook-style tool in which the
user can save the present results of analysis and associated reports,
include any comments, and later retrieve the corresponding analysis
schema for further exploration
We use the above list implicitly when reviewing the open-source data
mining suites below and when summarizing our impressions of them at
the end of the article.

Selected open-source data mining suites


Below we review several open-source data mining suites, including some
of the largest and most popular packages, such as Weka and R. Although
representative of dierent types of user interfaces and implementations,
the list is restrictive because the number of other data mining suites in
open source is large; because of space limitations we necessarily selected
only a small sample. We direct interested readers to web pages, such as
KDnuggets (http://www.kdnuggets.com/) and Open Directory (http://
dmoz.org), for more comprehensive lists of open-source data mining tools.
For illustrations throughout this article we used a data set on heart disease
from UCI Machine Learning Repository [5].
R
R (http://www.r-project.org) is a language and environment for statistical
computing and graphics. Most of its computationally intensive methods are
eciently implemented in C, C, and Fortran, and then interfaced to R,
a scripting language similar to the S language originally developed at Bell
Laboratories [13]. R includes an extensive variety of techniques for statistical testing, predictive modeling, and data visualization, and has become a de
facto standard open-source library for statistics (Fig. 1). R can be extended
by hundreds of additional packages available at The Comprehensive R
Archive Network (http://cran.r-project.org) that cover virtually every aspect
of statistical data analysis and machine learning. For those interested in
genomic data analysis in bioinformatics, there is an R library and software
development project called Bioconductor (http://www.bioconductor.org).
The preferred interface to R is its command line and use through scripting. Scripting interfaces have distinct advantages: the data analysis procedure is stated clearly and can be saved for the later reuse. The downside
is that scripting requires some programming skills. Users lacking them
can use R through extensions with graphical user interfaces. R Commander

OPEN-SOURCE TOOLS FOR DATA MINING

43

Fig. 1. Snapshot of the basic R environment (RGui) with an example script that reads the data,
constructs an object that stores the result of hierarchical clustering, and displays it as a dendrogram in a separate window.

(http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/), for instance, implements


a graphical user interface to compose and issue commands in R script.
Rattle (http://rattle.togaware.com), another interface extension of R, is
implemented as an R library and provides a graphical user interface to
many of Rs data analysis and modeling functions.
Tanagra
Tanagra (http://eric.univ-lyon2.fr/wricco/tanagra/) is a data mining suite
built around a graphical user interface wherein data processing and analysis
components are organized in a tree-like structure in which the parent component passes the data to its children (Fig. 2). For example, to score a prediction model in Tanagra, the model is used to augment the data table with
a column encoding the predictions, which is then passed to the component
for evaluation.
Although lacking more advanced visualizations, Tanagra is particularly
strong in statistics, oering a wide range of uni- and multivariate parametric
and nonparametric tests. Equally impressive is its list of feature selection
techniques. Together with a compilation of standard machine learning

44

ZUPAN & DEMSAR

Fig. 2. Snapshots of Tanagra with an experimental setup dened in the left column, which
loads the data (Dataset), shows a scatterplot (Scatterplot 1), selects a set of features (Dene
status 1), computes linear correlations (Linear correlation 1), selects a subset of instances based
on a set of conditions (Rule-based selection 1), computes the correlation and a scatterplot for
these instances, and so on. The components of the data processing tree are dragged from the list
at the bottom (Components); the snapshot shows only those related to statistics. The scatterplot
on the right side shows the separation of the instances based on the rst two axes as found by
the partial least squares analysis, where each symbol represents a patient, with the symbols
shape corresponding to a diagnosis.

techniques, it also includes correspondence analysis, principal component


analysis, and the partial least squares methods. Presentation of machine
learning models is most often not graphical, but insteaddunlike other machine learning suitesdincludes several statistical measures. The dierence in
approaches is best illustrated by the na ve Bayesian classier, whereby,
unlike Weka and Orange, Tanagra reports the conditional probabilities
and various statistical assessments of importance of the attributes (eg, c2,
Cramers V, and Tschuprows t). Tanagras data analysis components report
their results in a nicely formatted HTML.

OPEN-SOURCE TOOLS FOR DATA MINING

45

Weka
Weka (Waikato Environment for Knowledge Analysis, http://www.cs.
waikato.ac.nz/ml/weka/) [9] is perhaps the best-known open-source machine
learning and data mining environment. Advanced users can access its components through Java programming or through a command-line interface.
For others, Weka provides a graphical user interface in an application called
the Weka KnowledgeFlow Environment featuring visual programming, and
Weka Explorer (Fig. 3) providing a less exible interface that is perhaps easier to use. Both environments include Wekas impressive array of machine
learning and data mining algorithms. They both oer some functionality
for data and model visualization, although not as elaborate as in the other
suites reviewed here. Compared with R, Weka is much weaker in classical
statistics but stronger in machine learning techniques. Wekas community
has also developed a set of extensions (http://weka.sourceforge.net/wiki/
index.php/Related_Projects) covering diverse areas, such as text mining,
visualization, bioinformatics, and grid computing. Like R in statistics,

Fig. 3. Weka Explorer with which we loaded the heart disease data set and induced a na ve
Bayesian classier. On the right side of the window are the results of evaluation of the model
using 10-fold cross-validation.

46

ZUPAN & DEMSAR

Weka became a reference package in the machine learning community,


attracting a number of users and developers. Medical practitioners would
get the easiest start by using Weka Explorer, and combining it with extensions for more advanced data and model visualizations.
YALE
Among the reviewed graphical user interface environments, the visual programming in YALE (Yet Another Learning Environment, http://rapid-i.
com) is the closest to the traditional sense of the word programming: the
user denes an experiment by placing the functions (eg, components for
reading the data, cross-validation, applying a chain of operators, and so forth)

Fig. 4. A snapshot of YALE with the experimental setup for cross-validation that reads the
data, computes some basic statistics about the features, and then cross-validates a classication
tree inducer J48. Selection of any component from the list on the left of the window provides
access to its parameters; those for cross-validation are displayed in the snapshot. The experiment log is displayed in the bottom part of the window. After executing the experiment, the
results of experiments are available in the Results tab.

OPEN-SOURCE TOOLS FOR DATA MINING

47

into a treelike structure and runs the program (Fig. 4). Internal nodes of the
tree represent functions in which their children are the arguments (which
may in turn bedand usually aredfunctions). For example, an operator
XValidation performs cross-validation and requires two child nodes.
The rst must be able to handle an ExampleSet and deliver a Model.
The second child node gets an ExampleSet and a Model and outputs a PerformanceVector. The second child would typically be an operator chain consisting of a ModelApplier, which uses the prediction Model on an
ExampleSet, resulting in a table of predictions and actual classes and a PerformanceEvaluator, which takes the table and computes the corresponding
classier scores.
YALE incorporates a reasonable number of visualizations ranging from
the basic histograms to multidimensional RadViz [14] projections. YALE is
written in Java and is built on top of Weka, thus including its vast array of
data analysis components. Although data miners with a background in
programming easily grasp its visual functional programming concepts,

Fig. 5. Screenshot of KNIME. The central part of the window shows the experimental setup
with several interconnected nodes; the right part contains a useful description of the selected
node. The screenshot shows an experiment in which we loaded the data, colored the instances
according to their class and showed them in a table, and used parallel coordinates and a scatterplot for visualization. In the middle of the graph we placed the nodes for testing the performance of a classication tree inducer; node Cross-validation has an internal workow with
the denition of the evaluated learning algorithm. At the bottom part of the graph are nodes
for random partitioning of the data set, binning of the training set, and derivation of a classication tree used to predict the classes of the test set and obtain the related performance scores.
In addition, we visualized the training set in a scatterplot, but put the instances through the
HiLite Filter. With this setup, we can pick a node in the classication tree J48 Weka and
see the corresponding examples in the Scatter Plot.

48

ZUPAN & DEMSAR

medical practitioners and researchers with limited knowledge of computer


science may nd them somewhat complicated to understand and manage.
KNIME
KNIME (Konstanz Information Miner, http://www.knime.org) is a nicely
designed data mining tool that runs inside the IBMs Eclipse development
environment. The application is easy to try out because it requires no installation besides downloading and unarchiving. Like YALE, KNIME is
written in Java and can extend its library of built-in supervised and

Fig. 6. A dialog of the node CAIM Binner (from Fig. 5) that transforms continuous features
into discrete features (discretization). Features to be discretized are selected in the bottom part
of the window, with the top part of the window displaying the corresponding split points.

OPEN-SOURCE TOOLS FOR DATA MINING

49

unsupervised data mining algorithms with those provided by Weka. But unlike that of Yale, KNIMEs visual programming is organized like a data
ow. The user programs by dragging nodes from the node repository
to the central part of the benchmark (Fig. 5). Each node performs a certain
function, such as reading the data, ltering, modeling, visualization, or similar functions. Nodes have input and output ports; most ports send and receive data, whereas some handle data models, such as classication trees.
Unlike nodes in Wekas KnowledgeFlow, dierent types of ports are clearly
marked, relieving the beginner of the guesswork of what connects where.
Typical nodes in KNIMEs KnowledgeFlow have two dialog boxes, one for
conguring the algorithm or a visualization and the other for showing its results
(Fig. 6). Each node can be in one of the three states, depicted with a trac-light
display: they can be disconnected, not properly congured, or lack the input
data (red); be ready for execution (amber); or have nished the processing
(green). A nice feature called HiLite (Fig. 7) allows the user to select a set of

Fig. 7. KNIME HiLiteing (see Fig. 5), where the instances from the selected classication tree
node are HiLited and marked in the scatterplot.

50

ZUPAN & DEMSAR

instances in one node and have them marked in any other visualization in the
current application, in this way further supporting exploratory data analysis.
Orange
Orange (http://www.ailab.si/orange) is a data mining suite built using the
same principles as KNIME and Weka KnowledgeFlow. In its graphical
environment called Orange Canvas (Fig. 8), the user places widgets on a canvas
and connects them into a schema. Each widget performs some basic function,

Fig. 8. Snapshot of the Orange canvas. The upper part of the schema centered around Test
Learners uses cross-validation to compare the performance of three classiers: na ve Bayes,
logistic regression, and a classication tree. Numerical scores are displayed in Test Learners,
with evaluation results also passed on to ROC Analysis and Calibration Plot that provide
means to graphically analyze the predictive performance. The bottom part contains a setup
similar to that in KNIME (see Fig. 5): the data instances are split into training and test sets.
Both parts are fed into Test Learners, which, in this case, requires a separate test set and tests
a classication tree built on the training set that is also visualized in Classication Tree
Graph. Linear Projection visualizes the training instances, separately marking the subset
selected in the Classication Tree Graph widget.

OPEN-SOURCE TOOLS FOR DATA MINING

51

but unlike in KNIME with two data typesdmodels and sets of instancesdthe
signals passed around Oranges schemata may be of dierent types, and may
include objects such as learners, classiers, evaluation results, distance matrices, dendrograms, and so forth. Oranges widgets are also coarser then
KNIMEs nodes, so typically a smaller number of widgets is needed to accomplish the same task. The dierence is most striking in setting up a crossvalidation experiment, which is much more complicated in KNIME, but
with the benet of giving the user more control in setting up the details of the
experiment, such as separate preprocessing of training and testing example sets.
Besides friendliness and simplicity of use, Oranges strong points are
a large number of dierent visualizations of data and models, including
intelligent search for good visualizations, and support of exploratory data
analysis through interaction. In a concept similar to KNIMEs HiLiteing
(yet subtly dierent from it), the user can select a subset of examples in
a visualization, in a model, or with an explicit lter, and pass them to, for
instance, a model inducer or another visualization widget that can show
them as a marked subset of the data (Fig. 9).
Orange is weak in classical statistics; although it can compute basic statistical properties of the data, it provides no widgets for statistical testing. Its

Fig. 9. The linear projection widget from Orange displaying a two-dimensional projection of
data, where the x and y axis are a linear combination of feature values whose components are
delineated with feature vectors. Coming from the schema shown in Fig. 8, the points corresponding to instances selected in the classication tree are lled and those not in the selection are open.

52

ZUPAN & DEMSAR

reporting capabilities are limited to exporting visual representations of data


and models. Similar to R, the computationally intensive parts of Orange are
written in C, whereas the upper layers are in developed in the scripting
language Python, allowing advanced users to supplement the existing suite
with their own algorithms or with routines from Pythons extensive scientic
library (http://www.scipy.org).
GGobi
Data visualization was always considered one of the key tools for successful data mining. Particularly suited for data mining and explorative data
analysis, GGobi (http://www.ggobi.org) is an open-source visualization program featuring interactive visualizations through, for instance, brushing
(Fig. 10), whereby a users selection is marked in all other opened visualizations, and grand tour (Fig. 11) [15], which uses two-dimensional visualizations and in a movie-like fashion shifts between two dierent projections.
GGobi can also plot networks, a potentially useful feature for analysis of
larger volumes of data, such as those from biomedicine. By itself GGobi
is only intended for visualization-based data mining, but can be nicely integrated with other statistical and data mining approaches when used as
a plug-in for R or used through interfaces for the scripting languages Perl
and Python.

Fig. 10. Scatterplot, a matrix of scatterplots and parallel coordinates as displayed by GGobi.
The instances selected in one visualization (scatterplot, in this case) are marked in the others.

OPEN-SOURCE TOOLS FOR DATA MINING

53

Fig. 11. GGobis Grand tour shows a projection similar to the Linear Projection in Orange (see
Fig. 9) but animates it by smoothly switching between dierent interesting projections, which
gives a good impression of positions of the instances in the multidimensional space.

Summary
State-of-the-art open-source data mining suites of today have come a long
way from where they were only a decade ago. They oer nice graphical interfaces, focus on usability and interactivity, support extensibility through augmentation of the source code or (better) through use of interfaces for add-on
modules. They provide exibility through either visual programming within
graphical user interfaces or prototyping by way of scripting languages. Major
toolboxes are well documented and use forums or discussion groups for user
support and exchange of ideas.
The degree to which all of the above is implemented of course varies from
one suite to another, but in the packages we have reviewed in this article most
of the above issues were addressed and we could not nd a clear winner in supporting all of the aspects in the best way. For a medical practitioner or biomedical researcher starting with data mining the choice for the right suite may be
guided by the simplicity of the interface, whereas for research teams a choice of
implementation or integration language (Java, R, C/C, Python, and so
forth) may be important. For the wish list of data mining techniques we nd
that all packages we have reviewed (with the exception of GGobi focusing
on visualization only) cover most of the standard data mining operations,
ranging from preprocessing to modeling, with some providing better support
for statistics and others for visualization.
There are many open-source data mining tools available, and our intention was only to demonstrate the ripeness of the eld through exemplary

54

ZUPAN & DEMSAR

implementations. We covered only general-purpose packages, and also


because of space limitations did not discuss any of the specialized software
tools dealing with biomedical data analysis, such as text mining, bioinformatics, microarray preprocessing, analysis in proteomics, and so forth,
some of which are addressed in other articles in this issue. The number of
such tools is large, with new development projects being established almost
on a daily basis. Not all of these will be successful in the long term, but
many of them are available, stable, and have already been used in a large
number of studies. With growing awareness that in science we should share
the experimental data and knowledge, along with the tools we build to
analyze and integrate them, open-source frameworks provide the right environment for community-based development, fostering exchange of ideas,
methods, algorithms and their implementations.

References
[1] Fayyad U, Piatetsky-Shapiro G, Smyth P, et al, editors. Advances in knowledge discovery
and data mining. Menlo Park (CA): AAAI Press; 1996.
[2] Quinlan JR. C4.5: programs for machine learning. San Mateo (CA): Morgan Kaufmann
Publishers; 1993.
[3] Michalski RS, Kaufman K. Learning patterns in noisy data: the AQ approach. In: Paliouras
G, Karkaletsis V, Spyropoulos C, editors. Machine learning and its applications. Berlin:
Springer-Verlag; 2001. p. 2238.
[4] Clark P, Niblett T. The CN2 induction algorithm. Machine Learning 1989;3:26183.
[5] Asuncion A, Newman DJ. UCI Machine Learning Repository. Available at: http://www.ics.
uci.edu/wmlearn/MLRepository.html. Accessed April 15, 2007. Irvine, CA: University of
California, Department of Information and Computer Science; 2007.
[6] Wall L, Christiansen T, Orwant J. Programming Perl. 3rd edition. Sebastopol, CA: OReilly
Media, Inc.; 2000.
[7] Kohavi R, Sommereld D, Dougherty J. Data mining using MLC: a machine learning
library in C. International Journal on Articial Intelligence Tools 1997;6:53766.
[8] Brunk C, Kelly J, Kohavi R. MineSet: an integrated system for data mining. In Proc. 3rd Intl.
Conf. on Knowledge Discovery and Data Mining, Menlo Park (CA). p. 1358.
[9] Witten IH, Frank E. Data mining: practical machine learning tools and techniques with Java
implementations. 2nd edition. San Francisco (CA): Morgan Kaufmann; 2005.
[10] Zupan B, Holmes JH, Bellazzi R. Knowledge-based data analysis and interpretation. Artif
Intell Med 2006;37:1635.
[11] Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform 2006; in press.
[12] Cios KJ, Moore GW. Uniqueness of medical data mining. Artif Intell Med 2002;26:124.
[13] Becker RA, Chambers JM. S: an interactive environment for data analysis and graphics.
Pacic Grove (CA): Wadsworth & Brooks/Cole; 1984.
[14] Homan PE, Grinstein GG, Marx KE. DNA visual and analytic data mining. Phoenix (AZ):
In Proc. IEEE Visualization; 1997. p. 43741.
[15] Asimov D. The grand tour: a tool for viewing multidimensional data. SIAM J Sci Statist
Comput 1985;6:12843.

Clin Lab Med 28 (2008) 5571

The Development of Health Care Data


Warehouses to Support Data Mining
Jason A. Lyman, MD, MSa,b,*, Kenneth Scully, MSa,b,
James H. Harrison, Jr, MD, PhDa,c
a

Division of Clinical Informatics, Department of Public Health Sciences,


University of Virginia, Suite 3181 West Complex, 1335 Hospital Drive,
Charlottesville, VA 22908, USA
b
Clinical Data Repository, University of Virginia School of Medicine,
Suite 3181 West Complex, 1335 Hospital Drive,
Charlottesville, VA 22908, USA
c
Department of Pathology, University of Virginia, Suite 3181 West Complex,
1335 Hospital Drive, Charlottesville, VA 22908, USA

Data mining requires an underlying data set and these data may be
acquired, stored, and managed in multiple ways. As this data set increases
in volume, data mining techniques generally become more eective and useful. Medicine is becoming well-positioned to take advantage of the capabilities of data mining; there is a tremendous wealth of largely untapped
clinical data available in the operational clinical information systems of laboratories, hospitals, and clinics around the world, and the volume of these
data is increasing rapidly as medical centers adopt electronic medical
records. An important initial step toward the most eective mining of these
data for biomedical and translational research is the development of enterprise clinical data warehouses that ooad information from production systems into separate, fully integrated databases optimized for performing
population-based queries. Without such systems, mining of production clinical data requires complex and time-consuming preparatory data aggregation and processing steps on a project-by-project basis. Because most
health care data have potential value for multiple research endeavors, the
benets of developing and maintaining large-scale multipurpose enterprise
data warehouses can be considerable. In this article, we present an introduction to clinical data warehouses, highlight examples from the literature in
* Corresponding author. Division of Clinical Informatics, Department of Public Health
Sciences, Suite 3181 West Complex, 1335 Lee Street, University of Virginia Health System,
Charlottesville, VA 22908.
E-mail address: lyman@virginia.edu (J.A. Lyman).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.003
labmed.theclinics.com

56

LYMAN

et al

which they support data mining, and describe specic issues and challenges
related to their development and use.
Introduction to data warehouses
Health care data warehouses, which integrate data from multiple operational systems and provide longitudinal, population-based views of health
information, are becoming increasingly common [19]. One challenge in discussing these systems is the lack of consensus about the appropriate terminology. The terms data repository, data warehouse, knowledge warehouse,
and information warehouse have all been used in the academic and business
communities to describe databases that are distinct from production systems
(eg, electronic health records) and exist to support analytic processing and
strategic decision making (in the business world) and biomedical research
(in the health care sector). There is ambiguity in the terminology, however,
because the term clinical data repository is often used to describe a production clinical system designed to display integrated patient data to health
care providers for the purposes of patient care [10].
A data warehouse, in general terms, can be dened as a copy of transaction data specically structured for query and analysis [11]. Transaction
data, as it pertains to health care organizations, may include clinical laboratory results derived directly from a laboratory information system, medication orders, textual documents (eg, discharge summaries, radiology reports,
surgical pathology reports), and administrative claims data, to name a few.
Although small data warehouses may serve primarily to structure production data from individual systems for analytic purposes, the most useful
warehouses incorporate data from multiple traditional production systems.
Integrating these disparate data into a common system that provides longitudinal views of retrospective clinical data is the essence of a clinical data
warehouse. A related concept, the clinical data mart, is a more specialized
version of a clinical data warehouse with a subset of data pertinent to a particular setting or topic [12,13].
Creating a copy of transaction data in a separate integrated database
usually yields signicant performance benets for research queries. Production clinical systems are optimized for high speed, continuous updating of
individual patient data elements, and individual patient queries in small transactions. These types of systems are generally referred to as online transaction
processing (OLTP) systems and their optimization for processing large
numbers of small transactions at high speed introduces important constraints
on their internal data models [14]. In contrast to production system transactions, researchers queries may be complex, involving long time periods or
multiple conditions and patients, and may return substantial data sets.
Clinical production systems are not designed for these types of queries and often respond sluggishly, yielding poor performance with research queries and
potentially compromising the speed of concurrent clinical care processes as

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

57

an unwanted side eect. Systems whose data models are structured to support
data analysis are generally termed online analytical processing (OLAP)
systems to distinguish them from OLTP systems [14]. Systems designed for
OLAP, such as data warehouses, are also typically nonvolatile: data are intermittently added and never changed. Because updating performance is not
critical, data additions are batch-scheduled during slow use periods and
include many data types on multiple patients [15]. Consequently, the underlying data model can be optimized to maximize the performance of the types
of complex queries across populations that are characteristic of research.
Data warehouses as sources for data mining initiatives
In recent years, there have been several published examples of the use of
data warehouses for data mining eorts related to clinical investigation.
Researchers have used these databases for factor analysis to identify patients
at risk for suboptimal outcomes, to explore associations between diseases
and clinical ndings, for improvement of clinical information systems, and
to examine diseasedisease and diseaseprocedure associations.
Prather and colleagues [16] describe the development of a clinical data
warehouse at Duke University Medical Center and its subsequent use for
a data mining project intended to identify factors associated with increased
risk for preterm birth. Data on more than 45,000 patients were transferred
from the perinatal database of their organizations electronic medical record
system to a separate relational database designed for research. In a preliminary analysis restricted to a 2-year time period, multiple queries against the
data warehouse were used to create a data set of 3902 patients. During this
process, variables were cleansed to correct erroneous or discrepant values
and formatting inconsistencies. Once the nal data set was created, factor
analysis was performed. For patients who met study criteria, more than
91% of the data values were usable, and three latent factors were identied
that explained 48.7% of the variance in the data. In a subsequent larger
analysis of 19,970 patients with 1622 variables per patient, 7 predictor variables for preterm birth risk were identied, yielding 0.72 area under the
curve for receiver operating characteristics [17].
In another example, researchers at Columbia University used co-occurrence statistics to discover associations between the presence of diseases
and clinical ndings [18]. Such associations, they believed, could improve
the usefulness of an automated problem list summarization tool used in
their electronic medical record system by allowing the removal of redundant, clinically non-informative ndings (eg, chest pain in a patient who
had myocardial infarction). They compared two methods, Chi-square and
the proportion condence interval (PCI), assessing their respective abilities
to detect clinically recognized disease-nding associations in discharge summaries. Using the former technique, 94% of the associations that were identied were believed to be true associations as judged by expert physicians,

58

LYMAN

et al

whereas 77% of the associations found using the PCI approach were believed to be correct. Although the purpose of their knowledge discovery effort was to improve the usefulness and accuracy of an automated problem
list generator, they acknowledged the potential usefulness of their approach
for identifying novel associations between diseasedisease, diseasemedication, diseaseprocedure, and so forth.
The identication of such associations was the focus of another data
mining eort using the University of Virginias Clinical Data Repository,
an enterprise-wide data warehouse developed to support clinical investigation. Mullins and colleagues [19] described a collaborative endeavor between
researchers at the University of Virginia, the Virginia Commonwealth
University, and IBM Life Sciences, in which a data set with 667,000 patients
was mined using three dierent unsupervised methods to identify potentially
interesting disease associations. Results were compared with automated
searches of the biomedical literature in an eort to distinguish between associations that were well established versus those that might represent previously unknown relationships. The analysis identied multiple associations
of both types, including congestive heart failurevalvular diseasehypertension (a well-known association) and albuteroltracheostomymagnesium (an association not found in the biomedical literature).

Developing a health care data warehouse


The design and implementation of an integrated data warehouse is a resource-intensive process requiring a multidisciplinary approach and substantial investments of time and energy. Third-party systems are available,
but fully functional warehouses can be developed using freely available
open-source tools. A common limitation of systems currently oered by
the vendor community is that they are often focused on the business aspects
of health care (eg, nances, use) rather than needs of biomedical investigators, which may require dierent types of queries with dierent optimal
underlying data models and analysis techniques.
The development of a functional data warehouse involves tasks in three
areas: (1) design and implementation of the data warehouse structure, (2)
population of the warehouse, including data acquisition, processing, loading, and linking, and (3) data access and query management with appropriate security and user interface services. Data acquisition typically includes
the initial collection of historic data and the development of mechanisms
for prospective, real-time, or batch feeds for new data. Each of these task
areas requires a signicant eort with careful planning.
Database design and implementation
Database design typically begins with requirementsdgathering activities
in which the desires and needs of potential users are elicited and

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

59

documented. In the context of building a data warehouse for clinical investigation and data mining, this can be challenging because of the diverse,
ever-changing nature of biomedical research. It is dicult to anticipate at
the outset the breadth and depth of information that will ultimately be
required. This is in contrast to business-oriented applications, in which there
are typically a small number of focused questions that a data warehouse is
intended to support.
Murphy and colleagues [7] approached the requirements process by
retrospectively examining queries of an existing clinical information system
to identify particular data types that were of most value. In their study,
coded diagnoses and medications accounted for more than 90% of queries,
and this information was used to optimize their data model for these types
of queries. This approach has benets but is also limited because historical
queries can only use information that was readily available at the time, and
by denition cannot represent data or types of queries that may become
available in a new system.
By practical necessity, the development of a clinical data warehouse is often largely driven by the data that are electronically available at the time of
its creation. Universally collected data, at least in the United States, include
administrative information necessary to support billing and governmental
reporting requirements. These include coded diagnoses, procedures, demographics, discharge disposition, and payer information. Although there is
debate about whether the accuracy of such administrative data is sucient
for clinical research or quality assessment [20,21], this type of information
has been used successfully for scores of clinical and health services research
projects. Because clinical laboratory results are typically stored in electronic
format at most institutions and are important indicators of diagnosis,
disease progression, and response to therapy, this information along with
administrative data often form the core around which a clinical data
warehouse is constructed. Other data, including medications (ordered or administered), vital signs, monitoring data (eg, EKG), or textual reports, are
often useful for research studies but are less commonly available electronically. Patient identiers, although perhaps not necessary for many research
projects, are still important for linking data from multiple systems over time,
and their inclusion has signicant ramications for the security requirements
of the data warehouse.
Database design includes not only requirements gathering and the identication of desired data elements but also the adoption or development of
the data model that will be used. The data model serves as the blueprint for
database construction and ultimately determines how data will be organized by the database management system. Data models for data warehouses tend to be fundamentally dierent from those developed for
transactional (OLTP) databases that support, for example, electronic medical records and other production systems. Analytic (OLAP) systems typically use a multidimensional data model that is implemented in one of two

60

LYMAN

et al

ways: specialized multidimensional databases (multidimensional OLAP, or


MOLAP) or relational databases with a multidimensional schema (relational OLAP, or ROLAP) [14,22]. A multidimensional data model is one
in which the item under analysis is often some type of event (a timestamped retail purchase or a patient care event, for example) that can be
characterized by several features. Each feature represents a dimension.
Common dimensions in the clinical setting might include patient, physician, location, month of year, day of week, type of event, associated diagnosis, result or observation, treatment type, dose, and many more. For
a set of events, dimensions can be selected and plotted against each other
or otherwise compared to determine their relationships. These types of
databases are often called data cubes because they share a set of common dimensions, although the number of dimensions is usually more
than the three that dene a physical cube.
MOLAP systems are implemented as a set of specialized arrays containing the dimensional data, with precalculated aggregates and summaries
derived from the data that are generated at load time. They are typically
completely loaded once from an external data store and are designed to support specic analyses that are particularly useful for business purposes. To
update the data, old cubes are discarded and new cubes are reloaded from
scratch. MOLAP systems generally yield fast query response because of
their highly optimized (nonrelational) internal design and the substantial
amount of precalculation that is done on loading. This design allows
ecient data summarization (roll-up) or inspection of detailed data contributing to a summary (drill-down). Their analytic capabilities may not
be exible enough for evolving biomedical research needs, however. MOLAP systems also tend to perform poorly in highly dimensional settings
(ie, those with more than 10 dimensions) because of the processing, memory,
and storage requirements for precomputing and storing summaries and relationships between many dimensions [23]. Medical environments often
yield 50 or more dimensions per event.
ROLAP systems are standard relational databases that implement multidimensional data models [14,22]. The relational database schemata that support these data models generally have a central fact table that contains an
entry for each item (eg, a clinical event) along with a limited number of data,
such as a time stamp. The remaining characteristics of the itemdthe dimensionsdare contained in a set of dimension tables that are linked to the fact
table. The linkage of dimension tables to each other through the fact table
allows exible and ecient correlation analyses between the dimension
tables. For example, if values are specied for one or more dimensions, it
is easy to look up the associated values in an additional dimension. This table structure consisting of multiple data tables surrounding a central table
that contains primarily linkage data is referred to as a star or snowake
schema depending on implementation details and is characteristic of a relational data warehouse [79,11,14,24].

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

61

Fig. 1. Star schema for clinical laboratory data. Star schemata are characterized by a central
fact table (Clinical Lab Result) containing a minimal representation of an event, such as a laboratory test, with links to multiple surrounding dimension tables holding the detailed data on
the characteristics of the events. FK, foreign key, which points to a primary key in another table; PK, primary key.

A fact table focused on laboratory test results might contain the following elds: patient identier, date/time identier, laboratory test identier,
specimen type, and numeric result (see Fig. 1). If a patient had multiple serum glucose tests, each result would be stored in a row in the table. In this
example, a single fact refers to a specic result for a specic laboratory
test that was obtained for a specic patient at a specic point in time. The
identiers function as links to the dimension tables: a patient dimension,
time dimension, and laboratory test dimension, where more detailed information about each of those elements would be stored. Star schemata
make it easy to add new dimensional characteristics. For example, if one
of the dimensions is time, then the dimension table would allow you to
group the time of events recorded in the fact table in dierent ways: by
year, quarter, month, day of week, season, weekend indicator, holiday indicator, and so forth. A patient age dimension might be constructed to easily
allow data to be queried or analyzed by adult versus pediatric, age decade,
or some other locally dened age group.

62

LYMAN

et al

Depending on the intended application, a clinical data warehouse fact


table may contain multiple dierent types of events or there may be several
fact tables, each with a narrower focus. For example, the RPDR data warehouse serving Partners Healthcare in Boston, Massachusetts, contains at
least four separate fact tables [7]. Some investigators have described the
importance of using dierent fact tables for dierent data types [25]. Clinical
laboratory results, for example, may be numeric (further broken into integer
or decimal types) or textual, and these data types should be indexed separately for best query performance.
ROLAP data warehouses may also have other structural dierences from
relational OLTP systems. Because data warehouse queries may be executed
across millions of lines of data for thousands of patients at a time, storing
redundant or precomputed data that are used commonly in queries can substantially improve query performance. Because this redundancy is designed
in and programmatically evaluated at load time and during maintenance, it
does not create a data integrity problem. For example, given a patients date
of birth and a date/time-stamped event, it is theoretically redundant to store
both a patients date of birth and age with the time-stamped event because
the age at the event can be calculated from the date of birth and time stamp.
Precalculating age, however, can dramatically improve the performance of
queries that specify an age condition or carry out mathematical calculations
based on age. In aggregate, the characteristics of ROLAP systems described
above yield data warehouses that can provide reasonable performance with
larger data sets and more dimensions than MOLAP systems can eectively
manage, while also allowing greater database design and analytic exibility.
Most large-scale health care data warehouses thus are based on a ROLAP
design.
Implementation of the data warehouse requires selection of the underlying relational database management system (RDBMS) from among a wide
variety of commercial and open-source systems. Clinical data warehouses
have been successfully implemented using Microsoft SQL Server [16,
26,27], Oracle [8,28,29], Sybase [4], and the open-source systems Postgres
(www.postgressql.com) [30] and MySQL (www.mysql.org) [31], to name
just a few examples. Although practically any fully functional RDBMS
can be used for a data warehouse, the ultimate decision depends on multiple
factors, including consideration of other institutional information systems,
available local expertise, database size, performance requirements, interface
requirements, and functionality.
Data acquisition and processing
Clinical data typically are entered and stored in multiple operational systems distributed throughout a health care organization. It is not unusual for
data mining to require data that are stored in several of these operational
systems. Because each data source typically has its own format and

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

63

idiosyncrasies, mining the data is dicult unless it is processed into an appropriate form and collected into a single database. One of the most important functions of data warehouses is to aggregate these data in a form that is
appropriate for mining. Several steps are required to adequately organize and
process the data before loading them into the data warehouse: extraction,
ltering and transformation (sometimes referred to as data cleansing),
classication, and aggregation. The ultimate goal is to standardize the information from disparate systems and transform it so that it is useful for mining.
Extraction
Data may be extracted from production systems as les or may be received in real time through existing hospital interfaces (Fig. 2). Real-time
data are often communicated through an interface engine using the HL7
messaging standard [32], and there are general purpose HL7 parsers (commercial and open source) that can extract data from HL7 feeds. Nonstandard real-time formats require custom programming. Alternatively, some
data sources (particularly those focused on administrative requirements,
such as billing) may deliver data as batch les 1 to 2 months after hospital
discharge. These batch les can be parsed and processed using standard text
parsing or database import libraries.
Filtering
Not all of the data provided by source systems are desirable or accurate.
Some of the information may have no potential clinical research benet and
may be removed (eg, inventory data or clerical information about how reports were transcribed). Data related to patient identity (eg, social security
number, insurance policy numbers), which are usually required for accurate
linkage of records from multiple systems, need special handling. Data errors
are a perpetual challenge and the optimal approach to them is to identify
errors at the time of data acquisition and lter them out. For example, invalid data or data with related missing critical values may need to be eliminated. Decisions must be made on which data will be ltered versus which
will be transformed into missing or unknown values. For example, if a lab
result is received with a date that is in the future, should the record be eliminated, should it be loaded with the bad date, or should the date be changed
to a best guess or unknown value? Each approach has its advantages and
disadvantages, and the answer should be determined by the goals and requirements of the data warehouse.
Transformation
It may be useful to force some data to t predetermined standard values.
For example, a patients gender may be represented in one data source as
M or F but in another as 1 or 2. Some systems provide patient
names in a non-atomic format that combines rst, middle, and last name
all into one eld. Transformation should yield consistent data

64

LYMAN

et al

Fig. 2. Data ow associated with the University of Virginia Clinical Data Repository (CDR).
Data are derived from clinical production systems (left) as HL7 messages or batch les. Messages pass through an HL7 interface engine (1), to the CDR PHI database in a secure environment (top). PHI (HIPAA Safe Harbor data elements) are stripped and stored there, and the
remaining limited data set is passed to a set of staging databases (middle) where data are
held between batch updates of the main CDR database (CDR DB). During this time, data corrections are passed from the production system to the PHI and staging databases, and the data
are ltered and transformed as necessary. An interface (2) is available in the secure CDR environment that allows CDR sta access to the PHI and de-identied databases, and also executes
prebuilt queries for quality assurance studies or researchers who have permission to access patient identity data (see below). CDR users (bottom right) working with de-identied data or limited data sets with IRB permission query the database directly through an SSL-secured Web
interface (3) located with the de-identied databases in a controlled-access environment (middle)
that is restricted by user IP address and account name/password. When users who have permission to view PHI extract data, a call is passed from interface 3 to interface 2 to run an appropriate prebuilt query, and combined data from the de-identied and PHI databases are
returned. There is no direct external user access into the secure CDR environment. The system
also provides a separate database in the secure environment for PHI for external patients (xID
DB, upper right) with capabilities similar to the primary PHI database, to accommodate, for
example, outside data from multicenter trials. The CDR is implemented as MySQL databases
running in Linux, and Web access is provided by way of JDBC from custom Java servlets running in the Apache Tomcat environment.

representation across the warehouse and individual data elements should be


separately accessible for query. Certain invalid data may also need to be
transformed: a gender received as G or empty may be changed and stored
in the database as U (for Unknown). This step should also transform data
containing condential information into acceptable forms. For example,
a patients medical record number may be converted to a disguised number.
Transformation of local data representations into standard coding schemes
or ontologies may have substantial benet for simplifying queries and

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

65

supporting data sharing, and should be considered strongly where possible


[31,33]. With respect to pathology data, local lab codes may be converted to
LOINC [34] or text strings may be identied in pathology reports and
encoded using a standard vocabulary [35].
Classication
Data may be grouped and classied into new schemes designed to support particular data warehouse or mining requirements. For example, a persons age may be classied into discrete age groups (younger than 18 years,
1840 years, and so forth). Many data mining software techniques require
that data be submitted in a limited number of discrete categories rather
than as a data element with an unlimited number of possible values, so
pre-classifying may aid in readying a database for extraction and input
into a data mining software tool. Classifying data can itself be a data mining
operation and it may be desirable to feed results of this data mining back
into the database to help with other data mining activities. Although classication can have advantages, it is important to maintain the raw, detailed
data in the database to retain exibility for multiple uses and future reclassication as needed.
Aggregation
It may be useful from a query formulation and performance standpoint
to precalculate certain commonly used values (eg, totals, averages) and store
these in the database. The use of precalculated patient age was discussed
previously. Another candidate for precalculation might be the total number
of days for a hospitalization, calculated from admission and discharge dates.
This particular variable is often used as an outcome in health services research and precalculation improves the speed of the many queries that
make use of it. It may also be benecial to aggregate hospital stay data
by patient diagnosis, location, or service, and precompute average, maximum, minimum, and variance values, if there is sucient local interest in
these data. As local needs change, the type of values that are precalculated
and stored can also change as long as the raw data on which the calculations
are based remain in the data warehouse.
Security and condentiality
The Health Insurance Portability and Accountability Act (HIPAA) that
went into eect in April 2001 has signicant implications for the use of clinical data in general and their use in biomedical research in particular [36].
Although the conduct of clinical research frequently does not require identiable data for analysis, patient identiers are needed when (1) patients
must be contacted for study recruitment, or (2) the research requires existing
data from multiple, disparate systems to be linked together by patient. From
the standpoint of developing a data warehouse, identiable data are

66

LYMAN

et al

essential for linking records derived from disparate systems and for maintaining longitudinal records over time. At the University of Virginia, our
data warehouse, the Clinical Data Repository (CDR, see Fig. 2), addresses
this task by separating direct identiers, such as name, medical record number, and social security number, and storing this information in a distinct
database on its own highly protected secure server, accessible only to members of the CDR project team [4]. Date of birth, gender, and race are also
stored on this server because they are sometimes useful for resolving ambiguous matches when linking records. Our system assigns a disguised identier
for each new patient, which links clinical records for that patient within the
database that is accessible to our research users. This disguised identier is
mapped to the identiable data in the highly secure database and serves as
the master link between the two systems. The names, dates of birth, medical
record numbers, and so forth can thus be omitted from the accessible research server but are available when necessary for record linking and for
research purposes with appropriate Institutional Review Board permission.
The HIPAA privacy rule species a list of 18 data elements that must be
removed for a database to be considered de-identied [37]. A data warehouse
that maintains a de-identied database for direct user access has security and
convenience advantages that may be preferable to requiring various users to
obtain permission to access a system that contains patient identiers.
Removal of HIPAA identiers may not prevent identication for patients
who have rare conditions or rare combinations of conditions, however. In
these cases, queries may return very small groups or single patients [38].
For this reason, most clinical data warehouses that implement investigator
access to de-identied data return only summary data if the data set resulting
from a query contains fewer than a dened minimum number of patients.
Although many of the data elements that HIPAA includes as potentially
identiable are not needed at the user query stage, temporal information is
often important. The HIPAA privacy rule prohibits specic health-related
dates (other than year) in a de-identied data set, so disguising dates in
a way that preserves their research usefulness is benecial. The use of date
osets, the number of days (or hours, minutes, and so forth) between events,
is one way to address researchers needs for temporal constraints on query
conditions. Typically the absolute date is not required for research, but it is
necessary to determine whether events occurred within some specied time
period (eg, readmission to the hospital within 30 days of discharge, or the
use of a medication within 24 hours of hospital admission). The use of osets requires the identication of a time zero for each patient; our data
warehouse uses the date/time of the rst event for a patient as a starting
point and all osets are calculated based on that reference. The raw dates
can be stored in the highly secure database containing the identiable
data set. This approach, although having the benet of improved condentiality, does have important limitations. Strictly adhering to this method
means that a table storing outpatient clinic visits could not include the

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

67

date of the visit, just the year and the number of days between that visit and
the patients rst event in the database. A user query aimed at identifying
specic seasonal trends in hospitalized patients, for example, would require
links back to the identiable data to nd the correct dates of origin for the
oset calculations.
The challenges of textual data
Some of the richest, most clinically detailed information in the medical
record is stored as text. This information includes consultative reports (eg,
pathology, radiology), operative notes, progress notes, discharge summaries, and also some clinical laboratory results, including microbiology. These
data are often valued by researchers but present the data warehouse developer with multiple challenges, including ensuring condentiality, classifying
data correctly, and providing useful query methods.
Condentiality and textual data
As opposed to coded clinical/administrative data (eg, diagnoses, procedures, medications, and so forth) or numeric clinical laboratory tests, data
in textual elds are much more susceptible to the inclusion of identiable
data, because the data are entered in a free text format, and the identiers
may be clinically and operationally important for communication between
members of the health care team. A surgical pathology report typically includes phrases that describe how specimens are labeled, and may explicitly
identify patient name and medical record number. A textual eld in a laboratory result might include the phrase, Result called at 3/23/2006 9:12 AM
to Dr. Smith or a reference to a particular health care facility. Although
access to these reports might be allowable for researchers who have IRB approval to review identiable data, the ultimate goal is to provide as much
de-identied information as possible so that researchers may work without
identities unless they are truly necessary. The automated de-identication,
or scrubbing, of textual reports is an active area of medical informatics research, and there are increasing numbers of available tools to accomplish
this [35,39,40]. A detailed discussion is beyond the scope of this article
but two approaches described in the literature include (1) automated extraction of accepted medical terms, which are stored in the de-identied database in lieu of the textual report [35], and (2) removal of identiable data
from the corpus of the text, leaving behind a report that is ostensibly clinically detailed but free from any information that might make the patients
identity known [39,40].
Querying textual data
A common query into a clinical data warehouse might ask for all newly
diagnosed patients who have a certain form of cancer. Surgical pathology

68

LYMAN

et al

reports are often a valuable source of information for this type of question,
but locating these cases means successfully searching potentially hundreds of
thousands of documents that are rife with abbreviations, homonyms, synonyms, and misspellings. The query challenges are, in essence, similar to those
that users face when searching the biomedical literature or even the World
Wide Web, and consequently similar approaches can be used to facilitate
successful queries. One of the commonly studied methods for addressing
this challenge is to use a computer-based auto-coding approach in which
the text is parsed and codes are assigned from a standard terminology,
such as the Unied Medical Language System Metathesaurus [35,41,42].
Currently, though, the large-scale use of standard terminologies for linkage
to, or replacement of, textual reports remains a future goal.

Performance issues
Except for very small databases, performance should be a major focus
when designing a relational clinical data warehouse. If the database will
be large, performance will be a signicant issue and good database design
will be a particularly critical requirement for success.
One of the most signicant features of a relational database system that
can enhance performance is the ability to place indexes on any data element
(or set of data elements). When searching the database or joining tables together, tremendous performance gains can be achieved by strategically adding indexes on those columns that will be used in database queries. Indexes
are separate database les that are maintained internally and contain a copy
of the indexed columns in sorted order along with pointers to corresponding
rows in the original data table. Indexes allow the database system to use
a fast binary search algorithm to nd data and the original table rows
that contain it quickly.
Indexing, however, cannot usually be placed on every data element. It is
only useful if the ratio of unique data values to the number of rows is relatively high. Creating indexes on data elements that contain few distinct
values (such as gender or race) typically oers little benet, and can actually
be detrimental to query performance. There are times, therefore, when
a query must nd rows by doing a sequential read through the entire table.
In these cases, query speed is related to table size, so tables should be designed to be as lean as possible.
In the laboratory test result tables of our CDR, the lab test description
and unit of measure were extracted to separate tables. Some space was saved
because of reduction in redundancy, but there was a larger reward in performance. Descriptions tend to be large so they can signicantly increase the
size of a table, which in turn adds to sequential search times. By extracting
the descriptions, we decreased the size of the table yielding faster sequential
search times. The added time required to extract the description from

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

69

a separate table is comparatively minimal. Performance of complex queries,


particularly those using sub-queries, may also be improved by judicious use
of temporary database tables, pre-fetched data, and similar techniques,
depending on the design and capabilities of the particular database management software.
User interfaces
A exible, easy-to-use user interface for a clinical data warehouse increases
the accessibility of data, promotes the use of the system by researchers, and
helps moderate the demand on the warehouse sta for support of individual
projects. Direct access to the data may also improve query results by allowing
users to successively rene their queries as results are returned to yield the
most valuable data set for a project. User interfaces for data warehouses
may include several dierent components: query formulation (in which the
user is insulated from manual SQL scripting), presentation of standard reports, creation of custom reports, downloading data sets, and statistical analysis. Although query formulation is a requisite for any data warehouse
interface, the other components are optional. There is a wide range in the extent to which these other functions are supported in existing clinical data
warehouses. For example, the RPDR at Partners Healthcare, Inc. oers an
interface that has been developed using MS Access, allowing researchers to
rene queries until the desired data set is identied [27]. At that point, researchers with the appropriate authorization can request these data, which
are then compiled by the RPDR sta. Investigators at Indiana University/
Regenstrief have developed a powerful query tool for users as part of the
Shared Pathology Information Networks project that supports online statistical analysis and summarization of large data sets using open source tools
[43]. Finally, our Clinical Data Repository at the University of Virginia includes a exible Web-based query and reporting interface that permits users
to create, review, and rene queries online, and then directly download these
data sets if they have appropriate authorization [4].
Summary
Clinical data warehouses are becoming increasingly common at academic
health centers in the United States, and their value will continue to grow as
electronic medical records gain further adoption and the volume of clinical
data in electronic form continues to increase. High-quality commercial and
open-source software is available for the construction of the warehouse itself
along with query and reporting interfaces and statistical analysis tools. Development of these systems is a signicant undertaking that requires multidisciplinary expertise, but the benets to clinical research are substantial.
A robust clinical data warehouse oers a rich source of integrated, longitudinal patient data, and it has the potential to greatly facilitate all forms of

70

LYMAN

et al

data mining. Even in cases in which data are exported from a data warehouse so that they can be restructured into an appropriate format for mining, such tasks are orders of magnitude less time-consuming and resourceintensive than de novo collection and processing of data from multiple clinical operational systems. Furthermore, data mining software developers often incorporate relational database connectivity into their applications,
allowing mining directly against warehouses without the need for data export and transformation. These developments indicate that the nature of
biomedical research is evolving: we are entering an era in which large
amounts of clinical data in electronic form will be accessible to researchers
using well-designed analysis tools to pursue biomedical and translational
knowledge discovery.

References
[1] Dewitt JG, Hampton PM. Development of a data warehouse at an academic health system:
knowing a place for the rst time. Acad Med 2005;80:101925.
[2] Kamal J, Pasuparthi K, Rogers P, et al. Using an information warehouse to screen patients
for clinical trials: a prototype. Proc AMIA Symp 2005;1004.
[3] Bock BJ, Dolan CT, Miller GC, et al. The data warehouse as a foundation for populationbased reference intervals. Am J Clin Pathol 2003;120:66270.
[4] Einbinder JS, Scully KW, Pates RD, et al. Case study: a data warehouse for an academic
medical center. J Healthc Inf Manag 2001;15:16575.
[5] Tusch G, Muller M, Rohwer-Mensching K, et al. Data warehouse and data mining in a surgical clinic. Stud Health Technol Inform 2000;77:7849.
[6] Wisniewski MF, Kieszkowski P, Zagorski BM, et al. Development of a clinical data warehouse for hospital infection control. J Am Med Inform Assoc 2003;10:45462.
[7] Murphy SN, Morgan MM, Barnett GO, et al. Optimizing healthcare research data warehouse design through past COSTAR query analysis. Proc AMIA Symp 1999;8926.
[8] Verma R, Harper J. Life cycle of a data warehousing project in healthcare. J Healthc Inf
Manag 2001;15:10717.
[9] Berndt DJ, Hevner AR, Studnicki J. The catch data warehouse: support for community
health care decision-making. Decision Support Systems 2003;35(3):36784.
[10] Sittig DF, Pappas J, Rubalcaba P. Building and using a clinical data repository. Available at:
http://www.informatics-review.com/thoughts/cdr.html. Accessed April 23, 2007.
[11] Kimball R. The data warehouse toolkit. New York, NY: John Wiley & Sons, Inc.; 1996.
[12] McNamee LA, Launsby BD, Frisse ME, et al. Scaling an expert system data mart: more facilities in real-time. Proc AMIA Symp 1998;498502.
[13] Brandt CA, Morse R, Matthews K, et al. Metadata-driven creation of data marts from an
EAV-modeled clinical research database. Int J Med Inform 2002;65:22541.
[14] Rob P, Coronel C. Database systems: design, implementation, and management. 7th edition.
Boston: Thomson/Course Technology; 2007.
[15] Inmon WH. Building the data warehouse. 4th edition. Indianapolis (IN): Wiley; 2005.
[16] Prather JC, Lobach DF, Goodwin LK, et al. Medical data mining: knowledge discovery in
a clinical data warehouse. Proc AMIA Symp 1997;1015.
[17] Goodwin LK, Iannacchione MA. Data mining methods for improving birth outcomes prediction. Outcomes Manag 2002;6:805.
[18] Cao H, Markatou M, Melton GB, et al. Mining a clinical data warehouse to discover diseasending associations using co-occurrence statistics. Proc AMIA Symp 2005;10610.

THE DEVELOPMENT OF HEALTH CARE DATA WAREHOUSES

71

[19] Mullins IM, Siadaty MS, Lyman J, et al. Data mining and clinical data repositories: insights
from a 667,000 patient data set. Comput Biol Med 2006;36:135177.
[20] Humphries KH, Rankin JM, Carere RG, et al. Co-morbidity data in outcomes research: are
clinical data derived from administrative databases a reliable alternative to chart review?
J Clin Epidemiol 2000;53:3439.
[21] Iezzoni LI. Assessing quality using administrative data. Ann Intern Med 1997;127:66674.
[22] Gorla N. Features to consider in a data warehousing system. Commun ACM 2003;46(11):
1115.
[23] Weber R, Schek H, Blott S. A quantitative analysis and performance study for similaritysearch methods in high-dimensional spaces. VLDB98, Proceedings of the twenty fourth International Conference on Very Large Data Bases; 1998. p. 194205.
[24] Levene M, Loizou G. Why is the snowake schema a good data warehouse design? Information Systems 2003;28(3):22540.
[25] Nadkarni PM, Brandt C. Data extraction and ad hoc query of an entity-attribute-value database. J Am Med Inform Assoc 1998;5:51127.
[26] Breen C, Rodrigues LM. Implementing a data warehouse at Inglis Innovative Services.
J Healthc Inf Manag 2001;15:8797.
[27] Murphy SN, Gainer V, Chueh HC. A visual interface designed for novice users to nd research patient cohorts in a large biomedical database. Proc AMIA Symp 2003;48993.
[28] Ledbetter CS, Morgan MW. Toward best practice: leveraging the electronic patient record as
a clinical data warehouse. J Healthc Inf Manag 2001;15:11931.
[29] Nigrin DJ, Kohane IS. Scaling a data retrieval and mining application to the enterprise-wide
level. Proc AMIA Symp 1999;9015.
[30] Corwin J, Silberschatz A, Miller PL, et al. Dynamic tables: an architecture for managing
evolving, heterogeneous biomedical data in relational database management systems.
J Am Med Inform Assoc 2007;14:8693.
[31] Lyman JA, Scully K, Tropello S, et al. Mapping from a clinical data warehouse to the HL7
reference information model. Proc AMIA Symp 2003;920.
[32] HL7. Health level 7. Available at: http://www.hl7.org. Accessed June 15, 2007.
[33] Nardon FB, Moura LA. Knowledge sharing and information integration in healthcare using
ontologies and deductive databases. Medinfo 2004;11(Pt 1):626.
[34] Khan AN, Grith SP, Moore C, et al. Standardizing laboratory data by mapping to
LOINC. J Am Med Inform Assoc 2006;13(3):3535.
[35] Berman JJ. Concept-match medical data scrubbing. How pathology text can be used in research. Arch Pathol Lab Med 2003;127:6806.
[36] National Institutes of Health. Clinical research and the HIPAA privacy rule. Available at:
http://privacyruleandresearch.nih.gov/clin_research.asp. Accessed June 18, 2007.
[37] Schell SR. Creation of clinical research databases in the 21st century: a practical algorithm
for HIPAA compliance. Surg Infect (Larchmt) 2006;7(1):3744.
[38] El Emam K, Jabbouri S, Sams S, et al. Evaluating common de-identication heuristics for
personal health information. J Med Internet Res 2006;8(4):E28.
[39] Gupta D, Saul M, Gilbertson J. Evaluation of a deidentication (de-id) software engine to
share pathology reports and clinical documents for research. Am J Clin Pathol 2004;121:17686.
[40] Beckwith BA, Mahaadevan R, Balis UJ, et al. Development and evaluation of an open
source software tool for deidentication of pathology reports. BMC Med Inform Decis
Mak 2006;6:1221.
[41] Nadkarni P, Chen R, Brandt C. UMLS concept indexing for production databases: a feasibility study. J Am Med Inform Assoc 2001;8:8091.
[42] Hazlehurst B, Frost HR, Sittig DF, et al. MediClass: a system for detecting and classifying
encounter-based clinical events in any electronic medical record. J Am Med Inform Assoc
2005;12:51729.
[43] McDonald CJ, Dexter P, Schadow G, et al. SPIN query tools for de-identied research on
a humongous database. Proc AMIA Symp 2005;5159.

Clin Lab Med 28 (2008) 7382

Multi-Database Mining
Mir S. Siadaty, MD, MS*,
James H. Harrison, Jr, MD, PhD
Division of Clinical Informatics, Department of Public Health Sciences, University of Virginia,
Suite 3181 West Complex, 1335 Hospital Drive Charlottesville, VA 22908, USA

Traditional data mining approaches are designed to identify patterns of


interest (eg, association rules or clusters) in large integrated data sets, as
discussed elsewhere in this issue. Commonly, large numbers of possibly
interesting patterns are generated from these studies, and these result sets
are pruned using statistical or heuristic methods to eliminate patterns
that are less likely to be interesting. The inclusion of data of multiple types
from multiple sources can broaden the scope and eectiveness of the initial
identication of patterns and the pruning tasks. In biomedical science, for
example, the integration of genotypic and phenotypic databases to identify
genomic-phenotypic (phenomic) associations improves the ability to identify genes important in disease and therapeutic responses [1,2]. Because clinical laboratory databases are among the largest generally accessible, detailed
records of human phenotype in disease, they will likely have an important
role in future studies designed to tease out associations between human
gene expression and the presentation and progression of disease.
Data types that are derived from dierent sources, such as genomic, proteomic, or phenotypic (eg, clinical) data, often exist in separate databases.
For traditional data mining, these data must be appropriately transformed
and integrated into single databases so that they can be processed as a unit.
Examples of this approach include the PhenCode database, which integrates
ENCODE genome sequences with human phenotype and clinical data [3],
integration of molecular genetic mutation data with clinical data for discovery of diagnostic markers in Marfan syndrome [4], and integration of gene
expression data with a protein-protein network model derived from literature reports to more clearly identify genes that are likely to contribute to
neurodegenerative disorders [5]. Techniques of data fusion [6], originally

* Corresponding author.
E-mail address: mirsiadaty@virginia.edu (M.S. Siadaty).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.004
labmed.theclinics.com

74

SIADATY & HARRISON

developed to combine heterogeneous sensor and image data for common


analysis, have been used successfully to integrate health data [7,8]. Multiple
databases may also be federated, or virtually integrated through communications, such that the databases remain separate but are presented in a unied view for analytic purposes [9]. TwinNET [10] uses this approach to
integrate phenotypic and genotypic data on 600,000 twin pairs in seven
European countries and Australia.
Multiple databases may also be mined separately, without data integration, to yield patterns whose relative occurrence between the individual
databases has meaning. This technique, known as multi-database mining
[11], is the focus of this article. It may have practical advantages when data
models and formats are dicult to integrate into a single database, when data
sharing or privacy issues dier substantially between data sets, or when an
integrated database would be of unwieldy size. Multi-database mining may
also have theoretical advantages when patterns or associations are not
shared across all databases (and thus would be diluted by integration)
and the primary issue of interest is the degree of correspondence between
databases. For example, multi-database mining allows a distinction between
patterns that are present commonly in all included databases and patterns
that are present commonly in only a few of the databases. Depending on
the application, this distinction may be useful for tasks such as pruning
spurious patterns or identifying patterns that are unusual in particular ways.
Fig. 1 summarizes situations in which multiple database mining might be
applied. When the data to be mined are contained in multiple structurally
similar databases (see Fig. 1, Box A), analytic requirements determine

Fig. 1. Flow chart for applying single and multi-database data mining techniques.

MULTI-DATABASE MINING

75

whether it is best to pool the data or mine the databases separately (see
Fig. 1, Box 1 versus 2). When the databases dier substantially in structure
or data representation (see Fig. 1, Box B), the approach to mining is determined by whether logically related patterns and compatible measures of association strength can be dened across the set of databases. Logical
relationships between patterns do not require uniform data representation
or structure or the existence of identical patterns across the set of databases.
If a logical correspondence between patterns and their association strengths
cannot be created, data mining may be performed on one primary database
(see Fig. 1, Box 3), and the remaining databases may be used to provide supplemental data or annotation of the results. If a correspondence between
patterns in the databases can be established, multiple database data mining
techniques may be applied (see Fig. 1, Box 4). In this case, multiple database
mining can yield specic insights into the characteristics of the data set and
make unique interpretations possible. The dual mining method, which is
described later in this article, is an example of this scenario.
Methods specic to multi-database mining
Multi-database mining shares many technical features with single database mining, but it also includes some special requirements and methods.
Methods for single database mining are covered elsewhere in this issue.
This article briey discusses some of the dierences in methodology between
single and multi-database mining and then illustrates several multi-database
mining tasks.
Because databases vary in content and because there may be many candidate databases that could be used in a data mining project, multi-database
mining benets from ranking databases based on their similarity to each
other and relevance to the problem at hand. Liu and colleagues [12] developed techniques for identifying and ranking relevant databases based on the
pertinence of the information in the databases to the planned analysis. They
argue that eective data mining from multi-databases should involve an
explicit selection phase before mining in which the databases are objectively
evaluated for relevance. Their method computes a relevance factor for each
candidate database from a listing of the data elements it contains that are
related to the intended mining project. Databases are then ranked and
selected for inclusion on the basis of their relevance factor. Zhang and
colleagues [13] also argue for the evaluation of databases before mining projects, and present a set of methods for database classication that are
application independent.
Cleansing and preparation of data for mining are important in single
and multi-database mining. Because databases may dier substantially in
data structure, representation, and integrity, mining multiple databases usually requires each database to be prepared individually using techniques
appropriate for its characteristics. In addition, the data in each database

76

SIADATY & HARRISON

must be prepared or transformed as necessary such that the patterns discovered during mining are comparable as intended across databases.
When data mining in each database is complete, the patterns across all
databases are classied into one or more of four types [11]. Individual databases produce a set of ndings termed local patterns, similar to patterns produced by a standard single database mining task. Some local patterns may
also be shared across most of the databases mined. These patterns are
termed high-voting patterns and generally have global implications. Other
local patterns may be prominent in a few or only one of the databases. These
exceptional patterns may highlight the unique characteristics of a database,
for example, features that are visible only from a particular perspective. Patterns that are present across multiple databases with a moderate frequency
slightly below that required for attention are termed suggestive patterns.

Use of multi-database mining to improve pattern identication


In most applications, standard data mining techniques produce large
numbers of patterns [14,15], many of which are previously known and not
interesting. Inspecting this long list manually can be very time consuming
or even infeasible; therefore, an automated method to prune the list of patterns before inspection by a human expert is desirable. This pruned list
should include the most interesting patterns and exclude spurious or
previously known patterns. Pruning methods have been developed that
use a variety of techniques to measure the interestingness of each pattern.
The way the methods dene and measure interestingness separates them into
several categories.
One family of rule interestingness measures is called subjective measures, in which previous relevant knowledge in the domain of a data set
under analysis is gathered from human experts in the domain [16]. The
knowledge is then coded into rules so that it can be processed. Software
uses this representation of domain knowledge to prune uninteresting, irrelevant, or spurious patterns from the list generated by data mining, yielding
a subset of patterns enriched in those that are interesting. Current data mining strategies almost always include a process of this type as an integral
component. Because this knowledge acquisition process involves human
experts and knowledge encoding, it is prone to higher costs (of human
resources), recall errors, and the possibility of knowledge being outdated
or inadvertently omitted. The authors have recently developed a multidatabase method called dual mining that provides an automated method
of evaluating the interestingness of found patterns in light of published
knowledge, removing the manual knowledge acquisition bottleneck while
taking advantage of the work of a large number of experts. This approach
uses two or more databases to measure the interestingness of patterns and to
prune uninteresting ones [17].

MULTI-DATABASE MINING

77

The dual mining method aims to solve the knowledge acquisition bottleneck for discovering useful or interesting patterns by automatically comparing the strengths of associations mined from a target database with the
strengths of corresponding associations mined from a relevant knowledge
base, for example, published biomedical literature. When the estimates of
the strength of an association do not match in the knowledge base and
target database, a high surprise score is assigned to that association to
identify it as potentially interesting. The surprise score captures the degree
of novelty or interestingness of mined patterns without the need for
a domain expert to evaluate the patterns by hand.
As a simple example of surprise scores, consider four patterns mined
from a target database for which the strengths of association, on a scale
of 0 to 1, are 0.09, 0.12, 0.97, and 0.84, respectively. Although patterns 3
and 4 are obviously the stronger ones, these patterns might already be
known and therefore not very interesting or useful. To determine which
of the patterns might be interesting, we estimate the strengths of the same
patterns in a pertinent knowledge base. Corresponding association strengths
of the patterns in the knowledge base are 0.08, 0.96, 0.84, and 0.12. Patterns
2 and 3 are the strong ones in the knowledge base. We argue for a model in
which patterns that are similarly associated in both the database and knowledge base (patterns 1 and 3) are less interesting than patterns that are
strongly associated in the knowledge base but not the database or vice versa
(patterns 2 and 4). Intuitively, a pattern with strong association in the database but weak association in the knowledge base may represent a discovery
with little current knowledge in its support, warranting further investigation. Likewise, pattern 2 appears to be well-established knowledge, but
the association estimated in the database is not large; therefore, pattern 2
also represents a surprise nding.
The scatterplot in Fig. 2 visualizes the relationship between strengths of
patterns in the database and strengths of patterns in the knowledge base.
The diagonal 45-degree line represents equal database and knowledge
base association strengths. According to our model, patterns close to the
diagonal are not of much interest to the user, whereas patterns that are
farther away from the diagonal are more likely to be interesting. The distance from the diagonal, rather than the strength of an association, denes
its degree of interestingness. We dene patterns that are located far from the
diagonal as surprise patterns in dual mining. They are an example of
the exceptional pattern in multi-database mining that was described in
the previous section as occurring with substantially dierent frequency
within a set of mined databases.
To apply the method, a target database is paired with a relevant knowledge base that contains facts and relationships describing the data items in
the target. For example, the US National Library of Medicines Medical Subject Headings (MeSH) encoded MEDLINE database could represent a pertinent knowledge base applicable to a target database of clinical laboratory

78

SIADATY & HARRISON

Fig. 2. Scatterplot showing the association strength of four patterns (circles labeled 14) simultaneously mined from a database (x-axis) and knowledge base (y-axis). Patterns 1 and 3 are near
the diagonal (similar association strengths in both data sets) and may represent uninteresting
ndings. Patterns 2 and 4 are far away from the diagonal (dierent association strengths in
the data sets) and may be interesting. (Adapted from Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining results: a dual data- and knowledge-mining method.
BMC Med Inform Decis Mak 2006;6:13; with permission.)

tests and diagnoses. Association strengths between items in the target


database could be evaluated in comparison with association strengths of
similar concepts in MEDLINE abstracts or sentences. The entire MEDLINE
database is generally available through a no-cost lease contract with the
National Library of Medicine [18].

A case study of the dual mining method


In an evaluation of this approach, the authors tested the dual mining
method as outlined previously using the University of Virginias Clinical
Data Repository (CDR, see the article by Lyman and colleagues, elsewhere
in this issue) as the target database and MEDLINE as the knowledge base
[17]. We limited ourselves to laboratory test results and diagnosis code
concepts, although many other data types are available in the CDR. During
each visit, patients are assigned one or more diagnosis codes according to the
International Classication of Diseases (ICD) standard, and those codes are
stored in the CDR. Diagnostic laboratory tests with their results for each visit

MULTI-DATABASE MINING

79

are also saved in the CDR. Several thousand dierent laboratory tests and
diagnosis codes (ie, laboratory and disease concepts) in the CDR are
observed with varying frequency. We chose as concepts to study those that
appeared with high frequency in the CDR to ensure an adequate number
of patterns for analysis, and we focused on concepts that could be dened
and detected with certainty in the CDR. The latter are concepts that can
be coded directly based on unambiguous criteria that are generally accepted
clinically. The appearance of those codes in the CDR yields high condence
that the condition existed in the patient. After applying these ltering criteria,
we obtained 96 disease concepts and 105 laboratory concepts.
Because dual mining is based on a comparison of the incidence of associations across databases, concepts in the databases must be represented
consistently so that their associations can be correctly categorized. Data
representation in the CDR, which uses aggregate disease classications
such as ICD-9, diers from that in MEDLINE. The latter codes articles
with MeSH, a detailed biomedical vocabulary developed for the medical
research literature. To reconcile this dierence in data representation, we
obtained the MEDLINE database (including citations, abstracts, and
MeSH encoding) from the National Library of Medicine [18] and used the
Unied Medical Language System [19] to map MeSH and free-text medical
terminology (contents of titles and abstracts) to ICD diagnosis codes in the
CDR.
Concepts in the CDR were generated from ICD codes (disease concepts),
the presence or absence of laboratory tests, and laboratory results (separately classied as abnormal, elevated, and depressed). The laboratory result
concepts were based on reference ranges stored with the test result in the
CDR. To detect the 96 disease and 105 laboratory concepts in MEDLINE,
we used two complementary approaches. In one approach, the textual
description of the CDR ICD codes and the laboratory test concepts was
used to identify corresponding text strings in MEDLINE titles, abstracts,
and MeSH terms. In the second approach, the automatic term mapping
capability of ReleMed, a publicly accessible search engine for MEDLINE
(www.relemed.com), was used to dynamically generate additional term
mappings at runtime [20]. ReleMed evaluates both the presence of and relationships between query terms in MEDLINE records.
We constructed all possible pairs of one disease concept and one laboratory test concept, resulting in 10,080 patterns. We additionally constructed
a set of patterns containing each pair of diagnosis and laboratory concepts
in gender (male, female) and race (black, white) subsets, for an additional
10,080 * 4 40,320 patterns. In total, we constructed 50,400 patterns containing associations of two to four concepts. Scripts written in Perl (www.
perl.org) were used to scan the CDR database and MEDLINE citations
for instances of these patterns. In the CDR, patterns were constrained to
within patient visits. In MEDLINE, patterns were constrained to within
articles (title, abstract, and full-text when available). In total, we scanned

80

SIADATY & HARRISON

27.5 million tests and diagnosis codes in the CDR (containing data from
9.4 million visits from 1993 to 2005) and 15.7 million MEDLINE citations.
Fig. 3 shows the correspondence between the association scores estimated
in the CDR and MEDLINE for each pattern. Patterns with the 100 highest
surprise scores are shown as larger circles. Note that some points are not in
the top 100 even though they appear to be farther away from the diagonal
line when compared with some circles. These patterns have scores with
larger variances, making their surprise scores less signicant [17]. As
a preliminary evaluation of the ability of the surprise score to prune uninteresting patterns, we built a list of top patterns ranked according to the
strength of their association in the CDR and compared it with a list of patterns ranked by surprise score. One would expect most strong associations
in a clinical database to be previously described and thus not interesting.
Consistent with this notion, 99% of patterns with strong associations in
the database were eliminated by using the surprise score. The remaining
1% of associations may be of interest for more detailed follow-up. Table 1
shows a listing of ten representative patterns deemed interesting/surprising

Fig. 3. Pairs of associations mined from the University of Virginia CDR database (x-axis) and
MEDLINE knowledge base (y-axis). The diagonal line represents uninterestingness as in
Fig. 2. The points appear non-homogeneously distributed because a weighted normalization
procedure was used. The graph depicts more than 50,000 data points represented as tiny
dots. The 100 patterns with the largest surprise scores are shown as larger circles in the upper
left and lower right corners of the graph. (Adapted from Siadaty MS, Knaus WA. Locating
previously unknown patterns in data-mining results: a dual data- and knowledge-mining
method. BMC Med Inform Decis Mak 2006;6:13; with permission.)

81

MULTI-DATABASE MINING

Table 1
Sample interesting mined patterns
Sex and
race
Fw
Tot
Fw
Fw
Fb
Fb
Fw
Fw
Fw
Fw

Disease concept

Laboratory concept

Nephritis
Secondary
hyperparathyroidism
Ventricular brillation
Ventricular brillation
Apnea
Ventricular brillation
Ventricular tachycardia
Sleep apnea
Glomerulonephritis
Ventricular tachycardia

Hypercapnia
Hypophosph(orjat)emia
Low serum albumin
Anemia
High serum albumin
Thrombocytopenia
Anemia
Thrombocytopenia
Hypercapnia
High serum albumin

rDbi

rKbi

5
7

0.571
0.77

0.95
0.64

33
45
83
88
91
92
93
99

0.615
0.564
0.605
0.626
0.521
0.443
0.553
0.543

SS rank

0.859
0.772
0.848
0.74
0.785
0.96
0.867
0.884

Abbreviations: Fw, female white; Fb, female black; rDBi, strength of association of the sex,
race, disease, and laboratory concepts in the database (CDR); rKbi, strength of association of
the concepts in the knowledge base (MEDLINE); SS rank, rank based on surprise score;
Tot, all sex and race groups combined.
Data from Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining
results: a dual data- and knowledge-mining method. BMC Med Inform Decis Mak 2006;6:13.

by the automated dual mining algorithm. Although the surprise score indicates that these patterns occur at unexpected incidences in the database
when compared with the knowledge base, further work is required to determine the extent to which they identify meaningful clinical associations or are
useful in creating research hypotheses.
Summary
Data mining is often performed against data that was originally collected
into multiple databases. In many cases, it is appropriate to integrate these
databases for standard data mining using various data transformation
and data fusion approaches, or to create a federated database across multiple contributing systems; however, multiple databases may provide multiple
useful perspectives on a biomedical problem. Typical integrative approaches
may lose the unique perspectives inherent in separate databases. Under
appropriate conditions, these diering perspectives can be leveraged using
multi-database mining techniques to yield valuable insights into a data
set. The authors dual mining approach is an example of this potential, in
which the diering perspectives of related databases are used to identify
interesting association patterns within the databases.
References
[1] Lussier YA, Liu Y. Computational approaches to phenotyping: high-throughput phenomics. Proc Am Thorac Soc 2007;4(1):1825.

82

SIADATY & HARRISON

[2] Sax U, Schmidt S. Integration of genomic data in electronic health recordsopportunities


and dilemmas. Methods Inf Med 2005;44(4):54650.
[3] Giardine B, Riemer C, Heeron T, et al. Phencode: connecting encode data with mutations
and phenotype. Hum Mutat 2007;28(6):55462.
[4] Baumgartner C, Matyas G, Steinmann B, et al. A bioinformatics framework for genotypephenotype correlation in humans with Marfan syndrome caused by FBN1 gene mutations.
J Biomed Inform 2006;39(2):17183.
[5] Limviphuvadh V, Tanaka S, Goto S, et al. The commonality of protein interaction
networks determined in neurodegenerative disorders (NDDs). Bioinformatics 2007;23(16):
212938. Available at: http://bioinformatics.oxfordjournals.org/cgi/reprint/btm307v1.
Accessed July 1, 2007.
[6] Wald L. Denitions and terms of reference in data fusion. International Archives of Photogrammetry and Remote Sensing 1999;32(Part 7):6514.
[7] Carvalho HS, Heinzelman WB, Murphy AL, et al. A general data fusion architecture.
Proceedings of the Sixth International Conference of Information Fusion. Cains, Australia,
June 2003;2:146572.
[8] Raza M, Gondal I, Green D, et al. Fusion of FNA-cytology and gene-expression data using
Dempster-Shafer theory of evidence to predict breast cancer tumors. Bioinformation 2006;
1(5):1705.
[9] Kerschberg L. Knowledge management in heterogeneous data warehouse environments. In:
Kambayashi Y, et al, editors. Proceedings of the 3rd International Conference on Data
Warehousing and Knowledge Discovery, September 57, 2001, Munich, Germany. Lecture
Notes in Computer Science 2001;2114:110.
[10] Muilu J, Peltonen L, Litton J. The federated database: a basis for biobank-based postgenome studies, integrating phenome and genome data from 600,000 twin pairs in Europe.
Eur J Hum Genet 2007;15(7):71823.
[11] Zhang S, Wu X, Zhang C. Multi-database mining. IEEE Computational Intelligence Bulletin 2003;2(1):513.
[12] Liu H, Lu H, Yao J. Toward multidatabase mining: identifying relevant databases. IEEE
Transactions on Knowledge and Data Engineering 2001;13(4):54153.
[13] Zhang S, Zhang C, Wu X. Knowledge discovery in multiple databases. New York: Springer;
2004.
[14] Ye N. The handbook of data mining. Mahweh (NJ): Lawrence Erlbaum Associates; 2003.
[15] Mitra S, Pal SK, Mitra P. Data mining in soft computing framework: a survey. IEEE Transactions on Neural Networks 2002;13:314.
[16] Silberschatz A, Tuzhilin A. What makes patterns interesting in knowledge discovery
systems. IEEE Transactions on Knowledge and Data Engineering 1996;8(6):9704.
[17] Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining results:
a dual data- and knowledge-mining method. BMC Med Inform Decis Mak 2006;6:13.
[18] To Lease MEDLINE/PubMed and other NLM Databases. Available at: http://www.nlm.
nih.gov/databases/license/license.html. Accessed July 1, 2007.
[19] Unied medical language system (UMLS). Available at: http://www.nlm.nih.gov/research/
umls/. Accessed July 1, 2007.
[20] Siadaty MS, Shu J, Knaus WA. ReleMed: sentence-level search engine with relevance
score for the MEDLINE database of biomedical articles. BMC Med Inform Decis Mak
2007;7(1):1.

Clin Lab Med 28 (2008) 83100

Temporal Data Mining


Andrew R. Post, MD, PhDa,*,
James H. Harrison, Jr, MD, PhDb
a

Division of Clinical Informatics, Department of Public Health Sciences,


University of Virginia, Suite 3181 West Complex, 1335 Hospital Drive, Charlottesville,
VA 22908-0717, USA
b
Division of Clinical Informatics, Department of Public Health Sciences and Pathology,
University of Virginia, Suite 3181 West Complex, 1335 Hospital Drive, Charlottesville,
VA 22908, USA

Clinical data repositories provide a population-based view of phenotypic


responses associated with disease presentation and progression and response
to therapy. Much of the clinical data in these databases, and particularly
laboratory data, are in the form of time-stamped data elements whose temporal relationships may carry signicant clinical meaning [1]. For example,
the recent redenition of myocardial infarct includes an increase followed by
a decrease in cardiac injury markers according to a characteristic time
course, in conjunction with EKG changes and symptoms [2]. The diagnosis
of viral hepatitis requires evaluation of temporal relationships in the expression of multiple antibodies and viral antigens [3]. Depending on its temporal
relationship with other data, an elevated blood digoxin level could mean
overtreatment, but it could also be a signal of declining renal function (progression of disease), an inappropriately timed blood draw or drug dose
(a medical process problem), or a drugdrug interaction with quinidine
(eectively another medical process problem). The particular arrangement
of data elements in time thus conveys clinical meaning that would be lost
if the same set of values were randomly ordered. For this reason, most clinical databases can be considered time sequence databases, and data mining
methods must take these temporal relationships into account to yield clinically meaningful results.
Conventional data mining methods treat time sequences as unrelated aggregates of individual data elements. Although these techniques can identify
associations between individual data values, they cannot identify frequently

* Corresponding author.
E-mail address: arp4m@virginia.edu (A.R. Post).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.005
labmed.theclinics.com

84

POST & HARRISON

recurring subsequences with a particular prole of values (eg, trends) or


detect sequences with similar proles that occur over dierent time scales
or have dierent baseline values. This limits the applicability of data mining
methods in clinical laboratories and other settings in which time sequences
are important. To make the features of time sequences accessible to data
mining, methods are needed that treat time sequences as distinct entities
and identify related subsequences within longer time sequences. Temporal
data mining seeks to extend conventional data mining methods to incorporate recognition of these temporal features.
Representation of time in clinical information
Temporal relationships are inherent in the accurate expression of clinical
histories, therapeutic procedures, and therapeutic outcomes. The importance of time representation in electronic clinical records was explicitly
addressed more than 30 years ago in the Time-Oriented Database (TOD)
model, which expresses the time of occurrence of clinical observations and
events with a timestamp [4]. Standard relational databases support this
model, and Structured Query Language (SQL) [5] allows basic comparisons
of these timestamps in database queries. The TOD model does not identify
time sequences as distinct objects, nor does it support segmentation of time
sequences into subsequences, thus limiting its ability to support querying for
specic proles within time series.
An alternative representation of time expresses time sequences as intervals
with start and end times, over which a feature or state holds. This representation is believed by many to be more compatible than pure timestamps with
typical clinical reasoning about time [1,6]. Intervals may contain a sequence
of time-stamped data elements and thus represent an interpretation of those
data elements (Fig. 1), or they may represent a period of risk or a disease
state that is inferred from a sequence of data elements but is outside of the
temporal span of those elements. Extensions to SQL have been proposed
that support creation and storage of interval data types that can represent,
for example, disease and patient care processes [7]. These extensions also
support query for temporal relationships involving the endpoints of intervals
or individual timestamps, with relationships specied using a temporal reasoning language based on, for example, Allens 13 interval relationships (eg,
before, after; Fig. 2) [8], or Dechter and colleagues [9] system of numerical
relationships (eg, before by at least 2 days). Eorts to incorporate these
changes into the SQL standard have thus far been unsuccessful [10].
The proposed SQL extensions do not provide explicit support for computing intervals from data subsequences, but separate development eorts
specically targeting clinical data have produced sophisticated temporal
processing systems for relational databases that provide this capability
[1114]. These systems incorporate time sequence processing that allows recognition of intervals representing, for example, states and trends, within

TEMPORAL DATA MINING

85

Fig. 1. State intervals of digoxin drug levels. A patients serum digoxin measurements were used
to infer periods of decreasing and increasing values. A period of normal levels can be inferred
from the rst eight values because they are all within the normal range (between the thin horizontal lines). The last three values are above the upper limit of normal and thus indicate that
the patient had toxic levels during that time period. Intervals like these can be computed by temporal abstraction (see Fig. 4).

sequences of clinical data, and can allow recognition of combinations of


those intervals across multiple data types [15,16]. The scope of application
of these interval identication systems has been limited, however, because
the expert knowledge required to specify detailed criteria for clinically relevant states, trends, and other patterns is dicult to marshal [6].
Both within and outside of medicine, there has been a growing recognition of the need for specialized techniques for mining temporal data [17]
to identify anomalies, detect change, and predict future events [18]. These
techniques are relatively well developed in the business and engineering
communities [19], and have been used in a wide range of applications,

Fig. 2. Temporal relationships between intervals dened in Allens temporal logic [8]. There are
seven basic relationships and their inverses (not shown). In the case of Equals, the basic and
inverse relationships are identical.

86

POST & HARRISON

from the discovery of patterns in stock market data [20] to the identication
of anomalies in space shuttle telemetry [21]. The use of temporal data mining with clinical time sequences has emerged more recently [22] because
increasing volumes of patient data are being stored electronically [23].
Some kinds of clinical data (electrocardiogram [21] and intensive care unit
[24] data, in particular) have been found to have properties resembling these
business and engineering data sources and are thus amenable to similar
types of analyses.
Temporal data mining techniques in the medical domain [18] are generally designed either for exploration or for prediction. Exploratory methods
involve processing a database to identify groups of time series with similar
combinations of frequent intervals (see Fig. 1) and temporal relationships
(see Fig. 2). Clinical domain knowledge may be applied to these groups
or clusters to determine if they represent useful or previously unknown relationships between data types. Predictive techniques may target a diagnosis,
therapeutic response, or other clinical or patient care process, and search for
combinations of intervals that frequently occur with some temporal relationship to the target. These methods vary in the temporal features of
time sequences that they incorporate into pattern discovery and the ease
with which they can be congured for use with a particular data set. All
of them also depend on eective techniques for measuring the similarity
of time series.
Time series similarity measures
The distance metrics used by conventional data mining techniques to determine similarity between data elements can be applied to detecting similarity in some clinical time series. The most commonly used measure for this
purpose is the Euclidean distance (Table 1) [25], which computes the square
root of the sum-squared dierences between sequential pairs of points in
two series. Euclidean distance requires each time series to have the same
number of values, or new values must be interpolated into the time series
to equalize their lengths. Although interpolation preserves the general prole of a time sequence, it distorts the speed with which the prole changes
value (ie, the local slope of the graph) and assumes that additional values
are accurately predictable from existing data. Euclidean distance is not sensitive to data order, hence randomly shuing the elements of two time
sequences would not change their Euclidean distance, and the dierence
between time series with signicant variation tends to be underestimated.
Euclidean distance also disregards the durations of the time spans between
data elements, is sensitive to outliers, and poorly estimates the dierence between short time series. Results are thus usually unsatisfactory for clinically
relevant comparisons between medical time series.
Several distance measures improve on Euclidean distance by correlating
successive values of a pair of time series, thus taking into account the

87

TEMPORAL DATA MINING

Table 1
Representative methods for evaluating time series similarity
Name

Strategy

Applicationa

Euclidean distance

Evaluates the total distance


between data elements

Pearson correlation
distance

Evaluates the similarity


between proles of data
sequences
Evaluates similarity after
stretching and/or
compressing portions of the
time axis of one series to
achieve best alignment
Decomposes time series into
sets of sine waves of varying
frequency and amplitude,
which are compared
Decomposes time series into
wavelet coecients
determined at multiple time
scales, which are compared

Time series with equal length


where data order and spacing
are not important
Time series with equal length
where data order is
important but spacing is not
Time series with possibly
dierent lengths where the
spacing of data and duration
of temporal features are
not important
Time series where the lengths
are equal, data are regularly
spaced, and temporal
features are periodic
Time series where the lengths
are equal and data are
regularly spaced; temporal
features may be irregular

Dynamic time warping

Fourier transform

Discrete wavelet
transform

Length refers to the number of elements in the series rather than its total duration.

relative order of the data elements. The Pearson correlation distance (see
Table 1) [26] recognizes similarities in the shapes of two sequences, as
long as those sequences are the same length, and can also capture inverse
relationships (an increase in the values of one time sequence with a concurrent decrease in those of another). For data sets in which time sequences all
have the same number of data elements, the time spans between data
elements are unimportant, and comparisons of entire sequences are of interest, this distance metric can produce satisfactory results. Because routine
medical observations are not made with the intent of cross-population comparison, however, medical time series are often composed of data elements
of varying numbers that are spaced irregularly and dierently across
episodes and patients. Traditional metrics, such as Pearson, may classify incorrectly proles that are similar but are sampled at dierent times. If the
durations of the time spans between data elements are important, the number of data elements in each sequence is variable, or temporal relationships
between subsequences (eg, trends and periodicity) are important, more
sophisticated methods are needed to incorporate these temporal relationships into similarity measures.
Dynamic time warping
Dynamic time warping (see Table 1) is a robust distance calculation
method developed for speech recognition [27]. It computes the distance

88

POST & HARRISON

between two time series after stretching and compressing subsequences of


one of the time series along the time axis until an alignment between the
two series is found that minimizes their distance. Time warping works
well for computing the similarity of time sequences with dierent numbers
of data elements and irregular data spacing. It does not explicitly take the
time spans between data elements into account, however, and thus it is
best used when duration constraints are unimportant. In the clinical domain, it has been successfully applied to detection of anomalies in the serum
creatinine values of kidney transplant patients [28], demonstrating its ability
to nd similarities in laboratory test result trends in patients who have different baseline values.
Transform-based methods
Transform-based techniques apply mathematic functions to time series
that output an unordered list of values encapsulating the temporal characteristics of the data. The output of these functions can generally be input
into a reverse transform that reproduces the original time sequence. Parameters for these functions control the degree to which the output faithfully reproduces every aspect of the original time sequence, and can be used to
smooth noisy time sequences and reduce the impact of outliers on the output. The time spans between data elements are incorporated into these functions, unlike the distance metrics above. Once these lists of shape
characteristics are assembled for each time series in the database, they can
be compared for similarity using conventional distance measures such as
those described earlier.
The Fourier transform (see Table 1) [29] characterizes the shape of a time
sequence as a collection of sinusoidal waves with dierent amplitudes and
phases as a function of frequency, the sum or integral of which reproduces
the original time sequence. It assumes that the entire length of a time sequence can be represented by the same set of wave functions, an assumption
that may not hold for time sequences reecting multiple clinical processes
occurring sequentially. Fourier transform techniques also assume regularly
spaced data elements and require the number of data elements in each time
series to be a power of two. These requirements often necessitate padding
the data, which can be achieved through interpolation. The Fourier transform is best suited to characterizing data sequences that are periodic, such
as diurnal variation in serum glucose levels [30].
A similar approach called the discrete wavelet transform (see Table 1)
[31] localizes the wave functions describing a time sequence to periods of
time, thus relaxing the Fourier transforms assumption of a single description of the entire time series, and supports repeating the transform at multiple time scales. A member of a class of wave functions called a wavelet,
chosen empirically, is stretched and repeated across a time sequence at
one or more time scales, transforming the sequence into a set of wavelet

TEMPORAL DATA MINING

89

coecients. The coecients represent progressively ner levels of detail associated with their corresponding time scales. The wavelet transform can
thus represent the general shape of a time sequence and its ne structure,
eectively allowing for zooming in and out of the sequences temporal
features. It, like the Fourier transform, assumes regularly spaced data in
which the number of data elements is a power of two, and interpolation
can be used satisfy these requirements. Because the functional descriptions
produced by the wavelet transform are localized in time, this technique is
applicable to detection of nonperiodic patterns in clinical data sequences.
Clinical application areas include early detection of hemodynamic deterioration as measured by multiple physiologic variables in ICU patients [25] and
early detection of infectious disease outbreaks by monitoring for spikes in
emergency department chief complaints [32].
Although the transform-based methods feature signicantly improved
temporal pattern recognition over plain distance measures, including the
ability to identify periodic patterns and similarities between complex proles, they suer from the need to prespecify parameters (eg, the wavelet
function for the wavelet transform) that may be unintuitive for clinical
experts, and, like basic similarity measures and dynamic time warping,
they compare the similarity of entire time sequences. In practice, however,
interesting clinical features are most often expressed as characteristic subsequences within longer sequences of data. In clinical settings comparison of
entire time sequences is often undesirable and methods are needed to split
time sequences into shorter, clinically meaningful subsequences.

Subsequencing methods
Sliding window methods
Time series may be directly divided into subsequences by scanning a sliding window that views a xed number of data elements across each
sequence, creating the set of all possible subsequences of that length (Table
2). The subsequences overlap, and thus every data element is analyzed in its
local temporal context. The sliding window method is illustrated in Fig. 3.
The position of a subsequence within a sequence is lost in this technique,
thus it is primarily useful for identifying common motifs or combinations
of motifs across a collection of time sequence data for which the specic
location of a motif within each sequence does not matter. Also, because
the size of the sliding window is dened as a number of data elements rather
than a duration, the sliding window approach is not appropriate when the
duration of a motif is a critical feature, particularly if the time sequences
are not regularly spaced or have missing values.
As compared with transform-based methods, the window size parameter
is relatively intuitive and the approach allows for clustering time series based
on specic temporal features that can be contained by the window, even if

90

POST & HARRISON

Table 2
Selected methods for subsequencing time series
Name

Strategy

Application

Sliding window

Breaks sequences into all


possible subsequences of
a specied lengtha

Segmentation

Breaks sequences into adjacent


subsequences based on trend
patterns and inection points

Temporal abstraction

Specically identies
subsequences of interest
based on dened temporal
and mathematic relationships
between data elements

Time series with discrete


temporal features of uniform
duration, where location in
the larger sequence is
unimportant
Time series with high
frequency, regularly-spaced
data that accurately
represent temporal features
Time series with features of
interest that can be
predened; features may be
complex and/or multivariate

Length refers to the number of elements in the series rather than its total duration.

the time series dier signicantly in other respects. The window size can
markedly aect what temporal features are found, however, and empirical
testing may be required to pick the best window size. Also complicating
the use of this technique is that time sequences from the same source tend

Fig. 3. Illustration of a sliding window of length 3 scanning a time sequence of platelet counts.
Subsequences are indicated by dashed rectangles and are labeled by number in the order in
which they are processed. The rst three scanned subsequences have negative slopes, and the
last two subsequences have positive slopes. A data mining algorithm might cluster the rst three
and last two subsequences into two clusters representing decreasing (A) and increasing (B)
values.

TEMPORAL DATA MINING

91

to have spans with many similar, overlapping subsequences, which may


dilute more distant and interesting subsequences that share similar features. Attempts to cluster time series data with sliding window methods
thus may result in meaningless clusters if the specied temporal features
of interest are too general [33].
Segmentation
The problems caused by overlapping subsequences and a xed sliding
window size can be solved by partitioning or segmenting a time series using
statistical techniques into nonoverlapping subsequences with particular
trend characteristics (see Table 2). Segmentation algorithms (for a brief
review, see Hoppner Ref. [34]) nd inection points in the data at which
the data change direction (eg, from increasing to decreasing based on the
deviation from a regression line calculated across recent values). Segments
do not overlap, and thus segmentation algorithms are less vulnerable to
detection of trivially similar motifs than sliding window algorithms. Segmentation algorithms may allow a frequency threshold to be set below
which segments are discarded, or they may scan for infrequent subsequences
[21] that might reect anomalies indicating adverse events or patient care
processes that deviate from the standard of care.
Basic segmentation algorithms assume high frequency, regularly spaced
data, and have been applied successfully to summarizing physiologic variables in ICU data [24] and clustering RR intervals in ECG tracings [22].
They are sensitive to missing data, however, and prone to error for data
sampled at low frequency or irregular intervals. A variant of segmentation
has been proposed that performs iterative smoothing (eg, by the wavelet
transform described earlier) of a time sequence, nds segments in each
smoothed prole, and discards segments that are found in less than a prespecied number or percentage of smoothing cycles [35,36]. This approach
allows generally similar proles to be recognized as such, even if data are
spaced dierently across them, or if there are outliers. An alternative technique weights segments according to the number of data values they span
and discards segments below a prespecied threshold length, under the
assumption that the slopes of very short segments are more sensitive to outliers and missing data and less likely to reect meaningful patterns [22].
Temporal abstraction
The segmentation methods described earlier use statistical methods for
identifying temporal features within time series subsequences and are
typically limited to trend detection. In the 1990s, investigators in articial
intelligence proposed software layers built on top of clinical databases
that use expert knowledge to infer clinical states or processes implicit in clinical time series data and represent them as abstractions that specify an

92

POST & HARRISON

interval of time over which a state or process exists (see Fig. 1) [37]. These
temporal abstractions may be inferred from raw time-stamped data
elements based on prespecied mathematic relationships (eg, states, trends)
in the data, yielding low-level abstractions, or from prespecied combinations of previously inferred abstractions based on temporal relationships
between their intervals, yielding high-level abstractions. Examples of abstractions are shown in Fig. 4. Knowledge elicitation techniques have been
developed to facilitate the process of encoding the expert knowledge dening abstractions in a computable form [38]. Temporal abstraction has been
developed primarily in the medical domain, and has been applied successfully to pattern detection in laboratory test results [16,30,39] and childrens

Fig. 4. Example time series of platelet (PLT) counts in HELLP (Hemolysis, Elevated Liver enzymes, Low Platelets) syndrome, and intervals identied by temporal abstraction. HELLP is
a dangerous complication of pregnancy that appears during the latter part of the third trimester
or after childbirth [50]. HELLP syndrome has been dened as pre-eclampsia with PLT less than
100,000/ml, lactate dehydrogenase (LDH) greater than 600 U/L, and aspartate aminotransferase
(AST) greater than 70 U/L, and increasing PLT indicates recovery [51,52]. Subsequences of
PLT, LDH, and AST values that satisfy these thresholds are labeled as PLT_Low, LDH_High,
and AST_High intervals, respectively. Overlapping subsequences of low PLT, high LDH, and
high AST are labeled as Lab_HELLP intervals, subsequences of increasing PLT count are labeled as PLT_Increasing intervals, and subsequences of increasing PLT count that begin at
or after the start of a Lab_HELLP interval are labeled as Recovering intervals.

TEMPORAL DATA MINING

93

growth [40,41] for decision support in inpatient and outpatient settings. A


recent extension to temporal abstraction adapts the technique for querying
retrospective clinical data repositories for specied temporal patterns and
relationships, returning populations of patients who display those patterns
[42].
Low-level abstractions can resemble the trend-related subsequences returned by segmentation algorithms, and such abstraction-based segmentation (see Table 2) has been applied to analysis of physiologic data collected
during hemodialysis [36]. In addition, low-level abstractions may specify
and identify subsequences that satisfy independently dened and datatypespecic constraints in data values, such as whether laboratory test
results are within the tests reference range (see Fig. 1). High-level abstractions can represent relationships between multiple subsequences in time
series within and across data types that are useful for identifying complex
states, such as disease severity and progression or response to therapy, and
are particularly useful for this purpose when those states are not represented in standard coding schemes (eg, ICD-9-CM), are poorly expressed
by codes, or may be inaccurately or incompletely coded in clinical databases [42]. Existing temporal abstraction methods require prespecication
of high-level abstractions, which limits their applicability to clinical
domains in which sucient knowledge exists to dene interesting interval
relationships. Data mining techniques are available that can learn frequent
interval associations that are analogous to high-level abstractions, and
feed those associations back into the data mining process (see later
discussion).
The low- and high-level intervals of temporal abstraction have the potential to be used in data mining, although we have been unable to nd any examples of such an application. They have several advantages over intervals
produced by automated segmentation. Using expert knowledge, detection
algorithms can be designed that recognize specic patterns of interest clinically and identify outliers or unusual states that are clinically important,
while ignoring uninteresting noise. Time sequences can be processed in
multiple, mathematically distinct ways rather than using a general one
size ts all approach for feature identication and smoothing that may
not be optimal for all data types. Patterns that are clinically important could
overlap within a time series, which is physiologically possible but dicult to
handle with automatic segmentation. Finally, the ability to identify patterns
that are highly likely to be clinically relevant as a precursor to data mining
would reduce the processing requirements for mining and the volume of uninteresting patterns that must be pruned after mining. Temporal abstraction
requires expert knowledge to identify the intervals to be incorporated into
an interval database for mining, however, and because these intervals are
not mined directly from the raw data, unanticipated low-level temporal
relationships (ie, motifs) may be missed. Temporal abstraction is most
appropriate as an adjunct to data mining when interesting low- and

94

POST & HARRISON

high-level intervals can be specied and the goal is discovering novel associations and predictive relationships between those intervals.
Mining interval databases for temporal patterns
Once interesting subsequences of time series are identied by segmentation, temporal abstraction, or other methods, temporal mining strategies
can assemble them into multivariate patterns that may have clinical meaning. Methods have been proposed that identify combinations of intervals
that frequently co-occur in a data set, and some of these methods allow
for mining combinations of intervals that frequently co-occur with some
temporal relationship (see Fig. 2). The former methods are generally called
association learning methods, and the latter are called temporal rule learning methods.
Association learning methods use a variant of the Apriori algorithm [43],
which is an iterative process that rst identies all single instances of a particular type of interval, nds all frequent combinations, or itemsets, of the
rst interval type with a second interval type, and then nds all frequent
combinations of the rst two interval types with a third interval type, and
so on (Fig. 5). These aggregations can either be used for discovery of frequently co-occurring intervals, or they can be thought of as predictive rules
in which the presence of the interval types in an itemset predicts the presence
of the interval type added to the itemset in each iteration of the algorithm.
When used for discovery of frequent combinations of interval types,
Apriori has a parameter called the minimum support threshold that denes
the minimum number or percentage of time sequences in which an itemset
must occur for the itemset to be passed on to the next iteration of the algorithm (see Fig. 5). For mining interval types that may have multiple occurrences within a time sequence, a useful alternative denition of support is
the minimum total time duration of all instances of an interval type in a sequence. For rule learning, there is an additional parameter called the
minimum condence threshold, which denes the minimum conditional
probability that a predicted interval type will occur within a sequence, given
an already-discovered itemset. This parameter is useful for ltering out rules
that have low predictive value. Apriori-based association rule learning has
been applied to assess the quality of a hemodialysis service: frequent combinations of temporal abstractions in physiologic parameters were mined and
used to improve the understanding of the contribution of physiologic factors
to the values of a set of quality indicators [36].
Temporal rule learning extends Apriori by storing temporal relationships
between interval types as additional attributes of each itemset [35,44], using
a temporal reasoning language (see later discussion) to encode relationships.
Many itemsets are created for a given group of interval types, one for each
temporal conguration of those interval types that is found in the dataset
based on the selected support and condence parameters. Temporal Apriori

TEMPORAL DATA MINING

95

Fig. 5. Illustration of the Apriori algorithm processing a hypothetical database of clinical ndings for frequent associations between ndings, with a minimum support threshold of 3 occurrences in the database. In the left table (Items), ndings and number of occurrences are listed. In
the center table (Pairs), pairs of ndings that co-occur in the same patient record are shown. In
the right table (Triplets), combinations of three ndings that co-occur in the same patient record
are shown. Findings and combinations of ndings that satisfy the minimum support threshold
are highlighted in gray.

thus aggregates frequent interval types into frequent higher-level pattern


specications similar to high-level temporal abstractions, eectively discovering high-level abstractions with potential clinical relevance. Temporal
rules are similar to the association rules described earlier, but store a temporal relationship between each interval type and the itemset to which it is
added, and thus can predict not only whether an interval type will occur
within a sequence, but when in that sequence it will occur. This approach
has been applied in the hemodialysis domain similarly to the association discovery method described earlier [36].
Although Allens temporal language (see Fig. 2) has been used previously
for expressing temporal relationships in temporal rule learning [35,36],
recent theoretic work indicates that distinguishing temporal congurations
of interval types based on Allens relationships may result in arbitrary distinctions between itemsets that are actually very similar. For example, in
an appropriate context a Starts pattern may dier little from an Overlaps pattern with a small overlap at the start (see Fig. 2). An alternative
language has been proposed with a smaller set of temporal relationships
than Allen and greater exibility between the boundaries of each relationship [45]. In an initial evaluation of a temporal Apriori implementation
using this language, similar patterns were found as compared with an implementation of temporal Apriori using Allens language, but with fewer itemsets that more closely matched the expectations of a domain expert.
Current status and future directions
Clinical data repositories contain records from millions of patients, and
these patients data may reect unexpected responses to therapy, previously

96

POST & HARRISON

unknown relationships between disease states, and deviations from standard


clinical practice that may be of interest to clinicians and researchers. The patient data stored in these repositories are high dimensional, and many of
the data, such as care encounters, clinical orders, laboratory test results,
diagnostic and procedure codes, and records of therapy, include time
sequences of variable length that are sampled irregularly and at dierent frequencies. These databases are of great value for better understanding
biomedical and operational aspects of health care, but the data they contain
are dicult to process using traditional time series analysis and data mining
methods. Temporal data mining methods are under development and have
been used successfully for analyzing limited subsets of clinical data repositories that are characterized by few data types and high-frequency or regularly
spaced timestamps [22,24,25,28,32,36]. These methods have yet to be applied more generally, and implementations thus far have been site specic.
One temporal data mining system has been reported to be in production
use [36], but otherwise these systems have been implemented only in research
environments.
Because clinical repositories contain a broad range of data types with different characteristics, the wider application of temporal data mining in medicine will probably require software systems that aggregate a number of
features currently found in several experimental systems. In particular, systems will likely need to support multiple methods of low-level time sequence
processing that identify intervals of interest. These methods will allow appropriate low-level processing approaches to be applied to dierent data
types, with the ability to group the intervals detected across a range of
data types into meaningful higher-level relationships. Such systems might
use, for example, statistical time series segmentation for high-frequency
data sets or those with highly characteristic inection points, and knowledge-based detection of predened motifs for low-frequency, irregular
time sequences with specic duration constraints. Found intervals and interval groups will be stored in an interval database that may be mined for associative or predictive relationships using a variant of the Apriori algorithm.
Recently, the temporal abstraction method has been incorporated into a system that supports pluggable low-level processing algorithms and aggregation of intervals found by those algorithms [42]. This type of system could
support multiple data processing strategies to create a database of intervals
for mining.
Adapting interval and relationship mining algorithms for use with large
clinical databases requires improving their abilities to nd unexpected patterns and optimizing their performance. Although existing algorithms are
designed to nd the most frequent patterns in a data set, unexpected patterns are likely to be frequent only within relatively small patient subsets
(eg, a small group of patients all with a relationship between drug administration and laboratory test result intervals reecting an unexpected response
to a drug). Existing algorithms could be extended with methods from the

TEMPORAL DATA MINING

97

information retrieval [46] literature that weight infrequent, interesting words


in documents higher than words that are frequent, where documents are
patients, and words are temporal patterns found in the patients data [25].
Alternatively, found patterns could be ltered against a database of trivial
and already-known temporal features that is populated with a combination
of expert knowledge, literature values, and mined high-frequency patterns
found o-line by existing algorithms, analogous to the co-mining approach
[47] described in the article by Siadaty and Harrison, elsewhere in this issue.
Existing temporal relationship detection algorithms have sucient performance for mining small, targeted subsets of a clinical data repository
that are reduced in the number of patients or the number of variables,
but are too computationally expensive to apply to an entire repository. Temporal relationship detection requires scanning every pair of previously found
intervals within a patient [16,35]. Improving performance may require preprocessing a repository into subsets, and applying temporal data mining
algorithms separately to each subset. One approach is to divide a data set
along the values of application-specic variables (eg, gender, age ranges)
[48], but appropriate variables have to be chosen carefully for a data mining
task so that interesting patterns involving those variables are not lost. Alternatively, theoretic work in nontemporal data mining has identied a class of
expensive data mining algorithms for which the data can be preprocessed
by a cheap algorithm that outputs large but manageable clusters for expensive processing, with a guarantee that similar data will not be separated
prematurely into dierent clusters by the cheap algorithm [49]. Research is
needed to determine whether temporal data mining algorithms can be developed that also have this property.
Summary
Data mining of clinical time series has evolved over the past decade from
simple algorithms that only consider the order of the data and not their
timestamps, to sophisticated algorithms that aim to discover prediction
rules. Temporal data mining oers the potential for detecting previously unknown combinations of clinical observations and events that reect novel
patient phenotypes and useful information about care delivery processes,
but clinically relevant patterns of interest may occur in only a small number
of patients. Large-scale processing of clinical data repositories may be
needed to detect these patterns. Temporal data mining algorithms have
thus far been applied to low-dimensional, homogeneous data sets. Although
these experiments have yielded useful information, the major benets of
data mining will come from its application to large-scale, high-dimensional,
heterogeneous data in general clinical repositories. We believe that the potential for combining interval identication methods (temporal abstraction
and statistical time sequence segmentation) with interval mining methods,
and recent research successfully scaling data mining algorithms to large

98

POST & HARRISON

datasets, suggest an optimistic view that general-purpose temporal data


mining software applicable to clinical data repositories is practical and likely
will become an important tool for clinical research and health care quality
assurance.
References
[1] Shahar Y. Dimensions of time in illness: an objective view. Ann Intern Med 2000;132(1):
4553.
[2] Alpert JS, Thygesen K, Antman E, et al. Myocardial infarction redenedda consensus
document of The Joint European Society of Cardiology/American College of Cardiology
Committee for the redenition of myocardial infarction. J Am Coll Cardiol 2000;36(3):
95969.
[3] Imperial JC. Natural history of chronic hepatitis B and C. J Gastroenterol Hepatol 1999;
14(Suppl):S15.
[4] Wiederhold G, Fries JF. Structured organization of clinical data bases. Proceedings of the
American Federation of Information Processing Societies National Computer Conference
(AFIPS) 1975;44:47985.
[5] Elmasri R, Navathe SB. Fundamentals of database systems. 3rd edition. New York: Addison-Wesley; 2000.
[6] Keravnou ET, Shahar Y. Temporal reasoning in medicine. In: Fisher M, Gabbay D, Vila L,
editors. Handbook of temporal reasoning in articial intelligence. New York: Elsevier; 2005.
p. 587653.
[7] Snodgrass R, Bohlen MH, Jensen CS, et al. In: Etzion O, Jajodia S, Sripada S, editors. Temporal databases: research and practice. Transitioning temporal support in TSQL2 to SQL3,
vol. 1399. Berlin: Springer; 1998. p. 15094.
[8] Allen JF. Maintaining knowledge about temporal intervals. Commun ACM 1983;26(11):
83243.
[9] Dechter R, Meiri I, Pearl J. Temporal constraint networks. Artif Intell 1991;49:6195.
[10] Adlassnig KP, Combi C, Das AK, et al. Temporal representation and reasoning in medicine:
research directions and challenges. Artif Intell Med 2006;38(2):10113.
[11] Dorda W, Gall W, Duftschmid G. Clinical data retrieval: 25 years of temporal query
management at the University of Vienna Medical School. Methods Inf Med 2002;41(2):
8997.
[12] Nigrin DJ, Kohane IS. Temporal expressiveness in querying a time-stampbased clinical
database. J Am Med Inform Assoc 2000;7(2):15263.
[13] OConnor MJ, Tu SW, Musen MA. The Chronus II temporal database mediator. Proc
AMIA Symp 2002;56771.
[14] Spokoiny A, Shahar Y. A knowledge-based time-oriented active database approach for intelligent abstraction, querying and continuous monitoring of clinical data. Medinfo 2004;
11(Pt 1):848.
[15] OConnor MJ, Grosso WE, Tu SW, et al. RASTA: a distributed temporal abstraction system
to facilitate knowledge-driven monitoring of clinical databases. Medinfo 2001;10(Pt 1):
50812.
[16] Shahar Y. A framework for knowledge-based temporal abstraction. Artif Intell 1997;90:
79133.
[17] Roddick JF, Spiliopoulou M. A survey of temporal knowledge discovery paradigms and
methods. Knowledge and Data Engineering, IEEE Transactions on 2002;14(4):75067.
[18] Roddick JF, Fule P, Warwick JG. Exploratory medical knowledge discovery: experiences
and issues. SIGKDD Explor. Newsl 2003;5(1):949.

TEMPORAL DATA MINING

99

[19] Antunes CM, Oliveira AL. Temporal data mining: an overview. Paper presented at the Proceedings of the Knowledge Discovery and Data Mining Workshop on Temporal Data Mining (KDD 01); Aug 2629, 2001. San Francisco.
[20] Fu T-c, Chung F-l, Luk R, et al. Preventing meaningless stock time series pattern discovery
by changing perceptually important point detection. Fuzzy Systems and Knowledge Discovery 2005;11714.
[21] Keogh E, Lin J, Fu A. HOT SAX: nding the most unusual time series subsequence: algorithms and applications. Paper presented at the 5th IEEE International Conference on Data
Mining. New Orleans (LA), November 2730, 2005.
[22] Keogh E, Pazzani M. An enhanced representation of time series which allows fast and accurate classication, clustering, and relevance feedback. AAAI Press; Proceedings of the
Fourth International Conference on Knowledge Discovery and Data Mining. 1998.
p. 23941.
[23] Haux R. Health information systemsdpast, present, future. Int J Med Inform 2006;75(34):
26881.
[24] Li J, Leong TY. Using linear regression functions to abstract high-frequency data in medicine. Proc AMIA Symp 2000;4926.
[25] Saeed M, Mark R. A novel method for the ecient retrieval of similar multiparameter physiologic time series using wavelet-based symbolic representations. AMIA Annu Symp Proc
2006;67983.
[26] Altiparmak F, Ferhatosmanoglu H, Erdal S, et al. Information mining over heterogeneous
and high-dimensional time-series data in clinical trials databases. IEEE Trans Inf Technol
Biomed 2006;10(2):25463.
[27] Ratanamahatana CA, Keogh E. Everything you know about dynamic time warping is
wrong. Paper presented at the Third Workshop on Mining Temporal and Sequential Data
(KDD-2004). Seattle (WA), August 2225, 2004.
[28] Fritsche L, Schlaefer A, Budde K, et al. Recognition of critical situations from time
series of laboratory results by case-based reasoning. J Am Med Inform Assoc 2002;
9(5):5208.
[29] Chateld C. Analysis of time series. 4th edition. New York: Chapman and Hall; 1989.
[30] Bellazzi R, Larizza C, Riva A. Temporal abstractions for interpreting diabetic patients
monitoring data. Intelligent Data Analysis 1998;2(14):97122.
[31] Graps A. An introduction to wavelets. IEEE Comput Sci Eng 1995;2(2):5061.
[32] Zhang J, Tsui FC, Wagner MM, et al. Detection of outbreaks from time series data using
wavelet transform. Proc AMIA Symp 2003;74852.
[33] Keogh E, Lin J, Truppel W. Clustering of time series subsequences is meaningless: implications for previous and future research. Paper presented at The Third IEEE International Conference on Data Mining (ICDM 03). Melbourne (FL), November 1922,
2003.
[34] Hoppner F. Time series abstraction methodsda survey. Paper presented at the Dortmund,
Germany: GI Jahrestagung; September 30Oct 3; 2002.
[35] Hoppner F. Learning dependencies in multivariate time series. Paper presented at the
ECAI02 Workshop on Knowledge Discovery in (Spatio-) Temporal Data. Lyon: France;
July 2223, 2002.
[36] Bellazzi R, Larizza C, Magni P, et al. Temporal data mining for the quality assessment of
hemodialysis services. Artif Intell Med 2005;34(1):2539.
[37] Stacey M, McGregor C. Temporal abstraction in intelligent clinical data analysis: a survey.
Artif Intell Med 2007;39(1):124.
[38] Shahar Y, Chen H, Stites DP, et al. Semi-automated entry of clinical temporal-abstraction
knowledge. J Am Med Inform Assoc 1999;6(6):494511.
[39] Larizza C, Moglia A, Stefanelli M. M-HTP: a system for monitoring heart transplant patients. Artif Intell Med 1992;4:11126.

100

POST & HARRISON

[40] Kuilboer MM, Shahar Y, Wilson DM, et al. Knowledge reuse: temporal-abstraction mechanisms for the assessment of childrens growth. Proc Annu Symp Comput Appl Med Care
1993;44953.
[41] Kohane IS, Haimowitz IJ. Hypothesis-driven data abstraction with trend templates. Proc
Annu Symp Comput Appl Med Care 1993;4448.
[42] Post AR, Harrison JH Jr. PROTEMPA: a method for specifying and identifying temporal
sequences in retrospective data for patient selection. J Am Med Inform Assoc 2007;14(5):
67483.
[43] Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. Paper
presented at the 20th Int Conf Very Large Databases. Santiago de Chile, Chile: VLDB; September 1215. 1994. Santiago de Chile, Chile.
[44] Morchen F, Ultsch A. Discovering temporal knowledge in multivariate time series. Paper
presented at the Gesellschaft fur Klassikation (GfKI). Dortmund, Germany; March
911, 2004.
[45] Morchen F. A better tool than Allens relations for expressing temporal knowledge in interval data. Paper presented at the Theory and Practice of Temporal Data Mining (TPTDM
2006). Philadelphia; August 2023, 2006.
[46] Korfhage RR. Information storage and retrieval. New York: Wiley; 1997.
[47] Siadaty MS, Knaus WA. Locating previously unknown patterns in data-mining results:
a dual data- and knowledge-mining method. BMC Med Inform Decis Mak 2006;6:13.
[48] Tsoi AC, Zhang S, Hagenbuchner M. Pattern discovery on Australian medical claims datad
a systematic approach. IEEE Trans Know Data Eng 2005;17(10):142035.
[49] McCallum A, Nigam K, Ungar LH. Ecient clustering of high-dimensional data sets with
application to reference matching. Paper presented at the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Boston, MA: August
2023, 2000.
[50] Sibai BM. The HELLP syndrome (hemolysis, elevated liver enzymes, and low platelets):
much ado about nothing? Am J Obstet Gynecol 1990;162(2):3116.
[51] Sibai BM, Barton JR. Dexamethasone to improve maternal outcome in women with hemolysis, elevated liver enzymes, and low platelets syndrome. Am J Obstet Gynecol 2005;193(5):
158790.
[52] Fonseca JE, Mendez F, Catano C, et al. Dexamethasone treatment does not improve the outcome of women with HELLP syndrome: a double-blind, placebo-controlled, randomized
clinical trial. Am J Obstet Gynecol 2005;193(5):15918.

Clin Lab Med 28 (2008) 101117

Regional and National Health


Care Data Repositories
James H. Harrison, Jr, MD, PhDa,*,
Raymond D. Aller, MDb
a

Departments of Public Health Sciences and Pathology, University of Virginia, Suite 3181
West Complex, 1335 Hospital Drive, Charlottesville, VA 22908, USA
b
Automated Disease Surveillance Section, Acute Communicable Disease Control Program,
Los Angeles County Department of Public Health, 313 N. Figueroa Street,
Room 222, Los Angeles, CA 90012, USA

Large-scale analysis of patient data, including data mining, has the


potential to improve substantially the understanding of disease presentation, care delivery processes, and the variability of disease progression
and response to therapy as they happen in the real world. The inherent
limitations of clinical trials in this respect and the benets of direct analysis
of patient data were described by Feinstein [1] more than 20 years ago and
have been reviewed more recently from the perspective of evidence-based
medicine [2]. Direct patient data analyses gain power with increasing data
volume, completeness, and accuracy and under conditions that limit selection bias. For this reason, data sets to address community health and health
care process questions optimally are aggregated at the population level and
cut across traditional socioeconomic categories and geographic divisions.
Such comprehensive health-related population data sets have not been available previously in the United States because of complex social, political, and
business-related barriers. Recently, the societal benets of such data have
been articulated clearly in a call for the creation of a national framework
for the appropriate secondary use of health data [3] and a national health
data warehouse [4]. These eorts may bear fruit in the future; in the meantime, useful although limited analyses have been performed against large
health-related data sets created for other purposes, as described in more
detail later.

* Corresponding author.
E-mail address: james.harrison@virginia.edu (J.H. Harrison).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.006
labmed.theclinics.com

102

HARRISON & ALLER

Clinical laboratory data may play several important roles in aggregated


health-related databases. Some of the early work in health data warehousing
was pioneered from laboratory medicine by Altshuler [5,6] and Eggert and
Emmerich [7]. Since that time, most laboratory medicine publications on
data warehousing have focused on aggregating data from a single care provider, such as a hospital [8] or a regional organization [9,10], to be used for
local analysis. Although these analyses can have substantial value for local
laboratory operations [10], improvement of health care quality [11], and
support of clinical research [12], the approach has been underutilized [13].
In databases that aggregate laboratory data with other health-related
data, the content of laboratory data often is limited and targeted to a particular purpose. Even in administrative and other databases that may contain
only laboratory orders or specic disease-targeted laboratory results, however, these records serve as important objective values that can conrm or
question the accuracy of other values in the database such as diagnostic
or procedure codes [14,15].
Existing large-scale health-related databases can be characterized broadly
as Medicare/insurance claims databases, single-provider data warehouses,
regional/national collaborative provider data repositories, and public health
and government population databases. Most of these data sets were collected for purposes other than the delivery of health care, and they generally
contain summary data derived from patients health records rather than
a comprehensive clinical record. Thus they suer shortcomings and biases
related to their design, purpose for data acquisition, and data selection
and summarization procedures [14,16]. Nonetheless, they have been useful
for obtaining a global view of particular aspects of health care and characteristics of patients in a number of large-scale studies. The strengths and
weaknesses of these databases, and their relationships with clinical laboratories and laboratory data, are considered here.
Claims data
Medicare and large insurance claims databases can provide coverage of
large patient populations. Within a claims database, records for patients
can be linked longitudinally through patient identity, yielding a temporal
view of patient care. The most useful data in these sets is generally based
on physician, facilities, and pharmacy claims (Table 1) [17]. Physician claims
contain American Medical Association Current Procedural Terminology
(CPT) codes describing procedures performed by physicians and other
care providers, International Classication of Disease (ICD, most commonly ICD-9-CM) codes describing diagnoses, and dates of service. Facilities claims contain ICD and CPT codes, including Health Care Common
Procedure Coding System (HCPCS) codes representing services, supplies,
therapeutics, and other items, and admit/discharge dates if applicable. Pharmacy claims include National Drug Code and/or HCPCS identiers of

REGIONAL AND NATIONAL HEALTH CARE DATA

103

Table 1
Contents of common regional data sets
Data set

Typical contentsa

Features

Claims data, from care


providers and payers

ICD-9-CM codesb
CPT codesc
HCPCS codesd
NDC codese
Charges

Single and cooperative


provider repositories

ICD/CPT codes
Problem lists
Clinical orders
Observation/test results
Procedure descriptions
Textual case summaries
Diagnoses/codes
Observation/test results
Follow-up data as specied,
including geographic and
environmental data
Large-scale survey results

Charged items only


Some diagnoses may be
present only as rule out
Results of observations and
tests not included
Time precision at the
day level
Full clinical detail with
precise timestamps

Public health and


governmental databases

Detailed data conned to


that mandated for
reporting and follow-up
or included in the survey

Abbreviations: CPT, Current Procedural Terminology codes; HCPCS, Healthcare Common


Procedure Coding System; ICD-9, International Classication of Disease, edition 9; NDC, National Drug Code.
a
Actual contents will be site dependent, particularly for clinical repositories from providers.
b
Clinical diagnoses.
c
Physician and other provider procedures.
d
Services and supplies.
e
Therapeutics.

therapeutics, a therapeutic class categorization, and prescription date. Timestamps generally provide only the day of the charge (which may not be the
day of the clinical event) and thus do not support precise short-term calculations such as most pharmacokinetics evaluations or conrmation of
a rapid response to an event. Claims databases have been used in a large
number of health care process and outcomes studies (eg, Ref. [18]), and
similar data are used by the Health Plan Employer Data and Information
Set (HEDIS) to compare quality of care among providers [19].
A number of problems exist with claims data, and their use is controversial [15,16,20]. By their nature, claims records exist only for items that generate a nancial transaction, and there is an incentive for the choice of codes
to support the operation of the nancial system. Many codes are added after
patient care during a separate clinical records review, rather than by caregivers. Claims forms may provide a limited number of slots for codes,
forcing code selection based on the needs of the current transaction. Diagnostic codes attached to procedures and laboratory tests may represent
rule-out possibilities rather than conrmed diagnoses. Finally, the ICD
coding system is not comprehensive for all clinical conditions and does

104

HARRISON & ALLER

not support accurate description of dierences in severity, presentation, or


course among patients who have the same disease.
For these reasons, the sensitivity and specicity of claims data for identifying conditions in patients has been questioned. Results dier among
diagnoses reviewed, although specicity is generally better than sensitivity
(it is more common to fail to code an existing condition than to code a nonexistent one [20]), and the coding of comorbidities seems particularly problematic [21]. Suboptimal sensitivity and/or specicity of coding have been
reported for cardiovascular diseases [22,23], acute renal failure [24], pneumonia [25], pediatric asthma [26], migraine [27], gout [28], and mild traumatic brain injury [29], among others. Several investigators have reported
improved sensitivity and specicity for identifying particular conditions
using algorithms that categorize patients based on multiple criteria in claims
databases rather than by the presence or absence of specic codes [17,30]. In
general, claims data should be used carefully, and the results should be
interpreted conservatively, with an understanding of the biases and inaccuracies inherent in the data sets.
Single-provider data warehouses
Many health care organizations have or are developing local research
databases, but only a few large, integrated health care organizations and
national laboratories are able to marshal data sets that approach large
claims databases in size. Data sets from single health care organizations
can contain detailed clinical data that are linked through patient identity
and thus present a longitudinal view of patient care (see Table 1). National
laboratory data may not guarantee unique identication of patients outside
specic testing transactions and thus are less useful for longitudinal studies.
Because of the increased detail oered by these data sets, they may support
a substantially more comprehensive analysis from a medical perspective
than claims databases. Access to the data, however, may be limited to internal researchers whose activities are directed according to organizational
interest. Barriers to general analysis of such data sets include a lack of nancial incentive for the organization to support and maintain data access and
processing for arbitrary projects, a lack of nancial incentive to support
governmentally mandated requirements for research use of patient data,
and the perception of organizational risk related to loss of patient condentiality even if explicit patient identiers are removed.
Several large organizations have published analyses from their internal
data sets. Kaiser Permanente (www.kaiserpermanente.org), which covers
a substantial number of patients in the Western United States, has implemented an organizational data warehouse [31] that identies patients for inclusion in a Cardiovascular Risk Factor Management Program [32] and is
being used to standardize clinical care across diverse patient populations
[33]. Large national laboratories have used data warehouses consisting of

REGIONAL AND NATIONAL HEALTH CARE DATA

105

test orders, diagnoses, and results to calculate population-based reference


ranges [10] and to support infectious disease surveillance [34]. The US
Veterans Administration (VA), which supports what is perhaps one of the
worlds largest integrated electronic health records, maintains a centralized
database into which data from local VA health care sites can be aggregated
for analysis. Although this database is not yet comprehensive, large targeted
data sets such as the Diabetes Epidemiology Cohort have been analyzed to
support improved patient care, identify process problems, and target appropriate education to clinicians [35].
Regional and national collaborative provider databases
Most health care providers are not large enough to create regional clinical repositories individually but could build similar databases through cooperative data-sharing agreements involving multiple providers (see Table 1).
The utility of these regional databases, along with the need for sharing clinical and administrative data between the multiple parties typically involved
in patient care, led to a proposal some years ago for the development of
cooperative community health information networks (CHINs). Although
there have been a few notable successes in developing community-based
data warehouses [36], these are unusual, and most initial work in CHIN
development ultimately did not come to fruition for a variety of technical,
nancial, administrative, and political reasons [3739]. The need for sharing
health information to improve care delivery remains, however, and has
driven the development of a modied successor of CHINs termed regional
health information organizations (RHIOs) [39,40]. A primary goal of the
RHIO approach is the communication of clinical information across organizations to support patient care. This goal may be met either through
establishment of an accessible central database of shared information or
by peer-to-peer communications without a central database; examples of
both approaches exist [40]. When actual or federated virtual databases are
created, RHIOs could provide an excellent foundation for data mining.
Only time will tell whether the RHIO movement will be successful; several
have been operating for many years, predating their identication as
RHIOs. Many others have been formed only recently: the Massachusetts
Health Data Consortiums national list of RHIOs included 53 organizations
as of July 2006 [41].
Health care organizations also have formed voluntary data-sharing consortia based on implementation of particular information systems rather
than on regional location. This approach has the advantage of reducing
the challenges of reconciling data models and data representation across systems by adopting a vendors proprietary designs as a standard for data sharing, but it also relies on the vendors continued viability and may be
problematic if sites change vendors. Furthermore, although this strategy
increases the volume of data available for analysis, it does not capture

106

HARRISON & ALLER

a comprehensive longitudinal care record within a region, because data


transmitted to the central repository are de-identied (and thus data from
one patient cannot be linked across care providers), and vendors generally
do not have complete regional coverage. Thus vendor-based data aggregation is not likely to be satisfactory in the long term. Examples of this approach include hospitals using the ambulatory electronic medical record
from GE Healthcare Information Technologies (Milwaukee, Wisconsin)
[42] and the PedCath pediatric catherization database (Scientic Software
Solutions, Inc., Charlottesville, Virgina) [43].
The development of collaborative data sharing has been limited for a variety of reasons. Among the most important are a lack of clear nancial
incentives for institutions that bear the cost of creating and maintaining the
data repositories and uncertainty concerning risks and benets (eg,
balancing patient condentiality risks against organizational and societal
benets). As discussed in relation to data warehousing elsewhere in this
issue, developing and maintaining methods for accurately communicating
and cleansing data and for reconciling multiple data models and data representation strategies is demanding. Once created, data warehouses probably
will support work that will lead to recommendations for improved care.
This information may benet society, but whether the called-for improvements in care would be nancially benecial or detrimental to health care
organizations or would provide competitive advantage over other organizations that also may be participating in the data-sharing eort is unclear. In
addition, linking patients data longitudinally across multiple providers requires patient identity information of some type. Thus aggregated databases
carry a risk of loss, albeit perhaps low, of patient condentiality. If condentiality were breached, the responsibility of the organization providing the data
is not completely clear under current regulations; in any case, the perception
of poor stewardship of health information would be detrimental. The incentive is limited for an institution to participate in a resource-intensive eort
that has a low but real risk for a negative outcome and an uncertain likelihood
of a positive outcome (for the institution). It is understandable that most
aggregated databases to date are intraorganizational: data processing, communication, and maintenance issues are simpler, and analysis can be targeted
to particular projects of interest to the organization. Until the incentives for
provider data sharing can be brought into alignment with public needs, public
mandates probably will remain the most eective method for aggregating
clinical data from multiple organizations into useful form.
Public health and related databases
In nations with dierent approaches to health care nancing and dierent
regulations related to data stewardship and privacy, the creation of largescale clinical repositories for analysis has proceeded more rapidly than in
the United States. These projects have been initiated within health care

REGIONAL AND NATIONAL HEALTH CARE DATA

107

organizations or governmentally mandated at the regional or national level,


or they may involve data sharing between nations. Examples include regional [44] and national [45] general-purpose clinical data warehouses,
national [46] and international [47] disease-specic data warehouses,
national public health databases [48], and international federated genome
data sets for twin studies [49].
In the United States, reporting specic laboratory, diagnosis, and other
data identied as important to public health has been mandated for many
years, and these data are gathered into regional databases designed for public
health surveillance and case follow-up (Fig. 1; see Table 1). These databases
typically contain geographic distributions of symptoms or disease (Fig. 2)
and may contain nonhealth care information from a variety of sources
that is pertinent to understanding disease transmission (Fig. 3). The unique
geographic aspects of these databases make them particularly useful for analysis using geographic information systems and spatial data mining [50].
In the past public health reporting has been a hierarchical process managed by local and state public health authorities, with manual entry of paper
forms from health care providers into multiple disease- and program-specic
database systems. These processes and systems are being replaced by electronic reporting and integrated data warehousing capabilities. A recent survey [51] indicated that 69% of state public health laboratories had
integrated data management capabilities, and an additional 16% planned
to acquire these capabilities by the end of the decade. States at the leading
edge of this development are relatively advanced, with electronic reporting,
data transfer collaborations with in-state RHIOs, and data warehousing at
the state level [52].
Data collected by the states is ultimately reported to the US Centers for
Disease Control (CDC), which supports active programs that aggregate the
data for a variety of special purposes. For example, the Public Health Information Network [53] provides an architecture for reporting and responding
to infectious disease and other health data that may signal public health
emergencies. CDC Wonder (wonder.cdc.gov) is a system that provides
access to a large selection of public healthrelated summary data based
on the state reports. Special programs such as the Vaccine Adverse Event
Reporting System collect data about a particular health-related issue that
then are accessible for data mining [54,55]. Public health data reporting
and databases are exempt from Health Insurance Portability and Accountability Act (HIPAA) privacy restrictions but have their own accessibility
requirements. In general, detailed public health data are accessible only to
public health practitioners, and thus data mining involves collaboration
with appropriate public health ocials.
A number of other special-purpose governmental databases containing
survey data related to population health and health care services use (reviewed in Ref. [56]) or to Food and Drug Administration product surveillance [57] also may be available for analysis. All these public health and

108

HARRISON & ALLER

Fig. 1. Reportable diseases and health conditions, Los Angeles County, California. Reporting
may be performed by submission of paper forms, telephone, or electronic communication. The
submitted information, along with additional follow-up data, is incorporated into a regional public health data warehouse. (Courtesy of County of Los Angeles, Department of Health Services,
Public Health, Los Angeles, CA.)

REGIONAL AND NATIONAL HEALTH CARE DATA

109

Fig. 2. Prevalence of a rash symptom in patients seen in emergency departments in Los Angeles
County. Color coding indicates how statistically abnormal the rate is in each zip code of patient
residence. The bright red area indicates a probable outbreak of a rash-related illness.

governmental data sets are limited, in that they are not complete health records and contain only data that are mandated for survey, reporting, or capture during follow-up. Thus they are not regional or national clinical
repositories in the sense of some of the European systems, but they can
be useful for addressing particular questions in appropriate domains.
Challenges inherent in regional databases
Regional data warehouses, whether they house clinical data from collaborating care providers or public health data, oer all the challenges previously described in this issue for data warehouse construction, with the
added complexity of multiple independent data-contributing organizations.
Furthermore, regional warehouses may receive useful data from organizations or locations distinct from traditional health care providers, for example environmental or veterinary laboratories (see Fig. 3), nursing homes or
home health care providers, or remote sensors such as home blood-glucose
monitors. Some users may be outside the traditional health care provider or
health services researcher roles, and data presentation or analysis capabilities
should support their needs. For example, public health data are important for
highlighting geographic clusters of symptoms in biosurveillance (see Fig. 2)

110

HARRISON & ALLER

Fig. 3. Veterinary data indicating the distribution of West Nile virus (WNV) in bird autopsies
in Los Angeles County. Larger dots indicate a higher number of WNV-positive birds. These
data permit the County to focus its mosquito-abatement eorts into high-incidence areas, which
may help limit transmission to humans.

and for targeting mosquito abatement in eorts to suppress the West Nile virus
in areas of high bird loss (see Fig. 3). Beyond these special applications, regional data warehouses also have distinct challenges related to data communication and loading, reconciliation of varied data representation, linkage of
data by patient, and normalization of laboratory and other numerical data.

REGIONAL AND NATIONAL HEALTH CARE DATA

111

Communications and data loading


Traditionally, national hospital discharge databases accumulated data in
yearly chunks. Data processing and loading often took signicant additional
time, so data might be several years old by the time they became accessible
for study. The applicability of conclusions based on such older data to current decisions was legitimately questioned. For a regional warehouse to be
useful for business planning or process improvement, the authors have proposed that data be accumulated and distributed at least quarterly. For detection of disease outbreaks, data should be loaded at least daily and
surveyed every day. Finally, if a database will be a hub that communicates
patient data to care providers, it should be designed as a transactional system, with updates from contributing systems within a few minutes or less.
Thus, depending on their purpose, regional systems can span a broader
range of updating requirements than typical for intraorganizational data
warehouses, which generally are updated weekly or monthly.
Data warehouses that are updated periodically may receive data as les
that are manually or automatically downloaded from contributing systems
and processed to an appropriate form for loading before or after electronic
transmission by standard means such as secure File Transfer Protocol. Ongoing developments in standard data formats based on Health Level 7 (HL7,
www.hl7.org) and Extensible Markup Language (XML, www.w3.org/
XML) and the use of standard HL7 parsers and XML processing software
may simplify these basic import-export steps, although the challenges described later related to data representation also will need to be addressed before a fully standard approach is practical. For more rapid updates related to
biosurveillance and clinical communication, real-time transactional interfaces will be necessary. In the United States, most clinical data interfaces
use the HL7 version 2 framework [58], which provides a transaction message
structure but does not dene its data content fully. Thus system interfaces using HL7 require negotiation and adaptation, and creating separate interfaces
for each contributing system can be a signicant expense in the construction
of regional data warehouses. It is hoped that HL7 version 3 will be able to
standardize clinical interfaces better and reduce their costs through denition
of a common data model and data element representations in addition to
a messaging framework [59]. The Clinical Data Interchange Standards Consortium (www.cdisc.org) and the Cancer Biomedical Informatics Grid (cabig.nci.nih.gov) are engaged in similar eorts to standardize data models and
representation related to clinical research. As these eorts progress and coordinate, they ultimately may decrease substantially the expense and eort
currently required to connect clinical information systems for data sharing.
Reconciliation of data representation
In the absence of a standard data representation specication such as that
in HL7 version 3, data transferred using a standard messaging or le format

112

HARRISON & ALLER

may represent values in multiple ways. For example, most laboratories have
created their own unique sets of test codes and text mnemonics, and other
data similarly may have site-specic representation. With a few exceptions,
it generally is necessary to build a translation table into each source system,
mapping its vocabulary to a standard target vocabulary used in the regional
warehouse. The Logical Observation Identier Names and Codes (LOINC)
has established a well-accepted standard for laboratory test names [60], and
organism names in microbiology can be converted to the Systematized
Nomenclature of Medicine Clinical Terminology (SNOMED CT [61]).
The task of fully coding a laboratory database for LOINC and SNOMED
nomenclature is signicant, and most laboratories create mapping tables
only for tests that are exported. This approach can cause problems, because each laboratorys mapping table must be maintained as tests change
and new tests are incorporated into the regional database. Failure to maintain the table appropriately can cause failure of data transmission and
require retroactive identication and transmission of unreported results.
In the future, as an increasing number of regulatory agencies and other organizations require the use of communication standards, laboratories
should be able to provide conversion to standard terminologies as a part
of routine data export. Similar data-representation considerations apply
to other clinical systems that supply data to a regional warehouse.
Linkage of data by patient
A health information data warehouse that is optimally useful for understanding disease presentation and course as well as care processes should
contain accurate longitudinal representations of disease, care, and therapeutic response for each patient. When data come from multiple care providers
who share no common patient identication mechanism, data linkage requires transmission of some type of patient identier [62]. These identiers
may be stripped from the data before the data are used for analysis, but
it is necessary to maintain the identiers and their link to the patient data
in a secure environment in the data warehouse so that future data may be
linked to data from the same patient that are already in the warehouse,
irrespective of the data source. Although this architecture is used in regional
and national European health data warehouses [45,46], the authors are not
aware that it has been deployed successfully in a multiorganizational health
data warehouse in the United States.
Importance, standardization, and normalization of clinical
laboratory data
Laboratory data occupy an important position in health data warehouses. Much of the data in these databases is coded (eg, ICD-9-CM codes),
and the codes typically are entered by people who are performing a rapid
review of all or part of the medical record for the purpose of nancial

REGIONAL AND NATIONAL HEALTH CARE DATA

113

transactions [16]. In databases that contain textual records, physicians comments in text may be unclear, ambiguous, or misleading. Laboratory data
are objective and clinically meaningful and generally are transmitted accurately. Laboratory data can serve as validation for correctly coded data
and as a ag for incorrect codes. Laboratory data also can support risk
stratication and inference of disease severity or unusual presentation,
which are poorly represented in the currently common coding schemes
[63]. Finally, with appropriate analysis techniques, laboratory data can
allow correct classication of patients to conditions for which codes do
not exist or when codes are omitted [64].
The aggregation of data from multiple laboratories into data warehouses
oers several important challenges in addition to the previously mentioned
dierences in naming of test codes. Results that are expressed as categoric
values or discrete scales may dier between laboratories based on dierent
category names and scales. These values must be reconciled or standardized
to a common set of values or scale. Many clinical laboratory tests that yield
numeric values are not standardized to reference material, and the variability of results between test methodologies, and even between laboratories using the same methodology, has been well described [6567]. The authors
have found that many analytes can yield dierences between laboratories
in a range that would aect clinical interpretation and yield spurious associations in data mining. Merely comparing the results to reference ranges is
insucient for rigorous manual evaluation and is inadequate for data mining. Statistical methods to normalize result values between dierent laboratories are available and should be considered for general application across
the data warehouse [68,69].
Summary
Although the United States lags behind Europe in creating large-scale,
detailed health data repositories at the regional level, several types of repositories exist in this country that may provide data useful for particular datamining applications. Regional repositories or data warehouses have special
requirements and constraints that distinguish them from intraorganizational
data warehouses. These dierences are related to their intended purpose, the
diversity of data they may receive, and their connection to multiple unrelated data providers. Public health databases dier further because they
are exempt from HIPAA privacy requirements and operate under their
own security and condentiality mandates. As objective data that are highly
likely to be transmitted correctly, clinical laboratory results play an important role in all these databases in validating other data, including diagnosis
and procedure codes, and in supporting inference of disease severity and
risk-related information that is poorly handled in coding systems. Aggregation of laboratory results from multiple data providers yields specic challenges related to representation and normalization of data that must be

114

HARRISON & ALLER

addressed successfully for these databases to be optimally useful. As data


communications related to public health continue to develop, more laboratories will become direct contributors to public health databases through
electronic data transfer. If the United States is successful in current eorts
to develop a general administrative and political framework for the secondary analysis of health data, laboratories also are likely to become important
and valuable contributors to multiple regional databases targeting improvement in community health, better understanding of therapeutic response,
and improvement in the medical process.
References
[1] Feinstein AR. An additional basic science for clinical medicine: II. The limitations of
randomized trials. Ann Intern Med 1983;99(4):54450.
[2] Grossman J, Mackenzie FJ. The randomized controlled trial: gold standard, or merely
standard? Perspect Biol Med 2005;48(4):51634.
[3] Safran C, Bloomrosen M, Hammond WE, et al. Toward a national framework for the secondary use of health data: an American Medical Informatics Association white paper. J Am
Med Inform Assoc 2007;14(1):19.
[4] Blewett LA, Parente ST, Finch MD, et al. National health data warehouse: issues to consider. J Healthc Inf Manag 2004;18(1):528.
[5] Altshuler CH. Building a database for monitoring and facilitating health care. Clin Lab Med
1983;3(1):179204.
[6] Altshuler CH. Use of comprehensive laboratory data as a management tool. Clin Lab Med
1985;5(4):67396.
[7] Eggert AA, Emmerich KA. Long-term data storage in a clinical laboratory information
system. J Med Syst 1989;13(6):34754.
[8] Kamal J, Rogers P, Saltz J, et al. Information warehouse as a tool to analyze Computerized
Physician Order Entry order set utilization: opportunities for improvement. AMIA Annu
Symp Proc 2003;33640.
[9] Maizlish NA, Shaw B, Hendry K. Glycemic control in diabetic patients served by community health centers. Am J Med Qual 2004;19(4):1729.
[10] Bock BJ, Dolan CT, Miller GC, et al. The data warehouse as a foundation for populationbased reference intervals. Am J Clin Pathol 2003;120(5):66270.
[11] Bates DW, Pappius E, Kuperman GJ, et al. Using information systems to measure and
improve quality. Int J Med Inform 1999;53(23):11524.
[12] Gilbertson JR, Gupta R, Nie Y, et al. Automated clinical annotation of tissue bank specimens. Medinfo 2004;11(Pt 1):60710.
[13] Aller RD. The clinical laboratory data warehouse. An overlooked diamond mine. Am J Clin
Pathol 2003;120(6):8179.
[14] McCarthy EP, Iezzoni LI, Davis RB, et al. Does clinical evidence support ICD-9-CM
diagnosis coding of complications? Med Care 2000;38(8):86876.
[15] Solberg LI, Engebretson KI, Sperl-Hillen JM, et al. Are claims data accurate enough to
identify patients for performance measures or quality improvement? The case of diabetes,
heart disease and depression. Am J Med Qual 2006;21:23845.
[16] OMalley KJ, Cook KF, Price MD, et al. Measuring diagnoses: ICD code accuracy. Health
Serv Res 2005;40(5 Pt 2):162039.
[17] Rector TS, Wickstrom SL, Shah M, et al. Specicity and sensitivity of claims-based algorithms for identifying members of Medicarechoice health plans that have chronic medical
conditions. Health Serv Res 2004;39(6 Pt 1):183957.

REGIONAL AND NATIONAL HEALTH CARE DATA

115

[18] Udvarhelyi IS, Gatsonis C, Epstein AM, et al. Acute myocardial infarction in the Medicare
population. Process of care and clinical outcomes. JAMA 1992;268(18):25306.
[19] National Committee for Quality Assurance. Health Plan Employer Data and Information
Set 3.0. Washington, DC: National Committee for Quality Assurance; 1998.
[20] Ahmed F, Janes GR, Baron R, et al. Preferred provider organization claims showed high
predictive value but missed substantial proportion of adults with high-risk conditions.
J Clin Epidemiol 2005;58:6248.
[21] Kern EF, Maney M, Miller DR, et al. Failure of ICD-9-CM codes to identify patients with
comorbid chronic kidney disease in diabetes. Health Serv Res 2006;41(2):56480.
[22] Birman-Deych E, Waterman AD, Yan Y, et al. Accuracy of ICD-9-CM codes for identifying
cardiovascular and stroke risk factors. Med Care 2005;43(5):4805.
[23] Bullano MF, Kamat S, Willey VJ, et al. Agreement between administrative claims and the
medical record in identifying patients with a diagnosis of hypertension. Med Care 2006;
44(5):48690.
[24] Waikar SS, Wald R, Chertow GM, et al. Validity of international classication of diseases,
ninth revision, clinical modication codes for acute renal failure. J Am Soc Nephrol 2006;
17(6):168894.
[25] Aronsky D, Haug PJ, Lagor C, et al. Accuracy of administrative data for identifying patients
with pneumonia. Am J Med Qual 2005;20(6):31928.
[26] Dombkowski KJ, Wasilevich EA, Lyon-Callo SK. Pediatric asthma surveillance using
Medicaid claims. Public Health Rep 2005;120(5):51524.
[27] Kolodner K, Lipton RB, Lafata JE, et al. Pharmacy and medical claims data identied
migraine suerers with high specicity but modest sensitivity. J Clin Epidemiol 2004;
57(9):96272.
[28] Harrold LR, Saag KG, Yood RA, et al. Validity of gout diagnoses in administrative data.
Arthritis Rheum 2007;57(1):1038.
[29] Bazarian JJ, Veazie P, Mookerjee S, et al. Accuracy of mild traumatic brain injury case
ascertainment using ICD-9 codes. Acad Emerg Med 2006;13(1):318.
[30] Mapel DW, Frost FJ, Hurley JS, et al. An algorithm for the identication of undiagnosed
COPD cases using administrative claims data. J Manag Care Pharm 2006;12(6):45765.
[31] Hollis J. Deploying an HMOs data warehouse. Health Manag Technol 1998;19(8):
468.
[32] Joyce JS, Fetter MM, Klopfenstein DH, et al. The Kaiser Permanente Northwest Cardiovascular Risk Factor Management Program: a model for all. The Permanente Journal 2004;
9(2):1926.
[33] Sequist TD, Cullen T, Ayanian JZ. Information technology as a tool to improve the quality
of American Indian health care. Am J Public Health 2005;95(12):21739.
[34] Koski E, Teates KS, Tellez P, et al. Exploring the role of Quest Diagnostics corporate data
warehouse for timely inuenza surveillance. Advances in Disease Surveillance 2006;1:41.
[35] Kupersmith J, Francis J, Kerr E, et al. Advancing evidence-based care for diabetes: lessons
from the Veterans Health Administration. Health A (Millwood) 2007;26(2):w15668.
[36] Berndt DJ, Hevner AR, Studnicki J. The Catch data warehouse: support for community
health care decision-making. Decision Support Systems 2003;35:36784.
[37] Starr P. Smart technology, stunted policy: developing health information networks. Health
A (Millwood) 1997;16(3):91105.
[38] Payton FC, Brennan PF. How a community health information network is really used.
Commun ACM 1999;42(12):859.
[39] Berberabe T. Information: its better when you share. Manag Care 2005;14(2):30, 3537.
[40] Solomon MR. Regional health information organizations: a vehicle for transforming health
care delivery? J Med Syst 2007;31(1):3547.
[41] Massachusetts Health Data Consortium. Active Regional Health Information Organizations (RHIO) List. 2006. Available at: http://www.mahealthdata.org/data/library/
20061127_ActiveRHIOs.pdf. Accessed August 19, 2007.

116

HARRISON & ALLER

[42] Wright A, Ricciardi TN, Zwick M. Application of information-theoretic data mining techniques in a national ambulatory practice outcomes research network. AMIA Annu Symp
Proc 2005;82933.
[43] Everett AD, Ringel R, Rhodes JF, et al. Development of the MAGIC congenital heart disease catheterization database for interventional outcome studies. J Interv Cardiol 2006;
19(2):1737.
[44] Currie CJ, McEwan P, Peters JR, et al. The routine collation of health outcomes data from
hospital treated subjects in the Health Outcomes Data Repository (HODaR): descriptive
analysis from the rst 20,000 subjects. Value Health 2005;8(5):58190.
[45] van Bemmel JH, van Mulligen EM, Mons B, et al. Databases for knowledge discovery.
Examples from biomedicine and health care. Int J Med Inform 2006;75(34):25767.
[46] Ben Said M, le Mignot L, Mugnier C, et al. A multi-source information system via the Internet for end-stage renal disease: scalability and data quality. Stud Health Technol Inform
2005;116:9949.
[47] Steil H, Amato C, Carioni C, et al. EuCliDda medical registry. Methods Inf Med 2004;
43(1):838.
[48] Hristovski D, Rogac M, Markota M. Using data warehousing and OLAP in public health
care. Proc AMIA Symp 2000;36973.
[49] Muilu J, Peltonen L, Litton JE. The federated databasea basis for biobank-based postgenome studies, integrating phenome and genome data from 600,000 twin pairs in Europe.
Eur J Hum Genet 2007;15(7):71823.
[50] Scotch M, Parmanto B. Development of SOVAT: a numerical-spatial decision support
system for community health assessment research. Int J Med Inform 2006;75(1011):
77184.
[51] Inhorn SL, Wilcke BW Jr, Downes FP, et al. A comprehensive Laboratory Services Survey
of State Public Health Laboratories. J Public Health Manag Pract 2006;12(6):51421.
[52] Hanrahan LP, Foldy S, Barthell EN, et al. Medical informatics in population health: building Wisconsins strategic framework for health information technology. WMJ 2006;105(1):
1620.
[53] Loonsk JW, McGarvey SR, Conn LA, et al. The Public Health Information Network
(PHIN) Preparedness initiative. J Am Med Inform Assoc 2006;13(1):14.
[54] Banks D, Woo EJ, Burwen DR, et al. Comparing data mining methods on the VAERS
database. Pharmacoepidemiol Drug Saf 2005;14(9):6019.
[55] Iskander J, Pool V, Zhou W, et al. Data mining in the US using the Vaccine Adverse Event
Reporting System. Drug Saf 2006;29(5):37584.
[56] Zeni MB, Kogan MD. Existing population-based health databases: useful resources for
nursing research. Nurs Outlook 2007;55(1):2030.
[57] Cabell CH, Noto TC, Kruco MW. Clinical utility of the food and drug administration electrocardiogram warehouse: a paradigm for the critical pathway initiative. J Electrocardiol
2005;38(Suppl 4):1759.
[58] Health Level Seven, Inc. About HL7. Available at: http://www.hl7.org/about/hl7about.htm.
Accessed August 19, 2007.
[59] Blobel BGME, Engel K, Pharow P. Semantic interoperabilityHL7 version 3 compared to
advanced architecture standards. Methods Inf Med 2006;45(4):34353.
[60] Regenstrief Institute, Inc. Logical Observation Identiers Names and Codes (LOINC).
Available at: http://www.regenstrief.org/medinformatics/loinc/. Accessed August 19, 2007.
[61] National Library of Medicine. SNOMED Clinical Terms (SNOMED CT). Available at:
http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html. Accessed August 19,
2007.
[62] Black N. Secondary use of personal data for health and health services research: why identiable data are essential. J Health Serv Res Policy 2003;8(3 Suppl 1):3640.
[63] Emons MF. Integrated patient data for optimal patient management: the value of laboratory
data in quality improvement. Clin Chem 2001;47(8):151620.

REGIONAL AND NATIONAL HEALTH CARE DATA

117

[64] Post AR, Harrison JH Jr. PROTEMPA: a method for specifying and identifying temporal
sequences in retrospective data for patient selection. J Am Med Inform Assoc 2007;14(5):
67483.
[65] Ricos C, Domenech MV, Perich C. Analytical quality specications for common reference
intervals. Clin Chem Lab Med 2004;42(7):85862.
[66] Westgard JO, Westgard SA. The quality of laboratory testing today: an assessment of sigma
metrics for analytic quality using performance data from prociency testing surveys and the
CLIA criteria for acceptable performance. Am J Clin Pathol 2006;125(3):34354.
[67] Viljoen A, Twomey PJ. True or not: uncertainty of laboratory results. J Clin Pathol 2007;
60(6):5878.
[68] Karvanen J. The statistical basis of laboratory data normalization. Drug Inf J 2003;37:
1017.
[69] Ruvuna F, Flores D, Mikrut B, et al. Generalized lab norms for standardizing data from
multiple laboratories. Drug Inf J 2003;37:6179.

Clin Lab Med 28 (2008) 119126

Data Mining and Infection Control


Stephen E. Brossette, MD, PhD*,
Patrick A. Hymel, Jr, MD
Cardinal Health, 400 Vestavia Parkway, Suite 310, Birmingham, AL 35216, USA

We regard data mining as the data-driven, automated construction of


descriptive and predictive statistical models. Data mining, infection control,
and laboratory medicine intersect at the use of clinical laboratory data by
computers to automatically construct models that describe or predict hospital epidemiology patterns of statistical and clinical signicance. Of course,
the main tenet of data mining is that the models and patterns contain
insights that were previously unsuspected. For that reason alone, data
mining is not an exercise in hypothesis-driven exploratory statistics, or
hypothesis-driven statistical model building, because hypothesis-driven
implies previously suspected. In this article, we examine data mining in
laboratory medicine and infection control and describe future opportunities
in the space.
Infection control is the quality control activity concerned primarily with
the quantication and prevention of nosocomial infections (NIs). Its success
depends on the timely identication and correction of process breakdowns
that increase infection risks. It is dicult, however, for infection control
to identify new risk threats, intervene, and track outcomes continuously,
hospital-wide. These challenges can be mitigated by a properly designed
data mining system.
Traditional collection and analysis of infection control data occur by
hand. The Centers for Disease Control (CDC) recommends that NI case
nding proceed by the manual application of clinical case denitions
that perform poorly prospectively (sensitivity 0.61) and retrospectively
(specicity 0.68) [1]. The inability of infection control practitioners to
reliably identify NIs, much less patterns among them, is a clear limitation
of the traditional CDC-endorsed system. For this reason, Brossette and
colleagues [2] created the electronic Nosocomial Infection Marker (NIM,

* Corresponding author.
E-mail address: sbrossette@medmined.com (S.E. Brossette).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.007
labmed.theclinics.com

120

BROSSETTE & HYMEL

patent pending, Cardinal Health). The NIM outperforms the CDC National
Nosocomial Infection Surveillance system (NNIS) clinical case denitions in
the ICU and the Study on the Ecacy of Nosocomial Infection Control
(SENIC) case denitions house-wide [2], and is based solely on electronic
clinical microbiology and electronic patient census and movement data.
As a result, the NIM is reproducibly computable, thus solving a major
limitation of manual case-nding methods. Data models based on the NIM,
or other deterministically computable infection proxies, can specically and
reliably describe patterns of NIs, not just laboratory results, allowing for
more specic and objective process improvement initiatives.
Predictive data mining
Descriptive data mining should reveal and describe important, previously
unknown patterns of nosocomial (and community-acquired) infections,
contamination, and colonization. Predictive data mining should construct
models to predict NI risk. To some extent, the NIM algorithm accomplishes
as much. If an NIM is detected, it is likely associated with NI; if an NIM is
not detected, an NI is likely not present [2]. The previously described GermWatcher system implemented culture-based denitions for NI to accomplish
similar goals [3]. The NIM and GermWatcher are expert rules systems;
neither generates models to predict risk. They can, however, like the
NIM, be used to provide data to model-generating systems.
Predictive data mining to build models of infection risk could use any of
the classier building techniques from machine learning [4] (eg, neural
networks) or even techniques used mostly for descriptive mining, such as
association rules. This endeavor would require substantial research, but
once developed, classiers could be used to target proactively high-risk
patients for prevention eorts. Of course, the exercise could lead to obvious
conclusions, such as neutropenic patients are at high risk for NI, but it
also may provide insights that are currently unknown or underappreciated.
Descriptive data mining: the Data Mining Surveillance System
Descriptive data mining in laboratory medicine and infection control is at
this time entirely represented by the Data Mining Surveillance System
(DMSS) [58]. DMSS uses frequent set and association rule analysis to automatically construct from laboratory medicine and patient movement data
patterns of statistical and clinical interest. The reason these techniques are
useful for infection control is that NI risks are complex and subtle patterns
of infection, colonization, contamination, and multidrug resistance often go
unnoticed. This is not hard to understand; the combinatorial complexity of
a simple infection event space is substantial. A hypothetical hospital with
20 common bacterial pathogens, 10 specimen sources, 10 physicians or
services, and 10 in vitro antimicrobial sensitivity results (each sensitive,

DATA MINING AND INFECTION CONTROL

121

intermediately resistant, resistant [S/I/R]), yields more than 100,000,000


possible events. Temporal and spatial clustering of these events compose
patterns, so 100 million events with the added dimensions of time and space
create a pattern space that exceeds the capacity of manual exploration. Data
mining is a better way to approach these types of problems.
Frequent sets/association rules
Descriptive data mining is dominated by frequent set and association
rules (FS/AR) techniques. These techniques are used in DMSS and are
briey described here. A more complete discussion of association rules
and other software that uses them can be found in the article by Brown,
elsewhere in this issue.
Frequent sets are sets of items that commonly appear together in transaction data. For example, if motor oil, shampoo, and chocolate candy bar
X are purchased together in 100 transactions in a week from a major
retailer, and any set of items purchased together more than 75 times is
considered frequent, then A {motor oil, shampoo, chocolate candy bar
X} is a frequent set. By simple closure, all subsets of A are also frequent.
Using frequent sets, association rules can be constructed. Association
rules are statements of how often certain items in frequent sets are found
with other items in the same set. For example, the frequent sets {bread},
{milk}, and {bread, milk} can be used to generate the association rules
{bread}/{milk} and {milk}/{bread}. Each are statements of conditional
probability of the form A/B, read as B given A, where A and B are
frequent sets. If A occurs 100 times and B occurs with A 63 times, then
we say the rule A/B has condence 63/100.
Data from retail checkout systems are used to identify sets of items that
people purchase together frequently. For this reason, FS/AR analysis is
often referred to as market basket analysis. In market basket analysis, association rules are used to design product placement strategies and marketing
campaigns. If consumers purchase items together, especially in unexpected
ways, then opportunities may exist to design campaigns that use buying
patterns or item attributes to increase sales.
In DMSS, the market basket contains clinical microbiology results with
NIM status along with admission, location, specimen timing, and patient
demographic information. Frequent sets are computed to determine which
attributes occur together even at low frequency, and association rules are
constructed to describe relationships between frequent sets. Once these rules
are constructed for a given time-slice of data (eg, one month) their condences are calculated historically and prospectively and are monitored for
signicant changes. Rather than identifying associations with a high condence, as might be done in a simple retail market basket analysis, DMSS
uses signicant changes in the condence of an association over time as
an indicator of changes in the frequency of events of interest.

122

BROSSETTE & HYMEL

For example, four-drug resistant Acinetobacter baumannii NIMs from


lower respiratory specimens from patients in the medical intensive care
unit (MICU) who were in oor location X 48 hours before specimen collection is an event that may occur with a very low frequency, say once every
other month among the approximately 40 patients a month who are transferred between the two locations. If this event were to occur three times in
one month among 40 patients, and this was statistically signicant, DMSS
would generate an alert describing the change. The frequent set for this
event is
fR-drug1; R-drug2; R-drug3; R-drug4; A: baumannii; NIM;
lower respiratory; MICU; locationX-48g
Because each of the nine items can be placed on the right or left side of an
association rule, 2^9 512 rules can be generated from it. DMSS generates
all rules, but uses rule templates, such as keep all resistance traits together
to prune rules that are relatively uninformative, such as {R-drug2, NIM} /
{R_drug1, R_drug3, A baumannii, lower respiratory, MICU, locationX-48}.
The event described above has the association rule:
fMICU; locationX-48g OfR-drug1; R-drug2; R-drug3; R-drug4;
A: baumannii; NIM; lower respiratoryg
with a condence history something like {month1: 0/41, month2: 1/38,
month3: 0/39, month4: 1/42, month5: 3/40}. When the history is broken
into two parts, for example {months14} and {month5}, and the average
condence diers signicantly between the parts, then DMSS generates an
alert. The comparison of condences is a comparison of two proportions,
which can be accomplished with a Fishers exact test or a c2 test [7].
Unfortunately, data mining systems usually generate too many, often
redundant, patterns and schemes must be used to reduce the pattern load
on the end user. In the example above, the nine-item frequent set also has
2^9 frequent subsets, all of which are generated in na ve schemes, and all
of which have 2^(their number of items) association rules. Even after rule
templates are applied, several related alerts are redundant. To address this
problem, DMSS uses an alert clustering scheme to select only the most
descriptive alert for presentation [7]. Conceptually, given two alerts A and B
with Arule AL/ AR and Brule BL/ BR and BL4AL and BR4AR, if the
data that satisfy A are removed from B, which they also satisfy, and the
resultant B is not an alert, then A captures B. For example, if Arule
{MICU, locationX-48} / {R-drug1, R-drug2, R-drug3, R-drug4, A
baumannii, NIM, lower respiratory} and Aconf_hist {month1: 0/41,
month2: 1/38, month3: 0/39, month4: 1/42, month5: 3/40} and Brule
{MICU, locationX-48} / {R-drug3, R-drug4, A baumannii, NIM} and

DATA MINING AND INFECTION CONTROL

123

Bconf_hist {month1: 0/41, month2: 1/38, month3: 0/39, month4: 1/42,


month5: 3/40}, identical to Aconf_hist, then A captures B. A is more descriptive
than B, and B without A is nonexistent. For cases where Aconf_hist s Bconf_hist,
a dierence history can be analyzed using a test of two proportions [7].
Once alerts are reviewed, investigative action must be taken for eects to
occur. Sometimes just showing up in the right place at the right time with the
right information elicits the necessary changes in process to correct the
underlying problem, even if explicit defects are never discovered. Because
outbreak epidemiology is complex, alerts should be viewed as windows
into broken systems. Without an alert, the need to look would not exist.
Additionally, process improvements reduce alert frequency, and improvement patterns can be generated. These should be used to follow and enhance
compliance with improvement recommendations.

Operational considerations
Although DMSS is a complex system and a full description is beyond the
scope of this article, a few distilled operational principles and challenges are
worth discussing.
Data collection
DMSS requires useable data. The garbage-in, garbage-out adage is
applicable. Data are obtained from the laboratory information system
(LIS), the admit-discharge-transfer (ADT) system, and the hospital census
system. Clinical laboratory data, especially clinical microbiology data, are
poorly structured, however, and contain free text and natural language.
Clinical microbiology data (including molecular testing) and infectionassociated serology data from the LIS can be obtained in three ways:
custom-built LIS queries, printed reports, and HL7 messages. Custom
queries can be built to specicationdcontent and presentation can be
controlleddbut they require programmer resources to construct and their
results must be checked against gold-standard results (usually printed
reports) for completeness.
Printed reports from the LIS, specically those used to present information to clinicians, are mostly complete, but may suppress results that are
selectively reported (eg, imipenem susceptibility in Pseudomonas aeruginosa). Suppressed results limit the ability of frequent set and association
rule algorithms to detect relationships that may exist, but these limitations
are usually not signicant because the results are not suppressed in cases
in which the information is most useful (eg, imipenem resistance in the presence of aminoglycoside and cephalosporin resistance). Printed reports can
also change format with LIS upgrades, the introduction of new tests, and
the removal of discontinued tests. For these reasons, structure and content
must be actively monitored for change. Printed reports can be readily

124

BROSSETTE & HYMEL

obtained in le format from printer queues (usually custom queues established expressly for le retrieval), but need to be parsed to load the data
into a database. Tools such as Monarch Data Pump (www.datawatch.
com) can be useful.
Clinical microbiology HL7 messages are often poorly structured but are
readily available from HL7 routers in most hospitals. Their modeling and
parsing, however, require considerable sophistication. Print-structured
data are often simply embedded in message segments, and therefore all challenges and considerations of print report modeling apply to HL7 message
modeling. Additional challenges exist, such as identifying and modeling
only applicable messages. DMSS obtains LIS data by HL7 messages.
Patient movement and census data are obtained from two sources: HL7
ADT messages and electronic census reports. Although ADT messages are
rich in content, near real-time, and precise, they are transaction based and
occasional message omission is not uncommon. For example, if for some
reason a discharge message is not generated for a patient, the patient
appears to never leave the hospital. For this reason, DMSS uses census
reports obtained throughout the day to reconcile ADT data errors.
Data cleaning/normalization
Once data are obtained using one of the three mechanisms above, they
must be loaded into a database, quality checked, and mapped. Database
design and population are beyond the scope of this article (see the article
by Lyman and colleagues, elsewhere in this issue for a general discussion
of database design for mining), but once data les are retrieved and checked
to make sure their sizes are within normal limits and their data are from the
time periods expected, data can be loaded and mapped. Mapping requires,
among other things, that SA, S. AURIUS, STAPH AUREUS, and
so forth all be mapped to Staphylococcus aureus. Original data are also
maintained. Cardinal Health DMSS databases contain data from more
than 250 hospitals and have hundreds of mappings to single organisms
and specimen sources. For example, there are hundreds of terms for blood
specimens, including ones with misspellings of blood. The management of
term mapping alone requires pattern recognition and quality assurance
systems. After terms are mapped, data must be checked again to make
sure certain common specimens, tests, and organisms exist within statistical
limits.
The next step is to impart additional meaning to the data. For example,
NIM criteria are applied so that NIs, community-acquired infections, and
specimen contamination proxies (indicators) are computed. If this information were not imparted to the data before pattern analysis, patterns would
less reliably describe nosocomial versus community-acquired infections versus colonization or contamination. Electronic proxies for these clinical and
laboratory states, like the NIM, add value to the data and make data mining

125

DATA MINING AND INFECTION CONTROL

more productive. Once data are annotated with these proxies, they can be
analyzed.
Frequent set and association rule analysis and alert generation
FS/AR analyses generally work as described above, but are typically
fraught with complexity for the inexperienced practitioner. Time partitioning of the data, the organization of association rules obtained from each
partition, and the ability to track changes among rules need to be handled.
Once rules are stored along with their condences in time, rules whose condences change signicantly between two single or aggregate time periods
compose alerts. Rules whose condences are changing insignicantly are ignored. Alert clustering reduces alert volume by a factor of two to four and is
yet another tool used to reduce pattern overload. All data mining steps from
data selection to pattern presentation need to be designed with this problem
in mind. Generating too many statistically signicant but often meaningless
or redundant patterns leads to user exhaustion and project failure.
The nal step of data mining is report preparation. In DMSS reports are
prepared from clustered patterns by domain experts who select patterns by
their usefulness. Pattern usefulness or interestingness is a function of clinical
signicance and actionability, and includes an estimate of how much information the end user can use eciently. These are largely subjective measures
that are dicult to code explicitly, but through experience we know that end
users do well with 5 to 10 patterns a month, about one tenth of all clustered
patterns (Table 1). DMSS pattern reports are currently presented monthly.

Data Mining Surveillance System results


DMSS identies new patterns of interest and detects known outbreaks in
historical data [7]. Patterns can be arbitrarily complex, and can describe
everything from slow changes in simple event frequency in large populations
(hospital-wide, for example), to location-specic outbreaks of 10-drug resistant A baumannii [7], to community outbreaks of infectious diarrhea [9].
Currently, more than 225 hospitals nationwide subscribe to Cardinal Health
Table 1
Monthly Data Mining Surveillance System statistics by hospital

Median
Interquartile
range

Inpatient
admits

Specimens

Tests

NIMs

CIMs

Clustered
patterns

Reported
patterns

1498
8092368

2728
13674289

3254
16045170

61
26112

245
157424

52
27.583

6
39

Specimens and tests are inpatient and outpatient.


Abbreviations: CIMs, community-acquired infection markers; NIMs, nosocomial infection
markers.

126

BROSSETTE & HYMEL

services that include DMSS pattern analysis, and from these hospitals more
than 20 DMSS-based abstracts have been presented at national conferences.

Future directions
In its current form, DMSS provides a practical illustration of the usefulness of data mining in health care. Access to additional electronic data could
extend the model-building capabilities and usefulness of DMSS. For example, additional data about the patient origin could allow models to describe
or predict signicant patterns from nursing homes, zip codes, counties, and
so forth. Additional electronic data, such as surgical procedure, operating
room, operative time, anesthesia scores, and wound class, could increase
the descriptiveness of surgery-associated patterns. Antimicrobial use data
or complete blood counts could increase the sensitivity and specicity of
the NIM, even if for only specic subsets of patients. Any gains in pattern
specicity and marker performance, however, add data acquisition costs
and require additional eort for data validation and cleansing. These
requirements must be matched by a corresponding increase in the clinical
usefulness of alerts and reports to justify additional development.

References
[1] Emori TG, Edwards JR, Culver DH, et al. Accuracy of reporting nosocomial infections in
intensive-care-unit patients to the National Nosocomial Infections Surveillance System:
a pilot study. Infect Control Hosp Epidemiol 1998;19:30816.
[2] Brossette SE, Hacek DM, Gavin PJ, et al. A laboratory-based, hospital-wide, electronic
marker for nosocomial infection. Am J Clin Pathol 2006;125:349.
[3] Kahn MG, Steib SA, Fraser VJ, et al. An expert system for culture-based infectioncontrol
surveillance. Proc Annu Symp Comput Appl Med Care 1993;1715.
[4] Mitchell Tom. Machine learning. McGraw Hill; 1997.
[5] Brossette SE, Sprague AP, Hardin JM, et al. Association rules and data mining in hospital
infection control and public health surveillance. J Am Med Inform Assoc 1998;5:37381.
[6] Brossette SE, Moser SA. Application of knowledge discovery and data mining to intensive
care microbiologic data. Journal of Emerging Infectious Diseases 1999;5:4547.
[7] Brossette SE, Sprague AP, Jones WT, et al. A data mining system for infection control surveillance. Methods Inf Med 2000;39:30310.
[8] Peterson LR, Brossette SE. Hunting healthcare associated infections from the clinical microbiology laboratory: passive, active, and virtual surveillance. J Clin Microbiol 2002;40:14.
[9] Peterson LR, Hacek DM, Rolland D, et al. Detection of a community infection outbreak with
virtual surveillance [letter]. Lancet 2003;362(9395):15878.

Clin Lab Med 28 (2008) 127143

Data Mining for Biomarker


Development: A Review of Tissue
Specicity Analysis
Eric W. Klee, PhD
Division of Experimental Pathology, Department of Laboratory Medicine and Pathology,
Mayo Clinic, 200 1st Street SW, Stabile 2-50, Rochester, MN 55905, USA

Tissue-specic expression proling is the simultaneous measurement of


the expression of thousands of genes in a target tissue or organ, and it
can play an important role in biomarker development and in lling
clinicians unmet needs for improved marker-based assays. The National
Cancer Institutes report on 2008 research, The Nations Investment in
Cancer Research, states that identifying candidate biomarkers that successfully translate into improved diagnostic and prognostic assay is a priority
objective [1]. Such biomarkers can impact patient care and outcome through
disease screening, early detection, risk stratication, and prediction of
disease reoccurrence. Additionally, with the increasing trend toward individualized medicine and personalized treatment, theragnostic biomarkers are
needed to identify patients responsive to specic therapies and to track the
therapeutic eects [2,3]. To discover and bring to clinic these novel assays
requires a data-driven development cycle that incorporates data-mining
steps for discovery, qualication, verication, and validation [4].
Biomarker development has leveraged the enormous data content obtainable from high-throughput genomic and proteomic technologies to identify
novel candidates, eectively transforming this research into a practical exercise in data mining. Genomic discovery using microarrays and proteomic
discovery using mass spectrometry are becoming the primary methods for
initial candidate marker selection [57]. The identication of dierentially
expressed candidate markers using these techniques is proving to be only
the rst step in the biomarker development process, however. Further
marker characterization and prioritization are needed to increase the chances of these candidates being validated and translated into clinical assays.

E-mail address: klee.eric@mayo.edu


0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.009
labmed.theclinics.com

128

KLEE

The data mining of public transcriptomic databases to dene tissue-specic


expression proles of biomarkers provides one way to do this. This article
reviews four databases that were selected because they contain dierent
types of transcriptomic data and provide interfaces for proling tissuespecic gene expression. This review is followed by a discussion of eight
tissue-specicity studies, with detailed descriptions of the data-analysis
methods and metrics used to identify and quantify tissue specicity. The
article concludes with a short summary of biomarker development projects
that have mined the transcriptomic databases and applied tissue-specicity
metrics is provided.

Public transcriptomic databases


Transcriptomics is the study of the complete set of mRNA transcripts
expressed in a tissue or cell type at one time. High-throughput sequencing
methods have facilitated transcriptomic studies by generating an abundance
of data describing the expression of gene transcripts within various human
tissues and disease states. These data are maintained in large, public transcriptomic databases that are available to researchers for analysis. In this
section, four representative databases are described to familiarize readers
with the data content and interfaces available. The chosen databases each
house data generated by a dierent expression-proling technology, including Expressed Sequence Tag (EST) [8], Serial Analysis of Gene Expression
(SAGE) [9], Massively Parallel Signature Sequencing (MPSS) [10,11], and
microarray expression analysis. Although other public databases contain
these data types, these databases were selected because they provide direct
and clear tools for generating tissue-specic gene-expression proles.
Unigene
Unigene is a National Institutes of Health (NIH)sponsored database of
assembled EST transcripts [12]. The database is designed to provide an
organized collection of dbEST [13] sequences, processed and clustered to
be associated with identiable genes and presented in a format useful to
researchers. ESTs are partially sequenced cDNAs obtained by reverse transcription of mRNA and provide a snapshot of gene expression at the time of
cDNA synthesis. ESTs generally are sequenced using a single-pass
approach, are 30 -biased, contain inaccuracies (approximately 2%), and are
highly redundant [8]. Unigene predates the fully sequenced human genome,
and previously it provided one of the most comprehensive data sets for gene
discovery research. It continues to provide a compendium of information on
gene expression and expression characteristics within tissue, disease states,
and developmental stages, by estimating gene-expression levels using EST
clone frequency in dierent libraries [14]. A systematic assembly method
is used to assign short EST fragments to consensus sequence clusters that

DATA MINING FOR BIOMARKER DEVELOPMENT

129

are anchored at the 30 end of a gene. The present build of the database
(#201, updated 03/01/2007) contains sequence data for 77 species, including
more than 6,694,833 human transcripts in 124,179 clusters.
The Unigene interface for tissue-specicity analysis is well suited for
single-gene queries. Each EST cluster entry provides a Gene Expression
Summary section that includes links to the cluster Expression Prole.
Within the prole, a numeric and graphic display of the cluster expression
in dierent tissue libraries is provided. Three distinct proles are provided,
illustrating the relative cluster expression in dierent normal tissues, in
various health states, and in independent developmental stages. When
more than half of a clusters entire EST counts are assigned to a single
prole state, that state is labeled as a restricted expression state. For
example, PSA (KLK3), a well-known prostate cancer serum biomarker,
shows restricted expression in prostate, prostate cancer, and adult developmental stage. Unigene also can be queried for specic tissue-expression
proles using the Digital Dierential Display (DDD) tool, which generates
information on tissue-specic distribution of expression dened by the EST
library of origin. DDD uses the Fisher Exact Test to compare the distribution of transcripts in two dierent libraries, thereby allowing users to
compare expression in two libraries (tissue types) of interest. Analysis is
restricted to deeply sequenced libraries (O 1000 sequences) to ensure
validity of results. Overall, Unigene provides a rich information resource
for biomarker data mining, generating gene-expression proles across tissue,
disease, and development stage strata.
The Cancer Genome Anatomy Project
SAGE is a transcriptional proling technique that surveys short 30 oligonucleotides derived from mRNA to quantify gene expression. Highthroughput, highly redundant sequencing generates millions of transcripts
that provide information on gene expression that is complementary to
traditional EST methods. The Cancer Genome Anatomy Project (CGAP)
of the NIH maintains a large database of SAGE sequences [15]. An internal
analysis pipeline lters out erroneous data and maps SAGE transcripts to
gene sequences. Based on the relative number of transcripts mapped to
a gene and the disease or tissue characteristics of the originating cellular
library, tissue-specic and disease-specic expression proles can be
generated.
Within CGAP, the SAGE Genie is used to mine the database and generate several tissue-specic expression proles. The SAGE Anatomical Viewer
creates a list of best SAGE tags for a specic gene. For any tag, three
dierent visualizations of the data can be selected. The Ludwig Transcript
Viewer provides a positional map of SAGE transcripts on the target gene,
the Digital Northern provides a frequency-sorted list of tag occurrence in
SAGE libraries, and the Anatomical Viewer generates three heat-map

130

KLEE

pictographs (tissues, cell-lines, and both), displaying the relative expression


level of a gene in 24 normal and cancer tissues. Additionally, the SAGE
Genie toolset includes the Dierential Gene Expression Displayer and
Experimental Viewer, which compare the relative expression of tags between
two user-dened libraries, in the same manner as the Unigene DDD tool.
The CGAP database continues to grow as SAGE transcripts from recent
research projects are added to it. For example, a study on glioblastomas
generated 116,259 new SAGE tags that were summarily deposited in
CGAP [16]. The study investigators then used the SAGE Genie tool suite
to evaluate their expression data and identify several genes highly specic
to this disease, demonstrating the power of the CGAP database.
The Ludwig Institute for Cancer ResearchdMassively Parallel
Signature Sequencing data
MPSS is a technique based on a novel cloning and sequencing method
employing enzymatic digestion and hybridization of short polyadenylated
transcripts using microbeads [10]. This method is advantageous because it
has a large dynamic range (105 fold dierences) and consequently can
generate a richer gene-expression data set. The Ludwig Institute for Cancer
Research (New York, New York) developed an extensive MPSS database
with expression data on 32 normal human tissue types [17]. The database
is Web accessible and can be queried for gene-expression proles across
all tissue types for individual genes and gene lists, as well as for expression
in restricted tissue types. This database provides a valuable resource for assessing the nominal expression characteristics of a gene but does not provide
any information on expression characteristics in cancer/disease states.
SymAtlas V1.2.4
The SymAtlas is a microarray-expression database supported by the
Genomics Institute of the Novartis Research Foundation (Cambridge, Massachusetts) (GNF) [18]. Custom human microarray chips, designed and fabricated by Aymetrix (Santa Clara, California), were used to measure gene
expression in 46 normal human tissues. All samples were prepared and analyzed using a standard protocol to enable accurate comparisons between
experiments. The database can be searched in a targeted manner for single
genes, and a graphic display of expression across tissue types is returned,
along with a set of textual annotations including basic gene characteristics,
gene ontologylinked functional descriptors, associated transcripts and proteins, a textual summary of the gene, a list of corresponding probe sets, and
a list of relevant literature citations. The graphic display gives a relative
measure of expression for the target gene in all tissues. In addition to single-gene queries, users can also do batch queries for general expression proles. The batch query form allows users to dene unique expression
thresholds (fold change relative to median) for all tissue types, combined

DATA MINING FOR BIOMARKER DEVELOPMENT

131

by a logical AND operator. Unlike the three previously described databases,


in which data are derived from short oligo-sequencing technologies, GeneAtlas is based on expression information obtained from customized hybridization arrays created by Aymetrix.

Tissue-specic expression-proling methods


A number of studies have developed methods for analyzing the previously described data sets and generating rigorous metrics of tissue specicity. In this section, eight methods are discussed in detail (Table 1). Each
method uses a unique statistical test or set of decision criteria for identifying
genes with tissue-specic expression proles.
Selective expression
The selective-expression approach is based on an algorithm for identifying genes with selective expression, dened as signicantly dierent expression in one tissue compared with other tissues, based on four basic
properties: intensity, source, source-set, and source-condence [19]. The
methods used are designed to be adaptable to dierent data types, including
microarray data and EST, SAGE, or MPSS library transcript counts. The
intensity parameter denotes the expression value for a given gene in a tissue
type denoted by the source parameter. The source-set parameter allows
sources to be aggregated into subsets of interest. The source-condence
parameter allows the inclusion of a quantiable metric for weighting the
quality of a measurement in the specicity analysis. For example, it can
be used to correct for dierences in cDNA library sequencing depth
(sampling size). Statistical signicance of selective expression is measured

Table 1
Methods for computing tissue specicity metrics from large gene-expression data sets
Study

Data source

Specicity metric/test

Selective expression
TissueInfo

EST, SAGE, MPSS


EST

ExQuest
GEPIS
Shannon Entropy
Akaikes Information Criterion
Tissue selectivity
ROKU

EST
EST
EST, microarray
Microarray
Microarray
Microarray

Dixon discordance test


Expression ratio and rule-based
decision tree
TSU
DEU - Z-statistic
Shannon Entropy and Q-statistic
Akaikes Information Criterion
Honest Signicant Dierence
Shannon Entropy and Akaikes
Information Criterion

Abbreviations: DEU, digital expression units; EST, Expressed Sequence Tag; GEPIS, Gene
Expression Proling in silico; SAGE, Serial Analysis of Gene Expression; MPSS, Massively
Parallel Signature Sequencing; TUS, tissue-specic units.

132

KLEE

using the Dixon discordance test for uniform distributions. The data analysis includes the following steps:
1. Filter the data points for high-quality measurements.
2. Verify a minimum number of data points passing step one for each
source.
3. Apply the quantitative test of discordance and use the statistical significance score to determine if scores are reliable.
4. Adjust scores computed in step three for the baseline expression level of
nonvariant genes.
5. Compute the minimum intensity gap (the separation between the largest
intensity and the second largest intensity).
6. Compute an overall condence for the selective expression parameter by
combining steps four and ve to provide an overall condence level for
the selectivity dened by the numeric evaluation of parameters in a single
decision function.
Step four of the process allows the data to be adjusted to reduce the
signicance of discordant values as the baseline expression level approaches
saturation for the measurement system used. Step ve introduces a level of
robustness to the discordant values in step six by noting if the minimum
intensity gap approaches the resolution power of the measurement system
used.
The decision metric described in this approach is designed to be universally applicable to any data source capable of reporting tissue-specic
expression proles. The six-step analysis method computes a numeric value
for tissue selectivity that can be congured to the technique used to
generate the data to prevent erroneous predictions. The algorithm has
been used in several studies to identify tissue-selective expression [2024].
TissueInfo
The TissueInfo method and interface were constructed to provide tissuespecic expression proles for a target gene sequence on the basis of EST
sequence counts [25]. TissueInfo uses the Basic Alignment Search Tool
(BLAST) [26] and MegaBLAST [27] sequence comparison programs to
align individual EST sequences obtained from dbEST to a user-submitted
query sequence. Based on the number of ESTs matching the query sequence
and the associated annotation of the matching EST source libraries, the
query sequence is annotated as (1) expressed in a tissue, (2) specically
expressed in a tissue, or (3) tissue specic. To be labeled as expressed in
a tissue, the query sequence need only have a single matching EST sequence
derived from a library constructed from that tissue type. To be specic to
a tissue, the number of ESTs from the tissue in question matching the query
sequence divided by the total number of ESTs matching the query sequence
must be greater than a user-dened threshold for specicity (the authors use

DATA MINING FOR BIOMARKER DEVELOPMENT

133

0.95). To be tissue-specic, the query sequence can be expressed only in


a maximum of two tissues, and the ratio of ESTs matching the query
sequence coming from a pooled tissue set over all the ESTs matching the
query sequence may not exceed one minus the threshold for specicity.
This method for determining tissue specicity is a straightforward and intuitive ratio test but fails to incorporate statistical tests of signicance included
in several of the other methods described.
A unique characteristic of TissueInfo is it computes tissue-specic expression values for both gene- and protein-query sequences by changing the
sequence comparison algorithm used. The authors leveraged this characteristic to test the tissue-specicity predictions for a set of protein sequences obtained from the SwissProt database (http://ca.expasy.org/sprot/) with
annotated Tissue Specicity. This testing showed the program was 69% accurate in predicting whether a query sequence was specic to (expressed predominantly in) a tissue and 80% accurate in predicting whether a query
sequence was expressed in a tissue [25]. It is possible the accuracy estimates
are lower than the true program accuracy, because it could be questioned
whether the Tissue Specicity annotations for the proteins evaluated are
based on sucient experimental depth or tissue source breadth to be properly comparable with the range of tissues sampled by EST sequences and
used in the TissueInfo calculations.
TissueInfo is hosted by the Weill Medical College of Cornell University
(http://icb.med.cornell.edu/crt/tissueinfo/webservice.xml) and has been
maintained with current updates to the underlying sequence sets. The interface allows users to generate a tissue-expression prole for a specic
sequence or search the database for genes that are expressed or predominantly expressed in designated tissues.
ExQuest
The ExQuest program maps ESTs to a target sequence and uses the
underlying EST library annotations to compute tissue-specicity proles
[28]. Sequence mapping is performed by MegaBLAST. Based on these comparisons, the actual number of ESTs matching the query sequence for each
tissue category is recorded. These values subsequently are normalized for
the relative depth of the underlying EST library sampling, using an Expected
Hit metric. This metric is calculated for each tissue category by multiplying
the ratio of the number of ESTs from a given tissue type over the total
number of ESTs included in the ExQuest database by the total number of
ESTs matching the query sequence. The result is an estimated EST hit
rate for each tissue based completely on the underlying EST library sizes.
Finally, for each tissue, the actual EST hits are divided by the expected
EST hits to obtain a normalized metric of expression called tissue-specic
units (TSU). Based on the distribution of TSU values across tissues, users
can identify tissue-specic expression patterns for the query sequence.

134

KLEE

ExQuest does not incorporate any statistical test for tissue specicity. The
TSU metric provides an interesting and simple approach to normalizing
expression measurements between libraries that are sampled to dierent
extents. It should provide a fair estimate of tissue-specic expression levels,
provided the analysis includes no libraries that are sampled at a very low
level.
ExQuest provides a unique hierarchical organization of tissue categories
embedded in the program. EST libraries are organized into hierarchical
tissue bins in which related libraries are grouped together, thereby increasing
the overall transcript count per tissue category. The hierarchical organization is designed to provide users with three levels of organization for querying specicity: at the primary tissue, the secondary tissue, and individual
EST library. For example, a primary tissue would be the pancreas, a secondary tissue would be the islet of Langerhans, and the individual EST library
would be a single EST library derived from the islet of Langerhans. This
hierarchical organization gives users the exibility to balance sampling
depth in a category with more exact feature specicity. A second unique
feature of ExQuest is that users can dene the degree of similarity by which
EST sequences are mapped to gene entities by selecting the percentage alignment parameter used by MegaBLAST. The authors demonstrate that varying this parameter allows related paralog sequences to be dierentiated from
each other. Finally, ExQuest also provides an interesting interface option to
view chromosomal mapping of sequence-associated ESTs. This feature
enables users to zoom in and out of a chromosomal map while selecting
specic tissues for which matching ESTs will be displayed. This interface
lets users evaluate tissue-specic expression patterns in a positional context
and associate these patterns with established genomic eects. The ExQuest
program is available through a Web server (http://lena.jax.org/wdcb/
ensRNA/exquest.html) but has not been updated since 2004 (Derry Roopenian, PhD, personal communication, Bar Harbor, Maine, 2007). The hierarchical grouping of libraries is a unique and useful feature not explicitly
provided in other tissue-specicity methods, allowing users to adapt the
analysis to the level best suited to their biomarker study. It is unfortunate
the system is not in active development, but even in its present state it
may be a useful resource for investigators.
Gene Expression Proling in silico
The Gene Expression Proling in silico (GEPIS) method uses EST sequence mapping to compute tissue-specic expression in 43 tissue types
[29]. Before computing specicity, the GEPIS system executes a rigorous
data-quality ltering protocol on the ESTs and EST libraries obtained
from dbEST. All ESTs obtained from libraries with tissue-source annotations of unknown, ambiguous, or pooled tissue type are excluded from
the analysis. EST libraries that have been normalized or subtracted and

DATA MINING FOR BIOMARKER DEVELOPMENT

135

EST libraries derived from fetal or embryonic tissues also are excluded from
the analysis. Finally, the authors also eliminated several EST libraries that
were deemed to have been misannotated according to their expression analysis. The remaining EST sequences were assigned to gene sequences using
BLAST comparisons.
GEPIS computes EST library normalized expression values, called
digital expression units (DEU), for all genes in all tissues. The DEU is
equal to the number of ESTs assigned to a gene in a tissue, divided by the
total ESTs in the tissue, and multiplied by a scaling factor of 1,000,000.
A Z-statistic then is used to compare any two tissue categories and determine statistical signicance. The Z-statistic is computed by Equation 1,
and the result compared with a normal distribution to obtain a p-value.
pbA  pbB
Z r
1
1

pb1  pb
NA NB

Where pbA and pbB are the DEU values for tissues A and B, pbis the DEU for
the gene of interest across all tissues, and NA and NB are total number of
EST sequences from tissues A and B. For a given gene, the Web interface
returns a graphic representation of the tissue-specic expression prole. It
also provides a chart containing raw EST counts and DEU values for
both normal and cancer tissue libraries, a p-value based on the Z-statistic
comparing the normal and cancer DEU values within a tissue, and the highest p-value obtained from the Z-statistics comparing the gene expression in
a normal tissue with the expression in each of the other normal tissues.
GEPIS, much like the ExQuest program [28], computes a Regional Atlas
that graphically shows the expression of all genes in proximity to the gene of
interest, across a user-dened range and number of tissue types. This feature is
useful when investigating events such as cancer-induced copy-number eects.
The program authors also attempted to verify GEPIS gene-expression proles
experimentally using quantitative polymerase chain reaction (qPCR) measurements of 40 genes over a range of expression levels in normal and cancer
colon samples [29]. The analysis revealed strong correlation between the
GEPIS expression level and qPCR expression level at all expression ranges
(low, medium, and high). GEPIS recently has been revised and released
as GeneHub-GEPIS [30]. The upgraded version now allows users to search
the database using multiple database identiers.
Shannon Entropy
Schug and colleagues used Shannon Entropy (generally designated H in
information theory) was used to measure the overall specicity of a gene,
that is, the degree by which a gene-expression prole diers from a ubiquitous expression prole [31]. This statistical measure of specicity provides

136

KLEE

a single metric for assessing the complete gene-expression prole but does
not provide any information regarding the tissues in which a gene may be
specically expressed. To obtain this information, a new statistic, Q, is introduced to measure what the authors label categorical specicity or expression specic to a tissue type. This method was used to measure tissue
specicity in both the GNF Gene Expression Atlas [18] microarray data
and the Database of Transcribed Sequences, EST data [32]. The microarray
expression values were analyzed, without modication, as obtained from the
Gene Expression Atla For the EST data, counts of EST sequences associated with a given gene in a given tissue were normalized into pseudo-counts
by Equation 2.


ng;t 1
wg;t
2
Nt Ng
Where ng,t is the number of EST sequences mapped to gene g, in tissue t. Nt
is the total number of EST sequences associated with tissue t, and Ng is the
total number of genes. Relative expression values ptjg were computed for all
expression values by dividing wg,t by the total expression for a gene across
all tissues. Using the relative expression values, the Shannon Entropy was
computed using Equation 3.
X
Hg
ptjg log2 ptjg
3
1%t%N

Where N is the total number of tissues evaluated. A low entropy value reects a highly specic gene-expression prole. The Q-statistic, for measuring
categorical specicity, then is computed from the Shannon Entropy using
Equation 4.
Qgjt Hg  log2 ptjg

Where a Q-value of zero denotes expression restricted to that tissue only, and
an increasing Q-value is indicative of a more ubiquitous expression prole.
A comparison of specicity measurements between the microarray and
EST data using the same gene set and tissue set showed the overall distribution of entropy across all genes was slightly depressed in the EST data set
versus the microarray data set. This nding may reect insucient sampling
associated with some EST sequence libraries, so that several genes seem to
be expressed in a more specic manner than found when analyzing the
microarray data set. Independent experimental measurements would be
required to understand the discrepancies between the specicity measurements from the two data sources.
The method described in this article introduces a classic measure from
information theory, Shannon Entropy, to detect tissue-specic gene-expression patterns and a novel statistic, Q, to measure tissue-specic expression.

DATA MINING FOR BIOMARKER DEVELOPMENT

137

By using these two statistics together, genes that are expressed in a tissue-selective manner can be identied and ranked by the overall measure of specicity, in a manner similar to that developed to analyze the Gene Logic
(http://www.genelogic.com) microarray data set [33]. It would be interesting
to apply the Shannon Entropy method to the GeneLogic data and compare
the overall ranks and measures of gene specicity computed by the two different approaches. The method also is advantageous, because it is applicable
to data from two dierent sources (EST and microarray).
Akaikes Information Criterion
Akaikes Information Criterion can be used to identify genes with markedly dierent expression in one or a few genes relative to the gene expression
in most tissue types [34]. The criterion originally was developed to identify
an optimal model from a class of competing models [35] but has been adapted to the detection of outlier gene expression. The authors rationalize the
use of this criterion because it operates without requiring a signicance level
to be selected, thereby providing an objective method of gene selection. In
a study of mouse microarray data obtained from the Riken Expression Array Database [36], 49 samples from dierent normal tissues were evaluated
with the criterion. After ltering out low-quality data and normalizing to
a reference array, genes of interest were selected using a U-statistic, where
a low score was desirable, as dened in Equation 5.


p logn!
U nlogs s 2
5
n
Where (ns) equals the total number of observations across tissue types, s
equals the number of outlier candidates, and s equals the SD across scores
from the n tissues (but not the s tissues)
A U-statistic was computed for up to X(X1)/2 combinations of possible
outliers, where X 1(ns)/2. X varied according to the number of data
points (ie, changed if there were missing data points) and used a combination
of outliers that were both up- and down-regulated. This method can be
adapted easily to the analysis of any large data set (eg, microarray data)
for which expression ratios in multiple tissues are available or computable.
The method is fairly straightforward and easy for other users to implement,
and the authors believe that the lack of a p-value for specicity is an advantage. It does, however, require a signicant amount of computation, up to
N*X(X 1) calculations where X is the number of tissues types evaluated,
and N is the number of genes evaluated.
Tissue selectivity
A robust method to calculate tissue selectivity uses the GeneLogic microarray expression data set [33]. The methods were designed to identify

138

KLEE

tissue-selective gene expression. Tissue-selective expression is dened as


modied gene expression in one or a few biologically similar tissue types
and is distinguished from tissue-specic expression, dened as gene expression restricted to one and only one tissue type. The Tukey-Kramers honest
signicant dierence (HSD) test is used compute pairwise comparisons of
a genes expression between all tissues types. The HSD test generates an
adjusted p-value and Q-parameters for each comparison; the Q-parameter
is equivalent to the absolute dierence of two group means over the sample
sizeadjusted SD. Tissue-selective genes were identied subsequently using
two rules based on the HSD p-values. First, 91 of 97 (N 98) pairwise
comparisons were required to have a p-value below the user-dened significance level. Second, the rst rule could not be true for a gene in more than
six dierent tissue types. Additionally, the authors used the HSD Q-scores
to compute an overall probability value, called the enrichment score,
for a gene, to provide a measure of error control. This calculation was
computed as shown in Equation 6.

N 
1 X
Qi  MinQi
ES
1
N  1 i1
MaxQi  MinQi

Where N k(k1)/2, and k equals the number of tissues evaluated. This


method for assessing tissue-specic expression patterns is advantageous
for two main reasons. The authors employed an expression-comparison
test that includes statistical rigor to reduce erroneous predictions. The authors also dened a method of tissue-specic expression analysis that is exible enough to allow a gene to be expressed specically in a few tissues, not
in just a single tissue. Additionally, users can adapt the selection criteria easily to increase the stringency or exibility in dening exactly what tissue selective expression is.
ROKU
ROKU is a method for analyzing microarray data to detect tissue-specic
gene expression that combines the Shannon Entropy method [31] for ranking genes with specic expression patterns and the Akaikes Information
Criterion method [34] for detecting specic tissues with expression outliers
[37]. The authors rationalize the approach by claiming that the Q-statistic
developed to complement the Shannon Entropy fails to acknowledge redundancy in the selection of genes specically expressed in a tissue, and this
redundancy can be handled better with an outlier selection criterion.
Furthermore, combining these two techniques into a single method enables
the detection of tissue-specic expression that is over- or underexpressed relative to the nominal gene expression. Consequently, the ROKU method is
designed not to detect genes that are expressed in a small number of tissues
and are not expressed in the remaining tissue but to detect genes that are

DATA MINING FOR BIOMARKER DEVELOPMENT

139

dierentially expressed in a small number of tissues relative to a generally


stable expression in most tissues. That is, the ROKU method is designed
to identify genes expressed at signicantly higher or signicantly lower levels
in a small number of tissues relative to the general background expression
level of that gene, a process the authors call broad sense specicity [37].
The tissue-specicity calculations use the same equation for Shannon
Entropy (H) as previously described in Equation 4. The only dierence is
that the ROKU method rst transforms the microarray expression values,
wg,t, by subtracting the one-step Tukey biweight and taking the absolute
value. This processing eectively recenters the data and enables the Shannon
Entropy metric to detect genes that are specically expressed in a broad
sense. The tissues in which genes are specically expressed then are identied
using the Akaikes Information Criterion for outlier detection as described
in Equation 1.
A clear disadvantage in the ROKU method is that the one-step Tukey
biweight transformation prevents the analysis from identifying binomial
expression patterns, in which a gene is expressed at one level in half the tissues
and at a dierent level in the remaining tissues. The transformation adjusts
the expression vector, so that the overall expression prole appears uniform.
The ROKU method, however, has other advantages over the Shannon
Entropy method and the Akaikes Information Criterion method, because
it can detect tissue-specic gene expression and rank genes according to overall tissue specicity and also can identify genes that are specically expressed
in only one tissue and not simply in a small number of tissues.
Summary
The studies described in this article provide an introductory review of
data-mining techniques that have been applied to large transcriptomic
data sets to derive tissue-specic expression proles. The methods were
selected to illustrate dierent decision criteria and data preprocessing
applied to both microarray and short oligonucleotide (EST, SAGE,
MPSS) data sets. Several other studies on tissue-specicity analysis have
applied a spectrum of methods to the problem, from simple fold-change
[38] and t-test [39] decision criteria to more complex principal component
analysis [40] and binary indexing [41] algorithms. In addition, several
programs have been developed to improve user access and provide more
intuitive tissue-proling functionality to the large transcriptomic databases
previously described [4244]. These methods provide researchers with
a range of gene-expression data and analysis tools that can drive or complement biomarker development projects.
Tissue-specic expression proling has been discussed using four dierent
types of data. Most analysis methods were developed to analyze a specic
data type, although several demonstrated interoperability between dierent
data types. When undertaking a biomarker development study, however,

140

KLEE

the question remains as to which data type should be analyzed to benet the
study best. Clearly, this decision depends on the specic nature of the study
and whether the study-specic tissue is sampleddand sampled sucientlyd
in a given database. In some cases, the study target could fall within a very
specic tissue type and may be restricted to a single data source, such as the
EST database or the CGAP SAGE collection. In most cases, however, any
one of the four data types probably will be able to provide pertinent
information. To address whether this information is redundant or complementary, Huminiecki and Bicknell [45] evaluated the congruence of specicity data obtained from SAGE and EST data analysis and later compared
these data with microarray data analysis. The initial study concluded that
evaluating SAGE and EST data together provided a more accurate assessment of tissue specicity than could be obtained from either data set alone.
In a later study, the correlation of specicity analysis by EST and SAGE
data compared with that of microarray data was strong in tissues that
were extensively sampled [46]. The study reported, however, that correlation
was not strong between microarray and the EST and SAGE data types in
tissues that had complex cellular composition or that had not been sampled
extensively. This nding suggests that EST and SAGE libraries measure up
to microarray analysis only in tissues that are deeply sampled within those
libraries. None of the microarray analysis methods discussed here have
addressed the possible eect of low-level microarray measurements on the
robustness of the tissue-specicity predictions. Low-expression measurements often fall within the background level of an array technology, and
any analysis dependent on these signals is potentially erroneous. Consequently, it would seem that for genes expressed at a low level or in a poorly
sampled tissue, analysis of a single data type would be insucient. Therefore, researchers should consider evaluating tissue-specic expression in
multiple (or all available) data types to obtain the most comprehensive
expression prole.
Tissue-specicity expression proling has been used widely for biomarker
discovery but is equally or more applicable to candidate biomarker characterization. The literature is lled with references to studies that have identied candidate biomarkers by mining transcriptomic data for genes
dierentially expressed in normal and cancerous tissue [4755]. This type
of candidate biomarker discovery can be undertaken using the data-mining
methods described herein. It is the purpose of this review, however, to
encourage investigators to consider using these data-mining methods to generate tissue-specic expression proles that can complement existing eorts
to discover candidate biomarkers. This approach has been used to identify
candidate cardiac markers that are specically expressed in the heart [55], to
identify candidate prostate cancer biomarkers that are dierentially
expressed in normal and cancer tissues and also are selectively expressed
in prostate [54], to identify candidate brain-injury markers specic to the
brain [53], and to identify candidate bladder carcinoma biomarkers specic

DATA MINING FOR BIOMARKER DEVELOPMENT

141

to the bladder [48]. Furthermore, specicity analysis has been suggested as


an important metric for identifying candidate serum cancer biomarkers
[56]. In all these instances, tissue-specic expression proling provides
a method for further rening lists of candidate biomarkers and intelligently
selecting an enriched set of candidates to move forward in the biomarker
development cycle.
References
[1] National Cancer Institute 2007. The nations investment in cancer research. A plan and
budget proposal for scal year 2008. Pub. L. No. 92218, NIH Publication No. 06-6090.
[2] Batchelder K, Miller P. A change in the marketdinvesting in diagnostics. Nat Biotechnol
2006;24(8):9226.
[3] Ozdemir V, Williams-Jones B, Glatt S, et al. Shifting emphasis from pharmacogenomics to
theragnostics. Nat Biotechnol 2006;24(8):9426.
[4] Rifai N, Gillette MA, Carr SA. Protein biomarker discovery and validation: the long and
uncertain path to clinical utility. Nat Biotechnol 2006;24(8):97183.
[5] Cho WC. Contribution of oncoproteomics to cancer biomarker discovery. Mol Cancer 2007;
6(1):25.
[6] Bharti A, Ma PC, Salgia R. Biomarker discovery in lung cancer-promises and challenges of
clinical proteomics. Mass Spectrom Rev; 2007.
[7] He YD. Genomic approach to biomarker identication and its recent applications. Cancer
Biomark 2006;2(34):10333.
[8] Adams MD, Kellye JM, Gocayne JD, et al. Complementary DNA sequencing: expressed
sequence tags and human genome project. Science 1991;252(5013):16516.
[9] Velculescu VE, Zhang L, Vogelstein B, et al. Serial analysis of gene expression. Science 1995;
270(5235):4847.
[10] Brenner S, Johnson M, Bridgham J, et al. Gene expression analysis by massively parallel
signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 2000;18(6):6304.
[11] Jongeneel CV, Iseli C, Stevenson BJ, et al. Comprehensive sampling of gene expression in
human cell lines with massively parallel signature sequencing. Proc Natl Acad Sci U S A
2003;100(8):47025.
[12] Pontius JU, Wagner L, Schuler GD. UniGene: a unied view of the transcriptome. In: The
NCBI handbook. Bethesda (MD): National Center for Biotechnology Information; 2003.
[13] Boguski MS, Lowe TM, Tolstoshev CM. dbESTdatabase for expressed sequence tags.
Nat Genet 1993;4(4):3323.
[14] Adams MD, Kerlavage AR, Fields C, et al. 3,400 new expressed sequence tags identify
diversity of transcripts in human brain. Nat Genet 1993;4(3):25667.
[15] Boon K, Osorio EC, Greenhut SF, et al. An anatomy of normal and malignant gene expression. Proc Natl Acad Sci U S A 2002;99(17):1128792.
[16] Beaty RM, Edwards JB, Boon K, et al. PLXDC1 (TEM7) is identied in a genome-wide
expression screen of glioblastoma endothelium. J Neurooncol 2007;81(3):2418.
[17] Jongeneel CV, Delorenzi M, Iseli C, et al. An atlas of human gene expression from Massively
Parallel Signature Sequencing (MPSS). Genome Res 2005;15(7):100714.
[18] Su AI, Cooke MP, Ching KA, et al. Large-scale analysis of the human and mouse transcriptomes. Proc Natl Acad Sci U S A 2002;99(7):446570.
[19] Greller LD, Tobin FL. Detecting selective expression of genes and proteins. Genome Res
1999;9(3):28296.
[20] Stekel DJ, Git Y, Falciani F. The comparison of gene expression from multiple cDNA libraries. Genome Res 2000;10(12):205561.
[21] Castensson A, Emilsson L, Preece P, et al. High-resolution quantication of specic mRNA
levels in human brain autopsies and biopsies. Genome Res 2000;10(8):121929.

142

KLEE

[22] Lai C, Chou C, Chang L, et al. Identication of novel human genes evolutionarily conserved
in Caenorhabditis elegans by comparative proteomics. Genome Res 2000;10(5):70313.
[23] Walker MG, Volkmuth W, Sprinzak E, et al. Prediction of gene function by genomescale expression analysis: prostate cancer-associated genes. Genome Res 1999;9(12):
1198203.
[24] Ewing RM, Kahla AB, Poirot O, et al. Large-scale statistical analyses of rice ESTs reveal
correlated patterns of gene expression. Genome Res 1999;9(10):9509.
[25] Skrabanek L, Campagne F. TissueInfo: high-throughput identication of tissue expression
proles and specicity. Nucleic Acids Res 2001;29(21):E102.
[26] Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990;215:
40310.
[27] Zhang Z, Schwartz S, Wagner L, et al. A greedy algorithm for aligning DNA sequences.
J Comput Biol 2000;7:20314.
[28] Brown AC, Kai K, May ME, et al. ExQuest, a novel method for displaying quantitative gene
expression from ESTs. Genomics 2004;83(3):52839.
[29] Zhang Y, Eberhard DA, Frantz GD, et al. GEPISquantitative gene expression proling in
normal and cancer tissues. Bioinformatics 2004;20(15):23908.
[30] Zhang Y, Luoh SM, Hon L, et al. GeneHub-GEPIS: digital expression proling for normal
and cancer tissues based on an integrated gene database. NAR 2007;35:W1528.
[31] Schug J, Schuller WP, Kappen C, et al. Promoter features related to tissue specicity as measured by Shannon Entropy. Genome Biol 2005;6(4):R33.
[32] The Computational Biology and Informatics Laboratory. AllGenes: a Web site providing
access to an integrated database of known and predicted human (release 9.0, 2004) and
mouse genes (release 10.0, 2004). Center for Bioinformatics, University of Pennsylvania.
Available at: http://www.allgenes.org. Accessed November 19, 2007.
[33] Liang S, Li Y, Be X, et al. Detecting and proling tissue-selective genes. Physiol Genomics
2006;26(2):15862.
[34] Kadota K, Nishimura S, Bono H, et al. Detection of genes with tissue-specic expression
patterns using Akaikes information criterion procedure. Physiol Genomics 2003;12(3):
2519.
[35] Akaike H. Information theory and an extension of the maximum likelihood principle. Proc:
2nd Int symp information theory. Budapest; 1973. p. 26781.
[36] Miki R, Kadota K, Bono H, et al. Delineating developmental and metabolic pathways in
vivo by expression proling using the RIKEN set of 18,816 full-length enriched mouse
cDNA arrays. Proc Natl Acad Sci U S A 2001;98(5):2199204.
[37] Kadota K, Ye J, Nakai Y, et al. ROKU: a novel method for identication of tissue-specic
genes. BMC Bioinformatics 2006;7:294.
[38] Saito-Hisaminato A, Katagiri T, Kakiuchi S, et al. Genome-wide proling of gene expression in 29 normal human tissues with a cDNA microarray. DNA Res 2002;9(2):3545.
[39] Hsiao LL, Dangond F, Yoshida T, et al. A compendium of gene expression in normal human
tissues. Physiol Genomics 2001;7(2):956.
[40] Misra J, Schmitt W, Hwang D, et al. Interactive exploration of microarray gene expression
patterns in a reduced dimensional space. Genome Res 2002;12(7):111220.
[41] Vasmatzis G, Klee E, Kube DM, et al. Quantitating tissue specicity of human genes to
facilitate biomarker discovery. Bioinformatics 2007;23(11):134855.
[42] Gupta S, Vingron M, Haas SA. T-STAG: resource and Web-interface for tissue-specic
transcripts and genes. Nucleic Acids Res 2005;33(Web Server issue):W6548.
[43] Wang J, Liang P. DigiNorthern, digital expression analysis of query genes based on ESTs.
Bioinformatics 2003;19(5):6534.
[44] Madden SF, ODonovan B, Furney SJ, et al. Digital extractor: analysis of digital dierential
display output. Bioinformatics 2003;19(12):15945.
[45] Huminiecki L, Bicknell R. In silico cloning of novel endothelial-specic genes. Genome Res
2000;10(11):1796806.

DATA MINING FOR BIOMARKER DEVELOPMENT

143

[46] Huminiecki L, Lloyd AT, Wolfe KH. Congruence of tissue expression proles from gene
expression atlas, SAGEmap and TissueInfo databases. BMC Genomics 2003;4(1):31.
[47] Campagne F, Skrabanek L. Mining expressed sequence tags identies cancer markers of
clinical interest. BMC Bioinformatics 2006;7:481.
[48] Wang XS, Zhang Z, Wang HC, et al. Rapid identication of UCA1 as a very sensitive and
specic unique marker for human bladder carcinoma. Clin Cancer Res 2006;12(16):48518.
[49] Wang AG, Yoon SY, Oh JH, et al. Identication of intrahepatic cholangiocarcinoma related
genes by comparison with normal liver tissues using expressed sequence tags. Biochem
Biophys Res Commun 2006;345(3):102232.
[50] Yoon SY, Kim JM, Oh JH, et al. Gene expression proling of human HBV- and/or HCVassociated hepatocellular carcinoma cells using expressed sequence tags. Int J Oncol 2006;
29(2):31527.
[51] Huang ZG, Ran ZH, Lu W, et al. Analysis of gene expression prole in colon cancer using
the cancer genome anatomy project and RNA interference. Chin J Dig Dis 2006;7(2):97102.
[52] Aouacheria A, Navratil V, Barthelaix A, et al. Bioinformatic screening of human ESTs for
dierentially expressed genes in normal and tumor tissues. BMC Genomics 2006;7:94.
[53] Laterza OF, Modur VR, Crimmins DL, et al. Identication of novel brain biomarkers. Clin
Chem 2006;52(9):171321.
[54] Asmann YW, Kosari F, Wang K, et al. Identication of dierentially expressed genes in
normal and malignant prostate by electronic proling of expressed sequence tags. Cancer
Res 2002;62(11):330814.
[55] Megy K, Audic S, Claverie JM. Heart-specic genes revealed by expressed sequence tag
(EST) sampling. Genome Biol 2002;3(12):RESEARCH0074:111.
[56] Klee EW, Finlay J, McDonald C, et al. Bioinformatics methods for prioritizing serum biomarker candidates. Clin Chem 2006;52(11):21624.

Clin Lab Med 28 (2008) 145166

Data Mining in Genomics


Jae K. Lee, PhD*, Paul D. Williams, PhD,
Sooyoung Cheon, PhD
Division of Biostatistics and Epidemiology, Department of Public Health Sciences,
Box 800717, University of Virginia, Charlottesville, VA 22908, USA

A great explosion of genomic data has occurred recently, because of the


advances in various high-throughput biotechnologies, such as RNA gene
expression microarrays. These large genomic data sets are informationrich and often contain much more information than the researchers generating the data may have anticipated. Such an enormous data volume enables
new types of analyses but also makes research questions dicult to answer
using traditional methods. Analysis of these massive genomic data has several unprecedented challenges.
Challenge 1: multiple comparisons issue
Analysis of high-throughput genomic data requires handling an astronomic number of candidate targets, most of which are false-positives [1,2].
For example, a traditional statistical testing criterion of 5% signicance
level would result in an average of 500 false-positive genes from a 10K
microarray study comparing two biological conditions for which no real
biological dierences appeared in gene regulation. If a small number of
100 genes, for example, are dierentially regulated, the real dierentially
expressed genes will be mixed with the 500 false-positives without any a priori information to discriminate the groups of genes. Condence on the 600
targets identied with this statistical test is low, and further investigation of
these candidates will have a poor yield. Simply tightening this statistical criterion, such as with a 1% or lower signicance level, will result in a high
false-negative error rate with failure to identify many important real biological targets. This pitfall, the so-called multiple comparisons issue, becomes even more serious when trying to nd novel biological mechanisms

This study is supported by NIH grant 1R01HL081690 of JKL.


* Corresponding author.
E-mail address: jaeklee@virginia.edu (J.K. Lee).
0272-2712/08/$ - see front matter 2008 Elsevier Inc. All rights reserved.
doi:10.1016/j.cll.2007.10.010
labmed.theclinics.com

146

LEE

et al

and biomarker prediction models that involve multiple interacting targets


and genes, because the number of candidate pathways or interaction mechanisms grows exponentially. Thus, data mining techniques must eectively
minimize both false-positive and false-negative error rates in these genomewide investigations.
Challenge 2: high-dimensional biological data
The second challenge is the high-dimensional nature of biological data in
many genomic studies [3]. In genomic data analysis, many gene targets are
investigated simultaneously, yielding dramatically sparse data points in the
corresponding high-dimensional data space. Mathematical and computational approaches often fail to capture these high-dimensional phenomena
accurately. For example, many search algorithms cannot freely move between local maxima in a high-dimensional space. Furthermore, inference
based on the combination of several lower-dimensional observations may
not provide a correct understanding of the real phenomenon in their joint
high-dimensional space. Consequently, unless appropriate statistical dimension reduction techniques are used to convert high-dimensional data problems into lower-dimensional ones, important variation and information in
the biological data may be obscured.
Challenge 3: small n and large p problem
The third challenge is the so-called small n and large p problem [2]. Desired performance of conventional statistical methods is achieved when the
sample size of the data, namely ndthe number of independent observations
and subjectsdis much larger than the number of candidate prediction
parameters and targets, namely p. In many genomic data analyses this situation is often completely reversed. For example, in a microarray study, tens
of thousands of gene expression patterns may become the candidate prediction factors for a biological phenomenon of interest (eg, response versus
resistance to a chemotherapeutic regimen), but the number of independent
observations (eg, dierent patients or samples) is often a few tens or hundreds at most. Because of the experimental costs and limited availability
of biological materials, the number of independent samples may be even
smaller, sometimes only a few. Traditional statistical methods are not
designed for these circumstances and often perform very poorly; furthermore, statistical power must be strengthened using all sources of information in large-screening genomic data.
Challenge 4: computational limitation
No matter how powerful a computer system becomes, solving many
genomic data mining problems through exhaustive combinatorial search
and comparisons is often prohibitive [4]. In fact, many current problems

DATA MINING IN GENOMICS

147

in genomic data analysis have been theoretically proven to be of nonpolynomial-hard complexity, implying that no computational algorithm can search
for all possible candidate solutions. Thus, heuristicdmost frequently statisticaldalgorithms that eectively search and investigate a very small portion
of all possible solutions are often sought for genomic data mining problems.
The success of many bioinformatics studies critically depends on the construction and use of eective and ecient heuristic algorithms, most of
which are based on the careful application of probabilistic modeling and statistical inference techniques.
Challenge 5: noisy high-throughput biological data
The next challenge derives from the fact that high-throughput biotechnical data and large biological databases are inevitably noisy because biological information and signals of interest are often observed with many other
random or confounding factors. Furthermore, a one-size-ts-all experimental design for high-throughput biotechniques can introduce bias and error
for many candidate targets. Therefore, many investigations in bioinformatics can be performed successfully only when the variability of genomics
data is well understood. In particular, the distributional characteristics of
each data set must be analyzed using statistical and quality control techniques on initial data sets so that relevant statistical approaches may be
applied appropriately. This preprocessing step is critical for all subsequent
bioinformatics analyses, and reconciling dramatically dierent results that
may stem from slightly dierent preprocessing procedures can sometimes
be dicult. Although this issue has no easy answer, consistent preprocessing
procedures within each and across dierent analyses, with good documentation of procedures, must be used.
Challenge 6: integration of multiple, heterogeneous biological
data for translational bioinformatics research
The last challenge is the integration of genomic data with heterogeneous
biological data and associated metadata, such as gene function, biological
subjects phenotypes, and patient clinical parameters. For example, multiple
heterogeneous data sets, including gene expression data, biological responses, clinical ndings, and outcomes data, may need to be combined
to discover genomic biomarkers and gene networks that are relevant to disease and predictive of clinical outcomes, such as cancer progression and chemosensitivity to an anticancer compound. Some data sets exist in dierent
formats and may require combined preprocessing, mapping between data elements, or other preparatory steps before correlative analysis, depending on
their biological characteristics and data distributions. Eective combination
and use of the information from these heterogeneous genomic, clinical, and
other data resources remain a signicant challenge.

148

LEE

et al

This article reviews novel concepts and techniques for tackling various
genomic data mining problems. In particular, because DNA microarrays
and GeneChips techniques have become an important tool in biological
and biomedical investigations, this article focuses on statistical approaches
that have been applied to various microarray data analyses to overcome
some challenges mentioned earlier.

A new concept of statistical signicance: false discovery rate


To avoid a large number of false-positive ndings, the family-wise error
rate (FWER) has been classically controlled for the random chance of multiple hypotheses (or candidates) through evaluating the probability that at
most one false-positive is included at a cuto level of a test statistic among
all candidates. However, FWER has been found to be very conservative in
microarray studies, resulting in a high false-negative error rate, often very
close to 100% [1]. To avoid this pitfall, a novel concept of statistical significance, the so-called false discovery rate (FDR) and its renement, Q
value, have been suggested [2,5] (Q-value package, www.bioconductor.org).
To illustrate what the false discovery rate is, suppose M candidates are
available for simultaneously testing to reject the null hypothesis of no biological signicance. Assume M0 among M to be the number of true negative
candidates and M1 ( M  M0) to be the number of true positive candidates. At a cuto value of a test statistic or data mining tool, let R denote
the number of all positives (or signicantly identied candidates), V the
number of false-positives, and S the number of false-negatives (Table 1).
Then, the FDR is dened as V/R if R is greater than 0, the ratio between
false-positives (V) and all positive ndings (R V S). Note that FDR is
thus derived based on the null (no signicance) and alternative (signicant
target) distributions. In contrast, the classical P value (or type 1 error),
here V/M0, and the statistical power (1, type 2 error), or S/M1, are based
only on one of the null and alternative distributions. Therefore, the
FDR criterion can simultaneously balance between false-positives and
false-negatives, whereas the classical P value and power can address only
one of the two errors.

Table 1
Classication of the candidate hypotheses

Null true
Alternative true
Total

Null accept

Null reject

Total

U
T
W

V
S
R

Mo
M1
M

Abbreviations: S, true positive; T, false negative; U, true negative; V, false positive; W, U


T; R, V S; Mo, U V; M1, T S; M, Mo M1.

DATA MINING IN GENOMICS

149

The FDR evaluation has been rapidly adopted for microarray data analysis, including the widely used signicance analysis of microarrays (SAM)
and other approaches [1,6]. Many dierent methods have been suggested
for estimating FDR directly from test statistics, or indirectly from classical
P values of these statistics. The latter methods are convenient, because standard P values can be simply converted into their corresponding FDR values
[5,7] and Q value, especially the latter based on a resampling technique.
More careful FDR assessment can also be found in many other recent
studies [7].

Pairwise statistical tests for genomic data


The dierential expression pattern of each gene in a microarray experiment is usually assessed with (typically pairwise) contrasts of mean expression values among experimental conditions. These comparisons have been
routinely measured as fold changes, whereby genes with greater than twoor threefold changes are selected for further investigation. Results have
frequently found that a gene that shows a high fold change between comparison conditions might also exhibit high variability in general, and hence its
dierential expression may not be signicant. Similarly, a modest change in
gene expression may be signicant if its dierential expression pattern is
highly reproducible. Several authors have indicated this fundamental aw
in the fold changebased approach [1]. Thus, the emerging standard approach
is based on statistical signicance and hypothesis testing, with careful attention to reliability of variance estimates and multiple comparison issues.
The classical t test and other traditional test statistics have been initially
used for testing the dierential expression of each gene [6]. These classical
testing procedures, however, rely on reasonable estimates of reproducibility
or within-gene error, requiring a large number of replicated arrays. When
a small number of replicates are available per condition (eg, duplicate or
triplicate), the use of within-gene estimates of variability does not provide
a reliable hypothesis-testing framework. For example, a gene may have similar dierential expression values in duplicate experiments through chance
alone. Furthermore, comparing means can be misled by outliers with dramatically smaller or larger expression intensities than other replicates, and
therefore error estimates constructed solely within genes may result in underpowered tests for dierential expression comparisons and also in large
numbers of false-positives. Several approaches to improving estimates of
variability and statistical tests of dierential expression have thus recently
emerged [810].
Signicance analysis of microarrays
SAM has been proposed to improve the unstable error estimation in the
2-tailed t test through adding a variance stabilization factor that minimizes

150

LEE

et al

the variance variability across dierent intensity ranges [1]. Based on the observation that the signal-to-noise ratio varies with dierent gene expression intensities, SAM tries to stabilize gene-specic uctuations and is dened based
on the ratio of change in gene expression to the standard deviation in the data
for that gene. The relative dierence d(i) in gene expression is dened as:
di  xI i  xU i=si s0
where xI(i) and xU(i) are the average expression values of gene i in states I
and U, respectively. The gene-specic scatter s(i) is the standard pooled
deviation of replicated expression values of the gene in the two states. To
compare values of d(i) across all genes, the distribution of d(i) is assumed
to be independent of the level of gene expression. However, at low expression levels, variability in d(i) can be high because of small values of s(i).
To ensure that the variance of d(i) is independent of gene expression, a positive constant s0 is added to the denominator. The value for s0 is chosen to
minimize the coecient of variation, where the coecient of variability of
d(i) is computed as a function of s(i) in moving windows across all the genes.
Local pooled error
Based on a more careful error-pooling technique, the so-called localpooled-error (LPE) test was also introduced. This testing technique is
particularly useful when the sample size is very small (eg, two or three per
condition). LPE variance estimates for genes are formed by pooling variance
estimates for genes with similar expression intensities from replicated arrays
within experimental conditions [6]. The LPE approach leverages the observations that genes with similar expression intensity values often show similar
array-experimental variability within experimental conditions; and that variance of individual gene expression measurements within experimental conditions typically decreases as a (nonlinear) function of intensity. LPE has
been introduced specically for analyzing small-sample microarray data,
whereby error variance estimates for genes are formed through pooling variance estimates for genes with similar expression intensities from replicated
arrays within experimental conditions (LPE package, www.bioconductor.
org). The LPE approach is possible because common background noise
can often be found within each local intensity region of the microarray
data. At high levels of expression intensity, this background noise is dominated by the expression intensity, whereas at low levels the background
noise is a larger component of the observed expression intensity, which
can be easily observed in the so-called M versus A log-intensity scatter
plot of two replicated chips among three dierent immune conditions
(Fig. 1) [6]. The LPE approach controls the situation when a gene with
low expression may have very low variance by chance and the resulting signal-to-noise ratio is unrealistically large.

DATA MINING IN GENOMICS

151

Fig. 1. Log-intensity ratio (M) as a function of average gene expression between replicated
chips (A). Top panels represent the estimated error distributions (based on a non-parametric
regression) for (A) naive, (B) 48 hour activated, and (C) T-cell clone D4 conditions in the mouse
immune response microarray study.

Statistical signicance of the LPE-based test is evaluated through rst


calculating each genes medians m1 and m2 under the two compared conditions to avoid artifacts from outliers. The LPE statistic for the median (logintensity) dierence Z is then calculated as:
Z m1  m2 =SLPEpooled
where sLPEpooled is the pooled standard error from the LPE-estimated baseline variances from the two conditions. The LPE approach shows a signicantly better performance than 2-tailed t test, SAM, and Westfall-Youngs
permutation tests, especially when the number of replicates is smaller than
10 [6].

Statistical modeling on genomic data


Genomic expression proling studies are also frequently performed for
comparing complex, multiple biological conditions and pathways. Several
linear modeling approaches have been introduced for analyzing microarray
data with multiple conditions. For example, an analysis of variance
(ANOVA) model approach was considered to capture the eects of dye,

152

LEE

et al

array, gene, condition, arraygene interaction, and conditiongene interaction separately on complementary DNA (cDNA) microarray data [11], and
a two-stage mixed model was proposed rst to model cDNA microarray
data with the eects of array, condition, and conditionarray interaction
and then t the residuals with the eects of gene, genecondition interaction,
and genearray interaction [12]. Several approaches have also been developed using the Bayesian paradigm for analyzing microarray data, including
Bayesian parametric modeling [13], Bayesian regularized t test [8], Bayesian
hierarchical modeling with a multivariate normal prior [14], and Bayesian
heterogeneous error model (HEM) with two error components [15].
Analysis of variance modeling
The use of ANOVA models has been suggested to estimate relative gene
expression and to account for other sources of variation in microarray data
[16]. Although the exact form of the ANOVA model depends on the particular data set, a typical ANOVA model for two-colorbased cDNA microarray data can be dened as
yikg m Ai Dj Vk Gg ADij AGig DGig VGkg eijkg
where yijkg is the measured intensity from array i, dye j, variety k, and gene g
on an appropriate scale (typically the log scale). The generic term variety is
often used to refer to the mRNA samples studied, such as treatment and
control samples; cancer and normal cells; or time points of a biological process. The terms A, D, and AD account for the overall eects that are not
gene-specic. The gene eects Gg capture the average levels of expression
for genes and the array-by-gene interactions AGig capture dierences caused
by varying sizes of spots on arrays. The dye-by-gene interactions DGjg represent gene-specic dye eects. None of these eects are of biological interest but amount to a normalization of the data for ancillary sources of
variation. The eects of primary interest are the interactions between genes
and varieties, VGg. These terms capture dierences from overall averages
that are attributable to the specic combination of variety k and gene g. Differences among these variety-by-gene interactions provide the estimates for
the relative expression of gene g in varieties 1 and 2 through VG1g  VG2g.
Note that AV, DV, and other higher-order interaction terms are typically assumed to be negligible and are considered together with the error terms. The
error terms eijkg are often assumed to be independent and normal with mean
zero and a common variance. However, such a global ANOVA model is difcult to implement in practice because of its computational restriction. Instead, one often considers gene-by-gene ANOVA models such as
yijkg mg Ai Dj Vk ADij VGkg eijkg

DATA MINING IN GENOMICS

153

Alternatively, a two-stage ANOVA model may be used [12]. The rst


layer is for main eects non-specic to the gene eects:
yijkg mg Ai Dj Vk ADij AGig eijkg
Let rijkg be the residuals from this rst ANOVA t. Then, the secondlayer ANOVA model for gene-specic eects is considered as
rijkg Gg AGig DGig VGkg vijkg
Excepting the main eects of G and V and their interaction eects, the
other terms A, D, (AD), (AG), and (DG) can be considered as random
eects. These within-gene ANOVA models can be implemented using
most standard statistical packages, such as R, SAS, or SPSS.
Heterogeneous error model
Similar to the statistical tests for comparing two sample conditions, the
within-gene ANOVA modeling methods are underpowered and have inaccurate error estimation in microarray data with limited replication. HEM has
been suggested as an alternative (HEM package at www.bioconductor.org).
It is based on Bayesian hierarchical modeling and LPE error-poolingbased
prior constructions, with two layers of error that decompose the total error
variability into the technical and biological error components in microarray
data [15]. The rst layer is constructed to capture the array technical variation
caused by many experimental error components, such as sample preparation,
labeling, hybridization, and image processing: yijkl xijk eijkl, where eijkl is
approximately iid Normal[0, s2(xijk)], where i 1, 2, ., G; j 1, 2, ., C;
k 1, 2, ., mij; 1 1, 2, ., nijk. The second layer is then hierarchically constructed to capture the biological error component: xijk m gi cj rij
bijk, where bijk is approximately iid Normal[0, s2b(ij)]. Here, the genetic parameters are for the grand mean (shift or scaling) constant, gene, cell, interaction
eects, and the biological error; the last error term varies and is heterogeneous
for each combination of dierent genes and conditions. The biological variability is individually assessed for discovery of biologically relevant expression
patterns. The HEM approach shows a signicantly better performance than
standard ANOVA methods, especially when the number of replicates is small
(Fig. 2).

Unsupervised learning: clustering


Clustering analysis is widely applied to search for the groups (clusters)
in microarray data, because these techniques can eectively reduce the
high-dimensional gene expression data into a two-dimensional dendrogram
organized by each genes expression association patterns (Fig. 3). Currently,

154

LEE

et al

Fig. 2. Receiver operating characteristic curves from heterogeneous error model (solid lines) and
analysis of variance (dotted lines) models with two and ve replicated arrays. The horizontal axis
is 1false-positive error rate (FPR) and the vertical axis is 1false-negative error rate (FNR).

clustering analysis is one of the most frequently used techniques for genomic data mining in biomedical studies [1719]. Some technical aspects of
these approaches are summarized. A clustering approach rst must be dened through a measure or distance index of similarity or dissimilarity,
such as
 Euclidean: dx; y Sxk  yk 2
 Manhattan: dx; y Sjxk  yk j
 Correlation: d(x, y) 1  r(x, y), where r(x, y) is a correlation coecient
Next, an allocation algorithm must be dened based on one of these distance metrics. Two classes of clustering algorithms have been used in genomic data analysis: hierarchical and partitioning allocation algorithms.
Hierarchical algorithms that allocate each subject to its nearest subject or
group include:
 Agglomerative methods: average linkage based on group average distance, single linkage based on minimum nearest distance, and complete
linkage based on maximum furthest distance;

DATA MINING IN GENOMICS

155

Fig. 3. Dendrogram (top panel) and heatmap (bottom panel) of hierarchical clustering analysis
for the concordant complementary DNA (cDNA) and oligo array expression patterns on the
NCI-60 cancer cell lines. A region of heatmap occupied by melanoma genes are shown from
the combined set of 3297 oligo and cDNA transcripts. Each gene expression pattern is designated as coming from the cDNA or oligo array set. The concordant oligo and cDNA microarray expressions are marked with blue bars.

 Probabilistic methods: Bayes factor, posterior probabilities of


subclusters;
 Divisive methods: monothetic variable division, polythetic division.
Partitioning algorithms divide the data into a prespecied number of subsets, including
 Self-organizing map: division into a geometrically preset grid structure
of subclusters;
 K-means: iterative relocation into a predened number of subclusters;
 Partitioning around medoids: similar to but more robust than K-means
clustering;
 Clara: division of xed-size subdatasets for applications to large data sets;
 Fuzzy algorithm: probabilistic fractions of membership rather than
deterministic allocations.

156

LEE

et al

One of the most dicult aspects of using these clustering analyses is the
interpretation of their heuristic, often unstable, clustering results. To overcome this shortcoming, several rened clustering approaches have been suggested. For example, the use of bootstrapping was suggested to evaluate the
consistency and condence of each genes membership to particular cluster
groups [11]. The gene shaving approach was suggested to nd the clusters
directly relevant to major variance directions of an array data set [3]. Recently, tight clustering, a rened bootstrap-based hierarchical clustering,
was proposed to formally assess and identify the groups of genes that are
most tightly clustered with each other [20].

Supervised learning: classication


Supervised classication learning on genomic data is often performed to
obtain genomic prediction models for dierent groups of biological subjects
(eg, macrophage cells under dierent immunologic conditions) and patients
(dierent subclasses of patients who have cancer, such as those who have
acute lymphoblastic leukemia versus acute myeloid leukemia). Prediction
based on genomic expression signatures has received considerable attention
in many challenging classication problems in biomedical research [21,22].
For example, these analyses have been conducted in cancer research as alternative diagnostic techniques to the traditional ones, such as classication
through the origin of cancer tissues or microscopic appearance, which can
be problematic for predicting many critical human disease subtypes [23]. Several dierent approaches to microarray classication modeling have been proposed, including gene voting [21], support vector machines (SVMs) [24],
Bayesian regression models [22], partial least squares [25], and GA/KNN
[26]. The following discussion considers strategies to evaluate and compare
the performance of these dierent classication methods.
Measures for classication model performance
Microarray data often have tens of thousands of genes on each chip,
whereas only a few tens of samples or replicated arrays are available in a microarray study. In the classication modeling on genomic data, avoiding
overtting is essential, as is nding an optimal subset of the thousands of
genes for constructing classication rules and models that are robust in different choices of training samples and consistent in prediction performance
on future samples. Typically, in this kind of supervised learning, separate
training and test sets (of subjects with known classes) are used, the former
to t classication prediction models, and the latter, which is independent
of the former set, for rigorous model validation. Evaluation of prediction
performance should then be carefully conducted among the extremely large
number of competing models, especially in using appropriate performance
selection criteria and in using the whole data for model training and

DATA MINING IN GENOMICS

157

evaluation. Several dierent measures are currently used to evaluate performance of classication models: classication error rate, area under the
receiver operating characteristic curve (area under the curve [AUC]), and
the product of posterior classication probabilities [27,28].
When a large number of candidate models (eg, approximately 108 twogene models on 10K array data) are compared in their performance, these
measures are often saturateddtheir maximum performance levels are
achieved using many competing modelsdso that identication of the best
(most robust) prediction model among them is extremely dicult. Furthermore, these measures cannot capture an important aspect of classication
model performance as follows: suppose three samples are classied using
two classication models (or rules); one model provides the correct posterior
classication probabilities 0.8, 0.9, and 0.4, and the other 0.8, 0.8, and 0.4
for the three samples. Assuming these were unbiased estimates of classication error probabilities (on future data), the former model would be preferred because this model will perform better in terms of the expected
number of correctly classied samples in future data.
Note that the two models provide the same misclassication error rate,
one third. This aspect of classication performance cannot be captured
through evaluating the commonly used error rate or AUC criteria, which
simply add one count for each correctly classied sample, ignoring its degree
of classication error probability.
To overcome this limitation, the so-called misclassication-penalized
posterior (MiPP) criterion has been suggested recently [4]. This measure is
the sum of the correct-classication (posterior) probabilities of correctly classied samples subtracted by the sum of the misclassication (posterior) probabilities of misclassied samples. Suppose there are m classes pk, i 1, .,
mk, from a population of N samples. Let Xj, j 1, ., ni, be the jth sample. Under a particular prediction model (eg, one- or two-gene model) from a classication rule, such as linear discriminant analysis or SVMs, MiPP is then dened
as:



L Scorrect pk Xj  Swrong 1  pk Xj
where pk(Xj) is the posterior classication probability of sample Xj into the
kth class. Here correct and wrong correspond to the samples that are correctly and incorrectly classied. In the two-class problem, correct simply
means pk(Xj) is more than 0.5, but in general it occurs when pk(Xj)
maxi 1, ., m (pi(Xj)). MiPP can also be shown to be the sum of the posterior probabilities of correct classication penalized by the number of misclassied samples (NM): L S pk(Xj)  NM. Thus, MiPP is a continuous
measure (compared with the discrete error rate) of classication performance that accounts for the degree of classication certainty and the error
rate, and is sensitive enough to distinguish subtle dierences in prediction
performance among many competing models.

158

LEE

et al

Classication modeling
Several classication modeling approaches are currently widely used in
genomic data analysis.
Gene voting
Gene voting [21] is an intuitively-derived technique that aggregates the
weighted votes from all modeling gene signatures; the advantage of this
technique is that it can be easily implemented without complicated computing and statistical arguments. It has been proposed for predicting subclasses
of patients who have acute leukemia observed with microarray gene expression data [21]. This method gains accuracy through aggregating predictors
built from a learning set and casting their voting weights. For binary classication, each gene casts a vote for class 1 or 2 among p samples, and the
votes are aggregated over genes. For gene gj, the vote is vj aj (gj  bj),
where aj (m1  m2)/(s1 s2) and bj (m1 m2)/2 for sample means
m1 and m2 and sample standard deviations s1 and s2. Using this method
based on 50 gene predictors, 36 of 38 patients in an independent validation
set have been correctly classied between acute myeloid leukemia and acute
lymphoblastic leukemia.
Linear and quadratic discriminant analysis
Linear or quadratic discriminate analysis is a classical statistical classication technique based on the multivariate normal distribution assumption.
This technique is frequently found to be robust and powerful for many different applications, despite the distributional assumption; the gene voting
technique can be considered as a variant of linear discriminant analysis. Linear discriminant analysis can be applied with leave-one-out classication,
assuming each class follows a multivariate normal distribution. Each sample
will then be allocated to group k, to which its classication probability is
maximized. The quadratic discriminate analysis can be similarly performed,
except that the covariance matrix of the multivariate normal distribution
(for each of m classes) is now considered dierently among m classes. Differences between linear discriminant analysis and quadratic discriminant
analysis are typically small, especially if polynomial factors are considered
in linear discriminant analysis. In general, quadratic discriminant analysis
requires more observations to estimate each variancecovariance matrix
for each class. Linear and quadratic discriminant analysis have consistently
shown high performance, not because the data is likely derived from Gaussian distributions, but more likely because the data support only a simple
boundaries, such as linear or quadratic [28].
Logistic regression
The logistic regression classication technique is based on the regression
t on probabilistic odds among comparing conditions. This technique

DATA MINING IN GENOMICS

159

requires no specic distribution assumption but is often found to be less sensitive than other approaches. Logistic regression methods simply maximize
the conditional likelihood Pr(G kjX), typically by a Newton-Raphson
algorithm [29]. The allocation decision on a sample is based on the logistic
regression t:
Logitpi logpi =1  pi wbT x
where b is the logistic regression estimated coecient vector for the
microarray data. Logistic regression discriminant analysis is often used
because of its exible assumption about the underlying distribution, but if
it is actually from Gaussian distribution, logistic regression shows a loss
of 30% eciency in the (misclassication) error rate compared with linear
discriminant analysis.
Support vector machines
Conceptually similar to gene voting, SVMs are one of the recent
machine-learning classication techniques based on the data projection to
high-dimensional kernel space. This technique also does not require distributional assumption, yet can perform better than other approaches in
some complicated cases. However, it often requires large numbers of samples and predictor gene signatures for optimal performance. SVMs separate
a given set of binary labeled training data with a hyperplane that is maximally distant from them, known as the maximal margin hyperplane [24].
Based on a kernel, such as a polynomial of dot products, the current data
space will be embedded in a higher dimensional space. Commonly used kernels include
 Radial basis function kernel: K(x, y) exp(jx  yj2/2s2)
 Polynomial kernel: K(x, y) !x, yO^d or K(x, y) (!x, yO c)d,
where !, O denotes the inner product.
Comparison of classication methods
These classication techniques must be carefully applied in prediction
model training on genomic data. In particular, if all the samples are used
for model search/training and evaluation in a large screening search for classication models, a serious selection bias is inevitably introduced [30]. To
avoid this pitfall, a stepwise (leave-one-out) cross-validated discriminant
procedure has been suggested that gradually adds genes to the training set
[4,28]. The prediction performance is typically found to be continuously
improved (or not decreased) through adding more features into the model.
This result is again caused by a sequential search-and-selection strategy
against an astronomically large number of candidate models; some of
them can show overoptimistic prediction performance for a particular

160

LEE

et al

training set by chance. Furthermore, even though a leave-one-out or similar


cross-validation strategy is used in this search, the number of candidate
models is too big to eliminate many random ones that survive by chance
from cross-validation. Thus, test data should be completely independent
from the training data to obtain an unbiased estimate of each models performance. To address these pitfalls, the stepwise cross-validated discriminant (SCVD) procedure sequentially adds one gene at a time to identify
the most optimal prediction model based both on n-fold modeling and
train-test validation strategies. SCVD can be used with any of the aforementioned classication methods.
Linear discriminant analysis, quadratic discriminant analysis, logistic regression, and SVMs with linear or radial basis function kernels have been
compared using the SCVD approach [28]. The leukemia microarray data
in the study by Golub and colleagues [21] had a training set of 27 acute lymphoblastic leukemia and 11 acute myeloid leukemia samples and an independent test set of 20 acute lymphoblastic leukemia and 14 acute myeloid
leukemia samples. Because two distinct data sets exist, the model is constructed based on the training data and evaluated on the test data set.
Each rule identied a somewhat dierent subset of features that showed
the best performance within each classication method (Table 2). In terms
of error rate, the SVM with a linear kernel seems to be the most accurate
rule. However, linear discriminant analysis only misclassied one sample
and the SVM with the radial basis function kernel and quadratic discriminant analysis misclassied two samples on the independent test data. Logistic regression does not seem to perform as well as the other rules,
misclassifying 4 of 34 samples. Comparing the rules based on MiPP is somewhat complicated for SVMs, because the estimated probabilities of correct
classication from SVMs are based on how far samples are from a decision
boundary. Therefore, these are not true probabilities, as is the case with linear discriminant analysis, quadratic discriminant analysis, and logistic

Table 2
Classication results of the classication rules and the corresponding gene model
Error rate on
training data

MiPP on
training data

Error rate
on test data

MiPP on
test data

1144
5062
4211 575
4377 1882

0%
0%
0%
0%

37.91
37.96
37.99
35.16

2.9%
5.8%
11.8%
0%

31.46
29.81
25.64
29.26

4847 3867 6281

0%

32.52

5.9%

21.71

Method

Gene model

LDA
QDA
Logistic
SVM K
linear
SVM K
RBF

1882
4847
1807
2020

Abbreviations: K, kernel; LDA, linear discriminant analysis; MiPP, misclassicationpenalized posterior; QDA, quadratic discriminant analysis; RBF, radial basis function; SVM,
support vector machines.

DATA MINING IN GENOMICS

161

regression. In an application to a dierent microarray study on colon cancer, the radial basis functionkernel SVM model with three genes was found
to perform best among these classication techniques.
The MiPP-based SCVD procedure was the most robust classication
model and could accurately classify samples with a very small number of
featuresdonly two or three genes for the two well-known microarray
data sets, outperforming many previous models with 50 to 100 featuresd
although dierent classication methods may perform dierently in dierent data sets. These data are consistent with the notion that many correlated
genes share more or less similar information and may discriminate similarly
among dierent subtypes of a particular disease, and that multiple smallfeature models may perform well in terms of the construction of a classication model. As shown, the prediction performance on the training set is
quickly saturated with a 0% error rate and very close to the maximum
MiPP value of 38 (total sample size). However, error rates and MiPP values
vary greatly on the independent test set. The error rates were also found to
be misleading and less informative than MiPP.

Genomic pathway modeling


Many recent pathway-modeling studies for transcriptional regulation and
gene functional networks have been performed based on genomic expression
data. These approaches can be largely divided into three categories, qualitative, quantitative, and integrative pathway modeling, based on the dierent
types of genomic data used. This article introduces several pathway-modeling approaches with references. Pathway modeling is one of the most active
research elds in current genomic sciences, and substantial additional information can be found in the references.
Qualitative pathway modeling
Pathway modeling has been performed using functional and annotation
information from several genomic databases. For example, computationally
predicting genomewide transcription units based on pathwaygenome databases (PGDBs) and other organizational (ie, protein complexes) annotation
improved transcription unit organization information in Escherichia coli
and Bacillus subtilis [31]. A classication of transcription factors is also proposed to organize pathways that connect extracellular signaling to the regulation of transcription and constitutively activate nuclear factors in
eukaryotic cells. This latter classication was performed based on known
cellular characteristics that describe the roles of these factors within regulatory circuits to identify many downstream functional mechanisms, such as
serine and tyrosine phosphorylation; Rel/nuclear factorkB family, Ci/Gli,
Wnt and Notch pathway, nuclear factor of activated T cells transcription
factor activation; and Ca2 increase [32].

162

LEE

et al

Quantitative pathway modeling


Gene regulation networks have also been explored based on quantitative
genomic expression data. For example, Bayesian network modeling was
used for capturing regulatory interactions between genes based on genomewide expression measurements on yeast [33].
Probabilistic models for context-specic regulatory relationships were
also proposed to capture complex expression patterns of many genes in various biological conditions, accounting for known variable factors, such as
experimental settings, putative binding sites, or functional information in
yeast stress and compendium data [34]. These quantitative pathway models
have been found to eectively characterize both relationships and magnitude of relevant genes expression patterns, and have been used extensively
in recent pathway modeling in various microarray studies [3335].
Integrative pathway modeling
Integration of qualitative and quantitative gene network information has
been attempted in recent pathway modeling studies. For example, a comprehensive genomic module map was constructed through combining gene
expression and known functional and transcriptional information, wherein
each module represents a distinctive set of directly regulated and associated
genes that act in concert to perform a specic function. In this study, dierent expression activities in tumors were described in terms of the behavior of
these modules [35]. Regression on transcription motifs is proposed for discovering candidate genes upstream sequences that undergo expression
changes in various biological conditions. This method combines the known
motif structural information and gene expression patterns based on an integrated regression analysis [36].

Genomic biomarkers for disease progression and chemosensitivity


Recently, genomic data have been used to predict the outcome and tumor
chemosensitivity in patients who have cancer. The prognosis of these
patients is frequently uncertain, and histologically based prognostic indicators are sometimes inaccurate because of the inherent complexities involved.
Deregulation of tumor cells leads to uncontrolled division, invasion, and
metastasis; the specic patterns of deregulation in a particular tumor and
which genetic pathways are altered likely aect the course of the disease.
Based on these observations, genomic predictors have been developed for
the progression of metastatic disease after removal of breast cancer tumors
[37]. In another study, genomic predictors were generated using tumor samples from 78 patients, of whom 34 developed metastasis within 5 years of
surgical resection [38]. The rst step in the development of the predictor
was to determine, using microarray analysis, which genes were signicantly

DATA MINING IN GENOMICS

163

dierentially expressed compared with a pool of all specimens. Of 24,479


genes on the chip, 4968 were found to have at least a twofold change in
expression and a P value less than 0.01 in at least ve tumors. To determine
which genes could be used to predict metastasis, the correlation coecients
between gene expression values and metastasis development were calculated
and 231 genes were found with an absolute correlation coecient greater
than 0.3. Leave-one-out cross-validation on sequentially larger subsets of
genes determined that a group of 70 genes yielded the least-erroneous classier. Testing the classier on a 19-patient dataset resulted in two incorrect
predictions, and this was later expanded to predict survival times for patients who had breast cancer, accurately predicting disease-free long-term
survivors against patients who had poor prognosis [39].
A combination of in vitro assays and genomic data was used to predict
the prognosis of patients who had nonsmall cell lung cancer based on
Bayesian computational classication techniques [40]. In this study, microarray proling analysis was performed to determine gene expression proles
of the various tumors, using Aymetrix HG-U133 plus 2.0 chips. First ltering found approximately 2070 genes highly correlated to lung cancer recurrence. The k-means method was used to generate gene clusters, which were
then analyzed using singular value decomposition to generate a metagene,
the dominant average expression pattern of the cluster. These metagenes
were used in a binary regression tree to partition tumor samples into subsets
on which predictions of recurrence could be made; the prediction accuracy
for 5-year disease-free survival was more than 90% among the predicted
long-term survivors compared with less than 40% among the patients
who had predicted poor prognosis.
Summary
The authors believe that several reasons exist why these genomic data mining approaches have been successful and represent a promising direction for
future work. First, gene networks inferred from genomic expression signatures have been found to be highly relevant to patient prognosis and chemotherapeutic responses [4143]. Even though many individual gene expression
values in these networks are often variable and noisy, the entire gene networks have been found to be consistent in their overall expression patterns
[4446]. Second, genomewide RNA expression proling techniques such as
microarrays and GeneChips have been dramatically improved recently, so
that the expression patterns of the entire human genome can now be accurately and cost-eectively measured on patient samples. In fact, microarray
RNA proling is one of the most accurately quantiable and comprehensive
proling biotechnologies among all current high-throughput biotechniques,
including comparative genomic hybridization, spectral karyotyping, serial
analysis of gene expression, two-dimensional gel electrophoresis, mass spectrometry, or protein arrays [47,48]. Third, bioinformatics analysis methods

164

LEE

et al

and techniques for these microarray data have been signicantly improved,
especially in testing (eg, SAM, LPE, false discovery rate), clustering (eg, hierarchical, self-organizing map, K-means, response projected clustering),
classication (linear discriminant analysis, SVMs, logistic regression, random forest), and pathway analysis (Gene Map Annotator And Pathway Proler, ingenuity pathway analysis) for investigating the complex and extensive
information in massive genomic data sets eectively and eciently [1,4,24].
Finally, most importantly, based on signicant eorts by the National Institutes of Health (Gene Expression Omnibus [GEO]) and the European Bioinformatics Institute (ArrayExpress), many precious microarray data sets of
cancerdcell lines and patientsdhave been archived for public access. For example, GEO currently archived more than 5550 microarray data sets on
more than 150,000 dierent biomedical samples and human patients with
more than 1500 sets for cancer alone. Furthermore, despite their technical
dierences, microarray data sets from dierent time points, laboratories,
and even platforms contain consistent information for many gene expression
patterns, so that investigations can be performed successfully across those
dierent genomic data sets. This large and rapidly increasing compendium
of data demands data mining approaches and ensures that genomic data
mining will continue to be a necessary and highly productive eld.
References
[1] Tusher VG, Tibshirani R, Chu G. Signicance analysis of microarrays applied to the ionizing
radiation response. Proc Natl Acad Sci U S A 2001;98(9):511621.
[2] Storey JD, Tibshirani R. Statistical signicance for genomewide studies. Proc Natl Acad Sci
U S A 2003;100(16):94405.
[3] Hastie T, Tibshirani R, Eisen MB, et al. Gene shaving as a method for identifying distinct
sets of genes with similar expression patterns. Genome Biol 2000;1(2):p. RESEARCH0003.
[4] Soukup M, Cho H, Lee JK. Robust classication modeling on microarray data using misclassication penalized posterior. Bioinformatics 2005;21(Suppl 1):i42330.
[5] Benjamini Y, Drai D, Elmer G, et al. Controlling the false discovery rate in behavior genetics
research. Behav Brain Res 2001;125(12):27984.
[6] Jain N, Thatte J, Braciale T, et al. Local-pooled-error test for identifying dierentially expressed genes with a small number of replicated microarrays. Bioinformatics 2003;19(15):
194551.
[7] Jain N, Cho H, OConnell N, et al. Rank-invariant resampling based estimation of false discovery rate for analysis of small sample microarray data. BMC Bioinformatics 2005;6:187.
[8] Baldi P, Long AD. A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics 2001;17(6):50919.
[9] Efron B, Tibshirani R. Empirical Bayes methods and false discovery rates for microarrays.
Genet Epidemiol 2002;23(1):7086.
[10] Kerr MK, Martin M, Churchill GA. Analysis of variance for gene expression microarray
data. J Comput Biol 2000;7(6):81937.
[11] Kerr MK, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci U S A 2001;98(16):89615.
[12] Wolnger RD, Gibson G, Wolnger ED, et al. Assessing gene signicance from cDNA microarray expression data via mixed models. J Comput Biol 2001;8(6):62537.

DATA MINING IN GENOMICS

165

[13] Newton MA, Kendziorski CM, Richmond CS, et al. On dierential variability of expression
ratios: improving statistical inference about gene expression changes from microarray data.
J Comput Biol 2001;8(1):3752.
[14] Ibrahim JGaC, M.-H., Gray RJ. Bayesian models for gene expression with DNA microarray
Data. J Am Stat Assoc 2002;97:8899.
[15] Cho H, Lee JK. Bayesian hierarchical error model for analysis of gene expression data.
Bioinformatics 2004;20(13):201625.
[16] Kerr MK, Churchill GA. Statistical design and the analysis of gene expression microarray
data. Genet Res 2001;77(2):1238.
[17] Lee JK, Bussey KJ, Gwadry FG, et al. Comparing cDNA and oligonucleotide array data:
concordance of gene expression across platforms for the NCI-60 cancer cells. Genome
Biol 2003;4(12):R82.
[18] Scherf U, Ross DT, Waltham M, et al. A gene expression database for the molecular pharmacology of cancer. Nat Genet 2000;24(3):23644.
[19] Weinstein JN, Scherf U, Lee JK, et al. The bioinformatics of microarray gene expression proling. Cytometry 2002;47(1):469.
[20] Tseng GC, Wong WH. Tight clustering: a resampling-based approach for identifying stable
and tight patterns in data. Biometrics 2005;61(1):106.
[21] Golub TR, Slonim DK, Tamayo P, et al. Molecular classication of cancer: class discovery
and class prediction by gene expression monitoring. Science 1999;286(5439):5317.
[22] West M, Blanchette C, Dressman H, et al. Predicting the clinical status of human breast cancer by using gene expression proles. Proc Natl Acad Sci U S A 2001;98(20):114627.
[23] Su AI, Welsh JB, Sapinoso LM, et al. Molecular classication of human carcinomas by use
of gene expression signatures. Cancer Res 2001;61(20):738893.
[24] Furey TS, Cristianini N, Duy N, et al. Support vector machine classication and validation
of cancer tissue samples using microarray expression data. Bioinformatics 2000;16(10):
90614.
[25] Nguyen DV, Rocke DM. Partial least squares proportional hazard regression for application to DNA microarray survival data. Bioinformatics 2002;18(12):162532.
[26] Li L, Darden TA, Weinberg CR, et al. Gene assessment and sample classication for gene
expression data using a genetic algorithm/k-nearest neighbor method. Comb Chem High
Throughput Screen 2001;4(8):72739.
[27] Hand DJ. Construction and assessment of classication rules. Chichester: John Wiley &
Sons; 1997.
[28] Soukup M, Lee JK. Developing optimal prediction models for cancer classication using
gene expression data. J Bioinform Comput Biol 2004;1(4):68194.
[29] Pampel FC. Logistic regression: a primer. Sage University Papers Series on Quantitative
Applications of the Social Sciences; 2000.
[30] Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of microarray
gene-expression data. Proc Natl Acad Sci U S A 2002;99(10):65626.
[31] Romero PR, Karp PD. Using functional and organizational information to improve
genome-wide computational prediction of transcription units on pathway-genome databases. Bioinformatics 2004;20(5):70917.
[32] Brivanlou AH, Darnell JE Jr. Signal transduction and the control of gene expression. Science
2002;295(5556):8138.
[33] Friedman N, Linial M, Nachman I, et al. Using Bayesian networks to analyze expression
data. J Comput Biol 2000;7(34):60120.
[34] Segal E, Taskar B, Gasch A, et al. Rich probabilistic models for gene expression. Bioinformatics 2001;17(Suppl 1):S24352.
[35] Segal E, Friedman L, Koller D, et al. A module map showing conditional activity of expression modules in cancer. Nat Genet 2004;36(10):10908.
[36] Conlon EM, Liu XS, Lieb JD, et al. Integrating regulatory motif discovery and genome-wide
expression analysis. Proc Natl Acad Sci U S A 2003;100(6):333944.

166

LEE

et al

[37] van t Veer LJ, Dai H, van de Vijver MJ, et al. Gene expression proling predicts clinical outcome of breast cancer. Nature 2002;415(6871):5306.
[38] van t Veer LJ, Dai H, van de Vijver MJ, et al. Expression proling predicts outcome in breast
cancer. Breast Cancer Res 2003;5(1):578.
[39] Dressman HK, Hans C, Bild A, et al. Gene expression proles of multiple breast cancer phenotypes and response to neoadjuvant chemotherapy. Clin Cancer Res 2006;12(3 Pt 1):
81926.
[40] Potti A, Mukherjee S, Petersen R, et al. A genomic strategy to rene prognosis in early-stage
non-small-cell lung cancer. N Engl J Med 2006;355(6):57080.
[41] Miller LD, Smeds J, George J, et al. An expression signature for p53 status in human breast
cancer predicts mutation status, transcriptional eects, and patient survival. Proc Natl Acad
Sci U S A 2005;102(38):135505.
[42] Havaleshko DM, Cho H, Conaway M, et al. Prediction of drug combination chemosensitivity in human bladder cancer. Mol Cancer Ther 2007;6(2):57886.
[43] Paik S, Shak S, Tang G, et al. A multigene assay to predict recurrence of tamoxifen-treated,
node-negative breast cancer. N Engl J Med 2004;351(27):281726.
[44] Horvath S, Zhang B, Carlson M, et al. Analysis of oncogenic signaling networks in glioblastoma identies ASPM as a molecular target. Proc Natl Acad Sci U S A 2006;103(46):
174027.
[45] Bild AH, Yao G, Chang JT, et al. Oncogenic pathway signatures in human cancers as a guide
to targeted therapies. Nature 2006;439(7074):3537.
[46] Potti A, Yao G, Chang JT, et al. Genomic signatures to guide the use of chemotherapeutics.
Nat Med 2006;12(11):1294300.
[47] Ma XJ, Patel R, Wang X, et al. Molecular classication of human cancers using a 92-gene
real-time quantitative polymerase chain reaction assay. Arch Pathol Lab Med 2006;
130(4):46573.
[48] Puskas LG, Juhasz F, Zarva A, et al. Gene proling identies genes specic for well-dierentiated epithelial thyroid tumors. Cell Mol Biol (Noisy-le-grand) 2005;51(2):17786.

Вам также может понравиться