You are on page 1of 18


From Data Mining to

Knowledge Discovery in
Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth

■ Data mining and knowledge discovery in This article begins by discussing the histori-
databases have been attracting a significant cal context of KDD and data mining and their
amount of research, industry, and media atten- intersection with other related fields. A brief
tion of late. What is all the excitement about? summary of recent KDD real-world applica-
This article provides an overview of this emerging
tions is provided. Definitions of KDD and da-
field, clarifying how data mining and knowledge
ta mining are provided, and the general mul-
discovery in databases are related both to each
other and to related fields, such as machine tistep KDD process is outlined. This multistep
learning, statistics, and databases. The article process has the application of data-mining al-
mentions particular real-world applications, gorithms as one particular step in the process.
specific data-mining techniques, challenges in- The data-mining step is discussed in more de-
volved in real-world applications of knowledge tail in the context of specific data-mining al-
discovery, and current and future research direc- gorithms and their application. Real-world
tions in the field. practical application issues are also outlined.
Finally, the article enumerates challenges for
future research and development and in par-

cross a wide variety of fields, data are ticular discusses potential opportunities for AI
being collected and accumulated at a technology in KDD systems.
dramatic pace. There is an urgent need
for a new generation of computational theo-
ries and tools to assist humans in extracting
useful information (knowledge) from the
Why Do We Need KDD?
rapidly growing volumes of digital data. The traditional method of turning data into
These theories and tools are the subject of the knowledge relies on manual analysis and in-
emerging field of knowledge discovery in terpretation. For example, in the health-care
databases (KDD). industry, it is common for specialists to peri-
At an abstract level, the KDD field is con- odically analyze current trends and changes
cerned with the development of methods and in health-care data, say, on a quarterly basis.
techniques for making sense of data. The basic The specialists then provide a report detailing
problem addressed by the KDD process is one the analysis to the sponsoring health-care or-
of mapping low-level data (which are typically ganization; this report becomes the basis for
too voluminous to understand and digest easi- future decision making and planning for
ly) into other forms that might be more com- health-care management. In a totally differ-
pact (for example, a short report), more ab- ent type of application, planetary geologists
stract (for example, a descriptive sift through remotely sensed images of plan-
approximation or model of the process that ets and asteroids, carefully locating and cata-
generated the data), or more useful (for exam- loging such geologic objects of interest as im-
ple, a predictive model for estimating the val- pact craters. Be it science, marketing, finance,
ue of future cases). At the core of the process is health care, retail, or any other field, the clas-
the application of specific data-mining meth- sical approach to data analysis relies funda-
ods for pattern discovery and extraction.1 mentally on one or more analysts becoming

Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00 FALL 1996 37

intimately familiar with the data and serving areas is astronomy. Here, a notable success
as an interface between the data and the users was achieved by SKICAT, a system used by as-
and products. tronomers to perform image analysis,
For these (and many other) applications, classification, and cataloging of sky objects
this form of manual probing of a data set is from sky-survey images (Fayyad, Djorgovski,
slow, expensive, and highly subjective. In and Weir 1996). In its first application, the
fact, as data volumes grow dramatically, this system was used to process the 3 terabytes
type of manual data analysis is becoming (1012 bytes) of image data resulting from the
completely impractical in many domains. Second Palomar Observatory Sky Survey,
Databases are increasing in size in two ways: where it is estimated that on the order of 109
(1) the number N of records or objects in the sky objects are detectable. SKICAT can outper-
database and (2) the number d of fields or at- form humans and traditional computational
tributes to an object. Databases containing on techniques in classifying faint sky objects. See
the order of N = 109 objects are becoming in- Fayyad, Haussler, and Stolorz (1996) for a sur-
creasingly common, for example, in the as- vey of scientific applications.
tronomical sciences. Similarly, the number of In business, main KDD application areas
There is an fields d can easily be on the order of 102 or includes marketing, finance (especially in-
even 103, for example, in medical diagnostic vestment), fraud detection, manufacturing,
urgent need applications. Who could be expected to di- telecommunications, and Internet agents.
for a new gest millions of records, each having tens or Marketing: In marketing, the primary ap-
hundreds of fields? We believe that this job is plication is database marketing systems,
generation of certainly not one for humans; hence, analysis which analyze customer databases to identify
computation- work needs to be automated, at least partially. different customer groups and forecast their
The need to scale up human analysis capa- behavior. Business Week (Berry 1994) estimat-
al theories bilities to handling the large number of bytes ed that over half of all retailers are using or
and tools to that we can collect is both economic and sci- planning to use database marketing, and
assist entific. Businesses use data to gain competi- those who do use it have good results; for ex-
tive advantage, increase efficiency, and pro- ample, American Express reports a 10- to 15-
humans in vide more valuable services to customers. percent increase in credit-card use. Another
extracting Data we capture about our environment are notable marketing application is market-bas-
the basic evidence we use to build theories ket analysis (Agrawal et al. 1996) systems,
useful and models of the universe we live in. Be- which find patterns such as, “If customer
information cause computers have enabled humans to bought X, he/she is also likely to buy Y and
gather more data than we can digest, it is on- Z.” Such patterns are valuable to retailers.
(knowledge) ly natural to turn to computational tech- Investment: Numerous companies use da-
from the niques to help us unearth meaningful pat- ta mining for investment, but most do not
rapidly terns and structures from the massive describe their systems. One exception is LBS
volumes of data. Hence, KDD is an attempt to Capital Management. Its system uses expert
growing address a problem that the digital informa- systems, neural nets, and genetic algorithms
volumes of tion era made a fact of life for all of us: data to manage portfolios totaling $600 million;
overload. since its start in 1993, the system has outper-
digital formed the broad stock market (Hall, Mani,
data. Data Mining and Knowledge and Barr 1996).
Fraud detection: HNC Falcon and Nestor
Discovery in the Real World PRISM systems are used for monitoring credit-
A large degree of the current interest in KDD card fraud, watching over millions of ac-
is the result of the media interest surrounding counts. The FAIS system (Senator et al. 1995),
successful KDD applications, for example, the from the U.S. Treasury Financial Crimes En-
focus articles within the last two years in forcement Network, is used to identify finan-
Business Week, Newsweek, Byte, PC Week, and cial transactions that might indicate money-
other large-circulation periodicals. Unfortu- laundering activity.
nately, it is not always easy to separate fact Manufacturing: The CASSIOPEE trou-
from media hype. Nonetheless, several well- bleshooting system, developed as part of a
documented examples of successful systems joint venture between General Electric and
can rightly be referred to as KDD applications SNECMA, was applied by three major Euro-
and have been deployed in operational use pean airlines to diagnose and predict prob-
on large-scale real-world problems in science lems for the Boeing 737. To derive families of
and in business. faults, clustering methods are used. CASSIOPEE
In science, one of the primary application received the European first prize for innova-


tive applications (Manago and Auriol 1996). Data Mining and KDD
Telecommunications: The telecommuni-
cations alarm-sequence analyzer (TASA) was Historically, the notion of finding useful pat-
built in cooperation with a manufacturer of terns in data has been given a variety of
telecommunications equipment and three names, including data mining, knowledge ex-
telephone networks (Mannila, Toivonen, and traction, information discovery, information
Verkamo 1995). The system uses a novel harvesting, data archaeology, and data pattern
processing. The term data mining has mostly
framework for locating frequently occurring
been used by statisticians, data analysts, and
alarm episodes from the alarm stream and
the management information systems (MIS)
presenting them as rules. Large sets of discov-
communities. It has also gained popularity in
ered rules can be explored with flexible infor-
the database field. The phrase knowledge dis-
mation-retrieval tools supporting interactivity
covery in databases was coined at the first KDD
and iteration. In this way, TASA offers pruning,
workshop in 1989 (Piatetsky-Shapiro 1991) to
grouping, and ordering tools to refine the re- emphasize that knowledge is the end product
sults of a basic brute-force search for rules. of a data-driven discovery. It has been popular-
Data cleaning: The MERGE - PURGE system ized in the AI and machine-learning fields.
was applied to the identification of duplicate In our view, KDD refers to the overall pro-
welfare claims (Hernandez and Stolfo 1995). cess of discovering useful knowledge from da- The basic
It was used successfully on data from the Wel- ta, and data mining refers to a particular step
fare Department of the State of Washington. in this process. Data mining is the application
In other areas, a well-publicized system is of specific algorithms for extracting patterns addressed by
IBM’s ADVANCED SCOUT, a specialized data-min- from data. The distinction between the KDD
ing system that helps National Basketball As- process and the data-mining step (within the
the KDD
sociation (NBA) coaches organize and inter- process) is a central point of this article. The process is
pret data from NBA games (U.S. News 1995). additional steps in the KDD process, such as one of
ADVANCED SCOUT was used by several of the data preparation, data selection, data cleaning,
NBA teams in 1996, including the Seattle Su- incorporation of appropriate prior knowledge, mapping
personics, which reached the NBA finals. and proper interpretation of the results of low-level
Finally, a novel and increasingly important mining, are essential to ensure that useful
type of discovery is one based on the use of in- knowledge is derived from the data. Blind ap- data into
telligent agents to navigate through an infor- plication of data-mining methods (rightly crit- other forms
mation-rich environment. Although the idea icized as data dredging in the statistical litera-
of active triggers has long been analyzed in the ture) can be a dangerous activity, easily that might be
database field, really successful applications of leading to the discovery of meaningless and more
invalid patterns.
this idea appeared only with the advent of the compact,
Internet. These systems ask the user to specify
The Interdisciplinary Nature of KDD more
a profile of interest and search for related in-
KDD has evolved, and continues to evolve,
formation among a wide variety of public-do-
from the intersection of research fields such as
main and proprietary sources. For example,
FIREFLY is a personal music-recommendation
machine learning, pattern recognition, or more
agent: It asks a user his/her opinion of several
databases, statistics, AI, knowledge acquisition useful.
for expert systems, data visualization, and
music pieces and then suggests other music
high-performance computing. The unifying
that the user might like (<http://
goal is extracting high-level knowledge from>). CRAYON (>)
low-level data in the context of large data sets.
allows users to create their own free newspaper
The data-mining component of KDD cur-
(supported by ads); NEWSHOUND (<http://www. rently relies heavily on known techniques>) from the San Jose from machine learning, pattern recognition,
Mercury News and FARCAST (<http://www.far- and statistics to find patterns from data in the> automatically search information data-mining step of the KDD process. A natu-
from a wide variety of sources, including ral question is, How is KDD different from pat-
newspapers and wire services, and e-mail rele- tern recognition or machine learning (and re-
vant documents directly to the user. lated fields)? The answer is that these fields
These are just a few of the numerous such provide some of the data-mining methods
systems that use KDD techniques to automat- that are used in the data-mining step of the
ically produce useful information from large KDD process. KDD focuses on the overall pro-
masses of raw data. See Piatetsky-Shapiro et cess of knowledge discovery from data, includ-
al. (1996) for an overview of issues in devel- ing how the data are stored and accessed, how
oping industrial KDD applications. algorithms can be scaled to massive data sets

FALL 1996 39

and still run efficiently, how results can be in- A driving force behind KDD is the database
terpreted and visualized, and how the overall field (the second D in KDD). Indeed, the
man-machine interaction can usefully be problem of effective data manipulation when
modeled and supported. The KDD process data cannot fit in the main memory is of fun-
can be viewed as a multidisciplinary activity damental importance to KDD. Database tech-
that encompasses techniques beyond the niques for gaining efficient data access,
scope of any one particular discipline such as grouping and ordering operations when ac-
machine learning. In this context, there are cessing data, and optimizing queries consti-
clear opportunities for other fields of AI (be- tute the basics for scaling algorithms to larger
sides machine learning) to contribute to data sets. Most data-mining algorithms from
KDD. KDD places a special emphasis on find- statistics, pattern recognition, and machine
ing understandable patterns that can be inter- learning assume data are in the main memo-
preted as useful or interesting knowledge. ry and pay no attention to how the algorithm
Thus, for example, neural networks, although breaks down if only limited views of the data
a powerful modeling tool, are relatively are possible.
difficult to understand compared to decision A related field evolving from databases is
trees. KDD also emphasizes scaling and ro- data warehousing, which refers to the popular
bustness properties of modeling algorithms business trend of collecting and cleaning
Data mining for large noisy data sets. transactional data to make them available for
Related AI research fields include machine online analysis and decision support. Data
is a step in discovery, which targets the discovery of em- warehousing helps set the stage for KDD in
the KDD pirical laws from observation and experimen- two important ways: (1) data cleaning and (2)
tation (Shrager and Langley 1990) (see Kloes- data access.
process that gen and Zytkow [1996] for a glossary of terms Data cleaning: As organizations are forced
consists of ap- common to KDD and machine discovery), to think about a unified logical view of the
and causal modeling for the inference of wide variety of data and databases they pos-
plying data causal models from data (Spirtes, Glymour, sess, they have to address the issues of map-
analysis and and Scheines 1993). Statistics in particular ping data to a single naming convention,
discovery al- has much in common with KDD (see Elder uniformly representing and handling missing
and Pregibon [1996] and Glymour et al. data, and handling noise and errors when
gorithms that [1996] for a more detailed discussion of this possible.
produce a par- synergy). Knowledge discovery from data is Data access: Uniform and well-defined
fundamentally a statistical endeavor. Statistics methods must be created for accessing the da-
ticular enu- provides a language and framework for quan- ta and providing access paths to data that
meration of tifying the uncertainty that results when one were historically difficult to get to (for exam-
tries to infer general patterns from a particu- ple, stored offline).
patterns lar sample of an overall population. As men- Once organizations and individuals have
(or models) tioned earlier, the term data mining has had solved the problem of how to store and ac-
negative connotations in statistics since the cess their data, the natural next step is the
over the 1960s when computer-based data analysis question, What else do we do with all the da-
data. techniques were first introduced. The concern ta? This is where opportunities for KDD natu-
arose because if one searches long enough in rally arise.
any data set (even randomly generated data), A popular approach for analysis of data
one can find patterns that appear to be statis- warehouses is called online analytical processing
tically significant but, in fact, are not. Clearly, (OLAP), named for a set of principles pro-
this issue is of fundamental importance to posed by Codd (1993). OLAP tools focus on
KDD. Substantial progress has been made in providing multidimensional data analysis,
recent years in understanding such issues in which is superior to SQL in computing sum-
statistics. Much of this work is of direct rele- maries and breakdowns along many dimen-
vance to KDD. Thus, data mining is a legiti- sions. OLAP tools are targeted toward simpli-
mate activity as long as one understands how fying and supporting interactive data analysis,
to do it correctly; data mining carried out but the goal of KDD tools is to automate as
poorly (without regard to the statistical as- much of the process as possible. Thus, KDD is
pects of the problem) is to be avoided. KDD a step beyond what is currently supported by
can also be viewed as encompassing a broader most standard database systems.
view of modeling than statistics. KDD aims to
provide tools to automate (to the degree pos- Basic Definitions
sible) the entire process of data analysis and KDD is the nontrivial process of identifying
the statistician’s “art” of hypothesis selection. valid, novel, potentially useful, and ultimate-


Interpretation /

Data Mining

Transformation Knowledge


--- --- ---
--- --- ---
--- --- ---

Preprocessed Data Data

Data Target Date

Figure 1. An Overview of the Steps That Compose the KDD Process.

ly understandable patterns in data (Fayyad, data) or utility (for example, gain, perhaps in
Piatetsky-Shapiro, and Smyth 1996). dollars saved because of better predictions or
Here, data are a set of facts (for example, speedup in response time of a system). No-
cases in a database), and pattern is an expres- tions such as novelty and understandability
sion in some language describing a subset of are much more subjective. In certain contexts,
the data or a model applicable to the subset. understandability can be estimated by sim-
Hence, in our usage here, extracting a pattern plicity (for example, the number of bits to de-
also designates fitting a model to data; find- scribe a pattern). An important notion, called
ing structure from data; or, in general, mak- interestingness (for example, see Silberschatz
ing any high-level description of a set of data. and Tuzhilin [1995] and Piatetsky-Shapiro and
The term process implies that KDD comprises Matheus [1994]), is usually taken as an overall
many steps, which involve data preparation, measure of pattern value, combining validity,
search for patterns, knowledge evaluation, novelty, usefulness, and simplicity. Interest-
and refinement, all repeated in multiple itera- ingness functions can be defined explicitly or
tions. By nontrivial, we mean that some can be manifested implicitly through an or-
search or inference is involved; that is, it is dering placed by the KDD system on the dis-
not a straightforward computation of covered patterns or models.
predefined quantities like computing the av- Given these notions, we can consider a
erage value of a set of numbers. pattern to be knowledge if it exceeds some in-
The discovered patterns should be valid on terestingness threshold, which is by no
new data with some degree of certainty. We means an attempt to define knowledge in the
also want patterns to be novel (at least to the philosophical or even the popular view. As a
system and preferably to the user) and poten- matter of fact, knowledge in this definition is
tially useful, that is, lead to some benefit to purely user oriented and domain specific and
the user or task. Finally, the patterns should is determined by whatever functions and
be understandable, if not immediately then thresholds the user chooses.
after some postprocessing. Data mining is a step in the KDD process
The previous discussion implies that we can that consists of applying data analysis and
define quantitative measures for evaluating discovery algorithms that, under acceptable
extracted patterns. In many cases, it is possi- computational efficiency limitations, pro-
ble to define measures of certainty (for exam- duce a particular enumeration of patterns (or
ple, estimated prediction accuracy on new models) over the data. Note that the space of

FALL 1996 41

patterns is often infinite, and the enumera- methods, the effective number of variables
tion of patterns involves some form of under consideration can be reduced, or in-
search in this space. Practical computational variant representations for the data can be
constraints place severe limits on the sub- found.
space that can be explored by a data-mining Fifth is matching the goals of the KDD pro-
algorithm. cess (step 1) to a particular data-mining
The KDD process involves using the method. For example, summarization, clas-
database along with any required selection, sification, regression, clustering, and so on,
preprocessing, subsampling, and transforma- are described later as well as in Fayyad, Piatet-
tions of it; applying data-mining methods sky-Shapiro, and Smyth (1996).
(algorithms) to enumerate patterns from it; Sixth is exploratory analysis and model
and evaluating the products of data mining and hypothesis selection: choosing the data-
to identify the subset of the enumerated pat- mining algorithm(s) and selecting method(s)
terns deemed knowledge. The data-mining to be used for searching for data patterns.
component of the KDD process is concerned This process includes deciding which models
with the algorithmic means by which pat- and parameters might be appropriate (for ex-
terns are extracted and enumerated from da- ample, models of categorical data are differ-
ta. The overall KDD process (figure 1) in- ent than models of vectors over the reals) and
cludes the evaluation and possible matching a particular data-mining method
interpretation of the mined patterns to de- with the overall criteria of the KDD process
termine which patterns can be considered (for example, the end user might be more in-
new knowledge. The KDD process also in- terested in understanding the model than its
cludes all the additional steps described in predictive capabilities).
the next section.
Seventh is data mining: searching for pat-
The notion of an overall user-driven pro-
terns of interest in a particular representa-
cess is not unique to KDD: analogous propos-
tional form or a set of such representations,
als have been put forward both in statistics
including classification rules or trees, regres-
(Hand 1994) and in machine learning (Brod-
sion, and clustering. The user can significant-
ley and Smyth 1996).
ly aid the data-mining method by correctly
performing the preceding steps.
Eighth is interpreting mined patterns, pos-
The KDD Process sibly returning to any of steps 1 through 7 for
further iteration. This step can also involve
The KDD process is interactive and iterative,
visualization of the extracted patterns and
involving numerous steps with many deci-
sions made by the user. Brachman and Anand models or visualization of the data given the
(1996) give a practical view of the KDD pro- extracted models.
cess, emphasizing the interactive nature of Ninth is acting on the discovered knowl-
the process. Here, we broadly outline some of edge: using the knowledge directly, incorpo-
its basic steps: rating the knowledge into another system for
First is developing an understanding of the further action, or simply documenting it and
application domain and the relevant prior reporting it to interested parties. This process
knowledge and identifying the goal of the also includes checking for and resolving po-
KDD process from the customer’s viewpoint. tential conflicts with previously believed (or
Second is creating a target data set: select- extracted) knowledge.
ing a data set, or focusing on a subset of vari- The KDD process can involve significant
ables or data samples, on which discovery is iteration and can contain loops between
to be performed. any two steps. The basic flow of steps (al-
Third is data cleaning and preprocessing. though not the potential multitude of itera-
Basic operations include removing noise if tions and loops) is illustrated in figure 1.
appropriate, collecting the necessary informa- Most previous work on KDD has focused on
tion to model or account for noise, deciding step 7, the data mining. However, the other
on strategies for handling missing data fields, steps are as important (and probably more
and accounting for time-sequence informa- so) for the successful application of KDD in
tion and known changes. practice. Having defined the basic notions
Fourth is data reduction and projection: and introduced the KDD process, we now
finding useful features to represent the data focus on the data-mining component,
depending on the goal of the task. With di- which has, by far, received the most atten-
mensionality reduction or transformation tion in the literature.


The Data-Mining Step

of the KDD Process
The data-mining component of the KDD pro- o
cess often involves repeated iterative applica-
o o
tion of particular data-mining methods. This x
section presents an overview of the primary o
goals of data mining, a description of the x
x x
methods used to address these goals, and a o o
brief description of the data-mining algo- o o
rithms that incorporate these methods. x
x x
The knowledge discovery goals are defined x
o o
by the intended use of the system. We can o
distinguish two types of goals: (1) verification x o
and (2) discovery. With verification, the sys-
tem is limited to verifying the user’s hypothe- Income
sis. With discovery, the system autonomously
finds new patterns. We further subdivide the
discovery goal into prediction, where the sys-
tem finds patterns for predicting the future Figure 2. A Simple Data Set with Two Classes Used for Illustrative Purposes.
behavior of some entities, and description,
where the system finds patterns for presenta-
tion to a user in a human-understandable
form. In this article, we are primarily con- ily in the goodness-of-fit criterion used to
cerned with discovery-oriented data mining. evaluate model fit or in the search method
Data mining involves fitting models to, or used to find a good fit.
determining patterns from, observed data. In our brief overview of data-mining meth-
The fitted models play the role of inferred ods, we try in particular to convey the notion
knowledge: Whether the models reflect useful that most (if not all) methods can be viewed
or interesting knowledge is part of the over- as extensions or hybrids of a few basic tech-
all, interactive KDD process where subjective niques and principles. We first discuss the pri-
human judgment is typically required. Two mary methods of data mining and then show
primary mathematical formalisms are used in that the data- mining methods can be viewed
model fitting: (1) statistical and (2) logical. as consisting of three primary algorithmic
The statistical approach allows for nondeter- components: (1) model representation, (2)
ministic effects in the model, whereas a logi- model evaluation, and (3) search. In the dis-
cal model is purely deterministic. We focus cussion of KDD and data-mining methods,
primarily on the statistical approach to data we use a simple example to make some of the
mining, which tends to be the most widely notions more concrete. Figure 2 shows a sim-
used basis for practical data-mining applica- ple two-dimensional artificial data set consist-
tions given the typical presence of uncertain- ing of 23 cases. Each point on the graph rep-
ty in real-world data-generating processes. resents a person who has been given a loan
Most data-mining methods are based on by a particular bank at some time in the past.
tried and tested techniques from machine The horizontal axis represents the income of
learning, pattern recognition, and statistics: the person; the vertical axis represents the to-
classification, clustering, regression, and so tal personal debt of the person (mortgage, car
on. The array of different algorithms under payments, and so on). The data have been
each of these headings can often be bewilder- classified into two classes: (1) the x’s repre-
ing to both the novice and the experienced sent persons who have defaulted on their
data analyst. It should be emphasized that of loans and (2) the o’s represent persons whose
the many data-mining methods advertised in loans are in good status with the bank. Thus,
the literature, there are really only a few fun- this simple artificial data set could represent a
damental techniques. The actual underlying historical data set that can contain useful
model representation being used by a particu- knowledge from the point of view of the
lar method typically comes from a composi- bank making the loans. Note that in actual
tion of a small number of well-known op- KDD applications, there are typically many
tions: polynomials, splines, kernel and basis more dimensions (as many as several hun-
functions, threshold-Boolean functions, and dreds) and many more data points (many
so on. Thus, algorithms tend to differ primar- thousands or even millions).

FALL 1996 43

The purpose here is to illustrate basic ideas

on a small problem in two-dimensional
o space.
No Loan
o o Data-Mining Methods
o The two high-level primary goals of data min-
o ing in practice tend to be prediction and de-
x x scription. As stated earlier, prediction in-
o o
o o volves using some variables or fields in the
database to predict unknown or future values
x x of other variables of interest, and description
o o Loan
o focuses on finding human-interpretable pat-
x o terns describing the data. Although the
boundaries between prediction and descrip-
tion are not sharp (some of the predictive
models can be descriptive, to the degree that
they are understandable, and vice versa), the
distinction is useful for understanding the
Figure 3. A Simple Linear Classification Boundary for the Loan Data Set. overall discovery goal. The relative impor-
The shaped region denotes class no loan. tance of prediction and description for partic-
ular data-mining applications can vary con-
siderably. The goals of prediction and
description can be achieved using a variety of
particular data-mining methods.
Classification is learning a function that
ine maps (classifies) a data item into one of sever-
gr ess al predefined classes (Weiss and Kulikowski
Debt Re
o o
1991; Hand 1981). Examples of classification
x methods used as part of knowledge discovery
o applications include the classifying of trends
x in financial markets (Apte and Hong 1996)
x x
o o and the automated identification of objects of
o o
interest in large image databases (Fayyad,
x x Djorgovski, and Weir 1996). Figure 3 shows a
o o simple partitioning of the loan data into two
x class regions; note that it is not possible to
x o separate the classes perfectly using a linear
decision boundary. The bank might want to
Income use the classification regions to automatically
decide whether future loan applicants will be
given a loan or not.
Figure 4. A Simple Linear Regression for the Loan Data Set. Regression is learning a function that maps
a data item to a real-valued prediction vari-
able. Regression applications are many, for
example, predicting the amount of biomass
present in a forest given remotely sensed mi-
crowave measurements, estimating the proba-
bility that a patient will survive given the re-
sults of a set of diagnostic tests, predicting
consumer demand for a new product as a
function of advertising expenditure, and pre-
dicting time series where the input variables
can be time-lagged versions of the prediction
variable. Figure 4 shows the result of simple
linear regression where total debt is fitted as a
linear function of income: The fit is poor be-
cause only a weak correlation exists between
the two variables.
Clustering is a common descriptive task


where one seeks to identify a finite set of cat-

egories or clusters to describe the data (Jain
and Dubes 1988; Titterington, Smith, and + Cluster 2
Makov 1985). The categories can be mutually
Cluster 1 + +
exclusive and exhaustive or consist of a richer +
representation, such as hierarchical or over- +
lapping categories. Examples of clustering ap- +
+ +
plications in a knowledge discovery context +
include discovering homogeneous subpopula- + +
tions for consumers in marketing databases +
+ +
and identifying subcategories of spectra from +
+ +
infrared sky measurements (Cheeseman and +
Stutz 1996). Figure 5 shows a possible cluster- + + Cluster 3
ing of the loan data set into three clusters;
note that the clusters overlap, allowing data Income
points to belong to more than one cluster.
The original class labels (denoted by x’s and
o’s in the previous figures) have been replaced Figure 5. A Simple Clustering of the Loan Data Set into Three Clusters.
by a + to indicate that the class membership Note that original labels are replaced by a +.
is no longer assumed known. Closely related
to clustering is the task of probability density
estimation, which consists of techniques for
estimating from data the joint multivariate
discovering the most significant changes in
probability density function of all the vari-
the data from previously measured or norma-
ables or fields in the database (Silverman
tive values (Berndt and Clifford 1996; Guyon,
Matic, and Vapnik 1996; Kloesgen 1996;
Summarization involves methods for find- Matheus, Piatetsky-Shapiro, and McNeill
ing a compact description for a subset of da- 1996; Basseville and Nikiforov 1993).
ta. A simple example would be tabulating the
mean and standard deviations for all fields. The Components of
More sophisticated methods involve the Data-Mining Algorithms
derivation of summary rules (Agrawal et al.
The next step is to construct specific algo-
1996), multivariate visualization techniques, rithms to implement the general methods we
and the discovery of functional relationships outlined. One can identify three primary
between variables (Zembowicz and Zytkow components in any data-mining algorithm:
1996). Summarization techniques are often (1) model representation, (2) model evalua-
applied to interactive exploratory data analy- tion, and (3) search.
sis and automated report generation. This reductionist view is not necessarily
Dependency modeling consists of finding a complete or fully encompassing; rather, it is a
model that describes significant dependencies convenient way to express the key concepts
between variables. Dependency models exist of data-mining algorithms in a relatively
at two levels: (1) the structural level of the unified and compact manner. Cheeseman
model specifies (often in graphic form) which (1990) outlines a similar structure.
variables are locally dependent on each other Model representation is the language used to
and (2) the quantitative level of the model describe discoverable patterns. If the repre-
specifies the strengths of the dependencies sentation is too limited, then no amount of
using some numeric scale. For example, prob- training time or examples can produce an ac-
abilistic dependency networks use condition- curate model for the data. It is important that
al independence to specify the structural as- a data analyst fully comprehend the represen-
pect of the model and probabilities or tational assumptions that might be inherent
correlations to specify the strengths of the de- in a particular method. It is equally impor-
pendencies (Glymour et al. 1987; Heckerman tant that an algorithm designer clearly state
1996). Probabilistic dependency networks are which representational assumptions are being
increasingly finding applications in areas as made by a particular algorithm. Note that in-
diverse as the development of probabilistic creased representational power for models in-
medical expert systems from databases, infor- creases the danger of overfitting the training
mation retrieval, and modeling of the human data, resulting in reduced prediction accuracy
genome. on unseen data.
Change and deviation detection focuses on Model-evaluation criteria are quantitative

FALL 1996 45

Decision Trees and Rules

Decision trees and rules that use univariate
o splits have a simple representational form,
No Loan making the inferred model relatively easy for
o o the user to comprehend. However, the restric-
o tion to a particular tree or rule representation
x can significantly restrict the functional form
x x (and, thus, the approximation power) of the
o o
o o model. For example, figure 6 illustrates the ef-
x fect of a threshold split applied to the income
x x
x variable for a loan data set: It is clear that us-
o o Loan
o ing such simple threshold splits (parallel to
x o the feature axes) severely limits the type of
classification boundaries that can be induced.
t Income If one enlarges the model space to allow more
general expressions (such as multivariate hy-
perplanes at arbitrary angles), then the model
is more powerful for prediction but can be
Figure 6. Using a Single Threshold on the Income Variable to much more difficult to comprehend. A large
Try to Classify the Loan Data Set. number of decision tree and rule-induction
algorithms are described in the machine-
learning and applied statistics literature
(Quinlan 1992; Breiman et al. 1984).
To a large extent, they depend on likeli-
hood-based model-evaluation methods, with
varying degrees of sophistication in terms of
statements (or fit functions) of how well a par- penalizing model complexity. Greedy search
ticular pattern (a model and its parameters) methods, which involve growing and prun-
meets the goals of the KDD process. For ex- ing rule and tree structures, are typically used
ample, predictive models are often judged by to explore the superexponential space of pos-
the empirical prediction accuracy on some sible models. Trees and rules are primarily
test set. Descriptive models can be evaluated used for predictive modeling, both for clas-
along the dimensions of predictive accuracy, sification (Apte and Hong 1996; Fayyad, Djor-
novelty, utility, and understandability of the govski, and Weir 1996) and regression, al-
fitted model. though they can also be applied to summary
Search method consists of two components: descriptive modeling (Agrawal et al. 1996).
(1) parameter search and (2) model search.
Once the model representation (or family of Nonlinear Regression and
representations) and the model-evaluation Classification Methods
criteria are fixed, then the data-mining prob- These methods consist of a family of tech-
lem has been reduced to purely an optimiza- niques for prediction that fit linear and non-
tion task: Find the parameters and models linear combinations of basis functions (sig-
from the selected family that optimize the moids, splines, polynomials) to combinations
evaluation criteria. In parameter search, the of the input variables. Examples include feed-
algorithm must search for the parameters forward neural networks, adaptive spline
that optimize the model-evaluation criteria methods, and projection pursuit regression
given observed data and a fixed model repre- (see Elder and Pregibon [1996], Cheng and
sentation. Model search occurs as a loop over Titterington [1994], and Friedman [1989] for
the parameter-search method: The model rep- more detailed discussions). Consider neural
resentation is changed so that a family of networks, for example. Figure 7 illustrates the
models is considered. type of nonlinear decision boundary that a
neural network might find for the loan data
set. In terms of model evaluation, although
Some Data-Mining Methods networks of the appropriate size can univer-
A wide variety of data-mining methods exist, sally approximate any smooth function to
but here, we only focus on a subset of popu- any desired degree of accuracy, relatively little
lar techniques. Each method is discussed in is known about the representation properties
the context of model representation, model of fixed-size networks estimated from finite
evaluation, and search. data sets. Also, the standard squared error and


cross-entropy loss functions used to train

neural networks can be viewed as log-likeli-
hood functions for regression and
classification, respectively (Ripley 1994; Ge- o
man, Bienenstock, and Doursat 1992). Back No Loan
o o
propagation is a parameter-search method x
that performs gradient descent in parameter o
(weight) space to find a local maximum of x
x x
the likelihood function starting from random o o
o o
initial conditions. Nonlinear regression meth- x
ods, although powerful in representational x x
power, can be difficult to interpret. o o Loan
For example, although the classification x
x o
boundaries of figure 7 might be more accu-
rate than the simple threshold boundary of
figure 6, the threshold boundary has the ad-
vantage that the model can be expressed, to
some degree of certainty, as a simple rule of
the form “if income is greater than threshold,
then loan will have good status.” Figure 7. An Example of Classification Boundaries Learned by a Nonlinear
Classifier (Such as a Neural Network) for the Loan Data Set.
Example-Based Methods
The representation is simple: Use representa-
tive examples from the database to approxi-
mate a model; that is, predictions on new ex-
amples are derived from the properties of
similar examples in the model whose predic-
tion is known. Techniques include nearest-
neighbor classification and regression algo-
Debt No Loan
rithms (Dasarathy 1991) and case-based o o
reasoning systems (Kolodner 1993). Figure 8 o
illustrates the use of a nearest-neighbor clas- o
sifier for the loan data set: The class at any x x
new point in the two-dimensional space is o o
the same as the class of the closest point in x
the original training data set. x x
A potential disadvantage of example-based o o Loan
methods (compared with tree-based methods) x
x o
is that a well-defined distance metric for eval-
uating the distance between data points is re-
quired. For the loan data in figure 8, this
would not be a problem because income and
debt are measured in the same units. Howev-
er, if one wished to include variables such as
the duration of the loan, sex, and profession, Figure 8. Classification Boundaries for a Nearest-Neighbor
then it would require more effort to define a Classifier for the Loan Data Set.
sensible metric between the variables. Model
evaluation is typically based on cross-valida-
tion estimates (Weiss and Kulikowski 1991) of
a prediction error: Parameters of the model to
be estimated can include the number of
neighbors to use for prediction and the dis-
tance metric itself. Like nonlinear regression
methods, example-based methods are often
asymptotically powerful in terms of approxi-
mation properties but, conversely, can be
difficult to interpret because the model is im-
plicit in the data and not explicitly formulat-
ed. Related techniques include kernel-density

FALL 1996 47

estimation (Silverman 1986) and mixture evitably limited in scope; many data-mining
modeling (Titterington, Smith, and Makov techniques, particularly specialized methods
1985). for particular types of data and domains, were
not mentioned specifically. We believe the
Probabilistic Graphic general discussion on data-mining tasks and
Dependency Models components has general relevance to a vari-
Graphic models specify probabilistic depen- ety of methods. For example, consider time-
dencies using a graph structure (Whittaker series prediction, which traditionally has
1990; Pearl 1988). In its simplest form, the been cast as a predictive regression task (au-
model specifies which variables are directly de- toregressive models, and so on). Recently,
pendent on each other. Typically, these mod- more general models have been developed for
els are used with categorical or discrete-valued time-series applications, such as nonlinear ba-
variables, but extensions to special cases, such sis functions, example-based models, and ker-
Understand- as Gaussian densities, for real-valued variables nel methods. Furthermore, there has been
are also possible. Within the AI and statistical significant interest in descriptive graphic and
ing data communities, these models were initially de- local data modeling of time series rather than
mining and veloped within the framework of probabilistic purely predictive modeling (Weigend and
expert systems; the structure of the model and Gershenfeld 1993). Thus, although different
model the parameters (the conditional probabilities algorithms and applications might appear dif-
induction at attached to the links of the graph) were elicit- ferent on the surface, it is not uncommon to
ed from experts. Recently, there has been sig- find that they share many common compo-
this nificant work in both the AI and statistical nents. Understanding data mining and model
component communities on methods whereby both the induction at this component level clarifies
level clarifies structure and the parameters of graphic mod- the behavior of any data-mining algorithm
els can be learned directly from databases and makes it easier for the user to understand
the behavior (Buntine 1996; Heckerman 1996). Model-eval- its overall contribution and applicability to
of any uation criteria are typically Bayesian in form, the KDD process.
and parameter estimation can be a mixture of An important point is that each technique
data-mining closed-form estimates and iterative methods typically suits some problems better than
algorithm depending on whether a variable is directly others. For example, decision tree classifiers
observed or hidden. Model search can consist can be useful for finding structure in high-di-
and makes it of greedy hill-climbing methods over various mensional spaces and in problems with
easier for the graph structures. Prior knowledge, such as a mixed continuous and categorical data (be-
partial ordering of the variables based on cause tree methods do not require distance
user to metrics). However, classification trees might
causal relations, can be useful in terms of re-
understand its ducing the model search space. Although still not be suitable for problems where the true
overall primarily in the research phase, graphic model decision boundaries between classes are de-
induction methods are of particular interest to scribed by a second-order polynomial (for ex-
contribution KDD because the graphic form of the model ample). Thus, there is no universal data-min-
and lends itself easily to human interpretation. ing method, and choosing a particular
algorithm for a particular application is some-
applicability Relational Learning Models thing of an art. In practice, a large portion of
to the Although decision trees and rules have a repre- the application effort can go into properly
sentation restricted to propositional logic, rela- formulating the problem (asking the right
KDD tional learning (also known as inductive logic question) rather than into optimizing the al-
process. programming) uses the more flexible pattern gorithmic details of a particular data-mining
language of first-order logic. A relational learn- method (Langley and Simon 1995; Hand
er can easily find formulas such as X = Y. Most 1994).
research to date on model-evaluation methods Because our discussion and overview of da-
for relational learning is logical in nature. The ta-mining methods has been brief, we want
extra representational power of relational to make two important points clear:
models comes at the price of significant com- First, our overview of automated search fo-
putational demands in terms of search. See cused mainly on automated methods for ex-
Dzeroski (1996) for a more detailed discussion. tracting patterns or models from data. Al-
though this approach is consistent with the
definition we gave earlier, it does not neces-
Discussion sarily represent what other communities
Given the broad spectrum of data-mining might refer to as data mining. For example,
methods and algorithms, our overview is in- some use the term to designate any manual


search of the data or search assisted by queries oriented data, although making the applica-
to a database management system or to refer tion development more difficult, make it po-
to humans visualizing patterns in data. In tentially much more useful because it is easier
other communities, it is used to refer to the to retrain a system than a human. Finally,
automated correlation of data from transac- and perhaps one of the most important con-
tions or the automated generation of transac- siderations, is prior knowledge. It is useful to
tion reports. We choose to focus only on know something about the domain —what
methods that contain certain degrees of are the important fields, what are the likely
search autonomy. relationships, what is the user utility func-
Second, beware the hype: The state of the tion, what patterns are already known, and so
art in automated methods in data mining is on.
still in a fairly early stage of development.
There are no established criteria for deciding Research and Application Challenges
which methods to use in which circum- We outline some of the current primary re-
stances, and many of the approaches are search and application challenges for KDD.
based on crude heuristic approximations to This list is by no means exhaustive and is in-
avoid the expensive search required to find tended to give the reader a feel for the types
optimal, or even good, solutions. Hence, the of problem that KDD practitioners wrestle
reader should be careful when confronted with.
with overstated claims about the great ability Larger databases: Databases with hun-
of a system to mine useful information from dreds of fields and tables and millions of
large (or even small) databases. records and of a multigigabyte size are com-
monplace, and terabyte (1012 bytes) databases
are beginning to appear. Methods for dealing
Application Issues with large data volumes include more
For a survey of KDD applications as well as efficient algorithms (Agrawal et al. 1996),
detailed examples, see Piatetsky-Shapiro et al. sampling, approximation, and massively par-
(1996) for industrial applications and Fayyad, allel processing (Holsheimer et al. 1996).
Haussler, and Stolorz (1996) for applications High dimensionality: Not only is there of-
in science data analysis. Here, we examine ten a large number of records in the database,
criteria for selecting potential applications, but there can also be a large number of fields
which can be divided into practical and tech- (attributes, variables); so, the dimensionality
nical categories. The practical criteria for KDD of the problem is high. A high-dimensional
projects are similar to those for other applica- data set creates problems in terms of increas-
tions of advanced technology and include the ing the size of the search space for model in-
potential impact of an application, the ab- duction in a combinatorially explosive man-
sence of simpler alternative solutions, and ner. In addition, it increases the chances that
strong organizational support for using tech- a data-mining algorithm will find spurious
nology. For applications dealing with person- patterns that are not valid in general. Ap-
al data, one should also consider the privacy proaches to this problem include methods to
and legal issues (Piatetsky-Shapiro 1995). reduce the effective dimensionality of the
The technical criteria include considera- problem and the use of prior knowledge to
tions such as the availability of sufficient data identify irrelevant variables.
(cases). In general, the more fields there are Overfitting: When the algorithm searches
and the more complex the patterns being for the best parameters for one particular
sought, the more data are needed. However, model using a limited set of data, it can mod-
strong prior knowledge (see discussion later) el not only the general patterns in the data
can reduce the number of needed cases sig- but also any noise specific to the data set, re-
nificantly. Another consideration is the rele- sulting in poor performance of the model on
vance of attributes. It is important to have da- test data. Possible solutions include cross-vali-
ta attributes that are relevant to the discovery dation, regularization, and other sophisticat-
task; no amount of data will allow prediction ed statistical strategies.
based on attributes that do not capture the Assessing of statistical significance: A
required information. Furthermore, low noise problem (related to overfitting) occurs when
levels (few data errors) are another considera- the system is searching over many possible
tion. High amounts of noise make it hard to models. For example, if a system tests models
identify patterns unless a large number of cas- at the 0.001 significance level, then on aver-
es can mitigate random noise and help clarify age, with purely random data, N/1000 of
the aggregate patterns. Changing and time- these models will be accepted as significant.

FALL 1996 49

This point is frequently missed by many ini- edge is important in all the steps of the KDD
tial attempts at KDD. One way to deal with process. Bayesian approaches (for example,
this problem is to use methods that adjust Cheeseman [1990]) use prior probabilities
the test statistic as a function of the search, over data and distributions as one form of en-
for example, Bonferroni adjustments for inde- coding prior knowledge. Others employ de-
pendent tests or randomization testing. ductive database capabilities to discover
Changing data and knowledge: Rapidly knowledge that is then used to guide the da-
changing (nonstationary) data can make pre- ta-mining search (for example, Simoudis,
viously discovered patterns invalid. In addi- Livezey, and Kerber [1995]).
tion, the variables measured in a given appli- Integration with other systems: A stand-
cation database can be modified, deleted, or alone discovery system might not be very
augmented with new measurements over useful. Typical integration issues include inte-
time. Possible solutions include incremental gration with a database management system
methods for updating the patterns and treat- (for example, through a query interface), in-
ing change as an opportunity for discovery tegration with spreadsheets and visualization
by using it to cue the search for patterns of tools, and accommodating of real-time sensor
change only (Matheus, Piatetsky-Shapiro, and readings. Examples of integrated KDD sys-
McNeill 1996). See also Agrawal and Psaila tems are described by Simoudis, Livezey, and
(1995) and Mannila, Toivonen, and Verkamo Kerber (1995) and Stolorz, Nakamura, Mesro-
(1995). biam, Muntz, Shek, Santos, Yi, Ng, Chien,
Missing and noisy data: This problem is Mechoso, and Farrara (1995).
especially acute in business databases. U.S.
census data reportedly have error rates as
great as 20 percent in some fields. Important
Concluding Remarks: The
attributes can be missing if the database was Potential Role of AI in KDD
not designed with discovery in mind. Possible In addition to machine learning, other AI fiel-
solutions include more sophisticated statisti- ds can potentially contribute significantly to
cal strategies to identify hidden variables and various aspects of the KDD process. We men-
dependencies (Heckerman 1996; Smyth et al. tion a few examples of these areas here:
1996). Natural language presents significant op-
Complex relationships between fields: portunities for mining in free-form text, espe-
Hierarchically structured attributes or values, cially for automated annotation and indexing
relations between attributes, and more so- prior to classification of text corpora. Limited
phisticated means for representing knowl- parsing capabilities can help substantially in
edge about the contents of a database will re- the task of deciding what an article refers to.
quire algorithms that can effectively use such Hence, the spectrum from simple natural lan-
information. Historically, data-mining algo- guage processing all the way to language un-
rithms have been developed for simple at- derstanding can help substantially. Also, nat-
tribute-value records, although new tech- ural language processing can contribute
niques for deriving relations between significantly as an effective interface for stat-
variables are being developed (Dzeroski 1996; ing hints to mining algorithms and visualiz-
Djoko, Cook, and Holder 1995). ing and explaining knowledge derived by a
Understandability of patterns: In many KDD system.
applications, it is important to make the dis- Planning considers a complicated data
coveries more understandable by humans. analysis process. It involves conducting com-
Possible solutions include graphic representa- plicated data-access and data-transformation
tions (Buntine 1996; Heckerman 1996), rule operations; applying preprocessing routines;
structuring, natural language generation, and and, in some cases, paying attention to re-
techniques for visualization of data and source and data-access constraints. Typically,
knowledge. Rule-refinement strategies (for ex- data processing steps are expressed in terms of
ample, Major and Mangano [1995]) can be desired postconditions and preconditions for
used to address a related problem: The discov- the application of certain routines, which
ered knowledge might be implicitly or explic- lends itself easily to representation as a plan-
itly redundant. ning problem. In addition, planning ability
User interaction and prior knowledge: can play an important role in automated
Many current KDD methods and tools are not agents (see next item) to collect data samples
truly interactive and cannot easily incorpo- or conduct a search to obtain needed data sets.
rate prior knowledge about a problem except Intelligent agents can be fired off to col-
in simple ways. The use of domain knowl- lect necessary information from a variety of


sources. In addition, information agents can Note

be activated remotely over the network or 1. Throughout this article, we use the term pattern
can trigger on the occurrence of a certain to designate a pattern found in data. We also refer
event and start an analysis operation. Finally, to models. One can think of patterns as compo-
agents can help navigate and model the nents of models, for example, a particular rule in a
World-Wide Web (Etzioni 1996), another area classification model or a linear component in a re-
growing in importance. gression model.
Uncertainty in AI includes issues for man-
aging uncertainty, proper inference mecha- References
nisms in the presence of uncertainty, and the Agrawal, R., and Psaila, G. 1995. Active Data Min-
reasoning about causality, all fundamental to ing. In Proceedings of the First International Con-
KDD theory and practice. In fact, the KDD-96 ference on Knowledge Discovery and Data Mining
conference had a joint session with the UAI-96 (KDD-95), 3–8. Menlo Park, Calif.: American Asso-
ciation for Artificial Intelligence.
conference this year (Horvitz and Jensen 1996).
Knowledge representation includes on- Agrawal, R.; Mannila, H.; Srikant, R.; Toivonen, H.;
tologies, new concepts for representing, stor- and Verkamo, I. 1996. Fast Discovery of Association
Rules. In Advances in Knowledge Discovery and Data
ing, and accessing knowledge. Also included
Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
are schemes for representing knowledge and Smyth, and R. Uthurusamy, 307–328. Menlo Park,
allowing the use of prior human knowledge Calif.: AAAI Press.
about the underlying process by the KDD
Apte, C., and Hong, S. J. 1996. Predicting Equity
system. Returns from Securities Data with Minimal Rule
These potential contributions of AI are but Generation. In Advances in Knowledge Discovery and
a sampling; many others, including human- Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
computer interaction, knowledge-acquisition Smyth, and R. Uthurusamy, 514–560. Menlo Park,
techniques, and the study of mechanisms for Calif.: AAAI Press.
reasoning, have the opportunity to con- Basseville, M., and Nikiforov, I. V. 1993. Detection
tribute to KDD. of Abrupt Changes: Theory and Application. Engle-
In conclusion, we presented some defini- wood Cliffs, N.J.: Prentice Hall.
tions of basic notions in the KDD field. Our Berndt, D., and Clifford, J. 1996. Finding Patterns
primary aim was to clarify the relation be- in Time Series: A Dynamic Programming Approach.
tween knowledge discovery and data mining. In Advances in Knowledge Discovery and Data Mining,
We provided an overview of the KDD process eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
and basic data-mining methods. Given the R. Uthurusamy, 229–248. Menlo Park, Calif.: AAAI
broad spectrum of data-mining methods and Press.
algorithms, our overview is inevitably limit- Berry, J. 1994. Database Marketing. Business Week,
ed in scope: There are many data-mining September 5, 56–62.
techniques, particularly specialized methods Brachman, R., and Anand, T. 1996. The Process of
for particular types of data and domain. Al- Knowledge Discovery in Databases: A Human-Cen-
tered Approach. In Advances in Knowledge Discovery
though various algorithms and applications
and Data Mining, 37–58, eds. U. Fayyad, G. Piatet-
might appear quite different on the surface,
sky-Shapiro, P. Smyth, and R. Uthurusamy. Menlo
it is not uncommon to find that they share Park, Calif.: AAAI Press.
many common components. Understanding
Breiman, L.; Friedman, J. H.; Olshen, R. A.; and
data mining and model induction at this Stone, C. J. 1984. Classification and Regression Trees.
component level clarifies the task of any da- Belmont, Calif.: Wadsworth.
ta-mining algorithm and makes it easier for Brodley, C. E., and Smyth, P. 1996. Applying Clas-
the user to understand its overall contribu- sification Algorithms in Practice. Statistics and Com-
tion and applicability to the KDD process. puting. Forthcoming.
This article represents a step toward a Buntine, W. 1996. Graphical Models for Discover-
common framework that we hope will ulti- ing Knowledge. In Advances in Knowledge Discovery
mately provide a unifying vision of the com- and Data Mining, eds. U. Fayyad, G. Piatetsky-
mon overall goals and methods used in Shapiro, P. Smyth, and R. Uthurusamy, 59–82.
KDD. We hope this will eventually lead to a Menlo Park, Calif.: AAAI Press.
better understanding of the variety of ap- Cheeseman, P. 1990. On Finding the Most Probable
proaches in this multidisciplinary field and Model. In Computational Models of Scientific Discov-
how they fit together. ery and Theory Formation, eds. J. Shrager and P. Lan-
gley, 73–95. San Francisco, Calif.: Morgan Kauf-
Acknowledgments mann.
We thank Sam Uthurusamy, Ron Brachman, and Cheeseman, P., and Stutz, J. 1996. Bayesian Clas-
KDD-96 referees for their valuable suggestions sification (AUTOCLASS): Theory and Results. In Ad-
and ideas. vances in Knowledge Discovery and Data Mining, eds.

FALL 1996 51

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. ering Informative Patterns and Data Cleaning. In
Uthurusamy, 73–95. Menlo Park, Calif.: AAAI Press. Advances in Knowledge Discovery and Data Mining,
Cheng, B., and Titterington, D. M. 1994. Neural eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
Networks—A Review from a Statistical Perspective. R. Uthurusamy, 181–204. Menlo Park, Calif.: AAAI
Statistical Science 9(1): 2–30. Press.
Codd, E. F. 1993. Providing OLAP (On-Line Analyti- Hall, J.; Mani, G.; and Barr, D. 1996. Applying
cal Processing) to User-Analysts: An IT Mandate. E. Computational Intelligence to the Investment Pro-
F. Codd and Associates. cess. In Proceedings of CIFER-96: Computational
Dasarathy, B. V. 1991. Nearest Neighbor (NN) Intelligence in Financial Engineering. Washington,
Norms: NN Pattern Classification Techniques. D.C.: IEEE Computer Society.
Washington, D.C.: IEEE Computer Society. Hand, D. J. 1994. Deconstructing Statistical Ques-
Djoko, S.; Cook, D.; and Holder, L. 1995. Analyzing tions. Journal of the Royal Statistical Society A. 157(3):
the Benefits of Domain Knowledge in Substructure 317–356.
Discovery. In Proceedings of KDD-95: First Interna- Hand, D. J. 1981. Discrimination and Classification.
tional Conference on Knowledge Discovery and Chichester, U.K.: Wiley.
Data Mining, 75–80. Menlo Park, Calif.: American Heckerman, D. 1996. Bayesian Networks for Knowl-
Association for Artificial Intelligence. edge Discovery. In Advances in Knowledge Discovery
Dzeroski, S. 1996. Inductive Logic Programming for and Data Mining, eds. U. Fayyad, G. Piatetsky-
Knowledge Discovery in Databases. In Advances in Shapiro, P. Smyth, and R. Uthurusamy, 273–306.
Knowledge Discovery and Data Mining, eds. U. Menlo Park, Calif.: AAAI Press.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Hernandez, M., and Stolfo, S. 1995. The MERGE -
Uthurusamy, 59–82. Menlo Park, Calif.: AAAI Press. PURGE Problem for Large Databases. In Proceedings
Elder, J., and Pregibon, D. 1996. A Statistical Per- of the 1995 ACM-SIGMOD Conference, 127–138.
spective on KDD. In Advances in Knowledge Discov- New York: Association for Computing Machinery.
ery and Data Mining, eds. U. Fayyad, G. Piatetsky- Holsheimer, M.; Kersten, M. L.; Mannila, H.; and
Shapiro, P. Smyth, and R. Uthurusamy, 83–116. Toivonen, H. 1996. Data Surveyor: Searching the
Menlo Park, Calif.: AAAI Press. Nuggets in Parallel. In Advances in Knowledge Dis-
Etzioni, O. 1996. The World Wide Web: Quagmire covery and Data Mining, eds. U. Fayyad, G. Piatet-
or Gold Mine? Communications of the ACM (Special sky-Shapiro, P. Smyth, and R. Uthurusamy,
Issue on Data Mining). November 1996. Forthcom- 447–471. Menlo Park, Calif.: AAAI Press.
ing. Horvitz, E., and Jensen, F. 1996. Proceedings of the
Fayyad, U. M.; Djorgovski, S. G.; and Weir, N. 1996. Twelfth Conference of Uncertainty in Artificial Intelli-
From Digitized Images to On-Line Catalogs: Data gence. San Mateo, Calif.: Morgan Kaufmann.
Mining a Sky Survey. AI Magazine 17(2): 51–66. Jain, A. K., and Dubes, R. C. 1988. Algorithms for
Fayyad, U. M.; Haussler, D.; and Stolorz, Z. 1996. Clustering Data. Englewood Cliffs, N.J.: Prentice-
KDD for Science Data Analysis: Issues and Exam- Hall.
ples. In Proceedings of the Second International Kloesgen, W. 1996. A Multipattern and Multistrate-
Conference on Knowledge Discovery and Data gy Discovery Assistant. In Advances in Knowledge
Mining (KDD-96), 50–56. Menlo Park, Calif.: Amer- Discovery and Data Mining, eds. U. Fayyad, G. Piatet-
ican Association for Artificial Intelligence. sky-Shapiro, P. Smyth, and R. Uthurusamy,
Fayyad, U. M.; Piatetsky-Shapiro, G.; and Smyth, P. 249–271. Menlo Park, Calif.: AAAI Press.
1996. From Data Mining to Knowledge Discovery: Kloesgen, W., and Zytkow, J. 1996. Knowledge Dis-
An Overview. In Advances in Knowledge Discovery covery in Databases Terminology. In Advances in
and Data Mining, eds. U. Fayyad, G. Piatetsky- Knowledge Discovery and Data Mining, eds. U. Fayyad,
Shapiro, P. Smyth, and R. Uthurusamy, 1–30. Men- G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
lo Park, Calif.: AAAI Press. 569–588. Menlo Park, Calif.: AAAI Press.
Fayyad, U. M.; Piatetsky-Shapiro, G.; Smyth, P.; and Kolodner, J. 1993. Case-Based Reasoning. San Fran-
Uthurusamy, R. 1996. Advances in Knowledge Dis- cisco, Calif.: Morgan Kaufmann.
covery and Data Mining. Menlo Park, Calif.: AAAI Langley, P., and Simon, H. A. 1995. Applications of
Press. Machine Learning and Rule Induction. Communica-
Friedman, J. H. 1989. Multivariate Adaptive Regres- tions of the ACM 38:55–64.
sion Splines. Annals of Statistics 19:1–141. Major, J., and Mangano, J. 1995. Selecting among
Geman, S.; Bienenstock, E.; and Doursat, R. 1992. Rules Induced from a Hurricane Database. Journal
Neural Networks and the Bias/Variance Dilemma. of Intelligent Information Systems 4(1): 39–52.
Neural Computation 4:1–58. Manago, M., and Auriol, M. 1996. Mining for OR.
Glymour, C.; Madigan, D.; Pregibon, D.; and ORMS Today (Special Issue on Data Mining), Febru-
Smyth, P. 1996. Statistics and Data Mining. Com- ary, 28–32.
munications of the ACM (Special Issue on Data Min- Mannila, H.; Toivonen, H.; and Verkamo, A. I.
ing). November 1996. Forthcoming. 1995. Discovering Frequent Episodes in Sequences.
Glymour, C.; Scheines, R.; Spirtes, P.; Kelly, K. 1987. In Proceedings of the First International Confer-
Discovering Causal Structure. New York: Academic. ence on Knowledge Discovery and Data Mining
Guyon, O.; Matic, N.; and Vapnik, N. 1996. Discov- (KDD-95), 210–215. Menlo Park, Calif.: American


Association for Artificial Intelligence. Spirtes, P.; Glymour, C.; and Scheines, R. 1993.
Matheus, C.; Piatetsky-Shapiro, G.; and McNeill, D. Causation, Prediction, and Search. New York:
1996. Selecting and Reporting What Is Interesting: Springer-Verlag.
The KEfiR Application to Healthcare Data. In Ad- Stolorz, P.; Nakamura, H.; Mesrobian, E.; Muntz, R.;
vances in Knowledge Discovery and Data Mining, eds. Shek, E.; Santos, J.; Yi, J.; Ng, K.; Chien, S.; Me-
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. choso, C.; and Farrara, J. 1995. Fast Spatio-Tempo-
Uthurusamy, 495–516. Menlo Park, Calif.: AAAI ral Data Mining of Large Geophysical Datasets. In
Press. Proceedings of KDD-95: First International Confer-
Pearl, J. 1988. Probabilistic Reasoning in Intelligent ence on Knowledge Discovery and Data Mining,
Systems. San Francisco, Calif.: Morgan Kaufmann. 300–305. Menlo Park, Calif.: American Association
for Artificial Intelligence.
Piatetsky-Shapiro, G. 1995. Knowledge Discovery
Titterington, D. M.; Smith, A. F. M.; and Makov, U.
in Personal Data versus Privacy—A Mini-Sympo-
E. 1985. Statistical Analysis of Finite-Mixture Distribu-
sium. IEEE Expert 10(5).
tions. Chichester, U.K.: Wiley.
Piatetsky-Shapiro, G. 1991. Knowledge Discovery
U.S. News. 1995. Basketball’s New High-Tech Guru:
in Real Databases: A Report on the IJCAI-89 Work-
IBM Software Is Changing Coaches’ Game Plans.
shop. AI Magazine 11(5): 68–70.
U.S. News and World Report, 11 December.
Piatetsky-Shapiro, G., and Matheus, C. 1994. The
Weigend, A., and Gershenfeld, N., eds. 1993. Pre-
Interestingness of Deviations. In Proceedings of
dicting the Future and Understanding the Past. Red-
KDD-94, eds. U. M. Fayyad and R. Uthurusamy.
wood City, Calif.: Addison-Wesley.
Technical Report WS-03. Menlo Park, Calif.: AAAI
Press. Weiss, S. I., and Kulikowski, C. 1991. Computer Sys-
tems That Learn: Classification and Prediction Meth-
Piatetsky-Shapiro, G.; Brachman, R.; Khabaza, T.;
ods from Statistics, Neural Networks, Machine Learn-
Kloesgen, W.; and Simoudis, E., 1996. An Overview ing, and Expert Systems. San Francisco, Calif.:
of Issues in Developing Industrial Data Mining and Morgan Kaufmann.
Knowledge Discovery Applications. In Proceedings
Whittaker, J. 1990. Graphical Models in Applied Mul-
of the Second International Conference on Knowl-
tivariate Statistics. New York: Wiley.
edge Discovery and Data Mining (KDD-96), eds. J.
Han and E. Simoudis, 89–95. Menlo Park, Calif.: Zembowicz, R., and Zytkow, J. 1996. From Contin-
American Association for Artificial Intelligence. gency Tables to Various Forms of Knowledge in
Databases. In Advances in Knowledge Discovery and
Quinlan, J. 1992. C4.5: Programs for Machine Learn-
Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
ing. San Francisco, Calif.: Morgan Kaufmann.
Smyth, and R. Uthurusamy, 329–351. Menlo Park,
Ripley, B. D. 1994. Neural Networks and Related Calif.: AAAI Press.
Methods for Classification. Journal of the Royal Sta-
tistical Society B. 56(3): 409–437.
Senator, T.; Goldberg, H. G.; Wooton, J.; Cottini, M.
A.; Umarkhan, A. F.; Klinger, C. D.; Llamas, W. M.;
Marrone, M. P.; and Wong, R. W. H. 1995. The Fi- Usama Fayyad is a senior re-
nancial Crimes Enforcement Network AI System searcher at Microsoft Research.
( FAIS ): Identifying Potential Money Laundering He received his Ph.D. in 1991
from Reports of Large Cash Transactions. AI Maga- from the University of Michigan
zine 16(4): 21–39. at Ann Arbor. Prior to joining Mi-
crosoft in 1996, he headed the
Shrager, J., and Langley, P., eds. 1990. Computation-
Machine Learning Systems Group
al Models of Scientific Discovery and Theory Forma-
at the Jet Propulsion Laboratory
tion. San Francisco, Calif.: Morgan Kaufmann.
(JPL), California Institute of Tech-
Silberschatz, A., and Tuzhilin, A. 1995. On Subjec- nology, where he developed data-mining systems
tive Measures of Interestingness in Knowledge Dis- for automated science data analysis. He remains
covery. In Proceedings of KDD-95: First Interna- affiliated with JPL as a distinguished visiting scien-
tional Conference on Knowledge Discovery and tist. Fayyad received the JPL 1993 Lew Allen Award
Data Mining, 275–281. Menlo Park, Calif.: Ameri- for Excellence in Research and the 1994 National
can Association for Artificial Intelligence. Aeronautics and Space Administration Exceptional
Silverman, B. 1986. Density Estimation for Statistics Achievement Medal. His research interests include
and Data Analysis. New York: Chapman and Hall. knowledge discovery in large databases, data min-
Simoudis, E.; Livezey, B.; and Kerber, R. 1995. Using ing, machine-learning theory and applications, sta-
Recon for Data Cleaning. In Proceedings of KDD-95: tistical pattern recognition, and clustering. He was
First International Conference on Knowledge Discov- program cochair of KDD-94 and KDD-95 (the First
International Conference on Knowledge Discovery
ery and Data Mining, 275–281. Menlo Park, Calif.:
and Data Mining). He is general chair of KDD-96,
American Association for Artificial Intelligence.
an editor in chief of the journal Data Mining and
Smyth, P.; Burl, M.; Fayyad, U.; and Perona, P. Knowledge Discovery, and coeditor of the 1996 AAAI
1996. Modeling Subjective Uncertainty in Image Press book Advances in Knowledge Discovery and Da-
Annotation. In Advances in Knowledge Discovery and ta Mining.
Data Mining, 517–540. Menlo Park, Calif.: AAAI

FALL 1996 53

Gregory Piatetsky-Shapiro is a cal Engineering Departments at Caltech (1994) and

principal member of the technical regularly conducts tutorials on probabilistic learn-
staff at GTE Laboratories and the ing algorithms at national conferences (including
principal investigator of the UAI-93, AAAI-94, CAIA-95, IJCAI-95). He is general
Knowledge Discovery in Databas- chair of the Sixth International Workshop on AI
es (KDD) Project, which focuses and Statistics, to be held in 1997. Smyth’s research
on developing and deploying ad- interests include statistical pattern recognition, ma-
vanced KDD systems for business chine learning, decision theory, probabilistic rea-
applications. Previously, he soning, information theory, and the application of
worked on applying intelligent front ends to het- probability and statistics in AI. He has published 16
erogeneous databases. Piatetsky-Shapiro received journal papers, 10 book chapters, and 60 confer-
several GTE awards, including GTE’s highest tech- ence papers on these topics.
nical achievement award for the KEfiR system for
health-care data analysis. His research interests in-
clude intelligent database systems, dependency
networks, and Internet resource discovery. Prior to
GTE, he worked at Strategic Information develop-
ing financial database systems. Piatetsky-Shapiro re-
ceived his M.S. in 1979 and his Ph.D. in 1984, both
from New York University (NYU). His Ph.D. disser-
tation on self-organizing database systems received
NYU awards as the best dissertation in computer
science and in all natural sciences. Piatetsky-
Shapiro organized and chaired the first three (1989,
1991, and 1993) KDD workshops and helped in de-
veloping them into successful conferences (KDD-95
and KDD-96). He has also been on the program
committees of numerous other conferences and
workshops on AI and databases. He edited and
coedited several collections on KDD, including two
books—Knowledge Discovery in Databases (AAAI
Press, 1991) and Advances in Knowledge Discovery in
Databases (AAAI Press, 1996)—and has many other
publications in the areas of AI and databases. He is
a coeditor in chief of the new Data Mining and
Knowledge Discovery journal. Piatetsky-Shapiro
founded and moderates the KDD Nuggets electronic
newsletter ( and is the web master for
Knowledge Discovery Mine (<
~kdd /index.html>).

Providence, Rhode Island

Padhraic Smyth received a first-
class-honors Bachelor of Engi- July 27–31, 1997
neering from the National Uni-
versity of Ireland in 1984 and an
MSEE and a Ph.D. from the Elec-
trical Engineering Department at
the California Institute of Tech-
nology (Caltech) in 1985 and Title pages due January 6, 1997
1988, respectively. From 1988 to
1996, he was a technical group leader at the Jet Papers due January 8, 1997
Propulsion Laboratory (JPL). Since April 1996, he
has been a faculty member in the Information and Camera copy due April 2, 1997
Computer Science Department at the University of
California at Irvine. He is also currently a principal
investigator at JPL (part-time) and is a consultant to
private industry. Smyth received the Lew Allen
Award for Excellence in Research at JPL in 1993
and has been awarded 14 National Aeronautics and
Space Administration certificates for technical in- Conferences/National/1997/aaai97.html
novation since 1991. He was coeditor of the book
Advances in Knowledge Discovery and Data Mining
(AAAI Press, 1996). Smyth was a visiting lecturer in
the Computational and Neural Systems and Electri-