Вы находитесь на странице: 1из 17

To appear in the Handbook of Technology Management, H. Bidgoli (Ed.), John Wiley and Sons, 2010.

Data Mining
Gary M. Weiss, Ph.D., Department of Computer and Information Science, Fordham University
Brian D. Davison, Ph.D., Department of Computer Science and Engineering, Lehigh University

depending on the context, may refer to the entire

INTRODUCTION process or just the algorithmic step. This entire process,
The amount of data being generated and stored is as originally envisioned by Fayyad, Piatetsky-Shapiro
growing exponentially, due in large part to the and Smyth (1996), is shown in Figure 1. In this chapter
continuing advances in computer technology. This we discuss the entire process, but, as is common with
presents tremendous opportunities for those who can most texts on the subject, we focus most of our
unlock the information embedded within this data, but attention on the algorithmic data mining step.
also introduces new challenges. In this chapter we The first three steps in Figure 1 involve preparing
discuss how the modern field of data mining can be the data for mining. The relevant data must be selected
used to extract useful knowledge from the data that from a potentially large and diverse set of data, any
surround us. Those that can master this technology and necessary preprocessing must then be performed, and
its methods can derive great benefits and gain a finally the data must be transformed into a
competitive advantage. representation suitable for the data mining algorithm
In this introduction we begin by discussing what that is applied in the data mining step. As an example,
data mining is, why it developed now and what the preprocessing step might involve computing the
challenges it faces, and what types of problems it can day of week from a date field, assuming that the
address. In subsequent sections we look at the key data domain experts thought that having the day of week
mining tasks: prediction, association rule analysis, information would be useful. An example of data
cluster analysis, and text, link and usage mining. transformation is provided by Cortes and Pregibon
Before concluding we provide a list of data mining (1998). If each data record describes one phone call but
resources and tools for those who wish further the goal is to predict whether a phone number belongs
information on the topic. to a business or residential customer based on its
calling patterns, then all records associated with each
What is Data Mining? phone number must be aggregated, which will entail
Data mining is a process that takes data as input and creating attributes corresponding to the average
outputs knowledge. One of the earliest and most cited number of calls per day, average call duration, etc.
definitions of the data mining process, which highlights While data preparation does not get much attention
some of its distinctive characteristics, is provided by in the research community or the data mining
Fayyad, Piatetsky-Shapiro and Smyth (1996), who community in general, it is critical to the success of any
define it as “the nontrivial process of identifying valid, data mining project because without high quality data it
novel, potentially useful, and ultimately understandable is often impossible to learn much from the data.
patterns in data.” Note that because the process must be Furthermore, although most research on data mining
non-trivial, simple computations and statistical pertains to the data mining algorithms, it is commonly
measures are not considered data mining. Thus acknowledged that the choice of a specific data mining
predicting which salesperson will make the most future algorithms is generally less important than doing a
sales by calculating who made the most sales in the good job in data preparation. In practice it is common
previous year would not be considered data mining. for the data preparations steps to take more time and
The connection between “patterns in data” and effort than the actual data mining step. Thus, anyone
“knowledge” will be discussed shortly. Although not undertaking a data mining project should ensure that
stated explicitly in this definition, it is understood that sufficient time and effort is allocated to the data
the process must be at least partially automated, relying preparation steps. For those interested in this topic,
heavily on specialized computer algorithms (i.e., data there is a book (Pyle 1999) that focuses exclusively on
mining algorithms) that search for patterns in the data. data preparation for data mining.
It is important to point out that there is some The fourth step in the data mining process is the
ambiguity about the term “data mining”, which is in data mining step. This step involves applying
large part purposeful. This term originally referred to specialized computer algorithms to identify patterns in
the algorithmic step in the data mining process, which the data. Many of the most common data mining
initially was known as the Knowledge Discovery in algorithms, including decision tree algorithms and
Databases (KDD) process. However, over time this neural network algorithms, are described in this
distinction has been dropped and data mining, chapter. The patterns that are generated may take
Selection Preprocessing Transformation Data Mining Evaluation

bbbbb xxxxx XXXXX X Y Z
ccccc yyyyy YYYYY X Y Z
xxxxx zzzzz ZZZZZ
Target Preprocessed
zzzzz Data Data Transformed Patterns Knowledge
Data Data

Figure 1: The Data Mining Process

various forms (e.g., decision tree algorithms generate acted upon. However, even once the mined knowledge
decision trees). At least for predictive tasks, which are is acted upon the data mining process may not be
probably the most common type of data mining task, complete and have to be repeated, since the data
these patterns collectively can be viewed as a model. distribution may change over time, new data may
For example, if a decision tree algorithm is used to become available, or new evaluation criteria may be
predict who will respond to a direct marketing offer, introduced.
we can say that the decision tree models how a
consumer will respond to a direct mail offer. Finally, Motivation and Challenges
the results of data mining cannot simply be accepted,
Data Mining developed as a new discipline for several
but must be carefully evaluated and interpreted. As a
reasons. First, the amount of data available for mining
simple example, in the case of the direct-mail example
grew at a tremendous pace as computing technology
just described, we could evaluate the decision tree
became widely deployed. Specifically, high speed
based on its accuracy—the percentage of its predictions
networks allowed enormous amount of data to be
that are correct. However, many other evaluation or
transferred and rapidly decreasing disk costs permitted
performance metrics are possible and for this specific
this data to be stored cost-effectively. The size and
example return on investment might actually be a
scope of these new datasets is remarkable. According
better metric.
to a recent industry report (International Data
The data mining process is an iterative process,
Corporation 2007), in 2006 161 Exabyte’s (161 Billion
although this is not explicitly reflected in Figure 1.
Gigabytes) of data were created and in 2010 988
After the initial run of the process is complete, the user
Exabyte’s of data will be created. While these figures
will evaluate the results and decide whether further
include data in the form of email, pictures and video,
work is necessary or if the results are adequate.
these and other forms of data are increasingly being
Normally, the initial results are either not acceptable or
mined. Traditional corporate datasets, which include
there is an expectation that further improvements are
mainly fixed-format numerical data, are also quite
possible, so the process is repeated after some
huge, with many companies maintaining Terabyte
adjustments are made. These adjustments can be made
datasets that record every customer transaction.
at any stage of the process. For example, additional
This remarkable growth in the data that is collected
data records may be acquired, additional fields (i.e.,
and stored presents many problems. One basic problem
variables) may be generated from existing information
concerns the scalability of traditional statistical
or obtained (via purchase or measurement), manual
techniques, which often cannot handle data sets with
cleaning of the data may be performed, or new data
millions or billions of records and hundreds or
mining algorithms may be selected. At some point the
thousands of variables. A second problem is that much
results may become acceptable and the mined
of the data that is available for analysis, such as text,
knowledge then will be communicated and may be

audio, video and images, is non-numeric and highly Overview of Data Mining Tasks
unstructured (i.e., cannot be represented by a fixed set
The best way to gain an understanding of data mining
of variables). This data cannot easily be analyzed using
is to understand the types of tasks, or problems, that it
traditional statistical techniques. A third problem is that
can address. At a high level, most data mining tasks
the number of data analysts has not matched the
can be categorized as either having to do with
exponential growth in the amount of data, which has
prediction or description. Predictive tasks allow one to
caused much of this data to remain unanalyzed in a
predict the value of a variable based on other existing
“data tomb” (Fayyad 2003).
information. Examples of predictive tasks include
Data mining strives to address these challenges.
predicting when a customer will leave a company (Wei
Many data mining algorithms are specifically designed
and Chiu 2002), predicting whether a transaction is
to be scalable and perform well on very large data sets,
fraudulent or not (Fawcett and Provost 1997), and
without fixed size limitations, and degrade gracefully
identifying the best customers to receive direct
as the size of the data set and the number of variables
marketing offers (Ling and Li 2000). Descriptive tasks,
increases. Scalable data mining algorithms, unlike
on the other hand, summarize the data in some manner.
many early machine learning algorithms, should not
Examples of such tasks include automatically
require that all data be loaded into and remain in main
segmenting customers based on their similarities and
memory, since this would prevent very large data sets
differences (Chen et al. 2006) and finding associations
from being mined. A great deal of effort has also been
between products in market basket data (Agrawal and
expended to develop data mining methods to handle
Srikant 1994). Below we briefly describe the major
non-numeric data. This includes mining of spatial data
predictive and descriptive data mining tasks. Each task
(Ester et al. 1998), images (Hsu, Lee, and Zhang 2002),
is subsequently described in greater detail later in the
video (Zhu et al. 2005), text documents (Sebastiani
2002), and the World Wide Web (Chakrabarti 2002,
Liu 2007). While a discussion of all of these methods is Classification and Regression
beyond the scope of this chapter, text mining and web Classification and regression tasks are predictive tasks
mining, which have become increasingly popular due that involve building a model to predict a target, or
to the success of the Web, are discussed later in this dependent, variable from a set of explanatory, or
chapter. independent, variables. For classification tasks the
Data mining also attempts to offload some of the target variable usually has a small number of discrete
work from the data analyst so that more of the values (e.g., “high” and “low”) whereas for regression
collected data can be analyzed. One can see how data tasks the target variable is continuous. Identifying
mining aids the data analyst by contrasting data mining fraudulent credit card transactions (Fawcett and
methods with the more conventional statistical Provost 1997) is a classification task while predicting
methods. Most of statistics operates using a future prices of a stock (Enke and Thawornwong 2005)
hypothesize-and-test paradigm where the statistician is a regression task. Note that the term “regression” in
first decides on a specific hypothesis to be tested (Hand this context should not be confused with the regression
1998). The statistician then makes assumptions about methods used by statisticians (although those methods
the data (e.g., that it is normally distributed) and then can be used to solve regression tasks).
tries to fit a model based on these assumptions. In data
mining the analyst does not need to make specific Association Rule Analysis
assumptions about the data nor formulate a specific Association rule analysis is a descriptive data mining
hypothesis to test. Instead, more of the responsibility task that involves discovering patterns, or associations,
for finding a good model is assigned to the data mining between elements in a data set. The associations are
algorithm, which will search through a large space of represented in the form of rules, or implications. The
potential models in order to identify a good model. The most common association rule task is market basket
data mining process, unlike the deductive process analysis. In this case each data record corresponds to a
typically used by a statistician, is typically data-driven transaction (e.g., from a supermarket checkout) and
and inductive, rather than hypothesis-driven and lists the items that have been purchased as part of the
deductive. It needs to be noted, however, that for data transaction. One possible association rule from
mining to be successful there must be a sufficient supermarket data is {Hamburger Meat} Æ {Ketchup},
amount of high quality (i.e., relatively noise-free) data. which indicates that those transactions that include
This amount depends on the complexity of the problem Hamburger Meat tend to also include Ketchup. It
and cannot easily be estimated a priori. should be noted that although this is a descriptive task,

highly accurate association rules can be used for PREDICTION TASKS: CLASSIFICATION
prediction (e.g., in the above example it might be AND REGRESSION
possible to use the presence of “Hamburger Meat” to
Classification and regression tasks are the most
predict the presence of “Ketchup” in a grocery order).
commonly encountered data mining tasks. These tasks,
as described earlier, involve mapping an object to
Cluster Analysis
either one of a set of predefined classes (classification)
Cluster analysis is a descriptive data mining task where
or to a numerical value (regression). In this section we
the goal is to group similar objects in the same cluster
introduce the terminology required to describe these
and dissimilar objects in different clusters.
tasks and the framework for performing predictive
Applications of clustering include clustering customers
modeling. We then describe several key characteristics
for the purpose of market segmentation and grouping
of predictive data mining algorithms and finish up by
similar documents together in response to a search
describing the most popular predictive data mining
engine request (Zamir and Etzioni 1998).
algorithms in terms of these characteristics.
Text Mining Tasks
Much available data is in the form of unstructured or Terminology and Background
semi-structured text, which is very different from Most prediction tasks assume that the underlying data
conventional data, which is completely structured. is represented as a collection of objects or records,
Text is unstructured if there is no predetermined which, in data mining, are often referred to as instances
format, or structure, to the data. Text is semi-structured or examples. Each example is made up of a number of
if there is structure associated with some of the data, as variables, commonly referred to as features or
in the case for web pages, since most web pages will attributes. The attribute to be predicted is of special
have a title denoted by the title tag, images denoted by interest and may be referred to as the target, or, for
image tags, etc. While text mining tasks often fall into classification tasks, the class. In the majority of cases
the classification, clustering and association rule the number of attributes is fixed and thus the data can
mining categories, we discuss them separately because be represented in a tabular format, as shown in Table 1.
the unstructured nature of text requires special The data in Table 1 describes automobile loans and
consideration. In particular, the method for contains 10 examples with 5 attributes, where the
representing textual data is critical. Example binary target/class attribute resides in the last column
applications of text mining includes the identification and indicates whether the customer defaulted on their
of specific noun phrases such as people, products and loan.
companies, which can then be used in more Table 1: Sample Auto Loan Default Data
sophisticated co-occurrence analysis to find non-
obvious relationships among people or organizations. Credit
A second application area that is growing in Age Income Student Rating Default
importance is sentiment analysis, in which blogs, Youth Medium Yes Fair No
discussion boards, and reviews are analyzed for Youth Low Yes Fair No
opinions about products or brands.
Senior Low No Excellent No
Link Analysis Tasks Senior Medium No Excellent No
Link analysis is a form of network analysis that Senior High No Poor Yes
examines associations between objects. For example, Senior Medium No Poor Yes
in the context of email, the objects might be people and
Senior Low Yes Fair No
the associations might represent the existence of email
between two people. On the Web each page can link to Middle Age Low No Fair Yes
others, and so web link analysis considers the web Middle Age Medium Yes Fair No
graph resulting from such links. Given a graph showing Middle Age Low No Fair Yes
relationships between objects, link analysis can find
particularly important or well-connected objects and This data in Table 1 can be used to build a
show where networks may be weak (e.g., in which all predictive model to classify customers based on
paths go through one or a small number of objects). whether they will default on their loan. That is, the
model generated by the data mining algorithm will take
the values for the age, income, student, and credit-

+ + - - + + y=x + - - + -
- ++ - ++
++ + + + -
+ - + - -
- - -
- - - + + - -
- - - ++ -
+ + -
- - - - - - - - -
- - - -

(a) (b) (c)

Figure 3: Three Sample Classification Tasks

rating attributes and map them to “Yes” or “No”. The Classification requires the data mining algorithm to
induced classification model could take many forms partition the input space in such a way as to separate
and Figure 2 provides a plausible result for a rule-based the examples based on their class. To help illustrate
learning algorithm. In this case each rule in Figure 2 is this, consider Figure 3, which shows representations of
evaluated in order and the classification is determined three data sets, where each example is described by
by the first rule for which the left-hand side evaluates three attributes, one of which is the class variable that
to true. In this case, if the first two rules do not fire, may take on the values “+” and “-”.
then the third one is guaranteed to fire and thus acts as The dashed lines in Figure 3 form decision
a default rule. Note that the three rules in Figure 2 can boundaries that partition the input space in such a way
be used to correctly classify all ten examples in Table 1. that the positive examples are separated from the
negative examples. In Figure 3a the two lines partition
1) If credit-rating = “Poor” → Default = “Yes” the space into four quadrants so that all positive
2) If Age = “Middle Aged” and Income = “Low” examples are segregated in the top left quadrant.
→ Default = “Yes” Similarly, in Figure 3b the equation x + y = 1 forms the
3) → Default = “No” decision boundary that perfectly separates the two
classes. The data in Figure 3c requires three globular
Figure 2: Rule Set for Predicting Auto Loan Defaults decision boundaries to perfectly identify the positive
In data mining, the predictive model is induced examples. Some classification tasks may require that
from a training set where the target or class value is complex decision boundaries be formed in order to
achieve good classification performance, but even in
provided. This is referred to as supervised learning,
since it requires that someone acts as a teacher and such cases it is often impossible to perfectly separate
provides the answers (i.e., class/target value) for each the examples by class. This is due to the fact that
domains may be complex and the data may be noisy.
training example. The model induction is accomplished
in the data mining step using any of a number of data Note that the decision boundary in Figure 2b, because
mining algorithms. Once the model is generated, it can it is not parallel to either axis, requires both attribute
values to be considered at once, something not required
then be applied to data where the value for the target
value is unknown. Because we are primarily interested by the data in Figure 2a. This is significant because not
in evaluating how well a model performs on new data all prediction algorithms (e.g., decision trees) produce
models with this capability.
(i.e., data not used to build the model), we typically
reserve some of the labeled data and assign it to a test
set for evaluation purposes. When the test set is used Characteristics of Predictive Data Mining
for evaluating the predictive performance of the model, Algorithms
the target value is examined only after the prediction is Before the most commonly used data mining
made. Evaluating the model on a test set that is algorithms are introduced, it is useful to understand the
independent of the training data is crucial because characteristics that can be used to describe and
otherwise we would have an unrealistic (i.e., overly compare them. These characteristics are described
optimistic) estimate of the performance of the model. briefly in this section and then referred to in subsequent

Table 2: Summary off Predictive Data Mining Algorithms

Expressive Training Testing Model

Learning Method Tasks Handled Power Time Time Comprehensibility
Decision Trees Classification Fair Fast Fast Good
Rule-Based Classification Fair Fast Fast Good
ANN Good Slow Fast Poor
Classification, No model generated but
Nearest-Neighbor Good No Time Slow
Regression predictions are explainable
Naïve Bayesian Classification Good Fast Fast Poor

sections. The first characteristic concerns the type of model, and the testing time, how long it will take to
predictive tasks that can be handled by the algorithm. apply the model to new data in order to generate a
Predictive data mining algorithms may handle only prediction. Computational requirements are much more
classification tasks, only regression tasks, or may of a concern if the learning must occur in real-time, but
handle both types of tasks. currently most data mining occurs “off-line.”
The second characteristic concerns the expressive Table 2 describes some of the most popular data
power of the data mining model. The expressive power mining methods in terms of the characteristics just
of a model is determined by the types of decision introduced (the methods themselves are described in
boundaries that it can form. Some learners can only the next section). Experience has shown that there is
form relatively simple decision boundaries and hence no one best method and that the method that performs
have limited expressive power. Algorithms with best depends on the domain as well as the goals of the
limited expressive power may not perform well on data mining task (e.g., does the induced model need to
certain tasks, although it is difficult to determine in be easily understandable). The five listed methods are
advance which algorithms will perform best for a given all in common use and are implemented by most major
task. In fact, it is not uncommon for those algorithms data mining packages.
that generate less complex models to perform
competitively with those with more expressive power. Predictive Data Mining Algorithms
The format of the model impacts the third criterion,
In this section we briefly describe some of the most
which is the comprehensibility, or explainability, of the
common data mining algorithms. Because the purpose
predictive model. Certain models are easy to
of this chapter is to provide a general description of
comprehend or explain, while others are nearly
data mining, its capabilities, and how it can be used to
impossible to comprehend and, due to their nature,
solve real-world problems, many of the technical
must essentially be viewed as “black boxes” that given
details concerning the algorithms are omitted. A basic
an input, somehow produce a result. Whether
knowledge of the major data mining algorithms,
comprehensibility is important depends on the goal of
however, is essential in order to know when each
the predictive task. Sometimes one only cares about the
algorithm is relevant, what the advantages and
outcome—perhaps the predictive accuracy of the
disadvantages of each algorithm are, and how these
algorithm—but often one needs to be able to explain or
algorithms can be used to solve real-world problems.
defend the predictions. In some cases
comprehensibility may even be the primary evaluation
Decision Trees
criteria if the main goal is to better understand the
Decision tree algorithms (Quinlan 1993; Breiman et al.
domain. For example, if one can build an effective
1984) are a very popular class of learning algorithms
model for predicting manufacturing errors, then one
for classification tasks. A sample decision tree,
may be able to use that model to determine how to
generated from the automobile loan data in Table 1, is
reduce the number of future errors.
shown in Figure 4. The internal nodes of the decision
The fourth criterion concerns the computation time
tree each represent an attribute while the terminal
of the data mining algorithm. This is especially
nodes (i.e., leaf nodes displayed as rectangles) are
important due to the enormous size of many data sets.
labeled with a class value. Each branch is labeled with
With respect to computation time, we are interested in
an attribute value, and, when presented with an
the training time, how long it will take to build the
example, one follows the branches that match the

attribute values for the example, until a leaf node is decision trees cannot handle regression tasks, other
reached. The class value assigned to the leaf node is methods must be used for those tasks.
then used as the predicted value for the example. In this
simple example the decision tree will predict that a Rule-based Classifiers
customer will default on their automobile loan if their Rule-based classifiers generate classification rules,
credit rating is “poor” or it is not “poor” (i.e., “fair” or such as the rule set shown earlier in Figure 2. The way
“excellent”) but the person is “middle aged” and their in which classifications are made from a rule set varies.
income level is “low”. For some rule-based systems the first rule to fire (i.e.,
have the left-hand side of the rule satisfied) determines
the classification, whereas in other cases all rules are
Credit-Rating evaluated and the final classification is made based on
a voting scheme. Rule-based classifiers are very similar
to decision-tree learners and have similar expressive
Poor Fair, Excellent power, computation time, and comprehensibility. The
connection between these two classification methods is
even more direct since any decision tree can trivially be
converted into a set of mutually exclusive rules, by
Default = Yes Age creating one rule corresponding to the path from the
root of the tree to each leaf. While some rule-based
learners such as C4.5Rules (Quinlan 1993) operate this
Middle Age Youth, Senior
way, other rule learners, such as RIPPER (Cohen
1995), generate rules directly.
Income Default = No Artificial Neural Networks
Artificial Neural Networks (ANNs) were originally
inspired by attempts to simulate some of the functions
Low Medium, High of the brain and can be used for both classification and
regression tasks (Gurney 1997). An ANN is composed
of an interconnected set of nodes that includes an input
Default = Yes Default = No layer, zero or more hidden layers, and an output layer.
The links between nodes have weights associated with
them. A typical neural network is shown in Figure 5.
Figure 4: Decision Tree Model Induced from the
The ANN in Figure 5 accepts three inputs, I1, I2,
Automobile Loan Data
and I3 and generates a single output O1. The ANN
computes the output value from the input values as
Decision tree algorithms are very popular. The follows. First, the input values are taken from the
main reason for this is that the induced decision tree attributes of the training example, as it is inputted to
model is easy to understand. Additional benefits the ANN. These values are then weighted and fed into
include the fact that the decision tree model can be the next set of nodes, which in this example are H1 and
generated quickly and new examples can be classified H2. A non-linear activation function is then applied to
quickly using the induced model. The primary this weighted sum and then the resulting value is
disadvantage of a decision tree algorithm is that it has passed to the next layer, where this process is repeated,
limited expressive power, namely because only one until the final value(s) are outputted. The ANN learns
attribute can be considered at a time. Thus while a by incrementally modifying its weights so that, during
decision tree classifier could easily classify the data set the training phase, the predicted output value moves
represented in Figure 3a, it could not easily classify the closer to the observed value. The most popular
data set represented in Figure 3b, assuming that the algorithm for modifying the weights is the
true decision boundary corresponds to the line y=x. A backpropagation algorithm (Rumelhart, Hinton, and
decision tree could approximate that decision boundary William 1986). Due to the nature of ANN learning, the
if it were permitted to grow very large and complex, entire training set is applied repeatedly, where each
but could never learn it perfectly. Note that since application is referred to as an epoch.

I1 I2 I3

Input Layer

w11 w12 w22 w31 w32

Layer H1 H2



Figure 5: A Typical Artificial Neural Network

ANNs can naturally handle regression tasks, since specified parameter. The simplest scheme is to predict
numerical values are passed through the nodes and are the class value that occurs most frequently in the k
ultimately passed through to the output layer. However, examples, while more sophisticated schemes might use
ANNs can also handle classification tasks by weighted voting, where those examples most similar to
thresholding on the output values. ANNs have a great the example to be classified are more heavily weighted.
deal of expressive power and are not subject to the People naturally use this type of technique in everyday
same limitations as decision trees. In fact, most ANNs life. For example, realtors typically base the sales price
are universal approximators in that they can of a new home on the sales price of similar homes that
approximate any continuous function to any degree of were recently sold in the area. Nearest-neighbor
accuracy. However, this power comes at a cost. While learning is sometimes referred to as instance-based
the induced ANN can be used to quickly predict the learning.
values for unlabelled examples, training the model Nearest-neighbor algorithms are typically used for
takes much more time than training a decision tree or classification tasks, although they can also be used for
rule-based learner and, perhaps most significantly, the regression tasks. These algorithms also have a great
ANN model is virtually incomprehensible and deal of expressive power. Nearest-neighbor algorithms
therefore cannot be used to explain or justify its generate no explicit model and hence have no training
predictions. time. Instead, all of the computation is performed at
testing time and this process may be relatively slow
Nearest-Neighbor. since all training examples may need to be examined. It
Nearest-neighbor learners (Cover and Hart 1967) are is difficult to evaluate the comprehensibility of the
very different from any of the learning methods just model since none is produced. We can say that because
described in that no explicit model is ever built. That no model is produced, one cannot gain any global (i.e.,
is, there is no training phase and instead all of the work high-level) insight into the domain. However,
associated with making the prediction is done at the individual predictions can easily be explained and
time an example is presented. Given an example the justified in a very natural way, by referring to the
nearest-neighbor method first determines the k most nearest-neighbors. We thus say that this method does
similar examples in the training data and then not produce a comprehensible model but its predictions
determines the prediction based on the class values are explainable.
associated with these k examples, where k is a user-

Naïve Bayesian Classifiers and a decision tree model) to be combined so that, in
Most classification tasks are not completely theory, one can get the best aspects of each.
deterministic. That is, even with complete knowledge
about an example you may not be able to correctly
classify it. Rather, the relationship between an example ASSOCIATION ANALYSIS
and the class it belongs to is often probabilistic. Naïve Many businesses maintain huge databases of
Bayesian classifiers (Langley et al. 1992) are transactional data, which might include all purchases
probabilistic classifiers that allow us to exploit made from suppliers or all customer sales. Association
statistical properties of the data in order to predict the analysis (Agrawal, Imielinki and Swami 1993)
most likely class for an example. More specifically, attempts to find patterns either within or between these
these methods use the training data and the prior transactions.
probabilities associated with each class and with each
attribute value and then utilize Bayes’ theorem to Simple Example using Market Basket Data
determine the most likely class given a set of observed Consider the data in Table 3, which includes five
attribute values. This method is naïve in that it assumes transactions associated with purchases at a grocery
that the values for each attribute are independent store. These data are referred to as market basket data
(Bayesian Belief Networks do not make this since each transaction includes the items found in a
assumption but a discussion of those methods is customer’s shopping “basket” during checkout. Each
beyond the scope of this chapter). The naïve Bayes record contains a transaction identifier and then a list
method is used for classification tasks. These methods all of the items purchased as part of the transaction.
are quite powerful, can express complex concepts, and
are fast to generate and to classify new examples. Table 3: Market Basket Data from a Grocery Store
However, these methods do not build an explicit model
that can then be interpreted. Transaction ID Items
1 {Ketchup, Hamburgers, Soda}
Ensemble Methods 2 {Cereal, Milk, Diapers, Bread}
Ensemble methods are general methods for improving 3 {Hot dogs, Ketchup, Soda, Milk}
the performance of predictive data mining algorithms.
4 {Greeting Card, Cake, Soda}
The most notable ensemble methods are bagging and
boosting, which permit multiple models to be 5 {Greeting Card, Cake, Milk, Cereal}
combined. With bagging (Breiman 1996) the training
data are repeatedly randomly sampled with The data in Table 3 are very different from the
replacement, so that each of the resulting training sets relational data used for predictive data mining, such as
has the same number of examples as the original the data in Table 1, where each record is composed of a
training data but is composed of different training fixed number of fields. Here the key piece of data is a
examples. A classifier is induced from each of the variable-length list of items, in which the list only
generated training sets and then each unlabelled test indicates the presence or absence of an item—not the
example is assigned the classification most frequently number of instances purchased.
predicted by these classifiers (i.e., the classifiers “vote” In market basket analysis, a specific instance of
on the classification). Boosting (Freund and Schapire association analysis, the goal is to find patterns
1997) is somewhat similar to bagging and also between items purchased in the same transaction. As an
generates multiple classifiers, but boosting adaptively example, using the limited data in Table 3, a data
changes the distribution of training examples such that mining algorithm might generate the association rule
the training examples that are misclassified are better {Ketchup} Æ {Soda}, indicating that a customer that
represented in the next iteration. Thus boosting focuses purchases Ketchup is likely to also purchase Soda.
more attention on the examples that are difficult to
classify. As with bagging, unlabeled examples are Uses of Association Analysis
classified based on the predictions of all of the
There are many benefits of performing association
generated classifiers. Most data mining packages now
analysis. Continuing with the grocery store example, an
implement a variety of ensemble methods, including
association rule of the form A Æ B can be exploited in
boosting and bagging. Most of these packages also
many ways. For example, items A and B could be
permit different models (e.g., a neural network model
located physically close together in a store in order to

assist the shoppers or could be located far apart to .66, since Ketchup only occurs in two of the three
increase the chance that a shopper will encounter and transactions that contain Soda. In association rule
purchase items that would otherwise be missed. Sales mining the user must also specify a minimum
strategies can also be developed to exploit these confidence value, minconf. Table 4 shows the results of
associations. For example, the store could run a sale on an association rule mining algorithm, such as the
item A in order to increase the sales of item B or could Apriori algorithm, when applied to the data in Table 3
have a coupon printed out for item B at checkout for with minsup = .4 and minconf = .75. Note that more
those who purchase item A, in order to increase the than one item may appear on either side of the
likelihood that item B will be purchased on the next association rules, but does not occur for this example
shopping trip. Applications of association rule analysis due to the simplicity of the sample data.
are not limited to market basket data and other
association mining techniques have been applied to Table 4: Association Rules Generated from Market
sequence data (Agrawal and Srikant 1995), spatial data Basket Data (minsup=.4 and minconf=.75)
(Huang, Shekhar, and Xiong 2004) and graph-based
data (Kuramochi and Karypis 2001). We discuss Association Rule Support Confidence
sequence mining toward the end of this section. {Ketchup} Æ {Soda} 0.4 1.0
{Cereal} Æ {Milk} 0.4 1.0
Generation and Evaluation of Association
Rules {Greeting Card} Æ {Cake} 0.4 1.0
There are many algorithms for performing association {Cake} Æ {Greeting Card} 0.4 1.0
rule analysis, but the earliest and most popular such
algorithm is the Apriori algorithm (Agrawal and
Srikant 1994). This algorithm, as well as most others, Sequential Pattern Mining
utilizes two stages. In the first stage, the sets of items Sequential pattern mining in transactional data is a
that co-occur frequently are identified. In the second variant of association analysis, in that one is looking
stage association rules are generated from these for pattern between items in a sequence, rather than in
“frequent itemsets.” As an example, the frequent a set of items. As an example, suppose a company rents
itemset {Ketchup, Soda} could be generated from the movies and keeps records of all of the rental
data in Table 3, since it occurs in two transactions, and transactions. The company can then mine these data to
from this the association rule {Ketchup} Æ {Soda} determine patterns within the sequences of movie
could be generated. rentals. Some patterns will be expected and hence may
The frequency of itemsets and association rules can not have much business value, such as Star Wars
be quantified using the support measure. The support Episode I Æ Star Wars Episode II, but less obvious
of an itemset (or association rule) is the fraction of all patterns may also be found. Sequential patterns abound
transactions that contain all items in the itemset (or and sequence mining algorithms have been applied to a
association rule). Thus, using the data in Table 3, the variety of domains. For example, sequence mining has
itemset {Ketchup, Soda} and the association rule been applied to sequences of network alarms in order
{Ketchup} Æ {Soda} both have a support of 2/5, or .4. to predict network equipment failures (Weiss and Hirsh
The user must specify the minimum support level, 1998), to computer audit data in order to identify
minsup, that is acceptable and only association rules network intrusions (Lee, Stolfo and Mok 2000), to
that satisfy that support will be generated. biological sequences to find regions of local similarity
Not all association rules that satisfy the support (Altschul et al. 1990), and to web clickstream data to
constraint will be useful. The confidence of an find web pages that are frequently accessed together
association rule attempts to measure the predictive (Tan and Kumar 2002).
value of the rule. The confidence of a rule is calculated
as the fraction of transactions that contain the items on
the right-hand side of the rule given that the transaction CLUSTER ANALYSIS
contains the items in the left-hand side of the rule. Cluster analysis (Jain, Murthy, and Flynn 1999;
Using the data in Table 3, the association rule Parsons, Haque, and Liu 2004) automatically partitions
{Ketchup} Æ {Soda} has a confidence of 2/2 or 1.0, data into meaningful groups based on the
since Soda is always found in the transaction if characteristics of the data. Similar objects are placed
Ketchup is purchased. The confidence of the into the same cluster and dissimilar objects are placed
association rule {Soda} Æ {Ketchup} is only 2/3, or into different clusters. Clustering is an unsupervised

learning task in that the training data do not include the There are many specific applications of clustering
“answer” (i.e., a mapping from example to cluster). and we list only a few here. Clustering can be used to
Clustering algorithms operate by measuring the automatically segment customers into meaningful
similarity and dissimilarity between objects and then groups (e.g., students, retirees, etc.), so that more
finding a clustering scheme that maximizes intra- effective, customized, marketing plans can be
cluster similarity and inter-cluster dissimilarity. developed for each group. In document retrieval tasks
Clustering requires that a similarity measure be defined the returned documents may be clustered and presented
between objects, which, for objects with numerical to the users grouped by these clusters (Zamir and
attributes, may be the Euclidean distance between the Etzioni 1998) in order to present the documents to the
points. Figure 6 shows one possible clustering of user in a more organized and meaningful way. For
eleven objects, each described by three attributes. The example, clustering can be employed by a search
cluster boundaries are denoted by the dashed shapes. engine so that the documents retrieved from the search
term “jaguar” cluster the documents related to the
jaguar animal separately from those related to the
Jaguar automobile (the ask.com search engine currently
provides this capability). The clustering algorithm can
work effectively in this case because one set of
returned documents will repeatedly have the term
“car”, “automobile” or “S-type” in it while the other set
may have the terms “jungle” or “animal” appear

Categories of Clustering Algorithms

Clustering algorithms can be organized by the basic
approach that they employ. These approaches are also
related to the type of clustering that the algorithm
produces. The two main types of clusterings are
hierarchical and non-hierarchical. A hierarchical
clustering has multiple levels while a non-hierarchical
Figure 5: Eleven Examples Placed into Three Clusters clustering has only a single level. An example of a
hierarchical clustering is the taxonomy used by
Uses of Clustering biologists to classify living organisms (although that
hierarchy was not formed using data mining
There are many reasons to cluster data. The main
reason is that it allows us to build simpler, more
understandable models of the world, which can be The non-hierarchical clustering algorithms will take
the presented objects and place each into one of k
acted upon more easily. People naturally cluster objects
clusters, where each cluster must have at least one
for this reason all the time. For example, we are able to
identify objects as a “chair” even if they look quite object. Most of these algorithms require the user to
specify the value of k, which is often a liability, since
different and this allows us to ignore the specific
the user will generally not know ahead of time the
characteristics of a chair if they are irrelevant.
Clustering algorithms automate this process and allow optimal number of meaningful clusters. The framework
used by many of these algorithms is to form an initial
us to exploit the power of computer technology. A
random clustering of the objects and then repeatedly
secondary use for clustering is for dimensionality
reduction or data compression. For example, one could move objects between clusters to improve the overall
quality of the clusters. One of the oldest and most
identify ten attributes for a data set, cluster the
notable of these methods is the K-means clustering
examples using these attributes, and then replace the
ten attributes with one new attribute that specifies the algorithm (Jain and Dubes 1988). This algorithm
randomly assigns each object to one of the k clusters
cluster number. Reducing the number of dimensions
(i.e., attributes) can simplify the data mining process. and then computes the mean (i.e., center or centroids)
Clustering can also aid with data compression by of the points in the cluster. Then each object is
reassigned to the cluster based on which centroid it is
replacing complex objects with an index into a table of
the object closest to the center of that objects cluster. closest to and then the centroids of each cluster are
recomputed. This cycle continues until no changes are

made. This very simple method sometimes works well. most data mining algorithms and thus some text-
Another way to generate non-hierarchical clusterings is specific methods are often needed.
via density-based clustering methods, such as
DBSCAN (Ester et al. 1996), which find regions of Text representation.
high density that are separated from regions of low While a number of methods for representing text have
density. One advantage of DBSCAN is that because it been developed, essentially all use the same
is sensitive to the density differences it can form framework. A document (e.g., a web page, blog post,
clusters with arbitrary shapes. book, or search query) is treated as a “bag” of words or
Hierarchical clustering algorithms are the next type terms, which means that the order of the terms is
of clustering algorithms. These algorithms can be ignored and only the distinct set of terms is stored,
divided into agglomerative and divisive algorithms. along with a weight corresponding to its importance or
The agglomerative algorithms start with each object as prevalence. In the simplest model the weight might be
an individual cluster and then at each iteration merge a Boolean value, indicating whether the term occurs in
the most similar pair of clusters. The divisive the document. The vector space model, in contrast,
algorithms take the opposite approach and start with all encodes a document as a vector of real-valued weights
objects in a single partition and then iteratively split that might incorporate the relative frequency of the
one cluster into two. The agglomerative techniques are term in the document and the relative popularity of the
by far the more popular method. These methods are term in all documents in the collection.
appropriate when the user prefers a hierarchical
clustering of the objects. Text classification and clustering.
In text classification, two methods are common. The
naive Bayes algorithm provides a computationally
TEXT, LINK AND USAGE MINING inexpensive method for text classification, which can
In this section we focus on mining unstructured and be interpreted probabilistically. When higher accuracy
semi-structured, non-numeric data. While these data is required, support vector machines (Vapnik 1999) are
cannot be effectively stored in a conventional relational used. The support vector machine (SVM) method
database, it is the dominant form for human operates by learning a hyperplane to separate two
communication, especially given the advent and classes in a real-valued multidimensional space.
explosive growth of email, instant messaging and the Modern SVMs (Joachims 1998) are designed to
World Wide Web (WWW). In many cases the data efficiently handle large numbers of attributes but still
mining tasks associated with these data are not new, provide high predictive accuracy.
but are described in this section because of their Document clustering is similar to other forms of
importance and because they typically utilize clustering. Typically the K-means method described
specialized data mining methods. As an example, text earlier is used, where each example corresponds to the
mining tasks include classification (i.e., text term vector used to represent each document. However,
classification) but the methods used must take into in this case the similarity function used in the
account the unstructured nature and high clustering process incorporates the weights associated
dimensionality of the data. Other data mining tasks, with each term. The most common similarity measure
such as link mining, can be considered a new type of calculates the cosine of the angle between the two
data mining task, although they may still be used in vectors of term weights.
conjunction with existing tasks (i.e., link mining can be
used to aid in classification). Link Mining
Many kinds of data are characterized by relationships,
Text Mining or links, between entities. Such relationships include
The basic unit for analysis in text mining is a co-authorship and co-citation (i.e., scholarly articles
document. A document can contain an arbitrary cited in the same article). Hyperlinks between web
number of terms from an arbitrarily large vocabulary, pages form a particularly useful relationship. These
which is the union of all terms in the collection of relationships can form large graphs, where the entities
documents being analyzed. If one represents a correspond to nodes in the graph and the relationships
document using attributes that denote the presence or correspond to the edges in the graph. In some cases
absence of each term, then the number of attributes will these relationships have been studied for decades. In
generally be very large (in the thousands or millions, social network analysis (Wasserman and Faust 1994)
depending on the collection). This causes difficulty for the relationships and interactions between people are

analyzed to find communities or participants with product information from online stores, job postings,
particular centrality or prestige within the network. In search results, news articles, and much more. The
bibliometrics the citation network of scholarly process of selecting these data from the page in which
publications is analyzed, using relationships such as it is embedded is called data extraction, and can be
co-citation (Small 1973) and bibliographic coupling automated in a supervised or unsupervised manner (Liu
(Kessler 1963), to determine the importance of authors, 2007). Such data can then be mined for knowledge
papers, and venues. more directly as homogeneous records. An even
Research in the WWW community has rekindled greater amount of data is believed to be available via
interest in these ideas and has provided substantial the “deep Web” – those pages that result from content
applications. In particular, the graph of the Web submitted through forms (typically for a search or
provides something similar to a citation network, since database lookup) – and are available for harvesting
links are often construed as recommendations or when the appropriate form content to submit is known.
citations from one page to another. Google's PageRank Even when pages are classified as being in a
algorithm (Page et al. 1998), for example, uses the idea category of interest, the content within it (perhaps
of rank prestige from social network analysis. In it, a already extracted from the rest of the page) may still be
page's importance is dependent not only on the number unstructured. A common concern for many
of votes (i.e., links) received from other pages, but also organizations is to determine not only which pages
on the importance of those pages. Kleinberg (1999), in discuss topics or products of interest, but also what the
contrast, used bibliometric ideas to define measures for attitudes are with respect to those topics or products.
web hubs and authorities. In his HITS model, a good The Web has resulted in a huge expansion of the ways
web hub is a page that points to a number of good web that customers can express their opinions, i.e., though
authorities; similarly, a good web authority is a page to blogs, discussion groups and product reviews on
which many good hubs point. Both PageRank and merchant sites. Mining these opinions provides
HITS utilize recursive definitions that when applied to organizations with valuable insights into product and
all pages simultaneously, correspond to the calculation brand reputations, insights into the competition, and
of the eigenvector of the matrix form of the web graph. consumer trends. Opinion mining (Liu, 2007) can also
The authority values generated by this process are used be of value to customers who want advice before a
by search engines such as Google and Ask.com in purchase and to advertisers who want to promote their
combination with estimates of query relevance to rank products. The simplest form of opinion mining is
pages for retrieval. Given the large number of relevant sentiment classification, in which the text is classified
results for most queries, estimates of page importance as being positive or negative. For more detail, feature-
have become essential in generating useful result based opinion mining and summarization might be
rankings. performed to extract details, such as product
characteristics mentioned, and determine whether the
Content Mining opinions expressed were positive or negative. A third
variation would be to search for explicit comparisons
While mining the link structure of the Web is
within the opinions and thus be able to provide relative
significant, there is an enormous amount of data within
comparisons between similar products.
web pages that is ripe for mining. Given a web page,
one might first ask what the page is about (sometimes
called “gisting”). Assigning a web page to one of a Web Usage Mining
number of possible classes is more difficult than The content and structure of the Web provide
traditional text classification—the Web places no significant opportunity for web mining, as described
restrictions on the format, length, writing style, above. Usage of the Web also provides tremendous
validity, or uniqueness of the content of a web page. information as to the quality, interestingness, and
Fortunately, careful use of the content of neighboring effectiveness of web content, and insights into the
pages can dramatically improve classification accuracy interests of users and their habits. By mining
(Qi and Davison 2006). Topical classification of web clickstream data and other data generated by users as
pages is particularly important for contextual they interact with resources on one or more web sites,
advertising, for automated web directory creation, and behavioral patterns can be discovered and analyzed.
focused crawling. Discovered patterns include collections of frequent
Many web pages contain structured data that are queries or pages visited by users with common
retrieved from some underlying but otherwise interests. By modeling user behavior it is possible to
inaccessible database. This structured data includes personalize a web site, improve web performance by

fetching web pages before they are needed, and to Discovery and ACM’s Transactions on Knowledge
cross-sell or up-sell a potential customer for improved Discovery from Data and the primary professional
profitability. Such analysis is possible on the basis of organization is ACM’s Special Interest Group (SIG) on
web server logs, but stronger models are possible with Knowledge Discovery and Data Mining
clickstream traffic captured by a web proxy as it can (http://www.sigkdd.org). Major conferences in the field
also capture cross-server activity. include the ACM SIGKDD International Conference
Another record of web activity is from search on Knowledge Discovery & Data Mining and the IEEE
engines. The queries submitted represent the express International Conference on Data Mining. A
interests or information needs of the searchers and thus comprehensive list of data mining texts, journals, and
provide an unmatched view into the interests, needs, conferences is available from http://kdnuggets.com/.
and desires of a user population. However, in order to There are a wide variety of data mining tools
characterize a query log, one must first be able to available. Some of these tools implement just one data
classify the queries. Query classification is also mining method (e.g., decision trees) whereas others
important for monetization of search through relevant provide a comprehensive suite of methods and a
advertising. In general, query classification is known uniform interface. Many of the tools provided by the
to be difficult, primarily because typical query strings academic community are available for free, while
are short and often ambiguous. As a result, some many of the commercial data mining tools can be quite
additional knowledge is normally used, such as expensive. The commercial tools are frequently
incorporating the text of the results of the query to provided by companies that also provide statistical
provide expanded content for classification. tools and in these cases are often marketed as an
In addition to queries, search engine providers also extension to these tools.
capture the clicks corresponding to which results the One of the most frequently used tools for research
searcher visited. Analysis of the patterns of clicks can has been C4.5 (Quinlan 1993), a decision tree tool,
provide important feedback on searcher satisfaction which is still freely available for download over the
and how to improve the search engine rankings Internet (www.rulequest.com/Personal/). However, this
(Joachims et al. 2005). tool is no longer supported and its capabilities are
Finally, it is important to note that the collection, somewhat dated. A more powerful commercial version
storage, transmission and use of usage data is often of this product, C5.0, is available for a modest price
subject to legal constraints in addition to privacy from Rulequest Research (www.rulequest.com). There
expectations. Not surprisingly, methods for the are a number of powerful data mining packages that
anonymization of user data continue to be an active provide support for all major data mining algorithms.
research topic. These packages also provide support for the entire data
mining process, including data preparation and model
evaluation, and provide access through a graphical user
DATA MINING RESOURCES & TOOLS interface so that a programming background is not
For those wishing to obtain more information on data required. These data mining packages include Weka
mining, there are a number of general resources. A (Witten and Frank 2005), which is free and available
good electronic resource for staying current in the field for download over the Internet and commercial
is KDnuggets (http://kdnuggets.com/), a website that packages such as Enterprise Miner and Clementine,
provides information on data mining in the form of from SAS Institute Inc. and SPSS Inc., respectively. A
news articles, job postings, publications, courses, and more complete list of data mining tools is available
conferences, and a free bimonthly email newsletter. from KDnuggets at www.kdnuggets.com/software.
There are a number of general textbooks on data
mining. Those who have some background in computer
science and are interested in the technical aspects of CONCLUSION
data mining, including how the data mining algorithms Data mining initially generated a great deal of
operate, should consider the texts by Han and Kamber excitement and press coverage, and, as is common with
(2006) , Tan, Steinbach and Kumar (2006) and Liu new “technologies”, overblown expectations. However,
(2007). Those with a business background, or whose as data mining has begun to mature as a discipline, its
primary interest is in how data mining can address methods and techniques have not only proven to be
business problems, may want to consider the texts by useful, but have begun to be accepted by the wider
Berry and Linoff (2004) and Pyle (2003). The primary community of data analysts. As a consequence, courses
journals in the area are Data Mining and Knowledge in data mining are now not only being taught in

Computer Science departments, but also in most and association rule mining are the two main
business schools. Even many of the social sciences that descriptive data mining tasks.
have long relied almost exclusively on statistical
Predictive accuracy – The fraction, or percentage, of
techniques have begun to realize that some knowledge
predictions that are correct.
of data mining is essential and will be required to
ensure future success. Prediction task – The data mining task that involves
All “knowledge workers” in our information predicting a value based on other existing
society, particularly those who need to make informed information. The main predictions tasks are
decisions based on data, should have at least a basic classification and regression tasks.
familiarity with data mining. This chapter provides this
Regression task – The predictive data mining task that
familiarity by describing what data mining is, its
capabilities, and the types of problems that it can involves mapping an example to a numerical,
address. Further information on this topic can be possibly continuous, value. Example: predicting a
acquired via the resources listed in the previous future stock price.
section. Supervised learning – A type of learning task where
the “answer” is provided along with the input.
GLOSSARY Test Set – The labeled data used for predictive data
Artificial Neural Network (ANN) – A computational mining tasks that is reserved to evaluate the
device, inspired by the brain and modeled as an effectiveness (e.g., accuracy) of the predictive
interconnected set of nodes, which learns to predict model built using the training data.
by adjusting the weights between its nodes so that Training set – The data provided as input to a data
the output it generates better matches the predicted mining algorithm that is used to train or build a
value encoded with the training examples. model.
Association analysis – A data mining task that looks Unsupervised learning – A type of learning task
for associations between items that occur within a where the “answer” is not provided along with the
set of transactions. Example: if Hamburger Meat is input. Clustering and association rule mining are
purchased then Ketchup is purchased 30% of the the main examples of unsupervised learning in data
time. mining.
Bayesian Classifier – A probabilistic classifier that
determines, via Bayes’ theorem, the most likely REFERENCES
class given a set of observed attributes.
Agrawal, R., Imielinki, T., and Swami, A. (1993). Mining
Classification – The predictive data mining task that association rules between sets of items in large databases.
involves assigning an example to one of a set of In Proceedings of the 1993 ACM-SIGMOD International
predefined classes. Example: predicting who will Conference on Management of Data, 207-216,
Washington, DC.
default on a loan.
Agrawal, R., and Srikant, R. (1994). Fast algorithms for
Cluster Analysis – The data mining task that mining association rules. In Proceedings of the 1994
automatically partitions data into clusters (i.e., International Conference on Very Large Databases, 487-
groups) such that similar objects are placed into the 499, Santiago, Chile.
same cluster and dissimilar objects are placed into Agrawal, R., and Srikant, R. (1995). Mining sequential
different clusters. patterns. In Proceedings of the International Conference
on Data Engineering, 3-14, Taipei, Taiwan.
Data mining process – The nontrivial process of Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and
identifying valid, novel, potentially useful, and Lipman, D. J. (1990) Basic local alignment search tool.
ultimately understandable patterns in data (Fayyad Journal of Molecular Biology, 215(3):403-410.
et al. 1996). Synonymous with the knowledge Berry, M., and Linoff, G. S. (2004). Data Mining Techniques
discovery process. for Marketing, Sales, and Customer Relationship
Management. Wiley.
Data mining step – The algorithmic step in the data
Breiman, L. 1996. Bagging predictors. Machine Learning,
mining process that extracts patterns from the data. 24(2):123-140.
Description task – The data mining task that
summarizes the data in some manner. Clustering

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). International Data Corporation (2007). The expanding digital
Classification and Regression Trees. Wadsworth universe: A forecast of worldwide information growth
International Group. through 2010. http://www.emc.com/collateral/analyst-
Chakrabarti, S. (2002). Mining the Web: Statistical Analysis reports/expanding-digital-idc-white-paper.pdf.
of Hypertext and Semi-Structured Data. Morgan Jain, A.K., and Dubes, R. (1988). Algorithms for Clustering
Kaufmann. Data. Prentice Hall.
Chen, Y., Zhang, G, Hu, D., and Wang, S. (2006). Customer Jain, A. K., Murthy, M. N., and Flynn, P. J. (1999). Data
segmentation in customer relationship management based Clustering: A Review. ACM Computing Reviews, Nov
on data mining. In Knowledge Enterprise: Intelligent 1999.
Strategies in Product Design, Manufacturing, and Joachims, T. (1998) Making large-scale support vector
Management, 288-293. Boston: Springer. machine learning practical. In Scholkopf, Burges and
Cohen, W. (1995). Fast effective rule induction. In Smola (ed), Advances in Kernel Methods: Support Vector
Proceedings of the 12th International Conference on Machines. Cambridge, MA: MIT Press.
Machine Learning, 115-123, Tahoe City, CA. Joachims, T., Granka, L., Pan, B., Hembrooke, H. and Gay,
Cortes, C., and Pregibon, D. (1998). Giga-mining. In G. (2005). Accurately Interpreting Clickthrough Data as
Proceedings of the Fourth International Conference on Implicit Feedback. In Proceedings of the 28th Annual
Knowledge Discovery and Data Mining, 174-178. ACM SIGIR Conference on Research and Development
Cover, T. and Hart, P. (1967). Nearest neighbor pattern in Information Retrieval, Salvador, Brazil.
classification. IEEE Transactions on Information Theory, Kessler, M. (1963). Bibliographic Coupling between
13: 21-27. Scientific Papers. American Documentation, 14.
Enke, D., and Thawornwong, S. (2005). The use of data Kleinberg, J. (1999). Authoritative Sources in a Hyperlinked
mining and neural networks for forecasting stock market Environment. Journal of the ACM 46(5):604-632.
returns. Expert Systems with Applications, 29(4):927-940. Kuramochi, M., and Karypis, G. (2001). Frequent Subgraph
Ester, M., Frommelt, A., Kriegel, H., and Sander, J. (1998). Discovery. In Proceedings of the 2001 IEEE
Algorithms for characterization and trend detection in International Conference on Data Mining, 313-320, San
spatial databases. In Proceedings of the International Jose, CA.
Conference of Knowledge Discovery and Data Mining, p. Langley, P., Iba, W., and Thompson, K. (1992). An analysis
44-50, New York, NY, August 1998. of Bayesian classifiers. In Proceedings of the 10th
Ester, M., Kriegel, H., Sander, J., and Xu, X. (1996). A National Conference on Artificial Intelligence, 223-228.
density-based algorithm for discovering clusters in large Lee, W., Stolfo, S. J., and Mok, K. W. (2000). Adaptive
spatial databases with noise. In Proceedings of the 2nd intrusion detection: a data mining approach. Artificial
International Conference on Knowledge Discovery and Intelligence Review, 14(6):533-567.
Data Mining, 226-231. Ling, C. X., and Li, C. (2000). Applying Data Mining to
Fayyad, U. M. (2003). Editorial. SIGKDD Explorations, 5(2). Direct Marketing. In W. Kou and Y. Yesha (eds.),
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. From data Electronic Commerce Technology Trends: Challenges
mining to knowledge discovery in databases. AI and Opportunities, 185-198, IBM Press.
Magazine, 17(3):37-54. Liu, B. (2007) Web Data Mining: Exploring Hyperlinks,
Fawcett, T., and Provost, F. (1997). Adaptive fraud detection. Contents, and Usage Data. New York: Springer-Verlag.
Data Mining and Knowledge Discovery, 1(3): 291-316. Page, L., Brin, S., Motwani, R. and Winograd, T. (1998).
Freund, Y. and Schapire, Y.E. (1997). A decision-theoretic The PageRank Citation Ranking: Bringing Order to the
generalization of on-line learning and an application to Web. Technical Report, Computer Science Department,
boosting. Journal of Computer and System Sciences, Stanford University.
55(1):119-139. Parsons, L., Haque, E., and Liu, H. Subspace clustering for
Gurney, K. (1997). An Introduction to Neural Networks. high dimensional data: A review. SIGKDD Explorations,
CRC Press. 6:90-105.
Han, J., and Kamber, M. (2006). Data Mining: Concepts and Pyle, D. (1999). Data preparation for Data Mining. Morgan
Techniques. Morgan Kaufmann. Kaufmann.
Hand, D. J. (1998). Data Mining: Statistics and More? Pyle, D. (2003). Business Modeling and Data Mining.
American Statistician, 52(2): 112-118. Morgan Kaufmann.
Hsu, W., Lee, M. L., and Zhang, J. (2002). Image mining: Qi, X. and Davison, B. D. (2006) Knowing a Web Page by
Trends and developments. Journal of International the Company It Keeps. In Proceedings of the 15th ACM
Information Systems, 19:7-23. Conference on Information and Knowledge Management
Huang, Y., Shekhar, S., and Xiong, H. (2004). Discovering (CIKM), pp. 228-237, Arlington, VA.
co-location patterns from spatial datasets: a general Quinlan, J. R. (1993). C4.5: Programs for Machine Learning.
approach. IEEE Transactions on Knowledge and Data Morgan Kaufmann.
Engineering, 16(12):1472-1485.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Wei, C., and Chiu, I (2002). Turning telecommunications call
Learning representations by back-propagating errors. details to churn prediction: A data mining approach.
Nature, 323:533-536. Expert Systems with Applications, 23(2):103-112.
Sebastaiani, F. (2002). Machine learning in automated text Weiss, G. M., and Hirsh, H. (1998). Learning to predict rare
categorization. ACM Computing Surveys, 34:1-47. events in event sequences. In Proceedings of the 4th
Small, H. (1973) Co-Citation in the Scientific Literature: a International Conference on Knowledge Discovery and
New Measure of the Relationship between Two Data Mining, 359-363, Menlo Park, CA.
Documents. Journal of American Society for Information Witten, I. H., and Frank, E. (2005). Data Mining: Practical
Science, 24(4), pp. 265-269. Machine Learning Tools and Techniques, 2nd edition,
Tan, P. N., and Kumar, V. (2002). Mining association Morgan Kaufmann, San Francisco.
patterns in web usage data. In Proceedings of the Zamir, O., and Etzioni, O. (1998). Web document clustering:
International Conference on Advances in Infrastructure A feasibility study. In Proceedings of the 21st
for e-Business, e-Education, e-Science, and e-Medicine International ACM Conference on Research and
on the Internet. Development in Information Retrieval, 46-54.
Tan, P., Steinbach, M., and Kumar, V. (2006). Introduction Zhu, X., Wu, X., Elmagarmid, A. K., Feng, Z., and Wu, L.
to Data Mining. Pearson Addison Wesley. (2005). Video data mining: semantic indexing and event
Vapnik, V. (1999) The Nature of Statistical Learning detection from the association perspective. IEEE
Theory, 2nd Ed. New York: Springer-Verlag. Transactions on Knowledge and Data Engineering,
Wasserman, S. and Faust, K. (1994). Social Network 17(5): 665-677.
Analysis: Methods and Applications. Cambridge, MA:
Cambridge University Press.