Академический Документы
Профессиональный Документы
Культура Документы
Presented By:
Ankita Agarwal
Deepika Raipuria
Mody Institute of Technology
and Science, Laxmangarh
Data mining refers to extracting or “mining” knowledge from large amounts of data. The
term is actually a misnomer. Mining of gold from rocks or sand is referred to as gold
mining rather than rock or sand mining. Thus data mining should have been appropriately
named “knowledge mining from data,” which is unfortunately somewhat long.
Nevertheless, mining is a vivid term characterizing the process that finds a small set of
precious nuggets from a great deal of raw material. Thus such a misnomer that carries
both “data” and “mining” became a popular choice. There are many other terms carrying
a similar or slightly different meaning to data mining such as knowledge mining from
databases, knowledge extraction, data or pattern analysis, data archeology and data
dredging. Many people treat data mining as a synonym for another popularly used term,
“Knowledge discovery in databases” or KDD. It consist of an iterative sequence of
the following steps:
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
Fig.1
The architecture of a typical data mining system have the following major
components as shown in fig.2:
Pattern evaluation
Knowledge-
Database or data warehouse
server
base
Databases Data
Warehouse
fig. 2
Database, Data warehouse, or other information repository:
This is one or a set of databases, data warehouses, spreadsheets or other kinds of
information repositories. Data cleaning and data integration are performed on the data.
Knowledge base:
This is the domain knowledge that is used to guide the search, or evaluate the
interestingness of resulting patterns. Such knowledge can include concept hierarchies,
used to organize attributes or attribute values into different level of abstraction.
Data mining functionalities are used to specify the kind of patterns to be found in data
mining tasks. They can be classified into two categories: descriptive and predictive.
Descriptive mining tasks characterize the general properties of the data into database.
Predictive mining tasks perform inference on the current data in order to make
predictions.
Data can be associated with classes or concepts. It can be useful to describe individual
classes or concepts in summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions can be
derived via:
Data Characterization: It is summarization of the general characteristics or features of a
target class of data. The data corresponding to the user-specified class are typically
collected by a database query.
Association analysis:
More formally, association rules are of the form X => Y, that is, “A1
^………….^AmB1^…….^Bn”, where Ai(for i in {1,….,m}) and Bj(for j in {1,…,n})
are attribute-value pairs. The association rule X => Y is interpreted as “ database tuples
that satisfy the conditions in X are also likely to satisfy the conditions in Y.”
Classification is the process of finding a set of models that describe and distinguish data
classes or concepts, for the purpose of being able to use the module to predict the class of
objects whose class label is unknown, i.e. training data. The derived model is based on
the analysis of a set of training data. Classification can be used for predicting the class
label of data objects. Prediction referred to both data value prediction and class label
prediction. It also encompasses the identification of distribution trends based on the
available data.
Cluster Analysis:
Unlike classification and prediction, clustering analyses data objects without consulting a
known class label. It can be used to generate such labels. The objects are clustered or
grouped based on the principle of maximizing the intraclass similarity and minimizing
the interclass similarity. That is, cluster of objects are formed so that objects within a
cluster have high similarity in comparison to one another, but are very dissimilar to
objects in other clusters. Each cluster that is formed can be viewed as a class of objects,
from which rules can be derived.
Outlier Analysis:
A database may contain data objects that do not comply with the general behavior or the
model of the data. These data objects are outliers. Most data mining methods discard
outliers as noise or exceptions. However, in some applications such as fraud detection,
the rare events can be more interesting than the more regularly occurring ones. The
analysis of outlier data is referred to as outlier mining.
References:
Sites:
www.cs.uiuc.edu/~hanj/dmbook
www-courses.cs.uiuc.edu/~cs497jh/
www.cs.uiuc.edu/~hanj or www.dbminer.com