Вы находитесь на странице: 1из 8

Chapter 1

Data Mining-An Introduction


1.1 Why Data Mining?
We live in a world where vast amounts of data are collected daily. Analysing such data
is an important need. Data mining can meet this need by providing tools to discover
knowledge from data.

The explosive growth of available data volume is a result of the computerization


of our society and the fast development of powerful data collection and storage
tools. Powerful and versatile tools are badly needed to automatically uncover
valuable information from the tremendous amounts of data and to transform
such data into organized knowledge. This necessity has led to the birth of data
mining.

Data mining can be viewed as a result of the natural evolution of information


technology. The database and data management industry evolved in the
development of several critical functionalities: data collection and database
creation, data management (including data storage and retrieval and database
transaction processing), and

advanced

data analysis

(involving data

warehousing and data mining). The widening gap between data and information
calls for the systematic development of data mining tools that can turn data
tombs into golden nuggets of knowledge.

1.2 What Is Data Mining?


It is no surprise that data mining can be defined in many different ways. Even the term
data mining does not really present all the major components in the picture. To refer to
the mining of gold from rocks or sand, we say gold mining instead of rock or sand
mining. Analogously, data mining should have been more appropriately named
knowledge mining from data, which is unfortunately somewhat long. However, the
shorter term, knowledge mining may not reflect the emphasis on mining from large
amounts of data. Nevertheless, mining is a vivid term characterizing the process that
finds a small set of precious nuggets from a great deal of raw material. Thus, such a
misnomer carrying both data and mining became a popular choice. In addition,
many other terms have a similar meaning to data miningfor Example, knowledge

Chapter 1 Data Mining- An Introduction

mining from data, knowledge extraction, data/pattern analysis, data Archaeology, and
data dredging.

Figure 1.1 Data mining as a step in the process of knowledge discovery

Many people treat data mining as a synonym for another popularly used term,
knowledge discovery from data, or KDD, while others view data mining as merely an
essential step in the process of knowledge discovery. The knowledge discovery process
is an iterative sequence of the following steps:
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved from the
Database)

Poornima University, Jaipur

B.tech, Computer Engineering

Chapter 1 Data Mining- An Introduction

4. Data transformation (where data are transformed and consolidated into forms
appropriate for mining by performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract
data patterns)
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge
based on interestingness measures)
7. Knowledge presentation (where visualization and knowledge representation
techniques are used to present mined knowledge to users)
Steps 1 through 4 are different forms of data pre-processing, where data are prepared for
mining. The data mining step may interact with the user or a knowledge base. The
interesting patterns are presented to the user and may be stored as new knowledge in the
knowledge base.
The preceding view (Figure 1.1) shows data mining as one step in the knowledge
discovery process, though an essential one because it uncovers hidden patterns for
evaluation. However,
in industry, in media, and in the research environment, the term data mining is often
used to refer to the entire knowledge discovery process (perhaps because the term is
shorter than knowledge discovery from data). Therefore, we adopt a broad view of data
mining functionality: Data mining is the process of discovering interesting patterns and
knowledge from large amounts of data. The data sources can include databases, data
warehouses, the Web, other information repositories, or data that are streamed into the
system dynamically.

1.3 Which Technologies Are Used?


As a highly application-driven domain, data mining has incorporated many techniques
from other domains such as statistics, machine learning, pattern recognition, database
and data warehouse systems, information retrieval, visualization, algorithms, high
performance Computing, and many application domains (Figure 1.2).

Poornima University, Jaipur

B.tech, Computer Engineering

Chapter 1 Data Mining- An Introduction

Figure 1.2 Data mining adopts techniques from many domains

1.3.1 Statistics
Statistics studies the collection, analysis, interpretation or explanation, and presentation
of data. Data mining has an inherent connection with statistics. A statistical model is a
set of mathematical functions that describe the behaviour of the objects in a target class
in terms of random variables and their associated probability distributions.
1.3.2 Machine Learning
Machine learning investigates how computers can learn (or improve their
performance) based on data. A main research area is for computer programs to
automatically learn to recognize complex patterns and make intelligent decisions based
on data.
1.3.3 Database Systems and Data Warehouses
Database systems research focuses on the creation, maintenance, and use of databases
for organizations and end-users. Particularly, database systems researchers have
established highly recognized principles in data models, query languages, query
processing and optimisation methods, data storage, and indexing and accessing
methods. A data warehouse integrates data originating from multiple sources and
various timeframes.
1.3.4 Information Retrieval
Information retrieval (IR) is the science of searching for documents or information in
documents. Documents can be text or multimedia, and may reside on the Web. The
differences between traditional information retrieval and database systems are two fold:
Information retrieval assumes that (1) the data under search are unstructured; and (2) the
Poornima University, Jaipur

B.tech, Computer Engineering

Chapter 1 Data Mining- An Introduction

queries are formed mainly by keywords, which do not have complex structures (unlike
SQL queries in database systems).

1.4 Data Mining Applications


The list of areas where data mining is widely used are

Financial Data Analysis

Retail Industry

Telecommunication Industry

Biological Data Analysis

Other Scientific Applications

Intrusion Detection

Financial Data Analysis


The financial data in banking and financial industry is generally reliable and of high
quality which facilitates systematic data analysis and data mining. Some of the typical
cases are as follows

Design and construction of data warehouses for multidimensional data analysis


and data mining.

Loan payment prediction and customer credit policy analysis.

Classification and clustering of customers for targeted marketing.

Detection of money laundering and other financial crimes.

Retail Industry
Data Mining has its great application in Retail Industry because it collects large
amount of data from on sales, customer purchasing history, goods transportation,
consumption and services. It is natural that the quantity of data collected will continue
to expand rapidly because of the increasing ease, availability and popularity of the
web.

Poornima University, Jaipur

B.tech, Computer Engineering

Chapter 1 Data Mining- An Introduction

Data mining in retail industry helps in identifying customer buying patterns and trends
that lead to improved quality of customer service and good customer retention and
satisfaction. Here is the list of examples of data mining in the retail industry

Design and Construction of data warehouses based on the benefits of data


mining.

Multidimensional analysis of sales, customers, products, time and region.

Analysis of effectiveness of sales campaigns.

Customer Retention.

Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries
providing various services such as fax, pager, cellular phone, internet messenger,
images, e-mail, web data transmission, etc. Due to the development of new computer
and communication technologies, the telecommunication industry is rapidly
expanding. This is the reason why data mining is become very important to help and
understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication
patterns, catch fraudulent activities, make better use of resource, and improve quality
of service. Here is the list of examples for which data mining improves
telecommunication services

Multidimensional Analysis of Telecommunication data.

Fraudulent pattern analysis.

Identification of unusual patterns.

Multidimensional association and sequential patterns analysis.

Mobile Telecommunication services.

Use of visualization tools in telecommunication data analysis.

Biological Data Analysis


In recent times, we have seen a tremendous growth in the field of biology such as
genomic, proteomic, functional Genomic and biomedical research. Biological data
mining is a very important part of Bioinformatics. Following are the aspects in which
data mining contributes for biological data analysis

Poornima University, Jaipur

B.tech, Computer Engineering

Chapter 1 Data Mining- An Introduction

Semantic integration of heterogeneous, distributed genomic and proteomic


databases.

Alignment, indexing, similarity search and comparative analysis multiple


nucleotide sequences.

Discovery of structural patterns and analysis of genetic networks and protein


pathways.

Association and path analysis.

Visualization tools in genetic data analysis.

Other Scientific Applications


The applications discussed above tend to handle relatively small and homogeneous
data sets for which the statistical techniques are appropriate. Huge amount of data have
been collected from scientific domains such as geosciences, astronomy, etc. A large
amount of data sets is being generated because of the fast numerical simulations in
various fields such as climate and ecosystem modeling, chemical engineering, fluid
dynamics, etc. Following are the applications of data mining in the field of Scientific
Applications

Data Warehouses and data preprocessing.

Graph-based mining.

Visualization and domain specific knowledge.

Intrusion Detection
Intrusion refers to any kind of action that threatens integrity, confidentiality, or the
availability of network resources. In this world of connectivity, security has become
the major issue. With increased usage of internet and availability of the tools and tricks
for intruding and attacking network prompted intrusion detection to become a critical
component of network administration. Here is the list of areas in which data mining
technology may be applied for intrusion detection

Development of data mining algorithm for intrusion detection.

Association and correlation analysis, aggregation to help select and build


discriminating attributes.

Analysis of Stream data.

Poornima University, Jaipur

B.tech, Computer Engineering

Chapter 1 Data Mining- An Introduction

Distributed data mining.

Visualization and query tools.

Poornima University, Jaipur

B.tech, Computer Engineering

Вам также может понравиться