Вы находитесь на странице: 1из 47

Introduction of Data Mining- Motivation & Importance of data Mining ,Role of Data Mining in knowledge Discovery

ROBIN PRAKASH MATHUR ASST .PROFESSOR DEPARTMENT OF CSE LPU-PHAGWARA , JALANDHAR

Why Data Mining?


The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data

Business: Web, e-commerce, transactions, stocks,


Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge!


Necessity is the mother of inventionData miningAutomated analysis of massive data sets
March 23, 2013 Data Mining: Concepts and Techniques 2

What Is Data Mining?


Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer?

Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

March 23, 2013

Data Mining: Concepts and Techniques

Knowledge Discovery (KDD) Process


This is a view from typical database systems and data warehousing Pattern Evaluation communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection

Data Cleaning
Data Integration Databases

March 23, 2013

Data Mining: Concepts and Techniques

1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)

ARCHITECTURE OF DATA MINING

Database, data warehouse, WorldWideWeb, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request.

Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.

Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It

Data Summarization, Data Cleaning, Data Transformation, Concept Hierarchy, Structure

March 23, 2013

Data Mining: Concepts and Techniques

11

Data Preprocessing
Why preprocess the data?
Data cleaning

Data integration and transformation


Data reduction

Discretization and concept hierarchy generation


Summary
March 23, 2013 Data Mining: Concepts and Techniques 12

Why Data Preprocessing?


Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data

March 23, 2013

Data Mining: Concepts and Techniques

13

Major Tasks in Data Preprocessing


Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies

Data integration
Integration of multiple databases, data cubes, or files

Data transformation
Normalization and aggregation

Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization
Part of data reduction but with particular importance, especially for numerical data
March 23, 2013 Data Mining: Concepts and Techniques 14

Forms of data preprocessing

March 23, 2013

Data Mining: Concepts and Techniques

15

Chapter 3: Data Preprocessing


Why preprocess the data?
Data cleaning

Data integration and transformation


Data reduction

Discretization and concept hierarchy generation


Summary
March 23, 2013 Data Mining: Concepts and Techniques 16

Data Cleaning
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data

March 23, 2013

Data Mining: Concepts and Techniques

17

Missing Data
Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry

not register history or changes of the data


Missing data may need to be inferred.

March 23, 2013

Data Mining: Concepts and Techniques

18

How to Handle Missing Data?


Ignore the tuple: usually done when class label is missing (assuming the

tasks in classificationnot effective when the percentage of missing values


per attribute varies considerably. Fill in the missing value manually: tedious + infeasible ! Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter

March 23, 2013

Data Mining: Concepts and Techniques

19

Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
March 23, 2013 Data Mining: Concepts and Techniques 20

How to Handle Noisy Data?


Binning method: first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions
March 23, 2013 Data Mining: Concepts and Techniques 21

Simple Discretization Methods: Binning


Binning methods smooth a sorted data value by consulting its neighborhood, that is, the values around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. The data are first sorted and then partitioned into equal-frequency bins of size 4 (i.e., each bin contains four values). In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries
March 23, 2013 Data Mining: Concepts and Techniques 22

Binning Methods for Data Smoothing


* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
March 23, 2013 Data Mining: Concepts and Techniques 23

Cluster Analysis
Unlike classification , the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. K mean, k-medoid
March 23, 2013 Data Mining: Concepts and Techniques 24

Cluster Analysis

March 23, 2013

Data Mining: Concepts and Techniques

25

Regression it is the process predicting one


variable from the another variable.
y

Y1

Y1

y=x+1

X1

March 23, 2013

Data Mining: Concepts and Techniques

26

Different commercial tools that can aid in the step of discrepancy detection.
Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources.
Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. They are variants of data mining tools.

Chapter 3: Data Preprocessing


Why preprocess the data?
Data cleaning

Data integration and transformation


Data reduction

Discretization and concept hierarchy generation


Summary
March 23, 2013 Data Mining: Concepts and Techniques 28

Data Integration
Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
March 23, 2013 Data Mining: Concepts and Techniques 29

Handling Redundant Data in Data Integration


Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases

One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
March 23, 2013 Data Mining: Concepts and Techniques 30

Data Transformation
Data transformation routines convert the data into appropriate forms for mining. For example, attribute data may be normalized so as to fall between a small range, such as 0.0 to 1.0 Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
March 23, 2013 Data Mining: Concepts and Techniques 31

Data Transformation: Normalization


min-max normalization

v minA v' (new _ maxA new _ minA) new _ minA maxA minA
z-score normalization

v meanA v' stand _ devA

normalization by decimal scaling

v v' j 10
March 23, 2013

Where j is the smallest integer such that Max(| v' |)<1

Data Mining: Concepts and Techniques

32

Chapter 3: Data Preprocessing


Why preprocess the data?
Data cleaning

Data integration and transformation


Data reduction Discretization and concept hierarchy generation
March 23, 2013 Data Mining: Concepts and Techniques 33

Data Reduction Strategies


Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
March 23, 2013 Data Mining: Concepts and Techniques 34

Strategies for data reduction include the following: 1. Data cube aggregation- aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. 3.Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. 4. Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms. 5. Discretization and concept hierarchy generation,where rawdata values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies.

Data Cube Aggregation


The cube created at the lowest level of abstraction is referred to as the base cuboid. The base cuboid should correspond to an individual entity of interest, such as sales or customer. In other words, the lowest level should be usable, or useful for the analysis. A cube at the highest level of abstraction is the apex cuboid. Data cubes created for varying levels of abstraction are often referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each higher level of abstraction further reduces the resulting data size.
March 23, 2013 Data Mining: Concepts and Techniques 36

Example of Decision Tree Induction


Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ?

A1?

A6?

Class 1

Class 2

Class 1

Class 2

> Reduced attribute set: {A1, A4, A6}


March 23, 2013 Data Mining: Concepts and Techniques 37

Data Compression
String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time
March 23, 2013 Data Mining: Concepts and Techniques 38

Data Compression

Original Data
lossless

Compressed Data

Original Data Approximated

March 23, 2013

Data Mining: Concepts and Techniques

39

Regression and Log-Linear Models


Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector

Log-linear model: approximates discrete multidimensional


probability distributions
March 23, 2013 Data Mining: Concepts and Techniques 40

Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd

Regress Analysis and LogLinear Models

Histograms
A popular data reduction 40 technique 35 Divide data into buckets 30 and store average (sum) for 25 each bucket Can be constructed 20 optimally in one dimension 15 using dynamic programming 10 Related to quantization 5 problems.
0
March 23, 2013

10000 30000 Data Mining: Concepts and Techniques

50000

70000

90000 42

Discretization
Three types of attributes: Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers Discretization: * divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis

March 23, 2013

Data Mining: Concepts and Techniques

43

Discretization and Concept hierachy


Discretization
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
March 23, 2013 Data Mining: Concepts and Techniques 44

Concept hierarchy generation for categorical data


Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping

Specification of a set of attributes, but not of their partial


ordering Specification of only a partial set of attributes

March 23, 2013

Data Mining: Concepts and Techniques

45

Specification of a set of attributes


Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.

country province_or_ state

15 distinct values

65 distinct values 3567 distinct values 674,339 distinct values

city street

March 23, 2013

Data Mining: Concepts and Techniques

46

THANKS.

Вам также может понравиться