DATA MINING Chapter 1 and 2 Lect Slide

Introduction of Data Mining- Motivation & Importance of data Mining ,Role of Data Mining in knowledge Discovery
ROBIN PRAKASH MATHUR ASST .PROFESSOR DEPARTMENT OF CSE LPU-PHAGWARA , JALANDHAR
Why Data Mining?

The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,

Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!

Necessity is the mother of inventionData miningAutomated analysis of massive data sets
March 23, 2013 Data Mining: Concepts and Techniques 2
What Is Data Mining?

Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
March 23, 2013
Data Mining: Concepts and Techniques
Knowledge Discovery (KDD) Process

This is a view from typical database systems and data warehousing Pattern Evaluation communities Data mining plays an essential role in the knowledge discovery process Data Mining Task-relevant Data Data Warehouse Selection
Data Cleaning
Data Integration Databases
March 23, 2013
1. Data cleaning (to remove noise and inconsistent data) 2. Data integration (where multiple data sources may be combined) 3. Data selection (where data relevant to the analysis task are retrieved from the database) 4. Data transformation (where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)
5. Data mining (an essential process where intelligent methods are applied in order to extract data patterns) 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures) 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present the mined knowledge to the user)
ARCHITECTURE OF DATA MINING
Database, data warehouse, WorldWideWeb, or other information repository: This is one or a set of databases, data warehouses, spreadsheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. Database or data warehouse server: The database or data warehouse server is responsible for fetching the relevant data, based on the users data mining request.
Knowledge base: This is the domain knowledge that is used to guide the search or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Data mining engine: This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association and correlation analysis, classification, prediction, cluster analysis, outlier analysis, and evolution analysis.
Pattern evaluation module: This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search toward interesting patterns. It
Data Summarization, Data Cleaning, Data Transformation, Concept Hierarchy, Structure
March 23, 2013
11
Data Preprocessing
Why preprocess the data?
Data cleaning
Data integration and transformation

Data reduction
Discretization and concept hierarchy generation

Summary
Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data
March 23, 2013
13
Major Tasks in Data Preprocessing

Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data transformation
Normalization and aggregation
Data reduction
Obtains reduced representation in volume but produces the same or similar analytical results
Data discretization
Part of data reduction but with particular importance, especially for numerical data
Forms of data preprocessing
March 23, 2013
15
Chapter 3: Data Preprocessing

Data cleaning

Data reduction

Summary
Data Cleaning
Data cleaning tasks
Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data
March 23, 2013
17
Missing Data
Data is not always available
E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry
not register history or changes of the data

Missing data may need to be inferred.
March 23, 2013
18
How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming the
tasks in classificationnot effective when the percentage of missing values

per attribute varies considerably. Fill in the missing value manually: tedious + infeasible ! Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter
March 23, 2013
19
Noisy Data
Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
How to Handle Noisy Data?

Binning method: first sort data and partition into (equi-depth) bins
then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.
Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions
Simple Discretization Methods: Binning

Binning methods smooth a sorted data value by consulting its neighborhood, that is, the values around it. The sorted values are distributed into a number of buckets, or bins. Because binning methods consult the neighborhood of values, they perform local smoothing. The data are first sorted and then partitioned into equal-frequency bins of size 4 (i.e., each bin contains four values). In smoothing by bin means, each value in a bin is replaced by the mean value of the bin. Smoothing by bin medians can be employed, in which each bin value is replaced by the bin median In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries
Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
Cluster Analysis
Unlike classification , the class labels are not present in the training data simply because they are not known to begin with. Clustering can be used to generate such labels. The objects are clustered or grouped based on the principle of maximizing the intraclass similarity and minimizing the interclass similarity. K mean, k-medoid
Cluster Analysis
March 23, 2013
25
Regression it is the process predicting one

variable from the another variable.
y
Y1
Y1
y=x+1
X1
March 23, 2013
26
Different commercial tools that can aid in the step of discrepancy detection.
Data scrubbing tools use simple domain knowledge (e.g., knowledge of postal addresses, and spell-checking) to detect errors and make corrections in the data. These tools rely on parsing and fuzzy matching techniques when cleaning data from multiple sources.
Data auditing tools find discrepancies by analyzing the data to discover rules and relationships, and detecting data that violate such conditions. They are variants of data mining tools.

Data cleaning

Data reduction

Summary
Data Integration
Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple databases The same attribute may have different names in different databases
One attribute may be a derived attribute in another table, e.g., annual revenue
Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Data Transformation
Data transformation routines convert the data into appropriate forms for mining. For example, attribute data may be normalized so as to fall between a small range, such as 0.0 to 1.0 Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization normalization by decimal scaling
Data Transformation: Normalization

min-max normalization
v minA v' (new _ maxA new _ minA) new _ minA maxA minA
z-score normalization
v meanA v' stand _ devA
normalization by decimal scaling
v v' j 10
March 23, 2013
Where j is the smallest integer such that Max(| v' |)<1
32

Data cleaning

Data reduction Discretization and concept hierarchy generation
Data Reduction Strategies

Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
Strategies for data reduction include the following: 1. Data cube aggregation- aggregation operations are applied to the data in the construction of a data cube. 2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or dimensions may be detected and removed. 3.Dimensionality reduction, where encoding mechanisms are used to reduce the data set size. 4. Numerosity reduction,where the data are replaced or estimated by alternative, smaller data representations such as parametric models (which need store only the model parameters instead of the actual data) or nonparametric methods such as clustering, sampling, and the use of histograms. 5. Discretization and concept hierarchy generation,where rawdata values for attributes are replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity reduction that is very useful for the automatic generation of concept hierarchies.
Data Cube Aggregation

The cube created at the lowest level of abstraction is referred to as the base cuboid. The base cuboid should correspond to an individual entity of interest, such as sales or customer. In other words, the lowest level should be usable, or useful for the analysis. A cube at the highest level of abstraction is the apex cuboid. Data cubes created for varying levels of abstraction are often referred to as cuboids, so that a data cube may instead refer to a lattice of cuboids. Each higher level of abstraction further reduces the resulting data size.
Example of Decision Tree Induction

Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ?
A1?
A6?
Class 1
Class 2
Class 1
Class 2
> Reduced attribute set: {A1, A4, A6}

Data Compression
String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time
Data Compression
Original Data
lossless
Compressed Data
Original Data Approximated
March 23, 2013
39
Regression and Log-Linear Models

Linear regression: Data are modeled to fit a straight line Often uses the least-square method to fit the line Multiple regression: allows a response variable Y to be modeled as a linear function of multidimensional feature vector
Log-linear model: approximates discrete multidimensional

probability distributions
Linear regression: Y = + X Two parameters , and specify the line and are to be estimated by using the data at hand. using the least squares criterion to the known values of Y1, Y2, , X1, X2, . Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above. Log-linear models: The multi-way table of joint probabilities is approximated by a product of lower-order tables. Probability: p(a, b, c, d) = ab acad bcd
Regress Analysis and LogLinear Models
Histograms
A popular data reduction 40 technique 35 Divide data into buckets 30 and store average (sum) for 25 each bucket Can be constructed 20 optimally in one dimension 15 using dynamic programming 10 Related to quantization 5 problems.
0
March 23, 2013
10000 30000 Data Mining: Concepts and Techniques
50000
70000
90000 42
Discretization
Three types of attributes: Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers Discretization: * divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
March 23, 2013
43
Discretization and Concept hierachy

Discretization
reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Concept hierarchy generation for categorical data

Specification of a partial ordering of attributes explicitly at the schema level by users or experts Specification of a portion of a hierarchy by explicit data grouping
Specification of a set of attributes, but not of their partial

ordering Specification of only a partial set of attributes
March 23, 2013
45
Specification of a set of attributes

Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy.
country province_or_ state
15 distinct values
65 distinct values 3567 distinct values 674,339 distinct values
city street
March 23, 2013
46
THANKS.

DATA MINING Chapter 1 and 2 Lect Slide

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DATA MINING Chapter 1 and 2 Lect Slide

Загружено:

Авторское право:

Доступные форматы

Introduction of Data Mining- Motivation & Importance of data Mining ,Role of Data Mining in knowledge Discovery

ROBIN PRAKASH MATHUR ASST .PROFESSOR DEPARTMENT OF CSE LPU-PHAGWARA , JALANDHAR

Why Data Mining?

Business: Web, e-commerce, transactions, stocks,

We are drowning in data, but starving for knowledge!

What Is Data Mining?

March 23, 2013

Data Mining: Concepts and Techniques

Knowledge Discovery (KDD) Process

March 23, 2013

Data Mining: Concepts and Techniques

ARCHITECTURE OF DATA MINING

Data Summarization, Data Cleaning, Data Transformation, Concept Hierarchy, Structure

March 23, 2013

Data Mining: Concepts and Techniques

Data integration and transformation

Discretization and concept hierarchy generation

Why Data Preprocessing?

March 23, 2013

Data Mining: Concepts and Techniques

Major Tasks in Data Preprocessing

Forms of data preprocessing

March 23, 2013

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Data integration and transformation

Discretization and concept hierarchy generation

March 23, 2013

Data Mining: Concepts and Techniques

not register history or changes of the data

March 23, 2013

Data Mining: Concepts and Techniques

How to Handle Missing Data?

tasks in classificationnot effective when the percentage of missing values

March 23, 2013

Data Mining: Concepts and Techniques

How to Handle Noisy Data?

Simple Discretization Methods: Binning

Binning Methods for Data Smoothing

March 23, 2013

Data Mining: Concepts and Techniques

Regression it is the process predicting one

March 23, 2013

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Data integration and transformation

Discretization and concept hierarchy generation

Handling Redundant Data in Data Integration

Data Transformation: Normalization

v meanA v' stand _ devA

normalization by decimal scaling

Where j is the smallest integer such that Max(| v' |)<1

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Data integration and transformation

Data Reduction Strategies

Data Cube Aggregation

Example of Decision Tree Induction

> Reduced attribute set: {A1, A4, A6}

Original Data Approximated

March 23, 2013

Data Mining: Concepts and Techniques

Regression and Log-Linear Models