Вы находитесь на странице: 1из 14

Data Preprocessing in

Data Warehouse

Group Members:
Kaavya Johri 090101088
Palash Gaur 090101122
Roshan P Babu 090101143
Sakshi Kulbhaskar 090101145

Todays real world databases are highly suspectible to
incomplete (lacking attribute values, lacking certain attributes of
interest), noisy(contain errors) ,missing, and inconsistent data
due to their huge sizes and their origing from multiple
,heterogenous sources ,i.e. low quality data will lead to low
quality mining.
So in order to improve the efficiency and ease of mining process,
data is pre-processed.
Data preprocessing is an important step in knowledge discovery
process, because quality decisions must be based on quality
data.
Detecting data anomalies,rectifying them early and reducing
data to be analysed can lead to huge payoffs for decision
making.
Introduction
Data have quality if they satisfy the
requirements of intended use.
Many factors comprise data quality including
Accuracy,
Completeness,
Consistency,
Timeliness(completion w.r.t. time),
Believability(how much data is trusted by user)
and
Inter-operability(how easy data is understood).

Data Preprocessing
Data cleaning
Does work to clean the data by filling in missing values,
smoothing noisy data, identify or remove outliers, and resolving
inconsistencies
Data integration
Integration of multiple databases, data cubes, or files i.e.
including data from multiple sources.
Data reduction
Obtains a reduced representation of data set that is smaller in
volume ,yet produces same analytical results.
Data transformation
Data is transformed into forms appropriate for mining.
Major Tasks in Data Preprocessing
Forms of Data Preprocessing
Data cleaning routines work to clean" the data by filling in
missing values, smoothing noisy data, identifying or removing
outliers, and resolving inconsistencies as we know ,dirty data
can cause confusion for the mining procedure.
Data Cleaning
Data is not always available
E.g., many tuples have no recorded value for several attributes,
such as employee income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data

Missing Data/values
Ignore the tuple: this is usually done when class label is missing
(assuming mining tasks involve classification)not effective when the
percentage of missing values per attribute varies considerably,unless
tuple contains several attributes with missing values.
Fill in the missing value manually: time consuming and may not be
feasible
Use a global constant to fill in the missing value: replace all missing
attribute values by same constant, such as unknown.
Use the attribute mean or median to fill in the missing value
Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree induction.
Data Cleaning Methods
Noise is a random error or variance in a measured variable
Incorrect attribute values may due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention

To overcome with this, we have to smooth data to remove noise.
Noisy Data
Binning method: this method does local smoothing
first sort data and partition into (equi-depth) bins
then smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
Clustering
detect and remove outliers. Similar values are organised
into culsters and values that fall outside it are outliers.
Regression
smooth by fitting the data into regression functions i.e.
conforms data values to a function. Linear regression
involves finding the best line to fit two attributes so that
one can be used to predict the other.
Smoothing Techniques
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Example of Binning Method
Smoothing: remove noise from data
Aggregation: summarization, data cube construction
Generalization: concept hierarchy climbing
Normalization: scaled to fall within a small, specified
range
min-max normalization
z-score normalization
normalization by decimal scaling
Data Transformation
min-max normalization


z-score normalization


normalization by decimal scaling


Data Transformation: Normalization
A A A
A A
A
min new min new max new
min max
min v
v _ ) _ _ ( '

A
A
dev stand
mean v
v
_
'

j
v
v
10
'
Where j is the smallest integer such that Max(| |)<1
' v
Thank You

Вам также может понравиться