0 Голоса «за»0 Голоса «против»

Просмотров: 20865 стр.data minning ppt

Apr 27, 2016

© © All Rights Reserved

PPT, PDF, TXT или читайте онлайн в Scribd

data minning ppt

© All Rights Reserved

Просмотров: 208

data minning ppt

© All Rights Reserved

- Steve Jobs
- The Woman Who Smashed Codes: A True Story of Love, Spies, and the Unlikely Heroine who Outwitted America's Enemies
- NIV, Holy Bible, eBook
- NIV, Holy Bible, eBook, Red Letter Edition
- Cryptonomicon
- Hidden Figures Young Readers' Edition
- Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
- Make Your Mind Up: My Guide to Finding Your Own Style, Life, and Motavation!
- Console Wars: Sega, Nintendo, and the Battle that Defined a Generation
- The Golden Notebook: A Novel
- Alibaba: The House That Jack Ma Built
- Life After Google: The Fall of Big Data and the Rise of the Blockchain Economy
- Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
- Hit Refresh: The Quest to Rediscover Microsoft's Soul and Imagine a Better Future for Everyone
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- The 10X Rule: The Only Difference Between Success and Failure
- Autonomous: A Novel

Вы находитесь на странице: 1из 65

ARCHITECTURES,

CONCEPT DESCRIPTION

Reduction - Discretization & Concept Hierarchy Generation

Data Mining Primitives Query Language Graphical User

Interfaces Architectures Concept Description Data

Generalization Characterizations - Class Comparisons

Descriptive Statistical Measures.

Data Preprocessing

Data cleaning

Data reduction

generation

Summary

Techniques

incomplete: lacking attribute values, lacking

certain attributes of interest, or containing only

aggregate data

e.g., occupation=

e.g., Salary=-10

or names

e.g., Was rating 1,2,3, now rating A, B, C

e.g., discrepancy between duplicate records

Data Mining: Concepts and

Techniques

was collected and when it is analyzed.

human/hardware/software problems

collection

entry

transmission

Techniques

Important?

even misleading statistics.

quality data

comprises the majority of the work of building a

data warehouse. Bill Inmon

Techniques

Quality

Accuracy

Completeness

Consistency

Timeliness

Believability

Value added

Interpretability

Accessibility

Broad categories:

intrinsic, contextual, representational, and

accessibility.

Techniques

Preprocessing

Data cleaning

Data integration

Data reduction

Data transformation

remove outliers, and resolve inconsistencies

the same or similar analytical results

Data discretization

especially for numerical data

Data Mining: Concepts and

Techniques

Forms of data

preprocessing

Techniques

Data Preprocessing

Data cleaning

Data reduction

generation

Summary

Techniques

Data Cleaning

Importance

Data cleaning is one of the three biggest

problems in data warehousingRalph Kimball

Data cleaning is the number one problem in

data warehousingDCI survey

Techniques

10

Missing Data

attributes, such as customer income in sales data

equipment malfunction

time of entry

Techniques

11

Data?

(assuming the tasks in classificationnot effective when the

percentage of missing values per attribute varies considerably.

class: smarter

formula or decision tree

Techniques

12

Noisy Data

variable

Incorrect attribute values may due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

Other data problems which requires data cleaning

duplicate records

incomplete data

inconsistent data

Techniques

13

Binning method:

first sort data and partition into (equi-depth) bins

bin median, smooth by bin boundaries, etc.

Clustering

detect and remove outliers

Combined computer and human inspection

detect suspicious values and check by human

(e.g., deal with possible outliers)

Regression

smooth by fitting the data into regression functions

Techniques

14

Binning

Binning: Binning methods smooth a sorted data value by

consulting its neighborhood, that is, the values around it. The

sorted values are distributed into a number of buckets, or

bins. Because binning methods consult the neighborhood of

values, they perform local smoothing.

Equal-width (distance) partitioning:

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute,

the width of intervals will be: W = (B A)/N.

The most straightforward, but outliers may dominate

presentation

Skewed data is not handled well.

Equal-depth (frequency) partitioning:

Divides the range into N intervals, each containing

approximately same number of samples

Good data scaling

Data Mining: Concepts and

Managing categorical

attributes

can be tricky.

April 27, 2016

Techniques

15

Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,

26, 28, 29, 34

* Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34

* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29

* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

April 27, 2016

Techniques

16

Cluster Analysis

Techniques

17

Regression

relationship between the dependent variable (target

field) and one or more independent variables.

want to predict, whereas the independent variables

are the variables that you base your prediction on.

regression models: linear, polynomial, and logistic

regression.

18

Regression

y

Y1

y=x+1

Y1

X1

Techniques

19

20

Use metadata (e.g., domain, range, dependency,

distribution)

Check field overloading

Check uniqueness rule, consecutive rule and null

rule

Use commercial tools

(e.g., postal code, spell-check) to detect errors

and make corrections

rules and relationship to detect violators (e.g.,

correlation and clustering to find outliers)

Data cleaning

Data reduction

generation

Summary

Techniques

21

Data Integration

Data migration tools: allow transformations to

be specified

ETL (Extraction/Transformation/Loading) tools:

allow users to specify transformations through

a graphical user interface

Integration of the two processes

Iterative and interactive (e.g., Potters Wheels)

22

Data Integration

Data integration:

combines data from multiple sources into a

coherent store

Schema integration

integrate metadata from different sources

Entity identification problem: identify real world

entities from multiple data sources, e.g., A.cust-id

B.cust-#

Detecting and resolving data value conflicts

for the same real world entity, attribute values

from different sources are different

possible reasons: different representations,

different scales, e.g., metric vs. British units

Techniques

23

Integration

multiple databases

may have different names in different databases

attribute in another table, e.g., annual revenue

correlational analysis

may help reduce/avoid redundancies and

inconsistencies and improve mining speed and quality

Techniques

24

Correlation Analysis

statistical correlation to evaluate the strength

of the relations between variables

Correlation

analysis

measures

the

relationship between two items, for example,

a security's price and an indicator.

coefficient") shows if changes in one item

(e.g., an indicator) will result in changes in

the other item (e.g., the security's price).

25

moment coefficient)

rA, B

i 1

(ai A)(bi B )

(n 1) A B

i 1

(ai bi ) n A B

(n 1) A B

A

B and

where n is the number of tuples,

are the

respective means of A and B, A and B are the

respective standard deviation of A and B, and

(aibi) is the sum of the AB cross-product.

increase as Bs). The higher, the stronger correlation.

rA,B = 0: independent; rAB < 0: negatively correlated

26

2 (chi-square) test

2

(

Observed

Expected

)

2

Expected

The larger the 2 value, the more likely the

variables are related

are those whose actual count is very different from

the expected count

correlated

population

27

Play

chess

Not play

chess

Sum

(row)

250(90)

200(360)

450

fiction

50(210)

1000(840)

1050

Sum(col.)

300

1200

1500

are expected counts calculated based on the data

distribution in the two categories)

(250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2

507.93

90

210

360

840

2

correlated in the group

28

Scatter plots

showing the

similarity from

1 to 1.

29

relationship)

between objects

To compute correlation, we standardize

data objects, A and B, and then take their

dot product

correlatio n( A, B ) A' B'

30

Correlation coefficient:

and

are the respective

B

A

mean or expected values of A and B, A and B are the respective

standard deviation of A and B

Positive covariance: If CovA,B > 0, then A and B both tend to be

larger than their expected values

Negative covariance: If CovA,B < 0 then if A is larger than its

expected value, B is likely to be smaller than its expected value

Independence: CovA,B = 0 but the converse is not true:

Some pairs of random variables may have a covariance of 0 but are not

independent.

Only under some additional assumptions (e.g., the data follow multivariate

normal distributions) does a covariance of 0 imply independence

31

Co-Variance: An Example

week: (2, 5), (3, 8), (5, 10), (4, 11), (6, 14).

will their prices rise or fall together?

Data Transformation

attribute to a new set of replacement values s.t. each

old value can be identified with one of the new values

Methods

range

min-max normalization

z-score normalization

Attribute/feature construction

Data Mining: Concepts and

from the given ones

April 27, 2016

Techniques

33

Data Transformation:

Normalization

min-max normalization

v minA

v'

(new _ maxA new _ minA) new _ minA

maxA minA

z-score normalization

v meanA

v'

stand _ devA

v

v' j

10

Techniques

34

Example

[0.0, 1.0]. Then $73,000 is mapped to

73,600 12,000

(1.0 0) 0 0.716

98,000 12,000

73,600 54,000

1.225

16,000

April 27, 2016

Techniques

35

Data Preprocessing

Data cleaning

Data reduction

generation

Summary

Techniques

36

Data Reduction

Strategies

Complex data analysis/mining may take a very long

time to run on the complete data set

Data reduction

Obtain a reduced representation of the data set that

is much smaller in volume but yet produce the same

(or almost the same) analytical results

Data reduction strategies

Data cube aggregation

Dimensionality reductionremove unimportant

attributes

Data Compression

Numerosity reductionfit data into models

Discretization and concept hierarchy generation

Data Mining: Concepts and

Techniques

37

interest

Use the smallest representation which is enough to

solve the task

answered using data

when

Datacube,

Mining: Concepts

and possible

Techniques

38

Dimensionality Reduction

Select a minimum set of features such that the

probability distribution of different classes given the

values for those features is as close as possible to the

original distribution given the values of all features

reduce # of patterns in the patterns, easier to

understand

Heuristic methods (due to exponential # of choices):

step-wise forward selection

step-wise backward elimination

combining forward selection and backward elimination

decision-tree induction

Techniques

39

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A6?

A1?

Class 1

>

April 27, 2016

Class 2

Class 1

Class 2

Data Mining: Concepts and

Techniques

40

Data Compression

String compression

There are extensive theories and well-tuned

algorithms

Typically lossless

But only limited manipulation is possible without

expansion

Audio/video compression

Typically lossy compression, with progressive

refinement

Sometimes small fragments of signal can be

reconstructed without reconstructing the whole

Time sequence is not audio

Typically short and vary slowly with time

Data Mining: Concepts and

Techniques

42

Data Compression

Compressed

Data

Original Data

lossless

Original Data

Approximated

April 27, 2016

y

s

s

lo

Techniques

43

Wavelet Transformation

Haar2

Daubechie4

processing, multiresolutional analysis

of the strongest of the wavelet coefficients

lossy compression, localized in space

Method:

necessary)

length

Techniques

44

Image

Low Pass

Low Pass

Low Pass

High Pass

High Pass

High Pass

Techniques

45

k orthogonal vectors that can be best used to

represent data

consisting of N data vectors on c principal

components (reduced dimensions)

principal component vectors

Techniques

46

X2

Y1

Y2

X1

Techniques

47

Numerosity Reduction

Parametric methods

model parameters, store only the parameters,

and discard the data (except possible outliers)

Log-linear models: obtain value at a point in mD space as the product on appropriate marginal

subspaces

Non-parametric methods

Techniques

48

Models

line

be modeled as a linear function of multidimensional

feature vector

multidimensional probability distributions

Techniques

49

Models

Linear regression: Y = + X

Two parameters , and specify the line and

are to be estimated by using the data at hand.

using the least squares criterion to the known

values of Y1, Y2, , X1, X2, .

Multiple regression: Y = b0 + b1 X1 + b2 X2.

Many nonlinear functions can be transformed

into the above.

Log-linear models:

The multi-way table of joint probabilities is

approximated by a product of lower-order

tables.

Histograms

30

25

20

15

10

5

100000

90000

80000

70000

60000

Techniques

50000

0

10000

35

40000

40

30000

A popular data

reduction technique

Divide data into buckets

and store average

(sum) for each bucket

Can be constructed

optimally in one

dimension using

dynamic programming

Related to quantization

problems.

20000

51

Clustering

cluster representation only

if data is smeared

multi-dimensional index tree structures

and clustering algorithms, further detailed in

Chapter 8

Techniques

52

Sampling

potentially sub-linear to the size of the data

Choose a representative subset of the data

Simple random sampling may have very poor

performance in the presence of skew

Develop adaptive sampling methods

Stratified sampling:

subpopulation of interest) in the overall database

Sampling may not reduce database I/Os (page at a

time).

Techniques

53

Sampling

R

O

W

SRS le random

t

p

u

o

m

i

h

t

s

i

(

w

e

l

samp ment)

ce

a

l

p

e

r

SRSW

R

Raw Data

April 27, 2016

Techniques

54

Sampling

Raw Data

Cluster/Stratified Sample

Techniques

55

Hierarchical Reduction

of reduction

Hierarchical clustering is often performed but tends

to define partitions of data sets rather than clusters

Parametric methods are usually not amenable to

hierarchical representation

Hierarchical aggregation

An index tree hierarchically divides a data set into

partitions by value range of some attributes

Each partition can be considered as a bucket

Thus an index tree with aggregates stored at each

node is a hierarchical histogram

Techniques

56

Data Preprocessing

Data cleaning

Data reduction

generation

Summary

Techniques

57

Discretization

Nominal values from an unordered set

Ordinal values from an ordered set

Continuous real numbers

Discretization:

divide the range of a continuous attribute into

intervals

Some classification algorithms only accept

categorical attributes.

Reduce data size by discretization

Prepare for further analysis

Techniques

58

Discretization

hierachy- reduce the number of values

for a given

continuous attribute by dividing the range of the attribute

into intervals. Interval labels can then be used to replace

actual data values

Concept hierarchies - reduce the data by collecting and

replacing low level concepts (such as numeric values for the

attribute age) by higher level concepts (such as young,

middle-aged, or senior)

Discretization techniques can be categorized based on how

the discretization is performed,

uses class information or

which direction it proceeds (i.e., top-down vs. bottom-up).

If the discretization process uses class information, then we

say it is supervised discretization. Otherwise, it is

unsupervised.

Data Mining: Concepts and

April 27, 2016

Techniques

59

Conti..

(called split points or cut points) to split the entire attribute

range, and then repeats this recursively on the resulting

intervals, it is called top-down discretization or splitting.

This contrasts with bottom-up discretization or merging,

which starts by considering all of the continuous values

as potential split-points, removes some by merging

neighborhood values to form intervals, and then

recursively applies this process to the resulting intervals.

Discretization can be performed recursively on an attribute

to provide a hierarchical or multiresolution partitioning of

the attribute values, known as a concept hierarchy.

Concept hierarchies are useful for mining at multiple levels

of abstraction.

Techniques

60

Generation for Numeric Data

and time-consuming task for a user or a domain expert.

Several discretization methods can be used to

automatically generate or dynamically refine concept

hierarchies for numerical attributes.

Furthermore, many hierarchies for categorical attributes

are implicit within the database schema and can be

automatically defined at the schema definition level.

Binning top down splitting tech

Histogram analysis unsupervised DT

Clustering analysis - unsupervised DT

Entropy-based discretization

Segmentation by natural partitioning

Techniques

61

Entropy-Based Discretization

intervals S1 and S2 using boundary T, the entropy

after partitioning is | S 1|

|S 2|

E (S , T )

|S|

Ent ( S 1)

|S|

Ent ( S 2)

over all possible boundaries is selected as a binary

discretization.

The process is recursively applied to partitions

obtained until some stopping criterion is met, e.g.,

Ent ( S ) E (T , S )

improve classification accuracy

Techniques

62

Segmentation by Natural

Partitioning

data into relatively uniform, natural intervals.

most significant digit, partition the range into 3 equiwidth intervals

significant digit, partition the range into 4 intervals

significant digit, partition the range into 5 intervals

Techniques

63

count

Step 1:

Step 2:

-$351

-$159

Min

msd=1,000

profit

Low=-$1,000

(-$1,000 - 0)

(-$400 - 0)

(-$200 -$100)

(-$100 0)

Max

High=$2,000

($1,000 - $2,000)

(0 -$ 1,000)

(-$4000 -$5,000)

Step 4:

(-$300 -$200)

$4,700

(-$1,000 - $2,000)

Step 3:

(-$400 -$300)

$1,838

(0 - $1,000)

(0 $200)

($1,000 $1,200)

($200 $400)

($1,200 $1,400)

($1,400 $1,600)

($400 $600)

($600 $800)

($800 $1,000)

$2,000)

Techniques

($2,000 $3,000)

($3,000 $4,000)

($4,000 $5,000)

64

Categorical Data

explicitly at the schema level by users or experts

street<city<state<country

Specification of a portion of a hierarchy by explicit

data grouping

{Urbana, Champaign, Chicago}<Illinois

Specification of a set of attributes.

System automatically generates partial ordering

by analysis of the number of distinct values

Specification of only a partial set of attributes

E.g., only street < city, not others

Techniques

65

Generation

generated based on the analysis of the number of

distinct values per attribute in the given data set

The attribute with the most distinct values is

placed at the lowest level of the hierarchy

Note: Exceptionweekday, month, quarter, year

country

15 distinct values

province_or_ state

65 distinct

values

3567 distinct values

city

street

April 27, 2016

Techniques

66

- 50120140506015Загружено:IAEME Publication
- Linear RegressionЗагружено:aarthy
- Nonlinear RegressionЗагружено:aarthy
- CS2304 SSЗагружено:aarthy
- Topic 11 - AssignmentЗагружено:DoubleM
- l 016275963Загружено:International Organization of Scientific Research (IOSR)
- Unit3bЗагружено:aarthy
- SPEECH COMPRESSION TECHNIQUES: A REVIEWЗагружено:IJIERT-International Journal of Innovations in Engineering Research and Technology
- speech compression using waveletsЗагружено:Kushal Desale
- The Mathematics of DiversificationЗагружено:Ryan
- e CommerceЗагружено:Ramm Pknn
- 127-96-511-1-10-20180108.pdfЗагружено:pujo
- Basic ConceptsЗагружено:hsrinivas_7
- Validation of scoring Instruments in Obstetrics- GynaecologyЗагружено:Creanga Cristina
- iccd06final v2Загружено:rasitot
- 2003 Rain Di SagЗагружено:isnan1517
- Fault Location in Underground Cables Using ANFIS Nets and Discretewavelet TransformЗагружено:Shimaa Barakat
- compensatin 2.pdfЗагружено:elika
- Class5Загружено:chipchiphop
- regression.pdfЗагружено:Kenny
- Articulo MullardЗагружено:Pablo Benitez
- 1. VALIDITAS X1Загружено:ulul farichin
- 1.4 Probability ConceptsЗагружено:Mario
- Cost Account Accountinv 441Загружено:MahinChowdhury
- Stock Watson 3U ExerciseSolutions Chapter2 StudentsЗагружено:k_ij9658
- 1988-Stuberg-mARCHA-NIÑOSЗагружено:Alejandra Vasquez
- Data Mining.pdfЗагружено:abirose191
- Formula CardЗагружено:Florissa Calinagan
- Article Review 1 Husna.doc Kak TiniЗагружено:Che'gu Jamal
- crime scene projectЗагружено:api-314049675

- Cyber Laws Chapter in Legal Aspects BookЗагружено:Jatan Gandhi
- Information-Technology-Act 2000.pptЗагружено:aarthy
- CyberlawЗагружено:api-3744400
- DarkReading.pdfЗагружено:aarthy
- MC1628 - TCPIP Protocol SuiteЗагружено:plutonium5x6469
- India - Cyber Crime data.pdfЗагружено:aarthy
- Two marksЗагружено:Tharani Venugopal
- Micro ControllersЗагружено:aarthy
- Ec6502 Principles of Digital Signal ProcessingЗагружено:aarthy
- Computer NetworksЗагружено:aarthy
- AttachmentЗагружено:aarthy
- Unit - II 2 MarrksЗагружено:aarthy
- Satcom 2 Mark AnsЗагружено:aarthy
- Questions SampleЗагружено:aarthy
- 2016 question bank Computer Networks.docЗагружено:aarthy
- Computer Networks 2 MarksЗагружено:mohamed
- ECEMicrowave EngineeringЗагружено:Joshua Duffy
- IT 1252- Digital Signal ProcessingЗагружено:anon-384794
- Ec73 -Rf and Microwave EngineeringЗагружено:Santhosh Pounraj
- 2 Marks May 2015Загружено:aarthy
- Unit3bЗагружено:aarthy
- Unit3aЗагружено:aarthy
- Unit2b.pptЗагружено:aarthy

- 4DM4 Lab#1-2010-cacheЗагружено:Ernesto Gonzalez
- Process AutomationЗагружено:sheizareh
- sm335Загружено:Ahmed Mahrous
- 3406C+Industrial+Engines Maintenance+IntervalsЗагружено:foxtrot12
- CryptographyЗагружено:nooti
- SAS® 9.4 Companion for UNIX EnvironmentsThird EditionSASЗагружено:Nagesh Khandare
- Alfa Laval Koltek MH Valve EnЗагружено:jpsingh75
- Coursework ItЗагружено:NuratieQa Bakrin
- Brake System NISSAN B16Загружено:Alex Hernandez
- Dynamic_Block_tutorial[1].pdfЗагружено:bsathish83
- obp proposalЗагружено:api-279877480
- macOS Security ChecklistЗагружено:nicolepetrescu
- ilab 300 plus service manual rev2.pdfЗагружено:Cherlie Anjas Findani
- Service Manual m hcp hcp hcp hcp Lta-22n686hcpЗагружено:Cristina Nistor
- Manual Msi Ms-6330Загружено:Infonova Rute
- FlexpodЗагружено:prysak
- MAS 850-SЗагружено:RokhmatuSiva
- Samsung Sync Master 940N LCD Service ManualЗагружено:Semen Felix
- plq200Загружено:Anto Silentkids
- Gaming Dorks by SioVer-Cracked.toЗагружено:El tiempo libre de pancho
- Dragon City GamesЗагружено:portbarber64
- Customizing Public ExploitsЗагружено:lemezlovas
- FREJA-306_DS_en_V04.pdfЗагружено:s_waqar
- LST04_LCDPARTSЗагружено:Raja Pathamuthu.G
- Vb3 Oop PracЗагружено:Mogeni Jnr
- BillDipenser NCR 5633Загружено:Amanda Brown
- Vos5000 Server Requirements for 20000 CallsЗагружено:Falideel Ali
- PattonЗагружено:checagd
- h3980 Vmware Backup Networker DsЗагружено:Vijay Reddy
- Fddi Frame Format PDFЗагружено:Melissa

## Гораздо больше, чем просто документы.

Откройте для себя все, что может предложить Scribd, включая книги и аудиокниги от крупных издательств.

Отменить можно в любой момент.