Concepts and Techniques: - Chapter 2

Data Mining:
Concepts and Techniques
— Chapter 2 —
Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign
Simon Fraser University
©2013 Han, Kamber, and Pei. All rights reserved.
1
Chapter 2: Getting to Know Your Data
 Data Objects and Attribute Types
 Basic Statistical Descriptions of Data
 Data Visualization
 Measuring Data Similarity and Dissimilarity
 Summary
2
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,
crosstabs
 Document data: text documents: term-
frequency vector
 Transaction data
 Graph and network
 World Wide Web
 Social or information networks
 Molecular Structures
 Ordered
 Video data: sequence of images
 Temporal data: time-series
 Sequential Data: transaction sequences
 Genetic sequence data
 Spatial, image and multimedia:
 Spatial data: maps
 Image data:
 Video data:
3
Data Objects
 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points, objects,
tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
4
Attributes
 Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
 E.g., customer _ID, name, address
 Types:
 Nominal
 Binary
 Ordinal
 Numeric
 Interval-scaled
 Ratio-scaled
5
Attribute Types
 Nominal: categories, states, or “names of things”, order not meaningful
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
 Quantity (integer or real-valued)
 Interval
 Measured on a scale of equal-sized units
 Values have order
 E.g., temperature in C˚or F˚, calendar dates
 No true zero-point
 Ratio
 Inherent zero-point
 We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
 e.g., temperature in Kelvin, length, counts,
monetary quantities
7
Discrete vs. Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 E.g., zip codes, profession, or the set of words in a
collection of documents
 Continuous Attribute
 Has real numbers as attribute values
 E.g., temperature, height, or weight
 Continuous attributes are typically represented as floating-

point variables
8
 Summary
9
Basic Statistical Descriptions of Data
 Measures of central tendency
 Mean, median, mode
 Dispersion of data
 range, quartiles and interquartile range, five-number
summary and boxplots, variance and standard deviation
10
Measuring the Central Tendency
1 n
x   xi
 Mean:
Note: n is sample size and N is population size. n i 1
n
w x
 Weighted arithmetic mean:
i i
 Trimmed mean: chopping extreme values x i 1
n
 Median: w
i 1
i
 Middle value if odd number of values, or average of the

middle two values otherwise
 Estimated by interpolation (for grouped data):
Median
interval
 Mode
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode  3  (mean  median)
11
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric
symmetric, positively and negatively
skewed data
positively skewed negatively skewed
May 21, 2019 Data Mining: Concepts and Techniques 12

Measuring the Dispersion of Data
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: A simple graphical device to display the overall shape of a
distribution, including the outliers. Ends of the box are the quartiles; median
is marked within the box; add whiskers to mark min and max, and plot
outliers individually
 Outlier: Values less than Q1-1.5*IQR and greater than Q3+1.5*IQR are
outliers
13
Boxplot Analysis
 Draw a box plot for the following dataset.
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Here,
Q2(median) = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 14.9-14.4 = 0.5
Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.
So, the outliers are at 10.2, 15.9, and 16.4.
The ends of the box are at 14.4 and 14.9.

The median 14.6 is marked within the box.
The whiskers extend to 14.1 and 15.1.
The outliers 10.2, 15.9, and 16.4 are plotted individually.
14
Measuring the Dispersion of Data
 Variance and standard deviation
 Variance:
 Standard deviation σ is the square root of variance σ2
15
Example of Standard Deviation
Find out the Mean, the Variance, and the Standard Deviation of the following
dataset.
9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4
Here, the mean = 7
Using the formula for variance,
σ2 = 8.9
Therefore, the standard deviation is σ = 2.98
16
Graphic Displays of Basic Statistical Descriptions
 Boxplot: graphic display of five-number summary

 Histogram: x-axis are values, y-axis represents frequencies
 Quantile plot: each value xi is paired with fi indicating that
approximately fi * 100% of data are below the value xi
 Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
 Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
17
Histogram Example
18
Quantile Plot
 Used to check whether your data is normal

 To make a quantile plot:
 If the data distribution is close to normal, the plotted points

will lie close to a slopped straight line
19
Quantile Plot Example
Create a quantile plot for the following dataset.

3, 5, 1, 4, 10
First sort the data: 1, 3, 4, 5, 10
Calculate the sample quantiles: 0.1, 0.3, 0.5, 0.7, 0.9
Plot the graph
20
Scatter Plot
21
Positively and Negatively Correlated Data
22
Uncorrelated Data
23
 Summary
24
Data Visualization
 Why data visualization?
 Gain insight into the data
 Search for patterns, trends, structure, irregularities, relationships among
data
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations
25
Pixel-Oriented Visualization Techniques
 For a data set of m dimensions, create m windows on the screen, one
for each dimension
 The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
 The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
26
Geometric Projection Visualization Techniques
 Visualization of geometric transformations and projections of

the data
 Helps users find interesting projections of multidimensional data
 Methods
 Scatterplot and scatterplot matrices
27
Icon-Based Visualization Techniques
 Visualization of the data values as features of icons

 Typical visualization methods
 Chernoff Faces
 Stick Figures
28
Chernoff Faces
 A way to display multidimensional data of up to 18 dimensions as a cartoon

human face
 Components of the faces such as eyes, ears, mouth, and nose represent
values of the dimensions by their shape, size, placement and orientation
29
Stick Figure
 A 5-piece stick figure (1 body and 4 limbs)
 Two attributes mapped to the two axes, remaining attributes mapped to angle or
length of limbs
A census data
figure showing
age, income,
gender,
education, etc.
Data Mining: Concepts and Techniques 30

Hierarchical Visualization Techniques
 Visualization of the data using a hierarchical

partitioning into subspaces instead of visualizing all
dimensions at the same time
 Methods
 Dimensional Stacking
 Worlds-within-Worlds
 Tree-Map
 Cone Trees
 InfoCube
31
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
32
Visualizing Complex Data and Relations
 Visualizing non-numerical data: text and social networks
 Tag cloud: visualizing user-generated tags
 The importance of tag is
represented by font
size/color
Newsmap: Google News Stories in 2005

 Summary
34
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data objects are
 Value is higher when objects are more alike
 Often falls in the range [0,1]
 Dissimilarity
 Numerical measure of how different two data objects are
 Lower when objects are more alike
 Two data structures commonly used to measure the above
 Data matrix
 Dissimilarity matrix
35
Data Matrix and Dissimilarity Matrix
 Data matrix
 Stores n data points with  x11 ... x1f ... x1p 
p dimensions  
 ... ... ... ... ... 
 Two modes – stores both x ... xif ... xip 
objects and attributes  i1 
 ... ... ... ... ... 
x ... xnf ... xnp 
 n1 
 Dissimilarity matrix
 Stores n data points, but
 0 
registers only the  d(2,1) 
dissimilarity between  0 
objects i and j  d(3,1) d ( 3,2) 0 
 
 A triangular matrix  : : : 
 Single mode as it only
d ( n,1) d ( n,2) ... ... 0
stores dissimilarity values
36
Proximity Measure for Nominal Attributes
 Can take 2 or more states, e.g., map_color may have

attributes red, yellow, blue and green
 Dissimilarity can be computed based on the ratio of
mismatches
 m: # of matches, p: total # of attributes
d (i, j)  p  p
m
37
Dissimilarity between Nominal Attributes
Here, we have one nominal attribute, test-1, so p=1.

The dissimilarity matrix is as shown below:
38
Proximity Measure for Binary Attributes
Object j
 A contingency table for binary data
Object i
 Dissimilarity for symmetric binary

attributes:
 Dissimilarity for asymmetric binary

attributes :
 Jaccard coefficient (similarity

measure for asymmetric binary
attributes):
39
Dissimilarity between Binary Variables
 Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
 Gender is a symmetric attribute, others are asymmetric binary

 Let the values Y and P be 1, and the value N 0
 Suppose the distance between patients is computed based only on the
asymmetric attributes
01
d ( jack , mary )   0.33
2 01
11
d ( jack , jim )   0.67
111
1 2
d ( jim , mary )   0.75
11 2
40
Distance on Numeric Data: Minkowski Distance
 Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
 Properties
 d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
 d(i, j) = d(j, i) (Symmetry)
 d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
 A distance that satisfies these properties is a metric
41
Special Cases of Minkowski Distance
 h = 1: Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp
 h = 2: Euclidean distance
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
 h  . supremum distance
 This is the maximum difference between any component (attribute)
of the vectors
42
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
43
Proximity Measures for Ordinal Variables
 Let M represent the number of possible values that an ordinal

attribute can have
 replace each xif by its corresponding rank rif {1,...,M f }
 map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1
 compute the dissimilarity using any of the distance
measures for numeric attributes, e.g., Euclidean distance
44
Proximity Measures for Ordinal Variables
 Consider the data in the adjacent table:

 Here, the attribute Test has three states: fair, good Student Test
and excellent, so Mf=3 1 Excellent
 For step 1, the four attribute values are assigned the 2 Fair
ranks 3,1,2 and 3 respectively. 3 Good
 Step 2 normalizes the ranking by mapping rank 1 to 4 Excellent
0.0, rank 2 to 0.5 and rank 3 to 1.0
 For step 3, using Euclidean distance, a dissimilarity
matrix is obtained as shown
 Therefore, students 1 and 2 are most dissimilar, as are
students 2 and 4
45
Attributes of Mixed Type
 A database may contain all attribute types : Nominal, symmetric binary,
asymmetric binary, numeric, ordinal
 One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 For nominal and ordinal attributes, use the technique mentioned earlier
to compute dissimilarity matrix
 For numeric attributes use the following formula to calculate
dissimilarity
46
Attributes of Mixed Type - Example
 Consider the data in the table:
 The dissimilarity matrices for
the nominal and ordinal data
are shown to the right
computed using the methods
discussed before
 To compute dissimilarity
matrix for the numeric
attribute, maxhxh=64,
minhxh=22. Using the formula
from previous slide, the
dissimilarity matrix is obtained
as shown below:
47
Attributes of Mixed Type - Example
 The three dissimilarity matrices can now be used to compute the
overall dissimilarity between two objects using the equation
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
 The resulting dissimilarity matrix is
48
Cosine Similarity
 Cosine similarity is a measure of similarity that can be used to compare
documents.
 A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document
called term-frequency vector.
 Cosine measure: If d1 and d2 are two term-frequency vectors, then

cos(x, y) = (x  y) /||x|| ||y|| ,
where  indicates dot product, ||x|| is the length of vector x defined as
. Similarly, ||y|| is the length of vector y.
 A cosine value of 0 means that the two vectors are at 90 degrees to each other
and there is no match.
 The closer the cosine value to 1, the greater the match between the vectors
49
Example: Cosine Similarity
 Ex: Find the similarity between documents 1 and 2 from previous slide.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
Using the formula, cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
The cosine similarity shows that the two documents are quite similar.
50
 Summary
51
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
 Many types of data sets, e.g., numerical, text, graph, image, etc.
 Gain insight into the data by:
 Basic statistical data description: central tendency,
dispersion, graphical displays

 Data visualization:
 Measure data similarity
 Above steps are the beginning of data preprocessing

Exercise 1 of 2
 Find the dissimilarity value between Alyssa and Chris and
between Alyssa and Diane.
Student Gender HairColor Test1 Test2

Alyssa F Black Excellent 80
Chris M Black Good 85
Jessica F Brown Fair 55
Diane F Blonde Excellent 80
Exercise 2 of 2
 Given two objects represented by the tuples (22, 1, 42, 10) and
(20, 0, 36, 8), compute the:
 Euclidean distance
 Manhattan distance
 Supremum distance
 Cosine similarity

Concepts and Techniques: - Chapter 2

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Concepts and Techniques: - Chapter 2

Загружено:

Авторское право:

Доступные форматы

Data Mining:

Concepts and Techniques

Jiawei Han, Micheline Kamber, and Jian Pei

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Data sets are made up of data objects.

 E.g., zip codes, profession, or the set of words in a

 E.g., temperature, height, or weight

 Continuous attributes are typically represented as floating-

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Middle value if odd number of values, or average of the

positively skewed negatively skewed

May 21, 2019 Data Mining: Concepts and Techniques 12

The ends of the box are at 14.4 and 14.9.

 Standard deviation σ is the square root of variance σ2

 Boxplot: graphic display of five-number summary

 Used to check whether your data is normal

 If the data distribution is close to normal, the plotted points

Create a quantile plot for the following dataset.

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Search for patterns, trends, structure, irregularities, relationships among

 Geometric projection visualization techniques

 Icon-based visualization techniques

 Hierarchical visualization techniques

 Visualizing complex data and relations

 Visualization of geometric transformations and projections of

 Visualization of the data values as features of icons

 A way to display multidimensional data of up to 18 dimensions as a cartoon

Data Mining: Concepts and Techniques 30

 Visualization of the data using a hierarchical

Newsmap: Google News Stories in 2005

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 Can take 2 or more states, e.g., map_color may have

Here, we have one nominal attribute, test-1, so p=1.

 Dissimilarity for symmetric binary

 Dissimilarity for asymmetric binary

 Jaccard coefficient (similarity

 Gender is a symmetric attribute, others are asymmetric binary

 Let M represent the number of possible values that an ordinal

 Consider the data in the adjacent table:

 The resulting dissimilarity matrix is

 Cosine measure: If d1 and d2 are two term-frequency vectors, then

Using the formula, cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,

||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481

||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

dispersion, graphical displays

 Measure data similarity

 Above steps are the beginning of data preprocessing

Student Gender HairColor Test1 Test2

Вам также может понравиться

||d1||= (55+00+33+00+22+00+00+22+00+00)0.5=(42)0.5 = 6.481

||d2||= (33+00+22+00+11+11+00+11+00+11)0.5=(17)0.5 = 4.12