Академический Документы
Профессиональный Документы
Культура Документы
— Chapter 2 —
Data Visualization
Summary
2
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
crosstabs
Document data: text documents: term-
frequency vector
Transaction data
Graph and network
World Wide Web
Social or information networks
Molecular Structures
Ordered
Video data: sequence of images
Temporal data: time-series
Sequential Data: transaction sequences
Genetic sequence data
Spatial, image and multimedia:
Spatial data: maps
Image data:
Video data:
3
Data Objects
Types:
Nominal
Binary
Ordinal
Numeric
Interval-scaled
Ratio-scaled
5
Attribute Types
Nominal: categories, states, or “names of things”, order not meaningful
Hair_color = {auburn, black, blond, brown, grey, red, white}
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
e.g., temperature in Kelvin, length, counts,
monetary quantities
7
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Continuous Attribute
Has real numbers as attribute values
8
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
9
Basic Statistical Descriptions of Data
Measures of central tendency
Mean, median, mode
Dispersion of data
range, quartiles and interquartile range, five-number
summary and boxplots, variance and standard deviation
10
Measuring the Central Tendency
1 n
x xi
Mean:
Note: n is sample size and N is population size. n i 1
n
w x
Weighted arithmetic mean:
i i
Trimmed mean: chopping extreme values x i 1
n
Median: w
i 1
i
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula: mean mode 3 (mean median)
11
Symmetric vs. Skewed Data
Median, mean and mode of symmetric
symmetric, positively and negatively
skewed data
13
Boxplot Analysis
Draw a box plot for the following dataset.
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Here,
Q2(median) = 14.6
Q1 = 14.4
Q3 = 14.9
IQR = Q3 – Q1 = 14.9-14.4 = 0.5
Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.
So, the outliers are at 10.2, 15.9, and 16.4.
14
Measuring the Dispersion of Data
Variance and standard deviation
Variance:
15
Example of Standard Deviation
Find out the Mean, the Variance, and the Standard Deviation of the following
dataset.
9, 2, 5, 4, 12, 7, 8, 11, 9, 3, 7, 4, 12, 5, 4, 10, 9, 6, 9, 4
Here, the mean = 7
Using the formula for variance,
σ2 = 8.9
Therefore, the standard deviation is σ = 2.98
16
Graphic Displays of Basic Statistical Descriptions
17
Histogram Example
18
Quantile Plot
20
Scatter Plot
21
Positively and Negatively Correlated Data
22
Uncorrelated Data
23
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
24
Data Visualization
Why data visualization?
Gain insight into the data
data
Categorization of visualization methods:
Pixel-oriented visualization techniques
25
Pixel-Oriented Visualization Techniques
For a data set of m dimensions, create m windows on the screen, one
for each dimension
The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
26
Geometric Projection Visualization Techniques
27
Icon-Based Visualization Techniques
28
Chernoff Faces
29
Stick Figure
A 5-piece stick figure (1 body and 4 limbs)
Two attributes mapped to the two axes, remaining attributes mapped to angle or
length of limbs
A census data
figure showing
age, income,
gender,
education, etc.
31
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the
outer x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
32
Visualizing Complex Data and Relations
Visualizing non-numerical data: text and social networks
Tag cloud: visualizing user-generated tags
The importance of tag is
represented by font
size/color
Data Visualization
Summary
34
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different two data objects are
Lower when objects are more alike
Two data structures commonly used to measure the above
Data matrix
Dissimilarity matrix
35
Data Matrix and Dissimilarity Matrix
Data matrix
Stores n data points with x11 ... x1f ... x1p
p dimensions
... ... ... ... ...
Two modes – stores both x ... xif ... xip
objects and attributes i1
... ... ... ... ...
x ... xnf ... xnp
n1
Dissimilarity matrix
Stores n data points, but
0
registers only the d(2,1)
dissimilarity between 0
objects i and j d(3,1) d ( 3,2) 0
A triangular matrix : : :
Single mode as it only
d ( n,1) d ( n,2) ... ... 0
stores dissimilarity values
36
Proximity Measure for Nominal Attributes
37
Dissimilarity between Nominal Attributes
38
Proximity Measure for Binary Attributes
Object j
A contingency table for binary data
Object i
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and h is the order (the distance
so defined is also called L-h norm)
Properties
d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
d(i, j) = d(j, i) (Symmetry)
d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
A distance that satisfies these properties is a metric
41
Special Cases of Minkowski Distance
h = 1: Manhattan distance
d (i, j) | x x | | x x | ... | x x |
i1 j1 i2 j 2 ip jp
h = 2: Euclidean distance
d (i, j) (| x x |2 | x x |2 ... | x x |2 )
i1 j1 i2 j2 ip jp
h . supremum distance
This is the maximum difference between any component (attribute)
of the vectors
42
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0
x4 3 1 5 0
43
Proximity Measures for Ordinal Variables
44
Proximity Measures for Ordinal Variables
45
Attributes of Mixed Type
A database may contain all attribute types : Nominal, symmetric binary,
asymmetric binary, numeric, ordinal
One may use a weighted formula to combine their effects
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
For nominal and ordinal attributes, use the technique mentioned earlier
to compute dissimilarity matrix
For numeric attributes use the following formula to calculate
dissimilarity
46
Attributes of Mixed Type - Example
Consider the data in the table:
The dissimilarity matrices for
the nominal and ordinal data
are shown to the right
computed using the methods
discussed before
To compute dissimilarity
matrix for the numeric
attribute, maxhxh=64,
minhxh=22. Using the formula
from previous slide, the
dissimilarity matrix is obtained
as shown below:
47
Attributes of Mixed Type - Example
The three dissimilarity matrices can now be used to compute the
overall dissimilarity between two objects using the equation
pf 1 ij( f ) dij( f )
d (i, j)
pf 1 ij( f )
48
Cosine Similarity
Cosine similarity is a measure of similarity that can be used to compare
documents.
A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document
called term-frequency vector.
49
Example: Cosine Similarity
Ex: Find the similarity between documents 1 and 2 from previous slide.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
cos(d1, d2 ) = 0.94
The cosine similarity shows that the two documents are quite similar.
50
Chapter 2: Getting to Know Your Data
Data Visualization
Summary
51
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled,
ratio-scaled
Many types of data sets, e.g., numerical, text, graph, image, etc.
Gain insight into the data by:
Basic statistical data description: central tendency,
Manhattan distance
Supremum distance
Cosine similarity