You are on page 1of 9

Data mining homework #2

3.3) using the given data below:


13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70.
a) Using smoothing by bin method while bin depth is 3.
Sol:Partition into equal-frequency (equi-depth) bins:
Bin 1: 13, 15, 16
Bin 2: 16, 19, 20
Bin 3: 20, 21, 22
Bin 4: 22, 25, 25
Bin 5: 25, 25, 30
Bin 6: 33, 33, 35
Bin 7: 35, 35, 35
Bin 8: 36, 40, 45
Bin 9: 46, 52, 70
Smoothing by bin means:
Bin 1: 14.6, 14.6, 14.6
Bin 2: 18.3, 18.3, 18.3
Bin 3: 21, 21, 21
Bin 4: 24, 24, 24
Bin 5: 26.6, 26.6, 26.6
Bin 6: 33.6, 33.6, 33.6
Bin 7: 35, 35, 35
Bin 8: 40.3, 40.3, 40.3
Bin 9: 56, 56, 56.

Smoothing by bin boundaries:


Bin 1: 13, 16, 16
Bin 2: 16, 20, 20
Bin 3: 20, 22, 22
Bin 4: 22, 25, 25
Bin 5: 25, 25, 30
Bin 6: 33, 33, 35
Bin 7: 35, 35, 35
Bin 8: 36, 36, 45
Bin 9: 46, 46, 70
b) By drawing the boxplot for the given set of data, the data outliers can be
identified.
Box plot

Point beyond inner fence is called mild outlier, and point beyond outer fence
is called extreme outlier.

c) Regression method and outlier analysis are the other methods of data
smoothing in which, regression transforms data in to a function. Where in
case of outlier analysis, clusters are detected.
3.6) by using methods of normalization, normalize the following data:
200, 300, 400, 600, 1000.
a) Min-max method

b) Z-score normalization for set of data A is:

c) By using mean absolute deviation method.

d) By using decimal scaling.

3.8) using age and fat percentage values:


a) Normalize values using z-score normalization:
z-score normalization of age

z-score normalization of fat percent

b) Covariance matrix of age and fat percent:

Correlative coefficient of age and fat percentage

As correlative coefficient is 0.81 which is >0, age and fat percentage are positively
correlated.

4. Suppose that a data warehouse consists of the four dimensions date, spectator, location, and
game, and the two measures count and charge, where charge is the fare that a spectator pays
when watching a game on a given date. Spectators may be students, adults, or seniors, with each
category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.

Date
Dimension table
Date_id
Day
Month
Quarter
Year

Date_id
Spectator_id
Game_id
Location_id
Count
Charge

sales
fact table

spectator
Dimension table
Spectator_id
Spectator_name
Status
Phone
address

Game
Dimension table
Game_id
Game_name
Description
Producer

location
dimension table
Location_id
Location_name
Phone#
Street
City
Province
Country

(b) Starting with the base cuboid [date,spectator,location,game], what specic OLAP operations
should you perform in order to list the total charge paid by student spectators at GM Place in
2010?
Roll-up on date from date id to year.
Roll-up on game from game id to all.
Roll-up on location from location id to location name.
Roll-up on spectator from spectator id to status.
Dice with status=students, location name=GM Place, and year=2010.
(c) Bitmap indexing is useful in data warehousing. Taking this cube as an example, briey
discuss advantages and problems of using a bitmap index structure.
Bitmap indexing is advantageous for low-cardinality domains. For example, in this cube,
if dimension location is bitmap indexed, then comparison, join, and aggregation operations over
location are then reduced to bit arithmetic, which substantially reduces the processing time.
Furthermore, strings of long location names can be represented by a single bit, which leads to
signicant reduction in space and I/O. For dimensions with high cardinality, such as date in this
example, the vector used to represent the bitmap index could be very long. For example, a 10year collection of data could result in 3650 date records, meaning that every tuple in the fact
table would require 3650 bits (or approximately 456 bytes) to hold the bitmap index.