Hirasugar Institute of Technology, Nidasoshi

S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW

Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Subject Code Module No Module Name Prepared By

DATA MINING AND DATA WAREHOUSE
DATA (15CS651) II IMPLEMENTATION & Prof. Shilpa B. Hosagoudra
WAREHOUSING DATA MINING
Module – 2 8 Hours
Data warehouse implementation & Data mining:

Efficient Data Cube computation: An overview, Indexing OLAP Data: Bitmap index and join index,
Efficient processing of OLAP Queries, OLAP server Architecture ROLAP versus MOLAP Versus
HOLAP.
Introduction to data mining: What is data mining, Challenges, Data Mining Tasks, Data: Types of
Data, Data Quality, Data Preprocessing, Measures of Similarity and Dissimilarity,
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
2. DATA WAREHOUSE IMPLEMENTATION

Data warehouses contain huge volumes of data. OLAP servers demand that decision support queries
be answered in the order of seconds. Therefore, it is crucial for data warehouse systems to support
highly efficient cube computation techniques, access methods, and query processing techniques.
2.1 EFFICIENT DATA CUBE COMPUTATION: AN OVERVIEW
At the core of multidimensional data analysis is the efficient computation of aggregations across many
sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s. Each group-by can
be represented by a cuboid, where the set of group-by’s forms a lattice of cuboids defining a data cube.
The “compute cube” Operator and the Curse of Dimensionality

The compute cube operator computes aggregates over all subsets of the dimensions specified in the
operation. This can require excessive storage space, especially for large numbers of dimensions.
What is the total number of cuboids, or group-by’s, that can be computed for this
data cube?
Assume there are three attributes, city, item, and year, as the dimensions for the data cube, and sales in
dollars as the measure, the total number of cuboids, or groupby’s, that can be computed for this data
cube is 23 = 8.
The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year),
(city), (item), (year),()}. where () means that the group-by is empty.
Fig: 2.1
In the above fig 2.1, data cube is created for AllElectronics sales, which contains the city, item, year,
and sales in dollars.
The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any
combination of the three dimensions.
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It contains the total
sum of all sales.
The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the most
generalized (least specific) of the cuboids, and is often denoted as all. If we start at the apex cuboid
and explore downward in the lattice, this is equivalent to drilling down within the data cube. If we start
at the base cuboid and explore upward, this is akin to rolling up.
The cube operator is the n-dimensional generalization of the group-by operator. For a cube with n-
dimensions, there are a total of 2n cuboids, including the base cuboid.
If we write a statement such as “compute cube sales cube” then it would explicitly instruct the system
to compute the sales aggregate cuboids for all eight subsets of the set fcity, item, yearg, including the
empty subset.
Advantages of pre-computation of Compute Cube’s are: Pre-computation leads to fast response time
and avoids some redundant computation. Most, if not all, OLAP products resort to some degree of pre-
computation of multidimensional aggregates.
Curse of Dimensionality Or. A major challenge related to pre-computation.

The required storage space may explode if all the cuboids in a data cube are pre-computed,
especially when the cube has many dimensions. The storage requirements are even more excessive
when many of the dimensions have associated concept hierarchies, each with multiple levels. This
problem is referred to as the curse of dimensionality. The extent of the curse of dimensionality is
illustrated here.
How many cuboids are there in an n-dimensional data cube?

Ans:
If there were no hierarchies associated with each dimension, then the total number of cuboids for an n-
dimensional data cube, as we have seen, is 2n. But in practice, many dimensions do have hierarchies
(Ex: time is usually explored not at only one conceptual level (e.g., year), but rather at multiple
conceptual levels such as in the hierarchy “day <month < quarter < year.”). In such cases, an n-
dimensional data cube, the total number of cuboids that can be generated (including the cuboids
generated by climbing up the hierarchies along each dimension) is given by-
Where Li is the number of levels associated with dimension i.

One is added to Li in above Eq. to include the virtual top level, all.
Ex: If the cube has 10 dimensions and each dimension has five levels (including all), the total number
of cuboids that can be generated is 510 ~=9.8 x 106. The size of each cuboid also depends on the
cardinality (i.e., number of distinct values) of each dimension.
(Ex: The size of each cuboid also depends on the cardinality (i.e., number of distinct values) of each
dimension).
As the number of dimensions, number of conceptual hierarchies, or cardinality increases, the
storage space required for many of the group-by’s will grossly exceed the (fixed) size of the input
relation.
Partial Materialization: Selected Computation of Cuboids.
If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial
materialization; that is, to materialize only some of the possible cuboids that can be generated.
There are three choices for data cube materialization given a base cuboid:
1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to

computing expensive multidimensional aggregates on-the-fly, which can be extremely slow.
2. Full materialization: Precompute all of the cuboids. The resulting lattice of computed cuboids
is referred to as the full cube. This choice typically requires huge amounts of memory space in
order to store all of the precomputed cuboids.
3. Partial materialization: Selectively compute a proper subset of the whole set of possible
cuboids. Alternatively, we may compute a subset of the cube, which contains only those cells
that satisfy some user-specified criterion, such as where the tuple count of each cell is above
some threshold. We will use the term subcube to refer to the latter case, where only some of the
cells may be precomputed for various cuboids. Partial materialization represents an interesting
trade-off between storage space and response time.
The partial materialization of cuboids or subcubes should consider three factors:

(1) Identify the subset of cuboids or subcubes to materialize;
(2) Exploit the materialized cuboids or subcubes during query processing;
(3) Efficiently update the materialized cuboids or subcubes during load and refresh.
Alternative Strategy of Materialization: Several OLAP products have adopted heuristic approaches
for cuboid and subcube selection. A popular approach is to materialize the cuboids set on which other
frequently referenced cuboids are based.
ICEBERG CUBE: It is a data cube that stores only those cube cells with an aggregate value (e.g.,
count) that is above some minimum support threshold.
SHELL CUBE: This involves precomputing the cuboids for only a small number of dimensions (e.g.,
three to five) of a data cube.
2.2: INDEXING OLAP DATA: BITMAP INDEX AND JOIN INDEX

To facilitate efficient data accessing, most data warehouse systems support index structures and
materialized views (using cuboids).
Bitmap Index:
The bitmap indexing method is popular in OLAP products because it allows quick searching in data
cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap
index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s domain.
If a given attribute’s domain consists of n values, then n bits are needed for each entry in the bitmap
index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then
the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits
for that row are set to 0.
Advantage of Bitmap indexing:

1) It is efficient if compared to hash and tree indices.
2) It is especially useful for low-cardinality domains because comparison, join, and aggregation
operations are then reduced to bit arithmetic, which substantially reduces the processing time.
3) Bitmap indexing leads to significant reductions in space and input/output (I/O) since a string of
characters can be represented by a single bit.
4) For higher-cardinality domains, the method can be adapted using compression techniques.
Join Index:
The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast,
join indexing registers the joinable rows of two relations from a relational database.
(Ex: If two relations R.RID, A/ and S.B, SID/ join on the attributes A and B, then the join index record
contains the pair .RID, SID/, where RID and SID are record identifiers from the R and S relations,
respectively).
Advantage of join indexing:

1) The join index records can identify joinable tuples without performing costly join operations.
2) Join indexing is especially useful for maintaining the relationship between a foreign key and
its matching primary keys, from the joinable relation.
Example on Bitmap indexing. In the AllElectronics data warehouse, suppose the dimension item
at the top level has four values (representing item types): “home entertainment,” “computer,”
“phone,” and “security.” Each value (e.g., “computer”) is represented by a bit vector in the item
bitmap index table. Suppose that the cube is stored as a relation table with 100,000 rows. Because the
domain of item consists of four values, the bitmap index table requires four bit vectors (or lists), each
with 100,000 bits. Figure 2.2 shows a base (data) table containing the dimensions item and city, and its
mapping to bitmap index tables for each of the dimensions.
Fig: 2.2
Example on Join indexing: According to the star schema for AllElectronics of the form “sales
star [time, item, branch, location]: dollars sold D sum (sales in dollars).” An example of a join index
relationship between the sales fact table and the location and item dimension tables is shown in Figure
2.3. For example, the “Main Street” value in the location dimension table joins with tuples T57, T238,
and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item dimension table joins with
tuples T57 and T459 of the sales fact table. The corresponding join index tables are shown in Figure
2.4
Fig: 2.3
Join index tables based on the linkages between the sales fact table and the location and time
dimension table shown in Fig:2.3
Fig 2.4
2.3: EFFICIENT PROCESSING OF OLAP QUERIES:
The purpose of materializing cuboids and constructing OLAP index structures is to speed up query
processing in data cubes. Given materialized views, query processing should proceed as follows:
1. Determine which operations should be performed on the available cuboids:

This involves transforming any selection, projection, roll-up (group-by), and drill-down operations
specified in the query into corresponding SQL and/or OLAP operations. For example, slicing and
dicing a data cube may correspond to selection and/or projection operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be applied:

This involves identifying all of the materialized cuboids that may potentially be used to answer the
query, pruning the set using knowledge of “dominance” relationships among the cuboids, estimating
the costs of using the remaining materialized cuboids, and selecting the cuboid with the least cost.
Example on OLAP query processing.

Suppose that we define a data cube for AllElectronics of the form
“sales cube [time, item, location]: sum(sales in dollars).” The dimension hierarchies used are
“day < month < quarter < year” for time; “item name < brand < type” for item; and
“street < city < province or state < country” for location.
Suppose that the query to be processed is on {brand, province or state}, with the selection constant
“year = 2010”.
suppose that there are four materialized cuboids available, as follows:
cuboid 1: {year, item name, city}
cuboid 2: {year, brand, country}
cuboid 3: {year, brand, province or state}
cuboid 4: {item name, province or state}, where year = 2010
Then the question come in front of us is,

Which of these four cuboids should be selected to process the query?
 We know that Finer-granularity data cannot be generated from coarser-granularity data.
Therefore, cuboid 2 cannot be used because country is a more general concept than province or state.
Cuboids 1, 3, and 4 can be used to process the query because
(1) They have the same set or a superset of the dimensions in the query,
(2) The selection clause in the query can imply the selection in the cuboid, and
(3) The abstraction levels for the item and location dimensions in these cuboids are at a finer level than
brand and province or state, respectively.
And also some cost-based estimation is required to decide which set of cuboids should be selected for
query processing.
2.4: OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational
DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP
servers include optimization for each DBMS back end, implementation of aggregation navigation
logic, and additional tools and services. ROLAP technology tends to have greater scalability than
MOLAP technology. The DSS server of Micro-strategy, for example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views
through array-based multidimensional storage engines. They map multidimensional views directly to
data cube array structures. The advantage of using a data cube is that it allows fast indexing to
precomputed summarized data. Notice that with multidimensional data stores, the storage utilization
may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be
explored
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP.
For example, a HOLAP server may allow large volumes of detailed data to be stored in a relational
database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000
supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query
language and query processing support for SQL queries over star and snowflake schemas in a read-
only environment.
How are data actually stored in ROLAP and MOLAP architectures?

 ROLAP uses relational tables to store data for online analytical processing. The fact table
associated with a base cuboid is referred to as a base fact table. The base fact table stores data
at the abstraction level indicated by the join keys in the schema for the given data cube.
Aggregated data can also be stored in fact tables, referred to as summary fact tables. Some
summary fact tables store both base fact table data and aggregated data. And MOLAP uses
multidimensional array structures to store data for online analytical processing.
2.5: INTRODUCTION TO DATA MINING

Data mining is a technology that blends traditional data analysis methods with sophisticated algorithms
for processing large volumes of data. It has also opened up exciting opportunities for exploring and
analyzing new types of data and for analyzing old types of data in new ways.
Applications that require new techniques for data analysis.

1) Business: Data mining techniques can be used to support a wide range of business intelligence
applications such as customer profiling, targeted marketing, workflow management, store
layout, and fraud detection.
2) Medicine, Science, and Engineering: Researchers in medicine, science, and engineering are
rapidly accumulating data that is key to important new discoveries.
Applications of data mining

1) Banking: loan/credit card approval: Given a database of 100,000 names, which persons are the
least likely to default on their credit cards?
2) Customer relationship management: Which of my customers are likely to be the most loyal,
and which are most likely to leave for a competitor?
3) Targeted marketing: Identify likely responders to promotions
4) Fraud detection: telecommunications, financial transactions – from an online stream of event
identify fraudulent events
5) Manufacturing and production: Automatically adjust knobs when process parameter changes
6) Medicine: disease outcome, effectiveness of treatments:– Analyze patient disease history: find
relationship between diseases
7) Molecular/Pharmaceutical: Identify new drugs
8) Scientific data analysis: Identify new galaxies by searching for sub clusters
9) Web site/store design and promotion:– Find affinity of visitor to pages and modify layout
Inculcating Values, Promoting Prosperity Module-I Notes
2.6: WHAT IS DATA MINING?

Data mining is the process of automatically discovering useful information in large data
repositories. Data mining techniques are deployed to scour large databases in order to find novel
and useful patterns that might otherwise remain unknown. They also provide capabilities to
predict the outcome of a future observation.
Ex: Web pages via a query to an Internet search engine are tasks related to the area of
information retrieval.
Data Mining and Knowledge Discovery:

Data mining is an integral part of knowledge discovery in databases (KDD), which is the
overall process of converting raw data into useful information, This process consists of a series
of transformation steps, from data preprocessing to post processing of data mining results.
Fig 2.5: The Process of KDD.
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites.
The purpose of preprocessing is to transform the raw input data into an appropriate format for
subsequent analysis.
The steps involved in data preprocessing include fusing data from multiple sources, cleaning data
to remove noise and duplicate observations, and selecting records and features that are relevant
to the data mining task at hand.
2.7: MOTIVATING CHALLENGES:

The following are some of the specific challenges that motivated the development of data
mining. Those are Scalability, High Dimensionality, Heterogeneous and Complex data, data
Ownership and Distribution, and Non-traditional Analysis
1) SCALABILITY: Due to the advancement in technology, the data will generate in exponential
manner, with sizes of gigabytes, terabytes, or even petabytes of data sets. To handle such huge
data size the data mining algorithms must be scalable, where many algorithms employ special
search strategies to handle exponential search problems. Scalability may also require the
implementation of novel data structures and also be improved by using sampling or developing
parallel and distributed algorithms.
2) HIGH DIMENSIONALITY: It is now common to encounter data sets with hundreds or

thousands of attributes instead of the handful common a few decades ago. Traditional data
analysis techniques that were developed for low-dimensional data often do not work well for
such high dimensional data. Also, for some data analysis algorithms, the computational
complexity increases rapidly as the dimensionality (the number of features) increases.
Example-1:
In bioinformatics, progress in microarray technology has produced gene expression data
involving thousands of features. Data sets with temporal or spatial components also tend to have
high dimensionality.
Example-2:
Consider a data set that contains measurements of temperature at various locations. If the
temperature measurements are taken repeatedly for an extended period, the number of
dimensions (features) increases in proportion to the number of measurements taken.
3) HETEROGENEOUS AND COMPLEX DATA: We know that data mining plays an

important role in all fields where data sets containing heterogeneous attributes with more
complex data objects which are categories into non-traditional types of data. So, Techniques
developed for mining such complex objects should take into consideration relationships in the
data, such as temporal and spatial autocorrelation, graph connectivity, and parent-child
relationships between the elements in semi-structured text and XML documents
Examples of such non-traditional types of data:

1) Collections of Web pages containing semi-structured text and hyperlinks;
2) DNA data with sequential and three-dimensional structure;
3) Climate data that consists of time series measurements (temperature, pressure, etc.) at various
locations on the Earth's surface.
4) DATA OWNERSHIP AND DISTRIBUTION: Sometimes, the data needed for an analysis
is not stored in one location or owned by one organization. Instead, the data is geographically
distributed among resources belonging to multiple entities. This requires the development of
distributed data mining techniques. Among the key challenges faced by distributed data mining
algorithms include
(1) How to reduce the amount of communication needed to perform the distributed computation?
(2) How to effectively consolidate the data mining results obtained from multiple sources?
(3) How to address data security issues?
5) NON-TRADITIONAL ANALYSIS: The traditional statistical approach is based on a

hypothesize-and-test paradigm. In other words, a hypothesis is proposed, an experiment is
designed to gather the data, and then the data is analyzed with respect to the hypothesis.
Unfortunately, this process is extremely labor-intensive. Current data analysis tasks often require
the generation and evaluation of thousands of hypotheses, and consequently, the development of
some data mining techniques has been motivated by the desire to automate the process of
hypothesis generation and evaluation. Furthermore, the data sets analyzed in data mining are
typically not the result of a carefully designed experiment and often represent opportunistic
samples of the data, rather than random samples. Also, the data sets frequently involve non-
traditional types of data and data distributions.
2.8: THE ORIGINS OF DATA MINING

Data mining draws upon ideas as shown in the Fig. 2.6, such as (1) sampling, estimation, and
hypothesis testing from statistics and (2) search algorithms, modeling techniques, and learning
theories from artificial intelligence, pattern recognition, and machine learning. Data mining has
also been quick to adopt ideas from other areas, including optimization, evolutionary computing,
information theory, signal processing, visualization, and information retrieval.
Also database systems are needed to provide support for efficient storage, indexing, and query
processing. Techniques from high performance (parallel) computing are often important in
addressing the massive size of some data sets. Distributed techniques can also help address the
issue of size and are essential when the data cannot be gathered in one location.
Fig: 2.6 Data mining as a confluence of many disciplines
2.9: DATA MINING TASKS:

Data mining tasks are generally divided into two major categories:
Predictive tasks: The objective of these tasks is to predict the value of a particular attribute
based on the values of other attributes. The attribute to be predicted is commonly known as the
target or dependent variable, while the attributes used for making the prediction are known as the
explanatory or independent variables.
Descriptive tasks: The objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data mining
tasks are often exploratory in nature and frequently require post-processing techniques to
validate and explain the results.
Fig 2.7: Four core data mining tasks
Four of the core data mining tasks are
1) Predictive modeling refers to the task of building a model for the target variable as a function
of the explanatory variables. There are two types of predictive modeling tasks: classification,
which is used for discrete target variables, and regression, which is used for continuous target
variables. The goal of both tasks is to learn a model that minimizes the error between the
predicted and true values of the target variable.
2) Association analysis is used to discover patterns that describe strongly associated features in
the data. The discovered patterns are typically represented in the form of implication rules or
feature subsets. Because of the exponential size of its search space, the goal of association
analysis is to extract the most interesting patterns in an efficient manner.
3) Cluster analysis seeks to find groups of closely related observations so that observations that
belong to the same cluster are more similar to each other than observations that belong to other
clusters.
4) Anomaly detection is the task of identifying observations whose characteristics are

significantly different from the rest of the data. Such observations are known as anomalies or
outliers. The goal of an anomaly detection algorithm is to discover the real anomalies and avoid
falsely labeling normal objects as anomalous. In other words, a good anomaly detector must have
a high detection rate and a low false alarm rate. Applications of anomaly detection include the
detection of fraud, network intrusions, unusual patterns of disease, and ecosystem disturbances.
2.10 DATA: THE TYPES OF DATA:

The attributes used to describe data objects can be of different types-quantitative or qualitative
and data sets may have special characteristics. A data set can often be viewed as a collection of
data objects. Other names for a data object are record, point, vector, pattern, event, case, sample,
observation, or entity. In turn, data objects are described by a number of attributes.
Definition of Attribute: An attribute is a property or characteristic of an object that may vary,

either from one object to another or from one time to another
The following properties (operations) of numbers are typically used to describe attributes.
1. Distinctness: = and =/=

2. Order: <, <=, >, and =>
3. Addition: + and -
4. Multiplication: * and /
Based on above properties we can define four types of attributes:

1) Nominal, 2) Ordinal, 3) Interval, and 4) Ratio.
Attribute Description Examples Operations

Type
The values of a nominal attribute Zip codes, mode, entropy,

are just different names; i.e., employee ID numbers, contingency
nominal values provide only eye color, gender correlation,
Nominal
enough information to x2 (Chay) test
distinguish one object from
Categorical
another. (=, =/=)
(Qualitative)
The values of an ordinal attribute hardness of minerals, median,
provide enough information to {good, better, best}, percentiles,
Ordinal order objects. (<, >) grades, street numbers rank correlation,
run tests,
sign tests
For interval attributes, the calendar dates, mean,
differences between values are temperature in Celsius standard deviation,
Interval meaningful, i.e., a unit of or Fahrenheit Pearson's
measurement exists. (+,-) correlation,
Numeric
t and F tests
(Quantitative)
For ratio variables, both Temperature in Kelvin. geometric mean,
differences and ratios are monetary quantities, harmonic mean,
Ratio
meaningful. ( *, /) counts, age, mass, percent
length, electrical current variation.
Fig: 2.10.1: Different Types of attributes.
Nominal and ordinal attributes: These are collectively referred to as categorical or qualitative
attributes. Ex for qualitative attributes, such as employee ID, lack most of the properties of
numbers.
Interval and ratio: These are collectively referred to as quantitative or numeric attributes.
Quantitative attributes are represented by numbers and have most of the properties of numbers.
Note that quantitative attributes can be integer-valued or continuous.
The types of attributes can also be described in terms of transformations that do not change the
meaning of an attribute.
Attribute Type Transformation Comment

Any one-to-one mapping, €.g., It all employee IIJ numbers are reassigned, it
Nominal
A permutation of values. will not make any difference.
categorical
or qualitative An order-preserving change of An attribute encompassing the notion of
values. i.e. good, better, best can be represented equally
attributes ordinal
new_value = f (old_value), well by the values {1, 2, 3} or by {0.5, 1, 10}.
where f is a monotonic function.
New_value = a * old_value + b, The Fahrenheit and Celsius temperature
quantitative or Interval a and b constants. scales differ in the location of their zero value
numeric and the size of a degree (unit).
attributes new_value : a * old_value Length can be measured in
Ratio
meters or feet.
Table: 2.10.2: Transformations that define a attribute levels.
Above Table 2.10.2 shows the permissible (meaning-preserving) transformations for the four
attribute types of Table 2.10.1.
Describing Attributes by the Number of Values

Discrete: A discrete attribute has a finite or countably infinite set of values. Such attributes can
be categorical, such as zip codes or ID numbers, or numeric, such as counts. Discrete attributes
are often represented using integer variables. Binary attributes are a special case of discrete
attributes and assume only two values, e.g., true/false, yes/no, male/female, or 0f 1. Binary
attributes are often represented as Boolean variables, or as integer variables that only take the
values 0 or 1.
Continuous: A continuous attribute is one whose values are real numbers. Examples include
attributes such as temperature, height, or weight. Continuous attributes are typically represented
as floating-point variables. Practically, real values can only be measured and represented with
limited precision.
Asymmetric Attributes: In these attributes only presence a non-zero attribute value-is regarded
as important. Binary attributes where only non-zero values are important are called asymmetric
binary attributes. This type of attribute is particularly important for association analysis, For
instance, if the number of credits associated with each course is recorded, then the resulting data
set will consist of asymmetric discrete or continuous attributes.
2.10.1: TYPES OF DATA SETS:
There are many types of data sets, and as the field of data mining develops and matures.
The types of data sets are grouped into three groups: 1) Record data, 2) Graph based data,
and 3) Ordered data.
And in general there are three characteristics that apply to many data sets and have a significant
impact on the data mining techniques that are used: 1) Dimensionality, 2) Sparsity, and 3)
Resolution
1) Dimensionality: The dimensionality of a data set is the number of attributes that the
objects in the data set possess
2) Sparsity: For some data sets, such as those with asymmetric features, most attributes of
an object have values of 0; in many cases, fewer than 1% of the entries are non-zero. In
practical terms, sparsity is an advantage because usually only the non-zero values need to
be stored and manipulated. This results in significant savings with respect to computation
time and storage.
3) Resolution It is frequently possible to obtain data at different levels of resolution, and

often the properties of the data are different at different resolutions.
The types of data sets are grouped into three groups:

I) Record data, II) Graph based data, and III) Ordered data.
I) Record Data
Much data mining work assumes that the data set is a collection of records (data objects), each of
which consists of a fixed set of data fields (attributes). Record data is usually stored either in flat
files or in relational databases.
Different variation of record data
1) Transaction or Market Basket Data Transaction data is a special type of record data,
where each record (transaction) involves a set of items. This type of data is called market
basket data because the items in each record are the products in a person's "market
basket." Transaction data is a collection of sets of items, but it can be viewed as a set of
records whose fields are asymmetric attributes.
2) The Data Matrix If the data objects in a collection of data all have the same fixed set of
numeric attributes, then the data objects can be thought of as points (vectors) in a
multidimensional space, where each dimension represents a distinct attribute describing
the object. A set of such data objects can be interpreted as an m by n matrix, where there
are m rows, one for each object, and n columns, one for each attribute. This matrix is
called a data matrix or a pattern matrix.
3) The Sparse Data Matrix A sparse data matrix is a special case of a data matrix in which
the attributes are of the same type and are asymmetric; i.e., only non-zero values are
important. Transaction data is an example of a sparse data matrix that has only 0 1
entries. Another common example is document data. In particular, if the order of the
terms (words) in a document is ignored, then a document can be represented as a term
vector, where each term is a component (attribute) of the vector and the value of each
component is the number of times the corresponding term occurs in the document. This
representation of a collection of documents is Often-called-a-document-term-matrix.
II) Graph-Based Data:

A graph can sometimes be a convenient and powerful representation for data. We consider two
specific cases: (1) the graph captures relationships among data objects and (2) the data objects
themselves are represented as graphs.
1) Data with Relationships among Objects: The relationships among objects frequently
convey important information. In such cases, the data is often represented as a graph. In
particular, the data objects are mapped to nodes of the graph, while the relationships
among objects are captured by the links between objects and link properties, such as
direction and weight. Ex: Web pages on the World Wide Web, which contain both text
and links to other pages.
2) Data with Objects: That Are Graphs If objects have structure, that is, the objects contain
sub objects that have relationships, then such objects are frequently represented as
graphs. Ex: the structure of chemical compounds can be represented by a graph, where
the nodes are atoms and the links between nodes are chemical bonds.
III) Ordered Data:

For some types of data, the attributes have relationships that involve order in time or space.
Different types of ordered data are Sequential Data Sequential data (temporal data), Sequence
Data, Time Series Data, Spatial Data.
1) Sequential Data: It also referred to as temporal data, can be thought of as an extension

of record data, where each record has a time associated with it. Ex: each record could be
the purchase history of a customer, with a listing of items purchased at different times.
2) Sequence Data: It consists of a data set that is a sequence of individual entities, such as a
sequence of words or letters. It is quite similar to sequential data, except that there are no
time stamps; instead, there are positions in an ordered sequence. For example, the genetic
information of plants and animals can be represented in the form of sequences of
nucleotides that are known as genes.
3) Time Series Data: It is a special type of sequential data in which each record is a time
series, i.e., a series of measurements taken over time. For example, a financial data set
might contain objects that are time series of the daily prices of various stocks.
4) Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as
other types of attributes. An example of spatial data is weather data (precipitation,
temperature, pressure) that is collected for a variety of geographical locations. An
important aspect of spatial data is spatial autocorrelation; i.e., objects that are physically
close-tend-to-be-similar-in-other-ways-as-well.
2.11: DATA QUALITY:

Data mining applications are often applied to data that was collected for another purpose, or
for future, but unspecified applications. For that reason , data mining cannot usually take
advantage of the significant benefits of "addressing quality issues at the source".
Preventing data quality problems is typically not an option, data mining focuses on
(1) The detection and correction of data quality problems and
(2) The use of algorithms that can tolerate poor data quality.
Examples of data quality problems: 1) Noise and outliers, 2) missing values 3) duplicate data.
The first step, detection and correction, is often called data cleaning. The following sections
discuss specific aspects of data quality.
Measurement and Data Collection Issues:

It is unrealistic to expect that data will be perfect. There may be problems due to human error,
limitations of measuring devices, or flaws in the data collection process. Values or even entire
data objects may be missing.
In this contestant, we need to deal with 1) variety of problems that involve measurement
error: noise, artifacts, bias, precision, and accuracy. And also 2) data quality issues that may
involve both measurement and data collection problems: outliers, missing and inconsistent
values,-and,-duplicate-data.
1) Noise: Noise refers to modification of original values.
Ex: distortion of a person’s voice when talking on a poor phone and “snow” on television
screen
2) Outliers: Outliers are data objects with characteristics that are considerably different than
most of the other data objects in the data set.
3) Missing Values: The reasons for missing values are
i) Information is not collected (e.g., people decline to give their age and weight)
ii) Attributes may not be applicable to all cases (e.g., annual income is not applicable to
children).
4) Handling missing values:
i) Eliminate Data Objects
ii) Estimate Missing Values
iii) Ignore the Missing Value During Analysis
iv) Replace with all possible values (weighted by their probabilities).
5) Duplicate Data:
i) Data set may include data objects that are duplicates, or almost duplicates of one
another
ii) Major issue when merging data from heterogeneous sources. Ex: Same person with
multiple email addresses
6) Data cleaning: Process of dealing with duplicate data issues.
7) Precision: The closeness of repeated measurements (of the same quantity) to one another
8) Bias: A systematic quantity being measured.
9) Accuracy: The closeness of measurements o the true value of the quantity being
measured.
2.12: DATA PREPROCESSING
Data preprocessing is a broad area and consists of a number of different strategies and techniques
that are interrelated in complex ways. Preprocessing steps should be applied to make the data
more suitable for data mining.
The most important approaches and ideas are:
1) Aggregation
2) Sampling
3) Dimensionality reduction
4) Feature subset selection
5) Feature creation
6) Discretization and Binarization
7) Variable transformation
Aggregation: Combining two or more attributes (or objects) into a single attribute (or object)
Purpose: a) Data reduction
b) Reduce the number of attributes or objects
c) Change of scale. ex: Cities aggregated into regions, states, countries, etc
d) More “stable” data,
e) Aggregated data tends to have less variability
Sampling: Sampling is the main technique employed for data selection. It is often used for both
the preliminary investigation of the data and the final data analysis. Statisticians sample because
obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used
in data mining because processing the entire set of data of interest is too expensive or time
consuming.
The key principle for effective sampling is the following:

a) using a sample will work almost as well as using the entire data sets, if the sample is
representative
b) A sample is representative if it has approximately the same property (of interest) as the
original set of data
Types of Sampling
a) Simple Random Sampling: There is an equal probability of selecting any particular item
b) Sampling without replacement: As each item is selected, it is removed from the
population
c) Sampling with replacement: Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up
more than once.
d) Stratified sampling: Split the data into several partitions; then draw random samples from
each partition
Dimensionality Reduction:
Purpose: Avoid curse of dimensionality, Reduce amount of time and memory required by data
mining algorithms, Allow data to be more easily visualized, May help to eliminate irrelevant
features or reduce noise
Meaning of the Curse of Dimensionality: the curse of dimensionality refers to the phenomenon
that many types of data analysis become significantly harder as the dimensionality of the data
increases. Specifically, as dimensionality increases, the data becomes increasingly sparse in the
space that it occupies.
Feature Subset Selection:

a) Another way to reduce dimensionality of data
b) Redundant features: Duplicate much or all of the information contained in one or more
other attributes. Ex: Purchase price of a product and the amount of sales tax paid
c) Irrelevant features: Contain no information that is useful for the data mining task at hand.
Ex: students' ID is often irrelevant to the task of predicting students' GPA
Techniques of Feature Subset Selection:

a) Brute-force approach: Try all possible feature subsets as input to data mining algorithm
b) Embedded approaches: Feature selection occurs naturally as part of the data mining
algorithm
c) Filter approaches: Features are selected before data mining algorithm is run
d) Wrapper approaches: Use the data mining algorithm as a black box to find best subset
of attributes.
Feature Creation: Create new attributes that can capture the important information in a data set
much more efficiently than the original attributes
Three general methodologies:

a) Feature Extraction
b) Domain-specific
c) Mapping Data to New Space
d) Feature Construction
e) Combining features
Discretization and Binarization: Some data mining algorithms require that the data be in the
form of categorical attributes. Algorithms that find association patterns require that the data be in
the form of binary attributes. Thus, it is often necessary to transform a continuous attribute into a
categorical attribute (discretization). Both continuous and discrete attributes may need to be
transformed into one or more binary attributes (binarization).
Attribute Transformation: A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be identified with one of the new
values
2.13: MEASURES OF SIMILARITY AND DISSIMILARITY
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbor classification, and anomaly detection. In many
cases) the initial data set is not needed once these similarities or dissimilarities have been
computed. Such approaches can be viewed as transforming the data to a similarity (dissimilarity)
space and then performing the analysis.
PROXIMITY: The term proximity is used to refer to either similarity or dissimilarity, the
proximity between two objects is a function of the proximity between the corresponding
attributes of the two objects.
SIMILARITY: “The similarity between two objects is a numerical measure of the degree to
which the two objects are alike”. Consequently, similarities are higher for pairs of objects that
are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and
1 (complete similarity).
DISSIMILARITY: “The dissimilarity between two objects is a numerical measure of the degree
to which the two objects are different”. Dissimilarities are lower for more similar pairs of
objects. Frequently, the term distance is used as a synonym for dissimilarity, Dissimilarities
sometimes fall in the interval [0,1], but it is also common for them to range from 0 to 00.
TRANSFORMATIONS: Transformations are often applied to convert a similarity to a

dissimilarity, or vice versa, or to transform a proximity measure to fall within a particular range,
such as [0,1]
SIMILARITY/DISSIMILARITY FOR SIMPLE ATTRIBUTES

p and q are the attribute values for two data objects.
DISSIMILARITIES BETWEEN DATA OBJECTS
Euclidean Distance: The Euclidean distance, d, between two points, x and y, in one-, two-,
three-, or higher dimensional space, is given by the following familiar formula :
Where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
Minkowski Distance: It is a generalization of Euclidean Distance
Where r is a parameter. Where n is the number of dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or data objects x and y.
The following are the three most common examples of Minkowski distances.
1) r = 1. City block (Manhattan, taxicab, L1 norm) distance: A common example of this is the
Hamming distance, which is just the number of bits that are different between two binary vectors
2) r = 2. Euclidean distance
3) “supremum” (Lmax norm, norm) distance: This is the maximum difference between
any component of the vectors
Common Properties of a Distance:

a) Distances, such as the Euclidean distance, have some well-known properties.
b) If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, y) >= 0 for all x and y,
(b) d(x, y) = 0 only if x = y
2. Symmetry
d(x,y)= d(y,x) for all x and y.
3. Triangle Inequality
d(x,z) <= d(x, y) + d(y, z) for all points x, y, and z.
Measures that satisfy all three properties are known as metrics.
Common Properties of a Similarity

If s(x, y) is the similarity between points x and y, then the typical properties of similarities are
the following:
1. s(x,y) = 1 only if x= y. (0 <= s <= 1)
2. s(x,y)= s(y, x) for all x and y. (Symmetry)
Similarity Measures for Binary Data:

Let x and y be two objects that consist of n binary attributes. The comparison of
two such objects, i.e., two binary vectors, Leads to the following four quantities or
frequencies.
Simple Matching Coefficient (SMC):

It is commonly used similarity coefficient ,which is defined as follows:
Jaccard Coefficient (J): It is frequently used to handle objects consisting of asymmetric binary
attributes.
Cosine Similarity: if x and y are two document vectors, then
Extended Jaccard Coefficient (Tanimoto Coefficient):
Correlation:
ASSIGNMENT - II
Sem: VI Sub: Data Mining and Data Warehousing Sub. Code: 15CS651
Max. Marks : 25 Mapped CO: C320.2 Date: 11-03-2019
Module – II
Q. RBT
Description of Question Marks
No. Level
BATCH 1
Write a short note on: i) ROLAP ii) MOLAP iii) HOLAP OR Describe 5 L2
1
the severs involved in implementation of a warehouse server.
2 What is Data Mining? Explain KDD with neat diagram. 5 L2
How to compute data cubes efficiently by using ‘Compute Cube’ operator 5 L1
3
and data cube materialization
Explain briefly properties should hold by distance measures and explain 5 L2
4
1) Euclidean Distance, 2) Minkowaski Distance.
What are the important approaches of data processing? Explain. OR 5 L2
5
Define data preprocessing. List the different strategies involved in it.
BATCH 2
1 Explain the different types of data sets with examples. 5 L2

Explain: i. Partial Materialization ii. Efficient processing of OLAP 5 L2
2
queries.
3 How data mining tasks are categorized? Explain. 5 L2
4 Explain Pearson’s correlation coefficient. 5 L2
5 List and explain the different types of similarity measures of binary data. 5 L2
BATCH 3
What are the challenges that motivates the use of data mining? Explain 5 L2
1
data mining tasks and applications in detail.
Explain Indexing the OLAP Data with respect to Bitmap Index and Join 5 L2
2
Index.
3 List and explain four types of data attributes, with suitable example 5 L2
4 What is sampling? Explain different types of sampling approaches. 5 L2
5 Explain Simple Matching and jaccard coefficient with example. 5 L2
Last Date of Submission: 18-03-2019
Course Coordinator Module Coordinator HOD

Mrs. Shilpa B. Hosagoudra Mr. S. V. Manjaragi Dr. Parashuram Baraki

Hirasugar Institute of Technology, Nidasoshi

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hirasugar Institute of Technology, Nidasoshi

Загружено:

Авторское право:

Доступные форматы

S J P N Trust's CSE Dept.

Hirasugar Institute of Technology, Nidasoshi. DMDW

Subject Code Module No Module Name Prepared By

Data warehouse implementation & Data mining:

2. DATA WAREHOUSE IMPLEMENTATION

2.1 EFFICIENT DATA CUBE COMPUTATION: AN OVERVIEW

The “compute cube” Operator and the Curse of Dimensionality

Curse of Dimensionality Or. A major challenge related to pre-computation.

How many cuboids are there in an n-dimensional data cube?

Where Li is the number of levels associated with dimension i.

Partial Materialization: Selected Computation of Cuboids.

1. No materialization: Do not precompute any of the “nonbase” cuboids. This leads to

The partial materialization of cuboids or subcubes should consider three factors:

2.2: INDEXING OLAP DATA: BITMAP INDEX AND JOIN INDEX

Advantage of Bitmap indexing:

Advantage of join indexing:

2.3: EFFICIENT PROCESSING OF OLAP QUERIES:

1. Determine which operations should be performed on the available cuboids:

2. Determine to which materialized cuboid(s) the relevant operations should be applied:

Example on OLAP query processing.

Then the question come in front of us is,

2.4: OLAP Server Architectures: ROLAP versus MOLAP versus HOLAP

How are data actually stored in ROLAP and MOLAP architectures?

2.5: INTRODUCTION TO DATA MINING

Applications that require new techniques for data analysis.

Applications of data mining

2.6: WHAT IS DATA MINING?

Data Mining and Knowledge Discovery:

Fig 2.5: The Process of KDD.

2.7: MOTIVATING CHALLENGES:

2) HIGH DIMENSIONALITY: It is now common to encounter data sets with hundreds or

3) HETEROGENEOUS AND COMPLEX DATA: We know that data mining plays an

Examples of such non-traditional types of data:

(3) How to address data security issues?

5) NON-TRADITIONAL ANALYSIS: The traditional statistical approach is based on a

2.8: THE ORIGINS OF DATA MINING

Fig: 2.6 Data mining as a confluence of many disciplines

2.9: DATA MINING TASKS:

Fig 2.7: Four core data mining tasks

Four of the core data mining tasks are

4) Anomaly detection is the task of identifying observations whose characteristics are

2.10 DATA: THE TYPES OF DATA:

Definition of Attribute: An attribute is a property or characteristic of an object that may vary,

1. Distinctness: = and =/=

Based on above properties we can define four types of attributes:

Attribute Description Examples Operations

The values of a nominal attribute Zip codes, mode, entropy,

Attribute Type Transformation Comment

Describing Attributes by the Number of Values

2.10.1: TYPES OF DATA SETS:

3) Resolution It is frequently possible to obtain data at different levels of resolution, and

The types of data sets are grouped into three groups:

II) Graph-Based Data:

III) Ordered Data:

1) Sequential Data: It also referred to as temporal data, can be thought of as an extension

2.11: DATA QUALITY:

Measurement and Data Collection Issues:

2.12: DATA PREPROCESSING

The key principle for effective sampling is the following:

Feature Subset Selection:

Techniques of Feature Subset Selection:

Three general methodologies:

2.13: MEASURES OF SIMILARITY AND DISSIMILARITY