Академический Документы
Профессиональный Документы
Культура Документы
Module – 2 8 Hours
Introduction to data mining: What is data mining, Challenges, Data Mining Tasks, Data: Types of
Data, Data Quality, Data Preprocessing, Measures of Similarity and Dissimilarity,
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
At the core of multidimensional data analysis is the efficient computation of aggregations across many
sets of dimensions. In SQL terms, these aggregations are referred to as group-by’s. Each group-by can
be represented by a cuboid, where the set of group-by’s forms a lattice of cuboids defining a data cube.
What is the total number of cuboids, or group-by’s, that can be computed for this
data cube?
Assume there are three attributes, city, item, and year, as the dimensions for the data cube, and sales in
dollars as the measure, the total number of cuboids, or groupby’s, that can be computed for this data
cube is 23 = 8.
The possible group-by’s are the following: {(city, item, year), (city, item), (city, year), (item, year),
(city), (item), (year),()}. where () means that the group-by is empty.
Fig: 2.1
In the above fig 2.1, data cube is created for AllElectronics sales, which contains the city, item, year,
and sales in dollars.
The base cuboid contains all three dimensions, city, item, and year. It can return the total sales for any
combination of the three dimensions.
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty. It contains the total
sum of all sales.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
The base cuboid is the least generalized (most specific) of the cuboids. The apex cuboid is the most
generalized (least specific) of the cuboids, and is often denoted as all. If we start at the apex cuboid
and explore downward in the lattice, this is equivalent to drilling down within the data cube. If we start
at the base cuboid and explore upward, this is akin to rolling up.
The cube operator is the n-dimensional generalization of the group-by operator. For a cube with n-
dimensions, there are a total of 2n cuboids, including the base cuboid.
If we write a statement such as “compute cube sales cube” then it would explicitly instruct the system
to compute the sales aggregate cuboids for all eight subsets of the set fcity, item, yearg, including the
empty subset.
Advantages of pre-computation of Compute Cube’s are: Pre-computation leads to fast response time
and avoids some redundant computation. Most, if not all, OLAP products resort to some degree of pre-
computation of multidimensional aggregates.
Ex: If the cube has 10 dimensions and each dimension has five levels (including all), the total number
of cuboids that can be generated is 510 ~=9.8 x 106. The size of each cuboid also depends on the
cardinality (i.e., number of distinct values) of each dimension.
(Ex: The size of each cuboid also depends on the cardinality (i.e., number of distinct values) of each
dimension).
As the number of dimensions, number of conceptual hierarchies, or cardinality increases, the
storage space required for many of the group-by’s will grossly exceed the (fixed) size of the input
relation.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
If there are many cuboids, and these cuboids are large in size, a more reasonable option is partial
materialization; that is, to materialize only some of the possible cuboids that can be generated.
There are three choices for data cube materialization given a base cuboid:
Alternative Strategy of Materialization: Several OLAP products have adopted heuristic approaches
for cuboid and subcube selection. A popular approach is to materialize the cuboids set on which other
frequently referenced cuboids are based.
ICEBERG CUBE: It is a data cube that stores only those cube cells with an aggregate value (e.g.,
count) that is above some minimum support threshold.
SHELL CUBE: This involves precomputing the cuboids for only a small number of dimensions (e.g.,
three to five) of a data cube.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Bitmap Index:
The bitmap indexing method is popular in OLAP products because it allows quick searching in data
cubes. The bitmap index is an alternative representation of the record ID (RID) list. In the bitmap
index for a given attribute, there is a distinct bit vector, Bv, for each value v in the attribute’s domain.
If a given attribute’s domain consists of n values, then n bits are needed for each entry in the bitmap
index (i.e., there are n bit vectors). If the attribute has the value v for a given row in the data table, then
the bit representing that value is set to 1 in the corresponding row of the bitmap index. All other bits
for that row are set to 0.
Join Index:
The join indexing method gained popularity from its use in relational database query processing.
Traditional indexing maps the value in a given column to a list of rows having that value. In contrast,
join indexing registers the joinable rows of two relations from a relational database.
(Ex: If two relations R.RID, A/ and S.B, SID/ join on the attributes A and B, then the join index record
contains the pair .RID, SID/, where RID and SID are record identifiers from the R and S relations,
respectively).
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Example on Bitmap indexing. In the AllElectronics data warehouse, suppose the dimension item
at the top level has four values (representing item types): “home entertainment,” “computer,”
“phone,” and “security.” Each value (e.g., “computer”) is represented by a bit vector in the item
bitmap index table. Suppose that the cube is stored as a relation table with 100,000 rows. Because the
domain of item consists of four values, the bitmap index table requires four bit vectors (or lists), each
with 100,000 bits. Figure 2.2 shows a base (data) table containing the dimensions item and city, and its
mapping to bitmap index tables for each of the dimensions.
Fig: 2.2
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Example on Join indexing: According to the star schema for AllElectronics of the form “sales
star [time, item, branch, location]: dollars sold D sum (sales in dollars).” An example of a join index
relationship between the sales fact table and the location and item dimension tables is shown in Figure
2.3. For example, the “Main Street” value in the location dimension table joins with tuples T57, T238,
and T884 of the sales fact table. Similarly, the “Sony-TV” value in the item dimension table joins with
tuples T57 and T459 of the sales fact table. The corresponding join index tables are shown in Figure
2.4
Fig: 2.3
Join index tables based on the linkages between the sales fact table and the location and time
dimension table shown in Fig:2.3
Fig 2.4
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
The purpose of materializing cuboids and constructing OLAP index structures is to speed up query
processing in data cubes. Given materialized views, query processing should proceed as follows:
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Relational OLAP (ROLAP) servers: These are the intermediate servers that stand in between a
relational back-end server and client front-end tools. They use a relational or extended-relational
DBMS to store and manage warehouse data, and OLAP middleware to support missing pieces. ROLAP
servers include optimization for each DBMS back end, implementation of aggregation navigation
logic, and additional tools and services. ROLAP technology tends to have greater scalability than
MOLAP technology. The DSS server of Micro-strategy, for example, adopts the ROLAP approach.
Multidimensional OLAP (MOLAP) servers: These servers support multidimensional data views
through array-based multidimensional storage engines. They map multidimensional views directly to
data cube array structures. The advantage of using a data cube is that it allows fast indexing to
precomputed summarized data. Notice that with multidimensional data stores, the storage utilization
may be low if the data set is sparse. In such cases, sparse matrix compression techniques should be
explored
Hybrid OLAP (HOLAP) servers: The hybrid OLAP approach combines ROLAP and MOLAP
technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP.
For example, a HOLAP server may allow large volumes of detailed data to be stored in a relational
database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 2000
supports a hybrid OLAP server.
Specialized SQL servers: To meet the growing demand of OLAP processing in relational databases,
some database system vendors implement specialized SQL servers that provide advanced query
language and query processing support for SQL queries over star and snowflake schemas in a read-
only environment.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-II Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Ex: Web pages via a query to an Internet search engine are tasks related to the area of
information retrieval.
The input data can be stored in a variety of formats (flat files, spreadsheets, or relational tables)
and may reside in a centralized data repository or be distributed across multiple sites.
The purpose of preprocessing is to transform the raw input data into an appropriate format for
subsequent analysis.
The steps involved in data preprocessing include fusing data from multiple sources, cleaning data
to remove noise and duplicate observations, and selecting records and features that are relevant
to the data mining task at hand.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
1) SCALABILITY: Due to the advancement in technology, the data will generate in exponential
manner, with sizes of gigabytes, terabytes, or even petabytes of data sets. To handle such huge
data size the data mining algorithms must be scalable, where many algorithms employ special
search strategies to handle exponential search problems. Scalability may also require the
implementation of novel data structures and also be improved by using sampling or developing
parallel and distributed algorithms.
Example-1:
In bioinformatics, progress in microarray technology has produced gene expression data
involving thousands of features. Data sets with temporal or spatial components also tend to have
high dimensionality.
Example-2:
Consider a data set that contains measurements of temperature at various locations. If the
temperature measurements are taken repeatedly for an extended period, the number of
dimensions (features) increases in proportion to the number of measurements taken.
4) DATA OWNERSHIP AND DISTRIBUTION: Sometimes, the data needed for an analysis
is not stored in one location or owned by one organization. Instead, the data is geographically
distributed among resources belonging to multiple entities. This requires the development of
distributed data mining techniques. Among the key challenges faced by distributed data mining
algorithms include
(1) How to reduce the amount of communication needed to perform the distributed computation?
(2) How to effectively consolidate the data mining results obtained from multiple sources?
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Predictive tasks: The objective of these tasks is to predict the value of a particular attribute
based on the values of other attributes. The attribute to be predicted is commonly known as the
target or dependent variable, while the attributes used for making the prediction are known as the
explanatory or independent variables.
Descriptive tasks: The objective is to derive patterns (correlations, trends, clusters, trajectories,
and anomalies) that summarize the underlying relationships in data. Descriptive data mining
tasks are often exploratory in nature and frequently require post-processing techniques to
validate and explain the results.
1) Predictive modeling refers to the task of building a model for the target variable as a function
of the explanatory variables. There are two types of predictive modeling tasks: classification,
which is used for discrete target variables, and regression, which is used for continuous target
variables. The goal of both tasks is to learn a model that minimizes the error between the
predicted and true values of the target variable.
2) Association analysis is used to discover patterns that describe strongly associated features in
the data. The discovered patterns are typically represented in the form of implication rules or
feature subsets. Because of the exponential size of its search space, the goal of association
analysis is to extract the most interesting patterns in an efficient manner.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
3) Cluster analysis seeks to find groups of closely related observations so that observations that
belong to the same cluster are more similar to each other than observations that belong to other
clusters.
The following properties (operations) of numbers are typically used to describe attributes.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Nominal and ordinal attributes: These are collectively referred to as categorical or qualitative
attributes. Ex for qualitative attributes, such as employee ID, lack most of the properties of
numbers.
Interval and ratio: These are collectively referred to as quantitative or numeric attributes.
Quantitative attributes are represented by numbers and have most of the properties of numbers.
Note that quantitative attributes can be integer-valued or continuous.
The types of attributes can also be described in terms of transformations that do not change the
meaning of an attribute.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Continuous: A continuous attribute is one whose values are real numbers. Examples include
attributes such as temperature, height, or weight. Continuous attributes are typically represented
as floating-point variables. Practically, real values can only be measured and represented with
limited precision.
Asymmetric Attributes: In these attributes only presence a non-zero attribute value-is regarded
as important. Binary attributes where only non-zero values are important are called asymmetric
binary attributes. This type of attribute is particularly important for association analysis, For
instance, if the number of credits associated with each course is recorded, then the resulting data
set will consist of asymmetric discrete or continuous attributes.
There are many types of data sets, and as the field of data mining develops and matures.
The types of data sets are grouped into three groups: 1) Record data, 2) Graph based data,
and 3) Ordered data.
And in general there are three characteristics that apply to many data sets and have a significant
impact on the data mining techniques that are used: 1) Dimensionality, 2) Sparsity, and 3)
Resolution
1) Dimensionality: The dimensionality of a data set is the number of attributes that the
objects in the data set possess
2) Sparsity: For some data sets, such as those with asymmetric features, most attributes of
an object have values of 0; in many cases, fewer than 1% of the entries are non-zero. In
practical terms, sparsity is an advantage because usually only the non-zero values need to
be stored and manipulated. This results in significant savings with respect to computation
time and storage.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
I) Record Data
Much data mining work assumes that the data set is a collection of records (data objects), each of
which consists of a fixed set of data fields (attributes). Record data is usually stored either in flat
files or in relational databases.
Different variation of record data
1) Transaction or Market Basket Data Transaction data is a special type of record data,
where each record (transaction) involves a set of items. This type of data is called market
basket data because the items in each record are the products in a person's "market
basket." Transaction data is a collection of sets of items, but it can be viewed as a set of
records whose fields are asymmetric attributes.
2) The Data Matrix If the data objects in a collection of data all have the same fixed set of
numeric attributes, then the data objects can be thought of as points (vectors) in a
multidimensional space, where each dimension represents a distinct attribute describing
the object. A set of such data objects can be interpreted as an m by n matrix, where there
are m rows, one for each object, and n columns, one for each attribute. This matrix is
called a data matrix or a pattern matrix.
3) The Sparse Data Matrix A sparse data matrix is a special case of a data matrix in which
the attributes are of the same type and are asymmetric; i.e., only non-zero values are
important. Transaction data is an example of a sparse data matrix that has only 0 1
entries. Another common example is document data. In particular, if the order of the
terms (words) in a document is ignored, then a document can be represented as a term
vector, where each term is a component (attribute) of the vector and the value of each
component is the number of times the corresponding term occurs in the document. This
representation of a collection of documents is Often-called-a-document-term-matrix.
1) Data with Relationships among Objects: The relationships among objects frequently
convey important information. In such cases, the data is often represented as a graph. In
particular, the data objects are mapped to nodes of the graph, while the relationships
among objects are captured by the links between objects and link properties, such as
direction and weight. Ex: Web pages on the World Wide Web, which contain both text
and links to other pages.
2) Data with Objects: That Are Graphs If objects have structure, that is, the objects contain
sub objects that have relationships, then such objects are frequently represented as
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
graphs. Ex: the structure of chemical compounds can be represented by a graph, where
the nodes are atoms and the links between nodes are chemical bonds.
2) Sequence Data: It consists of a data set that is a sequence of individual entities, such as a
sequence of words or letters. It is quite similar to sequential data, except that there are no
time stamps; instead, there are positions in an ordered sequence. For example, the genetic
information of plants and animals can be represented in the form of sequences of
nucleotides that are known as genes.
3) Time Series Data: It is a special type of sequential data in which each record is a time
series, i.e., a series of measurements taken over time. For example, a financial data set
might contain objects that are time series of the daily prices of various stocks.
4) Spatial Data: Some objects have spatial attributes, such as positions or areas, as well as
other types of attributes. An example of spatial data is weather data (precipitation,
temperature, pressure) that is collected for a variety of geographical locations. An
important aspect of spatial data is spatial autocorrelation; i.e., objects that are physically
close-tend-to-be-similar-in-other-ways-as-well.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Preventing data quality problems is typically not an option, data mining focuses on
(1) The detection and correction of data quality problems and
(2) The use of algorithms that can tolerate poor data quality.
Examples of data quality problems: 1) Noise and outliers, 2) missing values 3) duplicate data.
The first step, detection and correction, is often called data cleaning. The following sections
discuss specific aspects of data quality.
Data preprocessing is a broad area and consists of a number of different strategies and techniques
that are interrelated in complex ways. Preprocessing steps should be applied to make the data
more suitable for data mining.
The most important approaches and ideas are:
1) Aggregation
2) Sampling
3) Dimensionality reduction
4) Feature subset selection
5) Feature creation
6) Discretization and Binarization
7) Variable transformation
Aggregation: Combining two or more attributes (or objects) into a single attribute (or object)
Purpose: a) Data reduction
b) Reduce the number of attributes or objects
c) Change of scale. ex: Cities aggregated into regions, states, countries, etc
d) More “stable” data,
e) Aggregated data tends to have less variability
Sampling: Sampling is the main technique employed for data selection. It is often used for both
the preliminary investigation of the data and the final data analysis. Statisticians sample because
obtaining the entire set of data of interest is too expensive or time consuming. Sampling is used
in data mining because processing the entire set of data of interest is too expensive or time
consuming.
Types of Sampling
a) Simple Random Sampling: There is an equal probability of selecting any particular item
b) Sampling without replacement: As each item is selected, it is removed from the
population
c) Sampling with replacement: Objects are not removed from the population as they are
selected for the sample. In sampling with replacement, the same object can be picked up
more than once.
d) Stratified sampling: Split the data into several partitions; then draw random samples from
each partition
Dimensionality Reduction:
Purpose: Avoid curse of dimensionality, Reduce amount of time and memory required by data
mining algorithms, Allow data to be more easily visualized, May help to eliminate irrelevant
features or reduce noise
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Meaning of the Curse of Dimensionality: the curse of dimensionality refers to the phenomenon
that many types of data analysis become significantly harder as the dimensionality of the data
increases. Specifically, as dimensionality increases, the data becomes increasingly sparse in the
space that it occupies.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Feature Creation: Create new attributes that can capture the important information in a data set
much more efficiently than the original attributes
Discretization and Binarization: Some data mining algorithms require that the data be in the
form of categorical attributes. Algorithms that find association patterns require that the data be in
the form of binary attributes. Thus, it is often necessary to transform a continuous attribute into a
categorical attribute (discretization). Both continuous and discrete attributes may need to be
transformed into one or more binary attributes (binarization).
Attribute Transformation: A function that maps the entire set of values of a given attribute to a
new set of replacement values such that each old value can be identified with one of the new
values
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Similarity and dissimilarity are important because they are used by a number of data mining
techniques, such as clustering, nearest neighbor classification, and anomaly detection. In many
cases) the initial data set is not needed once these similarities or dissimilarities have been
computed. Such approaches can be viewed as transforming the data to a similarity (dissimilarity)
space and then performing the analysis.
PROXIMITY: The term proximity is used to refer to either similarity or dissimilarity, the
proximity between two objects is a function of the proximity between the corresponding
attributes of the two objects.
SIMILARITY: “The similarity between two objects is a numerical measure of the degree to
which the two objects are alike”. Consequently, similarities are higher for pairs of objects that
are more alike. Similarities are usually non-negative and are often between 0 (no similarity) and
1 (complete similarity).
DISSIMILARITY: “The dissimilarity between two objects is a numerical measure of the degree
to which the two objects are different”. Dissimilarities are lower for more similar pairs of
objects. Frequently, the term distance is used as a synonym for dissimilarity, Dissimilarities
sometimes fall in the interval [0,1], but it is also common for them to range from 0 to 00.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Euclidean Distance: The Euclidean distance, d, between two points, x and y, in one-, two-,
three-, or higher dimensional space, is given by the following familiar formula :
Where n is the number of dimensions (attributes) and xk and yk are, respectively, the kth
attributes (components) or data objects x and y.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Where r is a parameter. Where n is the number of dimensions (attributes) and xk and yk are,
respectively, the kth attributes (components) or data objects x and y.
The following are the three most common examples of Minkowski distances.
1) r = 1. City block (Manhattan, taxicab, L1 norm) distance: A common example of this is the
Hamming distance, which is just the number of bits that are different between two binary vectors
2) r = 2. Euclidean distance
3) “supremum” (Lmax norm, norm) distance: This is the maximum difference between
any component of the vectors
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
2. Symmetry
d(x,y)= d(y,x) for all x and y.
3. Triangle Inequality
d(x,z) <= d(x, y) + d(y, z) for all points x, y, and z.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Jaccard Coefficient (J): It is frequently used to handle objects consisting of asymmetric binary
attributes.
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
Correlation:
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in
S J P N Trust's CSE Dept.
Hirasugar Institute of Technology, Nidasoshi. DMDW
Inculcating Values, Promoting Prosperity Module-I Notes
Approved by AICTE,Recognized by Govt.of Karnataka, Affiliated to VTU Belagavi,
Accredited at “A” Grade by NAAC and Recognized Under Sectoin 2(f) of UGC Act,1956 2018-19 (EVEN)
ASSIGNMENT - II
Sem: VI Sub: Data Mining and Data Warehousing Sub. Code: 15CS651
Max. Marks : 25 Mapped CO: C320.2 Date: 11-03-2019
Module – II
Q. RBT
Description of Question Marks
No. Level
BATCH 1
Write a short note on: i) ROLAP ii) MOLAP iii) HOLAP OR Describe 5 L2
1
the severs involved in implementation of a warehouse server.
2 What is Data Mining? Explain KDD with neat diagram. 5 L2
How to compute data cubes efficiently by using ‘Compute Cube’ operator 5 L1
3
and data cube materialization
Explain briefly properties should hold by distance measures and explain 5 L2
4
1) Euclidean Distance, 2) Minkowaski Distance.
What are the important approaches of data processing? Explain. OR 5 L2
5
Define data preprocessing. List the different strategies involved in it.
BATCH 2
BATCH 3
What are the challenges that motivates the use of data mining? Explain 5 L2
1
data mining tasks and applications in detail.
Explain Indexing the OLAP Data with respect to Bitmap Index and Join 5 L2
2
Index.
3 List and explain four types of data attributes, with suitable example 5 L2
4 What is sampling? Explain different types of sampling approaches. 5 L2
5 Explain Simple Matching and jaccard coefficient with example. 5 L2
Dept. of CSE, HSIT Nidasoshi, Taq: Hukkeri, Dist: Belagavi, Karnataka - 591 236
Prepared by: Prof Shilpa B. Hosagoudra, mail-id : shilpahosgoudar.cse@hsit.ac.in