Data Warehousing and Data Mining

PREREQUISITES
This course is a part of the Information Technology program (B.Sc. (IT)) of Kuvempu University.
A student registering for the fifth semester of B.Sc.(IT) of Kuvempu University must have completed
fourth semester of B.Sc.(IT). The student should have attained the knowledge of the following modules:
NIIT
Algorithms
Java Programming
Unix & Shell Programming
Software Engineering
Coordinator Guide Data Warehousing and Data Mining 1
CHAPTER-SPECIFIC INPUTS
Chapter One
Objectives
In this chapter, the students have learned to:
Describe basic database management concepts

Define data warehouse and data mining systems
Focus Areas
Introduce the need for a database management system with the help of an example, such as the need for a
company to organize its data and manage it through a front-end application. Ensure that the students
understand the drawbacks of a file management system as compared to a database management system
(DBMS). Explain elementary database concepts using ample examples.
Next, give the following analogy to the students to introduce data warehouses.
"A large insurance company has stored and organized its relevant official papers on racks in each room.
It stores a large volume of papers in a separate warehouse. Some of the papers have now become
historical and are not required in day-to-day working. However, these papers cannot be destroyed, as they
are an essential asset for analysis of market trends and the companys progress. Moreover, it is also
mandatory to maintain such papers till a fixed period of time. Therefore, it becomes essential that the
papers required for day-to-day processing are stored in areas where they can be accessed quickly and the
historical papers are stored for future reference in a bigger storage area.
If you compare a database to the racks containing papers for day-to-day processing, then the database's
counterpart for the warehouse is called a data warehouse."
Now, explain the term data warehouse and the usage of a data warehouse. Similarly, explain data mining.
After introducing these concepts, tell the students that before proceeding further, they must understand
two important database management concepts called normalization and entity relationships.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Database Normalization
Normalization is the process in which database objects such as attributes, keys, and relationships are
restructured to remove redundancy and dependencies, and stabilize and simplify the database.
Normalization is an important concept because it has a direct impact on the storage efficiency of the
database.
Normalization is done on the basis of certain rules. Each set of rules makes a database object, such as a
table, more "normal". Each set of rules is called a "normal form." If the first set of rules is observed, the
database is said to be in the "first normal form." If the second set of rules is observed, the database is
considered to be in the "second normal form." Finally, if the third set of rules is observed, the database is
considered to be in the "third normal form." The third normal form is the highest level of normalization
necessary for most applications.
2 Coordinator Guide Data Warehousing and Data Mining
NIIT
Consider the following definitions of normal forms:

First Normal Form or 1NF
1NF aims at removing repeating, multi-valued attributes from a table. The steps to be followed for
converting a table to a 1NF table are:
1.
2.
3.
Eliminate repeating groups in individual tables.

Create a separate table for each set of related data.
Identify each set of related data with a primary key.
Second normal form (2NF)

2NF aims at removing attributes that are not dependent on the primary key completely. The steps to be
followed for converting a table to a 2NF table are:
1.
2.
3.
Ensure that the table meets all the requirements of the first normal form.
Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
Create relationships between these new tables and their predecessors using foreign keys.
Third normal form (3NF)

3NF aims at removing attributes that are dependent on other attributes, which are not a part of the
primary key. The steps to be followed for converting a table to a 2NF table are:
1.
2.
Ensure that the table meets all the requirements of the second normal form.
Remove columns that are not dependent upon the primary key.
Example:
Consider the following unnormalized table to understand normalization:
Students
Student
Subject 1
Teacher 1
Subject 2
Teacher 2
Subject 3
Teacher 3
John
English-01
George
Mathematics-01
Mary
Science-01
Greg
Tony
English-01
James
Mathematics-02
Mary
Science-01
Peter
John
English-02
Pat
Mathematics 02
Mary
Science 01
Greg
The sample table Students, given above, is unnormalized because it has repeating (subject-teacher)
groups for the student and also does not have any primary key. To convert the table into 1 NF, it must be
given a primary key and repeating groups must be eliminated.
The I NF form of the above table is:
RNo
NIIT
Student
Subject
Teacher
Code
1102
John
English
George
01
1102
John
Mathematics
Mary
01
1102
John
Science
Greg
01
1105
Tony
English
James
01
1105
Tony
Mathematics
Mary
02
RNo
Student
Subject
Teacher
Code
1105
Tony
Science
Peter
02
1109
John
English
Pat
02
1109
John
Mathematics
Mary
02
1109
John
Science
Greg
01
After implementing 1NF on the above table:
Each row is uniquely identified by a primary key (Roll No now identifies that there two different
students with the name John)
The sequence of rows and columns is insignificant (The sequence of columns was not insignificant
in the unnormalized form).
Each column is unique (Unlike the unnormalized form where there were repeating groups such as
subject 1, subject 2 and subject 3).
Each column has single values (Now subject name and code are two separate columns, unlike the
unnormalized form where code was stored as a part of the column name for subject).
Similarly, other normal forms can be implemented after the requirements for the previous normal forms
are met.
Entity Relationship Model

An entity is a concrete or abstract object that exists and is distinguishable from other objects. It is
represented by a set of attributes, that is, properties. For example, "Employee" is an entity and has
attributes like Name, and Address, which define its characteristics. In terms of a database, an entity can
be mapped to a table or a view. For instance, an entity can be represented as:
Employee
First Name
Last Name
Address
DOB
Entities in a database can share a relationship. A relationship is an association between several entities.
For example, two entities Employee and Department can be related as shown in the following figure:
Employee
Works in
First Name
Last Name
Address
DOB
Department
Name
DNo.
Head
Three different types of relationships can exist between two entities:

1.
2.
One-to-One Relationships: This type of relationship exists when a single occurrence of an entity is
related to just one occurrence of another entity. For example, a person has one PAN number and a
PAN number is allotted to only one person. Therefore, the entity Person has a one-to-one
relationship with the entity PAN number.
One-to-Many Relationships: This type of relationship exists when a single occurrence of an entity
is related to many occurrences of another entity. For example, a student studies in one school but a
school has various students. Therefore, the relationship between School and Students is one-to-many.
NIIT
3.
Many-to-Many Relationships: This type of relationship exists when many occurrences of an entity
are related to many occurrences of another entity. For example, resources are allocated to many
projects; and a project is allocated many resources. Therefore, the relationship between Resources
and Projects is many-to-many.
The relationships between entities are depicted using specialized graphics known as ER diagrams or
Entity-relationship diagrams.
NIIT
FAQ
1. What is the disadvantage of File Management System over DBMS?
Ans:
Some of the disadvantages of file management system over database management system are:
Data redundancy and inconsistency

Difficulty in accessing data
Difficulty in integrating data into new enterprise level applications because of varying formats
Lack of support for concurrent updates by multiple users.
Lack of inherent security
2. Are relational databases the only possible type of database models?

Ans:
No. Apart from relational, other models include network and hierarchical models. However, these two
models are obsolete. Nowadays, relational and object-oriented models are preferred.
3. What is referential integrity and how it is achieved in a relational database?
Ans:
Referential integrity is a feature of DBMS that prevents the user from entering inconsistent data. This is
mainly achieved by having a foreign key constraint on a table.
4. What are the higher normal forms?
Ans:
A normal form considered higher than 3NF is the Boyce Codd Normal Form (BCNF). The BCNF differs
from the 3 NF when there are more than one composite and disjoint candidate keys.
NIIT
Chapter Two
Objectives
Define data warehouse

Identify the data warehouse process
Identify the process flow in a data warehouse
Identify the architecture for a data warehouse
Apply data warehouse schemas
Partition Fact table into separate partitions
Identify the need for meta data and data marts
Focus Areas
Initiate the discussion by asking the following questions:
What is data?
What is the need to store this data?
How should the data be stored?
What techniques can be adopted to retrieve enormous data efficiently?
You can also discuss the architecture for storing voluminous data with the help of data warehouse.
Besides that, a difference between the data warehouse, data mart, and metadata should be discussed
clearly.
Additional Inputs
Evolution of Data Warehousing

Large organizations, all over the world, are increasingly using data for analytical purposes to assist them
in taking business decisions in real time. The applications used for such analyses can be termed as
Business Intelligence Applications. These analyses assist in determining trends based on which, future
business decisions can be formulated.
One of the fundamental requirements of these types of complex analyses is that they require a large
volume of data to reach a consistent level of sampling.
For example, soft drink manufacturers often use historical data to forecast the quantity of bottles to be
currently manufactured. These forecasts are based on parameters such as the temperature, purchasing
capacity of the customers, age group and so on. In order to obtain a better forecast, it becomes necessary
that the amount of sample data be very large. In fact, the larger the volumes of data, the better are the
chances for the forecast to be accurate.
The data requirement for such kind of analysis is of two types:
A large volume of data

Historical data, capturing the data at various frames of time
The On Line Transaction Processing (OLTP) systems are not suitable for the kind of complex analysis.
This is because OLTP systems:
NIIT
Contain only transactional/recent data.

Are specifically designed to manage transaction processing.
There are therefore two distinct uses of data in organizations. The first is the analysis of historical data
for business analytical purposes. The second is the usage of data that takes care of the daily transactional
activities.
The use of data for business analytical purposes has led to the development of data warehouses. Data
warehouse is one of the key components of Business Intelligence Systems. The data stored in a data
warehouse is used by the querying or reporting tools, data mining applications, and On Line Analysis
Processing (OLAP) applications for business analysis.
Data Mart
A data mart is a specific subset of the contents of a data warehouse, stored within its own database. It
contains data focused at a department level or on a specific business area of the organization. The volume
of data in a data mart is less than that of a typical data warehouse, making query processing faster.
Experts use data marts for analysis, rather than using the main data warehouse. For example, a large
organization can treat its individual departments or divisions as independent business units. Each of these
units can have its own data mart, used regularly by the analysts of that specific unit. Data marts
contribute to the main data warehouse on a regular basis.
Data Warehouse vs. Data Mart

A data mart is simply a mini-data warehouse. More specifically, it is a data warehouse designed for a
specific purpose or to be analyzed by a particular group within the institution. The data collected can be
extracted from the data warehouse. Basically, a data warehouse collects a wide range of data types, while
a data mart specifically involves only data the user will want.
Data Warehouse and CRM

Customer Relationship Management (CRM) uses data warehousing for value enhancement purposes. In
CRM, data warehousing fulfills two main needs. It helps in analyzing customer habits, which in turn
helps businesses, such as hotel and airlines, to offer better services to customers. It also helps in
analyzing the behavior trend of customer segments. This analysis helps in determining segment
commonalties and thereby, develops a business package based on their common habits.
Data Warehouse and DSS

Decision support systems (DSS) are used by organizations to improve strategic, tactical and operational
decisions. Such decisions are based on complex analysis of historical data. The data warehouses provide
a repository of historical data that support DSS functioning.
Need for Security of a Data Warehouse

Let us consider an example:
Consider a company, which has all the information in the centralized data warehouse. The data
warehouse will surely contain confidential information also. Every department's analyst must be in need
of this data warehouse to analyze their specific data. However, they want to hide this data from others. In
such a case, roles are critical.
Consider an enterprise data warehouse, such as the airlines system. The data of all airlines is stored in one
data warehouse. If security is not implemented, the analyst of one airline can easily get the secrets of
another airline.
Hence, security is the crucial factor that we need to consider while implementing a data warehouse.
Metadata
Due to the large volume of data that exists and is queried in a data warehouse, it is often useful to classify
the type of data. This helps increase query response activity. In a data warehouse, a specific type of data,
known as metadata, is used. Metadata contains information about types of data.
NIIT
For example, the class of an object and the object, that is, instance of the class. If data is the instance then
metadata is the class. Metadata is data about data.
Characteristics of a Data Warehouse

There are four characteristics intrinsic to a data warehouse:
Consolidated and consistent data

Subject-oriented data
Historical data
Read-only data
Consolidated and Consistent Data

A data warehouse combines operational data from a variety of sources with consistent naming
conventions, measurements, physical attributes, and semantics.
For example, the dates are often in different formats. Different regions often refer to the same piece of
data differently. "Total" can also be known as "Balance" or "Total Amount" to represent the amount of
cash transactions.
This is unacceptable because data in a data warehouse is stored in a central location and cannot be stored
in different formats or referred to differently. Data has to be stored in the data warehouse in a single,
agreed-upon format, despite variations in the operational sources. This enables data from across the
enterprise to be combined in the data warehouse and cross referenced by the analysts.
Subject-Oriented Data
Often, a large amount of data is useful only to the person with whom it exists or who has created that
data. However, analysts might not need this entire data. It serves no purpose to store such data in a data
warehouse. During data cleaning and reformatting, data that cannot be used for analysis is discarded.
For example, it is customary to release all goods from the warehouse against an invoice. In the local
database, the invoice number is an integral part of the data and so is the name of the person who is
entering as well as approving the goods disbursement. However, the information such as who entered the
invoice or who approved the invoice has no relevance to business or executive reporting. The presence of
this data makes business analysis cumbersome and therefore, it must be organized in a way to make data
querying manageable.
In a data warehouse, only the key business information from operational sources must be stored and
organized for analysis.
Historical Data
Data in OLTP systems always represent the current value of data at any moment in time. However, data
stored in a data warehouse represents data at specific points or frames in time. Data stored in a data
warehouse represents historical rather than current information. This is a typical and mandatory
requirement of data warehouse.
For example, an order-entry form, which is an OLTP application, shall always display the current levels
of a specific item in the inventory. It will not show data at some point of time in the past. On the other
hand, when all inventory-specific data is stored in the data warehouse, it must contain the snapshots of
inventory on a daily, weekly, or monthly basis. Moreover, the analysis applications, which access this
data from the warehouse, should also be able to exclusively take these snapshots and make them
accessible to the end users.
NIIT
UPDATES
RDBMS
DATA LOADS
QUERIES
DATA
WAREHOUSE
QUERIES
Read-only Data
After data has been moved to the data warehouse, it cannot be changed. Data stored in a data warehouse
pertains to a point in time or to a specific timeframe, therefore it must never be updated. The only
operations that occur in a data warehouse, when it has been set up, are loading and querying.
Differences in the Way Data Behaves in an RDBMS and a Data Warehouse
FAQ
1. On which layer of application architecture does data warehouse operate?
Ans:
A data warehouse is a server-side repository to store data.
2. What is a data warehouse?
Ans:
A data warehouse is a huge repository of data used for very complex business analysis.
3. What are the benefits of data warehousing?
Ans:
The main benefit of data warehousing is to keep data in such a form that complex business analysis can
be done in minimum amount of time.
4. What are the application areas of a data warehouse?
Ans:
There are various application areas of a data warehouse. Some of these are:
Airlines
Meteorology
Logistics
Insurance
5. Where can you use data warehouse successfully?

Ans:
It is ideal to implement data warehousing when there is a large amount of historical data that needs to be
processed for extensive analytical purposes.
NIIT
6. When and where is a data mart useful?

Ans:
A data mart helps to provide data for conducting analysis at a specialized level. It will always be used at a
strategic business unit level such as the department level for a business unit.
7. What does historical data support in data warehouse?
Ans:
Historical data is used for supplying pre-processed and non-pre-processed data for conducting business
analysis.
NIIT
Chapter Three
Objectives
Identify the characteristics of the Star Flake Schemas

Design Fact tables
Design dimension tables
Design the Star Flake schema
Identify the concept of query redirection
Analyze data using multi-dimensional schema
Focus Areas
Conduct a quick recap to ensure that the students have clearly understood the concept of a data
warehouse. Initiate a discussion on the concept of schemas. Explain the Star Flake Schema. Discuss the
use of Fact and Dimension Tables. Also, clarify the distinction between Fact and Dimension Table so that
there is no confusion in this respect.
Discuss the guidelines to be followed while creating the Fact and Dimension Tables. Also, discuss the
concept of query redirection and its purpose in helping users to get the most appropriate results in time.
Additional Inputs
Fact Tables
Fact tables contain data that describe a specific event within a business, such as a financial transaction or
a product sale. Fact tables can also contain data aggregations such as sales per month per region. Under
normal conditions, existing data within a fact table is not updated, however, new data is loaded. Fact
tables contain the majority of data stored in a data warehouse. This makes the structuring of fact tables
very crucial. The features of fact tables are:
Contain data that is static in nature
Contain many records, possibly billions
Store numeric data
Have multiple foreign keys
For example, a fact table can contain data such as product ID numbers and geographical IDs of areas.
Dimension Tables
Dimension tables contain data used to reference the data stored in the fact table, such as product
descriptions, customer names, and addresses. Data, in this case, is mainly stored in characters. It is
possible to optimize queries by separating the fact table data from the dimension table data. Dimension
tables do not contain as many rows as fact tables. Dimension tables can change and must be structured to
permit change. Dimension tables:
Can be updated
Have lesser number of rows as compared to fact tables
Store data in characters
Have many columns to manage dimension hierarchies
Have one primary key, also known as the dimensional key
NIIT
For example, a Retailer dimension table can contain retailer ID and retailer name columns.
Keys: Unique identifiers used to query data stored in the central fact table. The dimensional key, such as
a primary key, links a row in the fact table with one dimension table. This structure makes it easy to
construct complex queries and support a drill-down analysis in decision-support applications.
Star Schema
The star schema is a relational database structure and is a popular design technique used to implement a
data warehouse. In this schema, data is maintained in a single fact table at the centre of the schema. Each
dimension table is directly related to the fact table by a key column.
The star schema design increases query performance by reducing the volume of data read from disk to
satisfy a query. Queries first analyze data in the dimension tables to obtain the dimension keys that index
into the central fact table. In this way, the number of rows that have to be scanned to satisfy a query is
reduced greatly.
S E A S O N T A B LE
R E G IO N A L T A B LE
S
S
S
S
E
E
S
E
R
R
R
R
R
R
eason ID
e a s o n D e s c r ip t io n
ta rt D a te
ta rt M o nth
nd D ate
nd M onth
ta rt Y e ar
nd Y ear
D im e n s io n T a b le
io n I D
io n D e s c r ip t io n
io n L o c a t io n D is t r ic t
io n L o c a t io n S t a t e
io n L o c a t io n C o u n t r y
io n a l I n c h a r g e I D
S A LE S T A B LE
S
R
R
P
U
S
R E T A ILE R T A B LE
eg
eg
eg
eg
eg
eg
eas on ID
e g io n a l I D
e t a ile r I D
ro du c t ID
n it s S o ld
e a s o n a l D is c o u n t
F a c t T a b le
R e t a ile r I D
R e t a ile r s h ip s t a r t d a t e
R e t a ile r s h ip v a lid it y p e r io d
P RO D UCT T A B LE
P rod u c t ID
P r o d u c t D e s c r ip t io n
Launch D ate
C o n t in u it y S t a t u s
U n it
P r ic e
B ran d N a m e
P r o d u c t P r o p e r t y 1 V a lu e
P r o d u c t P r o p e r t y 2 V a lu e
Star Schema
OLAP
The arrangement and analysis of data in a data warehouse is done using On Line Analysis Processing
(OLAP) systems. The historical data supports business decisions at various levels or departments, from
strategic business planning to financial performance appraisal of a separate organizational unit and/or
distinct business functional areas.
After being collected from various heterogeneous sources, data is extracted, cleaned or scrubbed, and
stored in the data warehouse in a homogeneous form. This is the first part of the activity.
The analysts now have to choose the right data. Applications have to be built to assist analysts in
completing their activity in a limited amount of time. It is necessary for another application to arrange
data in a format, which shall make it easily accessible to the end user on querying. OLAP arranges data in
an easily accessible format typical to a data warehouse.
OLAP technology thus enables data warehouses to be used effectively for:
NIIT
Online analysis
Providing quick responses to complex iterative queries posed by analysts
OLAP achieves this through its multi-dimensional data models. Multi-dimensional data models are to
data warehouse what tables are to RDBMS. These are the units where data is stored. These models are
used to organize and summarize large amounts of data, making it easy to evaluate using online analysis
and graphical tools. It also provides the speed and flexibility to support the analysts, helping them
complete complex analysis in an acceptable time limit.
B u s in e s s A n a ly s t s a n d
O t h e r B u s in e s s U s e r s
LO
D A
D A
D A
O p e r a t io n a l /
T r a n s a c t io n a l
D ata;
H e te ro g e n o u s D a ta
AD
T A
T A
T A
M AN AG ER
E X T R A C T IO N
T R A N S F O R M A T IO N
L O A D IN G
O L AP
DAT A ST O RAG E
D A TA W A R E H O U S E
Overview of how OLAP Services are Correlated to the Data Warehouse
The OLTP databases have basic constituents such as tables and relations. In the case of data warehouses,
these constituents are fact tables and dimension tables. A fact table contains measurable columns such as
sales, costs, and expenses. A dimension table contains columns on which fact table columns can be
categorized. For example, a Product dimension table can have columns such as product family, product
category, and product ID on which sales, costs and expenses can be categorized. Similarly, these fact
table columns can also be categorized on Stores table columns based on country and region.
The following table displays the sales categorized according to the product family (from Product
dimension) and country (from Store dimension):
USA
Germany
Mexico
Drink
2500000
450000
670000
Food
3400000
675000
774800
Solutions to Chapter Three Questions

8. What is query redirection?
Ans.
When the available data grows beyond size, partitioning becomes essential. Query redirection means that
the queries should be directed to the appropriate partitions that store the data required by the query.
FAQ
1. What are the benefits of OLTP?
Ans:
On Line Transaction Processing (OLTP) assists in storing current business transactional data. It also
supports a large number of concurrent users to access data at the same time.
NIIT
2. Why the OLTP cannot provide history data for analysis?

Ans:
Data in a data warehouse comes from an OLTP system only. However, it cannot be directly used for
analysis. The reason is that the data in OLTP systems is not organized to give results quickly from
billions of records. In a data warehouse, data is classified into various categories and so it is possible to
give the results quickly.
3. Why is the data in the data warehouse not stored in a normalized form as in OLTP?
Ans:
The objective of storing data in a normalized form in OLTP is to reduce redundancy and minimize disk
storage. The key objective in a data warehouse is to enhance the query response time. The easier the
access to data better will be the query response time. Hence, the normalization rules do not matter in a
data warehouse.
4. An integral part of OLTP is its support for hundreds of concurrent users. The number of concurrent
users supported by a data warehouse is comparable to OLTP. Is this statement true or false? Justify your
answer.
Ans:
The statement is false. This is because the number of people involved in data analysis is very low as
compared to front-end users who engage in transactional data. Moreover, the percentage of CPU usage
per user is very high in case of data warehousing as compared to OLTP users.
5. Explain why a data warehouse does not use current or OLTP data for analysis.
Ans:
The main purpose of a data warehouse is to provide historical data to analyze business trends. Therefore,
historical data needs to be a snapshot of events over time and not only on the current data.
6. What is the advantage of MOLAP as storage model?
Ans:
MOLAP dimensions provide better query performance. Here the contents of the dimension are
processed and stored on the Analysis Server and not on a Relational Server.
7. What kind of data does a fact table contain?
Ans:
A fact table contains numeric data.
8. What are the different OLAP storage models?
Ans:
Following are the different OLAP storage models: MOLAP, ROLAP and HOLAP.
9. A data analysis has to be done in the fastest possible means on data stored in Multi-dimensional
format. Which storage model is best suited in this case?
Ans:
MOLAP.
NIIT
Chapter Four
Objectives
Identify a partitioning strategy

Use horizontal partitioning
Use vertical partitioning
Use hardware partitioning
Focus Areas
Introduce partitioning by asking the students to identify ways in which the performance of a data
warehouse and its manageability can be improved. Inform students that if data is broken into smaller
manageable chunks, it can be scanned faster and managed easily. In this context, introduce partitioning.
Ask students to identify the advantages of partitioning such as, faster access, better manageability due to
smaller size, improved recovery time, and reduced effect of failure or breakdown.
Explain the various types of partitioning with the help of examples for each. You can also explain
Stripping discussed in the Additional Inputs section. This section also gives examples of how partitioning
such as Stripping can help in optimization. To demonstrate partitioning, you may explain it with
reference to the 'Partitioning in Oracle" section discussed in Additional Inputs.
Additional Inputs
Optimizing through Stripping

Data warehouses handle a large amount of historical, present, and future data. As a result, the I/O
performance is a crucial consideration in data warehouse operations.
A technique called stripping is frequently used for improving I/O performance in data warehouses
Stripping divides the data of a large table into small portions and stores them on separate data files on
separate disks. You can stripe tablespaces for tables, indexes, rollback segments, and temporary
tablespaces, where the data is spread over hardware components such as controllers, I/O channels, and
internal buses.
You can optimize a data warehouse object across multiple hardware components such as hard disks
through stripping. However, it is essential that the requirement be studied carefully before a partitioning
strategy is adopted for the data warehouse. The following examples will demonstrate how:
Case A
Eric Pitt, the data warehousing architect at a stock purchase company, will require frequent full table
scans on the data warehouse customer related table(s) to retrieve data for business intelligence reporting.
Eric can improve the I/O performance of the data warehouse by placing a table on multiple hard disks for
a faster scan.
Case B
Eric's organization also has a BI team, which constantly monitors market trends, fluctuations and
compares it with historical data to make predictions several times during the day. The availability of the
data warehouse is crucial in this case. For this, it is important to restrict the tablespace to a few hard
disks. In this case, if Eric wants to improve full table scan as well as availability, he can maximize
partitions of the tables but minimize partitions for each tablespace.
NIIT
Types of Stripping
Stripping is of the following types:
Global: Global Stripping entails disks and partitions. It is often used when you need to access data in
only one partition. For implementing global partition in such cases, you can spread the data in that
partition across many disks to improve performance for parallel execution operations. However,
global Stripping has a single point of failure. That is, if one disk fails, and the disks are not mirrored,
all the partitions are affected.
Local: Local Stripping deals with partitioned tables and indexes. It is a simple form of partitioning in
which each partition has its own set of disks and files. Access to the disk on which the partition
resides or to the files does not overlap. Unlike global partition, an advantage of local Stripping is that
if one disk fails, it does not affect other partitions. However, its main disadvantages are on the cost,
and maintenance side. In local partitioning since each partition requires multiple disks of its own, it
adds a cost and maintenance overhead due to these multiple hardware components. In this case, if
you want to limit the number of disks used, you will have to reduce the number of partitions. This
consequently makes local Stripping inappropriate for parallel operations. Local Stripping is a good
choice if availability is a critical concern in your data warehouse.
Automatic: Automatic Stripping is the Stripping done by the operating system itself based on some
settings. It is a simple and flexible way of Stripping and useful for parallel processing requirements for the same operation or multiple operations. However, the advantages of this Stripping are limited
by the hardware such as I/O buses. That is, unlike local Stripping the DOP is not a function of disks.
As per Oracle's recommendation, the stripe size must be at least 64 KB for good performance.
Manual: Manual stripping is the process of adding multiple files to each tablespace, such that each
file is on a separate disk. When using manual Stripping, the degree of parallelism of processing
depends on the number of disks rather than of the number of processors. If manual Stripping is used
correctly, system's performance improves significantly.
Partitioning in Oracle
Oracle is one of the most preferred choices in the field of data warehousing. Oracle 9i supports various
types partitioning techniques such as:
Hash partitioning: In this type of partitioning a table's records are partitioned based on the value of
a particular field in the table. The value, which has to be mapped for deciding each record's partition,
is called the hash value.
Range partitioning: In this type of partitioning a table's records are partitioned based on the range
of values in a particular field of the table.
List partitioning: In this type of partitioning a table's records are partitioned based on a value from a
list of values.
Composite range-hash partitioning: In this type of partitioning, a table is first partitioned on the
basis of range partitioning and then the partitions are further partitioned based on hash partitioning.
Composite range-list partitioning: In this type of partitioning, a table is first partitioned on the
basis of range partitioning and then the partitions are further partitioned based on list partitioning
As an example, consider how Hash partitioning can be performed. Suppose, you are designing a data
warehouse for an insurance company. The company stores the details of its investors in its policies
released till date (suppose 3) in a table INSURANCE_DATA. This table has a field POLICY_TYPE.
You can partition the table through hash partitioning with POLICY_TYPE as the hash value as follows:
CREATE TABLE INSURANCE_DATA
(POLICY_NUMBER,...,INSURANCE TYPE);
PARTITION BY HASH(INSURANCE_TYPE)
(PARTITION P1_TYPE TABLESPACE TBLSPC01,
PARTITION P2_TYPE TABLESPACE TBLSPC02,
PARTITION P3_TYPE TABLESPACE TBLSPC03,)
This way you can map one policy type to one partition.
NIIT
FAQ
1. Name the 2 important parameters that decide the granularity of partitions.
Ans:
Two important factors that decide the granularity of partitions are the overall size and manageability of
the system. Both parameters are to be balanced against each other while deciding on a partitioning
strategy. Suppose a data containing information about the population is partitioned on the basis of state,
the two maintenance related issues that could be faced by the administrator are:
The query needs the information of all the states, such as particular languages spoken in the states,
when all the states have to be scanned.
If the definition of state changes (state is redefined), the entire fact table needs to be built again.
2. Are there any disadvantages of data partitioning?

Ans:
Data partitioning is by and large an advantageous technique for improving performance. However, it
increases the implementation complexity and imposes constraints in query design.
3. Can partitions be indexed?
Ans:
Yes, partitions can be indexed if supported by the platform. For example, in Oracle 9i, you can create
various types of partitions in indexes.
4. If you have a huge amount of historical data, which is too old to be useful often but cannot be
discarded, then can partitioning help?
Ans:
Essentially the answer to this question depends on various factors such as availability of resources and
design strategies. However, you can partition data on the basis of the date that it was last accessed and
keep the historical data on a separate partition. In fact, you can use Stripping to keep it on a separate disk
to improve access speed to the more useful data.
NIIT
Chapter Five
Objectives
Define aggregation
Design and create summary tables
Focus Areas
Introduce aggregation as an operation that has a very significant impact on the performance of a data
warehouse. Explain the need for aggregates. Also, identify goals associated with aggregation from the
Additional Inputs section. Explain the considerations for designing aggregations referring to Additional
Inputs section. Also, explain the concept of an aggregate navigator.
Explain Summary tables' (or aggregates or aggregate fact tables) design and its creation.
Additional Inputs
Goals Associated with Aggregation

Aggregation aims at improving the performance of a data warehouse. An effective aggregation must meet
certain goals. Kimball, a pioneer in the field of data warehousing, suggests some goals based on which
you can design a good aggregate strategy. These goals are:
Provide dramatic performance gains for as many categories of users queries as possible.
Add only a reasonable amount of extra data storage to the warehouse. What is reasonable is up to the
DBA. However, many data warehouse DBAs strive to increase the overall disk storage for the data
ware house by a factor of two or less.
Be completely transparent to end users and to application designers except for the obvious
performance benefits; in other words, no end-user application SQL should reference the aggregates.
Directly benefit all users of data warehouse, regardless of which query tool they use.
Keep the impact of the cost of the data extract system to the minimum. Inevitably, a lot of aggregates
must be built every time data is loaded, but their specification should be as automated as possible.
Keep the impact of the DBA's administrative responsibility to the minimum. The metadata that
supports aggregates should be limited and easy to maintain.
Selecting Candidates for Aggregation

Before creating aggregates you must select appropriate candidates for creating aggregations. The
selection of candidates for aggregation is primarily based on the common requests of business users. The
selection may also be based on a statistical view that ensures that even if a dimension does not qualify as
a commonly queried dimension, it may still be aggregated if it has a large range of values for its
attribute(s). The statistical view also incorporates the relation between various attributes within a
dimension and aims at selecting aggregates based on these relations.
Some generic factors based on which you can select candidates are:
NIIT
Dimensions whose attributes could be candidates for aggregation because they are used often
Attributes commonly used together
The number or range of values for a particular attribute
Possible aggregates that may be used to create other aggregates on the fly, if not directory required
by the business users
Aggregate Navigator
An aggregate navigator is a middleware component between a client and the database server. It intercepts
the client's SQL queries and transforms them into SQL queries that can be applied on aggregates. It
contains up-to-date meta data about the aggregates of the data warehouse. Based on this meta data, it
finds the appropriate aggregate, which can handle the basic SQL query sent by a client and transforms the
query for the aggregate. The aggregate navigation algorithm suggested by Kimball is as follows:
1.
2.
3.
4.
Create the aggregation as per the following design goals:

a. Aggregates must be stored in their own fact tables, separate from the base atomic data. In
addition, each distinct aggregation level must occupy its own unique fact table.
b. The dimension tables attached to the aggregate fact tables must, wherever possible be shrunken
version of the dimension tables associated with the base fact table.
c. The basic atomic fact table and all of its related aggregate fact tables must be associated together
as a 'family of schemas' so that the aggregate navigator knows which tables are related to each
other.
d. Force all SQL statements created by any end user data access tool or application to refer
explicitly to the base fact table and its associated full-size dimension tables.
Sort the schemas from the smallest to the largest based on the row count. For any given SQL
statement presented to the database server, find the smallest fact table. Choose the smallest schema.
Compare the table fields in the SQL statement to the table fields in the series of lookups in DBMS
system catalogue. If all of the field in the SQL statement can be found in the fact and dimension
tables being examined, alter the original SQL by simply substituting destination table names, for
original table names. No field names need to be changed. If any field in the SQL statement cannot be
found in the current fact and dimension tables, go back to step 2 and find the next larger fact table.
This process is guaranteed to terminate successfully because eventually you arrive at the base
schema, which is always guaranteed to satisfy the query.
Run the altered SQL. It is guaranteed to return the correct answer because all of the fields in the SQL
statement are present in the chosen schema.
FAQ
1. Are there any risks associated with aggregation?
Ans:
The main risk associated with aggregates is that of increase in disk storage space.
2. Once created, is an aggregate permanent?
Ans:
No, aggregates keep changing as per the need of the business. In fact, they can be taken offline or put
online anytime by the administrator. Aggregates, which have become obsolete, can also be deleted to free
up disk space.
3. Can operations such as MIN and MAX value be determined once a summary table has been created?
Ans:
Operations such as MIN and MAX cannot be determined correctly once the summary table has been
created. To determine their value they must be calculated and stored at the time that the summary table
was derived from the base table.
4. How much storage increase might be required in the data warehouse system when using aggregates?
Ans:
The storage needs typically increase by a factor of 1 or sometimes even 2 for aggregates.
NIIT
Chapter Six
Objectives
Identify the need for a data mart

Differentiate between a data warehouse and a data mart
Describe access control issues in data mart design
Focus Areas
Introduce the concept of data marts using the store-house retail outlet analogy given in the book.
Compare data marts with data warehouses to bring out the difference between the two. Explain the need
for a data mart. In this context, explain that a data warehouse can actually be built from scratch or be built
up on data marts. Introduce EDMA and DS/DMA as two data warehouse architectures based on data
marts.
Explain the access control issues in data mart design briefly.
Additional Inputs
Data Mart vs. Data Warehouse

Criteria
Data Mart
Data Warehouse
Purpose
Business-driven
Technology-driven
Scope
Localized
Centralized
Cost
In hundreds of thousands of dollars
In millions of dollars
Development Time
6-8 months
1.5-2 years
Data
Data about specific

departments/subjects
Data about the entire enterprise
Number
An enterprise usually has multiple

data marts
An enterprise usually has one

data warehouse
Granularity of data
Detailed
Summarized
Data Mart Types

You can classify Data marts in the following two types:
Independent data mart

Dependent data mart
An independent data mart is the one where data is sourced from a legacy application or a source other
than the data warehouse. It is created with the aim of integrating into a data warehouse.
A dependent data mart is the one whose data source is the data warehouse itself.
NIIT
Data Mart Based Data Warehouse Architectures

There are three important Data warehouse architectures based on data marts:
Enterprise Data Mart Architecture (EDMA)

Data Stage/Data Mart Architecture (DS/DMA)
Distributed Data Warehouse/Data Mart Architecture (DDW/DMA)
A data warehouse architecture which implements an incremental approach to designing a data warehouse
uses data marts and a shared Global metadata repository (refer to chapter 6). This architecture also
supports a common data staging area. This data staging area is called a Dynamic Data Store (DDS). The
DDS plays a crucial role in integrating data marts with the data warehouse. In this architecture, star
schema modeling should be used if relational technology is used for the data warehouse.
In Data Stage/Data Mart Architecture, no single data warehouse is physically implemented. Instead, the
warehouse is considered a logical group of all the data marts.
DDW/DMA is also similar to EDMA as it has a dynamic staging area and a common global metadata
repository.
FAQ
1. What are conformed dimensions?
Ans:
A conformed dimension is the one whose meaning is independent of the fact table from which it is being
referred to.
2. What are virtual data marts?
Ans:
Virtual data marts are logical views of multiple physical data marts based on user requirement.
3. Which tool supports data mart based data warehouse architectures?
Ans:
Informtica is commonly used for implementing data mart based data warehouse architectures.
4. Is the data in data marts also historical like in data warehouses?
Ans:
The data in data marts is historical only to some extent. In fact, it is not the same as the data in data
warehouse because of the difference in the purpose and approaches of the two.
NIIT
Chapter Seven
Objectives
Define metadata
Identify the uses of metadata
Focus Areas
Explain metadata as "data about data". Explain its important and its need in a data warehouse. Discuss the
various types of metadata referring to the Additional Inputs section. Explain its usage in transformation
and loading, data management and query generation. Explain the concept of metadata management
referring to the Additional Inputs section.
Additional Inputs
Metadata Types
With the growth and advances in the data warehousing field, metadata has become an invaluable
resource. According to the source of the data being described, metadata can be classified as:
Source metadata: This includes the information about the source data. It could include schemas,
formats, graphics, relational tables, and ownership, administrative, business descriptions. It could
also include process-related information such as extraction settings, schedules, and results of specific
jobs that were performed on the source systems.
Data staging metadata: This includes all the metadata required to load the data into the staging
area. It could include, data acquisition information, definitions of conformed dimensions and facts,
slowly changing dimension policies, data cleaning specifications, data enhancement and mapping
transformations, target schema designs, data flows, load scripts, aggregate definitions, various
process logs, and other business documentation.
DBMS metadata: This includes the metadata describing various definitions, settings and
specifications after the data has been loaded into the data warehouse. It could include, partition
settings, indexes, Stripping specifications, security privileges, administrative scripts, view
definitions, backup status, and procedures.
Front room metadata: It could include, names and descriptions for attributes and tables, canned
query and report definitions. In addition, it includes end-user documentation, user profiles, network
security user privileges and profiles, network usage statistics, and usage instructions for data
elements, tables, views and reports.
Metadata Management
Metadata management is as crucial as metadata itself. Metadata management ensures that a metadata can
be represented and shared in a standard format. Metadata management involves two essential
components:
NIIT
Metadata modeling
Metadata repository
In order to standardize representation of metadata it must be modeled at separate layers. This way
varying source systems that have metadata at different layers of abstraction can be mapped to one of the
layers and standardized. A metadata model typically has four layers as shown in the following figure:
Layers of Metadata Model

The Meta-Metamodel describes the structure of various entities of a database. A metamodel defines the
structure and semantics of the Metadata. The Metadata describes the format and semantics of the data. By
taking this model as a basis while designing your data warehouse and managing the metadata, you can
standardize the representation of the metadata to be used for the data warehouse.
The second important component of metadata management is metadata repository. A metadata repository
is a common storage for all the metadata required for running a data warehouse. A metadata repository
can be of two types depending on the architecture used:
Centralized meta data repository

Decentralized meta data repository
In a centralized meta data repository, the metadata is defined and controlled through a single schema
stored in a centralized repository, called the global metadata repository. This single schema represents the
composite schema of all the sub systems.
In a decentralized repository, data repository is required in a distributed environment. It consists of a
central global metadata repository, as well as local metadata repositories. A global repository contains the
metadata that is to be shared and reused among the local repositories. This global metadata is a single
schema that is used by all the local metadata repositories. The local metadata repositories, on the other
hand, contain the metadata specific to their individual uses.
FAQ
1. How can you classify metadata?
Ans:
You can classify metadata according the use of metadata as:
Administrative metadata: Metadata that describes the data used for managing the data in terms of
statistics such as time of creation, access rights, and last access time.
Structural metadata: Metadata that describes the structure of the data.
Descriptive metadata: Metadata that describes the purpose or functionality of the data.
NIIT
2. What is backroom metadata?

Ans:
Backroom metadata is the metadata related to the process of extracting, cleaning, and loading. It is of use
to the DBA and business users but not to the end-user.
3. What is a metadata catalogue?
Ans:
A metadata catalogue is the same as a metadata repository. It is also called metadatabase.
4. Are there any tools for metadata management?
Ans:
Yes, there are various tools that facilitate metadata management. One such Windows based tool is Saphir.
SQL Server 2000 also enables metadata management to some extent.
NIIT
Chapter Eight
Objectives
Identify the various types of managers in a data warehouse environment

Identify the responsibilities of each type of system manager
Identify the responsibilities of each type of process manager
Focus Areas
Tell students that the various components and techniques applied to implement a data warehouse must be
managed together. This management is the responsibility of two types of management tools - system
management tools and process management tools. Now, explain what system managers and process
managers are. Emphasize that these are not persons but tools.
Explain each type of manager and their responsibilities in detail. Explain the various components of SQL
Server 2000, which act as one or more of these managers by referring to the Additional Inputs section.
The main points are given in the Additional Inputs section. You can refer to the SQL Server Help if you
need to detail students on any specific point.
Additional Inputs
SQL Server 2000 for Data Warehousing Process Management

SQL Server 2000, although not preferred over Oracle, stands as a good Data warehousing tool by itself. It
includes various components for enabling the extract-transform-load (ETL) process. The three process
managers, query manager, load manager, and warehouse manager, primarily manage the ETL process.
Various services in SQL Server provide the functionality of these process managers. These services
include:
Data Transformation services, which primarily acts as the load manager.

Metadata services, which primarily act as the warehouse manager along with DTS
Job Scheduler and Query Analyzer, which primarily act as the query manager along with DTS
Note that the terms load, query, and warehouse managers are theoretical terms and may not completely
map to the components of a data warehousing software. Their functionality may be spread across various
components or services of a tool. Alternatively, a tool may not actually have separate components
mapping to each manager type. These are just terms used to collectively indicate a set of tasks that need
to be performed. These tasks may be performed by a data warehousing software in any way.
Data Transformation Services (DTS)

DTS is a set of tools that enables you to build a collection of various elements that execute to perform the
ETL process. This collection of elements that perform ETL is called a DTS package. DTS includes three
primary components:
Connections: In order to enable the elements of a package to function for data transformation, you
must establish a connection between a data source and target.
Tasks: A DTS task is a single step that is performed in the data transformation process. For example,
export data from a source.
Transformation: DTS supports various field level mappings and transformation between data
sources. Examples of supported transformations include:
Copy column transformation: In this, the source column is directly copied to the destination
column without making any changes.
NIIT
Date Time String transformations: In this, transformations on date/time fields, which are
either of string data type or date/time data type, are performed.
ActiveX script transformation: In this, ActiveX scripts are written that programmatically
transform fields for every row being copied from the source to the target.
DTS can be used simply using the Import/Export Wizard, which is an easy but a less flexible way of
loading and transforming data. A more complex but advanced and flexible way to use DTS is through the
DTS designer - a full-fledged designing environment in which you can create transformation packages.
To understand how DTS actually works as a load manager. Observer the following snapshot of the DTS
Import/Export Wizard in which data is loaded from an Excel sheet to a SQL Server database.
The functionality offered by the Import/Export wizard is only limited to loading and transformation. The
DTS designer actually provides a host of other functionalities. A snapshot of the DTS designer is given
here:
Under the Task pane, the various tasks supported by DTS are:
NIIT
File transfer protocol task

ActiveX Script Task
Transform Data Task
Execute Process Task
Execute SQL Task
Data Drive Query Task
Copy SQL Server Objects Task
Send Mail Task
Bulk Insert Task
Execute Package Task
Message Queue Task
Transfer Error Message Task
Transfer Databases Task
Mater Stored Procedures Task
Transfer Jobs Task
Transfer Login Task
Dynamic Properties Task
NIIT
Meta Data Services

SQL Server includes repository tables to store and manage SQL Server metadata. The SQL Server Meta
Data Services provides storage for metadata including:
SQL Server metadata

Metadata associated with specific DTS packages
Online analytical processing (OLAP) metadata
SQL Server also supports a Meta Data browser that enables you to view metadata.
Job Scheduler and Query Analyzer

A Job Scheduler in SQL Server 2000 is used to schedule various jobs or processes that need to be
executed. The Wizard that facilitates job scheduling is shown below:
As shown in the figure, it can be used to schedule SQL queries, ActiveX scripts and shell commands.
Queries can be created and tested using the SQL Query Analyzer.
Using query analyzer you can:
NIIT
Create queries, SQL scripts, and commonly used database objects from predefined scripts
Execute queries on SQL Server databases
Execute stored procedures without knowing the parameters
Debug stored procedures
Copy existing database objects
Debug query performance problems
Locate objects within databases or view and work with objects
Insert, update, or delete rows in a table
Add frequently used commands to the Tools menu
FAQ
1. Are the system and process management devoid of any manual intervention considering that process
manager is a tool and not a person?
Ans:
No. Although the system and process manager are themselves tools that automate system and process
management in data warehouses, they must be configured and sometimes handled through manual
intervention at times. These tasks may be done by the Database Administrator.
2. Does SQL Server also provide system managers?
Ans:
Yes. SQL Server includes various components that enable system management through management and
security services:
3. What is Oracle Warehouse Builder(OWB)?
Ans:
It is one of the commonly used data warehouse development tools with various advanced features such as
support for large databases, automated summary management, and embedded multidimensional OLAP
engine. Unlike SQL Server, which is only for the Windows platform, OWB can be used on all platforms.
It is also more fast, reliable, and scaleable than SQL Server.
4. What is replication?
Ans:
Replication is the process of creating multiple copies of data on the same or different platform and
keeping the copies in sync.
NIIT
Chapter Nine
Objectives
Define data mining

Identify the data that can be mined
Identify the functionalities of data mining
Categorize data mining systems
Identify the application fields for data mining
Focus Areas
Initiate a classroom discussion by asking the students the following questions:
What is data mining?

Explain data mining by giving various definitions.
What is Knowledge Discovery in Databases (KDD)?
Explain the various steps in KDD using a figure.
What type of data can be mined?
List and explain the following types of data that can be mined: flat files, relational databases,
data warehouses, multimedia databases, spatial databases, time-series databases, and the World
Wide Web can be mined.
What benefits does data mining provide?
Explain the following benefits of data mining: characterization, discrimination, association
analysis, classification, prediction, clustering, outlier analysis, evolution and deviation analysis.
On what criteria can data mining systems be categorized?
Explain that data mining systems can be categorized on the following criteria:
Type of data source mined
Data model used
Kind of knowledge discovered
Mining techniques used
What are the issues in data mining?
Explain the following issues in context of data mining: security and social issues, user interface
issues, mining methodology issues, performance issues, data source issues.
Why is data mining becoming so popular?
Explain the following reasons for the growing popularity of data mining: Growing data volume,
limitation of human analysis, low cost of machine learning.
What are the various application areas of data mining?
Explain that data mining has found its application in various areas such as retail, marketing,
banking, insurance, health care, transportation, and medicine.
Additional Inputs
Data Mining
Data Mining is the process of finding new and potentially useful knowledge from data. Data mining
software allows users to analyze large databases to solve business decision problems. Data mining is, in
some ways, an extension of statistics, with a few artificial intelligence and machine learning twists
thrown in. Like statistics, data mining is not a business solution, it is just a technology.
NIIT
For example, consider a catalog retailer who needs to decide who should receive information about a new
product. The information operated on by the data mining process is contained in a historical database of
previous interactions with customers and the features associated with the customers, such as age, zip
code, their responses. The data mining software would use this historical information to build a model of
customer behavior that could be used to predict which customers would be likely to respond to the new
product. By using this information, a marketing manager can select only the customers who are most
likely to respond. The operational business software can then feed the results of the decision to the
appropriate touchpoint systems (call centers, web servers, email systems, etc.) so that the right customers
receive the right offers.
Clustering
Clustering is often one of the first steps in data mining analysis. It identifies groups of related records that
can be used as a starting point for exploring further relationships. This technique supports the
development of population segmentation models, such as demographic-based customer segmentation.
Additional analyses using standard analytical and other data mining techniques can determine the
characteristics of these segments with respect to some desired outcome. For example, the buying habits
of multiple population segments might be compared to determine which segments to target for a new
sales campaign.
Classification
Classification, perhaps the most commonly applied data mining technique, employs a set of pre-classified
examples to develop a model that can classify the population of records at large. Fraud detection and
credit-risk applications are particularly well suited to this type of analysis. This approach frequently
employs decision tree or neural network-based classification algorithms. The use of classification
algorithms begins with a training set of pre-classified example transactions. For a fraud detection
application, this would include complete records of both fraudulent and valid activities, determined on a
record-by-record basis. The classifier training algorithm uses these pre-classified examples to determine
the set of parameters required for proper discrimination. The algorithm then encodes these parameters
into a model called a classifier.
The approach affects the explanation capability of the system. Once an effective classifier is developed, it
is used in a predictive mode to classify new records into these same predefined classes. For example, a
classifier capable of identifying risky loans could be used to aid in the decision of whether to grant a loan
to an individual.
KDD
Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies
for extracting useful knowledge from data. The ongoing rapid growth of online data due to the Internet
and the widespread use of databases have created an immense need for KDD methodologies. The
challenge of extracting knowledge from data draws upon research in statistics, databases, pattern
recognition, machine learning, data visualization, optimization, and high-performance computing to
deliver advanced business intelligence and web discovery solutions.
KDD refers to a multi-step process that can be highly interactive and iterative. It includes data
selection/sampling, preprocessing and transformation for subsequent steps. Data mining algorithms are
then used to discover patterns, clusters and models from data. These patterns and hypotheses are then
rendered in operational forms that are easy for people to visualize and understand. Data mining is a step
in the overall KDD process.
Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The
nontrivial extraction of implicit, previously unknown, and potentially useful information from data". It
uses machine learning, statistical and visualization techniques to discovery and presents knowledge in a
form that is easily comprehensible to humans. The main idea in KDD is to discover a high level
knowledge (abstract knowledge) from lower levels of relatively raw data, or to discover a higher level of
interpretation and abstraction than those previously known.
NIIT
FAQ
1. What is KDD Process?
Ans:
The unifying goal of the KDD process is to extract knowledge from data in the context of large
databases. It does this by using data mining methods (algorithms) to extract (identify) what is deemed
knowledge, according to the specifications of measures and thresholds, using a database along with any
required preprocessing, subsampling, and transformations of that database.
2. What is Data Visualization?
Ans:
Data Visualization presents data in three dimensions and colors to help users view complex patterns.
They also provide advanced manipulation capabilities to slice, rotate or zoom the objects to identify
patterns.
3. What are the constituents of Multidimensional objects?
Ans:
Dimensions and Measures.
4. What does level specify within dimension?
Ans:
Levels specify the contents and structure of the dimension's hierarchy.
5. What is data mining?
Ans:
Data Mining is the process of finding new and potentially useful knowledge from data
6. What does Data Mining Software do?
Ans:
A Data Mining Software searches large volume of data, looking for patterns that accurately predict
behavior, such as customers most likely to maintain relationship with the company etc. Common
techniques employed by Data Mining Software include Neural Networks, Decision Trees and standard
statistical modeling.
7. What is Oracle Data Mining?
Ans:
Oracle Data Mining is enterprise data mining software that combines the ease of a Windows-based client
with the power of a fully scalable, multi-algorithmic, UNIX server-based solution. Oracle Data Mining
provides comprehensive predictive modeling capabilities that take advantage of parallel computing
techniques to rapidly extract valuable customer intelligence information. Oracle Data Mining can
optionally generate deployable models in C, C++, or Java code, delivering the "power of prediction" to
call center, campaign management, and Web-based applications enterprise-wide.
NIIT
8. How does data mining differ from OLAP?

Ans:
Simply put, OLAP compares and data mining predicts. OLAP performs roll-ups, aggregations, and
calculations, and it compares multiple results in a clearly organized graphical or tabular display. Data
mining analyzes data on historical cases to discover patterns and uses the patterns to make predictions or
estimates of outcomes for unknown cases. An analyst may use OLAP to discover a business problem, and
then apply data mining to make the predictions necessary for a solution. An OLAP user can apply data
mining to discover meaningful dimensions that should be compared. OLAP may be used to perform rollups and aggregations needed by the data mining tool. Finally, OLAP can compare data mining
predictions or values derived from predictions.
9. What are some typical data mining applications?
Ans:
Following are some of the data mining applications:
Customer retention
Cross selling
Response modeling / target marketing
Profitability analysis
Product affinity analysis
Fraud detection
NIIT
Chapter Ten
Objectives
Identify the steps in the data preparation process

Identify data mining primitives
Use data mining querying language
Identify strategies to define a graphical user interface based on a data mining query language
Identify architectures of data mining systems
Focus Areas
Introduce the first step of data mining, which is data preprocessing. Enumerate the steps to be followed to
preprocess data and make it ready for application of data mining techniques. Initiate a discussion on data
preparation. However, before that, you must explain how data is acquired for a data mining system from
the Additional Inputs section. After data acquisition, explain data preparation with the help of an
appropriate example of a dataset. Next, briefly discuss data mining primitives and then shift focus to data
mining querying language. Explain data mining querying language in detail with the help of examples.
Finally, conclude the session with a brief discussion on data mining system architectures. Hold a
discussion for identifying scenarios for selecting a particular architecture.
Additional Inputs
Preparing Data for Preprocessing

Before we can preprocess the data, it is very important to collect the right data from the right source.
The process of data collection involves the following activities:
Acquiring data from various sources

Describing the acquired data
Exploring the data
Verifying the quality of the data for the mining process
The process of acquiring data involves various activities such as identifying sources from which data can
be acquired, ways of acquiring data, and actually gathering data. Data can be acquired from multiple
sources in various ways such as interviews, surveys, observational data, and transactional databases.
After the culmination of this process, the following output must be produced:
List of methods used for acquiring data

List of sources from which data was acquired
Problems faced during the process and their subsequent resolution
List of the data acquired
After data has been acquired, it must be described. Data description involves various activities such as:
NIIT
Defining the volume of data

Identifying missing attributes and values. For example, if a table is supposed to have a field as per
the meta data or requirements specification but the field does not exist, or if all values for a given
attribute are empty, then you can conclude the data attributes and value are missing.
Identifying type and meaning of data attributes. For example, if a data field collected is named
"Distance", then you must be able to describe its meaning as which distance and in which units (such
as Km, m, or miles). This activity often requires revisiting the data acquisition step and makes it an
iterative process, till all the data has been clearly defined.
Identifying the initial, basic, format of the data by ensuring that metadata is available. Meta data is
very crucial for describing data.
After data has been adequately defined, it must be explored. Data exploration refers to understanding the
basic structure and schema of data, and evaluating its usefulness for data mining. Basic exploration can
start by analyzing the meta data. Meta data gives an idea of structure as well as the detailed meaning of
data. This helps in analyzing the usefulness of data. Beyond studying the meta data, basic exploration
uses simple as well as sophisticated statistical techniques to reveal the properties of the data.
After the data has been gathered, described, and explored, its quality needs to be verified. Quality data is
the one that is complete, least complex, has enough meta data, has unambiguous attributes, and possesses
context independence. It is essentially that before data is preprocessed and mined, its quality be verified.
This is because mining data of low quality will lead to inaccurate results and may cost an enterprise a
huge amount in terms of money and effort wasted. Data quality can be harmed at any step from gathering
data to delivering and storing it. For example, typing mistakes during storage of data can lead to loss of
quality due to incorrect data values.
After complete verification of data quality, data can now be preprocessed for mining.
Choosing a Data Mining System - Commercial Aspect

There is very little common in data mining systems in commercial applications. The functionality and
methodology required is different and so is the data to be mined. For the success of a data mining system
it is essential to identify the data mining system that you will use. A good data mining system has the
following characteristics:
It is easy to use.
It should provide at least 80% accuracy of prediction.
It should be able to perform all common data mining tasks such as cleaning, import, export, and
formatting.
FAQ
1. What is Noisy Data?
Ans:
Noise is a random error or variance in data. It can happen because of:
Faulty data collection and data entry mistake such as a typing mistake
Data transmission and storage problem
Inconsistency in naming convention
Noise makes data inaccurate for predictions and renders it futile for mining systems.
2. Which are the major data mining tasks?
Ans:
The main data mining tasks include:
Classification
Clustering
Associations
Prediction
Characterization and Discrimination
Evolution Analysis
NIIT
3. What are some other Data Mining Languages and standardization of primitives apart from DMQL?
Ans:
Some other Data Mining Languages and standardizations of primitives apart from DMQL include:
MSQL
Mine Rule
Query flocks based on Data log syntax
OLEDB for DM
CRISP-DM
4. Which Data Mining tools are used commercially?

Ans:
Some Data Mining tools used commercially are:
Clementine
Darwin
Enterprise Miner
Intelligent Miner
Mine Set
5. How can noisy data be smoothened?

Ans:
Noisy data can be smoothened using the following techniques:
NIIT
Binning
Clustering
Computer/Human inspection
Regression
Chapter Eleven
Objectives
Identify the techniques of Data Mining

Apply Apriori Algorithm
Build Decision Trees
Focus Areas
Initiate a discussion by asking the students the importance and need of Data Mining. Explain that various
techniques have to be adopted for data mining as it is a specialized field and needs to be handled through
appropriate techniques. Discuss the various techniques that can be used for mining the data. Also focus
on issues, if any, related to any of the techniques and the key features of various techniques.
Additional Inputs
Data mining using FP-Tree Algorithm

The basic idea of Apriori algorithm is to generate the set of candidate patterns of length (k+1) from the
set of frequent patterns of length k and use database scan and pattern matching to collect counts for the
candidate itemsets.
However, it has few bottlenecks:
1.
2.
Huge candidate sets generation: Needs 2100 >> 1030 candidates to be generated to discover frequent
patterns of size 100 such as {a1, a2, ,a100}.
Multiple scans of database: Needs (n +1) scans, n is the length of the longest pattern
The solution is in the form of FP-Tree algorithm that shows performance improvements over Apriori and
its variations since it uses a compressed data representation (nodes and a tree structure) and does not need
to generate candidate sets. However, FP-Tree based mining uses a complete data structure and
performance gains are very sensitive to the support threshold setting. Update of the database requires a
complete repetition of the scan process and construction of a new tree.
Definition of FP-Tree:
FP-Tree is an extended prefix-tree structure storing crucial, quantitative information about frequent
patterns.
It is highly condensed, but complete for frequent pattern mining. The advantage over Apriori is that it
avoids costly database scans and does not require candidate generation.
NIIT
Process of FP-Tree Mining

To implement FP-Tree Mining on a transaction database D, perform the following steps assuming that
support is set to 4:
Step1: Scan the transaction database D (shown in the following figure) once. Collect the set of frequent
items (with count more than minimum support) F and their supports. Sort F in descending order of
support as L, the list of frequent items.
Step 2: Create the root of an FP-Tree, and label it as "null". For each transaction Trans in D do the
following.
Select and sort the items in Trans according to the order of L. Let the sorted frequent item list in Trans be
[p|P], where p is the first element and P is the remaining list. Call insert_tree ([p|P], T], which is
performed as follows. If T has a child, N such that N.item-name = p.item-name, then increment N's count
by 1. Otherwise, create a new node N, and let its count be 1, its parent link be linked to T, and its nodelink to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert_tree
(P, N) recursively.
Step 3: Mine FP-Tree using FP-Growth algorithm to generate frequent itemsets.

Starting at the frequent header table in the FP-tree:
1. For each item in header table, create its conditional pattern base by accumulating all prefix paths of
that item.
2. Expand the item and create its conditional FP-Tree.
3. Repeat the process recursively on each constructed FP-Tree until FP-Tree is either empty or contains
only one path.
NIIT
4.
Enumerate all the combinations of the items from a single path that forms the frequent patterns
FAQ
1. What are variations to Apriori algorithm?
Ans:
Following are some of the variations to Apriori algorithm that improves the efficiency of the original
algorithm:
Transaction reduction: Reducing the number of transaction scanned in future iterations

Partitioning: Partitioning data to find candidate itemsets
Sampling: Mining on a subset of the given data
Dynamic itemset counting: Adding candidate itemsets at different points during the scan
2. Which is the best approach when we are interested in finding all possible interactions among a set of
attributes?
Ans:
The best approach to find all possible interactions among a set of attributes is association rule mining.
3. What is over fitting in neural network?
Ans:
Over fitting is a common problem in neural network design. Over fitting occurs when a network has
memorized the training set but has not learned to generalize to new inputs. Over fitting produces a
relatively small error on the training set but will produce a much larger error when new data is presented
to the network.
NIIT
4. What is back propagation neural network?

Ans:
The back propagation is a neural network algorithm for classification that employs a method of gradient
descent. It searches for a set of weights that can model the data to minimize the mean squared distance
between the network's class prediction and the actual class label of data samples. Rules may be extracted
from trained neural networks in order to help improve the interpretability of the learned network.
NIIT
Chapter Twelve
Objectives
Identify and apply the guidelines for a KDD
Focus Areas
Recall the chain of data-->organized data (information)-->knowledge and introduce Knowledge
Discovery in Databases (KDD) as the complete process of mining knowledge from information stored in
databases. Explain the KDD environment set up referring to the Additional Inputs section. Tell the
students that a KDD environment is a part of an enterprise whose core competency is data mining.
Consequently, it is important that the enterprise understands what makes a KDD environment successful.
List the factors, which make a KDD environment successful from the Additional Inputs section. Next,
enumerate and explain the various guidelines from the textbook.
Finally, initiate a discussion on how to act on mining results.
Additional Inputs
Setting up a KDD Environment

KDD environment needs to be set up in an organization that wants to shift from guesswork and personal
opinions of experts to analysis and facts derived from actual information. KDD is often believed to shift
an enterprise's focus from "product or service" to "customer". That is, by the knowledge gained from data
mining, various improvements can be made in the product or service, which are beneficial for an
enterprise and most importantly for its customer base.
A KDD environment consists of the following elements:
A group that develops data mining skills

Communication links into the required business units
Set of tools, hardware and software for data mining
Access to the data of the entire enterprise
Ability to publish data mining results so that the enterprise can act on them
Success Factors of a KDD Environment

The success of a KDD environment depends on five crucial factors. These factors are:
The team involved in KDD should ideally consist of 8-10 multi-skilled people, with an excellent
aptitude for technical aspects as well as business aspects. They need to be experts in various
disciplines such as statistical analysis, understanding business users, comprehending data owners,
and managing.
The team should be lead by a single person who has a proven track record in KDD.
Business units must be an integral part of the KDD process. The knowledge gained during the KDD
process is foremost for the business units of an enterprise and their participation is inevitable in the
process.
IT should be involved and form the basis for the implementation of the KDD process from the
beginning.
A good pilot project should be selected and run to demonstrate the efficiency and effectiveness of the
KDD process.
NIIT
Acting on Data Mining Results

Data Mining serves no purpose if you never act on the results of the model. Acting on the results can take
several different forms:
Insights: New facts discovered from the data during the mining process, may lead to insights about
the business and about the customers.
One-time results: If the results are focused on a particular activity, such as a marketing campaign,
the marketing campaign should be carried out based on the trends determined by the data mining
process.
Remembered results: Sometimes results provide interesting and valuable information about
customers. This information should be accessible through a data warehouse.
Periodic predictions: The results may be used to score customer periodically, to determine the best
strategies to be adopted thereafter to determine whom to target for retention efforts.
Real-time scoring: The results may be incorporated into another system to provide real-time
predictions.
Fixing data: Sometimes the data mining results uncover data problems. These results often results in
cleaner, more complete data for the future.
FAQ
1. What is the difference between KDD and data mining?
Ans:
KDD refers to the overall process of discovering useful knowledge from data. It also includes the choice
of encoding schemes, preprocessing, sampling, and projections of the data prior to the data-mining step.
Data mining refers to the application of algorithms for extracting patterns from data without the
additional steps of the KDD process. It is essentially the modeling step in the KDD process.
2. Is data stored for data mining different from other operational systems?
Ans:
The two systems differ in the usage patterns associated with data mining. Some of the basic differences
are:
Operational systems support transactional processing whereas data mining systems support
analytical processing.
Operational systems are process-oriented whereas data mining systems are subject-oriented.
Operational systems are concerned with current data whereas data mining systems deal with
historical data.
Operational systems are updated frequently and have volatile data whereas data mining systems have
non-volatile data and are rarely changed.
Operational systems are optimized for fast updates whereas data mining systems are optimized for
fast retrievals.
3. What are the primary goals of the KDD process?

Ans:
The two primary goals of data mining are:
NIIT
Prediction of unknown or future values

Description, which focuses on finding human-interpretable patterns
4. Does KDD need to be scalable?

Ans:
KDD should be scalable because of the ever-increasing data in enterprises. Limiting the size of a data
mining system will affect the accuracy of predictions.
NIIT
Chapter Thirteen
Objectives
Identify areas of application of data mining
Focus Areas
Start the session by asking students to name the fields where data mining has practical application. List
the various fields and explain the application of data mining in these fields. Let the students participate
actively in the discussion to validate their understanding of the data mining process.
Refer to the Additional Inputs section to explain application of data mining in fields such as
telecommunication and manufacturing.
Additional Inputs
Data Mining for the Telecommunication Industry

The telecommunication industry can be considered to comprise of three elements - networks, operators,
and customers. Data mining is crucial and beneficial for each of these three elements. That is data mining
can be used for analyzing and predicting. For instance, it can be used for analysis of network operations
such as fraud detection, call record analysis, and customer behavior modeling. The need for data mining
has grown with the integration of the Internet and the telecommunication world and consequent
expansion of the market.
Tools for OLAP visualization, linkage visualization, association visualization, clustering, and outlier
visualization have been shown to be very useful for telecommunication data analysis. Specifically, data
mining is crucial for the telecommunication industry in the following ways:
The data collected in the telecommunication industry includes various dimensions such as calling
time, duration, location of caller, location of the called, and type of call. The multidimensional
analysis of this data can be used to identify and compare the data traffic, system workload, resource
usage, user group behavior, and profit. The telecommunication data can be consolidated into large
data warehouses and multidimensional analysis using OLAP and visualization tools can be routinely
performed on it.
The data collected can be mined to identify fraudulent activity. It is important to identify potentially
fraudulent users and their typical usage patterns, detect attempts to gain fraudulent entry to customer
accounts, and discover unusual patterns such as periodic calls from automatic dial-out equipment.
The discovery of association and sequential patterns in multidimensional analysis can be used to
promote telecommunication services. For example, usage patterns by customer group, by month, and
by time of day can be found.
Data Mining for the Manufacturing Processes

Data mining has proved to be useful in marketing, sales and financial units of an industry. It can also
prove to be useful for the manufacturing processes in an industry through industrial process
improvement. Data mining is easier and more beneficial to apply in a manufacturing unit, which has a
factory automation environment. Nevertheless, it can prove to be useful regardless of whether factory
automation has been implemented or not.
The efficiency of many manufacturing processes depends on various parameters. Small variations in
temperature, pressure, weight, viscosity, speed, humidity, and air pressure can affect the productivity of a
NIIT
manufacturing unit. Data Mining has helped to successfully understand the variations in these factors and
their inter-relationship in diverse manufacturing processes.
Data mining can help in improving industrial processes in various ways such as:
Reducing costs and improving service throughout the supply chain in a supply chain management
system
Improving sales and operations planning as a result of conducting historical sales analysis to improve
customer service, reduce inventory cost, reduce obsolete raw material costs, and reduce shipping
costs through improved routing and distribution.
Supporting process and product quality initiatives, through analyzing trends in defects. Early
detection and active management of defects can have a significant impact on the profitability of a
project.
Measuring worker productivity and evaluating the resources and processes based on it.
Solutions to Chapter Thirteen Questions

1. Identify an application and also explain the techniques that can be incorporated into, in solving the
problem using data mining techniques.
Ans.
An important application and contribution of data mining has been in the field of DNA analysis.
Research through mining data related to DNA has proved useful in finding out causes of various genetic
diseases.
Humans have around one hundred thousand genes in their bodies. Various diseases are deeply related to
gene patterns. It is therefore important to analyze these gene patterns. Some data mining techniques that
can help in DNA data analysis include:
Association analysis for identifying co-occurring gene sequences.

Path analysis for co-relating genes to different stages of development of the disease.
Similarity search and comparison among DNA sequences.
Visualization tools and genetic data analysis can be used to analyze DNA data with the help of
graphs, trees, cuboids and chains.
To understand how these techniques help in the DNA analysis field, consider the case of path analysis.
Different genes may become active at different stages of a disease. Using path analysis the sequence of
genetic activities across different stages of the disease can be identified. This knowledge may help in
developing specific medicines for specific stages.
FAQ
1. What are different types of Query Answering Mechanisms?
Ans:
Query answering can be classified into two categories based on their method of response:
Direct Query answering: It means that query is answered by returning exactly what is being asked.
Intelligent Query answering: It refers to analyzing the intent of the query and providing
generalized, neighborhood, or associated information relevant to the query.
2. Are there any social concerns related to data mining?

Ans:
The most important social issue is privacy of data pertaining to individual. KDD poses a threat to
privacy. The discovered patterns often classify individuals into categories, revealing their confidential
personal information. Moreover, it raises very sensitive and controversial issues, such as those that
involve race, gender or religion. In addition, it may correlate and disclose confidential, sensitive facts
about individuals.
NIIT
3. What is Market Basket Analysis?

Ans:
Market Basket Analysis is one of the most common and useful techniques for data analysis for
marketing. The purpose of market basket analysis is to determine what products customers purchase
together in a market. This can be helpful to a retailer, merchant or manufacturer or any other type of
organization interested in studying consumer buying patterns.
4. What is Visual Data Mining?
Ans:
Visual data mining integrates data mining and data visualization in order to discover implicit and useful
knowledge from large data sets.
NIIT
NIIT

Data Warehousing and Data Mining

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Warehousing and Data Mining

Загружено:

Авторское право:

Доступные форматы

PREREQUISITES

Coordinator Guide Data Warehousing and Data Mining 1

Describe basic database management concepts

2 Coordinator Guide Data Warehousing and Data Mining

Consider the following definitions of normal forms:

Eliminate repeating groups in individual tables.

Second normal form (2NF)

Third normal form (3NF)

Coordinator Guide Data Warehousing and Data Mining 3

After implementing 1NF on the above table:

Entity Relationship Model

Three different types of relationships can exist between two entities:

4 Coordinator Guide Data Warehousing and Data Mining

Coordinator Guide Data Warehousing and Data Mining 5

Data redundancy and inconsistency

2. Are relational databases the only possible type of database models?

6 Coordinator Guide Data Warehousing and Data Mining

Define data warehouse

Evolution of Data Warehousing

A large volume of data

Contain only transactional/recent data.

Coordinator Guide Data Warehousing and Data Mining 7

Data Warehouse vs. Data Mart

Data Warehouse and CRM

Data Warehouse and DSS

Need for Security of a Data Warehouse

8 Coordinator Guide Data Warehousing and Data Mining

Characteristics of a Data Warehouse

Consolidated and consistent data

Consolidated and Consistent Data

Coordinator Guide Data Warehousing and Data Mining 9

5. Where can you use data warehouse successfully?

10 Coordinator Guide Data Warehousing and Data Mining

6. When and where is a data mart useful?

Coordinator Guide Data Warehousing and Data Mining 11

Identify the characteristics of the Star Flake Schemas

12 Coordinator Guide Data Warehousing and Data Mining

Coordinator Guide Data Warehousing and Data Mining 13

Overview of how OLAP Services are Correlated to the Data Warehouse

Solutions to Chapter Three Questions

14 Coordinator Guide Data Warehousing and Data Mining

2. Why the OLTP cannot provide history data for analysis?

Coordinator Guide Data Warehousing and Data Mining 15

Identify a partitioning strategy

Optimizing through Stripping

16 Coordinator Guide Data Warehousing and Data Mining

Coordinator Guide Data Warehousing and Data Mining 17

2. Are there any disadvantages of data partitioning?

18 Coordinator Guide Data Warehousing and Data Mining

Goals Associated with Aggregation

Selecting Candidates for Aggregation

Coordinator Guide Data Warehousing and Data Mining 19

Create the aggregation as per the following design goals:

20 Coordinator Guide Data Warehousing and Data Mining

Identify the need for a data mart

Data Mart vs. Data Warehouse

In hundreds of thousands of dollars

Data about specific

Data about the entire enterprise

An enterprise usually has multiple

An enterprise usually has one

Data Mart Types

Independent data mart