Академический Документы
Профессиональный Документы
Культура Документы
This course is a part of the Information Technology program (B.Sc. (IT)) of Kuvempu University.
A student registering for the fifth semester of B.Sc.(IT) of Kuvempu University must have completed
fourth semester of B.Sc.(IT). The student should have attained the knowledge of the following modules:
NIIT
Algorithms
Java Programming
Unix & Shell Programming
Software Engineering
CHAPTER-SPECIFIC INPUTS
Chapter One
Objectives
In this chapter, the students have learned to:
Focus Areas
Introduce the need for a database management system with the help of an example, such as the need for a
company to organize its data and manage it through a front-end application. Ensure that the students
understand the drawbacks of a file management system as compared to a database management system
(DBMS). Explain elementary database concepts using ample examples.
Next, give the following analogy to the students to introduce data warehouses.
"A large insurance company has stored and organized its relevant official papers on racks in each room.
It stores a large volume of papers in a separate warehouse. Some of the papers have now become
historical and are not required in day-to-day working. However, these papers cannot be destroyed, as they
are an essential asset for analysis of market trends and the companys progress. Moreover, it is also
mandatory to maintain such papers till a fixed period of time. Therefore, it becomes essential that the
papers required for day-to-day processing are stored in areas where they can be accessed quickly and the
historical papers are stored for future reference in a bigger storage area.
If you compare a database to the racks containing papers for day-to-day processing, then the database's
counterpart for the warehouse is called a data warehouse."
Now, explain the term data warehouse and the usage of a data warehouse. Similarly, explain data mining.
After introducing these concepts, tell the students that before proceeding further, they must understand
two important database management concepts called normalization and entity relationships.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Database Normalization
Normalization is the process in which database objects such as attributes, keys, and relationships are
restructured to remove redundancy and dependencies, and stabilize and simplify the database.
Normalization is an important concept because it has a direct impact on the storage efficiency of the
database.
Normalization is done on the basis of certain rules. Each set of rules makes a database object, such as a
table, more "normal". Each set of rules is called a "normal form." If the first set of rules is observed, the
database is said to be in the "first normal form." If the second set of rules is observed, the database is
considered to be in the "second normal form." Finally, if the third set of rules is observed, the database is
considered to be in the "third normal form." The third normal form is the highest level of normalization
necessary for most applications.
NIIT
Ensure that the table meets all the requirements of the first normal form.
Remove subsets of data that apply to multiple rows of a table and place them in separate tables.
Create relationships between these new tables and their predecessors using foreign keys.
Ensure that the table meets all the requirements of the second normal form.
Remove columns that are not dependent upon the primary key.
Example:
Consider the following unnormalized table to understand normalization:
Students
Student
Subject 1
Teacher 1
Subject 2
Teacher 2
Subject 3
Teacher 3
John
English-01
George
Mathematics-01
Mary
Science-01
Greg
Tony
English-01
James
Mathematics-02
Mary
Science-01
Peter
John
English-02
Pat
Mathematics 02
Mary
Science 01
Greg
The sample table Students, given above, is unnormalized because it has repeating (subject-teacher)
groups for the student and also does not have any primary key. To convert the table into 1 NF, it must be
given a primary key and repeating groups must be eliminated.
The I NF form of the above table is:
RNo
NIIT
Student
Subject
Teacher
Code
1102
John
English
George
01
1102
John
Mathematics
Mary
01
1102
John
Science
Greg
01
1105
Tony
English
James
01
1105
Tony
Mathematics
Mary
02
RNo
Student
Subject
Teacher
Code
1105
Tony
Science
Peter
02
1109
John
English
Pat
02
1109
John
Mathematics
Mary
02
1109
John
Science
Greg
01
Each row is uniquely identified by a primary key (Roll No now identifies that there two different
students with the name John)
The sequence of rows and columns is insignificant (The sequence of columns was not insignificant
in the unnormalized form).
Each column is unique (Unlike the unnormalized form where there were repeating groups such as
subject 1, subject 2 and subject 3).
Each column has single values (Now subject name and code are two separate columns, unlike the
unnormalized form where code was stored as a part of the column name for subject).
Similarly, other normal forms can be implemented after the requirements for the previous normal forms
are met.
First Name
Last Name
Address
DOB
Entities in a database can share a relationship. A relationship is an association between several entities.
For example, two entities Employee and Department can be related as shown in the following figure:
Employee
Works in
First Name
Last Name
Address
DOB
Department
Name
DNo.
Head
2.
One-to-One Relationships: This type of relationship exists when a single occurrence of an entity is
related to just one occurrence of another entity. For example, a person has one PAN number and a
PAN number is allotted to only one person. Therefore, the entity Person has a one-to-one
relationship with the entity PAN number.
One-to-Many Relationships: This type of relationship exists when a single occurrence of an entity
is related to many occurrences of another entity. For example, a student studies in one school but a
school has various students. Therefore, the relationship between School and Students is one-to-many.
NIIT
3.
Many-to-Many Relationships: This type of relationship exists when many occurrences of an entity
are related to many occurrences of another entity. For example, resources are allocated to many
projects; and a project is allocated many resources. Therefore, the relationship between Resources
and Projects is many-to-many.
The relationships between entities are depicted using specialized graphics known as ER diagrams or
Entity-relationship diagrams.
NIIT
FAQ
1. What is the disadvantage of File Management System over DBMS?
Ans:
Some of the disadvantages of file management system over database management system are:
NIIT
Chapter Two
Objectives
In this chapter, the students have learned to:
Focus Areas
Initiate the discussion by asking the following questions:
What is data?
What is the need to store this data?
How should the data be stored?
What techniques can be adopted to retrieve enormous data efficiently?
You can also discuss the architecture for storing voluminous data with the help of data warehouse.
Besides that, a difference between the data warehouse, data mart, and metadata should be discussed
clearly.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
The On Line Transaction Processing (OLTP) systems are not suitable for the kind of complex analysis.
This is because OLTP systems:
NIIT
There are therefore two distinct uses of data in organizations. The first is the analysis of historical data
for business analytical purposes. The second is the usage of data that takes care of the daily transactional
activities.
The use of data for business analytical purposes has led to the development of data warehouses. Data
warehouse is one of the key components of Business Intelligence Systems. The data stored in a data
warehouse is used by the querying or reporting tools, data mining applications, and On Line Analysis
Processing (OLAP) applications for business analysis.
Data Mart
A data mart is a specific subset of the contents of a data warehouse, stored within its own database. It
contains data focused at a department level or on a specific business area of the organization. The volume
of data in a data mart is less than that of a typical data warehouse, making query processing faster.
Experts use data marts for analysis, rather than using the main data warehouse. For example, a large
organization can treat its individual departments or divisions as independent business units. Each of these
units can have its own data mart, used regularly by the analysts of that specific unit. Data marts
contribute to the main data warehouse on a regular basis.
Metadata
Due to the large volume of data that exists and is queried in a data warehouse, it is often useful to classify
the type of data. This helps increase query response activity. In a data warehouse, a specific type of data,
known as metadata, is used. Metadata contains information about types of data.
NIIT
For example, the class of an object and the object, that is, instance of the class. If data is the instance then
metadata is the class. Metadata is data about data.
NIIT
UPDATES
RDBMS
DATA LOADS
QUERIES
DATA
WAREHOUSE
QUERIES
Read-only Data
After data has been moved to the data warehouse, it cannot be changed. Data stored in a data warehouse
pertains to a point in time or to a specific timeframe, therefore it must never be updated. The only
operations that occur in a data warehouse, when it has been set up, are loading and querying.
Differences in the Way Data Behaves in an RDBMS and a Data Warehouse
FAQ
1. On which layer of application architecture does data warehouse operate?
Ans:
A data warehouse is a server-side repository to store data.
2. What is a data warehouse?
Ans:
A data warehouse is a huge repository of data used for very complex business analysis.
3. What are the benefits of data warehousing?
Ans:
The main benefit of data warehousing is to keep data in such a form that complex business analysis can
be done in minimum amount of time.
4. What are the application areas of a data warehouse?
Ans:
There are various application areas of a data warehouse. Some of these are:
Airlines
Meteorology
Logistics
Insurance
NIIT
NIIT
Chapter Three
Objectives
In this chapter, the students have learned to:
Focus Areas
Conduct a quick recap to ensure that the students have clearly understood the concept of a data
warehouse. Initiate a discussion on the concept of schemas. Explain the Star Flake Schema. Discuss the
use of Fact and Dimension Tables. Also, clarify the distinction between Fact and Dimension Table so that
there is no confusion in this respect.
Discuss the guidelines to be followed while creating the Fact and Dimension Tables. Also, discuss the
concept of query redirection and its purpose in helping users to get the most appropriate results in time.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Fact Tables
Fact tables contain data that describe a specific event within a business, such as a financial transaction or
a product sale. Fact tables can also contain data aggregations such as sales per month per region. Under
normal conditions, existing data within a fact table is not updated, however, new data is loaded. Fact
tables contain the majority of data stored in a data warehouse. This makes the structuring of fact tables
very crucial. The features of fact tables are:
Contain data that is static in nature
Contain many records, possibly billions
Store numeric data
Have multiple foreign keys
For example, a fact table can contain data such as product ID numbers and geographical IDs of areas.
Dimension Tables
Dimension tables contain data used to reference the data stored in the fact table, such as product
descriptions, customer names, and addresses. Data, in this case, is mainly stored in characters. It is
possible to optimize queries by separating the fact table data from the dimension table data. Dimension
tables do not contain as many rows as fact tables. Dimension tables can change and must be structured to
permit change. Dimension tables:
Can be updated
Have lesser number of rows as compared to fact tables
Store data in characters
Have many columns to manage dimension hierarchies
Have one primary key, also known as the dimensional key
NIIT
For example, a Retailer dimension table can contain retailer ID and retailer name columns.
Keys: Unique identifiers used to query data stored in the central fact table. The dimensional key, such as
a primary key, links a row in the fact table with one dimension table. This structure makes it easy to
construct complex queries and support a drill-down analysis in decision-support applications.
Star Schema
The star schema is a relational database structure and is a popular design technique used to implement a
data warehouse. In this schema, data is maintained in a single fact table at the centre of the schema. Each
dimension table is directly related to the fact table by a key column.
The star schema design increases query performance by reducing the volume of data read from disk to
satisfy a query. Queries first analyze data in the dimension tables to obtain the dimension keys that index
into the central fact table. In this way, the number of rows that have to be scanned to satisfy a query is
reduced greatly.
S E A S O N T A B LE
R E G IO N A L T A B LE
S
S
S
S
E
E
S
E
R
R
R
R
R
R
eason ID
e a s o n D e s c r ip t io n
ta rt D a te
ta rt M o nth
nd D ate
nd M onth
ta rt Y e ar
nd Y ear
D im e n s io n T a b le
io n I D
io n D e s c r ip t io n
io n L o c a t io n D is t r ic t
io n L o c a t io n S t a t e
io n L o c a t io n C o u n t r y
io n a l I n c h a r g e I D
D im e n s io n T a b le
S A LE S T A B LE
S
R
R
P
U
S
R E T A ILE R T A B LE
eg
eg
eg
eg
eg
eg
eas on ID
e g io n a l I D
e t a ile r I D
ro du c t ID
n it s S o ld
e a s o n a l D is c o u n t
F a c t T a b le
R e t a ile r I D
R e t a ile r s h ip s t a r t d a t e
R e t a ile r s h ip v a lid it y p e r io d
P RO D UCT T A B LE
P rod u c t ID
P r o d u c t D e s c r ip t io n
Launch D ate
C o n t in u it y S t a t u s
U n it
P r ic e
B ran d N a m e
P r o d u c t P r o p e r t y 1 V a lu e
P r o d u c t P r o p e r t y 2 V a lu e
D im e n s io n T a b le
D im e n s io n T a b le
Star Schema
OLAP
The arrangement and analysis of data in a data warehouse is done using On Line Analysis Processing
(OLAP) systems. The historical data supports business decisions at various levels or departments, from
strategic business planning to financial performance appraisal of a separate organizational unit and/or
distinct business functional areas.
After being collected from various heterogeneous sources, data is extracted, cleaned or scrubbed, and
stored in the data warehouse in a homogeneous form. This is the first part of the activity.
The analysts now have to choose the right data. Applications have to be built to assist analysts in
completing their activity in a limited amount of time. It is necessary for another application to arrange
data in a format, which shall make it easily accessible to the end user on querying. OLAP arranges data in
an easily accessible format typical to a data warehouse.
OLAP technology thus enables data warehouses to be used effectively for:
NIIT
Online analysis
Providing quick responses to complex iterative queries posed by analysts
OLAP achieves this through its multi-dimensional data models. Multi-dimensional data models are to
data warehouse what tables are to RDBMS. These are the units where data is stored. These models are
used to organize and summarize large amounts of data, making it easy to evaluate using online analysis
and graphical tools. It also provides the speed and flexibility to support the analysts, helping them
complete complex analysis in an acceptable time limit.
B u s in e s s A n a ly s t s a n d
O t h e r B u s in e s s U s e r s
LO
D A
D A
D A
O p e r a t io n a l /
T r a n s a c t io n a l
D ata;
H e te ro g e n o u s D a ta
AD
T A
T A
T A
M AN AG ER
E X T R A C T IO N
T R A N S F O R M A T IO N
L O A D IN G
O L AP
DAT A ST O RAG E
D A TA W A R E H O U S E
The OLTP databases have basic constituents such as tables and relations. In the case of data warehouses,
these constituents are fact tables and dimension tables. A fact table contains measurable columns such as
sales, costs, and expenses. A dimension table contains columns on which fact table columns can be
categorized. For example, a Product dimension table can have columns such as product family, product
category, and product ID on which sales, costs and expenses can be categorized. Similarly, these fact
table columns can also be categorized on Stores table columns based on country and region.
The following table displays the sales categorized according to the product family (from Product
dimension) and country (from Store dimension):
USA
Germany
Mexico
Drink
2500000
450000
670000
Food
3400000
675000
774800
FAQ
1. What are the benefits of OLTP?
Ans:
On Line Transaction Processing (OLTP) assists in storing current business transactional data. It also
supports a large number of concurrent users to access data at the same time.
NIIT
NIIT
Chapter Four
Objectives
In this chapter, the students have learned to:
Focus Areas
Introduce partitioning by asking the students to identify ways in which the performance of a data
warehouse and its manageability can be improved. Inform students that if data is broken into smaller
manageable chunks, it can be scanned faster and managed easily. In this context, introduce partitioning.
Ask students to identify the advantages of partitioning such as, faster access, better manageability due to
smaller size, improved recovery time, and reduced effect of failure or breakdown.
Explain the various types of partitioning with the help of examples for each. You can also explain
Stripping discussed in the Additional Inputs section. This section also gives examples of how partitioning
such as Stripping can help in optimization. To demonstrate partitioning, you may explain it with
reference to the 'Partitioning in Oracle" section discussed in Additional Inputs.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
NIIT
Types of Stripping
Stripping is of the following types:
Global: Global Stripping entails disks and partitions. It is often used when you need to access data in
only one partition. For implementing global partition in such cases, you can spread the data in that
partition across many disks to improve performance for parallel execution operations. However,
global Stripping has a single point of failure. That is, if one disk fails, and the disks are not mirrored,
all the partitions are affected.
Local: Local Stripping deals with partitioned tables and indexes. It is a simple form of partitioning in
which each partition has its own set of disks and files. Access to the disk on which the partition
resides or to the files does not overlap. Unlike global partition, an advantage of local Stripping is that
if one disk fails, it does not affect other partitions. However, its main disadvantages are on the cost,
and maintenance side. In local partitioning since each partition requires multiple disks of its own, it
adds a cost and maintenance overhead due to these multiple hardware components. In this case, if
you want to limit the number of disks used, you will have to reduce the number of partitions. This
consequently makes local Stripping inappropriate for parallel operations. Local Stripping is a good
choice if availability is a critical concern in your data warehouse.
Automatic: Automatic Stripping is the Stripping done by the operating system itself based on some
settings. It is a simple and flexible way of Stripping and useful for parallel processing requirements for the same operation or multiple operations. However, the advantages of this Stripping are limited
by the hardware such as I/O buses. That is, unlike local Stripping the DOP is not a function of disks.
As per Oracle's recommendation, the stripe size must be at least 64 KB for good performance.
Manual: Manual stripping is the process of adding multiple files to each tablespace, such that each
file is on a separate disk. When using manual Stripping, the degree of parallelism of processing
depends on the number of disks rather than of the number of processors. If manual Stripping is used
correctly, system's performance improves significantly.
Partitioning in Oracle
Oracle is one of the most preferred choices in the field of data warehousing. Oracle 9i supports various
types partitioning techniques such as:
Hash partitioning: In this type of partitioning a table's records are partitioned based on the value of
a particular field in the table. The value, which has to be mapped for deciding each record's partition,
is called the hash value.
Range partitioning: In this type of partitioning a table's records are partitioned based on the range
of values in a particular field of the table.
List partitioning: In this type of partitioning a table's records are partitioned based on a value from a
list of values.
Composite range-hash partitioning: In this type of partitioning, a table is first partitioned on the
basis of range partitioning and then the partitions are further partitioned based on hash partitioning.
Composite range-list partitioning: In this type of partitioning, a table is first partitioned on the
basis of range partitioning and then the partitions are further partitioned based on list partitioning
As an example, consider how Hash partitioning can be performed. Suppose, you are designing a data
warehouse for an insurance company. The company stores the details of its investors in its policies
released till date (suppose 3) in a table INSURANCE_DATA. This table has a field POLICY_TYPE.
You can partition the table through hash partitioning with POLICY_TYPE as the hash value as follows:
CREATE TABLE INSURANCE_DATA
(POLICY_NUMBER,...,INSURANCE TYPE);
PARTITION BY HASH(INSURANCE_TYPE)
(PARTITION P1_TYPE TABLESPACE TBLSPC01,
PARTITION P2_TYPE TABLESPACE TBLSPC02,
PARTITION P3_TYPE TABLESPACE TBLSPC03,)
This way you can map one policy type to one partition.
NIIT
FAQ
1. Name the 2 important parameters that decide the granularity of partitions.
Ans:
Two important factors that decide the granularity of partitions are the overall size and manageability of
the system. Both parameters are to be balanced against each other while deciding on a partitioning
strategy. Suppose a data containing information about the population is partitioned on the basis of state,
the two maintenance related issues that could be faced by the administrator are:
The query needs the information of all the states, such as particular languages spoken in the states,
when all the states have to be scanned.
If the definition of state changes (state is redefined), the entire fact table needs to be built again.
NIIT
Chapter Five
Objectives
In this chapter, the students have learned to:
Define aggregation
Design and create summary tables
Focus Areas
Introduce aggregation as an operation that has a very significant impact on the performance of a data
warehouse. Explain the need for aggregates. Also, identify goals associated with aggregation from the
Additional Inputs section. Explain the considerations for designing aggregations referring to Additional
Inputs section. Also, explain the concept of an aggregate navigator.
Explain Summary tables' (or aggregates or aggregate fact tables) design and its creation.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Provide dramatic performance gains for as many categories of users queries as possible.
Add only a reasonable amount of extra data storage to the warehouse. What is reasonable is up to the
DBA. However, many data warehouse DBAs strive to increase the overall disk storage for the data
ware house by a factor of two or less.
Be completely transparent to end users and to application designers except for the obvious
performance benefits; in other words, no end-user application SQL should reference the aggregates.
Directly benefit all users of data warehouse, regardless of which query tool they use.
Keep the impact of the cost of the data extract system to the minimum. Inevitably, a lot of aggregates
must be built every time data is loaded, but their specification should be as automated as possible.
Keep the impact of the DBA's administrative responsibility to the minimum. The metadata that
supports aggregates should be limited and easy to maintain.
NIIT
Dimensions whose attributes could be candidates for aggregation because they are used often
Attributes commonly used together
The number or range of values for a particular attribute
Possible aggregates that may be used to create other aggregates on the fly, if not directory required
by the business users
Aggregate Navigator
An aggregate navigator is a middleware component between a client and the database server. It intercepts
the client's SQL queries and transforms them into SQL queries that can be applied on aggregates. It
contains up-to-date meta data about the aggregates of the data warehouse. Based on this meta data, it
finds the appropriate aggregate, which can handle the basic SQL query sent by a client and transforms the
query for the aggregate. The aggregate navigation algorithm suggested by Kimball is as follows:
1.
2.
3.
4.
FAQ
1. Are there any risks associated with aggregation?
Ans:
The main risk associated with aggregates is that of increase in disk storage space.
2. Once created, is an aggregate permanent?
Ans:
No, aggregates keep changing as per the need of the business. In fact, they can be taken offline or put
online anytime by the administrator. Aggregates, which have become obsolete, can also be deleted to free
up disk space.
3. Can operations such as MIN and MAX value be determined once a summary table has been created?
Ans:
Operations such as MIN and MAX cannot be determined correctly once the summary table has been
created. To determine their value they must be calculated and stored at the time that the summary table
was derived from the base table.
4. How much storage increase might be required in the data warehouse system when using aggregates?
Ans:
The storage needs typically increase by a factor of 1 or sometimes even 2 for aggregates.
NIIT
Chapter Six
Objectives
In this chapter, the students have learned to:
Focus Areas
Introduce the concept of data marts using the store-house retail outlet analogy given in the book.
Compare data marts with data warehouses to bring out the difference between the two. Explain the need
for a data mart. In this context, explain that a data warehouse can actually be built from scratch or be built
up on data marts. Introduce EDMA and DS/DMA as two data warehouse architectures based on data
marts.
Explain the access control issues in data mart design briefly.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Data Mart
Data Warehouse
Purpose
Business-driven
Technology-driven
Scope
Localized
Centralized
Cost
In millions of dollars
Development Time
6-8 months
1.5-2 years
Data
Number
Granularity of data
Detailed
Summarized
An independent data mart is the one where data is sourced from a legacy application or a source other
than the data warehouse. It is created with the aim of integrating into a data warehouse.
A dependent data mart is the one whose data source is the data warehouse itself.
NIIT
A data warehouse architecture which implements an incremental approach to designing a data warehouse
uses data marts and a shared Global metadata repository (refer to chapter 6). This architecture also
supports a common data staging area. This data staging area is called a Dynamic Data Store (DDS). The
DDS plays a crucial role in integrating data marts with the data warehouse. In this architecture, star
schema modeling should be used if relational technology is used for the data warehouse.
In Data Stage/Data Mart Architecture, no single data warehouse is physically implemented. Instead, the
warehouse is considered a logical group of all the data marts.
DDW/DMA is also similar to EDMA as it has a dynamic staging area and a common global metadata
repository.
FAQ
1. What are conformed dimensions?
Ans:
A conformed dimension is the one whose meaning is independent of the fact table from which it is being
referred to.
2. What are virtual data marts?
Ans:
Virtual data marts are logical views of multiple physical data marts based on user requirement.
3. Which tool supports data mart based data warehouse architectures?
Ans:
Informtica is commonly used for implementing data mart based data warehouse architectures.
4. Is the data in data marts also historical like in data warehouses?
Ans:
The data in data marts is historical only to some extent. In fact, it is not the same as the data in data
warehouse because of the difference in the purpose and approaches of the two.
NIIT
Chapter Seven
Objectives
In this chapter, the students have learned to:
Define metadata
Identify the uses of metadata
Focus Areas
Explain metadata as "data about data". Explain its important and its need in a data warehouse. Discuss the
various types of metadata referring to the Additional Inputs section. Explain its usage in transformation
and loading, data management and query generation. Explain the concept of metadata management
referring to the Additional Inputs section.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Metadata Types
With the growth and advances in the data warehousing field, metadata has become an invaluable
resource. According to the source of the data being described, metadata can be classified as:
Source metadata: This includes the information about the source data. It could include schemas,
formats, graphics, relational tables, and ownership, administrative, business descriptions. It could
also include process-related information such as extraction settings, schedules, and results of specific
jobs that were performed on the source systems.
Data staging metadata: This includes all the metadata required to load the data into the staging
area. It could include, data acquisition information, definitions of conformed dimensions and facts,
slowly changing dimension policies, data cleaning specifications, data enhancement and mapping
transformations, target schema designs, data flows, load scripts, aggregate definitions, various
process logs, and other business documentation.
DBMS metadata: This includes the metadata describing various definitions, settings and
specifications after the data has been loaded into the data warehouse. It could include, partition
settings, indexes, Stripping specifications, security privileges, administrative scripts, view
definitions, backup status, and procedures.
Front room metadata: It could include, names and descriptions for attributes and tables, canned
query and report definitions. In addition, it includes end-user documentation, user profiles, network
security user privileges and profiles, network usage statistics, and usage instructions for data
elements, tables, views and reports.
Metadata Management
Metadata management is as crucial as metadata itself. Metadata management ensures that a metadata can
be represented and shared in a standard format. Metadata management involves two essential
components:
NIIT
Metadata modeling
Metadata repository
In order to standardize representation of metadata it must be modeled at separate layers. This way
varying source systems that have metadata at different layers of abstraction can be mapped to one of the
layers and standardized. A metadata model typically has four layers as shown in the following figure:
In a centralized meta data repository, the metadata is defined and controlled through a single schema
stored in a centralized repository, called the global metadata repository. This single schema represents the
composite schema of all the sub systems.
In a decentralized repository, data repository is required in a distributed environment. It consists of a
central global metadata repository, as well as local metadata repositories. A global repository contains the
metadata that is to be shared and reused among the local repositories. This global metadata is a single
schema that is used by all the local metadata repositories. The local metadata repositories, on the other
hand, contain the metadata specific to their individual uses.
FAQ
1. How can you classify metadata?
Ans:
You can classify metadata according the use of metadata as:
Administrative metadata: Metadata that describes the data used for managing the data in terms of
statistics such as time of creation, access rights, and last access time.
Structural metadata: Metadata that describes the structure of the data.
Descriptive metadata: Metadata that describes the purpose or functionality of the data.
NIIT
NIIT
Chapter Eight
Objectives
In this chapter, the students have learned to:
Focus Areas
Tell students that the various components and techniques applied to implement a data warehouse must be
managed together. This management is the responsibility of two types of management tools - system
management tools and process management tools. Now, explain what system managers and process
managers are. Emphasize that these are not persons but tools.
Explain each type of manager and their responsibilities in detail. Explain the various components of SQL
Server 2000, which act as one or more of these managers by referring to the Additional Inputs section.
The main points are given in the Additional Inputs section. You can refer to the SQL Server Help if you
need to detail students on any specific point.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Note that the terms load, query, and warehouse managers are theoretical terms and may not completely
map to the components of a data warehousing software. Their functionality may be spread across various
components or services of a tool. Alternatively, a tool may not actually have separate components
mapping to each manager type. These are just terms used to collectively indicate a set of tasks that need
to be performed. These tasks may be performed by a data warehousing software in any way.
Connections: In order to enable the elements of a package to function for data transformation, you
must establish a connection between a data source and target.
Tasks: A DTS task is a single step that is performed in the data transformation process. For example,
export data from a source.
Transformation: DTS supports various field level mappings and transformation between data
sources. Examples of supported transformations include:
Copy column transformation: In this, the source column is directly copied to the destination
column without making any changes.
NIIT
Date Time String transformations: In this, transformations on date/time fields, which are
either of string data type or date/time data type, are performed.
ActiveX script transformation: In this, ActiveX scripts are written that programmatically
transform fields for every row being copied from the source to the target.
DTS can be used simply using the Import/Export Wizard, which is an easy but a less flexible way of
loading and transforming data. A more complex but advanced and flexible way to use DTS is through the
DTS designer - a full-fledged designing environment in which you can create transformation packages.
To understand how DTS actually works as a load manager. Observer the following snapshot of the DTS
Import/Export Wizard in which data is loaded from an Excel sheet to a SQL Server database.
The functionality offered by the Import/Export wizard is only limited to loading and transformation. The
DTS designer actually provides a host of other functionalities. A snapshot of the DTS designer is given
here:
Under the Task pane, the various tasks supported by DTS are:
NIIT
NIIT
SQL Server also supports a Meta Data browser that enables you to view metadata.
As shown in the figure, it can be used to schedule SQL queries, ActiveX scripts and shell commands.
Queries can be created and tested using the SQL Query Analyzer.
Using query analyzer you can:
NIIT
Create queries, SQL scripts, and commonly used database objects from predefined scripts
Execute queries on SQL Server databases
Execute stored procedures without knowing the parameters
Debug stored procedures
Copy existing database objects
Debug query performance problems
Locate objects within databases or view and work with objects
Insert, update, or delete rows in a table
Add frequently used commands to the Tools menu
FAQ
1. Are the system and process management devoid of any manual intervention considering that process
manager is a tool and not a person?
Ans:
No. Although the system and process manager are themselves tools that automate system and process
management in data warehouses, they must be configured and sometimes handled through manual
intervention at times. These tasks may be done by the Database Administrator.
2. Does SQL Server also provide system managers?
Ans:
Yes. SQL Server includes various components that enable system management through management and
security services:
3. What is Oracle Warehouse Builder(OWB)?
Ans:
It is one of the commonly used data warehouse development tools with various advanced features such as
support for large databases, automated summary management, and embedded multidimensional OLAP
engine. Unlike SQL Server, which is only for the Windows platform, OWB can be used on all platforms.
It is also more fast, reliable, and scaleable than SQL Server.
4. What is replication?
Ans:
Replication is the process of creating multiple copies of data on the same or different platform and
keeping the copies in sync.
NIIT
Chapter Nine
Objectives
In this chapter, the students have learned to:
Focus Areas
Initiate a classroom discussion by asking the students the following questions:
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Data Mining
Data Mining is the process of finding new and potentially useful knowledge from data. Data mining
software allows users to analyze large databases to solve business decision problems. Data mining is, in
some ways, an extension of statistics, with a few artificial intelligence and machine learning twists
thrown in. Like statistics, data mining is not a business solution, it is just a technology.
NIIT
For example, consider a catalog retailer who needs to decide who should receive information about a new
product. The information operated on by the data mining process is contained in a historical database of
previous interactions with customers and the features associated with the customers, such as age, zip
code, their responses. The data mining software would use this historical information to build a model of
customer behavior that could be used to predict which customers would be likely to respond to the new
product. By using this information, a marketing manager can select only the customers who are most
likely to respond. The operational business software can then feed the results of the decision to the
appropriate touchpoint systems (call centers, web servers, email systems, etc.) so that the right customers
receive the right offers.
Clustering
Clustering is often one of the first steps in data mining analysis. It identifies groups of related records that
can be used as a starting point for exploring further relationships. This technique supports the
development of population segmentation models, such as demographic-based customer segmentation.
Additional analyses using standard analytical and other data mining techniques can determine the
characteristics of these segments with respect to some desired outcome. For example, the buying habits
of multiple population segments might be compared to determine which segments to target for a new
sales campaign.
Classification
Classification, perhaps the most commonly applied data mining technique, employs a set of pre-classified
examples to develop a model that can classify the population of records at large. Fraud detection and
credit-risk applications are particularly well suited to this type of analysis. This approach frequently
employs decision tree or neural network-based classification algorithms. The use of classification
algorithms begins with a training set of pre-classified example transactions. For a fraud detection
application, this would include complete records of both fraudulent and valid activities, determined on a
record-by-record basis. The classifier training algorithm uses these pre-classified examples to determine
the set of parameters required for proper discrimination. The algorithm then encodes these parameters
into a model called a classifier.
The approach affects the explanation capability of the system. Once an effective classifier is developed, it
is used in a predictive mode to classify new records into these same predefined classes. For example, a
classifier capable of identifying risky loans could be used to aid in the decision of whether to grant a loan
to an individual.
KDD
Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon methodologies
for extracting useful knowledge from data. The ongoing rapid growth of online data due to the Internet
and the widespread use of databases have created an immense need for KDD methodologies. The
challenge of extracting knowledge from data draws upon research in statistics, databases, pattern
recognition, machine learning, data visualization, optimization, and high-performance computing to
deliver advanced business intelligence and web discovery solutions.
KDD refers to a multi-step process that can be highly interactive and iterative. It includes data
selection/sampling, preprocessing and transformation for subsequent steps. Data mining algorithms are
then used to discover patterns, clusters and models from data. These patterns and hypotheses are then
rendered in operational forms that are easy for people to visualize and understand. Data mining is a step
in the overall KDD process.
Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The
nontrivial extraction of implicit, previously unknown, and potentially useful information from data". It
uses machine learning, statistical and visualization techniques to discovery and presents knowledge in a
form that is easily comprehensible to humans. The main idea in KDD is to discover a high level
knowledge (abstract knowledge) from lower levels of relatively raw data, or to discover a higher level of
interpretation and abstraction than those previously known.
NIIT
FAQ
1. What is KDD Process?
Ans:
The unifying goal of the KDD process is to extract knowledge from data in the context of large
databases. It does this by using data mining methods (algorithms) to extract (identify) what is deemed
knowledge, according to the specifications of measures and thresholds, using a database along with any
required preprocessing, subsampling, and transformations of that database.
2. What is Data Visualization?
Ans:
Data Visualization presents data in three dimensions and colors to help users view complex patterns.
They also provide advanced manipulation capabilities to slice, rotate or zoom the objects to identify
patterns.
3. What are the constituents of Multidimensional objects?
Ans:
Dimensions and Measures.
4. What does level specify within dimension?
Ans:
Levels specify the contents and structure of the dimension's hierarchy.
5. What is data mining?
Ans:
Data Mining is the process of finding new and potentially useful knowledge from data
6. What does Data Mining Software do?
Ans:
A Data Mining Software searches large volume of data, looking for patterns that accurately predict
behavior, such as customers most likely to maintain relationship with the company etc. Common
techniques employed by Data Mining Software include Neural Networks, Decision Trees and standard
statistical modeling.
7. What is Oracle Data Mining?
Ans:
Oracle Data Mining is enterprise data mining software that combines the ease of a Windows-based client
with the power of a fully scalable, multi-algorithmic, UNIX server-based solution. Oracle Data Mining
provides comprehensive predictive modeling capabilities that take advantage of parallel computing
techniques to rapidly extract valuable customer intelligence information. Oracle Data Mining can
optionally generate deployable models in C, C++, or Java code, delivering the "power of prediction" to
call center, campaign management, and Web-based applications enterprise-wide.
NIIT
Customer retention
Cross selling
Response modeling / target marketing
Profitability analysis
Product affinity analysis
Fraud detection
NIIT
Chapter Ten
Objectives
In this chapter, the students have learned to:
Focus Areas
Introduce the first step of data mining, which is data preprocessing. Enumerate the steps to be followed to
preprocess data and make it ready for application of data mining techniques. Initiate a discussion on data
preparation. However, before that, you must explain how data is acquired for a data mining system from
the Additional Inputs section. After data acquisition, explain data preparation with the help of an
appropriate example of a dataset. Next, briefly discuss data mining primitives and then shift focus to data
mining querying language. Explain data mining querying language in detail with the help of examples.
Finally, conclude the session with a brief discussion on data mining system architectures. Hold a
discussion for identifying scenarios for selecting a particular architecture.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
The process of acquiring data involves various activities such as identifying sources from which data can
be acquired, ways of acquiring data, and actually gathering data. Data can be acquired from multiple
sources in various ways such as interviews, surveys, observational data, and transactional databases.
After the culmination of this process, the following output must be produced:
After data has been acquired, it must be described. Data description involves various activities such as:
NIIT
Identifying the initial, basic, format of the data by ensuring that metadata is available. Meta data is
very crucial for describing data.
After data has been adequately defined, it must be explored. Data exploration refers to understanding the
basic structure and schema of data, and evaluating its usefulness for data mining. Basic exploration can
start by analyzing the meta data. Meta data gives an idea of structure as well as the detailed meaning of
data. This helps in analyzing the usefulness of data. Beyond studying the meta data, basic exploration
uses simple as well as sophisticated statistical techniques to reveal the properties of the data.
After the data has been gathered, described, and explored, its quality needs to be verified. Quality data is
the one that is complete, least complex, has enough meta data, has unambiguous attributes, and possesses
context independence. It is essentially that before data is preprocessed and mined, its quality be verified.
This is because mining data of low quality will lead to inaccurate results and may cost an enterprise a
huge amount in terms of money and effort wasted. Data quality can be harmed at any step from gathering
data to delivering and storing it. For example, typing mistakes during storage of data can lead to loss of
quality due to incorrect data values.
After complete verification of data quality, data can now be preprocessed for mining.
It is easy to use.
It should provide at least 80% accuracy of prediction.
It should be able to perform all common data mining tasks such as cleaning, import, export, and
formatting.
FAQ
1. What is Noisy Data?
Ans:
Noise is a random error or variance in data. It can happen because of:
Faulty data collection and data entry mistake such as a typing mistake
Data transmission and storage problem
Inconsistency in naming convention
Noise makes data inaccurate for predictions and renders it futile for mining systems.
2. Which are the major data mining tasks?
Ans:
The main data mining tasks include:
Classification
Clustering
Associations
Prediction
Characterization and Discrimination
Evolution Analysis
NIIT
3. What are some other Data Mining Languages and standardization of primitives apart from DMQL?
Ans:
Some other Data Mining Languages and standardizations of primitives apart from DMQL include:
MSQL
Mine Rule
Query flocks based on Data log syntax
OLEDB for DM
CRISP-DM
Clementine
Darwin
Enterprise Miner
Intelligent Miner
Mine Set
NIIT
Binning
Clustering
Computer/Human inspection
Regression
Chapter Eleven
Objectives
In this chapter, the students have learned to:
Focus Areas
Initiate a discussion by asking the students the importance and need of Data Mining. Explain that various
techniques have to be adopted for data mining as it is a specialized field and needs to be handled through
appropriate techniques. Discuss the various techniques that can be used for mining the data. Also focus
on issues, if any, related to any of the techniques and the key features of various techniques.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
Huge candidate sets generation: Needs 2100 >> 1030 candidates to be generated to discover frequent
patterns of size 100 such as {a1, a2, ,a100}.
Multiple scans of database: Needs (n +1) scans, n is the length of the longest pattern
The solution is in the form of FP-Tree algorithm that shows performance improvements over Apriori and
its variations since it uses a compressed data representation (nodes and a tree structure) and does not need
to generate candidate sets. However, FP-Tree based mining uses a complete data structure and
performance gains are very sensitive to the support threshold setting. Update of the database requires a
complete repetition of the scan process and construction of a new tree.
Definition of FP-Tree:
FP-Tree is an extended prefix-tree structure storing crucial, quantitative information about frequent
patterns.
It is highly condensed, but complete for frequent pattern mining. The advantage over Apriori is that it
avoids costly database scans and does not require candidate generation.
NIIT
Step 2: Create the root of an FP-Tree, and label it as "null". For each transaction Trans in D do the
following.
Select and sort the items in Trans according to the order of L. Let the sorted frequent item list in Trans be
[p|P], where p is the first element and P is the remaining list. Call insert_tree ([p|P], T], which is
performed as follows. If T has a child, N such that N.item-name = p.item-name, then increment N's count
by 1. Otherwise, create a new node N, and let its count be 1, its parent link be linked to T, and its nodelink to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert_tree
(P, N) recursively.
NIIT
4.
Enumerate all the combinations of the items from a single path that forms the frequent patterns
FAQ
1. What are variations to Apriori algorithm?
Ans:
Following are some of the variations to Apriori algorithm that improves the efficiency of the original
algorithm:
2. Which is the best approach when we are interested in finding all possible interactions among a set of
attributes?
Ans:
The best approach to find all possible interactions among a set of attributes is association rule mining.
3. What is over fitting in neural network?
Ans:
Over fitting is a common problem in neural network design. Over fitting occurs when a network has
memorized the training set but has not learned to generalize to new inputs. Over fitting produces a
relatively small error on the training set but will produce a much larger error when new data is presented
to the network.
NIIT
NIIT
Chapter Twelve
Objectives
In this chapter, the students have learned to:
Focus Areas
Recall the chain of data-->organized data (information)-->knowledge and introduce Knowledge
Discovery in Databases (KDD) as the complete process of mining knowledge from information stored in
databases. Explain the KDD environment set up referring to the Additional Inputs section. Tell the
students that a KDD environment is a part of an enterprise whose core competency is data mining.
Consequently, it is important that the enterprise understands what makes a KDD environment successful.
List the factors, which make a KDD environment successful from the Additional Inputs section. Next,
enumerate and explain the various guidelines from the textbook.
Finally, initiate a discussion on how to act on mining results.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
The team involved in KDD should ideally consist of 8-10 multi-skilled people, with an excellent
aptitude for technical aspects as well as business aspects. They need to be experts in various
disciplines such as statistical analysis, understanding business users, comprehending data owners,
and managing.
The team should be lead by a single person who has a proven track record in KDD.
Business units must be an integral part of the KDD process. The knowledge gained during the KDD
process is foremost for the business units of an enterprise and their participation is inevitable in the
process.
IT should be involved and form the basis for the implementation of the KDD process from the
beginning.
A good pilot project should be selected and run to demonstrate the efficiency and effectiveness of the
KDD process.
NIIT
Insights: New facts discovered from the data during the mining process, may lead to insights about
the business and about the customers.
One-time results: If the results are focused on a particular activity, such as a marketing campaign,
the marketing campaign should be carried out based on the trends determined by the data mining
process.
Remembered results: Sometimes results provide interesting and valuable information about
customers. This information should be accessible through a data warehouse.
Periodic predictions: The results may be used to score customer periodically, to determine the best
strategies to be adopted thereafter to determine whom to target for retention efforts.
Real-time scoring: The results may be incorporated into another system to provide real-time
predictions.
Fixing data: Sometimes the data mining results uncover data problems. These results often results in
cleaner, more complete data for the future.
FAQ
1. What is the difference between KDD and data mining?
Ans:
KDD refers to the overall process of discovering useful knowledge from data. It also includes the choice
of encoding schemes, preprocessing, sampling, and projections of the data prior to the data-mining step.
Data mining refers to the application of algorithms for extracting patterns from data without the
additional steps of the KDD process. It is essentially the modeling step in the KDD process.
2. Is data stored for data mining different from other operational systems?
Ans:
The two systems differ in the usage patterns associated with data mining. Some of the basic differences
are:
Operational systems support transactional processing whereas data mining systems support
analytical processing.
Operational systems are process-oriented whereas data mining systems are subject-oriented.
Operational systems are concerned with current data whereas data mining systems deal with
historical data.
Operational systems are updated frequently and have volatile data whereas data mining systems have
non-volatile data and are rarely changed.
Operational systems are optimized for fast updates whereas data mining systems are optimized for
fast retrievals.
NIIT
NIIT
Chapter Thirteen
Objectives
In this chapter, the students have learned to:
Focus Areas
Start the session by asking students to name the fields where data mining has practical application. List
the various fields and explain the application of data mining in these fields. Let the students participate
actively in the discussion to validate their understanding of the data mining process.
Refer to the Additional Inputs section to explain application of data mining in fields such as
telecommunication and manufacturing.
Additional Inputs
The following section provides some extra inputs on the important topics covered in the SG:
The data collected in the telecommunication industry includes various dimensions such as calling
time, duration, location of caller, location of the called, and type of call. The multidimensional
analysis of this data can be used to identify and compare the data traffic, system workload, resource
usage, user group behavior, and profit. The telecommunication data can be consolidated into large
data warehouses and multidimensional analysis using OLAP and visualization tools can be routinely
performed on it.
The data collected can be mined to identify fraudulent activity. It is important to identify potentially
fraudulent users and their typical usage patterns, detect attempts to gain fraudulent entry to customer
accounts, and discover unusual patterns such as periodic calls from automatic dial-out equipment.
The discovery of association and sequential patterns in multidimensional analysis can be used to
promote telecommunication services. For example, usage patterns by customer group, by month, and
by time of day can be found.
NIIT
manufacturing unit. Data Mining has helped to successfully understand the variations in these factors and
their inter-relationship in diverse manufacturing processes.
Data mining can help in improving industrial processes in various ways such as:
Reducing costs and improving service throughout the supply chain in a supply chain management
system
Improving sales and operations planning as a result of conducting historical sales analysis to improve
customer service, reduce inventory cost, reduce obsolete raw material costs, and reduce shipping
costs through improved routing and distribution.
Supporting process and product quality initiatives, through analyzing trends in defects. Early
detection and active management of defects can have a significant impact on the profitability of a
project.
Measuring worker productivity and evaluating the resources and processes based on it.
To understand how these techniques help in the DNA analysis field, consider the case of path analysis.
Different genes may become active at different stages of a disease. Using path analysis the sequence of
genetic activities across different stages of the disease can be identified. This knowledge may help in
developing specific medicines for specific stages.
FAQ
1. What are different types of Query Answering Mechanisms?
Ans:
Query answering can be classified into two categories based on their method of response:
Direct Query answering: It means that query is answered by returning exactly what is being asked.
Intelligent Query answering: It refers to analyzing the intent of the query and providing
generalized, neighborhood, or associated information relevant to the query.
NIIT
NIIT
NIIT