Вы находитесь на странице: 1из 24

Business Intelligence and Tools Unit 6

Sikkim Manipal University Page No.: 130


Unit 6 Data Extraction
Structure
6.1 Introduction
Objectives
6.2 ETL Overview
6.2.1 Significance of ETL Processes
6.2.2 ETL Requirements and Steps
Self Assessment Question(s) (SAQs)
6.3 Overview of the Data Extraction Process
Self Assessment Question(s) (SAQs)
6.4 Types of Data in Operational Systems
6.4.1 Current Value
6.4.2 Periodic Status
Self Assessment Question(s) (SAQs)
6.5 Source Identification
Self Assessment Question(s) (SAQs)
6.6 Data Extraction Techniques
6.6.1 Immediate Data Extraction
6.6.2 Deferred Data Extraction
Self Assessment Question(s) (SAQs)
6.7 Evaluation of the techniques
Self Assessment Question(s) (SAQs)
6.8 Summary
6.9 Terminal Questions (TQs)
6.10 Multiple Choice Questions (MCQs)
6.11 Answers to SAQs, TQs, and MCQs
6.11.1 Answers to Self Assessment Questions (SAQs)
6.11.2 Answers to Terminal Questions (TQs)
6.11.3 Answers to Multiple Choice Questions (MCQs)
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 131
6.1 Introduction
In this Unit, we discuss the data extraction process that is carried out from
the sources systems into data warehouses or data marts. As we have
already discussed, data extraction is the first step in the execution of the
ETL (Extraction, Transaction, and Loading) functions to build a data
warehouse. This extraction can be done from an OLTP database and non-
OLTP systems, such as text files, legacy systems, and spreadsheets. The
data extraction process is complex in its nature because of the tremendous
diversity that exists among the source systems in practice.
Objectives:
The objectives of the Unit are to make you understand:
Source Identification for extraction of the data.
Various methods being used fro data extraction.
Evaluation of the extraction techniques.
Exception handling in case some data has not been extracted properly.
6.2 ETL Overview
Mostly the information contained in a data warehouse comes from the
operational systems. But we all know that the operational systems could not
be used to provide the strategic information. So you need to carefully
understand what constitutes the difference between the data in the source
operational systems and the information in the data warehouse. It is all ETL
functions that reshape the relevant data from the source systems into useful
information to be stored in the data warehouse. There would be no strategic
information in a data warehouse in the absence of these functions.
6.2.1 Significance of ETL Processes
The ETL functions act as the back-end processes that cover the extraction
of the data from the source systems. Also, they include all the functions and
procedures for changing the source data into the exact formats and
structures appropriate for storage in the data warehouse database. After the
transformation of the data, the processes include all processes that
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 132
physically move the data into the data warehouse repository. After capturing
the information, you cannot dump the data into the data warehouse. You
have to carefully subject the extracted data to all manner of transformations
so that the data will be fit to convert into information.
Data Extraction
Extraction from heterogeneous internal and external source systems
Data Summarization
Creating aggregate datasets on the basis of predefined procedures
Data Transformation
Conversion and restructuring of data as per the transformation rules
Data Integration
Combining the entire data on the basis of source-to-target mapping
Data Cleansing
Scrubbing and enriching as per the cleansing rules
Initial Data Loading
Apply initial data in large volumes to the warehouse
Meta Data Updating
Maintain and use metadata for ETL functions
Ongoing Data Loading
Apply ongoing incremental loads, periodic refreshes to the warehouse
Fig 6.1: ETL Processes Summary
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 133
Let us try to understand the significance of the ETL function by taking an
example. For instance, you want to analyze and compare sales according to
stores, product and month. But the sales data is available in various
applications of your organization. Therefore, you have to have the entire
sales details in the data warehouse database. You can do this by providing
the sales and price in a fact table, the products in a product dimension table,
the stores in a stores dimension table and months in a time dimension table.
To do this, you need to extract the data from the respective source systems,
reconcile the variations in the data representations among the source
systems, transform the entire sales details, and load the sales into fact and
dimension table. Thus the execution of ETL functions is challenging
because of the nature of the source systems.
Also, the amount of time to be spent on performing the ETL functions is as
much as 50-70% of the total effort to be put for building a data warehouse.
To extract the data, you have to know the time window during each day to
extract the data from a specific source system without impacting the usage
of that system. Also, you need to determine the mechanism for capturing the
changes in the data in each of the relevant systems. Apart from the ETL
functions, the building of a data warehouse includes the functions like data
integration, data summarization and metadata updating. Figure 6.1 details
the processes involved to execute the ETL functions in building a data
warehouse.
6.2.2 ETL Requirements and Steps
Ideally, you are required to undergo the following steps provided in
Fig. 6.2 for the execution of the ETL functions.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 134
Fig 6.2: Steps involved in executing the ETL functions
Self Assessment Question(s) (SAQs)
For Section 6.2
1. Mention the ETL functions and explain how they are important in
building a data warehouse?
2. The implementation of ETL function includes various steps from
determining the source systems to preparation of fact and dimension
tables. List out the major steps involved in the ETL process in a
sequence?
6.3 Overview of the Data Extraction Process
In general, data extraction is performed within the source system itself,
especially if it is a relational database to which the extraction procedures
can easily be added. Also, it is possible for the extraction logic to exist in the
1. To determine the Target data of the Data Warehouse
3. To map the sources with the target data elements
2. To identify the internal and external data sources
4. To establish comprehensive data extraction rules
5. To prepare data transformation and cleansing rules
6. To plan for aggregate tables
7. To organize the data staging area and test tools
8. To write the procedures for all data loads
9. To execute ETL functions for dimension tables
10. To execute ETL functions for fact tables
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 135
data warehouse staging area and query the source system for data using
ODBC (Open Database Connectivity), OLE DB (Object Linking and
Embedding, Database), or other APIs (Application Programming Interfaces).
The most common method of data extraction for the legacy systems is to
produce text files, although many newer systems offer direct query APIs or
accommodate access through ODBC or OLE DB.
Data extraction in the context of a data warehousing is different from that of
building an operating system. For a data warehouse, you have to extract
data from many disparate sources. Also, you have to extract data on the
changes for ongoing incremental loads as well as for a one-time initial full
load. But for operational systems, all you need is just one-time extractions
and data conversions. These two factors increase the complexity of data
extraction for a data warehouse and so an effective data extraction is a key
to the success of your data warehouse. Therefore, you need to pay special
attention to the issues and formulate an exclusive data extraction strategy
for your data warehouse.
Some of the important issues that you need to take care while extracting the
data from various source systems are as follows:
Source identification that involves identification of the source
applications and source structures
Method of data extraction from each of the source systems and to define
whether the extraction process is to be carried out manually or by using
the tool
Preparation of time window for each data source to denote the time for
the data extraction
Determination of the extraction frequency for each data source in order
to establish how frequently the data extraction must be done (like daily,
weekly, quarterly, etc.)
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 136
J ob sequencing to determine whether the beginning of one job in an
extraction job stream has to wait until the previous job has been finished
successfully
Exception handling to determine an approach to handle input records
that cannot be extracted
Self Assessment Question(s) (SAQs)
For Section 6.3
1. What do you mean by Data Extraction? Discuss various issues involved
while extracting the data in building a data warehouse?
6.4 Types of Data in Operational Systems
As the data in the data warehouse is extracted from the operational system,
we now discuss the characteristics of the data in an operational system. The
data in the source operational systems is said to be time-dependent or
temporal as the source data changes with the time. As business transaction
keeps changing, the data in the source systems, the value of an attribute in
source system is the value of that attribute at the current time. For instance,
when a customer moves to another location, the data about customer
address changes in the customer table in the source system.
Similarly, the product data gets changed in the source system when a new
feature has been added to the existing product line. In case of change of
address, the current address of the customer is important in the operational
systems. But in the data warehouse the entire history is to be maintained. If
you want to make sales analysis based on the location for a given time
interval, the sales of the customer prior to the change of address is taken for
the previous period, and the current address is considered for the current
period.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 137
Now we discuss how the data is stored in the source systems. There are
two ways of storing the data. The type of data extraction method you have
to use depends on nature of each of these two categories.
6.4.1 Current Value
The current value is the stored value of an attribute at that moment of time.
These values are transient and change as and when business transactions
happen. You cannot forecast when the value changes next. For instance,
the current balance amount in a bank account and the stock price of a
company are some examples of this category. Thus the value of an attribute
remains constant until a business transaction changes it and so data
extraction for preserving the history of the changes in the data warehouse
gets quite involved.
6.4.2 Periodic Status
Here, the value of the attributes is stored as the status every time a change
occurs. It means the status value is stored with reference to the time. For
example, the data about an insurance policy is stored as the status data of
the policy at each point of time. So the history of the changes is preserved in
the source systems themselves. Thus the data extraction becomes relatively
easier.
Self Assessment Question(s) (SAQs)
For Section 6.4
1. What are the types of storing the data in the source systems?
6.5 Source Identification
Source identification is a critical process in the data extraction process. For
instance, if you have intended to design a data warehouse to provide
strategic information on fulfillment of orders, you need to store the
information as fulfilled and pending orders. If you deliver the orders through
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 138
multiple channels, you also need to capture the data about the delivery
channels.
If you provide the customers an option to get the status of the order, you are
required to extract the data on the status of each of these orders. To
achieve this, you need to maintain the attributes like total order amount,
discounts, and expected delivery time in the fact table for order fulfillment.
Also, you need to prepare dimension tables for product, customers, order
disposition, delivery channel, etc.
You may take up the following step-wise approach for the source
identification:
Prepare a list of all data items or facts needed in fact tables
Prepare each dimension attribute from all dimensions
Find out the source system and source data item for every target data
item
Identify the preferred source in case there are multiple sources for a
specific data item
Identify the single source filed for multiple target fields and establish
splitting rules
Ascertain the default values and inspect the source data for missing
values
Self Assessment Question(s) (SAQs)
For Section 6.5
1. Discuss various steps involved in the source identification process of
extracting the data.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 139
6.6 Data Extraction Techniques
The initial data that you might have moved to the data warehouse to get it
started is called the initial load. After the initial load, you need to update the
data warehouse so as to maintain the history of the changes.
Broadly, there are two types of data extraction from the source operation
systems. They are as is (or static data) and data of revisions.
Static data is the capture of data at a specific point of time. For current
data, this capture includes all transient data identified for extraction. In
case of the data categorized periodically, the capture has to include
each status at each point of time as available in the source systems.
Generally, the static data capture is used for the initial load of the data
warehouse and you may want a full refreshing of a dimensional table in
certain cases. For instance, you may do a full refreshing of the product
dimension table if the product master of the source application is
completely revamped.
Data of revisions (also known as incremental data capture) includes the
revisions since the last time data was captured. If the source data is
transient, the capture of the revisions is a difficult exercise. But for
periodic status data, the incremental data capture includes the values of
attributes at specific times. Here, you need to extract the status that has
been recorded since the last date of extract. This incremental data
capture can be divided into immediate data extraction and deferred data
extraction.
6.6.1 Immediate Data Extraction
The immediate data extraction is a real-time data extraction. There are three
types of methods in the immediate data extraction process as discussed
below.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 140
6.6.1.1 Capture through Transaction Logs
In this method, the transaction logs of the DBMSs are used to capture the
data. Whenever there is an updating in the database table, the DBMS writes
entries on the log file. The method reads the transaction log and selects all
the committed transactions. Since logging is already a part of the
transaction processing, there will not be any additional overheads in this
method.
However, you need to ensure extraction of all the transactions before the log
file gets refreshed. This method works well if all of the source systems are
database applications. In case of non-database applications, you need to
use some other data extraction method as there are no log files for these
applications.
6.6.1.2 Capture through Database Triggers
The database triggers are specially stored procedures or programs stored
on the database and are fired when a pre-defined event occurs. Here, you
can create the trigger programs for all events for which you need to capture
the data. The output of these trigger programs is written on a separate file
and will be used to extract data for the data ware house. For instance, you
can write a trigger program to capture all updates and deletes for a specific
table if you intend to capture all changes that happened to the records of the
table. But this method is also applicable only if the source operating systems
are database applications.
This method is quite reliable as the data capturing occurs right at the source
and you can capture both before and after images. However, you need to
put additional efforts on the front to build and maintain trigger programs.
Also, the execution of trigger procedures during the transaction processing
leads to additional overhead on the source systems.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 141
6.6.1.3 Capture in Source Applications
In this method, the source application is made to assist in the data capture
for the data ware house. Here, you need to modify the relevant application
programs that write to the source files and databases. You may revise the
programs to write all additions and deletions on the source files and
database tables and the other extract programs can use the separate file
containing the changes to the source data.
The advantage of this method is that the method can be used for all types of
source data irrespective of the type of the source systems, say, databases,
indexed files or other flat files. But the difficulty lies with revising the
programs in the source operational systems and maintaining them properly.
Also, the method may reduce the performance of the source applications as
additional processing is required to capture the changes on separate files.
This method is also referred to as application - assisted data capture.
6.6.2 Deferred Data Extraction
All the methods in the immediate data extraction involve the real-time data
capture. In contrast, these deferred data extraction methods do not capture
the changes in real time but does the same in later period.
6.6.2.1 Capture based on Date and Time Stamp
In this method, every source record created or updated is marked with a
stamp that shows the date and time. The data capture occurs at a later time
after the creation or updating of a source record and the time stamp
provides the basis for selecting the records for data extraction. However, the
limitation of the method is that it presumes that all the relevant source
records contain date and time stamps. Also, the technique works well in
case the revised records are less in number.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 142
6.6.2.2 Capture by Comparing Files
If any of the above techniques is not feasible for you to extract the data in
your environment, then you may consider the following method. According
to this method, you capture the changes to your product data by comparing
the current source data with the previous captured data. Though the method
seems simple, the comparison of the data becomes difficult in case of a
large file. But this method is only an option to deal with the legacy data
sources that do have any transaction logs and time stamps on the source
records. This method is also called as snapshot differential method as it
compares two snapshots of the source data.
Self Assessment Question(s) (SAQs)
For Section 6.6
1. Differentiate the terms static data capture and incremental data
capture to extract the data from the source systems?
2. Discuss various methods involved in the immediate data capture of
data extraction?
3. What is Deferred data extraction? Describe the methods of deferred
data extraction?
6.7 Evaluation of the techniques
As discussed in the above section, you may use any of the following
methods based on their application in your environment:
A. Capture of static data
B. Capture through transaction logs
C. Capture through database triggers
D. Capture in source applications
E. Capture based on date and time stamp
F. Capture by comparing files
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 143
Among all these methods, the methods of using the transaction logs and
database triggers are already available through out the database products.
Also, these methods are comparatively easy to implement and economical.
The technique based on transaction logs is perhaps the least expensive.
The method of data capture in source systems may be the most expensive
in terms of deployment and maintenance as it requires substantial revisions
to the existing source systems. But the method of file comparison is time-
consuming for the data extraction and so can be used only if any of the
other methods are not applicable.
The features of various data capture methods and their advantages and
limitations in implementation are as follows:
A. Capture of static data
Flexible to capture specifications
Does not affect the performance of source systems
No revisions required to existing applications
Can be used on legacy systems and file-oriented systems
B. Capture through transaction logs
Not flexible to capture specifications
Does not affect the performance of source systems
No revisions to existing applications
Can be used on most legacy systems, but cannot be used on file-
oriented systems.
C. Capture through database triggers
Not flexible to capture specifications
Does not seriously affect the performance of source systems
No revisions to existing applications
Cannot be used on most legacy systems, file-oriented systems
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 144
D. Capture in source applications
Flexible to capture specifications
Does not comprehensibly affect the performance of source systems
Major revisions to existing applications
Can be used on most legacy systems, file-oriented systems
High internal costs because of in-house work
E. Capture based on date and time stamp
Flexible to capture specifications
Does not affect the performance of source systems
Likely revisions to existing applications.
Cannot be used on legacy systems.
Can be used on file-oriented systems.
F. Capture by comparing files
Flexible to capture specifications
Does not affect the performance of source systems
No revisions to existing applications
May be used on legacy systems, file-oriented systems
Self Assessment Question(s) (SAQs)
For Section 6.7
1. List out the advantages and limitations of various data extraction
methods that are under practice.
6.8 Summary
ETL (Extraction, Transaction, and Loading) functions are crucial in building
a data warehouse. These ETL functions act as the back-end processes and
the amount of time to be spent on performing these functions is as much as
50-70% of the total effort to be put for building a data warehouse. These
functions include the following processes; Data Extraction, Data
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 145
Transformation, Data Integration, Data Cleansing, Data Summarization,
Initial Data Loading, Meta Data Updating, Ongoing Data Loading. Here, data
extraction is the first and crucial process.
Following are some of the challenges one may face while implementing the
data extraction process: identification of all source systems, choosing an
appropriate method of data extraction, preparation of time window,
determination of the extraction frequency, J ob sequencing, and exception
handling. There are two ways of storing the data in the source systems.
They are current value and periodic status. The current value is the stored
value of an attribute at that moment of time. Periodic status is the status
every time a change occurs. So the status value is stored with reference to
a specific time frame. The type of data extraction method you have to use
depends on the nature of each of these two categories.
Broadly there are two types of data extraction: as is (or static data) and
data of revisions. Static data is the capture of data at a specific point of
time. For current data, this capture includes all transient data identified for
extraction. Data of revisions (also known as incremental data capture)
includes the revisions since the last time a data was captured. There are
three types of methods in the immediate data extraction process. They are:
capture through transaction logs, capture through database triggers, and
capture from source applications. All the methods in the immediate data
extraction involve the real-time data capture. But there are two methods in
practice under deferred data extraction where the data capturing is done in
later period, not in the real time. These methods are capture based on date
and time stamp and capture by comparing files. But one has to be careful
in choosing an appropriate method to implement the data extraction process
and it has to be done after understanding the advantages and limitations of
these methods.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 146
6.9 Terminal Questions (TQs)
1. Give the appropriate reasons why you think ETL functions are most
challenging in a data warehouse environment.
2. Brief the major issues you may face in the process of data extraction
and explain how to deal with those issues?
3. Supposing you have chosen an ETL tool to perform the extraction,
transformation, and loading functions, discuss the activities carried out
by an ETL in executing the above mentioned functions.
6.10 Multiple Choice Questions (MCQs)
1. What do you mean by 'exception handling' in extracting the data?
a. to determine the ways to handle the input records that cannot be
extracted
b. to identify the source systems for extracting the data into a
warehouse
c. to arrange the alternative system for a data warehouse to work in
contingency situations
d. to re-design the architecture of a data warehouse if the performance
of the warehouse is not up to the mark
2. The time period set for repetition of the data extraction from a specific
source system is called ______.
a. Extraction period
b. Extraction level
c. Extraction frequency
d. None of the above
3. The process of data extraction for a data warehouse is complex
compared to that of an operational system, because ________.
a. For a data warehouse, you have to extract data from many disparate
sources
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 147
b. For a data warehouse, you have to extract the data on the changes
for ongoing incremental loads as well as for a one-time initial full load
c. Both (a) and (b)
d. None of the above
4. Why is the data in a source system said to be time-dependent or
temporal?
a. The data in the source system changes with the time
b. The data in the source system does not change with the time
c. The attributes of the data in source system changes with the time
d. The data in the source system gets erased or deleted after a specific
period of time
5. Which of the following statements is false?
a. The history of data records are ignored in case of OLTP systems
b. Operational data in the source systems can be categorized into
'current value' and 'periodic status'
c. the information required for OLTP systems is drawn from the data in
the data warehouse
d. None of the above
6. If the data capture is not done immediately or not done in real-time, the
type of extraction is called _________.
a. Delayed extraction
b. Deferred extraction
c. Delimit extraction
d. Deteriorated extraction
7. Which of the following is not an option for immediate data extraction?
a. Capture through transaction logs
b. Capture in source applications
c. Capture through the source code
d. Capture through database triggers
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 148
8. The ETL functions in a data warehouse environment are challenging in
nature. Because ________.
a. Source systems are very diverse and disparate
b. Many source systems are older legacy applications running on
obsolete database technologies.
c. Generally, historical data on changes in values are not preserved
in source operational systems. Historical information is critical in a
data warehouse.
d. All the above
9. Which of the following represents the duration in which the data can be
extracted from a specific source system so that the system does not
get impacted by the extraction process?
a. Source window
b. extraction window
c. time window
d. Impact window
10. The creation of aggregate datasets on the basis of the pre-determined
procedures is called ___________.
a. Data cleansing
b. Data integration
c. Data setting
d. Data summarization
11. Which of the following is a type of Immediate Data Extraction method to
capture the data from the source systems?
a. Capture through Transaction Logs
b. Capture through Database Triggers
c. Capture in Source Applications
d. All the above
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 149
12. Which of the following is a type of Deferred Data Extraction method for
the data capture?
a. Capture based on date and time Stamp
b. Capture by comparing files
c. Both of (a) and (b)
d. None of the above
13. Which of the following methods is the least preferred method and is
implemented only when any of the other methods cannot be
implemented?
a. Capture based on date and time Stamp
b. Capture by comparing files
c. Capture in Source Applications
d. Capture through Database Triggers
6.11 Answers to SAQs, TQs, and MCQs
6.11.1 Answers to Self Assessment Questions (SAQs)
Section 6.2
1. ETL stands for Extraction, Transformation and Loading. To implement
these functions, you may be required to undergo the following
processes: data extraction, data transformation, data integration, data
cleansing, data summarization, initial data loading, metadata updating,
and ongoing data loading. Also write the significance of these ETL
processes as provided in the Section 6.2.1.
2. The steps to be followed in the implementation of ETL functions are
listed in the Figure 6.2. You may list out these steps with a small
description of each of these steps.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 150
Section 6.3
1. Data extracting is the process of retrieving and collecting the data from
the appropriate source systems. The various issues one may face in the
data extraction process include identification of all source systems,
choosing an appropriate method of data extraction, preparation of time
window, determination of the extraction frequency, J ob sequencing, and
exception handling. You may describe these processes as discussed in
the section 6.3.
Section 6.4
1. There are two ways of storing the data in the source systems; current
value and periodic status. The current value is the stored value of an
attribute at that moment of time. Periodic status is the status every time
a change occurs. So the status value is stored with reference to a
specific time frame. You may describe these methods as discussed in
the sections 6.4.1 and 6.4.2.
Section 6.5
1. The source identification process includes the following steps;
preparation of a list of all data items or facts needed in fact tables,
preparation of each dimension attribute from all dimensions, identifying
the source system and source data item for every target data item,
identifying the preferred source in case there are multiple sources for a
specific data item, identifying the single source filed for multiple target
fields and establishing splitting rules, and ascertaining the default values
and inspecting the source data for missing values.
Section 6.6
1. Static data is the capture of data at a specific point of time. For current
data, this capture includes all transient data identified for extraction.
Data of revisions (also known as incremental data capture) includes the
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 151
revisions since the last time the data was captured. You may include the
other points as discussed in the Section 6.6.
2. There are three types of methods in the immediate data extraction
process. They are: capture through transaction logs, capture through
database triggers, and capture in source applications. You can describe
each of these methods as discussed in the Section 6.6.1.
3. The deferred data extraction methods do not capture the changes in real
time. They capture the data in later period. The methods used in
deferred data extraction include capture based on date and time stamp,
capture by comparing files. You can describe each of these methods as
discussed in the Section 6.6.2.
Section 6.7
1. One can adopt any of the following methods for the data extraction;
Capture of static data, Capture through transaction logs, Capture
through database triggers, Capture in source applications, Capture
based on date and time stamp, and Capture by comparing files. The
advantages and limitations of each of these methods are discussed in
the Section 6.7.
6.11.2 Answers to Terminal Questions (TQs)
1. Because of the following reasons, you can say that the ETL functions
are challenging in the context of building a data warehouse:
Source systems are very diverse and disparate. You may have to
deal with source systems on multiple platforms and different
operating systems.
Many source systems are legacy applications running on obsolete
database technologies. The quality of data is also dubious in many
of the old source systems.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 152
Source system structures keep changing over time because of new
business conditions. Therefore, the ETL functions need to be
modified accordingly.
Historical data on changes in values are not preserved in source
operational systems, which is critical in a data warehouse.
Even when inconsistent data is detected among disparate source
systems, lack of a means for resolving mismatches escalates the
problem of inconsistency.
Many of the source systems do not represent the data in types or
formats that are meaningful to the users.
2. The major issues involved in the data extraction process are discussed
in the Section 6.3. Accordingly, you may discuss the issues viz., source
identification, selection of an appropriate method of data extraction,
preparation of time window for each of the data source systems,
determination of the extraction frequency, sequencing of the jobs,
exception handling to handle the missing records during the extraction
process.
3. Before the selection of an ETL tool, you need to understand whether the
tool has the following capacities:
Data extraction from various regional databases of leading vendors,
from old legacy databases, indexed files, and flat files
Data transformation from one format to another with variations in
source and target fields
Performing of standard conversions, key reformatting, and structural
changes
Provision of audit trails from source to target
Application of business rules for extraction and transformation
Combining of several records from the source systems into one
integrated target record
Recording and management of metadata.
Business Intelligence and Tools Unit 6
Sikkim Manipal University Page No.: 153
6.11.3 Answers to Multiple Choice Questions (MCQs)
1. Ans: a
2. Ans: c
3. Ans: c
4. Ans: a
5. Ans: c
6. Ans: b
7. Ans: c
8. Ans: d
9. Ans: c
10. Ans: d
11. Ans: d
12. Ans: c
13. Ans: b