Вы находитесь на странице: 1из 14

D

DAAT
TAA
W
WAAR
REEHHO
OUUSSIIN
NGG

EETTLL iiss tthhee pprroocceessss ooff ccooppyyiinngg ddaattaa ffrroom m oonnee


ddaattaabbaassee ttoo aannootthheerr bbuutt iitt iiss nnoott aa pphhyyssiiccaall
iim
mpplleem meennttaattiioonn.. TThhee pprroocceessss iiss nnoott aa oonnee--
ttiim e e
me evveenntt bbuutt iitt iiss aa oonnggooiinngg aass w weellll aass tthhee
warehouse..
rreeccuurrrriinngg ppaarrtt ooff ddaattaa w a r e h o u s e

ETL
1. Starter 2
2. ETL Components 2
2.1 Extraction 3
2.2 Extraction Methods 4
2.2.1 Logical Extraction Methods 4
2.2.2 Physical Extraction Methods 5
2.3 Transformation 5
2.4 Loading 8
2.4.1. Loading Mechanisms 9
2.5 Meta Data 9

3. ETL Design Considerations 10


4. ETL Architectures 11
4.1 Homogenous Architecture 11
4.2 Heterogeneous Architecture 12
5. ETL Development 13
5.1 Identify and Map Data 13
5.2 Identify Source Data 13
5.3 Identify Target Data 13
5.4 Map Source Data to Target Data 14
SUSHIL KULKARNI 2

1. Starter

Companies know they have valuable data lying around throughout their networks that
needs to be moved from one place to another such as from one business application to
another or to a data warehouse for analysis.

The only problem is that the data lies in all sorts of heterogeneous systems, and
therefore in all sorts of formats. For instance, a CRM system may define a customer in
one way, while a back-end accounting system may define the same customer
differently.

To solve the problem, companies use extract, transform and load (ETL) software,
which includes reading data from its source, cleaning it up and formatting it uniformly,
and then writing it to the target repository to be exploited.

During the ETL process, data is extracted from an OLTP database, transformed to match
the data warehouse schema, and loaded into the data warehouse database. Many data
warehouses also incorporate data from non-OLTP systems, such as text files, legacy
systems, and spreadsheets; such data also requires extraction, transformation, and
loading.

In its simplest form, ETL is the process of copying data from one database to another.
This simplicity is rarely found in data warehouse implementations. In reality, ETL is a
complex combination of process and technology that consumes a significant portion of
the data warehouse development efforts and requires the skills of business analysts,
database designers, and application developers.

When defining ETL for a data warehouse, it is important to think of ETL as a process,
not a physical implementation. ETL systems vary from data warehouse to data
warehouse and even between department data marts within a data warehouse.

The ETL process is not a one-time event as the new data is added to a data warehouse
periodically. Typical periodicity may be monthly, weekly, daily, or even hourly,
depending on the purpose of the data warehouse and the type of business it serves.

Because ETL is an ongoing and recurring part of a data warehouse, ETL processes must
be automated and operational procedures documented. ETL also changes and evolves
as the data warehouse evolves, so ETL processes must be designed for ease of
modification. A solid, well-designed, and documented ETL system is necessary for the
success of a data warehouse project.

2. ETL Components

Regardless of how they are implemented, all ETL systems have a common purpose:
they move data from one database to another. Generally, ETL systems move data from
OLTP systems to a data warehouse, but they can also be used to move data from one
data warehouse to another. An ETL system consists of four distinct functional
components:

sushiltry@yahoo.co.in
SUSHIL KULKARNI 3

1. Extraction
2. Transformation
3. Loading
4. Meta data

2.1 Extraction

The ETL extraction component is responsible for extracting or pulling the data from the
source system. During extraction, data may be removed from the source system or a
copy made and the original data retained in the source system.

It is common to move historical data that accumulates in an operational OLTP system to


a data warehouse to maintain OLTP performance and efficiency. Legacy systems may
require too much effort to implement such offload processes, so legacy data is often
copied into the data warehouse, leaving the original data in place.

Extracted data is loaded into the data warehouse staging area (a relational database
usually separate from the data warehouse database), for manipulation by the remaining
ETL processes.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 4

Data extraction is generally performed within the source system itself, especially if it is a
relational database to which extraction procedures can easily be added. It is also
possible for the extraction logic to exist in the data warehouse staging area and query
the source system for data using ODBC, OLE DB, or other APIs. For legacy systems, the
most common method of data extraction is for the legacy system to produce text files,
although many newer systems offer direct query APIs or accommodate access through
ODBC or OLE DB.

Data extraction processes can be implemented using Transact-SQL stored procedures,


Data Transformation Services (DTS) tasks, or custom applications developed in
programming or scripting languages.

2.2 Extraction Methods

These are different methods available to extract the data from source databases. Two of
the important methods are Logical extraction method and Physical extraction method.
We will see both methods.

2.2.1 Logical Extraction Methods

Following are two kinds of logical extraction:

[A] Full Extraction

The data is extracted completely from the source system. Since this extraction reflects
all the data currently available on the source system, there's no need to keep track of
changes to the data source. An example for a full extraction may be an export file of a
distinct table or a remote SQL statement scanning the complete source table.

[B] Incremental Extraction

At a specific point in time, only the data that has changed since a well-defined event
back in history will be extracted. This event may be the last time of extraction or a more
complex business event like the last booking day of a financial period. To identify this
delta change there must be a possibility to identify all the changed information for a
specific time event. This information can be either provided by the source data itself or
by a change table where an appropriate additional mechanism keeps track of the
changes besides the originating transactions.

Many data warehouses do not use any change-capture techniques as part of the
extraction process. Instead, entire tables from the source systems are extracted to the
data warehouse or staging area, and these tables are compared with a previous extract
from the source system to identify the changed data. This approach may not have
significant impact on the source systems, but it clearly can place a considerable burden
on the data warehouse processes, particularly if the data volumes are large.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 5

2.2.2 Physical Extraction Methods

Depending on the chosen logical extraction method and the capabilities and restrictions
on the source side, the extracted data can be physically extracted by two mechanisms.
The data can either be extracted online from the source system or from an offline
structure. Thus there are following methods of physical extraction.

[A] Online Extraction

The data is extracted directly from the source system itself. The extraction process can
connect directly to the source system to access the source tables themselves or to an
intermediate system that stores the data in a pre configured manner (for example,
snapshot logs or change tables). Note that the intermediate system is not necessarily
physically different from the source system.

With online extractions, you need to consider whether the distributed transactions are
using original source objects or prepared source objects.

[B] Offline Extraction

The data is not extracted directly from the source system but is staged explicitly outside
the original source system. The data already has an existing structure (for example,
redo or was created by an extraction routine.

2.3 Transformation

Data transformations are complex and most costly part, in terms of processing time in
ETL process. They can range from simple data conversions to extremely complex data
scrubbing techniques.

The data can be transform in the Multistage Data Transformation as follows

The data transformation logic for most data warehouses consists of multiple steps. For
example, in transforming new records to be inserted into a customer table, there may
be separate logical transformation steps to validate each dimension key.

Consider the example to implement different transformations as a separate SQL


operation and to create a separate, temporary staging table (such as the tables
new_sales_step1 and new_sales_step2 in the following figure) to store the incremental
results for each step. This load-then-transform strategy also provides a natural
checkpointing scheme to the entire transformation process, which enables to the
process to be more easily monitored and restarted. However, a disadvantage to
multistaging is that the space and time requirements increase.

It may also be possible to combine many simple logical transformations into a single
SQL statement or single PL/SQL procedure. Doing so may provide better performance
than performing each step independently, but it may also introduce difficulties in
modifying, adding, or dropping individual transformations, as well as recovering from
failed transformations.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 6

Listed below are some basic examples that illustrate the types of transformations
performed by this element:

1. Data Validation

Check that all rows in the fact table match rows in dimension tables to enforce data
integrity.

2. Data Accuracy

Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

3. Data Type Conversion

Ensure that all values for a specified field are stored the same way in the data
warehouse regardless of how they were stored in the source system. For example, if
one source system stores "off" or "on" in its status field and another source system
stores "0" or "1" in its status field, then a data type conversion transformation converts
the content of one or both of the fields to a specified common value such as "off" or
"on".

4. Business Rule Application

Ensure that the rules of the business are enforced on the data stored in the warehouse.
For example, check that all customer records contain values for both FirstName and
LastName fields.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 7

[B] Scrubbing

Data scrubbing is the process of fixing or eliminating individual pieces of data that are
incorrect, incomplete or duplicated before the data is passed to a data warehouse or
another application.

The need to scrub data is made pretty clear by simple questions like this one:

Are Shweta A. Menon of 16 Ambedkar Rd., Dadar, Mumbai, and


Mehir B. Mehta of 16 Ambedkar Rd., Dadar, Mumbai the same person?

You would probably say that most likely they are. But a computer, without help from
specialized software, would deal with the information as though it were about two
different guys.

The human eye and mind recognize that the differences between the two sets of data
records are probably the result of mistakes or inconsistencies in data entry. Weeding out
and fixing or discarding inconsistent, incorrect or incomplete data is what's called data
scrubbing or cleansing.

"Dirty data" has been a problem for as long as there have been computers -- or maybe
for as long as people have attempted to gather and analyze information. It's a large part
of the "garbage in" that can result in the worthless "garbage out" of a computing
process.

The issue of data hygiene has become increasingly important as more and more
corporations implement complex customer relationship management (CRM) systems and
build data warehouses that merge information from many different sources.

Without data cleansing, the IT staffs of those companies face the unappetizing prospect
of merging corrupt or incomplete bits of data from multiple databases. A single piece of
dirty data might seem like a trivial problem, but if you multiply that "trivial" problem by
thousands or millions of pieces of erroneous, duplicated or inconsistent data, it becomes
a prescription for chaos.

In its 2001 report about organizations implementing data warehouses for the purpose of
business intelligence, Following are identified as dirty data:

o Poor data entry, which includes misspellings, typos and transpositions, and variations
in spelling or naming.

o Data missing from database fields.

o Lack of companywide or industrywide data coding standards (a big problem in health


care, for example).

sushiltry@yahoo.co.in
SUSHIL KULKARNI 8

o Multiple databases scattered throughout different departments or organizations, with


the data in each structured according to the idiosyncratic rules of that particular
database.

o Older systems that contain poorly documented or obsolete data.

As the list suggests, data scrubbing is aimed at more than eliminating errors and
redundancy. The goal is also to bring consistency to various data sets that may have
been created with different, incompatible business rules. Without data scrubbing, those
sets of data aren't very useful when they're merged into a warehouse that's supposed to
feed business intelligence across an organization.

In the early days of computing, most data scrubbing was done by hand. And when
performed by bleary-eyed humans, the laborious task of finding and then fixing or
eliminating incorrect, incomplete or duplicated records was costly - and it often led to
the introduction of new errors.

Now, specialized software tools use sophisticated algorithms to parse, standardize,


correct, match and consolidate data. Their functions range from simple cleansing and
enhancement of single sets of data to matching, correcting and consolidating database
entries from different databases and file systems.

Most of these tools are able to reference comprehensive data sets and use them to
correct and enhance data. For example, customer data for a CRM application could be
referenced and matched to additional customer information, such as household income
and other demographic information.

Companies that want to use specialized data cleansing tools can get them from several
sources. Building the tools in-house was the most common choice among companies
studied by Arlington, Mass.-based Cutter Consortium; of the surveyed companies that
said they were using such tools, 31% said they were building them in-house.

But companies that choose to buy data cleansing software have plenty of options.
Oracle Corp., Ascential Software Corp. in Westboro, Mass., and Group 1 Software Inc. in
Lanham, Md., led other vendors in the Cutter survey, with 8% of the market each. Other
vendors, including PeopleSoft Inc. in Pleasanton, Calif., SAS Institute Inc. in Cary, N.C.,
and Informatica Corp. in Redwood City, Calif., were bunched a few percentage points
behind. The major data warehouse and business-intelligence vendors also include data
scrubbing functionality in their products.

2.4 Loading

The ETL loading component is responsible for loading transformed data into the data
warehouse database. Data warehouses are usually updated periodically rather than
continuously, and large numbers of records are often loaded to multiple tables in a
single data load.

The data warehouse is often taken offline during update operations so that data can be
loaded faster and update OLAP cubes to incorporate the new data.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 9

2.4.1. Loading Mechanisms


You can use the following mechanisms for loading a warehouse:

[A] SQL*Loader [B] External Tables


[C] OCI and Direct-Path APIs [D] Export/Import

Let us learn these mechanisms:

[A] SQL*Loader

Before any data transformations can occur within the database, the raw data must
become accessible for the database. One approach is to load it into the database. The
most common technique for transporting data is by way of flat files.

SQL*Loader is used to move data from flat files into a data warehouse. During this data
load, SQL*Loader can also be used to implement basic data transformations. When
using direct-path SQL*Loader, basic data manipulation, such as datatype conversion and
simple NULL handling, can be automatically resolved during the data load. Most data
warehouses use direct-path loading for performance reasons.

[B] External Tables

Another approach for handling external data sources is using external tables. Oracle9i`s
external table feature enables you to use external data as a virtual table that can be
queried and joined directly and in parallel without requiring the external data to be first
loaded in the database. You can then use SQL, PL/SQL, and Java to access the external
data.

The main difference between external tables and regular tables is that externally
organized tables are read-only. No DML operations (UPDATE/INSERT/DELETE) are
possible and no indexes can be created on them.

[C] OCI and Direct-Path APIs

OCI and direct-path APIs are frequently used when the transformation and computation
are done outside the database and there is no need for flat file staging.

[D] Export/Import

Export and import are used when the data is inserted as is into the target system. No
large volumes of data should be handled and no complex extractions are possible.

2.5 Meta Data

The ETL meta data functional component is responsible for maintaining information
(meta data) about the movement and transformation of data, and the operation of the
data warehouse. It also documents the data mappings used during the transformations.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 10

Meta data logging provides possibilities for automated administration, trend prediction,
and code reuse.

Examples of data warehouse meta data that can be recorded and used to analyze the
activity and performance of a data warehouse include:

o
o Data Lineage, such as the time that a particular set of records was loaded into the data
warehouse.

•o Schema Changes, such as changes to table definitions.

•o Data Type Usage, such as identifying all tables that use the "Birthdate" user-defined
data type.

•o Transformation Statistics, such as the execution time of each stage of a


transformation, the number of rows processed by the transformation, the last time the
transformation was executed, and so on.

•o DTS Package Versioning, which can be used to view, branch, or retrieve any historical
version of a particular DTS package.

o Data Warehouse Usage Statistics, such as query times for reports.

3. ETL Design Considerations

Regardless of their implementation, a number of design considerations are common to


all ETL systems:

[A] Modularity

ETL systems should contain modular components that perform discrete tasks. This
encourages reuse and makes them easy to modify when implementing changes in
response to business and data warehouse changes. Monolithic systems should be
avoided.

[B] Consistency

ETL systems should guarantee consistency of data when it is loaded into the data
warehouse. An entire data load should be treated as a single logical transaction—either
the entire data load is successful or the entire load is rolled back. In some systems, the
load is a single physical transaction, whereas in others it is a series of transactions.
Regardless of the physical implementation, the data load should be treated as a single
logical transaction.

[C] Flexibility

ETL systems should be developed to meet the needs of the data warehouse and to
accommodate the source data environments. It may be appropriate to accomplish some

sushiltry@yahoo.co.in
SUSHIL KULKARNI 11

transformations in text files and some on the source data system; others may require
the development of custom applications. A variety of technologies and techniques can
be applied, using the tool most appropriate to the individual task of each ETL functional
element.

[D] Speed

ETL systems should be as fast as possible. Ultimately, the time window available for ETL
processing is governed by data warehouse and source system schedules. Some data
warehouse elements may have a huge processing window (days), while others may
have a very limited processing window (hours). Regardless of the time available, it is
important that the ETL system execute as rapidly as possible.

[E] Heterogeneity

ETL systems should be able to work with a wide variety of data in different formats. An
ETL system that only works with a single type of source data is useless.

[F] Meta Data Management

ETL systems are arguably the single most important source of meta data about both the
data in the data warehouse and data in the source system. Finally, the ETL process itself
generates useful meta data that should be retained and analyzed regularly.

4. ETL Architectures

It is important to understand the different ETL architectures and how they relate to each
other. Essentially, ETL systems can be classified in two architectures: the homogenous
architecture and the heterogeneous architecture.

4.1 Homogenous Architecture

A homogenous architecture for an ETL system is one that involves only a single source
system and a single target system. Data flows from the single source of data through
the ETL processes and is loaded into the data warehouse, as shown in the following
diagram.

Operational data

ETL System

Data Warehouse

sushiltry@yahoo.co.in
SUSHIL KULKARNI 12

Most homogenous ETL architectures have the following characteristics:

o Single data source: Data is extracted from a single source system, such as an OLTP
o
system.

o
o Rapid development: The development effort required to extract the data is
straightforward because there is only one data format for each record type.
o
o Light data transformation: No data transformations are required to achieve consistency
among disparate data formats, and the incoming data is often in a format usable in the
data warehouse. Transformations in this architecture typically involve replacing NULLs and
other formatting transformations.
o
o Light structural transformation: Because the data comes from a single source, the
amount of structural changes such as table alteration is also very light. The structural
changes typically involve denormalization efforts to meet data warehouse schema
requirements.
o
o Simple research requirements: The research efforts to locate data are generally simple:
if the data is in the source system, it can be used. If it is not, it cannot.

The homogeneous ETL architecture is generally applicable to data marts, especially


those focused on a single subject matter.

4.2 Heterogeneous Architecture

A heterogeneous architecture for an ETL system is one that extracts data from multiple
sources, as shown in the following diagram.

Operational data Operational data

ETL System

Data Warehouses

The complexity of this architecture arises from the fact that data from more than one
source must be merged, rather than from the fact that data may be formatted
differently in the different sources. However, significantly different storage formats and
database schemas do provide additional complications

Most heterogeneous ETL architectures have the following characteristics:

sushiltry@yahoo.co.in
SUSHIL KULKARNI 13

o Multiple data sources.

o More complex development: The development effort required to extract the data is
increased because there are multiple source data formats for each record type.

o Significant data transformation: Data transformations are required to achieve consistency


among disparate data formats, and the incoming data is often not in a format usable in
the data warehouse. Transformations in this architecture typically involve replacing NULLs,
additional data formatting, data conversions, lookups, computations, and referential
integrity verification. Pre-computed calculations may require combining data from multiple
sources, or data that has multiple degrees of granularity, such as allocating shipping costs
to individual line items.

o Significant structural transformation: Because the data comes from multiple sources, the
amount of structural changes, such as table alteration, is significant.
o Substantial research requirements to identify and match data elements.

Heterogeneous ETL architectures are found more often in data warehouses than in data
marts.

5. ETL Development

ETL development consists of two general phases: identifying and mapping data, and
developing functional element implementations. Both phases should be carefully
documented and stored in a central, easily accessible location, preferably in electronic
form.

5.1 Identify and Map Data

This phase of the development process identifies sources of data elements, the targets
for those data elements in the data warehouse, and the transformations that must be
applied to each data element as it is migrated from its source to its destination. High
level data maps should be developed during the requirements gathering and data
modeling phases of the data warehouse project. During the ETL system design and
development process, these high level data maps are extended to thoroughly specify
system details.

5.2 Identify Source Data

For some systems, identifying the source data may be as simple as identifying the server
where the data is stored in an OLTP database and the storage type (SQL Server
database, Microsoft Excel spreadsheet, or text file, among others).

In other systems, identifying the source may mean preparing a detailed definition of the
meaning of the data, such as a business rule, a definition of the data itself, such as
decoding rules (O = On, for example), or even detailed documentation of a source
system for which the system documentation has been lost or is not current.

sushiltry@yahoo.co.in
SUSHIL KULKARNI 14

5.3 Identify Target Data

Each data element is destined for a target in the data warehouse. A target for a data
element may be an attribute in a dimension table, a numeric measure in a fact table, or
a summarized total in an aggregation table. There may not be a one-to-one
correspondence between a source data element and a data element in the data
warehouse because the destination system may not contain the data at the same
granularity as the source system. For example, a retail client may decide to roll data up
to the SKU level by day rather than track individual line item data. The level of item
detail that is stored in the fact table of the data warehouse is called the grain of the
data. If the grain of the target does not match the grain of the source, the data must be
summarized as it moves from the source to the target.

5.4 Map Source Data to Target Data

A data map defines the source fields of the data, the destination fields in the data
warehouse and any data modifications that need to be accomplished to transform the
data into the desired format for the data warehouse. Some transformations require
aggregating the source data to a coarser granularity, such as summarizing individual
item sales into daily sales by SKU. Other transformations involve altering the source data
itself as it moves from the source to the target. Some transformations decode data into
human readable form, such as replacing "1" with "on" and "0" with "off" in a status field.
If two source systems encode data destined for the same target differently (for
example, a second source system uses Yes and No for status), a separate
transformation for each source system must be defined. Transformations must be
documented and maintained in the data maps. The relationship between the source and
target systems is maintained in a map that is referenced to execute the transformation
of the data before it is loaded in the data warehouse.

eeaaaaee

sushiltry@yahoo.co.in

Вам также может понравиться