Вы находитесь на странице: 1из 5

A Compatrive Study of ETL Tools

Sana Yousuf

Department of Computer Science Military College of Signals, National University of Sciences & Technology Islamabad, Pakistan sn_ysf@yahoo.com

AbstractIn many organizations valuable data is wasted because it lies around in different formats and in various resources. Data warehouses (DWs) are complex systems having consolidated data with an objective to assist the knowledge workers in decision making process. The key components of DWs are the Extraction-Transformation-Loading (ETL) processes. Since incorrect or misleading data may produce wrong decisions. This necessitates the selection of appropriate ETL Tools for a DW to improve data quality. The selection of ETL tool is a complex and important issue in data warehousing because it validates the quality of a data warehouse. This paper first highlights the ETL process briefly then discuses some of the ETL tools available along with a general criterion used as measuring parameters for selecting appropriate ETL tools. At the end an analysis of the tools based on the generalized criteria is presented to give an insight of which tool is better for which circumstance.

Keywords: Dataware houses, ETL tools, complex systems, enterprise systems

I.

INTRODUCTION

Data Warehouse is a large data repository that consolidates various types of data transformed into a single suitable format. Depending on specific business needs it can be architectured differently. However in general data stored in operational databases is transferred to a data ware house pre processing platform also known as staging area, then after processing into the data ware house and lastly is transformed into sets of conformed data marts

A. ETL Process and Concepts

Extract, Transform and Load (ETL), is an important component of the Data Warehousing Architecture. The process includes extraction of data from various data sources, transformation of extracted data according to business requirements and loading of that data into the dataware house. Any programming language can be used to make an ETL process however making it from bits and pieces is quite complex. Various ETL tools are available in the market easing an enterprise to select one based on its requirements & needs. With the passage of time these tools have matured and now provide much more than just Extraction,

Sanam Shahla Rizvi

Department of Computer Science Military College of Signals, National University of Sciences & Technology Islamabad, Pakistan ssrizvi@mcs.edu.pk

transformation and loading of data. The improvements include capabilities such as “data profiling, data quality control, monitoring and cleansing, real-time and on-demand data integration in a service oriented architecture, and metadata management” [12]. Moreover ETL tools are now customizable according to the functional requirements of an enterprise data warehouse. a) Extraction Being the first step in the ETL process its focus is on extracting data from different source systems. These sources are named as source system because they could be internal, external, structured or unstructured i.e. of any type. Thus sources systems could be mainframe applications, flat files, ERP applications, relational databases, non-relational

databases, CRM tools or even message queues. These sources may have different formats of data i.e. different internal representation making Extraction a difficult process. So an extraction tool should be able to :

-

Understand all different data storage formats

-

Have a communicative ability among various relational databases

-

Read & understand different file formats used in an organization.

-

Extract only relevant data before bringing it into to the DW.

 

b)

Transformation

The transformation phase ensures the data consistency and performs data cleansing before loading data in the data warehouse. In order to transform the data properly, a number of rules and business calculations are applied to the extracted data so that different data formats are mapped into a single format. Transformation can be integrated with extraction or loading phase depending upon when it is performed.

c) Loading

After transforming and cleansing the extracted data, it is

loaded into fact and dimension tables of the data warehouse to be used for various analytical purposes. It is done regularly to avoid data stacks to get piled up. It can be required in one of the two situations:

- Load the new data that is currently contained in the operational database

- Load the updates corresponding to the changes occurred in the operational database “Reference [3] states that incremental loading is the preferred approach to data warehouse refreshment because it generally reduces the amount of data that has to be extracted, transformed, and loaded by the ETL system. ETL jobs for incremental loading require access to source data that has been changed since the previous loading cycle. For this purpose, so called Change Data Capture (CDC) mechanisms

at the sources can be exploited, if available. Additionally,

ETL jobs for incremental loading potentially require access

to the overall data content of the operational sources.”

The paper provides an insight to the background of ETL tools in following section. Section III presents brief overview

of the various ETL tools. Section IV focuses on setting the criteria to rank available tools. Section V on the other hand presents a comparative analysis of various tools. Paper is ended by a conclusion of the overall study in section VI.

II. BACKGROUND OF ETL TOOLS

An ETL tool provides a certain set of basic ETL processing facilities, as explained in section I, to rank it as a proper ETL tool. Since 2003 Passionned, a consultancy and research firm, has been closely monitoring the market for both ETL and data integration tools [4]. Earlier the surveys conducted were based on the main market driving entities also known as visionaries. Many organizations used to assume that they had automatically made the right choice if they purchased a tool from one of the market leaders. However the trend changed over time and then organizations started making ETL tools for according to their requirements themselves. Since the late nineties, all the major business intelligence (BI) vendors had purchased or developed their own ETL tools. BI tools had more reliable ETL processes and a well designed method of keeping the data warehouse. BI provided

a better solution but it consumed 70 -80% of the costs

involved in a successful BI system. Passionned in its ETL Tools survey 2009 described the importance to evaluate and promote ETL tools because many organizations still built their data warehouses by hand i.e. writing complex PL/SQL or SQL and stored procedures. The focus of such surveyors was that developer productivity would be increased by a factor of 3-5 times if a proper ETL

tool was used. Thus if a proper guidance was available to enterprises then choosing the right product would become easier and less risking for he organization itself. As explained by reference [5] construction of data ware houses through ETL tools resulted in a better, stable and more reliable data-ware house that allowed more aspects to be checked and monitored in relation to each other. Companies on their own official websites also present a comparison of their offered product with other market competitors; Adeptia [10], Microsoft SSIS and informatica [3] are such examples.

III. SOME FAMOUS ETL TOOLS

Some famous ETL tools available in market are as follows:

A. Pentaho Data Integration

Pentaho [12] is a commercial open-source Business Intelligence suite along with a data integration product

named Kettle. Using the innovative meta-driven approach it

is fast having an easy to use GUI. Having started in 2001 it

has grown and today it has a strong community of 13,500 registered users. It also supports multi-format data and

allows data movement between many different databases and

files.

B.

Talend Open Studio

Talend Open Studio (TOS) [10]is another tool with support of data integration and is open source. Started in 2006, has a less community of followers but still has quite a market share as 2 supporters are finance companies. Rather than metadata driven it uses a code driven approach and has

a GUI for user interaction. The code generation property

allows generating executable code of Java and Perl that can

be run later on a server.

C. Informatica Power Center

Informatica Power Center (IPC) [3] is not an open source software but is commercially a recommended data integration suite and thus the market share leader in data integration tools. Found in 1993, it has made its place in market with consistency and leadership, today it has 2600 registered users out of which 100 are included in list of stock exchange companies. The main focus of IPC is on data integration with numerous capabilities e.g. enterprise size architecture, data cleansing, data profiling, web servicing and interoperability with current and legacy systems.

D. Inaplex Inaport

Inaplex [12] provides mid-market solutions focusing customer relationship management for customers’ data integration. Besides the customer relationship management it also lays emphasis on providing simple solutions for data integration and accountancy handling.

E. Oracle Warehouse Builder

The Oracle Warehouse Builder (OWB) [13] is “a comprehensive tool for ETL, relational and dimensional modeling, data quality, data auditing, and full lifecycle management of data and metadata” [13]. It allows high performance, security and scalability by having Oracle DB

as

the metadata repository and transformation engine.

F.

IBM Information Server

A product by IBM (IS Datastage) [10] & is well known

for its services. The capabilities of the tool include data consolidation, synchronization, and distribution across disparate databases, automatic data profiling & analysis in terms of content and structure, data quality enhancement, transformation and delivery to and from complex sources i.e. capability to get data from any sources format and deliver it

to any targets, within or outside the enterprise, at the right

time.

It

also allows integration and information access for

diverse data and content regardless of the placement of data.

With the data replication services customer information management can be done quickly.

G. Microsoft SQL ServerIntegration Services

Microsoft SQL Server Integration Services (MS SSIS) [14] allows run time data transfer and management. Designed for enterprise wide application support, it provides a platform for performing ETL functions and creating and controlling data packages. It allows formation of script application using .net platform support, increased scalability with thread pooling, and a more advanced import and export wizard. It also allows customization of the package suiting specific organization needs, usage of digital sign for security and supports service oriented architecture.

IV. ETL TOOL FEATURES

With the available span of functionality and quite a number of ETL tool vendors it is quite difficult to rank all the variety of tools as every tool has some special features

too. Some generic behaivour has been identified by [5] on

the basis of which following comparison and graph making is done. Following general aspects can be kept in mind when evaluating an ETL tool

A. Architecture For evaluating any tool with respect to architecture aspects such as support for parallel processing, symmetric multiprocessing, massive multi processing, clustering, load balancing and feasibility for grid computing should be considered. Also support for multi user management of ETL processes running on multiple machines and support for common meta-model i.e. allowing for exchange of meta data with self brand and other brands is to be considered too.

B. Functionality

Two main aspects relating to functionality of an ETL tool are important i.e. the metadata support and the overall functionality provided by the tool. The main functionality focuses of whether the tool is data cleansing oriented or data transformation oriented, or it performs both equally. Thus one gets a clear picture of what tool to select depending on the nature of data that shall be put into the tool. Also the support for direct connection to data source for input is also an important aspect of functionality. On the other hand support of metadata is a key aspect

too. An ETL is also responsible of using metadata to map

source data to destination. Thus choosing a tool that

conforms to organizations metadata strategy is very

important.

C. Usability

The usability is one of the important factors of any tool. Thus points to consider are that the tool should be easy to

use, understand and fast to get used to. In this regard aspects of concern are that tool should have a well balanced interface and must support the typical tasks sequence as of any ETL

usage.

D. Reusability

The reusability depends on that the components of a data ware house architecture, which is constructed using the ETL tool, must be reusable and can handle parameters. The tools should be capable of dividing the process into small building blocks, allow user to make user defined functions and allowing these functions to be used in the process flow.

E. Connectivity

The main aspects to consider include the native connections the tool supports, the packages its can read metadata from, the type of message queuing products the tool can connect to, capability to graphically join tables, support for changed data capture principle, transformation matching and address cleansing ability as well as options for data profiling uniqueness and distribution etc.

F. Interoperability Last but not he least the tool should be capable to run on

a number of platforms and also on the different versions of a

product.

V. ANALYSIS OF ETL TOOLS

With all the aspects, as discussed in section IV, in mind an analysis of the services provided by the tools is discussed hereafter. Thus in choosing any tools its respective aspects should be considered. Following graph based analysis provides support for the decision making. For this analysis various websites, vendor’s white papers, web-blogs, comparisons and previous surveys were consulted and thus based on the basic set of features discussed in section IV the analysis was conducted. Each of the above mentioned ETL tools, as discussed in section III, is graded on the basis of points according to the level of services supported while the vendors are depicted by the acronyms in graphs instead of full names.

A. Architectural Aspects

Based on the support of enterprise architecture, clustering, data separation into groups, Web based application interface support & cloud computing deployment support following graph depicts the current services supported by tools. Thus IPC and OWB are nice in architectural support with SSIS coming up right behind.

B. ETL Functionality

Depending upon completeness of tools in terms of functionality points have been given. Thus support for data cleansing, transformation, support for integration services and common metadata model support are the main aspects considered. The graph is drawn by adding up the points granted to each tool depending upon the support it provided i.e. one point for each aspect and then adding up those points which fall into one category. Same case was done for both

trends i.e. basic functionalities in 2007 and improvements till

2010.

Architectural Aspects

35 30 25 20 15 10 5 0 IBM IS I PC Talend OWB MS
35
30
25
20
15
10
5
0
IBM IS
I PC
Talend
OWB
MS SSIS BO SAP
SAS DIS
Others
OS
Web-based UI
Enalbes SOA
Clustering and Job Distribution
Deploy in Cloud Option
Figure 1.
Architectural Support

ETL Functionality Provided

2007 2010 improvement 50 45 40 35 30 25 20 15 10 5 0 IBM
2007
2010 improvement
50
45
40
35
30
25
20
15
10
5
0
IBM IS
I PC
Talend
OWB
MS
BO
SAS
Others
OS
SSIS
SAP
DIS
Points

Vendors

Figure 2.

Functionality

C.

This graph covers all the points graded to a tool on the basis of an easy to use, a well designed and a balanced interface. What you see is what you get (WYSIWYG) and task compatibility also is other basis of grade. Each point graded gets accumulated by the existence of a subset of services necessary of ease of use and understanding. Also ease of training new users to become used to the interface is a part of criterion.

D. Reusability

The graph, as follows, depicts a comparison and point grading on basis of reusability factor supported, capability of data stream splitting, automatic documentation and support for definition of user defined functions and using them in the process flow.

Usability

Ease Of Use Points 8 7 6 5 4 3 2 1 0 IBM IS
Ease Of Use
Points
8
7
6
5
4
3
2
1
0
IBM IS
I PC
Talend
OWB
MS SSIS BO SAP SAS DIS Others

Original 2007Improvement OS 2010

Original 2007 Improvement OS 2010

Improvement OS 2010

6 5 4 3 2 1 0 IBM IS I PC Talend OWB MS SSIS BO

Vendors

Figure 3.

Usability

Reusability 40 35 30 25 20 15 10 5 0
Reusability
40
35
30
25
20
15
10
5
0

IBM IS

I PC

Talend

OWB

MS

BO

SAS

Others

 

OS

SSIS

SAP

DIS

Reusable service RepositoryData Partitioning Split Data Streams Automatic Documentation

Data PartitioningReusable service Repository Split Data Streams Automatic Documentation

Split Data StreamsReusable service Repository Data Partitioning Automatic Documentation

Automatic DocumentationReusable service Repository Data Partitioning Split Data Streams

Figure 4.

Reusability

E.

Connectivity as the name indicates is calculated by aggregating the points granted to a tool on the following aspects. These include total number of all the sources which could be read in without any additional middleware, the enterprise applications supported by the tool, the platforms it can run on and last but not the least the support for messaging (i.e. real time data handling).

F. Interoperability

The support of various platforms in detail is provided in following graph. Here all Windows & Linux versions are considered as one while UNIX versions are catered

separately.

Connectivity

Connectivity

Platfroms Data Sources Packages Messages 100 90 80 70 60 50 40 30 20 10
Platfroms
Data Sources
Packages
Messages
100
90
80
70
60
50
40
30
20
10
0
IBM IS
I PC
Talend
OWB
MS
BO
SAS
Others
OS
Vendors
SSIS
SAP
DIS
Points
Figure 5. Connectivity Interoperability 100 90 80 70 60 50 40 30 20 10 0
Figure 5.
Connectivity
Interoperability
100
90
80
70
60
50
40
30
20
10
0
IBM IS
I PC
Talend
OWB
MS SSIS BO SAP
SAS DIS
Others
OS
Windows
Linux
Sun Solaris
HP-UX
IBM A/X
IBM iSeries OS400
IBM zSeries MVS
HP Tru64
Open VMS
Figure 6.
Interoperability

From all the analysis conducted it is still hard to generalize which tool is the best. Though Infomatica proves to be better in quite many features but MS SSIS and OWB have improved well overtime and now are in pace with the high contenders too. Overall it can bee seen when considering pure ETL tools then IPC can be ranked as still the market leader with IBM IS coming second along side Talend OS. However when it comes to DB integrated Tools then OWB and SSIS follow IPC directly. Thus one should be careful in selecting the tool as it may not be the best for organization just by the name of vendor. The capabilities of the tool should be reviewed before selection.

VI.

CONCLUSION

Important data in most of the organizations is under utilized just because it exists around in different formats and in various resources. Data warehouses (DWs) are complex systems having consolidated data with a main objective to assist the knowledge workers in decision making process. The key components of DWs are the Extraction- Transformation-Loading (ETL) processes. The goal of this paper is to elaborate ETL process, its importance relevant to the data warehouses and provide a comparison based on some generalized criteria to find suitability of a tool for a certain category of consumers. The paper provides a brief overview of the available ETL tools in market, specifies some key points that can be made for generalizing capabilities provided by a tool and using graph based analysis on a grade point scale to grade the specific tools selected. This all provides a comparison of the available tools in terms of the features they provide helping an organization choose which tool will best suit its needs.

REFERENCES

[1]

T.Y. Wah, H. Peng, and C.S. Hok, Building Data Warehouse,Proc.

[2]

24th South East Asia Regional Computer Conference, November 18- 19, 2007, Bangkok, Thailand Tho, M. Njuyen, Tjoa, A. Min; Zero-Latency Data Warehousing for

[3]

Heterogeneous Data Sources and Continuous Data Streams, Institute of Software Technology and Interactive Systems Favoriteristr. 9- 11/188, 2003 T. Jaorg, S. Dessloch, Near Real-Time Data Warehousing Using

[4]

State-of-the-Art ETL Tools, University of Kaiserslautern, 67653 Kaiserslautern, Germany, 2009.

[5]

Passionned, 'The BI Tool survey report”, 2008. Passionned, “ETL Tools survey report, 2009.

 

[6]

J. Levin, “ETL Tools Comparison”, March 2008.

[7]

Dr. R. Chillar; B. Kochar; Extraction Transformation Loading A Road to Data warehouse, 2nd National Conference Mathematical

Techniques: Emerging Paradigms for Electronics and IT Industries

[8]

Guide to Data Warehousing and Business Intelligence, available at http://data-warehouses.net/architecture/etlprocess.html.

[9]

Pervasive

Systems,

Extraordinarily

Flexible

ETL

Platform,http://www.pervasiveintegration.com/scenarios/Pages/etl_to

ols_data_aggregation.aspx.

[10] Adeptia incorporation, ETL Vendors Comparison, available at

l.

[11] Guide to Data ware housing and Business Intelligence, Architectural

http://data-

Overview,

available

at

warehouses.net/architecture/overview.html.

[12] ETL

tools

Survey,

available

at

http://www.etltool.com/what-is-

etl.htm.

[13] Oracle

Ware

house

builder

11g,

A

technical

overview,

at

http://www.oracle.com/technology/products/warehouse/index.html.

[14] ETL

data

ware

house

concepts,

available

at