You are on page 1of 68

All rights reserved Javlin 2011

March 2011
Mgr. Jan Ulrych

ETL
All rights reserved Javlin 2011
Organizational Matters
Introduction to ETL
More about ETL
CloverETL intro
ETL Projects
Current Trends
All rights reserved Javlin 2011
Presenter
Jan Ulrych
Graduated from Faculty of Mathematics and Physics
at Charles University, Prague
Works for Javlin a.s. as ETL Consultant since 2008
E-mail: jan.ulrych@javlin.eu

Professional experience
DVRA project at DHL IT Services Europe, Prague
ETL Consultant on various data integration projects
Since 2010 is a Pre-sales consultant for CloverETL

All rights reserved Javlin 2011
Javlin Overview
Javlin since 2005
Javlin is a software developer and services provider
CloverETL platform
Javlin services and ETL consulting
Software development for major clients

Employees: staff of 60+
Developers and consultants
Service, sales, support
Executive management

Experienced management team
Data and ETL software development legacy
Key industry expertise finance, health, media, logistics, government

Office locations
Prague
Brno
Greater Washington DC
All rights reserved Javlin 2011
Selected Customers
All rights reserved Javlin 2011
Origins & Motivation
Session 1:
All rights reserved Javlin 2011
Data Warehousing
A data warehouse is a system that
extracts, cleans, conforms, and delivers source data
into a dimensional data store
and then supports and implements querying
and analysis for the purpose of decision making.
Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004

The most visible part is
querying and analysis

The most complex and time consuming part is
extracts, cleans, conforms, and delivers
All rights reserved Javlin 2011
Data Warehousing
The most complex and time consuming part is
extracts, cleans, conforms, and delivers

How complex is it?
70-80% of BI (DI or DW) project is reliable ETL process
All rights reserved Javlin 2011
Getting data into DW
How to load data into DW?
Scripts in linux shell, perl, python,
sqlldr + SQL
Hardcoded in Java, C#, C
In-house built ETL tool
Off-the shelf ETL tool

Aspects to be kept in mind
Manageability
Maintainability
Transparency
Scalability
Flexibility
Complexity
Auditing
Job restartability
Testing
All rights reserved Javlin 2011
Introduction to ETL
Session 1:
All rights reserved Javlin 2011
ETL
How to load data into DW in a right way?
Introduce formal ETL process

This section covers
What ETL is
Motivation
Where to use ETL
How to Implement ETL
Key ETL Aspects


All rights reserved Javlin 2011
Motivation
Is ETL interesting area?
70-80% of BI (DI or DW) project is reliable ETL process.

Lets have a look on the DW & DI market size
In 2003, DI was USD 9.3 billion market
In 2008, DI was USD 13 billion market
By 2010, yearly grow estimated to USD 2.2 billion
TCO of DI can reach USD 509,600 annually

The more systems in the world,
the more work in Data Integration!
All rights reserved Javlin 2011
What is ETL?
ETL = Extract Transform Load
Extract
Get the data from source system as efficiently as
possible
Transform
Perform calculations on data
Load
Load the data in the target storage
All rights reserved Javlin 2011
What is ETL?
ETL = Extract Transform Load
Extract
Get the data from source system as efficiently as possible

Clean
Perform data cleansing and dimension conforming

Transform
Perform calculations on data

Load
Load the data in the target storage
All rights reserved Javlin 2011
Why is ETL (System) Important?
Adds value to data
Removes mistakes and corrects data
Documented measures of confidence in data
Captures the flow of transactional data
Adjusts data from multiple sources to be used
together (conforming)
Structures data to be usable by BI tools
Enables subsequent business / analytical data
procesing
All rights reserved Javlin 2011
ETL Disambiguation
ETL = Extract Transform Load
Not tight specifically to DW anymore

Process/System
A complete process including
Data extraction
Enforcing DQ and consistency standards
Conforming data from disparate systems
Delivering data to target
People, HW, Documentation, Support, etc.
Tool
A piece of software implementing the
three (four) E-(C)-T-L steps.
A tool designed specifically to perform data transformations
All rights reserved Javlin 2011
Data
ETL Data Integration
Analytics
Presentation
DBMS (MS SQL, MySQL, Oracle),
XML, flat files, CSV, mainframe
ETL
BI Tools (SAP, Cognos), KPI, Data Mining
Dashboards, Reports, Portals
Extracting value
from the database
Integrated data
value for applications
ETL Process
All rights reserved Javlin 2011
Source A
Files, Databases,
Message Queues,
Web Services
Source B
Files, Databases,
Message Queues,
Web Services
ETL
Read
Apply Logic
Write
True Data Integration is agnostic of source or target application

ETL is a bridge for bi-directional flow
ETL Tool: True Data Integration
All rights reserved Javlin 2011
Data Migration
Process of transferring data between storage types or
formats. An automated migration frees up human resources
from tedious tasks. Design, extraction, cleansing, load and
verification are done for moderate to high complexity jobs.
Data Consolidation
Usually associated with moving data from remote locations
to a central location or combining data due to an
acquisition or merger.
Data Integration
Process of combining data residing at different sources and
providing a unified view. Emerges in both commercial and
scientific fields and is focus of extensive theoretical work. Also
referred to as Enterprise Information Integration.
ETL Data Integration Solutions (1)
All rights reserved Javlin 2011
Master Data Management
Processes and tools to define and manage non-transactional
data. Provides for collecting, aggregating, matching,
consolidating, quality-assuring, persisting and distributing
data to an organization to ensure consistency and control.
Data Warehouse
Repository of electronically stored data. ETL facilitates
populating, reporting and analysis. Includes business
intelligence as well as metadata retrieval and management
tools.
Data Synchronization
Process of making sure two or more locations contain the
same up-to-date files. Add, change, or delete a file from one
location, synchronization will mirror the action at the new
location.
ETL Data Integration Solutions (2)
All rights reserved Javlin 2011
Where is ETL used?
All rights reserved Javlin 2011
How to Implement ETL System (1)
Source: Ralph Kimball, Joe Caserta: The Data Warehouse ETL Toolkit; Wiley 2004
All rights reserved Javlin 2011
How to Implement ETL System (2)
Scripting (shell, perl, python)
PL/SQL, sqlldr
Transformation hardcoded in Java, C#
Develop (universal) ETL tool in-house
Using off-the-shelf ETL tool

All rights reserved Javlin 2011
ETL Tool Key Features (1)
Extract, Load => flexible on interfaces
Flat files, DBMS, XML data, XLS,
MQ, web services, LDAP
Semi-structured data (emails, web logs, wiki pages)
Unstructured data (blogs, documents)
Extensibility with custom connectors
Local data, remote data FTP(S), SFTP, SCP, http(s)
Clean
Lookups, Validations, Filters, Translations
Transform
Changing data structure, Joins, (De)Normalization,
Aggregation, RollUp, Sorting, Partitioning,
Data De-duplication
Ability to call external tools
All rights reserved Javlin 2011
ETL Tool Key Features (2)
Performance
Symmetric Multiprocessing (SMP)
Pipeline processing
Multithreaded processing
Massively Parallel Processing (MPP)
Clustering
MapReduce
Load balancing
User friendliness
GUI
Metadata capture
Training time
Development
Reusable components
Impact Analysis / Data Lineage
Documentation
All rights reserved Javlin 2011
ETL Tool Key Features (3)
Manageability
Team collaboration
Transformation repository
Metadata repository
Development process (Dev -> Test -> Prod)
Security
Runtime
Scheduler Automation
Recovery and Restart
Workflow
Others
Vendor stability
Release cycle
Support
All rights reserved Javlin 2011
Source: Ted Friedman, Mark A. Beyer, Eric Thoo: Magic Quadrant for Data Integration Tools; Gartner RAS Core Research Note G00207435; 19 November 2010
ETL Market
All rights reserved Javlin 2011
Well Known ETL Tools
Commercial
Ab Initio
IBM DataStage
Informatica PowerCenter
Microsoft Data Integration Services
Oracle Data Integrator
SAP Business Objects Data Integrator
SAS Data Integration Studio
Open-source based
Adeptia Integration Suite
Apatar
CloverETL
Pentaho Data Integration (Kettle)
Talend Open Studio/Integration Suite
The list above is not meant to be comprehensive

All rights reserved Javlin 2011
ETL or ELT?
ETL = Extract Transform Load
Much more flexible
ELT = Extract Load Transform
Pushed forward (mainly) by Oracle
First get the data into database
Then use Oracle DB tools to work with it
Less flexible
Tightly coupled to vendors database/solution
Less flexible on output formats
Requires staging area
Possibly better performance
All data are processed in the same database
Nothing is downloaded from database for Transform step
All rights reserved Javlin 2011
CloverETL


Session 2:
Dominate Your Data with
All rights reserved Javlin 2011
What is CloverETL
A data integration software platform
Manages, designs and runs your data
Embeddable and scalable
Integrates easily with databases, operating
systems and applications

ETL platform that dominates your data
Extract Transform Load
Reads from one or more data sources
Transforms data in almost any way imaginable
Writes to any number of data targets

Legacy of open source and commitment to commercial use

CloverETL Engine can be used as embedded OEM as well
www.cloveretl.com
Clover works on all OS
Linux
Windows
HP-UX
AIX
IBM AS/400
Solaris
Mac OS X
All rights reserved Javlin 2011
Platform
independent
Java, integration,
library support
Scalability
Desktop Enterprise
Cluster
Usability
built by ETL experts
for ETL experts
Performance
outstanding at all
production levels
Services
Clover was built to
require minimal
services
OEM
Embeddable
small footprint,
extensible and
customizable
Cost
delivers lowest total
cost of ownership
CloverETL Key Features
All rights reserved Javlin 2011
CloverETL Engine
Pure Java 6.0
Embeddable
Extensible
Ideal for OEM
CloverETL Designer
Visual Designer
Transformation Developer
Eclipse Platform
CloverETL Server
Production Platform
Web apps (Tomcat,
WebSphere, GlassFish)
For Enterprise Integration
Design Manage
Runtime
CloverETL Product Suite
All rights reserved Javlin 2011
Features
Transformation design
Intuitive GUI
Drag & drop
Components Library
Debug
CloverETL Designer
All rights reserved Javlin 2011
Vision
CloverETL enables companies to get more value from
their data quickly without massive infrastructure expense
or years of project investment.
Dominate your data

Approach
CloverETL is the best value for money. It builds off the
open source foundation of the CloverETL Engine and
scales from desktop to enterprise to cluster.
Get it done quickly

Investment
CloverETL is an easy to use ETL software that can grow by
both core features and user expertise at a fraction of
the cost of larger system vendors.
Low cost buy-in for our customers
CloverETL Vision
All rights reserved Javlin 2011
CloverETL Approach
Input
Output
Transform
Components
Prebuilt algorithms

Graph
Processing algorithm
in visual form

Data Flow
Edges between components

Process
Build > Connect > Configure
Data processed as structured
records with named fields
Components operate on
record fields
All rights reserved Javlin 2011
Transformation capabilities
50+ specialized components available for use
Readers and writers
Text or binary files, CSV, XML
Archives including ZIP or GZIP
Remote transfers over HTTP/FTP(S)
Access to messaging via JMS

Database connectivity
JDBC connectivity, support for bulk loaders
Supports any SQL statements, stored procedure calls

Transformers and aggregators
Variety of data manipulation algorithms: sorting, deduping, joining,
arithmetics, aggregating, statistic functions and more
Customizable with user-defined code
Alternative implementations for efficient execution
All rights reserved Javlin 2011
Physical architecture
Others: Server-centric with thin clients
Server is necessary for development and execution
Transformations and data are stored remotely on server
Limited use when working on multiple sites or in restricted-access
networks

CloverETL: Designer is a standalone application
Development and execution is possible without central server
Transformations and metadata are stored on local machine
Transformations can be also deployed to server for centralized
management
Designer available for all major platforms (Linux, Windows, Mac)

All rights reserved Javlin 2011
Repository
Others: Central repository
With proprietary storage format (binary files, tables)
Often managed with a proprietary version control system
Can be problematic with mass changes or when hacking is necessary

CloverETL: XML format files and directories
No proprietary repository, uses plain files and directories
Transformation and metadata stored in human-readable XML
Open to any version management systems
CVS, SVN, IBM Rational ClearCase, git
Integrated with Eclipse VMS clients
Subclipse, EGit, ClearCase
All rights reserved Javlin 2011
Flexibility and expressive power
Others: Expression-based languages only
Expressions and built-in functions for simple data manipulation
No support for programming statements (loops, user functions)
Limited when data manipulation requires complex coding

CloverETL: Scripting and more
Components are customizable with CTL or Java code
CTL: scripting language with simple syntax
Allows simple expressions as well as complex code
Has variables, loops, user-defined functions
Many built-in data validation functions
Java: mature programming language, allow coding
Access to variety of existing libraries
GC-based memory management = rapid development
All rights reserved Javlin 2011
Extensibility
Others: Custom components
Limited support for developing custom components
Impossible to extend other aspects of data processing

CloverETL: Plugin-based extensible platform
Ready to extend, customize and modify
Plugin-based architecture for easy extension management
Supports custom components, connections, functions
Implemented in Java
Access to libraries
Memory management
Developers

All rights reserved Javlin 2011
CloverETL Designer Examples

Session 2:
All rights reserved Javlin 2011
Features
Automation and scheduling
File and event triggers
Workflows
Monitoring and logging
User management
Real-time ETL
Clustering
Load balancing
Failover
Distributed processing
Inexpensive buy-in
CloverETL Server
All rights reserved Javlin 2011
Key features
Runtime automation
Allows integration with existing infrastructure
Simplified management and execution of data transformations

Scalability and optimization
Increases transformation performance
Shortens response time in continuous/transactional processing

Security
Controls access to data, transformations and server configuration
Secures communication between server and clients

Clustering
Allows cooperation between multiple processing nodes
Improves scalability, performance and error resiliency
All rights reserved Javlin 2011
Runtime automation
Internal scheduler
Supports one-time or periodic execution
Allows interval-based scheduling, including flexible cron-like rules

Integration with enterprise schedulers
Transformation execution can be started by external scheduler
cron, IBM Tivoli (Maestro), Autosys, UC4
Scheduler instructs CloverETL Server via one of its interfaces (HTTP,JMX)

Events, tasks and dependencies
Tasks and transformations are started on internal or external events
Internal: transformation finished or failed
External: file arrived or its size changed
Allows creating dependencies between executed tasks
Suitable for logical sequencing of execution or monitoring purposes
All rights reserved Javlin 2011
Runtime automation (cont.)
Monitoring
Support alert emails or messages with configurable content
Automatically populated with execution status, log, statistics
Allow integration with ticketing systems, support team

Execution history
Automatically stores performance and statistics about each execution
Stored in database tables and log files
Open for any trend analysis and reporting

Archiving
Configurable cleanup of execution logs and history
Can be further extended by scripting
All rights reserved Javlin 2011
Scalability and optimization
Parallel execution
Executes multiple transformations in parallel
Can execute multiple instances of single transformations (SOA)

Graph pooling
Improves response time
Useful with SOA-architectures and Launch Services

Launch services
Applications implemented as data transformations and deployed as
RESTful web services
Schema on next slide
All rights reserved Javlin 2011
Large quantity of data loaded Processed in parallel in Cluster Written out in parallel
Increased processing and throughput
- Parallel execution of transformations over multiple servers or nodes
- Load balancing based on individual node utilization

Increased fault tolerance
- Fail over in case of problems with particular nodes
- Data replication

Increased flexibility
- Cluster can be dynamically reconfigured by adding or removing nodes

CloverETL Cluster
All rights reserved Javlin 2011
Data Integration Projects
Session 3:
All rights reserved Javlin 2011
Data Integration Projects
This section covers
How to manage DI projects
Phases of DI project
Responsibilities
Typical issues
All rights reserved Javlin 2011
Data Integration Projects
Phases of typical Data Integration Project
Requirements
Planning
Analysis (of ETL steps)
Implementation
Documentation
RTP
Support
All rights reserved Javlin 2011
Requirements
Functional
Input data
Output data
Output data format
Transformation logic
Non-functional
Time restrictions
Availability
Frequency of update
Data Latency
How to handle erroneous records
Security requirements

All rights reserved Javlin 2011
Planning
ETL system implementation to be planned
properly
Time for implementation
Correctly prioritized
Thorough data analysis extremely important
Unforeseen data quality issues cause delays
Biggest risk is unexpected data quality issues
Communicate properly

Keep it simple
Do no try to save the world
If you think it can be done simply, do it simply

All rights reserved Javlin 2011
Analysis Extract
In which source systems is the data we need?
How can we access the system?
Flat files / database access / XML / web service / JMS
Full extract / incremental / change notification / on-request
Local access / ftp(s) / sftp / scp / MQ
Data Syntax
What is a record?
What is record delimiter?
What is field delimiter?
What is the data length / data type / format?
Data Semantics
What are the field names?
What data does the field represent?
Ok, what data does the field really represent?
Are there any duplicates?
What are the limitations / restrictions?
What are the data volumes?
How often can the export be done?
Any impact on network (NIA)?
What is the expected data growth rate?
All rights reserved Javlin 2011
Analysis Clean & Transform
Which fields need to be validated?
against which source?
How to handle erroneous data?

What is the data flow?
Between source and target data
What is the transformation logic?
Action on error?
Stop transformation
Process valid data only

Transformation restartability
Small data volumes transaction based
Huge data volumes process in bulk mode



All rights reserved Javlin 2011
Analysis Load
What is the target schema?
What is the target data volume?
What are the history requirements
Usually SCD type 1, 2 or 3
Data Syntax
What is a record?
What is record delimiter?
What is field delimiter?
What is the data length / data type / format?
Data Semantics
What are the field names?
What data does the field represent?
Ok, what data does the field really represent?
What are the limitations / restrictions?
What are the data volumes?
How often can the export be done?
Any impact on network (NIA)?
All rights reserved Javlin 2011
Implementation
Development
Enforce standardization
Naming conventions
Best practices
Generating surrogate keys
Looking up keys
Applying default values
Testing
Review
Testing on Production data
Unit Testing
All rights reserved Javlin 2011
Implementation
Documentation
Data sources / targets / transformations
Data Lineage
Important to know and publish
Frequency of ETL processes runs
Error handling
Support Monitoring checklist

RTP

All rights reserved Javlin 2011
Planning & Leadership
Contact person for each source system
Contact business person
ETL team responsibilities
Define ETL scope
Perform source system data analysis
Define data quality strategy
Gather & document business rules from business
users
ETL Implementation
Defining & executing Unit & QA testing
Implementing production
All rights reserved Javlin 2011
ETL team Roles & Responsibilities
ETL Manager
ETL Architect
ETL Developer
System Analyst
Data-Quality Specialist
Database Administrator (DBA)
Dimension Manager
Fact Table Provider
All rights reserved Javlin 2011
Typical Issues
Project Management
Poor Data Analysis
Data Understanding
Performance
Scalability
All rights reserved Javlin 2011
Typical Issues Project Management
ETL takes 70-80% project resources
Be aware of this from beginning
Communicate this to stakeholders
Plan ETL phase properly

Data Quality
Unexpected data quality issues is biggest risk
DQ issues cause delays
DQ issues generate extra work



All rights reserved Javlin 2011
Typical Issues Data Understanding
Source system (data)
Not documented
Documented incorrectly
Represent something else than they should
Data not clean

Transformation/Requirements
Not specified properly
Not specified at all
Initial analysis has not revealed issues/complexity
Requirements being changed

All rights reserved Javlin 2011
Typical Issues Performance & Scalability
Performance
Is the performance ok now?
Will performance be ok in 5 years?

Scalability
What is the data growth rate?
Are we testing on production data volumes?

Change Data Capture
Time consuming task
Issue on old systems

All rights reserved Javlin 2011
Current Trends
Session 4:
All rights reserved Javlin 2011
Market Trends
Shift to Semi-structured & Unstructured data
Emails, documents, blogs,

Real-time processing
CRM, Zero-latency business
SOA, Web Services, ESB, JMS, MQ

Cloud-mania
Cloud, Cluster, Elastic Cluster
MapReduce, Apache Hadoop

Reducing TCO
License costs, Development cost, Maintenance
Emphasis on value; not price

Services for small customers
Require better ROI


All rights reserved Javlin 2011
Ralph Kimball, Joe Caserta:
The Data Warehouse ETL Toolkit; Wiley

Ralph Kimball, Margy Ross:
The Data Warehouse Toolkit; Wiley

Len Silverston:
The Data Model Resource Book; Wiley
Literature
All rights reserved Javlin 2011
Web
www.cloveretl.com


US
Javlin Inc.
8000 Towers Crescent Drive
Suite 1350
Vienna, VA 22182
USA

Web: www.javlininc.com
Email: info@javlininc.com
Phone: +1 703 847 3600
Contact Javlin




Europe
Javlin a.s.
Kemencova 18
110 00 Praha 1
Czech Republic


Web: www.javlin.eu
E-mail: info@javlin.eu
Phone: +420 277 003 200