Вы находитесь на странице: 1из 23

ETL Software Development

Process and ETL Toolkit



Dell Confidential
ITETL0001
TOOLKIT Part 1
2
ETL
ETL - EXTRACT -> TRANSFORM -> LOAD
Extract
Extract from source system.
Typically done by source system owner
Not typically done by our team
End Result is flat files stored on our central NFS server
Transform
Transformation is application of business rules and other modifications to facilitate
reporting.
This is the main purpose of the ETL team
Abinitio is a proprietary software package we use for Transformation
Load
Data is loaded to the reporting database (typically TD)
Other manipulation/summary might be done within the dB (aka rollups)
3
Application of Ab Initio
Transformation of disparate sources
Aggregation
Referential integrity checking
Transformations for business rules
Database loading
Aggregation for mart tables
Extraction



4
GDE and Co-Operating System
GDE (1.13.7)
Co-Operation System (2.13)
Sun Solaris Red Hat Linux
IBM AIX Windows NT
HP-UX Compaq Tru64 UNIX
IBM DYNIX/ptx IBM OS/390
Silicon Graphics IRIX NCR MP-RAS

5
Databases
Oracle
Sybase
Teradata
MS SQL Server 7


6
Co-Operating System Services

Parallel and distributed application execution.
Transactional semantic at the application level (not in DB)
Checkpointing
Monitoring and Debugging
Parallel file management
Parameter driven components
7
Lab 1: Setting up Linux directories
Create your directory under /usr/dell/abinitio/training on abidev01
using your NT account
cd /usr/dell/abinitio/training
mkdir your_name

Create a data directory under /serial/data/ddw/serial/training/
mkdir your_name
cd your_name
mkdir lab1


8
Lab 1: Logon via GDE
Create a login account from GDE
host=abidev01
login= your NT account
host_directory=/usr/dell/abinitio/training/your_name
location=/usr/dell/abinitio/abinitio


9
Lab 1: Create an Empty Sandbox
Create an empty sandbox
Projects->create sandbox
Directory=/usr/dell/abinitio/training/your_name/lab1
Go to the sandbox in Linux and do an ls -l
xfr
run
mp
dml
db
**The goal is to have a development environment which enables the migration
of a graph or set of graphs to any other environment/location without any
changes.
** Save the graph as build_customer.mp in your mp directory
10
Lab 1: Collection Files
Collection files are located in /usr/dell/abinitio/training/collection
customer_header.dat
cust_number
cust_name
birth_year
birth_month
birth_day
customer_detail.dat
cust_number
cust_address
cust_zip
address_type

11
Lab 1: Edit Sandbox
Edit the sandbox to define the sandbox variables
Project->Edit Sandbox
Add AI_ prefix to all the sandbox directories
Define AI_SERIAL_OUT_DATA=/serial/data/ddw/serial/training
Define COLL_HOME=/usr/dell/abinitio/training/collection
$AI_RUN run directory usually contains deployed shells
$AI_DML record format files
$AI_XFR transform files
$AI_MP graphs
$AI_DB database config files
12
Lab 1: Transform Requirements
Lab1- Transformation requirements:
1. Customer header records should be de-duped on customer
number.
2. For customer number all the leading zeros should be removed.
3. The resulting final file should left outer join customer header
and customer detail.
4. The resulting final file the date field should be consolidated as
YYYYMMDD.
5. The resulting final file should have all the columns from header
and detail.
13
Lab 1: DML for Customer Header
Input file component with URL pointing to $COLL_HOME/ customer_header.dat
and dml port to have the below dml
record
string("~|") cust_number = NULL("");
string("~|") cust_name = NULL("");
string("~|") birth_year = NULL("");
string("~|") birth_month = NULL("");
string("\n") birth_day = NULL("");
end;
14
Lab 1: DML for Customer Detail
Input file component with URL pointing to $COLL_HOME/ customer_detail.dat
and dml port to have the below dml
record
string("~|") cust_number = NULL("");
string("~|") cust_address = NULL("");
string("~|") cust_zip = NULL("");
string("~|") address_type = NULL("");
string(1) newline;
end;
??? Why is newline defined explicitly in customer detail and not in
customer header
15
Lab 1: Reformat

Add a reformat for customer header having the output with date
as YYYYMMDD
OUTPUT DML:
record
string("~|") cust_number = NULL("");
string("~|") cust_name = NULL("");
string("\n") date_of_birth = NULL("");
end;
Add a reformat for customer detail to remove the leading zero's
from cutomer number.

16
Lab 1: Sort and Dedup


Sort and Dedup customer header on customer number
Sort customer detail on customer number
Sort and Dedup keys should be the same. Why is it so? How will
it behave in multi file system?
17
Lab 1: Join
Do a left outer join on customer number for the header and detail records.
The final file should have all the records from header and detail:
record
string("~|") cust_number = NULL("");
string("~|") cust_name = NULL("");
string("~|") date_of_birth = NULL("");
string("~|") cust_address = NULL("");
string("~|") cust_zip = NULL("");
string("~|") address_type = NULL("");
string(1) newline;
end;

18
Lab 1: More Thoughts on Join
How will you do an inner join?
How will you do a full outer join?
How will you do a right outer join?
How can you override the join keys?
How can you manipulate the output through join transform?
Why is \n hard coded?
19
Lab 1: Loading in TD
CREATE MULTISET TABLE test1.lab1_your_name ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL
(
CUST_NUMBER INTEGER NOT NULL,
CUST_NAME VARCHAR(18) CHARACTER SET LATIN NOT CASESPECIFIC,
DATE_OF_BIRTH DATE FORMAT 'yyyy-mm-dd',
CUST_ADDRESS VARCHAR(40) CHARACTER SET LATIN NOT CASESPECIFIC,
CUST_ZIP INTEGER,
ADDRESS_TYPE CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC
)
PRIMARY INDEX XNUP_lab1_your_name ( cust_number );
20
Lab 1: Loading in TD
UTILITIES:

1) TD_DML_GENERATOR: generates the Ab Initio DML for a Teradata table. Copy the
utility from /usr/dell/abinitio/training/utils/td_dml_generator.ksh into your local utils
directory.

2) Getpasswd: Returns the password for the oracle username for a particular instance.
The configuration management has to add the entry of the username and you need to
have permissions to access the password.

3) Gettdpasswd: Does the same as getpasswd for Teradata

Example: gettdpasswd ddwdev us_svc_tag_etl
21
Lab 1: Loading in TD
Accessing the DDW developed TD components

To access TD we use our own Ab Initio components. In order to use them we need to add
the folder to the component organizer of GDE.

Right Click in Component Organizer.
Select New->Top Level Folder
In the box, select HOST and give the path /usr/dell/abinitio/Teradata/components
22
Lab 1: Loading in TD
Load the final file in Teradata in table test1.lab1_your_name:
Generate DML from the table using td_dml_generator.ksh and
name it r_customer.dml
Add a Reformat before loading to change the delimiter to and
the date format to YYYY-MM-DD
Use DDW_INSERT to load the data
Add TD_LOGON in the sandbox
Execute .profile from settings or start script
Save it in a separate graph called upload_customer.mp


23
Lab 1: Order of parameter evaluation
1. The host setup script is run
2. Common project parameters
3. Project/sandbox parameters
4. Graph parameters
5. Graph start script parameter

Вам также может понравиться