0 оценок0% нашли этот документ полезным (0 голосов)
31 просмотров23 страницы
This document provides instructions for a lab on ETL software development using Ab Initio. It describes the ETL process of extract, transform, load. It outlines transforming customer data from flat files to load into a Teradata database table. The lab instructions include: setting up directories; creating an empty sandbox; defining collection files, DML formats and transforms; joining, sorting and deduplicating files; loading into Teradata using Ab Initio components while addressing data types and formats.
This document provides instructions for a lab on ETL software development using Ab Initio. It describes the ETL process of extract, transform, load. It outlines transforming customer data from flat files to load into a Teradata database table. The lab instructions include: setting up directories; creating an empty sandbox; defining collection files, DML formats and transforms; joining, sorting and deduplicating files; loading into Teradata using Ab Initio components while addressing data types and formats.
This document provides instructions for a lab on ETL software development using Ab Initio. It describes the ETL process of extract, transform, load. It outlines transforming customer data from flat files to load into a Teradata database table. The lab instructions include: setting up directories; creating an empty sandbox; defining collection files, DML formats and transforms; joining, sorting and deduplicating files; loading into Teradata using Ab Initio components while addressing data types and formats.
Dell Confidential ITETL0001 TOOLKIT Part 1 2 ETL ETL - EXTRACT -> TRANSFORM -> LOAD Extract Extract from source system. Typically done by source system owner Not typically done by our team End Result is flat files stored on our central NFS server Transform Transformation is application of business rules and other modifications to facilitate reporting. This is the main purpose of the ETL team Abinitio is a proprietary software package we use for Transformation Load Data is loaded to the reporting database (typically TD) Other manipulation/summary might be done within the dB (aka rollups) 3 Application of Ab Initio Transformation of disparate sources Aggregation Referential integrity checking Transformations for business rules Database loading Aggregation for mart tables Extraction
4 GDE and Co-Operating System GDE (1.13.7) Co-Operation System (2.13) Sun Solaris Red Hat Linux IBM AIX Windows NT HP-UX Compaq Tru64 UNIX IBM DYNIX/ptx IBM OS/390 Silicon Graphics IRIX NCR MP-RAS
5 Databases Oracle Sybase Teradata MS SQL Server 7
6 Co-Operating System Services
Parallel and distributed application execution. Transactional semantic at the application level (not in DB) Checkpointing Monitoring and Debugging Parallel file management Parameter driven components 7 Lab 1: Setting up Linux directories Create your directory under /usr/dell/abinitio/training on abidev01 using your NT account cd /usr/dell/abinitio/training mkdir your_name
Create a data directory under /serial/data/ddw/serial/training/ mkdir your_name cd your_name mkdir lab1
8 Lab 1: Logon via GDE Create a login account from GDE host=abidev01 login= your NT account host_directory=/usr/dell/abinitio/training/your_name location=/usr/dell/abinitio/abinitio
9 Lab 1: Create an Empty Sandbox Create an empty sandbox Projects->create sandbox Directory=/usr/dell/abinitio/training/your_name/lab1 Go to the sandbox in Linux and do an ls -l xfr run mp dml db **The goal is to have a development environment which enables the migration of a graph or set of graphs to any other environment/location without any changes. ** Save the graph as build_customer.mp in your mp directory 10 Lab 1: Collection Files Collection files are located in /usr/dell/abinitio/training/collection customer_header.dat cust_number cust_name birth_year birth_month birth_day customer_detail.dat cust_number cust_address cust_zip address_type
11 Lab 1: Edit Sandbox Edit the sandbox to define the sandbox variables Project->Edit Sandbox Add AI_ prefix to all the sandbox directories Define AI_SERIAL_OUT_DATA=/serial/data/ddw/serial/training Define COLL_HOME=/usr/dell/abinitio/training/collection $AI_RUN run directory usually contains deployed shells $AI_DML record format files $AI_XFR transform files $AI_MP graphs $AI_DB database config files 12 Lab 1: Transform Requirements Lab1- Transformation requirements: 1. Customer header records should be de-duped on customer number. 2. For customer number all the leading zeros should be removed. 3. The resulting final file should left outer join customer header and customer detail. 4. The resulting final file the date field should be consolidated as YYYYMMDD. 5. The resulting final file should have all the columns from header and detail. 13 Lab 1: DML for Customer Header Input file component with URL pointing to $COLL_HOME/ customer_header.dat and dml port to have the below dml record string("~|") cust_number = NULL(""); string("~|") cust_name = NULL(""); string("~|") birth_year = NULL(""); string("~|") birth_month = NULL(""); string("\n") birth_day = NULL(""); end; 14 Lab 1: DML for Customer Detail Input file component with URL pointing to $COLL_HOME/ customer_detail.dat and dml port to have the below dml record string("~|") cust_number = NULL(""); string("~|") cust_address = NULL(""); string("~|") cust_zip = NULL(""); string("~|") address_type = NULL(""); string(1) newline; end; ??? Why is newline defined explicitly in customer detail and not in customer header 15 Lab 1: Reformat
Add a reformat for customer header having the output with date as YYYYMMDD OUTPUT DML: record string("~|") cust_number = NULL(""); string("~|") cust_name = NULL(""); string("\n") date_of_birth = NULL(""); end; Add a reformat for customer detail to remove the leading zero's from cutomer number.
16 Lab 1: Sort and Dedup
Sort and Dedup customer header on customer number Sort customer detail on customer number Sort and Dedup keys should be the same. Why is it so? How will it behave in multi file system? 17 Lab 1: Join Do a left outer join on customer number for the header and detail records. The final file should have all the records from header and detail: record string("~|") cust_number = NULL(""); string("~|") cust_name = NULL(""); string("~|") date_of_birth = NULL(""); string("~|") cust_address = NULL(""); string("~|") cust_zip = NULL(""); string("~|") address_type = NULL(""); string(1) newline; end;
18 Lab 1: More Thoughts on Join How will you do an inner join? How will you do a full outer join? How will you do a right outer join? How can you override the join keys? How can you manipulate the output through join transform? Why is \n hard coded? 19 Lab 1: Loading in TD CREATE MULTISET TABLE test1.lab1_your_name ,NO FALLBACK , NO BEFORE JOURNAL, NO AFTER JOURNAL ( CUST_NUMBER INTEGER NOT NULL, CUST_NAME VARCHAR(18) CHARACTER SET LATIN NOT CASESPECIFIC, DATE_OF_BIRTH DATE FORMAT 'yyyy-mm-dd', CUST_ADDRESS VARCHAR(40) CHARACTER SET LATIN NOT CASESPECIFIC, CUST_ZIP INTEGER, ADDRESS_TYPE CHAR(1) CHARACTER SET LATIN NOT CASESPECIFIC ) PRIMARY INDEX XNUP_lab1_your_name ( cust_number ); 20 Lab 1: Loading in TD UTILITIES:
1) TD_DML_GENERATOR: generates the Ab Initio DML for a Teradata table. Copy the utility from /usr/dell/abinitio/training/utils/td_dml_generator.ksh into your local utils directory.
2) Getpasswd: Returns the password for the oracle username for a particular instance. The configuration management has to add the entry of the username and you need to have permissions to access the password.
3) Gettdpasswd: Does the same as getpasswd for Teradata
Example: gettdpasswd ddwdev us_svc_tag_etl 21 Lab 1: Loading in TD Accessing the DDW developed TD components
To access TD we use our own Ab Initio components. In order to use them we need to add the folder to the component organizer of GDE.
Right Click in Component Organizer. Select New->Top Level Folder In the box, select HOST and give the path /usr/dell/abinitio/Teradata/components 22 Lab 1: Loading in TD Load the final file in Teradata in table test1.lab1_your_name: Generate DML from the table using td_dml_generator.ksh and name it r_customer.dml Add a Reformat before loading to change the delimiter to and the date format to YYYY-MM-DD Use DDW_INSERT to load the data Add TD_LOGON in the sandbox Execute .profile from settings or start script Save it in a separate graph called upload_customer.mp
23 Lab 1: Order of parameter evaluation 1. The host setup script is run 2. Common project parameters 3. Project/sandbox parameters 4. Graph parameters 5. Graph start script parameter
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More