Вы находитесь на странице: 1из 15

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Best practices using Teradata with PowerCenter


Content
This document discusses how to use Teradata with PowerCenter. It covers Teradata basics and also describes some tweaks which may be necessary to adequately deal with some of the common practices.

Teradata Basics
Teradata is a relational database management system from NCR. It offers high performance for very large databases using a highly parallel architecture. While Teradata can run on other platforms, it is predominantly found on NCR hardware (which runs NCR's version of UNIX). It is very fast and very scalable.

Teradata Hardware
The NCR computers on which Teradata runs support both MPP (Massively Parallel Processing) and SMP (Symmetric Multi-Processing). Each MPPnode (or semi-autonomous processing unit) can support SMP. Teradata can be configured to communicate directly with a mainframe's I/O channel. This is known aschannel attached . Alternatively, it can benetwork attached . That is, configured to communicate via TCP/IP over a LAN. Channel attached is not always faster thannetwork attached . Similar performance has been observed across a channel attachment as well as a 100 MB LAN. In addition,channel attached requires an additional sequential data move because the data must be moved from the PowerCenter server to the mainframe prior to moving the data across the mainframe channel to Teradata.

Teradata Software
There are Teradata Director Program Ids (TDPIDs), databases and users. The TDPID is the name that is used to connect from a Teradata client to a Teradata server (similar to Oracle thetnsnames.ora entry). Teradata databases and users are somewhat synonymous. A user has a userid, password and space to store tables. A database is basically a user without a login and password (or a user is a database with a userid and password). Teradata Access Module Processors (AMPs) are Teradata's parallel database engines. Although they are strictly software ('virtual processors' according to NCR terminology), Teradata refers to AMP and hardware node interchangeably since in the past an AMP was a piece of hardware.

Client Configuration Basics for Teradata


The client side configuration is done using thehosts file (/etc/hosts on UNIX or winnt\system32\drivers\etc\hosts on Windows). In thehosts file the name of the Teradata instance (i.e.tdpid - Teradata Director Program Id) is indicated by the letters and numbers that precede the stringcop1 in a hosts file entry.

Example: 127.0.0.1 localhost demo1099cop1 192.168.80.113 curly pcop1 This tells Teradata that when a client tool references the instancedemo1099 , it should direct requests tolocalhost (or IP address 127.0.0.1), when a client tool references instance p, this located on the servercurly (or IP address 192.168.80.113). This entry does not contain any kind of database server specific information (the TDPID is not the same as an Oracle instance ID). That is, the TDPID is used strictly to define the name a client uses to connect to a server. Teradata takes the name you specify, looks in thehosts file to map the cop1 (orcop2 , etc.) to an IP address, and then attempts to establish a connect with Teradata at the IP address. There can be multiple entries in a hosts file with similar TDPIDs: 127.0.0.1 localhost demo1099cop1 192.168.80.113 curly_1 pcop1 192.168.80.114 curly_2 pcop2 192.168.80.115 curly_3 pcop3 192.168.80.116 curly_4 pcop4 This setup allows load balancing of clients among multiple Teradata nodes. That is, most Teradata systems have many nodes, and each node has its own IP address. Without the multiple hosts file entries, every client will connect to one node and eventually this node will be doing more client processing. With multiple host file entries, if it takes too long for the node specified with thecop1 suffix to respond (i.e.curly_1 ) to the client request to connect top , then the client will automatically attempt to connect to the node with thecop2 suffix (i.e.curly_2 ) and so forth.

PowerCenter and Teradata


PowerCenter accesses Teradata with several various Teradata tools. Each will be defined and how it is configured within PowerCenter.

ODBC
Teradata provides 32-bit ODBC drivers for Windows and UNIX platforms. If possible, use the ODBC driver from Teradata's TTU7 release (or

1 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

above) of their client software because this version supports array reads. Tests have shown these drivers (3.02) can be 20 % - 30 % faster than the 3.01 drivers. This release of Teradata TTU 8.0 uses ODBC 3.0421. Teradata's ODBC is on a performance par with Teradata's SQL CLI. In fact, ODBC is Teradata recommended SQL interface for their partners. Do not use ODBC to write to Teradata unless you're writing very small data sets (and even then, you should probably useTpump instead) because Teradata's ODBC is optimized for query access and not optimized for writing data. So use Teradata ODBC for Teradata sources and lookups. PowerCenter Designer uses Teradata's ODBC to import all Teradata objects (sources, lookups, targets, etc.).

ODBC Windows
Configure the Teradata ODBC driver with the following information on Windows:

ODBC UNIX
When the PowerCenter server is running on UNIX ODBC is required to read both sources and lookups from Teradata. As with all UNIX ODBC drivers, the key to configuring the UNIX ODBC driver is adding the appropriate entries to the.odbc.ini file. To correctly configure the.odbc.ini file, there must be an entry under[ODBC Data Sources] that includes the Teradata ODBC driver shared library ( tdata.sl on HP-UX, standard shared library extensions on other types of UNIX). The following example shows the required entries from an actual.odbc.ini file (note the path to the driver may be different on each computer): [ODBC Data Sources] dBase=MERANT 3.60 dBase Driver Oracle8=MERANT 3.60 Oracle 8 Driver Text=MERANT 3.60 Text Driver Sybase11=MERANT 3.60 Sybase 11 Driver Informix=MERANT 3.60 Informix Driver DB2=MERANT 3.60 DB2 Driver MS_SQLServer7=MERANT SQLServer driver TeraTest=tdata.sl [TeraTest] Driver=/usr/odbc/drivers/tdata.sl Description=Teradata Test System DBCName=148.162.247.34

2 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Similar to the clienthosts file setup, you can specify multiple IP addresses for theDBCName to balance the client load across multiple Teradata nodes. Consult with a Teradata administrator for exact details on this (or copy the entries from thehosts file on the client machine. Refer to theClient Configuration Basics section).

Important note:
Make sure that the DataDirect ODBC path precedes the Teradata ODBC path information in the PATH and SHLIB_PATH (or LD_LIBRARY_PATH, etc.) environment variables. This is because both sets of ODBC software use some of the same file names. PowerCenter must use the DataDirect files because this is the software that has been certified.

Teradata external loaders


PowerCenter supports four different Teradata external loaders: Tpump FastLoad MultiLoad TeradataWarehouse Builder ( TWB ). The actual Teradata loader executables (tpump ,mload ,fastload andtbuild ) must be accessible by the PowerCenter Server application. All of the Teradata loader connections will require a value to theTDPID attribute. Refer to the first section of this document to understand how to correctly enter the value. All of these loaders require: A load file ( can be configured to be a stream/pipe and is autogenerated by PowerCenter ) A control file of commands to tell the loader what to do ( PowerCenter autogenerates) All of these loaders will also produce a log file. This log file will be the means to debug the loader if something goes wrong. As these are external loaders, all PowerCenter will received back from the loader is whether is ran successfully or not. By default, the input file, control file and log file will be created in the$PMTargetFileDir directory of the PowerCenter Server executing the workflow.

3 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Any of these loaders can be used by the target in the PowerCenter session configured to be aFile Writer and then choose the appropriate loader:

To override the auto-generated control file click thePencil icon next to the loader connection name:

4 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Scroll to the bottom of the connection attribute list and click the value next to theControl File Content Override attribute. Then click the down arrow.

5 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Click theGenerate button and change the control file as desired. The changed control file is stored in the repository,

Most of the loaders also use some combination of internal Work, Error and Log tables. By default, these will be in the same database as the target table. All of these can now be overridden in the attributes of the connection.

6 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

To write the input flat file that the loaders need to disk, theIs Staged attribute must be checked. If theIs Staged attribute is not set, then the file will be piped/streamed to the loader. If you select the non-staged mode for a loader, you should also set theCheckpoint property to '0'. This effectively turns off the checkpoint processing. Checkpoint processing is used for recovery/restart of Fastload and Multiload sessions. However, if your are not using a physical file as input, but rather a named pipe, then the recovery/restart mechanism of the loaders does not work. Not only does this impact performance (i.e. the checkpoint processing is not free and we want to eliminate as much unnecessary overhead as possible), but a non-zero checkpoint value will sometimes cause seemingly random errors and session failures when used with named pipe input (as is the case withstreaming mode).

Teradata loader Requirements for PowerCenter servers on UNIX


All Teradata load utilities require a non-null standard output and standard error to run properly. Standard output (STDOUT) and standard error (STDERR) are UNIX conventions that determine the default location for a program to write output and error information. When you start the PowerCenter Server without explicitly defining STDOUT and STDERR, these both point to the current terminal session. If you logout of UNIX, UNIX redirects STDOUT and STDERR to /dev/null (i.e. a placeholder that throws out anything written to it). At this point, Teradata loader sessions will fail because they do not permit STDOUT and STDERR to be /dev/null. Therefore, you must start PowerCenter Server as follows (go to the PowerCenter installation directory): ./pmserver ./pmserver.cfg > ./pmserver.out 2>&1 This starts the PowerCenter Server using thepmserver.cfg configuration file and points STDOUT and STDERR to the filepmserver.out . In this way, STDERR and STDOUT will be defined even after the terminal session logs out.

Important note:
There are no spaces in the token2>&1 . This tells UNIX to redirect STDERR to the same place as STDOUT. As an alternative to this method, you can specify the console output file name in the pmserver.cfg file. That is, information written to standard output and standard error will go the file specified as follows: ConsoleOutputFilename= With this entry in thepmserver.cfg file, you can start the PowerCenter Server normally (i.e../pmserver ).

7 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Partitioned Loading
With PowerCenter, if you set a round robin partition point on the target definition and sets each target instance to be loaded using the same loader connection instance, then PowerCenter automatically writes all data to the first partition and only starts one instance of FastLoad or MultiLoad. You will know you are getting this behavior if you see the following entry in the session log: MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions.All data will be routed to the first partition. If you do not see this message, then chances are the session fails with the following error: WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not support partitioned sessions. WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.

Tpump
Tpump is an external loader that supports inserts, updates, upserts and deletes and data driven updates. Multiples Tpump's can execute simultaneously against the same table as it does not use many resource nor does it require table level locks. It is often used to trickle load a table. As stated earlier, it will be a faster way to update a table as opposed to ODBC, but will not be as fast as the other loaders.

MultiLoad
This is a sophisticated bulk load utility and is the primary method PowerCenter uses to load/update mass quantities of data into Teradata. Unlike bulk load utilities from other vendors, MultiLoad supports inserts, updates, upserts, delete and data driven operations in PowerCenter. You can also use variables and embed conditional logic into MultiLoad scripts. It is very fast (millions of rows in a few minutes). It can be resource intensive and will take a table lock.

8 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Cleaning up after a failed MultiLoad MultiLoad puts the target table into the MultiLoad state. Upon successful completion, the target table is returned to theNormal (non-MultiLoad) state. Therefore, when a MultiLoad fails for any reason, the table is left in MultiLoad state, and you cannot simply re-run the same MultiLoad. MultiLoad will report an error. In addition, MultiLoad also queries the target table's MultiLoad log table to see if it contains any errors. If a MultiLoad log table exists for the target table, then you also will not be able to rerun your MultiLoad job. To recover from a failed MultiLoad, you must release the target table from the MultiLoad state and also drop the MultiLoad log table. You can do this using BTEQ or QueryMan to issue the following commands: drop table mldlog_<table name>; release mload <table name>;

Note:
The drop table command assumes that you are recovering from a MultiLoad script generated by PowerCenter (PowerCenter always names the MultiLoad log tablemldlog_<table name> ). If you're working with a hand-coded MultiLoad script, the name of the MultiLoad log table could be anything. Here is the actual text from a BTEQ session which cleans up a failed load to the tabletd_test owned by the userinfatest :

BTEQ -- Enter your DBC/SQL request or BTEQ command: drop table infatest.mldlog_td_test; drop table infatest.mldlog_td_test; *** Table has been dropped. *** Total elapsed time was 1 second. BTEQ -- Enter your DBC/SQL request or BTEQ command: release mload infatest.td_test; release mload infatest.td_test; *** Mload has been released.

9 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

*** Total elapsed time was 1 second.


Using one instance of MultiLoad to load multiple tables MultiLoad can require a large amount of resources on a Teradata system. Some systems will have hard limits on the number of concurrent MultiLoad sessions allowed. By default, PowerCenter will start an instance of MultiLoad for every target file. To use a single instance of MultiLoad to load multiple tables (or to load both inserts and updates into the same target table) the generated MultiLoad script file must be editted.

Note:
This should not be an issue with Tpump because Tpump is not as resource intensive as MultiLoad (and a multiple concurrent instances of Tpump can target the same table). Here's a workaround: 1. Use a dummy session (i.e. set test rows to 1 and target a test database) to generate MultiLoad control files for each of the targets. 2. Merge the multiple control files (one per target table) into a single control file (one for all target tables). 3. Configure the session to call MultiLoad from a post-session script using the control file created in step (2). Integrated support cannot be used because each input file is processed sequentially and this causes problems when combined with PowerCenter's integrated named pipes and streaming. Details on merging the control files: 1. There is a single log file for each instance of MultiLoad. Therefore, you do not have to change or add anything theLOGFILE statement. However, you might want to change the name of the log table since it may be a log that spans multiple tables. 2. Copy the work and error table delete statements into the common control file. 3. Modify theBEGIN MLOAD statement to specify all the tables that the MultiLoad will be hitting. 4. Copy theLayout sections into the common control file and give each a unique name.Organize the file such that all the layout sections are grouped together. 5. Copy theDML sections into the common control file and give each a unique name.Organize the file such that all the DML sections are grouped together. 6. Copy theImport statements into the common control file and modify them to reflect the unique names created for the referenced LAYOUT and DML sections created in steps 4) and 5). Organize the file such that all the Import sections are grouped together. 7. Runchmod -w on the newly created control file so PowerCenter does not overwrite it, or, name it something different so PowerCenter cannot overwrite it.

Note:
Asingle instance of MultiLoad can target at most 5 tables. Therefore, do not combine more than 5 target files into a common file.

Example: Here's an example of a control file merged from two default control files: .DATEFORM ANSIDATE; .LOGON demo1099/infatest,infatest; .LOGTABLE infatest.mldlog_TD_TEST; DROP TABLE infatest.UV_TD_TEST ; DROP TABLE infatest.WT_TD_TEST ; DROP TABLE infatest.ET_TD_TEST ; DROP TABLE infatest.UV_TD_CUSTOMERS ; DROP TABLE infatest.WT_TD_CUSTOMERS ; DROP TABLE infatest.ET_TD_CUSTOMERS ; .ROUTE MESSAGES WITH ECHO TO FILE c:\LOGS\TgtFiles\td_test.out.ldrlog ; .BEGIN IMPORT MLOAD TABLES infatest.TD_TEST, infatest.TD_CUSTOMERS ERRLIMIT1 CHECKPOINT10000 TENACITY 10000 SESSIONS 1 SLEEP 6 ;

10 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

/* Begin Layout Section */ .Layout InputFileLayout1; .Field CUST_KEY1 CHAR(12)NULLIF CUST_KEY= '*' ; .Field CUST_NAME13 CHAR(20)NULLIF CUST_NAME = '*' ; .Field CUST_DATE33 CHAR(10)NULLIF CUST_DATE = '*' ; .Field CUST_DATEmm33 CHAR(2); .Field CUST_DATEdd36 CHAR(2); .Field CUST_DATEyyyy39 CHAR(4); .Field CUST_DATEtd CUST_DATEyyyy||'/'||CUST_DATEmm||'/'||CUST_DATEdd NULLIF CUST_DATE= '*' ; .Filler EOL_PAD43 CHAR(2) ; .Layout InputFileLayout2; .Field CUSTOMER_KEY1 CHAR(12); .Field CUSTOMER_ID13 CHAR(12); .Field COMPANY25 CHAR(50)NULLIF COMPANY= '*' ; .Field FIRST_NAME75 CHAR(30)NULLIF FIRST_NAME= '*' ; .Field LAST_NAME105 CHAR(30)NULLIF LAST_NAME= '*' ; .Field ADDRESS1135 CHAR(72)NULLIF ADDRESS1= '*' ; .Field ADDRESS2207 CHAR(72)NULLIF ADDRESS2= '*' ; .Field CITY279 CHAR(30)NULLIFCITY= '*' ; .Field STATE309 CHAR(2)NULLIF STATE= '*' ; .Field POSTAL_CODE311 CHAR(10)NULLIF POSTAL_CODE = '*' ; .Field PHONE321 CHAR(30)NULLIF PHONE= '*' ; .Field EMAIL351 CHAR(30)NULLIF EMAIL= '*' ; .Field REC_STATUS381 CHAR(1)NULLIF REC_STATUS= '*' ; .Filler EOL_PAD382 CHAR(2) ; /* End Layout Section */ /* begin DML Section */ .DML Label tagDML1; INSERT INTO infatest.TD_TEST ( CUST_KEY, CUST_NAME, CUST_DATE ) VALUES ( :CUST_KEY, :CUST_NAME, :CUST_DATEtd ) ; .DML Label tagDML2; INSERT INTO infatest.TD_CUSTOMERS ( CUSTOMER_KEY, CUSTOMER_ID, COMPANY, FIRST_NAME, LAST_NAME, ADDRESS1, ADDRESS2, CITY, STATE, POSTAL_CODE, PHONE, EMAIL, REC_STATUS ) VALUES ( :CUSTOMER_KEY, :CUSTOMER_ID, :COMPANY, :FIRST_NAME, :LAST_NAME, :ADDRESS1, :ADDRESS2, :CITY, :STATE, :POSTAL_CODE, :PHONE, :EMAIL, :REC_STATUS ) ; /* end DML Section */ /* Begin Import Section */

11 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

.Import Infile c:\LOGS\TgtFiles\td_test.out Layout InputFileLayout1 Format Unformat Apply tagDML1 ; .Import Infile c:\LOGS\TgtFiles\td_customers.out Layout InputFileLayout2 Format Unformat Apply tagDML2 ; /* End Import Section */ .END MLOAD; .LOGOFF;

FastLoad
As the name suggests, this is a very fast utility to load data into Teradata. It is the fastest method to load data into Teradata. However, there is one major restriction: the target table must be empty.

Teradata Warehouse Builder (TWB)


Teradata Warehouse Builder (TWB) is a single utility that was intended to replace FastLoad, MultiLoad, Tpump and FastExport. It was to support a single scripting environment with different modes, where each mode roughly equates to one of the legacy utilities. It also was to support parallel loading (i.e. multiple instances of a TWB client could run and load the same table at the same time - something the legacy loaders cannot do). Unfortunately, NCR/Teradata does not support TWB and TWB has never been formally released. According to NCR, the release was delayed primarily because of issues with the mainframe version.

12 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Defining primary keys for tables to support updates, upserts and deletes
Like any other database technology, primary keys have to be specified for Teradata tables when doing updates, upserts or deletes using Teradata loaders. Sometimes, however, there are no primary keys defined for the underlying Teradata tables. In this case, primary keys have to be defined in the metadata when target table definitions are imported using Warehouse designer. The list of primary keys to be used in this definition can be obtained from the Teradata DBA. If a table has a partition defined then the key(s) on which the partition has been defined should also be marked as a primary key(s) when defining the target table. This can be obtained either from the DBA or from looking at the table definition scripts. In the example below, theCustomer_ID ,Customer_Name and Effective_Date fields have been marked as primary keys even though these are not primary keys in the underlying Teradata table.

13 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Note:
If the primary keys are not defined then any attempt to update, upsert or delete data using Teradata loaders will result in an error.

More Information

Reference
Teradata documentation http://www.info.ncr.com/Teradata/eTeradata-BrowseBy.cfm Teradata Forum http://www.teradataforum.com

Related Documents
Teradata Frequently Asked Questions FAQ: What versions of Teradata does PowerCenter support?

Attachments

Applies To

14 of 15

6/2/2009 2:07 PM

16182

https://my-prod.informatica.com/infakb/whitepapers/1/Pages/16182.aspx

Database : Operating Systems : Other Software : Product :

Teradata

PowerCenter

Last Modified Date: 4/10/2009 11:46 PM ID: 16182

15 of 15

6/2/2009 2:07 PM

Вам также может понравиться