Академический Документы
Профессиональный Документы
Культура Документы
Version 8
SC18-9889-00
WebSphere DataStage
Version 8
SC18-9889-00
Note Before using this information and the product that it supports, be sure to read the general information under Notices and trademarks on page 63.
Contents
Chapter 1. Introduction . . . . . . . . 1 Chapter 2. Tutorial project goals . . . . 3 Chapter 3. Module 1: Opening and running the sample job . . . . . . . . 5
Lesson 1.1: Opening the sample job . . . . . . . 5 The Designer client . . . . . . . . . . . 5 The sample job for the tutorial . . . . . . . 6 Starting the Designer client and opening the sample job . . . . . . . . . . . . . . 6 Lesson checkpoint . . . . . . . . . . . 7 Lesson 1.2: Viewing and compiling the sample job . . 7 Exploring the Sequential File stage . . . . . . 8 Exploring the Data Set stage . . . . . . . . 8 Compiling the sample job . . . . . . . . . 9 Lesson checkpoint . . . . . . . . . . . 9 Lesson 1.3: Running the sample job . . . . . . . 9 Running the job . . . . . . . . . . . . 9 Viewing the data set . . . . . . . . . . 11 Lesson checkpoint . . . . . . . . . . . 11 Module 1: Summary . . . . . . . . . . . 11 Lesson 3.1: Designing the transformation job . The transformer job . . . . . . . . Creating the transformation job and adding stages and links . . . . . . . . . . Configuring the Data Set stages . . . . Configuring the Transformer stage. . . . Running the transformation job . . . . . Lesson checkpoint . . . . . . . . . Lesson 3.2: Combining data in a job . . . . Using a Lookup stage . . . . . . . . Creating a lookup job . . . . . . . . Configuring the Lookup File Set stage . . Configuring the Lookup stage . . . . . Lesson checkpoint . . . . . . . . . Lesson 3.3: Capturing rejected data . . . . Lesson checkpoint . . . . . . . . . Lesson 3.4: Performing multiple transformations single job . . . . . . . . . . . . . Adding new stages and links . . . . . Configuring the Business_Rules Transformer stage . . . . . . . . . . . . . Configuring the Lookup operation . . . . Lesson checkpoint . . . . . . . . . Module 3 Summary . . . . . . . . . . . . . . . . . . . . . . . . in . . . . . . . 25 . 25 . . . . . . . . . . . . . a . . . . . . 25 26 26 28 29 29 29 29 30 31 32 32 33 33 34 35 37 38 38
. 15 . . . . . . . . . . . . . . 17 17 18 18 18 20 20 20 20 21 21 22 22 22
. 22 . 23 . 23
iii
. . .
. . .
. . .
. 53 . 53 . 54
Creating a DSN for the tutorial table computer . . . . . . . . . Creating a DSN for the tutorial table Linux computer . . . . . . .
on a . . on a . .
Windows . . . . 59 UNIX or . . . . 59
Chapter 8. Tutorial summary . . . . . 55 Accessing information about IBM . . . 61 Appendix. Installing and setting up the tutorial . . . . . . . . . . . . . . . 57
Creating a folder for the tutorial files . . . . . Creating the tutorial project . . . . . . . . Copying the data files to the project folder or directory . . . . . . . . . . . . . . Importing the tutorial components into the tutorial project . . . . . . . . . . . . . . . Creating a target database table . . . . . . . 57 . 57 . 57 . 58 . 58 Contacting IBM . . . . . . . . . . Accessible documentation . . . . . . Providing comments on the documentation . . . . . . . . 61 . 61 . 62
Index . . . . . . . . . . . . . . . 67
iv
Chapter 1. Introduction
In this tutorial, you will learn the basic skills that you need to design and run WebSphere DataStage parallel jobs.
Learning objectives
By completing this tutorial, you will achieve the following learning objectives: v Learn how to design parallel jobs that extract, transform, and load data. v Learn how to run the jobs that you have designed, and how to view the results. v Learn how to create reusable objects that can be included in other job designs.
Learning objectives
As you work through the job scenario, you will learn how to do the following tasks: v Design parallel jobs that extract, transform, and load data v Run the jobs that you design and view the results v Create reusable objects that can be included in other job designs This tutorial should take approximately four hours to finish. If you explore other concepts related to this tutorial, it can take longer to complete.
Skill level
You can do this tutorial with only a beginning level of understanding of WebSphere DataStage concepts.
Audience
This tutorial is intended for WebSphere DataStage designers who want to learn how to create parallel jobs.
System requirements
The tutorial requires the following hardware and software: v WebSphere DataStage clients installed on a Windows XP platform. v Connection to a WebSphere DataStage server on a Windows or UNIX platform (Windows servers can be on the same computer as the clients). v To run the parallel processing module (module 5), the WebSphere DataStage server must be installed on a multi-processor system (SMP or MPP).
Prerequisites
You need to complete the following tasks before starting the tutorial: v Get DataStage developer privileges from the WebSphere DataStage administrator v Check that the WebSphere DataStage administrator has installed and set up the tutorial by following the procedures described in Appendix A
Copyright IBM Corp. 2006
v Obtain the name of the tutorial folder on the WebSphere DataStage client computer and the tutorial project folder or directory on the WebSphere DataStage server computer from the WebSphere DataStage administrator.
Learning objectives
After you complete the lessons in this module, you will understand how to do the following tasks: v Start the WebSphere DataStage and QualityStage Designer (Designer client) and attach a project. v Open an existing job. v Compile a job so that it ready to run. v Open the Director client and run a job. v View the results of the job. This module should take approximately 30 minutes to complete.
Prerequisites
Ensure that you have DataStage user authority.
Lesson checkpoint
In this lesson, you opened your first job. You learned the following tasks: v How to start the Designer client v How to open a job v Where to find the tutorial objects in the repository tree
The Data Set stage editor does not have a Format tab because the data set does not require any formatting data. Although the View Data button is available on this tab, there is no data for this stage yet. If you click the View Data button, you will receive a message that no data exists. The data gets created when the job runs.
Lesson checkpoint
In this lesson, you explored a simple data extraction job that reads data from a file and writes it to a staging area. You learned the following tasks: v How to open stage editors v How to view the data that a stage represents v How to compile a job so that it is ready to run
2. Select the sample job in the right pane of the Director client, and select Job Run Now. 3. In the Job Run Options window, specify the path of the project folder (for example, C:\IBM\InformationServer\Server\Projects\Tutorial and click Run. The job status changes to Running. 4. When the job status changes to Finished, select View Log. 5. Examine the job log to see the type of information that the Director client reports as it runs a job. The messages that you see are either control or information type. Jobs can also have Fatal and Warning messages. The following figure shows the log view of the job.
10
Lesson checkpoint
In this lesson you ran the sample job and looked at the results. You learned the following tasks: v How to start the Director client from the Designer client v How to run a job and look at the log file v How to view the data written by the job
Module 1: Summary
You have now opened, compiled, and run your first data extraction job. Now that you have run a data extraction job, you can start creating your own jobs. The next module guides you through the process of creating a simple job that does more data extraction.
Lessons learned
By completing this module, you learned about the following concepts and tasks: v Starting the Designer client. v Opening an existing job. v Compiling the job. v Starting the Director client from the Designer client. v Running the sample job. v Viewing the results of the sample job and seeing how the job extracts data from a comma-separated file and writes it to a staging area.
Additional resources
For more information about the features that you have learned about, see the following guides: v IBM WebSphere DataStage Designer Client Guide v IBM WebSphere DataStage Director Client Guide
11
12
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks: v Add stages and links to a job. v Specify the properties of the stages and links to determine what they will do when the job is run. v Learn how to specify column metadata. v Consolidate your knowledge of compiling and running jobs. This module should take approximately 90 minutes to complete.
Lesson checkpoint
In this lesson you created a job and saved it to a specified place in the repository. You learned the following tasks: v How to create a job in the Designer client. v How to name the job and save it to a folder in the repository tree.
13
14
Always use specific names for your stages and links rather than the default names assigned by the Designer client. Using specific names make your job designs easier to document and easier to maintain. 8. Select File Save to save the job. Your job design should now look something like the one shown in this figure:
Specifying properties and column metadata for the Sequential File stage
You will now edit the first of the stages that you added to specify what the stage does when you run the job. You will also specify the column metadata for the data that will flow down the link that joins the two stages. To edit the stages and add properties and metadata: 1. Double-click the country_codes Sequential File stage to open the stage editor. The editor opens in the Properties tab of the Output page. 2. Select the File property under the Source category. 3. In the File field, type the path name for your project folder (where the data files were copied when the tutorial was set up) and add the name CustomerCountry.csv (for example C:\IBM\ InformationServer\Server\Projects\Tutorial\CustomerCountry.csv), and then press enter. (You can browse for the path name if you prefer, click the browse button on the right of the File field.) You specified the name of the comma-separated file that the stage reads when the job runs. 4. Select the First Line is Column Name property under the Options category. 5. Click the down arrow next to the First Line is Columns Names field and select True from the list. The row that contains the column names is dropped when the job reads the file. 6. Click the Format tab. 7. In the record-level category, select the Record delimiter string property from the Available properties to add. 8. Select DOS format from the Record delimiter string list. This setting ensures that the file can be read by UNIX or Linux WebSphere DataStage servers. 9. Click the Columns tab. Because the CustomerCountry.csv file contains only three columns, type the column definitions into the Columns tab. (If a file contains many columns, it is less time consuming
15
and more accurate to import the column definitions directly from the data source.) Note that column names are case-sensitive, so use the case in the instructions. 10. Double-click the first line of the table. Fill in the fields as follows:
Column Name CUSTOMER_ NUMBER Key Yes SQL Type Char Length 7 Description Key column for the look up - the customer identifier
You will use the default values for the remaining fields. 11. Add two more rows to the table to specify the remaining two columns and fill them in as follows:
Column Name COUNTRY Key No SQL Type Char Length 2 Description The code that identifies the customers country The code that identifies the customers language
LANGUAGE
No
Char
Your Columns tab should look like the one in the following figure (if you have National Language Support installed, there is an additional field named Extended):
12. Click the Save button to save the column definitions that you specified as a table definition object in the repository. The definitions can then be reused in other jobs. 13. In the Save Table Definition window, enter the following information:
Option Data source type Data source name Table/file name Description Saved CustomerCountry.csv country_codes_data
16
Description date and time of saving Table definition for country codes source file
14. Click OK to specify the locator for the table definition. The locator identifies the table definition. 15. In the Save Table Definition As window, save the table definition in the Tutorial folder and name it country_codes_data. 16. Click the View Data button and click OK in the Data Browser window to use the default settings. The data browser shows you the data that the CustomerCountry.csv file contains. Since you specified the column definitions, the Designer client can read the file and show you the results. 17. Close the Data Browser window. 18. Click OK to close the stage editor. 19. Save the job. Notice that a small table icon has appeared on the Country_codes_data link. This icon shows that the link now has metadata. You have designed the first part of your job.
Specifying properties for the Lookup File Set stage and running the job
In this part of the lesson, you configure the next stage in your job. You already specified the column metadata for data that will flow down the link between the two stages, so there are fewer properties to specify in this task. To configure the Lookup File Set stage: 1. Double-click the country_code_lookup Lookup File Set stage to open the stage editor. The editor opens in the Properties tab of the Input page. 2. Select the Lookup Keys category; then double-click the Key property in the Available Properties to add area. 3. In the Key field, click the down arrow and select CUSTOMER_NUMBER from the list and press enter. You specified that the CUSTOMER_NUMBER column will be the lookup key for the lookup table that you are creating. 4. Select the Lookup File Set property under the Target category. 5. In the Lookup File Set field, type the path name for the lookup file set that the stage will create, (for example, C:\IBM\InformationServer\Server\Projects\Tutorial\countrylookup.fs) and press enter. 6. Click OK to save your property settings and close the Lookup File Set stage editor. 7. Save the job and then compile and run the job by using the techniques that you learned in Lesson 1. You have now written a lookup table that can be used by another job later on in the tutorial.
Lesson checkpoint
You have now designed and run your very first job. You learned the following tasks: v How to add stages and links to a job v How to set the stage properties that determine what the stage will do when you run the job v How to specify column metadata for the job and to save the column metadata to the repository for use in other jobs
17
18
Name special_handling_data
Your job design should now look like the one shown in this figure:
3. Open the stage editor for the special_handling Sequential File stage and specify that it will read the file SpecialHandling.csv and that the first line of this file contains column names. 4. Click the Format tab. 5. In the record-level category, select the Record delimiter string property from the Available properties to add. 6. Select DOS format from the Record delimiter string list. This setting ensures that the file can be read by UNIX or Linux WebSphere DataStage servers. 7. Click the Columns tab. 8. Click Load. You load the column metadata from the table definition that you previously saved as an object in the repository. 9. In the Table Definitions window, browse the repository tree to the folder where you stored the SpecialHandling.csv column definitions. 10. Select the SpecialHandling.csv table definition and click OK. 11. In the Selected Columns window, ensure that all of the columns appear in the Selected columns list and click OK. The column definitions appear in the Columns tab of the stage editor. 12. Close the Sequential File stage editor. 13. Open the stage editor for the special_handling_lookup stage. 14. Specify a path name for the destination file set and specify that the lookup key is the SPECIAL_HANDLING_CODE column then close the stage editor. 15. Save, compile, and run the job.
19
Lesson checkpoint
You have now added to your job design and learned how to import the metadata that the job uses. You learned the following tasks: v How to import column metadata directly from a data source v How to load column metadata from a definition that you saved in the repository
Job parameters
Sometimes, you want to specify information when you run the job rather than when you design it. In your job design, you can specify a job parameter to represent this information. When you run the job, you are then prompted to supply a value for the job parameter. You specified the location of four files in the job that you designed in Lesson 2.3. In each part of the job, you specified a file that contains the source data and a file to write the lookup data set to. In this lesson, you will replace all four file names with job parameters. You will then supply the actual path names of the files when you run the job. You will save the definitions of these job parameters in a parameter set in the repository. When you want use the same job parameters in a job later on in this tutorial, you can load them into the job design from the parameter set. Parameter sets enable the same job parameters to be used by different jobs.
path name for the country path name codes lookup file set path name for the special path name handling codes file
special_handling_source
special_handling_lookup
path name for the special path name handling lookup file set
20
The Parameters tab of the Job Properties window should now look like the one in the following figure:
9. Click OK to close the Job Properties window. 10. Click File Save to save the job.
21
The job runs, using the values that you supplied for the job parameters.
Lesson checkpoint
You defined job parameters to represent the file names in your job and specified values for these parameters when you ran the job. You learned the following tasks: v How to define job parameters v How to add job parameters in your job design v How to specify values for the job parameters when you run the job
Parameter sets
You use parameter sets to define job parameters that you are likely to reuse in other jobs. Whenever you need this set of parameters in a job design, you can insert them into the job properties from the parameter set. You can also define different sets of values for each parameter set. These parameter sets are stored as files in the WebSphere DataStage server installation directory and are available to use in your job designs or when you run jobs that use these parameter sets. If you make any changes to a parameter set object, these changes are reflected in job designs that use this object until the job is compiled. The parameters that a job is compiled with are available when the job is run. However, if you change the design after the job is compiled, the job will link to the current version of the parameter set. You can create parameter sets from existing job parameters, or you can specify the job parameters as part of the task of creating a new parameter set.
22
10. Click OK, specify a repository folder in which to store the parameter set, and then click Save. 11. The Designer client asks if you want to replace the selected parameters with the parameter set that you have just created. Click No. 12. Click OK to close the Job Parameters window. 13. Save the job. You created a parameter set that is available for another job that you will create later in this tutorial. The current job continues to use the individual parameters rather than the parameter set.
Lesson checkpoint
You have now created a parameter set. You learned the following tasks: v How to create a parameter set from a set of existing job parameters v How to specify a set of default values for the parameters in the parameter set
Module 2 Summary
In this module, you designed and ran a data extraction job. You also learned how to create reusable objects such as table definitions and parameters sets that you can include in other jobs that you design.
Lessons learned
By completing this module, you learned about the following concepts and tasks: v Creating new jobs and saving them in the repository v Adding stages and links and specifying their properties v Specifying column metadata and saving it as a table definition to reuse later v Specifying job parameters to make your job design more flexible, and saving the parameters in the repository to reuse later
23
24
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks: v How to use a Transformer stage to transform data v How to handle rejected data v How to combine data by using a Lookup stage This module should take approximately 60 minutes to complete.
25
5. Drop the Transformer stage between the two Data Set stages and name the Transformer stage Trim_and_Strip. 6. Right-click the GlobalCoBillTo Data Set stage and drag a link to the Transformer stage. This method of linking the stages is fast and easy. You do not need to go back to the palette and grab a link to connect each stage. 7. Use the same method to link the Transformer stage to the int_GlobalCoBillTo Data Set stage. 8. Name the first link full_bill_to and name the second link stripped_bill_to. Your job should look like the one in the following picture:
26
v v v v v v v
v SETUP_DATE v STATUS_CODE 3. Drag these columns from the upper left pane to the stripped_bill_to link in the upper right pane of the stage editor. You are specifying that only these columns will flow through the Transformer stage when the job is run. The remaining columns will be dropped. 4. In the stripped_bill_to column definitions at the bottom of the right pane, edit the SQL type and length fields for your columns as specified in the following table:
Column CUSTOMER_NUMBER CUST_NAME ADDR_1 ADDR_2 CITY REGION_CODE ZIP TEL_NUM REVIEW_MONTH SETUP_DATE STATUS_CODE SQL Type Char VarChar VarChar VarChar VarChar Char VarChar VarChar VarChar VarChar Char Length 7 30 30 30 30 2 10 10 2 12 1
By specifying stricter data typing for your data, you will be able to better diagnose inconsistencies in your source data when you run the job. 5. Double-click the Derivation field for the CUSTOMER_NUMBER column in the stripped_bill_to link. The expression editor opens. 6. In the expression editor, type the following text: trim(full_bill_to.CUSTOMER_NUMBER, ,A). The text specifies a function that deletes all the space characters from the CUSTOMER_NUMBER column on the full_bill_to link before writing it to the CUSTOMER_NUMBER column on the stripped_bill_to link. Your Transformer stage editor should look like the one in the following figure:
27
7. Click OK to close the Transformer stage editor. 8. Open the stage editor for the int_GlobalCoBillTo Data Set stage and go to the Columns tab of the Input page. Notice that the stage editor has acquired the metadata from the stripped_bill_to link. 9. Save and then compile your TrimAndStrip job.
28
Lesson checkpoint
In this lesson you learned how to design and configure a transformation job. You learned the following tasks: v v v v How How How How to to to to configure a Transformer stage link stages using a different method for drawing links. load column metadata into a link, using a drag-and-drop operation. run a job from within the Designer client and monitor the performance of the job.
29
6. Delete the int_GlobalCoBillTo Data Set stage. It will be replaced with a different Data Set stage. 7. Select the File area in the palette and drag a Lookup File Set stage to the job. Position it immediately above the Lookup stage and name it Country_Code_Fileset. 8. Draw a link from the Country_Code_Fileset Lookup File Set stage to the Lookup_Country Lookup stage and name it country_reference. The link appears as a dotted line, which indicates that the link is a reference link. 9. Drag a Data Set stage from the palette to the job and position it to the right of the Lookup stage. Name the Data Set stage temp_dataset. 10. Draw a link from the Lookup stage to the Data Set stage and name it country_code. The job that you designed should look like the one in the following figure:
30
31
4. Double-click the Condition bar in the Country_Reference link. The Lookup Stage Conditions window opens. Select the Lookup Failure field and select Continue from the list. You are specifying that, if a CUSTOMER_NUMBER value from the stripped_bill_to link does not match any CUSTOMER_NUMBER column values in the reference table, the job continues to the next CUSTOMER_NUMBER column. 5. Close the Lookup stage editor. 6. Open the temp_dataset Data Set stage and specify a file name for the data set. 7. Save, compile and run the job. The Job Run Options window displays all the parameters in the parameter set. 8. In the Job Run Options window, select lookupvalues1 from the list next to the parameter set name. The parameters values are filled in with the path names that you specified when you created the parameter set. 9. Click Run to run the job and then click View Data in the temp_dataset stage to examine the results.
Lesson checkpoint
With this lesson, you started to design more complex and sophisticated jobs. You learned the following tasks: v How to copy stages, links, and associated configuration data between jobs. v How to combine data in a job by using a Lookup stage.
32
3. Double-click the Lookup_Country Lookup stage to open the Lookup stage editor. 4. Double-Click the Condition bar in the country_reference link to open the Lookup Stage Conditions window. 5. In the Lookup Stage Conditions window, select the Lookup Failure field and select Reject from the list. Close the Lookup stage editor. This step specifies that, whenever a row from the stripped_bill_to link has no matching entry in the country code lookup table, the row is rejected and written to the Rejected_Rows Sequential File stage. 6. Edit the Rejected_Rows Sequential File stage and specify a path name for the file that the stage will write to (for example, c:\tutorial\rejects.txt). This stage derives the column metadata from the Lookup stage, and you cannot alter it. 7. Save, compile the CleansePrepare job, and run the job. 8. Open the Rejected_Rows Sequential File stage editor and click View Data to look at the rows that were rejected.
Lesson checkpoint
You learned the following tasks: v How to add a reject link to your job v How to configure the Lookups stage so that it rejects data where a lookup fails
33
In the sample bill_to data, one of the columns is overloaded. The SET_UP data column can contain a special handling code as well as the date that the account was set up. The transformation logic that is being added to the job extracts this special handling code into a separate column. The job then looks up the text description corresponding to the code from the lookup table that you populated in Lesson 2 and adds the description to the output data. The transformation logic also adds a row count to the output data.
2.
3. 4. 5.
34
4. 5.
6. 7.
The new columns appear in the graphical representation of the link, but are highlighted in red because they do not yet have valid derivations. In the graphical area, double-click the Derivation field of the SOURCE column. In the expression editor, type GlobalCo:. Position your mouse pointer immediately to the right of this text, right-click and select Input Column from the menu. Then select the COUNTRY column from the list. When you run the job, the SOURCE column for each row will contain the two-letter country code prefixed with the text GlobalCo, for example, GlobalCoUS. In the Transformer stage editor toolbar, click the Stage Properties tool on the far left. The Transformer Stage Properties window opens. Click the Variables tab and, by using the techniques that you learned for defining table definitions, add the following stage variables to the grid:
SQL Type Char VarChar Precision 1 10
When you close the Properties window, these stage variables appear in the Stage Variables area above the with_business_rules link. 8. Double-click the Derivation fields of each of the stage variables in turn and type the following expressions in the expression editor:
35
Expression
Description
if Len (country_code.SETUP_DATE) < This expression checks that the 2 Then country_code.SETUP_DATE SETUP_DATE column contains a Else Field special handling code. If it does, the (country_code.SETUP_DATE, ,2) value of xtractSpecialHandling is set to that code. If the column contains a date and a code, the code is extracted and the value of xtractSpecialHandling is set to that code. If Len (country_code.SETUP_DATE) < This expression checks that the 3 Then 01/01/0001 Else Field SETUP_DATE column contains a (country_code.SETUP_DATE, ,1) date. If the SETUP_DATE column does not contain a date, then the expression sets the value of the TrimDate variable to the string 01/01/0001. If the SETUP_DATE column contains a date, then the date is extracted and the value of the TrimDate variable is set to a string that contains the date.
TrimDate
9. Select the xtractSpecialHandling stage variable and drag it to the Derivation field of the SPECIAL_HANDLING_CODE column and drop it on the with_business_rules link. A line is drawn between the stage variable and the column, and the name xtractSpecialHandling appears in the Derivation field. For each row that is processed, the SPECIAL_HANDLING_CODE column writes the current value of the xtractSpecialHandling variable. 10. Select the TrimDate stage variable and drag it to the Derivation field of the SETUP_DATE column and drop it on the with_business_rules link. A line is drawn between the stage variable and the column, and the name TrimDate appears in the Derivation field. For each row processed, the SETUP_DATE column writes the current value of the TrimDate variable. 11. Double-click the Derivation field of the RECNUM column and type GC: in the expression editor. Right-click and select System Variable from the menu. Then select @OUTROWNUM. You added row numbers to your output. Your transformer editor should look like the one in the following picture:
36
37
v v v v v v v 6. 7.
8. 9.
v SPECIAL_HANDLING_CODE Select the DESCRIPTION column in the special_handling reference link and drag it to the finished_data output link (the LANGUAGE column is not used). Double-click the Condition bar in the special_handling reference link to open the Lookup Stage Conditions window. Specify that the processing will continue if the lookup fails for a data row. You do not need to specify a reject link for this stage. Only a minority of the rows in the bill_to data contain a special handling code, so if the rows that do not contain a code are rejected, most of the data is rejected. Specify a job parameter to represent the file that the Target Sequential File stage will write to, and add this job parameter to the stage. Save, compile and run the CleansePrepare job.
Lesson checkpoint
In this lesson, you consolidated your existing skills in defining transformation jobs and added some new skills. You learned the following tasks: v How to define and use stage variables in a Transformer stage v How to use system variables to generate output column values
Module 3 Summary
In this module you refined and added to your job design skills. You learned how to design more complex jobs that transform the data that your previous jobs extracted.
Lessons learned
By completing this module, you learned the following concepts and tasks: v How to drop data columns from your data flow v How to use the transform functions that are provided with the Designer client v How to combine data from two different sources v How to capture rejected data
38
Learning objectives
After completing the lessons in this module, you will understand how to do the following tasks: v How to define a data connection object that you use and reuse to connect to a database. v How to import column metadata from a database. v How to write data to a relational database target. This module should take approximately 60 minutes to complete.
Prerequisites
Ensure that your database administrator runs the relevant database scripts that are supplied with the tutorial and set up a DSN for you to use when connecting to an ODBC connector.
39
1. Select the tutorial folder in the repository, right-click, and select New Other Data Connection from the shortcut menu. 2. In the General page of the Data Connection window, enter name for Data Connect (for example, tutorial_connect) and provide a short description and a long description of the object. 3. Open the Parameters page. 4. Click the browse button next to the Connect using Stage Type field. 5. In the Open window, open the Stage Types Parallel Database folder, select the ODBC Connector item and click Open. The Connection parameters grid is populated and shows the connection parameters that are required by the stage type that you selected. 6. Enter values for each of the Parameters as shown in the following table:
Parameter name ConnectionString Username Password Value Type the DSN name Type the user name for connecting to the database using the specified DSN Type the password for connecting to the database using the specified DSN.
7. Click OK. 8. In the Save Data Connection As window, select the tutorial folder and click Save.
Lesson checkpoint
You learned how to create a data connection object and store the object in the repository.
40
10. Click the Test Connection link to ensure that you can connect to the database by using the connection details and then click Next. 11. In the Filter page, select the schema from the Schema list (ask your database administrator if you do not know the name of the schema) and click Next. 12. In the Selection page, select the tutorial table from the list and click Next. 13. In the Confirm import page, review the import details, and then click Import. 14. In the Select Folder window, select the tutorial folder and click OK. The table definition is imported and appears in the tutorial folder. The table definition has a different icon from the table definitions that you used previously. This icon identifies that the table definition was imported by using a connector and is available to other projects and to other suite components.
Lesson checkpoint
You learned how to import column metadata from a database using a connector.
Connectors
Connectors are stages that you use to connect to data sources and data targets to read or write data. In the Database section of the palette in the Designer are many types of stages that connect to the same types of data sources or targets. For example, if you click the down arrow next to the ODBC icon in the palette, you can choose to add either an ODBC connector stage or an ODBC Enterprise stage to your job. If your database type supports connector stages, use them because they provide the following advantages over other types of stages: v Creates job parameters from the connector stage (without first defining the job parameters in the job properties). v Saves any connection information that you specify in the stage as a data connection object. v Reconciles data types between source and target to avoid runtime errors. v Generates detailed error information if a connector encounters problems when the job runs.
41
42
d. Click the SQL tab to view the SQL statement; then click OK to close the SQL builder. The SQL statement is displayed in the Insert statement field, and your ODBC connector should look like the one in the following figure:
43
9. Click OK to close the ODBC Connector. 10. Save, compile, and run the job. You wrote the BillTo data to the tutorial database table. This table forms the bill_to dimension of the star schema that is being implemented for the GlobalCo delivery data in the business scenario that the tutorial is based on.
Lesson checkpoint
You learned how to use a connector stage to connect to and write to a relational database table. You learned the following tasks: v How to configure a connector stage v How to use a data connection object to supply database connection details v How to use the SQL builder to define the SQL statement by accessing the database.
44
Module 4 summary
In this module, you designed a job that writes data to a table in a relational database. In lesson 4.1, you learned how to define a data connection object; in lesson 4.2, you imported column metadata from a data base; and in lesson 4.3, you learned how to write data to a relational data base target.
Lessons learned
By completing this module, you learned about the following concepts and tasks: v Load data into data targets v Use the Designer clients reusable components
45
46
Learning objectives
After completing the lessons in this module, you will know how to do the following: v How to use the configuration file to optimize parallel processing. v How to control parallel processing at the stage level in your job design. v How to control the partitioning of data so that it can be handled by multiple processors. This module should take approximately 60 minutes to complete.
Prerequisites
You must be working on a computer with multiple processors. You must have DataStage administrator privileges to create and use a new configuration file.
47
{ node "node1" { fastname "R101" pools "" resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""} resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""} } node "node2" { fastname "R101" pools "" resource disk "C:/IBM/InformationServer/Server/Datasets" {pools ""} resource scratchdisk "C:/IBM/InformationServer/Server/Scratch" {pools ""} } }
The default configuration file is created when WebSphere DataStage is installed. Although the system has four processors, the configuration file specifies two processing nodes. Specify fewer processing nodes than there are physical processors to ensure that your computer has processing resources available for other tasks while it runs WebSphere DataStage jobs. This file contains the following fields: node The name of the processing node that this entry defines.
fastname The name of the node as it is referred to on the fastest network in the system. For an SMP system, all processors share a single connection to the network, so the fastname node is the same for all the nodes that you are defining in the configuration file. pools Specifies that nodes belong to a particular pool of processing nodes. A pool of nodes typically has access to the same resource, for example, access to a high-speed network link or to a mainframe computer. The pools string is empty for both nodes, specifying that both nodes belong to the default pool.
resource disk Specifies the name of the directory where the processing node will write data set files. When you create a data set or file set, you specify where the controlling file is called and where it is stored, but the controlling file points to other files that store the data. These files are written to the directory that is specified by the resource disk field. resource scratchdisk Specifies the name of a directory where intermediate, temporary data is stored. Configuration files can be more complex and sophisticated than the example file and can be used to tune your system to get the best possible performance from the parallel jobs that you design.
Lesson checkpoint
In this lesson, you learned how the configuration file is used to control parallel processing. You learned the following concepts and tasks: v About configuration files v How to open the default configuration file v What the default configuration file contains
48
In the simplest scenario, do not worry about how your data is partitioned. WebSphere DataStage can partition your data and implement the most efficient partitioning method. Most partitioning operations result in a set of partitions that are as near to equal size as possible, ensuring an even load across your processors. As you perform other operations, you need to control partitioning to ensure that you get consistent results. For example, you are using an aggregator stage to summarize your data to get the answers that you need. You must ensure that related data is grouped together in the same partition before the summary operation is performed on that partition. In this lesson, you will run the sample job that you ran in Lesson 1. By default, the data that is read from the file is not partitioned when it is written to the data set. You change the job so that it has the same number of partitions as there are nodes defined in your systems default configuration file.
49
4. Click the disk icon in the toolbar to open the Data Set viewer 5. View the data in the data set to see its structure. 6. Close the window.
50
5. Compile and run the job. 6. Return to the data set management tool and open the GlobalCo_BillTo.ds data set. You can see that the data set now has multiple data partitions. The following figure shows the data set partitions on the system.
Lesson checkpoint
In this lesson, you learned some basics about data partitioning. You learned the following tasks: v How to use the data set management tool to view data sets v How to set a partitioning method for a stage
51
52
5. In the General tab of the Project Properties window, click Environment. 6. In the Categories tree of the Environment variables window, select the Parallel node. 7. Select the APT_CONFIG_FILE environment variable, and edit the file name in the path name under the Value column heading to point to your new configuration file. The Environment variables window should resemble the one in the following picture:
You deployed your new configuration file. Keep the Administrator client open, because you will use it to restore the default configuration file at the end of this lesson.
Lesson checkpoint
You learned how to create a configuration file and use it to alter the operation of parallel jobs. You learned the following tasks: v How to create a configuration file based on the default file. v How to edit the configuration file. v How to deploy the configuration file
Chapter 7. Module 5: Processing in parallel
53
Module 5 summary
In this module, you learned how to use the configuration file to control how your parallel jobs are run. You also learned how to control the partitioning of data at the level of individual stages.
Lessons learned
By completing this module, you learned about the following concepts and tasks: v The configuration file v How to use the configuration editor to edit the configuration file v How to control data
54
Lessons learned
By completing this tutorial, you learned about the following concepts and tasks: v How to extract, transform, and load data by using WebSphere DataStage v Using the parallel processing power of WebSphere DataStage. v How to reuse the job design elements
55
56
57
The WebSphere DataStage server might be on the same Windows computer as the clients, or it might be on a separate Windows, UNIX, or Linux computer. When you created the project for the tutorial, you automatically created a folder or directory for that project on the server computer. 1. Open the tutorial folder that you created on the client computer and locate all the files that end with .csv: v CustomerCountry.csv v SpecialHandling.csv v GlobalCo_BillTo.csv 2. Open the project folder on the server computer for the tutorial project that you created. The default path name for a Windows server is c:\IBM\InformationServer\Server\Projects\tutorial_project. The default path name for a UNIX or Linux server is /opt/IBM/InformationServer/Server/Projects/ tutorial_project. 3. Copy the files from the tutorial folder on the client computer to the project folder on the server computer.
58
59
60
Contacting IBM
You can contact IBM by telephone for customer support, software services, and general information.
Customer support
To contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378).
Software services
To learn about available service options, call one of the following numbers: v In the United States: 1-888-426-4343 v In Canada: 1-800-465-9600
General information
To find general information in the United States, call 1-800-IBM-CALL (1-800-426-2255). Go to www.ibm.com for a list of numbers outside of the United States.
Accessible documentation
Documentation is provided in XHTML format, which is viewable in most Web browsers.
61
XHTML allows you to view documentation according to the display preferences that you set in your browser. It also allows you to use screen readers and other assistive technologies. Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing the online documentation using a screen reader.
62
63
Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Corporation J46A/G4 555 Bailey Avenue San Jose, CA 95141-1003 U.S.A. Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this document and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement, IBM International Program License Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBMs future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information is for planning purposes only. The information herein is subject to change before the products described become available. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.
64
If you are viewing this information softcopy, the photographs and color illustrations may not appear.
Trademarks
IBM trademarks and certain non-IBM trademarks are marked at their first occurrence in this document. See http://www.ibm.com/legal/copytrade.shtml for information about IBM trademarks. The following terms are trademarks or registered trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product or service names might be trademarks or service marks of others.
65
66
Index A
accessibility 62 adding job parameters 20 adding stages 14 Administrator client 57 starting 52
F
files data 58 folder 57
R
readers comment form 62 reject data 32 relational database 58 repository 5 repository objects data connections 39 parameter set 22 table definition 18, 40 running jobs 9
I
importing column metadata 18, 40 tutorial components 58 installing 57
C
column metadata importing 18, 40 loading 18 comma-separated files 58 comments on documentation 62 compiling jobs 9 configuration file 47, 51 default 47 configurations viewer 47, 52 connector ODBC 40 configuring 42 contacting IBM 61 creating job parameters 22 creating a job 13
S J
job parameters adding 20 parameter sets 22 job properties 20 jobs compiling 9 creating 13 opening 5 running 9 sample job 5 screen readers 62 Sequential File stage 8 setting up 57 source data files 58 SQL Builder 42 stage Lookup File Set 17, 30 Transformer 26 stage properties 15 stages adding 14 starting the Designer client
L
legal notices 63 loading column metadata 18 lookup 29 Lookup File Set stage 17, 30 Lookup stage 29
D
data combining 29 looking up 29 partitioning 49 reject 32 stage Lookup 29 data browser 11 data connection creating 39 data files 58 data set management tool Data Set stage 8 database 58 Designer client 5, 58 starting 6 Director client starting 9 documentation accessible 62 ordering 61 Web site 61 DSN 59
T
table definition 18, 40 trademarks 65 Transformer stage 26 tutorial folder 57
M
metadata importing 18, 40 loading 18 49
V
viewing data 11
O
ODBC 40, 42, 59 opening a job 5
P
parameter set creating 22 parameter sets 22 partitioning data 49 partitions creating 50 project creating 57 properties job 20 stage 15
E
environment variables 53
67
68
Printed in USA
SC18-9889-00