Академический Документы
Профессиональный Документы
Культура Документы
2014 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any
means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation. All other
company and product names may be trade names or trademarks of their respective owners and/or copyrighted
materials of such owners.
Abstract
This document describes how to use Informatica Big Data Edition Sandbox for Hortonworks to run sample mappings
based on common big data uses cases. After you understand the sample big data use cases, you can create and run
your own big data mappings.
Supported Versions
Table of Contents
Installation and Configuration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 1. Download the Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Download and Install VMWare Player. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Register at Informatica Marketplace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Download the Big Data Trial Sandbox for Hortonworks Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Step 2. Start the Big Data Trial Sandbox for Hortonworks Virtual Machine. . . . . . . . . . . . . . . . . . . . . . . . 4
Step 3. Configure and Install the Big Data Trial Sandbox for Hortonworks Client. . . . . . . . . . . . . . . . . . . . 4
Configure the Domain Properties on the Windows Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Configure a Static IP Address on the Windows Machine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Install the Big Data Trial Sandbox for Hortonworks Client. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Step 4. Access the Big Data Trial Sandbox for Hortonworks Sandbox. . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Apache Ambari. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Informatica Administrator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Informatica Developer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Big Data Trial Sandbox for Hortonworks Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Running Common Tutorial Mappings on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Performing Data Discovery on Hadoop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Performing Data Warehouse Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Processing Complex Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Reading and Parsing Complex Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Writing to Complex Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Working with NoSQL Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
HBase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Troubleshooting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Hortonworks 2.1.3
Sample data
Note: The Informatica Big Data Trial Sandbox for Hortonworks installation and configuration document is available on
the desktop of the virtual machine.
The Big Data Trial Sandbox for Hortonworks client installs the libraries and binaries required for the Informatica
Developer (Developer tool) client.
2.
Optionally, in VMware Player click Browse > Import to extract the contents of the virtual machine to the
selected location and start the virtual machine. Then, click Play virtual machine.
You are logged in to the virtual machine. The Informatica services and Hadoop services start automatically.
Step 3. Configure and Install the Big Data Trial Sandbox for
Hortonworks Client
To communicate with the virtual machine before you run the client, you must configure the domain properties for the
Big Data Trial Sandbox for Hortonworks client installation.
Optionally, to avoid updating the IP address of the virtual machine each time it changes, you can configure a static IP
address for the virtual machine.
Then, you can run the silent installer to install the Big Data Trial Sandbox for Hortonworks client.
Click Applications > System Tools > Terminal to open the terminal to run commands.
2.
Run the ifconfig command to find the IP address of the virtual machine.
The ifconfig command returns all interfaces on the virtual machine. Select the eth interface to get values
for IP address.
The following image shows the ifconfig command with the return value for inet addr highlighted with a red
arrow:
3.
Add the IP address and the default hostname hdp-bde-demo to the hosts file on the Windows machine on
which you install the Developer tool.
The hosts file can be located in the following location: C:\Windows\System32\drivers\etc\hosts. Add the
following line to the hosts file: <IP address> <hostname>. For example, add the following line:
192.168.159.159 hdp-bde-demo
Click Applications > System Tools > Terminal to open the terminal to run commands.
2.
Run the ifconfig command to find the IP address and hardware ethernet address of the virtual machine.
The ifconfig command returns all interfaces on the virtual machine. Select the eth interface to get values
for the hardware ethernet address.
The following image shows the ifconfig command with the return values for inet addr and HWaddr outlined
with red boxes:
3.
Edit vmnetdhcp.conf to add the values for host name, IP address, and hardware ethernet address.
vmnetdhcp.conf is located in the following directory: C:\ProgramData\VMware
Add the following entry before the #END tag at the end of the file:
host <hostname> {
hardware ethernet <your HWaddr>;
fixed-address <your inet addr>;
}
The following sample code shows how to set a static IP address:
host hdp-bde-demo {
hardware ethernet 00:0C:29:10:F9:4C;
fixed-address 192.168.159.159;
}
4.
Add the IP address and the default hostname hdp-bde-demo to the hosts file on the Windows machine on
which you install the Developer tool.
The hosts file can be located in the following location: C:\Windows\System32\drivers\etc\hosts. Add the
following line to the hosts file: <IP address> <hostname>. For example, add the following line:
192.168.159.159 hdp-bde-demo
5.
6.
2.
The silent installer runs in the background. The process can take several minutes.
The command window displays a message that indicates that the installation is complete.
You can find the Informatica_Version_Client_InstallLog.log file in the following directory: C:\Informatica
\9.6.1_BDE_Trial\
After the installation process is complete, you can launch the Big Data Trial Sandbox for Hortonworks Client.
Apache Ambari
You can log in to Ambari from the following URL: http://hdp-bde-demo:8080/#/login.
Enter the following credentials to log in to Ambari:
User name: admin
Password: admin
Informatica Administrator
You can access the Administrator tool from the following URL: http://hdp-bde-demo:6005
Enter the following credentials to log in to the Administrator tool:
User name: Administrator
Password: Administrator
Informatica Developer
You can start the Developer tool client from the Windows Start menu.
Enter the following credentials to connect to the Model repository Infa_mrs:
User name: Administrator
Password: Administrator
After you run the mappings in the Developer tool, you can monitor the mapping jobs in the Administrator tool.
m_DataLoad_2
m_DataLoad_2 loads data from the READ_WordFile2 flat file from your machine to the
WRITE_HDFSWordFile2 file on HDFS.
m_WordCount
m_WordCount reads two source files from HDFS and parses the data and the output to a flat file on HDFS.
The following image shows the mapping m_WordCount:
Expression transformations. Removes the carriage return and new line characters from a word.
Profile_CustomerData. Profiles the customer data to determine the characteristics of the customer data.
Use the samples to understand how to perform data discovery on Hadoop. You want to discover the quality of the
source customer data in the CustomerData flat file before you use the customer data as a source in a mapping. You
should verify the quality of the customer data to determine whether the data is ready for processing. You can run the
Profile_CustomerData profile based on the source data to determine the characteristics of the customer data.
The profile determines the characteristics of columns in a data source, such as value frequencies, unique values, null
values, patterns, and statistics.
The profile determines the following characteristics of source data:
The number of unique and null values in each column, expressed as a number and percentage.
The patterns of data in each column and the frequencies with which these values occur.
Statistics about the column values, such as the maximum value length, minimum value length, first value, and
last value in each column.
The following figure shows the profile results that you can analyze to determine the characteristics of the customer
data:
To run the workflow, enter the following command to run the workflow from the command line:
./infacmd.sh
wfs
startWorkflow
-dn infa_domain -sn infa_dis
-un Administrator -pd
Administrator -Application App_DataWarehouseOptimization -wf wf_DataWarehouseOptimization
To run the mappings in the workflow, open a mapping and right-click the mapping to run the mapping.
The workflow contains the following mappings and transformations:
Mapping_Day1
The workflow object Mapping_Day1 reads customer data from flat files in a local file system and writes to an
HDFS target for the first 24-hour period.
Mapping_Day2
The workflow object Mapping_Day 2 reads customer data from flat files in a local file system and writes to an
HDFS target for the next 24-hour period.
m_CDC_DWHOptimization
The workflow object m_CDC_DWHOptimization captures the changed data. It reads data from HDFS and
identifies the data that has changed. To increase performance, you can configure the mapping to run on
Hadoop cluster nodes in a Hive environment.
The following image shows the mapping m_CDC_DWHOptimization:
10
Sources. HDFS files that were the targets of the previous two mappings. The Data Integration Service
reads all of the data as a single column.
Expression transformations. Extract a key from the non-key values in the data. The expressions use the
INSTR function and SUBSTR function to perform the extraction of key values.
Joiner transformation. Performs a full outer join on the two sources based on the keys generated by the
Expression transformations.
Filter transformations. Use the output of the Joiner transformation to filter rows based on whether or not
the rows should be updated, deleted, or inserted.
Targets. HDFS files. The Data Integration Service writes the data to three HDFS files based on whether
the data is inserted, deleted, or updated.
Consolidated_Mapping
The workflow object Consolidated_Mapping consolidates the data in the HDFS files and loads the data to the
data warehouse.
The following figure shows the mapping Consolidated_Mapping:
Sources. The HDFS files that were the target of the previous mapping are the sources of this mapping.
Expression transformations. Add the deleted, updated, or inserted tags to the data rows.
Target. Flat file that acts as a staging location on the local file system.
11
The following image shows the sample web log processing workflow:
To run the workflow, enter the following command to run the workflow from the command line:
./infacmd.sh
wfs
startWorkflow
-dn infa_domain -sn infa_dis
Administrator -Application app_logProcessing -wf wf_LogProcessing
To run the mappings in the workflow, open a mapping and right-click the mapping to run the mapping.
You can run the following mappings and transformations in the workflow:
m_LoadData
The workflow object m_LoadData reads the parsed web log data and writes to a flat file target. The source
and target are flat files.
The following image shows the mapping m_LoadData:
m_sample_weblog_parsing
The workflow object m_sample_weblog_parsing is a logical data object read mapping reads data from a
HDFS source, parse the data using a Data Processor transformation, and writes to a logical data object.
12
The following image shows the expanded logical data object read mapping m_sample_weblog_parsing:
Source. HDFS file that was the target of the previous mapping.
Data Processor transformation. Processes the input binary stream of data, parses the data, and writes to
XML format.
Joiner transformation. Combines the activity of visitors who return to the website on the same day with
stock queries.
13
The following figure shows the Complex File Writer sample mapping:
HDFS output
The output, Write_binary_single_file, is a complex file stored in HDFS.
14
HBase
HBase
Use HBase when you need random real-time read and writes from a database. HBase is a non-relational distributed
database that runs on top of the Hadoop Distributed File System (HDFS) and can store sparse data. Big Data Trial
Sandbox for Hortonworks provides samples that demonstrate how to read and process binary data from HBase.
The HBase_Binary_Data project in the Developer tool includes samples that you can use to read and process binary
data in HBase tables to string data in a flat file target.
The sample HBase table contains the details of people and the cars that they purchased over a period of time. The
table contains the Details and Cars column families. The column names of the Cars column family are of String data
type. You can get all columns in the Cars column family as an single binary column. You can use the sample Java
transformation to covert the binary data to string data. You can join the data from both the column families and write it
to a flat file.
To process the Hbase binary data, use the wf_HBase_Binary_Data workflow.
The following figure shows the wf_HBase_Binary_Data workflow:
To run the workflow, enter the wfs startworkflow command to run the workflow from the command line.
To run the mappings in the workflow, open a mapping and right-click the mapping to run the mapping.
The workflow contains following mappings and transformations:
m_person_Cars_Write_Static
The workflow object references the m_person_Cars_Write_Static HBase write data object mapping that
writes data to the columns in the Cars and Details column family.
m_preson_Cars_Write_Static1
The workflow object references the m_pers_cars_static_reader mapping that transforms the binary data in an
HBase data object to columns of the String data type and writes the details to a flat file data object.:
15
Person_Car_Static_Read
The first source for the mapping is an HBase data object named Person_Car_Static that contains the
columns in the Details column family. The HBase read data object operation is named
Person_Car_Static_Read.
pers_cars_Static_bin_read
The second source for the mapping is an HBase data object named Person_cars_Static_bin that
contains the data in the Cars column family. The HBase read data object operation is named
pers_cars_Static_bin_read.
Transformations
The Sorter transformation sorts the data in ascending order based on the row ID.
The Expression and Aggregator transformations convert the row data to columnar data.
The Joiner transformation combines the data from both the HBase input sources before you load the
data to the flat file data object.
The Filter transformation filters out any person with age less than or equal to 43.
Write_Person_Cars_FF
The target for the mapping is a flat file data object named Person_Cars_FF. The flat file data object write
operation is named Write_Person_Cars_FF to write data from the Cars and Details column families.
The Data Integration Service converts the binary column in Person_cars_Static_bin, joins the data in
Person_Car_Static, and writes the data to the flat file data object Write_Person_Cars_FF.
Troubleshooting
This section describes troubleshooting information.
Informatica Services shut down
The Informatica services might shut down when the machine on which you run the virtual machine goes into
hibernation or when you resume the virtual machine.
Run the following command to restart the services on the operating system of the virtual machine: sh /home/
infauser/BDETRIAL/.cmdInfaServiceUtil.sh start
Debug mapping failures
To debug mapping failures, check the error messages in the mapping log file.
The mapping log file appears in the following location: /home/infauser/bdetrial_repo/informatica/
informatica/tomcat/bin/disTemp
Virtual machine does not start because of a 64-bit error
VMWare Player displays a message that states it cannot power on a 64-bit virtual machine. Or, VMware
Player might display the following error when you play the virtual machine: The host supports Intel VT-x,
but Intel VT-x is disabled. Intel VT-x might be disabled if it has been disabled in the
BIOS/firmware settings or the host has not been power-cycled since changing this setting.
You must enable the BIOS of the machine on which VMware Player runs to use Intel Virtualization
Technology. For more information, refer to the VMware Knowledge Base article here.
16
Author
Big Data Edition Team
17