Вы находитесь на странице: 1из 10

IBM InfoSphere CDC for DataStage with Hadoop

Distributed File System (HDFS) support


Overview
This article describes the following:
A new feature in IBM InfoSphere CDC for InfoSphere DataStage V10.2 Interim Fix
2 and later that supports specifying a HDFS directory as the output directory for flat
files.
A use case scenario wherein the changed data delivered to flat files in HDFS are
processed by Hive, which is a Data Warehousing tool of the Apache Hadoop
platform.
The HDFS and Hive components being used are part of the IBM InfoSphere
BigInsights product.

Pre-requisites
The following components need to be installed:
IBM Infosphere CDC Management Console Version V10.2
IBM Infosphere CDC Access Server Version V10.2
IBM Infosphere CDC DataStage with HDFS support V10.2 Interim Fix 2 and later
must be installed on the same machine as IBM InfoSphere BigInsigts.
IBM Infosphere Biginsights V2.0

High-Level Architecture

InfoSphere CDC for InfoSphere DataStage V10.2 Interim Fix 2 and later supports
specifying HDFS paths to deliver data changes in flat files. All the paths starting with
hdfs:// are treated as HDFS paths and the flat files will be written accordingly to the
specified HDFS directory. When an HDFS path has been specified the following
configuration options must have been set:
The path for the Hadoop jar files is specified in the CLASSPATH environment

variable.
The environment parameter HADOOP_CONF_DIR must be set to point to a
directory containing the target Hadoop cluster configuration files.

When an HDFS path has been chosen, there will be no .STOPPED files generated as in
the case of a regular file system. Hadoop tools like Hive treat files prefixed with an
underscore as temporary files and ignore them. Temporary flat files that are created
before the file is hardened is prefixed with an underscore so that they are ignored by
Hive.
A sample custom data formatter (SampleDataFormatForHdfs.java) to make the data
suitable for consumption by Hive is also shipped along with this release. This Java file is
available in the Samples.jar under <cdc_install_dir>/lib folder.

Step by step configuration to enable replication to HDFS:


Ensure InfoSphere BigInsights environment is initialized.
The required environment for Hadoop cluster is automatically setup after the installation
of InfoSphere BigInsights. To confirm, please check the CLASSPATH is pointing to the
Hadoop core jar files and the HADOOP_CONF_DIR environment variable points to the
directory containing the Hadoop configuration files.
If for some reason, these are not set, the Hadoop environment can be initialized manually
by the following step:
Run the script biginsights.sh under <biginsights_install_dir>/conf
The CLASSPATH environment variable will by default point only to the Hadoop core jar
file(hadoop-core-1.0.3.jar). For writing to HDFS, the following jar files also need to be
specified in the CLASSPATH:
commons-configuration-1.6.jar
commons-logging-1.1.1.jar
commons-lang-2.4.jar
All these jar files are available under <biginsights_install_dir>/IHC/lib directory.
Start the HDFS and Hive components of the Hadoop cluster.
IBM InfoSphere BigInsights ships with several Hadoop components like Map/Reduce,
HDFS, Hive, Catalog, Hbase, Oozie, etc. These services can be started either through the
BigInsights console or through command line:
To open the BigInsights console in a web browser the following URL must be
specified: http://<your-server>:8080/data/html/index.html#redirect-welcome.
Start the HDFS and Hive services by selecting them from Cluster Status tab of
the console.

To start the services through command line, run the script start-all.sh under
<biginsights_install_dir>/bin directory. (Please note that there are no scripts to
selectively start the services through command line)

Create a HDFS directory for the flat files and setup a CDC for DataStage (V10.2
Interim Fix 2 and later) instance.
From the 'Files' tab in BigInsights console create a directory (as shown in Figure 1 and
Figure 2 below where you need the flat files to be created.

Figure 1

Figure 2

Create a CDC for DataStage with HDFS instance by following these steps.
Create a subscription and map a simple table using these step by step instructions
with an option of either single record or multiple record. While mapping give the
whole path of HDFS directory as shown in Figure 3.
hdfs://<your -server>:9000/<location of the directory>
Please note that port number must be specified as mentioned above.

Figure 3

Enable the custom formatter by compiling the Sample formatter Java file as per
the following instructions. This sample is customized as per requirements of
Hive. Some of the customization's done are:
No quotes around column data. This enables Hive to import data matching
the data type of the source database.
Hive doesn't support DATE data type. So, it is converted to Timestamp by
appending 00:00:00 to it.

Setting up Hive tables with InfoSphere CDC for DataStage using default data
formatter:

Hive tables can be created from the Hive command prompt


Type 'hive' from <biginsights_install_dir>/hive/bin directory to get the Hive prompt.
From the Hive prompt, create hive tables using the definitions as described in the
next section. An external table will be created which will point to the HDFS directory
where CDC for DataStage will write the flat files.
With the default data formatter which puts quotes around column values, all the data
imported by Hive will have to be String data type.

Sample table definitions for Hive with InfoSphere CDC for DataStage using the default

data formatter:

Only columns with string data types needs to be created under Hive.
In Hive table the first 4 columns are for default audit trail enabled. They are
<timestamp>, <transaction id>, <operation type(I,U,D)> and <user> who made the
change.
If single record option is chosen while mapping, then we need to create Hive table
with twice the number of columns as in the Source along with first 4 columns for the
values mentioned above.
If multiple record option is chosen while mapping, then we need to create Hive table
with the same number of columns as in the Source along with first 4 columns
mentioned in the previous point.

Definition of the Source table:


create table abc.hdfs(
col1 number(9),
col2 varchar(10));
Definition of Hive table if Single record option is chosen:
hive> create external table testhdfs(
col1 string,
col2 string,
col3 string,
col4 string,
col5 string,
col6 string,
col7 string,
col8 string)
row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile
location '< hdfs_directory>';
hdfs_directory must be specified in the standard Unix format. Eg. /tmp/hdfstest
Definition of Hive table if multiple record option is chosen:
hive> create external table testhdfs(
col1 string,
col2 string,
col3 string,
col4 string,
col5 string,
col6 string)
row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile
location '< hdfs_directory>';
Setting up Hive tables with InfoSphere CDC for DataStage using custom data

formatter:
With custom formatter data can be imported into Hive with matching data types as
the Source database.
Definition of source table:
create table abc.hdfs(
col1 number(3),
col2 smallint,
col3 int,
col4 number(9),
col5 float,
col6 double precision,
col7 varchar(10));
Definition of Hive table for single record option:
hive> create external table testhdfs(
col1 timestamp,
col2 string,
col3 string,
col4 string,
col5 tinyint,
col6 smallint,
col7 int,
col8 bigint,
col9 float,
col10 double,
col11 string,
col12 tinyint,
col13 smallint,
col14 int,
col15 bigint,
col16 float,
col17 double,
col18 string)
row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile
location '<hdfs_directory>';
Definition of Hive table for multiple record option:
hive> create external table testhdfs(
col1 timestamp,
col2 string,
col3 string,
col4 string,
col5 tinyint,
col6 smallint,
col7 int,
col8 bigint,

col9 float,
col10 double,
col11 string)
row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile
location '<location where file is located in HDFS>';
Starting replication and checking the flat files generated in HDFS
The setup is now complete. Now start Mirroring.

Using the BigInsights Console's Files tab, check if the flat file is written to the specified
HDFS directory

Now check if the Hive table has been populated with the data

Supported Data types in Hive


Hive doesn't support the DATE data type. As mentioned in the previous sections using the
custom data formatter, the DATE value can be converted to a Timestamp by appending
00:00:00 to the DATE value and then importing it into a Hive table. For a Timestamp
data type in Hive, null values are not accepted and is a limitation. However, if a null
value is inserted on the source for timestamp/date column, CDC replicates it as an empty
value to the target.

Trouble shooting problems

If the subscription fails with the following error, it means the required Hadoop jar
files are not specified in the CLASSPATH or the HDFS directory doesn't exist.

If the subscription fails with the following error, then the Hadoop environment has
not been initialized correctly. Follow the steps of manually initializing the Hadoop
environment described above.

Conclusion
This article has described in detail the new feature available in IBM InfoSphere CDC for
DataStage V10.2 Interim Fix 2 and later for supporting change data delivery to HDFS
and its consumption by Hive.

Вам также может понравиться