Академический Документы
Профессиональный Документы
Культура Документы
Pre-requisites
The following components need to be installed:
IBM Infosphere CDC Management Console Version V10.2
IBM Infosphere CDC Access Server Version V10.2
IBM Infosphere CDC DataStage with HDFS support V10.2 Interim Fix 2 and later
must be installed on the same machine as IBM InfoSphere BigInsigts.
IBM Infosphere Biginsights V2.0
High-Level Architecture
InfoSphere CDC for InfoSphere DataStage V10.2 Interim Fix 2 and later supports
specifying HDFS paths to deliver data changes in flat files. All the paths starting with
hdfs:// are treated as HDFS paths and the flat files will be written accordingly to the
specified HDFS directory. When an HDFS path has been specified the following
configuration options must have been set:
The path for the Hadoop jar files is specified in the CLASSPATH environment
variable.
The environment parameter HADOOP_CONF_DIR must be set to point to a
directory containing the target Hadoop cluster configuration files.
When an HDFS path has been chosen, there will be no .STOPPED files generated as in
the case of a regular file system. Hadoop tools like Hive treat files prefixed with an
underscore as temporary files and ignore them. Temporary flat files that are created
before the file is hardened is prefixed with an underscore so that they are ignored by
Hive.
A sample custom data formatter (SampleDataFormatForHdfs.java) to make the data
suitable for consumption by Hive is also shipped along with this release. This Java file is
available in the Samples.jar under <cdc_install_dir>/lib folder.
To start the services through command line, run the script start-all.sh under
<biginsights_install_dir>/bin directory. (Please note that there are no scripts to
selectively start the services through command line)
Create a HDFS directory for the flat files and setup a CDC for DataStage (V10.2
Interim Fix 2 and later) instance.
From the 'Files' tab in BigInsights console create a directory (as shown in Figure 1 and
Figure 2 below where you need the flat files to be created.
Figure 1
Figure 2
Create a CDC for DataStage with HDFS instance by following these steps.
Create a subscription and map a simple table using these step by step instructions
with an option of either single record or multiple record. While mapping give the
whole path of HDFS directory as shown in Figure 3.
hdfs://<your -server>:9000/<location of the directory>
Please note that port number must be specified as mentioned above.
Figure 3
Enable the custom formatter by compiling the Sample formatter Java file as per
the following instructions. This sample is customized as per requirements of
Hive. Some of the customization's done are:
No quotes around column data. This enables Hive to import data matching
the data type of the source database.
Hive doesn't support DATE data type. So, it is converted to Timestamp by
appending 00:00:00 to it.
Setting up Hive tables with InfoSphere CDC for DataStage using default data
formatter:
Sample table definitions for Hive with InfoSphere CDC for DataStage using the default
data formatter:
Only columns with string data types needs to be created under Hive.
In Hive table the first 4 columns are for default audit trail enabled. They are
<timestamp>, <transaction id>, <operation type(I,U,D)> and <user> who made the
change.
If single record option is chosen while mapping, then we need to create Hive table
with twice the number of columns as in the Source along with first 4 columns for the
values mentioned above.
If multiple record option is chosen while mapping, then we need to create Hive table
with the same number of columns as in the Source along with first 4 columns
mentioned in the previous point.
formatter:
With custom formatter data can be imported into Hive with matching data types as
the Source database.
Definition of source table:
create table abc.hdfs(
col1 number(3),
col2 smallint,
col3 int,
col4 number(9),
col5 float,
col6 double precision,
col7 varchar(10));
Definition of Hive table for single record option:
hive> create external table testhdfs(
col1 timestamp,
col2 string,
col3 string,
col4 string,
col5 tinyint,
col6 smallint,
col7 int,
col8 bigint,
col9 float,
col10 double,
col11 string,
col12 tinyint,
col13 smallint,
col14 int,
col15 bigint,
col16 float,
col17 double,
col18 string)
row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile
location '<hdfs_directory>';
Definition of Hive table for multiple record option:
hive> create external table testhdfs(
col1 timestamp,
col2 string,
col3 string,
col4 string,
col5 tinyint,
col6 smallint,
col7 int,
col8 bigint,
col9 float,
col10 double,
col11 string)
row format delimited fields terminated by ',' lines terminated by '\n' stored as textfile
location '<location where file is located in HDFS>';
Starting replication and checking the flat files generated in HDFS
The setup is now complete. Now start Mirroring.
Using the BigInsights Console's Files tab, check if the flat file is written to the specified
HDFS directory
Now check if the Hive table has been populated with the data
If the subscription fails with the following error, it means the required Hadoop jar
files are not specified in the CLASSPATH or the HDFS directory doesn't exist.
If the subscription fails with the following error, then the Hadoop environment has
not been initialized correctly. Follow the steps of manually initializing the Hadoop
environment described above.
Conclusion
This article has described in detail the new feature available in IBM InfoSphere CDC for
DataStage V10.2 Interim Fix 2 and later for supporting change data delivery to HDFS
and its consumption by Hive.