Академический Документы
Профессиональный Документы
Культура Документы
Abstract
This white paper explains how Pentaho Data Integration (Kettle)
can be configured and used with Greenplum database by using
Greenplum Loader (GPLOAD). This boosts connectivity and
interoperability of Pentaho Data Integration with Greenplum
Database.
February 2012
Table of Contents
Executive summary.................................................................................................. 4
Audience ............................................................................................................................ 4
Executive summary
Greenplum database is a popular analytical database which works with different
open source data integration products like Pentaho Data Integration (PDI), a.k.a.
Kettle. Pentaho Kettle is part of Pentaho Business Intelligence suite. Greenplum
Database is capable of managing, storing and analyzing large amount of data.
One of the latest enhancements that Pentaho did for expanded support for OLAP
includes a native bulk loader integration with EMC Greenplum to improve the data
loading performance. Pentaho is offering a native adaptor support for Greenplum
GPLoad capability (bulk loader), which enables joint customers to leverage data
integration capabilities to quickly capture, transform and load massive amounts of
data into Greenplum Databases.
Currently, Pentaho Data Integration is connected to Greenplum through JDBC (Java
Database Connectivity) drivers. Greenplum Database can be used both on the source
and target sides in the Pentaho ETL transformations.
Audience
This white paper is intended for EMC field facing employees such as sales, technical
consultants, support, as well as customers who will be using Pentaho Data
Integration tool to integrate their ETL work. This is neither an installation guide nor an
introductory material on Pentaho. It documents the Pentaho connectivity and
operation capabilities with Greenplum Loader, and shows the readers how Pentaho
PDI can be used in conjunction with Greenplum database to retrieve, transform and
present data to users. Though the reader is not expected to have extensive Pentaho
knowledge, basic understanding of Pentaho data integration concepts and ETL tools
would help the reader understand this document better.
Executive summary
Conclusion
Enterprise Edition (EE) Data Integration Server Data Integration Engine, Security
integration with LDAP/Active Directory, Monitor/Scheduler, Content Management
Pentaho is capable of loading big data sets in terms of Terabytes or Petabytes into
Greenplum Database taking full advantage of the massively parallel processing
environment provided by the Greenplum product family.
Analytics Support
-
Figure 1
Greenplums SGS technology ensures parallelism by scattering data from source systems
across 100s or 1000s of parallel streams that simultaneously flow to all nodes of the
Greenplum Database. Performance scales with the number of Greenplum Database nodes,
and the technology supports both large batch and continuous near-real-time loading
patterns with negligible impact on concurrent database operations.
Figure 2 shows how the final gathering and storage of data to disk takes place on all nodes
simultaneously, with data automatically partitioned across nodes and optionally
compressed. This technology is exposed via a flexible and programmable external table
(explained below) interface and a traditional command-line loading interface.
10
Figure 2
External Tables
External tables enable users to access data in external sources as if it were in a table in the
database. In Greenplum database, there are two types of external data sources, external
tables and Web tables. They have different access methods, external tables contain static
data that can be scanned multiple times. The data does not change during queries. Web
tables provide access to dynamic data sources as if those sources were regular database
tables. Web tables cannot be scanned multiple times. The data can change during the
course of a query.
11
In the above example, gpfdist is set up to run on the Greenplum DIA server, anticipating
data loading from flat files stored in a file directory /etl-data. Port 8887 is opened and
listening for data requests, and a log file is created in /home/gpadmin called etl-log.
VERSION: 1.0.0.1
DATABASE: db_name
USER: db_username
HOST: master_hostname
12
PORT: master_port
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- hostname_or_ip
PORT: http_port
| PORT_RANGE: [start_port_range, end_port_range]
FILE:
- /path/to/input_file
- COLUMNS:
- field_name: data_type
- FORMAT: text | csv
- DELIMITER: 'delimiter_character'
- ESCAPE: 'escape_character' | 'OFF'
- NULL_AS: 'null_string'
- FORCE_NOT_NULL: true | false
- QUOTE: 'csv_quote_character'
- HEADER: true | false
- ENCODING: database_encoding
- ERROR_LIMIT: integer
- ERROR_TABLE: schema.table_name
OUTPUT:
- TABLE: schema.table_name
- MODE: insert | update | merge
- MATCH_COLUMNS:
- target_column_name
- UPDATE_COLUMNS:
- target_column_name
- UPDATE_CONDITION: 'boolean_condition'
- MAPPING:
SQL:
- BEFORE: "sql_command"
13
- AFTER: "sql_command"
Above example shows syntax for GPLOAD using YAML file. This file is divided into sections
for easy reference, those horizontal lines are not to be placed in a YAML file. For example,
users can run a load job as defined in my_load.yml using gpload:
gpload -f my_load.yml
It is recommended that we confirm that gpload is running successfully, to reduce the chance of
future errors. As a first step, you can run gpload at the system (command) prompt to verify. By
copying a small representation of a source file and a control (YAML) file, you can run gpload.py
using a sample load control file.
If gpload.py script is not successfully executed, please confirm the following settings:
Check if the correct version is installed by checking the gpload readme.
Check the environment variables for PATH, GPHOME_LOADERS and PYTHONPATH are
correctly installed.
Check if the pathname environmental variables are pointing or including to the correct path
Example of the load control file - my_load.yml:
--VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
INPUT:
- SOURCE:
LOCAL_HOSTNAME:
- etl1-1
- etl1-2
- etl1-3
- etl1-4
PORT: 8081
FILE:
- /var/load/data/*
- COLUMNS:
- name: text
- amount: float4
- category: text
- desc: text
- date: date
- FORMAT: text
14
- DELIMITER: '|'
- ERROR_LIMIT: 25
- ERROR_TABLE: payables.err_expenses
OUTPUT:
- TABLE: payables.expenses
- MODE: INSERT
SQL:
- BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
- AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"
Note: YAML file is not a free formatted file, field names and most of the content need to be in a
certain format.
By using Pentaho, you do not need to write your own YAML file; there are some pre-built
steps inside the Bulk loading folder in the Design windows of Spoon. The customized
Greenplum step is called Greenplum Load, which will help to generate the YAML file
when all the necessary details are provided.
The Greenplum Load step wraps the Greenplum GPLoad data loading utility we just
discussed. The GPLoad data loading utility is used for massively parallel data loading
using Greenplum's external table parallel loading feature. As you can see in the above
example, four ETL servers are used for feeding data into Greenplum through GPLOAD.
GPLoad can be implemented in either single or multiple Pentaho ETL servers. The following
diagrams show the typical deployment scenarios for performing parallel loading to
Greenplum Database:
15
16
Double Click on the Text File Input and choose the right input delimited file.
17
2. Click on the next tab of Contents to define how to parse the CSV file:
3. Go to the next tab Fields and click on Get Fields to define all the fields:
18
19
The details of the Greenplum Load step need to be defined as the following:
First, you have to choose the correct connection and target table.
Then, please click on Get fields button in order to generate all the target table fields:
After that, click on the Edit Mapping button to define all the mappings from the sources to targets:
20
Next, go to the GP Configuration tab in order to define the correct GPLOAD, control file, data file
location:
21
When everything is defined and saved, you can execute the transformation/job by click the GREEN
arrow on the top left corner.
Once the execution is finished, you can check the Logging and Step Metrics sections to see if the
transformation is successfully executed. You can also verify if data is loaded into this target
Greenplum database table, lineitem through gpload.
The above transformation is just a sample; therefore, user can add different components in this
transformation or incorporate into a well developed job for transforming the data.
22
Conclusion
In this white paper, the process of how to use Greenplum Loader Step(GPLOAD) to
enhance the loading capability and performance of Pentaho Data Integration is
discussed. It covers the preliminary interoperability between both Pentaho PDI and
Greenplum database for data integration and business intelligence projects by using
Greenplums Scatter/Gather Streaming Technology embedded in Greenplum Loader.
23
References
1) Pentaho Kettle Solutions Building Open Source ETL Solutions with Pentaho Data
Integration (ISBN-10: 0470635177 / ISBN-13: 978-0470635179)
2) Getting Started with Pentaho Data Integration guide from www.pentaho.com
3) Greenplum Database 4.1 Load tools for UNIX guide
4) Greenplum Database 4.1 Load Tools for Windows guide
5) Pentaho Community - Greenplum Load
24