Академический Документы
Профессиональный Документы
Культура Документы
m
pl
im
en
ts
of
Data
Warehousing
with
Greenplum
Second Edition
Marshall Presser
REPORT
SECOND EDITION
Data Warehousing
with Greenplum
Open Source Massively
Parallel Data Analytics
Marshall Presser
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Data Warehousing
with Greenplum, the cover image, and related trade dress are trademarks of O’Reilly
Media, Inc.
While the publisher and the author have used good faith efforts to ensure that the
information and instructions contained in this work are accurate, the publisher and
the author disclaim all responsibility for errors or omissions, including without limi‐
tation responsibility for damages resulting from the use of or reliance on this work.
Use of the information and instructions contained in this work is at your own risk. If
any code samples or other technology this work contains or describes is subject to
open source licenses or the intellectual property rights of others, it is your responsi‐
bility to ensure that your use thereof complies with such licenses and/or rights.
This work is part of a collaboration between O’Reilly and Pivotal. See our statement
of editorial independence.
978-1-492-05810-6
[LSI]
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
3. Deploying Greenplum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Custom(er)-Built Clusters 21
Greenplum Building Blocks 23
Public Cloud 24
Private Cloud 25
Greenplum for Kubernetes 26
Choosing a Greenplum Deployment 27
Additional Resources 27
iii
4. Organizing Data in Greenplum. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Distributing Data 30
Polymorphic Storage 33
Partitioning Data 33
Orientation 37
Compression 37
Append-Optimized Tables 39
External Tables 39
Indexing 40
Additional Resources 41
5. Loading Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
INSERT Statements 43
\COPY Command 43
The gpfdist Process 44
The gpload Tool 46
Additional Resources 47
iv | Table of Contents
External Web Tables 92
Additional Resources 93
Table of Contents | v
Foreword to the Second Edition
Nearly all the data in this problem space was structured, and our
user base and business intelligence tool suite used the universal lan‐
guage of SQL. Upon analysis, we realized that we needed a new data
store to resolve these issues.
A team of experienced technology professionals spanning multiple
organizational levels evaluated the pain points of our current data
store product suite in order to select the next-generation platform.
The team’s charter was to identify the contenders, define a set of
evaluation criteria, and perform an impartial evaluation. Some of
the key requirements for this new data store were that the product
could easily scale, provide dramatic query response time improve‐
ments, be ACID and ANSI compliant, leverage deep data compres‐
sion, and support a software-only implementation model. We also
needed a vendor that had real-world enterprise experience, under‐
stood the problem space, and could meet our current and future
needs. We conducted a paper exercise on 12 products followed by
two comprehensive proofs-of-concept (PoCs) with our key applica‐
tion stakeholders. We tested each product’s utility suite (load,
vii
unload, backup, restore), its scalability capability along with linear
query performance, and the product’s ability to recover seamlessly
from server crashes (high availability) without causing an applica‐
tion outage. This extensive level of testing allowed us to gain an inti‐
mate knowledge of the products, how to manage them, and even
some insight into how their service organizations dealt with soft‐
ware defects. We chose Greenplum due to its superior query perfor‐
mance using a columnar architecture, ease of migration and server
management, parallel in-database analytics, the product’s vision and
roadmap, and strong management commitment and financial back‐
ing.
Supporting our Greenplum decision was Pivotal’s commitment to
our success. Our users had strict timelines for their migration to the
Greenplum platform. During the POC and our initial stress tests, we
discovered some areas that required improvement. Our deployment
schedule was aggressive, and software fixes and updates were
needed at a faster cadence than Greenplum’s previous software-
release cycle. Scott Yara, one of Greenplum’s founders, was actively
engaged with our account, and he responded to our needs by assign‐
ing Ivan Novick, Greenplum’s current product manager, to work
with us and adapt their processes to meet our need for faster soft‐
ware defect repair and enhancement delivery. This demonstrated
Pivotal’s strong customer focus and commitment to Morgan Stanley.
To improve the working relationship even further and align our
engineering teams, Pivotal established a Pivotal Tracker (issue
tracker, similar to Jira) account, which shortened the feedback loop
and improved Morgan Stanley’s communication with the Pivotal
engineering teams. We had direct access to key engineers and visi‐
bility into their sprints. This close engagement allowed us to do
more with Greenplum at a faster pace.
Our initial Greenplum projects were highly successful and our plant
doubled annually. The partnership with Pivotal evolved and Pivotal
agreed to support our introduction of Postgres into our environ‐
ment, even though Postgres was not a Pivotal offering at the time.
As we became customer zero on Pivotal Postgres, we aligned our
online transaction processing (OLTP) and big data analytic offerings
on a Postgres foundation. Eventually, Pivotal would go all in with
Postgres by open sourcing Greenplum and offering Pivotal Postgres
as a generally available product. Making Greenplum the first open
source massively parallel processing (MPP) database built on Post‐
— Howard Goldberg
Executive Director
Global Head of Database Engineering
Morgan Stanley
In the mid-1980s, the phrase “data warehouse” was not in use. The
concept of collecting data from disparate sources, finding a histori‐
cal record, and then integrating it all into one repository was barely
technically possible. The biggest relational databases in the world
did not exceed 50 GB in size. The microprocessor revolution was
just getting underway, and two companies stood out: Tandem,
which lashed together microprocessors and distributed online trans‐
action processing (OLTP) across the cluster; and Teradata, which
clustered microprocessors and distributed data to solve the big data
problem. Teradata was named from the concept of a terabyte of data
—1,000 GB—an unimaginable amount of data at the time.
Until the early 2000s Teradata owned the big data space, offering its
software on a cluster of proprietary servers that scaled beyond its
original 1 TB target. The database market seemed set and stagnant,
with Teradata at the high end; Oracle and Microsoft’s SQL Server
product in the OLTP space; and others working to hold on to their
diminishing share.
But in 1999, a new competitor, soon to be renamed Netezza, entered
the market with a new proprietary hardware design and a new
indexing technology, and began to take market share from Teradata.
By 2005, other competitors, encouraged by Netezza’s success,
entered the market. Two of these entrants are noteworthy. In 2003,
Greenplum entered the market with a product based on Post‐
greSQL; it utilized the larger memory in modern servers to good
effect with a data flow architecture, and reduced costs by deploying
on commodity hardware. In 2005, Vertica was founded based on a
xi
major reengineering of the columnar architecture first implemented
by Sybase. The database world would never again be stagnant.
This book is about Greenplum, and there are several important
characteristics of this technology that are worth pointing out.
The concept of flowing data from one step in the query execution
plan to another without writing it to disk was not invented at Green‐
plum, but it implemented this concept effectively. This resulted in a
significant performance advantage.
Just as important, Greenplum elected to deploy on regular, nonpro‐
prietary hardware. This provided several advantages. First, Green‐
plum did not need to spend R&D dollars engineering systems. Next,
customers could buy hardware from their favorite providers using
any volume purchase agreements that they already might have had
in place. In addition, Greenplum could take advantage of the fact
that the hardware vendors tended to leapfrog one another in price
and performance every four to six months. Greenplum was achiev‐
ing a 5–15% price/performance boost several times a year—for free.
Finally, the hardware vendors became a sales channel. Big players
like IBM, Dell, and HP would push Greenplum over other players if
they could make the hardware sale.
Building Greenplum on top of PostgreSQL was also noteworthy. Not
only did this allow Greenplum to offer a mature product much
sooner, it could use system administration, backup and restore, and
other PostgreSQL assets without incurring the cost of building them
from scratch. The architecture of PostgreSQL, which was designed
for extensibility by a community, provided a foundation from which
Greenplum could continuously grow core functionality.
Vertica was proving that a full implementation of a columnar archi‐
tecture offered a distinct advantage for complex queries against big
data, so Greenplum quickly added a sophisticated columnar capabil‐
ity to its product. Other vendors were much slower to react and
then could only offer parts of the columnar architecture in response.
The ability to extend the core paid off quickly, and Greenplum’s
implementation of columnar still provides a distinct advantage in
price and performance.
Further, Greenplum saw an opportunity to make a very significant
advance in the way big data systems optimize queries, and thus the
GPORCA optimizer was developed and deployed.
— Rob Klopp
Ex-CIO, US Social Security Administration
Author of The Database Fog Blog
xv
The need to do federated queries and integrate data from a wide
variety of sources has led to new tools for using and ingesting data
into Greenplum. This is discussed in the Chapter 8.
xvi | Preface
In addition, there is material on monitoring and managing Green‐
plum, deployment options, and some other topics as well.
Acknowledgments
I owe a huge debt to my colleagues at Pivotal who helped explicitly
with this work and from whom I’ve learned so much in my years at
Pivotal and Greenplum. I cannot name them all, but you know who
you are.
Special callouts to the section contributors (in alphabetical order by
last name):
Preface | xvii
• Scott Kahler for Greenplum Management Tools
• John Knapp for Greenplum-GemFire Connector
• Jing Li for Greenplum Command Center
• Frank McQuillan for Data Science on Greenplum
• Venkatesh Raghavan for GPORCA, Optimizing Query
Response
• Craig Sylvester and Bharath Sitaraman for GP Text
Other commentators:
• Stanley Sung
• Amil Khanzada
• Jianxia Chen
• Kelly Indrieri
• Omer Arap
xviii | Preface
CHAPTER 1
Introducing the Greenplum
Database
There are many databases available. Why did the founders of Green‐
plum feel the need to create another? A brief history of the problem
they solved, the company they built, and the product architecture
will answer these questions.
1
Responses to the Challenge
One alternative was NoSQL. Advocates of this position contended
that SQL itself was not scalable and that performing analytics on
large datasets required a new computing paradigm. Although the
NoSQL advocates had successes in many use cases, they encoun‐
tered some difficulties. There are many varieties of NoSQL data‐
bases, often with incompatible underlying models. Existing tools
had years of experience in speaking to relational systems. There was
a smaller community that understood NoSQL better than SQL, and
analytics in this environment was still immature. The NoSQL move‐
ment morphed into a Not Only SQL movement, in which both para‐
digms were used when appropriate.
Another alternative was Hadoop. Originally a project to index the
World Wide Web, Hadoop soon became a more general data analyt‐
ics platform. MapReduce was its original programming model; this
required developers to be skilled in Java and have a fairly good
understanding of the underlying architecture to write performant
code. Eventually, higher-level constructs emerged that allowed pro‐
grammers to write code in Pig or even let analysts use HiveQL, SQL
on top of Hadoop. However, SQL was never as complete or per‐
formant as a true relational system.
In recent years, Spark has emerged as an in-memory analytics plat‐
form. Its use is rapidly growing as the dramatic drop in price of
memory modules makes it feasible to build large memory servers
and clusters. Spark is particularly useful in iterative algorithms and
large in-memory calculations, and its ecosystem is growing. Spark is
still not as mature as older technologies, such as relational systems,
and it has no native storage model.
Yet another response was the emergence of clustered relational sys‐
tems, often called massively parallel processing (MPP) systems. The
first entrant into this world was Teradata, in the mid- to late 1980s.
In these systems, the relational data, traditionally housed in a single-
system image, is dispersed into many systems. This model owes
much to the scientific computing world, which discovered MPP
before the relational world. The challenge faced by the MPP rela‐
tional world was to make the parallel nature transparent to the user
community so coding methods did not require change or sophisti‐
cated knowledge of the underlying cluster.
• Fraud analytics
• Financial risk management
• Cybersecurity
• Customer churn reduction
• Predictive maintenance analytics
• Manufacturing optimization
• Smart cars and internet of things (IoT) analytics
• Insurance claims reporting and pricing analytics
• Health-care claim reporting and treatment evaluations
• Student performance prediction and dropout prevention
• Advertising effectiveness
• Traditional data warehouses and business intelligence (BI)
• Greenplum
• Pivotal Postgres, a supported version of the open source Post‐
greSQL database
Private Interconnect
The master must communicate with the segments and the segments
must communicate with one another. They do this on a private User
Datagram Protocol (UDP) network that is distinct from the public
network on which users connect to the master. This is critical. Were
the segments to communicate on the public network, user down‐
loads and other heavy loads would greatly affect Greenplum perfor‐
mance. The private network is critical. Greenplum requires a 10 Gb
network and strongly urges using 10 Gb switches for redundancy.
Other than the master, the standby, and the segment servers, some
other servers might be plumbed into the private interconnect net‐
work. Greenplum will use these to do fast parallel data loading and
unloading. We discuss this topic in Chapter 5.
Mirror Segments
In addition to the redundancy provided by the standby master,
Greenplum strongly urges the creation of mirror segments. These are
segments that maintain a copy of the data on a primary segment, the
one that actually does the work. Should either a primary segment or
the host housing a primary segment fail, the mirrored segment con‐
Additional Resources
This book is but the first step in learning about Greenplum. There
are many other information sources available to you.
Greenplum Documentation
The Greenplum documentation will answer many questions that
both new and experienced users have. You can download this up-to-
date information, free of charge, from the Greenplum documenta‐
tion site.
There is also a wealth of information about Greenplum available on
the Pivotal website.
Latest Documentation
When you see a Greenplum documentation page URL
that includes the word “latest,” this refers to the latest
Greenplum release. That web page can also get you to
documentation for all supported releases.
Additional Resources | 11
ter issues that pertain to appliances, cloud-based Greenplum, or
customer-built commodity clusters.
Other Sources
Internet searches on Greenplum return many hits. It’s wise to
remember that not everything on the internet is accurate. In partic‐
ular, as Greenplum evolves over time, comments made in years past
might no longer reflect the current state of things.
Additional Resources | 13
CHAPTER 2
What’s New in Greenplum?
15
R and Python data science modules
These are collections of open source packages that data scien‐
tists find useful. They can be used in conjunction with the pro‐
cedural languages for writing sophisticated analytic routines.
New datatypes
JSON, UUID, and improved XML support.
Enhanced query optimization
The GPORCA query optimizer has increased support for more
complex queries.
PXF extension format for integrating external data
PXF is a framework for accessing data external to Greenplum.
This is discussed in Chapter 8.
analyzedb enhancement
Critical for good query performance is the understanding of the
size and data contents of tables. This utility was enhanced to
cover more use cases and provide increased performance.
PL/Container for untrusted languages
Python and R and untrusted languages because they contain OS
callouts. As a result, only the database superuser can create
functions in these languages. Greenplum 5 added the ability to
run such functions in a container isolated from the OS proper
so superuser powers are not required.
Improved backup and restore and incremental backup
Enhancements to the tools used to back up and restore Green‐
plum. These will be discussed in Chapter 7.
Resource groups to improve concurrency
The ability to control queries cannot be underestimated in an
analytic environment. The new resource group mechanism is
discussed in Chapter 7.
Greenplum-Kafka integration
Kafka has emerged as a leading technology for data dissemina‐
tion and integration for real-time and near-real-time data
streaming. Its integration with Greenplum is discussed in Chap‐
ter 8.
• New features
• Changed features
• Beta features
• Deprecated features
• Resolved issues
• Known issues and limitations
Additional Resources | 19
CHAPTER 3
Deploying Greenplum
Custom(er)-Built Clusters
For those customers who wish to deploy Greenplum on hardware in
their own datacenter, Greenplum has always provided a cluster-
aware installer but assumed that the customer had correctly built the
cluster. This strategy provided a certain amount of flexibility. For
example, customers could configure exactly the number of segment
hosts they required and could add hosts when needed. They had no
restrictions on which brand of network gear to use, how much
memory per node, or the number or size of the disks. On the other
hand, building a cluster is considerably more difficult than configur‐
ing a single server. To this end, Greenplum has a number of facilities
that assist customers in building clusters.
Today, there is much greater experience and understanding in build‐
ing MPP clusters, but a decade ago, this was much less true and
21
some early deployments were suboptimal due to poor cluster design
and its effect on performance. The gpcheckperf utility checks disk
I/O bandwidth, network performance, and memory bandwidth.
Assuring a healthy cluster before Greenplum deployment goes a
long way to having a performant cluster.
Customer-built clusters should be constructed with the following
principles in mind:
Public Cloud
Many organizations have decided to move much of their IT environ‐
ment from their own datacenters to the cloud. Public cloud deploy‐
ment offers the quickest time to deployment. You can configure a
functioning Greenplum cluster in less than an hour.
Pivotal now has Greenplum available, tested, and certified on AWS,
Microsoft Azure, and GCP. For all three, Pivotal has built and tested
the templates that define the cloud resources such as machine
images, the nodes, networks, firewalls, and so on, as well as the auto‐
mation scripting for deployment and self-healing.
Pivotal takes a very opinionated view of the configurations it will
support in the public cloud marketplaces; for example:
These are not arbitrary choices. Greenplum has tested these config‐
urations and found them suitable for efficient operation in the
cloud. That being said, customers have built their own clusters
without using the Greenplum-specific deployment options on both
Azure and AWS. Although they are useful for QA and development
needs, these configurations might not be optimal for use in a pro‐
duction environment.
The use of the marketplace for creating Greenplum clusters has
many beneficial features that derive from the facilities that the pub‐
lic cloud bake into their offerings.
Private Cloud
Some enterprises have legal or policy restrictions that prevent data
from being moved into a public cloud environment. These organiza‐
tions can deploy Greenplum on private clouds, mostly running
VMware infrastructure. Some important principles apply to all vir‐
tual deployments:
Private Cloud | 25
• There must be adequate network bandwidth between the virtual
hosts.
• Shared-disk contention is a performance hindrance.
• Automigration must be turned off.
• As in real hardware systems, primary and mirror segments
must be separated, not only on different virtual servers, but also
on physical infrastructure.
• Adequate memory to prevent swapping is a necessity.
Additional Resources
You can download the mainline Pivotal Greenplum distribution
from the Pivotal Network site.
29
Distributing Data
One of the most important methods for achieving good query per‐
formance from Greenplum is the proper distribution of data. All
other things being equal, having roughly the same number of rows
in each segment of a database is a huge benefit. In Greenplum, the
data distribution policy is determined at table creation time. Green‐
plum adds a distribution clause to the Data Definition Language
(DDL) for a CREATE TABLE statement. Prior to Greenplum 6,
there were two distribution methods. In random distribution, each
row is randomly assigned a segment when the row is initially inser‐
ted. The other distribution method uses a hash function computed
on the values of some columns in the table.
Here are some examples:
CREATE TABLE bar
(id INT, stuff TEXT, dt DATE) DISTRIBUTED BY (id);
In the case of the bar table, Greenplum will compute a hash value
on the id column of the table when the row is created and will then
use that value to determine into which segment the row should
reside.
The foo table will have rows distributed randomly among the seg‐
ments. Although this will generate a good distribution with almost
no skew, it might not be useful when colocated joins will help per‐
formance. Random distribution is fine for small-dimension tables
and lookup tables, but probably not for large fact tables.
In the case of the gruz table, there is no distribution clause. Green‐
plum will use a set of default rules. If a column is declared to be a
primary key, Greenplum will hash on this column for distribution.
Otherwise, it will choose the first column in the text of the table def‐
inition if it is an eligible data type. If the first column is a user-
defined type or a geometric type, the distribution policy will be
random. It’s a good idea to explicitly distribute the data and not rely
There will be a copy of the repl table on each segment if the DIS‐
TRIBUTED REPLICATED clause is supplied. You can use replicated
tables to improve query performance by eliminating broadcast
motions for the table. Typically, you might use this for larger-
dimension tables in a star schema. A more sophisticated use is for
user-defined functions that run in the segments rather than the mas‐
ter where functions require access to all rows of the table. This is not
advised for large tables.
Currently an update on the distributed column(s) is allowed only
with the use of the Pivotal Query Optimizer (discussed later in this
chapter). With the legacy query planner, updates on the distribution
column were not permitted, because this would require moving the
data to a different segment.
A poorly chosen distribution key can result in suboptimal perfor‐
mance. Consider a database with 64 segments and the table gruz
had it been distributed by the gender column. Let’s suppose that
there are three values, “M,” “F,” and NULL for unknown. It’s likely
that roughly 50% of the values will be “M,” roughly 50% “F,” and a
small proportion of NULLs. This would mean that 2 of the 64 seg‐
ments would be doing useful work and others would have nothing
to do. Instead of achieving an increase in speed of 64 times over a
nonparallel database, Greenplum could generate an increase of only
two times, or, in worst case, no speed increase at all if the two seg‐
ments with the “F” and “M” data live on the same segment host.
The distribution key can be a set of columns, but experience has
shown that one or two columns usually is sufficient. Distribution
columns should always have high cardinality, although that in and
of itself will not guarantee good distribution. That’s why it is impor‐
tant to check the distribution. To determine how the data actually is
distributed, Greenplum provides a simple SELECT statement that
shows the data distribution:
Distributing Data | 31
SELECT gp_segment_id, count(*) FROM foo GROUP BY 1;
gp_segment_id | count
---------------+--------
1 | 781632
0 | 749898
2 | 761375
3 | 753063
Polymorphic Storage
Greenplum employs the concept of polymorphic storage. That is,
there are a variety of methods by which Greenplum can store data in
a persistent manner. This can involve partitioning data, organizing
the storage by row or column orientation, various compression
options, and even storing it externally to the Greenplum Database.
There is no single best way that suits all or even the majority of use
cases. The storage method is completely transparent to user queries,
and thus users do not do coding specific to the storage method. Uti‐
lizing the storage options appropriately can enhance performance.
Partitioning Data
Partitioning a table is a common technique in many databases, par‐
ticularly in Greenplum. It involves separating table rows for effi‐
ciency in both querying and archiving data. You partition a table
when you create it by using a clause in the DDL for table creation.
Polymorphic Storage | 33
Here is an example of a table partitioned by range of the timestamp
column into partitions for each week:
CREATE TABLE foo (fid INT, ftype TEXT, fdate DATE)
DISTRIBUTED by (fid)
PARTITION BY RANGE(fdate)
(
PARTITION week START ('2017-11-01'::DATE)
END ('2018-01-31'::DATE)
EVERY ('1 week'::INTERVAL)
);
This example creates 13 partitions of the table, one for each week
from 2017-11-01 to 2018-01-31. Be aware that the end date does not
include January 31, 2018. To include that date, end on Feburary 1 or
use the INCLUSIVE clause.
What is happening here? As rows enter the system, they are dis‐
tributed into segments based upon the hash values for the fid col‐
umn. Partitioning works at the storage level in each and every
segment. All values whose fdate values are in the same week will be
stored in a separate file in the segment host’s filesystem. Why is this
a good thing? Because Greenplum usually does full-table scans
rather than index scans; if a WHERE clause in the query can limit
the date range, the optimizer can eliminate reading the files for all
dates outside that range.
If our query were
SELECT * FROM foo WHERE fdate > '2018-01-14'::DATE;
the optimizer would know that it had to read data from only the
three latest partitions, which eliminates about 77% of the partition
scans and would likely reduce query time by a factor of four. This is
known as partition elimination.
In addition to range partitioning, there is also list partitioning, in
which the partitions are defined by discrete values, as demonstrated
here:
CREATE TABLE bar (bid integer, bloodtype text, bdate date)
DISTRIBUTED by (bid)
PARTITION BY LIST(bloodtype)
(PARTITION a values('A+', 'A', 'A-'),
PARTITION b values ('B+', 'B-', 'B'),
PARTITION ab values ('AB+', 'AB-', 'AB'),
PARTITION o values ('O+', 'O-', 'O')
DEFAULT PARTITION unknown);
Partitioning Data | 35
to separate data within segments. You can envision this two-
dimensional “striping” of data as “plaiding,” with stripes running in
perpendicular directions. There is a further confusion: PARTITION
is a keyword in the definition of Window Functions. This use of par‐
tition has nothing to do with table partitioning.
Just as Greenplum did not allow update of the distribution columns,
it also forbade the update of the partition column until the use of the
GPORCA optimizer.
In many analytic data warehouses, recent data is more important
than older data, and it is helpful to archive older data. This older
data may be occasionally useful, but it might be best archived out‐
side the database. The external table mechanism is ideal for this use.
It is possible to add, drop, and split partitions. The Greenplum Data‐
base documentation covers this in detail.
Here are some best practices on partitioning:
• It makes little sense to partition small lookup tables and the like.
• Never partition on columns that are part of the distribution
clause.
• In column-oriented tables, the number of files for heavily parti‐
tioned tables can be very large as each column will have its own
file. A table with 50 columns, partitioned hourly with 10 sub‐
partitions over two years, will have 50 × 24 × 365 × 2 × 10 =
8,760,000 files per segment. With a segment host having eight
segments this would be more than 70 million files that the
server must open and close to do a full-table scan. This will have
a huge performance impact.
Compression
Compression is normally considered a method to decrease the stor‐
age requirements for data. In Greenplum, it has an additional bene‐
fit. When doing analytics with large datasets, queries often read the
entire table from disk. If the table is compressed, the smaller size can
result in many fewer disk reads, enhancing query performance.
Small tables need not be compressed because it will yield neither sig‐
nificant performance gain nor storage gains. Compression and
expansion require CPU resources. Systems already constrained by
CPU need to experiment to determine whether compression will be
helpful in query performance.
Orientation | 37
Like distribution, you specify compression when you create a table.
Greenplum offers many compression options but only for append-
optimized or column-oriented tables. The first statement in the fol‐
lowing example creates a table, the second generates an error:
CREATE TABLE foobar (fid INT, ft TEXT)
WITH (compresstype = quicklz,orientation=column,appendonly=true)
DISTRIBUTED by (fid);
External Tables
The Greenplum concept of an external table is data that is stored
outside the database proper but can be accessed via SQL statements,
as though it were a normal database table. The external table is use‐
ful as a mechanism for:
• Loading data
• Archiving data
• Integrating data into distributed queries
Append-Optimized Tables | 39
simple file protocol. Chapter 5, on loading data, focuses on the
gpfdist protocol, Greenplum’s parallel transport protocol.
In the following example, raw data to be loaded into Greenplum is
located in a small flat file on a segment host. It has comma-separated
values (CSV) with a header line. The head of the file would look like
the following:
dog_id, dog_name, dog_dob
123,Fido,09/09/2010
456,Rover,01/21/2014
789,Bonzo,04/15/2016
The corresponding table would be as follows:
CREATE TABLE dogs
(dog_id int, dog_name text, dog_dob date) distributed randomly;
The external table would be defined like this:
CREATE READABLE EXTERNAL TABLE dogs_ext (
dog_id int, dog_name text, dog_dob date)
LOCATION ('file://seghost1/external/expenses1.csv')
FORMAT 'csv' (header);
It could be queried as this:
SELECT * from dogs_ext;
Indexing
In transaction systems, indexes are used very effectively to access a
single row or a small number of rows. In analytic data warehouses,
this sort of access is relatively rare. Most of the time, the queries
access all or many of the rows of the table, and a full table scan gives
better performance in an MPP architecture than an index scan. In
general, Greenplum favors full-table scans over index scans in most
cases.
For that reason, when moving data warehouses to Greenplum, best
practice is to drop all the indexes and add them on an as-needed
basis for particular commonly occurring queries. This has two
added benefits. It saves considerable disk space and time. Maintain‐
ing indexes during large data loads degrades performance. Best
practice is to drop the indexes on a table during a load and re-create
them thereafter. Even though this is a parallel operation, it does con‐
sume computing resources.
Additional Resources
The Pivotal Greenplum Administrator Guide has a section on table
creation and storage options. It’s well worth reading before begin‐
ning any Greenplum project.
Jon Roberts discusses Greenplum storage in some blog posts on his
PivotalGuru website. Topics include append-optimized tables, notes
on tuning that rely heavily on data organization in Greenplum, and
a section on external storage with Amazon S3.
As mentioned earlier, although Greenplum does not require exten‐
sive indexing, the documentation includes a detailed discussion of
indexing strategies.
Additional Resources | 41
CHAPTER 5
Loading Data
INSERT Statements
The simplest way to insert data is to use the INSERT SQL statement,
which facilitates inserting a few rows of data. However, because the
insert is done via the master node of Greenplum Database, it cannot
be parallelized.
An insert like the one that follows is fine for populating the values in
a small lookup table:
INSERT INTO faa.d_cancellation_codes
VALUES ('A', 'Carrier'),
('B', 'Weather'),
('C', 'NAS'),
('D', 'Security'),
('', 'none');
There are also set-based INSERT statements that you can parallelize
(these are discussed later in this chapter).
\COPY Command
The \COPY command is a psql command. There is also the COPY
statement in SQL. There are minor differences, one of which is that
only a user with superuser privileges can use COPY.
43
You can run \COPY in psql scripts. psql is the command-line tool
for accessing Greenplum as well as PostgreSQL. A Greenplum ver‐
sion is available as part of the normal installation process.
\COPY is efficient for loading smallish datasets—a few thousand
rows or so—but because it is single-threaded through the master
server, it is less useful for loads of millions of rows or more.
The data is in a file, the top lines of which look like this:
dog_id, dog_name, dog_dob
123,Fido,09/09/2010
456,Rover,01/21/2014
789,Bonzo,04/15/2016
The corresponding table would be as follows:
CREATE TABLE dogs
(dog_id int, dog_name text, dog_dob date) distributed randomly;
Here’s the SQL statement that copies the three rows of data to the
table dogs:
\COPY dogs FROM '/home/gpuser/Exercises/dogs.csv'
CSV HEADER LOG ERRORS SEGMENT REJECT LIMIT 50 ROWS;
Raw data is often filled with errors. Without the REJECT clause, the
\COPY statement would fail if there were errors. In this example,
the REJECT clause allows the script to continue loading until there
are 50 errors on a segment. The LOG ERRORS clause will place the
errors in a log file. This is explained in the Greenplum SQL Com‐
mand Reference.
You also can use \COPY to unload small amounts of data:
\COPY dogs TO '/home/gpuser/Exercises/dogs_out.csv' CSV HEADER;
Start the gpfdist process on the host housing the data to be loaded.
The -d command-line argument points to the directory in which the
files live:
gpfdist -d /home/gpuser/data/ -p 8081 > gpfdist.log 2>&1 &
After the gpfdist process has been started, the data can easily be
imported by using the following statement:
INSERT INTO dogs AS SELECT * FROM dogs_ext;
When the INSERT statement is executed, all of the segments
engaged in the INSERT statement will issue requests to the gpfdist
process running on the server with address 10.0.0.99 for chunks of
data. They will parse each row, and if the row should belong to the
segment that imports it, it will be stored there. If not, the row will be
shipped across the private interconnect to the segment to which it
belongs and it will be stored there. This process is known as Scatter-
Gather. The mechanics of this are completely transparent to the user
community.
The Greenplum documentation describes a number of methods for
deploying gpfdist. In particular, the number of gpfdist processes
per external server can have a large impact on load speeds.
Figure 5-1 shows one example in which one gpfdist process is run‐
ning on the external ETL server.
Notice that there are two PORT fields in the YAML file. The first is
the Greenplum listener port to which user sessions attach. The sec‐
ond is the port that gpfdist and Greenplum use to transfer data.
There are many useful optional features in gpload:
Additional Resources
The Rollout Blog has a good tutorial on YAML.
Additional Resources | 47
The Greenplum Documentation has a thorough discussion of load‐
ing and unloading data.
Pivotal’s Jon Roberts has written Outsourcer, a tool that does
Change Data Capture from Oracle and SQLServer into Greenplum.
Though this is not a Pivotal product and is thus not supported by
Pivotal, Jon makes every effort to maintain, support, and enhance
this product.
It’s all well and good to efficiently organize and load data, but the
purpose of an analytic data warehouse is to gain insight. Initially,
data warehousing used business intelligence tools to investigate his‐
torical data owned by the enterprise. This was embodied in key per‐
formance indicators (KPIs) and some limited “what-if ” scenarios.
Greenplum was more ambitious. It wanted users to be able to do
predictive analytics and process optimization using data in the
warehouse. To that end, it employed the idea of bringing the analyt‐
ics to the data rather than the data to the analytics because moving
large datasets between systems is inefficient and costly. Greenplum
formed an internal team of experienced practitioners in data science
and developed analytic tools to work inside Greenplum. This chap‐
ter explores the ways in which users can gain analytic insight using
the Pivotal Greenplum Database.
49
how to build and run the appropriate predictive analytics models on
the pertinent data to realize this business value.
Much of the interest is due to the explosion of data, for example, the
IoT. But it has been made possible by advances in computer mem‐
ory, CPU, and network capabilities. How can a business use data to
its greatest benefit? The first step is to use the data for retrospective
analysis. It is important to see what has happened in the past and be
able to slice and dice that data to fully understand what’s going on.
This is commonly called business intelligence (BI), which makes use
of descriptive statistics. The next step is to understand patterns in
the data and infer causality. This inference can lead to deeper under‐
standing of a problem and also has a predictive component. This
process is called data science. The algorithms come from the fields of
artificial intelligence (AI) and machine learning as well as the more
established branches of engineering and statistics. But you need big
data technology to make all of this work. Greenplum has been
developed with the goal of enabling data science, and in this section,
we explore how it does that.
Apache MADlib
Frank McQuillan
Apache MADlib | 51
MADlib is a comprehensive library that has been developed over
several years. It comprises more than 50 principal algorithms along
with many additional utilities and helper functions, as illustrated in
Figure 6-1.
Apache MADlib | 53
Figure 6-2. MADlib and Greenplum architecture
Algorithm Design
Realizing predictive models in a distributed manner is complex. It
needs to be done on an algorithm-by-algorithm basis, because there
is no general or automatic way to make a given predictive algorithm
run in parallel.
Generally, MADlib achieves linear scalability with data size and the
key parameters of a particular algorithm. Though MADlib develop‐
ers use the most appropriate and performant implementations from
the literature, different algorithms can achieve varying degrees of
parallelism depending on their nature. For example, time-series data
can be challenging to parallelize, whereas linear regression has a
closed form solution that can be readily parallelized.
Figure 6-3 shows how iterative predictive models are executed in the
Greenplum Database. A key component is the stored procedure for
the model. It comprises a transition function that updates model
state in parallel on the segments, a merge function that combines
models from different segments on the master, and a final function
that produces the output value. The iteration of transition, merge,
and final functions proceeds until model convergence is achieved.
R Interface
R is one of the most commonly used languages by data scientists.
Although it provides a breadth of capability and is a mature library,
it is not designed for working with datasets that exceed the memory
of the client machine.
Apache MADlib | 55
This is where PivotalR comes in. PivotalR is a CRAN package that
provides an R interface to MADlib. The goal is to harness the famili‐
arity and usability of R and at the same time take advantage of the
scalability and performance benefits of in-database computation
that come with Greenplum.
PivotalR translates R code into SQL, which is sent to the database
for execution. The database could be local running on a laptop or it
could be a separate Greenplum Database cluster. Figure 6-4 demon‐
strates the flow.
Deep Learning
Deep learning is a type of machine learning, inspired by the biology
of the brain, that uses a class of algorithms called artificial neural
networks. These neural networks have been shown to perform very
well on certain problems such as computer vision, speech recogni‐
tion, and machine translation when coupled with acceleration by
graphics processing units (GPUs). There is rapid innovation in this
domain and a number of open source deep learning libraries are
now available to help design and run these networks.
A new area of R&D by the Greenplum team focuses on the tooling
needed to run deep learning libraries with GPU acceleration on data
residing in Greenplum Database. The idea is to avoid the need for
users to move data from the database to another execution frame‐
work to run deep learning workloads. One project goal is to train
models faster by using the power of MPP coupled with that of
GPUs. This project is in the relatively early stages but shows
promise.
Text Analytics
Craig Sylvester and Bharath Sitaraman
Text Analytics | 57
• GPText database schema provides in-database access to Apache
Solr indexing and searching
• Build indexes with database data or external documents and
search with the GPText API
• Custom tokenizers for international text and social media text
• Universal Query Processor that accepts queries with mixed syn‐
tax from supported Solr query processors
• Faceted search results
• Term highlighting in results
• Natural language processing (NLP), including part-of-speech
tagging and named entity extraction
• High availability (HA) on par with the Greenplum database
1 Apache Solr is a server providing access to Apache Lucene full-text indexes. Apache
SolrCloud is an HA, fault-tolerant cluster of Apache Solr servers. GPText cluster refers
to a SolrCloud cluster deployed by GPText for use with a Greenplum cluster.
Configuring Solr/GPText
In addition to these architectural enhancements, Pivotal essentially
ported Solr so that GPText provides the full flexibility you get with a
standalone SolrCloud installation. Activities like configuring or
managing memory or performance tuning are fully supported.
Index and search configurations such as adding or removing stop
words, synonyms, protected words, stemming, and so on are also
fully supported (with a provided utility to aid in these operations).
GPText includes OpenNLP libraries and analyzer classes to classify
indexed terms’ parts of speech (POS), and to recognize named enti‐
ties, such as the names of persons, locations, and organizations
(NER). GPText saves NER terms in the field’s term vector, prepen‐
ded with a code to identify the type of entity recognized. This allows
searching documents by entity type.
In addition to the analyzers and filters available through Solr,
GPText adds two additional text processing analyzers for parsing
international text (multiple languages) and social media text
(including #hashtags, URLs, emojis, and @mentions).
65
slow queries requires diving into multiple parts of the database as
well as an understanding of how queries are executed in an MPP
environment. Common culprits include locking and blocking of
tables by competing queries or the cluster’s inablility to spare the
resources to keep up with the demand of the current workload.
Sometimes, a badly written query at a point of heavy workload can
cause a cascading effect and block critical queries from execution.
These situations can be more or less common given the Greenplum
environment and normal usage, but to effectively diagnose the situa‐
tion requires a monitoring tool that surfaces deep insights into all of
the aforementioned factors.
Greenplum Command Center (GPCC) is an Enterprise Greenplum
tool that presents DBAs with a wealth of information about Green‐
plum performance in a web-based interface. The current version
(GPCC v4) allows the user to drill down deep into would-be issues
for the Greenplum cluster, down to the query execution step. The
GUI interface not only presents the information, but aims to sim‐
plify and aggregate it so that a potential problem is spotted right
away. The goal is to present any user, expert or novice, with the nec‐
essary information to allow an accurate understanding of what is
currently happening and what may be the underlying cause.
Figure 7-1 shows the GPCC dashboard, which provides a quick
overview of the status of the Greenplum cluster.
For users drilling down into a specific query, the GPCC shows a
number of important metrics regarding the performance, as well as
a visual representation of the query execution steps. The informa‐
tion is summed up into data points to help users figure out how
much longer the query is expected to run, and decide whether to
terminate the query to release the table lock or free up system
resources.
Figure 7-4 shows query details and metrics vital to understanding
the query’s current performance and contribution to the overall
cluster resource usage. You can expand each step of the plan for
more information. You also can expand the plan to a full-page view
to facilitate troubleshooting for a complex query.
After the problematic trends and problem origins are identified, the
DBA can capture this information and surface alerts for cluster and
Workload Management
Whether it’s routine ETL or ad hoc user queries, controlling
resource utilization and concurrency is important to ensure smooth
database operation. Data warehouse queries with high enough con‐
currency can consume enough CPU, memory, and I/O to over‐
whelm the cluster. To control resource usage, Greenplum
encompasses the concept of resource queues and resource groups.
Although each controls resource usage, resource groups are a much
more robust method.
Resource Queues
Prior to version 5, Greenplum used the notion of resource queues.
Each user or role is associated with a resource queue. Each queue
defines the limits on resources that its associated roles can use.
For example, the following two statements will create a resource
queue named adhoc and associate the role group2 with the queue:
CREATE RESOURCE QUEUE adhoc WITH (ACTIVE_STATEMENTS=3);
-[ RECORD 1 ]--+------------
Workload Management | 71
queueid | 228978
rsqname | power
rsqcountlimit | 3
rsqcountvalue | 0
rsqcostlimit | −1
rsqcostvalue | 0
rsqmemorylimit | 1.57286e+09
rsqmemoryvalue | 0
rsqwaiters | 0
rsqholders | 0
-[ RECORD 2 ]--+------------
queueid | 6055
rsqname | pg_default
rsqcountlimit | 20
rsqcountvalue | 0
rsqcostlimit | −1
rsqcostvalue | 0
rsqmemorylimit | −1
rsqmemoryvalue | 0
rsqwaiters | 0
rsqholders | 0
A fuller description of the resource queues is available in the Pivotal
documentation.
Resource queues do an important job in proactively managing
workload, but they do not respond to issues with dynamic resource
consumption. For example, after a query leaves a queue and begins
executing, the queue mechanism does nothing to further constrain
its behavior. Poorly written SQL can easily overwhelm a system.
Resource Groups
New in Greenplum Version 5
Resource groups were introduced in Greenplum 5 and
are not available in previous releases.
Workload Management | 73
gpconfig -v gp_resource_manager -c groups
Then restart the database. After the database is restarted, you can
view and modify resource groups through the command line or
through the Workload Management tab on the GPCC user interface,
as seen in Figure 7-7.
System Health
After the database is up and running, the following utilities are used
to determine the overall health of a running cluster and address any
issues that are reported:
Monitoring the state of the cluster: gpstate
Quickly provides information regarding the state of your
Greenplum cluster, including the validity of the master, standby
master, primary, and mirror segments.
Recovering from a primary segment failure: gprecoverseg
One of Greenplum’s features is high availability. This is partly
accomplished by mirroring the primary data segments in case of
hardware failures (disk, power, etc.). When a primary segment
fails-over to the mirror, the cluster continues functioning. There
comes a time when the original primary segment needs to be
reintroduced to the cluster. This utility synchronizes the pri‐
mary with the mirror based on the transactions that have occur‐
Additional Resources
The Pivotal Greenplum Command Center documentation has both
release notes as well as detailed information on installation, sup‐
ported platforms, access, and upgrades from previous versions.
DBAs should read two documents on workload management:
Additional Resources | 79
• Workload management overview
• Workload management with GPCC
dblink
New in Greenplum Version 5
dblinks are new in Greenplum 5.
81
Introduced into PostgreSQL in 8.4, dblink makes it easier to do this.
dblink is intended for database users to perform short ad hoc quer‐
ies in other databases. It is not intended for use as a replacement for
external tables or for administrative tools such as gpcopy. As with all
powerful tools, DBAs must be careful with granting this privilege to
users, lest they should gain access to information that they’re not
permitted to see.
dblink is a user-contributed extension. Before you can use it, you
must create the extension in the Greenplum instance by using gpad
min, the database superuser. Much of its functionality has been
replaced by foreign data wrappers in Greenplum 6.
For this example, suppose that there is a table called ducks in data‐
base gpuser22 in the same Greenplum instance as database gpuser.
One goal is to be able to query the ducks table while operating in
database gpuser.
First, the database superuser must create the dblink functions by
using a Greenplum-provided script:
[gpadmin@myhost ~]$ psql -d gpuserpsql -d gpuser
\i /usr/local/greenplum-db/share/postgresql/
contrib/dblink.sql
gpuser=# \i /usr/local/greenplum-db/share/postgresql/contrib/dblink.sql
SET
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
REVOKE
REVOKE
CREATE FUNCTION
...
CREATE FUNCTION
CREATE TYPE
CREATE FUNCTION
gpuser=# SELECT dblink_connect('mylocalconn',
'dbname=gpuser22 user=gpadmin');
dblink_connect
OK
(1 row)
Foreign data wrappers are much like dblink. They both allow cross-
database queries, but foreign data wrappers are more performant,
closer to standard SQL, and persistent, whereas dblinks are active
only on a session basis.
Foreign data wrappers require a bit more work than dblinks. You
need to create the following:
Functionally, PXF uses the external table syntax. For a Parquet file
created by Hive in HDFS, an external file definition might be as fol‐
lows:
CREATE EXTERNAL TABLE public.pxf_read_providers_parquet
(
provider_id text,
provider_name text,
specialty text,
address_street text,
address_city text,
address_state text,
address_zip text
)
LOCATION ('pxf://user/hive/warehouse/
providers.parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
In addition, PXF needs to know how to get to the HDFS cluster.
This information is held in the configuration files in /usr/local/
greenplum-pxf/servers/default.
To access the contents of the Parquet file, a simple select will do:
select * from public.pxf_read_providers_parquet order
by provider_id;
Had the data been stored in a plain CSV text file in HDFS, the exter‐
nal table definition might have been as shown in the preceding
example, but the profile, format, and filename would be different:
...
LOCATION ('pxf://tmp/data/providers/
providers_part1.csv?PROFILE=hdfs:text')
FORMAT 'CSV' (HEADER QUOTE '"' delimiter ',' )
The Pivotal PXF documentation describes how to write the three
functions for other data types. It’s a nontrivial exercise.
Greenplum-Kafka Integration
New in Greenplum Version 5
Greenplum-Kafka integration requires Greenplum 5.
Many of the source systems that provide data to Greenplum are able
to publish to Kafka streams, called “topics” in Kafka nomenclature.
The Greenplum-Kafka Integration module facilitates Greenplum’s
ability to act as a consumer of records in a topic.
Greenplum-Kafka Integration | 87
The gpkafka utility incorporates the gpsscli and gpss commands
of the GPSS and makes use of a YAML configuration file.
A full example is beyond the scope of this book but is well described
in the Greenplum-Kafka Integration documentation.
Greenplum-Informatica Connector
New in Greenplum Version 5
Implementing the Greenplum-Informatica Connector
requires Greenplum 5 and Informatica’s Power Center.
GemFire-Greenplum Connector | 89
Greenplum-Spark Connector
Apache Spark is an in-memory cluster computing platform. It is pri‐
marily used for its fast computation. Although originally built on
top of Hadoop MapReduce, it has enlarged the paradigm to include
other types of computations including interactive queries and
stream processing. A full description is available on the Apache
Spark website.
Spark was not designed with its own native data platform and origi‐
nally used HDFS as its data store. However, Greenplum provides a
connector that allows Spark to load a Spark DataFrame from a
Greenplum table. Users run computations in Spark and then write a
Spark DataFrame into a Greenplum table. Even though Spark does
provide a SQL interface, it’s not as powerful as that in Greenplum.
However, Spark does interation on DataFrames quite nicely, some‐
thing that is not a SQL strong point.
Figure 8-3 shows the flow of the Greenplum-Spark Connector. The
Spark driver initiates a Java Database Connectivity (JDBC) connec‐
tion to the master in a Greenplum cluster to get metadata about the
table to be loaded. Table columns are assigned to Spark partitions,
and executors on the Spark nodes speak to the Greenplum segments
to obtain data.
Amazon S3
Greenplum has the ability to use readable and writable external
tables as files in the Amazon S3 storage tier. One useful feature of
this is hybrid queries in which some of the data is natively stored in
Greenplum, with other data living in S3. For example, S3 could be
an archival location for older, less frequently used data. Should the
need arise to access the older data, you can access it transparently in
place without moving it into Greenplum. You also can use this pro‐
cess to have some table partitions archived to Amazon S3 storage.
For tables partitioned by date, this is a very useful feature.
In the following example, raw data to be loaded into Greenplum is
located into an existing Greenplum table.
The data is in a file, the top lines of which are as follows:
dog_id, dog_name, dog_dob
123,Fido,09/09/2010
456,Rover,01/21/2014
789,Bonzo,04/15/2016
The corresponding table would be as follows:
CREATE TABLE dogs
(dog_id int, dog_name text, dog_dob date) distributed randomly;
The external table definition uses the Amazon S3 protocol:
CREATE READABLE EXTERNAL TABLE dogs_ext like(dogs)
LOCATION
('s3://s3-us-west-2.amazonaws.com/s3test.foo.com/ normal/)
FORMAT 'csv' (header)
LOG ERRORS SEGMENT REJECT LIMIT 50 rows;
Assuming that the Amazon S3 URI is valid and the file is accessible,
the following statement will show three rows from the CSV file on
Amazon S3 as though it were an internal Greenplum table. Of
course, performance will not be as good as if the data were stored
internally. The gphdfs protocol allows for reading and writing data
to HDFS. For Greenplum used in conjunction with Hadoop as a
Amazon S3 | 91
landing area for data, this provides easy access to data ingested into
Hadoop and possibly cleaned there with native Hadoop tools.
Additional Resources
The gRPC website provides a wealth of documentation as well as
tutorials. That said, knowledge of gRPC is not required to run GPSS.
There is more detailed information about the GemFire-Greenplum
Connector in the Pivotal GemFire-Greenplum Connector docu‐
mentation.
GemFire is very different from Greenplum. This brief tutorial from
Pivotal is a good place to begin learning about it.
For more on GPSS, visit this Pivotal documentation page.
Greenplum PXF is described in detail on this Pivotal documentation
page.
For more details on dblinks, see the Pivotal dblink Functions page.
Additional Resources | 93
CHAPTER 9
Optimizing Query Response
1 “New Benchmark Results: Pivotal Query Optimizer Speeds Up Big Data Queries up to
1000x”.
95
tive language that is used to define, manage, and query the data
stored in relational/stream data management systems.
Declarative languages describe the desired result, not the logic
required to produce it. The responsibility for generating an optimal
execution plan lies solely with the query optimizer employed in the
database management system. To understand how query processing
works in Greenplum, there is an excellent description in the Gem‐
Fire documentation.
GPORCA is a top-down query optimizer based on the Cascades
optimization framework,2 which is not tightly coupled with the host
system. This unique feature enables GPORCA to run as a stand‐
alone service outside the database system. Therefore, GPORCA sup‐
ports products with different computing architectures (e.g., MPP
and Hadoop) using a single optimizer. It also takes advantage of the
extensive legacy of relational optimization in different query pro‐
cessing paradigms like Hadoop. For a given user query, there can be
a significantly large number of ways to produce the desired result
set, some much more efficient than others.
While searching for an optimal query execution plan, the optimizer
must examine as many plans as possible and follow heuristics and
best practices in choosing some plans and ignoring others. Statistics
capturing the characteristics of the stored data (such as skew), num‐
ber of distinct values, histograms, and percentage of null values, as
well as the cost model for the various operations are all crucial
ingredients the query optimizer relies on when navigating the space
of all viable execution plans. For GPORCA to be most effective, it is
crucial for DBAs to maintain up-to-date statistics for the queried
data.
Until recently, Greenplum used what is referred to as the legacy
query optimizer (LQO). This is a derivative of the original Post‐
greSQL planner that was adapted to the Greenplum code base ini‐
tially. The PostgreSQL planner was originally built for single-node
PostgreSQL optimized for online transaction processing (OLTP)
queries. In contrast, an MPP engine is built for long-running online
analytical processing (OLAP) queries. For this reason, the Post‐
greSQL planner was not built with an MPP database in mind.
2 G. Graefe, “The Cascades Framework for Query Optimization,” IEEE Data Eng Bull
18(3), 1995.
Previously, the LQO was set as the default, but as of Greenplum 5.0,
GPORCA is the default query optimizer (see Figure 9-1). You can
change the default at the database level or the session level by setting
the GUC parameter optimizer = on. When enabling GPORCA, we
request that users or DBAs ensure that statistics have been collected
on the root partition of a partitioned table. This is because, unlike
the legacy planner, GPORCA uses the statistics at the root partitions
rather than using the statistics of individual leaf partitions.
Figure 9-3 presents the query plan produced by GPORCA; the opti‐
mizer status shows the version of GPORCA used to generate the
plan.
Figure 9-3. GPORCA plan for a correlated subquery on the part table