GPOverview

Technical Publications
Greenplum Database 3.0

Overview Guide
Last Revised: April 27, 2007 1:56 pm

Contents - Greenplum Database 3.0 Overview Guide

Preface ............................................................................................... 1
About This Guide .............................................................................. 1
Greenplum Database Documentation ............................................... 1
Document Conventions ............................................................... 2
Contact Us ....................................................................................... 4
Chapter 1: About Greenplum Database ..................................... 5
Greenplum Database Overview ........................................................ 5
Greenplum Database and PostgreSQL......................................... 5
Greenplum Database and Bizgres ............................................... 6
Features and Benefits....................................................................... 6
Open Source ............................................................................... 7
Commodity Hardware ................................................................. 7
Shared Nothing Architecture ....................................................... 8
High Availability .......................................................................... 9
Parallel Data Loading .................................................................. 9
Workload Management ............................................................... 9
Scalability and Flexibility............................................................. 9
Performance and Price ................................................................ 9
About Greenplum ............................................................................10
The Greenplum Vision ................................................................10
Chapter 2: Greenplum Database Architecture ........................12
The Greenplum Database High-Level Architecture ...........................12
The Master ................................................................................13
The Segments ...........................................................................13
The Interconnect .......................................................................13
High Availability Architectures .........................................................13
Mirroring and Fault Tolerance ....................................................14
Fault Detection and Recovery ....................................................14
No Single Point of Failure ...........................................................15
Bizgres Loader ................................................................................16
Management and Monitoring ...........................................................16
System Management Suite ........................................................17
Greenplum Monitor ....................................................................17
Chapter 3: Greenplum Database System Processes .............19
How Greenplum Database Stores Data ............................................19
How Greenplum Database Executes Queries ...................................20
Example of Query Processing in Greenplum Database ...............20
Chapter 4: System Requirements and Supported Features 23
System Requirements .....................................................................23
System Prerequisites .................................................................23
Supported Operating System Environments...............................23
Recommended Hardware Configuration .....................................24
Supported Features .........................................................................24
SQL Support ..............................................................................24
Client API Support .....................................................................32
Supported PostgreSQL Server Applications ................................32
Contents - Greenplum Database 3.0 Overview Guide ii

Glossary ............................................................................................34
Copyright © 2007 by Greenplum, Inc. All rights reserved.

This publication pertains to Greenplum software and to any subsequent release until otherwise indicated in new
editions or technical notes.
Information in this document is subject to change without notice. The software described herein is furnished under a
license agreement, and it may be used or copied only in accordance with the terms of that agreement. Upgrades are
provided only at regularly scheduled software release dates. No part of this publication may be reproduced,
transmitted, or translated in any form or by any means, electronic, mechanical, manual, optical, or otherwise, without
the prior written permission of Greenplum, Inc. Greenplum makes no warranty of any kind with respect to the
completeness or accuracy of this manual.
Greenplum, the Greenplum logo, Bizgres, the Bizgres logo, and Greenplum Database are trademarks of Greenplum,
Inc. PostgreSQL is a trademark of Marc Fournier held in trust for The PostgreSQL Development Group. All other
company and product names used herein may be trademarks or registered trademarks of their respective companies.
Contents - Greenplum Database 3.0 Overview Guide iii

Greenplum Database Overview Guide 3.0 – Preface
Preface
This guide provides an overview of the Greenplum Database product. It explains the
features, concepts, system architecture, and system processes.
• About This Guide
• Greenplum Database Documentation
• Contact Us
About This Guide

This guide provides a general understanding of the Greenplum Database product, its
high-level architecture, and its features. This guide is intended for customers or
potential customers of Greenplum who are considering a deployment of Greenplum
Database.
This guide contains the following chapters and appendices:
• Chapter 1, “About Greenplum Database” provides a high-level overview of
Greenplum Database and how it can be used to manage large data volumes. This
chapter explains all of the features and benefits of Greenplum Database.
• Chapter 2, “Greenplum Database Architecture” explains all of the components of
Greenplum Database and how the high-level system architecture is layed out.
• Chapter 3, “Greenplum Database System Processes” provides information about
how Greenplum Database stores data and executes queries.
• Chapter 4, “System Requirements and Supported Features” explains the hardware
system requirements for Greenplum Database. It also lists the supported SQL
commands and features as compared to PostgreSQL.
• The Glossary provides definitions for Greenplum Database terminology used
throughout the Greenplum Database documentation.
Greenplum Database Documentation

The Greenplum Database documentation is provided in PDF and HTML format. The
documentation can be found in the $GPHOME/docs directory of your Greenplum
Database installation, or on the Greenplum Network.
The Greenplum Database documentation is intended as a supplement to the
PostgreSQL 8.2.4 documentation, upon which Greenplum Database is based.
The following documents are provided with this release of Greenplum Database:
• Release Notes — The release notes (README) explain the new features and any
known issues associated with this release.
• Greenplum Database Overview Guide — This guide provides a high-level
overview of Greenplum Database including its architecture, components, features,
usage, and system requirements.
About This Guide 1

• Greenplum Database Administrator Guide — This guide contains instructions

and reference information for system administrators responsible for installing and
maintaining the Greenplum Database system. Installation instructions and server
maintenance information are provided in this guide.
• Greenplum Database User Guide — This guide is intended for database
administrators and users of Greenplum Database. It contains information about
creating and managing databases, accessing databases, loading data, and issuing
queries. Reference information for supported SQL commands is also found in this
guide.
• Greenplum Database Performance Tuning Guide — This guide is intended for
system and database administrators. It contains information and advice for
achieving the maximum level of performance from your Greenplum Database
system.
Document Conventions
The following conventions are used throughout the Greenplum Database
documentation to help you identify certain types of information.
Text Conventions
Table 0.1 Text Conventions
Text Convention Usage Examples
bold Button, menu, tab, page, and field Click Cancel to exit the page without
names in GUI applications saving your changes.
italics New terms where they are defined The master instance is the postmaster
process that accepts client
Database objects, such as schema,
connections.
table, or columns names
Catalog information for Greenplum
Database resides in the pg_catalog
schema.
monospace File names and path names Edit the postgresql.conf file.
Programs and executables Use gpstart to start Greenplum
Database.
Command names and syntax
Parameter names
monospace italics Variable information within file /home/gpadmin/config_file
paths and file names COPY tablename FROM
'filename'
Variable information within
command syntax
Greenplum Database Documentation 2

Table 0.1 Text Conventions
monospace bold Used to call attention to a particular Change the host name, port, and
part of a command, parameter, or database name in the JDBC
code snippet. connection URL:
jdbc:postgresql://host:5432/m
ydb
UPPERCASE Environment variables Make sure that the Java /bin
directory is in your $PATH.
SQL commands
SELECT * FROM my_table;
Keyboard keys
Press CTRL+C to escape.
Command Syntax Conventions

Table 0.2 Command Syntax Conventions
{ } Within command syntax, curly FROM { 'filename' | STDIN }

braces group related command
options. Do not type the curly
braces.
[ ] Within command syntax, square TRUNCATE [ TABLE ] name
brackets denote optional
arguments. Do not type the
brackets.
... Within command syntax, an ellipsis DROP TABLE name [, ...]
denotes repetition of a command,
variable, or option. Do not type the
ellipsis.
| Within command syntax, the pipe VACUUM [ FULL | FREEZE ]
symbol denotes an “OR”
relationship. Do not type the pipe
symbol.
Greenplum Database Documentation 3

Naming Conventions and Acronyms

Table 0.3 Naming Conventions and Acronyms
Convention Meaning
$GPHOME The base directory where Greenplum Database is

installed, for example:
/usr/local/greenplum-db-3.0.0.0
gpadmin The default name for the Greenplum Database super
user.
KB Kilobytes ( 210 )
MB Megabytes ( 220 )
GB Gigabytes ( 230 )
TB Terabytes ( 240 )
Contact Us
To contact Greenplum customer support, call 1-866-410-6060 or send email to:
support@greenplum.com
Contact Us 4
Greenplum Database Overview Guide 3.0 – Chapter 1: About Greenplum Database
1 About Greenplum Database

Greenplum Database is the industry’s first massively parallel processing (MPP)
database server based on open-source technology. It is explicitly designed to support
business intelligence (BI) applications and large, multi-terabyte data warehouses. This
chapter provides some high-level information about Greenplum Database and it uses,
features, and benefits.
• Greenplum Database Overview
• Features and Benefits
• About Greenplum
Greenplum Database Overview

Greenplum Database is the first open source powered database server that can scale to
support multi-terabyte data warehousing demands. It is based on PostgreSQL, the
most advanced open-source database available. This section explains Greenplum
Database’s similarities and differences as compared to PostgreSQL and the Bizgres
single-server edition.
• Greenplum Database and PostgreSQL
• Greenplum Database and Bizgres
Greenplum Database and PostgreSQL

The object-relational database management system known as PostgreSQL is derived
from the POSTGRES package written at the University of California at Berkeley.
With almost three decades of development behind it, PostgreSQL is now the most
advanced open-source database available.
Greenplum Database is built upon the PostgreSQL 8.2.4 code base and has many
similarities to PostgreSQL. For example, many of the client and server applications,
configuration files, supported SQL commands, and syntax will be the same or very
similar to PostgreSQL.
Greenplum Database is essentially several PostgreSQL instances acting as one
cohesive database management system. The internals of PostgreSQL have been
modified or supplemented to support the parallel structure of Greenplum Database.
For example the system catalog has been supplemented to track all of the segment
instances that comprise a Greenplum database. The query parser, query planner, query
optimizer, and query executor processes have been modified and enhanced to be able
to execute queries in parallel across all of the segments.
Data Query and Manipulation Language (DQL/DML) is essentially supported as it is
in PostgreSQL. SELECT, INSERT, UPDATE, and DELETE are DQL/DML commands. All
other SQL commands are considered Data Definition Language (DDL) or utility
Greenplum Database Overview 5

commands. Most DDL and utility SQL statements are supported in Greenplum
Database as they are in PostgreSQL, with a few minor exceptions. See “SQL Support”
on page 24 for more information.
Greenplum Database and Bizgres

Bizgres is the first open source, production ready, database server focused exclusively
on supporting Business Intelligence applications. Bizgres is Greenplum’s single-host
product that is driven directly from the efforts of Bizgres.org, a Greenplum-sponsored
and community-supported open source project, the mission of which is to build a
comprehensive database platform for Business Intelligence on top of PostgreSQL.
As with Greenplum Database, Bizgres is based on PostgreSQL 8.2. Bizgres includes
all of the features of PostgreSQL 8.2, plus enhancements and features (such as the
Bizgres Loader and Workload Management) that optimize PostgreSQL for Business
Intelligence applications. Bizgres runs on a single host and is designed to support
smaller databases (under a terabyte), while Greenplum Database is a parallel database
system running on multiple hosts designed to support multi-terabyte databases.
Greenplum Database uses both intra-host and inter-host parallelism to increase query
performance.
The Bizgres and PostgreSQL development communities are closely connected and
each contributes to the other. For example, many features first introduced in Bizgres
were later included in PostgreSQL, such as COPY performance enhancements and table
partitioning. Features currently in Bizgres, such as bitmap index and workload
management are planned for PostgreSQL 8.3. Likewise, each PostgreSQL release is
merged back into the Bizgres code base.
Greenplum Database and Bizgres are developed in parallel, and are not derived from
the same code base. Both products do share some common characteristics and
features, such as the Bizgres Loader, and are designed to be upgrade compatible.
Users who begin with Bizgres can easily upgrade to Greenplum Database at a later
time as their data grows beyond the capacity of Bizgres.
Features and Benefits

Greenplum Database provides many advantages over traditional data warehousing
systems.
• Open Source
• Commodity Hardware
• Shared Nothing Architecture
• High Availability
• Workload Management
• Scalability and Flexibility
• Performance and Price
Features and Benefits 6

When compared to the competition, Greenplum Database is the only product that
performs exceptionally well in all of these categories.
Commodity Shared High Scalability / Performance /

Company Open Source
Hardware Nothing Availability Flexibility Price
= good = fair = poor
Figure 1.1 Greenplum Database and the Competition
Open Source
The global adoption and momentum building around Linux have demonstrated the
power and value of utilizing open source software in the enterprise. Open source
offers many of the benefits that have been missing from the traditional proprietary
commercial software industry. Open source software:
• Insulates enterprises from vendor lock-in
• Lowers the cost of ownership
• Leverages the efforts of a global developer community
In 2005, Greenplum founded Bizgres.org the first open source database project
focused on Business Intelligence (BI). The Bizgres Project aims to make PostgreSQL
the world’s most robust open source database for Business Intelligence. Greenplum
Database is an enterprise-ready commercial solution that builds upon and extends the
offerings of Bizgres.
Commodity Hardware
One of Greenplum Database’s greatest strengths is that it can run on off-the-shelf,
low-cost commodity servers. Greenplum Database was designed specifically to take
advantage of the tremendous price/performance advantages that commodity
computing delivers over traditional proprietary SMP-based systems.
Greenplum Database supports standard hardware configurations from Dell, HP, Sun,
and other hardware vendors. A typical Greenplum Database compute host has the
following hardware resources:
• 2 dual-core CPUs (typically Xeon or Opteron)

• 16 GB of RAM
• 2 Gigabit Ethernet interfaces
• 1 SATA RAID disk controller per 8 drives
• 16 SATA 400 GB hard drives
By leveraging commodity systems, Greenplum Database requires less than $25,000
(US) of hardware per terabyte of usable warehousing capacity.
Shared Nothing Architecture

Business Intelligence (BI) processing normally involves repeated scanning of the
entire contents of a deep repository of data to compute the results of complex queries.
On the other hand, most of today’s general-purpose relational database management
systems have been designed for Online Transaction Processing (OLTP) applications,
where simple queries are repeatedly processed using small amounts of data. Databases
designed to handle OLTP workloads often perform poorly when faced with BI
applications that require full-table scans, many table joins, sorting, or aggregation
against very large volumes of data.
When a query scans the entire contents of the data stored in a database, its speed will
be limited by the bandwidth of its connections to the physical storage. Greenplum
Database’s shared-nothing approach separates the physical storage into small units on
individual segment instances, each with a dedicated, independent high-speed channel
connection to local disks.
Buffers Buffers Buffers Buffers

Locks Locks Locks Locks
Control Blocks Control Blocks Control Blocks Control Blocks
Shared Everything Shared Nothing
Figure 1.2 Shared Everything versus Shared Nothing Architectures
These segment instances are connected by the Greenplum Database Interconnect and
database optimizer technology. They perform work in parallel and use all disk
connections simultaneously. As a result, the database system consists of a number of
self-contained parallel processing units and is able to scale storage capacity and
processing power together to answer complex queries on growing data repositories.
Each segment instance acts as a self-contained database processor that owns and
manages a distinct portion of the overall data.
Because shared-nothing databases automatically distribute data and make query
workloads parallel across all available hardware, they dramatically outperform
general-purpose database systems for BI workloads.

High Availability
Greenplum Database provides for redundancy of its components so that there is no
single point of failure in the Greenplum Database system. Greenplum Database
provides a high degree of system fail-over through cyclical data redundancy. Each
data segment is mirrored on an alternate host, where each segment instance manages
one segment (either a primary or a backup segment). In a typical Greenplum Database
implementation, there is generally one segment instance per CPU, several to a host.
When an active segment instance fails, Greenplum Database redirects connections to
the backup segment on an alternate host.
Parallel Data Loading

One challenge of large scale, multi-terabyte data warehouses is getting large amounts
of data loaded within a given maintenance window. Greenplum supports fast, parallel
data loading with its external tables feature. Using external tables, data can be loaded
in excess of 2 TB an hour.
External tables provide an easy way to perform basic extraction, transformation, and
loading (ETL) tasks that are common in data warehousing. External table files are
read in parallel by the Greenplum Database segment instances, so they also provide a
means for fast data loading. External tables are comprised of flat files that reside
outside of the database. Creating an external table allows you to access these flat files
as though they were a regular database table. External table data can be queried
directly (and in parallel) using regular SQL commands.
Workload Management
The purpose of Greenplum workload management is to limit the number of active
queries in the system at any given time in order to avoid exhausting system resources
such as memory, CPU, and disk I/O. This is accomplished by creating role-based
resource queues. A resource queue has attributes that limit the size and/or total
number of queries that can be executed by the users (or roles) in that queue. By
assigning all of your database roles to the appropriate resource queue, administrators
can control concurrent user queries and prevent the system from being overloaded.
Scalability and Flexibility

Greenplum Database allows for incremental, host-centric expansion. Compute,
bandwidth, or mass storage capacity issues can be easily addressed and corrected
simply by adding an extra host (or hosts) into the system. With linear scalability
inherent to the Greenplum Database architecture, you can easily model and make
provisions for how many hosts will be required to support data warehouse growth.
Performance and Price

Compared to traditional data warehousing solutions, Greenplum Database offers the
best price/performance ratio on the market. Greenplum Database has the lowest
overall total cost of ownership — 80% less than Teradata, 70% less than Oracle on
SMP, and 50% less than Netezza.

Greenplum Database achieves its tremendous performance advantages through

parallelism. SQL statements executed within Greenplum Database are broken into
smaller components, and all components are worked on at the same time by the
individual segments to deliver a single result set. All relational operations—such as
table scans, index scans, joins, aggregations, and sorts—execute in parallel across the
segments simultaneously. Each operation is performed on a segment independent of
the data associated with the other segments. This parallel execution delivers results up
to 100 times faster than traditional database management systems.
Greenplum Database’s use of commodity hardware and its flexible architecture allow
you to dial in your performance/cost ratio by choosing a deep configuration (smaller
numbers of servers with a higher density of data stored on each disk) or a fast
configuration (larger numbers of servers with a lower density of data stored on each
disk).
Fastest
Traditional DW
Cost Solutions
Dial-In Your Performance
Deepest
# of Terabytes
Figure 1.3 Price / Performance Advantage of Greenplum Database
About Greenplum
Greenplum was formed in 2003 by the merger of Metapa and Didera with the goal of
developing a low cost, high-performance, large-scale data warehousing solution built
on open source software. Greenplum is led by pioneers in open source, database
systems, data warehousing, supercomputing, and Internet performance acceleration
with technical staff from companies such as Oracle, Sybase, Informix, Teradata,
Netezza, Tandem, and Sun. Greenplum company headquarters are in San Mateo,
California.
The Greenplum Vision

By utilizing open source software and commodity, off-the-shelf hardware,
Greenplum’s vision is to make enterprise data as available and easy to use for business
users as Web data is for consumers. Businesses need to make faster, more accurate
About Greenplum 10
business decisions to gain competitive advantage. The task of managing and scaling
data for business reporting has traditionally been difficult and expensive, and in the
past 20 years the database infrastructure on which business intelligence (BI) systems
are built has not evolved significantly.
Greenplum recognizes that companies are moving to low cost computing and
replacing proprietary, Unix-based hardware and software with Intel-based hardware
running Linux. Greenplum’s offerings are specifically designed to help companies
take advantage of the price and performance returns of Linux.
Greenplum Database’s target markets include organizations that manage large and
growing data volumes. Customer warehouse sizes range from under one terabyte to
multi-terabyte systems. Primary markets include telecommunications companies,
retail firms, financial services companies, online service providers and application
service providers.
About Greenplum 11
Greenplum Database Overview Guide 3.0 – Chapter 2: Greenplum Database Architecture
2 Greenplum Database Architecture

This chapter describes the Greenplum Database architecture and components. It
contains the following topics:
• The Greenplum Database High-Level Architecture
• High Availability Architectures
• Bizgres Loader
• Management and Monitoring
The Greenplum Database High-Level Architecture

A database in Greenplum is actually an array of individual databases, usually running
on different servers or hosts, all working together to present a single database image.
The master is the entry point to the Greenplum Database system. It is the database
instance where users connect to the database and execute SQL statements. The master
coordinates the work amongst the other database instances in the system—the
segments, which is where the user data resides.
Segment Instance 1
Interconnect - Gigabit Ethernet Switch
?
Segment Instance 2
Segment Instance 3
Master Instance
Segment Instance n
Figure 2.1 High-Level Greenplum Database Architecture
This section describes the basic components of the Greenplum Database system, and
how they work together:
• The Master
• The Segments
The Greenplum Database High-Level Architecture 12

• The Interconnect
The Master
The master is the entry point to the Greenplum Database system. It is the database
process that accepts client connections and processes the SQL commands issued by
the users of the system.
Since Greenplum Database is based on PostgreSQL, end-users interact with
Greenplum Database (through the master) as they would a typical PostgreSQL
database. They can connect to the database using client programs such as psql or
application programming interfaces (APIs) such as JDBC or ODBC.
The master is where the global system catalog resides (the set of system tables that
contain metadata about the Greenplum Database system itself), however the master
does not contain any user data. User data resides only on the segments. The master
does the work of processing the incoming SQL commands, distributing the work load
between the segments, coordinating the results returned by each of the segments, and
presenting the final results to the user.
The Segments
In Greenplum Database, the segments are where the user data resides. User-defined
tables and their indexes are distributed across the available number of segments in the
Greenplum Database system, each segment containing a distinct portion of the data.
Segment instances are the database server processes that serve segments. Users do not
interact directly with the segment instances in a Greenplum Database system, but do
so through the master.
In the recommended Greenplum Database hardware configuration, there is one
primary segment instance per effective CPU or CPU core.
The Interconnect
The Interconnect component in Greenplum Database is responsible for moving data
between the segments during query execution. The Interconnect delivers messages,
moves data, collects results, and coordinates work among the segments in the system.
The Interconnect rides on top of a standard Gigabit Ethernet switching fabric.
High Availability Architectures

This section explains how to deploy Greenplum Database in environments that
demand high availability and data redundancy. It covers the following topics:
• Mirroring and Fault Tolerance
• No Single Point of Failure
High Availability Architectures 13

Mirroring and Fault Tolerance

When you deploy your Greenplum Database system, you have the option to configure
mirror segments.
Mirror segments allow database queries to fail over to a backup segment if the primary
segment is unavailable. To configure mirroring, you must have enough hosts in your
Greenplum Database system so that the secondary segment always resides on a
different host than its primary. Figure 2.2 shows how table data is distributed across
the segments when mirroring is configured. The mirror segment for a distributed table
resides on a different host than its primary segment. Primary segments and mirror
segments are served by different segment instances.
master host host 1 host 2 host n
global Segment 1 Segment 2 Segment n

catalog (primary) (primary) (primary)
Segment Instance Segment Instance Segment Instance

MPP Master
Segment 2 Segment n Segment 1

(mirror) (mirror) (mirror)
Segment Instance Segment Instance Segment Instance
Figure 2.2 Data Mirroring in Greenplum Database
Fault Detection and Recovery

Greenplum Database is able to detect when a host is unavailable or when a segment
database server process is down. When this occurs, the master will mark the primary
segments on that host as out-of-service and immediately switch over to the mirror
segments so that operations can continue. If there is no primary or mirror available for
a particular segment (for example, if paired mirror hosts are down) then the entire
Greenplum Database system will be unavailable until they are brought back online.
Administrators can choose whether to operate Greenplum Database in read-only or
read-write mode. If the system is in read-only mode, then failed segments can be
automatically re-introduced into the system once they are back online. In read-write
mode, the administrator must bring the failed segment back online after synchronizing
the data with the current active segment. Greenplum Database provides utilities for
managing these tasks.

No Single Point of Failure

In addition to the data redundancy provided by mirroring, you can deploy Greenplum
Database so that there is no single point of failure in the system. Failover and recovery
for the segment instances is handled by mirroring. Redundancy for the other major
Greenplum Database components, the master and the Interconnect, are covered in this
section.
Gigabit Ethernet Switch 1

Segment Instance 1
Segment Instance 2
Master
Instance
(primary)
Segment Instance 3
automatic
synchronization
Segment Instance 4
Gigabit Ethernet Switch 2
Segment Instance 5
Master
Instance
(backup)
Segment Instance n
Figure 2.3 High Availability Greenplum Database Architecture
Redundant Greenplum Database Masters

The Greenplum Database master instance is the entry point to the Greenplum database
system. If the master host or database server process becomes unavailable,
administrators can bring the backup master online in its place. Since the master does
not contain any user data, only the system catalog tables need to be synchronized
between the primary and backup copies. These tables are not updated frequently, but
when they are, changes are automatically copied over to the backup master so that it is
always kept current with the primary.
Redundant Interconnect
The Greenplum Database Interconnect rides on top of a standard Gigabit Ethernet
switching fabric. A highly available Interconnect can be achieved by deploying dual
Gigabit Ethernet switches on your network, and redundant Gigabit connections to the
Greenplum Database host servers.

Bizgres Loader
Bizgres Loader is a Java command-line program that can be used to load large
quantities of data into a Greenplum Database database. Its functionality and data
formatting options are similar to the PostgreSQL COPY command, which is also
supported in Greenplum Database. Bizgres Loader provides several additional
features such as error logging, concurrent load execution, optimized performance,
data batching, and enhanced configuration options.
Bizgres Loader can be run from the master or any host on the network connected to
the master. It takes as input a control file. The control file contains one or more LOAD
commands, which are the specifications for loading the data into a table. Data can be
loaded from either a data file or standard input.
Bizgres Loader can be run in batch mode, which means that large quantities of data
are loaded into the database in batches. You can specify the number of rows and a size
limit that define a batch. If a batch contains data errors, the bad batches are written to
a log file so you can go back and troubleshoot the rows that did not get loaded.
Control File Segment Instance 1
Source Data
Segment Instance 2
Bizgres
Loader Segment Instance 3
Load Log Files
Segment Instance n
Figure 2.4 The Bizgres Loader High-Level Architecture
Management and Monitoring

Administration of the Greenplum Database system is handled through two easy-to-use
interfaces.
• System Management Suite
• Greenplum Monitor
Bizgres Loader 16
System Management Suite

The system management suite is a command-line interface for performing the
common administration tasks of Greenplum Database. Most of the functionality
provided by the PostgreSQL client and server applications is also provided in
Greenplum Database, however some of these applications have been modified to
handle the distributed nature of a Greenplum database. For more information on the
supported PostgreSQL management features, see “Supported PostgreSQL Client
Applications” on page 30 and “Supported PostgreSQL Server Applications” on page
32.
The system management suite handles the following Greenplum Database
administration tasks:
• Installing Greenplum Database on an Array
• Initializing a Greenplum Database System
• Starting and Stopping Greenplum Database
• Adding or Removing a Host
• Managing Recovery for Failed Segment Instances
• Managing Failover and Recovery for Failed Master Instances
• Backing Up and Restoring a Greenplum Database Database (in Parallel)
• Loading Data in Parallel
• System State Reporting
Greenplum Monitor
The Greenplum Monitor server sits outside of your Greenplum Database core
installation and monitors the system state, network activity, processing load, database
capacity and utilization, user connections, and system alerts. Each master instance and
segment instance has a small monitoring agent running on it that reports status
information back to the Greenplum Monitor server. The status information is stored in
a standard Bizgres or PostgreSQL database running on the Greenplum Monitor server.
Management and Monitoring 17

Greenplum Monitor UI
agent
Segment Host 1
agent
Segment Host 2
Gigabit Ethernet Switch

agent
Monitoring Server Segment Host 3
agent
agent
Segment Host 4
Master Instance agent

Segment Host 5
agent
Segment Host n
Figure 2.5 Greenplum Monitor High-Level Architecture
Administrators access the Greenplum Monitor data through a web application that is
also running on the Greenplum Monitor server host.
Management and Monitoring 18

Greenplum Database Overview Guide 3.0 – Chapter 3: Greenplum Database System Processes
3 Greenplum Database System Processes

This chapter describes how some basic database processes work in Greenplum
Database. It contains the following topics:
• How Greenplum Database Stores Data
• How Greenplum Database Executes Queries
How Greenplum Database Stores Data

To understand how Greenplum Database stores data across the various hosts and
segment instances, consider the following simple logical database. In Figure 3.1,
primary keys are shown in bold font and foreign key relationships are indicated by a
line from the foreign key in the referring relation to the primary key of the referenced
relation. In data warehouse terminology, this is referred to as a star schema. In this
type of database schema, the sale table is usually called a fact table and the other
tables (customer, vendor, product) are usually called the dimension tables.
customer
cn integer
cname text
sale
vendor
cn integer
vn integer vn integer
pn integer vname text
dt date loc text
qty integer
prc float
product
pn integer
pname text
Figure 3.1 Sample Database Star Schema
In Greenplum Database all tables are distributed, which means a table is divided into
non-overlapping sets of rows or parts. Each part resides on a single database known as
a segment within the Greenplum Database system. The parts are distributed evenly
across all of the available segments using a sophisticated hashing algorithm.
How Greenplum Database Stores Data 19

The Greenplum Database physical database implements the logical database on an

array of individual database instances — a master instance and two or more segment
instances. The master instance does not contain any user data, only the global catalog
tables. The segment instances contain disjoint parts (collections of rows) for each
distributed table.
master segment 1 segment 2 segment 3
sale sale sale

part 1 part 2 part 3
customer customer customer

product product product

global
catalog vendor vendor vendor
Figure 3.2 Table Distribution in a Greenplum Database Physical Database
How Greenplum Database Executes Queries

For most purposes, you can query a Greenplum Database database in the same way
you would any RDBMS. For example, the following query will produce the expected
result as you would see in any RDBMS:
SELECT c.cname FROM customer c, sale s WHERE c.cn = s.cn;
The difference with Greenplum Database is that the tables customer and sale are
actually distributed across an array of machines each of which is running one or more
independent database servers (called segment instances in Greenplum Database).
What actually happens in the example query above is that Greenplum recognizes the
fact that tables customer and sale are distributed. Greenplum Database transforms
the query into slices that can be executed against each of the databases in the system,
and then collects the individual results into a single result to present to the user.
Greenplum Database performance advantages are achieved through query parallelism.
That is, single SQL statements executed within Greenplum Database are broken into
smaller components, and all components are worked on at the same time by the
individual databases to deliver a single result set. All relational operations—such as
table scans, index scans, joins, aggregations, and sorts—execute in parallel across the
segments simultaneously. Each operation is performed on a segment database
independent of the data associated with the other segment databases.
Example of Query Processing in Greenplum Database

To illustrate how query processing works in Greenplum Database, consider the
following query:
SELECT c.cname FROM customer c, sale s WHERE c.cn = s.cn;
How Greenplum Database Executes Queries 20

The query is received through the master, which parses the query, optimizes the query,
and creates a query execution plan. The query execution plan is divided and sent to the
individual segment instances for execution, where the segment instances execute their
slice of the plan in parallel.
?
Segment Instance 1
?
Segment Instance 2
?
?
Segment Instance 3
Master Instance
?
Segment Instance n
Figure 3.3 Query Parallelism in Greenplum Database
Query parallelism is enabled in Greenplum Database by using a sophisticated hashing

algorithm to distribute the data across all segments defined in the system. Data rows
are assigned to a specific segment, and each segment is then responsible for executing
local database operations for its particular set of data. Greenplum’s hash distribution
scheme ensures balanced processing and data placement across the entire system.
Greenplum executes database queries by creating a query execution plan, which is a
tree plan of nodes. The bottom nodes are generally scan nodes (either table scans or
index scans), and the higher-level nodes might be sort, join, or aggregation nodes.
Greenplum Database supports an additional executor node type called a motion node.
The purpose of the motion node is to move tuples among the segments.
The following diagram shows how a query plan is executed on an individual segment
instance:

?
GP Master
Instance
Hash
Join
4
Hash Hash
Node Node
3 3
receive receive
Other Other
Motion Motion
Segment Segment
Node Node
Instances send send Instances
2 2
Sale Customer
Table 1 Table 1
Figure 3.4 Example Query Plan Processing on a Segment Instance
1. The customer and sales table segments are scanned on a segment instance.
2. Tuples are dynamically rehashed on the join column and sent to the correct
segment instance.
3. The hash node receives its slice of data from all other segments and starts the join
operation.
4. The local segment result set is sent to the master, which materializes the results
and presents them to the user.

Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
4 System Requirements and Supported

Features
This section describes the hardware system requirements for a host that is running
Greenplum Database. It also describes the PostgreSQL features that are supported in
Greenplum Database.
• System Requirements
• Supported Features
System Requirements
This section outlines the minimum system requirements for a machine that will be
running Greenplum Database.
System Prerequisites
The following table lists minimum recommended specifications for servers intended
to support Greenplum Database. It is recommended that you work with your
Greenplum Systems Engineer to review your anticipated environment to ensure an
appropriate configuration for Greenplum Database.
Table 4.1 System Prerequisites for Greenplum Database 3.0
Minimum CPU Pentium Pro compatible (P3/Athlon and above)
Minimum Memory 1 GB RAM per server
Disk Requirements • 32MB required for installation

• Appropriate free space for data
• 700MB free space for Installation Verification Test
• High-speed, local storage for Greenplum Database databases
Network Requirements Gigabit Ethernet within the array
Software and Utilities bash shell

GNU tar
GNU zip
GNU readline
GCC runtime libraries (glibc, etc.)
Supported Operating System Environments

The following table indicates the currently supported production operating systems
and versions/patch levels.
Table 4.2 Supported OS Environments for Greenplum Database 3.0
RedHat Enterprise Linux 3.0 or higher
Solaris x86 v10
System Requirements 23
Recommended Hardware Configuration

Greenplum Database supports standard hardware configurations from Dell, HP, Sun,
and other hardware vendors. The recommended Greenplum Database compute host
has the following hardware resources:
• 2 dual-core CPUs (typically Xeon or Opteron)
• 16 GB of RAM
• 2 Gigabit Ethernet interfaces
• 1 SATA RAID disk controller per 8 drives
• 16 SATA 400 GB hard drives
Supported Features
Greenplum Database is based on PostgreSQL 8.2.4 and supports many of the database
features, SQL commands, client applications, and server applications provided by
PostgreSQL. Some features of PostgreSQL have been modified or are unsupported in
Greenplum Database due to the distributed nature of the Greenplum database and its
parallel architecture.
This section provides a reference of the PostgreSQL features and how those features
are supported in Greenplum Database. For more information on PostgreSQL, refer to
the PostgreSQL Documentation.
SQL Support
The following table lists all of the SQL commands supported in PostgreSQL, and
whether or not they are supported in Greenplum Database. For full SQL syntax and
references, see the Greenplum Database User Guide.
Data Query and Manipulation Language (DQL/DML) is essentially supported as it is
in PostgreSQL. SELECT, INSERT, UPDATE, and DELETE are DQL/DML commands. All
other SQL commands are considered Data Definition Language (DDL) or utility
commands. The majority of DDL and utility statements are supported with a few
minor exceptions.
Table 4.3 SQL Support in Greenplum Database
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
ALTER AGGREGATE YES
ALTER CONVERSION YES
ALTER DATABASE YES
ALTER DOMAIN YES
ALTER FUNCTION YES
ALTER GROUP YES Deprecated in PostgreSQL 8.1 - see ALTER ROLE
Supported Features 24
Supported in
Greenplum
ALTER INDEX YES
ALTER LANGUAGE YES
ALTER OPERATOR YES
ALTER OPERATOR CLASS NO
ALTER RESOURCE QUEUE YES Greenplum Database workload management feature -

not in PostgreSQL 8.2.4.
ALTER ROLE YES Greenplum Database Clauses:

RESOURCE QUEUE queue_name | none
ALTER SCHEMA YES
ALTER SEQUENCE YES
ALTER TABLE YES Unsupported Clauses / Options:

CLUSTER ON
ENABLE/DISABLE TRIGGER
ALTER TABLESPACE YES
ALTER TRIGGER NO
ALTER TYPE YES
ALTER USER YES Deprecated in PostgreSQL 8.1 - see ALTER ROLE
ANALYZE YES
BEGIN YES
CHECKPOINT YES
CLOSE YES
CLUSTER YES
COMMENT YES
COMMIT YES
COMMIT PREPARED NO
COPY YES Modified Clauses:

ESCAPE [ AS ] 'escape' | 'OFF'
CREATE AGGREGATE YES Unsupported Clauses / Options:

[ , SORTOP = sort_operator ]
Greenplum Database Clauses:

[ , PREFUNC = prefunc ]
Limitations:
The functions used to implement the aggregate must
be IMMUTABLE functions.
CREATE CAST YES
Supported in
Greenplum
CREATE CONSTRAINT TRIGGER NO
CREATE CONVERSION YES
CREATE DATABASE YES
CREATE DOMAIN YES
CREATE EXTERNAL TABLE YES Greenplum Database parallel ETL feature - not in
PostgreSQL 8.2.4.
CREATE FUNCTION YES Limitations:

Functions defined as STABLE or VOLITILE can be
executed in Greenplum Database provided that are
executed on the master only. STABLE and VOLITILE
functions cannot be used in statements that execute at
the segment level. See the Greenplum Database User
Guide for more information.
CREATE GROUP YES Deprecated in PostgreSQL 8.1 - see CREATE ROLE
CREATE INDEX YES Limitations:

UNIQUE indexes are allowed only if the initial columns
of the index key are identical to the Greenplum
distribution key.
CREATE LANGUAGE YES
CREATE OPERATOR YES Limitations:

The function used to implement the operator must be
an IMMUTABLE function.
CREATE OPERATOR CLASS NO
CREATE RESOURCE QUEUE YES Greenplum Database workload management feature -

CREATE ROLE YES Greenplum Database Clauses:

RESOURCE QUEUE queue_name | none
CREATE RULE YES
CREATE SCHEMA YES
CREATE SEQUENCE YES Limitations:

• The lastval and currval functions are not
supported.
• The setval function only allowed in queries that do
not operate on distributed data.
• The nextval function not allowed in UPDATE or
DELETE queries if mirrors are enabled.
Supported in
Greenplum
CREATE TABLE YES Unsupported Clauses / Options:

[GLOBAL | LOCAL]
REFERENCES
FOREIGN KEY
[DEFERRABLE | NOT DEFERRABLE]
Limited Clauses:
• Only one UNIQUE constraint allowed per table
• UNIQUE not allowed if the table has a PRIMARY KEY
Greenplum Database Clauses:

DISTRIBUTED BY (column, [ ... ] ) |
DISTRIBUTED RANDOMLY
CREATE TABLE AS YES
CREATE TABLESPACE YES Greenplum Database Clauses:

LOCATION 'segdir', 'segdir', ...
MIRROR LOCATION 'segdir', 'segdir', ...
CREATE TRIGGER NO
CREATE TYPE YES Limitations:

The functions used to implement a new base type
must be IMMUTABLE functions.
CREATE USER YES Deprecated in PostgreSQL 8.1 - see CREATE ROLE
CREATE VIEW YES
DEALLOCATE YES
DECLARE YES Unsupported Clauses / Options:

SCROLL
FOR UPDATE [ OF column [, ...] ] } ]
Limitations:
Cursors are non-scrollable and non-updateable
DELETE YES Unsupported Clauses / Options:

RETURNING
Limitations:
• Joins must be on a common Greenplum distribution
key (equijoins)
• Cannot use STABLE or VOLITILE functions in a
DELETE statement if mirrors are enabled
DROP AGGREGATE YES
DROP CAST YES
DROP CONVERSION YES
DROP DATABASE YES
Supported in
Greenplum
DROP DOMAIN YES
DROP EXTERNAL TABLE YES Greenplum Database parallel ETL feature - not in
PostgreSQL 8.2.4.
DROP FUNCTION YES
DROP GROUP YES Deprecated in PostgreSQL 8.1 - see DROP ROLE
DROP INDEX YES
DROP LANGUAGE YES
DROP OPERATOR YES
DROP OPERATOR CLASS NO
DROP OWNED YES
DROP RESOURCE QUEUE YES Greenplum Database workload management feature -

DROP ROLE YES
DROP RULE YES
DROP SCHEMA YES
DROP SEQUENCE YES
DROP TABLE YES
DROP TABLESPACE YES
DROP TRIGGER NO
DROP TYPE YES
DROP USER YES Deprecated in PostgreSQL 8.1 - see DROP ROLE
DROP VIEW YES
END YES
EXECUTE YES
EXPLAIN YES
FETCH YES Unsupported Clauses / Options:

LAST
PRIOR
BACKWARD
BACKWARD ALL
Limitations:
Cannot fetch rows in a nonsequential fashion (no
scrolling)
GRANT YES
Supported in
Greenplum
INSERT YES Unsupported Clauses / Options:

RETURNING
LISTEN NO
LOAD YES
LOCK YES
MOVE YES See FETCH
NOTIFY NO
PREPARE YES
PREPARE TRANSACTION NO
REASSIGN OWNED YES
REINDEX YES
RELEASE SAVEPOINT YES
RESET YES
REVOKE YES
ROLLBACK YES
ROLLBACK PREPARED NO
ROLLBACK TO SAVEPOINT YES
SAVEPOINT YES
SELECT YES
SELECT INTO YES Limitations:

• Limited use of VOLATILE and STABLE functions in
FROM or WHERE clauses
• Limited use of correlated subquery expressions (See
the Greenplum Database User Guide)
SET YES
SET CONSTRAINTS NO In PostgreSQL, this only applies to foreign key

constraints, which are currently not supported in
Greenplum Database.
SET ROLE YES
SET SESSION AUTHORIZATION YES Deprecated in PostgreSQL 8.1 - see SET ROLE
SET TRANSACTION YES
SHOW YES
START TRANSACTION YES
TRUNCATE YES
UNLISTEN NO
Supported in
Greenplum
UPDATE YES Unsupported Clauses:

RETURNING
Limitations:
• SET not allowed for Greenplum distribution key
columns.
• Joins must be on a common Greenplum distribution
key (equijoins).
• Cannot use STABLE or VOLITILE functions in an
UPDATE statement if mirrors are enabled.
VACUUM YES Limitations:

VACUUM FULL is not recommended in Greenplum
Database.
VALUES YES
Supported PostgreSQL Client Applications

This section contains support information and links for PostgreSQL client applications
and utilities. In Greenplum Database, client applications must be run on the master
and most require super user privileges.
Table 4.4 PostgreSQL Client Application Support in Greenplum Database
Client Supported in
Description Notes
Application Greenplum
clusterdb Cluster a database. YES
createdb Create a new database. YES
createlang Define a new procedural YES

language.
createuser Define a new Greenplum YES

Database role (user or group).
dropdb Remove a database. YES
droplang Remove a procedural language. YES
dropuser Remove a Greenplum Database YES

role (user or group).
ecpg Embedded SQL C preprocessor. YES
gp_dump Dumps a Greenplum Database YES See also, pg_dump.

database in parallel by dumping
each segment instance
individually.
gp_dump_agent Called by gp_dump to create a

dump file for an individual
segment instance.
Table 4.4 PostgreSQL Client Application Support in Greenplum Database
Client Supported in
Description Notes
gp_restore Restores a Greenplum database YES See also, pg_restore.

in parallel from archive files
created by gp_dump.
gp_restore_agent Called by gp_restore to restore YES

an individual segment instance
from its associated dump file.
gpsyncmaster Process that synchronizes a YES

standby master host with the
primary master host.
pg_config Retrieve information about the YES Prints out information for the
installed version of Greenplum Greenplum Master Instance
Database. only.
pg_dump Extract a Greenplum Database YES Only use pg_dump when the
database into a single script file or system is quiet or no DML is
other archive file. being executed. The dump
operation is not executed in a
serializable transaction.
Use use the --gp-syntax
command-line option to include
the DISTRIBUTED BY clause
in CREATE TABLE statements.
See also, gp_dump.
pg_dumpall Extract a database cluster into a YES Only use pg_dumpall when the
script file (all databases, roles, system is quiet or no DML is
and system catalog info). being executed. The dump
operation is not executed in a
serializable transaction.
Use use the --gp-syntax
command-line option to include
the DISTRIBUTED BY clause
in CREATE TABLE statements.
See also, gp_dump.
pg_restore Restore a Greenplum Database YES See also, gp_restore.

database from an archive file
created by pg_dump.
psql PostgreSQL / Greenplum YES

Database interactive terminal.
reindexdb Reindex a database. YES
vacuumdb Garbage-collect and analyze a YES

database.
Client API Support

Greenplum Database also supports the following APIs (as well as client software that
utilizes these APIs) for accessing a database. Client connections to the database are
made through the master using the standard PostgreSQL database drivers.
• ODBC (Open Database Connectivity)
• JDBC (Java Database Connectivity)
• Perl DBI (Perl Database Interface)
Supported PostgreSQL Server Applications

This section contains support information and links for PostgreSQL server
applications and utilities. In Greenplum Database, server applications must be run on
the master.
Table 4.5 PostgreSQL Server Application Support in Greenplum Database
Server Supported in
Description Notes
initdb Initialize a Greenplum Database REPLACED Called by gpcreatecluster.sh,

instance. which initializes the database
catalog and data storage areas
on the master and all segment
instances.
ipcclean Remove shared memory and NO

semaphores from a failed
Greenplum Database server.
pg_controldata Display control information for a REPLACED Replaced by gpstate.sh, which

Greenplum Database database. reports information for a
Greenplum Database
database, including the Master
and all Segment Instances.
pg_ctl Start, stop, or restart Greenplum REPLACED Replaced by gpstart.sh and

Database. gpstop.sh, which stops and
starts the entire Greenplum
Database system.
pg_resetxlog Reset the write-ahead log and NO

other control information of a
Greenplum Database database
Table 4.5 PostgreSQL Server Application Support in Greenplum Database
Server Supported in
Description Notes
postgres The postgres executable is the REPLACED In Greenplum Database, you

actual PostgreSQL server process use gpstart.sh and gpstop.sh
that processes queries. It is not to start all postmasters in the
called directly; instead a system at once. The
postmaster process is started. postmaster process creates
postgres processes as needed
to handle client connections.
postmaster postmaster is the PostgreSQL REPLACED In Greenplum Database, you

database server listener process use gpstart.sh and gpstop.sh
that accepts client connections. In to start all postmasters in the
Greenplum Database, a system at once. Starting a
postmaster process runs on the postmaster process directly on
Greenplum Master Instance and a single host in an Greenplum
on each Segment Instance. array is not recommended.
Greenplum Database Overview Guide 3.0 – Glossary
Glossary
A
array
The set of physical devices (hosts, servers, switches, etc.) used to house a Greenplum Database
system.
bandwidth
Bandwidth is the maximum amount of information that can be transmitted along a channel, such as
a network or I/O channel. This data transfer rate is usually measured in megabytes per second (MB/s).
BIOS
This is the Basic Input/Output System and is installed on the computer’s motherboard. It controls the
most basic operations and is responsible for starting up the computer and initializing hardware such
as disk drives, I/O devices, and so on.
C
catalog
See system catalog.
cluster
In PostgreSQL, a cluster refers to a collection of databases that is managed by a single instance of a
running database server. In file system terms, a database cluster is a single directory under which all
data is stored, referred to as the data directory or data storage area. In Greenplum Database, a cluster
refers to a single global catalog (on the master) and the database objects owned and contained in that
catalog.
correlated subquery
A correlated subquery is a nested SELECT statement that refers to a column from an outer SELECT
statement. For example:
SELECT * FROM product WHERE exists (SELECT * FROM sale WHERE qty>0 AND pn
= product.pn);
CPU
CPU stands for Central Processing Unit and is often, simply called, the processor. The part of a
computer (a microprocessor chip) that does most of the data processing. Sometimes the term CPU is
used to describe the whole box that contains the chip (along with the motherboard, expansion cards,
disk drives, power supply, and so on).
See also dual-core CPU.
array 34
D
data directory
The data directory is the location on disk where database data is stored. The master data directory
contains the global system catalog only — no user data is stored on the master. The data directory on
the segment instances have user data for their segment plus a local system catalog. The data directory
contains several subdirectories, control files, and configuration files.
distributed
Certain database objects in Greenplum Database, such as tables and indexes, are distributed. They
are divided into equal parts and spread out amoungst the segment instances based on a hashing
algorithm. To the end-user and client software, however, a distributed object appears as a
conventional database object.
distribution key
In a Greenplum table, one or more columns are used as the distribution key, meaning those columns
are used to divide the data amongst all of the segments. The distribution key should be the primary
key of the table or a unique column. If that is not possible, then choose the column with the lowest
selectivity to ensure the most even data distribution.
DDL
Data Definition Language. A subset of SQL commands used for defining and examining the structure
of a database.
DML
Database Manipulation Language. SQL commands that store, manipulate, and retrieve data from
tables. INSERT, UPDATE, DELETE, and SELECT are DML commands.
dual-core CPU
A dual-core CPU is basically two separate processors on a single microprocessor chip. Those two
processors can outperform single-core processors on most multithreaded applications while running
at lower clock speeds and consuming less power. An application with multiple software threads, such
as Greenplum Database, will run faster on a dual-core processor because the operating system can
assign an individual thread to its own processor core. Multithreaded applications running on a
single-core processor must wait for one thread to finish before another thread can be processed.
See also CPU.
Greenplum Database
Greenplum Database is the industry’s first massively parallel processing (MPP) database server
based on open-source technology. It is explicitly designed to support business intelligence (BI)
applications and large, multi-terabyte data warehouses. Greenplum Database is based on
PostgreSQL.
data directory 35
Greenplum Database system

One or more Greenplum instances running on an array, which can be composed of one or more hosts.
Greenplum instance
The process that serves a database. An instance of Greenplum Database is comprised of a master
instance and one or more segment instances.
host
A host represents a physical machine or compute node in a Greenplum Database system. In
Greenplum Database, one host is designated as the master. The other hosts in the system have one or
more segment instances running on them.
hyper-threaded CPU
A hyper-threaded CPU is a single processor that presents itself to the operating system as two virtual
processors. The processor can work on two sets of tasks simultaneously, use resources that otherwise
would sit idle, and get more work done in the same amount of time. Hyper-threading was pioneered
by Intel on the Xeon processor family for servers. See also, dual-core CPU.
I
Interconnect
The Interconnect component in Greenplum is responsible for moving data between the segments
during query execution. The Interconnect delivers messages, moves data, collects results, and
coordinates work among the segments in the system. The Interconnect rides on top of a standard
Gigabit Ethernet switching fabric over a private local area network (LAN).
I/O
Input/Output (I/O) refers to the transfer of data to and from a system or device using a
communucation channel.
JDBC
Java Database Connectivity is an application program interface (API) specification for connecting
programs written in Java to data in a database management system (DBMS). The application
program interface lets you encode access request statements in SQL that are then passed to the
program that manages the database.
Greenplum Database system 36

M
master
The master (also referred to as the Greenplum master or master instance) is the entry point to the
Greenplum Database system. It is the database listener process (postmaster process) that accepts
client connections and processes the SQL commands issued by the users of the system.
The master is where the global system catalog resides, however the master does not contain any user
data. User data resides only on the segment instances. The master does the work of processing the
incoming SQL commands, distributing the work load between the segment instances, coordinating
the results returned by each of the segments, and presenting the final results to the user.
master instance
The database process that serves the Greenplum master. See master.
mirror
A mirror is a backup copy of a segment (or master) that is stored on a different host than the primary
copy. Mirrors are useful for maintaining operations if a host in your Greenplum Database system
fails. Mirroring is an optional feature of Greenplum Database. Mirror segments are evenly distributed
amongst other hosts in the array. If a host that holds a primary segment fails, Greenplum Database
will switch to the mirror or secondary host.
motion node
A motion node is a portion of a query execution plan responsible for moving tuples amoungst the
Greenplum Database segment instances.
motherboard
The main circuit board of a computer, which houses all the vital components usually including the
microprocessor, internal memory, and device controllers such as for the disk drives.
MPP
Massive Parallel Processing.
ODBC
Open Database Connectivity, a standard database access method that makes it possible to access any
data from any client application, regardless of which database management system (DBMS) is
handling the data. ODBC manages this by inserting a middle layer, called a database driver, between
a client application and the DBMS. The purpose of this layer is to translate the application’s data
queries into commands that the DBMS understands.
master 37
P
partitioned
Partitioning is a way to logically divide the data in a table for better performance and easier
maintenance. In Greenplum Database, partitioning is a procedure that creates multiple sub-tables (or
child tables) from a single large table (or parent table). The primary purpose is to improve
performance by scanning only the relevant data needed to satisfy a query. Note that partitioned tables
are also distributed.
Perl DBI
Perl Database Interface (DBI) is an API for connecting programs written in Perl to database
management systems (DBMS). Perl DBI (DataBase Interface) is the most common database
interface for the Perl programming language.
PostgreSQL
PostgreSQL is a SQL compliant, open source relational database management system (RDBMS).
Greenplum Database uses a modified version of PostgreSQL as its underlying database server. For
more information on PostgreSQL go to http://www.postgresql.org.
postgres process
The postgres process is the server process that processes queries. It is not called directly, but is a
subprocess of a postmaster process.
postmaster process
The postmaster is the PostgreSQL database server listener process. In order for a client application
to access a database it connects (over a network or locally) to a running postmaster process. The
postmaster also manages the communication among server subprocesses (see postgres process).
In Greenplum Database, there is a postmaster process running on the master and on each segment
instance. Users and client applications always connect to Greenplum Database using the postmaster
process on the master.
psql
This is the interactive terminal to PostgreSQL. You can use psql to access a database and issue SQL
commands. For more information on psql, see the psql reference in the PostgreSQL documentation.
QD
See query dispatcher.
QE
See query executor.
partitioned 38
query dispatcher
The query dispatcher (QD) is a process that is initiated when users connect to the master and issue
commands. This process represents a user session in the context of a Greenplum Database single
database image as a whole. The query dispatcher process spawns one or more query executor
processes to assist in the execution of SQL commands.
query executor
A query executor process (QE) is associated with a query dispatcher (QD) process and operates on
its behalf. Query executor processes run on the segment instances and execute their portion of a query
plan for a segment.
query execution plan

Greenplum Database executes database queries by creating a query execution plan, which is a tree
plan of nodes for executing the query in the most efficient way. The bottom nodes are usually scan
nodes (table or index scans), and the higher-level nodes might be sort, join, or aggregation nodes.
Greenplum Database supports an additional executor node type called a motion node.
R
rack
A type of shelving to which computer components can be attached vertically, one on top of the other.
Components are normally screwed into front-mounted, tapped metal strips with holes which are
spaced so as to accommodate the height of devices of various U-sizes. Racks usually have their
height denominated in U-units.
RAID
Redundant Array of Independent (or Inexpensive) Disks. RAID is a system of using multiple hard
drives for sharing or replicating data among the drives. The benefit of RAID is increased data
integrity, fault-tolerance and/or performance, over using drives singularly. Multiple hard drives are
grouped and seen by the OS as one logical hard drive.
RAID-10 (1+0)
The most popular of the multiple RAID levels, RAID 10 combines the best features of striping and
mirroring to yield large arrays with high performance and superior fault tolerance. RAID 10 is a
stripe across a number of mirrored sets. Half of the disks are reserved for parity.
RAID-5
One of the most popular RAID levels, RAID 5 stripes both data and parity information across three
or more hard drives. It uses a distributed parity algorithm, writing data and parity blocks across all
the drives in the array. One disk’s worth of capacity is reserved for parity.
query dispatcher 39
RAID-Z
The ZFS file system in Solaris provides a RAID-Z configuration with single parity fault tolerance,
which is similar to RAID-5. In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are
full-stripe writes. This design is only possible because ZFS integrates file system and device
management in such a way that the file system’s metadata has enough information about the
underlying data replication model to handle variable-width RAID stripes. If you have three disks in
a single-parity RAID-Z configuration, parity data occupies space equal to one of the three disks. No
special hardware is required to create a RAID-Z configuration.
RAM
Random Access Memory. The main memory of a computer system used for storing programs and
data. RAM provides temporary read/write storage while hard disks offer semi-permanent storage.
SATA
Serial Advanced Technology Attachment is a computer bus primarily designed for transfer of data to
and from a hard disk. Unlike IDE which uses parallel signaling, SATA uses serial signaling
technology. Because of this the SATA cables are thinner than the ribbon cables used by IDE hard
drives. SATA cables can also be longer allowing you to connect to more distant devices without fear
of signal interference. There is also more room to grow with data transfer speeds starting at 150 MB
per second.
SCSI
Small Computer System Interface (pronounced ‘scuzzy’). A set of interface standards for connecting
certain peripheral devices, particularly mass storage units such as disk drives, to a computer system.
SCSI interfaces provide for faster data transmission rates (up to 80 megabytes per second) than
standard serial and parallel ports. In addition, you can attach many devices to a single SCSI port, so
that SCSI is really an I/O bus rather than simply an interface.
segment
A segment represents a portion of data in a Greenplum database. User-defined tables and their
indexes are distributed across the available number of segment instances in the Greenplum Database
system. Each segment instance contains a distinct portion of the user data. A primary segment
instance and its mirror both store the same segment of data.
segment instance
The segment instance is the database server process (postmaster process) that serves segments. Users
do not connect to segment instances directly, but through the master.
star schema
A relational database schema used in data warehousing. The star schema is organized around a
central table (fact table) joined to a few smaller tables (dimension tables) using foreign key
references. The fact table contains raw numeric items that represent relevant business facts (price,
number of units sold, etc.).
RAID-Z 40
system catalog
The system catalogs are the place where a relational database management system stores schema
metadata, such as information about tables and columns, and internal bookkeeping information. The
system catalog in Greenplum Database is the same as the PostgreSQL catalog with some additional
tables to support the distributed nature of the Greenplum system and databases. In Greenplum
Database, the master contains the global system catalog and each segment instance contains its own
local system catalog.
TPC-H
The Transaction Processing Performance Council (TPC) is a third-party organization that provides
database benchmark tools for the industry. TPC-H is their ad-hoc, decision support benchmark. This
benchmark illustrates decision support systems that examine large volumes of data, execute queries
with a high degree of complexity, and give answers to critical business questions. The TPC-H toolkit
is used for Bizgres and Greenplum Database functional and performance testing.
tuple
A tuple is another name for a row or record in a relational database table.
WAL
Write-Ahead Logging (WAL) is a standard approach to transaction logging. WAL’s central concept
is that changes to data files (where tables and indexes reside) are logged before they are written to
permanent storage. Data pages do not need to be flushed to disk on every transaction commit. In the
event of a crash, data changes not yet applied to the database can be recovered from the log. A major
benefit of using WAL is a significantly reduced number of disk writes.
system catalog 41
Greenplum Database Overview Guide 3.0 - Index
Index
A CREATE INDEX: 26
CREATE LANGUAGE: 26
about Greenplum: 5 CREATE OPERATOR: 26
ALTER AGGREGATE: 24 CREATE OPERATOR CLASS: 26
ALTER CONVERSION: 24 CREATE RESOURCE QUEUE: 26
ALTER DATABASE: 24 CREATE ROLE: 26
ALTER DOMAIN: 24 CREATE RULE: 26
ALTER FUNCTION: 24 CREATE SCHEMA: 26
ALTER GROUP: 24 CREATE SEQUENCE: 26
ALTER INDEX: 25 createdb: 30
ALTER LANGUAGE: 25 createlang: 30
ALTER OPERATOR: 25 createuser: 30
ALTER OPERATOR CLASS: 25 creating
ALTER RESOURCE QUEUE: 25 databases: 30
ALTER SCHEMA: 25 languages: 30
ALTER SEQUENCE: 25 users: 30
ALTER TABLE: 25
ALTER TABLESPACE: 25
ALTER TRIGGER: 25 D
ALTER TYPE: 25
ALTER USER: 25 data
distribution: 19
ANALYZE: 25
architecture: 12 loading: 16
shared-nothing: 8 storage: 19
array data directory: 35
in Greenplum Database: 34 database
creating: 30
deleting: 30
B users: 30
BEGIN: 25 vacuum: 31
benefits: 6 DDL: 35
Bizgres Loader: 16 deleting
Bizgres open source: 6 databases: 30
languages: 30
users: 30
C dispatcher: 39
catalog: 34 distributed: 35
CHECKPOINT: 25 DML: 35
CLOSE: 25 DROP ROLE: 28
CLUSTER: 25 DROP RULE: 28
cluster: 34 DROP SCHEMA: 28
clusterdb: 30 DROP SEQUENCE: 28
COMMENT: 25 DROP TABLE: 28
COMMIT: 25 DROP TABLESPACE: 28
COMMIT PREPARED: 25 DROP TRIGGER: 28
commodity hardware: 7 DROP TYPE: 28
COPY: 25 DROP USER: 28
correlated subquery: 34 DROP VIEW: 28
CPU: 34 dropdb: 30
dual-core: 35 droplang: 30
CREATE AGGREGATE: 25 dropuser: 30
CREATE CAST: 25 dual-core CPU: 35
CREATE CONSTRAINT TRIGGER: 26
CREATE CONVERSION: 26
CREATE DATABASE: 26 E
CREATE DOMAIN: 26 ecpg: 30
CREATE EXTERNAL TABLE: 26 END: 28
CREATE FUNCTION: 26 EXECUTE: 28
CREATE GROUP: 26 executor: 39
Last Revised: April 27, 2007 1:56 pm 42

EXPLAIN: 28 MPP: 37
F N
fault tolerance: 14 NOTIFY: 29
features: 6
FETCH: 28
flexibility: 9 O
open source: 7
overview: 5
G architecture: 12
global system catalog: 41 of features: 6
GRANT: 28
Greenplum: 10
Greenplum Database P
and Bizgres: 6 partitioned: 38
and PostgreSQL: 5 performance: 9
architecture: 12 pg_config: 31
array: 34 pg_controldata: 32
benefits: 6 pg_ctl: 32
cluster: 34 pg_dump: 31
features: 6 pg_dumpall: 31
interconnect: 13 pg_resetxlog: 32
master: 13 pg_restore: 31
postgres: 33
overview: 5
PostgreSQL: 5, 38
queries: 20 client applications: 30
segment: 13 server applications: 32
postmaster: 33, 38
H PREPARE: 29
PREPARE TRANSACTION: 29
hardware prerequisites: 23
commodity: 7 price: 9
default host configuration: 7 programming languages: 30
requirements: 24 psql: 31, 38
high availability: 9, 14
host: 36
Q
QD: 38
I QE: 38
initdb: 32 query
INSERT: 29 dispatcher: 39
interconnect: 13 execution plan: 39
ipcclean: 32 executor: 39
query processing: 20
L
languages: 30 R
LISTEN: 29 rack: 39
LOAD: 29 RAID: 39
loader: 16 REASSIGN OWNED: 29
LOCK: 29 recovery: 14
redundancy: 14
M REINDEX: 29
RELEASE SAVEPOINT: 29
master: 13 RESET: 29
master instance: 13, 37 REVOKE: 29
mirror: 37 ROLLBACK: 29
mirroring: 14 ROLLBACK PREPARED: 29
motion node: 37 ROLLBACK TO SAVEPOINT: 29
MOVE: 29

S
SATA: 40
SAVEPOINT: 29
scalability: 9
schema
star: 40
SCSI: 40
segment: 13, 40
segment instance: 13, 40
SELECT INTO: 29
SET: 29
SET CONSTRAINTS: 29
SET ROLE: 29
SET SESSION AUTHORIZATION: 29
SET TRANSACTION: 29
shared-nothing: 8
SHOW: 29
single point of failure: 15
SQL support: 24
star schema: 40
START TRANSACTION: 29
system catalog: 41
system requirements: 23
T
TPC-H: 41
TRUNCATE: 29
tuple: 41
U
UNLISTEN: 29
UPDATE: 30
user
creating: 30
deleting: 30
V
VACUUM: 30
vacuumdb: 31
VALUES: 30

GPOverview

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

GPOverview

Загружено:

Авторское право:

Доступные форматы

Technical Publications

Greenplum Database 3.0

Last Revised: April 27, 2007 1:56 pm

Contents - Greenplum Database 3.0 Overview Guide

Contents - Greenplum Database 3.0 Overview Guide ii

Copyright © 2007 by Greenplum, Inc. All rights reserved.

Contents - Greenplum Database 3.0 Overview Guide iii

About This Guide

Greenplum Database Documentation

About This Guide 1

• Greenplum Database Administrator Guide — This guide contains instructions

Text Convention Usage Examples

Greenplum Database Documentation 2

Table 0.1 Text Conventions

Text Convention Usage Examples

Command Syntax Conventions

Text Convention Usage Examples

{ } Within command syntax, curly FROM { 'filename' | STDIN }

Greenplum Database Documentation 3

Naming Conventions and Acronyms

$GPHOME The base directory where Greenplum Database is

1 About Greenplum Database

Greenplum Database Overview

Greenplum Database and PostgreSQL

Greenplum Database Overview 5

Greenplum Database and Bizgres

Features and Benefits

Features and Benefits 6

Commodity Shared High Scalability / Performance /

= good = fair = poor

Figure 1.1 Greenplum Database and the Competition

Features and Benefits 7

Shared Nothing Architecture

Buffers Buffers Buffers Buffers

Shared Everything Shared Nothing

Figure 1.2 Shared Everything versus Shared Nothing Architectures

Features and Benefits 8

Parallel Data Loading

Scalability and Flexibility

Performance and Price

Features and Benefits 9

Greenplum Database achieves its tremendous performance advantages through

Dial-In Your Performance

Figure 1.3 Price / Performance Advantage of Greenplum Database

The Greenplum Vision

2 Greenplum Database Architecture

The Greenplum Database High-Level Architecture

Figure 2.1 High-Level Greenplum Database Architecture

The Greenplum Database High-Level Architecture 12

High Availability Architectures

High Availability Architectures 13

Mirroring and Fault Tolerance

master host host 1 host 2 host n

global Segment 1 Segment 2 Segment n

Segment Instance Segment Instance Segment Instance

Segment 2 Segment n Segment 1

Segment Instance Segment Instance Segment Instance

Figure 2.2 Data Mirroring in Greenplum Database

Fault Detection and Recovery

High Availability Architectures 14

No Single Point of Failure

Gigabit Ethernet Switch 1

Figure 2.3 High Availability Greenplum Database Architecture

Redundant Greenplum Database Masters