Академический Документы
Профессиональный Документы
Культура Документы
Glossary ............................................................................................34
Information in this document is subject to change without notice. The software described herein is furnished under a
license agreement, and it may be used or copied only in accordance with the terms of that agreement. Upgrades are
provided only at regularly scheduled software release dates. No part of this publication may be reproduced,
transmitted, or translated in any form or by any means, electronic, mechanical, manual, optical, or otherwise, without
the prior written permission of Greenplum, Inc. Greenplum makes no warranty of any kind with respect to the
completeness or accuracy of this manual.
Greenplum, the Greenplum logo, Bizgres, the Bizgres logo, and Greenplum Database are trademarks of Greenplum,
Inc. PostgreSQL is a trademark of Marc Fournier held in trust for The PostgreSQL Development Group. All other
company and product names used herein may be trademarks or registered trademarks of their respective companies.
Preface
This guide provides an overview of the Greenplum Database product. It explains the
features, concepts, system architecture, and system processes.
• About This Guide
• Greenplum Database Documentation
• Contact Us
Document Conventions
The following conventions are used throughout the Greenplum Database
documentation to help you identify certain types of information.
Text Conventions
Table 0.1 Text Conventions
bold Button, menu, tab, page, and field Click Cancel to exit the page without
names in GUI applications saving your changes.
italics New terms where they are defined The master instance is the postmaster
process that accepts client
Database objects, such as schema,
connections.
table, or columns names
Catalog information for Greenplum
Database resides in the pg_catalog
schema.
monospace File names and path names Edit the postgresql.conf file.
Programs and executables Use gpstart to start Greenplum
Database.
Command names and syntax
Parameter names
monospace italics Variable information within file /home/gpadmin/config_file
paths and file names COPY tablename FROM
'filename'
Variable information within
command syntax
monospace bold Used to call attention to a particular Change the host name, port, and
part of a command, parameter, or database name in the JDBC
code snippet. connection URL:
jdbc:postgresql://host:5432/m
ydb
UPPERCASE Environment variables Make sure that the Java /bin
directory is in your $PATH.
SQL commands
SELECT * FROM my_table;
Keyboard keys
Press CTRL+C to escape.
Convention Meaning
Contact Us
To contact Greenplum customer support, call 1-866-410-6060 or send email to:
support@greenplum.com
Contact Us 4
Greenplum Database Overview Guide 3.0 – Chapter 1: About Greenplum Database
commands. Most DDL and utility SQL statements are supported in Greenplum
Database as they are in PostgreSQL, with a few minor exceptions. See “SQL Support”
on page 24 for more information.
When compared to the competition, Greenplum Database is the only product that
performs exceptionally well in all of these categories.
Open Source
The global adoption and momentum building around Linux have demonstrated the
power and value of utilizing open source software in the enterprise. Open source
offers many of the benefits that have been missing from the traditional proprietary
commercial software industry. Open source software:
• Insulates enterprises from vendor lock-in
• Lowers the cost of ownership
• Leverages the efforts of a global developer community
In 2005, Greenplum founded Bizgres.org the first open source database project
focused on Business Intelligence (BI). The Bizgres Project aims to make PostgreSQL
the world’s most robust open source database for Business Intelligence. Greenplum
Database is an enterprise-ready commercial solution that builds upon and extends the
offerings of Bizgres.
Commodity Hardware
One of Greenplum Database’s greatest strengths is that it can run on off-the-shelf,
low-cost commodity servers. Greenplum Database was designed specifically to take
advantage of the tremendous price/performance advantages that commodity
computing delivers over traditional proprietary SMP-based systems.
Greenplum Database supports standard hardware configurations from Dell, HP, Sun,
and other hardware vendors. A typical Greenplum Database compute host has the
following hardware resources:
• 2 dual-core CPUs (typically Xeon or Opteron)
• 16 GB of RAM
• 2 Gigabit Ethernet interfaces
• 1 SATA RAID disk controller per 8 drives
• 16 SATA 400 GB hard drives
By leveraging commodity systems, Greenplum Database requires less than $25,000
(US) of hardware per terabyte of usable warehousing capacity.
These segment instances are connected by the Greenplum Database Interconnect and
database optimizer technology. They perform work in parallel and use all disk
connections simultaneously. As a result, the database system consists of a number of
self-contained parallel processing units and is able to scale storage capacity and
processing power together to answer complex queries on growing data repositories.
Each segment instance acts as a self-contained database processor that owns and
manages a distinct portion of the overall data.
Because shared-nothing databases automatically distribute data and make query
workloads parallel across all available hardware, they dramatically outperform
general-purpose database systems for BI workloads.
High Availability
Greenplum Database provides for redundancy of its components so that there is no
single point of failure in the Greenplum Database system. Greenplum Database
provides a high degree of system fail-over through cyclical data redundancy. Each
data segment is mirrored on an alternate host, where each segment instance manages
one segment (either a primary or a backup segment). In a typical Greenplum Database
implementation, there is generally one segment instance per CPU, several to a host.
When an active segment instance fails, Greenplum Database redirects connections to
the backup segment on an alternate host.
Workload Management
The purpose of Greenplum workload management is to limit the number of active
queries in the system at any given time in order to avoid exhausting system resources
such as memory, CPU, and disk I/O. This is accomplished by creating role-based
resource queues. A resource queue has attributes that limit the size and/or total
number of queries that can be executed by the users (or roles) in that queue. By
assigning all of your database roles to the appropriate resource queue, administrators
can control concurrent user queries and prevent the system from being overloaded.
Fastest
Traditional DW
Cost Solutions
Deepest
# of Terabytes
About Greenplum
Greenplum was formed in 2003 by the merger of Metapa and Didera with the goal of
developing a low cost, high-performance, large-scale data warehousing solution built
on open source software. Greenplum is led by pioneers in open source, database
systems, data warehousing, supercomputing, and Internet performance acceleration
with technical staff from companies such as Oracle, Sybase, Informix, Teradata,
Netezza, Tandem, and Sun. Greenplum company headquarters are in San Mateo,
California.
About Greenplum 10
Greenplum Database Overview Guide 3.0 – Chapter 1: About Greenplum Database
business decisions to gain competitive advantage. The task of managing and scaling
data for business reporting has traditionally been difficult and expensive, and in the
past 20 years the database infrastructure on which business intelligence (BI) systems
are built has not evolved significantly.
Greenplum recognizes that companies are moving to low cost computing and
replacing proprietary, Unix-based hardware and software with Intel-based hardware
running Linux. Greenplum’s offerings are specifically designed to help companies
take advantage of the price and performance returns of Linux.
Greenplum Database’s target markets include organizations that manage large and
growing data volumes. Customer warehouse sizes range from under one terabyte to
multi-terabyte systems. Primary markets include telecommunications companies,
retail firms, financial services companies, online service providers and application
service providers.
About Greenplum 11
Greenplum Database Overview Guide 3.0 – Chapter 2: Greenplum Database Architecture
Segment Instance 1
Interconnect - Gigabit Ethernet Switch
?
Segment Instance 2
Segment Instance 3
Master Instance
Segment Instance n
This section describes the basic components of the Greenplum Database system, and
how they work together:
• The Master
• The Segments
• The Interconnect
The Master
The master is the entry point to the Greenplum Database system. It is the database
process that accepts client connections and processes the SQL commands issued by
the users of the system.
Since Greenplum Database is based on PostgreSQL, end-users interact with
Greenplum Database (through the master) as they would a typical PostgreSQL
database. They can connect to the database using client programs such as psql or
application programming interfaces (APIs) such as JDBC or ODBC.
The master is where the global system catalog resides (the set of system tables that
contain metadata about the Greenplum Database system itself), however the master
does not contain any user data. User data resides only on the segments. The master
does the work of processing the incoming SQL commands, distributing the work load
between the segments, coordinating the results returned by each of the segments, and
presenting the final results to the user.
The Segments
In Greenplum Database, the segments are where the user data resides. User-defined
tables and their indexes are distributed across the available number of segments in the
Greenplum Database system, each segment containing a distinct portion of the data.
Segment instances are the database server processes that serve segments. Users do not
interact directly with the segment instances in a Greenplum Database system, but do
so through the master.
In the recommended Greenplum Database hardware configuration, there is one
primary segment instance per effective CPU or CPU core.
The Interconnect
The Interconnect component in Greenplum Database is responsible for moving data
between the segments during query execution. The Interconnect delivers messages,
moves data, collects results, and coordinates work among the segments in the system.
The Interconnect rides on top of a standard Gigabit Ethernet switching fabric.
Segment Instance 2
Master
Instance
(primary)
Segment Instance 3
automatic
synchronization
Segment Instance 4
Gigabit Ethernet Switch 2
Segment Instance 5
Master
Instance
(backup)
Segment Instance n
Redundant Interconnect
The Greenplum Database Interconnect rides on top of a standard Gigabit Ethernet
switching fabric. A highly available Interconnect can be achieved by deploying dual
Gigabit Ethernet switches on your network, and redundant Gigabit connections to the
Greenplum Database host servers.
Bizgres Loader
Bizgres Loader is a Java command-line program that can be used to load large
quantities of data into a Greenplum Database database. Its functionality and data
formatting options are similar to the PostgreSQL COPY command, which is also
supported in Greenplum Database. Bizgres Loader provides several additional
features such as error logging, concurrent load execution, optimized performance,
data batching, and enhanced configuration options.
Bizgres Loader can be run from the master or any host on the network connected to
the master. It takes as input a control file. The control file contains one or more LOAD
commands, which are the specifications for loading the data into a table. Data can be
loaded from either a data file or standard input.
Bizgres Loader can be run in batch mode, which means that large quantities of data
are loaded into the database in batches. You can specify the number of rows and a size
limit that define a batch. If a batch contains data errors, the bad batches are written to
a log file so you can go back and troubleshoot the rows that did not get loaded.
Source Data
Segment Instance 2
Bizgres
Loader Segment Instance 3
Segment Instance n
Bizgres Loader 16
Greenplum Database Overview Guide 3.0 – Chapter 2: Greenplum Database Architecture
Greenplum Monitor
The Greenplum Monitor server sits outside of your Greenplum Database core
installation and monitors the system state, network activity, processing load, database
capacity and utilization, user connections, and system alerts. Each master instance and
segment instance has a small monitoring agent running on it that reports status
information back to the Greenplum Monitor server. The status information is stored in
a standard Bizgres or PostgreSQL database running on the Greenplum Monitor server.
Greenplum Monitor UI
agent
Segment Host 1
agent
Segment Host 2
agent
agent
Segment Host 4
agent
Segment Host n
Administrators access the Greenplum Monitor data through a web application that is
also running on the Greenplum Monitor server host.
customer
cn integer
cname text
sale
vendor
cn integer
vn integer vn integer
pn integer vname text
dt date loc text
qty integer
prc float
product
pn integer
pname text
In Greenplum Database all tables are distributed, which means a table is divided into
non-overlapping sets of rows or parts. Each part resides on a single database known as
a segment within the Greenplum Database system. The parts are distributed evenly
across all of the available segments using a sophisticated hashing algorithm.
The query is received through the master, which parses the query, optimizes the query,
and creates a query execution plan. The query execution plan is divided and sent to the
individual segment instances for execution, where the segment instances execute their
slice of the plan in parallel.
?
Segment Instance 1
?
Segment Instance 2
?
?
Segment Instance 3
Master Instance
?
Segment Instance n
?
GP Master
Instance
Hash
Join
4
Hash Hash
Node Node
3 3
receive receive
Other Other
Motion Motion
Segment Segment
Node Node
Instances send send Instances
2 2
Sale Customer
Table 1 Table 1
1. The customer and sales table segments are scanned on a segment instance.
2. Tuples are dynamically rehashed on the join column and sent to the correct
segment instance.
3. The hash node receives its slice of data from all other segments and starts the join
operation.
4. The local segment result set is sent to the master, which materializes the results
and presents them to the user.
System Requirements
This section outlines the minimum system requirements for a machine that will be
running Greenplum Database.
System Prerequisites
The following table lists minimum recommended specifications for servers intended
to support Greenplum Database. It is recommended that you work with your
Greenplum Systems Engineer to review your anticipated environment to ensure an
appropriate configuration for Greenplum Database.
Table 4.1 System Prerequisites for Greenplum Database 3.0
Minimum CPU Pentium Pro compatible (P3/Athlon and above)
System Requirements 23
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported Features
Greenplum Database is based on PostgreSQL 8.2.4 and supports many of the database
features, SQL commands, client applications, and server applications provided by
PostgreSQL. Some features of PostgreSQL have been modified or are unsupported in
Greenplum Database due to the distributed nature of the Greenplum database and its
parallel architecture.
This section provides a reference of the PostgreSQL features and how those features
are supported in Greenplum Database. For more information on PostgreSQL, refer to
the PostgreSQL Documentation.
SQL Support
The following table lists all of the SQL commands supported in PostgreSQL, and
whether or not they are supported in Greenplum Database. For full SQL syntax and
references, see the Greenplum Database User Guide.
Data Query and Manipulation Language (DQL/DML) is essentially supported as it is
in PostgreSQL. SELECT, INSERT, UPDATE, and DELETE are DQL/DML commands. All
other SQL commands are considered Data Definition Language (DDL) or utility
commands. The majority of DDL and utility statements are supported with a few
minor exceptions.
Table 4.3 SQL Support in Greenplum Database
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
Supported Features 24
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
ALTER TRIGGER NO
ANALYZE YES
BEGIN YES
CHECKPOINT YES
CLOSE YES
CLUSTER YES
COMMENT YES
COMMIT YES
COMMIT PREPARED NO
Limitations:
The functions used to implement the aggregate must
be IMMUTABLE functions.
Supported Features 25
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
CREATE EXTERNAL TABLE YES Greenplum Database parallel ETL feature - not in
PostgreSQL 8.2.4.
Supported Features 26
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
Limited Clauses:
• Only one UNIQUE constraint allowed per table
• UNIQUE not allowed if the table has a PRIMARY KEY
CREATE TRIGGER NO
DEALLOCATE YES
Limitations:
Cursors are non-scrollable and non-updateable
Limitations:
• Joins must be on a common Greenplum distribution
key (equijoins)
• Cannot use STABLE or VOLITILE functions in a
DELETE statement if mirrors are enabled
Supported Features 27
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
DROP EXTERNAL TABLE YES Greenplum Database parallel ETL feature - not in
PostgreSQL 8.2.4.
DROP TRIGGER NO
END YES
EXECUTE YES
EXPLAIN YES
Limitations:
Cannot fetch rows in a nonsequential fashion (no
scrolling)
GRANT YES
Supported Features 28
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
LISTEN NO
LOAD YES
LOCK YES
NOTIFY NO
PREPARE YES
PREPARE TRANSACTION NO
REINDEX YES
RESET YES
REVOKE YES
ROLLBACK YES
ROLLBACK PREPARED NO
SAVEPOINT YES
SELECT YES
SET YES
SET SESSION AUTHORIZATION YES Deprecated in PostgreSQL 8.1 - see SET ROLE
SHOW YES
TRUNCATE YES
UNLISTEN NO
Supported Features 29
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Supported in
SQL Command Modifications, Limitations, Exceptions
Greenplum
Limitations:
• SET not allowed for Greenplum distribution key
columns.
• Joins must be on a common Greenplum distribution
key (equijoins).
• Cannot use STABLE or VOLITILE functions in an
UPDATE statement if mirrors are enabled.
VALUES YES
Client Supported in
Description Notes
Application Greenplum
Supported Features 30
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Client Supported in
Description Notes
Application Greenplum
pg_config Retrieve information about the YES Prints out information for the
installed version of Greenplum Greenplum Master Instance
Database. only.
pg_dump Extract a Greenplum Database YES Only use pg_dump when the
database into a single script file or system is quiet or no DML is
other archive file. being executed. The dump
operation is not executed in a
serializable transaction.
Use use the --gp-syntax
command-line option to include
the DISTRIBUTED BY clause
in CREATE TABLE statements.
See also, gp_dump.
pg_dumpall Extract a database cluster into a YES Only use pg_dumpall when the
script file (all databases, roles, system is quiet or no DML is
and system catalog info). being executed. The dump
operation is not executed in a
serializable transaction.
Use use the --gp-syntax
command-line option to include
the DISTRIBUTED BY clause
in CREATE TABLE statements.
See also, gp_dump.
Supported Features 31
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Server Supported in
Description Notes
Application Greenplum
Supported Features 32
Greenplum Database Overview Guide 3.0 – Chapter 4: System Requirements and Supported Features
Server Supported in
Description Notes
Application Greenplum
Supported Features 33
Greenplum Database Overview Guide 3.0 – Glossary
Glossary
A
array
The set of physical devices (hosts, servers, switches, etc.) used to house a Greenplum Database
system.
bandwidth
Bandwidth is the maximum amount of information that can be transmitted along a channel, such as
a network or I/O channel. This data transfer rate is usually measured in megabytes per second (MB/s).
BIOS
This is the Basic Input/Output System and is installed on the computer’s motherboard. It controls the
most basic operations and is responsible for starting up the computer and initializing hardware such
as disk drives, I/O devices, and so on.
C
catalog
See system catalog.
cluster
In PostgreSQL, a cluster refers to a collection of databases that is managed by a single instance of a
running database server. In file system terms, a database cluster is a single directory under which all
data is stored, referred to as the data directory or data storage area. In Greenplum Database, a cluster
refers to a single global catalog (on the master) and the database objects owned and contained in that
catalog.
correlated subquery
A correlated subquery is a nested SELECT statement that refers to a column from an outer SELECT
statement. For example:
SELECT * FROM product WHERE exists (SELECT * FROM sale WHERE qty>0 AND pn
= product.pn);
CPU
CPU stands for Central Processing Unit and is often, simply called, the processor. The part of a
computer (a microprocessor chip) that does most of the data processing. Sometimes the term CPU is
used to describe the whole box that contains the chip (along with the motherboard, expansion cards,
disk drives, power supply, and so on).
See also dual-core CPU.
array 34
Greenplum Database Overview Guide 3.0 – Glossary
D
data directory
The data directory is the location on disk where database data is stored. The master data directory
contains the global system catalog only — no user data is stored on the master. The data directory on
the segment instances have user data for their segment plus a local system catalog. The data directory
contains several subdirectories, control files, and configuration files.
distributed
Certain database objects in Greenplum Database, such as tables and indexes, are distributed. They
are divided into equal parts and spread out amoungst the segment instances based on a hashing
algorithm. To the end-user and client software, however, a distributed object appears as a
conventional database object.
distribution key
In a Greenplum table, one or more columns are used as the distribution key, meaning those columns
are used to divide the data amongst all of the segments. The distribution key should be the primary
key of the table or a unique column. If that is not possible, then choose the column with the lowest
selectivity to ensure the most even data distribution.
DDL
Data Definition Language. A subset of SQL commands used for defining and examining the structure
of a database.
DML
Database Manipulation Language. SQL commands that store, manipulate, and retrieve data from
tables. INSERT, UPDATE, DELETE, and SELECT are DML commands.
dual-core CPU
A dual-core CPU is basically two separate processors on a single microprocessor chip. Those two
processors can outperform single-core processors on most multithreaded applications while running
at lower clock speeds and consuming less power. An application with multiple software threads, such
as Greenplum Database, will run faster on a dual-core processor because the operating system can
assign an individual thread to its own processor core. Multithreaded applications running on a
single-core processor must wait for one thread to finish before another thread can be processed.
See also CPU.
Greenplum Database
Greenplum Database is the industry’s first massively parallel processing (MPP) database server
based on open-source technology. It is explicitly designed to support business intelligence (BI)
applications and large, multi-terabyte data warehouses. Greenplum Database is based on
PostgreSQL.
data directory 35
Greenplum Database Overview Guide 3.0 – Glossary
Greenplum instance
The process that serves a database. An instance of Greenplum Database is comprised of a master
instance and one or more segment instances.
host
A host represents a physical machine or compute node in a Greenplum Database system. In
Greenplum Database, one host is designated as the master. The other hosts in the system have one or
more segment instances running on them.
hyper-threaded CPU
A hyper-threaded CPU is a single processor that presents itself to the operating system as two virtual
processors. The processor can work on two sets of tasks simultaneously, use resources that otherwise
would sit idle, and get more work done in the same amount of time. Hyper-threading was pioneered
by Intel on the Xeon processor family for servers. See also, dual-core CPU.
I
Interconnect
The Interconnect component in Greenplum is responsible for moving data between the segments
during query execution. The Interconnect delivers messages, moves data, collects results, and
coordinates work among the segments in the system. The Interconnect rides on top of a standard
Gigabit Ethernet switching fabric over a private local area network (LAN).
I/O
Input/Output (I/O) refers to the transfer of data to and from a system or device using a
communucation channel.
JDBC
Java Database Connectivity is an application program interface (API) specification for connecting
programs written in Java to data in a database management system (DBMS). The application
program interface lets you encode access request statements in SQL that are then passed to the
program that manages the database.
M
master
The master (also referred to as the Greenplum master or master instance) is the entry point to the
Greenplum Database system. It is the database listener process (postmaster process) that accepts
client connections and processes the SQL commands issued by the users of the system.
The master is where the global system catalog resides, however the master does not contain any user
data. User data resides only on the segment instances. The master does the work of processing the
incoming SQL commands, distributing the work load between the segment instances, coordinating
the results returned by each of the segments, and presenting the final results to the user.
master instance
The database process that serves the Greenplum master. See master.
mirror
A mirror is a backup copy of a segment (or master) that is stored on a different host than the primary
copy. Mirrors are useful for maintaining operations if a host in your Greenplum Database system
fails. Mirroring is an optional feature of Greenplum Database. Mirror segments are evenly distributed
amongst other hosts in the array. If a host that holds a primary segment fails, Greenplum Database
will switch to the mirror or secondary host.
motion node
A motion node is a portion of a query execution plan responsible for moving tuples amoungst the
Greenplum Database segment instances.
motherboard
The main circuit board of a computer, which houses all the vital components usually including the
microprocessor, internal memory, and device controllers such as for the disk drives.
MPP
Massive Parallel Processing.
ODBC
Open Database Connectivity, a standard database access method that makes it possible to access any
data from any client application, regardless of which database management system (DBMS) is
handling the data. ODBC manages this by inserting a middle layer, called a database driver, between
a client application and the DBMS. The purpose of this layer is to translate the application’s data
queries into commands that the DBMS understands.
master 37
Greenplum Database Overview Guide 3.0 – Glossary
P
partitioned
Partitioning is a way to logically divide the data in a table for better performance and easier
maintenance. In Greenplum Database, partitioning is a procedure that creates multiple sub-tables (or
child tables) from a single large table (or parent table). The primary purpose is to improve
performance by scanning only the relevant data needed to satisfy a query. Note that partitioned tables
are also distributed.
Perl DBI
Perl Database Interface (DBI) is an API for connecting programs written in Perl to database
management systems (DBMS). Perl DBI (DataBase Interface) is the most common database
interface for the Perl programming language.
PostgreSQL
PostgreSQL is a SQL compliant, open source relational database management system (RDBMS).
Greenplum Database uses a modified version of PostgreSQL as its underlying database server. For
more information on PostgreSQL go to http://www.postgresql.org.
postgres process
The postgres process is the server process that processes queries. It is not called directly, but is a
subprocess of a postmaster process.
postmaster process
The postmaster is the PostgreSQL database server listener process. In order for a client application
to access a database it connects (over a network or locally) to a running postmaster process. The
postmaster also manages the communication among server subprocesses (see postgres process).
In Greenplum Database, there is a postmaster process running on the master and on each segment
instance. Users and client applications always connect to Greenplum Database using the postmaster
process on the master.
psql
This is the interactive terminal to PostgreSQL. You can use psql to access a database and issue SQL
commands. For more information on psql, see the psql reference in the PostgreSQL documentation.
QD
See query dispatcher.
QE
See query executor.
partitioned 38
Greenplum Database Overview Guide 3.0 – Glossary
query dispatcher
The query dispatcher (QD) is a process that is initiated when users connect to the master and issue
commands. This process represents a user session in the context of a Greenplum Database single
database image as a whole. The query dispatcher process spawns one or more query executor
processes to assist in the execution of SQL commands.
query executor
A query executor process (QE) is associated with a query dispatcher (QD) process and operates on
its behalf. Query executor processes run on the segment instances and execute their portion of a query
plan for a segment.
R
rack
A type of shelving to which computer components can be attached vertically, one on top of the other.
Components are normally screwed into front-mounted, tapped metal strips with holes which are
spaced so as to accommodate the height of devices of various U-sizes. Racks usually have their
height denominated in U-units.
RAID
Redundant Array of Independent (or Inexpensive) Disks. RAID is a system of using multiple hard
drives for sharing or replicating data among the drives. The benefit of RAID is increased data
integrity, fault-tolerance and/or performance, over using drives singularly. Multiple hard drives are
grouped and seen by the OS as one logical hard drive.
RAID-10 (1+0)
The most popular of the multiple RAID levels, RAID 10 combines the best features of striping and
mirroring to yield large arrays with high performance and superior fault tolerance. RAID 10 is a
stripe across a number of mirrored sets. Half of the disks are reserved for parity.
RAID-5
One of the most popular RAID levels, RAID 5 stripes both data and parity information across three
or more hard drives. It uses a distributed parity algorithm, writing data and parity blocks across all
the drives in the array. One disk’s worth of capacity is reserved for parity.
query dispatcher 39
Greenplum Database Overview Guide 3.0 – Glossary
RAID-Z
The ZFS file system in Solaris provides a RAID-Z configuration with single parity fault tolerance,
which is similar to RAID-5. In RAID-Z, ZFS uses variable-width RAID stripes so that all writes are
full-stripe writes. This design is only possible because ZFS integrates file system and device
management in such a way that the file system’s metadata has enough information about the
underlying data replication model to handle variable-width RAID stripes. If you have three disks in
a single-parity RAID-Z configuration, parity data occupies space equal to one of the three disks. No
special hardware is required to create a RAID-Z configuration.
RAM
Random Access Memory. The main memory of a computer system used for storing programs and
data. RAM provides temporary read/write storage while hard disks offer semi-permanent storage.
SATA
Serial Advanced Technology Attachment is a computer bus primarily designed for transfer of data to
and from a hard disk. Unlike IDE which uses parallel signaling, SATA uses serial signaling
technology. Because of this the SATA cables are thinner than the ribbon cables used by IDE hard
drives. SATA cables can also be longer allowing you to connect to more distant devices without fear
of signal interference. There is also more room to grow with data transfer speeds starting at 150 MB
per second.
SCSI
Small Computer System Interface (pronounced ‘scuzzy’). A set of interface standards for connecting
certain peripheral devices, particularly mass storage units such as disk drives, to a computer system.
SCSI interfaces provide for faster data transmission rates (up to 80 megabytes per second) than
standard serial and parallel ports. In addition, you can attach many devices to a single SCSI port, so
that SCSI is really an I/O bus rather than simply an interface.
segment
A segment represents a portion of data in a Greenplum database. User-defined tables and their
indexes are distributed across the available number of segment instances in the Greenplum Database
system. Each segment instance contains a distinct portion of the user data. A primary segment
instance and its mirror both store the same segment of data.
segment instance
The segment instance is the database server process (postmaster process) that serves segments. Users
do not connect to segment instances directly, but through the master.
star schema
A relational database schema used in data warehousing. The star schema is organized around a
central table (fact table) joined to a few smaller tables (dimension tables) using foreign key
references. The fact table contains raw numeric items that represent relevant business facts (price,
number of units sold, etc.).
RAID-Z 40
Greenplum Database Overview Guide 3.0 – Glossary
system catalog
The system catalogs are the place where a relational database management system stores schema
metadata, such as information about tables and columns, and internal bookkeeping information. The
system catalog in Greenplum Database is the same as the PostgreSQL catalog with some additional
tables to support the distributed nature of the Greenplum system and databases. In Greenplum
Database, the master contains the global system catalog and each segment instance contains its own
local system catalog.
TPC-H
The Transaction Processing Performance Council (TPC) is a third-party organization that provides
database benchmark tools for the industry. TPC-H is their ad-hoc, decision support benchmark. This
benchmark illustrates decision support systems that examine large volumes of data, execute queries
with a high degree of complexity, and give answers to critical business questions. The TPC-H toolkit
is used for Bizgres and Greenplum Database functional and performance testing.
tuple
A tuple is another name for a row or record in a relational database table.
WAL
Write-Ahead Logging (WAL) is a standard approach to transaction logging. WAL’s central concept
is that changes to data files (where tables and indexes reside) are logged before they are written to
permanent storage. Data pages do not need to be flushed to disk on every transaction commit. In the
event of a crash, data changes not yet applied to the database can be recovered from the log. A major
benefit of using WAL is a significantly reduced number of disk writes.
system catalog 41
Greenplum Database Overview Guide 3.0 - Index
Index
A CREATE INDEX: 26
CREATE LANGUAGE: 26
about Greenplum: 5 CREATE OPERATOR: 26
ALTER AGGREGATE: 24 CREATE OPERATOR CLASS: 26
ALTER CONVERSION: 24 CREATE RESOURCE QUEUE: 26
ALTER DATABASE: 24 CREATE ROLE: 26
ALTER DOMAIN: 24 CREATE RULE: 26
ALTER FUNCTION: 24 CREATE SCHEMA: 26
ALTER GROUP: 24 CREATE SEQUENCE: 26
ALTER INDEX: 25 createdb: 30
ALTER LANGUAGE: 25 createlang: 30
ALTER OPERATOR: 25 createuser: 30
ALTER OPERATOR CLASS: 25 creating
ALTER RESOURCE QUEUE: 25 databases: 30
ALTER SCHEMA: 25 languages: 30
ALTER SEQUENCE: 25 users: 30
ALTER TABLE: 25
ALTER TABLESPACE: 25
ALTER TRIGGER: 25 D
ALTER TYPE: 25
ALTER USER: 25 data
distribution: 19
ANALYZE: 25
architecture: 12 loading: 16
shared-nothing: 8 storage: 19
array data directory: 35
in Greenplum Database: 34 database
creating: 30
deleting: 30
B users: 30
BEGIN: 25 vacuum: 31
benefits: 6 DDL: 35
Bizgres Loader: 16 deleting
Bizgres open source: 6 databases: 30
languages: 30
users: 30
C dispatcher: 39
catalog: 34 distributed: 35
CHECKPOINT: 25 DML: 35
CLOSE: 25 DROP ROLE: 28
CLUSTER: 25 DROP RULE: 28
cluster: 34 DROP SCHEMA: 28
clusterdb: 30 DROP SEQUENCE: 28
COMMENT: 25 DROP TABLE: 28
COMMIT: 25 DROP TABLESPACE: 28
COMMIT PREPARED: 25 DROP TRIGGER: 28
commodity hardware: 7 DROP TYPE: 28
COPY: 25 DROP USER: 28
correlated subquery: 34 DROP VIEW: 28
CPU: 34 dropdb: 30
dual-core: 35 droplang: 30
CREATE AGGREGATE: 25 dropuser: 30
CREATE CAST: 25 dual-core CPU: 35
CREATE CONSTRAINT TRIGGER: 26
CREATE CONVERSION: 26
CREATE DATABASE: 26 E
CREATE DOMAIN: 26 ecpg: 30
CREATE EXTERNAL TABLE: 26 END: 28
CREATE FUNCTION: 26 EXECUTE: 28
CREATE GROUP: 26 executor: 39
EXPLAIN: 28 MPP: 37
F N
fault tolerance: 14 NOTIFY: 29
features: 6
FETCH: 28
flexibility: 9 O
open source: 7
overview: 5
G architecture: 12
global system catalog: 41 of features: 6
GRANT: 28
Greenplum: 10
Greenplum Database P
and Bizgres: 6 partitioned: 38
and PostgreSQL: 5 performance: 9
architecture: 12 pg_config: 31
array: 34 pg_controldata: 32
benefits: 6 pg_ctl: 32
cluster: 34 pg_dump: 31
features: 6 pg_dumpall: 31
interconnect: 13 pg_resetxlog: 32
master: 13 pg_restore: 31
postgres: 33
overview: 5
PostgreSQL: 5, 38
queries: 20 client applications: 30
segment: 13 server applications: 32
postmaster: 33, 38
H PREPARE: 29
PREPARE TRANSACTION: 29
hardware prerequisites: 23
commodity: 7 price: 9
default host configuration: 7 programming languages: 30
requirements: 24 psql: 31, 38
high availability: 9, 14
host: 36
Q
QD: 38
I QE: 38
initdb: 32 query
INSERT: 29 dispatcher: 39
interconnect: 13 execution plan: 39
ipcclean: 32 executor: 39
query processing: 20
L
languages: 30 R
LISTEN: 29 rack: 39
LOAD: 29 RAID: 39
loader: 16 REASSIGN OWNED: 29
LOCK: 29 recovery: 14
redundancy: 14
M REINDEX: 29
RELEASE SAVEPOINT: 29
master: 13 RESET: 29
master instance: 13, 37 REVOKE: 29
mirror: 37 ROLLBACK: 29
mirroring: 14 ROLLBACK PREPARED: 29
motion node: 37 ROLLBACK TO SAVEPOINT: 29
MOVE: 29
S
SATA: 40
SAVEPOINT: 29
scalability: 9
schema
star: 40
SCSI: 40
segment: 13, 40
segment instance: 13, 40
SELECT INTO: 29
SET: 29
SET CONSTRAINTS: 29
SET ROLE: 29
SET SESSION AUTHORIZATION: 29
SET TRANSACTION: 29
shared-nothing: 8
SHOW: 29
single point of failure: 15
SQL support: 24
star schema: 40
START TRANSACTION: 29
system catalog: 41
system requirements: 23
T
TPC-H: 41
TRUNCATE: 29
tuple: 41
U
UNLISTEN: 29
UPDATE: 30
user
creating: 30
deleting: 30
V
VACUUM: 30
vacuumdb: 31
VALUES: 30