PostgreSQL
Architecture
Prepared by :
We will discuss PostgreSQL internals, its architecture, and how the various
components of PostgreSQL interact with one another. This will serve as a starting
point and building block for the remainder of our Become a PostgreSQL DBA blog
series.
When you start PostgreSQL, The Postmaster starts first and allocates the shared
memory. It also accepts connections and spins off a backend for each new connection.
So each backend (server process) gets its pointers to shared memory from the
postmaster. It is pretty disastrous if the postmaster dies with backends still running,
so we have it do as little as possible, so that there isn't as much which can crash it.
Postgres does have a pool of shared memory; however, it does not have a library or
dictionary cache stored in that memory. This means that statements do need to be
parsed and planned every time they are entered. If parse/plan overhead is an issue,
we suggest the use of prepared statements. While Oracle is able to avoid the
repeated parse/plan overhead, it must still do enough analysis of the query to
determine whether the information is present in the library cache, which also
consumes some time and CPU resources. The parser is quite lightweight, so we feel
that the overhead of parsing the query each time is acceptable.
Before we proceed, you should understand the basic PostgreSQL system architecture.
Understanding how the parts of PostgreSQL interact will make this chapter somewhat
clearer.
● A server process, which manages the database files, accepts connections to
the database from client applications and performs database actions on behalf
of the clients. The database server program is called Postgres.
● The user's client (frontend) application that wants to perform database
operations. Client applications can be very diverse in nature: a client could be a
text-oriented tool, a graphical application, a web server that accesses the
database to display web pages, or a specialized database maintenance tool.
Some client applications are supplied with the PostgreSQL distribution; most
are developed by users.
The PostgreSQL server can handle multiple concurrent connections from clients. To
achieve this it starts ("forks") a new process for each connection. From that point
on, the client and the new server process communicate without intervention by the
original postgres process. Thus, the master server process is always running, waiting
for client connections, whereas client and associated server processes come and go.
(All of this is of course invisible to the user. We only mention it here for
completeness.)
PostgreSQL Architecture
Figure 1-1.
PostgreSQL structure
Shared Memory refers to the memory reserved for database caching and transaction
log caching. The most important elements in shared memory are Shared Buffer and
WAL buffers
Shared Buffer
The purpose of Shared Buffer is to minimize DISK IO. For this purpose, the
following principles must be met
● You need to access very large (tens, hundreds of gigabytes) buffers quickly.
● You should minimize contention when many users access it at the same time.
● Frequently used blocks must be in the buffer for as long as possible
WAL Buffer
The WAL buffer is a buffer that temporarily stores changes to the database. The
contents stored in the WAL buffer are written to the WAL file at a predetermined
point in time. From a backup and recovery point of view, WAL buffers and WAL files
are very important.
Postmaster Process
The Postmaster process is the first process started when you start PostgreSQL. At
startup, performs recovery, initialize shared memory, and run background processes.
It also creates a backend process when there is a connection request from the client
process. (See Figure 1-2)
If you check the relationships between processes with the pstree command, you can
see that the Postmaster process is the parent process of all processes. (For clarity,
I added the process name and argument after the process ID)
Background Process
The list of background processes required for PostgreSQL operation are as follows.
(See Table 1-1)
Process Role
Backend Process
1. work_mem Space used for sorting, bitmap operations, hash joins, and merge
joins. The default setting is 4 MB.
2. Maintenance_work_mem Space used for Vacuum and CREATE INDEX . The
default setting is 64 MB.
3. Temp_buffers Space used for temporary tables. The default setting is 8
MB.
Client Process refers to the background process that is assigned for every backend
user connection.Usually the postmaster process will fork a child process that is
dedicated to serve a user connection.
CLIENTS PROCESS:
● Whenever we issue a query or the action made by us (client) is called the client
process
● It is front end.
● Front end may be a text application, graphical application or web server page.
● Through TCP/IP clients access the server
● Many users at a time can access the DB
● FORKS – This process makes multi user access possible. It don’t disturb the
postgres process
POSTMASTER:
● The work of postmaster is that it authenticates the port (5432) and allocates
process for users.
SERVER PROCESS:
I.Instance
II.Storage
1.Memory buffer
2.Utility Process
1.Memory Buffer:
a)Shared_buffer:
Sets the amount of memory the database server uses for shared memory buffers.
The default is typically 128 megabytes (128MB), but might be less if your kernel
settings will not support it (as determined during initdb). This setting must be at
least 128 kilobytes. (Non-default values of BLCKSZ change the minimum.) However,
settings significantly higher than the minimum are usually needed for good
performance. This parameter can only be set at server start.
On systems with less than 1GB of RAM, a smaller percentage of RAM is appropriate,
so as to leave adequate space for the operating system. Also, on Windows, large
values for shared_buffers aren't as effective. You may find better results keeping
the setting relatively low and using the operating system cache more instead. The
useful range for shared_buffers on Windows systems is generally from 64MB to
512MB.
b)Wall_buffer:
The amount of shared memory used for WAL data that has not yet been written to
disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of
shared_buffers, but not less than 64kB nor more than the size of one WAL segment,
typically 16MB. This value can be set manually if the automatic choice is too large or
too small, but any positive value less than 32kB will be treated as 32kB. This
parameter can only be set at server start.
The contents of the WAL buffers are written out to disk at every transaction
commit, so extremely large values are unlikely to provide a significant benefit.
However, setting this value to at least a few megabytes can improve write
performance on a busy server where many clients are committing at once. The
auto-tuning selected by the default setting of -1 should give reasonable results in
most cases.
c)CLOG Buffers:
CLOG BUFFERS are one of the SLRU-style buffers oriented toward circular "rings"
of data, like which transaction numbers have been committed or rolled back.
Sets the maximum number of temporary buffers used by each database session.
These are session-local buffers used only for access to temporary tables. The
default is eight megabytes (8MB). The setting can be changed within individual
sessions, but only before the first use of temporary tables within the session;
subsequent attempts to change the value will have no effect on that session.
A session will allocate temporary buffers as needed up to the limit given by
temp_buffers. The cost of setting a large value in sessions that do not actually need
many temporary buffers is only a buffer descriptor, or about 64 bytes, per
increment in temp_buffers. However if a buffer is actually used an additional 8192
bytes will be consumed for it (or in general, BLCKSZ bytes).
e)Work_mem:
Specifies the amount of memory to be used by internal sort operations and hash
tables before writing to temporary disk files. The value defaults to four megabytes
(4MB). Note that for a complex query, several sort or hash operations might be
running in parallel; each operation will be allowed to use as much memory as this value
specifies before it starts to write data into temporary files. Also, several running
sessions could be doing such operations concurrently. Therefore, the total memory
used could be many times the value of work_mem; it is necessary to keep this fact in
mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT,
and merge joins. Hash tables are used in hash joins, hash-based aggregation, and
hash-based processing of IN subqueries.
f)Maintenance_work_mem:
a)BGWriter:
There is a separate server process called the background writer, whose function is to
issue writes of "dirty" (new or modified) shared buffers. It writes shared buffers so
server processes handling user queries seldom or never need to wait for a write to
occur. However, the background writer does cause a net overall increase in I/O load,
because while a repeatedly-dirtied page might otherwise be written only once per
checkpoint interval, the background writer might write it several times as it is dirtied
in the same interval. The parameters discussed in this subsection can be used to tune
the behavior for local needs.
b)WallWriter:
WAL buffers are written out to disk at every transaction commit, so extremely large
values are unlikely to provide a significant benefit. However, setting this value to at
least a few megabytes can improve write performance on a busy server where many
clients are committing at once. The auto-tuning selected by the default setting of -1
should give reasonable results in most cases.
The delay between activity rounds for the WAL writer. In each round the writer will
flush WAL to disk. It then sleeps for wal_writer_delay milliseconds, and repeats.
The default value is 200 milliseconds (200ms). Note that on many systems, the
effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a
value that is not a multiple of 10 might have the same results as setting it to the
next higher multiple of 10. This parameter can only be set in the postgresql.conf file
or on the server command line.
As per the figure, it is clearly understood that all – the utility processes + user
backends + Postmaster Daemon are attached to syslogger process for logging the
information about their activities. Every process information is logged under
$PGDATA/pg_log with the file .log.
Debugging more on the process information will cause overhead on the server. Minimal
tuning is always recommended. However, increasing the debug level when required.
Click Here for further on logging parameters
d)CHECKPOINTS:
When checkpoints occur, all the dirty pages must write to disk. If we increase
the checkpoint_segments then checkpoint will occur less and so I/O will be less as it
need to write less to disk. IF large amount of data is inserted there is more
generation of checkpoints.
Write-Ahead Logging (WAL) puts a checkpoint in the transaction log every so often.
The CHECKPOINT command forces an immediate checkpoint when the command is
issued, without waiting for a scheduled checkpoint.
A checkpoint is a point in the transaction log sequence at which all data files have
been updated to reflect the information in the log. All data files will be flushed to
disk.
Only superusers can call CHECKPOINT. The command is not intended for use during
normal operation.
e)Stats Collector:
f)Archiver:
Setting up the database in Archive mode means to capture the WAL data of each
segment file once it is filled and save that data somewhere before the segment file is
recycled for reuse.
On Database Archivelog mode, once the WAL data is filled in the WAL Segment, that
filled segment named file is created under PGDATA/pg_xlog/archive_status by the
WAL Writer naming the file as “.ready”. File naming will be
“segment-filename.ready”.
Archiver Process triggers on finding the files which are in “.ready” state created by
the WAL Writer process. Archiver process picks the ‘segment-file_number’ of .ready
file and copies the file from $PGDATA/pg_xlog location to its concerned Archive
destination given in ‘archive_command’ parameter(postgresql.conf).
Example:
data_directory:
Specifies the directory to use for data storage. This parameter can only be set at
server start.
config_file:
Specifies the main server configuration file (customarily called postgresql.conf). This
parameter can only be set on the postgres command line.
hba_file:
ident_file:
Specifies the configuration file for Section 19.2 user name mapping (customarily
called pg_ident.conf). This parameter can only be set at server start.
Specifies the name of an additional process-ID (PID) file that the server should
create for use by server administration programs. This parameter can only be set at
server start.
PG_LOG:
It is not an actual postgres directory, it is the directory where RHEL stores the
actual textual LOG.
PG_XLOG:
Here the write ahead logs are stored. It is the log file, where all the logs are
stored of committed and un committed transaction. It contains max 6 logs, and last
one overwrites. If archiver is on, it moves there.
PG_CLOG:
It contains the commit log files, used for recovery for instant crash
PG_VERSION:
Base:
Global:
Subdirectory containing multitransaction status data (used for shared row locks)
PG_SUBTRANS:
PG_TBLSPC:
PG_TWOPHASE:
POSTMASTER.OPTS:
A file recording the command-line options the postmaster was last started with
POSTMASTER.PID:
A lock file recording the current postmaster PID and shared memory segment ID (not
present after postmaster shutdown)
1.parser:
● The parser defined in gram.y and scan.l is built using the Unix tools bison and
flex.
● The transformation process does modifications and augmentations to the data
structures returned by the parser.
● The parser has to check the query string (which arrives as plain text) for valid
syntax. If the syntax is correct a parse tree is built up and handed back;
Note: The mentioned transformations and compilations are normally done automatically
using the makefiles shipped with the PostgreSQL source distribution.
A detailed description of bison or the grammar rules given in gram.y would be beyond
the scope of this paper. There are many books and documents dealing with flex and
bison. You should be familiar with bison before you start to study the grammar given
in gram.y otherwise you won't understand what happens there
2.Traffic Cop:
The traffic cop is the agent that is responsible for differentiating between simple
and complex query
commands. Transaction control commands such as BEGIN and ROLLBACK are simple
enough so as to not
need additional processing, whereas other commands such as SELECT and JOIN are
passed on to the
Shared read, means it comes from the disk and it was not cached. If the query is
run again, and if the cache configuration is correct (we will discuss about it below), it
will show up as shared hit.
3.Rewriter:
● The first one worked using row level processing and was implemented deep in
the executor. The rule system was called whenever an individual row had been
accessed. This implementation was removed in 1995 when the last official
release of the Berkeley Postgres project was transformed into Postgres95.
● The second implementation of the rule system is a technique called query
rewriting. The rewrite system is a module that exists between the parser
stage and the planner/optimizer. This technique is still implemented.
The task of the planner/optimizer is to create an optimal execution plan. A given SQL
query (and hence, a query tree) can be actually executed in a wide variety of
different ways, each of which will produce the same set of results. If it is
computationally feasible, the query optimizer will examine each of these possible
execution plans, ultimately selecting the execution plan that is expected to run the
fastest.
The planner's search procedure actually works with data structures called paths,
which are simply cut-down representations of plans containing only as much
information as the planner needs to make its decisions. After the cheapest path is
determined, a full-fledged plan tree is built to pass to the executor. This represents
the desired execution plan in sufficient detail for the executor to run it. In the rest
of this section we'll ignore the distinction between paths and plans.
5.Executor:
The executor takes the plan created by the planner/optimizer and recursively
processes it to extract the required set of rows. This is essentially a demand-pull
pipeline mechanism. Each time a plan node is called, it must deliver one more row, or
report that it is done delivering rows.
The executor mechanism is used to evaluate all four basic SQL query types: SELECT,
INSERT, UPDATE, and DELETE. For SELECT, the top-level executor code only needs
to send each row returned by the query plan tree off to the client. For INSERT,
each returned row is inserted into the target table specified for the INSERT. This is
done in a special top-level plan node called ModifyTable. (A simple INSERT ...
VALUES command creates a trivial plan tree consisting of a single Result node, which
computes just one result row, and ModifyTable above it to perform the insertion. But
INSERT ... SELECT can demand the full power of the executor mechanism.) For
UPDATE, the planner arranges that each computed row includes all the updated
column values, plus the TID (tuple ID, or row ID) of the original target row; this
data is fed into a ModifyTable node, which uses the information to create a new
updated row and mark the old row deleted. For DELETE, the only column that is
actually returned by the plan is the TID, and the ModifyTable node simply uses the
TID to visit each target row and mark it deleted.
3. Directory Structure:
The file name at the time of table and index creation is OID, and OID and
pg_class.relfilenode are the same at this point. However, when a rewrite operation (
Truncate , CLUSTER , Vacuum Full , REINDEX , etc.) is performed, the relfilenode
value of the affected object is changed, and the file name is also changed to the
relfilenode value. You can easily check the file location and name by using
pg_relation_filepath ('< object name >'). template0, template1, postgres database
Running Tests
If you query the pg_database view after initdb() , you can see that the template0 ,
template1 , and postgres databases have been created.
● Through the datistemplate column, you can see that the template0 and
template1 databases are database for template for user database creation.
● The datlowconn column indicates whether the database can be accessed.
Since the template0 database can’t be accessed, the contents of the
database can’t be changed either.
● The reason for providing two databases for the templateis that the
template0 database is the initial state template and the template1 database
is the template added by the user.
● The postgres database is the default database created using the template1
database. If you do not specify a database at connection time, you will be
connected to the postgres database.
● The database is located under the $PGDATA/base directory. The directory
name is the database OID number.
pg_default tablespace
pg_global tablespace
● For example, tables of the same type as the pg_database table provide the
same information whether they are accessed from any database. (See Figure
1-5)
● The location of the pg_global tablespace is $PGDATA\global.
1
postgres=# create tablespace myts01 location '/data01';
The pg_tablespace shows that the myts01 tablespace has been created.
Connect to the postgres and mydb01 databases and create the table.
If you look up the /data01 directory after creating the table, you will see that the
OID directory for the postgres and mydb01 databases has been created and that
there is a file in each directory that has the same OID as the T1 table.
Note: Tablespaces are also very useful in environments that use partition tables.
Because you can use different tablespaces for each partition table, you can more
flexibly cope with file system capacity problems.
#1 and #2 are generally required for DBMS management. But #3 and #4 are
necessary because of the PostgreSQL MVCC feature1
1
Deepak Kumar Padhi