Вы находитесь на странице: 1из 32

 

 
 
 
 
 

PostgreSQL 
Architecture 
 
 
 

Prepared by : 

Deepak Kumar Padhi 


 
 
 
 

 
 
 
 
 

 
 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

PostgreSQL​ is probably the most advanced database in the open-source relational 
database market. It was first released in 1989, and since then, there have been a 
lot of enhancements. According to db-engines, it is the fourth most used database at 
the time of writing. 

We will discuss ​PostgreSQL​ internals, its architecture, and how the various 
components of PostgreSQL interact with one another. This will serve as a starting 
point and building block for the remainder of our Become a PostgreSQL DBA blog 
series. 

When you start PostgreSQL, The Postmaster starts first and allocates the shared 
memory. It also accepts connections and spins off a backend for each new connection. 
So each backend (server process) gets its pointers to shared memory from the 
postmaster. It is pretty disastrous if the postmaster dies with backends still running, 
so we have it do as little as possible, so that there isn't as much which can crash it. 
Postgres does have a pool of shared memory; however, it does not have a library or 
dictionary cache stored in that memory. This means that statements do need to be 
parsed and planned every time they are entered. If parse/plan overhead is an issue, 
we suggest the use of prepared statements. While Oracle is able to avoid the 
repeated parse/plan overhead, it must still do enough analysis of the query to 
determine whether the information is present in the library cache, which also 
consumes some time and CPU resources. The parser is quite lightweight, so we feel 
that the overhead of parsing the query each time is acceptable. 

Before we proceed, you should understand the basic PostgreSQL system architecture. 
Understanding how the parts of PostgreSQL interact will make this chapter somewhat 
clearer. 

In database jargon, PostgreSQL uses a client/server model. A PostgreSQL session 


consists of the following cooperating processes (programs): 

● A server process, which manages the database files, accepts connections to 
the database from client applications and performs database actions on behalf 
of the clients. The database server program is called Postgres. 
● The user's client (frontend) application that wants to perform database 
operations. Client applications can be very diverse in nature: a client could be a 
text-oriented tool, a graphical application, a web server that accesses the 
database to display web pages, or a specialized database maintenance tool. 
Some client applications are supplied with the PostgreSQL distribution; most 
are developed by users. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

 
As is typical of client/server applications, the client and the server can be on 
different hosts. In that case, they communicate over a TCP/IP network connection. 
You should keep this in mind because the files that can be accessed on a client 
machine might not be accessible (or might only be accessible using a different file 
name) on the database server machine. 

The PostgreSQL server can handle multiple concurrent connections from clients. To 
achieve this it starts ("forks") a new process for each connection. From that point 
on, the client and the new server process communicate without intervention by the 
original postgres process. Thus, the master server process is always running, waiting 
for client connections, whereas client and associated server processes come and go. 
(All of this is of course invisible to the user. We only mention it here for 
completeness.) 

PostgreSQL Architecture 

The physical structure of PostgreSQL is very simple. It consists of shared memory 


and a few background processes and data files. (See Figure 1-1) 

Figure 1-1. 
PostgreSQL structure 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

Shared Memory 

Shared Memory refers to the memory reserved for database caching and transaction 
log caching. The most important elements in shared memory are Shared Buffer and 
WAL buffers 

Shared Buffer 

The purpose of Shared Buffer is to minimize DISK IO​. For this purpose, the 
following principles must be met 

● You need to access very large (tens, hundreds of gigabytes) buffers quickly. 
● You should minimize contention when many users access it at the same time. 
● Frequently used blocks must be in the buffer for as long as possible 

WAL Buffer 

The WAL buffer is a ​buffer that temporarily stores changes to the database.​ The 
contents stored in the WAL buffer are written to the WAL file at a predetermined 
point in time. From a backup and recovery point of view, WAL buffers and WAL files 
are very important. 

PostgreSQL has four process types. 

1. Postmaster (Daemon) Process 


2. Background Process 
3. Backend Process 
4. Client Process 

Postmaster Process 

The Postmaster process is the first process started when you start PostgreSQL. At 
startup, performs recovery, initialize shared memory, and run background processes. 
It also creates a backend process when there is a connection request from the client 
process. (See Figure 1-2) 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

Figure 1-2. Process relationship diagram 

If you check the relationships between processes with the pstree command, you can 
see that the Postmaster process is the parent process of all processes. (For clarity, 
I added the process name and argument after the process ID) 

Background Process 

The list of background processes required for PostgreSQL operation are as follows. 
(See Table 1-1) 

Process  Role 

logger  Write the error message to the log file. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

checkpointer  When a checkpoint occurs, the dirty buffer is written to 
the file. 

writer  Periodically writes the dirty buffer to a file. 

wal writer  Write the WAL buffer to the WAL file. 

Autovacuum  Fork autovacuum worker when autovacuum is enabled.It is 


launcher  the responsibility of the autovacuum daemon to carry 
vacuum operations on bloated tables on demand 

archiver  When in Archive.log mode, copy the WAL file to the 


specified directory. 

stats collector  DBMS usage statistics such as session execution 


information ( pg_stat_activity ) and table usage 
statistical information ( pg_stat_all_tables ) are 
collected. 

Backend Process 

The maximum number of backend processes is set by the max_connections parameter, 


and the default value is 100. The backend process performs the query request of the 
user process and then transmits the result. Some memory structures are required for 
query execution, which is called local memory. The main parameters associated with 
local memory are: 

1. work_mem Space used for sorting, bitmap operations, hash joins, and merge 
joins. The default setting is 4 MB. 
2. Maintenance_work_mem Space used for Vacuum and CREATE INDEX . The 
default setting is 64 MB. 
3. Temp_buffers Space used for temporary tables. The default setting is 8 
MB. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

Client Process 

Client Process refers to the background process that is assigned for every backend 
user connection.Usually the postmaster process will fork a child process that is 
dedicated to serve a user connection.

Architecture Explanation With Query Flow 

LIBPQ -Library Pooled Quota 

● Details About Connected Users Using tools 


● libpq is the C application programmer's interface to PostgreSQL. libpq is a set 
of library functions that allow client programs to pass queries to the 
PostgreSQL backend server and to receive the results of these queries. 
● Client programs that use libpq must include the header file libpq-fe.h and must 
link with the libpq library 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

● Here are also several complete examples of libpq applications in the directory 
src/test/examples in the source code distribution. 

CLIENTS PROCESS: 

● Whenever we issue a query or the action made by us (client) is called the client 
process   
● It is front end. 
● Front end may be a text application, graphical application or web server page. 
● Through TCP/IP clients access the server  
● Many users at a time can access the DB 
● FORKS – This process makes multi user access possible. It don’t disturb the 
postgres process 

POSTMASTER: 

● The work of postmaster is that it authenticates the port (5432) and allocates 
process for users.  

SERVER PROCESS: 

● It is also called as postgres. It accepts the connection from the clients(we) 


like database files and manages the database action. 

Postgres Server is Divided into Two parts  

I.Instance  

II.Storage 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

Postgres Server 

I.Instance is divide into two types  

1.Memory buffer 

2.Utility Process 

1.Memory Buffer: 

a)Shared_buffer: 

Sets the amount of memory the database server uses for shared memory buffers. 
The default is typically 128 megabytes (128MB), but might be less if your kernel 
settings will not support it (as determined during initdb). This setting must be at 
least 128 kilobytes. (Non-default values of BLCKSZ change the minimum.) However, 
settings significantly higher than the minimum are usually needed for good 
performance. This parameter can only be set at server start. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 

If you have a dedicated database server with 1GB or more of RAM, a reasonable 
starting value for shared_buffers is 25% of the memory in your system. There are 
some workloads where even large settings for shared_buffers are effective, but 
because PostgreSQL also relies on the operating system cache, it is unlikely that an 
allocation of more than 40% of RAM to shared_buffers will work better than a 
smaller amount. Larger settings for shared_buffers usually require a corresponding 
increase in checkpoint_segments, in order to spread out the process of writing large 
quantities of new or changed data over a longer period of time. 

On systems with less than 1GB of RAM, a smaller percentage of RAM is appropriate, 
so as to leave adequate space for the operating system. Also, on Windows, large 
values for shared_buffers aren't as effective. You may find better results keeping 
the setting relatively low and using the operating system cache more instead. The 
useful range for shared_buffers on Windows systems is generally from 64MB to 
512MB. 

b)Wall_buffer: 

The amount of shared memory used for WAL data that has not yet been written to 
disk. The default setting of -1 selects a size equal to 1/32nd (about 3%) of 
shared_buffers, but not less than 64kB nor more than the size of one WAL segment, 
typically 16MB. This value can be set manually if the automatic choice is too large or 
too small, but any positive value less than 32kB will be treated as 32kB. This 
parameter can only be set at server start. 

The contents of the WAL buffers are written out to disk at every transaction 
commit, so extremely large values are unlikely to provide a significant benefit. 
However, setting this value to at least a few megabytes can improve write 
performance on a busy server where many clients are committing at once. The 
auto-tuning selected by the default setting of -1 should give reasonable results in 
most cases. 

c)CLOG Buffers: 

CLOG BUFFERS are one of the SLRU-style buffers oriented toward circular "rings" 
of data, like which transaction numbers have been committed or rolled back. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
10 
d)Temp_Buffer: 

Sets the maximum number of temporary buffers used by each database session. 
These are session-local buffers used only for access to temporary tables. The 
default is eight megabytes (8MB). The setting can be changed within individual 
sessions, but only before the first use of temporary tables within the session; 
subsequent attempts to change the value will have no effect on that session. 

A session will allocate temporary buffers as needed up to the limit given by 
temp_buffers. The cost of setting a large value in sessions that do not actually need 
many temporary buffers is only a buffer descriptor, or about 64 bytes, per 
increment in temp_buffers. However if a buffer is actually used an additional 8192 
bytes will be consumed for it (or in general, BLCKSZ bytes). 

e)Work_mem: 

Specifies the amount of memory to be used by internal sort operations and hash 
tables before writing to temporary disk files. The value defaults to four megabytes 
(4MB). Note that for a complex query, several sort or hash operations might be 
running in parallel; each operation will be allowed to use as much memory as this value 
specifies before it starts to write data into temporary files. Also, several running 
sessions could be doing such operations concurrently. Therefore, the total memory 
used could be many times the value of work_mem; it is necessary to keep this fact in 
mind when choosing the value. Sort operations are used for ORDER BY, DISTINCT, 
and merge joins. Hash tables are used in hash joins, hash-based aggregation, and 
hash-based processing of IN subqueries. 

f)Maintenance_work_mem: 

Specifies the maximum amount of memory to be used by maintenance operations, such 


as VACUUM, CREATE INDEX, and ALTER TABLE ADD FOREIGN KEY. It defaults to 
64 megabytes (64MB). Since only one of these operations can be executed at a time 
by a database session, and an installation normally doesn't have many of them running 
concurrently, it's safe to set this value significantly larger than work_mem. Larger 
settings might improve performance for vacuuming and for restoring database dumps. 

Note that when autovacuum runs, up to autovacuum_max_workers times this memory 


may be allocated, so be careful not to set the default value too high. It may be 
useful to control for this by separately setting autovacuum_work_mem. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
11 
2.Utility(Background) Process: 

a)BGWriter: 

There is a separate server process called the background writer, whose function is to 
issue writes of "dirty" (new or modified) shared buffers. It writes shared buffers so 
server processes handling user queries seldom or never need to wait for a write to 
occur. However, the background writer does cause a net overall increase in I/O load, 
because while a repeatedly-dirtied page might otherwise be written only once per 
checkpoint interval, the background writer might write it several times as it is dirtied 
in the same interval. The parameters discussed in this subsection can be used to tune 
the behavior for local needs. 

b)WallWriter: 

WAL buffers are written out to disk at every transaction commit, so extremely large 
values are unlikely to provide a significant benefit. However, setting this value to at 
least a few megabytes can improve write performance on a busy server where many 
clients are committing at once. The auto-tuning selected by the default setting of -1 
should give reasonable results in most cases. 

The delay between activity rounds for the WAL writer. In each round the writer will 
flush WAL to disk. It then sleeps for wal_writer_delay milliseconds, and repeats. 
The default value is 200 milliseconds (200ms). Note that on many systems, the 
effective resolution of sleep delays is 10 milliseconds; setting wal_writer_delay to a 
value that is not a multiple of 10 might have the same results as setting it to the 
next higher multiple of 10. This parameter can only be set in the postgresql.conf file 
or on the server command line. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
12 
c)SysLogger:Error Reporting and Logging 

As per the figure, it is clearly understood that all – the utility processes + user 
backends + Postmaster Daemon are attached to syslogger process for logging the 
information about their activities. Every process information is logged under 
$PGDATA/pg_log with the file .log. 

Debugging more on the process information will cause overhead on the server. Minimal 
tuning is always recommended. However, increasing the debug level when required. 
Click Here for further on logging parameters 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
13 
logging collector, which is a background process that captures log messages sent to 
stderr and redirects them into log files 

● log_directory- data directory 


● log_filename -The default is postgresql-%Y-%m-%d_%H%M%S.log 
● he default permissions are 0600 

d)CHECKPOINTS: 

When checkpoints occur, all the dirty pages must write to disk. If we increase 
the checkpoint_segments then checkpoint will occur less and so I/O will be less as it 
need to write less to disk. IF large amount of data is inserted there is more 
generation of checkpoints. 

Write-Ahead Logging (WAL) puts a checkpoint in the transaction log every so often. 
The CHECKPOINT command forces an immediate checkpoint when the command is 
issued, without waiting for a scheduled checkpoint. 

A checkpoint is a point in the transaction log sequence at which all data files have 
been updated to reflect the information in the log. All data files will be flushed to 
disk. 

If executed during recovery, the CHECKPOINT command will force a restartpoint 


rather than writing a new checkpoint. 

Only superusers can call CHECKPOINT. The command is not intended for use during 
normal operation. 

e)Stats Collector: 

PostgreSQL's statistics collector is a subsystem that supports collection and reporting 


of information about server activity. Presently, the collector can count accesses to 
tables and indexes in both disk-block and individual-row terms. It also tracks the 
total number of rows in each table, and information about vacuum and analyze actions 
for each table. It can also count calls to user-defined functions and the total time 
spent in each one. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
14 
PostgreSQL also supports reporting of the exact command currently being executed 
by other server processes. This facility is independent of the collector process. 

The statistics collector transmits the collected information to other PostgreSQL 


processes through temporary files. These files are stored in the directory named by 
the stats_temp_directory parameter, pg_stat_tmp by default. For better 
performance, stats_temp_directory can be pointed at a RAM-based file system, 
decreasing physical I/O requirements. When the server shuts down cleanly, a 
permanent copy of the statistics data is stored in the pg_stat subdirectory, so that 
statistics can be retained across server restarts. When recovery is performed at 
server start (e.g. after immediate shutdown, server crash, and point-in-time 
recovery), all statistics counters are reset. 

f)Archiver: 

Achiver process is optional process, default is OFF. 

Setting up the database in Archive mode means to capture the WAL data of each 
segment file once it is filled and save that data somewhere before the segment file is 
recycled for reuse. 

On Database Archivelog mode, once the WAL data is filled in the WAL Segment, that 
filled segment named file is created under PGDATA/pg_xlog/archive_status by the 
WAL Writer naming the file as “.ready”. File naming will be 
“segment-filename.ready”. 

Archiver Process triggers on finding the files which are in “.ready” state created by 
the WAL Writer process. Archiver process picks the ‘segment-file_number’ of .ready 
file and copies the file from $PGDATA/pg_xlog location to its concerned Archive 
destination given in ‘archive_command’ parameter(postgresql.conf). 

On successful completion of copy from source to destination, archiver process 


renames the “segment-filename.ready” to “segment-filename.done”. This completes 
the archiving process. 

It is understood that, if any files named “segement-filename.ready” found in 


$PGDATA/pg_xlog/archive_status. They are the pending files still to be copied to 
Archive destination. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
15 
II.Storage: 

● postgresql.conf file already mentioned, PostgreSQL uses two other 


manually-edited configuration files, which control client authentication  
● all three configuration files are stored in the database cluster's data 
directory. 
● The parameters described in this section allow the configuration files to be 
placed elsewhere 

Example: 

data_directory: 

Specifies the directory to use for data storage. This parameter can only be set at 
server start. 

config_file: 

Specifies the main server configuration file (customarily called postgresql.conf). This 
parameter can only be set on the postgres command line. 

hba_file: 

Specifies the configuration file for host-based authentication (customarily called 


pg_hba.conf). This parameter can only be set at server start. 

ident_file: 

Specifies the configuration file for Section 19.2 user name mapping (customarily 
called pg_ident.conf). This parameter can only be set at server start. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
16 
external_pid_file : 

Specifies the name of an additional process-ID (PID) file that the server should 
create for use by server administration programs. This parameter can only be set at 
server start. 

PG_LOG: 

It is not an actual postgres directory, it is the directory where RHEL stores the 
actual textual LOG. 

PG_XLOG: 

Here the write ahead logs are stored. It is the log file, where all the logs are 
stored of committed and un committed transaction. It contains max 6 logs, and last 
one overwrites. If archiver is on, it moves there. 

PG_CLOG: 

It contains the commit log files, used for recovery for instant crash 

PG_VERSION: 

A file containing the major version number of PostgreSQL 

Base: 

Subdirectory containing per-database subdirectories 

Global: 

Subdirectory containing cluster-wide tables, such as pg_database 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
17 
PG_MULTIXACT: 

Subdirectory containing multitransaction status data (used for shared row locks) 

PG_SUBTRANS: 

Subdirectory containing subtransaction status data 

PG_TBLSPC: 

Subdirectory containing symbolic links to tablespaces 

PG_TWOPHASE: 

Subdirectory containing state files for prepared transactions 

POSTMASTER.OPTS: 

A file recording the command-line options the postmaster was last started with 

POSTMASTER.PID:  

A lock file recording the current postmaster PID and shared memory segment ID (not 
present after postmaster shutdown) 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
18 
Postgresql Query Flow 

1.parser: 

The parser stage consists of two parts: 

● The parser defined in gram.y and scan.l is built using the Unix tools bison and 
flex. 
● The transformation process does modifications and augmentations to the data 
structures returned by the parser. 
● The parser has to check the query string (which arrives as plain text) for valid 
syntax. If the syntax is correct a parse tree is built up and handed back; 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
19 
otherwise an error is returned. The parser and lexer are implemented using 
the well-known Unix tools bison and flex. 
● The lexer is defined in the file scan.l and is responsible for recognizing 
identifiers, the SQL key words etc. For every key word or identifier that is 
found, a token is generated and handed to the parser. 
● The parser is defined in the file gram.y and consists of a set of grammar rules 
and actions that are executed whenever a rule is fired. The code of the 
actions (which is actually C code) is used to build up the parse tree. 
● The file scan.l is transformed to the C source file scan.c using the program 
flex and gram.y is transformed to gram.c using bison. After these 
transformations have taken place a normal C compiler can be used to create 
the parser. Never make any changes to the generated C files as they will be 
overwritten the next time flex or bison is called. 

Note: The mentioned transformations and compilations are normally done automatically 
using the makefiles shipped with the PostgreSQL source distribution. 

A detailed description of bison or the grammar rules given in gram.y would be beyond 
the scope of this paper. There are many books and documents dealing with flex and 
bison. You should be familiar with bison before you start to study the grammar given 
in gram.y otherwise you won't understand what happens there 

2.Traffic Cop: 

The traffic cop is the agent that is responsible for differentiating between simple 
and complex query  

commands. Transaction control commands such as BEGIN and ROLLBACK are simple 
enough so as to not 

need additional processing, whereas other commands such as SELECT and JOIN are 
passed on to the 

rewriter. This discrimination reduces the processing time by performing minimal 


optimization on the 

simple commands, and devoting more time to the complex ones. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
20 
Parsing is Two Types:  

● Soft Parse – when the parsed representation of a submitted SQL statement 


exists in the Postgres Server(Shared Buffer) Performs syntax and semantic 
checks but avoids the relatively costly operation of query optimization. Reuses 
the existing Postgres SQL area which already has the execution plan required 
to execute the SQL statement 
● Hard Parse – if a statement cannot be reused or if it the very first time the 
SQL statement is being loaded in the Postgres Server(Shared Buffer), it 
results in a hard parse. Also when a statement is aged out of the Postgres 
Server(Shared Buffer) (because the sPostgres Server(Shared Buffer) is limited 
in size), when it is reloaded again, it results in another hard parse. So size of 
the shared Buffer can also affect the amount of parse calls. 
1. We can query pg_prepared_statements to see what is cached. Note that it is 
not available across sessions and visible only to the current session. 
2. The pg_buffercache module provides a means for examining what's happening in 
the shared buffer cache in real time 
3. It(below Query) can even tell how much data blocks came from disk and how 
much came from shared_buffers i.e memory. 
4. explain (analyze,buffers) statement 

explain (analyze,buffers) select * from users order by userid limit 20; 

Shared read, means it comes from the disk and it was not cached. If the query is 
run again, and if the cache configuration is correct (we will discuss about it below), it 
will show up as shared hit. 

3.Rewriter: 

PostgreSQL rule system consisted of two implementations: 

● The first one worked using row level processing and was implemented deep in 
the executor. The rule system was called whenever an individual row had been 
accessed. This implementation was removed in 1995 when the last official 
release of the Berkeley Postgres project was transformed into Postgres95. 
● The second implementation of the rule system is a technique called query 
rewriting. The rewrite system is a module that exists between the parser 
stage and the planner/optimizer. This technique is still implemented. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
21 
4.Optimizer: 

The task of the planner/optimizer is to create an optimal execution plan. A given SQL 
query (and hence, a query tree) can be actually executed in a wide variety of 
different ways, each of which will produce the same set of results. If it is 
computationally feasible, the query optimizer will examine each of these possible 
execution plans, ultimately selecting the execution plan that is expected to run the 
fastest. 

The planner's search procedure actually works with data structures called paths, 
which are simply cut-down representations of plans containing only as much 
information as the planner needs to make its decisions. After the cheapest path is 
determined, a full-fledged plan tree is built to pass to the executor. This represents 
the desired execution plan in sufficient detail for the executor to run it. In the rest 
of this section we'll ignore the distinction between paths and plans. 

5.Executor: 

The executor takes the plan created by the planner/optimizer and recursively 
processes it to extract the required set of rows. This is essentially a demand-pull 
pipeline mechanism. Each time a plan node is called, it must deliver one more row, or 
report that it is done delivering rows. 

The executor mechanism is used to evaluate all four basic SQL query types: SELECT, 
INSERT, UPDATE, and DELETE. For SELECT, the top-level executor code only needs 
to send each row returned by the query plan tree off to the client. For INSERT, 
each returned row is inserted into the target table specified for the INSERT. This is 
done in a special top-level plan node called ModifyTable. (A simple INSERT ... 
VALUES command creates a trivial plan tree consisting of a single Result node, which 
computes just one result row, and ModifyTable above it to perform the insertion. But 
INSERT ... SELECT can demand the full power of the executor mechanism.) For 
UPDATE, the planner arranges that each computed row includes all the updated 
column values, plus the TID (tuple ID, or row ID) of the original target row; this 
data is fed into a ModifyTable node, which uses the information to create a new 
updated row and mark the old row deleted. For DELETE, the only column that is 
actually returned by the plan is the TID, and the ModifyTable node simply uses the 
TID to visit each target row and mark it deleted. 

3. Directory Structure: 
 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
22 
All the data needed for a database cluster is stored within the cluster's data 
directory, commonly referred to as PGDATA. You can get the detailed description at 
below link: 
http://www.enterprisedb.com/docs/en/9.2/pg/storage-file-layout.html 
 
I see the diagram left out the one I would like to add: pg_serial. pg_serial is used to 
track summarized information about committed serializable transactions which might 
still become part of a serialization failure rolling back some not-yet-committed 
transaction to protect data integrity. 
The catalog cache is information from the system tables which describes the tables, 
indexes, views, etc. in the database. If you had to re-read that from the system 
tables each time, it would be slow. Even shared memory would be clumsy for that, so 
each backend process has its own cache of system catalog data for fast lookup. 
When anything changes, all backends are sent a signal to update or reload their cache 
data. When pages are read or written, they go through the OS cache, which is not 
directly under PostgreSQL control. The optimizer needs to keep track of a lot of 
information while it parses and plans a query, which is why that is shown. A plan has 
execution nodes, some of which may need to use memory; that is where work_mem 
comes in -- a sort or hash table (as examples) will try not to exceed work_mem *for 
that node*. It is significant that one query might use quite a few nodes which each 
allocate memory up to work_mem. But since most queries are simpler and might not 
use any work_mem allocations, people often do their calculations based on an 
expected maximum of one allocation per backend (i.e., per connection). But that could 
be off by quite a bit if all connections might be running queries with five nodes 
allocating memory. 
 
It is worth noting that if there is enough RAM on the machine to have a good-sized 
OS cache, a PostgreSQL page read will often just be a copy from system cache to pg 
shared_buffers, and a page write will often just be a copy from pg shared_buffers 
to the system cache. The fsync of tables which is part of the checkpoint process is 
when they are actually written from the OS to the storage system. But even there a 
server may have a battery-backed RAM cache, so the OS write to storage is often 
just a copy in RAM.... unless there is so much writing that the RAID controller's 
cache fills, at which point writes suddenly become hundreds of times slower than they 
were. 
 
Other interesting dynamics: pg will try to minimize disk writes by hanging onto dirty 
buffers (ones which have logically been updated) before writing them to the OS. But 
buffers may need to be written so they can be freed so that a new read or write has 
a buffer to use. If a request to read a page or write to a new buffer can't find an 
idle page, the query might need to write a buffer dirtied by some other backend 
before it can do its read (or whatever). The background writer can help with this. It 
tries to watch how fast new pages are being requested and write out dirty pages at a 
rate which will stay ahead of demand. 
 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
23 
Here are some things that are important to know when attempting to understand the 
database structure of PostgreSQL. 

Items related to the database 

1. PostgreSQL consists of several databases. This is called a database cluster. 


2. When initdb () is executed, template0 , template1 , and postgres databases 
are created. 
3. The template0 and template1 databases are template databases for user 
database creation and contain the system catalog tables. 
4. The list of tables in the template0 and template1 databases is the same 
immediately after initdb (). However, the template1 database can create 
objects that the user needs. 
5. The user database is created by cloning the template1 database. 

Items related to the tablespace 

1. The pg_default and pg_global tablespaces are created immediately after 


initdb(). 
2. If you do not specify a tablespace at the time of table creation, it is stored 
in the pg_dafault tablespace. 
3. Tables managed at the database cluster level are stored in the pg_global 
tablespace. 
4. The physical location of the pg_default tablespace is $PGDATA\base. 
5. The physical location of the pg_global tablespace is $PGDATA\global. 
6. One tablespace can be used by multiple databases. At this time, a 
database-specific subdirectory is created in the table space directory. 
7. Creating a user tablespace creates a symbolic link to the user tablespace in 
the $PGDATA\tblspc directory. 

Items related to the table 

1. There are three files per table. 


2. One is a file for storing table data. The file name is the OID of the table. 
3. One is a file to manage table free space. The file name is OID_fsm . 
4. One is a file for managing the visibility of the table block. The file name is 
OID_vm . 
5. The index does not have a _vm file. That is, OID and OID_fsm are 
composed of two files. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
24 
Other Things to Remember... 

The file name at the time of table and index creation is OID, and OID and 
pg_class.relfilenode are the same at this point. However, when a rewrite operation ( 
Truncate , CLUSTER , Vacuum Full , REINDEX , etc.) is performed, the relfilenode 
value of the affected object is changed, and the file name is also changed to the 
relfilenode value. You can easily check the file location and name by using 
pg_relation_filepath ('< object name >'). template0, template1, postgres database 

Running Tests 

If you query the pg_database view after initdb() , you can see that the template0 , 
template1 , and postgres databases have been created. 

● Through the datistemplate column, you can see that the template0 and 
template1 databases are database for template for user database creation. 
● The datlowconn column indicates whether the database can be accessed. 
Since the template0 database can’t be accessed, the contents of the 
database can’t be changed either. 
● The reason for providing two databases for the templateis that the 
template0 database is the initial state template and the template1 database 
is the template added by the user. 
● The postgres database is the default database created using the template1 
database. If you do not specify a database at connection time, you will be 
connected to the postgres database. 
● The database is located under the $PGDATA/base directory. The directory 
name is the database OID number. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
25 
 

Create User Database 

The user database is created by cloningthe template1 database. To verify this, 


create a user table T1 in the template1 database. After creating the mydb01 
database, check that the T1 table exists. (See Figure 1-3.) 

Figure 1-3. Relationship between 


Template Database and User Database 

pg_default tablespace 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
26 
If you query pg_tablespace after initdb (), you can see that the pg_default and 
pg_global tablespaces have been created. 

The location of the pg_default tablespace is $PGDATA\base. There is a subdirectory 


by database OID in this directory. (See Figure 1-4) 

Figure 1-4. Pg_default tablespace and database 


relationships from a physical configuration perspective 

pg_global tablespace 

The pg_global tablespace is a tablespace for storing data to be managed at the 


'database cluster' level. 

● For example, tables of the same type as the pg_database table provide the 
same information whether they are accessed from any database. (See Figure 
1-5) 
● The location of the pg_global tablespace is $PGDATA\global. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
27 
Figure 1-5. Relationship between 
pg_global tablespace and database 

Create User Tablespace 


postgres=# create tablespace myts01 location '/data01'; 

The pg_tablespace shows that the myts01 tablespace has been created. 

Symbolic links in the $PGDATA/pg_tblspc directory point to tablespace directories. 

Connect to the postgres and mydb01 databases and create the table. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
28 
 

If you look up the /data01 directory after creating the table, you will see that the 
OID directory for the postgres and mydb01 databases has been created and that 
there is a file in each directory that has the same OID as the T1 table. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
29 
How to Change Tablespace Location 

PostgreSQL specifies a directory when creating tablespace. Therefore, if the file 


system where the directory is located is full, the data can no longer be stored. To 
solve this problem, you can use the volume manager. However, if you can’t use the 
volume manager, you can consider changing the tablespace location. The order of 
operation is as follows. 

Note: Tablespaces are also very useful in environments that use partition tables. 
Because you can use different tablespaces for each partition table, you can more 
flexibly cope with file system capacity problems. 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
30 
What is Vacuum? Vacuum does the following: 

1. Gathering table and index statistics 


2. Reorganize the table 
3. Clean up tables and index dead blocks 
4. Frozen by record XID to prevent XID Wraparound 

#1 and #2 are generally required for DBMS management. But #3 and #4 are 
 necessary because of the PostgreSQL MVCC feature1 

1
Deepak Kumar Padhi 

Deepak kumar Padhi 


Database Consultant 
deepakpadhi16@gmail.com (8686182035) 
 
31 

Оценить