Вы находитесь на странице: 1из 8

Redundant Linux/PostgreSQL Setup for High Availability Evan Summers

Rev 1.2, 2011-12-10

This document sponsored by http://BizSwitch.net Read latest version on Google Docs via http://evanx.info

1 Platform overview 2 PostgreSQL hot-standby overview 3 Hardware 3.1 Multiple firewalls 3.2 Maximum server RAM 3.3 RAID10 3.6 Cold spares 3.7 Redundant power supply 4 Backup with rdiff-backup and/or rsync 5 PostgreSQL warm-standby for PITR 6 Fail-over procedure 6.1 Fail-over procedure commands 6.2 Fail-back 6.3 Scheduled fail-over drill 7 Conclusion 8 References

1 Platform overview
The application server is a Linux server e.g. running CentOS 6 with PostgreSQL 9 database, and Apache Tomcat application server for a custom OLTP application and reports. Our goal is data redundancy, high performance and high availability. A fail-over is presented as a manual procedure that might take a few minutes once the decision has been made to do so, during which time the system is unavailable. Five nines (99.999%) availability means at most 5 minutes downtime per annum. The server performs self-monitoring of the application, CPU load, disk-space, RAID status, and PostgreSQL replication. The opensource Nagios package is used for this purpose, with a custom application plugin for application errors. An attached modem, or alternatively an SMS gateway on the Internet, can be used to send urgent notifications via SMS to system administrators, i.e. when problems are detected which require urgent attention, and possibly a fail-over to the standby server. Alternatively, or in conjunction, a monitoring service on the cloud could be used e.g. bijk.com,

pingdom.com etc.

2 PostgreSQL hot-standby overview

The primary application server is mirrored to a hot-standby server for fail-over in the event of a non-recoverable failure of the primary server. The standby server can be on-site (together with the primary server) and/or off-site. That is, two standby servers can be deployed, one on-site and the other off-site. The off-site standby is then a backup which caters for a disaster or disruption at the primary site. The standby server configuration mirrors that of the primary server, except that the PostgreSQL database operates in hot-standby mode. This is a read-only mode, where the database engine applies changes from the primary database as they occur, via a streaming replication connection. Additionally, the standby server can be used for reporting purposes, to take that load off the primary database. As such we improve the capacity and performance of our database for both transactional and analytical (reporting) purposes.

3 Hardware
3.1 Multiple firewalls
The primary and standby servers are connected to the Internet via a firewall, which forwards the application ports to the primary server, and in event of a fail-over to the standby server, switches those ports to the standby server rather. In the event of failure of the primary server, e.g. as alerted by the Nagios monitoring service, the system administrator takes a decision to fail-over to the standby server, and performs the fail-over procedure. This procedure, with practise, can be performed within a minute or two. Note that in addition to this hardware firewall, we also configure iptables on the Linux servers, so that we have as many levels of security as possible. Additionally, e.g. in a cloud-computing environment, we might deploy our database server into VM whose hypervisor functions as a virtual firewall.

3.2 Maximum server RAM

Since RAM is so much faster than hard drives, and database servers are I/O bound, we maximise the RAM in the server, to boost performance via caching. Identify the maximum DIMM size to which the price per gigabyte is fairly linear i.e. without too big a jump in price per gigabyte, e.g. an 8Gb DIMM, and then populate all the available slots in the server with this size DIMM, eg. 12x 8Gb giving 96Gb RAM. For high performance and large data sets, consider installing the maximum possible RAM e.g. 12x 16Gb DIMMs for 192Gb total.

3.3 RAID10

Since disks are the most common hardware failure (e.g. MTBF of 400k hours), we use RAID for storage, ideally hardware RAID e.g. via a LSI MegaRAID card. Ideally we use a RAID controller with a write-back cache enabled by a battery backup unit e.g. on an LSI MegaRAID controller, or alternatively an SSD write cache. This provides reliability and improved database performance. Linux software RAID is an alternative option (via md or btrfs), but does not provide the performance benefit of a battery-backed write-back cache, or managed SSD cache. Internal storage is configured as RAID10 which is preferred for database applications. The RAID10 array might consist 4x 2T drives, providing 4T of actual storage capacity, since disks are mirrored. In this RAID configuration, sometimes denoted as RAID1+0, logical disks they are striped and mirrored. The striping improves write performance, and the mirroring improves read performance (by load balancing), as well as ensuring redundancy.

3.4 SSD cache

For a further performance boost, high-end RAID controllers enable the use of an SSD array attached to the hybrid controller to function as a large 2nd level read and write cache. For example, we use 4x SSDs in a RAID10 array for such a cache, connected to the controller at 6Gb/s. Alternatively we can attach 2x SSDs in a RAID1 configuration using Linux software RAID, or add an PCI SSD card with built-in redundant flash, as a fast tablespace for indexes and/or the PostgreSQL WAL directory (pg_xlog), using symlinks. In this case we benefit from the fast seek times and I/O performance of SSDs. However, if the indexes are already fully cached in RAM, this will not accelerate read access.

3.5 Hot spare drive

Additionally, a hot-spare drive is included in the array, so that in the event of a disk failure, this hotspare can be utilitised immediately, so that the faulty drive (which is excluded from the RAID array in favour of the hot-spare) can be scheduled for replacement. MegaRAID for example, can be configured to auto-rebuild on the pre-configured hot spare. We use Nagios to monitor the raid array, e.g. via the MegaCli command, so that we know when this occurs in order to schedule a swop-out of the failed drive.

3.6 Cold spares

In addition to the hot-spare drive already installed in the server, a further spare drive is kept for each server as a cold-standby, for a scheduled replacement in the event of a drive failure. Ideally the server hardware should support hot-swap of drives, so that the server does not have to shutdown in order to swop out a failed hard drive. The spare firewall security router should be pre-configured as per the live router, so that it can be swapped in without requiring further configuration, other than changing its IP number to that of

the faulty router that is removed.

3.7 Redundant power supply

To ensure high-availability, consider using enterprise server hardware with redundant power supplies, and hot-swap drive bays.

4 Backup with rdiff-backup and/or rsync

The opensource rdiff-backup utility is invoked via a daily cron script, to keep incremental backups of the primary server on the standby server, where this includes the configuration of the server and the application components, the log files, and a snapshot of the database. Prior to the rdiff-backup, the daily application log files are gziped by this nightly cron script, and then included in the rdiff-backup. Additionally, this nightly cron script performs a pg_dumpall of the PostgreSQL database, also to be included in the rdiff-backup. If the remote machine does not have rdiff-backup installed, or an incompatible version thereof, then we rdiff-backup directories such as /etc and /opt to the machine itself, to benefit from rdiff-backups time-based recovery feature, and then rsync the rdiff-backup directory to the remote server. When rsyncing to a backup server, we use its --hard-links and --link-dest options.

5 PostgreSQL warm-standby for PITR

In addition to the hot-standby PostgreSQL databases (possibly one on-site and another off-site), a warm-standby database for point-in-time recovery (PITR) is maintained on the primary server. The WALs (write-ahead log files) of the primary database are archived and these are replayed nightly to the warm-standby to catch up with the primary database. Alternatively, the PostgreSQL data directory can be rsynced overnight, and LVM snapshots used, so that the PITR can be used a live test database (up to date to the previous night), and reverted and recovered to any point-in-time of the current day, or any time when LVM snapshots were taken, and WAL subsequent files are archived. If a PITR recovery is required for a given time of the current day, the archived WAL files can then be replayed to that point-in-time.

6 Fail-over procedure
The following steps are taken to fail-over to the on-site standby server. 1. Shutdown primary application server.

2. 3. 4. 5.

Shutdown the primary database. Promote standby PostgreSQL database. Startup standby application. Redirect ports on firewall to standby server.

In general in the case of fail-over, we must take care to shutdown the primary server and otherwise ensure than transactions cannot be processed by both the primary server and the standby server at the same time.

6.1 Fail-over procedure commands

These steps are effected using the following Linux commands, as root system administrator. As as standard for Linux, the scripts required are in the /etc/init.d/ directory. On the primary, we stop the application server using the application script. primary# /etc/init.d/tomcat stop primary# /etc/init.d/postgresql-9.0 stop primary# /etc/init.d/postgresql-9.0 demote Note that the above demote command parameter (and similarly promote) are custom additions to this otherwise standard script, for our fail-over setup. On the standby server, we promote the PostgreSQL database from a hot-standby to our live transactional database, using the postgresql-9.0 script with a custom parameter, as follows. standby# /etc/init.d/postgresql-9.0 promote Finally we startup application on the standby server using the application script. standby# /etc/init.d/tomcat start

6.2 Fail-back
In the case of on-site fail-over, we recover the primary server as the new standby server, i.e. its database is restored from the operational database (on the promoted standby), and put into hotstandby mode to replicate transactions. We might wish to schedule a fail-back where the original primary server becomes the active primary server again, and the standby is demoted to hot-standby mode, i.e. replicating from the primary database. In this case the procedure is reversed, as follows. standby# /etc/init.d/tomcat stop standby# /etc/init.d/postgresql-9.0 demote Once the standby is shutdown as above, we promote-back and startup the primary.

primary# /etc/init.d/postgresql-9.0 promote primary# /etc/init.d/tomcat

6.3 Scheduled fail-over drill

We recommend a scheduled fail-over, and fail-back, to be performed e.g. once per year, in order to be assured that everything is in order for such a fail-over, and to provide necessary practice so that the fail-over can be performed rapidly, in the event of an emergency.

7 Conclusion
We use a Linux/PostgreSQL platform, utilitising PostgreSQL streaming replication to enable a hotstandby server. Since the most common failure of servers is hard disk failure, naturally we employ disk mirroring via a RAID controller with write-back battery-backed cache (and/or large SSD cache), which also serves to boost database performance. The configuration of the primary server is mirrored to a standby server, and the transactions are replicated as they occur by PostgreSQL. An additional benefit of the standby PostgreSQL 9 database, is that it can be used for application reports, which further improves the performance of the system, in addition to its redundancy. In addition to a hot-standby database, we configure a warm-standby to enable point-in-time-recovery (PITR). Additionally, using LVM for the PITR data enables it to be used a test database, that can be reverted and recovered to any point-in-time, as needed. Scripts are customised to make the fail-over procedure straight-forward, that is to shutdown the application on the failed primary server, promote the standby database, start the application on the standby server, and finally re-direct the ports on the firewall to the standby server. Fail-back is essentially a reverse of the fail-over procedure. Notwithstanding the hot-standby and warm-standby solutions, we recommend using the cron to schedule nightly historical snapshots of the database using pg_dumpall, and to backup those files and daily logs remotely using rsync, and directories with configuration files etc using rdiff-backup.

8 References
Apache Tomcat CentOS Nagios LSI MegaRAID Controller with CacheCade etc RAID10 http://tomcat.apache.org/ http://www.centos.org/ http://www.nagios.org/ http://www.lsi.com http://en.wikipedia.org/wiki/RAID

rdiff-backup SMS Server Tools PostgreSQL

http://www.nongnu.org/rdiff-backup/ http://smstools3.kekekasvi.com/ http://www.postgresql.org/