Академический Документы
Профессиональный Документы
Культура Документы
This document sponsored by http://BizSwitch.net Read latest version on Google Docs via http://evanx.info
Contents
1 Platform overview 2 PostgreSQL hot-standby overview 3 Hardware 3.1 Multiple firewalls 3.2 Maximum server RAM 3.3 RAID10 3.6 Cold spares 3.7 Redundant power supply 4 Backup with rdiff-backup and/or rsync 5 PostgreSQL warm-standby for PITR 6 Fail-over procedure 6.1 Fail-over procedure commands 6.2 Fail-back 6.3 Scheduled fail-over drill 7 Conclusion 8 References
1 Platform overview
The application server is a Linux server e.g. running CentOS 6 with PostgreSQL 9 database, and Apache Tomcat application server for a custom OLTP application and reports. Our goal is data redundancy, high performance and high availability. A fail-over is presented as a manual procedure that might take a few minutes once the decision has been made to do so, during which time the system is unavailable. Five nines (99.999%) availability means at most 5 minutes downtime per annum. The server performs self-monitoring of the application, CPU load, disk-space, RAID status, and PostgreSQL replication. The opensource Nagios package is used for this purpose, with a custom application plugin for application errors. An attached modem, or alternatively an SMS gateway on the Internet, can be used to send urgent notifications via SMS to system administrators, i.e. when problems are detected which require urgent attention, and possibly a fail-over to the standby server. Alternatively, or in conjunction, a monitoring service on the cloud could be used e.g. bijk.com,
pingdom.com etc.
3 Hardware
3.1 Multiple firewalls
The primary and standby servers are connected to the Internet via a firewall, which forwards the application ports to the primary server, and in event of a fail-over to the standby server, switches those ports to the standby server rather. In the event of failure of the primary server, e.g. as alerted by the Nagios monitoring service, the system administrator takes a decision to fail-over to the standby server, and performs the fail-over procedure. This procedure, with practise, can be performed within a minute or two. Note that in addition to this hardware firewall, we also configure iptables on the Linux servers, so that we have as many levels of security as possible. Additionally, e.g. in a cloud-computing environment, we might deploy our database server into VM whose hypervisor functions as a virtual firewall.
3.3 RAID10
Since disks are the most common hardware failure (e.g. MTBF of 400k hours), we use RAID for storage, ideally hardware RAID e.g. via a LSI MegaRAID card. Ideally we use a RAID controller with a write-back cache enabled by a battery backup unit e.g. on an LSI MegaRAID controller, or alternatively an SSD write cache. This provides reliability and improved database performance. Linux software RAID is an alternative option (via md or btrfs), but does not provide the performance benefit of a battery-backed write-back cache, or managed SSD cache. Internal storage is configured as RAID10 which is preferred for database applications. The RAID10 array might consist 4x 2T drives, providing 4T of actual storage capacity, since disks are mirrored. In this RAID configuration, sometimes denoted as RAID1+0, logical disks they are striped and mirrored. The striping improves write performance, and the mirroring improves read performance (by load balancing), as well as ensuring redundancy.
6 Fail-over procedure
The following steps are taken to fail-over to the on-site standby server. 1. Shutdown primary application server.
2. 3. 4. 5.
Shutdown the primary database. Promote standby PostgreSQL database. Startup standby application. Redirect ports on firewall to standby server.
In general in the case of fail-over, we must take care to shutdown the primary server and otherwise ensure than transactions cannot be processed by both the primary server and the standby server at the same time.
6.2 Fail-back
In the case of on-site fail-over, we recover the primary server as the new standby server, i.e. its database is restored from the operational database (on the promoted standby), and put into hotstandby mode to replicate transactions. We might wish to schedule a fail-back where the original primary server becomes the active primary server again, and the standby is demoted to hot-standby mode, i.e. replicating from the primary database. In this case the procedure is reversed, as follows. standby# /etc/init.d/tomcat stop standby# /etc/init.d/postgresql-9.0 demote Once the standby is shutdown as above, we promote-back and startup the primary.
7 Conclusion
We use a Linux/PostgreSQL platform, utilitising PostgreSQL streaming replication to enable a hotstandby server. Since the most common failure of servers is hard disk failure, naturally we employ disk mirroring via a RAID controller with write-back battery-backed cache (and/or large SSD cache), which also serves to boost database performance. The configuration of the primary server is mirrored to a standby server, and the transactions are replicated as they occur by PostgreSQL. An additional benefit of the standby PostgreSQL 9 database, is that it can be used for application reports, which further improves the performance of the system, in addition to its redundancy. In addition to a hot-standby database, we configure a warm-standby to enable point-in-time-recovery (PITR). Additionally, using LVM for the PITR data enables it to be used a test database, that can be reverted and recovered to any point-in-time, as needed. Scripts are customised to make the fail-over procedure straight-forward, that is to shutdown the application on the failed primary server, promote the standby database, start the application on the standby server, and finally re-direct the ports on the firewall to the standby server. Fail-back is essentially a reverse of the fail-over procedure. Notwithstanding the hot-standby and warm-standby solutions, we recommend using the cron to schedule nightly historical snapshots of the database using pg_dumpall, and to backup those files and daily logs remotely using rsync, and directories with configuration files etc using rdiff-backup.
8 References
Apache Tomcat CentOS Nagios LSI MegaRAID Controller with CacheCade etc RAID10 http://tomcat.apache.org/ http://www.centos.org/ http://www.nagios.org/ http://www.lsi.com http://en.wikipedia.org/wiki/RAID