Вы находитесь на странице: 1из 24

MySQL:

A Guide to High Availability

A MySQL Strategy Whitepaper August 2013

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Table of Contents

Page #

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

Executive Summary .................................................................................................. 3 Understanding the Causes and Effects of Downtime ............................................ 3 Determining High Availability Requirements ......................................................... 5 Database Replication ................................................................................................ 7 MySQL Enterprise HA: Clustering & Virtualization .............................................. 11 Shared-Nothing, Failover Clusters ........................................................................ 17 Comparing MySQL HA Solutions .......................................................................... 20 Third-Party HA Technologies ................................................................................. 21 Operational Best Practices ..................................................................................... 21 Conclusion ............................................................................................................. 24 Additional Resources............................................................................................ 24

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

1.

Executive Summary

Data is the currency of todays web, mobile, social, enterprise and cloud applications. Ensuring data is always available is a top priority for any organization minutes of downtime will result in significant loss of revenue and reputation. There is not a one size fits all approach to delivering High Availability (HA). Unique application attributes, business requirements, operational capabilities and legacy infrastructure can all influence HA technology selection. And then technology is only one element in delivering HA People and Processes are just as critical as the technology itself. This Guide is designed to assist Developers, Architects and DBAs in navigating the complex waters of HA. It presents: A methodology for selecting the right HA solution to meet Service Level Agreements; A tour of the leading certified HA solutions for MySQL; Operational best practices to implement and support HA. As the worlds leading open source database, there are many options for MySQL HA, scaling all the way up to 99.999% uptime. This guide is designed to discuss your options and show you how to get the best levels of availability for your application.

2.

Understanding the Causes and Effects of Downtime

In developing a strategy to make services highly available, it is important to understand the different causes of downtime and the impact they can have on your organization. As shown in Figure 1 below, downtime can generally be attributed to one of four events: System Failures: server faults, software bugs or crashes, networking errors; Physical Disasters: events causing failures of an entire data center, including fire, flood, hurricanes; Scheduled Maintenance: hardware and software upgrades, patches, hot-fixes; Operator or User Errors: accidental or malicious activities such as file deletion, malware, and poor operational procedures. For obvious reasons, organizations are reluctant to share details on outages they experience, but anecdotal evidence suggests the following: 50% of all outages are the result of failure and/or disaster events; 30% of all outages are the result of scheduled maintenance operations; 20% of all outages are the result of operator or user errors. Additional research complements these numbers by categorizing people and process accounting for 80% of downtime (40% each) while products (technology) contribute the remaining 20%. 1

http://www.gartner.com/DisplayDocument?id=334197

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 3

Figure 1: Cause, Effect and Impact of Downtime

2.1.

Calculating the Cost & Impact of Downtime

The starting point in identifying the appropriate HA strategy for an application is usually the calculation of revenue losses arising from downtime, over a given time period, i.e.: Orders cant be placed; Financial trades cant be completed; Subscribers cant be billed, etc. However, it is also important to consider that direct revenue loss is only one aspect in calculating the true impact of downtime. To gain a complete perspective, it is also necessary to factor in other, often less quantifiable factors, that collectively can dwarf any immediate direct revenue loss, including: Damage to brand image; Impact to customer relationships, satisfaction and loyalty; Loss in employee productivity; Potential regulatory issues if an essential service is unavailable or important data (i.e. customer records, financial transactions, etc.) are corrupted. The impact of downtime varies by application, and is dependent on factors such as the affected number of internal and external users, the value and volume of transactions, regulatory and/or competitive pressures, etc. As an example, internal procurement systems will not incur the same cost of downtime as a web-based content management system, which in turn is nowhere near as damaging as the outage of an eCommerce engine or telecoms service. To help refine cost analysis and guide technology selection, it is important to understand the amount of time an IT service can be unavailable before the organization suffers a material loss. This is a factor often used in Business Continuity planning, and is referred to as the Recovery Time Objective (RTO), which is discussed in the next section of the Guide.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 4

The length of acceptable downtime will again vary across business processes. For example a high volume ecommerce Web site, where users expect rapid response times and for which customer switching costs are very low, is likely to have a very low tolerance to any length of downtime. However, for systems that support back-end operations such as shipping and billing, the length of downtime can be higher without materially affecting the business. Understanding both the direct and indirect costs of downtime for each application or service is essential because this will determine: a. The SLAs required by the business; b. The HA technology chosen to meet those SLAs; c. Operational processes to implement, monitor and manage the HA technology.

3.

Determining High Availability Requirements

Implementing HA for applications can be complex and costly. It is therefore critical that an organization perform a thorough analysis of their business requirements. It is important to set expectations with business users. While avoiding any form of downtime is always highly desirable, it is largely impractical. Higher levels of availability are typically achieved by deploying systems with increasing levels of redundancy and fault-tolerance. However, greater redundancy will also increase the total cost and complexity of the system due to requirements for more hardware and software, as well as demanding a larger investment in IT staff, processes, and services. Change management becomes more rigidly defined which can impact business agility. To guide analysis and technology selection, RTO and RPO are two important considerations. Recovery Time Objective (RTO): The availability level of an HA solution only defines application uptime over a specified period, e.g. 99.99% per year or 99.9% per week. To choose the right architecture for high availability it is also important to define the maximum acceptable downtime per incident in order to avoid a break in business continuity. This measure is defined as the Recovery Time Objective. Recovery Point Objective (RPO): The Recovery Point Objective is the point in time to which data must be recovered when a service is re-established. RPO is typically determined by considering the type of application. For example, financial transactions will demand a different RPO from clickstream data. The RPO allows an organization to define a window of time before a disaster during which data may be lost. The RPO can be anything from microseconds to days. Analysis of the business requirements for application availability, including RTO and RPO, coupled with an understanding of the associated costs, enables an optimal solution to be developed that is balanced to meet the needs of the organization, within its financial and resource constraints.

3.1.

Establishing Service Level Agreements (SLAs)

Using the cost and impact analysis described above, the organization can start to define their applications SLAs. Many organizations categorize their applications into four tiers, and then select the HA solution that best achieves the defined SLAs. Tier 1 applications are mission-critical that incur maximum disruption if they are unavailable. They have the most stringent HA requirements with systems needing to be available on a continuous or near-continuous

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 5

basis. Good examples include emergency service systems, utilities, including telecommunications, eCommerce and market trading platforms. Tier 2 applications are typically business-critical, but do not need to maintain the 99.999% availability demanded by Tier 1, mission-critical applications. Examples include web content management systems, user authentication, session management, Customer Relationship Management (CRM) systems, corporatewide email, etc. Tier 3 business applications still require some type of HA mechanism to reduce downtime, but are not as critical as the applications above. They would typically serve specific web functions such as feeds, blogs or wikis, or internal Line of Business processes such as Procurement, Human Resources, Data Marts, etc. Tier 4 applications may be related to internal development or small departmental deployments. Systems supporting these processes usually do not have the HA requirements of the higher tiers. Once the uptime requirements of the application have been agreed, the next step is to evaluate the capabilities of various HA architectures and select those that best meet the SLA requirements of the business.

3.2.

Mapping Application Needs to HA Architectures

There are multiple architectures that can be used to achieve highly available database services, each differentiated by the levels of uptime they offer. These architectures can be grouped into three main categories: Database Replication (typically implemented across loosely coupled clusters of servers); Tightly Coupled Clusters & Virtualized Systems; Shared-Nothing, Geographically-Replicated Clusters. As illustrated in the figure below, each of these architectures offers progressively higher levels of uptime, but this needs to be balanced against the potentially greater levels of cost and complexity each will incur. Simply deploying a high availability architecture is not a guarantee of actually delivering HA. In fact, a poorly implemented and managed shared-nothing cluster could easily deliver lower levels of availability than a simple data replication solution.

Figure 2: Mapping High Availability Architectures to Systems Downtime By understanding the availability requirements of each application, it is possible to map the database deployment model to the appropriate HA architecture. As the worlds most popular open source database there are many different approaches available to delivering highly available MySQL services. The following sections of the Guide discuss the HA architectures certified and supported by Oracle.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 6

4.

Database Replication

Replication is the most common approach to delivering high avalability for MySQL. Replication is a native feature of MySQL, available out-of the-box without any complex add-ons or options. Replication enables MySQL to copy changes from one instance to others (i.e. from the master to one or more slave instances) in a loosely coupled cluster. This is used to increase the availability and scalability of a database, enabling MySQL to scale-out beyond the capacity constraints of a single system. When deployed for HA, database updates are replicated from a master to either one or more slaves, with the goal of failing-over to the slave in the event of an outage to the master, either due to a failure or maintenance event. Enhancing flexibility, MySQL is able to replicate to slaves both within and across multiple geographically dispersed data centers, enabling disaster recovery. The release of MySQL 5.6 includes the broadest set of enhancements to MySQL replication ever delivered in a single release. Global Transaction Identifiers (GTIDs) are one of those core enhancements, providing a foundation to building self-healing, highly available data clusters. This is discussed later.

4.1.

Replication Modes and Data Consistency

There are multiple modes of replication, defined as asynchronous, semi- synchronous or synchronous. Asynchronous Replication By default, MySQL is asynchronous. Updates are commited to the database on the master and then relayed to the slave where they are also applied. The master does not wait for the slave to receive the update, and so is able to continue processing further write operations without being blocked as it waits for acknowledgement from the slave. When using asynchronous replication, there are no guarantees that all updates are replicated to the slave in the event of an outage of the master. SemiSynchronous replication discussed below can be configured to enhance consistency and durability between a MySQL master and its slaves, reducing the risk of data loss in the event of a failover. Any delay (lag) of committed updates to the slaves is most noticeable with highly transactional applications where there is an abundance of write operations. With the correct components and tuning, replication can appear to be almost instantaneous to the application. Figure 3: Contrasting different replication modes Using asynchronous replication, slaves need not be connected permanently to receive updates from the master. This means that updates can occur over long-distance connections and even over temporary or intermittent connections. Depending on the configuration, you can replicate all databases, select databases, or even selected tables within a database. Semi-Synchronous Replication Semi-Synchronous Replication can be used as an alternative to MySQLs default asynchronous replication, serving to enhance data integrity. Using semi-synchronous replication, a commit is returned to the client only when a slave has received the update, or a timeout occurs. Therefore it is assured that the data exists on the master and at least one slave (note that the slave will have received the update but not necessarily applied it when a commit is returned to the master).

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 7

It is possible to combine the different modes of replication, so some MySQL slaves are configured with asynchronous replication while others use semi-synchronous replication. This ultimately means that the Developer / DBA can determine the appropriate level of data consistency and performance on a per-slave basis. The different replication modes described above can be contrasted with fully-synchronous replication whereby data is committed to two or more instances at the same time, using a two phase commit protocol. Synchronous replication gives assured consistency across multiple systems, and facilitates faster failover times in the event of an outage, but it can add a performance overhead as a result of additional messaging between nodes.

4.2.

Implementing Replication in MySQL

MySQL Replication is implemented by configuring one instance as a master, with one or more additional instances configured as slaves. The master will log the changes to the database, which are then sent and applied to the slave(s) immediately or after a set time interval 2. The figure below represents the implementation of MySQL replication.

Figure 4: MySQL Replication Workflow Beyond HA, MySQL replication is often employed to scale-out the database across a cluster of physical servers, as illustrated in the figure below. All write operations (and any reads which need to include the most recent changes) are directed to the Master, while other SELECT statements are directed to the slave(s), with query routing implemented either via the appropriate MySQL connector (e.g. the Connector/J JDBC or PHP drivers), or within the application logic.

Time Delayed Replication is a feature of MySQL 5.6

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 8

Figure 5: MySQL Replication Supports HA and Read Scalability Out of the Box MySQL replication can be deployed in a range of topologies to support diverse scaling and HA requirements. To learn more about these options and how to configure MySQL replication, refer to the following whitepaper: http://www.mysql.com/why-mysql/white-papers/mysql-replication-introduction MySQL replication is a mature and well proven approach to scaling workloads while providing a foundation for HA. The following table summarizes the current status of MySQL replication, prior to the new MySQL 5.6 release.

4.3.

MySQL 5.6 Replication Enhancements

MySQL 5.6 introduces the most far reaching set of Replication enhancements ever released. The figure below summarizes those enhancements, which are focused on HA, performance, data integrity and ease-of-use. The most significant HA enhancements come from the introduction of Global Transaction Identifiers (GTIDs). GTIDs are unique identifiers comprising the server UUID (of the original master) and a transaction number. They are automatically generated as a header for every transaction and written with the transaction to the binary log. GTIDs make it simple to track and compare replicated transaction between the master and slaves, which in turn enables simple recovery from failures of the master. The default InnoDB storage engine must be used with GTIDs to get the full benefits of HA. GTIDs make it simple to track and compare replication progress between the master and slaves, which in turn enables simple recovery from failures of the master. GTIDs also introduce greater flexibility in the provisioning and on-going management of multi-tier or ring (circular) replication topologies.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 9

To complement GTIDs, two new Replication Utilities have also been released. These utilities provide administration of GTID-enabled slaves, and monitoring with automatic failover and on-demand switchover, coupled with slave promotion, enabling self-healing recovery. mysqlfailover Provides continuous monitoring of the replication topology, enabling failover to a slave in the event of an outage to the master. It can be run within a users terminal or as a daemon process. The default behavior is to promote the first slave in the list that meets the minimum criteria but the promotion policies are fully configurable. Therefore, a user can nominate a specific candidate slave to become the new master (i.e. because it has better performing hardware). Before being promoted, the candidate slave will temporarily be made a slave to each of the other servers in turn to receive any updates it might be missing. This ensures no replicated transactions are lost, even if the candidate is not the most current slave when failover is initiated.

mysqlrpladmin If a user needs to take a master offline for scheduled maintenance, mysqlrpladmin can perform a switchover to a a nominated candidate slave. It can also perform a manual failover in the event a master server has gone offline.

Figure 6: MySQL 5.6, delivering the largest set of replication enhancements ever

With either of these utilities, the user may register scripts to be called in the event of a slave promotion - for example to redirect an application to use the new master for writes and for reads that must always be consistent. You can learn more about these utilities from a video tutorial: http://dev.mysql.com/tech-resources/articles/mysql-replication-utilities.html

4.4.

Monitoring MySQL Replication

In order to achieve high availability it is crucial to monitor systems and receive automatic notifications of issues or potential problems before they impact performance or availability of the application. Therefore, comprehensive management and monitoring tools should be regarded as mandatory in any HA installation. Many MySQL customers use the MySQL Enterprise Monitor (discussed in more depth in the Operational Best Practices section) with its GUI dashboard to manage their replication topologies. MySQL Enterprise Monitor makes it easier to scale-out and achieve high availability using the MySQL Replication Monitor, providing auto detection, grouping, documenting and monitoring of all master / slave hierarchical relationships. Changes and additions to existing replication topologies are also auto-detected and displayed, providing DBAs with instant visibility into newly implemented updates. As the Replication Advisor identifies a problem and sends out an alert, the DBA can use the alert content along with the Replication Monitor to drill into the status of the affected master and/or slave. Using the Replication Monitor and the expert advice from the Replication Advisor they can review the current master/slave status and view metrics (such as Slave I/O, Slave SQL thread, seconds behind master, master binlog position, last error, etc.) that are relevant to diagnosing and correcting any problems.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 10

The Replication Monitor is designed and implemented to save DevOps time writing and maintaining scripts that collect, consolidate and monitor similar MySQL Replication status and diagnostic data.

Figure 7: MySQL Enterprise Replication Monitor

5.

Clustering & Virtualization

To achieve higher levels of availability, it is necessary to deploy systems in more tightly coupled failover clusters using heartbeating mechanisms and cluster resource managers. These mechanisms monitor hardware, OS, network and database processes, and automatically failover to standby servers in the event of a failure being detected, and redirect applications to new masters. As demonstrated earlier, the enhancements delivered as part of MySQL 5.6 evolve replication more in the direction of this category of HA technology. A failover cluster is a group of independent nodes that are physically connected by a local-area or wide-area network and programmatically connected by cluster software. The group of nodes is managed as a single system and shares a common namespace. The group usually includes multiple network connections and data storage connected to the nodes via storage area networks (SANs) or local distributed storage. The failover cluster operates by moving resources between nodes to provide service if system components fail. Heartbeating mechanisms implement protocols that send messages at regular intervals between two or more servers. If a message is not received from a node within a given interval, it is assumed the node has failed and the cluster resource manager initiates a failover action. Typically the IP address of the server will be virtualized (Virtual IP) which also fails over to the standby server, enabling applications to continue running without having to connect to a different IP address. The virtual IP address is associated with a specific service, and can move between specific servers. It is also common to integrate failover clusters with virtualization as a means to enhance HA. Among other operational benefits, virtualization enables resource consolidation and ease of provisioning. Virtualization provides a layer of abstraction between the physical hardware, the operating systems and the MySQL Server. As a result, a user can migrate running Virtual Machines (VMs) to other hosts in order to eliminate downtime resulting from planned maintenance operations. Virtualization can also ensure that in the case of an unplanned hardware failure, affected virtual machines are rapidly restarted on another host in the cluster, and load balanced between hosts so that the service is only failed over to hosts that have sufficient spare capacity. Oracle currently certifies and supports the following clustering solutions for the MySQL Server.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 11

All of the certified solutions in this category of HA are based around an Active / Passive cluster offering automatic failover or live migration and protection against the loss of committed transactions. With the exception of DRBD, all rely on centralized shared storage to protect data. All updates are persisted to a SAN (Storage Area Network), which is accessible by both the Active and Passive nodes in the cluster 3, ensuring no data loss in the event of a fail-over. DRBD uses a different approach to ensuring data consistency by synchronously replicating updates made to the Active instance across to the Passive instance before committing the transaction. Therefore, no data is lost if there is a failover between the Active and Passive instances. The Oracle VM Template and DRBD based solutions are both designed for Linux environments while Windows and Solaris clustering provide HA when MySQL is deployed on those platforms. Oracle supports the full stack of OS, HA clustering middleware and database, with the exception of the Windows solution. The OS and clustering mechanisms would be supported by Microsoft, while Oracle supports MySQL. As an alternative to Windows Cluster, users could deploy MySQL Replication or MySQL Cluster to achieve HA on the Windows platform. To ensure fully automatic recovery in each of the solutions above, it is recommended users deploy the InnoDB storage engine (the default engine starting from MySQL 5.5) and leverage its full crash-recovery capabilities. The clustering mechanisms used in each of these certified solution can also be combined with MySQL replication to provide scale-out of slave systems, and geographic redundancy if needed. Users therefore enjoy the benefits of highly available MySQL Active / Passive clusters, and couple that with MySQL replication to scale queries across multiple systems. Each of the solutions above is discussed in more detail in the following sections.

At any point in time, the clustering software ensures that only the current active node can access the data held in this storage

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 12

5.1.

The Oracle VM Template for MySQL Enterprise Edition

Packaged as a single downloadable image, the Oracle VM Template for MySQL Enterprise Edition provides a preinstalled and pre-configured virtualized MySQL 5.5 Enterprise Edition software image running on Oracle Linux and Oracle VM, certified for production use and available with 24 x 7 support for the entire stack 4. As illustrated in the images below, the Oracle VM Template for MySQL Enterprise Edition enables users to rapidly provision multiple, virtualized MySQL instances to an Oracle VM Server pool, and integrates HA capabilities to provide resource monitoring, failover, recovery and load balancing of MySQL instances across the Server Pool. The template packages all of the components necessary to create a vritualized and highly available MySQL instance in a single downloadable file. The components of the template include: - Oracle Linux with the Unbreakable Enterprise Kernel 5 - Oracle VM 6 - Oracle Cluster File System 2 (OCFS2) 7 - MySQL Enterprise Edition 8 High Availability Delivered by the Oracle VM Template In addition to the proven benefits of virtualization, Oracle VM enhances MySQL high availability through: Figure 8: The Oracle VM Template for Secure Live VM Migration: Eliminates service outages MySQL associated with planned maintenance by migrating running VMs to other servers over secure SSL links, without interruption. Oracle VM is the first major virtualization solution to SSL-encrypt migration traffic by default to protect sensitive data from exploitation. Automatic VM Restart: Detects and automatically restarts instances within the server pool after failures of physical server hardware, VM instances or MySQL. Automatic or Manual Server Pool Load Balancing: Guest VMs are automatically placed on the server with the most resources available in the pool at start-up, or can be started within a user-designated subset of servers. Automated Network Management: Oracle VM configures a common, virtualized system IP that is automatically bound and re-bound to physical network layers, regardless of the platform it is initially started on, thereby eliminating the manual administration effort involved with updating routing tables or network configurations.

4 5

Requires subscription to the Oracle Unbreakable Linux Network http://www.oracle.com/us/technologies/linux/overview/index.html 6 http://www.oracle.com/us/technologies/virtualization/oraclevm/index.html 7 http://www.oracle.com/us/technologies/linux/025995.htm 8 http://www.mysql.com/products/enterprise/

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 13

Figure 9: Oracle VM Template delivering HA planned events and failures Underpinning the HA mechanisms, Oracle VM provides: Powerful cluster-based network and storage heartbeat algorithms to quickly and deterministically identify failed and/or isolated servers in the server pool to ensure rapid recovery; Sophisticated distributed lock management functionality for SAN and iSCSI storage that ensures VMs or entire servers can be rapidly restarted with no risk of data corruption. You can learn more about the template by reading the whitepaper posted as follows: http://www.mysql.com/why-mysql/white-papers/mysql_wp_oracle-vm-template-for-mee.php MySQL Enterprise Edition includes access to 24 x 7 support from Oracle. Support for Oracle Linux and Oracle VM provides a single point of contact for support of the entire virtualized stack, and is available by subscribing to the Unbreakable Linux Network 9.

5.2.

Oracle Linux and DRBD

DRBD is a Linux kernel module, integrated into the Oracle Linux Unbreakable Enterprise Kernel. The complete stack includes: Oracle Linux with Unbreakable Enterprise Kernel and DRBD kernel module; DRBD userland utilities; Pacemaker and Corosync cluster messaging and management processes; MySQL Enterprise Edition. Collectively, the stack provides for: Automatic failover and recovery for service continuity; Live migration for planned maintenance; Mirroring, via synchronous replication, to ensure failover between nodes without the risk of losing committed transactions; Building of HA clusters from commodity hardware with local storage. At the lowest level, two hosts are required in order to provide physical redundancy; if using a virtual environment, those two hosts should be on different physical machines. It is an important feature that no shared storage is required. At any point in time, the services will be active on one host and in standby mode on the other.
9

http://linux.oracle.com/

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 14

Pacemaker and Corosync combine to provide the clustering layer that sits between the services and the underlying hosts and operating systems. Pacemaker is responsible for starting and stopping services ensuring that theyre running on exactly one host, delivering high availability and avoiding data corruption. Corosync provides the underlying messaging infrastructure between the nodes that enables Pacemaker to do its job; it also handles the nodes membership within the cluster and informs Pacemaker of any changes. The core Pacemaker process does not have built in knowledge of the specific services to be managed; instead agents are used which provide a wrapper for the servicespecific actions. For example, in this solution we use agents for Virtual IP Addresses, MySQL and DRBD these are all existing agents and come packaged with Pacemaker. The essential services managed by Pacemaker in this configuration are DRBD, MySQL and the Virtual IP Address that applications use to connect to the active MySQL service.

Figure 10: MySQL / DRBD / Pacemaker / Corosync Stack

DRBD synchronizes data at the block device (typically a spinning or solid state disk) transparent to the application, database and even the file system. DRBD requires the use of a journaling file system such as ext3 or ext4. For this solution it acts in an active-standby mode this means that at any point in time the directories being managed by DRBD are accessible for reads and writes on exactly one of the two hosts and inaccessible (even for reads) on the other. Any changes made on the active host are synchronously replicated to the standby host by DRBD. You can learn more about this stack, with instructions on how to configure and deploy from it the whitepaper posted here: http://www.mysql.com/why-mysql/white-papers/mysql_wp_drbd.php

5.3.

MySQL with Windows Server Failover Clustering

Microsoft Windows is consistently ranked as the top development platform for MySQL, based on surveys of the MySQL user community. MySQL Enterprise Edition is certified and supported 10 with Windows Server 2008 R2 Failover Clustering (WSFC), enabling organizations to safely deploy business-critical applications demanding high levels of availability using Microsofts native Windows clustering services.

10

Users must escalate issues related to Windows Server and its associated clustering mechanisms directly to Microsoft.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 15

The following figure illustrates the integration of MySQL with Windows Server Failover Clustering to provide a highly available service. In this architecture, MySQL is deployed in an Active / Passive configuration. Failures of either MySQL or the underlying server are automatically detected and the MySQL instance is restarted on the Passive node. Applications accessing the database, as well as any MySQL replication slaves, can automatically reconnect to the new MySQL process using the same Virtual IP address once MySQL recovery has completed and it starts accepting connections. MySQL with Windows Failover Clustering requires at least 2 servers within the cluster together with shared storage (for example FC-AL SAN or iSCSI disks). The MySQL binaries and data files are stored in the shared storage and Windows Failover Clustering ensures that only one of the cluster nodes will access those files at any point in time. Clients connect to the MySQL service through a Virtual IP Address (VIP) and so in the event of failover they experience a brief loss of connection but otherwise do not need to be aware that the failover has happened, other than to handle the failure of any in-flight transactions.

Figure 11: Typical MySQL HA Configuration with WSFC

You can learn more about configuring MySQL with Windows Server Failover Clustering from the whitepaper posted here: http://www.mysql.com/why-mysql/white-papers/mysql_wp_windows_failover_clustering.php

5.4.

Solaris Cluster

Built on Oracle Solaris, the leading enterprise operating system, Oracle Solaris Cluster provides high availability and load balancing to mission-critical applications and services in physical or virtualized environments. With Oracle Solaris Cluster, organizations have a scalable and flexible solution that is suited equally to small clusters in local datacenters or larger multi-site, multi-cluster deployments that are part of enterprise disaster recovery implementations. Overview At its simplest, Oracle Solaris Cluster monitors the health of cluster components, including the stack of applications, middleware, operating system, servers, storage, and network interconnects. The software automatically reacts to failures of any of those components through the appropriate action, either leveraging the built-in hardware redundancy to resume operations or executing a policy-based, service or application-specific recovery action. Thanks to kernel-based heartbeating and low-level monitoring, server failure detection is near to immediate and resilient to load. As soon as a server goes offline and ceases its heartbeat, it is isolated. Applications are failed over to another server quickly and transparently to users and data is protected from corruption through fencing off of the failing server. With this architecture application service levels are increased while preserving data integrity.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 16

Figure 12. Oracle Solaris Cluster enables multiple servers and storage systems to act as a single system Failover, scalable, and cluster-aware MySQL agents Failover and scalable agents are software programs that support Oracle or ISV applications to take full advantage of Oracle Solaris Cluster features. The Oracle Solaris Cluster MySQL agent integrates seamlessly with MySQL and the MySQL replication protocol. It offers a selection of configuration options in the various Oracle Solaris Cluster topologies: Replication The Oracle Solaris Cluster MySQL agent controls and monitors MySQL replication. MySQL can either be a master or a slave database, or switch roles at any point in time. Such a role switch does not require an action in Oracle Solaris Cluster. With the Oracle Solaris Cluster MySQL agent, customers have the option to get highly available masters and highly available slaves, both can be configured as different resources in the same cluster. Scalable Topology With the scalable topology, customers run multiple MySQL databases and access them through Oracle Solaris Cluster's internal load-balancer, providing an ideal environment to implement a slave farm with scalable MySQL slaves. Disaster Recovery The Oracle Solaris Cluster Geographic Edition features tie two distinct clusters (one configured with a MySQL master and one with a MySQL slave database) together to implement a disaster recovery solution. In conjunction with the Oracle Solaris Cluster MySQL agent, this solution manages the MySQL replication and offers an automated approach to protect the data. Virtualization The Oracle Solaris Cluster MySQL agent also leverages the full set of virtualization options offered by Oracle Solaris Cluster: Oracle Solaris Containers nodes, Oracle Solaris Container clusters and failover containers. The Oracle Solaris Cluster MySQL agent can be configured on all three of them depending on the customer requirements. The MySQL agent is included with Oracle Solaris Cluster, which you can learn more about and download from here: http://www.oracle.com/technetwork/server-storage/solaris-cluster/overview/index.html

6.

Shared-Nothing, Failover Clusters

The methods of achieving high availability discussed above satisfy the uptime requirements of many applications, but there are classes of services that are highly transactional and update-intensive, demanding near-continuous availability. Examples include eCommerce and financial transactions, billing, user access and authentication,

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 17

network infrastructure applications and telecommunications services. Other services such as social gaming and online marketing campaign tracking also have to deal with increasingly demanding SLAs dictating 99.999% uptime. In these cases, MySQL Cluster provides the familiarity and ease-of-use of the regular MySQL Server, while delivering 99.999% availability (less than 5 minutes of downtime per year), coupled with auto-sharding (partitioning) for high write performance and low latency. MySQL Cluster is proven in environments demanding the highest levels of availability, continuing to deliver service in the event of failures, disasters and planned maintenance operations.

6.1.

MySQL Cluster

MySQL Cluster is a write-scalable, real-time, ACID-compliant transactional database, combining 99.999% availability with the low TCO of open source. Designed around a distributed, multi-master architecture with no single point of failure, MySQL Cluster scales horizontally on commodity hardware to serve read and write intensive workloads, accessed via SQL and NoSQL interfaces. MySQL Cluster's real-time design delivers predictable, millisecond response times with the ability to service millions of operations per second. Support for in-memory and disk-based data, automatic data partitioning (sharding) with load balancing and the ability to add nodes to a running cluster with zero downtime allows linear database scalability to handle the most unpredictable workloads. The following figure shows the architecture of MySQL Cluster.

Figure 13: MySQL Clusters distributed architecture eliminates any single point of failure MySQL Cluster comprises three node-types which collectively provide high availability to the application. By using a single namespace, the different nodes are transparent to the application which can connect to any node and queries are routed automatically: Data nodes manage the storage and access to data. Tables are automatically sharded across the data nodes which also transparently handle load balancing, replication, failover and self-healing. There is no need for any type of additional heartbeating or resource management middleware all of these functions are integrated directly into MySQL Cluster.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 18

Application nodes provide connectivity from the application logic to the data nodes. Multiple APIs are presented to the application. MySQL provides a standard SQL interface, including connectivity to all of the leading web development languages and frameworks. There are also a range of NoSQL interfaces including Node.js 11, Memcached, REST/HTTP, C++ (NDB-API), Java and JPA. Management nodes are used to configure the cluster and provide arbitration in the event of a network partition to avoid a split brain which would lead to data inconsistency. Resilience to Failures with Self-Healing Recovery The distributed, shared-nothing architecture of MySQL Cluster has been carefully designed to ensure resilience to failures, with automated, self-healing recovery: The data within a data node is synchronously replicated to a neighboring node. If a data node fails, then there is always at least one other data node storing the same information. In the event of a data node failure the MySQL Server or application node can use any other data node in the node group to execute transactions. The application simply retries the transaction and the remaining data nodes will successfully satisfy the request. MySQL Cluster detects any failures instantly and control is automatically failed over to other active nodes in the cluster, without interrupting service to the clients. In the event of a failure, the MySQL Cluster data nodes are able to self-heal by automatically restarting, recovering, and resynchronizing themselves with the rest of the cluster, all of which is completely transparent to the application. Duplicate management server nodes can be deployed so that no management or arbitration functions are lost if a single management server fails.

Designing the cluster in this way makes the system reliable and highly available since single points of failure have been eliminated. Any node can be lost without it affecting the system as a whole.

Figure 14: With no single point of failure, MySQL Cluster delivers extreme resilience to failures As demonstrated in the figure above, MySQL Cluster continues to deliver service, even in the event of catastrophic failures. As long as one data node from each node group and an application server remain available, the cluster will remain operational. In addition to the site-level high-availability achieved through its redundant architecture, MySQL Cluster also supports geographic distribution between datacenters: Geographic Replication mirrors complete clusters between geographically remote sites;
11

Node.js API for MySQL Cluster is currently provided as a preview release.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 19

Multi-site clustering enables a single cluster to be split across remote data centers, with synchronous replication between sites. (Note: this requires a very high quality WAN!) Whichever mode is chosen, all sites are Active/Active, and therefore able to accept write operations, with MySQL Cluster handling conflict detection and resolution. This ensures organizations do not have to carry the overhead of provisioning and maintaining systems that are idle for most of the time. Maintaining Availability During Scheduled Maintenance Activities As discussed earlier, around 30% of all downtime is attributable to scheduled maintenance activities. MySQL Cluster supports all of the following events as online operations, ensuring the database continues to provide service: Scaling the cluster by adding new nodes; Updating the schema with new columns, tables and indexes; Re-sharding of tables across data nodes to allow better data distribution; Performing back-up operations; Upgrading or patching the underlying hardware and operating system; Upgrading or patching MySQL Cluster, with full online upgrades between releases. Through the HA capabilities described above, MySQL Cluster is able to eliminate both planned maintenance and unplanned downtime in order to deliver the 99.999% availability required by the most critical applications. Getting Started with MySQL Cluster You can learn more about MySQL Cluster from the whitepaper posted as follows: http://mysql.com/why-mysql/white-papers/mysql_wp_scaling_web_databases.php

7.

Comparing MySQL HA Solutions

The following table compares solutions for MySQL HA.

* http://www.mysql.com/support/supportedplatforms/database.html ** InnoDB recovery time dependent on cache and database size, database activity, etc. *** http://www.mysql.com/support/supportedplatforms/cluster.html

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 20

As the table demonstrates, users have a wide range of options to achieve optimum levels of availability for their applications: MySQL replication and MySQL Cluster support a range of operating systems, while the other solutions are limited to specific platforms. All of the HA solutions with the exception of MySQL Cluster use the general purpose InnoDB storage engine, which is also the default engine of the MySQL server. MySQL Cluster uses its own NDB storage engine to support capabilities such as auto-sharding, failover and recovery, etc. Therefore if users are migrating from InnoDB, they will need to optimize their queries and schema to achieve the best possible performance from MySQL Cluster. All of the HA solutions support application level failover, with the exception of MySQL replication which requires users to integrate their own scripts for this functionality. Failover times vary by technology. MySQL Cluster is designed for sub-second failover times, while Active/Passive solutions require the added overhead of InnoDB recovery. HA solutions such as the Oracle VM Template along with Windows and Solaris Clustering rely on shared storage to maintain data consistency, while DRBD and MySQL Cluster use local storage with synchronous replication. MySQL replication uses asynchronous replication as standard, which can result in updates that have not been propagated from the master to a slave being lost if the master fails. Configuring semi-synchronous replication can mitigate this. MySQL Cluster is only the HA solution supporting a fully active/active, multi-master architecture, enabling users to scale both read and write operations across the cluster. Each solution delivers progressively higher availability levels, with MySQL replication designed for 99.9% uptime, all the way to MySQL Cluster with 99.999% uptime.

8.

Third-Party HA Technologies

In addition to the HA solutions certified and supported by Oracle, there are a range of third party technologies that can also be used to increase the uptime of MySQL deployments. Examples include Red Hat Cluster Suite and Veritas Cluster Server. Support for any 3rd party HA products must be obtained from the respective vendors. Oracle provides support for MySQL on supported platforms 12, even when used with 3rd party HA technologies, as long as any issues can be recreated in standalone environments.

9.

Operational Best Practices

High Availability is not only a function of the underlying technology, but also well established and tested operating procedures managed by a highly skilled operations team. As discussed earlier in the Guide, industry analysts estimate that 80% of downtime is the result of people and process, so the importance of operational best practices cannot be overstated. Oracle offers a range of tools and services to enable MySQL users to achieve operational excellence and deliver against their committed SLAs. Oracle University Training of operational and administrative teams reduces the risk of human error that can result in accidental system outages. Oracle University offers an extensive range of MySQL training from introductory courses (i.e. MySQL Essentials, MySQL DBA, etc.) through to advanced certifications such as MySQL High Availability and MySQL Cluster Administration. It is also possible to define custom training plans for delivery at customer site.
12

MySQL support is dependent on a valid MySQL subscription or support agreement

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 21

You can learn more about MySQL training from the Oracle University here: http://www.mysql.com/training/ MySQL Consulting To ensure adherence to best practices from the initial design phase of a project through to implementation and sustaining, users can engage Oracles MySQL Professional Services consultants. Delivered remotely or onsite, these engagements help in optimizing the architecture and increasing operational efficiency. Again Oracle offers a full range of consulting services, from Architecture and Design through to High Availability, Replication and Clustering. You can learn more at http://www.mysql.com/consulting/ MySQL Enterprise Edition and MySQL Cluster Carrier Grade Edition (CGE) The commercial editions of MySQL deliver the most comprehensive set of MySQL production, backup, monitoring, modeling, development, and administration tools so organizations can achieve the highest levels of availability, performance and security. Key components of MySQL Enterprise Edition and MySQL Cluster CGE are discussed below. 24x7 Global Support MySQL offers 24x7x365 access to Oracles MySQL Support team, which is staffed by seasoned database experts ready to help with the most complex technical issues, with direct access to the MySQL development team. Oracles Premier support provides you with: 24x7x365 phone and online support; Rapid diagnosis and solution to complex issues Unlimited incidents Emergency hot fix builds Access to Oracles MySQL Knowledge Base Consultative support services The Support team partners with customers in the analysis and remediation of issues that are causing outages, leading to faster problem resolution, and if needed, generates hot fixes to restore service. This level of assistance offers significant benefits for HA over community or self-supported environments. Access to best practices Knowledge Base is included within MySQL support agreements. The Knowledge Base offers great insight into how to configure, provision and manage highly available MySQL environments. You can learn more at http://www.mysql.com/support/ MySQL Enterprise Monitor During normal operations, monitoring of the infrastructure is key to maintaining high availability and can help you detect problems BEFORE they occur. MySQL Enterprise Monitor provides at-a-glance views of the health of your databases, continuously monitoring your MySQL servers and alerting you to potential problems before they impact your system. MySQL Enterprise Monitor automatically tracks hundreds of MySQL variables to analyse current status. A sophisticated rules-based engine alerts administrators whenever parameters exceed defined thresholds so that DevOps and DBA teams can proactively avoid downtime or performance degradation. Administrators are alerted immediately should an outage occur, and are presented with diagnostics information to speed remediation of the issue and quickly restore service availability. MySQL Enterprise Monitor also stores historical MySQL status data so that analysis of issues is greatly simplified. You can learn more about the MySQL Enterprise Monitor here: http://www.mysql.com/products/enterprise/monitor.html MySQL Enterprise Backup

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 22

Database backups are well-established processes in production environments. Depending on the technique used, backup operations can affect on-going services in several ways: - Increased server load, impacting performance of production queries; - Blocking of write operations, limiting the service to read-only queries during the backup process; - Complete (planned) downtime during backup. Of course, for HA services none of these is acceptable. A full online backup that does not consume MySQL Server resources is therefore the right choice to achieve HA. MySQL Enterprise Backup performs online "Hot", non-blocking backups of your MySQL databases. Full backups can be performed on all InnoDB data, while MySQL is online, without interrupting queries or updates. In addition, incremental backups are supported where only data that has changed is backed up. Also partial backups are supported when only certain tables or tablespaces need to be captured. MySQL Enterprise Backup restores your data from a full backup with full backward compatibility. Consistent Pointin-Time Recovery (PITR) enables DBAs to perform a restore to a specific point in time. You can learn more about MySQL Enterprise Backup from http://www.mysql.com/products/enterprise/backup.html

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 23

10. Conclusion
High availability is a critical concern for any organization looking to deliver services to users and customers. As this whitepaper has demonstrated, there is a range of HA technologies available for MySQL delivering 99.9% to 99.999% uptime. Before selecting the technology, it is important to assess the actual requirements of the application not everything needs 99.999% uptime, however desirable that may first appear. Combining operational best practices with technology solutions is essential to delivering true HA. This paper has presented a methodology to enable you to determine application requirements, and from there, select the right HA solution for your MySQL environment coupled with tools and services that reduce risk, cost and complexity.

11. Additional Resources


MySQL Replication Whitepaper: http://www.mysql.com/why-mysql/white-papers/mysql-replication-introduction Oracle VM Template for MySQL Enterprise Edition Whitepaper: http://www.mysql.com/why-mysql/white-papers/mysql_wp_oracle-vm-template-for-mee.php Oracle Linux and DRBD for MySQL Enterprise Edition Whitepaper: http://www.mysql.com/why-mysql/white-papers/mysql_wp_drbd.php MySQL with Windows Server Failover Clustering Whitepaper: http://www.mysql.com/why-mysql/white-papers/mysql_wp_windows_failover_clustering.php MySQL Cluster Whitepaper: http://mysql.com/why-mysql/white-papers/mysql_wp_scaling_web_databases.php MySQL High Availability Documentation: http://dev.mysql.com/doc/refman/5.6/en/ha-overview.html http://www.mysql.com/cluster

Copyright 2012, 2013 Oracle Corp. MySQL is a registered trademark of Oracle Corp. in the U.S. and in other countries. Other products mentioned may be trademarks of their companies.

Copyright 2012, 2013. Oracle and/or its affiliates. All rights reserved.

Page 24

Вам также может понравиться