You are on page 1of 3

Ten vSphere HA and DRS

Misconfiguration Issues: High Availability



2012 2,729



VMware has a popular and powerful virtualization suite of products in the vSphere
and vCenter family of products. This post focuses on ten of the biggest mistakes people make when configuring the High
Availability (HA) and Distributed Resource Scheduler (DRS) features. Well begin by looking at five common HA issues, then
well look at four common DRS issues, then conclude with an issue that affects both HA and DRS.

HA Issues
HA is included in almost every version of vSphere, including one of the small business bundles (Essentials Plus), as the impact
of an ESXi host failure is much bigger than the loss of a single server in the traditional world because many virtual machines
(VMs) are affected. Thus, it is very important to get HA designed and configured correctly.
Purchasing Differently Configured Servers
One of the common mistakes people make is buying differently sized servers (more CPU and/or memory in some servers than
others) and placing them in the same cluster. This is often done with the idea that some VMs require a lot more resources than
others, and the big, powerful servers are more expensive than several smaller servers. The problem with this thinking is that HA
is pessimistic and assumes that the largest servers will fail.
Solution: Either buy servers that are configured the same (or at least similarly) or create a couple of different clusters, with each
cluster having servers configured the same. Some people also implement affinity rules to keep the big VMs on designated
servers, but this impacts DRSwell cover that issue later.
Insufficient Hosts to Run All VMs Accounting for HA Overhead
When budgets are tight, many administrators size their environments to have sufficient resources to run all the VMs that are
needed but forget to take into account the overhead HA imposes to guarantee that sufficient resources exist to restart the VMs

on a failed host (or multiple hosts, if you are pessimistic). VMwares best practice is to always leave Admission Control enabled
to have HA automatically set aside resources to restart VMs after a host failure.
Solution: Plan for the HA overhead and purchase sufficient hardware to cover the resources required by the VMs in the
environment plus the overhead for HA.
Using the Host Failures Cluster Tolerates Policy
Recall that there are three admission control policies, namely:
Host failures the cluster tolerates: The original (and only) option for HA, this type assumes the loss of a specified

number of hosts (one to four in versions 3 and 4, up to 31 in vSphere 5).

Percentage of cluster resources reserved as failover spare capacity: Introduced in vSphere4, this option sets aside a

specified percentage of both CPU and memory resources from the total in the cluster for failover use; vSphere 5 improved
this option by allowing different percentages to be specified for CPU and memory.
Specify failover hosts: This policy specifies a standby host that runs all the time but is never used for running VMs

unless a host in the cluster fails. It was introduced in vSphere 4 and upgraded in version 5 by allowing multiple hosts to be
As described previously, HA is pessimistic, and always assumes the largest host will fail, reserving more resources than usually
needed if the hosts are sized differently (though, per issue one, we dont recommend that). This policy also uses a concept
called slots to reserve the right amount of spare capacity, but it assumes a one size fits all policy in this regard and uses the
VM with the largest CPU and the largest memory reservation as the slot size for all VMs.
Solution: Use the VMware recommended policy of percentage of cluster resources reserved as failover spare capacity instead,
which takes a Percentage of the entire clusters resources and uses actual reservations on each VM instead of using the largest
Forgetting to Update the Percentage Admission Control Policy as Cluster Grows
If the Percentage of cluster resources reserved as failover spare capacity policy is used (as suggested), it is important to
reserve the correct amount of CPU and memory based on the needs of the VMs and the size of the cluster. For example, in a
two-node cluster, the loss of one of the nodes removes half of the cluster resources (assuming they are sized the same). Thus,
the percentage may be set to 50. However, if additional nodes are added to the cluster later, that value is probably too high and
should be reduced to take into account the additional node(s) and the number of simultaneous failures expected (for example
with four nodes, the loss of one node suggests that the percentage be set to 25, while if two failures are expected, then 50
percent should be used).
Solution: Go back and recalculate the appropriate value in your cluster whenever hosts are added to or removed from the
Configuring VM Restart Priorities Inefficiently
One of the settings that can be set in an HA cluster is the default restart priority of VMs after a host failure. This defaults to
Medium, but can be set to Low, Medium, or High, or Disabled, if most VMs should not be restarted after a host failure.
Solution: Consider setting the cluster default for restart priority to Low, enabling two higher levels for VMs. For example, maybe
infrastructure VMs such as domain controllers or DNS servers are the highest priority (setting those VMs to High), followed by
critical services, such as database or e-mail servers (setting those VMs to Medium), and then the rest of the VMs will be at the
default (Low). Any VMs that dont need to be restarted can be set to Disabled to save resources after a host failure.
Recommended Courses
VMware vSphere: Fast Track [V5.0]

VMware vSphere: Install, Configure, Manage [V5.0]

VMware vSphere: Whats New [V5.0]
Reprinted with permission from Ten vSphere HA and DRS Misconfiguration Issues

Ten vSphere HA and DRS Misconfiguration Issues Series