ESXTOP Troubleshooting

ESXTOP
Introduction
One of the many jobs administrators are tasked with is that of performance monitoring to ensure that
the environments we are responsible for are running as smoothly and as efficiently as possible. This
fact remains true regardless of whether we are running in a physical or in a virtual environment. While
the graphs displayed to us through the vSphere Client / vSphere Web Client do provide us some of
that functionality, the tool to most likely be used should we decide to involve VMware technical support
will be ESXTOP.
ESXTOP is one of those commands not covered in the official VMware 5.1 Install, Configure, and
Manage course (it is mentioned in one slide) however ESXTOP (and it's brother resxtop) provide us
with a way to view live performance data directly on the host using counters and percentages. For
those administrators familiar with Linux / UNIX, ESXTOP is used the same way the TOP command is
used. Personally, I believe that for performance troubleshooting, ESXTOP gives us the best native
tool to monitor resources such as CPU, memory, disk, and network usage. In this document, I attempt
to outline some of the more common options used with the ESXTOP command along with their
description, recommended thresholds, and examples of problems that may signal a potential issue
within our environment.
Starting esxtop
Being a command line tool, we first need to determine how we are going to access that command line.
Depending on the method chosen, the command used to invoke it may change somewhat. If using
SSH or the ESXi shell, we would log in and enter:
esxtop
In the event we would prefer to use the downloadable vCLI (vSphere Command Line Interface)
package or vMA (vSphere Management Assistant), the command changes to:
resxtop --server <server name or IP address>
Regardless of the method used to access the tool, the first screen encountered is the CPU statistics
(Figure 1). This screen is automatically refreshed every 5 seconds however the refresh rate can be
changed by using 's' to denote seconds and then the desired number of seconds between refreshes.
As an example, the following command would set the pages refresh rate to refresh every 3 seconds:
s3
We can navigate to screens displaying other resources by using different keystrokes. Although there
are a multitude of available options, the following are the most commonly used and therefore the ones
I'll focus on:
c = cpu
m = memory
n = network
d = disk adapter
u = disk device (includes NFS as of 4.0 Update 2)
v = virtual machine disk activity
CPU
Figure 1. CPU View (at rest)

As previously mentioned, the first screen defaults to the CPU view which provides us valuable
information in how the physical CPU is being utilized by the VMkernel. The 'NAME' column shows us
processes running (including VMs) and how those processes are using the physical CPU (PCPU).
Figure 1 above shows us a screenshot of an ESXi host with multiple VMs running on it (Win7, WinXP1, WinXP-2, WinXP-3) each with a %RDY time of less than one (1). Using the chart below (Table 1),
we can surmise that no VM is currently waiting for a vCPU and that all VMs are currently receiving the
necessary CPU resources required to meet their demands.
View
Column
Threshold
Description
CPU
%RDY
10
The amount of time a CPU is ready and

waiting to execute an instruction but can't
because it can't get scheduled on to a
PCPU.
CPU
%CSTP
The amount of time a vSMP VM spent descheduled to try and equalize the threads
processed. Possibly due to high vSMP
configuration.
CPU
%MLMTD
Maximum limited time. A value larger than

0 means a limit has been set.
Table 1. CPU counters
Figure 2. CPU View (High %RDY)

In figure 2, the %RDY setting changes significantly. Looking at counter, we can see that we have a
number of VMs showing very high readings. The reasons for these readings are twofold. First, the
physical host on which they are running is only a dual core machine which limits us to the amount of
PCPU available. Second, the VMs are currently running CPU intense calculations. These two factors
translate to a lack of CPU resources for these VMs. With not enough CPU cycles to meet the
demand, the kernel scheduler attempts to schedule the VMs fairly however the demand is high and
there are not enough cycles to go around. In this case, the only option for us would be to add CPU
resources to the box, or to migrate some VMs off to another host that may not be as CPU constrained.
As we saw here, high CPU RDY times in an environment can be caused due to lack of CPU resources
however they can also be caused by external factors such as bottlenecks on other resources. One
example that we'll revisit is disk latency. If a VM is experiencing disk latency, it may be displayed as
high %RDY times therefore it is important to look at all of a VMs resources when troubleshooting
latency problems in an environment.
Another issue commonly seen in environments has to do with vSMP (virtual symmetric
multiprocessing) VMs. Although it is possible to create VMs with multiple vCPUs, it is recommended
that that vSMP VMs be the exception and not the norm. The reason for this is because when a vSMP
VM requires CPU cycles, it will attempt to look for those x number of vCPUs (where 'x' is the number
of vCPUs allocated to the VM) in order to execute. If x number is not available, a relaxed coscheduler will allow for some of the threads to be placed on available cores. At first this doesn't seem
like a problem however at some point, the thread(s) furthest along must be de-scheduled off of a CPU
in order allow it's siblings to catch up with it. In figure 3, this can be seen because a vSMP VM
(WinXP-4) has been introduced to the environment and is also now conducting CPU intensive
calculations. With all VMs performing these calculations, WinXP-4's vCPUs must be forcibly descheduled and is reflected in the %CSTP (co-stop) counter.
Figure 3. CPU View (High %CSTP)

DISK
As previously mentioned, high numbers in one category may not be the result of that specific
resource. One example of this is disk latency.
Figure 4. CPU View (High %RDY)

In Figure 4, we once again see high %RDY values when we look in the CPU view of ESXTOP
however when we change to the disk view, the numbers on this screen may shed some light on why
this is occurring.
Figure 5. Adapter View (High DAVG)

In figure 5, we've navigated to the storage adapters view (using the 'd' key) and can see how our
storage is responding. In this view, we see the vmhba37 is currently seeing a lot of activity. The
numbers to focus in on through this view are the DAVG, KAVG, and GAVG (see table 2 for more
details). By using these numbers, we can see that the VMkernel is processing commands (KAVG)
and not experiencing any lag however the response from the physical disks (DAVG) is a different
story. Another view that narrows down the LUN associated with this vmhba can be seen by using the
'u' key as we've done in figure 6.
Figure 6. Disk device view (High %DAVG)

In this view, we can see that the latency is being experience on the LUNs identified using t10
identifiers. These two disks seen here are actually slow PATA drives on a remote device with a high
number of VMs located on it. Because of the demands placed on the storage and it's inability to
perform, the VMkernel will not schedule the VM to be placed on a PCPU until the disks have
processed the SCSI commands requested of them. This is the reason for the high %RDY time seen
in the CPU view of ESXTOP.
View
Column
Threshold
Description
DISK
GAVG
25
Latency as perceived by the guest (VM). The

sum of both DAVG and KAVGs.
DISK
DAVG
25
Average amount of time it takes for the disk

to process a SCSI command.
DISK
KAVG
Average amount of time it takes the kernel to

process a SCSI command.
DISK
QUED
Disk queue maxed out. Queue depth

determined by array vendor.
Table 2. Disk Counters

Network
Figure 6. Network View (At rest)
Similar to what was experienced with the disk latency, virtual machine latency may also be caused by
network problems. These problems may be the result of dropped packets which in turn may be the
result of a saturated uplink. Figure 6 is a view of the activity on the virtual network. In this screen
shot, network activity is minimal however figure 7 shows this same network under higher utilization.
Figure 7. Network view (High utilization)

Although not experiencing any dropped packets in this environment, I have seen problems caused by
such saturation and this screen would be where we could make that determination. The network
screen provides us with a view of how much traffic is being seen by the kernel and by what device
(Management port group, vmnic, etc). The network screen can be accessed by using the n key.
View
Column
Threshold
Description
Network
%DRPTX
Percentage of packets
dropped during
transmission. Possible
network saturation.
Network
%DRPRX
Percentage of packets
dropped on receipt.
Possible network
saturation.
Table 3. Network Counters

Memory
One of the most coveted and highly contested resources within a vSphere environment is that of
memory. Although the ESXi host has multiple mechanisms to try and use memory more efficiently,
there can come a time when the amount of memory installed in the host is not enough to meet the
demands of the environment (i.e. over commitment, HA event, etc.). Accessing the memory statistics
through ESXTOP can help us determine when we have reached that point and allow us to take
immediate action to mitigate the situation (i.e. vMotion VMs to another host, add another host to the
cluster, etc.). The memory screen can be accessed using the 'm' key on our keyboards (Figure 8).
Figure 8. Memory View (At rest)

Using table 3, we can see that the VMs displayed in figure 8 are currently not experiencing any
problems. Through this view, we can see the amount of memory that has been allocated to each VM
(MEMSZ) and any swapping that may be taking place (SWCUR) however one counter not available
through the default view is the ballooning counter. In order to add this and other counters if desired,
we can use the 'f' key which displays the screen seen in figure 9.
Figure 9. Counter options
Because were interested in seeing the ballooning statistics, the counter we choose is j (MCTL). We
can then exit this screen and are returned to the memory statistics screen (figure 10) which now
contains the ballooning counters (MCTL). By looking at these counters, we can determine if VMs are
borrowing memory from each other. Ballooning begins to take place when the host is beginning to run
low on memory and allows for one VM to borrow memory pages from another VM. If the memory
being borrowed wasnt being used by the VM being victimized, then it is possible for no performance
impact to be felt however if the memory being taken were hot pages, this may force to victimized VM
to swap pages to disk which obviously brings a performance penalty with it.
Figure 10. Memory (balloon statistics)

In figure 11, we began running memory intense applications within some VMs in the environment. In
this scenario, we see that the Ubuntu 12.04 VM has begun borrowing memory from other VMs to try
and keep up. Because all the VMs in this scenario were being forced to use all their memory, the
ballooning did cause them to begin swapping back to disks and a large performance penalty was felt
throughout the environment.
Figure 11. Memory (ballooning VM)
View
Column
Threshold
Description
Memory
MCTLSZ
The amount of guest

memory that has been
reclaimed through the
balloon driver
Memory
SWCUR
Memory
SWW/s
If larger than 0 host is

actively writing to vswp.
Possible cause:
Excessive memory over
commitment.
Memory
ZIP/s
If larger than 0 host is

actively compressing
memory. Possible
cause: Memory over
commitment.
If greater than 0, the

host is currently
swapping the VMs
memory out to disk.
Possible cause: Over
commitment
Table 4. Memory Counters
It is possible for some problems to manifest themselves during off hours (every morning at 0200 for
example). If someone is not readily available and monitoring, troubleshooting such issues can be a
challenge however in these cases, we can schedule ESXTOP to run and capture performance
snapshots which can be saved and replayed at our convenience or sent to VMware technical support
for further analysis. The command used for this can vary between ESXi versions (see
(http://kb.vmware.com/kb/1967) however one example would be:
vm-support -p -d 600 -i 20 > perfsnap.csv
This command would run the vm-support command specifically for performance (-p) for ten minutes (d is duration in seconds) in 20 second intervals (-i). It will then export that into the perfsnap.csv file for
later analysis.
ESXTOP is a command that provides us with a plethora of options and each of these options could be
discussed in great length. In the preceding document, I attempted to detail some of the more common
areas where issues may creep up however this document was not intended to be exhaustive. For
further reading on this topic, Ive provided links to VMware documentation and VMworld presentations
that you may find useful.
http://www.vmware.com/pdf/esx2_using_esxtop.pdf
http://kb.vmware.com/kb/1008205
http://communities.vmware.com/docs/DOC-9279
http://communities.vmware.com/docs/DOC-5240
http://www.vmware.com/pdf/vsphere4/r41/vsp_41_resource_mgmt.pdf
VMworld 2011 - ESXTOP for Advanced Users
VMworld 2011 - Performance Best Practices and Troubleshooting

ESXTOP Troubleshooting

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ESXTOP Troubleshooting

Загружено:

Авторское право:

Доступные форматы

ESXTOP

Figure 1. CPU View (at rest)

The amount of time a CPU is ready and

Maximum limited time. A value larger than

Table 1. CPU counters

Figure 2. CPU View (High %RDY)

Figure 3. CPU View (High %CSTP)

Figure 4. CPU View (High %RDY)

Figure 5. Adapter View (High DAVG)

Figure 6. Disk device view (High %DAVG)

Latency as perceived by the guest (VM). The

Average amount of time it takes for the disk

Average amount of time it takes the kernel to

Disk queue maxed out. Queue depth

Table 2. Disk Counters

Figure 6. Network View (At rest)

Figure 7. Network view (High utilization)

Table 3. Network Counters

Figure 8. Memory View (At rest)

Figure 9. Counter options

Figure 10. Memory (balloon statistics)

Figure 11. Memory (ballooning VM)

The amount of guest

If larger than 0 host is

If larger than 0 host is

If greater than 0, the

Table 4. Memory Counters

Вам также может понравиться