Академический Документы
Профессиональный Документы
Культура Документы
Acropolis 4.7
06-Oct-2016
Notice
Copyright
Copyright 2016 Nutanix, Inc.
Nutanix, Inc.
1740 Technology Drive, Suite 150
San Jose, CA 95110
All rights reserved. This product is protected by U.S. and international copyright and intellectual property
laws. Nutanix is a trademark of Nutanix, Inc. in the United States and/or other jurisdictions. All other marks
and names mentioned herein may be trademarks of their respective companies.
License
The provision of this software to you does not grant any licenses or other rights under any Microsoft
patents with respect to anything other than the file server implementation portion of the binaries for this
software, including no licenses or any other rights in any hardware or any devices or software that are used
to communicate with or in connection with this software.
Conventions
Convention Description
user@host$ command The commands are executed as a non-privileged user (such as nutanix)
in the system shell.
root@host# command The commands are executed as the root user in the vSphere or Acropolis
host shell.
> command The commands are executed in the Hyper-V host shell.
Version
Last modified: October 6, 2016 (2016-10-06 3:54:57 GMT-7)
1: Cluster Management............................................................................... 6
Controller VM Access.......................................................................................................................... 6
Port Requirements.................................................................................................................... 6
Starting a Nutanix Cluster................................................................................................................... 6
Stopping a Cluster............................................................................................................................... 7
Destroying a Cluster............................................................................................................................ 8
Creating Clusters from a Multiblock Cluster........................................................................................9
Cluster IP Address Configuration............................................................................................. 9
Configuring the Cluster........................................................................................................... 10
Verifying IPv6 Link-Local Connectivity....................................................................................13
Failing from one Site to Another....................................................................................................... 15
Disaster failover.......................................................................................................................15
Planned failover...................................................................................................................... 15
Fingerprinting Existing vDisks............................................................................................................15
2: Changing Passwords............................................................................ 17
Changing User Passwords................................................................................................................ 17
Changing the SMI-S Provider Password (Hyper-V)............................................................... 17
Changing the Controller VM Password............................................................................................. 18
6: Self-Service Restore..............................................................................35
Requirements and Limitations of Self-Service Restore.....................................................................35
Enabling Self-Service Restore...........................................................................................................36
Restoring a File as a Guest VM Administrator................................................................................. 37
Restoring a File as a Nutanix Administrator..................................................................................... 38
7: Logs........................................................................................................ 40
Sending Logs to a Remote Syslog Server........................................................................................40
Configuring the Remote Syslog Server Settings.................................................................... 41
Common Log Files.............................................................................................................................42
Nutanix Logs Root.................................................................................................................. 42
4
Self-Monitoring (sysstats) Logs...............................................................................................43
/home/nutanix/data/logs/cassandra...................................................................................... 43
Controller VM Log Files.......................................................................................................... 43
Correlating the FATAL log to the INFO file....................................................................................... 46
Stargate Logs.....................................................................................................................................47
Cassandra Logs................................................................................................................................. 48
Prism Gateway Log........................................................................................................................... 49
Zookeeper Logs................................................................................................................................. 50
Genesis Logs..................................................................................................................................... 50
Diagnosing a Genesis Failure................................................................................................ 51
ESXi Log Files................................................................................................................................... 52
8: Troubleshooting Tools.......................................................................... 53
Nutanix Cluster Check (NCC)........................................................................................................... 53
Installing NCC from an Installer File.......................................................................................54
Upgrading NCC Software....................................................................................................... 56
NCC Usage............................................................................................................................. 57
Diagnostics VMs................................................................................................................................ 58
Running a Test Using the Diagnostics VMs........................................................................... 58
Diagnostics Output.................................................................................................................. 59
Syscheck Utility.................................................................................................................................. 60
Using Syscheck Utility.............................................................................................................60
5
1
Cluster Management
Although each host in a Nutanix cluster runs a hypervisor independent of other hosts in the cluster, some
operations affect the entire cluster.
Controller VM Access
Most administrative functions of a Nutanix cluster can be performed through the web console or nCLI.
Nutanix recommends using these interfaces whenever possible and disabling Controller VM SSH access
with password or key authentication. Some functions, however, require logging on to a Controller VM
with SSH. Exercise caution whenever connecting directly to a Controller VM as the risk of causing cluster
issues is increased.
Warning: When you connect to a Controller VM with SSH, ensure that the SSH client does
not import or change any locale settings. The Nutanix software is not localized, and executing
commands with any locale other than en_US.UTF-8 can cause severe cluster issues.
To check the locale used in an SSH session, run /usr/bin/locale. If any environment variables
are set to anything other than en_US.UTF-8, reconnect with an SSH configuration that does not
import or change any locale settings.
Port Requirements
Nutanix uses a number of ports for internal communication. The following unique ports are required for
external access to Controller VMs in a Nutanix cluster.
Table
If the cluster starts properly, output similar to the following is displayed for each node in the cluster:
CVM: 10.1.64.60 Up
Zeus UP [3704, 3727, 3728, 3729, 3807, 3821]
Scavenger UP [4937, 4960, 4961, 4990]
SSLTerminator UP [5034, 5056, 5057, 5139]
Hyperint UP [5059, 5082, 5083, 5086, 5099, 5108]
Medusa UP [5534, 5559, 5560, 5563, 5752]
DynamicRingChanger UP [5852, 5874, 5875, 5954]
Pithos UP [5877, 5899, 5900, 5962]
Stargate UP [5902, 5927, 5928, 6103, 6108]
Cerebro UP [5930, 5952, 5953, 6106]
Chronos UP [5960, 6004, 6006, 6075]
Curator UP [5987, 6017, 6018, 6261]
Prism UP [6020, 6042, 6043, 6111, 6818]
CIM UP [6045, 6067, 6068, 6101]
AlertManager UP [6070, 6099, 6100, 6296]
Arithmos UP [6107, 6175, 6176, 6344]
SysStatCollector UP [6196, 6259, 6260, 6497]
Tunnel UP [6263, 6312, 6313]
ClusterHealth UP [6317, 6342, 6343, 6446, 6468, 6469, 6604, 6605,
6606, 6607]
Janus UP [6365, 6444, 6445, 6584]
NutanixGuestTools UP [6377, 6403, 6404]
What to do next: After you have verified that the cluster is running, you can start guest VMs.
(Hyper-V only) If the Hyper-V failover cluster was stopped, start it by logging on to a Hyper-V host and
running the Start-Cluster PowerShell command.
Warning: By default, Nutanix clusters have redundancy factor 2, which means they can tolerate
the failure of a single node or drive. Nutanix clusters with a configured option of redundancy factor
3 allow the Nutanix cluster to withstand the failure of two nodes or drives in different blocks.
• Never shut down or restart multiple Controller VMs or hosts simultaneously.
• Always run the cluster status command to verify that all Controller VMs are up before
performing a Controller VM or host shutdown or restart.
Stopping a Cluster
Before you begin: Shut down all guest virtual machines, including vCenter if it is running on the cluster.
Do not shut down Nutanix Controller VMs.
(Hyper-V only) Stop the Hyper-V failover cluster by logging on to a Hyper-V host and running the Stop-
Cluster PowerShell command.
Note: This procedure stops all services provided by guest virtual machines, the Nutanix cluster,
and the hypervisor host.
Wait to proceed until output similar to the following is displayed for every Controller VM in the cluster.
Destroying a Cluster
Before you begin: Reclaim licenses from the cluster to be destroyed by following Reclaiming Licenses
When Destroying a Cluster in the Web Console Guide.
Note: If you have destroyed the cluster and did not reclaim the existing licenses, contact Nutanix
Support to reclaim the licenses.
Note: If you have a cluster running the Cloud Platform System (CPS), the procedure to destroy it
is different. See the "Destroying the Cluster" section in the CPS Standard on Nutanix Guide.
Destroying a cluster resets all nodes in the cluster to the factory configuration. All cluster configuration and
guest VM data is unrecoverable after destroying the cluster.
Note: If the cluster is registered with Prism Central (the multiple cluster manager VM), unregister
the cluster before destroying it. See Registering with Prism Central in the Web Console Guide for
more information.
Wait to proceed until output similar to the following is displayed for every Controller VM in the cluster.
Caution: Performing this operation deletes all cluster and guest VM data in the cluster.
2. Create one or more new clusters by following Configuring the Cluster on page 10.
AOS includes a web-based configuration tool that automates assigning IP addresses to cluster
components and creates the cluster.
Requirements
The web-based configuration tool requires that IPv6 link-local be enabled on the subnet. If IPv6 link-local is
not available, you must configure the Controller VM IP addresses and the cluster manually. The web-based
configuration tool also requires that the Controller VMs be able to communicate with each other.
All Controller VMs and hypervisor hosts must be on the same subnet. The hypervisor can be multihomed
provided that one interface is on the same subnet as the Controller VM.
Guest VMs can be on a different subnet.
Video: Click here to see a video (MP4 format) demonstration of this procedure. (The video may
not reflect the latest features described in this section.)
Note: Internet Explorer requires protected mode to be disabled. Go to Tools > Internet
Options > Security, clear the Enable Protected Mode check box, and restart the browser.
The value of the inet6 addr field up to the / character is the IPv6 address of the Controller VM.
4. Type a virtual IP address for the cluster in the Cluster External IP field.
This parameter is required for Hyper-V clusters and is optional for vSphere and AHV clusters.
You can connect to the external cluster IP address with both the web console and nCLI. In the event
that a Controller VM is restarted or fails, the external cluster IP address is relocated to another
Controller VM in the cluster.
5. (Optional) If you want to enable redundancy factor 3, set Cluster Max Redundancy Factor to 3.
Redundancy factor 3 has the following requirements:
• Redundancy factor 3 can be enabled only when the cluster is created.
• A cluster must have at least five nodes for redundancy factor 3 to be enabled.
• For guest VMs to tolerate the simultaneous failure of two nodes or drives in different blocks, the data
must be stored on containers with replication factor 3.
• Controller VMs must be configured with 24 GB of memory.
6. Type the appropriate DNS and NTP addresses in the respective fields.
Note: You must enter NTP servers that the Controller VMs can reach in the CVM NTP Servers
field. If reachable NTP servers are not entered or if the time on the Controller VMs is ahead of
the current time, cluster services may fail to start.
For Hyper-V clusters, the CVM NTP Servers parameter must be set to the IP addresses of one or more
Active Directory domain controllers.
The Hypervisor NTP Servers parameter is not used in Hyper-V clusters.
8. Type the appropriate default gateway IP addresses in the Default Gateway row.
9. Select the check box next to each node that you want to add to the cluster.
Note: The unconfigured nodes are not listed according to their position in the block. Ensure
that you assign the intended IP address to each node.
If the cluster is running properly, output similar to the following is displayed for each node in the cluster:
CVM: 10.1.64.60 Up
Zeus UP [3704, 3727, 3728, 3729, 3807, 3821]
Scavenger UP [4937, 4960, 4961, 4990]
SSLTerminator UP [5034, 5056, 5057, 5139]
Hyperint UP [5059, 5082, 5083, 5086, 5099, 5108]
Medusa UP [5534, 5559, 5560, 5563, 5752]
DynamicRingChanger UP [5852, 5874, 5875, 5954]
Pithos UP [5877, 5899, 5900, 5962]
Stargate UP [5902, 5927, 5928, 6103, 6108]
Cerebro UP [5930, 5952, 5953, 6106]
Chronos UP [5960, 6004, 6006, 6075]
Curator UP [5987, 6017, 6018, 6261]
Prism UP [6020, 6042, 6043, 6111, 6818]
CIM UP [6045, 6067, 6068, 6101]
→ Mac OS
$ ifconfig en0
Note the IPv6 link-local addresses, which always begin with fe80 . Omit the / character and anything
following.
→ Linux/Mac OS
$ ping6 ipv6_linklocal_addr%interface
• Replace ipv6_linklocal_addr with the IPv6 link-local address of the other laptop.
• Replace interface with the interface identifier on the other laptop (for example, 12 for Windows, eth0
for Linux, or en0 for Mac OS).
If the ping packets are answered by the remote host, IPv6 link-local is enabled on the subnet. If the
ping packets are not answered, ensure that firewalls are disabled on both laptops and try again before
concluding that IPv6 link-local is not enabled.
5. Reenable the firewalls on the laptops and disconnect them from the network.
Results:
• If IPv6 link-local is enabled on the subnet, you can use automated IP address and cluster configuration
utility.
• If IPv6 link-local is not enabled on the subnet, you have to manually set IP addresses and create the
cluster.
Note: IPv6 connectivity issue might occur if mismatch occurs because of VLAN tagging. This
issue might occur because ESXi that is shipped from the factory does not have VLAN tagging,
hence it might have VLAN tag as 0. The workstation (laptop) that you have connected might be
connected to access port, so it might use different VLAN tag. Hence, ensure that ESXi port must
be in the trunking mode.
Disaster failover
Planned failover
Connect to the primary site and specify the failover site to migrate to.
ncli> pd migrate name="pd_name" remote-site="remote_site_name2"
Run the vDisk manipulator utility from any Controller VM in the cluster.
nutanix@cvm$ vdisk_manipulator --operation="add_fingerprints" \
--stats_only="false" --nfs_container_name="ctr_name" \
--nfs_relative_file_path="vdisk_path"
• Replace ctr_name with the name of the container where the vDisk to fingerprint resides.
• (Web console) Log on to the web console as the user whose password is to be changed and select
Change Password from the user icon pull-down list of the main menu.
For more information about changing properties of the current users, see the Web Console Guide.
• Replace username with the name of the user whose password is to be changed.
• Replace curr_pw with the current password.
• Replace new_pw with the new password.
Note: If you change the password of the admin user from the default, you must specify the
password every time you start an nCLI session from a remote system. A password is not
required if you are starting an nCLI session from a Controller VM where you are already logged
on.
1. Log on to the system where the SCVMM console is installed and start the console.
4. Update the username and password to include the new credentials and ensure that Validate domain
credentials is not checked.
3. Respond to the prompts, providing the current and new nutanix user password.
Changing password for nutanix.
Old Password:
New password:
Retype new password:
Password changed.
AOS includes a web-based configuration tool that automates the modification of Controller VM IP
addresses and configures the cluster to use these new IP addresses. Other cluster components must be
modified manually.
The internal virtual switch manages network communications between the Controller VM and the
hypervisor host. This switch is associated with a private network on the default VLAN and uses the
192.168.5.0/24 address space. The traffic on this subnet is typically restricted to the internal virtual switch,
but might be sent over the physical wire, through a host route, to implement storage high availability on
ESXi and Hyper-V clusters. This traffic is on the same VLAN as the Nutanix storage backplane.
Note: For guest VMs and other devices on the network, do not use a subnet that overlaps with
the 192.168.5.0/24 subnet on the default VLAN. If you want to use an overlapping subnet for such
devices, make sure that you use a different VLAN.
The following tables list the interfaces and IP addresses on the internal virtual switch on different
hypervisors:
Interfaces and IP Addresses on the Internal Virtual Switch virbr0 on an AHV Host
eth1:1 192.168.5.254
Interfaces and IP Addresses on the Internal Virtual Switch vSwitchNutanix on an ESXi Host
eth1:1 192.168.5.254
Interfaces and IP Addresses on the Internal Virtual Switch InternalSwitch on a Hyper-V Host
eth1:1 192.168.5.254
The external virtual switch manages communication between the virtual machines, between the virtual
machines and the host, and between the hosts in the cluster. The traffic on this virtual switch also includes
Controller VM–driven replication traffic for the purposes of maintaining the specified replication factor (RF),
as well as any ADSF traffic that cannot be processed locally. The external switch is assigned a NIC team or
bond as the means to provide connectivity outside of the host.
Note: Make sure that the hypervisor and Controller VM interfaces on the external virtual switch
are not assigned IP addresses from the 192.168.5.0/24 subnet.
The following tables list the interfaces and IP addresses on the external virtual switch on different
hypervisors:
Interfaces and IP Addresses on the External Virtual Switch br0 on an AHV Host
Interfaces and IP Addresses on the External Virtual Switch ExternalSwitch on a Hyper-V Host
1. Place the cluster in reconfiguration mode. See the "Before you begin" section in Changing a Controller
VM IP Address (manual) on page 25.
• Confirm that the system you are using to configure the cluster meets the following requirements:
• (Hyper-V only) Confirm that the hosts have only one type of NIC (10 GbE or 1 GbE) connected during
cluster creation. If the nodes have multiple types of network interfaces connected, disconnect them until
after you join the hosts to the domain.
Warning: If you are reassigning a Controller VM IP address to another Controller VM, you must
perform this complete procedure twice: once to assign intermediate IP addresses and again to
assign the desired IP addresses.
For example, if Controller VM A has IP address 172.16.0.11 and Controller VM B has IP address
172.16.0.10 and you want to swap them, you would need to reconfigure them with different IP
addresses (such as 172.16.0.100 and 172.16.0.101) before changing them to the IP addresses in
use initially.
The cluster must be stopped and in reconfiguration mode before changing the Controller VM IP addresses.
Note: Internet Explorer requires protected mode to be disabled. Go to Tools > Internet
Options > Security, clear the Enable Protected Mode check box, and restart the browser.
The value of the inet6 addr field up to the / character is the IPv6 address of the Controller VM.
4. Click Reconfigure.
5. Wait until the Log Messages section of the page reports that the cluster has been successfully
reconfigured, as shown in the following example.
Configuring IP addresses on node S10264822116570/A...
Success!
Configuring IP addresses on node S10264822116570/C...
Success!
Configuring IP addresses on node S10264822116570/B...
Success!
Configuring IP addresses on node S10264822116570/D...
Success!
Configuring Zeus on node S10264822116570/A...
Configuring Zeus on node S10264822116570/C...
Configuring Zeus on node S10264822116570/B...
Configuring Zeus on node S10264822116570/D...
Reconfiguration successful!
The IP address reconfiguration disconnects any SSH sessions to cluster components. The cluster is
taken out of reconfiguration mode.
Web Console
> Name Servers
• Log on to a Controller VM in the cluster and check that all hosts are part of the metadata store.
nutanix@cvm$ ncli host ls | grep "Metadata store status"
For every host in the cluster, Metadata store enabled on the node should be shown.
Wait to proceed until output similar to the following is displayed for every Controller VM in the cluster.
1. Log on to the hypervisor host with SSH or the IPMI remote console (vSphere or AHV) or remote
desktop connection (Hyper-V).
3. Restart genesis.
nutanix@cvm$ genesis restart
Caution: Do not place a backup copy of the ifconfig-ethX script files in the /etc/sysconfig/
network-scripts/ directory. Ensure that there are no backup files (.bkp or similar file extension)
in this location.
d. Press Esc.
Caution: Failure to update these entries according to the provided steps might result in
services failing to start or other network-related anomalies that might prevent cluster use.
• For each node in the cluster, a message containing the current Zookeeper mapping is displayed.
Found mapping {'10.4.100.70': 1, '10.4.100.243': 3, '10.4.100.251': 2}
The second numbers in the pairs are the Zookeeper IDs. Each cluster has either three or five
Zookeeper nodes.
• A message Safe to proceed with modification of Zookeeper mapping indicates that you can
change the Zookeeper IP addresses.
Replace cvm_ip_addr1, cvm_ip_addr2, and cvm_ip_addr3 with the new IP addresses that you want
to assign to the Controller VMs.
1. If you changed the IP addresses by modifying the Controller VM configuration files directly rather than
using the Nutanix utility, take the cluster out of reconfiguration mode.
Perform these steps for every Controller VM in the cluster.
c. Restart genesis.
nutanix@cvm$ genesis restart
If the cluster starts properly, output similar to the following is displayed for each node in the cluster:
CVM: 10.1.64.60 Up
Zeus UP [3704, 3727, 3728, 3729, 3807, 3821]
Scavenger UP [4937, 4960, 4961, 4990]
SSLTerminator UP [5034, 5056, 5057, 5139]
Hyperint UP [5059, 5082, 5083, 5086, 5099, 5108]
Medusa UP [5534, 5559, 5560, 5563, 5752]
DynamicRingChanger UP [5852, 5874, 5875, 5954]
Pithos UP [5877, 5899, 5900, 5962]
Stargate UP [5902, 5927, 5928, 6103, 6108]
Cerebro UP [5930, 5952, 5953, 6106]
Chronos UP [5960, 6004, 6006, 6075]
Curator UP [5987, 6017, 6018, 6261]
Prism UP [6020, 6042, 6043, 6111, 6818]
CIM UP [6045, 6067, 6068, 6101]
AlertManager UP [6070, 6099, 6100, 6296]
Arithmos UP [6107, 6175, 6176, 6344]
SysStatCollector UP [6196, 6259, 6260, 6497]
What to do next: Run the following NCC checks to verify the health of the Zeus configuration:
• ncc health_checks system_checks zkalias_check_plugin
1. From the Server Manager, add and enable the Multipath I/O feature in Tools > MPIO.
a. Add support for iSCSI devices by checking the box in the Discovered Multipaths tab.
b. Enable multipath for the targets by checking the box in the Microsoft iSCSI Initiator and selecting
the IP addresses for the Target Portal IP.
2. Set the default load balancing policy for all LUNS to Fail Over Only by running the following PowerShell
cmdlet on each Windows Server 2012 VM that is being used for Windows Failover Clustering:
> Set-MSDSMGlobalDefaultLoadBalancePolicy -Policy FOO
3. Log on to any Controller VM in your cluster through an SSH session and access the Acropolis
command line.
nutanix@cvm$ acli
<acropolis>
4. Create a volume group, then add a disk to the newly-created volume group. Verify that the new volume
group and disk were created successfully
a. Create a volume group, where vg_name is the name of the volume group.
<acropolis> vg.create vg_name shared=true
b. Add a disk to the newly-created volume group, where container_name is the name of the container
and disk_size is the disk size. (Optional) Use index_number to index the disk within the cluster,
(for example, create_size=1000G creates a disk with a capacity of 1000G). Otherwise, the system
automatically assigns index numbers.
Note: If you allocate disk size, the lower case g (in size=20g) indicates gigabyte and capital
G (in size=1000G) indicates gibibytes.
<acropolis> vg.disk_create vg_name container=container_name \
create_size=disk_size index=index_number
Note: For best results, Nutanix recommends that you configure 1 vDisk per volume group.
Creating a Windows Guest VM Failover Cluster | Acropolis Advanced Administration Guide | AOS | 30
Note: If you have more than one disk inside the target, the vg.get command displays all
disks within the target.
<acropolis> vg.get vg_name
5. From Windows Server, get the iSCSI initiator name. Then, from the Acropolis CLI, attach the external
initiator to the volume group and verify the connection.
a. From the Windows Server Manager VM, in the Disk Management window, click Tools iSCSI
Initiator, then click the Configuration tab and copy the iSCSI initiator name from the text box.
b. From the Acropolis CLI, attach the external initiators, where initiator_name is the copied initiator
name.
<acropolis> vg.attach_external vg_name initiator_name
c. Repeat this step for any remaining external initiators. Verify that the external initiators are connected.
<acropolis> vg.get vg_name
Note: You can also create a volume group and enable multiple initiators to access the
volume group by using Prism web console. For more information, see Creating a Volume
Group section of Prism Web Console Guide.
a. From the Windows Server Manager VM, in the iSCSI Initiator Properties window, click Discovery
and add the Controller VM IP address, then click OK.
b. Repeat this step for the remaining Controller VM IP addresses by doing the same to the next target.
c. To verify that the IP addresses are connected, go to the Targets tab and click Refresh.
d. Click OK to exit.
7. From the Server Manager, place the target disks online and create a New Simple Volume.
a. In the Disk Management window, right-click each disk and choose the Online option.
Repeat for any remaining disks.
b. Click New Simple Volume Wizard and verify the information in the following windows until you
reach the Format Platform window.
c. Enter a name for the volume in Volume label and complete the remaining wizard steps.
Note: (Optional) If a formatting window appears, you can format the Simple Volume.
8. From the Server Manager, create a Windows Guest VM Failover Cluster and add the disks.
a. In the Server Manager, click Tools > Failover Cluster Manager and click Create Cluster.
b. Click Browse and in the Select Computer window enter the names of the VMs you want to add, then
click OK. Click Yes to validate configuration tests.
c. Verify the information, then enter a name and IP address for the Windows Failover Cluster. Click OK.
Creating a Windows Guest VM Failover Cluster | Acropolis Advanced Administration Guide | AOS | 31
d. In the Failover Cluster Manager, click the volume group and click Storage > Disks. Choose Add
disk.
e. Select the disks you want to add to the cluster and click OK.
The new cluster and disks have been created and configured.
Creating a Windows Guest VM Failover Cluster | Acropolis Advanced Administration Guide | AOS | 32
5
Virtual Machine Pinning
VM pinning provides an ability to the administrator to select the storage tier preference for a particular
virtual disk (vDisk) of the virtual machine that may be running some latency sensitive mission critical
applications. For example, a cluster running a mission critical applications (such as SQL database)
workload with a large working set alongside other workloads may be too large to fit into the SSD tier (hot
tier) and could potentially migrate to the HDD tier (cold tier). For extremely latency sensitive workloads, this
migration to HDD tier could seriously affect the read and write performance of the workload.
By using VM pinning, you can specify particular virtual disk to be pinned to a SSD tier so that all the data of
that virtual disk resides only in the SSD tier and never gets down migrated into the HDD tier. By using this
feature, you can ensure that all the critical VMs get consistent performance with higher throughput after
they are pinned to a SSD tier. Maximum of 25% of your overall SSD capacity across the cluster is available
for the virtual disk pinning.
You can configure VM pinning in the following two ways.
• Full pinning
• Partial pinning
Full Pinning
If you configure pinning for a virtual disk equal to the total size of virtual disk then the virtual disk is
considered to be fully pinned to the SSD tier. This policy only applies to the new data written to or read
from the storage tier. Once the data gets accessed or written to the storage system, the data first gets
placed on the SSD tier (subjected to SSD tier space availability). In case of fully pinned virtual disk,
once data is placed into SSD tier, it never gets down migrated to HDD tier if pinning configurations are
preserved. All the sequential and random read or write requests are served from the SSD tier.
Caution: If the SSD tier is 100% full and if you have fully pinned a virtual disk to SSD tier, then the
subsequent writes to that particular virtual disk may fail.
Partial Pinning
If you configure pinning of a virtual disk less than the total size of virtual disk, then its considered to be
partially pinned to the SSD tier. If you do not want to place the complete virtual disk in the SSD tier, then
partial pinning provides you with flexibility to select the amount of virtual disk to be pinned to the SSD tier.
This ensures that only working set of the virtual disk resides on the SSD tier at all the times. If the virtual
disk uses more SSD space than configured, then excess data may get down migrated to HDD tier. Partial
pinning provides more flexibility to keep only hot data in the SSD tier for a particular virtual disk and still
provides the higher IOPS and throughput.
• If the virtual disk is partially pinned to the SSD tier, data usage of a virtual disk above the pinned space
on the SSD tier may be down migrated to a HDD tier.
• If the virtual disk is partially pinned to the SSD tier, more preference is given to hot working set of the
virtual disk while down migrating the data to HDD tier.
Disks with this feature enabled running on ESXi are not backed up by the disaster recovery workflows.
However original disks of the VMs are backed up. If you replicate the VMs with self-service enabled disks,
the following scenarios occur.
• If attaching a self-service disk is attempted for a VM that is part of a protection domain and is getting
replicated to a remote site previous to AOS 4.5, an error message is displayed during the disk attach
operation.
• If a snapshot with the attached self-service disk is replicated to a remote site previous to AOS 4.5, an
alert message is displayed during the replication process.
Requirements
Replace virtual_machine_id with the ID of the VM. You can retrieve the ID of a VM by using ncli> vm ls
command. If NGT is not enabled, output similar to the following is displayed.
VM Id: 00051a34-066f-72ed-0000-000000005400::38dc7bf2-a345-4e52-9af6-c1601e759987
Nutanix Guest Tools Enabled: false
File Level Restore: false
Replace virtual_machine_id with the ID of the VM. Replace application_names with file-
level-restore. For example, to enable NGT and self-service restore for the VM with ID
00051a34-066f-72ed-0000-000000005400::38dc7bf2-a345-4e52-9af6-c1601e759987, use the following
command.
ncli> ngt enable application-names=file-level-restore \
vm-id=00051a34-066f-72ed-0000-000000005400::38dc7bf2-a345-4e52-9af6-c1601e759987
4. (Optional) To disable self-service restore for a VM, use the following command.
ncli> ngt disable-applications application-names=file-level-restore vm-
id=virtual_machine_id
5. List the snapshots and virtual disks that are present for the guest VM by using the following command.
ngtcli> flr ls-snaps
The snapshot ID, disk labels, logical drives, and create time of the snapshot is displayed. The guest VM
administrator can use this information and take a decision to restore the files from the relevant snapshot
that has the data. For example, if the files are present in logical drive “C:” for the snapshot 41 (figure
below) and disk label scsi0:0, the guest VM administrator can use this snapshot ID and disk label to
attach the disk.
For example, to attach a disk with snapshot ID 16353 and disk label scsi0:1, type the command.
ngtcli> flr attach-disk snapshot-id=16353 disk-label= scsi0:1
After successfully running the command, a new disk with label "G" is attached to the guest VM.
If sufficient logical drive letters are not present, bringing disks online action fails. In this case, you should
detach the current disk, create enough free slots by detaching other self-service disks and re-attach the
disk again.
7. Go to the attached disk label drive and restore the desired files.
For example, to remove the disk with disk label scsi0:3, type the command.
ngtcli> flr detach-disk attached-disk-label=scsi0:3
If the disk is not removed by the guest VM administrator, the disk is automatically removed after 24
hours.
9. To view all the attached disks to the VM, use the following command.
ngtcli> flr list-attached-disks
The attached disk does not come automatically online. Administrators should make the disk online by
using disk management utility of Windows.
Once the disk is attached, the guest VM administrator can restore the files from the attached disk.
• As the logs are forwarded from a Controller VM, the logs display the IP address of the Controller VM.
• You can only configure one rsyslog server; you cannot specify multiple servers.
• The remote syslog server is enabled by default.
• Supported transport protocols are TCP and UDP.
• rsyslog-config supports and can report messages from the following Nutanix modules:
Module name With monitor logs disabled, these logs are With monitor logs enabled, these logs are
forwarded also forwarded
AOS log levels Contain information from these syslog log levels
ERROR ERROR
1. As the remote syslog server is enabled by default, disable it while you configure settings.
ncli> rsyslog-config set-status enable=false
2. Create a syslog server (which adds it to the cluster) and confirm it has been created.
ncli> rsyslog-config add-server name=remote_server_name ip-address=remote_ip_address
port=port_num network-protocol={tcp | udp}
ncli> rsyslog-config ls-servers
Name : remote_server_name
IP Address : remote_ip_address
Port : port_num
Protocol : TCP or UDP
remote_server_name A descriptive name for the remote server receiving the specified
messages
remote_ip_address The remote server's IP address
port_num Destination port number on the remote server.
tcp | udp Choose tcp or udp as the transport protocol
3. Choose a module to forward log information from and specify the level of information to collect.
ncli> rsyslog-config add-module server-name=remote_server_name module-name=module
level=loglevel include-monitor-logs={ false | true }
• (Optional) Set include-monitor-logs to specify whether the monitor logs are sent. It is enabled
(true) by default. If disabled (false), only certain logs are sent.
Note: If enabled, the include-monitor-logs option sends all monitor logs, regardless of the
level set by the level= parameter.
.FATAL Logs
If a component fails, it creates a log file named according to the following convention:
component-name.cvm-name.log.FATAL.date-timestamp
The first character indicates whether the log entry is an Info, Warning, Error, or Fatal. The next four
characters indicate the day on which the entry was made. For example, if an entry starts with F0820, it
means that at some time on August 20th, the component had a failure.
Tip: The cluster also creates .INFO and .WARNING log files for each component. Sometimes, the
information you need is stored in one of these files.
/home/nutanix/data/logs/cassandra
This is the directory where the Cassandra metadata database stores its logs. The Nutanix process that
starts the Cassandra database (cassandra_monitor) logs to the /home/nutanix/data/logs directory.
However, the most useful information relating to the Cassandra is found in the system.log* files located in
the /home/nutanix/data/logs/cassandra directory.
This directory contains the output for each of these commands, along with the corresponding timestamp.
Location: /home/nutanix/data/logs
curator.[out, ERROR, FATAL, INFO, WARNING] Metadata health and ILM activity
Location: /home/nutanix/data/logs/cassandra
Log Contents
Location: /home/nutanix/data/logs/sysstats
iostat.INFO I/O activity for each physical disk every 5 sec sudo iostat
lsof.INFO List of open files and processes every 1 min sudo lsof
that open them
Location: /home/nutanix/data/serviceability/alerts
Log Contents
num.processed Alerts that have been processed
Log Contents
When a process fails, the reason for the failure is recorded in the corresponding FATAL log. There are two
ways to correlate this log with the INFO file to get more information:
1. Search for the timestamp of the FATAL event in the corresponding INFO files.
c. Open the INFO file with vi and go to the bottom of the file (Shift+G).
d. Analyze the log entries immediately before the FATAL event, especially any errors or warnings.
In the following example, the latest stargate.FATAL determines the exact timestamp:
nutanix@cvm$ cat stargate.FATAL
In the above example, the timestamp is F0907 01:22:23 , or September 7 at 1:22:23 AM.
Next, grep for this timestamp in the stargate*INFO* files:
nutanix@cvm$ grep "^F0907 01:22:23" stargate*INFO* | cut -f1 -
d:stargate.NTNX-12AM3K490006-2-CVM.nutanix.log.INFO.20130904-220129.7363
2. If a process is repeatedly failing, it might be faster to do a long listing of the INFO files and select
the one immediately preceding the current one. The current one would be the one referenced by the
symbolic link.
For example, in the output below, the last failure would be recorded in the file
stargate.NTNX-12AM3K490006-2-CVM.nutanix.log.INFO.20130904-220129.7363.
ls -ltr stargate*INFO*
Tip: You can use the procedure above for the other types of files as well (WARNING and
ERROR) in order to narrow the window of information. The INFO file provides all messages,
WARNING provides only warning, error, and fatal-level messages, ERROR provides only error
and fatal-level messages, and so on.
Stargate Logs
This section discusses common entries found in Stargate logs and what they mean.
The Stargate logs are located at /home/nutanix/data/logs/stargate.[INFO|WARNING|ERROR|FATAL].
This message is generic and can happen for a variety of reasons. While Stargate is initializing, a watch dog
process monitors it to ensure a successful startup process. If it has trouble connecting to other components
(such as Zeus or Pithos) the watch dog process stops Stargate.
If Stargate is running, this indicates that the alarm handler thread is stuck for longer than 30 seconds. The
stoppage could be due to a variety of reasons, such as problems connecting to Zeus or accessing the
Cassandra database.
To analyze why the watch dog fired, first locate the relevant INFO file, and review the entries leading up to
the failure.
This message indicates that Stargate is unable to communicate with Medusa. This may be due to a
network issue.
Analyze the ping logs and the Cassandra logs.
Log Entry: CAS failure seen while updating metadata for egroup egroupid or Backend returns error
'CAS Error' for extent group id: egroupid
This is a benign message and usually does not indicate a problem. This warning message means that
another Cassandra node has already updated the database for the same key.
Log Entry: Fail-fast after detecting hung stargate ops: Operation with id <opid> hung for 60secs
F0712 14:19:13.088392 30295 stargate.cc:912] Fail-fast after detecting hung stargate ops:
Operation with id 3859757 hung for 60secs
F0907 01:22:23.124495 10559 zeus.cc:1779] Timed out waiting for Zookeeper session
establishment
This message indicates that Stargate had 5 failed attempts to connect to Medusa/Cassandra.
Review the Cassandra log (cassandra/system.log) to see why Cassandra was unavailable.
Log Entry:multiget_slice() failed with error: error_code while reading n rows from
cassandra_keyspace
Log Entry: Forwarding of request to NFS master ip:2009 failed with error kTimeout.
This message indicates that Stargate cannot connect to the NFS master on the node specified.
Review the Stargate logs on the node specified in the error.
Cassandra Logs
After analyzing Stargate logs, if you suspect an issue with Cassandra/Medusa, analyze the Cassandra
logs. This topic discusses common entries found in system.log and what they mean.
The Cassandra logs are located at /home/nutanix/data/logs/cassandra. The most recent file is named
system.log. When the file reaches a certain size, it rolls over to a sequentially numbered file (example,
system.log.1, system.log.2, and so on).
This is a common log entry and can be ignored. It is equivalent to the CAS errors in the stargate.ERROR
log. It simply means that another Cassandra node updated the keyspace first.
This message indicates that the node could not communicate with the Cassandra instance at the specified
IP address.
Either the Cassandra process is down (or failing) on that node or there are network connectivity issues.
Check the node for connectivity issues and Cassandra process restarts.
Log Entry: Caught Timeout exception while waiting for paxos read response from leader: x.x.x.x
This message indicates that the node encountered a timeout while waiting for the Paxos leader.
Either the Cassandra process is down (or failing) on that node or there are network connectivity issues.
Check the node for connectivity issues or for the Cassandra process restarts.
This section discusses common entries found in prism_gateway.log and what they mean. This log is
located on the Prism leader. The Prism leader is the node which is running the web server for the Nutanix
UI. This is the log to analyze if there are problems with the UI such as long loading times.
The Prism log is located at /home/nutanix/data/logs/prism_gateway.log on the Prism leader.
To identify the Prism leader, you can run cluster status | egrep "CVM|Prism" and determine which node
has the most processes. In the output below, 10.3.176.242 is the Prism leader.
nutanix@cvm$ cluster status | egrep "CVM|Prism"
The stats_aggregator component periodically issues an RPC request for all Nutanix vdisks in the cluster. It
is possible that all the ephemeral ports are exhausted.
If there are issues with connecting to the Nutanix UI, escalate the case and provide the
output of the ss -s command as well as the contents of prism_gateway.log.
Zookeeper Logs
Genesis Logs
When checking the status of the cluster services, if any of the services are down, or the Controller VM
is reporting Down with no process listing, review the log at /home/nutanix/data/logs/genesis.out to
determine why the service did not start, or why Genesis is not properly running.
Check the contents of genesis.out if a Controller VM reports multiple services as DOWN, or if the entire
Controller VM status is DOWN.
Like other component logs, genesis.out is a symbolic link to the latest genesis.out instance and has the
format genesis.out.date-timestamp.
An example of healthy output:
nutanix@cvm$ tail -F genesis.out
Under normal conditions, the genesis.out file logs the following messages periodically:
Unpublishing service Nutanix Controller
Publishing service Nutanix Controller
Zookeeper is running as [leader|follower]
Prior to these occasional messages, you should see Starting [n]th service. This is an indicator that all
services were successfully started. As of 4.1.3, there are 20 services.
Tip: You can ignore any INFO messages logged by Genesis by running the command:
grep -v -w INFO /home/nutanix/data/logs/genesis.out
Possible Errors
2013-09-09 16:20:01 ERROR rpc.py:303 Json Rpc request for unknown Rpc object NodeManager
2013-09-09 16:20:18 WARNING node_manager.py:2038 Could not load the local ESX configuration
Any of the above messages means that Genesis was unable to log on to the ESXi host using the
configured password.
1. Examine the contents of the genesis.out file and locate the stack trace (indicated by the CRITICAL
message type).
In the example above, the certificates in AuthorizedCerts.txt were not updated, which means that you
failed to connect to the NutanixHostAgent service on the host.
Location: /var/logs
Log Contents
Location: /vmfs/volumes/
Log Contents
NCC Output
Each NCC plugin is a test that completes independently of other plugins. Each test completes with one of
these status types.
PASS
The tested aspect of the cluster is healthy and no further action is required.
FAIL
The tested aspect of the cluster is not healthy and must be addressed.
WARN
The plugin returned an unexpected value and must be investigated.
INFO
The plugin returned an expected value that however cannot be evaluated as PASS/FAIL.
Tip: Note the MD5 value of the file as published on the support portal.
• Some NCC versions include a single installer file (ncc_installer_filename.sh) that you can download and
run from any Controller VM.
• Some NCC versions include an installer file inside a compressed tar file (ncc_installer_filename.tar.gz)
that you must first extract, then run from any Controller VM.
1. Download the installation file to any controller VM in the cluster and copy the installation file to the /
home/nutanix directory.
2. Check the MD5 value of the file. It must match the MD5 value published on the support portal. If the
value does not match, delete the file and download it again from the support portal.
nutanix@cvm$ md5sum ./ncc_installer_filename.sh
3. Perform these steps for NCC versions that include a single installer file (ncc_installer_filename.sh)
b. Install NCC.
nutanix@cvm$ ./ncc_installer_filename.sh
NCC installation file logic tests the NCC tar file checksum and prevents installation if it detects file
corruption.
• If it verifies the file, the installation script installs NCC on each node in the cluster.
• If it detects file corruption, it prevents installation and deletes any extracted files. In this case,
download the file again from the Nutanix support portal.
4. Perform these steps for NCC versions that include an installer file inside a compressed tar file
(ncc_installer_filename.tar.gz).
Replace ncc_installer_filename.tar.gz with the name of the compressed installation tar file.
The --recursive-unlink option is needed to ensure old installs are completely removed.
b. Run the install script. Provide the installation tar file name if it has been moved or renamed.
nutanix@cvm$ ./ncc/bin/install.sh [-f install_file.tar]
The installation script copies the install_file.tar tar file to each node and install NCC on each node in
the cluster.
5. Check the output of the installation command for any error messages.
• If installation is successful, a Finished Installation message is displayed. You can check any
NCC-related messages in /home/nutanix/data/logs/ncc-output-latest.log.
• In some cases, output similar to the following is displayed. Depending on the NCC version installed,
the installation file might log the output to /home/nutanix/data/logs/ or /home/nutanix/data/
serviceability/ncc.
Copying file to all nodes [ DONE ]
-------------------------------------------------------------------------------+
+---------------+
| State | Count |
+---------------+
| Total | 1 |
+---------------+
Plugin output written to /home/nutanix/data/logs/ncc-output-latest.log
What to do next:
• As part of installation or upgrade, NCC automatically restarts the cluster health service on each node
in the cluster, so you might observe notifications or other slight anomalies as the service is being
restarted.
Note:
• If you are adding one or more nodes to expand your cluster, the latest version of NCC might not
be installed on each newly-added node. In this case, re-install NCC in the cluster after you have
finished adding the one or more nodes.
• This topic describes how to install NCC software from the Prism web console. To install NCC
from the command line, see Installing NCC from an Installer File on page 54.
If the check reports a status other than PASS, resolve the reported issues before proceeding. If you are
unable to resolve the issues, contact Nutanix support for assistance.
2. Do this step to download and automatically install the NCC upgrade files. To manually upgrade, see the
next step.
a. Log on to the Prism web console for any node in the cluster.
b. Click Upgrade Software from the gear icon in the Prism web console, then click NCC in the dialog
box.
3. Do this step to download and manually install the NCC upgrade files.
a. Log on to the Nutanix support portal and select Downloads > Tools & Firmware.
c. Log on to the Prism web console for any node in the cluster.
d. Click Upgrade Software from the gear icon in the Prism web console, then click NCC in the dialog
box.
f. Click Choose File for the NCC metadata and binary files, respectively, browse to the file locations,
and click Upload Now.
g. When the upload process is completed, click Upgrade > Upgrade Now, then click Yes to confirm.
The Upgrade Software dialog box shows the progress of your selection, including pre-installation
checks.
As part of installation or upgrade, NCC automatically restarts the cluster health service on each node
in the cluster, so you might observe notifications or other slight anomalies as the service is being
restarted.
NCC Usage
The general usage of NCC is as follows:
nutanix@cvm$ ncc ncc-flags module sub-module [...] plugin plugin-flags
Typing ncc with no arguments yields a table listing the next modules that can be run. The Type column
distinguishes between modules (M) and plugins (P). The Impact tag identifies a plugin as intrusive or non-
intrusive. By default, only non-intrusive checks are used if a module is run with the run_all plugin.
nutanix@cvm$ ncc
+-----------------------------------------------------------------------------+
| Type | Name | Impact | Short help |
+-----------------------------------------------------------------------------+
| M | cassandra_tools | N/A | Plugins to help with Cassandra ring analysis |
| M | file_utils | N/A | Utilities for manipulating files on the |
| | | | cluster |
| M | health_checks | N/A | All health checks |
| M | info | N/A | Contains all info modules (legacy |
| | | | health_check.py) |
+-----------------------------------------------------------------------------+
The usage table is displayed for any module specified on the command line. Specifying a plugin runs its
associated checks.
nutanix@cvm$ ncc info
+------------------------------------------------------------------------------+
| Type | Name | Impact | Short help |
+------------------------------------------------------------------------------+
| M | cluster_info | N/A | Displays summary of info about this cluster. |
| M | cvm_info | N/A | Displays summary of info about this CVM. |
| M | esx_info | N/A | Displays summary of info about the local esx |
| | | | host. |
| M | ipmi_info | N/A | Displays summary of info about the local IPMI.|
| P | run_all | N/A | Run all the plugins in this module |
+------------------------------------------------------------------------------+
+------------------------------------------------------------------------------+
| Type | Name | Impact | Short help |
+------------------------------------------------------------------------------+
| P | file_copy | Non-Intrusive | Copies a local file to all CVMs. |
| P | remove_old_cores | Non-Intrusive | Removing cores older than 30 days |
| P | remove_old_fatals | Non-Intrusive | Removing fatals older than 90 days|
| P | run_all | N/A | Run all the plugins in this module|
+------------------------------------------------------------------------------+
Usage Examples
• Run the NCC with a named output file and a non-standard path for ipmitool.
nutanix@cvm$ ncc --ncc_plugin_output_history_file=ncc.out health_checks \
hardware_checks ipmi_checks run_all --ipmitool_path /usr/bin/ipmitool
Note: The flags override the default configurations of the NCC modules and plugins. Do not
run with these flags unless your cluster configuration requires these modifications.
Diagnostics VMs
Nutanix provides a diagnostics capability to allow partners and customers to run performance tests on the
cluster. This is a useful tool in pre-sales demonstrations of the cluster and while identifying the source of
performance issues in a production cluster. Diagnostics should also be run as part of setup to ensure that
the cluster is running properly before the customer takes ownership of the cluster.
The diagnostic utility deploys a VM on each node in the cluster. The Controller VMs control the diagnostic
VM on their hosts and report back the results to a single system.
The diagnostics test provide the following data:
• Sequential write bandwidth
• Sequential read bandwidth
• Random read IOPS
• Random write IOPS
Because the test creates new cluster entities, it is necessary to run a cleanup script when you are finished.
(vSphere only) In vCenter, right-click any diagnostic VMs labeled as "orphaned", select Remove from
Inventory, and click Yes to confirm removal.
(vSphere only) In vCenter, right-click any diagnostic VMs labeled as "orphaned", select Remove from
Inventory, and click Yes to confirm removal.
Diagnostics Output
System output similar to the following indicates a successful test.
Note:
• Expected results vary based on the specific AOS version and hardware model used.
• The IOPS values reported by the diagnostics script is higher than the values reported by the
Nutanix management interfaces. This difference is because the diagnostics script reports
physical disk I/O, and the management interfaces show IOPS reported by the hypervisor.
• If the reported values are lower than expected, the 10 GbE ports may not be active. For more
information about this known issue with ESXi 5.0 update 1, see VMware KB article 2030006.
Syscheck Utility
Syscheck is a tool that runs load on a cluster and evaluate its performance characteristics. This tool
provides pass or fail feedback on all the checks. The current checks are network throughput and direct
disk random write performance. Syscheck tracks the tests on a per node basis and prints the result at the
conclusion of the test.
After executing the command, a message that records all the considerations of running this test is
displayed. When prompted with the message, type yes to run the check.
The test returns either pass or fail result. The latest result is placed under /home/nutanix/data/syscheck
directory. An output tar file is also placed in /home/nutanix/data/ directory after every time you run this
utility.
Platform Default
The following tables show the minimum amount of memory and vCPU requirements and recommendations
for the Controller VM on each node for platforms that do not follow the default.
Nutanix Platforms
Dell Platforms
Lenovo Platforms
The following table lists the minimum amount of memory required when enabling features. The memory
size requirements are in addition to the default or recommended memory available for your platform
(Nutanix, Dell, Lenovo) as described in Controller VM Memory Configurations for Base Models. Adding
features cannot exceed 16 GB in additional memory.
Note: Default or recommended platform memory + memory required for each enabled feature =
total Controller VM Memory required
Platform Default