Nagios - Monitoring HP Bladesystem Servers

Monitoring HP bladesystem servers
HP Bladesystem servers are different guys when compared with their brothers from the DL, ML or even BL series: Among other things, its management is not based on ILO but on Onboard Administrator (OA). ILO supports the great RIBCL protocol, that is by far the best option for monitoring HP servers: it is based on xml and, thus, easily parseable and it is native (no need of installing SNMP daemons in our servers). Sadly there's not a similar option to RIBCL in Onboard Administrator. It supports a telnet/ssh command interpreter, but parsing outputs from a facility addressed to human administrators instead of machines is more than tricky: Bet that the output format of the command you parse will change in the next firmware revision. It's true that the blades contained in a bladesystem enclosure -since are considered as serverssupport ILO, but the output you get when you submit a RIBCL command is not 100% real: For instance a virtual fan is shown for representing all the fans available in the enclosure, and something similar happens with power supplies. What blade servers publish via RIBCL is an abstraction of the enclosure reality.
SNMP is the answer So the only option for fine-graining monitoring the bladesystem is SNMP. HP C3000 and C7000 series bladesystems support the CPQRACK-MIB MIB (1.3.6.1.4.1.232.22) storing interesting information for monitoring the system health: The enclosure itself polling the table cpqRackCommonEnclosureTable (CPQRACKMIB.2.3.1.1) Enclosure manager (the own onboard administrators) information is located in the table cpqRackCommonEnclosureManagerTable (CPQRACK-MIB.2.3.1.6) Temperature data can be found in the table cpqRackCommonEnclosureTempTable (CPQRACK-MIB.2.3.1.2) Fan info is located in the table cpqRackCommonEnclosureFanTable (CPQRACK-MIB.2.3.1.3) Fuses are represented in the table cpqRackCommonEnclosureFuseTable (CPQRACKMIB.2.3.1.4) FRUs (Field Replaceable Units) information is stored in the table cpqRackCommonEnclosureFruTable (CPQRACK-MIB.2.3.1.5) Power systems (global and power supply specific) can be monitored polling the tables cpqRackPowerEnclosureTable (CPQRACK-MIB.2.3.3.1) and cpqRackPowerSupplyTable (CPQRACK-MIB.2.5.1.1) Blade information is stored in the table cpqRackServerBladeTable (CPQRACK-MIB.2.4.1.1) Finally, network IO subsystems can be polled via the table cpqRackNetConnectorTable (CPQRACK-MIB.2.6.1.1)
MIB in detail All of them store item working status and levels that is what a monitoring system needs for building an image of the status and performance of a blade system:
cpqRackCommonEnclosureCondition (cpqRackCommonEnclosureTable.1.16) stores the status of the whole enclosure: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureManagerCondition (cpqRackCommonEnclosureManagerTable.1.12) stores the status of each manager: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureManagerRedundant (cpqRackCommonEnclosureManagerTable.1. 11) stores the manager redundancy status: redundant (3), notRedundant (2) or other(1). cpqRackCommonEnclosureTempCondition (cpqRackCommonEnclosureTempTable.1.8) states the temperature condition of a single sensor: OK (2), degraded (3), failed (4) or other (1). You can get the real temperature value (in celsius) from cpqRackCommonEnclosureTempCurrent (cpqRackCommonEnclosureTempTable.1.6) and its factory threshold from cpqRackCommonEnclosureTempThreshold (cpqRackCommonEnclosureTempTable.1.7) cpqRackCommonEnclosureFanCondition (cpqRackCommonEnclosureFanTable.1.11) returns a single fan status: OK (2), degraded (3), failed (4) or other (1). cpqRackCommonEnclosureFanRedundant (cpqRackCommonEnclosureFanTable.1.9) returns if a fan is in a redundant configuration: redundant (3), notRedundant (2) or other(1). cpqRackCommonEnclosureFuseCondition (cpqRackCommonEnclosureFuseTable.1.7) stores the condition of a single fuse: OK (2), failed (4) or other (1). cpqRackPowerEnclosureCondition (cpqRackPowerEnclosureTable.1.9) stores the overall power system status: OK (2), degraded (3) or other (1). cpqRackPowerSupplyCondition (cpqRackPowerSupplyTable.1.17) returns the working condition of a single power supply: OK (2), degraded (3), failed (4) or other (1). If you like LOTS of details, cpqRackPowerSupplyStatus (cpqRackPowerSupplyTable.1.14) stores the real status of the element: noError (1) generalFailure (2) bistFailure (3) fanFailure (4) tempFailure (5) interlockOpen (6) epromFailed (7) vrefFailed (8) dacFailed (9) ramTestFailed (10) voltageChannelFailed (11) orringdiodeFailed (12) brownOut (13) giveupOnStartup (14) nvramInvalid (15) calibrationTableInvalid (16) cpqRackServerBladeStatus (cpqRackServerBladeTable.1.21) returns the status of a single blade: OK (2), degraded (3), failed (4) or other (1). cpqRackServerBladePowered (cpqRackServerBladeTable.1.25) returns the operational status of a single blade: On (2), off (3) powerStaggedOff (4), rebooting (5) or other (1).
Using traps Maybe you are an experienced monitoring technician and you discard polling data continuously because you prefer to manage the bladesystem status based on SNMP traps (the truth is that plotting fan speeds and temperatures is cool, but unpractical). If you select this approach, focus on managing at least these traps. All of them are derived from cpqHoGenericTrap (.1.3.6.1.4.1.232.0) defined in CPQHOST-MIB (and inherited by CPQRACKMIB): Managers: cpqRackEnclosureManagerDegraded (cpqHoGenericTrap.22037) cpqRackEnclosureManagerOk (cpqHoGenericTrap.22038) Temperatures: cpqRackEnclosureTempFailed (cpqHoGenericTrap.22005) cpqRackEnclosureTempDegraded (cpqHoGenericTrap.22006) cpqRackEnclosureTempOk (cpqHoGenericTrap.22007) Fans: cpqRackEnclosureFanFailed (cpqHoGenericTrap.22008) cpqRackEnclosureFanDegraded (cpqHoGenericTrap.22009) cpqRackEnclosureFanOk (cpqHoGenericTrap.22010) Power supplies: cpqRackPowerSupplyFailed (cpqHoGenericTrap.22013) cpqRackPowerSupplyDegraded (cpqHoGenericTrap.22014) cpqRackPowerSupplyOk (cpqHoGenericTrap.22015) Power system: cpqRackPowerSubsystemNotRedundant (cpqHoGenericTrap.22018) cpqRackPowerSubsystemLineVoltageProblem (cpqHoGenericTrap.22019) cpqRackPowerSubsystemOverloadCondition (cpqHoGenericTrap.22020) Blades: cpqRackServerBladeStatusRepaired (cpqHoGenericTrap.22052) cpqRackServerBladeStatusDegraded (cpqHoGenericTrap.22053) cpqRackServerBladeStatusCritical (cpqHoGenericTrap.22054) Network IO subsystem: cpqRackNetConnectorFailed (cpqHoGenericTrap.22046) cpqRackNetConnectorDegraded (cpqHoGenericTrap.22047) cpqRackNetConnectorOk (cpqHoGenericTrap.22048)
Getting the MIB itself You can browse CPQRACK-MIB in different places, but be warned that it is not show on its last version in, for instance, mibdepot (it doesn't cover the more than important cpqRackServerBladeStatus field in the cpqRackServerBladeTable blade table that defines the status of a blade). If you need the CQPRACK-MIB MIB itself, you can download it from Plixer.
Monitoring Bladesystem servers in Nagios If you are a practical guy or you feel too lazy for programming, I recommend using Trond H. Amundsen's check_hp_bladechassis Nagios plugin. It is based on the polling of the previous tables
and it's able of generating performance data.
http://monitoringtt.blogspot.com/2012/11/monitoring-hp-bladesystem-devices.html

Nagios - Monitoring HP Bladesystem Servers

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Nagios - Monitoring HP Bladesystem Servers

Загружено:

Авторское право:

Доступные форматы

Monitoring HP bladesystem servers

and it's able of generating performance data.

Вам также может понравиться