OV203 Course Notes

OV203:
Advanced Opsview Configuration and

Management
Ton Voon, Opsera
Web training
March 2010
© Opsera 2010 Commercial in Confidence

Introduction
 Who am I?
 Who are you?
• Name
• What you do
• Experience with Opsview
• What you are most interested in
 What will we be learning?
© Opsera 2010 Commercial in Confidence 2
Ton Voon is the Product Architect for Opsview and is the main person in charge of the design
and scope of Opsview. He has been involved in the development of Opsview since 2005.
The main documentation site is http://docs.opsview.org/doku.php?id=opsview-community
Aims of training course
 Understand the advanced concepts of how Opsview

monitors
 Understand how a distributed Opsview system works
 Understand how to add new custom plugins
 Understand what backups are required for Opsview
Theory: Understanding the advanced concepts that will appear in Opsview

Distributed Opsview: Understand how it works and what limitations exist
Plugins: How to extend Opsview to monitor specific characteristics
Agenda
 Advanced monitoring  Backup and recovery

concepts
 Troubleshooting
 Distributed architecture
 Modules
 Custom plugins
 ODW
 API
 Configuration files

Advanced concepts

Checks: active versus passive
 Active: Run on a periodic basis

 Passive: Result that arrives on demand
• Manually reset the state of a failure
• or can set a freshness interval to auto change state after a period of time
Active checks are "polling checks". Try to use active checks because when things are fixed, the
service will automatically change status at the next polling interval.
Examples of passive checks: a backup start/finish message; a link up or link down from an
interface; or entries in a log file.
Passive checks need to be reset otherwise the state stays the same. You can:
- submit a result via the contextual menu for the service. This will send the result to the slave (if
appropriate) which will process it and send the result back up
- configure the service check so that after a defined interval, it will auto reset the state
You can also submit a result to test failure scenarios.
State changes
 A Soft state is an initial failure state

 A failure state will change to Hard if it is still failing after a
certain number of checks
 Based on Max Check Attempts
 Can retry at different intervals
 Aim: cater for temporary “glitches”
When a service transitions from an OK state to a failed state, the check attempt will be 1. This
increments for each subsequent failure. When the max check attempts is reached, the state is
considered a hard state.
Hostgroup Hierarchy always shows soft states, so it is always the most recent state information.
Beware of ignoring soft states - there maybe something that is a transcient problem that needs
resolving (usually load related).
Notifications
 Will send notifications on hard state changes

 Can configure to receive emails, sms or RSS/Atom feeds
• Emails: slave, RSS: master, SMS: configurable
 Extra state: recovery

 Checks continue to be run - will re-notify unless
acknowledged or marked as downtime
 Limitation: Notification from a slave will not necessarily have
all information that the master has, since the parent/child
topology will be different
Notifications are only sent on hard state changes which means a delay is introduced. You can
set max check attempts to 1 if you want to be notified straight away.
Opsview by default supports 3 different notification methods:
Emails (this will require setting up a mail system on the Opsview server)
SMS (need to have either an SMS gateway or a GSM modem attached)
RSS
See: http://docs.opsview.org/doku.php?id=opsview-community:notificationmethods for more
information.
If you want to overcome the limitations of slaves, you can set notification from the master.
A "recovery" state is when a host or service transitions from a hard failure back to ok
Flapping
 If a host or service changes state too frequently

 Disables notifications temporarily
The flapping value is calculated as the number of state changes that have occurred from the last
21 states for a service.
There is a high flapping threshold (when something goes into a flapping state) and a low flapping
threshold (when something comes out of a flapping state).
We use the default values of high: 30% and low: 20%.
You can disable flap detection for each servicecheck. The default is to set this on.
Parent/child relationships
 Host can have a parent to denote a dependency

 You can assign multiple parents for a host
 Defines a network topology, acting as a dependency for
determining network reachability
If a host is marked as down, then its parents are checked to determine the network reachability.
Performance graphing
 A way of storing numeric data over time

 Automatically created from plugin’s performance data
 RRD based, averaged
 5 minute intervals
 Service checks must run at least once an hour, otherwise
“gaps” appear
 Gauge or counter data points
 Map file for changing non-compliant plugins
RRD stands for Round Robin Database. It is a fixed size database and very fast, but it loses
resolution over time.
If your plugin returns valid performance data, the database will be automatically updated.
There is a default resolution for all RRDs:
5 minute is the smallest resolution
50 hours averaged over 5 mins
14 days averaged over 30 mins
2 months averaged over 2 hours
2 years averaged over 1 day
This produces an RRD file which is 24K in size.
The service check must run at least once an hour, otherwise RRD will mark no data in the RRD.
The Gauge is automatic and default. To set counter values, suffix the value with “c”
Checkpoint
 What are the two types of checks?

 What are the two state change types?
 A process graph shows there are 3.2 processes. How come
I get a non-integer value?
 What happens with service defined with max check attempts
of 1? When is this useful?
 A active service is checked every 3 minutes, with a retry
check interval of 1 min with a max check attempts of 4. How
long will it be between the last OK and the time you are
notified?
Opsview has 4 types of service checks: active, snmp polling, passive and snmp traps. The first
two are active checks and the last two are passive.
You could get fractional values for data points that are only integers because of the constant
averaging at the RRD.
Max check attempts 1 useful for passive checks, and also for services you want to get alerted on
immediately.
Answer!
Soft Hard
CRIT CRIT
OK OK CRIT CRIT
3 3 1 1 1
Time
6 minutes since last OK

Distributed architecture

Distributed slaves
 Slaves provide monitoring from a different location

• Reduces bandwidth
• Spreads load
• Independent monitoring system
• Simplifies firewall configuration
 Slaves can consist of clustered nodes

• Balances workload
• Redundancy and automatic failover

Distributed slaves, part 2
 Runs as an independent monitoring server, reporting to

master (using NSCA)
 Stores MRTG and NMIS data locally
 Can enable web interface on slave (standard Nagios CGIs)
 Managed from master server
 Each host is assigned to a slave
 Results from slave are marked as “stale” if no results arrive
 Slave-node services automatically created
One of the main features of Opsview is the handling of distributed system. Opsview handles:
* installing the Opsview software on the slave
* upgrading Opsview software when master is upgraded
* synchronising /usr/local/nagios/libexec
* generating configuration for master and slave
* single point of control from the master web UI
The technology used to send results back to the master is Nagios Service Check Acceptor
(NSCA).
You can enable the web interface on the slave server, so that you can see standard Nagios
screens from the slave.
"Freshness checking", a Nagios concept of expecting results within a certain timeframe, has an
additional 30 minute window before it will start to mark services into a different state.
There is a "Slave-node: {hostname}" service that will be automatically created that monitors the
slave.
http://docs.opsview.org/doku.php?id=opsview-community:slavesetup
Distributed slaves, part 2
 Limitation: Same OS and architecture as master

 Limitation: Plugin output on slave only sends the first 511
bytes or the 1st line to the master
 Limitation: Loses results if connection drops
 Limitation: Acknowledgements and downtime not
synchronised
 Limitation: Time must be synchronised, but time zone can be
different
 Can have them clustered
• Failover
• Load balancing
Slaves need to be the same architecture as the master because Opsview master sends all files
in /usr/local/nagios/bin, including the nagios executable, to the slaves. This will include
architecture specific files.
Plugin output from a slave to the master is limited to the 1st 511 bytes or the 1st line, which ever
comes first. This is due to the transport mechanism used.
Opsview uses NSCA for sending results from the slaves to the master. If it fails to send the
check results, it will drop those results, hence you could have lost results.
Time must be synchronised between master and slave. The check_opsview_slave_node,
automatically created, will check that the time is within 5 seconds of each other.
The timezone does not have to match, but we recommend it is set the same as the master.
Software level clustering for slaves is provided in Opsview. You can use your own virtual
machines to provide failover and redundancy if desired.
Synchronisation of states
 Principle: single point of control from Opsview master

 Changes made through user interface will be propagated to
slaves (5 second delay)
• So acknowledgements and downtimes on master were replicated on
slaves
 At Opsview reload, states for hosts and services on slave

will be synchronised with the master if the last updated time
is older on the slave
A limitation of Opsview prior to 3.5.2 was that if you moved a host or service that was
acknowledged from one slave to another, the new slave would not know about the
acknowledgement and thus notifications from the slave would be sent.
From Opsview 3.5.2 onwards, the state of the slave is synchronised with the master as long as
the last_updated field on the slave was older than the masterʼs view. This occurs at Opsview
reload time and a cluster take over time.
Distributed Architecture Diagram
Web
Web
clients HTTP
Web
clients Opsview Master
Web
clients port 80
clients nsca
ssh port 22
Optional: HTTP 80
ssh ssh
Slave B1 Slave B2 Slave B3
Slave A
ssh
Datacenter 1 Datacenter 2
NRPE,
SNMP or
check_by_ssh
All communication between master and slave is over port 22 (SSH). This is usually initiated from
the master to the slave. It is possible to do a “reverse SSH”, so the slave initiates an SSH tunnel
from the slave to the master which the master will then use to connect to slaves.
Setup of slave requires exchanging SSH keys.
Cluster nodes requires ssh key exchange between each other so they can tell between
themselves what their state information is.
You can have a different number of slave cluster nodes for each slave system.
Clustered slaves
 Can have an arbitrary number of clustered nodes, usually 2

or 3
 At reload time, the hosts for a slave are split across cluster
nodes
 On each node, a service (“Cluster-node: {nodename}”) is
automatically generated to monitor every other node
 An event handler is setup to take over on failure
 Requires an ssh connection between the two slave clusters
For more information about slave clusters, see: http://docs.opsview.org/doku.php?id=opsview-

community:slaveclusters
Example slave cluster system
Slave-node: A
Monitors hosts
Master Slave-node: B
1, 2, 3, 4, 5, 6
Slave-node: C
Node A Node B Node C
Cluster-node: B Takeover 3 Cluster-node: A Takeover 1 Cluster-node: A Takeover 2

Cluster-node: C Takeover 5 Cluster-node: C Takeover 6 Cluster-node: B Takeover 4
Monitors hosts Monitors hosts Monitors hosts

1, 2 3,4 5, 6
The Opsview Master has 6 hosts which are being monitored by the slave system. There are 3
nodes in the slave.
At Opsview reload time, the 6 hosts are split across all the nodes (based on an algorithm called
Set::Cluster on CPAN - http://search.cpan.org/dist/Set-Cluster/ ).
The “Cluster-node” services are automatically generated to look at every other node in that slave
system. This requires SSH
Each cluster node service has a list of hosts it will takeover in the case of the specific node
failure - this is calculated at reload time (and not dynamically).
Clustered slaves, part 2
 MRTG data stored locally: the first node in the cluster will
poll devices
 NMIS data stored locally: will rsync with other cluster nodes
 Every 15 minutes, the status information for node is sent to
every other node (for synchronisation at takeover)
 Limitation: Single node failure only
 Limitation: Slave clusters should be in the same network
segment
The NMIS data is rsyncʼd between cluster nodes every hour (based on nagios userʼs crontab)
Checkpoint
 Why would you use slaves?

 How many slaves can you run?
 If there are 2 cluster nodes in a slave, how many Cluster-
node checks are automatically created?
 What if there are 4 cluster nodes?
 What would happen if 2 nodes in a cluster failed?
There is no software limit to the number of slaves, but there is a cost as reload times increase.
For 2 cluster nodes, 2 checks are automatically setup, each monitoring the other.
For 3 cluster nodes, 6 checks would be setup.
For 4, 12 checks are setup.
The formula is: N x (N-1) where N is the number of nodes in a slave
If two nodes failed, some of the hosts/services would be marked as stale.
Distributed slaves
setup

Setup
 Setup users and groups on slaves

 Prerequisite software needs to be installed. Use opsview-
slave package
 Doesn’t install actual software - will be sent from master
 Upgrades will update slaves as part of the upgrade process
 Can enable a web interface on the slave
The procedure for setting up a slave web interface is documented at http://docs.opsview.org/

doku.php?id=opsview-community:slavesetup#slave_web_interface
Operation
 Slave runs its active checks

 Every result is written to /usr/local/nagios/var:
• cache_host.log
• cache_service.log
 Every 5 seconds, /usr/local/nagios/bin/process-cache-data is

called to send host and service results to the master
 Output from these calls is saved to cache.log and the return
code is saved to /usr/local/nagios/var/ocsp.status
 The Opsview master will show the service as passive,
though it is actively run on slave
process-cache-data uses send_nsca to transporting results back to the master. This has a limit
of 511 characters in the output.
Troubleshooting
 Check ocsp.status for last return code on slave

 Check cache.log for errors
 Check:
• echo “” | /usr/local/nagios/bin/send_nsca -H 127.0.0.1 -c /usr/local/nagios/
etc/send_nsca.cfg
• 0 data packet(s) sent to host successfully.
 Check ssh on master

 Check netstat -an | grep 5667 on master

Modules
We use the term Modules for functionality that is “loosely coupled” to Opsview but we still
provide integration with it.
Opsview Core comes with:

* Nagvis
* MRTG
* NMIS
Nagvis
 Nagvis provides a visual representation of the status of

various objects
 Maps are the grouping of objects together with a
background image
 Automap is a replacement network map view
 Technology:
• PHP5
 Integration:
• Apache configuration
• Authentication
• Host group hierarchy when choosing host groups
Because Nagvis is PHP5 based, Opsview delegates the PHP5 page rendering to Apache, by
disabling proxying through to the Opsview Web application.
When configuring Apache, you can use the auth ticket method for authenticating. This means
that users of Opsview can access Nagvis using this authentication ticket seamlessly. If a user
tries to access nagvis without this ticket, they will be redirected back to the Opsview login
screen.
For more information about Nagvis: http://docs.opsview.org/doku.php?id=opsview-

community:nagvis
Nagvis: Using
 Initial page at /nagvis

 Access to maps
 Adding a new map
 Adding a background
 Adding a new state object (host groups can be based on the
hierarchy)
 Limitation: No fine-grained access controls within Nagvis
 Limitation: Times displayed are in UTC
Beware of Nagvis access controls! Maps can be assigned to EVERYONE for view and
EVERYONE for edit. This means any authenticated user could edit your maps. Also, since the
maps can be edited, this means it is possible to get a drop down list of all the hosts and services
on your system! You can overcome this by making sure you only have named users for edit.
MRTG and NMIS
 Both used for interface statistics

 Use SNMP to collect information from hosts
 Uses RRD files to store its data
More information about MRTG and NMIS in the OV204 course

Custom Plugins

Plugin specifications
 A plugin must provide:

• A return code on completion
• Output to stdout (preferably 1 line, less than 511 bytes)
 A plugin should provide:

• Help output when run with -h (written to stdout)
 A plugin can provide:

• Performance data. Everything after the pipe symbol (“|”) is considered
performance data (preferably on that 1st line)
 Full plugin guidelines:

• http://nagiosplug.sourceforge.net/developer-guidelines.html

Performance data format
 Performance data will be automatically graphed if it is of the

correct format
 label=value[uom][;warn][;critical]
 Can have multiple sections of these - use space to separate
 Can change the order and the number of the sections and
the insert routine will update the data appropriately
 Be aware of averaging!
The warning and critical levels are optional.

The full performance format includes maximum and minimum values, but these are ignored in
Opsview.
If the label changes, then it will be considered a new performance plot.
ODW will save the raw value. However, threshold information is not retained.
For counter values, use performance data like:

inputbytes=119c
Custom checks
 Write it
 Run it on command line
 Have -h option to print out help text
 Drop onto /usr/local/nagios/libexec
 Will be automatically available in Opsview servicecheck
page
 Recommendation: Plugins return a short amount of data on
1 line
You can create a plugin in any language as long as it is executable by the nagios user.
We will create an example plugin that just returns OK
You may not need to write your own plugin, if an existing one can handle the checking for you.
For instance, instead of a dedicated virus update plugin, just use check_file_age to test that the
virus definition file is up to date.
You could also use a "proxy" method for getting results. For instance, query a database to get
results for a test (say, number of sessions in your web application, balance of a test account)
FOSDEM example plugin: http://nagiosplugins.org/fosdem
Creating plugins easily
 Use Nagios::Plugin
 Distributed with Opsview Agents, installed in /usr/local/
nagios/perl/lib
 Start plugin with:
• use FindBin;
use lib "$FindBin::Bin/../perl/lib";
use Nagios::Plugin;
 More documentation:
• http://search.cpan.org/dist/Nagios-Plugin/lib/Nagios/Plugin.pm

ODW

History
 Nagios is very good at “what is happening now”

 Nagios’ reporting uses nagios.log file to get status
 Lots of logic to work out status changes held in report code
 Need to move to database driven

NDOutils and Runtime
 NDOutils (Nagios Data Objects) is a project to put Nagios

status data into a mysql database
 Some limitations in how NDOutils saves its data:
• No configuration information over time
• Some tables are too big
Opsera have been very active in using NDOutils and updating the software.
Nagios has started to move data into a database structure - this is the project called NDOutils
(Nagios Data Objects).
Opsview uses NDOutils, and has been actively promoting and updating the software, but
recognises there are some limitations in how the data is represented in NDOutils. For instance,
the rows to record every result in NDOutils takes about 1K for every result, whereas in ODW it
takes about 250 bytes.
Opsview Data Warehouse
 ODW designed to be a data warehouse

 Denormalised data, for easier searching
 Long term storage tables
 Raw results - performance data is exactly as received
 Summary tables for quick queries
 Schema diagram and further documentation
• http://docs.opsview.org/doku.php?id=opsview-community:odw

ODW: Operation
 Need to opt-in to the import process from System

Preference
 Cron job called import_runtime runs at 4 minutes past the
hour to collect data and summarise
 Plugin: check_odw_hostgroup_availability
If you change the crontab entry, be aware that an upgrade will revert the crontab back again.
Be aware that the tables used by ODW are MyISAM tables which do table level locking. If you
have a long running query, it may lock up the import process, but the import process will
continue when tables are unlocked again.
If you do a lot of reporting, you may want to consider setting up replication of the ODW database
onto a different mysql server where you can run the reports without affecting the main Opsview
instance.
Architecture
Only the import_runtime script is used to add data into ODW.

The import_runtime is also aware of multiple Opsview masters, so it is possible to have a shared
ODW across multiple Opsview masters. This allows comparisons between hosts and services
on different systems.
The main limitation is that the host name must be unique between all Opsview masters: http://
docs.opsview.org/doku.php?id=opsview-community:sharedodw
Reports
 Opsview comes with some PDF reports

 Phasing out to use Opsview Enterprise Reports Module
instead
• Based on Jasper Reports technology
• Available to Enterprise subscribers
• Documentation at http://docs.opsview.org/doku.php?id=reportingmodule
• Some predefined reports, such as Weekly Availability by Keyword and
Weekly Performance by Keyword
 Uses ODW to gather all data metrics for availability and

performance
The PDF reports are retained in Opsview Core, but are being phased out due to complexity of
code and poor functionality.
The new reports infrastructure uses Jasper Reports for its base technology, and includes several
report types
API

Reasons for API
 Automated configuration changes

 Currently supports:
• creating / cloning / deleting hosts
• scheduling downtime
• reloads
For further information about the API, see: http://docs.opsview.org/doku.php?id=opsview-

community:api
 <opsview>
 <authentication><username>admin</
username><password>initial</password></authentication>
 <host action="create">
 <name>host</name>
 <ip>10.10.10.10</ip>
 <check_command><name>ping</name></
check_command>
 <hostgroup><id>2</id></hostgroup>
 <icon><name>LOGO - Opsview</name></icon>
 </host>
 </opsview>
This is an example XML file to push to Opsview.

Example invocations
 curl -H 'Content-Type: text/xml' -d @file.xml http://

opsviewserver/api
 opsview_api -f file.xml

Configuration files

Configuration files
 /usr/local/nagios/etc
• opsview.conf and opsview.defaults
• map.local and map
 /usr/local/opsview-web
• opsview_web_local.yml and opsview_web.yml
 /usr/local/nagios/share/stylesheets
• custom.css
 /usr/local/nagios/nagvis/etc
• nagvis.ini.php
 /usr/local/nagios/nmis/conf
• nmis.conf
opsview.defaults is a shipped file and is subject to change. If there are files that you need to
override, copy the variable into opsview.conf and amend it there. opsview.conf will not usually be
changed over a upgrade.
Similarly for the map and the map.local file. However, you shouldnʼt need to use the map file as
correctly formatted performance data will be automatically graphed.
The nagios configuration files are regenerated every time.
Backup and recovery

Backup scope
 A cronjob is run around 3am which invokes a backup

 The backups save configuration data and some key Runtime
data
 The scope is to be able to recover a system quickly
 Key variables in opsview.conf:
• $backup_dir - which directory to store the backups
• $backup_retention_days - number of days worth of backups to keep
 You will have to design your own backup strategy for long
term archival
Long term archival is not handled as part of the backup script because of the amount of data that
could be in ODW and Runtime.
For more information about backups: http://docs.opsview.org/doku.php?id=opsview-
community:backups
Data to consider backing up
 ODW
 Runtime
 MRTG
 NMIS
 Slaves
• MRTG
• NMIS
• Nagios logs

Restore process
 Assumes you are restoring to the same server as the

original Opsview server
 Stop Opsview and Opsview Web
 Restore files
 Restore Opsview, Runtime (subset), Reports database
 Restart Opsview and Opsview Web
 Reload Opsview
For the latest restore process, see: http://docs.opsview.org/doku.php?id=opsview-

community:backups
If you are looking to migrate Opsview onto different hardware, see: http://docs.opsview.org/
doku.php?id=opsview-community:migratinghardware
Troubleshooting

Reload process
 Creates Nagios configuration for master

 Runs MRTG configuration generation in background
 For each slave:
• Creates Nagios configuration
• Transfers files
 Validates all configuration

 Reloads Nagios simultaneously
Opsview uses Parallel::Forker for the workflow management. There is a limit set to only run 4
concurrent jobs at once - you can increase this if you have more CPUs on your master server.
Reload troubleshooting
 /admin/reload - will show common errors, including Nagios

validation errors
 /usr/local/nagios/var/rw/config_output
• Last full debug output of all configuration generation
 /usr/local/nagios/var/log/create_and_send_configs.debug
• Last reload process workflow

File locations
 /usr/local/nagios - Nagios and main Opsview core

 /usr/local/opsview-web - Opsview web application
 /var/log/opsview - logs for master server
• opsviewd.log - For main opsviewd daemon and other master jobs
• opsview-web.log - For web application
 /usr/local/nagios/var - logs and status files for Nagios (on

slaves too)
• var/log/opsview-slave.log - For Opsview specific slave jobs
 Performance RRD data

• /usr/local/nagios/var/rrd/{hostname}/{servicename}/{metric}/value.rrd
• thresholds.rrd
Performance graphs stored in /usr/local/nagios/var/rrd/{hostname}/{servicename}/{metric}/

value.rrd.
Thresholds are stored in thresholds.rrd, with information about the warning and critical levels.
Opsview uses Log4perl for logging and you can set location and rotation information. However,
this file is overwritten as part of an upgrade
Opsview not running
 Restarting Opsview: /etc/init.d/opsview restart

 Restarting Opsview Web: /etc/init.d/opsview-web
 Do these either as root or switch to nagios user
• su - nagios
You must use su - nagios. The dash means to pick up some environment variables required by
Nagios.
Errors relating to browser
 The Opsview Web Server is not running

• If an upgrade is in progress
 Error retrieving update from Opsview. Will continue to retry

• AJAX update problem. Will continue to repoll. Problems with web service?

Errors relating to Opsview
 If a service fails which is for the Opsview master or slaves,

escalate to Opsera
Opsera provide commercial support for Opsview. See http://opsera.com/jsp/opsera_product/

Opsview%20product.jsp for details
Summary
 Understand the advanced concepts of how Opsview

monitors
 Understand how a distributed Opsview system works
 Understand how to add new custom plugins
 Understand what backups are required for Opsview

OV203 Course Notes

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

OV203 Course Notes

Загружено:

Авторское право:

Доступные форматы

OV203:

Advanced Opsview Configuration and

© Opsera 2010 Commercial in Confidence

 What will we be learning?

© Opsera 2010 Commercial in Confidence 2

 Understand the advanced concepts of how Opsview

© Opsera 2010 Commercial in Confidence 3

Theory: Understanding the advanced concepts that will appear in Opsview

 Advanced monitoring  Backup and recovery

© Opsera 2010 Commercial in Confidence 4

© Opsera 2010 Commercial in Confidence

 Active: Run on a periodic basis

© Opsera 2010 Commercial in Confidence 6

 A Soft state is an initial failure state

© Opsera 2010 Commercial in Confidence 7

 Will send notifications on hard state changes

 Extra state: recovery

© Opsera 2010 Commercial in Confidence 8

 If a host or service changes state too frequently

© Opsera 2010 Commercial in Confidence 9

 Host can have a parent to denote a dependency

© Opsera 2010 Commercial in Confidence 10

 A way of storing numeric data over time

© Opsera 2010 Commercial in Confidence 11

 What are the two types of checks?

© Opsera 2010 Commercial in Confidence 12

6 minutes since last OK

© Opsera 2010 Commercial in Confidence 13

© Opsera 2010 Commercial in Confidence

 Slaves provide monitoring from a different location

 Slaves can consist of clustered nodes

© Opsera 2010 Commercial in Confidence 15

 Runs as an independent monitoring server, reporting to

© Opsera 2010 Commercial in Confidence 16

 Limitation: Same OS and architecture as master

© Opsera 2010 Commercial in Confidence 17

 Principle: single point of control from Opsview master

 At Opsview reload, states for hosts and services on slave

© Opsera 2010 Commercial in Confidence 18

© Opsera 2010 Commercial in Confidence

 Can have an arbitrary number of clustered nodes, usually 2

© Opsera 2010 Commercial in Confidence 20

For more information about slave clusters, see: http://docs.opsview.org/doku.php?id=opsview-

Node A Node B Node C

Cluster-node: B Takeover 3 Cluster-node: A Takeover 1 Cluster-node: A Takeover 2

Monitors hosts Monitors hosts Monitors hosts

© Opsera 2010 Commercial in Confidence 21

© Opsera 2010 Commercial in Confidence 22

 Why would you use slaves?

© Opsera 2010 Commercial in Confidence 23

© Opsera 2010 Commercial in Confidence

 Setup users and groups on slaves

© Opsera 2010 Commercial in Confidence 25

The procedure for setting up a slave web interface is documented at http://docs.opsview.org/

 Slave runs its active checks

 Every 5 seconds, /usr/local/nagios/bin/process-cache-data is

© Opsera 2010 Commercial in Confidence 26

 Check ocsp.status for last return code on slave

 Check ssh on master

© Opsera 2010 Commercial in Confidence 27

© Opsera 2010 Commercial in Confidence

Opsview Core comes with: