Академический Документы
Профессиональный Документы
Культура Документы
Who am I?
Who are you?
• Name
• What you do
• Experience with Opsview
• What you are most interested in
Ton Voon is the Product Architect for Opsview and is the main person in charge of the design
and scope of Opsview. He has been involved in the development of Opsview since 2005.
The main documentation site is http://docs.opsview.org/doku.php?id=opsview-community
Aims of training course
Active checks are "polling checks". Try to use active checks because when things are fixed, the
service will automatically change status at the next polling interval.
Examples of passive checks: a backup start/finish message; a link up or link down from an
interface; or entries in a log file.
Passive checks need to be reset otherwise the state stays the same. You can:
- submit a result via the contextual menu for the service. This will send the result to the slave (if
appropriate) which will process it and send the result back up
- configure the service check so that after a defined interval, it will auto reset the state
You can also submit a result to test failure scenarios.
State changes
When a service transitions from an OK state to a failed state, the check attempt will be 1. This
increments for each subsequent failure. When the max check attempts is reached, the state is
considered a hard state.
Hostgroup Hierarchy always shows soft states, so it is always the most recent state information.
Beware of ignoring soft states - there maybe something that is a transcient problem that needs
resolving (usually load related).
Notifications
Notifications are only sent on hard state changes which means a delay is introduced. You can
set max check attempts to 1 if you want to be notified straight away.
Opsview by default supports 3 different notification methods:
Emails (this will require setting up a mail system on the Opsview server)
SMS (need to have either an SMS gateway or a GSM modem attached)
RSS
See: http://docs.opsview.org/doku.php?id=opsview-community:notificationmethods for more
information.
If you want to overcome the limitations of slaves, you can set notification from the master.
A "recovery" state is when a host or service transitions from a hard failure back to ok
Flapping
The flapping value is calculated as the number of state changes that have occurred from the last
21 states for a service.
There is a high flapping threshold (when something goes into a flapping state) and a low flapping
threshold (when something comes out of a flapping state).
We use the default values of high: 30% and low: 20%.
You can disable flap detection for each servicecheck. The default is to set this on.
Parent/child relationships
If a host is marked as down, then its parents are checked to determine the network reachability.
Performance graphing
RRD stands for Round Robin Database. It is a fixed size database and very fast, but it loses
resolution over time.
If your plugin returns valid performance data, the database will be automatically updated.
There is a default resolution for all RRDs:
5 minute is the smallest resolution
50 hours averaged over 5 mins
14 days averaged over 30 mins
2 months averaged over 2 hours
2 years averaged over 1 day
This produces an RRD file which is 24K in size.
The service check must run at least once an hour, otherwise RRD will mark no data in the RRD.
The Gauge is automatic and default. To set counter values, suffix the value with “c”
Checkpoint
Opsview has 4 types of service checks: active, snmp polling, passive and snmp traps. The first
two are active checks and the last two are passive.
You could get fractional values for data points that are only integers because of the constant
averaging at the RRD.
Max check attempts 1 useful for passive checks, and also for services you want to get alerted on
immediately.
Answer!
Soft Hard
CRIT CRIT
OK OK CRIT CRIT
3 3 1 1 1
Time
One of the main features of Opsview is the handling of distributed system. Opsview handles:
* installing the Opsview software on the slave
* upgrading Opsview software when master is upgraded
* synchronising /usr/local/nagios/libexec
* generating configuration for master and slave
* single point of control from the master web UI
The technology used to send results back to the master is Nagios Service Check Acceptor
(NSCA).
You can enable the web interface on the slave server, so that you can see standard Nagios
screens from the slave.
"Freshness checking", a Nagios concept of expecting results within a certain timeframe, has an
additional 30 minute window before it will start to mark services into a different state.
There is a "Slave-node: {hostname}" service that will be automatically created that monitors the
slave.
http://docs.opsview.org/doku.php?id=opsview-community:slavesetup
Distributed slaves, part 2
Slaves need to be the same architecture as the master because Opsview master sends all files
in /usr/local/nagios/bin, including the nagios executable, to the slaves. This will include
architecture specific files.
Plugin output from a slave to the master is limited to the 1st 511 bytes or the 1st line, which ever
comes first. This is due to the transport mechanism used.
Opsview uses NSCA for sending results from the slaves to the master. If it fails to send the
check results, it will drop those results, hence you could have lost results.
Time must be synchronised between master and slave. The check_opsview_slave_node,
automatically created, will check that the time is within 5 seconds of each other.
The timezone does not have to match, but we recommend it is set the same as the master.
Software level clustering for slaves is provided in Opsview. You can use your own virtual
machines to provide failover and redundancy if desired.
Synchronisation of states
A limitation of Opsview prior to 3.5.2 was that if you moved a host or service that was
acknowledged from one slave to another, the new slave would not know about the
acknowledgement and thus notifications from the slave would be sent.
From Opsview 3.5.2 onwards, the state of the slave is synchronised with the master as long as
the last_updated field on the slave was older than the masterʼs view. This occurs at Opsview
reload time and a cluster take over time.
Distributed Architecture Diagram
Web
Web
clients HTTP
Web
clients Opsview Master
Web
clients port 80
clients nsca
ssh port 22
Optional: HTTP 80
ssh ssh
Slave B1 Slave B2 Slave B3
Slave A
ssh
Datacenter 1 Datacenter 2
NRPE,
SNMP or
check_by_ssh
All communication between master and slave is over port 22 (SSH). This is usually initiated from
the master to the slave. It is possible to do a “reverse SSH”, so the slave initiates an SSH tunnel
from the slave to the master which the master will then use to connect to slaves.
Setup of slave requires exchanging SSH keys.
Cluster nodes requires ssh key exchange between each other so they can tell between
themselves what their state information is.
You can have a different number of slave cluster nodes for each slave system.
Clustered slaves
Slave-node: A
Monitors hosts
Master Slave-node: B
1, 2, 3, 4, 5, 6
Slave-node: C
The Opsview Master has 6 hosts which are being monitored by the slave system. There are 3
nodes in the slave.
At Opsview reload time, the 6 hosts are split across all the nodes (based on an algorithm called
Set::Cluster on CPAN - http://search.cpan.org/dist/Set-Cluster/ ).
The “Cluster-node” services are automatically generated to look at every other node in that slave
system. This requires SSH
Each cluster node service has a list of hosts it will takeover in the case of the specific node
failure - this is calculated at reload time (and not dynamically).
Clustered slaves, part 2
MRTG data stored locally: the first node in the cluster will
poll devices
NMIS data stored locally: will rsync with other cluster nodes
Every 15 minutes, the status information for node is sent to
every other node (for synchronisation at takeover)
Limitation: Single node failure only
Limitation: Slave clusters should be in the same network
segment
The NMIS data is rsyncʼd between cluster nodes every hour (based on nagios userʼs crontab)
Checkpoint
There is no software limit to the number of slaves, but there is a cost as reload times increase.
For 2 cluster nodes, 2 checks are automatically setup, each monitoring the other.
For 3 cluster nodes, 6 checks would be setup.
For 4, 12 checks are setup.
The formula is: N x (N-1) where N is the number of nodes in a slave
If two nodes failed, some of the hosts/services would be marked as stale.
Distributed slaves
setup
process-cache-data uses send_nsca to transporting results back to the master. This has a limit
of 511 characters in the output.
Troubleshooting
We use the term Modules for functionality that is “loosely coupled” to Opsview but we still
provide integration with it.
Integration:
• Apache configuration
• Authentication
• Host group hierarchy when choosing host groups
Because Nagvis is PHP5 based, Opsview delegates the PHP5 page rendering to Apache, by
disabling proxying through to the Opsview Web application.
When configuring Apache, you can use the auth ticket method for authenticating. This means
that users of Opsview can access Nagvis using this authentication ticket seamlessly. If a user
tries to access nagvis without this ticket, they will be redirected back to the Opsview login
screen.
Beware of Nagvis access controls! Maps can be assigned to EVERYONE for view and
EVERYONE for edit. This means any authenticated user could edit your maps. Also, since the
maps can be edited, this means it is possible to get a drop down list of all the hosts and services
on your system! You can overcome this by making sure you only have named users for edit.
MRTG and NMIS
Write it
Run it on command line
Have -h option to print out help text
Drop onto /usr/local/nagios/libexec
Will be automatically available in Opsview servicecheck
page
Recommendation: Plugins return a short amount of data on
1 line
You can create a plugin in any language as long as it is executable by the nagios user.
We will create an example plugin that just returns OK
You may not need to write your own plugin, if an existing one can handle the checking for you.
For instance, instead of a dedicated virus update plugin, just use check_file_age to test that the
virus definition file is up to date.
You could also use a "proxy" method for getting results. For instance, query a database to get
results for a test (say, number of sessions in your web application, balance of a test account)
FOSDEM example plugin: http://nagiosplugins.org/fosdem
Creating plugins easily
Use Nagios::Plugin
Distributed with Opsview Agents, installed in /usr/local/
nagios/perl/lib
Start plugin with:
• use FindBin;
use lib "$FindBin::Bin/../perl/lib";
use Nagios::Plugin;
More documentation:
• http://search.cpan.org/dist/Nagios-Plugin/lib/Nagios/Plugin.pm
Opsera have been very active in using NDOutils and updating the software.
Nagios has started to move data into a database structure - this is the project called NDOutils
(Nagios Data Objects).
Opsview uses NDOutils, and has been actively promoting and updating the software, but
recognises there are some limitations in how the data is represented in NDOutils. For instance,
the rows to record every result in NDOutils takes about 1K for every result, whereas in ODW it
takes about 250 bytes.
Opsview Data Warehouse
If you change the crontab entry, be aware that an upgrade will revert the crontab back again.
Be aware that the tables used by ODW are MyISAM tables which do table level locking. If you
have a long running query, it may lock up the import process, but the import process will
continue when tables are unlocked again.
If you do a lot of reporting, you may want to consider setting up replication of the ODW database
onto a different mysql server where you can run the reports without affecting the main Opsview
instance.
Architecture
The PDF reports are retained in Opsview Core, but are being phased out due to complexity of
code and poor functionality.
The new reports infrastructure uses Jasper Reports for its base technology, and includes several
report types
API
/usr/local/nagios/etc
• opsview.conf and opsview.defaults
• map.local and map
/usr/local/opsview-web
• opsview_web_local.yml and opsview_web.yml
/usr/local/nagios/share/stylesheets
• custom.css
/usr/local/nagios/nagvis/etc
• nagvis.ini.php
/usr/local/nagios/nmis/conf
• nmis.conf
opsview.defaults is a shipped file and is subject to change. If there are files that you need to
override, copy the variable into opsview.conf and amend it there. opsview.conf will not usually be
changed over a upgrade.
Similarly for the map and the map.local file. However, you shouldnʼt need to use the map file as
correctly formatted performance data will be automatically graphed.
The nagios configuration files are regenerated every time.
Backup and recovery
You will have to design your own backup strategy for long
term archival
Long term archival is not handled as part of the backup script because of the amount of data that
could be in ODW and Runtime.
For more information about backups: http://docs.opsview.org/doku.php?id=opsview-
community:backups
Data to consider backing up
ODW
Runtime
MRTG
NMIS
Slaves
• MRTG
• NMIS
• Nagios logs
Opsview uses Parallel::Forker for the workflow management. There is a limit set to only run 4
concurrent jobs at once - you can increase this if you have more CPUs on your master server.
Reload troubleshooting
/usr/local/nagios/var/log/create_and_send_configs.debug
• Last reload process workflow
Opsview uses Log4perl for logging and you can set location and rotation information. However,
this file is overwritten as part of an upgrade
Opsview not running
You must use su - nagios. The dash means to pick up some environment variables required by
Nagios.
Errors relating to browser