Вы находитесь на странице: 1из 61

OV203:

Advanced Opsview Configuration and


Management
Ton Voon, Opsera
Web training
March 2010

© Opsera 2010 Commercial in Confidence


Introduction

 Who am I?
 Who are you?
• Name
• What you do
• Experience with Opsview
• What you are most interested in

 What will we be learning?

© Opsera 2010 Commercial in Confidence 2

Ton Voon is the Product Architect for Opsview and is the main person in charge of the design
and scope of Opsview. He has been involved in the development of Opsview since 2005.
The main documentation site is http://docs.opsview.org/doku.php?id=opsview-community
Aims of training course

 Understand the advanced concepts of how Opsview


monitors
 Understand how a distributed Opsview system works
 Understand how to add new custom plugins
 Understand what backups are required for Opsview

© Opsera 2010 Commercial in Confidence 3

Theory: Understanding the advanced concepts that will appear in Opsview


Distributed Opsview: Understand how it works and what limitations exist
Plugins: How to extend Opsview to monitor specific characteristics
Agenda

 Advanced monitoring  Backup and recovery


concepts
 Troubleshooting
 Distributed architecture
 Modules
 Custom plugins
 ODW
 API
 Configuration files

© Opsera 2010 Commercial in Confidence 4


Advanced concepts

© Opsera 2010 Commercial in Confidence


Checks: active versus passive

 Active: Run on a periodic basis


 Passive: Result that arrives on demand
• Manually reset the state of a failure
• or can set a freshness interval to auto change state after a period of time

© Opsera 2010 Commercial in Confidence 6

Active checks are "polling checks". Try to use active checks because when things are fixed, the
service will automatically change status at the next polling interval.
Examples of passive checks: a backup start/finish message; a link up or link down from an
interface; or entries in a log file.
Passive checks need to be reset otherwise the state stays the same. You can:
- submit a result via the contextual menu for the service. This will send the result to the slave (if
appropriate) which will process it and send the result back up
- configure the service check so that after a defined interval, it will auto reset the state
You can also submit a result to test failure scenarios.
State changes

 A Soft state is an initial failure state


 A failure state will change to Hard if it is still failing after a
certain number of checks
 Based on Max Check Attempts
 Can retry at different intervals
 Aim: cater for temporary “glitches”

© Opsera 2010 Commercial in Confidence 7

When a service transitions from an OK state to a failed state, the check attempt will be 1. This
increments for each subsequent failure. When the max check attempts is reached, the state is
considered a hard state.
Hostgroup Hierarchy always shows soft states, so it is always the most recent state information.
Beware of ignoring soft states - there maybe something that is a transcient problem that needs
resolving (usually load related).
Notifications

 Will send notifications on hard state changes


 Can configure to receive emails, sms or RSS/Atom feeds
• Emails: slave, RSS: master, SMS: configurable

 Extra state: recovery


 Checks continue to be run - will re-notify unless
acknowledged or marked as downtime
 Limitation: Notification from a slave will not necessarily have
all information that the master has, since the parent/child
topology will be different

© Opsera 2010 Commercial in Confidence 8

Notifications are only sent on hard state changes which means a delay is introduced. You can
set max check attempts to 1 if you want to be notified straight away.
Opsview by default supports 3 different notification methods:
Emails (this will require setting up a mail system on the Opsview server)
SMS (need to have either an SMS gateway or a GSM modem attached)
RSS
See: http://docs.opsview.org/doku.php?id=opsview-community:notificationmethods for more
information.
If you want to overcome the limitations of slaves, you can set notification from the master.
A "recovery" state is when a host or service transitions from a hard failure back to ok
Flapping

 If a host or service changes state too frequently


 Disables notifications temporarily

© Opsera 2010 Commercial in Confidence 9

The flapping value is calculated as the number of state changes that have occurred from the last
21 states for a service.
There is a high flapping threshold (when something goes into a flapping state) and a low flapping
threshold (when something comes out of a flapping state).
We use the default values of high: 30% and low: 20%.
You can disable flap detection for each servicecheck. The default is to set this on.
Parent/child relationships

 Host can have a parent to denote a dependency


 You can assign multiple parents for a host
 Defines a network topology, acting as a dependency for
determining network reachability

© Opsera 2010 Commercial in Confidence 10

If a host is marked as down, then its parents are checked to determine the network reachability.
Performance graphing

 A way of storing numeric data over time


 Automatically created from plugin’s performance data
 RRD based, averaged
 5 minute intervals
 Service checks must run at least once an hour, otherwise
“gaps” appear
 Gauge or counter data points
 Map file for changing non-compliant plugins

© Opsera 2010 Commercial in Confidence 11

RRD stands for Round Robin Database. It is a fixed size database and very fast, but it loses
resolution over time.
If your plugin returns valid performance data, the database will be automatically updated.
There is a default resolution for all RRDs:
5 minute is the smallest resolution
50 hours averaged over 5 mins
14 days averaged over 30 mins
2 months averaged over 2 hours
2 years averaged over 1 day
This produces an RRD file which is 24K in size.
The service check must run at least once an hour, otherwise RRD will mark no data in the RRD.
The Gauge is automatic and default. To set counter values, suffix the value with “c”
Checkpoint

 What are the two types of checks?


 What are the two state change types?
 A process graph shows there are 3.2 processes. How come
I get a non-integer value?
 What happens with service defined with max check attempts
of 1? When is this useful?
 A active service is checked every 3 minutes, with a retry
check interval of 1 min with a max check attempts of 4. How
long will it be between the last OK and the time you are
notified?

© Opsera 2010 Commercial in Confidence 12

Opsview has 4 types of service checks: active, snmp polling, passive and snmp traps. The first
two are active checks and the last two are passive.
You could get fractional values for data points that are only integers because of the constant
averaging at the RRD.
Max check attempts 1 useful for passive checks, and also for services you want to get alerted on
immediately.
Answer!

Soft Hard
CRIT CRIT
OK OK CRIT CRIT

3 3 1 1 1

Time

6 minutes since last OK

© Opsera 2010 Commercial in Confidence 13


Distributed architecture

© Opsera 2010 Commercial in Confidence


Distributed slaves

 Slaves provide monitoring from a different location


• Reduces bandwidth
• Spreads load
• Independent monitoring system
• Simplifies firewall configuration

 Slaves can consist of clustered nodes


• Balances workload
• Redundancy and automatic failover

© Opsera 2010 Commercial in Confidence 15


Distributed slaves, part 2

 Runs as an independent monitoring server, reporting to


master (using NSCA)
 Stores MRTG and NMIS data locally
 Can enable web interface on slave (standard Nagios CGIs)
 Managed from master server
 Each host is assigned to a slave
 Results from slave are marked as “stale” if no results arrive
 Slave-node services automatically created

© Opsera 2010 Commercial in Confidence 16

One of the main features of Opsview is the handling of distributed system. Opsview handles:
* installing the Opsview software on the slave
* upgrading Opsview software when master is upgraded
* synchronising /usr/local/nagios/libexec
* generating configuration for master and slave
* single point of control from the master web UI

The technology used to send results back to the master is Nagios Service Check Acceptor
(NSCA).

You can enable the web interface on the slave server, so that you can see standard Nagios
screens from the slave.
"Freshness checking", a Nagios concept of expecting results within a certain timeframe, has an
additional 30 minute window before it will start to mark services into a different state.
There is a "Slave-node: {hostname}" service that will be automatically created that monitors the
slave.
http://docs.opsview.org/doku.php?id=opsview-community:slavesetup
Distributed slaves, part 2

 Limitation: Same OS and architecture as master


 Limitation: Plugin output on slave only sends the first 511
bytes or the 1st line to the master
 Limitation: Loses results if connection drops
 Limitation: Acknowledgements and downtime not
synchronised
 Limitation: Time must be synchronised, but time zone can be
different
 Can have them clustered
• Failover
• Load balancing

© Opsera 2010 Commercial in Confidence 17

Slaves need to be the same architecture as the master because Opsview master sends all files
in /usr/local/nagios/bin, including the nagios executable, to the slaves. This will include
architecture specific files.
Plugin output from a slave to the master is limited to the 1st 511 bytes or the 1st line, which ever
comes first. This is due to the transport mechanism used.
Opsview uses NSCA for sending results from the slaves to the master. If it fails to send the
check results, it will drop those results, hence you could have lost results.
Time must be synchronised between master and slave. The check_opsview_slave_node,
automatically created, will check that the time is within 5 seconds of each other.
The timezone does not have to match, but we recommend it is set the same as the master.
Software level clustering for slaves is provided in Opsview. You can use your own virtual
machines to provide failover and redundancy if desired.
Synchronisation of states

 Principle: single point of control from Opsview master


 Changes made through user interface will be propagated to
slaves (5 second delay)
• So acknowledgements and downtimes on master were replicated on
slaves

 At Opsview reload, states for hosts and services on slave


will be synchronised with the master if the last updated time
is older on the slave

© Opsera 2010 Commercial in Confidence 18

A limitation of Opsview prior to 3.5.2 was that if you moved a host or service that was
acknowledged from one slave to another, the new slave would not know about the
acknowledgement and thus notifications from the slave would be sent.

From Opsview 3.5.2 onwards, the state of the slave is synchronised with the master as long as
the last_updated field on the slave was older than the masterʼs view. This occurs at Opsview
reload time and a cluster take over time.
Distributed Architecture Diagram

Web
Web
clients HTTP
Web
clients Opsview Master
Web
clients port 80
clients nsca

ssh port 22
Optional: HTTP 80

ssh ssh
Slave B1 Slave B2 Slave B3
Slave A

ssh

Datacenter 1 Datacenter 2

NRPE,
SNMP or
check_by_ssh

© Opsera 2010 Commercial in Confidence

All communication between master and slave is over port 22 (SSH). This is usually initiated from
the master to the slave. It is possible to do a “reverse SSH”, so the slave initiates an SSH tunnel
from the slave to the master which the master will then use to connect to slaves.
Setup of slave requires exchanging SSH keys.
Cluster nodes requires ssh key exchange between each other so they can tell between
themselves what their state information is.
You can have a different number of slave cluster nodes for each slave system.
Clustered slaves

 Can have an arbitrary number of clustered nodes, usually 2


or 3
 At reload time, the hosts for a slave are split across cluster
nodes
 On each node, a service (“Cluster-node: {nodename}”) is
automatically generated to monitor every other node
 An event handler is setup to take over on failure
 Requires an ssh connection between the two slave clusters

© Opsera 2010 Commercial in Confidence 20

For more information about slave clusters, see: http://docs.opsview.org/doku.php?id=opsview-


community:slaveclusters
Example slave cluster system

Slave-node: A
Monitors hosts
Master Slave-node: B
1, 2, 3, 4, 5, 6
Slave-node: C

Node A Node B Node C

Cluster-node: B Takeover 3 Cluster-node: A Takeover 1 Cluster-node: A Takeover 2


Cluster-node: C Takeover 5 Cluster-node: C Takeover 6 Cluster-node: B Takeover 4

Monitors hosts Monitors hosts Monitors hosts


1, 2 3,4 5, 6

© Opsera 2010 Commercial in Confidence 21

The Opsview Master has 6 hosts which are being monitored by the slave system. There are 3
nodes in the slave.

At Opsview reload time, the 6 hosts are split across all the nodes (based on an algorithm called
Set::Cluster on CPAN - http://search.cpan.org/dist/Set-Cluster/ ).

The “Cluster-node” services are automatically generated to look at every other node in that slave
system. This requires SSH
Each cluster node service has a list of hosts it will takeover in the case of the specific node
failure - this is calculated at reload time (and not dynamically).
Clustered slaves, part 2

 MRTG data stored locally: the first node in the cluster will
poll devices
 NMIS data stored locally: will rsync with other cluster nodes
 Every 15 minutes, the status information for node is sent to
every other node (for synchronisation at takeover)
 Limitation: Single node failure only
 Limitation: Slave clusters should be in the same network
segment

© Opsera 2010 Commercial in Confidence 22

The NMIS data is rsyncʼd between cluster nodes every hour (based on nagios userʼs crontab)
Checkpoint

 Why would you use slaves?


 How many slaves can you run?
 If there are 2 cluster nodes in a slave, how many Cluster-
node checks are automatically created?
 What if there are 4 cluster nodes?
 What would happen if 2 nodes in a cluster failed?

© Opsera 2010 Commercial in Confidence 23

There is no software limit to the number of slaves, but there is a cost as reload times increase.
For 2 cluster nodes, 2 checks are automatically setup, each monitoring the other.
For 3 cluster nodes, 6 checks would be setup.
For 4, 12 checks are setup.
The formula is: N x (N-1) where N is the number of nodes in a slave
If two nodes failed, some of the hosts/services would be marked as stale.
Distributed slaves
setup

© Opsera 2010 Commercial in Confidence


Setup

 Setup users and groups on slaves


 Prerequisite software needs to be installed. Use opsview-
slave package
 Doesn’t install actual software - will be sent from master
 Upgrades will update slaves as part of the upgrade process
 Can enable a web interface on the slave

© Opsera 2010 Commercial in Confidence 25

The procedure for setting up a slave web interface is documented at http://docs.opsview.org/


doku.php?id=opsview-community:slavesetup#slave_web_interface
Operation

 Slave runs its active checks


 Every result is written to /usr/local/nagios/var:
• cache_host.log
• cache_service.log

 Every 5 seconds, /usr/local/nagios/bin/process-cache-data is


called to send host and service results to the master
 Output from these calls is saved to cache.log and the return
code is saved to /usr/local/nagios/var/ocsp.status
 The Opsview master will show the service as passive,
though it is actively run on slave

© Opsera 2010 Commercial in Confidence 26

process-cache-data uses send_nsca to transporting results back to the master. This has a limit
of 511 characters in the output.
Troubleshooting

 Check ocsp.status for last return code on slave


 Check cache.log for errors
 Check:
• echo “” | /usr/local/nagios/bin/send_nsca -H 127.0.0.1 -c /usr/local/nagios/
etc/send_nsca.cfg
• 0 data packet(s) sent to host successfully.

 Check ssh on master


 Check netstat -an | grep 5667 on master

© Opsera 2010 Commercial in Confidence 27


Modules

© Opsera 2010 Commercial in Confidence

We use the term Modules for functionality that is “loosely coupled” to Opsview but we still
provide integration with it.

Opsview Core comes with:


* Nagvis
* MRTG
* NMIS
Nagvis

 Nagvis provides a visual representation of the status of


various objects
 Maps are the grouping of objects together with a
background image
 Automap is a replacement network map view
 Technology:
• PHP5

 Integration:
• Apache configuration
• Authentication
• Host group hierarchy when choosing host groups

© Opsera 2010 Commercial in Confidence 29

Because Nagvis is PHP5 based, Opsview delegates the PHP5 page rendering to Apache, by
disabling proxying through to the Opsview Web application.

When configuring Apache, you can use the auth ticket method for authenticating. This means
that users of Opsview can access Nagvis using this authentication ticket seamlessly. If a user
tries to access nagvis without this ticket, they will be redirected back to the Opsview login
screen.

For more information about Nagvis: http://docs.opsview.org/doku.php?id=opsview-


community:nagvis
Nagvis: Using

 Initial page at /nagvis


 Access to maps
 Adding a new map
 Adding a background
 Adding a new state object (host groups can be based on the
hierarchy)
 Limitation: No fine-grained access controls within Nagvis
 Limitation: Times displayed are in UTC

© Opsera 2010 Commercial in Confidence 30

Beware of Nagvis access controls! Maps can be assigned to EVERYONE for view and
EVERYONE for edit. This means any authenticated user could edit your maps. Also, since the
maps can be edited, this means it is possible to get a drop down list of all the hosts and services
on your system! You can overcome this by making sure you only have named users for edit.
MRTG and NMIS

 Both used for interface statistics


 Use SNMP to collect information from hosts
 Uses RRD files to store its data

© Opsera 2010 Commercial in Confidence 31

More information about MRTG and NMIS in the OV204 course


Custom Plugins

© Opsera 2010 Commercial in Confidence


Plugin specifications

 A plugin must provide:


• A return code on completion
• Output to stdout (preferably 1 line, less than 511 bytes)

 A plugin should provide:


• Help output when run with -h (written to stdout)

 A plugin can provide:


• Performance data. Everything after the pipe symbol (“|”) is considered
performance data (preferably on that 1st line)

 Full plugin guidelines:


• http://nagiosplug.sourceforge.net/developer-guidelines.html

© Opsera 2010 Commercial in Confidence 33


Performance data format

 Performance data will be automatically graphed if it is of the


correct format
 label=value[uom][;warn][;critical]
 Can have multiple sections of these - use space to separate
 Can change the order and the number of the sections and
the insert routine will update the data appropriately
 Be aware of averaging!

© Opsera 2010 Commercial in Confidence 34

The warning and critical levels are optional.


The full performance format includes maximum and minimum values, but these are ignored in
Opsview.
If the label changes, then it will be considered a new performance plot.
ODW will save the raw value. However, threshold information is not retained.

For counter values, use performance data like:


inputbytes=119c
Custom checks

 Write it
 Run it on command line
 Have -h option to print out help text
 Drop onto /usr/local/nagios/libexec
 Will be automatically available in Opsview servicecheck
page
 Recommendation: Plugins return a short amount of data on
1 line

© Opsera 2010 Commercial in Confidence 35

You can create a plugin in any language as long as it is executable by the nagios user.
We will create an example plugin that just returns OK
You may not need to write your own plugin, if an existing one can handle the checking for you.
For instance, instead of a dedicated virus update plugin, just use check_file_age to test that the
virus definition file is up to date.
You could also use a "proxy" method for getting results. For instance, query a database to get
results for a test (say, number of sessions in your web application, balance of a test account)
FOSDEM example plugin: http://nagiosplugins.org/fosdem
Creating plugins easily

 Use Nagios::Plugin
 Distributed with Opsview Agents, installed in /usr/local/
nagios/perl/lib
 Start plugin with:
• use FindBin;
use lib "$FindBin::Bin/../perl/lib";
use Nagios::Plugin;

 More documentation:
• http://search.cpan.org/dist/Nagios-Plugin/lib/Nagios/Plugin.pm

© Opsera 2010 Commercial in Confidence 36


ODW

© Opsera 2010 Commercial in Confidence


History

 Nagios is very good at “what is happening now”


 Nagios’ reporting uses nagios.log file to get status
 Lots of logic to work out status changes held in report code
 Need to move to database driven

© Opsera 2010 Commercial in Confidence 38


NDOutils and Runtime

 NDOutils (Nagios Data Objects) is a project to put Nagios


status data into a mysql database
 Some limitations in how NDOutils saves its data:
• No configuration information over time
• Some tables are too big

© Opsera 2010 Commercial in Confidence 39

Opsera have been very active in using NDOutils and updating the software.
Nagios has started to move data into a database structure - this is the project called NDOutils
(Nagios Data Objects).
Opsview uses NDOutils, and has been actively promoting and updating the software, but
recognises there are some limitations in how the data is represented in NDOutils. For instance,
the rows to record every result in NDOutils takes about 1K for every result, whereas in ODW it
takes about 250 bytes.
Opsview Data Warehouse

 ODW designed to be a data warehouse


 Denormalised data, for easier searching
 Long term storage tables
 Raw results - performance data is exactly as received
 Summary tables for quick queries
 Schema diagram and further documentation
• http://docs.opsview.org/doku.php?id=opsview-community:odw

© Opsera 2010 Commercial in Confidence 40


ODW: Operation

 Need to opt-in to the import process from System


Preference
 Cron job called import_runtime runs at 4 minutes past the
hour to collect data and summarise
 Plugin: check_odw_hostgroup_availability

© Opsera 2010 Commercial in Confidence 41

If you change the crontab entry, be aware that an upgrade will revert the crontab back again.
Be aware that the tables used by ODW are MyISAM tables which do table level locking. If you
have a long running query, it may lock up the import process, but the import process will
continue when tables are unlocked again.
If you do a lot of reporting, you may want to consider setting up replication of the ODW database
onto a different mysql server where you can run the reports without affecting the main Opsview
instance.
Architecture

© Opsera 2010 Commercial in Confidence 42

Only the import_runtime script is used to add data into ODW.


The import_runtime is also aware of multiple Opsview masters, so it is possible to have a shared
ODW across multiple Opsview masters. This allows comparisons between hosts and services
on different systems.
The main limitation is that the host name must be unique between all Opsview masters: http://
docs.opsview.org/doku.php?id=opsview-community:sharedodw
Reports

 Opsview comes with some PDF reports


 Phasing out to use Opsview Enterprise Reports Module
instead
• Based on Jasper Reports technology
• Available to Enterprise subscribers
• Documentation at http://docs.opsview.org/doku.php?id=reportingmodule
• Some predefined reports, such as Weekly Availability by Keyword and
Weekly Performance by Keyword

 Uses ODW to gather all data metrics for availability and


performance

© Opsera 2010 Commercial in Confidence 43

The PDF reports are retained in Opsview Core, but are being phased out due to complexity of
code and poor functionality.

The new reports infrastructure uses Jasper Reports for its base technology, and includes several
report types
API

© Opsera 2010 Commercial in Confidence


Reasons for API

 Automated configuration changes


 Currently supports:
• creating / cloning / deleting hosts
• scheduling downtime
• reloads

© Opsera 2010 Commercial in Confidence 45

For further information about the API, see: http://docs.opsview.org/doku.php?id=opsview-


community:api
 <opsview>
 <authentication><username>admin</
username><password>initial</password></authentication>
 <host action="create">
 <name>host</name>
 <ip>10.10.10.10</ip>
 <check_command><name>ping</name></
check_command>
 <hostgroup><id>2</id></hostgroup>
 <icon><name>LOGO - Opsview</name></icon>
 </host>
 </opsview>
© Opsera 2010 Commercial in Confidence

This is an example XML file to push to Opsview.


Example invocations

 curl -H 'Content-Type: text/xml' -d @file.xml http://


opsviewserver/api
 opsview_api -f file.xml

© Opsera 2010 Commercial in Confidence 47


Configuration files

© Opsera 2010 Commercial in Confidence


Configuration files

 /usr/local/nagios/etc
• opsview.conf and opsview.defaults
• map.local and map

 /usr/local/opsview-web
• opsview_web_local.yml and opsview_web.yml

 /usr/local/nagios/share/stylesheets
• custom.css

 /usr/local/nagios/nagvis/etc
• nagvis.ini.php

 /usr/local/nagios/nmis/conf
• nmis.conf

© Opsera 2010 Commercial in Confidence 49

opsview.defaults is a shipped file and is subject to change. If there are files that you need to
override, copy the variable into opsview.conf and amend it there. opsview.conf will not usually be
changed over a upgrade.
Similarly for the map and the map.local file. However, you shouldnʼt need to use the map file as
correctly formatted performance data will be automatically graphed.
The nagios configuration files are regenerated every time.
Backup and recovery

© Opsera 2010 Commercial in Confidence


Backup scope

 A cronjob is run around 3am which invokes a backup


 The backups save configuration data and some key Runtime
data
 The scope is to be able to recover a system quickly
 Key variables in opsview.conf:
• $backup_dir - which directory to store the backups
• $backup_retention_days - number of days worth of backups to keep

 You will have to design your own backup strategy for long
term archival

© Opsera 2010 Commercial in Confidence 51

Long term archival is not handled as part of the backup script because of the amount of data that
could be in ODW and Runtime.
For more information about backups: http://docs.opsview.org/doku.php?id=opsview-
community:backups
Data to consider backing up

 ODW
 Runtime
 MRTG
 NMIS
 Slaves
• MRTG
• NMIS
• Nagios logs

© Opsera 2010 Commercial in Confidence 52


Restore process

 Assumes you are restoring to the same server as the


original Opsview server
 Stop Opsview and Opsview Web
 Restore files
 Restore Opsview, Runtime (subset), Reports database
 Restart Opsview and Opsview Web
 Reload Opsview

© Opsera 2010 Commercial in Confidence 53

For the latest restore process, see: http://docs.opsview.org/doku.php?id=opsview-


community:backups
If you are looking to migrate Opsview onto different hardware, see: http://docs.opsview.org/
doku.php?id=opsview-community:migratinghardware
Troubleshooting

© Opsera 2010 Commercial in Confidence


Reload process

 Creates Nagios configuration for master


 Runs MRTG configuration generation in background
 For each slave:
• Creates Nagios configuration
• Transfers files

 Validates all configuration


 Reloads Nagios simultaneously

© Opsera 2010 Commercial in Confidence 55

Opsview uses Parallel::Forker for the workflow management. There is a limit set to only run 4
concurrent jobs at once - you can increase this if you have more CPUs on your master server.
Reload troubleshooting

 /admin/reload - will show common errors, including Nagios


validation errors
 /usr/local/nagios/var/rw/config_output
• Last full debug output of all configuration generation

 /usr/local/nagios/var/log/create_and_send_configs.debug
• Last reload process workflow

© Opsera 2010 Commercial in Confidence 56


File locations

 /usr/local/nagios - Nagios and main Opsview core


 /usr/local/opsview-web - Opsview web application
 /var/log/opsview - logs for master server
• opsviewd.log - For main opsviewd daemon and other master jobs
• opsview-web.log - For web application

 /usr/local/nagios/var - logs and status files for Nagios (on


slaves too)
• var/log/opsview-slave.log - For Opsview specific slave jobs

 Performance RRD data


• /usr/local/nagios/var/rrd/{hostname}/{servicename}/{metric}/value.rrd
• thresholds.rrd

© Opsera 2010 Commercial in Confidence 57

Performance graphs stored in /usr/local/nagios/var/rrd/{hostname}/{servicename}/{metric}/


value.rrd.
Thresholds are stored in thresholds.rrd, with information about the warning and critical levels.

Opsview uses Log4perl for logging and you can set location and rotation information. However,
this file is overwritten as part of an upgrade
Opsview not running

 Restarting Opsview: /etc/init.d/opsview restart


 Restarting Opsview Web: /etc/init.d/opsview-web
 Do these either as root or switch to nagios user
• su - nagios

© Opsera 2010 Commercial in Confidence 58

You must use su - nagios. The dash means to pick up some environment variables required by
Nagios.
Errors relating to browser

 The Opsview Web Server is not running


• If an upgrade is in progress

 Error retrieving update from Opsview. Will continue to retry


• AJAX update problem. Will continue to repoll. Problems with web service?

© Opsera 2010 Commercial in Confidence 59


Errors relating to Opsview

 If a service fails which is for the Opsview master or slaves,


escalate to Opsera

© Opsera 2010 Commercial in Confidence 60

Opsera provide commercial support for Opsview. See http://opsera.com/jsp/opsera_product/


Opsview%20product.jsp for details
Summary

 Understand the advanced concepts of how Opsview


monitors
 Understand how a distributed Opsview system works
 Understand how to add new custom plugins
 Understand what backups are required for Opsview

© Opsera 2010 Commercial in Confidence 61

Вам также может понравиться