Вы находитесь на странице: 1из 14

System Monitoring Best Practices

For

BCM 6.0 & 7.0


version 1.0
Contents
OVERVIEW ....................................................................................................................................................... 3
KEY AREAS TO MONITOR .............................................................................................................................. 4
Automatically triggered events/reports ................................................................................................................4
Daily manual check ................................................................................................................................................4
Weekly manual check.............................................................................................................................................5
Monthly manual check ...........................................................................................................................................5
General good practices ..........................................................................................................................................5
LOGS IN BCM 6.0 ............................................................................................................................................. 6
Generally: .................................................................................................................................................................6
CEM: .........................................................................................................................................................................6
Call Dispatcher .........................................................................................................................................................6
Media Routing Server: ..............................................................................................................................................6
Federation Bridge: ....................................................................................................................................................6
WEB: .........................................................................................................................................................................6
Connection Server: ...................................................................................................................................................6
H.323 bridge: ............................................................................................................................................................6
High Availability Controller: .......................................................................................................................................6
Data Collector: ..........................................................................................................................................................7
SIP bridge: ................................................................................................................................................................7
Detailed CEM errors for timeouts between CoS & CDT ...........................................................................................7
LOGS IN BCM 7.0 ............................................................................................................................................. 9
Error types ...............................................................................................................................................................9
Features ...................................................................................................................................................................9
Logging Categories and Locations ......................................................................................................................9
SOLUTIONS IN USE BY CUSTOMERS......................................................................................................... 11
Call Robot ...............................................................................................................................................................13
OVERVIEW

Once a BCM system is installed and running in live production, it is recommended to monitor your
system on a regular basis. This helps to ensure the system continues to run smoothly, and any
potential issues are identified early. Many problems are preventable, such as a server running out of
disk space. By proactively monitoring your system, you can prevent problems from escalating and
potentially causing system downtime.
This document is designed to highlight the key areas to monitor in your BCM environment, and
suggest some best practices.

These are just guidelines which each customer and partner has to then review and adapt for their own
environments. The named third party solutions are only examples, you may have other solutions for
the same purposes.

Please also see the infrastructure_guide.pdf.


KEY AREAS TO MONITOR

Note that the frequency of checking is only a suggestion

Automatically triggered events/reports


 Critical server elements: A server monitor program should be setup to monitor items such as
hard disk space, CPU usage and available memory on the production servers. There are
solutions available such as; MS Operation Manager, HP OpenView, Whatup or Nagios to
monitor the system components and create reports. Typically the partners have had an
established system for hardware monitoring which they adapt for the BCM servers.
 CPU usage, and available memory. These should create automatic triggers in the event of one
of these elements passing a specified limit, such as less that 10Gig hard disk space available.
 SNMP traps from alarm server: The alarm server, included in the BCM software package, can
be configured to send SNMP traps. For example with the Alarm Server you can send alerts to
some 3rd party software, so emails/SMS’s/SNMP alerts can be sent to notify you of any failure.
Other customers have implemented a “ping check” to constantly poll the virtual unit’s IP and
see if it is reachable.

SNMP (simple network management protocol) is part of the Internet Protocol Suite as defined by
Internet Engineering Task Force (IETF). SNMP is used in network management systems to monitor
network attached devices for conditions that warrant administrative attention.
Several SNMP management systems exist on the market and there is no particular preference for any
of them.

Basically SNMP defines the following message types:


- SNMP GET REQUEST
- SNMP SET REQUEST
- SNMP TRAP

SAP BCM sends only SNMP traps. SNMP GET and SET messages can be used for managing other
SAP BCM infrastructure devices supporting SNMP.SNMP management systems typically include
sophisticated functions, such as ability to:
- Filter the most relevant messages
- Send e-mail and SMS alerts initiated by events in the monitored system.
- Log events for further investigation.
- Export event data to files, sheets or syslog servers for reporting purposes.

 Network traffic monitoring. BCM software relies on stable network connections, so it is vital to
have network traffic monitoring in place. This could bring up an alarm if network connectivity
between the sites encounter severe delays or connection breaks. This is particularly important
when having two or more active sites.

Daily manual check


 IA – Open IA and check are the processes running, and are they running on correct server. IA
running on a large display visible in the office – this way you can easily monitor the current
status of all virtual units.
 Windows Event viewer
Weekly manual check
 SQL jobs – There are SQL jobs in the BCM system which are scheduled to run on a regular
basis. These include job which creates the reports.
 BCM Logs – Check no errors appear (see later section). After making any major configuration
changes, it is recommended to check the logs for the related components immediately after the
change, and also the following day – in addition to normal testing after the change.
 Reports – check that the BCM reports are created correctly, and that the Monitoring application
displays data

Monthly manual check


 Windows/SQL – checking for the latest Windows & SQL updates help keep your system
secure. Currently Microsoft releases updates on the second Tuesday of each month.
 BCM updates – Keeping BCM updated with the latest fixes is important. Critical fixes are
contained in hotfixes released when needed. Service Packs are released infrequently. They
contain a complete build. It is important to keep the Service Pack level up to date, as typically
hotfixes are only released for the latest Service Pack version.
 Check for updated BCM documentation on SAP Service Marketplace
 Reports – check that the BCM reports exist and contain data
 Windows Performance Monitor – log onto the server(s) to check CPU & memory usage, and
available hard disk space.
 On the SQL server, check the database and transaction log size, and available disk space.
Compare with size last month to plan ahead.
 Remove old files from the mail server. Old mails can be deleted from the mail server also by
setting an appropriate value to the parameter DeleteOldMailsAfterDays. See the System
Administration Guide for more information about application parameters.
 Remove old log files. Log file directories are defined during the installation. Define appropriate
paths and logging levels, and follow the amount of collected data regularly. The amount of data
depends on the system size, configuration and tasks it is used for.

General good practices


 Maintenance: Scheduled daily backups of at least CEM, CPM, VWU & WDU databases.
 BCM training – it is important that there are key staff who are trained on BCM. Whenever new
versions are released, it is a good idea to attend a partner training session to keep up to date
with the BCM software. Having a minimum of two BCM trained experts per system is a good
practice.
 Document any arising issues – e.g. agents might have the same issue, if recorded somewhere,
then can link up related issues, e.g. calls from a certain gateway have audio issues
 Good communication between teams. Having good communication between IT, telephony, and
users is essential so that any issues are reported in a timely manner, and that the correct
people are informed.
LOGS IN BCM 6.0

Check there are no errors in the logs (for example CEM log). Normally this would be a manual process
but could be done for example on a weekly basis, and using WV to highlight
“ERR>|error|WRN>|EXC>” for example makes this easier or a batch with command findstr /r /s "ERR>
error WRN> EXC> " "[VU_logs_folder]\*.*" makes this easier.

The "log grepping" could be automated with third party applications. We are not aware of what
monitoring tools other partners use to examine the logs themselves, and at the moment we do not
have ready templates. Creating templates which best suite each installation is a learning process, so
there will be some improvements for what is required/false positives occurring.

Generally:
ERR> , EXC> , WRN>

CEM:
ERR> , EXC> , WRN>, Failed

Call Dispatcher
ERR> , EXC> , WRN>

Media Routing Server:


ERR> , EXC> , WRN>, Terminating WCFPCore

Federation Bridge:
ERR> , EXC> , WRN>

WEB:
error, odbc error (ignore case)

Connection Server:
fail (ignore case. Will include words such as Fail, Failure, Failed)

H.323 bridge:
EXC>

High Availability Controller:


Inactive, Inactivating, "no such process"
Data Collector:
ERR>, "Failed to write events into database transaction buffer" (This would mean that either the place
where data collector is running is out of space, or it has no connection to the database and it’s buffer
has run out.)

SIP bridge:
(below are actual errors reported in SIP Bridge logs)
Received command to unknown session
Message handler failed
Couldn't parse message from
Caught unknown exception while handling message from
Unknown protocol
Socket create error
Socket bind error
getsockopt error
Socket WSAAsyncSelect error
Socket recv error
ReceiveAsync error:
OnConnectFailure:
OnReceive error:
ReceiveAsync error:
failed with error
failed. Reason=
DnsQuery failed getting SRV record for
failed. Error
send error
socket listen error
socket accept error
socket select error
disconnected with code
disconnected on receive timeout
can't send - not connected yet
SetSecurityOptions() failed.
Expired transaction

The WRN> is not usually critical, but can be included just to see how many appear.
We would advise just monitoring these to start with to see how many occurrences they produce.
Prior to implementing these in the tool, it would be good to manually check for example your last
hour’s or day’s worth of logs, incase these are already present. For example having an email channel
with wrong username can cause errors and exceptions.

Detailed CEM errors for timeouts between CoS & CDT


(i.e. the server and the agent’s workstation)
1) These lines occur when there are no messages from the client for about 30 seconds. If there are no
other messages coming the client should send at least a keep alive message. The client is set to
“paused” so no calls are sent. At this point, the CDT might show “Opening connection”
08:53:44.126 3284 ERR> LIBIPCServer : REC_Timeout : Timeout for received messages
(796B6A0E92E67148907C40A5FE3076AF) : ('i109',)

2) If between 30 seconds & 2 minutes, the connection recovers, the Opening Connection disappears:
08:31:47.063 3284 ERR> LIBIPCServer : REC_Timeout : Timeout for received messages
(8C32315692BD8F4A871C68E42F7B60A3) : ('i109',)
08:32:10.352 7616 REC> _CHANNEL_RECOVERY_ = {'_EVT': '_CHANNEL_RECOVERY_', '_ID':
'i109', '_SAP_ID': 'CONTROL', '_REP_ID': 'i109'}

3) If the connection between CoS & Client is still not reconnected after 2 minutes, the CEM stops the
current session, from it’s point of view. Note in green there is the affected extension number and user
id:
08:55:06.826 4420 INF> LIBIPCServer.OnLostChannel :
('796B6A0E92E67148907C40A5FE3076AF', 'i109', '796B6A0E92E67148907C40A5FE3076AF')
08:55:44.448 1976 INF> Closing login session due to timeout : ('1234', 'John Smith')
08:55:44.448 1976 INF> LIBIPCServer.CloseChannel : ('796B6A0E92E67148907C40A5FE3076AF',
'i109', '796B6A0E92E67148907C40A5FE3076AF')

4) If there is a network break over two minutes, from the user’s point of view the CDT will just show
opening connection until the connection is recovered, and then automatically recover.
08:25:31.438 6248 INF> {'CallType': 'In', 'ANumber': '987654321', '_SAP_ID': 'CALL_CONTROL',
'_CMD': '_UNKNOWN_REP', 'BNumber': '1234', '_EVT': 'Disconnect', 'CALL_ID': 'CI_101L', '_MGR':
'UI_MGR', '_CLS': 'UI', '_REP_ID': 'REP_20324', '_ID': 'BRW_20324', 'DATA': 'ERROR'} : ('REJ>',)

Possible Cause for these timeouts in the logs:


- CDT/atl is somehow terminated so that it does not log out properly.
- The workstation is so busy it does not send these in time. Then the session is probably restored
within 2 minutes.
- Workstation restart: new session should be imminent. In this case at least number 1234 did not
reconnect (to CEM, at least), so this might be a network problem
- Network problems. Then the session might be restored within 2 minutes.
LOGS IN BCM 7.0
(initial version)

In BCM 7.0, logs can be configured in versatile way to increase relevancy and avoid collecting
unnecessary files.
For example, it is possible to set logging level of a location or category to the debugging level as other
parts of code are logged in error level only.

In addition to BCM legacy log file format two new formats are added, SAP Generic Log File (GLF) and
List Log. SAP Generic Log File (GLF) format enables using SAP Log Viewer and SAP Solution
Manager for analyzing and managing logs.

Error types
Almost all modules will use the above logging mechanism (including also CEM, MRS, CD etc. since
7.0 SP1), file formats and configuration style.

Thus, all errors (in BCM-formatted logs) are in ERR> (EXC> should disappear and ERR> printed
instead. If you choose GLF-formatting to be used, it is error. Warnings are as WRN> (in BCM, and
warning in GLF).

Features
Configure logging function via Windows registry with a value starting Log followed by attributes
delimited with periods, such as
Log<ObjectType>.<ObjectIdentifier>.<Attribute>.

Modules collect information in the log files that are named with the syntax: <module-
name>_<computer-name>_<virtual-unit-name>_yyyymmdd[_hh][_nnn].log

Log files are collected in the folder that is defined during installation as Log File Directory of the Virtual
Unit installation variable. The default value is $VU_HOME$\logs.
Logging Categories and Locations
Using logging categories enable that, for example, database administrators only receive database
related log files, and network administrators follow the network logs.

Logging Categories and Locations


Using logging categories enable that, for example, database administrators only receive database
related log files, and network administrators follow the network logs.
Log Levels Level Log Writing Includes
always Messages that should always be printed,
such as process startup notifications.
error Unrecoverable error has occurred. Some
or all functions of the module are
inoperable or perform incorrectly. Often
indicates malfunctioning hardware, major
misconfiguration, or a bug in the product.
warning Warning messages indicate of problems
that are somehow recoverable or are in
less important functions. For example,
minor misconfigurations, performance
warnings, temporary inavailability of a
service.
info Informative messages about business level
or service level objects.
trace Messages to trace the execution of the
code in more detail. Mostly contains useful
information for developers only while
providing information about how execution
of the code occurred.
debug Most detailed messages about internal
execution logic of the code. Mostly
contains useful information for developers
only. Often used for tracing rare issues or
bugs in the product.

The global log level is the root level and cannot be assigned to be inherited. Log targets (log files,
console etc.) and logging modules (locations and categories) are by default set to inherit the global log
level, but they can also be assigned to their own level setting. Some logging modules may be a child
module to another logging module; the above parent filter is then the level filter of the parent logging
module. Threads are set by default, to inherited to follow other effective level filters, but they may also
be given their own thread-local log level filter setting. Combining the above level filters and settings,
one may control specific parts of the code, threads or types of events that produce log entries.

Example

A server module that prints debug-level log to file and error-level log to console. Also, entries from
LibIpc code location are narrowed down to error-level. The registry of the module (module's own
registry key) would have the following registry value names and value data (shown as
ValueName=ValueData):
LogLevel=debug
LogConsoleLevel=error
LogModule.LibIpc.Level=error
SOLUTIONS IN USE BY CUSTOMERS
One large German customer uses the WhatsUp program, although this just collect SNMP traps and
monitor active virtual IP's.

One Swedish customer uses an application called www.pingplotter.com that pings the VUs every 10
seconds. It sends me an email according to different rules set up. They also use Pingplotter for
monitoring (pinging) Internet, servers and WAN servers. By using to WAN (or LAN) you can see
historical graphs regarding ping times and jitter.

Solutions from BaseN & Noval have been used for monitoring the network infrastructure and to spot
possible problems with connections and/or quality. They can simulate also RTP traffic, and if
necessary even monitor the connections from certain sites/workstations.

One Finnish Partner has been using test robots (built by Goodsign, partly based on our previous
WSAM) which use ClientCore phone for making test calls from remote sites every x minutes and at
the same time test logon/logoff/simple query over http. They also measure the time taken so that if
there are suddenly bigger/increasing delays etc they can alert someone to have a look even before
“everything explodes”. With the test calls they can validate “the whole chain” i.e. a call from a
softphone thru gateway to a service pool or voicemail box.
One Finnish Partner uses an application called CastleRock. Two screenshots below show examples
of the interface & display.

CastleRock SNMP network view:


CastleRock SNMP SAP BCM servers view:

Call Robot
One option would be to use Call Robots to check service availability as experienced by phone users,
e.g. by utilizing the ClientCore interface of BCM. The robot could make automated calls with a given
pattern and confirm that the system responds as expected. If any anomalies were found, the robot
could generate an alert to notify the administrators.

In a more advanced scenario the robot could in addition to calling make simple test queries to the web
server to confirm that the web sites and databases respond without unexpected delays.
The design of the automated robots can vary depending on what you want to monitor and measure.
An important factor is not to create too much loads by the testing.
© Copyright 2012 SAP AG. All rights reserved. These materials are subject to change without notice.
No part of this publication may be reproduced These materials are provided by SAP AG and its affiliated
ortransmitted in any form or for any purpose without the companies ("SAP Group") for informational purposes only,
express permission of SAP AG. The information contained without representation or warranty of any kind, and SAP
herein may be changed without prior notice. Group shall not be liable for errors or omissions with
Some software products marketed by SAP AG and its respect to the materials. The only warranties for SAP
distributors contain proprietary software components of Group products and services are those that are set forth in
other software vendors. the express warranty statements accompanying such
products and services, if any. Nothing herein should be
Microsoft, Windows, Outlook, and PowerPoint are
construed as constituting an additional warranty.
registered trademarks of Microsoft Corporation.
These materials are provided “as is” without a warranty of
IBM, DB2, DB2 Universal Database, OS/2, Parallel
any kind, either express or implied, including but not
Sysplex, MVS/ESA, AIX, S/390, AS/400, OS/390,
limited to, the implied warranties of merchantability,
OS/400, iSeries, pSeries, xSeries, zSeries, z/OS, AFP,
fitness for a particular purpose, or non-infringement.
Intelligent Miner, WebSphere, Netfinity, Tivoli, Informix,
i5/OS, POWER, POWER5, OpenPower and PowerPC are SAP shall not be liable for damages of any kind including
trademarks or registered trademarks of IBM Corporation. without limitation direct, special, indirect, or consequential
damages that may result from the use of these materials.
Adobe, the Adobe logo, Acrobat, PostScript, and Reader
are either trademarks or registered trademarks of Adobe SAP does not warrant the accuracy or completeness of the
Systems Incorporated in the United States and/or other information, text, graphics, links or other items contained
countries. within these materials. SAP has no control over the
information that you may access through the use of hot
Oracle is a registered trademark of Oracle Corporation.
links contained in these materials and does not endorse
UNIX, X/Open, OSF/1, and Motif are registered
your use of third party web pages nor provide any warranty
trademarks of the Open Group.
whatsoever relating to third party web pages.
Citrix, ICA, Program Neighborhood, MetaFrame,
SAP NetWeaver “How-to” Guides are intended to simplify
WinFrame, VideoFrame, and MultiWin are trademarks or
the product implementation. While specific product
registered trademarks of Citrix Systems, Inc.
features and procedures typically are explained in a
HTML, XML, XHTML and W3C are trademarks or practical business context, it is not implied that those
registered trademarks of W3C®, World Wide Web features and procedures are the only approach in solving a
Consortium, Massachusetts Institute of Technology. specific business problem using SAP NetWeaver. Should
Java is a registered trademark of Sun Microsystems, Inc. you wish to receive additional information, clarification or
JavaScript is a registered trademark of Sun Microsystems, support, please refer to SAP Consulting.
Inc., used under license for technology invented and Any software coding and/or code lines / strings (“Code”)
implemented by Netscape. included in this documentation are only examples and are
MaxDB is a trademark of MySQL AB, Sweden. not intended to be used in a productive system
SAP, R/3, mySAP, mySAP.com, xApps, xApp, SAP environment. The Code is only intended better explain and
NetWeaver, and other SAP products and services visualize the syntax and phrasing rules of certain coding.
mentioned herein as well as their respective logos are SAP does not warrant the correctness and completeness of
trademarks or registered trademarks of SAP AG in the Code given herein, and SAP shall not be liable for
Germany and in several other countries all over the world. errors or damages caused by the usage of the Code, except
All other product and service names mentioned are the if such damages were caused by SAP intentionally or
trademarks of their respective companies. Data contained grossly negligent.
in this document serves informational purposes only. Disclaimer
National product specifications may vary. Some components of this product are based on Java™. Any
code change in these components may cause unpredictable
and severe malfunctions and is therefore expressively
prohibited, as is any decompilation of these components.
Any Java™ Source Code delivered with this product is only
to be used by SAP’s Support Services and may not be
modified or altered in any way.

Вам также может понравиться