Вы находитесь на странице: 1из 12

ITA Zabbix Reference

17/06/2014

01 Network Related
01.001 Unreachable
ThehostcannotbereachedbyICMPPingfromourmonitoringserver.

Action:Youmustcarefullywhythehostcantbereached.
Contact:POCoftheDevice
01.002 SmokepingPacketLossdetected
Smokepingissending20ICMPpingstoonehost.Ifacertainnumberdoesnot
returnyouwillgetthisalarm.ThismeansthatthehostisgenerallyspeakingUP,
butdoesnthavea100%functionalcommunictionlinetothemonitoringServer.

Action:Carefullycheckwherethepacketlossisinitialted.Trymtrfromnoc1to
thehost.Alsocheckthezabbixpacketlossgraphtoseeifthisstartedata
specificpoint.
Contact:Coreand/orInfrastructure.

01.003 SmokepignHighLatencyWarning(local/national/international..)

Smokepingissending20ICMPpingstoonehostandalsomeasuresthe
averageroundtriptimeofeachping.Onalocalnetwork(withingluandaor
similar)weexpectaverylowlatency(e.g.<30ms),national(likeluanda<>
benguela)itcanbeupto55msandinternationallyitcanbeserveralhundreds
milliseconds.

Thisalarmshowsthattheactuallatencyishigherthanweexpect.

Action:Veryfywherethelatencyiscomingfromandtestcarefullyalsoforpacket
loss.
Contact:Infrastructure/Core

01.004 LowThroughputonbackhaullinks
Thetriggermeasurestheactualbandwidth/throughputofalink(==howmuch
dataiscurrentlyflowing).
Ifthisislowerthanexpectedyouwillgetthiserror.Theexpectationdependsper
link.Checkthetheseverity.

Action:Verifyifthereisanyproblem.IfitsonaRadiolinkcheckradioparameter
andorotheralarm
Contact:Infrastructureand/orCore

01.005 LowThoughputonInternationalLinks
Thetriggermeasurestheactualbandwidth/throughputofalink(==howmuch
dataiscurrentlyflowing).
Ifthisislowerthanexpectedyouwillgetthiserror.Theexpectationdependsper
link.Checkthetheseverity.Forplannedoutagehoweveryoushouldbeaware
beforehand(checkemail)!!Ifnotthisshouldbeescaltedimmediatelytothecore
teamandtothelinkprovider(e.g.AngolaTelcom/AngolaCables,Satellite
provider)

Action:Contactcore+Provider
02 Routing Device Related (Cisco/Juniper)
02.001 CPUTemperaturehigh
IfatemperatureofaCPUgoesaboveacertainleveltheroutermightget
damagedormightshutdownautomatically.

Action:Verifythetemperaturebyloggingintotherouter.Checktheroom
temperaturesensors(evtlintaketemperaturesofsamedevice).Likelythe
AirConfailedintheroom.

Contact:InfrastructureDepartment(incaseofACfailure),Coreifother
reasons(likefanErroretc..)
02.002 FanErrorsorMalfunction
FansarekeepingtheairinsideaRoutercirculated.Tohavethemfunctionalis
criticaltokeepadevicecool.IffansaremalfunctionyoulikelyseetheCPU
temperaturegoingup.

Action:Verifythefansbyloggingintotherouter
Contact:Core
02.003 Intake/InletTemperatureHigh
Thisistheairtemperaturewhenenteringtherouterforcooling.Thisshouldbe
sameorsimilarasthenormalRoomTemperature.Inanairconditioned
environmentthisshouldnotexceed27degrees.

Action:VerifyRoomTemperatureortemperaturesensors,VerifyCPU
temperatureofdevice
Contact:Infrastructure
02.004 CPULoad
IfarouterisverybusyhandlingrequeststheCPULoadwillincrease.Ifitisto
high,therouterwillstarttomalfunction(e.g.droppingpacketsetc..),becauseits
CPUisnotstrongenoughtohandletherequests.

NormallytheCPUloadshouldstaybelow60%.Wehaveaseriousproblemif
above90%.

Action:VerifyAlarm
Contact:Core
02.005 RebootDetected
Wedetectedthatadevicerebooted,thiscouldhavemanycauses,e.g.:
Someonedoingmaintenance,PowerProblems,Routerhardwareorsoftware
problems.Arebootmustbecheckedcarefullyandactionmustbetaken!

Action:Verifythereboot(checkuptime)
Contact:Coreand/orInfrastructure
02.006 OutletTemperature
Normallynotmeasured,butthisistheoutlettemperatureofadevice.
SoIntake>inside(cpu)>outletshouldbehigherthaninlettemp.

Action:See02.003
02.007 TemperatureCheck
Acombindedtemperaturecheckdonebytherouter.

Action:Verifybyloggingin,checktemperatureofotherdevicesinsamePOP
Contact:Infrastructure
02.008 ChassisTemperature
Thetemperatureinsidethechassisofthedevice.

Action:Verifybyloggingin,checktemperatureofotherdevicesinsamePOP
Contact:Infrastructure


03 Radio Problems
03.001 LowSignal/LowRSL
Thesignallevelreceivedisnotasgoodasexpected.Thiscancauselink
degradation.Alsochecktheseverity,itmightdependontheactualseensignal
level.
Thistriggercanbeseenonvarioustechnolgies,Microwave,VSAT.

ManyhighfrequenciesaresensitivetoRain.Ifthereiscurrentlyheavyrain,there
isnoneedtoescalatethis!

Action:Logintodevice,confirmsignaldegradation.Checkonactualpacketloss.
Contact:InfrastructureorVSAT
03.002 LowCapacity/Bitrate
Ithasbeendetectedthatduetotheradioconditionsalink(mostlikelyap2p)
doesnothavetheexpectedorrequiredbandwidthtosupportthenetwork.

Action:Checktheradioforsignaldegradations,verifytheproblem,checkfor
packetloss
Contact:Infrastructure
03.003 ErrorsonMicrowaveRadioLink(ES/SES/UAS)

Theerrorrateishigherthanexpected.

Thiscanbecausedbybadweather/rain.

Action:EscaltetoInfrastructureDepartment

04 Power
04.001 NonetworkPower
ApplicabletoUPSsystemsthisindicatesthatthereiscurrentlynoinputpower
andthesystemisrunningonbatteries.Soundernormalthismeansthereisno
powerfromthemunicipalityandthegeneratordidnotstart!

Action:LogintoUPS,verifyalarm,checkbatteryvoltagelevelanduptime.
Contact:InfrastructureTeam.Ifintheprovincestheprovincestandbyteam.
04.002 Batteryvoltagelow
Thereisnoinputpower[04.001]andadditinallytheBatteryVoltageislow,
meaningthesitemightgodownverysoon.

Action:VerifyAlarm,takehighestpriorityactiontocontactinfrastructureteamto
gotositeandgetthegeneratorstarted
Contact:Infrastructure
04.003 UPSAlarms/Major/Minor/Critical

TheUPSreportsandalarmofthementionedseverity.

Action:LogintotheUPSandreadthealarmdescription.Ifcriticalescalate
immediatelytoinfrastructureTeam

04.100 GensetBatteryVoltage

ThemeasuredbatteryvoltageofaGeneratorislow.Thatcancausethe
generatornotbeingabletostartonpowercuts!

Action:contactinfrastructuretocheck
04.101 GensetMaintenanceDue

Thegeneratorhasbeenrunningforalongtimeanditmustgoformaintenance.

Action:EmailInfrastructureMakesuretheydoresetthecounterafter
maintenancewasperformed.Thenthisalarmwilldisappear.
04.102 GensetEngineRunning

Thegeneratoriscurrentlyrunning.
04.103 GensetEnginerunningfor>48h

Thegeneratorhasbeenrunningwithoutstopformorethan48hours.

Action:EscalatetoInfrastructuretoverifywiththepropertyowneriftheresa
powerproblem.

05 Radio Access Network Related
05.1XX Alvarion/Telrad 4Motion Related
05.101 AAALowLatencyDisabled
TheBasestationrequiresalowlatencytotheAAAserverunlessaspecific
settingisset.Unfortunatelythesettingdoesnotsurvivearebootofthebase
station.Ifyouseethisalarmintheprovinceswehaveaproblem.

Action:EntertheBSusingTelnetandenterthefollowingcommand:

NPUlogin:root
Password:

npu#conft
npu(config)#authenticatoreaptransferinterval5000
npu(config)#exit
npu#

Iferrordoesnotdisappear,escalatetosystemsteam!URGENT
05.102 AAASwitched
AbasestationauthenticatesCPEsusingRADIUStoaAAAserver.Wehave
twoofthem,oneatlda6andoneatlda11(aaa1.lda6,aaa1.lda11).

NowthebasestationdecidedtousethealternativeAAAserverfor
authentication.

Action:Checkwhythishappenedandifotherbasestationsdecidedtodosame.
Watchcarefullyforanyeventsorcomplainsonthewimaxlink.Watchcarefully
theaffectedBSforregisteredsubscribers.
Contact:Core(networkissues)orSystems(aaaissues)
05.103 Nosubscribersonaspecificsector
Almostallofourbasestationshavemultiplesectorspointingtodifferent
directions.Thiserrorindicatesthatthereisonesectorthatdoesnothaveany
CPEsconnected.
Thiserrorcanhappenonveryemptybasestations(exampleonweekendsor
powercuts).Butmustnthappenonbusyones!

Action:Logintothebasestation,verifythenumberofsubscribers(shmsinfo)
alsocheckonzabbix(Clickonthetrigger>simpleGraph)thenumberof
usuallyconnectedCPEs.LogintoalvaristarandcheckforerrorsontheBS.
Contact:Infrastructure
05.104 LowGPSSatelliteCount
ThebasestationneedsaworkingGPStohavethesubscriberunitssychronized
overthenetwork.IftheGPScannotseeenoughsatellitesforalongtimethe
systemmightgetoutofsyncandmisfunction.

Action:Checkweatherconditions,thiscanbenormalonheavyrain.Ifnot
escalatetoInfrastructureinbusinesshours.
Contact:Infrastructure
05.105 ODUerrors

ThesystemdetectedanODU(OutdoorUnit)error.Thislikelyaffectsservices.

Action:Loginintothebasestationandescalateimmediatelyifconfirmed.
Contact:Infrastructure
05.106 AUError

ThesystemdetectedanAUerror.Thislikelyaffectsservice.

Action:Loginintothebasestationandescalateimmediatelyifconfirmed.
Contact:Infrastructure
05.107 MSCounterDifference

ThenumberofregisteredCPEsonaBaseStationsuddenlychanged.Thiscan
benormalinmanycases(exampleshortlyto8amwhenbusinessopen),butif
thishappensrepeatedlyandoverotherbasestations,too,urgentattention
required.

Action:ChecktheSimplegraphofregisteredCPEsandseeifabnormalornot.
Contact:Infrastructure
05.108 NoMSRegistrations
ThereisnotasingleCPEregisteredtotheentirebasestation!!!Unlessonvery
newbasestationsthisnormallyindicatesaproblem.

Action:Verifythebasestationforerrors,Verifythebearernetwork,verifytheAAA
Contact:Infrastructureand/orCore
05.120 AAAProcessingRequests
ThismonitorstheAAAserver(s)requiredforCPEauthentication.IfaAAAserver
shouldstopworkingyoushouldseealotof[05.102].

Ifthishappensescalatetosystemsteamasitshighlyimportantthatwehave
bothserversoperationalatalltime.

Action:Checkforpossiblenetworkissuesthatcouldhavecausedthis.
Contact:Core(ifnetworkissues)orSystems!
05.200 Alvarion/Telrad FDD Micro Base Stations
05.201 ODUnotoperational
TheOutdoorUnitreportsthatitsnotoperational.Thismostlikelyaffectsservice!

Action:LoginandcheckODUstatusoreventuallogmessages
Contact:Infrastructure
06 Server
06.001 LowMemory
Serverreportslowmemory(RAM),thiswilllikelyimpactontheperformanceof
theserverandmightevenmakeitunresponsive!

Action:ContactSystemsdepartment
06.002 HighSystemLoad
ThebusynessofaUNIXserverisindicatedbytheSystemLoad(==the
numberofprocessesintherunningqueuescheduler).Theexpectedvaluecan
beverydifferentfromservertoserverdependingwhattasktheyrun.Mostlya
valueofabout3.0,m3mscangoupto30withoutperformanceproblems.Ifthis
triggerisupforawhileandseverityisHIGHescalate.

Action:Ifpossiblecheckifsystemisrepsonsiveanydoubtsescalate
accordingtotheseverity.
Contact:Systems

06.003 StandardTCPporttest
MostserverprovidenetworkservicesreachableonTCP.Serveraltestsexistto
testifaTCPportusupandlistening.YoucanverifyallportsusingTELNET.

httpTCP/80Webservices
httpsTCP/443WebservicesoverSSL
imapTCP/143Mailboxaccess
imapsTCP/993MailboxaccessoverSSL
pop3TCP/110Mailboxaccess
pop3sTCP/995MailboxaccessoverSSL
smtpTCP/25SimpleMailTransferProtocol
smtpsTCP/25SimpleMailTransferProtocoloverSSL
sshTCP/22SecureShell(admistration)

(theremightbeothers)

Action:Checkifitaffectscustomers
Contact:Systems
06.004 IPAReplicationDown
IPAisthesystemweuseforcentralizedauthentication.Anychangereplicatesto
variousotherserversacrosstheGroup(ITA/ITZ/ITN).Thistestswillcheckifit
works.

Action:ContactSystems!
06.005 MailserverMailq

Thismonitorsthecurrentlengthofthequeueofmailswaitingtobedelivered.If
thisexceedsacertainnumbermostlikelythisiscausedbymalwarethat
abusestheservertorelaySPAM.Thiscannotreallyavoidedautomatically
becausetheynormallyusevalidauthenticationofahackedaccount.

Requiresimmediateactionbysystemsteamtofindthehackedmailboxand
blockitanddeletethespammailsfromthequeue.Ifwedonottakeactionthe
mailserverwillenduponablacklistandallmailcangetblockedDisaster!

Action:ContactSystems!
06.006 MailserverTest

Thisisascripttestingthegeneralfunctionalityofamailserver.Ifraised,
mailservermostlikelyisnotfunctioningproperlyandcustomersareaffected.

Action:ContactSystemsimmediately!
06.007 WebmailTest

Thisisascriptthatemulatestheusageofthewebmail(e.g.
http://webmail.maxnet.ao)iffailsmostlikelythewebmailisbroken.As
customerslikelyusethis,immediateactionrequired.

Action:Contactsystems
06.008 Backups
Theserversareconfiguredtodoperformautomaticbackupstoremote
locations.Ascriptchecksifthebackupwasmadesuccessfully.Ifnotthis
requiresattentionbythesystemsteam.

Action:Contactsystemsteamnextofficehours.Alarmwilldisappearoncea
recentbackupisfound.
06.100 MBMS related
06.101 mgraphpoll
Thisisaperiodicscriptthatispollingthegatewayroutersandcreatesbandwidth
graphs.Allthegraphs(excepttheiDirectones)arecreatedandupdatedbythis
scriptandaccessiblebym3ms,thecustomerportalandtheresellerportal.

Ifthisscriptdoesnotrunatleastonceevery7minuteswehavegapsinthe
graphs!!Thishastobeavoided!

Action:Contactsystemdepatementasquickaspossible!
06.102 idgraphpoll
Thisistheequivalentscriptto[06.101]foriDirect.ItwillpolltheiDirecthubsand
creategraphsforeachidirectremote.

Ifthisscriptdoesnotrunatleastonceevery7minuteswehavegapsinthe
graphs!!Thishastobeavoided!
Action:Contactsystemdepartmentasquickaspossible
06.103 idgraphdiscover
Thisisascriptrunningperiodically.ItpollstheiDirectHUBSlookingfor
new/unknowniDirectRemotes(Modems)andiffoundwillupdatethembms
database.

Ifitdoesntrunforawhileyougettheerrorandasaresultnewcomissioned
linkswillnotappearinm3ms.

Action:ContactSystemsonnextbusinesshour
06.104 idgraphupdate
Thisisascriptrunningperiodically.ItcheckthelocaliDirectremotes(Modems)
database,thenpollstheHUBandlooksforupdatessuchasoniDirectname,
Ethernetaddressandafewotherthings.

Ifitdoesntrunwemightgetsomeoutdated(noncritical)informationonm3ms

Action:Contactsystemsonnextbusinesshour
06.105 mbmsmailsender
Onmanyocasionsmbmsisabletosendmails.Oneexampleistheticketing
system.Allmailsgeneratedarefirstsavedtoaspooldatabase,thentheyget
sentoutviaascript.
Ifthemonitoringsystemdetectsthatthespooldatabasehasslightlyolddata
whichhasnotbeensent,thisalarmwillgoon.

BeawarethatticketreplysareNOTSENTifthisalarmison.

Action:Contactsystemsdepartment
06.106 mgraphmonitor
Thisisthescriptthatmonitorsallcustomercircuitsdefinedinm3ms.Itis
supposedtorunlikeevery5minutesandwilltrytodetectusingvariousmethods
ifacircuitisupordown.ItwillthenmarkthecircuitaseitherUP/DOWNorifit
cantmonitoritasUNMONITORED.

Ifthisscriptdoesntrunyouwillseeoutdatedinformationregardingacustomer
circuitstatus.Alsoacustomer/resellerhasaccesstothisinformationonthe
customerportalsandmightconfusethem.

Action:Contactsystemsdepartment

06.107 sugraphpoll
ThisscriptpollstheAlvarion/TelradFDDWimaxbasestationsforthesignal
levelsoftheCPEs.Itwillthencreatethesignallevelgraphwhichalsohas
influenceonthesignallevelindicatorandthecolorofthedotintheradiomap.
Thisinformationsareinternaluseonly.

Ifthisscriptdoesntrunwewillgetgapsinthegraphsandseeoutdated
informationonm3ms.

Action:ContactSystems
06.108 sugraph4xml
Thisistheequivalentto[06.107]butfortheAlvarion/Telrad4Motionbase
stations.InthiscasetheinformationwillbeextractedfromaperformanceXML
filewhichiscreatedevery15minbythebasestation.

Ifthisscriptdoesntruntherewillbenogapsinthegraphsbecauseitkeepsfile
history.Soitsnotcritical.

Action:ContactSystems
06.109 sugraph4mnpu
AperiodicscriptthatpollstheAlvarion/Telrad4MotionbasestationssNPUfora
listofconnectedCPEs.Itthenperformsvariousupdatesonthedatabase,e.g.
towhichsector/basestationaCPEisconnected,discovernewCPEsetc..

Ifthisscriptdoesntrunwemighthaveoutdatedinformationinthedatabase.
Mostlyuncritical.

Action:Contactsystems
06.110 sugraphbsgraph
ThisperiodicscriptwilltakeallsignallevelsofaCPEandaveragethemovera
basestationandagraph.Thisgraphcanbeseeninm3mstools>wimax>
basestationstatistics

Ifscriptdoesntrunwemightgetgapsinthegraph.Theinformationisforinternal
useonly.

Action:Contactsystems
06.111 Postgresql
Postgresqlisthedatabaseweuseformbms.Everythinglinkshere,ifpostgresql
isdownmanythingswillnotwork:m3ms,customer/resellerportal,email
service,dnsupdate,allthembmsscriptsetcDISASTER.Escalateatany
timeimmediately.

Action:contactSystemsimmediately.

07 Infrastructure
07.001 BatteryVoltagePowerGenerator

ThemeasuredbatteryvoltageofaGeneratorislow.Thatcancausethe
generatornotbeingabletostartonpowercuts!

Action:contactinfrastructuretocheck
07.002 MaintenanceDue

Thegeneratorhasbeenrunningforalongtimeanditmustgoformaintenance.

Action:EmailInfrastructureMakesuretheydoresetthecounterafter
maintenancewasperformed.Thenthisalarmwilldisappear.

08 ITA Office
08.001 SIPPeerunreachable

ThisisrelatedtoourPABX.ItsinterconnectedtotheproviderusingSIPtrunks.
AlsoPhonesareconnectedusingSIP.
ThetriggermonitorsimportantSIPtrunksthatmustntfail,forexampleMundo
Startelisprovidinguswithourmainofficetelephonenumber.Ifthatoneisdown
wecantplaceorreceivecalls!

Contact:SystemsDepartment,Severity:Depending

Вам также может понравиться