Вы находитесь на странице: 1из 28

JUNE28,2016

CUSUMAnomalyDetection

ByKingaFarkas

CUSUM Anomaly Detection | Farkas

Abstract
Introduction
Background
AnomalyDetectioninNetworkTrafficFlows
CUSUMCharts
CUSUMAnomalyDetection(CAD)
ApplyingCUSUMChartstoInternetPerformanceVariables
CADsDesign
Overview
SlidingWindows
FindingtheTrainingSet
ApplyingtheCUSUMChart
InterpretingtheCUSUMChartResults
TuningParameters
ExamplesandResults
Example1:CADAppliedtotheTimeSeriesfromtheISPInterconnectionanditsImpact
onConsumerInternetPerformanceStudy
Example2:CADAppliedtoDatacollectedfromIranbefore,duringandafterthe2013
IranianPresidentialElections
EvaluationofCADsPerformance
Conclusion
2016MeasurementLabConsortium

CUSUM Anomaly Detection | Farkas

Abstract
Today,researcherscancollectdataonawiderangeofindicatorsrelatedtoInternetaccess,
speed,andlatency.Whatcanwelearnfromallthisdata?Thereisanincreasingneedfor
analysisthatusesautomatedmethodstosiftthroughthedataanduncoverunusualpatterns,
outliers,andanomaloussequences.

TheCUSUManomalydetection(CAD)methodisbasedonCUSUMstatisticalprocesscontrol
charts.CADisusedtodetectanomaloussubsequencesofatimeseriesthatshowasubtle
shiftinthemeanrelativetothecontextofthesequenceitself.CADwasappliedinordertolook
foranomaliesinMLabsdatabaseofNetworkDiagnosticTest(NDT)results.CADssuccessis
basedontheobservationthattheNDTtimeseriescanbeviewedasbeingcomprisedofvarying
lengthsubsequencesofrealvaluedrandomvariates,whereeachofthesesubsequences
correspondtoanormaldistributionwithaspecificmeanandstandarddeviation.

WedescribethebasicdesignofCAD,illustratehowitfunctionsbyapplyingittotimeseries
fromMLabsNDTdatabasethatcontainknownanomalies,anddemonstrateitseffectiveness
byshowingthatCADsuccessfullyandautomaticallydetectedeachoftheInternetperformance
degradationincidentswithveryfewfalsenegativesorfalsepositives.

Introduction
MLabsmissionis
toadvancenetworkresearchandempowerthepublicwithuseful
informationabouttheirbroadbandandmobileconnections.ByenhancingInternettransparency,
MLabhelpssustainahealthy,innovativeInternet.1

TheCUSUManomalydetectionalgorithmexplorestheneedforanautomatizedmethodof
searchingMLabsvastdatabaseofNetworkDiagnosticTest(NDT)resultsnotforsingleoutlier
points,butforaseriesofunusuallyhighorlowmeasurements.Thisprojectwasdeveloped
duringthecourseofathreemonthlongOutreachy2internshipatMeasurementLabinthe
summerof2015.

Oneofthemostimportantfeaturesofthealgorithmisfindinganddefiningthenormalpattern
forthetimeseries,relativetowhichdeviationscouldbeclassifiedasanomalies.Usingasliding
windowtechniquethestatisticallysignificantshiftsinthemeanaredetectedrelativetothe
normalpattern(thetrainingset).Theoutputofthealgorithmisthepotentiallistofanomalies
alongwiththecorrespondingplotofthetimeseriesanditsanomalies.

1
2

Measurement Lab, About,


http://www.measurementlab.net/about
.
GNOME Foundation, OUTREACHY,
https://www.gnome.org/outreachy/
.

CUSUM Anomaly Detection | Farkas

Background
Anomaly Detection in Network Traffic Flows
Therehavebeenseveralattemptstocharacterizetimeseriesofnetworktrafficflowstodetect
anomalies,whichincludeoutages,abuse,orInternetfiltering.Anomalydetectionisbecoming
anincreasinglystudiedfield,giventhecentralrolethattheInternetplaysinglobal
communications.Thesemethodologiesvaryfromasymbolicrepresentationoftimeseriestoan
automateddetectionofInternetfiltering3.

Intheirpaper,
AsymbolicrepresentationofTimeseries,withimplicationsforstreaming
algorithms,
J.LinE.Keogh,S.Lonardi,B.Chiu4 makeanattempttocreatearepresentationof
timeseriesthatallowsdimensionality/numerosityreduction,anditalsoallowsdistance
measurestobedefinedonthesymbolicapproachthatlowerboundcorrespondingdistance
measuresdefinedontheoriginalseries.

5
In
Visualizinganddiscoveringnontrivialpatternsinlargetimeseriesdatabases,
thesame
authorsdescribeatimeseriespatterndiscoveryandvisualizationsystem,VizTree,basedon
augmentingsuffixtrees.VizTreevisuallysummarizesboththeglobalandlocalstructuresof
timeseriesdataatthesametime.Thisprovidessolutionstomotifdiscovery,anomalydetection
andquerycontent.

P.BartfordandD.Plonka6describetheirworkofcollectingandanalyzingnetworkflowdataby
usingFlowScanopensourcesoftware.Thegoaloftheirworkistoidentifythestatistical
propertiesofanomaliesand,iftheyexist,theirinvariantproperties.

ThemostcomprehensiveandelaboratedworkindetectingInternetfilteringfromgeographic
timeseriesispresentedbyJ.Wright,A.DarerandO.Farnanintheirpaper,
DetectingInternet
filteringfromgeographictimeseries7.Thegoaloftheirworkistoidentifyglobalpatternsof
Internetfilteringthroughtechnicalnetworkmeasurementsandtolinktheseeventstotheirsocial
context.Theirapproachtodetectanomaliesisbasedonprincipalcomponentanalysis.Itshould
bepointedout,thatthegoalCADisverysimilartotheirs,buttheapproachisverydifferentand
basedonstatisticalprocesscontrol.

J.Wright,A.Darer,O.Farnan,
DetectingInternetfilteringfromgeographictimeseries
,OxfordInternetInstitute,July
21,2015.http://arxiv.org/abs/1507.05819.
4
J.Lin,E.Keogh,S.Lonardi,B.Chiu,
AsymbolicrepresentationofTimeseries,withimplicationsforstreaming
algorithms
,DMKD,June13,2003SanDiego,CA,USA.
5
J.Lin,E.Keogh,S.Lonardi,
Visualizinganddiscoveringnontrivialpatternsinlargetimeseriesdatabases
,
InformationVisualization(2005),4,6182.
6
P.BarfordandD.Plonka,
Characteristicsofnetworktrafficflowanomalies
,C.S.DepartmentattheUniversityof
Wisconsin,Madison(2001).
7
J.Wright,A.Darer,O.Farnan.
DetectingInternetfilteringfromgeographictimeseries.

CUSUM Anomaly Detection | Farkas

CUSUM Charts
Statisticalcontrolchartsaregraphsthatareusedtoshowhowaprocesschangesovertime.
Allstatisticalcontrolchartshaveacenterlinefortheaverageandanuppercontrolandalower
controlline.Theselinesarebasedonhistoricalvaluesoftheprocessmeanandstandard
deviation.Anoutofcontrolprocesswillhavepointsonthechartthatlandabovetheupper
controllineorbelowthelowercontrolline.8

TheCUSUM(cumulativesum)controlchartisastatisticalcontrolchartusedtotrackthe
variationofaprocess9.Itisamethodthatisabletodetectsmallshiftsintheprocessmean.
TheCUSUMchartusesfourparameters:
1. theexpectedmeanoftheprocess,
2. theexpectedstandarddeviationoftheprocess,
3. thesizeoftheshiftthatistobedetected, k
4. thecontrollimit, H
Theexpectedmeanandstandarddeviationaredefinedtobethehistoricalmeanandstandard
deviationoftheprocesswhentheprocessisnormalandinstatisticalcontrol10.Theparameter
k determinestheslackthatisallowedintheprocessitsusualvalueisabout .Theparameter
H isthethresholdfortheprocessitsvalueisusuallysetto 5.

TheCUSUMchartworksbytrackingtheindividualcumulativesumsofthenegativeandpositive
deviationsfromthemean,thehighandlowsumsrespectively.
Thehighsumisgivenbytherecursivesequence

+
S+i = M ax {0, Si1
+ xi k} , S0+ = 0 for i = 1, 2, ..., N
whereas,thelowsumisdefinedby

Si = M in {0, Si1
+ xi + k} , S0 = 0 for i = 1, 2, ..., N.

+
Notethattheparameter k doesindeedprovideforslackintheprocedure,since S+i > Si1
only

if xi > + k

and
Si < Si1
only

if xi < k .11
Ifeitherofthecumulativesums, Si
or
S+i ,
reachthe

threshold H ,
theprocessisconsideredoutofcontrol.

Asanexample,considerthesequence V = {vi} 240


vi arerandomlyselectedfroma

i=1 ,wherethe
normaldistributionwithamean = 50, andstandarddeviation = 3 .Thissequencethen
representsaprocessthatisstatisticallystableitsdeparturesfromthetargetvalue,themean,
= 50, aretheexpectedvariationsduetochance.Next,supposethatthemiddle 40
elements,
140
40
{vi} i=101 , are replaced by {wi} i=1 , where the wi are randomly selected from a normal
distribution with mean = 54 and standard deviation = 3. The resulting sequence
8

ControlChart,ASQ,
http://asq.org/learnaboutquality/datacollectionanalysistools/overview/controlchart.html
.
KeepingtheProcessonTarget:CUSUMCharts,BPIConsulting,LLC,2014,
http://www.spcforexcel.com/knowledge/variablecontrolcharts/keepingprocesstargetcusumcharts
.
10
D.C.Mongtomery,
IntroductiontoStatisticalQualityControl
(JohnWiley&Sons,199)1,103
11
KeepingtheProcessonTarget:CUSUMCharts,BPIConsulting,LLC,2014,
http://www.spcforexcel.com/knowledge/variablecontrolcharts/keepingprocesstargetcusumcharts
.
9

CUSUM Anomaly Detection | Farkas

40
240
240
V = {vi} 100
i=1 {wi} i=1 {vi} i=141 = {vi} i=1 isnolongerstatisticallystablebecausethe
subsequencemeanchangesdrasticallyrightinthemiddleofthesequence.

Thepackageqcc12inRcontainsanimplementationoftheCUSUMchart,the cusum function.


Figure1illustratestheresultsasthe cusum functionisappliedtoatimeseriessequence V ,
created,alsousingR,accordingtothespecificationsdescribedabove.TheCUSUM
parametersweresetto 49.83, 2.89 ,themeanandstandarddeviationof V respectively,
whereasthecontrollimitwas H = 5 andshiftsizewas k = .

Thetimeseriessequence V anditsCUSUMchartisplottedinFigure1.Withintheplotof V
thegreenandredhorizontallinesindicatethemeanoftheoriginalsequence V andofits
anomaloussubsequence {vi} 140
V (thebottomplotof

i=101 respectively.TheCUSUMchartof
+
Figure1)detectstheshiftinmeanof V , sincetheuppersum Si reachesandsurpasses

H = 5 when i = 105 .NotethatintheqccpackageimplementationoftheCUSUMchart,the


term
Decisioninterval
referstotheparameter H ,andthetem
Shiftdetection
toparameter k.

Figure 1
: Plot of V and the CUSUM Chart for V

12

L.Scruccaqcc:anRpackageforqualitycontrolchartingandstatisticalprocesscontrol,
RNews4/1
(2004),pp.
1117.

CUSUM Anomaly Detection | Farkas

CUSUMAnomalyDetection(CAD)
TheCUSUMAnomalyDetection(CAD)isastatisticalmethoditisananomalydetection
techniqueforunivariatetimeseries.ItusestheoutofcontrolsignalsoftheCUSUMchartsto
locateanomalouspoints.ThedetectionofperiodicityisnotyetpartofCADnoritisamethod
thatsearchesformeasurementsthatdonotfollowtheexpectedperiodicbehavior.

MLabsdataconsistsmainlyofNetworkDiagnosticTestresults,whichmeansthatthedata
collectionrate,orsamplingrate,isvariable.Inordertocreateatimeseriesofequallyspaced
measurements,theacceptedpracticeistofindthemedianofthemeasurementsperunittime.
Still,shorttermjumpsinatimeseriesvaluescouldbeduetochancealone.Therefore,CADis
optimizedtofindanomaloussubsequences13oflengthgreaterthan , wheretheadjustable
parametersdefaultvalueis 5 unitsoftime.Specifically,CADisnotdesignedtofind
contextualanomalies14,thatis,singledatapointsconsideredtobeanomalouswithinthecontext
ofthetimeseriesitself.CADisdesignedtodetectsustainedchanges,ratherthanasingle
anomalousdatapoint.

Defining a Training Set and Applying CUSUM Charts to Internet Performance


Variables
ThetimeseriesofanInternetperformancevariablelikeroundtriptime,downloadthroughputor
packetretransmissionratecanbeviewedasoutputvariablescharacterizingtheprocessof
transmittinginformationthroughtheInternet.

13

D.Cheboli,AThesisSubmittedtotheFacultyoftheGraduateSchooloftheUniversityofMinnesota,
Anomaly
DetectionofTimeSeries(2010),

http://conservancy.umn.edu/bitstream/handle/11299/92985/?sequence=.1
.
14
Ibid.,
http://conservancy.umn.edu/bitstream/handle/11299/92985/?sequence=1
,6.

CUSUM Anomaly Detection | Farkas

ByapplyingtheCUSUMcharttooneofthesetimeseriesweimplicitlycreatealocalmodelfor
thetimeseriesinquestion.Thismodeldefinesthelocalnormalforthetimeseries,anditis
definedbythefollowing:
1. itisnormallydistributed
2. ithasawelldefinedmeanandstandarddeviation
3. itisinstatisticalcontrol
AnanomalysignaledbytheCUSUMchartimpliesashiftinthemeanwithrespecttothislocal
modeloftheInternetperformancevariable.

Example:CUSUMChartappliedtoanInternetperformancevariabletimeseries
ConsiderMLabsIrandailymediandownloadthroughputfortheyear2013(Figure2)15.In
ordertobeabletoapplytheCUSUMcharttothistimeseries,itmusthaveatleastone
subsequencewhosevalueshavemoderatelynormaldistribution.Onesuchsubsequence,ifit
exists,canbeusedasthetrainingset.ThistrainingsetwoulddefinealocalnormalforIrans
dailymediandownloadthroughputanditwouldbeusedtocalculatetheexpectedmeanandthe
expectedstandarddeviationforthetimeseries.

Figure 2:
Irans Daily Median Download Throughput for 2013
An M-Lab Dataset

Asitturnsout,thetimeseriesinquestiondoeshavesuchasubsequence.Themeasurements
duringthetimeperiodofJune9,2013October6,2013fittherequirements.Thisisthe
subsequencehighlightedinredinFigure3.

15

IranhelditspresidentialelectionsonJune14,2013.

CUSUM Anomaly Detection | Farkas

Figure 3
: Irans Daily Download Throughput with the Training Set in Red

Inordertodemonstratethattheselectedtrainingsetsdistributionisclosetonormal,thedensity
plotofthetrainingsetandthenormaldistributioncurvewiththemeanandstandarddeviationof
thetrainingsetareshowninFigure4.Sincetheexperimentaldistributioncurve(inblue)closely
approximatesthetheoreticaldistributioncurve(inred)wecanclaimthattheselectedtraining
setsdistributionisclosetonormal.

Figure 4
: Experimental and Theoretical Distributions of the Training Set

CUSUM Anomaly Detection | Farkas

WhentheCUSUMparameters H and k aresetto5 and3 respectively,thesubsequenceis


instatisticalcontrol.Neithertheuppernorthelowercumulativesumsreachthecontrollines,
denoted U DB (UpperDecisionBoundary)and LDB (LowerDecisionBoundary)asshownby
Figure5.NotethattheplotofthetrainingsetwasalsoprovidedinFigure5inordertoprovide
referencefortheCUSUMchart.

Figure 5:The Training Set and its CUSUM chart

CUSUM Anomaly Detection | Farkas


10

TheCUSUMchartoftheentiretimeseriesisthenfoundusingthemeanofthetrainingsetas
theexpectedmean,thestandarddeviationofthetrainingsetastheexpectedstandard
deviation,andtheCUSUMparameters H = 5 and k = 3.

Figure6containsthegraphsoftheentiretimeseriesanditsCUSUMchart.Thereddotsinthe
CUSUMchartofIranDailyMedianThroughputappearatthepointswheneithertheuppersum
isabovetargetorthelowersumisbelowtarget.Thedaysoverwhichtheuppersumisabove
targetcoincidewiththedayswhentimeseriesvaluesshowasteepincrease,whereasthedays
overwhichthelowersumisbelowtargetcoincideswiththedateswhenthetimeseriesvalues
dropdrastically.

Figure 6
: Plot of Irans Daily Median Throughput and its CUSUM
Chart

CUSUM Anomaly Detection | Farkas


11

CADs Design
Overview
TheimplementationofCADwaswritteninRanditusestheqccpackage16 tofindtheCUSUM
chartofatimeseries.CADusestheslidingwindowtechnique.Foreachwindow,CAD
searchesthetimeseriesalongthewindowforatrainingset.Ifoneisfound,CADappliesthe
CUSUMcharttotheentiretimeseriesalongthewindow.Afterinterpretingtheresultsofthe
CUSUMchart,someofthepointsaredesignatedaspossibleanomalies.Thisprocedureis
repeatedforeverywindowdownthelengthofthetimeseries.Theoutputoftheprocessisthe
indexesoftheanomalieswithinthetimeseriesandagraphofthetimeserieswithanomaliesin
redandabarchartofthenumberoftimeseachpointwaslabeledananomaly.

SlidingWindows
Thelengthofthemovingwindow, w ,isroughlyonethirdofthelengthoftheentiretimeseries
asthislengthseemedtoprovidethebestoutcome
.
Futureworkcouldconsiderothervaluesfor
lengthorasystematicwaytodeterminetheoptimallength.
Thewindowoverlapis w 1 ,that
is,thewindowisalwaysshiftedonedatapointtotherightatatime.

FindingtheTrainingSet
Foreachwindow,theportionofthetimeseriescontainedinthewindowissearchedfora
trainingset.
16

L.Scrucca,qcc:anRpackageforqualitycontrolchartingandstatisticalprocesscontrol.
RNews4/1
(2004),
1117.

CUSUM Anomaly Detection | Farkas


12


Thesearchstartswithlookingatallthesubsequenceswithlength f loor (w3 ) .Ifatrainingsetis
notfoundthelengthofthesubsequenceisdecreasedbyoneandtheprocessrepeatsuntil
eitherasubsequenceisfoundthathastherightpropertiesorthesubsequencelengthhas
reached 24 17.

TheexaminationofeachsubsequenceentailscalculatingthepvalueoftheShapiroWilktest,
thatcheckswhetherarandomsample, y1 , y2 ,, yN comesfromanormaldistribution.
Thetwo
hypothesesoftheShapiroWilktestarethenullhypothesis(H
)thedistributionisnormaland
0
thealternativehypothesis(H
)thedistributionisnotnormal.Whenthepvalueisgreaterthan
a
orequalto0.05,H
cannotberejected.Whenthepvalueislessthan0.05,H
isrejectedand
0
0
thedistributionisconsiderednonnormal.Inthislastcase,thesubsequenceisdiscardedsince
thepointsofthesubsequencewereproventohaveanonnormaldistribution.Foreach
subsequenceforwhichthepvalueisgreaterthanorequalto 0.05, thekurtosisandskewness
valuesarecalculated,andthesmallestvaluesoftheCUSUMparameters H and k forwhich
thesubsequenceisinstatisticalcontrolareidentified.

Ifthesetofsubsequenceswithagivenlengthandwithapvaluegreaterorequalthan 0.05 is
nonempty,thesubsequencethatminimizesthequantities |1 skewness| , |kurtosis 3| ,
|pvalue 1| , H , and k ischosentobethetrainingset.

Ifthesubsequencelengthdecreasesallthewayto 24 andnosuitabletrainingsetwasfound,
theentireprocessofanomalydetectionhaltswiththeconclusionthatCADcannotbeappliedto
thetimeseriesinquestion.

ApplyingtheCUSUMChart
Onceatrainingset, , isfoundalongwithitsCUSUMparameters H and k ,theCUSUMchart
isappliedtotheentiretimeseries.TheparametersfortheCUSUMchartaresettothe
followingvalues: H = H , k = k , = mean(), = standarddeviation(). Theoutputofthe
CUSUMchartistheindicesoftheuppersumviolations(ifthereareany)andofthelowersum
violations(ifthereareany)andthevaluesoftheupperandlowersums.
InterpretingtheCUSUMChartResults
GiventhattheCUSUMchartresultsofthetimeseriessequence,thepotentialanomaliesare
identifiedbyfindingtheincreasingsubsequencesoftheuppersumviolationsoflength , and
thedecreasingsubsequencesofthelowersumviolationsoflength ,ifthereareany.The
indicesofthesesubsequenceelementspinpointthepotentialanomaliesinthetimeseriesfor
thewindowinquestion.

17

Byastatisticalprocesscontrolruleofthumb,12to24valuesaresufficienttocalculatetheCusumparameters.

See
http://asq.org/qualityprogress/2012/07/backtobasics/smartcharting.html
.

CUSUM Anomaly Detection | Farkas


13

Tuning Parameters
CADismostlyautomated,however,therearestillafewparametersthat,althoughtheyhave
defaultvalues,cannonethelessbeadjustedbytheuser.Theseare:

:theminimumlengthoftheanomaloussubsequencesthatCADshoulddetectits
defaultvalueis 5. Decreasing allowsCADtosearchforshortdurationsharp
increasesordecreasesinthetimeseries.Theminimumvalueof is 1 andatthis
settingCADwillallowforthedetectionofsinglepointanomalies,butwithanincreased
riskoffalsepositives.
:adjuststhe k valueoftheCUSUMchartappliedtothemaintimeseries.Itisan
offset,avaluethatisaddedtotheCUSUMparameter k .Itsdefaultvalueis 3 .
Adjusting adjuststhesensitivityofCAD,thehigherthevaluethelesssensitiveCAD
gets.Thereisnomaximalvaluefor .
type :thechoicesforthisparameterare upper or lower .Itdeterminesthetypeof
anomalyCADshouldsearchfor.When type = upper, CADlooksforsubsequencesof
thetimeserieswithmeanlargerthanthelocalmeanforthetimeseries.Ontheother
hand,when type = lower, subsequenceswillbelabeledanomalousiftheirmeanvalueis
belowthelocalmeanofthetimeseries.Thedefaultsettingfor type is lower.

CUSUM Anomaly Detection | Farkas


14

ExamplesandResults
CADwastestedonInternetperformancetimeseriesthatcontainedknownanomalous
subsequences.Therearetwodifferentsetsofexamplesconsideredhereandthedataforboth
examplescomesfromMLabsNetworkDiagnosticTestdataset.

ThefirstsetofexamplesshowstheresultsofCADbeingappliedtosomeofthetimeseries
fromMLabs
ISPInterconnectionanditsImpactonConsumerInternetPerformance18study.
Theeventsdescribedwithintheinterconnectionstudywereidentifiedbyinspectionorprior
questionsaboutwherepotentialdegradationhadoccurred.ThedataforInternetperformance
variablesthatformthesetimeserieswereusedbythestudytoshowsustaineddegradationof
broadbandperformanceforendusers.Theseeventsshowupasanomaloussubsequencesof
theInternetperformancevariabletimeseries.TheresultswillshowthatCADuncoversthese
verysameanomalieswhenappliedtothesetimeseries.

ThesecondexampleusesCADtofindtheanomaliesresultingfromaprominentcaseof
confirmedInternetcensorshipthatoccurredinIran,justbeforethepresidentialelectionson
June14,2013.ThedatasetusedinthesecondexampleisalsocomprisedofInternet
performancedatathatisknowntocontainanomalies,sinceIransgovernmenthasadmittedto
slowingdowntheInternetinordertopreservecalmduringtheelectionperiod.19

Example 1: CAD Applied to the Time Series from the ISP Interconnection and its
Impact on Consumer Internet Performance Study
TheMLabConsortiumTechnicalReport,
ISPInterconnectionanditsImpactonConsumer
InternetPerformance
,uncoveredinstancesofperformancedegradationintheUSusing
MLabsNDTdatasets.ThedeclineinInternetperformancecanbeobservedasasteepdropin
mediandownloadthroughputandasharpincreaseinthepacketretransmitrateofaccessISPs
acrosssomeofthetransitISPs.Thesubsequenceofthetimeseriesofthemediandownload
throughputofanISP,correspondingtothetimeperiodoverwhichthemediandownload
throughputvalueshavedrasticallydroppedisconsideredtobeananomaloussubsequence.
Similarly,thesubsequencewithdrasticallylargevaluesofthepacketretransmitratetimeseries
isananomaloussubsequence.

Inthisexample,wefocusedonthedownloadthroughputandpacketretransmitratedatafrom
theNewYorkCityarea,concerningthecustomersofTimeWarnerCable,Comcast,and
VerizonconnectingacrossthetransitISPCogent.TheseNDTtimeseriesspannedthetime
periodfromJanuary1,2012September30,2014.MLabsreportdemonstratedthe

18

ISPInterconnectionanditsImpactonConsumerInternetPerformance,

MeasurementLab
(2014),
http://www.measurementlab.net/publications/ispinterconnectionimpact.pdf
.
19
GolnazEsfandiari,IranAdmitsThrottlingInternetToPreserveCalmDuringElection.

RadioFreeEuropeRadio
Liberty.
June26,2013.
http://www.rferl.org/content/iranInternetdisruptionselection/25028696.html
.

CUSUM Anomaly Detection | Farkas


15

degradationofInternetperformancebetweenAprilJune2013andlateFebruary2014.20
Therefore,weexpectedthattheanomaliesdetectedbyCADwouldfallintothistimerangeas
well.

Tostart,CADwasappliedtothemediandownloadthroughputforTimeWarnerCableacross
CogentinNewYork.Forthistimeseriesthesettingswerethedefaultsettings:
type = lower, = 3, = 5. Thedetectedanomalies,plottedinredinFigure7,rangeoverthetime
periodsofMay7,2013June11,2013andJuly15,2013February25,2014.Thesetime
periodsaremostlyinagreementwiththetimeperiodsofslowInternetservicedemonstratedby
the
ISPInterconnectionStudy
.WewillnotconsidertheJune11,2013July15,2013gapin
thelistofanomaliesasanerror.Byvisualinspectionofthetimeseries,itisclearthatthe
downloadthroughputvaluesofthistimeperiodaremuchhigherthantheneighboringvalues,
bothontheleftandright.So,itisreasonablethatthemeasurementsintheofthistimeperiod
arenotlabeledasloweranomaliesbyCAD.

Figure 7:
TWC Download Throughput Using the Transit ISP Cogent in the New York City
Area with Anomalies Detected by CAD in Red.

Next,CADwasappliedtodownloadthroughputmeasurementsbetweenComcastandCogent
inNewYorkCity(Figure8).TheCADsettingswere: type = lower, = 5, = 5. Thedetected
anomaliesoccurredduringthetimeperiodfromJanuary24,2013January31,2013,February
5,2013March14,2013,andApril17,2013February20,2014.

Oftheseanomalies,theonesoccurringduringJanuary24,2013January31,2013and
February5,2013toMarch14,2013areoutsidetheexpectedtimerange.However,an
examinationoftheplotofthetimeseriesinFigure8showsthatthereisindeedadropinvalues
duringthoseperiods,resultinginanoveralldropintheaveragevalueforthesetimeintervals.
20

ISPInterconnectionanditsImpactonConsumerInternetPerformance,
MeasurementLab
(2014),
http://www.measurementlab.net/publications/ispinterconnectionimpact.pdf
,9.

CUSUM Anomaly Detection | Farkas


16

Thesedropsinthevalueofthemeanweresignificantenoughthatthemostpoints
in this date
rangeretainedtheanomalouslabelforallvaluesoftheparameterforwhichthepointsforthe
daterangeApril17,2013February20,2014werestilldeemedanomalous.

Figure 8:
Comcast Download Throughput Using the Transit ISP Cogent in the New York
City Area with Anomalies Detected by CAD in Red.

Finally,CADwasappliedtodownloadthroughputmeasurementsbetweenVerizonandCogent
inNewYorkCity(Figure9).TheCADsettingswere: type = lower, = 3, = 5. Thedetected
anomaliesoccurredduringthetimeperiod:May18,2013February26,2014,whichis
consistentwiththeresultsofthe
ISPInterconnectionStudy.

Figure 9:
Verizon Download Throughput Using the Transit ISP Cogent in the New York City
Area with Anomalies Detected by CAD in Red.

CUSUM Anomaly Detection | Farkas


17

Next,stillwithinthecontextoftheexample,CADwasappliedtotimeseriesofpacket
retransmissionrates.
Figure10showstheresultsofCADbeingappliedtothedailymedianpacketretransmission
ratebetweenTimeWarnerCableandCogentinNewYorkCity.CADssettingswere
type = upper, = 1, = 3. Thedetectedanomaliesoccurduringthetimeperiods:January8,
2013January11,2013,May6,2013June11,2013,July21,2013August14,2013,August
25,2013September9,2013,September14,2013September25,2013,September27,2013
October9,2013,October19,2013October23,2013,andNovember2,2013December
20,2013.

Figure10:
TWCPacketRetransmitRATEUsingtheTransitISPCogentintheNewYorkCity
AreawithAnomaliesDetectedbyCADinRed.

Mostofthedetectedanomaliesfallintotheexpectedrange,withtheexceptionofthosethat
occurredbetweenJanuary8,2013January11,2013.However,therewasasharpincreasein
valuesduringthisperiod(seeFigure10),andsoCADdesignatingthesepointsasanomaliesis
notunreasonable.WhatismoreconcerningisthatCADfailedtodesignatethemeasurements
fromtheendofDecember,2013totheendofFebruary,2014asanomalies,althoughfromthe
graphinFigure10itseemsthattheyareindeedanomalouslyhighmeasurements.Futurework
willbefocusingonfixingthistypeofissue.

Figure11showtheresultsofCADbeingappliedtothedailymedianpacketretransmitrate
betweenComcastandCogentinNewYorkCity.TheCADsettingswere:
type = upper, = 5, = 5. ThedetectedanomaliesoccurredduringthetimeperiodsofJanuary
24,2013February6,2013,June1,2013June14,2013andJuly5,2013February20,
2014.TheanomaliesoccurringbetweenJanuary23,2013andFebruary6,2013seemtobe
falsepositives.Therestoftheanomaliesoccurredduringtheexpectedtimeperiod.However,
CADagainfailedtodesignatesomemeasurementsasanomalous.FromFigure11,itseems

CUSUM Anomaly Detection | Farkas


18

clearthatthedatapointsfromMay20,2013June1,2013andfromJune14,2013July5,
2013haveanomalouslyhighvalueswhencomparedtotherestofthetimeseries.Investigating
whythesepeakswerenotdetectedcouldbeaddressedinfuturework.

Figure 11:
Comcast Packet Retransmit RATE Using the Transit ISP Cogent in the New York
City Area with Anomalies Detected by CAD in Red.

Thelastdatasetweconsiderfromthe
ISPInterconnectionStudyi
sthedailymedianpacket
retransmitratebetweenVerizonandCogentinNewYorkCity.Whenappliedtothistimeseries,
CADsparametersweresetto type = upper, = 3, = 1. AsshowninFigure12,thedetected
anomaliesoccurduringthetimeperiods:May14,2013July6,2013,July14,2013July18,
2013,July28,2013July30,2013,January2,2014January7,2014,January10,2014
January14,2014,January17,2014January26,2014,February4,2014February9,2014,
andFebruary12,2014February17,2014.

CUSUM Anomaly Detection | Farkas


19

Figure 12:
Verizon Packet Retransmit RATE Using the Transit ISP Cogent in the New York
City Area with Anomalies Detected by CAD in Red.

InVerizonscase,alltheanomaliesdetectedbyCADfellintotheexpecteddaterange.There
arenoinstancesoffalsepositiveerrors.However,measurementsfromthetimeperiodofJuly
30,2013January2,2014shouldlikelyhavebeenclassifiedasanomalies.

Example 2: CAD Applied to Data collected from Iran before, during and after the
2013 Iranian Presidential Elections
FollowingthemethodologyusedbyCollinAndersoninhispaper
DimmingtheInternet:
DetectingThrottlingasaMechanismofCensorshipinIran21,
itseemsthattheInternet
performancevariablethatismostvisiblyaffectedbythrottlingisdownloadthroughput.Inthis
exampleCADwasappliedtothedailymediandownloadthroughputofIranianclientsduringthe
timeperiodofJanuary1,2013toJuly1,2014fromMLabsNDTdataset.

CADsparametersweresetto type = lower, = 3, = 1. Thedetectedanomaliesoccurduring


thetimeperiodofMay24,2013June13,2013(seeFigure13).

21

CollinAnderson,DimmingtheInternet:DetectingThrottlingasaMechanismofCensorshipinIran,

Cornell
UniversityLibrary
,June18,2013,
http://arxiv.org/abs/1306.4361
.

CUSUM Anomaly Detection | Farkas


20

Figure 13:
Irans Daily Median Download Throughput with Anomalies Detected by CAD in
Red.

AccordingtotheJune26,2013RadioFreeEuropeRadioLibertyblogpostbyGolnaz
EsfandiariIran'sministerforcommunicationsandinformationtechnology,MohammadHassan
Nami,hasacknowledgedthatthecountryrestrictedthespeedoftheInternetinthedaysleading
uptotheJune14presidentialelection.22Therefore,theexpecteddatesoftheanomaliesare
thedayspriortoJune14,2013.

TheanomaliesdetectedbyCADdoindeedfallintotheexpecteddaterange.Noneofthepoints
testedbyCADweremisclassifiedasanomalies.However,uponvisualinspectionoftheplotof
thedailymediandownloadthroughputandtheanomaliesdetectedbyCAD,itseemsthatCAD
shouldprobablyhavelabeledatleastthreeadditionalpointsasanomalous:May21,2013May
23,2013.

Evaluation of CADs Performance on the Time Series from


Examples 1 and 2
ThetypeofanomalydetectionthatCADwasdevelopedforisabinaryclassificationproblema
pointiseitherlabeledananomalyoritisnot.InordertoevaluatetheeffectivenessofCADwe
adaptthebinaryclassificationterminologyanddefinethefollowingterms:
falsepositiveerror(typeIerror):improperlyclassifyingapointinthetimeseriesas
anomalous
falsenegativeerror(typeIIerror):failingtolabelananomalouspointasanomalous
falsepositiverate:theprobabilityofafalsepositiveerroroccurringitisdenotedby

22

GolnazEsfandiari,IranAdmitsThrottlingInternetToPreserveCalmDuringElection.

RadioFreeEuropeRadio
Liberty.
June26,2013,
http://www.rferl.org/content/iranInternetdisruptionselection/25028696.html

CUSUM Anomaly Detection | Farkas


21

falsenegativerate:theprobabilityofafalsenegativeerroroccurring(theproportionof
thenumberofpointstestedthatresultinafalsenegativeerror)itisdenotedby
specificityofthemethod:thequantity 1
sensitivityofthemethod:thequantity 1 23

TheresultsoftheevaluationofCADsperformanceonthetimeseriesfromExample1and
Example2areshowninTable1.

Table 1:
Evaluating CADs Performance on Examples 1 and 2
Length
of the
Time
Series

False
Positive
Errors

False
Negative
Errors

Specificity

Sensitivity

TWC DT24

997

Comcast DT

997

44

0.044

0.956

Verizon DT

997

TWC PRR25

997

62

0.062

0.938

Comcast
PRR

997

14

33

0.014

0.033

0.986

0.967

Verizon PRR

997

155

0.155

0.845

Iran DT

912

0.003

0.997

Average

0.008

0.036

0.992

Time Series

0.964

FutureworkcouldconsideramorerobustevaluationofCAD,withaneyetowardsreducingthe
falsenegativeandpositiverates.

Conclusion
TheCADmethodworkswellfordiscoveringanomaliesinnetworkperformancedata,withahigh
rateofsuccessfulanomalydetectionandalowrateoffalsepositives.Oneofthestrengthsof
theCUSUMAnomalyDetectionalgorithmisthat,withineachslidingwindow,it
finds a
subsequence of the time series that is normally distributed and in statistical control. This
23

Falsepositivesandfalsenegatives,
Wikipedia
,
https://en.wikipedia.org/wiki/False_positives_and_false_negatives
.
DTistheabbreviationforDownloadThroughput

25
PRRistheabbreviationforPacketRetransmitRate

24

CUSUM Anomaly Detection | Farkas


22

subsequencethenrepresentsthenormalbehaviorofthetimeserieswithinthewindowanditis
usedasitstrainingset.BasedonthistrainingsetaCUSUMchartiscreated.Thischartisused
inidentifyingthestatisticallysignificantanomalies,ifthereareany.Oneofthetunable
parametersofCAD, , finetunesthebehavioroftheCUSUMchart.Theoutputofthealgorithm
isalistofpossibleanomaliesandaplotofthetimeseriesandtheanomalies.

AlthoughCADissuccessfulwhenappliedtothetypeoftimeseriesitwasdevelopedfor,there
areseveralpotentiallimitations.CADhasnotbeentestedoutsidethisnarrowscope,and,asan
automaticprocess,itdoesnotprovidealistofpointsthatcanbelabeledasanomalouswith
absolutecertainty.However,itseedstheresearchprocessbyproducingalistofpossible
anomaliesthattheusermustassess.Thereafter,theusermayadjustparametersasnecessary
toproducemorereliableresults.Futureworkcouldpotentiallyfocusonreducingtheneedfor
tunableparametersandextendingitsapplicationdomaintoamoregeneralcategoryof
univariatetimeseries.

CUSUM Anomaly Detection | Farkas


23

Appendix
CADwastestedonadditionalexamples.

Example 1: Internet performance degradation in Vietnam due to problems with


the AAG undersea cable in late 2013 and early 2014
TheAsiaAmericaGateway(AAG)cablewascutonDecember21,2013

anditwasnot
repaireduntilJanuary5,2014.Then,inearlyMarch2014theAAGcablewasagainunder
repair.Itwasaplannedeventbutitstilladverselyaffectedinternetperformance26.
Figure14containsascreenshotby

RenesysInternetIntelligencenowknownasDynInternet
Intelligence27obtainedfromtheDynResearchblogpostBewareoftheIdesofMarch:Subsea
CableCutTrendContinues. 28Itclearlyshowstheimpactofthecableoutageoninternet
latencyfromTokyotoVietnam.

Figure14:
ScreenshotbyRenesysInternetIntelligence

CADwasappliedtoMLabtimeseriesofdatacollectedfromVietnambetweenSeptember1,
2013andMay30,2014inordertoseewhetherCADcouldalsodetectthesustained
degradationinInternetperformance.

Theminimumroundtriptime,packetretransmitrateanddownloadthroughputfromVietnam
duringthetimeperiodofSeptember1,2013May2,2014wasdownloadedfromMLabs
Google BigQuery database. The plots of the daily medians of these Internet performance
variables and the anomalies detected by CAD are shown below:
26

DougMadory,BewaretheIdesofMarch:SubseaCableCuteTrendContinues,

DYNResearch
,March31,2014,
http://research.dyn.com/2014/03/bewaretheidesofmarch/#!prettyPhoto
.
27
DynIntelligence.

http://dyn.com/dyninternetintelligence/
.
28
DougMadory,BewaretheIdesofMarch:SubseaCableCuteTrendContinues,

DYNResearch
,March31,2014,
http://research.dyn.com/2014/03/bewaretheidesofmarch/#!prettyPhoto

CUSUM Anomaly Detection | Farkas


24

Figure 15:
Daily Median Round Trip Time in Vietnam, Anomalies Detected by CAD in Red.

Figure 16:
Daily Median Packet Retransmit Rate in Vietnam, Anomalies Detected by CAD in
Red.

CUSUM Anomaly Detection | Farkas


25

Figure 17:
Daily Median Download Throughput in Vietnam, Anomalies Detected by CAD
in Red.

Example 2: M-Lab Test Volume Increase After the Internet Health Test Launch
AnewversionofNetworkDiagnosticsTool(NDT)wasreleasedinlateApril,2015.Thisnew
versionallowedformeasuringnetworkperformancefromthebrowserwithoutaneedfor
browserplugins.BattlefortheNet29,acoalitionofpublicinterestadvocacyorganizations,was
amongthefirsttotakeadvantageofthisnewupdate30andinMay,2015BattlefortheNet
launchedTheInternetHealthTest31.ThistestusesMLabinfrastructureandcode,andallthe
datacollectedbythetestarehostedbyMLab.Figure18,obtainedfromMLabsblogpost

NewOpportunitiesforTestDeploymentandContinuedAnalysisofInterconnection
Performance,32
showsthatthelaunchingoftheInternetHealthTestresultedinasharp
increaseintheamountofMLabdatacollected.

29

NetNeutralityisUnderAttack,
FightForTheFuture
,
https://www.battleforthenet.com/
.
CollinAnderson,NewOpportunitiesforTestDeploymentandContinuedAnalysisofInterconnection

Performance,
MeasurementLab,June24,2015,
https://www.measurementlab.net/blog/interconnection_and_measurement_update/
.
31
TheInternetHealthTest,

FightForTheFuture
,
https://www.battleforthenet.com/internethealthtest/
.
32
CollinAnderson,NewOpportunitiesforTestDeploymentandContinuedAnalysisofInterconnection

Performance,
MeasurementLab,June24,2015,
https://www.measurementlab.net/blog/interconnection_and_measurement_update/
.
30

CUSUM Anomaly Detection | Farkas


26

Figure 18:Graphic from M-Labs blog post


New Opportunities for Test Deployment and
Continued Analysis of Interconnection Performance

ThetimeseriesofdailynetworkdiagnostictestcountsfromMLabforthetimeperiodof
December1,2014toJuly31,2015anditsknownanomalouslyhighvaluesinMayof2015were
usedtotestCADsanomalydetectioncapabilities.
Figure19showsthegraphMLabsdailytestcountandtheanomaliesdetectedbyCADinred.
CADsparametersweresetto type = upper, = 4, = 3.

Figure 19:
Daily Network Diagnostic Test Count

CUSUM Anomaly Detection | Farkas


27

AbouttheAuthor
KingaFarkasisadatascienceconsultantfor
FastForward,Inc.andSperlingsBestPlaces.In
hercurrentjob,Kingausesvariousmachinelearningtechniquesforthepredictiveanalysisofa
diversesetofdemographicandsocioeconomicdata.ShewasanOutreachydatascienceintern
atMLabduringMayAugust2015.ItwasduringthisperiodthatshecreatedCAD,theCUSUM
AnomalyDetectionmethod.Amathematicianbytraining,KingaholdsanMSinmathematics,a
BSandmathematicsandaBSinphysicsfromOregonStateUniversity.Shehasdonefurther
graduatelevelworkinalgebraicnumbertheory,algebraicgeometryandellipticcurve
cryptographyatthesameinstitution.Kingawritesaboutherdatascienceescapadesinherblog
at
:
http://nthturn.com/
.

TheRcodeforCADcanbefoundhere:
https://github.com/kingakfarkas/CAD
.

2016MeasurementLab

CUSUM Anomaly Detection | Farkas


28

Вам также может понравиться