Академический Документы
Профессиональный Документы
Культура Документы
CUSUMAnomalyDetection
ByKingaFarkas
Abstract
Introduction
Background
AnomalyDetectioninNetworkTrafficFlows
CUSUMCharts
CUSUMAnomalyDetection(CAD)
ApplyingCUSUMChartstoInternetPerformanceVariables
CADsDesign
Overview
SlidingWindows
FindingtheTrainingSet
ApplyingtheCUSUMChart
InterpretingtheCUSUMChartResults
TuningParameters
ExamplesandResults
Example1:CADAppliedtotheTimeSeriesfromtheISPInterconnectionanditsImpact
onConsumerInternetPerformanceStudy
Example2:CADAppliedtoDatacollectedfromIranbefore,duringandafterthe2013
IranianPresidentialElections
EvaluationofCADsPerformance
Conclusion
2016MeasurementLabConsortium
Abstract
Today,researcherscancollectdataonawiderangeofindicatorsrelatedtoInternetaccess,
speed,andlatency.Whatcanwelearnfromallthisdata?Thereisanincreasingneedfor
analysisthatusesautomatedmethodstosiftthroughthedataanduncoverunusualpatterns,
outliers,andanomaloussequences.
TheCUSUManomalydetection(CAD)methodisbasedonCUSUMstatisticalprocesscontrol
charts.CADisusedtodetectanomaloussubsequencesofatimeseriesthatshowasubtle
shiftinthemeanrelativetothecontextofthesequenceitself.CADwasappliedinordertolook
foranomaliesinMLabsdatabaseofNetworkDiagnosticTest(NDT)results.CADssuccessis
basedontheobservationthattheNDTtimeseriescanbeviewedasbeingcomprisedofvarying
lengthsubsequencesofrealvaluedrandomvariates,whereeachofthesesubsequences
correspondtoanormaldistributionwithaspecificmeanandstandarddeviation.
WedescribethebasicdesignofCAD,illustratehowitfunctionsbyapplyingittotimeseries
fromMLabsNDTdatabasethatcontainknownanomalies,anddemonstrateitseffectiveness
byshowingthatCADsuccessfullyandautomaticallydetectedeachoftheInternetperformance
degradationincidentswithveryfewfalsenegativesorfalsepositives.
Introduction
MLabsmissionis
toadvancenetworkresearchandempowerthepublicwithuseful
informationabouttheirbroadbandandmobileconnections.ByenhancingInternettransparency,
MLabhelpssustainahealthy,innovativeInternet.1
TheCUSUManomalydetectionalgorithmexplorestheneedforanautomatizedmethodof
searchingMLabsvastdatabaseofNetworkDiagnosticTest(NDT)resultsnotforsingleoutlier
points,butforaseriesofunusuallyhighorlowmeasurements.Thisprojectwasdeveloped
duringthecourseofathreemonthlongOutreachy2internshipatMeasurementLabinthe
summerof2015.
Oneofthemostimportantfeaturesofthealgorithmisfindinganddefiningthenormalpattern
forthetimeseries,relativetowhichdeviationscouldbeclassifiedasanomalies.Usingasliding
windowtechniquethestatisticallysignificantshiftsinthemeanaredetectedrelativetothe
normalpattern(thetrainingset).Theoutputofthealgorithmisthepotentiallistofanomalies
alongwiththecorrespondingplotofthetimeseriesanditsanomalies.
1
2
Background
Anomaly Detection in Network Traffic Flows
Therehavebeenseveralattemptstocharacterizetimeseriesofnetworktrafficflowstodetect
anomalies,whichincludeoutages,abuse,orInternetfiltering.Anomalydetectionisbecoming
anincreasinglystudiedfield,giventhecentralrolethattheInternetplaysinglobal
communications.Thesemethodologiesvaryfromasymbolicrepresentationoftimeseriestoan
automateddetectionofInternetfiltering3.
Intheirpaper,
AsymbolicrepresentationofTimeseries,withimplicationsforstreaming
algorithms,
J.LinE.Keogh,S.Lonardi,B.Chiu4 makeanattempttocreatearepresentationof
timeseriesthatallowsdimensionality/numerosityreduction,anditalsoallowsdistance
measurestobedefinedonthesymbolicapproachthatlowerboundcorrespondingdistance
measuresdefinedontheoriginalseries.
5
In
Visualizinganddiscoveringnontrivialpatternsinlargetimeseriesdatabases,
thesame
authorsdescribeatimeseriespatterndiscoveryandvisualizationsystem,VizTree,basedon
augmentingsuffixtrees.VizTreevisuallysummarizesboththeglobalandlocalstructuresof
timeseriesdataatthesametime.Thisprovidessolutionstomotifdiscovery,anomalydetection
andquerycontent.
P.BartfordandD.Plonka6describetheirworkofcollectingandanalyzingnetworkflowdataby
usingFlowScanopensourcesoftware.Thegoaloftheirworkistoidentifythestatistical
propertiesofanomaliesand,iftheyexist,theirinvariantproperties.
ThemostcomprehensiveandelaboratedworkindetectingInternetfilteringfromgeographic
timeseriesispresentedbyJ.Wright,A.DarerandO.Farnanintheirpaper,
DetectingInternet
filteringfromgeographictimeseries7.Thegoaloftheirworkistoidentifyglobalpatternsof
Internetfilteringthroughtechnicalnetworkmeasurementsandtolinktheseeventstotheirsocial
context.Theirapproachtodetectanomaliesisbasedonprincipalcomponentanalysis.Itshould
bepointedout,thatthegoalCADisverysimilartotheirs,buttheapproachisverydifferentand
basedonstatisticalprocesscontrol.
J.Wright,A.Darer,O.Farnan,
DetectingInternetfilteringfromgeographictimeseries
,OxfordInternetInstitute,July
21,2015.http://arxiv.org/abs/1507.05819.
4
J.Lin,E.Keogh,S.Lonardi,B.Chiu,
AsymbolicrepresentationofTimeseries,withimplicationsforstreaming
algorithms
,DMKD,June13,2003SanDiego,CA,USA.
5
J.Lin,E.Keogh,S.Lonardi,
Visualizinganddiscoveringnontrivialpatternsinlargetimeseriesdatabases
,
InformationVisualization(2005),4,6182.
6
P.BarfordandD.Plonka,
Characteristicsofnetworktrafficflowanomalies
,C.S.DepartmentattheUniversityof
Wisconsin,Madison(2001).
7
J.Wright,A.Darer,O.Farnan.
DetectingInternetfilteringfromgeographictimeseries.
CUSUM Charts
Statisticalcontrolchartsaregraphsthatareusedtoshowhowaprocesschangesovertime.
Allstatisticalcontrolchartshaveacenterlinefortheaverageandanuppercontrolandalower
controlline.Theselinesarebasedonhistoricalvaluesoftheprocessmeanandstandard
deviation.Anoutofcontrolprocesswillhavepointsonthechartthatlandabovetheupper
controllineorbelowthelowercontrolline.8
TheCUSUM(cumulativesum)controlchartisastatisticalcontrolchartusedtotrackthe
variationofaprocess9.Itisamethodthatisabletodetectsmallshiftsintheprocessmean.
TheCUSUMchartusesfourparameters:
1. theexpectedmeanoftheprocess,
2. theexpectedstandarddeviationoftheprocess,
3. thesizeoftheshiftthatistobedetected, k
4. thecontrollimit, H
Theexpectedmeanandstandarddeviationaredefinedtobethehistoricalmeanandstandard
deviationoftheprocesswhentheprocessisnormalandinstatisticalcontrol10.Theparameter
k determinestheslackthatisallowedintheprocessitsusualvalueisabout .Theparameter
H isthethresholdfortheprocessitsvalueisusuallysetto 5.
TheCUSUMchartworksbytrackingtheindividualcumulativesumsofthenegativeandpositive
deviationsfromthemean,thehighandlowsumsrespectively.
Thehighsumisgivenbytherecursivesequence
+
S+i = M ax {0, Si1
+ xi k} , S0+ = 0 for i = 1, 2, ..., N
whereas,thelowsumisdefinedby
Si = M in {0, Si1
+ xi + k} , S0 = 0 for i = 1, 2, ..., N.
+
Notethattheparameter k doesindeedprovideforslackintheprocedure,since S+i > Si1
only
if xi > + k
and
Si < Si1
only
if xi < k .11
Ifeitherofthecumulativesums, Si
or
S+i ,
reachthe
threshold H ,
theprocessisconsideredoutofcontrol.
i=1 ,wherethe
normaldistributionwithamean = 50, andstandarddeviation = 3 .Thissequencethen
representsaprocessthatisstatisticallystableitsdeparturesfromthetargetvalue,themean,
= 50, aretheexpectedvariationsduetochance.Next,supposethatthemiddle 40
elements,
140
40
{vi} i=101 , are replaced by {wi} i=1 , where the wi are randomly selected from a normal
distribution with mean = 54 and standard deviation = 3. The resulting sequence
8
ControlChart,ASQ,
http://asq.org/learnaboutquality/datacollectionanalysistools/overview/controlchart.html
.
KeepingtheProcessonTarget:CUSUMCharts,BPIConsulting,LLC,2014,
http://www.spcforexcel.com/knowledge/variablecontrolcharts/keepingprocesstargetcusumcharts
.
10
D.C.Mongtomery,
IntroductiontoStatisticalQualityControl
(JohnWiley&Sons,199)1,103
11
KeepingtheProcessonTarget:CUSUMCharts,BPIConsulting,LLC,2014,
http://www.spcforexcel.com/knowledge/variablecontrolcharts/keepingprocesstargetcusumcharts
.
9
40
240
240
V = {vi} 100
i=1 {wi} i=1 {vi} i=141 = {vi} i=1 isnolongerstatisticallystablebecausethe
subsequencemeanchangesdrasticallyrightinthemiddleofthesequence.
Thetimeseriessequence V anditsCUSUMchartisplottedinFigure1.Withintheplotof V
thegreenandredhorizontallinesindicatethemeanoftheoriginalsequence V andofits
anomaloussubsequence {vi} 140
V (thebottomplotof
i=101 respectively.TheCUSUMchartof
+
Figure1)detectstheshiftinmeanof V , sincetheuppersum Si reachesandsurpasses
Figure 1
: Plot of V and the CUSUM Chart for V
12
L.Scruccaqcc:anRpackageforqualitycontrolchartingandstatisticalprocesscontrol,
RNews4/1
(2004),pp.
1117.
CUSUMAnomalyDetection(CAD)
TheCUSUMAnomalyDetection(CAD)isastatisticalmethoditisananomalydetection
techniqueforunivariatetimeseries.ItusestheoutofcontrolsignalsoftheCUSUMchartsto
locateanomalouspoints.ThedetectionofperiodicityisnotyetpartofCADnoritisamethod
thatsearchesformeasurementsthatdonotfollowtheexpectedperiodicbehavior.
MLabsdataconsistsmainlyofNetworkDiagnosticTestresults,whichmeansthatthedata
collectionrate,orsamplingrate,isvariable.Inordertocreateatimeseriesofequallyspaced
measurements,theacceptedpracticeistofindthemedianofthemeasurementsperunittime.
Still,shorttermjumpsinatimeseriesvaluescouldbeduetochancealone.Therefore,CADis
optimizedtofindanomaloussubsequences13oflengthgreaterthan , wheretheadjustable
parametersdefaultvalueis 5 unitsoftime.Specifically,CADisnotdesignedtofind
contextualanomalies14,thatis,singledatapointsconsideredtobeanomalouswithinthecontext
ofthetimeseriesitself.CADisdesignedtodetectsustainedchanges,ratherthanasingle
anomalousdatapoint.
13
D.Cheboli,AThesisSubmittedtotheFacultyoftheGraduateSchooloftheUniversityofMinnesota,
Anomaly
DetectionofTimeSeries(2010),
http://conservancy.umn.edu/bitstream/handle/11299/92985/?sequence=.1
.
14
Ibid.,
http://conservancy.umn.edu/bitstream/handle/11299/92985/?sequence=1
,6.
ByapplyingtheCUSUMcharttooneofthesetimeseriesweimplicitlycreatealocalmodelfor
thetimeseriesinquestion.Thismodeldefinesthelocalnormalforthetimeseries,anditis
definedbythefollowing:
1. itisnormallydistributed
2. ithasawelldefinedmeanandstandarddeviation
3. itisinstatisticalcontrol
AnanomalysignaledbytheCUSUMchartimpliesashiftinthemeanwithrespecttothislocal
modeloftheInternetperformancevariable.
Example:CUSUMChartappliedtoanInternetperformancevariabletimeseries
ConsiderMLabsIrandailymediandownloadthroughputfortheyear2013(Figure2)15.In
ordertobeabletoapplytheCUSUMcharttothistimeseries,itmusthaveatleastone
subsequencewhosevalueshavemoderatelynormaldistribution.Onesuchsubsequence,ifit
exists,canbeusedasthetrainingset.ThistrainingsetwoulddefinealocalnormalforIrans
dailymediandownloadthroughputanditwouldbeusedtocalculatetheexpectedmeanandthe
expectedstandarddeviationforthetimeseries.
Figure 2:
Irans Daily Median Download Throughput for 2013
An M-Lab Dataset
Asitturnsout,thetimeseriesinquestiondoeshavesuchasubsequence.Themeasurements
duringthetimeperiodofJune9,2013October6,2013fittherequirements.Thisisthe
subsequencehighlightedinredinFigure3.
15
IranhelditspresidentialelectionsonJune14,2013.
Figure 3
: Irans Daily Download Throughput with the Training Set in Red
Inordertodemonstratethattheselectedtrainingsetsdistributionisclosetonormal,thedensity
plotofthetrainingsetandthenormaldistributioncurvewiththemeanandstandarddeviationof
thetrainingsetareshowninFigure4.Sincetheexperimentaldistributioncurve(inblue)closely
approximatesthetheoreticaldistributioncurve(inred)wecanclaimthattheselectedtraining
setsdistributionisclosetonormal.
Figure 4
: Experimental and Theoretical Distributions of the Training Set
TheCUSUMchartoftheentiretimeseriesisthenfoundusingthemeanofthetrainingsetas
theexpectedmean,thestandarddeviationofthetrainingsetastheexpectedstandard
deviation,andtheCUSUMparameters H = 5 and k = 3.
Figure6containsthegraphsoftheentiretimeseriesanditsCUSUMchart.Thereddotsinthe
CUSUMchartofIranDailyMedianThroughputappearatthepointswheneithertheuppersum
isabovetargetorthelowersumisbelowtarget.Thedaysoverwhichtheuppersumisabove
targetcoincidewiththedayswhentimeseriesvaluesshowasteepincrease,whereasthedays
overwhichthelowersumisbelowtargetcoincideswiththedateswhenthetimeseriesvalues
dropdrastically.
Figure 6
: Plot of Irans Daily Median Throughput and its CUSUM
Chart
CADs Design
Overview
TheimplementationofCADwaswritteninRanditusestheqccpackage16 tofindtheCUSUM
chartofatimeseries.CADusestheslidingwindowtechnique.Foreachwindow,CAD
searchesthetimeseriesalongthewindowforatrainingset.Ifoneisfound,CADappliesthe
CUSUMcharttotheentiretimeseriesalongthewindow.Afterinterpretingtheresultsofthe
CUSUMchart,someofthepointsaredesignatedaspossibleanomalies.Thisprocedureis
repeatedforeverywindowdownthelengthofthetimeseries.Theoutputoftheprocessisthe
indexesoftheanomalieswithinthetimeseriesandagraphofthetimeserieswithanomaliesin
redandabarchartofthenumberoftimeseachpointwaslabeledananomaly.
SlidingWindows
Thelengthofthemovingwindow, w ,isroughlyonethirdofthelengthoftheentiretimeseries
asthislengthseemedtoprovidethebestoutcome
.
Futureworkcouldconsiderothervaluesfor
lengthorasystematicwaytodeterminetheoptimallength.
Thewindowoverlapis w 1 ,that
is,thewindowisalwaysshiftedonedatapointtotherightatatime.
FindingtheTrainingSet
Foreachwindow,theportionofthetimeseriescontainedinthewindowissearchedfora
trainingset.
16
L.Scrucca,qcc:anRpackageforqualitycontrolchartingandstatisticalprocesscontrol.
RNews4/1
(2004),
1117.
Thesearchstartswithlookingatallthesubsequenceswithlength f loor (w3 ) .Ifatrainingsetis
notfoundthelengthofthesubsequenceisdecreasedbyoneandtheprocessrepeatsuntil
eitherasubsequenceisfoundthathastherightpropertiesorthesubsequencelengthhas
reached 24 17.
TheexaminationofeachsubsequenceentailscalculatingthepvalueoftheShapiroWilktest,
thatcheckswhetherarandomsample, y1 , y2 ,, yN comesfromanormaldistribution.
Thetwo
hypothesesoftheShapiroWilktestarethenullhypothesis(H
)thedistributionisnormaland
0
thealternativehypothesis(H
)thedistributionisnotnormal.Whenthepvalueisgreaterthan
a
orequalto0.05,H
cannotberejected.Whenthepvalueislessthan0.05,H
isrejectedand
0
0
thedistributionisconsiderednonnormal.Inthislastcase,thesubsequenceisdiscardedsince
thepointsofthesubsequencewereproventohaveanonnormaldistribution.Foreach
subsequenceforwhichthepvalueisgreaterthanorequalto 0.05, thekurtosisandskewness
valuesarecalculated,andthesmallestvaluesoftheCUSUMparameters H and k forwhich
thesubsequenceisinstatisticalcontrolareidentified.
Ifthesetofsubsequenceswithagivenlengthandwithapvaluegreaterorequalthan 0.05 is
nonempty,thesubsequencethatminimizesthequantities |1 skewness| , |kurtosis 3| ,
|pvalue 1| , H , and k ischosentobethetrainingset.
Ifthesubsequencelengthdecreasesallthewayto 24 andnosuitabletrainingsetwasfound,
theentireprocessofanomalydetectionhaltswiththeconclusionthatCADcannotbeappliedto
thetimeseriesinquestion.
ApplyingtheCUSUMChart
Onceatrainingset, , isfoundalongwithitsCUSUMparameters H and k ,theCUSUMchart
isappliedtotheentiretimeseries.TheparametersfortheCUSUMchartaresettothe
followingvalues: H = H , k = k , = mean(), = standarddeviation(). Theoutputofthe
CUSUMchartistheindicesoftheuppersumviolations(ifthereareany)andofthelowersum
violations(ifthereareany)andthevaluesoftheupperandlowersums.
InterpretingtheCUSUMChartResults
GiventhattheCUSUMchartresultsofthetimeseriessequence,thepotentialanomaliesare
identifiedbyfindingtheincreasingsubsequencesoftheuppersumviolationsoflength , and
thedecreasingsubsequencesofthelowersumviolationsoflength ,ifthereareany.The
indicesofthesesubsequenceelementspinpointthepotentialanomaliesinthetimeseriesfor
thewindowinquestion.
17
Byastatisticalprocesscontrolruleofthumb,12to24valuesaresufficienttocalculatetheCusumparameters.
See
http://asq.org/qualityprogress/2012/07/backtobasics/smartcharting.html
.
Tuning Parameters
CADismostlyautomated,however,therearestillafewparametersthat,althoughtheyhave
defaultvalues,cannonethelessbeadjustedbytheuser.Theseare:
:theminimumlengthoftheanomaloussubsequencesthatCADshoulddetectits
defaultvalueis 5. Decreasing allowsCADtosearchforshortdurationsharp
increasesordecreasesinthetimeseries.Theminimumvalueof is 1 andatthis
settingCADwillallowforthedetectionofsinglepointanomalies,butwithanincreased
riskoffalsepositives.
:adjuststhe k valueoftheCUSUMchartappliedtothemaintimeseries.Itisan
offset,avaluethatisaddedtotheCUSUMparameter k .Itsdefaultvalueis 3 .
Adjusting adjuststhesensitivityofCAD,thehigherthevaluethelesssensitiveCAD
gets.Thereisnomaximalvaluefor .
type :thechoicesforthisparameterare upper or lower .Itdeterminesthetypeof
anomalyCADshouldsearchfor.When type = upper, CADlooksforsubsequencesof
thetimeserieswithmeanlargerthanthelocalmeanforthetimeseries.Ontheother
hand,when type = lower, subsequenceswillbelabeledanomalousiftheirmeanvalueis
belowthelocalmeanofthetimeseries.Thedefaultsettingfor type is lower.
ExamplesandResults
CADwastestedonInternetperformancetimeseriesthatcontainedknownanomalous
subsequences.Therearetwodifferentsetsofexamplesconsideredhereandthedataforboth
examplescomesfromMLabsNetworkDiagnosticTestdataset.
ThefirstsetofexamplesshowstheresultsofCADbeingappliedtosomeofthetimeseries
fromMLabs
ISPInterconnectionanditsImpactonConsumerInternetPerformance18study.
Theeventsdescribedwithintheinterconnectionstudywereidentifiedbyinspectionorprior
questionsaboutwherepotentialdegradationhadoccurred.ThedataforInternetperformance
variablesthatformthesetimeserieswereusedbythestudytoshowsustaineddegradationof
broadbandperformanceforendusers.Theseeventsshowupasanomaloussubsequencesof
theInternetperformancevariabletimeseries.TheresultswillshowthatCADuncoversthese
verysameanomalieswhenappliedtothesetimeseries.
ThesecondexampleusesCADtofindtheanomaliesresultingfromaprominentcaseof
confirmedInternetcensorshipthatoccurredinIran,justbeforethepresidentialelectionson
June14,2013.ThedatasetusedinthesecondexampleisalsocomprisedofInternet
performancedatathatisknowntocontainanomalies,sinceIransgovernmenthasadmittedto
slowingdowntheInternetinordertopreservecalmduringtheelectionperiod.19
Example 1: CAD Applied to the Time Series from the ISP Interconnection and its
Impact on Consumer Internet Performance Study
TheMLabConsortiumTechnicalReport,
ISPInterconnectionanditsImpactonConsumer
InternetPerformance
,uncoveredinstancesofperformancedegradationintheUSusing
MLabsNDTdatasets.ThedeclineinInternetperformancecanbeobservedasasteepdropin
mediandownloadthroughputandasharpincreaseinthepacketretransmitrateofaccessISPs
acrosssomeofthetransitISPs.Thesubsequenceofthetimeseriesofthemediandownload
throughputofanISP,correspondingtothetimeperiodoverwhichthemediandownload
throughputvalueshavedrasticallydroppedisconsideredtobeananomaloussubsequence.
Similarly,thesubsequencewithdrasticallylargevaluesofthepacketretransmitratetimeseries
isananomaloussubsequence.
Inthisexample,wefocusedonthedownloadthroughputandpacketretransmitratedatafrom
theNewYorkCityarea,concerningthecustomersofTimeWarnerCable,Comcast,and
VerizonconnectingacrossthetransitISPCogent.TheseNDTtimeseriesspannedthetime
periodfromJanuary1,2012September30,2014.MLabsreportdemonstratedthe
18
ISPInterconnectionanditsImpactonConsumerInternetPerformance,
MeasurementLab
(2014),
http://www.measurementlab.net/publications/ispinterconnectionimpact.pdf
.
19
GolnazEsfandiari,IranAdmitsThrottlingInternetToPreserveCalmDuringElection.
RadioFreeEuropeRadio
Liberty.
June26,2013.
http://www.rferl.org/content/iranInternetdisruptionselection/25028696.html
.
degradationofInternetperformancebetweenAprilJune2013andlateFebruary2014.20
Therefore,weexpectedthattheanomaliesdetectedbyCADwouldfallintothistimerangeas
well.
Tostart,CADwasappliedtothemediandownloadthroughputforTimeWarnerCableacross
CogentinNewYork.Forthistimeseriesthesettingswerethedefaultsettings:
type = lower, = 3, = 5. Thedetectedanomalies,plottedinredinFigure7,rangeoverthetime
periodsofMay7,2013June11,2013andJuly15,2013February25,2014.Thesetime
periodsaremostlyinagreementwiththetimeperiodsofslowInternetservicedemonstratedby
the
ISPInterconnectionStudy
.WewillnotconsidertheJune11,2013July15,2013gapin
thelistofanomaliesasanerror.Byvisualinspectionofthetimeseries,itisclearthatthe
downloadthroughputvaluesofthistimeperiodaremuchhigherthantheneighboringvalues,
bothontheleftandright.So,itisreasonablethatthemeasurementsintheofthistimeperiod
arenotlabeledasloweranomaliesbyCAD.
Figure 7:
TWC Download Throughput Using the Transit ISP Cogent in the New York City
Area with Anomalies Detected by CAD in Red.
Next,CADwasappliedtodownloadthroughputmeasurementsbetweenComcastandCogent
inNewYorkCity(Figure8).TheCADsettingswere: type = lower, = 5, = 5. Thedetected
anomaliesoccurredduringthetimeperiodfromJanuary24,2013January31,2013,February
5,2013March14,2013,andApril17,2013February20,2014.
Oftheseanomalies,theonesoccurringduringJanuary24,2013January31,2013and
February5,2013toMarch14,2013areoutsidetheexpectedtimerange.However,an
examinationoftheplotofthetimeseriesinFigure8showsthatthereisindeedadropinvalues
duringthoseperiods,resultinginanoveralldropintheaveragevalueforthesetimeintervals.
20
ISPInterconnectionanditsImpactonConsumerInternetPerformance,
MeasurementLab
(2014),
http://www.measurementlab.net/publications/ispinterconnectionimpact.pdf
,9.
Thesedropsinthevalueofthemeanweresignificantenoughthatthemostpoints
in this date
rangeretainedtheanomalouslabelforallvaluesoftheparameterforwhichthepointsforthe
daterangeApril17,2013February20,2014werestilldeemedanomalous.
Figure 8:
Comcast Download Throughput Using the Transit ISP Cogent in the New York
City Area with Anomalies Detected by CAD in Red.
Finally,CADwasappliedtodownloadthroughputmeasurementsbetweenVerizonandCogent
inNewYorkCity(Figure9).TheCADsettingswere: type = lower, = 3, = 5. Thedetected
anomaliesoccurredduringthetimeperiod:May18,2013February26,2014,whichis
consistentwiththeresultsofthe
ISPInterconnectionStudy.
Figure 9:
Verizon Download Throughput Using the Transit ISP Cogent in the New York City
Area with Anomalies Detected by CAD in Red.
Next,stillwithinthecontextoftheexample,CADwasappliedtotimeseriesofpacket
retransmissionrates.
Figure10showstheresultsofCADbeingappliedtothedailymedianpacketretransmission
ratebetweenTimeWarnerCableandCogentinNewYorkCity.CADssettingswere
type = upper, = 1, = 3. Thedetectedanomaliesoccurduringthetimeperiods:January8,
2013January11,2013,May6,2013June11,2013,July21,2013August14,2013,August
25,2013September9,2013,September14,2013September25,2013,September27,2013
October9,2013,October19,2013October23,2013,andNovember2,2013December
20,2013.
Figure10:
TWCPacketRetransmitRATEUsingtheTransitISPCogentintheNewYorkCity
AreawithAnomaliesDetectedbyCADinRed.
Mostofthedetectedanomaliesfallintotheexpectedrange,withtheexceptionofthosethat
occurredbetweenJanuary8,2013January11,2013.However,therewasasharpincreasein
valuesduringthisperiod(seeFigure10),andsoCADdesignatingthesepointsasanomaliesis
notunreasonable.WhatismoreconcerningisthatCADfailedtodesignatethemeasurements
fromtheendofDecember,2013totheendofFebruary,2014asanomalies,althoughfromthe
graphinFigure10itseemsthattheyareindeedanomalouslyhighmeasurements.Futurework
willbefocusingonfixingthistypeofissue.
Figure11showtheresultsofCADbeingappliedtothedailymedianpacketretransmitrate
betweenComcastandCogentinNewYorkCity.TheCADsettingswere:
type = upper, = 5, = 5. ThedetectedanomaliesoccurredduringthetimeperiodsofJanuary
24,2013February6,2013,June1,2013June14,2013andJuly5,2013February20,
2014.TheanomaliesoccurringbetweenJanuary23,2013andFebruary6,2013seemtobe
falsepositives.Therestoftheanomaliesoccurredduringtheexpectedtimeperiod.However,
CADagainfailedtodesignatesomemeasurementsasanomalous.FromFigure11,itseems
clearthatthedatapointsfromMay20,2013June1,2013andfromJune14,2013July5,
2013haveanomalouslyhighvalueswhencomparedtotherestofthetimeseries.Investigating
whythesepeakswerenotdetectedcouldbeaddressedinfuturework.
Figure 11:
Comcast Packet Retransmit RATE Using the Transit ISP Cogent in the New York
City Area with Anomalies Detected by CAD in Red.
Thelastdatasetweconsiderfromthe
ISPInterconnectionStudyi
sthedailymedianpacket
retransmitratebetweenVerizonandCogentinNewYorkCity.Whenappliedtothistimeseries,
CADsparametersweresetto type = upper, = 3, = 1. AsshowninFigure12,thedetected
anomaliesoccurduringthetimeperiods:May14,2013July6,2013,July14,2013July18,
2013,July28,2013July30,2013,January2,2014January7,2014,January10,2014
January14,2014,January17,2014January26,2014,February4,2014February9,2014,
andFebruary12,2014February17,2014.
Figure 12:
Verizon Packet Retransmit RATE Using the Transit ISP Cogent in the New York
City Area with Anomalies Detected by CAD in Red.
InVerizonscase,alltheanomaliesdetectedbyCADfellintotheexpecteddaterange.There
arenoinstancesoffalsepositiveerrors.However,measurementsfromthetimeperiodofJuly
30,2013January2,2014shouldlikelyhavebeenclassifiedasanomalies.
Example 2: CAD Applied to Data collected from Iran before, during and after the
2013 Iranian Presidential Elections
FollowingthemethodologyusedbyCollinAndersoninhispaper
DimmingtheInternet:
DetectingThrottlingasaMechanismofCensorshipinIran21,
itseemsthattheInternet
performancevariablethatismostvisiblyaffectedbythrottlingisdownloadthroughput.Inthis
exampleCADwasappliedtothedailymediandownloadthroughputofIranianclientsduringthe
timeperiodofJanuary1,2013toJuly1,2014fromMLabsNDTdataset.
21
CollinAnderson,DimmingtheInternet:DetectingThrottlingasaMechanismofCensorshipinIran,
Cornell
UniversityLibrary
,June18,2013,
http://arxiv.org/abs/1306.4361
.
Figure 13:
Irans Daily Median Download Throughput with Anomalies Detected by CAD in
Red.
AccordingtotheJune26,2013RadioFreeEuropeRadioLibertyblogpostbyGolnaz
EsfandiariIran'sministerforcommunicationsandinformationtechnology,MohammadHassan
Nami,hasacknowledgedthatthecountryrestrictedthespeedoftheInternetinthedaysleading
uptotheJune14presidentialelection.22Therefore,theexpecteddatesoftheanomaliesare
thedayspriortoJune14,2013.
TheanomaliesdetectedbyCADdoindeedfallintotheexpecteddaterange.Noneofthepoints
testedbyCADweremisclassifiedasanomalies.However,uponvisualinspectionoftheplotof
thedailymediandownloadthroughputandtheanomaliesdetectedbyCAD,itseemsthatCAD
shouldprobablyhavelabeledatleastthreeadditionalpointsasanomalous:May21,2013May
23,2013.
22
GolnazEsfandiari,IranAdmitsThrottlingInternetToPreserveCalmDuringElection.
RadioFreeEuropeRadio
Liberty.
June26,2013,
http://www.rferl.org/content/iranInternetdisruptionselection/25028696.html
falsenegativerate:theprobabilityofafalsenegativeerroroccurring(theproportionof
thenumberofpointstestedthatresultinafalsenegativeerror)itisdenotedby
specificityofthemethod:thequantity 1
sensitivityofthemethod:thequantity 1 23
TheresultsoftheevaluationofCADsperformanceonthetimeseriesfromExample1and
Example2areshowninTable1.
Table 1:
Evaluating CADs Performance on Examples 1 and 2
Length
of the
Time
Series
False
Positive
Errors
False
Negative
Errors
Specificity
Sensitivity
TWC DT24
997
Comcast DT
997
44
0.044
0.956
Verizon DT
997
TWC PRR25
997
62
0.062
0.938
Comcast
PRR
997
14
33
0.014
0.033
0.986
0.967
Verizon PRR
997
155
0.155
0.845
Iran DT
912
0.003
0.997
Average
0.008
0.036
0.992
Time Series
0.964
FutureworkcouldconsideramorerobustevaluationofCAD,withaneyetowardsreducingthe
falsenegativeandpositiverates.
Conclusion
TheCADmethodworkswellfordiscoveringanomaliesinnetworkperformancedata,withahigh
rateofsuccessfulanomalydetectionandalowrateoffalsepositives.Oneofthestrengthsof
theCUSUMAnomalyDetectionalgorithmisthat,withineachslidingwindow,it
finds a
subsequence of the time series that is normally distributed and in statistical control. This
23
Falsepositivesandfalsenegatives,
Wikipedia
,
https://en.wikipedia.org/wiki/False_positives_and_false_negatives
.
DTistheabbreviationforDownloadThroughput
25
PRRistheabbreviationforPacketRetransmitRate
24
subsequencethenrepresentsthenormalbehaviorofthetimeserieswithinthewindowanditis
usedasitstrainingset.BasedonthistrainingsetaCUSUMchartiscreated.Thischartisused
inidentifyingthestatisticallysignificantanomalies,ifthereareany.Oneofthetunable
parametersofCAD, , finetunesthebehavioroftheCUSUMchart.Theoutputofthealgorithm
isalistofpossibleanomaliesandaplotofthetimeseriesandtheanomalies.
AlthoughCADissuccessfulwhenappliedtothetypeoftimeseriesitwasdevelopedfor,there
areseveralpotentiallimitations.CADhasnotbeentestedoutsidethisnarrowscope,and,asan
automaticprocess,itdoesnotprovidealistofpointsthatcanbelabeledasanomalouswith
absolutecertainty.However,itseedstheresearchprocessbyproducingalistofpossible
anomaliesthattheusermustassess.Thereafter,theusermayadjustparametersasnecessary
toproducemorereliableresults.Futureworkcouldpotentiallyfocusonreducingtheneedfor
tunableparametersandextendingitsapplicationdomaintoamoregeneralcategoryof
univariatetimeseries.
Appendix
CADwastestedonadditionalexamples.
anditwasnot
repaireduntilJanuary5,2014.Then,inearlyMarch2014theAAGcablewasagainunder
repair.Itwasaplannedeventbutitstilladverselyaffectedinternetperformance26.
Figure14containsascreenshotby
RenesysInternetIntelligencenowknownasDynInternet
Intelligence27obtainedfromtheDynResearchblogpostBewareoftheIdesofMarch:Subsea
CableCutTrendContinues. 28Itclearlyshowstheimpactofthecableoutageoninternet
latencyfromTokyotoVietnam.
Figure14:
ScreenshotbyRenesysInternetIntelligence
CADwasappliedtoMLabtimeseriesofdatacollectedfromVietnambetweenSeptember1,
2013andMay30,2014inordertoseewhetherCADcouldalsodetectthesustained
degradationinInternetperformance.
Theminimumroundtriptime,packetretransmitrateanddownloadthroughputfromVietnam
duringthetimeperiodofSeptember1,2013May2,2014wasdownloadedfromMLabs
Google BigQuery database. The plots of the daily medians of these Internet performance
variables and the anomalies detected by CAD are shown below:
26
DougMadory,BewaretheIdesofMarch:SubseaCableCuteTrendContinues,
DYNResearch
,March31,2014,
http://research.dyn.com/2014/03/bewaretheidesofmarch/#!prettyPhoto
.
27
DynIntelligence.
http://dyn.com/dyninternetintelligence/
.
28
DougMadory,BewaretheIdesofMarch:SubseaCableCuteTrendContinues,
DYNResearch
,March31,2014,
http://research.dyn.com/2014/03/bewaretheidesofmarch/#!prettyPhoto
Figure 15:
Daily Median Round Trip Time in Vietnam, Anomalies Detected by CAD in Red.
Figure 16:
Daily Median Packet Retransmit Rate in Vietnam, Anomalies Detected by CAD in
Red.
Figure 17:
Daily Median Download Throughput in Vietnam, Anomalies Detected by CAD
in Red.
Example 2: M-Lab Test Volume Increase After the Internet Health Test Launch
AnewversionofNetworkDiagnosticsTool(NDT)wasreleasedinlateApril,2015.Thisnew
versionallowedformeasuringnetworkperformancefromthebrowserwithoutaneedfor
browserplugins.BattlefortheNet29,acoalitionofpublicinterestadvocacyorganizations,was
amongthefirsttotakeadvantageofthisnewupdate30andinMay,2015BattlefortheNet
launchedTheInternetHealthTest31.ThistestusesMLabinfrastructureandcode,andallthe
datacollectedbythetestarehostedbyMLab.Figure18,obtainedfromMLabsblogpost
NewOpportunitiesforTestDeploymentandContinuedAnalysisofInterconnection
Performance,32
showsthatthelaunchingoftheInternetHealthTestresultedinasharp
increaseintheamountofMLabdatacollected.
29
NetNeutralityisUnderAttack,
FightForTheFuture
,
https://www.battleforthenet.com/
.
CollinAnderson,NewOpportunitiesforTestDeploymentandContinuedAnalysisofInterconnection
Performance,
MeasurementLab,June24,2015,
https://www.measurementlab.net/blog/interconnection_and_measurement_update/
.
31
TheInternetHealthTest,
FightForTheFuture
,
https://www.battleforthenet.com/internethealthtest/
.
32
CollinAnderson,NewOpportunitiesforTestDeploymentandContinuedAnalysisofInterconnection
Performance,
MeasurementLab,June24,2015,
https://www.measurementlab.net/blog/interconnection_and_measurement_update/
.
30
ThetimeseriesofdailynetworkdiagnostictestcountsfromMLabforthetimeperiodof
December1,2014toJuly31,2015anditsknownanomalouslyhighvaluesinMayof2015were
usedtotestCADsanomalydetectioncapabilities.
Figure19showsthegraphMLabsdailytestcountandtheanomaliesdetectedbyCADinred.
CADsparametersweresetto type = upper, = 4, = 3.
Figure 19:
Daily Network Diagnostic Test Count
AbouttheAuthor
KingaFarkasisadatascienceconsultantfor
FastForward,Inc.andSperlingsBestPlaces.In
hercurrentjob,Kingausesvariousmachinelearningtechniquesforthepredictiveanalysisofa
diversesetofdemographicandsocioeconomicdata.ShewasanOutreachydatascienceintern
atMLabduringMayAugust2015.ItwasduringthisperiodthatshecreatedCAD,theCUSUM
AnomalyDetectionmethod.Amathematicianbytraining,KingaholdsanMSinmathematics,a
BSandmathematicsandaBSinphysicsfromOregonStateUniversity.Shehasdonefurther
graduatelevelworkinalgebraicnumbertheory,algebraicgeometryandellipticcurve
cryptographyatthesameinstitution.Kingawritesaboutherdatascienceescapadesinherblog
at
:
http://nthturn.com/
.
TheRcodeforCADcanbefoundhere:
https://github.com/kingakfarkas/CAD
.
2016MeasurementLab