Вы находитесь на странице: 1из 32

ThispageisaworkinprogressdescribingEnsembleSelection,arelativelynewand powerfulapproachtosupervisedmachinelearning. Theoriginalpapercanbefoundhere: http://www.cs.cornell.edu/~caruana/caruana.icml04.revised.rev2.

ps

EnsembleSelectioninaNutshell(informaldescription)
OvertheyearspeopleinthefieldofMachineLearninghavecomeupwithallsortsof algorithmsandimprovementstoexistingapproaches.Therearesomanyunique supervisedlearningalgorithmsit'sreallyhardtokeeptrackofthemall.Furthermore,no onemethodisbestbecauseitreallydependsonthecharacteristicsofthedatayouare workingwith.Inpractice,differentalgorithmsareabletotakeadvantageofdifferent characteristicsandrelationshipsofagivendataset. Intuitively,thegoalofensembleselectionthenistoautomaticallydetectandcombinethe strengthsoftheseuniquealgorithmstocreateasumthatisgreaterthantheparts. Thisisaccomplishedbycreatingalibrarythatisintendedtobeasdiverseaspossibleto capitalizeonalargenumberofuniquelearningapproaches.Thisparadigmof overproducingahugenumberofmodelsisverydifferentfrommoretraditionalensemble approaches.Thusfar,ourresultshavebeenveryencouraging. PreliminaryResultsofourWEKAImplementation Thusfar,we'veseenconsiderableperformanceincreaseswithourWEKA implementationofEnsembleSelection.Thefollowingtableshowsperformanceon15 learningproblemsfromtheUCIrvinedatasetrepository(7binaryand8multiclass). Theseresultswereobtainedbyusinghalfasatrainsetandhalfastestset.TheEnsemble Selectiondefaultsettingswereusedwithoutanyparametertuning.Unfortunately,wedo nothaveanyerrorbarsastrainingmodellibrariesisquiteexpensiveandweonlyhadthe computingtimetotryoneofeach.However,asyoucansee,itdidquitewellacrossthe boardandperformanceoftheensemblesclearlyimprovedoverthebestlibrarymodels onallbutafewofthedatasets.Althoughthecomputingtimenecessarywaslarge,we ranourclassifieroutoftheboxandthusthehumaninteractiontimewasreallylow. Itisdefinitelyalsoworthnotingthatevenincaseswhereperformanceoftheensemble doesnotimproveoverthemodellibraryyoustillgetstatisticsonallthemodelsinyour ensemblelibrary.Thisisusefulandinterestinginandofitself.Weforeseemany situationswherepeoplewouldwanttoseearankedperformancesummaryofhowa defaultsetofEnsembleSelectionclassifiersdidontheirdataset.

(Foradescriptionofmetriclossreductionthatislistedinthereductioncolumninthe tablebelow,pleaseseesection5.2oftheEnsembleSelectionpaper.)

TheEnsembleSelectionAlgorithm Thebasicalgorithmcanbebrieflyexplainedintwoparts: Thefirststepistocreateamodellibrary.Thislibraryshouldbealargeanddiverseset ofclassifierswithdifferentparameters.Tothebestofourabilitywethrowthekitchen sinkatyourdatathemorethemerrier.Keepinmindthatitshouldnotreallyhurt performancetotrainbadmodelsastheywillsimplynotbechosenbytheEnsemble Selectionalgorithmiftheyhurtperformance. ThesecondstepistocombinemodelsfromyourlibrarywiththeEnsembleSelection algorithmto.Althoughtherearealotofverycomplexrefinementstopreventover fitting,thebasicideabehindthealgorithmissimple.Basically,westartwiththebest modelfromourwholelibrarythatdidthebestjobonourvalidationset(heldasidedata). Atthispointwehaveanensemblewithonlyonemodelinit.Then,weaddmodelsone atatime

toourensemble.Tofigureoutwhichmodeltoadd,eachtime:separatelyaveragethe predictionsofeachmodelfromthelibrarycurrentlybeingconsideredwiththecurrent ensemble.Thenchoosethemodelthatprovidedthemostperformanceimprovement. TheOverfittingProblem Thisisagreedyhillclimbingapproach,andaspreviouslymentionedhasalotofover fittingproblems(whileperformanceonthevalidationsetimprovesperformanceonthe testsetdegrades).Topreventtheproblemswithoverfitting,asdescribedintheoriginal paper,wetakeadvantageofthreeprovenstrategies:modelbagging,replacement, andsortinitialization.Foramoreindepthdescriptionofthesethreemethods,please refertotheoriginalpaper.

EnsembleSelection:ASlightlyMoreTechnicalExplanation(linktothis
onaseparateWIKIpage) EnsembleSelectionisanensemblelearningmethod.ThefocusoftheEnsemble Selectionlearningalgorithmiscombiningalargesetofdiversemodelsintoahigh performanceensemblebydeterminingappropriateweightsforthemodels.Thealgorithm wasinspiredbyforwardstepwisefeatureselection,whereinfeaturesareaddedoneata timetogreedilyimproveperformanceofthemodelbeingtrained.Here,wegreedilyadd modelstoourensembletooptimizeforsomemetric,suchasaccuracy.Initially,weadd themodelwhichhasthebestperformanceonsomevalidationdata.Then,forsome numberofiterations,weaddthemodelwhich,whencombinedwiththecurrentsetof models,helpsperformancethemostonthevalidationdata.Whenwearedone,wecan considerthenumberoftimesamodelwasaddedasitsweight,andourensemblemakes predictionsbytakingaweightedaverageoverthechosenmodels. Someothertechniquesareusedtohelpperformanceandavoidoverfitting.Oneissort initialization.Theideaistosortmodelsbytheirperformance,andaddthebestN models,whereNisthenumberoftopmodelswhichoptimizeperformanceversusthe validationset.Anotherimportantstrategyismodelbagging.Theideahereistoonly considersomerandomsubsetoftheentirelibraryofavailablemodels,forexamplehalf (default),andrunensembleselectionforthatsubset.Thisisdonesomenumberoftimes, similarlytoBreimansBaggingmethod,andtheweightsdeterminedbyEnsemble SelectionforeachbagareaddeduptocreatethefinalEnsemble. EnsembleSelectioncanberunusingasetasidevalidationset,orusingcrossvalidation. ThecrossvalidationimplementedintheEnsembleSelectionclassifierisamethod proposedbyCaruanaetal.calledembeddedcrossvalidation.Embeddedcross validationisslightlydifferentthantypicalcrossvalidation.Inembeddedcross validation,aswithstandardcrossvalidation,wedivideupthetrainingdataintonfolds,

withseparatetrainingdataandvalidationdataforeachfold.Wethentraineachbase classifierconfigurationntimes,onceforeachsetoftrainingdata.Wearethenleftwithn versionsofeachbaseclassifier,andwecombinetheminEnsembleSelectionintowhat wecallamodel(representedbytheEnsembleSelectionLibraryModelclass).Thus,a modelinEnsembleSelectionisactuallymadeupofnclassifiers,oneforeachfold. Theconceptofembeddedcrossvalidationistochoosemodelsfortheensemble.Thatis, ratherthanbeinginterestedintheperformanceofasingletrainedclassifier,weare concernedwithhowwellthemodel,orbaseclassifierconfigurationperformed(based ontheperformanceofitsconstituentclassifiers).Noticethatforeveryinstanceinthe trainingset,thereisoneclassifierforeachmodel/configurationwhichwasnottrainedon thatinstance.Thus,toevaluatetheperformanceofthemodelonthatinstance,wecan simplyusethesingleclassifierfromthatmodelwhichwasnottrainedontheinstance.In thisway,wecanevaluateeachmodelusingtheentiretrainingset,sinceforevery instanceinthetrainingsetandeverymodel,wehaveapredictionwhichisnotbasedona classifierthatwastrainedwiththatinstance.Sincewecanevaluateallthemodelsinour libraryontheentiretrainingset,wecanperformensembleselectionusingtheentire trainingsetforhillclimbing.

UserGuide
AswithallWEKAclassifierstherearetwowaystorunEnsembleSelection:eitherfrom thecommandlineorfromtheGUI.However,thisclassifierisslightlyuniqueinacouple ofways.First,itrequirestheadditionalfirststepofcreatingamodellistfile.Thisfile basicallytellsthealgorithmwhichmodelsyouwouldliketotrainforyourmodellibrary. Second,theEnsembleSelectionclassifierrequiresyoutospecifyaworkingdirectory onawritablefilesystemwhereitwillbeabletosavemodelsthatithastrainedforusein buildingensembles.

Buildingthemodellist
WhetheryouaregoingtoruntheEnsembleSelectionalgorithmfromthecommandline orfromtheGUI(e.g.Explorer),youaregoingtofirstneedtocreateamodellist.Todo this,youwillneedtousetheLibraryEditorGUIthatwasmadeforthispurpose,builda modellistandthensaveittoa.model.xmlfile. We created this user interface to let us quickly build lists, aka libraries, of classifiers to be trained. The design goal of this widget was to lets us quickly throw together really long lists of models that we will train for Ensemble Selection libraries. We want to be able to take shortcuts like build me every neural net with every possible value in this range of learning rate values, and this range of momentum values, and both with and without the

nominalToBinary filter turned on and then BAM you have a list of classifiers consisting of every possible combinatoric combination of the parameter ranges specified. YoucanbringuptheEnsembleLibraryEditorbysimplyclickingonthe EnsembleSelectionlibraryattributeintheExplorer(highlightedbelow).

ClickingtheattributewillbringuptheModelListeditorwhichisaseparateGUI Windowthatconsistsof4tabbedpanels.Whilethefirstpaneldisplaysallthemodelsin thecurrentmodellist,theotherthreeallowyoutoaddmodelstothislist.

ModelListPanelThisisthefirstpanelthatisresponsiblefordisplayingallmodelsthat arecurrentlyinthemodellibrary.Italsoletsyousave/loadmodellistsineitherxml format(.model.xml)orasasimpleflatfile(.mlfrecommendedforloggingpurposes only!). AddModelsPanelletsyougeneratesetsofmodelstoaddbyspecifyingClassifiersand rangesofparametersforthem. AddDefaultSetPanelletsyouaddmodelsfromoneoftheEnsembleSelectiondefault lists LoadModelsDetectsallofthe.elmfilesintheworkingdirectorycurrentlyspecified foryourEnsembleSelectionclassifierandbuildsamodelcontainingallthemodels found. Wetriedtomaketheseuserinterfacesasintuitiveandsimpleaswecould,sowethink youcouldprobablyjustsortofjumpinandfigurethemoutbyplayingaround.Ifyouget stuck,thefollowingfoursectionsdescribethesepanelsinmoredetail. TheModelListPanel

ThefirstpanelyouwillseeistheLibraryListPanel(shownbelow).Asthename suggests,thistabsimplyshowsyouallthemodelsthathavecurrentlybeenchosentogo intoyourlibrarylist.Sinceyoushouldn'thaveanymodelsinyourlistatthebeginning, thelistshouldbeemptywhenitfirstcomesup.

Fromthispanelyoucanalsosave/loadmodellistfileswiththerespectivebuttonsatthe bottomwhichbringupstandardfilechoosers.

Whileloading,thefilechooserwillonlyletyouchoosefileswitheitherthe.model.xml extensionorthe.mlfextensiondependingonwhichfilesoftypeiscurrentlyselected.

TheAddModelsPanel OK, so lets add some models to our now empty library. First hit the add models tab to make a small working set. There are two parts of the add models panel. The top half shows you all of the parameters for the current selected Classifier type. This is where you specify value ranges and combinations. Try playing around with collapsing and expanding the tree nodes. Also note that we created tool tips for all the parameters so if you hold your mouse still over one for a few seconds, a comment describing it will pop up. The bottom half shown you a current temporary working set from which you can add models to the main Library List panel (above).

For numeric attributes, you can specify ranges. For example, with the neural net classifier, the following says to try all learning rates from 0.05 to 0.5 in increments of 0.05, which will give us 0.05, 0.10, 0.15, etc

Note that you can also define exponential ranges that is, multiply by the iterator instead of divide by the iterator to get each value in the range. You can toggle this mode by hitting the += button. When you it will show *= instead. As an example, for neural nets its often useful to plot exponentially increasing ranges of training epochs. The following would give the learning rate values of 500, 1000, 2000, 4000, and 8000.

For nominal values (either an enumeration of values or binary true/false booleans) You simply check the boxes of the values you wish to try.

Example of Binary Attribute (subtreeRaising in J48):

Example of an Enumeration Attribute (distanceWeighting in IBk):

Now lets see what happens when we generate models from our specified parameter ranges. Consider the example below using the J48 classifier. We've specified 2 values for binarySplits (true and false), 5 values for confidence factor (0.1, 0.2, 0.3, 0.4, and 0.5), and 5 values for minNumObj (1, 2, 4, 8, and 16). So we should get a total of 50 models generated (2 x 5 x 5). When you have selected the appropriate parameter space up top, you hit the generate models button in the middle to generate and add all of the models into the temporary working list on the bottom. This is a scrollable list in which each row represents a model (see below). To get more information about a model leave your mouse pointer over it to see a tool tip showing all the respective parameter values. Furthermore, list like

keyboard shortcuts also work: <ctrl>+a will select all models in the list and <delete> will delete the models currently selected. You can prune this model list to your liking by selecting models from the list that you dont want and then hitting the remove selected. Once you are satisfied with the working list and are ready to add them to the model library, you just hit the Add all button and every model in the list will be added to the main panel. Also, you can prune this list by selecting a set of one or more models (<ctrl>+click or <shift>+click) and then hitting the remove selected button.

It should be mentioned that it is possible to use the GUI to specify invalid values for classifier parameters in this case these models will appear highlighted. You can use the mouse tooltip to find out what the problem is (specifically what the exception text was

when trying to set the options for that particular model). For example, trying to specify a value greater than 1 for the confidence value parameter to J48 trees causes an exception. Therefore, these models will be highlighted and holding the mouse pointer over them will display the tooltip explaining why. If you want to remove all models that had errors, just hit the remove invalid button. Also note that these wont get added to the main panel theyll just be ignored. Oh, and one other strange thing weve encountered is that some classifiers will not throw an exception when given invalid input but instead will just replace it with a default value. So in these cases the invalid models wont be highlighted - they simply wont appear in the list and a model with a valid value will instead.

Finally, it should be mentioned that the tree list is recursively defined. That is, a classifier with a classifier as a parameter (i.e. some of the meta classifiers) will result in that parameter having its own subtree of parameters just like the root node. Try it. Select the meta.Bagging classifier and experiment. You can even create bags of bags of bags. Although that is a really dumb idea its nice to know the original Weka flexibility remains. This allows powerful combinations. For example, consider the following: We are wrapping a J48 classifier inside of two layers of meta classifiers. We can actually specify parameter ranges on all three levels and it will generate all the possible combinations as expected.

This subtree behavior applies to any objects that are rendered by the GenericObjectEditor not just classifiers. As an example, consider the BayesNet Classifier which has two parameters, the estimator and search algorithm, which are custom objects specific to that algorithm which have their own sets of parameters. So as another example, consider the two arguments estimator and searchAlgorithm to a BayesNet. These are not classifiers but since they use the GenericObjectEditor as a GUI, the LibraryEditor knows how to grab their relevant information and display their sub-arguments within the classifier tree.

AddDefaultSetPanel Thispanelisprobablytherightplacetostartifyoujustwanttotryoutensemble selectionwithtoomucheffort.Itletsyouchoosemodelsfromanalreadyexistingmodel list.Allyouhavetodoisselectthelistyouwantfromthedropdownlistatthetop.We currentlyhavetwonotabledefaultlists:onetargetingmulticlassandonetargetingbinary classdatasets.NotethatitisnormalfortheGUItofreezeupforalittlebitafterselecting oneoftheselistsbecausetheyarequitelarge(>1000models),andtheGUIneedsto processandtesteachonetomakesurethatalltheoptionsarevalid. ForalisttobeselectablewithinthisGUIitneedsto 1) belistedintheweka/classifiers/meta/ensembleSelection/DefaultModels.propsfile 2) actuallybelocatedintheweka/classifiers/meta/ensembleSelection/directoryinthe classpath YouwillalsoseenearthetopadropdownlistlabeledExcludemodelswithlarge: with the selectable options of traintime,testtime,andfilesize.Thisletsyou prunethedefaultlistbyremovingmodelsthatwilltaketoomuchresourcestotrain. Onceyouselecttheresourceyoucareabout,youcanhittheexcludebuttonandthelist shouldshrink.Thelogicthatdetermineswhichmodelsgetremovedforeachtypeof resource(traintime,testtime,andfilesize)isbasedonalistofregularexpressionsthat canbefoundintheweka/classifiers/meta/ensembleSelection/DefaultModels.propsfile.

Currently,theselistsofregularexpressionsarefairlybasice.g.IBkclassifierswillget removedfortesttime,MultilayerPerceptronwillgetremovedfortraintime,etc...

SimilartotheAddModelsPanelyousimplyhiteitheraddalloraddselectedto movethesemodelstothemodellibraryshowninthelistmodelspanel. LoadModelsPanel ThisGUIwillautomaticallysearchthefilesystemfor.elmfilesinthecurrentlyselected workingdirectoryandloadthemintoalist.Thisisreallyjustmeantforconvenience assumingthatyou'vealreadytrainedalibraryandwouldnowliketobuildensembles withit. ItisimportanttonotethatthedefaultbehavioristosearchALLdirectoriesinyour currentworkingdirectory.Soifyou'vetrainedtwodifferentlibrarieswithdifferent classifiersinyourworkingdirectorythelistthatwillappearintheloadmodelspanelwill betheunionofboththesemodelsets. Thereloadsetbuttonwillsimplyreloadthelistincase1)you'vechangedyourworking directoryparameteror2)thesetofmodelsinyourdirectoryhassomehowchanged.

ChoosingaWorkingDirectory
TheworkingdirectoryisjustaStringargumenttoEnsembleSelectionrepresentingthe filepathtoyourensemble.Whilethatseemsprettystraightforward,there'salotof detailsabouttheanatomyoftheworkingdirectorythataren't.Thissectionwillexplain thedefaultbehaviorssurroundingthisparameter,thefilestructureofaworkingdirectory, andthenamingconventionsused. Aspreviouslymentioned,inatypicalrunofEnsembleSelectionyouaregoingtowantto trainalotofmodels.Soweassumedfromthebeginningthattherewasnowayallof themwouldfitintosystemmemory.Tosolvethisproblem,weassumethattheuserwill specifyaworkingdirectoryforthelibrarymodelstobesavedin.Thisgivesour classifieraplacewhereitcanserializeallofitslibrarymodelstothefilesystemalong withotherinformationitneedstotrackwhichmodelsweretrainedwhatsetsofdata. Alternately,ifyouhavealreadytrainedalibraryofmodels,thenbyspecifyingaworking directoryyouaretellingEnsembleSelectionwheretolookforthemodelsitneeds. Thefollowingsubsectionsexplaintheinternalsoftheworkingdirectory.

Establishingtheworkingdirectory Thefirstthingthattheclassifierdoesisestablishtheworkingdirectorybyfindingthe directoryspecifiedastheworkingdirectory. Ifyouspecifyaworkingdirectorythatdoesnotexistonthefilesystem,itwillbecreated (assumingthatitsawellformfilepath,otherwiseyougetanexception). Ifyoudonotspecifyaworkingdirectory,thenEnsembleSelectionwillcreateadefault workingdirectoryinyourhomedirectorynamedEnsembleX,whereXisthefirst unusednumberbetween1and999thatdoesn'texistinainyourhomedirectory.Ifyou alreadyhavedirectoriesnamedEnsemble1allthewaytoEnsemble999inyourhome directorythenyouwillgetanexception,nottomentionalotofworkingdirectories! DatasetDirectoriesandNamingConventions Thenextthingthattheclassifierdoesattraintimeistoestablishadirectoryinyour workingdirectorythatisspecifictothesetofinstancesitwasgiventotrain.Foreachset ofinstancesthatyoutraininaworkingdirectoryadirectorywithanameuniquetothat setofinstanceswillbecreated. Togettheuniquedirectorynameforthetrainset,thechecksumoftheInstances.toString ()methodisusedtogetasequenceof8alphanumericcharacters.Notethatthe Instances.toString()methodreturnsthedatasetin.arffformat.Thechecksumisthen appendedtothenumberofinstancesinthedataset. Forexample,atypicaldirectorynamewillbesomethinglike: 2391_instances_46778c6d Ifitdoesn'talreadyexist,thenthedatasetdirectorygetscreatedandtheclassifierbegins trainingalltheclassifiersthatwerespecifiedinthemodellistargument.Allofthese modelsarestoredinthedirectoryassociatedwiththetrainingdataset. Ifthedatasetdirectoryalreadyexists,theneachmodelinthemodellistiscreatedonlyif itdoesn'talreadyexistinthedatasetdirectory.Otherwise,eachmodeliscreatedand storedinthedirectory. Atfirst,thischecksumnamingconventionmightseemstrange,butitfulfillsavery importantproperty.Itmakessurethatthemodelsusedtocreateanensemblewere trainedonthevarysamedatathatwasgiventotrainitsunderlyinglibrarymodels.Since weareallowingmodellibrariestobecreatedseparatelyfromensembles,wedecidedthat weneededsomemechanismtoenforcethis.Thisisanimportantproperty!

LoggingFiles Itisalsoworthmentioningthatattraintime,theoutputoftheInstances.toString()is automaticallysavedinafilewiththesamenameasthedatasetdirectorybutwiththe.arff extensionindicatingitisthedatasetfile. Forexample,inthe2391_instances_46778c6ddirectory,wewouldfind: 2391_instances_46778c6d.arff Thatwouldstorethespecificsetofinstancesusedtotrainallmodelsinthatdirectory. Wethoughtthiswouldbeusefulforloggingpurposes. Also,thesetofmodelsthatyouattemptedtotrainforthelibrarywillalsobesavedinthis directoryinboththexml(.model.xml)andhumanreadableflatfile(.mlf)formats.These listswillbesavedwithafilenamereflectingthetimetrainingstartedalongwiththe numberofmodelsthatweregoingtobetrained. Forexample,youwouldfindmodellistfileswithnamessimilartothis: 2006.05.18.16.34_1041_models.mlf 2006.05.18.16.34_1041_models.model.xml

EnsembleLibraryModelfiles(.elmextension) Thenextthingtoexplainisthefilesusedtostoredourensemblelibrarymodels.What wedoistakethecommandlinethatwouldbeusedtonormallytrainthelibrarymodel andturnitintoafilenamewith four transformations. First,weturnallspaceandquote ()charactersintounderscore(_)characterstomakethingsmorereadable.Second,for filenamesthatwouldbegreaterthan128characters,wetrimtheendoffsotheywill satisfythe128characterlimitofmostoperatingsystems.Third,weappendachecksum oftheoriginalmodelcommandlinetoguaranteethatthefilenamesareuniqueforeach model.fourth,weaddthe.elmfileextensionindicatingthefileisanEnsembleLibrary Modelfile. Tosummarize,themodelrepresentedbythiscommandlinestring: weka.classifiers.trees.J48 -C 0.25 -B -M willbesavedinafilewiththename: weka.classifiers.trees.J48_-C_0.25_-B_-M_23bae0c94.elm

Notethatthe.elmfilesholdallmodels(oneforeachdatafold)associatedwitha particularclassifierandassociatedsetofparametersalongwithotherinformationthat EnsembleSelectionwillneedlaterontobuildensembles. LockFiles Wecreatedwhatwecallinformallockfilesthatarecreatedbeforeeachlibrarymodel istrainedanddeletedafterthemodelisfinishedtraining.Sincetheselockfiles correspondtoaspecificmodelindicatingthatitiscurrentlybeingworkedon,theyhave thesamenamingconventionexcepttheycarrythe.LCKfileextension. The.elmfileinthepreviousexamplewillhavethelockfilename: weka.classifiers.trees.J48_-C_0.25_-B_-M_23bae0c94.LCK Theseareinformallockfilesweuseforprocessestosayhey,I'mworkingonthat modelrightnow.Whentrainingalistofmodels,EnsembleSelectionwillsimplyskip anymodelsinitslistforwhichitdetectedalockfile.Wedothisfortworeasons: 1Thisisnicewhentraininglibrariesinparallelonclusters.Eachnodeknowsnotto trainamodelthathasa.LCKfile.Thesegetdeletedwhentheprocessisdonetraining therespectivemodel. 2Thesefilesarealsousefulwhentrainingmodelsonasinglecomputerbecausethey alsohelpyoufigureoutwhich(ifany)modelsdidn'ttrain.WhenEnsembleSelection detectsthese.LCKitassumesthatitshouldn'tdealwiththeclassifierandskipstothe nextoneinthelist. Youmaybewonderingwhywedon'tusereallockfiles(whichjavadoessupport).Our mainreasonisthatwefeltthiswouldbeoverkill.Thefactisthatwhentrainingmodels inparallel,wedon'treallyneedtoenforceanyguaranteesthatthe.LCKfilestrulyreflect thecurrentstateoftraining.Theworstcase(whichshouldbeextremelyrare)isthattwo nodeswouldtrainthesamemodelwhichisn'treallyallthatbad.Furthermore, implementingtruefilelockswouldhavetakenmoretime.

EnsembleSelectionfromthecommandline
Werecommendtrainingthelibrarymodelsasaseparatestepbecauseitwillmakethings easierfordebuggingerrorsshouldsomethinggowrong. Step1:Creatingamodellistfile

EventhoughyouarerunningEnsembleSelectionfromthecommandline,youstillneed tocreatethemodellistfile(.model.xml)filefromtheGUI.Followtheinstructionsinthe previoussectionaboutbuildingmodellistfilestodothis.Onceyou'vesavedthemodel listfile,youwillbespecifyingthisfileonthecommandlinewiththeL path/to/your/mode/list/file.model.xmloptiondescribedbelow. Step2:Trainingthelibrary TobuildalibraryfromthecommandlineyouneedtousetheAlibraryoptiontotell EnsembleSelectionthatitshouldn'tbuildanyensemblemodelsandthatyouonlywantto trainthebaseclassifiers. WARNING!YoushouldbasicallyNEVERspecifycrossvalidation(CV)fromthe commandlinewiththexoptionthatisreadbyEvaluation.java.EnsembleSelection fullysupportscrossvalidationbutitisimplementedverycarefullytousevalidationdata tobuildensembles.Instead,youshouldusetheXoptiontospecifythenumberoffolds. Also,thismayseemcounterintuitive,butthenocvvoptionsareextremely importantsincetheypreventtheEvaluationclassfromdefaultingto10foldCVtotryto createaperformanceestimate.ThisisalotofwastedCPUtimesinceweareonly trainingalibraryanddon'tcareaboutperformanceestimates.WithEnsembleSelection, youshouldalwaystelltheclassifierevaluationcodenottoperformcrossvalidation regardlessofwhetheryouaredoingitforEnsembleSelection. Finally,theV<validationratio>optionisonlymeaningfulwhenyoudonotspecifya numberofcrossvalidationfoldsgreaterthan1.IfyouareusingCV,thenvalidationsets aregeneratedautomatically,otherwiseifyouhaveonlyonefoldthenthepercentage specifiedwiththeVoptionwillbeheldasideforvalidation. Thefollowingisanexamplecommandlinetobuildanensemblelibrary:
java weka.classifiers.meta.EnsembleSelection -no-cv -v -L path/to/your/mode/list/file.model.xml -W / path/to/your/working/directory -A library -X 5 -S 1 -O -D -t yourTrainingInstances.arff

Step3:BuildingtheEnsemble JustlikeanyWEKAclassifier,youspecifythefiletosaveyourmodeltowiththedflag andthetraininstanceswiththetflag.Notethatthesetraininstancesshouldbethesame onesusedintheprevioussteptotrainthelibrary.Inaddition,thefollowingoptions givenreflectmostofthedefaultvalues: 10modelbagswithamodelratioof50%foreach.

usingRMSEasthehillclimbingmetricwhileaddingmodels using100iterationsofthehillclimbingalgorithmtoaddmodels. usinggreedysortinitializationwith100%ofthemodels. fivefoldcrossvalidationisbeingused. ThefollowingisanexamplecommandlinetobuildandsaveanEnsembleSelection modelfromanensemblelibrary:


java weka.classifiers.meta.EnsembleSelection -no-cv -v -L path/to/your/mode/list/file.model.xml -W / path/to/your/working/directory -B 10 -P rmse -A forward -E 0.5 -H 100 -I 1.0 -X 5 -S 1 -G -O -R -D -d /path/to/your/model.model -t yourTrainingInstances.arff

Step4:TestingtheEnsemble JustlikeanyWEKAclassifier,youspecifythefiletoloadyourmodelfromwiththel flagandthetestinstanceswiththeTflag.There'sreallynothingspecialtonotehere. Thefollowingisanexamplecommandlinetotestanensemblelibrary:


java weka.classifiers.meta.EnsembleSelection -no-cv -W / path/to/your/working/directory -O -D -l /path/to/your/model.model -T yourTestingInstances.arff -i -k

EnsembleSelectionfromtheGUI
ANoteonMemoryUsagefromtheExplorer RunningEnsembleSelectionfromtheexplorerisonlyrecommendedforeitherbuilding modellistsorplayingaroundwithsmalldatasets.Youwillfindthatwithadecentsized modellistonanythingbutthesmallestdatasetsyouwillquicklyrunoutofmemory. ThemainproblemisthefactthattheEvaluationcodeinWEKAasksclassifierstomake predictionsoneatatimeinsteadofpassingthemallatonceasacollection.Wearenot criticizingthisdesignchoice.It'sjustthatinourcaseitmeansthatwehadtoimplement asmallworkaroundtomakethingsacceptablyefficient. ForensembleSelectiontoworkweneedtoaveragepredictionsforallourensemble modelstogether.Theoriginalimplementationdidsomethinglikethis:
For each model Deserialize the model from the file system For each Instance

Get a prediction for the given instances Garbage collect the model

Thiswasn'ttoexpensive.Wecouldjustloadmodelsfromthefilesystemoneatatime andgetalltheirtestpredictions,andthenthrowthemodelaway.There'snoneedtokeep allthemodelsinmemory. However,sinceEvaluation.javaasksourclassifiertomakepredictionsontestinstances oneatatime,wewouldbestuckdoingsomethingmorelikethis:


For each Instance For each model Deserialize the model from the file system Get a prediction for the given instances Garbage collect the model

Thismeansthatifyouhave1000testdatapoints,youaregoingtohavetodeserialize everyensemblemodel1000times!Wefeltthiswasunacceptableanddecidedtoforce theclassifiertokeepallensemblemodelsinmemorytopreventallthedeserialization. Furthermore,weimplementedpredictioncachinginourmainmethodwheretest predictionsforallensemblemodelsarecachedbeforehandingcontroloverto Evaluation.java.Unfortunately,thisworkaroundonlyworkswheninvoking EnsembleSelectionfromthecommandline.ThisisbecauseintheExplorer,unlikeour mainmethodthereisnotimewhencontrolishandedtoourclassifierwithallthetest instances. Sotosummarize,whatallthismeansisthatwhentrainingEnsembleSelectionfor reasonablysizedmodellistsonreasonablysizeddatasets,youshoulddoitfromthe commandline(asdescribedintheprevioussection)oryoumightgetanoutofmemory exception. Otherwise,wethinkthattheGUIisfineforsortofplayingaroundtogetafeelforhow ensembleselectionworkswithsmallmodellists(suchasthetoylist.model.xmlinthe defaultspanel). Step1:Specifythemodelstouse Thisisslightlydifferentfromthecommandlineinthatyoudon'tspecifythefile.Simply clickonthelibraryattributeintheEnsembleSelectionpropertypaneltobringupthe libraryeditor.AsmentionedintheprevioussectiononthelibraryEditorGUI,youcan eitherloadanalreadyexistinglistorjustaddsomemodelstoyourlibrarywithoneofthe thepanels. Step2:Trainingthelibrary

Aftertypinginthenameofyourworkingdirectoryandspecifyingthemodelsforyour library,youjustneedtotelltheclassifieritisonlytrainingmodels.Inthedropdown menuforalgorithmontheEnsembleSelectionPanelselectlibrarytoindicatethatyou WARNING!Asdescribedinthecommandlinesection,youshouldbasicallyNEVER specifycrossvalidation(CV)fromtheExplorerClassifiertestingGUI(Shownbelow). InsteadchooseoneoftheotheroptionsasCVisprohibitivelyexpensivefromEnsemble Selectionjusttogetperformanceestimates.

Step3:BuildingandTestingtheEnsemble ThesestepscanbedonejustliketheycanwithanyotherWEKAclassifier.Justkeepin mindourwarningabouttheoutofmemoryerrorsanddon'tbeafraidtotrythingsfrom thecommandline.

OverviewofClassesusedbyEnsembleSelection
TheimplementationofEnsembleSelectionanditslibraryeditorwasasignificant undertaking.Thefollowingisabriefoverviewofthepackagescreatedtosupportthe classifierwithalistoftheclassfilesstoredineach.Formoreinformation,pleaseseethe associatedjavadocsandcodecomments. Package:weka.classifiers.meta Theactualclassifier
weka.classifiers.meta.EnsembleSelection.java

Package:weka.classifiers ThesearebaseclasseswecreatedsothatotherscouldusethebasicEnsembleLibrary functionalityoftheLibraryEditorGUI.


weka.classifiers.EnsembleLibraryModelComparator.java

weka.classifiers.EnsembleLibraryModel.java weka.classifiers.EnsembleLibrary.java

Package:weka.classifiers.meta.ensembleSelection TheseclassessupportthemainEnsembleSelectionalgorithm.Thereisalsoaproperties fileandthreemodelliststhatareusedtopopulatethedefaultmodelspanelinthelibrary editorGUI.


weka.classifiers.meta.ensembleSelection.EnsembleModelMismatchExcept ion.java weka.classifiers.meta.ensembleSelection.ModelBag.java weka.classifiers.meta.ensembleSelection.EnsembleMetricHelper.java weka.classifiers.meta.ensembleSelection.EnsembleSelectionLibrary.ja va weka.classifiers.meta.ensembleSelection.EnsembleSelectionLibraryMod el.java weka.classifiers.meta.ensembleSelection.DefaultModels.props weka.classifiers.meta.ensembleSelection.large_binary_class.model.xm l weka.classifiers.meta.ensembleSelection.large_multi_class.model.xml weka.classifiers.meta.ensembleSelection.toylist.model.xml

Package:weka.gui EnsembleLibraryEditorisabaseclasswecreatedsothatotherscouldusethebasic LibraryEditorGUI.WeextendthisclasswithEnsembleLibraryEditortodomore specificEnsembleSelectionthings.


weka.gui.EnsembleLibraryEditor.java weka.gui.EnsembleSelectionLibraryEditor.java

Package:weka.gui.ensembleLibraryEditor TheseareclassesimplementingandsupportingthepanelsintheLibraryEditorGUI.
weka.gui.ensembleLibraryEditor.ModelList.java weka.gui.ensembleLibraryEditor.AddModelsPanel.java weka.gui.ensembleLibraryEditor.ListModelsPanel.java weka.gui.ensembleLibraryEditor.DefaultModelsPanel.java weka.gui.ensembleLibraryEditor.LoadModelsPanel.java weka.gui.ensembleLibraryEditor.LibrarySerialization.java

Package:weka.gui.ensembleLibraryEditor.tree This entire package supports just the add models panel. Getting the neat Jtree user interface to work for building lists of classifiers was by far one of the greatest challenges we faced. Most of these classes implement the functionality needed for a single node in the tree.
weka.gui.ensembleLibraryEditor.tree.GenericObjectNode.java weka.gui.ensembleLibraryEditor.tree.CheckBoxNodeEditor.java weka.gui.ensembleLibraryEditor.tree.ModelTreeNodeRenderer.java weka.gui.ensembleLibraryEditor.tree.ModelTreeNodeEditor.java weka.gui.ensembleLibraryEditor.tree.DefaultNode.java weka.gui.ensembleLibraryEditor.tree.CheckBoxNode.java

weka.gui.ensembleLibraryEditor.tree.NumberClassNotFoundException.ja va weka.gui.ensembleLibraryEditor.tree.GenericObjectNodeEditor.java weka.gui.ensembleLibraryEditor.tree.InvalidInputException.java weka.gui.ensembleLibraryEditor.tree.NumberNodeEditor.java weka.gui.ensembleLibraryEditor.tree.NumberNode.java weka.gui.ensembleLibraryEditor.tree.PropertyNode.java

UserFAQ
WhenisEnsembleSelectionagoodidea? Theinterfaceallowsforeasycreationofalargesetofclassifierswithminimal(human) effortusuallyprovidingstateoftheartperformance.Othercompetingmethodssuchas BayesianModelAveragingandStackingareknowntooverfitwithlargelibrariesof models.Thishasthecapabilityofnotjusttrainingmanymodels,butevaluatingtheir performanceonatestset(usingEnsembleSelection.main()andtheVoption). Whenisensembleselectionabadidea? It'sverytimeconsuming,andtakesalotofmemory.Totrainanensemblelibraryfora reasonablysizeddatasetitwilltakesdaystoweeksofcomputetime.Also,attheendof trainingyourensembleitisdifficultorperhapsevenimpossibletointuitivelyunderstand itsmechanismbythatwemeantheunderlyinglogicyourmodelusestomake predictions.Whereaswithsomethinglikeasingletreeclassifieryoucanlookatthe branchesandvalidate/understandwhyitwasbuiltthewayitwas.Withensemble selection,youhavepredictionsaveragedacrosshundredsofmodelswhichmakesthis processdifficultifnotimpossible. Expensiveeh?Soisitpossibletoparallelizethis? Ifyouhaveacluster,wehavesuccessfullymadethelibrarytrainingeasilyparallelizable, (assumingyourclusternodeshaveasharedfilesystem,e.g.NFS).Allyouhavetodois invokethesamecommandlinelineargumenttotrainyourensemblelibrarytoallnodes youwishtousetotrainyourmodellibrary.Makesurethatyouspecifythesamepathfor yourworkingdirectory(whichagainshouldberemotelyreachablebyallofthem)and

makesuretousetheAlibraryoptiontotellthemalltoonlytrainthelibraryandnotto doanythingelse.Notethatthestepofusingthelibrarytothenbuildanensemblewith theEnsembleSelectionalgorithmisnotcurrentlyparallelizable. WhatifIjustwanttotrainabunchofmodelsandmaybefindthebestonewithouthaving todealwithallthisfancyEnsembleSelectionmodelstuff? Wecandothattoo.JustusetheAbestoption.Theensemblewillbemadeupofthe singlemodelwhichperformedbestonthevalidationdata.Furthermore,ifyouwantto gettheperformanceofallthemodelswithrespecttothevalidationdata,youcandothis aswellusingtheV(verboseOutput)option. IgotanoutofmemoryerrorwhenusingEnsembleSelection.WhatcanIdo? Twothings:1)makesureyouusethejavaxm.Optiontoincreasetheamountof memoryavailabletotheJVMand2)IfyougottheerrorfromtheGUI,tryitagainfrom thecommandline.ForanexplanationoftheMemoryUsageproblemsintheexplorer pleaseseetheUserGuidesectionaboutthis. WhydoIneedaworkingdirectory?CantIkeeptheminmemory? No,thattakestoomuchmemory.EnsembleSelectionisdesignedforverylargelibraries ofmodels(e.g.>1,000),andsoinmostrealworldsituations,themodellibrarycouldnot possiblybeheldinmemory. Idontknowwhichmodelstouse.AretheredefaultlistsofmodelsIcanuse? Yes.Thiswashighonourlistwhenwefirstgotstartedbuildingthisclassifier,we wantedpeopletobeabletochoosefromreasonabledefaultmodellists.Justfireupthe LibraryEditorandclickonthedefaultmodelstabtoseeafewdefaultlists. Iwanttotrydifferentmodelsthaninthedefaultlist.CanIdothat? Yes,gocheckouttheAddModelsPanelsectionintheuserguide. Inoticedtwotypesofmodellistfiles.mlfand.model.xmlwhat'sthedeal? Originally,wethoughtitwouldbeeasiesttomakethemodellistfilesbeasimpleflatfile withadifferentClassifier+setofoptionsoneachline.Itturnsoutthishasmany problems(getOptionsandsetOptionsdonotnecessarilyinteractproperlyorevenfully representtheassociatedclassifiers).Thesearethemodellistfileswiththe.mlf extension.Laterweadoptedanxmlfileschemathatworksgreatbutdoesn'thavethe nicesimplicity/humanreadablenessoftheold.mlffiles.Thesexmlmodelfileshavethe.

model.xmlextension.Anyway,we'vehadgreatsuccesswiththenewformatandwe recommendyouusethisextensioninsteadofthe.mlf's. CanIusethesameworkspacedirectorytotrainmodelsfortwodifferentproblems?Two differentpartitionsofthedataforthesameproblem?Twodifferentmodellists? Yes,yes,yes.Seethesectiondescribingthedataset/checksumnamingconvention. IgotanEnsembleModelMismatchException,whatdoesthatmean? Thisisatoughone.Basically,wetrackallofthedatasetinformationthatwasusedto createeachmodel.Thisisbecausewewanttoprotectusersfromdoingforeseeablybad things.e.g.,tryingtobuildanensembleforadatasetwithmodelsthatweretrainedon thewrongpartitioningofthedataset.Thiscouldleadtoartificiallyhighperformancedue tothefactthatinstancesusedforthetestsettogaugeperformancecouldhave accidentallybeenusedtotrainthebaseclassifiers.Soinanutshell,wearepreventing peoplefromunintentionally"cheating"byenforcingthattheseed,#folds,validation ration,andthechecksumoftheInstances.toString()methodALLmatchexactly.Ifyou trytobuildanensembleandoneofitsmodelswasnottrainedwithallofthesesame parametervalues,thenwethrowthatspecificexception. WhydidtheloadmodelstabintheLibraryEditorshowmeallthemodelsItrainedfora bunchofdifferentdatasetsthatwereindifferentdirectories? Yes,itcanbeconfusing,buttheloadmodelstabispopulatedbyallmodelsin subdirectoriesoftheworkingdirectory.(Thisisoutofnecessity,becauseatthetimewe populateit,wedontknowwhatdatasetyoureusing). Isitpossibletomodifymylist.model.xmlbyhandorwithascript? Theoreticallyyes,inpractice,nodosoatyourownrisk.Wehighlyrecommendusing thelisteditorGUIinstead.The.mlffilesmaybeeditedbyhand,butasnotedelsewhere, notallclassifierscanbeproperlyconfiguredusingcommandlineoptions. HowdoIspecifythemodelsforEnsembleSelectiontousefromthecommandline? UsetheLoption,andprovidea.model.xmlfile.Youcancreatea.model.xmlfile usingtheLibraryEditor(seetheusersguideformoredetail).Itisalsopossibletousea filein.mlfformat,whichissimplyalistofclassifiersandoptionsforthem. SoImadeabunchofmodels.HowdoIknowspecificallywhatdatatheyweretrained on?

Everytimeyoutrainalibrary,wewriteouta.arfffileintherespectiveinstances directorycontainingalloftheinstancesusedtotrainthatsetofmodels.It'snamedbased onthedataandnumberofinstances.Convenient,no? SoImadeabunchofmodels.HowdoIknowwhatmodelswereinthelibraryattrain time? Forloggingpurposes,wealsowriteoutbotha.model.xmland.mlffileattraintimeofall themodelsyouareattemptingtotrain. Whathappensifnotallofmymodelstraini.e.,throwanexceptionforwhatever reason? Thisisactuallynotamatterofif,it'samatterofwhen.Notallmodelscanusealldata sets.latelythough,wetrapmostexceptionsandthrowmodelsthatdidn'ttrainoutofthe ensemble.Mostofthetimethisworksandwereokay.However,ifweranoutof memory,thiscouldstillkilltheprocessnomatterhowmucherrorswetrap.Sometimes, youjusthavetofigureoutwhichmodel(s)aregumminguptheworksandmanually removethemfromthemodellist.Althoughinourexperiencethisisn'ttoooften. Hey,IhaveabunchofexistingWekamodels,andIdliketomakeanEnsembleoutof them.CanIdothat? Sorry,currentlynotsupported.Wouldbeagreatfutureadditionthough. Hey,IjustranEnsembleSelectiontofindtheBestModel,andnowIwanttosavethat modelforlateruse. Sorry,cantdothateither.It's kind of a hack, but what you can do is build an Ensemble of only the best model (-A best). This will give you an EnsembleSelection classifier to classifyfutureinstances.Ultimatelythough,wethinkthatthereshouldbeamechanism toextracttheactualmodelsfromour.elmfiles. Whyisthereaseparatelistforbinaryvsmulticlassproblemsinthedefaultlistpanelin theLibraryEditor?IthoughtallWekaclassifiersworkedforeither? Therearesomemodelsthatdon'tscalewellformulticlassproblems.Modelssuchas SVM's(functions.SMO),havetrain+testtimesthatincreaseexponentiallywiththe numberofclassesforaproblem.We'veobservedclassifierswhichtook15minutesto trainonabinaryproblemtookover8hourstotrainonamulticlassproblem,whileother classifiersinthelibrarydidnothavethesameincrease.Sowefeltthebestapproach wouldbedomaintaintwoseparatedefaultmodellists.Oneformulticlassandonefor binary.

IsitokayformetocombinetrainingthelibraryofmodelsandtheEnsembleSelectionin onestep? Yes,thatsfinebutdoesntallowforparallelization.Alsoseparatingthestepscan makethingseasierfordebugging. Whatswiththecrazylettersandnumbersinmy.elmfilesandthesubdirectoriesofmy workingdirectory? SeethesectiononfilenamingandchecksumsintheWorkingDirectory. CanIuseatrainedaEnsembleSelectionclassifierfornewdataaftertheensembleis built? Yes,justusethe.modelfilelikeyouwouldanysavedWEKAclassifier.However,it musthaveaccesstothemodelsitselected,whicharesavedinseparate.elmfiles.The directorywheretheycanbefoundcanbespecifiedusingtheWoption. HowdoescrossvalidationworkinEnsembleSelection? That'sadoozey... (note:thisisjustcopiedandpastedfromabove)EnsembleSelectioncanbe runusingasetasidevalidationset,orusingcrossvalidation.Thecross validationimplementedintheEnsembleSelectionclassifierisamethod proposedbyCaruanaetal.calledembeddedcrossvalidation.Embedded crossvalidationisslightlydifferentthantypicalcrossvalidation.In embeddedcrossvalidation,aswithstandardcrossvalidation,wedivideupthe trainingdataintonfolds,withseparatetrainingdataandvalidationdatafor eachfold.Wethentraineachbaseclassifierconfigurationntimes,oncefor eachsetoftrainingdata.Wearethenleftwithnversionsofeachbase classifier,andwecombinetheminEnsembleSelectionintowhatwecalla model(representedbytheEnsembleSelectionLibraryModelclass).Thus,a modelinEnsembleSelectionisactuallymadeupofnclassifiers,oneforeach fold. Theconceptofembeddedcrossvalidationistochoosemodelsforthe ensemble.Thatis,ratherthanbeinginterestedintheperformanceofasingle trainedclassifier,weareconcernedwithhowwellthemodel,orbase classifierconfigurationperformed(basedontheperformanceofits constituentclassifiers).Noticethatforeveryinstanceinthetrainingset,there isoneclassifierforeachmodel/configurationwhichwasnottrainedonthat instance.Thus,toevaluatetheperformanceofthemodelonthatinstance,we

cansimplyusethesingleclassifierfromthatmodelwhichwasnottrainedon theinstance.Inthisway,wecanevaluateeachmodelusingtheentiretraining set,sinceforeveryinstanceinthetrainingsetandeverymodel,wehavea predictionwhichisnotbasedonaclassifierthatwastrainedwiththat instance.Sincewecanevaluateallthemodelsinourlibraryontheentire trainingset,wecanperformensembleselectionusingtheentiretrainingset forhillclimbing. InoticedthatsomeofmymodelsappearedredintheLibraryEditor.Whatdoesthat mean? WhenourLibraryEditordynamicallygeneratesabunchofmodelsfromtheparameter rangesyouspecifyintheaddModelspanel,ittriestoinstantiateeachclassifiertoseeof thegivensetofparametersisvalid.Ifittrapsanerrorforasetofparametersthenitflags thatclassifierasRed.Notethatyoudon'thavetomanuallyremovetheseinvalidmodels fromthetemplist.Whenyouaddmodelstothemainlibrarylist,theinvalidoneswillbe automaticallyremoved. CanIusetheLibraryEditortodefineparameterrangesforbothmetaclassifiersand theirbaseclassifiers,simultaneouslyatthesametime? YES!TheClassifiertreeintheAddModelspanelisrecursivewhichletsyoudoallsorts ofcrazypowerfulthings.Youcanusemultiplelayersofmetaclassifiersandset parameterrangesacrosseachlayeranditwillgenerateallthepossiblecombinations.But watchoutoryou'llendupwithextremelyhugelists.Thiswastoughtoimplementbut wethinkitwasworthit. IhavetwoslightlydifferentmodelliststhatareREALLYlong.Isthereanyeasywayto knowthedifferencebetweenthetwowithoutstaringatthembothforever? SavebothlistsasflatfilesfromtheLibraryEditor(the.mlfformat),andthenjustusethe diffcommandlinetool. CanImodifythedefaultmodelliststhatappearintheDefaultListPanel?Alternately, canIjustaddadefaultlistofmyown? Yesandyes.Themodellistsarefoundintheweka/classifiers/meta/ensembleSelection directoryoftheWEKAclasspath.Youcanmodifythemodelliststhere.Toaddalist, justaddittotheappropriatelineintheDefaultModels.propsfilefoundinthatdirectory andthenjustdropyourlistintothatdirectory.Note:thisrequiresthatyouunjaryour weka.jarfileifyouareusinga.jarfile. CanImodifytheregularexpressionsusedtoprunethemodellistinthedefaultmodellist panel?

Yes,theseregularexpressionsarefoundintheDefaultModels.props.Also,ifyoumake changestothesethatseemreasonableenoughtoberolledbackinasdefaults,please sharethemwithusasrefiningthesepropertiesisonourlistofimprovementstomake. DeveloperFAQ WhythelongpredictioncachingmainmethodintheEnsembleSelection?areyouguys crazy? Yes,wearecrazy!Butnotbecauseofthepredictioncachingthing.Wecache predictionsinthemainmethodsothatwecanachievereasonablespeedandmemory evenwithalargeensemble. OurmainproblemisthatEvaluation.evaluateModel()onlygivesusoneinstanceata time.Thecostofdeserializingeveryoneofourmodelsfromthefilesystemforeachand EVERYtestinginstancemeansthat1)yourharddrivewillbegivenathoroughworkout and2)testtimewilltakethousandsandmillionsofyears. It'salittlehackishbutwhatwedoiscacheallthetestpredictionsfromallofourmodels firstbeforehandingcontroltoEvaluation.Sowhenevaluationhandsusinstancesoneat atime,wealreadyhavetheanswers.Whenthisisnotrunfromthecommandlineand EnsembleSelectionisrunfromtheGUI,wegetaroundthisbykeepingALLofour librarymodelsinmemorywhichiswhywerunoutofmemoryfromthecommandline sooften. TheonlywaywecouldgetawaywithoutdoingthisisifEvaluationpassedallthe instancestousatonceinsteadofoneatatime.However,thisdoesnotseemrealisticas itwouldrequiresignificantlychangingtherestofWEKA.Sowhilethisdoesseem slightlyhackish,itdoesworkwell. Sowhydidn'tyoujustmaketheModelListastringargumenttoafileandthe LibraryEditoraseparateGUI? TheclosestthingwecouldfindtowhatwewantedtodowiththeLibraryEditorwasthe CostMatrixEditorusedforclassifierslikeMetaCost.Asmuchaspossible,wewantedto dothingstheWEKAwaysowefollowedthisclassifier/editorpairasadesignpattern anditseemedtoworkwell.Wewerejusttryingtofollowtheprecedent. Why did you declare the LibraryEditor class a part of weka.gui package? Shouldn't it be somewhere in a subpackage associated with the EnsembleSelection algorithm? Wedidinitially,andthenitoccurredtousthatpiecesofourLibraryEditorinterfacecould beusefulelsewhere.Ifsomeoneelsewouldliketousetheneatclassifierparametertree

GUI,theycan.SowhatwedidwasactuallymakeabaseclassLibraryEditorthathasall thefunctionalitywethoughtotherpeoplemightwant,anthenextendeditwiththe EnsembleLibraryEditorthathasallthespecificEnsembleSelectionstuffweneed. What is going on with the EnsembleLibrary and EnsembleLibraryModel classes. These seem sort of pointless. Why not just use an Array of Classifier[] instead of messing around with a bunch of extra wrapper classes? Two reasons. First, we have to be careful about memory. As previously discussed, when creating ensemble models we could be dealing with thousands or possibly tens of thousands of models. Each instantiation of these models could potentially take up a lot of memory. So we created a wrapper class that contains only the information necessary and useful for building lists of classifiers while making sure that the actual instantiations of the classifiers are garbage collected when they need to be. Second,weareactuallyextendingallofthesebaseclasses(respectivelycalled EnsembleSelectionLibraryandEnsembleSelectionLibraryModel)todomuchmorework foruswithourimplementationofEnsembleSelection.Sowhilethesetwoclassesmight seematadsimpleandunnecessarytheyareactuallyveryimportantbaseclassesthatare layingthefoundationfortheEnsembleLibraryandEnsembleLibraryModelclasses. Why do you have those strange static methods at the end of the LibraryEditor class? Basically,thisisahack.Ourproblemisthatweneedtoaccesssomeweka.guiclasses suchasPropertyPanel,PropertyText,CostMatrixEditor,etc...andalloftheseclassesin weka.guiarenotdeclaredpublic.TheonlyalternativeIcouldthinkoftohavingthese seeminglyoutofplacestaticmethodswastothrowallofourclassesintotheweka.gui packageaswellitseemedlikethatwouldclutterthingsuptoomuchsoIwentthestatic methodroute. SoIjustimplementedabunchofmethodsthatcheckfortheclasstypesofthenumbers, castethemtowhateveractualclasstheyare,performsthedesiredarithmeticoperation, andthenreturnstheresult.I'msurethere'sgottobeamuchbetterwaytodothisplease letmeknowifyoucanthinkofone.

Knownissues
Thereareafewoutstandingbugs/issues UsingtheGUI(e.g.Explorer)youwillrunoutofmemoryonreasonablylarge datasetsormodellists.PleaseseetheUserguidesectiononmemoryusageproblems whenrunningfromtheGUI.Wecurrentlydon'tknowifourworkaroundthatisonly

availablefromthecommandlineissufficientorofthere'ssomeothersolutionwecould implement.Forknow,wearegoingtolistthisasanopenissue. Currentlydontdoboundscheckingonallvalues.Sorry!Stickwithdefaultsoruse reasonablevaluessomewherearoundthedefaultsandyoullbefine. Phantomdefaultmodels.WhenyoudontspecifyanymodelsintheGUI LibraryEditor,ourclassifierjustdefaultsto10REPTreeswithdifferentSeeds.The problemisthatthese10RepTreesgetaddedoutsideoftheGUIinterfaceandwon'tshow upintheLibraryEditor.SotheproblemisthatifyoutrainanEnsembleintheExplorer withnospecifiedmodels10REPTreeswillbeaddedtothelibrary.Aftertraining,these 10modelswillbeinthelibrarybutwon'tshowupintheModelListGUI.Thisisa relativelysmallishbug.Notsureofthebestwaytoselectadefaultlist.Perhapswe shouldforcetheusertoselectsomedefaultsetbeforetraining? IntermittentStateExceptionsintheLibraryEditorwheneditingmodellists.This exceptiongetsraisedoccasionallyfromtheLibraryEditorGUI.Wearenotsurewhatthe causeorthefixis.However,thisexceptionseemstobebenignandcansafelybeignored. Perhapsitshouldjustbetrappedandthrownaway? WhenusingV(verbose)optiontogetindividualmodelperformanceonthe validationset,theoutputiskindofugly,andwhenRMSEisused,weactually display(1RMSE). The.mlfformatformodellistsisdangerous.Firstofall,notallmodelssupporttheir commandlineargumentsproperly.Secondly,thecurrentimplementationwithin EnsembleSelectionforturningaclassifieranditsoptionsinStringformatintotheactual classifiermayhavesomeproblemshandlingthingslikenestedoptionsformeta classifierswrappingmetaclassifierswrappingbaseclassifierssomewhereinthere thingscanbecomejumbled.Forthesereasons,wethinkitmightmakesensetoforce userstouseonlythe.model.xmlformat.

FutureEnhancementsandotherDesirableImprovements
Thefollowingisaprioritizedwishlistoffutureenhancementstoourimplementationof EnsembleSelectioninrankingorderofdesirability. LibraryModelsshouldbesavedseparately:Oneoptionwouldbetostoreevery Modelinn+1files,wherewereusingnfoldcrossvalidation.Onefilewouldbethe .elmfile,somewhatlikewehavenow,whichwouldcontainimportantinformation suchascachedpredictionsforthatmodel,thenumberoffolds,trainingdatahash, randomseed,etc.Buttheclassifiersthemselveswouldbesavedinseparate.model files,withautomaticallygeneratednames(e.g.foo.fold1.model,foo.fold2.model,etc...). Thiswouldalloweasyexportingoftrainedmodels,becausetheydsimplyalreadyexist

inthecorrectformatintheworkingdirectory.Furthermore,insomecasesthiscould speeduptheperformanceofEnsembleSelection,becauseinmanycasesitwouldnothave toloadtheClassifiersthemselvesinordertotraintheensemble,wherecurrentlyitdoes. Ourmulticlassandbinaryclasslistsneedtoberefined.Currently,the approximately1000modesinthemulticlassand1400modelsinthebinaryclassdefault modellistsweresortofthrowntogetherasafirstattempt.Toourknowledge,noonehas triedtodoanythinglikethisbeforecreatingacomprehensivelistofClassifiers+ parameterrangesthatwillgiveyouahalfwaydecentcoverageofthemodeldiversity availableinWEKA.ThesetwoliststhatyouwillfindintheDefaultmodelspanelare basedonparameterrangesusedintheoriginalpaperonEnsembleSelectionformostof thesamebaseclassifiertypes.InadditionwealsoaddedafairnumberofWEKA classifiersthatseemedtobehavewellonalargenumberofUCIproblems.Wethink thisisagoodstart,butonlythat:astart.Theselistsneedtoevolveandimproveinorder toprovidethebestcoveragewecanofthemodelspaceavailablefromWEKA. RefineDefaultModels.propsregularexpressions.Theregularexpressionsspecifying modelswithlargetraintimes,modelsizes,andtesttimesintheDefault.propsfileis currentlyasortofaplaceholder.Theregularexpressionscurrentlyusedtodefinewhich modelsshouldberemovedforlargetraintimes,etc...arebasedonsomeadhoc observations.However,itwouldbenicetobealittlemorespecificandperhapsbase thesevaluesonsomethingalittlemoreconcrete.Originallyweplannedonplottingthe traintimes,testtimes,andfilesizesformodelsacrossalargenumberofproblemsinthe UCIdatasetsandthenadding. Metriccalculationisslow.Wecouldimplementthecalculationofmetricsseparately fromEvaluation,andprobablyseesignificantspeedup.(Evaluationupdateseverything everytimeamodelisevaluatedonaninstance) CouldhandlepredictioncachingmoreefficientlyCurrently,cachepredictionfor eachmodel/baseclassifier.CouldjustcachethecurrentpredictionofEnsembleSelection itself. Makepredictioncachinganoption.Currentlynochoiceinthematter,mightgettoo big. MismatchExcpetionscanbeconfusing.Ifyouusethesamedatasetfortworunsand specifythesameworkingdirectory,butspecifyadifferent#foldsorvalidationRatio you'llgetaModelmismatchexception,andrightlyso.However,peoplemightget confusedaboutthis.Perhaps,asolutiontothiswouldbeto IntegrateCalibrationandotherfilters.Possibletodofiltersnow,butwecouldmake iteasierintheGUI.Also,somepreliminaryworkhasshownthatcalibratingensemble

librarymodels(withsomethinglikePlatt'smethod)hasbeenshowntoclearlyimprove performance.Thiswasactuallyonouroriginallistofdesiredfeaturesbutunfortunately weweren'tabletoaddthisduetotimeconstraints. CustomserializationiscurrentlynotimplementedforEnsembleLibraryModels. forEnsembleSelectionLibraryModelssothattrainedmodelscouldbeforward/backward compatibleacrossdifferentversionswithminimalheadaches. MakingaselfcontainedEnsembleSelectionmodel.Currently,youmusthavea directorycontainingalltheEnsembleLibraryModelsthatanEnsembleSelectionclassifier usesavailabletoitforittowork. AllowuserstoimportexistingWEKAmodelsasLibraryModels.We'reonthefence astowhetherthiswouldbeadesirablefeatureasitwouldcircumventalotofourerror checkingintraininglibrarymodelsmodelssafelyonthesamedataintendedforthe Ensemble.However,surelysomewouldfinditconvenient.

Вам также может понравиться