Вы находитесь на странице: 1из 6

21/07/2015

Support

WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog

Developers

ContactUs

Downloads

Search

COMMUNITY

Hadoop&BigData

FAQs
Blog
Accumulo(1)
Avro(17)
Bigtop(6)
Books(12)
Careers(14)
CDH(155)
Cloud(22)
ClouderaLabs(8)
ClouderaLife(6)
ClouderaManager(76)

DOWNLOADS

TRAINING

BLOGS

Why Apache Spark is a Crossover Hit for Data Scientists


bySeanOwen

OurCustomers

DOCUMENTATION

March03,2014

10comments

Sparkisacompellingmultipurposeplatformforusecasesthatspaninvestigative,aswellasoperational,
analytics.
Datascienceisabroadchurch.IamadatascientistorsoIvebeentoldbutwhatIdoisactuallyquitedifferent
fromwhatotherdatascientistsdo.Forexample,therearethosepracticinginvestigativeanalyticsandthose
implementingoperationalanalytics.(Iminthesecondcamp.)
DatascientistsperforminginvestigativeanalyticsuseinteractivestatisticalenvironmentslikeRtoperformadhoc,
exploratoryanalyticsinordertoanswerquestionsandgaininsights.Bycontrast,datascientistsbuildingoperational
analyticssystemshavemoreincommonwithengineers.Theybuildsoftwarethatcreatesandqueriesmachine
learningmodelsthatoperateatscaleinrealtimeservingenvironments,usingsystemslanguageslikeC++andJava,
andoftenuseseveralelementsofanenterprisedatahub,includingtheApacheHadoopecosystem.
Andtherearesubgroupswithinthesegroupsofdatascientists.Forexample,someanalystswhoareproficientwithR
haveneverheardofPythonorscikitlearn,orviceversa,eventhoughbothprovidelibrariesofstatisticalfunctions
thatareaccessiblefromaREPL(ReadEvaluatePrintLoop)environment.

A World of Tradeoffs
Itwouldbewonderfultohaveonetoolforeveryone,andonearchitectureandlanguageforinvestigativeaswellas
operationalanalytics.IfIprimarilyworkinJava,shouldIreallyneedtoknowalanguagelikePythonorRinordertobe
effectiveatexploringdata?Comingfromaconventionaldataanalystbackground,mustIunderstandMapReducein
ordertoscaleupcomputations?Thearrayoftoolsavailabletodatascientiststellsastoryofunfortunatetradeoffs:
Roffersarichenvironmentforstatisticalanalysisandmachinelearning,butithassomeroughedgeswhen
performingmanyofthedataprocessingandcleanuptasksthatarerequiredbeforetherealanalysisworkcan
begin.Asalanguage,itsnotsimilartothemainstreamlanguagesdevelopersknow.

Community(217)

PythonisageneralpurposeprogramminglanguagewithexcellentlibrariesfordataanalysislikePandasandscikit
learn.ButlikeR,itsstilllimitedtoworkingwithanamountofdatathatcanfitononemachine.

DataIngestion(22)

ItspossibletodevelopdistributedmachinelearningalgorithmsontheclassicMapReducecomputationframework
inHadoop(seeApacheMahout).ButMapReduceisnotoriouslylowlevelanddifficulttoexpresscomplex
computationsin.

DataScience(37)
Events(53)
Flume(23)

ApacheCrunchoffersasimpler,idiomaticJavaAPIforexpressingMapReducecomputations.Butstill,thenature
ofMapReducemakesitinefficientforiterativecomputations,andmostmachinelearningalgorithmshavean
iterativecomponent.

GraphProcessing(3)

Andsoon.Therearebothgapsandoverlapsbetweentheseandotherdatasciencetools.Comingfromabackground
inJavaandHadoop,Idowonderwithenvysometimes:whycantwehaveaniceREPLlikeinvestigativeanalytics
environmentlikethePythonandRusershave?Thatsstillscalableanddistributed?Andhasthenicedistributed
collectiondesignofCrunch?Andcanequallybeusedinoperationalcontexts?

Guest(109)

Common Ground in Spark

General(335)

Hadoop(338)
Hardware(6)
HBase(145)

ThesearethedesiresthatmakemeexcitedaboutApacheSpark.WhilediscussionaboutSparkfordatasciencehas
mostlynoteditsabilitytokeepdataresidentinmemory,whichcanspeedupiterativemachinelearningworkloads
comparedtoMapReduce,thisisperhapsnoteventhebignews,nottome.Itdoesnotsolveeveryproblemfor
everyone.However,Sparkhasanumberoffeaturesthatmakeitacompellingcrossoverplatformforinvestigativeas
wellasoperationalanalytics:
Sparkcomeswithamachinelearninglibrary,MLlib,albeitbarebonessofar.

HDFS(53)
Hive(74)
Howto(88)
Hue(35)
Impala(90)
Kafka(10)
KiteSDK(17)
Mahout(5)

BeingScalabased,SparkembedsinanyJVMbasedoperationalsystem,butcanalsobeusedinteractivelyina
REPLinawaythatwillfeelfamiliartoRandPythonusers.
ForJavaprogrammers,Scalastillpresentsalearningcurve.Butatleast,anyJavalibrarycanbeusedfromwithin
Scala.
SparksRDD(ResilientDistributedDataset)abstractionresemblesCrunchsPCollection,whichhasproveda
usefulabstractioninHadoopthatwillalreadybefamiliartoCrunchdevelopers.(Crunchcanevenbeusedontop
ofSpark.)
SparkimitatesScalascollectionsAPIandfunctionalstyle,whichisaboontoJavaandScaladevelopers,butalso
somewhatfamiliartodeveloperscomingfromPython.Scalaisalsoacompellingchoiceforstatistical
computing.
Sparkitself,andScalaunderneathit,arenotspecifictomachinelearning.TheyprovideAPIssupportingrelated
tasks,likedataaccess,ETL,andintegration.AswithPython,theentiredatasciencepipelinecanbeimplemented
withinthisparadigm,notjustthemodelfittingandanalysis.
CodethatisimplementedintheREPLenvironmentcanbeusedmostlyasisinanoperationalcontext.

MapReduce(74)
MeetTheEngineer(22)

Dataoperationsaretransparentlydistributedacrossthecluster,evenasyoutype.
Spark,andMLlibinparticular,stillhasalotofgrowingtodo.Forexample,theprojectneedsoptimizations,fixes,and

http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/

1/6

21/07/2015
Oozie(26)
OpsAndDevOps(23)
Parquet(15)
Performance(13)
Pig(37)
ProjectRhino(5)
QuickStartVM(6)
Search(26)
Security(33)

WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog
deeperintegrationwithYARN.Itdoesntyetprovidenearlythedepthoflibraryfunctionsthatconventionaldata
analysistoolsdo.Butasabestofmostworldsplatform,itisalreadysufficientlyinterestingforadatascientistofany
denominationtolookatseriously.

In Action: Tagging Stack Overflow Questions


AcompleteexamplewillgiveasenseofusingSparkasanenvironmentfortransformingdataandbuildingmodelson
Hadoop.ThefollowingexampleusesadumpofdatafromthepopularStackOverflowQ&Asite.OnStackOverflow,
developerscanaskandanswerquestionsaboutsoftware.Questionscanbetaggedwithshortstringslikejavaor
sql.Thisexamplewillbuildamodelthatcansuggestnewtagstoquestionsbasedonexistingtags,usingthe
alternatingleastsquares(ALS)recommenderalgorithmquestionsareusersandtagsareitems.
GettingtheData
StackExchangeprovidescompletedumpsofalldata,mostrecentlyfromJanuary20,2014.Thedataisprovidedas
atorrentcontainingdifferenttypesofdatafromStackOverflowandmanysistersites.Onlythefile
stackoverflow.comPosts.7zneedstobedownloadedfromthetorrent.
Thisfileisjustabzipcompressedfile.Spark,likeHadoop,candirectlyreadandsplitsomecompressedfiles,butin
thiscaseitisnecessarytouncompressacopyontoHDFS.Inonestep,thats:

Sentry(2)
bzcat stackoverflow.com-Posts.7z | hdfs dfs -put - /user/srowen/Posts.xml

Spark(44)
Sqoop(24)

Support(5)

Uncompressed,itconsumesabout24.4GB,andcontainsabout18millionposts,ofwhich2.1millionarequestions.
Thesequestionshaveabout9.3milliontagsfromapproximately34,000uniquetags.

Testing(9)

SetUpSpark

ThisMonthInThe
Ecosystem(16)
Tools(9)
Training(46)
UseCase(69)

GiventhatSparksintegrationwithHadoopisrelativelynew,itcanbetimeconsumingtogetitworkingmanually.
Fortunately,CDHhidesthatcomplexitybyintegratingSparkandmanagingsetupofitsprocesses.Sparkcanbe
installedseparatelywithCDH4.6.0,andisincludedinCDH5Beta2.ThisexampleusesaninstallationofCDH5
Beta2.
ThisexampleusesMLlib,whichusesthejblaslibraryforlinearalgebra,whichinturncallsnativecodeusingLAPACK
andFortran.Atthemoment,itisnecessarytomanuallyinstalltheFortranlibrarydependencytoenablethis.The
packageiscalledlibgfortranorlibgfortran3,andshouldbeavailablefromthestandardpackagemanagerof
majorLinuxdistributions.Forexample,forRHEL6,installitwith:

Whirr(6)
sudo yum install libgfortran

YARN(15)
ZooKeeper(24)
ArchivesbyMonth

ThismustbeinstalledonallmachinesthathavebeendesignatedasSparkworkers.
LogintothemachinedesignatedastheSparkmasterwithssh.Itwillbenecessary,atthemoment,toaskSparkto
letitsworkersusealargeamountofmemory.ThecodeinMLlibthatisusedinthisexample,inversion0.9.0,hasa
memoryissue,onethatisalreadyfixedforthenextrelease.Toconfigureformorememoryandlaunchtheshell:

export SPARK_JAVA_OPTS="-Dspark.executor.memory=8g"
spark-shell

InteractiveProcessingintheShell
TheshellistheScalaREPL.Itspossibletoexecutelinesofcode,definemethods,andingeneralaccessanyScala
orSparkfunctionalityinthisenvironment,onelineatatime.YoucanpastethefollowingstepsintotheREPL,oneby
one.
First,getahandleonthePosts.xmlfile:

val postsXML = sc.textFile("hdfs:///user/srowen/Posts.xml")

InresponsetheREPLwillprint:

postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12

ThetextfileisanRDD(ResilientDistributedDataset)ofStrings,whicharethelinesofthefile.Youcanqueryitby
callingmethodsoftheRDDclass.Forexample,tocountthelines:

postsXML.count

ThiscommandyieldsagreatdealofoutputfromSparkasitcountslinesinadistributedway,andfinallyprints

http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/

2/6

21/07/2015

WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog
18066983.
ThenextsnippettransformsthelinesoftheXMLfileintoacollectionof(questionID,tag)tuples.Thisdemonstrates
Scalasfunctionalprogrammingstyle,andotherquirks.(Explainingthemisoutofscopehere.)RDDsbehavelike
Scalacollections,andexposemanyofthesamemethods,likemap:

(Youcancopythesourcefortheabovefromhere.)
Youwillnoticethatthisreturnsimmediately,unlikepreviously.Sofar,nothingrequiresSparktoactuallyperformthis
transformation.ItispossibletoforceSparktoperformthecomputationby,forexample,callingamethodlikecount.
OrSparkcanbetoldtocomputeandpersisttheresultthroughcheckpointing,forexample.
TheMLlibimplementationofALSoperatesonnumericIDs,notstrings.Thetags(items)inthisdatasetarestrings.It
willbesufficientheretohashtagstoanonnegativeintegervalue,usetheintegervaluesforthecomputation,andthen
useareversemappingtotranslatebacktotagstringslater.Here,ahashfunctionisdefinedsinceitwillbereused
shortly.

def nnHash(tag: String) = tag.hashCode & 0x7FFFFF


var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))

Now,youcanconvertthetuplesfrombeforeintotheformatthattheALSimplementationexpects,andthemodelcan
becomputed:

import org.apache.spark.mllib.recommendation._
// Convert to Rating(Int,Int,Double) objects
val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))
// Train model with 40 features, 10 iterations of ALS
val model = ALS.trainImplicit(alsInput, 40, 10)

Thiswilltakeminutesormore,dependingonthesizeofyourcluster,andwillspewalargeamountofoutputfromthe
workers.TakeamomenttofindtheSparkmasterwebUI,whichcanbefoundfromClouderaManager,andwillrunby
defaultathttp://[master]:18080.Therewillbeonerunningapplication.Clickthrough,thenclickApplicationDetail
UI.InthisviewitspossibletomonitorSparksdistributedexecutionoflinesofcodeinALS.scala:

Whenitiscomplete,afactoredmatrixmodelisavailableinSpark.Itcanbeusedtopredictquestiontagassociations
byrecommendingtagstoquestions.AtthisearlystageofMLlibslife,thereisnotevenaproperrecommendmethod

http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/

3/6

21/07/2015

WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog
yet,thatwouldgivesuggestedtagsforaquestion.Howeveritiseasytodefineone:

def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = {


// Build list of one question and all items and predict value for all of them
val predictions = model.predict(tagHashes.map(t => (questionID,t._1)))
// Get top howMany recommendations ordered by prediction value
val topN = predictions.top(howMany)(Ordering.by[Rating,Double](_.rating))
// Translate back to tags from IDs
topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))
}

Andtocallit,pickanyquestionwithatleastfourtags,likeHowtomakesubstringmatchingqueryworkfastona
largetable?andgetitsIDfromtheURL.Here,thats7122697:

recommend(7122697).foreach(println)

Thismethodwilltakeaminuteormoretocomplete,whichisslow.Thelookupsinthelastlinearequiteexpensive
sinceeachrequiresadistributedsearch.Itwouldbesomewhatfasterifthismappingwereavailableinmemory.Its
possibletotellSparktodothis:

tagHashes = tagHashes.cache

BecauseofthemagicofScalaclosures,thisdoesinfactaffecttheobjectusedinsidetherecommendmethodjust
defined.Runthemethodcallagainanditwillreturnfaster.Theresultinbothcaseswillbesomethingsimilartothe
following:

(sql,0.17745152481166354)
(database,0.13526622226672633)
(oracle,0.1079428707621154)
(ruby-on-rails,0.06067207312463499)
(postgresql,0.050933613169706474)

(Yourresultwillnotbeidentical,sinceALSstartsfromarandomsolutionanditerates.)Theoriginalquestionwas
taggedpostgresql,queryoptimization,substring,andtextsearch.Itsreasonablethatthequestionmightalsobe
taggedsqlanddatabase.oraclemakessenseinthecontextofquestionsaboutoptimizationandtextsearch,and
rubyonrailsoftencomesupwithPostgreSQL,eventhoughthesetagsarenotinfactrelatedtothisparticular
question.

Something for Everyone


Ofcourse,thisexamplecouldbemoreefficientandmoregeneral.Butforthepracticingdatascientistsoutthere
whetheryoucameinasanRanalyst,Pythonhacker,orHadoopdeveloperhopefullyyousawsomethingfamiliarin
differentelementsoftheexample,andhavediscoveredawaytouseSparktoaccesssomebenefitsthattheother
tribestakeforgranted.
LearnmoreaboutSparksroleinanEDH,andjointhediscussioninourbrandnewSparkforum.
SeanisDirectorofDataScienceforEMEAatCloudera,helpingcustomersbuildlargescalemachinelearning
solutionsonHadoop.Previously,SeanfoundedMyrrixLtd,producingarealtimerecommenderandclusteringproduct
evolvedfromApacheMahout.SeanwasprimaryauthorofrecommendercomponentsinMahout,andhasbeenan
activecommitterandPMCmemberfortheproject.HeiscoauthorofMahoutinAction.

Filed under:
DataScience
Spark
UseCase

10Responses
ANTONIOPICCOLBONI/MARCH03,2014/3:15PM

Aucontraire,Rhasexceptionaltoolsforthemanipulationofdatapriortoanalysis,see
http://vita.had.co.nz/papers/tidydata.pdf
Someofthosearebeingportedtomapreduce,seethisprojecthttps://github.com/RevolutionAnalytics/plyrmr/
JOWANZAJOSEPH/MARCH03,2014/3:27PM

Thanksforposting.IhavebeenmessingaroundwithSparkR,anRintegrationwithSparkforafewdaysandIlikeit.I
havegreatamountofhopeforthisframework.

http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/

4/6

21/07/2015

WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog
FLORIANLEITNER/MARCH06,2014/11:51PM

YouguysseemtoneverhaveheardofIPythonIPython,either.Indeed,itisveryeasytododistributedcomputingwith
Pythonthesedays.So,atleasttwoofyouropeningargumentsarequiteabitmoot.
SEANOWEN/MARCH09,2014/7:57PM

Thankyouforthecomments.Thecommentsinthepostareindeedgeneralizationsandsimplifications.Itisnotasif
nobodyhastriedtocleandatainRorparallelizePython.IbelievedatamanipulationinRremainsquitedifficult
comparedtoamodernprogramminglanguage.IPython,whichIhavecertainlyheardof,providesaparallelization
primitiveamongotherthingsbutdoesnotcompareseriouslytoHadoopfordistributedcomputation.Toeachhis/herown
toolbutIwouldsuggesttheseareexactlytheaudiencesthatshouldbelookingatSpark.
MAJIDALDOSARI/MARCH12,2014/1:51PM

Parallelprogramminganddistributeddataismoreaboutparadigmsthanparticularassociationwithaprogramming
language.Itspossibletohaveaframeworkaccessiblefrommanylanguages.
WhatpromptedmetosaythisiswhenyousaidPythonislimitedtoprocessinginmemory.ImsureIcansearchpython
interfaces(orevenpythonnative)todistributeddatastoresandprocessing.
SEANOWEN/MARCH13,2014/7:45AM

Onrereading,IagreethatthisistoodismissiveofPythonrelatedtoolsfordistributedcomputation.ToPythonplus
scikit,forexample,itdoesrequireaddinganotherdifferentplatformlikeIPython.AndIsuspectthatformostpeoplethis
paradigmisdifferentagainfromwhereyourotherdataandcomputationlives,likeHadoop.Iwouldretreattoaweaker
pointthenandthisisupfordebatethatdistributingthePythonworldisrelativelyhardtoaccessoutsideofits
specialistworld,comparedto,say,Hadoop.Really,thepointwasnotthatPythonoritsassociatedtoolsaredeficient,
buttocreateasimplisticsketchoftheeither/ortradeoffsyouraverageorganizationfaceswhenengagingthesetools.
ALTONALEXANDER/MARCH13,2014/11:37AM

VeryimpressedwithwhatIveseenfromsparksofar.LookingforwardtoaddingittomyarsenalincludingRandPython.
ALEXMCLINTOCK/MAY09,2014/3:37AM

HiSean,IheardyougivethistalklastnightatBigDataLondon.IamwonderinghowSASfitsintothispicture.Itseems
tomethatsomepeoplelikeSASanduseitabitlikeR.andthatClouderaandSAStalktoeachother.presumably
throughtheHiveODBCdriver.HowdoIfindoutmore?
JUSTINKESTELYN(@KESTELYN)/MAY09,2014/10:37AM

Alex,
Thispostmayhelp:
http://blog.cloudera.com/blog/2013/05/howthesasandclouderaplatformsworktogether/
NEALMCBURNETT/JANUARY25,2015/10:34AM

Greatpostthankyou.
Pythonisindeedanawesomelanguagefordatascientists.Thankfully,SparknowsupportspysparkanofficialAPI
bindingtoregularCPython2.6andup.Thisseemsquiterelevantre:thepythondiscussioninthecomments.
pysparkaugmentstheotheramazingmultiparadigmdistributedprocessingopportunitiesprovidedbyIPythonparallel
(http://ipython.org/ipythondoc/2/parallel/index.html)aswellasscikit,andthewonderfulworldofliterate
programmingviaIPythonNotebooks,etc.

Leave a comment
Name

REQUIRED

Email

REQUIRED

(WILLNOTBEPUBLISHED)

Website

Comment

http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/

5/6

21/07/2015

WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog

Leave Comment

Proveyou'rehuman!*
eight+

=9

Products

Solutions

Partners

About

ClouderaEnterprise

EnterpriseSolutions

ResourceLibrary

Hadoop&BigData

ClouderaExpress

PartnerSolutions

Support

ManagementTeam

ClouderaManager

IndustrySolutions

English
Followus:

Board

CDH

Events

AllDownloads

PressCenter

ProfessionalServices

Careers

Training

ContactUs

Share:

SubscriptionCenter

Cloudera,Inc.

www.cloudera.com

2014Cloudera,Inc.Allrightsreserved Terms&Conditions

1001PageMillRoadBldg2

US:18887891488

HadoopandtheHadoopelephantlogoaretrademarksoftheApacheSoftwareFoundation.

PaloAlto,CA94304

Intl:16503620488

http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/

PrivacyPolicy

6/6

Вам также может понравиться