Why Apache Spark Is A Crossover Hit For Data Scientists

21/07/2015
Support
WhyApacheSparkisaCrossoverHitforDataScientists|ClouderaEngineeringBlog
Developers
ContactUs
Downloads
Search
COMMUNITY
Hadoop&BigData
FAQs
Blog
Accumulo(1)
Avro(17)
Bigtop(6)
Books(12)
Careers(14)
CDH(155)
Cloud(22)
ClouderaLabs(8)
ClouderaLife(6)
ClouderaManager(76)
DOWNLOADS
TRAINING
BLOGS
Why Apache Spark is a Crossover Hit for Data Scientists

bySeanOwen
OurCustomers
DOCUMENTATION
March03,2014
10comments
Sparkisacompellingmultipurposeplatformforusecasesthatspaninvestigative,aswellasoperational,
analytics.
Datascienceisabroadchurch.IamadatascientistorsoIvebeentoldbutwhatIdoisactuallyquitedifferent
fromwhatotherdatascientistsdo.Forexample,therearethosepracticinginvestigativeanalyticsandthose
implementingoperationalanalytics.(Iminthesecondcamp.)
DatascientistsperforminginvestigativeanalyticsuseinteractivestatisticalenvironmentslikeRtoperformadhoc,
exploratoryanalyticsinordertoanswerquestionsandgaininsights.Bycontrast,datascientistsbuildingoperational
analyticssystemshavemoreincommonwithengineers.Theybuildsoftwarethatcreatesandqueriesmachine
learningmodelsthatoperateatscaleinrealtimeservingenvironments,usingsystemslanguageslikeC++andJava,
andoftenuseseveralelementsofanenterprisedatahub,includingtheApacheHadoopecosystem.
Andtherearesubgroupswithinthesegroupsofdatascientists.Forexample,someanalystswhoareproficientwithR
haveneverheardofPythonorscikitlearn,orviceversa,eventhoughbothprovidelibrariesofstatisticalfunctions
thatareaccessiblefromaREPL(ReadEvaluatePrintLoop)environment.
A World of Tradeoffs
Itwouldbewonderfultohaveonetoolforeveryone,andonearchitectureandlanguageforinvestigativeaswellas
operationalanalytics.IfIprimarilyworkinJava,shouldIreallyneedtoknowalanguagelikePythonorRinordertobe
effectiveatexploringdata?Comingfromaconventionaldataanalystbackground,mustIunderstandMapReducein
ordertoscaleupcomputations?Thearrayoftoolsavailabletodatascientiststellsastoryofunfortunatetradeoffs:
Roffersarichenvironmentforstatisticalanalysisandmachinelearning,butithassomeroughedgeswhen
performingmanyofthedataprocessingandcleanuptasksthatarerequiredbeforetherealanalysisworkcan
begin.Asalanguage,itsnotsimilartothemainstreamlanguagesdevelopersknow.
Community(217)
PythonisageneralpurposeprogramminglanguagewithexcellentlibrariesfordataanalysislikePandasandscikit
learn.ButlikeR,itsstilllimitedtoworkingwithanamountofdatathatcanfitononemachine.
DataIngestion(22)
ItspossibletodevelopdistributedmachinelearningalgorithmsontheclassicMapReducecomputationframework
inHadoop(seeApacheMahout).ButMapReduceisnotoriouslylowlevelanddifficulttoexpresscomplex
computationsin.
DataScience(37)
Events(53)
Flume(23)
ApacheCrunchoffersasimpler,idiomaticJavaAPIforexpressingMapReducecomputations.Butstill,thenature
ofMapReducemakesitinefficientforiterativecomputations,andmostmachinelearningalgorithmshavean
iterativecomponent.
GraphProcessing(3)
Andsoon.Therearebothgapsandoverlapsbetweentheseandotherdatasciencetools.Comingfromabackground
inJavaandHadoop,Idowonderwithenvysometimes:whycantwehaveaniceREPLlikeinvestigativeanalytics
environmentlikethePythonandRusershave?Thatsstillscalableanddistributed?Andhasthenicedistributed
collectiondesignofCrunch?Andcanequallybeusedinoperationalcontexts?
Guest(109)
Common Ground in Spark
General(335)
Hadoop(338)
Hardware(6)
HBase(145)
ThesearethedesiresthatmakemeexcitedaboutApacheSpark.WhilediscussionaboutSparkfordatasciencehas
mostlynoteditsabilitytokeepdataresidentinmemory,whichcanspeedupiterativemachinelearningworkloads
comparedtoMapReduce,thisisperhapsnoteventhebignews,nottome.Itdoesnotsolveeveryproblemfor
everyone.However,Sparkhasanumberoffeaturesthatmakeitacompellingcrossoverplatformforinvestigativeas
wellasoperationalanalytics:
Sparkcomeswithamachinelearninglibrary,MLlib,albeitbarebonessofar.
HDFS(53)
Hive(74)
Howto(88)
Hue(35)
Impala(90)
Kafka(10)
KiteSDK(17)
Mahout(5)
BeingScalabased,SparkembedsinanyJVMbasedoperationalsystem,butcanalsobeusedinteractivelyina
REPLinawaythatwillfeelfamiliartoRandPythonusers.
ForJavaprogrammers,Scalastillpresentsalearningcurve.Butatleast,anyJavalibrarycanbeusedfromwithin
Scala.
SparksRDD(ResilientDistributedDataset)abstractionresemblesCrunchsPCollection,whichhasproveda
usefulabstractioninHadoopthatwillalreadybefamiliartoCrunchdevelopers.(Crunchcanevenbeusedontop
ofSpark.)
SparkimitatesScalascollectionsAPIandfunctionalstyle,whichisaboontoJavaandScaladevelopers,butalso
somewhatfamiliartodeveloperscomingfromPython.Scalaisalsoacompellingchoiceforstatistical
computing.
Sparkitself,andScalaunderneathit,arenotspecifictomachinelearning.TheyprovideAPIssupportingrelated
tasks,likedataaccess,ETL,andintegration.AswithPython,theentiredatasciencepipelinecanbeimplemented
withinthisparadigm,notjustthemodelfittingandanalysis.
CodethatisimplementedintheREPLenvironmentcanbeusedmostlyasisinanoperationalcontext.
MapReduce(74)
MeetTheEngineer(22)
Dataoperationsaretransparentlydistributedacrossthecluster,evenasyoutype.
Spark,andMLlibinparticular,stillhasalotofgrowingtodo.Forexample,theprojectneedsoptimizations,fixes,and
http://blog.cloudera.com/blog/2014/03/whyapachesparkisacrossoverhitfordatascientists/
1/6
21/07/2015
Oozie(26)
OpsAndDevOps(23)
Parquet(15)
Performance(13)
Pig(37)
ProjectRhino(5)
QuickStartVM(6)
Search(26)
Security(33)
deeperintegrationwithYARN.Itdoesntyetprovidenearlythedepthoflibraryfunctionsthatconventionaldata
analysistoolsdo.Butasabestofmostworldsplatform,itisalreadysufficientlyinterestingforadatascientistofany
denominationtolookatseriously.
In Action: Tagging Stack Overflow Questions

AcompleteexamplewillgiveasenseofusingSparkasanenvironmentfortransformingdataandbuildingmodelson
Hadoop.ThefollowingexampleusesadumpofdatafromthepopularStackOverflowQ&Asite.OnStackOverflow,
developerscanaskandanswerquestionsaboutsoftware.Questionscanbetaggedwithshortstringslikejavaor
sql.Thisexamplewillbuildamodelthatcansuggestnewtagstoquestionsbasedonexistingtags,usingthe
alternatingleastsquares(ALS)recommenderalgorithmquestionsareusersandtagsareitems.
GettingtheData
StackExchangeprovidescompletedumpsofalldata,mostrecentlyfromJanuary20,2014.Thedataisprovidedas
atorrentcontainingdifferenttypesofdatafromStackOverflowandmanysistersites.Onlythefile
stackoverflow.comPosts.7zneedstobedownloadedfromthetorrent.
Thisfileisjustabzipcompressedfile.Spark,likeHadoop,candirectlyreadandsplitsomecompressedfiles,butin
thiscaseitisnecessarytouncompressacopyontoHDFS.Inonestep,thats:
Sentry(2)
bzcat stackoverflow.com-Posts.7z | hdfs dfs -put - /user/srowen/Posts.xml
Spark(44)
Sqoop(24)
Support(5)
Uncompressed,itconsumesabout24.4GB,andcontainsabout18millionposts,ofwhich2.1millionarequestions.
Thesequestionshaveabout9.3milliontagsfromapproximately34,000uniquetags.
Testing(9)
SetUpSpark
ThisMonthInThe
Ecosystem(16)
Tools(9)
Training(46)
UseCase(69)
GiventhatSparksintegrationwithHadoopisrelativelynew,itcanbetimeconsumingtogetitworkingmanually.
Fortunately,CDHhidesthatcomplexitybyintegratingSparkandmanagingsetupofitsprocesses.Sparkcanbe
installedseparatelywithCDH4.6.0,andisincludedinCDH5Beta2.ThisexampleusesaninstallationofCDH5
Beta2.
ThisexampleusesMLlib,whichusesthejblaslibraryforlinearalgebra,whichinturncallsnativecodeusingLAPACK
andFortran.Atthemoment,itisnecessarytomanuallyinstalltheFortranlibrarydependencytoenablethis.The
packageiscalledlibgfortranorlibgfortran3,andshouldbeavailablefromthestandardpackagemanagerof
majorLinuxdistributions.Forexample,forRHEL6,installitwith:
Whirr(6)
sudo yum install libgfortran
YARN(15)
ZooKeeper(24)
ArchivesbyMonth
ThismustbeinstalledonallmachinesthathavebeendesignatedasSparkworkers.
LogintothemachinedesignatedastheSparkmasterwithssh.Itwillbenecessary,atthemoment,toaskSparkto
letitsworkersusealargeamountofmemory.ThecodeinMLlibthatisusedinthisexample,inversion0.9.0,hasa
memoryissue,onethatisalreadyfixedforthenextrelease.Toconfigureformorememoryandlaunchtheshell:
export SPARK_JAVA_OPTS="-Dspark.executor.memory=8g"
spark-shell
InteractiveProcessingintheShell
TheshellistheScalaREPL.Itspossibletoexecutelinesofcode,definemethods,andingeneralaccessanyScala
orSparkfunctionalityinthisenvironment,onelineatatime.YoucanpastethefollowingstepsintotheREPL,oneby
one.
First,getahandleonthePosts.xmlfile:
val postsXML = sc.textFile("hdfs:///user/srowen/Posts.xml")
InresponsetheREPLwillprint:
postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12
ThetextfileisanRDD(ResilientDistributedDataset)ofStrings,whicharethelinesofthefile.Youcanqueryitby
callingmethodsoftheRDDclass.Forexample,tocountthelines:
postsXML.count
ThiscommandyieldsagreatdealofoutputfromSparkasitcountslinesinadistributedway,andfinallyprints
2/6
21/07/2015
18066983.
ThenextsnippettransformsthelinesoftheXMLfileintoacollectionof(questionID,tag)tuples.Thisdemonstrates
Scalasfunctionalprogrammingstyle,andotherquirks.(Explainingthemisoutofscopehere.)RDDsbehavelike
Scalacollections,andexposemanyofthesamemethods,likemap:
(Youcancopythesourcefortheabovefromhere.)
Youwillnoticethatthisreturnsimmediately,unlikepreviously.Sofar,nothingrequiresSparktoactuallyperformthis
transformation.ItispossibletoforceSparktoperformthecomputationby,forexample,callingamethodlikecount.
OrSparkcanbetoldtocomputeandpersisttheresultthroughcheckpointing,forexample.
TheMLlibimplementationofALSoperatesonnumericIDs,notstrings.Thetags(items)inthisdatasetarestrings.It
willbesufficientheretohashtagstoanonnegativeintegervalue,usetheintegervaluesforthecomputation,andthen
useareversemappingtotranslatebacktotagstringslater.Here,ahashfunctionisdefinedsinceitwillbereused
shortly.
def nnHash(tag: String) = tag.hashCode & 0x7FFFFF

var tagHashes = postIDTags.map(_._2).distinct.map(tag =>(nnHash(tag),tag))
Now,youcanconvertthetuplesfrombeforeintotheformatthattheALSimplementationexpects,andthemodelcan
becomputed:
import org.apache.spark.mllib.recommendation._
// Convert to Rating(Int,Int,Double) objects
val alsInput = postIDTags.map(t => Rating(t._1, nnHash(t._2), 1.0))
// Train model with 40 features, 10 iterations of ALS
val model = ALS.trainImplicit(alsInput, 40, 10)
Thiswilltakeminutesormore,dependingonthesizeofyourcluster,andwillspewalargeamountofoutputfromthe
workers.TakeamomenttofindtheSparkmasterwebUI,whichcanbefoundfromClouderaManager,andwillrunby
defaultathttp://[master]:18080.Therewillbeonerunningapplication.Clickthrough,thenclickApplicationDetail
UI.InthisviewitspossibletomonitorSparksdistributedexecutionoflinesofcodeinALS.scala:
Whenitiscomplete,afactoredmatrixmodelisavailableinSpark.Itcanbeusedtopredictquestiontagassociations
byrecommendingtagstoquestions.AtthisearlystageofMLlibslife,thereisnotevenaproperrecommendmethod
3/6
21/07/2015
yet,thatwouldgivesuggestedtagsforaquestion.Howeveritiseasytodefineone:
def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = {

// Build list of one question and all items and predict value for all of them
val predictions = model.predict(tagHashes.map(t => (questionID,t._1)))
// Get top howMany recommendations ordered by prediction value
val topN = predictions.top(howMany)(Ordering.by[Rating,Double](_.rating))
// Translate back to tags from IDs
topN.map(r => (tagHashes.lookup(r.product)(0), r.rating))
}
Andtocallit,pickanyquestionwithatleastfourtags,likeHowtomakesubstringmatchingqueryworkfastona
largetable?andgetitsIDfromtheURL.Here,thats7122697:
recommend(7122697).foreach(println)
Thismethodwilltakeaminuteormoretocomplete,whichisslow.Thelookupsinthelastlinearequiteexpensive
sinceeachrequiresadistributedsearch.Itwouldbesomewhatfasterifthismappingwereavailableinmemory.Its
possibletotellSparktodothis:
tagHashes = tagHashes.cache
BecauseofthemagicofScalaclosures,thisdoesinfactaffecttheobjectusedinsidetherecommendmethodjust
defined.Runthemethodcallagainanditwillreturnfaster.Theresultinbothcaseswillbesomethingsimilartothe
following:
(sql,0.17745152481166354)
(database,0.13526622226672633)
(oracle,0.1079428707621154)
(ruby-on-rails,0.06067207312463499)
(postgresql,0.050933613169706474)
(Yourresultwillnotbeidentical,sinceALSstartsfromarandomsolutionanditerates.)Theoriginalquestionwas
taggedpostgresql,queryoptimization,substring,andtextsearch.Itsreasonablethatthequestionmightalsobe
taggedsqlanddatabase.oraclemakessenseinthecontextofquestionsaboutoptimizationandtextsearch,and
rubyonrailsoftencomesupwithPostgreSQL,eventhoughthesetagsarenotinfactrelatedtothisparticular
question.
Something for Everyone

Ofcourse,thisexamplecouldbemoreefficientandmoregeneral.Butforthepracticingdatascientistsoutthere
whetheryoucameinasanRanalyst,Pythonhacker,orHadoopdeveloperhopefullyyousawsomethingfamiliarin
differentelementsoftheexample,andhavediscoveredawaytouseSparktoaccesssomebenefitsthattheother
tribestakeforgranted.
LearnmoreaboutSparksroleinanEDH,andjointhediscussioninourbrandnewSparkforum.
SeanisDirectorofDataScienceforEMEAatCloudera,helpingcustomersbuildlargescalemachinelearning
solutionsonHadoop.Previously,SeanfoundedMyrrixLtd,producingarealtimerecommenderandclusteringproduct
evolvedfromApacheMahout.SeanwasprimaryauthorofrecommendercomponentsinMahout,andhasbeenan
activecommitterandPMCmemberfortheproject.HeiscoauthorofMahoutinAction.
Filed under:
DataScience
Spark
UseCase
10Responses
ANTONIOPICCOLBONI/MARCH03,2014/3:15PM
Aucontraire,Rhasexceptionaltoolsforthemanipulationofdatapriortoanalysis,see
http://vita.had.co.nz/papers/tidydata.pdf
Someofthosearebeingportedtomapreduce,seethisprojecthttps://github.com/RevolutionAnalytics/plyrmr/
JOWANZAJOSEPH/MARCH03,2014/3:27PM
Thanksforposting.IhavebeenmessingaroundwithSparkR,anRintegrationwithSparkforafewdaysandIlikeit.I
havegreatamountofhopeforthisframework.
4/6
21/07/2015
FLORIANLEITNER/MARCH06,2014/11:51PM
YouguysseemtoneverhaveheardofIPythonIPython,either.Indeed,itisveryeasytododistributedcomputingwith
Pythonthesedays.So,atleasttwoofyouropeningargumentsarequiteabitmoot.
SEANOWEN/MARCH09,2014/7:57PM
Thankyouforthecomments.Thecommentsinthepostareindeedgeneralizationsandsimplifications.Itisnotasif
nobodyhastriedtocleandatainRorparallelizePython.IbelievedatamanipulationinRremainsquitedifficult
comparedtoamodernprogramminglanguage.IPython,whichIhavecertainlyheardof,providesaparallelization
primitiveamongotherthingsbutdoesnotcompareseriouslytoHadoopfordistributedcomputation.Toeachhis/herown
toolbutIwouldsuggesttheseareexactlytheaudiencesthatshouldbelookingatSpark.
MAJIDALDOSARI/MARCH12,2014/1:51PM
Parallelprogramminganddistributeddataismoreaboutparadigmsthanparticularassociationwithaprogramming
language.Itspossibletohaveaframeworkaccessiblefrommanylanguages.
WhatpromptedmetosaythisiswhenyousaidPythonislimitedtoprocessinginmemory.ImsureIcansearchpython
interfaces(orevenpythonnative)todistributeddatastoresandprocessing.
SEANOWEN/MARCH13,2014/7:45AM
Onrereading,IagreethatthisistoodismissiveofPythonrelatedtoolsfordistributedcomputation.ToPythonplus
scikit,forexample,itdoesrequireaddinganotherdifferentplatformlikeIPython.AndIsuspectthatformostpeoplethis
paradigmisdifferentagainfromwhereyourotherdataandcomputationlives,likeHadoop.Iwouldretreattoaweaker
pointthenandthisisupfordebatethatdistributingthePythonworldisrelativelyhardtoaccessoutsideofits
specialistworld,comparedto,say,Hadoop.Really,thepointwasnotthatPythonoritsassociatedtoolsaredeficient,
buttocreateasimplisticsketchoftheeither/ortradeoffsyouraverageorganizationfaceswhenengagingthesetools.
ALTONALEXANDER/MARCH13,2014/11:37AM
VeryimpressedwithwhatIveseenfromsparksofar.LookingforwardtoaddingittomyarsenalincludingRandPython.
ALEXMCLINTOCK/MAY09,2014/3:37AM
HiSean,IheardyougivethistalklastnightatBigDataLondon.IamwonderinghowSASfitsintothispicture.Itseems
tomethatsomepeoplelikeSASanduseitabitlikeR.andthatClouderaandSAStalktoeachother.presumably
throughtheHiveODBCdriver.HowdoIfindoutmore?
JUSTINKESTELYN(@KESTELYN)/MAY09,2014/10:37AM
Alex,
Thispostmayhelp:
http://blog.cloudera.com/blog/2013/05/howthesasandclouderaplatformsworktogether/
NEALMCBURNETT/JANUARY25,2015/10:34AM
Greatpostthankyou.
Pythonisindeedanawesomelanguagefordatascientists.Thankfully,SparknowsupportspysparkanofficialAPI
bindingtoregularCPython2.6andup.Thisseemsquiterelevantre:thepythondiscussioninthecomments.
pysparkaugmentstheotheramazingmultiparadigmdistributedprocessingopportunitiesprovidedbyIPythonparallel
(http://ipython.org/ipythondoc/2/parallel/index.html)aswellasscikit,andthewonderfulworldofliterate
programmingviaIPythonNotebooks,etc.
Leave a comment
Name
REQUIRED
Email
REQUIRED
(WILLNOTBEPUBLISHED)
Website
Comment
5/6
21/07/2015
Leave Comment
Proveyou'rehuman!*
eight+
=9
Products
Solutions
Partners
About
ClouderaEnterprise
EnterpriseSolutions
ResourceLibrary
Hadoop&BigData
ClouderaExpress
PartnerSolutions
Support
ManagementTeam
ClouderaManager
IndustrySolutions
English
Followus:
Board
CDH
Events
AllDownloads
PressCenter
ProfessionalServices
Careers
Training
ContactUs
Share:
SubscriptionCenter
Cloudera,Inc.
www.cloudera.com
2014Cloudera,Inc.Allrightsreserved Terms&Conditions
1001PageMillRoadBldg2
US:18887891488
HadoopandtheHadoopelephantlogoaretrademarksoftheApacheSoftwareFoundation.
PaloAlto,CA94304
Intl:16503620488
PrivacyPolicy
6/6

Why Apache Spark Is A Crossover Hit For Data Scientists - Cloudera Engineering Blog

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Why Apache Spark Is A Crossover Hit For Data Scientists - Cloudera Engineering Blog

Загружено:

Авторское право:

Доступные форматы

21/07/2015

Common Ground in Spark

In Action: Tagging Stack Overflow Questions

val postsXML = sc.textFile("hdfs:///user/srowen/Posts.xml")

postsXML: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at :12

def nnHash(tag: String) = tag.hashCode & 0x7FFFFF

def recommend(questionID: Int, howMany: Int = 5): Array[(String, Double)] = {

Something for Everyone

Вам также может понравиться