Blog - Angulosolido.pt: Keywords

5/3/2014
blog.angulosolido.pt
BloginformaldaAnguloSlido
S EG U N D A F EI R A , 2 7 D E J A N EI R O D E 2 0 1 4
Arquivodoblogue 2014(1) Janeiro(1) Coreconfusion 2013(8) 2012(6) 2011(8) 2010(17) 2009(41) 2008(47) 2007(1)
Coreconfusion
Keywords
cpuinfo,hyperthreading,intel,amd,cores,modules,numa
Introduction
Comparison of processor features and performance is a complicated subject. Back in the days we had RISC vs CISC, then we had the growing market of turtleslow PCs vs large scale machines from SUN, IBM... then PCs got acceptably fast, then they got more cores at a lower clock speed, then came Hyperthreading,thenclockspeedswentfaster,thenserversgotevenfasterandintheendof2011AMD haslaunchedaprocessorarchitecturecalledBuldozzerwherecomputingunitsaresomethingbetweenan IntelcoreandanIntelthreadbuttowhichtheyalsocallcores. Needless to say that for the high level IT manager this is a confusing mess a mess perhaps driven by marketing that must be carefully looked at, first from the procurement side and later from the operations side.
Etiquetas android(1) aprilsfool(6) copyright(1) desktop(10) dsl(9) eventos(5) firefox(5) fosdem(2) hardware(1) history(2) humor(5) integration(13) interoperability(14) Microsoft(5) migrao(3) mobile(1) multihoming(2) OOXML(5) politics(3) portugues_suave(10) redes(18) standards(2) sysadmin(8) video(3) wireless(1) wtf(10) Links AnguloSlido ESOP OpenOffice
Corecounting
Before purchasing a server we usually look at the available options for number of CPUs (N), CPU clock frequency(F)andnumberofcoresperCPU(n).WhenourapplicationsarebothCPUintensiveandparallel innatureitmakessensetoestimatethetotalavailablecomputingpowerbycalculatingP=N*F*n.Tothe guaranteed computing power, calculated in this manner, we would add a nonguaranteed, workload dependant extra that comes from Hyperthreading, if the processor is Intel. The ability to run a second separatethreadinthesamecoretranslatesinfacttoaperformanceboostofsomethingbetween10and 30%,dependingontheworkload. IftheprocessorsarerecentOpteronsfromAMDwecan'tusethesameformula.Wemustadaptittolook likethis: P=N*F*n*S where S is a scaling factor that compensates for the fact that what AMD advertises as a core is not as independent as an Intel core. In their new terminology AMD talks about modules inside which there are two cores present. They could, instead, have counted each module as a core and have invented some word,egAstrothreading,thatwouldmakethingseasilycomparableanditseemsthatthereisaword for that.Buttheresultonsalesmightnotbeasgood:) Given the described physical situation 2 "cores" sharing resources inside a module it is tempting to propose,asaneducatedguess,S=0.5.Infact,wehaveseenasimilarvaluepresentonaHPCtenderin order to allow for the comparison of recent Intel and AMD systems with a simple N*F*n formula. It was somethinglike: IntelE526XXS=1.00 IntelE526XXv2S=1.00 IntelE524XXS=1.00 IntelE546XXS=1.00 IntelE728XXS=0.71 AMDOpteron62XXS=0.47 AMDOpteron63XXS=0.47 Beingthatso,foracrudeestimationoftheguaranteedperformancewecouldgobacktotheinitialformula P=N*F*n wherenwouldnowbethenumberofIntelcoresorthenumberofadvertisedAMDcoresdividedmultiplied by0.47(roughlyadivisionby2). We should have in mind that the nonguaranteed extra performance will, in principle, be greater in AMD OpteronsthanwhatIntel'sHyperthreadinghastooffer.AndweshouldalsohaveinmindthattheSfactor mentionedabovewasempiricallycalculatedonaspecificHPCscenario,whilerecognizingthatitmatches our educated guess. It means that for this specific HPC scenario AMD processors seem to behave as if havingonlyhalfoftheadvertisedcores. Wewillsee,laterinthisarticle,howthiscalculationholdsonasimplisticbenchmarkwehaveperformed,
http://blog.angulosolido.pt/2014/01/core-confusion.html
1/7
5/3/2014
even more closely when the comparison takes into account the extra performance given by Intel's HyperthreadingandAMD'score'duplication'.
Linuxclassificationofcomputingunits
Wewillnowanalyzetwodifferentsystems,bothexposing16virtualprocessorstotheoperatingsystem. ThisisanexampleofanIntelXeonE5620CPU,asseenfromdmidecode V e r s i o n : I n t e l ( R ) X e o n ( R ) C P U E 5 6 2 0 @ 2 . 4 0 G H z V o l t a g e : U n k n o w n E x t e r n a l C l o c k : 1 3 3 M H z M a x S p e e d : 2 4 0 0 M H z C u r r e n t S p e e d : 2 4 0 0 M H z S t a t u s : P o p u l a t e d , E n a b l e d U p g r a d e : O t h e r L 1 C a c h e H a n d l e : 0 x 0 0 0 5 L 2 C a c h e H a n d l e : 0 x 0 0 0 6 L 3 C a c h e H a n d l e : 0 x 0 0 0 7 S e r i a l N u m b e r : T o B e F i l l e d B y O . E . M . A s s e t T a g : T o B e F i l l e d B y O . E . M . P a r t N u m b e r : T o B e F i l l e d B y O . E . M . C o r e C o u n t : 4 C o r e E n a b l e d : 4 T h r e a d C o u n t : 8 AndthisisanexampleofanAMDOpteron6328 V e r s i o n : A M D O p t e r o n ( t m ) P r o c e s s o r 6 3 2 8 V o l t a g e : 1 . 1 V E x t e r n a l C l o c k : 2 0 0 M H z M a x S p e e d : 3 2 0 0 M H z C u r r e n t S p e e d : 3 2 0 0 M H z S t a t u s : P o p u l a t e d , E n a b l e d U p g r a d e : S o c k e t G 3 4 L 1 C a c h e H a n d l e : 0 x 0 0 0 5 L 2 C a c h e H a n d l e : 0 x 0 0 0 6 L 3 C a c h e H a n d l e : 0 x 0 0 0 7 S e r i a l N u m b e r : T o B e F i l l e d B y O . E . M . A s s e t T a g : T o B e F i l l e d B y O . E . M . P a r t N u m b e r : T o B e F i l l e d B y O . E . M . C o r e C o u n t : 8 C o r e E n a b l e d : 8 T h r e a d C o u n t : 8 From these examples it would seem that Linux had adopted the AMD's terminology regarding core counts.However,dmidecodeonlydisplayswhatisstoredinthecomputerBIOSandavailablethroughthe DMIinterface(mandmidecode).Let'sseewhatispresentonthe/proc/cpuinfo. FortheIntelprocessorwehave p r o c e s s o r : 0 v e n d o r _ i d : G e n u i n e I n t e l c p u f a m i l y : 6 m o d e l : 4 4 m o d e l n a m e : I n t e l ( R ) X e o n ( R ) C P U E 5 6 2 0 @ 2 . 4 0 G H z s t e p p i n g : 2 c p u M H z : 1 6 0 0 . 0 0 0 c a c h e s i z e : 1 2 2 8 8 K B p h y s i c a l i d : 0 s i b l i n g s : 8 c o r e i d : 0 c p u c o r e s : 4 whereasfortheAMDOpteronwesee p r o c e s s o r : 0 v e n d o r _ i d : A u t h e n t i c A M D c p u f a m i l y : 2 1 m o d e l : 2 m o d e l n a m e : A M D O p t e r o n ( t m ) P r o c e s s o r 6 3 2 8 s t e p p i n g : 0 c p u M H z : 1 4 0 0 . 0 0 0 c a c h e s i z e : 2 0 4 8 K B p h y s i c a l i d : 0
2/7
5/3/2014
s i b l i n g s : 8 c o r e i d : 0 c p u c o r e s : 4
Fromwhatispresenton/proc/cpuinfoweseethatLinuxhasNOTadoptedAMD'sterminology.Infactitis treating AMD's core duplication in the same way it treats Intel's Hyperthreading: the number of cores is exactly the same and the total number of core threads, called "sibblings", is twice as large. That is, in both cases we have 2 threads per core and each of those appears to the operating system as a virtual CPU. We will see as many entries as the number of sibblings in each processor times the number of processors. But not all entries are equal: each core with Hyperthreading (Intel) or "double core in a module" (AMD) represents two entries that share resources and, thus, don't behave as independent processors. Note:the displayed cpu MHZ values are not the maximum ones the energy saving mechanism lowers thefrequencywhenthesystemisidleandincreasesitondemand.
DistinguishingindependentandcoupledvirtualCPUs
Wehaveseenthatanytimethenumberofsibblingsistwicethenumberofcoresthesystemwillcontain pairsofvirtualCPUsthataredependentoneachother.TheLinuxkernelisawareofthatandspreadsthe loadoveravailablephysicalcoresbeforeallowingasecondthreadofeachcorebeputtouse.Buttheuser might want to manually bind certain long running cpu intensive processes to specific cpus. How can the userknowwhichvirtualCPUsdependoneachother? Wecangrouptherelevantprocessorinfobyrunningsomethinglike
c a t / p r o c / c p u i n f o | e g r e p " p r o c e s s o r | p h y s i c a l i d | c o r e i d " | s e d ' s / ^ p r o c e s s o r / \ n p r o c e s s o r / g '
FortheIntelsystemwehave p r o c e s s o r : 0 p h y s i c a l i d : 0 c o r e i d : 0 p r o c e s s o r : 1 p h y s i c a l i d : 0 c o r e i d : 1 p r o c e s s o r : 2 p h y s i c a l i d : 0 c o r e i d : 9 p r o c e s s o r : 3 p h y s i c a l i d : 0 c o r e i d : 1 0 p r o c e s s o r : 4 p h y s i c a l i d : 1 c o r e i d : 0 p r o c e s s o r : 5 p h y s i c a l i d : 1 c o r e i d : 1 p r o c e s s o r : 6 p h y s i c a l i d : 1 c o r e i d : 9 p r o c e s s o r : 7 p h y s i c a l i d : 1 c o r e i d : 1 0 p r o c e s s o r : 8 p h y s i c a l i d : 0 c o r e i d : 0 p r o c e s s o r : 9 p h y s i c a l i d : 0 c o r e i d : 1 p r o c e s s o r : 1 0 p h y s i c a l i d : 0 c o r e i d : 9
3/7
5/3/2014
p r o c e s s o r : 1 1 p h y s i c a l i d : 0 c o r e i d : 1 0 p r o c e s s o r : 1 2 p h y s i c a l i d : 1 c o r e i d : 0 p r o c e s s o r : 1 3 p h y s i c a l i d : 1 c o r e i d : 1 p r o c e s s o r : 1 4 p h y s i c a l i d : 1 c o r e i d : 9 p r o c e s s o r : 1 5 p h y s i c a l i d : 1 c o r e i d : 1 0 FortheAMDsystemwehave p r o c e s s o r : 0 p h y s i c a l i d : 0 c o r e i d : 0 p r o c e s s o r : 1 p h y s i c a l i d : 0 c o r e i d : 1 p r o c e s s o r : 2 p h y s i c a l i d : 0 c o r e i d : 2 p r o c e s s o r : 3 p h y s i c a l i d : 0 c o r e i d : 3 p r o c e s s o r : 4 p h y s i c a l i d : 0 c o r e i d : 0 p r o c e s s o r : 5 p h y s i c a l i d : 0 c o r e i d : 1 p r o c e s s o r : 6 p h y s i c a l i d : 0 c o r e i d : 2 p r o c e s s o r : 7 p h y s i c a l i d : 0 c o r e i d : 3 p r o c e s s o r : 8 p h y s i c a l i d : 1 c o r e i d : 0 p r o c e s s o r : 9 p h y s i c a l i d : 1 c o r e i d : 1 p r o c e s s o r : 1 0 p h y s i c a l i d : 1 c o r e i d : 2 p r o c e s s o r : 1 1 p h y s i c a l i d : 1 c o r e i d : 3 p r o c e s s o r : 1 2 p h y s i c a l i d : 1
4/7
5/3/2014
c o r e i d : 0 p r o c e s s o r : 1 3 p h y s i c a l i d : 1 c o r e i d : 1 p r o c e s s o r : 1 4 p h y s i c a l i d : 1 c o r e i d : 2 p r o c e s s o r : 1 5 p h y s i c a l i d : 1 c o r e i d : 3
Foreachofthepreviousexampleswehaveformattedasboldtwothreadsbelongingtoasinglephysical core.Weseethatevenwiththesamenumberofcoresandthreadsperprocessor,theoutputisdifferent onthesetwosystems.FortheIntelprocessorthetwothreadsofthesamecorearetheonesthatshare thesamephysicalidandcoreidwhereasfortheAMDprocessorthethreadpairshaveadjacentprocessor ids. We can reach this conclusion empirically by developing a simple cpu performance test program, say called cputest.sh, and running two instances of it on an otherwise idle machine. With the taskset commandwecanbindeachinstancetoadifferentvirtualcpuandevaluatetheperformance.Wheneverwe detectaperformancepenaltyduetothesecondinstancewehavehitapairofcorethreads. ExamplefortheIntelCPUdifferentphysicalcores [ u s e r @ s r v 1 ~ ] $ t a s k s e t c 0 . / c p u t e s t . s h 0 p r o c i d 0 t i m e 4 . 1 0 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 1 0 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 1 1 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 0 9 o p s p e r s e c 2 4 [ u s e r @ s r v 1 ~ ] $ t a s k s e t c 7 . / c p u t e s t . s h 7 p r o c i d 0 t i m e 4 . 1 0 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 1 0 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 1 1 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 0 9 o p s p e r s e c 2 4 ExamplefortheIntelCPUsamephysicalcore [ u s e r @ s r v 1 ~ ] $ t a s k s e t c 0 . / c p u t e s t . s h 0 p r o c i d 0 t i m e 4 . 0 9 o p s p e r s e c 2 4 p r o c i d 0 t i m e 6 . 0 6 o p s p e r s e c 1 6 p r o c i d 0 t i m e 7 . 0 5 o p s p e r s e c 1 4 p r o c i d 0 t i m e 7 . 0 2 o p s p e r s e c 1 4 [ u s e r @ s r v 1 ~ ] $ t a s k s e t c 8 . / c p u t e s t . s h 8 p r o c i d 8 t i m e 7 . 0 5 o p s p e r s e c 1 4 p r o c i d 8 t i m e 7 . 0 3 o p s p e r s e c 1 4 p r o c i d 8 t i m e 7 . 0 3 o p s p e r s e c 1 4 p r o c i d 8 t i m e 7 . 0 1 o p s p e r s e c 1 4 ExamplefortheAMDCPUdifferentphysicalcores [ u s e r @ t e r m i n a l s e r v e r 0 1 ~ ] $ t a s k s e t c 0 . / c p u t e s t . s h 0 p r o c i d 0 t i m e 4 . 0 0 o p s p e r s e c 2 5 p r o c i d 0 t i m e 4 . 0 1 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 0 2 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 0 2 o p s p e r s e c 2 4 [ u s e r @ t e r m i n a l s e r v e r 0 1 ~ ] $ t a s k s e t c 3 . / c p u t e s t . s h 3 p r o c i d 3 t i m e 4 . 0 0 o p s p e r s e c 2 5 p r o c i d 3 t i m e 4 . 0 1 o p s p e r s e c 2 4 p r o c i d 3 t i m e 4 . 0 0 o p s p e r s e c 2 5 p r o c i d 3 t i m e 4 . 0 0 o p s p e r s e c 2 5 ExamplefortheAMDCPUsamephysicalcore [ u s e r @ t e r m i n a l s e r v e r 0 1 ~ ] $ t a s k s e t c 0 . / c p u t e s t . s h 0 p r o c i d 0 t i m e 4 . 0 1 o p s p e r s e c 2 4 p r o c i d 0 t i m e 4 . 0 0 o p s p e r s e c 2 5 p r o c i d 0 t i m e 4 . 7 9 o p s p e r s e c 2 0 p r o c i d 0 t i m e 5 . 0 0 o p s p e r s e c 2 0
5/7
5/3/2014
[ u s e r @ t e r m i n a l s e r v e r 0 1 ~ ] $ t a s k s e t c 1 . / c p u t e s t . s h 1 p r o c i d 1 t i m e 5 . 0 2 o p s p e r s e c 1 9 p r o c i d 1 t i m e 5 . 0 0 o p s p e r s e c 2 0 p r o c i d 1 t i m e 5 . 0 1 o p s p e r s e c 1 9 p r o c i d 1 t i m e 5 . 0 0 o p s p e r s e c 2 0
Onthepreviousexampleswesee,asexpected,aperprogramperformancedropwhentwoCPUintensive programsrunonvirtualCPUsthatsharecommonresources,i.e.,programsrunningondifferentthreadsof the same core. We tried to normalize the examples by translating AMD's new terminology to the usual Intel case otherwise we would have to say, in the AMD case, that the performance penalty happens wheneverweruntwoprogramsinthecoresbelongingtothesamemodule. It is also interesting to compare performance penalties, or put in another way, the amount of extra computing power arising from Hyperthreading or AMD's core duplication. Even though we see a per program performance penalty, the total number of operations increases when the two programs are runningonthesamecore. On the Intel system we saw the number of operations per second increase from 24 to 28 (14+14) as we addedasecondinstanceofcputest.shinthesamecore.OntheAMDprocessorthatnumberincreased from24to40.So,forthisparticularworkloadwhichisarathertrivialseriesoffloatingpointoperations Hyperthreadingincreasesperformanceby16%whereasAMD'scoreduplicationincreasesperformanceby 67%. On the other hand we see that Intel's 2.4GHZ processor performs the same singlethread 24 operations per second that AMD's 3.2GHZ processor is capable of. With two threads per core the AMD processorperforms67%moreoperations.Thesenumbersarenotofabsolutevalueinrealworldscenarios formorerigorousbenchmarkingyoucanlookatthisarticle. Still, it is at least instructive to work out some trivial numbers. If we were to compare systems made of thesetwoprocessors,forthespecificsimplecalculationswereferredto,wecouldwrite P=N*F*n*S FortheIntelsystem,consideringtheadvertised4coreswithoutHyperthreading,andusingthissystemas thereference(S=1)wewouldhave P_Intel=N*2.4*4*1 andfortheAMDsystem,consideringtheadvertised8'cores'puttouse,wewouldhave P_AMD=P_Intel*(40/24) Sincewealsohave P_AMD=N*3.2*8*S weconcludethatS=0.625. More fair comparisons can also be performed. Let us consider single thread performance. Since in that casethenumberofoperationspersecondisthesameforbothprocessorswehave P_AMD=P_Intel P_AMD=N*3.2*4*S andthereforeS=2.4/3.2=0.75.Ontheotherhandifweprefertoconsiderthemoreefficient,andmore realistic,scenariowhereweruntwothreadsoneachcore(Intel'sterminology),weshouldwrite P_AMD=P_Intel*(40/28) P_AMD=N*3.2*A*S from where we obtain S=(30/7A). In the previous expression A is what we consider to be AMD's core count. If we use AMD's marketing core count (A=8) we would find S=0.54. Otherwise, ie using A=4, we wouldhaveS~1. Thus, if we count the cores the same way Intel and the Linux kernel do we see that the throughput per MHZ of the AMD Opteron processor is roughly equivalent to the throughput per MHZoftheIntelprocessor,atleastforthisparticularusecase. Coutingcoresinthatwayenablesustousedirectlytheformula P=N*F*n The Opteron 6328 3,2GHZ processor would then seem to have higher throughput simply due to havingahigherclockfrequency,beingtheinternalcoreduplicationnearlyequivalenttoIntel's Hyperthreading,forthetestedcase. Whatwehavefoundusingaverysimplebenchmarkisconsistentwithourfirst"educatedguess" andwithwhatwementionedbeforeasseenonaLinuxHPCsystemtender.
6/7
5/3/2014
Anotherimportantconclusionthatwearriveat,fromthisanalysis,is/proc/cpuinfonotbeingaconsistent sourceofinformationanymore.WhatispresenttheredependsonthespecifickernelCPUdriverandthere were public discussions between Intel and AMD engineers about what should be available on cpuinfo in face of the new processor architectures. Apparently /proc/cpuinfo is seen as deprecated its replacement beingtheinformationavailableat / s y s / d e v i c e s / s y s t e m / n o d e / n o d e X / c p u Y / t o p o l o g y whereXistheNUMAnodeandYthevirtualCPUid.Forexample,bylookingatthecontentsof / s y s / d e v i c e s / s y s t e m / n o d e / n o d e 0 / c p u 0 / t o p o l o g y / t h r e a d _ s i b b l i n g s _ l i s t ontheAMDsystemwewouldhaveseenthatvirtualprocessors0and1areinfactapairofsibblings. NUMA is the architecture that replaced SMP and we expect to be present in all new multi processor machines where each NUMA "node" has faster access to a specific memory region. We won't go into details on this post besides mentioning that a summary of NUMA related information can be seen by running n u m a c t l h a r d w a r e ThesamecommandcanbeusedtobindtakstoaNUMAnode.Thismightbeusefulifthetasksarenot onlyCPUintensivebutveryI/Ointensiveintermsofmemory.
Conclusions
Ashortsummaryofwhatwelearned/reviewed performanceisnottotallycomparablebyclockspeedvalues performanceisnottotallymeasurablebynumberofcores the advertised number of cores is a very different thing for Intel processors and current AMD Opterons TocomparetheguaranteedparallelCPUperformanceofdifferentmachinesoneneedstocalculateN*F*n andrunasinglethreadedtestapplicationoneachoneinordertoderiveaperformancescalefactor. The first value estimates the theoretical parallel computing power, which needs to be normalized across machinesbyavalueproportionaltothe effective number of "operations" (whatever those are) per second perMHZ,eachmachineiscapableof. Thesamethingcanbedonewithtwoormoreworkinginstancesofatestapplicationinordertocalculate thetotalthroughputofaphysicalcore,exploringitsinternalsubdivisions. If testing is not possible it is a good starting point to calculate N*F*n for the candidate systems dividingnby2foreachAMDsysteminthelist. From there, we can estimate the cost per "effective parallel MHZ" and become conscious buyers. All conclusionswill,ofcourse,beveryspecifictothetestedworkloads. While this article is focused on CPU performance please note that I/O is just as important in many real worldscenarios. PublicadaporPanoramix (s)01:48 Etiquetas:hardware,sysadmin
Semcomentrios:
Enviarumcomentrio
Pginainicial Subscrever:Enviarcomentrios(Atom)
Mensagemantiga
AnguloSlidoTecnologiasdeInformao
7/7

Blog - Angulosolido.pt: Keywords

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Blog - Angulosolido.pt: Keywords

Загружено:

Авторское право:

Доступные форматы

5/3/2014

Вам также может понравиться