Вы находитесь на странице: 1из 38

Getting top performance from

NXPs
LPC processors

Maarten Pennings

2009 November 17

1 Introduction
ThisdocumenttriestoexplainthespeedofoperationofanLPCprocessorofNXP.ItlooksatthePLL
settings,checkstheeffectsoftheMAMsettings,andshowsthedifferenceinspeedbetweenrunningfrom
flashandrunningfromRAM.ItshowsthedifferencebetweennewerfastGPIOandtheolderslow
GPIOviatheAPBbus.Finally,itexplainsmeasuredperformancefiguresusingthetheoreticalfiguresfrom
theARMmanual.
Allinall,anoptimizedGPIOpintogglerisnearly250timesasfastasoneusingdefaultsettings.

1.1 The experiment


ThepracticalworkhasbeendoneonaKeilMCB2140evaluationboard,containinganNXPLPC2148
processor.TheprocessorhasanARM7TDMIcoreandseveralperipherals,amongstothersgeneralpurpose
input/output(GPIO),amemoryacceleratormodule(MAM),aphaselockedloop(PLL),apulsewidth
modulator(PWM),anseveralothermorefunctionalones,butlessinterestingfromthepointofviewof
performanceevaluation.WeusedanolderULINKUSBJTAGprobetoprogramtheLPCandaneven
olderFlukePM3082scope.
Thesoftwarewaswrittenwiththeevaluationversion(version3.80a)ofKeilsuVisionIDEwith
ARMsRealViewcompiler.

1.2 References
[NXPsLPC2xxx] http://www.standardics.nxp.com/products/lpc2000
[KeilsMCB2140board] http://www.keil.com/mcb2140
[KeilsoldULINK] http://www.keil.com/ulink1
[KeilsIDE] http://www.keil.com/arm/mdk.asp
[ARMsARM7TDMIcore] http://www.arm.com/products/CPUs/ARM7TDMI.html
[NXPLPC2148manualrev2] http://www.standardics.nxp.com/support/documents/microcontrollers
/pdf/user.manual.lpc2141.lpc2142.lpc2144.lpc2146.lpc2148.pdf
[WikipediaPLL] http://en.wikipedia.org/wiki/Phaselocked_loop
[ARM7TDMISrefmanual] http://infocenter.arm.com/help/topic/com.arm.doc.ddi0084f/DDI0084.pdf

1.3 Version history


V3 2009November17 Textualimprovementsafterreview
V2 2009October04 AddedMAM,fastGPIO,andtheory
V1 2009September17 Createdfirstversion
1
1.4 Table of contents
1 Introduction............................................................................................................................................. 1
1.1 Theexperiment................................................................................................................................ 1
1.2 References....................................................................................................................................... 1
1.3 Versionhistory................................................................................................................................ 1
1.4 Tableofcontents............................................................................................................................. 2
2 UsingPWMtogetareliablemeasurement............................................................................................. 3
2.1 Thebackground............................................................................................................................... 3
2.2 Thesoftware.................................................................................................................................... 3
2.3 Startup............................................................................................................................................. 4
2.4 Thepracticalresults......................................................................................................................... 4
2.5 Someafterthoughts......................................................................................................................... 5
3 UsingthePLLtospeedupCCLK........................................................................................................... 5
3.1 Thebackground............................................................................................................................... 5
3.2 Thesoftware.................................................................................................................................... 7
3.3 Thepracticalresults......................................................................................................................... 7
4 FastercodefetcheswiththeMAM......................................................................................................... 8
4.1 Thebackground............................................................................................................................... 8
4.2 Thesoftwarepart1.......................................................................................................................... 8
4.3 Thesoftwarepart2........................................................................................................................ 10
4.4 Thepracticalresults....................................................................................................................... 10
4.5 Otherresults.................................................................................................................................. 11
5 Instructiontiming.................................................................................................................................. 12
5.1 Thesoftware.................................................................................................................................. 12
5.2 Theresults..................................................................................................................................... 12
5.3 Moreresults................................................................................................................................... 13
6 UsingfastGPIOinsteadofslowGPIO................................................................................................. 13
6.1 Thesoftware.................................................................................................................................. 13
6.2 Theresults..................................................................................................................................... 14
6.3 Somefinaltheory.......................................................................................................................... 14
7 Conclusions........................................................................................................................................... 15
7.1 Performance.................................................................................................................................. 15
7.2 Theory........................................................................................................................................... 16
7.3 Futurework................................................................................................................................... 16
2
2 Using PWM to get a reliable measurement
WeneedareliablewaytodeterminetheperformanceoftheLPC.Onlythen,wecanreliablyseetheeffect
ofachangeintheconfiguration.Oneofthecrucialingredientsfortheprocessorperformanceistheclock
thatdrivesthecore.Inthissectionwewillthereforefocusonmeasuringtheclockspeed.

2.1 The background


TheARMcorerunsonaclockknownastheCCLK.HowcanwereliablemeasuretheCCLK?We
couldrunaprogramonthecorethattogglesapin,buttherearetoomanysettingsinfluencingtheresult.
So,instead,wedecidedtotrytogettheCCLKonanexternalpin.

F
osc CCLK ARM
crystal PLL core

APB PCLK
PWM
divider
LPC

Thethreemainfrequencies(Fosc,CCLKandPCLK)andtheirrelation.

Asthefigureaboveshows,thePLLgeneratestheCCLK(coreclock)fromthecrystal(Fosc).ThePCLK
(peripheralclock)isderivedfromtheCCLKwiththesocalledAPBdivider.Theperipheralclockdrives
manyperipheralsliketheUARTs,thetimers,etc.Oneperipheralinparticularseemstosuitourneeds:
thePWMblock.Itisahardwareonlyblock;onceconfiguredbysoftware,itrunsstandalone.

2.2 The software


TodrivethePWMweusedthefollowingcode.
void pwm_init( void )
{
/ We use P0.21/PWM5 as output pin

/ Power the pwm block (note: it's on by default after reset)


PCONP |= 1<<5;

/ Set the peripheral clock divider (to 1, so that PCLK=CCLK), VPBDIV is the APB divider
VPBDIV = 1; // 0 -> PCLK = 1/4 CCLK, 1 -> PCLK = 1/1 CCLK, 2 -> PCLK = 1/2 CCLK
// Configure pin (PWM5 is function 01)
PINSEL1 &= ~ ( 3<<10 );
PINSEL1 |= ( 1<<10 );

/ Set the PWM prescaler (so that the PWM clock=PCLK=CCLK)


PWMPR = 0;

/ Configure the PWM curve


PWMMR0 = 2; // Set PWM period to 2 PWM clock ticks
PWMMR5 = 1; // Flip line after 1 tick (so, we run at half the CCLK)
// Configure the PWM block
PWMMCR = 0x00000002; // Reset TC on MR0
PWMPCR = 0x1<<13; // Enable PWM5 (and set it to single edge)
PWMLER = 0x7f; // Latch 0 and 5
// Start PWM-ing
PWMPC = 0; // Prescale counter to 0
PWMTC = 0; // Reset timer to 0
PWMTCR = 0x09; // Enable PWM mode and start timer
}

Observethefollowingaspectsofpwm_init():
WeusePWM5outputviapinP0.21
TheAPBdivider,knowninKeilasVPBDIV,issetto1
ThePWMperiodissetto2PCLKticks;thePWMoutputstartslow,andafter1tickthePWMoutput
israised.Effectively,thePWM5outputrunsathalfthePCLK,andsincetheAPBdivideris1,
PWM5runsathalftheCCLK.
3
2.3 Startup
Ourfirstmeasurementprogramisverysimple:main()firstcallsthefunctionpwm_init()fromthe
previoussection,nextitrunsaninfiniteloop.Seethecodefragmentbelow.
int main( void )
{
pwm_init();
while( 1 )
; // infinite loop
}

ItshouldbenotedthatKeilsuVisiongeneratesanassemblerfilethat(toputitsimply)mapsthereset
vectortomain().ThisassemblerfilemayalsoinitializethePLL,theMAMandsomeotherthings,but
fortheseperformanceexperiments,wedisabledthat(seefigurebelow).

Keilsconfigurationwizardtabforstartup.s(insteadoftextfiletab),withmostsettingsdisabled

Whenonerunstheprogram,andbreaksit,averynicefeatureofuVisionisavailableinthemenu
Peripherals|SystemControlBlock|PhaseLockedLoop0.ItisadialogshowingthecurrentPLLsettings
(andevenallowsonetomakelivechanges).

PLLdialogfromKeilsuVision,showingthePLLisnotenabled(topleftcheckbox
labeledPLLE).ItalsoconfirmstheCCLKis12MHz(bottomline).

2.4 The practical results


Asthescopeshows,atthishighfrequency,wedonotgetnicesquarepulses;therearerippleswhen
swinginglowandtherearerippleswhenswinginghigh.Nevertheless,weget3Vpulsesataclear6
MHzpulsefrequency.Inotherwords,theCCLKis12Mhz.
4
ThePWMoutputonthescope.Thetwoverticaldashedlinesaresocalledtracklines;thetextat
thetopofthescopeshowstheyare167nsapart(sothepulserateis5.99MHz).Theblackline
showthetheoreticalsquarepulses.
2.5 Some after thoughts
WhenwecomparethecrystalschematicsoftheKeilMCB2140board(seebelow)withFigure47in Whatdoes
theLPC214xusermanual,weconcludethatwehaveahardwarelayoutmatchingtheb)oscillation thismean?
modeofoperation.
ZoominginonthecrystalintheschematicsofKeilsMCB2140board

Inthismode,thecrystalshouldgenerateafrequencybetween1MHzand30MHz.Indeed,theMCB2140
boardhasacrystalrunningat12MHz.ThismeanswehaveFosc=12MHz.SincethePLLisnotenabled,
CCLKisalso12MHz.SincetheAPBdivideris1,thePLCKisalso12MHz.AndsincethePWMrunsat
halfthefrequency,itis6MHz.

3 Using the PLL to speed up CCLK


TospeeduptheCCLKweneedtoconfigurethePLL.

3.1 The background


Hardwarewiseitiseasytodivideaclocksignal(seee.g.theAPBdivider).However,itisnotpossibleto
multiplyaclocksignal.Butitispossibletorunanotheroscillatorofamuchhigherfrequency,whose(output)
frequencyisautomaticallyraisedorlowereduntilitmatchesareference(input)oscillatorinbothfrequencyand
phase.Thiscontrolsystemisknownasaphaselockedloop(loopfromthefeedbackpath).

input phasedetector variableoscillator output

AdiagramofaPLL
5
ThefigureaboveisadiagramofaPLL;itfeaturesonemoreelement,namelyafrequencydividerinthe
feedbackpath.Thisallowstheoutputfrequencytobeafactorhigherthantheinputfrequency.So,with
aPPLandadivider,weimplementamultiplier.
TheLPC2148featuresaPLLwithtwodividers,theyareknownasMandP.Thesocalledcurrent
controlledoscillator(CCO)hasaworkingrangeof156MHzto320MHz.Theleadstothe
followingdiagram.
F
osc + CCO CCLK
divby2P
(156..320MHz)

divbyM

ThePLLintheLPC2148withtwodividersandacurrentcontrolledoscillator

Sincethe+inputofthephasedetectoris12MHz,theinputshouldalsobe12MHz.BysettingMto1,
2,3,4,or5respectively,CCLKneedstobe12,24,26,48,or60MHzrespectively1(60MHzisthe
maximumfortheLPC2148).ThetrickisnotinsettingM,becausethatsjustamatterofpickingthe
wantedCCLKfromthefivepossibilities.ThetrickisinselectingaPsothattheCCOcanoperateinits
workingrange(156MHz..320MHz).Pcanonlybe1,2,4,or8.
ThetablebelowshowswhichPwehavetopickforthe5CCLKswecanchosefrom(withFosc=12
MHz).ThetablealsolistsoptionsforPincaseFoscwouldhavebeen10MHz,justtoillustratethat
sometimesthereismorethanoneoptionforP.
Fosc M CCLK FCCO
P=1 P=2 P=4 P=8
10 1 10 20 40 80 160
10 2 20 40 80 160 320
10 3 30 60 120 240 480
10 4 40 80 160 320 640
10 5 50 100 200 400 800
10 6 60 120 240 480 960
12 1 12 24 48 96 192
12 2 24 48 96 192 384
12 3 36 72 144 288 576
12 4 48 96 192 384 768
12 5 60 120 240 480 960

TheMandPcombinationswhereCCOisinitsworkingrange156MHz320MHz(lightgray)

ThebitfieldsMSELandPSELrelatetodividersMandPaccordingtothefollowingtable.
P PSEL M MSEL
1 00 1 00000
2 01 2 00001
4 10 3 00010
8 11 4 00011
5 00100
MappingofPandMtoPSELandMSELbitfields

ThePLLCFGregisterhasMSELinbits0..4andithasPSELinbits5and6.So,forour(12MHz)board
theonlylegalvaluesforPLLCFGaregiveninthetablebelow.

1
TherelationbetweenFosc,CCLKandMisCCLK=MFosc.Therefore,Misknownasthe
multiplier.Mathematically(functionally)thisistruebuttechnically,itisnot.
6
Fosc=12MHz
CCLK M P PSEL MSEL PLLCFG(bin) PLLCFG(hex)
12 1 8 11 00000 1100000 60
24 2 4 10 00001 1000001 41
36 3 4 10 00010 1000010 42
48 4 2 01 00011 0100011 23
60 5 2 01 00100 0100100 24

AnoverviewofallpossiblevaluesforCCLK,theassociatedvaluesforMandP,theunderlying
bitfieldsMSELandPSEL,thecompleteregisterPLLCFG(binary)andfinallythe
PLLCFGvalueinhex.

3.2 The software


TosetthePLL,onemustconfigureandenableit.Next,asasecuritymeasure2,thePLLmustbefedwith
magicvalues.ThismakestheCCOrunning,andthefeedbackpathandthedetectorwilltuneit.Ittakes
sometimebeforethePLLisstable(locked),soasathirdstep,thePLLSTATmustbecheckedfora
lock.Ifeverythingisok,thePLLmaybeconnected,andthismustagainbefollowedbyafeed.
void pll_init( int cfg )
{
int loop_ctr;
// Step 1: Set CFG and CON
PLL0CFG = cfg;
PLL0CON = 0x01; // PLL Enable

/ Step 2: Security measure:


feed pll_feed();

/ Step 3: Wait for the lock into the new


frequency loop_ctr = 10000;
while( ((PLL0STAT&(1<<10))==0) && (loop_ctr>0) ) loop_ctr--;
/ if PLL0STAT & (1<<10) does not hold, we have an issue...

/ Step 4: Connect the PLL


PLL0CON |= 0x03;
// Step 5: Security measure:
feed pll_feed();
}

Wherepll_feed()isdefined3as
static void pll_feed( void )
{
PLL0FEED = 0xAA;
PLL0FEED = 0x55;
}

Themain()functionnowbecomes:
int main( void )
{
pll_init(0x24); // legal values: 60, 41, 42, 23, 24
pwm_init();
while(1); // infinite loop
}

3.3 The practical results


Werunthisprogram,checkingpinP0.21(PWM)onthescope.Thepracticalresultsareasexpected:

2 Quotingtheusermanual:Sinceallchipoperations,includingtheWatchdogTimer,are
dependentonthePLL0whenitisprovidingthechipclock,accidentalchangestothePLLsetupcould
resultinunexpectedbehaviorofthemicrocontroller.
3 Ifyouhaveinterruptsactive,thisfunctionisnotcorrect:nobusoperationmaytakeplacebetween
thetwofeeds,sointerruptshavetobetemporarilydisabled.
7
PLLCFG CCLK Measured CCLKfrom
passedtopll_init() intheory frequency measurement
24 60 30.1 60.2
23 48 24.1 48.2
42 36 18.0 36.0
41 24 12.0 24.0
60 12 5.99 12.0

ThepossiblePLLCFGvalues,thetheoreticalresultingCCLK,themeasured
frequency,andtheassociatedpracticalCCLK.

So,byconfiguringthePLL,weachieveaspeedupofafactorof5.

4 Faster code fetches with the MAM


WenowknowhowtocontroltheCCLK,andwehaveaway(PWMoutputpin)toactuallymeasureit.The
nextstepistomeasureexecutionspeedofinstructions.SincetheARMispipelined,wewouldhopefor
oneinstructionperCCLKtick.

4.1 The background


Toexecuteinstructions,theyneedtobefetchedfirst.Therearethreepossibleroutes.Firstly,an
instructioncancomedirectlyfromtheflash.Secondly,theMAM(memoryaccelerationmodule)mightbe
enabled;itprefetchesinstructions,speedinguptheratherslowflash.Thirdly,thearmcoremayfetch
instructionsfromram(ifcodehappenstobelocatedthere).

Flash MAM
ARM
core
RAM

Thearmcoreandthethreesourcesofaninstruction(flash,MAM,RAM)

Theaddressrange40000000upto40007FFF(32kbytes)and7FD000007FD01FFF(8kbytes)are
mappedtoRAM.SocodefetchesintheserangesarefetchesfromRAM.Theaddressrange00000000up
to0007FFFF(512kbytes)ismappedtoflash.Socodefetchesinthisrangearefetchesfromflash,
optionallyviatheMAM.
TheMAMisasortofminicache.Itcaneitherbedisabledorenabled.Ifitisenabled,aninstructionfetch
fromtheARMisusuallysatisfiedbythe128bits(4words,or4instructions)prefetchbufferinthe
MAM.Iftheprefetchbufferdoesnotcontaintheinstruction,theARMisstalledandtheMAMfetches
anentirelineof128bitsintotheprefetchbuffer.Similarly,adatafetchcausestheMAMtofetchan
entirelineof128bitswhichisstoredindatabuffer.Thereisathirdbuffer,thebranchtrailbuffer,
also128bits,thatisusedwhenthereisabreakinthesequentialflowofinstructionfetches.
WhentheMAMisenabled,wehavetoconfigurehowmanyCCLKtickstheMAMshoulduseforflash
access.ThisregisterisknownasMAMTIMandhasvalues1upto7.WhenMAMTIMis1theARM
corerunsatnativespeed.ForhighCCLKfrequencies,MAMTIMmustbegreaterthan1,becauseofthe
speedlimitationsoftheflash.

4.2 The software part 1


Howdowemeasuretheactualinstructionspeed?WedecidetosetandclearpinP1.16.Weattacha
scopetothatpinsothatwecanmeasurehowfastittoggles.Note:atoggleheremeansafullperiodofP
1.16firstbeinglowandnextbeinghigh.
Tobeinfullcontroloftheinstructions,wecodetheminassembler.WeaddedtheroutineSBlink()tothe
assemblerfilestartup.s,whichisalreadypartofourproject.
8
EXPORT SBlink
SBlink
LDR R0, =0x00010000 ; mask for pin 16
LDR R1, =0xE0028010 ; base address of the slow GPIO port
SBlinkLoop
STR R0,[R1,#0x04] ; set port pin
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
STR R0,[R1,#0x0C] ; clear port pin
NOP
NOP
NOP
NOP
NOP
NOP
NOP
NOP
B SBlinkLoop

Observethefollowingpoints:
R0isloadedwiththemaskforpin16
R1isloadedwiththebaseaddress(E0028010)oftheslowSFRscontrollingport1
WefirststoreR0inR1+04(soinE0028014,orIO1SET)whichraisesP1.16
Later,westoreR0inR1+0C(soinE002801C,orIO1CLR)whichlowersP1.16again.
Attheendthereisa(relative)jumpbacktotheIOSETinstruction.
The NOPs are added to have some meat in the code 4; the complete routine is now 20
instructions.Themainfunctionnowlooksasfollows
void SBlink( void ); // in assembler
int main( void )
{
pll_init(0x60); // 12MHz CCLK
pwm_init(); // To check CCLK
mam_init(4); // Init MAM
// Configure port P1.16 for slow general purpose output
SCS &= ~(1<<1); // Select slow mode (for port 1)
PINSEL2 &= ~ (1<<3); // Set port 1 (pins 16..25) to GPIO in one go
IODIR1 |= (1<<16); // Set pin 16 for output
// Start blinking the slow I/O port
SBlink();
}

Theusedmam_initisnew:
void mam_init( int cycles )
{
MAMCR = 0x00; // Disable the Memory Accelerator Module
MAMTIM = cycles; // MAM fetch cycles
MAMCR = 0x02; // Enable the Memory Accelerator Module
}

Observethefollowingpoints:
WerunatthelowestCCLKof12MHz.
WestillenablethePWMtochecktheclock.5
TheMAMisenabled(andsetto4fetchcycles).
Port1isconfiguredforslow(traditional,legacy)GPIO.

4 TheMAMhasasmallbuffer,sowithoutanyNOPs,thewhole3instructionprogramwould
fitintheMAMbuffer.Secondly,asexplainedlater,theSTRinstructionstakeunexpectedlymany
clockticks.AddingNOPsmitigatesthissomewhat.
5 As it appears later, the pwm_init() sets VPBDIV, and this also influences IO1SET and
IO1CLRspeed(sinceslowI/OisdonebyatheGPIOperipheralontheAPBbus).
9
PinsP1.16..P1.25areconfiguredforfunctionGPIO.
PinP1.16isgivendirectionoutput.
Finally,wecalltheneverendingSBlink()routine(whoseprototypeisaddedjustbeforemain).

4.3 The software part 2


AsecondexperimentisrunningSBlink()fromRAM.Thisisachievedbydeclaringanarray(namedcodein
thefragmentbelow),copyingsufficientbytesfromfunctionSBlinktoarraycode,andexecutingarraycode
(usingatypecast).Thisrequiressomejugglingwithtypecastsasthecodebelowillustrates.
void SBlink( void ); // in assembler

typedef void(*func_t)(void);
int main(void)
{
char code[500]; // Array to hold the SBlink code in RAM
pll_init(0x60); // 12 MHz CCLK
pwm_init(); // To check CCLK
// Configure port P1.16 for slow general purpose output
SCS &= ~(1<<1); // Select slow mode (for port 1)
PINSEL2 &= ~ (1<<3); // Set port 1 (pins 16..25) to GPIO in one go
IODIR1 |= (1<<16); // Set pin 16 for output
// Copy SBlink to code, and run it
memcpy( (int*)code, (int*)&SBlink ,
sizeof(code) ); ((func_t)code)();
}

TheMAMisnotneededforexecutionfromRAM.

4.4 The practical results


Thefirstprogram(fromflash)isrun8times.SeventimeswithMAMTIMfrom1upto7,andoncewith
mam_init()notcalled(sowithMAMleftindisabledstate,thehardwaredefault).Thesecondprogram
(fromRAM)isrunonce.Ineachrun,thetimeofonetoggleonP1.16ismeasured. 6Thelastcolumn
showsthetoggletimenotinnanosecondsbutinCCLKticks(eachof83ns,sincetheCCLKrunsat12
MHz,sincethePLLisconfiguredwith0x60).

Codein MAMCR MAMTIM T(toggle) clockticks@12MHz


Flash Disabled N/A 13600ns 163
Flash Enabled 7 4080ns 49
Flash Enabled 6 3730ns 45
Flash Enabled 5 3410ns 41
Flash Enabled 4 3070ns 37
Flash Enabled 3 2980ns 35
Flash Enabled 2 2900ns 35
Flash Enabled 1 2820ns 34
RAM N/A N/A 2810ns 34

TimeofonetoggleonP1.16(innanosecondsandinclockticks)fordifferentMAMsettings
(CCLKis12MHz).

WeseethattheflashisconsiderablyslowerthanRAM:nearlyafactorof5(13600/2810).Wealsosee
thattheMAMreallyhelpsinclosingthatgap(2820nsversus2810ns).Wealsonoticedthatwiththe
MAMenabled,thepredictabilitydecreased:toggleperiodsdifferinlength.
Operational
Sincethesizeofthecodeis20instructions,the(4word)prefetchbuffershouldbereloaded4timesper
toggleperiod(thebranchtrailbufferisalsousedduetothebranchattheendofthetoggleperiod).This model?

6
Actuallytwoperiodsaremeasured(andthattimeishalved)becausethescopeshowsthatperiodsdiffer
inlength(andoneshorterseemsalwaystobefollowedbyonelongerone).
10
meansapenaltyof4ticksperincrementofMAMTIM.WedoseethisforMAMTIM4to5,5to6,and6
to7.WecannotexplainthesmallerpenaltyforMANTIM1to2,2to3,and3to4. Whocan?

4.5 Other results


Weaddedsomevariationtotheexperiment.
ThefirstvariationwastoincreasethenumberofNOPinstructions.Asweseeinthetablebelow(compare
columns2str+17nop+1band2str+18nop+1b),whengoingfrom17to18NOPinstructions,weconsistently
get1clocktickofextratimespent(forRAMandforflashthroughMAMwithanytimingsetting).
WealsoreducedthenumberofNOPstobelow3asinthecodefragmentbelow.
SBlinkLoop
STR R0,[R1,#0x04] ; set port pin
NOP
STR R0,[R1,#0x0C] ; clear port pin
NOP
B SBlinkLoop

Inthiscase,theMAMbuffersneednevertobereloaded(presumablytheprefetchbufferholdsthefirst4
andthebranchtrailbufferholdsthelastinstruction),sothetoggleperioddoesnotvarywiththeMAMTIM
setting.Seethetablebelowformeasurementswith2and1NOPinthecode(compare2str+2nop+1bwith
2str+3nop+1b).

Timeofonetoggle(ns,clockticks)CCLK=12MHz
Code MAM TIM 2str+1nop+1b 2str+2nop+1b 2str+3nop+1b 2str+17nop+1b 2str+18nop+1b
Flash disabled N/A 4540,54 5100,61 5700,68 13600,163 14400,173
Flash en 7 1505,18 1585,19 2180,26 4080,49 4180,50
Flash en 6 1505,18 1585,19 2095,25 3730,45 3830,46
Flash en 5 1510,18 1575,19 1995,24 3410,41 3540,42
Flash en 4 1505,18 1575,19 1910,23 3070,37 3160,38
Flash en 3 1505,18 1580,19 1815,22 3005,36 3080,37
Flash en 2 1510,18 1575,19 1775,21 2900,35 3010,36
Flash en 1 1505,18 1575,19 1675,20 2820,34 2940,35
RAM N/A N/A 1505,18 1580,19 1665,20 2810,34 2890,35

TimeofonetoggleonP1.16fordifferentMAMsettingsandvariousnumberofNOPinstructions
(note2str+3nop+1bstandforanSBlink()routinecontaining2store,3nopand1branch
instruction).

ThesecondvariationwastochangethePLLsettingsothatwegetahigherCCLK.Seethetablebelowfor
theresults.Weseethatthenumberofclockticksremainsthesame.Inotherwords,therealworld
performanceincreaseslinearlywiththeclockspeed.Or,rephrased,theflashcankeepupwiththespeed
oftheARMcore.

Timeofonetoggle(ns,clockticks)2str,17nop,1b
Code MAM TIM 12MHz 36MHz 60MHz
Flash disable N/A 13600,163 4610,166 2750,165
Flash en 7 4080,49 1355,49 815,49
Flash en 6 3730,45 1245,45 750,45
Flash en 5 3410,41 1135,41 690,41
Flash en 4 3070,37 1025,37 620,37
Flash en 3 3005,36 1005,36 600,36
Flash en 2 2900,35 975,35 585,35
Flash en 1 2820,34 945,34 crash
RAM N/A N/A 2810,34 945,34 565,34

TimeofonetoggleonP1.16fordifferentMAMsettingsandvariousCCLKspeeds
11
Thereisoneexception:whenrunningatfullspeed(60MHz),andnowaitsintheMAM(MAMTIM=1),
themicrocontrollercrashed.Thesurprisehereisthatcrashesdidnthappensooner(onalllightgrayJustluck?boxes).As
theLPCmanualexplains:
Forsystemclockslowerthan20MHz,MAMTIMcanbe001.Forsystemclock
between20MHzand40MHz,Flashaccesstimeissuggestedtobe2CCLKs,while
insystemswithsystemclockfasterthan40MHz,3CCLKsareproposed.
Ifweputthesesuggestionsinatablewegetthefollowingresult.
CCLK SuggestedMAMTIM Flashaccesstime
10..20MHz 1 100ns..50ns
20..40MHz 2 100ns..50ns
40..60MHz 3 75ns..50ns

TheMAMTIMsettingfromtheLPC2148manualforvariousCCLK
speeds,suggestaflashaccesstimeof50nsminimal

ThesuggestedMAMTIMsettingforvariousCCLKspeeds,suggestaflashaccesstimeof50nsminimal.
So,inthetest60MHz/MAMTIM=1,weareoverclockingthesystem.Thisisoutofspec!

5 Instruction timing
Theprevioussectionshowsthata20instructionroutine(knownaboveas2str+17nop+1b)executesin34
cyclesinsteadofthe20onemightexpectfromapipelinedRISCprocessorliketheARM.Thissection
explainsthelessthanexpectedperformance.Thenextsectionshowsawaytospeeditup.

5.1 The software


Weusethreeversionsoftheblinker:abaseprogram,thebaseprogramwithanextraSTRinstruction
andthebaseprogramwithanextraBinstruction.Bymeasuringthedifferenceinruntime,weknowthe
cost(inticks)oftheSTRandBinstruction.
2str+1b
SBlinkLoop
STR R0,[R1,#0x04] ; set port pin
STR R0,[R1,#0x0C] ; clear port pin
B SBlinkLoop

3str+1b
SBlinkLoop
STR R0,[R1,#0x04] ; set port pin
STR R0,[R1,#0x04] ; set port pin
STR R0,[R1,#0x0C] ; clear port pin
B SBlinkLoop

2str+2b
SBlinkLoop
STR R0,[R1,#0x04] ; set port pin
B SblinkLoopCont
SBlinkLoopCont
STR R0,[R1,#0x0C] ; clear port pin
B SBlinkLoop

5.2 The results


WerunthethreeblinkersfromRAM,withCCLKsetto12MHz.
2str+1b 3str+1b 2str+2b
1420ns(17ticks) 1995ns(24ticks) 1670ns(20ticks)

Measuringindividualinstructions(slowGPIO)

Wenowhavemeasuredindividualinstructions:
TheSTRinstructiontakes7ticks(24ticksfor3str+1bminus17ticksfor2str+1b)
12
TheBinstructiontakes3ticks(20ticksfor2str+2bminus17ticksfor2str+1b)
The NOP instruction takes 1 tick (35 ticks for 2st r+18nop+1b minus 34 ticks for
2str+17nop+1b;seepreviouschapter)
Thesetimingfiguresexplaintothedigitthetimingresultsofthepreviouschapter:2str+17nop+1bruns
in27+171+13=14+7+3=34cycles.

5.3 More results


ItsuddenlystruckusthatslowGPIOrunsontheARMperipheralBus(APB).TheAPBbusrunson
thePCLK,whichisderivedfromCCLKviatheAPBdivider.ThecontrollingSFRVPBDIVissetto1
inpwm_init().Wedecidedtorerunthethreetests,withvaryingPCLKs.

2str+1b 3str+1b
VPBDIV=1(PCLK=1/1CCLK=12MHz) 1420ns(17ticks) 1995ns(24ticks)
VPBDIV=2(PCLK=1/2CCLK=6MHz) 1820ns(22ticks) 2660ns(32ticks)
VPBDIV=0(PCLK=1/4CCLK=3MHz) 2980ns(36ticks) 4315ns(52ticks)

Measuringindividualinstructions(slowGPIO)withvaryingPCLK

WhenwelookattheSTRinstructionweseethattheSTRinstructiontakes2417=7tickswhentheAPB
divideris1,thatittakes3222=10tickswhentheAPBdivideris2andthatittakes5236=16tickswhen
theAPBdivideris4. Who
knows
WecanexplainthisbyassumingthattheSTRinstructiontakes4internalARMcorecyclesand3APB
aboutthe
buscycles.WhentheAPBdivideris1,weget4+3=7,whentheAPBdivideris2weget4+23=10and
ARM/LPC
whentheAPBdivideris4,weindeedget4+43=16.
interaction?

6 Using fast GPIO instead of slow GPIO


WenowknowthattheSTRinstructionfortheslowGPIOSFRstakes7ticks(whenPCLK=CCLK).We
expected1tick,sothatisindeedslow.HowmuchfasterwouldthenewfastGPIObe?Werepeatthe
experimentofthepreviouschapter,nowwithfastGPIO.

6.1 The software


Themain()functionchangesslightly;wehavetosetupP1.16forfastGPIO.
void FBlink( void ); // in assembler
int main(void)
{
char code[500]; // Array to hold the FBlink code in RAM
pll_init(0x60); // 12MHz CCLK
// Configure port P1.16 for fast general purpose output
SCS |= 1<<1; // Select fast mode (for port 1)
PINSEL2 &= ~ (1<<3); // Set port 1 (pins 16..25) to GPIO in one go
FIO1MASK &= ~(1<<16); // Enable pin for set/clear
FIO1DIR |= (1<<16); // Set pin for output
// Copy FBlink to code, and run it
memcpy( (int*)code, (int*)&FBlink ,
sizeof(code) ); ((func_t)code)();
}

WealsohavetowriteFBlink(inassemblerstartup.s)tousethefastSFRsinsteadoftheslowSFRs.
EXPORT FBlink
FBlink
LDR R0, =0x00010000 ; mask for pin 16
LDR R1, =0x3FFFC020 ; base address of the fast GPIO port
FBlinkLoop
STR R0,[R1,#0x18] ; set port pin
STR R0,[R1,#0x1C] ; clear port pin
B FblinkLoop

Note
13
R1isloadedwiththebaseaddress(3FFFC020)ofthefastSFRscontrollingport1
WefirststoreR0inR1+18(soin3FFFC038,orFIO1SET)whichraisesP1.16
Later,westoreR0inR1+1C(soin3FFFC03C,orFIO1CLR)whichlowersP1.16again.
Attheendthereisa(relative)jumpback.

6.2 The results


WerunthethreeblinkersfromRAM,withCCLKsetto12MHz.
2str+1b 3str+1b 2str+2b
585ns(7ticks) 755ns(9ticks) 840ns(10ticks)

Measuringindividualinstructions(fastGPIO)

Wenowhavemeasuredindividualinstructions:
TheSTRinstructiontakes2ticks(9ticksfor3str+1bminus7ticksfor2str,+1b)
TheBinstructionstilltakes3ticks(10ticksfor2str+2bminus7ticksfor2str+1b)
TheconclusionisthataSTRtoslowGPIOis3.5timesasslowasanSTRtofastGPIO(7ticksversus2),5
timesasslow(10ticksversus2)oreven8timesasslow(16ticksversus2)dependingontheAPBdivider.

6.3 Some final theory


TheLPC2148containsanARM7TDMIScore.AstheARM7TDMISreferencemanualexplainsthat
thiscoreusesapipelinetoincreasethespeedoftheflowofinstructions.Thisallowsseveraloperationsto
takeplacesimultaneously,andtheprocessing,andmemorysystemstooperatecontinuously.Athree
stagepipelineisused,soinstructionsareexecutedinthreestages:
Fetch(theinstructionisfetchedfrommemory)
Decode(theregistersusedintheinstructionaredecoded)
Execute(registersarereadfromregisterbank,theALUoperates,andtheregistersarewrittenback)
TheARM7TDMIShasaVonNeumannarchitecture,withasingle32bitdatabuscarrying
bothinstructionsanddata.Onlyload,store,andswapinstructionscanaccessdatafrommemory.
TheARM7TDMIShasfourbasictypesofmemorycycle:
Idlecycle(I)
Nonsequentialcycle(N)
Sequentialcycle(S)
Coprocessorregistertransfercycle(C)
InthepipelinedarchitectureoftheARM7TDMIS,whileoneinstructionisbeingfetched,theprevious
instructionisbeingdecoded,andtheonepriortothatisbeingexecuted.Thetablebelow(takenfrom
theARMmanual)liststhenumberofcyclesrequiredbyaninstruction,whenthatinstructionreaches
theexecutestage.
Instruction Qualifier Cyclecount
Anyunexecuted Conditioncodesfail +S
Dataprocessing Singlecycle +S
B,BL +N+2S
STR +N+N
SWP +N+N+I+S
MCR +(b)I+C+N
more

ExcerptfromthetimingtablefromARM7TDMISreferencemanual

WeseethataBinstructionhasafetch,decode,nonsequential,sequential,andsequentialcycle.The
ARM7TDMISreferencemanualexplainstheoperationsinthethreeexecutesteps:
14
1. Duringthefirstcycle,abranchinstructioncalculatesthebranchdestinationwhileperformingapre
fetchfromthecurrentPC.Thisprefetchisdoneinallcasesbecause,bythetimethedecisionto
takethebranchhasbeenreached,itisalreadytoolatetopreventtheprefetch.
2. Duringthesecondcycle,theARM7TDMISperformsaFetchfromthebranchdestination.
3. Duringthethirdcycle,theARM7TDMISperformsaFetchfromthedestination.
TheSTRinstructionhasafetch,decode,nonsequential,andnonsequentialcycle.TheARM7TDMIS
referencemanualexplainstheoperationsinthetwoexecutesteps:
1. Duringthefirstcycle,theARM7TDMIScalculatestheaddresstobestored.
2. Duringthesecondcycle,theARM7TDMISperformsthebasemodification,andwritesthedatato
memory(ifrequired).

7 7

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

STR F DXNXN
STR F D XNXN
B F D XNXSXS
STR F DXNXN
STR F D XNXN
B F D XNXSXS
STR F DXNXN
ThetimingofthefastGPIOloop(boundedbytheexecutionphase)isindeed7ticks

ThefigureaboveillustratesthetimingofthefastGPIOloop.
ForslowGPIO,theSTRtakes7cyclesinsteadof2.ThishastodowiththeslowGPIOgoingthroughthe
ARMPeripheralBus.AnexplanationwouldbethateachNaccessdoeshaveawaitstateintroducedby Confirm
theAHBwrapper,andthatthereisanadditionalwaitof3APBclocks(seefigurebelow).Thisresultsina
7clockexecutephase(asmeasured).Furthermore,italsoexplainswhyanAPBdividersetto2makesthe ation?
executephaseofthestorelast4+23=10ticks.
17

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
STR F D X1N ws APBAPBAPBX2N ws
STR F D X1N ws APBAPBAPBX2N ws
B F D XN XS XS
STR F D X1N ws APB

ApossibleexplanationofthetimingoftheslowGPIOloop

7 Conclusions

7.1 Performance
Wehavelookedatseveralaspectsoftuningthesystemforperformance.
ThePLLallowsustoboostanexternalclockwithaminimumfrequencyof10MHzto60MHz.A
speedupofafactorof6(ourboardhadacrystalof12MHz,not10MHz).
WithanenabledMAMwithminimaltiming(orwhenrunningfromRAM)wegetaspeedupofnearly
5withrespecttorunningfromflashdirectly.
FastGPIOis3.5times(or5or8times)asfastasslowGPIO(dependingontheAPBdivider).
Totalspeedupachievedis64.88=230.4.
So,thesystemsperformancewindow(forGPIO)isafactor230wide.
15
7.2 Theory
WenowunderstandthepurposeandarchitectureofthePLLinthesystem,namelymultiplyingthe
externalclock.Thedetailsofchoosinganoscillatorarenotyetclear.
WeunderstandthepurposeoftheMAMinthesystem,namelybridgingthespeedgapbetweentheflash
andtheARMcore.Thedetailsofthetimingandthepurposeofthethreebuffersisnotcompletely
clear.WehaveseenthatcodeinRAMperformsoptimally.
WehaveseenthatSTRinstructionstoslowGPIOSFRsperformreallypoor(nonsequentialaccesses
delayedbytheAPBbus),andthatSTRinstructionstofastGPIOperformsmuchbetter.TheslowGPIO
isevenslowerwhentheAPBdividerkicksin.
Thedetailsofwhyeachinstructionclocksasmeasuredarenotyetcompletelyunderstood,butthetheory
roughlymatchesthepractice.
16

Вам также может понравиться