Вы находитесь на странице: 1из 26

CGRAstructure

ThispagesdescribesthestructureoftheCGRA.

Thecomputemodule
ThemostfinegrainedmodulesusedintheCGRAarethefunctionalunits(FUs),thefollowingFUsareavailable:
LoadStoreUnit(LSU)
RegisterFileUnit(RF)
ArithmeticLogicUnit(ALU)
ImmediateUnit(IU)
AccumulateandBranchUnit(ABU)
MultiplierUnit(MUL)
BesidestheseunitsthereareInstructionDecoders(IDs),InstructionFetchunits(IFs)andswitchboxes(swb).
Theseunitsareinstantiatedbasedontheatchitecturedescription,aXMLfiledescribingtheFUsavailable,their
propertiesandinterconnect.Theinterconnectcanbeeitherfixed(specifiedatdesigntimeintheXMLfile)which
wecallstaticorreconfigurable,whichwecalldynamic.DynamicCGRAsuseswitchboxestomakeconnections
betweenfunctionalunitsandbetweenIDsandFUs.Bydoingsoprocessorscanbeconstructed,eitheratdesign
timewithastaticCGRAorruntimewithadynamicCGRA.
TheFUs,IDs,IFsandswitchboxesarecontainedinthecomputemodule.Anexampleofsuchacompute
moduleisshowninthefigurebelow,pleasenotethatthisisnotmeantthosshowaparticularinstantiationand
merelyaimstoshowthestructureoftheCGRA.Theswitchboxesarenotshownforclarity.

Picturesource
Thearrowsinthefigureaboveindicateinterfacestoahigherhierarchicallevel,fortheLSUstheseareinterfaces
tolocalorglobalmemory.FortheIFunitsthearrowsindicateainterfacetotheinstructionmemory.
ThecomputemodulealsocontainsscanchainsforloadingtheCGRAconfigurationandforreadingorwritingthe
processorstate.Thecontrolwiresareavailabletothehigherhierarchicallevel,thecomputewrappermodule.

Thecomputewrappermodule
ThecomputewrappermodulecontainsthecomputemoduleoftheCGRAandaddssomeadministrative
functionality.TheindependentglobalmemoryconnectionsforeachoftheLSUsareconnectedtoanarbiter
whichmanagesaccesstotheglobalmemorybus.Thelocalmemoryconnectionspassthroughthismodulesince
theydonotneedtobearbited,asdotheinstructionmemoryconnectionstotheIFs.Configurationloadingis
managedbytheconfigurationloader,thismodulealsocontainssomestatusandcontrolregisterswhichare
describedinCGRAexternalinterfaces.Theconfigurationloaderalsotakescareoffillingtheinstruction
memorieswiththeoperationsdefinedinthebinary.Another,optional,moduleisthestatecontroller.Thismodule
allowsreadingandwritingoftheentirestateofthecomputemodule,thiscanbeusefulformultithreadingor
debugging.

Picturesource
Theconnectionsforthedatamemory,loaderandstatecontrollerareallDTL(DeviceTransactionLayer)busses,
essentiallyasimplifiedAXIbus.ThesebussesarecompatiblewiththeCompSoCplatformbutarealsovery
convenienttoconnecttootherdevicesviaDTLtoAXItranslatorsforexample.TheDTLportforthestate
controllerisanoptionalportandwillonlybeinstantiatedwhentheglobaldefineINCLUDE_STATE_CONTROL
isdefined.
TheconnectionsbetweenthelocalmemoriesandtheLSUs,aswellastheinstructionmemoriesandtheIFsare
simpledirectmemoryinterfaces.Thesememoriesarenotcontainedwithinthecomputewrappermodulesince
theASICtoolflowtypicallyhasnomemorygenerator.Thereforethecomputewrappermoduleisthetoplevel
moduleforASICsynthesis.

Thecoremodule
Thecoremodulecombinesthecomputewrappermodulewiththelocaldatamemoriesandinstructionmemories.
Thesememoriesarecontainedinaseparatemodule,veryimaginativelycalled,thememorymodule.Besides
addingthememoriestothecomputewrappermodule,thecoremoduledoesnotintroduceanynewfunctionality.
TheDTLbussessimplypassthroughthismodule.
Picturesource
ThecoremoduleisareadytouseCGRAblockthatcanbeincludedintoFPGAdesigns(ortheCompSoC
platform),theDTLportshavetobeconnectedtothehostprocessorandtherequiredexternalmemories,more
aboutthisinUsingtheCGRA.

Thetopmodule
Thetopmoduleisa(almost)standaloneversionoftheCGRAandismostlyusedforsimulation,all
requiredmemoriessuchastheglobalandstatestoragememoryandperipheralsarecontainedinthismodule.
Theolymemorynotcontainedinthismodule(becauseitisassumedtobesuppliedbyaexternalsystem)isthe
memorywheretheapplicationbinaryresides.Thetopmoduleassumesasimplehardwareinitiatorthatsendsa
pointertotheaddressinthebinarymemorywheretheapplicationbinaryresides.Peripheralsspecifiedinthe
architectureXMLfilewillbeinstantiatedinthetopmodule,anyrequiredinputsandoutputswillbeaddedbased
onthearchtecturedescription.
Theglobalmemorybus,andthereforealsotheperipherals,arearbitedbetweentheexternalworld(ahost
processorcanthereforealsocontrolltheperipherals)andtheCGRA.Thearbiterisisroundrobinbutdoesnot
grantslotstoportswithoutrequests,iftherearenorequestsfromtheexternalworldthereisnopenaltyinCGRA
performance(incycles)byhavingthearbiterpresent.
Picturesource

Testbench
Thetestbenchcontainsthesimulationlogicandapplicationbinary.ItassertstheproperDTLcontrolsignalsto
configureandstarttheCGRA.ItcanalsobeusedtopollthestateoftheCGRAorretrievedatafromtheglobal
memory.Controlsignalsforperipheralsareavailablewithinthetestbench,itisuptotheuserhowevertouse
thesesignals(notincludedinthedefaultcontrollogic).
Picturesource

Verilogcodestructure
ThissectiondescribesthestructureofthetoolflowgeneratedbytheCGRAtoolflow.Weusethefollowing
conventions:
[design]:isthenameofthedesign,e.g.thetestbenchname'BinarizationStatic'
[unit]:isthenameofafunctionalunit,e.g.lsu
Thecodeisstructuredasfollows:
TB_TOP:topleveltestbenchloadingthebinaryandinitiatingconfigurationoftheDUT.
dut:TheDeviceUnderTest,inourcasetheCGRAinstanceused.
GM_inst:Globalmemory.
DTL_[peripheralname]_inst:Peripheralconnectedtotheglobalmemorybus.
[design]_Core_inst:Wrapperaroundthememoryandcomputemodules
[design]_Memory_inst:ModulecontainingallthememoriesfortheCGRA
(instruction,localdataandglobaldatamemories)
LM_[unit]:Memoryforfunctionalunits,usuallyLSUs,thatareconnected
toalocalscratchpadmemory.
IM_[unit]:InstructionMemoryforunitssuchastheIDsandtheimmediate
units.
[design]_Compute_Wrapper_inst:Modulecontainingaloaderandthecompute
module.
Loader_inst:ModulethatmanagesconfigurationofthenetworkandFUs.
Italsomanagesloadingtheinstructionmemories.
[design]_Compute_inst:ModulecontainingallFUs,switchboxes(if
present)andallwiringinbetween.
IF_[unit]_inst:instructionfetcherforaID
SWB_DATA_[unit]_inst:Switchboxmoduleforthedatanetwork
SWB_CONTROL_[unit]_inst:Switchboxmoduleforthecontrol
(decodedinstructions)network
[unit]_inst:instanceofafunctionalunit,canbeaID,IU,ALU,ABU,
LSU,RForMUL.
Instruction Set
The CGRA consists of multiple functional units (FUs) which each have their own
instruction set. These FUs are the Load Store Unit (LSU), the Register File (RF)
and the Arithmetic Logic Unit (ALU), although other units might be added in the
future. For all FUs that are implemented and functionally tested the instruction
set is listed below.

Load Store Unit:


Type 1 instructions:

These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):

Opcode [11:7], DataType [6:5], Destination [4:4], InputB [3:2], InputA [1:0]

The 5 bit opcode describes the operation that is to be performed, The DataType
field is used for loading bytes, halfwords and words. The remaining 6 bits are
divided into 2 sets of 3 bit, the first set specifying load operations and the
second set specifying store operations. These 3-bit sets are formatted as
follows:

Bit 2: Load or Store / NOP


Bit 1: Global / Local memory
Bit 0: Implicit / Addressed

The DataType field can be one of the following values:

00: BYTE
01: HWORD
10: WORD
11: DWORD

The destination specifies which output register has to be written.

The input describes which of the inputs of the functional unit will be selected
for a specific operand. For all operations in this instruction type: input B is the
address input and input A is the data input.

Note: when a simultaneous load and store are performed on the same read-
and write address the NEW value will be returned by the load.
Operation Description Opcode and parameters
NOP No OPeration 0000000_?_??_??
PASS outD, inA PASS A to output 0010011_D_??_AA
SLA TYPE, inB, inA Store Local Address 00001_TT_?_BB_AA
SLI TYPE, inA Store Local Implicit 00010_TT_?_??_AA
SGA TYPE, inB, inA Store Global Address 00011_TT_?_BB_AA
SGI TYPE, inA Store Global Implicit 10110_TT_?_??_AA
LLA TYPE, outD, inB Load Local Address 00101_TT_D_BB_??
LLI TYPE, outD Load Local Implicit 00110_TT_D_??_??
LGA TYPE, outD, inB Load Global Address 00111_TT_D_BB_??
LGI TYPE, outD Load Global Implicit 01000_TT_D_??_??
Load Local Implicit | Store
LLI_SLA TYPE, outD, inB, inA 01001_TT_D_BB_AA
Local Address
Load Global Implicit | Store
LGI_SLA TYPE, outD, inB, inA 01010_TT_D_BB_AA
Local Address
Load Local Address | Store
LLA_SLI TYPE, outD, inB, inA 01011_TT_D_BB_AA
Local Implicit
Load Local Implicit | Store
LLI_SLI TYPE, outD, inA 01100_TT_D_??_AA
Local Implicit
Load Global Address | Store
LGA_SLI TYPE, outD, inB, inA 01101_TT_D_BB_AA
Local Implicit
Load Global Implicit | Store
LGI_SLI TYPE, outD, inA 01110_TT_D_??_AA
Local Implicit
Load Local Implicit | Store
LLI_SGA TYPE, outD, inB, inA 01111_TT_D_BB_AA
Global Address
Load Global Implicit | Store
LGI_SGA TYPE, outD, inB, inA 10001_TT_D_BB_AA
Global Address
Load Local Address | Store
LLA_SGI TYPE, outD, inB, inA 10010_TT_D_BB_AA
Global Implicit
Load Local Implicit | Store
LLI_SGI TYPE, outD, inA 10011_TT_D_??_AA
Global Implicit
Load Global Address | Store
LGA_SGI TYPE, outD, inB, inA 10100_TT_D_BB_AA
Global Implicit
Load Global Implicit | Store
LGI_SGI TYPE, outD, inA 10101_TT_D_??_AA
Global Implicit

Type 2 instructions:

This type of instruction takes a 4-bit register address and input number as
parameters. The opcode is 6 bit in size, hence the instruction format is as
follows:

Opcode [11:6], Register Y [5:2], InputA [1:0]

Input A is the data input.

The SRM operation is used to write the LSUs configuration registers, which can
be found in the LSU description.
The LRM operation can be used to read configuration registers. To ensure
compatibility with the LRM operation used for the RF, this operation always
writes to the highest output port number.

Operation Description Opcode and parameters


SRM rY, inA Store data from input A in register YYYY 100000_YYYY_AA
Load data from register YYYY into the output
LRM rY 101000_YYYY_??
buffer

Register File
Type 1 instructions:

These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):

Opcode [11:5], Destination [4:4], InputB [3:2], InputA [1:0]

The 7 bit opcode describes the operation that is to be performed. For the RF,
the only instruction of this type is NOP.

Operation Description Opcode and parameters


NOP No OPeration 0000000_?_??_??

Type 3 instructions:

This type of instructions has an opcode of only 2-bit in size. The parameters
are: register X, register Y, and input A. The format of this instruction is:
Opcode [11:10], Register X [9:6], Register Y [5:2], InputA [1:0]

The RF only has one instruction of this type and it performs a parallel register
read and write on two different addresses. The input (A) specifies the input for
the data that is to be used in the register write part of the operation.

Operation Description Opcode and parameters


Load data from register XXXX | Store
LRM_SRM rX, rY, inA 11_XXXX_YYYY_AA
data from input A in register YYYY

Type 2 instructions:

This type of instruction takes a 4-bit register address and input number as
parameters. The opcode is 6 bit in size, hence the instruction format is as
follows:

Opcode [11:6], Register Y [5:2], InputA [1:0]

Input A is the data input and is only used in register write operations.

? = dont care

Operation Description Opcode and parameters


Store data from input A in register
SRM rY, inA 100000_YYYY_AA
YYYY
Load data from register YYYY into the
LRM rY 101000_YYYY_??
output buffer

Type 4 instructions:

This type of instruction takes a input specified address and data input number
as parameters. The opcode is 8 bit in size, hence the instruction format is as
follows:

Opcode [11:4], InputB [3:2] , InputA [1:0]

Input A is the data input and is only used in register write operations. Input B is
the input on which the register address is present.

? = dont care

Operation Description Opcode and parameters


Store data from input A in register
SRA inB, inA 10010000_BB_AA
specified on input B
Load data from register specified on
LRA inB 10001000_BB_??
input B into the output buffer

Arithmetic Logic Unit


Type 1 instructions:

These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):

Opcode [11:5], Destination [4:4], InputB [3:2], InputA [1:0]

The 7 bit opcode describes the operation that is to be performed.

The DataType field can be one of the following values:

110: BYTE
010: HWORD
101: WORD

The destination specifies which output register has to be written.

The input describes which of the inputs of the functional unit will be selected
for a specific operand.

Operation Description Opcode and parameters


NOP No OPeration 0000000_?_??_??
ADD outD, inB, inA Add A and B 0011010_D_BB_AA
Add A and B and sign extend the
ADD_SE TYPE, outD, inB, inA inputs from specified width to TTT1010_D_BB_AA
datapath width
SUB outD, inB, inA Subtract B from A 0011011_D_BB_AA
Subtract B from A and sign
SUB_SE TYPE, outD, inB, inA extend the inputs from specified TTT1011_D_BB_AA
width to datapath width
AND outD, inB, inA Bitwise AND of A and B 0010000_D_BB_AA
NAND outD, inB, inA Bitwise NAND of A and B 0110000_D_BB_AA
OR outD, inB, inA Bitwise OR of A and B 0010001_D_BB_AA
NOR outD, inB, inA Bitwise NOR of A and B 0110001_D_BB_AA
XOR outD, inB, inA Bitwise XOR of A and B 0010010_D_BB_AA
XNOR outD, inB, inA Bitwise XNOR of A and B 0110010_D_BB_AA
NEG outD, inA Bitwise negate of A 0110011_D_??_AA
Conditional move (based on flag
CMOV outD, inB, inA from compare) on A (flag=F) or 1110011_D_BB_AA
B (flag=T)
External Conditional move (flag
ECMOV outD, inB, inA is bit 0 from input 0) on A 0000011_D_BB_AA
(flag=F) or B (flag=T)
PASS outD, inA PASS A to output 0010011_D_??_AA
PASS A to output and sign extend
PASS_SE TYPE, outD, inA the input from specified width to TTT0011_D_??_AA
datapath width
EQ outD, inB, inA Test if A and B are equal 1101111_D_BB_AA
NEQ outD, inB, inA Test if A and B are not equal 1011111_D_BB_AA
LTU outD, inB, inA Test if A less than B (unsigned) 0011111_D_BB_AA
LTS outD, inB, inA Test if A less than B (signed) 1001111_D_BB_AA
Test if A greater or equal to B
GEU outD, inB, inA 0111111_D_BB_AA
(unsigned)
Test if A greater or equal to B
GES outD, inB, inA 0101111_D_BB_AA
(signed)
SHLL1 outD, inA Logic shift left (1 bit) on A 0010100_D_??_AA
SHLL4 outD, inA Logic shift left (4 bit) on A 0010101_D_??_AA
SHRL1 outD, inA Logic shift right (1 bit) on A 0010110_D_??_AA
SHRL4 outD, inA Logic shift right (4 bit) on A 0010111_D_??_AA
SHRA1 outD, inA Arithmetic shift right (1 bit) on A 0000110_D_??_AA
SHRA4 outD, inA Arithmetic shift right (4 bit) on A 0000111_D_??_AA

Immediate Unit:
Type 5 instructions:

These instructions have the following format:

Opcode [N-1:N-1], IMediate [N-1:0]

The single-bit opcode specifies if the immediate value has to be written to the
output of the immediate unit.
Since the Immediate Unit (IU) is a special version of a instruction decoder (ID)
the width of these instructions can be different than the instruction width for
the other IDs.

The size of the Immediate (M) field scales with the instruction size (e.g a 9-bit
instruction will have a 8-bit data field).

Operatio
Description Opcode and parameters
n
NOPI No OPeration 0_{N-1{?}}
IMM write IMMediate to data
1_{N-1{M}}
value network

Accumulate & Branch Unit


Type 1 instructions:

These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):

Opcode [11:5], Destination [4:4], InputB [3:2], InputA [1:0]

The 7 bit opcode describes the operation that is to be performed. For the RF,
the only instruction of this type is NOP.

Operation Description Opcode and parameters


NOP No OPeration 0000000_?_??_??
Jump Relative, input B (signed) is added to the
JR inB 1100000_?_BB_??
program counter
Jump Absolute, input B (unsigned) overwrites
JA inB 1101000_?_BB_??
the program counter
Branch Conditional Relative, input B (signed)
BCR inB, inA is added to the program counter when condition 1110000_?_BB_AA
A != 0
Branch Conditional Absolute, input B
BCA inB, inA (unsigned) overwrites the program counter 1111000_?_BB_AA
when condition A != 0

Type 2 instructions:
This type of instruction takes a 4-bit register address and input number as
parameters. The opcode is 6 bit in size, hence the instruction format is as
follows:

Opcode [11:6], Register Y [5:2], InputA [1:0]

Input A is the data input and is only used in register write operations.

? = dont care

Operation Description Opcode and parameters


SRM rY, inA Store data from input A in register YYYY 100000_YYYY_AA
Load data from register YYYY into the output
LRM rY 101000_YYYY_??
buffer
Unsigned accumulate on register YYYY, with
ACCU rY, inA 110010_YYYY_AA
input A
Signed accumulate on register YYYY, with
ACCS rY, inA 110011_YYYY_AA
input A

Type 6 instructions:

This type of instruction takes a 6-bit immediate value and input number as
parameters. The opcode is 4 bit in size, hence the instruction format is as
follows:

Opcode [11:8], IMmediate [7:2], InputA [1:0]

Input A is the input specifying the branch condition and is only used in
conditional branch operations.

? = dont care

Operation Description Opcode and parameters


Jump Relative Immediate, the immediate M
JRI value 0001_MMMMMM_??
(signed) is added to the program counter
Jump Absolute Immediate, the immediate M
JAI value 0011_MMMMMM_??
(unsigned) overwrites the program counter
Branch Conditional Relative Immediate, the
BCRI value, inA immediate M (signed) is added to the program 0101_MMMMMM_AA
counter when condition A != 0
BCAI value, inA Branch Conditional Absolute Immediate, the 0111_MMMMMM_AA
immediate M (unsigned) overwrites the
program counter when condition A != 0

Multiplier Unit
Type 1 instructions:

These instructions have the following format (bit widths for destination and
inputs might change in the future as we decide to add or remove in- or outputs
from the FUs):

Opcode [11:5], Destination [4:4], InputB [3:2], InputA [1:0]

The 7 bit opcode describes the operation that is to be performed.

The destination specifies which output register has to be written.

The input describes which of the inputs of the functional unit will be selected
for a specific operand.

Opcode and
Operation Description
parameters
NOP No OPeration 0000000_?_??_??
Unsigned multiplication of A and B,
MULLU outD, inB, inA 100_1000_D_BB_AA
output is the lower part
MULLU_SH8 outD, inB, inA Unsigned multiplication of A and B,
output is the lower part, result shifted 100_1001_D_BB_AA
right by 8 bit.
MULLU_SH16 outD, inB, inA Unsigned multiplication of A and B,
output is the lower part, result shifted 100_1010_D_BB_AA
right by 16 bit.
MULLU_SH24 outD, inB, inA Unsigned multiplication of A and B,
output is the lower part, result shifted 100_1011_D_BB_AA
right by 24 bit.
Signed multiplication of A and B,
MULLS outD, inB, inA 101_1000_D_BB_AA
output is the lower part
MULLS_SH8 outD, inB, inA Signed multiplication of A and B,
output is the lower part, result shifted 101_1001_D_BB_AA
right by 8 bit.
MULLS_SH16 outD, inB, inA Signed multiplication of A and B, 101_1010_D_BB_AA
output is the lower part, result shifted
right by 16 bit.
Signed multiplication of A and B,
MULLS_SH24 outD, inB, inA output is the lower part, result shifted 101_1011_D_BB_AA
right by 24 bit.
Unsigned multiplication of A and B,
MULU outD, inB, inA complete multiplier output on port D 110_1000_D_BB_AA
and D+1
Unsigned multiplication of A and B,
MULU_SH8 outD, inB, inA complete multiplier output on port D 110_1001_D_BB_AA
and D+1, result shifted right by 8 bit.
Unsigned multiplication of A and B,
MULU_SH16 outD, inB, inA complete multiplier output on port D 110_1010_D_BB_AA
and D+1, result shifted right by 16 bit.
Unsigned multiplication of A and B,
MULU_SH24 outD, inB, inA complete multiplier output on port D 110_1011_D_BB_AA
and D+1, result shifted right by 24 bit.
Signed multiplication of A and B,
MULS outD, inB, inA complete multiplier output on port D 111_1000_D_BB_AA
and D+1
Signed multiplication of A and B,
MULS_SH8 outD, inB, inA complete multiplier output on port D 111_1001_D_BB_AA
and D+1, result shifted right by 8 bit.
Signed multiplication of A and B,
MULS_SH16 outD, inB, inA complete multiplier output on port D 111_1010_D_BB_AA
and D+1, result shifted right by 16 bit.
Signed multiplication of A and B,
MULS_SH24 outD, inB, inA complete multiplier output on port D 111_1011_D_BB_AA
and D+1, result shifted right by 24 bit.
LH outD Load contents of the high part of the 010_0000_D_??_??
multiplier output regsiter to output D
(does not do a multiplication)
ArithmeticLogicUnit
TheArithmeticLogicUnit(ALU)oftheCGRAcantakeanumberofinputs(bydefaultconfiguredto4inputs)and
performarithmeticandlogicoperationsontwooftheseinputs.Theinputsarespecifiedbytheinstructionand
selectedbythetwo4inputmultiplexersinthetopofthefigurebelow.
TheoperationstheALUcanperformaredividedoverthreefunctionalgroups:
Shiftoperations
LogicOperations
Arithmeticoperations
Additionally,theoutputofthearithmeticoperationscanbeusedforcomparingtwooperands.Theresultofthe
comparisonisstoredintheflagregisterandcanbeusedforCMOVoperations.Thecomparisonoutputcanalso
beroutedtothedatapath,thisallowstransmittingtheflagtootherALUsinthepipelineorusingitasaaresult
(e.g.binarization).Whenroutedtothedatapaththeflagisextendedfrom1bittothewidthofthedatapath.This
meansthatflag=0willresultinavalueof0onthedatapathwhileflag=1willresultin2^D_WIDTH1.Theflag
canalsobeinvertedsuchthatnotonlyLT(LessThan)andEQ(Equals)areavailablebutNEQ(NotEqual)and
GE(GreaterthanorEqual)aswell.
TheoutputofthelogicmodulecanbeinvertedwhichresultsintheoperationsNAND,NOR,XNORandInvert.
Thedestinationoperandspecifiestowhichregistertheoutputhastobewritten.Output0isanbufferedor
unbufferedoutput(dependingonconfigurationBit[0],0=unbufferedand1=buffered),whereastheotheroutputs
arebuffered.Theunbufferedoutputcouldbeusedtoconstructsinglecyclecomplexoperations.
MultiplierUnit
TheMultiplierUnitoftheCGRAcantakeanumberofinputs(bydefaultconfiguredto4inputs)andperform
signedandunsignedmultiplicationsontwooftheseinputs.Theinputsarespecifiedbytheinstructionand
selectedbythetwo4inputmultiplexersinthetopofthefigurebelow.
Sincetheoutputofthemultiplieris2timesthewidthoftheinputdatatheresultdoesnotfitononeoutputport.
Themultipliersupportstwomethodsforreadingthefulloutput:
Anormalmultiplicationoutputsthelowerhalfoftheresultdirectlyonthechosenoutputport.Asecond
instructionthenreadsthehigherpartoftheresulttoaspecifiedoutputport.
Thecompleteresultiswrittentotwoportsatonce,theoutputswillbedest(lowerpart)anddest+1(higher
part).
Thedestinationoperandspecifiestowhichregistertheoutputhastobewritten.
RegisterFileUnit
TheRegisterFileunit(RF)has4inputsofwhichonecanbeselectedforaddressingandoneasdatainput.In
contrasttootherfunctionalunitstheregisterfiledoesnothavemultipleaddressableoutputs.Instead,thehighest
outputnumber(bydefaultoutput1)canbeaddressedintwoways:
Throughtheinstruction,asanimmediate.
ThroughoperandB(fromthedatanetwork).
Theotheroutputs(0N2)aredirectlyconnectedtotheregisterscorrespondingtotheiroutputnumber(e.gr0>
out0).Thisallowsforreadingmultipleregistersfromtheregisterfileatoncewithoutthecostofrequiringextra
readports.
Inordertostoreavalueintoaregister,thedatahastobeavailableonoperandA.Theregisterwherethedata
hastobewrittenisspecifiedbyeitheranimmediateintheinstructionorviathedatanetwork.
Itispossibletoperformareadandwriteondifferentregistersatonce(usingimmediatesforbothoperations).
NotethattheoutputsoftheRFarenotregistered,thereforethereresultofaloadisavailabletootherunitswithin
thesamecycle.
LoadStoreUnit
TheLoadStoreUnit(LSU)has(bydefault)4inputsofwhichonecanbeselectedforaddressingandoneas
datainput.Alloutputsarebufferedandoneofthe(bydefault2)outputscanbeselectedasthetargetregister.
TheLSUsupportsoperationsfrombothlocal(privatetoeachLSU)andglobal(sharedbetweenLSUs).Currently
allmemoriesareconsideredtobesinglecyclebutinthefutureanarbiterwillbeinsertedbetweentheglobal
memoryandtheLSUsconnectedtoit.Thiswillalsomeanthatwewillhavetoimplementsomekindofstall
signal.
Onboththeglobalandlocalmemorythefollowingoperationscanbeperformed:
Load
Store
Loadimplicit
Storeimplicit

DataTypes
TheLSUsupportsloadingandstoringmultipledatatypes:
DWORD(64bit)
WORD(32bit)
HWORD(16bit)
BYTE(8bit)
However,themaximumsupportedwidthisequaltothedatapathwidth(e.g.a32bitCGRAsupportsBYTE,
HWORDandWORD).

Loadandstore
Theseoperationstakealltheirrequiredinformation(addressanddata)fromtheinputs.OperandAisusedfor
dataandOperandBisusedforaddressing.Currentlyweconsiderthemaximumaddressspaceofthelocal
memorytobe16bitandtheaddressspaceoftheglobalmemorytobe32bit.Thismeansthate.g.aCGRAwith
a8bitdatapathcannotdirectlyaddressallmemory(bothlocalandglobal)withaddressessuppliedfromthe
datanetwork.Toovercomethis,theLSUcontainsseveralconfigurationregisterswhichalsocontainregisters
thatholdthehigherbytes/wordsoftheaddresses.Thecontentsoftheseregisters,togetherwiththeaddress
suppliedontheinputwillformthefinalmemoryaddress.Witha16bitCGRAalllocalmemorycanbedirectly
addressedandwitha32bitCGRAallglobalmemorycanbedirectlyaddressedaswell.

Implicitloadandstore
Theseoperationsusetheconfigurationregisterstoimplementsomeautomaticaddressgeneration.The
configurationregistersallowtospecifythestartaddressandthestride.Eachtimeanimplicitoperationis
performedtheaddressisincrementedwiththestride.Thisisdoneseperatelyforloadsandstoresandglobaland
localaccesses.
Fortheglobalmemorythestartaddressandthebytesformemoryaddressextention(for8bitand16bit
CGRAs)areshared,meaningthatthisaddresswillincrementwitheachimplicitoperation.Forlocalmemorythe
startaddressandmemoryaddressextentionareseparated.

Dualissue
Someoperationsthatdonotconflictwithrespecttoinputselectioncanbeexecutedinparallel(e.g.astoreon
thelocalmemoryandaimplicitloadfromtheglobalmemory).Thisallowsforahighermemorybandwidthandfor
veryefficientmemorycopyorshuffling.Theoperationsthatcanbeperformedinparallelhaveaspecial
instructionfacilitatingthis(e.g.LGI_SLA),wheretheunderscoredenotesthesimultaneousexecution.
Picturesource

LSUConfigurationRegisters
Inordertoovercomelimitationsrelatedtodatapathwidthandthewidthrequiredtoaddressthelocalandglobal
memoriestheLoadStoreUnit(LSU)hassomeconfigurtionregisters.Additionallytheseregisterscanbeusedfor
implicit(automaticallyincrementingmemoryaddresses)whichallowparallelloadandstoreoperations.
Themappingoftheseconfigurationregistersisasfollows:

Register Description

0 Startaddressforimplicitlocalloadoperations(Highbytein8bit
datapath,notusedforwidthlargerthan8)

1 Startaddressforimplicitlocalloadoperations(Lowbytein8bit
datapath)

2 Strideforimplicit(localandglobal)loadoperations

3 Startaddressforimplicitlocalstoreoperations(Highbytein8bit
datapath,notusedforwidthlargerthan8)

4 Startaddressforimplicitlocalstoreoperations(Lowbytein8bit
datapath)

5 Strideforimplicit(localandglobal)storeoperations

6 Globalloadaddress(alsoimplicitcounter)(byte3forW=8,
unusedforW=16,unusedforW=32andhigher)

7 Globalloadaddress(byte2forW=8,unusedforW=16,unused
forW=32andhigher)

8 Globalloadaddress(byte1forW=8,word1forW=16,unusedfor
W=32andhigher)

9 Globalloadaddress(byte0forW=8,word0forW=16,fullfor
W=32andhigher)

10 Globalstoreaddress(alsoimplicitcounter)(byte3forW=8,
unusedforW=16,unusedforW=32andhigher)

11 Globalstoreaddress(byte2forW=8,unusedforW=16,unused
forW=32andhigher)

12 Globalstoreaddress(byte1forW=8,word1forW=16,unused
forW=32andhigher)

13 Globalstoreaddress(byte0forW=8,word0forW=16,fullfor
W=32andhigher)

14 Localloadaddresshighbyte(onlyforW=8)

15 Localstoreaddresshighbyte(onlyforW=8)
Accumulate&BranchUnit
TheAccumulateandBranchUnit(ABU)canbeconfiguredtoperformtwotasks,asthenameimplies.
Itcanbeusedasamultiregisteraccumulator.
Itcanbeusedtocalculateprogramcountersandhencefunctionasabranchunit.
Selectionbetweenthesetwofunctionalitiesismadeatconfigurationtimebysetting(branchfunctionality)or
clearing(accumulatefunctionality)theconfigurationbit.

Accumulatemode:
Thewidthoftheaccumulationregistersis16bitin8and16bitmodesand32bitin32bitmode.Theaccumulate
outputoftheselectedregisterisavailableatthehighestportnumber(s).
In8bitmodetwooutputsareusedtooutputoneofthe16bitaccumulationregisters,16accumulate
registersareavailable.
In16bitmodeoneoutputisusedtooutputtheselectedaccumulateregister,16accumulateregistersare
available.Anyadditionaloutputsareconnecteddirectlytotheregisternumbercorrespondingtotheport
number.
In32bitmodeoneoutputisusedtooutputtheselectedaccumulateregister,apair16bitregistersare
concatenatedinthismodetherefore832bitaccumulateregistersareavailable.Anyadditionaloutputsare
connecteddirectlytotheregisternumbercorrespondingtotheportnumber.

Picturesource
Branchmode:
Theprogramcounterisavailableatthehighestportnumber(s)andusesa16bitcounter.In16bitand32bit
modetheportN1canbeusedtoloadvaluesfromanyofthe16internal16btregisters.In8bitmodethisoption
isnotavailabletoportnumberlimitations.
Thebranchunitsupportsabsolute/relativeconditional/unconditionaljumps.Whenthebranchonlyhastojumpa
limitednumberofinstructionsitispossibletouseaintermediatebranchinstruction,otherwisethebranchtarget
hastobepresentononeoftheinputs.Conditionsalwayshavetobepresentontheinputs.
In8bitmodetwooutputsareusedtooutputthe16bitprogramcounter.Noregisterreadingissupported.
Anyadditionaloutputsareconnecteddirectlytotheregisternumbercorrespondingtotheportnumber.
In16bitand32bitmodethehighestoutputportproducestheprogramcounter,portnumberN1reads
theselectedregister.Anyadditionaloutputsareconnecteddirectlytotheregisternumbercorrespondingto
theportnumber.
Notyetimplemented:Theadditionalregisterswill,inthefuture,beusedforconfigurationofhardwareloop
support.
ImmediateUnit
TheimmediateUnit(IU)isaInstructionDecoderthatdoesnotperformanyotheractionthantoputavalue,
encodedintheinstruction,onthedatanetwork.
Ifthewidthoftheimmediatepartoftheinstructionisequalorlargerthanthatofthedatanetwork,theimmediate
willbedirectlyavailableonthenetwork.Ifthewidthoftheimmediatepartoftheinstructionissmallerthanthe
datanetworkwidth,theimmediateisbuiltfromseveralimmediateinstructions.Eachtimeanimmediate
instructionisexecutedthedataisshiftedbythenumberofbitsavailableineachimmediateinstruction.
Forexample:
Ourinstructionwidthis9bit,thereforethesizeoftheimmediateis8bit(9minusonewriteenablebit).
Weassumethedatanetworktobe32bit,meaningweneed4loadstofilltheoutputoftheimmediateunit
withthe32bitvalue.
Ifwewouldwanttoloadtheimmediatevalue0xAABBCCDD,wewouldexecutethefollowinginstructions:
imm0xAA(IUoutputvalue:0x??????AA)
imm0xBB(IUoutputvalue:0x????AABB)
imm0xCC(IUoutputvalue:0x??AABBCC)
imm0xDD(IUoutputvalue:0xAABBCCDD)
TheIUcanbeconfigured(withaparameterinVerilog)toinsertabubbleornot.Thisallowsthetotalnumberof
pipelinestagesfromtheIFtothedataarrivingonthedatanetworktobeequalforboththenotmalIDandFUs
andtheIU.Thenumberofpipelinestageswithoutbubbleinsertionis2andwithbubbleinsertion3.

Вам также может понравиться