Академический Документы
Профессиональный Документы
Культура Документы
Team Members:
Mohammad Noman (CMPE)
Advisor:
Hanho Lee, Ph.D.
Assistant Professor, Electrical and Computer Engineering
University of Connecticut
lee@engr.uconn.edu
Table of Contents
Introduction ………………………………………………………………………… 1
Reasons for Low Power……………………………………….……………………. 1
Architecture…………………………………………………………………………. 1
Instruction Fetch…………………………………………..…......................... 2
Instruction Decode…………………………………………………………… 2
Execute…………………………………………………… ………………… 3
Memory……………………………………………………………………… 6
Write Back…………………………………………………………………… 7
Data Forward Unit…………………………………………………………… 7
Hazard Detection Unit……………………………………………………….. 7
Instruction Set………………………………………………………………………. 7
Power Reduction Method…………………………………………………………... 9
FPGA Design Flow………………………………………………………………….. 10
Testing and Verification Method………………………………………………… 10
Synthesis Result……………………………………………………………………... 11
Conclusion…………………………………………………………………………… 12
References……………………………………………………………………………. 13
Appendix A: Detailed Architecture………………………………………………... 14
Appendix B: Instruction Set………………………………………………………... 15
Appendix C: Simulation Waveforms………………………………………………. 19
Appendix D: Synthesis Report……………………………………………………... 21
Appendix E: VHDL Code for hex file based Instrution Memory………………... 24
Appendix F: FPGA Floorplan………………............................................................ 26
1
Introduction:
An embedded processor is a processor that has been “embedded” into a device. It can be
programmed to interact with different pieces of hardware. Performance wise, an embedded
processor can outperform a microcontroller, but does not have as much performance as a
general-purpose microprocessor.
Low-power embedded processors are used in a wide variety of applications including cars,
phones, digital cameras, printers, and other such devices. The reason for their wide use is that
they are small; therefore, they do not take up much die area and are cost effective to fabricate.
Also, embedded processors are verified, eliminating the need to spend additional engineering
man-hours tracking down hardware flaws. Another great advantage in using embedded
processors is that they run software, which enables one to deal with changing specifications as
various system requirements change.
Low power processors are the key to the realization of portable electronic devices, in which
power consumption is an important factor. Low-power consumption helps to reduce heat
dissipation, lengthen battery life, and increase device reliability. In this project we will
implement a 16-bit RISC type embedded processor that will support a pre-defined instruction set.
This processor will follow RISC architecture because it allows for a simpler implementation of
our design. There are several power saving techniques that can be used in our design; however,
the main focus of our processor’s low power architecture will be clock gating.
Manufacturing issues such as packaging costs can come into play when considering advantages
for a low power design. Heat dissipation is one of the major factors considered when chip
packaging takes place. The low power design of a processor greatly reduces heat dissipation,
which will in turn reduce the packaging costs also.
Architecture:
The overall diagram of the processor architecture is shown in appendix A. As seen from the
diagram, the architecture consists of a five stage pipeline. The stages are Instruction Fetch,
Instruction Decode, Execute, Memory, and Write Back. Also, there is a Data Forward and
Hazard Detection unit to maintain proper data flow through the pipeline stages. In an effort to
2
reduce power consumption we have employed clock gating and signal gating throughout the
design wherever applicable. Each of the stages of the pipeline along with the data forward and
Hazard Detection unit are described in detail below.
Instruction Fetch:
This stage consists of the Program Counter, Instruction Memory, and the Branch Decide Unit.
Program Counter: The Program Counter (PC) contains the address of the instruction that will
be fetched from the Instruction Memory during the next clock cycle. Normally the PC is
incremented by one during each clock cycle unless a branch instruction is executed. When a
branch instruction is encountered, the PC is incremented/decremented by the amount indicated
by the branch offset. The PC Write input of the PC serves as an enable signal. When PC Write
signal is high, the contents of the PC are incremented during the next clock cycle, and when it is
low, the contents of the PC remain unchanged.
Instruction Memory: The Instruction Memory contains the instructions that are executed by
the processor. The input to this unit is a 16-bit address from the Program Counter and the output
is a 16-bit instruction word. This module supports up to 64 K words of memory, where each
word is 16-bits long.
Branch Decide Unit: The Branch Decide Unit is responsible for determining whether a branch
is to take place or not based on the 2-bit Branch signal from the Control Unit and the Zero flag
from the Arithmetic Logic Unit (ALU). The output of this unit is a 1-bit value which is high
when a branch is to take place, and otherwise it is low. This output controls a multiplexer which
in turn controls whether the PC gets incremented by one or by the amount indicated by the
branch offset.
Instruction Decode:
This stage consists of the Control Unit, Register File, Y-Register, and the Sign Extend Unit.
Control Unit: The control unit generates all the control signals needed to control the
coordination among all the component of the processor. The input to this unit is the 4-bit opcode
field of the instruction word. This unit generates signals that control all the read and write
operations of the Register File, Y-Register, and the Data Memory. It is also responsible for
generating signals that decide when to use the multiplier and when to use the ALU, and it also
generates appropriate branch flags that are used by the Branch Decide unit. In addition, this unit
provides clock gating signals for the ALU Control and the Branch Adder module.
Register File: This is a two port register file which can perform two simultaneous read and one
write operation. It contains sixteen 16-bit general purpose registers. The registers are named R0
through R15. R0 is a special register which always contains the value zero and any write request
to this register is always ignored. When the Reg_Write signal is high, a write operation is
performed to the register indicated by the write address, otherwise the value contained in the
registers indicated by the read addresses are outputted.
3
Y-Register: The Y-Register is a special 16-bit register that is used to store the upper 16 bits
(bits16-31) of the result generated by the multiplier. When the Y_Write signal is high new value
is written to this register, otherwise the currently stored value is outputted.
Sign Extend Unit: The input to this unit is an 8-bit immediate value provided by all the
immediate type instructions. This unit sign extends the 8-bit value to a 16-bit value signed value.
Execute:
This stage consists of the Branch Adder, Multiplier, Arithmetic Logic Unit (ALU), and the ALU
Control Unit.
Branch Adder: The branch adder adds the 12-bit signed branch offset with the current value of
the PC to calculate the branch target. The 12-bit offset is provided by the branch instruction.
The output of this unit goes to the PC control multiplexer which updates the PC with this value
only when a branch is to be taken.
Multiplier:
The high-level block diagram of the multiplier is shown in diagram 2 below. It consists of four
distinct components. They are the Booth Encoder, Partial Product Generator, Carry Save Adder,
and the Carry Lookahead Adder. Our multiplier architecture employs two main techniques to
increase the speed of the multiplication process. First technique is to reduce the number of
partial products and the second is to increase the speed at which the partial products are added.
The individual components shown in diagram 1 are explained in detail below.
Booth Encoder: This module encodes the 16-bit multiplier using radix 4 Booth’s algorithm.
Radix 4 encoding reduces the total number of multiplier digits by a factor of two, which means
in this case the number of multiplier digits will reduce from 16 to 8. This algorithm arranges the
original multiplier into groups of three consecutive bits where the outermost bit in each group is
shared with the outermost bit of the adjacent group. Each of these groups of three bits then
corresponds to one of the numbers from the set {2, 1, 0, -1, -2}. Each encoder produces a 3-bit
output where the first bit represents the number 1 and the second bit represents the number 2.
The third and final bit indicates whether the number in the first or second bit is negative. Since
there are 16 input bits, there will be a total of 8 Booth encoder modules in the overall multiplier
architecture. The way the outputs are determined is shown in table 1 below.
Partial Product Generator (PPG): The output from the Booth encoder is used in this module
to generate the partial products. Since there are eight Booth encoders there will be a total of
eight partial products. The multiplication by two is implemented by shifting the multiplicand left
one bit and the negation is implemented by taking the two’s complement of the multiplicand.
The architecture of the partial product generator is shown in diagram 2.
Each row of the diagram corresponds to one partial product. Even though the diagram does not
show it, there are eight such rows corresponding to eight partial products. Also, each partial
product is shifted two bits to the left relative to the partial product above it to account for the
radix 4 Booth encoding of the multiplier.
Wallace Tree: This module is responsible for adding the partial products that were generated in
the PPG module. This module uses 3 to 2 carry save adders (CSA) to implement the Wallace
Tree. The individual CSAs are nothing more than full adders with the exception that the carry-
ins and the carry-outs are handled in a special way. Each column of numbers in the partial
product is added using this method. Diagram 3 below shows how this method works for adding
8 bits. The carry-outs generated in each stage of addition are transferred to the Wallace Tree of
the column of bits of partial products on the left and the carry-ins comes from the column to the
right. The advantage of using a Wallace Tree structure for addition is that for adding eight bits
the result is available only after four full adder delays. If the same addition were to be performed
using a ripple carry adder, it would have required seven full adder delays. Therefore, although
the structure of the adder might be a little complicated, it greatly increases the speed of addition.
Carry Lookahead Adder (CLA): This unit is used to add the final sum and carry vectors
generated by the Wallace Trees for each column of bits from the partial products. Only a 28-bit
CLA is needed, instead of a full 32 bits, because some of the bits of the final result are already
available from the Wallace Trees.
Arithmetic Logic Unit (ALU): The ALU is responsible for all arithmetic and logic operations
that take place within the processor. These operations can have one operand or two, with these
values coming from either the register file or from the immediate value from the instruction
directly. The low power design of the ALU involves the gating the input signals to each of the
separate components of the ALU. These inputs are gated using transmission gates. When a
particular component of the ALU is not being used, the input to that component will be in a
High Z state due to the output of the transmission gate. The operations supported by the ALU
include add, subtract, compare, and, or, not, xor, logical shift, and arithmetic shift. The output of
the ALU goes either to the data memory (in the case where the output is an address) or through a
multiplexer back to the register file.
The add, subtract, and compare operations are performed by the adder component. This adder
component is essentially a 16-bit carry look-ahead adder, which performs the specified operation
based on the input signals it receives from the ALU Control Unit.
The shift operations are performed by the shift component. The shifter is capable of performing
arithmetic shift left or right, as well as a logical shift left or right based on the inputs it receives.
The difference between the logical shift and the arithmetic shift is that in the logical shift
operation zeros are pushed into the vacated bit positions, whereas in the arithmetic shift
operation, the vacated bit positions are replaced with the bit values that have been pushed out of
the operand in a wraparound fashion.
The Xor, Or, and And operations take two 16-bit values and performs respective bitwise
operation on those two operands. The Not is an unary operation that takes only one 16-bit value
and inverts all the bits.
ALU Control Unit: This unit is responsible for providing signals to the ALU that indicates the
operation that the ALU will perform. The input to this unit is the 4-bit opcode and the 4-bit
function field of the instruction word. It uses these bits to decide the correct ALU operation for
the current instruction cycle. This unit also provides another set of output that is used to gate the
signals to the parts of the ALU that it will not be using for the current operation.
Memory:
This stage consists of the Data Memory module.
Data Memory: This module supports up to 64k words of 16-bit data words. The Load and
Store instructions are used to access this module. When new data is to be written to the memory,
the Mem_Write signal is asserted. When the Mem_Write signal is low, a read operation is
performed for the given memory location.
7
Write Back:
This stage consists of some control circuitry that forwards the appropriate data, generated by the
ALU/MAC or read from the Data Memory, to the register files to be written into the designated
register.
Instruction Set:
There are three basic types of instructions supported by this processor. These are the Register
Type, Branch Type, and the Immediate Type. The specification for each type of instructions is
given below.
Register Type: In this format bits 15-12 represents the opcode. Bits 11-8 represent the address
of the first source register, which is also the address of the destination register. Bits 7-4 give the
address of the second source register. The last four bits, 3-0, represent the function code that
represents the ALU function that is to be performed. If the opcode bits, 15-12, do not indicate an
ALU function, then the function bits are ignored. Diagram 4 below shows the basic format of
this instruction type.
15 12 11 8 7 4 3 0
opcode Rs1/Rd Rs2 Function
Diagram 4: Register Type Instruction
8
Immediate Type: As with the Register Type instruction, bits 15-12 represents the opcode, and
bits 11-8 represents the source register which is also the address of the destination register. Bits
7-0 of this instruction type represent an 8-bit immediate value given in 2’s complement form.
When the opcode represents a unary operation, the value in this immediate field is used as the
operand (instead of the value in Rs). Diagram 5 below shows the instruction format.
15 12 11 8 7 0
opcode Rd/Rs Immediate
Diagram 5: Immediate Type Instruction
Branch Type: Bits 15-12 of this instruction format represents the type of branch operation to be
performed. The remaining 12 bits, 11-0, represent the branch offset in 2’s complement format.
This number is added to the value of the PC to obtain the branch target address. Instruction
format for this type is shown in Diagram 6 below.
15 12 11 0
opcode Branch target
Diagram 6: Branch Type Instruction
Table 2 below summarizes all the instructions supported by this processor. A more detailed table
of the instruction set along with the description for each instruction can be found in Appendix B.
Schematic entry
Verification
Synthesis
Verification
Verification
Configuration
Schematic Entry: The design is entered into a synthesis design system using a hardware
description language. The language used for this design was VHDL and we used the editor
provided by Xilinx Integrated Software Environment (ISE Version 5.2).
Synthesis: A netlist is generated using the VHDL code and the Xilinx synthesis tool.
Place and Route: The place process decides the best location of the cells and the best routing
strategy for the given design and desired performance. The route process makes the connections
between the cells and the blocks. This process was also completed with Xilinx ISE.
Configuration: This step was not completed because the xc95108 CPLD chip that we received
did not have enough build in memory blocks to properly synthesize our design. More precisely,
it was not able to synthesize the data and instruction memory blocks.
Verification: At each step of the design process, we verified our architecture using software
simulation. We used ModelSim XE II software package for simulating our VHDL code.
instructions that were to be tested. The reason for using binary representation is that otherwise it
would take too much time to develop a completely new assembler to interpret text based
assembly code. The hex file that was created after running this program was substituted in place
of the ROM for the Instruction Memory when performing simulation with ModelSim. The
VHDL program used to create the hex file is shown in Appendix E.
The procedure for creating the hex file and setting up the testbench for verification is listed
below.
• We created a new project and added the VHDL program that creates the hex file.
• A new testbench waveform file was attached to this VHDL program by going to
Project –> New Source…–> Test Bench Waveform.
• ModelSim Simulator was invoked using the option “Simulate Behavioral VHDL model”
for the newly created waveform file in the previous step.
• Once ModelSim finishes running completely, the hex file is created automatically and put
in the work directory.
• All the processor VHDL files were added to this same project.
• Another new Test Bench Waveform file was created for the processor files and the
desired output signals were specified.
• Modelsim Simulator was invoked for this file the same way as before.
• Since the instruction memory was simulated by the hex file, when the ModelSim
simulator ran, it automatically opened up the previously created hex file and executed the
instructions one at a time showing the output waveforms in the Wave window.
It should be noted that the VHDL file for the instruction memory was written in a fashion so that
it would open the hex file for reading when it started running the first time. This VHDL file is
included in Appendix E. Once all the tests were complete, it was rewritten to use the ROM
modules from the Xilinx Unisim library instead of relaying on hex files.
We tested our processor architecture by running many test programs that were created using the
method described above. The basic verification approach was to compare the simulated output
results with the expected results that we computed by hand. Whenever we found a mismatch
between the two, we identified the problem(s) and took care of them appropriately. We tested
the functionality of all the instructions, the interactions among the instructions in the pipeline,
and the correctness of the data as a result of executing those instructions.
Synthesis Results:
As mentioned above, after the VHDL code of our design was complete, we synthesized our code
using Xilinx Integrated Software Environment tool (Version 5.2). Since we were unable to
implement our design on the xc95108 CPLD, we chose the xc2v250 chip from the Virtex2
FPGA family for synthesis purposes. From synthesis estimate, the minimum clock period that
can be achieved in our architecture is 23.028ns, which translate to a maximum operating
frequency of 43.425 MHz. The critical path that determines this delay comes from the execute
stage of the pipeline through the Multiplier. This was an expected result, since the Multiplier is
the biggest module in our design in terms of circuit complexity and size. The complete synthesis
report can be found in Appendix D. It includes all the timing information along with device
12
resource utilization summary. The FPGA floorplan for the implementation of this design is
shown in Appendix F.
We also used the Xilinx XPower software to analyze the power dissipation figures for our
design. From the software simulation, the estimated power dissipation is ~ 780 mW. This figure
is obtained using the assumption that the clock frequency is 43.425 MHz and the default activity
rate of the signals in design is 100% relative to the clock frequency. We presume that this power
dissipation figure is an overestimation of the actual power dissipated by this design since it does
not take into account the effect of circuit inactivity due to tri-state buffering and clock gating.
Conclusion:
Like any other engineering design we had to test our design consistently and make modifications
throughout when a problem arose. We added new pipelines and remapped our diagram of the
processor before we had the code fully working. The project greatly enhanced our understanding
of embedded processor design, low power and the important role they play in today’s electronic
world. Power consumption rate will continue to improve in embedded processors as technology
will unravel new and more efficient ways to decrease power consumption.
13
References:
4. Hamblen, James and Furman, Michael, Rapid Prototyping of Digital Systems, 2nd
edition, Boston: Kluwer Academic Publishers, 2001.
7. “A Microelectronics Primer”,
http://www.cmc.ca/about/corporation/plan/Module5/appendix5a.html
9. Hamacher, Vranesic, and Zaky. Computer Organization, 5th edition, NewYork: McGraw-
Hill Companies, 2002.
Description: This instruction copies the not set. The 12-bit immediate field
value in the immediate field and writes it specifies the branch offset.
into r1.
Instruction: Branch Always
Instruction: Move Y 15 12 11 0
Format: 1011 Branch Offset
15 12 11 8 7 4 3 0
Format: 1100 Rs1/Rd N/A N/A Syntax: Ba, immd
Syntax: Movy r1 Description: The instruction will
Description: The instruction will copy always perform a branch regardless of
the value from the Y register and write it the condition of the zero flag. The 12-bit
into r1. immediate field specifies the branch
offset.
Instruction: Load Word
15 12 11 8 7 4 3 0
Format: 0101 Rs1/Rd Rs2 N/A
Syntax: Lw r1, r2 Instruction: No Operation
15 12 11 0
Description: This instruction is used to Format: 0000 N/A
load word from the data memory. r2 Syntax: Nop
contains the base address of the memory Description: No action is taken.
location, and r1 contains an offset. The
effect memory location address is found
by adding r1 and r2. The data word is
loaded in r1.
Diagram 8 below shows the simulation result of clock gating for the ALU Control Unit and the Branch Adder. The inputs to the
Branch Adder change only twice since it encounters only two branch instructions. When the instruction is something other than
branch, the inputs remain unchanged as a result of clock gating. In a similar fashion, the inputs to the ALU Control Unit remain
unchanged when the current instruction does not require the operations provided by the ALU (such as, multiplication and branch).
Diagram 8: Clock gating for ALU Control Unit Branch Adder
21
Design Statistics
# IOs : 66
Macro Statistics :
# Registers : 38
# 1-bit register : 17
# 12-bit register : 1
# 16-bit register : 13
# 2-bit register : 1
# 4-bit register : 5
# 8-bit register : 1
# Tristates : 18
# 1-bit tristate buffer : 4
# 16-bit tristate buffer : 13
# 4-bit tristate buffer : 1
# Adders/Subtractors : 1
# 16-bit adder : 1
# Comparators : 8
# 4-bit comparator equal : 6
# 4-bit comparator not equal : 2
# Xors : 201
# 1-bit xor3 : 201
Cell Usage :
# BELS : 2330
# GND : 1
# LUT1 : 15
# LUT1_L : 1
# LUT2 : 136
# LUT2_D : 24
# LUT2_L : 7
# LUT3 : 365
# LUT3_D : 58
# LUT3_L : 23
# LUT4 : 1247
# LUT4_D : 125
# LUT4_L : 137
# MUXCY : 15
# MUXF5 : 144
# rom32x1 : 16
# VCC : 1
# XORCY : 15
# FlipFlops/Latches : 519
# FD : 175
# FDCE : 16
# FDE : 72
# LD : 256
# RAMS : 2
# ram16x8s : 2
# Tri-States : 215
# BUFT : 215
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 65
# IBUF : 1
# OBUF : 64
=========================================================================
22
Timing Summary:
---------------
Speed Grade: -6
entity instr_dump is
Port ( a : in std_logic;
b : out std_logic);
end instr_dump;
begin
mem_load: process is --(address, data_in, wrt) is
type load_file_type is file of std_logic_vector(15 downto 0);
file load_file: load_file_type open write_mode is "instruction.txt";
begin
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("1000000100010000")); -- movi R1, D'16'
write(load_file,std_logic_vector'("1001000000000101")); -- beq D'5'
write(load_file,std_logic_vector'("1000001000001100")); -- movi R2, D'12'
write(load_file,std_logic_vector'("1000001100100000")); -- movi R3, D'32'
write(load_file,std_logic_vector'("1000100100000101")); -- movi R9, D'5'
write(load_file,std_logic_vector'("0001000100001111")); -- andi R1, D'15'
write(load_file,std_logic_vector'("1101000100100000")); -- add R1, R2
write(load_file,std_logic_vector'("1111000100001111")); -- muli R1, D'15'
write(load_file,std_logic_vector'("1110000100100000")); -- mul R1, R2
write(load_file,std_logic_vector'("1101000100110100")); -- cmp R1, R3
write(load_file,std_logic_vector'("1001000000000010")); -- beq D'2'
write(load_file,std_logic_vector'("1101000100101000")); -- sll R1, 2
write(load_file,std_logic_vector'("1101001100100111")); -- not R3, R2
write(load_file,std_logic_vector'("1101000100110100")); -- cmp R1, R3
write(load_file,std_logic_vector'("0100001011111111")); -- ori R2, 11111111
write(load_file,std_logic_vector'("0111001100000000")); -- mov R3, R0
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
wait on a;
end process mem_load;
end Behavioral;
25
entity I_memory is
Port ( address : in std_logic_vector(15 downto 0);
data_out : out std_logic_vector(15 downto 0));
end I_memory;
begin
mem_behavior: process is --(address) is
type mem_array is array (0 to 65536) of std_logic_vector(15 downto 0);
variable mem: mem_array;
type load_file_type is file of std_logic_vector(15 downto 0);
file load_file: load_file_type open read_mode is "instruction.txt";
variable index: natural;
begin
index:= 0;
while not endfile(load_file) loop
read(load_file,mem(index));
index:= index+1;
end loop;
end Behavioral;
26