Вы находитесь на странице: 1из 28

Low-Power Embedded Processor Design

Senior Design Project


Electrical and Computer Engineering Department
University of Connecticut

Team Members:
Mohammad Noman (CMPE)

Olugbenga Odesina (CMPE)

Pouyan Afshar (EE)

Advisor:
Hanho Lee, Ph.D.
Assistant Professor, Electrical and Computer Engineering
University of Connecticut
lee@engr.uconn.edu
Table of Contents

Introduction ………………………………………………………………………… 1
Reasons for Low Power……………………………………….……………………. 1
Architecture…………………………………………………………………………. 1
Instruction Fetch…………………………………………..…......................... 2
Instruction Decode…………………………………………………………… 2
Execute…………………………………………………… ………………… 3
Memory……………………………………………………………………… 6
Write Back…………………………………………………………………… 7
Data Forward Unit…………………………………………………………… 7
Hazard Detection Unit……………………………………………………….. 7
Instruction Set………………………………………………………………………. 7
Power Reduction Method…………………………………………………………... 9
FPGA Design Flow………………………………………………………………….. 10
Testing and Verification Method………………………………………………… 10
Synthesis Result……………………………………………………………………... 11
Conclusion…………………………………………………………………………… 12
References……………………………………………………………………………. 13
Appendix A: Detailed Architecture………………………………………………... 14
Appendix B: Instruction Set………………………………………………………... 15
Appendix C: Simulation Waveforms………………………………………………. 19
Appendix D: Synthesis Report……………………………………………………... 21
Appendix E: VHDL Code for hex file based Instrution Memory………………... 24
Appendix F: FPGA Floorplan………………............................................................ 26
1

Introduction:
An embedded processor is a processor that has been “embedded” into a device. It can be
programmed to interact with different pieces of hardware. Performance wise, an embedded
processor can outperform a microcontroller, but does not have as much performance as a
general-purpose microprocessor.

Low-power embedded processors are used in a wide variety of applications including cars,
phones, digital cameras, printers, and other such devices. The reason for their wide use is that
they are small; therefore, they do not take up much die area and are cost effective to fabricate.
Also, embedded processors are verified, eliminating the need to spend additional engineering
man-hours tracking down hardware flaws. Another great advantage in using embedded
processors is that they run software, which enables one to deal with changing specifications as
various system requirements change.

Low power processors are the key to the realization of portable electronic devices, in which
power consumption is an important factor. Low-power consumption helps to reduce heat
dissipation, lengthen battery life, and increase device reliability. In this project we will
implement a 16-bit RISC type embedded processor that will support a pre-defined instruction set.
This processor will follow RISC architecture because it allows for a simpler implementation of
our design. There are several power saving techniques that can be used in our design; however,
the main focus of our processor’s low power architecture will be clock gating.

Reasons for Low-Power:


There are several reasons for emphasizing low power dissipation in modern processor designs.
Some of these reasons are device performance related issues, while others may be manufacturing
issues. In accordance with what was stated in the introduction, with low power processors being
very important in today’s mobile devices, one of the main advantages of having a low power
design is that the battery life for these devices can be prolonged. Another performance related
issue is that, in many cases, reducing the switching current in the processor increases the
reliability of the device.

Manufacturing issues such as packaging costs can come into play when considering advantages
for a low power design. Heat dissipation is one of the major factors considered when chip
packaging takes place. The low power design of a processor greatly reduces heat dissipation,
which will in turn reduce the packaging costs also.

Architecture:
The overall diagram of the processor architecture is shown in appendix A. As seen from the
diagram, the architecture consists of a five stage pipeline. The stages are Instruction Fetch,
Instruction Decode, Execute, Memory, and Write Back. Also, there is a Data Forward and
Hazard Detection unit to maintain proper data flow through the pipeline stages. In an effort to
2

reduce power consumption we have employed clock gating and signal gating throughout the
design wherever applicable. Each of the stages of the pipeline along with the data forward and
Hazard Detection unit are described in detail below.

Instruction Fetch:
This stage consists of the Program Counter, Instruction Memory, and the Branch Decide Unit.

Program Counter: The Program Counter (PC) contains the address of the instruction that will
be fetched from the Instruction Memory during the next clock cycle. Normally the PC is
incremented by one during each clock cycle unless a branch instruction is executed. When a
branch instruction is encountered, the PC is incremented/decremented by the amount indicated
by the branch offset. The PC Write input of the PC serves as an enable signal. When PC Write
signal is high, the contents of the PC are incremented during the next clock cycle, and when it is
low, the contents of the PC remain unchanged.

Instruction Memory: The Instruction Memory contains the instructions that are executed by
the processor. The input to this unit is a 16-bit address from the Program Counter and the output
is a 16-bit instruction word. This module supports up to 64 K words of memory, where each
word is 16-bits long.

Branch Decide Unit: The Branch Decide Unit is responsible for determining whether a branch
is to take place or not based on the 2-bit Branch signal from the Control Unit and the Zero flag
from the Arithmetic Logic Unit (ALU). The output of this unit is a 1-bit value which is high
when a branch is to take place, and otherwise it is low. This output controls a multiplexer which
in turn controls whether the PC gets incremented by one or by the amount indicated by the
branch offset.

Instruction Decode:
This stage consists of the Control Unit, Register File, Y-Register, and the Sign Extend Unit.

Control Unit: The control unit generates all the control signals needed to control the
coordination among all the component of the processor. The input to this unit is the 4-bit opcode
field of the instruction word. This unit generates signals that control all the read and write
operations of the Register File, Y-Register, and the Data Memory. It is also responsible for
generating signals that decide when to use the multiplier and when to use the ALU, and it also
generates appropriate branch flags that are used by the Branch Decide unit. In addition, this unit
provides clock gating signals for the ALU Control and the Branch Adder module.

Register File: This is a two port register file which can perform two simultaneous read and one
write operation. It contains sixteen 16-bit general purpose registers. The registers are named R0
through R15. R0 is a special register which always contains the value zero and any write request
to this register is always ignored. When the Reg_Write signal is high, a write operation is
performed to the register indicated by the write address, otherwise the value contained in the
registers indicated by the read addresses are outputted.
3

Y-Register: The Y-Register is a special 16-bit register that is used to store the upper 16 bits
(bits16-31) of the result generated by the multiplier. When the Y_Write signal is high new value
is written to this register, otherwise the currently stored value is outputted.

Sign Extend Unit: The input to this unit is an 8-bit immediate value provided by all the
immediate type instructions. This unit sign extends the 8-bit value to a 16-bit value signed value.

Execute:
This stage consists of the Branch Adder, Multiplier, Arithmetic Logic Unit (ALU), and the ALU
Control Unit.

Branch Adder: The branch adder adds the 12-bit signed branch offset with the current value of
the PC to calculate the branch target. The 12-bit offset is provided by the branch instruction.
The output of this unit goes to the PC control multiplexer which updates the PC with this value
only when a branch is to be taken.

Multiplier:

The high-level block diagram of the multiplier is shown in diagram 2 below. It consists of four
distinct components. They are the Booth Encoder, Partial Product Generator, Carry Save Adder,
and the Carry Lookahead Adder. Our multiplier architecture employs two main techniques to
increase the speed of the multiplication process. First technique is to reduce the number of
partial products and the second is to increase the speed at which the partial products are added.
The individual components shown in diagram 1 are explained in detail below.

Diagram 1: Architecture of the Multiplier


4

Booth Encoder: This module encodes the 16-bit multiplier using radix 4 Booth’s algorithm.
Radix 4 encoding reduces the total number of multiplier digits by a factor of two, which means
in this case the number of multiplier digits will reduce from 16 to 8. This algorithm arranges the
original multiplier into groups of three consecutive bits where the outermost bit in each group is
shared with the outermost bit of the adjacent group. Each of these groups of three bits then
corresponds to one of the numbers from the set {2, 1, 0, -1, -2}. Each encoder produces a 3-bit
output where the first bit represents the number 1 and the second bit represents the number 2.
The third and final bit indicates whether the number in the first or second bit is negative. Since
there are 16 input bits, there will be a total of 8 Booth encoder modules in the overall multiplier
architecture. The way the outputs are determined is shown in table 1 below.

Multiplier Bits Output bits Operation on


yi+1 yi yi-1 NEG 2 1 Multiplicand
0 0 0 0 0 0 0x
0 0 1 0 0 1 +1x
0 1 0 0 0 1 +1x
0 1 1 0 1 0 +2x
1 0 0 1 1 0 -2x
1 0 1 1 0 1 -1x
1 1 0 1 0 1 -1x
1 1 1 1 0 0 0x
Table 1: Booth algorithm

Partial Product Generator (PPG): The output from the Booth encoder is used in this module
to generate the partial products. Since there are eight Booth encoders there will be a total of
eight partial products. The multiplication by two is implemented by shifting the multiplicand left
one bit and the negation is implemented by taking the two’s complement of the multiplicand.
The architecture of the partial product generator is shown in diagram 2.

Diagram 2: Partial Product Generator


5

Each row of the diagram corresponds to one partial product. Even though the diagram does not
show it, there are eight such rows corresponding to eight partial products. Also, each partial
product is shifted two bits to the left relative to the partial product above it to account for the
radix 4 Booth encoding of the multiplier.

Wallace Tree: This module is responsible for adding the partial products that were generated in
the PPG module. This module uses 3 to 2 carry save adders (CSA) to implement the Wallace
Tree. The individual CSAs are nothing more than full adders with the exception that the carry-
ins and the carry-outs are handled in a special way. Each column of numbers in the partial
product is added using this method. Diagram 3 below shows how this method works for adding
8 bits. The carry-outs generated in each stage of addition are transferred to the Wallace Tree of
the column of bits of partial products on the left and the carry-ins comes from the column to the
right. The advantage of using a Wallace Tree structure for addition is that for adding eight bits
the result is available only after four full adder delays. If the same addition were to be performed
using a ripple carry adder, it would have required seven full adder delays. Therefore, although
the structure of the adder might be a little complicated, it greatly increases the speed of addition.

Diagram 3: Wallace Tree


6

Carry Lookahead Adder (CLA): This unit is used to add the final sum and carry vectors
generated by the Wallace Trees for each column of bits from the partial products. Only a 28-bit
CLA is needed, instead of a full 32 bits, because some of the bits of the final result are already
available from the Wallace Trees.

Arithmetic Logic Unit (ALU): The ALU is responsible for all arithmetic and logic operations
that take place within the processor. These operations can have one operand or two, with these
values coming from either the register file or from the immediate value from the instruction
directly. The low power design of the ALU involves the gating the input signals to each of the
separate components of the ALU. These inputs are gated using transmission gates. When a
particular component of the ALU is not being used, the input to that component will be in a
High Z state due to the output of the transmission gate. The operations supported by the ALU
include add, subtract, compare, and, or, not, xor, logical shift, and arithmetic shift. The output of
the ALU goes either to the data memory (in the case where the output is an address) or through a
multiplexer back to the register file.

The add, subtract, and compare operations are performed by the adder component. This adder
component is essentially a 16-bit carry look-ahead adder, which performs the specified operation
based on the input signals it receives from the ALU Control Unit.

The shift operations are performed by the shift component. The shifter is capable of performing
arithmetic shift left or right, as well as a logical shift left or right based on the inputs it receives.
The difference between the logical shift and the arithmetic shift is that in the logical shift
operation zeros are pushed into the vacated bit positions, whereas in the arithmetic shift
operation, the vacated bit positions are replaced with the bit values that have been pushed out of
the operand in a wraparound fashion.

The Xor, Or, and And operations take two 16-bit values and performs respective bitwise
operation on those two operands. The Not is an unary operation that takes only one 16-bit value
and inverts all the bits.

ALU Control Unit: This unit is responsible for providing signals to the ALU that indicates the
operation that the ALU will perform. The input to this unit is the 4-bit opcode and the 4-bit
function field of the instruction word. It uses these bits to decide the correct ALU operation for
the current instruction cycle. This unit also provides another set of output that is used to gate the
signals to the parts of the ALU that it will not be using for the current operation.

Memory:
This stage consists of the Data Memory module.

Data Memory: This module supports up to 64k words of 16-bit data words. The Load and
Store instructions are used to access this module. When new data is to be written to the memory,
the Mem_Write signal is asserted. When the Mem_Write signal is low, a read operation is
performed for the given memory location.
7

Write Back:
This stage consists of some control circuitry that forwards the appropriate data, generated by the
ALU/MAC or read from the Data Memory, to the register files to be written into the designated
register.

Data Forward Unit:


This unit is responsible for maintaining proper data flow to the ALU and the Multiplier. The
primary function of this unit is to compare the destination register address of the data waiting in
the Memory and Write Back pipeline registers to be written back to the register file with the
current data needed by the ALU or the Multiplier and forward the most up-to-date data to these
units. It also performs the same operation with the Y-register data as well. By forwarding the
data at the appropriate time, this unit makes sure that the pipeline works smoothly and does not
stall as a result of data dependencies.

Hazard Detection Unit:


This unit detects conditions under which data forwarding is not possible and stalls the pipeline
for one or two clock cycles in order to make sure that instructions are executed with the correct
data set. When it detects that a stall is necessary, it disables any write operation in the instruction
decode pipeline registers, stops the PC from incrementing, and clears all the control signals
generated by the control unit. By taking these steps it can delay the execution of any instruction
by one clock cycle. It can do this as many times as necessary to ensure proper execution of
instructions.

Instruction Set:
There are three basic types of instructions supported by this processor. These are the Register
Type, Branch Type, and the Immediate Type. The specification for each type of instructions is
given below.

Register Type: In this format bits 15-12 represents the opcode. Bits 11-8 represent the address
of the first source register, which is also the address of the destination register. Bits 7-4 give the
address of the second source register. The last four bits, 3-0, represent the function code that
represents the ALU function that is to be performed. If the opcode bits, 15-12, do not indicate an
ALU function, then the function bits are ignored. Diagram 4 below shows the basic format of
this instruction type.

15 12 11 8 7 4 3 0
opcode Rs1/Rd Rs2 Function
Diagram 4: Register Type Instruction
8

Immediate Type: As with the Register Type instruction, bits 15-12 represents the opcode, and
bits 11-8 represents the source register which is also the address of the destination register. Bits
7-0 of this instruction type represent an 8-bit immediate value given in 2’s complement form.
When the opcode represents a unary operation, the value in this immediate field is used as the
operand (instead of the value in Rs). Diagram 5 below shows the instruction format.

15 12 11 8 7 0
opcode Rd/Rs Immediate
Diagram 5: Immediate Type Instruction

Branch Type: Bits 15-12 of this instruction format represents the type of branch operation to be
performed. The remaining 12 bits, 11-0, represent the branch offset in 2’s complement format.
This number is added to the value of the PC to obtain the branch target address. Instruction
format for this type is shown in Diagram 6 below.

15 12 11 0
opcode Branch target
Diagram 6: Branch Type Instruction

Table 2 below summarizes all the instructions supported by this processor. A more detailed table
of the instruction set along with the description for each instruction can be found in Appendix B.

Instruction Description Instruction Type


Add Addition Register
Addu Unsigned addition Register
Addi Addition (immediate) Immediate
Sub Subtraction Register
Subu Unsigned subtraction Register
Subi Subtraction (immediate) Immediate
Mul Multiplication Register
Muli Multiplication (immediate) Immediate
Cmp Compare Register
And AND Register
Andi AND (immediate) Immediate
Or OR Register
Ori OR (immediate) Immediate
Not NOT Register
Xor XOR Register
Sll Logical shift left Register
Srl Logical shift right Register
Sla Arithmetic shift left Register
9

Sra Arithmetic shift right Register


Lw Load word Register
Sw Store word Register
Mov Move data between registers Register
Movi Move data (immediate) Immediate
Beq Branch if equal to 0 Branch
Bne Branch if not equal to 0 Branch
Ba Branch always Branch
Movy Move data from Y register Register
Nop No operation N/A
Table 2: Instruction Set

Power reduction method:


The main power reducing method that has been explored in this architecture is clock gating.
Clock gating is a method where the clock signal is prevented from reaching the various modules
of the processor. The absence of the clock signal prevents any register and/or flip-flop from
changing there value. As a result of this, the input to any combinational logic circuit remains
unchanged, and thus no switching activity takes place in those circuits. Since, in CMOS circuits,
most of the power dissipation results from switching activity, clock gating greatly reduces the
overall power consumption.
In this design, we mostly concentrated on the Multiplier, ALU, ALU Control Unit, and the
Branch Adder for clock gating. The reason for this is that these are the biggest modules in our
architecture in terms of number of logic gates,. Therefore, a significant improvement in power
use can be achieved by gating these components. The Control Unit is responsible for generating
the clock gating signal based on the current instruction. It then forwards these signals to the
appropriate pipeline registers to make sure that the write operation is disabled to these registers
for the duration of the execution cycle of the instruction.
Also, in an effort to be more power efficient, the ALU has been built in a special modular
fashion. Each of the operations supported by the ALU is performed by a different sub-module
inside the ALU. Since almost 70% of the instructions in the instruction set use the ALU, it is
beneficial to be able to operate only the part of the ALU needed by the current instruction and
turn off the rest. Each of the modules of the ALU is preceded by a set of transmission gates that
controls all the inputs to that module. When a module is needed, the transmission gates allow
the data to pass through; otherwise they simply put that portion of the ALU in an electrically
disconnected state. The ALU Control unit is responsible for generating signals that controls the
blocks of transmission gates.
The simulation waveforms showing the results of clock/signal gating is shown and discussed in
Appendix C.
10

FPGA Design Flow:


We followed a typical FPGA design flow for this project. Each of the steps of the flow is shown
in Diagram 7 and discussed below.

Schematic entry

Verification

Synthesis

Verification

Place and Route

Verification

Configuration

Diagram 7: FPGA design flow

Schematic Entry: The design is entered into a synthesis design system using a hardware
description language. The language used for this design was VHDL and we used the editor
provided by Xilinx Integrated Software Environment (ISE Version 5.2).

Synthesis: A netlist is generated using the VHDL code and the Xilinx synthesis tool.

Place and Route: The place process decides the best location of the cells and the best routing
strategy for the given design and desired performance. The route process makes the connections
between the cells and the blocks. This process was also completed with Xilinx ISE.

Configuration: This step was not completed because the xc95108 CPLD chip that we received
did not have enough build in memory blocks to properly synthesize our design. More precisely,
it was not able to synthesize the data and instruction memory blocks.

Verification: At each step of the design process, we verified our architecture using software
simulation. We used ModelSim XE II software package for simulating our VHDL code.

Testing and Verification Method:


For testing purposes, we simulated the instruction memory of our processor using a hex file. The
reason for doing so is that, it was very difficult to reprogram the ROM component that was used
from the Xilinx Unisim library in the final version of the design. It order to quickly test many
cases, the hex file option gave us a lot of flexibility. To properly create the hex file, we wrote a
simple VHDL program. In that program, we wrote the binary representation of all the
11

instructions that were to be tested. The reason for using binary representation is that otherwise it
would take too much time to develop a completely new assembler to interpret text based
assembly code. The hex file that was created after running this program was substituted in place
of the ROM for the Instruction Memory when performing simulation with ModelSim. The
VHDL program used to create the hex file is shown in Appendix E.

The procedure for creating the hex file and setting up the testbench for verification is listed
below.
• We created a new project and added the VHDL program that creates the hex file.
• A new testbench waveform file was attached to this VHDL program by going to
Project –> New Source…–> Test Bench Waveform.
• ModelSim Simulator was invoked using the option “Simulate Behavioral VHDL model”
for the newly created waveform file in the previous step.
• Once ModelSim finishes running completely, the hex file is created automatically and put
in the work directory.
• All the processor VHDL files were added to this same project.
• Another new Test Bench Waveform file was created for the processor files and the
desired output signals were specified.
• Modelsim Simulator was invoked for this file the same way as before.
• Since the instruction memory was simulated by the hex file, when the ModelSim
simulator ran, it automatically opened up the previously created hex file and executed the
instructions one at a time showing the output waveforms in the Wave window.

It should be noted that the VHDL file for the instruction memory was written in a fashion so that
it would open the hex file for reading when it started running the first time. This VHDL file is
included in Appendix E. Once all the tests were complete, it was rewritten to use the ROM
modules from the Xilinx Unisim library instead of relaying on hex files.

We tested our processor architecture by running many test programs that were created using the
method described above. The basic verification approach was to compare the simulated output
results with the expected results that we computed by hand. Whenever we found a mismatch
between the two, we identified the problem(s) and took care of them appropriately. We tested
the functionality of all the instructions, the interactions among the instructions in the pipeline,
and the correctness of the data as a result of executing those instructions.

Synthesis Results:
As mentioned above, after the VHDL code of our design was complete, we synthesized our code
using Xilinx Integrated Software Environment tool (Version 5.2). Since we were unable to
implement our design on the xc95108 CPLD, we chose the xc2v250 chip from the Virtex2
FPGA family for synthesis purposes. From synthesis estimate, the minimum clock period that
can be achieved in our architecture is 23.028ns, which translate to a maximum operating
frequency of 43.425 MHz. The critical path that determines this delay comes from the execute
stage of the pipeline through the Multiplier. This was an expected result, since the Multiplier is
the biggest module in our design in terms of circuit complexity and size. The complete synthesis
report can be found in Appendix D. It includes all the timing information along with device
12

resource utilization summary. The FPGA floorplan for the implementation of this design is
shown in Appendix F.

We also used the Xilinx XPower software to analyze the power dissipation figures for our
design. From the software simulation, the estimated power dissipation is ~ 780 mW. This figure
is obtained using the assumption that the clock frequency is 43.425 MHz and the default activity
rate of the signals in design is 100% relative to the clock frequency. We presume that this power
dissipation figure is an overestimation of the actual power dissipated by this design since it does
not take into account the effect of circuit inactivity due to tri-state buffering and clock gating.

Conclusion:
Like any other engineering design we had to test our design consistently and make modifications
throughout when a problem arose. We added new pipelines and remapped our diagram of the
processor before we had the code fully working. The project greatly enhanced our understanding
of embedded processor design, low power and the important role they play in today’s electronic
world. Power consumption rate will continue to improve in embedded processors as technology
will unravel new and more efficient ways to decrease power consumption.
13

References:

1. Brake, Cliff, “Power Management In Portable ARM Based Systems”,


http://www.microsoft.com/windows/embedded/docs/Power_Management.doc

2. “MIPS IV Instruction Set”, http://techpubs.sgi.com/library/manuals/2000/ 007-2597-


001/pdf/007-2597-001.pdf

3. Brown, Richard, “A Microprocessor Design Project in an Introductory VLSI Course”,


IEEE Transactions on Education, Vol. 43, No. 3, August 2000.

4. Hamblen, James and Furman, Michael, Rapid Prototyping of Digital Systems, 2nd
edition, Boston: Kluwer Academic Publishers, 2001.

5. Andrej Zemva, VLSI Design Synthesis Flow, http://www.cbl.ncsu.edu/publications/1996-


Thesis-PhD-Zemva/1996-Thesis-PhD-Zemva-HTML/node5.html

6. “Introduction to Embedded Processors”,


http://www.cs.ucsd.edu/classes/sp02/cse291_E/slides/armlect.pdf

7. “A Microelectronics Primer”,
http://www.cmc.ca/about/corporation/plan/Module5/appendix5a.html

8. “ECE 252 / CSE 252 Digital Systems Design Lecture 2”,


http://www.engr.uconn.edu/~chandy/ece252/252ln02.pdf

9. Hamacher, Vranesic, and Zaky. Computer Organization, 5th edition, NewYork: McGraw-
Hill Companies, 2002.

10. Liao, and Roberts, “A High-Performance and Low-Power 32-bit Multiply-Accumulate


Unit With Single-Instruction-Multiple-Data (SIMD) Feature”, IEEE Journal of Solid-
State Circuits, Vol. 37, No. 7, July 2002.
14

Appendix A: Detailed Architecture


15

Appendix B: Instruction Set

Instruction Description Instruction Opcode Function


Type Code
Add Addition Register 1101 0000
Addu Unsigned addition Register 1101 0010
Addi Addition (immediate) Immediate 0010 N/A
Sub Subtraction Register 1101 0001
Subu Unsigned subtraction Register 1101 0011
Subi Subtraction (immediate) Immediate 0011 N/A
Mul Multiplication Register 1110 N/A
Muli Multiplication (immediate) Immediate 1111 N/A
Cmp Compare Register 1101 0100
And AND Register 1101 0101
Andi AND (immediate) Immediate 0001 N/A
Or OR Register 1101 0110
Ori OR (immediate) Immediate 0100 N/A
Not NOT Register 1101 0111
Xor XOR Register 1101 1100
Sll Logical shift left Register 1101 1000
Srl Logical shift right Register 1101 1001
Sla Arithmetic shift left Register 1101 1010
Sra Arithmetic shift right Register 1101 1011
Lw Load word Register 0101 N/A
Sw Store word Register 0110 N/A
Mov Move Register 0111 N/A
Movi Move data (immediate) Immediate 1000 N/A
Beq Branch if equal to 0 Branch 1001 N/A
Bne Branch if not equal to 0 Branch 1010 N/A
Ba Branch always Branch 1011 N/A
Movy Move data from Y register Register 1100 N/A
Nop No operation N/A 0000 N/A
Table 3: Instruction Set Description
16

Instruction Set Summary Instruction: Subtraction Immediate


15 12 11 8 7 0
Format: 0011 Rs1/Rd Immediate
Syntax: Subi r1, immd
Description: This instruction will
Instruction: Addition subtract the value in r1 from the 8-bit
15 12 11 8 7
Format: 1101 Rs1/Rd Rs2
4 3 0 value specified in the immediate field,
0000
and store the result in r1.
Syntax: Add r1, r2
Description: This instruction will add
Instruction: Multiplication
the value in r1 with the value in r2, and 15 12 11 8 7 4 3 0
store the result in r1. Format: 1110 Rs1/Rd Rs2 N/A
Syntax: Mul r1, r2
Instruction: Unsigned Addition Description: The instruction will take
15 12 11 8 7
Format: 1101 Rs1/Rd Rs2
4 3 0 the values in r1 and r2 and perform
0010
multiplication. The lower 16 bits will be
Syntax: AddU r1, r2
stored in r1 and the upper 16 bits will be
Description: This instruction will add
stored in the Y register.
the value in r1 with the value in r2, and
store the result in r1. The values in r1
Instruction: Multiplication Immediate
and r2 will be treated as unsigned 15 12 11 8 7 0
integers. Format: 1111 Rs1/Rd Immediate
Syntax: Muli r1, immd
Instruction: Addition Immediate Description: The instruction will
15 12 11 8 7
Format: 0010 Rs1/Rd
0 multiply the value in r1 with the 8-bit
Immediate
value specified in the immediate field.
Syntax: Addi r1, immd
The lower 16 bits of the result will be
Description: This instruction will add
stored in r1 and the upper 16 bits will be
the value in r1 with the 8-bit value
stored in the Y register.
specified in the immediate field, and
store the result in r1.
Instruction: Compare
15 12 11 8 7 4 3 0
Instruction: Subtraction Format: 1101 Rs1/Rd Rs2 0100
15 12 11 8 7
Format: 1101 Rs1/Rd Rs2
4 3 0 Syntax: Comp r1, r2
0001
Description: The purpose of this
Syntax: Sub r1, r2
instruction is to compare the value in
Description: This instruction will
register r1 with the value in register r2.
subtract the value in r1 from the value in
If the values are equal it will set the zero
r2, and store the result in r1.
flag.
Instruction: Subtraction Unsigned
15 12 11 8 7 4 3 0 Instruction: AND
Format: 1101 Rs1/Rd Rs2 0011 15 12 11 8 7 4 3 0
Format: 1101 Rs1/Rd Rs2 0101
Syntax: SubU r1, r2
Syntax: And r1, r2
Description: This instruction will
Description: This instruction takes the
subtract the value in r1 from the value in
value in r1 and r2 and Performs bitwise
r2, and store the result in r1. The values
AND operation. The output will be
in r1 and r2 will be treated as unsigned
stored in r1.
integers.
17

Instruction: AND Immediate Description: This operation performs


15 12 11 8 7 0 logical left shift on the value in r1. The
Format: 0001 Rs1/Rd Immediate
Syntax: Andi r1, immd number of places to be shifted is
Description: This instruction will take a specified by the 4-bit immediate field.
value in r1 and an 8-bit value from the The result is stored in r1.
immediate field and perform bitwise
AND operation. The result will be stored Instruction: Logical Shift Right
15 12 11 8 7 4 3 0
in r1. Format: 1101 Rs1/Rd Immd. 1001
Syntax: Srl r1, immd
Instruction: OR Description: This operation performs
15 12 11 8 7 4 3 0 logical right shift on the value in r1. The
Format: 1101 Rs1/Rd Rs2 0110
Syntax: Or r1, r2 number of places to be shifted is
Description: The instruction takes the specified by the 4-bit immediate field.
values in r1 and r2 and performs bitwise The result is stored in r1.
OR operation. The result will then be
stored in r1. Instruction: Arithmetic Shift Left
15 12 11 8 7 4 3 0
Format: 1101 Rs1/Rd Immd. 1010
Instruction: OR Immediate Syntax: Sla r1, immd
15 12 11 8 7 0 Description: This operation performs
Format: 0100 Rs1/Rd Immediate
Syntax: Ori r1, immd arithmetic shift on the value in r1. The
Description: This instruction takes the number of places to be shifted is
value in r1 and an 8-bit value from the specified by the 4 bit immediate field.
immediate field and performs bitwise The result is stored in r1.
OR operation. The result is stored in r1.
Instruction: Arithmetic Shift Right
15 12 11 8 7 4 3 0
Format: 1101 Rs1/Rd Immd. 1011
Instruction: NOT Syntax: Sra r1, immd
15 12 11 8 7 4 3 0 Description: This operation performs
Format: 1101 Rs1/Rd Rs2 0111
Syntax: Not r1 arithmetic shift on the value in r1. The
Description: The instruction will take number of places to be shifted is
the value in r2 and perform bitwise NOT specified by the 4-bit immediate field.
operation. The result will be stored in r1. The result is stored in r1.

Instruction: Xor Instruction: Move


15 12 11 8 7 4 3 0
15 12 11 8 7
Format: 1101 Rs1/Rd Rs2
4 3 0 Format: 0111 Rs1/Rd Rs2 N/A
1100
Syntax: Xor, r1, r2 Syntax: Mov r1, r2
Description: This instruction will take Description: This instruction will copy
the value in r1 and r2 and performs a the value from r2 and write it into r1.
bitwise XOR operation. The result will
be stored in r1. Instruction: Move Immediate
15 12 11 8 7 0
Format: 1000 Rs1/Rd Immediate
Instruction: Logical Shift Left Syntax: Movi r1, immd
15 12 11 8 7 4 3 0
Format: 1101 Rs1/Rd Immd. 1000
Syntax: Sll r1, immd
18

Description: This instruction copies the not set. The 12-bit immediate field
value in the immediate field and writes it specifies the branch offset.
into r1.
Instruction: Branch Always
Instruction: Move Y 15 12 11 0
Format: 1011 Branch Offset
15 12 11 8 7 4 3 0
Format: 1100 Rs1/Rd N/A N/A Syntax: Ba, immd
Syntax: Movy r1 Description: The instruction will
Description: The instruction will copy always perform a branch regardless of
the value from the Y register and write it the condition of the zero flag. The 12-bit
into r1. immediate field specifies the branch
offset.
Instruction: Load Word
15 12 11 8 7 4 3 0
Format: 0101 Rs1/Rd Rs2 N/A
Syntax: Lw r1, r2 Instruction: No Operation
15 12 11 0
Description: This instruction is used to Format: 0000 N/A
load word from the data memory. r2 Syntax: Nop
contains the base address of the memory Description: No action is taken.
location, and r1 contains an offset. The
effect memory location address is found
by adding r1 and r2. The data word is
loaded in r1.

Instruction: Store Word


15 12 11 8 7 4 3 0
Format: 0110 Rs1/Rd Rs2 N/A
Syntax: Sw r1, r2
Description: This instruction is used to
store data in the data memory. The data
value to be stored is kept in r1 and the
memory location is kept in r2.

Instruction: Branch Equal


15 12 11 0
Format: 1001 Branch Offset
Syntax: Beq, immd
Description: The instruction will
perform a branch when the zero flag is
set. The 12-bit immediate field specifies
the branch offset.

Instruction: Branch Not Equal


15 12 11 0
Format: 1010 Branch Offset
Syntax: Bne, immd
Description: This instruction will
perform a branch when the zero flag is
19

Appendix C: Simulation Waveforms


Diagram 7 below shows the result of gating the input signals of the ALU and the multiplier. At any given clock cycle, either the
Multiplier, or only one module of the ALU receives the input signals. The rest of the modules are disconnected from the circuit by
the use of transmission gates. As a result, we see that most of the modules are in HIGH Z state (represent by the blue lines) except
for when they are needed to execute the current instruction. For example, since there are only two multiplication operations in the
given sequence of instructions, the input to the MAC is only defined for only two clock cycles. Similar observations can be made
for all the other instructions as well.
Diagram 7: Input Signal Gating for ALU and MAC
20

Diagram 8 below shows the simulation result of clock gating for the ALU Control Unit and the Branch Adder. The inputs to the
Branch Adder change only twice since it encounters only two branch instructions. When the instruction is something other than
branch, the inputs remain unchanged as a result of clock gating. In a similar fashion, the inputs to the ALU Control Unit remain
unchanged when the current instruction does not require the operations provided by the ALU (such as, multiplication and branch).
Diagram 8: Clock gating for ALU Control Unit Branch Adder
21

Appendix D: Synthesis Report


=========================================================================
* Final Report *
=========================================================================
Final Results
RTL Top Level Output File Name : processor_top.ngr
Top Level Output File Name : processor_top
Output Format : NGC
Optimization Criterion : Speed
Keep Hierarchy : NO
Macro Generator : macro+

Design Statistics
# IOs : 66

Macro Statistics :
# Registers : 38
# 1-bit register : 17
# 12-bit register : 1
# 16-bit register : 13
# 2-bit register : 1
# 4-bit register : 5
# 8-bit register : 1
# Tristates : 18
# 1-bit tristate buffer : 4
# 16-bit tristate buffer : 13
# 4-bit tristate buffer : 1
# Adders/Subtractors : 1
# 16-bit adder : 1
# Comparators : 8
# 4-bit comparator equal : 6
# 4-bit comparator not equal : 2
# Xors : 201
# 1-bit xor3 : 201

Cell Usage :
# BELS : 2330
# GND : 1
# LUT1 : 15
# LUT1_L : 1
# LUT2 : 136
# LUT2_D : 24
# LUT2_L : 7
# LUT3 : 365
# LUT3_D : 58
# LUT3_L : 23
# LUT4 : 1247
# LUT4_D : 125
# LUT4_L : 137
# MUXCY : 15
# MUXF5 : 144
# rom32x1 : 16
# VCC : 1
# XORCY : 15
# FlipFlops/Latches : 519
# FD : 175
# FDCE : 16
# FDE : 72
# LD : 256
# RAMS : 2
# ram16x8s : 2
# Tri-States : 215
# BUFT : 215
# Clock Buffers : 1
# BUFGP : 1
# IO Buffers : 65
# IBUF : 1
# OBUF : 64
=========================================================================
22

Device utilization summary:


---------------------------

Selected Device : 2v250fg456-6

Number of Slices: 1165 out of 1536 75%


Number of Slice Flip Flops: 519 out of 3072 16%
Number of 4 input LUTs: 2172 out of 3072 70%
Number of bonded IOBs: 65 out of 200 32%
Number of TBUFs: 215 out of 768 27%
Number of GCLKs: 1 out of 16 6%

Timing Summary:
---------------
Speed Grade: -6

Minimum period: 23.028ns (Maximum Frequency: 43.425MHz)


Minimum input arrival time before clock: 2.588ns
Maximum output required time after clock: 9.881ns

Timing constraint: Default period analysis for Clock 'clk'


Delay: 23.028ns (Levels of Logic = 21)
Source: pipe_mem_web_inst_11_8_out_2
Destination: pipe_ex_mem_mult_up_16_out_14
Source Clock: clk rising
Destination Clock: clk rising

Data Path: pipe_mem_web_inst_11_8_out_2 to pipe_ex_mem_mult_up_16_out_14


Gate Net
Cell:in->out fanout Delay Delay Logical Name (Net Name)
---------------------------------------- ------------
FD:c->q 14 0.449 1.203 pipe_mem_web_inst_11_8_out_2
(pipe_mem_web_inst_11_8_out_2)
LUT2:i1->o 4 0.347 0.818 data_forward_ker230351 (data_forward_n23037)
LUT4_L:I0->LO 1 0.347 0.100 data_forward__n001121 (choice4520)
LUT4:i2->o 1 0.347 0.312 data_forward__n001133 (choice4522)
LUT4_L:I2->LO 1 0.347 0.100 data_forward__n001169 (choice4526)
LUT4:i3->o 18 0.347 1.386 data_forward__n0011173 (choice4549)
LUT4:i3->o 16 0.347 1.294 execute_stg__n00061 (execute_stg__n0006)
LUT4_L:I0->LO 1 0.347 0.100 execute_stg_mmux_alu_pin_2_input_i0_result38
(choice4238)
LUT4:i2->o 7 0.347 0.932 execute_stg_mmux_alu_pin_2_input_i0_result77
(execute_stg_alu_pin_2_input<15>)
BUFT:i->o 22 0.433 1.569 execute_stg_t_g_mult_2_i0_0
(execute_stg_t_g_mult2_out<15>)
LUT4_D:I3->O 9 0.347 1.010 lut_167274711 (execute_stg_mac_ppg_pp2<29>)
LUT3:i2->o 12 0.347 1.125
execute_stg_mac_wallace_add_columns31_c1_mxor_sum_
result1
(execute_stg_mac_wallace_add_columns31_c1s)
LUT3:i1->o 5 0.347 0.855 execute_stg_mac_wallace_add_columns20_c4_cout1
(execute_stg_mac_wallace_add_cin3<21>)
LUT4_L:I0->LO 1 0.347 0.100
execute_stg_mac_wallace_add_columns21_c7_mxor_sum_
result1_sw0 (n74368)
LUT4:i0->o 5 0.347 0.855
execute_stg_mac_wallace_add_columns21_c7_mxor_sum_
result1 (execute_stg_mac_w_sum<21>)
LUT4_D:I0->LO 1 0.347 0.100 execute_stg_mac_cla_add_add_16_19_ker248491_sw0
(N77691)
LUT4:i3->o 5 0.347 0.855 execute_stg_mac_cla_add_add_16_19_ker248491
(execute_stg_mac_cla_add_add_16_19_n24851)
LUT4_D:I1->O 7 0.347 0.932 execute_stg_mac_cla_add_carry2021 (choice4562)
LUT4:i1->o 5 0.347 0.855 execute_stg_mac_cla_add_carry2032_sw0 (n73490)
LUT4_D:I1->LO 1 0.347 0.100 execute_stg_mac_cla_add_carry2432 (N77740)
23

LUT4:i1->o 1 0.347 0.312 execute_stg_mac_cla_add_add_24_27_mxor_sum<2>_xo08


(choice2213)
LUT4_L:I2->LO 1 0.347 0.000
execute_stg_mac_cla_add_add_24_27_mxor_sum<2>_resu
lt1 (ex_mult_high_16<14>)
FD:d 0.293 pipe_ex_mem_mult_up_16_out_14
----------------------------------------
Total 23.028ns (8.115ns logic, 14.913ns route)
(35.2% logic, 64.8% route)
24

Appendix E: VHDL Code for hex file based Instrution Memory

VHDL file for creating the hex file:


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity instr_dump is
Port ( a : in std_logic;
b : out std_logic);
end instr_dump;

architecture Behavioral of instr_dump is

begin
mem_load: process is --(address, data_in, wrt) is
type load_file_type is file of std_logic_vector(15 downto 0);
file load_file: load_file_type open write_mode is "instruction.txt";
begin
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("1000000100010000")); -- movi R1, D'16'
write(load_file,std_logic_vector'("1001000000000101")); -- beq D'5'
write(load_file,std_logic_vector'("1000001000001100")); -- movi R2, D'12'
write(load_file,std_logic_vector'("1000001100100000")); -- movi R3, D'32'
write(load_file,std_logic_vector'("1000100100000101")); -- movi R9, D'5'
write(load_file,std_logic_vector'("0001000100001111")); -- andi R1, D'15'
write(load_file,std_logic_vector'("1101000100100000")); -- add R1, R2
write(load_file,std_logic_vector'("1111000100001111")); -- muli R1, D'15'
write(load_file,std_logic_vector'("1110000100100000")); -- mul R1, R2
write(load_file,std_logic_vector'("1101000100110100")); -- cmp R1, R3
write(load_file,std_logic_vector'("1001000000000010")); -- beq D'2'
write(load_file,std_logic_vector'("1101000100101000")); -- sll R1, 2
write(load_file,std_logic_vector'("1101001100100111")); -- not R3, R2
write(load_file,std_logic_vector'("1101000100110100")); -- cmp R1, R3
write(load_file,std_logic_vector'("0100001011111111")); -- ori R2, 11111111
write(load_file,std_logic_vector'("0111001100000000")); -- mov R3, R0
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
write(load_file,std_logic_vector'("0000000000000000")); -- nop
wait on a;
end process mem_load;

end Behavioral;
25

Instruction Memory using the hex file:


library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity I_memory is
Port ( address : in std_logic_vector(15 downto 0);
data_out : out std_logic_vector(15 downto 0));
end I_memory;

architecture Behavioral of I_memory is

begin
mem_behavior: process is --(address) is
type mem_array is array (0 to 65536) of std_logic_vector(15 downto 0);
variable mem: mem_array;
type load_file_type is file of std_logic_vector(15 downto 0);
file load_file: load_file_type open read_mode is "instruction.txt";
variable index: natural;
begin

index:= 0;
while not endfile(load_file) loop
read(load_file,mem(index));
index:= index+1;
end loop;

data_out <= mem(conv_integer(address));


wait on address;
end process mem_behavior;

end Behavioral;
26

Appendix E: FPGA Floorplan

Floorplan using Xilinx Virtex2 xc2v250 FPGA