Final Project Ee577a

University of Southern California
Viterbi School of Engineering

Ming-Hsieh Department of Electrical Engineering
EE 577a, Spring 2014, Prof. Nazarian

Final Project: Phase 3

Submitted by:
1. Ren Chen, USC ID: 4844077297
2. Deepak Anand, USC ID: 8115945023
3. Santosh Kumar Karnagaran, USC ID: 4069922119
4. Joydeep Saha, USC ID: 6594466840
5. Nikhil Sarode, USC ID: 6601506789

May. 11, 2014

Part 1 Report Summary
1. List of all the designs.
CPU Processor;
Divider;

2. Performance results

Estimated
Area
Clock period
(Schematic)
LVS result Power*Area*Delay
CPU
Processor
500 um x
196 um
1.5 ns
Only the
components
passed the LVS
Unknown
Divider
300 um x
80 um
1.6 ns
Only the
components
passed the LVS
Unknown

3. The basic cells used in the designs
For divider:
3-bit Gray code generator;
Four 16-bit Brent-Kung adders (used as subtractor);
3-to-8 decoder;
Two 8-to-1 Multiplexers;
For CPU:
64-bit XOR unit (CMOS complementary circuit);
64-bit AND unit (Dynamic circuit);
64-bit OR unit (Dynamic circuit);
64-bit ADD unit (using four 16-bit Brent-Kung adders);
1Kbit SRAM (Supports 64-bit Load/Store);
Four 64-bit input registers (for operand A and B);
Two 64-bit output registers (for output results);

Part 2 Divider
1. Schematic

1) The implemented algorithm used for the divider
We used the following algorithm for the divider implementation. The algorithm will
divide the dividend by divisor placing the quotient in Q and the remainder in R.

2) Original Design

As shown above, the divider is composed of six parts. The details of the six parts will be
introduced next.
Part 1
Part 2
Part 3
Part 4
Part 5
Part 6
Part 1

As shown in the figure above, the main components include two 8-to-1 multiplexers, one gray
code generator, and three 1-bit registers.
The values of the three 1-bit registers will be reset to 100 when the divider starts to operate.
After the reset signal, the values of the three 1-bit registers will change with the clock such as:
000->001->011->010->110->111->101->100.
In this way, the glitches in the multiplexers will be avoided so as to reduce delay and power
consumption.
The two 8-to-1 multiplexers are used to input two bits of the dividend in each clock cycle. Note
that to reduce the delay, we used TG gates to implement the multiplexers; hence we also add one
data buffer to separate the TG gates with the following-up circuits.

Fig. Gray code generator used in the divider

Fig. Components used in the gray code generator

Part 2

Fig. Schematic design of Part 2 in the divider
The part 2 is composed of 16 4-to-1 multiplexers and one 16-bit registers.
The 16 4-to-1 multiplexers are used to multiplex the subtractor results to the remainder.
In each clock cycle, the remainder can be R, R-D, R-2D, or R-3D;
The selecting bits Q1 and Q0 are generated based on the overflow bits of the subtractors.

Part 3

The part 3 includes three 16-bit registers;
In our earlier version of the divider, we didnt have those three 16-bit registers.
To reduce the delay of the critical path, we added the three 16-bit registers.
By adding those registers, the clock period of the divider is reduced from 2 ns to 1.6 ns.
Those registers are used to store the results of the subtractors.

Part 4


The part 4 is composed of a 3-to-8 decoder, 16 2-to-1 multiplexers, and a 16-bit register;
The 3-to-8 decoder is used to generate the selecting bits for the 16 2-to-1 multiplexers;
The 16 2-to-1 multiplexers are used to select the correct bits to the output of the quotient;;
The 16-bit register is the output registers for the quotient;
Part 5

The part 5 is composed of four 16-bit Brent-Kung adders.
In our earlier version of the divider, we used carry-select adder.
To reduce the delay of the critical path, we replace the carry-select adder with the Brent-Kung
adder.
In this way, the clock period of the divider is reduced from 2 ns to 1.6 ns.
The signals A_case1, B_case2, C_case3 represent the comparison results of the subtrators.

Fig. Schematic design of 16-bit Brent-Kung in the divider

Part 6

The part 6 is composed of a 2-bit registers and a generator for Q0 and Q1 based on A_case1,
B_case2 and C_case3.
The signals A_case1, B_case2, C_case3 represent the comparison results of the subtrators.
The signals Q0 and Q1 represent the partial bits of the quotient.

Fig. Schematic design of S01S1 generator

3) Layout of the design components

Layout of the 16-bit Brent-Kung adder:

Layout of the 3-bit gray code generator:

Layout of the 3-to-8 decoder:

Layout of the S0S1Gen:

Layout of the 8-to-1 Mux:

4) Layout of the final design

5) Simulation result

Fig. Simulation result of the given testbench during the demo (Quotient)
The quotient result during 20 ns 25 ns: 1101 0011: D3
The quotient result during 40 ns 45 ns: 0000 0110: 06

Fig. Simulation result of the given testbench during the demo (Remainder)
The remainder result during 20 ns 25 ns: 0110 1000 = 68
The remainder result during 40 ns 45 ns: 0000 0110 = 06

Part 2 CPU Processor
1. Schematic

1) Original Design

As shown above, the CPU processor is composed of five parts. The details of the five parts will be
introduced next.
Note that we used a lot of wire labels in the schematic design.

Part 1
Part 2
Part 4
Part 3
Part 5
Part 1

Fig. The schematic design of the part 1 in the CPU
As shown in the figure above, the part 1 include:
64 4-to-1 multiplexers, used for shift 64-bit load data so as to make sure the correctness of the
16-bit or 32-bit operation;
64 1-to-2 demultiplexers, used to demultiplex the load data to operand A or operand B;
64 2-to-1 multiplexers, used to multiplex the data from SRAM or the immediate number for
operand B;
64 1-to-2 demultiplexers, used to demultiplex the operand A to 64-bit adder or the ALU including
64-bit XOR, 64-bit AND, and 64-bit OR units.
64 1-to-2 demultiplexers, used to demultiplex the operand B to 64-bit adder or the ALU including
64-bit XOR, 64-bit AND, and 64-bit OR units.

Fig. 4-to-1 multiplexer

Fig. 1-to-2 demultiplexer

Part 2

Fig. Schematic design of Part 2 in the CPU
The part 2 is composed of 64-bit AOX, OR, AND units.
The details of the part 2 are shown in the figure below:

Fig. Schematic design of the EX1 stage (Part 2) in the CPU
From the figure above, we can see that after data of operand A or operand B enter into the EX1
stage, both of them will be stored in the two 64-bit input registers. Then operand A and operand
B will be multiplexed to 64-bit XOR, OR, or AND unit based on the given operation.
The results of the 64-bit XOR, OR and AND units will be multiplexed out. Hence 64 4-to-1
multiplexers are used to output the results.
All the three units take one clock cycle to finish its operation.
64-bit Static XOR unit

Fig. Schematic design of the 64-bit static XOR unit

Fig. Schematic design of the 1-bit static XOR unit

64-bit Dynamic AND unit

Fig. Schematic design of the 64-bit dynamic AND unit

Fig. Schematic design of the 1-bit dynamic AND unit

64-bit Dynamic OR unit

Fig. Schematic design of the 64-bit dynamic OR unit

Fig. Schematic design of the 1-bit dynamic OR unit

Clock gating unit

Fig. Schematic design of clock gating unit used for dynamic units
Part 3

The part 3 is actually the 64-bit adder;
We use four 16-bit Brent-Kung adders to build the final 64-bit adder using carry chain structure;

Overview of the schematic design of the 64-bit adder

Fig. Schematic design of the 64-bit adder
The 64-bit adder takes four cycles to finish the 64-bit addition;
To improve the CPU performance, we reordered the instructions in the testbench,
and make sure the data for addition are always first loaded.
The addition operation will be executed in parallel with one of the OR, XOR, and AND
operations. In this way, we overlapped the execution time of the addition with other
operations.
Recirculate unit

Fig. Schematic design of the 1-bit recirculate unit
We used recirculate unit to implement the multi-cycle adder.
The input registers of the operands of the adder will be enabled when the operands are loaded;
Then the input registers will be disabled until the next addition;
Part 4

Fig. Schematic design of Part 4 (EX2 stage) in the CPU

Fig. Schematic design of EX2 stage in the CPU

Fig. Schematic design of 1-bit Unit used in the EX2 stage in the CPU
The EX2 stage is used for registering the 64-bit result from ADDER, and the 64-bit
result from one of the units including the ADD, XOR, OR units;
The 1-bit unit used in the EX2 stage is employed for registering one bit result;
In the top level of the EX2 stage, 64 such 1-bit unit are used totally;

Part 5

The part 5 is used for multiplexing the output results from EX2 stage and the input data from
Perl.
The output results of the part 5 component will be written into SRAM;
Part 6

The part 6 is actually the SRAM design.
In lab2, we have designed an SRAM supporting 16-bit data load or store;
To reduce the delay for load or store, we redesigned the SRAM so that it supports 64-bit data
load or store;
However, this also brings some issues such as the results of the 16-bit operations may be always
written at the first 16-bit of the 64-bit SRAM lines. To solve this issue, we add the unit for shifting
the data loaded from SRAM. The unit has been introduced in the part 1 design.
We also optimized the SRAM decoder, and we reduced the clock period of the SRAM from 2 ns
to 0.8 ns.
In this way, our final CPU processor can operate correctly with a clock period of 1.5 ns.

2) Layout of the design components in the CPU

Part 1

Fig. The layout design of the part 1 in the CPU

Fig. 4-to-1 multiplexer

Fig. 1-to-2 demultiplexer

Part 2

Fig. Layout design of Part 2 in the CPU
Static XOR unit

Fig. Layout design of the 1-bit static XOR unit

Dynamic AND unit

Fig. Layout design of the 1-bit dynamic AND unit

Dynamic OR unit

Fig. Layout design of the 1-bit dynamic OR unit

Part 3

Fig. Layout design of Part 3 (64-bit Adder) in the CPU
Recirculate unit

Fig. Layout design of the 2-bit recirculate unit

Part 4

Fig. Layout design of Part 4 (EX2 stage) in the CPU

Fig. Layout design of 2-bit Unit used in the EX2 stage in the CPU
Part 5

Fig. Layout design of Part 5 (for two bits) in the CPU

Part 6

Fig. Layout design of Part 6 in the CPU

The final layout of the entire CPU Processor

3) Simulation result

Fig. Simulation result of the given testbench during the demo (SRAM Output 0~20)



Part 3 PERL
1. Perl Code Overview

1) Flowchart

Fig. The flowchart of the perl program

Read input
file
Reorder
instruction

Generate loads and
stores for
Arithmetic units
SRAM access address
registering
Vector generation
Print vector file
2) Details of each step in the flowchart

Parameter definition

Print the header of the vector file

Reorder the instructions so that Addition will be executed firstly

Reorder the instructions to avoid consecutive instructions of the same type

Generate stores and loads for arithmetic operations

Add delay to the stores for additions so that result of adders can be stored with
correct data

SRAM address registering, to make sure addresses have been registered in the
previous clock cycle.

The main function to generate the vector file

3) Main subroutines
Here only several subroutines are listed, please refer the source code for the
details.

Subroutine 1: read each line

Subroutine 2: Vector generation for load

Subroutine 3: Check burst length and data

Subroutine 3: Controlling bits generation for SRAM access

Subroutine 4: Imm generation

Final Project Ee577a

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Final Project Ee577a

Загружено:

Авторское право:

Доступные форматы

University of Southern California

Viterbi School of Engineering

Вам также может понравиться