Part 1 Report Summary 1. List of all the designs. CPU Processor; Divider;
2. Performance results
Estimated Area Clock period (Schematic) LVS result Power*Area*Delay CPU Processor 500 um x 196 um 1.5 ns Only the components passed the LVS Unknown Divider 300 um x 80 um 1.6 ns Only the components passed the LVS Unknown
3. The basic cells used in the designs For divider: 3-bit Gray code generator; Four 16-bit Brent-Kung adders (used as subtractor); 3-to-8 decoder; Two 8-to-1 Multiplexers; For CPU: 64-bit XOR unit (CMOS complementary circuit); 64-bit AND unit (Dynamic circuit); 64-bit OR unit (Dynamic circuit); 64-bit ADD unit (using four 16-bit Brent-Kung adders); 1Kbit SRAM (Supports 64-bit Load/Store); Four 64-bit input registers (for operand A and B); Two 64-bit output registers (for output results);
Part 2 Divider 1. Schematic
1) The implemented algorithm used for the divider We used the following algorithm for the divider implementation. The algorithm will divide the dividend by divisor placing the quotient in Q and the remainder in R.
2) Original Design
As shown above, the divider is composed of six parts. The details of the six parts will be introduced next. Part 1 Part 2 Part 3 Part 4 Part 5 Part 6 Part 1
As shown in the figure above, the main components include two 8-to-1 multiplexers, one gray code generator, and three 1-bit registers. The values of the three 1-bit registers will be reset to 100 when the divider starts to operate. After the reset signal, the values of the three 1-bit registers will change with the clock such as: 000->001->011->010->110->111->101->100. In this way, the glitches in the multiplexers will be avoided so as to reduce delay and power consumption. The two 8-to-1 multiplexers are used to input two bits of the dividend in each clock cycle. Note that to reduce the delay, we used TG gates to implement the multiplexers; hence we also add one data buffer to separate the TG gates with the following-up circuits.
Fig. Gray code generator used in the divider
Fig. Components used in the gray code generator
Part 2
Fig. Schematic design of Part 2 in the divider The part 2 is composed of 16 4-to-1 multiplexers and one 16-bit registers. The 16 4-to-1 multiplexers are used to multiplex the subtractor results to the remainder. In each clock cycle, the remainder can be R, R-D, R-2D, or R-3D; The selecting bits Q1 and Q0 are generated based on the overflow bits of the subtractors.
Part 3
Fig. Schematic design of Part 3 in the divider The part 3 includes three 16-bit registers; In our earlier version of the divider, we didnt have those three 16-bit registers. To reduce the delay of the critical path, we added the three 16-bit registers. By adding those registers, the clock period of the divider is reduced from 2 ns to 1.6 ns. Those registers are used to store the results of the subtractors.
Part 4
Fig. Schematic design of Part 4 in the divider
The part 4 is composed of a 3-to-8 decoder, 16 2-to-1 multiplexers, and a 16-bit register; The 3-to-8 decoder is used to generate the selecting bits for the 16 2-to-1 multiplexers; The 16 2-to-1 multiplexers are used to select the correct bits to the output of the quotient;; The 16-bit register is the output registers for the quotient; Part 5
Fig. Schematic design of Part 5 in the divider The part 5 is composed of four 16-bit Brent-Kung adders. In our earlier version of the divider, we used carry-select adder. To reduce the delay of the critical path, we replace the carry-select adder with the Brent-Kung adder. In this way, the clock period of the divider is reduced from 2 ns to 1.6 ns. The signals A_case1, B_case2, C_case3 represent the comparison results of the subtrators.
Fig. Schematic design of 16-bit Brent-Kung in the divider
Part 6
Fig. Schematic design of Part 6 in the divider The part 6 is composed of a 2-bit registers and a generator for Q0 and Q1 based on A_case1, B_case2 and C_case3. The signals A_case1, B_case2, C_case3 represent the comparison results of the subtrators. The signals Q0 and Q1 represent the partial bits of the quotient.
Fig. Schematic design of S01S1 generator
3) Layout of the design components
Layout of the 16-bit Brent-Kung adder:
Layout of the 3-bit gray code generator:
Layout of the 3-to-8 decoder:
Layout of the S0S1Gen:
Layout of the 8-to-1 Mux:
4) Layout of the final design
5) Simulation result
Fig. Simulation result of the given testbench during the demo (Quotient) The quotient result during 20 ns 25 ns: 1101 0011: D3 The quotient result during 40 ns 45 ns: 0000 0110: 06
Fig. Simulation result of the given testbench during the demo (Remainder) The remainder result during 20 ns 25 ns: 0110 1000 = 68 The remainder result during 40 ns 45 ns: 0000 0110 = 06
Part 2 CPU Processor 1. Schematic
1) Original Design
As shown above, the CPU processor is composed of five parts. The details of the five parts will be introduced next. Note that we used a lot of wire labels in the schematic design.
Part 1 Part 2 Part 4 Part 3 Part 5 Part 1
Fig. The schematic design of the part 1 in the CPU As shown in the figure above, the part 1 include: 64 4-to-1 multiplexers, used for shift 64-bit load data so as to make sure the correctness of the 16-bit or 32-bit operation; 64 1-to-2 demultiplexers, used to demultiplex the load data to operand A or operand B; 64 2-to-1 multiplexers, used to multiplex the data from SRAM or the immediate number for operand B; 64 1-to-2 demultiplexers, used to demultiplex the operand A to 64-bit adder or the ALU including 64-bit XOR, 64-bit AND, and 64-bit OR units. 64 1-to-2 demultiplexers, used to demultiplex the operand B to 64-bit adder or the ALU including 64-bit XOR, 64-bit AND, and 64-bit OR units.
Fig. 4-to-1 multiplexer
Fig. 1-to-2 demultiplexer
Part 2
Fig. Schematic design of Part 2 in the CPU The part 2 is composed of 64-bit AOX, OR, AND units. The details of the part 2 are shown in the figure below:
Fig. Schematic design of the EX1 stage (Part 2) in the CPU From the figure above, we can see that after data of operand A or operand B enter into the EX1 stage, both of them will be stored in the two 64-bit input registers. Then operand A and operand B will be multiplexed to 64-bit XOR, OR, or AND unit based on the given operation. The results of the 64-bit XOR, OR and AND units will be multiplexed out. Hence 64 4-to-1 multiplexers are used to output the results. All the three units take one clock cycle to finish its operation. 64-bit Static XOR unit
Fig. Schematic design of the 64-bit static XOR unit
Fig. Schematic design of the 1-bit static XOR unit
64-bit Dynamic AND unit
Fig. Schematic design of the 64-bit dynamic AND unit
Fig. Schematic design of the 1-bit dynamic AND unit
64-bit Dynamic OR unit
Fig. Schematic design of the 64-bit dynamic OR unit
Fig. Schematic design of the 1-bit dynamic OR unit
Clock gating unit
Fig. Schematic design of clock gating unit used for dynamic units Part 3
Fig. Schematic design of Part 3 in the CPU The part 3 is actually the 64-bit adder; We use four 16-bit Brent-Kung adders to build the final 64-bit adder using carry chain structure;
Overview of the schematic design of the 64-bit adder
Fig. Schematic design of the 64-bit adder The 64-bit adder takes four cycles to finish the 64-bit addition; To improve the CPU performance, we reordered the instructions in the testbench, and make sure the data for addition are always first loaded. The addition operation will be executed in parallel with one of the OR, XOR, and AND operations. In this way, we overlapped the execution time of the addition with other operations. Recirculate unit
Fig. Schematic design of the 1-bit recirculate unit We used recirculate unit to implement the multi-cycle adder. The input registers of the operands of the adder will be enabled when the operands are loaded; Then the input registers will be disabled until the next addition; Part 4
Fig. Schematic design of Part 4 (EX2 stage) in the CPU
Fig. Schematic design of EX2 stage in the CPU
Fig. Schematic design of 1-bit Unit used in the EX2 stage in the CPU The EX2 stage is used for registering the 64-bit result from ADDER, and the 64-bit result from one of the units including the ADD, XOR, OR units; The 1-bit unit used in the EX2 stage is employed for registering one bit result; In the top level of the EX2 stage, 64 such 1-bit unit are used totally;
Part 5
Fig. Schematic design of Part 5 in the CPU The part 5 is used for multiplexing the output results from EX2 stage and the input data from Perl. The output results of the part 5 component will be written into SRAM; Part 6
Fig. Schematic design of Part 6 in the CPU The part 6 is actually the SRAM design. In lab2, we have designed an SRAM supporting 16-bit data load or store; To reduce the delay for load or store, we redesigned the SRAM so that it supports 64-bit data load or store; However, this also brings some issues such as the results of the 16-bit operations may be always written at the first 16-bit of the 64-bit SRAM lines. To solve this issue, we add the unit for shifting the data loaded from SRAM. The unit has been introduced in the part 1 design. We also optimized the SRAM decoder, and we reduced the clock period of the SRAM from 2 ns to 0.8 ns. In this way, our final CPU processor can operate correctly with a clock period of 1.5 ns.
2) Layout of the design components in the CPU
Part 1
Fig. The layout design of the part 1 in the CPU
Fig. 4-to-1 multiplexer
Fig. 1-to-2 demultiplexer
Part 2
Fig. Layout design of Part 2 in the CPU Static XOR unit
Fig. Layout design of the 1-bit static XOR unit
Dynamic AND unit
Fig. Layout design of the 1-bit dynamic AND unit
Dynamic OR unit
Fig. Layout design of the 1-bit dynamic OR unit
Part 3
Fig. Layout design of Part 3 (64-bit Adder) in the CPU Recirculate unit
Fig. Layout design of the 2-bit recirculate unit
Part 4
Fig. Layout design of Part 4 (EX2 stage) in the CPU
Fig. Layout design of 2-bit Unit used in the EX2 stage in the CPU Part 5
Fig. Layout design of Part 5 (for two bits) in the CPU
Part 6
Fig. Layout design of Part 6 in the CPU
The final layout of the entire CPU Processor
3) Simulation result
Fig. Simulation result of the given testbench during the demo (SRAM Output 0~20)
Fig. Simulation result of the given testbench during the demo (SRAM Output 41~21)
Fig. Simulation result of the given testbench during the demo (SRAM Output 63~42)
Part 3 PERL 1. Perl Code Overview
1) Flowchart
Fig. The flowchart of the perl program
Read input file Reorder instruction
Generate loads and stores for Arithmetic units SRAM access address registering Vector generation Print vector file 2) Details of each step in the flowchart
Parameter definition
Print the header of the vector file
Reorder the instructions so that Addition will be executed firstly
Reorder the instructions to avoid consecutive instructions of the same type
Generate stores and loads for arithmetic operations
Add delay to the stores for additions so that result of adders can be stored with correct data
SRAM address registering, to make sure addresses have been registered in the previous clock cycle.
The main function to generate the vector file
3) Main subroutines Here only several subroutines are listed, please refer the source code for the details.
Subroutine 1: read each line
Subroutine 2: Vector generation for load
Subroutine 3: Check burst length and data
Subroutine 3: Controlling bits generation for SRAM access
EE 552 (Logic Design and Switching Theory) Project: Quantitative Measurement of The Benefits of Reduction Techniques For Asynchronous Finite State Machines