ECE 327 Slides VHDL Verilog Digital Hardware Design

E&CE 327: Digital Systems Engineering Lecture Slides
Mark Aagaard 2011t1Winter University of Waterloo Dept of Electrical and Computer Engineering
Contents
I Lecture Notes
1 VHDL 1.1 Introduction to VHDL . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Levels of Abstraction . . . . . . . . . . . . . . . . . . . . . 1.1.2 VHDL Origins and History . . . . . . . . . . . . . . . . . . 1.1.3 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Synthesis of a Simulation-Based Language . . . . . . . . 1.1.5 Solution to Synthesis Sanity . . . . . . . . . . . . . . . . . 1.1.6 Standard Logic 1164 . . . . . . . . . . . . . . . . . . . . . 1.2 Comparison of VHDL to Other Hardware Description Languages . . . . . . . . . . . . . . . .
1
3 4 4 5 6 11 12 13 14
ii 1.3 Overview of Syntax . . . . . . . . . . . . . . . . . 1.3.1 Syntactic Categories . . . . . . . . . . . . . 1.3.2 Library Units . . . . . . . . . . . . . . . . . 1.3.3 Entities and Architecture . . . . . . . . . . . 1.3.4 Concurrent Statements . . . . . . . . . . . 1.3.5 Component Declaration and Instantiations . 1.3.6 Processes . . . . . . . . . . . . . . . . . . 1.3.7 Sequential Statements . . . . . . . . . . . . 1.3.8 A Few More Miscellaneous VHDL Features 1.4 Concurrent vs Sequential Statements . . . . . . . 1.4.1 Concurrent Assignment vs Process . . . . 1.4.2 Conditional Assignment vs If Statements . 1.4.3 Selected Assignment vs Case Statement . 1.4.4 Coding Style . . . . . . . . . . . . . . . . . 1.5 Overview of Processes . . . . . . . . . . . . . . . 1.5.1 Combinational Process vs Clocked Process 1.5.2 Latch Inference . . . . . . . . . . . . . . . . 1.6 Details of Process Execution . . . . . . . . . . . . 1.6.1 Simple Simulation . . . . . . . . . . . . . . 1.6.2 Temporal Granularities of Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 15 18 21 21 26 27 27 28 29 30 31 32 36 43 46 46 48
CONTENTS 1.6.3 Intuition Behind Delta-Cycle Simulation . . . . . 1.6.4 Denitions and Algorithm . . . . . . . . . . . . . 1.6.4.1 Process Modes . . . . . . . . . . . . . 1.6.4.2 Simulation Algorithm . . . . . . . . . . 1.6.4.3 Delta-Cycle Denitions . . . . . . . . . 1.6.5 Example 1: Process Execution (Bamboozle) . . 1.6.6 Example 2: Process Execution (Flummox) . . . . 1.6.7 Ex: Need for Provisonal Asn . . . . . . . . . . . 1.6.8 Delta-Cycle Simulations of Flip-Flops . . . . . . 1.7 Register-Transfer-Level Simulation . . . . . . . . . . . . 1.7.1 Overview . . . . . . . . . . . . . . . . . . . . . . 1.7.2 Technique for Register-Transfer Level Simulation 1.7.3 Examples of RTL Simulation . . . . . . . . . . . 1.7.3.1 RTL Simulation Example 1 . . . . . . . 1.8 VHDL and Hardware Building Blocks . . . . . . . . . . . 1.8.1 Basic Building Blocks . . . . . . . . . . . . . . . 1.8.2 Deprecated Building Blocks for RTL . . . . . . . 1.8.3 Hardware and Code for Flops . . . . . . . . . . . 1.8.3.1 Flops with Waits and Ifs . . . . . . . . . 1.8.3.2 Flops with Synchronous Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii 48 50 50 54 57 58 58 63 69 78 79 80 81 81 85 85 90 92 92 94
iv
CONTENTS 1.8.3.3 Flop with Chip-Enable and Mux on Input . . 1.8.3.4 Flops with Chip-Enable, Muxes, and Reset . 1.8.4 An Example Sequential Circuit . . . . . . . . . . . . . 1.9 Arrays and Vectors . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.10.1 Arithmetic Packages . . . . . . . . . . . . . . . . . . 1.10.2 Shift and Rotate Operations . . . . . . . . . . . . . . 1.10.3 Overloading of Arithmetic . . . . . . . . . . . . . . . 1.10.4 Different Widths and Arithmetic . . . . . . . . . . . . 1.10.5 Overloading of Comparisons . . . . . . . . . . . . . 1.10.6 Different Widths and Comparisons . . . . . . . . . . 1.10.7 Type Conversion . . . . . . . . . . . . . . . . . . . . 1.11 Synthesizable vs Non-Synthesizable Code . . . . . . . . . . 1.11.1 Unsynthesizable Code . . . . . . . . . . . . . . . . . 1.11.1.1 Initial Values . . . . . . . . . . . . . . . . . 1.11.1.2 Wait For . . . . . . . . . . . . . . . . . . . . 1.11.1.3 Different Wait Conditions . . . . . . . . . . 1.11.1.4 Multiple if rising edge in Process . . . . . 1.11.1.5 if rising edge and wait in Same Process 1.11.1.6 if rising edge with else Clause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 102 102 102 103 103 104 104 104 104 105 106 108 109 109 110 111 113 114 115
CONTENTS
1.11.1.7 if rising edge Inside a for Loop . . . . . . . . . . 116 1.11.1.8 wait Inside of a for loop . . . . . . . . . . . . . . 118 1.12 Synthesizable VHDL Coding Guidelines . . . . . . . . . . . . . . . 120 2 RTL Design with VHDL 2.1 Prelude to Chapter . . . . . . . . . . . . . . . . . . . . . . 2.2 FPGA Background and Coding Guidelines . . . . . . . . 2.2.1 Generic FPGA Hardware . . . . . . . . . . . . . . 2.2.1.1 Generic FPGA Cell . . . . . . . . . . . . 2.2.2 Area Estimation . . . . . . . . . . . . . . . . . . . 2.2.2.1 Interconnect for Generic FPGA . . . . . . 2.2.2.2 Clocks for Generic FPGAs . . . . . . . . 2.2.2.3 Special Circuitry in FPGAs . . . . . . . . 2.2.3 Generic-FPGA Coding Guidelines . . . . . . . . . 2.3 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Algorithms and High-Level Models . . . . . . . . . . . . . 2.5 Finite State Machines in VHDL . . . . . . . . . . . . . . . 2.5.1 Introduction to State-Machine Design . . . . . . . 2.5.1.1 Mealy vs Moore State Machines . . . . . 2.5.1.2 Introduction to State Machines and VHDL 121 122 122 122 123 128 134 134 135 139 143 143 144 144 144 147
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
vi
CONTENTS 2.5.1.3 Explicit vs Implicit State Machines . . . . . . . 2.5.2 Implementing a Simple Moore Machine . . . . . . . . . 2.5.2.1 Implicit Moore State Machine . . . . . . . . . . 2.5.2.2 Explicit Moore with Flopped Output . . . . . . 2.5.2.3 Explicit Moore with Combinational Outputs . . 2.5.2.4 Explicit-Current+Next Moore with Concurrent signment . . . . . . . . . . . . . . . . . . . . . 2.5.2.5 E-C+N Moore with Comb Proc . . . . . . . . . 2.5.3 Implementing a Simple Mealy Machine . . . . . . . . . 2.5.4 Reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 State Encoding . . . . . . . . . . . . . . . . . . . . . . . 2.6 Dataow Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Dataow Diagrams Overview . . . . . . . . . . . . . . . 2.6.2 Dataow Diagrams, Hardware, and Behaviour . . . . . 2.6.3 Dataow Diagram Execution . . . . . . . . . . . . . . . 2.6.4 Performance Estimation . . . . . . . . . . . . . . . . . . 2.6.5 Area Estimation . . . . . . . . . . . . . . . . . . . . . . 2.6.6 Design Analysis . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Area / Performance Tradeoffs . . . . . . . . . . . . . . . 2.7 Design Example: Massey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . As. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 154 155 157 159 161 163 165 166 170 171 171 184 188 198 199 201 203 206
CONTENTS 2.8 Design Example: Vanier . . . . . . . . . . . . . . . . . 2.8.1 Requirements . . . . . . . . . . . . . . . . . . 2.8.2 Algorithm . . . . . . . . . . . . . . . . . . . . . 2.8.3 Initial Dataow Diagram . . . . . . . . . . . . . 2.8.4 Reschedule to Meet Requirements . . . . . . . 2.8.5 Optimize Resources . . . . . . . . . . . . . . . 2.8.6 Assign Names to Registered Values . . . . . . 2.8.7 Input/Output Allocation . . . . . . . . . . . . . 2.8.8 Tangent: Combinational Outputs . . . . . . . . 2.8.9 Register Allocation . . . . . . . . . . . . . . . . 2.8.10 Datapath Allocation . . . . . . . . . . . . . . . 2.8.11 Hardware Block Diagram and State Machine 2.8.11.1 Control for Registers . . . . . . . . . 2.8.11.2 Control for Datapath Components . 2.8.11.3 Control for State . . . . . . . . . . . 2.8.11.4 Complete State Machine Table . . . 2.8.12 VHDL Code with Explicit State Machine . . . 2.8.13 Peephole Optimizations . . . . . . . . . . . . 2.8.14 Notes and Observations . . . . . . . . . . . . 2.9 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii 206 208 209 210 211 213 216 217 220 221 223 224 225 228 230 231 233 237 240 242
viii 2.9.1 Introduction to Pipelining . . . . . . . . . . . . . 2.9.2 Partially Pipelined . . . . . . . . . . . . . . . . . 2.9.3 Terminology . . . . . . . . . . . . . . . . . . . . . Design Example: Pipelined Massey . . . . . . . . . . . Memory Arrays and RTL Design . . . . . . . . . . . . 2.11.1 Memory Operations . . . . . . . . . . . . . . . 2.11.2 Memory Arrays in VHDL . . . . . . . . . . . . . 2.11.3 Data Dependencies . . . . . . . . . . . . . . . 2.11.4 Memory and Dataow Diagrams . . . . . . . . 2.11.5 Ex: Mem Array and Dataow Diagram . . . . . Input / Output Protocols . . . . . . . . . . . . . . . . . Example: Moving Average . . . . . . . . . . . . . . . . 2.13.1 Requirements and Environmental Assumptions 2.13.2 Algorithm . . . . . . . . . . . . . . . . . . . . . 2.13.3 Pseudocode and Dataow Diagrams . . . . . . 2.13.4 Control Tables and State Machine . . . . . . . . 2.13.5 VHDL Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 248 250 252 256 256 260 260 265 272 279 280 281 282 286 291 295
2.10 2.11
2.12 2.13
CONTENTS
ix
3 Performance Analysis and Optimization 297 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 3.2 Dening Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 299 3.3 Comparing Performance . . . . . . . . . . . . . . . . . . . . . . . . . 302 3.3.1 General Equations . . . . . . . . . . . . . . . . . . . . . . . . 302 3.3.2 Example: Performance of Printers . . . . . . . . . . . . . . . 304 3.4 Clock Speed, CPI, Program Length, and Performance . . . . . . . . 305 3.4.1 Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.4.2 Example: CISC vs RISC and CPI . . . . . . . . . . . . . . . . 306 3.4.3 Effect of Instruction Set on Performance . . . . . . . . . . . . 310 3.4.4 Effect of Time to Market on Relative Performance . . . . . . 312 3.4.5 Summary of Equations . . . . . . . . . . . . . . . . . . . . . 312 3.5 Performance Analysis and Dataow Diagrams . . . . . . . . . . . . 313 3.5.1 Dataow Diagrams, CPI, and Clock Speed . . . . . . . . . . 313 3.5.2 Examples of Dataow Diagrams for Two Instructions . . . . . 316 3.5.2.1 Scheduling of Operations for Different Clock Periods 317 3.5.2.2 Performance Computation for Different Clock Periods 320 3.5.2.3 Example: Two Instructions Taking Similar Time . . . 321 3.5.2.4 Example: Same Total Time, Different Order for A . . 322 3.5.3 Example: From Algorithm to Optimized Dataow . . . . . . . 323
x 3.6 General Optimizations . . . . . . . . . . . . . . . . . 3.6.1 Strength Reduction . . . . . . . . . . . . . . 3.6.1.1 Arithmetic Strength Reduction . . . 3.6.1.2 Boolean Strength Reduction . . . . 3.6.2 Replication and Sharing . . . . . . . . . . . . 3.6.2.1 Mux-Pushing . . . . . . . . . . . . . 3.6.2.2 Common Subexpression Elimination 3.6.2.3 Computation Replication . . . . . . 3.6.3 Arithmetic . . . . . . . . . . . . . . . . . . . . 3.7 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 326 326 327 328 328 329 331 332 333
CONTENTS 4 Functional Verication 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Terminology: Validation / Verication / Testing . . . . . . . 4.1.2 The Difculty of Designing Correct Chips . . . . . . . . . 4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad) . . 4.1.2.2 Notes from Aart de Geus (Chairman and CEO Synopsys) . . . . . . . . . . . . . . . . . . . . . 4.2 Test Cases and Coverage . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Floating Point Divider Example . . . . . . . . . . . . . . . 4.3 Testbenches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Overview of Test Benches . . . . . . . . . . . . . . . . . . 4.3.2 Reference Model Style Testbench . . . . . . . . . . . . . 4.3.3 Relational Style Testbench . . . . . . . . . . . . . . . . . 4.3.4 Coding Structure of a Testbench . . . . . . . . . . . . . . 4.3.5 Datapath vs Control . . . . . . . . . . . . . . . . . . . . . 4.3.6 Verication Tips . . . . . . . . . . . . . . . . . . . . . . . 4.4 Functional Verication for Datapath Circuits . . . . . . . . . . . . 4.4.1 A Spec-Less Testbench . . . . . . . . . . . . . . . . . . . 4.4.2 Use an Array for Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . of . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi 335 336 336 336 337 337 338 338 339 344 344 345 345 346 347 348 349 351 352
xii 4.4.3 Build Spec into Stimulus . . . . . . . 4.4.4 Have Separate Specication Entity . 4.4.5 Generate Test Vectors Automatically 4.4.6 Relational Specication . . . . . . . 4.5 Functional Verication of Control Circuits . 4.5.1 Overview of Queues in Hardware . . 4.5.2 VHDL Coding . . . . . . . . . . . . . 4.5.2.1 Package . . . . . . . . . . 4.5.2.2 Other VHDL Coding . . . . 4.5.3 Code Structure for Verication . . . 4.5.4 Instrumentation Code . . . . . . . . 4.5.5 Assertions . . . . . . . . . . . . . . 4.5.6 VHDL Coding Tips . . . . . . . . . . 4.5.7 Queue Specication . . . . . . . . . 4.5.8 Queue Testbench . . . . . . . . . . 4.6 Example: Microwave Oven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 355 358 359 360 361 368 368 368 369 371 376 380 385 389 391
CONTENTS 5 Timing Analysis 5.1 Delays and Denitions . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Background Denitions . . . . . . . . . . . . . . . . . . . 5.1.2 Clock-Related Timing Denitions . . . . . . . . . . . . . . 5.1.2.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . 5.1.2.2 Clock Latency . . . . . . . . . . . . . . . . . . . 5.1.2.3 Clock Jitter . . . . . . . . . . . . . . . . . . . . . 5.1.3 Storage-Related Timing Denitions . . . . . . . . . . . . . 5.1.3.1 Flops and Latches . . . . . . . . . . . . . . . . . 5.1.4 Propagation Delays . . . . . . . . . . . . . . . . . . . . . 5.1.5 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . 5.1.5.1 Minimum Clock Period . . . . . . . . . . . . . . . 5.1.5.2 Hold Constraint . . . . . . . . . . . . . . . . . . 5.1.5.3 Example Timing Violations . . . . . . . . . . . . 5.2 Timing Analysis of Latches and Flip Flops . . . . . . . . . . . . . 5.2.1 Simple Multiplexer Latch . . . . . . . . . . . . . . . . . . . 5.2.1.1 Structure and Behaviour of Multiplexer Latch . . 5.2.1.2 Strategy for Timing Analysis of Storage Devices 5.2.1.3 Clock-to-Q Time of a Multiplexer Latch . . . . . 5.2.1.4 Setup Timing of a Multiplexer Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii 401 402 402 403 403 405 406 408 408 410 411 411 412 412 415 415 416 420 421 422
xiv
CONTENTS 5.2.1.5 Hold Time of a Multiplexer Latch . . . . . . . . . 5.2.1.6 Example of a Bad Latch . . . . . . . . . . . . . . 5.3 Critical Paths and False Paths . . . . . . . . . . . . . . . . . . . 5.3.1 Introduction to Critical and False Paths . . . . . . . . . . 5.3.1.1 Example of Critical Path in Full Adder . . . . . . 5.3.1.2 Preliminaries for Critical Paths . . . . . . . . . . 5.3.1.3 Longest Path and Critical Path . . . . . . . . . . 5.3.2 Longest Path . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Detecting a False Path . . . . . . . . . . . . . . . . . . . . 5.3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . 5.3.3.2 Almost-Correct Algorithm to Detect a False Path 5.3.3.3 Examples of Detecting False Paths . . . . . . . 5.3.4 Finding the Next Candidate Path . . . . . . . . . . . . . . 5.3.4.1 Algorithm to Find Next Candidate Path . . . . . 5.3.4.2 Examples of Finding Next Candidate Path . . . . 5.3.5 Correct Algorithm to Find Critical Path . . . . . . . . . . . 5.3.5.1 Rules for Late Side Inputs . . . . . . . . . . . . . 5.3.5.2 Monotone Speedup . . . . . . . . . . . . . . . . 5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation . 5.3.5.4 Complete Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 430 431 431 434 436 436 440 441 441 447 447 449 450 451 454 454 455 456 456
CONTENTS 5.3.5.5 Complete Examples . . . . . . . . . . . . . . . . . . 5.3.6 Further Extensions to Critical Path Analysis . . . . . . . . . . 5.3.7 Increasing the Accuracy of Critical Path Analysis . . . . . . . 5.4 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 RC-Networks for Timing Analysis . . . . . . . . . . . . . . . . 5.4.2 Derivation of Analog Timing Model . . . . . . . . . . . . . . . 5.4.2.1 Example Derivation: Equation for Voltage at Node 3 5.4.2.2 General Derivation . . . . . . . . . . . . . . . . . . . 5.4.3 Elmore Timing Model . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Examples of Using Elmore Delay . . . . . . . . . . . . . . . . 5.4.4.1 Interconnect with Single Fanout . . . . . . . . . . . 5.4.4.2 Interconnect with Multiple Gates in Fanout . . . . . 5.5 Practical Usage of Timing Analysis . . . . . . . . . . . . . . . . . . . 5.5.1 Speed Binning . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1.1 FPGAs, Interconnect, and Synthesis . . . . . . . . . 5.5.2 Worst Case Timing . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.1 Fanout delay . . . . . . . . . . . . . . . . . . . . . . 5.5.2.2 Derating Factors . . . . . . . . . . . . . . . . . . . .
xv 457 462 462 463 463 475 479 483 487 491 491 495 498 500 501 502 502 503
xvi 6 Power Analysis and Power-Aware Design 6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Importance of Power and Energy . . . . . . . . 6.1.2 Industrial Names and Products . . . . . . . . . 6.1.3 Power vs Energy . . . . . . . . . . . . . . . . . 6.1.4 Batteries, Power and Energy . . . . . . . . . . 6.1.4.1 Do Batteries Store Energy or Power? 6.1.4.2 Battery Life and Efciency . . . . . . 6.1.4.3 Battery Life and Power . . . . . . . . 6.2 Power Equations . . . . . . . . . . . . . . . . . . . . . 6.2.1 Switching Power . . . . . . . . . . . . . . . . . 6.2.2 Short-Circuited Power . . . . . . . . . . . . . . 6.2.3 Leakage Power . . . . . . . . . . . . . . . . . . 6.2.4 Glossary . . . . . . . . . . . . . . . . . . . . . 6.2.5 Note on Power Equations . . . . . . . . . . . . 6.3 Overview of Power Reduction Techniques . . . . . . . 6.4 Voltage Reduction for Power Reduction . . . . . . . . 6.5 Data Encoding for Power Reduction . . . . . . . . . . 6.5.1 How Data Encoding Can Reduce Power . . . . 6.5.2 Example Problem: Sixteen Pulser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 508 508 509 509 510 510 511 512 515 517 520 521 522 522 522 527 531 531 535
CONTENTS 6.5.2.1 Problem Statement . . . . . . . . . . . . . 6.5.2.2 Additional Information . . . . . . . . . . . . 6.5.2.3 Answer . . . . . . . . . . . . . . . . . . . . 6.6 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Introduction to Clock Gating . . . . . . . . . . . . . . 6.6.2 Implementing Clock Gating . . . . . . . . . . . . . . 6.6.3 Design Process . . . . . . . . . . . . . . . . . . . . 6.6.4 Effectiveness of Clock Gating . . . . . . . . . . . . . 6.6.5 Example: Reduced Activity Factor with Clock Gating 6.6.6 Clock Gating with Valid-Bit Protocol . . . . . . . . . 6.6.6.1 Valid-Bit Protocol . . . . . . . . . . . . . . . 6.6.6.2 How Many Clock Cycles for Module? . . . 6.6.6.3 Adding Clock-Gating Circuitry . . . . . . . 6.6.7 Example: Pipelined Circuit with Clock-Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xvii 535 536 538 544 544 545 546 546 550 552 552 555 556 559
xviii 7 Fault Testing and Testability 7.1 Faults and Testing . . . . . . . . . . . . . . . . . . . . . 7.1.1 Overview of Faults and Testing . . . . . . . . . . 7.1.1.1 Faults . . . . . . . . . . . . . . . . . . . 7.1.1.2 Causes of Faults . . . . . . . . . . . . . 7.1.1.3 Testing . . . . . . . . . . . . . . . . . . 7.1.1.4 Burn In . . . . . . . . . . . . . . . . . . 7.1.1.5 Bin Sorting . . . . . . . . . . . . . . . . 7.1.1.6 Testing Techniques . . . . . . . . . . . 7.1.1.7 Design for Testability (DFT) . . . . . . . 7.1.2 Example Problem: Economics of Testing . . . . 7.1.3 Physical Faults . . . . . . . . . . . . . . . . . . . 7.1.3.1 Types of Physical Faults . . . . . . . . . 7.1.3.2 Locations of Faults . . . . . . . . . . . . 7.1.3.3 Layout Affects Locations . . . . . . . . 7.1.3.4 Naming Fault Locations . . . . . . . . . 7.1.4 Detecting a Fault . . . . . . . . . . . . . . . . . . 7.1.4.1 Which Test Vectors will Detect a Fault? 7.1.5 Mathematical Models of Faults . . . . . . . . . . 7.1.5.1 Single Stuck-At Fault Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 564 564 564 565 565 566 566 567 567 567 567 568 569 570 570 571 571 574 575
CONTENTS 7.1.6 Generate Test Vector to Find a Mathematical Fault 7.1.6.1 Algorithm . . . . . . . . . . . . . . . . . . 7.1.6.2 Example of Finding a Test Vector . . . . . 7.1.7 Undetectable Faults . . . . . . . . . . . . . . . . . 7.1.7.1 Redundant Circuitry . . . . . . . . . . . . 7.1.7.2 Curious Circuitry and Fault Detection . . 7.2 Test Generation . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 A Small Example . . . . . . . . . . . . . . . . . . . 7.2.2 Choosing Test Vectors . . . . . . . . . . . . . . . . 7.2.2.1 Fault Domination . . . . . . . . . . . . . . 7.2.2.2 Fault Equivalence . . . . . . . . . . . . . 7.2.2.3 Gate Collapsing . . . . . . . . . . . . . . 7.2.2.4 Node Collapsing . . . . . . . . . . . . . . 7.2.2.5 Fault Collapsing Summary . . . . . . . . 7.2.3 Fault Coverage . . . . . . . . . . . . . . . . . . . . 7.2.4 Test Vector Generation and Fault Detection . . . . 7.2.5 Generate Test Vectors for 100% Coverage . . . . 7.2.5.1 Collapse the Faults . . . . . . . . . . . . 7.2.5.2 Check for Fault Domination . . . . . . . . 7.2.5.3 Required Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xix 577 577 578 579 579 582 583 583 584 585 586 587 588 588 589 590 591 592 595 597
xx
CONTENTS 7.2.5.4 Faults Not Covered by Required Test Vectors . . . . 598 7.2.5.5 Order to Run Test Vectors . . . . . . . . . . . . . . . 599 7.2.5.6 Summary of Technique to Find and Order Test Vectors601 7.2.6 One Fault Hiding Another . . . . . . . . . . . . . . . . . . . . 602 7.3 Scan Testing in General . . . . . . . . . . . . . . . . . . . . . . . . . 604 7.3.1 Structure and Behaviour of Scan Testing . . . . . . . . . . . 604 7.3.2 Scan Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 7.3.2.1 Circuitry in Normal and Scan Mode . . . . . . . . . 607 7.3.2.2 Scan in Operation . . . . . . . . . . . . . . . . . . . 608 7.3.2.3 Scan in Operation with Example Circuit . . . . . . . 610 7.3.3 Summary of Scan Testing . . . . . . . . . . . . . . . . . . . . 614 7.3.4 Time to Test a Chip . . . . . . . . . . . . . . . . . . . . . . . 615 7.3.4.1 Example: Time to Test a Chip . . . . . . . . . . . . 616 7.4 Boundary Scan and JTAG . . . . . . . . . . . . . . . . . . . . . . . . 617 7.4.1 Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 620 7.5 Built In Self Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621 7.5.1 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . 621 7.5.1.1 Components . . . . . . . . . . . . . . . . . . . . . . 624 7.5.1.2 Linear Feedback Shift Register (LFSR) . . . . . . . 628 7.5.1.3 Maximal-Length LFSR . . . . . . . . . . . . . . . . . 630
CONTENTS Test Generator . . . . . . . . . . . . . . . . . . Signature Analyzer . . . . . . . . . . . . . . . . Result Checker . . . . . . . . . . . . . . . . . . Arithmetic over Binary Fields . . . . . . . . . . Shift Registers and Characteristic Polynomials 7.5.6.1 Circuit Multiplication . . . . . . . . . . 7.5.7 Bit Streams and Characteristic Polynomials . . 7.5.8 Division . . . . . . . . . . . . . . . . . . . . . . 7.5.9 Signature Analysis: Math and Circuits . . . . . 7.6 Scan vs Self Test . . . . . . . . . . . . . . . . . . . . . 7.5.2 7.5.3 7.5.4 7.5.5 7.5.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxi 633 636 640 641 643 646 647 648 651 660
xxii 8 Review 8.1 Overview of the Term . . . . . . . . . . 8.2 VHDL . . . . . . . . . . . . . . . . . . . 8.2.1 VHDL Topics . . . . . . . . . . . 8.2.2 VHDL Example Problems . . . . 8.3 RTL Design Techniques . . . . . . . . . 8.3.1 Design Topics . . . . . . . . . . 8.3.2 Design Example Problems . . . 8.4 Functional Verication . . . . . . . . . . 8.4.1 Verication Topics . . . . . . . . 8.4.2 Verication Example Problems . 8.5 Performance Analysis and Optimization 8.5.1 Performance Topics . . . . . . . 8.5.2 Performance Example Problems 8.6 Timing Analysis . . . . . . . . . . . . . . 8.6.1 Timing Topics . . . . . . . . . . . 8.6.2 Timing Example Problems . . . 8.7 Power . . . . . . . . . . . . . . . . . . . 8.7.1 Power Topics . . . . . . . . . . . 8.7.2 Power Example Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 662 663 663 664 665 665 666 667 667 668 669 669 670 671 671 672 673 673 674
CONTENTS 8.8 Testing . . . . . . . . . . . . . . . . 8.8.1 Testing Topics . . . . . . . . 8.8.2 Testing Example Problems . 8.9 Formulas to be Given on Final Exam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xxiii 675 675 676 677
Part I Lecture Notes
Chapter 1 VHDL: The Language
CHAPTER 1. VHDL
1.1 1.1.1
Introduction to VHDL Levels of Abstraction
Transistor Signal values and time are continous (analog). Each transistor is modeled by a resistor-capacitor network. Switch Time is continuous, but voltage may be either continuous or discrete. Linear equations are used. Gate Transistors are grouped together into gates. Voltages are discrete values such as 0 and 1. Register transfer level Hardware is modeled as assignments to registers and combinational signals. Basic unit of time is one clock cycle. Transaction level A transaction is an operation such as transfering data across a bus. Building blocks are processors, controllers, etc. VHDL, SystemC, or SystemVerilog. Electronic-system level Looks at an entire electronic system, with both hardware and software.
1.1.2 VHDL Origins and History
1.1.2
VHDL Origins and History
VHDL = VHSIC Hardware Description Language VHSIC = Very High Speed Integrated Circuit
The VHSIC Hardware Description Language (VHDL) is a formal notation intended for use in all phases of the creation of electronic systems. Because it is both machine readable and human readable, it supports the development, verication, synthesis and testing of hardware designs, the communication of hardware design data, and the maintenance, modication, and procurement of hardware. Language Reference Manual (IEEE Design Automation Standards Committee, 1993a)
VHDL is a lot more than synthesis of digital hardware
CHAPTER 1. VHDL
1.1.3
Semantics
The original goal of VHDL was to simulate circuits. The semantics of the language dene circuit behaviour.
a c <= a AND b;
simulation
b c
But now, VHDL is used in simulation and synthesis. Synthesis is concerned with the structure of the circuit. Synthesis: converts one type of description (behavioural) into another, lower level, description (usually a netlist).
c <= a AND b;
synthesis
a c b
1.1.3 Semantics
Synthesis
Synthesis is a computer-aided design (CAD) technique that transforms a designers concise, high-level description of a circuit into a structural description of a circuit.
c <= a AND b;
synthesis
a c b
CHAPTER 1. VHDL
CAD Tools
CAD Tools allow designers to automate lower-level design processes in implementing the desired functionality of a system. NOTE: EDA = Electronic Design Automation. EDA = CAD. In digital hardware design
1.1.3 Semantics
Synthesis vs Simulation
For synthesis, we want the code we write to dene the structure of the hardware that is generated.
c <= a AND b;
synthesis
a c b
10
CHAPTER 1. VHDL
Synthesis vs Simulation
The VHDL semantics dene the behaviour of the hardware that is generated, not the structure of the hardware.
a a c
the
sis
simulation
b c
syn
c <= a AND b;
different structure
a a c b
same behaviour
syn the sis
simulation
b c
1.1.4 Synthesis of a Simulation-Based Language
11
1.1.4 Synthesis of a Simulation-Based Language

This section reserved for your reading pleasure
12
CHAPTER 1. VHDL
1.1.5
Solution to Synthesis Sanity
Pick a high-quality synthesis tool and study its documentation thoroughly Learn the idioms of the tool Different VHDL code with same behaviour can result in very different circuits Be careful if you have to port VHDL code from one tool to another KISS: Keep It Simple Stupid
VHDL examples will illustrate reliable coding techniques for the synthesis tools from Synopsys, Mentor Graphics, Altera, Xilinx, and most other companies as well. Follow the coding guidelines and examples from lecture As you write VHDL, think about the hardware you expect to get. Note: If you cant predict the hardware, then the hardware probably wont be very good (small, fast, correct, etc)
1.1.6 Standard Logic 1164
13
1.1.6
Standard Logic 1164
std logic 1164: IEEE standard for signal values in VHDL. U X 0 1 Z W L H -- uninitialized strong unknown strong 0 strong 1 high impedance weak unknown weak 0 weak 1 dont care
The most common values are: U, X, 0, 1. If you see X in a simulation, it usually means that there is a mistake in your code.
14
CHAPTER 1. VHDL
1.2 Comparison of VHDL to Other Hardware Description Languages

1.3 1.3.1
Overview of Syntax Syntactic Categories

1.3.2
Library Units
1.3.3 Entities and Architecture
15
1.3.3
Entities and Architecture
Each hardware module is described with an Entity/Architecture pair
entity
entity architecture
architecture
Entity and Architecture
16
CHAPTER 1. VHDL
Entity
library ieee; use ieee.std_logic_1164.all; entity and_or is port ( a, b, c : in std_logic ; z : out std_logic ); end and_or; Example of an entity
1.3.3 Entities and Architecture
17
Architecture
architecture signal x : begin x <= a AND z <= x OR end main; main of and_or is std_logic; b; (a AND c); Example of architecture
18
CHAPTER 1. VHDL
1.3.4
Concurrent Statements
Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.4)

Concurrent statements make VHDL fundamentally different from most software languages. Hardware (gates) naturally execute in parallel VHDL mimics the behaviour of real hardware. At each innitesimally small moment of time, each gate: 1. samples its inputs 2. computes the value of its output 3. drives the output
1.3.4 Concurrent Statements
19
Concurrent Statements
architecture main of bowser is begin x1 <= a AND b; x2 <= NOT x1; z <= NOT x2; end main; architecture main of bowser is begin z <= NOT x2; x2 <= NOT x1; x1 <= a AND b; end main;
a b
x1
x2
The order of concurrent statements doesnt matter
20
CHAPTER 1. VHDL
Types of Concurrent Statements

conditional assignment similar to conventional if-then-else c <= a+b when sel=1 else a+c when sel=0 else "0000"; selected assignment similar to conventional case/switch with color select d <= "00" when red , "01" when . . .; component instantiation use a hardware module/component add1 : adder port map( a => f, b => g, s => h, co => i); for-generate create multiple pieces of hardware bgen: for i in 1 to 7 generate b(i)<=a(7-i); end generate; if-generate conditionally create some hardware okgen : if optgoal /= fast then generate result <= ((a and b) or (d and not e)) or g; end generate; fastgen : if optgoal = fast then generate result <= 1; end generate; process description of complex behaviour (Section 1.3.6)
1.3.5 Component Declaration and Instantiations
21
1.3.5 Component Declaration and Instantiations

1.3.6
Processes
Processes are used to describe complex and potentially unsynthesizable behaviour A process is a concurrent statement (Section 1.3.4). The body of a process contains sequential statements (Section 1.3.7) Processes are the most complex and difcult to understand part of VHDL (Sections 1.5 and 1.6)
22
CHAPTER 1. VHDL
Example Process with Sensitivity List

process (a, b, c) begin y <= a AND b; if (a = 1) then z1 <= b AND c; z2 <= NOT c; else z1 <= b OR c; z2 <= c; end if; end process;
1.3.6 Processes
23
Example Process with Wait Statements

process begin y <= a AND b; z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; y <= 0; wait until rising_edge(clk); else y <= a OR b; end if; end process;
24
CHAPTER 1. VHDL
Sensitivity Lists and Wait Statements

Processes must have either a sensitivity list or at least one wait statement on each execution path through the process. Processes cannot have both a sensitivity list and a wait statement.
1.3.6 Processes
25
Sensitivity List
The sensitivity list contains the signals that are read in the process. A process is executed when a signal in its sensitivity list changes value. An important coding guideline to ensure consistent synthesis and simulation results is to include all signals that are read in the sensitivity list. There is one exception to this rule: for a process that implements a ip-op with an if rising edge statement, it is acceptable to include only the clock signal in the sensitivity list other signals may be included, but are not needed.
26
CHAPTER 1. VHDL
1.3.7
Sequential Statements
Used inside processes and functions. wait signal assignment if-then-else case wait until . . . ; . . . <= . . . ; if . . . then . . . elsif . . . end if; case . . . is when . . . | . . . => . . . ; when . . . => . . . ; end case; loop . . . end loop; while . . . loop . . . end loop; for . . . in . . . loop . . . end loop; next . . . ;
loop while loop for loop next
The most commonly used sequential statements
1.3.8 A Few More Miscellaneous VHDL Features
27
1.3.8 A Few More Miscellaneous VHDL Features

1.4
Concurrent vs Sequential Statements
All concurrent assignments can be translated into sequential statements. But, not all sequential statements can be translated into concurrent statements.
28
CHAPTER 1. VHDL
1.4.1
Concurrent Assignment vs Process
The two code fragments below have identical behaviour: architecture main of tiny is begin b <= a; end main; architecture main of tiny is begin process (a) begin b <= a; end process; end main;
1.4.2 Conditional Assignment vs If Statements
29
1.4.2 Conditional Assignment vs If Statements

The two code fragments below have identical behaviour: Concurrent Statements t <= <val1> when <cond> else <val2>; Sequential Statements if <cond> then t <= <val1>; else t <= <val2>; end if
30
CHAPTER 1. VHDL
1.4.3 Selected Assignment vs Case Statement

The two code fragments below have identical behaviour Concurrent Statements with <expr> select t <= <val1> when <choices1>, <val2> when <choices2>, <val3> when <choices3>; Sequential Statements case <expr> is when <choices1> => t <= <val1>; when <choices2> => t <= <val2>; when <choices3> => t <= <val3>; end case;
1.4.4 Coding Style
31
1.4.4
Coding Style
Code thats easy to write with sequential statements, but difcult with concurrent: case <expr> is when <choice1> => if <cond> then o <= <expr1>; else o <= <expr2>; end if; when <choice2> => ... end case;
32
CHAPTER 1. VHDL
1.5
Overview of Processes
Processes are the most difcult VHDL construct to understand. This section gives an overview of processes. Section 1.6 gives the details of the semantics of processes. Within a process, statements are executed almost sequentially
Among processes, execution is done in parallel Remember: a process is a concurrent statement!
1.5. OVERVIEW OF PROCESSES
33
Process Semantics
VHDL mimics hardware Hardware (gates) execute in parallel Processes execute in parallel with each other All possible orders of executing processes must produce the same simulation results (waveforms) If a signal is not assigned a value, then it holds its previous value
All orders of executing concurrent statements must produce the same waveforms
34
CHAPTER 1. VHDL
Process Semantics
execution sequence
architecture procA: process stmtA1; stmtA2; stmtA3; end process; procB: process stmtB1; stmtB2; end process; B1 B2 B1 B2 B1 B2 A1 A2 A3 A1 A2 A3 A1 A2 A3
execution sequence
execution sequence
single threaded: procA before procB
single threaded: procB before procA
multithreaded: procA and procB in parallel
1.5. OVERVIEW OF PROCESSES
35
Process Semantics
All execution orders must have same behaviour
36
CHAPTER 1. VHDL
1.5.1 Combinational Process vs Clocked Process

Each well-written synthesizable process is either combinational or clocked.
Combinational process:
Executing the process takes part of one clock cycle Target signals are outputs of combinational circuitry A combinational processes must have a sensitivity list A combinational process must not have any wait statements A combinational falling_edges
process must not have any rising_edges, or
The hardware for a combinational process is just combinational circuitry
37
Clocked process:
Executing the process takes one (or more) clock cycles Target signals are outputs of ops Process contains one or more wait or if rising edge statements Hardware contains combinational circuitry and ip ops
Note: Clocked processes are sometimes called sequential processes, but this can be easily confused with sequential statements, so in E&CE 327 well refer to synthesizable processes as either combinational or clocked.
38
CHAPTER 1. VHDL
Combinational or Clocked Process? (1)

process (a,b,c) p1 <= a; if (b = c) then p2 <= b; else p2 <= a; end if; end process;
39

process begin wait until rising_edge(clk); b <= a; end process;
40
CHAPTER 1. VHDL

process (clk) begin if rising_edge(clk) then b <= a; end if; end process;
41

process (clk) begin a <= clk; end process;
42
CHAPTER 1. VHDL

process begin wait until rising_edge(a); c <= b; end process;
1.5.2 Latch Inference
43
1.5.2
Latch Inference
The semantics of VHDL require that if a signal is assigned a value on some passes through a process and not on other passes, then on a pass through the process when the signal is not assigned a value, it must maintain its value from the previous pass. process (a, b, c) begin if (a = 1) then z1 <= b; z2 <= b; else z1 <= c; end if; end process;
a b c z1 z2
Example of latch inference
44
CHAPTER 1. VHDL
Latch Inference
When a signals value must be stored, VHDL infers a latch or a ip-op in the hardware to store the value. If you want a latch or a ip-op for the signal, then latch inference is good. If you want combinational circuitry, then latch inference is bad.
1.5.2 Latch Inference
45
Loop, Latch, Flop

a b z
a
Latch Combinational loop
EN
b a
Flip-op
Question:
Write VHDL code for each of the above circuits
46
CHAPTER 1. VHDL
1.6 1.6.1
Details of Process Execution Simple Simulation

0ns
a
10ns
12ns 15ns
d e
b c d e
1.6.2 Temporal Granularities of Simulation
47
Different Programs, Same Behaviour

All three programs below synthesize to the circuit on the previous slide. The goal of VHDL semantics is that all three programs have the same behaviour. process (a,b) begin c <= a and b; end process; process (b,c,d) begin d <= not c; e <= b and d; end process; process (a,b,c,d) begin c <= a and b; d <= not c; e <= b and d; end process; process (a,b) begin c <= a and b; end process; process (c) begin d <= not c; end process; process (b,d) begin e <= b and d; end process;
48
CHAPTER 1. VHDL
1.6.2
Temporal Granularities of Simulation

1.6.3 tion
Intuition Behind Delta-Cycle Simula-
In zero-delay simulation, a sequence of dependent events must appear to happen instantaneously (in zero time). In particular, the effect of an event must propagate instantaneously through combinational circuitry. Two fundamental rules for zero-delay simulation: 1. events appear to propagate through combinational circuitry instantaneously. 2. all of the gates appear to operate in parallel
1.6.3 Intuition Behind Delta-Cycle Simulation
49
Intution for Delta Cycles

To make it appear that events propagate instaneously, VHDL introduces an articial unit of time, the delta cycle, to represent an innitesimally small amount of time. In each delta cycle, every gate in the circuit will sample its inputs, compute its result, and drive its output signal with the result. Simulators simulate one gate at a time, but the waveforms make it appear that all of the gates were run in parallel. In each delta cycle, the simulator executes all gates whose inputs changed. To preserve the illusion that the gates ran in parallel, the effect of simulating a gate remains invisible until the end of the delta cycle.
50
CHAPTER 1. VHDL
1.6.4 1.6.4.1
Denitions and Algorithm Process Modes

active
e sp su te tiv a nd ac
postponed resume
suspended
1.6.4 Denitions and Algorithm
51
Suspended
active
e sp su te tiv a
nd
postponed resume
ac
suspended
Nothing to currently execute A process stays suspended until the event that it is waiting for occurs: either a change in a signal on its sensitivity list or the condition in a wait statement
52
CHAPTER 1. VHDL
Postponed
active
e sp su te tiv a
nd
postponed resume
ac
suspended
Wants to execute, but not currently active A process stays postponed until the simulator chooses it from the pool of postponed processes
53
Active
active
e sp su te tiv a nd
postponed resume
ac
suspended
Currently executing A process stays active until it hits a wait statement or sensitivity list, at which point it suspends
54
CHAPTER 1. VHDL
1.6.4.2
Simulation Algorithm
The algorithm presented here is a simplication of the actual algorithm in the VHDL Standard. This algorithm does not (a <= b after 2 ns;). support delayed assignments; for example:
A somewhat ironic note, only six of the two hundred pages in the VHDL Standard are devoted to the semantics of executing processes.
55
The Algorithm
Simulations start at step 1 with all processes postponed and all signals with a default value (e.g., U for std logic).
1. While there are postponed processes: (a) Pick one or more postponed processes to execute (become active). (b) Provisionally execute assignments (new values become visible at step 3) (c) A process executes until it hits its sensitivity list or a wait statement, at which point it suspends. (d) Processes that become suspended, stay suspended until there are no more postponed or active processes. 2. Each process checks its sensitivity list or wait condition to see if it should resume 3. Update signals with their provisional values 4. If no postponed processes, then increment simulation time to next event.
56
CHAPTER 1. VHDL
Notes on Simulation Algorithm

At a wait statement, the process will suspend even if the condition is true in the current simulation cycle. The process will resume when the condition changes to true. In n-threaded execution, at most n processes are active at a time
57
1.6.4.3
Delta-Cycle Denitions
Denition simulation step: Executing one sequential assignment or process mode change.
Denition simulation cycle: The operations that occur in one iteration of the simulation algorithm.
Denition delta cycle: A simulation cycle that does not advance simulation time.
Denition simulation round: A sequence of simulation cycles that all have the same simulation time.
58
CHAPTER 1. VHDL
1.6.5 Example 1: Process Execution (Bamboozle)

1.6.6 Example 2: Process Execution (Flummox)

This example is a variation of the Bamboozle example from section 1.6.5.

process mode (S=suspended, P=postponend A=active) simulation-step pointer (one per process) P visible-assignment value provisional-assignment value
59
proc1: process (a, b, c) begin c <= a AND b; d <= NOT c; end process; proc2: process (b, d) begin 0ns e <= b AND d; sim round end process; sim cycle proc3: process begin delta cycle a <= 1; proc1 proc2 b <= 0; proc3 wait for 3 ns; a b <= 1; wait for 99 ns; b end process;
c d e
U a U b Uc Ud U e
Legend
initial values simulation step
60
1. While there are postponed processes: (a) Pick process(es) to activate (b) Execute active processes, record prov asns (c) Suspend at sens list or wait statement (d) Once suspended, stay suspended 2. Check sens lists, wait conditions for changes 3. Update signals with provisional values 4. If no postponed procs, increment time proc1: ...(a, b, c)... c <= a AND b; d <= NOT c; end process; proc2: ...(b, d)... e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process;
CHAPTER 1. VHDL
d e
sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e
61
From Delta-Time to Real Time

0ns +1
a U b U c U d U e U U U U U U
3ns +2 +3 +1 +2 +3
102ns
0ns
1ns
2ns
3ns
4ns
100ns 101ns 102ns
a U b U c U d U e U
62
CHAPTER 1. VHDL
Note and Questions

Note: If a signal is updated with the same value it had in the previous simulation cycle, then it does not change, and therefore does not trigger processes to resume.
Question: What are the different granularities of time that occur when doing delta-cycle simulation?
Question: What is the order of granularity, from nest to coarsest, amongst the different granularities related to delta-cycle simulation?
1.6.7 Ex: Need for Provisonal Asn
63
1.6.7
Ex: Need for Provisonal Asn
architecture main of swindle is begin p_c: process (a, b) begin Question: c <= a AND b; end process; p_d: process (a, c) begin d <= a XOR c; end process; end main;
draw the circuit
Circuit to illustrate need for provisional assignments 1. Start with all signals at 0. 2. Simultaneously change to a = 1 and b = 1.
64
CHAPTER 1. VHDL
With Provisional Assignments, c Before d

If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;
p_c p_d a b c d
0 0 0 0
P A P
S A S P A S
If p c is scheduled before p d, then d will have a 1 pulse.
65
With Provisional Assignments, d Before c

If assignments are not visible within same simulation cycle (correct: i.e. provisional assignments are used) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;
p_c p_d a b c d
0 0 0 0
P P A S
S P A S
If p d is scheduled before p c, then d will have a 1 pulse.
66
CHAPTER 1. VHDL
Without Prov. Assignments, c Before d

If assignments are visible within same simulation cycle (incorrect) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;
p_c p_d a b c d
0 0 0 0
P A P
S A S P A S
If p c is scheduled before p d, then d will stay constant 0.
67
Without Prov. Assignments, d Before c

If assignments are visible within same simulation cycle (incorrect) p_c: process c <= a AND end process; p_d: process d <= a XOR end process; (a, b) begin b; (a, c) begin c;
p_c p_d a b c d
0 0 0 0
P P A S
S P A S
If p d is scheduled before p c, then d will have a 1 pulse.
68
CHAPTER 1. VHDL
Need for Provisional Assignment

With provisional assignments, both orders of scheduling processes result in the same behaviour on all signals. Without provisional assignments, different scheduling orders result in different behaviour.
1.6.8 Delta-Cycle Simulations of Flip-Flops
69
1.6.8
Delta-Cycle Simulations of Flip-Flops

p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flop : process ( clk ) begin if rising_edge( clk ) then q <= a; end if; end process;
p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process;
0ns
sim round sim cycle delta cycle p_a P p_clk P flop P a U clk U q U
B B B A
E E S A
U U
S A S
0 0
70
CHAPTER 1. VHDL
Redraw with Normal Time Scale
0ns
5ns
10ns
15ns
20ns
25ns
30ns
35ns
a clk q
71
Back-to-Back Flops
p_a : process begin a <= 0; wait for 15 ns; a <= 1; wait for 20 ns; end process; p_clk : process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; flops : process ( clk ) begin if rising_edge( clk ) then q1 <= a; q2 <= q1; end if; end process;
15ns 20ns 30ns 35ns
10ns
sim round sim cycle delta cycle p_a p_clk flops a 0 clk 0 q1 U q2 U
B
B/E B/E
B B S
E B E
E B E B/E B
B/E
P A P A P A S
E B E B/E B B/E B S P A
1
E B E B E B/E B B/E B E S P A P A S S
E B E
E E
P A
1 U 0
1 1
72
CHAPTER 1. VHDL
0ns
5ns
10ns
15ns
20ns
25ns
30ns
35ns
a clk q
73
External Inputs and Flops

Question: Do the signals b1 and b2 have the same behaviour from 2030 ns?
74 architecture mathilde of sauv is e signal clk, a, b : std_logic; begin process begin clk <= 1; wait for 10 ns; clk <= 0; wait for 10 ns; end process; process begin wait for 20 ns; a1 <= 1; end process; process begin wait until rising_edge(clk); a1 <= 1; end process; process begin wait until rising_edge( clk ); b1 <= a1;
CHAPTER 1. VHDL
75
Testbenches and Clock Phases

env : process begin a <= 1; clk <= 0; wait for 10 ns; a <= 0; clk <= 1; wait for 10 ns; end process;
0ns
flop : process ( clk ) begin if rising_edge( clk ) then q1 <= a end if; end process;
sim round sim cycle delta cycle env flop1 flop2 a clk q1
76
CHAPTER 1. VHDL

0ns 10ns 20ns
a clk q1
77
Warning
Note: Testbench signals For consistent results across different simulators, simulation scripts vs test benches, and timingsimulation vs zero-delay simulation do not change signals in your testbench or script at the same time as the clock changes.
0ns 10ns 20ns 30ns 40ns 50ns 60ns
a is output of clocked or combinational process
a U clk U q1
0ns U 10ns 20ns 30ns 40ns 50ns 60ns
a U
a is output of timed process (testbench or environment) POOR DESIGN a is output of timed process (testbench or environment) GOOD DESIGN
clk U q1
0ns U 10ns 20ns 30ns 40ns 50ns 60ns
a U clk U q1
U
78
CHAPTER 1. VHDL
1.7
0ns
sim round sim cycle delta cycle proc1 proc2 proc3 a b c d e B B B P P A P U U U U U U
Register-Transfer-Level Simulation
0ns+1 0ns+2 0ns+23ns
EB EB PA S EB E S PA EB EB B S PA
3ns+1
3ns+2 3ns+3
E E E S
102ns
A S A U1 0 U U S
EB EB S PA P
S A
EB EB P PA S
A S
EB EB S PA
EB E S PA
0ns
a U 1 U 0 U 0 U 1 U 0
1ns
2ns
3ns
102ns
1 0 U 0 0 1 0 1 1 1 1 0 0
b c d e
1 1 0
Delta cycle simulation
RTL simulation
1.7.1 Overview
79
1.7.1
Overview
Much simpler than delta cycle Columns are real time: clock cycles, nanoseconds, etc. Can simulate both synthesizable and unsynthesizable code Cannot simulate combinational loops Same values as delta-cycle at end of simulation round process begin Question: In this code, what a <= 0; value should b have 10 ns? wait for 10 ns; a <= 1; ... end process;
process begin b <= 0; wait for 10 ns; b <= a; ... end process;
80
CHAPTER 1. VHDL
1.7.2 Technique for Register-Transfer Level Simulation

1. Pre-processing (a) Separate processes into combinational and non-combinational (clocked and timed) (b) Decompose each combinational process into separate processes with one target signal per process (c) Sort processes into topological order based on dependencies 2. For each clock cycle or unit of time: (a) Run non-combinational processes in any order. Non-combinational assignments read from earlier clock cycle / time step, except that clocked processes read the current value of the clock signal. (b) Run combinational processes in topological order. Combinational assignments read from current clock cycle / time step.
1.7.3 Examples of RTL Simulation
81
1.7.3 1.7.3.1
Examples of RTL Simulation RTL Simulation Example 1
We revisit an earlier example from delta-cycle simulation, but change the code slightly and do register-transfer-level simulation. proc1: process (a, b, c) begin d <= NOT c; c <= a AND b; end process; proc2: process (b, d) begin e <= b AND d; end process; proc3: process begin a <= 1; b <= 0; wait for 3 ns; b <= 1; wait for 99 ns; end process;
82
CHAPTER 1. VHDL
Decompose and sort comb procs

proc1d: process (c) begin d <= NOT c; end process; proc1c: process (a, b) begin c <= a AND b; end process; proc2: process (b, d) begin e <= b AND d; end process; proc1c: process (a, b) begin c <= a AND b; end process; proc1d: process (c) begin d <= NOT c; end process; proc2: process (b, d) begin e <= b AND d; end process;
Decomposed
Sorted
1.7.3 Examples of RTL Simulation
83
Waveforms
0ns
a b c d e U U U U U
1ns
2ns
3ns
102ns
Example: Communicating State Machines
84
CHAPTER 1. VHDL
huey: process begin clk <= 0; wait for 10 ns; clk <= 1; wait for 10 ns; end process; dewey: process begin a <= to_unsigned(0,4); wait until re(clk); while (a < 4) loop a <= a + 1; wait until re(clk); end loop; end process;
louie: process begin d <= 1; wait until re(clk); if (a >= 2) then d <= 0; wait until re(clk); end if; end process;
clk a d
1.8. VHDL AND HARDWARE BUILDING BLOCKS
85
1.8 1.8.1
VHDL and Hardware Building Blocks Basic Building Blocks
Different classes of building blocks:
Conditional Arithmetic Storage
86
CHAPTER 1. VHDL
Basic Building Blocks: Boolean

Schematic VHDL Description and or not nor xor
AND OR
gate
gate inverter and gate exclusive-or gate
nand NAND gate
1.8.1 Basic Building Blocks
87
Basic Building Blocks: Conditional

if-then-else, when-else, Multiplexer with-select, case
88
CHAPTER 1. VHDL
Basic Building Blocks: Arithmetic

+ adder subtracter
asl, lsl left shifter asr, lsr right shifter
1.8.1 Basic Building Blocks
89
Basic Building Blocks: Storage

D CE S WE A DI DO R Q
clocked process
ip op
memory component single-port memory

WE A0 DI0 A1 DO1 DO0
memory component dual-port memory
90
CHAPTER 1. VHDL
1.8.2
Deprecated Building Blocks for RTL
Some of the common gates you have encountered in previous courses should be avoided when synthesizing register-transfer-level hardware, particularly if FPGAs are the implementation technology. Latches : Use ops, not latches T, JK, SR, etc ip-ops : Limit yourself to D-type ip-ops Tri-State Buffers : Use multiplexers, not tri-state buffers Note: Unfortunately and surprisingly, PalmChip has been awarded a US patent for using uni-directional busses (i.e. multiplexers) for system-on-chip designs. The patent was led in 2000, so all fourth-year design projects completed after that date will need to pay royalties to PalmChip
1.8.2 Deprecated Building Blocks for RTL
91
What is This?
process (a) begin if rising_edge(a) then c <= b; end if; end process;
92
CHAPTER 1. VHDL
1.8.3 1.8.3.1
Hardware and Code for Flops Flops with Waits and Ifs
process (clk) begin if rising_edge(clk) then q <= d; end if; end process;
1.8.3 Hardware and Code for Flops
93
VHDL Code for Flip-Flop: Wait-Style

process begin wait until rising_edge(clk); q <= d; end process;
94
CHAPTER 1. VHDL
1.8.3.2
Flops with Synchronous Reset
process (clk) begin if rising_edge(clk) then if (reset = 1) then q <= 0; else q <= d; end if; end if; end process;
95
Flop with Synchronous Reset: Wait-Style

process begin wait until rising_edge(clk); if (reset = 1) then q <= 0; else q <= d0; end if; end process;
96
CHAPTER 1. VHDL
Variation on a Floppy Theme

Question: Synchronous or asynchronous reset?
process (clk, reset) begin if (reset = 1) then q <= 0; else if rising_edge(clk) then q <= d; end if; end if; end process;
97
Variated Flop of a Theme

Question: Synchronous or asynchronous reset?
process begin if (reset = 1) then q <= 0; else q <= d0; end if; wait until rising_edge(clk); end process;
98
CHAPTER 1. VHDL
Flop with Chip-Enable

process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process; Wait-style op with chip-enable included in course notes
99
Q: Flop with a Mux on the Input?

sel d0
D Q
d1 clk
100
CHAPTER 1. VHDL
Q: Flops with a Mux on the Output?

d0 clk d1 clk
D Q D Q
q0
sel
q q1
Question: For the circuits with mux-on-input and mux-on-output, does q have the same behaviour in both circuits?
101
1.8.3.3 Input
Flop with Chip-Enable and Mux on
Hint: Chip Enable process (clk) begin if rising_edge(clk) then if (ce = 1) then q <= d; end if; end if; end process;
102
CHAPTER 1. VHDL
1.8.3.4 Reset
Flops with Chip-Enable, Muxes, and

1.8.4
An Example Sequential Circuit

1.9
Arrays and Vectors

1.10. ARITHMETIC
103
1.10 Arithmetic
VHDL includes all of the common arithmetic and logical operators. Use the VHDL arithmetic operators and let the synthesis tool choose the best implementation for you.
1.10.1
Arithmetic Packages
To do arithmetic with signals, use the numeric_std package. This package denes types signed and unsigned, which are std_logic vectors on which you can do signed or unsigned arithmetic. numeric std supersedes std logic arith. earlier arithmetic packages, such as
Use only one arithmetic package, otherwise the different denitions will clash and you can get strange error messages.
104
CHAPTER 1. VHDL
1.10.2
Shift and Rotate Operations

1.10.3
Overloading of Arithmetic
1.10.4
Different Widths and Arithmetic

1.10.5
Overloading of Comparisons
1.10.6 Different Widths and Comparisons Overloading of Comparison Operations (=, /=, >=, >, <) src1/2 unsigned signed unsigned src2/1 integer OK integer OK signed fails in analysis
105
1.10.6
Different Widths and Comparisons

106
CHAPTER 1. VHDL
1.10.7
Type Conversion
The functions unsigned, signed, to integer, to unsigned and to signed are used to convert between integers, std-logic vectors, signed vectors and unsigned vectors. If you convert between two types of the same width, then no additional hardware will be generated. The listing below summarizes the types of these functions.
1.10.7 Type Conversion
107
Type Conversion
unsigned( val : std_logic_vector ) signed( val : std_logic_vector ) to_integer( val : signed ) to_integer( val : unsigned ) to_unsigned( val : integer; width : natural) to_signed( val : integer; width : natural) Note: More details in course notes return unsigned; return signed; return integer; return integer; return unsigned; return signed;
108
CHAPTER 1. VHDL
1.11 Synthesizable vs Non-Synthesizable Code

Synthesis is done by matching VHDL code against templates or patterns. Its important to use idioms that your synthesis tools recognize. Think like hardware: when you write VHDL, you should know what hardware you expect to be produced by the synthesizer.
1.11.1 Unsynthesizable Code
109
1.11.1 1.11.1.1
Unsynthesizable Code Initial Values
Initial values on signals (UNSYNTHESIZABLE) signal bad_signal : std_logic := 0; Reason: At powerup, the values on signals are random (except for some FPGAs).
110
CHAPTER 1. VHDL
1.11.1.2
Wait For
Wait for length of time (UNSYNTHESIZABLE) wait for 10 ns; Reason: Delays through circuits are dependent upon both the circuit and its operating environment, particularly supply voltage and temperature. For example, imagine trying to build an AND gate that will have exactly a 2ns delay in all environments.
111
1.11.1.3
Different Wait Conditions
wait statements with different conditions in a process (UNSYNTHESIZABLE) -- different clock signals process begin wait until rising_edge(clk1); x <= a; wait until rising_edge(clk2); x <= a; end process; Reason: Would require the ip ops to use different clock signals at different times.
112
CHAPTER 1. VHDL
Different Wait Conditions

-- different clock edges process begin wait until rising_edge(clk); x <= a; wait until falling_edge(clk); x <= a; end process; Reason: Would require ip-op to be sensitive to different clock edges at different times.
113
1.11.1.4 cess
Multiple if rising edge in Pro-
Multiple if rising edge statements in a process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; if rising_edge(clk) then q1 <= d1; end if; end process; Reason: The idioms for synthesis tools generally expect just a single if rising edge statement in each process. The simpler the VHDL code is, the easier it is to synthesize hardware. Programmers of synthesis tools make idiomatic (idiotic?) restrictions to make their jobs simpler.
114
CHAPTER 1. VHDL
1.11.1.5 if rising edge and wait in Same Process

An if rising edge statement and a wait statement in the same process (UNSYNTHESIZABLE) process (clk) begin if rising_edge(clk) then q0 <= d0; end if; wait until rising_edge(clk); q0 <= d1; end process; Reason: The idioms for synthesis tools generally expect just a single type of opgenerating statement in each process.
115
1.11.1.6
if rising edge with else Clause
The if statement has a rising edge condition and an else clause (UNSYNTHESIZABLE). process (clk) begin if rising_edge(clk) then q0 <= d0; else q0 <= d1; end if; end process; Reason: Generally, an if-then-else statement synthesizes to a multiplexer.
116
CHAPTER 1. VHDL
1.11.1.7
if rising edge Inside a for Loop
An if rising edge statement in a for-loop (UNSYNTHESIZABLE-Synopsys) process (clk) begin for i in 0 to 7 loop if rising_edge(clk) then q(i) <= d; end if; end loop; end process; Reason: just an idiom of the synthesis tool. Some loop statements are synthesizable (Rushton Section 8.7). For-loops in general are described in Ashenden.
117
Synthesizable Alternative
A synthesizable alternative to an if rising edge statement in a for-loop is to put the if-rising-edge outside of the for loop. process (clk) begin if rising_edge(clk) then for i in 0 to 7 loop q(i) <= d; end loop; end if; end process;
118
CHAPTER 1. VHDL
1.11.1.8
wait Inside of a for loop
wait statements in a for loop (UNSYNTHESIZABLE) process begin for i in 0 to 7 loop wait until rising_edge(clk); x <= to_unsigned(i,4); end loop; end process; Reason: Unknown. while-loops with the same behaviour are synthesizable. Note: Combinational for-loops Combinational for-loops are usually synthesizable. They are often used to build a combinational circuit for each element of an array. Note: Clocked for-loops Clocked for-loops are not synthesizable, but are very useful in simulation, particular to generate test vectors for test benches.
119
Synthesizable Alternative to Wait-Inside-For

while loop (synthesizable) This is the synthesizable alternative to the the wait statement in a for loop above. process begin -- output values from 0 to 4 on i -- sending one value out each clock cycle i <= to_unsigned(0,4); wait until rising_edge(clk); while (4 > i) loop i <= i + 1; wait until rising_edge(clk); end loop; end process;
120
CHAPTER 1. VHDL
1.12 Synthesizable VHDL Coding Guidelines

Chapter 2 RTL Design with VHDL: From Requirements to Optimized Code
122
CHAPTER 2. RTL DESIGN WITH VHDL
2.1
Prelude to Chapter
2.2 FPGA Background and Coding Guidelines 2.2.1 Generic FPGA Hardware
2.2.1 Generic FPGA Hardware
123
2.2.1.1
Generic FPGA Cell
Cell = Logic Element (LE) in Altera = Congurable Logic Block (CLB) in Xilinx
carry_in
data_in
comb
D CE
data_out
ctrl_in
carry_out
124
Congurable Comb/Flop Connection

carry_in comb_data_out comb_data_in comb
D CE R
flop_data_out
flop_data_in ctrl_in
carry_out
125
Separate Comb and Flop

D CE R
flop_data_out
carry_out
126
Connect Comb and Flop

D CE R
flop_data_out
carry_out
127
Flopped and Unopped Outputs

D CE R
flop_data_out
carry_out
128
2.2.2
Area Estimation
To estimate the number of FPGA cells that will be required to implement a circuit, recall that an FPGA lookup-table can implement any function with up to four inputs and one output. We will describe two methods to estimate the area (number of FPGA cells) required to implement a gate-level circuit:
1. Rough estimate based simply upon the number of ip-ops and primary inputs that are in the fanin of each ip-op. 2. A more accurate estimate, based upon greedily including as many gates as possible into each FPGA cell.
2.2.2 Area Estimation
129
Lower Bound on Area for Circuit with one Target

Source ops/inputs Minimum cells 1 1 2 1 3 1 4 1 5 2 6 2 7 2 8 3 9 3 10 3 11 4 For a single target signal, this technique gives a lower bound on the number of cells needed. For multiple target signals, this technique might be an overestimate, because a single cell can drive several other cells.
130 Question:
CHAPTER 2. RTL DESIGN WITH VHDL How many cells are needed to implement a 4:1 mux?
131
3 Cells for 10:1 Function
132
Estimate Area for Circuit

For each ip-op and output: traverse backward through the fanin gathering as much combinational circuitry as possible into the FPGA cell. Stopping conditions: ip-op
more than four inputs However, have more than four signals as input, then further back in the fanin, the circuit will collapse back to four or fewer signals.
2.2.2 Area Estimation Question: Map the combinational circuits below onto generic FPGA cells.
133
comb
D CE
comb
Q D CE
a b c d z
comb
D CE
comb
Q D CE
comb
D CE
comb
Q D CE
134
2.2.2.1
Interconnect for Generic FPGA

2.2.2.2
Clocks for Generic FPGAs
Characteristics of clock signals: High fanout (drive many gates)
Long wires (destination gates scattered all over chip)

Characteristics of FPGAs: Very few gates that are large (strong) enough to support a high fanout.
Very few wires that traverse entire chip and can be connected to every ip-op.
135
2.2.2.3
Special Circuitry in FPGAs Memory
For more than ve years, FPGAs have had special circuits for RAM and ROM. In Altera FPGAs, these circuits are called ESBs (Embedded System Blocks). These special circuits are possible because many FPGAs are fabricated on the same processes as SRAM chips. So, the FPGAs simply contain small chunks of SRAM.
136
Microprocessors
A new feature to appear in FPGAs in 2001 and 2002 is hardwired microprocessors on the same chip as programmable hardware.
Hard Soft Altera Arm 922T with 200 MIPs Nios with ?? MIPs Xilinx: Virtex-II Pro Power PC 405 with 420 D-MIPs Microblaze with 100 D-MIPs The Xilinx-II Pro has 4 Power PCs and enough programmable hardware to implement the rst-generation Intel Pentium microprocessor.
137
Arithmetic Circuitry
A new feature to appear in FPGAs in 2001 and 2002 is hardwired circuits for multipliers and adders. Altera: Mercury 16 16 at 130MHz Xilinx: Virtex-II Pro 18 18 at ???MHz Using these resources can improve signicantly both the area and performance of a design.
138
Input / Output
Recently, high-end FPGAs have started to include special circuits to increase the bandwidth of communication with the outside world. Product Altera True-LVDS (1 Gbps) Xilinx Rocket I/O (3 Gbps)
2.2.3 Generic-FPGA Coding Guidelines
139
2.2.3
Generic-FPGA Coding Guidelines Flip Flops Are Free
Flip-ops are almost free in FPGAs

reason In FPGAs, the area consumed by a design is usually determined by the amount of combinational circuitry, not by the number of ip-ops.
140
Use It or Lose
Aim for using 8090% of the cells on a chip.
reason If you use more than 90% of the cells on a chip, then the place-androute program might not be able to route the wires to connect the cells. reason If you use less than 80% of the cells, then probably: there are optimizations that will increase performance and still allow the design to t on the chip; or you spent too much human effort on optimizing for low area; or you could use a smaller (cheaper!) chip. exception In E&CE 327 (unlike in real life), the mark is based on the actual number of cells used.
2.2.3 Generic-FPGA Coding Guidelines
141
Just One Clock

Use just one clock signal
reason If all ip-ops use the same clock, then the clock does not impose any constraints on where the place-and-route tool puts ip-ops and gates. If different ip-ops used different clocks, then ip-ops that are near each other would probably be required to use the same clock.
142
Just One Clock Edge

Use only one edge of the clock signal
reason There are two ways to use both rising and falling edges of a clock signal: have rising-edge and falling-edge ip ops, or have two different clock signals that are inverses of each other. Most FPGAs have only rising-edge ip ops. Thus, using both edges of a clock signal is equivalent to having two different clock signals, which is deprecated by the preceding guideline.
2.3. DESIGN FLOW
143
2.3
Design Flow
2.4
Algorithms and High-Level Models

144
2.5 2.5.1
Finite State Machines in VHDL Introduction to State-Machine Design Mealy vs Moore State Machines
2.5.1.1
2.5.1 Introduction to State-Machine Design
145
Moore Machines
Outputs are dependent upon only the state No combinational paths from inputs to outputs
s0/0 a s1/1 !a s2/0
s3/0
146
Mealy Machines
Outputs are dependent upon both the state and the inputs Combinational paths from inputs to outputs
s0 a/1 s1 /0 s3 /0 !a/0 s2
147
2.5.1.2 VHDL
Introduction to State Machines and
A state machine is generally written as a single clocked process, or as a pair of processes, where one is clocked and one is combinational.
Design Decisions
Moore vs Mealy (Sections 2.5.2 and 2.5.3) Implicit vs Explicit (Section 2.5.1.3) State values in explicit state machines: Enumerated type vs constants (Section 2.5.5) State values for constants: encoding scheme (binary, gray, one-hot, ...) (Section 2.5.5)
148
VHDL Constructs for State Machines

The following VHDL control constructs are useful to steer the transition from state to state: loop if ... then ... else case next for ... loop exit while ... loop
149
2.5.1.3
Explicit
Explicit vs Implicit State Machines
There are two styles of writing state machines in VHDL: explicit and implicit.
State signal appears explicitly in VHDL code At most one wait statement per process Two sub-categories of explicit state machines
Explicit-Current State signal represents current state Next-state computation done in a clocked process Explicit-Current+Next Two state signals: current state and next state Next-state computation done in a combinational process Current-state <= next-state is registered assignment Implicit Use multiple wait statements in a process to describe state machine implicilty
150
Implicit State Machines

For the implicit style of writing state machines, the synthesis program adds an implicit register to hold the state signal and combinational circuitry to update the state signal. In Synopsys synthesis tools, the state signal dened by the synthesizer is named multiple wait state reg. In Mentor Graphics, the state signal is named STATE VAR We can think of the VHDL code for implicit state machines as having zero state signals, explicit-current state machines as having one state signal (state), and explicit-current+next state machines as having two state signals (state and state next).
151
State Machine Tradeoffs

Explicit-Current+Next
Most detailed, closest to hardware Greatest opportunity for manual optimization Most labour-intensive Susceptible to small, subtle, hard-to-nd bugs
Explicit-Current
Almost as manual optimization as Explicit-Current+Next Easier to write than Explicit-Current+Next Less susceptible to subtle bugs
Implicit
Taught infrequently Least detailed, furthest from actual hardware Rely on synthesis for optimization Usually least labour to write, shortest code Easiest to write correctly (But must understand VHDL synthesis!)
152
Limitation of Implicit State Machines

Because implicit state machines are written with loops, if-then-elses, cases, etc. it is difcult to write some state machines with complicated control ows in an implicit style. The following example illustrates the point.
s0/0 a !a s2/0
!a s3/0
a s1/1
153
Terminology
Note: The terminology of explicit and implicit is somewhat standard, in that some descriptions of processes with multiple wait statements describe the processes as having implicit state machines. There is no standard terminology to distinguish between the two explicit styles: explicit-current+next and explicit-current.
154
2.5.2 Implementing a Simple Moore Machine

s0/0 a s1/1 !a s2/0
entity simple is port ( a, clk : in std_logic; z : out std_logic ); end simple;
s3/0
155
2.5.2.1
Implicit Moore State Machine
Flops architecture moore_implicit_v1a of simple isGates Delay begin process begin z <= 0; wait until rising_edge(clk); if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); z <= 0; wait until rising_edge(clk); end process; end moore_implicit;
156
Implicit Moore State Machine

!a s2/0
157
2.5.2.2
Explicit Moore with Flopped Output

Flops Gates Delay
architecture moore_explicit_v1 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; z <= 1; else state <= s2; z <= 0; end if; when s1 | s2 => state <= s3; z <= 0; when s3 => state <= s0; z <= 1; end case; end if; end process; end moore_explicit_v1;
158
Explicit Moore with Flopped Outputs
159
2.5.2.3 Explicit Moore with Combinational Outputs

architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then case state is when s0 => if (a = 1) then state <= s1; else state <= s2; end if; when s1 | s2 => state <= s3; when s3 => state <= s0; end case; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2;
Flops Gates Delay
160
Explicit Moore with Combinational Outputs
161
2.5.2.4 Explicit-Current+Next Moore with Concurrent Assignment

architecture moore_explicit_v3 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; state_nxt <= s1 when (state = s0) and (a = 1) else s2 when (state = s0) and (a = 0) else s3 when (state = s1) or (state = s2) else s0; z <= 1 when (state = s1) else 0; end moore_explicit_v3;
Flops Gates Delay
162
Explicit-Current+Next Moore with Concurrent Assignment

The hardware synthesized from this architecture is the same as that synthesized from moore explicit v2, which is written in the current-explicit style.
163
2.5.2.5
E-C+N Moore with Comb Proc

Change the selected assignment to state into a combinational process using a case statement. Flops Gates Delay Same hardware as moore explicit v2 and v3.
architecture moore_explicit_v4 of simple is type state_ty is (s0, s1, s2, s3); signal state, state_nxt : state_ty; begin process (clk) begin if rising_edge(clk) then state <= state_nxt; end if; end process; process (state, a) begin case state is when s0 => if (a = 1) then state_nxt <= s1; else state_nxt <= s2; end if; when s1 | s2 => state_nxt <= s3; when s3 => state_nxt <= s0; end case; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v4;
164
Explicit-Current+Next Moore with Combinational Process
2.5.3 Implementing a Simple Mealy Machine
165
2.5.3 Implementing a Simple Mealy Machine

Mealy machines have a combinational path from inputs to outputs, which often violates good coding guidelines for hardware. Thus, Moore machines are much more common. You should know how to write a Mealy machine if needed, but most of the state machines that you design will be Moore machines. This section reserved for your reading pleasure
166
2.5.4
Reset
All circuits should have a reset signal that puts the circuit back into a good initial state. However, not all ip ops within the circuit need to be reset. In a circuit that has a datapath and a state machine, the state machine will probably need to be reset, but datapath may not need to be reset. There are standard ways to add a reset signal to both explicit and implicit state machines. It is important that reset is tested on every clock cycle, otherwise a reset might not be noticed, or your circuit will be slow to react to reset and could generate illegal outputs after reset is asserted.
2.5.4 Reset
167
Reset with Implicit State Machine

Insert a loop Test for reset after each wait
Example from section 2.5.2.1:
architecture moore_implicit of simple is begin process begin init : loop -- outermost loop z <= 0; wait until rising_edge(clk); next init when (reset = 1); -- test for reset if (a = 1) then z <= 1; else z <= 0; end if; wait until rising_edge(clk); next init when (reset = 1); -- test for reset z <= 0; wait until rising_edge(clk); next init when (reset = 1); -- test for reset end process; end moore_implicit;
168
Reset with Explicit State Machine

Reset is often easier to include in an explicit state machine, because we need only put a test for reset = 1 in the clocked process for the state. The pattern for an explicit-current style of machine is: process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else if ... then state <= ...; elif ... then ... -- more tests and assignments to state end if; end if; end if; end process;
2.5.4 Reset
169
Reset with Explicit State Machine

Applying this pattern to the explicit Moore machine from section 2.5.2.3 produces:
architecture moore_explicit_v2 of simple is type state_ty is (s0, s1, s2, s3); signal state : state_ty; begin process (clk) begin if rising_edge(clk) then if (reset = 1) then state <= s0; else case state is ... end case; end if; end if; end process; z <= 1 when (state = s1) else 0; end moore_explicit_v2;
170
Reset with Explicit-Current+Next

The pattern for an explicit-current+next style is: process (clk) begin if rising_edge(clk) then if reset = 1 then state_cur <= reset state; else state_cur <= state_nxt; end if; end if; end process;
2.5.5
State Encoding
2.6. DATAFLOW DIAGRAMS
171
2.6 2.6.1
Dataow Diagrams Dataow Diagrams Overview
Dataow diagrams are data-dependency graphs where the computation is divided into clock cycles. Purpose:
Provide a disciplined approach for designing datapath-centric circuits Guide the design from algorithm, through high-level models, and nally to register transfer level code for the datapath and control circuitry. Estimate area and performance Make tradeoffs between different design options
Background
Based on techniques from high-level synthesis tools Some similarity between high-level synthesis and software compilation Each dataow diagram corresponds to a basic block in software compiler terminology.
172
Data-Dependency Graph
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Data-dependency graph for z = a + b + c + d + e + f
2.6.1 Dataow Diagrams Overview
173
Dataow Diagrams
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Dataow diagram for z = a + b + c + d + e + f
174
Clock Cycle Boundaries

a b c d e f
+
x1
+
x2
Horizontal lines mark clock cycle boundaries
+
x3
+
x4
+
z
175
Latency
a b c d e f
+
2 3 4 5 6
z x1
+
x2
+
x3
+
x4
+
Latency = 6 clock cycles
176
Latency
a b c d e f
+
x1
+
2
x2
+
x3
+
3 4
z x4
+
Latency = 4 clock cycles
Question: Why would a good hardware engineer nd this design disatisfying?
177
Flip Flops
a b c d e f
+
x1
+
x2
+
x3
Signals crossing clock boundaries are flip-flops
+
x4
+
z
178
Registered Inputs and Outputs

a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Flops on both inputs and outputs
179
Registered Inputs, Combinational Outputs

a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Flops on inputs, but not outputs (Latency = 5)
180
Datapath Components
a b c d e f
+
x1
+
x2
+
x3
+
x4
Blocks in clock cycles are datapath components
+
z
181
Inputs
Unconnected signal tails are inputs Horizontal lines mark clock cycle boundaries
+
x1
+
x2
+
x3
+
x4
+
z
182
Outputs
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Unconnected signal heads are outputs
183
Summary
a b c d e f
+
x1
+
x2
+
x3
+
x4
+
z
Unconnected signal heads are outputs
184
2.6.2 Dataow Diagrams, Hardware, and Behaviour Primary Input

Dataow Diagram i Hardware i x
x
Behaviour
clk i x
2.6.2 Dataow Diagrams, Hardware, and Behaviour
185
Register Input
Hardware i x Dataow Diagram i Behaviour
clk i x
186
Register Signal
Hardware
i1 x
Dataow Diagram i1 i2
i2
+
x
clk i1 i2 x
Behaviour
2.6.2 Dataow Diagrams, Hardware, and Behaviour
187
Combinational-Component Output
Hardware
i1
Dataow Diagram i1 i2
i2
+
x
clk i1 i2 x
Behaviour
188
2.6.3
Dataow Diagram Execution
2.6.3 Dataow Diagram Execution
189
Execution with Registers on Both Inputs and Outputs

a b c d e f
0
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
190

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
191

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
192

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
193

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
194

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
195

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
5 6
196

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
5 6
197
Execution Without Output Registers

a b c d e f
0 1
clk a
0 1 2 3 4 5 6
x1
+
x2
+
x3
x1 x2
+
x4
x3 x4
+
x5
x5 z
+
z
198
2.6.4
Performance Estimation Performance Equations

Performance 1 TimeExec
TimeExec = Latency ClockPeriod
Denition Latency: Number of clock cycles from inputs to outputs. A combinational circuit has latency of zero. A single register has a latency of one. A chain of n registers has a latency of n.
Latency: count horizontal lines in diagram
Performance of Dataow Diagrams
Min clock period (Max clock speed) limited by longest path in a clock cycle
199
2.6.5
Area Estimation
Maximum number of blocks in a clock cycle is total number of that component that are needed Maximum number of signals that cross a cycle boundary is total number of registers that are needed Maximum number of unconnected signal tails in a clock cycle is total number of inputs that are needed Maximum number of unconnected signal heads in a clock cycle is total number of outputs that are needed
These estimates give lower bounds. Other constraints might force you to use more components.
200
Area Estimation
Implementation-technology factors, such as the relative size of registers, multiplexers, and datapath components, might force you to make tradeoffs that increase the number of datapath components to decrease the overall area of the circuit. With some FPGA chips, a 2:1 multiplexer has the same area as an adder.
With some FPGA chips, a 2:1 multiplexer can be combined with an adder into one FPGA cell per bit. In FPGAs, registers are usually free, in that the area consumed by a circuit is limited by the amount of combinational logic, not the number of ip-ops.
2.6.6 Design Analysis
201
2.6.6
a b
Design Analysis
c d e f
+
x1
num inputs
+
x2
num outputs
+
x3
num registers
+
x4
num adders min clock period
+
z
latency
202
Design Analysis (Contd)

a b c d e f
+
x1
num inputs
+
x2
num outputs
+
x3
num registers
+
x4
num adders min clock period
+
x5 z
latency
2.6.7 Area / Performance Tradeoffs
203
2.6.7
a b
Area / Performance Tradeoffs

two adds per clock cycle
a b c d e f c d e f
one add per clock cycle

0 1
x1
0 1
+ +
x2
+
x1
+
x2
+
x3
+
x3
+
x4
+
x4
+
x5 z
5 6
+
x5 z
3 4
Note: wasted.
In the Two-add design, half of the last clock cycle is
204
Two Adds per Clock Cycle

a b c d e f
0
clk
0 1 2 3 4 5 6
a x1
+
x1
+
x2
x2
+
x3
x3
x4 x5
+
x4
+
x5 z
3 4
2.6.7 Area / Performance Tradeoffs
205
Design Comparison
One add per clock cycle
a b c d e f
Two adds per clock cycle

a b c d e f
0 1
0 1
+
x1
+
x1
+
x2
+
x2
+
x3
+
x3
+
x4
+
x4
+
x5 z
5 6
+
x5 z
3 4
inputs outputs registers adders clock period latency Question:
6 1 6 1 op + 1 add 6
6 1 6 2 op + 2 add 4
Under what circumstances would each design option be fastest?
206
2.7
Design Example: Massey

2.8
Design Example: Vanier
Well go through the following artifacts: 1. requirements 2. algorithm 3. dataow diagram 4. high-level models 5. hardware block diagram 6. RTL code for datapath 7. state machine 8. RTL code for control
2.8. DESIGN EXAMPLE: VANIER
207
Design Process
1. Scheduling (allocate operations to clock cycles) 2. I/O allocation 3. First high-level model 4. Register allocation 5. Datapath allocation 6. Connect datapath components, insert muxes where needed 7. Design implicit state machine 8. Optimize 9. Design explicit-current state machine 10. Optimize
208
2.8.1
Requirements
Functional requirements: compute the following formula: output = (a d) + c + (d b) + b Performance requirement:

Max clock period: op plus (2 adds or 1 multiply) Max latency: 4
Cost requirements
Maximum of two adders Maximum of two multipliers Unlimited registers Maximum of three inputs and one output Maximum of 5000 student-minutes of design effort
Registered inputs and outputs
2.8.2 Algorithm
209
2.8.2
Algorithm
output = (a d) + c + (d b) + b Create a data-dependency graph for the algorithm.

a d b c
+ + +
z
210
2.8.3
Initial Dataow Diagram
Schedule operations into clock cycles.

a d b c
+ + +
z
2.8.4 Reschedule to Meet Requirements
211
2.8.4
a
Reschedule to Meet Requirements

d b c a d b c
+ + +
z z
212
Fix Clock Period Violation

d b c d b c
+ + +
z
+ + +
z
2.8.5 Optimize Resources
213
2.8.5
Optimize Resources
a d b c
+ + +
z
214
Analysis
d b
+ + +
z
Question: Should we move the second addition from third clock cycle to second?
2.8.5 Optimize Resources
215
Dene Entity
Having nalized our input/output scheduling, we can write our entity. Note: we will add a reset signal later, when we design the state machine to control the datapath. entity vanier is port ( clk : in std_logic; i_1, i_2 : in std_logic_vector(15 downto 0); o_1 : out std_logic_vector(15 downto 0) ); end vanier;
216
2.8.6
Assign Names to Registered Values

d b
+ + +
z
Question:
Why do we not need to assign names to combinational signals?
Question: Why do we not need to assign a new name to x1, x2, and x4 the second time they cross a clock cycle boundary?
2.8.7 Input/Output Allocation
217
2.8.7
Input/Output Allocation
d x1 b x2 c x4 x5
a x3
+
x6
+ +
x8 z
x7
218
VHDL Code!
architecture hlm_v1 of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7, x_8 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; wait until rising_edge(clk); x_8 <= x_6 + (x_4 + x_7); end process; o_1 <= std_logic_vector(x_8); end hlm_v1;
2.8.7 Input/Output Allocation

0 i1 i2 x1
i1 d i2 b
219
1 2 3 4 5
0 1
x2 x3
x1
i1 a
x2
i2 c
x4 x5
x3
x4
x5
+
x6
x6 x7
+ +
x8 z o1
x7 3
x8
0 4 i1 i2 r1 r2 r3 r4 r5
220
2.8.8
Tangent: Combinational Outputs
architecture hlm_v1c of vanier is signal x_1, x_2, x_3, x_4, x_5, x_6, x_7 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); x_1 <= unsigned(i_1); x_2 <= unsigned(i_2); wait until rising_edge(clk); x_3 <= unsigned(i_1); x_4 <= x_1(7 downto 0) * x_2(7 downto 0); x_5 <= unsigned(i_2); wait until rising_edge(clk); x_6 <= x_3(7 downto 0) * x_1(7 downto 0); x_7 <= x_2 + x_5; end process; o_1 <= std_logic_vector(x_6 + (x_4 + x_7)); end hlm_v1c;
i1 d
i2 b
x1
i1 a
x2
i2 c
x3
x4
x5
+
x6
+ +
z o1
x7
2.8.9 Register Allocation
221
2.8.9
Register Allocation
i1 d i2 b
x1
i1 a
x2
i2 c
x3
x4
x5
+
x6
+ +
z o1
x7
222
New VHDL Code!

i1 d r1 x1 i1 a r3 x3 r4 x4 i2 b r2 x2 i2 c r5 x5
+
r2 x6
+ +
r5 x8 z o1
r5 x7
architecture hlm_v2 of vanier is signal r_1, r_2, r_3, r_4, r_5 : unsigned(15 downto 0); begin process begin wait until rising_edge(clk); r_1 <= unsigned(i_1); r_2 <= unsigned(i_2); wait until rising_edge(clk); r_3 <= unsigned(i_1); r_4 <= r_1(7 downto 0) * r_2(7 downto 0); r_5 <= unsigned(i_2); wait until rising_edge(clk); r_2 <= r_3(7 downto 0) * r_1(7 downto 0); r_5 <= r_2 + r_5; wait until rising_edge(clk); r_5 <= r_2 + (r_4 + r_5); end process; o_1 <= std_logic_vector(r_5); end hlm_v2;
2.8.10 Datapath Allocation
223
2.8.10
i1 d r1 x1 i1 a r3 x3
Datapath Allocation
i2 b r2 x2 i2 c r4 x4 r5 x5
+
r2 x6
+ +
r5 x8 z o1
r5 x7
224
2.8.11 Hardware Block Diagram and State Machine

1. Calculate number of states that are needed 2. Control signals for registers
Chip enable Mux select on input

3. Control signals for datapath components
Instruction (e.g. add/sub for ALU) Mux select on inputs

For our example: Use four states: S0..S3, one for each clock cycle.
225
2.8.11.1
S0 S1
i1 a r3 x3 m1 m1 i1 d r1 x1
Control for Registers

i2 b r2 x2 i2 c r4 x4 a1 r5 x5
Build a table with one row per state, one colum per register.
S2
+
r5 x7
r2 x6 a2
S3
a1
+
r5 x8 z o1
S0
r1 ce S0 S1 S2 S3 d ce
r2 d ce
r3 d ce
r4 d ce
r5 d
226
Optimize chip enables and muxes

r1 S0 S1 S2 S3 ce 1 0 d i1 ce 1 0 1 r2 d i2 m1 ce 1 r3 d i1 ce 1 0 r4 d m1 ce 1 1 1 r5 d i2 a1 a1
Chip enable: a register holds a value for multiple clock cycles. Mux: a register loads values from multiple sources.
227
Optimized Chip Enables and Muxes

r1=i1 ce 1 0 r2 ce 1 0 1 d i2 m1 r3=i1 r4=m1 ce 1 0 r5 d i2 a1 a1
S0 S1 S2 S3
228
2.8.11.2
Control for Datapath Components
Table for datapath components. One row per state. One column per datapath component. Sub-columns for sources and instructions (e.g. add/sub for ALU).
S0 S1
i1 a r3 x3 m1 m1 r4 x4 a1 r2 x6 a2 i2 c r5 x5 i1 d r1 x1 i2 b r2 x2
S2
+
r5 x7
S3
a1
+
r5 x8 z o1
S0
S0 S1 S2 S3
a1 a2 m1 src1 src2 src1 src2 src1 src2 r1 r2 r2 r5 r3 r1 r2 a2 r4 r5
229
Optimize Datapath Control Table

a1 a2 m1 src1 src2 src1 src2 src1 src2 r1 r2 r2 r5 r1 r3 r2 a2 r4 r5
S0 S1 S2 S3
230
2.8.11.3
Control for State
We need to control the transition from one state to the next. For this example, the transition is very simple, each state transitions to its successor: S0 S1 S2 S3 S0....
231
2.8.11.4
S0 S1 S2 S3
Complete State Machine Table
r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state 1 1 i2 S1 0 0 1 i2 r2 S2 1 m1 0 a1 r5 r3 S3 a1 a2 S0
Question:
What values should we use for dont cares?
232
Dont Cares Instantiations

S0 S1 S2 S3 r1 ce r2 ce r2 sel r4 ce r5 sel a1 src2 sel m1 src2 sel state 1 1 i2 0 a1 a2 r3 S1 0 0 m1 1 i2 a2 r2 S2 1 1 m1 0 a1 r5 r3 S3 1 1 m1 0 a1 a2 r3 S0
2.8.12 VHDL Code with Explicit State Machine
233
2.8.12 chine
VHDL Code with Explicit State Ma-
We chose a one-hot encoding of the state, which usually results in small and fast hardware for state machines with sixteen or fewer states.
architecture explicit_v1 of vanier is signal r_1, r_2, r_3, r_4, r_5 : std_logic_vector(15 downto 0) type state_ty is std_logic_vector(3 downto 0); constant s0 : state_ty := "0001"; constant s1 : state_ty := "0010"; constant s2 : state_ty := "0100"; constant s3 : state_ty := "1000"; signal state : state_ty;
234
begin ----------------------- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process; ----------------------- r_2 process (clk) begin if rising_edge(clk) then if state != S1 then if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process;

----------------------- r_3 process (clk) begin if rising_edge(clk) then r_3 <= i_1; end if; end process; ----------------------- r_4 process (clk) begin if rising_edge(clk) then if state = S1 then r_4 <= m_1; end if; end if; end process;
2.8.12 VHDL Code with Explicit State Machine

----------------------- r_5 process (clk) begin if rising_edge(clk) then if state = S1 then r_5 <= i_2; else r_5 <= a_1; end if; end if; end process; ----------------------- combinational datapath with state select a1_src2 <= r_5 when S2, a_2 when others; with state select m1_src2 <= r_2 when S1 r_3 when others; a_1 <= a_2 + a1_src2; a_2 <= r_4 + r_5; m_1 <= r_1 * m1_src2; o_1 <= r_5; ----------------------- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; ---------------------end explicit_v1;
235
S1; S2; S3; S0;
236
Hardware Block Diagram

i1 i2
S0 S1
i1 a r3 x3 m1
i1 d r1 x1 m1 r4 x4
i2 b r2 x2 i2 c r5 x5 a1
S2
+
r5 x7
r1
r2
r3
r5
r2 x6 a2
S3
a1
+
r5 x8 z m1
o1
S0
r4 a2
+ +
a1
2.8.13 Peephole Optimizations
237
2.8.13
Peephole Optimizations
-- r_1 (optimized) process (clk) begin if rising_edge(clk) then if then r_1 <= i_1; end if; end if; end process;
-- r_1 process (clk) begin if rising_edge(clk) then if state != S1 then r_1 <= i_1; end if; end if; end process;
238
-- r_2 process (clk) begin if rising_edge(clk) then if state != S1 if state = S0 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process; -- r_2 (optimized) process (clk) begin if rising_edge(clk) then if state(1) = 0 then if state(0) = 1 then r_2 <= i_2; else r_2 <= m_1; end if; end if; end if; end process;
2.8.13 Peephole Optimizations
239
-- state machine process (clk) begin if rising_edge(clk) then if reset = 1 then state <= S0; else case state is when S0 => state <= when S1 => state <= when S2 => state <= when S3 => state <= end case; end if; end if; end process; -- state machine (optimized) -- NOTE: "st" = "state" process (clk) begin if rising_edge(clk) then if reset = 1 then st <= S0; else for i in 0 to 3 loop st( (i+1) mod 4 ) <= st( i ); end loop; end if; end if; end process;
S1; S2; S3; S0;
240
2.8.14
Notes and Observations
Our functional requirements were written as: output = (a d) + (d b) + b + c Alternatively, we could have achieved exactly the same functionality with the functional requirements written as (the two statements are mathematically equivalent): output = (a d) + b + (d b) + c
2.8.14 Notes and Observations
241
Data Dependency Graphs: Clean vs Ugly

The naive data dependency graph for the alternative formulation is much messier than the data dependency graph for the original formulation: Original (a d) + (d b) + b + c
a d b c a d
Alternative (a d) + c + (d b) + b
b c
+ + +
z
+ +
z
242
2.9
Pipelining
Pipelining is optimization that increases performance by overlapping the execution of multiple parcels (instructions). The cost is an increase in area, because we cannot reuse datapath components, registers, inputs, or outputs.
2.9.1
Introduction to Pipelining
2.9.1 Introduction to Pipelining
243
Review of unpipelined dataow diagram

a r1
add1
b r2
0
c r2
+
r1
add1
1
clk d r2
0 1 2 3 4 5 6 7 8 9 10 11 12 13
a r1
+
r1
add1
2
e r2
+
r1
add1
3
f r2
+
r1
add1
4 5
+
z
Question: How soon can we start to execute ?
244
Pipelined dataow diagram

Each stage is treated as separate dataow diagram. Double line denotes boundary between stages.
a stage 5 stage 4 stage 3 stage 2 stage 1 r1
add1
b r2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 0
c r4 clk a (stage1) r1 d r5
+
r3
add2
1 2
e r8
+
r5
add3
(stage2) r3 (stage3) r5
+
r7
add4
3
f r10
+
r9
add5
4 5
+
z
Question: How soon can we start to execute ?
245
Sequential (Unpipelined) Hardware

reset
State(0) State(1) State(2) State(3) State(4) i1 i2
r1
add1
r2
+
o1
246
Pipelined Hardware
i1 i2 r1 stage 1
add1
r2 i3
+
r3
r4 i4
stage 2
add2
+
r5
r6 i5
stage 3
add3
+
r7
r8 i6
stage 4
add4
+
r9
r10
stage 5
add5
+
o1
247
Pipelined VHDL Code

-- stage 1 process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; end process; -- stage 2 process begin wait until rising_edge(clk); r3 <= r1 + r2; r4 <= i3; end process; -- stage 3 process begin wait until rising_edge(clk); r5 <= r3 + r4; r6 <= i4; end process; -- stage 4 process begin wait until rising_edge(clk); r7 <= r5 + r6; r8 <= i5; end process; -- stage 5 process begin wait until rising_edge(clk); r9 <= r7 + r8; r10 <= i6; end process; -- output o1 <= r9 + r10;
248
2.9.2
Partially Pipelined
Fully pipelined: throughput is one parcel per clock cycle Partially pipelined: throughput is less than one parcel per clock cycle. Superscalar: throughput is more than one parcel per clock cycle.
a r1 stage 1
add1
b r2
0
c r2
0 1 2 3 4 5 6 7 8 9 10 11 12 13
clk a
+
r1
add1
1
d r4
+
r3
add2
2
e r4
stage 2
+
r3
add2
3
f r6
(stage3) r5 z
+
r5
add3
4 5
+
z
Question: How do we execute followed by ?
stage 3
2.9.2 Partially Pipelined
249
Hardware for Partially Pipelined

i1 i2
reset
State(0) State(1)
stage 1
r1
add1
r2
+
i2
stage 2
r3
add2
r4
+
i2 stage 3 r5
add3
r6
+
o1
250
2.9.3
Terminology
Denition Depth: The depth of a pipeline is the number of stages on the longest path through the pipeline.
Denition Latency: The latency of a pipeline is measured the same as for an unpipelined circuit: the number of clock cycles from inputs to outputs.
Denition Throughput: The number of parcels consumed or produced per clock cycle.
Denition Upstream/downstream: Because parcels ow through the pipeline analogously to water in a stream, the terms upstream and downstream are used respectively to refer to earlier and later stages in the pipeline. For example, stage1 is upstream from stage2.
2.9.3 Terminology Denition Bubble: When a pipe stage is empty (contains invalid data), it is said to contain a bubble.
251
Question: How do we know whether the output of the pipeline is a bubble or is valid data?
252
2.10 Design Example: Pipelined Massey Requirements

Functional requirements:
Compute the sum of output = a + b + c + d + e + f Registered inputs, combinational outputs

Performance requirements:
six
8-bit
numbers:
Maximum clock period: unlimited Maximum latency: four

Cost requirements:
Maximum of ve adders Small miscellaneous hardware (e.g. muxes) is unlimited Maximum of six inputs and one output Design effort is unlimited
2.10. DESIGN EXAMPLE: PIPELINED MASSEY
253
Initial Dataow Diagrams

Original dataow
a b c d
Final unpipelined dataow

a b c f
+ +
+ + + + +
z f d e
+ +
z
254
Dataow Diagram Exploration

Variation on original dataow
a b c d e f
Pipelined dataow diagram

a b c d i_valid
+ +
+ +
+ +
z o_valid
+
z
2.10. DESIGN EXAMPLE: PIPELINED MASSEY
255
VHDL Code
-- stage 1 process begin wait until rising_edge(clk); r1 <= i1; r2 <= i2; r3 <= i3; end process; a1 <= r1 + r2; a2 <= r3 + r4; -- stage 2 process begin wait until rising_edge(clk); r5 <= a1; r6 <= a2; r7 <= i5; end process; a3 <= r5 + r6; a4 <= r7 + r8; -- stage 3 process begin wait until rising_edge(clk); r9 <= a3; r10 <= a4; end process; a5 <= r9 + r10; -- outputs z <= a5; o_valid <= v3;
r4 <= i4;
v1 <= i_valid;
r8 <= i6;
v2 <= v1;
v3 <= v2;
256
2.11 Memory Arrays and RTL Design 2.11.1 Memory Operations Read of Memory with Registered Inputs
Hardware
we a clk
WE A DO
M
DI
do
Behaviour
clk we a M(a) do a d -
2.11.1 Memory Operations
257
Write to Memory with Registered Inputs

Hardware
we a di clk
WE A DO
M
DI
do
Behaviour
clk we a di M(a) do a d -
258
Dual-Port Memory with Registered Inputs

clk we a0 we a0 di0 a1 clk
WE A0 DO0
a d a -
M
DI0 A1 DO1
do0 do1
di0 a1 M(a) M(a) do0 do1
2.11.1 Memory Operations
259
Sequence of Memory Operations

clk we a0 di0 we a0 di0 a1 clk
WE A0 DO0
a d a a d2 a -
a1 M do0 do1 M(a) M(a) M(a) M(a) do0 do1
DI0 A1 DO1
d d1 d
260
2.11.2
Memory Arrays in VHDL

2.11.3
Data Dependencies
Denition of Three Types of Dependencies

M[i] := := M[i] := := := M[i] :=
:= M[i]
M[i]
:=
M[i]
:=
Read after Write Write after Write Write after Read (True dependency) (Load dependency) (Anti dependency) Instructions in a program can be reordered, so long as the data dependencies are preserved.
2.11.3 Data Dependencies
261
Purpose of Dependencies
W0 WAW ordering prevents W0 from happening after W1 R3 := ...... W1 R3 := ...... producer
RAW ordering prevents R1 from happening before W1 WAR ordering prevents W2 from happening before R1 R1 ... := ... R3 ... consumer
W2
R3 := ......
Each of the three types of memory dependencies (RAW, WAW, and WAR) serves a specic purpose in ensuring that producer-consumer relationships are preserved.
262
Ordering of Memory Operations Data Dependencies

M[3] M[2] M[1] M[0] 30 20 10 0 M[2] := 21 M[3] := 31 A B := M[2] := M[0] 21
M[3] := 32 M[0] := 01 C := M[3]
Initial Program
2.11.3 Data Dependencies
263
Data Dependencies (Contd)

M[2] := 21 M[3] := 31 A B := M[2] := M[0] M[2] := 21 B A := M[0] := M[2]
M[3] := 31 M[3] := 32 M[0] := 01 C := M[3]
M[3] := 32 M[0] := 01 C := M[3]
Initial Program
Valid Modication
264
Data Dependencies (Contd)

M[2] := 21 M[3] := 31 A B := M[2] := M[0] M[2] := 21 B A := M[0] := M[2]
M[3] := 31 C := M[3]
M[3] := 32 M[0] := 01 C := M[3]
M[3] := 32 M[0] := 01
Initial Program
Valid (or Bad?) Modication
2.11.4 Memory and Dataow Diagrams
265
2.11.4
Memory and Dataow Diagrams Legend for Dataow Diagrams
name name name name (rd) name(wr)
Input port Output port State signal Array read Array write
Basic Memory Operations

mem mem addr mem(rd) data mem (anti-dependency) mem(wr) data addr
mem
data := mem[addr]; mem[addr] := data; Memory Read Memory Write
266
Dataow Diagrams and Data Dependencies
Read after Write Dependencies

Algo: mem[wr addr] := data in; data out := mem[rd addr];
mem data_in wr_addr
mem(wr)
rd_addr
mem(rd)
mem
data_out
Read after Write
267
Read after Write Optimization

Algo: mem[wr addr] := data in; := mem[rd addr]; data out
mem data_in wr_addr rd_addr
mem(wr)
mem(rd)
mem
data_out
Optimization when rd addr = wr addr
268
Write after Write Dependencies

Algo: mem[wr1 addr] := data1; mem[wr2 addr] := data2;
mem data1 wr1_addr
mem(wr)
data2
wr2_addr
mem(wr)
mem
Write after Write
269
Write after Write Scheduling Option

mem data1 wr1_addr

mem data2 wr2_addr
mem(wr)
data2
wr2_addr
data1
mem(wr) wr1_addr
mem(wr)
mem
mem(wr)
Write after Write

mem
Scheduling option when wr1 addr = wr2 addr
270
Write after Read Dependencies

Algo: rd data := mem[rd addr]; mem[wr addr] := wr data;
mem rd_addr
mem(rd)
wr_data wr_addr
mem(wr)
rd_data
mem
Write after Read
271
Write after Read Optimization

Algo: rd data := mem[rd addr]; mem[wr addr] := wr data;
mem rd_addr wr_data wr_addr
mem(rd)
mem(wr)
rd_data
mem
Optimization when rd addr = wr addr
272
2.11.5 gram
mem M 21 2
Ex: Mem Array and Dataow Dia-
data_in wr_addr
M(wr)
31
M(wr)
M(rd)
M(rd)
32
1 2 3 4 5 6 7
M[2] := 21 M[3] := 31 A B := M[2] := M[0]
M(wr)
01
M(wr)
M[3] := 32 M[0] := 01 C := M[3] M C 7 M(rd)
2.11.5 Ex: Mem Array and Dataow Diagram
273
Dependencies for Known Addresses

mem M data_in wr_addr 21 2
M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
274
Anti-Dependencies for Known Addresses

mem M data_in wr_addr 21 2
M(wr)
31
M(wr)
M(rd)
M(rd)
32
M(wr)
01
M(wr)
M(rd)
275
Minimal Dependencies
M 0 21 2 31 3
M(rd) B 01 0 M(wr)
M(wr)
M(wr)
2 M(rd)
32 3 M(wr) 3 M(rd)
Memory array with minimal dependencies
276
Memory Array with Orderings

M 0 21 2 31 3
M(rd) B 01 0
M(wr)
M(wr)
2 2 M(rd) 3
32 3 M(wr) 3 3 M(rd)
M(wr)
Memory array with orderings
277
Place Operations in Clock Cycles

M 0 21 2
M(rd) B
M(wr)
2 2 M(rd) A 2
31 3 M(wr)
32 3 3 M(wr)
01 0 4 M(wr) 3
3 M(rd)
278
Final Dataow Diagram

M 0 1 M(rd) B 2 2 M(rd) A 32 3 3 M(wr) 2 31 3 M(wr) 1 21 2 M(wr)
3 3 M(rd) C 4
01 0 M(wr) M
Final version of DFD
2.12. INPUT / OUTPUT PROTOCOLS
279
2.12 Input / Output Protocols

280
2.13 Example: Moving Average

In this section we will design a circuit that performs a moving average as it receives a stream of data. When each new data item is received, the output is the average of the four most recently received data.
Time 0 1 2 3 4 5 6 7 8 9 10 i_data 2 3 5 6 6 0 2 2 5 3 1
o_avg
4 5 4 3
2.13.1 Requirements and Environmental Assumptions
281
2.13.1 Requirements and Environmental Assumptions

1. Input data is sent sporadically, with at least 2 clock cycles of bubbles (invalid data) between valid data. 2. When the input data is valid, the signal i valid is asserted for exactly one clock cycle. 3. Input data will be 8-bit signed numbers. 4. When output data is ready, o valid shall be asserted. 5. The output data (o avg) shall be the average of the four most recently received input data. Output numbers shall be truncated to integer values.
282
2.13.2
Algorithm
avg i = (xi3 + xi2 + xi1 + xi)/4
Generic equation with input data xi:
Decompose into sum and avg: sumi = xi3 + xi2 + xi1 + xi avg i = sumi/4 Look for patterns and potential optimizations: sum5 = x2 + (x3 + x4 + x5) sum6 = (x3 + x4 + x5) + x6 = sum5 x2 + x6 Generalized recurrence equation: sumi = sumi1 xi4 + xi avg i = sumi/4
2.13.2 Algorithm
283
Summary of Behaviour
1. Dene a signal new for the value of i data each time that i valid is 1. 2. Dene a memory array M to store a sliding window of the four most recent values of i data. 3. Dene a signal old for the oldest data value from the sliding window. 4. Update sumi with sumi1 oldi + newi
284
Sliding Window
Two design patterns to choose from: shift register vs circular buffer
old old
M[3] M[2] M[1] M[0]
M[0..3]
new
new
Shift register
Circular Buffer For FIFO behaviour, circular buffer is usually prefered: smaller and lower power.
2.13.2 Algorithm
285
Sliding Window with Registers

8 d we addr idx[0] ce[0]
D CE Q
M[0] 8
idx[1]
ce[1]
D CE
M[1] 8 8 q M[2] 8
idx[2]
ce[2]
D CE
idx[3]
ce[3]
D CE
M[3] 8
Register array with chip-enables and decoded multiplexer
286
2.13.3
Pseudocode and Dataow Diagrams
First Pseudocode
Real 3-address pseudocode new old tmp sum M[idx] idx o_avg = = = = = = = i_data M[idx] sum - old tmp + new new idx rol 1 sum/4
sum M idx i_data new
Rd
old
Wr
tmp
(wired shift)
sum
o_avg
idx
2.13.3 Pseudocode and Dataow Diagrams Remove intermediate signal old new = i_data tmp = sum - M[idx] sum = tmp + new M[idx] = new idx = idx rol 1 o_avg = sum/4 reading new from memory tmp = sum - M[idx] M[idx] = i_data new = M[idx] sum = tmp + new idx = idx rol 1 o_avg = sum/4 Remove intermediate signal new tmp = sum - M[idx] M[idx] = i_data sum = tmp + M[idx] idx = idx rol 1 o_avg = sum/4
287
Data-dependency graph after removing new

sum M idx i_data
Rd
old
Wr
Rd
tmp
new
(wired shift)
sum
o_avg
idx
288
Dataow Diagram
Latency of three clock cycles
M S0
Wr Rd
Latency of two clock cycles

M S0
Wr Rd
i_data
idx
sum
i_data
idx
sum
S1
Rd 1
S1
Rd 1
S2 S0 M sum
(wired shift)
S0
(wired shift)
o_avg
idx
sum
o_avg
idx
Two clock cycles potentially preferable for performance, but requires an additional multiplexer.
2.13.3 Pseudocode and Dataow Diagrams Latency of two clock cycles with registered address M i_data idx sum
S0
Wr Rd 1
289
S1
Rd
S0
(wired shift)
sum
o_avg
idx
Removes need for multiplexer on address input to circular buffer
290
Register and Datapath Allocation

M S0
Wr idx sum Rd 1 rol
i_data
idx
sum
S1
Rd
as1
sum
idx
S0
as1
(wired shift)
sum
o_avg
idx
2.13.4 Control Tables and State Machine
291
2.13.4
M S0
Wr
Control Tables and State Machine

idx sum
i_data
idx
sum
Rd 1 rol
S1
Rd
as1
Register control table M idx sum we addr d ce d ce d S0 1 idx x 0 1 as1 S1 0 idx 1 rol 1 as1 Datapath control table as1 rol sub src1 src2 src1 src2 S0 0 M sum S1 1 sum M idx 1
sum
idx
S0
as1
(wired shift)
sum
o_avg
idx
292
CHAPTER 2. RTL DESIGN WITH VHDL Optimized control table Static assignments in control table M.addr = idx M.d = x idx.d = rol sum.d = as1 as1.src1 = sum as1.src2 = M
M idx as1 we ce sub S0 1 1 0 S1 0 0 1
2.13.4 Control Tables and State Machine
293
Control Table and Bubbles

Almost nal control table M idx sum as1 we ce ce sub S0 1 0 1 0 S1 0 1 1 1 idle 0 0 0 Final control table M idx sum as1 we ce ce sub S0 1 0 1 0 S1 0 1 1 1 idle 0 0 0 0 Static assignments M.addr = idx M.d = x idx.d = rol sum.d = as1 as1.src1 = sum as1.src2 = M
294
State Machine
i valid valid1 S0 1 0 S1 0 1 idle 0 0 Final control table with state encoding
state M idx sum as1 i valid valid1 we ce ce sub S0 1 0 1 0 1 0 S1 0 1 0 1 1 1 idle 0 0 0 0 0 0 M.we idx.ce sum.ce as1.sub = = = = i_valid valid1 i_valid OR valid1 valid1
2.13.5 VHDL Code
295
2.13.5
VHDL Code
-- valid bits process begin wait until rising_edge(clk); valid1 <= i_valid; o_valid <= valid1; end process; -- idx process begin wait until rising_edge(clk); if reset = 1 then idx <= "0001"; else if valid1 = 1 then idx <= idx rol 1; end if; end if; end process;
-- sliding window process begin wait until rising_edge(clk); for i in 3 downto 0 loop if (i_valid = 1) and (idx(i) = 1) th M(i) <= i_data; end if; end loop; end process; mem_out <= M(0) when idx(0) = 1 else M(1) when idx(1) = 1 else M(2) when idx(2) = 1 else M(3); -- add sub add_sub <= sum - mem_out when valid1 = 1 else sum + mem_out; -- sum process begin wait until rising_edge(clk); if i_valid = 1 or valid1 = 1 then sum <= add_sub; end if; end process;
296
Hardware
i_valid i_data
A CE
valid1
CE
M
(wired shift)
idx
add/sub
CE
sum
(wired shift)
o_valid
o_avg
Chapter 3 Performance Analysis and Optimization
298
CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION
3.1
Introduction
Hennessey and Pattersons Quantitative Computer Achitecture (textbook for E&CE 429) has good information on performance. We will use some of the same definitions and formulas as Hennessey and Patterson, but we will move away from generic denitions of performance for computer systems and focus on performance for digital circuits.
3.2. DEFINING PERFORMANCE
299
3.2
Dening Performance
Performance = Work Time
You can double your performance by: doing twice the work in the same amount of time OR doing the same amount of work in half the time
300
Benchmarking
Work Performance = Time Measuring time is easy, but how do we accurately measure work? The game of benchmarketing is nding a denition of work that makes your system appear to get the most work done in the least amount of time. Measure of Work clock cycle instruction synthetic program real program travel 1/4 mile Measure of Performance MHz MIPs Whetstone, Dhrystone, D-MIPs (Dhrystone MIPs) SPEC drag race
3.2. DEFINING PERFORMANCE
301
SPEC Benchmarks
The Spec Benchmarks are among the most respected and accurate predictions of real-world performance.
Denition SPEC: Standard Performance Evaluation Corporation MISSION: To establish, maintain, and endorse a standardized set of relevant benchmarks and metrics for performance evaluation of modern computer systems http://www.spec.org.
The Spec organization has different benchmarks for integer software, oating-point software, web-serving software, etc.
302
3.3 3.3.1
Comparing Performance General Equations

Big Small n% = Small
Equation for Big is n% greater than Small:
Using n% greater formula, the phrase The performance of A is n% greater than the performance of B is: PerformanceA PerformanceB PerformanceB
n% =
Performance is inversely proportional to time: 1 Performance = Time
3.3.1 General Equations
303
Substituting the above equation into the equation for the performance of A is n% greater than the performance of B gives: n% = TimeB TimeA TimeA
In general, the equation for a fast system to be n% faster than a slow system is: TSlow TFast TFast
n% =
Another useful formula is the average time to do one of k different tasks, each of which happens %i of the time and takes an amount of time Ti to do each time it is done .
TAvg =
i=1
(%i)(Ti)
We can measure the performance of practically anything (cars, computers, vacuum cleaners, printers....)
304
3.3.2
Example: Performance of Printers

3.4. CLOCK SPEED, CPI, PROGRAM LENGTH, AND PERFORMANCE
305
3.4 Clock Speed, CPI, Program Length, and Performance 3.4.1 Mathematics
CPI NumInsts ClockSpeed ClockPeriod Cycles per instruction Number of instructions Clock speed Clock period
Time = NumInsts CPI ClockPeriod Time = NumInstsCPI ClockSpeed
306
3.4.2
Example: CISC vs RISC and CPI

Clock Speed SPECint AMD Athlon 1.1GHz 409 Fujitsu SPARC64 675MHz 443
The AMD Athlon is a CISC microprocessor (it uses the IA-32 instruction set). The Fujitsu SPARC64 is a RISC microprocessor (it uses Suns Sparc instruction set). Assume that it requires 20% more instructions to write a program in the Sparc instruction set than the same program requires in IA-32.
3.4.2 Example: CISC vs RISC and CPI
307
SPECint and Performance

Clock Speed SPECint AMD Athlon 1.1GHz 409 Fujitsu SPARC64 675MHz 443
Question:
Which of the two processors has higher performance?
308
Relative CPI
Question: What is the ratio between the CPIs of the two microprocessors?
3.4.2 Example: CISC vs RISC and CPI
309
Absolute CPI
Question: Can you determine the absolute (actual) CPI of either microprocessor?
310
3.4.3 Effect of Instruction Set on Performance

Your group designs a microprocessor and you are considering adding a fused multiply-accumulate to the instruction set. (A fused multiply accumulate is a single instruction that does both a multiply and an addition. It is often used in digital signal processing.) Your studies have shown that, on average, half of the multiply operations are followed by an add instruction that could be done with a fused multiply-add. Additionally, you know: cpi % ADD 0.8 CPIavg 15% MUL 1.2 CPIavg 5% Other 1.0 CPIavg 80%
3.4.3 Effect of Instruction Set on Performance
311
Options
You have three options:
option 1 : no change option 2 : add the MAC instruction, increase the clock period by 20%, and MAC has the same CPI as MUL. option 3 : add the MAC instruction, keep the clock period the same, and the CPI of a MAC is 50% greater than that of a multiply.
Question:
Which option will result in the highest overall performance?
312
3.4.4 Effect of Time to Market on Relative Performance

Assume that performance of the average product in your market segment doubles every 18 months. You are considering an optimization that will improve the performance of your product by 7%.
Question: If you add the optimization, how much can you allow your schedule to slip before the delay hurts your relative performance compared to not doing the optimization and launching the product according to your current schedule?
3.4.5
Summary of Equations
3.5. PERFORMANCE ANALYSIS AND DATAFLOW DIAGRAMS
313
3.5 Performance Analysis and Dataow Diagrams 3.5.1 Dataow Diagrams, CPI, and Clock Speed
One of the challenges in designing a circuit is to choose the clock speed. Choosing a clock period affects many aspects of the design, not just the overall performance. Some goals will push you toward a short clock period Some goals will push you toward a long clock period
314 Goal
CHAPTER 3. PERFORMANCE ANALYSIS AND OPTIMIZATION Action Affect
Minimize area
Increase exibility
scheduling
Decrease percentage of clock cycle spent in ops (overhead time in ops is not doing useful work) Decrease time to execute an instruction
3.5.1 Dataow Diagrams, CPI, and Clock Speed
315
Outline to Choose Clock Period

Outline of plan to nd optimal latency and clock period for maximum performance:
1. Start with smallest possible clock period. 2. Allocate operations to clock cycles 3. Calculate average time to execute an instruction. 4. If latency > 1, then: increase clock period until reduce latency; return to Step 2. Else (latency = 1): choose clock period and dataow diagram that resulted in highest performance. 5. Optimize dataow diagram to reduce area.
316
3.5.2 Examples of Dataow Diagrams for Two Instructions

Circuit supports two instructions, A and B Each operation occurs 50% of the time. The delay through a register is 5ns. Find clock period and dataow diagram to maximize overall performance.
h (20 ns)
Instruction A
f (30ns)
Instruction B
i (40ns)
g (50 ns)
g (50 ns)
g (50 ns)
317
3.5.2.1 Scheduling of Operations for Different Clock Periods Scheduling (1)

55ns Clock Period
Instr A 55ns 55ns f (30ns) Instr B i (40ns)
15 ns 25 ns
g (50 ns) h (20 ns)
g (50 ns)
55ns
55ns
g (50 ns)
318
Scheduling (2)
15 ns 25 ns 15 ns 25 ns
319
Scheduling (3)
15 ns 25 ns
320
3.5.2.2 Performance Computation for Different Clock Periods

Question: Which clock speed will result in the highest overall performance? Tavg
Clock Period CPIA CPIB 55ns 75ns 85ns 95ns 155ns
321
3.5.2.3 Example: Two Instructions Taking Similar Time

Question: For the ow below, which clock speed will result in the highest overall performance?
A B 30ns 40ns 50ns 50ns 20ns 40ns 50ns
Clock Period CPIA CPIB ns ns ns ns ns ns
Tavg
322
3.5.2.4 Example: Same Total Time, Different Order for A

Question: For the ow below, which clock speed will result in the highest overall performance?
A B 30ns 40ns 20ns 50ns 50ns 40ns 50ns
Clock Period CPIA CPIB ns ns ns ns
Tavg
3.5.3 Example: From Algorithm to Optimized Dataow
323
3.5.3 Example: mized Dataow
From Algorithm to Opti-
This question involves doing some of the design work for a circuit that implements InstP and InstQ using the components described below. Instruction Algorithm Frequence of Occurrence InstP a b ((a b) + (b d) + e) 75% InstQ (i + j + k + l) m 25%
Component Delays 2-input Mult 40ns 2-input Add 25ns Register 5ns
324
NOTES
There is a resource limitation of a maximum of 3 input ports. (There are no other resource limitations.) You must put registers on your inputs, you do not need to register your outputs. The environment will directly connect your outputs (its inputs) to registers. Each input value (a, b, c, d, e, i, j, k, l, m) can be input only once if you need to use a value in multiple clock cycles, you must store it in a register.
3.5.3 Example: From Algorithm to Optimized Dataow
325
Questions
Question: What clock period will result in the best overall performance?
Question: Find a minimal set of resources that will achieve the performance you calculated.
326
3.6 3.6.1
General Optimizations Strength Reduction
Strength reduction replaces one operation with another that is simpler.
3.6.1.1
Arithmetic Strength Reduction

wired shift logical left shift logical left wired shift logical right shift logical right wired shift and addition
Multiply by a constant power of two Multiply by a power of two Divide by a constant power of two Divide by a power of two Multiply by 3
3.6.1 Strength Reduction
327
3.6.1.2
is neg, is pos
Boolean Strength Reduction
Boolean tests that can be implemented as wires is odd, is even By choosing your encodings carefully, you can sometimes reduce a vector comparison to a wire. For example if your state uses a one-hot encoding, then the comparison state = S3 reduces to state(3) = 1. You might expect a reasonable logic-synthesis tool to do this reduction automatically, but most tools do not do this reduction. When using encodings other than one-hot, Karnaugh maps can be useful tools for optimizing vector comparisons. By carefully choosing our state assignments, when we use a full binary encoding for 8 states, the comparison: (state = S0 or state = S3 or state = S4) = 1 can be reduced from looking at 3 bits, to looking at just 2 bits. If we have a condition that is true for four states, then we can nd an encoding that looks at just 1 bit.
328
3.6.2 3.6.2.1
Replication and Sharing Mux-Pushing
Pushing multiplexors into the fanin of a signal can reduce area. Before z <= a + b when (w = 1) else a + c; After tmp <= b when (w = 1) else c; z <= a + tmp; The rst circuit will have two adders, while the second will have one adder. Some synthesis tools will perform this optimization automatically, particularly if all of the signals are combinational.
3.6.2 Replication and Sharing
329
3.6.2.2 tion
Common Subexpression Elimina-
Introduce new signals to capture subexpressions that occur multiple places in the code. Before y <= else z <= else
a + b + c when (w = 1) d; a + c + d when (w = 1) e;
After tmp <= y <= else z <= else
a + c; b + tmp when (w = 1) d; d + tmp when (w = 1) e;
330
Subexpression Elimination
Note: Clocked subexpressions Care must be taken when doing common subexpression elimination in a clocked process. Putting the temporary signal in the clocked process will add a clock cycle to the latency of the computation, because the tmp signal will be ip-op. The tmp signal must be combinational to preserve the behaviour of the circuit.
3.6.2 Replication and Sharing
331
3.6.2.3
Computation Replication
To improve performance
If same result is needed at two very distant locations and wire delays are signicant, it might improve performance (increase clock speed) to replicate the hardware
To reduce area
If same result is needed at two different times that are widely separated, it might be cheaper to reuse the hardware component to repeat the computation than to store the result in a register Note: Muxes are not free Each time a component is reused, multiplexors are added to inputs and/or outputs. Too much sharing of a component can cost more area in additional multiplexors than would be spent in replicating the component
332
3.6.3
Arithmetic
VHDL is left-associative. The expression a + b + c + d is interpreted as (((a + b) + c) + d). You can use parentheses to suggest parallelism. Perform arithmetic on the minimum number of bits needed. If you only need the lower 12 bits of a result, but your input signals are 16 bits wide, trim your inputs to 12 bits. This results in a smaller and faster design than computing all 16 bits of the result and trimming the result to 12 bits.
3.7. RETIMING
333
3.7
state a b c
Retiming
state S0 S1 S2 S3 S0 S1 S2 S3 a critical path b c sel 1 y z x y + z +
sel x
process begin wait until rising_edge(clk); if state = S1 then z <= a + c; else z <= b + c; end if; end process;
334
Retimed Circuit and Waveform

state a b c sel x y z
state S0 S1 S2 S3 S0 S1 S2 S3 a b c sel x y z
process (state) begin if state = S1 then sel = 1 else sel = 1 end if; end process; process begin wait until rising_edge(clk); if sel = 1 then ... -- code for z end if; end process;
process begin wait until rising_edge(clk); if state = then sel = 1 else sel = 1 end if; end process; process begin wait until rising_edge(clk); if sel = 1 then ... -- code for z end if; end process;
Chapter 4 Functional Verication
336
CHAPTER 4. FUNCTIONAL VERIFICATION
4.1
Overview
4.1.1 Terminology: Validation / Verication / Testing 4.1.2 The Difculty of Designing Correct Chips
4.1.2 The Difculty of Designing Correct Chips
337
4.1.2.1 Notes from Kenn Heinrich (UW E&CE grad)

Everyone should get a lecture on why their rst industrial design wont work in the eld. Note: There are six reasons in your notes.
4.1.2.2 Notes from Aart de Geus (Chairman and CEO of Synopsys)

More than 60% of the ASIC designs that are fabricated have at least one error, issue, or a problem that whose severity forced the design to be reworked. Note: There is a pretty picture in your notes.
338
4.2 4.2.1
Test Cases and Coverage Coverage
To be absolutely certain that an implementation is correct, we must check every combination of values. This includes both input values and internal state (ip ops). If we have ni bits of inputs and ns bits in ip-ops, we have to test 2ni +ns different cases when doing functional verication.
Question: If we have nc combinational signals, why dont we have to test 2ni+ns+nc different cases?
4.2.2 Floating Point Divider Example
339
4.2.2
Floating Point Divider Example
This example illustrates the difculty of achieving signicant coverage on realistic circuits. Consider doing the functional simulation for a double precision (64-bit) oating-point divider. Given Information Data width 64 bits Number of gates in circuit 10 000 Number of assembly-language instructions to 100 simulate one gate for one test case Number of clock cycles required to execute one 0.5 assembly language instruction on the computer that is running the simulation Clock speed of computer that is running the sim- 1 Gigahertz ulation
340
Number of Cases
Question: How many cases must be considered?
width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
341
Simulation Run Time

Question: How long will it take to simulate all of the different possible cases using a single computer? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
342
Coverage
Question: If you can run simulations non-stop for one year on ten computers, what coverage will you achieve? width=64b, gates=10 000, instrs/gate=100, cycles/instr=0.5, cycles/sec=109
343
Simulation vs the Real World

From Validating the Intel(R) Pentium(R) Microprocessor by Bob Bentley, Design Automation Conference 2001. (Link on E&CE 327 web page.) Simulating the Pentium 4 Processor on a Pentium 3 Processor ran at about 15 MHz.
By tapeout, over 200 billion simulation cycles had been run on a network of computers. All of these simulations represent less than two minutes of running a real processor.
344
4.3 4.3.1
Testbenches Overview of Test Benches

testbench specification stimulus check
implementation
Implementation Circuit that youre checking for bugs also known as: design under test or unit under test Stimulus Generates test vectors Specication Describes desired behaviour of implementation Check Checks whether implementation obeys specication
4.3.2 Reference Model Style Testbench
345
4.3.2
Reference Model Style Testbench

specification
reference model testbench
stimulus
implementation
4.3.3
Relational Style Testbench
relational testbench
stimulus
check
implementation
346
4.3.4
testbench stimulus
Coding Structure of a Testbench

specification check
implementation
architecture main of athabasca_tb is component declaration for implementation; other declarations begin implementation instantiation; stimulus process; specification process (or component instantiation); check process; end main;
4.3.5 Datapath vs Control
347
4.3.5
Datapath vs Control
Datapath and control circuits tend to use different styles of testbenches.

reference model testbench specification stimulus
implementation
relational testbench
stimulus
check
implementation
348
4.3.6
Verication Tips
Suggested order of simulation for functional verication. 1. Write high-level model. 2. Simulate high-level model until have correct functionality and latency. 3. Write synthesizable model. 4. Use zero-delay simulation (uw-sim) to check behaviour of synthesizable model against high-level model. 5. Optimize the synthesizable model. 6. Use zero-delay simulation (uw-sim) to check behaviour of optimized model against high-level model. 7. Use timing-simulation (uw-timsim) to check behaviour of optimized model against high-level model. section 4.4 describes a series of testbenches that are particularly useful for debugging datapath circuits in the early phases of the design cycle.
4.4. FUNCTIONAL VERIFICATION FOR DATAPATH CIRCUITS
349
4.4 Functional Verication for Datapath Circuits

In this section we will incrementally develop a testbench for a very simple circuit: an AND gate.
350
Implementation
entity and2 is port ( a, b : in std_logic; c : out std_logic ); end and2; architecture main of and2 is begin c <= 1 when (a = 1 AND b = 1) else 0; end and2;
4.4.1 A Spec-Less Testbench
351
4.4.1
A Spec-Less Testbench
First, use waveform viewer to check that implementation generates reasonable outputs for a small set of inputs.
entity and2_tb is end and2_tb; architecture main_tb of and2_tb is component and2 ... end component; signal ta, tb, tc_impl : std_logic; signal ok : boolean; begin --------------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); --------------------------------------------stimulus : process begin ta <= 0; tb <= 0; wait for 10ns; ta <= 1; tb <= 1; wait for 10ns; end process; --------------------------------------------end main_tb;
352
4.4.2
Use an Array for Test Vectors
architecture main_tb of and2_tb is ... begin ... stimulus : process type test_datum_ty is record ra, rb : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty ; constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; end main_tb;
4.4.3 Build Spec into Stimulus
353
4.4.3
Build Spec into Stimulus
stimulus : process type test_datum_ty is record ra, rb, rc : std_logic; end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := -a, b: inputs -c : expected output -a b c ( ( 0, 0, 0), ( 0, 1, 0), ( 1, 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process;
354
Build Spec into Stimulus (Contd)

stimulus : process ... begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; tc_spec <= test_vectors(i).rc; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;
4.4.4 Have Separate Specication Entity
355
4.4.4
Have Separate Specication Entity
entity and2_spec is ...(same as and2 entity)... end and2_spec; architecture spec of and2_spec is begin c <= a AND b; end spec;
356
Testbench for Separate Specication

architecture main_tb of and2_tb is component and2 ...; component and2_spec ...; signal ta, tb, tc_impl, tc_spec : std_logic; signal ok : boolean; begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); spec : and2_spec port map (a => ta, b => tb, c => tc_spec); -----------------------------------------stimulus process... check process... end
4.4.4 Have Separate Specication Entity
357
Testbench for Separate Spec (Contd)

stimulus : process ... constant test_vectors : test_vectors_ty := -a b ( ( 0, 0), ( 1, 1) ); begin for i in test_vectorslow to test_vectorshigh loop ta <= test_vectors(i).ra; tb <= test_vectors(i).rb; wait for 10 ns; end loop; end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= (tc_impl = tc_spec); end process; -----------------------------------------end main_tb;
358
4.4.5
Generate Test Vectors Automatically
architecture main_tb of and2_tb is ... begin ... stimulus : process subtype std_test_ty of std_logic is (0, 1); begin for va in std_test_tylow to std_test_tyhigh loop for vb in std_test_tylow to std_test_tyhigh loop ta <= va; tb <= vb; wait for 10 ns; end loop; end loop; end process; ... end main_tb;
4.4.6 Relational Specication
359
4.4.6
Relational Specication
Sometimes we want to check a relationship between the output and the input, rather than check that the output has a specic value. To do this, we drop the spec process, and put the brains into the check process.
architecture main_tb of and2_tb is ... begin -----------------------------------------impl : and2 port map (a => ta, b => tb, c => tc_impl); -----------------------------------------stimulus : process ... end process; -----------------------------------------check : process (tc_impl, tc_spec) begin ok <= NOT (tc_impl = 1 AND (ta =0 OR tb = 0)); end process; -----------------------------------------end main_tb;
360
4.5 Functional Verication of Control Circuits

Control circuits are often more challenging to verify than datapath circuits.
In this section, we will explore the functional verication of state machines via a First-In First-Out queue.
4.5.1 Overview of Queues in Hardware
361
4.5.1
Overview of Queues in Hardware

write read queue
Structure of queue
362
Empty A Write 1

Write 2 A
Write Sequence

Write 1 A B Write 2 A B
363
A Second Example Write
364
Read 1 A B B Read 2 A
Example Read Sequence

Write 1 Write 2
365
B C D E F G H I J
B C D E F G H I J
Write Illustrating Index Wrap
366
Write 1 K B C D E F G H I J Write 2 K B C D E F G H I J
Write Illustrating Full Queue

do_rd mem do_wr rd_idx data_rd data_wr wr_idx mem do_wr data_wr rd_idx
WE A0 DI0 A1 DO1 DO0
367
do_rd wr_idx
data_rd
empty
empty
Queue Signals Control circuitry not shown.
Incomplete Queue Blocks
368
4.5.2 4.5.2.1
VHDL Coding Package
package queue_pkg is subtype data is std_logic_vector(3 downto 0); function to_data(i : integer) return data; end queue_pkg; package body queue_pkg is function to_data(i : integer) return data is begin return std_logic_vector(to_unsigned(i, 4)); end to_data; end queue_pkg;
4.5.2.2
Other VHDL Coding
4.5.3 Code Structure for Verication This section reserved for your reading pleasure
369
4.5.3
Code Structure for Verication
Verication things to notice in queue implementation: 1. instrumentation code 2. coverage monitors 3. assertions
370
Code Structure for Verication

architecture ... is ... begin ... normal implementation ... process (clk) begin if rising_edge(clk) then ... instrumentation code ... prev_signame <= signame; end if; end process; ... assertions ... ... coverage monitors ... end;
4.5.4 Instrumentation Code
371
4.5.4
Instrumentation Code
Added to implementation to support verication Usually keeps track of previous values of signals Does not create hardware (Optimized away during synthesis) Does not feed any output signals Must use synthesizable subset of VHDL
process (clk) begin if rising_edge(clk) then prev_rd_idx <= rd_idx; prev_wr_idx <= wr_idx; prev_do_rd <= do_rd; prev_do_wr <= do_wr; end if; end process;
372
Coverage Events for Queue

Question: What events should we monitor to estimate the coverage of our functional tests?
373
Coverage Monitor Template

process (signals read) begin if (condition) then report "coverage: message"; elsif (condition) ) then report "coverage: message"; else report "error: case fall through on message" severity warning; end if; end process;
374
Coverage Monitor Code

Events related to rd idx equals wr idx. process (prev_rd_idx, prev_wr_idx, rd_idx, wr_idx) begin if (rd_idx = wr_idx) then if ( prev_rd_idx = prev_wr_idx ) then report "coverage: read = write both moved"; elsif ( rd_idx /= prev_rd_idx ) then report "coverage: Read caught write"; elsif ( wr_idx /= prev_wr_idx ) then report "coverage: Write caught read"; else report "error: case fall through on rd/wr catching" severity warning; end if; end if; end process;
375
Coverage Monitor Code

Events related to rd idx wrapping. process (rd_idx) begin if (rd_idx = low_idx) then report "coverage: rd mv to low"; elsif (rd_idx = high_idx) then report "coverage: rd mv to high"; else report "coverage: rd mv normal"; end if; end process;
376
4.5.5
Assertions Assertions for Queue
1. If rd idx changes, then it increments or wraps. 2. If rd idx changes, then do rd was 1, or reset is 1. 3. If wr idx changes, then it increments or wraps. 4. If wr idx changes, then do wr was 1, or reset is 1. 5. And many others....
4.5.5 Assertions
377
Assertion Template
process (signals read) begin assert (required condition) report "error: message" severity warning; end process;
378
Assertions: Read Index

process (rd_idx) begin assert ((rd_idx > prev_rd_idx) or (rd_idx = low_idx)) report "error: rd inc" severity warning; assert ((prev_do_rd = 1) or (reset = 1)) report "error: rd imp do_rd" severity warning; end process;
4.5.5 Assertions
379
Assertions: Write Index

process (wr_idx) begin assert ((wr_idx > prev_wr_idx) or (wr_idx = low_idx)) report "error: wr inc" severity warning; assert ((prev_do_wr = 1) or (reset = 1)) report "error: wr imp do_wr" severity warning; end process;
380
4.5.6
VHDL Coding Tips Vector Type Declaration
type data_array_ty is array(natural range <>) of data; signal data_array : data_array_ty(7 downto 0);
4.5.6 VHDL Coding Tips
381
Functions
function to_idx (i : natural range data_arraylow to data_arrayhigh) return idx_ty is begin return to_unsigned(i, idx_tylength); end to_idx; Conversion to Index Without Function With Function rd_idx <= to_unsigned(5, 3); rd_idx <= to_idx(5); The function code is verbose, but is very maintainable, because neither the function itself nor uses of the function need to know the width of the index vector.
382
Attributes
function inc_idx (idx : idx_ty) return idx_ty is begin if idx < data_arrayhigh then return (idx + 1); else return (to_idx(data_arraylow)); end if; end inc_idx;
4.5.6 VHDL Coding Tips
383
Feedback Loops, and Functions

Coding guideline: use functions. Dont use procedures. inc as fun inc as proc wr_idx <= inc_idx(wr_idx); inc_idx(wr_idx); Functions clearly distinguish between reading from a signal and writing to a signal. By examining the use of a procedure, you cannot tell which signals are read from and which are written to. You must examine the declaration or implementation of the procedure to determine modes of signals. Modifying a signal within a procedure results in a tri-state signal. This is bad.
384
File I/O (textio package)

TEXTIO denes read, write, readline, writeline functions. Described in: http://www.eng.auburn.edu/department/ee/mgc/vhdl.html#textio These functions can be used to read test vectors from a le and write results to a le.
4.5.7 Queue Specication
385
4.5.7
Queue Specication
Most bugs in queues are related to the queue becoming full, becoming empty, and/or wrap of indices. Specication should be obviously correct. Avoid bugs in specication by making specication queue larger than the max number of writes that we will do in test suite. Thus, the specication queue will never become full or wrap. However, the implementation queue will become full and wrap.
386
Write Index Update in Specication

We increment write-index on every write, we never wrap. process (clk) begin if rising_edge(clk) then if (reset = 1) then wr_idx <= 0; elsif (do_wr = 1) then wr_idx <= wr_idx + 1; end if; end if; end process;
4.5.7 Queue Specication
387
Things to Notice
Things to notice in queue specication: 1. dont care conditions (-) 2. uninitialized data (hint: what is the value of rd_data when do more reads than writes?
388
Dont Care
rd_data <= data_array(rd_idx) when (do_rd =1) else (others => -);
4.5.8 Queue Testbench
389
4.5.8
Queue Testbench
Things to notice in queue testbench: 1. running multipe test sequences 2. uninitialized data U 3. std_match to compare spec and impl data 0 0 1 1 everything else 0 L 1 H everything everything
With equality, - = 1, but we want to use - to mean dont care in specication. The solution is to use std match, rather than = to check implementation signals against the specication.
390
Stimulus Process Structure

The stimulus process runs multiple test vectors in a single simulation run.
stimulus : process type test_datum_ty is record r_reset, ... normal fields ... end record; type test_vectors_ty is array(natural range <>) of test_datum_ty; constant test_vectors : test_vectors_ty := ( -reset ... other signal ... ( 1, normal fields), -- test case 1 ( 0, normal fields), ... ( 1, normal fields), -- test case 2 ( 0, normal fields), ... ); begin for i in test_vectorsrange loop if (test_vectors(i).r_reset = 1) then ... reset code ... end if; reset <= 0; ... normal sequence ... wait until rising_edge(clk); end loop; end process;
4.6. EXAMPLE: MICROWAVE OVEN
391
4.6
Example: Microwave Oven
This question concerns the VHDL code microwave, which controls a simple microwave oven; the properties prop1...prop3; and two proposed changes to the VHDL code. INSTRUCTIONS: 1. Assume that the code as currently written is correct any change to the code that causes a change to the behaviour of the signals heat or count is a bug. 2. For each of the two proposed code changes, answer whether the code change will cause a bug. 3. If the code change will cause a bug, provide a test case that will exercise the bug and identify all of the given properties (prop1, prop2, and prop3) that will detect the bug with the test case you provide. 4. If none of the three properties can detect the bug, provide a property of your own that will detect the bug with the testcase you provide.
392
Question: For each of the three properties prop1...prop2, answer whether the property is best checked as part of a testbench or assertion. For each property, justify why a testbench or an assertion is the best method to validate that property. prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specied by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decremented.
393
Implementation
entity microwave is port ( timer -- time input from user : in unsigned(7 downto 0); reset, -- resets microwave clk, -- clock signal input is_open, -- detects when door is open start -- start button input from user : in std_logic; heat : out std_logic -- 1=on, 0=off ); end microwave; architecture main of microwave is signal count : unsigned(7 downto 0); -- internal time count signal x_heat : std_logic; begin
394
-- heat process -----------------------------process (clk) begin if rising_edge(clk) then if reset = 1 then x_heat <= 0; elsif (is_open = 0) and (start = 1) and (time > 0) then x_heat <= 1; elsif (is_open = 0) and (count > 0) then x_heat <= x_heat; else x_heat <= 0; end if; end if; end process;
-- region of -- change #1 -----

-- count process -----------------------------process (clk) begin if rising_edge(clk) then if (reset = 1) then count <= to_unsigned(0, 8); elsif (start = 1) then count <= timer; elsif (count > 0) then count <= count - 1; end if; end if; end process; heat <= x_heat; end main;
395
-- region of -- change #2 ---
396
Properties
prop1 If start is pushed and the door is closed, then heat remains on for exactly the time specied by the timer when start was pushed, assuming reset remains false and the door remains closed. prop2 If the door is open, then heat is off. prop3 If start is not pushed, reset is false, and count is greater than zero, then count is decremented.
397
Change #1
elsif (start = 1) then count <= time; From: elsif (count > 0) then count <= count - 1; elsif (count > 0) then count <= count - 1; elsif (start = 1) then count <= time;
To:
398
Change #2
elsif (is_open then x_heat <= From: elsif (is_open then x_heat <= elsif To: = 0) and (start = 1) and (time > 0) 1; = 0) and (count > 0) x_heat;
(is_open = 0) and ((start = 1) or (count > 0)) then x_heat <= 1; else x_heat <= 0;
399
Coverage
Question: If msb of src1 is 1 and lsb of src2 is 0 or sum(3) is 1, then result is wrong. What is the minimum coverage needed to detect bug? What is the minimim coverage needed to guarantee that the bug will be detected?
400
Chapter 5 Timing Analysis
402
CHAPTER 5. TIMING ANALYSIS
5.1
Delays and Denitions
In this section we will look at the different timing parameters of circuits. Our focus will be on those parameters that limit the maximum clock speed at which a circuit will work correctly.
5.1.1
Background Denitions
5.1.2 Clock-Related Timing Denitions
403
5.1.2 5.1.2.1
Clock-Related Timing Denitions Clock Skew

clk1 clk2 clk3 clk4
skew clk1 clk2 clk3 clk4
Denition Clock Skew: The difference in arrival times for the same clock edge at different ip-ops.
Clock skew is caused by the difference in interconnect delays to different points on the chip.
404
Clock Tree Design

Clock tree design is critical in high-performance designs to minimize clock skew. Sophisticated synthesis tools put lots of effort into clock tree design, and the techniques for clock tree design still generate PhD theses.
405
5.1.2.2
Clock Latency
master clock latency intermediate clock final clock
master clock intermediate clock final clock
Denition Clock Latency: The difference in arrival times for the same clock edge at different levels of interconnect along the clock tree. (Intuitively different points in the clock generation circuitry.) Note: Clock latency Clock latency does not affect the limit on the minimim clock period.
406
5.1.2.3
ideal clock
Clock Jitter
clock with jitter jitter
Denition Clock Jitter: Difference between actual clock period and ideal clock period.
407
Causes of Clock Jitter

Clock jitter is caused by: temperature and voltage variations over time
temperature and voltage variations across different locations on a chip manufacturing variations between different parts
408
5.1.3 5.1.3.1
clk d q
Storage-Related Timing Denitions Flops and Latches

clk d q
Flop Behaviour
Latch Behaviour
Storage devices have two modes: load mode and store mode. Flops are edge sensitive; they are in load mode just before the clock edge. Latches are level senstive; they are in load mode while their enable signal is asserted high (low for active low latches).
5.1.3 Storage-Related Timing Denitions
409
Timing Parameters
Setup d clk q Clock-to-Q Hold d clk q Clock-to-Q Setup Hold Setup d clk q Clock-to-Q Hold
Flip-op
Active-high latch
Active-low latch
Setup and hold dene the window in which input data are required to be constant in order to guarantee that storage device will store data correctly. Clock-to-Q denes the delay from the clock edge to when the output is guaranteed to be stable.
410
5.1.4
Propagation Delays
Propagation delay time it takes a signal to travel from the source (driving) op to the destination op propagation delay = load delay + interconnect delay Load delay combinational gates between the ops Interconnect delay wires between gates and ops
5.1.5 Timing Constraints
411
5.1.5 5.1.5.1
Timing Constraints Minimum Clock Period

a clk1 clk2 b signal is stable signal may change signal may rise signal may fall
clock period
clk1 clk2 a b
ClockPeriod >
412
5.1.5.2 5.1.5.3
Hold Constraint Example Timing Violations Good Timing

a clk b c d
a clk b
Clock-to-Q
Prop Setup Hold
c d
5.1.5 Timing Constraints
413
Setup Violation
a clk b Clock-to-Q Prop Setup c d ??? ???
Setup Violation
414
Hold Violation
a clk b c d
a clk b
Clock-to-Q Prop Hold
c d
???
Hold Violation
5.2. TIMING ANALYSIS OF LATCHES AND FLIP FLOPS
415
5.2 Timing Analysis of Latches and Flip Flops

In this section, we show how to nd the clock-to-Q, setup, and hold times for latches, ip-ops, and other storage elements.
5.2.1
Simple Multiplexer Latch
416
5.2.1.1 Structure and Behaviour of Multiplexer Latch

clk i o i 1 o
Loading / pass-through mode
Storage mode
5.2.1 Simple Multiplexer Latch
417
Unfold Multiplexer to Simple Gates

0 i o a b s o
Multiplexer: symbol and implementation

clk i o a sel b o
Latch implementation
418
Latch Glitching
d clk o
Note: inverters on clk Both of the inverters on the clk signal are needed. Together, they prevent a glitch on the OR gate when clk is deasserted. If there was only one inverter, a glitch would occur. For more on this, see section 5.2.1.6
419
Loading and Storing Values

d clk d=0 clk=1 1 1 0 1 1 0 0 o
Loading 0
0 1 0 d clk=0 o
Loading 1
0 1 1 0 o=0
d=1 clk=1
0 0 0 1
0 1 1
Storing 0
Storing 1
420
5.2.1.2 Strategy for Timing Analysis of Storage Devices

The key to calculating setup and hold times of a latch, op, etc is to identify: 1. how the data is stored when not connected to the input (often a pair of inverters in a loop) 2. the gate(s) that the clock uses to cause the stored data to drive the output (often a transmission gate or multiplexor) 3. the gate(s) that the clock uses to cause the input to drive the output (often a transmission gate or multiplexor)
421
5.2.1.3 Latch
d clk
Clock-to-Q Time of a Multiplexer

l1 c2 cn l2 qn s2 s1 q d clk l1 c2 cn l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
d clk
l1 c2 cn
l2 qn s2 s1 q
422
5.2.1.4
d 1 clk
0
Setup Timing of a Multiplexer Latch

1 0 0
d 0 clk
1 0
Circuit is stable in load mode d 0 clk

0 1 0 0
t=3: l2 is set to 0, because c2 turns off AND gate d 0 clk

1 0 1 0
t=0: Clk transitions from load to store d 0 clk

1 1 1 0
t=4: from store path propagates to q d 0 clk

1 0 1 0
t=1: Clk transitions from load to store d 0 clk

1 0 1
t=5: from store path completes cycle
t=2: s1 propagates to s2, because cn turns on AND gate
423
1 1 1
Setup Violation
d 1 clk
0 1 0 0
d 0 clk
Circuit is stable in load mode with

0 1 0
d 1 clk
t=1: propagates through AND Clk propagates through inverter Trouble: inconsistent values on load path and store path. Old value () still in store path when store path is enabled. d 0 clk
1 0 1
t=-1: D transitions from to d 0 clk

0 1 0 0
t=2: old propagates through AND d 0 clk

1 0 1 0 /
t=0: propagates through inverter Clk transitions from load to store
t=3: l2 is set to 0, because c2 turns off AND gate
424
d 0 clk
1 0 1 / /

d 0 clk
=1 1 0 1 1 0 0 0 0 1 1 1
t=4: / from store path propagates to q

d
t=5: Illustrate instability with =0, =1

setup with negative margin
-3 -2 -1 0 1 2 3 4 5 6
d 0 clk
1 0
l1 l2 qn q s1 s2
1 /
clk cn
t=5: / from store path completes cycle
c2
425
We now repeat the analysis of setup violation, but illustrate the minimum violation (input transitions from to 3 time-units before the clock edge).
d 1 clk
0 1 0 0
d 1 clk
0 1
Circuit is stable in load mode with

0 1 0
t=-1: propagates through AND

0 1 0
d 1 clk
d 0 clk
t=-3: D transitions from to

0 1 0
t=0: Clk transitions from load to store

1 1 1
d 1 clk
d 0 clk
t=-2: propagates through inverter
t=1: Clk propagates through inverter
426
Trouble: inconsistent values on load path and store path. Old value () still in store path when store path is enabled. d 0 clk
1 0 1

d 0 clk
1 0 1 / 0 /
t=5: / from store path completes cycle
t=2: old propagates through AND

1 0 1
d 0 clk
0 /
d 0 clk
=1 1 0
0 0 1
1 1
0 1 1
t=3: l2 is set to 0, because c2 turns off AND gate

1 0 1
d
t=5: Illustrate instability with =0, =1

-3 -2 -1 0 1 2 3 4 5 6
setup with negative margin

/
d 0 clk
0 / /
l1 l2 qn q s1 s2 clk cn
t=4: / from store path propagates to q
c2
427
Minimum Setup Time

d clk l1 l2 qn cn s2 s1 q
setup d l1 l2 qn q s1 s2 clk cn c2
428
5.2.1.5
Hold Time of a Multiplexer Latch

d clk cn s2 s1 l1 c2 l2 qn q
429
Hold Time Behaviour

d clk cn s2 s1 l1 c2 l2 qn q d clk cn s2 s1 l1 c2 l2 qn q
d clk cn
l1 c2
l2 qn s2 s1 q
d clk cn
l1 c2
l2 qn s2 s1 q
d clk cn
l1 c2
l2 qn s2 s1 q
d clk cn
l1 c2
l2 qn s2 s1 q
430
5.2.1.6
Example of a Bad Latch

d clk l1 c2 cn l2 qn s2 s1 q
d l1 l2 qn q s1 s2 clk c2 cn
5.3. CRITICAL PATHS AND FALSE PATHS
431
5.3
Critical Paths and False Paths
5.3.1 Introduction to Critical and False Paths

Denition critical path: The slowest path on the chip between ops or ops and pins. The critical path limits the maximum clock speed.
Denition false path: : a path along which an edge cannot travel from beginning to end.
432
Outline
The algorithm that we present comes from McGeer and Brayton in a DAC 198? paper. The algorithm to nd the critical path through a circuit is presented in several parts. 1. Section 5.3.2: Find the longest path ignoring the possibility of false paths. 2. Section 5.3.3: Almost-correct algorithm to test whether a candidate critical path is a false path. 3. Section 5.3.4: If a candidate path is a false path, then nd the next candidate path, and repeat the false-path detection algorithm. 4. Section 5.3.5: Correct, complete, and complex algorithm to nd the critical path in a circuit.
433
Notes
Note: The analysis of critical paths and false paths assumes that all inputs change values at exactly the same time. Timing differences between inputs are modelled by the skew parameter in timing analysis. Throughout our discussion of critical paths, we will use the delay values for gates shown in the table below. gate delay NOT 2 AND 4 OR 4 XOR 6
434
5.3.1.1 Adder
Question:
Example of Critical Path in Full
Find the critical path through the full-adder circuit shown below.
ci a b i k j co s
435
Alternative Excitation
Question: Do the input values of ci=0, a=, b=1 exercise the critical path?
ci a b i k j co s
436
5.3.1.2 5.3.1.3
Preliminaries for Critical Paths Longest Path and Critical Path
The longest path through the circuit might not be the critical path, because the behaviour of the gates might prevent an edge (0 1 or 1 0) from travelling along the path.
437
Example False Path

Question: path Determine whether the longest path in the circuit below is a false
a y b
a = 0, b = 0 1
a y b b a
a = 0, b = 1 0
y
a = 1, b = 0 1
a y b b a
a = 1, b = 1 0
y
Question:
How can we determine analytically that this is a false path?
438
a

y b
439
Preview of Complete Example

Question: Find the critical path through the circuit below.
b a c d e f g
b a c
440
5.3.2
Longest Path
Outline of Algorithm to Find Longest Path

The basic idea is to annotate each signal with the maximum delay from it to an output. Start at destination signals and traverse through fanin to source signals. Destination signals have a delay of 0 At each gate, annotate the inputs by the delay through the gate plus the delay of the output. When a signal fans out to multiple gates, annotate the output of the source (driving) gate with maximum delay of the destination signals.
The primary input signal with the maximum delay is the start of the longest path. The delay annotation of this signal is the delay of the longest path. The longest path is found by working from the source signal to the destination signals, picking the fanout signal with the maximum delay at each step.
5.3.3 Detecting a False Path
441
5.3.3 5.3.3.1
Detecting a False Path Preliminaries
The controlling value of a gate is the value such that if one of the inputs has this value, the output can be determined independently of the other inputs. The controlled output value is the value produced by the controlling input value. Gate Controlling Value Controlled Output
AND OR NAND NOR XOR
442
Path Input, Side Input

Denition path input: For a gate on a path (either a candidate critical path, or a real critical path), the path input is the input signal that is on the path.
Denition side input: For a gate on a path (either a candidate critical path, or a real critical path), the side inputs are the input signals that are not on the path.
443
Reconvergent Fanout
Denition reconvergent fanout: There are paths from signals in the fanout of a gate that reconverge at another gate.
a b d e f c g y h z
If a candidate path has reconvergent fanout, then the rising or falling edge on the input to the path might cause a side input along the path to have a rising or falling edge, rather than a stable 0 or 1.
444
Rules for Propagating an Edge Along a Path

NOT 1 AND 1
0 OR
1 XOR
445
Missing Rules?
Question: Why do the rules not have falling edges for AND gates or rising edges for OR gates on the side input?
a b a c b c
446
Viability Condition of a Path

Denition Viability condition: For a path (p) though a circuit, the viability condition is a Boolean expression in terms of the input signals that denes the cases where an edge will propagate along the path.
Based upon the rules for propagating an edge that we have seen so far, the viability condition for a path is: every side input has a non-controlling value. As always, section 5.3.5 has the complete viability condition.
447
5.3.3.2 Almost-Correct Algorithm to Detect a False Path

1. Annotate each side input along the path with its non-controlling value. These annotations are the constraints that must be satised for the candidate path to be exercised. 2. Propagate the constraints backward from the side inputs of the path to the inputs of the circuit under consideration. 3. If there is a contradiction amongst the constraints, then the candidate path is a false path. 4. If there is no contradiction, then the constraints on the inputs give the conditions under which an edge will traverse along the candidate path from input to output.
5.3.3.3
Examples of Detecting False Paths
448
False-Path Example 1
Question: Determine if the longest path in the circuit below is a false path.
a 16 b 12 c 10
d 14
f 12
12 g 8 12 6 h 4 i 4 8 8
2 4 4
j 0 k 0
e 8
side input non-controlling value constraint
5.3.4 Finding the Next Candidate Path
449
5.3.4
Finding the Next Candidate Path
If the longest path is a false path, we need to nd the next longest path in the circuit, which will be our next candidate critical path. If this candidate fails, we continue to nd the next longest of the remaining paths, ad innitum.
450
5.3.4.1 Path
Algorithm to Find Next Candidate
1. Initialize path table with primary inputs, their potential delay, and fanout. 2. Sort path table by potential delay 3. If the partial path with the max delay has just one unused fanout signal, then extend the partial path with this signal. Otherwise: (a) Extend path through unused fanout with max delay. (b) Delete this fanout signal from the list of unused fanout signals . 4. Compute constraint that side input has non-controlling value 5. If the new constraint does not cause a contradiction, then return to step 3. Otherwise: (a) Mark this partial path as false. (b) For each partial path that is a prex of the false path:
recalculate potential delay of path

(c) Return to step 2
5.3.4 Finding the Next Candidate Path
451
5.3.4.2 Examples of Finding Next Candidate Path Next-Path Example 1

Question: Starting from the initial delay calculation and longest path, nd the next candidate path and test if it is a false path.
a 16 b 12 c 10 e 8 d 14 f 12 12 g 8 12 6 8 8 h 4 i 4 2 4 4 j 0 k 0
452
CHAPTER 5. TIMING ANALYSIS potential unused delay fanout path 10 e c 12 h, g b 16 d a
5.3.4 Finding the Next Candidate Path side input non-controlling value constraint
453
454
5.3.5 Path
Correct Algorithm to Find Critical
We now remove the assumption that side inputs always arrive earlier than path inputs.
5.3.5.1
Rules for Late Side Inputs

side=non-ctrl path=non-ctrl side=CTRL path=CTRL side=CTRL path=non-ctrl
side=non-ctrl path=CTRL Early Side

path input causes glitch
path input propogates
side input propogates
neither input propogates
Late Side
monotone speedup monotone speedup path input propogates side input causes glitch
The complete and correct rule: a path input excites the gate if the side-input is non-controlling or the side-input arrives late and the path input is controlling.
5.3.5 Correct Algorithm to Find Critical Path
455
5.3.5.2
Monotone Speedup
Denition monotonic: A function ( f ) is monotonic if increasing its input causes the output to increase or remain the same. Mathematically: x < y = f (x) f (y).
Denition monotononous: A lecture is monotonous if increasing the length of the lecture increases the number of people who are asleep.
Denition monotone speedup: The maximum clockspeed of a circuit should be monotonic with respect to the speed of any gate or sub-circuit. That is, if we increase the speed of part of the circuit, we should either increase the clockspeed of the circuit, or leave it unchanged.
456
5.3.5.3 Analysis of Side-Input-Causes-Glitch Situation 5.3.5.4 Complete Algorithm
If nd a contradiction on the path, check for side inputs that are on previously discovered false paths. If a gate and its side input are on a previously discovered false path, then the side input denes a prex of a false path that is a late-arriving side input. For each late-arriving prex, compute its viability (the conditions under which an edge will propagate along the prex to the late side input). To the row of the late arriving side input in the constraint table, add as a disjunction the constraint that: the path input has a controlling value and at least one of the prexes is viable.
457
5.3.5.5
Complete Examples Complete Example 1
Question:
Find the critical path in the circuit below.

b a c d e g
potential unused delay fanout path false a,b,d,e,f,g 10 g, c a 10 a,c,f,g side input non-controlling value constraint f[e] 1 a g[a] 1 a
458
Complete Example 2
Question: Find the critical path in the circuit below.
a
8 8 8 14
f4
8 8 8
4 4 4
j 0
b 12 c
d 12 e 10
10
g 8 12
12
h8 i
potential unused delay fanout path false b,d,e,g,h,i,j 8 f a 12 h c 14 f, g b,d,e 14 b,d,e,g,i,j
side input non-ctrl value constraint h[c] 0 c i[h] 0 cb j[f] 0 ab
459
Complete Example 3
Monotone speedup
Critical path a, c, e, f Late side input e[d]

0 e f
Total delay 10 Excitation: a = rising edge
0 a
0 0
b c
2 2
0 a
0 0
b c
2 2
4 e
0 6 f
Rising edge excitation

0 a 0 c 2 0 b 0.5 d 1 e 6 0
Falling edge excitation

f 10
Fast timing
460
Complete Example 4
Late side inputs sometimes must have an edge. Find the second-longest path with contradiction using early sides: c d k e a i j b g f h
a b
c 0 d 1
0
e 1 g 4 h 6
1 6
1 0
0 f 2
c 2 d 4
0
a 0 b
e4 8
6
48
i 810
j
10 12
14 k 16
0 f 2
g 4 h 6
461
Complete Example 5
Late side paths must be viable.
Question:
Find the critical path in the circuit below.

a b c h d f e g i k j
a b c
i k j
h d f
462
5.3.6 Further Extensions to Critical Path Analysis

McGeer and Braytons paper includes two extensions to the critical path algorithm presented here that we will not cover. gates with more than two inputs
nding all input values that will exercise the critical path multiple paths with the same delay to the same gate
5.3.7 Increasing the Accuracy of Critical Path Analysis

When doing critical path calculations, it is often useful to strike a balance between accuracy and effort. In the examples so far, we assumed that all signals had the same wire and load delays. This assumption simplies calculations, but reduces accuracy. Section 5.4 discusses how the analog world affects timing analysis.
5.4. ELMORE TIMING MODEL
463
5.4 5.4.1
Elmore Timing Model RC-Networks for Timing Analysis

Mask Level (P-Tran) poly source contact gate p-diff
drain
substrate
Transistor Level (P-Tran) source

gate drain
Cross-Section of Fabricated Transistor poly contact

p-diff
Switch Level (P-Tran) source

gate drain
464 Mask Level (N-Tran) poly source

gate
drain
CHAPTER 5. TIMING ANALYSIS Cross-Section of Fabricated Transistor poly contact

p-diff
Transistor Level (N-Tran) source

gate
Switch Level (N-Tran) source

gate drain
contact n-diff drain

substrate
5.4.1 RC-Networks for Timing Analysis
465
Different Levels of Abstraction for Inverter

Transistor Level VDD Gate Level a b Mask Level
contact VDD poly p-diff b n-diff GND metal
GND
metal
RC-Network models of P- and N-transistors

source Rpu gate Cp drain gate Rpd source drain Cp
466 RC-Network for Timing Analysis

VDD Rpu a CL Cp Rpd GND b
467
A Pair of Inverters
Transistor Level VDD Gate Level b
b
GND
Mask Level b
468
A Pair of Inverters (Contd)

Mask Level
VDD b c
GND
RC-Network for Timing Analysis

VDD Rpu a CL Rpd GND Cp b RW CW RV CL Rpd Cp Rpu c
RC-Network for Timing Analysis (trimmed)

VDD Rpu b RW Cp Rpd GND CW RV CL
469
470
A Circuit with Fanout

Gate Level
c a b d
Gate Level (physical layout) c b d a c
Transistor Level
VDD
b a c b d
c GND
471
A Circuit with Fanout (Contd)

Transistor Level
VDD
b a c b d
c GND
Mask Level
VDD b a b c d c GND
472
A Circuit with Fanout (Contd)

Mask Level
VDD b a b c d c GND

VDD
Rpu a CL Cp Rpd b RW1 RV CW1 CL
Rpu b c Cp Rpd RW3 CW3 RW2 CW2 RV CL
Rpu d Cp Rpd c
GND
473
A Circuit with Fanout

VDD Rpu a CL Cp Rpd b RW1 RV CW1 CL Cp Rpd RW3 CW3 Rpu b c RW2 CW2 RV CL Cp Rpd c d Rpu
GND
RC-Network for Timing Analysis (trimmed)

VDD
Rpu b b RW1 Cp Rpd RV CW1 CL RW2 CW2 RV CL
GND
474
VDD Rpu b RW1 Cp Rpd GND RV CW1 CL b RW2 CW2
CHAPTER 5. TIMING ANALYSIS RC-Network for Timing Analysis (cleaned up)

RV CL
5.4.2 Derivation of Analog Timing Model
475
5.4.2
Derivation of Analog Timing Model Real Waveforms

Slow input Fast input
input voltage time time input voltage time time
input voltage
output voltage
476
Steps Toward Approximation

We begin with two simplications as steps toward calculating a single delay value for a circuit. 1. Look at the circuits response to a step-function input. 2. Measure the delay to go from GND to 65% of VDD and from VDD to 35% of VDD.
Denition Trip Points: A high or 1 trip point is the voltage level where an upwards transition means the signal represents a 1. A low or 0 trip point is the voltage level where a downwards transition means the signal represents a 0.
a
477
The source (VDD in our case) and each capacitor is a node. We number the nodes, capacitors, and resistors. Resistors are numbered according to the capacitor to their right. Multiple resistors in series without an intervening capacitor are lumped into a single resistor. All nodes except the source start at GND. We calculate the voltage at a node when we turn on the P-transistor (connect to VDD).
The process for analyzing a transition from VDD to GND on a node is the dual of the process just described. The source node is GND, all other nodes start at VDD, we calculate the voltage when we turn on the N-transistor (connect it to GND).
VDD 0 Rpu R1 R2 1 b RW12 Cp Rpd GND RV R5 CW1 R3 R4 b RW2 3 RV CW2 5 CL 4 CL
Node Numbering, Initial Conditions
478
Dene: Path and Downstream

Denition path: The path from the source node to a node i is the set of all resistors between the source and i. Example: path(3) = {R1, R2, R3}
Denition down: The set of capactitors downstream from a node is the set of all capacitors where current would ow through the node to charge the capacitor. You can think of this as the set of capacitors that are between the node and ground. Example: down(2) = {C2,C3,C4,C5}. Example: down(3) = {C3,C4}
479
5.4.2.1 Example Derivation: Equation for Voltage at Node 3

V3(t) = V0(t) voltage drop fromNode0toNode3 The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Node3 = V0(t)
rpath(3)
Rr Ir (t)
= V0(t) (R1I1(t) + R2I2(t) + R3I3(t)) The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) =
cdown(r)
Ic
I1(t) = Ic1 + Ic2 + Ic3 + Ic4 + Ic5 I2(t) = Ic2 + Ic3 + Ic4 + Ic5 I3(t) = Ic3 + Ic4
480
CHAPTER 5. TIMING ANALYSIS Substitute Ir into the equation for V3 R1(Ic1 + Ic2 + Ic3 + Ic4 + Ic5) V3(t) = V0(t) + R2(Ic2 + Ic3 + Ic4 + Ic5) + R3(Ic3 + Ic4) Use associativity to group terms by currents. Ic1(R1) + Ic2(R1 + R2) + Ic3(R1 + R2 + R3) V3(t) = V0(t) + Ic4(R1 + R2 + R3) + Ic5(R1 + R2)
5.4.2 Derivation of Analog Timing Model Current through a capacitor Vc(t) Ic(t) = Cc t Substitute Ic into equation for V3 Vc1(t) (R1)Cc1 t V (t) + (R1 + R2)Cc2 c2 t V (t) V3(t) = V0(t) + (R1 + R2 + R3)Cc3 c3 t V (t) + (R1 + R2 + R3)Cc4 c4 t V (t) + (R1 + R2)Cc5 c5 t
481
482
Ri,k = R3,1 R3,2 R3,3 R3,4 R3,5 = = = = =
r(path(k)path(k))
Rr
R1 R1 + R2 R1 + R2 + R3 R1 + R2 + R3 R1 + R2
Substitute Ri,k into V3 Vc2(t) Vc3(t) Vc1(t) + R3,2Cc2 + R3,3Cc3 R3,1Cc1 t t t V3(t) = V0(t) Vc4(t) Vc5(t) + R3,4Cc4 + R3,5Cc5 t t
483
5.4.2.2
General Derivation
Vi(t) = V0(t) voltage drop fromNode0toNodei The voltage drop is the sum of the voltage drops across the resistors on the path from Node0 to Nodei = V0(t)
rpath(i)
Rr Ir (t)
484
CHAPTER 5. TIMING ANALYSIS The current through a resistor is the sum of the currents through all of the downstream capacitors Ir (t) =
cdown(r)
Ic
Vi(t) = V0(t)
Substitute Ir into the equation for Vi

rpath(i)
Rr
cdown(r)
Ic
Use associativity to push Rr into the summation over c Vi(t) = V0(t)

rpath(i) cdown(r)
Rr Ic
5.4.2 Derivation of Analog Timing Model Current through a capacitor Vc(t) Ic(t) = Cc t Substitute Ic into equation for Vi Vi(t) = V0(t)
rpath(i) cdown(r)
485
Rr Cc
Vc(t) t
Vi(t) = V0(t)
A little bit of handwaving to prepare for Elmore resistance

kNodes
rpath(i)path(k)
Rr Ck
Vc(t) t
486
CHAPTER 5. TIMING ANALYSIS Dene Elmore resistance Ri,k R i,k =

r(path(k)path(k))
Rr
Substitute Ri,k into Vi Vi(t) = V0(t)

kNodes
Ri,k Ck
Vc(t) t
5.4.3 Elmore Timing Model
487
5.4.3
Elmore Timing Model
Assume that V0(t) is a step function from 0 to 1 at time 0. Derive upper and lower bounds for Vi(t). Find RC time constants for upper and lower bounds. Elmore delay is guaranteed to be between upper and lower bounds.
Upper and lower bounds Elmore model RC-network model
TD-TRi
TRi
TP
TP-TRi TD
488
Equations for Curves

Time : 0 1+ t TDi TP TDi TRi TP TRi TDi TP t TR TRi 1 ie TP 1 et/TDi
Upper
Elmore
Lower
TDi 1 t + TRi
TP TRi t TDi TP e 1 TP
Fact: 0 TRi TDi TP
5.4.3 Elmore Timing Model
489
Denitions of Time Constants

TRi = TDi = TP =
kNodes
R2 Ck k,i Ri,i
Mathematical artifact, no intuitive meaning
kNodes
Rk,iCk Elmore delay Rk,kCk RC-time constant for lumped network
kNodes
490
Picking the Trip Point

Vi(t) = VDD(1 et/TDi ) Pick trip point of Vi(t) = 0.65VDD, then solve for t 0.65VDD = VDD(1 et/TDi ) 0.35 = et/TDi Take ln of both sides ln 0.35 = ln(et/TDi ) ln 0.35 = 1.05 1.0 1.0 = t/TDi t = TDi By picking a trip point of 0.65VDD, the time for Vi to reach the trip is the Elmore delay.
5.4.4 Examples of Using Elmore Delay
491
5.4.4 5.4.4.1
Examples of Using Elmore Delay Interconnect with Single Fanout
492
G1
G2
Ra4 Ra1
G1
C3 Rw3
G2 C1 Rw1 G1
Rpu
Ra3 C2 Rw2 Ra2
G2 Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4
Vi Cp Rpd
C1
C2
C3
CG2
G* C* Ra* Rw*
gate capacitance on wire resistance through antifuse resistance through wire
5.4.4 Examples of Using Elmore Delay Question:

G1 Rpu Ra1 Rw1 Ra2 Rw2 Ra3 Rw3 Ra4 Cp Rpd C1 C2 C3 CG2
493
Calculate delay from gate 1 to gate 2

G2
Vi
494
Doubling Antifuses
Question: If you double the number of antifuses and wires needed to connect two gates, what will be the approximate effect on the wire delay between the gates?
495
5.4.4.2 Interconnect with Multiple Gates in Fanout

G1 G3 G2
G2 G3 G1
Question: Assuming that wire resistance is much less than antifuse resistance and that all antifuses have equal resistance, calculate the delay from the source inverter (G1) to G2
496
497
Delay to G2 vs G3
Question: Assuming all wire segments at same level have roughly the same capacitance, which is greater, the delay to G2 or the delay to G3?
R3 C3 R5 C4 R6 G3 C7 C6 R2 C2
R4 C5 G2 C1 R1
G1
G1
G2
Rpu R1 n1 R2 n2 Cp Rpd G3 R5 n6 R6 C6 C1 C2 n3 R3 n4 R4 C3 C4
Vi
n5 C5
n7 C7
498
5.5
Practical Usage of Timing Analysis
Speed Grading
Fabs sort chips according to their speed (sorting is known as speed grading or speed binning) Faster chips are more expensive In FPGAs, sorting is based usualy on propagation delay through an FPGA cell. As wires become a larger portiono of delay, some analysis of wire delays is also being done. Propagation delay is the average of the rising and falling propagation delays. Typical speed grades for FPGAs:
Std standard speed grade 1 15% faster than Std 2 25% faster than Std 3 35% faster than Std Worst-Case Timing
Maximum Delay in CMOS. When?
5.5. PRACTICAL USAGE OF TIMING ANALYSIS Minimum voltage Maximum temperature
499
Slow-slow conditions (process variation/corner which result in slow p-channel and slow n-channel). We could also have fast-fast, slow-fast, and fast-slow process corners
Increasing temperature increases delay

Temp = resistivity resistivity = electron vibration electron vibration = colliding with current electrons colliding with current electrons = delay
Increasing supply voltage decreases delay

supply voltage = current current = load capacitor charge time load capacitor charge time = total delay
Derating factor is a number used to adjust timing number to account for voltage and temp conditions
500
ASIC manufacturers classes, based on variety of environments: VDD TA (ambient temp) TC (case temp) Commercial 5V 5% 0 to +70C Industrial 5V 10% 40 to +85C 5V 10% 55 to +125C Military What is important is the transistor temperature inside the chip, TJ (junction temperature)
5.5.1
Speed Binning
Speed binning is the process of testing each manufactured part to determine the maximum clock speed at which it will run reliably. Manufacturers sell chips off of the same manufacturing line at different prices based on how fast they will run. A speed bin is the clock speed that chips will be labeled with when sold. Overclocking: running a chip at a clock speed faster than what it is rated for (and hoping that your software crashes more frequently than your over-stressed hardware will).
5.5.1 Speed Binning
501
5.5.1.1 FPGAs, Interconnect, and Synthesis

On FPGAs 40-60% of clock cycle is consumed by interconnect. When synthesizing, increasing effort (number of iterations) of place and route can signicantly reduce the clock period on large designs.
502
5.5.2 5.5.2.1
Worst Case Timing Fanout delay
In Smiths book, Table 5.2 (Fanout delay) combines two separate parameters:
capacitive load delay interconnect delay

into a single parameter (fanout). This is common, and ne. But, when reading a table such as this, you need to know whether fanout delay is combining both capacitive load delay and interconnect delay, or is just capacitive load.
5.5.2 Worst Case Timing
503
5.5.2.2
Derating Factors
Delays are dependent upon supply voltage and temperature. Temp = Delay Supply voltage = Delay
504
Temperature
Temp = Delay
Temp = Resistivity of wires As temp goes up, atoms vibrate more, and so have greater probability of colliding with electrons owing with current.
5.5.2 Worst Case Timing
505
Supply Voltage
Supply voltage = Delay
Supply voltage = current (V = IR) current = time to charge load capacitors to threshold voltage
506
Derating Factor Denition

A derating factor is a number to adjust timing numbers to account for different temperature and voltage conditions. Excerpt from table 5.3 in Smiths book (Actel Act 3 derating factors): Derating factor 1.17 1.00 0.63 Temp 125C 70C -55C Vdd 4.5V 5.0V 5.5V
Chapter 6 Power Analysis and Power-Aware Design
508
CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN
6.1 6.1.1
Overview Importance of Power and Energy
Laptops, PDA, cell-phones, etc obvious! For microprocessors in personal computers, every watt above 40W adds $1 to manufacturing cost Approx 25% of operating expense of server farm goes to energy bills (Dis)Comfort of Unix labs in E2 Sandia Labs had to build a special sub-station when they took delivery of Teraops massively parallel supercomputer (over 9000 Pentium Pros) High-speed microprocessors today can run so hot that they will damage themselves Athlon reliability problems, Pentium 4 processor thermal throttling In 2000, information technology consumed 8% of total power in US. Future power viruses: cell phone viruses cause cell phone to run in full power mode and consume battery very quickly; PC viruses that cause CPU to meltdown batteries
6.1.2 Industrial Names and Products
509
6.1.2
Industrial Names and Products
Note: Lots of links from E&CE 327 web pages under Documentation
6.1.3
Power vs Energy
Most people talk about power reduction, but sometimes they mean power and sometimes energy. Power minimization is usually about heat removal
Energy minimization is usually about battery life or energy costs

Type Units Equivalent Types Equations Energy Joules Work = Volts Coulombs = 1 C Volts2 2 Power Watts Energy / Time = Volts I = Joules/ sec
510
6.1.4
Batteries, Power and Energy
6.1.4.1 Do Batteries Store Energy or Power?

Energy = Volts Coulombs Power = Energy Time
Batteries rated in Amp-hours at a voltage. battery = Amps Seconds Volts = Coulombs Seconds Volts Seconds = Coulombs Volts = Energy Batteries store energy.
6.1.4 Batteries, Power and Energy
511
6.1.4.2
Battery Life and Efciency
To extend battery life, we want to increase the amount of work done and/or decrease energy consumed. Work and energy are same units, therefore to extend battery life, we truly want to improve efciency. Power efciency of microprocessors normally measured in MIPS/Watt. Is this a real measure of efciency? MIPs = millions of instructions Seconds Watts Energy Seconds = millions of instructions Energy Both instructions executed and energy are measures of work, so MIPs/Watt is a measure of efciency.
Question:
What is the weakness of this analysis?
512
6.1.4.3
Battery Life and Power
Question: Running a VHDL simulation requires executing an average of 1 million instructions per simulation step. My computer runs at 700MHz, has a CPI of 1.0, and burns 70W of power. My battery is rated at 10V and 2.5AH. Assuming all of my computers clock cycles go towards running VHDL simulations, how many simulation steps can I run on one battery charge?
6.1.4 Batteries, Power and Energy
513

Question: If I use the SpeedStep feature of my computer, my computer runs at 600MHz with 60W of power. With SpeedStep activated, much longer can I keep the computer running on one battery?
514

Question: With SpeedStep activated, how many more simulation steps can I run on one battery?
6.2. POWER EQUATIONS
515
6.2
Power Equations
Power = SwitchPower + ShortPower + LeakagePower DynamicPower StaticPower
Dynamic Power dependent upon clock speed Switching Power useful charges up transistors Short Circuit Power not useful both N and P transistors are on Static Power independent of clock speed Leakage Power not useful leaks around transistor
516
Dynamic Power
Dynamic power is proportional to how often signals change their value (switch). Roughly 20% of signals switch during a clock cycle.
Need to take glitches into account when calculating activity factor. Glitches increase the activity factor. Equations for dynamic power contain clock speed and activity factor.
6.2.1 Switching Power
517
6.2.1
Switching Power
1->0 0->1 CapLoad 0->1 1->0 CapLoad
Charging a capacitor
Disharging a capacitor
1 energy to (dis)charge capacitor = CapLoad VoltSup2 2
518
Switching Power
When a capacitor C is charged to a voltage V , the energy stored in capacitor is 1CV 2. 2 The energy required to charge the capacitor from 0 to V is CV 2. Half of the energy ( 1CV 2 is dissipated as heat through the pullup resistance. Half of energy is 2 transfered to the capacitor. When the capacitor discharges from V to 0, the energy stored in the capacitor 1 ( 2CV 2) is dissipated as heat through the pulldown resistance.
6.2.1 Switching Power
519
Switching Power
f : frequency at which invertor goes through complete charge-discharge cycle. (eqn 15.4 in Smith)
average switching power = f CapLoad VoltSup2 ClockSpeed clock speed ActFact average number of times that signal switches from 0 1 or from 1 0 during a clock cycle
1 average switching power = ActFact ClockSpeed CapLoad VoltSup2 2
520
6.2.2
IShort Vi Vo
Short-Circuited Power
VoltSup VoltSup - VoltThresh
VoltThresh GND P-trans on N-trans on TimeShort
Gate Voltage
PwrShort = ActFact ClockSpeed TimeShort IShort VoltSup
6.2.3 Leakage Power
521
6.2.3
Leakage Power
Vi Vo
I
N P N P P
ILeak V
N-substrate
Cross section of invertor showing parasitic diode
Leakage current through parasitic diode
PwrLk = ILeak VoltSup q VoltThresh kT
ILeak e
522
6.2.4
Glossary
6.2.5
Note on Power Equations

6.3 Overview of Power Reduction Techniques

We can divide power reduction techniques into two classes: analog and digital.
6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES
523
Analog Parameters
Power reduction parameters at the analog level. capacitance for example, Silicon on Insulator (SOI) resistance for example, copper wires voltage low-voltage circuits
524
Analog Techniques
Power reduction techniques at the analog level. dual-VDD Two different supply voltages: high voltage for performance-critical portions of design, low voltage for remainder of circuit. Alternatively, can vary voltage over time: high voltage when running performance-critical software and low voltage when running software that is less sensitive to performance. dual-Vt Two different threshold voltages: transistors with low threshold voltage for performance-critical portions of design (can switch more quickly, but more leakage power), transistors with high threshold voltage for remainder of circuit (switches more slowly, but reduces leakage power). exotic circuits Special ops, latches, and combinational circuitry that run at a high frequency while minimizing power adiabatic circuits Special circuitry that consumes power on 0 1 transitions, but not 1 0 transitions. These sacrice performance for reduced power. clock trees Up to 30% of total power can be consumed in clock generation and clock tree
6.3. OVERVIEW OF POWER REDUCTION TECHNIQUES
525
Digital Parameters
Power-reduction parameters at the digital level. capacitance (number of gates) activity factor clock frequency
526
Digital Techniques
Power-reduction techniques at the digital level. multiple clocks Put a high speed clock in performance-critical parts of design and a low speed clock for remainder of circuit clock gating Turn off clock to portions of a chip when its not being used data encoding Gray coding vs one-hot vs fully encoded vs ... glitch reduction Adjust circuit delays or add redundant circuitry to reduce or eliminate glitches. asynchronous circuits Get rid of clocks altogether.... Additional low-power design techniques for RTL from a Qualis engineer: http://home.europa.com/celiac/lowpower.html
6.4. VOLTAGE REDUCTION FOR POWER REDUCTION
527
6.4 Voltage Reduction for Power Reduction

If our goal is to reduce power, the most promising approach is to reduce the supply voltage, because, from: (ActFact ClockSpeed 1 CapLoad VoltSup2) 2 + (ActFact ClockSpeed TimeShort IShort VoltSup) + (ILeak VoltSup)
Power =
we observe:
Power VoltSup2
528
Reducing Difference Between Supply and Threshold Voltage

As the supply voltage decreases, it takes longer to charge up the capacitive load, which increases the load delay of a circuit. In the chapter on timing analysis, we saw that increasing the supply voltage will decrease the delay through a circuit. (From V = IR, increasing V causes an increase in I, which causes the capacitive load to charge more quickly.) However, it is more accurate to take into account both the value of the supply voltage, and the difference between the supply voltage and the threshold voltage. (VoltSup VoltThresh)2 MaxClockSpeed VoltSup
6.4. VOLTAGE REDUCTION FOR POWER REDUCTION
529
Effect of Decreasing Supply Voltage on Delay

Question: If the delay along the critical path of a circuit is 20 ns, the supply voltage is 2.8 V, and the threshold voltage is 0.7 V, calculate the critical path delay if the supply voltage is dropped to 2.2 V.
530
Reducing Threshold Voltage Increases Leakage Current

If we reduce the supply voltage, we want to also reduce the threshold voltage, so that we do not increase the delay through the circuit. However, as threshold voltage drops, leakage current increases: q VoltThresh kT
ILeak e
And increasing the leakage current increases the power: Power ILeak So, need to strike a balance between reducing VoltSup (which has a quadratic affect on reducing power), and increasing ILeak, which has a linear affect on increasing power.
6.5. DATA ENCODING FOR POWER REDUCTION
531
6.5
Data Encoding for Power Reduction
6.5.1 How Data Encoding Can Reduce Power

Data encoding is a technique that chooses data values so that normal execution will have a low activity factor. The most common example is Gray coding where exactly one bit changes value each clock cycle when counting.
532 Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
CHAPTER 6. POWER ANALYSIS AND POWER-AWARE DESIGN Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
6.5.1 How Data Encoding Can Reduce Power
533
8-bit Counter
Question: For an eight-bit counter, how much more power will a binary counter consume than a Gray-code counter?
534
Random Data
Question: For completely random eight-bit data, how much more power will a binary circuit consume than a Gray-code circuit?
6.5.2 Example Problem: Sixteen Pulser
535
6.5.2 6.5.2.1
Example Problem: Sixteen Pulser Problem Statement
Your task is to do the power analysis for a circuit that should send out a one-clock-cycle pulse on the done signal once every 16 clock cycles. (That is, done is 0 for 15 clock cycles, then 1 for one cycle, then repeat with 15 cycles of 0 followed by a 1, etc.)
1 clk done 2 3 15 16 17 31 32 33
Required behaviour You have been asked to consider three different types of counters: a binary counter, a Gray-code counter, and a one-hot counter. (The table below shows the values from 0 to 15 for the different encodings.)
Question: What is the relative amount of power consumption for the different options?
536
6.5.2.2
Additional Information
Your implementation technology is an FPGA where each cell has a programable combinational circuit and a ip-op. The combinational circuit has 4 inputs and 1 output. The capacitive load of the combinational circuit is twice that of the ip-op.
PLA
cell 1. You may neglect power associated with clocks. 2. You may assume that all counters: (a) are implemented on the same fabrication process (b) run at the same clock speed (c) have negligible leakage and short-circuit currents
537
Data Encoding
Decimal 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Gray 0000 0001 0011 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 One-Hot 0000000000000001 0000000000000010 0000000000000100 0000000000001000 0000000000010000 0000000000100000 0000000001000000 0000000010000000 0000000100000000 0000001000000000 0000010000000000 0000100000000000 0001000000000000 0010000000000000 0100000000000000 1000000000000000 Binary 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
538
6.5.2.3
Answer Sketch the Circuitry
Name the output done and the count digits d().
539
Capacitance
cap number subtotal cap Gray d() PLAs Flops done PLAs Flops 1-Hot d() PLAs Flops done PLAs Flops Binary d() PLAs Flops done PLAs Flops
540
Activity Factors Gray Coding Activity Factor

clk d(0) d(1) d(2) d(3) done 8/16 4/16 2/16 2/16 2/16
Gray coding
541
One-Hot Activity Factor

clk d(0) d(1) d(2) 2/16 2/16 2/16 2/16 done 2/16
One-hot coding
542
Binary Coding Activity Factor

clk d(0) d(1) d(2) d(3) done 16/16 8/16 4/16 2/16 2/16
Binary coding
543
Putting it all Together

subtotal cap act fact Gray d() PLAs Flops done PLAs Flops Total d() PLAs Flops done PLAs Flops Total power
1-Hot
Binary d() PLAs Flops done PLAs Flops Total
544
6.6
Clock Gating
The basic idea of clock gating is to reduce power by turning off the clock when a circuit isnt needed. This reduces the activity factor.
6.6.1
Introduction to Clock Gating

Examples of Clock Gating
Condition O/S in standby mode
Circuitry turned off Everything except core state (PC, registers, caches, etc) No oating point instruc- oating point circuitry tions for k clock cycles Instruction cache miss Instruction decode circuitry No instruction in pipe Pipe stage i stage i
6.6.2 Implementing Clock Gating
545
6.6.2
Implementing Clock Gating
Clock gating is implemented by adding a component that disables the clock when the circuit isnt needed.
i_data i_valid clk o_data
o_valid
Without clock gating

i_data i_valid clk o_data
cool_clk
o_valid
clk_en i_wakeup Clock Enable State Machine
With clock gating
546
6.6.3 6.6.4
Design Process Effectiveness of Clock Gating
Parameters to characterize effectiveness of clock gating: Eff = effectiveness of clock gating PctValid = percentage of clock cycles with valid data in the circuit the clock must be toggling PctClk = percentage of clock cycles that clock toggles Effectiveness measures the percentage of clock cycles with invalid data in which the clock is turned off. Equation for effectiveness of clock gating: PctClkOff Eff = PctInvalid 1 PctClk = 1 PctValid
6.6.4 Effectiveness of Clock Gating
547
Clock Gating Effectiveness Questions

Question: What is the effectiveness if the clock toggles only when there is valid data?
Question:
What is the effectiveness of a clock that always toggles?
548
Clock Gating Effectiveness Questions

Question: What does it mean for a clock gating scheme to be 75% effective?
Question:
What happens if PctClk < PctValid?
6.6.4 Effectiveness of Clock Gating
549
Effect of Effectiveness
We can see the effect of the effectiveness of a clock-gating scheme on the activity factor: A PctValid * A A
0 0 Eff 1
The new activity factor with a clock gating scheme is:
A = A (1 PctValid) Eff A
550
6.6.5 Example: Reduced Activity Factor with Clock Gating

Question: How much power will be saved in the following clock-gating scheme?
70% of the time the main circuit has valid data clock gating circuit is 90% effective (90% of the time that the circuit has invalid data, the clock is off) clock gating circuit has 10% of the area of the main circuit clock gating circuit has same activity factor as main circuit neglect short-circuiting and leakage power
6.6.5 Example: Reduced Activity Factor with Clock Gating
551
552
6.6.6 6.6.6.1
Clock Gating with Valid-Bit Protocol Valid-Bit Protocol
Need a mechanism to tell circuit when to pay attention to data inputs

clk i_valid i_data
o_valid o_data
clk i_valid i_data
6.6.6 Clock Gating with Valid-Bit Protocol
553
Valid-Bit Protocol
clk i_valid i_data o_valid o_data
clk i_valid i_data o_valid o_data
i valid: high when i data has valid data signies whether circuit should pay attention to or ignore data. o valid: high when o data has valid data signies whether whether environment should pay attention to output of circuit. For more on circuit protocols, see section 2.12.
554
Microscopic Analysis
Which clock edges are needed?
i_valid clk o_valid
clk i_valid o_valid
555
6.6.6.2 How Many Clock Cycles for Module?

Given a module with latency Lat, if the module receives a stream of NumPcls consecutive valid parcels, how many clock cycles must the clock-enable signal be asserted?
Latency NumPcls NumClkEn i_valid o_valid clk_en Latency NumPcls NumClkEn
i_valid o_valid clk_en
556
6.6.6.3
Adding Clock-Gating Circuitry Before Clock Gating

data_in valid_in clk data_out valid_out
clk valid_in data_in valid_out data_out dont care uninitialized
557
After Clock Gating: Circuitry

data_in valid_in data_out valid_out
hot_clk clk_en wakeup_in Clock Enable State Machine
cool_clk
wakeup_out
hot clk: clock that always toggles cool clk: gated clock sometimes toggles, sometimes stays low wakeup: alerts circuit that valid data will be arriving soon clk en: turns on cool clk
558
After Clock Gating: New Signals

hot_clk wakeup_in valid_in data_in clk_en cool_clk valid_out data_out wakeup_out
6.6.7 Example: Pipelined Circuit with Clock-Gating
559

Design a clock enable state machine for the pipelined component described below. capacitance of pipelined component = 200
latency varies from 5 to 10 clock cycles, even distribution of latencies contains a maximum of 6 instructions (parcels of data). 60% of incoming parcels are valid average length of continuous sequence of valid parcels is 80 use input and output valid bits for wakeup leakage current is negligible short-circuit current is negligible LUTs have a capacitance of 1, ops have a capacitance of 2
560
Waveforms for Parcel Count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid parcel_count parcel_clk_en
Waveforms for Cycle Count

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 i_valid o_valid cycle_count
0 1 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10 cycle_clk_en
561
Summary of Design Process

Outline: 1. sketch out circuitry for parcel count and cycle count state machine 2. estimate capacitance of each state machine 3. estimate activity factor of main circuit, based on behaviour
562
Parcel Count Design

Need to count (0..6) parcels, therefore need 3 bits for counter. Counter must be able to increment and decrement. Equations for counter action (increment/decrement/no-change): i valid o valid action 0 0 no change 0 1 decrement increment 1 0 1 1 no change
Chapter 7 Fault Testing and Testability
564
CHAPTER 7. FAULT TESTING AND TESTABILITY
7.1 7.1.1
Faults and Testing Overview of Faults and Testing Faults
7.1.1.1
During manufacturing, faults can occur that make the physical product behave incorrectly. Denition: A fault is a manufacturing defect that causes a wire, poly, diffusion, or via to either break or connect to something it shouldnt.
Good wires
Shorted wires
Open wire
7.1.1 Overview of Faults and Testing
565
7.1.1.2
Causes of Faults
Fabrication process (initial construction is bad) chemical mix, impurities, dust Manufacturing process (damage during construction)
handling: probing, cutting, mounting materials: corrosion, adhesion failure, cracking, peeling
7.1.1.3
Testing
Denition Testing is the process of checking that the manufactured wafer/chip/board/system has the same functionality as the simulations.
566
7.1.1.4
Burn In
Denition Burn-in: The process of subjecting chips to extreme conditions (high and low temps, high and low voltages, high and low clock speeds) before and during testing.
Soon to break wire
7.1.1.5
Bin Sorting
Each chip (or wafer) is run at a variety of clock speeds. The chips are grouped and labeled (binned) by the maximum clock frequency at which they will work reliably. For example, chips coming off of the same production line might be labelled as 800MHz, 900MHz, and 1000MHz.
7.1.2 Example Problem: Economics of Testing
567
7.1.1.6 7.1.1.7
Testing Techniques Design for Testability (DFT)
7.1.2 Example Problem: Economics of Testing

Note: There is a tradeoff between the amount of money spent on testing chips vs dealing with (e.g. replacing) faulty chips. Usually the best tradeoff is to ship chips with a small, but non-zero probability that the chip has a fault.
7.1.3
Physical Faults
568
7.1.3.1
Good Circuit
a b c d
Types of Physical Faults

Bad Circuits open wired-AND bridging short wired-OR bridging short stronger wins bridging short (b is stronger) short to VDD
a b a b a b a b c d c d c d c d
a b a b
c d c d
short to GND
7.1.3 Physical Faults
569
7.1.3.2
Locations of Faults
Each segment of wire, poly, diffusion, via, etc is a potential fault location. Different segments affect different gates in the fanout. A potential fault location is a segment or segments where a fault at any position affects the same set of gates in the same way.
570
7.1.3.3
a b c d e f g h
Layout Affects Locations

L2
e
L3
L2
e
L3 L5 L4
L1 L4
g h
L1
g h
7.1.3.4
Naming Fault Locations
Two ways to name a fault location: pin-fault model Faults are modelled as occuring on input and output pins of gates. net-fault model Faults are modelled as occuring on segments of wires. In E&CE 327, well use the net-fault model, because it is simpler to work with and is closer to what actually happens in hardware.
7.1.4 Detecting a Fault
571
7.1.4
Detecting a Fault
To detect a fault, we compare the actual output of the circuit against the expected value.
7.1.4.1 Fault?
Which Test Vectors will Detect a
Question: For the good circuit and faulty circuit shown below, which test vectors will detect the fault?
a b c a b e c
d e
Good circuit
Faulty circuit
572 Answer: a 0 0 0 0 1 1 1 1 b 0 0 1 1 0 0 1 1 c good faulty 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 0 1 0 1 1 1
Sometimes multiple test vectors will catch the same fault. Sometimes a single test vector can catch multiple faults.
7.1.4 Detecting a Fault

a b c a b c d e d e
573
a b c good faulty 1 1 0 1 0
Another fault The test vector 110 can catch both this fault and the previous one. Note: Detect vs. diagnose Testing detects faults. Testing does not diagnose which fault occurred.
574
7.1.5
Mathematical Models of Faults
Goal: develop reliable and predictable technique for detecting faults in circuits. Observations:
The possible faults in a circuit are dependent upon the physical layout of the circuit. A very wide variety of possible faults A single test vector can catch many different faults
Need: a mathematical model for faults that is abstracted from complexities of circuit layout and plethora of possible faults, yet still detects most or all possible faults.
7.1.5 Mathematical Models of Faults
575
7.1.5.1
Single Stuck-At Fault Model
Two simplifying assumptions: 1. A maximum of one fault per tested circuit (hence single) 2. All faults are either: (a) stuck-at 1: short to VDD (b) stuck-at 0: short to GND hence, stuck at
576
Example of Stuck-At Faults

a b c d i
Question: If we consider all possible stuck-at faults, how many faulty circuits would we need to test for?
Question: If we consider only single-stuck-at faults, how many faulty circuits would we need to test for?
7.1.6 Generate Test Vector to Find a Mathematical Fault
577
7.1.6 Generate Test Vector to Find a Mathematical Fault

Faults are detected by stimulating circuits (real, manufactured circuit, not a simulation!) with test-vectors and checking that the real circuit gives the correct output.
7.1.6.1
Algorithm
1. compute Karnaugh map for correct circuit 2. compute Karnaugh map for faulty circuit 3. nd region of disagreement 4. any assignment in region of disagreement is a test vector that will detect fault 5. any assignment outside of region of disagreement will result in same output on both correct and faulty circuit
578
7.1.6.2
Example of Finding a Test Vector

a b c
a c
d e
a b c
d e
b c1 c0
ab ab ab ab 10 11 01 00
c
Good circuit
Faulty circuit
Question:
Find a test test vector will detect the faulty circuit
a c
7.1.7 Undetectable Faults
579
7.1.7
Undetectable Faults
Not all faults are detectable.
1. If a circuit is irredundant then all single stuck-at faults can be detected. A redundant circuit is one where one or more gates can be removed without affecting the functional behaviour. 2. If not trying to nd all of the faults in a circuit, then a fault that you arent looking for can mask a fault that you are looking for.
7.1.7.1
Redundant Circuitry
Some faults are undetectable. Undetectable stuck-at faults are located in redundant parts of a circuit.
580
Timing Hazards
Static hazard Dynamic hazard Timing hazards are often removed by adding redundant circuitry.
Redundant Circuitry
a b
1,1 1,0
1,0 1,0,1
b c
d e
d c
1,1
0,1
0,1
f g
Irredundant circuit
Illustration of timing hazard
Glitch on g is caused because the AND gate for e turns off before f turns on.
7.1.7 Undetectable Faults
581
Redundant Circuitry
Question: Add one or more gates to the circuit so that the static hazard is guaranteed to be prevented, independent of the delay values through the gates
1,1 1,0
a c
a b
1,0 1,0,1
d c
1,1
0,1
0,1
Redundant Circuitry
Question: Has the redundant circuitry introduced any undetectable faults? If so, identify an undetectable fault.
582
7.1.7.2 Curious Circuitry and Fault Detection

Curiously, the stuck-at fault at L1 is undetectable, but faults at either L2 or L3 are detectable.
a
L2
a z z c
c
b c
L1 L3
fault eqn L2@0 a (b c)
K-map
a c b
diff w/ ckt
a c b
a c
b c
L2@1 a (b c)
7.2. TEST GENERATION
583
7.2 7.2.1
a b c
Test Generation A Small Example

L4
ab + bc
a
L2 L5
fault 1) L2@1
eqn K-map
a c b
diff w/ ckt test vectors

a c b
a c
b c
2) L4@1
a c b c a b
3) L5@1
584
7.2.2
Choosing Test Vectors
The goal of test vector generation is to nd the smallest set of test vectors that will detect the faults of interest. Test vector generation requires analyzing the faults. We can simplify the task of fault analysis by reducing the number of faults that we have to analyze. Smith has examples of this in Figures 14.13 and 14.14.
7.2.2 Choosing Test Vectors
585
7.2.2.1
fault eqn
Fault Domination
K-map
a c b c
Diff w/ ckt
a b
test vectors
1) L5@1 ab+c
a c b c a b
101, 001
2) L6@1 1
101, 001, 100, 010, 000
Denition dominates: f1 dominates f2: any test vector that detects f1 will also detect f2. When choosing test vectors, we can ignore the dominated fault, but must keep the dominant fault.
Question:
To detect both L5@1 and L6@1, can we ignore one of the faults?
Question:
What would happen if we ignored the wrong fault?
586
7.2.2.2
fault
Fault Equivalence
Diff w/ ckt
b c a b a c
eqn K-map
1) L1@1 b
a c b c a b
2) L3@1 b
Denition fault equivalence: f1 is equivalent to f2: f1 and f2 are detected by exactly the same set of test vectors. That is, all of the test vectors that detect f1 will also detect f2, and vice versa. When choosing test vectors we can ignore one of the faults and just include the other.
7.2.2 Choosing Test Vectors
587
7.2.2.3
Gate Collapsing
A stuck-at-1 fault on the input to an OR gate is equivalent to a stuck-at-1 fault on the output of the OR gate.
Denition Gate collapsing: : The technique of looking at the functionality of a gate and nding equivalent faults between inputs and outputs. Sets of collapsable faults for common gates
@0
AND
@1
@0
@0
OR
@1
@1
QuestionWhat is the set of collapsible faults for a NAND gate? NAND
588
7.2.2.4
Note:
Node Collapsing
Node collapsing is relevant only for the pin-fault model
7.2.2.5
Fault Collapsing Summary
When calculating the test-vectors to detect a set of faults, apply the fault collapsing techniques of: gate collapsing
node collapsing (if using pin-fault model) general fault equivalence (intelligent collapsing) fault domination
to reduce the number of faults that you must examine.
7.2.3 Fault Coverage
589
7.2.3
Fault Coverage
Denition Fault coverage: percentage of detectable faults that are detected by a set of test vectors. DetectedFaults DetectableFaults
FaultCoverage =
Some peoples denition of fault coverage has a denominator of AllPossibleFaults, not just those that are detectable.
590
7.2.4 Test Vector Generation and Fault Detection

There are two ways to generate vectors and check results: built-in tests and scan testing. Both require: generate test vectors
overide normal datapath to send test-vectors, rather than normal inputs, as inputs to ops compare outputs of ops to expected result
7.2.5 Generate Test Vectors for 100% Coverage
591

In this section we will nd the test vectors to achieve 100% coverage of single stuck at faults for the circuit of the day. We will use a simple algorithm, there are much more sophisticated algorithms that are more efcient. The problem of test vector generation is often called Automatic Test Pattern Generation (ATPG) and continues to be an active area of research.
a b c
L1 L4 L2 L5 L3 L7
ab + bc
L6 L8
a b
Example Circuit with Fault Locations and Karnaugh Map
592
7.2.5.1
Collapse the Faults
Initial circuit with potential faults:

a b
L2@0,1 L5@0,1 L1@0,1 L4@0,1
L6@0,1 L8@0,1 L7@0,1
L3@0,1
593
Gate Collapsing
gate faults kept fault
For each set of equivalent faults, we will keep the fault shown in bold and eliminate the other faults. A good heuristic for choosing which fault to keep: keep the fault closes to the output. The closer a fault is to the output, the easier it is to analyze its behaviour, because the equation for the output will be simpler.
594
Intelligent Collapsing
1. delete faults that previously decided could be ignored 2. by intelligent analysis of circuit, nd equivalent faults
a b
L2@0,1 L5@0,1
L1@0,1 L4@0,1
L6@0,1 L8@0,1 L7@0,1
L3@0,1
595
7.2.5.2
fault eqn 1) L2@1 a+c
Check for Fault Domination

K-map Diff w/ ckt
a c b
a c
a c
a c
2) L3@1 b
a c b
a c
3) L4@1 a+bc
a c b
a c
4) L5@1 ab+c
a c b
a c
5) L6@0 bc
a c b
a c
6) L7@0 ab
a c b
a c
7) L8@0 0
a c b
a c
8) L8@1 1
596
Remove dominated faults

Current faults:
a b
L2@0,1 L5@0,1 L1@0,1 L4@0,1
L6@0,1 L8@0,1 L7@0,1
L3@0,1
Dominated faults:
597
7.2.5.3
Required Test Vectors
Denition required test vector: A test vector tv is required if there is a fault for which tv is the only test vector that will detect the fault. fault eqn K-map
a c b c
Diff w/ ckt
a b
1) L3@1 b
a c b c a b
2) L4@1 a+bc
a c b c a b
3) L5@1 ab+c
a c b c a b
4) L6@0 bc
a c b c a b
5) L7@0 ab
598
7.2.5.4 Faults Not Covered by Required Test Vectors

fault eqn K-map
a c b c
Diff w/ ckt
a b
1) L4@1 a+bc
a c b c a b
2) L5@1 ab+c Test vector(s) required to catch these faults:
599
7.2.5.5
Order to Run Test Vectors
The order in which the test vectors are run is important because it can affect how long a faulty chip stays in the tester before the chips fault is detected. The rst vector to run should be the one that detects the most faults. Build a table for which faults each test vector will detect.
600
a c b c a b c

Test Vector
a b c a b
fault 110
a c b
010
011
101
1) 2) 3) 4) 5) 6) 7) 8) 9)
L1@0
a c b
1 1
a c b
L1@1 L2@0
a c b
1
a c b
1 1 1
L2@1 L3@0
a c b
L3@1
a c b
1 1
a c b
L4@0 L4@1
a c b
1 1
a c b
L5@0
a c b
10) L5@1 11) L6@0

a c b
1 1 1
a c b
12) L6@1 13) L7@0

a c b
1 1
14) L7@1
a c b
1 1
a c b
1 1
15) L8@0 16) L8@1 Faults detected
1 5
1 6
601
7.2.5.6 Summary of Technique to Find and Order Test Vectors

1. identify all possible faults 2. gate collapsing 3. node collapsing 4. intelligent collapsing 5. fault domination 6. determine required test vectors 7. choose minimal set of test vectors to detect remaining faults 8. order test vectors based on number of faults detected (NOTE: when iterating through this step, need to take into account faults detected by earlier test vectors)
602
7.2.6
a b c
L1 L4 L2 L5 L3
One Fault Hiding Another

L6 L8 L7
Assume that we are not trying to detect all faults L1 is viewed as not being at risk for faults, but L3 is at risk for faults.
a b z c
L3 L1
a b
L1
z c
L3
7.2.6 One Fault Hiding Another
603
Fault Hiding
a b z c
L3 L1
a b
L1
z c
L3
Problem: If L1 is stuck-at 1, the test vectors that normally detect L3@0 will not detect L3@0. In the presence of other faults, the set of test vectors to detect a fault will change. fault(s) L3@0 eqn K-map Diff w/ ckt
a c b c a b
ab
a c b c a b
L1@1,L3@0 b
604
7.3
Scan Testing in General
7.3.1 Structure and Behaviour of Scan Testing

data_in(3) another circuit #0 zeta_in(3) another circuit #1
data_in(2) circuit under test
zeta_in(2)
data_in(1)
zeta_in(1)
data_in(0)
zeta_in(0)
Normal Circuit
7.3.1 Structure and Behaviour of Scan Testing

mode0 scan_in0 mode1 scan_in1
605
circuit under test
scan_out0
scan_out1
Circuit with Scan Chains Added
yet another circuit
another circuit
scan chain 0
scan chain 1
606
7.3.2
Scan Chains
mode0 scan_in0 mode1 scan_in1 zeta_in(3) another circuit #1 data_in(3) zeta_in(3)
data_in(3) another circuit #0
data_in(2) circuit under test
zeta_in(2)
data_in(2)
data_in(1)
zeta_in(1)
circuit under test
zeta_in(2)
data_in(1) data_in(0) zeta_in(0) data_in(0)
zeta_in(1)
zeta_in(0) scan_out0 scan_out1
Normal Circuit
Circuit with Scan Chains Added
7.3.2 Scan Chains
607
7.3.2.1
mode0 scan_in0
Circuitry in Normal and Scan Mode

mode1 scan_in1 mode0 scan_in0 mode1 scan_in1
circuit under test
circuit under test
scan_out0
scan_out1
scan_out0
scan_out1
Normal Mode
Scan Mode
608
7.3.2.2
mode0 scan chain 0
Scan in Operation
scan_in0 mode1 scan chain 0 scan_in1 clk mode0 yet another circuit scan_out0 scan_in0 scan_out1 scan_in1 scan_out0 scan_out1 current vector0 current results1
another circuit
circuit under test
Circuit under test with scan chains

current vector0 scan_in0 scan chain 0 mode0 mode1 scan chain 0 scan_in1 scan chain 0 scan_in0 mode1 scan chain 0
Sequence of load; test; unload

mode0 scan_in1 scan chain 0 scan chain 0 scan_in0 mode1 scan_in1
mode0
another circuit
another circuit
another circuit
yet another circuit
circuit under test
yet another circuit
circuit under test
scan_out0 scan_out0 scan_out1 scan_out0 scan_out1
scan_out1 current results1
Load Test Vector (1 cycle per bit)
Run Test Vector Through Circuit
Unload Result (1 cycle per bit)
yet another circuit
circuit under test
7.3.2 Scan Chains
609
Unload and Load and Same Time

mode0 scan chain 0 current vector0 scan_in0 mode1 scan chain 0 current vector1 scan_in1 mode0 scan chain 0 scan_in0 mode1 scan chain 0 scan_in1 mode0 scan chain 0 next test vector0 scan_in0 mode1 scan chain 0 next test vector1 scan_in1
another circuit
another circuit
yet another circuit
another circuit
yet another circuit
circuit under test
scan_out0 previous results0
scan_out1 previous results1
scan_out0
scan_out1
Unload Prev Result Load Cur Test Vector (1 cycle per bit)
clk mode0 scan_out0 scan_in0 scan_out1 scan_in1 previous results0 current vector0 previous results1 current vector1 current results0 next test vector0 current results1 next test vector1
Run Cur Test Vector Through Circuit
Unload Cur Result Load New Test Vector (1 cycle per bit)
Sequence of load; run; unload
yet another circuit
circuit under test
circuit under test
610
7.3.2.3 Scan in Operation with Example Circuit

mode0 scan_in0 a a b y z c d c b z y mode1 scan_in1
Circuit under test
scan_out0
scan_out1
Circuit under test with scan test circuitry
7.3.2 Scan Chains

mode0 scan_in0 a y b z c c b z mode1 scan_in1 mode0 scan_in0 a y mode1 scan_in1
611
scan_out0 clk mode0
scan_out1
scan_out0 clk mode0
scan_out1
Start Loading Test Vector (Load )

mode0 scan_in0 a y b z c c b mode1 scan_in1 mode0 scan_in0 a
Load
mode1 scan_in1
scan_out0 clk mode0
scan_out1
scan_out0 clk mode0
scan_out1
Load
Load
612

__
+
__
__
__
scan_out1
scan_out1
scan_out0 clk mode0
scan_out0 clk mode0
Run Test Vector

mode0 scan_in0 +
__
Test Values Propagate

mode1 scan_in1 mode0 scan_in0 +
__
mode1 scan_in1
__
scan_out0 clk mode0
scan_out1 (+)
__
scan_out0
__
scan_out1 (+, +)
__
clk mode0
Flop-In Result, Start (Un)loading Test Vector
Continue (Un)loading Test Vector
7.3.2 Scan Chains

mode0 scan_in0 mode1 scan_in1 mode0 scan_in0 mode1 scan_in1
613
scan_out0
__
scan_out1 (+, +)
__
scan_out0
__
scan_out1 (+, +)
__
clk mode0
clk mode0
Continue (Un)loading Test Vector

Finish (Un)loading Test Vector
scan_out0
__
scan_out1 (+, +)
__
clk mode0
Run Next Test Vector
614
7.3.3
Summary of Scan Testing
Adding scan circuitry

1. Registers around circuit to be tested are grouped into scan chains 2. Replace each op with mux + op 3. Flops and muxes wired together into scan chains 4. Each scan chain is connected to dedicated I/O pins for loading and unloading test vectors
Running test vectors

1. Put scan chain in scan mode 2. Load in test vector (one element of vector per clock cycle) 3. Put scan chain in normal mode 4. Run circuit for one clock cycle load result of test into ops 5. Unload results of current test vector while simultaneously loading in next test vector (one element of vector per clock cycle)
7.3.4 Time to Test a Chip
615
7.3.4
Time to Test a Chip
If the length (number of ops) of a scan chain is n, then it takes 2n + 1 clock cycles to run a single test: n clock cycles to scan in the test vector, 1 clock cycle to execute the test vector, and n cycles to scan out the results. Once the results are scanned out, they can be compared to the expected results for a correctly working circuit. If we run 2 or more tests (and chips generally are subjected to hundreds of thousands of tests), then we speed things up by scanning in the next test vector while we scan out the previous result. ScanLength = number of ip ops in a scan chain NumVectors = number of test vectors in test suite TimeScan = number of clock cycles to run test suite = NumVectors (ScanLength + 1) + ScanLength
616
7.3.4.1
Example: Time to Test a Chip
A 800MHz chip has scan chains of length 20,000 bits, 18,000 bits, 21,000 bits, 22,000 bits, and two of 15,000 bits. 500,000 test vectors are used for each scan chain. The tests are run at 80% of full speed.
Question:
Calculate the total test time.
7.4. BOUNDARY SCAN AND JTAG
617
7.4
Boundary Scan and JTAG
Boundary scan originated as technique to test wires on printed circuit boards (PCBs). Goal was to replace bed-of-nails style testing with technique that would work for high-density PCBs (lots of small wires close together) Now used to test both boards and chip internals. Used both on boundaries (I/O pins) and internal ops.
618
Boundary Scan with JTAG

Standardized by IEEE (1149) and previously by JTAG: 4 required signals (Scan Pins: TDI, TDO, TCK, TMS)
1 optional signal (Scan Pin: TRST) protocol to connect circuit under test to tester and other circuits state machine to drive test circuitry on chip Boundary Scan Description Language (BSDL): structural language used to describe which features of JTAG a circuit supports
JTAG circuitry now commonly built-into FPGAs and ASICS, or part of a cell-library. Rarely is a JTAG circuit custom-built as part of a larger part. So, youll probably be choosing and using JTAG circuits, not constructing new ones. Using JTAG circuitry is usually done by giving a description of your printed circuit board (PCB) and the JTAG components on each chip (in BSDL) to test generation software. The software then generates a sequence of JTAG commands and data that can be used to test the wires on the circuit board for opens and shorts.
7.4.1 Scan Instructions
619
JTAG Structure
chip BSR BSC circuit under test BSC BSC chip scan registers control TDI BR Instruction Decoder IR TCK IDCODE TDI TCK TMS TDO control TMS TAP Controller IRC IRC TDO BSC BSC BSC
normal input pins
circuit under test
normal output pins
High-level view
Detailed view
620
7.4.1
Scan Instructions
This the set of required instructions, other instructions are optional. Test board-level interconnect. Drive output pins of chip with hard-coded test vector. Sample results on inputs. SAMPLE Sample result data PRELOAD Load test vector BYPASS Directly connect TDI to TDO. This is used when several chips are daisy chained together to skip loading data into some chips. IDCODE Output manufacturer and part number EXTEST
7.5. BUILT IN SELF TEST
621
7.5 7.5.1
test generator
Built In Self Test Block Diagram

mode test generator d(0) o_data(0) d(0) i_data(0) o_data(0) mode
i_data(0)
d(1) i_data(1) circuit under test
o_data(1)
d(1) i_data(1) circuit under test
o_data(1)
d(2) i_data(2)
o_data(2)
d(2) i_data(2)
o_data(2)
d(3) i_data(3) result checker all_ok i_data(3)
d(3)
result checker all_ok
Circuit in Normal Mode
Circuit in Test Mode
622
Circuit w/ BIST in Normal Mode

mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2)
d(0)
d(1) i_data(1)
d(3) i_data(3)
7.5.1 Block Diagram
623
Circuit w/ BIST in Test Mode

d(0)
d(1) i_data(1)
d(3) i_data(3)
624
7.5.1.1
Components Test Generator
d(0)
d(1) i_data(1)
d(3) i_data(3)
generates a psuedo-random set of test vectors for n output bits, generates all vectors from 1 to 2n 1 in a pseudo random order built with a linear-feedback shift register (shift-register portion is the input ops)
7.5.1 Block Diagram
625
Test Generator
q2 q1 q0
Question:
Why not just use a counter to generate 1..2n 1?
626
Signature Analyzer
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2) d(0)
d(1) i_data(1)
d(3) i_data(3)
checks that the output it is examining has the correct results for the complete set of tests that are run only has a meaningful result at the end of the entire test sequence. built with a linear-feedback shift register similar to a hash function or a lossy compression function if there are no faults, the signature analyzer will denitely say ok (no false negatives) if there is a fault, the signature analyzer might say ok or might say bad (false positives are possible) design tradeoff: more accurate signature analyzers require more hardware
7.5.1 Block Diagram
627
Result Checker
mode test gen LFSR test generator i_data(0) signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) i_data(2) o_data(1) signature ok(2) analyzer2 o_data(2) d(0) d(1) i_data(1)
d(3) i_data(3)
signature analyzers output ok/bad on every clock cycle, but the result is only meaningful at the end of running the complete set of test vectors the result checker looks at test vector inputs to detect the end of the test suite and outputs all ok if all signature analyzers report ok at that moment implemented as an AND gate
628
7.5.1.2 Linear Feedback Shift Register (LFSR)

Basically, a shift register (sequence of ip-ops) with the output of the last ip-op fed back into some of the earlier ip-ops with XOR gates. Design parameters:
number of ip-ops external or internal XOR feedback taps (coefcients) external-input or self-contained reset or set
reset
d0 i
q0 d1
q1 d2
q2
LFSR Example
7.5.1 Block Diagram
629
Example LFSRs
reset d0 d0 i
S S S S S S R
q0 d1
q1 d2
q2
q0 d1
q1 d2
q2
set
External-XOR, input, reset
External-XOR, no input, set
reset i d0
R
q0
d1
q1 d2
q2 i d0
R
q0
d1
q1
d2
q2
S S S S
set
Internal-XOR, input, set
Internal-XOR, input, reset
In E&CE 327, we use internal-XOR LFSRs, because the circuitry matches the mathematics of Galois elds. External-XOR LFSRs work just ne, but they are more difcult to analyze, because their behaviour cant be treated as Galois elds.
630
7.5.1.3
Maximal-Length LFSR
Denition maximal-length linear feedback shift register: An LFSR that outputs a pseudo-random sequence of all representable bit-vectors except 0...00.
Denition pseudo random: The same elements in the same order every time, but the relationship between consecutive elements is apparantly random.
Maximal-length linear feedback shift registers are used to generate test vectors for built-in self test.
7.5.1 Block Diagram
631
Maximal-Length LFSR Circuits

The gures below illustrate the two maximal-length internal-XOR linear feedback shift registers that can be constructed with 3 ops.
d0
q0 d1
q1
d2
q2
set
Maximal-length internal-XOR LFSR
d0
q0
d1
q1 d2
q2
set
Maximal-length internal-XOR LFSR
Question: Why do maximal-length LFSRs not generate the test vector 0...00?
632
Maximal Length LFSR Characteristics

Maximal-length LFSRs:
set to all 1s initially self contained (no external i input)
1 reset clk d0 q0 d1 q1 q2 val 7 6
Timing diagram for a 3-op maximal-length LFSR
7.5.2 Test Generator
633
7.5.2
mode test gen LFSR test generator i_data(0) d(0) d(1) i_data(1)
Test Generator
signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) o_data(1) signature ok(2) analyzer2 o_data(2)
i_data(2)
d(3) i_data(3)
The test generator component is a maximal-length LFSR ...
d0
q0
d1
q1 d2
q2
set
634
Test Generator
The test generator component is a maximal-length LFSR with multiplexors on the inputs to each ip-op. In test mode, the data input on each ip op is connected to the output of the previous ip op. In normal mode, the input of each ip op is connected to the environment.
mode d1 q1 d2 q2
d0
q0
i_d(0) i_d(1) i_d(2) set q0 q1 q2
7.5.2 Test Generator
635
Test Generator
mode d0 i_d(0)
q0
d1 i_d(1) d2 i_d(2)
q1
q2
A test generator, reset not shown
636
7.5.3
Signature Analyzer
There are four things that change between different signature analyzers:
number of ops ( ops = area, accuracy) choice of feedback taps: a good choice can improve accuracy (more isnt necessarily better) bubbles on input to AND gate for ok: determined by expected result from simulating test sequence through circuit under test and LFSR of analyzer.
d(0)
d(1) i_data(1)
d(3) i_data(3)
7.5.3 Signature Analyzer
637
Signature Analyzer
This circuit:
Two ops, most analyzers use more the HP boards in the 1970s used 37 ops! Feedback taps on both ops. Different signature analyzers have different congurations of feedback taps. Also contains ok tester (AND gate). Expected output of LFSR at end of test sequence is: q0=1 and q1=1, or 01. (We know this because of bubble on AND gate. To see why this is the expected output of the signature analyzer, we would need to know the correct sequence of outputs of the circuit under test.)
reset
d0 i
q0
d1
q1
ok
638
Signature Analyzer
reset clk i d0 q0 d1 q1 0 0 i6 i5 i4 i3 i2 i1 i0 -
7.5.3 Signature Analyzer
639
Signature Analyzer Timing

reset clk i d0 q0 d1 q1 i6 i6 0 0 0 i5 i5 i6 i6 0 i4 i3 i2 i1 i0 -
i4i6 356 i5
245 1346 02356
i4i6 356
245 1346 02356 -
i5i6 i4i5 346 2356 1245 i6
i5i6 i4i5 346 2356 1245
356 = i3i5i6 2356 = i2i3i5i6 etc...
640
7.5.4
mode test gen LFSR test generator i_data(0) d(0) d(1) i_data(1)
Result Checker
signature ok(0) analyzer0 o_data(0) signature ok(1) analyzer1 circuit under test d(2) o_data(1) signature ok(2) analyzer2 o_data(2)
i_data(2)
d(3) i_data(3)
The purpose of the result checker is to check the ok circuit at the end of the test sequence.
reset q0 q1 q2 ok
all_ok
7.5.5 Arithmetic over Binary Fields
641
7.5.5
Arithmetic over Binary Fields
Galois Fields! Two operations: + and Two values: 0 and 1 Bit vectors and shift-registers are written as polynomials in terms of x.
Addition
+ represents XOR expression result 0+0 0 0+1 1 1+0 1 1+1 0 x+x 0
Multiplication
represents concatenating shift registers expression result x4 1 x4 x2 x3 x5
642
Example
Calculate (x3 + x2 + 1) (x2 + x) x2 (x3 + x2 + 1) = x5 + x4 x (x3 + x2 + 1) = x4 + x5 + + x2 x3 + x x3 + x2 + x
7.5.6 Shift Registers and Characteristic Polynomials
643

Each linear feedback shift register has a corresponding characteristic polynomial. From polynomials to hardware:
The maximum exponent denotes the number of ops The other exponents denote the ops that tap off of feedback line from last op From the characteristic polynomial, we cannot determine whether the shift register has an external input. Stated another way, two shift registers that are identical except that one has an external input and the other does not will have the same characteristic polynomial.
644
Shift Regs and Polynomials

reset i d0
R
q0
q1
q2
p(x) = x3
reset
d0 x0
q0 x1
d1
q1 x2
q2 x3
p(x) = x3 + x
reset
d0 i x0
q0 x1
q1 x2
q2 x3
p(x) = x3 + 1
645
Shift Regs and Polynomials
reset
d0 i x0
q0 x1
d1
q1 x2
q2 x3
p(x) = x3 + x + 1
reset
d0 i x0
q0 x1
d1
q1 x2
d2
q2 x3
p(x) = x3 + x2 + x + 1
reset
d0 i x0
q0 x1
d1
q1 x2
q2 x3
d3
q3 x4
p(x) = x4 + x3 + x + 1
646
7.5.6.1
Circuit Multiplication
Redoing the multiplication example (x2 + x) (x3 + x2 + 1) as circuits:
x2 + x x3 + x2 + 1 (x2 + x) (x3 + x2 + 1)
x (x3 + x2 + 1) + x2 (x3 + x2 + 1)
x5 + x3 + x2 + x
7.5.7 Bit Streams and Characteristic Polynomials
647
7.5.7 Bit Streams and Characteristic Polynomials

A bit stream, or bit sequence, can be represented as a polynomial. The oldest (rst) bit in a sequence of n bits is represented by xn1 and the youngest (last) bit is x0 . The bit sequence 1010011 can be represented as x6 + x4 + x + 1: 1 0 1 0 0 1 1 = 1x6 + 0x5 + 1x4 + 0x3 + 0x2 + 1x1 + 1x0 = x6 + x4 + x + 1
648
7.5.8
Division
With rules for multiplication and addition, we can dene division. A fundamental theorem of division denes q and r to be the quotient and remainder, respectively, of m p iff:
m(x) = q(x) p(x) + r(x)
7.5.8 Division
649
Long Division
In Galois elds, we do division just as with long division in elementary school. Given: m(x) = x6 + x4 + x3 p(x) = x4 + x Calculate the quotient, q(x) and remainder r(x) for m(x) p(x): x2 + 1 x4 + x x6 + 0x5 + 1x4 + 1x3 + 0x2 + 0x1 + 0x0 x6 + 1x3 1x4 1x4 + x x Quotient q(x) = x2 + 1 Remainder r(x) = x
650
Long Division (Check)

Check result: m(x) = = = = q(x) p(x) + r(x) (x2 + 1) (x4 + x) + x x6 + x3 + x4 + x + x x6 + x4 + x3
7.5.9 Signature Analysis: Math and Circuits
651

The input to the signature analyzer is a message, m(x), which is a sequence of n bits represented as a polynomial. After n shifts through an LFSR with l ops:
The sequence of output bits forms a quotient, q(x), of length n l The ops in the analyzer form a remainder, r(x), of length l
m(x) = q(x) p(x) + r(x) The remainder is the signature.
652
Test Generation: Math and Circuits

The mathematics for an LFSR without an input i:
same polynomial as if the circuit had an input input sequence is all 0s
653
Input Streams and Error Polynomials

An input stream with an error can be represented as m(x) + e(x)
e(x) is the error polynomial bits in the message that are ipped have a coefcient of 1 in e(x)
m(x) + e(x) = q(x) p(x) + r (x)
654
Input Streams and Error Polynomials

The error e(x) will be detected if it results in a different signature (remainder). m(x) and m(x) + e(x) will have the same remainder iff
e(x) mod p(x) = 0 That is e(x) must be a multiple of p(x). The larger p(x) is, the smaller the chances that e(x) will be a multiple of p(x).
655
BIST for a Simple Circuit

Outline of steps to see if a fault will be detected by BIST: 1. Output sequence from test generator 2. Output sequence from correct circuit 3. Remainder for signature analyzer with correct output sequence 4. Output sequence from faulty circuit 5. Remainder for signature analyzer with faulty output sequence 6. Compare correct and faulty remainder, if different then fault detected
656
Components
a b a
L1 L4 L2 L5 L3 L6 L7 L8
t0
t1
t2
r0
r1
r2

t0 t1 t2 a b c z z correct faulty
657
t0 t1
t2
r0
r1
r2
r0
r1
r2
658 Question:
CHAPTER 7. FAULT TESTING AND TESTABILITY Determine if L2@1 will be detected Equation for correct circuit: ab + bc Equation for faulty circuit: a + c Output sequences for correct and faulty circuits
t0 a 1 1 0 1 0 0 1 t1 b 1 1 1 0 1 0 0 t2 c 1 0 1 0 0 1 1 correct faulty z 1 1 1 0 0 0 0 z 1 1 1 1 0 1 1 output sequences from circuits
Test Generation Sequence

t0 t1 1 1 0 1 0 0 1 1 1 1 1 0 1 0 0 1 0 1 0 0 1 1 1 t2 1 initial values = 1 0 1 0 0 1 1 1 final values are repeat of initial values
Technique is to shift; then compute result of XORs
vectors from test generation sequence
659
Signature analyzer sequence for correct Signature analyzer sequence for faulty circuit Circuit
z 1 1 1 0 0 0 0 1 1 1 1 0 0 1 r0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 1 r1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 r2 0 0 0 initial values = 0 1 0 0 remainder 1 1 z 1 1 1 1 0 1 1 1 1 1 0 0 1 1 r0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 r1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 r2 0 0 0 initial values = 0 1 0 0 remainder 0 0
output sequence from correct circuit
output sequence from correct circuit
660
7.6
Scan
Scan vs Self Test
less hardware slower well dened coverage test vectors are easy to modify Self Test more hardware faster ill dened coverage test vectors are hard to modify
Chapter 8 Review
This chapter lists the major topics of the term. The Topics List section for each major area is meant to be relatively complete.
662
CHAPTER 8. REVIEW
8.1
Overview of the Term

Analog world
power faults and testing effects in the digital
The purely digital world

VHDL design and optimization methods functional verication performance analysis
timing analysis
8.2. VHDL
663
8.2 8.2.1
VHDL VHDL Topics
simple syntax and semantics things that you should know simply by having done the labs and project behavioural semantics of VHDL synthesis semantics of VHDL synthesizable and unsynthesizable code
664
CHAPTER 8. REVIEW
8.2.2
VHDL Example Problems
identify whether a particular signal will be the output of combinational circuitry or a op identify whether a particular process is combinational or clocked legal, synthesizable, and good code perform delta-cycle simulation of VHDL perform RTL simulation of VHDL identify whether two VHDL fragments have same behaviour match VHDL code with waveforms match VHDL code with hardware choose the VHDL fragment that generates smaller or faster hardware
8.3. RTL DESIGN TECHNIQUES
665
8.3 8.3.1
RTL Design Techniques Design Topics

dataow diagram scheduling input/output allocation register allocation datapath allocation hardware block diagram state machine
coding guidelines generic FPGA hardware area estimation nite state machines
implicit explicit-current explicit-current+next
from algorithm to hardware

dependency graph
memory dependencies memory arrays and dataow diagrams
666
CHAPTER 8. REVIEW
8.3.2
Design Example Problems
choose design guidelines to follow in different situations estimate area to implement a circuit in an FPGA calculate resource usage for a dataow diagram calculate performance data for a dataow diagram given an algorithm, design a dataow diagram given a dataow diagram, design the datapath and nite state machine optimize a dataow diagram to improve performance or reduce resource usage given a dataow diagram, calculate the clock period that will result in the maximum performance
8.4. FUNCTIONAL VERIFICATION
667
8.4 8.4.1
Functional Verication Verication Topics
test cases measuring coverage time for verication test benches assertions coverage monitors relational specication functional specication boundary conditions / corner cases
668
CHAPTER 8. REVIEW
8.4.2
Verication Example Problems
choose rst cases to test identify corner cases choose technique to detect bug (test case, assertion/test bench) determine whether a code change will cause a bug identify a test case and either assertion or test bench to catch a bug
8.5. PERFORMANCE ANALYSIS AND OPTIMIZATION
669
8.5 Performance Analysis and Optimization 8.5.1

speedup n% faster calculating performance of different different tasks and of average task choosing which task to optimize to best improve overall performance cpi calculations performance increase over time design tradeoffs (CPI vs NumInsts vs ClockSpeed vs time-to-market) CPI calculations MIPs calculations Clock speed vs. performance Optimality performance / area tradeoffs
Performance Topics
time to execute a program denition of performance
670
CHAPTER 8. REVIEW
8.5.2
Performance Example Problems
calculate performance / area tradeoffs calculate performance / time tradeoffs compare performance data between products evaluate performance criteria
8.6. TIMING ANALYSIS
671
8.6 8.6.1
Timing Analysis Timing Topics

timing analysis of master-slave ip-op timing analysis of hierachical storage device critical path and false path
algorithm to nd critical path algorithm to determine if path is false or critical signal assignment to exercise critical path
circuit parameters that affect delay

clock period clock skew clock jitter propagation delay load delay setup time hold time clock-to-Q time
elmore timing model derating factors
timing analysis of latch
672
CHAPTER 8. REVIEW
8.6.2
Timing Example Problems
timing parameters for minimum clock period timing parameters for hold constraint nd the critical path and assignment to exercise it compute elmore delay constant compare accuracy of different timing models determine if a storage device will work correctly compute timing parameters of storage device identify timing violation, suggest remedy suggest design change to increase clock speed
8.7. POWER
673
8.7 8.7.1
Power Power Topics

leakage current threshold voltage supply voltage
power vs energy equations for power dynamic power

static power switching power short circuit power leakage power activity factor
analog power reduction techniques rtl power reduction techniques

data encoding clock gating
674
CHAPTER 8. REVIEW
8.7.2
Power Example Problems
predict effect of new fabrication process on power predict effect of environment change (temp, supply voltage, etc) on power consumption predict effect of design change on power consumption (capacitance, activity factor) design data-encoding scheme for a circuit, predict effect on power consumption design clock gating scheme for a circuit, predict effect on power consumption asses validity of various power- or energy-consumption metrics
8.8. TESTING
675
8.8 8.8.1
Testing Testing Topics

behaviour of a scan chain time to run a scan test JTAG built-in self-test linear feedback shift register signature analyzer Galois elds process and time to run a BIST test
causes of faults locations of faults physical faults single stuck-at fault model testable / untestable fault economics of testing fault coverage test vector generation order test vectors to reduce test time
676
CHAPTER 8. REVIEW
8.8.2
Testing Example Problems
compute optimal amount of testing to maximize prots compute coverage for a given set of test vectors nd test vectors to catch a set of faults, choose order to run test vectors determine if a fault is detectable choose an LFSR to use for BIST test generation choose an LFSR to use for BIST signature analysis determine if a given BIST will catch a given fault determine probability that a given BIST technique will report that a faulty circuit is correct determine if a given fault-testing scheme will detect a physical fault match LFSR to characteristic polynomial match BIST hardware to Galois mathematics perform Galois eld mathematics, compare to waveforms
8.9. FORMULAS TO BE GIVEN ON FINAL EXAM
677
8.9
Formulas to be Given on Final Exam

Ins C T = F Pf = W T T1 T2 F/106
( PIi Ci)
i=0 n
S =
M =
678
CHAPTER 8. REVIEW
Formulas II
1 P = (A CL V2 F) + ( A V ISh F) + (V IL) 2 q = 1.60218 1019C k = 1.38066 1023J/K (V VTh)2 F V q VTh IL e k T

ECE 327 Slides VHDL Verilog Digital Hardware Design

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ECE 327 Slides VHDL Verilog Digital Hardware Design

Загружено:

Авторское право:

Доступные форматы

E&CE 327: Digital Systems Engineering Lecture Slides

xxiii 675 675 676 677

Part I Lecture Notes

Chapter 1 VHDL: The Language

Introduction to VHDL Levels of Abstraction

1.1.2 VHDL Origins and History

VHDL Origins and History

VHDL is a lot more than synthesis of digital hardware

syn the sis

1.1.4 Synthesis of a Simulation-Based Language

1.1.4 Synthesis of a Simulation-Based Language

Solution to Synthesis Sanity

1.1.6 Standard Logic 1164

Standard Logic 1164

1.2 Comparison of VHDL to Other Hardware Description Languages

Overview of Syntax Syntactic Categories

1.3.3 Entities and Architecture

Entities and Architecture

Each hardware module is described with an Entity/Architecture pair

Entity and Architecture

1.3.3 Entities and Architecture

Architectures contain concurrent statements Concurrent statements execute in parallel (Figure1.4)

1.3.4 Concurrent Statements

The order of concurrent statements doesnt matter

Types of Concurrent Statements

1.3.5 Component Declaration and Instantiations

1.3.5 Component Declaration and Instantiations

Example Process with Sensitivity List

Example Process with Wait Statements

Sensitivity Lists and Wait Statements

loop while loop for loop next

The most commonly used sequential statements

1.3.8 A Few More Miscellaneous VHDL Features

1.3.8 A Few More Miscellaneous VHDL Features

Concurrent vs Sequential Statements

Concurrent Assignment vs Process

1.4.2 Conditional Assignment vs If Statements

1.4.2 Conditional Assignment vs If Statements

1.4.3 Selected Assignment vs Case Statement

1.4.4 Coding Style

Among processes, execution is done in parallel Remember: a process is a concurrent statement!

1.5. OVERVIEW OF PROCESSES

single threaded: procA before procB

single threaded: procB before procA

multithreaded: procA and procB in parallel

1.5. OVERVIEW OF PROCESSES

All execution orders must have same behaviour

1.5.1 Combinational Process vs Clocked Process

The hardware for a combinational process is just combinational circuitry

1.5.1 Combinational Process vs Clocked Process

Combinational or Clocked Process? (1)

1.5.1 Combinational Process vs Clocked Process

Combinational or Clocked Process? (2)

Combinational or Clocked Process? (3)

1.5.1 Combinational Process vs Clocked Process

Combinational or Clocked Process? (4)

Combinational or Clocked Process? (5)

1.5.2 Latch Inference

Example of latch inference

1.5.2 Latch Inference

Loop, Latch, Flop

Write VHDL code for each of the above circuits

Details of Process Execution Simple Simulation

1.6.2 Temporal Granularities of Simulation