Вы находитесь на странице: 1из 20

Contents

Timing optimization

Area optimization

Additional readings

Budapest University of Technology and Economics

RTL Optimization Techniques


Pter Horvth
Department of Electron Devices

August 7, 2014

Pter Horvth

RTL Optimization Techniques

1 / 20

Contents

Timing optimization

Area optimization

Additional readings

Contents

Contents

timing optimization concepts and design techniques


throughput, latency, local datapath delay
loop unrolling, removing pipeline registers, register balancing

area optimization concepts and design techniques


resource requirement metrics in standard cell ASIC and FPGA
control-based logic reuse, priority encoders, considering technology
primitives

additional readings

Pter Horvth

RTL Optimization Techniques

2 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization

Pter Horvth

RTL Optimization Techniques

3 / 20

Contents

Timing optimization

Area optimization

Additional readings

Computation performance concepts

Computation performance concepts

There are three important concepts related to the computation


performance.
throughput: The amount of data processed in a single clock cycle
(bits per second).
latency: The time elapsed between data input and processed data
output (clock cycles).
local datapath delays: Delay of logic between storage elements
(nanoseconds). It determines the maximum clock frequency.

Pter Horvth

RTL Optimization Techniques

4 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

High throughput loop unrolling (pipeline)


During the high throughput optimization the time required for
processing of a single data is irrelevant but the time elapsed
between two input reads is minimized.
Data n+1 is read while data n is still under processing.
architecture iterative of pow3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (start = '1') then
count <= 2;
pow <= x;
elsif (stop = '0') then
count <= count - 1;
pow <= pow * x;
end if;
end if;
end process;
stop <= '1' when count = 0 else '0';
end architecture;

architecture pipelined of pow3 is


begin
process (clk)
begin
if (rising_edge(clk)) then
-- stage 1
x1 <= x;
-- stage 2
x2 <= x1;
pow1 <= x1 * x1;
-- stage 3
pow <= pow1 * x2;
end if;
end process;
end architecture;
throuhgput: 8/1 = 8 bits/cycle; latency: 3 cycles

throuhgput: 8/3 = 2.7 bits/cycle; latency: 3 cycles


Pter Horvth

RTL Optimization Techniques

5 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

High throughput loop unrolling (pipeline)


x[31:0]
32
clk

x1

x[31:0]

32

32

32

clk

start

x2

32
0

32

32

32

clk
32

pow1

32
clk

pow

32

32

pow[31:0]
32
clk

pow

throughput: 8/3 = 2.7 bits/cycle;


latency: 3 cycles

32

pow[31:0]

throughput: 8/1 = 8 bits/cycle;


latency: 3 cycles
Pter Horvth

RTL Optimization Techniques

6 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

Low latency removing pipeline registers


The objective of the low-latency optimization is to pass the data
from the input to the output with minimal internal processing
delay.
A low-latency design uses parallelism and removes pipeline registers.
architecture async of pow3 is
begin
process (x)
begin
x1 <= x;

architecture pipelined of pow3 is


begin
process (clk)
begin
if (rising_edge(clk)) then
-- stage 1
x1 <= x;

end process;
process (x1)
begin
x2 <= x1;
pow1 <= x1 * x1;
end process;

-- stage 2
x2 <= x1;
pow1 <= x1 * x1;
-- stage 3
pow <= pow1 * x2;
end if;
end process;
end architecture;

pow <= pow1 * x2;


end architecture;
latency: 1 cycles (with an additional output register)

latency: 3 cycles
Pter Horvth

RTL Optimization Techniques

7 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

Low latency removing pipeline registers


x[31:0]
32
clk

x1

x[31:0]
32

32

32
32

32
clk

x2

32

32

clk
32

pow1

32

32
clk

pow

32

32

clk

pow[31:0]

pow
32

latency: 1 cycles

pow[31:0]

latency: 3 cycles
Pter Horvth

RTL Optimization Techniques

8 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

Minimizing logic delay register layers


The logic between two sequential elements is called local datapath.
The delay of the slowest local datapath determines the maximum
clock frequency.
The local datapath delay can be reduced by additional register
layers.
architecture single_cycle of fir is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (valid = '1') then
x1 <= x;
x2 <= x1;
y <= A*x + B*x1 + C*x2;
end if;
end if;
end process;
end architecture;
Pter Horvth

architecture multi_cycle of fir is


begin
process (clk)
begin
if (rising_edge(clk)) then
if (valid = '1') then
x1 <= x; x2 <= x1;
prod1 <= A * x;
prod2 <= B * x1;
prod3 <= C * x2;
end if;
end if;
end process;
y <= prod1 + prod2 + prod3;
end architecture;
RTL Optimization Techniques

9 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

Minimizing logic delay register layers


x[31:0]

x[31:0]

32

A[31:0]

32

32

32

clk

B[31:0]

B[31:0]

32

32
clk

32

32

clk

x2

x1

x1

A[31:0]

clk

x2
C

32

32

C[31:0]

32

32

32

32

32
clk

32

prod3
32

clk

clk

prod2

32

prod1
32

32

32
clk

32

clk

y
32

y[31:0]
32

y[31:0]

local datapaths: 1 adder and 1


multiplier
Pter Horvth

local datapaths: 1 adder or 1


multiplier
RTL Optimization Techniques

10 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

Minimizing logic delay register balancing

During register balancing the logic between registers is redistributed


in order to minimize the worst-case delay between any register pairs.
architecture not_balanced of add3 is
begin
process (clk)
begin
if (rising_edge(clk)) then
reg_a <= in_a;
reg_b <= in_b;
reg_c <= in_c;
sum <= reg_a + reg_b + reg_c;
end if;
end process;
end architecture;

Pter Horvth

architecture balanced of add3 is


begin
process (clk)
begin
if (rising_edge(clk)) then
reg_ab_sum <= in_a + in_b;
reg_c <= in_c;
sum <= reg_ab_sum + reg_c;
end if;
end process;
end architecture;

RTL Optimization Techniques

11 / 20

Contents

Timing optimization

Area optimization

Additional readings

Timing optimization techniques

Minimizing logic delay register balancing


in_a[31:0]

in_b[31:0]

32
clk

reg_b

32

clk

clk

reg_ab_sum

reg_c
32

32

32

in_c[31:0]

32

reg_b

32

32

in_b[31:0]

32

clk

reg_a

32

in_a[31:0]

in_b[31:0]
32

32

clk

+
32

32
clk

clk

sum

sum

32

32

sum[31:0]

local datapaths: 2 adders

sum[31:0]

local datapaths: 1 adder

Pter Horvth

RTL Optimization Techniques

12 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area optimization

Pter Horvth

RTL Optimization Techniques

13 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area concepts

Area concepts

The resource requirement means the amount of the basic functional


primitives required for implementing the described functionality.
The basic functional primitives in standard cell ASICs are the
standard cells, which can be simple logic gates, flip-flops but also
more complex arithmetic-logic functions or memories.
The basic logic elements (BLE) of an FPGA consists of a logic
function (the input number is dependent on the vendor and the
device family), a flip-flop and a multiplexer. There are special
purpose resoures as well, such as memory blocks, signal processing
elements (multipliers) etc.

Pter Horvth

RTL Optimization Techniques

14 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area optimization techniques

Minimizing area control-based logic reuse


Control-based logic reuse should be considered the opposite
operation to the loop unrolling. Pipeline requires internal data
storage resources and additional logic to implement parallel
operation. These resources can be reused with the cost of a
reduced throughput.
in1

in2

in3

in4

32

32

32

32

sel

reset
clk

32

plr2

zero
clk
reset

in4
32

32
1

FSM

ce

plr1

32
0

32

ce

in3

in2

32

+
32

reset
clk

in1

sel_input
zero ce_acc
clk
reset ss_z

32
ce

32

32

reset
clk

32
32

32
1
reset
clk

acc

ce
reset
clk

acc

Control-based logic reuse requires an


FSM to generate control signals.

32
zero

acc

32

acc
Pter Horvth

RTL Optimization Techniques

15 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area optimization techniques

Minimizing area priority encoders


The resource requirement can be improved if the mutual exclusion
is exploited. The elsif statement should be used only if a priority
encoder is required and the conditions are not mutually exclusive.
architecture not_priority of logic is
begin
process (clk)
begin
if (rising_edge(clk)) then
if (ctrl(0) = '1') then
output(0) <= input; end if;
if (ctrl(1) = '1') then
output(1) <= input; end if;
if (ctrl(2) = '1') then
output(2) <= input; end if;
if (ctrl(3) = '1') then
output(3) <= input; end if;
end if;

architecture priority of logic is


begin
process (clk)
begin
if (rising_edge(clk)) then
if (ctrl(0) = '1') then
output(0) <= input;
elsif (ctrl(1) = '1') then
output(1) <= input;
elsif (ctrl(2) = '1') then
output(2) <= input;
elsif (ctrl(3) = '1') then
output(3) <= input;
end if;
end if;
end process;
end architecture;

end process;
end architecture;

Pter Horvth

RTL Optimization Techniques

16 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area optimization techniques

Minimizing area priority encoders


32

32

input[31:0]

input
0

32

32

output_a
clk

1
sel

ctrl

output_a[31:0]

32

32

output_a

clk

1
sel

ctrl

output_a

[0]

[0]

32

32

32
4

32
4

32

output_b
clk

1
sel

32

output_b[31:0]

output_b
clk

1
sel

output_b

[0]
[1]

[1]

32

32

32

32

0
32

output_c
clk

1
sel

0
32

output_c[31:0]

output_c
clk

1
sel

output_c

[0]
[1]
[2]

[2]

32

32

32

32

4
0
32

output_d
clk

1
sel

0
32

output_d[31:0]

output_d
clk

1
sel

[0]
[1]
[2]
[3]

output_d

[3]

without exploiting mutual exlusion


Pter Horvth

with exploiting mutual exclusion


RTL Optimization Techniques

17 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area optimization techniques

Minimizing area considering technology primitives

With appropriate HDL coding style a more efficient logic


synthesis can be achieved. The synthesis tool vendors usually
provide coding technique proposals to improve the resource
requirement or timing parameters of the design. The proposed
coding style takes the unique characteritics of the technology
primitives into consideration.
utilizing block RAM modules in FPGAs: Block RAM modules do
not have any reset inputs and their outputs are synchronous to a
clock signal. Only HDL models with these parameters can be
implemented in block RAMs.
utilizing high quality DSP units: The DSP slices in the FPGAs have
synchronous outputs. This restriction have to be taken into account
in HDL model generation.

Pter Horvth

RTL Optimization Techniques

18 / 20

Contents

Timing optimization

Area optimization

Additional readings

Area optimization techniques

Minimizing area considering technology primitives


architecture FFS of RAM is
begin
process (clk)
begin
if (reset = '1') then
content <= (others=>(others=>'0'));
elsif (rising_edge(clk)) then
if (write = '1') then
content(address) <= data_in;
end if;
end if;
end process;
data_out <= content(address);
end architecture;

Because of the asynchronous


output this model cannot be
implemented in block RAM.
The reset function hinders the
LUT implementation as well.
Pter Horvth

architecture BRAM of RAM is


begin
process (clk)
begin
if (rising_edge(clk)) then
if (write = '1') then
content(address) <= data_in;
end if;
data_out <= content(address);
end if;
end process;
end architecture;

This model can be implemented


as flip-flops, LUT RAM and
block RAM as well.

RTL Optimization Techniques

19 / 20

Contents

Timing optimization

Area optimization

Additional readings

Additional readings

Additional readings

Steve Kilts Advanced FPGA Design, Architecture, Implementation,


and Optimization
David Money Harris, Sarah L. Harris Digital Design and Computer
Architecture
Peter J. Ashenden Digital Design An Embedded System
Approach Using VHDL
M. Moris Mano, Charles R. Kime Logic and Computer Design
Fundamentals
Pong P. Chu RTL Hardware Design Using VHDL
Peter Wilson Design Recipes for FPGAs

Pter Horvth

RTL Optimization Techniques

20 / 20