Вы находитесь на странице: 1из 4

PERFORMANCE

HDL Coding and Design Practices


for Improving Virtex-5 Utilization,
Performance, and Power
These tips and techniques can lead to better Virtex-5 designs.

by Brian Philofsky For example, if you know of and use terms of area, performance, and power is to
Staff Software Technical Marketing Manager Bitslip technology within the ISERDES, install the latest version of the software.
Xilinx, Inc. you could save time, effort, and resources by
brian.philofsky@xilinx.com capturing input data rather than attempting Control Signal Polarity
to describe and build similar circuitry. The Virtex-5 architecture can support dif-
FPGAs have been very flexible in accom- In another example, if you know the ferent control signal polarity (clock
modating any HDL coding or design style structure and capability of the DSP48E, enables, resets, or sets). However, to have
for digital logic; Xilinx® Virtex™-5 you can make better choices as to when the most optimal design, I recommend
devices are no exception. Although Virtex- and where to place pipeline registers. consistent use of active high control signals
5 FPGAs can accommodate many differ- Dedicated features like the wider multipli- in your design. The Virtex-5 slice control
ent types of designs written in many er or post adder can also help you achieve logic is active high, and when described in
different methods, certain recommended better area, performance, and power. this same manner in the code should never
constructs and manners can achieve Similarly, knowing the capabilities and require additional LUT resources for a sim-
improved optimization in terms of area, current limitations of your synthesis tool can ple signal inversion.
performance, and power. not only help when choosing coding styles If the signal comes from an external pin
to properly infer primitives but can also give and needs an active low polarity, I suggest
Know Your Target you greater insight as to when to instantiate inverting the signal in the top-level code
Architecture and Synthesis Tool a component or use inference. Review syn- and using a positive polarity in all process-
Before beginning any project, you thesis manuals, application notes, or other es and sub-modules requiring that signal.
should understand the device architec- relevant materials before starting so that you This is critical for designs that have several
ture you are targeting. For Virtex-5 know the recommended coding styles for cores, use bottom-up synthesis techniques,
FPGAs, I recommend reading the the synthesis tool you are using. have KEEP_HIERARCHY constraints, or
Virtex-5 Users Guide (http://direct. You should also update and use the lat- employ the use of partitions (Figure 2).
xilinx.com/bvdocs/userguides/ug190.pdf) est versions of synthesis and ISE™ tools Designs that fall into these categories
before starting your first line of code. before beginning a project. Although ini- are more susceptible to the use of addition-
Once you have a better understanding tial synthesis support for the Virtex-5 al LUTs per core/netlist/hierarchy/parti-
and vision as to how your code will ulti- architecture is strong, many improvements tion for the sole purpose of inverting these
mately result in the base hardware, you in optimization and inference support are control signals, which not only consume
can make both large and small design still to come with new releases. One easy extra LUT resources but may also have
and coding decisions confidently. way to ensure more optimal designs in negative effects on performance and slice

Fourth Quarter 2006 Xcell Journal 19


PERFORMANCE

packing. As a general rule, always code sets, register LUTs); distributed RAM (LUT- The Virtex-5 device departs from the
resets, and enables with an active high based RAM) memory; or block RAM for traditional four-input LUT in previous
(logic 1 activates) polarity. the implementation, which would not be FPGA families and has an enhanced six-
otherwise possible nor optimal. The synthe- input LUT (6-LUT), allowing for wider
Use of Resets sis tool has maximum flexibility to choose logic functions between pipeline registers
It is common practice to use a global asyn- the best resource for the described code. while maintaining top performance. You
chronous reset in the source HDL code to should keep this in mind, as logic functions
initialize the design; however, in many Pipelining coded into HDL as optimal code should
cases this consumes additional resources. As with previous FPGA generations, prop- include six inputs to the logic function
Instead, think synchronous and local. I erly pipelining your design is necessary to between registers to get the most optimal
suggest describing a synchronous set/reset achieve top performance and improved pipelining and LUT resource management.
logic to the portions of the design that do power characteristics. With the introduc- In cases where it is not practical or pos-
need periodical resets. For those portions of tion of the Virtex-5 architecture, a new sible to have exactly six inputs in a given
the design that do not, you can initialize logic structure dictates slightly different logic function, the wider input 6-LUT
the signals defined to be registered in the rules regarding when and how to pipeline. still allows for good performance by
HDL code at the time they are declared
(for example, when defining a reg in
Verilog or a signal in VHDL). This
Top
methodology allows for improved packing Flip-Flop
density, enhances timing analysis and per- Clock CE
Enable
formance, and can improve area resources. Partition
LUT6
In terms of FPGA behavior, without a
Flip-Flop
global reset described in the code, a GSR
Old Netlist
CE
(global set/reset) will occur upon comple-
tion of the configuration cycle, initializing LUT6
CE
all registers to known specified values. This
LUT6
same cycle is also simulated in the gate- KEEP_HIERARCHY Flip-Flop
level simulation netlist, giving the same
CE
known starting point as in the FPGA. Core Flip-Flop

In terms of RTL simulation, having the LUT6


CE
registers initialized in the code allows for
LUT6
proper RTL or behavioral simulation; this
same initialization will be picked up by the
synthesis tool and applied to the imple-
mented design. Therefore, for simulation
at any stage, a global reset is redundant Top
Flip-Flop
and unnecessary.
Using a synchronous reset instead of an Clock CE
Enable
asynchronous reset also allows for more pre- LUT6 Partition
dictable behavior upon the assertion and Flip-Flop
release of the reset, because the synchronous Old Netlist CE
signals are automatically analyzed and their
behavior is more deterministic when all CE

timing constraints are met. It also allows for


the possibility of greater logic optimization KEEP_HIERARCHY Flip-Flop

and performance because it is not global. CE


Core
When using synchronous control sig- Flip-Flop

nals, you can move portions of the logic CE

function to the synchronous set or reset of


the flip-flop; this is not possible with asyn-
chronous signals. By only describing a reset
where necessary, the synthesis tool can use
Figure 2 – How clock enable polarity affects LUT utilization in a design
alternative resource choices like SRLs (shift

20 Xcell Journal Fourth Quarter 2006


PERFORMANCE

Verilog Coding Example VHDL Coding Example


`timescale 1ns / 1ps
----------------------------------------------------------------------------------
//////////////////////////////////////////////////////////////////////////////////
-- Company: Xilinx
// Company: Xilinx
-- Engineer: Brian Philofsky
// Engineer: Brian Philofsky
--
//
-- Create Date: 07:42:58 08/12/2006
// Create Date: 07:42:58 08/12/2006
-- Design Name: good_design
// Design Name: good_design
-- Module Name: good_code2
// Module Name: good_code
-- Project Name: HDL Coding Practices for Improving Virtex 5 Utilization,
// Project Name: HDL Coding and Design Practices for Improving Virtex 5
-- Performance and Power
// Utilization, Performance and Power
-- Target Devices: Virtex 5
// Target Devices: Virtex 5
-- Tool versions: ISE 8.2i
// Tool versions: ISE 8.2i
-- Description: This is an example code employing some good coding practices
// Description: This is example code employing some good coding practices
-- when targeting a Virtex 5 device.
// when targeting a Virtex 5 device.
--
//
-- Revision 0.01 - File Created
// Revision 0.01 - File Created
--
//
----------------------------------------------------------------------------------
//////////////////////////////////////////////////////////////////////////////////
library IEEE;
module good_code #(
use IEEE.std_logic_1164.all;
parameter data_width = 16,
use IEEE.std_logic_arith.all;
parity_width = 2)
library UNISIM;
( input [data_width-1:0] DATA_IN,
use UNISIM.Vcomponents.all;
input DATA_STORE,
entity good_code2 is
input CLK, RST,
generic (
input READ_DATA,
data_width : integer := 16;
parity_width : integer := 2
output [data_width+parity_width-1:0] DATA_OUT,
);
output reg RW_ERROR = 1'b0,
port (
output DATA_VALID, FULL
DATA_IN : in std_logic_vector(data_width-1 downto 0);
);
DATA_STORE: in std_logic;
// Always initialize registers to known values
CLK, RST: in std_logic;
reg [data_width-1:0] data_in_reg = {data_width{1'b0}};
READ_DATA: in std_logic;
reg [data_width-1:0] data_in_reg2 = {data_width{1'b0}};
reg [2:0] data_store_delay = 3'b000;
DATA_OUT : out std_logic_vector(data_width+parity_width-1 downto 0);
reg [2:0] data_valid_delay = 3'b000;
reg [parity_width-1:0] parity = {parity_width{1'b0}}; RW_ERROR : out std_logic := '0';
DATA_VALID, FULL : out std_logic
wire read_error, write_error; );
end good_code2;
// Use resets only where necessary and make them synchronous
// Make resets and clock enables active high architecture XILINX of good_code2 is
always @(posedge CLK)
if (RST) -- Always initialize registers to known values
data_in_reg <= {data_width{1'b0}}; signal data_in_reg: std_logic_vector(data_width-1 downto 0) := (others => '0');
else if (DATA_STORE) signal data_in_reg2: std_logic_vector(data_width-1 downto 0) := (others => '0');
data_in_reg <= DATA_IN; signal data_store_delay: std_logic_vector(2 downto 0) := "000";
signal data_valid_delay: std_logic_vector(2 downto 0) := "000";
// Do not use resets where not necessary signal parity: std_logic_vector(parity_width-1 downto 0) := (others => '0');
// In this case an SRL can be used due to the fact no reset is described.
always @(posedge CLK) begin signal read_error, write_error: std_logic;
data_store_delay <= {data_store_delay[1:0], DATA_STORE};
data_in_reg2 <= data_in_reg; begin
data_valid_delay <= {data_valid_delay[1:0], READ_DATA};
RW_ERROR <= read_error | write_error; -- Use resets only where necessary and make them synchronous
parity[1] <= ^data_in_reg[15:8]; -- Make resets and clock enables active high
parity[0] <= ^data_in_reg[7:0]; process (CLK)
end begin
if (CLK'event and CLK='1') then
// In general, RAMs should be inferred however in this case, a FIFO is needed if RST='1' then
// and synthesis can not yet infer the dedicated Virtex 5 FIFO. data_in_reg <= (others => '0');
elsif (DATA_STORE='1') then
// FIFO18: 16k+2k Parity Synchronous/Asynchronous BlockRAM FIFO data_in_reg <= DATA_IN;
// Virtex-5 end if;
// Xilinx HDL Language Template, version 8.2.2i end if;
end process;
FIFO18 #(
.ALMOST_FULL_OFFSET(12'h080), // Sets almost full threshold -- Do not use resets where not necessary
-- In this case an SRL can be used due to the fact no reset is described.
.ALMOST_EMPTY_OFFSET(12'h080), // Sets the almost empty threshold process (CLK)
.DATA_WIDTH(18), // Sets data width to 4, 9 or 18 begin
.DO_REG(1), // Enable output register (0 or 1) if (CLK'event and CLK='1') then
// Must be 1 if EN_SYN = "FALSE data_store_delay <= (data_store_delay(1 downto 0) & DATA_STORE);
.EN_SYN("TRUE"), // Specifies FIFO as Asynchronous ("FALSE") data_in_reg2 <= data_in_reg;
// or Synchronous ("TRUE") data_valid_delay <= (data_valid_delay(1 downto 0) & READ_DATA);
.FIRST_WORD_FALL_THROUGH("FALSE") // Sets the FIFO FWFT to "TRUE" or "FALSE RW_ERROR <= read_error OR write_error;
) FIFO18_inst ( parity(1) <= (data_in_reg(15) XOR data_in_reg(14) XOR data_in_reg(13) XOR
.ALMOSTEMPTY(), // 1-bit almost empty output flag data_in_reg(12) XOR data_in_reg(11) XOR data_in_reg(10) XOR
.ALMOSTFULL(), // 1-bit almost full output flag data_in_reg(9) XOR data_in_reg(8));
.DO(DATA_OUT[15:0]), // 16-bit data output parity(0) <= (data_in_reg(7) XOR data_in_reg(6) XOR data_in_reg(5) XOR
.DOP(DATA_OUT[17:16]), // 2-bit parity data output data_in_reg(4) XOR data_in_reg(3) XOR data_in_reg(2) XOR
.EMPTY(), // 1-bit empty output flag data_in_reg(1) XOR data_in_reg(0));
.FULL(FULL), // 1-bit full output flag end if;
.RDCOUNT(), // 12-bit read count output end process;
.RDERR(read_error), // 1-bit read error output
.WRCOUNT(), // 12-bit write count output -- In general, RAMs should be inferred however in this case, a FIFO is needed
.WRERR(write_error), // 1-bit write error -- and synthesis can not yet infer the dedicated Virtex 5 FIFO.
.DI(data_in_reg2), // 16-bit data input
.DIP(parity[1:0]), // 2-bit parity input -- FIFO18: 16k+2k Parity Synchronous/Asynchronous BlockRAM FIFO BlockRAM Memory
.RDCLK(CLK), // 1-bit read clock input -- Virtex-5
.RDEN(READ_DATA), // 1-bit read enable input -- Xilinx HDL Language Template version 8.2.2i
.RST(RST), // 1-bit reset input
.WRCLK(CLK), // 1-bit write clock input FIFO18_inst : FIFO18
.WREN(data_store_delay[2]) // 1-bit write enable input generic map (
); ALMOST_FULL_OFFSET => X"080", -- Sets almost full threshold
ALMOST_EMPTY_OFFSET => X"080", -- Sets the almost empty threshold
// End of FIFO18_inst instantiation DATA_WIDTH => 18, -- Sets data width to 4, 9, 18, or 36
DO_REG => 1, -- Enable output register (0 or 1)
endmodule -- Must be 1 if the EN_SYN = FALSE
EN_SYN => TRUE, -- Specified FIFO as Asynchronous (FALSE) or
-- Synchronous (TRUE)
FIRST_WORD_FALL_THROUGH => FALSE) -- Sets the FIFO FWFT to TRUE or FALSE
port map (
ALMOSTEMPTY => open, -- 1-bit almost empty output flag
ALMOSTFULL => open, -- 1-bit almost full output flag
DO => DATA_OUT(15 downto 0), -- 32-bit data output
DOP => DATA_OUT(17 downto 16), -- 2-bit parity data output
EMPTY => open, -- 1-bit empty output flag
FULL => FULL, -- 1-bit full output flag
RDCOUNT => open, -- 12-bit read count output
RDERR => read_error, -- 1-bit read error output
WRCOUNT => open, -- 12-bit write count output
WRERR => write_error, -- 1-bit write error
DI => data_in_reg2, -- 16-bit data input
DIP => parity, -- 2-bit parity input
RDCLK => CLK, -- 1-bit read clock input
RDEN => READ_DATA, -- 1-bit read enable input
RST => RST, -- 1-bit reset input
WRCLK => CLK, -- 1-bit write clock input
WREN => data_store_delay(2) -- 1-bit write enable input
);

-- End of FIFO18_inst instantiation

end XILINX;

Figure 3 – Sound FPGA coding styles

Fourth Quarter 2006 Xcell Journal 21


PERFORMANCE

reducing the number of logic levels, thus Both block RAM and distributed RAM For designs in which some or most of the
requiring fewer pipeline stages to achieve memories also have additional capabilities code was created for an architecture other
the same as or better performance than that require different coding and design con- than Virtex-5 FPGAs, I suggest that you
previous FPGA architectures. siderations. For performance, perhaps the review the code to ensure that it is well suit-
A good goal is to aim for less than 10 most important is the proper use of output ed for implementation into the new archi-
inputs to a given logic function between registers. For block RAMs, this means tecture. A few minutes of time spent here
I/Os, registers, or synchronous blocks (like enabling the output registers to the block can save several hours later if you identify
block RAM or DSP48Es), which generally RAM whenever possible. By enabling the and correct suboptimal code.
would represent two logic levels. When you output registers, a reduced clock-to-out is If your design contains cores or pre-
need a significantly higher number of realized from the RAM, thus improving tim- compiled netlists (EDIF or NGC files)
inputs for the design path to meet latency ing for the data leaving the RAM. However, from a previous architecture, you should
or other requirements, you can attempt to an extra clock cycle of latency is added dur- regenerate those targeting Virtex-5
reduce the fan-in to that logic function ing reads, for which you must account. devices. Unless regenerated, netlists opti-
(when possible) if high performance or low Similarly, when using distributed RAM, mized for a previous architecture are more
power are your design objectives. the output of the RAM can be asynchro- likely than not far less optimal when tar-
nous; however, coding it synchronously will geting Virtex-5 architectures.
Coding Memories allow the use of the register within the slice, One last suggestion is to use the HDL
Among other innovations within the providing better timing characteristics and language templates within the ISE tools.
Virtex-5 architecture, Xilinx has enhanced reducing the chance of the RAM being part They not only help with accelerating the
both block RAM and distributed RAM of the timing bottleneck. generation of VHDL or Verilog code, but
memories with greater capacity and capabil- There are more advanced features of the also provide assistance in creating more
ity. You must make different decisions early block RAMs, such as FIFO and ECC optimal code for FPGAs. They also cut
in the design process and while coding to (error correction circuitry) capabilities. down on the possibility of creating syntax
get the most from these valuable resources. The distributed RAM also has new capa- or other simple but common mistakes that
General guidelines call for inferring bilities such as a quad-port configuration. can hold up the testing and verifying of
RAMs when possible for easier code In some cases, these features cannot be HDL code.
changes, faster simulation, and more realized by inference within synthesis and Figure 3 shows both Verilog and VHDL
portable code. However, even when behav- instantiation is necessary. If you need such code following the guidelines discussed here.
iorally describing the RAM, you should functionality, I suggest instantiating the
keep some important things in mind. The RAMs either by generating cores within Conclusion
first and most obvious thought is RAM Xilinx CORE Generator™ software or by Coding styles are very individual; howev-
capacity. In terms of block RAMs, the base instantiating the base primitive. Taking er, following these suggestions makes it
memory block increased in Virtex-5 devices advantage of these advanced features can more likely that you will achieve a more
to 36 Kb of memory storage space. You can save RAM and logic resources as well as optimal result. These guidelines do not
configure this block to the wider but shal- improve area, performance, and power. represent absolutely everything you need
lower 512 x 72 configuration, the deeper to know to achieve the best Virtex-5
single-bit width 32 Kb x 1, or several con- Some General Guidelines design possible, but I have provided some
figurations in between. It is also possible to A few other general recommendations do common strategies that can help in achiev-
cascade two 36-Kb RAMs to form a 64-Kb not fall into any specific categories but can ing more optimal designs.
x 1 configuration or break up the 36-Kb result in better coding and design choices. Almost any set of valid HDL code likely
RAMs into two separate 18-Kb RAMs capa- First, you should make wise choices in terms will result in a functioning design, but fol-
ble of 512 x 36 to 16-Kb x 1 configurations. of your design hierarchy right from the lowing a few simple guidelines can help in
Distributed RAM have benefited from start. Your choice of hierarchy can have terms of improved density, performance,
the larger LUT structure and can now effi- effects on the synthesis and implementation and power, and many times may reduce the
ciently accommodate 64-bit depths without tools’ ability to optimize the logic paths. amount of time it takes to ultimately com-
any area or performance penalties. This is In general, do not allow timing paths to plete a design.
the most optimal size for this type of RAM cross multiple boundaries of hierarchy. This For more information, see the Synthesis
in the Virtex-5 device, although other sizes not only limits the tool’s ability to optimize and Simulation Design Guide at
can be accommodated. The base RAM sizes logic but may also limit your options for http://toolbox.xilinx.com/docsan/xilinx82/
are important to remember during memory design implementation and design debug- books/docs/sim/sim.pdf or White Paper 231,
selection and coding to most efficiently use ging. For instance, you may not be able to “HDL Coding Practices to Accelerate Design
the limited RAM resources in the device use partitions or KEEP_HIERARCHY on Performance,” at http://direct.xilinx.com/
and achieve the best performance. certain hierarchies with this practice. bvdocs/whitepapers/wp231.pdf.

22 Xcell Journal Fourth Quarter 2006

Вам также может понравиться