Академический Документы
Профессиональный Документы
Культура Документы
HLS
Example 6-1 serial addition
The serial addition is a for loop, there won’t be any modification in the code as it is sequential.
Example6-1.h:
#include <ap_int.h>
#include <stdio.h>
#define WIDTH 8
#define IN_NUMBER 8
typedef ap_uint<WIDTH> int8;
as shown above the arbitrarty precision is used to define a variable called int8 with 8-bit width.
And the function prototype is defined.
example6-1.cpp:
#include "example6-1.h"
This code represent our intended HW, where it is simply a for loop calculate the sum of array
contents and return the result.
Tb-6-1.cpp:
#include "example6-1.h"
#include <iostream>
#include <stdlib.h>
// fill r with random numbers < 50 and print them on screen and
calculate the gold refence
sw_result =0;
for (int k =0 ; k<IN_NUMBER; k++)
{
r[k]= rand()% 50;
cout << "r[" << k << "] = " << r[k] << endl;
sw_result += r[k];
}
// DUT
hw_result = sum (r);
Here is the test bench for simulation and co-simulation. If the HW result = SW then the main
return 0 and simulation passes. Otherwise, the simulation fails.
We defined sum as the top module and after a successful synthesis the report will show:
1. General Report
2. Performance Estimates
a. Timing
b. Latency
i. Latency: the amount of clocks to output the results.
ii. Interval: the amount of clocks needed to read the next set of inputs.
3. Utilization Estimates
4. Interface
Note: To reduce the latency the design should be concurrent ,hence unrolling the loop.
Vivado hls provide by default clk and rst ports and handshaking ports. The array is treated as a
block memory, to be noted that bram has only 2 ports, so it may be a bottleneck for some
designs which will require array partitioning to be optimized.
From the analysis perspective we can know how the HW works. Where c0, c1, c2 are the control
states which is similar to FSM states but the are no one-to-one mapping between them. The
read operation take 2 clock cycles and the addition require one. The orange sell refer to the
loop.
Example6-2: concurrent design using adder tree
Adder tree
To make the design concurrent the for loop should be unrolled so this directive will be added:
#pragma HLS UNROLL
HLS consider arrays as memory, and memories can have only 2 port of reading. So in order to
make 8 memory readings at the same time, we want to have 4 memories of length 2. Thus
partition the array to 4 arrays using this directive:
#pragma HLS ARRAY_PARTITION variable=r factor=4 dim=1
As shown in the figure the latency and interval has droped down significantly. Where we find
increase in the LUT usage.
Figure 8: Adder tree Interfaces, notice that the array was completely partitioned
Since the array are completely partitioned the array elements can be read all at the same time.
If that wasn’t done there would be bottleneck in the design and wouldn’t perform as expected.
Figure 9: the analysis perspective of adder tree
As it shown in the analysis perspective, all the functionality occurs in one clock cycles. And we
see that there are 7 adder units which will compose the adder tree.
When minimum area is required that mean the using serial architecture following the code:
Header.h:
#include <stdio.h>
return result;
}
sw_result =0;
for (int k =0 ; k<100; k++)
{
a= rand()% 50;
sw_result = pow (a,4);
// DUT
hw_result = powerof4 (a);
cout << "a = " << a;
cout << " sw_result = " << sw_result ;
cout << " hw_result = " << hw_result << endl;
if (hw_result != sw_result) err_cnt++;
return err_cnt;
}
If we want to compare the serial and concurrent architectures latency and resource utilization of
the power of 4 we got this table:
Figure 10 this figure shows that using the serial we used half the number of DSPs provided
return result;