Вы находитесь на странице: 1из 9

HW/SW co-design for SoC with Vivado

HLS
Example 6-1 serial addition

The serial addition is a for loop, there won’t be any modification in the code as it is sequential.

Figure 1: Serial adder DFG

The code is as follows:

Example6-1.h:
#include <ap_int.h>
#include <stdio.h>

#define WIDTH 8
#define IN_NUMBER 8
typedef ap_uint<WIDTH> int8;

int8 sum ( int8 r[IN_NUMBER]);

as shown above the arbitrarty precision is used to define a variable called int8 with 8-bit width.
And the function prototype is defined.

example6-1.cpp:
#include "example6-1.h"

int8 sum ( int8 r[IN_NUMBER])


{
int8 sum = 0;
for (int k = 0; k <IN_NUMBER; k++)
{
sum += r[k];
}
return sum;
}

This code represent our intended HW, where it is simply a for loop calculate the sum of array
contents and return the result.

Tb-6-1.cpp:
#include "example6-1.h"
#include <iostream>
#include <stdlib.h>

using namespace std;

int main (){

int8 sw_result, hw_result;


int8 r [IN_NUMBER];

// fill r with random numbers < 50 and print them on screen and
calculate the gold refence
sw_result =0;
for (int k =0 ; k<IN_NUMBER; k++)
{
r[k]= rand()% 50;
cout << "r[" << k << "] = " << r[k] << endl;
sw_result += r[k];

}
// DUT
hw_result = sum (r);

cout << "sw_result = " << sw_result << endl;


cout << "hw_result = " << hw_result << endl;

if (hw_result == sw_result) return 0;


else return 1;
}

Here is the test bench for simulation and co-simulation. If the HW result = SW then the main
return 0 and simulation passes. Otherwise, the simulation fails.

A successful simulation will output the following:


INFO: [SIM 211-2] *************** CSIM start ***************
WARNING: [SIM 211-51] HLS only supports CLANG compiler in Linux.
INFO: [SIM 211-4] CSIM will launch GCC as the compiler.
Compiling ../../../example6-1.cpp in debug mode
Generating csim.exe
r[0] = 41
r[1] = 17
r[2] = 34
r[3] = 0
r[4] = 19
r[5] = 24
r[6] = 28
r[7] = 8
sw_result = 171
hw_result = 171
INFO: [SIM 211-1] CSim done with 0 errors.
INFO: [SIM 211-3] *************** CSIM finish ***************
Finished C simulation.

We defined sum as the top module and after a successful synthesis the report will show:

1. General Report
2. Performance Estimates
a. Timing
b. Latency
i. Latency: the amount of clocks to output the results.
ii. Interval: the amount of clocks needed to read the next set of inputs.
3. Utilization Estimates
4. Interface

Figure 2: Serial adder performance estimates

Note: To reduce the latency the design should be concurrent ,hence unrolling the loop.

Figure 3: the utilization estimates of serial adder


Figure 4: serial adder interface

Vivado hls provide by default clk and rst ports and handshaking ports. The array is treated as a
block memory, to be noted that bram has only 2 ports, so it may be a bottleneck for some
designs which will require array partitioning to be optimized.

Figure 5: analysis of serial adder RTL

From the analysis perspective we can know how the HW works. Where c0, c1, c2 are the control
states which is similar to FSM states but the are no one-to-one mapping between them. The
read operation take 2 clock cycles and the addition require one. The orange sell refer to the
loop.
Example6-2: concurrent design using adder tree
Adder tree

Figure 6: adder tree

To make the design concurrent the for loop should be unrolled so this directive will be added:
#pragma HLS UNROLL

HLS consider arrays as memory, and memories can have only 2 port of reading. So in order to
make 8 memory readings at the same time, we want to have 4 memories of length 2. Thus
partition the array to 4 arrays using this directive:
#pragma HLS ARRAY_PARTITION variable=r factor=4 dim=1

After synthesis the result will be an adder tree. The code:


#include "example6-1.h"

int8 sum ( int8 r[IN_NUMBER])


{
#pragma HLS ARRAY_PARTITION variable=r factor=4 dim=1
int8 sum = 0;
sum_label1:for (int k = 0; k <IN_NUMBER; k++)
{
#pragma HLS UNROLL
sum += r[k];
}
return sum;
}
Figure 7: comparison between Serial and concurrent adders

As shown in the figure the latency and interval has droped down significantly. Where we find
increase in the LUT usage.

Figure 8: Adder tree Interfaces, notice that the array was completely partitioned

Since the array are completely partitioned the array elements can be read all at the same time.
If that wasn’t done there would be bottleneck in the design and wouldn’t perform as expected.
Figure 9: the analysis perspective of adder tree

As it shown in the analysis perspective, all the functionality occurs in one clock cycles. And we
see that there are 7 adder units which will compose the adder tree.

Example 6-3 minimum area design

When minimum area is required that mean the using serial architecture following the code:

Header.h:
#include <stdio.h>

int powerof4 ( int a);

main.cpp for serial Datapath design


#include "header.h"

int powerof4 ( int a)


{
int result=1;
for (int i = 0; i<4; i++)
result *= a;

return result;
}

The test bench used to test the design is:


#include "header.h"
#include <iostream>
#include <stdlib.h>

using namespace std;

int main (){

int a, sw_result, hw_result;


unsigned int err_cnt=0;

sw_result =0;
for (int k =0 ; k<100; k++)
{
a= rand()% 50;
sw_result = pow (a,4);
// DUT
hw_result = powerof4 (a);
cout << "a = " << a;
cout << " sw_result = " << sw_result ;
cout << " hw_result = " << hw_result << endl;
if (hw_result != sw_result) err_cnt++;

return err_cnt;
}

If we want to compare the serial and concurrent architectures latency and resource utilization of
the power of 4 we got this table:
Figure 10 this figure shows that using the serial we used half the number of DSPs provided

Example 6-3 serial power of 4


It is simple for loop
int powerof4 ( int16 a)
{
int result=1;
power_loop:for (int i = 0; i<4; i++)
result *= a;

return result;

Вам также может понравиться