Scheduling and Resource Sharing (On FPGAs)

Scheduling and Resource Sharing
Digital Systems Design Jos M. Leito e a Marcos Antunes Srgio Paigua e a November 2012 1 Introduction
Much of the complexity involved in designing a digital system stems from the usually very stringent constraints involved. These usually include limited resources, a certain target throughput or even limitations to power dissipation. Each of these parameters must be then taken into account during the design process, so that the circuit is tailored right from the early stages with these requirements in mind. In this second project for the course Digital Systems Design, an image processing algorithm is implemented with very limited hardware resources, so that the concepts of Scheduling and Resource Sharing in the context of digital circuit design are explored. In particular, a Sobel-like operator is to be implemented. This operator computes an approximation of the gradient of an images intensity function. The two directions of the gradient vector are obtained by performing a convolution between the source image and two convolution kernels of size 3x3, with parametrizable entries. This report is organized as follows. Section 2.2 presents the data ow graph corresponding to one inner loop iteration of the algorithm. Within this section ALAP and ASAP scheduling is utilized, in addition to list scheduling. In section 3, the circuit that implements the algorithm is described in detail, focusing on its two main elements, the datapath and the control unit. The following section discusses the obtained circuit based on the resource utilization and obtained performance, i.e, maximum operating frequency and throughput. Finally, section 5 concludes the report by discussing to what extent the objectives set forth in the beginning were accomplished and how the circuit could be improved in terms of throughput with or without increasing the available resources.
2 Scheduling
As described in the introduction, the Sobel operator relies on the convolution of an image with two convolution kernels of size 3 3, which are presented below: a 0 +a Gx = b 0 +b c 0 +c a b c 0 0 Gy = 0 +a +b +c
The constants a, b and c, are positive numbers that can be implemented to obtain dierent ltering eects. For simplicity, the image to process is composed of 10 20 pixels with values ranging from 0 to 255, i.e, the image is in grayscale format. When the kernel is being applied to a given pixel S(0,0) , the following notation is employed to describe the adjacent pixels: S(1,1) S(1,0) S(1,1) S(0,1) S(0,0) S(0,1) S(1,1) S(1,0) S(1,1) In the following subsections, the scheduling of the hardware utilization is devised only for one inner loop iteration of the algorithm. As such, the expression to implement is the following:
gx = a S(1,1) b S(0,1) c S(1,1) + a S(1,1) + b S(0,1) + c S(1,1) gy = a S(1,1) b S(1,0) c S(1,1) + a S(1,1) + b S(1,0) + c S(1,1)
2.1 Scheduling with no area restrictions

From the equation above and considering that each operation takes a clock cycle to be performed, it is trivial to obtain the correspondent data ow graph, DFG. This type of graph exposes the data dependencies that exist between the various operations and the level of mobility that is associated with each operation, i.e, how large is the temporal window within which the node can be positioned without increasing the latency of the circuit. When all the nodes, or operations, have zero mobility, it is not possible to apply techniques such as ASAP (as soon as possible) or ALAP (as late as possible) scheduling. In gure 2.1, the result of the application of both the previously mentioned techniques is shown, considering that no hardware restrictions exist, meaning that as long as there are no data dependencies, all operations can run in parallel. It should be noted that the DFG shown already includes an optimization, which arises from the observation that S(1,1) and S(1,1) are multiplied by the same constants, a and c, respectively. From the analysis of the diagram, it is clear that only two of the nodes boast a nonzero mobility and, as such, are scheduled dierently depending on the approach taken. However, as none of these operations are part of the critical path, their position within the scheduling does not impact the latency of the complete circuit. Having said that, despite taking an ALAP or ASAP approach, the values of gx and gy for each processed pixel are obtained within 4 clock cycles.
2.2 Scheduling with area restrictions

Since the previous analysis did not take into account the restriction on the number of occupied resources, the obtained scheduling is far from being the optimum. The specied maximum number for each type of processing element is listed in table 2.1. In order
Figure 2.1: ASAP and ALAP scheduling diagram, ignoring the hardware restrictions and considering single cycle operations.
to meet the project requirements with a minimum impact on the overall throughput, the mathematical expression for computing a result (gx , gy ) was rearranged, using the distributive property of the multiplication, as follows:
gx = a S(1,1) b S(0,1) c S(1,1) + a S(1,1) + b S(0,1) + c S(1,1) gy = a S(1,1) b S(1,0) c S(1,1) + a S(1,1) + b S(1,0) + c S(1,1)
gx = a [S(1,1) S(1,1) ] + b [S(0,1) S(0,1) ] + c [S(1,1) S(1,1) ] gy = a [S(1,1) S(1,1) ] + b [S(1,0) S(1,0) ] + c [S(1,1) S(1,1) ] The new expression is advantageous for two mains reasons. Firstly it requires less arithmetic operations, which by itself decreases the signicance of the scheduling limitations imposed by the projects hardware requirements. Secondly, it provides a greater incidence balance between each arithmetic operation, which happens to be in tune with the number of available resources, i.e., the scarcer arithmetic units correspond to the least frequent operations. Figure 2.2 depicts the ASAP and ALAP scheduling diagrams that were obtained for this new formula, ignoring both the memory access delays and the number of available
Processing Element Multiplier Subtractor Adder
Maximum number of units 2 2 1
Table 2.1: Area restrictions.
resources. Taking those diagrams as a starting point made it trivial to extract a scheduling solution which accounts for the hardware limitations.
Figure 2.2: ASAP and ALAP scheduling diagrams for the rearranged formula, ignoring the hardware restrictions and considering single cycle operations.
Although the assumption of single cycle operations does not represent a real constraint, it stands useful for scheduling purposes, since it provides a solid visualization of the presumable critical path. The scheduling presented in table 3.1 was created using the critical path priority principle and taking advantage of the mobility of nodes 11 and 12. Note that the scheduling list itself did not consider single cycle operations since this project was not supposed to make use of a pipelined datapath. Considering that the only restriction is on the number of arithmetic units populating the circuit, a maximum throughput of 0.25 pixels per cycle (ppc) could be achieved using the proposed scheduling. The restriction of using a single shared adder was found to be the limiting factor, due to the fact that the four addition operations imperatively require four clock cycles to be computed.
Cycle 1 2 3 4
Nodes 1,3,7,9,13 2,4,8,10,14 5,6,11,12,15 16
( 2) 2 2 2 0
( 2) 2 2 2 0
+ ( 1) 1 1 1 1
Table 2.2: Scheduling list solution limited by the resources availability.
3 Hardware Design
As specied by the project guidelines, the hardware was structured using the FSMD paradigm In this way, a Control Unit was designed to control the ow of the data that is to be passed through the Datapath, where all the arithmetic operations related to the image lter convolution take place. Given the strict requisites in terms of the number of processing elements available, the bottleneck for the processor throughput is more likely to be on the Datapath than on the Control Unit, which is why the hardware design was mostly on focused on devising a satisable architecture for the former.
3.1 Datapath
In section 2.2, a scheduling scheme was derived for a single pixel computation considering the specied area restrictions. However, some additional constraints need to be taken into account before mapping it into a valid hardware implementation, namely regarding the input memory accesses. As stated in the project guidelines, the developed circuit can access to three in-line pixels per cycle, thus resulting in the following memory reading sequence, represented in gure 3.1. P i(j) corresponds to the pixel read at the ith memory output j cycles ago.
Figure 3.1: Input memory reading sequence.
If the pixels read in each cycle are stored for further processing, the previous scheduling solution shown in table 3.1 turns out to be easily adaptable. For instance, in cycle 2 the available pixels can be matched with the previous terminology: S(1,1) P 1(2) S(0,1) P 1(1) S(1,1) P 1 S(1,0) P 2(2) S(0,0) P 2(1) S(1,0) P 2 S(1,1) P 3(2) S(0,1) P 3(1) S(1,1) P 3
Replacing the variables names in the formula of the lter yields the following expression: gx = a [P 3(2) P 1(2)] + b [P 3(1) P 1(1)] + c [P 3 P 1] gy = a [P 1 P 1(2)] + b [P 2 P 2(2)] + c [P 3 P 3(2)] In spite of the correctness of the previous expression, it requires some rescheduling in order to distribute the operations over the several cycles, and thus complying with the imposed area restrictions. Applying a certain time-shift to each pixel enables the necessary computation to be spread around four clock cycles, noting that there is a rst hold cycle (0) where no operation is performed which will still add up to the overall datapath latency. Cycle 1 Cycle 2 g = a [P 3(1) P 1(1)] + b [P 3 P 1] +c [P 3 P 1] x gy = a [P 1 P 1(2)] + b [P 2(1) P 2(3)] + c [P 3(1) P 3(3)]
Cycle 2 Cycle 4 Cycle 3
The rescheduled algorithm can then be summarized into the following table : Cycle 0 1 2 3 4 Computed terms gx
1
( 2) 2 2 2 0
- ( 2) 2 2 2 0
+ ( 1) 1 1 1 1
a [S(1,1) S(1,1) ] + b [S(0,1) S(0,1) ]

gy
1
gx 1 + c [S(1,1) S(1,1) ] , a [S(1,1) S(1,1) ]

gy
2
b [S(1,0) S(1,0) ] + c [S(1,1) S(1,1) ] gy 1 + gy 2
Table 3.1: Scheduling list for the rescheduled solution considering both area and memory constraints.
The algorithm was then translated into an hardware implementation of the datapath, as depicted in gure 3.2. The devised circuit depends on eleven control signals to ensure
the desired functionality is achieved. A description of these signals can be found in section 3.2, where the inner-workings of the Control Unit are detailed.
Figure 3.2: Datapath block diagram.
Despite the rst hold cycle, the obtained circuit is able to reach the maximum throughput of 0.25/ppc, if Cycle 4 is overlapped with the Cycle 0 of the next iteration. Assuming that is the case, only the rst and last occurrences will contribute to the datapath latency, since each one of the remaining pixels is computed in four cycles (TConv ). Considering one extra cycle for reading the lter constants (a,b,c) from the input memory, and another one to activate the done ag after the last result is written in memory, the total overhead (TOH ) for the convolution of one complete line will be of four clock cycles. Hence, the general formula to get the number of cycles required to compute a line of the ltered image, which can be seen as a single instruction, is given by:
CP I = TOH + TConv (N 2) In this particular case, considering images 10-pixels wide, CP I = 4 + 4 (10 2) = 36, which was positively checked against the simulation.
3.2 Control Unit

The limited resources available to implement the convolution operation dictates that the functional units present in the Datapath must be shared and time-multiplexed. Thus, it is the task of the Control Unit to eectively coordinate the access to these resources along the multiple cycles required to perform a convolution centered on a given pixel. As made clear on the previous section, the devised Datapath relies on the timely activation and deactivation of 11 control signals, each of them controlling either multiplexers or registers write enables. To properly compute the values of gx and gy , the sequence of signals shown on gure 3.3 is required. The signal names herewith presented are
based on those of gure 3.2, although the actual signal names used within the hardware description of the unit are short-hand versions of these.
Figure 3.3: Sequence of control signals needed to compute gx and gy .
For simplicity reasons, the 11 control signals were combined into a control word of 13-bit, where the two extra bits are due to the 2-bit width of the signals rsub2 and gy sel. This control word is then routed to the Datapath block where it is separated into the necessary control signals. Although the Datapath also requires a abc we signal to trigger the storage of the three constants, this signal was not included in the control word mentioned, as it is only used once after the dedicated processor is reset, given that the a, b and c coecients do not change during the processing of an image. In addition, the Control Unit is also responsible for managing the input and output memories, generating its read and write addresses, respectively, as well as the write-enable signal for the latter. Finally, the done signal is also generated within this unit, after a full line of the image has been processed. The signal sequence of gure 3.3 was implemented directly as an FSM, leaving all possible optimizations to the synthesis process, so that the cross-checking between the table and the respective hardware implementation could be made trivially, only by looking at the respective VHDL source. This proved to be a good decision, as the eort that would have been put towards optimizing the generation of these signals would be made redundant by the action of the synthesis tool. The complete nite-state machine is composed of 9 states and utilizes a Moore approach, as the output of each state is uniquely a function of the internal signals of the machine and not a combination of those with the external signals, as is the case for Mealy machines. The diagram representing the FSM is depicted in gure 3.4, where it should be noted that all the output signals default to logic value 0 in each state, unless otherwise specied. The FSM operation starts when the rst signal is high, holding the state machine in the HOLD STATE, which generates the address corresponding to the position where the three constants are kept in the input memory. In addition, the address of the output
Figure 3.4: Finite-state machine that generates the control signals for the datapath.
memory is set to zero. As soon as the reset is released, the FSM iterates through the following states: GET ABC In this state, the abc we signal is asserted, in order to store the values of the three coecients. The address of the input memory is also set to one, to initiate the reading of the image, three pixels at a time, which will start taking place in the next state. STATE 04 In this sequence of states, the input memory address is incremented by 10 three consecutive times, after which it is decreased by 29, which corresponds to shifting the convolution window by one position to the right. During these cycles, the outputs are those presented in the control table presented in the beginning of this section. Once STATE 1 is reached, the write address is incremented by one unit and the write operation is enabled. STATE 1 Init This state, similarly to STATE 0, is part of the transient phase of the processing and, therefore, only occurs once after every reset of the dedicated processor, due to the necessity to ll the array of input registers of the datapath before the processing can start occurring at a constant pace. After STATE 1 is reached for the rst time, the sequence of states will be STATE 1 STATE 2 STATE 3 STATE 4, thus generating a value for gx and gy every four clock cycles.
STATE Done This state is reached after a full line of pixels has been written to the output memory, i.e, the output memory address has reached position 7. End of operation is signalled by asserting the Done signal and the next state is set to be STATE Hold, as the task of the processor is to operate on a single image line.
4 Resource utilization and Performance Results

A completed documentation of a given digital circuit must always include the number of occupied resources, as well as the maximum clock frequency for which the circuit functions correctly. The latter, together with the Cycles per Instruction (CPI), provides a precise measurement of the overall performance. Table 4.1 displays the absolute and relative occupation of the main type of cells featured in the targeted FPGA device. The reported Post-Place and Route results were obtained separately for the Datapath and the Control Unit, in addition to the values for the nal Image Processor. Circuit Datapath Control Unit Image Processor (D+CU) Slice Flip Flops 125 (6%) 29 (1%) 154 (8%) Slices 120 (12%) 28 (1%) 153 (15%) LUTs 146 (7%) 32 (2%) 186 (9%) Multipliers (DSP) 2 (50%) 0 2 (50%)
Table 4.1: Resource utilization of the Image Processor on the targeted XC3S100E device.
The synthesis report conrmed the compliance with the constraints, as 3 Adders/Subtractors and 2 Multipliers were inferred by the tool, for the Datapath. As expected, the core components of the Control Unit were also detected, such as the Finite-State Machine and the pair of Adders used to compute the addresses for the input and output memories. Regarding the performance, a maximum operating frequency of 65.4 MHz was achieved, which, combined with the derived CPI in section 3.1, leads to an average throughput of Npixels 8 pixels 65.4 M/s=14.5 Mpixels/s or 1.81 instructions/s. As expected, the CP I f = 36 critical path is the one that traverses the serial connection of the subtractor, multiplier and adder. Given the provided specications, the developed circuit stands as an ecient solution since it provides a relatively high throughput using a fair amount of resources. This means there is still a margin for improvement in terms of performance, if one revisits the classical tradeo of area/throughput.
5 Conclusion and Future Work

The dedicated processor described within this report managed to achieve all the required specications in terms of hardware resources utilization while obtaining a fair throughput,
10
producing one pixel every four clock cycles, neglecting the initial overheads that are quickly made irrelevant when computing the convolution operation over a full image. However, there is still room for improvement, both in the pixel production rate and on the maximum operating frequency. The latter can be easily achieved by employing pipelining techniques. In particular, and given that the critical path was determined to go through the branch where the subtractor, multiplier and adder lie in series, introducing a register at the output of both multipliers (allowing the synthesis tool to place them inside the multipliers themselves) would eectively reduce the length of this path, therefore allowing an increase in the operating frequency. The downside to this approach would be the increase in the latency of the circuit, which would only have an impact in the initial overhead. Again, considering the processing of a full image, this increased time needed to ll the pipeline would be negligible. The rate at which new pixels are made available to the datapath is determined by the input memories which, in this case, provide 3 pixels per cycle. This immediately sets a maximum pixel production rate of a pixel for every three clock cycles, the time it takes to read in the 9 pixels needed to perform a local convolution. By inspecting once again the table 3.1, it is clear that the throughput of the dedicated processor is being limited by the existence of a single adder. By relaxing this hardware restriction and adding a second adder, it is possible to perform two additions in cycle 3, instead of waiting one extra cycle to perform the nal addition. The updated scheduling is represented in table 5.1. Cycle 0 1 2 3 Computed terms gx
1
( 2) 2 2 2
- ( 2) 2 2 2
+ ( 2) 1 1 2
a [S(1,1) S(1,1) ] + b [S(0,1) S(0,1) ]

gy
1
gx 1 + c [S(1,1) S(1,1) ] , a [S(1,1) S(1,1) ] gy 1 + b [S(1,0) S(1,0) ] + c [S(1,1) S(1,1) ]
Table 5.1: Result of a list scheduling limited to 2 multipliers, 2 subtractors and 2 adders
If, once again, cycle 3 is overlapped with cycle 0 after the rst local convolution, the result is a throughput of 1 pixel every 3 clock cycles, which corresponds to the maximum achievable performance for an input of 3 pixels per cycle. Finally, the revised datapath reecting this enhancement is presented in gure 5.1
11
12
Figure 5.1: Revised Datapath block diagram to include pipelining and an additional adder.

Scheduling and Resource Sharing (On FPGAs)

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Scheduling and Resource Sharing (On FPGAs)

Загружено:

Авторское право:

Доступные форматы

Scheduling and Resource Sharing

2.1 Scheduling with no area restrictions

2.2 Scheduling with area restrictions

Processing Element Multiplier Subtractor Adder

Maximum number of units 2 2 1

Table 2.1: Area restrictions.

Nodes 1,3,7,9,13 2,4,8,10,14 5,6,11,12,15 16

Table 2.2: Scheduling list solution limited by the resources availability.

Figure 3.1: Input memory reading sequence.

a [S(1,1) S(1,1) ] + b [S(0,1) S(0,1) ]

gx 1 + c [S(1,1) S(1,1) ] , a [S(1,1) S(1,1) ]

b [S(1,0) S(1,0) ] + c [S(1,1) S(1,1) ] gy 1 + gy 2

Figure 3.2: Datapath block diagram.

3.2 Control Unit

Figure 3.3: Sequence of control signals needed to compute gx and gy .

4 Resource utilization and Performance Results

5 Conclusion and Future Work

a [S(1,1) S(1,1) ] + b [S(0,1) S(0,1) ]

gx 1 + c [S(1,1) S(1,1) ] , a [S(1,1) S(1,1) ] gy 1 + b [S(1,0) S(1,0) ] + c [S(1,1) S(1,1) ]

Вам также может понравиться