Kiran

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll08/16llpp278-284 Volume 16, Number 3, June 2011
Automatic Identification of Customized Instruction Based on Multiple Attribute Decision-Making for Multi-Issue Architectures*
TAN Honghe (), SUN Yihe ()**
Tsinghua National Laboratory of Information and Technology, Institute of Microelectronics, Tsinghua University, Beijing 100084, China Abstract: This paper illustrates the importance of the configuration of function units and the change of an applications critical path when using instruction set extension (ISE) with multi-issue architectures. This paper also presents an automatic identification approach for customized instruction without input/output number constraints for multi-issue architectures. The approach identifies customized instructions using multiple attribute decision-making based on the analysis of several attributes for each candidate node. Tests indicate that the approach achieves higher speedup ratios than previous approaches, as well as less area cost. In addition, this approach provides designers with multiple candidate designs. Key words: instruction set extension (ISE); multi-issue architecture; customized instruction (CI)
Introduction
Instruction set extension (ISE) and instruction level parallelism (ILP) are two important technologies for improving processor performance. There are some methods for automatically customizing instructions for single-issue processors with extensions for multi-issue processors with specific instructions by hand. However, automatically extending customized instructions (CI) for multi-issue architectures have not been well studied. One ISE approach is to completely explore the search space to find the best solution from all possible cuts based on given constraints[1]. Galuzzi et al.[2] identified application-specific function units using the MAXMISO cut. Atasu and Oskar[3] used an integer linear programming approach to identify the CI, while Biswas et al.[4] proposed a heuristic method based on the Kernighan-Lin algorithm. Note, however, that the
Received: 2010-05-18; revised: 2011-04-05
** Supported by the Basic Research Fund of Tsinghua University ** To whom correspondence should be addressed.
E-mail: sunyh@tsinghua.edu.cn; Tel: 86-10-62788905
target processors for these approaches were singleissue RISC processors and they identified the CI for a constraint on the input/output (IO) number. As processors develop, more are using the multiissue technology, with very long instruction word (VLIW) processors already used in many fields[5]. The ISE was introduced into VLIW processor design to obtain higher performance. Some researchers have proposed methods for the designer to customize VLIW processors by manually adding CI[6-8]. Jain et al.[9] extended a VLIW architecture with CIs that were automatically identified using the exact algorithm proposed by Pozzi et al.[1] However, in these studies, the CIs were designed by hand or identified automatically, so the issue architecture was not considered at all. The result could destroy the operation parallelism and limit the performance improvement for the multi-issue processor. Wu I-Wei et al.[10] presented an ISE exploration algorithm for multi-issue architectures, which derives from ant colony optimization (ACO) and list scheduling. However, this approach has several parameters which makes the ISE result uncertain. If the target processor is a single-issue processor, the
TAN Honghe () et al.Automatic Identification of Customized Instruction
279
application latency is determined mainly by the total number of operations and the execution cycle of each operation. However, the application latency becomes more complicated when the target processor has a multi-issue architecture. Several other aspects affect the final performance, such as the number, type and interconnections of the function units, the critical path of the application, and even the scheduling algorithm. All these aspects will influence the benefit due to ISE on a multi-issue target processor. This paper illustrates the influence of the multi-issue architecture on the ISE and presents a CI identification algorithm for multi-issue architectures.
1 ISE for Multi-Issue Architecure

The function unit configuration in the multi-issue architecture and the critical path changes significantly influence the benefits of implementing ISE. This is affected by the use of connected or unconnected cuts. 1.1 Function unit configuration
Fig. 1 Influence of function unit configuration. (a) Original data flow graph; (b) Scheduling result for single-issue architecture and multi-issue architecture that can only execute one load operation per cycle; (c) Scheduling result for a multi-issue architecture that can execute two load operations per cycle; (d) Scheduling result for a multi-issue architecture using a different customized operation.
1.2
Critical path
The function unit configuration, including the number and type of units, in the architecture is an essential part of the processor design, which significantly influences the ISE. In a data flow graph such as Fig. 1a, each node stands for a basic operation, which uses one execution cycle in software with the normalized execution time in hardware labeled by a decimal. Nodes 1 and 2 are memory access operations, whose functions cannot be implemented within a CI. The other nodes denote arithmetic operations. As demonstrated in Fig. 1b, after the node set {3, 4, 5, 6, 7} is customized as a CI, the graph scheduling result on a single-issue processor is 5 cycles. Scheduling the same graph on a multi-issue processor that can only execute one load operation per cycle gives the same result. However, if the multi-issue processor can execute more than one load operation per cycle, the scheduling result becomes 4 cycles, as exhibited in Fig. 1c. Nodes 2 and 3 can be executed in parallel as shown in Fig. 1d; both the execution cycle time and the hardware costs of the CI are less than Fig. 1b. Unlike in Fig. 1c, only one function unit is needed for the load operations so the hardware cost is reduced.
After a group of operations is customized to an application-specific customized instruction, the graph topology represents the application changes. Then, the path with the longest latency, i.e., the critical path, may also change, which may negatively affect the ISE result. For instance, Fig. 2a has five nodes which takes 5 cycles for a single-issue processor to execute the graph. If the numbers of inputs and outputs are both limited to less than 2, the best cut is nodes {1, 3, 5}. For a single-issue processor, the scheduling result after customization shown in Fig. 2b needs 4 cycles, 1 cycle less than that with no customization. If the target processor can execute more than one arithmetic operation
Fig. 2 Influence of critical path. (a) Scheduling result for a multi-issue architecture without CI; (b) Scheduling result with CI.
280
Tsinghua Science and Technology, June 2011, 16(3): 278-284
per cycle, after the same cut {1, 3, 5} is customized, the result is also 4 cycles, the same as in Fig. 2b. The scheduling result without CI shown in Fig. 2a needs 3 cycles, showing that the performance with CI is not improved. The reason is that the critical path latency increases from 3 cycles to 4 cycles after the cut {1, 3, 5} is customized as a CI. 1.3 Connected and unconnected cuts
Usually, an unconnected cut is preferred over a connected cut when mapping to a CI in most ISE algorthms for single-issue architectures since unconnected cuts may accommodate more nodes. However, for multi-issue architectures, unconnected cuts may change not only the instruction parallelism but also the critical path, so they have no obvious advantage.
Current Approach
This section describes the details of the current approach with the algorithm flow, four attributes used to estimate the benefits of a cut, the method used to compare all candidate cuts, and the time complexity of the algorithm. 2.1 Algorithm description
application execution cycle is recorded in variable Min_Cycles (Step 2). The following three loops are the main part of the algorithm. Every application of time loop 1 generates a new CI. Loop 2 identifies a candidate cut in each unmarked sub-graph. The best cut that brings the best performance is then selected (Step 13) and collapsed to a new node (Step 14). The necessary parameters are then updated (Steps 15 and 16), and the iterative process continues until any constraint is broken or there is no more source node. Within loop 2, a source node for an unmarked sub-graph SGn is selected to initialize the last best cut Cn (Step 3) and the current best cut best_Cn (Step 4). When there is no convex candidate cut, the program exits loop 3 (Step 9), and the final best cut is recorded in Cn . Then the algorithm continues to identify the next best cut in another sub-graph. Loop 3 is composed of several steps where each node in sets PredCn and SuccCn is considered as a candidate node. The predecessor set PredCn and the
Input: DAG; function unit configuration Step1: Devide DAG into sub-graphs SG1 - SGN Step2: Min_Cycles = execution cycles of DAG; Loop1 until constraints are broken or no source node do Loop2 for every unmarked SGn do Step3: Select a source node in SGn and initial cut Cn; Step4: MCn = Min_Cycles; best_Cn = Cn; Loop3 until jumped out do Step5: ASAP and ALAP schedule Step6: forall vi Pred Cn SuccCn do i Step7: Cn = vi + best _ Cn ; i if Cn is convex do i Step8: Calculate attribute values of Cn ; endif endfor i if there is no convex Cn do Step9: jump out of loop3; endif i Step10: best _ Cn = the Cn with best attribute Step11: Cycles = execution cycles with best_Cn as a CI if Cycles < MCn do Step12: MCn = Cycles; Cn = best_Cn; endif endloop3 endloop2 Step13: selected_C = Cn with min (MCn ) in all SG; Step14: collapse selected_C to a new node; Step15: mark SG that provided selected_C; Step16: update DAG and Min_Cycles endloop1 Fig. 3 Algorithm description
Each basic block of an application is represented by a directed acyclic graph (DAG) G(V,E), where nodes V represent basic operations and edges E represent data dependencies between two operations. Symbol V f denotes the set of forbidden nodes, such as memory access nodes, which can not be included in a CI. In addition, the specific function unit that is used to implement a CI is called a customized function unit (CFU). The ISE problem can be formally stated as follows: given a basic block G, find a cut C to be customized as a CI to improve the performance of executing G by a certain processor with the following constraints: (1) C content no node in V f ; (2) C is convex. A cut C(VC , EC) is convex if there is no path between two nodes vi , v j VC that involves a node vk V \VC . When customized as a CI, the cut C(VC , EC) must be convex. The algorithm given in Fig. 3 first divides the DAG into N sub-graphs (Step 1). Each sub-graph has data dependency only with node v V f . Then the
281
successor set SuccCn of the cut Cn are defined as and PredC = {v j | (v j , vi ) E , vi VC , v j V \V f \VC } f SuccC = {v j |(vi , v j ) E , vi VC , v j V \V \VC } respectively. In Step 7, each candidate node is chosen and i combined with Cn to form a candidate cut Cn . Then i the attribute values of each convex Cn are calculated i (Step 8). Step 10 compares all the convex Cn to obtain the best cut in best_Cn . Only if better performance is achieved by customizing best_Cn as a CI, i.e., Cycles < Min_Cycles, will Cn be replaced by best_Cn (Steps 11 and 12). Otherwise, the growth will continue. The execution cycle is estimated using a list schedule.
2.2 Cut attributes
Fig. 4
Special connections for customized function unit
2.2.1 Execution cycle For node vi , tSW (vi ) represents the execution cycle of vi in software, while tHW (vi ) denotes the execution time of vi in hardware, which is normalized to a clock cycle. Symbol LSW (C ) means the latency taken in executing cut C in software. LSW (C ) = tSW (vi ), vi VC (1)
2.2.3 Critical path The critical path of a graph G may change after a cut C is customized as a CI. Therefore, the attribute CPLC (G ) is used to represent the critical path latency of G after the customization of C. 2.2.4 Function units The space mobility msi and the time mobility mt i of a node vi are 1, vi f ; msi = (vi , f ), (vi , f ) = (4) others f 0,
mt i = ALAP(vi ) ASAP(vi ) + 1
(5)
Symbol LHW (C ) is the latency of C in hardware, which is calculated by Eq. (2), where CPC represents the set of nodes in the critical path of cut C . LHW (C ) = tHW (vi ), vi CPC (2) The attribute M lat (C ) defined as the difference between LSW (C ) and LHW (C ) represents the cycles saved by cut C. M lat (C ) = LSW (C ) LHW (C ) (3)
2.2.2 Input and output The number of read/write ports of register file and the encoded instructions length limit the number of IO for normal function units. Consequently, CIs are generated to satisfy the constraints on the number of IO[1] or more latency is spent[3]. To relax the IO constraints without extra latency, the connections between the CFU and the register file need to change as shown in Fig. 4. The CFU input/output data is directly connected to specific registers, so the CI can take data from and write results to the specific registers. Moreover, in this case, the CI can be encoded without information concerning the used registers. Although the IO constraint is relaxed, less IO is better for reducing the design complexity. The attribute Nio(C ) is defined as the total number of inputs and outputs of cut C.
There exists a many-to-many mapping between the node set V and the function units set F : V F . If a node vi V is mapped to a function unit f F , i.e., vi 6 f , it means that operation vi can be executed on function unit f . In Eq. (5), ALAP(vi ) means the position of vi in as-late-as-possible scheduling, and ASAP(vi ) is the position in as-soon-as-possible scheduling. The probability that a node vi occurs at space f and the probability that vi occurs at time l can then be obtained as 1 (vi , f ) (6) psif = msi
1 , ASAP(vi ) - l - ALAP(vi ); pt = mt i 0, others

l i
(7)
The attribute M fu (C i ) based on the time probability and the space probability is defined as M fu (C i ) = pslj ps fj psli psif psli psif=
l f vj
ps
l f vj
l j
ps jf psil psif 1, v j V \ VC (8)
which is used to estimate the influence of the number and type of function unit when vi is combined with C. Equation (8) can be separated into two parts. The first part before the minus sign is repulsion effect from the
282
others nodes to node vi . If v j and vi occur at the same time l and the same space f, there is a resource conflict between them. The conflict probability is ps fj pt lj psif pt li and the sum of conflict probability between all the other nodes and node vi is pslj ps jf psil psif .
vj
then O(| V | 4 + | V | 4 log | V |) .
Test Results
The second part after the minus sign is an attraction form for function units towards vi . The probability that vi is executed by function unit f at time l is psif pt li .
2.3 Comparison of candidates
The comprehensive attribute value of each candidate C i based on these four attributes is M (C i ) = 1 M lat (C i ) + 2 Nio(C i ) +
3 CPLC (G ) + 4 M fu (C i ), where 1 , 2 , 3 , and 4 are weighting factors.
M lat (C ) and M fu (C i ) are profit attributes while Nio(C ) and CPLC (G ) are cost attributes. The four weighting factors are not constants but are determined using the multiple attribute decision-making method based on maximizing deviations[11]. Thus, they change within each comparison.
2.4 Time complexity
For a sub-graph SG(VSG , ESG ) in G (V , E ) for the algorithm shown in Fig. 3, | V | and | VSG | are the number of nodes in G and SG. Step 6 takes O(| V |) times to judge whether the cut is convex and to calculate the attributes. There are | VSG | candidates for comparison in the worst case. Therefore, the time complexity of Step 6 is O(| VSG | | V |) . Since a list schedule is employed, the worst time complexity of Step 11 is O(| V | 2 log | V |) . The complexity of the other steps in loop 3 is O(|V |) or O(|VSG |) , which are neglectable. Hence, the complexity of loop 3 is O(| VSG | | V | + | V |2 log | V |) . In the worst case, loop 3 is iterated | VSG | times in loop 2. The number of iterations of loop 2 in loop 1 is equal to the number of unmarked sub-graphs, which makes the time complexity of loop 1 equal to O(| VSG |2 | V | + | VSG | | V |2 log| V |). Since
all umarked SG all umarked SG
Tests were conducted using some benchmarks in Mediabench[12] and UTDSP[13] compiled using the OPEN64 compiler to compute the algorithm with previous ones. The list schedule algorithm was used to estimate the performance before and after the ISE. The target architecture has three types of function units: memory access units (LS), arithmetic and logic units(AL), and multiplier units (MP). All function units use one global register file. The execution cycle of each basic operation was assumed to be one clock cycle, i.e., the value of tSW (vi ) is 1 for all nodes. The basic operations were modeled by Verilog and synthesized using 0.18 m CMOS technology to obtain the hardware execution time, which was then normalized to the execution time of a 32-bit multiplier. tHW (vi ) was calculated from the normalized result. The area cost of each type of basic operation was estimated at the same time. Figure 5 shows the results with only one CI in the multi-issue architecture. The speedup is defined as ECSingleIssue /ECMultiIssue&CI, where ECSingleIssue is the number of execution cycles on a single-issue architecture
| VSG |- | V | and
all umarked SG
|VSG | 2 - |V | 2 , the
complexity of loop 1 is not higher than O(| V | 3 +

| V | 3 log | V |) . Furthermore, the number of iterations
in loop 1 is smaller than the number of unmarked subgraphs. The total time complexity of the algorithm is
Fig. 5 Speedup with one CI. Speedup = ECSingleIssue/ ECMultiIssue&CI
283
without CI and ECMultiIssue&CI is the number of execution cycles on a multi-issue architecture with one CI. The search tree algorithm is abbreviated as ST, and the numbers in square brackets are the number of read/ write ports of the register file. For instance, [6,3] means that the register file has six read ports and three write ports. The symbols in curly braces are the number of each type of function unit in the architecture, e.g., {2:2:2} means that there are two AL units, two MP units, and two LS units, so the issue width is six. The results show that the current algorithm exhibits better performance than the search tree algorithm in most cases. For these three multi-issue architecture configurations, the average speedup ratios with the current algorithm are 2.14, 3.90, and 5.68, with the best speedup ratios of 2.47, 4.75, and 9.00. For the six benchmarks, the number of inputs/outputs of the CI are [3,1], [7,2], [5,3], [2,1], [4,2], and [7,2], which are less than [8,4]. The current approach also can provide multiple customized results for designers to select according to cost constraints and performance demands. Taking the G721 decoder application as an example, each dot in Fig. 6 represents a different solution. The speedup is defined as ECMultiIssue/ECMultiIssue&CIs, where ECMultiIssue is the number of execution cycles on the multi-issue architecture without CI, and ECMultiIssue&CIs is the number of execution cycles on the multi-issue architecture with multiple CIs.
CMultiIssue/CMultiIssue&CIs that can be obtained with multiple CIs. The area efficiency is defined as the average area cost per saved execution cycle. ACO represents the approach presented by Wu et al.[10] [8,4] means the number of read/write ports of the register file. As shown in the figure, the algorithm gives results with better performance speedup and better area efficiency.
Fig. 7 Average speedup and area efficiency for different configurations. Speedup = ECMultiIssue/ECMultiIssue&CIs
Figure 8 shows the actual execution time of the proposed algorithm for different size of applications. The x-axis represents the number of nodes in the applications. Each + symbol is the worst execution
Fig. 6 Speedup with CIs. Speedup = ECMultiIssue / ECMultiIssue&CIs
Figure 7 shows 19 different configurations for the multi-issue architecture, including the average best speedup and the average area efficiency for all the benchmarks. The best speedup is the highest value of
Fig. 8 Execution times of the proposed algorithm. The + symbols represent actual execution times while the solid line represents the analytical result.
284
time among the 19 configurations for each application. The solid line represents the time complexity derived in Section 2.4. The results show that the actual execution times are much lower than predicted.
[5] Shen Zheng, He Hu, Zhang Yanjun, et al. CERCIS: A video codec system-on-chip design and implementation.
Journal of Tsinghua University (Science and Technology),

2009, 49(8): 1219-1223. (in Chinese) [6] Zhang Yanjun, He Hu, Shen Zheng, et al. ASIP approach for multimedia applications based on a scalable VLIW DSP architecture. Tsinghua Science and Technology, 2009,
14(1): 126-132.
Conclusions
This paper shows that the number and type of function units and the critical path are very important for the ISE for multi-issue architectures. The examples show that various aspects, such as the number and type of function units and the critical application path, should be taken into account in the ISE for a multi-issue target processor. Tests indicate that for some multi-issue architecture configurations, the performance becomes worse if the algorithm cares little about the issue architecture. A specific connection between customized function unit and the register file is used to relax the constraints on the input/output number. The multiple attribute decision making method is employed to select the best candidate node for customized instruction growth. Tests show that the algorithm has better performance than previous algorithms.
References
[1] Pozzi L, Atasu K, Ienne P. Exact and approximate algorithms for the extension of embedded processor instruction sets. IEEE Transactions on Computer-aided Ddesign of
[7] Middha B, Raj V, Gangwar A, et al. A Trimaran based framework for exploring the design space of VLIW ASIPs with coarse grain functional units. In: Proc. of the 15th International Symposium on System Synthesis. New York, NY: ACM, 2002: 2-7. [8] Saghir M A R, El-Majzoub M, Akl P. Datapath and ISA customization for soft VLIW processors. In: Proc. of IEEE International Conference on Reconfiguurable Computing and FPGAs (ReConFig 2006). Piscataway, NJ: IEEE Press, 2006: 1-10. [9] Jain D, Kumar A, Pozzi L, et al. Automatically customising VLIW architectures with coarse-grained application- specific functional units. In: Proc. of the Eighth International Workshop on Software and Compilers for Embedded Systems (SCOPES 2004). Berlin/Heidelberg: Springer, 2004: 17-32. [10] Wu I-Wei, Chen Zhi-Yuan, Shann Jyh-Jiun, et al. Instruction set extension exploration in multiple-issue architecture. In: Proc. of the Conference on Design, Automation and Test in Europe (DATE 2008). New York, NY: ACM, 2008: 764-769. [11] Xu Zeshui. Uncertain Multiple Attribute Decision Making: Methods and Applications. Beijing: Tsinghua University, 2004. (in Chinese) [12] Lee C, Potkonjak M, Mangione-Smith W H. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In: Proc. of the 30th Annual ACM/IEEE International Symposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society, 1997: 330-335. [13] Lee C, Stoodley M. UTDSP BenchMark Suite. http://www. eecg.toronto.edu/~corinna/DSP/infrastructure/UTDSP.html. 2010.
Integrated Circuits and Systems, 2006, 25(7): 1209-1229.

[2] Galuzzi C, Bertels K, Vassiliadis S. A linear complexity algorithm for the generation of multiple input single output instructions of variable size. In: Proc. of 7th International Workshop, SAMOS 2007. Berlin/Heidelberg: Springer, 2007: 283-293. [3] Atasu K, Oskar M. CHIPS: Custom hardware instruction processor synthesis. IEEE Trans. on CAD, 2008, 27(3): 528-541. [4] Biswas P, Banerjee S, Dutt N D, et al. ISEGEN: An iterative improvement-based ISE generation technique for fast customization of processors. IEEE Transactions on VLSI, 2006, 14(7): 754-762.

Kiran

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Kiran

Загружено:

Авторское право:

Доступные форматы

TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll08/16llpp278-284 Volume 16, Number 3, June 2011

TAN Honghe () et al.Automatic Identification of Customized Instruction

1 ISE for Multi-Issue Architecure

Tsinghua Science and Technology, June 2011, 16(3): 278-284

TAN Honghe () et al.Automatic Identification of Customized Instruction

Special connections for customized function unit

1 , ASAP(vi ) - l - ALAP(vi ); pt = mt i 0, others

ps jf psil psif 1, v j V \ VC (8)

Tsinghua Science and Technology, June 2011, 16(3): 278-284

then O(| V | 4 + | V | 4 log | V |) .

3 CPLC (G ) + 4 M fu (C i ), where 1 , 2 , 3 , and 4 are weighting factors.

complexity of loop 1 is not higher than O(| V | 3 +

Fig. 5 Speedup with one CI. Speedup = ECSingleIssue/ ECMultiIssue&CI

TAN Honghe () et al.Automatic Identification of Customized Instruction

Fig. 6 Speedup with CIs. Speedup = ECMultiIssue / ECMultiIssue&CIs

Tsinghua Science and Technology, June 2011, 16(3): 278-284

Journal of Tsinghua University (Science and Technology),

Integrated Circuits and Systems, 2006, 25(7): 1209-1229.

Вам также может понравиться