Вы находитесь на странице: 1из 7

1050

IEEE JOURNAL OF SOLID-STATE C I R C U I T S , VOL. 27, NO. 7. JULY 1992

VLSI System Design for Automotive Control


Andreas Laudenbach and Manfred Glesner
Abstract-This paper presents a novel VLSI approach for combustion engine control. The approach is based on a realtime solution of a thermodynamical differential equation. The control system calculates an optimum ignition point by fast measurement and real-time processing of signals as temperature, pressure, and volume of the combustion chamber. The required computational power cannot be met with standard signal processors. We present the design of a mechatronic system that is based on an application-specific vector architecture. Each design step from the analysis of the heat release algorithm, the optimization of the algorithm and the dataform, the mapping on an architecture, the physical design of the test chip set and the single chip, the chip test, and the system integration is presented. Finally, the application at an engine test-stand and results are shown.

optimized program code is generated with a macroassembler that supports arithmetic operations on scalar and vector data; full flexibility is provided with off-chip program memory. The actual program code is stored in fast static R A M . 11. MATHEMATICAL MODELOF A COMBUSTION ENGINE State-of-the-art combustion engine control systems [ 11 determine the ignition point either in open-loop control depending on the load of the engine and the revolutions per minute or in closed-loop control by knock detection. Thermodynamical parameters of the combustion are not taken into account. In a novel approach for engine control [ 2 ] ,each combustion is analyzed thermodynamically. The fundamental parameter of the control algorithm is the heat release dQ,/dcp as a function of the crank shaft angle cp. The heat release has a great influence on the engine's economy and on the exhaust. The heat release can be calculated with the pressure p(cp) in the combustion chamber and the volume V(p) and the differentials of these functions dp(cp) and dV(cp). With the law of conservation of energy, the heat release becomes dQ8 = dU JJ . dV + dQwall (1) where U is the internal energy of the air/fuel mixture, Qwallis the heat due to energy losses over the cylinder walls, and p dV is the mechanical work on the piston. From the ideal gas equation follows:

I. INTRODUCTION N this long-term research project activities are focused on a novel approach for combustion engine control, which is based on the measurement of the pressure in the combustion chamber and on the real-time calculation of the heat release. The first realization of the microelectronic subsystem for thermodynamical analysis was implemented with a commercial DSP. Because of the limited computational power of this solution, a heat release algorithm with many simplifications had to be used. The physical behavior of the combustion is modeled with more accuracy if energy losses over the chamber wall are incorporated and if the combustion-dependent gas constant is described more exactly. In this case the heat release is defined implicitly and the calculation is based on the solution of a nonlinear differential equation. The real-time solution is not possible with a commercial processor. As a consequence a parallel VLSI architecture tuned to this specific application was developed. For working with the computer system as an evaluation tool for mechanical engineering, some requirements had to be met:

T = -P ' V R, m

(2)

where Tis the gas temperature, R, is the gas constant, and m is the mass of the air/fuel mixture. With an empirical power series expansion [3], the internal energy becomes

high computational power is provided by parallel processing arithmetic units; high data throughput is ensured with a multiplexed and pipelined data path with three data memory modules on the chip;
Manuscript received December 2, 1991; revised February 28, 1992. This work was supported by the German National Science Foundation (Deutsche Forschungs-gemeinschaft) under Project SFB 24 1 IMES (Integrated Mechano-Electronic Systems). The authors are with the Institute of Microelectronic Systems, Darmstadt University of Technology, D-6100 Darmstadt, Germany. IEEE Log Number 9200153.

ko

+ kl

.T

+ k2

T2

+ k3 . T 3 .

(3)

The energy loss over the chamber walls, which is induced by convection, is calculated as

(4)
where Awall is the combustion chamber surface, Twall is the cylinder wall temperature, which is set constant, and w is the angular speed. With the average piston speed c , ~ the , coefficient (Y of heat conductivity [4] is = 130 . V-0.06 . PO.' * * (c, + 1.4)O.'. (5)
0 1992 IEEE

0018-9200/92$03.00

LAUDENBACH A N D GLESNER: VLSI SYSTEM DESIGN FOR AUTOMOTIVE CONTROL

1051

At this point the heat release could directly be calculated. But in the burning fuel mixture the ideal gas constant R, is not constant; it is still a function of the combustion rate. The resulting differential equation is of first order and nonlinear. The differential equation is solved with an iterative method which can be processed in parallel. The iterative algorithm calculates a start solution dQF' with constant R, and the heat due to energy losses over the cylinder walls set to zero (dewall = 0). In the ith iteration we use the old integral heat release
1)

be reached, if the computation has finished after 50" crank shaft. At a maximum revolution speed of 6000 rpm the 50" angular interval corresponds to 1.35 ms. The required computational power can be reached only with parallel processing, either with a multiprocessor board or with a single chip that has parallel processing units. Since the aim of the project is an integrated solution, the development of a single-chip parallel processor is the only acceptable way.

(cp) =

d Q $ - " dp'

(6)

and the old maximum haust ratio r ( p ) ( ' ) :

Q$,,QI to calculate the new air/ex-

The parameters rendand r,tanare calculated in the initial phase before the combustion starts. Now the gas constant R, = R,( p(cp), r ( p ) , T(cp)) can be calculated as 29.0

+ A(cp) + p(cp)

B. Data Format Simulations on behavioral level with different sampled input signals demonstrate that the intermediate results can vary over 12 orders of magnitude. This dynamic is caused by variations of the input signals, which are amplified by arithmetic operations. For example, the pressure input signal has a range from 0.01 to 10 MPa. The internal data format must be able to represent each value with sufficient accuracy. The simulations have shown that a fixed-point representation requires 52 b, whereas a floating-point format with 20 b provides sufficient accuracy and dynamic for the heat release algorithm. The floating-point format was preferred because a 52-b bus would result in a big area and power consumption. The optimized floating point data format has 12 mantissa bits, a sign bit, and seven exponent bits. The dynamic of the number representation is 38 orders of magnitude.

A , B , and C are power functions of r with a fractional exponent between zero and one. With R, (p) and (1)-(5) a new heat release dQB ( c p ) (' ) can be calculated. The analysis is done over a range of 128" crank shaft angle with a step size of 1O . The iteration is repeated until the variation of the integral heat release is less than 0.1 % :

C . Architecture Due to the use of a floating-point data format, the arithmetic units get quite complex because mantissa and exponent must be processed in a different way and costly error handling becomes necessary. This influences the architecture because it is not possible to integrate many (9) arithmetic units on a single chip. The heat release algorithm could easily be mapped on a parallel single-instruction multiple-data (SIMD) architecture with next-neighOF ALGORITHM A N D ARCHITECTURE 111. DEVELOPMENT bor connections. But since the SIMD architecture is A. Computation Time Requirements working efficiently only if a great number of arithmetic Heat release calculations with sampled input data are units are realized, the integration on a single chip is not demonstrating that with two runs through the iterative loop possible. Another parallel processing concept to map the algothe variation of the integral heat release is always less than 0.1 %. This is the iteration stopping criterion. For rithm easily is vector processing. Here the data are progenerating the start solution and two times running cessed very efficiently with the principle of an assembly through the iteration loop it is necessary to do 218 addi- line [5], [6]. The implementation of the heat release altions, subtractions, and multiplications and 14 divisions. gorithm on a vector architecture, as shown in Fig. l , imSince the analysis is done over a range of 128" crank shaft plies that the input signals and intermediate results are with a resolution of 1", each operation has to be per- processed as vectors. Each degree of crank shaft angle formed 128 times. A sequential calculation of the heat represents one vector component. The vector size is 128. release must not be longer than the time for sampling one All algorithmic operations except the integration are vecpressure signal. The signals are sampled with a resolution tor operations. The long vectors make the design of arithof 1" crank shaft. If the engine is running with the max- metic units (AU's) with deep pipelines of four to ten stages imum revolution speed of 6000 rpm, the resulting com- very efficient. The first AU is able to perform a multipliputation time will be about 20 ps, which is not sufficient cation or division alternatively, while the second one can for the sequential calculation of 232 mathematical oper- alternatively add or subtract two floating-point numbers. ations including 14 divisions. If the algorithm is pro- The most frequent operations are of the multiply-add, cessed after the sampling period, real-time capability will multiply-subtract, divide-add, or divide-subtract type.

1052

IEEE JOURNAL OF SOLID-STATE CIRCUITS. VOL. 27. NO. 7 . J U L Y 1992

to calculate polynomials of higher order. But this is not a drawback because time-consuming memory accesses for the fetching of Spline coefficients that are different for each vector component are avoided. IV. PHYSICAL DESIGN A N D TEST Due to the complexity of the different processor blocks, which mainly resulted from the floating-point data format, no commercial library was available. As a consequence a library of macrocells has been designed. The library contains module generator based cells like floating-point multiplier/divider and adder/subtractor, vector memory modules, static and dynamic registers, and bus multiplexers. A set of input, output, tristate, bidirectional, and supply pads has been designed with the layout editor and extractor of the Cadence design framework. Each macrocell must be parametrizable with regard to the floating-point format in order to permit the optimization of computation accuracy, floorplaning, and the data format of the vector architecture in parallel to the macrocell design process. For that purpose, and also to be independent from design rule variations, the macrocells were designed by using a symbolic design environment [7]. The layout structure of the different blocks was described with a symbolic layout language on a virtual grid. The language offers the use of procedural statements to permit an efficient parametrization. The transformation of this description to the final coordinates is performed by two one-dimensional compaction steps which are based on a modified most recent layer algorithm. A local two-dimensional compaction step follows to improve the overall result. A compacted layout and netlists for SPICE and switch-level simulation are generated.
A. Design of Complex Modules 1) Multiplier/Divider Unit: The multiplication and division unit contains a full pipelined circuit which is dedicated to floating-point computations. The implemented algorithms are a space expanded add-shift algorithm for multiplication and a restoring division algorithm. For monitoring the operations, a set of five error flags is provided: overjlow, underjlow, division by zero, illegal operation, and operation aborted. The multiplication and division unit is designed to generate one result per clock cycle. There are no restrictions on the interlacing of multiplications or divisions in consecutive clock cycles. Fig. 2 shows that the multiplier/divider unit is subdivided into the mantissa data path and the exponent data path, including sign bit processing and error flags generation. a ) Mantissa data path: The mantissa data path contains an initialization stage, the adder array, and a postnormalization circuit. The structure of the adder array is basically restricted by the restoring division algorithm: the quotient bits have to be calculated in a most significant bit to least significant bit order. Based on this circuit

Adder/Subtractor

Fig. 1. Architecture and paritioning of the vector processor

This is the reason why the output of the multiplier/divider unit is connected to the input of the adder/subtractor unit. The advantage of this connection is that only three instead of four data memory accesses per clock cycle are necessary. If the pipeline is filled, two floating-point operations per clock period will be processed. The memory is divided in on-chip memory modules and an off-chip RAM. The need to feed three data into the AUs requires either one high-speed on-chip RAM with a high-speed bus or several vector memories with a crossbar network. The crossbar switch was realized with bus multiplexers; the vector memories were realized with shift registers of 128 words times 20-b size. The shift registers can store three signal vectors or result vectors. The advantage of this solution is a relative moderate clock, while in each clock period two floating-point operations are processed. D. Algorithm The differentiation df (cp) is approximated by the secant through the points f (cp 1) and f (cp - 1). The integration is performed with the trapezoidal method. The approximation of power functions with fractional exponent and exponential functions is performed with power series expansion. A special tool for the determination of the approximation constants with least-squares optimization technique was developed. The user has to specify the functions that will be approximated, the range of possible arguments, and the allowed error tolerance. The global approximations are based on orthogonal polynomials that meet the required accuracy over the whole argument range. Therefore the approximations are calculated easily in parallel because for each argument the approximation constants are identical. Compared to Spline interpolation methods, which also could be possible, the processor has

LAUDENBACH AND GLESNER: VLSI SYSTEM DESIGN FOR AUTOMOTIVE CONTROL

I053

Mantissa

Exponent

Sign

mantissa initial.

mantissa adder array

exponent calculation

a e 1
C

"
1 a

mantissa postnormalization

exponent adjustment error flags

Fig. 2. Multiplieridivider block structure

structure the multiplier bits also have to be evaluated in the same order. As a result of this strategy the carry-out bit of each adder row must be taken into account by a final addition creating the most significant half of the product word. For division, the sign of each generated partial remainder must be accessible from the adders for use in the control circuit. This prohibits the use of carry-save adders. For that purpose carry-select adders are used to prevent both the high wiring cost of carry lookahead techniques and the low speed of carry ripple adders. b) Exponent data path: The exponent data path consists mainly of two carry-select adder rows (for exponent calculation and adjustment) and a lot of random logic for the generation of the error flags. The first adder row is executing the addition and subtraction of the exponents which is required for multiplication and division of the two operands, respectively. The second is used to generate a correct biased exponent (bias = 2 ' - ' - 1) and it takes a possible mantissa postnormalization into account. The random logic circuit is able to detect overflow or underflow of the result as well as operations with illegal operands (e.g., 03 . 0, O/O, 03/03). On overflow, the corresponding flag is set and the result is set to 03 (represented by an exponent e = 2' - 1); underflow causes the result to be zero (represented by an exponent e = 0). If an illegal operation is detected the illegal flag will be set and the result outputs will be undefined. 2) AdderlSubtractor Unit: The addition/subtraction unit is divided into three sections seperated by pipeline register stages which represent the basic operations of floating-point addition or subtraction: 1) denormalization, 2) kernel calculate operation, 3) normalization. The inherent equality of addition and subtraction is used to reduce the complexity of the arithmetic unit. The equations of the sum and the difference bit are identical. The equations of the carry and the borrow bit differ only by the value of X: subtraction: addition:

The same circuit will be used for addition and subtraction, if the negation of X is available for the calculation of the borrow bit. By multiplexing the absolute input values it is always guaranteed that the smaller value is subtracted from (or added to) the greater value [8]. The result of the operation cannot be negative. Logic for complementation and an additional adder are saved. The kernel of the addedsubtractor unit is a carry-select adder. The carry-select adder combines the advantages of high calculation performance power and simple parametrization. The basic adder that is designed in pass-transistor logic is modified with three 2 : 1 multiplexers to allow addition and subtraction with the same circuit. The multiplier/divider unit and the addedsubtractor unit can operate in test mode. Then all the pipeline registers are sequentially linked to form several scan paths. 3) Vector Memory Module: The vector memory is realized with shift registers that provide a sequential memory organization. It is not possible to fetch one arbitrary vector component in one cycle. But this is not a drawback because all operations of the algorithm are vector operations. The sequential memory organization instead of RAM has the advantage that no address calculation is necessary. This saves either area for on-chip address calculation logic or additional pads and program memory if the addresses are generated from a compiler. The basic shift-register cell for storing 1 b is a semistatic flip-flop. The information will be shifted into the master if 4, and a shift signal are at logic high level. If the shift signal is at logic low level the information will be stored statically in the master. The master is able to hold information statically over a long period of time, because two inverters are connected ring-like and build a positive feedback loop. The slave stores information only dynamically. The logic value is refreshed from the signal in the master in each cycle with 42at logic high level. Several vector variables are used more than one time in the algorithm. This is the reason why a feedback mode and a multiplexer at the input of the shift register are provided. If the feedback signal and the shift signal are set to logic high level, the output of the shift register will be connected to its input. After 128 shift operations in the feedback mode the vector variable is in the original state again. Since a vector in the combustion rate algorithm has 128 components and the optimized data format has 20 b, the shift registers have 128-word X 20-b organization. The layout of the generated shift register is extremely regular. The transistor density of the macrocell is more than 7000 transistors/mm2 in a 1.2-pm CMOS process.

B. Multichip Project
For the first realization of the vector processor, the whole architecture has been partitioned into different modules, as shown with the dotted lines in Fig. 1. The fabrication of the partitioned processor blocks has the advantage that each module can be tested independently from each other. Together with test chips from other projects a multichip wafer with eight test chips was processed

s=xeyeci,
CO,, = X(Y v Gin) v YCln

D = X e YeB,,
Bo", =

l X ( Y V Bin)

YBin.

1054

IEEE JOURNAL OF SOLID 5 1 A T E CIRCUITS. VOL 27. N O 7, J U L Y 1YY2

Fig. 3 . Microphotograph of the multichip wafer.

Chip Type vector memory adderisubtractor multiplieridivider

Macrocell Size 4098 pin x 814 p m 1703 pni x 1846 pni 5212 pm x 3462 pni

Chip Sire 5040 pm x IS84 pm 5264 pni x 4168 p m 5928 pni x 4160 pni

Trans isto rs

Clock
36 MH7 IS MHr 36 M H z

26 no0

moo

31 000

in a 1.2-pm CMOS technology. In Fig. 3 the microphotograph of the multichip project is shown. The multichip wafer contains three modules of the vector processor: floating-point multiplier/divider (right side at the bottom), floating-point addedsubtractor with some circuitry for fast integration and search of a maximum vector component (left side at the bottom), vector memory (right side in the middle).

In the test phase the correct function of each test chip was verified successfully. The sizes of the test chips and the maximum clock frequencies are shown in Table I. The maximum clock frequency of the addedsubtractor is low compared to the other modules. This is caused by powerground bouncing induced by the buffers in the output pads. The multiplier/divider unit is operating with a clock of 36 MHz. This is the highest possible two-phase nonoverlapping clock that can be generated with the tester.
C. Single-Chip Vector Processor The different blocks of the whole vector architecture, shown in Fig. 1, are placed and the signal lines between them are routed by a floorplanning tool that is based on simulated annealing [7], [9]. The underlying layout model used in this tool is the well-known slicing structure. The floorplanner and the macrocell generator are strongly linked to permit a fast design cycle. The floorplanner generates a layout, in which all the signal lines of the core cells as well as the signal lines to the pad cells are routed. The supply rails of the macrocells must be inserted by hand. Desing improvements over the test-chip set include

Fig. 4. Microphoto of the single-chip vector processoi

a physical seperation between the power supply of the pad cells and the power supply of the core cells. In the adder/ subtractor macrocell two buffers that drive critical nodes are designed with lower impedance. The layout of the single-chip vector processor is shown in Fig. 4. The layout of the whole vector processor as a single chip generated in a 1.2-pm CMOS process has a size of 7.2 mm X 7 . 3 mm and contains 120 000 transistors. The chip has 68 pins, including 14 power pins and 5 test pins. The singlechip vector processor has been fabricated and has SUCcessfully been tested up to the maximum clock frequency of 33 MHz.

LAUDENBACH AND GLESNER: VLSI SYSTEM DESIGN FOR AUTOMOTIVE CONTROL

I nss

V. SYSTEM INTEGRATION A N D APPLICATION Fig. 5 shows the prototype vector processor board. With this board the single-chip vector processor is emulated and the whole system is tested. The partitioned processor blocks, shown with the dotted lines in Fig. 1, are realized by the test chips. Two programmable logic devices are used for the implementation of the crossbar network and off-the-shelf registers are used for the buffering of fetched instructions and data. The programmable logic devices limit the system clock to 5 MHz. The heat release computer contains a commercial microcontroller and the vector processor board with program and data memory as shown in Fig. 6. The microcontroller is used for controlling the analog-to-digital converter board and the vector processor board, synchronization with the combustion engine, preprocessing of the input signals, download of the program code for the vector processor into the program RAM, and interfacing to a HP9000/series 300 host computer. The vector processor board is mounted in a PC rack together with the A/D converter board, the microcontroller board, the vector processor board, and the data and program memory board. Optimized program code for the vector processor is generated with a macroassembler. The macroassembler supports simple and composed arithmetic instructions for scalar and vector data types and special arithmetic functions such as search of a maximum vector component, integration, and differentiation of a vector. The heat release algorithm is described very efficiently with the assembler language, which contains symbolic variables and constants, as well as loop statements. The assembler generates up to 32K lines linear machine code. The instruction word contains 35 b. The heat release computer is a part of a mechatronic system for ignition control which is installed at an engine test-stand in the Department of Mechanical Engineering at Darmstadt University. VI. RESULTS A set of test chips has been fabricated successfully. A board with the test chips for the emulation of the singlechip vector processor was built. The system clock of the emulation board is limited to 5 MHz because of the use of programmable logic devices (PLDs). The heat release algorithm is calculated with the emulation board in 4.6 ms with a performance of real 6.5 MFLOPs. This is nearly the same time as used by the implementation on a standard 32-b digital signal processor with integrated floatingpoint unit. The heat release computer is successfully working at an engine test-stand. A single-chip version of the vector processor has been fabricated successfully. If the algorithm is processed on the single-chip vector processor, the computation time will be 1.15 ms, so that the real-time requirement of 1.35 ms will be met. The implemented heat release algorithm will run with real 25.8 MFLOPs, so that the maximum performance of 40 MFLOPs of the vector processor is occupied to 64 % . The vector processor contains 120 000 transistors, while a

Fig. 5 . Vector processor emulation board

reset count

INTEL

crank shaft angle top dead center

80960

control code

to the host

adress bus (12 bit)


v

vector processor

data[l..20]
\

2 port
data
4

--

data[l..20]

32-b digital signal processor has at least about 400 000 transistors. The application-specific architecture brings a significant reduction in transistor count and computation time. With the presented system it will be possible for the first time to implement control algorithms for combustion engines which are based on a thermodynamical evaluation of each combustion in real time. ACKNOWLEDGMENT The VLSI test chip set and the single-chip vector processor have been fabricated by Intermetall Deutsche ITT Industries company in Freiburg on their 1.2-pm CMOS process. The authors would like to express their thanks to Intermetall for the support.

REFERENCES
[ l ] U. Adler, Kombiniertes Zuend- und Benzineinspritzsystem mit Lambda-Regelung: Motronic, Bosch GmbH: Reihe Technische Unterrichtung, Stuttgart, Germany, 1985. [2] A. Laudenbach, M. Glesner, G. Hohenberg, E. Nitzschke, and D. Koehler, Real time heat release calculation for combustion engines, presented at the ISATA Conf. Mechatronics, Florence, Italy, 1991. [3] A . Pischinger, Thermodynamik der Verbrennungskraftmaschine. Berlin: Springer Verlag, 1989. 141 G. Hohenberg, Experimentelle Erfassung der Wandwaerme van Kolbenmotoren, Habilitation thesis, Tech. Univ. Graz, Graz, Austria, 1980.

I056

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 27. NO. 7 , JULY 1992

[SI K . Hwang, Computer Architecture and Parallel Processing.


[6] [7]

181 191

New York: McGraw-Hill, 1985. H. S . Stone, High Performance Computer Architecture. Reading, MA: Addison-Wesley, 1990. N. Wehn, Efficient methodologies for the physical design of MOSVLSI circuits, Ph.D. dissertation, Darmstadt Univ.. Darmstadt, Germany, July 1989. K . Hwang, Comuuter Arirhmetics. New York: Wilev. 1979. < . 1. Schuck: N . WehnI , M. Glesner, and G . Kamp, The ALGIC silicon compiler system: Implementation, design experience and results, presented at the 24thI Design Automation Conf.. Miami, FL, June 1987.

Mr. Laudenbach is
chaft.

i l

member of the Deutsche Physikalische Gesells-

Andreas Laudenbach received the diploma in physics from the Darmstadt University of Technology, Darmstadt, Germany, in 1987. He is currently working towards the Ph.D. degree at the Institute of Microelectronic Systems at the same university. His research interests cover CMOS VLSI circuit design and the design of application-specific integrated processors for the development of microelectronic subsystems in mechatronic applications.

Manfred Glesner graduated from the Saarland University, Saarbruecken, Germany, in applied physics and electrical engineering in 1969. In 1975 he received the Ph.D. degree from the same university. Since 198 1 he has been a Professor of Electrical Engineering at Darmstadt University of Technology, Darmstadt. Germany, where he is engaged in research on CAD tool development and VLSI circuit design. He has been active in the field of CAD for electronic circuits for 20 years and has published more than 50 papers. Current work covers silicon compilation, digital signal processing, and innovative system applications of microelectronics. Dr. Glesner is a member of several technical societies.

Вам также может понравиться