2008 FinalExam SoCN Final Master Solution

Institute for Integrated Systems Technische Universitt Mnchen Prof. Dr. sc.techn.
Andreas Herkersdorf
Final Exam System on Chip Solutions in Networking SS 2008

Room: N1189 Munich, July 28, 2008 Time: 9:00 am
I hereby confirm that I have been informed prior to begin of the examination that I have to notify the examination supervisors immediately, if sudden illness occurs during the examination. This will be noted in the examination protocol. An application for exam withdrawal has to be filed immediately at the board of examiners being in charge. A medical certificate from one of the physicians acknowledged by the Technische Universitt Mnchen issued on the same day as the examination must be forwarded without delay. In case the examination is regularly completed despite of illness, a subsequent withdrawal due to illness cannot be accepted.
.. Name
Matriculation #
.. Signature Examination Time: 75 minutes
Seat
At first, please fill in the title page with your name, matriculation number and the number of your seat. Do not forget to sign the exam! If you hand in any extra sheets of paper, they must also contain your name and matriculation number. We will check the student ID and passport during the exam. The numbers in parentheses are indicative for the number of credits you can earn for a correct answer of this question. The maximum number of credits to be earned is 70. Subquestions which may be answered independently of other parts are marked with an asterisk (*). Please note that in multiple-choice questions false answers lead to negative credits.
No materials allowed in the exam except: pen, non-programmable pocket calculator, one sheet A4 with your personal notes.
Good Luck!
Final Exam SoCN SS 2008 July 28, 2008
1) Right or wrong? (8) Remark: correct answer +1, wrong answer -1, no answer 0 points, min.: 0 Yes No VLIW processors and superscalar processors both have X more than one execution unit. DSP processors are usually RISC cores extended with HW X multiply-accumulate units for signal processing tasks. Sonet/SDH frame synchronization may be achieved by X analyzing inter-frame gaps. Sonet/SDH networks are typically installed in a star X topology. In a shared medium LAN there are usually more collisions X between communication partners than in switched LANs. The FIFOs in input buffered switches need a higher X memory access bandwidth than those in shared output buffered switches. High-speed differential serial I/O is a prerequisite for X implementing the L2 interface of single STM-4 framer devices. Thermal conduction dominates thermal radiation with X respect to heat dissipation from the active area to the environment. 2) Explain two reasons for payload scrambling in a Sonet/SDH transmission system! (2) - Avoid SoF pattern mimicking through payload - Guarantee sufficient signal transitions for clock recovery
3) Microprocessor Architecture: CISC vs. RISC a) * Explain the conceptual difference between a RISC and CISC processor! (2) RISC: all operands from register file + dedicated ld/st instruction (CISC memory accesses also in arithmetic/logic instructions possible)
b) * Assume a pipelined implementation of a RISC architecture and a CISC architecture, both implementations have five pipeline stages. What architecture would achieve the higher clock frequency? Explain why you think so! (2) The RISC will be faster, because there is less functionality / complexity in each stage.
4) Name the two fundamentally different switch architectures with respect to port contention resolution strategy. Sketch their respective offered load delay behavior into the two diagrams provided below! (4) 1) Input Buffered Switch architecture 2) Output Buffered Switch architecture
64
64
Normalized Delay
16 8 4 2 1
Normalized Delay
20% 40% 60% 80% 100% 120%
32
32 16 8 4 2 1
20% 40% 60% 80% 100% 120%
Offered Load Offered Load Diagrams are given in the Switch chapter slide No. 8 and 9
5) Chip Package Heat Dissipation
(7)
hcase-air=7 W/mK kcase=237 W/mK kglue=0.8 W/mK ksilicon=149 W/mK Junction (active area)
The die is 0.45 mm thick, the glue 0.05 mm and the case cover is 0.7 mm thick. The die area is 125 mm and the case surface (for convection only!) is 1200 mm. Formulas: Rcond=t/kA, Rconv=1/hA. a) Derive the conductive thermal resistance between the active area of the die and the top of the case. Further derive the convective thermal resistance of the case and the total thermal resistance of the diepackage combination! (4) RSi=.00045 / (149 .000125) K/W=0.024 K/W (1) Rglue=0.00005 / (0.8 .000125) K/W=0.5 K/W Rcase=.0007 / (237 .000125) K/W=.024 K/W (1) Rconv=1/(7 .0012) K/W=119 K/W (1) Rtotal=Rconv+Rcond=119.548 K/W (1) b) * Calculate the junction temperature for an ambient air temperature of 27C, when the chip dissipates 800mW! If you could not solve a) assume a thermal resistance of 125 K/W! (3) T=PR (1) T=.8 W 119.548 K/W=96 K (1) Tjunction=27C+96K=123C (1)
6) Network Processor System

Instruction Bus CPU 1 f=2GHz CPU 2 f=2GHz CPU 3 f=2GHz CPU 4 f=2GHz
(25)
SRAM 2ns, 128b

Data Bus
DMA
CryptoHW
SDRAM Ctrl.
...
SDRAM module 32b, 250MHz, 7-1 Given is an NP-SoC with a run-to-completion architecture as depicted above. It is designed to encrypt and forward IP packets (no L2 protocol shall be regarded here!) at a speed rate of 2 Gbit/s. The IP packets are variable in length with a fixed header size of 20 bytes and a payload section between 26 bytes and 1460 bytes. Before encryption an additional header of 20 bytes is appended to each packet as shown in the following diagram:
20 B IP 26 B 1460 B Payload 20 B IP 20 B 26 B 1460 B
Crypt. Payload Head. The CPUs are single-issue pipelined RISC cores that run at 2 GHz and have a single level of cache. The SRAM, which serves as instruction memory, has a minimum access time of 2 ns and an internal memory width of 128 bits. The instruction bus is running at 250 MHz and is 256 bits wide. The packet memory consists of several SDRAM modules with a 32 bit data bus, 250 MHz clock (single data rate) and a 7-1 access pattern.
a) * What is the optimum CPICPU of each of the given cores? Ignore pipeline hazards! (1) CPI=1, for a single-issue pipelined CPU.
b) * What is the L1 miss penalty for the instruction accesses to the SRAM, given a cache line size of 512 bit? Assume that you will have to wait for an additional half cache line read by another CPU on average. The cache miss is resolved after the entire cache line has been read. (4) Cache-line size: 1632 bit=512 bit (1) => 512b/128b=4 cycles on 500 MHz memory interface (1) => instruction bus: same bandwidth as memory (2x as wide, the speed) (1) wait for other CPU => 6 cycles @ 500 MHz => 24 cycles @ 2 GHz (1) The miss penalty is 24 cycles.
For the following questions (c) to e)) assume a software-only processing without the HW crypto core! c) * How large is the time budget for each CPU to complete packet processing in a work-conserving operation if you assume (I) a continuous flow of shortest-size packets and (II) a continuous flow of longest size packets? (4) (I) Packet size: 46B, Line speed: 2 Gbit/s => 184 ns inter-arrival time (1) => Spread among 4 CPUs (round robin) => 736 ns per CPU (1) (II) Packet size: 1480B, Line speed: 2 Gbit/s => 5,920 ns inter-arrival time (1) => Spread among 4 CPUs (round robin) => 23,680 ns per CPU (1)
d) * Derive the number of CPU instructions that need to be executed in the two above mentioned cases! Forwarding and header insertion require 120 instructions and software encryption costs 24 instructions per payload byte. (2) (I) 120 instructions + 2624 instructions = 744 instructions (1) (II) 120 instructions + 146024 instructions = 35,160 instructions (1)
e) Is the given clock frequency of 2 GHz sufficient to process the packets in the two scenarios (I) and (II)? Assume an instruction miss rate of 4% and a data cache miss rate of 0%! If you could not solve b) assume the miss penalty to be 25 cycles, and time budgets (c)) of 750 ns (I) and 24000 ns (II) per CPU. (6) CPI=CPICPU+CPIMEM=1+0.0424=1.96 (2) (I) 744 instructions1.96 CPI=1,458.2 cycles @ 2 GHz => 729.1 ns => It is just sufficient here! (2) (II) 35,160 instructions1.96 CPI=68,913.6 cycles @ 2 GHz => 34,456.8 ns => It is not sufficient here! (2)
Now, we want to offload the encryption function from the CPU cores to a HW crypto core. f) * The crypto core has an internal data path width of 16 bits and the encryption of each (16 bit) word is achieved in a single clock cycle. The longest logic path has 55 gate levels and you know the following parameters from the CMOS library: Register: tSU=350 ps, tpd=thold=200 ps Logic: tgate=90 ps What is the maximum frequency that the crypto core may run at? (2) critical path: tpd+55tgate+tSU=5.5 ns frequency: 181.8 MHz
g) * Is a single crypto core sufficient to process the entire traffic, if you assume that the data transfers into and out of the core are achieved with at least the same speed as the encryption itself? Derive the processing times for the two scenarios (I) and (II) and compare them to the figures obtained in c)! If you could not solve f) assume a frequency of 180 MHz! (3) (I) 26B = 13 words => 13 cycles @ 5.5 ns = 71.5 ns (1) (II) 1460B = 730 words => 730 cycles @ 5.5 ns = 4,015 ns (1) => Yes, it can be achieved easily! (1) h) * As we offload the encryption to hardware, we gain a lot of headroom on our CPU cores. Could a pipelined operation of "single CPU + HW accelerator" perform the task of the NP? Think about the worst-case traffic from the perspective of the CPU now, and assume that the HW accelerator is in any case work-conserving! (3) CPU has to execute 120 instructions, regardless of packet length. (1) 120 1.96 CPI = 235.2 cycles = 117.6 ns (1) Worst-case: shortest packets with IAT of 184 ns (from c)) => one CPU is sufficient! (1)
7) ATM Switch System (20) In the following we consider a 4x4 switch for ATM cells, which originate from STM-4c SDH connections. There is one receive FIFO per port and we use an on-chip bus system as switch element. The cells from the FIFO are read out with the bus width and frequency and written into an output shift register that may hold an entire cell. The shift register adapts the data to the output port width and frequency. You have a CMOS library that allows you to implement all switch-internal components at a maximum frequency of 300 MHz, and your I/Os support a maximum of 200 MHz.
FIFO Shift Reg.
ATM cells from STM-4c
ATM cells to STM-4c
Bus = switch element
a) * Sketch the frame format (including dimensions) of STM-4c and derive the payload data rate! (6) Rectangle+structure (2) Correct sizes (numbers) (2) 125 s transmission + payload calculation (2) 4*260*9*8*8000 bit/s = 599.04 Mbit/s
b) * What is the data path width and operating frequency for the switch I/O ports? If you could not solve a) assume 620 Mbit/s! (3) w=R/f=2.99 bit (1) => we choose 4 bit (1) => f=R/w=149.76 MHz (1)
c) * What effect with respect to output port contention can you expect to see in such a switch type? (1) Head-of-line blocking
d) * In order to mitigate the negative effect on the system performance, you want to run the bus with a speedup of four, i.e. 4x the speed necessary for the typical implementation. Calculate the bus width and operating frequency! (4) aggregate throughput: 4x 599.04 Mbit/s = 2.396 Gbit/s (1) 4x speedup: 9.585 Gbit/s (1) w=R/f=31.95 => 32 bit (1) f= R/w=299.5 MHz (1)
e) Assume that only a single cell is switched from an input to an output. Derive the minimum cell latency considering that the cell must be fully received by the FIFO, is then transmitted over the bus and retransmitted after the entire cell sits in the output shift register. (3) Reception: 53B cell => 106 clock cycles @ 149.76 MHz =707.8 ns (1) Bus transfer: 53B / 32b= 13.25 => 14 clock cycles @ 299.5 MHz =46.7 ns (1) => Minimum latency: 754.5 ns, because the first byte appears on the output directly (otherwise transmission would take the same time as on input) (1)
f) Calculate the total power consumption of the chip, given a switch core logic (i.e. FIFOs, bus, registers, ) power dissipation of 200 mW. Assume an I/O switching factor of 0.35 and a 50 pF load capacitance. The I/Os of the switch operate on 3.3V. Assume that you need a clock line and data valid line in addition to the I/O data lines derived in b)! (3) Number of outputs needed (1): 4x6=24 (only outputs count!) I/O power consumption: P=NafCV (1) =684 mW (1) Total consumption: 884 mW

2008 FinalExam SoCN Final Master Solution

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

2008 FinalExam SoCN Final Master Solution

Загружено:

Авторское право:

Доступные форматы

Institute for Integrated Systems Technische Universitt Mnchen Prof. Dr. sc.techn.

Final Exam System on Chip Solutions in Networking SS 2008

.. Signature Examination Time: 75 minutes

Final Exam SoCN SS 2008 July 28, 2008

Final Exam SoCN SS 2008 July 28, 2008

20% 40% 60% 80% 100% 120%

Final Exam SoCN SS 2008 July 28, 2008

5) Chip Package Heat Dissipation

Final Exam SoCN SS 2008 July 28, 2008

6) Network Processor System

SRAM 2ns, 128b

Final Exam SoCN SS 2008 July 28, 2008

Final Exam SoCN SS 2008 July 28, 2008

Final Exam SoCN SS 2008 July 28, 2008

ATM cells from STM-4c

ATM cells to STM-4c

Bus = switch element

Final Exam SoCN SS 2008 July 28, 2008

Вам также может понравиться