Simulation: Simulation of An Integrated Architecture For IP-over-ATM Frame Processing

SIMULATION
http://sim.sagepub.com Simulation of an Integrated Architecture for IP-over-ATM Frame Processing

Peter M Ewert and Naraig Manjikian SIMULATION 2002; 78; 249 DOI: 10.1177/0037549702078004543 The online version of this article can be found at: http://sim.sagepub.com/cgi/content/abstract/78/4/249
Published by:
http://www.sagepublications.com
On behalf of:
Society for Modeling and Simulation International (SCS)
Additional services and information for SIMULATION can be found at: Email Alerts: http://sim.sagepub.com/cgi/alerts Subscriptions: http://sim.sagepub.com/subscriptions Reprints: http://www.sagepub.com/journalsReprints.nav Permissions: http://www.sagepub.com/journalsPermissions.nav
Downloaded from http://sim.sagepub.com at PENNSYLVANIA STATE UNIV on February 7, 2008 2002 Simulation Councils Inc.. All rights reserved. Not for commercial use or unauthorized distribution.
Simulation of an Integrated Architecture for IP-over-ATM Frame Processing

Peter M. Ewert Access and Switching Group Intel Corporation Hillsboro, Oregon, USA 97124 mark.ewert@intel.com Naraig Manjikian Dept. of Elec./Comp. Eng. Queens University Kingston, Ontario, Canada K7L 3N6 nmanjiki@ee.queensu.ca The performance of an integrated architecture for full-duplex IP-over-ATM processing is evaluated through detailed simulation. The architecture combines processing, memory, and multiple directmemory-access engines for single-chip implementation. The simulation models the segmentation and reassembly operations needed to translate IP frames to and from a xed ATM cell size. A key operation is the insertion of a virtual path and virtual channel identier (VPI/VCI) into the outgoing ATM cells. Software-based VPI/VCI insertion provides exibility but requires the on-chip processor to perform this function. Hardware-based VPI/VCI insertion is an optimization that requires one of the directmemory-access engines to perform this task. The two approaches are evaluated through simulated execution of representative control software with detailed modeling of all on-chip components. Results indicate that software-based VPI/VCI insertion supports full-duplex trafc at 475 Mbps on a 500-MHz processor and that hardware-based VPI/VCI insertion supports full-duplex trafc at 560 Mbps on a 500-MHz processor. Keywords: IP-over-ATM, computer networks, processor-memory integration, computer systems, discrete-event simulation
1. Introduction The asynchronous transfer mode (ATM) has emerged as a prominent backbone network system to support the widely used Internet protocol (IP). Many commercial vendors offer components and complete systems designed for ATM networking. The speed and capacity requirements of ATM have normally required application-specic integrated circuits (ASICs) for processing ATM cells in switching and routing devices. The potential of increased integration and speed with improvements in microelectronics technology is creating new opportunities for applying low-cost, programmable processing solutions that are integrated onto a single chip, rather than complex ASIC devices. In this manner, both cost and complexity can be reduced. This paper describes an integrated architecture for IPover-ATM processing and the results of a performance
Submission Date: May 2001 Accepted Date: December 2001 SIMULATION, Vol. 78, Issue 4, April 2002 249-257 2002 The Society for Modeling and Simulation International
| | | |
study conducted through detailed simulation. The architecture combines processing, memory, and multiple direct-memory-access engines in a single chip in order to process both ingress and egress IP-over-ATM trafc in duplex mode. Larger IP frames must be segmented and reassembled to conform to the xed ATM cell size, and the simulated architecture and its representative control software model these operations appropriately. An important consideration is the insertion of a virtual path and virtual channel identier (VPI/VCI) into outgoing ATM cells. Two different approaches are evaluated: software-based insertion and hardware-based insertion. The software approach requires the on-chip processor to perform VPI/VCI insertion and reects the provision of maximum implementation exibility through the generality of software. The hardware approach requires one of the direct-memory-access engines to perform VPI/VCI insertion and reects an optimized application-specic approach. The two alternatives are evaluated through simulated execution of the representative control software with detailed modeling of the onchip cache, bus, and memory. The results indicate that the
Ewert and Manjikian
architecture with software-based VPI/VCI insertion supports full-duplex trafc at 475 Mbps on a 500-MHz processor and that hardware-based VPI/VCI insertion supports full-duplex trafc at 560 Mbps on a 500-MHz processor. The remainder of this paper is organized as follows. Section 2 presents background and a summary of related work. Section 3 describes the hardware architecture considered in this paper. Section 4 outlines the functionality of the representative control software. Section 5 describes the simulation environment and the experimental results. Finally, Section 6 presents conclusions and directions for future work. 2. Background and Related Work The broadband integrated digital services network (B-ISDN) standard denes two layers that characterize the ATM protocol: the ATM layer and the ATM adaptation layer (AAL) [1]. The ATM layer denes the transmission of xed sized cells (53 bytes) through logical connections in a network. The AAL maps higher-level applications such as voice and data to ATM cells. AAL Type 5 has become widely accepted as the standard for ATM trafc. It denes a cell with a 5-byte header and a 48-byte data payload. A higher-level packet data unit (PDU), such as an IP frame with a logical link control (LLC) header, is segmented into 48-byte units that will be the payloads for a sequence of ATM cells. The nal payload in the sequence contains the remaining bytes of the PDU, any necessary padding, and an 8-byte trailer used by the AAL. The trailer includes a 4-byte cyclic redundancy code (CRC). Each 48-byte data payload is appended to a 5-byte header that denes the logical route that the cell will take through the network using virtual path and channel identiers (VPIs/VCIs). At each point in the network, the sequence of ATM cells for a PDU may be reassembled to verify its integrity using the embedded CRC and to determine the remainder of the route that the PDU will take toward its destination. There has been considerable research and commercial development for architectures and devices related to network processing in general, and to ATM segmentation and reassembly in particular. Some of these developments reect the trend toward greater integration of processing, storage, and other logic in the same package [2]. An example of the potential for integration is the M32R/D from Mitsubishi that combines 2 Mbytes of DRAM with a 32bit microprocessor in a single chip for embedded applications [3]. Continuing improvements in microelectronics fabrication technology make it feasible to consider higher levels of integration for future devices. Cranor, Gopalakrishnan, and Onufryk [4] propose an architecture called UNUM for integrating ATM-related functionality into a specialized multithreaded processor. Multiple register les are provided to reduce the overhead of processing interrupts. A bus interface unit is provided for separate memory and I/O buses. Simulation
results are reported for a 200-MHz processor that can receive and reassemble ATM cells with a throughput of between 100 Mbps and 570 Mbps. There is, however, only limited discussion of the control software. The IXP1200 Network Processor [5] for IP packet processing embodies another approach toward increased on-chip integration. This product integrates a general processor with six microengines that each support up to four threads for improving the throughput of IP address table lookup in external main memory by overlapping both processing and memory accesses. It does, however, require careful programming with specialized tools. Shimonishi and Murase [6] describe a parallel architecture with multiple processor cores for IP packet forwarding and ATM segmentation and reassembly. Although they mention the use of on-chip memory, they do not adequately explain its role with respect to integration. Elkateeb and Elbeshti [7,8] consider the requirements for embedded processing in an ATM user-network interface. Their approach, however, requires a direct-memoryaccess engine operating much faster than the processor, and they do not appear to include CRC-related processing. Hobson and Wong [9] propose a parallel architecture for ATM cell reassembly for a user-network interface. By using 32-bit datapaths, it is reported that rates up 700 Mbps are possible. OConnor and Gomez [10] describe the iFlow processor that contains embedded DRAM for packet routing lookups and some on-chip processing. This product further demonstrates the feasibility of processor-memory integration, albeit for a different network-related application. 3. Integrated Hardware Architecture This section describes an integrated architecture dedicated to segmentation and reassembly of duplex IP-overATM trafc. The architecture is modeled as an integrated processor-memory package on a single chip that performs all of the processing and buffering required for an IP-overATM line card that is typically connected to an internal switching fabric in a commercial ATM networking device [11]. Earlier work by the authors documents a receiveronly architecture and its modeling in simulation [12]. This paper considers the issues for a full-duplex architecture. Figure 1 shows the key components of the duplex architecture. The processor executes code for packet reassembly, buffer management, error checking, and other functions. The on-chip memory is intended for buffering cells and packets, as well as storing code and data for the control software. A cache line size of 64 bytes is used throughout the architecture to match the 53-byte ATM cell size; the additional bytes per cache line provide space for linked-list pointers and other information. The interfacereceive direct-memory-access (IRX DMA) engine accepts incoming cells, and the fabric-transmit DMA (FTX DMA) engine forwards cells to the internal switching fabric.
250 SIMULATION Volume 78, Number 4
IP-OVER-ATM FRAME PROCESSING
optical receiver ingress ATM cells optical transmitter egress ATM cells
integrated package with processing/memory IRX DMA FTX DMA

Switch Fabric
processor
cache
ITX DMA
SDRAM memory
FRX DMA
Figure 1. Duplex architecture
The FTX DMA also performs CRC validation in hardware, using interrupts to synchronize with the processor. The fabric-receive (FRX) DMA engine accepts cells from the internal switching fabric, and the interface-transmit (ITX) DMA engine forwards cells to their next network destination. The IRX DMA engine writes received cells to the onchip memory. The control software provides information to the IRX on where to write cells in memory. The IRX, in turn, informs the processor through an interrupt to examine received cells. It is assumed that the error-correcting code in the ATM header for each cell is checked externally, as this is part of the transmission convergence layer [1]. Similarly, generation of the header error-correcting code is assumed to be done externally. After received cells are reassembled and processed to form a PDU, the FTX DMA engine sends buffered PDUs to the internal fabric. Each reassembled PDU is stored in memory by the control software using a linked list of 64byte nodes that is traversed by the FTX. The FRX DMA engine receives PDUs from the switch fabric and segments them into linked lists of 64-byte nodes in memory. The ITX DMA engine traverses the linked list for each PDU and transmits the ATM cells to their next network destination. The internal bus provides the interconnection for the processor, memory, and DMA engines. The bus is 128 bits wide and runs at 100 MHz; this arrangement is feasible for on-chip integration as current microprocessors have similar widths [13]. The bus request/response protocol is atomic. A basic round-robin arbitration mechanism is provided for fairness, but to improve trafc processing performance (i.e., reduce dropped packets), the receive DMA engines are given higher priority in arbitration, while the processor and transmit DMA engines receive lower priority. The bus can transfer 1.6 GB per second, which is 10 times higher than the minimum input/output bandwidth for the standard OC-12 622-Mbps rate. The 100-MHz clock rate implies that the startup latency will be at least 10 ns for any transfer across the bus. For the purpose of this research, a single-issue, pipelined processor is modeled with the assumption that
instructions take one cycle. Memory reference instructions add delay corresponding to the depth of the memory reference miss (L1 cache, L2 cache, or memory). For an ATM cell rate fc , a processor with a clock frequency fp , and instruction issue width W , the maximum number of instructions that can be executed between cell arrivals is given by (fp /fc ) W . This measure could be compared against the software instruction count, but cache and memory access latencies also have to be considered, hence detailed simulation is appropriate. A two-level, direct-mapped cache is modeled. The primary (L1) cache has a capacity of 8 kbytes and the secondary (L2) cache has a capacity of 256 kbytes. Writebacks are used for cache-to-memory coherence, and invalidations using the MESI protocol [14] are used for cache-to-DMA coherence. In this study, the MESI protocol is optimized to exploit the fact that the FTX DMA engine does not modify cache lines. Any modied data in the cache that are read by the FTX do not cause invalidations or writes to memory, thereby improving performance considerably. The on-chip memory has a 128-bit width to match the bus and is assumed to be similar to synchronous DRAM with burst timing of 50 ns/10 ns/10 ns/10 ns. The ideal available memory bandwidth would be 800 Mb/s, and the combination of a 16-byte memory bus and a 64-byte cache line allows ATM cells to be read from memory in 100 ns and written to memory in 60 ns. Both the bus and memory speeds are conservative for on-chip integration, hence any performance results can be interpreted as lower bounds. 4. Control Software Organization This study uses simulation to model the execution of representative control software on the proposed integrated hardware architecture. The control software is divided into receive (interface-to-fabric) and transmit (fabric-tointerface) processing, corresponding to the upper and lower parts, respectively, of Figure 1. In the absence of an appropriate benchmark, the aim of the software implementation is to provide a realistic workload for instruction-level
Volume 78, Number 4 SIMULATION 251
Ewert and Manjikian
simulation of the hardware architecture discussed in section 3, without addressing all errors and exceptions as in a commercial product. The implementation, however, does initiate CRC verication in a hardware module, and its contribution to any hardware processing delay (no worse than 10 ns/byte, as in our earlier work [12]) is modeled in the simulation. The software explicitly performs all of the necessary data structure manipulations with actual instructions for execution-driven simulation. Interrupt service routines are also included in the software; the simulator models the interrupt latency and the instruction overhead for servicing each interrupt. The control software is structured as a main loop that determines, in each iteration, whether receiver-related or transmitter-related processing is necessary. Flags and counters set by the interrupt service routines affect whether such processing is performed. When no immediate action is needed, the processor will busy-wait or spin. 4.1 Receiver Portion of the Control Software The control software executing on the integrated processor must receive cells for a PDU from the IRX in Figure 1. A list of free cells in memory is managed by the software for this purpose. The control software indicates the start of the freelist to the IRX, which writes cells into memory and signals the processor through an interrupt that newly received cells are available. The control software responds to this interrupt, examines the received cells, and groups them by virtual connection (VC) identier in a hash table to support fast lookup for reassembly. Once all of the cells in a PDU have arrived, reassembled PDUs are placed by reference in an outgoing buffer. The processor then signals the FTX in Figure 1 (with an uncached write operation) that a reassembled PDU is available for CRC validation and transmission through the switch fabric to its next destination. Once the FTX has completed validation and transmission of a PDU, it signals the processor through an interrupt that the cells in memory are available for reuse. 4.2 Transmitter Portion of the Control Software The transmitter software only requires one separate linked list as part of its functions for fabric processing and interface processing. When the control software rst starts, it initializes memory and gives the head of the linked list to the FRX in Figure 1 for fabric processing. As the FRX receives PDUs from the fabric, it segments the cells and writes them to memory by traversing the linked list. After a number of cells have been written, the FRX interrupts the processor for subsequent action on the received cells. The processor, in turn, informs the ITX in Figure 1 that cells are ready for transmission for interface processing. The ITX reads and transmits each cell through the interface, interrupting the processor to signal successful transmission and thereby allow for memory of the transmitted cells to be reclaimed.
4.2.1 Software and Hardware Schemes for VPI/VCI Insertion An important step that must be performed immediately before cells are transmitted through the interface by the ITX is to insert the appropriate VPI/VCI information into the outgoing cells. This step is necessary for the ATM protocol because switches along the route for a given cell use different logical path and circuit identiers for each link. In its simplest form, the FRX receives a PDU or frame from the fabric and simply segments it into 48-byte units that are placed in a linked list in memory for nal transmission. The processor is then responsible for inserting the correct VPI/VCI information (which could be embedded in the beginning of the frame coming from the fabric) into each subsequent cell until the last cell in the frame is processed. The processor then informs the ITX that ATM cells are ready for transmission. This software-based approach retains exibility and simplies the hardware at the expense of software overhead to read and manipulate cells. An alternative is to introduce additional hardware in the FRX component to repeatedly insert the necessary VPI/VCI information for each cell during segmentation. With this approach, the processor does not have to read the cells. Instead, all it has to do is service interrupts from the FRX and then inform the ITX that cells are ready to transfer. The processor could, however, read and modify cells after the FRX has completed its task. 5. Experimental Results This section describes the simulation environment in which the execution of the control software on the proposed architecture was modeled. A front-end simulator was used to trace the instructions in the control software, and the resulting memory reference trace was used to drive a discreteevent, back-end simulator for the system architecture. Results of simulation experiments are then described, focusing on the ability to sustain cell arrival rates and the corresponding device utilizations. 5.1 Instruction Tracing The SimpleScalar [15] functional simulator is used as a front-end instruction tracing tool for the control software described in section 4. The SimpleScalar instruction set is modeled directly on the MIPS architecture using a version of the GNU compiler and assembler that permits the generation of execution code for simulation. The trace of execution generates a memory reference stream that drives the back-end simulator for the system. A custom interface provides the means for the front end to pass information about each memory reference and the time separation between memory references to the back end. There is also feedback from the back end to the front end in order to model interrupts and change the ow of simulated execution in the code.
5.2 System Simulation A model of the duplex architecture discussed in Section 3 was implemented for simulation in Quasar, which is the Queens University Architectural Simulation Archive [16]. Quasar is a back-end discrete event-driven simulator that provides a exible, object-oriented framework in C++ for modeling various architectures. It interfaces with a frontend tool for instruction tracing. Quasar uses logical connections between the objects representing a system model to pass time-stamped messages for scheduling future events. Ports within an object dene the endpoints of these logical connections, which may be one-to-one, one-to-many, or many-to-one. One send operation on a port schedules a future event for all objects that are connected to that port. The objects instantiated within Quasar encapsulate the functionality of the architectural components such as the processor, caches, bus, and main memory. The processor object, in particular, serves as the source of events arising from the memory reference stream produced by the frontend simulator. Quasar denes a common interface between simulation objects and the event-driven core in order to guide the development of new classes of objects. With this capability, additional objects are introduced to model the ATM components that are relevant to this study, such as the specialized DMA engines and the external cell trafc sources and sinks. Figure 2 shows the system model implemented in Quasar for this study, with components in the duplex architecture represented by ellipses. The logical connections are used for scheduling future events, as explained above. Object connections, on the other hand, are used in this study to connect utility objects outside the event-driven core of Quasar in order to add additional functionality to the component objects. The ATMnode object and the VCLLCGenerator object generate ATM trafc with different PDU sizes and VC/LLC identiers. The Processor object (with a Cache object for modeling cache accesses and a TimeProling object for utilization measurements) models the processor. The Bus object models round-robin arbitration, bus requests, and bus transfer delays. The Memory object models the memory access delays. The Fabric object models fabric requests and fabric transfer delay. The RXDMAengine object can model either the IRX or FRX DMA engines. When a simulation is initialized, two RXDMAengine objects are instantiated with one congured to receive ATM cells from the simulated external interface and the other congured to receive ATM cells from the simulated internal switching fabric. When used for the FRX, the RXDMAengine utilizes a BufferMgt object to allow for PDU buffering and segmentation. In a similar manner, the TXDMAengine object can model either the ITX or FTX DMA engines. In Figure 2 the ATM cell sink for the ITX is not shown because the sink is actually modeled within the TXDMAengine, which adds appropriate delays according to the ATM line rate being used. Finally, the Monitoring and TimeMonitoring utility objects are used for monitor-
ing simulated memory usage within the architecture and recording simulation time, respectively, thereby providing a mechanism to measure system performance. In the simulation model, cells are generated to constitute PDUs and passed to the IRX DMA engine, from where they are written to memory for subsequent reassembly of PDUs. The cell trafc is constant bit rate (CBR) in order to determine the rate that the architecture can sustain for various processor speeds. The IRX interacts with the processor through a simulated interrupt that alters the ow of execution in the control software. The processor reassembles PDUs from the received cells, then signals the FTX, which performs CRC validation and transmits the PDUs to internal switching fabric. The fabric loops the PDUs back to the FRX for simulation of full-duplex trafc. The fabric bandwidth is modeled as four times larger than the incoming line rate. The FRX segments the PDUs from the fabric into cells and writes the cells to memory. The FRX also inserts the VPI/VCI for each cell if hardware VPI/VCI insertion is being modeled. Finally, the ITX reads the cells from memory and transmits them by simulating the ATM line rate egress delay. 5.3 Simulation Results This section presents the simulation results for the duplex architecture. Both the receiver and transmitter components of the software have independent 16,384-cell freelists for the results reported here. The performance of the softwarebased and hardware-based approaches for VPI/VCI insertion are compared and explained. 5.3.1 Software-Based VPI/VCI Insertion Figure 3 shows the results for software-based VPI/VCI insertion in a duplex simulation at an OC-3 (155 Mbps) line rate using constant-bit-rate trafc consisting of single-cell PDUs. The smallest PDU size is the worst case because the overhead for PDU processing cannot be amortized across many cells. Figure 3(a) shows the memory usage for the transmitter and receiver components on 76-MHz and 100MHz systems. The vertical axis represents the freelist size, and the horizontal axis represents the number of cells received at the IRX. This graph provides a means of determining system stability, which in this context reects the ability to sustain the peak arrival rate. The 100-MHz system is stable because the free memory plot is at. The 76-MHz system, however, is unstable; the receiver component is rapidly running out of memory because it cannot match the rate of incoming cells. Figure 3(b) shows similar results, but instead compares the number of incoming cells at the IRX with the number of outgoing cells at the ITX. When inow does not equal outow, the system is not stable. Figure 4 examines the performance results for a system with the highest possible throughput in the software-
Ewert and Manjikian
VCLLCGenerator.cc ATMnode.cc
TimeProfiling.cc Processor.cc Cache.cc Monitoring.cc TimeMonitoring.cc Logical Connection Object Connection
RXDMAengine.cc
Bus.cc
TXDMAengine.cc Fabric.cc
TXDMAengine.cc
Memory.cc
RXDMAengine.cc BufferMgt.cc
Figure 2. Relationships between objects in Quasar for modeling duplex architecture
Different Processor Speeds, 20000 cells 16500 16000 Memory Free (cells) 15500 15000 14500 14000 13500 0 5000 10000 Cell Count 15000 20000 Rx: 100MHz Tx: 100MHz Rx: 76MHz Tx: 76MHz 10000 8000 Cell count 6000 4000 2000 0 0
Different Processor Speeds, 10000 cells 76MHz IN (IRX) 76MHz OUT (ITX) 100MHz IN (IRX) 100MHz OUT (ITX)
5e+05 1e+061.5e+062e+062.5e+063e+063.5e+06 Simulation time (ns)
(a) Memory Usage

Figure 3. Results for software-based VPI/VCI insertion for OC-3 (single-cell PDUs)
(b) In vs. Out Performance
based VPI/VCI insertion duplex system for worst-case, single-cell PDU trafc, which has been experimentally determined to be 475 Mbps. Increased contention for the bus and memory (whose speeds remain constant) plays a role in limiting the throughput. Figure 4(a) shows that for 475 Mbps, a processor speed of 500 MHz is required, while at 400 MHz the receiver component rapidly runs out of memory. The reduced vertical scale in Figure 4(b), however, shows that with an increase in PDU size (in this case, 800 cells/PDU), 400 MHz is stable. An additional interesting observation is that there is a discernible delay gap between the receiver and transmitter components; this is because of the extra processing needed for the software to reassemble PDUs before they are passed to the FTX. Figure 5 shows the device utilizations for worst-case,
single-cell PDU trafc for different line rates. For the 155-Mbps and 475-Mbps rates, stable processor frequencies of 100 MHz and 500 MHz, respectively, were used. The processor rates were chosen for the corresponding line rates because the systems are at the threshold of stability. The bus and memory speeds are constant, and the intent is to highlight bus and memory utilization. The graph represents average device utilizations while processing 100,000 cells. P.Spin represents processor utilization, including any spin or busy-waiting time, but excluding the time for servicing cache misses. P.Busy represents processor utilization to perform useful work, that is, excluding spin time and cache miss service time. Bus utilization reects time where the bus carries address or data information, or when it is held for the atomic request/response protocol.
Different Processor Speeds, 20000 cells 16500 16000 15500 15000 14500 14000 13500 13000 12500 12000 11500 0
Large PDU for 400MHz, 10000 cells 16400 16300 16200 16100 16000 15900 15800 15700 15600 15500 15400 0 2000 Rx: 400MHz Tx: 400MHz
Memory Free (cells)
Rx: 500MHz Tx: 500MHz Rx: 400MHz Tx: 400MHz 5000 10000 Cell Count 15000 20000
Memory Free (cells)
4000 6000 Cell Count
8000
10000
(a) Memory Usage, single-cell PDUs

Figure 4. Results for software-based VPI/VCI insertion for 475 Mbps
(b) Memory Usage, 400MHz, 800 cells/PDU
100%
100
475Mbps
ISRs P.itx P.frx P.ftx P.irx
60% 40% 20% 0% P. Spin P. Busy
%Total time
80% Utilization
155Mbps
80 60 40 20
Bus
Memory
0 w/miss w/out miss

Figure 6. Duplex code, P.Busy fraction
Figure 5. Utilizations for duplex architecture
Memory utilization reects the time that data are either being read from or written to the DRAM. For both line rates shown in Figure 5, P.Busy is close to P.Spin, thereby showing that the processor does not have a signicant amount of idle spin time. For the 155-Mbps simulation, the P.Spin utilization is close to 100%, indicating that the processor spends a small proportion of time waiting to access the bus. Conversely, the 475-Mbps simulation has a much lower P.Spin utilization. This implies that the processor spends a larger proportion of its time waiting for bus arbitration and memory accesses. The high bus and memory utilizations for the 475-Mbps rate validate this hypothesis. Figure 6 shows the breakdown of the processor utilization for the P.Busy fraction from Figure 5 in a 500-MHz system with single-cell PDU trafc at 475 Mbps. The utilization is divided into the processing related to each of the DMA engines as well as the interrupt service routines. The leftmost bar in Figure 6 excludes spin time but includes
the time for cache misses. Conversely, the rightmost bar in Figure 6 excludes cache misses. The processing for the reception of ATM cells (IRX) and transmission of PDUs onto the fabric (FTX) composes the majority of the time. The processing related to transmitting cells through the interface (ITX) is a very small proportion. For the softwarebased insertion of VPI/VCI into each cell (categorized as time for the FRX), the difference between the two bars in Figure 6 clearly shows that this processing is dominated by waiting for memory accesses. This is because FRX-related processing in this case is quite simple: the software reads each cell from memory and inserts the VPI/VCI. 5.3.2 Hardware-Based VPI/VCI Insertion With hardware-based VPI/VCI insertion, the FRX DMA engine inserts the VPI/VCI into each cell. Therefore, the processor does not have to read and process each cell. It
Ewert and Manjikian
Varying Processor Speeds, 100000 Cells 620 600 580 560 Mbps 540 520 500 480 460 0 50 100 150 200 250 300 350 400 Number of Cells per PDU HW VPI/VCI 500MHz HW VPI/VCI 300MHz SW VPI/VCI 500MHz
100 % Utilization 80 60 40 20 0 P. Spin P. Busy Bus Memory SW VPI/VCI 500MHz HW VPI/VCI 500MHz HW VPI/VCI 300MHz
Figure 8. Utilizations for hardware and software VPI/VCI insertion, 500 Mbps
Figure 7. Throughputs for hardware and software VPI/VCI insertion
6. Summary and Conclusions This paper has presented an integrated architecture for duplex IP-over-ATM processing and the simulation results used to evaluate its performance. An important consideration was the insertion of virtual path and virtual channel identiers (VPI/VCI) into outgoing cells. Both softwarebased and hardware-based approaches for VPI/VCI insertion were evaluated. These two alternatives represent two extremes: exibility through software generality and performance through application-specic hardware. For software-based VPI/VCI insertion, simulation results indicated that a full-duplex 475-Mbps line rate can be supported with a 500-MHz processor. For the hardware-based approach, the achievable rate was 560 Mbps with a 500MHz processor. For lower line rates, such as 155-Mbps OC-3, the processor speed can be reduced to 100 MHz for the software-based approach. For future work, methods of improving the performance of the duplex architecture can be investigated. As discussed in the experimental results, performance at high line rates is limited by the bus and memory, which approach 100% utilization. Improvements such as a faster bus, split-transaction protocol, wider path, or multibank memory can be considered. Another enhancement is to assign more of the necessary processing to the DMA engines for greater parallelism and to reduce the load on the main processor. A natural extension of this idea would be to add one or more general processors, provided that the bus and memory bottleneck was also addressed so that processor utilization is adequate. The feasibility of integrating all of the components for the duplex architecture on a single chip was investigated in earlier work [17]. For a commercially available fabrication process from IBM, it was determined that it would be possible to include at least 16 Mb of memory along with processor, cache, and DMA engines. A nal direction for future work would be to investigate implementationrelated issues in more detail.
only has to monitor the cell count and synchronize with the FRX and ITX DMA engines. Figure 7 compares the maximum throughput for a range of PDU sizes for systems with software-based and hardware-based VPI/VCI insertion. Hardware-based insertion increases the worst-case, single-cell PDU performance to 560 Mbps from the value of 475 Mbps achieved with the software-based approach. The maximum throughput for larger PDU sizes is also increased from approximately 505M bps to 615M bps. Furthermore, the processor speed can be reduced to 300 MHz for the hardware-based approach without affecting the maximum throughput for larger PDUs. Due to PDU processing overhead, however, the slower processor lowers the throughput for small PDUs. Figure 8 compares the utilization for hardware-based and software-based VPI/VCI insertion with 500-Mbps, 100-cell PDU trafc. As before, P.Spin includes any spin or busy-waiting time, but both P.Spin and P.Busy exclude cache miss service time for the processor. P.Spin is low for the software-based VPI/VCI insertion due to bus and memory contention. Hardware-based insertion reduces the need for the processor to use the bus and memory, and as a result, P.Spin is increased signicantly. P.Busy does not change signicantly because approximately the same amount of processing work is being done. However, with a faster processor in hardware insertion, the processor has more idle time, hence P.Busy is reduced somewhat. The last important observation to be made from Figure 8 is that the bus and memory utilizations are quite high for hardware insertion. In fact, it has been measured that with 610-Mbps trafc the bus and memory utilizations are approximately 100% and 90%, respectively. This shows that regardless of the processor speed, the maximum rate that can be supported by the duplex system for the choice of bus and memory parameters is 610 Mbps.
7. Acknowledgements This research was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Ontario Graduate Scholarship Program (OGS), Communications and Information Technology Ontario (CITO), and Queens University. Contributors to the development of the base Quasar simulation framework that was extended and used in this work include G. Lewis, P. McHardy, T. Chong, A. Chow, and L. Wang. 8. References
[1] McDysan, D. E., and D. L. Spohn. 1994. ATM theory and application. New York: McGraw-Hill. [2] Patterson, D., T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. 1996. A case for intelligent RAM. IEEE Micro 17 (2): 34-44. [3] Nunomura, Y., T. Shimizu, and O. Tomisawa. 1997. M32R/Dintegrating DRAM and microprocessor. IEEE Micro 17 (6): 4047. [4] Cranor, C. D., R. Gopalakrishnan, and P. Z. Onufryk. 2000. Architecture considerations for CPU and network interface integration. IEEE Micro 20 (1): 18-26. [5] Intel Corporation. 2001. IXP1200 Network Processor Datasheet. Available from http://developer.intel.com. [6] Shimonishi, H., and T. Murase. 2001. A network processor architecture for very high speed line interfaces. Journal of Communications and Networks 3 (1): 88-95. [7] Elkateeb, A., and M. Elbeshti. 1999. An evaluation of the AAL and ATM protocols processing requirements for the network interfaces design. In Proceedings of the 1999 Symposium on Performance Evaluation of Computer and Telecommunication Systems, Chicago, IL, July, pp. 13-16. [8] Elkateeb, A., and M. Elbeshti. 2000. Evaluating of an embedded RISC core performance for the ATM network interface processing. In Proceedings of the 2000 Symposium on Performance Evaluation of Computer and Telecommunication Systems, Vancouver, BC, July, pp. 363-69. [9] Hobson, R., and P. Wong. 1999. A parallel embedded-processor architecture for ATM reassembly. IEEE/ACM Transactions on Networking 7 (1): 23-37. [10] OConnor, M., and C. Gomez. 2001. The iFlow address processor. IEEE Micro 21 (2): 16-23. [11] Cisco Corporation. 2001. Product overview Cisco 12008 Router. Available from http://www.cisco.com. [12] Ewert, P. M., and N. Manjikian. 2001. Hardware/software tradeoffs [17] [13] [14] [15] [16]
for IP-over-ATM frame reassembly in an integrated architecture. Computer Communications 24 (9): 768-80. Motorola Corporation. 1999. MPC7400 RISC microprocessor technical summary, Document MPC7400TS/D, rev. 0, Number 2 edition, August. Culler, D. E., J. P. Singh, and A. Gupta. 1999. Parallel computer architecture: A hardware/software approach. San Francisco, CA: Morgan Kaufmann. Burger, D., and T. M. Austin. 1997. The SimpleScalar tool set, Version 2.0. ACM SIGARCH Computer Architecture News 25 (3): 13-25. Manjikian, N., and P. R. McHardy. 1999. An object-oriented framework for execution-driven architectural simulation. In Proceedings of the 1999 Symposium on Performance Evaluation of Computer and Telecommunication Systems, July, Chicago, IL, pp. 22731. Ewert, P. M. 2000. Hardware/software considerations for IPover-ATM processing in an integrated architecture, Department of Electrical and Computer Engineering, Queens University, Kingston, Ontario, Canada, September.
Peter M. Ewert received his BEng (computer engineering) in 1998 from the University of Victoria and MASc (computer engineering) in 2000 from Queens University. His MASc research concentrated on integrated architectures for IP-over-ATM processing. Since 2000, he has been working with the IXP line of network processors at Intel. His research interests include computer architectures, network protocols, parallel and distributed processing, and high-performance network processing. Naraig Manjikian received his BASc (computer engineering) in 1991 and MASc (electrical engineering) in 1992, both from the University of Waterloo. He received his PhD (electrical engineering) from the University of Toronto in 1997. Between 1992 and 1997, he was also a participant in the NUMAchine Multiprocessor Project at the University of Toronto. Since 1997, he has been an assistant professor in the Department of Electrical and Computer Engineering at Queens University, Kingston, Ontario. Dr. Manjikians research interests include computer architecture and multiprocessing, compilers for parallel systems, and applications of parallel processing. His current research projects are examining processor-memory integration and application-specic computing architectures for telecommunications.

Simulation: Simulation of An Integrated Architecture For IP-over-ATM Frame Processing

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Simulation: Simulation of An Integrated Architecture For IP-over-ATM Frame Processing

Загружено:

Авторское право:

Доступные форматы

SIMULATION

http://sim.sagepub.com Simulation of an Integrated Architecture for IP-over-ATM Frame Processing

Society for Modeling and Simulation International (SCS)

Simulation of an Integrated Architecture for IP-over-ATM Frame Processing

Ewert and Manjikian

250 SIMULATION Volume 78, Number 4

IP-OVER-ATM FRAME PROCESSING

integrated package with processing/memory IRX DMA FTX DMA

Figure 1. Duplex architecture

Ewert and Manjikian

IP-OVER-ATM FRAME PROCESSING

Volume 78, Number 4 SIMULATION 253

Ewert and Manjikian

TimeProfiling.cc Processor.cc Cache.cc Monitoring.cc TimeMonitoring.cc Logical Connection Object Connection

Figure 2. Relationships between objects in Quasar for modeling duplex architecture

5e+05 1e+061.5e+062e+062.5e+063e+063.5e+06 Simulation time (ns)

(a) Memory Usage

(b) In vs. Out Performance

IP-OVER-ATM FRAME PROCESSING

Memory Free (cells)

Memory Free (cells)

4000 6000 Cell Count

(a) Memory Usage, single-cell PDUs

(b) Memory Usage, 400MHz, 800 cells/PDU

ISRs P.itx P.frx P.ftx P.irx

60% 40% 20% 0% P. Spin P. Busy

0 w/miss w/out miss

Figure 5. Utilizations for duplex architecture

Ewert and Manjikian

Figure 7. Throughputs for hardware and software VPI/VCI insertion

IP-OVER-ATM FRAME PROCESSING

Volume 78, Number 4 SIMULATION 257

Вам также может понравиться