Вы находитесь на странице: 1из 6

The XML Chip

Michael Leventhal, Eric Lemoine


LSI Corporation 10908 Technology Place San Diego, CA 92127 U.S.A.
firstname.lastname@lsi.com Abstract The history, design, operation and performance characteristics of a purpose-built XML chip is described. Our long-term implementation and commercialization experience has demonstrated that a purpose-built co-processor chip can significantly increase the overall performance of data-intensive applications which use XML for information transactions. Benchmark data showing large gains in message throughput, message latency and reduction of power consumption for XML operations is presented. Processing of XML in the gigabit per second range is obtained, showing that some complex operatoins on XML data can be done at line-rate in networking devices. Acceleration is also obtained using the XML chip with enterprise applications where load is well below the gigabit range and XML processing may not constitute the largest portion of the CPU workload due to the synergistic effects of processor offload. The workload efficiency of XML co-processing has been demonstrated to increase with the parallel capacity of multicore platforms.

As increases in processor performance through scaling processor frequency becomes increasingly expensive attention has turned to other approaches to raising computer system performance. Chief among these is multiplying the number of processor cores and better utilizing them through efficient parallel algorithms. A special case of multicore parellelism is to employ special purpose cores designed to process certain kinds of workloads more efficiently than the general purpose instruction set processor. Extensible Markup Language (XML) has emerged as the dominant representation used in information, knowledge and transactional systems. The Web infrastructure (Web Services, REST, and SOA), adopted for much of the worlds transactional framework for machine-to-machine interapplication communication, is based on XML. This means that servers in datacenters today spend a significant proportion of their time processing XML messages and that widely-applied improvements in the efficiency of XML processing could yield global-scale reductions in datacenter costs and also free up computing capacity for greater volume and tasks of greater complexity. The XML chip as part of a high-performance multicore architecture processes the XML workload more efficiently than the general purpose instruction set cores. I. A BRIEF HISTORY OF THE XML CHIP Within the commercial sector several companies over the last six years have announced that they were working on a purpose-built chip designed to process XML at high speeds including Conformative Systems, DataPower, Dajeil, Tarari, Xambala, and Ximpleware. To the best knowledge of the

authors only Tarari publically announced design wins where the processor was incorporated into OEM products, with the roster of customers including Alcatel-Lucent, Cisco, Layer 7 Technologies, and Reactivity. The Tarari XML processor has been used in web services gateways, application-oriented networking devices, and security appliances. Tarari was bought by LSI Corporation in October of 2007. LSI continues to sell an XML processor and to develop the technology. In May of 2009 HP announced an appliance product integrating the LSI Tarari XML chip and associated software with SAPs integration platform software NetWeaver PI[1]. This product is first use of the XML chip directly integrated with a major enterprise application software package, as opposed to prior experience with special-purpose XML acceleration appliances and network devices. This development may be an important step in the mainstreaming of XML acceleration hardware. The authors led the XML chip program at Tarari and continue in the same capacity for LSI. As the Tarari XML processor is the only example known to them of a commercially viable product of its kind this paper discusses the design and characteristics of this chip exclusively. Tarari was spun out from Intels Network Equipment Division in 2002, bringing IP for an XML chip design from Intel. This technology can be traced further back to the acquisition of IPivot, a pioneer in the development of XML networking equipment, by Intel in 1999. II. ALTERNATIVES TO AN XML CHIP The two main alternatives to the XML chip designed to increase the workload-specific efficiency of servers are increasing the number of general purpose cores to increase parallelism and improvements or better exploitation of existing features of the instruction set of the general purpose processor to process XML more efficiently. Letz used the nine core Cell BE processor to parse 8 documents in parallel, achieving an estimated 3X improvement in SAX parsing over libxml running on a Pentium M [2]. Intels Streaming SIMD Extensions 4 (Intel SSE4) includes four instructions which can accelerate a critical [XML parsing] performance bottleneck, since many operations such as sub-string searching and regular expression checking need many complex operations [3]. Intel reported an overall 25% performance improvement and some gains up to 70%. III. IMPLEMENTATION

The implementation of the XML chip is has been, to date, exclusively in FPGA. The choice of reconfigurable logic has permitted the designers to pursue an evolutionary development path of approximately two major design iterations per year and countless minor revisions over the lifetime of the product. This has been very advantageous in an area where there was little hard science and engineering history to guide us. It also permitted us to closely track the rapid evolution of microprocessor architecture and technology instead of following one to two years in the wake of new general purpose processor introductions. FPGA has also been the most cost-effective choice at the relatively low production volumes characteristic of the early years of the introduction of a new infrastructure technology. Author Lemoine was a student of the great pioneer of reconfigurable logic, Jean Vuillemin, who demonstrated in the late 1980 and early 1990s that the programmable active memory could be used to implement any computing function and serve as universal hardware co-processor coupled with the host computer [4]. XML processing, a byte-oriented symbolic computing problem, is, in the application domain, not an obvious choice for FPGA technology, Symbolic computing is difficult due to the lack of known algorithms for acceleration in this area and a readily reducible problem space where bottlenecks have been identified. The strategy, therefore, had to be evolutionary based on iterative and modular design as experience was accumulated and also as the price-performance and capacity of FPGAs has improved, expanding potential capabilities. It became evident that the development program would only have been feasible with the use of reconfigurable logic; having now spanned 6 years,it has consistently yielded progressively better results and a wider functional footprint. One or more XML chips communicates with the host processor through the PCIe bus. A PCIe card fits into a standard slot in the server. The card also contains a controller which manages jobs between the XML chip and the host processor and may optionally contain other processors such as cryptographic accelerators. A PCIe XML accelerator card is shown in the image below. This card has two chips on it but different implementations have been done with a single chip and some cards have had as many as three.

to make use of the accelerated operations of the board transparently from the application level. IV. HARDWARE ARCHITECTURE The logical design of the hardware is illustrated in the figure below. The design is capable of processing from to 1 to 4 bytes per cycle; the highest capacity design would attempt 4 bytes per cycle but it is also possible to do a smaller (less silicon) implementation with reduced parallelism. The chip is a fully modular design with 4 engines performing different computations in parallel. The earliest version of the chip had only a single module and did additional computation as postprocessing in software. As additional modules were added more functionality was carved out of software and put into hardware. Several more modules are planned for the future which will further move functionality now done in software into the hardware. A major accomplishment in this design is to have developed techniques for deeply integrating software and hardware processing and enabling progressive substitution of hardware functionality for software in an evolutionary design cycle spanning several years.

Fig. 2: XML Chip Architecture

The function of each module is summarized in the table below.


TABLE I XML HARDWARE MODULES

Module Semantic Structure Grammar Security

Function Recognition of XML symbols Recognition of document structure and hierarchy Inference against the document grammar Detection of threats

Fig. 1: XML Chip on PCIe Card

There are software components which enable use of the card. An OS-specific driver effects communication to the card from the host as a device and an API enables the programmer

In a nutshell, the hardware modules, in a single pass, analyze the XML document and produce information which greatly accelerates subsequent software processes. The semantic and structural modules perform parsing, producing an in-memory representation of the XML document which can be consumed efficiently by subsequent software processes. The in-memory representation is richer than SAX and also different from DOM, poorer in navigational facilities but facilitating random access to content through the efficient resoltuion of XPath queries. The grammar module is mainly related to schema validation. The security module is a unique

feature of a modern implementation of XML processing in an internet context: it is built in at the lowest level of hardware processing and detects abnormal and anomalous data as the document is streamed through the chip, greatly reducing the vulnerability of the parsing step to attack. The output from each module is DMAd to the host memory where software processes can operate on it. The XML chip is streaming, meaning that it is capable of processing XML documents in fragments of arbitrary size. The chip passes a context block into DDR memory each time a fragment is completed. This context block is retrieved and passed into the processor when the next fragment of the document arrives. With this approach the XML chip does not maintain state which renders it less vulnerable to certain types of processing errors and to attacks. There are many advantages to the streaming capabilities of the design. It allows for better management of memory; particularily in Java environments the unevenness in memory usage may lead to havoc with garbage collection and drastically degraded system performance. In processing scenarios where the occasional large document may clog the pipeline and cause many documents to be queued streaming can be leveraged to implement fairness with more dependable response times. Very large documents can be processed which otherwise could not be handled in a single chunk. Many XML processses such as XSLT have been designed in such a way as to render them intrinsically unstreamable. For these types of operations the document may still be parsed in chunks but the resulting document must be buffered and XSLT processing completed on the entire document. Yet, even in this case, there can be substantial benefit to streaming deriving from the lower memory overhead needed to produce the in-memory representation of the document. V. THE NOBLE EIGHTFOLD PATH OF XML ACCELERATION Eight principles were rigorously adhered to in the development the XML chip which have proven successful in averting suffering on the part of the designers. 1. 2. Modular design. New functions can be added without redesigning the previous modules. Stateless engine. Operating in a network environment a stateless design ensures that harmful XML message traffic will not slow down or alter in any way the processing of good XML messages. Processing time is a function of the length of the message and not the content and/or structure of it. This is another feature essential in a network environment in order to shield the design from denial of services attacks. Wide pipeline architecture. 4 bytes are processed in parallel but could be reduce to two or one bytes during the synthesis. Hashing is cryptographically secure to avoid reverse engineering from the outside and thwart probing and semantic attacks.

6.

7.

8.

Streaming support. This capacity is required to process network traffic on a per packet basis but also enables the engine to process messages too large to be buffered, up to 4 Gbytes. Chose the right metrics to measure your results. Our core measurement is cycles per byte (cpb), the number of host cycles per byte of message data processed. Ultimately, only latency matters

Enlightenment has come in the form of an engine that runs at 166 MHz once synthesized for a Xilinx V4 Lx60-10 and delivers a sustained performance better than 4.5 Gbps (peak is 166MHz processing 4 bytes in parallel yielding, approximately, 5.3 Gbps). In the 4 byte-wide FPGA implementation processing time is almost entirely a function of the length, with only a little bit of a performance cost from some of particular characteristics of individual messages such as the alignment of certain data types. In any case, the engine always processes not less than 2 bytes per cycle yielding a worse case performance of 2.6 Gbps. For many XML applications only three things matter in the final count: latency, latency, and latency. It is exceptional to have a huge volume of traffic to be processed but it is almost always the case that when the data shows up on the server it needs to be processed in the least possible delay as the operation will be one of many in a processing stack or transaction infrastructure. because they are part of a larger transaction. VI. SOFTWARE ARCHITECTURE The XML chip produces only limited acceleration for many of the established approaches to constructing XML processing software and produces remarkable levels of acceleration for some established and some new approaches that have been little exploited in the past due to their inefficiency with purpose-build XML hardware. The LSI XML chip has an extensive API which enables easy construction of applications using approaches essentially unique to performance characteristics obtainable using its special features. The table below summarizes the most important features that can be used in XML applications with very high performance with the LSI API and the XML chip.
TABLE III SOFTWARE FEATURES ENABLED BY THE XML CHIP

3.

Feature Threat Detection Characterization Anomaly Detection Streaming

4.

5.

Random Access

Description Recognizes malformed XML at the first byte Statistically profiles documents and can classify by policy group Statistically profiles documents and can detect anomalies falling outside of policy groups Chunks XML documents in arbitrarily-sized subdocuments enabling continuous document streams, super-large documents, recordoriented processing, stop-on-condition processing and network-oriented packet processing. Based on declarative XPath set processing,

XML (RAX)

simultanously applies large groups of individual XPaths incrementally to XML documents, producing match results and offsets in the document. The offset table can be used for random access into the document at points of interest.

The most important established XML processing approach which yields only limited acceleration with the XML chip is use of the Document Object Model (DOM). The DOM is memory intensive, constructing a tree-structure model of the XML document prior to navigating the document to extract the desired data. The cost of construction of the tree may be amortized in a long-lived document which is scanned repeatedly and deeply but is generally a poor performer in typical transactional applications. The vast majority of XML applications are implemented using the DOM due to the number of robust free tools. In some cases an underlying DOM representation can be replaced by our Random Access XML (RAX) API. This is the strategy we employed in creating our own version of an XSLT engine, RAX-XSLT. VII. PERFORMANCE

The table below provides a summary of averaged cycle per byte calculations over a range of tests selected as representative of typical XML processing for information, knowledge and transactional systems. The column chip shows the cycles per byte when the XML chip is used and SW shows the cycles per byte when the same process is run entirely in software. In some cases the XML chip provides functionality which does not have an equivalent process entirely in software. There is a very wide range of performance results which are highly data and processdependant for some XML operations. In these cases the cycles per byte are given in a range. This is more often the case with software which does not have the strongly linear performance of the hardware independent of composition of the data but even with operations offloaded by the XML chip there may be wide variation with processes such as XSLT where the time may be more influenced by the script than by the intrinsic nature of the data.
TABLE IV XML CHIP OFFLOAD

XML Operation Parser Attack Checks

Description

A. Base Metric: Cycles Per Byte Cycles per byte provides a first order of approximation processor independant measure of the offload efficiency of the XML chip. Cycles are the host CPU cycles needed to perform a given operation and cycles per byte are the number of host CPU cycles needed to perform that operation on one byte of XML data. The equation for calculating cycles per byte given the host processor frequency (cycles per second), the CPU utilization, and the throughput of the target process is shown below.

Wellformed Checks Messagebased anomaly detection

Parallel to parsing; detects buffer overflow and resource exhaustion attacks against parser Parser-based wellformed check Detects messages which deviate from statistical norm; adaptive; stronger than schema validation Token-based parsing with sequential and random-access to all XML objects Routing decisions based on large XPath sets XML Schema Validation XSLT XML-based Authentication based on XML and WS Security specifications

Host Cycles/Byte Chip SW 10 N/A

80400 N/A

Fig. 3: Cycles Per Byte Performance Calculation

Parsing

20

For example, row 2 of Table V shows that schema validation was benchmarked at 2.6 Gbps with 290% CPU utilization. Lets suppose that this test was run on server with 4 3 Ghz cores (all cores fully utilized would be 400% CPU utilization). Our cycles per byte for this test is: 3 Ghz * 290/100 / 2.6 / 8 = 26.8 cpb B. XML Chip Offload Per Operation With 100% offload the cycles per byte would be zero since no host cycles are needed to perform the operation. The XML chip always requires some minimum number of host CPU cycles because even if the XML chip fully computes the result the host processor will at least be implicated in communication with accelerator card and transfer of the XML data to the card. In fact, the architecture of the system is a judicious blend of hardware and software and a full range of levels of offload is seen depending on the nature of the XML operation.

SAX 80 DOM 400 5001600 6002000 120010000 14002600

Contentbased Routing Schema Validation XSLT XML Security

30

40 400900 450800

C. System Overhead and Performance Communication overhead between the XML accelerator and the host may impact performance. For example, for small messages the quantify of resources needed for signaling and

data transfer may be a large portion of the overall processing, especially when synchronous calls are used. The table below shows the relatively low throughput obtained for schema validation on a message of 2 KB and the high cpu utilization required. The throughput increases considerably when the message size is large enough to render the fixed overhead costs small and the performance improves further on a large message.
TABLE V SCHEMA VALIDATION, WHOLE MESSAGES

throughput and 7X decrease in per message latency was obtained due to the elimination of cascading bottlenecks that impacted performance in every other stage of the pipeline.
TABLE VII APPLICATION LEVEL PERFORMANCE

Message Size (MB) 8 11 13 14

Message Size (KB) 2 9 280

Throughput (Gbps) .7 2.6 3.4

CPU Utilization (%) 360 290 220

Throughput (Messages Per Fixed Interval) Chip SW Chip/SW 701 177 4.0 552 135 4.1 455 110 4.1 413 93 4.4

Latency (Average Latency Per Message) Chip SW SW/Chip 38 253 6.7 38 285 7.5 39 334 8.6 47 342 7.3

This raises the question of performance when the streaming feature is used and documents are passed in fragments to the XML chip. Typically a system that does this will balance a number of factors (level of interface into the network stack, management of connections, threading, memory management, use of asynchronous interfaces to name a few) so fragment size and performance may not be in such a simple relationship as described above. There is, however, intrinsically some penalty for managing the context required for streaming. The result below, however, shows that very good results may still be obtained on fragments as small as 1.5KB when scanning XML documents for threats (XML Threat Management, XTM).
TABLE VI XTM, STREAMING

VIII. POWER CONSUMPTION The XML chip is extremely power efficient, drawing far less power to process the same workload as an equivalent software process running on a general purpose CPU. The chart below shows this as the difference in the area under the curve of two runs, the first with software and the second with the offload from the XML chip. With the XML chip the application runs at 3 Gbps, consuming 76W above the system idle of 260W and in software the same application runs at .552 Gbps and consumes 94W above the same system idle. The wattage per gigabit with HW acceleration is 25.3 and SW is 170.3. HW acceleration is bringing almost 7X in power efficiency.

Message Size (KB) 1.5 3.0

Throughput (Gbps) 2.2 2.5

CPU Utilization (%) 120 100

D. Application-Level Performance: Throughput and Latency Great acceleration of specific algorithms or workloads does not always yield significant acceleration at the application level. Amdahls Law must always be considered: one cannot accelerate a problem more than 100% of the total contribution of that problem to the overall application level performance. Accelerate by 1000X an algorithm that is only 10% of the application and you will not get more than 1.1X acceleration. This observation was a dominant concern in our design of acceleration strategy employed in HPs SAPs NetWeaver appliance. Bearing the challenge carefully in mind, experience also shows that queueing principles may also apply: eliminating one problem may eliminate a whole series of problems that would propagate and drag down performance throughout the pipeline. The results in the table below are from a study conducted during development of the HP NetWeaver appliance where the operation offloaded by the XML chip accounted for only 1/3 of the total end-to-end application processing time. The theoretical limit of acceleration, by Amdahls law should have been 1.5X. In fact, a 4X increase in

Fig. 4: Power Consumption with the XML Chip

IX. KEY LEARNINGS Round trip time, host to the hardware and back to the host, is more important than the overall throughput of the accelerator. Latency rules! Think about accelerating the software as a whole not some subpart known to be slow. A relatively small number of gates in available in an FPGA and there is no hope to have enough to gates to put the entire process into reconfigurable logic and there is seldom a small part which so dominates processing

time that accelerating it accelerates the entire software pocess. Amdahl's Law would seem to pose a great challenge to the idea of accelerating generalized software applications but, in fact, it is possible to factor the performance benefits of hardware acceleration throughput the software process. Modular design simplifies development and testing admittedly not a new idea but certainly one that pays when it is scrupulously honored. Huge repercussion over the whole design lifecycle will be felt both when this is remembered and suffered when it is ignored. It has to be the one of the first design decisions taken. It is especially crucial when combined with formulating a strategy for architecting the partition between software and hardware - the modular approach allows you to keep "carving" out new acceleration opportunities for the hardware as more gates become available in later FPGAs. Building an accelerator around the processing of XML, which is nearly ubiquitous, is a tractable acceleration problem which could profoundly improve the general performance of a very large class of software applications. And while tractable, the challenge with XML was and still that the number and complexity of specifications that define what XML is and how it is processed is very large. A classical performance analysis using profiling to look for the bottlenecks would not have gotten very far. It is critcal to pick the functions to be accelerated by a more holistic analysis of where the hardware could be employed to enable better performance throughout the chain of software processes. One simple example, to illustrate the concept, was to provide the depth of the XML message (XML is a fundamentally a tree structure) to the software enabling it to allocate statically the queues required for processing, safely removing the need for costly boundary checking. With this approach the problem became more approachable. First, no need to support the entire specification to get acceleration benefits; second, based on the gate availability in the targeted FPGA it was easier to choose functions that would fit. X. REFERENCES
[1] News Release, HP Delivers Up to 400 Percent Improvement in XML Messaging Performance for SAP NetWeaver PI, May 12, 2009, Palo Alto, http://www.hp.com/hpinfo/newsroom/press/2009/090512xb.html. Letz, Stefan. Cell Processor-Based Workstation for XML Offload System Architecture and Design, University of Leipzig, Department of Computer Science, Leipzig, Germany, May 2005. Intel Software Network. XML Parsing Accelerator with Intel Streaming SIMD Extensions 4 (Intel SSE4), http://software.intel.com/en-us/articles/xml-parsing-accelerator-withintel-streaming-simd-extensions-4-intel-sse4. Bertin, P, Roncin, D., and Vuillemin, J., Programmable Active Memories: a Performance Assessment, PRL Research Report 24, Digital Equipment Corp., Paris Research Lab., March 1993.

[2]

[3]

[4]

Вам также может понравиться