You are on page 1of 6

2010 18th IEEE Symposium on High Performance Interconnects

Intel QuickPath Interconnect

Architectural Features Supporting Scalable System Architectures

Dimitrios Ziakas, Allen Baum, Robert A. Maddox, Robert J. Safranek

Intel Corporation, Santa Clara, USA.

Abstract Single processor performance has exhibited Intel QPI has a layered model [2], as shown in Figure 2.
substantial growth over the last three decades [1] as shown in The functions performed by Intel QPI are logically grouped
Figure 1. What is also desired are techniques which enable into four different layers which are:
connecting together multiple processors in order to create
scalable, modular and resilient multiprocessor systems. x The Physical Layer: This layer deals with the
Beginning with the production of the Intel Xeon processor operations of signals on a link between two agents.
5500 series, (previously codenamed Nehalem-EP), the Intel The layer manages data transfer on the signal wires,
Xeon processor 7500 series (previously codenamed including the electrical levels, timing aspects and
Nehalem-EX), and the Intel Itanium processor 9300 logical issues involved in sending and receiving
series (previously codenamed Tukwila-MC), Intel each bit of information across the parallel lanes.
Corporation has introduced a series of multi-core processors The unit of transfer at the Physical layer is 20 bits
that can be easily interconnected to create server systems and is called a Phit (physical unit).
scaling from 2 to 8 sockets. In addition, OEM platforms are x The Link Layer: This layer manages data transfers
currently available that extend this up to 256-socket server given to and from the physical layer, including flow
designs 1 . This scalable system architecture is built upon the control and error correction. The unit of transfer of
foundation of the Intel QuickPath Interconnect (Intel QPI). the link layer is 80 bits and is called a Flit (flow
These Intel micro-architectures provide multiple high-speed control unit).
(currently up to 25.6 GB/s), point-to-point connections between x The Routing Layer: This layer ensures data is
processors, I/O hubs and third party node controllers. The directed to its intended destination across Intel QPI.
interconnect features, as well as the capabilities built into the If a message handed up from the link layer is
processors system interconnect logic (also known as
destined for an agent in another device, this layer
uncore), work together to deliver the performance,
scalability, and reliability demanded in larger scale systems.
forwards it to the proper link to send it on. All
messages destined for agents on the local device are
Keywords: Intel, QuickPath, QPI, scalable, Xeon, Nehalem passed up to the protocol layer.
x The Protocol Layer: This layer has multiple
functions, defining the packet exchange
mechanisms across devices. A protocol layer packet
The Intel QuickPath Interconnect is the high speed, low is comprised of a number of flits and, whilst lower
latency, packetized, point to point, coherent system layers deal only with two directly connected
interconnect currently used in Intels high-end server devices, the protocol layer deals with messaging
processors. across multiple links, in multiple devices. The layer
manages cache coherence for the interface using the
write-back protocol and also has a set of rules for
managing non-coherent messaging. The protocol
layer typically connects to the cache coherence state
machine in the caching agents, and to the home
agent logic in memory controllers. It is also
responsible for system level functions such as
interrupts, memory mapped I/O and locks.

Figure 1. Processor performance growth since the VAXi-11/780 in 1978

For one example, see the Ultravioleti platform description at Figure 2. Intel QuickPath Interconnect layers

978-0-7695-4208-9/10 $26.00 2010 IEEE 1

DOI 10.1109/HOTI.2010.24
The Intel QuickPath Interconnect has been defined for These 18 unique ways in which the link layer manages
processor to processor communications, with its current traffic through to the single physical connection is
implementations operating at signalling speeds of up to 6.4 complemented by a credit/debit flow control scheme.
GT/s. An Intel QPI full width link pair has 21 lanes in each The protocol layer has the concept of agent types. The
direction (20 differential pairs and a clock lane), for a total of defined agent types include:
42 lanes for both directions. Each lane is physically a x Caching agent generates requests, controls caches
differential pair, therefore amounting to a total of 84 signals that contain copies of memory data, handles snoops
for a full Intel QPI link pair [3]. x Home agent - controls DRAM and manages
The protocol layers distributed agents form the way in coherency
which a devices various interconnected functional blocks x IO agent - proxy for I/O subsystem(s)
communicate. At the link layer, there are six message x Firmware Agent - indicates that the agent has a path
classes, as shown in Table 1, providing independent to boot code
transmission channels to the protocol layer agents. These six x Configuration Agent - contains configuration
message classes provide independent transmission channels registers for the device
to the protocol layer, allowing sharing of the physical x Routing agent - this agent has a routing layer
channel and at the same time ensuring that temporary Each end of a link will identify itself as one or more of
blocking of one message class does not prevent traffic of these agent types to the other end of the link during the
another from flowing. initialization process. In addition, it will identify its NodeID
(node identification number) and link number, enabling
Message Class Abbreviation Data
firmware to self-discover the system topology and configure
Snoop SNP No the system accordingly. The flexibility of the routing layer
Home HOM No allows this to be done even in the presence of failed nodes or
Data Response DRS Yes links, so only working components will be enabled.
Non-Data Response NDR No The different agent types are a way of classifying
Non-Coherent Standard NCS No protocol flows at the protocol layer. For example, caching
Non-Coherent Bypass NCB Yes agents generate all coherent requests and the requests
themselves are directed to home agents. Snoop requests are
Table 1. QPI link layer message classes
always directed at caching agents while snoop responses are
Virtual networks provide yet another level of virtual
returned to the home for a given cacheline address. For each
channel support which facilitates scalable network topologies
agent, the assigned NodeID is used to uniquely identify the
and routing. The link layers three supported virtual
agent and therefore allows message exchange and proper
networks (VN0, VN1 and VNA) provide each of the six
routing and communication. The coherence of data and the
message classes with three logically separate paths to get to
entire system architecture is coordinated by the protocol
the single physical media. This results for a total of up to 18
layer which also provides the functions for non-coherent
virtual channels distributed across the three available virtual
networks, as illustrated in Figure 3.
Typical 2, 4, and 8-socket topologies that can be created
from the above mentioned processors and I/O Hub devices.
Some examples are shown in Figures 4, 5 and 6. Note that
this system architecture allows for multiple distributed
memory controllers as part of the CPU functionality. Figure
10 depicts the extendable nature of this architecture, showing
expansion into multiple nodes, each of which use an OEM
developed Node Controller to manage system traffic within
and between Nodes.

Figure 3. Intel QPI virtual networks, message classes and virtual channels Figure 4. Two socket system employing Intel QuickPath Technology

The challenge of meeting the demands of scalable
systems, such as those using the Intel Xeon processor
7500 series shown in Figure 8, is significant. Links based
systems partially address the bandwidth issues, but large
systems must also contend with latency and increased error
Intel QPI addresses bandwidth, latency, and reliability in
multiple ways:
x Messages are variable length, conserving
bandwidth by ensuring that common messages are
x Intel QPI implements a source snoopy protocol to
provide a performance benefit for workloads with a
Figure 5. Four socket system employing Intel QuickPath Technology high degree of sharing (e.g. transaction processing,
Several issues arise when creating these large scale where modified data is often found in the cache of
systems. The distributed nature of the memory and CPU another socket) because data can be found and
caches is vital to keeping the multiple cores running at their returned from the remote socket without any
peak rates, but this distribution requires careful consideration intermediate processing. And, as long as snoop
of the messaging between all the coherent agents in the responses can be returned faster than DRAM can
system. A robust system wide memory coherency protocol be accessed, it does so without performance
and efficient management of the coherency traffic are key to penalty even when the data is not found in any
providing a scalable system. The network between the other cache. For systems in the 28 socket range,
elements must provide for a deadlock free method of this snoop traffic can be broadcast without
delivering the messages and it must account for agents that overwhelming fabric bandwidth.
may not physically be present. Allowing for reconfiguration
of the system topology from a logical viewpoint is essential
for supporting multiple partitions on these larger systems,
thereby allowing the resources to be most effectively used.
Finally, the interconnect fabric must be reliable, and remain
highly available even in the presence of certain error events.

Figure 8: Intel Xeon processor 7500 series die

Still, Intel QPI provides several methods of conserving
link bandwidth:
x Instead of sending individual snoops to each
remote cache, snoops can be broadcast. As the
snoops travel through links to remote sockets, they
are forked off to both the local socket, and any
number of outgoing links
x A snoop message doesnt need to be sent to the
Figure 6. Eight socket system employing Intel QuickPath Technology socket where the home agent is located. Instead,
the snoop can be locally generated at the remote
The Intel QuickPath Architecture is the foundation for socket from the memory request message itself
supporting these scalable systems. The interconnect and
related protocols are only part of the solution however. The x Caching agents can either be source snoopy or be
CPU designs must provide the proper uncore logic design to under directory control. Additionally, systems may
implement the capabilities required at the system level. implement both simultaneously; caching agents
that are unlikely to be sharing (e.g. IO agents,

which have smaller caches that are less likely to be IO hubs, or memory, or removal, or re-partitioning
hit, and generally cache address ranges which are the system (with OS and firmware assistance)
not permitted to be accessed until an entire DMA is x Every message flit sent over Intel QPI links
complete) are under directory control. In this way, includes 8 bits of CRC for every 72 bits of data +
these agents are snooped only when they actually control. If a CRC error is detected at the receiver,
own the cache line being accessed the transmitter is notified, and the message flit (and
x A single socket may contain more than one caching all subsequent flits) is retried
agent, which can be accessed simultaneously to x For increased error detection, Intel QPI implements
increase bandwidth. This is implemented in an optional 16 bit rolling CRC. In this mode, 16
conjunction with address interleaving of the bits of CRC are calculated sent across two
caching agents, so that only one ever needs to be consecutive flits, with the additional 8 bits being
snooped for any access combining with the next flits CRC, therefore
x Likewise, a single socket may contain more than increasing the error detection capability without
one home agent that controls memory. These are transmitting any more information.
also address interleaved, and when the caching x If retry fails too many times, the link can undergo
agent interleave matches the home agent interleave, an in-band reset. When this occurs, the physical
then the caching agents and home agents are layer of the link is retrained and brought back to
paired. In a highly NUMA-optimized system, the normal operating state. There is no loss of any
CPUs can take advantage of this to create a fast in-flight data and there is no interaction with
path from a caching agent to local memory, software other than notification of the event.
decreasing memory latency. This also has the x If the link cannot be retrained because of a failure
effect of allowing more transactions to be active in of a data lane, then the retraining process
the system simultaneously as it increases utilization determines which of the quartiles are still working,
of transaction buffers and reconfigures the link to use only a working half
x An Intel QPI routing table is accessed at every hop (sending each half of the data in succession) or a
through the link. In addition to specifying whether quarter of them. In this case the physical layer
to stop at the local socket, or route through to takes two or four times as long to send the flit, but
another remote socket, the routing table can specify there is no loss of any information or error
one of two Virtual Networks. This enables detection abilities
complex topologies to be configured without x Likewise, if the clock fails, retraining will
danger of deadlocks. It can also enable doubling commandeer one of the data lanes to substitute for
links between sockets for increased bandwidth the clock. The retraining will then select one or two
The resulting performance benefits on products like the of the remaining quartiles for data. Again, no loss
Intel Xeon processor 7500 series [4] is shown in Figure of any information or error detection ability.
9. x Intel QPI allows memory writes to be mirrored, so
that data is written to memory on two different
home agents. There is also support for read
Large scalable systems have a larger number of failover, so that memory reads can be forwarded to
components than typical systems, with an inevitable the second copy when the first copy has an
increase in failure rates. At the same time, large scalable uncorrected memory subsystem error. Again, no
systems require higher reliability than typical smaller information or error detection capability is lost.
systems. Intel QPI provides many features to increase The mirror write capability can be used to safely migrate
system reliability, availability, and serviceability [5]: the contents of memory from one home agent to another
x The Intel QuickPath Interconnect has a routing while it is in use when memory is determined to be
layer that enables large systems with flexible unreliable and needs replacing. Together with the
topologies. This can be used to partition a large reconfiguration flows, memory can be migrated, the failing
system into smaller clusters, isolating applications memory replaced, and memory migrated back without
for increased reliability. It can also be used to route system interruption
around failing nodes or links, allowing graceful x Memory controllers can detect data errors in
degradation under certain failure scenarios, and memory even when the data is not being used (e.g.
even field replacement without bringing systems if there is data corruption on a write, or when
down memory is patrol scrubbed). It is unnecessary to
x Intel QPI provides methods for reconfiguring the stop the system in this case because it is possible
system without bringing the system down. This can that memory will not be read again before being
be used to implement hot addition of CPU sockets, written again. Intel QPI can return data with a

poison indication that a CPU can use to cause error The protocol layer has the concept of a Source Address
handling only if the data is consumed. The error Decoder (SAD), and a Target Address Decoder (TAD) in
can be isolated to just those tasks that use the addition the routing layer's Route Table (RT). All coherent
failing data requests use the SAD in the requesting agent to determine
x If a message control or address is corrupted in the the NodeID of the target, taking into consideration the
system, then it is not possible to poison the data message type and address (if applicable). The routing layer
and isolate the failure to a particular task (since if uses the destination NodeID to select the path to the target
the address is corrupted, it isnt possible to know agent. The target agent uses the TAD to determine where to
which address to poison). If this occurs, then the send the request within the agent (e.g. to which memory
agent that detected it goes into viral mode in order channel). In other words, SAD determines where to go, RT
to contain the error. Every packet that is sent to or determines how to get there, TAD determines who sees it.
from the failing agent is marked as viral, and any In large scalable system topologies such as those of
agent that receives a viral packet also goes into Figure 6 and Figure 10, full configuration of the systems
viral mode. Simultaneously, the error is reported to SADs and TADs is typically done through firmware during
the system so that a (somewhat) orderly shutdown system initialization. This can bring unique advantages to
can occur. Otherwise, system operation is system configuration allowing for example a UMA (uniform
unaffected, with one exception: I/O agents stop memory access) or NUMA (non-uniform memory access)
forwarding requests and data while they are in viral optimized platform configuration by programming the
mode to prevent any possibly corrupt data from SADs and TADs of the system to adhere to the software
being written to permanent storage stack characteristics and memory access patterns. SAD
x Intel QPI implements a hierarchical timeout system structures allow socket interleaving and can be optimized
to enable failure analysis. Timeouts are configured for memory accesses across disparate sockets. TAD
in increasing intervals down message dependence structures can be used to route incoming transactions to a
chains, so that if a failure of one message results in particular memory channel thus achieving channel
a cascade of timeout errors, the original failure will interleaving. The combination of SAD and TAD structures
timeout first, and the source of the error identified gives a great deal of flexibility on balancing memory
transactions across all of the available memory.
Intel QPI has been architected to provide a framework
Intel QPI implements a modified MESI coherence capable of meeting future system requirements. As Figure 1
protocol [6] with a new forward (F) state added to enable showed, the growth in processor performance has been
cache-to-cache clean line forwarding. The features of the relentless and is expected to continue. Intel QPI will be
home agents which are closely tied to the integrated adapted as needed to provide the foundation for those future
memory controllers and which are responsible for managing platforms.
coherency for their portion of the memory are critical to One area under consideration is to increase the
achieving effective scaling. Only one agent can have a line signalling speeds on the link. Intels past history with speed
in this F-state at any time, with all states of the new MESIF bumps on previous generation FSB based platforms showed
coherence protocol summarised in Table 2. The May remarkable growth in the data transfer rates over the life of
Transition To column indicates what a caching agent is that interconnect. Increasing the transfer rate will help
allowed to do internally to its own cache structure without address increasing system bandwidth requirements without
sending any messages to other system agents. Any other resorting to increased pin counts.
transition (such as invalidating a Modified line) requires As transfer rates go up, and as overall energy efficiency
appropriate coherency messages to be exchanged. A notable demands increase, power management features both at the
result of the MESIF implementation on Intel QPI is that the device and the link level will also be an important aspect of
extended coherence protocol supplies data cached in other the system interconnect. Finally, the interaction between
nodes in a single round-trip delay for all common agents, the number and content of the messages they
operations. exchange, will always come under scrutiny in order to
Clean/ May May May maximize efficiency of the traffic on the links.
Dirty Write? Forward? Transition To? Because of the layered architecture of Intel QPI, these
M Modified Dirty Yes Yes - type of enhancements can be worked into future designs
E Exclusive Clean Yes Yes MSIF
S Shared Clean No No I
while impacting only the layer responsible for that function.
I Invalid - No No - Layers above or below need not be affected. Intel QPI has
F Forward Clean No Yes SI therefore established a very capable and adaptable system
Table 2 MESIF cache states interconnect architecture, providing a foundation for Intel
server platforms well into the future.

References [4] Intel, A Quantum Leap in Enterprise Computing, Unprecedented
[1] J. L. Hennessy and D. A. Patterson. Computer Architecture - A Reliability and Scalability in Multi-Processor Server, Product Brief,
quantitative Approach. Fourth Edition, Morgan Kaufman, San Intel 2010
Francisco 2007 [5] M. Demshki, M. Kumar and R. Shiveley, Advanced Reliability for
[2] G. Singh, R. A. Maddox, and R. J. Safranek, Weaving High Intel Xeon Processor-based Servers, Intel White Paper, Version
Performance Multiprocessor Fabric. Architectural insights into the 1.0, March 2010
Intel QuickPath Interconnect. Intel Press, 2009 [6] R. J. Chevance, Server Architectures: Multiprocessors, Clusters,
[3] Intel, An Introduction to the Intel QuickPath Interconnect. Intel Parallel Systems, Web Servers, and Storage Solutions. Elsevier
White Paper, January 2009 Digital Press, December 2004
[7] R. J. Safranek, Intel QuickPath Interconnect Overview, Proc. 21st Hot
Chips Symposium on High Performance Chips., August 2009.

Figure 9. Common 4 socket enterprise benchmarks with Intel Xeon processor 7500 series [4]

Figure 10 Scalable system topology employing Intel QuickPath Interconnect