Академический Документы
Профессиональный Документы
Культура Документы
NETWORKED SYSTEMS
DEVELOPED SPECIFICALLY TO ENABLE RESEARCH IN TCP/IP NETWORKING, THE
M5 SIMULATOR PROVIDES FEATURES NECESSARY FOR SIMULATING
NETWORKED HOSTS, INCLUDING FULL-SYSTEM CAPABILITY, A DETAILED I/O
SUBSYSTEM, AND THE ABILITY TO SIMULATE MULTIPLE NETWORKED
SYSTEMS DETERMINISTICALLY. M5S USEFULNESS AS A GENERAL-PURPOSE
ARCHITECTURE SIMULATOR AND ITS LIBERAL OPEN-SOURCE LICENSE HAVE
LED TO ITS ADOPTION BY SEVERAL ACADEMIC AND COMMERCIAL GROUPS.
Nathan L. Binkert
Ronald G. Dreslinski
Lisa R. Hsu
Kevin T. Lim
Ali G. Saidi
Steven K. Reinhardt
University of Michigan
52
Network I/O is an increasingly significant component of modern computer systems. Software trends toward net-centric
applications drive the commercial importance
of TCP/IP networking, while hardware trends
toward ever higher Ethernet bandwidths continually raise the level of technical challenge
involved. Nevertheless, this aspect of system
design has received relatively little attention,
particularly in academia, because of the complexity of analyzing all the hardware and software components involved in even the
simplest TCP/IP networking environments.
The key requirements for performance
modeling of networked systems are
capability. In addition, they both rely on Simics,5 a commercial functional simulator, to provide much of the I/O and privileged-mode
modeling. This approach reduces development
effort, but the resulting black-box nature of
the functional model restricts flexibility, and
licensing restrictions raise barriers to widespread use, particularly for nonacademics.
Because we use M5 to model complex multisystem networks with various components,
our design focused on providing modularity
and flexibility to manage this complexity.
M5s relatively clean software architecture and
advanced feature set have made it a powerful,
flexible simulation tool even for architects
who are not focused on network I/O.
Simulation core
At the core of M5 is a generic, object-oriented, discrete-event simulation infrastructure.
This foundation includes scaffolding for defining, parameterizing, instantiating, and checkpointing simulation objects; an interface for
building system models from complex collections of objects; a powerful statistics collection
package; and significant debugging support.
In M5, a simulation consists of a collection
of objects that interact directly through
method calls. Each object is responsible for
managing its own timing by scheduling its
own events on the global event queue. In contrast to other simulators such as ASIM6 that
integrate timing with interobject communication, M5s approach lets individual object
models make local trade-offs between performance and detail by determining the granularity with which they schedule their internal
events. The user can easily add external interobject delays when desired by interposing bus
or delay buffer objects.
M5s use of object-oriented programming
techniques provides a clear interface between
a simulation object and the rest of the system.
The benefits are threefold: First, researchers
can modify a components behavior using only
localized code changes with a higher likelihood
of not breaking seemingly unrelated parts of
the simulator. Second, object-oriented techniques encourage realistic modeling because
violations of software modularity often indicate violations of hardware modularity (and
thus feasibility). Third, these techniques allow
easy substitution of models for a particular
component, such as a CPU, within a particular configuration. These interchangeable models could differ in level of detail (to make
simulation speed versus accuracy trade-offs),
in the components functional behavior (to
study design alternatives), or both.
Another benefit of object-oriented modeling is that an object encapsulates all of a components state, eliminating nearly all global
variables. Users can thus easily replicate both
individual object models and entire collections of objects. Simple multiprocessors can
be constructed by instantiating multiple CPU
objects and connecting them to a shared bus.
A simulated system is merely a collection of
objects (CPUs, memory, devices, and so on),
so its possible to model a client-server network by instantiating two of these collections
and interconnecting them with a network link
object. All of these objects are in the same simulation process and share the same global
event queue, satisfying our requirement for
deterministic multisystem simulation.
M5 is implemented using two object-oriented languages: Python for high-level object
configuration and simulation scripting (when
flexibility and ease of programming are important) and C++ for low-level object implementation (when performance is key). M5
represents all simulation objects (CPUs, buses,
caches, and so on) as objects in both Python
and C++. Using Python objects for configuration allows flexible script-based object composition to describe complex simulation
targets. Once the configuration is constructed in Python, M5 instantiates the corresponding C++ objects, which provide good
runtime performance for detailed modeling.
Scripting M5
The M5 executable is invoked as follows:
m5 [m5 options] Python script
[script options]
The script, written in standard Python, controls the progress of the simulation. Figure 1
provides an example. The scripts first action
is to import the m5 package, which contains
the interface for orchestrating the simulation.
This step is followed by importing all of the
objects. These two statements are all that is
necessary to access the M5 framework in a
script. The next block of code parses the scripts
options. The following block of code defines
JULYAUGUST 2006
53
Object models
M5 includes a variety of object models implemented upon the core simulation engine. These
models include CPUs, caches, buses, and I/O
deviceseverything necessary for modeling
networks of complete systems. These models
also define the interfaces used for interobject
communicationfor example, transferring
requests between objects in the memory hier-
54
IEEE MICRO
archy. Researchers who find the supplied models inadequate can develop new objects that
implement the same interfaces, and incorporate
them transparently into existing configurations.
CPU models
M5 contains two primary CPU models,
SimpleCPU and O3CPU. Both models
derive from a base CPU class and export the
same interface, so they are interchangeable.
M5 can also switch between CPU models
during runtime, allowing the use of SimpleCPU for fast-forwarding and warm-up
phases, and O3CPU for taking statistics.
SimpleCPU is an in-order, nonpipelined
functional model that can be configured to execute one or more instructions per cycle, but can
only have one outstanding memory operation.
In addition to fast-forwarding and warm-up,
SimpleCPU is useful for modeling network
client systems whose only goal is to generate
traffic for a detailed server system under test.
O3CPU is an out-of-order, superscalar,
pipelined, simultaneous multithreading
(SMT) model. The O3CPU model simulates
an out-of-order pipeline in detail. Individual
stages such as fetch, decode, and so on are configurable in their width and latency. Both forward and backward communication between
stages is modeled explicitly using time buffers
(which we describe later). The model includes
detailed branch predictors, instruction queues,
load/store queues, functional units, and memory dependence predictors. The O3CPU uses
C++ templates to allow easy switching of individual pipeline stages or structures, avoiding
the performance overhead of virtual functions.
We developed the O3CPU model with a
strong focus on ensuring timing accuracy. To
promote accuracy, we integrated both timing
and functional modeling into a single executein-execute pipeline, in which functional instruction execution occurs in the execute stage of
the timing pipeline. Most other models, such
as our previous detailed CPU model, are execute-in-fetch models, which functionally execute instructions as they are fetched and then
do the timing modeling afterwards. Our execute-in-execute design provides accurate modeling of timing-dependent instructions, such
as synchronization operations and I/O device
accesses. An execute-in-execute model also provides higher confidence in the timing models
Backward communication
Fetch
Decode
Rename
Issue
Execute
Writeback
Commit
Memory system
M5s memory system includes two main
types of objects: devices and interconnects.
Devices are components such as caches, memories, and I/O devices. Interconnects encapsulate communication mechanisms such as buses
and networks. Although M5 currently provides
only a bus model, we have designed the deviceto-interconnect interface to easily allow extension to point-to-point networks as well.
M5 supports configurable caches with parameters for size, latency, associativity, replacement policy, coherence protocol, and so on.
JULYAUGUST 2006
55
I/O devices
Device (L1)
CPU
CPU
Port API
Port API
L1 cache
L1 cache
Interconnect (bus)
Bus
L2 cache
System bus
PCI bridge
DRAM
PCI bus
Network
interface controller
56
IEEE MICRO
Ethernet
M5 connects the NICs of individual systems using an Ethernet link object, which we
model as a lossless, full-duplex channel of configurable bandwidth and delay. M5 calculates
the time on the link by dividing the packet
size by the link bandwidth. If the result is less
than the wire delay, only one packet at a time
is allowed on the link. If the delay is large, then
the link can have multiple packets in flight. If
insufficient buffer space is available at the
receiver when a packet exits the link, the packet is dropped.
Outstanding
packets allowed = 6
Seq = 1
Seq = 2
Seq = 3
Timeout
resend 4
Ack 1
Seq = 4
Ack 2
Seq = 5
Ack 3
Seq = 6
Seq = 7
Seq = 8
Timeout
resend 4
Seq = 9
ACK
processing is
instantaneous
CPU switch
ACK
processing
time is
denoted by
arc
Dup seq 4
Dup seq 4
Time
Outstanding
packets
allowed =1
Ack 4
Timeout
resend 5
Ack 5
Dup seq 5
Ack 6
Ack 7
Sender
Receiver
Figure 4. TCP flow disruption due to CPU model change. The arrows
between the vertical lines represent packet transmissions.
JULYAUGUST 2006
57
Zero delay
400-s delay
Gbps
80
15
0
22
0
29
0
36
0
43
0
50
0
57
0
64
0
71
0
78
0
85
0
92
0
99
0
1,
06
0
10
Millions of cycles
Validating M5
Figure 5. A Comparison of TCP tuning times with respect to round trip times.
600
Bandwidth (Mbps)
500
400
300
200
100
0
Transmit
Receive
Web
58
IEEE MICRO
JULYAUGUST 2006
59
4. N. Hardavellas et al., SimFlex: A Fast, Accurate, Flexible Full-System Simulation Framework for Performance Evaluation of Server
Architecture, ACM Sigmetrics Performance Evaluation Rev., vol. 31, no. 4, Mar.
2004, pp. 31-35.
5. P.S. Magnusson et al., A Full System Simulation Platform, Computer, vol. 35, no. 2,
Feb. 2002, pp. 50-58.
6. J. Emer et al., ASIM: A Performance Model
Framework, Computer, vol. 35, no. 2, Feb.
2002, pp. 68-76.
7. L.R. Hsu et al., Sampling and Stability in
TCP/IP Workloads, Proc. First Ann. Workshop Modeling, Benchmarking, and Simulation, 2005, pp. 68-77; http://www.arctic.
umn.edu/~jjyi/MoBS/2005/program/MoBS_
2005_Proceedings.pdf.
8. A.G. Saidi et al., Performance Validation of
Network-Intensive Workloads on a Full-System Simulator, Proc. Workshop Interaction
between Operating System and Computer
Architecture (IOSCA 05), 2005, pp. 33-38;
http://www.ideal.ece.ufl.edu/workshops/
iosca05/paper6.pdf.
60
IEEE MICRO