Trans Met A Crusoe

Transmeta Crusoe
CS433 Processor Presentation Series Prof. Luddy Harrison
CS433 Prof. Luddy Harrison
Copyright 2005 University of Illinois
Note on this presentation series

These slide presentations were prepared by students of CS433 at the University of Illinois at Urbana-Champaign All the drawings and figures in these slides were drawn by the students. Some drawings are based on figures in the manufacturers documentation for the processor, but none are electronic copies of such drawings You are free to use these slides provided that you leave the credits and copyright notices intact
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 2
Outline
Transmeta Innovation Timeline Crusoe Processor Family and Overview Architecture showing Data paths, Registers, ALU, etc. Instruction Set Pipelining Code Morphing LongRun Power Management Memory Map and Support for Internal and External Memories Clocks and Timing Processor Pin Layout Target Applications, Specific Use, and Sample Assembly Language Code for application kernels
Transmeta Innovation Timeline
Dave Ditezel, of RISC-fame and formerly from SPARC, started up Transmeta as its CEO in 1995. The first Patent (5958061) was applied in July 24, 1996 granted in September 28, 1999. On January 19, 2000 the Crusoe processor was published. Crusoe became famous as an x86 compatible family of solutions that combines strong performance with remarkably low power consumption.
Crusoe Processor Family

TM5400 -- 500-700MHz, 256K L2 Cache TM5500 -- 0.13, 667-800MHz, 256K L2 Cache TM5600 -- 500-700MHz, 512K L2 Cache TM5800 -- 0.13, 667-800MHz, 512K L2 Cache SE TM55E/TM58E -- 0.13, 800MHz, 256K L2 cache Embedded version of Crusoe processor Presentation focuses on TM5800 as the representative processor in this family
Crusoe Processor Model TM5500 667 MHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus
Crusoe Processor Model TM5800 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
Crusoe SE Processor Model TM55E 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
Crusoe SE Processor Model TM58E 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W Supports T-junction temperatures of 100C Rated for 24/7 operation for 10 years
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W Supports T-junction temperatures of 100C Rated for 24/7 operation for 10 years
6
Characteristics of Crusoe
4 Instruction Issue, 128-Bit VLIW Engine Fully Pentium 4-ISA compatible Up to four instructions issued per clock cycle MMX multimedia extensions 512 MB L2 cache Advanced Code Morphing Software (CMS) Unique software-based architecture is key to reducing power consumption and enabling future scalability and flexibility Integrated Northbridge Core Logic On-chip SDR and DDR-266 memory interfaces On-chip 32-bit, 33 MHz PCI bus controller
Characteristics of Crusoe contd.

LongRun Dynamic Power Management Enables low power operation by dynamically adjusting operating frequency and voltage to match the performance requirements of application workloads. Provides higher performance within smaller, thermally constrained environments Enables fanless designs for quieter and more reliable systems
Architecture
Crusoe Processor
SDR Memory Interface Controller DDR Memory Interface Controller Serial ROM Interface Controller Transmeta LongRun Power/Thermal Management 32 bit 33 MHz PCI Bus Interface Controller L1 Instruction Cache L2 Cache L1 Data Cache
High Performance VLIW Engine

x86 CPU vs. Crusoe

Modern x86 CPU
x86 Instruction Translation Instruction Decode x86 Instruction Translation Branch Predict
Transmetas Crusoe
Instruction Decode
L1 Cache Execution Units Register Rename Instruction Reorder
L1 Cache Execution Units
Branch Predict
Register Rename
Instruction Reorder
11
Crusoe Hardware/Software Partitioning
Purple portions implemented in hardware Much smaller than traditional microprocessors Orange portions implemented in software x86 to native VLIW translation, branch prediction, and out-of-order execution (OOO) logic
12
Processor Details
Fabricated in 0.13 process technology High Performance 4 Issue 128-bit VLIW Engine with Code Morphing Software to provide x86 compatibility. L1 Data Cache: 64KB L1 Instruction Cache: 64KB L2 Write Back Cache: 512KB DDR Memory Support: DDR-SDRAM 100-133MHz SDR Memory Support: SDR-SDRAM 66-133MHz PCI bus controller (PCI 2.1 compliant) with 33 MHz, 3.3V interface Standard product speeds of 733, 800, 867, 933, and 1000 MHz Power: 0.5-1.5 W @ 300-1000 MHz, 0.8-1.3V running typical multimedia applications, 150 mW typical in deep sleep Processor Package: Compact 474-pin Ceramic BGA
Processor Block Diagram

L1 Instruction Cache 64 K 8-way set associative Unified TLB 256 entries
4-way set associative
DDR SDRAM Controller
CPU core Integer Unit Floating Point Unit MMU Multimedia Instr. Bus Interface
SDR SDRAM Controller
Serial ROM Interface
L1 Data Cache 64 K 16-way set associative

L2 WB Cache 512 K
4-way Set Associative
PCI Controller & Southbridge Interface

14
Architecture Block Diagram

Shadow Registers 64 General Purpose Registers Debug Reg Alias Hdw TLB T-Bit Buffer ALU0 ALU1 Gated Store Buffer Local Data Memory 8KB Data Cache 64KB Local Program Memory 8KB Data Cache 64KB Load/Store Branch FPU Shadow Registers 32 Floating Point Registers
Data Flow & Data Cache Control
Instruction Cache Control
Secondary Instruction/Data Cache 256 KB
Bus Interface Unit

Architecture Details
Five function units : 2xALU, FP/MMX, MEM and BR Each instruction (called molecule) has 2 or 4 two RISC-like operations (called atoms) Shadowed registers sets : 64 GPRs, 32 FPRs Gated store buffer in the Load/Store unit Alias hardware Very few HW interlocks in the pipeline Correct execution is guaranteed by CMS scheduling & compiler Micro-architecture hidden from the x86 programmer Processor timing issues can be worked-around by CMS
16
Crusoe Processor Hierarchy
x86 Applications x86 Operating system Windows XP, Linux etc.

x86 Software x86 Compatible Crusoe Processor Solution
x86 BIOS Code Morphing Software VLIW Processor

(resides in ROM)
17
Crusoe: A Native VLIW Processor

Crusoe is a VLIW (Very Long Instruction Word) processor Multiple FUs, each explicitly programmed on each instruction A Very Long Instruction Word is called a molecule Each molecule contains 4 atoms Each atom is an instruction destined for an FU A molecule is either 128 bits or 64 bits wide. Crusoes compilation and scheduling is a hybrid between Dynamic Superscalar Instruction Scheduling and VLIW Instruction scheduling Code Morphing Software (CMS) takes a compiled x86 program and recompiles it, on-the-fly, to Crusoe's native VLIW instruction format. Recompilation uses sophisticated compiler algorithms to extract parallelism from the code, look for dependencies and do all those things that a state-of-the-art VLIW compiler does.
18
Instruction Word - VLIW
128 bit Molecule FADD ADD LD BRCC
Floating Point Unit
Integer ALU #0
Load/Store Unit
Branch Unit
19
Code Morphing Software

x86 instructions are converted to the Crusoe instruction set through a software layer (Translation) During instruction translation, optimizations and scheduling tricks can be performed Instruction scheduling Register renaming Speculation Crusoe Processor Architecture is decoupled from application software
20
Efficeon TM8800 Processor

Next-generation Crusoe Processor Designed for higher performance 8 Instruction Issue, 256-Bit VLIW Engine Fully Pentium 4-ISA compatible Up to eight instructions issued per clock cycle Up to 50% improvement in integer applications SSE and SSE2 multimedia extensions enables multimedia applications to run up to 80% faster per clock cycle than previous generation Crusoe processors Large 1 MB L2 cache improves processor performance
21
Instruction Set
22
Registers
The processor has 64 GPRs, with the following specialized semantics: %r63 (%zero) always reads 0 when used as a source operand %r62 (%sink) is a discarded destination (e.g., for compares); it is never read %r59 (%from) saved return address %r58 (%link) return address %r47 (%sp) is the current stack pointer %r0 (%eax) for current x86 machine state %r1 (%ecx) for current x86 machine state %r2 (%edx) for current x86 machine state %r3 (%ebx) for current x86 machine state
23
Registers contd.
The lower 48 of these GPRs are backed by shadowed GPRs: whenever a bundle has its commit bit set, the Commit stage latches the current values of the GPRs into the 'known good' shadow GPRs. The processor also includes 32 80-bit floating point registers and 16 FP shadow registers. There are also a wide variety of special purpose registers (SPRs), including the condition codes, profiling registers, power control settings and so on.
24
Instruction Encoding
General Format:
%PC + 12
%PC + 8
%PC + 4
%PC + 0
C C
00 00
32 bits
LSU LSU
type type
ALU1 ALU1
0 1 C C C C
00 00 01 10 10 11
ALU0 ALU0 LSU ALU0 ALU0 ALU0 type type type type
imm32 Branch ALU1 imm32 ALU1 BRU
25
Instruction Encoding Continued.

Instructions are encoded in little endian byte and word order. All instructions (except branches) have a 9-bit opcode field. All opcodes share a common mapping into this 9-bit space, even though not all instructions can execute on all functional units. The hardware is designed to interlock all operations through scoreboarding, however design flaws sometimes prevent the microprocessor from taking full advantage of these features.
26
Format of ALUs
31
C C C 10 10 10 op op 11xxxx011
ALU0/ALU1
rd rd rd ra ra ra rb imm8 0 --
0
ALU with register operands ALU with 8-bit immediate ALU with 32-bit immediate
ALU1 executes a superset of the operations available on ALU0. Both ALUs may have an 8-bit signed immediate instead of register rb. ALU1 may optionally use a 32-bit immediate, but only in appropriate bundle types.
27
Format of ALUs Continued.

The ALU0|imm32 and ALU0|ALU1 bundle types share the same format code (10) but the ALU1 slot is interpreted as an imm32 depending on the opcode: 11xxxx011: 32-bit immediate in place of ALU1. If the 11xxxx011 pattern appears in an ALU1 slot, an 8-bit immediate is used instead. Condition codes exactly mirror the x86 semantics Any instruction in either ALU0 or ALU1 can optionally latch the resulting condition codes if specified by the opcode. However, only one of the two ALU0|ALU1 slots per bundle can write the condition codes in a single cycle. The ALU1 slot is also used for all floating point and MMX operations, as indicated by ALU1's type select bits being something other than '00'.
28
Format of Load/Store Unit
31
C C 01 01 op op
Load/Store Unit
rd ? ? rs raddr raddr ---
0
Loads Stores
Single load/store unit performs all loads and stores, alias operations and various other memory related tasks. All LSU operations take a fully calculated address in register ra. No ra+offset or ra+rb addressing modes are provided.
29
Format of Load/Store Unit Continued.

Two kinds of loads and stores are possible: operations on physical CMS space addresses (as used in CMS itself). operations as user code sees memory; i.e., addresses are translated by the TLB and can never access the protected CMS space. Processor has two special 8KB SRAMs: Local program memory (LPM): Holds often executed assist code for x86 page table lookups, alignment fixups, low level exception handling, interrupt handling, etc. This avoids having to bring such critical code into the L1 instruction cache on demand. Local data memory (LDM): contains data used by the LPM functions; i.e., copies of key x86 MSRs, native code stack, etc.
30
Format of Branch Unit

1
100 100 101 101 000 000 cond cond L L ? 0 ? 1 ? 0 ? 0 ? ? abstarget >> 3 ra ?
Branch (within CMS)

abstarget >> 3 ra
0
Unconditional Branch
?
Unconditional Branch via Register Conditional Branch Conditional Branch via Register
1 1
16
Branches (both conditional and unconditional) within CMS use a 23 bit absolute target address aligned to a 64-bit boundary (i.e., abstarget is shifted left 3 bits).
31
Format of Branch Unit Continued.

CMS address space is the only region from which code can be executed; the processor is physically incapable of executing code directly from user space. This is the reason why all x86 code must be translated (and thus copied to CMS space) before native execution. Conditional branches use the same condition code set (cc bits) as the x86 encoding in jump instructions. Unconditional branches can optionally write the return address to the %link register (%r58) if the L bit (bit 0 of the cc field) is set. Indirect branches occur through a general purpose register. Special instructions are provided to prepare for an indirect branch when the target address is known in advance; this avoids the three-cycle branch penalty. In addition, special instructions may provide a branch with link functionality.
32
Pipelining
33
Pipelining
Fetch0
Fetch1
Regs
ALU
Except
Write
Commit
Cache0
Cache1
( Load/Store Unit)
(Wait)
Redirect
( Branch Unit)
The top row of the diagram indicates the pipeline for an ALU instruction, with the other rows representing the two other types of logical units.
34
Pipeline Stages
Fetch0: The first 64 bits of a 64-bit or 128-bit bundle are fetched. Fetch1: The second 64 bits are fetched (for 128-bit bundles only). Regs: Read source registers and decode/disperse instructions. ALU: Execute single cycle operations in ALU0 and ALU1 Except: Complete two-cycle ALU0/ALU1 ops and detect exceptions Cache0: Initiate L1 data cache access based on register address Cache1: Complete L1 data cache access, TLB access and alias checks Write: Write results back to GPRs or store buffer Commit: Optionally latch the lower 48 GPRs into the shadow registers
35
Decoding and Scheduling

Conventional x86 superscalar processors fetch binary instructions and decode them into separate micro-operations. Then they are reordered by the hardware and executed in parallel. Code morphing translates an entire group of x86 instructions at once and stores the translation in a translation cache for future reference. Code Morphing System (CMS) described later
36
Crusoes VLIW Instruction Scheduling

Applications Code Morphing Software Operating System Software
CPU
Functional Units
Decoding and Scheduling Continued.

Translation step introduces many opportunities. Due to high repeat rates, the translation cache is frequently used to reduce overhead. Can use much more sophisticated scheduling algorithms. Much lower power consumption because translation is all in software. Can optimise generated code, and by learning which parts are executed often, can change levels of optimisation dynamically.
38
Code Morphing
39
Code Morphing System (CMS)

Dynamic binary translation of x86 ISA into Crusoe internal native VLIW format using software at run-time Translations cached to avoid translator overhead on repeated execution Optimizes across x86 instruction boundaries to improve performance Select a region, produce native code and store translation in the translation cache Completely invisible to operating system looks like x86 hardware processor Runtime system Handle devices, interrupts and exceptions, power management, and garbage collection
40
Code Morphing
Hardware: high megahertz, small die size, non x86 VLIW-processor.
Microcode
Code Morphing Software Technology
Superscalar out-of-order execution Variable Length Instructions
x86 Binary Code
Segmentation Trigonometric-Functions Microcode
Complex Addressing Modes
Silicon Microchip Integer Units Floating Point Unit Multimedia Unit Data Cache Instruction Cache
Instruction Prefixes
VLIW Binary Code
Real and Protected Mode
ASCII arithmetic
41

Start Interpreter Translator
Exceed Translation Threshold No Interpret Next Instruction not found
Yes
Translate Region Store In TCache
Find Next Instruction In TCache?
Execute fault Translation Rollback from Tcache No Chain found
Chain
42
Code Morphing Benefits

Molecules explicitly encode the instruction-level parallelism, hence they can be executed by a simple VLIW engine. Hardware does not need to perform complex instruction reordering. Simplicity means fast and low-power design. Processor upgrades are simplified. Software layer means that software developers dont have to recompile programs. New hardware architecture only needs a new code morphing software Code morphing software can be upgraded independently into flash ROM.
43
Code Morphing Benefits contd..

Software layer helps debugging process. There are different ways to perform the same function so software can be changed in debug process. Software layer increases performance. Timing of critical paths are improved. Optimization is applied to remove unnecessary instructions. Software reordering can be done much better than hardware by looking at a bigger window of instructions and applying more complicated algorithms.
44
Dynamic Translation using Chaining
Code Cache
Code Cache Tags
Pre Chained add %r5, %r6, %r7 li %next_addr_reg, next_addr

#load address #of next block
j dispatch loop Chained add %r5, %r6, %r7 j physical location of translated code for next_block
Runtime -Execution
Performing a Translation
The frontend decodes x86 instructions into a simple sequence of atoms. The optimizer applies well-known compiler optimizations (including elimination of unnecessary atoms from the instruction stream). The scheduler reorders the atoms and groups them into molecules.
46
Translation Example
addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5
frontend
ld %r30, [%esp] add.c %eax, %eax, %r30 ld %r31, [%esp] add.c %ebx, %ebx, %r31 ld %esi, [%ebp] sub.c %ecx, %ecx, 5
optimizer
ld %r30, [%esp] add %eax, %eax, %r30 add %ebx, %ebx, %r30 ld %esi, [%ebp] sub.c %ecx, %ecx, 5
scheduler
ld %r30, [%esp]; ld %esi, [%ebp]; sub.c %ecx, %ecx, 5 add %eax, %eax, %r30; add %ebx, %ebx, %r30
47
Translation Step 1
Ld %r30, [%esp] Add.c %eax, %eax, %r30 Addl %eax, (%esp) Addl %ebx, (%esp) Movl %esi, (%ebp) Subl %ecx, 5 Ld %r31, [%esp]
Translation by code morphing software
Add.c %ebx, %ebx, %r31 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5
Original x86 code

Native VLIW code

Copyright 2005 University of Illinois 48
Translation Step 2
Ld %r30, [%esp] Add.c %eax, %eax, %r30 Ld %r31, [%esp] Add.c %ebx, %ebx, %r31 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5 Native VLIW code
Ld %r30, [%esp]
Optimisation Elimination of atoms + extra condition code options.
Add %eax, %eax, %r30 Add %ebx, %ebx, %r30 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5
Optimised Native VLIW code

Copyright 2005 University of Illinois 49
Translation Step 3
Optimised Native VLIW code
Ld %r30, [%esp] Add %eax, %eax, %r30 Add %ebx, %ebx, %r30 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5 Scheduling -remaining atoms into molecules using a large window.
1. Ld %r30, [%esp]; Sub.c %ecx, %ecx, 5 2. Ld %esi, [%ebp]; Add %eax, %eax, %r30; Add %ebx, %ebx, %r30
Scheduled Native VLIW code

CMS Translation Cache

Translation cache purposes (trace cache)
A place to keep x86 codes translations A way to build longer code, that are better suited for optimizations Reduce fetch bottleneck (The original reason for trace caches that also exist in Pentium)
Successive executions of the translation invokes only the optimizer, not the translator Cost of translation is amortized over successive executions Computed gotos, trace linking and inlined function calls provided significant overhead in software-based trace optimizer
Simple chaining is not sufficient, special hardware exists for locating traces.
51
Code Optimization
Optimizer examines whole translations at a time Several levels of optimization: interpretation up to highly-optimized. High levels of optimization add run-time overhead Only worth doing for frequently executed code Code morphing instruments generated code to help determine usage patterns (count of #times executed) The optimization level to apply is chosen through heuristics based on usage patterns. The more a translation is executed, the more optimized it becomes If monitoring indicates that optimizations were too aggressive, then trace is partially de-optimized.
52
CMS Memory layout

CMS Compressed FlashROM Image
Expands to 2MB
Translation Cache
Memory Used by CMS
Memory available to applications Main Memory 16 MB

0 MB
2 MB
Maximum RAM Address

53
Hardware Support for Code Morphing

Explicit setting of condition code Crusoe uses specific registers to emulate setting of condition codes by the processor (.c suffix is used after the instruction to show that condition codes need to be set). All registers holding x86 state are shadowed Commit operation copies active state to the shadow registers. Gated store buffers for memory writes Alias hardware allows the ordering of load instructions ahead of store instructions Memory Mapped I/O Translated bit in page table to detect self-modifying code and provide protection
54
Shadow Registers
Two copies of each register, a working copy and a shadow copy If execution reaches the end of a translation block, performs commit operation Copy all working registers into shadow registers If any exceptional condition occurs inside the translation block, performs rollback operation Copy the shadow register values back into the working registers
55
Gated Store Buffers

Store data are held in gated store buffer Release to the memory system at the time of a commit On a rollback, simply be dropped from the store buffer
56
Exceptions
Original x86 code:
addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5 # load data from stack, add to eax # load data from stack, add to ebx # load esi from memory # sub 5 from ecx
Scheduled VLIW code:

ld %r30, [%esp]; sub.c %ecx, %ecx, 5 ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30
x86 instructions executed out-of-order with respect to original program flow. Need to restore state for precise traps.
57
Exception Handling
x86 exceptions are precise Problematic for out-of-order execution of instructions On an exception, processor state is rolled back to the most recent commit. Execution proceeds in in-order mode until the fault location is found Memory updates are rolled back through the gated store buffer (which holds x86 stores until a commit.)
58
Alias Hardware for Data Speculation

Translator cannot prove that load and store addressed do not overlap. Crusoe provides simple alias hardware support Allows that CMS reorders selected memory references Taking on the burden of verifying at runtime that the reordered references did not overlap When it detects a violation
It raises an exception. CMS may invoke rollback and conservative re-execution in the interpreter.
59
Moving Loads Ahead of Stores

When the translator moves a load operation ahead of a store operation Load => load-and-protect
Load and record the address and size of data loaded
Store => store-under-alias-mask

Check for protected regions Raise exception
st %data, [%x] ld %r31, [%y] use %r31
ldp %r31, [%x] stam %data, [%y] use %r31
60
Memory-mapped I/O
Memory-mapped I/O cannot be distinguished at translation time from regular memory accesses. Load and store atoms specify whether they have been reordered. When such a speculative memory atom accesses a memory page that is mapped to I/O space, raise an exception. CMS performs a rollback.
61
Handling Self-Modifying Code

When a translation is made, mark the associated x86 code page as being translated in page table Store to translated code page causes trap, and associated translations are invalidated
62
Other Remarks about Code Morphing

Unconditional jumps are completely eliminated; both paths of conditional jumps are speculatively executed, with the proper results being selected later The scheduler can re-order instructions in the generated code Registers are renamed aggressively. (No hardware register allocation needed) Speculation and failure recovery CMS monitors recurring failures and generates a more conservative translation (e.g., Retranslations of smaller regions).
63
Crusoe Performance Disadvantages

Comparison with conventional out-of-order execution processors from CMS perspective Operations with non-always-predictable latency, like loads, if delayed, can delay static schedule Cannot execute concurrently operations from several traces at once Code optimization doesnt start until a block of code has been executed more than a few times. Application startup time can be long. It may take a while for an application to be translated and optimized Some applications may have a big working code set, and will not fit entirely into fixed translation cache
64
Crusoe Performance Disadvantages

Code morphing software is loaded into system memory at boot-up Takes up 2MB of system memory plus an additional 6-14MB for caching. Dynamic translation can take up to 6 times more system memory to run the same code on a native x86 based CPU. Thus 64MB System RAM is effectively ~ 8-9MB RAM, resulting in a need for more frequent access to disk based virtual memory.
65
Power Management
66
LongRun Dynamic Power Management

Conventional power savings approaches Switch off processor quickly to save power Change clock rate by suspending processor and restarting Crusoe LongRun approach Adjust clock rate dynamically, without suspension Adjust voltage level LongRun achieves power reductions of up to 30%
67
LongRun Overview
Adaptive power management Dynamically reduce core processor power consumption to near-optimal levels in response to application workload requirements Thermal management Intelligently adapts processor operation to system thermal environments Cooperation between LongRun and Code Morphing System (CMS)
68
Power Management States

Crusoe maps industry standard ACPI power management modes to six processor states ACPI global system states: Working, Auto Halt, Quick Start, Deep Sleep, Sleeping, Suspend-to-RAM, Suspend-to-Disk, Soft off, Mechanical Off. Crusoe power management states: Normal, Auto halt, Quick Start, Deep Sleep, DSX, Off Mapping of states shown in power management states table (next slide)
69
Power Management States

ACPI System State G0 / S0 / C0 G0 / S0 / C1 G0 / S0 / C2 G0 / S0 / C3 Working Auto Halt Quick Start Deep Sleep Processor State Normal Auto Halt Quick Start Deep Sleep SDRAM Normal Normal/Self Refresh Self Refresh Self Refresh Clock Generator Running Running Running Clocks Stopped
G1 / S1
Sleeping
Deep Sleep DSX
Self Refresh
PLL Shut Down
G1 / S3 G1 / S4 G2 / S5 G3
Suspend-to-RAM Suspend-to-Disk Soft Off Mechanical Off
Off Off Off Off
Self Refresh Off Off Off
PLL Shut Down Off Off Off

70
Power Management State Diagram

Normal H = False1
HLT2 and Halt bus cycle4 STPCLK# negated and H = false1 STPCLK# asserted and Stop Grant bus cycle5 Notes: 1. H = processor halt state 2. HLT = x86 HLT instruction executed 3. Halt break = INTR, NMI, SMI#, INIT#, or RESET# 4. Halt bus cycle = PCI special cycle 5. Stop Grant bus cycle = PCI special cycle
Halt break3
SLEEP# asserted STPCLK# asserted & CLKIN stopped 5 bus cycle Auto Halt Quick Start & Stop Grant 1
H = True
Deep Sleep
STPCLK# negated and H = true1

S ev noo en p t
SLEEP# negated & CLKIN running Reduce CVDD Increase CVDD
Snoop event
Snoop serviced and H = true1
Snoop Service
Snoop serviced and H = false1
DSX
71
LongRun Power Management

LongRun provides CMS with the ability to adjust :Processor core operating voltage (V). Clock frequency (f). Dynamically adjust V and f depending on current application load on processor. Produces cubic reductions in power consumption (Power V2f) Conventional processors can only scale down power linearly by reducing f . LongRun power management policies implemented within CMS. Runtime performance information used to detect different workload scenarios.
72
LongRun Power Adjustments

Power adjustments transparent to operating system, power management controller, and user Uses a number of core frequency/voltage operating points Allows LongRun to optimize processor for the lowest power and maximum performance along the operating curve Processor transparently switches over to traditional power models when processor frequency and voltage scaling reaches minimum operating point on curve Allows ACPI-like policies to handle power management at very lowpower operating points
73
LongRun Power Management Operating Curve
Power
Peak Performance Region
Typical Operating Region

300 400 500 600 700 800 900 1000
300 MHz 0.80 V
433 MHz 0.875 V
533 MHz 0.95 V
667 MHz 1.05 V
800 MHz 900 MHz 1.15 V 1.25 V
1 GHz 1.30 V
74
Typical Operating Power per State

Application Workload DVD Playback MP3 Playback Auto Halt Quick Start Deep Sleep DSX ACPI State C0-C3 C0-C3 C1 C2 C3 C3 Typical Processor Power 1.0-1.5 W 0.50 W 0.35 W 0.30 W 0.15 W 0.10 W
1. All power supplies at their nominal operating values. Full system power management
enabled, including LongRun power management. 2. Typical DVD power is measured while running the Win DVD 2000 player under Windows 2000. 3. Typical MP3 power measured while running MMJukebox under Windows 2000.
Power Comparison
Processor Clock L1 L2 Vcc Power Process Efficeon TM8600 1.0 GHz 1.2 GHz I 128 KB D 64K KB 1 MB Unspecified Unspecified 0.13 um Crusoe TM5400 667 MHz 800 MHz 128 KB 256 KB 0.9-1.3 V 0.4-1.0 W 0.13 um Crusoe TM5800 667 MHz 800 MHz 128 KB 512 KB 0. 9-1.3 V 0.4-1.0 W 0.13 um Intel Mobile Pentium III 600 MHz 750 MHz I 16 KB D 16 KB 256 KB 1.1 1.35 V 12.2 W 0.18 um Intel Pentium M 900 MHz I 32 KB D 32 KB 1 MB 0.84-1.0 V 7 W 0.13 um
Note: It makes sense to compare processors in the same time frame only. Hence, Efficeon should be compared with Pentium M, while Crusoe should be compared with Mobile Pentium III.
LongRun Thermal Management

Thermal management integrated into dynamic power management operating point policies Manages processor thermal environment by using frequency/voltage operating point shifts as a substitute for thermal throttling Delivers higher performance at same die temperature, or same performance at lower die temperature Crusoe provides an integrated on-die thermal diode Can be connected to an external temperature sensor and processor temperature monitored by system BIOS and application software
77
Memory
78
CMS Memory layout

CMS Compressed FlashROM Image
Expands to 2MB
Translation Cache
Memory Used by CMS
Memory available to applications Main Memory 16 MB

0 MB
2 MB
Maximum RAM Address

79
DDR Memory Interface

TM5800 processors include an integrated high performance DDR (double datarate) SDRAM controller and interface. DDR controller supports only DDR SDRAM and transfers data at a rate that is twice the clock frequency of the interface. DDR SDRAM controller supports the equivalent of two DIMMs (up to four ranks) of DDR SDRAM using a 64-bit wide interface. DDR SDRAM interface does not support parity bits. Ranks (Not Banks) Terminology: The grouping of sections of memory on system boards, commonly referred to by designers as sides or banks, is be referred to by the proper name of rank, as this is the term recognized by the memory industry. A rank describes the memory chips connected to a common
(Use of term bank to describe memory organization can be confusing because modern memory chips use bank select signals in addition to row and column address signals.)
chip select signal.
It is possible for single-rank memory modules to contain memory chips on both sides of the module.
DDR Memory Interface

Supported DDR Memory Types TM5800 processors support only non-buffered/non-registered/non-ECC DDR SDRAM memory. SDR (single data rate) SDRAM memory, buffered/registered memory, and ECC memory are not supported. TM5800 memory subsystems can be populated with 64-Mbit (4M x 16 or 8M x 8), 128-Mbit (8M x 16 or 16M x 8), 256-Mbit (16M x 16 or 32M x 8), or 512-Mbit (32M x 16 or 64M x 8) devices. Note that only x8 and x16 memory devices are supported. DDR Memory Speed (Frequency) For memory configurations with up to 8 loads per interface signal, TM5800 processors support DDR interface frequencies up to 133 MHz (DDR-266). With memory configurations having more than 8 loads per signal, the DDR interface frequency must be reduced below 133 MHz.
DDR Memory Interface Constraints

TM5800 processors have a quad-rank memory controller implementation Only one register is available to describe and store all of the parameters of the memory types used in the system. Therefore individual rank of memory must be identical. Memory is assumed to be contiguous, so sequential placement of memory in each rank is required. Memory must be placed in ranks in equal capacities (sizes). Memory must be populated sequentially and contiguously so that a rank is not skipped and left open. Memory must be of the same density and organization (geometry). All memory devices must be the same speed. Memory interface operating frequency must be set for the lowest speed memory in the system.
DDR Memory Interface Constraints

The DDR interface cannot drive more than eight unbuffered loads (devices) at the maximum interface frequency of 133 MHz. For memory configurations that exceed eight loads, the interface must be reduced from 133 MHz. Placing the DDR devices down on the motherboard is recommended for best high-speed signal integrity. Memory configurations must be chosen so that the number of loads is minimized. There are strict constraints for user-installed DDR expansion memory. It is strongly recommended that SDR memory (and not DDR memory) be used for user-installed expansion memory.
83
DDR Memory Rank and Chip Select Examples

Each of the four memory ranks (once populated) is controlled through one of the four DDR interface chip select signals (C_CS0#, C_CS1#, C_CS2#, C_CS3#). These chip selects are used by the memory controller to manage accesses between the connected memories. What if different sizes/geometries of memory are paired up? The organization of the memory addressing will become incompatible and the system will become non-functional. What if identical memory of the same size and geometry is populated on a system board so that a rank is skipped? E.g., placing memory in Rank 0 and Rank 2 and not in Rank 1 Only the memory located in the first rank will be recognized and available to the system, while the second rank of memory will not be recognized. The same memory interface constraints apply for memory modules (e.g., SODIMMs) as well as soldered down memory.
DDR Memory Rank and Chip Select Examples

RANK 0 C_CS0# 16M 16 16M 16 16M 16 16M 16 16M 16 RANK 1 C_CS1# 16M 16 32M 8 non contiguous 16M 16 16M 16 different geometries 16M 16 16M 16 16M 16 16M 16 RANK 2 C_CS2# RANK 3 C_CS3# Supported Not Supported Not Supported Supported Supported
C_CS0# C_CS1# Dual Rank Module C_CS0# Dual Rank Module

C_CS2# C_CS3# Dual Rank Module C_CS1# C_CS2# Dual Rank Module
Supported
Supported
85
SDR Memory Interface

TM5800 processors include an integrated high performance SDR (single data-rate) SDRAM controller and interface. SDR controller supports only SDR SDRAM and transfers data at a rate that is equal to the clock frequency of the interface. SDR SDRAM controller supports up to two 64-bit DIMMs (up to four ranks) of single data rate SDRAM. The SDR SDRAM interface does not support parity bits. Supported DDR Memory Types SDR DIMMs can be populated with 64-Mbit, 128-Mbit, 256Mbit, or 512-Mbit devices. All DIMMs must use the same frequency SDRAMs, but there are no restrictions on mixing different DIMM configurations in the two DIMM slots.
SDR Memory Interface

SDR Memory Speed (Frequency) Frequency setting for the SDR SDRAM interface initialized during the boot sequence from data stored in the configuration ROM. Although processor can be configured for an SDR interface frequency in the range of 1/2 to 1/15 of the core frequency, the supported interface frequency is restricted to a minimum of 66 MHz and a maximum of 133 MHz. SDR Memory Speed Adjustment by LongRun Memory frequency settings vary at each power management step. When processor core frequency is changed by LongRun, the SDR interface frequency is recalculated to match the new core frequency setting. For example, a 1000 MHz device with a 125 MHz memory interface may have a LongRun setting of 667 MHz with a 133 MHz memory interface.
SDR Memory Interface Constraints

SDR SDRAM interface cannot drive more than sixteen unbuffered loads (devices).
Different SDR memory ranks can have different size or geometry devices. SDR memory interface will automatically be adjusted to run at the speed of the slowest installed SDR SDRAM memory.
All SDR memory installed in the system be the same speed (recommended).
Maximum unbuffered SDR SDRAM interface operating frequency is 133 MHz.
SDR SDRAM configurations requiring more than sixteen loads must use buffered SDR memory.
Maximum industry-standard buffered SDR SDRAM operating frequency is 66 MHz. Maximum processor SDR SDRAM interface operating frequency is 133 MHz. Hence processor SDR SDRAM interface operating frequency must be set below the standard LongRun power management.
SDR memory can be user expandable. SDR SDRAM is the preferred userinstalled expansion memory option for TM5800 processor-based systems.
Clocks And Timing
89
Clocks
TM5800 processor input clock (CLKIN) is multiplied by the processor clock multiplier to generate the processor core clock. For currently defined TM5800 SKUs, CLKIN is assumed to be 66.6 MHz. Processor core clock is divided down by the DDR and SDR clock dividers to generate the DDR SDRAM and SDR SDRAM interface clocks. There is also a clock divider that must be initialized for the PCI interface. The PCI interface operates at 33.3 MHz. Clock multiplier and divider values are programmed into the TM5800 processor during initialization from data stored in the configuration ROM.
Timing Specification for Input Clocks

Parameter fclk (clock frequency) CLKIN P_PCLK S_CLKIN tcycle (clock period) CLKIN P_PCLK S_CLKIN thigh (clock high time) CLKIN P_PCLK S_CLKIN tlow (clock low time) CLKIN P_PCLK S_CLKIN tjitter (clock jitter) CLKIN P_PCLK trise/fall (clock rise and fall time) CLKIN P_PCLK toffset (CLKIN to P_PCLK offset) tpll_lock (PLL relock time)
Minimum 60.0 MHz 30.0 MHz 15.0 nS 30 nS 7.5 nS 5.2 nS 11 nS 3.375 nS 5.0 nS 11 nS 3.375 nS 0.4 nS 1.0 V/nS 1.5 nS Copyright 2005 University of Illinois
Maximum 66.67 MHz 33.33 MHz 133.33 MHz 16.67 nS 250 pS 500 pS 1.6 nS 4.0 V/nS 4.0 nS 20 S
91
Timing Diagram for Input Clocks

tcycle
2.0 V Vm*
CLIKN P_PCLK S_CLKIN
tlow
0.8 V
0.8 V
trise
thigh
tfall *Vm = 1.2 V for CLIKN = 0.4 * IOVDD for P_PCLK = 1.4 for S_CLKIN
92
Timing Specification for DDR SDRAM Interface

Parameter fclk tcycle tlow, thigh tjitter Vx tvalid_dqs tohold_dqs tdqs_skew tdqs_low, tdqs_high tdqs_preamble off nras_cas ncas_read nread_pchg nwr_pchg nrow_pchg nidmrs nras_ras nburst nrefresh Description C_CLK frequency C_CLK period C_CLK low time, high time C_CLK jitter Differential cross pt voltage C_DQ, C_DQMB valid fromC_DQS (writes) C_DQ, C_DQMB hold from C_DQS (writes) C_DQS to C_DQ, C_DQMB skew C_DQS input low time, C_DQS input high time C_DQS preamble valid time Active to float delay C_DQ : C_RAS# to C_CAS# latency C_CAS# to read latency Read precharge delay Write precharge delay Row precharge time Idle cycles after Mode Register Set Row cycle time Burst length Refresh rate C_DQS Minimum 7.5 nS 0.45 bus clks 1.1V 0.76 ns -0.76 ns 0.45 bus clks 0.9 bus clks 0 nS : -0.5 nS Maximum 133 MHz 0.55 bus clks 150 pS 1.4 V 3.05 nS +0.76ns 0.55 bus clks 1.1 bus clks 2.5 nS : +0.5 nS
1 bus clock 1 bus clock 1 bus clock 1 bus clock 1 bus clock 2 bus clocks 2 bus clocks 4 transfers 128 bus clocks
16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 17 bus clocks 17 bus clocks 4 transfers 16k bus clocks
93
Timing Specification for DDR SDRAM Interface

Clock specifications apply to C_CLKA, C_CLKA#, C_CLKB, C_CLKB#. C_CLKA and C_CLKA# are 180 out of phase. C_CLKB and C_CLKB# are 180 out of phase. C_CLKA and C_CLKB are copies of each other. The data parameters are specified relative to DQS signals and CMD parameters are specified relative to C_CLK/C_CLK# differential cross point voltage. CMD signals are: C_A[12:0], C_BA[1:0], C_CAS#, C_CKE[1:0], C_CS#[3:0], C_RAS#, C_WE#. Assumes 80 pF maximum load on each CMD signal and 10 pF maximum load on each of C_DQ[63:0]. These parameters are programmable within the processor. Row precharge time is the number of bus clocks between the power on precharge and the next time RAS can be asserted. Row cycle time is the number of bus clocks between refresh and the next time RAS can be asserted for other SDRAM operations. This also is the number of cycles the DDR SDRAM controller waits before starting any SDRAM access after it exits clock off mode. The DDR SDRAM controller always performs burst operations.
Timing Diagram for DDR SDRAM Interface Read Cycle

C_CLK# C_CLK tcycle C_CKE[1:0] nras_cas C_RAS# C_CS#[3:0] C_A[12:0] C_BA[1:0] C_CAS# C_DQ[63:0] tdqs_preamble C_DQS[7:0] tdqs_high tdqs_low
IN0 IN1 IN2 IN3 0 1 2 3 4 5 6 7 8 9 10
nras_ras
VALID VALID
tvalid
ncas_read nread_pchg tdqs_skew
Note: CAS Latency = 2 in this diagram.

Timing Diagram for DDR SDRAM Interface Write Cycle

C_CLK# C_CLK C_CKE[1:0] C_RAS# C_CS#[3:0] C_A[12:0] C_BA[1:0] C_CAS# C_WE#
Val0 Val1 Val2 Val3 0 1 2 3 4 5 6 7 8 9 10
nras_cas nras_ras
VALID VALID
C_DQMB[7:0]
Out0 Out1 Out2 Out3
nwr_pchg
C_DQ[63:0] C_DQS[7:0] tvalid_dqs

CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois
tohold_dqs
96
Timing Specification for SDR SDRAM Interface

Parameter fclk tsetup tihold nras_cas ncas_read nread_pchg nwr_pchg nrow_pchg nidmrs nras_ras nburst nrefresh Description S_CLKIN, S_CLKOUT, S_CLK frequency Input setup time Input hold time S_RAS# to S_CAS# latency S_CAS# to read latency Read precharge delay Write precharge delay Row precharge time Idle cycles after MRS Row cycle time Burst length Refresh rate Minimum 1.7 nS 1.9 nS 1 bus clock 1 bus clock 1 bus clock 1 bus clock 1 bus clock 2 bus clocks 2 bus clocks 4 transfers 128 bus clocks Maximum 133 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 17 bus clocks 17 bus clocks 4 transfers 16K bus clocks
tvalid: Output valid delay tohold: Output hold time

S_CLK[3:0] are copies of S_CLKOUT. These parameters are specified relative to S_CLKIN rising edge at 1.4 V level. Input signals are: S_DQ[63:0]. Output signals are: Data = S_DQ[63:0], S_DQMB[7:0] Address = S_A[12:0], S_BA[1:0], S_CAS#, S_RAS#, S_WE# Enables = S_CKE[1:0], S_CS#[3:0] Assumes 50 pF load for output signals. For every 10 pF above a 50 pF load, add 170 pS for the data and enable signals, and 90 pS for the address signals. For every 10 pF below a 50 pF load, subtract 170 pS for the data and enable signals, and 90 pS for the address signals. These parameters are programmable within the processor. Row precharge time is the number of bus clocks between the power on precharge and the next time RAS can be asserted. MRS stands for Mode Register Set operation. Row cycle time is the number of bus clocks between refresh and the next time RAS can be asserted for other SDRAM operations. This also is the number of cycles the SDR SDRAM controller waits before starting any SDRAM access after it exits clock off mode. The SDR SDRAM controller always performs burst operations.
98
Timing Diagram for SDR SDRAM Interface Read Cycle

0 1 2 3 4 5 6 7 8 9 10
S_CLKIN nras_cas S_CKE[1:0] S_RAS# nras_ras S_CS#[3:0] S_A[12:0] S_BA[1:0] S_CAS#

VALID0 VALID1 VALID2 VALID3 VALID VALID
ncas_read
S_DQMB[7:0]
nread_pchg S_DQ[63:0]
IN0 IN1 IN2 IN3
99
Timing Diagram for SDR SDRAM Interface Write Cycle

0 1 2 3 4 5 6 7 8 9 10
S_CLKIN S_CKE[1:0] S_RAS# S_CS#[3:0] S_A[12:0] S_BA[1:0] S_CAS# S_DQMB[7:0]

VALID
nras_cas
nras_ras
VALID
VALID0 VALID1 VALID2 VALID3
S_DQ[63:0]
IN0
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois
nwr_pchg
IN1
IN2
IN3
100
Processor Pin Layout
101
SUSPEND#
SDR SDRAM
Temperature Sensor
DIODE_CATHODE DIODE_ANODE
Delay Loop SDR SDRAM High-Speed BiDirectional LevelTranslator Isolation
Temp Alert to Southbridge
SDR Interface Signals
SROM_SCLK, SROM_SOUT, SROM_SIN, SROM_CS[1:0]
SUSPEND#
DDR_CKE[1:0]
CRUSOE PROCESSOR
8Mbit Serial Flash for CMS 2kbit Serial ModeBit ROM
SRCLK, SRDATA TDM_TCK, TDO,TDI,TMS, TRST# DEBUG_INIT NM_DEBUG_INIT
DDR SDRAM
High-Speed BiDirectional LevelTranslator Isolation
DDR Interface Signals
(Serial Debug Bus interface)
CPU Core Power Supply Memory Power Supply
VRDA[4:0] PWRGOOD
CPU_RST#
Transmeta Debug Connector A (30-pin) PCI + CPU Clock/Suspend/Reset Pins

SUSPEND#
all System POWERGOOD signals
SUSB# SUSC#
SYS_RST#
connect to system-level reset
To Rest Of Motherboard
Target Applications and Sample Code
103
Target Applications
Suitable for portable and embedded systems No need for active cooling and external CPU fans Provides embedded devices with a performance per watt ratio that is unmatched by any other x86-based processor in its class Runs a mobile Linux kernel Capable of running Internet applications Web browsers Email applications Streaming video Used in Notebooks and Tablet PCs Offers advantages over standard hardware-only processors for making ultra light and thin Notebooks with less power consumption. Only microprocessor that is able to provide software upgrades to the processor that offer additional performance and power savings.
Target Applications
Thin client for server based computing The thin client, differentiated from desktop computers by a smaller form factor and the removal of all moving parts (disk drive, fan, etc), is designed specifically for server-based computing. This centralized approach provides the ability to easily deploy applications to thousands of thin client users at the same time. Increases resource efficiency and drastically reduces the overhead associated with application installation and upgrades. Because all data is centrally located on a server, data is better protected from the catastrophes, viruses, and data theft that plague non-centralized operations. Low power and high density Crusoe based servers have a high profitability matrix value (Performance/Per Watt/Per Cubic Foot).
105
Target Applications
Ultra-Personal Computer (UPC) New computing category enabled by Transmeta processors High-performance, full-featured PC that delivers the functionality of a desktop computer and the features of a laptop computer in the size of a handheld PDA. Designed to run full x86 desktop operating systems and applications, giving users application independence and the freedom to use their data with the software of their choice. Simplifies or even eliminates the synchronization between multiple devices, which results in increased productivity and reduced IT maintenance overhead while providing users with seamless portability of their data and multimedia content.
106
Target Applications
Cluster workstations Provides the necessary processing power to solve complex computational problems. By removing server cluster complexity and assembly time, Self contained, provide simple and easy-to-use features, by reducing server cluster complexity and assembly time. Built around industry standards, use standard software libraries, and can be configured to user needs.
107
Example Code Fragment

# Handler for reading MSR registers from x86 operating system code. # Parallel VLIW instruction words shown by || and terminated by ;
read_msr: addi %r38,%r47,-12; || addi %r47,%r47,-32; nop.lsu; || addi %r35,%r38,4; || oril %r34,%zero,0x80000000; st [%r47],%r25; || add %r60,%r0,%r34; st [%r35],%r26; slli %r37,%r60,4; addil %r25,%zero,0x0018aae0; #=> target 0x000a6798: # %r38 = %sp - 12 # %sp = %sp - 32 # # # # # # # # %r35 = %r38 + 4 %r34 = 0x80000000 Save %r25 %r60 = %r0 (%eax) + 0x80000000 (to bring MSR offset down to zero base) Save %r26 Shift MSR number for indexing into table 0x18aae0 -> cpuid data for 0x80000000-6
|| ||
# NOTE: shifts (slli, etc.) appear to only be available on ALU1, at least # according to the opcode map. This is a bizarre (but low power) design. st [%r38],%r27; 001100000 %r27,%r58,%r0; cmpil.c %sink,%r0,0x80000006 ; # [%sp-12] = %r27 (callee saved) # # compare %eax == 0x80000006?
|| ||
108
nop.lsu; || addi %r59,%r25,-64; || addil %r36,%r37,0x0018aae0; nop.lsu; || cmp.c %sink,%r0,%r34; || or %r26,%zero,%zero; || br.gt 0x000a6860; cmp.c %sink,%r0,3; br.ge 0x000a68b8;
# 0x18aaa0 -> cpuid data returned for 0x0-0x3 # %r36 = 0x18aae0 + (%r60 << 4) # # # # # # # # # compare %eax == 0x80000000 Move %r26 = 0 branch if %eax > 0x80000006 (to handle 0x80860000 functions) Compare %eax == 3 branch if (%eax >= 0x80000000) (branch to load_cpuid_data_to_regs) %r60 = %eax %r37 = %eax << 4 (index cpuid table)
||
or %r60,%r0,%zero; || slli %r37,%r0,4; nop.lsu; || add %r36,%r37,%r59; nop.lsu; addi %r35,%r47,20; or %r2,%zero,%zero; move %r1,%zero,%zero; move %r3,%zero,%zero; move %r0,%zero,%zero; addi %r34,%r47,24;
# %r36 = %r37 + (%r0<<4) + 0x18aaa0 #=> target 0x000a6820: # %r35 = %r47 + 20 # %edx = 0 # %ecx = 0 # %ebx = 0 # Set %eax = 0 # %r34 = %sp + 24 #=> target 0x000a6840:
|| || || ||
109
References
Transmeta website http://www.transmeta.com Alexander Klaiber, The Technology Behind Crusoe Processors, January 2000, Transmeta white paper. Jon Stokes, Crusoe Explored, Ars Technica, January 2000, http://arstechnica.com/articles/paedia/cpu/crusoe.ars/ Rob Hughes, Transmetas Crusoe Microprocessor, January 2000, Chipgeek.com, http://www.geek.com/procspec/features/ transmeta/crusoe.htm James C. Dehnert et. al., The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges, First Annual IEEE/ACM International Symposium on Code Generation and Optimization, March 2003. Transmeta Zone http://www.transmetazone.com
110

Trans Met A Crusoe

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Trans Met A Crusoe

Загружено:

Авторское право:

Доступные форматы

Transmeta Crusoe

CS433 Processor Presentation Series Prof. Luddy Harrison

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Note on this presentation series

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Transmeta Innovation Timeline

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Processor Family

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Characteristics of Crusoe contd.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

High Performance VLIW Engine

x86 CPU vs. Crusoe

L1 Cache Execution Units Register Rename Instruction Reorder

L1 Cache Execution Units

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Hardware/Software Partitioning

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Processor Block Diagram

DDR SDRAM Controller

SDR SDRAM Controller

Serial ROM Interface

L1 Data Cache 64 K 16-way set associative

PCI Controller & Southbridge Interface

Architecture Block Diagram

Data Flow & Data Cache Control

Instruction Cache Control

Secondary Instruction/Data Cache 256 KB

Bus Interface Unit

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Processor Hierarchy

x86 Applications x86 Operating system Windows XP, Linux etc.

x86 BIOS Code Morphing Software VLIW Processor

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe: A Native VLIW Processor

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Instruction Word - VLIW

128 bit Molecule FADD ADD LD BRCC

Floating Point Unit

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Code Morphing Software

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Efficeon TM8800 Processor

CS433 Prof. Luddy Harrison