Вы находитесь на странице: 1из 110

Transmeta Crusoe

CS433 Processor Presentation Series Prof. Luddy Harrison

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Note on this presentation series


These slide presentations were prepared by students of CS433 at the University of Illinois at Urbana-Champaign All the drawings and figures in these slides were drawn by the students. Some drawings are based on figures in the manufacturers documentation for the processor, but none are electronic copies of such drawings You are free to use these slides provided that you leave the credits and copyright notices intact
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 2

Outline
Transmeta Innovation Timeline Crusoe Processor Family and Overview Architecture showing Data paths, Registers, ALU, etc. Instruction Set Pipelining Code Morphing LongRun Power Management Memory Map and Support for Internal and External Memories Clocks and Timing Processor Pin Layout Target Applications, Specific Use, and Sample Assembly Language Code for application kernels

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Transmeta Innovation Timeline

Dave Ditezel, of RISC-fame and formerly from SPARC, started up Transmeta as its CEO in 1995. The first Patent (5958061) was applied in July 24, 1996 granted in September 28, 1999. On January 19, 2000 the Crusoe processor was published. Crusoe became famous as an x86 compatible family of solutions that combines strong performance with remarkably low power consumption.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Processor Family


TM5400 -- 500-700MHz, 256K L2 Cache TM5500 -- 0.13, 667-800MHz, 256K L2 Cache TM5600 -- 500-700MHz, 512K L2 Cache TM5800 -- 0.13, 667-800MHz, 512K L2 Cache SE TM55E/TM58E -- 0.13, 800MHz, 256K L2 cache Embedded version of Crusoe processor Presentation focuses on TM5800 as the representative processor in this family

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Processor Model TM5500 667 MHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus

Crusoe Processor Model TM5800 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus

Crusoe SE Processor Model TM55E 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus

Crusoe SE Processor Model TM58E 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus

MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W

MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W

MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W Supports T-junction temperatures of 100C Rated for 24/7 operation for 10 years

MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W Supports T-junction temperatures of 100C Rated for 24/7 operation for 10 years
6

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Characteristics of Crusoe
4 Instruction Issue, 128-Bit VLIW Engine Fully Pentium 4-ISA compatible Up to four instructions issued per clock cycle MMX multimedia extensions 512 MB L2 cache Advanced Code Morphing Software (CMS) Unique software-based architecture is key to reducing power consumption and enabling future scalability and flexibility Integrated Northbridge Core Logic On-chip SDR and DDR-266 memory interfaces On-chip 32-bit, 33 MHz PCI bus controller

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Characteristics of Crusoe contd.


LongRun Dynamic Power Management Enables low power operation by dynamically adjusting operating frequency and voltage to match the performance requirements of application workloads. Provides higher performance within smaller, thermally constrained environments Enables fanless designs for quieter and more reliable systems

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Architecture

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Processor
SDR Memory Interface Controller DDR Memory Interface Controller Serial ROM Interface Controller Transmeta LongRun Power/Thermal Management 32 bit 33 MHz PCI Bus Interface Controller L1 Instruction Cache L2 Cache L1 Data Cache

High Performance VLIW Engine


CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 10

x86 CPU vs. Crusoe


Modern x86 CPU
x86 Instruction Translation Instruction Decode x86 Instruction Translation Branch Predict

Transmetas Crusoe
Instruction Decode

L1 Cache Execution Units Register Rename Instruction Reorder

L1 Cache Execution Units

Branch Predict

Register Rename

Instruction Reorder
11

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Crusoe Hardware/Software Partitioning

Purple portions implemented in hardware Much smaller than traditional microprocessors Orange portions implemented in software x86 to native VLIW translation, branch prediction, and out-of-order execution (OOO) logic

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

12

Processor Details
Fabricated in 0.13 process technology High Performance 4 Issue 128-bit VLIW Engine with Code Morphing Software to provide x86 compatibility. L1 Data Cache: 64KB L1 Instruction Cache: 64KB L2 Write Back Cache: 512KB DDR Memory Support: DDR-SDRAM 100-133MHz SDR Memory Support: SDR-SDRAM 66-133MHz PCI bus controller (PCI 2.1 compliant) with 33 MHz, 3.3V interface Standard product speeds of 733, 800, 867, 933, and 1000 MHz Power: 0.5-1.5 W @ 300-1000 MHz, 0.8-1.3V running typical multimedia applications, 150 mW typical in deep sleep Processor Package: Compact 474-pin Ceramic BGA
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 13

Processor Block Diagram


L1 Instruction Cache 64 K 8-way set associative Unified TLB 256 entries
4-way set associative

DDR SDRAM Controller

CPU core Integer Unit Floating Point Unit MMU Multimedia Instr. Bus Interface

SDR SDRAM Controller

Serial ROM Interface

L1 Data Cache 64 K 16-way set associative


CS433 Prof. Luddy Harrison

L2 WB Cache 512 K
4-way Set Associative
Copyright 2005 University of Illinois

PCI Controller & Southbridge Interface


14

Architecture Block Diagram


Shadow Registers 64 General Purpose Registers Debug Reg Alias Hdw TLB T-Bit Buffer ALU0 ALU1 Gated Store Buffer Local Data Memory 8KB Data Cache 64KB Local Program Memory 8KB Data Cache 64KB Load/Store Branch FPU Shadow Registers 32 Floating Point Registers

Data Flow & Data Cache Control

Instruction Cache Control

Secondary Instruction/Data Cache 256 KB

Bus Interface Unit


CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 15

Architecture Details
Five function units : 2xALU, FP/MMX, MEM and BR Each instruction (called molecule) has 2 or 4 two RISC-like operations (called atoms) Shadowed registers sets : 64 GPRs, 32 FPRs Gated store buffer in the Load/Store unit Alias hardware Very few HW interlocks in the pipeline Correct execution is guaranteed by CMS scheduling & compiler Micro-architecture hidden from the x86 programmer Processor timing issues can be worked-around by CMS

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

16

Crusoe Processor Hierarchy

x86 Applications x86 Operating system Windows XP, Linux etc.


x86 Software x86 Compatible Crusoe Processor Solution

x86 BIOS Code Morphing Software VLIW Processor


(resides in ROM)

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

17

Crusoe: A Native VLIW Processor


Crusoe is a VLIW (Very Long Instruction Word) processor Multiple FUs, each explicitly programmed on each instruction A Very Long Instruction Word is called a molecule Each molecule contains 4 atoms Each atom is an instruction destined for an FU A molecule is either 128 bits or 64 bits wide. Crusoes compilation and scheduling is a hybrid between Dynamic Superscalar Instruction Scheduling and VLIW Instruction scheduling Code Morphing Software (CMS) takes a compiled x86 program and recompiles it, on-the-fly, to Crusoe's native VLIW instruction format. Recompilation uses sophisticated compiler algorithms to extract parallelism from the code, look for dependencies and do all those things that a state-of-the-art VLIW compiler does.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

18

Instruction Word - VLIW

128 bit Molecule FADD ADD LD BRCC

Floating Point Unit

Integer ALU #0

Load/Store Unit

Branch Unit

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

19

Code Morphing Software


x86 instructions are converted to the Crusoe instruction set through a software layer (Translation) During instruction translation, optimizations and scheduling tricks can be performed Instruction scheduling Register renaming Speculation Crusoe Processor Architecture is decoupled from application software

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

20

Efficeon TM8800 Processor


Next-generation Crusoe Processor Designed for higher performance 8 Instruction Issue, 256-Bit VLIW Engine Fully Pentium 4-ISA compatible Up to eight instructions issued per clock cycle Up to 50% improvement in integer applications SSE and SSE2 multimedia extensions enables multimedia applications to run up to 80% faster per clock cycle than previous generation Crusoe processors Large 1 MB L2 cache improves processor performance

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

21

Instruction Set

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

22

Registers
The processor has 64 GPRs, with the following specialized semantics: %r63 (%zero) always reads 0 when used as a source operand %r62 (%sink) is a discarded destination (e.g., for compares); it is never read %r59 (%from) saved return address %r58 (%link) return address %r47 (%sp) is the current stack pointer %r0 (%eax) for current x86 machine state %r1 (%ecx) for current x86 machine state %r2 (%edx) for current x86 machine state %r3 (%ebx) for current x86 machine state

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

23

Registers contd.
The lower 48 of these GPRs are backed by shadowed GPRs: whenever a bundle has its commit bit set, the Commit stage latches the current values of the GPRs into the 'known good' shadow GPRs. The processor also includes 32 80-bit floating point registers and 16 FP shadow registers. There are also a wide variety of special purpose registers (SPRs), including the condition codes, profiling registers, power control settings and so on.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

24

Instruction Encoding
General Format:

%PC + 12

%PC + 8

%PC + 4

%PC + 0

C C

00 00
32 bits

LSU LSU

type type

ALU1 ALU1

0 1 C C C C

00 00 01 10 10 11

ALU0 ALU0 LSU ALU0 ALU0 ALU0 type type type type

imm32 Branch ALU1 imm32 ALU1 BRU

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

25

Instruction Encoding Continued.


Instructions are encoded in little endian byte and word order. All instructions (except branches) have a 9-bit opcode field. All opcodes share a common mapping into this 9-bit space, even though not all instructions can execute on all functional units. The hardware is designed to interlock all operations through scoreboarding, however design flaws sometimes prevent the microprocessor from taking full advantage of these features.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

26

Format of ALUs

31
C C C 10 10 10 op op 11xxxx011

ALU0/ALU1
rd rd rd ra ra ra rb imm8 0 --

0
ALU with register operands ALU with 8-bit immediate ALU with 32-bit immediate

ALU1 executes a superset of the operations available on ALU0. Both ALUs may have an 8-bit signed immediate instead of register rb. ALU1 may optionally use a 32-bit immediate, but only in appropriate bundle types.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

27

Format of ALUs Continued.


The ALU0|imm32 and ALU0|ALU1 bundle types share the same format code (10) but the ALU1 slot is interpreted as an imm32 depending on the opcode: 11xxxx011: 32-bit immediate in place of ALU1. If the 11xxxx011 pattern appears in an ALU1 slot, an 8-bit immediate is used instead. Condition codes exactly mirror the x86 semantics Any instruction in either ALU0 or ALU1 can optionally latch the resulting condition codes if specified by the opcode. However, only one of the two ALU0|ALU1 slots per bundle can write the condition codes in a single cycle. The ALU1 slot is also used for all floating point and MMX operations, as indicated by ALU1's type select bits being something other than '00'.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

28

Format of Load/Store Unit

31
C C 01 01 op op

Load/Store Unit
rd ? ? rs raddr raddr ---

0
Loads Stores

Single load/store unit performs all loads and stores, alias operations and various other memory related tasks. All LSU operations take a fully calculated address in register ra. No ra+offset or ra+rb addressing modes are provided.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

29

Format of Load/Store Unit Continued.


Two kinds of loads and stores are possible: operations on physical CMS space addresses (as used in CMS itself). operations as user code sees memory; i.e., addresses are translated by the TLB and can never access the protected CMS space. Processor has two special 8KB SRAMs: Local program memory (LPM): Holds often executed assist code for x86 page table lookups, alignment fixups, low level exception handling, interrupt handling, etc. This avoids having to bring such critical code into the L1 instruction cache on demand. Local data memory (LDM): contains data used by the LPM functions; i.e., copies of key x86 MSRs, native code stack, etc.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

30

Format of Branch Unit


1
100 100 101 101 000 000 cond cond L L ? 0 ? 1 ? 0 ? 0 ? ? abstarget >> 3 ra ?

Branch (within CMS)


abstarget >> 3 ra

0
Unconditional Branch
?

Unconditional Branch via Register Conditional Branch Conditional Branch via Register

1 1

16

Branches (both conditional and unconditional) within CMS use a 23 bit absolute target address aligned to a 64-bit boundary (i.e., abstarget is shifted left 3 bits).

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

31

Format of Branch Unit Continued.


CMS address space is the only region from which code can be executed; the processor is physically incapable of executing code directly from user space. This is the reason why all x86 code must be translated (and thus copied to CMS space) before native execution. Conditional branches use the same condition code set (cc bits) as the x86 encoding in jump instructions. Unconditional branches can optionally write the return address to the %link register (%r58) if the L bit (bit 0 of the cc field) is set. Indirect branches occur through a general purpose register. Special instructions are provided to prepare for an indirect branch when the target address is known in advance; this avoids the three-cycle branch penalty. In addition, special instructions may provide a branch with link functionality.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

32

Pipelining

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

33

Pipelining

Fetch0

Fetch1

Regs

ALU

Except

Write

Commit

Cache0

Cache1

( Load/Store Unit)

(Wait)

Redirect

( Branch Unit)

The top row of the diagram indicates the pipeline for an ALU instruction, with the other rows representing the two other types of logical units.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

34

Pipeline Stages
Fetch0: The first 64 bits of a 64-bit or 128-bit bundle are fetched. Fetch1: The second 64 bits are fetched (for 128-bit bundles only). Regs: Read source registers and decode/disperse instructions. ALU: Execute single cycle operations in ALU0 and ALU1 Except: Complete two-cycle ALU0/ALU1 ops and detect exceptions Cache0: Initiate L1 data cache access based on register address Cache1: Complete L1 data cache access, TLB access and alias checks Write: Write results back to GPRs or store buffer Commit: Optionally latch the lower 48 GPRs into the shadow registers

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

35

Decoding and Scheduling


Conventional x86 superscalar processors fetch binary instructions and decode them into separate micro-operations. Then they are reordered by the hardware and executed in parallel. Code morphing translates an entire group of x86 instructions at once and stores the translation in a translation cache for future reference. Code Morphing System (CMS) described later

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

36

Crusoes VLIW Instruction Scheduling


Applications Code Morphing Software Operating System Software

CPU

Functional Units
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 37

Decoding and Scheduling Continued.


Translation step introduces many opportunities. Due to high repeat rates, the translation cache is frequently used to reduce overhead. Can use much more sophisticated scheduling algorithms. Much lower power consumption because translation is all in software. Can optimise generated code, and by learning which parts are executed often, can change levels of optimisation dynamically.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

38

Code Morphing

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

39

Code Morphing System (CMS)


Dynamic binary translation of x86 ISA into Crusoe internal native VLIW format using software at run-time Translations cached to avoid translator overhead on repeated execution Optimizes across x86 instruction boundaries to improve performance Select a region, produce native code and store translation in the translation cache Completely invisible to operating system looks like x86 hardware processor Runtime system Handle devices, interrupts and exceptions, power management, and garbage collection

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

40

Code Morphing
Hardware: high megahertz, small die size, non x86 VLIW-processor.
Microcode

Code Morphing Software Technology

Superscalar out-of-order execution Variable Length Instructions

x86 Binary Code

Segmentation Trigonometric-Functions Microcode

Code Morphing Software

Complex Addressing Modes

Silicon Microchip Integer Units Floating Point Unit Multimedia Unit Data Cache Instruction Cache

Instruction Prefixes

VLIW Binary Code

Real and Protected Mode

ASCII arithmetic

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

41

Code Morphing Software


Start Interpreter Translator

Exceed Translation Threshold No Interpret Next Instruction not found

Yes

Translate Region Store In TCache

Find Next Instruction In TCache?

Execute fault Translation Rollback from Tcache No Chain found

Chain

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

42

Code Morphing Benefits


Molecules explicitly encode the instruction-level parallelism, hence they can be executed by a simple VLIW engine. Hardware does not need to perform complex instruction reordering. Simplicity means fast and low-power design. Processor upgrades are simplified. Software layer means that software developers dont have to recompile programs. New hardware architecture only needs a new code morphing software Code morphing software can be upgraded independently into flash ROM.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

43

Code Morphing Benefits contd..


Software layer helps debugging process. There are different ways to perform the same function so software can be changed in debug process. Software layer increases performance. Timing of critical paths are improved. Optimization is applied to remove unnecessary instructions. Software reordering can be done much better than hardware by looking at a bigger window of instructions and applying more complicated algorithms.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

44

Dynamic Translation using Chaining

Code Cache

Code Cache Tags

Pre Chained add %r5, %r6, %r7 li %next_addr_reg, next_addr


#load address #of next block

j dispatch loop Chained add %r5, %r6, %r7 j physical location of translated code for next_block
Runtime -Execution
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 45

Performing a Translation
The frontend decodes x86 instructions into a simple sequence of atoms. The optimizer applies well-known compiler optimizations (including elimination of unnecessary atoms from the instruction stream). The scheduler reorders the atoms and groups them into molecules.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

46

Translation Example
addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5

frontend
ld %r30, [%esp] add.c %eax, %eax, %r30 ld %r31, [%esp] add.c %ebx, %ebx, %r31 ld %esi, [%ebp] sub.c %ecx, %ecx, 5

optimizer
ld %r30, [%esp] add %eax, %eax, %r30 add %ebx, %ebx, %r30 ld %esi, [%ebp] sub.c %ecx, %ecx, 5

scheduler
ld %r30, [%esp]; ld %esi, [%ebp]; sub.c %ecx, %ecx, 5 add %eax, %eax, %r30; add %ebx, %ebx, %r30

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

47

Translation Step 1
Ld %r30, [%esp] Add.c %eax, %eax, %r30 Addl %eax, (%esp) Addl %ebx, (%esp) Movl %esi, (%ebp) Subl %ecx, 5 Ld %r31, [%esp]
Translation by code morphing software

Add.c %ebx, %ebx, %r31 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5

Original x86 code


CS433 Prof. Luddy Harrison

Native VLIW code


Copyright 2005 University of Illinois 48

Translation Step 2
Ld %r30, [%esp] Add.c %eax, %eax, %r30 Ld %r31, [%esp] Add.c %ebx, %ebx, %r31 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5 Native VLIW code
CS433 Prof. Luddy Harrison

Ld %r30, [%esp]
Optimisation Elimination of atoms + extra condition code options.

Add %eax, %eax, %r30 Add %ebx, %ebx, %r30 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5

Optimised Native VLIW code


Copyright 2005 University of Illinois 49

Translation Step 3
Optimised Native VLIW code
Ld %r30, [%esp] Add %eax, %eax, %r30 Add %ebx, %ebx, %r30 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5 Scheduling -remaining atoms into molecules using a large window.

1. Ld %r30, [%esp]; Sub.c %ecx, %ecx, 5 2. Ld %esi, [%ebp]; Add %eax, %eax, %r30; Add %ebx, %ebx, %r30

Scheduled Native VLIW code


CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 50

CMS Translation Cache


Translation cache purposes (trace cache)
A place to keep x86 codes translations A way to build longer code, that are better suited for optimizations Reduce fetch bottleneck (The original reason for trace caches that also exist in Pentium)

Successive executions of the translation invokes only the optimizer, not the translator Cost of translation is amortized over successive executions Computed gotos, trace linking and inlined function calls provided significant overhead in software-based trace optimizer
Simple chaining is not sufficient, special hardware exists for locating traces.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

51

Code Optimization
Optimizer examines whole translations at a time Several levels of optimization: interpretation up to highly-optimized. High levels of optimization add run-time overhead Only worth doing for frequently executed code Code morphing instruments generated code to help determine usage patterns (count of #times executed) The optimization level to apply is chosen through heuristics based on usage patterns. The more a translation is executed, the more optimized it becomes If monitoring indicates that optimizations were too aggressive, then trace is partially de-optimized.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

52

CMS Memory layout


CMS Compressed FlashROM Image

Expands to 2MB

Translation Cache

Memory Used by CMS

Memory available to applications Main Memory 16 MB


Copyright 2005 University of Illinois

0 MB

2 MB

Maximum RAM Address


53

CS433 Prof. Luddy Harrison

Hardware Support for Code Morphing


Explicit setting of condition code Crusoe uses specific registers to emulate setting of condition codes by the processor (.c suffix is used after the instruction to show that condition codes need to be set). All registers holding x86 state are shadowed Commit operation copies active state to the shadow registers. Gated store buffers for memory writes Alias hardware allows the ordering of load instructions ahead of store instructions Memory Mapped I/O Translated bit in page table to detect self-modifying code and provide protection

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

54

Shadow Registers
Two copies of each register, a working copy and a shadow copy If execution reaches the end of a translation block, performs commit operation Copy all working registers into shadow registers If any exceptional condition occurs inside the translation block, performs rollback operation Copy the shadow register values back into the working registers

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

55

Gated Store Buffers


Store data are held in gated store buffer Release to the memory system at the time of a commit On a rollback, simply be dropped from the store buffer

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

56

Exceptions
Original x86 code:
addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5 # load data from stack, add to eax # load data from stack, add to ebx # load esi from memory # sub 5 from ecx

Scheduled VLIW code:


ld %r30, [%esp]; sub.c %ecx, %ecx, 5 ld %esi, [%ebp]; add %eax, %eax, %r30; add %ebx, %ebx, %r30

x86 instructions executed out-of-order with respect to original program flow. Need to restore state for precise traps.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

57

Exception Handling
x86 exceptions are precise Problematic for out-of-order execution of instructions On an exception, processor state is rolled back to the most recent commit. Execution proceeds in in-order mode until the fault location is found Memory updates are rolled back through the gated store buffer (which holds x86 stores until a commit.)

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

58

Alias Hardware for Data Speculation


Translator cannot prove that load and store addressed do not overlap. Crusoe provides simple alias hardware support Allows that CMS reorders selected memory references Taking on the burden of verifying at runtime that the reordered references did not overlap When it detects a violation
It raises an exception. CMS may invoke rollback and conservative re-execution in the interpreter.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

59

Moving Loads Ahead of Stores


When the translator moves a load operation ahead of a store operation Load => load-and-protect
Load and record the address and size of data loaded

Store => store-under-alias-mask


Check for protected regions Raise exception

st %data, [%x] ld %r31, [%y] use %r31

ldp %r31, [%x] stam %data, [%y] use %r31

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

60

Memory-mapped I/O
Memory-mapped I/O cannot be distinguished at translation time from regular memory accesses. Load and store atoms specify whether they have been reordered. When such a speculative memory atom accesses a memory page that is mapped to I/O space, raise an exception. CMS performs a rollback.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

61

Handling Self-Modifying Code


When a translation is made, mark the associated x86 code page as being translated in page table Store to translated code page causes trap, and associated translations are invalidated

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

62

Other Remarks about Code Morphing


Unconditional jumps are completely eliminated; both paths of conditional jumps are speculatively executed, with the proper results being selected later The scheduler can re-order instructions in the generated code Registers are renamed aggressively. (No hardware register allocation needed) Speculation and failure recovery CMS monitors recurring failures and generates a more conservative translation (e.g., Retranslations of smaller regions).

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

63

Crusoe Performance Disadvantages


Comparison with conventional out-of-order execution processors from CMS perspective Operations with non-always-predictable latency, like loads, if delayed, can delay static schedule Cannot execute concurrently operations from several traces at once Code optimization doesnt start until a block of code has been executed more than a few times. Application startup time can be long. It may take a while for an application to be translated and optimized Some applications may have a big working code set, and will not fit entirely into fixed translation cache

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

64

Crusoe Performance Disadvantages


Code morphing software is loaded into system memory at boot-up Takes up 2MB of system memory plus an additional 6-14MB for caching. Dynamic translation can take up to 6 times more system memory to run the same code on a native x86 based CPU. Thus 64MB System RAM is effectively ~ 8-9MB RAM, resulting in a need for more frequent access to disk based virtual memory.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

65

Power Management

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

66

LongRun Dynamic Power Management


Conventional power savings approaches Switch off processor quickly to save power Change clock rate by suspending processor and restarting Crusoe LongRun approach Adjust clock rate dynamically, without suspension Adjust voltage level LongRun achieves power reductions of up to 30%

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

67

LongRun Overview
Adaptive power management Dynamically reduce core processor power consumption to near-optimal levels in response to application workload requirements Thermal management Intelligently adapts processor operation to system thermal environments Cooperation between LongRun and Code Morphing System (CMS)

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

68

Power Management States


Crusoe maps industry standard ACPI power management modes to six processor states ACPI global system states: Working, Auto Halt, Quick Start, Deep Sleep, Sleeping, Suspend-to-RAM, Suspend-to-Disk, Soft off, Mechanical Off. Crusoe power management states: Normal, Auto halt, Quick Start, Deep Sleep, DSX, Off Mapping of states shown in power management states table (next slide)

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

69

Power Management States


ACPI System State G0 / S0 / C0 G0 / S0 / C1 G0 / S0 / C2 G0 / S0 / C3 Working Auto Halt Quick Start Deep Sleep Processor State Normal Auto Halt Quick Start Deep Sleep SDRAM Normal Normal/Self Refresh Self Refresh Self Refresh Clock Generator Running Running Running Clocks Stopped

G1 / S1

Sleeping

Deep Sleep DSX

Self Refresh

PLL Shut Down

G1 / S3 G1 / S4 G2 / S5 G3

Suspend-to-RAM Suspend-to-Disk Soft Off Mechanical Off

Off Off Off Off

Self Refresh Off Off Off

PLL Shut Down Off Off Off


70

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Power Management State Diagram


Normal H = False1
HLT2 and Halt bus cycle4 STPCLK# negated and H = false1 STPCLK# asserted and Stop Grant bus cycle5 Notes: 1. H = processor halt state 2. HLT = x86 HLT instruction executed 3. Halt break = INTR, NMI, SMI#, INIT#, or RESET# 4. Halt bus cycle = PCI special cycle 5. Stop Grant bus cycle = PCI special cycle

Halt break3

SLEEP# asserted STPCLK# asserted & CLKIN stopped 5 bus cycle Auto Halt Quick Start & Stop Grant 1

H = True

Deep Sleep

STPCLK# negated and H = true1


S ev noo en p t

SLEEP# negated & CLKIN running Reduce CVDD Increase CVDD

Snoop event

Snoop serviced and H = true1

Snoop Service

Snoop serviced and H = false1

DSX

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

71

LongRun Power Management


LongRun provides CMS with the ability to adjust :Processor core operating voltage (V). Clock frequency (f). Dynamically adjust V and f depending on current application load on processor. Produces cubic reductions in power consumption (Power V2f) Conventional processors can only scale down power linearly by reducing f . LongRun power management policies implemented within CMS. Runtime performance information used to detect different workload scenarios.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

72

LongRun Power Adjustments


Power adjustments transparent to operating system, power management controller, and user Uses a number of core frequency/voltage operating points Allows LongRun to optimize processor for the lowest power and maximum performance along the operating curve Processor transparently switches over to traditional power models when processor frequency and voltage scaling reaches minimum operating point on curve Allows ACPI-like policies to handle power management at very lowpower operating points

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

73

LongRun Power Management Operating Curve

Power

Peak Performance Region

Typical Operating Region


300 400 500 600 700 800 900 1000

300 MHz 0.80 V

433 MHz 0.875 V

533 MHz 0.95 V

667 MHz 1.05 V

800 MHz 900 MHz 1.15 V 1.25 V

1 GHz 1.30 V
74

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Typical Operating Power per State


Application Workload DVD Playback MP3 Playback Auto Halt Quick Start Deep Sleep DSX ACPI State C0-C3 C0-C3 C1 C2 C3 C3 Typical Processor Power 1.0-1.5 W 0.50 W 0.35 W 0.30 W 0.15 W 0.10 W

1. All power supplies at their nominal operating values. Full system power management

enabled, including LongRun power management. 2. Typical DVD power is measured while running the Win DVD 2000 player under Windows 2000. 3. Typical MP3 power measured while running MMJukebox under Windows 2000.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 75

Power Comparison
Processor Clock L1 L2 Vcc Power Process Efficeon TM8600 1.0 GHz 1.2 GHz I 128 KB D 64K KB 1 MB Unspecified Unspecified 0.13 um Crusoe TM5400 667 MHz 800 MHz 128 KB 256 KB 0.9-1.3 V 0.4-1.0 W 0.13 um Crusoe TM5800 667 MHz 800 MHz 128 KB 512 KB 0. 9-1.3 V 0.4-1.0 W 0.13 um Intel Mobile Pentium III 600 MHz 750 MHz I 16 KB D 16 KB 256 KB 1.1 1.35 V 12.2 W 0.18 um Intel Pentium M 900 MHz I 32 KB D 32 KB 1 MB 0.84-1.0 V 7 W 0.13 um

Note: It makes sense to compare processors in the same time frame only. Hence, Efficeon should be compared with Pentium M, while Crusoe should be compared with Mobile Pentium III.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 76

LongRun Thermal Management


Thermal management integrated into dynamic power management operating point policies Manages processor thermal environment by using frequency/voltage operating point shifts as a substitute for thermal throttling Delivers higher performance at same die temperature, or same performance at lower die temperature Crusoe provides an integrated on-die thermal diode Can be connected to an external temperature sensor and processor temperature monitored by system BIOS and application software

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

77

Memory

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

78

CMS Memory layout


CMS Compressed FlashROM Image

Expands to 2MB

Translation Cache

Memory Used by CMS

Memory available to applications Main Memory 16 MB


Copyright 2005 University of Illinois

0 MB

2 MB

Maximum RAM Address


79

CS433 Prof. Luddy Harrison

DDR Memory Interface


TM5800 processors include an integrated high performance DDR (double datarate) SDRAM controller and interface. DDR controller supports only DDR SDRAM and transfers data at a rate that is twice the clock frequency of the interface. DDR SDRAM controller supports the equivalent of two DIMMs (up to four ranks) of DDR SDRAM using a 64-bit wide interface. DDR SDRAM interface does not support parity bits. Ranks (Not Banks) Terminology: The grouping of sections of memory on system boards, commonly referred to by designers as sides or banks, is be referred to by the proper name of rank, as this is the term recognized by the memory industry. A rank describes the memory chips connected to a common
(Use of term bank to describe memory organization can be confusing because modern memory chips use bank select signals in addition to row and column address signals.)

chip select signal.

It is possible for single-rank memory modules to contain memory chips on both sides of the module.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 80

DDR Memory Interface


Supported DDR Memory Types TM5800 processors support only non-buffered/non-registered/non-ECC DDR SDRAM memory. SDR (single data rate) SDRAM memory, buffered/registered memory, and ECC memory are not supported. TM5800 memory subsystems can be populated with 64-Mbit (4M x 16 or 8M x 8), 128-Mbit (8M x 16 or 16M x 8), 256-Mbit (16M x 16 or 32M x 8), or 512-Mbit (32M x 16 or 64M x 8) devices. Note that only x8 and x16 memory devices are supported. DDR Memory Speed (Frequency) For memory configurations with up to 8 loads per interface signal, TM5800 processors support DDR interface frequencies up to 133 MHz (DDR-266). With memory configurations having more than 8 loads per signal, the DDR interface frequency must be reduced below 133 MHz.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 81

DDR Memory Interface Constraints


TM5800 processors have a quad-rank memory controller implementation Only one register is available to describe and store all of the parameters of the memory types used in the system. Therefore individual rank of memory must be identical. Memory is assumed to be contiguous, so sequential placement of memory in each rank is required. Memory must be placed in ranks in equal capacities (sizes). Memory must be populated sequentially and contiguously so that a rank is not skipped and left open. Memory must be of the same density and organization (geometry). All memory devices must be the same speed. Memory interface operating frequency must be set for the lowest speed memory in the system.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 82

DDR Memory Interface Constraints


The DDR interface cannot drive more than eight unbuffered loads (devices) at the maximum interface frequency of 133 MHz. For memory configurations that exceed eight loads, the interface must be reduced from 133 MHz. Placing the DDR devices down on the motherboard is recommended for best high-speed signal integrity. Memory configurations must be chosen so that the number of loads is minimized. There are strict constraints for user-installed DDR expansion memory. It is strongly recommended that SDR memory (and not DDR memory) be used for user-installed expansion memory.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

83

DDR Memory Rank and Chip Select Examples


Each of the four memory ranks (once populated) is controlled through one of the four DDR interface chip select signals (C_CS0#, C_CS1#, C_CS2#, C_CS3#). These chip selects are used by the memory controller to manage accesses between the connected memories. What if different sizes/geometries of memory are paired up? The organization of the memory addressing will become incompatible and the system will become non-functional. What if identical memory of the same size and geometry is populated on a system board so that a rank is skipped? E.g., placing memory in Rank 0 and Rank 2 and not in Rank 1 Only the memory located in the first rank will be recognized and available to the system, while the second rank of memory will not be recognized. The same memory interface constraints apply for memory modules (e.g., SODIMMs) as well as soldered down memory.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 84

DDR Memory Rank and Chip Select Examples


RANK 0 C_CS0# 16M 16 16M 16 16M 16 16M 16 16M 16 RANK 1 C_CS1# 16M 16 32M 8 non contiguous 16M 16 16M 16 different geometries 16M 16 16M 16 16M 16 16M 16 RANK 2 C_CS2# RANK 3 C_CS3# Supported Not Supported Not Supported Supported Supported

C_CS0# C_CS1# Dual Rank Module C_CS0# Dual Rank Module


CS433 Prof. Luddy Harrison

C_CS2# C_CS3# Dual Rank Module C_CS1# C_CS2# Dual Rank Module
Copyright 2005 University of Illinois

Supported

Supported
85

SDR Memory Interface


TM5800 processors include an integrated high performance SDR (single data-rate) SDRAM controller and interface. SDR controller supports only SDR SDRAM and transfers data at a rate that is equal to the clock frequency of the interface. SDR SDRAM controller supports up to two 64-bit DIMMs (up to four ranks) of single data rate SDRAM. The SDR SDRAM interface does not support parity bits. Supported DDR Memory Types SDR DIMMs can be populated with 64-Mbit, 128-Mbit, 256Mbit, or 512-Mbit devices. All DIMMs must use the same frequency SDRAMs, but there are no restrictions on mixing different DIMM configurations in the two DIMM slots.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 86

SDR Memory Interface


SDR Memory Speed (Frequency) Frequency setting for the SDR SDRAM interface initialized during the boot sequence from data stored in the configuration ROM. Although processor can be configured for an SDR interface frequency in the range of 1/2 to 1/15 of the core frequency, the supported interface frequency is restricted to a minimum of 66 MHz and a maximum of 133 MHz. SDR Memory Speed Adjustment by LongRun Memory frequency settings vary at each power management step. When processor core frequency is changed by LongRun, the SDR interface frequency is recalculated to match the new core frequency setting. For example, a 1000 MHz device with a 125 MHz memory interface may have a LongRun setting of 667 MHz with a 133 MHz memory interface.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 87

SDR Memory Interface Constraints


SDR SDRAM interface cannot drive more than sixteen unbuffered loads (devices).
Different SDR memory ranks can have different size or geometry devices. SDR memory interface will automatically be adjusted to run at the speed of the slowest installed SDR SDRAM memory.

All SDR memory installed in the system be the same speed (recommended).
Maximum unbuffered SDR SDRAM interface operating frequency is 133 MHz.

SDR SDRAM configurations requiring more than sixteen loads must use buffered SDR memory.
Maximum industry-standard buffered SDR SDRAM operating frequency is 66 MHz. Maximum processor SDR SDRAM interface operating frequency is 133 MHz. Hence processor SDR SDRAM interface operating frequency must be set below the standard LongRun power management.

SDR memory can be user expandable. SDR SDRAM is the preferred userinstalled expansion memory option for TM5800 processor-based systems.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 88

Clocks And Timing

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

89

Clocks
TM5800 processor input clock (CLKIN) is multiplied by the processor clock multiplier to generate the processor core clock. For currently defined TM5800 SKUs, CLKIN is assumed to be 66.6 MHz. Processor core clock is divided down by the DDR and SDR clock dividers to generate the DDR SDRAM and SDR SDRAM interface clocks. There is also a clock divider that must be initialized for the PCI interface. The PCI interface operates at 33.3 MHz. Clock multiplier and divider values are programmed into the TM5800 processor during initialization from data stored in the configuration ROM.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 90

Timing Specification for Input Clocks


Parameter fclk (clock frequency) CLKIN P_PCLK S_CLKIN tcycle (clock period) CLKIN P_PCLK S_CLKIN thigh (clock high time) CLKIN P_PCLK S_CLKIN tlow (clock low time) CLKIN P_PCLK S_CLKIN tjitter (clock jitter) CLKIN P_PCLK trise/fall (clock rise and fall time) CLKIN P_PCLK toffset (CLKIN to P_PCLK offset) tpll_lock (PLL relock time)
CS433 Prof. Luddy Harrison

Minimum 60.0 MHz 30.0 MHz 15.0 nS 30 nS 7.5 nS 5.2 nS 11 nS 3.375 nS 5.0 nS 11 nS 3.375 nS 0.4 nS 1.0 V/nS 1.5 nS Copyright 2005 University of Illinois

Maximum 66.67 MHz 33.33 MHz 133.33 MHz 16.67 nS 250 pS 500 pS 1.6 nS 4.0 V/nS 4.0 nS 20 S
91

Timing Diagram for Input Clocks


tcycle
2.0 V Vm*
CLIKN P_PCLK S_CLKIN

tlow

0.8 V

0.8 V

trise

thigh

tfall *Vm = 1.2 V for CLIKN = 0.4 * IOVDD for P_PCLK = 1.4 for S_CLKIN

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

92

Timing Specification for DDR SDRAM Interface


Parameter fclk tcycle tlow, thigh tjitter Vx tvalid_dqs tohold_dqs tdqs_skew tdqs_low, tdqs_high tdqs_preamble off nras_cas ncas_read nread_pchg nwr_pchg nrow_pchg nidmrs nras_ras nburst nrefresh Description C_CLK frequency C_CLK period C_CLK low time, high time C_CLK jitter Differential cross pt voltage C_DQ, C_DQMB valid fromC_DQS (writes) C_DQ, C_DQMB hold from C_DQS (writes) C_DQS to C_DQ, C_DQMB skew C_DQS input low time, C_DQS input high time C_DQS preamble valid time Active to float delay C_DQ : C_RAS# to C_CAS# latency C_CAS# to read latency Read precharge delay Write precharge delay Row precharge time Idle cycles after Mode Register Set Row cycle time Burst length Refresh rate C_DQS Minimum 7.5 nS 0.45 bus clks 1.1V 0.76 ns -0.76 ns 0.45 bus clks 0.9 bus clks 0 nS : -0.5 nS Maximum 133 MHz 0.55 bus clks 150 pS 1.4 V 3.05 nS +0.76ns 0.55 bus clks 1.1 bus clks 2.5 nS : +0.5 nS

1 bus clock 1 bus clock 1 bus clock 1 bus clock 1 bus clock 2 bus clocks 2 bus clocks 4 transfers 128 bus clocks

16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 17 bus clocks 17 bus clocks 4 transfers 16k bus clocks
93

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

Timing Specification for DDR SDRAM Interface


Clock specifications apply to C_CLKA, C_CLKA#, C_CLKB, C_CLKB#. C_CLKA and C_CLKA# are 180 out of phase. C_CLKB and C_CLKB# are 180 out of phase. C_CLKA and C_CLKB are copies of each other. The data parameters are specified relative to DQS signals and CMD parameters are specified relative to C_CLK/C_CLK# differential cross point voltage. CMD signals are: C_A[12:0], C_BA[1:0], C_CAS#, C_CKE[1:0], C_CS#[3:0], C_RAS#, C_WE#. Assumes 80 pF maximum load on each CMD signal and 10 pF maximum load on each of C_DQ[63:0]. These parameters are programmable within the processor. Row precharge time is the number of bus clocks between the power on precharge and the next time RAS can be asserted. Row cycle time is the number of bus clocks between refresh and the next time RAS can be asserted for other SDRAM operations. This also is the number of cycles the DDR SDRAM controller waits before starting any SDRAM access after it exits clock off mode. The DDR SDRAM controller always performs burst operations.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 94

Timing Diagram for DDR SDRAM Interface Read Cycle


C_CLK# C_CLK tcycle C_CKE[1:0] nras_cas C_RAS# C_CS#[3:0] C_A[12:0] C_BA[1:0] C_CAS# C_DQ[63:0] tdqs_preamble C_DQS[7:0] tdqs_high tdqs_low
IN0 IN1 IN2 IN3 0 1 2 3 4 5 6 7 8 9 10

nras_ras
VALID VALID

tvalid

ncas_read nread_pchg tdqs_skew

Note: CAS Latency = 2 in this diagram.


CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 95

Timing Diagram for DDR SDRAM Interface Write Cycle


C_CLK# C_CLK C_CKE[1:0] C_RAS# C_CS#[3:0] C_A[12:0] C_BA[1:0] C_CAS# C_WE#
Val0 Val1 Val2 Val3 0 1 2 3 4 5 6 7 8 9 10

nras_cas nras_ras
VALID VALID

C_DQMB[7:0]
Out0 Out1 Out2 Out3

nwr_pchg

C_DQ[63:0] C_DQS[7:0] tvalid_dqs


CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois

tohold_dqs
96

Timing Specification for SDR SDRAM Interface


Parameter fclk tsetup tihold nras_cas ncas_read nread_pchg nwr_pchg nrow_pchg nidmrs nras_ras nburst nrefresh Description S_CLKIN, S_CLKOUT, S_CLK frequency Input setup time Input hold time S_RAS# to S_CAS# latency S_CAS# to read latency Read precharge delay Write precharge delay Row precharge time Idle cycles after MRS Row cycle time Burst length Refresh rate Minimum 1.7 nS 1.9 nS 1 bus clock 1 bus clock 1 bus clock 1 bus clock 1 bus clock 2 bus clocks 2 bus clocks 4 transfers 128 bus clocks Maximum 133 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 17 bus clocks 17 bus clocks 4 transfers 16K bus clocks

tvalid: Output valid delay tohold: Output hold time


CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 97

S_CLK[3:0] are copies of S_CLKOUT. These parameters are specified relative to S_CLKIN rising edge at 1.4 V level. Input signals are: S_DQ[63:0]. Output signals are: Data = S_DQ[63:0], S_DQMB[7:0] Address = S_A[12:0], S_BA[1:0], S_CAS#, S_RAS#, S_WE# Enables = S_CKE[1:0], S_CS#[3:0] Assumes 50 pF load for output signals. For every 10 pF above a 50 pF load, add 170 pS for the data and enable signals, and 90 pS for the address signals. For every 10 pF below a 50 pF load, subtract 170 pS for the data and enable signals, and 90 pS for the address signals. These parameters are programmable within the processor. Row precharge time is the number of bus clocks between the power on precharge and the next time RAS can be asserted. MRS stands for Mode Register Set operation. Row cycle time is the number of bus clocks between refresh and the next time RAS can be asserted for other SDRAM operations. This also is the number of cycles the SDR SDRAM controller waits before starting any SDRAM access after it exits clock off mode. The SDR SDRAM controller always performs burst operations.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

98

Timing Diagram for SDR SDRAM Interface Read Cycle


0 1 2 3 4 5 6 7 8 9 10

S_CLKIN nras_cas S_CKE[1:0] S_RAS# nras_ras S_CS#[3:0] S_A[12:0] S_BA[1:0] S_CAS#


VALID0 VALID1 VALID2 VALID3 VALID VALID

ncas_read

S_DQMB[7:0]

nread_pchg S_DQ[63:0]
IN0 IN1 IN2 IN3

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

99

Timing Diagram for SDR SDRAM Interface Write Cycle


0 1 2 3 4 5 6 7 8 9 10

S_CLKIN S_CKE[1:0] S_RAS# S_CS#[3:0] S_A[12:0] S_BA[1:0] S_CAS# S_DQMB[7:0]


VALID

nras_cas

nras_ras
VALID

VALID0 VALID1 VALID2 VALID3

S_DQ[63:0]
IN0
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois

nwr_pchg

IN1

IN2

IN3
100

Processor Pin Layout

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

101

SUSPEND#
SDR SDRAM

Temperature Sensor
DIODE_CATHODE DIODE_ANODE

Delay Loop SDR SDRAM High-Speed BiDirectional LevelTranslator Isolation

Temp Alert to Southbridge

SDR Interface Signals

SROM_SCLK, SROM_SOUT, SROM_SIN, SROM_CS[1:0]

SUSPEND#
DDR_CKE[1:0]

CRUSOE PROCESSOR

8Mbit Serial Flash for CMS 2kbit Serial ModeBit ROM

SRCLK, SRDATA TDM_TCK, TDO,TDI,TMS, TRST# DEBUG_INIT NM_DEBUG_INIT

DDR SDRAM

High-Speed BiDirectional LevelTranslator Isolation

DDR Interface Signals

(Serial Debug Bus interface)

CPU Core Power Supply Memory Power Supply

VRDA[4:0] PWRGOOD

CPU_RST#

Transmeta Debug Connector A (30-pin) PCI + CPU Clock/Suspend/Reset Pins


SUSPEND#

all System POWERGOOD signals

SUSB# SUSC#

SYS_RST#

connect to system-level reset

To Rest Of Motherboard
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 102

Target Applications and Sample Code

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

103

Target Applications
Suitable for portable and embedded systems No need for active cooling and external CPU fans Provides embedded devices with a performance per watt ratio that is unmatched by any other x86-based processor in its class Runs a mobile Linux kernel Capable of running Internet applications Web browsers Email applications Streaming video Used in Notebooks and Tablet PCs Offers advantages over standard hardware-only processors for making ultra light and thin Notebooks with less power consumption. Only microprocessor that is able to provide software upgrades to the processor that offer additional performance and power savings.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 104

Target Applications
Thin client for server based computing The thin client, differentiated from desktop computers by a smaller form factor and the removal of all moving parts (disk drive, fan, etc), is designed specifically for server-based computing. This centralized approach provides the ability to easily deploy applications to thousands of thin client users at the same time. Increases resource efficiency and drastically reduces the overhead associated with application installation and upgrades. Because all data is centrally located on a server, data is better protected from the catastrophes, viruses, and data theft that plague non-centralized operations. Low power and high density Crusoe based servers have a high profitability matrix value (Performance/Per Watt/Per Cubic Foot).

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

105

Target Applications
Ultra-Personal Computer (UPC) New computing category enabled by Transmeta processors High-performance, full-featured PC that delivers the functionality of a desktop computer and the features of a laptop computer in the size of a handheld PDA. Designed to run full x86 desktop operating systems and applications, giving users application independence and the freedom to use their data with the software of their choice. Simplifies or even eliminates the synchronization between multiple devices, which results in increased productivity and reduced IT maintenance overhead while providing users with seamless portability of their data and multimedia content.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

106

Target Applications
Cluster workstations Provides the necessary processing power to solve complex computational problems. By removing server cluster complexity and assembly time, Self contained, provide simple and easy-to-use features, by reducing server cluster complexity and assembly time. Built around industry standards, use standard software libraries, and can be configured to user needs.

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

107

Example Code Fragment


# Handler for reading MSR registers from x86 operating system code. # Parallel VLIW instruction words shown by || and terminated by ;
read_msr: addi %r38,%r47,-12; || addi %r47,%r47,-32; nop.lsu; || addi %r35,%r38,4; || oril %r34,%zero,0x80000000; st [%r47],%r25; || add %r60,%r0,%r34; st [%r35],%r26; slli %r37,%r60,4; addil %r25,%zero,0x0018aae0; #=> target 0x000a6798: # %r38 = %sp - 12 # %sp = %sp - 32 # # # # # # # # %r35 = %r38 + 4 %r34 = 0x80000000 Save %r25 %r60 = %r0 (%eax) + 0x80000000 (to bring MSR offset down to zero base) Save %r26 Shift MSR number for indexing into table 0x18aae0 -> cpuid data for 0x80000000-6

|| ||

# NOTE: shifts (slli, etc.) appear to only be available on ALU1, at least # according to the opcode map. This is a bizarre (but low power) design. st [%r38],%r27; 001100000 %r27,%r58,%r0; cmpil.c %sink,%r0,0x80000006 ; # [%sp-12] = %r27 (callee saved) # # compare %eax == 0x80000006?

|| ||

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

108

nop.lsu; || addi %r59,%r25,-64; || addil %r36,%r37,0x0018aae0; nop.lsu; || cmp.c %sink,%r0,%r34; || or %r26,%zero,%zero; || br.gt 0x000a6860; cmp.c %sink,%r0,3; br.ge 0x000a68b8;

# 0x18aaa0 -> cpuid data returned for 0x0-0x3 # %r36 = 0x18aae0 + (%r60 << 4) # # # # # # # # # compare %eax == 0x80000000 Move %r26 = 0 branch if %eax > 0x80000006 (to handle 0x80860000 functions) Compare %eax == 3 branch if (%eax >= 0x80000000) (branch to load_cpuid_data_to_regs) %r60 = %eax %r37 = %eax << 4 (index cpuid table)

||

or %r60,%r0,%zero; || slli %r37,%r0,4; nop.lsu; || add %r36,%r37,%r59; nop.lsu; addi %r35,%r47,20; or %r2,%zero,%zero; move %r1,%zero,%zero; move %r3,%zero,%zero; move %r0,%zero,%zero; addi %r34,%r47,24;

# %r36 = %r37 + (%r0<<4) + 0x18aaa0 #=> target 0x000a6820: # %r35 = %r47 + 20 # %edx = 0 # %ecx = 0 # %ebx = 0 # Set %eax = 0 # %r34 = %sp + 24 #=> target 0x000a6840:

|| || || ||

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

109

References
Transmeta website http://www.transmeta.com Alexander Klaiber, The Technology Behind Crusoe Processors, January 2000, Transmeta white paper. Jon Stokes, Crusoe Explored, Ars Technica, January 2000, http://arstechnica.com/articles/paedia/cpu/crusoe.ars/ Rob Hughes, Transmetas Crusoe Microprocessor, January 2000, Chipgeek.com, http://www.geek.com/procspec/features/ transmeta/crusoe.htm James C. Dehnert et. al., The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges, First Annual IEEE/ACM International Symposium on Code Generation and Optimization, March 2003. Transmeta Zone http://www.transmetazone.com

CS433 Prof. Luddy Harrison

Copyright 2005 University of Illinois

110

Вам также может понравиться