Академический Документы
Профессиональный Документы
Культура Документы
Outline
Transmeta Innovation Timeline Crusoe Processor Family and Overview Architecture showing Data paths, Registers, ALU, etc. Instruction Set Pipelining Code Morphing LongRun Power Management Memory Map and Support for Internal and External Memories Clocks and Timing Processor Pin Layout Target Applications, Specific Use, and Sample Assembly Language Code for application kernels
Dave Ditezel, of RISC-fame and formerly from SPARC, started up Transmeta as its CEO in 1995. The first Patent (5958061) was applied in July 24, 1996 granted in September 28, 1999. On January 19, 2000 the Crusoe processor was published. Crusoe became famous as an x86 compatible family of solutions that combines strong performance with remarkably low power consumption.
Crusoe Processor Model TM5500 667 MHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus
Crusoe Processor Model TM5800 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus
Crusoe SE Processor Model TM55E 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus
Crusoe SE Processor Model TM58E 800 MHz - 1 GHz 128 KByte L1 Cache (64KByte L1 cache 256KB L2 write-back cache Integrated Northbridge
64-bit, 133 MHz DDR memory controller 64-bit, 133 MHz SDR memory controller 32-bit, 33 MHz, 3.3V PCI bus
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W Supports T-junction temperatures of 100C Rated for 24/7 operation for 10 years
MMX Instruction Support 0.13m process Compact 474-pin Ceramic BGA Package Max TDP: 5.1W Supports T-junction temperatures of 100C Rated for 24/7 operation for 10 years
6
Characteristics of Crusoe
4 Instruction Issue, 128-Bit VLIW Engine Fully Pentium 4-ISA compatible Up to four instructions issued per clock cycle MMX multimedia extensions 512 MB L2 cache Advanced Code Morphing Software (CMS) Unique software-based architecture is key to reducing power consumption and enabling future scalability and flexibility Integrated Northbridge Core Logic On-chip SDR and DDR-266 memory interfaces On-chip 32-bit, 33 MHz PCI bus controller
Architecture
Crusoe Processor
SDR Memory Interface Controller DDR Memory Interface Controller Serial ROM Interface Controller Transmeta LongRun Power/Thermal Management 32 bit 33 MHz PCI Bus Interface Controller L1 Instruction Cache L2 Cache L1 Data Cache
Transmetas Crusoe
Instruction Decode
Branch Predict
Register Rename
Instruction Reorder
11
Purple portions implemented in hardware Much smaller than traditional microprocessors Orange portions implemented in software x86 to native VLIW translation, branch prediction, and out-of-order execution (OOO) logic
12
Processor Details
Fabricated in 0.13 process technology High Performance 4 Issue 128-bit VLIW Engine with Code Morphing Software to provide x86 compatibility. L1 Data Cache: 64KB L1 Instruction Cache: 64KB L2 Write Back Cache: 512KB DDR Memory Support: DDR-SDRAM 100-133MHz SDR Memory Support: SDR-SDRAM 66-133MHz PCI bus controller (PCI 2.1 compliant) with 33 MHz, 3.3V interface Standard product speeds of 733, 800, 867, 933, and 1000 MHz Power: 0.5-1.5 W @ 300-1000 MHz, 0.8-1.3V running typical multimedia applications, 150 mW typical in deep sleep Processor Package: Compact 474-pin Ceramic BGA
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 13
CPU core Integer Unit Floating Point Unit MMU Multimedia Instr. Bus Interface
L2 WB Cache 512 K
4-way Set Associative
Copyright 2005 University of Illinois
Architecture Details
Five function units : 2xALU, FP/MMX, MEM and BR Each instruction (called molecule) has 2 or 4 two RISC-like operations (called atoms) Shadowed registers sets : 64 GPRs, 32 FPRs Gated store buffer in the Load/Store unit Alias hardware Very few HW interlocks in the pipeline Correct execution is guaranteed by CMS scheduling & compiler Micro-architecture hidden from the x86 programmer Processor timing issues can be worked-around by CMS
16
17
18
Integer ALU #0
Load/Store Unit
Branch Unit
19
20
21
Instruction Set
22
Registers
The processor has 64 GPRs, with the following specialized semantics: %r63 (%zero) always reads 0 when used as a source operand %r62 (%sink) is a discarded destination (e.g., for compares); it is never read %r59 (%from) saved return address %r58 (%link) return address %r47 (%sp) is the current stack pointer %r0 (%eax) for current x86 machine state %r1 (%ecx) for current x86 machine state %r2 (%edx) for current x86 machine state %r3 (%ebx) for current x86 machine state
23
Registers contd.
The lower 48 of these GPRs are backed by shadowed GPRs: whenever a bundle has its commit bit set, the Commit stage latches the current values of the GPRs into the 'known good' shadow GPRs. The processor also includes 32 80-bit floating point registers and 16 FP shadow registers. There are also a wide variety of special purpose registers (SPRs), including the condition codes, profiling registers, power control settings and so on.
24
Instruction Encoding
General Format:
%PC + 12
%PC + 8
%PC + 4
%PC + 0
C C
00 00
32 bits
LSU LSU
type type
ALU1 ALU1
0 1 C C C C
00 00 01 10 10 11
ALU0 ALU0 LSU ALU0 ALU0 ALU0 type type type type
25
26
Format of ALUs
31
C C C 10 10 10 op op 11xxxx011
ALU0/ALU1
rd rd rd ra ra ra rb imm8 0 --
0
ALU with register operands ALU with 8-bit immediate ALU with 32-bit immediate
ALU1 executes a superset of the operations available on ALU0. Both ALUs may have an 8-bit signed immediate instead of register rb. ALU1 may optionally use a 32-bit immediate, but only in appropriate bundle types.
27
28
31
C C 01 01 op op
Load/Store Unit
rd ? ? rs raddr raddr ---
0
Loads Stores
Single load/store unit performs all loads and stores, alias operations and various other memory related tasks. All LSU operations take a fully calculated address in register ra. No ra+offset or ra+rb addressing modes are provided.
29
30
0
Unconditional Branch
?
Unconditional Branch via Register Conditional Branch Conditional Branch via Register
1 1
16
Branches (both conditional and unconditional) within CMS use a 23 bit absolute target address aligned to a 64-bit boundary (i.e., abstarget is shifted left 3 bits).
31
32
Pipelining
33
Pipelining
Fetch0
Fetch1
Regs
ALU
Except
Write
Commit
Cache0
Cache1
( Load/Store Unit)
(Wait)
Redirect
( Branch Unit)
The top row of the diagram indicates the pipeline for an ALU instruction, with the other rows representing the two other types of logical units.
34
Pipeline Stages
Fetch0: The first 64 bits of a 64-bit or 128-bit bundle are fetched. Fetch1: The second 64 bits are fetched (for 128-bit bundles only). Regs: Read source registers and decode/disperse instructions. ALU: Execute single cycle operations in ALU0 and ALU1 Except: Complete two-cycle ALU0/ALU1 ops and detect exceptions Cache0: Initiate L1 data cache access based on register address Cache1: Complete L1 data cache access, TLB access and alias checks Write: Write results back to GPRs or store buffer Commit: Optionally latch the lower 48 GPRs into the shadow registers
35
36
CPU
Functional Units
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 37
38
Code Morphing
39
40
Code Morphing
Hardware: high megahertz, small die size, non x86 VLIW-processor.
Microcode
Silicon Microchip Integer Units Floating Point Unit Multimedia Unit Data Cache Instruction Cache
Instruction Prefixes
ASCII arithmetic
41
Yes
Chain
42
43
44
Code Cache
j dispatch loop Chained add %r5, %r6, %r7 j physical location of translated code for next_block
Runtime -Execution
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 45
Performing a Translation
The frontend decodes x86 instructions into a simple sequence of atoms. The optimizer applies well-known compiler optimizations (including elimination of unnecessary atoms from the instruction stream). The scheduler reorders the atoms and groups them into molecules.
46
Translation Example
addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5
frontend
ld %r30, [%esp] add.c %eax, %eax, %r30 ld %r31, [%esp] add.c %ebx, %ebx, %r31 ld %esi, [%ebp] sub.c %ecx, %ecx, 5
optimizer
ld %r30, [%esp] add %eax, %eax, %r30 add %ebx, %ebx, %r30 ld %esi, [%ebp] sub.c %ecx, %ecx, 5
scheduler
ld %r30, [%esp]; ld %esi, [%ebp]; sub.c %ecx, %ecx, 5 add %eax, %eax, %r30; add %ebx, %ebx, %r30
47
Translation Step 1
Ld %r30, [%esp] Add.c %eax, %eax, %r30 Addl %eax, (%esp) Addl %ebx, (%esp) Movl %esi, (%ebp) Subl %ecx, 5 Ld %r31, [%esp]
Translation by code morphing software
Translation Step 2
Ld %r30, [%esp] Add.c %eax, %eax, %r30 Ld %r31, [%esp] Add.c %ebx, %ebx, %r31 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5 Native VLIW code
CS433 Prof. Luddy Harrison
Ld %r30, [%esp]
Optimisation Elimination of atoms + extra condition code options.
Add %eax, %eax, %r30 Add %ebx, %ebx, %r30 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5
Translation Step 3
Optimised Native VLIW code
Ld %r30, [%esp] Add %eax, %eax, %r30 Add %ebx, %ebx, %r30 Ld %esi, [%ebp] Sub.c %ecx, %ecx, 5 Scheduling -remaining atoms into molecules using a large window.
1. Ld %r30, [%esp]; Sub.c %ecx, %ecx, 5 2. Ld %esi, [%ebp]; Add %eax, %eax, %r30; Add %ebx, %ebx, %r30
Successive executions of the translation invokes only the optimizer, not the translator Cost of translation is amortized over successive executions Computed gotos, trace linking and inlined function calls provided significant overhead in software-based trace optimizer
Simple chaining is not sufficient, special hardware exists for locating traces.
51
Code Optimization
Optimizer examines whole translations at a time Several levels of optimization: interpretation up to highly-optimized. High levels of optimization add run-time overhead Only worth doing for frequently executed code Code morphing instruments generated code to help determine usage patterns (count of #times executed) The optimization level to apply is chosen through heuristics based on usage patterns. The more a translation is executed, the more optimized it becomes If monitoring indicates that optimizations were too aggressive, then trace is partially de-optimized.
52
Expands to 2MB
Translation Cache
0 MB
2 MB
54
Shadow Registers
Two copies of each register, a working copy and a shadow copy If execution reaches the end of a translation block, performs commit operation Copy all working registers into shadow registers If any exceptional condition occurs inside the translation block, performs rollback operation Copy the shadow register values back into the working registers
55
56
Exceptions
Original x86 code:
addl %eax, (%esp) addl %ebx, (%esp) movl %esi, (%ebp) subl %ecx, 5 # load data from stack, add to eax # load data from stack, add to ebx # load esi from memory # sub 5 from ecx
x86 instructions executed out-of-order with respect to original program flow. Need to restore state for precise traps.
57
Exception Handling
x86 exceptions are precise Problematic for out-of-order execution of instructions On an exception, processor state is rolled back to the most recent commit. Execution proceeds in in-order mode until the fault location is found Memory updates are rolled back through the gated store buffer (which holds x86 stores until a commit.)
58
59
60
Memory-mapped I/O
Memory-mapped I/O cannot be distinguished at translation time from regular memory accesses. Load and store atoms specify whether they have been reordered. When such a speculative memory atom accesses a memory page that is mapped to I/O space, raise an exception. CMS performs a rollback.
61
62
63
64
65
Power Management
66
67
LongRun Overview
Adaptive power management Dynamically reduce core processor power consumption to near-optimal levels in response to application workload requirements Thermal management Intelligently adapts processor operation to system thermal environments Cooperation between LongRun and Code Morphing System (CMS)
68
69
G1 / S1
Sleeping
Self Refresh
G1 / S3 G1 / S4 G2 / S5 G3
Halt break3
SLEEP# asserted STPCLK# asserted & CLKIN stopped 5 bus cycle Auto Halt Quick Start & Stop Grant 1
H = True
Deep Sleep
Snoop event
Snoop Service
DSX
71
72
73
Power
1 GHz 1.30 V
74
1. All power supplies at their nominal operating values. Full system power management
enabled, including LongRun power management. 2. Typical DVD power is measured while running the Win DVD 2000 player under Windows 2000. 3. Typical MP3 power measured while running MMJukebox under Windows 2000.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 75
Power Comparison
Processor Clock L1 L2 Vcc Power Process Efficeon TM8600 1.0 GHz 1.2 GHz I 128 KB D 64K KB 1 MB Unspecified Unspecified 0.13 um Crusoe TM5400 667 MHz 800 MHz 128 KB 256 KB 0.9-1.3 V 0.4-1.0 W 0.13 um Crusoe TM5800 667 MHz 800 MHz 128 KB 512 KB 0. 9-1.3 V 0.4-1.0 W 0.13 um Intel Mobile Pentium III 600 MHz 750 MHz I 16 KB D 16 KB 256 KB 1.1 1.35 V 12.2 W 0.18 um Intel Pentium M 900 MHz I 32 KB D 32 KB 1 MB 0.84-1.0 V 7 W 0.13 um
Note: It makes sense to compare processors in the same time frame only. Hence, Efficeon should be compared with Pentium M, while Crusoe should be compared with Mobile Pentium III.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 76
77
Memory
78
Expands to 2MB
Translation Cache
0 MB
2 MB
It is possible for single-rank memory modules to contain memory chips on both sides of the module.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 80
83
C_CS2# C_CS3# Dual Rank Module C_CS1# C_CS2# Dual Rank Module
Copyright 2005 University of Illinois
Supported
Supported
85
All SDR memory installed in the system be the same speed (recommended).
Maximum unbuffered SDR SDRAM interface operating frequency is 133 MHz.
SDR SDRAM configurations requiring more than sixteen loads must use buffered SDR memory.
Maximum industry-standard buffered SDR SDRAM operating frequency is 66 MHz. Maximum processor SDR SDRAM interface operating frequency is 133 MHz. Hence processor SDR SDRAM interface operating frequency must be set below the standard LongRun power management.
SDR memory can be user expandable. SDR SDRAM is the preferred userinstalled expansion memory option for TM5800 processor-based systems.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 88
89
Clocks
TM5800 processor input clock (CLKIN) is multiplied by the processor clock multiplier to generate the processor core clock. For currently defined TM5800 SKUs, CLKIN is assumed to be 66.6 MHz. Processor core clock is divided down by the DDR and SDR clock dividers to generate the DDR SDRAM and SDR SDRAM interface clocks. There is also a clock divider that must be initialized for the PCI interface. The PCI interface operates at 33.3 MHz. Clock multiplier and divider values are programmed into the TM5800 processor during initialization from data stored in the configuration ROM.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 90
Minimum 60.0 MHz 30.0 MHz 15.0 nS 30 nS 7.5 nS 5.2 nS 11 nS 3.375 nS 5.0 nS 11 nS 3.375 nS 0.4 nS 1.0 V/nS 1.5 nS Copyright 2005 University of Illinois
Maximum 66.67 MHz 33.33 MHz 133.33 MHz 16.67 nS 250 pS 500 pS 1.6 nS 4.0 V/nS 4.0 nS 20 S
91
tlow
0.8 V
0.8 V
trise
thigh
tfall *Vm = 1.2 V for CLIKN = 0.4 * IOVDD for P_PCLK = 1.4 for S_CLKIN
92
1 bus clock 1 bus clock 1 bus clock 1 bus clock 1 bus clock 2 bus clocks 2 bus clocks 4 transfers 128 bus clocks
16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 16 bus clocks 17 bus clocks 17 bus clocks 4 transfers 16k bus clocks
93
nras_ras
VALID VALID
tvalid
nras_cas nras_ras
VALID VALID
C_DQMB[7:0]
Out0 Out1 Out2 Out3
nwr_pchg
tohold_dqs
96
S_CLK[3:0] are copies of S_CLKOUT. These parameters are specified relative to S_CLKIN rising edge at 1.4 V level. Input signals are: S_DQ[63:0]. Output signals are: Data = S_DQ[63:0], S_DQMB[7:0] Address = S_A[12:0], S_BA[1:0], S_CAS#, S_RAS#, S_WE# Enables = S_CKE[1:0], S_CS#[3:0] Assumes 50 pF load for output signals. For every 10 pF above a 50 pF load, add 170 pS for the data and enable signals, and 90 pS for the address signals. For every 10 pF below a 50 pF load, subtract 170 pS for the data and enable signals, and 90 pS for the address signals. These parameters are programmable within the processor. Row precharge time is the number of bus clocks between the power on precharge and the next time RAS can be asserted. MRS stands for Mode Register Set operation. Row cycle time is the number of bus clocks between refresh and the next time RAS can be asserted for other SDRAM operations. This also is the number of cycles the SDR SDRAM controller waits before starting any SDRAM access after it exits clock off mode. The SDR SDRAM controller always performs burst operations.
98
ncas_read
S_DQMB[7:0]
nread_pchg S_DQ[63:0]
IN0 IN1 IN2 IN3
99
nras_cas
nras_ras
VALID
S_DQ[63:0]
IN0
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois
nwr_pchg
IN1
IN2
IN3
100
101
SUSPEND#
SDR SDRAM
Temperature Sensor
DIODE_CATHODE DIODE_ANODE
SUSPEND#
DDR_CKE[1:0]
CRUSOE PROCESSOR
DDR SDRAM
VRDA[4:0] PWRGOOD
CPU_RST#
SUSB# SUSC#
SYS_RST#
To Rest Of Motherboard
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 102
103
Target Applications
Suitable for portable and embedded systems No need for active cooling and external CPU fans Provides embedded devices with a performance per watt ratio that is unmatched by any other x86-based processor in its class Runs a mobile Linux kernel Capable of running Internet applications Web browsers Email applications Streaming video Used in Notebooks and Tablet PCs Offers advantages over standard hardware-only processors for making ultra light and thin Notebooks with less power consumption. Only microprocessor that is able to provide software upgrades to the processor that offer additional performance and power savings.
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 104
Target Applications
Thin client for server based computing The thin client, differentiated from desktop computers by a smaller form factor and the removal of all moving parts (disk drive, fan, etc), is designed specifically for server-based computing. This centralized approach provides the ability to easily deploy applications to thousands of thin client users at the same time. Increases resource efficiency and drastically reduces the overhead associated with application installation and upgrades. Because all data is centrally located on a server, data is better protected from the catastrophes, viruses, and data theft that plague non-centralized operations. Low power and high density Crusoe based servers have a high profitability matrix value (Performance/Per Watt/Per Cubic Foot).
105
Target Applications
Ultra-Personal Computer (UPC) New computing category enabled by Transmeta processors High-performance, full-featured PC that delivers the functionality of a desktop computer and the features of a laptop computer in the size of a handheld PDA. Designed to run full x86 desktop operating systems and applications, giving users application independence and the freedom to use their data with the software of their choice. Simplifies or even eliminates the synchronization between multiple devices, which results in increased productivity and reduced IT maintenance overhead while providing users with seamless portability of their data and multimedia content.
106
Target Applications
Cluster workstations Provides the necessary processing power to solve complex computational problems. By removing server cluster complexity and assembly time, Self contained, provide simple and easy-to-use features, by reducing server cluster complexity and assembly time. Built around industry standards, use standard software libraries, and can be configured to user needs.
107
|| ||
# NOTE: shifts (slli, etc.) appear to only be available on ALU1, at least # according to the opcode map. This is a bizarre (but low power) design. st [%r38],%r27; 001100000 %r27,%r58,%r0; cmpil.c %sink,%r0,0x80000006 ; # [%sp-12] = %r27 (callee saved) # # compare %eax == 0x80000006?
|| ||
108
nop.lsu; || addi %r59,%r25,-64; || addil %r36,%r37,0x0018aae0; nop.lsu; || cmp.c %sink,%r0,%r34; || or %r26,%zero,%zero; || br.gt 0x000a6860; cmp.c %sink,%r0,3; br.ge 0x000a68b8;
# 0x18aaa0 -> cpuid data returned for 0x0-0x3 # %r36 = 0x18aae0 + (%r60 << 4) # # # # # # # # # compare %eax == 0x80000000 Move %r26 = 0 branch if %eax > 0x80000006 (to handle 0x80860000 functions) Compare %eax == 3 branch if (%eax >= 0x80000000) (branch to load_cpuid_data_to_regs) %r60 = %eax %r37 = %eax << 4 (index cpuid table)
||
or %r60,%r0,%zero; || slli %r37,%r0,4; nop.lsu; || add %r36,%r37,%r59; nop.lsu; addi %r35,%r47,20; or %r2,%zero,%zero; move %r1,%zero,%zero; move %r3,%zero,%zero; move %r0,%zero,%zero; addi %r34,%r47,24;
# %r36 = %r37 + (%r0<<4) + 0x18aaa0 #=> target 0x000a6820: # %r35 = %r47 + 20 # %edx = 0 # %ecx = 0 # %ebx = 0 # Set %eax = 0 # %r34 = %sp + 24 #=> target 0x000a6840:
|| || || ||
109
References
Transmeta website http://www.transmeta.com Alexander Klaiber, The Technology Behind Crusoe Processors, January 2000, Transmeta white paper. Jon Stokes, Crusoe Explored, Ars Technica, January 2000, http://arstechnica.com/articles/paedia/cpu/crusoe.ars/ Rob Hughes, Transmetas Crusoe Microprocessor, January 2000, Chipgeek.com, http://www.geek.com/procspec/features/ transmeta/crusoe.htm James C. Dehnert et. al., The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-Life Challenges, First Annual IEEE/ACM International Symposium on Code Generation and Optimization, March 2003. Transmeta Zone http://www.transmetazone.com
110