Академический Документы
Профессиональный Документы
Культура Документы
Weidong Shi Hsien-Hsin (Sean) Lee Mrinmoy Ghosh Chenghuai Lu Alexandra Boldyreva School of Electrical and Computer Engineering Georgia Institute of Technology
Content
Motivation Related Work Counter/Decryption Pad Prediction Profile Prediction Failures 2Level Prediction Context Based Prediction Conclusions
Different Solutions
Crypto Engine Flash
Micro Controller
Create a little secure world, limited application scenarios (code signing, BIOS signature verification)
SoC. Memory is on-chip. Apply to limited platforms such as small embedded systems (cell phones)
Related Work
Use dedicated cache (sequence number cache) to reduce
latency overhead of memory decryption (Micro 2003)
encryption. Use wasted idle crypto engine pipeline stages for prediction and pre-computation. Less area overhead than caching and less memory pressure than prefetch based pre-decryption.
VAddr
Vaddr+2
Crypto Engine
Key
Encryption pad
Encryption pad
Each memory line has its own counter. Each time memory line is updated, increment the counter.
Processor Core
Key
Cache Line Cache Line ... Cache Line Cache Line
Crypto Engine
Encryption pad
Encryption pad
Counter has to be
fetched for memory line missing L2.
Encrypted 16B
Encrypted 16B
Counter Prediction
Counters exhibit both spacial and temporal coherence. To exploit spacial coherence, memory blocks from the same page
start counting from the same initial value (page root counter)
Page Base Addr 0x0000ff00 ... ... Page Root Counter (64 bits) 0xabcddcba12344321 ... ...
counter
Memory line Memory line ... Memory line Memory line 0xabcddcba123443f1 0xabcddcba12344e0a ... 0xabcddcba12344325 0xabcddcba12344321
static data
8
Memory Pipeline
decrypted line
Unrolled and pipelined AES decryption logic often stays idle from tens
to hundreds of cycles when data is missing L2.
Memory Pipeline
decrypted line
10
Window based dynamic tracking of prediction rate for each page. For frequently updated memory blocks, according to prediction history
vector, reset root counter number. All future write-backs will count from the new number.
11
Experiment Setup
Parameters L1 I/D Cache Value DM, 8KB
L2 Cache
Memory Bus CPU Clock AES Latency (256-bit) Prediction Depth
Simplescalar 3.0 SPEC2000 INT/FP, benchmarks with high L2 misses. Prediction hit rate study (8 billion instructions) IPC performance (400 million on representative window)
12
Prediction Rate
1.2 1 0.8 0.6 0.4 0.2 0
lu gr id pa rs er sw im tw ol f vo rt ex p ar t bz ip 2 cf gc c vp up r w is Av e er ag e p m gz i ap p m Am m
gr id pa rs er sw im tw ol f vo rt ex
ar t bz ip 2
gc c
cf
128K_Counter_#_Cache
512K_Counter_#_Cache
Pred
128K_Counter_#_Cache
512K_Counter_#_Cache
Prediction hit rate under 8 billion instructions No counter number cache when using prediction Prediction depth = 5 Average prediction hit rate, about 82-83%
13
vp up r w is Av e er ag e
Pred
ap p
gz i
Am
IPC
1.2 1 0.8 0.6 0.4 0.2 0
m p Ap pl u ar t gr id Pa rs er Sw im Tw ol Vo f rt ex Vp up r w is Av e er ag e p ip 2 Gc c cf Gz i M Bz Am M
gr id Pa rs er Sw im Tw ol Vo f rt ex
Counter_Cache_4K
Counter_Cache_128K
Counter_Cache_512K
Pred
Counter_Cache_4K
Counter_Cache_128K
Counter_Cache_512K
IPC normalized with the scenario without decryption. In general, outperform 128K counter cache On average, in par with 512K counter cache
14
Vp up r w is Av e er ag e
Pred
ip 2
Gc c
p Gz i
Am
Bz
cf
Prediction Miss
Reasons of prediction misses Prediction depth is too small. Reset of page root counter number. Memory lines whose counter
values based on the old page root counter cannot be predicted correctly using the new page root counter.
Solutions (details in the next few slides) Two-level prediction (divide prediction depth into sub ranges,
increase effective prediction depth without adding more predictions) Page root counter history memorization (predict using both the current page root counter and the previous root counter, only having marginal improvement) Context based prediction (exploit temporal coherence of accessing memory locations with coherent update frequency)
15
Two-level Prediction
10 01 00 11
Divide prediction window into ranges (power of 2) With 2bits per line, effectively quadruple the prediction depth. Overhead is about 2KB on chip memory for 64-entry TLB.
16
Prediction Window
17
{ while (1) { for all lines of the page write to the line; for all lines of the page read the line; } }
Regular Prediction (prediction depth=4) Prediction miss of memory read (%) 20% (for each line, every 5 reads, 1 miss)
Prediction Rate
1.2 1 0.8 0.6 0.4 0.2 0
lu gr id pa rs er sw im tw ol f vo rt ex p ar t bz ip 2 cf gc c vp up r w is Av e er ag e p m gz i ap p m Am m
gr id pa rs er sw im tw ol f vo rt ex
ar t
gc c
cf
Regular_Pred
Two-level_Pred
Context + Regular_Pred
Regular_Pred
Two-level_Pred
Context + Regular_Pred
8 billion instruction window Two-level prediction about 93% prediction hit Context based + regular prediction almost 99% prediction hit
19
vp up r w is Av e er ag e
bz ip
ap p
gz i
Am
IPC
1.2 1 0.8 0.6 0.4 0.2 0
m p Ap pl u ar t gr id Pa rs er Sw im Tw ol Vo f rt ex Vp up r w is Av e er ag e p ip 2 Gc c cf Gz i M Bz Am M
gr id Pa rs er Sw im Tw ol Vo f rt ex
Regular_Pred
2level_Pred
Context + Regular_Pred
Regular_Pred
2level_Pred
Context + Regular_Pred
IPC normalized to scenario of no decryption 1-3% loss of performance using best prediction
Vp up r w is Av e er ag e
ip 2
Gc c
p Gz i
Am
Bz
cf
20
Conclusions
Counter value prediction allows pre-computing of pads speculatively
without counter value caching.
Use idle cycles of pipelined decryption engine Counter prediction achieves better performance than some of the large
cache settings.
21
Questions
22