Вы находитесь на странице: 1из 8

A NOR Emulation Strategy over NAND Flash Memory

Jian-Hong Lin

Yuan-Hao Chang

Jen-Wei Hsieh

Tei-Wei Kuo

Cheng-Chih Yang

Graduate Institute of Networking and


Multimedia
Department of Computer Science and
Information Engineering
National Taiwan University, Taipei,
Taiwan 106, R.O.C.
{r94944003, d93944006, ktw}
@csie.ntu.edu.tw

Department of Computer Science and


Information Engineering
National Chiayi University,Chiayi,
Taiwan 60004, R.O.C.
jenwei@mail.ncyu.edu.tw

Product Development Firmware


Engineering Gruop Genesys Logic,
Inc. Taipei, Taiwan 231, R.O.C
Mikey.Yang@genesyslogic.com.tw
Abstract
This work is motivated by a strong market demand in the re-
placement of NOR ash memory with NAND ash memory to
cut down the cost in many embedded-system designs, such as
mobile phones. Different from LRU-related caching or buffer-
ing studies, we are interested in prediction-based prefetching
based on given execution traces of applications. An implemen-
tation strategy is proposed in the storage of the prefetching
information with limited SRAM and run-time overheads. An
efcient prediction procedure is presented based on informa-
tion extracted from application executions to reduce the per-
formance gap between NAND ash memory and NOR ash
memory in reads. With the behavior of a target application ex-
tracted from a set of collected traces, we show that data access
over NOR ash memory is responded effectively over SRAM
that serves as a cache for NAND ash memory.
Keywords: NAND, NOR, ash memory, data caching
1. INTRODUCTION
While ash memory remains one of the most popular stor-
ages in embedded systems because of its non-volatility, shock-
resistance, small size, and low energy consumption, its appli-
cation has grown much beyond its original design. Based on
its original design, NOR ash memory (referred to as NOR for
short) is designed to store binary code of programs because
NOR supports XIP (eXecute-In-Place) and high performance
in read operations, while NAND ash memory (referred to as
NAND for short) is used as a data storage because NAND has
lower price and higher performance in write/erase operations,
compared to NOR [22, 23, 27, 30]. In recent years, the price
of NAND has down done much faster than that of NOR, e.g.,
the price of 8Gbit NAND is lower than one-fth of 8Gbit NOR

Supported by the National Science Council of Taiwan, R.O.C., under Grant


NSC-95R0062-AE00-07 and NSC-95-2221-E-002-094-MY3
in the rst quarter of year 2007 [14]. In order to reduce the
hardware cost ultimately, using NAND to replace NOR (mo-
tivated by a strong market demand) becomes a new trend in
embedded-system designs, especially on mobile phones and
arcade games, even though read performance is an essential
and critical issue in program execution. These observations
motivate the objective of this work. That is the exploration of
lling up the read performance gap between NAND and NOR
with limited overhead.
The management of ash memory is carried out by either
software on a host system (as a raw medium) or hardware cir-
cuits/rmware inside its devices. In the past decade, there were
a lot of research and implementation designs to manage ash-
memory storage systems, e.g., [2, 3, 5, 6, 8, 9, 16, 31, 33, 34].
Some exploited efcient management schemes for large-scale
storage systems and different architecture designs, e,g., [8, 9,
16, 31, 33, 34]. To improve the performance of hard disks, Intel
proposed Robson to use NAND as a non-volatile cache of hard
disks [1, 4, 7, 32] and Microsoft proposed a fast booting mech-
anism (called ReadyBoost and ReadyDrive) in Windows Vista
to enhance system performance by caching data in ash mem-
ory. To use ash memory to replace hard disks and to improve
the performance of ash memory, some proposed to use battery
backed SRAM to cache recently used data from ash memory
with a copy-on-write scheme [12, 13, 18, 32]. For the new re-
search trend of using NAND to store both code and data (in
order to replace NOR), C. Park et al. developed a cost-efcient
memory architecture integrating NAND ash memory into ex-
isting memory hierarchy for XIP and also proposed an applica-
tion specic demand paging mechanism in NAND ash mem-
ory with compiler assistance [20, 21]. However, the proposed
mechanisms need the source code of applications and specic
compilers to lay out the compiled code in xed memory lo-
cation. The imposed constraints make the mechanisms hard to
implement.
1
Although researchers have proposed many excellent meth-
ods in improving the performance of ash memory, very few
of them work on improving the performance of program ex-
ecution when using NAND to store binary code. Only some
research proposed some caching mechanisms, e.g., [15, 17,
19, 25, 26], to improve the performance of program execution
(where code is stored in NAND) without considering the ac-
cess patterns. Without prefetching data according to the access
patterns of the installed programs, the cache miss rate could
not be effectively reduced and therefore programs cannot be
executed efciently, because most embedded systems only per-
form some specic functions, especially arcade-game stations.
In this paper, we propose an efcient prediction mechanism
with limited memory-space requirements and an efcient im-
plementation. The prediction mechanism collects the access
patterns of program execution to construct a prediction graph
by adopting the working-set concept [10, 11]. According to
the prediction graph, the prediction mechanism prefetches data
(/code) to the SRAM cache, so as to reduce the cache miss
rate. Therefore, the performance of the program execution is
improved and the read performance gap between NAND and
NOR is lled up effectively. A series of experiments is then
conducted based on some realistic traces, which are collected
from three different types of popular games Age of Empire II
(AOE II), The Typing of the Death (TTD), and Raiden.
We show that the average read performance of NAND with the
proposed prediction mechanism could be better than that of
NOR for 24%, 216%, and 298% in AOE II, TTD, and Raiden,
respectively. Furthermore, the cache miss rate was 35.27%,
4.21%, and 0.06% for AOE II, TTD, and Raiden, respectively,
and the percentage of redundant prefetched data was lower
than 10% in most cases.
The rest of this paper is organized as follows: Section 2 de-
scribes the characteristics of ash memory and research moti-
vation. In Section 3, an efcient prediction mechanism is pro-
posed. Section 4 summarizes the experimental results on read
performance, cache miss rate, and extra overheads. Section 5
is the conclusion.
2. Flash-Memory Characteristics and
Research Motivation
There are two types of ash memory: NAND and NOR. Each
NAND ash memory chip consists of many blocks, and each
block is of a xed number of pages. A block is the smallest
unit for erase operations, while reads and writes are done in
pages. A page contains a user area and a spare area, where
the user area is for the data storage of a logical block, and the
spare area stores ECC and other house-keeping information
(i.e., LBA). Because ash memory is write-once, we do not
overwrite data on each update. Instead, data are written to
free space, and the old versions of data are invalidated (or
considered as dead). The update strategy is called out-place
update. In other words, any existing data on ash memory
could not be over-written (updated) unless its corresponding
block is erased. The pages that store live data and dead data
are called valid pages and invalid pages, respectively.
Depending on the designs, blocks have different bounds
on the number of erases over a block. For example, the typ-
ical erase bounds of SLC and MLC
2
NAND ash mem-
ory are 10,000 and 1,000, respectively
1
. Each page of small-
block(/large-block) SLC NAND can store 512B(/2KB) data,
and there are 32(/64) pages per block. The spare area of a
small-block(/large-block) SLC NAND page is 16B(/64B). On
the other hand, each page of MLC
2
NAND can store 2KB,
and there are 128 pages per block. Different from NAND ash
memory, a byte is the unit for reads and writes over NOR ash
memory.
SLC NOR [28] SLC NAND [24]
(large-block, 2KB-page)
Price (US$/GB) [14] 34.65 6.79
Read (random access of 8bits) 40ns 25s
Write (random access of 8bits) 14s 300s
Read (sequential access) 23.842MB/sec 15.33MB/sec
Write (sequential access) 0.068MB/sec 4.57MB/sec
Erase 0.217MB/sec 6.25MB/sec
Table 1. The Typical Characteristics of NOR and NAND.
Host Interface
Cache (SRAM)
NAND Flash Memory
Data Bus
Address Bus
Converter
Prefetch
Procedure
Control Logic
512 bytes
byte
Host Interface
Cache (SRAM)
NAND Flash Memory
Data Bus
Address Bus
Converter
Prefetch
Procedure
Control Logic
512 bytes
byte
Figure 1. An Architecture for the Performance Improvement
of NAND Flash Memory.
NAND has been widely adopted in the implementation of
storage systems because of its advantages in cost and write
throughput (for block-oriented access), compared to NOR.
The typical cost of 1GB NOR costs US$34.65 in the mar-
ket, compared to US$6.79 per GB for NAND, and the price
gap of NAND and NOR will get even wider in the coming fu-
ture. However, due to the high performance of NOR in reads,
as shown in Table 1, and its eXecute-In-Place (XIP) char-
acteristics, NOR is adopted in various embedded-system de-
signs, such as mobile phones and Personal Multimedia Players
(PMP). The characteristics of NAND and NOR are summa-
rized in Table 1.
This research is motivated by a strong market demand in
the replacement of NOR with NAND in many embedded-
system designs. In order to ll up the performance gap be-
tween NAND and NOR, SRAM is a nature choice for data
caching in performance improvement, such as that in the sim-
ple but effective hardware architecture adopted by OneNAND
1
There are two major NAND ash memory designs: SLC (Single Level Cell)
ash memory and MLC (Multiple Level Cell) ash memory. Each cell of SLC
ash memory contains one-bit information while each cell of MLC
n
ash
memory contains n-bit information.
2
[15, 25, 26]. (Please see Figure 1.) However, the most criti-
cal technical problem behind the success in the replacement
of NOR with NAND is on the prediction scheme and its imple-
mentation design. Such an observation underlies the objective
of this research. That is the design and implementation of an
effective prediction mechanism for applications, with the con-
siderations of ash-memory characteristics. Because of strin-
gent resource supports over embedded systems, the proposed
mechanism must also face challenges in restricted SRAM us-
age and limited computing power.
3. An Efcient Prediction Mechanism
3.1 Overview
In order to ll up the performance gap between NAND and
NOR, SRAM can serve as a cache layer for data access over
NAND. As shown in Figure 1, the Host Interface is respon-
sible to the communication with the host system via address
and data buses. The Control Logic manages the caching activ-
ity and provides the service emulation of NOR with NAND
and SRAM. The Control Logic should have an intelligent pre-
diction mechanism implemented to improve the system per-
formance. Different from popular caching ideas adopted in
the memory hierarchy, this research aims at an application-
oriented caching mechanism. Instead of the adopting of a LRU-
like policy, we are interested in prediction-based prefetching
based on given execution traces of applications. We consider
the designs of embedded systems with a limited set of appli-
cations, such as a set of selected system programs in mobile
phones or arcade games of amusement-park machines. The
design and implementation should also consider the resource
constraints of a controller in the SRAM capacity and comput-
ing.
There are two major components in the Control Logic: The
Converter emulates NOR access over NAND with a SRAM
cache, where address translation must be done from byte ad-
dressing (for NOR) to Logical Block Address (LBA) address-
ing (for NAND). Note that each 512B/2KB NAND page corre-
sponds to one and four LBAs, respectively [29]. The Prefetch
Procedure tries to prefetch data from NAND to SRAM so that
the hit rate of the NOR access is high over SRAM. The proce-
dure should parse and extract the behavior of the target appli-
cation via a set of collected traces. According to the extracted
access patterns from the collected traces, the procedure gener-
ates prediction information, referred to as a prediction graph. In
Section 3.2, we shall dene a prediction graph and present its
implementation strategy over NAND. An algorithm design for
the Prefetch Procedure will be then presented in Section 3.3.
3.2 A Prediction Graph and Implementation
The access pattern of an application execution over NOR (or
NAND) consists of a sequence of LBAs, where some LBAs
are for instructions, and the others are for data. As an appli-
cation runs for multiple times, the virtually complete pic-
ture of the possible access pattern of an application execution
might appear, as shown in Figure 2. Since most application ex-
ecutions are input-dependent or data-driven, there can be more
than one subsequent LBAs following a given LBA, where each
0
5
6
2 1
3
4
8
9
7
11
13
10
12
0
5
6
2 1
3
4
8
9
7
11
13
10
12
Figure 2. An example of a prediction graph
LBA corresponds to one node in the graph. A node with more
than one subsequent LBAs is called a branch node (such as the
shaded nodes in Figure 2), and the other nodes are called regu-
lar nodes. The graph that corresponds to the access patterns is
referred to as the prediction graph of the patterns. If pages in
NAND could be pre-fetched in an on-time fashion, and there is
enough SRAM space for caching, then all data accesses can be
done over SRAM.

data data branch table


(regular node) (regular node)
data
(branch node)

data data branch table


(regular node) (regular node)
data
(branch node)
(a) Prediction Information
3
addr(b1)
addr(b2)
addr(b3)

3
addr(b1)
addr(b2)
addr(b3)

(b) A Branch Table


Figure 3. The Storage of a Prediction Graph
The technical problems are how to save the prediction graph
over ash memory with overheads minimized and how to
prefetch pages based on the graph in a simple but effective
way. We propose to save the subsequent LBA information of
each regular node at the spare area of the corresponding page. It
is because the spare area of a page in current implementations
and the specication has unused space, and the reading of a
page usually comes with the reading of its data and spare areas
simultaneously. In such a way, the accessing of the subsequent
LBA information of a regular node comes with no extra cost.
Since a branch node has more than one subsequent LBAs, the
spare area of the corresponding page might not have enough
free space to store the information. We propose to maintain
a branch table to save the subsequent LBA information of all
3
branch nodes. The starting entry address of the branch table
that corresponds to a branch node can be saved at the spare
area of the corresponding page, as shown in Figure 3.(a). The
starting entry records the number of subsequent LBAs of the
branch node, and the subsequent LBAs are stored in the en-
tries following the starting entry (Please see Figure 3.(b)). The
branch table can be saved on ash memory. During the run
time, the entire table can be loaded into SRAM for better per-
formance. If there is not enough SRAM space, parts of the table
can be loaded in an on-demand fashion.
3.3 A Prefetch Procedure
next current
2 3 4 5 1 1
. . .
6
. . .
next current
2 3 4 5 1 1
. . .
6
. . .
Figure 4. A Snapshot of the Cache
The objective of the prefetch procedure is to prefetch data
from NAND based on a given prediction graph such that most
data accesses occur over SRAM. The basic idea is to prefetch
data by following the LBA order in the graph. In order to
efciently look up a selected page in the cache, we propose
to adopt a cyclic buffer in the cache management, and let two
indices current and next denote the pages currently accessed
and prefetched, respectively. When current =next, the caching
buffer is empty. When current = (next +1) mod SIZE, the
buffer is full, where SIZE is the number of buffers for page
caching. Consider the prediction graph shown in Figure 2. The
page that corresponds to Node 2 is currently accessed, and the
page that corresponds to Node 6 is just prefetched (Please see
Figure 4).
The prefetch procedure is done in a greedy way: Let P1
be the last prefetched page. If P1 corresponds to a regular
node, then the page that corresponds to the subsequent LBA
is prefetched. If P1 corresponds to a branch node, then the pro-
cedure should prefetch pages by following all possible next
LBA links in an equal base and an alternative way. That is,
the prefetch procedure can follow each LBA link in an al-
ternative way. For example, pages corresponding to Nodes 4
and 5 are prefetched after the page that corresponds to Node
3 is prefetched, as shown in Figure 4. The next pages to be
prefetched are the pages corresponding to Nodes 1 and 6. In
order to properly manage the preferching cost, the prefetch
procedure stops following an LBA link when next reaches a
branch node again along a link, or when next and current might
point to the same page (both referred to as Stop Conditions).
When the caching buffer is full (also referred to as a Stop Con-
dition), the prefetch procedure should also stop temporarily.
Take the prediction graph shown in Figure 2 as an example.
The prefetch procedure should not prefetch pages correspond-
ing to Nodes 8 and 9 when the page corresponding to Node 7
is prefetched. When current reaches a page that corresponds
to a branch node, the next page to be accessed (referred to as
the target page) will determine which branch the application
execution will follow. The prefetch procedure should start pre-
fecting the page that corresponds to the subsequent LBA of
the target page (or the pages that correspond to the subsequent
LBAs of the target page if the target page corresponds to a
branch node). The above prefetch procedure shall repeat again
if it stops tentatively because of any Stop Condition. Note that
all pages cached in the SRAM cache between current and next
stay in the cache after the target page (in the following of a
branch) is accessed. It is because some of the cached pages
might be accessed shortly, even though the access of the target
page has determined which branch the application execution
will follow. Note that cache misses are still possible, e.g., those
when current = next. In such a case, data are accessed from
NAND and loaded onto the SRAM cache in an on-demand
fashion.
The pseudo code of the prefetch procedure is as shown in
Algorithm 1: Two ags, i.e., stop and start
bch
, are used to track
the prefetching state: stop and start
bch
denote the satisfaction
of any Stop Condition and the reaching of a branch node,
respectively. Initially, stop and start
bch
are set as FALSE. If
any Stop Condition is satised when the procedure is invoked,
then the procedure simply returns (Step 1). The procedure
prefetches one page in each iteration (Steps 2-19) until the
cache is full (i.e., a Stop Condition), or we reach a branch
node for the rst time. First, next is checked if it will point
to the same page as current does, then the prefetch procedure
stops and returns (Steps 3-6) Otherwise, in each iteration, the
procedure increases next, i.e., the location of the next free
cache buffer (Step 7). The LBA is obtained by checking up
the latest prefetched LBA (Step 8), and then the page of the
LBA is prefetched (Step 9). After the prefetching of a page, the
procedure checks up whether the prefetched page corresponds
to a branch node (Steps 10-11). If so, the procedure loads
the corresponding branch table entries (Step 12) and save the
subsequent LBA of each branch of the branch node (Steps 12-
17). Because the prefetched page corresponds to a branch node,
the procedure should start prefetching pages by following each
branch in an alternative way (Steps 20-36). The loop will stop
when the cache is full (Step 20), when every next LBA link of
a branch node reaches the next branch node (Steps 31-35), or
when next and current might point to the same page (Steps 22-
25). In each iteration of the loop, if the LBA link indexed by
idx
bch
does not yet reach the next branch node (Step 21), the
next LBA following the link shall be prefetched (Steps 26-28).
Pages are prefetched by following all possible next LBA links
in an equal base and an alternative way (Step 30).
Note that stop should be set to FALSE when the cache is
no longer full or when next and current do not point to the
same page
2
. Moreover, stop and start
bch
should both be reset to
FALSE when current passes a branch node and meets the target
page or when a cache miss occurs (i.e., current = next). Once
stop is set to FALSE, the prefetch procedure is invoked. When
start
bch
is FALSE in such an invocation, the prefetch procedure
starts prefetching from the rst loop between Steps 2 and 19.
Otherwise, the prefetch procedure will continue its previous
prefetching job by following next LBAlinks of a visited branch
node in an alternative way (Steps 20-36).
2
Performance enhancement is possible by deploying more complicated condi-
tion setting and actions.
4
Algorithm 1: Prefetch Procedure
Input: stop, next, current, lba, idx
bch
, N
bch
, lba
bch
[], and start
bch
Output: null
if stop = TURE then return ; 1
while start
bch
= FALSE and (next +1) mod SIZE = current do 2
if ChkNxLBA(lba) = cache(current) then 3
stop TRUE ; 4
return ; 5
end 6
next (next +1) mod SIZE ; 7
lba GetNxLBA(lba) ; 8
Read(next, lba) ; 9
start
bch
IsBchStart() ; 10
if start
bch
= TRUE then 11
LdBchTable(GetNxLBA(lba)) ; 12
idx
bch
0 ; 13
N
bch
GetBchNum() ; 14
for i = 0; i <N
bch
; i = i +1 do 15
lba
bch
[i] GetBchLBA(i) ; 16
end 17
end 18
end 19
while start
bch
= TRUE and (next +1) mod SIZE = current do 20
if IsBchCplt(idx
bch
) = FALSE then 21
if ChkNxLBA(lba
bch
[idx
bch
]) = cache(current) then 22
stop TRUE ; 23
return ; 24
end 25
next (next +1) mod SIZE ; 26
lba
bch
[idx
bch
] GetNxLBA(lba
bch
[idx
bch
]); 27
Read(next, lba
bch
[idx
bch
]) ; 28
end 29
idx
bch
(idx
bch
+1) mod N
bch
; 30
if IsBchStop() = TRUE then 31
stop TRUE ; 32
start
bch
FALSE ; 33
return; 34
end 35
end 36
4. PERFORMANCE EVALUATION
4.1 Performance Metrics and Experiment Setup
The purpose of this section is to evaluate the capability of the
proposed prefetch mechanism in terms of read performance,
cache miss rate (Section 4.2), main-memory requirements
(Section 4.3), and prefetch overhead (Section 4.4). The read
performance and cache miss rate were based on the cache size
and the number of game traces collected for creating the pre-
diction graph, the main-memory requirements were based on
the size of the branch table of the prediction graph, and the
prefetch overhead was based on the percentage of redundant
data that were prefetched.
The performance of the proposed prefetch mechanism was
evaluated under a trace-driven simulation. The experiment
traces were collected over a mobile PC in the unit of a sector
(512B), which is consistent with the unit of data prefetching
in the proposed prefetch mechanism. Because the applications
to use NAND to replace NOR are mainly used to store code
of programs of mobile phones and arcade games, we collected
traces over three types of popular games, i.e., Age of Empires
II (referred to as AOE II for short), The Typing of the Death
(refereed to as TTD for short), and Raiden, where each game
was played for ten times during the trace collection and the
size of each game was carefully selected. The characteristics
of the games are listed in Table 2. AOE II is a real-time strat-
egy game where all players can control their roles at the same
time, so that the access pattern is full of diversities and hard
to predict, as shown in Figure 5(a). TTD is a game of English
typing, and players can choose any one stage to play. Once
players clear a stage, an animation corresponding to that stage
will be displayed. Therefore, the size of the game is compara-
tively large and predictable, as shown in Figure 5(b). Raiden is
a 3D vertical-shooter game, where players should clear stages
one by one and each enemy occurs at the specic time and
place. Once one stage is cleared, the data of the next stage
are completely loaded to the system. Thus, this game is highly
predictable and loads data in burst.
AOE II TTD Raiden
Size
small large small
(438 MB) (812 MB) (467 MB)
Average number of branches high medium low
Burst in reads low medium high
Temporal locality in data access low low low
Randomness in data access high medium low
Branch table size
large large small
(35.14 KB) (39.83 KB) (0.43 KB)
Table 2. The characteristics of games under investigation
,;,,
v_, |;zj
(q|,;
|(;,| ;,,
zp,;|;z
,| |,qpz;}
}zz};|y
zz,,,
;,zq,,,
|;zj ||},
,;,,
(a) Age of Empire II
,;,,
v_, |;zj
(q|,;
|(;,| ;,,
zp,;|;z
,| |,qpz;}
}zz};|y
zz,,,
;,zq,,,
|;zj ||},
,;,,
(b) The Typing of The Death
,;,,
v_, |;zj
(q|,;
|(;,| ;,,
zp,;|;z
,| |,qpz;}
}zz};|y
zz,,,
;,zq,,,
|;zj ||},
,;,,
(c) Raiden
Figure 5. Compositive analysis of the investigated games
In the experiments, one large-block NAND ash memory
device (128 pages per block and 2KB per page) and one NOR
ash memory device were under investigation, where the re-
sponse time of reading one byte from NAND was 50ns and that
from NOR was 40ns. Meanwhile, the set-up time of NAND
to read data from a page was 25 s, but there was no set-up
time in NOR. In addition to ash memory, one SRAM was also
used to store branch tables and be the cache, its the response
time 10ns. In the experiemnts, the branch tables were stored in
NAND ash memory and were loaded into SRAM on demand
5
so as to prevent the size of the branch tables growing too large
to t in SRAM.
As for the relationship between the number of traces and
the average number of branches of each branch node in the
prediction graph, the more the number of traces was collected
for each game, the higher the average branches of each branch
node. As shown in Figure 6, the average number of branches of
each branch node was less than four, and the average number
of each branch node grew slowly, except AOE II, because of
the randomness of data access in the real-time strategy game
AOE II.

,
, 4
, (
, ;
3
3,
3, 4
3 4 5 ( 7 ; 9 0
(q|,; z[ |;z,,

v
,
;

_
,

|
;

z
j

(
q
|
,
;
,([ ||[ [;,,
Figure 6. Increment of average branch number
4.2 Read Performance and Cache Miss Rate
Figure 7 shows the read performance of each game with differ-
ent cache sizes where the prediction graph was derived based
on ten traces of each game. Our approach achieved the best
performance for Raiden, due to its regular access patterns, but
the worst performance was observed for AOE II. When the
cache size was 2KB, the average read performance of AOE
II, TTD, and Raiden with the prefetch mechanism was 27.74
MB/s, 68.68 MB/s, and 94.98 MB/s, respectively. We must
point out that all of them were better than the read performance
of NOR (23.84 MB/s). Note that the lower the cache miss rate
in prefetching was, the higher read performance. To resolve
a cache miss, data accesses to NOR had to be redirected to
NAND so that missed data could be loaded from NAND to the
cache. It was also shown that a 4KB cache was sufcient for
the games under considerations because the read performance
became saturated when the cache size was no less than 4KB.
Figure 8 shows the read performance of the proposed ap-
proach for the three games with respect to different numbers of
traces, where the cache size was 4KB. The read performance
of each game was better than that of NOR even when only
two traces were used to generate a prediction graph. For exam-
ple, the improvement ratios over AOE II, TTD, and Raiden
were 24%, 216%, and 298%, respectively, when the num-
ber of traces for each game was 10, and the size of cache
was 4KB. When there were more than two traces, the read
performance of Raiden had almost no improvement because
the cache miss rate was almost zero. For AOE II, the read
7, 74
9, 57 30, 05 30, 0( 30, ; 30, 49
(;, (;
75, 4
77, 77, 34 77, 7 77, (
94, 9; 94, 99 94, 99 94, 99 94, 99 94, 99
3, ;4
0
0
0
30
40
50
(0
70
;0
90
00
4 ; ( 3 (4
zzj, ,;,, ([|)
;
,

,

p
,
;
[
z
;
q

z
,

(

|
/
,
)
,([ ||[ [;,, S[, ([
Figure 7. The read performance with different cache sizes (10
traces)
performance was improved slowly when the number of col-
lected traces increased because the access pattern of AOE II
was highly random. The increasing in the number of collected
traces for the prediction graph could not reduce the cache miss
rate signicantly. For TTD, good improvement was observed
with the inclusion of two more traces. It was because the last
two traces were, in fact, collected during the advance of players
in the game by clearing more stages. Furthermore, we summa-
rize the read performance of the proposed scheme and other ex-
isting products in Table 3. It shows that the read performance
of some specic applications with regular access patterns is
even better than that of OneNAND. On the other hand, without
our prediction mechanism, i.e., the worst case = 100% miss
rate, request data has to be read from NAND ash memory
on each read request. Thus, it is impractical to use NAND to
replace NOR without any prediction mechanism because the
read performance gap between the emulated NOR and NOR is
too large.
7, 5
;, 44 9, ( 9, 3( 9, 57
40,
45, 0(
5(, 5 5(, ;
75, 4
;, 0(
94, 99 94, 99 94, 99 94, 99
3, ;4
0
0
0
30
40
50
(0
70
;0
90
00
4 ( ; 0
(q|,; z[ |;z,,
;
,

,

p
,
;
[
z
;
q

z
,

(

|
/
,
)
,([ ||[ [;,, S[, ([
Figure 8. The read performance with different numbers of
traces (4KB cache)
Figure 9 shows the cache miss rate. The miss rate was
lower when more number of traces were used to construct the
prediction graph. In the gure, when ten traces were used to
6
AOE II TTD Raiden Worst case NOR OneNAND[25]
Read
29.57 75.24 94.44 8.76 23.84 68
(MB/s)
Table 3. Comparison of the read performance (10 traces and
4KB cache in our approach)
generate the prediction garph, the cache miss rate of Raiden
was almost zero and that of TTD was lower than 5%, but
that of AOE II could not be reduced effectively because of its
unpredictable access patterns. Comparing to read performance
shown in Figure 8, the read performance of a game was higher
when the cache miss rate was lower.
39, ;3
37, 3
35, ; 35, (4
35, 7
, ;(
7, 73
0, 95 0, ;
4,
, ;4
0, 0( 0, 0( 0, 0( 0, 0(
0
5
0
5
0
5
30
35
40
45
4 ( ; 0
(q|,; z[ |;z,,
q
;
,
,

;

|
,

(
%
)
,([ ||[ [;,,
Figure 9. Cache miss rate with different number of traces (with
4KB cache)
4.3 Main-memory Requirements
The major memory overhead of the prediction mechanism was
to maintain the branch table. The more the traces were used
to create the prediction graph, the bigger the branch table was.
That was because more access patterns were learned. As shown
in Figure 10, the table sizes of AOE II, TTD, and Raiden were
only 39.83KB, 35.14KB, and 0.43KB, respectively, when there
were ten traces used to construct each game. In most embedded
systems, the branch table of each game was still small enough
to be stored in RAM. However, in this experiment, branch ta-
bles were stored in NAND ash memory and loaded to SRAM
on demand. Figure 10 shows that the table size of Raiden kept
low when the number of traces increased, but the table sizes of
AOE II and TTD kept growing because ten traces still couldnt
cover all the access patterns of AOE II and TTD. However, as
shown in Figure 9, the cache miss rate of TTD was very low
and didnt need to involve new traces to improve the cache hit
ratio, and the cache miss rate of AOE II still could not be low-
ered even if more traces were collected.
4.4 Cache Pollution Rate
Cache pollution Rate is the rate of data that are prefetched but
not referenced during the program execution. The prefetching
of unnecessary data represented overheads and might even de-
creased the read performance because the prefetching activities
, 4
7, ;7
, 4
(, 9
35, 4
, (
;, 4
30, 4
33, 45
39, ;3
0, 04 0, 7 0, 3( 0, 4 0, 43
0
5
0
5
0
5
30
35
40
45
4 ( ; 0
(q|,; z[ |;z,,
|
;

z
j

|

|
}
,

,
;
,
,

(
[
|
)
,([ ||[ [;,,
Figure 10. The size of branch table
of unnecessary data might delay the prefetching of useful data.
In addition, unnecessary data transfer leads to extra power con-
sumption, which is critical to designs of embedded systems.
Let N
SRAM2host
be the amount of data accessed by the host, and
N
f lash2SRAM
the amount of data transferred from NAND ash
memory to SRAM. The cache pollution rate was dened as
follows:
Cache pollution rate = 1
N
SRAM2host
N
f lash2SRAM
As shown in Figure 11, the cache pollution rate increased
as the number of traces for each game increased. That was
because more traces led to a larger number of branches per
branch node, and only one of the LBA links that follow a
given branch node was actually referenced by the program.
In summary, there was a trade-off between the prefetching
accuracy and the prefetching overhead, even though the cache
pollution rates were still lower than 10% in most cases.
7, 3;
;, 3
;, 9
9, 43
0, (3
, 57
4, (4
4, ;;
5, 09
(, 09
0, 0 0, 0 0, 03 0, 03 0, 03
0

3
4
5
(
7
;
9
0

4 ( ; 0
(q|,; z[ |;z,,
z

z
j
,

p
z
}
}
(
|
;
z

|
,

(
%
)
,([ ||[ [;,,
Figure 11. The cache pollution rate (4KB cache)
5. Conclusions
This paper addresses the issue of the replacement of NOR with
NAND motivated by a strong market demand. Different from
7
on-demand cache mechanisms proposed in previous work,
we propose an efcient prediction mechanism with limited
memory-space requirement and an efcient implementation to
improve the performance of programs stored in NAND. Binary
code of programs is prefetched from NAND to SRAM cache
precisely and efciently according to the prediction graph that
is constructed by the collected access patterns of program ex-
ecution. A series of experiments is conducted based on re-
alistic traces collected from three different types of popular
games AOE II, TTD, and Raiden. We show that the
average read performance of NAND with the proposed pre-
diction mechanism could be better than that of NOR in most
cases, the cache miss rate was 35.27%, 4.21%, and 0.06% for
AOE II, TTD, and Raiden, respectively, and the percentage of
redundant prefetched data was lower than 10% in most cases.
Fur future research, we shall further extend the proposed
mechanism to adjust the prediction graph on-line to make the
prediction mechanism adaptive to any special and temporal
changes of program executions. We shall also explore the pred-
icability of data prefetching for programs that have high ran-
domness in terms of access patterns.
References
[1] Flash Cache Memory Puts Robson in the Middle. Intel.
[2] Flash File System. US Patent 540,448. In Intel Corporation.
[3] FTL Logger Exchanging Data with FTL Systems. Technical
report, Intel Corporation.
[4] Software Concerns of Implementing a Resident Flash Disk. Intel
Corporation.
[5] Flash-memory Translation Layer for NAND ash (NFTL). M-
Systems, 1998.
[6] Understanding the Flash Translation Layer (FTL) Specication,
http://developer.intel.com/. Technical report, Intel Corporation,
Dec 1998.
[7] Windows ReadyDrive and Hybrid Hard Disk Drives,
http:// www.microsoft.com/whdc/device/storage/hybrid.mspx.
Technical report, Microsoft, May 2006.
[8] L.-P. Chang and T.-W. Kuo. An Adaptive Striping Architecture
for Flash Memory Storage Systems of Embedded Systems. In
IEEE Real-Time and Embedded Technology and Applications
Symposium, pages 187196, 2002.
[9] L.-P. Chang and T.-W. Kuo. An Efcient Management Scheme for
Large-Scale Flash-Memory Storage Systems. In ACM Symposium
on Applied Computing (SAC), pages 862868, Mar 2004.
[10] P. J. Denning. The Working Set Model for Program Behavior.
Communications of the ACM, 11(5):323333, 1968.
[11] P. J. Denning and S. C. Schwartz. Properties of the Working-Set
Model. Communications of the ACM, 15(3):191198, 1972.
[12] F. Douglis, R. Caceres, F. Kaashoek, K. Li, B. Marsh, and
J. Tauber. Storage Alternatives for Mobile Computers. In
Proceedings of the USENIX Operating System Design and
Implementation, pages 2537, 1994.
[13] F. Douglis, P. Krishnan, and B. Marsh. Thwarting the power-
hungry disk. In Proceedings of the 1994 Winter USENIX
Conference, pages 292306, 1994.
[14] DRAMeXchange. NAND Flash Contract Price,
http://www.dramexchange.com/, 03 2007.
[15] Y. Joo, Y. Choi, C. Park, S. W. Chung, E.-Y. Chung, and
N. Chang. Demand Paging for OneNAND
TM
Flash eXecute-
In-Place. CODES+ISSS, October 2006.
[16] A. Kawaguchi, S. Nishioka, and H. Motoda. A Flash-Memory
Based File System. In Proceedings of the 1995 USENIX Technical
Conference, pages 155164, Jan 1995.
[17] J.-H. Lee, G.-H. Park, and S.-D. Kim. A new NAND-type
ash memory package with smart buffer system for spatial and
temporal localities. JOURNAL OF SYSTEMS ARCHITECTURE,
51:111123, 2004.
[18] B. Marsh, F. Douglis, and P. Krishnan. Flash Memory File
Caching for Mobile Computers. In Proceedings of the Twenty-
Seventh Annual Hawaii International Conference on System
Sciences, pages 451460, 1994.
[19] C. Park, J.-U. Kang, S.-Y. Park, and J.-S. Kim. Energy-aware
demand paging on nand ash-based embedded storages. ISLPED,
August 2004.
[20] C. Park, J. Lim, K. Kwon, J. Lee, and S. L. Min. Compiler-
assisted demand paging for embedded systems with ash memory.
EMSOFT, September 2004.
[21] C. Park, J. Seo, D. Seo, S. Kim, and B. Kim. Cost-efcient
memory architecture design of nand ash memory embedded
systems. ICCD, 2003.
[22] Z. Paz. Alternatives to Using NAND Flash White Paper.
Technical report, M-Systems, August 2003.
[23] R. A. Quinnell. Meet Different Needs with NAND and NOR.
Technical report, TOSHIBA, September 2005.
[24] Samsung Electronics. K9F1G08Q0M 128M x 8bit NAND Flash
Memory Data Sheet, 2003.
[25] Samsung Electronics. OneNAND Features and Performance, 11
2005.
[26] Samsung Electronics. KFW8G16Q2M-DEBx 512M x 16bit
OneNAND Flash Memory Data Sheet, 09 2006.
[27] M. Santarini. NAND versus NOR. Technical report, EDN,
October 2005.
[28] Silicon Storage Technology (SST). SST39LF040 4K x 8bit SST
Flash Memory Data Sheet, 2005.
[29] STMicroelectronics. NAND08Gx3C2A 8Gbit Multi-level NAND
Flash Memory, 2005.
[30] A. Tal. Two Technologies Compared: NOR vs. NAND White
Paper. Technical report, M-Systems, July 2003.
[31] C.-H. Wu and T.-W. Kuo. An Adaptive Two-Level Management
for the Flash Translation Layer in Embedded Systems. In
IEEE/ACM 2006 International Conference on Computer-Aided
Design (ICCAD), November 2006.
[32] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile Main
Memory Storage System. In Proceedings of the Sixth International
Conference on Architectural Support for Programming Languages
and Operating Systems, pages 8697, 1994.
[33] Q. Xin, E. L. Miller, T. Schwarz, D. D. Long, S. A. Brandt,
and W. Litwin. Reliability Mechanisms for Very Large Storage
Systems. In Proceedings of the 20th IEEE / 11th NASA Goddard
Conference on Mass Storage Systems and Technologies (MSS03),
pages 146156, Apr 2003.
[34] K. S. Yim, H. Bahn, and K. Koh. A Flash Compression Layer
for SmartMedia Card Systems. IEEE Transactions on Consumer
Electronics, 50(1):192197, Feburary 2004.
8

Вам также может понравиться