Вы находитесь на странице: 1из 20

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/305379445

Design and implementation of the memory


management unit (MMU) of a 32-bit micro-
controller; split cache of 32/32kByte...

Technical Report June 2014


DOI: 10.13140/RG.2.1.1370.0088

CITATIONS READS

0 77

1 author:

Umair Saleem
Technische Universiteit Eindhoven
11 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by Umair Saleem on 17 July 2016.

The user has requested enhancement of the downloaded file.


Design and implementation of the memory management unit
(MMU) of a 32-bit micro-controller; split cache of 32/32kByte; 4-
way set-associative, LFU, Write-Through / Write-Allocate. With
an ARM926EJ-S with 1GHz clock speed of unlimited main
memory with a clock of 10MHz.

Produced by:
SALEEM, Muhammad Umair
Masters of Electrical Engineering
Matriculation Number 25279
HS-Weingarten

Supervising Professor:
Prof. Dr.-Ing. Andreas Siggelkow
Professor Fr Elektrotechnik und Elektronik
Hochschule Ravensburg-Weingarten

Submission date: 23rd JUNE 2014


Faculty of Electrical Engineering Department
Masters of International Engineering Program
Computer Architecture Project for Summer Semester 2014
SS2014
Abstract
The design and specification of the ARM9XX family is nothing new or novel to begin with. However to
emphasize upon the learning curve of the particular course (Computer Architecture) for the SS14, it is required to
come up with a slightly modified and tweaked design for the MMU (Memory Management Unit) of the cache.
Along with the MMU design, it is also in line to develop upon the idea of cache organization and programmers
model for register association. Further design add-ons for block diagrams pertaining to DMA controllers and
memory management units will be discussed as it comes later on. The algorithms for cache block replacement to
be utilized here will be LFU (Least frequently used) [6]. Furthermore, the design specifications for the write/read
algorithm for the cache will be taken as write through with write allocate. As of course the name suggests, there
will be minor changes to the standard ARM9XX architecture for cache design and implementation for pipelined
processor architecture. A 4 way set associative scheme will be employed for use in mapping technique. The low
associativity will be discussed further in detail [5,6,7]. The programmers model related to register specification
will also be determined based on the idea of split cache orientation of 32Kbyte/32Kbyte. In the end with help of
SPEC2009 results, the findings will be made in the light of tabular evaluation to further dwell upon the idea of
better system design in terms of latency reduction and decreasing the miss cycle for increased throughput
altogether.

Introduction
As a sequential progression towards the development mode of the architecture of the ARM9XX family, it is to be
determined that what system itself was utilized as a baseline per say for the development in terms of already
existing architectures and what can be done differently to modify them altogether. Statistical results presented
online and in book citations will be utilized where necessary to back up the findings. However, most of the detailed
orientation is developed and design from scratch considering the small changes in algorithm that cumulatively play
a much larger part as the system specifications advance for development of the MMU. The ARM cortex chosen
here for this project was selected to be an ARM926 EJ-S integer core [3,4]. The reason for choosing this
architecture for the processor to develop upon is because this processor was built close at hand and with certain
similarities that are required by our design. A few points that can be taken be taken as granted in this scheme are
the facts that this ARM926 EJ-S uses a HARVARD system of manipulation with bus arbitration [1]. Although
changes will be in effect when it comes to the actual design and algorithm for the specific project discussed under
here. This paper will proceed step by step based on bus arbitration for interfacing, derived from the AMBA
specifications [1]. Later on the DMAC (Direct Memory Access Controller) overview will be provided. This will be
followed by the actual cache implementation which will lead to the design of the actual MMU in question;
complete in line with the paging techniques for virtual addresses [4] along with the mapping, writing and
replacement algorithms for the processor. In the context of this technical data, register specifications will also be
made based on the ARM architecture (not the THUMB architecture of 16 bits) of 32 bits. Statistical data based on
formulae cited from certain literature will be also be provide where seemed relevant.

Cache and its organization


Caches can be transparent, meaning that they work independently of the client making the request (e.g., the
application software executing on the computer) and therefore have built-in heuristics that determine what data to
retain. Caches can also be managed explicitly by the client making the request (these are often called scratch-pads)
or can represent some hybrid of the two extremes. Cache memory is intended to give memory speed approaching
that of the fastest memories available, and at the same time provide a large memory size at the price of less
expensive types of semiconductor memories.
For the design of the cache memory management unit, the ARM926 EJ-S family utilizes the Harvard architecture.
Although detail will not be provided as to when the bus arbitration will be needed in terms of AHB (Advanced
Hardware Bus) ASB (Advance System Bus) and the bridging between the lower layer of memory interaction
through APB (Advanced Peripheral Bus) because the architectural functionalities have been kept intact or as it is
from the ARM manuals [3,4]. But at this point it is necessary to determine the feature which helps in shortlisting
this architecture as the selection basis for our MMU design. First off although the DATA and INSTRUCTIONS
paths can be visualized as being separate when it comes to this architecture, although there is a very good reason
for it as well. The changes or the variation in ON-CHIP space for the desired allocation of either DATA or
INSTRUCTION can play a role for different available chip size for storage. A little more description and
understanding of the architecture conveys the idea of having a split cache of different sizes [6,7]. The DATA parts
could be long and narrow; whereas the INSTRUCTION part of the on-chip area can be short and wide.
This makes sense because for data; user would like to have more of it, be it in short intonations however the
reverse could be said about the instruction part. For our design we utilize a split cache design between DATA and
INSTRUCTIONS parts, divided equally with a split of 32/32Kbytes. However, this split can be either hardware
oriented, in terms of actual component hierarchical scheme or could be provided by means of a software client
which further introduces the concept of transparent or non-transparent latches.
One of the depictions of the caches can be visualized as a hardware oriented or a software oriented cache. Given
below in the figure is some schemes of the current utilizations of cache designs. However our interest is of the
design split cache, part (b) of the hardware caches. [6]
Examples of caches. The caches are
divided into two main groups: solid-
state caches (top), and those that are
implemented by software mechanisms,
typically storing the cached data in
main memory (e.g., DRAM) or disk.
(b) Shows another configuration of a
general-purpose cache, in which the
cache is logically and/or physically
split into two separate partitions: one
that holds only instructions and one
that holds only data. Partitioning
caches can improve behavior. For
instance, providing a separate bus to
each partition potentially doubles the
bandwidth to the cache, and devoting
one partition to fixed-length instructions can simplify read-out logic by obviating support for variable-sized reads.
A much more detailed orientation of the design choices can be selected form the following diagram which
categorizes the entirety of how a cache should be designed and with what parameters or elements to complement it
with. These additional design parameters eventually contribute towards the cache design come out as a complete
memory management unit.

Details for all the selection parameters will be provided in due time and relevance. Much can be obliged by saying
here that for our design, LFU will be employed with a write through with write allocate on a split level design
cache of 4 way (low) associativity for data blocks to lines storage. The orthogonal design choices that make up a
cache. Any cache design, whether software based or entirely solid state, comprises a logical storage organization,
one or more content management heuristics, and one or more consistency management heuristics.

The phenomenon of exhibiting predictable, non-random memory-access behavior is called locality of reference.
The behavior is so named because we have observed that a programs memory references tend to be localized in
time and space. If the program references a datum once, it is likely in the near future to reference that datum again.
If the program references a datum once, it is likely in the near future to reference nearby data as well. The first of
these phenomena is called temporal locality, and the second is called spatial locality. The locality reference thats
comes in next is based on the idea of the temporal locality.

For our design we configure a single level cache, a write buffer will be included in between the split level
orientation of the cache. This write buffer is the result of implementation of the writing algorithm which will be
defined later on. This write buffer is representative of the hardware bifurcation for the cache level design in terms
of split level caches [4]. The cache contains a copy of portions of main memory. When the processor attempts to
read a word of memory, a check is made to determine if the word is in the cache. If so, the word is delivered to the
processor. If a block of main memory, consisting of some fixed number of words, is read into the cache and then
the word is delivered to the processor. Because of the phenomenon of locality of reference, when a block of data is
fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that
same memory location or to other words in the block.

Figure below depicts the structure of a cache/main-memory system. Main memory consists of up to 2n addressable
words, with each word having a unique n-bit address. For mapping purposes, this memory is considered to consist
of a number of fixed length blocks of K words each. That is, there are M = 2n/K blocks in main memory. The
cache consists of m blocks, called lines. Each line contains K words, plus a tag of a few bits. Each line also
includes control bits (not shown), such as a bit to indicate.
Whether the line has been modified since
being loaded into the cache. The length of a
line, not including tag and control bits, is the
line size. The line size may be as small as 32
bits, with each word being a single byte;
in this case the line size is 4 bytes. The
number of lines is considerably less than the
number of main memory blocks (m <<<M).

At any time, some subset of the blocks of


memory resides in lines in the cache. If a
word in a block of memory is read, that
block is transferred to one of the lines of the
cache. Because there are more blocks than
lines, an individual line cannot be uniquely
and permanently dedicated to a particular
block. Thus, each line includes a tag that
identifies which particular block is currently
being stored. The tag is usually a portion of
the main memory address, as described later

Figure below illustrates the read operation. The processor generates the read address (RA) of a word to be read.
Although the figure was taken from a book [5,6,7], it is a close resemblance of what needs to be implemented into
our system as well.

Cache and MMU Organization


When virtual memory is used, the address fields of
machine instructions contain virtual addresses. For
reads to and writes from main memory, a hardware
memory management unit (MMU) translates each
virtual address into a physical address in main
memory. When virtual addresses are used, the
system designer may choose to place the cache
between the processor and the MMU or between the
MMU and main memory. A logical cache, also
known as a virtual cache, stores data using virtual
addresses. The processor accesses the cache directly,
without going through the MMU. A physical cache
stores data using main memory physical addresses.

Goes without saying that in the HARVARD model,


the cache is a split unit with different design
specifications on both instruction and data part. For
our design the instruction and data parts will be
equally divided into a split of 32/32 Kbytes. This
may also include split divisions between units of
cache from the top level to the lower level. The write
buffer is included in between the two levels and will be utilized in the writing algorithm for a write-miss. More on
this design topology will be discussed later on in this paper. [6,7]
Figure to the right, depicts the use of multiple
levels of cache. The L2 cache is slower and
typically larger than the L1 cache, and the L3
cache is slower and typically larger than the L2
cache. Here, the depiction of the structure of a
cache/main-memory system is given. Main
memory consists of up to 2n addressable
words, with each word having a unique n-bit
address. For mapping purposes, this memory is
considered to consist of a number of fixed
length blocks of K words each. That is, there
are M 2n/K blocks in main memory. The cache
consists of m blocks, called lines.3 each line
contains K words, plus a tag of a few bits. Each
line also includes control bits (not shown),
such as a bit to indicate whether the line has
been modified since being loaded into the
cache. The length of a line, not including tag
and control bits, is the line size. The line size
may be as small as 32 bits, with each word
being a single byte; in this case the line size is
4 bytes.
Basic cache structure. A cache is composed of two main parts: The cache
metadata (or, simply, tags) and the cache data. Each data entry is termed a cache
line or cache block. The tag entries identify the contents of their corresponding
data entries. Status information includes a valid bit.

Because a cache is typically much smaller than the backing store, there is a good possibility that any particular
requested datum is not in the cache. Therefore, some mechanism must indicate whether any particular datum is
present in the cache or not. The cache tags fill this purpose. The tags, a set of block IDs, comprise a list of valid
entries in the cache, with one tag per data entry. The basic structure is illustrated in Figure above which shows a
cache as a set of cache entries [Smith 1982 reference provided from 2]. This can take any form: a solid-state array
(e.g., an SRAM implementation) or, if the cache is implemented in software, a binary search tree or even a short
linked list.

Content Management: To Cache or Not to Cache


The purpose of a content-management solution is to determine which references are the best candidates for caching
and to ensure that their corresponding data values are in the cache at the time the application requests them.
Simultaneously, it must consider the currently cached items and decide which should be evicted to make room for
more important data not yet cached. The optimal algorithm is one that can look into the future and determine what
will be needed and ensure that all important data words, up to the limits of the caches capacity and organization,
are in the cache when requested.

Consistency Management: Its Responsibilities


Consistency management comprises three main charges:
1. Keep the cache consistent with itself
2. Keep the cache consistent with the backing store
3. Keep the cache consistent with other caches

1. Keep the Cache Consistent with Itself:


There should never be two copies of a single item in different places of the cache, unless it is guaranteed that they
will always have the same value. In general, this tends not to be a problem, provided the cache is physically
indexed and physically tagged. Many operating systems allow two completely unrelated virtual addresses to map
to the same physical address, creating a synonym. It creates a problem by allowing the physical datum to lie in two
different virtually indexed cache sets at the same time, immediately creating an inconsistency as soon as one of
those copies is written. If the caches sets are referenced using a physical address (i.e., an address corresponding to
the backing store, which would uniquely identify the datum), then there is no problem.

2. Keep the Cache Consistent with the Backing Store:


The value kept in the backing store should be up to date with any changes made to the version stored in the cache,
but only to the point that it is possible to ensure that any read requests to the version in the backing store (e.g., from
another processor) return the most recently written value stored in the cache. Two typical mechanisms used in
general-purpose caches to effect this responsibility are write-back and write-through policies. The policy that
comes in coherence with our design proposition is the Write through with write allocation. And the design
specifications will be given a little further as our proposed design starts to take shape.

3. Keep the Cache Consistent with Other Caches


Other caches in the system should all be treated much like the backing store in terms of keeping the various copies
of a datum up to-date. The proposed design that comes in consistent with this analogy is the possibility of the split
cache design system for different cache levels for processor access.

The design parameter of writing onto the cache (mapping technique for 4
way set associativity)
The design for keeping the cache consistent with itself is a compromise between the two extremes, a set-
associative cache lies in between direct-mapped and fully associative designs and often reaps the benefits of both
designsfast lookup and lower contention. [6]
As the associativity of a cache controller goes up, the probability of thrashing goes down. The ideal goal would be
to maximize the set associativity of a cache by designing it so any main memory location maps to any cache line.
A cache that does this is known as a fully associative cache.
However, as the associativity increases, so does the complexity of the hardware that supports it. One method used
by hardware designers to increase the set associativity of a cache includes a content addressable memory (CAM).
CAM uses a set of comparators to compare the input tag address with a cache-tag stored in each valid cache line. A
CAM works in the opposite way a RAM works. Where a RAM produces data when given an address value, a
CAM produces an address if a given data value exists in the memory. Using a CAM allows many more cache-tags
to be compared simultaneously, thereby increasing the number of cache lines that can be included in a set. Our
solution makes the use of the fact that since we will be entertaining low associativity, the Content Addressable
Memory Scheme will remain unused for hardware simplicity. [2]

The figure here conveys the


overall idea of the topology
of the 4 way set associative
mapping for cache.
However, a little more detail
will be given next as to
specific parameters that
come into play while
designing the system.

In this case, the cache


consists of a number sets,
each of which consists of a
number of lines. [2,4]

The relationships are m = v * k & i = j modulo v

i = cache set number v = number of sets


j = main memory block number k = number of lines in each set
m = number of lines in the cache
For set-associative mapping, the cache control logic interprets a memory address as three fields: Tag, Set, and
Word. The d set bits specify one of v = 4d sets. The s bits of the Tag and Set fields specify one of the 2s blocks of
main memory. Figure below illustrates the cache control logic. With fully associative mapping, the tag in a
memory address is quite large and must be compared to the tag of every line in the cache. With k-way set-
associative mapping, the tag in a memory address is much smaller and is only compared to the k tags within a
single set. [6,7] to summarize,

Address length = (s + w) bits


Number of addressable units = 4s+w words or bytes
Block size = line size = 4w words or bytes
Number of blocks in main memory = 4s
Number of lines in set = k
Number of sets = n = 4d
Number of lines in cache m = k*v = k = 4d
Size of cache = k * 4d+w words or bytes
Size of tag = (s - d) bits

Design Calculations
For the main memory to have 2n addressable words
N unique bit addressable addresses
Which arrives at the formulation of M=2n/4 blocks in main memory and as we know already, m is the number of
lines in the cache.
So for our design, the bit addressable space is allocated to be 32 bits. Which implies for the split cache of a 32/32
Kbytes (64KB) D/I cache respectively
2^ (16) = 65536 bytes or approximately 64Kbytes
The 4 way set associativity aides in the derivation of block size i.e. 2^ (4) = 16bytes. Which is just another way of
saying 2^ (k way set associativity) equals the required block size of the set. So describing everything in the power
of 2s:
Block size 2^ (4) & Associativity 2^ (2)
From block size, the width of the offset field can be determined as log2 (block size) which provides log2 (16) = 4
For the width of set field, we need the number of rows and in turn for that, we need the number of blocks. Since we
are utilizing the concept of 4 way set associativity for our design, therefore each set has 4 blocks.
Number of blocks = capacity / block size = 2^ (16) / 2^ (4) = 2^ (12)
Or in other words, the number of blocks will be 12 in each set.
Similarly, the calculation for the number of bits in the set field is done.
2^ (12)/2^ (2) =2^ (10)
So the set field for the cache design block can be taken to be 10 bits wide.
Now if we observe the diagram given on the next page [6], the tag field, set field and the offset fields are justified
by their required cache design requirements. So for the design of 32 bit ARM cortex, as we see from the diagram
[stalling 6], word or offset (w) is given by a length of 4 bits, the set field (d) is taken as 10bits wide and
consequently, the tag field (s-d) which can be re-written as (32-10-4), where these values were derived earlier, the
tag field comes out to be 18 bits wide. The extra data space required for the tag addressing in the cache is again
taken from the tag of the design implementation itself which comes out to be 18 bits per line available for
allocation itself.

The figure below shows a representation of the k-way set associative mapping of design we need to establish for
the cache MMU. Although the figure provided is merely an extensions of the 2 block for a hit or miss comparison
(the figure was taken from stalling) [6], however the expressions have been modified to fit our required design
features and fit accordingly to our 4 way, low associativity mapping.
Before the management of cache
coherence comes. We further point in
establishing the context of virtual vs.
physical addressing for the MMU.

MVA formulation:
Whys and Hows
As opposed to the four configurations
of the cache MMU for determining the
prevailing addresses and the
instructions to follow for execution,
the setup is variedly divided between
options of physically indexed,
physically tagged; physically indexed,
virtually tagged; virtually indexed,
Physically tagged; and finally virtually
indexed, virtually tagged. The details
of these configurations can very well
be taken from scriptures such as [2,6 and 7]. However as a given information to the reader, the system on which
our design is chosen will be of the category physically tagged and physically indexed. The benefits of which will
become apparent shortly from here on end.
Our design dwells on the premise of physically indexed and physically tagged. Based on the specifications itself
from the ARM926EJ-S technical manual, very little has been changed and only the changes will be observed in
page referencing bits altogether on coarse, tiny and small pages. The design algorithm is given as under, and a
detailed working will be presented hereafter.
The cache is indexed and tagged by its physical address. Therefore, the virtual address must be translated before
the cache can be accessed. The advantage of the design is that since the cache uses the same namespace as physical
memory, it can be entirely controlled by hardware, and the operating system need not concern itself with managing
the cache. In this case the cache size can only grow only by increasing the associativity of the cache itself.

The Memory Management Unit


The ARM926EJ-S MMU is an ARM architecture v5
MMU. It provides virtual memory features required by
systems operating on platforms such as Symbian OS,
Windows-CE, and Linux. A single set of two-level page
tables stored in main memory is used to control the address
translation, permission checks, and memory region
attributes for both data and instruction accesses. The MMU
uses a single unified Translation Look-aside Buffer (TLB)
to cache the information held in the page tables. This
unified TLB is responsible for the integration of the
physically indexed physically tagged topology of the
virtual memory page translator for enhanced memory
storage. [2,3]
To support both sections and pages, there are two levels of
address translation. The MMU puts the translated physical
addresses into the MMU Translation Look-aside Buffer.
The MMU TLB has two parts:
1. The main TLB 2. The lockdown TLB.
The main TLB is a 4-way, set-associative cache for page table information. It has 32 entries per way for a total of
128 entries. The lockdown TLB is an eight-entry fully-associative cache that contains locked TLB entries. Locking
TLB entries can ensure that a memory access to a given region never incurs the penalty of a page table walk.
Whether an entry is placed in the set-associative, or lockdown part of the TLB is dependent on the state of the TLB
lockdown register. TLB Lockdown Register c10. Although the control registry for register c10 will be avoided for
design simplicity, on the other hand, a complete description of c15 will be given as the core control register
combination of opcode to control and manipulate the MMU. [2,3,4]
When an entry has been written into the lockdown part of the TLB, it can only be removed by being overwritten
explicitly, or by a Modified Virtual Address - based TLB invalidate operation, where the MVA matches the locked
down entry. The structure of the set-associative part of the TLB does not form part of the programmer's model for
the ARM926EJ-S processor [3]. No assumptions must be made about the structure, replacement algorithm, or
persistence of entries in the set-associative part. Specifically, any entry written into the set-associative part of the
TLB can be removed at any time. The set-associative part of the TLB must be considered as a temporary cache of
translation/page table information. No reliance must be placed on an entry either residing or not residing in the set-
associative TLB, unless that entry already exists in the lockdown TLB. The set-associative part of the TLB can
contain entries that are defined in the page tables but do not correspond to address values that have been accessed
since the TLB was invalidated. The set-associative part of the TLB must be considered as a cache of the underlying
page table, where memory coherency must be maintained at all times.
The MMU features are:
1. Standard ARM architecture v4 and v5 MMU mapping sizes, domains, and access protection scheme
2. Mapping sizes are 1MB (sections), 64KB (large pages), 4KB (small pages), and 1KB (tiny pages)
3. Access permissions for large pages and small pages can be specified separately for each quarter of the page
(subpage permissions)
4. Hardware page table walks
5. Invalidate entire TLB using CP15 c8
6. Invalidate TLB entry selected by MVA, using CP15 c8
7. Lockdown of TLB entries using CP15 c10.

Translated Entries
The main TLB caches 128 translated entries. If, during a memory access, the main TLB contains a translated entry
for the MVA, the MMU reads the protection data to determine if the access is permitted:
If access is permitted and an off-chip access is required, the MMU outputs the appropriate physical address
corresponding to the MVA
If access is permitted and an off-chip access is not required, the cache or TCM services the access
If access is not permitted, the MMU signals the CPU core to abort.

If the TLB misses (it does not contain an entry for the MVA) the translation table walk hardware is invoked to
retrieve the translation information from a translation table in physical memory. When retrieved, the translation
information is written into the TLB, possibly overwriting an existing value. To enable use of TLB locking features,
the location to be written can be specified using CP15 c10 TLB Lockdown Register.

Programmers model [2,4]


At this point the transition is taken into the programmers mode for defining of the control of the several parts of
the processors control-able components. The scripture to follow has been adopted from [arm technical refer] the
manual with several changes. The system control coprocessor (CP15) is used to configure and control the
ARM926EJ-S processor. The caches, Tightly-Coupled Memories (TCMs), Memory Management Unit (MMU),
and most other system options are controlled using CP15 registers. CP15 defines 16 registers as bellow:

Registers Reads Writes


0 ID code Unpredictable
0 Cache type Unpredictable
0 TCM status Unpredictable
1 Control Control
2 Translation table base Translation table base
3 Domain Access Control Domain Access Control
4 Reserved Reserved
5 Data Fault Status Data Fault Status
5 Instruction Fault Status Instruction Fault Status
6 Fault Address Fault Address
7 Cache operation Cache operation
8 Unpredictable TLB operation
9 Cache lockdown Cache lockdown
9 TCM region TCM region
10 TLB lockdown TLB lockdown
11 and 12 Reserved Reserved
13 FCSE PID FCSE PID
13 Context ID Context ID
14 Reserved Reserved
15 Test configuration Test configuration

The cp15 can be accessed via the MRC and MCR (Master Control/Reset) used in complimenting pairs. The figure
for configuring the opcode is shown below:

Register descriptions
The following registers are described in this section:
ID Code, Cache Type, and TCM Status Registers, c0 TLB Operations Register c8
Control Register c1 Cache Lockdown and TCM Region Registers c9
Translation Table Base Register c2 TLB Lockdown Register c10
Domain Access Control Register c3 Register c11 and c12
Register c4 Process ID Register c13
Fault Status Registers c5 Register c14
Fault Address Register c6 Test and Debug Register c15
Cache Operations Register c7

A brief working description of the registers in line with our design specifications given below:

ID Code, Cache Type, and TCM Status Registers, c0


ID Code is a read-only register that returns the 32-bit device ID code. Cache Type Register is a read-only register
that contains information about the size and architecture of the Instruction Cache (I Cache) and Data Cache
(DCache) enabling operating systems to establish how to perform such operations as cache cleaning and lockdown.
Ctype field determines the cache type. S bit Specifies if the cache is a unified cache (S=0), or separate I Cache and
DCache (S=1). If S=0, the I size and Dsize fields both describe the unified cache and must be identical. In the
ARM926EJ-S processor, this bit is set to a 1 to denote separate caches. Dsize specifies the size, line length, and
associativity of the DCache, or of the unified cache if the S bit is 0. I size specifies the size, length, and
associativity of the I Cache, or of the unified cache if the S bit is 0.

The D and I field have the similar structure as given below:

The Assoc. field determines the cache associativity in conjunction with the M bit.
M bit is the multiplier bit determines the cache size and cache associativity values
in conjunction with the Size and Assoc. fields. If the cache is present, M must be
set to 0. If the cache is absent, M must be set to 1. For the ARM926EJ-S processor,
M is always set to 0. The Len field determines the line length of the cache. The size of the cache is determined by
the Size field and the M bit. The M bit is 0 for the DCache and I Cache. The Size field is bits [21:18] for the
DCache and bits [9:6] for the I Cache. The minimum size of each cache is 4KB, and the maximum size is 128KB.
Size field and Cache size can be given the values corresponding to each other as b0110 32KB respectively. The
associativity of the cache is determined by the Assoc. field and the M bit. The M bit is 0 for the DCache and I
Cache. The Assoc. field is bits [17:15] for the DCache and bits [5:3] for the I Cache. The Assoc. field can be given
the 4way Associative mapping value with the corresponding value of b010 at Cache associativity encoding (M=0)

The line length of the cache is determined by the Len field. The Len field is bits [13:12] for the DCache and bits
[1:0] for the I Cache. The cache type register values for an ARM926EJ-S processor with the following
configuration are shown below:

Function Register bits Values


Reserved [31:29] b000
C type [28:25] b1110
S [24] b1=Harvard
D Size Reserved [23:22] b00
Size [21:18] b0110 = 32KB
Assoc. [17:15] b010 = 4way
M [14] b0
Len. [13:12] b10= 8 words per line
I Size Reserved [11:10] b00
Size [9:6] b0110 = 32KB
Assoc. [5:3] b010 = 4way
M [2] b0
Len. [1:0] b10= 8 words per line

Control Register c1
Register c1 is the Control Register for the ARM926EJ-S processor. This register specifies the configuration used to
enable and disable the caches and MMU. All defined control bits are set to zero on reset except the V bit and the B
bit. [3,4]

Bit Name Function


[31:19] Reserved.
When read returns an Unpredictable value.
When written Should Be Zero, or a value read from bits [31:19] on the same processor.
Using a read-modify-write sequence when modifying this register provides the greatest future
compatibility.
[18] Reserved, SBO. Read = 1, write = 1.
[17] Reserved, SBZ. Read = 0, write = 0.
[16] Reserved, SBO. Read = 1, write = 1.
[15] L4 bit Determines if the T bit is set when load instructions change the PC:
0 = loads to PC set the T bit
1 = loads to PC do not set T bit (ARMv4 behavior).
[14] LFU bit Replacement strategy for I Cache and DCache:
0 = LRU
1 = LFU
[13] V bit Location of exception vectors:
0 = Normal exception vectors selected, address range = 0x0000 0000 to
0x0000 001C
1 = High exception vectors selected, address range = 0xFFFF 0000 to
0xFFFF 001C.
Set to the value of VINITHI on reset.
[12] I bit I Cache enable/disable:
0 = I Cache disabled
1 = I Cache enabled.
[11:10] SBZ
[9] R bit ROM protection.
This bit modifies the ROM protection system.
[8] S bit System protection.
This bit modifies the MMU protection system.
[7] B bit Endianness: 0 = Little-endian operation 1 = Big-endian operation. Set to the value of BIGENDINIT on
reset.
[6:3] Reserved SBO
[2] C bit DCache enable/disable:
0 = Cache disabled
1 = Cache enabled.
[1] A bit Alignment fault enable/disable:
0 = Data address alignment fault checking disabled
1 = Data address alignment fault checking enabled.
[0] M bit MMU enable/disable:
0 = disabled
1 = enabled.

The behavior of different cache D/I will directly influence the behavior for the combined AMBA bus specification.

Cache MMU Behavior exhibited


I Cache Enabled or All instruction fetches are from external memory (AHB).
disabled disabled
I Cache Disabled All instruction fetches are cachable, with no protection checks. All addresses are flat mapped.
enabled That is VA = MVA = PA.
I Cache Enabled Instruction fetches are cachable or noncachable, and protection checks are performed. All
enabled addresses are remapped from VA to PA, depending on the MMU page table entry. That is, VA
translated to MVA, MVA remapped to PA.
DCache Enabled or All data accesses are to external memory (AHB).
disabled disabled
DCache Disabled All data accesses are noncachable nonbufferable. All addresses are flat mapped. That is VA =
enabled MVA = PA.
DCache Enabled All data accesses are cachable or noncachable, and protection checks are performed. All
enabled addresses are remapped from VA to PA, depending on the MMU page table entry. That is, VA
translated to MVA, MVA remapped to PA.

Cache Operations Register c7


Register c7 controls the caches and the write buffer. The function of each cache operation is selected by the
Opcode_2 and CRm fields in the MCR instruction used to write to CP15 c7. The cache design becomes evident for
our design specification in this register. [3,4]
The description of the format provides a
MV Address format of c7 cache controller
via the c15 register. So for a 32 Kbyte 4
way Set associative cache, 8 word-line
The figure below shows the description of set/way format in accordance with user configuration.

TLB Operations Register c8


This is a write-only register used to control the Translation Lookaside Buffer (TLB). There is a single TLB used to
hold entries for both data and instructions. The TLB is divided into two parts:
A set-associative part A fully-associative part.
The fully-associative part (also referred to as the lockdown part of the TLB) is used to store entries to be locked
down. Entries held in the lockdown part of the TLB are preserved during an invalidate TLB operation. Entries can
be removed from the lockdown TLB using an invalidate TLB single entry operation. [3,4]

Cache lockdown Register c9 , Tightly Coupled Memory (TCM) Region Register c9 , and TLB lockdown registers
have no changes to them other than the field selection with respect to the required associativity fields and the size
of the slit cache. [3,4]

The complete description for the entirety of the register control scheme for the ARM926EJ-S can be provided by
means of the ARM926 technical reference manual. Although much of the design requirement of the MMU does
not effect this field of this design, however if a complete reading is necessary, the user must refer to the source
material [3,4]. This now leads us towards the paging scheme of the virtual pages for data access and storage.

Minimal to no change is to be made in the registers counting from c2 to c6 and respectively from c9 to c15. A
detailed description of these registers is given in the architectural reference manual. [3,4]

VMA Descriptor
A finer description of VMA and its large, small
and tiny pages described in order of occurrence for
the level of memory accessed respectively; is also
provided in depth with-in the ARM reference
manual. Although a diagram of the bifurcation of
the page description for the level access of memory
is provided here, but since this design specification
is of no significant impact to our design
requirement of the MMU altogether, so the entire
page segmentation and addressing can be
referenced as it is from the manual [3,4].

Level Descriptors: (the first, the second


and the third.)
[3,4]Bits [31:14] of the TTBR are concatenated
with bits [31:20] of the MVA to produce a 30-bit
address.

A section descriptor provides the base address of a


1MB block of memory. The page table descriptors provide the base address of a page table that contains second-
level descriptors. There are two sizes of page table:
Coarse page tables have 256 entries,
splitting the 1MB that the table
describes into 4KB blocks
Fine page tables have 1024 entries,
splitting the 1MB that the table
describes into 1KB blocks.

Bits Description
Section Coarse Fine
[31:20] [31:10] [31:12] These bits form the corresponding bits of the physical address.
[19:12] SBZ
[11:10] Access permission bits. Access permissions and
Fault address and fault status registers on show how to interpret the access permission bits.
[9] [9] [11:9] SBZ
[8:5] [8:5] [8:5] Domain Control bits
[4] [4] [4] Must be 1
[3:2] Bits C and B indicate whether the area of memory mapped by this page is treated as write
through cache-able
[3:2] [3:2] Should be zero
[1:0] [1:0] [1:0] These bits indicate the page size and validity

Although in the design format of the syntax of the opcode, the fields [3:2] for segmented, coarse, and fine pages
for the VMA can be selected as write back with non-cached as well as the other two possible formats, but
according to our design requirement, we configure it to be write through with write allocate which is just another
way of saying on a write miss, the second level write buffer (cache-able) comes into the system flow.

For a detailed reference to the values of the table entry for translation, it is recommended to refer to the ARM
architecture manual itself. Similarly, the second level descriptor is the portion of the part of the TLB which
ultimately provides for the physical address of the page being referred to access the memory component. A
detailed description however will not be provided here. The specifications remain the same as with the ARM926
EJ-S architecture, however, the system flow diagram representing the bit distribution for address calculation will
be provided as a mere observation. Followed lastly by the 3rd level descriptor, more detail about which is provided
in the actual ARM926 technical reference manual [3,4].

Write Policies (replacement algorithms) Taxonomy equations [2,4]


The cache write policy is responsible for the usage of the algorithm that is employed in the cache mapping for the
replacing the lines within the set associative heuristics. When the processor core writes to memory, the cache
controller has two alternatives for its write policy. The controller can write to both the cache and main memory,
updating the values in both locations; this approach is known as write-through. For our design requirement, the
classification of the writing policy for the cache utilized here will be the write through with write allocation. The
complete system flow for the writing policy will be shown below along with the inclusion of the write buffer for
accessing the lower cache level. A cache is write-through, write-allocate when the cache controller uses a write-
through policy, it writes to both cache and main memory when there is a cache hit on write, ensuring that the cache
and main memory stay coherent at all times. Under this policy, the cache controller performs a write to main
memory for each write to cache memory. Because of the write to main memory, a write-through policy is slower
than a write-back policy.
On a cache miss, the cache controller must select a cache
line from the available set in cache memory to store the
new information from main memory. The cache line
selected for replacement is known as a victim. If the
victim contains valid, (dirty data, in case of write back
policy or pure temporal based LRU) the controller must
write the dirty data from the cache memory to main
memory before it copies new data into the victim cache
line. The process of selecting and replacing a victim
cache line is known as eviction.
On a read-hit, let RH be the time needed to get the data
from this cache to one level higher. The size of data we
are getting is the size that a higher level needs. On a read-
miss, we pay RH plus we pay some time RLL to read one
block (block size is for this cache) from the lower level
(this will include the time for a lower level to find this
block).
On a write-hit, we pay some time WH to get the data from
one level higher to this cache, plus we pay some time
WLL to get the same piece of data from this cache to one
level lower (this will include time for a lower level to
deal with this write). The size of data we are transferring
is the size that a higher level needs to write. On a write-
miss, we pay WH + WLL plus we pay some time RLL
(same price that we paid on a read miss) to read one block (block size is for this cache) from the lower level.

Read time for this cache will be R = RH+ miss-rate * RLL


Write time for this cache will be W = WH + WLL+ miss-rate * RLL

Now a little variations will be added for write buffer, assuming that the write buffer can be used at Wb cases
between a cache we are considering and the lower level. We write to lower level every time we have a write hit or
a write miss, and we pay for this (again the latency time for access) only when we cannot use write buffer.

Read time for this cache will be R = RH + miss-rate * RLL


Write time for this cache will be W = WH + (1 Wb)*WLL+ miss-rate * RLL

Taxonomy of design requirement


The calculations of the taxonomy must be carried out in order to evaluate the design specification s for the required
hardware solution of the ARM MMU design. As much of it based off the ARM926EJ-S cortex family, very little
liberties were taken when it came to specify for design technicality. But since the taxonomy requires to be carried
out via the available statistical data, SPEC2009 benchmark in our case-study, the assumptions for certain modules
of the design cache are assumed before delving any further into the taxonomy for the processors MMU.
For the split level 32/32Kbyte D/I cache design, the instruction access percentage is assumed to be 75% and the
data access percentage is taken to be 25%. Similarly from the SPEC2009 table of split cache design, the percentage
of instruction and data miss rate is taken to be 0.39% and 4.82% respectively. Moreover a hit is assumed to have a
1 clock cycle for instruction which makes the other assumption to be 25 instruction cycles as a miss penalty.
The miss penalty for our design can be derived from the formula as taken from the course scripture.

Pmiss-rate.all = (Instrfrac. * PIF-miss-rate) + (Datafrac. * PDF-miss-rate)


Pmiss-rate.all = (75% * 0.39%) + (25% * 4.82%)
Pmiss-rate.all = 1.5% which is calculated as the miss rate of split level caches.

However to provide a justification for our result, a similar taxonomy will be carried out for a design specification
of a similar split level cache but with a size of 16/16Kbytes and a miss penalty of 50 instruction cycles per a miss.

Using the same relation, the miss penalty percentage of this cache was concluded to be 1.7%, however a unified
32Kbyte cache gives a miss penalty percentage of 2.10%. We will derive another taxation of design to make some
sense of these figures. Now comes the average access time which is represented as follows:
Percentage of instruction access (hit time + instruction miss rate percentage * miss penalty) + Percentage of data
access (hit time + data miss rate percentage * miss penalty)
The design calculations shown here only for OUR design requirement:
(75%)*(1+0.39% * 25) + (25%)*(1+4.82% * 25) which comes out to be 1.374 cycles.

The same calculation done for the split level 16/16Kbytes give 1.7487 cycles and for the unified 32Kbytes design
comes out to be 2.24 cycles (remember these calculations are done on 50 cycle miss penalty).
The results depict a picture of such sort that, increasing the split orientation decreases the access latency time due
to the use of write through with buffers to the lower level. With this addition, stalls are nearly ignored and bubbles
need not included in the writing algorithm design to begin with.
However, from this point on the design taxonomy will be provide only for our design of 32/32 Kbyte split level
cache.

Now we come to the point of memory stall cycle calculation. Referred from the lecture script, the formulae provide
a complete sense of how the cache design should have to be implemented and should be behaving. From the
previous calculations we know that the miss rate penalty of separate caches is 1.5% and similarly, from previous
calculations, the memory reference latency of 1.37 cycles. Performance including the cache misses is the CPU time
Total CPU time = (CPU execution clock cycles + memory stall clock cycle) * clock cycle time
CPU time = instruction count * (CPI_execution + Memory stall cycles per Instruction) * clock cycle time

This relation is expanded in tabular form to provide us the result as:


CPU time = Ic * 2.5 * clock cycle time

The last step in the cache performance evaluation is the actual performance evaluation via the designated write
algorithm of the write through with write allocate and the set associative 4 way degree of our cache design.
Starting off with the cache performance relating to the write algorithm. Using the formulae stated before in the
write algorithm section, the read and write miss cycle percentage will be carried out. Before going any further, lets
first assume that the perfect CPI as taken from the lecture script is to be 1.1 cycles per instruction. This will be
instrumental in determining the speed up or form factor ration for the be all that ends all performance evaluation.
Again using the same miss rate penalty percentage the read miss penalty was calculated to be 1.03 cycles.
Using this data the read miss percentage for our write algorithm comes out to be 79.23% which makes the write
miss penalty to be 20.76% consequently.
Nscma = (79.2% * 1.5% * 25) + (20.76% * 25)
Nscma = 4.51 cycles
Nscai = 1.3 per instruction * 4.51cycles = 5.863 cycles per instruction
CPI calculated again = 1.1 + 5.863 = 6.963 cycles per instruction
So the finalized cache performance analysis is light of comparison with the perfect CPI access time of 1.1 cycles is
calculated to be
6.963 / 1.1 = 6.33 this number represents the form factor which determines the unique speed up factor for our
design in accordance with write algorithm utilized.
Since we already know that the miss penalty is 1.5% for our design of separate caches. Assuming with a 10 MHz
memory access speed (0.1 us) and with a processor clock of 1GHz (1ns), the T CPU is calculated as follows, for a 4
way set associative:
TCPU = Ic (1ns * 1.1 *4* 6.963 + 1.3% * 1.5% * 0.1us)
TCPU = Ic * 29.80 n seconds per instructions

Bibliography
1- AMBA specifications for BUS arbitration; AHB, ASB, APB combinations with DMA controller
2- ARM System Developers Guide Designing and Optimizing System Software BY Andrew N. Sloss;
Dominic Symes; Chris Wright With a contribution by John Rayfield
3- ARM926EJ-S (r0p4/r0p5) Technical Reference Manual (courtesy of atmel.com)
4- ARM926EJ-S Architectural reference Manual (courtesy of arm.com)
5- Introduction to Parallel Processing Algorithms and Architectures By Behrooz Parhami University of
California at Santa Barbara
6- Computer organization and architecture designing for performance eighth edition By William Stallings
7- Henessy and Patterson , Computer Organization and design & Quantitative Approach

View publication stats

Вам также может понравиться