Академический Документы
Профессиональный Документы
Культура Документы
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/305379445
CITATIONS READS
0 77
1 author:
Umair Saleem
Technische Universiteit Eindhoven
11 PUBLICATIONS 0 CITATIONS
SEE PROFILE
All content following this page was uploaded by Umair Saleem on 17 July 2016.
Produced by:
SALEEM, Muhammad Umair
Masters of Electrical Engineering
Matriculation Number 25279
HS-Weingarten
Supervising Professor:
Prof. Dr.-Ing. Andreas Siggelkow
Professor Fr Elektrotechnik und Elektronik
Hochschule Ravensburg-Weingarten
Introduction
As a sequential progression towards the development mode of the architecture of the ARM9XX family, it is to be
determined that what system itself was utilized as a baseline per say for the development in terms of already
existing architectures and what can be done differently to modify them altogether. Statistical results presented
online and in book citations will be utilized where necessary to back up the findings. However, most of the detailed
orientation is developed and design from scratch considering the small changes in algorithm that cumulatively play
a much larger part as the system specifications advance for development of the MMU. The ARM cortex chosen
here for this project was selected to be an ARM926 EJ-S integer core [3,4]. The reason for choosing this
architecture for the processor to develop upon is because this processor was built close at hand and with certain
similarities that are required by our design. A few points that can be taken be taken as granted in this scheme are
the facts that this ARM926 EJ-S uses a HARVARD system of manipulation with bus arbitration [1]. Although
changes will be in effect when it comes to the actual design and algorithm for the specific project discussed under
here. This paper will proceed step by step based on bus arbitration for interfacing, derived from the AMBA
specifications [1]. Later on the DMAC (Direct Memory Access Controller) overview will be provided. This will be
followed by the actual cache implementation which will lead to the design of the actual MMU in question;
complete in line with the paging techniques for virtual addresses [4] along with the mapping, writing and
replacement algorithms for the processor. In the context of this technical data, register specifications will also be
made based on the ARM architecture (not the THUMB architecture of 16 bits) of 32 bits. Statistical data based on
formulae cited from certain literature will be also be provide where seemed relevant.
Details for all the selection parameters will be provided in due time and relevance. Much can be obliged by saying
here that for our design, LFU will be employed with a write through with write allocate on a split level design
cache of 4 way (low) associativity for data blocks to lines storage. The orthogonal design choices that make up a
cache. Any cache design, whether software based or entirely solid state, comprises a logical storage organization,
one or more content management heuristics, and one or more consistency management heuristics.
The phenomenon of exhibiting predictable, non-random memory-access behavior is called locality of reference.
The behavior is so named because we have observed that a programs memory references tend to be localized in
time and space. If the program references a datum once, it is likely in the near future to reference that datum again.
If the program references a datum once, it is likely in the near future to reference nearby data as well. The first of
these phenomena is called temporal locality, and the second is called spatial locality. The locality reference thats
comes in next is based on the idea of the temporal locality.
For our design we configure a single level cache, a write buffer will be included in between the split level
orientation of the cache. This write buffer is the result of implementation of the writing algorithm which will be
defined later on. This write buffer is representative of the hardware bifurcation for the cache level design in terms
of split level caches [4]. The cache contains a copy of portions of main memory. When the processor attempts to
read a word of memory, a check is made to determine if the word is in the cache. If so, the word is delivered to the
processor. If a block of main memory, consisting of some fixed number of words, is read into the cache and then
the word is delivered to the processor. Because of the phenomenon of locality of reference, when a block of data is
fetched into the cache to satisfy a single memory reference, it is likely that there will be future references to that
same memory location or to other words in the block.
Figure below depicts the structure of a cache/main-memory system. Main memory consists of up to 2n addressable
words, with each word having a unique n-bit address. For mapping purposes, this memory is considered to consist
of a number of fixed length blocks of K words each. That is, there are M = 2n/K blocks in main memory. The
cache consists of m blocks, called lines. Each line contains K words, plus a tag of a few bits. Each line also
includes control bits (not shown), such as a bit to indicate.
Whether the line has been modified since
being loaded into the cache. The length of a
line, not including tag and control bits, is the
line size. The line size may be as small as 32
bits, with each word being a single byte;
in this case the line size is 4 bytes. The
number of lines is considerably less than the
number of main memory blocks (m <<<M).
Figure below illustrates the read operation. The processor generates the read address (RA) of a word to be read.
Although the figure was taken from a book [5,6,7], it is a close resemblance of what needs to be implemented into
our system as well.
Because a cache is typically much smaller than the backing store, there is a good possibility that any particular
requested datum is not in the cache. Therefore, some mechanism must indicate whether any particular datum is
present in the cache or not. The cache tags fill this purpose. The tags, a set of block IDs, comprise a list of valid
entries in the cache, with one tag per data entry. The basic structure is illustrated in Figure above which shows a
cache as a set of cache entries [Smith 1982 reference provided from 2]. This can take any form: a solid-state array
(e.g., an SRAM implementation) or, if the cache is implemented in software, a binary search tree or even a short
linked list.
The design parameter of writing onto the cache (mapping technique for 4
way set associativity)
The design for keeping the cache consistent with itself is a compromise between the two extremes, a set-
associative cache lies in between direct-mapped and fully associative designs and often reaps the benefits of both
designsfast lookup and lower contention. [6]
As the associativity of a cache controller goes up, the probability of thrashing goes down. The ideal goal would be
to maximize the set associativity of a cache by designing it so any main memory location maps to any cache line.
A cache that does this is known as a fully associative cache.
However, as the associativity increases, so does the complexity of the hardware that supports it. One method used
by hardware designers to increase the set associativity of a cache includes a content addressable memory (CAM).
CAM uses a set of comparators to compare the input tag address with a cache-tag stored in each valid cache line. A
CAM works in the opposite way a RAM works. Where a RAM produces data when given an address value, a
CAM produces an address if a given data value exists in the memory. Using a CAM allows many more cache-tags
to be compared simultaneously, thereby increasing the number of cache lines that can be included in a set. Our
solution makes the use of the fact that since we will be entertaining low associativity, the Content Addressable
Memory Scheme will remain unused for hardware simplicity. [2]
Design Calculations
For the main memory to have 2n addressable words
N unique bit addressable addresses
Which arrives at the formulation of M=2n/4 blocks in main memory and as we know already, m is the number of
lines in the cache.
So for our design, the bit addressable space is allocated to be 32 bits. Which implies for the split cache of a 32/32
Kbytes (64KB) D/I cache respectively
2^ (16) = 65536 bytes or approximately 64Kbytes
The 4 way set associativity aides in the derivation of block size i.e. 2^ (4) = 16bytes. Which is just another way of
saying 2^ (k way set associativity) equals the required block size of the set. So describing everything in the power
of 2s:
Block size 2^ (4) & Associativity 2^ (2)
From block size, the width of the offset field can be determined as log2 (block size) which provides log2 (16) = 4
For the width of set field, we need the number of rows and in turn for that, we need the number of blocks. Since we
are utilizing the concept of 4 way set associativity for our design, therefore each set has 4 blocks.
Number of blocks = capacity / block size = 2^ (16) / 2^ (4) = 2^ (12)
Or in other words, the number of blocks will be 12 in each set.
Similarly, the calculation for the number of bits in the set field is done.
2^ (12)/2^ (2) =2^ (10)
So the set field for the cache design block can be taken to be 10 bits wide.
Now if we observe the diagram given on the next page [6], the tag field, set field and the offset fields are justified
by their required cache design requirements. So for the design of 32 bit ARM cortex, as we see from the diagram
[stalling 6], word or offset (w) is given by a length of 4 bits, the set field (d) is taken as 10bits wide and
consequently, the tag field (s-d) which can be re-written as (32-10-4), where these values were derived earlier, the
tag field comes out to be 18 bits wide. The extra data space required for the tag addressing in the cache is again
taken from the tag of the design implementation itself which comes out to be 18 bits per line available for
allocation itself.
The figure below shows a representation of the k-way set associative mapping of design we need to establish for
the cache MMU. Although the figure provided is merely an extensions of the 2 block for a hit or miss comparison
(the figure was taken from stalling) [6], however the expressions have been modified to fit our required design
features and fit accordingly to our 4 way, low associativity mapping.
Before the management of cache
coherence comes. We further point in
establishing the context of virtual vs.
physical addressing for the MMU.
MVA formulation:
Whys and Hows
As opposed to the four configurations
of the cache MMU for determining the
prevailing addresses and the
instructions to follow for execution,
the setup is variedly divided between
options of physically indexed,
physically tagged; physically indexed,
virtually tagged; virtually indexed,
Physically tagged; and finally virtually
indexed, virtually tagged. The details
of these configurations can very well
be taken from scriptures such as [2,6 and 7]. However as a given information to the reader, the system on which
our design is chosen will be of the category physically tagged and physically indexed. The benefits of which will
become apparent shortly from here on end.
Our design dwells on the premise of physically indexed and physically tagged. Based on the specifications itself
from the ARM926EJ-S technical manual, very little has been changed and only the changes will be observed in
page referencing bits altogether on coarse, tiny and small pages. The design algorithm is given as under, and a
detailed working will be presented hereafter.
The cache is indexed and tagged by its physical address. Therefore, the virtual address must be translated before
the cache can be accessed. The advantage of the design is that since the cache uses the same namespace as physical
memory, it can be entirely controlled by hardware, and the operating system need not concern itself with managing
the cache. In this case the cache size can only grow only by increasing the associativity of the cache itself.
Translated Entries
The main TLB caches 128 translated entries. If, during a memory access, the main TLB contains a translated entry
for the MVA, the MMU reads the protection data to determine if the access is permitted:
If access is permitted and an off-chip access is required, the MMU outputs the appropriate physical address
corresponding to the MVA
If access is permitted and an off-chip access is not required, the cache or TCM services the access
If access is not permitted, the MMU signals the CPU core to abort.
If the TLB misses (it does not contain an entry for the MVA) the translation table walk hardware is invoked to
retrieve the translation information from a translation table in physical memory. When retrieved, the translation
information is written into the TLB, possibly overwriting an existing value. To enable use of TLB locking features,
the location to be written can be specified using CP15 c10 TLB Lockdown Register.
The cp15 can be accessed via the MRC and MCR (Master Control/Reset) used in complimenting pairs. The figure
for configuring the opcode is shown below:
Register descriptions
The following registers are described in this section:
ID Code, Cache Type, and TCM Status Registers, c0 TLB Operations Register c8
Control Register c1 Cache Lockdown and TCM Region Registers c9
Translation Table Base Register c2 TLB Lockdown Register c10
Domain Access Control Register c3 Register c11 and c12
Register c4 Process ID Register c13
Fault Status Registers c5 Register c14
Fault Address Register c6 Test and Debug Register c15
Cache Operations Register c7
A brief working description of the registers in line with our design specifications given below:
The Assoc. field determines the cache associativity in conjunction with the M bit.
M bit is the multiplier bit determines the cache size and cache associativity values
in conjunction with the Size and Assoc. fields. If the cache is present, M must be
set to 0. If the cache is absent, M must be set to 1. For the ARM926EJ-S processor,
M is always set to 0. The Len field determines the line length of the cache. The size of the cache is determined by
the Size field and the M bit. The M bit is 0 for the DCache and I Cache. The Size field is bits [21:18] for the
DCache and bits [9:6] for the I Cache. The minimum size of each cache is 4KB, and the maximum size is 128KB.
Size field and Cache size can be given the values corresponding to each other as b0110 32KB respectively. The
associativity of the cache is determined by the Assoc. field and the M bit. The M bit is 0 for the DCache and I
Cache. The Assoc. field is bits [17:15] for the DCache and bits [5:3] for the I Cache. The Assoc. field can be given
the 4way Associative mapping value with the corresponding value of b010 at Cache associativity encoding (M=0)
The line length of the cache is determined by the Len field. The Len field is bits [13:12] for the DCache and bits
[1:0] for the I Cache. The cache type register values for an ARM926EJ-S processor with the following
configuration are shown below:
Control Register c1
Register c1 is the Control Register for the ARM926EJ-S processor. This register specifies the configuration used to
enable and disable the caches and MMU. All defined control bits are set to zero on reset except the V bit and the B
bit. [3,4]
The behavior of different cache D/I will directly influence the behavior for the combined AMBA bus specification.
Cache lockdown Register c9 , Tightly Coupled Memory (TCM) Region Register c9 , and TLB lockdown registers
have no changes to them other than the field selection with respect to the required associativity fields and the size
of the slit cache. [3,4]
The complete description for the entirety of the register control scheme for the ARM926EJ-S can be provided by
means of the ARM926 technical reference manual. Although much of the design requirement of the MMU does
not effect this field of this design, however if a complete reading is necessary, the user must refer to the source
material [3,4]. This now leads us towards the paging scheme of the virtual pages for data access and storage.
Minimal to no change is to be made in the registers counting from c2 to c6 and respectively from c9 to c15. A
detailed description of these registers is given in the architectural reference manual. [3,4]
VMA Descriptor
A finer description of VMA and its large, small
and tiny pages described in order of occurrence for
the level of memory accessed respectively; is also
provided in depth with-in the ARM reference
manual. Although a diagram of the bifurcation of
the page description for the level access of memory
is provided here, but since this design specification
is of no significant impact to our design
requirement of the MMU altogether, so the entire
page segmentation and addressing can be
referenced as it is from the manual [3,4].
Bits Description
Section Coarse Fine
[31:20] [31:10] [31:12] These bits form the corresponding bits of the physical address.
[19:12] SBZ
[11:10] Access permission bits. Access permissions and
Fault address and fault status registers on show how to interpret the access permission bits.
[9] [9] [11:9] SBZ
[8:5] [8:5] [8:5] Domain Control bits
[4] [4] [4] Must be 1
[3:2] Bits C and B indicate whether the area of memory mapped by this page is treated as write
through cache-able
[3:2] [3:2] Should be zero
[1:0] [1:0] [1:0] These bits indicate the page size and validity
Although in the design format of the syntax of the opcode, the fields [3:2] for segmented, coarse, and fine pages
for the VMA can be selected as write back with non-cached as well as the other two possible formats, but
according to our design requirement, we configure it to be write through with write allocate which is just another
way of saying on a write miss, the second level write buffer (cache-able) comes into the system flow.
For a detailed reference to the values of the table entry for translation, it is recommended to refer to the ARM
architecture manual itself. Similarly, the second level descriptor is the portion of the part of the TLB which
ultimately provides for the physical address of the page being referred to access the memory component. A
detailed description however will not be provided here. The specifications remain the same as with the ARM926
EJ-S architecture, however, the system flow diagram representing the bit distribution for address calculation will
be provided as a mere observation. Followed lastly by the 3rd level descriptor, more detail about which is provided
in the actual ARM926 technical reference manual [3,4].
Now a little variations will be added for write buffer, assuming that the write buffer can be used at Wb cases
between a cache we are considering and the lower level. We write to lower level every time we have a write hit or
a write miss, and we pay for this (again the latency time for access) only when we cannot use write buffer.
However to provide a justification for our result, a similar taxonomy will be carried out for a design specification
of a similar split level cache but with a size of 16/16Kbytes and a miss penalty of 50 instruction cycles per a miss.
Using the same relation, the miss penalty percentage of this cache was concluded to be 1.7%, however a unified
32Kbyte cache gives a miss penalty percentage of 2.10%. We will derive another taxation of design to make some
sense of these figures. Now comes the average access time which is represented as follows:
Percentage of instruction access (hit time + instruction miss rate percentage * miss penalty) + Percentage of data
access (hit time + data miss rate percentage * miss penalty)
The design calculations shown here only for OUR design requirement:
(75%)*(1+0.39% * 25) + (25%)*(1+4.82% * 25) which comes out to be 1.374 cycles.
The same calculation done for the split level 16/16Kbytes give 1.7487 cycles and for the unified 32Kbytes design
comes out to be 2.24 cycles (remember these calculations are done on 50 cycle miss penalty).
The results depict a picture of such sort that, increasing the split orientation decreases the access latency time due
to the use of write through with buffers to the lower level. With this addition, stalls are nearly ignored and bubbles
need not included in the writing algorithm design to begin with.
However, from this point on the design taxonomy will be provide only for our design of 32/32 Kbyte split level
cache.
Now we come to the point of memory stall cycle calculation. Referred from the lecture script, the formulae provide
a complete sense of how the cache design should have to be implemented and should be behaving. From the
previous calculations we know that the miss rate penalty of separate caches is 1.5% and similarly, from previous
calculations, the memory reference latency of 1.37 cycles. Performance including the cache misses is the CPU time
Total CPU time = (CPU execution clock cycles + memory stall clock cycle) * clock cycle time
CPU time = instruction count * (CPI_execution + Memory stall cycles per Instruction) * clock cycle time
The last step in the cache performance evaluation is the actual performance evaluation via the designated write
algorithm of the write through with write allocate and the set associative 4 way degree of our cache design.
Starting off with the cache performance relating to the write algorithm. Using the formulae stated before in the
write algorithm section, the read and write miss cycle percentage will be carried out. Before going any further, lets
first assume that the perfect CPI as taken from the lecture script is to be 1.1 cycles per instruction. This will be
instrumental in determining the speed up or form factor ration for the be all that ends all performance evaluation.
Again using the same miss rate penalty percentage the read miss penalty was calculated to be 1.03 cycles.
Using this data the read miss percentage for our write algorithm comes out to be 79.23% which makes the write
miss penalty to be 20.76% consequently.
Nscma = (79.2% * 1.5% * 25) + (20.76% * 25)
Nscma = 4.51 cycles
Nscai = 1.3 per instruction * 4.51cycles = 5.863 cycles per instruction
CPI calculated again = 1.1 + 5.863 = 6.963 cycles per instruction
So the finalized cache performance analysis is light of comparison with the perfect CPI access time of 1.1 cycles is
calculated to be
6.963 / 1.1 = 6.33 this number represents the form factor which determines the unique speed up factor for our
design in accordance with write algorithm utilized.
Since we already know that the miss penalty is 1.5% for our design of separate caches. Assuming with a 10 MHz
memory access speed (0.1 us) and with a processor clock of 1GHz (1ns), the T CPU is calculated as follows, for a 4
way set associative:
TCPU = Ic (1ns * 1.1 *4* 6.963 + 1.3% * 1.5% * 0.1us)
TCPU = Ic * 29.80 n seconds per instructions
Bibliography
1- AMBA specifications for BUS arbitration; AHB, ASB, APB combinations with DMA controller
2- ARM System Developers Guide Designing and Optimizing System Software BY Andrew N. Sloss;
Dominic Symes; Chris Wright With a contribution by John Rayfield
3- ARM926EJ-S (r0p4/r0p5) Technical Reference Manual (courtesy of atmel.com)
4- ARM926EJ-S Architectural reference Manual (courtesy of arm.com)
5- Introduction to Parallel Processing Algorithms and Architectures By Behrooz Parhami University of
California at Santa Barbara
6- Computer organization and architecture designing for performance eighth edition By William Stallings
7- Henessy and Patterson , Computer Organization and design & Quantitative Approach