Вы находитесь на странице: 1из 38

Advanced Computer Architecture

Project Report

CSCI 5593
Spring Semester-2017

Simulating and Evaluating Shared Cache Replacement Algorithms for


Multi-Core Processors

Group Members

Alanoud Alslman
alanoud.alsalman@ucdenver.edu

Arwa Almalki
arwa.almalki@ucdenver.edu

Samaher Alghamdi
samaher.alghamdi@ucdenver.edu

Norah Almaayouf
norah.almaayouf@ucdenver.edu
Table of Contents
I. INTRODUCTION ................................................................................................................................................. 3
II. DESIGN ................................................................................................................................................................ 4
III. IMPLEMENTATION ..................................................................................................................................... 11
IV. TESTING ENVIRONMENT ......................................................................................................................... 17
VI. RESULTS AND ANALYSIS ........................................................................................................................ 21
I. FIRST EXPERIMENT ............................................................................................................................................................ 22
II. SECOND EXPERIMENT ..................................................................................................................................................... 31
VII. CONCLUSION .............................................................................................................................................. 34

Table of Figures
Figure 1: EW LRU pseudo code: evictEW procedure. ...................................................................................... 5

Figure 2: EW LRU: updateEW procedure ......................................................................................................... 6

Figure 3: EW SRRIP: evictEW procedure ......................................................................................................... 7

Figure 4: EW flowchart ...................................................................................................................................... 8

Figure 7: CacheSetEWLRU class diagram ...................................................................................................... 14

Figure 8: CacheSetEWSRRIP class diagram ................................................................................................... 15

Figure 9: CacheSetMRUT class diagram ......................................................................................................... 16

Figure 11: Baseline average miss rate in shared LLC...................................................................................... 19

Figure 13: Baseline average miss rate for LRU in shared LLC ....................................................................... 23

Figure 14: Execution time for LRU and EW LRU in shared LLC .................................................................. 24

Figure 15: Average miss rate for LRU, SRRIP, EW SRRIP in shared LLC. .................................................. 25

Figure 16: Execution time for LRU, SRRIP, and EW SRRIP. ........................................................................ 25

Figure 18: Execution time for LRU and EW LRU on Hydra. ......................................................................... 27

Figure 19: Average miss rate for SRRIP and EW SRRIP in Hydras shared LLC. ........................................ 27

Figure 20: Execution time for SRRIP and EW SRRIP on Hydra. ................................................................... 28

Figure 21: Average miss rate for MRU and MRU-T in shared LLC. .............................................................. 29

Figure 22: Execution time for MRU and MRU-T. .......................................................................................... 30

Figure 23: Implemented policies speedup over LRU. ..................................................................................... 31

Figure 24: Average miss rate of LRU and EWLRU with 2 workloads (each is allocated 2 cores) running
concurrently.............................................................................................................................................. 32

1
Figure 25: Execution Time of LRU and EWLRU with 2 workloads (each is allocated 2 cores) running
concurrently.............................................................................................................................................. 33

Figure 26: Average miss rate of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running
concurrently.............................................................................................................................................. 33

Figure 27: Execution Time of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running
concurrently.............................................................................................................................................. 34

Table of Tables
Table 1: Selected workload list ........................................................................................................................ 19

Table 2: Some of the simulated hardware configuration ................................................................................. 20

2
I. INTRODUCTION

Programs and applications currently depend on the efficiency of the underlying computer
architecture. Many studies and researches are conducted to optimize the gap between the CPU and
Memory speeds. The optimization moves toward parallelized architectures with multi-core
processors in hope to reduce capacity misses, conflict misses, and thus reducing the penalty of
accessing the main memory. These misses occur in the cache when handling large sizes of working
sets. In order to avoid the penalty of accessing the main memory, computer architectures nowadays
are usually built with 2 or 3 levels of caches. When there is a cache miss, instead of seeking the
required block from the main memory, the block is first seeked from the lower level of caches.
Furthermore, since multi-core architectures rely on running programs concurrently using multiple
threads and cores, then the use of a shared level of cache is necessary in order to synchronize data
between the physical cores. Thus, the emergence of shared last level caches (LLCs) came into
production. However, Shared LLCs are huge in size compared to the higher levels of the cache
hierarchy, which makes it difficult to apply a cache replacement algorithm that exploits temporal
and/or spatial locality in a manner that sufficiently reduces the number of misses in these type of
caches. The most prefered cache replacement algorithm is the Least-Recently Used (LRU) policy.
While it is the most favored replacement policy in the higher cache levels, it results in high miss
rates within the shared LLC due to its size. And high miss rates in shared LLCs lower the
performance of parallel applications.

In order to address this problem, we implemented two newly introduced cache replacement
policies on the shared last level cache using the Sniper Multi-Core Simulator[6] as a testing tool to
emulate the cache configurations of the Gainestown and the Istanbul processors. These new
algorithms are the Evict Write (EW) eviction policy[1] and the Most Recently Used Tour (MRU-
Tour) cache replacement algorithm[2]. The EW strategy was implemented on the LRU and the
Static Re-reference Interval Prediction (SRRIP)[7] policies. Our goal is to reduce the execution time
and the shared LLC miss rates over the LRU policy as our baseline algorithm in order to achieve
better performance on several multi-threaded workloads.

3
II. DESIGN

Any cache replacement policy consists of three components:

Insertion policy: After evicting a victim block (capacity miss), or when filling an empty
line in the cache (compulsory miss), the new block will be inserted in the cache by some
chosen policy.
Eviction decision: When there are no empty lines in the cache, then a victim block would
be chosen by some eviction decision (capacity miss).
Promotion rules: When a referenced block is found in the cache (cache hit), that block's
position in the cache is updated by certain rules.
Based of the previous facts, we will describe our algorithms using pseudo codes and a flow
chart. Then we will briefly explain how we employed the Sniper Multi-Core Simulator
infrastructure for our project implementation.

1. Evict-Write Eviction Strategy:

The Evict-Write strategy focuses on the eviction decision only, which priorities read blocks
overwritten blocks[1]. That means that a block that was written would most likely be the victim for
eviction, if no written blocks were found then the block that was both read and written would be the
next eviction candidate, otherwise, it will fallback to the LRU eviction decision for the read blocks.
When the candidate has been chosen, then it will be evicted, then replaced by the new block, and
then will be moved to the MRU position.

Evict-Write is integrated with two well-known algorithms, LRU and SRRIP. So, the
eviction decision does not just chooses any random block that was previously written only, but it
also has to be the Least Recently Used block that was written, in case of EW LRU. For the SRRIP,
there are two options that will be discussed later.

a. EW LRU:

The LRU insertion policy inserts the block in cache in the MRU position. As for the eviction
decision, it evicts the block located in the LRU position. The LRU promotion rule updates the block
position after a hit by moving it to the MRU position. In order to convert the old LRU algorithm
into the new EW LRU, all we need to do is change the eviction decision from evicting a block at the
Least Recently Used position without any regard to the type of memory access operation, to the
Evect-Write decision that takes into account the type of access. However, we do need to update the
block access types every time there is an insertion, eviction, or a promotion too.
4
We will show the Evict-Write LRU algorithm through the pseudocode in figure (1).

Figure 1: EW LRU pseudo code: evictEW procedure.

The algorithm uses three variables: one to hold the LRU written (W) block index, another for
the LRU read and written (RW) block index, and last one for the LRU read (R) block index. The
algorithm requires a loop through all the blocks present in that cache set in order to pick the least
recently used block of each type of the previously mentioned variables and store its index in the
suitable variable. After the iteration, the control moves to the if statements at the bottom of the
pseudo code. These statements will first ask whether a written only block was saved in the W block.
If so, return the W block index. If not, proceed to the next condition that asks if a read and written
block was found. If yes, then return the RW block index. If no, return the R block index.

The update procedure of the EW strategy is shown in the pseudo code in figure (2).

5
Figure 2: EW LRU: updateEW procedure

This procedure has only one purpose, to update the index of the incoming block as follows:
if the incoming block is a read block, then mark it as being previously read. Otherwise, mark it as
being written.

b. EW SRRIP:

The SRRIP algorithm uses M-bit Re-Reference Prediction Values (RRPV), and there is 2M
possible RRPV intermediate re-reference intervals predictions. SRRIP eviction decision, on the
other hand, finds the block that has the max RRPV bits, then decrements the RRPV by one, and the
Insertion policy inserts the block in the position of the evicted block. Later for the promotion rule,
two options can be adopted. One is Hit Priority(HP). HP will set the RRPV bits to zero when a
block gets a cache hit. The other type is Frequency Priority (FP), which is the one used in this
implementation. FP will decrement RRPV bits when the block receives a hit. The integration can be
seen in the algorithm's eviction decision only.

Figure (3) below, shows the implementation of EW-SRRIP:

6
Figure 3: EW SRRIP: evictEW procedure

EW-SRRIP eviction decision finds the block that has the max RRPV bits and it was used
only by write operations. If there are no write blocks, it chooses the block that was read and written.
When the block with both operations is not found, then the strategy fallbacks to SRRIP eviction for
the read blocks, which evict the block with the max RRPV. Eventually, the block is found and its
RRPV decremented by one.

The flowchart in figure (4) illustrates the design of the the EW LRU cache replacement
algorithm in the Sniper Simulator. This flowchart can be interpreted easily as the design of the EW
SRRIP but instead of using the LRU bits to move the inserted block index to the Most Recently
Used position, it uses the SRRIP bits that orders the indexes based on the Re-reference Prediction
Value.

7
Figure 4: EW flowchart

8
2. MRU-Tour Cache Replacement Algorithm:

MRU-Tour algorithm's basic idea is to keep checking the number of the times the block
occupies the MRU position while it is stored in the cache[2]. When the block is fetched, this will
indicate that the first tour for this block has begun. If the block has never been referenced, then it is
a candidate for eviction. Otherwise, the block is referenced which represents the second tour for the
block has begun. This means that the block has ran multiple MRU Tours.

In order to show the number of the MRUTs, an MRUT-bit is used. MRUT-bit is either 1,
when the block receives a hit, or 0, when it was fetched from the cache, and it has never been
referenced again during its lifetime in the cache. To sum this algorithm, MRUT eviction decision
evicts the block that was fetched, meaning the one that has 0 bit. The insertion policy moves the line
to MRU position and set MRUT-bit to 0. Regarding the promotion rules, the algorithm updates the
block by moving it to the MRU position and setting the MRUT-bit to 1.

Figure (5) below, shows the implementation of MRUT:

Figure 5: MRU-T pseudo code

9
figure . MRUT flowchart.

10
III. IMPLEMENTATION

In this section, we will give a detailed understanding of our implementation approach. We


will start by explaining the classes, methods, and files of the Sniper simulator that we have used and
modified for our project. Then we will describe the implementation and integration of the three
proposed cache replacement policies with the Sniper Simulator classes. Finally, the approach for
separating the new algorithms' files from the original algorithms' files will be clarified.

1. Understanding The Sniper Multi-Core Simulator:


In the Sniper Multi-Core Simulator, the code files are well defined and organized, but they
are poorly documented. Understanding the system management and implementing the two newly
introduced replacement algorithms, Evict Write (EW) and MRU-Tour, were quite challenging.
Most of the code for implementing a replacement algorithm is quite localized and can be found
under the cache folder. Each replacement algorithm class is implemented in a class name
corresponded to the algorithm name with the associated header and source files (example:
CacheSetMRUT is the MRU-Tour class name, and cache_set_MRUT.h and cache_set_MRUT.c are
its header and source files).

The implementation for those three replacement algorithms requires a deep understanding of
how Sniper Multi-Core Simulator classes are represented, such as dependency and inheritance. The
Sniper classes that are involved and modified in our implementation are the following: CacheCntrl,
Cache, CacheBase, CacheSet, CacheSetEWLRU, CacheSetEWSRRIP, and CacheSetMRUT, which
are briefly described next:

a. Classes:
CacheCntrl class is the main class for managing the cache related information. In this class,
private cache and the shared cache are implemented. In order to get the important information
about the type of the operation, which is also represent if the block gets cache hit, a sequence of
methods in this class are invoked as follow:
The method CacheCntlr::processMemOpFromCore is the first to be called to check if
the first private cache(L1) has the block that is required by the CPU.
To get this information along with the permission to access or insert,
CacheCntlr::processMemOpFromCore calls
CacheCntlr::operationPermissibleinCache which collects the block information
from CacheCntlr::getCacheBlockInfo. With these calls the address of the block is
11
sent along.
In CacheCntlr::getCacheBlockInfo method, a method from class Cache will be
invoked, Cache::peekSingleLine which will split the address that is sent along these
methods, and the index will be resolve, as shown in the flowchart, and checked its availability in
the cache using another method from class CacheSet, CacheSet::find.
in cacheSet class the cache will be divided into sets and the number of the blocks will be
assigned according to the number of the associativity; therefore, CacheSet::find
checks m_cache_block_info_array[index], which is an array that
represents one set,for matching tag of the sent index to see is valid block.

When all the necessary information have been collected, the pointer returns back to
CacheCntlr::operationPermissibleinCache . The method completes its code line
and checks if the block info that was returned is available for an operation. Mem_op_type
variable will give us the operation type which is either read or write.
A cacheSet methods CacheSet::readable() and CacheSet::writable(). If one of
the methods return true, then the block is available for this operation. This is a hit in the L1.

However, when a cache miss in L1 cache,


CacheCntlr::processShmemReqFromPrevCache will be invoced to see on which
level we get a cache hit instead. The same methods will be invoked again starting with
CacheCntlr::operationPermissibleinCache(). If it is L2 miss, then L3 cache
blocks will be checked. If it is cache miss in L3, then fetch the block from the main memory
using CacheCntlr::accessDRAM.

b. Methods:
For this project we traced and edited only the methods that are related to Shared Caches
when the block gets either a hit or a miss.
On a hit in LLC, the return call from
CacheCntlr::processShmemReqFromPrevCache()to
CacheCntlr::processMemOpFromCore() will return the type of the operation on a
cache level, which in our case will be the LLC.
To access this block to perform the operation, CacheCntlr::accessCache() will be
called and it will have the follow parameters: cache level, the address, a true value for
update_replacement variable to indicate that an update for the block in LLC , which is
required, and finally it will take the mem_op_type which is the operation type that will be

12
done on the block.
In the same time that this is tackled, the writethrough and writeback will be dealt with
using CacheCntlr::writeCacheBlock() and
CacheCntlr::updateCacheBlock() methods.
In Cache class, an array of instance of cacheSet type will be created according to the
number of the m_associativity variable. For each index in the array, another array
m_cache_block_info_array object of type CacheBlockInfo will be initialized , for
the array can use the class methods as shown in figure 6.

Figure 6: One to many relationship between CacheBase and


CacheSet class

Therefore, if Cache::LOAD is the operation then


Cache::accessSingleLine() method will access
CacheSet::Read_line().
if the operation is Cache::STORE, then Cache::accessSingleLine()
method will access CacheSet::Write_line().

In case of a miss in LLC or update for the previous level, the previous sequence of methods
accessing will be the same, but it will be insert methods not access. Thus,
CacheCntrl::insertCacheBlock() method will be invoked, and in this method will call
Cache::insertSingleLine().
Cache::insertSingleLine() will send all the information to method
CacheSet::insert(). Inside CacheSet::insert() method will get the evicted block that
13
the replacement algorithm will choose with the help of the
CacheSet::getReplacementIndex() that is inherited in all the replacement policies class
To get replacement policy, CacheSet::createCacheSet method will indicate which one is
chosen.

2. Implementing And Integrating Our New Algorithms With Sniper:

The actual implementation of the three replacement policies starts when the pointer of
execution reaches CacheSet::read_line. and CacheSet::write_line methods. In both
methods, we added m_coming_EW_type variable and m_block_op array which will capture if
the operation is either a READ or a WRITE. In order to know in which cache level the methods have
been called, we checked the cache_type variable in the CacheSet constructor if it is equal to
CacheBase::SHARED_CACHE. However, we have eliminated the need for this condition
statement since we separated our new replacement files, which will be discussed later.

A. If the policy is EW-LRU, the utilized class is CacheSetEWLRU:

Figure 7: CacheSetEWLRU class diagram

a. On a cache hit, the block is being accessed and the


CacheSetEWLRU::updateReplacementIndex() method will be invoked
from either CacheSet::Read_line() or CacheSet::Write_line().
b. The CacheSetEWLRU::updateReplacementIndex() method will apply
the promotion rule of the EW-LRU policy, which is done by calling two methods.
The first one is the CacheSetEWLRU::moveToMRU() method that simply
reorders the m_lru_bits values after giving the accessed index the value of 0
indicating that it is now in the MRU position. The second one is the
CacheSetEWLRU::updateEW() method as mentioned in the design section,
14
which updates the access type in the m_stored_EW_type array of each block by
either setting the was_read variable to True if the access type was a READ, or
setting was_written to True if it was a WRITE.
c. If CacheSet::insert() is invoked this means that a compulsory miss or a
capacity miss happened, and that means a block needs to be evicted. The
CacheSetEWLRU::getReplacementIndex() will return the victim block
index.

d. The CacheSetEWLRU::getReplacementIndex()method will check If an


empty cache line is found, then that index will be returned to
CacheSet::insert() after updating the m_lru_bits using the
moveTMRU() method, and the was_read and was_written variables using
the updateEW() method. Otherwise a victim must be selected and evicted.

e. The victim selection happens by calling the CacheSetEWLRU::evictEW()


method. This method, as described in the design section, will perform the EW
eviction policy and then returns the selected victim block index. After that, the
returned index will be used for the update process using moveTMRU() and
updateEW() methods.

B. If the policy is EW-SRRIP, the utilized class is CacheSetEWSRRIP:

Figure 8: CacheSetEWSRRIP class diagram

a. On block access, CacheSetEWSRRIP::updateReplacementIndex will be


invoced from either CacheSet::Read_line or CacheSet::Write_line. This
method, CacheSetEWSRRIP::updateReplacementIndex, will apply the

15
promotion rule for SRRIP as mintioned above utilizing m_rrip_bits array to
decrement the block RRPV bits.
b. If CacheSet::insert is invoked this means it is a miss and a block need to be
evicted. The CacheSetEWSRRIP::getReplacementIndex will return the victime
block. Using m_rrip_bits and m_block_op array the method will find the block that
has the max rrip bits and have write, read, or both operations.

C. If the policy is MRUT, the utilized class is CacheSetMRUT:

Figure 9: CacheSetMRUT class diagram

a. For this policy, it is important to check if bool update_replacement in


CacheSet::Read_line or CacheSet::Write_line is true which means again it
is a hit for a block that is already in the cache, and
CacheSetMRUT::updateReplacementIndex will be invoiced. In
CacheSetMRUT::updateReplacementIndex, an array m_mru_bits sets the
block bit to 1.
b. If CacheSet::insert is invoked this mean it is a miss and a block need to be
evicted. The CacheSetMRUT::getReplacementIndex will return the chosen block
to evict. Using m_mru_bits array to check if which block has 0 bits or 1 bits.
when the block is chosen, the block m_mru_bits will be set to 0

3. Separating The New Algorithms' Files:


In order to integrate the new algorithms classes and files, some of the mentioned classes have been
altered, which are CacheSet and CacheBase.

a. In class CacheBase,the three policies have been named "EW_LRU", "EW_SRRIP", and
"MRUT" respectively, and were added in the enum type ReplacementPolicy.

b. In class Cache, the method CacheSet::createCacheSet() is called which divides


the cache into sets depending on the associativity. Each index in m_sets array represent a
CacheSet object. Each set in the array will call CacheSet::createCacheSet() to form
16
the chosen replacement policy. CacheSet::parsePolicyType() is used to parse the
name of the cache replacement policy coming from either the terminal options or the
configuration file; therefore, the policies' names were also added in the
parsePolicyType() method.

c. CacheSet::createCacheSet() method is also altered. Three switch cases were added


that correspond to the three replacement algorithms' names in CacheBase class with some
initialization values essential for creating the new cache set.

IV. TESTING ENVIRONMENT

Our implementation was carried out using the sniper Simulator source code. and so for our
testing phase we used Sniper along with multi-threaded benchmark suites to test and evaluate our
implementation of replacement policy. As we are focused on enhancing the performance of shared
LLC replacement policies in chip-multiprocessors (CMP) , we will be using the two most common
benchmarks used for CMP studies: SPLASH-2 [3] and PARSEC [4].

The Stanford ParalleL Applications for SHared memory (SPLASH-2) was introduced in
1996 as a scalability optimizations for SPLASH which both aimed at shared-memory processors.
Splash-2 consists of a mixture of applications and kernels representing a variety of computations in
scientific, engineering, and graphics computing. However, Splash-2 is skew towards High
Performance Computation (HPC) applications and this would affect building a general judgment on
our implementation performance in multicore systems if we based our evaluation only on it.

Princeton Application Repository for Shared-memory Computers (PARSEC) benchmark


suite was developed and introduced in 2008 by Princeton and Intel Corporation. Its objective is to
provide a large and diverse collection of applications to be sufficiently representative for scientific
studies of CMP. PARSEC applications are considered the state-of-art technique in their area. The
algorithms these programs implement are useful, but their computational demands are very high on
current platforms.

Although PARSEC is more diverse than SPLASH-2, the two benchmarks workload differ
in important features as data locality, number of shared cache lines, and working set sizes [5]. For
that, to make our evaluation appropriate, it's important to take all these differences in concentration
as it influence the cache performance in CMP. In our evaluation we selected workloads from both
PARSEC and SPLASH-2

17
Since Parsec has 12 workloads and Splash-2 has 11 workloads, we needed to choose
between them to use in our tests and analysis. We chose 11 workloads ,4 from Parsec and 7 from
Splash-2 , based on two criteria:

1. We gave the priority to the benchmarks that have the intensive use of shared LLC. In order
to determine those, we generated CPI (Cycles per Instruction) stacks for our baseline
configuration (Gainestown) when LRU replacement policy applied to the shared LLC using
Sniper simulator. CPI stacks shows how many processor cycles, normalized by number of
instructions or execution time, were spent in each different components such as cache levels,
DRAM memory, instruction fetcher, and synchronization[key paper]. By using CPI stacks,
we were able to determine which benchmarks are memory intensive i.e. spend a large
amount of time in instructions with memory in shared LLC. Figure (10) shows two CPI
stacks for two benchmarks On left is one workload (canneal) CPI stack showing that the
time it spent in the LLC. On the right, on the other hand, the workload dosen't show any
significant time spent at the shared cache..

Figure 10: CPI stacks generated using Sniper

2. The other criteria used to determine the suitable benchmarks is average miss rate. We ran all
22 workloads using base LRU replacement policy and then chose the benchmarks that have
a high average miss rate. Figure (11) shows the baseline average miss rates in shared LLC
18
for all benchmarks.

Figure 11: Baseline average miss rate in shared LLC

The previous criterias led us to chose from Parsec benchmark the following workloads:
Bodytrack, Streamcluster, Dedup, and Canneal, and from Splash-2 benchmark the following
workloads: FFT, FMM, Lu.cont, Lu.ncont, Water.nsq, Ocean, and Cholesky. Table (1) lists all
selected workloads. Since each benchmark has multiple different input data set sizes, we restricted
our input size in all experiments to be simsmall in case of Parsec benchmark and small in case of
Splash-2 benchmark. Our restriction was based on [1], and the reason for this restriction was to
have reasonable trade-off between accuracy, number of experiments, and simulation speed.

Parsec Splash-2

Workload Bodytrack Canneal Dedup Stream- Cholesky FFT FMM Lu. Lu. Ocean. Water.
Name cluster cont ncont cont nsq

Type App. Kernel Kernel Kernel Kernel Kernel App. Kernel Kernel App. App.

Domain Computer Engineer Enterpri Data HPC Signal HPC HPC HPC HPC HPC
Vision ing se Mining Process
Storage ing

Input size simsmall sim- sim- sim- small small small small small small small
small small small
Table 1: Selected workload list

19
V. EXPERIMENT

To test the performance of our algorithms, we used two main metrics: Execution time and
average miss rate. The execution time was generated by the simulator as an output, and the
execution time is measured by the simulator in nanoseconds. On the other hand, we calculated the
average miss rate by taking the ratio between total number of misses from all cores and total
number of accesses from all cores. The base hardware setup for all experiments was based on the
default configuration for Sniper simulator. Sniper simulator uses Intel Nehalem Xeon processor:
quad-core Gainestown configuration.

In order to show the scalability of EW algorithm , we used another hardware setup. Since we
wanted to test our algorithm in a machine that is known and commonly used , we chose Hydra
from PDS lab machines. To obtain Hydras hardware specifications we used the UC Denver PDS
lab web site that provided the main architecture [8] in addition to [9] to get more insights about
Istanbul processor used in Hydra. However, due to some Sniper limitations, we had to change some
of the Istanbul processor configurations such as L1 data TLB size which is 48 entries, but Sniper
would only accept power of 2 sizes. Alternatively we set it to 32 entry. Another change was the
number of cores, Sniper again asks for power of 2 threads for some tested workloads, and to keep
the testing environment unified for all workloads, we set the number of cores to 8. Table (2) shows
the main specification for cache parameter in Gainestown multi-core processor and Hydra multi-
core machine.

Parameter Gainestown Hydra (Istanbul)

Num. cores 4 2x6

L1-D size 32 KB 2x 6 x 64 KB

L1-D associativity 8 2

L1-I size 32 KB 2x 6 x 64 KB

L1-I associativity 4 2

L2 size (per core) 4 x 256 KB 2 x 6 x 512 KB

L2 associativity 8 16

L3 total size 8 MB 6 MB

L3 associativity 16 48
Table 2: Some of the simulated hardware configuration

20
In terms of the tests types, we considered two types of experiments. First type of experiment
is running a single parallel workload on multiple cores. With the Gainestown configuration, we ran
the single workload using four cores which means the work is divided between the four cores.
Similarly, with Hydra configuration, each benchmark is using eight cores to complete its work.

Second type of experiment is running multiple workloads in parallel using available cores.
In this case, the number of cores is divided equally between the workloads. We ran this test using
Gainestown configuration only, and performed two testing scenarios. The first with parallel
workloads running concurrently. The second with sequential workloads running concurrently. The
goal of this test is to check the efficiency of multicore processors cache replacement algorithms
when the cache is shared between several workloads each is running independently on one core, and
not only shared between the cores that run one workload.

VI. RESULTS AND ANALYSIS

Running the simulator generates several output files. The one we mostly benefits from is the
sim.out file which contains a detailed information about the simulation such as the number of
instructions, the execution time, cores idle time and important cache metrics for each cache level
such as number of access, miss rate, number of misses.

To simplify the process of collecting the results, given the large number of tests (over 100),
we adapted a fixed directory naming format that holds all necessary information about the run. With
each simulation run, we set the output directory option to run-
results/simulted_policy/bencmark_name-workload_name-on-simulated_configration-
simulted_policy (ex. single_run/ewlru/parsec-canneal-on-gainestown-ewlru). With multiple
workloads simulation we also added the cores distribution (how many core is assigned for each
workload) and preserved the order of running workloads to the output folder name. So, basicly we
used file naming as a simple simulated run reference.

An ipython notebook was then built to pass through each run output folder and scan sim.out
file in the collected result directory reading and recording all the cache metrics we need from it.
Using the same ipython notebook we calculated any additional metrics we wanna examine,
classified our simulations, and generated our analysis charts. Since we are testing to different
experiments, two .ipynb files were built, one for each.

21
Here we will present our result and analysis. As our goal of the implementation is to reduce
cache load misses, the main metric we will be evaluating is L3 cache misses. In addition to
execution time and speed up over LRU. We will also evaluate the scalability of EW algorithm by
examining its results on Hydra machine configuration.

I. First Experiment

1- First algorithm: EW algorithm

As for EW LRU algorithm, the average miss rates for LRU compared to EW LRU is shown
in figure (12). The average miss rate was reduced in three benchmarks: Dedup, FFT, and Cholesky.
The reduction of average miss rate was about 1.3 % in Dedup, 1.7 % in Cholesky, and 5.3% in
FFT. This reduction of average miss in these three benchmarks can be explained by looking at
figure (13) which demonstrates the average miss rates for the selected workloads. All workloads
that get benefit from EW algorithms are the workloads that has the highest average miss rates. The
remaining workloads that did have reduction in average miss rates leads us to the fact that the
overall enhancement in the performance does not depend only on the total number of misses, but
also in the type of generated misses i.e. read and write misses. Since EW algorithm reduces the
number of read misses, so we do not expect any enhancement in the performance when the number
of write misses is larger than the number of read misses.

Based on [4] canneal and streamcluster shows trivial amounts of sharing. This could explain
why we didnt get any noticeable improvements in them. canneal has a very large size working sets
56 MB and more (classified by [4] as unbounded ) and its need of cache capacity grows as the
amount of data it process grows. the application need for large data set is caused by the algorithm it
runs. In canneal most of its working set is shared with all threads, however, due to the working set
unbound size, only a tiny fraction of it can be fitted in the cache and the line probability of being
accessed by a different thread before eviction is small.

22
Figure 12: Average miss rate for LRU and EW LRU in shared LLC

Figure 13: Baseline average miss rate for LRU in shared LLC

The execution time of LRU compared with EW LRU is shown in figure (14). Clearly, the
execution time for both algorithms are similar except for Dedup workload which had about 1.3 %
reduction in the execution time. Since Dedup workload had a reduction in average miss rate using
EW algorithm, so the reduction in execution time is reasonable.

23
Figure 14: Execution time for LRU and EW LRU in shared LLC

For EW SRRIP algorithem, figure (15) shows a comparison between LRU,SRRIP, and EW
SRRIP. As the figure shows, the use of EW SRRIP reduced the average miss rate in fft workload by
12.9 % . In the remaining workloads, when comparing the original SRRIP to LRU we can see that
LRU was actually performing better. This could be due to the nature of SRRIP that insearts new
blocks with long re-reference prediction or the applied promotion rule FT that decrement RRPV bits
on block hit ,which doesn't give a general assumption on whither the block will be reused soon.
Since the miss rate hasnt been improved, the execution time isnt expected to enhance. figure (16)
shows the execution time when applying EW SRRIP.

24
Figure 15: Average miss rate for LRU, SRRIP, EW SRRIP in shared LLC.

Figure 16: Execution time for LRU, SRRIP, and EW SRRIP.

Next we will examine the performance of implemented EW algorithms on Hydra to test its
25
scalability. The workload were run in the same manner as before but with Hydras configuration

that uses twice the number of cores and less capacity shared L3 cache. Table (2) shows a

comparison between the most important cache specifications.

The average miss rate in L3 when using EW LRU on Hydra is shown in figure (17). We

note that, Bodytrack, FFT ,and Ocean show an average reduction in miss rate of 1.3 %. This could

be due to the type of sharing these workloads have. For example, in Bodytrack the threads process

the same data (the input data), and so it has a substantial amount of sharing. When the number of

cores is increased the sharing needed increased as well. On the other hand, some of the workloads

show an average degradation of 1.2 %. This can be explained as follow, most of the workload that

experienced a degradation have a very large working sets with a growth rate proportional to the

number of cores. Hydra uses more cores and smaller L3 cache compared to Gainestown. As a

result, the working set has grown and only small fraction of it can be fitted in the cache. This will

reflect badly on the performance. Figure (18) shows execution time For EW LRU on Hydra, which

reflects the miss rates.

Figure 17: Average miss rate for LRU and EW LRU in Hydras shared LLC

26
Figure 18: Execution time for LRU and EW LRU on Hydra.

Figure (19) shows the average miss rate in L3 when using EW SRRIP on Hydra. The results

are very close to gainestown. The execution time can be seen in figure (20).

Figure 19: Average miss rate for SRRIP and EW SRRIP in Hydras shared LLC.

27
Figure 20: Execution time for SRRIP and EW SRRIP on Hydra.

2- Second algorithm: MRU-T algorithm

MRU-T algorithm was tested on Gainestown configuration, a comparison in average miss


rates between MRU and MRT-T is shown in figure (21). It is clear that all workloads got reduction
in average miss rate using MRU-T except FFT who increased by only 0.3%. However, except for
FFT, MRU-T reduced the miss rate by an average 33%.

28
Figure 21: Average miss rate for MRU and MRU-T in shared LLC.

The execution time of MRU compared with MRU-T is shown in figure (22). It is obvious
that the reduction in average miss rates led to the reduction in execution time. The execution time
was reduced in almost all benchmarks. Canneal had about 47 % reduction in execution time, Dedup
had about 22 % reduction

29
Figure 22: Execution time for MRU and MRU-T.

since this project motivation was to evaluate different LLC replacement policies, we will
examine their performance compared to LRU. For a general evaluation of the performance of the
three implemented replacement policies, figure (23) gives the speedup over LRU.

30
Figure 23: Implemented policies speedup over LRU.

We can note that in average, of all three algorithms, MRU-T performed better with average
speedup of 1.02 over LRU.

II. Second Experiment

For the last part of the analysis we will examine the performance of the second test. In this
test we examined two scenarios:

1. Running two workloads concurrently where each of them is running on half of the
cores and they share LLC.
In this test, we randomly selected pairs of workloads and run them in parallel
where each workload gets 2 cores out of 4.

2. Running four workloads concurrently where each of them is running sequentially


(using one core) and independent of the others but all share LLC.
In this test, we randomly selected combinations of four workloads. We
assigned one core to each workload to force it to run sequentially. The 4 workload
run concurrently.

31
The goal of these scenarios is to explore EW LRU performance with different sharing
degrees of the LLC. All previous tests were more focused on sharing LLC between cores running
the same program. Now, well look at how the performance of EWLRU would be influenced by
decreasing both the amount of sharing between cores (of the same program) and the LLC space
allocated.

1- First scenario:

In figure (24), we can see that for most of the workload pairs EW LRU performance is close
to regular LRU. We can notice that for each pair, the performance was representing the dominant
workload (the one generating the highest miss rate) miss rate. The .. . For instance, the pairs
containing Canneal workload shows a high degradation. This ties with our previous results and
analysis, with the single workload test Canneal kept its performance. With Hydra, as the number
assigned of core were higher there was a small enhancement. In this scenario Canneal is not only
sharing LLC with another program, but also assigned less cores. Figure (25) shows the
corresponding execution time.

Figure 24: Average miss rate of LRU and EWLRU with 2 workloads (each is allocated 2 cores) running concurrently.

32
Figure 25: Execution Time of LRU and EWLRU with 2 workloads (each is allocated 2 cores) running concurrently.

2- Second scenario:

In figure (26), we can see that the runs that contain dedup as one of the 4 workloads has a
LLC miss rate reduction even when workloads are running sequentially. Most of the workloads,
however, got degradation when EW LRU was managing single core shared lines. This behavior was
expected as EW LRU design targets multi-core sharing. Figure (27) shows the corresponding
execution time.

Figure 26: Average miss rate of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running concurrently.

33
Figure 27: Execution Time of LRU and EWLRU with 4 workloads (each is allocated 1 cores) running concurrently.

VII. CONCLUSION

Cache replacement algorithms such as LRU, MRU, etc, are currently being prefered in
multi-core architectures as the level of sharing is increasing in their memory hierarchy system.
However, there are room for improvement such that they achieve higher reduction in miss rates.
Based on that fact, our project was centralized on improving and enhancing the shared last-level
cache (LLC), which is a crucial component in multiprocessor performance. We do so by
implementing two newly introduced replacement algorithms, the Evict Write strategy and MRU-
Tour algorithm. Using the Sniper Multi-Core Simulator as our testing tool, we emulated the cache
specifications of the Gainestown processor and the specifications of the Istanbul AMD processor
used in the Hydra cluster, one of the PDS Labs clusters of the University of Colorado Denver,
Department of Computer Science.

After running the algorithms on more than 10 benchmark applications, our results show an
average improvement in shared LLC miss rate of %(1.5) for the EW LRU algorithm over LRU and
%(30) for the MRU-Tour algorithm over MRU. While EW SRRIP showed an averege degradtion
og %(41) over SRRIP .

34
VIII. FUTURE WORK

Due to the fact that there is no one replacement algorithm that work for all different
workload, our future work idea is to implement set Dueling [10]. This idea will divide the cache in
three parts. the two parts is dedicated for the replacement algorithms that they will be chosen
dynamically, one of the EW algorithms and the other one for MRUT. The last part called Follower
sets, which will record the winner between the two first parts. The Set Dueling will work efficiently
with Dynamic insertion policy (DIP), for DIP will provide low hardware overhead, low complexity,
high performance. Thus, we would like to test the performance when we implement DIP with Set
Dueling.

In addition, we would like to update our current code to allow tracing the number of read
and write misses in LLC, for we want to generate an accurate measurement for the performance of
EW algorithms. Moreover, This would help in checking the behavior of test workloads to use the
one that is expected to show enhancement.

35
ACKNOWLEDGMENT

This work would not have been possible without the great insight and experience of Team
Two that greatly assisted the research and the project. We all, Alanoud Alsalman , Arwa Almalki,
Samaher Alghamdi, and Norah Almaayouf, have had the pleasure to work together during this
project.
We would also like to show our gratitude to God for his guidance. We are especially
indebted to Professor Alaghband, University of Colorado in Denver, for sharing her pearls of
wisdom with us during the course which gives the project a prominent coherence in its result and
manuscript. We would also like to thank Manh for his support and patient with us.

36
References
[1] M. Geanta, L. Ghica and N. Tapus, Leverage Cache Replacement Policy In Multicore
Processors, 2016 IEEE 12th International Conference on Intelligent Computer
Communication and Processing (ICCP), Cluj-Napoca, 2016, pp.417-424.
doi: 10.1109/ICCP.2016.7737182

[2] A. Valero et al., MRU-Tour-based Replacement Algorithms for Last-Level Caches, in


Proceedings of the 23rd International Symposium on Computer Architecture and
High Performance Computing, October 2011, pp. 112119. doi: 10.1109/SBAC-PAD.2011.13

[3] PARSEC Group, "A Memo on Exploration of SPLASH-2 Input Sets", Princeton University,
2011.

[4] C. Bienia, S. Kumar, J. P. Singh and K. Li, "The PARSEC Benchmark Suite: Characterization
and Architectural Implications", Princeton University, Intel Labs, 2008.

[5] C. Bienia, S. Kumar, and Kai Li. PARSEC vs. SPLASH-2: A quantitative comparison of two
multithreaded benchmark suites on Chip-Multiprocessors. In Workload Characterization,
2008.IISWC 2008.IEEE International Symposium on, pages 4756, Sept 2008.

[6] The Sniper Multi-Core Simulator [Online].


Available: http://snipersim.org/w/The_Sniper_Multi-Core_Simulator

[7] A. Jaleel, K. Theobald, S. Steely and J. Emer, "High performance cache replacement using
re-reference interval prediction (RRIP), ACM SIGARCH Comput. Architecture News, vol. 38,
no. 3, pp. 60, 2010.

[8] University of Colorado at Denver. Parallel Distributed Systems Lab - PDS Lab [Online].
Available: http://pds.ucdenver.edu/webclass/index.html

[9] CPU-World (2017, Feb). AMD Opteron 2427 specifications [Online]. Available:
http://www.cpu-world.com/CPUs/K10/AMD-Six-Core%20Opteron%202427%20-
%20OS2427WJS6DGN%20%28OS2427WJS6DGNWOF%29.html

[10] M. Qureshi, A. Jaleel, Y. Patt, S. Steely Jr. and J. Emer, "Set-Dueling-Controlled Adaptive
Insertion for High-Performance Caching", IEEE Micro, vol. 28, no. 1, pp. 91-98, 2008.

37

Вам также может понравиться