Assignment 21

NUDA Non Uniform Disk Access
Programming assignment #2 for Info-0940 course on Operating

Systems
Tom Barbette
March 11, 2015
Abstract
Modern computers often have multiple storage device being either fast (SSD)
or big (HDD). Currently, its up to the user to choose where to place his data.
Previous work allows to use the SSD as a cache for the HDD, but looses all the
SSD size for that purpose.
Your work is to build a new RAID module, called NUDA. It will create a
virtual disk, having nearly the same total space as the sum of all drives of the
NUDA RAID array. The virtual drive will always try to read and write from/to the
SSD, by swapping the lesser-used blocks of the SSD with often accessed blocks
of the HDD. An indirection table will be used to keep track of the block position
inside the array, and be able to read the data in order.
Students will work in teams of two, keeping the groups of Assignment #1.
Introduction
Recently, new storage devices appeared, called Solid State Drive (SSD). They are
based on a well-known technology using silicon to store data, instead of the magnetic
disks found in the traditional hard drive (HDD). They offer faster bandwidth (in term
of bytes/seconds), but also a very low seek time. Seek time is the time taken by the
mechanical head in a HDD to move over the disks to access different cylinders. SSDs
have no mechanical part and can access directly any of their sectors.
But SSD have one major drawback : they have a much more expensive cost per
byte than HDD. And therefore in modern computers, it is not uncommon to have two
storage devices : one SSD with the operating system, recent documents and some
often accessed programs, and one HDD with media files, games, archives, and less
used programs. The same situation can be seen in embedded devices and smartphones,
where there is often a pure NAND-based memory (like most SSDs) an a slower storage,
like a SD-card.
The problem in these two scenario are the placement of data is let to the user, which
have to move his currently used data from the SSD to the HDD and vice versa himself.
Existing solutions have appeared into the mainline kernel since kernel version 3.9
but all these solutions use the SSD as a pure cache, loosing its entire space only for
1
backing some part of the HDD. While in servers or desktop PCs with a lot of HDD
space it can be acceptable, loosing 128GB of SSD space when we only have another
500GB on a laptop is too much. It is even impossible in the case of embedded devices
as all the OS data needs to remain accessible.
Your work is to build a new MD RAID level, called NUDA. It will take the form
of a kernel loadable module which at term of the third phase will allow to combine
multiple disks having multiple characteristics. The virtual drive will have nearly the
same space as the addition of the space of each drives, and will silently arrange read
and writes to always allow the user to have the most faster possible experience. In an
optimal scenario, all read and write will be done from and to the SSD, while when
the computer is considered idle, NUDA will silently move the old, unused data to the
HDD, leaving place for new data or data accessed more often.
Assignment
You will provide a report, answering the following questions and explaining your
work on the project. You also have to explain the contribution of each members of
your group. You are advised to answer the questions before writing any line of code
for the project section... The report will be sent in PDF format named "g{groupnr}report.pdf" along with a patch containing all your modifications to the kernel source
code. The patch will follow the same guidelines as in the first assignment (see the
slides at http://bit.ly/1eAgt51) and will be named "g{groupnr}-source.patch".
Please write {groupnr} with a leading 0 if youre from groups 1 to 9 (eg. 01, 02, ...).
You will pack the two files in a .tar.gz compressed archive and send it using the RUN
submission platform (http://submit.run.montefiore.ulg.ac.be) by Tuesday
24th of March at 23h59.
Further information will be available online at http://bit.ly/1eAgt51. Stay
online !
Strictly follow these requirements !
Questions
From now on, we always consider the Kernel 3.2.66 version.
3.1
Kernel DM and MD
3.1.1
What is the difference between the kernel Device Mapper (DM) and the
kernel Multiple Device (MD). Also briefly explain how they work.
3.1.2
What are the most used user-space tools to manipulate both of them? Give
the user-space command(s) to create for both of them :
1. A linear array which combines /dev/sdb1 and /dev/sdc1 into one big drive
2. A mirror array which combines /dev/sdb1 and /dev/sdc1 into one redundant
drive.
3. A stripping array which combines /dev/sdb1 and /dev/sdc1 into one faster drive
You can create other devices (/dev/sdb, /dev/sdc, ...) using VirtualBox. Use parted to
create an msdos label on them, and one big partition which will use all of the device
space and will appear as /dev/sdb1, /dev/sdc1, ...
3.1.3
What is the most accepted, standard way of having separated little partitions (for storing /, /var or /home separately) which you can make grow
if you need, over a RAID5 array. Explain the advantages of this, and the
drawbacks.
From now on, we only consider MD !

3.2
MD
3.2.1
What is the C structure which defines a MD RAID mode and the functions
that the MD driver should call to use it. How is it registered?
3.2.2
In this structure, explain the purpose of each variables and functions, including their arguments.
3.2.3
What are the structures mddev, md_rdev, request and bio. Explain what
are their purpose, the link between theses struct and the one explained in
3.2.1
3.3
Block I/O
3.3.1
How is a represented a block operation intended for in-kernel use? How

do you "submit" it for completion?
3.3.2
In the block I/O, where the data will be written (for the read operation) or
read from (for the write operation)? Does this space need to be contiguous?
How does it work? You can draw a sketch to explain it.
Project
You have to implement a new RAID module for the Linux Kernel 3.2.66 in /drivers/md/nuda.c which will be build using the Linux MD subsystem. It will be named
"nuda" and will have the level number -2. It will be kind of a combination of the level
-1 (linear), and the level 0 (stripping). Looking in their related source code may help
you. Note that the manipulation tool you found in the question will also have to be
modified to allow creation of arrays using NUDA.
The NUDA level only need to support two devices for this assignment. The first
device will always be considered as the SSD and the second as the HDD.
4.1
Chunks
HDD
SDD
Physical chunks
Virtual chunks
MD Disk
Figure 1: Chunks mapping

As in figure 1, the physical drives will be divided in chunks. Chunk size can already
be set using the tool to manipulate MD Raid arrays, and is a multiple of 4k. If the first
device is of 100MB and the second of 200MB1 with the default chunk size of 512K, it
means youll have 100M/512K + 200M/512K = 200 + 400 chunks, more or less a little
space lost at the beginning and the end of the drive.
The NUDA module will create a virtual drive combining the two physical drives, by
combining all the chunks as if they all came from the same virtual drive, implemented
by your module.
In our example the 600 chunks will be merged as a 300MB drive. At this point, the
module will be like a linear array, and you are strongly encouraged to test your module
with only those features and check that everything already work.
A simple "chunk" mapping for our example can be done like this : when you receive
a BIO request, you compute the chunk index where the data it requests can be found.
The BIO system send request with an address in sectors, you simply have to divide it
by the chunk size which is also given in sector, and youll end up with a chunk index.
If the BIO request is inside the first 200 chunk, you simply forward the request to the
first drive, by changing the destination block device, and multiply the chunk index by
the chunk size. If its for the next 400 chunks, you subtract 200 to the chunk index, and
forward the request to the second drive using the same technique. Look at other RAID
modules to see how its done in practice.
1 No
need to test with big drives !
Pay attention to the fact that requests could be done on a segment of data which
overlaps two chunks. The RAID 0 module use also chunks, you should look at how it
avoids nearly all the complicated cases.
At this "linear array" step, you should be able to install a filesystem on your device
using mkfs.ext4 /dev/md0 and mount it using sudo mkdir -p /mnt/target && sudo mount
/dev/md0 /mnt/target, copy files on it, check that they are not corrupted using md5sum
/path/to/file.
4.2
Indirection table
In a second step youll make all virtual chunks mapping to any physical chunk.
When a request to a chunk arrive, youll have to find which is the corresponding 1-to-1
mapped physical chunk using an indirection table as shown in figure 2. The indirection
table contains 3 fields (but you may add some if you need to) : the drive index, an
unsigned char telling on which drive the real physical chunk is, and a chunk index,
telling which physical chunk inside that drive will contain/contains the data for this
request. The third field counts the total number of read and write addressed to that
virtual chunk.
The final goal of NUDA is to try to make all often-used virtual chunks mapped
only to chunks of the SSD. Again, after having built a simple indirection table and
before modifying it, you should check that your translation mechanism correctly works
before going to the next step. When it works, do a randomization of the mapping at
initialization time, and the filesystem should still work without any problem.
Drive
Index of chunk in drive
Access count
128
...
399
399
98
400
71
199
235
...
599
Figure 2: Indirection table
4.3
Swapping
The final step for this second assignment is to handle the write requests so they
always end up writing to the SSD.
When a read request is sent to your module, you will simply update the access
count and do the translation process.
When a write request is done to a chunk currently backed up by the first drive (the
SSD), youll do the same.
5
But when a write request is addressed to a chunk currently on the HDD, NUDA will
exchange that physical HDD chunk with another physical chunk on the SSD. The goal
is that all write will always be done on the SSD, and therefore as soon as a write request
is done, its data will be on the SSD and therefore accessible in a fast way, assuming
that if you write some data, you will access it soon.
When that last case happen, youll add the chunk to a "swap pending" list, and
flag an MD thread as runnable, so it will do the exchange later by reading the list.
You cant do any other processing regarding the swapping operation inside the
"make_request" function. The whole swapping work has to be done inside that other
thread.
That thread will find enough virtual chunks currently backed by the SSD to do the
swapping with all pending HDD chunks. As they will move to the HDD, youve got
to find the least-accessed chunk using the access count field of the indirection table. In
practice you should send new request to the two drives to read and copy the two chunks
to the memory, and then copy their data to exchange the chunks. However, for this
assignment, the real data swapping doesnt need to be done. You will just exchange
the mapping inside the indirection table. Of course, that means that the data will be
corrupted, and you cant write a filesystem on NUDA at this point. But you can still
access /dev/md0 to check the content.
To be able to read the state of the indirection table from userspace, youll create a
new sysentry folder in /sys/block/md0/chunks (if md0 is the name of the NUDA array
created by the user). This folder will contain one sub-folder by chunk having the
virtual chunk index for name, and each of theses sub-folders will contain a "drive",
a "chunk_idx" and a "access_count" sysentry file which will allow to read (and only
read) the current values for those fields inside the indirection table.
The indirection table doesnt need to be physically stored in the disk at this point,
so when your computer restarts or when the array is stopped, the indirection table is
lost, and therefore the data too as you wont be able to put back the physical chunks in
order.
As with the linear array or the raid level 0, there is no notion of a "degraded" array,
therefore you dont need to use the bitmap system or all the flags, fields and function
related to "resync" of normal Redundant Array of Independent Disk, as there is not any
redundancy.
4.4
Testing
You may or may not start from the code of the Assignment #1. Its up to you but
the seqgen repeater could be useful for testing purposes. You can for example set the
sequence to "1", fill completely /dev/sdb1 with it using dd if=/dev/seqgen of=/dev/sdb1,
do the same with /dev/sdc1 but filling it with 2, and try to write some 3 at a specific
location using dd if=/dev/seqgen of=/dev/md0 bs=512 count=1 seek=100k, which will
write 512 "3" starting at the 100000*512th byte. If you use the default chunk size, it
means youll write in the 100th chunk, and therefore if this chunk is on the HDD, it
should create a swap in the indirection table and write some 3 over 1 (instead of 2 if
the swap was really done). You can use dd if=/dev/md0 bs=512 count=2 skip=100k to
check the sector you wrote and the sector after.
4.5
Final words
Just to be crystal clear, only the real data swapping can be missing for this assignment #2, but the chunk system, the /sys entries and the indirection table MUST be
implemented.
Make all your kernel messages start with "[NUDA]". I always filter your log messages with that string. If I get no info about whats wrong I may consider that one of
my test makes your code fail completely while its just a little bug.
You will use all the proper functions of the kernel when possible. Bad hooking,
re-writing existing kernel code, ugly things, bad programming style, inappropriate task
or drive state or kernel instabilities will result in loss of points.
Good luck !

Assignment 21

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Assignment 21

Загружено:

Авторское право:

Доступные форматы

NUDA Non Uniform Disk Access

Programming assignment #2 for Info-0940 course on Operating

From now on, we only consider MD !

How is a represented a block operation intended for in-kernel use? How

Figure 1: Chunks mapping

need to test with big drives !

Index of chunk in drive

Figure 2: Indirection table

Вам также может понравиться