Wasim PHD Thesis

Design Considerations for
Developing a Disk File System
A thesis submitted on partial fulfilment of the

requirement for the degree of
Doctor of Philosophy (Ph.D)
by
Wasim Ahmad Bhat
P.G. Department of Computer Sciences

Faculty of Applied Sciences & Technology
University of Kashmir
under the supervision of
Dr. S.M.K. Quadri

in
Computer Science
March, 2012
Declaration
This is to certify that the thesis entitled “Design Considerations for Developing
a Disk File System”, submitted by Mr. Wasim Ahmad Bhat in the Department
of Computer Sciences, University of Kashmir, Srinagar for the award of the degree
of Doctor of Philosophy in Computer Science, is a record of an original research
work carried out by him under my supervision and guidance. The thesis has
fulfilled all the requirements as per the regulations of the University and in my
opinion has reached the standards required for the submission. The results em-
bodied in this thesis have not been submitted to any other University or Institute
for the award of any degree or diploma.
Supervisor & Head
(Dr. S.M.K. Quadri)

Department of Computer Sciences
University of Kashmir
Srinagar, 190 006
Dated: 27-March-2012
To my genius father & loving mother for everything,
&
to my little angels; Zeenat, Falak, Dawood & Farhan, for their smile.
Acknowledgements
Allah, the almighty, the lord of all worlds, the most merciful, the most gracious,
be witness, I acknowledge your every minute and greatest blessing on me, apparent
and hidden favours, from the time before, to the time after this life, in all the
quarters, including my research and writing of this Ph.D thesis, just for my belief
in you.
I would like to express my sincere gratitude to my advisor Dr. S. M. K. Quadri for

his continuous support during my Ph.D study and research. I acknowledge his pa-
tience, motivation, enthusiasm, and immense knowledge. His guidance helped me
throughout my Ph.D study and writing of this thesis. I could not have imagined
having a better advisor and mentor for my Ph.D study. Thank you sir.
Besides my advisor, I owe a great debt of gratitude to the teaching faculty of

Department of Computer Sciences, UoK for teaching me in my Masters degree.
I am also indebted to Dr. Rana Hashmy for having discussions regarding my

research, sharing general ideas, and boosting my morale. Thank you ma’m.
I also appreciate the support of non-teaching faculty of Department of Computer

Sciences, UoK in aspects that facilitated smooth working of my study.
As a Ph.D student in UoK, I spent most of my time in research lab. I thank my

fellow lab-mates for discussions, for working together late hours before deadlines,
for staying with me for 2 months in the department during winter of 2010 and for
their suggestions and criticism. Furthermore, I thank them, and my friends and
students, for listening to my arguments, reading my papers, pinpointing where I
went wrong and their valuable suggestions.
Besides research activities, I also enjoyed my stay in UoK. I thank my fellow lab-
mates, colleagues and friends for sharing tea in canteen, for playing counter-strike
in lab during leisure hours, for cooking food during our 2 month long stay in the
department during the winter of 2010, and everything else.
This thesis would not have have been possible without the efforts of people who
don’t know me. They include researchers of file system community, in particular
Dr. Mendel Rosenblum, Dr. D.S.H. Rosenthal, Dr. Margo Seltzer, Dr. Erez Zadok
and many more. I thank you for your contributions. Furthermore, I thank Dennis
Ritchie for C language, Linus Torvalds for Linux kernel, Donald Knuth for TEX and
Leslie Lamport for LATEX, Pascal Brachet for TexMaker, authors of InkScape, Scilab
Consortium for Scilab, developers and maintainers of Fedora OS, and designers,
developers and maintainers of hundreds of file systems.
Also, this work would not have been possible without my experiments. I acknowl-
edge my hp8100 Elite desktop for running my experiments and my hp nx7300 lap-
top for drafting my papers and this thesis. I also acknowledge the suffering of
ENTER Key. You did well.
Last but not least, no words are ever sufficient to express my everlasting love,
gratitude, appreciation and thanks to my beloved father and mother for being
the light of my life. I am forever indebted to them and my brother, who under-
standing the importance of this work, suffered my hectic working hours.
Wasim Ahmad Bhat

Abstract
File Systems manage the most important entity of a digital computer system,
namely digital-data. They are responsible for the storage and retrieval of digi-
tal data residing on the storage sub-system. However, this job of file systems is
made difficult by the advancements and enhancements in hardware technology,
and the dynamic and intense user requirements. The current hardware technology
trends and workloads challenge the performance, scalability and extensibility of
file systems. As such, file systems need to evolve to cope up with this require-
ment. Though file systems have been evolving since their existence, but the way
they have been, has unknowingly led to other problems. These include objective
specific disk file systems, incompatible variants of same basic design, reluctance
on part of a file system to adapt to new features, and so on. Furthermore, the
benchmarking methodology followed to evaluate the performance of a file system
is problematic.
The motivation of this research is to enhance file systems to overcome and mit-
igate the challenges put forth by current hardware technology trends and user
requirements such that the effort can be applied to all file systems or atleast a
class of file systems in order to avoid the problems inherent in existing evolution.
The central theme in this thesis is to overcome the challenges of performance, scal-
ability and extensibility of file systems without modifying the design or source of
the file system. In this thesis, we investigate various existing approaches and
new concepts to achieve this goal. These include exploiting the concept of vir-
tual unification to break file size limitation, the hybrid storage for performance
and the block level behaviour for reliable and efficient secure data deletion. Fur-
thermore, this thesis also investigates a new benchmarking methodology for file
system benchmarks.
To validate the concepts, we have designed 4 file systems namely suvFS, hFAT,
restFS and OneSec, and have implemented three of them while simulated the
rest. In addition, we have empirically evaluated the claims made by these file
systems.
Our evaluations show that suvFS can break file size limitation of any file system
without modifying the design or source of that file system. Although, suvFS adds
performance overhead, but the overhead is mainly due to FUSE framework. Fur-
thermore, the normalised statistics revealed that the average performance over-
head added by suvFS algorithms during reading is as low as 0.59 % while during
writing and deleting it is as high as 11.22 % and 14.77 % respectively. However,
the average growth of performance overhead with doubling the file size is 0.51 %
for reading, 6.73 % for writing and 9.54 % for deletion.
Furthermore, the simulation of hFAT file system indicates that it can reduce
the latency incurred by FAT32 file system operations by a minimum of 25 %,
10 % and 90 % during writing, reading and deleting large number of small files
respectively, if a solid state storage device having latency lesser or equal to 10
% of that of magnetic disk is used in addition. Furthermore, this performance
enhancement doesn’t demand any design or source modification of the file system.
Similarly, the evaluation of restFS showed that it can save a minimum of 28 %

of data blocks that needed to be overwritten and can reduce the number of disk
writes issued by 88 % as compared to other existing overwriting techniques when
mounted on top of ext2 file system. Furthermore, this feature doesn’t demand
any design or source modification of the file system.
Finally, OneSec pointed out that the Linux OS file system framework adds large
costs to file system operations which is as minimum as 76.93 % in OneSec. Fur-
thermore, on evaluating ext2 file system against OneSec, the evaluation pointed
out that reading operation of ext2 file system is 96.86 % nearer to efficient design
while deletion is 95.62 % and writing is 95.39 % nearer. Also, the comparison
of normalised results showed that ext2 file system should reduce its disk I/Os in
order to yield better performance.
Contents
List of Figures xi
List of Tables xii
List of Algorithms xiii
Glossary xiv
1 Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 File System Scalability 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Impact of H/W Technology Trends on Growth & Proliferation of Digital Data 12
2.2.1 Micro-Processor Trends: Domain and depth of technology . . . . . . . 12
2.2.2 Network Trends: Fast and affordable sharing . . . . . . . . . . . . . . 13
2.2.3 Magnetic Disk Drive Trends: Large and affordable storage . . . . . . . 15
2.3 Comparison of Popular File Systems for Scalability limitations . . . . . . . . 15
2.3.1 Volume Size Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 File Count Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 File Size Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 suvFS: A Virtual File System to Break File Size Limitation . . . . . . . . . . 19
2.4.1 Design of suvFS File System . . . . . . . . . . . . . . . . . . . . . . . . 22
vii
CONTENTS
2.4.2 Implementation of suvFS File System . . . . . . . . . . . . . . . . . . 23

2.4.2.1 suvfs write() call . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2.2 suvfs read() call . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.2.3 suvfs getattr() call . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2.4 suvfs readdir() call . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2.5 suvfs unlink() & other related calls . . . . . . . . . . . . . 29
2.4.2.6 suvfs open() & other related calls . . . . . . . . . . . . . . 30
2.4.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4.4 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4.6 Limitations & Future Scope . . . . . . . . . . . . . . . . . . . . . . . . 36
3 File System Performance 37

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Factors Affecting Performance of a Disk File System . . . . . . . . . . . . . . 40
3.2.1 Magnetic Disk Drive Performance Limitations . . . . . . . . . . . . . . 40
3.2.2 Conventional File System Design Constraints . . . . . . . . . . . . . . 41
3.2.3 Workload Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Notable High Performance Disk File System Designs . . . . . . . . . . . . . . 44
3.3.1 Fast File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.2 Log Structured File System . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4 Other Proposals to Reduce Head Positioning Latencies . . . . . . . . . . . . . 49
3.4.1 Adaptive Disk Layout Re-arrangement . . . . . . . . . . . . . . . . . . 49
3.4.2 Caching & Automatic Pre-fetching . . . . . . . . . . . . . . . . . . . . 50
3.4.3 Exploiting Hybrid File System Designs . . . . . . . . . . . . . . . . . . 51
3.4.4 Exploiting Hybrid Storage Devices . . . . . . . . . . . . . . . . . . . . 52
3.5 hFAT: A High Performance Hybrid FAT32 File System . . . . . . . . . . . . . 53
3.5.1 History of FAT File Systems . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.2 Performance Problems in FAT32 file system . . . . . . . . . . . . . . . 56
3.5.3 Design of hFAT File System . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5.4 Working of hFAT Stackable Device Driver . . . . . . . . . . . . . . . . 60
3.5.5 Simulation of hFAT Stackable Device Driver . . . . . . . . . . . . . . . 62
3.5.6 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
viii
CONTENTS
3.5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4 File System Extensibility 68

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 File System Layering: Stacking Models, Support in OSes, and Alternatives . 71
4.2.1 vnode Stacking Models . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.1.1 Rosenthal’s Stacking Model . . . . . . . . . . . . . . . . . . . 71
4.2.1.2 UCLA Stacking Model . . . . . . . . . . . . . . . . . . . . . 73
4.2.2 Layering Support in Popular OSes . . . . . . . . . . . . . . . . . . . . 73
4.2.2.1 Solaris . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2.2 FreeBSD Unix . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2.3 Linux . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.2.4 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.3 Alternatives to File System Layering . . . . . . . . . . . . . . . . . . . 75
4.2.3.1 Micro-Kernel Architecture . . . . . . . . . . . . . . . . . . . 76
4.2.3.2 NFS Loopback Server . . . . . . . . . . . . . . . . . . . . . . 76
4.2.3.3 Trap & Pass Frameworks . . . . . . . . . . . . . . . . . . . . 76
4.3 Application of File System Layering . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.1 Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3.2 Data Transforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.3 Size Changing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.4 Operation Transformation . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3.5 Fan-Out File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 restFS: Secure Data Deletion using Reliable & Efficient Stackable File System 79
4.4.1 Background of Secure Data Deletion using Over-writing . . . . . . . . 81
4.4.2 Design of restFS File System . . . . . . . . . . . . . . . . . . . . . . . 83
4.4.3 Implementation of restFS File System . . . . . . . . . . . . . . . . . . 86
4.4.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
ix
CONTENTS
5 File System Benchmarking 94

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.2 Proposed Guidelines for File System Benchmarking . . . . . . . . . . . . . . . 97
5.2.1 Evaluate performance of both metadata & userdata operations . . . . 97
5.2.2 Application specific benchmarking . . . . . . . . . . . . . . . . . . . . 97
5.2.3 Don’t reduce vector result to scalar . . . . . . . . . . . . . . . . . . . . 97
5.2.4 Ageing affects file system performance . . . . . . . . . . . . . . . . . . 98
5.2.5 Caching affects file system performance . . . . . . . . . . . . . . . . . 98
5.2.6 How to Validate the Results? . . . . . . . . . . . . . . . . . . . . . . . 98
5.2.7 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3 OneSec: A Modelled Hypothetical File System for Benchmarking Comparison 100
5.3.1 Design of OneSec File System . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.2 syscalls supported and their modelling . . . . . . . . . . . . . . . . . 105
5.3.3 Implementation of OneSec File System . . . . . . . . . . . . . . . . . . 106
5.3.3.1 Callback functions of OneSec file system module . . . . . . . 108
5.3.3.2 Calling sequence of OneSec callback functions . . . . . . . . 109
5.3.4 The Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.4.1 Benchmarking Methodology . . . . . . . . . . . . . . . . . . 111
5.3.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6 Conclusions & Future Scope 117

6.1 Conclusions Drawn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Lessons Learned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Future Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Publications 122
References 125
x
List of Figures
2.1 Both speed & mobility in wireless networks is enhancing . . . . . . . . . . . . 14

2.2 Capacity of magnetic disk doubles every year while cost halves . . . . . . . . 16
2.3 Design of suvFS using FUSE framework . . . . . . . . . . . . . . . . . . . . . 23
2.4 Call path of write() call to a file system via suvFS . . . . . . . . . . . . . . . 24
2.5 Screen-shot of suvFS in action when mounted on top of FAT32 file system . . 31
2.6 suvFS: Sprite LFS large-file micro-benchmark results for Write phase . . . . . . 33
2.7 suvFS: Sprite LFS large-file micro-benchmark results for Read phase . . . . . . 34
2.8 suvFS: Sprite LFS large-file micro-benchmark results for Delete phase . . . . . . 34
3.1 Logical on-disk layout of a conventional disk file system . . . . . . . . . . . . 42

3.2 On-disk layout of Old Unix File System and FFS . . . . . . . . . . . . . . . . . 46
3.3 Structural changes in LFS layout during update operation . . . . . . . . . . . 48
3.4 Relationship between various on-disk structures of FAT32 file system . . . . . 57
3.5 Effect of large volume size on seek distance between FAT Table & Clusters . . 58
3.6 Logical on-disk layout of hFAT file system . . . . . . . . . . . . . . . . . . . . 60
3.7 Design of hFAT file system using driver stacking . . . . . . . . . . . . . . . . . 61
3.8 Affect of various latencies on total latency of operations of hFAT . . . . . . . 66
4.1 File system layering types; a) Linear b) Fan-out and c) Fan-in . . . . . . . . . 72

4.2 Possibilities exploited by restFS for efficiency . . . . . . . . . . . . . . . . . . 84
4.3 Working of user level demon of restFS . . . . . . . . . . . . . . . . . . . . . . 87
4.4 Behaviour of restFS during unlink() operation . . . . . . . . . . . . . . . . . 89
5.1 Logical on-disk layout of OneSec file system . . . . . . . . . . . . . . . . . . . 103

5.2 OneSec interaction with kernel components of Linux OS . . . . . . . . . . . . 107
xi
List of Tables
2.1 Transistor count and speed in leading microprocessors . . . . . . . . . . . . . 13

2.2 Latency lags behind Bandwidth in networks . . . . . . . . . . . . . . . . . . . 14
2.3 Amount of digital data generated annually over a period of 5 years . . . . . . 17
2.4 Effect of resolution on size of individual multimedia files and on total volume 19
2.5 Performance overhead added by suvFS after normalisation . . . . . . . . . . . 35
3.1 Characteristics of common file system workloads . . . . . . . . . . . . . . . . 43

3.2 hFAT simulation report showing distribution of read blocks . . . . . . . . . . 64
3.3 hFAT simulation report showing distribution of written blocks . . . . . . . . . 64
4.1 restFS: Postmark benchmark report . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2 restFS: efficiency to save multiple overwrites of same block . . . . . . . . . . . 92
4.3 restFS: efficiency to save number of disk writes issued . . . . . . . . . . . . . . 92
5.1 OneSec: syscalls supported & their modelling . . . . . . . . . . . . . . . . . 105

5.2 OneSec: Disk I/O and OS costs during Sprite LFS small-file benchmark execution114
5.3 Percentage of overheads added by disk I/Os and OS for OneSec . . . . . . . . 114
5.4 Comparison of ext2 file system with OneSec file system . . . . . . . . . . . . . 115
5.5 Disk I/O costs, OS costs and their overhead percentage in ext2 file system . . 115
xii
List of Algorithms
2.1 Algorithm for suvfs write() of suvFS . . . . . . . . . . . . . . . . . . . . . . 26

2.2 Algorithm for suvfs read() of suvFS . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Algorithm for suvfs getattr() of suvFS . . . . . . . . . . . . . . . . . . . . 28
2.4 Algorithm for suvfs readdir() of suvFS . . . . . . . . . . . . . . . . . . . . 29
2.5 Algorithm for suvfs unlink() of suvFS . . . . . . . . . . . . . . . . . . . . . 29
2.6 Algorithm for suvfs open() of suvFS . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Algorithm used during simulation of hFAT stackable device driver . . . . . . . 62
4.1 Algorithm for restfs unlink() call of restFS . . . . . . . . . . . . . . . . . . 85

4.2 Algorithm for restfs setattr() call of restFS . . . . . . . . . . . . . . . . . 85
4.3 Algorithm for restfs write() call of restFS . . . . . . . . . . . . . . . . . . 86
5.1 Algorithm for onesec fill super() of OneSec . . . . . . . . . . . . . . . . . 109

5.2 Algorithm for onesec creat() of OneSec . . . . . . . . . . . . . . . . . . . . 110
5.3 Algorithm for onesec unlink() of OneSec . . . . . . . . . . . . . . . . . . . 110
5.4 Algorithm for onesec open() of OneSec . . . . . . . . . . . . . . . . . . . . . 110
5.5 Algorithm for onesec read() of OneSec . . . . . . . . . . . . . . . . . . . . . 110
5.6 Algorithm for onesec write() of OneSec . . . . . . . . . . . . . . . . . . . . 110
xiii
reference any memory addresses, and has
complete control over everything.
Kernel Module An object file that contains code

to extend the running kernel, or so-called
base kernel, of an operating system.
Glossary libfs A set of routines in Linux 2.6 kernel

which are designed to make the task of
writing a virtual filesystem easier.
rpm Revolutions per minute is a measure of

Areal Density A measurement of the density of the frequency of rotation of magnetic
written information on the disk’s surface disk platters.
in bits per square inch.
superblock A record of the characteristics of a
Block A fixed set of consecutive disk sectors filesystem, including its size, the block
which are allocated in their entirety to size, the size and location of the inode ta-
a file or directory, like a Cluster. bles, the disk block map and usage infor-
Buffer cache Reading the information from disk mation, and the size of the block groups.
only once and then keeping it in mem- syscall System calls provide the interface be-
ory until no longer needed, speeds up all tween a process and the operating sys-
but the first read. The memory used for tem meant to request a service from an
this purpose is called the buffer cache. operating system’s kernel.
CGR A measure of how much something grew
User Mode One of the two distinct modes of an
on average, per year, over a multiple-
OS which can execute limited instruc-
year period, after considering the effects
tions and reference limited memory ad-
of compounding.
dresses.
Cluster A fixed set of consecutive disk sectors
VFS An abstraction layer on top of a more
which are allocated in their entirety to
concrete file system whose purpose is to
a file or directory, like Block.
allow client applications to access differ-
Extent A contiguous range of sectors allocated ent types of concrete file systems in a
to a file represented by a pair of numbers; uniform way.
the first sector and number of sectors.
vnode A vnode is a kernel object that provides
inode A data structure on a traditional Unix- a common interface to any kind of open
style file systems such as UFS, which file: a regular or directory file, a block or
stores all the information about a reg- character device file, a pipe, etc.
ular file, directory, or other file system
object, except its data and name. ZCAV A scheme in which disk is divided into
a series of zones in which each has dif-
Kernel Mode Also referred to as system mode, is ferent numbers of sectors and therefore
one of the two distinct modes of an OS different performance characteristics.
which can execute any instructions and
xiv
Chapter 1
Introduction
1
1. INTRODUCTION
1.1 Introduction
File system is an important part of an operating system as it provides a way for information
to be stored, navigated, processed and retrieved in form of files and directories from the
storage subsystem. It is generally a kernel module which consists of algorithms to manage
the logical data structures residing on the storage subsystem. The basic key functions that
every file system provide include operations like file creation, file reading, file writing, file
deletion and so on. Apart from these basic functions some file systems also provide additional
functionalities such as transparent compression and encryption, and features like alternate
data streams, journalling, versioning and so on. Earlier file systems did not have name; they
were considered an implicit part of an operating system kernel. The first file system to have
a name was DECTape named after the company that made it, DEC. In 1973, UNIX Time
Sharing System V5 named its file system V5FS while at the same time CP/M file system was
simply called CP/M. This nomenclature continued until 1981 when MS-DOS introduced the
FAT12 file system. Since then, file systems got their much required and deserved identity.
File systems are used on data storage devices such as magnetic tape drives, magnetic disk
drives, floppy disks, optical discs, solid storage devices, and so on. They may also be used
to access data residing on a file server by acting as clients for a network protocol (e.g. NFS,
SMB, or 9P clients). Moreover, they may be virtual and exist only as an interface to access
I/O devices or to access non-persistent data of an operating system (e.g. procfs). Accordingly,
file systems can be classified into disk file systems, tape file systems, network file systems,
device file systems, special purpose file systems and so on. Due to a central role played by file
systems to store and retrieve persistent data, file system design and development is a crucial
task. However, due to diversity in types of storage locations, the task has become complex.
As an example, a file system designed for magnetic disk media has to take into consideration
the access latencies incurred due to mechanical nature of disk. In contrast, for flash media
such latencies do not exist. However, flash media tends to wear out when a single block
is repeatedly overwritten which needs to be considered while designing a file system for it.
Nevertheless, a large number of considerations are common to all file systems designed for
different types of storage locations. Moreover, magnetic disk file systems are much common
and widely used as compared to any other file system. Furthermore, most of the existing
file systems have been basically designed for magnetic disks and were later ported to other
storage devices.
2
1. INTRODUCTION
In general, disk file systems were designed by individual researchers and the software
industry. From time to time, these designs were modified to accommodate certain changes
required due to the changes in hardware technology and user requirements. In addition, new
designs and novel concepts surfaced to mitigate challenges put forth by same factors. As an
example, the basic design of Windows FAT file system was modified several times to support
large files, large number of files and large volumes when it couldn’t cope up with the growing
capacity of magnetic disks. This resulted into scalable variants of same FAT design in the form
of FAT12, FAT16, FAT32 and now exFAT. Similarly, Old Unix file system (UFS) was modified
to perform better when it couldn’t address the growing demands of workloads. This resulted
into many superior performance file systems in the form of FFS (Fast File System), C-FFS
(Co-Locating FFS) and so on. Furthermore, Linux Extended file systems were also modified to
be more scalable and support advanced features, which resulted into ext2, ext3 and ext4 file
systems. In general, all the popular and influential disk file system designs have been modified
from time to time to mitigate the challenges put forth by changes in hardware technology
and/or user requirements. Also, designs like LFS, ZFS, etc. and many other novel concepts
have emerged from time to time to address such challenges.
Although, this incremental development in the basic design of disk file systems has suc-
cessfully solved problems but at the same time has given roots to other problems. These
include objective specific disk file systems, incompatible variants of same basic design, reluc-
tance on part of a file system to adapt to new features, and many more.
This work is aimed to overcome the problems in basic design parameters of a disk file
system in such a way that the effort is applicable to most of the disk file systems without any
design change or source change.
The rest of this chapter is organised as follows: Section 1.2 unveils the motivation be-
hind this work while section 1.3 sets the goals and objectives. The section 1.4 presents the
contributions made in this work and finally, the section 1.5 gives an outline of this thesis.
1.2 Motivation
Solid state electronics is advancing and because of that microprocessors, primary memory,
bus architectures, networks and other such components are becoming more capable and
faster. The only component that lags behind is the magnetic disk drive; all because of being
mechanical at heart. As a consequence, disk file system designs need to be readdressed in
order to cope up with this speed mismatch. File System Performance is that design
3
1. INTRODUCTION
metric which exploits various data structures, layout policies and algorithms to deliver high
performance.
Furthermore, digital technologies are becoming affordable, and as such they penetrate
deeper and deeper into every aspect of our day to day life. This leads to the voluminous
creation, growth and proliferation of digital data. Therefore, file system designs need to be
readdressed in order to support large files, large number of files, large volumes and so on. File
System Scalability is that design metric which imposes these size and count limitations.
Similarly, due to widespread usage of computers, the significance of digital data has
increased to a point where survival of the information stock can matter more than the tem-
porary loss of an office or a factory. This necessitates incremental evolution of file systems
to support new and advanced features such as security, reliability, usability and so on. File
System Extensibility is that design metric which dictates the flexibility of a file system to
support new and advanced features.
Though from time to time, file systems have been modified to overcome Performance,
Scalability and Extensibility problems, but the approach used has several limitations. First,
it requires design and source modification of file systems. Second, it results into incompatible
variants of same basic design. Third, the effort has to be replicated for all file systems. Finally,
it takes time to make the new file system stable. Therefore, another approach needs to be
investigated so that overcoming such problems does not require design or source modification.
This has several benefits. First, the effort is applicable to wide variety of file systems. Second,
no design and source modification is required which otherwise leads to incompatible variants.
Third, the development time is reduced and, the stability and the reliability of existing file
system is not compromised. Finally, as actual design and source is kept intact, the flexibility
of design to support new features and overcome future challenges is not compromised. Also,
as the most common file systems were designed 10 to 20 years ago, with different concerns
and performance trade-offs than today, there is a potential for improvement.
Finally, the refinements and new designs to overcome performance, scalability and exten-
sibility problems need to be validated for their contribution. File System Benchmarking
does that, however the mentality of benchmarking is problematic as two or more practical
file systems are compared against each other which has led to existence of no standard file
system benchmark, meaning less or incomplete results and so on.
4
1. INTRODUCTION
1.3 Objectives
The recurring theme of this thesis is to mitigate the challenges put forth by change in hard-
ware technology and user requirements in a way such that no design modification or source
modification of disk file systems is required. As mentioned in section 1.2, the design met-
rics of file systems which are affected by these factors include Performance, Scalability and
Extensibility. Therefore, the main objectives of this thesis are:
1. Analyse various existing techniques for their approach, effectiveness and portability
to overcome the challenges put forth by the changes in hardware technology and user
requirements,
2. Identify various possible alternatives that can overcome these problems without design
and source modification of the file systems,
3. Design and implement/simulate the proposal in a way such that it does not ask for
design or source modification and is applicable to atleast a class of file systems, and
4. Evaluate the proposal for its claim; both keeping design and source intact and over-
coming the problem.
Furthermore, a new file system benchmarking methodology is investigated such that:
1. A standard benchmark for atleast a class of file systems is designed, and
2. The results are complete, i.e., they point designers to areas that have scope for im-
provement.
1.4 Contributions
The work presented in this thesis involves the mitigation of challenges put forth by the changes
in hardware technology and user requirements, by analysing existing policies and devising
new techniques that can be adopted by any existing file system. Furthermore, the thesis is
aimed to explore such new ideas and concepts, implement them and evaluate them for their
claim. There are four major contributions of this thesis. These are as follows:
1. This thesis discusses the complete design, implementation details and evaluation of
a new file system called suvFS. suvFS is a scalable userspace virtual file system that
breaks the file size limitation of any existing file system without any design or source
5
1. INTRODUCTION
modification. suvFS achieves this by exploiting the concept of virtual unification and
has been implemented using FUSE framework.
2. This thesis discusses the complete design, implementation and simulation details, and
evaluation of a new file system called hFAT. hFAT is a high performance hybrid FAT32
file system that overcomes the performance problem of FAT file systems by exploiting
hybrid storage for the storage of metadata and userdata of a file system. hFAT does not
demand design or source modification and has been simulated for FAT32 file system as
a stackable device driver.
3. This thesis discusses the complete design, implementation details and evaluation of a
new file system called restFS. restFS is a reliable and efficient vnode stackable file system
that performs secure data deletion when mounted on top of those file systems which
export block allocation map of a file to VFS. restFS does not demand any design or
source modification and has been implemented for ext2 file system as a vnode stackable
file system.
4. This thesis discusses the complete design and implementation details of a modelled hy-
pothetical file system called OneSec. OneSec introduces a new benchmarking method-
ology in which a real life file system is compared against an efficient hypothetical file
system for achieving benchmarking standardisation and meaningful results. OneSec de-
sign has been optimised for handling small files and has been implemented from scratch
as a Linux Loadable Kernel Module.
In addition, this thesis makes six other contributions. These include:
1. This thesis presents an in-depth analysis of H/W technology trends and their impact
on growth & proliferation of digital data. The goal is to argue that intrusion of digital
technologies into our daily life has led to creation and propagation of voluminous amount
of digital data that poses scalability challenges to file systems.
2. This thesis presents a comparison of four popular disk file systems which include ext4,
NTFS, HFS+ and ZFS, across four popular platforms such as Linux, Windows, Mac
and Solaris, for scalability limitations. The goal is to highlight their size and count
limitations. Most importantly, this comparison highlights the inherent problems in
conventional file system design that imposes scalability limitations.
6
1. INTRODUCTION
3. This thesis presents a comprehensive knowledge of the factors that affect the perfor-
mance of a disk file system. The goal is to bring out the most basic and the most
influential parameter(s) that affect the performance.
4. This thesis presents a review of two contemporary high performance disk file system
designs; FFS and LFS, and various other proposals to achieve performance gains. The
goal is to bring out the policies and techniques used by existing file systems and other
proposals to overcome performance related problems.
5. This thesis presents a detailed survey of file system layering including stacking models,
layering support in popular OSes such as Solaris, FreeBSD Unix, Linux and Windows,
alternatives to stacking and its applications. The goal is to realise the benefits and
applications of layering.
6. This thesis presents a refined and fresh set of guidelines for file system benchmarking.
The goal is to provide a comprehensive list of these guidelines and understand various
intricacies of file system benchmarking
1.5 Outline
Chapter 2 discusses the File System Scalability. It begins with an in-depth analysis
of impact of H/W technology trends on growth & proliferation of digital data followed by
comparison of four popular disk file systems for scalability limitations. Finally, it discusses
the design, implementation details and evaluation of a scalable userspace virtual file system
called suvFS.
Chapter 3 discusses the File System Performance. It begins with discussing factors
that affect performance of disk file systems followed by the discussion of various techniques
used by FFS and LFS designs and other proposals to overcome them. Finally, it discusses
the design, simulation details and evaluation of a high performance hybrid FAT file system
called hFAT.
Chapter 4 discusses the File System Extensibility. It begins with a detailed survey
of file system layering including stacking models, layering support in popular OSes such as
Solaris, FreeBSD Unix, Linux and Windows, alternatives to stacking and its applications. Fi-
nally, it discusses the design, implementation details and evaluation of a reliable and efficient
vnode stackable file system for secure data deletion called restFS.
7
1. INTRODUCTION
Chapter 5 discusses the File System Benchmarking. It begins with discussing the
various guidelines proposed for file system benchmarking. After this, it highlights the prob-
lem in current benchmarking mentality and introduces a new methodology. Based on this
new mentality, the chapter discusses the design, implementation details and evaluation of a
modelled hypothetical file system against ext2 file system, namely OneSec.
Chapter 6 discusses the conclusions drawn, lessons learned and future scope of the work.
Some of the material presented in this thesis has been published previously. The complete
list of these published articles immediately follow Chapter 6.
End
8
Chapter 2
File System Scalability
9
2. FILE SYSTEM SCALABILITY
2.1 Introduction
File System is a piece of software, preferably a kernel module, that converts the higher
level abstraction of hierarchy of directories and files into the lower level persistent physical
data structures residing on a storage sub-system. It is an important part of an operating
system as it provides a way to store, organise, navigate and retrieve the persistent data
residing on the storage sub-system in form of files and directories. To accomplish this task, a
file system processes these physical data structures after bringing them into primary memory
and stores them back on the storage sub-system in order to make changes persist. These data
structures residing on the storage sub-system can be broadly classified into two categories;
1) metadata and 2) userdata. The userdata accounts to the actual information that is stored
within the files and directories1 while metadata is that data which qualifies and quantifies
the userdata. In other words, metadata contains the information necessary to store and
manage files and directories such as filename, ownership, date and timestamps, permissions,
security information, file size, allocation map, and so on. In addition, metadata also stores
information about the file system itself, such as type of file system, current and maximum
volume size, maximum file size, maximum file count, blocks allocated and free, and so on.
Logically, every file system contains these two types of data structures and performs
the same function. However, there are different types of file systems than different types of
functionalities supported by them. The fact is that file systems vary greatly in various aspects
of metadata structure, such as 1) the number of different types of metadata structures, 2) the
on-disk layout of metadata structures, 2) the length and breadth of each type of metadata
structure, 4) the algorithms and much more. This gives rise to various file system designs.
In-spite of these subtle differences, almost every file system has some fixed length fields within
their metadata structures to identify various attributes of a file and the file system such as
maximum file size, maximum volume size, and so on. This policy is adopted to reduce the
complexity of a file system design. Moreover, it reduces the performance overhead incurred
during most of the file system operations. However, these fixed length fields of metadata
structures also pose scalability challenges to file systems.
File System Scalability is defined as the ability of a file system design to support large
volume sizes, large file sizes, large directories and large number of files while still providing
I/O performance [Sweeney et al., 1996]. The scalability of a file system depends in part on
how it stores information about files, directories and the file system itself. As an example,
1
On UNIX like and UNIX based OSes directories are simply files.
10
if file size is stored as a 32-bit integer, then no file in the file system can exceed 232 bytes
(4 GB). Scalability also depends on the methods used to organize and access data within the
file system. As an example, if directories are stored as a linear list of file names, then to find
a particular file, each entry must be searched one by one until the desired entry is found.
This works fine for small directories but not so well for large ones. Thus, Scalability of a file
system is a two part term: Scalable to Capacity and Scalable to Performance. This chapter
only discusses file system scalability to capacity and here onwards scalability will implicitly
mean only that; exceptions will be explicitly mentioned. The file system performance will be
discussed in next chapter.
Digital technologies are getting advanced, affordable and are becoming part and parcel of
our daily life. This has resulted in voluminous growth and proliferation of digital data, and
file systems need to scale with this growth [Dahlin, 1996]. However, as mentioned earlier,
the metadata of file systems is not scalable. The fixed length fields within the metadata
structures which store information about various size and count limits of the file system limit
the scalability of a file system. As a solution to this problem, if the said static fields are made
dynamic, the problem will be solved from the root. In fact, the limitation on filename length
in ext2 file system was overcome by replacing the static filename field with a dynamic one in
which a field identifies the beginning of filename while as another field identifies the filename
length [Card et al., 1995]. However, in general there are three problems associated with this
approach; 1) it hinders the performance as extra computations are to be performed, 2) an
extra metadata structure is added which demands space and management, and 3) because
very little information is stored about the metadata structures, they can’t grow dynamically
after certain limit.
Another possible solution would be to keep a large static field whose length would never
be possibly defeated. This idea has four limitations; 1) the memory requirements are high,
2) a lot of space is occupied by metadata structures, 3) a lot of space is wasted, and 4) still
there is an upper limit.
Currently, file systems solve this problem by approximating the optimal length require-
ments of such size and count fields and set them accordingly. As such, when demand crosses
over the limit, developers increase these limits by releasing a new version. This approach
has worked but has given birth to new problems. First, the scalable variants of same design
are not mount compatible with each other and as such OS updates are required. Second,
much effort is needed as every file system needs to update its limits. Third, the effort is for
11
expansion instead of innovation. Finally, the process is to be repeated as demand crosses the
limit again.
The rest of this chapter is organised as follows. Section 2.2 discusses current hardware
technology trends and their affect on growth and proliferation of digital data. Section 2.3
compares some popular file systems for scalability limitations to understand the reasons
for such limitations, identify current highly scalable file system(s) and analyse the affect of
digital data growth and proliferation on scalability. Finally, section 2.4 introduces the design,
implementation and performance evaluation of suvFS, a scalable user-space virtual file system
to overcome the file size limitation of file systems without modifying the design and source
of the file systems.
2.2 Impact of H/W Technology Trends on Growth & Prolif-

eration of Digital Data
Digital technologies have penetrated into every aspect of our day to day life. This intrusion
has led to creation and propagation of voluminous amount of digital data. Current hardware
technology trends exaggerate this scenario by increasing the domain and depth of applications
of digital technologies along with affordable sharing and storage of digital data so created.
In this section, we will discuss three main hardware technology trends which affect growth
and proliferation of digital data.
2.2.1 Micro-Processor Trends: Domain and depth of technology
The performance and cost of a microprocessor is primarily dictated by Moore’s law [Schaller,
1997]. First, the new processor doubles the number of transistors on chip found in previous
processor in 2 years. Second, the transistors in new chips are faster than the previous ones.
Finally, this enhancement does not ask for any significant additional cost. This fact is evident
from Table 2.1 which shows that in leading microprocessors, the transistor count has doubled
in every 2 years while as speed has doubled roughly every 18 months in past forty years
[Intel, 2007]. Furthermore, nowadays a processor has multiple CPU cores on the same die
with each core capable of providing hardware support for multi-threading [Borkar, 2007].
This is augmented by Pollack’s rule which states that the performance increase is roughly
proportional to square root of the increase in complexity. In other words, if we double the
logic in a processor core, it delivers only 40% more performance. However, a multi-core
architecture has potential to provide near linear performance improvement with complexity
12
Table 2.1: Transistor count and speed in leading microprocessors
Year Processor Transistor Count Intial Clock Speed

1971 4004 2,300 108 KHz
1972 8008 3,500 500-800 KHz
1974 8080 4,500 2 MHz
1978 8086 29,000 5 MHz
1979 8088 29,000 5 MHz
1982 286 134,000 6 MHz
1985 386 275,000 16 MHz
1989 486 1,200,000 25 MHz
1993 Pentium 3,100,000 66 MHz
1995 Pentium Pro 5,500,000 200 MHz
2000 Pentium 4 42,000,000 1.5 GHz
2006 Core 2 Duo 291,000,000 2.93 GHz
and power. As such, two smaller processor cores, instead of a large monolithic processor
core, can potentially provide 70-80% more performance, as compared to only 40% from a
large monolithic core.
In general, the microprocessor size, cost and power consumption is continuously decreasing
while its capability and applicability is increasing. This trend has increased the domain and
depth of applications of digital devices as microprocessors realise new applications of digital
equipments at affordable cost. As a result of this, digital technologies are penetrating deeper
and deeper into every aspect of our day to day life leading to the growth of voluminous
amount of digital data.
2.2.2 Network Trends: Fast and affordable sharing
The performance of a network is governed by two factors; its bandwidth and latency. Over
past 30-40 years, the bandwidth of a LAN network has improved from 3 Mbps to 10 Gbps.
This improvement in bandwidth has come from two sources; faster links and better topologies.
In the same period of time, the latency of the network has not improved in this manner
[Patterson, 2004]. From Table 2.2 it is clear that in the time bandwidth doubles, latency
improves by no more than a factor of 1.2 to 1.4. However, to overcome this latency lag over
bandwidth, optimized protocols to reduce latency over optic fibre have been used. Moreover,
13
Table 2.2: Latency lags behind Bandwidth in networks
Year Technology Bandwidth (mbps) Latency (msec)

1978 IEEE 802.3 Ethernet 10 3000
1995 IEEE 802.3u Fast Ethernet 100 500
1999 IEEE 802.3ab Gigabit Ethernet 1000 340
2003 IEEE 802.3ae 10 Gigabit Ethernet 10000 190
2007 IEEE 802.3ba 100 Gigabit Ethernet 100000 70
Butters’ law [Robinson, 2000] states that the amount of data coming out of an optical fibre
doubles every 9 months. Thus, the cost of transmitting a bit over an optical network decreases
by half every 9 months. As such, the current trend in wired networks not only increases
bandwidth but also decreases latency and the cost of transmission.
Furthermore, wireless networks too are getting faster and cheaper in addition to covering
more and more geographical area. Figure 2.1 shows two classes of wireless networks; Wi-Fi
and Mobile [Rao and Angelov, 2005]. In both classes, in addition to transmission speed,
the mobility supported by the network is also enhancing. Also, Nielsen’s law of Internet
bandwidth states that a high-end user’s connection speed grows by 50% per year1 [Nielsen,
1998].
In general, both wired and wireless networks are getting faster and cheaper. This trend
has made sharing of digital data fast and affordable, leading to both growth and proliferation
of digital data.
1995 2000 2005 2010

4G Very
High Speed
3.5 G High Speed Wireless

Mobility
3G
2.0 / 2.5 G 2.11 GHz W-Broadband
WLL
Mobile Access 5 GHz WLAN
Fixed/Local Area
2.4 GHz WLAN
Wireless Access
Transmission Speed
Figure 2.1: Both speed & mobility in wireless networks is enhancing

1
Annual growth of processing power is about 60% while that of Internet bandwidth is about 50%.
14
2.2.3 Magnetic Disk Drive Trends: Large and affordable storage
Since the inception of magnetic disk drive, there have been continuous design and performance
improvements. In-spite of this, due to mechanical motions involved, the performance has seen
less improvement as compared to capacity. However, the continuous drop in cost per unit bit
and a steady increase in storage capacity has made magnetic disk drive every body’s choice.
The storage capacity of a magnetic disk drive is directly proportional to its areal density
which is computed as the product of two other density measures; Track Density and Linear
Density. The former measures how tightly the concentric tracks on the disk are packed
while the latter measures how tightly the bits are packed within a length of track. Kryder’s
law states that magnetic disk areal storage density doubles approximately every 18 months
analogous to the doubling transistor count every 24 months in Moore’s law [Walter, 2005].
This density progress has been the result of a series of laboratory investigations of new head
designs and technologies. The evolution of head design involves a reduction in sensor length,
which increases the Track density, and a reduction in film thickness, which determines the
Linear density. Each progressive reduction results in higher areal density and a corresponding
higher disk capacity and this trend is expected to continue [Grochowski, 1999]. In 2003,
[Morris and Truskowski, 2003] observed that the areal density of magnetic disk drive has
increased by seven orders of magnitude while the cost has declined by five orders since 1980.
[Grochowski and Halem, 2003] argued that with a CGR (compound growth rate) of 60 %
to 100 % per year in areal density, the expected price declines would average between 37 %
to 50 % per year. Furthermore, [Kryder and Kim, 2009] argued that if the areal density of
magnetic disks would continue to progress with the current CGR of 40 %, a 14 TB drive will
cost about $40 in year 2020. Figure 2.2 shows the fact that annually the cost of magnetic
disk drive is getting reduced by half while its capacity is doubled.
In general, the capacity of magnetic disk drives is increasing day by day while cost per
unit bit is decreasing. This trend is allowing citizens of digital world to store large amount
of digital data at affordable price.
2.3 Comparison of Popular File Systems for Scalability limi-

tations
A scalable file system should support large volume sizes, large file sizes and large number of
files. This section discusses in detail these size and count limitations in flagship file systems
15
10,000 $100,000
1000
Capacity (GB) $10,000
100 $1,000
10
Cost per GB
$100
1 $10
0.1 $1
0.01 $0.1
0.001 $0.01
1980 1985 1990 1995 2000 2005 2010
Figure 2.2: Capacity of magnetic disk doubles every year while cost halves
of some popular OSes. The goal of this section is; 1) to understand the reason(s) behind
these limitations, 2) to identify highly scalable file system in each category, and 3) to discuss
the impact of digital data proliferation on these size and count limitations. The file systems
discussed include ext4 [Mathur et al., 2007] from Linux, HFS+ [Wikipedia] from Mac, NTFS
[Nagar, 1997] from Windows and ZFS [Bonwick, 2006] from Solaris.
2.3.1 Volume Size Limitation
The smallest read/write unit of a file system is a block/cluster. Every file system uniquely
identifies each and every block using some n-bit integer. This number primarily dictates
the maximum volume size as file systems are only able to address 2n blocks and thus, limits
the maximum volume size to (2n ∗ blockSize) bytes. Although increasing the block size
can increase the maximum volume size but can lead to wastage of space due to internal
fragmentation. Moreover, among these addressable blocks some are occupied by metadata
structures which further reduce the maximum volume size. As an example, NTFS stores
volume metadata in several system files and the design is such that every sector of volume
belongs to some file. Besides that, every file system needs to keep track of free and allocated
blocks. Most file systems prefer a bitmap for performance. The size of this bitmap also puts
the limit on maximum volume size.
The maximum volume size of ext4 is 1 EB1 , HFS+ is slightly less than 8 EB, and, NTFS
and ZFS is 8 EB. This indicates that among current popular file systems, both NTFS and
ZFS are highly scalable to large volume sizes. However, keeping in view the digital data
1
1 EB = 260 Bytes, 1 ZB = 270 Bytes
16
growth and proliferation, this is still a limit. Data storage needs at many sites are doubling
every year. A study by the University of California at Berkeley estimated that the amount
of original digital content creation each year is about 1.6 million terabytes [Seagate, 2001]
whereas another recently conducted survey showed that the digital data generation rate is
increasing every year in a fashion as shown in Table 2.3 [Makarenko, 2011]. The mathematical
model developed in this survey showed that in future the rate of production of digital data will
stabilize at 13.2 ZB per year. Another research has estimated that the growth in amount of
digital information created annually from 2009 to 2020 is about 44 times [Moore, 2009]. Also,
due to proliferation of World Wide Web, this information is being shared and downloaded by
millions of users world wide. The content sharing on Internet multiplies the storage required
for original copy of content by the number of users who downloaded it. This certainly indicates
that in future, file systems will be under high pressure to scale with the requirement of large
file system volumes due to growth and proliferation of digital data.
Table 2.3: Amount of digital data generated annually over a period of 5 years
Year 2006 2007 2008 2009 2010

Volume in EB 185.62 319.71 486.52 762.89 1143.23
2.3.2 File Count Limitation
Any file in a file system is exactly represented by one inode. As such, the maximum number
of files per volume supported by any file system depends upon the maximum number of
inodes that file system has. Earlier, file systems used to pre-allocate on-disk inode structures
statically during the creation of a file system. In fact, the system administrator had to guess
the number of files that will be created on any given file system and had to recreate the file
system when that guess went wrong. With inodes being allocated dynamically, this limitation
no more exists. However, it is necessary to use some kind of meta-metadata structure for
keeping track of where the dynamically allocated inodes are located and very few file systems
like XFS [Sweeney et al., 1996] support it without having any significant performance hit.
The maximum number of files that can be created on ext4, HFS+ and NTFS is one less
than 232 whereas on ZFS it is 248 files. This indicates that among current file systems, ZFS
is highly scalable to file count. In spite of the huge number of files supported by ZFS, it
is still a limit which does not exist in XFS and JFS. Furthermore, digital technologies are
17
continuously producing large number and volume of files. As an example, a medium-size

hospital typically performs about 120,000 radiological imaging studies each year producing
more than 2 TB of imaging data per year with both the resolution and quantity of images
increasing each year [Riedel et al., 2001]. Moreover, people prefer to digitise and archive
every possible information for secure, cost effective and easy management of information. As
such, the citizens of digital world are continuously creating voluminous amount of digital
data packed in large number of files. Hence, current and future file systems need to scale
with this growth in number of files.
2.3.3 File Size Limitation
Logically, there should be no limit on the size of a file provided there is free space to hose
the data. But the file systems impose limits here as well. The file size limitation within
file systems come from two integers; one which sets the working block size and another
which sets the maximum number of blocks that can be accommodated within the allocation
map of a file. As such, the maximum file size supported by a file system will be the cross
product of these two integers. Moreover, file systems also record the size of the file in
bytes or in number of blocks allocated to it. This n-bit static field also sets the limit
on the file size. As such, the maximum file size limitation of any file system equals to
min(blockSize ∗ blocksAllocated, maxRecordableSize).
The maximum file size on ext4 file system is 16 TB, on HFS+ it is slightly less than 8
EB, and on both NTFS and ZFS it is 16 EB. This indicates that among current file systems,
both ZFS and NTFS are highly scalable to large file sizes. However, like other scalability
limitations, this limitation is also crossed due to digital data growth and proliferation. New
data sources and novel applications for data so collected has resulted in the development of
a new class of data intensive applications which create and process large data files. As an
example, every nation nowadays maintain large databases which contain information gathered
from surveillance cameras, satellite images, ISP logs, historical call data and so on. Such
databases are not only increasing in size but are also growing in number. Also, multimedia
content is both growing and getting complex. Table 2.4 shows the effect of resolution on the
individual and on total storage required by multimedia files.
As digital devices are getting affordable they increase the count and total volume of
multimedia content they generate. However, as they get more capable they increase the
individual size of files in addition to total volume.
18
Table 2.4: Effect of resolution on size of individual multimedia files and on total volume
File Type 4MP 8MP MP3 HiD SD HD

Photos Photos Songs Songs Video Video
Approx. File Size (MB) 1 2 4 150 700 4000
File Count Size in GBs
1000 1 2 4 150 700 4,000
2000 2 4 8 300 1,400 8,000
4000 4 8 16 600 2,800 16,000
2.4 suvFS: A Virtual File System to Break File Size Limitation
Designing a file system is a complex task. As such, to reduce the complexity of design and
development, the file system design has always been objective specific. The design objectives
include performance, scalability, advanced features and so on. There is generally a trade-off
between design objectives while designing a file system. A file system may not be scalable
or may lack some advanced features like journalling, versioning and so on, in a compromise
to deliver high performance. Similarly, other permutations and combinations are adopted as
required. Thus, in current world we have many objective specific file systems which do justice
with their goals but at the same time lack what was left during designing. However, these
objective specific file systems suffer due to the current trend in growth and proliferation
of digital data which is exaggerated by current hardware trends. The big data is posing
scalability challenges to file systems and unfortunately file systems are inherently prone to
such problems as there has been a compromise in their design. Although, the limitation on
number of files supported by a file system can be avoided by making allocation of inodes
dynamic rather than static (XFS and JFS use dynamic allocation of inodes), but no such
mechanism directly applies for eliminating volume size and file size limitation.
However, there exists a technique that has potential to avoid volume size and file size
limitation, namely Virtual Unification. Virtual Unification provides a merged view of several
directories, without physically merging them. The benefit allows the files to remain physi-
cally separate, but appear as if they reside in one location. This satisfies the need of both
administrators as well as users. Many proposals for unification have surfaced from time to
time, however the following text discusses the most influential and popular designs.
19
Plan-9 developed by Bell Labs can connect different machines, file servers, and networks,
and offers a binding service that enables multiple directories to be grouped under a common
namespace [Bell-Labs, 1995]. Similarly, 3DFS, also developed by AT&T Bell Labs for source
code management, maintains a per-process table that contains directories and a location in
the file system that the directories overlay [Korn and Krell, 1990]. This technique is called
viewpathing, and it presents a view of directories stacked over one another. Although, the
Translucent File System (TFS) released in SunOS 4.1 in 1989 [Hendricks, 1990] also provides
a viewpathing solution like 3DFS, but is an improvement over 3DFS as it better adheres to
UNIX semantics when deleting a file. Moreover, Union Mounts, implemented on 4.4BSD-Lite,
merge directories and their trees to provide a unified view [Pendry and McKusick, 1995].
This structure, called the union stack, permits directories to be dynamically added either to
the top or the bottom of the view. Finally, Unionfs proposed by [Wright et al., 2006] is the
most recent and improved union file system as it maintains UNIX semantics while offering
advanced unification features such as dynamic insertion and removal of namespaces at any
point in the unified view and so on. Unionfs allows users to specify a series of directories and
a mount point which presents the union of specified directories.
Virtual Unification presents a large virtual directory to users for easy management of files
and directories which physically reside at different locations. Although, originally intended for
name-space unification, union file systems can also be used for snapshotting, by marking some
data sources read-only and then utilizing copy-on-write for the read-only sources [Wright and
Zadok, 2004]. Nevertheless, Virtual Unification unintentionally presents a virtual file system
that is scalable to store and process large number of files and directories even though they
actually reside at different locations within the file system or may even belong to different
file systems. Similarly, as it can merge directories from different file systems into one virtual
directory, these file systems can be thought of a virtual file system that is scalable to large
volume size if the specified directories are mount points of different file systems. This concept
has been exploited by mhddfs to some extend. mhddfs is another union file system developed
by Oboukhov that allows to unite several mount points to the single one [Oboukhov, 2008].
mhddfs simulates one big file system by combining several hard drives or network file systems.
As such, mhddfs can be argued as to be a file system scalable to large volume size and file
count in the same way as other union file systems can be. However, mhddfs goes a step further
by supporting load balancing. In mhddfs, if an overflow arises while writing to some unified
mount point, the file content already written will be transferred to another unified mount
point containing enough of free space for the file. The transferring is processed on-the-fly,
20
fully transparent for the application that is writing. It is to be noted that no such proposal
exists that directly makes an effort to present a virtual file system that is scalable to large
volume sizes and file counts by exploiting the concept of virtual unification. Nevertheless,
this section exploits this idea explicitly to break the file size limitation of file systems.
As mentioned in section 2.3.3, the file size limitation within file systems come from 2
static fields; one that records the block size and another that records the blocks allocated to
the file. The former field is almost same in all file systems except those which have variable
sized blocks. However, even in such file systems there is an upper limit on block size. In
spite of this similarity, there is diversity in how file systems maintain the later field. As an
example, FAT file systems use global FAT Table to keep track of allocated and free clusters in
addition to keep record of chain of clusters allocated to a file [Microsoft, a]. As such, logically
a file on such a file system can be as large as the volume itself. However, in FAT file systems
the maximum file size limitation is imposed by a field which records the size of file in bytes.
Therefore, the maximum file size limitation on FAT volume equals to min((FATsz/4) ∗ b, 232 )
bytes where FATsz is the size of FAT Table in bytes and b is the cluster size. This 32-bit field
imposes a 4 GB file size limitation on FAT volume. In contrast to this, ext2 file system uses an
array of 32-bit pointers within the table-of-contents field of inode to identify the blocks
allocated to a file [Card et al., 1995]. Logically, the maximum file size in such file systems
can be less or more than the maximum volume size depending upon the maximum number of
addressable blocks in it. However, the maximum number of blocks that can allocated to a file
is also limited by a 32-bit field which records file size in number of blocks allocated. As such,
the maximum file size in ext2 file system is limited to min(((b/4)3 +(b/4)2 +b/4+12)∗b, 232 ∗b)
bytes where b is the block size.
Because of this diversity in the way file systems impose file size limitations, mitigating the
problem in design is not a feasible solution to be applied to all file systems. However, using
the concept of virtual unification, we can build a virtual file system which when mounted on
top of any existing file system extends it to support files which are not logically limited in
size.
The rest of this section is organised as follows. Section 2.4.1 introduces the design goals
of suvFS while section 2.4.2 discusses the implementation details of suvFS. Section 2.4.3
discusses the experiment carried out to evaluate the performance of suvFS. Section 2.4.4
discusses the results while section 2.4.5 presents the conclusion. Finally, section 2.4.6 points
towards limitations and future scope of the work.
21
2.4.1 Design of suvFS File System
suvFS is a scalable user-space virtual file system that breaks the maximum file size limitation
of any file system when mounted on top of it. suvFS can be mounted on top of any existing
file system to extend its capability to handle large file sizes without modifying the design and
source of the file system. Also, with suvFS on top, there is logically no upper limit on the file
size. The design goals of suvFS are as follows:
1. To extend the capability of a file system to handle large sized files with logically no
upper limit on the file size.
2. To break this file size limitation without modifying the design and source of the file
system.
3. To add this capability to every existing file system.
4. To layer this extension in user-space so that the kernel stability and reliability is not
compromised.
suvFS mitigates its core design goal (enumerated above at 1) by first splitting a large file
which cannot be created, stored and processed in its entirety on native file system, into a
number of legitimate sized files called fragments. As such, this way we are able to store the
contents of a large file whose size crosses the file size limitation in the form of a number of
legal sized fragments. Unfortunately, this way we are only able to store large information but
can’t process it; because to process it, we need to join the fragments which will take us back
to the original problem. Second, in order to overcome this problem, the suvFS presents a large
virtual file as the representative of associated fragments (a virtual unification of files) instead
of a large physical file to the user applications for processing. The operations performed on
this large virtual file are reflected in the associated fragments. Third, to avoid individual
access to fragments and to exclude them from directory listing, suvFS filters fragments from
simple files.
Furthermore, to avoid splitting to be done manually for every file, the split should be
transparent to user applications. However, no design or source modification of applications,
libraries, system calls and even file systems should be made. This design decision has two
benefits; 1) it allows easy applicability and high portability of the solution, and 2) it allows
the solution to be applied to all types of file systems. In order to be able to extend any file
system to support new features without modifying source of the kernel or of the file system,
22
the extension should be a layered file system. Layering allows trapping, pre-processing and
post-processing of file system syscalls targeted to below mounted native file system. As such,
layering is transparent both to user applications on one side and to native file systems on
other side. This mitigates the design goals of suvFS enumerated above at 2 and 3. Also,
layering can be done both at kernel level and user level. The benefit of layering at user level
is that the reliability and stability of kernel is not compromised; however the downside is
that it does add some performance overhead. In contrast, layering at kernel level adds very
minimal performance overhead but can break the stability and reliability of the kernel. suvFS
is implemented as a file system layer in user-space using FUSE framework [Szeredi, 2005] as
shown in Figure 2.3. This design decision mitigates design goal enumerated above at 4.
File System Operations suvFS

User-Mode
large virtual file associated fragments

Kernel-Mode
VFS
FUSE File System
Figure 2.3: Design of suvFS using FUSE framework
2.4.2 Implementation of suvFS File System
suvFS is implemented using FUSE framework. FUSE is an acronym for Filesystem in UserSpacE
and is used to develop full-fledged file systems and to extend existing file systems. The file
systems so created run in the user space. As such, along with high reliability of kernel
comes high ease of development as user space allows access to facilities (like C library) which
are lacking in kernel development. The FUSE framework contains a null-pass virtual file
system, fusexmp, which passes all the file system operations to below mounted file system
without any modification. suvFS is implemented by overriding the various procedures of
fusexmp. However, fusexmp doesn’t support overlay mounting and as such any mount point
23
passed to fusexmp during mounting, mounts the root directory (/) on that mount point.
Having said that, a little modification to fusexmp gets it overlaid on the specified mount
point, which can be a simple directory or a mount point of some other file system. It has two
benefits that add to suvFS; 1) it adds to transparency of suvFS as no path change is required
by applications accessing that volume, and 2) it leaves no path to access native file system
without the intervention of suvFS.
suvFS in specific and FUSE file systems in general, incur performance overhead as kernel
boundary is crossed to process the call. In addition to multiple context switching, multiple
process switching and data copying during call processing also adds overhead [Rajgarhia and
Gehani, 2010]. However, the benefits of development-ease, reliable environment and portable
file system outweigh the drawbacks. Figure 2.4 shows the call path of write() file system
call targeted to a file system with suvFS mounted on top.
write() suvfs_write()
Pre-Processing
Kernel-Mode User-Mode
Post-Processing write()
vfs_write() /dev/fuse vfs_write()
<fs>_write()
Figure 2.4: Call path of write() call to a file system via suvFS
The Figure 2.4 clearly depicts the overhead incurred during the write() call due to the
framework. The figure also shows that every file system call can be pre- and post-processed in
user space to reflect the desired operation. Although, the implementation of suvFS demands
processing many syscalls but for most of these calls suvFS does a minimal amount of work of
restricting the access to fragments. Besides, suvfs write() and suvfs read() are primarily
responsible for creating, writing and reading a large file. Also, suvfs readdir() implements
the logic to restrict listing of fragments and suvfs getattr() presents a large virtual file
which is the virtual unification of associated fragments. In addition, calls like suvfs unlink()
reflect the same operation on associated fragments while calls like suvfs open() restrict direct
access to fragments.
24
This section discusses the implementation details of suvfs write(), suvfs read(),
suvfs readdir(), suvfs getattr(), suvfs unlink() and suvfs open() calls. All other
calls either fall into one of these categories or are implemented accordingly.
2.4.2.1 suvfs write() call
suvfs write() call intercepts the write() call issued by user applications targeted to a file
system on top of which suvFS is layered. suvfs write() splits large file at offset values that
are integral multiples of maximum file size supported by below mounted file system. This
maximum file size value is different for different file systems. The value is detected on the fly
by suvFS and stored in a global variable MAX FS. Moreover, the value of MAX FS is always a
byte less than maximum file size. As an example, in case of FAT32 file system, the value is
4294967295 (4 GB-1). The reason for this is that the last byte of every file should be EOF
and fragments are just files to native file system. So, physically there many EOFs in a large
virtual file. The call works by first calculating the fragment number wherein the supplied
offset should lie. All fragments belonging to the large file get a suffix appended to their
filename when they are created. The suffix contains a magic string (say “.frag.”) to identify
it as a fragment, followed by a numeric value. This numeric value identifies fragment’s position
(offset) in the chain of fragments belonging to the same large virtual file. suvFS doesn’t add
any suffix to first fragment (at offset=0) for two reasons. First, while creating a file suvFS
doesn’t know whether it will cross the maximum file size limitation or not. Second, suvFS
has to maintain actual filename for future references to the file and consistency in usability.
This way no extra data structure is needed for this purpose.
After calculating the fragment number, the call creates fragment name (or filename) if
offset crosses the limitation. In this case, it normalises the value of offset to lie within MAX FS
range. This solves the calculation needed to skip EOFs in fragments from being counted as
data bytes and sets offset value within legitimate range of fragment size.
Finally, the appropriate file (fragment) at appropriate offset (normalised) is opened and
the input buffer is copied to its full length into the fragment. In case, if the normalised offset
and the input buffer size values are such that the buffer spans over two or more consecutive
fragments, the call only copies those many bytes from buffer which fill the fragment to its
full length and returns that value from the function. This way, the suvfs write() function
will be again called from the upper layers with updated offset value and left over data in
buffer. Hence, the supplied offset will be again normalised (this time to value 0) and the
25
appropriate fragment number will be selected (one greater than previous fragment number).
The Algorithm 2.1 shows how suvfs write() call is implemented.
Algorithm 2.1 Algorithm for suvfs write() of suvFS

Input: f ile, inBuf , of f set, buf Sz
Output: bytW rt
1: f ragN um ← of f set/M AX F S
2: if f ragN um > 0 then
3: of f set ← of f set mod M AX F S
4: f rag ← f ile + “.f rag.” + f ragN um
5: end if
6: if f rag does not exist then
7: createFile(f rag)
8: end if
9: f ileDesc ← openFile(f rag)
10: spcAvail ← M AX F S − of f set
11: if buf Sz > spcAvail then
12: buf Sz ← spcAvail
13: end if
14: bytW rt ← writeFile(f ileDesc, inBuf, buf Sz, of f set)
15: closeFile(f ileDesc)
16: return bytW rt
2.4.2.2 suvfs read() call
In order to process information contained in associated fragments, suvfs read() call inter-
cepts the read() call executed by user applications. Logically, the suvfs read() call does
the reverse of suvfs write() call. Based on the supplied offset and global MAX FS value, it
locates the appropriate fragment and opens it at appropriate offset for reading. However, a
special care is required to deal with a situation wherein the normalised offset and the buffer
size values are such that they span over two or more consecutive fragments. In this case, the
next fragment(s) needs immediately to be read to fully populate the input buffer. This is
because of the technical issue related to the read() call which is not called again by upper
layers if the number of bytes actually read is less than the number of bytes supposed to be
read. The read() call flags this condition as EOF or I/O error. The Algorithm 2.2 takes
care of this for suvfs read() call.
26
Algorithm 2.2 Algorithm for suvfs read() of suvFS

Input: f ile, outBuf , of f set, buf Sz
Output: bytRead
1: f ragN um ← of f set/M AX F S
2: if f ragN um > 0 then
4: end if
5: f ragSz ← fileSize(f rag)
6: of f set ← of f set mod M AX F S
8: if buf Sz 6 (f ragSz − of f set) then
9: bytRead ← readFile(f ileDesc, outBuf, buf Sz, of f set)
10: else
11: byt2bRead ← buf Sz
12: readCnt ← readFile(f ileDesc, outBuf, (f ragSz − of f set), of f set)
13: byt2bRead ← byt2bRead − readCnt
15: bytRead ← readCnt
16: while byt2bRead 6= 0 do
17: f ragN um ← f ragN um + 1
18: of f set ← 0
20: f ragSz ← fileSize(f rag)
22: if f ileDesc is null then
24: return bytRead
25: end if
26: readCnt ← readFile(f ileDesc, (outBuf + bytRead), byt2bRead, of f set)
27: byt2bRead ← byt2bRead − readCnt
28: bytRead ← bytRead + readCnt
30: end while
31: end if
32: return bytRead
27
2.4.2.3 suvfs getattr() call
Now that suvFS is able to store, retrieve and process a large file, suvFS still needs to intercept
getattr() call to present to user applications a large virtual file. To do so, suvfs getattr()
call identifies all fragments associated with the first fragment and then manipulates its at-
tributes (like size) in memory based on the attributes of fragments in order to present a large
virtual file. The algorithm for suvfs getattr() is shown in Algorithm 2.3.
Algorithm 2.3 Algorithm for suvfs getattr() of suvFS

Input: f ile, statBuf
Output: status
1: if f ile contains “.frag.” then
2: return false
3: else
4: statBuf ← statFile(f ile)
5: f ragN um ← 1
6: loop
8: if f rag exists then
9: tmpStBuf ← statFile(f rag)
10: statBuf.Size ← statBuf.Size + tmpStBuf.Size
12: else
13: break
14: end if
15: end loop
16: return true
17: end if
2.4.2.4 suvfs readdir() call
suvFS needs to exclude fragments from being listed during directory listing. In order to
accomplish this task, suvFS intercepts the readdir() call. suvfs readdir() call filters the
fragments by parsing filenames for the magic string “.frag.” and excludes them from listing.
The algorithm for suvfs readdir() is shown in Algorithm 2.4.
28
Algorithm 2.4 Algorithm for suvfs readdir() of suvFS

Input: dir
Output: status
1: for each f ilename in dir do
2: if f ilename contains “.f rag.” then
3: continue
4: end if
5: {Continue normal processing}
6: end for
7: return true
2.4.2.5 suvfs unlink() & other related calls
Calls like unlink() or any other metadata operation like rename, change permissions, and
so on, need to be intercepted to maintain the semantics of the operation by executing the
same operation on all associated fragments of a large virtual file. suvfs unlink() does so by
finding all fragments belonging to the file (if it has any) and executing same operation with
same arguments over them. The algorithm is shown in Algorithm 2.5.
Algorithm 2.5 Algorithm for suvfs unlink() of suvFS

Input: f ile
Output: status
1: if f ile contains “.frag.” then
2: return false
3: end if
4: unlink(f ile)
5: f ragN um ← 1
6: loop
8: if f rag does not exist then
9: break
10: else
11: unlink(f rag)
13: end if
14: end loop
15: return true
29
2.4.2.6 suvfs open() & other related calls
Every file system call that can directly access the fragment needs to be intercepted. suvfs open()
denies direct access to fragments by parsing the filename for the magic string “.frag.”. This
is also done by suvfs getattr(). The Algorithm 2.6 shows the implementation.
Algorithm 2.6 Algorithm for suvfs open() of suvFS

Input: f ile
Output: status
1: if f ile contains “.f rag.” then
2: return false
3: end if
4: {Continue normal processing}
It is worth mentioning here that every procedure of suvFS compares the size of input
file with MAX FS before assuming it to have associated fragments or in other words, to be
virtually large. This decision is based on the fact that if the size of file is less than MAX FS
it can’t have associated fragments otherwise it may probably have. Therefore this avoids
performance overhead that may be otherwise incurred in processing of normal sized files.
Furthermore, to further reduce the performance overhead, no special database or meta-
data structure is maintained to identify the fragments associated with a large file. Rather, a
fragment is identified if it has following filename signature < f ilename > .f rag. < of f set >.
However, this design decision creates the limitation that if any fragment in a chain is missing
or deleted, all the fragments following that fragment are lost too. The reason behind this is
that suvFS procedures traverse the chain of fragments sequentially from the seed point frag-
ment1 to fragments at higher offset. Normally, when the last fragment in chain is traversed
the next highest fragment is not found by suvFS. As such, it stops further traversal by consid-
ering it as the last in chain. This simplifies the identification, traversal and management of
fragments. However, in case if a fragment at any offset in chain (except last) is accidentally
lost or deleted, the fragment before it in chain will be considered the last in chain. As a
consequence, the deleted fragment along with the rest of fragments following it are logically
lost.
Although, such a situation is not possible as long as suvFS is overlaid but the information
loss followed by space leakage can be the consequences. However, this is also true with other
1
Seed point is calculated using supplied offset.
30
file systems if the file is lost or deleted accidentally in which case there is a complete loss of
the information.
Having that said, suvFS accomplishes its intended task. Figure 2.5 shows suvFS in action.
The screen-shot shows that after suvFS is overlaid on FAT32 volume, a 16 GB file was created
on that volume which otherwise is not possible. Furthermore, it shows that on listing the
directory a large virtual file was presented while no fragments were visible. Also, as soon as
suvFS was unmounted, the fragments were displayed as simple files.
Figure 2.5: Screen-shot of suvFS in action when mounted on top of FAT32 file system
2.4.3 Experiment
The experiment was aimed to evaluate the effect of suvFS on the performance of native
file system over which it is overlaid. We evaluated and compared following three configura-
tions: vanilla FAT32 file system (FAT32); FAT32 file system with null-pass FUSE file system
(FAT32-np); and FAT32 file system with suvFS mounted on top (FAT32-suv). We chose FAT32
file system for three reasons. First, it is the most compatible and the most popular file
system used on wide variety of storage devices. Second, the maximum file size possible on
FAT32 volume is just 4 GB, and hence it is the most eligible file system to receive this scala-
bility extension. Third, the performance of FAT32 file system is not so good, and hence can
give us some accurate measure of the performance overhead added by suvFS. Furthermore,
the FAT32-np configuration is evaluated and compared in order to investigate whether the
possible performance hit is due to suvFS algorithms or FUSE framework.
31
The experiment was conducted on an Intel based PC with Intel Core i3 CPU 550 @
3.20 GHz with total 4 MB cache, and 2 GB DDR3 1333 MHz SDRAM. The magnetic disk drive
used was a 7200 rpm 320 GB SATA drive with on board cache of 8 MB. The drive was parti-
tioned into a 20 GB primary partition to hose the Fedora Core 14 operating system having
kernel version 2.6.35.14-95.fc.14.i686. Another 160 GB primary partition was used
to mount FAT32 file system which was extended to support large file sizes by mounting suvFS
on top of it. Furthermore, the evaluation was done with Linux running at runlevel 1 to
reduce the random effect of other applications and daemons. In addition, the experiment was
repeated atleast five times and an average of each individual phase was considered. Moreover,
in each phase, the standard deviation was less than 4 % of the average.
Sprite LFS large-file microbenchmark could have been used to exercise all the configurations
but due to its fixed file-size, it can’t scale to our requirements. However, we modified Sprite
LFS large-file microbenchmark to support large file sizes. Furthermore, we modified Sprite
LFS large-file microbenchmark to run three phases. The first phase creates and writes files
sequentially, second phase reads them sequential and final phase deletes them. Also, on
each configuration we ran this modified Sprite LFS large-file microbenchmark with 4 GB file
size, 8 GB file size and 16 GB file size. Sprite LFS large-file microbenchmark was implemented
as a simple shell script using bash-version 4.1.7(1). Also, each test created, read and
deleted 10 files and the results were averaged for writing, reading and deleting a single file.
In addition, the caches were flushed between each phase by rebooting the machine.
Sprite LFS large-file micro-benchmark random phases were discarded because large files are
generally sequentially read and written. This has been reported from time to time. Also, the
deletion phase was introduced. As mentioned earlier, the suvFS algorithms are not executed
if the file size is less than the maximum file size. This saves suvFS from being benchmarked
for small files as the possible performance overhead will be totally added by FUSE framework.
2.4.4 Results & Discussion
As FAT32 and FAT32-np configurations can’t be evaluated for 8 GB and 16 GB files, we evalu-
ated these configurations for reading, writing and deleting files of size 512 MB, 1 GB, 2 GB, 3
GB and 4 GB in order to statistically produce the results for evaluation of 8 GB and 16 GB files
on these configurations. In both cases of FAT32 and FAT32-np configurations, we found that
the time for reading and writing a file sequentially, and deleting a file doubles as we double
32
the file size. Based on this observation we calculated the expected time for all phases of Sprite
LFS small-file microbenchmark for file sizes of 8 GB and 16 GB on these configurations.
Figure 2.6 shows the result of Sprite LFS large-file micro-benchmark executed on all con-
figurations for sequentially writing 4 GB, 8 GB and 16 GB files. The results point out the
inherent performance overhead of FUSE framework present in FAT32-np configuration when
compared with FAT32 configuration. Also, this performance overhead doubles as the file size
doubles. In contrast, when FAT32-np and FAT32-suv configuration is compared, the perfor-
mance overhead is not significant but indeed it grows with growth in file size. This clearly
indicates that although the overall overhead is large when FAT32 and FAT32-suv is compared
but the overhead is mainly due to FUSE framework rather than suvFS algorithms. In our
opinion, this will be amortized by faster CPUs and lesser process and context switches as
argued by [Rajgarhia and Gehani, 2010].
300
4 GB file
8 GB file
250 16 GB file
Time Elapsed (in sec)
200
150
100
50
0
FAT32 FAT32-np FAT32-suv
Configurations Evaluated
Figure 2.6: suvFS: Sprite LFS large-file micro-benchmark results for Write phase
Figure 2.7 shows the result of Sprite LFS large-file micro-benchmark executed on all con-
figurations for sequentially reading 4 GB, 8 GB and 16 GB files. These results also report the
existence of performance overhead due to FUSE framework. Furthermore, when FAT32-np
and FAT32-suv configuration is compared, the performance overhead is negligible. Moreover,
it grows negligibly with growth in file size. This is a good sign as large files are mostly
sequentially read only. Furthermore, this behaviour is because of reduced number of context
and process switches in suvfs read() call as the user buffer is filled in its entirety as long as
33
there exist fragments containing data. This is not the case with suvfs write() call wherein
the user buffer can be partially filled if the rest of the data lies in next fragment(s).
180
4 GB file
160 8 GB file
16 GB file
140
120
100
80
60
40
20
0
Figure 2.7: suvFS: Sprite LFS large-file micro-benchmark results for Read phase
2.0
4 GB file
1.8 8 GB file
16 GB file
1.6
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
Figure 2.8: suvFS: Sprite LFS large-file micro-benchmark results for Delete phase
Figure 2.8 shows the result of Sprite LFS large-file micro-benchmark executed on all config-
urations for deleting 4 GB, 8 GB and 16 GB files. The results indicate that the performance
34
overhead added by FUSE framework exists in both FAT32-np and FAT-suv configurations.
However, the overhead is mainly due to the number of fragments to be deleted and thus, in-
creases as the file size increases. This means, the fragmentation approach of suvFS degrades
the performance of deletion operation by demanding deletion of more files than actual.
In order to further investigate the overhead added by suvFS algorithms, we eliminated
the performance overhead added by FUSE framework. Table 2.5 shows the percentage of
performance overhead added by suvFS algorithms for all phases of Sprite LFS large-file micro-
benchmark for different file sizes after eliminating the overhead of FUSE framework. The
results were normalised by using the results of FAT32-np configuration for these phases. Also,
the table shows the average growth of performance overhead when the file size is doubled.
The Table 2.5 clearly indicates that suvFS algorithms do not add any performance overhead
Table 2.5: Performance overhead added by suvFS after normalisation
Phases 4 GB File 8 GB File 16 GB File Average Average

Overhead Overhead
Growth
Sequential Write 0 % 8.99 % 13.46 % 11.22 % 6.73 %
Sequential Read 0 % 0.16 % 1.03 % 0.59 % 0.51 %
Deletion 0 % 10.46 % 19.08 % 14.77 % 9.54 %
for any operation when file sizes are lesser or equal to the maximum file size supported by
the FAT32 file system. Furthermore, the average overhead added by suvfs read() call is
negligible for every file size (0.59 % average overhead) and grows negligibly with an average
of 0.51 % when the file size is doubled. However, this is not true for suvfs write() call. The
average overhead added for every file size is significant (11.22 % average overhead), though
the average growth of overhead is about 6.73 % when the file size is doubled. In case of
suvfs unlink() call, the average growth of overhead is significant (9.54 %) like its average
overhead (14.77 %).
2.4.5 Conclusion
We argued that virtual unification can be exploited to overcome the maximum file size lim-
itation of a file system. To validate, we designed and implemented a layered file system,
suvFS. suvFS works in user-space and extends any below mounted file system to support
large file sizes. It does so by splitting the large file into legitimate sized fragments. The
35
splitting is transparent to user applications and does not demand any design or source mod-
ification of library, file systems or kernel. suvFS virtually unifies the fragments into a large
file which supports all the file operations like reading, writing, deletion, and so on, by pre-
and post-processing file system syscalls in suvFS.
Furthermore, we evaluated the performance hit of FAT32 file system with suvFS mounted
on top of it. The results indicate that suvFS places almost no performance overhead for
sequentially reading large files. For sequentially writing large files, suvFS deteriorates the
performance of below mounted file system but the performance hit is largely due the FUSE
framework used. Also, the approach of fragmenting large file demands their individual dele-
tion during deletion operation, and thus significantly hinders the performance. The nor-
malised statistics reveal that the average performance overhead added by suvFS algorithms
during reading is as low as 0.59 % while during writing and deleting it is as high as 11.22 %
and 14.77 %. However, the average growth of performance overhead with doubling the file
size that is added by suvFS algorithms is 0.51 % for reading, 6.73 % for writing and 9.54 %
for deletion.
We can conclude that the benefits of large file size scalability that can be added to any
file system across all popular platforms without modifying any user level programs or kernel
level modules with minimal performance hit in reading sequentially a large, which is the most
common operation performed on large files, overshadows the drawbacks.
2.4.6 Limitations & Future Scope
The file size supported by suvFS is logically not limited by any limit but practically the
size primarily depends on the bit length of offset data type (in Linux off t is 32 bit while
64off t is 64 bit) of VFS. However, this limitation also limits the highly scalable file system.
In addition to this, the number of files that can be created on a volume and within a directory
also impose file size limitations in suvFS when mounted on top of a file system. Furthermore,
suvFS does not allow creation of a file whose name contains the signature of a fragment.
The suvFS file system can be further optimised by avoiding the lookups necessary to
identify and locate the fragments. In addition, the performance of suvFS can be evaluated
over a range of file systems. Also, based on the concept of suvFS file system, a system to break
file count limitation of file systems can be designed and developed which can be combined
with suvFS and mhddfs to overcome all size and count limitations of a file system.
End
36
Chapter 3
File System Performance
37
3. FILE SYSTEM PERFORMANCE
3.1 Introduction
File Systems interact with the diverse physical components of a system such as CPU, mem-
ory, I/O buses, storage subsystem, and so on. These components of a system greatly vary
in their underlying build technology and therefore, operate at different speeds. Furthermore,
because different technologies advance at different paces, this speed gap which ultimately
leads to performance gap, is widening. As a consequence, file systems are put under high
pressure to handle this speed mismatch in order to deliver performance. Though this speed
mismatch is not only faced by the file system component of an OS, however file systems,
specifically disk file systems, face more severe performance challenges than any other compo-
nent of an OS. In fact, disk file systems have become the major performance bottleneck of the
overall system performance. The performance of a disk file system is limited by three factors;
1) the performance limitation of magnetic disk, 2) the diverse file system workloads, and 3)
the traditional file system designs. Although magnetic disk technology has improved since its
inception in 1956, but the improvements have been primarily in the areas of cost and capacity
rather than performance. The fact is that disk drives at heart are mechanical devices and as
such can not improve as quickly as solid state devices. As an example, the CPU performance
has increased 16,800 times between 1998 and 2008, but disk performance has increased by
just 11 times [Klein, 2008] 1 . On the other hand, file systems are often subjected to heavier
and more intense workloads than they were originally designed for. These workloads are
getting more heavier and intensive as digital technologies penetrate more and more deeper
into our life. Therefore, in order to deliver performance, file systems should adapt to any
major change in hardware and workload trends. However, due to the complexity involved in
designing, implementing, and even maintaining a file system, file systems do not always cope
with these changes in a timely manner. For example, traditional file systems such as FFS
(Fast File System) [McKusick et al., 1984] and similar other file systems were designed with
different assumptions about the underlying hardware and the workloads to which these file
systems would be subjected. Over the years, they have received limited upgrades particularly
in the area of the on-disk data structures in order to maintain compatibility. Unfortunately,
on-disk data structures are often the limiting factor in performance. Although these file sys-
tems have served their time well, they have become less appealing in dealing with current
hardware and workload trends.
1
Since 1956, the capacity of magnetic disk has increased by 50 million fold.
38
One solution to this problem would be to rework the traditional file system design to better
exploit the current hardware characteristics in an effort to satisfy the demands of workloads.
Unfortunately, the sheer degree of complexity involved in designing and implementing a
complete file system makes it infeasible. Furthermore, these need to be reworked in a way
so that in future they can adapt to the changing environment, failing which can bring us
back to the original problem. Moreover, reworking also has to deal with the compatibility
issues. Another possible solution would be to run only those workloads on a file system for
which it has been optimised. However, real life workloads keep on fluctuating and changing
dramatically.
Nevertheless, the problem can be solved at the root level. The single most important
factor in determining the performance of a file system is the number of seeks and rotational
delays that it incurs [Smith and Seltzer, 1994]. A file system that makes many seeks will
spend a disproportionate amount of time waiting for the disk head to reach its destination
and correspondingly little time transferring data to or from the disk. This results in poor
file system performance. Furthermore, the scenario is exaggerated by the length of seeks and
the rotational distance covered. Similarly, a file system that performs few seeks will spend a
larger portion of its time transferring data to and from the disk and will have correspondingly
higher performance. Although the cost associated with the length of a disk seek and with the
rotational distance can be amortised by high speed magnetic disks but the total cost is still
unaffordable with a huge number of disk seeks and rotational delays. Moreover, magnetic
disks are mechanical at heart and hence the cost associated with a disk seek and a rotational
delay can’t be reduced significantly. However, if the number of disk seeks and rotational
latencies are minimised at a level higher than that of disk but a level lower than that of file
system design, the performance of a file system will increase without the need to rework the
file system design and without the need to narrow down workloads.
The rest of this chapter is organised as follows. Section 3.2 discusses the various factors
that affect the performance of a disk file system. Section 3.3 discusses the on-disk layout,
performance challenges and their mitigation of two theoretically high performance file sys-
tems; FFS and LFS. The section 3.4 discusses various proposals that tend to enhance the
performance of disk file systems without modifying the design. Finally, section 3.5 discusses
an approach to reduce the seek and rotational latencies in FAT file systems. To validate this
approach, the section discusses the design, simulation and evaluation of a high performance
hybrid FAT32 file system called hFAT.
39
3.2 Factors Affecting Performance of a Disk File System
The performance of a disk file system is affected by three main factors as mentioned in
previous section. This section discusses these factors in detail. The goal of this section is
to understand how these factors affect the performance of a disk file system. Furthermore,
for every factor, this section further discusses possible solution(s) to eliminate or reduce this
affect.
3.2.1 Magnetic Disk Drive Performance Limitations
Due to the mechanical nature of magnetic disk drives, they don’t advance at the same rate
as solid state devices do; and thus have become the main component of the file system to
create performance bottleneck. Therefore, the challenge in building a high performance disk
file system is in using the disk subsystem efficiently [Seltzer et al., 1995]. Although, there
are many factors associated with the performance of a disk drive, but it is basically made up
of two components: Bandwidth and Access time [Ng, 1998]. The bandwidth of a disk drive
is the maximum amount of data that can be read (or written) from it per unit time. This
bandwidth of a disk grows proportionally with increase in areal density of the disk and its
rotational speed. Areal density dictates the number of sectors that can be put on a track
while rotational speed dictates the time taken to read a complete track. Thus, for same areal
density, the disk with higher rotational speed has higher bandwidth. On the other hand, the
access time for a disk I/O request can be broken into two parts; 1) one that is dependent on
the amount of data that is being transferred, and 2) the one that is not. The former usually
consists of just the media transfer time. However, most of the access time components which
include command processing overhead, seek time and rotational latency are independent of
request size.
Areal density of magnetic disk drive is progressing faster than that of semi-conductor
technology and hence also improves the media transfer time. Furthermore, the command
processing overhead improves with improvement in semi-conductor technology. However,
seek time and rotational latency (rotational speed) are purely dependent on the mechanical
motions and hence lag behind. Rotational latency is inversely proportional to the rotational
speed of the disk spindle. Therefore, by increasing rpm (rotations per minute) of disk, the
average rotational latency (which is time taken by a platter to complete half rotation) is
decreased. Experiments have shown that increasing the spindle speed to 7,200 rpm while
keeping areal density constant, can improve the data rate by 71 % over 4,200 rpm drives
40
and by 33 % over 5,400 rpm drives [Blount, 2007]. Furthermore, the latency was improved
by 42 % and 25 % respectively.
Although, as of this writing 15,000 rpm and 20,000 rpm disks are commercially available
but there is a limit to which the rpm of a disk spindle can be increased without crashing the
disk. Similarly, seeks are mechanical motions and can’t improve fast.
Now, we can safely argue that the performance of a modern magnetic disk drive is limited
by two parameters which are totally mechanical in nature; Rotational Latency and Seek Time.
Furthermore, the performance gap between magnetic disk drive and other solid state devices
is widening all because of these two parameters. Even within the drive, other parameters
like areal density is progressing faster than that of solid state devices, but this progress is
only able to increase the capacity. As such, the key to a high performance disk file system
is to minimize the number of disk seeks and rotational latencies. In addition to this, the
mandatory seeks and rotational latencies should be smaller in length.
3.2.2 Conventional File System Design Constraints
In a conventional file system design, a file is represented by two components; 1) File Metadata
and 2) File Userdata. File metadata gives a file its identity within the file system which
includes its name, size and so on, and locates all the blocks associated with the file containing
its userdata. Generally, a file system’s on-disk layout consists of data structures to hold the
metadata about the whole file system (which identifies volume size, free and allocated blocks,
etc.) and the metadata about the individual files. However, no specific data structure is
maintained to hold the userdata of a file. Although the notion of block, cluster and extent
is usually used but they simply identify the minimum size of physically contiguous sectors
which can be allocated to a file. In other words, metadata alone has an identity while userdata
doesn’t and is dependent upon metadata for its identity. These on-disk metadata structures
uniquely identify a file system design and support the policies and protocols employed by
the file system operations for that file system. Depending upon the design of the file system,
these on disk metadata structures have a fixed starting location. In case, any of these data
structures have a dynamic location, the location is either identified by some other fixed
location metadata structure or is hard-coded in the file system code. However, the userdata
of files they identify, never have relatively a fixed location as the locations are set during
runtime and are identified by metadata. This means by just examining the data block
content one can not tell to which file the data belongs (say to file1.txt or file2.txt);
41
not even whether it belongs to an existing file or deleted file? Worse, whether it was ever
allocated to some file or not. This is because the information needed to qualify a data block
is held in metadata structure of the file system. Figure 3.1 shows the logical on-disk layout
of a conventional file system.
metadata link
userdata link
file 1
file 2
Figure 3.1: Logical on-disk layout of a conventional disk file system
This conventional file system design allows disk seek and rotational latency to creep in
during the file system operations and hence degrade the performance of a disk file system.
Consider a workload where large number of small files are being accessed. There is a proba-
bility that the data blocks belonging to these files lie on the same cylinder. As such accessing
the data of these files in itself does not ask for any seek. However, to access each file there is
a mandatory head seek from current location to the block holding its metadata followed by
another seek to the identified userdata block. Moreover, there are very less chances that the
block holding metadata structure of a file and its corresponding userdata block are vertically
aligned i.e. at rotationally optimal position. This gives rotational latency a chance to creep
in. Furthermore, if the file data is not small enough to fit within a single data block and
spans over multiple blocks, the intra-file seek and rotational latency to access userdata is
expected when they don’t lie on the same cylinder. Figure 3.1 shows a scenario wherein to
access file1 (or file2) there is a seek and rotational latency required between sectors hold-
ing metadata and userdata, and same positional latencies between various userdata sectors
of the file. Moreover, depending upon the workload, inter-file latencies are possible between
42
file1 and file2.

This indicates that for a high performance file system the delineation of metadata and
userdata should be reduced as low as possible to avoid seeks and rotational latencies. In
addition to this, the intra- and inter-file seek and rotational latencies should be kept low.
3.2.3 Workload Implications
As digital technologies are getting capable and affordable, they are penetrating deeper into
every aspect of our daily life leading to the growth and proliferation of digital data. As
a consequence, several different types of file system workloads are possible. However, all
of them lie between two ranges; 1) workloads dominated by accesses to small files, and
2) workloads dominated by accesses to large files. There are several studies on file system
workload characterizations, and from time to time, these studies echo the findings of previous
studies. All these studies report existence of a large number of small files and a large number
of accesses to small files. One such study conducted by [Baker et al., 1991] reported that 80%
of file accesses are to files of size less than 10 KB. In addition, [Ganger and Kaashoek, 1997]
reported that 79% of files are smaller than 8 KB. Similarly, [Roselli et al., 2000] reported that
small files account to large number of file accesses and many files tend to be read-mostly
while a small number of files are write-mostly. Furthermore, in a recent study, [Tanenbaum
et al., 2006] reported that 70% of the files are less than 8 KB which are read 60% of times and
written 80% of times. Also, large files (greater than 54 MB) account to 10 % of total files and
are read and written less than 10 % of times. In another recent study, [Agrawal et al., 2007]
reported that most of the storage space is occupied by large files while most of the files are
4 KB or smaller. They also argued that the absolute count of files per file system has grown
significantly over time, but the general shape of the distribution has not changed significantly.
Table 3.1 lists the characteristics of common file system workloads based on these studies.
Table 3.1: Characteristics of common file system workloads
Parameter Small Files Large Files

Storage Space small fraction large fraction
Count large number small number
Accessed mostly & frequently occasionally
Operations both read and written usually read only
Pattern sequential & random generally sequential
43
We can conclude that the amount of accesses made to metadata during such a workload
is large as large number of files are frequently accessed. As a consequence, in a conventional
file system design, a large number of head seeks and rotational latencies are incurred to
access metadata in order to locate the userdata of a file. Furthermore, because these files are
small in size, one access to metadata is enough to fully identify and locate the file’s userdata.
However, as these files are also written, updating metadata is necessary. In accordance to
this requirement, for every write operation a head seek and a rotational latency is required
to update the metadata. The scenario is exaggerated by the intra- and inter-file userdata
block head seeks and rotational latencies. Therefore, the key to a high performance disk file
system is to reduce the frequent head seeks and rotational latencies incurred during access
to metadata. In addition, the length of mandatory seeks and rotational latencies incurred
during intra- and inter-file userdata block accesses should be reduced.
3.3 Notable High Performance Disk File System Designs
All the factors mentioned previously in section 3.2 suggest that the performance of a disk file
system all depends upon its capability to reduce seeks and rotational latencies, which in-turn
relies on its on-disk layout. Since, the on-disk data layout (both userdata and metadata) of
a file system is affected by its update policy, file systems can be accordingly classified into
two types; 1) Update-in-place file systems, and 2) Update-out-of-place file systems. In an
update-in-place file system the location of every data (be it metadata or userdata) pertaining
to a file is fixed after initial allocations. Thus, updating an existing data requires a possible
seek to its original location. Also, deallocated blocks are freed explicitly by updating the
fixed location bitmap. Update-in-place file systems have been used widely; examples include
FFS, JFS, XFS, ReiserFS, ext3, etc.
In contrast, an update-out-of-place commits every update to a new location using no-
overwrite policy. This eliminates the need to seek back to its original position when writing
an existing block but entails a garbage collector to clean its old disk space. Update-out-of-
place file systems have not enjoyed much success; examples include LFS & WAFL. This section
discusses FFS and LFS file system designs which are contemporary examples of update-in-
place and update-out-of-place file system designs respectively. The goal of this discussion is
to unveil the design policy of each file system to deliver high performance.
44
3.3.1 Fast File System
Old Unix file system (UFS) was designed to provide a hierarchical way to file data with less
attention paid to capacity and performance. With time, when the average size of magnetic
disk drive and file system workload increased, the Old Unix file system became the bottle neck
of overall system performance. At that time, 150 MBs of a disk with Old Unix file system
mounted on top consisted of 4 MBs of inodes followed by 146 MBs of user data. This organi-
zation put the inode information far away from data. Thus accessing a file normally incurred
a long seek and a possible rotational latency from the file’s inode to its data. Furthermore,
files in a single directory were not explicitly allocated consecutive inodes. This yielded into
many head seeks and possible rotational latencies to access non-consecutive blocks of inodes
belonging to files contained in same directory. Also, the block size was 512 bytes long and
often the next logically sequential data block was not physically sequential, forcing head seeks
and rotational latencies between 512 byte transfers.
In 1984, [McKusick et al., 1984] refined this Old Unix file system by proposing Fast File
System (FFS). In FFS, the disk is statically partitioned into cylinder groups, typically between
16 and 32 cylinders to a group. Each group contains a redundant copy of Super block, a
fixed number of inodes, 4096 bytes long data blocks and bitmaps to record inodes and data
blocks available for allocation. The inodes in a cylinder group reside at fixed disk addresses,
so that disk addresses may be computed from inode numbers. To optimise sequential file
access, new blocks are allocated in a way so that no seek is required between two consecutive
accesses. Also, FFS tries to place data blocks for a file in the same cylinder group, preferably
at rotationally optimal positions. The fundamental idea behind this transformation was to
achieve logical locality by placing all related data of a file (metadata and userdata), and all
files belonging to the same directory nearer to each other. As a consequence, the seek time
is minimized significantly and the performance is increased dramatically. This idea was later
adopted by Linux Extended file systems [Card et al., 1995]. Figure 3.2 shows the on-disk layout
of Old Unix file system and FFS.
FFS was further refined in 1991 by [McVoy and Kleiman, 1991] without changing the on-
disk layout of file system by introducing file system clustering in order to run disks at their
full bandwidth. McVoy et al. retained the on-disk layout of FFS but modified its code
to allocate files contiguously and do sequential I/O in units of clusters. This modification
produced performance advantages of that of extent based file systems by removing rotational
delays.
45
Cylinder Group
Inode Block
Data Block
k1
B loc
de
Ino
k
loc
Cy me
er B
lin tad
de at
S up
rg a
ro
In
od
up
e
Bl
oc
k1
k
In 1
loc
od
e
ta B
Bl
oc
k Da
Da
N ta
Bl
oc
N k
ck 1
Blo e
od In
Old Unix FS FFS
Figure 3.2: On-disk layout of Old Unix File System and FFS
In 1997, [Ganger and Kaashoek, 1997] further increased the logical locality of semantically
related data. They proposed Co-Locating FFS (C-FFS) which introduced two new techniques
for exploiting disk bandwidth for small files; 1)Embedded inodes and 2) Explicit Grouping.
Embedding inodes in the directory that names them, rather than storing then in separate
blocks removes physical on-disk level of indirection without sacrificing logical level of indirec-
tion. This technique halves the number of blocks that must be accessed to open a file and
allows the inodes pertaining to files contained in a directory to be accessed without request-
ing additional blocks. On other hand, explicit grouping places the data blocks of multiple
files at adjacent disk locations and accesses them as a single unit most of the time. C-FFS
groups files whose inodes are embedded in the same directory. As such, embedding inodes
both simplifies the implementation and enhances the efficiency of explicit grouping. Explicit
grouping was used because accessing several blocks rather than just one involved a fairly
small additional cost. As an example, even assuming minimal seek distances, accessing 64
KB data sequentially requires less than twice as long as accessing a 512 byte sector.
Now, the factors limiting FFS performance were synchronous file creation and deletion.
Although this ensured FFS reliability but did hinder its performance. Ganger et al.
proposed Soft Updates as a solution [Ganger and Patt, 1994][Ganger et al., 2000]. Soft Updates
have two advantages; 1) it provides same reliability as found in synchronous updates, and 2)
46
it eliminates the performance overhead incurred due to synchronous updates.

FFS design has served its time well. It exploited various techniques to reduce the number
and length of head seeks and rotational latencies in order to deliver performance. However,
in order to maintain compatibility its on-disk layout has not changed much over the years.
This is the main reason why FFS is not able to cope up with current hardware and workload
trends. Furthermore, FFS does not handle small files as efficiently as it does large files. Also,
for metadata and small files, the disk bandwidth is always underutilised as clustered I/O does
not help here. Therefore, metadata and small files are more susceptible to a disk seek. In
addition, the assumptions made by FFS about the physical geometry of magnetic disk drives
is no more valid1 .
3.3.2 Log Structured File System
Log Structured File system (LFS) was proposed by Mendel Rosenblum and John K.
Ousterhout in 1992 [Rosenblum, 1992][Rosenblum and Ousterhout, 1992]. LFS design
was based on the assumption that files are cached in main memory and with increasing mem-
ory sizes, the caches will become more and more effective at satisfying read requests. As a
consequence, disk traffic will be dominated by writes and hence a file system design should be
optimised for writes [Ousterhout et al., 1985]. The fundamental idea of a log-structured file
system is to improve write performance by buffering a sequence of file system changes in the
cache and then writing all the changes to disk sequentially in a single disk write operation.
The information written to disk in the write operation includes file data blocks, attributes,
index blocks, directories, and almost all the other information used to manage the file sys-
tem. This approach increases the write performance dramatically by eliminating almost all
seeks. LFS is also optimized for reading files which are written in their entirety over a brief
period of time. Furthermore, it provides temporal locality; in other words it is optimized
for accessing files that were created or modified at approximately same time. In LFS, log
is the only on-disk data structure and all the file system updates are written sequentially to
it. Although logging was used before LFS but only as an auxiliary structure to hold updates
until they are committed to their corresponding individual files. This auxiliary log speeds up
writes and ensures crash recovery. However, in LFS, log is a permanent data center to which
every update is written sequentially.
1
Modern disk drives have variable number of sectors per cylinder.
47
[Rosenblum and Ousterhout, 1992] also implemented a prototype of LFS called Sprite-
LFS for Sprite Network OS [Ousterhout et al., 1988]. Sprite-LFS is a hybrid design between
sequential database log and FFS. It performs all writes sequentially but incorporates the FFS
index structure into this log to support efficient random retrieval. In Sprite-LFS, the disk is
statically divided into large fixed-size extents called segments. The logical ordering of these
segments creates a single, continuous log. When writing, Sprite-LFS gathers many dirty
pages and prepares to write them to disk sequentially in the next available segment. Since
Sprite-LFS writes dirty data blocks into the next available segment, the modified blocks are
written to the disk in different locations than the original blocks. This space reallocation
is called a “no-overwrite” policy, and it necessitates a mechanism to reclaim space resulting
from deleted or overwritten blocks. Figure 3.3 shows the structural changes in LFS layout
during update operation.
Data blocks associated to a file before update

Data blocks associated to a file after update
Metadata of a file
Dead blocks to be reclaimed End of log
12 3
After blocks 1 and 3 are updated
2 13
Figure 3.3: Structural changes in LFS layout during update operation
[Seltzer et al., 1993] implemented a prototype of LFS on BSD Unix called BSD-LFS. This
variant was designed to provide recoverability of FFS, performance capability same as or
better than FFS and integration into UNIX systems. However, this implementation unveiled
some problems in overall LFS design. First, the garbage collector has a severe impact in certain
workloads because non-empty segments must be first read from the disk in order to single
out and relocate live blocks in them. Second, the memory management is more complicated
than FFS. Third, the delayed allocation in BSD-LFS makes accounting of available free space
more complex than in pre-allocated file system like FFS. As an example, in Sprite-LFS, the
free space available is the sum of disk space available to file system and buffer pool. As a
48
result, data is written to buffer pool for which there might be no free space on disk. Fourth,
LFS design makes use of temporal locality to lay out data, which means that data written at
about the same time will be stored in the same segment. As a result, the read performance
can suffer if disk blocks are read in an order different from the one in which they were written.
Finally, the authors of LFS assumed that read traffic could be absorbed effectively by a large
cache. However, a later study does not seem to support this assumption [Blackwell et al.,
1995]. It is also doubtful that write traffic is more important than read traffic [Roselli et al.,
2000].
3.4 Other Proposals to Reduce Head Positioning Latencies
Although the design policy of FFS and LFS to reduce the head positioning latencies have
proven valuable, but the debate still continues about the superior policy [Seltzer et al., 1995].
Though these policies can be adopted by other file systems, but they demand significant
modification of the file system design. Nevertheless, there do exist other techniques to enhance
the performance of a disk file system. These techniques are design independent and thus, can
be adopted by most of the file systems without modifying the design or source significantly.
This section discusses these techniques to understand their effectiveness and compatibility.
3.4.1 Adaptive Disk Layout Re-arrangement
Most components of an operating system adapt to the changes in system state in order
to efficiently utilise available resources by modifying their policy at runtime. The most
prominent among them is a scheduler. However, the disk file systems often operate in the
same way throughout their lifetime. Studies have shown that the file system access patterns
do not necessarily correspond to the usage patterns anticipated by the system’s designers.
Therefore, file systems should have some runtime component which will allow them to detect
and compensate for any poorly-placed data blocks. Based on this many adaptive disk layout
re-arrangement proposals have surfaced.
In similar techniques proposed by [Akyürek and Salem, 1995] and [Vongsathorn and Car-
son, 1990], the frequently referenced blocks are copied from their original locations to reserved
space near the middle of the disk. The basis of this idea is to reduce the average length of
the seek distances to the most frequently accessed blocks of the file system in order to deliver
performance. The arriving requests are monitored and the references are calculated at run
time. Therefore, no prior knowledge about the reference frequencies is required. This was
49
implemented by modifying the UNIX device driver and hence, no modifications are required to
the file system that uses the driver. Furthermore, in one of the techniques using trace-driven
simulations it was shown that seek times can be cut substantially by copying only a small
number of blocks. In another similar technique called disk shuffling [Ruemmler and Wilkes,
1991], it was shown that by moving the frequently accessed data into the center of a disk, the
mean seek distances can be substantially reduced. However, as the frequently-accessed data
may change over time, this would require re-shuffling the layout every time the disk access
pattern changes which makes these techniques less attractive.
To overcome this, [Huang et al., 2005] proposed dynamically placing copies of data in
the file system’s free blocks according to the disk access patterns observed at runtime. As
one or more replicas can now be accessed in addition to their original data block, choosing
the “nearest” replica that provides fastest access can significantly improve the performance
for disk I/O operations. As an example, if every data block is duplicated half a round trip
away from its location and half of the maximum seek distance, both seek and rotational delay
can be reduced by 50%. However, the replication overhead incurred and some necessary file
system modifications required, do not make it appealing.
The more recent proposal called I/O Deduplication [Koller and Rangaswami, 2010] utilizes
content similarity for improving I/O performance by eliminating I/O operations and reducing
the mechanical delays during I/O operations. However, as it is based on the assumption that
duplication of data in storage systems is becoming increasingly common, the hypothesis needs
further investigation.
3.4.2 Caching & Automatic Pre-fetching
The benefit of caching in general, needs no introduction. In simple words, for disk file systems
it substantially reduces the amount of file system I/O. However, caching only reduces the fre-
quency of disk accesses and only works well for the data with moderate locality of reference.
Even with caching, file systems still are the bottleneck of the overall system performance; all
because of those disk accesses that are not satisfied from the cache. Unfortunately, increas-
ing the cache size beyond a certain point only results in minor performance improvements.
Experience shows that the relative benefit of caching decreases as cache size (and thus cache
cost) increases.
Similarly, the concept of pre-fetching which has been used in a variety of environments
including microprocessor designs, virtual memory paging, databases, and file read ahead can
50
be used to reduce the frequency of disk accesses. A simple straight forward method of pre-
fetching is to have each application inform the operating system of its future requirements.
However, using this approach, applications must be rewritten to inform the operating system
of future file requirements. An approach called automatic pre-fetching was proposed in which
the operating system rather than the application predicts future file requirements [Griffioen
and Appleton, 1994]. Automatic pre-fetching takes a heuristic-based approach using knowl-
edge of past accesses to predict future access without user or application intervention. As a
result, the applications automatically receive reduced perceived latencies, better use of avail-
able bandwidth via batched file system requests, and improved cache utilization. However,
the parameters that significantly affect the predictions can hinder the performance of the file
system in an unanticipated workload. Moreover, depending upon the cache state, data nec-
essary to the current computation may be forced out of the cache and replaced by (useless)
data needed far in the future.
3.4.3 Exploiting Hybrid File System Designs
Every file system design has pros and cons. If the good corners of all designs are summed
up into a complete file system, the probability of getting high performance file system is
higher. yFS [Zhang and Ghose, 2003] is one such file system developed from scratch that is
based on this concept. yFS goal is to reduce disk seeks during file accesses and it achieves
this goal by integrating proven techniques and new ideas into a mature operating system.
The experimental results showed that yFS works better than FFS (with Soft Updates) even
without a dedicated logging device. However, yFS design is highly influenced by FFS and as
such lack the benefits of LFS design. Moreover, this approach requires that all the “selected
good things” should be compatible with each other. As an example, sub-blocking or tail
packing is efficient for storage but hinders the I/O throughput. Similarly, either temporal
locality or logical locality can be achieved. Nevertheless, techniques like logging, B+ trees,
delayed allocation and so on, all can be exploited in one design.
Similarly, a hybrid file system design is possible in which two or more file systems can
be merged such that they only do what they are good at doing. As an example, if a file
system f1 is good at handling small files and another file system f2 is good at handling
large files, then f3 can be a file system that incorporates f1 to handle small files while f2
to handle large files. The requirement of this approach is that the two designs should be
complementary so that their benefits are mutually exclusive. hFS [Zhang and Ghose, 2007] is
51
one such design that exploits the mutually exclusive capabilities of FFS and LFS file systems.
In hFS, one partition has LFS design to support small file read and write operations and file
system wide metadata while another partition has FFS design to support large file read and
write operations. hFS delivers equal or better performance when compared to FFS with Soft
Updates, however the partitioning between the log partition and the data partition in hFS is
static. Furthermore, yFS and hFS although promising, are still under research and have not
been benchmarked under realistic workloads.
3.4.4 Exploiting Hybrid Storage Devices
The performance problem of a disk file system can be solved if in addition to the magnetic disk,
a solid state storage device is used. As such, the hybrid system combines the advantages of the
solid state device’s inexpensive random seek with the inexpensive sequential access and large
storage capacity of the magnetic disk drive to produce significantly improved performance in
a variety of situations. Based on this idea many proposals have been made.
[Baker et al., 1992] proposed using a small amount of battery-backed RAM to act as a
small write buffer to reduce disk accesses. The motive was to prevent losing recent updates
to file caches without having to continuously write data back to the disks as soon as updates
occur. Similarly, [Miller et al., 2001] designed a system called HeRMES which uses a form of
non volatile RAM called Magnetic RAM (MRAM) to act as a persistent cache for a magnetic
disk drive. They used MRAM to cache the file system metadata and also buffer writes to the
magnetic disk drive. Furthermore, [Wang et al., 2006] proposed Conquest which uses a simple
partitioning approach in which they place all small files and metadata (e.g. directories and
file attributes) in NVRAM (Non-Volatile Random Access Memory) while the remaining large
files and their associated metadata are assigned to the magnetic disk drive. Though these
approaches significantly improve the performance of a file system, however these demand some
file system source modification in addition to physical overhaul of the machine to support
new pieces of hardware.
Moreover, Microsoft Windows 7 utilises hybrid drives with an option known as ready boost
[Matthews et al., 2008]. The idea is to use flash memory as a persistent buffer or cache to
absorb all read and write requests. This was primarily developed to improve the power usage
of laptop systems by allowing the magnetic disk drive to spin down during low workload times.
Similarly, [Soundararajan et al., 2010] proposed a hybrid solid state device and magnetic disk
drive system that accumulates a log of changes on the magnetic disk drive before writing
52
them in bulk onto the solid state device at a later time. Finally, [Fisher et al., 2012] proposed
to optimise the I/O performance of a system by using a large magnetic disk drive and limited
size solid state drive in tandem to store data. They proposed a drive assignment algorithm
which determines which device to place data on in order to take advantage of their desirable
characteristics while trying to overcome some of their undesirable characteristics. Though
these proposals can be added to any file system on the fly without any hardware overhaul,
however they only use the other device as an auxiliary log until the updates are committed to
the actual device. Hence, this procedure demands house-keeping work to be done at sometime
later.
3.5 hFAT: A High Performance Hybrid FAT32 File System
FAT file system is the primary file system for various operating systems including DR-DOS,
FreeDOS, MS-DOS, OS/2 (v1.1) and Microsoft Windows (upto Windows Me). FAT file system
being simple in design, chronologically old and primarily supported by Microsoft OSes has
made it most compatible and widely used file system. Almost all operating systems support
FAT file systems. Furthermore, digital devices like mp3 players, digital cameras and so on,
all need to store and process persistent data. These devices exchange this data frequently
with the desktop computers. This is only possible if the file system used in the device is
supported by the PC’s operating system, and no file system full-fills this requirement as
FAT file systems do. Similarly, among solid state storage devices such as pen drives and
removable magnetic disk drives which communicate with PCs via USB interface, FAT file
systems are common for compatibility reasons. However, because of the same simple and
chronologically old design FAT file systems face performance challenges. As compared to
other file systems, the performance of FAT is poor in situations where large number of small
files are being processed. Furthermore, for large file sizes it also does not perform so well.
Logically, the design of FAT file system is similar to that of Old Unix file system (UFS). In
FAT file systems, the metadata necessary to locate the contents of every file and directory
are placed at the beginning of the volume. As a consequence, the seek distance between this
metadata and the actual contents of files and directories is large. Moreover, no effort is made
to place the contents of a file or directory at rotationally optimal positions. Furthermore,
the directory entries are arranged as an unordered linear list. Therefore, finding a particular
file within a directory requires a linear search with an algorithm complexity of O(n). Worst,
the metadata necessary to locate all the clusters belonging to a file (or directory) is scattered
53
in a long FAT Table. As a consequence, this table needs to traversed from the head to the
tail until sufficient entries are read to locate the contents of that file (or directory). Though
being Update-in-place file system, FAT file system does not achieve logical locality. However,
depending upon the state of the file system, temporal locality may be possible for userdata
but can’t be exploited for performance.
Since the inception of FAT file system in 1981, this performance problem has been there in
its design. Although UFS received many performance patches to address such problems, but
no such effort was made for FAT file system. The concepts of FFS and other related proven
techniques which enhanced the performance of UFS can be applied to FAT file systems,
however this will create a high performance incompatible version of FAT file system which is
not desirable after achieving the spot of highly compatible file system. Nevertheless, other
techniques which do not demand modification of file system design or need little affordable
source modification as described in section 3.4 are feasible. This section describes the design,
simulation and evaluation of a FAT file system, namely hFAT, which stores the most frequently
accessed metadata of all files on a flash drive while as actual contents on the magnetic disk
drive. Specifically, the idea is to store the contents of directories and small amount of most
frequently accessed FAT Table of FAT file systems on the solid state drive to eliminate the
head positioning latency incurred by file system operations in FAT file systems.
The rest of this section is organised as follows. Section 3.5.1 discusses the history of
FAT file systems. Section 3.5.2 discusses the performance problems in FAT32 file system.
Section 3.5.3 discusses the design of a high performance hybrid FAT32 file system called
hFAT file system. Section 3.5.4 presents the working of hFAT stackable device driver while
section 3.5.5 discusses the simulation of hFAT file system for performance evaluation. Section
3.5.6 discusses the experiment, section 3.5.7 discusses the results, section 3.5.8 presents the
conclusion and section 3.5.9 discusses the limitation and future scope of the work.
3.5.1 History of FAT File Systems
FAT is a proprietary computer file system originally developed by Bill Gates and Marc
McDonald in 1976/1977 for managing disks in Microsoft’s BASIC program. In 1980, IBM
approached Bill Gates to discuss the state of home computers and what Microsoft products
could do for IBM. Since Microsoft had never written an operating system before, Gates
suggested that IBM investigate an OS called CP/M (Control Program for Microcomputers),
written by Gary Kildall of Digital Research. Kildall had his Ph.D. in computers and had
54
written the most successful operating system of the time, selling over 600,000 copies of CP/M.
IBM contacted Gary Kildall but he refused. IBM soon returned to Bill Gates and gave
Microsoft the contract to write a new operating system, one that eventually wiped Gary
Kildall’s CP/M out of common use [Hamm and Greene, 2004].
In 1981, Microsoft bought QDOS from Tim Patterson of Seattle Computer Products to
develop MS-DOS 1.0 for IBM and IBM compatible PCs. QDOS was a 16-bit Quick and Dirty
clone of Gary Kidall’s 8-bit CP/M. Tim Patterson had bought a CP/M manual and used
it as the basis to write his operating system in six weeks. QDOS was different enough from
CP/M to be considered legally a different product. In addition to other parts, QDOS had a
slightly different file system than CP/M which was based on the file system of Microsoft Disk
BASIC program that used an organisation method called FAT (File Allocation Table) to write
its files to floppy disks. In fact, FAT was the main difference between CP/M and MS-DOS
when it was released, and it was definitely the key to success of the new OS [Bellis, 2006].
The first FAT version called FAT12 file system was released with MS-DOS 1.0 in 1981. As
primarily designed for floppy disk, it had limited features. Hierarchical file structure (direc-
tories and sub-directories) was added to FAT12 with the introduction of MS-DOS 2.0 in 1983
when IBM released the PC-XT with a built-in hard disk. In 1984, MS-DOS 3.0 was released
along with IBMs new PC-AT with an updated FAT16 file system which allowed much larger
file sizes and supported 1.2 MB 5.25” floppy disks. In January 1986, MS-DOS 3.2 introduced
extended partition. In 1987 and 1988, FAT16 file system received more enhancement patches
and was finally released with MS-DOS 4.0 and OS/2 1.1 [Kirps, 2008].
One of the major drawbacks of the FAT file system was its 8 character filename limit (plus
a 3 character extension). 255 character long filenames were made possible using VFAT (or
Virtual FAT) which stored the long filenames as meta information visible to the user while
the OSes actually operated with 8 character filenames. VFAT was introduced with the more
user friendly Windows 95 in 1995. Although, Microsoft had also designed a very efficient file
system named NTFS v 1.0 in 1993, but it was only released with NT line of OSes and was
released with Windows 3.1 NT.
In order to overcome the volume size limitation of FAT16 file system, Microsoft decided
to implement a newer generation of FAT, known as FAT32, with cluster counts held in a 32
bit field, of which 28 bits are currently used. In theory, FAT32 file system supports a total
of approximately 268,435,456 clusters, allowing for drive sizes in the range of 8 TBs with
32 KB clusters (228 ). However, the boot sector uses a 32 bit field to limit volume size to
232 sectors (2 TB on a hard disk with 512 byte sectors). The maximum possible size for a
55
file on a FAT32 volume is 4 GB minus 1 byte. FAT32 was introduced with Windows 95 OSR2
[Tomov, 2006].
Microsoft has continued its support to FAT file systems even though they have developed
other high performance file systems. They also designed a specialised FAT file system, namely
FATX, for XBOX in 2002. Recently, the FAT32 file system was reworked to support large
volume sizes, large file sizes and file system reliability. This FAT file system, namely exFAT,
was released in 2006 with Windows Vista. In addition, KFAT [Kwon et al., 2005], TFAT
[Microsoft, b] and FATTY [Alei et al., 2007], all are reliability enhancements to the actual
FAT file system design by Microsoft and other individual researchers. Also, standards such
as ECMA-107 and ISO/IEC 9293 for FAT12 and FAT16 file system (without long filename
support) have been drafted.
3.5.2 Performance Problems in FAT32 file system
The FAT file system design is very simple. Although, there are many flavours of FAT file
systems, but all of these share same logical design. This section and following sections refer
to the most compatible and widely used FAT file system, namely FAT32 file system, for
reference [Microsoft, a].
The performance of FAT32 file system as compared to other file systems is very poor.
Specifically, FAT32 file systems can’t handle a workload which comprises accesses to large
number of small files. Some of the main and significant reasons for this are as follows:
1. The complete metadata and userdata of a file (or directory), even in best case, is
scattered at three different places, i.e. directory entry, FAT Table and userdata clusters.
This allows head positioning latencies to creep in at various junctures of a file system
operation.
2. There is a large seek distance between the frequently accessed metadata store, i.e. FAT
Table, which is required to locate clusters associated with every file and directory, and
the contents of those files and directories. Furthermore, this distance is covered twice
as FAT Table is located at one extreme end of the volume as shown in Figure 3.4.
3. FAT32 file system design makes no effort to place all clusters belonging to any file or
directory contiguously to avoid intra-file seeks. This also creates a probability that to
locate the chain-of-clusters associated with the file or directory in the FAT Table may
incur head positioning latencies as the entries may span over multiple sectors.
56
FAT Table Directory Entries Clusters
Name First Cluster
0x29 0x0000002B abc 0x00000029 0x29 First Cluster of file abc
def 0x0000002A
0x2A 0x0FFFFFFF 0x2A First & Last
Cluster of file def
Second & Last
0x2B 0x0FFFFFFF 0x2B
Cluster of file abc
0x2C 0x00000000 0x2C A Free Cluster
Figure 3.4: Relationship between various on-disk structures of FAT32 file system
4. Similarly, no design effort exists to place the clusters belonging to all files and directo-
ries residing in the same directory contiguously. This leads to inter-file latency while
accessing the contents of these directories and files. This also creates a probability that
the information necessary to locate chain-of-clusters associated to each such directory
and file in the FAT Table may incur head positioning latencies.
5. In a workload where files are created and deleted frequently or their lengths are often
changed, the volume becomes increasingly fragmented over time. This exaggerates the
presence of head positioning latencies between directory entry and FAT Table, between
various entries of FAT Table and between data clusters allocated to a file or directory.
6. Moreover, FAT32 file system design neither achieves logical locality nor temporal locality,
gives no regard to rotational latency, uses simple data structures and so on.
3.5.3 Design of hFAT File System
The design of FAT32 file system is logically similar to that of Old Unix file system (UFS). As an
analogy, FAT32 keeps FAT Table at the beginning of the volume like UFS keeps the inode store.
Moreover, in both file systems directory entries hold filenames. However, inodes only maintain
information regarding a particular file or directory while as the allocation/deallocation list is
maintained individually by UFS within Superblock. In contrast, the FAT Table records both
free and allocated clusters within the volume along with the chain-of-clusters associated with
a particular file or directory, while as directory entries in addition to filename contains other
57
information such as attributes, size, first cluster in chain and so on [Heybruck, 2003]. So
logically, the design of FAT32 file system is bit better than UFS. In-spite of this, FAT32 file
system suffers from performance problem mainly because of the large seek distance between
FAT Table and clusters. Furthermore, the percentage of storage space occupied by FAT Table
is very small as compared to clusters (32 bits/cluster). Therefore, as the size of volume
increases, the maximum seek distance increases significantly. Furthermore, this seek distance
is covered twice as FAT Table is located at the beginning of the volume. The figure 3.5 shows
the impact of large volume size on maximum seek distance between FAT Table and clusters.
Example Specification of FAT32 File System:

4 KB Cluster Size
32 bits of FAT Table per Cluster
When the size of FAT32 is doubled,

the seek distance increases by more than 2 folds
FAT Table
max. seek distance
ce
is tan
ee kd
x. s
ma
Clusters
Figure 3.5: Effect of large volume size on seek distance between FAT Table & Clusters
Also, because directory entries hold actual key to locating the clusters allocated to a
file (or directory) by naming them and locating the first cluster in the chain, access to this
metadata is mandatory. As clusters allocated to a directory can also be scattered, latency is
inevitable. In a workload wherein a large number of small files, whose size is less than cluster
size of FAT32 volume, are accessed, there are more frequent accesses to directory entries than
FAT Table. In fact, all the requests can be satisfied from directory entries. However, deletion
operations require access to FAT Table. In contrast, a workload wherein large files (atleast
2-3 clusters long) are accessed there are more frequent requests to FAT Table than directory
entries. But in both cases, the latencies incurred during access to FAT Table and directory
entries deteriorate the performance of FAT32 file system.
58
One possible solution to this would be to modify the FAT32 file system the same way UFS
was modified. In other words, re-engineering FAT32 file system using the concepts of FFS
and CFFS can yield significant performance gains. However, this is much complex as FAT
Table can’t be put near every file and directory. Furthermore, this will rigorously modify the
FAT32 file system design and FAT file system so modified, will lose its identity. Moreover,
as mentioned earlier FFS overcomes latency problems by making certain assumptions about
magnetic disks which no more hold true.
Another possible solution to reduce these latencies would be to pre-cache and delay-write
the whole FAT Table. This is beneficial because this FAT Table is a central metadata store
that is needed by every individual file and directory residing on the volume. However, the
size of FAT Table can impose serious limitations. As an example, for a 2 TB FAT32 volume
having 8 KB cluster size, cache memory required for FAT Table is 1 GB. Furthermore, FAT32 is
not reliable and thus, any unanticipated crash can result in data loss and possible file system
corruption.
In contrast, directory clusters can be placed near the contents of files and directories they
name by modifying the source of FAT32 file system. However, the algorithmic complexity
involved and assumptions to be made make it less appealing.
hFAT is a high performance hybrid FAT32 file system design that overcomes the perfor-
mance problems faced by plain vanilla FAT32 file systems. The hFAT file system intends to
exploit the advantages of the solid state device’s small and flat random access time with the
large sequential access speed and storage capacity of the magnetic disk drive to improve the
performance of FAT32 file system. In other words, hFAT intends to distribute the workload of
FAT32 file system between solid state and magnetic storage devices by placing the small hot
zone of the file system i.e. FAT Table and directory entries, on the solid state storage device
while the large cold zone on the magnetic disk drive. Although there exist proposals to use
hybrid storage for performance gains, all of these proposals either propose using solid state
storage device as a cache or an auxiliary log until the changes are committed to actual storage.
In addition, they demand design and/or source modification. However, hFAT proposes using
solid state storage device as a persistent store for the most frequently used FAT Table and
directory entries of a FAT32 file system. The motive of this design is to satisfy all the accesses
to FAT Table and directory entries from a small and low latency solid state storage device
without modifying the design or source. Therefore, it totally eliminates the head positioning
latencies incurred during access to FAT Table and directory entries. The Figure 3.6 shows the
logical layout of hFAT file system.
59
Stored on solid state storage device File Clusters stored on magnetic disk storage device
Boot Sector
Reserved
Sectors
FAT FAT Clusters
Copy 1 Copy 2
Directory Clusters stored on solid state storage device
Figure 3.6: Logical on-disk layout of hFAT file system
In order to accomplish this, hFAT has two choices; 1) modify the source of FAT32 file
system, or 2) modify the block device driver. In fact, modifying the source of either FAT32
file system or block device driver has several problems and thus is not feasible. However,
adding a layer of abstraction just below the FAT32 file system and above the block device
driver can do the job. The technique called driver stacking enables one driver to be stacked
on top of another driver just like file system stacking, in which one file system module can be
stacked on top of another file system module. File system stacking is not feasible here as the
requests made by FAT32 file system should be pre-processed rather than pre-processing the
requests made to FAT32 file system. The stacked block device driver in hFAT acts like fan-out
file system by overlaying over two block device drivers; one responsible for doing I/O with
solid state storage drive and another responsible for doing I/O with magnetic disk drive. The
driver so stacked can forward the request to the appropriate block device driver depending
upon the request made by FAT32 file system and return the results to upper layers.
3.5.4 Working of hFAT Stackable Device Driver
When a volume is formatted with FAT32 file system with hFAT slipped in, all the read
requests are satisfied from magnetic disk. This is done as to allow FAT32 file system code set
appropriate values for the size of FAT Table, cluster size and so on. However, all the write
requests are for metadata and hence hFAT redirects them to solid state storage device to zero-
fill FAT Table, initialise BPB and root directory, and so on. With this, hFAT reserves some
space on solid state storage device to act as a log that maintains the list of clusters allocated
to directories and their corresponding mapping into solid state storage device. This mapping
serves two functions; 1) it allows hFAT to utilise the space on solid storage device effectively,
and 2) it addresses the question how the stackable device driver will identify whether the
request is for data clusters or metadata clusters? In case of FAT Table it is quite simple as there
is a clear boundary between FAT Table and the rest of FAT32 volume. However, identifying
60
requests for clusters containing the directory entries is difficult as directory clusters may be
scattered around with data clusters of files. To overcome this, hFAT uses this log. The
log is populated by hFAT in following situations; 1) When a newly created FAT32 volume
is mounted for the first time along hFAT, the first request made to hFAT that crosses the
FAT Table boundary pertains to root directory. Thus, hFAT gets the first cluster number of
the root directory and can locate rest of the clusters from FAT Table. As further requests
arrive for any cluster of the root directory, hFAT can analyse the contents to identify the
allocation/deallocation of any new subdirectory. In case, hFAT finds one, it records its cluster
number(s) in the log along with mapping in solid state storage device. 2) To further validate
that a cluster is a directory cluster, hFAT can analyse the contents of that cluster for directory
entry signatures. This procedure applies not only to the root directory but for all those
clusters who have an entry in log. The figure 3.7 shows the logical design of hFAT file system
with a driver stacked on top of two block device drivers.
File System Operations
VFS
VFAT
hFAT stackable device driver
access FAT Table access

& Directory Clusters File Clusters
solid state storage device driver magnetic disk device driver
Figure 3.7: Design of hFAT file system using driver stacking
One may argue that this may add performance overhead to the working of FAT32 file
system. Because most of the processing is simple and doesn’t require any I/O to magnetic
disk, the performance overhead is expected to be low and will be amortised by faster hardware.
Furthermore, it is clear that the space actually meant for FAT Table and directory clusters
on magnetic disk is lost. However, in general scenario, the space lost is less than 1% of total
61
space of magnetic disk and thus is affordable. Moreover, it is mandatory for hFAT that the
FAT32 file system be created newly.
3.5.5 Simulation of hFAT Stackable Device Driver
Device driver stacking is possible in Linux but unfortunately there is no well defined interface
to perform it. This can lead to many performance and technical problems during the operation
of stacked driver. In contrast, Windows Driver Model introduced in Windows 98 and 2000 and
File System Filter Manager introduced in Windows XP Service Pack 2 provide a clean, flexible
and efficient driver stacking support. However, Windows OSes have there own limitations.
Nevertheless, the behaviour of hFAT stackable device driver can be simulated because of the
minimal functionality embedded in it. hFAT stackable device driver has to perform only one
main task; forward the request of upper layer (i.e. FAT32 file system) to one of the two below
existing device drivers depending upon the type of request. This behaviour can be simulated
if FAT32 file system block trace of some workload or benchmark is fed to the simulator along
with the information necessary to qualify the request as metadata or userdata. Algorithm
3.1 shows the simplicity of algorithm used during the simulation of hFAT stackable device
driver.
Algorithm 3.1 Algorithm used during simulation of hFAT stackable device driver
Input: blockT race, Inf ormation
Output: ssdBlockRead,ssdBlockW rite,hddBlockRead,hddBlockW rite
1: while blockT race contains more blocks do
2: if blockT race.blkN um is metadata then
3: if blockT race.blkN um is read then
4: ssdBlockRead ← ssdBlockRead + 1
5: else
6: ssdBlockW rite ← ssdBlockW rite + 1
7: end if
8: else
9: if blockT race.blkN um is read then
10: hddBlockRead ← hddBlockRead + 1
11: else
12: hddBlockW rite ← hddBlockW rite + 1
13: end if
14: end if
15: end while
62
As the motive of this simulation is to identify the number of requests satisfied from solid
state storage device and magnetic disk drive, we provide both block trace and the necessary
information. Also, as hFAT makes no changes to FAT32 file system design and works below a
level at which FAT32 file system operates, the efficiency will be achieved if hFAT sends more
requests to solid state storage device than magnetic disk drive. Indeed, this information
is reported by our simulator. However, there is a need to further evaluate the efficiency
of hFAT by assigning each block access to solid state device and magnetic disk drive some
latency values to identify the reduction in latency of operations.
3.5.6 Experiment
In order to illuminate specific operations which are improved by hFAT, the FAT32 file system
was exercised using Sprite LFS small-file microbenchmark to get block trace. However, instead
of writing, reading and deleting 1 KB files, the benchmark wrote, read and deleted 4 KB
files. This is in accordance to repeatedly reported observation that file system workloads
are dominated by accesses to small files, typically 4 KB or less in size [Agrawal et al., 2007].
Furthermore, Sprite LFS small-file microbenchmark was implemented as a simple shell script
using bash-version 4.1.7(1). The caches were flushed after every phase of Sprite LFS
small-file benchmark. The block trace was captured via /proc/sys/vm/block dump interface
of Linux kernel to log only those blocks whose request was not satisfied by the buffer cache.
However, the size of ring buffer of Linux was changed from default 4096 bytes to 8192 bytes
to avoid loss of any block trace due to overflow. Furthermore, because of the simplicity of
the simulation, the simulator was implemented in-house as a shell script using bash-version
4.1.7(1).
3.20 GHz processor with total 4 MB cache and 2 GB DDR3 1333 MHz SDRAM. The hard drive
used was a 7200 rpm 320 GB SATA drive with on board cache of 8 MB. The drive was par-
titioned into a 20 GB primary partition to hose the Fedora Core 14 operating system kernel
version 2.6.35.14-95.fc.14.i686 and another 5 GB partition to mount FAT32 file sys-
tem. The partition was large enough to hold all the files and small enough to fit in one zone
of the disk. During the execution of Sprite LFS small-file microbenchmark, the Linux was set
to run at run-level 1 to reduce the random effect of other applications and demons. Also,
the experiment was repeated atleast 5 times and all the results were averaged; the standard
deviation was less than 3 % of the average in all cases.
63
The result of the simulation is shown in Table 3.2 and 3.3. Table 3.2 shows the number of
block reads satisfied from non-HDD device and from HDD device against each operation of
the Sprite LFS small-file microbenchmark. Similarly, Table 3.3 reports the same statistics but
for blocks written. Moreover, in each category, the percentage of block accesses absorbed by
non-HDD device is calculated. There are many things worth noticing here.
Table 3.2: hFAT simulation report showing distribution of read blocks
Operation Blocks Read Blocks Read Access %age

(non-HDD) (HDD) of non-HDD
Create 10,000 4 KB Files 1380 0 100 %
Read 10,000 4 KB Files 643 10,000 6.04 %
Delete 10,000 4 KB Files 721 0 100 %
Table 3.3: hFAT simulation report showing distribution of written blocks
Operation Blocks Written Blocks Written Access %age

(non-HDD) (HDD) of non-HDD
Create 10,000 4 KB Files 2063 10,000 17.10 %
Read 10,000 4 KB Files 626 0 100 %
Delete 10,000 4 KB Files 785 0 100 %
First, in case of blocks being read, in two phases 100 % of block reads are satisfied from
non-HDD device while in one phase the percentage is as low as 6.04 %. This means that 66
% of the operations of this workload can effectively exploit the 100 % benefits of low and flat
latency of non-HDD device for reading.
Second, in case of blocks being written, in two phases 100 % of blocks writes are satisfied
from non-HDD device while in one phase the percentage is as low as 17.10 %. This means
that 66 % of the operations of this workload can effectively exploit the 100 % benefits of low
and flat latency of non-HDD device for writing.
Third, the phases in which block reads or writes are not 100 % satisfied from non-HDD
device correspond to reading and writing 10,000 4 KB files. This is expected as in these
two phases the ratio of metadata to userdata is very low. However, these phases are mutually
exclusive and such an operation within each phase is benefited from other operation of the
64
phase that exploits 100 % benefits of non-HDD device. As an example, the first phase wherein
10,000 4 KB files are created, the percentage of blocks read from non-HDD device is 100 %
while as the percentage of blocks written to non-HDD device is 17.10 % (expected as ratio of
metadata to userdata to be written is low). However, the 100 % block reads from non-HDD
device augments the 17.10 % blocks writes in two ways; 1) the reads in operations experience
low and flat latency of non-HDD device, and 2) the repositioning of R/W head of HDD device
during such reads is eliminated which otherwise would have created latency for blocks to be
written. Similar benefits are exploited in second phase.
Finally, in third phase 100 % block reads and writes are satisfied from non-HDD device.
This is expected as this phase only deals with metadata.
Furthermore, we evaluated the performance of hFAT by assigning various access latencies
to non-HDD device. To evaluate over a range of access latencies, we assigned weights in
terms of percentage of access latency of HDD device with granularity of 1 %. This means
100 possible latency values were assigned to non-HDD device where each value corresponds
to some percent of latency of HDD. As an example, if HDD device has access latency of 200
ms, then we assigned latencies assigned to non-HDD device ranged from 2 to 200 ms with
the step size of 2 ms.
The read/write latency is not symmetric in either solid state storage device or magnetic
disk drive. Although, less expensive solid state storage devices typically have write speeds
significantly lower than their read speeds, high performance devices have similar read and
write speeds. Similarly, HDDs generally have slightly lower write speeds than their read
speeds. Nevertheless, same latency values were assigned to both types of operations in each
type of device. Furthermore, average access latency of HDD was assigned to each accessed
block of HDD. Unfortunately, this way the benefits gained by removing the repositioning
of the head which reduces the inter-userdata block latency can’t be calculated. This means
that the evaluation will yield the upper bound on the latency incurred by the operations
and can be expected to go down. Figure 3.8 shows the graph plotted for three phases of the
benchmark against the range of access latencies across x-axis and the latency incurred by
operations in terms of percentage of actual latency when only HDD is used across y-axis.
This graph indicates that those phases which have steep slope are highly affected by the
latency of HDD device while as others are less affected. In the figure, the phase corresponding
to deletion of files is highly affected by the latency of HDD device. Next is the phase that
creates the files followed by the phase that reads these files. This graph also shows that using
hFAT with a non-HDD device having latency 10 % of that of HDD device can reduce the
65
100
(%age of operation latency when only HDD is used)

90
Latency incurred by operations 80
70
60
50
40
30
20 Create 4KB 10,000 files

Read 4KB 10,000 files
10
Delete 4KB 10,000 files
0
0 10 20 30 40 50 60 70 80 90 100
Average access latency of non-HDD device
(%age of access latency of HDD device)
Figure 3.8: Affect of various latencies on total latency of operations of hFAT
latency of write operations by as minimum as 10 %, read operations by as minimum as 25 %

and of delete operations by 90 %. It is to be noted that this reduction is the lower bound on
the reduction and can be expected to increase as the average access time of HDD is taken
into consideration.
Furthermore, the typical access latency of non-HDD is about 0.1 ms while as for HDDs
the average access time ranges between 5 to 10 ms. The ratio of 0.1 ms to 5 ms corresponds
to 2 % while as the ratio of 0.1 ms to 10 ms corresponds to 10 %. Both ratios fit our
observation.
3.5.8 Conclusion
We can conclude that using hybrid storage for enhancing the performance of disk file systems
is valuable. The fact is that the accesses to most frequently used structures of a file system
includes a small amount of file system wide metadata. This metadata if stored on solid state
storage device can not only eliminate the seeks and rotational delays incurred during its
access but will significantly reduce the the total latency incurred by operations.
Based on this idea, we designed and simulated the behaviour of a high performance hybrid
FAT32 file system called hFAT. hFAT uses a solid state storage device to hold the most
66
frequently used metadata of FAT file systems, namely FAT Table and directory clusters. FAT
Table accounts to small amount of the file system space but is used by every operation that
reads, writes or deletes files. However, being located at the beginning of volume it creates the
performance bottle neck in overall file system performance. Furthermore, directory clusters
need necessarily to be accessed during file system operations and are scattered around the
volume of the disk. hFAT places these structures on a solid state storage device in order to
eliminate the seeks and rotational delays incurred by operations during their access. However,
in order to keep the design of FAT32 file system intact, we propose hFAT to be slipped in
as a stackable device driver. Furthermore, we simulated the behaviour of hFAT as stackable
device driver and evaluated the performance gains by running the block trace collected by
exercising a FAT32 file system using Sprite LFS small-file benchmark. The results indicate
that hFAT file system can reduce the latency incurred by FAT32 file system operations by a
minimum of 25 %, 10 % and 90 % during writing, reading and deleting a large number of
small files respectively, if a solid state storage device having latency lesser or equal to 10 %
of that of magnetic disk is used as an addition.
hFAT has been currently simulated and evaluated using Sprite LFS small-file benchmark, how-
ever it needs to be evaluated for other microbenchmarks and macrobenchmarks. Also, hFAT
needs to be implemented to further validate the claims made in simulation and analyse the
affect of self-evolving nature of hFAT for recognising metadata and userdata cluster requests.
Furthermore, solid state storage devices have limited number of read/write cycles. However,
regular backups can be made by hFAT driver automatically to avoid data loss.
Nevertheless, the idea behind hFAT is valuable and can be extended to other file systems.
The strong point of this technique is that it does not require any design or source modi-
fication of file systems. The hFAT file system implementation in Linux can be investigated
by developing a clean interface for driver stacking. Furthermore, the self-evolving feature of
hFAT module can also be optimised and more generalised.
End
67
Chapter 4
File System Extensibility
68
4. FILE SYSTEM EXTENSIBILITY
4.1 Introduction
File Systems are the source and sink of persistent information at the lowest level within
an OS. The significance of file systems have made them crucial, large and complex. These
represent one of the most frequently exercised parts of an OS. Like other OS parts, file systems
also need to evolve. This is mainly because of the fact that as hardware gets advanced and
affordable, the digital technologies penetrate more into every aspect of our life. This creates
voluminous amounts of digital data; data which needs to be managed efficiently, reliably,
securely, and so on. Therefore, file systems require many changes to support new features to
mitigate such challenges. However, enhancing a file system by modifying and recompiling its
source is a challenging task. First, this approach requires the developer to understand and
deal with complicated kernel data structures. Thus, a deep understanding of OS (kernel)
internals is required even to make a small change in the existing code or to add some new
code. The situation is worse than it seems as OSes vary greatly in their kernel architectures
and same architectures vary in major aspects of the file system management. All this leads
to a time consuming and exhaustive effort of a developer to understand the underlying OS
before getting his hands on the actual file system [Zadok, 2003]. Furthermore, such developers
constitute a small chunk of programmers.
Second, even if this is accomplished successfully, the effort yields little gratification. As an
example, suppose we want to add a simple encryption extension to ext2 file system in which
‘a’ is replaced by ‘b’, ‘b’ by ‘c’ and so on. The C code for such a cipher is approximately 2-3
lines in length and at minimum all what is needed is to modify those ext2 file system routines
that write and read data to and from the storage subsystem. So, the obvious approach that
tempts the developer would be:
1. Study the design and implementation of ext2 file system,
2. Copy 5,000+ lines of code for ext2 and identify various routines that need to be modified,
and
3. Add the cipher code, recompile the file system and distribute it.
This approach has several limitations. First, the developer will spent a lot of time studying a
specific file system only to add 2-3 lines of code. Second, the developer has now only added
encryption extension to a single file system; what if now an encryption extension for FAT file
system is required? Third, all this is possible only if the developer has source code at his
69
disposal. Finally, there is a good probability that the design of the file system in question
may need some big change to accommodate such a little change.
Third, even if this is all what is required and is successfully done, the modified file system
needs to be tested for stability and reliability, the same way it was tested 10-20 years ago.
This is because the kernel is a complex environment to master and small mistakes (which
have no significance in user-level programs) can cause severe data corruption. As such, it
takes years of effort to make a file system stable and once it is stable and working, it is not a
good idea to break it by throwing in some new features. Besides, OS vendors and maintainers
of the file systems are reluctant to accept feature enhancement patches to their stable file
systems in order to maintain OS stability and avoid compromise of file system reliability.
Lastly, a fully functional in-kernel file system is difficult to port as it is highly cohesive with
the kernel.
Although, extending file system functionality in an incremental fashion is necessary and
valuable but because of the afore mentioned reasons the enhancement patches should not be
incorporated into the file system code rather they should be layered on top of the existing
file system as a module [Zadok et al., 1999]. Because of that, firstly the reliability and
stability of an existing file system is not compromised as it is not touched directly. Secondly,
the enhancement layer can be used for multiple file systems without the requirement of
understanding the design and source of each individual file system. Hence, the development
time is reduced and the portability of enhancement layer is increased. Thirdly, the modular
approach increases the ease of debugging by reducing the domain of bug induction. Fourthly,
the enhancements can be made available to proprietary file systems whose source is not
available. Lastly, this makes it possible for third party developers to release a file system
improvement without deploying a whole modified file system.
Layering has been widely used to add new functionalities rapidly and portably to existing
file systems. Several OSes have been designed to support layered file systems, including So-
laris, FreeBSD and Windows. Several layered file systems are available for Linux, even though
it was not originally designed to support them. Many users use layered file systems un-
knowingly as part of Anti-virus solutions [Symantec, 2004] and Windows XP’s system restore
feature [Harder, 2001]. The key advantage of layered file systems is that they can change the
functionality of a commodity OS at runtime; so hard-to-develop lower-level file systems do
not need to be changed.
The rest of this chapter is organised as follows. Section 4.2 discusses the various stacking
models for file system layering and their support in popular OSes. It also discusses various
70
alternatives to file system layering to extend file systems. Section 4.3 discusses the classi-
fication of layered file systems based on their application. Finally, section 4.4 introduces a
reliable and efficient vnode stackable file system called restFS to perform secure data deletion.
4.2 File System Layering: Stacking Models, Support in OSes,

and Alternatives
4.2.1 vnode Stacking Models
Earlier there was no distinct boundary between file system implementation and rest of the
monolithic kernel. Therefore, the system calls directly invoked file system methods. In 1986,
[Kleiman, 1986] proposed an architecture for accommodating multiple file systems within
SUN UNIX kernel. The idea was to split the file system implementation independent and
file system implementation dependent functionality of the kernel, and provide a well defined
interface between the two parts. In order to accomplish this, Kleiman introduced virtual
node or vnode which provided a layer of abstraction that separates the core of an OS from
the rest of file systems. In this architecture, each file is represented in memory by a vnode. A
vnode has an operations vector that defines several operations that an OS can call, thereby
allowing the OS to add and remove many types of file systems at runtime. This architecture
finally matured into VFS in UNIX like and UNIX based OSes. Other OSes use something
similar to the vnode interface.
A Layered, or Stackable, file system creates a vnode with its own operations vector to
be interposed on another vnode. Each time one of the layered file system’s operations is
invoked, the layered file system maps its own vnode to a lower-level vnode, and then calls
the lower-level vnode’s operation. To add functionality, the layered file system can perform
additional operations before or after the lower-level operation.
4.2.1.1 Rosenthal’s Stacking Model
Rosenthal is one of the pioneer’s to introduce the concept of vnode stacking in order to
extend file systems. In 1990, Rosenthal identified a steady growth in the size of vnode and
the number of operations in operations vector. This evolution was devoted to add new file
system functionalities and future enhancements. In order to reduce the performance degra-
dation implicit in this evolution and to support easy implementation of new functionalities,
71
Rosenthal proposed vnode stacking [Rosenthal, 1992]. Rosenthal identified two require-
ments in the design of VFS interface to support stacking. The first is interposition in which
a higher-level vnode is called before the lower-level vnode, and can modify the lower-level
vnode’s arguments, operation, and results as shown in Figure 4.1(a). This is commonly
called linear layering or linear stacking. The second is composition in which a higher-level
vnode performs operations on several lower-level vnodes as shown in Figure 4.1(b). This is
commonly called fan-out. Unfortunately, Rosenthal’s model does not support fan-in access,
in which a user process accesses the lower-level file system directly as shown in Figure 4.1(c).
Rosenthal modified the VFS of SunOS to support vnode stacking [Rosenthal, 1990]. The
modifications were made in the vnode structure by replacing its public fields with methods
so that layered file systems could intercept them. Also, two new pointer fields were added to
the structure: v_top and v_above. The v_above pointer points to the vnode that is directly
above this one in the stack. The v_top pointer points to the highest level vnode in the stack.
All vnode operations go through the highest-level vnode. Hence, every vnode in the stack
essentially becomes an alias to the highest-level vnode. Rosenthal also suggested several
layered file system building blocks, but did not implement any of them for his prototype.
Later, a vnode interface for SunOS based on Rosenthal’s interposition and several example
file systems were developed [Skinner and Wong, 1993].
System Call System Call System Call
vnode A vnode A vnode A
vnode B vnode B vnode C vnode B

(a) (b) (c)
Figure 4.1: File system layering types; a) Linear b) Fan-out and c) Fan-in
72
4.2.1.2 UCLA Stacking Model
Heidemann and Popek, the researchers at UCLA during 1990s, also developed an infras-
tructure for the layered file system development. The UCLA model [Heidemann and Popek,
1991, 1994] identified fan-in stacking in addition to linear and fan-out stacking. Fan-in ac-
cess is necessary when applications need to access unmodified data; for example, a backup
program should write encrypted data (stored on the lower-level file system) to tape, not the
unencrypted data (accessible through the top layer). Besides, in UCLA model a new file
system layer may add some new operations and existing file system layers can adapt to sup-
port these operations. If a layer does not recognise any encountered operation, the operation
can be forwarded to lower layer in the stack for processing. It may be that the lowest level
file system does not support a given operation, then a simple value (say “Operation not
supported”) may be returned. To support this, UCLA model suggested inter-layer interface
symmetry and extensibility. As such, in UCLA model the vnode does not have a fixed set of
operations. Instead, each file system layer provides its own set of operations, and the total
set of operations is the union of all file systems’ operations. Also, to provide operations with
arbitrary arguments, the UCLA interface transforms all of an operation’s arguments into a
single structure of arguments.
The UCLA model emphasized that a file system should be developed from the compo-
sition of light weight layers. To achieve this, the UCLA model suggests that each separate
service should be its own layer. For example, UFS should be divided into at least three dif-
ferent layers: 1) managing disk partitions, 2) a file storage layer providing arbitrary-length
files referenced by a unique identifier (i.e., inode number), and 3) a hierarchical directory
component. Breaking up a file system would allow interposition at several points. The Ficus
replicated file system and several course projects (e.g., encryption and compression layers)
were developed using the UCLA model by [Guy et al., 1990].
4.2.2 Layering Support in Popular OSes
File System Layering is used to rapidly develop file system enhancements and is seldom used
to create basic building-block file systems. Therefore, OSes need not necessarily support
layering for basic working of file systems. However, several OSes have been designed to
support layered file systems and many layered file systems are available for variety of OSes.
73
4.2.2.1 Solaris
The Solaris VFS architecture is almost identical to the classic vnode architecture. Each vnode
has a fixed operations vector. Each operation must be implemented by every file system.
Generic operations for returning “operation not supported” or in some cases “success” are
available. Mutable attributes such as size, access time, etc. are not part of vnode fields;
instead, the vnode operations include functions for managing such attributes. The Solaris
loopback file system, lofs, passes all vnode and VFS operations to the lower layer, but it
only stacks on directory vnodes [SMCC, 1991].
4.2.2.2 FreeBSD Unix
The FreeBSD vnode interface has extensibility based on the UCLA model. FreeBSD allows
dynamic addition of vnode operations which is missing in both Solaris and Linux. While
activating a file system, the kernel registers a set of vnode operations that the file system
supports, and builds an operation vector for each file system that contains the union of all
operations supported by any file system. Like UCLA model, in FreeBSD Unix 1) file systems
provide default routines for operations that they do not support, 2) use packed arguments, and
3) bypass operations that they do not need to intercept. FreeBSD’s version of the loopback
file system is called nullfs [Pendry and McKusick, 1995]. It is a simple file system layer
that makes no transformations on its arguments and just passes the requests it receives to
the lower file system.
4.2.2.3 Linux
The Linux VFS does not natively support layered file systems. The Linux VFS is large as it
attempts to provide a close generic match for every functionality that is not implemented by
the file system. This makes writing file systems for Linux easy and fast as lot of generic code
is provided by VFS. However, this makes it difficult to write a layered file system for Linux.
Because a layered file system must appear just like VFS to the lower file system, Linux layered
file system has to replicate whole VFS. Moreover, as Linux VFS keeps on changing and adding
more and more generic code to it, layered file systems do have to cope up.
On Linux, VFS operations vector is fixed. Operations cannot be added, and the proto-
type of existing operations cannot be changed. wrapfs is a pass-through file system that
intercepts name and data operations, has been ported to Linux, FreeBSD, and Solaris [Zadok
and Badulescu, 1999].
74
4.2.2.4 Windows
Each logical volume in Windows has an associated driver stack, which is a set of drivers
layered one on top of another. The Windows Driver Model introduced in Windows 98 and
2000 has two types of drivers: function drivers and filter drivers [Oney, 2003]. A function
driver is a kernel module that performs the device’s intended functionality. A filter driver is
a kernel module that can view or modify a function driver’s behaviour, and can exist above
other function drivers. Furthermore, the I/O manager of Windows Executive handles all I/O
requests within the kernel using a message-passing architecture in which a data structure
called the I/O Request Packet (IRP) describes an I/O request.
A file system filter driver can view or modify the file system driver’s IRPs by attaching
itself to the corresponding driver stack. When a request to a device with multiple drivers on
its stack is made, the I/O manager first sends an IRP to the highest-level driver in the stack.
After pre-processing of this IRP by the highest-level driver, it can either pass it down to next
lower level driver or ask the I/O manager to finish processing. Furthermore, the higher-level
driver can optionally register a completion routine that is called when the lower-level driver
finishes processing this IRP. Using this callback, the higher-level driver can post-process
the data returned by the lower level driver. For example, an encryption layer registers a
completion routine that decrypts the data after a read request.
Windows XP Service Pack 2 introduced a new file system filter driver architecture called
the File System Filter Manager [Microsoft, 2004]. In this architecture, file system filters are
written as mini-filter drivers that are managed by a Microsoft-supplied filter manager driver.
Mini-filters register with the filter manager to receive only operations of interest. This means
that they do not need to implement filtering code for all I/O requests, most of which would
be passed along without any changes. For example, an encryption mini-filter could intercept
only read and write requests.
4.2.3 Alternatives to File System Layering
Using Layering, file systems can be extended in an incremental fashion without modifying
the design and source. However, there exist many alternatives to file system layering. These
alternatives mostly attempt to extend the functionality of a file system in user-space. As
such, these alternatives suffer from performance issues.
75
4.2.3.1 Micro-Kernel Architecture
The fundamental design consideration of micro-kernel implementations such as Mach and

the MIT exo-kernel is to reduce the complexity of the kernel [Engler et al., 1995]. Both
approaches remove all but the most basic operating system services from the kernel, moving
it to programs residing in user space. Layering fits naturally with this approach. However,
the performance overhead of micro-kernels has been a contentious issue, and they have not
been deployed widely. Although, the Exo-kernels overcome the performance issue, but these
are still ongoing research efforts and their feasibility for widespread adoption is not clear.
4.2.3.2 NFS Loopback Server
The SFS tool-kit simplifies UNIX file system extensibility by allowing development of file
systems at user level by using NFS loopback servers [Mazières, 2001]. The NFS protocol is
designed for remote file servers. However, NFS client can also mount the NFS server over the
loopback network interface. Therefore, one can implement a simple user-level NFS server and
redirect local file system operations into the user level implementation. Although user space
NFS loopback servers can be both beneficial and portable, these are limited by NFS’s weak
cache consistency and the overhead incurred in the network stack.
4.2.3.3 Trap & Pass Frameworks
In order to develop or enhance file systems in user space, a framework is required which traps
the file system calls in kernel and passes them to the user space for processing. The framework
should also provide a simple and powerful set of APIs in user space that are common amongst
most operating systems. Various projects have aimed to develop such a framework.
UserFS [Fitzhardinge, 1993] consisted of a kernel module that registered a userfs file system
type with the VFS and all the requests to userfs were communicated to a user space library
through a file descriptor. Similarly, the Coda [Satyanarayanan et al., 1990] distributed file sys-
tem contains a Coda kernel module which communicates with the user space cache manager,
namely Venus, through a character device /dev/cfs0. Later, UserVFS, which was developed
as a replacement for UserFS, used this Coda character device for communication between the
kernel module and the user space library [Machek, 2000]. Similarly, Arla is an AFS client
that consists of a kernel module, namely xfs, which communicates with the arlad user space
76
daemon to serve file system requests [Westerlund and Danielsson, 1998]. The ptrace() sys-
tem call can also be used to build an infrastructure for developing file systems in user space
[Spillane et al., 2007].
Among such frameworks, there is one commonly used and well deployed system called
FUSE, part of the Linux kernel since version 2.6.14 [Szeredi, 2005]. FUSE is a three-
part system; 1) a kernel module FUSE, which hooks into the VFS code, registers fusefs file
system type with VFS and implements a special-purpose device /dev/fuse, 2) a user space
library libfuse, which manages communications with the kernel module via /dev/fuse and
translates the file system requests into a set of function calls which look similar (but not
identical) to the kernel’s VFS interface, and 3) finally, a user-supplied component which
actually implements the file system of interest. The two main disadvantages of a FUSE file
system are, (1) performance is limited by crossing the user-kernel boundary, and (2) the file
system can only use FUSE’s API, which closely matches the VFS API, whereas kernel file
systems may access a richer kernel API.
4.3 Application of File System Layering
As mentioned earlier, layered file systems have been used to add new functionalities to ex-
isting file systems in a rapid and portable manner. The applications of layered file systems
vary from simple monitoring of operations to operation transformations [Zadok et al., 2006].
The following text discusses various applications of layered file systems, and identifies the
appropriate stacking model and supported OS(es) for each individual category.
4.3.1 Monitoring
Monitoring file system layer intercepts operations and examines their arguments, but does
not modify them. Examples of monitoring layers are tracing, intrusion detection systems
(IDSs) and anti-virus. Monitoring file systems should use linear stacking and avoid fan-in
access as the monitoring layer shouldn’t be bypassed.
Both tracing and IDSs need to examine a large subset of file system operations. However,
both Rosenthal and UCLA model are suitable. Furthermore, all OSes (except Linux) support
it easily. In contrast, anti-virus file systems need to examine only a small subset of operations.
The UCLA model, and FreeBSD and Windows OSes, support it as the rest of the operations
can be serviced by a generic bypass routine. However, Rosenthal’s model, and Linux and
Solaris OSes, require implementations for each routine within the layer.
77
4.3.2 Data Transforming
Data transforming file system layer (like encryption layer) should also use linear stacking but
for efficient backup they should additionally support fan-in access.
Rosenthal’s and UCLA model, and Linux and Windows OSes, easily support modifying
the data on read and write path. However, neither FreeBSD nor Solaris include layered file
systems that modify the data1 . Further, in case of sparse files when data from a hole is read,
the file system returns zeros. For an encryption layer, it is possible for zeros to be a valid
encoding. An encryption layer should either avoid writing of sparse files or should fill in the
holes with encrypted zeros. Filling in holes can though significantly reduce performance for
some applications that rely on sparse files.
4.3.3 Size Changing
Size changing file system layer (like compression layer ) should also use linear stacking but
for an efficient backup they should additionally support fan-in access.
Such file systems require transformation algorithms that take a certain number of bits as
input and produce a different number of bits as output. Supporting size changing algorithms
at the layered level is complicated because they change the file size and layout, thereby
requiring maintenance of additional mapping information between the offsets for each file in
different layers of the file system.
4.3.4 Operation Transformation
Operation transformation file system layer usually implements advanced features like secure
data deletion, versioning and so on. Operation transforming file systems should use linear
stacking and avoid fan-in access as the operation transforming layer shouldn’t be bypassed.
When an operation is transformed, it is important that the semantics of the OS are
maintained. For example, if unlink() is called in UNIX and it returns successfully, then the
file should not be accessible by that name any more. One way to maintain the OS semantics
while not exactly performing the intended operation, is to transform one operation into a
combination of operations. In FreeBSD and Solaris, almost all operations take a vnode as an
argument, so transforming the operations is simpler. Similarly, in Windows, all operations
are implemented in terms of path names and file objects, so converting between objects is
1
FiST [Zadok and Nieh, 2000] templates for FreeBSD & Solaris can modify data.
78
also simpler. However, having many different types of VFS objects in Linux (file, dentry,
inode, etc.) makes operation transformation at the layered level more difficult.
4.3.5 Fan-Out File Systems
Fan-out file system layer, layers on top of several lower-level file systems. As such, an object
in a fan-out file system represents objects on several lower-level file systems. There are two
main problems associated with this. First, as each operation on the fan-out level may result in
multiple lower-level operations, the atomicity of each operation is not guaranteed. Although
Rosenthal proposed that transactions can solve this, but did not implement them. However,
adding transaction support to the OS would require significant and pervasive changes to many
subsystems, not just the VFS. Second, the VFS often uses fixed-length fields that must be
unique. This complicates mapping between the upper and the lower level VFS objects and
fields. This is true of the Rosenthal’s and UCLA model, and Solaris, FreeBSD, and Linux
OSes. Furthermore, the Windows OS was designed for linear stacking only, and hence does
not support fan-out file systems.
4.4 restFS: Secure Data Deletion using Reliable & Efficient

Stackable File System
Due to wide spread usage of computer systems, the significance of digital data security has
become the focus of current computer science research. One of the most common data
security problem relates to the after-deletion data recovery. The end user thinks that the
files have been permanently removed when he deletes a file or empties the Trash Bin. In
fact, operating systems give an illusion of file deletion by just invalidating the filename and
stripping it of the allocated data blocks. As such, the contents of data blocks associated
with a file remain there even after its deletion, unless and until these blocks get reallocated
to some other file and finally get overwritten with new data. This policy is adopted as a
trade-off between performance and security. Sometimes this time gap allows users to recover
files deleted accidentally. Unfortunately, this time gap also allows malicious users and hackers
to recover deleted files. As an example, a local privileged user can access a low level device
via the /dev interface and use one of the several available software products to get the whole
file or portions of it. Also, laptops and portable storage devices can be discarded, lost or
stolen. The sensitive and confidential information, which was deleted with a belief that the
information has been physically erased, can be recovered even by novice users. Due to the
79
excessive use of digital content in our day to day life, most users do not even know that
their disk contains confidential information in the form of deleted files; worse, the users who
know, ignore the fact. Therefore, this after-deletion data recovery is in part an education
problem and educating users is difficult. On other side, it is a psychological problem induced
by current operating systems and redesigning and recoding operating systems is not feasible
[Rosenbaum, 2000].
There are generally two methods to ensure secure deletion of data; 1) overwrite the data,
and 2) encrypt the data. In addition to these methods, destroying the physical media is a
fast alternative when the media is abandoned. Also, in case of whole drive data erasure,
magnetic media can be degaussed to wipe out magnetic information. Unfortunately, the
strong magnetic fields generated by degaussers can destroy the physical media by bending
the platters and thus making the rotation impossible.
Secure deletion using encryption can employ various encryption techniques to encrypt
the data before it is stored on disk and to decrypt it on its retrieval. This solution protects
both deleted as well as non-deleted data. However, it suffers from several problems: 1) All
encryption systems suffer from cumbersome and costly management of keys, 2) Encryption
adds CPU overheads for most of file system operations, 3) Keys could be lost or broken and
thus, a compromised key allows recovery of both live and deleted data, 4) Using per-file keys
adds more overhead and key management costs, 5) Key revocation is problematic even for
non-deleted files because it sometimes requires re-encrypting the data [Wright et al., 2003].
As an example, if one writes the file encrypted with new key directly over the old file whose
key was compromised, there is still a chance that the old data can be recovered using hardware
tools. This suggests that encryption must be used with overwriting for optimal security. 6)
At last, strong encryption is not allowed in some countries.
Secure deletion using overwriting works by overwriting the meta-data and user-data per-
taining to a file when it is deleted. In its simplest form, the file system or the storage media
can be overwritten in its entirety and the process can be accomplished by user applications
or assisted at hardware level. Unfortunately, this process is inconvenient as it erases the
live data also and thus is applicable only when whole disk or whole file system sanitation is
required. The most applicable and desired data overwriting level is transparent per-file over-
writing. This transparent per-file overwriting can be performed at two levels of an operating
system: 1) User-mode level, & 2) File System level. User-mode transparent per-file over-
writing can be implemented by modifying the library or adding extensions to it to support
data overwriting on deletion. However, this solution demands library modification, does not
80
work with statically linked binaries, can’t overwrite all the meta-data belonging to the file
and can be bypassed easily. As such, it is not a feasible solution to transparent per-file data
overwriting. In contrast, at file system level, all the file system operations required for data
overwriting can be intercepted and thus complete overwriting can be guaranteed.
This section discusses the design, implementation and evaluation of a reliable and efficient
vnode stackable file system, called restFS, for secure deletion of data. restFS works at block
level to achieve reliability and efficiency which is missing in existing secure data deletion file
system extensions which work at file level. restFS is a vnode stackable file system that can
be mounted between ext2 file system and VFS to enhance the capability of ext2 file system to
perform reliable and efficient secure deletion of data without modifying the design or source
of ext2 file system. Although, the implementation of restFS is very specific to ext2 file system,
the design can be implemented for all file systems that export the block allocation map of a
file to upper layers (VFS).
The rest of this section is organised as follows. Section 4.4.1 discusses the background and
work related to secure deletion using data overwriting. Section 4.4.2 introduces the design
goals of restFS while section 4.4.3 discusses the implementation details of restFS. Section
4.4.4 discusses the experiment carried out to evaluate the efficiency of restFS. Section 4.4.5
evaluates and discusses the results while section 4.4.6 presents the conclusion. Finally, section
4.4.7 points towards limitations and future scope of the work.
4.4.1 Background of Secure Data Deletion using Over-writing
In 1992, it was reported that Magnetic Force Microscope (MFM) can be used to reconstruct the
magnetic structure of a sample magnetic surface [Gomez et al., 1992]. In 2000, a technique
was introduced that uses a spin-stand to collect several concentric and radial magnetic surface
images, which can be processed to form a single surface image [Mayergoyz et al., 2000]. These
techniques make it possible to recover overwritten data if we accept the proposition that the
head positioning system is not accurate enough that the data written to a drive may not
be written back to the precise location of the original data. This track misalignment has
been argued to make possible the process of identifying traces of data from earlier magnetic
patterns alongside the current track. The basis of this belief is that the actual value stored
is closer to 0.95 when a zero (0) is overwritten with one (1), and a value 1.05 when one (1)
is overwritten with one (1). As such, using these techniques the data overwritten once or
twice may be recovered by subtracting what is expected to be read from a storage location
81
from what is actually read. In 1996, based on this proposition Peter Gutmann proposed a
35-pass data overwriting scheme for secure data deletion [Gutmann, 1996]. The basic idea
was to flip each magnetic domain on the disk back and forth as much as possible, without
writing the same pattern twice in a row and to saturate the disk surface to the greatest
depth possible. NIST recommends that the magnetic media be degaussed or overwritten at
least three times [Grance et al., 2003]. The Department of Defence Document DoD 522.22M
suggests overwrite with a character, its compliment, then a random character, as well as other
software-based overwrite methods [Diesburg and Wang, 2010].
In 2008, a study demonstrated that correctly wiped data cannot be reasonably retrieved
even if it is of a small size or found only over small parts of the hard drive due to the
enhancements in magnetic drive technologies such as advanced encoding schemes (PRML
and EPRML), Zone Bit Recording (ZBR) and so on [Wright et al., 2008]. The purpose of
the paper was a categorical settlement to the controversy surrounding the misconceptions
involving the belief that data can be recovered following a wipe procedure. In general, a
single wipe procedure is enough to make any reasonable data recovery using software and
hardware techniques impossible.
In order to achieve transparent per-file data overwriting, the file system level is the most
appropriate as it is able to intercept every file system operation and does not require any
modification in user level libraries or applications. In 2001, [Bauer and Priyantha, 2001]
modified ext2 file system to asynchronously overwrite data on unlink() and truncate()
operations. However, this method has some drawbacks such as source code modification can
break stability and reliability of a file system, modification should be made in every file system
and the purging cannot sustain across crashes. In 2005, automatic instrumentation of file sys-
tem source using FiST [Zadok and Nieh, 2000] to add purging support was demonstrated by
[Joukov and Zadok, 2005] to save the manual work of source code modification. In case, the
source code was not available the purging extension, called purgefs, instruments a null-pass
vnode stackable file system, called base0fs, to add purging extension as a stackable file sys-
tem. purgefs supports many overwrite policies and can overwrite both synchronously and
asynchronously. In asynchronous mode, purgefs can remap the data pages to a temporary file
and overwrite them using a kernel thread. However, purgefs also suffers from reliability prob-
lem as purging can’t sustain across system crashes. In 2006, [Joukov et al., 2006] proposed
another FiST extension called FoSgen which is similar to purgefs in instrumentation, i.e., if
source code of file system to be instrumented is not available, the FiST creates a stackable
file system. FoSgen differs from purgefs in operation as it moves the files to be deleted (or
82
truncated) to a special directory called ForSecureDeletion and invokes the user mode shred
tool that overwrites the files. In case of truncation, FoSgen creates a new file with same
name as original file and copies a portion of the original file to the new one while deleting the
original. Due to Trash Bin like functionality in FoSgen, the purging sustains across system
crashes but also increases the window of insecurity as it provides a clean file system interface
for data recovery.
purgefs & FoSgen overwrite at file level and are not able to exploit the behaviour of file
systems at block level for reliability and efficiency. Although, [Bauer and Priyantha, 2001]
does work at block level but it makes no effort to exploit it.
4.4.2 Design of restFS File System
restFS (Reliable & Efficient Stackable File System) is a vnode stackable file system that is
designed to overcome the problems found in existing secure data deletion extensions. The
design goals of restFS include reliability and efficiency in overwriting the meta-data and user
data of a file when unlink() or truncate() operation is called. All the purging extensions
discussed in section 4.4.1 suffer from these problems.
In order to achieve its reliability and efficiency, restFS exploits the behaviour of file systems
at block level. It can be safely argued that there is a possibility that many fragmented files
are to be purged. As such, there are good chances that the fragments of two or more files
to be purged may be placed consecutively on disk. The overwriting can be done efficiently
if all the individual overwrites to consecutive fragments are merged as a single overwrite;
hence reducing the number of disk write commands issued. Even in case of two or more
non-fragmented files to be purged there is a possibility that their content is placed next to
each other on disk. The overwriting can be efficient if the individual overwrites are merged
as a single overwrite [Ganger and Kaashoek, 1997]. This efficiency patch is not possible in
purgefs & FoSgen as they overwrite at file level and Bauer et. al. makes no effort to
exploit it. Furthermore, a file system under heavy workload has a good probability that
de-allocated blocks may be allocated again to some other file and finally may get overwritten
with new data. The purging extensions can exploit this possibility in a file system to save
some block overwrites. Again, this efficiency patch is either not possible or not considered
in current purging extensions. restFS considers these possible scenarios to achieve efficiency
during secure data deletion and operates at block level to achieve it. Figure 4.2 shows these
two possibilities exploited by restFS for efficiency.
83
file1 (blocks allocated)
1 2 3 4 5 6 7 8 9
file2 (blocks allocated)

After file1 and file2 are deleted, the current purging techniques will overwrite blocks in the sequence
2,3,5,7 & 8 followed by 1,4,6 & 9. restFS can overwrite in an efficient manner by sorting the de-allocated
blocks and creating number of long sequences; in this case 1,2,3,5,6,7,8 & 9.
file1 (blocks de-allocated 2,3,5,7,8)
1 2 3 4 5 6 7 8 9
file2 (blocks re-allocated 5,6,7,8)

When file1 is deleted and file2 is created (appended), the current purging techniques will overwrite blocks
2,3,5,7 & 8 and then the re-allocated blocks 5,7 & 8 will be written again with new data. restFS can save
block overwrites of these re-allocated blocks. Thus, blocks overwritten in this case will be 2 & 3 only.
Figure 4.2: Possibilities exploited by restFS for efficiency
The overwriting is guaranteed across system crashes (only in FoSgen) by moving files
to a special directory and shredding them there. However, this opens a window of security
vulnerability before the data is overwritten and is present in all purging extensions which
overwrite at file level. Although, purging extensions should make purging reliable across
system crashes but at the same time these should minimize the security vulnerability of data
to be overwritten. restFS ensures the reliability of purging across crashes by logging the
block numbers identifying the blocks to be purged and at the same time reduces the security
vulnerability found in FoSgen.
To accomplish its intended task, restFS during unlink() operation exports the block
allocation map of the file to user space to be processed by a user level demon and then
synchronously overwrites the meta-data. The demon sorts the block numbers and tries to
reduce the number of overwrites by creating a number of long sequences of consecutive blocks
for overwriting. This design decision helps in reducing the frequency and number of write
commands issued to the disk. In addition, restFS exports the information about the blocks
newly allocated to a file during write() call. Using this information, the demon ignores
the overwriting of those de-allocated blocks which are present on the overwrite list but have
been re-allocated. This design decision helps in reducing the number of blocks to be over-
84
written. Thus, the efficiency of restFS is achieved by reducing 1) the number of data blocks
to be written, 2) the number of write commands issued to the disk, and 3) the frequency of
disk writes. Furthermore, during truncation operation only the information about the blocks
which are deallocated ( or allocated in case of increasing the size) is exported and hence,
saves the copying of data that may be required during truncation in FoSgen. The algorithm
4.1 shows the minimal amount of work done by restFS during restfs unlink() operation.
Similarly, algorithm 4.2 shows the operation of restfs setattr() call executed during trun-
cate operation while algorithm 4.3 shows the work done by restFS during restfs write()
call.
Algorithm 4.1 Algorithm for restfs unlink() call of restFS

Input: inode, dentry
Output: status
1: export the list of blocks deallocated to user space via /proc interface
2: overwrite the contents of inode and dentry
3: continue normal processing
4: return status
Algorithm 4.2 Algorithm for restfs setattr() call of restFS

Input: dentry, attribute
Output: status
1: if attribute.Size > dentry.Size then
2: {New blocks have been allocated}
3: export the list of allocated blocks to user space via /proc interface
4: else
5: {Some blocks have been deallocated}
6: export the list of blocks deallocated to user space via /proc interface
7: end if
9: return status
Finally, the reliability of purging across system crashes is accomplished by creating per-
file system persistent log of de-allocated blocks that are to be overwritten by the user level
demon. The log is created only for the deallocated blocks and not for the newly allocated
blocks. The newly allocated blocks are read by demon and corresponding blocks are searched
in log for their removal to save some block overwrites. The implementation decision to let
85
Algorithm 4.3 Algorithm for restfs write() call of restFS

Input: dentry, attribute
Output: status
1: if attribute.Size > dentry.Size then
2: {New blocks have been allocated}
3: export the list of allocated blocks to user space via /proc interface
4: end if
6: return status
user level demon create the log simplifies the implementation of restFS kernel module and
achieves faster execution of file system operations. Hence, the unlink() call executes fast and
as it returns, the reliability (as well as efficiency) of purging is ensured by user level demon.
Furthermore, by logging the blocks (instead of files) to be overwritten, the vulnerability of
data in deleted files is reduced. Although, the de-allocated blocks may yet to be overwritten
by demon but it makes the ease of recovery found in purgefs and FoSgen bit difficult wherein
there is a clean file system interface for a hacker to know which files are being purged.
The overwriting policy is left up to the user level demon for flexibility. The demon can
overwrite instantly as soon as information is passed to it (preferably in case of removable
devices) or delay the process. The demon can also be configured to overwrite instantly for
small files. Likewise, it can be configured to overwrite instantly some first part of large files
to further reduce the window of insecurity. This makes overwriting policy re-configurable
without any need to remount the restFS module. Furthermore, the user level demon does not
remove from the log the entry of the data block meant to be overwritten until it makes sure
that the data block has been actually written to the disk. The demon accomplishes this task
by using the /proc/sys/vm/block dump interface of Linux kernel to grep the block numbers
that were supposed to be overwritten. This policy of demon beats the reliability problem that
may arise during a system crash because of buffer cache. The flowchart in Figure 4.3 depicts
the working of user level demon when information about de-allocated and newly allocated
blocks is passed to it.
4.4.3 Implementation of restFS File System
The design of restFS relies on the capability of below mounted file system to export the
list of data blocks associated with a file. Unfortunately, very few file systems export this
86
1. Read /proc/restfs/<mount-point>/config
2. Create per-filesystem persistent log
1. Read /proc/restfs/<mount-point>/alloc
2. Read /proc/restfs/<mount-point>/dealloc
3. Append de-allocated blocks to log
5. Clear /proc/restfs/<mount-point>/dealloc
Is any block on
/proc/restfs/<..>/alloc YES
present on log?
1. Remove that block from log
NO
1. Clear /proc/restfs/<mount-point>/alloc
1. Sort blocks on log in assending order

2. Create long sequences of consecutive blocks
3. Issue writes of these run-lengths
During every phase of this algorithm, /proc/sys/vm/block_dump is monitored. All those

blocks for which writes have been issued are removed from log only when they appear
on block_dump.
Figure 4.3: Working of user level demon of restFS
87
kind of information. Worse, those who do, do it in their own specific way. Also, VFS does
not provide a generic routine for such file systems to present the information in a consistent
manner. However, restFS can be built using allocation/de-allocation bitmap of file systems.
But this bitmap information is also not exported by all file systems. Those file systems who
export this information do not export it in a generic format. Adding to that, VFS does not
help here as well. In addition, there is no VFS routine for truncate operation to be hooked
by restFS. However, restFS can accomplish this task by hooking the VFS routine that sets
the attributes for a file and looks for possible size change. This limited capability of VFS has
been argued before [Zadok et al., 2006]. We believe that the file system research community
will come up with some solutions and remedies. Having said that, we implemented restFS for
ext2 file system as it exports the information needed by restFS and is widely used in secure
deletion research papers. It is worth mentioning here that the design of restFS is applicable
to all file systems which export allocation map of a file to VFS but the implementation is
very specific to ext2 file system.
restFS is implemented as a Linux Loadable Kernel Module (LKM) by instrumenting a
null-pass vnode stackable file system called base0fs. restFS performs reliable and efficient
secure data deletion when inserted between VFS and below mounted ext2 file system. There
are actually only two file system operations that need to be intercepted for secure deletion of
data: unlink (called to delete a file) and truncate (called to change the file size). However,
because restFS also exports the newly allocated blocks, we need to intercept write() file
system operation also. Figure 4.4 depicts the behaviour of restFS during unlink() operation
when mounted on top of ext2 file system.
When restFS is loaded, it creates a directory entry in /proc file system of the name restfs.
After this, when restFS is mounted on top of some native file system, it checks for ext2 file
system and then creates a directory entry in /proc/restfs/ having name same as that of
mount point of ext2 file system. Under this directory, three files namely config, dealloc and
alloc are created. config file exports the configuration data of ext2 file system like device
name, block size and so on; dealloc file exports the list of blocks de-allocated and alloc
file exports the list of blocks newly allocated. The /proc/restfs/<mount-point>/alloc is
populated in restfs write() call as this is the main function that may add some new data
blocks to an existing file. Similarly, /proc/restfs/<mount-point>/dealloc is populated in
restfs unlink() which corresponds to unlink() operation. As mentioned earlier, there is
no corresponding operation provided by VFS for truncate operation. Therefore, restFS hooks
setattr() call which is immediately called after truncation operation to reflect metadata
88
User Process
unlink()
User-Mode
Kernel-Mode vfs_unlink()
restfs_unlink()
VFS
1. Export block allocation
restfs_unlink() map of the file to user space
demon via /proc interface.
restFS
2. Overwrite metadata of
ext2_unlink() the file synchronously.
ext2 3. call ext2_unlink().
Figure 4.4: Behaviour of restFS during unlink() operation
changes. restFS correspondingly populates dealloc or alloc files in case size is lesser or
greater than the previous size. All this information is provided by the restfs kernel module
to user level demon for processing.
The demon is a Linux service that reads /proc/restfs/<mount-point>/config file to
know the device file and block size of ext2 file system on top of which restFS is mounted. After
this, the demon keeps reading both alloc and dealloc files and creates a persistent log file
per-file system which contains the blocks to be overwritten in order to sustain purging across
crashes. Now depending upon the overwriting policy of demon, it can overwrite instantly or
delay the process.
Finally, the demon removes the entry from log after it validates that the block overwrite
has been committed to disk by checking /proc/sys/vm/block_dump file for the specified
block. In all cases, it sorts the log and creates number of longest possible lists of consecutive
blocks. In any case, as soon as some new block is added to /proc/restfs/<mount-point>/alloc
list, the demon reads it, checks for its existence in persistent log of blocks to be overwritten
and if found, removes it from the log. Moreover, the clean-up process of
/proc/restfs/<mount-point>/alloc and /proc/restfs/<mount-point>/dealloc is done
by user level demon as shown in Figure 4.3.
One may argue how design of restFS is different from [Bauer and Priyantha, 2001]. Design
of [Bauer and Priyantha, 2001] mainly differs from restFS in 5 aspects; 1) it modifies source
89
code of ext2 file system, 2) it doesn’t exploit the behaviour of file system at block level for
efficiency, 3) it does nothing to make purging reliable, 4) it uses kernel demon, and 5) the
design is not compatible with other file systems.
4.4.4 Experiment
The reliability of restFS has been proven logically and in our opinion does not need any
empirical proof. Furthermore, restFS only pre-processes unlink(), truncate() and write()
file system calls and does a minimal amount of work of exporting allocation map of a file
to user space. As such, the performance overhead of restFS for CPU bound workloads is
naturally negligible. Also, because restFS does not overwrite data, either synchronously or
asynchronously, within unlink() & truncate() calls, it adds no performance overhead to I/O
bound workloads. However, it does overwrite meta-data synchronously within unlink() and
truncate() call, but the overhead is negligible and affordable. Therefore, we didn’t evaluate
restFS for performance overhead incurred during CPU bound and I/O bound workloads,
rather the experiment was aimed to validate the efficiency of restFS.
3.20 GHz with total 4MB cache, and 2GB DDR3 1333MHz SDRAM. The magnetic disk drive
used was a 320 GB SATA drive with on board cache of 8 MB and 15.3 ms of reported average
access time. The drive was partitioned into a 20 GB primary partition to hose Fedora Core
14 operating system running 2.6.35.14-95.fc.14.i686 kernel version. Another 20 GB
primary partition was used to mount ext2 file system with restFS on top. Furthermore, the
evaluation was done with Linux running at run-level 1 to reduce the random effect of other
applications and demons. Also, the experiment was repeated atleast 5 times and the average
of results were considered with standard deviation less than 4% of the average.
We evaluated restFS for its efficiency to save multiple overwrites of same block and to
reduce the frequency and number of writes to disk. [Agrawal et al., 2007] found that large
files account for a large fraction of space but most files are 4 KB or smaller. They also argued
that although the absolute count of files per file system will grow significantly over time, but
this general shape of the distribution will not change significantly. As such, we evaluated
restFS for large number of random read/write and create/delete operations on large number
of small files. To accomplish this, we used I/O intensive workload simulator called Postmark
[Katcher, 1997]. It performs a series of file appends, reads, creations and deletions. We
ran Postmark in its default configuration in which it initially created a pool of 500 files
90
between 500-10 K bytes and then performed 500 transactions. In each transaction there
is an equal probability of create/delete and read/write operations. Finally, it deleted all the
files. Furthermore, restFS demon was configured in delay-overwrite mode. Table 4.1 shows
the statistics gathered from Postmark.
Table 4.1: restFS: Postmark benchmark report
Initial Pool During Transactions After Transactions Total

Only Only
Files Created 500 264 7 764
Files Read 7 243 7 243
Files Appended 7 257 7 257
Files Deleted 7 236 528 764
We intercepted and analysed the information exported by restFS kernel module and pro-
cessing done by user-level demon during the execution of Postmark benchmark. First, the
results indicate that most of the blocks de-allocated during transactions got re-allocated.
The percentage of de-allocated blocks that got re-allocated was 98%. This means, if during
transactions all de-allocated blocks would have been over-written for secure deletion, almost
all of these blocks would still have been over-written with some new data. Further, in total,
at the completion of Postmark benchmark (which deletes all the files that were created), the
results indicate that 28% of the total de-allocated blocks were re-allocated. This 71% drop
in gained efficiency is due to the fact that at the completion of Postmark no new blocks are
allocated. This indicates that, using restFS a minimum of 28% of block overwrites for secure
deletion can be saved. Both results do not consider the blocks allocated during the creation of
initial pool of 500 files. Table 4.2 shows the efficiency evaluation of restFS to save multiple
overwrites to same block.
Second,, the blocks which were not re-allocated (or recently de-allocated) and needed to
be overwritten, were sorted. The sorted list was used to find a number of run lengths of
consecutive blocks. These run lengths were overwritten one at a time and hence, reduced the
number and frequency of disk writes issued. The results indicate that there was no gain at all
during transactions as very few de-allocated blocks were left unallocated. However, in total,
at the completion of Postmark, the save increased to 88%. This huge growth in efficiency is
91
Table 4.2: restFS: efficiency to save multiple overwrites of same block
During Transactions Total (including

Only Transactions)
Total Blocks unlinked 446 1547
Total unlinked Blocks re-allocated 437 437
%age of Block Overwrites Saved 98% 28%
due to the fact that at the completion of Postmark large number of blocks were de-allocated.
Table 4.3 shows the efficiency evaluation of restFS to save number of disk writes. The table
indicates that without using restFS, 530 commands are issued to disk wherein each command
writes to small number of consecutive blocks. While as using restFS, just 62 commands are
issued wherein each command writes to large number of consecutive blocks. In both cases,
the total amount of data and number of blocks written is same. The gain is in reduction in
the number and frequency of disk write commands issued in addition to reduced random disk
arm seeks.
Table 4.3: restFS: efficiency to save number of disk writes issued
During Transac- Total (including

tions Only Transactions)
unlinks issued 236 764
Non-sequential disk writes 9 530
before clubbing
Non-sequential disk writes 9 62
after clubbing
%age of issued disk write 0% 88%
commands Overwrites Saved
It is worth mentioning here that these two efficiency patches are complimentary. This
means that if there are many de-allocated blocks and new blocks are being allocated, a large
number of block overwrites are saved as the de-allocated blocks get re-allocated. However,
because very few de-allocated blocks are left unallocated, there is no saving in the number of
disk writes. Similarly, if there are many de-allocated blocks and no new blocks are allocated,
there is no saving in the block overwrites but a large number of disk writes are saved. This
fact is depicted by Table 4.2 and Table 4.3.
92
4.4.6 Conclusion
We reviewed proposals to secure data deletion highlighting their strengths & limitations. We
can conclude that for secure data deletion, transparently overwriting the data belonging to
a file during deletion and truncation operation is better and safe than encryption technique.
Furthermore, we argued that reliability and efficiency can be achieved in secure data deletion
using overwriting, if instead at file level, overwriting is done at block level. To validate the
concept, we designed and implemented a vnode stackable file system called restFS. restFS is
design compatible with any file system that exports block allocation map of a file to VFS and
is currently implemented for ext2 file system. The reliability of restFS makes purging reliable
across crashes and reduces the window of insecurity; while as, the efficiency of restFS saves a
minimum of 28% of data blocks that needed to be overwritten in existing purging extensions
and reduces the number of disk writes issued by 88% when Postmark benchmark was used to
simulate I/O intensive workload. Moreover, this was achieved without modifying design or
source of ext2 file system. We can conclude that for secure data deletion overwriting at block
level is both reliable and efficient as compared to overwriting at file level.
restFS is limited by the incapability of native file systems to export allocation map of a file
to VFS, and hence the design cannot be applied to all file systems. Furthermore, though
design is compatible with all such file systems which export file allocation map to VFS, the
implementation is very specific to ext2 file system and can be achieved without design or
source modification of file systems.
Nevertheless, block level overwriting is very promising for journalling, log-structured and
versioning file systems where file level overwriting completely fails.
End
93
Chapter 5
File System Benchmarking
94
5. FILE SYSTEM BENCHMARKING
5.1 Introduction
File systems need to evolve and whenever a new file system is designed or an existing file sys-
tem is refined, the augmentation needs to be validated for the amount of performance gained
or the length of problem mitigated. Comparing file systems solely based on technical merits
and specifications is rarely useful for predicting actual performance. Therefore, the most
common approach is to empirically evaluate the performance of a file system. File System
Benchmarks are used to get the performance data of file systems by running some workload
and performance-data gathering tool. In simple words, file system benchmarks exercise every
possible corner of a file system and at the same time record some data that point towards
its performance. File system benchmarking is almost as old as file system designing and in
present world there exist a lot of file system benchmarks. These benchmarks are currently
being used to evaluate the performance of different file systems for different workloads. They
are also being used to test any refinement or new ideas in file systems. Though the motive of
each benchmark is same, but they do differ in their approaches, methodologies, granularity
and so on. These benchmarks can be classified in two ways:
1. A Synthetic or an Application Benchmark, and
2. A Macro or a Micro Benchmark
Synthetic benchmarks model a workload by executing various file system operations in a mix
consistent with the target workload while as Application benchmarks consist of programs
and utilities that a user can actually use. For example, Andrew benchmark, an application
benchmark, consists of five phases including creating directories, copying files, stat files,
reading files and compiling. The operations chosen for the benchmark were intended to be the
representative of an average user workload [Howard et al., 1988]. Similarly, iostone, a synthetic
benchmark, allows the user to measure file system performance on a string of file system
requests that is the representative of measured system loads [Park et al., 1990]. Though,
synthetic benchmarks are more flexible than application benchmarks as they can scale better
with the technology by providing a large number of parameters and modelled workloads,
however the results are questionable as they don’t measure any real work. Similarly, the
results of application benchmarks are inadequate as the workload may not actually represent
the real world workload.
On the other hand, Macro benchmarks measure the entire system by running some work-
load which is either artificial in nature or is representative of real world workload such as
95
Postmark [Katcher, 1997]. So, they can be either synthetic or application benchmarks but in
both cases are designed to answer a common question, “How will my workload perform on this
system?”. However, Micro benchmarks measure a specific part of the system. Though being
artificial in nature, they are not designed to answer the same question rather they are de-
signed to give detailed insight into the file system performance such as Sprite-LFS benchmarks
[Rosenblum and Ousterhout, 1992].
Although, there exist large number of file system benchmarks but the quality of file system
benchmarking has not improved much. This is because of the fact that benchmarking file and
storage systems is a complex case of benchmarking. Every file system has a single motive;
mitigate the access to data residing on the storage sub-system via a uniform notion of files
and directories. However, they differ in many ways, such as type of underlying media (e.g.,
magnetic disk, optical disk, solid state memory, network storage, volatile and non-volatile
RAM, and so on), storage environment (e.g., RAID, LVM, virtualization, and so on), the
workloads for which the file system is optimized (e.g., Web server, Media server and so on),
in their features (e.g., journalling, encryption, versioning, compression, and so on), and so
on. In addition, complex interactions exist between file systems, I/O devices, specialized
caches (e.g., buffer cache, disk cache), kernel demons (e.g., kflushd in Linux), and other
OS components. This complexity in operation of file systems makes measuring file system
performance significantly more complicated than that of the underlying storage subsystem
[Ruwart, 2001]. Therefore, current file system benchmarks suffer from several problems.
First, there is no standard file system benchmark. Though, the closest to a standard
is the Andrew benchmark, but even then, some researchers use the original version while
others use a modified version. Second, existing benchmarks used to measure file systems are
inadequate as the results yielded are not useful. The results don’t point researchers towards
the areas which have potential for improvement and/or which do need improvement. Third,
the motive of these benchmarks has got confined to win war by adding performance stars to
a file system rather than highlighting the deficiency and/or merits.
The rest of this chapter is organised as follows. Section 5.2 discusses the benchmarking
guidelines proposed in various research papers to be considered while designing and/or run-
ning the file system benchmarks. Section 5.3 highlights the problem in current file system
benchmarking science and introduces a new benchmarking methodology to overcome it. The
section based on the proposed benchmarking methodology discusses the design, modelling and
implementation of a hypothetical file system called OneSec. Further, it empirically evaluates
the performance of ext2 file system using the new benchmarking mentality introduced.
96
5.2 Proposed Guidelines for File System Benchmarking
This section discusses various diverse aspects that need to be considered while designing
and/or running a file system benchmark. These guidelines have been either explicitly pro-
posed by researchers or are implicitly present in their proposed benchmarks [Traeger et al.,
2008]. The goal of this section is to provide a comprehensive list of these guidelines and
understand various intricacies of file system benchmarking.
5.2.1 Evaluate performance of both metadata & userdata operations
[Tang, 1995] found that several file system benchmarks evaluate only a subset of file system
functionalities. These benchmarks usually evaluate the performance of userdata operations
and ignore the metadata operations. However, the performance of a file system relies on the
performance of both types of operations. Moreover, the metadata operations constitute a
large percentage of requests made to a file system. D. Tang recommended that a file system
benchmark should evaluate the performance of both types of operations.
5.2.2 Application specific benchmarking
[Seltzer et al., 1999] argued that most benchmarks do not provide useful information as they
are not designed to describe the performance of a particular application. They argued for
an application-directed approach to benchmarking using performance metrics that reflect
the expected behaviour of a particular application across a range of hardware or software
platforms.
5.2.3 Don’t reduce vector result to scalar
[Seltzer et al., 1999] also found that most of the macro-benchmarks report only a single value
that represents the overall performance of the file system. Although this way the benchmark
settles down the argument and wins the war, this is somewhat like reducing a vector quan-
tity to a scalar quantity. As a consequence, during this translation a significant amount of
important detail may be lost which could point designers towards possible improvements. In
order to get an accurate assessment of the performance of a file system, they suggested that
a benchmark should be performed over a range of performance parameters and the vector
results so obtained should be reported.
97
5.2.4 Ageing affects file system performance
[Smith and Seltzer, 1997] found that file systems under a workload comprising of consistent
mix of creation and deletion of files get highly fragmented. They argued that Ageing of a file
system in this manner affects its performance. For small sized file systems, the problem can
be solved by de-fragmentation in which the on-disk data structures are rearranged sufficiently
to decrease the fragmentation. However, for large sized file systems this is practically not
feasible. They suggested that the file system benchmarks should not only benchmark newly
created file systems rather a file system should also be made to age and then benchmarked to
reveal the performance degradation due to ageing. Although ageing large file systems is time
consuming, however there exist certain faster methods to age a file system which includes
TBBT [Zhu et al., 2005], running a long-term workload, copying an existing raw image, or
replaying a trace before running the benchmark.
5.2.5 Caching affects file system performance
Caching is used in systems to solve performance issues related to the I/O sub-system. [Traeger
et al., 2008] argued that it is not always clear whether benchmarks should be run with
“warm” or “cold” caches because real systems do not generally run with completely cold
caches. On the other side, the results of a benchmark that accesses too much cached data
may be unrealistic. However, they suggested that if cold-cache results are desired then caches
should be cleared before each run. This can be done by allocating and freeing large amounts
of memory, remounting the file system, reloading the storage driver, or rebooting. [Wright
et al., 2005] found that rebooting is more effective than the other methods. Furthermore, if
warm-cache results are desired then the experiment should be run N + 1 times, and the first
run’s result should be discarded.
5.2.6 How to Validate the Results?
[Traeger et al., 2008] stressed that the results of a file system benchmark should be repro-
ducible which will allow others to validate the results. In order to accomplish this task file
system benchmarks should follow these two guidelines:
1. Every benchmark should explain what was done in as much detail as possible. This
allows others understand the configuration, perform the experiment and reproduce the
results. This may include detailed hardware and software specifications. If possible, the
software should be made available to other researchers; releasing the source is preferred.
98
2. In addition to saying what was done, every benchmark should explain why it was
done that way. This allows others to understand the methodology and objectives of
benchmarks, the questions asked by the benchmark and the possible conclusions to be
drawn.
5.2.7 Miscellaneous
In addition to what has been mentioned earlier, there do exist some other well established
guidelines. They include:
1. File system benchmarks should scale with technology which means that they should
exercise each machine the same amount, independent of hardware and software speed.
2. File system benchmarks measure performance of a file system mostly in terms of

throughput, however users care more about latency.
3. File system benchmarks should generally be I/O bound, but a CPU-bound benchmark
may also be run for systems that exercise the CPU.
4. Each test should be run several times to ensure accuracy, and standard deviations or
confidence levels should be computed to determine the appropriate number of runs.
Also, tests should be run for a period of time sufficient for the system to reach steady
state for the majority of the run.
5. Macrobenchmarks and traces are intended to give an overall idea of how the system
might perform under some workload. In contrast, microbenchmarks are used to high-
light interesting features about the system. Therefore, an accurate assessment of a
file system’s performance can be achieved by running at least one macrobenchmark or
trace, as well as several microbenchmarks.
6. A trace definitely recreates the workload that was traced, but only if it is captured and
played back correctly. Furthermore, one must also ensure that the captured workload
is really the representative of the system’s intended real-world workload, and
7. All non-essential services and processes should be stopped before running the bench-
mark. These processes can cause anomalous results or higher-than-normal standard
deviations for a set of runs.
99
5.3 OneSec: A Modelled Hypothetical File System for Bench-

marking Comparison
File system design has seen a lot of advancement in recent years. However, the benchmarks
designed to evaluate the performance of these file systems still lag behind. [Lucas, 1971] stated
some important reasons to perform benchmarking which include; 1) to know which system
is better, 2) how to improve its performance, and 3) how well will it perform. Therefore, the
results obtained from benchmarking can have a significant impact on its value as they may
be used by customers in purchasing decisions, or by researchers to help determine a system’s
worth and pinpoint possible areas for improvement. Unfortunately, file system benchmarks
have been used to win wars as benchmarks can make any file system look good or bad by
benchmarking it according to the specification of the file system. That is why the quality
of file system benchmarks has not improved much [Tarasov et al., 2011]. Although a lot
of considerations for designing and running a benchmark have been suggested as mentioned
earlier in section 5.2 but they are not followed in spirit. This is one of the reasons why
benchmarks have not evolved but we believe that the problem lies basically in the mentality
of the file system benchmarking science.
Current file system benchmarking science follows a simple methodology - benchmark
the practical file system under observation and compare the performance data so gathered
with that of the reference file system (an older version of the same file system or some
other practical file system). This way numbers decide the winner. However, there are many
problems related to this mentality of benchmarks. First, the motive of benchmarking is
drifted to winning an argument rather than unveiling a better file system or pinpointing the
areas for improvement. Therefore, no standard benchmark is possible in this scenario as
every benchmark exercises that part of file system which is of its interest. Due to this lack of
standardization in file system benchmarks, comparing results from different papers becomes
difficult. Second, the benchmark so drafted is executed on all file systems. The fact that
some file systems may be good at handling small files while others may be better in accessing
large files is completely ignored while selecting these file systems, leading to inadequate and
meaningless results. Third, even if the benchmark exercises all the aspects of a file system and
chooses unbiased and appropriate reference file systems for comparison, this mentality is still
problematic. Every practical file system has good and bad corners and as such imperfect file
systems are compared against each other. For a single class of file systems (say, file systems
good at accessing small files) benchmarking can result in confusing conclusions for the same
100
workload. For example, for two file systems F1 and F2, for the same intended workload, F1
may outperform F2 in some metric while F2 may outperform F1 in some other metric. As
such, this makes result interpretation and decision making a difficult task. Fourth, even if the
reference file system is such that it outperforms in every possible metric, the benchmarking
is still limited. The fact is that the better performing file system sets a higher limit to the
performance. Therefore, we are not able to decide how much actual improvement in some
metric is possible in a badly performing file system. Fifth, due to this mentality the results
of benchmarks are not portable; means the figures change if the configuration of hardware
is changed (say 5400 rpm disk is replaced by 7200 rpm) although the relation between the
results may remain the same. As such, if a third file system is put to the comparison it is to
be benchmarked using the same configuration. Finally, the current benchmarks strike hard
to bring forth a better file system while less attention is paid towards what areas of a file
system need improvement or have scope for improvement.
To deal with these problems, we propose benchmarking a practical file system against a
hypothetical file system; a file system that outperforms every practical file system of its class
in every metric. This change in mentality of benchmarks eradicates the inherent problems
of current benchmarking science. First, the motive of benchmarking will be drifted from
winning an argument to refining the file system to approximate the hypothetical file system.
Second, for every class of file systems a standard hypothetical file system can be designed.
Therefore, a standard will be achieved in benchmarking. Third, the practical file system will
always lag behind the hypothetical file system in every metric. Hence, no confusing results.
Fourth, the upper limit to the improvement possible in the practical file system is set by the
hypothetical file system which will be numerically higher than that set by any other practical
file system. Fifth, the portability of results can be achieved if the hypothetical file system
operations are modelled as raw disk I/O operations. Finally, the results can be represented
and interpreted in figures signifying how much the design is close to or far away from the
better design for various operations of a file system. As such, this can point to specific areas
of the practical file system’s design which need improvement or have scope for improvement.
To validate the worth of this new benchmarking mentality this section explains the design,
modelling and implementation of a hypothetical file system, called OneSec for benchmarking
practical file systems optimised for accessing small files.
The rest of this section is organised as follows. Section 5.3.1 discusses the design of
OneSec file system while section 5.3.2 explains the modelling of the system calls supported
by OneSec. Section 5.3.3 explains the implementation details of OneSec file system, section
101
5.3.4 discusses the benchmark, and section 5.3.5 discusses the experiment. Finally, section
5.3.6 discusses the results, section 5.3.7 presents the conclusion and section 5.3.8 discusses
the limitations and future scope of the work.
5.3.1 Design of OneSec File System
OneSec is a hypothetical file system whose only purpose is to serve as a benchmarking com-
parison for real fully-featured file systems. OneSec file system outperforms every practical
file system good at handling small files in every file system operation. The idea is to devise
an optimal lower-bound on the cost of various file system operations, and then compare the
performance of real-life file systems using those bounds. Being hypothetical allows us to
make any necessary but valid assumptions to make every operation of OneSec file system
outperform every corresponding operation in the real life file systems which are optimised for
reading and writing small files. However, the assumptions made should be such that these
operations can be modelled later for benchmarking purpose. OneSec file system takes into
consideration the algorithmic complexity of file system operations and the I/O bottleneck
of storage subsystems to make each operation represent a best-case performance baseline
against which other file systems can be compared. The design goals of OneSec hypothetical
file system are as follows:
1. Minimise the algorithmic complexity of the file system operations to reduce the number
of computations. This way the operations become less CPU bound and execute fast.
2. Minimise the the number of disk I/O operations needed to be performed against any
file system operation. This way the operations become less I/O bound and execute fast.
3. Minimise the seek incurred by the file system operations while accessing userdata per-
taining to any individual file. This can be achieved by placing all the userdata belonging
to any individual file nearer to each other.
4. Minimise the seek distance between metadata and userdata of a file. This can be
achieved if the metadata belonging to any individual file is placed near its user data.
5. The file system operations should be modelled as raw disk I/O in order to be bench-
marked.
Keeping the enumerated design goals in view, many things were assumed in the design
of OneSec file system. However, care has been taken to ensure that the assumptions made
102
for the hypothetical file system are necessary, valid and realistic. In other words, for every
assumption it is ensured that there exists a possibility of its implementation. This guideline
makes the hypothetical file system to perform not too much unrealistically. Having said that,
the design of OneSec file system is as follows:
1. All the information pertaining to a file like attributes, permissions, size of file, and so
on, except its user data is stored in not more than one sector of the disk, the smallest
readable and writeable unit of a disk. This design decision ensures that a minimum
number of sectors, which is one, is read or written to read or write all the meta-data
pertaining to the file. Moreover, this feature of OneSec has given it its name; One
Sector.
2. The on-disk layout of OneSec file system tries to place the metadata and userdata
belonging to a file as close and possible. It also tries to place all the userdata pertaining
to a file consecutively on disk. All this is done in order to reduce seek involved while
accessing metadata and userdata of a file. OneSec file system achieves this by having
only one type of data structure in which the first part occupies the metadata while
the second part holds the userdata pertaining to the file: Lets call them meta-user-
structures. These meta-user-structures are fixed sized structures, preferably integral
multiples of 4 KB size. Furthermore, they run from the beginning to the end of the
volume. Figure 5.1 shows the logical on-disk layout of OneSec file system.
meta-user structure n meta-user structure n+1 meta-user structure n+2
metadata sector First userata sector Second userdata sector Last userdata sector
of the file of the file of the file of the file
Figure 5.1: Logical on-disk layout of OneSec file system
It can be argued that this design decision puts limits on the number of files that can be
stored on the volume and the amount of data that can stored within a file. However,
it must be reminded that OneSec will be only used as benchmarking comparison for
103
those file systems which are good at handling small files. Thus, choosing appropriate
meta-user-structure size solves the file size limitation problem. Moreover, OneSec is
designed to evaluate only performance of real life file systems and not their size and
count limitations, advanced features, and so on. In-spite of this, the file count limitation
can be solved by choosing appropriate sized volume.
3. Having fixed sized meta-user-structures eliminates the need to maintain the metadata
structure to locate all the blocks associated to a file. Moreover, it eliminates the disk
I/O’s needed to bring them into memory, the algorithmic complexity involved to process
them and finally, the disk I/O’s needed to reflect the changes.
4. Having fixed sized meta-user-structures also eliminates the need to maintain the meta-
data structure to maintain the list of free and allocated blocks within the file system.
Moreover, it eliminates the disk I/O’s needed to bring them into memory, the algorith-
mic complexity involved to process them and finally, the disk I/O’s needed to reflect
the changes.
5. Although consecutively placed fixed length data structures (array of fixed sized ele-
ments) always come with the scalability limitations but at the same time reduce the
cost of management and processing of data structures. Same is true with meta-user-
structures in OneSec file system. As every individual meta-user-structure uniquely
identifies a file’s metadata and userdata, the file can be located by just knowing its
meta-user-structure’s offset in the volume. This feature of OneSec is exploited during
mapping of a filename to a unique meta-user-structure. A hash function hash() is
used to resolve every possible unique filename to a unique meta-user-structure. Propos-
als like HashCache [Badam et al., 2009] and hashFS [Lensing et al., 2010] have shown
that this is possible for small files. This design decision eliminates the intermediate
disk I/Os needed to bring other metadata structures into memory and process them to
identify the location of both metadata and userdata of any file. Furthermore, it reduces
the algorithmic complexity of identifying the location of both metadata and userdata
pertaining to any file.
6. The conventional file system design in which there is a distinction between metadata
and userdata is maintained in OneSec file system. Therefore, the efficiency of metadata
and userdata operations can be individually evaluated. Furthermore, the optimizations
like placing all related data of a file nearby suggested by [Ganger and Kaashoek, 1997]
104
and placing metadata and userdata on the same track suggested by [Jun et al., 2002]
and so on, are also achieved.
5.3.2 syscalls supported and their modelling
A hypothetical file system like OneSec which is meant to be used as a benchmarking com-
parison for real life file systems need to be modelled in order to be benchmarked. The
design of OneSec dictates the number and type of file system syscalls supported by it. In
fact, OneSec was designed in view to support all the possible file operations. Moreover, the
syscalls supported by OneSec can be modelled as raw disk I/O operations. The fact is
that all these syscalls supported by OneSec demand reading and/or writing a predictable
number of sectors. The following text discusses these syscalls and their modelling.
As per the design of OneSec file system, we have identified eight different types syscalls
supported by it which are being commonly used by benchmarks to evaluate the performance
of a file system. It must be noted that rest of the syscalls can be modelled the same way,
however, for the sake of benchmarking these many syscalls are enough. Also, as OneSec
file system design doesn’t support the concept of directories, these syscalls are specifically
meant for file operations only. Table 5.1 lists these eight syscalls supported by OneSec file
system along with the type and times some function is executed and the number of sectors
read and/or written.
Table 5.1: OneSec: syscalls supported & their modelling
syscalls Supported Functions Executed Number of Disk Sectors

Read/Written
creat() 1 hash() 1 R & 1 W
stat() 1 hash() 1 R
chmod() 1 hash() 1 R & 1 W
unlink() 1 hash() 1 R & 1 W
open() 1 hash() 1 R
read() 7 n R
write() 7 n W
close() 7 1 W
1. creat(): creat() system call that creates an empty file will have to execute the
hash() function to get the unique meta-user-structure’s offset within the volume de-
105
pending upon the filename passed. If that file already exists then the status field within
the metadata sector of meta-user-structure would contain appropriate value. If it con-
tains insane value, indicating that file doesn’t exist, then that metadata sector will be
overwritten with appropriate values and an empty file will be created. So, to create an
empty file we need to execute the hash() function once and, read and write a single
sector of the disk.
2. stat(): stat() system call that reads the metadata information of a file needs to
execute the hash() function once and read a single metadata sector.
3. chmod(): chmod() system call and other related syscalls which update the metadata
of a file need to execute the hash() function once, read a single metadata sector to make
sure that the file exists and finally update it by writing back to disk.
4. unlink(): unlink() system call that deletes a file executes the hash() function once,
reads a single metadata sector to make sure that the file exists and finally updates its
status field by writing back to disk.
5. open(): open() system call that opens a file for reading or writing a file will execute
hash() function once and read a single metadata sector.
6. read(): read() system call that reads a file will have already file opened, read and
buffered its metadata sector, and thus will need to read sequentially as many data
sectors as are required by the call.
7. write(): write() system call that writes a file will have already file opened, read
and buffered its metadata sector, and thus will need to write sequentially as many data
sectors as are required by the call.
8. close(): close() system call that closes a file will have already file opened, read and
buffered its metadata sector, and thus will need to write a single metadata sector back
to the disk to reflect any changes made to the file’s metadata like size of the file, and
so on.
5.3.3 Implementation of OneSec File System
OneSec file system is implemented on Linux OS as Linux Loadable Kernel Module (LKM). The
module registers file system type onesec with the kernel as dev file system and when mounted
106
on top of some block device, the OneSec file system module supports all the syscalls men-
tioned in section 5.3.2.
The meta-user-structure size selected for OneSec is 4 KB. This 4 KB size is common among
various data structures needed by OneSec for processing, like page size of virtual memory,
block size of disk driver, file size in common file system workloads and so on. This creates
a compatibility among various components of OS needed for efficiency and reduces the com-
plexity of code. Because metadata sector of each meta-user-structure of OneSec is 512 bytes
long, the maximum amount of userdata that can be stored in any individual file on this file
system is, therefore, 3.5 KB. Furthermore, to simplify the implementation of hash() func-
tion, filenames containing only numeric strings are recognised (i.e. names like ‘0’, ‘1’, and so
on). The hash() function simply converts the string filename into an integer which points to
the offset of meta-user-structure for the specified filename. Figure 5.2 shows the interaction
of OneSec file system with other kernel components of Linux OS.
File System Operations
inode Cache VFS Dentry Cache
OneSec
Buffer Cache Magnetic Disk
Figure 5.2: OneSec interaction with kernel components of Linux OS
107
5.3.3.1 Callback functions of OneSec file system module
The OneSec file system module registers many callback functions. First, it registers a call-
back function onesec get super(). This function is called when OneSec is mounted on
top of some block device. This function calls get sb single() generic function provided
by libfs and is passed the address of onesec fill super() function provided by OneSec
module. This function performs two main operations for OneSec file system. First, it creates
a dummy SuperBlock for OneSec file system as it doesn’t have a real one. During this, the
function sets various parameters of OneSec file system like block size, magic number and
so on, required by a Linux file system. However, no disk I/O is performed. Moreover, the
superblock-operations vector is set to default VFS operations provided by libfs.
Second, it creates a dummy root directory for OneSec file system. This is mandatory
for hierarchical file systems in Linux and as OneSec design doesn’t support the concept of
directories, it creates a dummy one. The is achieved by creating a VFS-inode and a dentry,
and setting it as root directory in super block struct of OneSec. Finally, the dentry
of root is added to dcache of VFS for faster lookup. Note that all the newly created
VFS inodes are automatically added to inode cache, however dentries need to be added
explicitly to dcache. Furthermore, the file-operations vector of root-inode is set to
simple dir operations provided by libfs while as its inode-operations vector is set to
onesec dir inode ops explicitly provided by OneSec module. Also, no disk I/O is performed
within this function.
The OneSec module registers two inode-operations in onesec dir inode ops vector
which includes onesec creat() and onesec unlink() used to create a new entry and delete
an existing entry in a directory.
struct i n o d e o p e r a t i o n s o n e s e c d i r i n o d e o p s = {
/∗ O p e r a t i o n s e x p l i c i t l y p r o v i d e d by OneSec ∗/
. creat = onesec creat ,
. unlink = onesec unlink ,
/∗ D e f a u l t o p e r a t i o n s p r o v i d e d by l i b f s ∗/
. l o oku p = s i m p l e l o o k u p ,
. rename = s i m p l e r e n a m e ,
...
};
OneSec module also registers three file-operations in onesec file ops vector which in-
cludes onesec open() called whenever a file is read or written, onesec read() called when-
ever a file is read and onesec write() called whenever a file is written.
108
struct f i l e o p e r a t i o n s o n e s e c f i l e o p s = {
/∗ O p e r a t i o n s e x p l i c i t l y p r o v i d e d by OneSec ∗/
. open = o n e s e c o p e n ,
. read = onesec read ,
. write = onesec write ,
};
5.3.3.2 Calling sequence of OneSec callback functions
The calling sequence in OneSec file system module for operations explicitly provided by the
module is as follows. After mounting, i.e., after dummy SuperBlock and root directory
is created, whenever a new file is created in root directory, the onesec creat() func-
tion is called. This function creates a new VFS-inode, sets its file operations vector to
onesec file ops, creates dentry for it and adds it to dcache in addition to setting other
necessary parameters for the VFS-inode, such as size and so on. Note, in OneSec file system if
a process wants to read from a file, the dentry of that file should exist in dcache. This is be-
cause the lookup function is not overridden in OneSec module. And if the dentry exists, the
onesec open() function will be called followed by the onesec read() function. On contrary,
whenever a process wants to write a file, if the file does not exist onesec creat() will be
called. However, if an entry exists in dcache, then the onesec open() function will be called
followed by onesec write() function. Finally, when a file is deleted, the onesec unlink()
callback function will be called. Naturally, this will be called only when dentry for the file
exists in dcache.
The algorithm 5.1 shows the implementation of onesec fill super() function.
Algorithm 5.1 Algorithm for onesec fill super() of OneSec

1: superBlock → s magic ← ONESEC MAGIC
2: {...}
3: root ← make inode()
4: root → i op ← &onesec dir inode ops
5: root → i f op ← &simple dir operations
6: dentry ← make dentry()
7: add dentry dcache(dentry)
8: superBlock → s root ← dentry
The algorithm 5.2 shows the implementation of onesec creat() callback function.
109
Algorithm 5.2 Algorithm for onesec creat() of OneSec

1: inode ← make inode()
2: inode → i f op ← &onesec f ile ops
3: dentry ← make dentry()
4: add dentry dcache(dentry)
5: {...}
The algorithm 5.3 shows the implementation of onesec unlink() callback function.
Algorithm 5.3 Algorithm for onesec unlink() of OneSec

1: {Identify meta-user-structure offset on disk}
2: {Overwrite the first sector i.e. metadata}
3: remove dentry dcache(dentry)
4: remove inode icache(inode)
5: {...}
The algorithm 5.4 shows the implementation of onesec open() callback function.
Algorithm 5.4 Algorithm for onesec open() of OneSec

2: {Read the first sector i.e. metadata}
3: {...}
The algorithm 5.5 shows the implementation of onesec read() callback function.
Algorithm 5.5 Algorithm for onesec read() of OneSec

2: {Read the whole 4 KB structure}
3: {...}
The algorithm 5.6 shows the implementation of onesec write() callback function.
Algorithm 5.6 Algorithm for onesec write() of OneSec

2: {Write the whole 4 KB structure}
3: {...}
110
It must be noted here that all the reads and writes in OneSec file system are done syn-
chronously. Though, this will add performance overhead but will indeed give accurate bench-
marking results. However, all the reads and writes are through buffer cache.
5.3.4 The Benchmark
The new benchmarking methodology introduced here does not directly answer questions such
as “how does my system compare to the current similar systems?”, “how does my system be-
have under its expected workload?” and “what are the causes of my performance improvements
or overheads?”. Rather, it directly answers only one question “how much near or far is my
design from the better design for a particular class of file systems?”. However, the answers to
former questions are derived from the answer to this question. The process begins by design-
ing a hypothetical file system for a class of file systems (say file systems good at handling small
files) which outperforms every file system within the class for every performance metric. After
this, an appropriate benchmark is chosen, preferably a micro-benchmark, which individually
highlights various aspects of a file system. Furthermore, the real life file systems belonging
to that class are selected and they along with the hypothetical file system are benchmarked
using this benchmark. As such, the results for any individual file system primarily and di-
rectly indicate its nearness to the ideal file system design. Moreover, it highlights the areas
which are very much efficient or may need improvement. Furthermore, if compared with the
results of other real life file systems under evaluation, it gives us a comparative analysis.
5.3.4.1 Benchmarking Methodology
The OneSec file system is one such hypothetical file system designed to be the golden yardstick
for those file systems which are supposed to handle large number of small files. The micro-
benchmark chosen to evaluate the various aspects of OneSec file system is Sprite LFS small-file
micro-benchmark. Sprite LFS small-file writes, reads and finally deletes a large number of small
files.
The benchmarking methodology exercises the OneSec file system using Sprite LFS small-file
benchmark in two configurations. In first configuration, the OneSec file system does not per-
form any disk I/O within onesec creat(), onesec unlink(), onesec open(), onesec read()
and onesec write() calls. During these calls the buffer is filled using some bogus data or
emptied accordingly. This is done to extract the aggregate performance overhead added by
111
the file system framework of an OS i.e. VFS of Linux, and the costs added by the OS archi-
tecture. It should be noted that OneSec implements the minimal bare-bone functionality of
creating a file, writing data to it, reading data from it and deleting it. And, when such a file
system is benchmarked without letting it to do any disk I/O, the results will point to the
minimal overhead that is to be paid by every file system.
The OneSec file system design is straight forward in which a lot of compromises have been
made to make its operations efficient. As an example, OneSec does not support any advanced
features of a file system like journalling, not even basic functionalities like mmap(); worst it
does not support superblock, directories, allocation/deallocation tables and much more.
One may argue that this design is not realistic; indeed it isn’t. The motive of this design is
put lower bounds on the algorithmic complexity of operations, and number and pattern of
disk accesses in order to yield an optimised file system for a specific set of operations.
In second configuration, the OneSec file system performs a predefined number of disk I/Os
required against each file system operation within various OneSec module functions as per
the Table 5.1. This is done to yield the total cost involved during the execution of Sprite LFS
small-file benchmark. Now, using these two statistics, we can obtain the costs for performing
disk I/Os.
Therefore, at the end of this process we know the minimal costs added by an OS, the
minimal disk I/O costs and the lower bound on the performance a file system which is good
at handling small files.
After this, a real life file system is exercised using Sprite LFS small-file benchmark. As
expected, the results will lag behind the results of that of OneSec file system when bench-
marked in second configuration. When these results are compared with that of OneSec, the
comparison highlights those areas of the file system under evaluation which have efficiency
nearer to that of efficient one, i.e. OneSec, and also those areas which are far away from
OneSec and hence, need improvement.
Moreover, when the results of the file system under evaluation are normalised by sub-
tracting the minimal OS costs yielded by benchmarking OneSec in first configuration, the
cost of disk I/Os in this file system surface. By calculating the percentage of OS costs and
disk I/O costs for this file system we can decide the area in which small improvement can
yield a large improvement in total cost.
Similarly, for other file systems such statistics can be calculated and a comparative analysis
can be made. However, the focus of this section is to yield the performance analysis of a
single file system.
112
5.3.5 Experiment
The experiment was conducted on an Intel based PC with Intel Core i3 CPU 550 @ 3.20
GHz with total 4 MB cache and 2 GB DDR3 1333 MHz SDRAM. The hard drive used was a
7200 rpm 320 GB SATA drive with on board cache of 8 MB. The drive was partitioned into
20 GB primary partition to hose the Fedora Core 14 operating system kernel version
2.6.35.14-95.fc.14.i686 and another 20 GB partition to mount OneSec and ext2 file sys-
tem to be benchmarked, one at a time. The experiment considered ext2 file system to be
benchmarked against OneSec file system as this file system is the most notable high perfor-
mance file system for handling large number of small files.
The Sprite LFS small-file benchmark creates 10,000 1 KB files, reads them and finally
deletes them. However, the benchmark was modified to write and read 3.5 KB instead of
1 KB. This modification has reasons; 1) File system workloads comprise of large number
of small files typically less than 4 KB in size which are frequently accessed [Agrawal et al.,
2007][Tanenbaum et al., 2006], and 2) The meta-user-structure of OneSec can hold a maxi-
mum of 3.5 KB of user data. The modified Sprite LFS small-file benchmark was implemented
as a shell script using bash-version 4.1.7(1). Furthermore, the data was written and
read in its entirety. Moreover, the caches were not flushed between the phases of Sprite LFS
small-file benchmark as real life file systems generally operate with warm caches.
It is worth mentioning here that as OneSec file system doesn’t support the concept of
directories, the benchmark was run in the root directory of ext2 file systems. Furthermore,
both file systems were mounted on the same 20 GB partition. This was done to approximate
the disk latency and affect of ZCAV on file system performance of both file systems. In
addition, ext2 file system was newly created using default parameters. Furthermore, the
experiment was carried out with Linux running at run-level 1 to reduce the random effect of
other applications and demons. Also, the experiment was repeated 10 times and the results
for each phase were averaged; the standard deviation was less than 5 % of the average.
Table 5.2 presents the results of Sprite LFS small-file benchmark executed on OneSec file
system in both configurations (with and without disk I/Os). This table clearly validates the
claim made by OneSec design as the costs for disk I/O are very minimal as compared to the
costs imposed by the OS in every phase. Moreover, Table 5.3 presents these overheads as
percentage of total cost in each phase of the benchmark. It is evident from this table that
113
the performance overhead added by OS in all phases of the benchmark is at a minimum of

76.93 %. This indicates that file systems in Linux can’t enhance alone rather evolution of
file system framework in Linux is necessary to bring down the total cost significantly. For
example, in OneSec file system a 50 % improvement in costs imposed by OS can reduce the
total cost by a minimum of 38.46 % while as same improvement in costs of disk I/Os can
reduce the total cost by a maximum of 11.53 %.
Table 5.2: OneSec: Disk I/O and OS costs during Sprite LFS small-file benchmark execution
OneSec OneSec Disk I/O OS Costs

(without (with disk Costs
disk I/O) I/O)
Create 10,000 3.5 KB files 25.9278 s 33.7018 s 7.7740 s 25.9278 s
Read 10,000 3.5 KB files 25.8784 s 33.3780 s 7.4996 s 25.8784 s
Delete 10,000 3.5 KB files 24.9630 s 28.3270 s 3.3640 s 24.9630 s
Table 5.3: Percentage of overheads added by disk I/Os and OS for OneSec
Overhead by I/Os Overhead by OS

Create 10,000 3.5 KB files 23.07 % 76.93 %
Read 10,000 3.5 KB files 22.47 % 77.53 %
Delete 10,000 3.5 KB files 11.88 % 88.12 %
The Table 5.4 compares the efficiency of ext2 file system with that of OneSec file system.
The results clearly show the lead of OneSec file system in all phases of Sprite LFS small-file
benchmark. This further validates our claim. Moreover, the results point out that all op-
erations in ext2 file system are much nearer to the efficient design. However, the reading
operation is efficient in ext2 as compared to its writing operation. Furthermore, the com-
parison pointed out that reading operation of ext2 file system is 96.86 % nearer to efficient
design while as deletion operation is 95.62 % and writing operation is 95.39 % nearer.
Furthermore, Table 5.5 presents the results of Sprite LFS small-file benchmark execution
on ext2 file system. The table also calculates the disk I/O costs in ext2 file system by
normalising the results for minimal OS costs yielded by benchmarking OneSec file system.
Also, the percentage of overhead added by disk I/Os and OS on total cost are also calculated.
114
Table 5.4: Comparison of ext2 file system with OneSec file system
OneSec ext2 Efficiency of ext2

Create 10,000 3.5 KB files 33.7018 s 35.3298 s 95.39 %
Read 10,000 3.5 KB files 33.3780 s 34.4588 s 96.86 %
Delete 10,000 3.5 KB files 28.3270 s 29.6260 s 95.62 %
Table 5.5: Disk I/O costs, OS costs and their overhead percentage in ext2 file system
ext2 OS costs Disk I/O Overhead Overhead

costs by disk by OS
I/Os
Create 10,000 3.5 35.3298 s 25.9278 s 12.7522 s 32.97 % 67.03 %
KB files
Read 10,000 3.5 KB 34.4588 s 25.8784 s 10.5056 s 28.87 % 71.13 %
files
Delete 10,000 3.5 29.6260 s 24.9630 s 6.9730 s 21.83 % 78.17 %
KB files
As compared to the OneSec file system, the disk I/O costs are very high numerically.
Furthermore, the percentage overhead of disk I/Os is also large comparatively. This clearly
indicates that ext2 file system can improve significantly by reducing the disk I/Os as the same
task can be accomplished with lesser disk I/O costs by OneSec.
5.3.7 Conclusion
We argued that current benchmarking science is problematic as it compares practical file

systems against each other which leads to incomplete and confusing results. As a solution,
we proposed benchmarking of a class of practical file systems against a hypothetical file
system which outperforms every file system from that class in every performance metric.
To validate our concept, we introduced an efficient hypothetical file system, namely
OneSec, which outperforms every file system good at handling small files in every perfor-
mance metric. The design motive is to set a lower bound on the number of disk I/Os to be
performed against any file system operation and on the algorithmic complexity of these op-
erations. We implemented OneSec file system on Linux as Linux Loadable Kernel Module and
benchmarked it in two configurations using Sprite LFS small-file benchmark. In one configu-
115
ration, no disk I/O was performed against operations while in another a pre-defined number
of disk I/Os against each operation were done. We also benchmarked ext2 file system using
the same benchmark and discussed the results.
From the discussion, we conclude many things. First, we found that in OneSec file system
the percentage of disk I/O overhead is maximum of 23.07 % as compared to OS overhead of
minimum of 88.12 %. This highlights the inefficiency on part of Linux OS which has a larger
share in deteriorating the performance of file systems. This observation correctly points for
evolution of Linux OS in order to deliver significant performance gains in file systems.
Second, we found that the OneSec file system outperforms ext2 file system in all the
phases of Sprite LFS small-file benchmark. This validates the claim made by OneSec design.
Furthermore, we found that all the operations of ext2 file system are efficient and nearer to
that of ideal design. On comparing results of ext2 file system against OneSec, the comparison
pointed out that reading operation of ext2 file system is 96.86 % nearer to efficient design
while deletion and writing are 95.62 % and 95.39 % nearer respectively.
Finally, after comparing the normalised results of ext2 file system for disk I/O costs with
the normalised results of OneSec, we found that the overhead added by disk I/Os is significant
in ext2 file system as compared to OneSec. Thus, ext2 file system should consider reducing
these costs in order to deliver better performance.
OneSec file system is scalable as far as the design is appropriate for the benchmark. As an
example, Sprite LFS small-file benchmark writes all the files sequentially, then reads them
sequentially and finally deletes them in same sequence. This is perfect for OneSec file system
design as during any operation of this benchmark no seek or rotational delay is incurred as
long as filenames start from 0 and the numerical difference between consecutive filenames is
1. However, if the benchmark picks random filenames, then OneSec design will not scale and
needs to be tweaked as per the new benchmark.
Nevertheless, the idea of OneSec file system can be extended for designing and developing
such hypothetical file systems for other classes of file systems for benchmarking comparisons.
In addition, within each class, the design can be optimised for diverse benchmarks.
End
116
Chapter 6
Conclusions & Future Scope
117
6. CONCLUSIONS & FUTURE SCOPE
6.1 Conclusions Drawn
File systems like other components of an operating system need to evolve. In this thesis,
we argued that this is because of the fact that hardware technology keeps on advancing and
hence changing. As a consequence, user requirements also change. Hence, to cope up with
these changes, file systems need modifications. However, modifying design of file systems
to overcome these challenges has its limitations. This work was aimed to argue that the file
systems can still evolve even with keeping design and source intact. To support our argument,
we worked on 3 basic design metrics of a file system design and showed that there is a chance
to enhance them without modifying the design or even source. The following text summarises
the conclusions made in each metric.
In File System Scalability, a scalable user-space virtual file system called suvFS was
designed, implemented and evaluated. suvFS breaks the file size limitation of any existing
file system without modifying the design or source of that file system by exploiting the con-
cept of virtual unification. suvFS works by splitting a large file which can’t be created into
its entirety, into small sized fragments. The associated fragments are virtually unified to
present a single large virtual file to user applications. This large virtual file supports all the
file system operations as other physical files do. This virtualisation and other house keeping
functionalities are provided by suvFS and are totally independent of the file system in ques-
tion. Furthermore, suvFS was implemented as a FUSE file system to achieve portability. The
experiment to validate the claim of suvFS and evaluate the possible performance overhead
was carried out on FAT32 file system. The results showed that suvFS accomplishes its goals
successfully, however there is performance overhead added by it. Nevertheless, the overhead
is mainly due to FUSE framework which we believe will be amortised with faster hardware.
Furthermore, the performance overhead is negligible during reading operation while during
writing and deletion operation it is affordable. The normalised statistics revealed that the
average performance overhead added by suvFS algorithms during reading is as low as 0.59
% while during writing and deleting it is as high as 11.22 % and 14.77 % respectively. How-
ever, the average growth of performance overhead with doubling the file size is 0.51 % for
reading, 6.73 % for writing and 9.54 % for deletion. From this, we can conclude that file
size scalability limitations of any file system can be cracked without touching its design or
source.
In File System Performance, a high performance hybrid FAT32 file system called hFAT
was designed, simulated and evaluated. hFAT works by placing the central metdata store, the
118
FAT Table and the hot zone of the file system, directory clusters, on a solid state storage device
while the rest of userdata on magnetic disk drive. The idea is to eliminate the access latencies
incurred by FAT file systems during accesses to FAT Table and directory clusters. However,
in order to keep the design of FAT32 file system intact, we proposed hFAT to be slipped in as
a stackable device driver. The driver so stacked can forward the request to appropriate block
device driver depending upon the request made by FAT32 file system and return the results to
upper layers. Furthermore, we simulated the behaviour of hFAT stackable device driver and
evaluated the performance gains by running the block trace collected by exercising a FAT32
file system using Sprite LFS small-file benchmark. The results indicate that hFAT file system
can reduce the latency incurred by FAT32 file system operations by a minimum of 25 %, 10
% and 90 % during writing, reading and deleting large number of small files respectively, if
a solid state storage device having latency lesser or equal to 10 % of that of magnetic disk
is used as an addition. From this, we can conclude that performance challenges of any file
system can be mitigated without touching its design or source.
In File System Extensibility, we found that there exists a plethora of techniques to
extend file systems (at both kernel and user level), almost all OSes support file system layering
and there are wide variety of applications of layering. We picked the most important but
much neglected aspect of file systems; after-deletion data recovery. A reliable and efficient
vnode stackable file system for secure data deletion, called restFS was designed, implemented
and evaluated. restFS does secure data deletion by overwriting and works at block level rather
than file level. restFS exploits two possibilities in file systems at block level for efficiency;
1) the recently deallocated blocks may get re-allocated, and 2) the blocks to be overwritten
belonging to different files may be placed sequentially on disk. The restFS harvests these
probabilities to perform few large sized I/Os instead of many small sized I/Os to reduce
the number and frequency of disk writes issued in addition to saving block overwrites of
those deallocated blocks which get reallocated. restFS is design compatible with all those file
systems which export file allocation map to VFS. However, it is currently implemented for ext2
file system. The experimental results showed that these two possibilities are complementary
and using restFS on top of ext2 file system can save block overwrites between 28-98 % while
reducing the disk write commands issued by a maximum of 88 % as compared to existing
techniques when exercised with Postmark benchmark. From this, we can conclude that new
features can be added to any file system without touching its design or source.
Finally, in File System Benchmarking, a modelled hypothetical file system called
OneSec was designed, implemented and evaluated. OneSec is a dummy file system used for
119
benchmarking comparison for other file systems. OneSec unveiled the benefits of the new
methodology of file system benchmarking in which a real file system is evaluated against an
efficient hypothetical file system. OneSec pointed out that the Linux OS file system framework
adds large costs to file system operations and is as minimum as 76.93 % in a bare-bone
file system like OneSec. Furthermore, on evaluating ext2 file system against OneSec, the
evaluation pointed out that reading operation of ext2 file system is 96.86 % nearer to efficient
design while deletion is 95.62 % and writing is 95.39 % nearer. Also, the comparison of
normalised results showed that ext2 file system should reduce its disk I/Os to yield better
performance.
6.2 Lessons Learned
For file system research, it is necessary and a prime requisite for the researcher to understand
the internals of an operating system which is a hard pill to swallow. Furthermore, OSes keep
advancing, specifically VFS of Linux and thus, a file system researcher has to remain updated
with these changes. Designing a file system in first place is a difficult task because of many
considerations, parameters and components involved. However, implementing a file system
demands determination, consistency and hard work.
During the implementation of suvFS file system, one of the early prototypes was not able
to retrieve data from the unified fragments properly. The problem was with the read() call
whose buffer when not filled to its full length signalled EOF or I/O error. Though it was
a small overlooked bug, however it costed us approximately a month to find. The lesson it
taught us was that one should not ignore even a small aspect of any loosely related component
in system’s research.
Moreover, implementing hFAT file system on Linux was not feasible due to the performance
costs added by the non-clean interface of driver stacking in Linux. Furthermore, the imple-
mentation would have later demanded evaluation over solid state storage devices having range
of latencies, which again would not have been financially feasible. The lesson it taught us was
that simulation is your friend when feasibility imposes limitations on the implementation.
One of the earlier prototypes of restFS file system had performance limitations as the
algorithm used to identify the blocks deallocated and newly allocated to a file in ext2 file
system was both CPU bound and I/O bound. Specifically, the algorithm performed badly
during truncate() operation to identify the deallocated blocks given the old and new size.
120
It took us more than a month to realise that the problem is not heterogeneous. The lesson it
taught us was that system’s research problems can be effectively solved using mathematics.
Finally, the earlier implementation and benchmarking procedure of OneSec file system
used to nullify the costs of execution of file system procedures and only took the cost of disk
I/Os into consideration. This way the implementation of OneSec was easy and benchmarking
was easier. However, this yielded unrealistic results. So, we decided to implement OneSec file
system from scratch in two configurations for near to realistic results. In first configuration,
no disk I/O was done to know execution costs while in another configuration a pre-defined
number of disk I/Os were done against each operation. Their implementation taught us very
good lessons like VFS of Linux is changing more frequently than any other component of Linux
and the costs imposed by it on file systems are very high.
6.3 Future Scope
The work presented in this thesis leaves a substantial amount of further investigations to be
performed. Some of the apparent future investigations can be:
1. The suvFS file system can be further optimised by avoiding the lookups necessary
to identify and locate the fragments. In addition, the performance of suvFS can be
evaluated over a range of file systems. Also, based on the concept of suvFS file system,
a system to break file count limitation of file systems can be designed and developed.
It can be combined with suvFS and mhddfs to overcome all size and count limitations.
2. The hFAT file system implementation in Linux can be investigated by developing a clean
interface for driver stacking. The self-evolving feature of hFAT module can be optimised
and more generalised. In addition, the hFAT module can be simulated and implemented
for other file systems to know their performance gains.
3. The block-level overwriting approach of restFS file system can be investigated for its
importance in log-structured and journalled file systems. Also, its efficiency can be
evaluated for other file systems.
4. The idea of OneSec file system can be extended for designing and developing such
hypothetical file systems for other classes of file systems for benchmarking comparisons.
In addition, within each class, the design can be optimised for diverse benchmarks.
End
121
Publications
122
PUBLICATIONS
Refereed Journal Papers
2012 Breaking File Size Limitation of File Systems (Wasim Ahmad Bhat, S.M.K. Quadri), Linux
Journal, ACM. (Accepted for publication)
2012 After-deletion data recovery: myths and solutions (Wasim Ahmad Bhat, S.M.K. Quadri), Com-
puter Fraud & Security, Elsevier, Volume 2012, Issue 4, Pages 17-20, ISSN: 1361-3723, April 2012.
2011 Open Source Code Doesn’t Help Always: Case of File System Development (Wasim Ahmad
Bhat, S.M.K. Quadri), Trends in Information Management, Volume 7, Issue 2, Pages 160-169, ISSN:
0973-4163, July-December 2011.
2011 Design Considerations for Developing Disk File System (Wasim Ahmad Bhat, S.M.K. Quadri),
Journal of Emerging Trends in Computing and Information Science, Volume 2, Issue 12, Pages 733-739,
eISSN: 2079-8407, December 2011.
2011 A Quick Review of On-Disk Layout of Some Popular Disk File Systems (Wasim Ahmad
Bhat, S.M.K. Quadri), Global Journal of Computer Science & Technology, Volume 11, Issue 6, Pages
1-18, pISSN: 0975-4350, eISSN: 0975-4172, April 2011.
2011 Benchmarking Criteria for File System Benchmarks (Wasim Ahmad Bhat, S.M.K. Quadri),
International Journal of Engineering Science & Technology, Volume 3, Issue 1, Pages 665-670, eISSN:
0975-5462, February 2011.
2011 IO Bound Property: A System Perspective Evaluation & Behaviour Trace of File System
(Wasim Ahmad Bhat, S.M.K. Quadri), Global Journal of Computer Science & Technology, Volume 11,
Issue 5, Pages 57-70, pISSN: 0975-4350, eISSN: 0975-4172, February 2011.
2010 Some Notable Reliability Techniques For Disk File Systems (Wasim Ahmad Bhat, S.M.K.
Quadri), Oriental Journal of Computer Science & Technology, Volume 3, Issue 2, Pages 269-271, pISSN:
0974-6471, December 2010.
2010 Review of FAT Datastructure of FAT32 File System (Wasim Ahmad Bhat, S.M.K. Quadri),
Oriental Journal of Computer Science & Technology, Volume 3, Issue 1, Pages 161-164, pISSN: 0974-
6471, June 2010.
Refereed Conference Papers
2012 restFS: Secure Data Deletion using Reliable & Efficient STackable File System (Wasim
Ahmad Bhat, S.M.K. Quadri), In Proceedings of 10th IEEE Jubilee International Symposium on Ap-
plied Machine Intelligence and Informatics, Pages 457-462, pISBN: 978-1-4577-0197-9, January 26-28,
Herlany, Slovakia, 2012.
2011 A Brief Summary of File System Forensic Techniques (S.M.K. Quadri, Wasim Ahmad Bhat), In
Proceedings of 5th National Conference on Computing for Nation Development, Bharati Vidyapeeth’s
Institute of Computer Applications and Management, Pages 499-502, pISSN:0973-7529, pISBN:978-93-
80544-00-7, March 10-11, New Delhi, India, 2011.
2011 Choosing Between Windows and Linux File Systems for a Novice User (S.M.K. Quadri,
Wasim Ahmad Bhat), In Proceedings of 5th National Conference on Computing for Nation Development,
Bharati Vidyapeeth’s Institute of Computer Applications and Management, Pages 457-462, pISSN:0973-
7529, pISBN:978-93-80544-00-7, March 10-11, New Delhi, India, 2011.
123
PUBLICATIONS
2011 Efficient Handling of Large Storage: A Comparative Study of Some Disk File Systems
(Wasim Ahmad Bhat, S.M.K. Quadri), In Proceedings of 5th National Conference on Computing for
Nation Development, Bharati Vidyapeeth’s Institute of Computer Applications and Management, Pages
475-480, pISSN:0973-7529, pISBN:978-93-80544-00-7, March 10-11, New Delhi, India, 2011.
Presentations & Talks
2011 Open Source Code Doesn’t Help Always: Case of File System Development, Presented at
National Seminar on Open Source Softwares: Challenges & Opportunities, June 20-22, University of
Kashmir, India, 2011.
2010 Comparison of Heterogeneous File Systems: Advantages & Limitations, Presented at 6th JK
Science Congress, December, University of Kashmir, India, 2010.
2010 How is File System Forensic Analysis Possible?, Presented at 6th JK Science Congress, Decem-
ber, University of Kashmir, India, 2010.
2010 Performance Evaluation of Common Disk File Systems under Linux, Presented at 6th JK
Science Congress, December, University of Kashmir, India, 2010.
124
References
Agrawal, N., Bolosky, W. J., Douceur, J. R., and Lorch, J. R. A five-year study of file-system
metadata. ACM Transactions on Storage, 3(3):338–346, October 2007. ISSN 1553-3077. URL http:
//doi.acm.org/10.1145/1288783.1288788. 43, 63, 90, 113
Akyürek, S. and Salem, K. Adaptive block rearrangement. ACM Transactions on Computer Systems, 13
(2):89–121, May 1995. ISSN 0734-2071. URL http://doi.acm.org/10.1145/201045.201046. 49
Alei, L., Kejia, L., Xiaoyong, L., and Haibing, G. Fatty: A reliable fat file system. In Proceedings of
the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools, DSD 2007,
pages 390–395, Washington, DC, USA, 2007. IEEE Computer Society. ISBN 978-0-7695-2978-3. URL
http://dx.doi.org/10.1109/DSD.2007.4341497. 56
Badam, A., Park, K., Pai, V. S., and Peterson, L. L. Hashcache: cache storage for the next billion.
In Proceedings of the 6th USENIX symposium on Networked systems design and implementation, NSDI’09,
pages 123–136, Berkeley, CA, USA, 2009. USENIX Association. URL http://dl.acm.org/citation.cfm?
id=1558977.1558986. 104
Baker, M., Asami, S., Deprit, E., Ouseterhout, J., and Seltzer, M. Non-volatile memory for fast,
reliable file systems. In Proceedings of the fifth international conference on Architectural support for pro-
gramming languages and operating systems, ASPLOS-V, pages 10–22, New York, NY, USA, 1992. ACM.
ISBN 0-89791-534-8. URL http://doi.acm.org/10.1145/143365.143380. 52
Baker, M. G., Hartman, J. H., Kupfer, M. D., Shirriff, K. W., and Ousterhout, J. K. Mea-
surements of a distributed file system. In Proceedings of the thirteenth ACM symposium on Operating
systems principles, SOSP ’91, pages 198–212, New York, NY, USA, 1991. ACM. ISBN 0-89791-447-3. URL
http://doi.acm.org/10.1145/121132.121164. 43
Bauer, S. and Priyantha, N. B. Secure data deletion for linux file systems. In Proceedings of the 10th
conference on USENIX Security Symposium - Volume 10, SSYM’01, pages 153–164, Berkeley, CA, USA,
2001. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267612.1267624. 82, 83, 89
Bell-Labs, A. Plan 9 - Programmer’s Manual. AT&T Bell Laboratories, March 1995. 20
Bellis, M. History of the ms-dos operating systems, ibm & microsoft, 2006. URL http://inventors.about.
com/od/computersoftware/a/Putting-Microsoft-On-The-Map.htm. [Online; accessed 2010]. 55
Blackwell, T., Harris, J., and Seltzer, M. Heuristic cleaning algorithms in log-structured file systems.
In Proceedings of the USENIX 1995 Technical Conference, TCON’95, pages 23–23, Berkeley, CA, USA,
1995. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267411.1267434. 49
125
REFERENCES
Blount, W. C. Why 7200 rpm mobile hard disk drives?, 2007. URL http://www.hitachigst.com. [Online;
accessed 2011]. 41
Bonwick, J. Zfs: The last word in file systems, 2006. URL http://www.opensolaris.org/os/community/
zfs/docs/zfs_last.pdf. [Online; accessed 2010]. 16
Borkar, S. Thousand core chips: a technology perspective. In Proceedings of the 44th annual Design
Automation Conference, DAC ’07, pages 746–749, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-
627-1. URL http://doi.acm.org/10.1145/1278480.1278667. 12
Card, R., T’so, T., and Tweedie, S. Design and implementation of the second extended filesystem. In
Proceedings of the First Dutch International Symposium on Linux, pages 1–6, State University of Groningen,
1995. ISBN 90-367-0385-9. URL http://web.mit.edu/tytso/www/linux/ext2intro.html. 11, 21, 45
Dahlin, M. D. The impact of trends in technology on file system design. Technical report, University of
California, Berkeley, 1996. URL www.cs.utexas.edu/~dahlin/techTrends/trends.ps. 11
Diesburg, S. M. and Wang, A.-I. A. A survey of confidential data storage and deletion methods. ACM
Computing Surveys, 43(1):1–37, December 2010. ISSN 0360-0300. URL http://doi.acm.org/10.1145/
1824795.1824797. 82
Engler, D. R., Kaashoek, M. F., and O’Toole, Jr., J. Exokernel: an operating system architecture
for application-level resource management. SIGOPS Operating Systems Review, 29(5):251–266, December
1995. ISSN 0163-5980. URL http://doi.acm.org/10.1145/224057.224076. 76
Fisher, N., He, Z., and McCarthy, M. A hybrid filesystem for hard disk drives in tandem with flash
memory. Computing (Springer), 94(1):21–68, January 2012. ISSN 0010-485X. URL http://dx.doi.org/
10.1007/s00607-011-0163-y. 53
Fitzhardinge, J. Userfs, 1993. URL http://www.goop.org/~jeremy/userfs. [Online; accessed 2011]. 76
Ganger, G. R. and Kaashoek, M. F. Embedded inodes and explicit grouping: exploiting disk bandwidth
for small files. In Proceedings of the annual conference on USENIX Annual Technical Conference, ATEC
’97, pages 1–18, Berkeley, CA, USA, 1997. USENIX Association. URL http://dl.acm.org/citation.cfm?
id=1268680.1268681. 43, 46, 83, 104
Ganger, G. R. and Patt, Y. N. Metadata update performance in file systems. In Proceedings of the 1st
USENIX conference on Operating Systems Design and Implementation, OSDI ’94, page 5, Berkeley, CA,
USA, 1994. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267638.1267643. 46
Ganger, G. R., McKusick, M. K., Soules, C. A. N., and Patt, Y. N. Soft updates: a solution to the
metadata update problem in file systems. ACM Transactions on Computer Systems, 18(2):127–153, May
Gomez, R., Adly, A., Mayergoyz, I., and Burke, E. Magnetic force scanning tunneling microscope
imaging of overwritten data. IEEE Transactions on Magnetics, 28(5):3141–3143, September 1992. ISSN
0018-9464. URL http://dx.doi.org/10.1109/20.179738. 81
Grance, T., Stevens, M., and Myers, M. Guide to Selecting Information Security Products. National
Institute of Standards and Technology (NIST), second edition, 2003. 82
126
REFERENCES
Griffioen, J. and Appleton, R. Reducing file system latency using a predictive approach. Technical Report
CS247-94, Department of Computer Science, University of Kentucky, Lexington, 1994. 51
Grochowski, E. Emerging trends in data storage on magnetic hard disk drives, September 1999. URL http:
//classes.soe.ucsc.edu/cmps129/Winter03/papers/grochowski-trends.pdf. [Online; accessed 2010].
15
Grochowski, E. and Halem, R. D. Technological impact of magnetic hard disk drives on storage systems.
IBM Systems Journal, 42(2):338–346, February 2003. ISSN 0018-8670. URL http://dx.doi.org/10.1147/
sj.422.0338. 15
Gutmann, P. Secure deletion of data from magnetic and solid-state memory. In Proceedings of the 6th
conference on USENIX Security Symposium, Focusing on Applications of Cryptography - Volume 6, pages 8–
8, Berkeley, CA, USA, 1996. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267569.
1267577. 82
Guy, R. G., Heidemann, J. S., Mak, W., Page Jr., T. W., Popek, G. J., and Rothmeier, D. Imple-
mentation of the ficus replicated file system. In Proceedings of the Summer USENIX Technical Conference,
pages 63–71, Anaheim, CA, 1990. USENIX Association. 73
Hamm, S. and Greene, J. The man who could have been bill gates, October 2004. URL http://www.
businessweek.com/magazine/content/04_43/b3905109_mz063.htm. [Online; accessed 2010]. 55
Harder, B. Microsoft windows xp system restore, 2001. URL http://msdn.microsoft.com/library/

default.asp?url=/library/enus/dnwxp/html/windowsxpsystemrestore.asp. [Online; accessed 2011]. 70
Heidemann, J. S. and Popek, G. J. A layered approach to file system development. Technical Report
CSD-910007, UCLA, 1991. URL http://www.isi.edu/~johnh/PAPERS/Heidemann91b.pdf. 73
Heidemann, J. S. and Popek, G. J. File-system development with stackable layers. ACM Transactions
on Computer Systems, 12(1):58–89, February 1994. ISSN 0734-2071. URL http://doi.acm.org/10.1145/
174613.174616. 73
Hendricks, D. A filesystem for software development. In Proceedings of the USENIX Summer Conference,
pages 333–340, Anaheim, CA, June 1990. USENIX Association. 20
Heybruck, W. F. An introduction to fat 16/fat 32 file systems, August 2003. URL http://www.hitachigst.
com/. [Online; accessed 2010]. 58
Howard, J. H., Kazar, M. L., Menees, S. G., Nichols, D. A., Satyanarayanan, M., Sidebotham,
R. N., and West, M. J. Scale and performance in a distributed file system. ACM Transactions on
Computer Systems, 6(1):51–81, February 1988. ISSN 0734-2071. URL http://doi.acm.org/10.1145/
35037.35059. 95
Huang, H., Hung, W., and Shin, K. G. Fs2: dynamic data replication in free disk space for improving disk
performance and energy consumption. SIGOPS Operating Systems Review, 39(5):263–276, October 2005.
ISSN 0163-5980. URL http://doi.acm.org/10.1145/1095809.1095836. 50
Intel. The evolution of a revolution, 2007. URL http://download.intel.com/pressroom/kits/

IntelProcessorHistory.pdf. [Online; accessed 2010]. 12
127
REFERENCES
Joukov, N. and Zadok, E. Adding secure deletion to your favorite file system. In Proceedings of the
Third IEEE Security in Storage Workshop, SISW ’05, pages 12–12, San Francisco, CA, 2005. IEEE. ISBN
0-7695-2537-7. URL http://dx.doi.org/10.1109/SISW.2005.1. 82
Joukov, N., Papaxenopoulos, H., and Zadok, E. Secure deletion myths, issues, and solutions. In Proceed-
ings of the second ACM workshop on Storage security and survivability, StorageSS ’06, pages 61–66, New
York, NY, USA, 2006. ACM. ISBN 1-59593-552-5. URL http://doi.acm.org/10.1145/1179559.1179571.
82
Jun, L., Xianliang, L., Guangchun, L., Hong, H., and Xu, Z. Stfs: a novel file system for efficient
small writes. SIGOPS Operating Systems Review, 36(4):50–54, October 2002. ISSN 0163-5980. URL
http://doi.acm.org/10.1145/583800.583806. 105
Katcher, J. Postmark: A new filesystem benchmark. Technical Report TR3022, Net-

work Appliance, 1997. URL https://koala.cs.pub.ro/redmine/attachments/download/605/
Katcher97-postmark-netapp-tr3022.pdf. 90, 96
Kirps, J. A short history of the windows fat file system, March 2008. URL http://www.kirps.com/cgi-bin/
web.pl?blog_record=84. [Online; accessed 2010]. 55
Kleiman, S. R. Vnodes: An architecture for multiple file system types in sun unix. In Proceedings of the
Summer USENIX Technical Conference, pages 238–247, Atlanta, GA, USA, 1986. USENIX Association. 71
Klein, D. History of digital storage, December 2008. URL http://www.micron.com. [Online; accessed 2011].
38
Koller, R. and Rangaswami, R. I/o deduplication: Utilizing content similarity to improve i/o performance.
ACM Transactions on Storage, 6(3):1–26, September 2010. ISSN 1553-3077. URL http://doi.acm.org/
10.1145/1837915.1837921. 50
Korn, D. G. and Krell, E. A new dimension for the unix file system. Software: Practice and Experience,
20(S1):19–34, 1990. ISSN 1097-024X. URL http://dx.doi.org/10.1002/spe.4380201304. 20
Kryder, M. and Kim, C. S. After hard driveswhat comes next? IEEE Transactions on Magnetics, 45(10):
3406–3413, October 2009. ISSN 0018-9464. URL http://dx.doi.org/10.1109/TMAG.2009.2024163. 15
Kwon, M., Bae, S., Jung, S., Seo, D., and Kim, C. Kfat: Log-based transactional fat filesystem for
embedded mobile systems. In Proceedings of 2005 US-Korea Conference, ICTS-142, 2005. 56
Lensing, P., Meister, D., and Brinkmann, A. hashfs: Applying hashing to optimize file systems for
small file reads. In Proceedings of the 2010 International Workshop on Storage Network Architecture and
Parallel I/Os, SNAPI ’10, pages 33–42, Washington, DC, USA, 2010. IEEE Computer Society. ISBN
978-0-7695-4025-2. URL http://dx.doi.org/10.1109/SNAPI.2010.12. 104
Lucas, Jr., H. Performance evaluation and monitoring. ACM Computing Surveys, 3(3):79–91, September
Machek, P. Uservfs, 2000. URL http://sourceforge.net/projects/uservfs. [Online; accessed 2011]. 76
Makarenko, A. V. Phenomenological model for growth of volumes of digital data, 2011. URL http:
//www.seagate.com/docs/pdf/whitepaper/disc_capacity_performance.pdf. [Online; accessed 2011]. 17
128
REFERENCES
Mathur, A., Cao, M., Bhattacharya, S., Dilger, A., Tomas, A., and Vivier, L. The new ext4
filesystem: Current status and future plans. In Proceedings of the Ottowa Linux Symposium, pages 21–34,
Ontario, Canada, 2007. URL http://kernel.org/doc/ols/2007/ols2007v2-pages-21-34.pdf. 16
Matthews, J., Trika, S., Hensgen, D., Coulson, R., and Grimsrud, K. Intel turbo memory: Nonvolatile
disk caches in the storage hierarchy of mainstream computer systems. ACM Transactions on Storage, 4(2):
1–24, May 2008. ISSN 1553-3077. URL http://doi.acm.org/10.1145/1367829.1367830. 52
Mayergoyz, I. D., Serpico, C., Krafft, C., and Tse, C. Magnetic imaging on a spin-stand. Journal of
Applied Physics, 87(9):6824–6826, 2000. URL http://link.aip.org/link/?JAP/87/6824/1. 81
Mazières, D. A toolkit for user-level file systems. In Proceedings of the General Track: 2002 USENIX Annual
Technical Conference, pages 261–274, Berkeley, CA, USA, 2001. USENIX Association. ISBN 1-880446-09-X.
URL http://dl.acm.org/citation.cfm?id=647055.759949. 76
McKusick, M. K., Joy, W. N., Leffler, S. J., and Fabry, R. S. A fast file system for unix. ACM
Transactions on Computer Systems, 2(3):181–197, August 1984. ISSN 0734-2071. URL http://doi.acm.
org/10.1145/989.990. 38, 45
McVoy, L. and Kleiman, S. Extent-like performance from a unix file system. In Proceedings of the 1991
winter USENIX conference, pages 33–43. USENIX Association, 1991. URL http://www.sunhelp.org/
history/pdf/unix_filesys_extent_like_perf.pdf. 45
Microsoft. Fat32 file system specification, a. URL http://microsoft.com/whdc/system/platform/

firmware/fatgen.mspx. [Online; accessed 2010]. 21, 56
Microsoft. Transaction-safe fat file system, b. URL http://msdn2.microsoft.c0m/en-us/library/

aa911939.aspx. [Online; accessed 2010]. 56
Microsoft. File system filter manager: Filter driver development guide, 2004. URL www.microsoft.com/
whdc/driver/filterdrv/default.mspx. [Online; accessed 2011]. 75
Miller, E. L., Brandt, S. A., and Long, D. D. E. Hermes: High-performance reliable mram-enabled
storage. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems, HOTOS ’01, pages
95–99, Washington, DC, USA, 2001. IEEE Computer Society. ISBN 0-7695-1040-X. URL http://dl.acm.
org/citation.cfm?id=874075.876403. 52
Moore, F. Storage facts, figures, best practices, and estimates, September 2009. URL www.horison.com/
StorageFactsAndFigures10.doc. [Online; accessed 2010]. 17
Morris, R. J. T. and Truskowski, B. J. The evolution of storage systems. IBM Systems Journal, 42(2):
205–217, February 2003. ISSN 0018-8670. URL http://dx.doi.org/10.1147/sj.422.0205. 15
Nagar, R. Windows NT File System Internals: A Developer’s Guide. O’Reilly Media, first edition, 1997.
ISBN 978-1-56592-249-5. 16
Ng, S. W. Advances in disk technology: Performance issues. IEEE Computer, 31(5):75–81, May 1998. ISSN
0018-9162. URL http://dl.acm.org/citation.cfm?id=619029.620987. 40
Nielsen, J. Nielsens law of internet bandwidth, April 1998. URL http://www.useit.com/alertbox/980405.

html. [Online; accessed 2010]. 14
129
REFERENCES
Oboukhov, D. E. mhddfs, 2008. URL http://svn.uvw.ru/mhddfs/trunk/README. [Online; accessed 2010].

20
Oney, W. Programming the Microsoft Windows Driver Model. Microsoft Press, second edition, 2003. 75
Ousterhout, J. K., Da Costa, H., Harrison, D., Kunze, J. A., Kupfer, M., and Thompson, J. G.
A trace-driven analysis of the unix 4.2 bsd file system. SIGOPS Operating Systems Review, 19(5):15–24,
December 1985. ISSN 0163-5980. URL http://doi.acm.org/10.1145/323627.323631. 47
Ousterhout, J. K., Cherenson, A. R., Douglis, F., Nelson, M. N., and Welch, B. B. The sprite
network operating system. IEEE Computer, 21(2):23–36, February 1988. ISSN 0018-9162. URL http:
//dx.doi.org/10.1109/2.16. 48
Park, A., Becker, J. C., and Lipton, R. J. Iostone: a synthetic file system benchmark. SIGARCH
Computer Architecture News, 18(2):45–52, May 1990. ISSN 0163-5964. URL http://doi.acm.org/10.
1145/88237.88242. 95
Patterson, D. A. Latency lags bandwith. ACM Communications, 47(10):71–75, October 2004. ISSN
Pendry, J.-S. and McKusick, M. K. Union mounts in 4.4bsd-lite. In Proceedings of the USENIX 1995
Technical Conference, TCON’95, pages 3–3, Berkeley, CA, USA, 1995. USENIX Association. URL http:
//dl.acm.org/citation.cfm?id=1267411.1267414. 20, 74
Rajgarhia, A. and Gehani, A. Performance and extension of user space file systems. In Proceedings of the
2010 ACM Symposium on Applied Computing, SAC ’10, pages 206–213, New York, NY, USA, 2010. ACM.
ISBN 978-1-60558-639-7. URL http://doi.acm.org/10.1145/1774088.1774130. 24, 33
Rao, B. and Angelov, B. Bandwidth intensive applications: Demand trends, usage forecasts, and com-
parative costs, July 2005. URL http://faculty.poly.edu/~brao/2005.BandwidthApps.NSF.pdf. [Online;
accessed 2010]. 14
Riedel, E., Faloutsos, C., Gibson, G., and Nagle, D. Active disks for large-scale data processing. IEEE
Computer, 34(6):68–74, June 2001. ISSN 0018-9162. URL http://dx.doi.org/10.1109/2.928624. 18
Robinson, G. Speeding net traffic with tiny mirrors, September 2000. URL http://eetimes.com/
electronics-news/4040776/SPEEDING-NET-TRAFFIC-WITH-TINY-MIRRORS. [Online; accessed 2010]. 14
Roselli, D., Lorch, J. R., and Anderson, T. E. A comparison of file system workloads. In Proceedings
of the annual conference on USENIX Annual Technical Conference, ATEC ’00, pages 4–4, Berkeley, CA,
USA, 2000. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267724.1267728. 43, 49
Rosenbaum, J. In defense of the delete key, 2000. URL http://simson.net/ref/2005/csci_e-170/ref/

rosenbaum_deletekey.pdf. [Online; accessed 2011]. 80
Rosenblum, M. The design and implementation of a log-structured file system. Technical Report CSD-92-696,
University of California at Berkeley, 1992. URL http://digitalassets.lib.berkeley.edu/techreports/
ucb/text/CSD-92-696.pdf. 47
130
REFERENCES
Rosenblum, M. and Ousterhout, J. K. The design and implementation of a log-structured file system.
ACM Transactions on Computer Systems, 10(1):26–52, February 1992. ISSN 0734-2071. URL http:
//doi.acm.org/10.1145/146941.146943. 47, 48, 96
Rosenthal, D. S. H. Evolving the vnode interface. In Proceedings of the Summer USENIX Technical Con-
ference, pages 5–5, Berkeley, CA, USA, 1990. USENIX Association. URL http://dl.acm.org/citation.
cfm?id=1268708.1268713. 72
Rosenthal, D. S. H. Requirements for a stacking vnode/vfs interface. Technical Report SD-01-02-N014,

UNIX International, 1992. 72
Ruemmler, C. and Wilkes, J. Disk shuffling, October 1991. URL http://www.e-wilkes.com/john/papers/

HPL-91-156.pdf. [Online; accessed 2010]. 50
Ruwart, T. M. File system benchmarks, then, now, and tomorrow. In Proceedings of the Eighteenth IEEE
Symposium on Mass Storage Systems and Technologies, MSS ’01, pages 117–117, Washington, DC, USA,
2001. IEEE Computer Society. ISBN 0-7695-9173-9. URL http://dl.acm.org/citation.cfm?id=824466.
824967. 96
Satyanarayanan, M., Kistler, J., Kumar, P., Okasaki, M., Siegel, E., and Steere, D. Coda: a highly
available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):
447–459, April 1990. ISSN 0018-9340. URL http://dx.doi.org/10.1109/12.54838. 76
Schaller, R. Moore’s law: past, present and future. IEEE Spectrum, 34(6):52–59, June 1997. ISSN 0018-
9235. URL http://dx.doi.org/10.1109/6.591665. 12
Seagate. Disc drive capacity and performance, November 2001. URL http://www.seagate.com/docs/pdf/
whitepaper/disc_capacity_performance.pdf. [Online; accessed 2010]. 17
Seltzer, M., Bostic, K., Mckusick, M. K., and Staelin, C. An implementation of a log-structured file
system for unix. In Proceedings of the USENIX Winter 1993 Conference, pages 3–3, Berkeley, CA, USA,
1993. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1267303.1267306. 48
Seltzer, M., Smith, K. A., Balakrishnan, H., Chang, J., McMains, S., and Padmanabhan, V. File
system logging versus clustering: a performance comparison. In Proceedings of the USENIX 1995 Technical
Conference, TCON’95, pages 21–21, Berkeley, CA, USA, 1995. USENIX Association. URL http://dl.
acm.org/citation.cfm?id=1267411.1267432. 40, 49
Seltzer, M., Krinsky, D., Smith, K., and Zhang, X. The case for application-specific benchmarking. In
Proceedings of the The Seventh Workshop on Hot Topics in Operating Systems, HOTOS ’99, pages 102–107,
Washington, DC, USA, 1999. IEEE Computer Society. ISBN 0-7695-0237-7. URL http://dl.acm.org/
citation.cfm?id=822076.822466. 97
Skinner, G. C. and Wong, T. K. “stacking” vnodes: a progress report. In Proceedings of the USENIX
Summer 1993 Technical Conference - Volume 1, pages 1–27, Berkeley, CA, USA, 1993. USENIX Association.
ISBN 987-654-3333-22-1. URL http://dl.acm.org/citation.cfm?id=1361453.1361465. 72
SMCC. lofs: loopback virtual file system. Sun Microsystems, Inc, 1991. 74
131
REFERENCES
Smith, K. and Seltzer, M. File layout and file system performance. Technical Report TR3594, Computer
Science Department, Harvard University, 1994. URL http://www.eecs.harvard.edu/~keith/papers/
tr94.ps.gz. 39
Smith, K. A. and Seltzer, M. I. File system aging increasing the relevance of file system benchmarks.
In Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of
computer systems, SIGMETRICS ’97, pages 203–213, New York, NY, USA, 1997. ACM. ISBN 0-89791-
Soundararajan, G., Prabhakaran, V., Balakrishnan, M., and Wobber, T. Extending ssd lifetimes
with disk-based write caches. In Proceedings of the 8th USENIX conference on File and storage technologies,
FAST’10, pages 8–8, Berkeley, CA, USA, 2010. USENIX Association. URL http://dl.acm.org/citation.
cfm?id=1855511.1855519. 52
Spillane, R. P., Wright, C. P., Sivathanu, G., and Zadok, E. Rapid file system development using
ptrace. In Proceedings of the 2007 workshop on Experimental computer science, ExpCS ’07, New York, NY,
USA, 2007. ACM. ISBN 978-1-59593-751-3. URL http://doi.acm.org/10.1145/1281700.1281722. 77
Sweeney, A., Doucette, D., Hu, W., Anderson, C., Nishimoto, M., and Peck, G. Scalability in the
xfs file system. In Proceedings of the 1996 annual conference on USENIX Annual Technical Conference,
ATEC ’96, Berkeley, CA, USA, 1996. USENIX Association. URL http://dl.acm.org/citation.cfm?id=
1268299.1268300. 10, 17
Symantec. Norton anti-virus, 2004. URL http://www.symantec.com. [Online; accessed 2011]. 70
Szeredi, M. Filesystem in userspace, 2005. URL http://fuse.sourceforge.net. [Online; accessed 2010].

23, 77
Tanenbaum, A. S., Herder, J. N., and Bos, H. File size distribution on unix systems: then and now.
SIGOPS Operating Systems Review, 40(1):100–104, January 2006. ISSN 0163-5980. URL http://doi.acm.
org/10.1145/1113361.1113364. 43, 113
Tang, D. Benchmarking filesystems. Technical Report TR1995, Harvard College, Cambridge, Massachusetts,
1995. URL ftp://ftp.deas.harvard.edu/techreports/tr-19-95.ps.gz. 97
Tarasov, V., Bhanage, S., Zadok, E., and Seltzer, M. Benchmarking file system benchmarking: it *is*
rocket science. In Proceedings of the 13th USENIX conference on Hot topics in operating systems, HotOS’13,
pages 9–9, Berkeley, CA, USA, 2011. USENIX Association. URL http://dl.acm.org/citation.cfm?id=
1991596.1991609. 100
Tomov, A. A brief introduction to fat (file allocation table) formats, August 2006. URL http://
www.wizcode.com/articles/comments/a-brief-introduction-to-fat-file-allocation-table/. [On-
line; accessed 2010]. 56
Traeger, A., Zadok, E., Joukov, N., and Wright, C. P. A nine year study of file system and storage
benchmarking. ACM Transactions on Storage, 4(2):1–56, May 2008. ISSN 1553-3077. URL http://doi.
acm.org/10.1145/1367829.1367831. 97, 98
Vongsathorn, P. and Carson, S. D. A system for adaptive disk rearrangement. Software: Practice and
Experience, 20(3):225–242, 1990. ISSN 1097-024X. URL http://dx.doi.org/10.1002/spe.4380200302.
49
132
REFERENCES
Walter, C. Kryder’s law, July 2005. URL http://www.scientificamerican.com/article.cfm?id=

kryders-law. [Online; accessed 2010]. 15
Wang, A.-I. A., Kuenning, G., Reiher, P., and Popek, G. The conquest file system: Better performance
through a disk/persistent-ram hybrid design. ACM Transactions on Storage, 2(3):309–348, August 2006.
ISSN 1553-3077. URL http://doi.acm.org/10.1145/1168910.1168914. 52
Westerlund, A. and Danielsson, J. Arla: a free afs client. In Proceedings of the annual conference
on USENIX Annual Technical Conference, ATEC ’98, pages 32–32, Berkeley, CA, USA, 1998. USENIX
Association. URL http://dl.acm.org/citation.cfm?id=1268256.1268288. 77
Wikipedia. Hfs plus. URL http://en.wikipedia.org/wiki/HFS_Plus. [Online; accessed 2010]. 16
Wright, C. P., Dave, J., and Zadok, E. Cryptographic file systems performance: What you don’t know
can hurt you. In Proceedings of the Second IEEE International Security in Storage Workshop, SISW ’03,
pages 47–47. IEEE Computer Society, 2003. ISBN 0-7695-2059-6. URL http://dx.doi.org/10.1109/
SISW.2003.10005. 80
Wright, C. P. and Zadok, E. Kernel korner: unionfs: bringing filesystems together. Linux Journal,
2004(128):8–8, December 2004. ISSN 1075-3583. URL http://dl.acm.org/citation.cfm?id=1044970.
1044978. 20
Wright, C. P., Joukov, N., Kulkarni, D., Miretskiy, Y., and Zadok, E. Auto-pilot: a platform for
system software benchmarking. In Proceedings of the annual conference on USENIX Annual Technical
Conference, ATEC ’05, pages 53–53, Berkeley, CA, USA, 2005. USENIX Association. URL http://dl.
acm.org/citation.cfm?id=1247360.1247413. 98
Wright, C. P., Dave, J., Gupta, P., Krishnan, H., Quigley, D. P., Zadok, E., and Zubair, M. N. Ver-
satility and unix semantics in namespace unification. ACM Transactions on Storage, 2(1):74–105, February
Wright, C., Kleiman, D., and Sundhar R.S., S. Overwriting hard drive data: The great wip-
ing controversy. In Proceedings of the Information Systems Security, Lecture Notes in Computer Sci-
ence, pages 243–257, Berkeley, CA, USA, 2008. Springer Berlin. ISBN 978-3-540-89861-0. URL http:
//dx.doi.org/10.1007/978-3-540-89862-7_21. 82
Zadok, E. Kernel korner: writing stackable filesystems. Linux Journal, 2003(109):8–8, May 2003. ISSN
1075-3583. URL http://dl.acm.org/citation.cfm?id=770650.770658. 69
Zadok, E. and Badulescu, I. A stackable file system interface for linux. In Proceedings of LinuxExpo
Conference, pages 141–151, Raleigh, NC, 1999. 74
Zadok, E. and Nieh, J. Fist: a language for stackable file systems. In Proceedings of the annual conference
on USENIX Annual Technical Conference, ATEC ’00, pages 5–5, Berkeley, CA, USA, 2000. USENIX
Association. URL http://dl.acm.org/citation.cfm?id=1267724.1267729. 78, 82
Zadok, E., Badulescu, I., and Shender, A. Extending file systems using stackable templates. In Proceedings
of the annual conference on USENIX Annual Technical Conference, pages 5–5, Berkeley, CA, USA, 1999.
USENIX Association. URL http://dl.acm.org/citation.cfm?id=1268708.1268713. 70
133
REFERENCES
Zadok, E., Iyer, R., Joukov, N., Sivathanu, G., and Wright, C. P. On incremental file sys-
tem development. ACM Transactions on Storage, 2(2):161–196, May 2006. ISSN 1553-3077. URL
http://doi.acm.org/10.1145/1149976.1149979. 77, 88
Zhang, Z. and Ghose, K. yfs: a journaling file system design for handling large data sets with reduced
seeking. In Proceedings of the 2nd USENIX conference on File and storage technologies, FAST’03, pages 5–
5, Berkeley, CA, USA, 2003. USENIX Association. URL http://dl.acm.org/citation.cfm?id=1973355.
1973360. 51
Zhang, Z. and Ghose, K. hfs: a hybrid file system prototype for improving small file and metadata
performance. SIGOPS Operating Systems Review, 41(3):175–187, March 2007. ISSN 0163-5980. URL
http://doi.acm.org/10.1145/1272998.1273016. 51
Zhu, N., Chen, J., and Chiueh, T.-C. Tbbt: scalable and accurate trace replay for file server evaluation.
In Proceedings of the 4th conference on USENIX Conference on File and Storage Technologies - Volume 4,
pages 24–24, Berkeley, CA, USA, 2005. USENIX Association. URL http://dl.acm.org/citation.cfm?
id=1251028.1251052. 98
134

Wasim PHD Thesis

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Wasim PHD Thesis

Загружено:

Авторское право:

Доступные форматы

Design Considerations for

Developing a Disk File System

A thesis submitted on partial fulfilment of the

Wasim Ahmad Bhat

P.G. Department of Computer Sciences

under the supervision of

Dr. S.M.K. Quadri

Supervisor & Head

(Dr. S.M.K. Quadri)

I would like to express my sincere gratitude to my advisor Dr. S. M. K. Quadri for

Besides my advisor, I owe a great debt of gratitude to the teaching faculty of

I am also indebted to Dr. Rana Hashmy for having discussions regarding my

I also appreciate the support of non-teaching faculty of Department of Computer

As a Ph.D student in UoK, I spent most of my time in research lab. I thank my

Wasim Ahmad Bhat

Similarly, the evaluation of restFS showed that it can save a minimum of 28 %

List of Tables xii

List of Algorithms xiii

2 File System Scalability 9

2.4.2 Implementation of suvFS File System . . . . . . . . . . . . . . . . . . 23

3 File System Performance 37

4 File System Extensibility 68

5 File System Benchmarking 94

6 Conclusions & Future Scope 117

2.1 Both speed & mobility in wireless networks is enhancing . . . . . . . . . . . . 14

3.1 Logical on-disk layout of a conventional disk file system . . . . . . . . . . . . 42

4.1 File system layering types; a) Linear b) Fan-out and c) Fan-in . . . . . . . . . 72

5.1 Logical on-disk layout of OneSec file system . . . . . . . . . . . . . . . . . . . 103

2.1 Transistor count and speed in leading microprocessors . . . . . . . . . . . . . 13

3.1 Characteristics of common file system workloads . . . . . . . . . . . . . . . . 43

4.1 restFS: Postmark benchmark report . . . . . . . . . . . . . . . . . . . . . . . . 91

5.1 OneSec: syscalls supported & their modelling . . . . . . . . . . . . . . . . . 105

2.1 Algorithm for suvfs write() of suvFS . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Algorithm used during simulation of hFAT stackable device driver . . . . . . . 62

4.1 Algorithm for restfs unlink() call of restFS . . . . . . . . . . . . . . . . . . 85

5.1 Algorithm for onesec fill super() of OneSec . . . . . . . . . . . . . . . . . 109

Kernel Module An object file that contains code

Glossary libfs A set of routines in Linux 2.6 kernel

rpm Revolutions per minute is a measure of

Furthermore, a new file system benchmarking methodology is investigated such that:

1. A standard benchmark for atleast a class of file systems is designed, and

In addition, this thesis makes six other contributions. These include:

File System Scalability

2.2 Impact of H/W Technology Trends on Growth & Prolif-

2.2.1 Micro-Processor Trends: Domain and depth of technology

Table 2.1: Transistor count and speed in leading microprocessors

Year Processor Transistor Count Intial Clock Speed

2.2.2 Network Trends: Fast and affordable sharing

Table 2.2: Latency lags behind Bandwidth in networks

Year Technology Bandwidth (mbps) Latency (msec)

1995 2000 2005 2010

3.5 G High Speed Wireless

2.0 / 2.5 G 2.11 GHz W-Broadband

Figure 2.1: Both speed & mobility in wireless networks is enhancing

2.2.3 Magnetic Disk Drive Trends: Large and affordable storage

2.3 Comparison of Popular File Systems for Scalability limi-

2.3.1 Volume Size Limitation

Year 2006 2007 2008 2009 2010

2.3.2 File Count Limitation

continuously producing large number and volume of files. As an example, a medium-size

2.3.3 File Size Limitation

File Type 4MP 8MP MP3 HiD SD HD

2.4 suvFS: A Virtual File System to Break File Size Limitation