Вы находитесь на странице: 1из 8

Proceeding of the 3rd International Conference on Informatics and Technology, 2009

An Integrated High-Performance Distributed File System Implementation


On Existing Local Network
K.N.Honwadkar1, Dr.T.R.Sontakke2
1
D.Y.Patil College of Engineering Sector 29,
PCNTDC, Akurdi, Pune 411 044, Maharashtra, India : knhonwadkar@yahoo.co.in
2
Former Director, S.G.G.S.Institute of Engineering and Technology,
Vishnupuri, Nanded 431 606, Maharashtra, India : trsontakke@yahoo.com

Abstract
KINDFS is a distributed file system designed to provide cost-effective storage service utilizing idle disk space on workstation
clusters. The system responsibilities are evenly distributed across a group of collaborating workstations; the proposed
architecture provides improved performance, reliability, and scalability. Workstation uptime data varies from system to
system. The feasibility of deploying KINDFS on an existing desktop computing infrastructure is discussed. KINDFS
prototype implementation and measurement of its performance is suggested. Preliminary results indicate that KINDFS
performance is comparable to that of commonly used distributed file systems, such as NFS, Samba, and Windows 2000
Server.

Key words: Distributed file system, metadata, SAN, NAS, NFS

I. Introduction

Many organizations have deployed desk top personal computers connected in a network for their regular operations and
sharing of resources. Similarly, with the Internet exploding in size and reaching into every walk of life, digital data
stored on-line are growing at an unprecedented rate. As a result, many organizations are under continuous pressure to
expand their storage systems as demands for their services grow and their data sets swell relentlessly. Recent
technological innovations ranging from faster peripheral channel, to dedicated storage area networks (SAN), finally to
aggressively specialized storage systems using specialized hardware and software have provided solution to the problem
of providing large storage space. Higher costs of these highly specialized storage systems, more often than not, makes
it difficult for many organizations to adjust budget on storage systems. At the same time, the computer industry has
made significant advances in magnetic recording technology, with the cost of disk drives reduced due to mass
production of disk drives. The standard disk capacity on mainstream computers is about 20-30GB as of mid-2001 and
is growing continuously over time. However, most users prefer the network storage for various reasons:

a). Mobility - the users want to access the data through consistent interface from any place;
b). Quality of Service - normally the network storage is provided by high-performance highly-available storage systems
with built-in redundancy and regular backup schedules;
c). Security - system security is much easier to maintain on a centralized storage system managed by professional
administrators than on a decentralized system managed by individual users.

As a result of this, most of the local disk space on client workstations is only used for operating systems, application
programs and temporary files, which in total take up only 2 to 4GB disk space. Douceur [l] and Bolosky [2] measured
and analyzed a large set of client machines in a large commercial environment with more than 60,000 desktop personal
computers. The measurement includes disk usage and content. The result shows that about 53% and 50% of the overall
disk space of the studied environment is in use in September 1998 and August 1999, respectively. The disparity of space
utilization ratios on storage servers and local machines is expected to deteriorate further over the time as the average
disk size grows rapidly. This led to a design and deployment process of various implementations of utilizing idle disk
space on workstation clusters. Using network attached storage (NAS) approach [3], NFS by Sun Microsystems provide
some way of sharing files on networked work stations. KINDFS architecture is designed to offer a file interface, similar
to modem storage appliances with built-in fault-tolerance to achieve reliability and high availability. All the members of
the network share responsibilities evenly distributed to them. KINDFS can sustain balanced and scalable performance.

©Informatics '09, UM 2009 RDT5 - 136


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

The rest of this paper is organized as follows. Section 2 describes the KINDFS architecture and presents a feasibility
study of deploying KINDFS on existing client computing infrastructure. Section 3 gives methods for the analysis of the
performance of the system, with section 4 briefly explains related work. Section 5 concludes this paper and discusses
future work.

II. KINDFS: System Architecture

A. Environment description and assumptions


The subjects in this study are front-end desktop workstations compared with back-end time-sharing servers. Local disks
on these workstations are primarily used to store files for operating system and application programs. We assume that in
such an environment there exists a central administration with login authentication servers and storage servers with a
unified namespace. All these servers and client workstations are connected in a secure local area network behind a
firewall. Therefore, the servers can trust the operating systems on these workstations. The foremost characteristic of this
environment is heterogeneity in terms of hardware platforms. For the initial study and deployment, same operating
system is available on all member computers. Generally, these workstations are well equipped with fast processors, large
amount of memory, and high bandwidth local I/O and network interfaces. Another noticeable fact is that these
workstations are susceptible to occasional unexpected reboots, due to software failure, user choice, or system-sharing
policy.

The primary objective of this study is to transform the idle disk space into usable storage service with satisfactory
reliability and efficiency. A good solution should require little administrative effort; otherwise, the prohibitive human
cost may overshadow the gain from recovered resources. KINDFS is not intended to replace main storage servers.
Instead KINDFS attempts to provide extra storage space supplementary to main storage servers with little or no further
investment. Potentially the recovered storage space can be used for archiving purposes for both end-users and system
administrators, such as large multimedia data or system snapshots files for online backup. Other possible usages include
web page caching, website replication, or data buffering for search engines. It is possible to solve the idling disk space
problem using existing software installed on current systems. Specifically, on UNIX workstations disk space can be
shared via NFS or Samba service; on Windows machines disk space can be shared through CIFS protocol [4]. On UNIX
platform, automount can be used to construct one unique file system namespace. The Distributed File System (DFS)
service on Windows 2000 Server has the same function to coalesce multiple file resources on Windows 2000 Server or
Samba (2.2.0 or later) into one single namespace. However, this is a manual process and may be cost-prohibitive
because of the extensive human work involved in the setup. Normally there is little data fault-tolerance in this ad hoc
solution.
The principal construct of our proposed system is the global file store, which is a logically monolithic file system
indexed by a hierarchical directory tree. Although logically monolithic, the global file store is physically distributed
among the disks of the client machines participating in the distributed file system.

We call this the K(c)luster Integrated NFS based Distributed File System and KINDFS for short.

The KINDFS is a kind of hyper file system. It stores its metadata and file data as files on the standard NFS server and
member nodes. Here the unused storage capacity of the member nodes is configured to add to the total storage space of
the file system. This may lead to virtual storage volume available for file storage and redundancy of the storage over
entire network. The system will be best suited for intranet applications on highly available data. Figure 1 shows a global
view of server and clients.
The configuration shown in the figure uses only one server and even number of client nodes. Every member node has its
mirrored counter part in the same network. File data written to both the nodes is the same with the same attributes. When
a node boots up and joins the network, it compares its KINDFS space with the mirror node and modifies the contents, if
required. This leads to consistency in both the images of the file data.

©Informatics '09, UM 2009 RDT5 - 137


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

Figure 1. Global view of the server and clients in KINDFS

Since file availability and reliability are drastically affected by the file replication factor, it is important to maintain a
minimum number of replicas for each file. To ensure sufficient disk space for these replicas, our proposed system
enforces only one replica of the file stored by a user. No quotas or allotment of specific space per user this system uses
space allotment by round robin manner. This will lead to a highly available, reliable, global, distributed directory
service, much like that provided by xFS [10]. Each machine can use this service to locate replica of requested file. This
directory must be maintained in a strongly consistent fashion, to ensure that clients see the latest version of each file.

Figure 3 KINDFS has similar directory structure as it’s


Meta file system

If some client makes some modifications on its own KINDFS directory, the changes are reflected by the changes to the
meta File server and can be seen by all other KINDFS clients. Thus the single global namespace of KINDFS is
maintained. The attributes of files such as file name, creation and modification time, uid, gid etc.in the KINDFS are
stored as the attributes of the files in meta file system. A typical metadata operation is explained by the example of file
creation. When a file in the KINDFS is created, there is also a file with the same name created on the meta KINDFS
server. Following is the process of creating a file.

B. Efficiency Considerations
Our proposed system can increase the usable space in the global file store through two techniques. First, it can coalesce
distinct files that happen to have identical contents [11] Second, it can compress files before storing them and
decompress them on the fly when they are read [12].; for example, many users may store copies of a common
application program in the global file store, and these can be coalesced into a single instance, prior to replicating this
instance among multiple machines. For clarity, logically distinct files with identical content are referred as duplicates,
and file copies the system generates as replicas. It is planned for the system to employ a lazy-update strategy, meaning
that the system waits for a short time after a file is written before updating the file’s replicas. Since a large fraction of
written files are deleted or overwritten within a few seconds [13], lazy update can significantly reduce file-update traffic.
Lazy update also allows a file to be written without requiring all (or even a quorum) of the replicas to be immediately
available. The disadvantage of this approach is that the content of newly written files will briefly reside on only one
machine, so loss of that machine will result in loss of the update. The directory service must keep track of which replica
contains up-to-date data, so users will not accidentally access out-of-date versions of files. The impact on the sending

©Informatics '09, UM 2009 RDT5 - 138


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

machine can also be reduced by performing non-cached reads and writes, to prevent buffer cache pollution. If remote
reads of very popular files are targeted at a small number of machines, those machines could become overloaded. Our
system can avoid creating these hotspots by allowing machines to copy files from other machines on-disk local caches.

C. Replica Management
In this proposed system, the selection of machines for storing file replica is driven by the availability of the machine.
The system measures machine uptimes and distributes replicas so as to maximize the minimum file availability. In
selecting locations for replicas of a given file, the system could select a set of machines whose uptimes are negatively
correlated, thus reducing the likelihood that all machines containing a replica will be down at the same time. However,
measurements suggest that this would provide little marginal benefit. The system can improve the availability of sets of
related files by storing them in the same locations. If a user needs a given set of files to accomplish a task, then if any of
those files is inaccessible, the task cannot be completed; therefore, it is beneficial to store related files together, so that
either all or none of the files are available. Since files accessed together temporally tend to be grouped together spatially,
replicas for all files in a given directory are stored on the same set of machines, and entire directories are cached
together. When a new machine joins the system, its files are replicated to the global storage areas on other machines;
space for those replicas is made by relocating replicas of other files onto the new machine. Similarly, when a machine is
decommissioned, the system creates and distributes additional replicas of the files stored on that machine. Machines join
the system by explicitly announcing their presence; however, machines can leave without notice, particularly if they
leave due to permanent failure, such as a disk-head crash. If the system notices that a machine has been down for an
extended period of time, it must assume that the machine has been decommissioned and accordingly generate new
replicas; otherwise, the possibility of another failure can jeopardize the reliability of files with replicas on that machine.

Replica management – placement algorithm

Ideally, replicas should be assigned to machines so as to maximize the minimum availability of any file, while also
maximizing the minimum reliability of any file. Measured logarithmically, the availability of a file equals the sum of the
availability of all machines that store a replica of the file, assuming randomly correlated machine uptime. The following
heuristic algorithm yields a low availability variance:

i. Set the provisional availability of all files to zero;


then,
ii. Iteratively select machines in order of increasing
availability;
iii. For each selected machine, identify the files with
the highest provisional availability from
those with the lowest replica count, assign it
to the selected machine, and
iv. Update the provisional availability of those files;
v. Repeat until all machines are full.

Algorithm for selection of a machine to store replica of file

D. Data Security
Distributing multiple replicas of a file protects not only against accidental failure but also against malicious attack. To
destroy a file, an adversary must compromise all machines that hold replicas of that file. To prevent an adversary from
coercing the system into placing all replicas of a given file on a small set of machines under the adversary’s control, the
system must use secure methods for selecting machines to house file replicas. File-update messages are digitally signed
by the writer to the file. Before applying an update to a replica it stores, each machine verifies the digital signature on
the update and compares the writer’s identity to a list of authorized writers associated with the file. This prevents an
adversary from forging and distributing file updates.

To prevent files from being read by unauthorized users, the contents of files are encrypted before they are replicated.
However, encryption could interfere with the automatic detection and coalescing of duplicate files, since different users
may encrypt identical plaintext files with different keys, which would normally produce different cipher text files. Since

©Informatics '09, UM 2009 RDT5 - 139


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

the implementation is suggested on Linux, security features inherent to the operating system are used; without any
additional algorithm for security.

III. Performance Analysis Methodology

To analyze the feasibility of deploying the proposed system on an existing computing infrastructure, usage data from
desktop personal computers connected in network must be collected. Contents of file systems, availability of machines,
and load on machines is measured. Roughly half of all machines belong to technical developers; the other half are
distributed approximately equally among the remaining categories.

A. File System Measurement


To determine the amount of disk space that is free and the amount of disk space that would be free if all duplicate files
were eliminated the file systems should be analyzed and also the rate of cache misses, the size of local system caches,
and the rate of file writes be estimated

a) Disk usage measurement


In September 1998, Microsoft employees run a scanning program on their Windows and Windows NT computers that
collected directory information (file names, sizes, and timestamps) from their file systems [7]. By this means, we
obtained measurements of 10,568 file systems [8]can be obtained.. In August 1999, study was conducted to remotely
read the performance counters (free disk space, total disk space, and logon name of the primary user) of every Windows
NT computer that could be accessed. By this means, measurements of 8669 file systems were obtained.

b) File access rate estimation


The timestamps in the data set can be used to estimate the rate at which files are read and written. Each file’s last read
time determined the most recent timestamp (create, update, or access) on the file, and its last write time as the most
recent of the create and update timestamps. There are three aspects of file access rate that are relevant to our design: the
miss rate of the local file cache, the size of the local file cache, and the amount of write traffic sent to file replicas on
other machines. Only files that would be stored in the global file store were considered, excluding inherently local files,
including the system paging file and those files in temporary directories or the Internet browser cache, all of which
account for 2% of the bytes in the data set. Write traffic rate is a function of the lazy-update propagation lag. We
estimate write traffic rate as follows: For a propagation lag of n hours, files last written between 2n hours ago and n
hours ago will have been propagated during the past n hours.

B. Machine Availability Measurement


To determine file availability in the proposed system, it is needed to know machine uptimes, whether these uptimes are
consistent over time, and whether the uptimes of different machines are correlated. When our proposed system sees a
machine go down, it needs to predict how long the machine will stay down. Our reliability calculations require knowing
machine lifetimes.
a) Machine uptime measurement

Patterson et al. analyzed the reliability of RAID disk arrays using queuing theory [15]. Similarly, the proposed
architecture reliability depends on the reliability of each individual system whose idling disk space is utilized. We define
the following notations for further discussion. For disks the Mean Time To Failure is denoted as MITFhk and the Mean
Time to Repair as MITRdsk. Mi77?&k is far smaller than MZFfik. For disk arrays the Mean Time To Failure is MnFg,,,,.
G is the number of data disks in a RAID group. C is the number of check disks in a group, where C is 1 in RAID level
1/4/5. Therefore, MITF,,,, is

(G 4- c) * (G + c - 1) * MnRdkk

With the latest disk manufacturing technologies, modem disk drives on client machines are very reliable. The main
failing factor in this study would be the system availability of desktop workstations. Due to failures in operating system,
or user's choice to reboot the system for whatever reasons, the mean uptime of client machine is ( MnFdi.,k ) very small
compared to M n F A k . Thus, we consider the MZTFfik of individual disk drive infinitely large.

©Informatics '09, UM 2009 RDT5 - 140


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

However, when the system reboots, it normally can restore to working condition in 2 or 3 minutes, including grace time
for all services to get started. This period of time for system to reboot is very small compared to the mean system
uptime. We define MlTF,, as the mean system uptime; and M n R M as the mean time that each system reboot needs to
restore to fully working status. MITRws is far smaller than MTTF,,.

b) Machine uptime correlation calculation


To determine whether the uptimes of different machines are correlated, we computed, for every pair of machines, a
temporal correlation value by adding one for every ping snapshot that the machines were both up or both down, and
subtracting one for every snapshot that one machine was up and the other was down. We normalized the result by
dividing by the count of snapshots.

c) Machine downtime prediction calculation


From the regular observation of the complete system , the administrator can find the time for which a machine is off in
given span of time; which will help determine the machine down time for same as well as any period of time

d) Machine lifetime measurement


Machines in the system can be periodically pinged and responses can be recorded. Scanning backwards through the data
it can be determined the last time each machine responded to a ping. If machines have deterministic lifetimes, then the
rate of attrition is constant, and the count of remaining machines decays linearly. The expected machine lifetime
(meaning the lifetime of the machine name, not the physical hardware) is the time until this count reaches zero.

C. Machine Load Measurement

CPU and disk load is measured by running a program on some data-collection computers that iteratively selected a
random machine and remotely read the performance counters (CPU load and disk load) of that machine. CPU load was
measured as the fraction of cycles expended in processes other than the idle process. Disk load was measured as the
number of disk operations performed per second.

D. Measurement Errors and Biases

The data-collection techniques discussed above are vulnerable to many sources of experimental error and measurement
bias. In general, the effects of these errors on the results may not be accurate, but one can at least enumerate them, so
their presence will not be ignored. Since each machine’s owner determined whether to run the file system scanning
program, these results could be affected by self-selection bias. The file timestamps collected and analyzed can be reset
by user-level software; therefore, they may be unreliable. In addition, it takes several minutes to perform the scan of
each file system, so this limits the granularity with which one can assess the elapsed time since file accesses. The use of
static timestamps to determine dynamic access rates prevents one from observing burstiness in file accesses. The file
access times derived from the file timestamps are biased towards integral multiples of one week, for the following areas
on: Individual users run the scanning program at times of their own choosing, so they can disproportionately likely to do
so on a weekday. Since files are more likely to be accessed on a weekday, the time since last access reflects this
congruence. The remote reads of performance counters may be restricted to a subset of the Windows NT/2000
machines on the network, since performance counters are not present in Windows 95/98, a machine’s owner can restrict
access to them in Windows NT, and they are restricted by default in Windows 2000. Windows 2000 is most likely to be
running on newer machines, which are likely to have larger-than-average disks. The operating system dynamically loads
the performance counter libraries when interrogated, and it unloads them after a period of non-use that could be smaller
than the mean time between same-machine samples. Therefore, sampled machines commonly loaded some amount of
monitoring code before responding to the query, thus increasing their workload. The exact amount of code loaded
depends on the particulars of each machine and on whether any other application or system component is already using
some part of the monitoring subsystem.

E. Availability Analysis
Redundancy will be provided to the files stored in the distributed storage space under KINDFS by mirrored image of the
same file on one more member node. In short the total available storage space will be divided in two parts viz. Virtual
space and Mirror Image as shown in Figure 1. This will provide for the availability of the file even in the event of a node
fails or the node is not powered ON.

©Informatics '09, UM 2009 RDT5 - 141


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

F. Reliability Analysis
If machines always notify the system before being permanently decommissioned, overall file reliability is governed by
the disk failure rate [16]. In particular, the likelihood of permanently losing a file is exponentiated by the replication
factor. However, if machines do not always notify the system, reliability is substantially degraded. To determine a lower
bound on reliability, it is assume that machines always disappear without warning. The mean time between loss of one
machine full of replicas is equal to the mean machine lifetime, l. A directory full of files will be permanently lost if all
machines containing the other replicas of these files are decommissioned before the system recognizes the loss and
creates new replicas. Given d directories per machine, r replicas, and lag τ between a machine’s turnoff and the creation
of replacement replicas.

IV. Related Work


The most relevant work is Microsoft Research’s Farsite project [14], which builds a serverless distributed file system with a
large group of collaborative desktop computers.. Fault-tolerance of Farsite is achieved through replication of files assisted by
Byzantine-fault-tolerant algorithm. The distributed RAID approach used in xFS [18] and Petal [17] was applied to build a
virtual disk with block interface. On top the virtual block devices, higher level file systems and distributed file systems were
built, like the metadata manager in xFS and Frangipani. KINDFS evenly distributes metadata and the management across all
storage devices in the cluster. Metadata in KINDFS systems are stored at fixed locations while data can be stored anywhere,
compared to the "Anything, Anywhere" file layout in xFS. KINDFS architecture bears close resemblance to Network-
Attached Secure Disks (NASD) [ 19], which targets at high performance storage systems based on directly network-attached
secure disks. The main difference is that in KINDFS file namespace consistency is maintained collaboratively by participating
cluster members, different from NASD whose central file manager maintains the namespace consistency for the whole system.

V. Conclusions and Future Work


KINDFS is a novel solution to build reliable storage service with efficiency, utilizing idle disk space on desktop workstations.
The enabling factor for system efficiency is the flexible "Data Anywhere, Metadata at Client Locations" file system layout.
The design eliminates many central server bottlenecks and can provide improved performance, reliability, and scalability. The
viability of constructing distributed file system clusters utilizing the idle disk space on desktop computing infrastructure should
be tested with deploying the system on machines having diverse configurations. Future work includes implementation of the
system for heterogeneous operating system environments.
-

References
[l] J. R. Douceur, and W. J. Bolosky. "A Large-Scale Study of File System Contents," in Proceedings of the 1999 Conference
on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1999.

[2] W. J. Bolosky, J. R. Douceur, et al., "Feasibility of a Serverless Distributed File System Deployed on an Existing Set of
Desktop PCs," in Proceedings of the 2000 International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), June 2000.

[3] R. Gibson, and R. Van Meter, "Network Attached Storage Architecture," in Communications Of the ACM, vol. 13, Nov.
2000.

[4] P. Leach, and D. Naik, "Common Intemet File System (CIFS/l .O) Protocol Preliminary Draft," Intemet-Draft Dec. 1997.

[5] B. Callaghan, NFS Illustrated Addison Wesley, 2000.

[6] B. Callaghan, B. Pawlowski, and P. Staubach, "NFS Version 3 Protocol Specification," June 1995.
[7] D. Hitz, J. Lau, and M. Malcolm, "File systems design for an NFS file server appliance," in USENIX Winter 1994
Technical Conference Proceedings, Jan. 1994.

©Informatics '09, UM 2009 RDT5 - 142


Proceeding of the 3rd International Conference on Informatics and Technology, 2009

[8] R. Card, T. Ts'o, and S. Tweedie, "Design and Implementation of the Second Extended File system," in Proceedings of
the First Dutch International Symposium onLinu, 1986.

[9] M. McKusick, et al., "A Fast File System for UNIX,"ACM Transaction of Computer Systems (TOSC). Aug. 1984.

[10] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, R. Wang. Serverless Network File Systems. 15th SOSP, p. 109-126,
Dec 1995.

[11] W. J. Bolosky, S. Corbin, D. Goebel, J. R. Douceur. Single Instance Storage in Windows 2000. To appear in 4th Usenix
Windows System Symposium, Aug 2000.

[12] D. Solomon. Inside Windows NT Second Edition. Microsoft Press, 1998.

[13]W. Vogels. File system usage in Windows NT 4.0. 17th SOSP, p. 93-109, Dec 1999.

[14] Farsite project website: http://www.research.microsoft.com/research/sn/Farsite/

[15] D. A. Patterson, G. Gibson, and R. H. Katz, "A Case for Redundant Array of Inexpensive Disks (RAID)," in
Proceedings of the I988 ACM Conference on Management of Data (SIGMOD). Chicago, IL, June 1988.

[16] Seagate Technology, Inc. http://seagate.com

[17] E. Lee, and C. Thekkath, "Petal: Distributed virtual disks," in Proceedings of the ACM 7th Intentational Conference on
Architectural Support for Programming Languages and Operating Systems (ASPWS), 1996.

[ 18] T. Anderson, M. Dahlin, et al.. "Serverless Network File Systems," ACM Transactions on Computer Systems
(TOSC), Feb. 1995.

[19] G. A. Gibson, et al., "A Case for Network-Attached Secure Disks," TR-CMU-CS-96-142, Sept. 1996.

©Informatics '09, UM 2009 RDT5 - 143

Вам также может понравиться