Академический Документы
Профессиональный Документы
Культура Документы
Abstract
KINDFS is a distributed file system designed to provide cost-effective storage service utilizing idle disk space on workstation
clusters. The system responsibilities are evenly distributed across a group of collaborating workstations; the proposed
architecture provides improved performance, reliability, and scalability. Workstation uptime data varies from system to
system. The feasibility of deploying KINDFS on an existing desktop computing infrastructure is discussed. KINDFS
prototype implementation and measurement of its performance is suggested. Preliminary results indicate that KINDFS
performance is comparable to that of commonly used distributed file systems, such as NFS, Samba, and Windows 2000
Server.
I. Introduction
Many organizations have deployed desk top personal computers connected in a network for their regular operations and
sharing of resources. Similarly, with the Internet exploding in size and reaching into every walk of life, digital data
stored on-line are growing at an unprecedented rate. As a result, many organizations are under continuous pressure to
expand their storage systems as demands for their services grow and their data sets swell relentlessly. Recent
technological innovations ranging from faster peripheral channel, to dedicated storage area networks (SAN), finally to
aggressively specialized storage systems using specialized hardware and software have provided solution to the problem
of providing large storage space. Higher costs of these highly specialized storage systems, more often than not, makes
it difficult for many organizations to adjust budget on storage systems. At the same time, the computer industry has
made significant advances in magnetic recording technology, with the cost of disk drives reduced due to mass
production of disk drives. The standard disk capacity on mainstream computers is about 20-30GB as of mid-2001 and
is growing continuously over time. However, most users prefer the network storage for various reasons:
a). Mobility - the users want to access the data through consistent interface from any place;
b). Quality of Service - normally the network storage is provided by high-performance highly-available storage systems
with built-in redundancy and regular backup schedules;
c). Security - system security is much easier to maintain on a centralized storage system managed by professional
administrators than on a decentralized system managed by individual users.
As a result of this, most of the local disk space on client workstations is only used for operating systems, application
programs and temporary files, which in total take up only 2 to 4GB disk space. Douceur [l] and Bolosky [2] measured
and analyzed a large set of client machines in a large commercial environment with more than 60,000 desktop personal
computers. The measurement includes disk usage and content. The result shows that about 53% and 50% of the overall
disk space of the studied environment is in use in September 1998 and August 1999, respectively. The disparity of space
utilization ratios on storage servers and local machines is expected to deteriorate further over the time as the average
disk size grows rapidly. This led to a design and deployment process of various implementations of utilizing idle disk
space on workstation clusters. Using network attached storage (NAS) approach [3], NFS by Sun Microsystems provide
some way of sharing files on networked work stations. KINDFS architecture is designed to offer a file interface, similar
to modem storage appliances with built-in fault-tolerance to achieve reliability and high availability. All the members of
the network share responsibilities evenly distributed to them. KINDFS can sustain balanced and scalable performance.
The rest of this paper is organized as follows. Section 2 describes the KINDFS architecture and presents a feasibility
study of deploying KINDFS on existing client computing infrastructure. Section 3 gives methods for the analysis of the
performance of the system, with section 4 briefly explains related work. Section 5 concludes this paper and discusses
future work.
The primary objective of this study is to transform the idle disk space into usable storage service with satisfactory
reliability and efficiency. A good solution should require little administrative effort; otherwise, the prohibitive human
cost may overshadow the gain from recovered resources. KINDFS is not intended to replace main storage servers.
Instead KINDFS attempts to provide extra storage space supplementary to main storage servers with little or no further
investment. Potentially the recovered storage space can be used for archiving purposes for both end-users and system
administrators, such as large multimedia data or system snapshots files for online backup. Other possible usages include
web page caching, website replication, or data buffering for search engines. It is possible to solve the idling disk space
problem using existing software installed on current systems. Specifically, on UNIX workstations disk space can be
shared via NFS or Samba service; on Windows machines disk space can be shared through CIFS protocol [4]. On UNIX
platform, automount can be used to construct one unique file system namespace. The Distributed File System (DFS)
service on Windows 2000 Server has the same function to coalesce multiple file resources on Windows 2000 Server or
Samba (2.2.0 or later) into one single namespace. However, this is a manual process and may be cost-prohibitive
because of the extensive human work involved in the setup. Normally there is little data fault-tolerance in this ad hoc
solution.
The principal construct of our proposed system is the global file store, which is a logically monolithic file system
indexed by a hierarchical directory tree. Although logically monolithic, the global file store is physically distributed
among the disks of the client machines participating in the distributed file system.
We call this the K(c)luster Integrated NFS based Distributed File System and KINDFS for short.
The KINDFS is a kind of hyper file system. It stores its metadata and file data as files on the standard NFS server and
member nodes. Here the unused storage capacity of the member nodes is configured to add to the total storage space of
the file system. This may lead to virtual storage volume available for file storage and redundancy of the storage over
entire network. The system will be best suited for intranet applications on highly available data. Figure 1 shows a global
view of server and clients.
The configuration shown in the figure uses only one server and even number of client nodes. Every member node has its
mirrored counter part in the same network. File data written to both the nodes is the same with the same attributes. When
a node boots up and joins the network, it compares its KINDFS space with the mirror node and modifies the contents, if
required. This leads to consistency in both the images of the file data.
Since file availability and reliability are drastically affected by the file replication factor, it is important to maintain a
minimum number of replicas for each file. To ensure sufficient disk space for these replicas, our proposed system
enforces only one replica of the file stored by a user. No quotas or allotment of specific space per user this system uses
space allotment by round robin manner. This will lead to a highly available, reliable, global, distributed directory
service, much like that provided by xFS [10]. Each machine can use this service to locate replica of requested file. This
directory must be maintained in a strongly consistent fashion, to ensure that clients see the latest version of each file.
If some client makes some modifications on its own KINDFS directory, the changes are reflected by the changes to the
meta File server and can be seen by all other KINDFS clients. Thus the single global namespace of KINDFS is
maintained. The attributes of files such as file name, creation and modification time, uid, gid etc.in the KINDFS are
stored as the attributes of the files in meta file system. A typical metadata operation is explained by the example of file
creation. When a file in the KINDFS is created, there is also a file with the same name created on the meta KINDFS
server. Following is the process of creating a file.
B. Efficiency Considerations
Our proposed system can increase the usable space in the global file store through two techniques. First, it can coalesce
distinct files that happen to have identical contents [11] Second, it can compress files before storing them and
decompress them on the fly when they are read [12].; for example, many users may store copies of a common
application program in the global file store, and these can be coalesced into a single instance, prior to replicating this
instance among multiple machines. For clarity, logically distinct files with identical content are referred as duplicates,
and file copies the system generates as replicas. It is planned for the system to employ a lazy-update strategy, meaning
that the system waits for a short time after a file is written before updating the file’s replicas. Since a large fraction of
written files are deleted or overwritten within a few seconds [13], lazy update can significantly reduce file-update traffic.
Lazy update also allows a file to be written without requiring all (or even a quorum) of the replicas to be immediately
available. The disadvantage of this approach is that the content of newly written files will briefly reside on only one
machine, so loss of that machine will result in loss of the update. The directory service must keep track of which replica
contains up-to-date data, so users will not accidentally access out-of-date versions of files. The impact on the sending
machine can also be reduced by performing non-cached reads and writes, to prevent buffer cache pollution. If remote
reads of very popular files are targeted at a small number of machines, those machines could become overloaded. Our
system can avoid creating these hotspots by allowing machines to copy files from other machines on-disk local caches.
C. Replica Management
In this proposed system, the selection of machines for storing file replica is driven by the availability of the machine.
The system measures machine uptimes and distributes replicas so as to maximize the minimum file availability. In
selecting locations for replicas of a given file, the system could select a set of machines whose uptimes are negatively
correlated, thus reducing the likelihood that all machines containing a replica will be down at the same time. However,
measurements suggest that this would provide little marginal benefit. The system can improve the availability of sets of
related files by storing them in the same locations. If a user needs a given set of files to accomplish a task, then if any of
those files is inaccessible, the task cannot be completed; therefore, it is beneficial to store related files together, so that
either all or none of the files are available. Since files accessed together temporally tend to be grouped together spatially,
replicas for all files in a given directory are stored on the same set of machines, and entire directories are cached
together. When a new machine joins the system, its files are replicated to the global storage areas on other machines;
space for those replicas is made by relocating replicas of other files onto the new machine. Similarly, when a machine is
decommissioned, the system creates and distributes additional replicas of the files stored on that machine. Machines join
the system by explicitly announcing their presence; however, machines can leave without notice, particularly if they
leave due to permanent failure, such as a disk-head crash. If the system notices that a machine has been down for an
extended period of time, it must assume that the machine has been decommissioned and accordingly generate new
replicas; otherwise, the possibility of another failure can jeopardize the reliability of files with replicas on that machine.
Ideally, replicas should be assigned to machines so as to maximize the minimum availability of any file, while also
maximizing the minimum reliability of any file. Measured logarithmically, the availability of a file equals the sum of the
availability of all machines that store a replica of the file, assuming randomly correlated machine uptime. The following
heuristic algorithm yields a low availability variance:
D. Data Security
Distributing multiple replicas of a file protects not only against accidental failure but also against malicious attack. To
destroy a file, an adversary must compromise all machines that hold replicas of that file. To prevent an adversary from
coercing the system into placing all replicas of a given file on a small set of machines under the adversary’s control, the
system must use secure methods for selecting machines to house file replicas. File-update messages are digitally signed
by the writer to the file. Before applying an update to a replica it stores, each machine verifies the digital signature on
the update and compares the writer’s identity to a list of authorized writers associated with the file. This prevents an
adversary from forging and distributing file updates.
To prevent files from being read by unauthorized users, the contents of files are encrypted before they are replicated.
However, encryption could interfere with the automatic detection and coalescing of duplicate files, since different users
may encrypt identical plaintext files with different keys, which would normally produce different cipher text files. Since
the implementation is suggested on Linux, security features inherent to the operating system are used; without any
additional algorithm for security.
To analyze the feasibility of deploying the proposed system on an existing computing infrastructure, usage data from
desktop personal computers connected in network must be collected. Contents of file systems, availability of machines,
and load on machines is measured. Roughly half of all machines belong to technical developers; the other half are
distributed approximately equally among the remaining categories.
Patterson et al. analyzed the reliability of RAID disk arrays using queuing theory [15]. Similarly, the proposed
architecture reliability depends on the reliability of each individual system whose idling disk space is utilized. We define
the following notations for further discussion. For disks the Mean Time To Failure is denoted as MITFhk and the Mean
Time to Repair as MITRdsk. Mi77?&k is far smaller than MZFfik. For disk arrays the Mean Time To Failure is MnFg,,,,.
G is the number of data disks in a RAID group. C is the number of check disks in a group, where C is 1 in RAID level
1/4/5. Therefore, MITF,,,, is
(G 4- c) * (G + c - 1) * MnRdkk
With the latest disk manufacturing technologies, modem disk drives on client machines are very reliable. The main
failing factor in this study would be the system availability of desktop workstations. Due to failures in operating system,
or user's choice to reboot the system for whatever reasons, the mean uptime of client machine is ( MnFdi.,k ) very small
compared to M n F A k . Thus, we consider the MZTFfik of individual disk drive infinitely large.
However, when the system reboots, it normally can restore to working condition in 2 or 3 minutes, including grace time
for all services to get started. This period of time for system to reboot is very small compared to the mean system
uptime. We define MlTF,, as the mean system uptime; and M n R M as the mean time that each system reboot needs to
restore to fully working status. MITRws is far smaller than MTTF,,.
CPU and disk load is measured by running a program on some data-collection computers that iteratively selected a
random machine and remotely read the performance counters (CPU load and disk load) of that machine. CPU load was
measured as the fraction of cycles expended in processes other than the idle process. Disk load was measured as the
number of disk operations performed per second.
The data-collection techniques discussed above are vulnerable to many sources of experimental error and measurement
bias. In general, the effects of these errors on the results may not be accurate, but one can at least enumerate them, so
their presence will not be ignored. Since each machine’s owner determined whether to run the file system scanning
program, these results could be affected by self-selection bias. The file timestamps collected and analyzed can be reset
by user-level software; therefore, they may be unreliable. In addition, it takes several minutes to perform the scan of
each file system, so this limits the granularity with which one can assess the elapsed time since file accesses. The use of
static timestamps to determine dynamic access rates prevents one from observing burstiness in file accesses. The file
access times derived from the file timestamps are biased towards integral multiples of one week, for the following areas
on: Individual users run the scanning program at times of their own choosing, so they can disproportionately likely to do
so on a weekday. Since files are more likely to be accessed on a weekday, the time since last access reflects this
congruence. The remote reads of performance counters may be restricted to a subset of the Windows NT/2000
machines on the network, since performance counters are not present in Windows 95/98, a machine’s owner can restrict
access to them in Windows NT, and they are restricted by default in Windows 2000. Windows 2000 is most likely to be
running on newer machines, which are likely to have larger-than-average disks. The operating system dynamically loads
the performance counter libraries when interrogated, and it unloads them after a period of non-use that could be smaller
than the mean time between same-machine samples. Therefore, sampled machines commonly loaded some amount of
monitoring code before responding to the query, thus increasing their workload. The exact amount of code loaded
depends on the particulars of each machine and on whether any other application or system component is already using
some part of the monitoring subsystem.
E. Availability Analysis
Redundancy will be provided to the files stored in the distributed storage space under KINDFS by mirrored image of the
same file on one more member node. In short the total available storage space will be divided in two parts viz. Virtual
space and Mirror Image as shown in Figure 1. This will provide for the availability of the file even in the event of a node
fails or the node is not powered ON.
F. Reliability Analysis
If machines always notify the system before being permanently decommissioned, overall file reliability is governed by
the disk failure rate [16]. In particular, the likelihood of permanently losing a file is exponentiated by the replication
factor. However, if machines do not always notify the system, reliability is substantially degraded. To determine a lower
bound on reliability, it is assume that machines always disappear without warning. The mean time between loss of one
machine full of replicas is equal to the mean machine lifetime, l. A directory full of files will be permanently lost if all
machines containing the other replicas of these files are decommissioned before the system recognizes the loss and
creates new replicas. Given d directories per machine, r replicas, and lag τ between a machine’s turnoff and the creation
of replacement replicas.
References
[l] J. R. Douceur, and W. J. Bolosky. "A Large-Scale Study of File System Contents," in Proceedings of the 1999 Conference
on Measurement and Modeling of Computer Systems (SIGMETRICS), May 1999.
[2] W. J. Bolosky, J. R. Douceur, et al., "Feasibility of a Serverless Distributed File System Deployed on an Existing Set of
Desktop PCs," in Proceedings of the 2000 International Conference on Measurement and Modeling of Computer Systems
(SIGMETRICS), June 2000.
[3] R. Gibson, and R. Van Meter, "Network Attached Storage Architecture," in Communications Of the ACM, vol. 13, Nov.
2000.
[4] P. Leach, and D. Naik, "Common Intemet File System (CIFS/l .O) Protocol Preliminary Draft," Intemet-Draft Dec. 1997.
[6] B. Callaghan, B. Pawlowski, and P. Staubach, "NFS Version 3 Protocol Specification," June 1995.
[7] D. Hitz, J. Lau, and M. Malcolm, "File systems design for an NFS file server appliance," in USENIX Winter 1994
Technical Conference Proceedings, Jan. 1994.
[8] R. Card, T. Ts'o, and S. Tweedie, "Design and Implementation of the Second Extended File system," in Proceedings of
the First Dutch International Symposium onLinu, 1986.
[9] M. McKusick, et al., "A Fast File System for UNIX,"ACM Transaction of Computer Systems (TOSC). Aug. 1984.
[10] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, R. Wang. Serverless Network File Systems. 15th SOSP, p. 109-126,
Dec 1995.
[11] W. J. Bolosky, S. Corbin, D. Goebel, J. R. Douceur. Single Instance Storage in Windows 2000. To appear in 4th Usenix
Windows System Symposium, Aug 2000.
[13]W. Vogels. File system usage in Windows NT 4.0. 17th SOSP, p. 93-109, Dec 1999.
[15] D. A. Patterson, G. Gibson, and R. H. Katz, "A Case for Redundant Array of Inexpensive Disks (RAID)," in
Proceedings of the I988 ACM Conference on Management of Data (SIGMOD). Chicago, IL, June 1988.
[17] E. Lee, and C. Thekkath, "Petal: Distributed virtual disks," in Proceedings of the ACM 7th Intentational Conference on
Architectural Support for Programming Languages and Operating Systems (ASPWS), 1996.
[ 18] T. Anderson, M. Dahlin, et al.. "Serverless Network File Systems," ACM Transactions on Computer Systems
(TOSC), Feb. 1995.
[19] G. A. Gibson, et al., "A Case for Network-Attached Secure Disks," TR-CMU-CS-96-142, Sept. 1996.