Академический Документы
Профессиональный Документы
Культура Документы
The purpose of a distributed file system (DFS) is to allow users of physically distributed
computers to share data and storage resources by using a common file system. A typical
configuration for a DFS is a collection of workstations and mainframes connected by a
local area network (LAN). A DFS is implemented as part of the operating system of each
of the connected computers. This paper establishes a viewpoint that emphasizes the
dispersed structure and decentralization of both data and control in the design of such
systems. It defines the concepts of transparency, fault tolerance, and scalability and
discusses them in the context of DFSs. The paper claims that the principle of distributed
operation is fundamental for a fault tolerant and scalable DFS design. It also presents
alternatives for the semantics of sharing and methods for providing access to remote files.
A survey of contemporary UNIX@-based systems, namely, UNIX United, Locus, Sprite,
Sun’s Network File System, and ITC’s Andrew, illustrates the concepts and demonstrates
various implementations and design alternatives. Based on the assessment of these
systems, the paper makes the point that a departure from the approach of extending
centralized file systems over a communication network is necessary to accomplish sound
distributed file system design.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1990 ACM 0360-0300/90/1200-0321 $01.50
A DFS is a file system, whose clients, the survey paper by Tanenbaum and Van
servers, and storage devices are dispersed Renesse [ 19851, where the broader context
among the machines of a distributed sys- of distributed operating systems and com-
tem. Accordingly, service activity has to be munication primitives are discussed.
carried out across the network, and instead In light of the profusion of UNIX-based
of a single centralized data repository there DFSs and the dominance of the UNIX file
are multiple and independent storage de- system model, five UNIX-based systems
vices. As will become evident, the concrete are surveyed. The first part of the paper is
configuration and implementation of a independent of this choice as much as pas-
DFS may vary. There are configurations sible. Since a vast majorit,y of the actual
where servers run on dedicated machines, DFSs (and all systems surveyed and men-
as well as configurations where a machine tioned in this paper) have some relation to
can be both a server and a client. A DFS UNIX, however, it is inevitable that the
can be implemented as part of a distributed concepts are understood best in the UNIX
operating system or, alternatively, by a context. The choice of the five systems
software layer whose task is to manage and the order of their presentation demon-
the communication between conventional strate the evolution of DFSs in the last
operating systems and file systems. The decade.
distinctive features of a DFS are the Section 1 presents the terminology and
multiplicity and autonomy of clients and concepts of transparency, fault tolerance,
servers in the system. and scalability. Section 2 discusses trans-
The paper is divided into two parts. In parency and how it is expressed in naming
the first part, which includes Sections 1 to schemes in greater detail. Section 3 intro-
6, the basic concepts underlying the design duces notions that are important for the
of a DFS are discussed. In particular, alter- semantics of sharing files, and Section 4
natives and trade-offs regarding the design compares methods of caching and remote
of a DFS are pointed out. The second part service. Sections 5 and 6 discuss issues
surveys five DFSs: UNIX United [Brown- related to fault tolerance and scalability,
bridge et al. 1982; Randell 19831, Locus respectively, pointing out observations
[Popek and Walker 1985; Walker et al. based on the designs of the surveyed sys-
19831, Sun’s Network File System (NFS) tems. Sections 7-11 describe each of the
[Sandberg et al. 1985; Sun Microsystems five systems mentioned above, including
Inc. 19881, Sprite [Nelson et al., 1988; distinctive features of a system not related
Ousterhout et al. 19881, and Andrew to the issues presented in the first part.
[Howard et al. 1988; Morris et al. 1986; Each description is followed by a summary
Satyanarayanan et al. 19851. These systems of the prominent features of the corre-
exemplify the concepts and observations sponding system. A table compares the five
mentioned in the first part and demon- systems and concludes the survey. Many
strate various implementations. A point in important aspects of DFSs and systems are
the first part is often illustrated by referring omitted from this paper; thus, Section 12
to a later section covering one of the sur- reviews related work not emphasized in our
veyed systems. discussion. Finally, Section 13 provides
The fundamental concepts of a DFS can conclusions and a bibliography provides re-
be studied without paying significant atten- lated literature not directly referenced.
tion to the actual operating system of which
it is a component. The first part of the
1. TRENDS AND TERMINOLOGY
paper adopts this approach. The second
part reviews actual DFS architectures that Ideally, a DFS should look to its clients like
serve to demonstrate approaches to inte- a conventional, centralized file system.
gration of a DFS with an operating system That is, the multiplicity and dispersion of
and a communication network. To comple- servers and storage devices should be trans-
ment our discussion, we refer the reader to parent to clients. As will become evident,
transparency has many dimensions and de- Systems have bounded resources and can
grees. A fundamental property, called net- become completely saturated under in-
work transparency, implies that clients creased load. Regarding a file system, sat-
should be able to access remote files using uration occurs, for example, when a server’s
the same set of file operations applicable to CPU runs at very high utilization rate or
local files. That is, the client interface of a when disks are almost full. As for a DFS in
DFS should not distinguish between local particular, server saturation is even a bigger
and remote files. It is up to the DFS to threat because of the communication over-
locate the files and arrange for the trans- head associated with processing remote
port of the data. requests. Scalability is a relative property;
Another aspect of transparency is user a scalable system should react more grace-
mobility, which implies that users can log fully to increased load than a nonscalable
in to any machine in the system; that is, one will. First, its performance should
they are not forced to use a specific ma- degrade more moderately than that of a
chine. A transparent DFS facilitates user nonscalable system. Second, its resources
mobility by bringing the user’s environ- should reach a saturated state later, when
ment (e.g., home directory) to wherever he compared with a nonscalable system.
or she logs in. Even a perfect design cannot accommo-
The most important performance mea- date an ever-growing load. Adding new re-
surement of a DFS is the amount of time sources might solve the problem, but it
needed to satisfy service requests. In con- might generate additional indirect load on
ventional systems, this time consists of disk other resources (e.g., adding machines to a
access time and a small amount of CPU distributed system can clog the network
processing time. In a DFS, a remote access and increase service loads). Even worse,
has the additional overhead attributed to expanding the system can incur expensive
the distributed structure. This overhead design modifications. A scalable system
includes the time needed to deliver the re- should have the potential to grow without
quest to a server, as well as the time needed the above problems. In a distributed sys-
to get the response across the network tem, the ability to scale up gracefully is of
back to the client. For each direction, in special importance, since expanding the
addition to the actual transfer of the infor- network by adding new machines or inter-
mation, there is the CPU overhead of run- connecting two networks together is com-
ning the communication protocol software. monplace. In short, a scalable design should
The performance of a DFS can be viewed withstand high-service load, accommodate
as another dimension of its transparency; growth of the user community, and enable
that is, the performance of a DFS should simple integration of added resources.
be comparable to that of a conventional file Fault tolerance and scalability are mu-
system. tually related to each other. A heavily
We use the term fault tolerance in a loaded component can become paralyzed
broad sense. Communication faults, ma- and behave like a faulty component. Also,
chine failures (of type fail stop), storage shifting a load from a faulty component to
device crashes, and decays of storage media its backup can saturate the latter. Gener-
are all considered to be faults that should ally, having spare resources is essential for
be tolerated to some extent. A fault- reliability, as well as for handling peak
tolerant system should continue function- loads gracefully.
ing, perhaps in a degraded form, in the face An advantage of distributed systems over
of these failures. The degradation can be in centralized systems is the potential for fault
performance, functionality, or both but tolerance and scalability because of the
should be proportional, in some sense, to multiplicity of resources. Inappropriate de-
the failures causing it. A system that grinds sign can, however, obscure this potential
to a halt when a small number of its com- and, worse, hinder the system’s scalability
ponents fail is not fault tolerant. and make it failure prone. Fault tolerance
The capability of a system to adapt to and scalability considerations call for a de-
increased service load is called scalability. sign demonstrating distribution of control
ACM Computing Surveys, Vol. 22, No. 4, December 1990
Distributed File Systems l 325
Location Table
ent.ire parent directory of b. In the former ond identifies the particular file within the
it is the server (machine2 in the example) unit. Variants with more parts are possible.
that performs the lookup, whereas in the The invariant of structured names is, how-
latter it is the client that initiates the ever, that individual parts of the name are
lookup that actually searches the directory. unique for all times only within the context
In case the server’s CPU is loaded, this of the rest of the parts. Uniqueness at all
choice is of consequence. In Andrew and times can be obtained by not reusing a
Locus, clients perform the lookups; in NFS name that is still used, or by allocating a
and Sprite the servers perform it. sufficient number of bits for the names
(this method is used in Andrew), or by using
2.3.2 Structured Identifiers a time stamp as one of the parts of the
name (as done in Apollo Domain [Leach et
Implementing transparent naming requires al. 19821).
the provision of the mapping of a file name To enhance the availability of the crucial
to its location. Keeping this mapping man- name to location mapping information,
ageable calls for aggregating sets of files methods such as replicating it or caching
into component units and providing the parts of it locally by clients are used. As
mapping on a component unit basis rather was noted, location independence means
than on a single file basis. Typically, struc- that the mapping changes in time and,
tured identifiers are used for this aggrega- hence, replicating the mapping makes up-
tion. These are bit strings that usually have dating the information consistently a com-
two parts. The first part identifies the com- plicated matter. Structured identifiers are
ponent unit to which file belongs; the sec- location independent; they do not mention
their output are equivalent to the effect and the clients. Every access is handled by
output of executing the same sessions in the server and results in network traffic.
some serial order. Locking a file for the For example, a Read corresponds to a
duration of a session implements these request message sent to the server and a
semantics. Refer to the rich literature on reply to the client with the requested
database management systems to under- data. A similar notion called Remote
stand the concepts of transactions and Open is defined in Howard et al. [1988].
locking [Bernstein et al. 19871. In the Cam- l Caching. If the data needed to satisfy the
bridge File Server, the beginning and end access request are not present locally, a
of a transaction are implicit in the Open copy of those data is brought from the
file, Close file operations, and transactions server to the client. Usually the amount
can involve only one file [Needham and of data brought over is much larger than
Herbert 19821. Thus, a file session in that the data actually requested (e.g., whole
system is actually a transaction. files or pages versus a few blocks). Ac-
Variants of UNIX and (to a lesser de- cesses are performed on the cached copy
gree) session semantics are the most in the client side. The idea is to retain
commonly used policies. An important recently accessed disk blocks in cache
trade-off emerges when evaluating these so repeated accesses to the same infor-
two extremes of sharing semantics. Sim- mation can be handled locally, without
plicity of a distributed implementation is additional network traffic. Caching
traded for the strength of the semantics’ performs best when the stream of file
guarantee. UNIX semantics guarantee the accesses exhibits locality of reference. A
strong effect of making all accesses see the replacement policy (e.g., Least Recently
same version of the file, thereby ensuring Used) is used to keep the cache size
that every access is affected by all previous bounded. There is no direct correspon-
ones. On the other hand, session semantics dence between accesses and traffic to
do not guarantee much when a file is ac- the server. Files are still identified, with
cessed concurrently, since accesses at dif- one master copy residing at the server
ferent machines may observe different machine, but copies of (parts of) the file
versions of the accessed file. The ramifica- are scattered in different caches. When a
tions on the ease of implementation are cached copy is modified, the changes need
discussed in the next section. to be reflected on the master copy and,
depending on the relevant sharing se-
mantics, on any other cached copies.
4. REMOTE-ACCESS METHODS
Therefore, Write accesses may incur sub-
Consider a client process that requests to stantial overhead. The problem of keep-
access (i.e., Read or Write) a remote file. ing the cached copies consistent with the
Assuming the server storing the file was master file is referred to as the cache
located by the naming scheme, the actual consistency problem [Smith 19821.
data transfer to satisfy the client’s request
It should be realized that there is a direct
for the remote access should take place.
analogy between disk access methods in
There are two complementary methods for
conventional file systems and remote ac-
handling this type of data transfer.
cess methods in DFSs. A pure remote serv-
l Remote Service. Requests for accesses ice method is analogous to performing a
are delivered to the server. The server disk access for each and every access re-
machine performs the accesses, and their quest. Similarly, a caching scheme in a DFS
results are forwarded back to the client. is an extension of caching or buffering tech-
There is a direct correspondence between niques in conventional file systems (e.g.,
accesses and traffic to and from the buffering block I/O in UNIX [McKusick et
server. Access requests are translated to al. 19841). In conventional file systems, the
messages for the servers, and server re- rationale behind caching is to reduce disk
plies are packed as messages sent back to I/O, whereas in DFSs the goal is to reduce
icantly reduce network traffic. In addition, tacts the server and checks whether the
the writ.e-on-close policy requires the clos- local data are consistent with the master
ing process to delay while the file is written copy. The frequency of the validity check
through, which reduces the performance is the crux of this approach and deter-
advantages of delayed-writes. The per- mines the resulting sharing semantics. It
formance advantages of this policy over can range from a check before every sin-
delayed-write with more frequent flushing gle access to a check only on first access
are apparent for files that are both open for to a file (on file Open). Every access that
long periods and modified frequently. is coupled with a validity check is de-
As a reference, we present data regarding layed, compared with an access served
the utility of caching in UNIX 4.2 BSD. immediately by the cache. Alternatively,
UNIX 4.2 BSD uses a cache of about 400Kb a check can be initiated every fixed inter-
holding different size blocks (the most com- val of time. Usually the validity check
mon size is 4Kb). A delayed-write policy involves comparing file header informa-
with 30-second intervals is used. A miss tion (e.g., time stamp of the last update
ratio (ratio of the number of real disk I/O maintained as i-node information in
to logical disk accesses) of 15 percent is UNIX). Depending on its frequency, this
reported in McKusick et al. [1984], and of kind of validity check can cause severe
50 percent in Ousterhout et al. [1985]. The network traffic, as well as consume pre-
latter paper also provides the following sta- cious server CPU time. This phenome-
tistics, which were obtained by simulations non was the cause for Andrew designers
on UNIX: A 4Mb cache of 4Kb blocks to withdraw from this approach (Howard
eliminates between 65 and 90 percent of all et al. [ 19881 provide detailed performance
disk accesses for file data. A write-through data on this issue).
policy resulted in the highest miss ratio. l Server-initiated approach. The server
Delayed-write policy with flushing when records for each client the (parts of) files
the block is ejected from cache had the the client caches. Maintaining informa-
lowest miss ratio. tion on clients has significant fault tol-
There is a tight relation between the erance implications (see Section 5.1).
modification policy and semantics sharing. When the server detects a potential for
Write-on-close is suitable for session se- inconsistency, it must now react. A po-
mantics. By contrast, using any delayed- tential for inconsistency occurs when a
write policy, when situations of files that file is cached in conflicting modes by two
are updated concurrently occur frequently different clients (i.e., at least one of the
in conjunction with UNIX semantics, is not clients specified a Write mode). If session
reasonable and will result in long delays semantics are implemented, whenever a
and complex mechanisms. A write-through server receives a request to close a file
policy is more suitable for UNIX semantics that has been modified, it should react by
under such circumstances. notifying the clients to discard their
cached data and consider it invalid.
4.1.4 Cache Validation Clients having this file open at that time,
discard their copy when the current ses-
A client is faced with the problem of decid-
sion is over. Other clients discard their
ing whether or not its locally cached copy
copy at once. Under session semantics,
of the data is consistent with the master
the server need not be informed about
copy. If the client determines that its
Opens of already cached files. The server
cached data is out of date, accesses can no
is informed about the Close of a writing
longer be served by that cached data. An
session, however. On the other hand, if a
up-to-date copy of the data must be brought
more restrictive sharing semantics is im-
over. There are basically two approaches to
plemented, like UNIX semantics, the
verifying the validity of cached data:
server must be more involved. The server
l Client-initiated approach. The client in- must be notified whenever a file is
itiates a validity check in which it con- opened, and the intended mode (Read or
Write) must be indicated. Assuming such modes. In addition, once a cached copy is
notification, the server can act when it modified, the changes need to be propa-
detects a file that is opened simultane- gated immediately to the rest of the
ously in conflicting modes by disabling cached copies. Frequent Writes can gen-
caching for that particular file (as done erate tremendous network traffic and
in Sprite). Disabling caching results in cause long delays before requests are sat-
switching to a remote service mode of isfied. This is why implementations (e.g.,
operation. Sprite) disable caching altogether and re-
A problem with the server-initiated ap- sort to remote service once a file is con-
proach is that it violates the traditional currently open in conflicting modes.
client-server model, where clients initiate Observe that such an approach implies
activities by requesting service. Such vi- some form of a server-initiated validation
olation can result in irregular and com- scheme, where the server makes a note of
plex code for both clients and servers. all Open calls. As was stated, UNIX se-
mantics lend themselves to an implemen-
In summary, the choice is longer accesses tation where a file is associated with a
and greater server load using the former single physical image. A remote service
method versus the fact that the server approach, where all requests are directed
maintains information on its clients using and served by a single server, fits nicely
the latter. with these semantics.
l The immutable shared files semantics
4.2 Cache Consistency were invented for a whole file caching
Before delving into the evaluation and com- scheme [Schroeder et al. 19851. With
parison of remote service and caching, we these semantics, the cache consistency
relate these remote access methods to the problem vanishes totally.
examples of sharing semantics introduced l Transactions-like semantics can be im-
in Section 3. plemented in a straightforward manner
using locking, when all the requests for
l Session semantics are a perfect match for the same file are served by the same
caching entire files. Read and Write server on the same machine as done in
accesses within a session can be handled remote service.
by the cached copy, since the file can be
associated with different images accord-
ing to the semantics. The cache consis- 4.3 Comparison of Caching
tency problem diminishes to propagating and Remote Service
the modifications performed in a session
to the master copy at the end of a session. Essentially, the choice between caching and
This model is quite attractive since it has remote service is a choice between potential
simple implementation. Observe that for improved performance and simplicity.
coupling these semantics with caching We evaluate the trade-off by listing the
parts of files may complicate matters, merits and demerits of the two methods.
since a session is supposed to read the l When caching is used, a substantial
image of the entire file that corresponds amount of the remote accesses can be
to the time it was opened. handled efficiently by the local cache.
l A distributed implementation of UNIX Capitalizing on locality in file access pat-
semantics using caching has serious con- terns makes caching even more attrac-
sequences. The implementation must tive. Ramifications can be performance
guarantee that at all times only one client transparency: Most of the remote ac-
is allowed to write to any of the cached cesses will be served as fast as local ones.
copies of the same file. A distributed con- Consequently, server load and network
flict resolution scheme must be used in traffic are reduced, and the potential for
order to arbitrate among clients wishing scalability is enhanced. By contrast,
to access the same file in conflicting when using the remote service method,
guarantees better availability. (For com- must be preserved when accesses to replicas
plete description of the prefix table mech- are viewed as virtual accesses to their logi-
anism refer to Section 10.2.) cal files. The analogous database term is
One-Copy Serializability [Bernstein et al.
19871. Davidson et al. [1985] survey ap-
5.3 File Replication
proaches to replication for database sys-
Replication of files is a useful redundancy tems, where consistency considerations are
for improving availability. We focus on rep- of major importance. If consistency is not
lication of files on different machines of primary importance, it can be sacrificed
rather than replication on different media for availability and performance. This is an
on the same machine (such as mirrored incarnation of a fundamental trade-off in
disks [Lampson 19811). Multimachine the area of fault tolerance. The choice is
replication can benefit performance too, between preserving consistency at all costs,
since selecting a nearby replica to serve thereby creating a potential for indefinite
an access request. results in shorter service blocking, or sacrificing consistency under
time. some (we hope rare) circumstance of cat-
The basic requirement from a replication astrophic failures for the sake of guaran-
scheme is that different replicas of the same teed progress. We illustrate this trade-off
file reside on failure-independent ma- by considering (in a conceptual manner)
chines. That is, the availability of one rep- the problem of updating a set of replicas of
lica is not affected by the availability of the the same file. The atomicity of such an
rest of the replicas. This obvious require- update is a desirable property; that is, a
ment implies that replicat,ion management situation in which both updated and not
is inherently a location-dependent activity. updated replicas serve accesses should be
Provisions for placing a replica on a partic- prevented. The only way to guarantee the
ular machine must be available. atomicity of such an update is by using a
It, is desirable to hide the details of rep- commit protocol (e.g., Two-phase commit),
lication from users. It is the task of the which can lead to indefinite blocking in the
naming scheme to map a replicated file face of machine and network failures
name to a particular replica. The existence [Bernstein et al. 19871. On the other hand,
of replicas should be invisible to higher if only the available replicas are updated,
levels. At some level, however, the replicas progress is guaranteed; stale replicas,
must be distinguished from one another by however, are present.
having different lower level names. This In most cases, the consistency of file data
can be accomplished by first mapping a file cannot be compromised, and hence the
name to an entity that is abie to differen- price paid for increased availability by
tiate the replicas (as done in Locus). An- replication is a complicated update prop-
other t,ransparency issue is providing agation protocol. One case in which consis-
replication control at higher levels. Repli- tency can be traded for performance, as
cation control includes determining the de- well as availability, is replication of the
gree of replication and placement of location hints discussed in Section 2.3.2.
replicas. Under certain circumstances, it is Since hints are validated upon use, their
desirable to expose these details to users. replication does not require maintaining
Locus, for instance, provides users and sys- their consistency. When a location hint is
tem administrators with mechanism to correct, it results in quick location of the
control the rephcation scheme. corresponding file without relying on a lo-
The main problem associated with rep- cation server. Among the surveyed systems,
licas is their update. From a user’s point of Locus uses replication extensively and sac-
view, replicas of a file denote the same rifices consistency in a partitioned environ-
logical entity; thus, an update to any replica ment for the sake of availability of files for
must be reflect,ed on all other replicas. More both Read and Write accesses (see Section
precisely, the relevant sharing semantics 8.5 for details).
[Barak and Litman 1985; Barak and machines violates functional symmetry.
Paradise 19861. Autonomy and symmetry are, however, im-
The approach taken in MOS is garbage portant goals to which to aspire.
collection. It is the client’s responsibility to An important aspect of decentralization
set, and later reset, an expiration date on is system administration. Administrative
state information the servers maintain for responsibilities should be delegated to en-
it. Clients reset this date whenever they courage autonomy and symmetry, without
access the server or by special, infrequent disturbing the coherence and uniformity of
messages. If this date has expired, a the distributed system. Andrew and Apollo
periodic garbage collector reclaims that Domain support decentralized system man-
storage. This way, the server need not de- agement [Leach et al. 19851.
tect clients’ crashes. By contrast, Locus The practical approximation to symmet-
invokes a clean-up procedure whenever a ric and autonomous configuration is clus-
server machine determines that a particu- tering, where a system is partitioned into
lar client machine is unavailable. Among a collection of semiautonomous clusters.
other things, this procedure releases space A cluster consists of a set of machines
occupied by the state of clients from the and a dedicated cluster server. To make
crashed machine. Detecting crashes can be cross-cluster file references relatively infre-
very expensive, since it is based on polling quent, most of the time, each machine’s
and time-out mechanisms that incur sub- requests should be satisfied by its own clus-
stantial network overhead. The scheme ter server. Such a requirement depends on
MOS uses requires tolerable and scalable the ability to localize file references and the
overhead, where every client signals a appropriate placement of component units.
bounded number of objects (the object it If the cluster is well balanced, that is, the
owns), whereas a failure detection mecha- server in charge suffices to satisfy a major-
nism is not scalable since it depends on the ity of the cluster demands, it can be used
size of the system. as a modular building block to scale up the
Network congestion and latency are system. Observe that clustering complies
major obstacles to large-scale systems. A with the Bounded Resources Principle. In
guideline worth pursuing is to minimize essence, clustering attempts to associate a
cross-machine interactions by means of server with a fixed set of clients and a set
caching, hints, and enforcement of relaxed of files they access frequently, not just with
sharing semantics. There is, however, a an arbitrary set of files. Andrew’s use of
trade-off between the strictness of the shar- clusters, coupled with read-only replication
ing semantics in a DFS and the network of key files, is a good example for a scalable
and server loads (and hence necessarily the clustering scheme.
scalability potential). The more stringent UNIX United emphasizes the concept of
the semantics, the harder it is to scale the autonomy. There, UNIX systems are joined
system up. together in a recursive manner to create a
Central control schemes and central re- larger global system [Randell 19831. Each
sources should not be used to build scalable component system is a complex UNIX sys-
(and fault-tolerant) systems. Examples of tem that can operate and be administered
centralized entities are central authentica- independently. Again, modular and auton-
tion server, central naming server, and cen- omous components are combined to create
tral file server. Centralization is a form of a large-scale system. The emphasis on
functional asymmetry among the machines autonomy results in some negative effects,
comprising the system. The ideal alterna- however, since component boundaries are
tive is a configuration that is functionally visible to users.
symmetric; that is, all the component
machines have an equal role in the opera-
6.2 Lightweight Processes
tion of the system, and hence each machine
has some degree of autonomy. Practically, A major problem in the design of any serv-
it is impossible to comply with such a prin- ice is the process structure of the server.
ciple. For instance, incorporating diskless Servers are supposed to operate efficiently
ACM Computing Surveys, Vol. 22, No. 4, December 1990
342 l E. Levy and A. Silberschatz
in peak periods when hundreds of active late in a common queue and threads are
clients need to be served simultaneously. A assigned to requests from the queue. The
single server process is certainly not a good advantages of using an LWPs scheme to
choice, since whenever a request necessi- implement the service are twofold. First,
tates disk I/O the whole service is delayed an I/O request delays a single thread, not
until the I/O is completed. Assigning a the entire service. Second, sharing common
process for each client is a better choice; data structures (e.g., the requests queue)
however, the overhead of multiplexing the among the threads is easily facilitated.
CPU among the processes (i.e., the context It is clear that some form of LWPs
switches) is an expensive price that must scheme is essential for servers to be scal-
be paid. able. Locus, Sprite, Andrew, use such
A related problem has to do with the fact schemes; in the future NFS will too. De-
that all the server processes need to share tailed studies of threads implementations
information, such as file headers and serv- can be found in Kepecs 1985 and Tevanian
ice tables. In UNIX 4.2 BSD processes et al. 1987.
are not permitted to share address
spaces, hence sharing must be done exter- 7. UNIX UNITED
naliy by using files and other unnatural
mechanisms. The UNIX United project from the lJni-
It appears that one of the best solutions versity of Newcastle upon Tyne, England,
for the server architecture is the use of is one of the earliest attempts to extend the
Lightweight Processes (LWPs) or Threads. UNIX file system to a distributed one with-
A thread is a process that has very little out modifying the UNIX kernel. In UNIX
nonshared state. A group of peer threads United, a software subsystem is added
share code, address space, and operating to each of a set of interconnected UNIX
system resources. An individual thread has systems (referred to as component or con-
at least its own register state. The extensive stituent systems), so as to construct a dis-
sharing makes context switches among peer tributed system that is functionally
threads and threads’ creation inexpensive, indistinguishable from a conventional cen-
compared with context switches among tra- tralized UNIX system.
ditional, heavy-weight processes. Thus, Originally, the component systems were
blocking a thread and switching to another perceived as mainframes functioning as
thread is a reasonable solution to the prob- time-sharing UNIX systems, and indeed
lem of a server handling many requests. the original implementation was based
The abstraction presented by a group of on a set of PDP-11’s connected by a
LWPs is that of multiple threads of control Cambridge Ring.
associated with some shared resources. The system is presented in two levels of
There are many alternatives regarding detail: First, an overview of UNIX United
threads; we mention a few of them briefly. is given. Then the implementation, the
Threads can be supported above the kernel, Newcastle Connection layer, is described.
at the user level (as done in Andrew) or by
the kernel (as in Mach [Tevanian et al. 7.1 Overview
19871). Usually, a lightweight process is not
bound to a particular client. Instead, it Any number of inter-linked UNIX system
serves single requests of different clients. can be joined to compose a UNIX United
Scheduling threads can be preemptive or system. Their naming structures (for files,
nonpreemptive. If threads are allowed to devices, directories, and commands) are
run to completion, their shared data need joined together into a single naming struc-
not be explicitly protected. Otherwise, some ture, in which each component system is to
explicit locking mechanism must be used all intents and purposes just a directory.
to synchronize the accesses to the shared Ignoring for the moment questions regard-
data. ing accreditation and access control, the
Typically, when LWPs are used to im- resulting system is one in which each user
plement a service, client requests accumu- can read or write any file, use any device,
ACM Computing Surveys, Vol. 22, No. 4, December 1990
Distributed File Systems l 343
execute any command, or inspect any di- /f3, file fl is referred to as /../fl, file f2 is
rectory, regardless of the system to which referred to as /../../unix2/f2, and finally
it belongs. That is, network transparency file f4 is referred to as /../../unix2/dir/
is supported. unix4/f4.
The component unit is a complete UNIX Observe that users are aware of the up-
tree belonging to a certain machine. The ward boundaries of the current component
position of these component units in the unit since they must use the ‘I..’ syntax
naming hierarchy is arbitrary. They can whenever they wish to ascend outside of
appear in the naming structure in positions their current machine. Hence, UNIX
subservient to other component units (di- United fails to provide complete location
rectly or via intermediary directories). It is transparency.
often convenient to set the naming struc- The traditional root directories (e.g.,
ture to reflect organizational hierarchy /dev, /bin) are maintained for each ma-
of the environment in which the system chine separately. Because of the relative
exists. naming scheme, they are named, from
In conventional UNIX the root of a file within a component system, in the exact
hierarchy is its own parent and is the only way as in conventional UNIX (e.g., just
directory not assigned a string name. In /dev). Each component system has its own
UNIX United, each component’s root is set of named users and its own administra-
still referred to as ‘/’ and still serves as the tor (superuser). The latter is responsible
starting point of all pathnames starting for the accreditation for users of his or her
with a ‘1’. Roots of component units, how- own system as well as remote users. For
ever, are assigned names so that they uniqueness, remote users’ identifiers are
become accessible and distinguishable ex- prefixed with the name of their original
ternally. Also, a subservient component can system. Accesses are governed by the stan-
access its superior system by referring to dard UNIX file protection mechanisms,
its own root parent, (i.e., ‘/..‘). Therefore, even if they cross component boundaries.
there is only one root that is its own parent That is, there is no need for users to log in
and that is not assigned a string name; separately or provide passwords when they
namely, the root of the composite name access remote files if they are properly ac-
structure, which is just a virtual node credited. Accreditation for remote users
needed to make the whole structure a single must be arranged with the system admin-
tree. Under this conventions, there is no istrator separately.
notion of absolute pathname. Each path.. UNIX United is well suited for a diverse
name is relative to some context, either the internetwork topology, spanning LANs, as
current working directory or the current well as direct links and even WANS. The
component unit. logical name space needs to be properly
In Figure 2, the directories unixl, mapped onto routing information in such a
unix2, unix3, and unix4 are component complex internetwork. An important de-
units (i.e., complete UNIX hierarchies) be- sign principle is that the naming hierarchy
longing to machines by the same names. needs bear no relationship to the network
For instance, all the files descending from topology.
unix2, except files that descend from
unix4, are stored on the machine unix2. 7.2 Implementation-Newcastle Connection
The tree rooted at Unix4 descends from
the directory dir, which is an ordinary (lo- The Newcastle Connection is a (user-level)
cal) directory of unix2. To illustrate the software layer incorporated in each com-
relative pathnames, note that /../unix2/f2 ponent system. This layer separates be-
is the name of the file f2 on the system tween the UNIX kernel on one hand, and
unix2 from within the unixl system. applications, command programs and the
From the unix3 system, the same file is shell on the other hand. It intercepts all
referred to as /../..unix2jf2. Now, suppose system calls concerning files and filters out
the current root (‘/‘) is as shown by the those that have to be redirected to remote
arrow. Then file f3 can be referenced as systems. Also, the Connection layer accepts
f4
Figure 2. UNIX United hierarchy.
system calls that have been directed to it changing the overall structure a very ex-
from other systems. Remote layers manage pensive (and hence infrequent) event.
communication by the means of a RPC Some leaves of the partial structure stored
protocol. Figure 3 is a schematic view of locally correspond to remote roots of other
the software architecture just described. parts of the global file system. These leaves
Incorporating the Connection layer pre- are specially marked and contain addresses
serves both the same UNIX system call of the appropriate storage sites of the de-
interface and the UNIX kernel, in spite of scending file systems. Pathname traversals
the extensive remote activity carried out by have to be continued remotely when en-
the system. The penalty for preserving the countering such marked leaves and, in fact,
kernel intact is the fact that the service is can span more than two systems until the
implemented as user-level daemon pro- target file is located. Therefore, a strict
cesses (as opposed to a kernel implemen- client-server pair model is not preserved.
tation), which slow down remote operation. Once a name is resolved and the file is
Each Connection layer stores a partial opened, it is accessed using file descriptors.
skeleton of the overall naming structure. The Connection layer marks descriptors
Each system stores its own file system lo- that refer to remote files and keeps network
cally. In addition, each system maintains addresses and routing information for them
information on the fragments of the overall in a per-process table.
name structure that relate it to its neigh- The actual remote file accesses are car-
boring systems in the naming structure ried out by a set of file server processes on
(i.e., systems that can be reached via trav- the target system. Each client has its own
ersal of the naming tree without passing file server process with which it communi-
through another system). For instance, re- cates directly. The initial connection is es-
fer to Figure 2. System Unix2 is aware of tablished with the aid of a spawner process
the position of systems unixl, unix2, and that has a standard fixed name that makes
unix4 in the global tree. Figure 4 shows it callable from any external process. This
the relative positioning of the component spawner process performs the remote ac-
units of the global name space that system cess rights checks according to a machine-
unix2 knows about. user identification pair. It also converts this
The fragments maintained by different identification to a valid local name. For the
systems overlap and hence must remain sake of preserving UNIX semantics, once a
consistent, a requirement that makes user process forks, its file service process
~~~~
mix1
unix4
forks as well. File descriptors (not lower l Connection Layer. Conceptually, the con-
level means such as i-nodes) are used to nection layer implementation is ele-
identify files between a user and its file gant and simple. It is a modular sub-
server. This is a stateful service scheme and system interfacing two existing layers
hence does not excel in terms of robustness. without modifying either of them or their
original semantics and still extending
7.3 Summary their capabilities by large. The imple-
mentation strategy is by relinking appli-
The overall profile of the UNIX United cation programs with the Connection
system can be characterized by the follow- layer library routines. These routines
ing prominent features: intercept file system calls and forward
l Logical Name Structure. The UNIX the remote ones to user-level remote
United name structure is a hierarchy daemons at the remote sites.
composed of component UNIX subtrees. Even though UNIX United is outdated, it
There is an explicitly visible correspon- serves our purposes well in demonstrating
dence between a machine and a subtree network transparency without location
in the structure; hence, machine bound- transparency, a simple implementation
aries are noticeable. Users must use the technique, and the issue of autonomy of
‘/..’ trap to get out of the current com- component systems.
ponent unit. There are no absolute path-
names-all pathnames are relative to
some context. 8. LOCUS
l Recursive Structure. Structuring a UNIX Locus is an ambitious project aimed at
United system out of a set of component building a full-scale distributed operating
systems is a recursive process akin to a system. The system is upward compatible
recursive definition of a tree. In theory, with UNIX, but unlike NFS, UNIX United,
such a system can be indefinitely exten- and other UNIX-based distributed sys-
sible. The building block of this recursive tems, the extensions are major ones and
scheme is an autonomous and complete necessitate a new kernel rather than a mod-
UNIX system. ified one. Locus stands out among other
the new version number (in order to pre- managers operating at the corresponding
vent attempts to read the old version). It is storage sites.
the responsibility of these additional SSs The cached data pages are guaranteed to
to bring their version up to date by propa- contain valid data only when the files’s data
gating the entire file or just the changes. A token is present. When the write data token
queue of propagation requests is kept is taken from that site, the i-node, as well
within the kernel at each site, and a kernel as all modified pages, is copied back to the
process services the queue efficiently by SS. Since arbitrary changes (initiated by
issuing appropriate Read requests. This remote clients) may have occurred when
propagation procedure uses the standard the token was not present, all cached
commit mechanism. Thus, if contact with buffers are invalidated when the token is
the file containing the newer version is lost, released. When a data token is granted to
the local file is left with a coherent copy, a site, both the i-node and data pages need
albeit still out of date. Given this commit to be fetched from the SS. There are some
mechanism, one is always left with either exceptions to enforcing this policy. Some
the original file or a completely changed attribute reading and writing calls (e.g.,
file, but never with a partially made change, stat) as well as directory reading and mod-
even in the face of site failures. ifying (e.g., lookup) calls are not subject to
the synchronization constraints. These
calls are sent directly to the SS, where the
8.4 Synchronizing Accesses to Files
changes are made, committed, and propa-
The default synchronization policy in Lo- gated to all storage and using sites.
cus is to emulate UNIX semantics on file Alternatively to the default UNIX se-
accesses in a distributed environment. mantics, Locus offers facilities for locking
UNIX semantics can be implemented fairly entire files or parts of them. Locking can
easily by having the processes share the be advisory (only checked as a result of a
same operating system data structures and locking attempt) or enforced (checked on
caches and by using locks on data struc- all reads and writes). A process can choose
tures to serialize requests. In Locus the to either fail if it cannot immediately get a
sharing processes may not co-reside on the lock or wait for it to be released.
same machine, and hence the implementa-
tion is more complicated. 8.5 Operation in a Faulty Environment
Recall that UNIX semantics allow sev-
eral processes descending from the same The basic approach in Locus is to maintain,
ancestor process to share the same current within a single partition, consistency
position (offset) in a file. A single token among copies of a file. The policy is to
scheme is devised to preserve this special allow updates only in a partition that has
mode of sharing. Only when the token is the primary copy. It is guaranteed that the
present, can a site proceed with executing most recent version of a file in a partition
system calls needing the offset. is read. The latter guarantee applies to all
In UNIX, the same in-core i-node for a partitions.
file can be shared by several processes. In A central point addressed in this section
Locus, the situation is much more compli- is the reconciliation of replicated filegroups
cated since the i-node of the file, as well as residing at partitioned sites. During normal
the parts of the file itself, can be cached at operation, the commit protocol ascertains
several sites. Token schemes are used to proper propagation of updates as described
synchronize sharing of a file’s i-node and earlier. A more elaborate scheme has to be
data. An exclusive-writer-multiple-readers used by recovering sites wishing to bring
policy is enforced. Only a site with a write their packs up to date. To this end, the
token for a file may modify the file; system maintains a commit count for each
any site with a read token can read it. The filegroup, enumerating each commit of
token schemes are coordinated by token every file in the filegroup. Each pack has a
lower-water-mark (lwm) that is a commit and for all local resources being used by
count value up to which the system guar- processes local to site b. This substantial
antees that all prior commits are reflects in cleaning procedure is the penalty of the
the pack. Also, the primary copy pack (usu- state information kept by a!1 three sites
ally stored at the CSS) keeps a list enu- participating in file access.
merating the files in the filegroup and the Since directory updates are not restricted
corresponding commit counts of all the re- to be applied to the primary copy, conflicts
cent commits in secondary storage. When among updates in different partitions may
a pack joins a partition it attempts to con- arise [Walker et al. 19831. Because of the
tact the CSS and checks whether its lwm simple nature of directory modification,
is within the recent commit list range. If however, an automatic reconciliation pro-
this is the case, the pack site schedules a cedure is devised. This procedure is based
kernel process that brings the pack to a on comparing the i-nodes and string name
consistent state by copying only the files pairs of replicas of the same directory. The
that reflect commits later than that of the most extreme action taken is when the
site’s lwm. If the CSS is not available, same name string corresponds to two dif-
writing is disallowed in this partition, but ferent i-nodes (i.e., the same name is used
reading is possible after a new CSS is cho- for creating two different files) and
sen. The new CSS communicates with the amounts to altering the file names slightly
partition members so it will be informed of and notifying the files owners by electronic
the most recent available (in the partition) mail.
version of each file in the filegroup. Once
the new CSS accomplishes the objective,
8.8 Summary
other pack sites can reconcile themselves
with it. As a result, all communicating sites An overall profile and evaluation of Locus
see the same view of the filegroup, and this is summarized by pointing out the following
view is as complete as possible, given a issues:
particular partition. Note that since up-
dates are allowed within the partition with Distributed operating system. Because of
the primary copy and Reads are allowed in the multiple dimensions of transparency
the rest of the partitions, it is possible to in Locus, it comes close to the definition
Read out-of-date replicas of a file. Thus, of a truly distributed operating system
Locus sacrifices consistency for the ability in contrast to a collection of network
to continue to both update and read files in services [Tanenbaum and Van Renesse
a partitioned environment. 19851.
When a pack is too far out of date (i.e., Implementation strategy. Essent,ially,
its lwm indicates a prior value to the earli- kernel augmentation is the implementa-
est commit count value in the primary tion strategy in Locus. The common
copy commit list), the system invokes an pattern in Locus is kernel-to-kernel
application-level process to bring the file- communication via specialized, high-
group up to date. At this point, the system performance protocols. This strategy is
lacks sufficient knowledge of the most re- needed to support the philosophy of a
cent commits to identify the missing up- distributed operating system.
dates. Instead, the site must inspect the Replication. A primary copy replication
entire i-node space to determine which files scheme is used in Locus. The main merit
in its pack are out of date. of this kind of replication scheme is in-
When a site is lost from an operational creased availability of directories that ex-
Locus network, a clean-up procedure is nec- hibit high read-write ratio. Availability
essary. Essentially, once site a has decided for modifying files is not increased by the
that site b is unavailable, site a must invoke primary copy approach. Handling repli-
failure handling for remote resources that cation transparently is one of the reasons
processes local to a were using at site b, for introducing the CSS entity, which is
a third entity taking part in a remote l A logical mount table replicated at all
access. In this context, the CSS functions sites is clearly not a scalable mecha-
as the mapping from an abstract file to a nism.
physical replica. l Extensive message traffic and server
l Access synchronization. UNIX seman- load caused by the complex synchro-
tics are emulated to the last detail, in nization of accesses needed to provide
spite of caching at multiple USs. Alter- UNIX semantics.
natively, locking facilities are provided. l UNIX compatibility. The way Locus
l Fault tolerance. Substantial effort has handles remote operation is geared to
been devoted to designing mechanisms emulation of standard UNIX. The im-
for fault tolerance. A few are an atomic plementation is merely an extension of
update facility, merging replicated packs UNIX implementation across a net-
after recovery, and a degree of indepen- work. Whenever buffering is used in
dent operation of partitions. The effects UNIX, it is used in Locus as well.
can be characterized as follows: UNIX compatibility is indeed retained;
Within a partition, the most recent, however, this approach has some in-
available version of a file is read. The herent flaws. First, it is not clear
primary copy must be available for whether UNIX semantics are appro-
write operations. priate. For instance, the mechanism for
supporting shared file offset by remote
The primary copy of a file is always up
processes is complex and expensive. It
to date with the most recent committed
is unclear whether this peculiar mode
version. Other copies may have either
of sharing justifies this price. Second,
the same version or an older version,
using caching and buffering as done in
but never a partially modified one.
UNIX in a distributed system has some
A CSS function introduces an addi- ramifications on the robustness and re-
tional point of failure. For a file to be coverability of the system. Compatibil-
available for opening, both the CSS ity with UNIX is indeed an important
for the filegroup and an SS must be design goal, but sometimes it obscures
available. the development of an advanced dis-
Every pathname component must be tributed and robust system.
available for the corresponding file to
be available for opening.
9. SUN NETWORK FILE SYSTEM
A basic questionable decision regarding
fault tolerance is the extensive use of in- The Network File System (NFS) is a name
core information by the CSS and SS func- for both an implementation and a specifi-
tions. Supporting the synchronization pol- cation of a software system for accessing
icy is a partial cause for maintaining this remote files across LANs. The implemen-
information; however, the price paid during tation is part of the SunOS operating sys-
recovery is enormous. Besides, explicit tem, which is a flavor of UNIX running on
deallocation is needed to reclaim this in- Sun workstations using an unreliable da-
core space, resulting in a pure overhead of tagram protocol (UDP/IP protocol [Postel
message traffic. 19801) and Ethernet. The specification and
implementation are intertwined in the fol-
l Scalability. Locus does not lend itself to lowing description; whenever a level of de-
very large distributed system environ- tail is needed we refer to the SunOS
ment, mainly because of the following implementation, and whenever the descrip-
reasons: tion is general enough it also applies to the
l One CSS per file group can easily be- specification.
come a bottleneck for heavily accessed The system is presented in three levels
filegroups. of detail. First (in Section 9.1), an overview
\ “SI‘
\
shared ,.:..
: ‘..,
.:’ ..,
,:’ ..,
,:’ ‘4,
,:’ dir1 ‘...
,:’ ‘.._
: . . . . . . .. . . . . . . .._............ 1..
(a)
Client: Client:
“Sr
,:’ ‘9..
,;’
,:’ dir1 ““..
,:’
:. ::..
(b)
Figure 5. NFS joins independent file systems (a), by mounts (b), and cascading mounts (c).
example. In Figure 5b, the effects of the mount mechanism does not exhibit a
the mounting server l:/usr/shared over transitivity property. In Figure 5c we illus-
client:/usr/local are shown. This figure trate cascading mounts by continuing our
depicts the view users on client have of example. The figure shows the result of
their file system. Observe that any file mounting server2:/dir2/dir over client:/
within the dir1 directory, for instance, can usr/local/dir 1, which is already remotely
be accessed using the prefix /usr/local/ mounted from serverl. Files within dir3
dir1 in client after the mount is complete. can be accessed in client using the prefix
The original directory /usr/local on that /usr/local/dir 1.
machine is not visible any more. The mount protocol is used to establish
Cascading mounts are also permitted. the initial connection between a server and
That is, a file system can be mounted over a client. The server maintains an export
another file system that is not a local one, list (the /etc/exports in UNIX) that spec-
but rather a remotely mounted one. A ma- ifies the local file systems it exports for
chine’s name space, however, is affected mounting, along with names of machines
only by those mounts the machine’s own permitted to mount them. Any directory
superuser has invoked. By mounting a re- within an exported file system can be re-
mote file system, access is not gained for motely mounted by an accredited machine.
other file systems that were, by chance, Hence, a component unit is such a direc-
mounted over the former file system. Thus, tory. When the server receives a mount
request that conforms to its export list, it server crash. Consequently, this list might
returns to the client a file handle that is include inconsistent data and should be
the key for further accesses to files within treated only as a hint.
the mounted file system. The file handle A further implication of the stateless
contains all the information the server server philosophy and a result of the syn-
needs to distinguish individual files it chrony of an RPC is that modified data
stores. In UNIX terms, the file handle con- (including indirection and status blocks)
sists of a file system identifier and an i- must be committed to the server’s disk
node number to identify the exact mounted before the call returns results to the client.
directory within the exported file system. The NFS protocol does not provide concur-
The server also maintains a list of the rency control mechanisms. The claim is
client machines and the corresponding cur- that since locks management is inherently
rently mounted directories. This list is stateful, a service outside the NFS should
mainly for administrative purposes, such as provide locking. It is advised that users
for notifying all clients that the server is would coordinate access to shared files us-
going down. Adding and deleting an entry ing mechanisms outside the scope of NFS
in this list is the only way the server state (e.g., by means provided in a database man-
is affected by the mount protocol. agement system).
Usually a system has some static mount-
ing preconfiguration that is established at 9.3 Implementation
boot time; however, this layout can be mod-
ified (/etc/fstab in UNIX). In general, Sun’s implementation of NFS
is integrated with the SunOS kernel for
reasons of efficiency (although such inte-
9.2.2 NFS Protocol gration is not strictly necessary). In this
section we outline this implementation.
The NFS protocol provides a set of remote
procedure calls for remote file operations.
The procedures support the following 9.3.1 Architecture
operations: The NFS architecture is schematically de-
Searching for a file within a directory picted in Figure 6. The user interface is the
(i.e., lookup). UNIX system calls interface based on the
Open, Read, Write, Close calls, and file
Reading a set of directory entries. descriptors. This interface is on top of a
Manipulating links and directories. middle layer called the Virtual File System
Accessing file attributes. (VFS) layer. The bottom layer is the one
Reading and writing files. that implements the NFS protocol and is
called the NFS layer. These layers comprise
These procedures can be invoked only after the NFS software architecture. The figure
having a file handle for the remotely also shows the RPC/XDR software layer,
mounted directory. Recall that the mount local file systems, and the network and thus
operation supplies this file handle. can serve to illustrate the integration of a
The omission of Open and Close opera- DFS with all these components. The VFS
tions is intentional. A prominent feature of serves two important functions:
NFS servers is that they are stateless.
There are no parallels to UNIX’s open files l It separates file system generic opera-
table or file structures on the server side. tions from their implementation by
Maintaining the clients list mentioned in defining a clean interface. Several imple-
Section 9.2.1 seems to violate the stateless- mentations for the VFS interface may
ness of the server. The client list, however, coexist on the same machine, allowing
is not essential in any manner for the cor- transparent access to a variety of types
rect operation of the client or the server of file systems mounted locally (e.g., 4.2
and hence need not be restored after a BSD or MS-DOS).
RPC/XDR RPUXDR
0 disk
,” Network
7 I
Figure 6. Schematic view of the NFS architecture.
l The VFS is based on a file representation operation by a regular system call. The
structure called a unode, which contains operating system layer maps this call to a
a numerical designator for a file that is VFS operation on the appropriate vnode.
networkwide unique. (Recall that UNIX- The VFS layer identifies the file as a remote
i-nodes are unique only within a single one and invokes the appropriate NFS pro-
file system.) The kernel maintains one cedure. An RPC call is made to the NFS
vnode structure for each active node (file service layer at the remote server. This call
or directory). Essentially, for every file is reinjected into the VFS layer, which finds
the vnode structures complemented by that it is local and invokes the appropriate
the mount table provide a pointer to its file system operation. This path is retraced
parent file system, as well as to the file to return the result. An advantage of this
system over which it is mounted. architecture is that the client and the server
are identical; thus, it is possible for a ma-
Thus, the VFS distinguishes local files chine to be a client, or a server, or both.
from remote ones, and local files are further The actual service on each server is per-
distinguished according to their file system formed by several kernel processes, which
types. The VFS activates file system spe- provide a temporary substitute to a LWP
cific operations to handle local requests facility.
according to their file system types and
calls the NFS protocol procedures for re- 9.3.2 Pathname Translation
mote requests. File handles are constructed
from the relevant vnodes and passed as Pathname translation is done by breaking
arguments to these procedures. the path into component names and doing
As an illustration of the architecture, let a separate NFS lookup call for every pair
us trace how an operation on an already of component name and directory vnode.
open remote file is handled (follow the ex- Thus, lookups are performed remotely by
ample in Figure 6). The client initiates the the server. Once a mount point is crossed,
is used for modification, and client- off-loading work from servers to clients and
initiated approach is used for validation structuring a system as a collection of clus-
of cached data. ters are two sound scalability strategies.
Clusters should be as autonomous as pos-
13. CONCLUSIONS sible and should serve as a modular building
block for an expandable system. A chal-
In this paper we presented the basic con- lenging aspect of scale that might be of
cepts underlying the design of a distributed interest for future designs is the exten-
file system and surveyed five of the most sion of the DFS paradigm over WANs.
prominent systems. A comparison of the Such an extended DFS would be character-
systems is presented in Table 1. A crucial ized by larger latencies and higher failure
observation, based on the assessment of probabilities.
contemporary DFSs, is that the design of a A factor that is certain to be prominent
DFS must depart from approaches devel- in the design of future DFSs is the available
oped for conventional file systems. Basing technology. It is important to follow tech-
a DFS on emulation of a conventional file nological trends and exploit their potential.
system might be a transparency goal, but it Some imminent possibilities are as follows:
certainly should not be an implementation
strategy. Extending mechanisms developed Large main memories. As main memo-
for conventional file systems over a net- ries become larger and less expensive,
work is a strategy that disregards the main-memory caching (as exemplified in
unique characteristics of a DFS. Sprite) becomes more attractive. The re-
Supporting this claim is the observation wards in terms of performance can be
that a loose notion of sharing semantics is exceptional.
more appropriate for a DFS than conven- Optical disks. Optical storage technology
tional UNIX semantics. Restrictive seman- has an impact on file systems in general
tics incur a complex design and intolerable and hence on DFSs in particular, too.
overhead. A provision to facilitate restric- Write-once optical disks are already
tive semantics for database applications available [Fujitani 19841. Their key fea-
may be offered as an option. Consequently, tures are very large density, slow access
UNIX compatibility should be sacrificed time, high reliability, and nonerasable
for the sake of a good DFS design. In this writing. This medium is bound to become
respect, the approach used in Andrew to on-line tertiary storage and replace tape
the semantics of sharing prove superior to devices. Rewritable optical disks are be-
those used in Locus and NFS. coming available and might replace mag-
Another area in which a fresh approach netic disks altogether.
is essential is the server process architec-
ture. There is a wide consensus that some Optical fiber networks. A change in the
form of LWPs is more suitable than tradi- entire approach to the remote access
tional processes for efficiently handling problem can be justified by the existence
high loads of service requests. of these remarkably fast communication
It is difficult to present concrete guide- networks. The concept of local disk is
lines in the context of fault tolerance and faster may be rendered obsolete.
scalability, mainly because there is not Nonvolatile RAMS. Battery-backed
enough experience in these areas. It is clear, memories can survive power outage,
however, that distribution of control and thereby enhancing the reliability of
data as presented in this paper is a key main-memories caches. A large and reli-
concept. User convenience calls for hiding able memory can cause a revolution in
the distributed nature of such a system. As storage techniques. Still, it is questiona-
we pointed out in Section 2, the additional ble whether this technology is sufficient
flexibility gained by mobile files is the next to make main memories as reliable as
step in the spirit of distribution and trans- disks because of the unpredictable con-
parency. Based on the Andrew experience, sequences of an operating system crash
Naming scheme Single pseudo-UNIX tree. No- Single UNIX tree, hiding both
ticeable machine boundaries. All replication and location.
pathnames are relative (by the
‘ .’ syntax). Independence of
component systems. Recursive
structuring.
Client-server Each machine can be both. A triple: US, SS, CSS. Every file
group has a CSS that selects SS
and synchronizes accesses. Once
a file is open, direct US-SS
protocol.
Pathname traversal The pathname translation re- US reads each directory and per-
quest is forwarded from machine forms lookup itself. Given a tile
to machine. group number, the CSS is found
replicated on all machines in the
logical mount table. The CSS
picks SS.
Reconfiguration, file mobility Impossible to move a file without Because of replication, servers
changing its name. No dynamic can be taken off-line or fail with-
reconfiguration. out disturbance. Directory hier-
archy can be changed by
mounting/unmounting.
Table l-Continued
NFS Sprite Andrew
A network service so that inde- Designed for an environment Designed as the sharing mecha-
pendent workstations would be consisting of diskless worksta- nism of a large-scale system for a
able to share remote files trans- tions with huge main memories university campus.
parently. interconnected by a LAN.
Each machine has its own view Single UNIX tree; hiding Private name spaces and one
of the global name space. location. UNIX tree for the shared name
space. The shared tree descends
from each local name space.
A directory within an exported A domain (UNIX file system). A volume (typically, all files of a
file system can be remotely single user).
mounted.
Every machine can be both. Di- Typically, clients are diskless Clustering: Dedicated servers per
rect client-server relationship is and servers are machines with cluster.
enforced. disks.
Remote service mixed with block Block caching in main memory. Whole file caching in local disks.
caching for service. In case of concurrent writes,
switch to Remote Service.
Block caching similar to UNIX Block caching similar to UNIX Read and Write are served
buffering. Client checks validity buffering. Delayed-write policy. directly by the cache without
of cached data on each open. Client checks validity of cached server involvement. Write-on-
Delayed-write policy. data on each open. Server dis- close policy. Server-initiated
ables caching when a file is approach for cache validation
opened in conflicting modes. (callback), hence no need to
check on each open.
Lookups are done remotely for Prefix tables mechanism. Inside Client caches each directory and
each pathname component, but a domain, lookup is done by performs lookup itself. Given a
all are initiated from the client. server. volume number, the server is
A lookup cache for speedup. found in a volume location data-
base replicated on each server.
Parts of this database are cached
on each machine.
In case of cascading mount, each If a server of a file is available, A client has to have a connection
server along the mount chain has the file is available regardless of to a server, and each pathname
to be available for a file to be the state of other servers (along component must be available.
available. the pathname).
Implementation strategy, archi- UNIX kernel kept intact. Con- Extensive UNIX kernel modifi-
tecture nection layer intercepts remote cation. Kernel is pushed into the
calls. User-level daemons for- network. Some kernel LWP for
ward and service remote opera- remote services. Structured, low-
tions. A spawner process creates level, location-independent file
a server process per user that ac- identifiers are used.
cesses files using file descriptors.
Main disadvantage Not fully transparent naming. Complicated design and large
kernel. Unscalable features.
Complex recovery due to main-
tained state.
Not intended for very large-scale Because broadcast is relied upon Reducing server load and cluster-
systems. and of server involvement in ing are the main strategy. Repli-
operations there might be a cated location database might be
problem. a problem.
Three layers: UNIX system call New kernel based on multi- Augmenting UNIX kernel with
interface, VFS interface to sepa- threading, intended for multi- user-level processes: Venus at
rate file system implementation processor workstation. each client, and a single server
from operations, and NFS layer. process on each server using
Independent specifications for nonpreemptable LWPs. Struc-
mount and NFS protocols. The tured, low-level, location-
Current implementation is ker- independent file identifiers are
nel based. used.
RPC and XDR on top of RPC on top of special-purpose RPC on top of datagram proto-
UDP/IP (unreliable datagram). network protocol. col. Whole file transfer as a side
effect.
Fault tolerance, because of state- Performance due to main mem- Ability to scale up gracefully.
less protocol. Implementation- ory caching. Clear and simple consistency
independent protocols, ideal for semantics.
heterogeneous environment.
Unclear semantics. Performance Questionable scalability. Not Fault tolerance issues due to
improvements obscure clean much in terms of fault tolerance. maintained state.
design.
BARAK, A., MALKI, D., AND WHEELER, R. 1986. cise in distributed computing. Commun. ACM 25,
AFS, BFS, CFS . or Distributed File Systems 4 (Apr.), 260-274.
for UNIX. In European UNIX Users Group Con-
BIRREL, A. D., AND NELSON, B. J. 1984.
ference Proceedings (Sept. 22-24, Manchester,
Implementing remote procedure calls. ACM
U.K.). EUUG, pp. 461-472.
Trans. Comput Syst. 2, 1 (Feb.), 39-59.
BARAK, A., AND PARADISE, 0. G. 1986. MOS: Scal-
ing up UNIX. In Proceedings of USENIX 1986 BLACK, A. P. 1985. Supporting distributed applica-
Summer Conference. USENIX Association, tions: Experience with Eden. In Proceedings of
Berkeley, California, pp. 414-418. the 10th Symposium on Operating Systems Prin-
BERNSTEIN, P. A., HADZILACOS, V., AND GOODMAN, ciples (Orcas, Island, Wash., Dec. l-4). ACM,
N. 1987. Concurrency Control and Recouery an New York, pp. 181-193.
Database Systems. Addison-Wesley, Reading, BROWNBRIDGE, D. R., MARSHALL, L. F., AND RAN-
Mass. DELL, B. 1982. The Newcastle connection or
BIRREL, A. D., LEVIN, R., NEEDHAM, R. M., AND UNIXes of the world unite! Softw. Prac. Exper.
SCHROEDER, M. D. 1982. Grapevine: An exer- 12, 12 (Dec.), 1147-1162.