Вы находитесь на странице: 1из 23

ReplicationAndConsistency September 2002

Distributed File
Systems
Distributed Systems
Introduction
File service architecture
Sun Network File System (NFS)
Andrew File System (AFS)
Recent advances
Summary

Learning objectives
9 Understand the requirements that affect
the design of distributed services
9 NFS: understand how a relatively simple,
widely-used service is designed
o Obtain a knowledge of file systems, both local
and networked
o Caching as an essential design technique
o Remote interfaces are not the same as APIs
o Security requires special consideration
9 Recent advances: appreciate the ongoing
research that often leads to major
advances
Distributed Systems DoCS 2002 2

DoCS 2002 1
ReplicationAndConsistency September 2002

Revision: Failure model


Class of failure Affects Description
Fail-stop Process Process halts and remains halted. Other processes
may detect this state.
Crash Process Process halts and remains halted. Other processes
may not be able to detect this state.
Omission Channel A message inserted in an outgoing message buffer ne-
ver arrives at the other ends incoming message buffer.
Send-omission Process A process completes a send but the message is not put in
its outgoing message buffer.
Receive-omission Process A message is put in a processs incoming message buffer,
but that process does not receive it.
Arbitrary Process Process/channel exhibits arbitrary behaviour: it may
(Byzantine) or channel send/transmit arbitrary messages at arbitrary times,
commit omissions; a process may stop or take an incorrect
step.

Distributed Systems DoCS 2002 3

Storage systems and their properties

9 In first generation of distributed systems


(1974-95), file systems (e.g. NFS) were
the only networked storage systems.
9 With the advent of distributed object
systems (CORBA, Java) and the web, the
picture has become more complex.

Distributed Systems DoCS 2002 4

DoCS 2002 2
ReplicationAndConsistency September 2002

Storage systems and their properties


Types of consistency between copies:
1 - strict one-copy consistency
- approximate consistency
X - no automatic consistency

Sharing Persis- Distributed Consistency Example


tence cache/replicas maintenance

Main memory 1 RAM


File system 1 UNIX file system
Distributed file system Sun NFS
Web Web server

Distributed shared memory Ivy


Remote objects (RMI/ORB) 1 CORBA
Persistent object store 1 CORBA Persistent
Object Service
Persistent distributed object store PerDiS, Khazana

Distributed Systems DoCS 2002 5

What is a file system?


9 Persistent stored data sets
9 Hierarchic name space visible to all processes
9 API with the following characteristics:
o access and update operations on persistently stored data sets
o Sequential access model (with additional random facilities)
9 Sharing of data between users, with access control
9 Concurrent access:
o certainly for read-only access
o what about updates?
9 Other features:
o mountable file stores
o more? ...

Distributed Systems DoCS 2002 6

DoCS 2002 3
ReplicationAndConsistency September 2002

What is a file system?


UNIX file system operations

filedes = open(name, mode) Opens an existing file with the given name.
filedes = creat(name, mode) Creates a new file with the given name.
Both operations deliver a file descriptor referencing the open
file. The mode is read, write or both.
status = close(filedes) Closes the open file filedes.
count = read(filedes, buffer, n) Transfers n bytes from the file referenced by filedes to buffer.
count = write(filedes, buffer, n) Transfers n bytes to the file referenced by filedes from buffer.
Both operations deliver the number of bytes actually transferred
and advance the read-write pointer.
pos = lseek(filedes, offset, Moves the read-write pointer to offset (relative or absolute,
whence) depending on whence).
status = unlink(name) Removes the file name from the directory structure. If the file
has no other names, it is deleted.
status = link(name1, name2) Adds a new name (name2) for a file (name1).
status = stat(name, buffer) Gets the file attributes for file name into buffer.

Distributed Systems DoCS 2002 7

What is a file system?


File system modules

Directory module: relates file names to file IDs

File module: relates file IDs to particular files

Access control module: checks permission for operation requested

File access module: reads or writes file data or attributes

Block module: accesses and allocates disk blocks

Device module: disk I/O and buffering

Distributed Systems DoCS 2002 8

DoCS 2002 4
ReplicationAndConsistency September 2002

What is a file system?


File attribute record structure
File length
Creation timestamp
Read timestamp
updated
Write timestamp
by system:
Attribute timestamp
Reference count
Owner
File type
updated Access control list
by owner:
E.g. for UNIX: rw-rw-r--

Distributed Systems DoCS 2002 9

File service requirements


9 Transparency
9 Concurrency
9 Replication
9 Heterogeneity
9 Fault
tolerance
9 Consistency
9 Security
9 Efficiency..

Distributed Systems DoCS 2002 10

DoCS 2002 5
ReplicationAndConsistency September 2002

File service requirements


9 Transparency
9 Concurrency Transparencies
Access: Same operations
9 Replication
Location: Same name space after
9 Heterogeneity relocation of files or processes
9 Fault Mobility: Automatic relocation of files is
tolerance possible
9 Consistency Performance: Satisfactory performance
9 Security across a specified range of
system loads
9 Efficiency..
Scaling: Service can be expanded to meet
additional loads

Distributed Systems DoCS 2002 11

File service requirements


9 Transparency Concurrency properties
9 Concurrency Isolation
9 Replication File-level or record-level locking
9 Heterogeneity Other forms of concurrency control to
minimise contention
9 Fault tolerance
9 Consistency
9 Security
9 Efficiency..

Distributed Systems DoCS 2002 12

DoCS 2002 6
ReplicationAndConsistency September 2002

File service requirements


9 Transparency Replication properties
9 Concurrency File service maintains multiple identical
9 Replication copies of files

9 Heterogeneity Load-sharing between servers makes


service more scalable
9 Fault tolerance
Local access has better response (lower
9 Consistency latency)
9 Security Fault tolerance
9 Efficiency.. Full replication is difficult to implement.
Caching (of all or part of a file) gives most
of the benefits (except fault tolerance)

Distributed Systems DoCS 2002 13

File service requirements


9 Transparency Heterogeneity properties
9 Concurrency Service can be accessed by clients running
on (almost) any OS or hardware platform.
9 Replication
Design must be compatible with the file
9 Heterogeneity systems of different OSes
9 Fault tolerance Service interfaces must be open - precise
9 Consistency specifications of APIs are published.
9 Security
9 Efficiency..

Distributed Systems DoCS 2002 14

DoCS 2002 7
ReplicationAndConsistency September 2002

File service requirements


Fault tolerance
9 Transparency
Service must continue to operate even when
9 Concurrency clients make errors or crash.
9 Replication at-most-once semantics
9 Heterogeneity at-least-once semantics
9 Fault requires idempotent operations
tolerance Service must resume after a server machine
9 Consistency crashes.
9 Security If the service is replicated, it can continue to
9 Efficiency.. operate even during a server crash.

Distributed Systems DoCS 2002 15

File service requirements

9 Transparency Consistency
9 Concurrency Unix offers one-copy update semantics for
9 Replication operations on local files - caching is
9 Heterogeneity completely transparent.
9 Fault tolerance Difficult to achieve the same for
distributed file systems while maintaining
9 Consistency good performance and scalability.
9 Security
9 Efficiency..

Distributed Systems DoCS 2002 16

DoCS 2002 8
ReplicationAndConsistency September 2002

File service requirements

9 Transparency Security
9 Concurrency Must maintain access control and privacy as
9 Replication for local files.
9 Heterogeneity based on identity of user making request
9 Fault tolerance identities of remote users must be
authenticated
9 Consistency
privacy requires secure communication
9 Security
Service interfaces are open to all processes
9 Efficiency.. not excluded by a firewall.
vulnerable to impersonation and
other attacks

Distributed Systems DoCS 2002 17

File service requirements

9 Transparency
9 Concurrency
Efficiency
9 Replication
Goal for distributed file systems is usually
9 Heterogeneity performance comparable to local file
9 Fault tolerance system.
9 Consistency
9 Security
9 Efficiency..

Distributed Systems DoCS 2002 18


*

DoCS 2002 9
ReplicationAndConsistency September 2002

Model file service architecture


Lookup
AddName
Client computer UnName Server computer
GetNames

Application Application Directory service


program program

Flat file service

Client module

Read
Write
Create
Delete
GetAttributes
SetAttributes
Distributed Systems DoCS 2002 19

Server operations for the


model file service
Flat file service Directory service

Read(FileId,position
i, n) ->of first
Databyte Lookup(Dir, Name) -> FileId
AddName(Dir, Name, FileId)
Write(FileId,position
i, Data)of first byte
UnName(Dir, Name)
Create() -> FileId
GetNames(Dir, Pattern) -> NameSeq
Delete(FileId) FileId
GetAttributes(FileId) -> A unique identifier for files anywhere in
the network.
Attr Pathname lookup
Pathnames such as '/usr/bin/tar' are resolved by
SetAttributes(FileId, Attr) iterative calls to lookup(), one call for each
component of the path, starting with the ID of
the root directory '/' which is known in every
client.

Distributed Systems DoCS 2002 20

DoCS 2002 10
ReplicationAndConsistency September 2002

File Group
A collection of files that can be
located on any server or moved
between servers while To construct a globally
maintaining the same names. unique ID we use some
unique attribute of the
o Similar to a UNIX filesystem
machine on which it is
o Helps with distributing the created, e.g. IP number,
load of file serving between even though the file group
several servers. may move subsequently.
o File groups have identifiers
which are unique throughout
the system (and hence for an File Group ID:
open system, they must be 32 bits 16 bits
globally unique). IP address date
Used to refer to file groups
and files

Distributed Systems DoCS 2002 21

Case Study: Sun NFS


9 An industry standard for file sharing on local networks since
the 1980s
9 An open standard with clear and simple interfaces
9 Closely follows the abstract file service model defined above
9 Supports many of the design requirements already
mentioned:
o transparency
o heterogeneity
o efficiency
o fault tolerance
9 Limited achievement of:
o concurrency
o replication
o consistency
o security

Distributed Systems DoCS 2002 22

DoCS 2002 11
ReplicationAndConsistency September 2002

NFS architecture
Client computer

Client computer Application


NFS Server computer
program Client

Application Application Application


program program Kernel program
UNIX
system calls

UNIX kernel Virtual file system Virtual file system


Operations Operations
on local files on
file system

remote files
UNIX NFS UNIX
NFS NFS
file Client file
Other

client server
system system

NFS
protocol
(remote operations)
Distributed Systems DoCS 2002 23

NFS architecture: does the implemen-


tation have to be in the system kernel?
No:
o there are examples of NFS clients and servers that run at
application-level as libraries or processes (e.g. early Windows
and MacOS implementations, current PocketPC, etc.)
But, for a Unix implementation there are advantages:
o Binary code compatible - no need to recompile applications
Standard system calls that access remote files can be routed
through the NFS client module by the kernel
o Shared cache of recently-used blocks at client
o Kernel-level server can access i-nodes and file blocks directly
but a privileged (root) application program could do almost the
same.
o Security of the encryption key used for authentication.

Distributed Systems DoCS 2002 24

DoCS 2002 12
ReplicationAndConsistency September 2002

NFS server operations (simplified)


read(fh, offset, count) -> attr, data Model flat file service
write(fh, offset, count, data) -> attr Read(FileId, i, n) -> Data
create(dirfh, name, attr) -> newfh, attr Write(FileId, i, Data)
remove(dirfh, name) status Create() -> FileId
getattr(fh) -> attr Delete(FileId)
setattr(fh, attr) -> attr
GetAttributes(FileId) -> Attr
lookup(dirfh, name) -> fh, attr
SetAttributes(FileId, Attr)
rename(dirfh, name, todirfh, toname)
link(newdirfh, newname, dirfh, name)
Model directory service
readdir(dirfh, cookie, count) -> entries
Lookup(Dir, Name) -> FileId
symlink(newdirfh, newname, string) -> status
readlink(fh) -> string AddName(Dir, Name, File)
mkdir(dirfh, name, attr) -> newfh, attr UnName(Dir, Name)
rmdir(dirfh, name) -> status GetNames(Dir, Pattern)
statfs(fh) -> fsstats ->NameSeq

Distributed Systems DoCS 2002 25

NFS access control and authentication


9 Stateless server, so the user's identity and access
rights must be checked by the server on each request.
o In the local file system they are checked only on open()
9 Every client request is accompanied by the userID and
groupID
o not shown in the because they are inserted by the RPC system
9 Server is exposed to imposter attacks unless the
userID and groupID are protected by encryption
9 Kerberos has been integrated with NFS to provide a
stronger and more comprehensive security solution

Distributed Systems DoCS 2002 26

DoCS 2002 13
ReplicationAndConsistency September 2002

Mount service
9 Mount operation:
mount(remotehost, remotedirectory, localdirectory)

9 Server maintains a table of clients who have


mounted filesystems at that server
9 Each client maintains a table of mounted file
systems holding:
< IP address, port number, file handle>
9 Hard versus soft mounts

Distributed Systems DoCS 2002 27

Local and remote file systems


accessible on an NFS client
Server 1 Client Server 2
(root) (root) (root)

export . . . vmunix usr nfs

Remote Remote
people students x staff users
mount mount

big jon bob . . . jim ann jane joe

Note: The file system mounted at /usr/students in the client is actually the sub-tree located at
/export/people in Server 1; the file system mounted at /usr/staff in the client is actually the
sub-tree located at /nfs/users in Server 2.

Distributed Systems DoCS 2002 28

DoCS 2002 14
ReplicationAndConsistency September 2002

Automounter
NFS client catches attempts to access 'empty'
mount points and routes them to the Automounter
o Automounter has a table of mount points and multiple
candidate serves for each
o it sends a probe message to each candidate server and
then uses the mount service to mount the filesystem at
the first server to respond
9 Keeps the mount table small
9 Provides a simple form of replication for read-only
filesystems
o E.g. if there are several servers with identical copies of
/usr/lib then each server will have a chance of being
mounted at some clients.

Distributed Systems DoCS 2002 29

Kerberized NFS
9 Kerberos protocol is too costly to apply on each file
access request
9 Kerberos is used in the mount service:
o to authenticate the user's identity
o User's UserID and GroupID are stored at the server with the
client's IP address
9 For each file request:
o The UserID and GroupID sent must match those stored at
the server
o IP addresses must also match
9 This approach has some problems
o can't accommodate multiple users sharing the same client
computer
o all remote filestores must be mounted each time a user logs in
Distributed Systems DoCS 2002 30

DoCS 2002 15
ReplicationAndConsistency September 2002

NFS optimization - server caching


9 Similar to UNIX file caching for local files:
o pages (blocks) from disk are held in a main memory buffer cache until
the space is required for newer pages. Read-ahead and delayed-write
optimizations.
o For local files, writes are deferred to next sync event (30 second
intervals)
o Works well in local context, where files are always accessed through
the local cache, but in the remote case it doesn't offer necessary
synchronization guarantees to clients.
9 NFS v3 servers offers two strategies for updating the
disk:
o write-through - altered pages are written to disk as soon as they are
received at the server. When a write() RPC returns, the NFS client
knows that the page is on the disk.
o delayed commit - pages are held only in the cache until a commit() call
is received for the relevant file. This is the default mode used by
NFS v3 clients. A commit() is issued by the client whenever a file is
closed.
Distributed Systems DoCS 2002 31

NFS optimization - client caching


9 Server caching does nothing to reduce RPC traffic
between client and server
o further optimization is essential to reduce server load in large
networks
o NFS client module caches the results of read, write, getattr, lookup and
readdir operations
o synchronization of file contents (one-copy semantics) is not
guaranteed when two or more clients are sharing the same file.
9 Timestamp-based validity check
o reduces inconsistency, but doesn't eliminate it
o validity condition for cache entries at the client:
(T - Tc < t) v (Tmclient = Tmserver) t freshness guarantee
o t is configurable (per file) but is typically Tc time when cache entry was
set to 3 seconds for files and 30 secs. last validated
for directories Tm time when block was last
o it remains difficult to write distributed updated at server
applications that share files with NFS T current time
Distributed Systems DoCS 2002 32

DoCS 2002 16
ReplicationAndConsistency September 2002

Other NFS optimizations


9 Sun RPC runs over UDP by default (can use TCP if
required)
9 Uses UNIX BSD Fast File System with 8-kbyte blocks
9 reads() and writes() can be of any size (negotiated
between client and server)
9 the guaranteed freshness interval t is set adaptively
for individual files to reduce gettattr() calls needed to
update Tm
9 file attribute information (including Tm) is piggybacked
in replies to all file requests

Distributed Systems DoCS 2002 33

NFS performance
9 Early measurements (1987) established that:
o write() operations are responsible for only 5% of server calls in typical UNIX
environments
hence write-through at server is acceptable
o lookup() accounts for 50% of operations -due to step-by-step pathname
resolution necessitated by the naming and mounting semantics.
9 More recent measurements (1993) show high
performance:
1 x 450 MHz Pentium III: > 5000 server ops/sec, < 4 ms average
latency
24 x 450 MHz IBM RS64: > 29,000 server ops/sec, < 4 ms average
latency
see www.spec.org for more recent measurements
9 Provides a good solution for many environments
including:
o large networks of UNIX and PC clients
o multiple web server installations sharing a single file store
Distributed Systems DoCS 2002 34

DoCS 2002 17
ReplicationAndConsistency September 2002

NFS summary
9 An excellent example of a simple, robust, high-performance
distributed service.
9 Achievement of transparencies:
Access: Excellent; the API is the UNIX system call interface for
both local and remote files.
Location: Not guaranteed but normally achieved; naming of
filesystems is controlled by client mount operations, but
transparency can be ensured by an appropriate system
configuration.
Concurrency: Limited but adequate for most purposes; when
read-write files are shared concurrently between clients,
consistency is not perfect.
Replication: Limited to read-only file systems; for writable files,
the SUN Network Information Service (NIS) runs over NFS
and is used to replicate essential system files,
Distributed Systems DoCS 2002 35

NFS summary
Achievement of transparencies (continued):
Failure: Limited but effective; service is suspended if a server
fails. Recovery from failures is aided by the simple stateless
design.
Mobility: Hardly achieved; relocation of files is not possible,
relocation of filesystems is possible, but requires updates to
client configurations.
Performance: Good; multiprocessor servers achieve very high
performance, but for a single filesystem it's not possible to
go beyond the throughput of a multiprocessor server.
Scaling: Good; filesystems (file groups) may be subdivided and
allocated to separate servers. Ultimately, the performance
limit is determined by the load on the server holding the
most heavily-used filesystem (file group).
Distributed Systems DoCS 2002 36

DoCS 2002 18
ReplicationAndConsistency September 2002

Distribution of processes in the


Andrew File System
Workstations Servers

User Venus
program
Vice
UNIX kernel

UNIX kernel

Venus
User Network
program
UNIX kernel

Vice

Venus
User
program UNIX kernel
UNIX kernel

Distributed Systems DoCS 2002 37

File name space seen by clients of AFS


Local Shared
/ (root)

tmp bin . . . vmunix cmu

bin

Symbolic
links

Distributed Systems DoCS 2002 38

DoCS 2002 19
ReplicationAndConsistency September 2002

System call interception in AFS


Workstation

User Venus
program
UNIX file Non-local file
system calls operations

UNIX kernel
UNIX file system

Local
disk

Distributed Systems DoCS 2002 39

Implementation of filesystem calls in AFS


User process UNIX kernel Venus Net Vice
open(FileName, If FileNamerefers to a
mode) file in shared file space,
pass the request to Check list of files in
local cache. If not
Venus. present or there is no
valid callback promise ,
send a request for the
file to the Vice server
that is custodian of the
volume containing the Transfer a copy of the
file. file and acallback
promiseto the
workstation. Log the
Place the copy of the callback promise.
file in the local file
Open the local file and system, enter its local
return the file name in the local cache
descriptor to the list and return the local
application. name to UNIX.
read(FileDescriptor, Perform a normal
Buffer, length) UNIX read operation
on the local copy.
write(FileDescriptor, Perform a normal
Buffer, length) UNIX write operation
on the local copy.
close(FileDescriptor) Close the local copy
and notify Venus that
the file has been closed. If the local copy has
been changed, send a
copy to the Vice server Replace the file
that is the custodian of contents and send a
the file. callbackto all other
clients holdingcallback
promiseson the file.
Distributed Systems DoCS 2002 40

DoCS 2002 20
ReplicationAndConsistency September 2002

The main components of the Vice


service interface
Fetch(fid) -> attr, data Returns the attributes (status) and, optionally, the contents of file
identified by the fid and records a callback promise on it.
Store(fid, attr, data) Updates the attributes and (optionally) the contents of a specified
file.
Create() -> fid Creates a new file and records a callback promise on it.
Remove(fid) Deletes the specified file.
SetLock(fid, mode) Sets a lock on the specified file or directory. The mode of the
lock may be shared or exclusive. Locks that are not removed
expire after 30 minutes.
ReleaseLock(fid) Unlocks the specified file or directory.
RemoveCallback(fid) Informs server that a Venus process has flushed a file from its
cache.
BreakCallback(fid) This call is made by a Vice server to a Venus process. It cancels
the callback promise on the relevant file.

Distributed Systems DoCS 2002 41

Recent advances in file services


NFS enhancements
WebNFS - NFS server implements a web-like service on a well-known port.
Requests use a 'public file handle' and a pathname-capable variant of lookup().
Enables applications to access NFS servers directly, e.g. to read a portion of
a large file.
One-copy update semantics (Spritely NFS, NQNFS) - Include an open()
operation and maintain tables of open files at servers, which are used to
prevent multiple writers and to generate callbacks to clients notifying them
of updates. Performance was improved by reduction in gettattr() traffic.

Improvements in disk storage organisation


RAID - improves performance and reliability by striping data redundantly across
several disk drives
Log-structured file storage - updated pages are stored contiguously in memory
and committed to disk in large contiguous blocks (~ 1 Mbyte). File maps are
modified whenever an update occurs. Garbage collection to recover disk
space.

Distributed Systems DoCS 2002 42

DoCS 2002 21
ReplicationAndConsistency September 2002

New design approaches


Distribute file data across several servers
o Exploits high-speed networks (ATM, Gigabit Ethernet)
o Layered approach, lowest level is like a 'distributed virtual disk'
o Achieves scalability even for a single heavily-used file
'Serverless' architecture
o Exploits processing and disk resources in all available network nodes
o Service is distributed at the level of individual files
Examples:
xFS: Experimental implementation demonstrated a substantial
performance gain over NFS and AFS
Frangipani: Performance similar to local UNIX file access
Tiger Video File System
Peer-to-peer systems: Napster, OceanStore (UCB), Farsite (MSR),
Publius (AT&T research) - see web for documentation on these very
recent systems

Distributed Systems DoCS 2002 43

New design approaches


9 Replicated read-write files
o High availability
o Disconnected working
re-integration after disconnection is a major problem
if conflicting updates have occurred
o Examples:
Bayou system
Coda system

Distributed Systems DoCS 2002 44

DoCS 2002 22
ReplicationAndConsistency September 2002

Summary
9 Sun NFS is an excellent example of a distributed service designed
to meet many important design requirements
9 Effective client caching can produce file service performance
equal to or better than local file systems
9 Consistency versus update semantics versus fault tolerance
remains an issue
9 Most client and server failures can be masked
9 Superior scalability can be achieved with whole-file serving
(Andrew FS) or the distributed virtual disk approach
Future requirements:
support for mobile users, disconnected operation, automatic re-
integration
support for data streaming and quality of service (Cf. Tiger file
system,)
Distributed Systems DoCS 2002 45

DoCS 2002 23

Вам также может понравиться