Вы находитесь на странице: 1из 45

Distributed File System

 File system spread over multiple, autonomous computers.


 A distributed file system should provide:
 Network transparency: hide the details of where a file is
located.
 High availability: ease of accessibility irrespective of the
physical location of the file.
 This objective is difficult to achieve because the distributed file
system is vulnerable to problems in underlying networks as well
as crashes of systems that are the “file sources”.
 Replication / mirroring can be used to alleviate the above
problem.
 However, replication/mirroring introduces additional issues such
as consistency.

B. Prabhakaran 1
DFS: Architecture
 In general, files in a DFS can be located in “any” system. We call
the “source(s)” of files to be servers and those accessing them to be
clients.
 Potentially, a server for a file can become a client for another file.
 However, most distributed systems distinguish between clients and
servers in more strict way:
 Clients simply access files and do not have/share local files.
 Even if clients have disks, they (disks) are used for swapping, caching,
loading the OS, etc.
 Servers are the actual sources of files.
 In most cases, servers are more powerful machines (in terms of CPU,
physical memory, disk bandwidth, ..)

B. Prabhakaran 2
DFS: Architecture …
… … …

Server Server …. Server

Computer Network

Client Client

B. Prabhakaran 3
DFS Data Access
Request to
Access data
Return data Load data Load server
to client to client cache cache
Check
client
Data Issue disk
cache
present read
Data
Not present Data
Not present
Check Check
Local disk Server cache
Data
(if any) Data
present
present

Data Send request to


Not present File server Network
B. Prabhakaran 4
Mechanisms for DFS
 Mounting: to help in combining files/directories in different
systems and form a single file system structure.
 Caching: to reduce the response time in bringing data from
remote machines.
 Hints: modified caching
 Bulk data transfer: helps in reducing the delay due to transfer
of files over the network. Bulk:
 Obtain multiple number of blocks with a single seek
 Format, transfer large number of packets in a single context
switch.
 Reduce the number of acknowledgements to be sent.
 (e.g.,) useful when downloading OS onto a diskless client.
 Encryption: Establish a key for encryption with the help of
an authentication server.

B. Prabhakaran 5
Mounting
 Mounting helps to build a hierarchy of file directories.
 A collection of files can be mounted at an internal node of
the hierarchy.
 Node at which this collection of files is mounted: mount
point.
 Operating systems kernel maintains a structure called the
mount table, mapping mount points to appropriate storage
devices.
 Mount table can be maintained at:
 Each client. Employed in Sun Network File System (NFS).
 Servers. All clients see the same file system structure. Employed
in Sprite file system.

B. Prabhakaran 6
Name Space Hierarchy
Server X
Root (/)

Mount
Points
a c
b

Server Y Server Z
g
d e f h i

B. Prabhakaran 7
Caching
 Performance of distributed file system, in terms of response time,
depends on the ability to “get” the files to the user.
 When files are in different servers, caching might be needed to
improve the response time.
 A copy of data (in files) is brought to the client (when referenced).
Subsequent data accesses are made on the client cache.
 Client cache can be on disk or main memory.
 Data cached may include future blocks that may be referenced too.
 Caching implies DFS needs to guarantee consistency of data.

B. Prabhakaran 8
Hints
 Hints can be used when cached data need not be
completely be accurate.
 Example: Mapping of the name of a file/directory to the
actual physical device. The address/name of device can be
stored as a hint.
 If this address fails to access the requested file, the cached
data can be purged.
 The file server can refer to a name server, determine the
actual location of file/directory, and update the cache.
 In hints, a cache is neither updated nor invalidated when a
change occurs to the content.

B. Prabhakaran 9
Design Issues
 Naming: Locating the file/directory in a DFS based on
name.
 Location of cache: disk, main memory, both.
 Writing policy: Updating original data source when cache
content gets modified.
 Cache consistency: Modifying cache when data source gets
modified.
 Availability: More copies of files/resources.
 Scalability: Ability to handle more clients/users.
 Semantics: Meaning of different operations (read, write,…)

B. Prabhakaran 10
Naming
 Name space: (e.g.,) /home/students/jack, /home/staff/jill.
 Name space is a collection of names.
 Location transparency: file names do not indicate their
physical locations.
 Name resolution: mapping name space to an
object/device/file/directory.
 Naming approaches:
 Simple Concatenation: add hostname to file names.
 Guarantees unique names.

 No transparency. Moving a file to another host involves a


file name change.

B. Prabhakaran 11
Naming: Approaches ...
 Naming approaches:
 .....
 Mounting: mount remote directories to local ones. Location
transparent after mounting. (followed in Sun NFS).
 Example: /students is mounted at /home.

 Remember: different clients in the system can mount in different


ways. (e.g.,) In client 1: mount /students at /. i.e.,
/students/jack, /students/jill. In client 2: mount /students at /usr,
i.e., /usr/students/jack, /usr/students/jill.
 Single Global Directory:all files in the system belong to a single
name space. (followed in Sprite OS).
 System wide unique names, i.e., all clients mount the same way.

 Difficult to enforce this restriction. Can work only among


(highly) cooperating systems (or system administrators !)

B. Prabhakaran 12
Naming: Context
 Context: identifying the name space within which name
resolution is to be done.
 Example: context using ~ (tilde).
 ~jill/t: /home/staff/jill/t
 ~john/t: /home/students/john/t
 ~name: represents the directory structure associated with a
person or a project.
 Whenever file “t” is accessed, it is interpreted with reference
to ~’s environment.
 ~ helps when different clients mount in different ways, still
sharing the same of users and their home directories.
 (e.g.,) ~john may be mapped to /home/students/john in client
1 and to /usr/students/john in client 2.

B. Prabhakaran 13
Name Resolution
 Done by name servers that map file names to actual files.
 Centralized name server: send names to the server and get
the path of servers+devices that lead to the requested file.
 Name server becomes a bottle neck.
 Distributed name server: (e.g.,) consider access to a file
/a/b/c/d/e
 Local name server identifies the remote server that handles the
part /b/c/d/e
 This procedure may be recursively done till ../e is resolved.

B. Prabhakaran 14
Caching
 In main memory:
 Faster than disks.
 Diskless workstations can also cache.
 Server-cache is in main memory -> same design can be used in
clients also.
 Disadvantage: clients need main memory for virtual memory
management too.
 In disks:
 Large files can be cached.
 Virtual memory management is straight forward.
 After caching the necessary files, the client can get disconnected
from network (if needed, for instance, to help its mobility).

B. Prabhakaran 15
Writing Policy
 When should a modified cache content be transferred to
the server?
 Write-through policy:
 Immediate writing at server when cache content is modified.
 Advantage: reliability, crash of cache (client) does not mean loss
of data.
 Disadvantage: Several writes for each small change.
 Delayed writing policy:
 Write at the server, after a delay.
 Advantage: small/frequent changes do not increase network
traffic.
 Disadvantage: less reliable, susceptible to client crashes.
 Write at the time of file closing.

B. Prabhakaran 16
Cache Consistency
 When should a modified source content be transferred to the
cache?
 Server-initiated policy:
 Server cache manager informs client cache managers that can then
retrieve the data.
 Client-initiated policy:
 Client cache manager checks the freshness of data before delivering to
users. Overhead for every data access.
 Concurrent-write sharing policy:
 Multiple clients open the file, at least one client is writing.
 File server asks other clients to purge/remove the cached data for the
file, to maintain consistency.
 Sequential-write sharing policy: a client opens a file that was
recently closed after writing.

B. Prabhakaran 17
Cache Consistency ...
 Sequential-write sharing policy: a client opens a file that
was recently closed after writing.
 This client may have outdated cache blocks of the file (since the
other client might have modified the file contents).
 Use time stamps for both cache and files. Compare the time
stamps to know the freshness of blocks.
 The other client (which was writing previously) may still have
modified data in its cache that has not yet been updated on
server. (e.g.,) due to delayed writing.
 Server can force the previous client to flush its cache
whenever a new client opens the file.

B. Prabhakaran 18
Availability
 Intention: overcome the failure of servers or network
links.
 Solution: replication, i.e., maintain copies of files at
different servers.
 Issues:
 Maintaining consistency
 Detecting inconsistencies, if they happen despite best efforts.
Possible reasons for such inconsistencies:
 Replica is not updated due to a server failure or a broken
network link.
 Inconsistency problems and their recovery may reduce the
benefit of replication.

B. Prabhakaran 19
Availability: Replication
 Unit of replication: is mostly a file.
 Replicas of a file in a directory may be handled by different
servers, requiring extra name resolutions to locate the replicas.
 Replication unit: group of files:
 Advantage: process of name resolution, etc., to locate replicas
can be done for a set of files and not for individual files.
 Disadvantage: wasteful of disk space if only very few of this
group of files is needed by users often.

B. Prabhakaran 20
Replica Management
 Two-phase commit protocols can be used to update all
replicas.
 Other schemes:
 Weighted votes:
 A certain number of votes r or w is to be obtained before
reading or writing.
 Current synchronization site (CSS):
 Designate a process/site to control the modifications.

 File open/close are done through CSS.

 CSS can become a bottleneck.

B. Prabhakaran 21
Scalability
 Ease of adding more servers and clients with respect to the
problems / design issues discussed before such as caching,
replication management, etc.
 Server-initiated cache invalidation scales up better.
 Using the clients cache:
 A server serves only X clients.
 New clients (after the first X) are informed of the X clients from
whom they can get the data (sort of chaining/hierarchy).
 Cache misses & invalidations are propagated up and down this
hierarchy, i.e., each node serves as a mini-file server for its children.
 Structure of a server:
 I/O operations through threads (light weight processes) can help in
handling more clients.

B. Prabhakaran 22
Semantics
 What is the effect / meaning of an operation?
 (e.g.,) read returns the data due to latest write operation.
 Guaranteeing the above semantics in the presence of
caching can be difficult.
 We saw techniques for these under caching.

B. Prabhakaran 23
Case Study: Sun NFS
 Major goal: keep the distributed file system independent of
underlying hardware and operating system.
 NFS (Network File System): uses the Remote Procedure
Call (RPC) for remote file operations.
 Virtual file system (VFS) interface: provides uniform,
virtual file operations that are mapped to the actual file
system. (e.g.,) VFS can be mapped to DOS, so NFS can
work with PCs.
 VFS uses a structure called vnode (virtual node) that is
unique in a NFS.
 Each vnode has a mount table that provides a pointer to its
parent file system and to the system over which it is
mounted.

B. Prabhakaran 24
Sun NFS...
 A vnode can be a mount point.
 Using mount tables, VFS interface can distinguish
between local and remote file systems.
 Requests to remote files are routed to the NFS by the VFS
interface.
 RPCs are used to reach remote VFS interface.
 Remote VFS invokes appropriate local file operation.

B. Prabhakaran 25
Sun NFS Architecture
Client

Kernel

OS Interface Server

VFS Interface Server


VFS Interface
Routines

Others Unix NFS


Disks

Disk RPC/XDR RPC/XDR

Network

B. Prabhakaran 26
NFS: Naming & Location
 Each client can configure its file system independent of
others. i.e., different clients can see different name spaces.
 Name resolution example:
 Look up for a/b/c. a corresponds to vnode1 (assume).
 Look up on vnode1/b returns vnode2 that might say the object is
on server X.
 Look up on vnode2/c is sent to X. X returns a file handle (if the
file exists, permission matches, etc).
 File handle is used for subsequent file operations.
 Name resolution in NFS is an iterative process (slow).
 Name space information is not maintained at each server as
the servers in NFS are stateless (to be discussed later).

B. Prabhakaran 27
NFS: Caching
 NFS Client Cache:
 File blocks: cached on demand.
 Employs read ahead. Large block sizes (8 Kbytes) for data transfer to
improve the sequential read performance.
 Entire files cached, if they are small. Timestamps of files are also cached.

 Cached blocks are valid for certain period after which validation is
needed from server. Validation done by comparing time stamps of file at
server.
 Delayed writing policy used. Modified files are flushed after closing to
handle sequential-write sharing.
 File name to vnode translations: directory name lookup cache holds the
vnodes for remote directory names.
 Cache updated when lookup fails (cache acts as hints).

 Attributes of files & directories:

B. Prabhakaran 28
NFS: Caching
 NFS Client Cache: ....
 Attributes of files & directories:
 Attribute inquiries form 90% of calls made to servers.

 Cache entries are updated every time new attributes are


received from server.
 File attributes are discarded after 3 seconds and directory
attributes after 30 seconds.

B. Prabhakaran 29
NFS: Stateless Server
 NFS servers are stateless to help crash recovery.
 Stateless: no record of past requests (e.g., whether file is open,
position of file pointer, etc.,).
 Client requests contain all the needed information. No
response, client simply re-sends the request.
 After a crash, a stateless server simply restarts. No need to:
 Restore previous transaction records.
 Update clients or negotiate with clients on file status.
 Disadvantages:
 Client message sizes are larger.
 Server cache management difficult since server has no idea on which
files have been opened/closed.
 Server can provide little information for file sharing.

B. Prabhakaran 30
Un/mounting in NFS
 Mounting of files in Unix is done by using a mount table
stored in a file: /etc/mnttab.
 mnttab is read by programs using procedures such as
getmntent.
 mount command adds an entry in mnttab, i.e., every time a
file system is mounted in the system.
 umount command removes an entry in mnttab, i.e., every
time a file system is unmounted from the system.

B. Prabhakaran 31
Un/Mounting
 First entry in mnttab: file system that was mounted first.
 Usually, file systems get mounted at boot time.
 Mount: term used for mounting tapes onto systems, I guess.
 Each entry is a line of fields separated by spaces in the form:
<special> <mount_point> <fstype> <options> <time>
 <special>: The name of the resource to be mounted.
 <mount_point> : pathname of the directory on which the filesystem is
mounted.
 <fstype> : file system type of the mounted file system.
 <options> : mount options.
 <time> : time at which the file system was mounted.
 Entries for <special>: path-name of a block-special device (e.g.,
/dev/fd0), the name of a remote filesystem (casa:/export/home, i.e.,
host:pathname), or the name of a swap file.

B. Prabhakaran 32
Sharing Filesystems
 In SunOS, share command is used to specify the file systems that can be
mounted by other systems.
 (e.g.), share [ -F FSType ] [ -o specific_options ] [-d description
] [ pathname ]
 Share command makes a resource available to remote system, through a
file system of FSType.
 <specific_options> : control access of the shared resource.
 rw pathname is shared read/write to all clients. This is also the default
behavior.
 rw=client[:client]...pathname is shared read/write only to the listed clients.
No other systems can access pathname.
 ro pathname is shared read-only to all clients.
 ro=client[:client]... pathname is shared read-only only to the listed
clients. No other systems can access pathname.

B. Prabhakaran 33
Sharing Filesystems…
 <-d description>: -d flag may be used to provide a description of the
resource being shared.
 Example : To share the /disk file system read-only at boot time.
 share -F nfs -o ro /disk
 share -F nfs -o rw=usera:userb /somefs
 Multiple share commands on same file system? : Last command
supersedes.
 Try:
 /etc/dfs/dfstab: list of share commands to be executed at boot time
 /etc/dfs/fstypes: list of file system types, NFS by default
 /etc/dfs/sharetab: system record of shared file systems.

B. Prabhakaran 34
Automounting
 mount a remote file system only when it is accessed, perhaps for a
guessed duration of time.
 automount utility: installs autofs mount points and associates an
automount map with each mount point.
 autofs file system monitors attempts to access directories within it and
notifies the automountd daemon.
 automountd uses the map to locate a file system. Then mounts at the
point of reference within the autofs file system.
 A map can be assigned to an autofs mount using an entry in the
/etc/auto_master map or a direct map.
 File system is not accessed within an appropriate interval (10
minutes by default) ? : the automountd daemon unmounts the file
system.

B. Prabhakaran 35
Cluster File System
 System Model: a set of storage devices that can be
accessed by a set of workstations.

System 1 System n

Very High Speed Network

RAID Tapes/CDs RAID


RAID: Redundant Array of
Inexpensive Disks
B. Prabhakaran 36
Cluster File System
 Storage devices can be viewed as a “pool of centralized
resources”.
 Storage devices are shared by a set of workstations/systems or a
cluster as it is called.
 Both the pool of storage and the cluster are attached to very high
speed networks (typically optical networks).
 Devices can be mounted to different systems: e.g., Raid1 to
system n, Raid 2 to system 1 etc.
 Features:
 Mirroring: replication of entire disks
 Striping: data (e.g., multimedia) spread over multiple disks
 Online reconfiguration: add/delete storage devices dynamically
 Assign/remove devices to applications/systems dynamically

B. Prabhakaran 37
Storage Virtualization
 Means logical representation of the physical resources:
storage devices & workstations
 Virtualization specifies details such as which devices are
meant for which host, how they can be shared, etc.
 Possible places for virtualization: (each choice has its own
advantages and disadvantages)
 Workstations or hosts
 Volume managers (software) are run on hosts,
providing control over how data is stored and
accessed over the different devices.

B. Prabhakaran 38
Storage Virtualization...
 Possible places for virtualization:
 In storage subsystem
 Associated with large-scale RAID large subsystems
(many terabytes). Virtualization services embedded on
storage controllers.
 In special appliances: “in-band” or “out-of-band”
 Special, intelligent appliances are used to provide
virtualization
 Appliance name: NAS (Network Attached Storage)

 In-band: NAS is part of storage pool

 Out-of-band: NAS not a part of storage pool

B. Prabhakaran 39
Veritas Volume Manager
 Works on both Unix and Windows
 Builds a diskgroup spanning multiple devices.
 Dynamic diskgroups management
 Striping of data on multiple RAIDs.
 Striping distributes data on multiple disks and hence
increases the disk bandwidth for retrieval. Suitable for
multimedia data.
 Cluster Volume Manager:
 Allows a volume to be simultaneously mounted for use across
multiple servers for both reads and writes.

B. Prabhakaran 40
Veritas Cluster Server
 Cluster server handles upto 32 systems.
 It monitors, controls and restarts applications in response
to a variety of task.
 (e.g.,) application A1 may be started on system n is system
1 fails. Disk group D1 will be automatically assigned to
system n.
 (e.g.) Disk group D2 may be assigned to system 1 if D1
fails and the application A1 will continue.

S1 Sn

D1 D2
B. Prabhakaran 41
Service Groups
 A set of resources working together to provide application
services to clients.
 Service group example:
 Disk groups having data
 Volume built using disk group
 File system (directories) using the volume
 Servers/systems providing the application
 Application program + libraries
 Types of Service Groups:
 Failover Groups: runs on 1 system in a cluster at a time. Used for
applications that are not designed to maintain data consistency on
multiple copies.
 Cluster server monitors the heart beat of the system. If it fails, the backup
is brought on-line.

B. Prabhakaran 42
Service Groups...
 Types of Service Groups...:
 Parallel groups: run concurrently on more than 1 system.
 Time-to-recovery:
 On a failure, an application service is moved to another server in
the cluster.
 Disk groups are de-imported from the crashed server and
imported by the back-up server.
 Volume manager helps to manage the disk group ownership and
accelerate recovery process of the cluster.
 New ownership properties are broadcast to the cluster to ensure
data security.
 Time to take to bring the back-up online.

B. Prabhakaran 43
Disaster Tolerance
 More than 1 cluster connected by very high speed networks
over a wide area network.
 Cluster 1 and 2 geographically distributed.

Very High Speed Link


Over a Wide Area Network

Cluster 1

Cluster 2

B. Prabhakaran 44
Veritas Volume Replicator
 Redundant copy of application in another cluster must be kept
up-to-date.
 Volume Replicator allows a disk group to be replicated at 1 or
more remote clusters.
 Initialization of replication: entire disk group is replicated.
 Runtime: only modifications to data are communicated.
Conserves network bandwidth.
 Disk groups at the remote cluster are not usually active.
 Identical instance of application is run on the remote cluster in
idle mode.
 Disaster is identified by volume replicator using heart beats.
 Puts remote cluster on-line for the applications.
 Time-to-recovery: less than 1 minute.

B. Prabhakaran 45

Вам также может понравиться