Академический Документы
Профессиональный Документы
Культура Документы
Jeremy Stribling, Emil Sit, M. Frans Kaashoek, Jinyang Li†, and Robert Morris
†
MIT CSAIL New York University
2.3 Write locally / Read globally contacting the node responsible for the file’s ID, ask-
ing it for the latest version number and a list of IP
WheelFS adopts a policy of initially storing written addresses of nodes believed to be caching or repli-
data on or near the writing node, to get disk or LAN cating the file. If WheelFS finds more than one copy
throughput for output, and of searching for nearby of the latest version, it fetches different pieces of the
cached copies when reading. This policy effectively file from different copies using a BitTorrent-inspired
results in lazy as-needed transfers, and is likely to algorithm and caches it locally on disk to assist other
perform well for many workloads such as intermedi- nodes. This mechanism will make reading popular
ate result files in Grid computations. files efficient, as when a distributed computation first
When an application creates a new file, WheelFS starts.
picks a semi-random new file ID that will cause the
application’s node to be the file’s responsible server. 3 Example Application
WheelFS will buffer writes to the file until the appli- WheelFS is designed to make it easier to construct
cation calls close(), at which point it will create and run distributed applications. As an example, this
the first version of the file locally. This local transfer section presents a simplified application design for
will be efficient for large files, though for small files a cooperative web cache: a web proxy (similar to
the expense may be dominated by communicating CoralCDN [14]) that serves each requested page
with the directory server. Creating a new version of from a cache in WheelFS when possible, and oth-
an existing file whose responsible server is far away erwise fetches the page from the origin server.
will not be as efficient, so applications should store WheelFS provides a good platform for a coop-
output in unique new files where possible. erative cache system because the application can
In order to open and read the latest version of a file, use location-independent file names to find copies
the application’s node first translates the path name of web pages stored by other nodes. We use DNS
by iteratively looking up each path component in the redirection [14] to forward each client’s request to
relevant directory to find the next ID, and using con- a random node in our system. The node’s inetd
sistent hashing to turn the ID into the IP address of server will start the script shown in Figure 1 upon
the node responsible for the next directory in the path each incoming connection. First, the script parses
name. This process ends with the application node the incoming HTTP request to extract the URL and
#!/bin/sh
# extracts URL from HTTP GET request
URL=‘awk ’$1 ˜/GET/ { print $2 }’‘
# turn URL to file name (e.g. abc.com/foo/bar.html becomes 9/f/9fxxxxxx)
FILE=‘echo $URL | sha1sum | sed "s/\(.\)\(.\)/\1\/\2\/\1\2/"‘
FILE_RD=/wfs/cwc/.maxtime=5,bestversion/${FILE}
FILE_WR=/wfs/cwc/.writemany,lax/${FILE}
if [ -f $FILE_RD ]; then
EXPIRE_DATE=‘head -n 100 $FILE_RD | awk ’$1 ˜/Expires:/ {print $2}’‘
if notexpired $EXPIRE_DATE ‘date‘; then
cat $FILE_RD; exit
else
DATE=‘head -n 100 $FILE_RD | grep Date: | cut -f 2- -d " "‘
fi
else
mkdir -p ‘dirname $FILE_WR‘
fi
wget --header="If-Modified-Since: $DATE" --save-headers -T 2 -O $FILE_WR $URL
cat $FILE_RD
Figure 1: A simplified implementation of a cooperative web cache atop WheelFS. We omit the details of comparing
two dates using a fake command notexpired.
translates it to a file name in WheelFS. For exam- plication write downloaded copies of web pages lo-
ple, the URL http://abc.com/foo/bar.html cally. Our system incurs at most 3 round-trips to look
will become /wfs/cwc/9/f/9fxxxxxx where up directory entries in a file’s pathname while Coral-
9fxxxxxx is the content hash of the URL. If CDN requires at most O(log n) round-trips, though
the file exists, the script opens it with cue Max- the average number of such round-trips will decrease
Time=5,BestVersion, attempting to retrieve the lat- through caching in both systems. WheelFS contacts
est version of the file when there is no failure and a file’s responsible node to obtain a list of candidate
using any cached copy otherwise. Lastly, the script nodes with cached files in order to pick a closest copy
checks if the web page has expired according to to avoid swamping any single node, while CoralCDN
the saved HTTP header. If any step fails, the script uses a more scalable multi-hop lookup protocol to
fetches the requested URL from the original web site find a nearby cached page. Part of our future work is
and saves it in WheelFS. to investigate how much performance and scalabil-
The cooperative web cache benefits from ity the system actually achieves in the real world as
WheelFS’s relaxed consistency semantics. The compared to CoralCDN. Figure 1 is only a prototype
application can use a cached copy as long as it has as opposed to a complete final system since it does
not yet expired, and can give up quickly if there not yet address other important issues like cache re-
are failures. Copies of the application on different source allocation and security policies [34,35]. How-
nodes may sometimes write the same file name ever, it does demonstrate how WheelFS can dramat-
simultaneously; the Lax cue allows this to result ically simplify the distributed storage management
in versions with the same number or files with the aspect of such cooperative cache systems.
same name if there are failures. Since any copy of
the fetched page is adequate, WheelFS’s conflict 4 Related Work
resolution strategy of eventually picking one version WheelFS adopts the idea of a configuration ser-
arbitrarily is sufficient. vice from Rosebud [29] and Chubby [8], distributed
One reason for the script’s simplicity is that directories from xFS [4], Ceph [36], Farsite [1,
it inherits one of CoralCDN’s main performance- 12], and GFS [16], on-disk caches and whole-
optimizing techniques from WheelFS: reading from file operations from AFS [31], consistent hashing
the closest cached copy. Both CoralCDN and our ap- from CFS [11], cooperative reading from Shark [5],
OceanStore/Pond [28], Coblitz [25], and BitTor- significant problem for real applications. Nodes may
rent [10], trading performance versus consistency need a way to offload storage burden if they are low
from JetFile [17] and TACT [38], replication tech- on disk space. Files may become unreferenced if all
niques from DHTs [9], finding the closest copy from replicas of a directory are lost due to permanent fail-
cooperative Web caches [14, 34], and proximity pre- ures; perhaps a repair program will be needed. While
diction from location services [24]. the system is not intended for long-term durability,
Central-server file systems that store and serve some attention may be needed to ensure it is durable
files from a single location [4,16,22,30,31,36] create enough for typical distributed applications. The se-
an unnecessary bottleneck when there is interaction mantic cues and default behaviors listed above are a
among client hosts. starting point; experience will guide us towards the
Symmetric file systems increase performance and most useful design.
availability by storing files on many hosts, often
writing files locally and fetching data directly from Acknowledgments
peers [1, 5, 11, 12, 17, 28]. Particularly close to Thanks to Jeremy Kepner, Michael Freedman, and
WheelFS are JetFile and Farsite. JetFile [17] is de- the IPTPS reviewers for their useful feedback.
signed for the wide area, stores data on all hosts,
provides a POSIX interface, trades some consistency References
for performance, uses file versioning, but relies on [1] A DYA , A., B OLOSKY, W. J., C ASTRO , M., C ERMAK ,
IP multicast. Farsite [12] is designed for a high- G., C HAIKEN , R., D OUCEUR , J. R., H OWELL , J.,
L ORCH , J. R., T HEIMER , M., AND WATTENHOFER ,
bandwidth, low-latency network. This assumption R. P. FARSITE: Federated, available, and reliable storage
allows Farsite to provide strong semantics and exact for an incompletely trusted environment. In Proceedings
POSIX compatibility with good performance. Farsite of the 5th OSDI (Dec. 2002).
handles Byzantine failures, while WheelFS does not. [2] A LLCOCK , W., B RESNAHAN , J., K ETTIMUTHU , R.,
Storage systems for computational Grids [2, 7, L INK , M., D UMITRESCU , C., R AICU , I., AND F OSTER ,
I. The Globus striped GridFTP framework and server. In
13, 20, 26, 27, 32, 33, 37] federate single-site storage
Proceedings of the 2005 Super Computing (Nov. 2005).
servers, with a job’s scheduler moving data among
[3] A NDERSON , T., AND ROSCOE , T. Learning from Planet-
sites to keep it close to the computations. IBP [26] Lab. In Proceedings of the 3rd WORLDS (Nov. 2006).
and GridFTP [2] are low-level mechanisms for stag- [4] A NDERSON , T. E., DAHLIN , M. D., N EEFE , J. M., PAT-
ing files, but don’t provide the convenience of a file TERSON , D. A., ROSELLI , D. S., AND WANG , R. Y.
system. LegionFS [37] provides a file system inter- Serverless network file systems. In Proceedings of the 15th
face built on a object store. SOSP (Dec. 1995).
Compared to previous files systems, WheelFS’s [5] A NNAPUREDDY, S., F REEDMAN , M. J., AND
M AZI ÈRES , D. Shark: Scaling file servers via coop-
main technical advantages are its semantic cues, in- erative caching. In Proceedings of the 2nd NSDI (May
expensive (though weak) default consistency, write- 2005).
local read-global data movement, and BitTorrent- [6] BAVIER , A., B OWMAN , M., C HUN , B., C ULLER , D.,
style cooperative reading, These differences reflect K ARLIN , S., M UIR , S., P ETERSON , L., ROSCOE , T.,
that WheelFS is designed as a reusable component S PALINK , T., AND WAWRZONIAK , M. Operating systems
support for planetary-scale network services. In Proceed-
for large-scale distributed applications.
ings of the 1st NSDI (Mar. 2004).
5 Discussion [7] B ENT, J., V ENKATARAMANI , V., L E ROY, N., ROY,
A., S TANLEY, J., A RPACI -D USSEAU , A., A RPACI -
This paper has argued that a file system designed D USSEAU , R., , AND L IVNY, M. NeST—a grid enabled
for distributed applications and experiments running storage appliance. Grid Resource Management (Sept.
across the Internet would simplify the systems’ in- 2003).
frastructure, and presented the design of such a file [8] B URROWS , M. The Chubby lock service for loosely-
coupled distributed systems. In Proceedings of the 7th
system, WheelFS. The design has many open ques- OSDI (Nov. 2006).
tions that will need to be explored. For example, [9] C HUN , B.-G., DABEK , F., H AEBERLEN , A., S IT, E.,
WheelFS cannot easily handle cross-directory re- W EATHERSPOON , H., K AASHOEK , M. F., K UBIATOW-
names atomically; it is not clear if this would be ICZ , J., AND M ORRIS , R. Efficient replica maintenance
for distributed storage systems. In Proceedings of the 3rd [26] P LANK , J. S., B ECK , M., E LWASIF, W., M OORE , T.,
Symposium on Networked Systems Design and Implemen- S WANY, M., AND W OLSKI , R. The Internet Backplane
tation (May 2006). Protocol: Storage in the network. In Proceedings of the
[10] C OHEN , B. Incentives build robustness in BitTorrent. In Network Storage Symposium (Oct. 1999).
Proceedings of the Workshop on Economics of Peer-to- [27] R AJASEKAR , A., WAN , M., M OORE , R., S CHROEDER ,
Peer Systems (June 2003). W., K REMENEK , G., JAGATHEESAN , A., C OWART, C.,
[11] DABEK , F., K AASHOEK , M. F., K ARGER , D., M ORRIS , Z HU , B., C HEN , S.-Y., AND O LSCHANOWSKY, R. Stor-
R., AND S TOICA , I. Wide-area cooperative storage with age resource broker—managing distributed data in a grid.
CFS. In Proceedings of the 18th SOSP (Oct. 2001). Computer Society of India Journal, Special Issue on SAN
33, 4 (Oct. 2003).
[12] D OUCEUR , J. R., AND H OWELL , J. Distributed directory
service in the Farsite file system. In Proceedings of the 7th [28] R HEA , S., E ATON , P., G EELS , D., W EATHERSPOON , H.,
OSDI (Nov. 2006). Z HAO , B., AND K UBIATOWICZ , J. Pond: The OceanStore
[13] F OSTER , I., AND K ESSELMAN , C. Globus: A metacom- prototype. In Proceedings of the 2nd FAST (Mar. 2003).
puting infrastructure toolkit. The International Journal of [29] RODRIGUES , R., AND L ISKOV, B. Rosebud: A scalable
Supercomputer Applications and High Performance Com- byzantine-fault-tolerant storage architecture. MIT LCS
puting 11, 2 (1997), 115–128. Technical Report TR/932 (Dec. 2003).
[14] F REEDMAN , M. J., F REUDENTHAL , E., AND [30] S ANDBERG , R., G OLDBERG , D., K LEIMAN , S.,
M AZI ÈRES , D. Democratizing content publication WALSH , D., AND LYON , B. Design and implementation
with Coral. In Proceedings of the 1st NSDI (Mar. 2004). of the Sun network filesystem. In Proceedings of the
[15] GENI: Global environment for network innovations. Summer 1985 USENIX (June 1985).
http://www.geni.net. [31] S ATYANARAYANAN , M., H OWARD , J. H., N ICHOLS ,
[16] G HEMAWAT, S., G OBIOFF , H., AND L EUNG , S.-T. The D. A., S IDEBOTHAM , R. N., S PECTOR , A. Z., AND
Google file system. In Proceedings of the 19th SOSP (Dec. W EST, M. J. The ITC distributed file system: Principles
2003). and design. In Proceedings of the 10th SOSP (Dec. 1985).
[17] G R ÖNVALL , B., W ESTERLUND , A., AND P INK , S. The [32] TATEBE , O., S ODA , N., M ORITA , Y., M ATSUOKA , S.,
design of a multicast-based distributed file system. In Pro- AND S EKIGUCHI , S. Gfarm v2: A Grid file system that
ceedings of the 3rd OSDI (Feb. 1999). supports high-performance distributed and parallel data
[18] K ARGER , D., L EHMAN , E., L EIGHTON , F., L EVINE , M., computing. In Proceedings of the 2004 Computing in High
L EWIN , D., AND PANIGRAHY, R. Consistent hashing and Energy and Nuclear Physics (Sept. 2004).
random trees: Distributed caching protocols for relieving [33] T HAIN , D., BASNEY, J., S ON , S.-C., AND L IVNY, M.
hot spots on the World Wide Web. In Proceedings of the The Kangaroo approach to data movement on the grid. In
29th Symposium on Theory of Computing (May 1997). Proceedings of the Tenth IEEE Symposium on High Per-
[19] KOLA , G., KOSAR , T., F REY, J., L IVNY, M., B RUNNER , formance Distributed Computing (HPDC10) (Aug. 2001).
R. J., AND R EMIJAN , M. DISC: A system for distributed [34] WANG , L., PAI , V., AND P ETERSON , L. The effective-
data intensive scientific computing. In Proceedings of the ness of request redirecion on CDN robustness. In Pro-
1st WORLDS (Dec. 2004). ceedings of the 5th OSDI (Dec. 2002).
[20] KOSAR , T., AND L IVNY, M. Stork: Making data place-
[35] WANG , L., PARK , K., PANG , R., PAI , V. S., AND P ETER -
ment a first class citizen in the grid. In Proceedings of the
SON , L. Reliability and security in the CoDeeN content
24th ICDCS (Mar. 2004).
distribution network. In Proceedings of the USENIX 2004
[21] L AMPORT, L. The part-time parliament. ACM Transac- Annual Technical Conference (Boston, MA, June 2004).
tions on Computer Systems 16, 2 (1998), 133–169.
[36] W EIL , S. A., B RANDT, S. A., M ILLER , E. L., L ONG ,
[22] M AZI ÈRES , D., K AMINSKY, M., K AASHOEK , M. F., D. D. E., AND M ALTZAHN , C. Ceph: A scalable, high-
AND W ITCHEL , E. Separating key management from file performance distributed file system. In Proceedings of the
system security. In Proceedings of the 17th SOSP (Dec. 7th OSDI (Nov. 2006).
1999).
[37] W HITE , B. S., WALKER , M., H UMPHREY, M., AND
[23] M.J.L ITZKOW, L IVNY, M., AND M.W.M UTKA. Condor: G RIMSHAW, A. S. LegionFS: A secure and scalable file
a hunter of idle workstations. In Proceedings of the 8th In- system supporting cross-domain high-performance appli-
ternational Conference of Distributed Computing Systems cations. In Proceedings of the 2001 Super Computing
(June 1988). (Nov. 2001).
[24] N G , T. S. E., AND Z HANG , H. Predicting Internet net-
[38] Y U , H., AND VAHDAT, A. Design and evaluation of a
work distance with coordinates-based approaches. In Pro-
conit-based continuous consistency model for replicated
ceedings of the 2002 Infocom (June 2002).
services. ACM TOCS 20, 3 (Aug. 2002), 239–282.
[25] PARK , K., AND PAI , V. S. Scale and performance in the
CoBlitz large-file distribution service. In Proceedings of
the 3rd NSDI (May 2006).