Академический Документы
Профессиональный Документы
Культура Документы
Introduction
CloudStore (formerly, Kosmos filesystem) was initially designed and implemented at Kosmix in 2006
by two developers Sriram Rao and Blake Lewis, designed and developed kfs. KFS released as an
open-source project in Sep. 2007. Quantcast is now the primary sponsor of the project .
Web-scale applications require a scalable storage infrastructure to process vast amounts of data.
CloudStore (formerly, Kosmos filesystem) is an open-source high performance distributed file system
designed to meet such an infrastructure need.
Web search engines are required to process large volumes of data. This entails having a scalable
backend storage infrastructure built on commodity hardware (such as, cluster of PCs running Linux).
To address this infrastructure need, at Kosmix, it released KFS as an open-source project under the
terms of the Apache 2.0 license. The initial release is KFS version 0.1 and it is currently in “alpha”.
KFS designed and implemented for Growing class of applications that process large volumes of data
Web Search, Web log analysis, Web 2.0 apps,Grid computing, The Key requirements are cost-efficient
scalable compute/storage infrastructure,It is focused towards building scalable storage infrastructure.
Looks like the big differences are that KFS is a more complete clone of GFS:
ABOUT CLOUDSTORE
CloudStore builds upon on ideas from Google's well-known filesystem project, GFS. The system
consists of 3 components:
● Meta server: This provides the global namespace for the filesystem.
● Chunkserver(Block server): Files in KFS are split into chunks. Each chunk is 64MB in size.
Chunks are replicated and striped across chunkservers. Chunkserver store the chunks as files
in the underlying file system (such as, ZFS on Solaris or XFS on Linux)
● Client library: The client library is linked with applications. This enables applications to
read/write files stored in KFS. Client library provides the file system API to allow applications to
interface with CloudStore. To integrate applications to use CloudStore, applications will need to
be modified and relinked with the CloudStore client library.
File consists of a set of chunks Applications are oblivious of chunks I/O is done on files. Translation
from file offset to chunk/offset is transparent to the application. Each chunk is fixed in size. Chunk size
is 64MB. While CloudStore can be accessed natively from C++ applications, support is also provided
for Java and Python applications. JNI(Java Native Interface) glue code/Python module support is
included in the release to allow those applications to access CloudStore via the CloudStore client
library APIs.
KFS is a scalable distributed file system designed for applications with large data processing needs
(such as, web search, text mining, grid computing, etc). This is the first release of the software.
Update to the initial release with additional features and stability fixes.
This release is a bug fix release. We are rolling in the fix to the client library code from kfs-0.1.1
We are pleased to release KFS v. 0.1.3. The primary change from previous release is support for 32-
bit platforms.
KFS version 0.2.1 release
This is a new release that updates kfs-0.2.0 with minor fixes/support for Mac OSX (leopard).
We are pleased to announce release 0.2.2 of KFS. This has stability improvements over previous
release. This also includes a Web UI for monitoring KFS servers.
We are pleased to announce release 0.2.3 of KFS. The new capabilities we have added to this release
are:
(1) support JBOD on the chunkservers (chunks are round-robin across drives and space on a given
drive is also used to determine placement),
(2) kfsfsck, a tool for checking KFS health. In addition, the release also contains stability fixes as well
as fixes to FUSE.
CloudStore Features
Incremental scalability: New chunkserver nodes can be added as storage needs increase; the
system automatically adapts to the new nodes.
Availability: Replication is used to provide availability due to chunk server failures. Typically, files are
replicated 3-way.
Per file degree of replication: The degree of replication is configurable on a per file basis, with a
max. limit of 64.
Re-replication: Whenever the degree of replication for a file drops below the configured amount (such
as, due to an extended chunkserver outage), the metaserver forces the block to be re-replicated on
the remaining chunk servers. Re-replication is done in the background without overwhelming the
system.
Rack-aware data placement: The chunk placement algorithm is rack-aware. Wherever possible, it
places chunks on different racks.
Re-balancing: Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is
done to help with balancing disk space utilization amongst nodes.
Data integrity: To handle disk corruptions to data blocks, data blocks are checksummed. Checksum
verification is done on each read; whenever there is a checksum mismatch, re-replication is used to
recover the corrupted chunk.
File writes: The system follows the standard model. When an application creates a file, the filename
becomes part of the filesystem namespace. For performance, writes are cached at the CloudStore
client library. Periodically, the cache is flushed and data is pushed out to the chunkservers. Also,
applications can force data to be flushed to the chunkservers. In either case, once data is flushed to
the server, it is available for reading.
Leases: CloudStore client library uses caching to improve performance. Leases are used to support
cache consistency.
Client side fail-over: The client library is resilient to chunksever failures. During reads, if the client
library determines that the chunkserver it is communicating with is unreachable, the client library will
fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
Language support: CloudStore client library can be accessed from C++, Java, and Python. FUSE
support on Linux: By mounting CloudStore via FUSE, this allows existing linux utilities (such as, ls) to
interface with CloudStore.
Tools: A shell binary is included in the set of tools. This allows users to navigate the filesystem tree
using utilities such as, cp, ls, mkdir, rmdir, rm, mv. Tools to also monitor the chunk/meta-servers are
provided.
Job placement support: The CloudStore client library exports an API to determine the location of a
byte range of a file. Job placement systems built on top of CloudStore can leverage this API to
schedule jobs appropriately.
Local read optimization: When applications are run on the same nodes as chunkservers, the
CloudStore client library contains an optimization for reading data locally. That is, if the chunk is stored
on the same node as the one on which the application is executing, data is read from the local node.
Construct a global namespace by decoupling storage from filesystem namespace Build a ''disk'' by
aggregating the storage from individual nodes in the cluster To improve performance, stripe a file
across multiple nodes in the cluster Use replication for tolerating failures Simplify storage
management System automagically balances storage utilization across all nodes Any file can be
accessed from any machine in the network .
System Architecture
Single meta-data server that maintains the global namespace Multiple chunkservers that enable
access to data Client library linked with applications for accessing files in KFS System implemented in
C++ .
Inter-process communication is via non-blocking TCP sockets. Communication protocol is text-based
Patterned after HTTP Connections between meta-server and chunkservers are persistent Simple
failure model:
● Connection break implies failure.
● Works for LAN settings.
Meta Server
The meta server is the repository for all file meta-data. It stores the file meta-data such as, directory
information, blocks of a file, etc., in-memory in a B-tree. Operations that mutate the tree are logged to
a log file. Periodically, via an off-line process, the log files are compacted to create a checkpoint file.
Whenever the metaserver is restarted, it rebuilds the B-tree from the latest checkpoint; it then applies
mutations to the tree from the log files to recover system state.
For fault-tolerance, the meta server's logs and checkpoint files should be backed up to a remote node.
The source code contains scripts that use rsync to backup system meta-data.
Chunkserver
Chunkservers store chunks, which are blocks of a file. Each chunk is 64MB in size. Chunkserver
stores chunks as files in the underlying filesystem. To protect against data corruptions, Adler-32
checksums are used:
● On writes, checksums are computed on 64KB block boundaries and saved in the chunk meta-
data.
● On reads, checksum verification is done using the saved data Internally,
Each chunk file is named: [file-id].[chunk-id].[version]. Each chunk file has a 16K header, that contains
the chunk checksum information. The checksum information is updated during writes.
When a chunkserver is restarted, it scans the directory containing the chunks to determine the chunks
it has. It then sends that information to the meta server. The meta server validates the blocks and
notifies the chunkserver of any _stale_ blocks. These blocks are those that are not owned by any file
in the system. Whenever a chunkserver receives a stale chunk notification, it deletes those chunks.
Client library
The client library enables applications to access files stored in KFS. For file operations, the client
library interfaces with the meta-data server to get file meta-data:
● On reads, the client library interfaces with the meta server to determine chunk locations; it then
downloads the block from one of the replicas
● On writes, the client library interfaces with the meta server to determine where to write the
chunk; the client library then forwards the data to the first replica; the first replica forwards the
data to the next replica and so on.
KFS virtualizes disk storage on a cluster of machines providing a global namespace. Files are striped
across nodes in the cluster and are replicated for fault tolerance/availability. KFS consists of a client
library that enables user applications to read/write files stored in KFS. KFS supports the familiar
filesystem interfaces/programming model. The functionality of the KFS API is similar to the model
exposed by operating systems such as Linux. To illustrate,
• When a file is created, the filename is visible in the global namespace.
• As data is written to a block of a file, it gets flushed out to the set of servers storing that block. Data
written to servers can now be read by other processes.
• For writing/reading, a process can seek to any point in the file and read/write from there.
• Files can be opened for writing multiple times.
• Data can be appened to existing files by opening the file for writing in append mode.
When blocks of a file are striped across nodes in the cluster, KFS stores individual blocks of file as
files in the underlying file system (such as, XFS on Linux). To guard against disk corruption,
checksums are computed on the blocks and verified on each read. If disk corruption is detected by
checksum mismatch, the system discards the corrupted block and uses re-replication to recover lost
data. Each file stored in KFS is typically replicated 3-way. Depending on application needs, the degree
of replication for files can be changed on-the-fly.KFS also contains rudimentary support for block
rebalancing. To help with better disk utilization across nodes, the system may periodically migrate data
from over-utilized to under-utilized nodes.KFS client library provides support job placement systems.
For instance, a job scheduler can determine the location(s) of a byte range within a file and schedule
jobs appropriately.KFS is implemented in C++. In addition to C++ applications, KFS also contains
support for Java (via JNI) and Python applications.To enable a large class applications to evaluate
KFS, we have integrated KFS to be the backing store for other open source projects:
• Hypertable: Hypertable is an open source project (being developed at Zvents Inc.) that provides a
Big-Table interface. KFS is integrated with Hypertable as the backing store.
<property>
<name>fs.kfs.metaServerHost</name>
<value><server></value>
<description>The location of the meta server.</description>
</property>
<property>
<name>fs.kfs.metaServerPort</name>
<value><port></value>
<description>The location of the meta server's port.</description>
</property>
This enables path URI's of the form
<property>
<name>fs.default.name</name>
<value>kfs://<server:port></value>
</property>
The Hadoop distribution contains a KFS jar file. To use the latest files, build the jar file (see
HowToCompile). The remaining steps are:
bin/hadoop fs -fs kfs://<server:port> -ls /can be used to list the contents of the root directory.
There are a few bug fixes that have been checked into KFS+Hadoop glue code (the code is in:
src/.../org/apache/hadoop/fs/kfs/). These fixes will be part of Hadoop-0.18x. To enable KFS to work
properly with prior Hadoop releases, the bug fixes need to be backported. Code with the backport is
included in the kfs-0.2.0 release. For example, to use with Hadoop-0.17x (similarly for other releases):
cp ~/code/kfs/kfs-hadoop/0.17x/org/apache/hadoop/fs/kfs/* <hadoop-
dir>/src/java/org/apache/hadoop/fs/kfs
rm <hadoop-dir>/lib/kfs-0.1*.jar
cp ~/code/kfs/build/kfs-0.2.0.jar <hadoop-dir>/lib
ant jarThis will build the hadoop jar files to include the new "glue" code. After the build finishes, restart
the Map/Reduce job trackers.
Limitations in CLOUDSTORE
Currently the single point of failure is the one meta-data server, but the GFS authors argued a single
master is a great feature, but they keep live backup masters ready to go. I'm sure this will be fixed
soon in KFS.
From the GFS paper:
Having a single master vastly simplifies our design and enables the master to make sophisticated
chunk placement and replication decisions using global knowledge. The master state is replicated for
reliability. Its operation log and checkpoints are replicated on multiple machines.
Most compaints so far are along the lines of "It doesn't work with Hadoop", which is silly, because it
does. However, KFS is not really attractive until they have a MapReduce-alike. You have to be able to
do something with all that data, after all Maybe KFS can be integrated with Hadoop's MapReduce to
make up for the current lack of such from Kosmix?