ZFS

CHAPTER
1
INTRODUCTION
Anyone who has ever lost important files, run out of space on a partition, spent weekends
adding new storage to servers, tried to grow or shrink a file system, or experienced data
corruption knows that there is room for improvement in file systems and volume
managers. Solaris ZFS is designed from the ground up to meet the emerging needs of a
general purpose local file system that spans the desktop to the data center. Solaris ZFS
offers a dramatic advance in data management with an innovative approach to data
integrity, near zero administration, and a welcome integration of file system and volume
management capabilities. The centerpiece of this new architecture is the concept of a
virtual storage pool which decouples the file system from physical storage in the same
way that virtual memory abstracts the address space from physical memory, allowing for
much more efficient use of storage devices. In Solaris ZFS, space is shared dynamically
between multiple file systems from a single storage pool, and is parceled out of the pool
as file systems request it. Physical storage can be added to or removed from storage pools
dynamically, without interrupting services, providing new levels of flexibility,
availability, and performance. And in terms of scalability, Solaris ZFS is a 128-bit file
system. Its theoretical limits are truly mind-boggling 2128 bytes of storage, and 264
for everything else such as file systems, snapshots, directory entries, devices, and more.
And ZFS implements an improvement on RAID-5, RAID-Z, which uses parity, striping,
and atomic operations to ensure reconstruction of corrupted data. It is ideally suited for
managing industry standard storage servers like the Sun Fire 450
1|Page
CHAPTER
2
FEATURES
ZFS is more than just a file system. In addition to the traditional role of data storage, ZFS
also includes advanced volume management that provides pooled storage through a
collection of one or more devices. These pooled storage areas may be used for ZFS file
systems or exported through a ZFS Emulated Volume (ZVOL) device to support
traditional file systems such as UFS. ZFS uses the pooled storage concept which
completely eliminates the antique notion of volumes. According to SUN, this feature does
for storage what the VM did for the memory subsystem. In ZFS everything
is
transactional , i.e., this keeps the data always consistent on disk, removes almost all
constraints on I/O order, and allows for huge performance gains. The main features of
ZFS are given in this chapter.
2.1 STORAGE POOLS
Fig (1).Pooled Storage in ZFS
2|Page
Unlike traditional file systems, which reside on single devices and thus require a volume
manager to use more than one device, ZFS file systems are built on top of virtual storage
pools called zpools. A zpool is constructed of virtual devices (vdevs), which are
themselves constructed of block devices: files, hard drive partitions, or entire drives, with
the last being the recommended usage. Block devices within a vdev may be configured in
different ways, depending on needs and space available: non-redundantly (similar to
RAID 0), as a mirror (RAID 1) of two or more devices, as a RAID-Z group of three or
more devices, or as a RAID-Z2 group of four or more devices. Besides standard storage,
devices can be designated as volatile read cache (ARC), nonvolatile write cache, or as a
spare disk for use only in the case of a failure. Finally, when mirroring, block devices can
be grouped according to physical chassis, so that the file system can continue in the face
of the failure of an entire chassis.
Storage pool composition is not limited to similar devices but can consist of ad-hoc,
heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently
doling out space to diverse file systems as needed. Arbitrary storage device types can be
added to existing pools to expand their size at any time. If high-speed solid-state drives
(SSDs) are included in a pool, ZFS will transparently utilize the SSDs as cache within the
pool, directing frequently used data to the fast SSDs and less-frequently used data to
slower, less expensive mechanical disks. The storage capacity of all vdevs is available to
all of the file system instances in the zpool. A quota can be set to limit the amount of
space a file system instance can occupy, and a reservation can be set to guarantee that
space will be available to a file system instance.
This arrangement of pool will eliminate bottlenecks and increase the speed of reads and
writes, Solaris ZFS stripes data across all available storage devices, balancing I/O and
maximizing throughput. And, as disks are added to the storage pool, Solaris ZFS
immediately begins to allocate blocks from those devices, increasing effective bandwidth
as each device is added. This means system administrators no longer need to monitor
storage devices to see if they are causing I/O bottlenecks.
3|Page
2.2 COPY-ON-WRITE TRANSACTION MODEL

Blocks containing active data are never overwritten in place; instead, a new block is
allocated, modified data is written to it, and then any metadata blocks referencing it are
similarly read, reallocated, and written. To reduce the overhead of this process, multiple
updates are grouped into transaction groups, and an intent log is used when synchronous
write semantics are required.
1 .Initial Block Tree
3 .COW indirect pointer

Fig (2).Copy-on-Write transaction model
4|Page
2 .COW some blocks
4 .Rewrite main pointer
2.3 Snapshots and clones

An advantage of copy-on-write is that, when ZFS writes new data, the blocks containing
the old data can be retained, allowing a snapshot version of the file system to be
maintained. ZFS snapshots are created very quickly, since all the data composing the
snapshot is already stored. They are also space efficient, since any unchanged data is
shared among the file system and its snapshots.
Writeable snapshots ("clones") can also be created, resulting in two independent file
systems that share a set of blocks. As changes are made to any of the clone file
systems, new data blocks are created to reflect those changes, but any unchanged
blocks continue to be shared, no matter how many clones exist. This is an
implementation of the Copy-on-write principle.
2.4 Encryption
With Oracle Solaris, the encryption capability in ZFS[48] is embedded into the I/O
pipeline. During writes, a block may be compressed, encrypted, check summed and
then de duplicated, in that order. The policy for encryption is set at the dataset level
when datasets (file systems or ZVOLs) are created. The wrapping keys provided by
the user/administrator can be changed at any time without taking the file system
offline. The default behaviour is for the wrapping key to be inherited by any child
data sets. The data encryption keys are randomly generated at dataset creation time.
Only descendant datasets (snapshots and clones) share data encryption keys.[49] A
command to switch to a new data encryption key for the clone or at any time is
provided this does not re- encrypt already existing data, instead utilising an
encrypted master-key mechanism.
2.5 Data integrity

Hard disk error rates and handling and Silent data corruption One major feature that
distinguishes ZFS from other file systems is that ZFS is designed with a focus on data
integrity. That is, it is designed to protect the user's data on disk against silent data
5|Page
corruption caused by bit rot, current spikes, bugs in disk firmware, phantom writes (the
previous write did not make it to disk), misdirected reads/writes (the disk accesses the
wrong block), DMA parity errors between the array and server memory or from the
driver (since the checksum validates data inside the array), driver errors (data winds up in
the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file
system), etc.
Data integrity is a high priority in ZFS because recent research shows that none of the
currently widespread file systemssuch as UFS, Ext,XFS, JFS, or NTFSnor hardware
RAID provide sufficient protection against such problems (hardware RAID has some
issues with data integrity). Initial research indicates that ZFS protects data better than
earlier efforts. While it is also faster than UFS, it can be seen as the successor to UFS.
2.6 ZFS data integrity

For ZFS, data integrity is achieved by using a (Fletcher-based) checksum or a (SHA-256)
hash throughout the file system tree. Each block of data is checksummed and the
checksum value is then saved in the pointer to that blockrather than at the actual block
itself. Next, the block pointer is checksummed, with the value being saved at its pointer.
This checksumming continues all the way up the file system's data hierarchy to the root
node, which is also checksummed, thus creating a Merkle tree. In-flight data corruption
or phantom reads/writes (the data written/read checksums correctly but is actually wrong)
are undetectable by most filesystems as they store the checksum with the data. ZFS stores
the checksum of each block in its parent block pointer so the entire pool self-validates.
When a block is accessed, regardless of whether it is data or meta-data, its checksum is
calculated and compared with the stored checksum value of what it "should" be. If the
checksums match, the data are passed up the programming stack to the process that asked
for it; if the values do not match, then ZFS can heal the data if the storage pool provides
data redundancy (such as with internal mirroring), assuming that the copy of data is
undamaged and with matching checksums.[19] If the storage pool consists of a single
disk, it is possible to provide such redundancy by specifying copies=2 (or copies=3),
which means that data will be stored twice (or three times) on the disk, effectively
halving (or, for copies=3, reducing to one third) the storage capacity of the disk.[20] If
redundancy exists, ZFS will fetch a copy of the data (or recreate it via a RAID recovery
6|Page
mechanism), and recalculate the checksumideally resulting in the reproduction of the

originally expected value. If the data passes this integrity check, the system can then
update the faulty copy with known-good data so that redundancy can be restored.
2.7 Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional
devices are added to the zpool, the stripe width automatically expands to include them;
thus, all disks in a pool are used, which balances the write load across them.
Variable block sizes
ZFS uses variable-sized blocks, with 128 KB as the default size. Available features allow
the administrator to tune the maximum block size which is used, as certain workloads do
not perform well with large blocks. If data compression is enabled, variable block sizes
are used. If a block can be compressed to fit into a smaller block size, the smaller size is
used on the disk to use less storage and improve IO throughput (though at the cost of
increased CPU use for the compression and decompression operations).
2.8 Lightweight filesystem creation

In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation
within a traditional filesystem; the time and effort required to create or expand a ZFS
filesystem is closer to that of making a new directory than it is to volume manipulation in
some other systems.
2.9 Resilvering and scrub

ZFS has no fsck repair tool equivalent, common on Unix filesystems, which does file
system validation and file system repair. Instead, ZFS has a repair tool called "scrub"
which examines and repairs silent corruption and other problems. Some differences are:
fsck must be run on an offline filesystem, which means the filesystem must be
unmounted and is not usable while being repaired. scrub does not need the ZFS
filesystem to be taken offline; scrub is designed to be used on a mounted, live filesystem.
fsck usually only checks metadata (such as the journal log) but never checks the data
itself. This means, after an fsck, the data might still be corrupt. scrub checks everything,
7|Page
including metadata and the data. The effect can be observed by comparing fsck to scrub
times sometimes a fsck on a large RAID completes in a few minutes, which means only
the metadata was checked. Traversing all metadata and data on a large RAID takes many
hours, which is exactly what scrub does.
The official recommendation from Sun/Oracle is to scrub enterprise-level disks once a
month, and cheaper commodity disks once a week.
2.10 Cache management

ZFS also uses the Adaptive Replacement Cache (ARC), a new method for read cache
management, instead of the traditional Solaris virtual memory page cache. For write
caching, ZFS employs the ZFS Intent Log (ZIL). ZFS makes allowances for both of these
methods to incorporate separate virtual devices to improve the total IOPS. For read
operations it is the "cache" vdev and for write operations it is the "log" vdev.
2.11 Adaptive endianness

Pools and their associated ZFS file systems can be moved between different platform
architectures, including systems implementing different byte orders. The ZFS block
pointer format stores filesystem metadata in an endian-adaptive way; individual metadata
blocks are written with the native byte order of the system writing the block. When
reading, if the stored endianness does not match the endianness of the system, the
metadata is byte-swapped in memory.
This does not affect the stored data; as is usual in POSIX systems, files appear to
applications as simple arrays of bytes, so applications creating and reading data remain
responsible for doing so in a way independent of the underlying system's endianness.
8|Page
CHAPTER
3
CAPACITY LIMIT
ZFS is a 128-bit file system, so it can address 1.84 1019 times more data than 64-bit
systems such as Btrfs. The limitations of ZFS are designed to be so large that they should
not be encountered in the foreseeable future.
Some theoretical limits in ZFS are: 248: number of entries in any individual directory 16
exbibytes (264 bytes): maximum size of a single file 16 exbibytes: maximum size of any
attribute 256 zebibytes (278 bytes): maximum size of any zpool 256: number of attributes
of a file (actually constrained to 248 for the number of files in a ZFS file system) 264:
number of devices in any zpool 264: number of zpools in a system 264: number of file
systems in a zpool
9|Page
CHAPTER
4
PLATFORMS
4.1 Solaris
Solaris 10 update 2 and later
ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC
and x86- based systems.
Solaris 11
After Oracle's Solaris 11 Express release, the OS/Net consolidation (the main OS code)
was made proprietary and closed-source, and further ZFS upgrades and implementations
inside Solaris (such as encryption) are not compatible with other non-proprietary
implementations which use previous versions of ZFS.
When creating a new ZFS pool, to retain the ability to use access the pool from other
non- proprietary Solaris-based distributions, it is recommended to upgrade to Solaris 11
Express from Open Solaris (snv_134b), and thereby stay at ZFS version 28.
Open Solaris
Open Solaris 2008.05 and 2009.06 use ZFS as their default file system. There are over a
dozen 3rd-party distributions, of which nearly a dozen are mentioned here. (OpenIndiana
and illumos are two new distributions not included on the Open Solaris distribution
reference page.)
OpenIndiana
OpenIndiana 148 and 151 use ZFS version 28, as implemented in Illumos.
By upgrading from OpenSolaris snv_134 to both OpenIndiana and Solaris 11 Express,
one also has the ability to upgrade and separately boot Solaris 11 Express on the same
10 | P a g e
ZFS pool, but one should not install Solaris 11 Express first because of ZFS
incompatibilities introduced by Oracle past ZFS version 28.
4.2 BSD
OS X OpenZFS on OSX (abbreviated to O3X) is an implementation of ZFS for OS X.
O3X is under active development, with close relation to ZFS on Linux and illumos' ZFS
implementation, while maintaining feature flag compatibility with ZFS on Linux. O3X
implements zpool version 5000, and includes the Solaris Porting Layer (SPL) originally
authored for MacZFS, which has been further enhanced to include a memory
management layer based on the illumos kmem and vmem allocators. O3X is fully
featured, supporting LZ4 compression, deduplication, ARC, L2ARC, and SLOG.
MacZFS is free software providing support for ZFS on OS X. The stable legacy branch
provides up to ZFS pool version 8 and ZFS filesystem version 2. The development
branch, based on ZFS on Linux and OpenZFS, provides updated ZFS functionality, such
as up to ZFS zpool version 5000 and feature flags.
A proprietary implementation of ZFS (Zevo) was available at no cost from GreenBytes,
Inc., implementing up to ZFS file system version 5 and ZFS pool version 28. Zevo
offered a limited ZFS feature set, pending further commercial development; it was sold to
Oracle in 2014, with unknown future plans.[citation needed]
DragonFlyBSD
Edward O'Callaghan started the initial port of ZFS to DragonFlyBSD.
NetBSD
The NetBSD ZFS port was started as a part of the 2007 Google Summer of Code and in
August 2009, the code was merged into NetBSD's source tree.
FreeBSD
Pawe Jakub Dawidek ported ZFS to FreeBSD, and it has been part of FreeBSD since
version 7.0. This includes zfsboot, which allows booting FreeBSD directly from a ZFS
volume.
11 | P a g e
FreeBSD's ZFS implementation is fully functional; the only missing features are kernel
CIFS server and iSCSI, but at least the latter can be added using externally available
packages.Samba can be used to provide a userspace CIFS server.
FreeBSD 7-STABLE (where updates to the series of versions 7.x are committed to) uses
zpool version 6. FreeBSD 8 includes a much-updated implementation of ZFS, and zpool
version 13 is supported. zpool version 14 support was added to the 8-STABLE branch on
January 11, 2010, and is included in FreeBSD release 8.1. zpool version 15 is supported
in release 8.2. The 8-STABLE branch gained support for zpool version v28 and zfs
version 5 in early June 2011. These changes were released mid-April 2012 with FreeBSD
8.3. FreeBSD 9.0-RELEASE uses ZFS Pool version 28.
FreeBSD 9.2-RELEASE is the first FreeBSD version to use the new "feature flags" based
implementation thus Pool version 5000.
MidnightBSD
MidnightBSD, a desktop operating system derived from FreeBSD, supports ZFS storage
pool version 6 as of 0.3-RELEASE. This was derived from code included in FreeBSD
7.0-RELEASE. An update to storage pool 28 is in progress in 0.4-CURRENT and based
on 9-STABLE sources around FreeBSD 9.1-RELEASE code.
PC-BSD
PC-BSD is a desktop version of FreeBSD, which inherits FreeBSD's ZFS support,
similarly to FreeNAS. It also allows installation with disk encryption using geli. Its
graphical installer can handle even / (root) on ZFS and RAID-Z pool and Gnome installs
right from the start in an easy convenient way (GUI). The current PC-BSD 10.0+ "Joule
Edition" has ZFS filesystem version 5 and ZFS storage pool version 5000.
FreeNAS
FreeNAS, an embedded open source network-attached storage (NAS) distribution based
on FreeBSD, has the same ZFS support as FreeBSD and PC-BSD.
ZFS Guru
ZFS Guru, an embedded open source network-attached storage (NAS) distribution based
on FreeBSD, under active development.
NAS4Free
12 | P a g e
NAS4Free, an embedded open source network-attached storage (NAS) distribution based

on FreeBSD 9.3, has the same ZFS support as FreeBSD 9.3, ZFS storage pool version
5000. This project is a continuation of FreeNAS 7 series project.
Debian GNU/kFreeBSD
Being based on the FreeBSD kernel, Debian GNU/kFreeBSD has ZFS support from the
kernel. However, additional userland tools are required, while it is possible to have ZFS
as root or /boot file system in which case required GRUB configuration is performed by
the Debian installer since the Wheezy release.
As of 31 January 2013, the ZPool version available is 14 for the Squeeze release, and 28
for the Wheezy-9 release.
4.3 Linux
ZFS has several Linux implementations despite the fact that the GNU General Public
License (GPL), under which the Linux kernel is licensed, is incompatible with the
Common Development and Distribution License (CDDL) under which ZFS is
distributed. According to the used licensing models, a single derived work of both
projects cannot be legally distributed, as it is not possible to simultaneously meet both
licenses' requirements.To include ZFS in the Linux kernel, ZFS would have to be cleanly
reimplemented, and patents may hamper this.
This problem is being worked around by providing the kernel facilities through a separate
kernel module, a technical solution for a legal problem that is also being employed by
vendors and distributors of proprietary hardware drivers.
Native ZFS on Linux
A native port of ZFS for Linux produced by the Lawrence Livermore National
Laboratory (LLNL) was released in March 2013, with the following key events:
2008: prototype to determine viability 2009: initial ZVOL and Lustre support 2010:
development moved to Github 2011: POSIX layer added 2011: community of early
adopters 2012: production usage of ZFS
13 | P a g e
2013: stable GA release.

Of the major distributions, Ubuntu and Gentoo have very good support for ZFS on Linux,
meaning that required packages can be installed from their own package repositories, and
configuring a ZFS root filesystem is well documented. Slackware also provides
documentation on supporting ZFS, both as a kernel module and when built into the
kernel. The current zpool version supported by ZFS on Linux is 5000.
Linux FUSE
Another solution to the issue with licenses incompatibility was to port ZFS to Linux's
FUSE system, so the filesystem runs entirely in userspace instead of being part of the
Linux kernel, in which case it is not considered a derived work of the kernel. A project to
do this was sponsored by Google's Summer of Code program in 2006.
KQ InfoTech Another native port for Linux was developed by KQ InfoTech in 2010.[98]
[99] This port used the zvol implementation from the Lawrence Livermore National
Laboratory as a starting point. A release supporting zpool v28 was announced in January
2011. In April 2011, KQ Infotech was acquired by sTec, Inc., and their work on ZFS
ceased. Source code of this port can be found on GitHub.
The work of KQ InfoTech was pulled back into the native port of ZFS for Linux,
produced by the Lawrence Livermore National Laboratory.
14 | P a g e
CHAPTER
5
LIMITATIONS
Capacity expansion is normally achieved by adding groups of disks as a top-level vdev:

simple device, RAID-Z, RAID-Z2, RAID-Z3, or mirrored. Newly written data will
dynamically start to use all available vdevs. It is also possible to expand the array by
iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to
heal itself; the heal time will depend on the amount of stored information, not the disk
size. As of Solaris 10 Update 11 and Solaris 11.2, it is neither possible to reduce the
number of top- level vdevs in a pool, nor to otherwise reduce pool capacity. This
functionality was said to be in development already in 2007. It is not possible to add a
disk as a column to a RAID-Z, RAID-Z2, or RAID-Z3 vdev. This feature depends on the
block pointer rewrite functionality due to be added soon. One can however create a new
RAID-Z vdev and add it to the zpool. Some traditional nested RAID configurations, such
as RAID 51 (a mirror of RAID 5 groups), are not configurable in ZFS. Vdevs can only be
composed of raw disks or files, not other vdevs. However, a ZFS pool effectively creates
a stripe (RAID 0) across its vdevs, so the equivalent of a RAID 50 or RAID 60 is
common. Reconfiguring the number of devices in a top-level vdev requires copying data
offline, destroying the pool, and recreating the pool with the new top-level vdev
configuration, except for adding extra redundancy to an existing mirror, which can be
done at any time or if all top level vdevs are mirrors with sufficient redundancy the zpool
split command can be used to remove a vdev from each top level vdev in the pool,
creating a 2nd pool with identical data. Resilver (repair) of a crashed disk in a ZFS raid
15 | P a g e
takes a long time. This applies to all types of RAID, in one way or another. This means
that future large disks, say 5 TB or 6 TB, can take several days to repair. This means that
raidz1 (similar to RAID 5) should be avoided, because repairing a raid puts additional
stress on the other disks which might cause them to crash, losing all data in the storage
pool if configured as raidz1. Therefore, with large disks, one should use raidz2 (allow
two disks to crash) or raidz3 (allow three disks to crash). It should be noted however, that
ZFS RAID differs from conventional RAID by only reconstructing live data and
metadata when replacing a disk, not the entirety of the disk including blank and garbage
blocks, which means that replacing a member disk on a ZFS pool that is only partially
full will take proportionately less time compared to conventional RAID. IOPS
performance of a ZFS storage pool can suffer if the ZFS raid is not appropriately
configured. This applies to all types of RAID, in one way or another. If the zpool consists
of only one group of disks configured as, say, eight disks in raidz2, then the write IOPS
performance will be that of a single disk. However, read IOPS will be the sum of eight
individual disks. This means, to get high write IOPS performance, the zpool should
consist of several vdevs, because one vdev gives the write IOPS of a single disk.
However, there are ways to mitigate this IOPS performance problem, for instance add
SSDs as ZIL cache which can boost IOPS into 100.000s.[61] In short, a zpool should
consist of several groups of vdevs, each vdev consisting of 812 disks. It is not
recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS
performance will be that of a single disk, which also means that resilver time will be very
long (possibly weeks with future large drives).
16 | P a g e
Conclusion
It is very simple, in the sense that it concisely expresses the user's intent .It is very
powerful as it introduces the pooled storage concepts, snapshots, clones, compression,
scrubbing and RAID-Z. It is safe as it detects and corrects silent data corruption. It
become very fast by introducing dynamic striping, intelligent pre-fetch, pipelined I/O.By
offering data security and integrity, virtually unlimited scalability, as well as easy and
automated manageability, Solaris ZFS simplifies storage and data management for
demanding applications today, and well into the future.
17 | P a g e
REFERENCE
Solaris ZFS Administration Guide

http://opensolaris.org/os/community/zfs/docs/zfsadmin.pdf
ZFS - FreeBSD Wiki
http://wiki.freebsd.org/ZFS
FreeBSD/ZFS - last word in operating/file systems
http://people.freebsd.org/~pjd/pubs/eurobsdcon07_zfs.pdf
ZFS:A last word in file system
http://www.opensolaris.org/os/community/zfs/Zfs_last.pdf
http://blogs.sun.com/main/tags/zfs
www.seminarsonly.com
18 | P a g e

ZFS

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

ZFS

Загружено:

Авторское право:

Доступные форматы

CHAPTER

2.1 STORAGE POOLS

Fig (1).Pooled Storage in ZFS

2.2 COPY-ON-WRITE TRANSACTION MODEL

1 .Initial Block Tree

3 .COW indirect pointer

2 .COW some blocks

4 .Rewrite main pointer

2.3 Snapshots and clones

2.5 Data integrity

2.6 ZFS data integrity

mechanism), and recalculate the checksumideally resulting in the reproduction of the

2.7 Dynamic striping

2.8 Lightweight filesystem creation

2.9 Resilvering and scrub

2.10 Cache management

2.11 Adaptive endianness

NAS4Free, an embedded open source network-attached storage (NAS) distribution based

2013: stable GA release.

Capacity expansion is normally achieved by adding groups of disks as a top-level vdev:

Solaris ZFS Administration Guide

Вам также может понравиться