Вы находитесь на странице: 1из 39

OpenZFS

Basics
George Wilson gwilson@delphix.com
Matt Ahrens mahrens@delphix.com

© 2017 Delphix. All Rights Reserved. Private and


ZFS Guiding Principles
● Pooled storage
● Completely eliminates the antique notion of volumes
● Does for storage what VM did for memory

● Transactional object system


● Always consistent on disk – no fsck, ever

● Provable end-to-end data integrity


● Detects and corrects silent data corruption

● Simple administration
● Concisely express your intent
13 Years Later...
FS/Volume Model vs. Pooled Storage
Traditional Volumes ZFS Pooled Storage
● Abstraction: virtual disk ● Abstraction: malloc/free
● Partition/volume for each FS ● No partitions to manage
● Grow/shrink automatically
● Grow/shrink by hand ● All bandwidth always available
● Each FS has limited bandwidth ● All storage in the pool is shared
● Storage is fragmented, stranded

FS FS FS ZFS ZFS ZFS

Volume Volume Volume Storage Pool


Local Local
NFS SMB NFS SMB iSCSI FC
files files

VFS VFS SCSI target

File interface
ZPL ZVOL
(ZFS POSIX Layer) (ZFS Volume)

Filesystem
Atomic
transactions
ZFS
(e.g. UFS, ext3) on objects
DMU
(Data Management Unit)

Block
allocate+write,
read, free

Volume Manager SPA


(e.g. LVM, SVM) (Storage Pool Allocator)
Block
interface
Copy-On-Write Transactions
1. Initial block tree 2. COW some blocks

3. COW indirect blocks 4. Rewrite uberblock (atomic)


128-byte Block pointers
vdev1 ASIZE
First copy of data
offset1

vdev2 ASIZE
Second copy of data
offset2 (for metadata)
vdev3 ASIZE
Third copy of data
offset3 (pool-wide metadata)
BDX lvl type cksum E comp PSIZE LSIZE

padding

When the physical birth txg


block was
written logical birth txg

fill count

Checksum of
data this block 256-bit checksum
points to
End-to-End Data Integrity in ZFS
Disk Block Checksums ZFS Data Authentication
● Checksum stored with data block ● Checksum stored in parent block pointer
● Any self-consistent block will pass ● Fault isolation between data and checksum
● Can't detect stray writes ● Checksum hierarchy forms
● Inherent FS/volume interface limitation self-validating Merkle tree
Address Address
Checksum Checksum

Address Address
Data Data Checksum Checksum

Checksum
••• Checksum
Data Data

Disk checksum only validates media ZFS validates the entire I/O path
✓ Bit rot ✓ Bit rot
✓ Phantom writes ✓ Phantom writes
✓ Misdirected reads and writes ✓ Misdirected reads and writes
✓ DMA parity errors ✓ DMA parity errors
✓ Driver bugs ✓ Driver bugs
✓ Accidental overwrite ✓ Accidental overwrite
Self-Healing Data in ZFS
1. Application issues a 2. ZFS tries the next 3. ZFS returns good
read. Checksum reveals disk. Checksum data to the application
that the block is corrupt indicates that the block and repairs the damaged
on disk. is good. block.

Application Application Application

ZFS mirror ZFS mirror ZFS mirror


ZFS Administration
● Pooled storage – no more volumes!
● Up to 248 datasets per pool – filesystems, iSCSI targets, swap, etc.
● Nothing to provision!

● Filesystems become administrative control points


● Hierarchical, with inherited properties
● Per-dataset policy: snapshots, compression, backups, quotas, etc.
● Who's using all the space? du(1) takes forever, but df(1M) is instant
● Manage logically related filesystems as a group
● Inheritance makes large-scale administration a snap
● Policy follows the data (mounts, shares, properties, etc.)
● Delegated administration lets users manage their own data
● ZFS filesystems are cheap

● Online everything
Simple Administration
● Create a pool name “tank”

# zpool create tank mirror sdc sdd

● Create a new filesystem named “fs”

# zfs create tank/fs


Increase pool capacity
● Space is automatically available to all filesystems
● Writes utilize all bandwidth
● New writes will favor fastest devices

# zpool add tank mirror sde sdf

ZFS ZFS ZFS ZFS ZFS ZFS

Storage Pool Add Mirror 5 Storage Pool

1 2 3 4 1 2 3 4 5
ZFS Scalability
● Immense capacity (128-bit)
● Moore's Law: need 65th bit in 10-15 years
● ZFS capacity: 256 quadrillion ZB (1ZB = 1 billion TB)
● Exceeds quantum limit of Earth-based storage
● Seth Lloyd, "Ultimate physical limits to computation."
Nature 406, 1047-1054 (2000)
● 100% dynamic metadata
● No limits on files, directory entries, etc.
● No wacky knobs (e.g. inodes/cg)
● Concurrent everything
● Byte-range locking: parallel read/write without violating POSIX
● Parallel, constant-time directory operations
Built-in Compression
● Block-level compression in SPA
● Transparent to all other layers
● Each block compressed independently
● All-zero blocks converted into file holes

DMU translations: all 128k

SPA block allocations:


128k 37k 69k
vary with compression
● LZJB, LZ4, ZSTD, Zero Block, and GZIP available today;
Pluggable architecture
Variable Block Size
● No single block size is optimal for everything
● Large blocks: less metadata, higher bandwidth
● Small blocks: more space-efficient for small objects
● Record-structured files (e.g. databases) have natural granularity;
filesystem must match it to avoid read/modify/write

● Per-object granularity
● A 37k file consumes 37k – no wasted space

● Enables transparent block-based compression


Adaptive Replacement Cache
● Scan resistant filesystem cache
● Divided two LRU (least recently used) caches zfs_read
● Most Recently Used (mru)
#cat
● Most Frequently Used (mfu) /tank/foo/
file

● Adapts to memory pressure

Decompress
Cache
Miss
ARC
● Primary consumer of RAM
zfs I/O
● Caches filesystem blocks
● Compressed blocks cached compressed and
vice-versa
● Compressed blocks will leverage on-disk checksum
Compressed
● LRU caches are sized based on workload Blocks
Traditional RAID (4/5/6)

● Stripe is physically defined


● Partial-stripe writes are awful
● 1 write -> 4 i/o’s (read & write of data & parity)
● Not crash-consistent
● “RAID-5 write hole”
● Entire stripe left unprotected
– (including unmodified blocks)
● Fix: expensive NVRAM + complicated logic
RAID-Z

• Single, double, or triple parity


• Eliminates “RAID-5 write hole”
• No special hardware required for best
perf
• How? No partial-stripe writes.
RAID-Z: no partial-stripe writes

● Always consistent!
● Each block has its own
parity
● Odd-size blocks use
slightly more space
● Single-block reads
access all disks :-(
ZFS Snapshots
● How to create snapshot?
● Save the root block
● When block is removed, can we free it?
● Use BP’s birth time
snap time 19 19
● If birth > prevsnap 25
37
snap time 25
● Free it live time 37 19 19
25
37 19
19
19
25 19 19 15
37 25
25

● When delete snapshot, what to free?


● Find unique blocks - Tricky!
128-byte Block pointers
vdev1 ASIZE

offset1
First copy of data

vdev2 ASIZE
Second copy of data
offset2 (for metadata)
vdev3 ASIZE
Third copy of data
offset3 (pool-wide metadata)
BDX lvl type cksum E comp PSIZE LSIZE

padding
When the
block was physical birth txg
written
logical birth txg

fill count

Checksum of
data this block 256-bit checksum
points to
Send and Receive
• “zfs send”
– serializes the contents of snapshots
– creates a “send stream” on stdout
• “zfs receive”
– recreates a snapshot from its serialized form
• Incrementals between snapshots

• Remote Replication
– Disaster recovery / Failover
– Data distribution
• Backup
How send/recv works: Design
Principles
• Locate changed blocks via block birth time
– Read minimum number of blocks
• Prefetching issues of i/o in parallel
– Uses full bandwidth & IOPS of all disks
• Unidirectional
– Insensitive to network latency
– Resumable (consumer provides token)
• DMU consumer
– Insensitive to ZPL complexity
Design: locating incremental changes
Block Pointer tells us that
everything below this was born in FromSnap’s birth
TXG 3 or earlier; therefore we do 3 8 time is TXG 5
not need to examine any of the
greyed out blocks

3 2 8 6

2 3 2 2 8 4 6 6

h e l l o _ A s i a B S D c o n
Examples

zfs send pool/fs@monday | \


ssh host \
zfs receive tank/recvd/fs
“FromSnap”
zfs send -i @monday \
pool/fs@tuesday | ssh ...

“ToSnap”
Encryption

• Parameter storage DVA[0]


• MAC in last 128 bits of checksum
DVA[1]
• Salt in first 64 bits of DVA[2]
• First 64 bits of IV in last 64 bits of DVA[2] Salt
IV
• Last 32 bits of IV in upper 64 bits of
properties
blk_fill
padding
• Explanation
• MAC effectively functions as a checksum physical birth txg
birth txg
• MAC has stronger guarantees
IV fill count
• Checksum can always be calculated
• Copies = 2 limitation allows usage of checksum

DVA[2]
MAC
• Fill count will never be > 32 bits for L0
blocks
26
On-disk Dedup
● Block-level deduplication
● Leverages variable blocksizes

● Synchronous (aka in-line)


● Scalable
● No capacity limits, reference counts, etc.

● Works with existing features


● Compression amplifies data reduction ratios
● Automatic self-healing
● Eliminates redundancy concerns with automatic-ditto data protection

● Performance
● Giant on-disk hash table
● Need to access for every (dedup) write and free
● You really want this to fit in RAM
Synchronous Writes
ZFS ARC (memory)
blah
blah
write() ZFS ZFS ZFS
write()
fsync()
blah Storage Pool

The fsync(3C) call won't return until all previous writes to


this file flushed out to disk.
Synchronous Writes w/log device
ZFS ARC (memory)
blah
blah
write() ZFS ZFS ZFS
write()
fsync()
blah Storage Pool
Log

Separate ZIL (ZFS Intent Log)


- eliminates I/O contention with the rest of the pool
- reduces latency (write mostly – read at pool import)
Device removal

Delphix Proprietary and Confidential


Device removal

Delphix Proprietary and Confidential


Device removal

Delphix Proprietary and Confidential


Other Cool Features

• Pool Checkpoint • Persistent L2ARC


• Resumable send/recv • DRAID
• Allocation Throttle • Compressed send/recv
• Fast Clone deletion • Allocation Classes
• Channel Programs • Bookmarks
• ZIL enhancements • Large block support
• Device Removal • New checksums
• RAID-Z Expansion • I/O scheduler
• Sequential Scrub • ABD support
• Encryption • Trim Support
• Device initialization • LUN Expansion
• Redacted send/recv • Enhanced pool recovery
• Compressed L2ARC
Last 12 months:
600+ commits
100+ committers
50+ companies contributing
5 Operating Systems

Annual OpenZFS Developer Summit (x5)


2nd Annual ZFS User Conference
April 19-20, Norwalk CT
Further reading: overview
● Design of FreeBSD book - Kirk McKusick
● Read/Write code tour video - Matt Ahrens
● Overview video (slides) - Kirk McKusick
● ZFS On-disk format pdf - Tabriz Leman / Sun Micro
Specific Features
● Space allocation video (slides) - Matt Ahrens
● Replication w/ send/receive video (slides)
○ Dan Kimmel & Paul Dagnelie
● Caching with compressed ARC video (slides) - George Wilson
● Write throttle blog 1 2 3 - Adam Leventhal
● Channel programs video (slides)
○ Sara Hartse & Chris Williamson
● Encryption video (slides) - Tom Caputi
● Device initialization video (slides) - Joe Stein
● Device removal video (slides) - Alex Reece & Matt Ahrens
Community / Development
● History of ZFS features video - Matt Ahrens
● Birth of ZFS video - Jeff Bonwick
● OpenZFS founding paper - Matt Ahrens

Learn more at:


http://www.openzfs.org
ZFS Basics
Learn more at:
http://www.openzfs.org
George Wilson gwilson@delphix.com
Matt Ahrens mahrens@delphix.com

© 2017 Delphix. All Rights Reserved. Private and

Вам также может понравиться