Академический Документы
Профессиональный Документы
Культура Документы
Basics
George Wilson gwilson@delphix.com
Matt Ahrens mahrens@delphix.com
● Simple administration
● Concisely express your intent
13 Years Later...
FS/Volume Model vs. Pooled Storage
Traditional Volumes ZFS Pooled Storage
● Abstraction: virtual disk ● Abstraction: malloc/free
● Partition/volume for each FS ● No partitions to manage
● Grow/shrink automatically
● Grow/shrink by hand ● All bandwidth always available
● Each FS has limited bandwidth ● All storage in the pool is shared
● Storage is fragmented, stranded
File interface
ZPL ZVOL
(ZFS POSIX Layer) (ZFS Volume)
Filesystem
Atomic
transactions
ZFS
(e.g. UFS, ext3) on objects
DMU
(Data Management Unit)
Block
allocate+write,
read, free
vdev2 ASIZE
Second copy of data
offset2 (for metadata)
vdev3 ASIZE
Third copy of data
offset3 (pool-wide metadata)
BDX lvl type cksum E comp PSIZE LSIZE
padding
fill count
Checksum of
data this block 256-bit checksum
points to
End-to-End Data Integrity in ZFS
Disk Block Checksums ZFS Data Authentication
● Checksum stored with data block ● Checksum stored in parent block pointer
● Any self-consistent block will pass ● Fault isolation between data and checksum
● Can't detect stray writes ● Checksum hierarchy forms
● Inherent FS/volume interface limitation self-validating Merkle tree
Address Address
Checksum Checksum
Address Address
Data Data Checksum Checksum
Checksum
••• Checksum
Data Data
Disk checksum only validates media ZFS validates the entire I/O path
✓ Bit rot ✓ Bit rot
✓ Phantom writes ✓ Phantom writes
✓ Misdirected reads and writes ✓ Misdirected reads and writes
✓ DMA parity errors ✓ DMA parity errors
✓ Driver bugs ✓ Driver bugs
✓ Accidental overwrite ✓ Accidental overwrite
Self-Healing Data in ZFS
1. Application issues a 2. ZFS tries the next 3. ZFS returns good
read. Checksum reveals disk. Checksum data to the application
that the block is corrupt indicates that the block and repairs the damaged
on disk. is good. block.
● Online everything
Simple Administration
● Create a pool name “tank”
1 2 3 4 1 2 3 4 5
ZFS Scalability
● Immense capacity (128-bit)
● Moore's Law: need 65th bit in 10-15 years
● ZFS capacity: 256 quadrillion ZB (1ZB = 1 billion TB)
● Exceeds quantum limit of Earth-based storage
● Seth Lloyd, "Ultimate physical limits to computation."
Nature 406, 1047-1054 (2000)
● 100% dynamic metadata
● No limits on files, directory entries, etc.
● No wacky knobs (e.g. inodes/cg)
● Concurrent everything
● Byte-range locking: parallel read/write without violating POSIX
● Parallel, constant-time directory operations
Built-in Compression
● Block-level compression in SPA
● Transparent to all other layers
● Each block compressed independently
● All-zero blocks converted into file holes
● Per-object granularity
● A 37k file consumes 37k – no wasted space
Decompress
Cache
Miss
ARC
● Primary consumer of RAM
zfs I/O
● Caches filesystem blocks
● Compressed blocks cached compressed and
vice-versa
● Compressed blocks will leverage on-disk checksum
Compressed
● LRU caches are sized based on workload Blocks
Traditional RAID (4/5/6)
● Always consistent!
● Each block has its own
parity
● Odd-size blocks use
slightly more space
● Single-block reads
access all disks :-(
ZFS Snapshots
● How to create snapshot?
● Save the root block
● When block is removed, can we free it?
● Use BP’s birth time
snap time 19 19
● If birth > prevsnap 25
37
snap time 25
● Free it live time 37 19 19
25
37 19
19
19
25 19 19 15
37 25
25
offset1
First copy of data
vdev2 ASIZE
Second copy of data
offset2 (for metadata)
vdev3 ASIZE
Third copy of data
offset3 (pool-wide metadata)
BDX lvl type cksum E comp PSIZE LSIZE
padding
When the
block was physical birth txg
written
logical birth txg
fill count
Checksum of
data this block 256-bit checksum
points to
Send and Receive
• “zfs send”
– serializes the contents of snapshots
– creates a “send stream” on stdout
• “zfs receive”
– recreates a snapshot from its serialized form
• Incrementals between snapshots
• Remote Replication
– Disaster recovery / Failover
– Data distribution
• Backup
How send/recv works: Design
Principles
• Locate changed blocks via block birth time
– Read minimum number of blocks
• Prefetching issues of i/o in parallel
– Uses full bandwidth & IOPS of all disks
• Unidirectional
– Insensitive to network latency
– Resumable (consumer provides token)
• DMU consumer
– Insensitive to ZPL complexity
Design: locating incremental changes
Block Pointer tells us that
everything below this was born in FromSnap’s birth
TXG 3 or earlier; therefore we do 3 8 time is TXG 5
not need to examine any of the
greyed out blocks
3 2 8 6
2 3 2 2 8 4 6 6
h e l l o _ A s i a B S D c o n
Examples
“ToSnap”
Encryption
DVA[2]
MAC
• Fill count will never be > 32 bits for L0
blocks
26
On-disk Dedup
● Block-level deduplication
● Leverages variable blocksizes
● Performance
● Giant on-disk hash table
● Need to access for every (dedup) write and free
● You really want this to fit in RAM
Synchronous Writes
ZFS ARC (memory)
blah
blah
write() ZFS ZFS ZFS
write()
fsync()
blah Storage Pool