Вы находитесь на странице: 1из 38

BTREE FILE SYSTEM (BTRFS)

What is a file system ?


It can be defined in different ways
A method of organizing blocks on a storage
device into files and directories.
A data structure that translates the physical
view of a disc into a logical structure.
The way in which file s are named and where
they are placed logically for storage and
retrieval.
NEED FOR DIFFERENT FILESYSTEMS
This depends on
Requirements of OS
Required level of security and efficiency
Minimize access time (eg.: cdfs)
Exploit device features ( eg.: flash
filesystem)
No seek time
Avoid continuous writes
Block wise erasing
FILE SYSTEM TERMINOLOGY
Inode : A data structure holding
information about files in a Unix file
system.
Blocks : Uniformly sized unit of data
storage for filesystem.
Extents : bunch of contiguous physical
blocks.
FILE SYSTEM TERMINOLOGY contd
Super block : Block that describes the
entire file system
Directory : A file that lists the files and
directories contained within, and their
associated inode.
Journaling : Logging of filesystem
transactions
WHAT IS BTRFS ?
Next Gen. File system for Linux
Also called BUTTER FS or BETTER FS
Motivated from zfs of Solaris
Support for copy on write
Unlike zfs which is licensed under CDDL(Common
Development and Distribution License), Btrfs is
GPL licensed
Only test versions are released(accepted in 2.6.29)
It have lots and lots of features compared to
extended series
DATA AND METADATA STRUCTURE
Uses COW (copy on write) tree or Rodeh's btrees
Advantages of COW btrees
Increase in the overall depth is infrequent
Minimum number of rearrangement
Search could be done in O(log2N)
Entire tree need not be in memory
COW facility speeds up operations
BTRFS DATA STRUCTURES
Btrfs internally only knows about three data
structures
Block header - btrfs_header
Key - btrfs_key
Item - btrfs_item
btrfs_header
The block header contains
checksum for the block
contents
uuid of the filesystem
level of the block in the tree
block number where this
block is supposed to live
struct btrfs_header
{
u8 csum[32 bytes];
u8 fsid[16];
__le64 blocknr;
__le64 generation;
__le64 owner;
__le16 nritems;
__le16 flags;
u8 level; }
btrfs_key
Key contains
unique object id analogous to
inode number in ext series
Object id is Most Significant Bits
of key results in grouping
together all info associated with
particular object id
struct
btrfs_key
{
u64 objectid;
u32 type;
u64 offset ;
}
btrfs_key contd
Offset field of the key indicates
the byte offset for a particular
item. Inode has offset value 0
Type field gives type of the item
can be inode, file data etc.
btrfs_item
Key key describing the
item
Offset it is the offset of item
Size size of item
* Design has under gone
massive changes. Fields may
have been changed.
struct btrfs_item
{
Struct btrfs_key
key;
__le32 offset;
__le16 size;
}
Example of
Initialized data
structures
inside the
filesystem
tree.
BTRFS NONLEAF NODES
Contains [ key, block headers ] pairs
Key tells you where to look for the item you
want
Header tell where next node or leaf in the btree is
located on disk - given by blocknr field of
header
BTRFS LEAF NODES
Broken up into two sections,item
header and data, that grow
towards each other
Data is variably sized
Offset and size filed gives the
location of data associated with
an item
FILES
Inline extents are used for small files
Large files are stored in extents and are
[ disk block, disk num blocks ] pair record the
area of disk for the file
Extnets store logical offset allows write in the
middle of extent
6
4

M
B
Old data
1

M
B


New data
R
e
s
t
Old data
DIRECTORY
Directories are indexed in two ways
using hash keys for filename look up and
directory items key is composed of as follows
Second indexing is used by readdir function to
retrieve information in inode
Extent Block Groups
The disk up into chunks of 256MB or more
For each chunk, information about the
number of blocks available is recorded
Flags indicate whether the extent is for data or
metadata
At creation time disk id broken up into 33% for
metadata and 66% for data alternatively
BTRFS organization of files, inodes,
ditectories, pointers, etc
Ext file systems
Btrfs file systems
Metadata Structure Of Btrfs
Btrfs uses 6 types of COW trees to keep track of
data and metadata
They are sorted using 132 bit key value
Key is formed from three components
Most significant 64 bits is the object id
Next 8bits is from the type fields
Remaining 64 bits are for objects own use
FILE SYSTEM TREE
User-visible files and directories live in file system
tree
Files and directories have a back-reference to
parent
There is one file system tree per sub volume
Small files are kept in inline extent data item
Otherwise data is kept outside the tree in extent
and extent data item in the tree is used to track
them
EXTENT ALLOCATION TREE
Used to track space usage by extents
In-memory tree of page-sized bitmaps to speed
up allocations
Items in the extent allocation tree do not have
object ids
Extent items contain a back-reference to the
tree node or file occupying that extent
Extent tree is COW writes may result in
change of extent tree
CHECKSUM TREE
Checksums for data and metadata are
calculated and stored in checksum tree as
checksum item
There is one checksum item per contiguous
run of allocated blocks
CRC-32C check summing is used
[crc32c -x32 + x28 + x27 + x26 + x25 + x23 +
x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10
+ x9 + x8 + x6 + 1]
LOG TREE
Log tree is created for each subvolume
Used to prevent the heavy workloads redundant
i/o operations result from fsync()
Log tree journals fsync initiated copy on writes
Log tree items are replayed and deleted at the
next full tree commit or at remount if file system
crash occurs
CHUNKS AND DEVICE TREES
Chunks may be mirrored or striped across multiple
devices
But file system sees only the logical address space that
chunk are mapped into
The mapping is stored as device and chunk map
items it the chunk tree
Device tree is the inverse of the chunk tree
Byte ranges of block devices back to individual
chunks
The chunks containing the chunk tree, the root tree,
device tree and extent tree are always mirrored
ROOT TREE
Records the root block for the extent tree and the
root blocks and names for each sub volume and
snapshot tree
At transaction commit root block pointers are
updated
Root tree assigns them object id
This object id can be used for back referencing
KEY FEATURES
Key features of btrfs are
Dynamic Inode Allocation
Writable Snapshots And Subvolumes
Copy On Write
Compression
Online Resizing
Online File System Check And Defragmentation
Inplace Ext3 Conversion
Multiple Device Support
Space Efficient Packing Of Small Files
DYNAMIC INODE ALLOCATION
Inodes are created only when they are needed
Ext series uses static inode tables only advantage
is corruption detection
Static inode table wastes space, slows down fsck, it
needs inode blocks to be at fixed location
In dynamic allocation inode can be kept close to
the file
Corruption detection can be implemented using
check summing inode blocks
WRITABLE SNAPSHOT AND SUBVOLUME
Allows creation of writable snapshots
They can be used for back ups and testing
transactions
Occupied space actually increases only when data is
written to snapshot
Sub volumes are named btrees
They have inodes inside the tree of tree roots
Snapshots are subvolumes whose root is shared with
another
They can be mounted as unique root to prevent
access
COPY ON WRITE
Optimization strategy
Each process requesting a resource is given
pointers to same resource
Implemented as marking resource as read only
and any modification generates a kernel interrupt
When performing COW operation to btree nodes
reference count of all nodes it point is incremented
When transaction commits new tree pointer is
inserted in root tree keeps progress records to
keep track of transaction
COMPRESSION
Uses zlib compression to improve perfomance
Few bits of type field could be used to indicate
compression
In case of compression uncompressed size is also
stored
If compression failed due to some reason it is
flagged to avoid further compression requests
Back ground compression can be done when fs is
idle in case of slow compression
ONLINE RESIZING
Allows file system to be resized while it is
mounted
It is implemented in btrfsprogs, a tool similar to
e2fsprogs
Online resizing is still under development
Online FSCK & Defragmentation
Btrfsck can be done while it is mounted only ext4 in
ext series allows that
In ext it is difficult to get fs back online after
something has gone wrong
Btrfs aims to be tolerant of invalid metadata
Offline btrfsck is made to ensure safe mounting
Btrfsck verifies extent allocation maps and reference
counts
The cost paid for this efficiency is more ram. It requires
3x ram than ext2fsck
INPLACE EXT3 CONVERSION
Features that make conversion possible are
Btrfs metadata need not be in fixed locations
COW allows unmodified copy to me maintained
allows undo the conversion
Libe2fs is used to read ext3 metadata
Uses the free blocks in the Ext3 filesystem to hold
the new Btrfs
Btrfs files points to the same blocks used by the
Ext3 files
INPLACE EXT3 CONVERSION contd
MULTIPLE DEVICE SUPPORT
In current filesystems, to create a RAID or add
new RAID level it needs to use lvmto create
new volume and then use H/W RAID card of
S/W RAID to combine volumes
But btrfs can do RAID as part of filesystem
By default
Metadata is mirrored across all devices
Data is stripped across all availale devices
Currently you can have metadata in following
manner
RAID0 - metadata is appended across all
devices present
RAID-1 - metadata is mirrored across all
devices present
RAID-10 - metadata is appended and mirrored
across all devices present
SINGLE - metadata is mirrored on a single
device
SPACE EFFICIENT PACKING OF SMALL
FILES
Small files are packed inside the inline extends
If compression is used for data that are larger
to keep in inline extends and smaller to keep
in external extends can be placed in inline
extends

Вам также может понравиться