0 оценок0% нашли этот документ полезным (0 голосов)
13 просмотров38 страниц
BTREE FILE SYSTEM (BTRFS) is a next generation file system for Linux. It is based on zfs of solaris and uses COW (copy on write) tree or Rodeh's btrees. It has lots and lots of features compared to extended series.
BTREE FILE SYSTEM (BTRFS) is a next generation file system for Linux. It is based on zfs of solaris and uses COW (copy on write) tree or Rodeh's btrees. It has lots and lots of features compared to extended series.
BTREE FILE SYSTEM (BTRFS) is a next generation file system for Linux. It is based on zfs of solaris and uses COW (copy on write) tree or Rodeh's btrees. It has lots and lots of features compared to extended series.
It can be defined in different ways A method of organizing blocks on a storage device into files and directories. A data structure that translates the physical view of a disc into a logical structure. The way in which file s are named and where they are placed logically for storage and retrieval. NEED FOR DIFFERENT FILESYSTEMS This depends on Requirements of OS Required level of security and efficiency Minimize access time (eg.: cdfs) Exploit device features ( eg.: flash filesystem) No seek time Avoid continuous writes Block wise erasing FILE SYSTEM TERMINOLOGY Inode : A data structure holding information about files in a Unix file system. Blocks : Uniformly sized unit of data storage for filesystem. Extents : bunch of contiguous physical blocks. FILE SYSTEM TERMINOLOGY contd Super block : Block that describes the entire file system Directory : A file that lists the files and directories contained within, and their associated inode. Journaling : Logging of filesystem transactions WHAT IS BTRFS ? Next Gen. File system for Linux Also called BUTTER FS or BETTER FS Motivated from zfs of Solaris Support for copy on write Unlike zfs which is licensed under CDDL(Common Development and Distribution License), Btrfs is GPL licensed Only test versions are released(accepted in 2.6.29) It have lots and lots of features compared to extended series DATA AND METADATA STRUCTURE Uses COW (copy on write) tree or Rodeh's btrees Advantages of COW btrees Increase in the overall depth is infrequent Minimum number of rearrangement Search could be done in O(log2N) Entire tree need not be in memory COW facility speeds up operations BTRFS DATA STRUCTURES Btrfs internally only knows about three data structures Block header - btrfs_header Key - btrfs_key Item - btrfs_item btrfs_header The block header contains checksum for the block contents uuid of the filesystem level of the block in the tree block number where this block is supposed to live struct btrfs_header { u8 csum[32 bytes]; u8 fsid[16]; __le64 blocknr; __le64 generation; __le64 owner; __le16 nritems; __le16 flags; u8 level; } btrfs_key Key contains unique object id analogous to inode number in ext series Object id is Most Significant Bits of key results in grouping together all info associated with particular object id struct btrfs_key { u64 objectid; u32 type; u64 offset ; } btrfs_key contd Offset field of the key indicates the byte offset for a particular item. Inode has offset value 0 Type field gives type of the item can be inode, file data etc. btrfs_item Key key describing the item Offset it is the offset of item Size size of item * Design has under gone massive changes. Fields may have been changed. struct btrfs_item { Struct btrfs_key key; __le32 offset; __le16 size; } Example of Initialized data structures inside the filesystem tree. BTRFS NONLEAF NODES Contains [ key, block headers ] pairs Key tells you where to look for the item you want Header tell where next node or leaf in the btree is located on disk - given by blocknr field of header BTRFS LEAF NODES Broken up into two sections,item header and data, that grow towards each other Data is variably sized Offset and size filed gives the location of data associated with an item FILES Inline extents are used for small files Large files are stored in extents and are [ disk block, disk num blocks ] pair record the area of disk for the file Extnets store logical offset allows write in the middle of extent 6 4
M B Old data 1
M B
New data R e s t Old data DIRECTORY Directories are indexed in two ways using hash keys for filename look up and directory items key is composed of as follows Second indexing is used by readdir function to retrieve information in inode Extent Block Groups The disk up into chunks of 256MB or more For each chunk, information about the number of blocks available is recorded Flags indicate whether the extent is for data or metadata At creation time disk id broken up into 33% for metadata and 66% for data alternatively BTRFS organization of files, inodes, ditectories, pointers, etc Ext file systems Btrfs file systems Metadata Structure Of Btrfs Btrfs uses 6 types of COW trees to keep track of data and metadata They are sorted using 132 bit key value Key is formed from three components Most significant 64 bits is the object id Next 8bits is from the type fields Remaining 64 bits are for objects own use FILE SYSTEM TREE User-visible files and directories live in file system tree Files and directories have a back-reference to parent There is one file system tree per sub volume Small files are kept in inline extent data item Otherwise data is kept outside the tree in extent and extent data item in the tree is used to track them EXTENT ALLOCATION TREE Used to track space usage by extents In-memory tree of page-sized bitmaps to speed up allocations Items in the extent allocation tree do not have object ids Extent items contain a back-reference to the tree node or file occupying that extent Extent tree is COW writes may result in change of extent tree CHECKSUM TREE Checksums for data and metadata are calculated and stored in checksum tree as checksum item There is one checksum item per contiguous run of allocated blocks CRC-32C check summing is used [crc32c -x32 + x28 + x27 + x26 + x25 + x23 + x22 + x20 + x19 + x18 + x14 + x13 + x11 + x10 + x9 + x8 + x6 + 1] LOG TREE Log tree is created for each subvolume Used to prevent the heavy workloads redundant i/o operations result from fsync() Log tree journals fsync initiated copy on writes Log tree items are replayed and deleted at the next full tree commit or at remount if file system crash occurs CHUNKS AND DEVICE TREES Chunks may be mirrored or striped across multiple devices But file system sees only the logical address space that chunk are mapped into The mapping is stored as device and chunk map items it the chunk tree Device tree is the inverse of the chunk tree Byte ranges of block devices back to individual chunks The chunks containing the chunk tree, the root tree, device tree and extent tree are always mirrored ROOT TREE Records the root block for the extent tree and the root blocks and names for each sub volume and snapshot tree At transaction commit root block pointers are updated Root tree assigns them object id This object id can be used for back referencing KEY FEATURES Key features of btrfs are Dynamic Inode Allocation Writable Snapshots And Subvolumes Copy On Write Compression Online Resizing Online File System Check And Defragmentation Inplace Ext3 Conversion Multiple Device Support Space Efficient Packing Of Small Files DYNAMIC INODE ALLOCATION Inodes are created only when they are needed Ext series uses static inode tables only advantage is corruption detection Static inode table wastes space, slows down fsck, it needs inode blocks to be at fixed location In dynamic allocation inode can be kept close to the file Corruption detection can be implemented using check summing inode blocks WRITABLE SNAPSHOT AND SUBVOLUME Allows creation of writable snapshots They can be used for back ups and testing transactions Occupied space actually increases only when data is written to snapshot Sub volumes are named btrees They have inodes inside the tree of tree roots Snapshots are subvolumes whose root is shared with another They can be mounted as unique root to prevent access COPY ON WRITE Optimization strategy Each process requesting a resource is given pointers to same resource Implemented as marking resource as read only and any modification generates a kernel interrupt When performing COW operation to btree nodes reference count of all nodes it point is incremented When transaction commits new tree pointer is inserted in root tree keeps progress records to keep track of transaction COMPRESSION Uses zlib compression to improve perfomance Few bits of type field could be used to indicate compression In case of compression uncompressed size is also stored If compression failed due to some reason it is flagged to avoid further compression requests Back ground compression can be done when fs is idle in case of slow compression ONLINE RESIZING Allows file system to be resized while it is mounted It is implemented in btrfsprogs, a tool similar to e2fsprogs Online resizing is still under development Online FSCK & Defragmentation Btrfsck can be done while it is mounted only ext4 in ext series allows that In ext it is difficult to get fs back online after something has gone wrong Btrfs aims to be tolerant of invalid metadata Offline btrfsck is made to ensure safe mounting Btrfsck verifies extent allocation maps and reference counts The cost paid for this efficiency is more ram. It requires 3x ram than ext2fsck INPLACE EXT3 CONVERSION Features that make conversion possible are Btrfs metadata need not be in fixed locations COW allows unmodified copy to me maintained allows undo the conversion Libe2fs is used to read ext3 metadata Uses the free blocks in the Ext3 filesystem to hold the new Btrfs Btrfs files points to the same blocks used by the Ext3 files INPLACE EXT3 CONVERSION contd MULTIPLE DEVICE SUPPORT In current filesystems, to create a RAID or add new RAID level it needs to use lvmto create new volume and then use H/W RAID card of S/W RAID to combine volumes But btrfs can do RAID as part of filesystem By default Metadata is mirrored across all devices Data is stripped across all availale devices Currently you can have metadata in following manner RAID0 - metadata is appended across all devices present RAID-1 - metadata is mirrored across all devices present RAID-10 - metadata is appended and mirrored across all devices present SINGLE - metadata is mirrored on a single device SPACE EFFICIENT PACKING OF SMALL FILES Small files are packed inside the inline extends If compression is used for data that are larger to keep in inline extends and smaller to keep in external extends can be placed in inline extends