Вы находитесь на странице: 1из 36

The Implementation of the Linux Buer Cache CSCI780: Linux Kernel Internals Fall 2001 Joy Schoenberger

Linux Buer Cache

Outline
We will rst dene the buer cache, and what role it plays in the Linux kernel. Next we will examine the data structures used to implement the buer cache, including buer heads, and the lists they belong to. Then we will trace a simple call to getblk(), the main service routine for the buer cache. We will take a look at the bdush daemon that handles writing dirty buers to disk when we exceed certain thresholds. Finally, we will trace a simple read and how it gets down to the buer cache

Linux Buer Cache

What is the buer cache?


The buer cache deals with blocks, which are groups of adjacent sectors involved in an I/O operation requested by a block device driver. The size of a block can be no larger than a page frame. Thus, for PC architecture, a block can be of size 512, 1024, 2048, or 4096 bytes. A buer is a RAM memory area used by the kernel to store a blocks contents for asynchronous reads and writes by the kernel to some block of a physical device. Thus, a buers size equals the size of its corresponding block. The buer cache is the software mechanism that manages data in RAM, namely, these block device buers.

Linux Buer Cache

Buer Head
Buer heads are the data structures used by the buer cache to maintain information about its associated buers. There are four types of buers that the buer heads need to track:

Cached - contains data stored in memory, associated with a block on a block device Free - available for caching data Asynchronous - buer used for segmented page I/O operations Unused - not yet allocated, but has a buer head

Linux Buer Cache

Buer Head Struct


In /linux/include/linux/fs.h
struct buffer_head { /* First cache line: */ struct buffer_head *b_next; unsigned long b_blocknr; unsigned short b_size; unsigned short b_list; kdev_t b_dev;

/* /* /* /* /*

Hash queue list */ Virtual block number */ Block size */ List to which this buffer belongs */ Virtual device (B_FREE = free) */

These are the most commonly used elds of the buer head, stored in a single cache line (16 bytes) to improve performance. It is used for searching block hash lists by getblk(). Every cached buer belongs to one of three doubly-linked lists, BUFF CLEAN, BUFF DIRTY, and BUFF LOCKED, which are macros for indices that identify these lists. Data member b list identies to which list this buer belongs. The block to which this buer maps is at b blocknr on b dev, which are the virtual block number and virtual device (perhaps one partition of a hard drive).
Linux Buer Cache 5

Some state maintenance


atomic_t b_count; kdev_t b_rdev; unsigned long b_state; unsigned long b_flushtime; /* /* /* /* Users using this block */ Real device */ buffer state bitmap */ Time when (dirty) buffer should be written */

The second 16 bytes are used for LRU buer scans, as used by sync buffers() and refill freelist(). The usage counter b count is mainly a safety lock, since the kernel never destroys a buer if it has a non-zero usage count. (The cached buers are examined by the kernel, either periodically, or when memory becomes scarce). b state mask:
#define #define #define #define #define #define #define BH_Uptodate BH_Dirty BH_Lock BH_Req BH_Mapped BH_New BH_Protected 0 1 2 3 4 5 6 /* /* /* /* /* /* /* 1 1 1 0 1 1 1 if if if if if if if the the the the the the the buffer buffer buffer buffer buffer buffer buffer contains valid data */ is dirty */ is locked */ has been invalidated */ has a disk mapping */ is new and not yet written out */ is protected */
6

Linux Buer Cache

Some list references


struct struct struct struct buffer_head buffer_head buffer_head buffer_head *b_next_free; *b_prev_free; *b_this_page; *b_reqnext; /* /* /* /* lru/free list doubly linked circular list request queue linkage */ list of buffers */ of buffers in one page */ */

The buer cache maintains seven lists of free buers (one for each size: 512,1024,...32768), one list of unused buers, and three lists of cached buers. All of these lists are maintained by data members b next free and b prev free. The variables pointing to the heads of these lists distinguish one from another. When the buer cache is used to transfer whole pages to or from a block device, the asynchronous buers that make up that page are maintained in a circular linked list. These buer heads, however, are discarded as soon as the I/O operation completes, to be referenced as a whole by its page descriptor. The b this page member points to this circular list. The request queue, maintained by b reqnext tracks the progress of such a block-segmented page I/O.

Linux Buer Cache

The rest of the buer head...


struct buffer_head **b_pprev; /* char * b_data; /* struct page *b_page; /* void (*b_end_io)(struct buffer_head void *b_private; /* unsigned long b_rsector; wait_queue_head_t b_wait; struct inode * struct list_head }; doubly linked list of hash-queue */ pointer to data block */ the page this bh is mapped to */ *bh, int uptodate); /* I/O completion */ reserved for b_end_io */

/* Real buffer location on disk */

b_inode; b_inode_buffers;

/* doubly linked list of inode dirty buffers */

The b data eld for the head of a cached buer points to the buer stored in memory. If the buer is asynchronous, it points to the temporary buer used to implement page I/O operations. If a buer heads b dev eld has the value B FREE, its b data points to a free buer. A buer head on the unused list has no buer associated with it, and its b data eld is meaningless.
Linux Buer Cache 8

Lists of buer heads


In /fs/buffer.c

static struct buffer head **hash table; This holds references to doubly-linked, linear lists of cached block buers linked by buffer head **b pprev and buffer head *b next static struct buffer head *lru list[NR LIST]; This table contains three doubly-linked lists of those same cached buers, referenced by b next free and b prev free, with each list indexed by the previously-mentioned macros BUFF CLEAN, BUFF DIRTY, and BUFF LOCKED. NOTE: while the BUFF CLEAN and BUFF LOCKED may have been used in past versions of Linux, a comment in the buer cache code reads: As we never browse LOCKED and CLEAN lru lists they are in fact completely useless. static struct buffer head * unused list; This linear list contains unused buers (who would have guessed?), also referenced by b next free. (b prev free is meaningless)
Linux Buer Cache 9

free list
struct bh_free_head { struct buffer_head *list; spinlock_t lock; }; static struct bh_free_head free_list[NR_SIZES];

The buer cache keeps one list of free buers for each possible buer size, with its corresponding lock. Again, these lists are referenced by b next free and b prev free. Ill show you the initialization of the possible buer sizes just because its geekily cool:
define NR_SIZES 7 static char buffersize_index[65] = {-1, 0, 1, -1, 2, -1, -1, -1, 3, -1, 4, -1, -1, -1, -1, -1, -1, -1, -1,-1, 5, -1, -1, -1, -1, -1, -1, -1, -1,-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,-1, 6};

-1, -1, -1, -1,

-1, -1, -1, -1,

-1, -1, -1, -1,

-1, -1, -1, -1,

-1, -1, -1, -1,

-1, -1, -1, -1,

#define BUFSIZE_INDEX(X) ((int) buffersize_index[(X)>>9])


Linux Buer Cache 10

getblk()
Now lets trace through the getblk() function in /fs/buffer.c, by which the kernel accesses buers in the cache.
struct buffer_head * getblk(kdev_t dev, int block, int size){ struct buffer_head * bh; int isize; repeat: spin_lock(&lru_list_lock); write_lock(&hash_table_lock); bh = __get_hash_table(dev, block, size); if (bh) goto out; isize = BUFSIZE_INDEX(size); spin_lock(&free_list[isize].lock); bh = free_list[isize].list; if (bh) { __remove_from_free_list(bh, isize); atomic_set(&bh->b_count, 1); }
Linux Buer Cache

Lock every member of LRU list Lock every member of hash table Returns buer with incremented usage counter

Convert size to free list index isize Lock every member of appropriate free list Retrieve head of appropriate free list If not empty, behead

11

spin_unlock(&free_list[isize].lock); if (bh) { init_buffer(bh, NULL, NULL); bh->b_dev = dev; bh->b_blocknr = block; bh->b_state = 1 << BH_Mapped;

Unlock the free list Initialize this free buer that we will return sets b list to BUFF CLEAN

__insert_into_queues(bh); Insert buer into all cached buer lists out: write_unlock(&hash_table_lock); Unlock hash table spin_unlock(&lru_list_lock); Unlock LRU list touch_buffer(bh); Sets a referenced bit of bh->b page return bh; } // We were unable to find the buffer in the cache, // nor were we able to find a free buffer to return. // We need to allocate some new buffers, then try again. write_unlock(&hash_table_lock); spin_unlock(&lru_list_lock); refill_freelist(size); goto repeat; }// end getblk()

Linux Buer Cache

12

rell freelist()
Also in /fs/buffer.c, the refill freelist() function allocates a page-full of buers, and takes the buer size as a parameter.
static void refill_freelist(int size) { balance_dirty(NODEV);

balance dirty() calls balance dirty state(), which checks the number of dirty buers, and returns 0 if adding a new dirty page will exceed the maximum allowed. (The dev parameter is not actually used, and NODEV is #defined as 0. Code comments indicate that this is for later implementation of device-specic pressure indicators). A 0 return from balance dirty state() forces a call to wakeup bdflush() to write dirty buers to disk. We will talk about this function later.

Linux Buer Cache

13

if (free_shortage()) page_launder(GFP_BUFFER, 0); if (!grow_buffers(size)) { wakeup_bdflush(1); current->policy |= SCHED_YIELD; __set_current_state(TASK_RUNNING); schedule(); } }

The free shortage() function checks if there are zones with a severe shortage of free pages, or if all zones have minor shortage. If so, page launder() is called to free up some pages. The next function we will discuss is grow buffers, which does the actual buer allocation. If this fails, wake up the bdflush kernel thread, and relinquish the CPU to let it run.

Linux Buer Cache

14

grow buers()
static int grow_buffers(int size) { struct page * page; struct buffer_head *bh, *tmp; struct buffer_head * insert_point; int isize; if ((size & 511) || (size > PAGE_SIZE)) { A little error trap to make sure printk("VFS: grow_buffers: size = %d\n",size); this page size is valid for return 0; this architecture } page = alloc_page(GFP_BUFFER); Allocate a new page frame if (!page) goto out; This just returns 0 LockPage(page); bh = create_buffers(page, size, 0); Assign buer heads to buers of newly-allocated page if (!bh) goto no_buffer_head; Unlock the page, and release back to memory

Ok, now we have a list of buer heads (bh) for the buers in this newly-allocated page. Lets insert our buers into the appropriate free list.
Linux Buer Cache 15

isize = BUFSIZE_INDEX(size); spin_lock(&free_list[isize].lock); insert_point = free_list[isize].list; tmp = bh; while (1) { This is a bunch of list insertion stu ................ There is a break in here :-) } tmp->b_this_page = bh; free_list[isize].list = bh; spin_unlock(&free_list[isize].lock); page->buffers = bh; Set some buer-related ags on this page page->flags &= ~(1 << PG_referenced); lru_cache_add(page); UnlockPage(page); atomic_inc(&buffermem_pages); return 1; Success! no_buffer_head: UnlockPage(page); page_cache_release(page); out: return 0; }

Linux Buer Cache

16

create buers()
static struct buffer_head * create_buffers(struct page * page, unsigned long size, int async){ struct buffer_head *bh, *head; long offset; try_again: head = NULL; offset = PAGE_SIZE; while ((offset -= size) >= 0) { Try to collect a page-full of buer heads bh = get_unused_buffer_head(async); from the unused buer head list if (!bh) If we ever fail, goto no_grow; free heads already gathered: see below bh->b_dev = B_FREE; Else ... bh->b_this_page = head; Buers collected in list referenced by b this page head = bh; Local variable head points to list bh->b_state = 0; Set appropriate buer head elds... .................... } return head; We got a page-full, return the list no_grow: if (head) { If any were collected, put back in unused list spin_lock(&unused_list_lock);
Linux Buer Cache 17

do { bh = head; head = head->b_this_page; __put_unused_buffer_head(bh); } while (head); spin_unlock(&unused_list_lock); wake_up(&buffer_wait); // Wake up any waiters.... } if (!async) If this is for a single buer request, FAIL return NULL; run_task_queue(&tq_disk); Tell device driver to ush to disk // Set our state for sleeping, then check again for buffer heads. // This ensures we wont miss a wake_up from an interrupt. wait_event(buffer_wait, nr_unused_buffer_heads >= MAX_BUF_PER_PAGE); goto try_again; }

The unused list maintains a minimum of NR RESERVED buer heads for asyncrhonous requests. If there are none left, another asynch must be using them, so we know they will be freed soon. Therefore, if I am an asynch request, I will wait until there are enough unused buers available.
Linux Buer Cache 18

bdush daemon
The bdush kernel thread is created during initialization. It selects some dirty buers and forces them to be written to their corresponding blocks on the physical device. Some system parameters stored in the b un eld of the bdf prm table control the behavior of this daemon, and are accessible by means of the bdflush() system call. (A tick corresponds to about 10 milliseconds). Those parameters are as follows: age_buffer age_super interval ndirty nfract Time-out in ticks of a normal dirty buer for being written to disk Time-out in ticks of a superblock dirty buer for being written to disk Delay in ticks between kupdate activations Maximum number of dirty buers written to disk during an activation of bdush Threshold percentage of dirty buers for waking up bdush

In order to wake up bdush, the kernel invokes the wakeup bdflush() function. We have already seen some instances of when we need to call this function. NOTE: There is another daemon kupdate that periodically writes old dirty buers to disk. It is very similar to bdush, except that it only ushes those buers whose b flushtime threshold has expired.
Linux Buer Cache 19

int bdflush(void *sem){ struct task_struct *tsk = current; int flushed; tsk->session = 1; Set up a task struct for this daemon tsk->pgrp = 1; strcpy(tsk->comm, "bdflush"); /* avoid getting signals */ spin_lock_irq(&tsk->sigmask_lock); flush_signals(tsk); sigfillset(&tsk->blocked); recalc_sigpending(tsk); spin_unlock_irq(&tsk->sigmask_lock); up((struct semaphore *)sem); for (;;) { CHECK_EMERGENCY_SYNC flushed = flush_dirty_buffers(0); This does all the work if (!flushed || balance_dirty_state(NODEV) < 0) { run_task_queue(&tq_disk); interruptible_sleep_on(&bdflush_wait); } } }
Linux Buer Cache 20

bdush()

ush dirty buers()


static int flush_dirty_buffers(int check_flushtime){ struct buffer_head * bh, *next; int flushed = 0, i; restart: spin_lock(&lru_list_lock); bh = lru_list[BUF_DIRTY]; if (!bh) If no dirty pages, goto out_unlock; Unlock list and return for (i = nr_buffers_type[BUF_DIRTY]; i-- > 0; bh = next) { next = bh->b_next_free; Find next buer to be written if (!buffer_dirty(bh)) { This buer doesnt belong __refile_buffer(bh); Rele in appropriate list continue; } if (buffer_locked(bh)) This buer is being accessed continue; Skip it. if (check_flushtime) { The dirty lru list is chronologically ordered so if the current bh is not yet timed out, then all the following bhs will also be too young. if (time_before(jiffies, bh->b_flushtime))
Linux Buer Cache 21

goto out_unlock; } else { if (++flushed > bdf_prm.b_un.ndirty) Weve ushed enough goto out_unlock; } /* OK, now we are committed to write it out. */ atomic_inc(&bh->b_count); Increment usage counter - eective safety lock spin_unlock(&lru_list_lock); ll_rw_block(WRITE, 1, &bh); atomic_dec(&bh->b_count); if (current->need_resched) schedule(); goto restart; } out_unlock: spin_unlock(&lru_list_lock); return flushed; } Release dirty LRU list Forced synchronous write, and unsets DIRTY bit of buer state Decrement usage counter

Start all over again

Linux Buer Cache

22

Doing a read
In user program: read(fd,&buffer,numBytes); Results in a syscall, control into kernel function sys read in fs/read write.c

Linux Buer Cache

23

asmlinkage ssize_t sys_read(unsigned int fd, char * buf, size_t count){ ssize_t ret; struct file * file; Retrieve the le associated with fd ret = -EBADF; file = fget(fd); if (file) { if (file->f_mode & FMODE_READ) { Check permissions if (file->f_mode & FMODE_READ) { ret = locks_verify_area(FLOCK_VERIFY_READ,file->f_dentry->d_inode, file, file->f_pos, count); if (!ret) { ssize_t (*read)(struct file *, char *,size_t, loff_t *); ret = -EINVAL; Declare read function if (file->f_op && (read = file->f_op->read) != NULL) ret = read(file, buf, count, &file->f_pos); } /* the minix read le operation is set to generic file read*/ } if (ret > 0) inode_dir_notify(file->f_dentry->d_parent->d_inode,DN_ACCESS); fput(file); } return ret; }
Linux Buer Cache 24

/mm/lemap.c
ssize_t generic_file_read(struct file * filp, char * buf, size_t count, loff_t *ppos){ /* This is the "read()" routine for all filesystems that can use the page cache directly.*/ ssize_t retval; retval = -EFAULT; if (access_ok(VERIFY_WRITE, buf, count)) { Verify that the buer is writeable retval = 0; if (count) { read_descriptor_t desc; In fs.h : descriptor for what desc.written = 0; were up to with a read desc.count = count; desc.buf = buf; parameter file read actor is a function desc.error = 0; that writes the data to the buer do_generic_file_read(filp, ppos, &desc, file_read_actor); retval = desc.written; if (!retval) retval = desc.error; } } return retval; }
Linux Buer Cache 25

The ugly part: in /mm/lemap.c

void do_generic_file_read(struct file * filp, loff_t *ppos, read_descriptor_t * desc, read_actor struct inode *inode = filp->f_dentry->d_inode; struct address_space *mapping = inode->i_mapping; unsigned long index, offset; struct page *cached_page; int reada_ok; int error; int max_readahead = get_max_readahead(inode); cached_page = NULL; index = *ppos >> PAGE_CACHE_SHIFT; offset = *ppos & ~PAGE_CACHE_MASK; if (index > filp->f_raend || index + filp->f_rawin < filp->f_raend) { reada_ok = 0; If current position outside previous read-ahead window, filp->f_raend = 0; Reset current read-ahead context & set read ahead max to 0 filp->f_ralen = 0; (will be set to just needed value later) filp->f_ramax = 0; filp->f_rawin = 0; } else { Otherwise, assume le accesses are sequential enough reada_ok = 1; }
Linux Buer Cache 26

/* Adjust the current value of read-ahead max. */ if (!index && offset + desc->count <= (PAGE_CACHE_SIZE >> 1)){ filp->f_ramax = 0; Read stays in rst half of page - no readahead } else { Try to increase readahead max just enough to do the read request unsigned long needed; needed = ((offset + desc->count) >> PAGE_CACHE_SHIFT) + 1; if (filp->f_ramax < needed) filp->f_ramax = needed; if (reada_ok && filp->f_ramax < MIN_READAHEAD) filp->f_ramax = MIN_READAHEAD; if (filp->f_ramax > max_readahead) filp->f_ramax = max_readahead; } for (;;) { struct page *page, **hash; unsigned long end_index, nr, ret; .................... A bunch of ugly shifting and masking of above varaibles /* * Try to find the data in the page cache.. */ hash = page_hash(mapping, index); spin_lock(&pagecache_lock); page = __find_page_nolock(mapping, index, *hash);
Linux Buer Cache 27

if (!page) goto no_cached_page; found_page: page_cache_get(page); Macro to increment usage counter spin_unlock(&pagecache_lock); if (!Page_Uptodate(page)) goto page_not_up_to_date; generic_file_readahead(reada_ok, filp, inode, page); page_ok: /* If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing * before reading the page on the kernel side. */ if (mapping->i_mmap_shared != NULL) flush_dcache_page(page); Ok, we have the page, and its up-to-date, so now we can copy it to user space... ret = actor(desc, page, offset, nr); .................... More bit shifting page_cache_release(page); if (ret == nr && desc->count) continue; break;
Linux Buer Cache 28

page_not_up_to_date: Page not immediately readable, so lets read ahead while were at it generic_file_readahead(reada_ok, filp, inode, page); Semi-ugly function if (Page_Uptodate(page)) goto page_ok; lock_page(page); Get exclusive access to the page if (!page->mapping) { Did it get unhashed before we got the lock? UnlockPage(page); page_cache_release(page); continue; } if (Page_Uptodate(page)) { Did somebody else ll it already? UnlockPage(page); goto page_ok; } readpage: ... and start the actual read. The read will unlock the page. error = mapping->a_ops->readpage(filp, page); Recall that address space *mapping = inode->i mapping; Thus, readpage is a pointer to a function... if (!error) { if (Page_Uptodate(page)) goto page_ok; /* Again, try some read-ahead while waiting for the page to finish.. */ generic_file_readahead(reada_ok, filp, inode, page);
Linux Buer Cache 29

wait_on_page(page); if (Page_Uptodate(page)) goto page_ok; error = -EIO; } /* UHHUH! A synchronous read error occurred. Report it */ desc->error = error; page_cache_release(page); break; no_cached_page: Ok, it wasnt cached, so we need to create a new page.. We get here with the page cache lock held. if (!cached_page) { spin_unlock(&pagecache_lock); cached_page = page_cache_alloc(mapping); if (!cached_page) { desc->error = -ENOMEM; break; } Somebody may have added the page while we dropped the page cache lock. Check for that. spin_lock(&pagecache_lock); page = __find_page_nolock(mapping, index, *hash); if (page)
Linux Buer Cache 30

goto found_page; } Ok, add the new page to the hash-queues... page = cached_page; __add_to_page_cache(page, mapping, index, hash); spin_unlock(&pagecache_lock); cached_page = NULL; goto readpage; } *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset; filp->f_reada = 1; if (cached_page) page_cache_free(cached_page); UPDATE_ATIME(inode); }

Linux Buer Cache

31

Follow those pointers to functions...


Heres the minix version of readpage in fs/minix/inode.c static int minix_readpage(struct file *file, struct page *page){ return block_read_full_page(page,minix_get_block); } This leads us to block read full page in... (drum-roll please).... buffer.c! Yes, were back to the buer cache! We can also trace the minix get block function pointer as follows: minix get block in fs/minix/inode.c calls V1 minix get block in fs/minix/itree v1.c, which calls get block in buffer.c

Linux Buer Cache

32

In buer.c
int block_read_full_page(struct page *page, get_block_t *get_block) { struct inode *inode = page->mapping->host; unsigned long iblock, lblock; struct buffer_head *bh, *head, *arr[MAX_BUF_PER_PAGE]; unsigned int blocksize, blocks; int nr, i; if (!PageLocked(page)) Error trap - I should never call this on an unlocked page PAGE_BUG(page); Macro for a function that just reports the bug blocksize = inode->i_sb->s_blocksize; if (!page->buffers) Remember, every page struct has a list of its member buers create_empty_buffers(page, inode->i_dev, blocksize); calls create buffers head = page->buffers; and sets page elds blocks = PAGE_CACHE_SIZE >> inode->i_sb->s_blocksize_bits; iblock = page->index << (PAGE_CACHE_SHIFT - inode->i_sb->s_blocksize_bits); lblock = (inode->i_size+blocksize-1) >> inode->i_sb->s_blocksize_bits; bh = head; nr = 0; i = 0;

Linux Buer Cache

33

do { if (buffer_uptodate(bh)) continue; if (!buffer_mapped(bh)) { if (iblock < lblock) { if (get_block(inode, iblock, bh, 0)) continue; } if (!buffer_mapped(bh)) { memset(kmap(page) + i*blocksize, 0, blocksize); flush_dcache_page(page); kunmap(page); set_bit(BH_Uptodate, &bh->b_state); continue; } /* get_block() might have updated the buffer synchronously */ if (buffer_uptodate(bh)) continue; } arr[nr] = bh; Add the mapped, but non-uptodate buer to the buer head array nr++; } while (i++, iblock++, (bh = bh->b_this_page) != head); Circularly-linked list
Linux Buer Cache 34

if (!nr) { All buers are uptodate SetPageUptodate(page); UnlockPage(page); return 0; } /* Stage two: lock the buffers */ for (i = 0; i < nr; i++) { struct buffer_head * bh = arr[i]; lock_buffer(bh); bh->b_end_io = end_buffer_io_async; atomic_inc(&bh->b_count); } /* Stage 3: start the IO */ for (i = 0; i < nr; i++) submit_bh(READ, arr[i]); return 0; }

Low-level device read

Linux Buer Cache

35

References
Bovet and Cesati. Understanding the Linux Kernel. OReilly, 2001 Chaee. File System 2: The Buer Cache. College of William and Mary, 2000 Torvalds, et al. Linux version 2.4.5 kernel source code. 2000.

Linux Buer Cache

36