Prepared by Richard A. Sevenich, rsevenic@netscape.net
Chapter 4. A Selection of Topics from the Kernel Internals General References: Love, Linux Kernel Development, Sams (2004). Bovet & Cesati, Understanding the Linux Kernel, 2nd Edition, O'Reilly (2003). Beck et alii, Linux Kernel Programming, 3rd Edition, Addison-Wesley (2002). Rubini & Corbet, Linux Device Drivers, 2nd Edition, O'Reilly (2001). Linux Kernel Version 2.4 Source Code. Note: At this writing the Rubini & Corbet book is freely available via download from http://www.xml.com/ldd/chapter/book/ 4.1 Introduction The topic areas we'll skim are: system calls signals wait queues task queues kernel timers and other timely topics interrupt handling process scheduler The kernel continues to change in each of these areas, particularly in response to the need for greater scalability. 4.2 The System Call Dispatcher Some of you will recall the MS DOS function dispatcher. It had functionality such as to send a character to the screen or printer receive a character from the keyboard read/write a disk drive get/set time or date This was implemented in assembly language and used the following interface: put the number of the desired function in the ah register perform any other initialization needed by the function (using other registers) call interrupt 0x21 The corresponding interrupt handler was the function dispatcher. The Linux system call dispatcher may be more complex, but is essentially similar to the MS DOS function dispatcher. Such a jump table is not a new idea. Let's look at an example: int main() { int result; result = write(1, "hello\n", 6); ... } The code above makes a library call, write, which is a wrapper around the sys_write system call. Some authors refer to the library call as a stub. The arguments of write are passed via the stack to the library function which does some setup and then invokes (assuming IA 32) int 0x80, the system call dispatcher. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 1 In our example, the library call would do this setup for the IA32: put the system call number for write (4) in the eax register put the first argument (stdout = 1) into the ebx register put the second argument (a pointer to the string "hello\n") into the ecx register put the third argument (length of string = 6) into the edx register invoke int 0x80 The interrupt handler switches to kernel mode, performs the task, and returns a result in eax to the library function. Our goal in this chapter is to get familiar with some details of the underlying implementation and then implement our own system call and then use it. Integrating our system call into the kernel will necessitate recompiling the kernel as you have done before. However, you have a good .config file, right - and it's backed up, right? 4.2.1 Implementation Details An important header file to look at is <asm/unistd.h>. It starts with a table of #define's containing the system call numbers. The number for the write library call we considered earlier appears thusly #define __NR_write 4 Further investigation of this header file suggests that the library wrapper around the system call or stub can be generated by a macro call of this form (see 'man 2 intro'): _syscallX(type, name, type1, arg1, type2, arg2, ...) where X is the number of arguments taken by the stub (range is 0 through 5) type is the return type of the system call name is the name of the system call typeN is the type of the Nth argument argN is the name of the Nth argument These macros can be seen in <linux/unistd.h>. For example, in that header file we find: #define _syscall3(type, name, type1, arg1, type2, arg2, type3, arg3) \ type name(typ1 arg1, type2 arg2, type3 arg3) \ { \ long __res; \ __asm__ volatile ("int $0x80") \ : "=a" (__res) \ : "0" (__NR_##name), "b" ((long)(arg1)), "c" ((long)(arg2)), \ "d" ((long)(arg3)); \ __syscall_return(type, __res); \ } and later: static inline _syscall3(int, write, int, fd, const char *, buf, off_t, count) Hence we know how to build the prototype for write i.e. int write(int fd, const char * buf, off_t count) { long __res; \ __asm__ volatile ("int $0x80") \ : "=a" (__res) \ : "0" (__NR_write), "b" ((long)(fd)), "c" ((long)(* buf)), \ "d" ((long)(count)); \ __syscall_return(int, __res);\ } Here we see in which registers parameters are passed, how the value 4 identifying the write system call is determined, etc. It is left for you to expand the __syscall_return macro. Note that we have explained the scenario from making the library call in the user code to having the library call subsequently invoke int 0x80. Now we've claimed that write is a wrapper around the actual kernel level system call, sys_write. How is sys_write called and where in the source code is it? If we knew those answers we'd be on our way to doing our own implementation. We note that int 0x80 ultimately results in executing the code in <linux/arch/i386/kernel/entry.S>. Look particularly at the code starting from ENTRY(system_call) noting that it soon does a call to a reference in the sys_call_table. That table is at the end of the entry.S file where we find, for example, that entry number 4 (__NR_write) is .long SYMBOL_NAME(sys_write) R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 2 So that's how it finds the call to sys_write. Where in the source code is sys_write? Using http://lxr.linux.no/source/, we find that it is in /usr/src/linux/fs/read_write.c. Check to see what other directories are at this level and see what other system calls you might locate (e.g. sys_fork, sys_chmod, sys_ alarm). We have enough details and putting the picture together will allow us to create our own system call. 4.2.2 Implementing our own System Call We'll just lay out the recipe, based on what we discovered in the previous section. Further we'll pick a specific example so everything is concrete. Here is the recipe: 1. Call your new system call, sys_my_new_call, in a file it_b_mine.c. As root, copy the file to /usr/src/linux/kernel/. 2. Modify the Makefile in /usr/src/linux/kernel/. 3. Edit /usr/src/linux/arch/i386/kernel/entry.S and /usr/linux/src/linux/asm/unistd.h, in that order (will be described in Section 4.3.3) 4. Recompile the kernel via 'make bzImage' while in /usr/src/linux, copy bzImage to the appropriate vmlinuz in / boot, run lilo, and reboot (cf. Chapter 2 of the course notes) 5. Write a user program which exercises your new system call You might tar and zip your current /usr/src/linux, because we're going to make some changes which you'll want to remove subsequently. 4.2.3 The new system call Here's the file it_b_mine.c: #include <linux/kernel.h> asmlinkage int sys_my_new_call(void) { printk(KERN_ALERT "sys_my_new_call at your service\n"); return 0; } As root, copy it into /usr/src/linux/kernel. Double check that the ownership and permissions are consistent with other files in that directory. 4.2.4 Modify the Makefile Modify the Makefile in /usr/src/linux/kernel to add it_b_mine.o to the entries for obj-y. 4.2.5 As root, edit unistd.h and entry.S Near the end of the file, /usr/src/linux/arch/i386/kernel/entry.S, you'll find the jump table. At the very end of that table, add .long SYMBOL_NAME(sys_my_new_call) and note the position. In my case, it was 226. Save the new entry.S. Next, near the beginning of the file, /usr/src/linux/include/asm/unistd.h, you'll find the table of system call numbers. Add the appropriate entry i.e. at the end I added #define __NR_my_new_call 226 where, in your case, the number might be different than 226, but must match that from the entry.S file. Save this unistd.h. 4.2.6 Recompile and reboot Unless you are also doing some reconfiguration, you need not do all the steps seen earlier in Section 1.4 of Chapter 1. In particular, you can start with Step 5 of that section and then something along the lines of Steps 8 and 9. Essentially all you need to do then is compile via 'make bzImage' copy the new kernel to /boot revise lilo.conf, if necessary, and rerun lilo ... or modify /boot/grub/menu.lst, if necessary reboot R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 3 4.2.7 A user program using our new system call Let's continue to assume the source hierarchy, in which we are working, is /usr/src/linux. Now gcc expects the include files to be at /usr/include/, but ours are at /usr/src/linux/include. Sometimes there are symbolic links from the former to the latter, in particular from /usr/include/asm to /usr/src/linux/include/asm and from /usr/include/linux to /usr/src/linux/include/linux So that we don't need to modify linkages for our particular example, we'll just tell gcc where those files are when we compile the user program i.e. gcc -I /usr/src/linux/include ... and so on. Here is a user program: /* Use my_new_call */ #include <sys/types.h> #include <linux/unistd.h> static inline _syscall0(int, my_new_call); int main() { int result; result = my_new_call(); } Compile and run this program. It should print to some log file e.g. to /var/log/messages: sys_my_new_call at your service which you can verify via something like tail -f /var/log/messages If it's printing to some other log file, you can do some detective work looking at time stamps via ls -l /var/log/ and see which log files have been written recently. 4.2.8 Return to normalcy If desired back out all the changes you made in this chapter and return your system to its original state. 4.2.9 Adding a bit more substance to our system call User programs, of course, cannot be allowed access to kernel space. Yet we may need to pass information back and forth under tight control e.g. via the system call mechanism and appropriate kernel functions. Linux provides various ways to do this. Here we'll introduce two macros: get_user() - can be called by a kernel process to get a single datum from the user's memory space put_user() - can be called by a kernel process to put a single datum into the user's memory space Here is the necessary information for get_user(): #include <asm/uaccess.h> void get_user(datum, ptr) This will read the datum from user space, where ptr is the user space address. The size of the datum transferred depends on the type of the ptr argument and is determined by gcc at compile time. The macro returns 0 on success, otherwise an error. Here is the necessary information for get_user(): #include <asm/uaccess.h> put_user(datum, ptr) This will write the datum to user space, where ptr is the user space address. The size of the datum transferred depends on the type of the ptr argument and is determined by gcc at compile time. The macro returns 0 on success, otherwise an error. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 4 As an example, we'll invent two new system calls: sys_new_sys1 - will use get_user() sys_new_sys2 - will use put_user() We'll package them together in the same file and put that file in /usr/src/linux/kernel. We also must modify the Makefile in that directory and put two new entries in both /usr/src/linux/arch/i386/kernel/entry.S and /usr/linux/src/linux/unistd.h So we are essentially just following the recipe at the start of Section 4.2.2. Here is the new kernel program: /* new_sysen.c */ #include <linux/kernel.h> #include <asm/uaccess.h> #include <asm/errno.h> static int shared_int = 0; asmlinkage int sys_new_sys2(unsigned long arg) { shared_int = 5 * shared_int; printk(KERN_ALERT "sys_new_sys2 will call put_user()\n"); if (put_user(shared_int, (int *)arg) != 0) return -EFAULT; return 0; } asmlinkage int sys_new_sys1(unsigned long arg) { shared_int = 0; printk(KERN_ALERT "sys_new_sys1 will call get_user()\n"); if (get_user(shared_int, (int *)arg) !=0) return -EFAULT; return 0; } Here is an example user program which makes use of the two new system calls. #include <stdio.h> #include <stdlib.h> #include <linux/unistd.h> #include <sys/types.h> static inline _syscall1(int, new_sys1, int *, foo1) static inline _syscall1(int, new_sys2, int *, foo2) int main() { int user_space_int; user_space_int = 16; printf("user_space_int starts with value %d\n", user_space_int); if (new_sys1(&user_space_int) != 0) { printf("new_sys1 failed.\n"); exit(-1); } if (new_sys2(&user_space_int) != 0) { printf("new_sys1 failed.\n"); exit(-1); } printf("user_space_int finishes with value %d\n", user_space_int); exit(0); } R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 5 4.3 Signals
We will see that there is a variety of available signals and there are various ways a program can be set up to respond to signals - giving the signal mechanism both power and flexibility. More specifically, a signal can have these possible effects on a program (please note the similarity to the hardware interrupt mechanism): The signal is 'caught' by the program: Execution is transferred to a signal handler and, upon its completion, control is returned to the signaled program. There is no signal handler so the appropriate default is exercised: STOP: The program is put into a stopped state, but can be returned to a runnable state later. EXIT: The program is forced to exit. CORE: The program is forced to exit and a core dump is generated and filed in the program's directory. IGNORE: The signal is ignored. The SIGKILL and SIGSTOP signals are distinct in that they can neither be caught nor ignored. A program's response to a signal is consistent throughout the process so that all threads within a process respond that same way. Signals have names (all starting with 'SIG'), values, and default actions. These are listed in the man page i.e. enter 'man 7 signal'. You'll note from the man page that there is a POSIX signal API and a legacy API. The referenced book by Johnson and Troan has a very nice chapter on signals which moves through the legacy signal mechanisms which were in some cases incompatible with each other. It also discusses the unreliability of ANSI C standardization of the signal() function. It is recommended that the well defined and reliable POSIX signal API be used. 4.3.1 The kernel's use of signals Of course, the kernel already uses signals to conduct its everyday business. Here are some examples from the man page: If a program makes an invalid memory reference (e.g. a wild pointer), the kernel send the offending process a SIGSEGV, with default action CORE. If a child process has stopped or terminated, the kernel sends the parent a SIGCHLD, with default action IGNORE. If the suspend keystroke combination (often CRTL-z) is pressed, the kernel sends SIGTSTP to any foregound process with default action STOP. If a program writes to a pipe which has no readers, the kernel sends that process a SIGPIPE, with default action EXIT. In general, the kernel uses signals for various reasons, not merely on error conditions. A categorization of such reasons might include: Program termination Program stopping and subsequent continuing Dealing with errant programs Terminal handling Program Notification (e.g. a timeout alarm, death of child) Again, note that some signals originate in response to a hardware interrupt i.e. the interrupt handler causes a signal to be sent. 4.3.2 Signals in user programs As expected, user programs use of signals is more restricted. They cannot for example, just send signals to anyone. They can, however, set themselves up to catch a variety of kernel generated signals - often having to do with signals sent in connection to terminal activity. Furthermore, the POSIX signals include a pair of user-defined signals, SIGUSR1 and SIGUSR2, whereby two user programs with the same uid can communicate. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 6 4.3.3 Signal handlers in user programs Although there may be instances where we want the default response to the signal, it is alternatively possible that we will want to catch and handle the signal - that will be the focus of this section. POSIX signals are organized in sets, represented by a data type sigset_t. Linux provides us with a group of functions for safely manipulating signal sets: empty the referenced set of all signals int sigemptyset(sigset_t * set); fill the referenced set with all signals int sigfillset(sigset_t * set); add a specified signal to the referenced set int sigaddset(sigset_t * set, int signo); remove a specified signal from the referenced set int sigdelset(sigset_t * set, int signo); test whether a specified signal is a member of the referenced set int sigismember(const sigset_t * set, int signo); The program that wishes to catch the signal will also declare the signal handler. The prototype for a signal handler is typedef void (*__sighandler_t)(int signo); The reference to your signal handler is placed in the struct sigaction, which specifies how the kernel should deliver signals to your program. The struct looks like this: struct sigaction { sighandler_t sa_handler; unsigned long sa_flags; void (*sa_restorer)(void); sigset_t sa_mask; }; Now we'll describe the items in this struct: sa_handler is a pointer to your signal handler, alternatively it can be SIG_IGN - tells the kernel to ignore the signal SIF_DFL - tells the kernel to use the default response sa_flags is a bitmask that controls kernel behavior when the signal is received and OR's various possibilities. Our subsequent example sets this to zero. You might investigate other options. sa_restorer is not used by linux sa_mask specifies the signals to be blocked while the signal handler is executing R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 7 Once the sigaction struct is declared, the sigaction() system call can be invoked to deliver the information to the kernel detailing how the signal should be delivered. The following user space program provides an example. #include <signal.h> #include <stdlib.h> #include <stdio.h> #include <unistd.h> #define true 1 #define false 0 int caught = false; /* here's a trivial signal handler */ void mysig_handler(int sig) { printf("mysig_handler got SIGALRM.\n"); caught = true; } int main(void) { /* declare the sigaction struct */ struct sigaction mysig_action; /* fill in the necessary fields in the prior struct*/ mysig_action.sa_handler = mysig_handler; sigemptyset(&mysig_action.sa_mask); mysig_action.sa_flags = 0; /* pass the signal and related struct to the kernel*/ sigaction(SIGALRM, &mysig_action, NULL); printf("Now calling alarm(5)\n"); /* set up a SIGALRM at 5 seconds from now */ alarm(5); /* let's hang around until the signal is caught*/ while(!caught); printf("Resumed program upon signal handler completion.\n"); return(0); } 4.4 Wait Queues It routinely happens in a wide variety of circumstances that a kernel process needs to wait for a particular event to happen. Although there are instances where the process may then do a busy waiting loop (e.g. spinlocks in a multiprocessor environment) it is often more appropriate that the process block, so other processes can continue to keep the cpu busy doing useful work. This capability is supported by wait queues. The wait queue struct is a cyclic linked list: struct wait_queue { struct task_struct * task; struct wait_queue * next; } The supporting macros include those that put the process to sleep awaken the process add and delete wait queue members We'll examine these next. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 8 Putting a process to sleep on a wait queue These include the following: void sleep_on(struct wait_queue **p); This sets the process state to TASK_UNINTERRUPTIBLE, enters the process in the designated wait queue, and relinquishes control by calling the scheduler. The process must be awakened by some other process which does a wake up call (discussed under the next bold subheading) for this queue.. void interruptible_sleep_on(struct wait_queue **p); This sets the process state to TASK_INTERRUPTIBLE and enters the process in the designated wait queue, and relinquishes control by calling the scheduler. The process must be awakened by some other process which does a wake up call for this queue, but can also be awakened by a signal. void sleep_on_timeout(struct wait_queue **p, long timeout); This sets the process state to TASK_UNINTERRUPTIBLE, enters the process in the designated wait queue, and relinquishes control by calling schedule_timeout. The process is awakened at the time specified by the timeout argument, rather than by requiring some other process to do a wake up call for this queue. void interruptible_sleep_on_timeout(struct wait_queue **p, long timeout); This sets the process state to TASK_INTERRUPTIBLE and enters the process in the designated wait queue, and relinquishes control by calling schedule_timeout. The process is awakened at the time specified by the timeout argument, rather than by requiring some other process to do a wake up call for this queue. However, the process can also be awakened by a signal. Awakening a process on a wait queue These include the following: void wake_up(struct wait_queue **p); This will wake up both interruptible and noninterruptible sleepers on the designated queue. void wake_up_interruptible(struct wait_queue **p); This will wake up only interruptible sleepers on the designated queue. Note that the wake up calls will not awaken processes which were explicitly stopped. Adding/deleting wait queue members To safely add and remove members of wait queues we have: void add_wait_queue(struct wait_queue **queue, struct wait_queue *entry); void remove_wait_queue(struct wait_queue **queue, struct wait_queue *entry); In both cases, the first argument refers to the queue of interest, while the second refers to the entry to be added or removed, respectively. 4.4.1 Race Conditions Let's say we put some process to sleep until some condition is true maybe using a construction like this: while (wake_condition == false) { interruptible_sleep_on(&my_wait_queue); ... } With the demise of the big kernel lock, this may be subject to race conditions. This will occur if the wake condition evaluates as false in the first line and becomes true before the second line executes. In the worst case, the process will experience deadlock. This can be avoided with some clever programming, but this has been encapsulated in the kernel - so we don't even need to be clever. The appropriate replacement for the prior code snippet is wait_event_interruptible(my_wait_queue, wake_condition == true); There is also the expected wait_event(my_wait_queue, wake_condition == true); R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 9 4.5 Task Queues Task queues hold tasks to be executed at a later time. The kernel provides predefined task queues in which you can register your task. The scheduler then decides just when tasks in such a queue will be executed. Alternatively, you can define your own task queue and specify when it should execute. A queue element is a tq_struct as defined by: #include <linux/tqueue.h> struct tq_struct { struct tq_struct *next; /* linked list of queued tasks */ unsigned long sync; /* must be initialized to zero */ void (*routine)(void *); /* function to call */ void *data; /* argument to function */ }; Once you have declared an element, you should clear the next and sync fields enter appropriate items in the routine and data fields Then you may queue the task with the queue_task function whose prototype is void queue_task(struct tq_struct *task, task_queue *list); Note: For the predefined tq_scheduler queue, the related code must use schedule_task to put the task on the tq_scheduler queue, not queue_task. We'll see an example shortly. To run a queue of tasks the function used is run_task_queue with prototype void run_task_queue(task_queue *list); which the kernel invokes for its predefined task queues and which you must call for any task queue you define yourself. 4.5.1 Queues Predefined by the Kernel The four queues predefined by the kernel are: tq_scheduler - queued tasks in here execute whenever the scheduler runs (not executed at interrupt time) tq_timer - execution of these tasks is triggered by the timer tick (executed at interrupt time) tq_immediate - these tasks are run as soon as possible, either on return from a system call or when the scheduler is run (executed at interrupt time) tq_disk - not available to modules; used internally by memory management This essentially leaves the first three for us. 4.5.2 The tq_timer and tq_immediate queues Note that tasks in the tq_timer and tq_immediate queues are executed in interrupt time. This has important consequences. First, in interrupt mode, there is no process context so that the queued task cannot access user space the current pointer is not meaningful. Second, if the process attempts to sleep or calls a function which can sleep, the queued task may hang. Note that functions which attempt to reserve system resources are quite likely to have a need to sleep (e.g. kmalloc). An example of usage of tq_timer or tq_immediate #include <linux/tqueue.h> static struct tq_struct my_task; void my_own_task(unsigned long ptr) { ... some valid code ... } void init_and_enqueue_my_task() { my_task.routine = (void *)&my_own_task; my_task.data = (void *)&some_data; queue_task(&my_task, &tq_immediate); } R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 10 4.5.3 The tq_scheduler queue Tasks in the tq_scheduler queue are not executed in interrupt time, so the constraints mentioned at the start of section 4.5.2 do not apply. A further difference from tq_immediate and tq_timer emerged in the 2.4 kernel series - the related code must use schedule_task to put the task on the tq_scheduler queue, not queue_task. An example of usage of tq_scheduler follows. #include <linux/tqueue.h> static struct tq_struct my_task; static char my_msg[] = "<1>\nmy_special_task has executed.\n"; DECLARE_WAIT_QUEUE_HEAD(my_wait); void my_special_task(unsigned long ptr) { printk((void *)ptr); wake_up_interruptible(&my_wait); } void init_and_enqueue_my_task() { my_task.routine = (void *)&my_special_task; my_task.data = (void *)&my_msg; schedule_task(&my_task); interruptible_sleep_on(&my_wait); } 4.5.4 Your own Task Queues In this case, since the queue is not predefined, the queue is declared by a macro in this style: DECLARE_TASK_QUEUE(my_tq); The fields would be filled in as before and then the task would be enqueued by: queue_task(&my_task, &my_tq); Unlike the predefined queues, this would need to be executed overtly by run_task_queue(&my_tq); This leaves the question of how the task queue execution would be triggered. This is done by registering the prior function in one of the predefined queues. 4.6 Time Related Functionality 4.6.1 Current Time The kernel keeps track of time via the timer interrupt, which in my IA-32 machine occurs 100 times per second (defined by HZ in /usr/src/linux/include/asm/param.h). The timer interrupt handler updates the value in jiffies. This is defined as an unsigned long volatile in /usr/src/linux/include/linux/sched.h. This 32-bit quantity is zeroed when your machine is powered up. The value in the variable jiffies is one method to measure time intervals in kernel code. If your driver needs the current time, the do_gettimeofday function is provided. It gives near microsecond resolution for most architectures. A usage example is shown in this fragment: struct timeval tv; do_gettimeofday(&tv); printk(KERN_ALERT"Current seconds = %08u.%06u\n", (int)(tv.tv_sec%100000000), (int)(tv.tv_usec)); In addition to the timer interrupt driven jiffies value, most modern processors have acknowledged the need for a much finer time resolution. This will be based on the processor clock speed and made available in a special register. This is architecture dependent and we will describe the situation in the more recent and ubiquitous IA32 (Pentium and later). The IA32 has a 64-bit register called the time stamp counter (TSC) available via the assembly language instruction rdtsc.
The TSC is also accessible via the C macros rdtsc and rdtscl desribed by: #include <asm/msr.h> rdtsc(low, high) - here low and high are each 32-bit variables holding the two parts of the 64-bit TSC rdtscl(low) - here low is just the low part of the 64-bit TSC R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 11 4.6.2 Delays Long Delays For pedagogical reasons, we'll start with a poor solution for creating a delay and move toward better. Each example will rely on this information: /*resolution on order of jiffies */ unsigned long my_delay = desired_seconds * HZ; unsigned long target_time = jiffies + my_delay; Since jiffies will eventually roll over and since Linux machines are relatively stable, target_time could roll over and be less than jiffies. Hence, a set of macros that accommodates roll over properly is provided in <linux/timer.h>. These are as follows: time_before(jiffies, target_time) - rollover corrected; evaluates as true, if jiffies < target_time time_after( jiffies, target_time) - rollover corrected; evaluates as true, if jiffies > target_time time_before_eq(jiffies, target_time) - rollover corrected; evaluates as true, if jiffies <= target_time time_after_eq( jiffies, target_time) - rollover corrected; evaluates as true, if jiffies >= target_time Let's examine some delay possibilities. The first example is known as "busy waiting" and should be avoided. It is simply:: while time_before(jiffies, target_time); /* the CPU stays busy in this loop, stalling any other work */ The fact that jiffies is declared as volatile forces it to be reread each time it is accessed in your code - so you won't be haunted by a cached value. However, jiffies is changed by the timer interrupt, so using this busy waiting loop while hardware interrupts were disabled would hang the machine. Our second example removes both problems: while time_before(jiffies, target_time) schedule(); This process calls the scheduler, so other tasks can run. However, this task remains in the execution queue which creates a subtle problem. If this is the only task, it will keep getting turns to run and it will keep calling the scheduler - but it's really doing nothing useful. On the other hand, if there are no tasks to run, the scheduler runs the 'idle' process which provides these benefits: it reduces the CPU's workload, reducing temperature and increasing lifetime (e.g. a laptop will go longer before needing its battery recharged) the time used by the process is accountable (maybe a non issue) Our third example removes the prior problem as follows: current->state = TASK_INTERRUPTIBLE; schedule_timeout(my_delay); Here, current is the task_struct of the executing process. The scheduler will avoid the task until the timeout has been reached. Short Delays The prior delays have resolution in the jiffies range. To get delays in the microsecond range, you can use the udelay function based on the processor's bogomips measurement. Its prototype is #include <linux/delay.h> void udelay(unsigned long usecs); For example, udelay(50); would be a busy waiting loop that lasts for 50 microseconds. It is recommended that the argument passed to udelay not exceed 1000, because fast machines (i.e. with high bogomips) may encounter an overflow. A wrapper iterating around udelay is provided by mdelay e.g. mdelay(70); would provide a delay of 70 milliseconds. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 12 4.6.3 Kernel Timers Like task queues, kernel timers provide a way to defer execution of a task until a later time. The kernel timers are kept in a doubly linked list. The data structure for a timer is given in /usr/src/linux/include/linux/timer.h as: struct timer_list { struct timer_list *next; /* MUST be first element */ struct timer_list *prev; unsigned long expires; unsigned long data; void (*function)(unsigned long); }; where 'expires' (3rd element) is the time in jiffies at which timeout occurs and '*function' (5th element) denotes the function to call at timeout. There are three important functions provided for manipulating timers: init_timer() - initializes the timer structure by zeroing the 'next' and 'prev' pointers add_timer() - inserts a timer structure into the global list of active timers del_timer() - for removing a timer from the list before its timeout has transpired Note that when a timer times out, it is automatically removed from the list. Here are the elements of a trivial example: #include <linux/time.h> #include <linux/timer.h> #include <linux/wait.h> #include <linux/param.h> static struct timer_list my_timer; DECLARE_WAIT_QUEUE_HEAD(my_wait); static char msg[] = "<1>\nmy_timer has timed out.\n"; void upon_my_timeout(unsigned long ptr) { printk((void *)ptr); wake_up_interruptible(&my_wait); } void wait_four() { init_timer(&my_timer); my_timer.function = upon_my_timeout; my_timer.data = (unsigned long)&msg; my_timer.expires = jiffies + (4 * HZ); add_timer(&my_timer); interruptible_sleep_on(&my_wait); } The time-outs provided by such timers are unlike task queues in that the timer specifies precisely when the timeout function is to be executed; whereas with a task queue all you know is that the queued task will be performed at some later time. Occasionally the need for such functionality arises in a driver. 4.7 Interrupt Handling We'll have a short discussion here on the linux approach to IA-32 style hardware interrupts with the assumption that the reader is familiar with the 'traditional' irq -> PIC/APIC <-> CPU interrupt mechanism. The interrupt handler does not run within the context of a process and cannot transfer data to/from user space. The interrupt handler starts executing with hardware interrupts disabled, but can reenable them if it so wishes masking irq's appropriately before the sti. Other than that, the interrupt handler is normal C code. The writer of that code needs to understand how the handler must interact with the hardware. For example, some devices will not issue another interrupt until the interrupt handler has acknowledged its response to the current irq signal, perhaps by clearing a specified I/O port bit. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 13 4.7.1 The Bottom Half Mechanism The handler needs to do its work quickly and efficiently. If there are subtasks that require significant time, but are not urgent; they can be deferred until later. This is the so called 'bottom-half' mechanism provided by linux. There are, in fact, only 32 'genuine' bottom halves available and the average joe device driver writer won't have one assigned to his/her use. However, a driver without a genuine bottom half can employ the immediate queue to provide bottom half functionality. What one does is to declare a task queue, initialize its routine field as the bottom half code you wrote, initialize its data field as needed, and then.add the initialized task queue to the immediate queue. Finally mark_bh(IMMEDIATE_BH) is called to schedule the function which will later execute all the functions in the immediate queue. 4.7.2 An Example Bottom Half Let's say we have an interrupt handler, my_irq_handler, to which we want to add a bottom half, say, void some_bottom_half(); We then take these steps: declare a task struct e.g. #include <linux/tqueue.h> static struct tq_struct some_bh; initialize the struct somewhere appropriate such as in init_module e.g. some_bh.routine = (void *)&some_bottom_half; some_bh.data = NULL; some_bh.sync = 0; add code to my_irq_handler to enqueue and mark the bottom half e.g. queue_task(&some_bh, &tq_immediate); mark_bh(IMMEDIATE_BH); We note that the bottom half is actually taken care of by the tasklet mechanism in the 2.4 series kernel. 4.7.3 The Tasklet Alternative The tasklet is quite similar to a task in a predefined task queue. Further, it runs in interrupt time so the constraints of section 4.5.2 apply. Other important properties of tasklets include these, copied from interrupt.h: If tasklet_schedule() is called, then tasklet is guaranteed to be executed on some cpu at least once after this. If the tasklet is already scheduled, but its excecution is still not started, it will be executed only once. If this tasklet is already running on another CPU (or schedule is called from tasklet itself), it is rescheduled for later. Tasklet is strictly serialized wrt itself, but not wrt another tasklets. If client needs some intertask synchronization he makes it with spinlocks. The tasklet_struct follows: struct tasklet_struct { struct tasklet_struct *next; unsigned long state; atomic_t count; void (*func)(unsigned long); unsigned long data; }; R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 14 4.7.4 A Tasklet Example Let's say we have an interrupt handler, my_irq_handler, to which we want to add a bottom half via the tasklet mechanism, say, void some_bottom_half(); We then take these steps: ensure you have the needed header #include <linux/interrupt.h> declare and initialize the tasklet_struct: DECLARE_TASKLET(some_bh, some_bottom_half, 0); add code to my_irq_handler to schedule the bottom half e.g. tasklet_schedule(&some_bh); Note that you do not need to separately declare: struct tasklet_struct some_bh; The DECLARE_TASKLET takes care of that. 4.8 The Process Scheduler 4.8.1 Introduction to the scheduler The Linux kernel is currently not preemptive and lies outside the realm of the scheduler, whose main job is to pick the next process to run. More specifically we can state that There is no mechanism by which a 'higher priority' process can preempt a kernel mode process, but the latter can decide to relinquish control. A kernel process can be interrupted by an interrupt/exception handler. Upon completion of the handler control returns to the interrupted kernel process. The interrupt/exception handler is itself a kernel mode process and can be interrupted by an interrupt/exception handler. Kernel mode processes can 'turn off' external hardware interrupts as appropriate. The scheduler for the current Linux 2.6 series has likely changed. Further kernel processes can be configured as preemptive. We focus on the 2.4 series here. In any case, it makes a good first exposure to scheduling. The excellent O'Reilly book, Understanding the Linux Kernel by Bovet & Cesati has a good chapter on this topic and, if you go to the O'Reilly web site (http://www.oreilly.com), you will find that the description of this book contains the chapter on the scheduler as a downloadable example. Recall that a process can exist in one of a possible set of states. For Linux, these are TASK_RUNNING TASK_INTERRUPTIBLE TASK_UNINTERRUPTIBLE TASK_STOPPED TASK_ZOMBIE To determine the next process to run, the scheduler chooses from among processes in the TASK_RUNNING state. It is assumed here that the reader has had some exposure to the concepts used in schedulers, so that no time will be spent on general background. Further, we will not discuss scheduling for SMP machines. In this chapter, we will discuss: scheduling policies and preemption when does the scheduler execute? process goodness and priorities the epoch the scheduling algorithm R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 15 4.8.2 Scheduling Policies and Preemption In <linux/sched.h>, we find the three Linux scheduling policies: #define SCHED_OTHER 0 #define SCHED_FIFO 1 #define SCHED_RR 2 SCHED_OTHER Normal user tasks will run under the SCHED_OTHER policy. As such they are preemptible and run in a time sliced environment involving dynamic priorities, to be described later. SCHED_FIFO This is a (soft) real-time policy. A SCHED_FIFO process is not time sliced and will execute until one of the following conditions becomes true: it completes it blocks for I/O it relinquishes the CPU by calling sched_yield() a higher priority process enters the TASK_RUNNING state SCHED_RR This also is a (soft) real-time policy. However, SCHED_RR processes are subject to a time slice. A set of SCHED_RR processes having the same priority would be scheduled in a classic round robin fashion with respect to each other. Such a process will complete its time slice unless one of the following occurs it completes it blocks for I/O it relinquishes the CPU by calling sched_yield() a higher priority process enters the TASK_RUNNING state If it is preempted, it is placed at the head of its queue. Next time it runs it completes its preempted time slice. On the other hand, if the SCHED_RR process completes its time quantum, it is placed at the tail of its queue in the traditional round robin fashion. 4.8.3 When does the scheduler execute? There are several ways that scheduler execution is triggered. These can be categorized as direct and indirect. Direct - a call to schedule() A process running in kernel mode can make a call to schedule. If you look for references to schedule via the Linux cross reference web site, you'll see that it is called many places such as file system code memory management code network management code many drivers A typical scenario is this: A piece of code needs to block. It puts itself on the appropriate wait queue. It changes its state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. It calls the scheduler. Indirect - via need_resched = 1 The task struct has a field, need_resched, which is checked when returning to user mode from an interrupt or exception. If this field equals 1, schedule() is called. Hence any time a process sets need_resched to 1, this ensures that schedule() will be called in the near future. Setting need_resched to 1 occurs in the following cases: when sched_setscheduler() or sched_yield() is called when a process is awakened and has higher goodness than the current process when the current process exhausts its time quantum R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 16 4.8.4 The Epoch From the scheduler's viewpoint, CPU time is divided into epochs as a means of encapsulating a group of runnable processes and their respective time quanta. A pseudocode overview follows: epoch_init: set quantum value for every process, except TASK_ZOMBIE processes start_epoch: choose highest goodness TASK_RUNNING process to run (goodness is discussed in Section 8.5) run that process until it blocks, is preempted, relinquishes the CPU voluntarily, or finishes its time quantum if all runnable processes have exhausted their quanta, go to epoch_init else go to start_epoch 4.8.5 Process Goodness and Priorities To make a scheduling decision, Linux calculates what is called the 'goodness' of each process currently in the TASK_RUNNING state and then choosing the process having the highest value of goodness to run next. Linux uses other parameters called priorities as constituents of goodness and therefore was forced to invent a new term 'goodness' rather than overloading the word 'priority'. The goodness of SCHED_FIFO and SCHED_RR processes The goodness of SCHED_FIFO and SCHED_RR processes lie in a range well above the goodness of any SCHED_OTHER process. Hence, a SCHED_OTHER process will never be chosen if there is an available (soft) real-time process. Let's consider how the goodness of a process is calculated. For a SCHED_FIFO or SCHED_RR process, goodness = 1000 + rt_priority where 1 <= rt_priority <= 99 Note that rt_priority is a field in the task structure. The scheduler does not changes rt_priority, so it is called a 'static' priority. However, under certain conditions, the rt_priority of a real-time process can be changed by system calls not discussed here..
The goodness of SCHED_OTHER processes The SCHED_OTHER goodness is somewhat more complex, is dynamic, and (as expected) does not depend on rt_priority. In this case, the goodness depends on two other fields from the task structure priority - both the base time quantum and base priority for the process counter - number of timer ticks (via irq0) left to the process before its time quantum expires The goodness is given by goodness = priority + counter Now the counter is decremented each timer tick, and when it reaches zero the process has exhausted its time quantum. At that point, the formula above is replaced by setting counter = 0 and goodness = 0. The base time quantum is initialized to DEF_PRIORITY for process 0, where currently #define DEF_PRIORITY (20*HZ/100) At the start of a new epoch, the new value of counter for each process is given by counter = priority + counter/2. Hence if the process is one that has just exhausted its quantum (counter = 0), it gets a new counter value equal to its base quantum. However, if the process is, for example, in the TASK_INTERRUPTIBLE state, its counter will be enhanced at the start of every epoch. This gives some preference to I/O bound processes. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 17 At a fork, the child always inherits the base time quantum of its parent. It is possible, albeit rare, for a process to change its base time quantum. As a result, most processes in the system have the same base time quantum, DEF_PRIORITY. Also at a fork, the counter of the parent is split in two, half going to the parent and half to the child. 4.8.6 The Scheduling Algorithm Starting with a very high level, coarse viewpoint, the scheduler does this: does some general housekeeping such as executing all interrupt handler bottom halves and deferred processes on task queues calculates the goodness for the processes in the TASK_RUNNING state to determine the next process to run turns the CPU over to the chosen process In this section, we'll look more closely at this scenario. It will perhaps take several readings to assimilate. This is a somewhat more detailed look at the scheduling algorithm. After understanding this you might go to the source code itself. 1. Run any deferred tasks in queue tq_scheduler. 2. Run any pending bottom halves. 3. Save current in local variable, prev. 4. If (prev is a SCHED_RR process), then assign it a new quantum and put it at the end of the run queue. 5. If (prev is in state TASK_INTERRUPTIBLE and has nonblocked, pending signals), then make its state TASK_RUNNING. 6. If (prev is not in the TASK_RUNNING state), then remove it from the run queue. 7. If the run queue is empty, point next at the idle_task. Otherwise, find the process in the run queue which has the highest goodness and reference that process with next. If there is a tie for highest non zero goodness between prev and some other process, prev is chosen to save a context switch. If all the runnable processes have zero goodness, this is the end of an epoch and a new quantum is assigned to all processes except TASK_ZOMBIE processes. 8. If (prev != next) then update the context switch statistics and perform a context switch from prev to next. R.A. Sevenich 2004 Introduction to Linux Device Driver Development 4 - 18