Anatomy of A System Call, Part 1: Vonbrand Log Out

Anatomy of a system call, part 1 [LWN.net] https://lwn.
net/Articles/604287/
Content ▶ Edition ▶
vonbrand | Log out | (Subscriber)
Anatomy of a system call, part 1

System calls are the primary mechanism by which user-space programs interact with the
Linux kernel. Given their importance, it's not surprising to discover that the kernel July 9, 2014
includes a wide variety of mechanisms to ensure that system calls can be implemented This article was contributed
generically across architectures, and can be made available to user space in an efficient by David Drysdale
and consistent way.
I've been working on getting FreeBSD's Capsicum security framework onto Linux and, as this involves the addition of
several new system calls (including the slightly unusual execveat() system call), I found myself investigating the details of
their implementation. As a result, this is the first of a pair of articles that explore the details of the kernel's implementation
of system calls (or syscalls). In this article we'll focus on the mainstream case: the mechanics of a normal syscall ( read()),
together with the machinery that allows x86_64 user programs to invoke it. The second article will move off the
mainstream case to cover more unusual syscalls, and other syscall invocation mechanisms.
System calls differ from regular function calls because the code being called is in the kernel. Special instructions are
needed to make the processor perform a transition to ring 0 (privileged mode). In addition, the kernel code being invoked
is identified by a syscall number, rather than by a function address.
Defining a syscall with SYSCALL_DEFINEn()
The system call provides a good initial example to explore the kernel's syscall machinery. It's implemented in
read()
fs/read_write.c,
as a short function that passes most of the work to vfs_read(). From an invocation standpoint the most
interesting aspect of this code is way the function is defined using the SYSCALL_DEFINE3() macro. Indeed, from the code, it's
not even immediately clear what the function is called.
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)

{
struct fd f = fdget_pos(fd);
ssize_t ret = -EBADF;
/* ... */
These SYSCALL_DEFINEn() macros are the standard way for kernel code to define a system call, where the n suffix indicates the
argument count. The definition of these macros (in include/linux/syscalls.h) gives two distinct outputs for each system call.
SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count)

__SYSCALL_DEFINEx(3, _read, unsigned int, fd, char __user *, buf, size_t, count)
{
/* ... */
The first of these, SYSCALL_METADATA(), builds a collection of metadata about the system call for tracing purposes. It's only
expanded when CONFIG_FTRACE_SYSCALLS is defined for the kernel build, and its expansion gives boilerplate definitions of data
that describes the syscall and its parameters. (A separate page describes these definitions in more detail.)
The __SYSCALL_DEFINEx() part is more interesting, as it holds the system call implementation. Once the various layers of
macros and GCC type extensions are expanded, the resulting code includes some interesting features:
asmlinkage long sys_read(unsigned int fd, char __user * buf, size_t count)
__attribute__((alias(__stringify(SyS_read))));
static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count);
asmlinkage long SyS_read(long int fd, long int buf, long int count);
asmlinkage long SyS_read(long int fd, long int buf, long int count)
{
long ret = SYSC_read((unsigned int) fd, (char __user *) buf, (size_t) count);
asmlinkage_protect(3, ret, fd, buf, count);
return ret;
}
static inline long SYSC_read(unsigned int fd, char __user * buf, size_t count)
1 of 6 03/21/2016 10:50 AM
Anatomy of a system call, part 1 [LWN.net] https://lwn.net/Articles/604287/
{
/* ... */
First, we notice that the system call implementation actually has the name SYSC_read(), but is static and so is inaccessible
outside this module. Instead, a wrapper function, called SyS_read() and aliased as sys_read(), is visible externally. Looking
closely at those aliases, we notice a difference in their parameter types — sys_read() expects the explicitly declared types
(e.g. char __user * for the second argument), whereas SyS_read() just expects a bunch of (long) integers. Digging into the
history of this, it turns out that the long version ensures that 32-bit values are correctly sign-extended for some 64-bit
kernel platforms, preventing a historical vulnerability.
The last things we notice with the SyS_read() wrapper are the asmlinkage directive and asmlinkage_protect() call. The Kernel
Newbies FAQ helpfully explains that asmlinkage means the function should expect its arguments on the stack rather than in
registers, and the generic definition of asmlinkage_protect() explains that it's used to prevent the compiler from assuming
that it can safely reuse those areas of the stack.
To accompany the definition of sys_read() (the variant with accurate types), there's also a declaration in include/linux
/syscalls.h, and this allows other kernel code to call into the system call implementation directly (which happens in half a
dozen places). Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.
Syscall table entries
Hunting for callers of sys_read() also points the way toward how user space reaches this function. For "generic"
architectures that don't provide an override of their own, the include/uapi/asm-generic/unistd.h file includes an entry
referencing sys_read:
#define __NR_read 63
__SYSCALL(__NR_read, sys_read)
This defines the generic syscall number __NR_read (63) for read(), and uses the __SYSCALL() macro to associate that number
with sys_read(), in an architecture-specific way. For example, arm64 uses the asm-generic/unistd.h header file to fill out a table
that maps syscall numbers to implementation function pointers.
However, we're going to concentrate on the x86_64 architecture, which does not use this generic table. Instead, x86_64
defines its own mappings in arch/x86/syscalls/syscall_64.tbl, which has an entry for sys_read():
0 common read sys_read
This indicates that read() on x86_64 has syscall number 0 (not 63), and has a common implementation for both of the ABIs for
x86_64, namely sys_read(). (The different ABIs will be discussed in the second part of this series.) The syscalltbl.sh script
generates arch/x86/include/generated/asm/syscalls_64.h from the syscall_64.tbl table, specifically generating an invocation of the
__SYSCALL_COMMON() macro for sys_read(). This header file is used, in turn, to populate the syscall table, sys_call_table, which is
the key data structure that maps syscall numbers to sys_name() functions.
x86_64 syscall invocation
Now we will look at how user-space programs invoke the system call. This is inherently architecture-specific, so for the
rest of this article we'll concentrate on the x86_64 architecture (other x86 architectures will be examined in the second
article of the series). The invocation process also involves a few steps, so a clickable diagram, seen at left, may help with
the navigation.
In the previous section, we discovered a table of system call function pointers; the table
for x86_64 looks something like the following (using a GCC extension for array
initialization that ensures any missing entries point to sys_ni_syscall()):
asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {

[0 ... __NR_syscall_max] = &sys_ni_syscall,
[0] = sys_read,
[1] = sys_write,
/*... */
};
For 64-bit code, this table is accessed from arch/x86/kernel/entry_64.S, from the system_call
assembly entry point; it uses the RAX register to pick the relevant entry in the array
and then calls it. Earlier in the function, the SAVE_ARGS macro pushes various registers
onto the stack, to match the asmlinkage directive we saw earlier.
Moving outwards, the system_call entry point is itself referenced in syscall_init(), a

function that is called early in the kernel's startup sequence:
2 of 6 03/21/2016 10:50 AM
void syscall_init(void)
{
/*
* LSTAR and STAR live in a bit strange symbiosis.
* They both write to the same internal register. STAR allows to
* set CS/DS but only a 32bit target. LSTAR sets the 64bit rip.
*/
wrmsrl(MSR_STAR, ((u64)__USER32_CS)<<48 | ((u64)__KERNEL_CS)<<32);
wrmsrl(MSR_LSTAR, system_call);
wrmsrl(MSR_CSTAR, ignore_sysret);
/* ... */
The wrmsrl instruction writes a value to a model-specific register; in this case, the address of the general system_call syscall
handling function is written to register MSR_LSTAR (0xc0000082), which is the x86_64 model-specific register for handling the
SYSCALL instruction.
And this gives us all we need to join the dots from user space to the kernel code. The standard ABI for how x86_64 user
programs invoke a system call is to put the system call number (0 for read) into the RAX register, and the other parameters
into specific registers (RDI, RSI, RDX for the first 3 parameters), then issue the SYSCALL instruction. This instruction causes
the processor to transition to ring 0 and invoke the code referenced by the MSR_LSTAR model-specific register — namely
system_call. The system_call code pushes the registers onto the kernel stack, and calls the function pointer at entry RAX in
the sys_call_table table — namely sys_read(), which is a thin, asmlinkage wrapper for the real implementation in SYSC_read().
Now that we've seen the standard implementation of system calls on the most common platform, we're in a better position
to understand what's going on with other architectures, and with less-common cases. That will be the subject of the
second article in the series.
Post a comment
Bravo
Posted Jul 10, 2014 12:56 UTC (Thu) by smitty_one_each (subscriber, #28989) [Link]
Bravo, thank you, and may Fortune smile on your day, Drysdale.
This is precisely the sort of well-written intermediate material that is needed to help
(a) the n00bs like me, and
(b) the ninjas who haven't reviewed it lately.
More of this!
Cheers,
Chris
Reply to this comment
Calling syscalls from the kernel

Posted Jul 10, 2014 22:21 UTC (Thu) by bourbaki (guest, #84259) [Link]
Hello David,
Thanks for this article, it looks like the start of a very promising series. I have question though, that you or other
kernel hackers may answer :
> Calling system calls directly from elsewhere in the kernel is generally discouraged and is not often seen.
I have searched about this a bit but couldn't find a specific reason for why it's "generally discouraged". Could
someone enlighten me ?
Thanks

Posted Jul 11, 2014 11:33 UTC (Fri) by roblucid (subscriber, #48964) [Link]
One general principal reason would be, that lower layer software, breaking out and calling higher layer software,
turns a "calling tree" into a "network graph", much harder to analyse and reason about it's performance.
More simply, a system call is designed for user space interaction, in general something in kernel, ought to be using
an in-kernel service, rather than re-purposing a user space interface.
[ As noone answered this question yet ]
3 of 6 03/21/2016 10:50 AM

Posted Jul 14, 2014 9:33 UTC (Mon) by drysdale (subscriber, #95971) [Link]
That's roughly my understanding too. System calls typically check all their (user-provided) arguments carefully, and
also copy any needed chunks of memory from __user * pointers -- all of which is inefficient for internal usage. So it's
more common to have an inner function that the rest of the kernel calls, and the syscall wraps.

Posted Jul 23, 2014 14:32 UTC (Wed) by YogeshC (subscriber, #49966) [Link]
In that case, why has this particular syscall been used in half a dozen places inside the kernel? Is there any specific
(or general) reason(s) for read/write syscalls to be called from within the kernel space on so many occasions?

Posted Aug 11, 2014 12:45 UTC (Mon) by rwmj (guest, #5474) [Link]
No one answered your question so I'll have a go.
The main user of sys_read at the moment is the code for unpacking the initramfs. You could imagine this "should"
be done by a userspace process, because it's doing a lot of userspace-type stuff, like uncompressing a cpio file and
then creating a tree of directories and files from it. It's nearly the equivalent of running:
zcat initramfs | cpio -id
However because the initramfs is needed before userspace is up -- because creating the initramfs is creating the
initial userspace -- they have to do this userspace-type stuff in the kernel instead.
So it's a layering violation, but for understandable reasons.

Posted Jul 11, 2014 9:01 UTC (Fri) by geuder (subscriber, #62854) [Link]
Thank you for the very detailed description. Occasionnally I had to track down (parts of) this from the source for
debugging some issue and it can be very tedious.
A little detail, but a nuisance for humans, especially if you debug an emebedded target and compare to your desktop
known to work correctly: What is the technical reason to have partially architecture specific syscall number?
Passing a number in a register does not look like anything inherently architecture specific.
If you quickly need to see the syscall numbers for a given architecture I have seen the hint to look at strace source.
I'm just at my phone now, so I don't compare the source now, but I vaguely remember it was indeed easier to
navigate than the kernel source proper. Was it so that strace distributes ready lists while for the kernel they need to
be built for each architecture?

Posted Jul 11, 2014 9:53 UTC (Fri) by sasha (subscriber, #16070) [Link]
> What is the technical reason to have partially architecture specific syscall number?
For "compatibility" with other, older OSes used on the architecture. No, I do not understand what kind of
"compatibility" you can get in such a way, but it is the main reason.
4 of 6 03/21/2016 10:50 AM
Posted Jul 11, 2014 16:55 UTC (Fri) by alonz (subscriber, #815) [Link]
I believe by now the reason is mainly historical.
Originally (in the days of Linux 1.x / 2.0) Linux attempted to be binary compatible to existing Unices on common
hardware (the personality(2) system call is also part of this). As time passed, compatibility with other Unices became
mostly a non-issue – but now we do need to maintain binary compatibility with older Linux binaries…

Posted Jul 11, 2014 22:12 UTC (Fri) by geuder (subscriber, #62854) [Link]
> Originally (in the days of Linux 1.x / 2.0) Linux attempted to be binary compatible to existing Unices on common
hardware
That sounds like a reasonable explanation. Besides that back in those days what Unices were running on x86_64 and
ARM? And these 2 differ. Yeah well, maybe the dependencies are not direct, but somehow indirect over other
platforms?

Posted Jul 11, 2014 22:22 UTC (Fri) by sfeam (subscriber, #2841) [Link]
As I recall, linux binaries for alpha would run under Digital Unix on alpha.

Posted Jul 13, 2014 0:00 UTC (Sun) by gb (subscriber, #58328) [Link]
Why such a strange implementation taken - put values into registers, switch to ring 0, than put this values into
stack? Isn't it make more sense to keep values in the registers till real syscall function which may, if it wants, push
values to the stack?

Posted Jul 14, 2014 9:45 UTC (Mon) by drysdale (subscriber, #95971) [Link]
Having the arguments in registers for the ring transition means that there's no need for fancy footwork to get at the
userspace stack memory (compare the innards of copy_from_user()).
Storing the registers on the kernel stack allows the state of the registers to be restored on the return to userspace.
But once the parameters are available on the stack, there's no need to preserve them in the registers too – the
syscall can get its arguments from the stack (i.e. be asmlinkage) and can immediately use (and clobber) the registers.

Posted Jul 16, 2014 17:03 UTC (Wed) by nix (subscriber, #2304) [Link]
Quite. Keeping the args on the stack is a non-starter: userspace stacks are swappable, and you *don't* want to have
to go checking to see if the args have been swapped out in the instant of ring transition: it's the sort of terribly
narrow race that leads to code that rots and then silently breaks in almost-impossible-to-debug ways, and for almost
no gain.
But obviously the args have to end up on the stack -- or, rather, have to end up whereever the C ABI for the platform
says they should (possibly optimized by asmlinkage, but still, something the compiler supports).
Thanks for this article: I too have wasted entirely too much time tracking this down in pieces now and then: it's nice
to have a reference here for next time. Looking forward to the next one.
5 of 6 03/21/2016 10:50 AM
Copyright © 2014, Eklektix, Inc.

Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds
6 of 6 03/21/2016 10:50 AM

Anatomy of A System Call, Part 1: Vonbrand Log Out

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Anatomy of A System Call, Part 1: Vonbrand Log Out

Загружено:

Авторское право:

Доступные форматы

Anatomy of a system call, part 1 [LWN.net] https://lwn.

vonbrand | Log out | (Subscriber)

Anatomy of a system call, part 1

Deﬁning a syscall with SYSCALL_DEFINEn()

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)

SYSCALL_METADATA(_read, 3, unsigned int, fd, char __user *, buf, size_t, count)

Syscall table entries

0 common read sys_read

x86_64 syscall invocation

asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = {

Moving outwards, the system_call entry point is itself referenced in syscall_init(), a

Reply to this comment

Calling syscalls from the kernel

Reply to this comment

Calling syscalls from the kernel

[ As noone answered this question yet ]

Reply to this comment

Calling syscalls from the kernel

Reply to this comment

Calling syscalls from the kernel

Reply to this comment

Calling syscalls from the kernel

No one answered your question so I'll have a go.

zcat initramfs | cpio -id

So it's a layering violation, but for understandable reasons.

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Anatomy of a system call, part 1

I believe by now the reason is mainly historical.

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Anatomy of a system call, part 1

Reply to this comment

Copyright © 2014, Eklektix, Inc.

Вам также может понравиться