This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Figure 9-11: Return of ext2_get_branch. Now that the position in the indirection chain in which no further blocks are allocated is clear, the Second Extended Filesystem must find out where there is space in the partition to add one or more new blocks to the file. This is not a trivial task because ideally the blocks of a file should be contiguous or, if this is not feasible, at least as close together as possible. This ensures that fragmentation is minimized and results not only in better utilization of hard disk capacity but in faster read/write operations because read/write head travel is reduced.
618
5:17pm
Page 618
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family Several steps are involved in searching for a new block. A search is first made for a goal block; from the perspective of the filesystem, this block is the ideal candidate for allocation. The search for a global block is based on general principles only and does not take account of the actual situation in the filesystem. The ext2_find_goal function is invoked to search for the best new block. When searching is performed, it is necessary to distinguish between two situations: ❑
When the block to be allocated logically immediately follows the block last allocated in the file (in other words, data are to be written contiguously), the filesystem tries to write to the next physical block on the hard disk. This is obvious — if data are stored sequentially in a file, they should if at all possible be stored contiguously on the hard disk.
❑
If the position of the new logical block is not immediately after the last allocated block, the ext2_find_near function is invoked to find the most suitable new block. Depending on the specific situation, it finds a block close to the indirection block or at least in the same cylinder group. I won’t bother with the details here.
Once it has these two pieces of information (the position in the indirection chain in which there are no further data blocks, and the desired address of the new block), the kernel sets about reserving a block on the hard disk. Of course, there is no guarantee that the desired address is really free, so the kernel may have to be satisfied with poorer alternatives, which unavoidably entails data fragmentation. Not only might new data blocks be required — it can also be the case that some blocks are required to hold indirection information. ext2_blks_to_allocate computes the total number of new blocks, that is, the sum of data and (single, double, and triple) indirection blocks. The allocation proper is then done by ext2_alloc_branch. The parameters passed to this function include the desired address of the new block, information on the last incomplete part of the indirection chain, and the number of indirection levels still missing up to the new data block. Expressed differently, the function returns a linked list of indirection and data blocks that can be added to the existing indirection list of the filesystem. Last but not least, ext2_slice_branch adds the resulting hierarchy (or, in the simplest case, the new data block) to the existing network and performs several relatively unimportant updates on Ext2 data structures.
Block Allocation ext2_alloc_branch is responsible to allocate the required blocks for a given new path and set up the chain that connects them. At a first glance, this seems an easy task, as the code flow diagram in Figure 9-12 might suggest.
The function calls ext2_alloc_blocks, which, in turn relies on ext2_new_blocks to reserve the required new blocks. Since the function always allocates consecutive blocks, one single call might not be sufficient to obtain the total number of required blocks. If the filesystem becomes fragmented, it can be that no such consecutive region is available. However, this is no problem: ext2_new_block is called multiple times until at least the number of blocks that is required for the indirection mechanism has been allocated. The surplus blocks can be used as data blocks. Finally, ext2_alloc_branch need only set up the Indirect instances for the indirection blocks, and it is done. Obviously, the hard work is hidden in ext2_new_blocks. The code flow diagram in Figure 9-13 proves that this is really the case!
619
Page 619
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family ext2_alloc_branch ext2_alloc_blocks Set up indirection structures
Iterate until at least the number of blocks required for indirection has been allocated
ext2_alloc_blocks ext2_new_blocks
Figure 9-12: Code flow diagram for ext2_alloc_branch. ext2_new_blocks Decide if reservations should be used ext2_has_free_blocks ext2_get_group_desc Disable preallocation if not enough free blocks are available ext2_try_to_allocate_with_rsv Yes
Update statistics and return
No
Allocation successsful?
Try different block groups Yes
Update statistics and return
No
Allocation successsful?
Retry from beginning without reservations
Figure 9-13: Code flow diagram for ext2_new_blocks. Recall that Ext2 supports pre-allocation, and this needs to be partly handled in ext2_new_blocks. Since the mechanism is already complicated enough without the details of pre-allocation, let’s first consider it without this extra complexity. We will come back to how pre-allocation works exactly afterward. Consider the prototype of ext2_new_blocks (note that ext2_fsblk_t is typedef’d to unsigned long and represents a block number). fs/ext2/balloc.c
ext2_fsblk_t ext2_new_blocks(struct inode *inode, ext2_fsblk_t goal, unsigned long *count, int *errp) { ...
620
5:17pm
Page 620
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family inode represents the inode for which the allocation is performed, while count designates the desired
number of blocks. Since the function returns the number of the first block in the allocated block sequence, possible error codes cannot be passed as a function result, thus the pointer errp is used. Finally, the goal parameter allows for specifying a goal block. This provides a hint to the allocation code about which block would be preferred. This is only a suggestion: Should this block not be available, then any other block can be selected. First of all, the function decides if the pre-allocation mechanism should be used and a reserved, but not yet allocated area be created. The choice is simple: If the inode is equipped with information for pre-allocation, then use it; otherwise, not. Allocations only make sense if the filesystem contains at least one free block, and ext2_has_free_blocks checks this. If the condition is not fulfilled, the allocation can immediately be canceled. In a world where all wishes come true, the goal block will be free, but in reality, this need not be the case. In fact, the goal block need not even be a valid block at all, and the kernel needs to check this (es is the ext2_super_block instance for the filesystem under consideration). fs/ext2/balloc.c
if (goal < le32_to_cpu(es->s_first_data_block) || goal >= le32_to_cpu(es->s_blocks_count)) goal = le32_to_cpu(es->s_first_data_block); group_no = (goal - le32_to_cpu(es->s_first_data_block)) / EXT2_BLOCKS_PER_GROUP(sb); goal_group = group_no; retry_alloc: gdp = ext2_get_group_desc(sb, group_no, &gdp_bh);
If the goal block is not within a valid range, the first data block of the filesystem is picked as the new goal. In any case, the block group of the goal block is computed. ext2_get_group_desc provides the corresponding group descriptor. Afterward, a little bookkeeping for the pre-allocation mechanism is again necessary. If reservations are enabled but the free space is not sufficient to fulfill it, then the mechanism is turned off. By calling ext2_try_to_allocate_with_rsv, the kernel then tries to actually reserve the desired data blocks — possibly using the reservation mechanism. As promised, this function is discussed below. For now, let us just observe the two possible outcomes:
1.
The allocation was successful. In this case, ext2_new_blocks needs to update the statistical information, but is otherwise done and can return to the caller.
2.
If the request could not be satisfied in the current block group, then all other block groups are tried. If this still fails, the whole allocation is restarted without the pre-allocation mechanism in case it was still turned on at this point — recall that it might have been turned off by default or in the previous course of action.
Pre-allocation Handling In the hierarchy of ext2 allocation functions, we’ve come as deep down as ext2_try_to_allocate_with_ rsv. However, there’s good news: The kernel source code cheers us up by remarking that this is ‘‘the
621
Page 621
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family main function used to allocate a new block and its reservation window.’’ We’re almost done! Note that now might also be a good opportunity to remember the pre-allocation data structures introduced in Section 9.2.2 since they form the core of the reservation window mechanism. The code flow diagram for ext2_try_to_allocate_rsv is shown in Figure 9-14. Basically, the function handles some reservation window issues and passes the proper allocation down to ext2_try_to_allocate, the last link in the chain. ext2_try_to_allocate_with_rsv has no direct connection with the inode for which the allocation is performed, but the reservation window is passed as a parameter. If a NULL pointer is given instead, this means that the reservation mechanism is not supposed to be used. ext2_try_to_allocate_with_rsv Reservation undesired?
Loop until allocation request if fulfilled
No
Yes
ext2_try_to_allocate explicitely w/o reservation window
Create new reservation window or extend existing reservation window if necessary
Uses alloc_new_reservation
ext2_try_to_allocate
with reservation window
Update reservation hits
ext2_try_to_allocate compute preferred allocation range No target block within group?
find_next_usable_block
No reservation window?
Find first free block
Try to reserve desired number of blocks
Figure 9-14: Code flow diagram for ext2_try_to_allocate_with_rsv. Thus the first check is to determine whether using pre-allocation is desired or possible at all. Should this not be the case, then ext2_try_to_allocate can be called immediately. Likewise, the function also has a parameter for the reservation information, and if a NULL pointer is passed instead, no pre-allocation will be used. If a reservation window exists, the kernel checks if the pre-allocation information needs to be updated, and does so if necessary. In this case, ext2_try_to_allocate is called with the order to use the reservation window. After calling ext2_try_to_allocate, the reservation hit statistics need to be updated by ext2_try_to_allocate_with_rcv in case an allocation could be performed in the allocation window. If the required number of blocks could be allocated, we are finished. Otherwise, the reservation window settings are readapted, and ext2_try_to_allocate is called again.
622
5:17pm
Page 622
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family By what criteria does the kernel update the reservation window? Observe the allocation loop: fs/etc2/balloc.c
static ext2_grpblk_t ext2_try_to_allocate_with_rsv(struct super_block *sb, unsigned int group, struct buffer_head *bitmap_bh, ext2_grpblk_t grp_goal, struct ext2_reserve_window_node * my_rsv, unsigned long *count) { ... group_first_block = ext2_group_first_block_no(sb, group); group_last_block = group_first_block + (EXT2_BLOCKS_PER_GROUP(sb) - 1); ... while (1) { if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) || !goal_in_my_reservation(&my_rsv->rsv_window, grp_goal, group, sb)) { if (my_rsv->rsv_goal_size < *count) my_rsv->rsv_goal_size = *count; ret = alloc_new_reservation(my_rsv, grp_goal, sb, group, bitmap_bh); if (!goal_in_my_reservation(&my_rsv->rsv_window, grp_goal, group, sb)) grp_goal = -1; } else if (grp_goal >= 0) { int curr = my_rsv->rsv_end (grp_goal + group_first_block) + 1; if (curr < *count) try_to_extend_reservation(my_rsv, sb, *count - curr); } ... ret = ext2_try_to_allocate(sb, group, bitmap_bh, grp_goal, &num, &my_rsv->rsv_window); if (ret >= 0) { my_rsv->rsv_alloc_hit += num; *count = num; break; /* succeed */ } num = *count; } return ret; }
If either there is no reservation associated with the file (checked by rsv_is_empty) or the desired goal block is not within the current reservation window (checked by goal_in_my_reservation), the kernel needs to create a new reservation window. This task is delegated to alloc_new_reservation, which contains the goal block. A more detailed discussion of the function follows below. Although alloc_new_reservation will try to find a region that contains the goal block, this might not be possible. In this case, grp_goal is set to −1, which signifies that no desired goal should be used.
623
Page 623
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family rsv_start
reservation window curr
100 120 group_first_block grp_goal
140
160
200
rsv_end
Figure 9-15: Check if a desired allocation can be fulfilled with a given reservation window.
If the file is equipped with a reservation window and a goal block is specified (as checked by the condition grp_goal > 0), the kernel has to check if the desired allocation will fit into the existing reservation. Starting
from the desired allocation goal that specifies a block number relative to the beginning of the group, the code computes the number of blocks until the end of the block group. The calculation is illustrated in Figure 9-15. If the desired allocation as given by count is larger than the possible region, the reservation window is increased with try_to_extend_reservation. The function simply queries the pre-allocation data structures to see if no other reservation window prevents the current window to grow, and does so if possible. Finally, the kernel can pass the allocation request together with the (possibly modified) reservation window to ext2_try_to_allocate. While the function guarantees that a consecutive number of blocks is allocated if some free space can be found, it cannot guarantee that the desired number of blocks is available. This has some implications on the returned values. While the first allocated block is returned as the direct function result, the number of allocated blocks must be passed upward via the pointer num. If some space could be allocated, ret is larger than or equal to zero. The kernel then needs to update the allocation hit counter rsv_alloc_hit and return the number of allocated blocks via the count pointer. If the allocation has failed, the loop needs to start again. Since ret is negative in this case, the kernel allocates a new reservation window in the next run as guaranteed by the condition ret < 0 in the initial if conditional. Otherwise, everything runs again as described. Finally, ext2_try_to_allocate is responsible for the low-level allocation that directly interacts with the block bitmaps. Recall that the function can work with a reservation window or not. The kernel now needs to search through the block bitmap, and thus an interval for the search needs to be determined. Note that the boundaries are specified relative to the current block group. This means that numbering starts from zero. A number of scenarios is distinguished, and Figure 9-16 illustrates various cases.
search region block group (no reservation window & goal block)
Figure 9-16: Search interval selection for block allocation in ext2_try_to_allocate.
624
reservation window goal block
5:17pm
Page 624
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family ❑
If a reservation window is available and the reservation starts within the block group, the absolute block number needs to be converted into a relative start position. For instance, if the block group starts at block 100 and the reservation window at block 120, the relative start block within the group is block 20. If the reservation window starts before the block group, block number 0 is used as the starting point. If the reservation window goes beyond the current block group, the search interval is restricted to the last block of the block group.
❑
If no reservation window is present, but a goal block is given, the goal can be directly used as the start block. If no reservation window is available and no goal block is specified, the search starts from block 0. In both cases, the end of the block group is used as the end block of the search.
ext2_try_to_allocate then proceeds as follows: fs/ext2/balloc.c
# static int # ext2_try_to_allocate(struct super_block *sb, int group, # struct buffer_head *bitmap_bh, ext2_grpblk_t grp_goal, # unsigned long *count, # struct ext2_reserve_window *my_rsv) # { ... ext2_grpblk_t start, end; ... /* Determine start and end */ ... repeat: if (grp_goal < 0) { grp_goal = find_next_usable_block(start, bitmap_bh, end); ... if (!my_rsv) { int i; for (i = 0; i < 7 && grp_goal > start && !ext2_test_bit(grp_goal - 1, bitmap_bh->b_data); i++, grp_goal--) ; } } start = grp_goal; ...
If no goal block was given (grp_goal < 0), the kernel uses find_next_usable_block to find the first free bit in the previously selected interval in the block allocation bitmap. find_next_usable_block first performs a bitwise search up to the next 64-bit boundary.17 This tries to
find a free block near the allocation goal. If one is available, the function returns the bit position. 17 If the starting block is zero, then find_next_usable_block assumes that no goal block was given and does not perform the near-goal search. Instead, it starts immediately with the next search step.
625
Page 625
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family If no free bit is found near the desired goal, the search is not performed bitwise, but bytewise to increase performance. A free byte corresponds to eight successive zeros or eight free file blocks. If a free byte is found, the address of the first bit is returned. As a last resort, a bitwise scan over the whole range is performed. This equates to searching for a single, isolated free block and is, of course, the worst-case scenario, which, unfortunately, cannot always be avoided. Let us go back to ext2_try_to_allocate. Since the bit might originate from a bytewise search, the last seven preceding bits are scanned for a free area. (A larger number of preceding bits is not possible because the kernel would then have found a free byte in the previous step.) The newly allocated block is always shifted as far to the left as possible to ensure that the free area to its right is as large as possible. What now remains to be done is a simple bitwise traversal of the block bitmap. In each step, a block is added to the allocation if the bit is not set. Recall that allocating a block is equivalent to setting the corresponding bit in the block bitmap to one. The traversal stops when either an occupied block is encountered or a sufficient number of blocks has been allocated. fs/ext2/balloc.c
if (ext2_set_bit_atomic(sb_bgl_lock(EXT2_SB(sb), group), grp_goal, bitmap_bh->b_data)) { /* * The block was allocated by another thread, or it was * allocated and then freed by another thread */ start++; grp_goal++; if (start >= end) goto fail_access; goto repeat; } num++; grp_goal++; while (num < *count && grp_goal < end && !ext2_set_bit_atomic(sb_bgl_lock(EXT2_SB(sb), group), grp_goal, bitmap_bh->b_data)) { num++; grp_goal++; } *count = num; return grp_goal - num; fail_access: *count = num; return -1; }
The only complication stems from the fact that the initial bit might have been allocated by another process between the time it was chosen and when the kernel tries to allocate it. In this case, both the starting position and group goal are increased by 1, and the search started again.
Creating New Reservations Above, it was mentioned that alloc_new_reservation is employed to create new reservation windows. This is an important task now discussed in detail. An overview of the function is presented in Figure 9-17.
626
5:17pm
Page 626
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family alloc_new_reservation Compute start block Old reservation window available
Update settings
search_reserve_window find_next_reservable_window No further window?
Return with error code
No
Free block in reserve window?
Yes
return
Retry next reservable space
Figure 9-17: Code flow diagram for alloc_new_reservation. First, alloc_new_reservation determines the block from which the search for a reservation window starts. fs/ext2/balloc.c
static int alloc_new_reservation(struct ext2_reserve_window_node *my_rsv, ext2_grpblk_t grp_goal, struct super_block *sb, unsigned int group, struct buffer_head *bitmap_bh) { struct ext2_reserve_window_node *search_head; ext2_fsblk_t group_first_block, group_end_block, start_block; ext2_grpblk_t first_free_block; struct rb_root *fs_rsv_root = &EXT2_SB(sb)->s_rsv_window_root; unsigned long size; int ret; group_first_block = ext2_group_first_block_no(sb, group); group_end_block = group_first_block + (EXT2_BLOCKS_PER_GROUP(sb) - 1); if (grp_goal < 0) start_block = group_first_block; else start_block = grp_goal + group_first_block; size = my_rsv->rsv_goal_size; ...
If the inode is already equipped with a reservation window, the allocation hit counter is evaluated and the window resized accordingly: fs/ext2/balloc.c
if (!rsv_is_empty(&my_rsv->rsv_window)) { /* * if the old reservation is cross group boundary * and if the goal is inside the old reservation window,
627
Page 627
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family * we will come here when we just failed to allocate from * the first part of the window. We still have another part * that belongs to the next group. In this case, there is no * point to discard our window and try to allocate a new one * in this group(which will fail). we should * keep the reservation window, just simply move on. */ if ((my_rsv->rsv_start <= group_end_block) && (my_rsv->rsv_end > group_end_block) && (start_block >= my_rsv->rsv_start)) return -1; if ((my_rsv->rsv_alloc_hit > (my_rsv->rsv_end - my_rsv->rsv_start + 1) / 2)) { /* * if the previously allocation hit ratio is * greater than 1/2, then we double the size of * the reservation window the next time, * otherwise we keep the same size window * / size = size * 2; if (size > EXT2_MAX_RESERVE_BLOCKS) size = EXT2_MAX_RESERVE_BLOCKS; my_rsv->rsv_goal_size= size; } } ...
The kernel code precisely states what is going on (and especially why this is going on), and for a change, there’s nothing further to add. If new boundaries for the window have been computed (or if there has not been a reservation window before), search_reserve_window checks if a reserve window that contains the allocation goal is already present. If this is not the case, the window before the allocation goal is returned. The selected window is used as a starting point for find_next_reservable_window, which tries to find a suitable new reservation window. Finally, the kernel checks if the window contains at least a single free bit. If not, it does not make any sense to pre-allocate space, so the window is discarded. Otherwise, the function returns successfully.
Creating and Deleting Inodes Inodes must also be created and deleted by low-level functions of the Ext2 filesystem. This is necessary when a file or directory is created (or deleted) — the core code of the two variants hardly differs. Let’s begin with the creation of a file or directory. As explained in Chapter 8, the open and mkdir system calls are available for this purpose. They work through the various functions of the virtual filesystem to arrive at the create and mkdir functions, each of which is pointed to by a function pointer in the file-specific instance of inode_operations. The ext2_create and ext2_mkdir functions are inserted as described in Section 9.2.4. Both functions are located in fs/ext2/namei.c. The flow of both actions is shown in the code flow diagrams in Figures 9-18 and 9-19.
628
5:17pm
Page 628
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family ext2_mkdir ext2_new_inode Insert i_op, i_fop, and i_mapping->a_ops ext2_make_empty ext2_add_link
Figure 9-18: Code flow diagram for ext2_mkdir.
ext2_create ext2_new_inode Set i_op, i_fop, and i_mapping->a_ops ext2_add_nondir
ext2_add_link
Figure 9-19: Code flow diagram for ext2_create. Let us first examine how new directories are created using mkdir. The kernel passes via the VFS function vfs_mkdir to the ext2_mkdir low-level function with the following signature. fs/ext2/namei.c
static int ext2_mkdir(struct inode * dir, struct dentry * dentry, int mode) dir is the directory in which the new subdirectory is to be created, and dentry specifies the pathname of the new directory. mode specifies the access mode of the new directory.
Once ext2_new_inode has reserved a new inode at a suitable place on the hard disk (the section below describes how the kernel finds the most suitable location with the help of the Orlov allocator), it is provided with the appropriate file, inode, and address space operations. fs/ext2/namei.c
static int ext2_mkdir(struct inode * dir, struct dentry * dentry, int mode) { ... inode->i_op = &ext2_dir_inode_operations; inode->i_fop = &ext2_dir_operations; if (test_opt(inode->i_sb, NOBH)) inode->i_mapping->a_ops = &ext2_nobh_aops; else inode->i_mapping->a_ops = &ext2_aops; ... }
629
Page 629
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family ext2_make_empty fills the inode with the default . and .. entries by generating the corresponding file structures and writing them to the data block. Then ext2_add_link adds the new directory to the existing
directory data of the initial inode in the format described in Section 9.2.2. New files are created in a similar way. The sys_open system call arrives at vfs_create, which again invokes the ext2_create low-level function of the Ext2 filesystem. Once it has allocated a new inode on the hard disk by means of ext2_new_inode, the appropriate file, inode, and address space structures are added, this time using the variants for regular files, that is, ext2_file_inode_operations and ext2_file_operations. There is no difference between the address space operations for directory inodes and file inodes. Responsibility for adding the new file to the directory hierarchy is assumed by ext2_add_nondir, which immediately invokes the familiar ext2_add_link function.
Registering Inodes When directories and files are created, the ext2_new_inode function is used to find a free inode for the new filesystem entry. However, the search strategy varies according to situation — this can be distinguished by the mode argument (S_IFDIR is set for directories but not for regular files). The search itself is not performance-critical, but it is very important for filesystem performance that the inode be optimally positioned to permit rapid access to data. For this reason, this section is devoted to an examination of the inode distribution strategy adopted by the kernel. The kernel applies three different strategies:
1. 2.
Orlov allocation for directory inodes.
3.
Inode allocation for regular files.
Classic allocation for directory inodes. This is only used if the oldalloc option is passed to the kernel, which disables Orlov allocation. Normally, Orlov allocation is the default strategy.
The three options are investigated below.
Orlov Allocation A standard scheme proposed and implemented for the OpenBSD kernel by Grigoriv Orlov is used to find a directory inode. The Linux version was developed later. The goal of the allocator is to ensure that directory inodes of child directories are in the same block group as the parent directory so that they are physically closer to each other and costly hard disk seek operations are minimized. Of course, not all directory inodes should end up in the same block group because they would then be too far away from their associated data. The scheme distinguishes whether a new directory is to be created directly in the (global) root directory or at another point in the filesystem, as the code flow diagram for find_group_orlov in Figure 9-20 shows.
630
5:17pm
Page 630
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family While entries for subdirectories should be as close to the parent directory as possible, subdirectories of the filesystem root should be diverted as well as possible. Otherwise, directories would again accumulate in a distinguished block group. find_group_orlov Parent inode is root inode?
Yes
get_random_bytes Search starts in random group
No Search starts in present group Iterate over all groups
Test suitability of group Group found?
No
Fallback selection
Yes Return group number
Figure 9-20: Code flow diagram for find_group_orlov.
Let’s first take a look at the standard situation in which a new subdirectory is to be created at some point in the directory tree (and not in the root directory). This corresponds to the right-hand branch in Figure 9-20. The kernel computes several variables used as criteria to establish the suitability of a block group to accommodate the desired directory node (I took the liberty of rearranging the code a little to make it easier to understand): fs/ext2/ialloc.c
int ngroups = sbi->s_groups_count; int inodes_per_group = EXT2_INODES_PER_GROUP(sb); freei = percpu_counter_read_positive(&sbi->s_freeinodes_counter); avefreei = freei / ngroups; free_blocks = percpu_counter_read_positive(&sbi->s_freeblocks_counter); avefreeb = free_blocks / ngroups; ndirs = percpu_counter_read_positive(&sbi->s_dirs_counter); blocks_per_dir = (le32_to_cpu(es->s_blocks_count)-free_blocks) / ndirs; max_dirs = ndirs / ngroups + inodes_per_group / 16; min_inodes = avefreei - inodes_per_group / 4; min_blocks = avefreeb - EXT2_BLOCKS_PER_GROUP(sb) / 4; max_debt = EXT2_BLOCKS_PER_GROUP(sb) / max(blocks_per_dir, BLOCK_COST); if (max_debt * INODE_COST > inodes_per_group) max_debt = inodes_per_group / INODE_COST; if (max_debt > 255) max_debt = 255; if (max_debt == 0) max_debt = 1;
631
Page 631
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family avefreei and avefreeb denote the number of free inodes and blocks (which can be read from the approx-
imative per-CPU counters associated with the superblock) divided by the number of groups. The values thus specify the average number of free inodes and blocks per group. This explains the prefix ave. max_dirs specifies the absolute upper limit for the number of directory inodes in a block group. min_inodes and min_blocks define the minimum number of free inodes or blocks in a group before a
new directory may be created. debt is a numeric value between 0 and 255. It is saved for each block group in the ext2_sb_info filesystem instance that makes the s_debts array available (ext2_sb_info is defined in Section 9.2.2). The value is incremented by 1 (in ext2_new_inode) each time a new directory inode is created, and is decremented by 1 when the inode is required for a different purpose — usually for a regular file. The value of debt is
therefore an indication of the ratio between the number of directories and inodes in a block group. Starting at the block group of the parent entry, the kernel iterates over all block groups until the following criteria are met: ❑
There are no more than max_ndir directories.
❑
No less than min_inodes inodes and min_blocks data blocks are free.
❑
The debt value does not exceed max_debt; that is, the number of directories does not get out of hand.
If just one of these criteria is not satisfied, the kernel skips the current block group and checks the next: fs/ext2/ialloc.c
for (i = 0; i < ngroups; i++) { group = (parent_group + i) % ngroups; desc = ext2_get_group_desc (sb, group, NULL); if (!desc || !desc->bg_free_inodes_count) continue; if (sbi->s_debts[group] >= max_debt) continue; if (le16_to_cpu(desc->bg_used_dirs_count) >= max_dirs) continue; if (le16_to_cpu(desc->bg_free_inodes_count) < min_inodes) continue; if (le16_to_cpu(desc->bg_free_blocks_count) < min_blocks) continue; goto found; }
Division without a remainder (%) at the beginning of the loop ensures that the search is resumed at the first block group once the last block group of the partition is reached. Once a suitable group is found (which is automatically as close as possible to the parent group unless the inode there has been removed), the kernel need only update the corresponding statistics counters and return the group number. If no group matches the requirements, the search is repeated with the help of a less demanding ‘‘fallback‘‘ algorithm: fs/ext2/ialloc.c
fallback: for (i = 0; i < ngroups; i++) {
632
5:17pm
Page 632
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family group = (parent_group + i) % ngroups; desc = ext2_get_group_desc (sb, group, &bh); if (!desc || !desc->bg_free_inodes_count) continue; if (le16_to_cpu(desc->bg_free_inodes_count) >= avefreei) goto found; } ... return -1;
Again, the kernel starts at the parent group. The directories are scanned one after the other. However, this time the kernel accepts the first group that contains more than the average number of inodes (specified by avefreei). This method is modified slightly when a new subdirectory is created in the root directory of the system, as illustrated by the left-hand branch of the code flow diagram in Figure 9-20 above. To spread the directory inodes across the filesystem as uniformly as possible, the immediate subdirectories of the root directory are distributed statistically over the block groups. The kernel uses get_random_bytes to select a random number that is trimmed to the maximum number of existing block groups by dividing (without remainder) by ngroups. The kernel then iterates as follows over the randomly selected groups and subsequent groups: fs/ext2/ialloc.c
get_random_bytes(&group, sizeof(group)); parent_group = (unsigned)group % ngroups; for (i = 0; i < ngroups; i++) { group = (parent_group + i) % ngroups; desc = ext2_get_group_desc (sb, group, &bh); if (!desc || !desc->bg_free_inodes_count) continue; if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir) continue; if (le16_to_cpu(desc->bg_free_inodes_count) < avefreei) continue; if (le16_to_cpu(desc->bg_free_blocks_count) < avefreeb) continue; best_group = group; best_ndir = le16_to_cpu(desc->bg_used_dirs_count); best_desc = desc; best_bh = bh; }
While, again, the minimum number of free inodes or blocks must not be below the limit set by avefreei and avefreeb, the kernel also ensures that the number of free directories is not greater than or equal to best_ndir. The value is initially set to the value of inodes_per_group but is always updated to the lowest value encountered by the kernel during its search. The winner is the block group that has the fewest entries and that also satisfies the other two conditions. If a suitable group is found, the kernel updates the statistics and returns the group number selected. Otherwise, the fallback mechanism comes into effect to find a less qualified block group.
633
Page 633
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family Classic Directory Allocation Kernel versions up to and including 2.4 did not use Orlov allocation, but the technique described below, called classic allocation. Ext2 filesystems can be mounted using the oldalloc option, which sets the EXT2_MOUNT_OLDALLOC bit in the s_mount_opt field of the superblock. The kernel then no longer uses the Orlov allocator but resorts to the classic scheme of allocation.18 How does the classic scheme work? The block groups of the system are scanned in a forward search, and particular attention is paid to two conditions:
1. 2.
Free space should still be available in the block group. The number of directory inodes should be as small as possible compared to other inodes in the block group.
In this scheme, directory inodes are typically spread as uniformly as possible across the entire filesystem. If none of the block groups satisfies requirements, the kernel restricts selection to groups with above average amounts of free space and from these chooses those with the fewest directory inodes.
Inode Allocation for Other Files A simpler scheme known as quadratic hashing is applied when searching for an inode for regular files, links, and all file types other than directories. It is based on a forward search starting at the block group of the directory inode of the directory in which the new file is to be created. The first block group found with a free inode is reserved. The block group in which the directory inode is located is searched first. Let’s assume its group ID is start. If it does not have a free inode, the kernel scans the block group with the number start + 20 , then start + 20 + 21 , start + 20 + 21 + 22 , and so on. A higher power of 2 is added to the group number in each step, which results in the sequence 1, 1 + 2, 1 + 2 + 4, 1 + 2 + 4 + 8, · · · = 1, 3, 7, 15, . . .. Usually, this method quickly finds a free inode. However, if no free inode is returned on an (almost hopelessly) overfilled filesystem, the kernel scans all block groups in succession to ensure that every effort is made to find a free inode. Again, the first block group with a free inode is selected. If absolutely no inodes are free, the action is aborted with a corresponding error code.19
Deleting Inodes Both directory inodes and file inodes can be deleted, and, from the perspective of the filesystem, both actions are much simpler than allocating inodes. Let us first look at how directories are deleted. After the appropriate system call (rmdir) has been invoked, the code meanders through the kernel and finally arrives at the rmdir function pointer of the inode_operations structure, which, for the Ext2 filesystem, contains ext2_rmdir from fs/ext2/namei.c. 18 In terms of compatibility with old kernel versions, it makes no difference whether directory inodes are reserved with the Orlov
allocator or not because the format of the filesystem remains unchanged. 19 In practice, this situation hardly ever occurs because the hard disk would have to contain a gigantic number of small files, and this
is very rarely the case on standard systems. A more realistic situation (often encountered in practice) is that all data blocks are full, but a large number of inodes are still free.
634
5:17pm
Page 634
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family Two main actions are needed to delete a directory:
1. 2.
First, the entry in the directory inode of the parent directory is deleted. Then the data blocks assigned on the hard disk (an inode and the associated data blocks with the directory entries) are released.
As the code flow diagram in Figure 9-21 shows, this is done in a few steps. ext2_rmdir ext2_empty_dir ext2_unlink ext2_find_entry ext2_delete_entry Decrement usage counter
Figure 9-21: Code flow diagram for ext2_rmdir. To ensure that the directory to be deleted no longer contains any files, the contents of its data block are checked using the ext2_empty_dir function. If the kernel finds only the entries for . and .., the directory is released for deletion. Otherwise, the action is aborted, and an error code (-ENOTEMPTY) is returned. Removal of the directory entry from the parent directory is delegated to the ext2_unlink function. This entry is found in the directory table using the ext2_find_entry function, which scans the individual directory entries one after the other (the scheme adopted for storing entries is described in Section 9.2.2). If a matching entry is found, the function returns an instance of ext2_dir_entry_2 to identify it uniquely. ext2_delete_entry removes the entry from the directory table. As described in Section 9.2.2, the data are not physically deleted from the table. Instead, the rec_len field of the ext2_dir_entry_2 structure is set in such a way that the entry is skipped when the table is traversed. As already noted, this approach yields substantial benefits in terms of speed, as actual deletion would necessitate rewriting a large amount of data.
This has both advantages and disadvantages. By inspecting the filesystem structures on the hard disk (assuming the corresponding permissions to read and write raw data on the partition) it is possible to recover a deleted file by reactivating the directory entry by resetting the rec_len field of its predecessor — if, of course, the allocated blocks have not been overwritten with other data in the meantime. If sensitive data are deleted, this can prove to be a final lifeline and, of course, a source of danger because a little technical know-how is all that is needed to access the data if the data blocks have not yet been overwritten.20 The kernel has now removed the directory entry from the filesystem, but the data blocks for the inode and directory contents are still marked as occupied. When are they released? 20 Explicitly overwriting the file with null bytes before deletion is a remedy.
635
Page 635
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family In this context, care should be exercised because of the structure of Unix filesystems, as explained in Chapter 8. If hard links are used, users have access to inodes (and therefore to the associated data blocks) under several names in the system. However, the nlink counter in the inode structure keeps a record of how many hard links point to an inode. The filesystem code decrements this counter by 1 each time a link to the inode is deleted. When the counter value reaches 0, there are no remaining hard links to the inode, and it can therefore be finally released. Once again it should be noted that only the corresponding bit in the inode bitmap is set to 0; the associated data are still present in the block and can potentially be used to reconstruct the file contents. The data blocks associated with the inode have not yet been released. This is not done until all references to the inode data structure have been returned with iput. What is the difference between deleting a regular file and deleting a directory? Most of the above actions (with the exception of ext2_empty_dir) do not specifically relate to directories and can be used for general inode types. In fact, the procedure used to delete non-directories is very similar to the one described above. Starting with the unlink system call, the VFS vfs_unlink function is invoked to initiate the filespecific inode_operations->unlink operation. For the Second Extended Filesystem, this operation is ext2_unlink, which is described above. Everything said there also applies for deleting regular files, links, and so on.
Removing Data Blocks In the delete operations described above, the data blocks remain untouched, partly because of the hard link problem. Removal of data blocks is closely associated with the reference counting of inode objects because two conditions must be satisfied before the data blocks can actually be deleted:
1.
The link counter nlink must be zero to ensure that there are no references to the data in the filesystem.
2.
The usage counter (i_count) of the inode structure must be flushed from memory.
The kernel uses the iput function to decrement the reference counter for the memory object. It therefore makes sense to introduce a check at this point to establish whether the inode is still needed and to remove it if not. This is a standard function of the virtual filesystem not discussed in detail here because the only aspect of interest is that the kernel invokes the ext2_delete_inode function to release the data associated with the inode on the hard disk (iput also returns memory data structures and memory pages reserved for data). This function builds primarily on two other functions — ext2_truncate, which releases the data blocks associated with the inode (regardless of whether the inode represents a directory or a regular file); and ext2_free_inode, which releases the memory space occupied by the inode itself.
Neither function deletes the space occupied on the hard disk or overwrites it with null bytes. They simply release the corresponding positions in the block or inode bitmap.
Since both functions reverse the technique used to create files, their implementation need not be discussed here.
636
5:17pm
Page 636
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family Address Space Operations In Section 9.2.4, the address space operations associated with the Ext2 filesystem are discussed. For the most part, functions whose names are prefixed with ext2_ are assigned to the individual function pointers. At first glance, it could therefore be assumed that they are all special implementations for the Second Extended Filesystem. However, this is not the case. Most of the functions make use of standard implementations of the virtual filesystem, which uses the function discussed in Section 9.2.4 as an interface to the low-level code. For example, the implementation of ext2_readpage is as follows: fs/ext2/inode.c
static int ext2_readpage(struct file *file, struct page *page) { return mpage_readpage(page, ext2_get_block); }
This is simply a transparent front end for the mpage_readpage standard function (introduced in Chapter 16) whose parameters are a pointer to ext2_get_block and the memory page to be processed. ext2_writepage is used to write memory pages and is similar in terms of its implementation: fs/ext2/inode.c
static int ext2_writepage(struct page *page, struct writeback_control *wbc) { return block_write_full_page(page, ext2_get_block, wbc); }
Again, a standard function described in Chapter 16 is used. This function is associated with the low-level implementation of the Ext2 filesystem using ext2_get_block. Most other address space functions provided by the Ext2 filesystem are implemented via similar front ends that use ext2_get_block as a go-between. It is therefore not necessary to discuss additional Ext2specific implementations because the functions described in Chapter 8 together with the information on ext2_get_block in Section 9.2.4 are all we need to know about address space operations.
9.3
Third Extended Filesystem
The third extension to the Ext filesystem, logically known as Ext3, features a journal in which actions performed on the filesystem data are saved. This helps to considerably shorten the run time of fsck following a system crash.21 Since the underlying filesystem concepts not related to the new journal mechanism have remained unchanged in the third version of the filesystem, I will discuss only the new Ext3 capabilities. However, for reasons of space, I will not delve too deeply into their technical implementation. The transaction concept originates from the database sector, where it helps guarantee data consistency if operations are not completed. The same consistency problem (which is not specific to Ext) arises in 21 On filesystems with several hundred gigabytes, consistency checks may take a few hours depending on system speed. This down-
time is not acceptable on servers. But even PC users appreciate the fact that consistency checks take just a few seconds rather than several minutes.
637
Page 637
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family filesystems. How can the correctness and consistency of metadata be ensured if filesystem operations are interrupted unintentionally — for example, in the event of a power outage or if a user switches a computer off without shutting it down first?
9.3.1 Concepts The basic idea of Ext3 is to regard each operation on the filesystem metadata as a transaction that is saved in a journal before it is performed. Once the transaction has terminated (i.e., when the desired modifications to the metadata have been made), the associated information is removed from the journal. If a system error occurs after transaction data have been written to the journal — but before (or during) performance of the actual operations — the pending operations are carried out in their entirety the next time the filesystem is mounted. The filesystem is then automatically in a consistent state. If the interruption occurs before the transaction is written to the journal, the operation itself is not performed because the information on it is lost when the system is restarted, but at least filesystem consistency is retained. However, Ext3 cannot perform miracles. It is still possible to lose data because of a system crash. Nevertheless, the filesystem can always be restored to a consistent state very quickly afterward. The additional overhead needed to log transactions is, of course, reflected in the performance of Ext3, which does not quite match that of Ext2. The kernel is able to access the Ext3 filesystem in three different ways in order to strike a suitable balance between performance and data integrity in all situations:
1.
In writeback mode, only changes to the metadata are logged to the journal. Operations on useful data bypass the journal. This mode guarantees highest performance but lowest data protection.
2.
In ordered mode only changes to the metadata are logged to the journal. However, changes to useful data are grouped and are always made before operations are performed on the metadata. This mode is therefore slightly slower than Writeback mode.
3.
In journal mode, changes not only to metadata but also to useful data are written to the journal. This guarantees the highest level of data protection but is by far the slowest mode (except in a few pathological situations). The chance of losing data is minimized.
The desired mode is specified in the data parameter when the filesystem is mounted. The default is ordered. As already stated, the Ext3 filesystem is designed to be fully compatible with Ext2 — not only downward but also (as far as possible) upward. The journal therefore resides in a special file with (as usual) its own inode. This enables Ext3 filesystems to be mounted on systems that support only Ext2. Even existing Ext2 partitions can be converted to Ext3 quickly and, above all, without the need for complicated data copying operations — a major consideration on server systems. The journal can be held not only in a special file but also on a separate partition, but the details are not discussed here.
638
5:17pm
Page 638
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family The kernel includes a layer called a journaling block device (JBD) layer to handle journals and associated operations. Although this layer can be used on different filesystems, currently it is used only by Ext3. All other journaling filesystems such as ReiserFS, XFS, and JFS have their own mechanisms. In the sections below, therefore, JBD and Ext3 are regarded as a single unit.
Log Records, Handles, and Transactions Transactions are not a monolithic structure used to implement the transaction concept. Owing to the structure of filesystems (and also for performance reasons), it is necessary to break a transaction down into smaller units, as shown in Figure 9-22. Transaction Handle
Handle
Handle
Log Record
Figure 9-22: Interaction of transactions, log records, and handles. ❑
Log records are the smallest units that can be logged. Each represents an individual update to a block.
❑
(Atomic) handles group together several log records on the system level. For example, if a write request is issued using the write system call, all log records involved in this operation are grouped into a handle.
❑
Transactions are groupings of several handles that ensure better performance.
9.3.2 Data Structures Whereas transactions include data with system-wide validity, each handle is always associated with a specific process. For this reason, the familiar task structure discussed in Chapter 2 includes an element that points to the current process handle: <sched.h>
struct task_struct { ... /* journaling file system info */ void *journal_info; ... }
The JBD layer automatically assumes responsibility for converting the void pointer to a pointer to handle_t. The journal_current_handle auxiliary function is used to read the active handle of the current process.
639
Page 639
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family handle_t is a typedef to the struct handle_s data type used to define a handle (a simplified version is
shown): <jbd.h>
typedef struct handle_s
handle_t;
/* Atomic operation type */
<jbd.h>
struct handle_s { /* Which compound transaction is this update a part of? */ transaction_t *h_transaction; /* Number of remaining buffers we are allowed to dirty: */ int h_buffer_credits; ... }; h_transaction is a pointer to the transaction data structure with which the handle is associated, and h_buffer_credits specifies how many free buffers are still available for journal operations (discussed
shortly). The kernel provides the journal_start and journal_stop functions that are used in pairs to label a code section whose operation is to be regarded as atomic by the journal layer: handle_t *handle = journal_start(journal, nblocks); /* Perform operations to be regarded as atomic */ journal_stop(handle);
The functions can be nested, but it must be ensured that journal_stop is invoked the same number of times as journal_start. The kernel provides the wrapper function ext3_journal_start, which takes a pointer to the inode in question as a parameter to infer the associated journal. With this information, journal_start is called. While journal_start is usually not used directly, ext3_journal_start is used all over the Ext3 code. Each handle consists of various log operations, each of which has its own buffer head (see Chapter 16) to save the change — even if only a single bit is modified in the underlying filesystem. What appears at first glance to be a massive waste of memory is compensated by higher performance because buffers are processed very efficiently. The data structure is defined (in greatly simplified form) as follows: <journal_head.h>
struct journal_head { struct buffer_head *b_bh; transaction_t *b_transaction; struct journal_head *b_tnext, *b_tprev;
❑
b_bh points to the buffer head that contains the operation data.
❑
b_transaction references the transaction to which the log entry is assigned.
❑
b_tnext and b_tprev help implement a doubly linked list of all logs associated with an atomic
operation.
640
5:17pm
Page 640
Mauerer
runc09.tex
V2 - 09/04/2008
5:17pm
Chapter 9: The Extended Filesystem Family The JBD layer provides journal_dirty_metadata to write modified metadata to the journal: fs/jbd/transaction.c
int journal_dirty_metadata(handle_t *handle, struct buffer_head *bh)
The matching journal_dirty_data function writes useful data to the journal and is used in data mode. Transactions are represented by a dedicated data structure; again a much simplified version is shown: <jbd.h>
typedef transaction_s transaction_t; struct transaction_s { journal_t tid_t
*t_journal; t_tid;
enum { T_RUNNING, ... T_FLUSH, T_COMMIT, T_FINISHED } struct journal_head unsigned long int t_handle_count;
t_state; *t_buffers; t_expires;
};
❑
t_journal is a pointer to the journal to which the transaction data are written (for the sake of
simplicity, the data structure used is not discussed because it is overburdened with technical details). ❑
Each transaction can have different states that are held in t_state: ❑
T_RUNNING indicates that new atomic handles can be added to the journal.
❑
T_FLUSH indicates that log entries are being written at the moment.
❑
T_COMMIT indicates when all data have been written to disk, but the metadata still need to be processed.
❑
T_FINISHED indicates that all log entries have been written safely to disk.
❑
t_buffers points to the buffers associated with the transaction.
❑
t_expires specifies the time by which the transaction data must have been physically written
to the journal. The kernel uses a timer that expires by default 5 seconds after the transaction has been generated. ❑
t_handle_count indicates the number of handles associated with the transaction.
The Ext3 code uses ‘‘checkpoints‘‘ at which a check is made to ascertain whether the changes in the journal have been written to the filesystem. If they have, the data in the journal are no longer needed
641
Page 641
Mauerer
runc09.tex
V2 - 09/04/2008
Chapter 9: The Extended Filesystem Family and can be removed. During normal operation, the contents of the journal play no active role. Only if a system crash occurs are the journal data used to reconstruct changes to the filesystem and return it to a consistent state. As compared to the original definition in Ext2, several elements have been added to the superblock data structure of Ext3 to support the journal functions: <ext3_fs_sb.h>
struct ext3_sb_info { ... /* Journaling */ struct inode * s_journal_inode; struct journal_s * s_journal; unsigned long s_commit_interval; struct block_device *journal_bdev; };
As noted, the journal can be held both in a file and on its own partition. Depending on the option chosen, either s_journal_inode or journal_bdev is used to reference its location. s_commit_interval specifies the frequency with which data are transferred from memory into the journal, and s_journal points to the journal data structure.
9.4
Summar y
Filesystems are used to organize file data on physical block devices like hard disks to store information persistently across reboots. The second and third extended filesystems have been the standard working horses of Linux for many years, and you have seen their implementation and how they represent data on disks in detail. After describing the basic challenges that filesystems have to face, you have seen the on-disk and inkernel structures of the second extended file system. You have learned how filesystem objects are managed by inodes, and how data blocks that provide storage space for files are handled. Various important filesystem operations like creating new directories were also discussed in detail. Finally, you have been introduced to the journaling mechanisms of Ext3, the evolutionary successor of Ext2.
642
5:17pm
Page 642
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Filesystems without Persistent Storage Traditionally, filesystems are used to store data persistently on block devices. However, it is also possible to use filesystems to organize, present, and exchange information that is not stored on block devices, but dynamically generated by the kernel. This chapter examines some of them: ❑
The proc filesystem enables the kernel to generate information on the state and configuration of the system. This information can be read from normal files by users and system programs without the need for special tools for communication with the kernel; in some cases, a simple cat is sufficient. Data can not only be read from the kernel, but also sent to it by writing character strings to a file of the proc filesystem. echo "value" > /proc/file — there’s no easier way of transferring information from userspace to the kernel. This approach makes use of a virtual filesystem that generates file information ‘‘on the fly,’’ in other words, only when requested to do by read operations. A dedicated hard disk partition or some other block storage device is not needed with filesystems of this type. In addition to the proc filesystem, the kernel provides many other virtual filesystems for various purposes, for example, for the management of all devices and system resources cataloged in the form of files in hierarchically structured directories. Even device drivers can make status information available in virtual filesystems, the USB subsystem being one such example.
❑
Sysfs is one particularly important example of another virtual filesystem that serves a similar purpose to procfs on the one hand, but is rather different on the other hand. Sysfs is, per convention, always mounted at /sys, but there is nothing that would prevent including it in other places. It was designed to export information from the kernel into userland at a highly structured level. In contrast to procfs, it was not designed for direct human use because the information is deeply and hierarchically nested. Additionally, the files do not always contain information in ASCII text form, but may well use unreadable binary
Page 643
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage strings. The filesystem is, however, very useful for tools that want to gather detailed information about the hardware present in a system and the topological connection between the devices. It is also possible to create sysfs entries for kernel objects that use kobjects (see Chapter 1 for more information) with little effort. This gives userland easy access to important core kernel data structures. ❑
10.1
Small filesystems that serve a specific purpose can be constructed from standard functions supplied by the kernel. The in-kernel library that provides the required functions is called libfs. Additionally, the kernel provides means to implement sequential files with ease. Both techniques are put together in the debugging filesystem debugfs, which allows kernel developers to quickly export values to and import values from userland without the hassle of having to create custom interfaces or special-purpose filesystems.
The proc Filesystem
As mentioned at the beginning of this chapter, the proc filesystem is a virtual filesystem whose information cannot be read by a block device. Information is generated dynamically only when the contents of a file are read. Using the proc filesystem, information can be obtained on the kernel subsystems (e.g., memory utilization, peripherals attached, etc.) and kernel behavior can be modified without the need to recompile the sources, load modules, or reboot the system. Closely related to this filesystem is the system control mechanism — sysctl for short — which has been frequently referenced in previous chapters. The proc filesystem provides an interface to all options exported using this mechanism, thus allowing parameters to be modified with little effort. No special communication programs need be developed — all that is required is a shell and the standard cat and echo programs. Usually, the process data filesystem (its full name) is mounted in /proc, from which it obviously derives its more frequently used abbreviated name (procFS). Nevertheless, it is worth noting that the filesystem — like any other filesystem — can be mounted at any other point in the file tree, although this would be unusual. The section below describes the layout and contents of the proc filesystem to illustrate its functions and options before we move on to examine its implementation details.
10.1.1 Contents of /proc Although the size of the proc filesystem varies from system to system (different data are exported depending on hardware configuration, and different architectures affect its contents) it nevertheless contains a large number of deeply nested directories, files, and links. However, this wealth of information can be grouped into a few larger categories:
644
❑
Memory management
❑
Characteristic data of system processes
❑
Filesystems
❑
Device drivers
5:18pm
Page 644
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage ❑
System buses
❑
Power management
❑
Terminals
❑
System control parameters
Some of these categories are very different in nature (and the above list is by no means comprehensive) and share few common features. In the past, this information overload was a latent but ever-present source of criticism (which occasionally erupted violently) of the proc filesystem concept. It may well be useful to provide data by means of a virtual filesystem, but a more structured approach would have been appreciated . . . . The trend in kernel development is away from the provision of information by the proc filesystem and toward the exporting of data by a problem-specific but likewise virtual filesystem. A good example of this is the USB filesystem which is used to export many types of status information on the USB subsystem into userspace without ‘‘overburdening‘‘ /proc with new entries. Additionally, the Sysfs filesystem allows for presenting a hierarchical view not only of the device tree (by device, I mean system buses, PCI devices, CPUs, etc.), but also of important kernel objects. Sysfs is discussed in Section 10.3. On the kernel mailing list, the addition of new entries to /proc is viewed with deep suspicion and is the subject of controversial discussion. New code has a far better chance of finding its way into the sources if it does not use /proc. Of course, this does not mean that the proc filesystem will gradually become superfluous. In fact, the opposite is true. Today, /proc is as important as ever not only when installing new distributions, but also to support (automated) system administration. The following sections give a brief overview of the various files in /proc and the information they contain. Again, I lay no claim to completeness and discuss only the most important elements found on all supported architectures.
Process-Specific Data Each system process, regardless of its current state, has a subdirectory (with the same name as the PID) that contains information on the process. As the name suggests, the original intention of the ‘‘process data system‘‘ (proc for short) was to deliver process data. What information is held in the process-specific directories? A simple ls -l command paints an initial picture: wolfgang@meitner> cd /proc/7748 wolfgang@meitner> ls -l total 0 dr-xr-xr-x 2 wolfgang users 0 2008-02-15 -r-------- 1 wolfgang users 0 2008-02-15 --w------- 1 wolfgang users 0 2008-02-15 -r--r--r-- 1 wolfgang users 0 2008-02-15 -r--r--r-- 1 wolfgang users 0 2008-02-15 lrwxrwxrwx 1 wolfgang users 0 2008-02-15 -r-------- 1 wolfgang users 0 2008-02-15 lrwxrwxrwx 1 wolfgang users 0 2008-02-15 dr-x------ 2 wolfgang users 0 2008-02-15 dr-x------ 2 wolfgang users 0 2008-02-15
04:22 04:22 04:22 00:37 04:22 04:22 04:22 01:30 00:56 04:22
attr auxv clear_refs cmdline cpuset cwd -> /home/wolfgang/wiley_kbook environ exe -> /usr/bin/emacs fd fdinfo
645
Page 645
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage -rw-r--r--r--r--r--rw-------r--r--r--r--------r--r--r--rw-r--r--r--r--r-lrwxrwxrwx -rw-------r--r--r--r--r--r--r--r--r--r--r--r-dr-xr-xr-x -r--r--r--
1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1
wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang wolfgang
users users users users users users users users users users users users users users users users
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15 2008-02-15
04:22 04:22 04:22 04:22 04:22 04:22 04:22 04:22 04:22 04:22 04:22 00:56 01:30 00:56 04:22 04:22
loginuid maps mem mounts mountstats numa_maps oom_adj oom_score root -> / seccomp smaps stat statm status task wchan
Our example shows the data for an emacs process with the PID 7,748 as used to edit the LaTeX sources of this book. The meanings of most entries are evident from the filename. For instance, cmdline is the command line used to start the process — that is, the name of the program including all parameters as a string:
The kernel does not use normal blanks to separate elements but NUL bytes as used in C to indicate the end of a string. wolfgang@meitner> cat cmdline emacsfs.tex
The od tool can be used to convert the data to a readable format: wolfgang@meitner> od -t a /proc/7748/cmdline 0000000 e m a c s nul f s . 0000015
t
e
x nul
The above output makes it clear that the process was called by emacs fs.tex. The other files contain the following data: ❑
environ indicates all environment variables set for the program; again, NUL characters are used as separators instead of blanks.
❑
All memory mappings to libraries (and to the binary file itself) used by the process are listed in text form in maps. In the case of emacs, an excerpt from this file would look like this (I use a regular text format without NUL characters): wolfgang@meitner> cat maps 00400000-005a4000 r-xp 00000000 08:05 283752 /usr/bin/emacs 007a3000-00e8c000 rw-p 001a3000 08:05 283752 /usr/bin/emacs 00e8c000-018a1000 rw-p 00e8c000 00:00 0 2af4b085d000-2af4b0879000 r-xp 00000000 08:05 1743619
646
[heap]
5:18pm
Page 646
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage /lib64/ld-2.6.1.so ... 4003a000-40086000 r-xp 00000000 03:02 131108 /usr/lib/libcanna.so.1.2 40086000-4008b000 rwxp 0004b000 03:02 131108 /usr/lib/libcanna.so.1.2 4008b000-40090000 rwxp 4008b000 00:00 0 40090000-400a0000 r-xp 00000000 03:02 131102 /usr/lib/libRKC.so.1.2 400a0000-400a1000 rwxp 00010000 03:02 131102 /usr/lib/libRKC.so.1.2 400a1000-400a3000 rwxp 400a1000 00:00 0 400a3000-400e6000 r-xp 00000000 03:02 133514 /usr/X11R6/lib/libXaw3d.so.8.0 400e6000-400ec000 rwxp 00043000 03:02 133514 /usr/X11R6/lib/libXaw3d.so.8.0 400ec000-400fe000 rwxp 400ec000 00:00 0 400fe000-4014f000 r-xp 00000000 03:02 13104 /usr/lib/libtiff.so.3.7.3 4014f000-40151000 rwxp 00051000 03:02 13104 /usr/lib/libtiff.so.3.7.3 40151000-4018f000 r-xp 00000000 03:02 13010 /usr/lib/libpng.so.3.1.2.8 4018f000-40190000 rwxp 0003d000 03:02 13010 /usr/lib/libpng.so.3.1.2.8 40190000-401af000 r-xp 00000000 03:02 9011 /usr/lib/libjpeg.so.62.0.0 401af000-401b0000 rwxp 0001e000 03:02 9011 /usr/lib/libjpeg.so.62.0.0 401b0000-401c2000 r-xp 00000000 03:02 12590 /lib/libz.so.1.2.3 401c2000-401c3000 rwxp 00011000 03:02 12590 /lib/libz.so.1.2.3 ... 2af4b7dc1000-2af4b7dc3000 rw-p 00001000 08:05 490436 /usr/lib64/pango/1.6.0/modules/pango-basic-fc.so 2af4b7dc3000-2af4b7e07000 r--p 00000000 08:05 1222118 /usr/share/fonts/truetype/arial.ttf 2af4b7e4d000-2af4b7e53000 r--p 00000000 08:05 211780 /usr/share/locale-bundle/en_GB/LC_MESSAGES/glib20.mo 2af4b7e53000-2af4b7e9c000 rw-p 2af4b7e07000 00:00 0 7ffffa218000-7ffffa24d000 rw-p 7ffffa218000 00:00 0 [stack] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vdso]
❑
status returns general information on process status in text form. wolfgang@meitner> cat status Name: emacs State: S (sleeping) SleepAVG: 98% Tgid: 7748 Pid: 7748 PPid: 4891 TracerPid: 0 Uid: 1000 1000 1000 Gid: 100 100 100 FDSize: 256 Groups: 16 33 100 VmPeak: 140352 kB VmSize: 139888 kB VmLck: 0 kB VmHWM: 28144 kB VmRSS: 27860 kB VmData: 10772 kB VmStk: 212 kB VmExe: 1680 kB VmLib: 13256 kB VmPTE: 284 kB Threads: 1 SigQ: 0/38912
1000 100
647
Page 647
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 00000001d1817efd CapInh: 0000000000000000 CapPrm: 0000000000000000 CapEff: 0000000000000000 Cpus_allowed: 00000000,00000000,00000000,0000000f Mems_allowed: 00000000,00000001
Information is provided not only on UID/GID and other process numbers but also on memory allocation, process capabilities, and the state of the individual signal masks (pending, blocked, etc.). ❑
stat and statm contain — as a consecutive sequence of numbers — more status information on
the process and its memory consumption. The fd subdirectory contains files with numbers as names; these represent the individual file descriptors of the process. A symbolic link points to the position in the filesystem that is associated with the file descriptor, assuming it is a file in the proper sense. Other elements such as pipes that are also addressed via file descriptors are given a link target in the form pipe:[1434]. Similarly, symbolic links point to files and directories associated with the process: ❑
cwd points to the current working directory of the process. If users have the appropriate rights,
they can switch to this directory using cd cwd
without needing to know which directory it is. ❑
exe points to the binary file with the application code. In our example, it would point to /usr/bin/emacs
❑
root points to the root directory of the process. This need not necessarily be the global root directory (see the chroot mechanism discussed in Chapter 8).
General System Information Not only the subdirectories of /proc contain information but also the directory itself. General information relating to no specific kernel subsystem (or shared by several subsystems) resides in files in /proc. Some of these files were mentioned in earlier chapters. For example, iomem and ioports provide information on memory addresses and ports used to communicate with devices, as discussed in Chapter 6. Both files contain lists in text form: wolfgang@meitner> cat /proc/iomem 00000000-0009dbff : System RAM 00000000-00000000 : Crash kernel 0009dc00-0009ffff : reserved 000c0000-000cffff : pnp 00:0d 000e4000-000fffff : reserved 00100000-cff7ffff : System RAM 00200000-004017a4 : Kernel code 004017a5-004ffdef : Kernel data
648
5:18pm
Page 648
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage cff80000-cff8dfff : ACPI Tables cff8e000-cffdffff : ACPI Non-volatile Storage cffe0000-cfffffff : reserved d0000000-dfffffff : PCI Bus #01 d0000000-dfffffff : 0000:01:00.0 d0000000-d0ffffff : vesafb ... fee00000-fee00fff : Local APIC ffa00000-ffafffff : pnp 00:07 fff00000-ffffffff : reserved 100000000-12fffffff : System RAM
wolfgang@meitner> cat /proc/ioports 0000-001f : dma1 0020-0021 : pic1 0040-0043 : timer0 0050-0053 : timer1 0060-006f : keyboard 0070-0077 : rtc 0080-008f : dma page reg 00a0-00a1 : pic2 ... e000-efff : PCI Bus #03 e400-e40f : 0000:03:00.0 e400-e40f : libata e480-e483 : 0000:03:00.0 e480-e483 : libata e800-e807 : 0000:03:00.0 e800-e807 : libata e880-e883 : 0000:03:00.0 e880-e883 : libata ec00-ec07 : 0000:03:00.0 ec00-ec07 : libata
Similarly, some files provide a rough overview of the current memory management situation. buddyinfo and slabinfo supply data on current utilization of the buddy system and slab allocator, and meminfo gives an overview of general memory usage — broken down into high and low memory, free, allocated and shared areas, swap and writeback memory, and so on. vmstat yields further memory management characteristics including the number of pages currently in each memory management subsystem. The kallsyms and kcore entries support kernel code debugging. The former holds a table with the addresses of all global kernel variables and procedures including their addresses in memory: wolfgang@meitner> cat /proc/kallsyms ... ffffffff80395ce8 T skb_abort_seq_read ffffffff80395cff t skb_ts_finish ffffffff80395d08 T skb_find_text ffffffff80395d76 T skb_to_sgvec ffffffff80395f6d T skb_truesize_bug ffffffff80395f89 T skb_under_panic ffffffff80395fe4 T skb_over_panic ffffffff8039603f t copy_skb_header ffffffff80396273 T skb_pull_rcsum
649
Page 649
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage ffffffff803962da T skb_seq_read ffffffff80396468 t skb_ts_get_next_block ... kcore is a dynamic core file that ‘‘contains‘‘ all data of the running kernel — that is, the entire contents of
main memory. It is no different from the normal core files that are saved for debugging purposes when a fatal error in user applications generates a core dump. The current state of a running system can be inspected using a debugger together with the binary file. Many of the figures in this book illustrating the interplay among the kernel data structures were prepared using this method. Appendix 2 takes a closer look at how available capabilities can be used with the help of the GNU gdb debugger and the ddd graphical user interface. interrupts saves the number of interrupts raised during the current operation (the underlying mecha-
nism is described in Chapter 14). On an IA-32 quad-core server, the file could look like this: wolfgang@meitner> cat /proc/interrupts CPU0 CPU1 CPU2 0: 1383211 1407764 1292884 1: 0 1 1 8: 0 1 0 9: 0 0 0 12: 1 3 0 16: 8327 4251 290975 18: 0 1 0 19: 21: 22: 23: 4347: NMI: LOC: ERR:
0 0 267439 0 12 0 5443482 0
0 0 93114 0 17 0 5443174
0 0 10575 0 7 0 5446374
CPU3 1364817 0 0 0 0 114077 0 0 0 5018 0 77445 0 5446306
IO-APIC-edge IO-APIC-edge IO-APIC-edge IO-APIC-fasteoi IO-APIC-edge IO-APIC-fasteoi IO-APIC-fasteoi IO-APIC-fasteoi IO-APIC-fasteoi IO-APIC-fasteoi IO-APIC-fasteoi PCI-MSI-edge
timer i8042 rtc acpi i8042 libata, uhci_hcd:usb1 ehci_hcd:usb2, uhci_hcd:usb4, uhci_hcd:usb7 uhci_hcd:usb6 uhci_hcd:usb3 libata, libata, HDA Intel uhci_hcd:usb5, ehci_hcd:usb8 eth0
Not only the number of interrupts but also the name of the device or driver responsible for the interrupt are given for each interrupt number. Last but not least, I must mention loadavg and uptime, which display, respectively, the average system loading (i.e., the length of the run queue) during the last 60 seconds, 5 minutes, and 15 minutes; and the system uptime — the time elapsed since system boot.
Network Information The /proc/net subdirectory supplies data on the various network options of the kernel. The information held there is a colorful mix of protocol and device data and includes several interesting entries as follows: ❑
Statistics on UDP and TCP sockets are available for IPv4 in udp and tcp; the equivalent data for IPv6 are held in udp6 and tcp6. Unix sockets are logged in unix.
❑
The ARP table for backward resolution of addresses can be viewed in the arp file.
❑
dev holds statistics on the volume of data transferred via the network interfaces of the
system (including the loopback software interface). This information can be used to check the
650
5:18pm
Page 650
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage transmission quality of the network because it also includes incorrectly transmitted and rejected packages as well as collision data. Some network card drivers (e.g., for the popular Intel PRO/100 chipset) create additional subdirectories in /proc/net with more detailed hardware-specific information.
System Control Parameters The system control parameters used to check and modify the behavior of the kernel dynamically make up the lion’s share of entries in the proc filesystem. However, this interface is not the only way of manipulating data — this can also be done using the sysctl system call. This requires more effort because it is first necessary to write a program to support communication with the kernel via the system call interface. As a result, the numeric sysctl mechanism was tagged as being obsolete during development of 2.5 (the kernel outputs a warning message to this effect each time sysctl is invoked) and was planned to be dropped at some point. Removing the system call has, however, created a controversial discussion, and up to 2.6.25, the call is still in the kernel — although a message warns the user that it is deprecated. The sysctl system call is not really needed because the /proc interface is a kernel data manipulation option of unrivaled simplicity. The sysctl parameters are managed in a separate subdirectory named /proc/sys, which is split into further subdirectories in line with the various kernel subsystems: wolfgang@meitner> total 0 dr-xr-xr-x 0 root dr-xr-xr-x 0 root dr-xr-xr-x 0 root dr-xr-xr-x 0 root dr-xr-xr-x 0 root dr-xr-xr-x 0 root dr-xr-xr-x 0 root
ls -l /proc/sys root root root root root root root
0 0 0 0 0 0 0
2008-02-15 2008-02-15 2008-02-14 2008-02-14 2008-02-14 2008-02-14 2008-02-14
04:29 04:29 22:26 22:22 22:22 22:22 22:26
abi debug dev fs kernel net vm
The subdirectories contain a series of files that reflect the characteristic data of the associated kernel subsystems. For example, /proc/sys/vm includes the following entries: wolfgang@meitner> total 0 -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root ... -rw-r--r-- 1 root -rw-r--r-- 1 root -rw-r--r-- 1 root
ls -l /proc/sys/vm root root root root root
0 0 0 0 0
2008-02-17 2008-02-16 2008-02-16 2008-02-16 2008-02-16
01:32 20:55 20:55 20:55 20:55
block_dump dirty_background_ratio dirty_expire_centisecs dirty_ratio dirty_writeback_centisecs
root 0 2008-02-17 01:32 swappiness root 0 2008-02-17 01:32 vfs_cache_pressure root 0 2008-02-17 01:32 zone_reclaim_mode
Unlike the files discussed earlier, the contents of the files in these directories can not only be read, but also supplied with new values by means of normal file operations. For instance, the vm subdirectory includes a swappiness file to indicate how ‘‘aggressively‘‘ the swapping algorithm goes about its job of swapping out pages. The default value is 60, as shown when the file contents are output using cat: wolfgang@meitner> cat /proc/sys/vm/swappiness 60
651
Page 651
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage However, this value can be modified by issuing the following command (as root user): wolfgang@meitner> echo "80" > /proc/sys/vm/swappiness wolfgang@meitner> cat /proc/sys/vm/swappiness 80
As discussed in Chapter 18, the higher the swappiness value the more aggressively will the kernel swap out pages; this can lead to better performance at certain system load levels. Section 10.1.8 describes in detail the implementation used by the kernel to manipulate parameters in the proc filesystem.
10.1.2 Data Structures Once again there are a number of central data structures around which the code used to implement the process data filesystem is built. These include the structures of the virtual filesystem discussed in Chapter 8. proc makes generous use of these, simply because, as a filesystem itself, it must be integrated into the VFS layer of the kernel. There are also proc-specific data structures to organize the data provided in the kernel. An interface to the subsystems of the kernel must also be made available to enable the kernel to extract required information from its structures before it is supplied to userspace by means of /proc.
Representation of proc Entries Each entry in the proc filesystem is described by an instance of proc_dir_entry whose (abbreviated) definition is as follows: <proc_fs.h>
struct proc_dir_entry { unsigned int low_ino; unsigned short namelen; const char *name; mode_t mode; nlink_t nlink; uid_t uid; gid_t gid; loff_t size; struct inode_operations * proc_iops; const struct file_operations * proc_fops; get_info_t *get_info; struct module *owner; struct proc_dir_entry *next, *parent, *subdir; void *data; read_proc_t *read_proc; write_proc_t *write_proc; ... };
Because each entry is given a filename, the kernel uses two elements of the structure to store the corresponding information: name is a pointer to the string in which the name is held, and namelen specifies
652
5:18pm
Page 652
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage the length of the name. Also adopted from the classic filesystem concept is the numbering of all inodes using low_ino. The meaning of mode is the same as in normal filesystems because the element reflects the type of the entry (file, directory, etc.), and the assignment of access rights in accordance with the classic ‘‘owner, group, others‘‘ scheme by means of the appropriate constants in <stat.h>. uid and gid specify the user ID and group ID to which the file belongs. Both are usually set to 0, which means that the root user is the owner of almost all proc files. The usage counter common to most data structures is implemented by count, which indicates the number of points at which the instance of a data structure is used in the kernel to ensure that the structure is not freed inadvertently. proc_iops and proc_fops are pointers to instances of types inode_operations and file_operations
discussed in Chapter 8. They hold operations that can be performed on an inode or file and act as an interface to the virtual filesystem that relies on their presence. The operations used depend on the particular file type and are discussed in more detail below. The file size in bytes is saved in the size element. Because proc entries are generated dynamically, the length of a file is not usually known in advance; in this case, the value 0 is used. If a proc entry is generated by a dynamically loaded module, module contains a reference to the associated module data structure in memory (if the entry was generated by compiled-in code, module holds a null pointer). The following three elements are available to control the exchange of information between the virtual filesystem (and ultimately userspace) and the various proc entries or individual kernel subsystems. ❑
get_info is a function pointer to the relevant subsystem routine that returns the desired data. As with normal file access, the offset and length of the desired range can be specified so that it is not necessary to read the full data set. This is useful, for example, for the automated analysis of proc entries.
❑
read_proc and write_proc point to functions to support the reading of data from and the writ-
ing of data to the kernel. The parameters and return values of the two functions are specified by the following type definition: <proc_fs.h>
typedef int (read_proc_t)(char *page, char **start, off_t off, int count, int *eof, void *data); typedef int (write_proc_t)(struct file *file, const char __user *buffer, unsigned long count, void *data);
Whereas data are read on the basis of memory pages (of course, an offset and the length of the data to be read can also be specified), the writing of data is based on a file instance. Both routines have an additional data argument that is defined when a new proc entry is registered and is passed as a parameter each time the routine is invoked (the data element of proc_dir_entry holds the data argument). This means that a single function can be registered as the read/write routine for several proc entries; the code can then distinguish the various cases by reference to the data argument (this is not possible with get_info because no data argument is passed). This tactic has already been adopted in preceding chapters to prevent the unnecessary duplication of code.
653
Page 653
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage Recall that there is a separate instance of proc_dir_entry for each entry in the proc filesystem. They are used by the kernel to represent the hierarchical structure of the filesystem by means of the following elements: ❑
nlink specifies the number of subdirectories and symbolic links in a directory. (The number of files of other types is irrelevant.)
❑
parent is a pointer to the directory containing a file (or subdirectory) represented by the current proc_dir_entry instance.
❑
subdir and next support the hierarchical arrangement of files and directories. subdir points to
the first entry of a directory (which, in spite of the name of the element, can be either a file or a directory), and next groups all common entries of a directory in a singly linked list.
proc inodes The kernel provides a data structure called proc_inode to support an inode-oriented view of the proc filesystem entries. This structure is defined as follows: <proc_fs.h>
union proc_op { int (*proc_get_link)(struct inode *, struct dentry **, struct vfsmount **); int (*proc_read)(struct task_struct *task, char *page); }; struct proc_inode { struct pid *pid; int fd; union proc_op op; struct proc_dir_entry *pde; struct inode vfs_inode; };
The purpose of the structure is to link the proc-specific data with the inode data of the VFS layer. pde contains a pointer to the proc_dir_entry instance associated with each entry; the meaning of the instance was discussed in the previous section. At the end of the structure there is an instance of inode.
This is the actual data, not a pointer to an instance of the structure.
This is exactly the same data used by the VFS layer for inode management. In other words, directly before each instance of an inode structure linked with the proc filesystem, there are additional data in memory that can be extracted from a given instance of proc_inode using the container mechanism. Because the kernel frequently needs to access this information, it defines the following auxiliary procedure: <proc_fs.h>
static inline struct proc_inode *PROC_I(const struct inode *inode) { return container_of(inode, struct proc_inode, vfs_inode); }
This returns the inode-specific data associated with a VFS inode. Figure 10-1 illustrates the situation in memory.
654
5:18pm
Page 654
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage proc_inode
VFS-Inode
PROC_I
Figure 10-1: Connection between struct proc_inode and struct inode. The remaining elements of the structure are only used if the inode represents a process-specific entry (which is therefore located in the proc/pid directory). Their meanings are as follows: ❑
pid is a pointer to the pid instance of a process. Because it is possible to access a large amount of process-specific information this way, it is clear why a process-specific inode should be directly associated with this data.
❑
proc_get_link and proc_read (which are collected in a union because only one at a time makes sense) are used to get process-specific information or to generate links to process-specific data in the Virtual Filesystem.
❑
fd holds the filedescriptor for which a file in /proc/
The meanings and use of these elements are discussed in detail in Section 10.1.7.
10.1.3 Initialization Before the proc filesystem can be used, it must be mounted with mount, and the kernel must set up and initialize several data structures to describe the filesystem structure in kernel memory. Unfortunately, the appearance and contents of /proc differ substantially from platform to platform and from architecture to architecture, and the code is crammed with #ifdef pre-processor statements that select code sections according to the particular situation. Although this practice is frowned upon, it simply cannot be avoided. Because initialization differences relate primarily to creation of the subdirectories that subsequently appear in /proc, they are not evident in Figure 10-2, which shows a code flow diagram of proc_root_init in fs/proc/root.c. proc_root_init proc_init_inodecache register_filesystem kern_mount_data proc_misc_init proc_net_init Create directories with proc_mkdir
Figure 10-2: Code flow diagram for proc_root_init.
655
Page 655
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage proc_root_init first creates a slab cache for proc_inode objects using proc_init_inodecache; these objects are the backbone of the proc filesystem and often need to be generated and destroyed as quickly as possible. Then the filesystem is officially registered with the kernel using the register_filesystem routine described in Chapter 8. And finally, mount is invoked to mount the filesystem. kern_mount_data is a wrapper function for do_kern_mount, also discussed in Chapter 8. It returns a pointer to a vfsmount instance. The pointer is saved in the global variable proc_mnt for later use by the
kernel. proc_misc_init generates various file entries in the proc main directory; these are linked using special procedures to read information from the kernel data structures. Examples of these procedures are:
❑
loadavg (loadavg_read_proc)
❑
meminfo (meminfo_read_proc)
❑
filesystems (filesystems_read_proc)
❑
version (version_read_proc)
create_proc_read_entry is invoked for each name on this list (and for a few more, as the kernel sources show). The function creates a new instance of the familiar proc_dir_entry data structure whose read_proc entry is set to the procedure associated with each name. The implementation of most of these procedures is extremely simple, as exemplified by the version_read_proc procedure used to get the
kernel version: init/version.c
const char linux_proc_banner[] = "%s version %s" " (" LINUX_COMPILE_BY "@" LINUX_COMPILE_HOST ")" " (" LINUX_COMPILER ") %s\n"; fs/proc/proc_misc.c
static int version_read_proc(char *page, char **start, off_t off, int count, int *eof, void *data) { int len; len = snprintf(page, PAGE_SIZE, linux_proc_banner, utsname()->sysname, utsname()->release, utsname()->version); return proc_calc_metrics(page, start, off, count, eof, len); }
The kernel string linux_proc_banner is written into a userspace page using sprintf. When this is done, the proc_calc_metrics auxiliary function determines the length of the data returned. Once proc_misc_init has completed, the kernel uses proc_net_init to install a large number of networking related files in /proc/net. Since the mechanism is similar to the previous case, it is not discussed here. Finally, the kernel invokes proc_mkdir to create a number of /proc subdirectories; these are required later but do not contain files at the moment. As for proc_mkdir, all we need to know is that the function
656
5:18pm
Page 656
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage registers a new subdirectory and returns the associated proc_dir_entry instance; its implementation is of no further interest. The kernel saves these instances in global variables because these data are needed later when filling the directories with files (i.e., when supplying the real information). fs/proc_root.c
struct proc_dir_entry *proc_net, *proc_bus, *proc_root_fs, *proc_root_driver; void __init proc_root_init(void) { ... proc_net = proc_mkdir("sysvipc", NULL); ... proc_root_fs = proc_mkdir("fs", NULL); proc_root_driver = proc_mkdir("driver", NULL); ... proc_bus = proc_mkdir("bus", NULL); }
Further directory initialization is no longer carried out by the proc layer itself but is performed by other parts of the kernel where the required information is made available. This makes it clear why the kernel uses global variables to save the proc_dir_entry instances of these subdirectories. The files in proc/net are filled, for example, by the network layer, which inserts files at many different points in the code of card drivers and protocols. Because new files are created when new cards or protocols are initialized, this can be done during the boot operation (in the case of compiled-in drivers) or while the system is running (when modules are loaded) — in any case, after initialization of the proc filesystem by proc_root_init has completed. If the kernel did not use global variables, it would have to provide functions to register subsystem-specific entries, and this is neither as clean nor as elegant as using global variables. The system control mechanism fills proc_sys_root with files that are always generated when a new sysctl is defined in the kernel. Repeated reference was made to this facility in earlier chapters. A detailed description of the associated mechanism is provided in Section 10.1.8.
10.1.4 Mounting the Filesystem Once all kernel-internal data that describe the structure and contents of the proc filesystem have been initialized, the next step is to mount the filesystem in the directory tree. In the view of the system administrator in userspace, mounting /proc is almost the same as mounting a non-virtual filesystem. The only difference is that an arbitrary keyword (usually proc or none) is specified as the source instead of a device file: root@meitner # mount -t proc proc /proc
The VFS-internal processes involved in mounting a new filesystem are described in detail in Chapter 8, but as a reminder are summarized below. When it adds a new filesystem, the kernel uses a linked list that is scanned to find an instance of file_system_type associated with the filesystem. This instance provides information on how to read in the filesystem superblock. For proc, the structure is initialized as follows: fs/proc/root.c
static struct file_system_type proc_fs_type = { .name = "proc",
657
Page 657
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage .get_sb .kill_sb
= proc_get_sb, = kill_anon_super,
};
The filesystem-specific superblock data are used to fill a vfsmount structure so that the new filesystem can be incorporated in the VFS tree. As the source code extract above shows, the superblock of the proc filesystem is supplied by proc_get_sb. The function builds on a further kernel auxiliary routine (get_sb_single) that enlists the help of proc_fill_super to fill a new instance of super_block. proc_fill_super is not very complex and is mainly responsible for filling the super_block elements with defined values that never change: fs/proc/inode.c
int proc_fill_super(struct super_block *s, void *data, int silent) { struct inode * root_inode; ... s->s_blocksize = 1024; s->s_blocksize_bits = 10; s->s_magic = PROC_SUPER_MAGIC; s->s_op = &proc_sops; ... root_inode = proc_get_inode(s, PROC_ROOT_INO, &proc_root); s->s_root = d_alloc_root(root_inode); ... return 0; }
The block size cannot be set and is always 1,024; as a result, s_blocksize_bits must always be 10 because 210 equals 1,024. With the help of the pre-processor, the magic number used to recognize the filesystem is defined as 0x9fa0. (This number is not actually needed in the case of proc because data do not reside on a storage medium but are generated dynamically.) More interesting is the assignment of the proc_sops superblock operations that group together the functions needed by the kernel to manage the filesystem: fs/proc/inode.c
static struct super_operations proc_sops = { .alloc_inode = proc_alloc_inode, .destroy_inode = proc_destroy_inode, .read_inode = proc_read_inode, .drop_inode = generic_delete_inode, .delete_inode = proc_delete_inode, .statfs = simple_statfs, .remount_fs = proc_remount, };
658
5:18pm
Page 658
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage The next two lines of proc_fill_super create an inode for the root directory and use d_alloc_root to convert it into a dentry that is assigned to the superblock instance; here it is used as the starting point for lookup operations in the mounted filesystem, as described in Chapter 8. In the main, the proc_get_inode function used to create the root inode fills several inode structure values to define, for example, the owner and the access mode. Of greater interest is the static proc_dir_entry instance called proc_root; when it is initialized, it gives rise to data structures with relevant function pointers: fs/proc/root.c
struct proc_dir_entry proc_root = { .low_ino = PROC_ROOT_INO, .namelen = 5, .name = "/proc", .mode = S_IFDIR | S_IRUGO | S_IXUGO, .nlink = 2, .count = ATOMIC_INIT(1), .proc_iops = &proc_root_inode_operations, .proc_fops = &proc_root_operations, .parent = &proc_root, }
The root inode differs from all other inodes of the proc file system in that it not only contains ‘‘normal‘‘ files and directories (even though they are generated dynamically), but also manages the process-specific PID directories that contain detailed information on the individual system processes, as mentioned above. The root inode therefore has its own inode and file operations, which are defined as follows: fs/proc/root.c
/* * The root /proc directory is special, as it has the *
659
Page 659
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage
10.1.5 Managing /proc Entries Before the proc filesystem can be put to meaningful use, it must be filled with entries containing data. Several auxiliary routines are provided to add files, create directories, and so on, in order to make this job as easy as possible for the remaining kernel sections. These routines are discussed below.
The fact that new proc entries can be easily generated should not disguise the fact that it is not accepted practice to use code to do this. Nevertheless, the simple, lean interface can be very useful for opening up a communication channel for test purposes between kernel and userspace with minimum effort.
I also discuss methods used by the kernel to scan the tree of all registered proc entries to find required information.
Creating and Registering Entries New entries are added to the proc filesystem in two steps. First, a new instance of proc_dir_entry is created together with all information needed to describe the entry. This instance is then registered in the data structures of proc so that it is visible to the outside. Because the two steps are never carried out independently of each other, the kernel makes auxiliary functions available to combine both actions so that new entries can be generated quickly and easily. The most frequently used function is called create_proc_entry and requires three arguments: <proc_fs.h>
extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode, struct proc_dir_entry *parent);
❑
name specifies the filename.
❑
mode specifies the access mode in the conventional Unix scheme (user/group/others).
❑
parent is a pointer to the proc_dir_entry instance of the directory where the file is to be inserted.
Caution: The function fills only the essential elements of the proc_dir_entry structure. It is therefore necessary to make a few brief ‘‘manual‘‘ corrections to the structure generated. This is illustrated by the following sample code, which generates the proc/net/hyperCard entry to supply information on a (unbelievably good) network card: struct proc_dir_entry *entry = NULL; entry = create_proc_entry("hyperCard", S_IFREG|S_IRUGO|S_IWUSR, &proc_net); if (!entry) { printk(KERN_ERR "unable to create /proc/net/hyperCard\n"); return -EIO; } else { entry->read_proc = hypercard_proc_read;
660
5:18pm
Page 660
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage entry->write_proc = hypercard_proc_write; }
Once the entry has been created, it is registered with the proc filesystem using proc_register in fs/proc/generic.c. The task is divided into three steps:
1.
A unique proc-internal number is generated to give the entry its own identity. get_inode_number is used to return an unused number for dynamically generated entries.
2.
The next and parent elements of the proc_dir_entry instance must be set appropriately to incorporate the new entry into the hierarchy.
3.
Depending on the file type, the pointers must be set appropriately to file and inode operations if the corresponding elements of proc_dir_entry, proc_iops and proc_fops previously contained a null pointer. Otherwise, the value held there is retained.
Which file and inode operations are used for proc files? The corresponding pointers are set as follows: fs/proc/generic.c
static int proc_register(struct proc_dir_entry * dir, struct proc_dir_entry * dp) { if (S_ISDIR(dp->mode)) { if (dp->proc_iops == NULL) { dp->proc_fops = &proc_dir_operations; dp->proc_iops = &proc_dir_inode_operations; } dir->nlink++; } else if (S_ISLNK(dp->mode)) { if (dp->proc_iops == NULL) dp->proc_iops = &proc_link_inode_operations; } else if (S_ISREG(dp->mode)) { if (dp->proc_fops == NULL) dp->proc_fops = &proc_file_operations; if (dp->proc_iops == NULL) dp->proc_iops = &proc_file_inode_operations; } ... }
For regular files, the kernel uses proc_file_operations and proc_file_inode_operations to define the file and inode operation methods: fs/proc/generic.c
static struct inode_operations proc_file_inode_operations = { .setattr = proc_notify_change, }; fs/proc/generic.c
static struct file_operations proc_file_operations = { .llseek = proc_file_lseek, .read = proc_file_read, .write = proc_file_write, };
661
Page 661
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage Directories use the following structures: fs/proc/generic.c
static struct file_operations proc_dir_operations = { .read = generic_read_dir, .readdir = proc_readdir, }; fs/proc/generic.c
/* proc directories can do almost nothing... */ static struct inode_operations proc_dir_inode_operations = { .lookup = proc_lookup, .getattr = proc_getattr, .setattr = proc_notify_change, };
Symbolic links require inode operations but not file operations: fs/proc/generic.c
static struct inode_operations proc_link_inode_operations = { .readlink = generic_readlink, .follow_link = proc_follow_link, };
Later in this section, I take a closer look at the implementation of some of the routines in the above data structures. In addition to create_proc_entry, the kernel provides two further auxiliary functions for creating new proc entries. All three are short wrapper routines for create_proc_entry and are defined with the following parameter list: <proc_fs.h>
static inline struct proc_dir_entry *create_proc_read_entry(const char *name, mode_t mode, struct proc_dir_entry *base, read_proc_t *read_proc, void * data) { ... } static inline struct proc_dir_entry *create_proc_info_entry(const char *name, mode_t mode, struct proc_dir_entry *base, get_info_t *get_info) { ... } create_proc_read_entry and create_proc_info_entry are used to create a new read entry. Because this can be done in two different ways (as discussed in Section 10.1.2), there must also be two routines. Whereas create_proc_info_entry requires a procedure pointer of type get_info_t that is added to the get_info element of proc_dir_entry, create_proc_info_entry expects not only a procedure pointer of type read_proc_t, but also a data pointer that enables the same function to be used as a read routine for various proc entries distinguished by reference to their data argument.
Although we are not interested in their implementation, I include below a list of other auxiliary functions used to manage proc entries:
662
❑
proc_mkdir creates a new directory.
❑
proc_mkdir_mode creates a new directory whose access mode can be explicitely specified.
5:18pm
Page 662
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage ❑
proc_symlink generates a symbolic link.
❑
remove_proc_entry deletes a dynamically generated entry from the proc directory.
The kernel sources include a sample file in Documentation/DocBook/procfs_example.c. This demonstrates the options described here and can be used as a template for writing proc routines. Section 10.1.6 includes some sample kernel source routines that are responsible for interaction between the read/write routines of the proc filesystem and the kernel subsystems.
Finding Entries Userspace applications access proc files as if they were normal files in regular filesystems; in other words, they follow the same path as the VFS routines described in Chapter 8 when searching for entries. As discussed there, the lookup process (e.g., of the open system call) duly arrives at real_lookup, which invokes the function saved in the lookup function pointer of inode_operations to resolve the filename by reference to its individual path components. In this section, we take a look at the steps performed by the kernel to find files in the proc filesystem. The search for entries starts at the mount point of the proc filesystem, usually /proc. In Section 10.1.2 you saw that the lookup pointer of the file_operations instance for the root directory of the process filesystem points to the proc_root_lookup function. Figure 10-3 shows the associated code flow diagram.
proc_root_lookup proc_lookup proc_pid_lookup
Figure 10-3: Code flow diagram for proc_root_lookup.
The kernel uses this routine simply to distinguish between two different types of proc entries before delegating the real work to specialized routines. Entries may be files in a process-specific directory, as with /proc/1/maps. Alternatively, entries may be files registered dynamically by a driver or subsystem (e.g., /proc/cpuinfo or /proc/net/dev). It is up to the kernel to distinguish between the two. The kernel first invokes proc_lookup to find regular entries. If the function finds the file it is looking for (by sequentially scanning the components of the specified path), everything is OK, and the lookup operation is terminated. If proc_lookup fails to find an entry, the kernel invokes proc_pid_lookup to look in the list of processspecific entries. These functions are not examined in detail here. All we need to know is that an appropriate inode type is returned (proc_pid_lookup is discussed again in Section 10.1.7, where the creation and structure of process-specific inodes are discussed).
663
Page 663
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage
10.1.6 Reading and Writing Information As noted in Section 10.1.5, the kernel uses the operations stored in proc_file_operations to read and write the contents of regular proc entries. The contents of the function pointers in this structure are as follows: fs/proc/generic.c
static struct file_operations proc_file_operations = { .llseek = proc_file_lseek, .read = proc_file_read, .write = proc_file_write, };
The sections below examine the read and write operations implemented by means of proc_file_read and proc_file_write.
Implementation of proc_file_read Data are read from a proc file in three steps:
1. 2. 3.
A kernel memory page is allocated into which data are generated. A file-specific function is invoked to fill the kernel memory page with data. The data are copied from kernel space to userspace.
Obviously, the second step is the most important because the subsystem data and kernel data structures must be specially prepared. The other two steps are simple routine tasks. Section 10.1.2 noted that the kernel provides two function pointers to get_info and read_proc in the proc_dir_entry structure; these functions are used to read data, and the kernel must select the one that matches. fs/proc/generic.c
proc_file_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { ... if (dp->get_info) { /* Handle old net routines */ n = dp->get_info(page, &start, *ppos, count); if (n < count) eof = 1; } else if (dp->read_proc) { n = dp->read_proc(page, &start, *ppos, count, &eof, dp->data); } else break; ... } page is a pointer to the memory page allocated to hold the data in the first step.
Since a sample implementation of read_proc is included in Section 10.1.5, it need not be repeated here.
664
5:18pm
Page 664
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage Implementation of proc_file_write Writing to proc files is also a simple matter — at least from the perspective of the filesystem. The code of proc_file_write is very compact and thus is reproduced in full below. fs/proc/generic.c
static ssize_t proc_file_write(struct file * size_t count, { struct inode *inode = struct proc_dir_entry
file, const char __user *buffer, loff_t *ppos) file->f_dentry->d_inode; * dp;
dp = PDE(inode); if (!dp->write_proc) return -EIO; return dp->write_proc(file, buffer, count, dp->data); }
The PDE function needed to obtain the required proc_dir_entry instance from the VFS inode using the container mechanism is very simple. All it does is execute PROC_I(inode)->pde. As discussed in Section 10.1.2, PROC_I finds the proc_inode instance associated with an inode (in the case of proc inodes, the inode data always immediately precede the VFS inode). Once the proc_dir_entry instance has been found, the routine registered for write purposes must be invoked with suitable parameters — assuming, of course, that the routine exists and is not assigned a null pointer. How does the kernel implement a write routine for proc entries? This question is answered using proc_write_foobar, which is included as an example for a write handler in the kernel sources: kernel/Documentation/DocBook/procfs_example.c
static int proc_write_foobar(struct file *file, const char *buffer, unsigned long count, void *data) { int len; struct fb_data_t *fb_data = (struct fb_data_t *)data; if(count > FOOBAR_LEN) len = FOOBAR_LEN; else len = count; if(copy_from_user(fb_data->value, buffer, len)) return -EFAULT; fb_data->value[len] = ’\0’;
665
Page 665
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage /* Parse the data and perform actions in the subsystem */ return len; }
Usually, a proc_write implementation performs the following actions:
1.
First, the length of the user input (it can be determined using the count parameter) must be checked to ensure that it is not longer than the reserved area.
2. 3.
The data are copied from userspace into the reserved kernel space area.
4.
Manipulations are then performed on the (sub)system in accordance with the user information received.
Information is extracted from the string. This operation is known as parsing, a term borrowed from compiler design. In the above example, this task is delegated to the cpufreq_parse_policy function.
10.1.7 Task-Related Information Outputting detailed information on system processes was one of the prime tasks for which the proc filesystem was originally designed, and this still holds true today. As demonstrated in Section 10.1.7, proc_pid_lookup is responsible for opening PID-specific files in /proc/
proc_pid_lookup
No
name == self?
Yes
Create self incode
name_to_int find_task_by_pid_ns proc_pid_instantiate proc_pid_make_inode Fill in file and inode operations
Figure 10-4: Code flow diagram for proc_pid_lookup.
The goal of the routine is to create an inode that acts as the first object for further PID-specific operations; this is because the inode represents the /proc/pid directory containing all files with process-specific information. Two cases, analyzed below, must be distinguished.
The self directory Processes can be selected by explicit reference to their PIDs, but the data of the currently running process can be accessed without knowing PID by selecting the /proc/self directory — the kernel then
666
5:18pm
Page 666
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage automatically determines which process is currently running. For example, outputting the contents of /proc/self/map with cat produces the following result: wolfgang@meitner> cat /proc/self/cmdline cat/proc/self/cmdline
If a Perl script is used to read the file, the following information is obtained. wolfgang@meitner> perl -e ’open(DAT, "< /proc/self/cmdline"); print(
Because the script was passed to the Perl interpreter as a command-line parameter, it reproduces itself — in fact, it is almost a self-printing Perl script.1 The self case is handled first in proc_pid_lookup, as the code flow diagram in Figure 10-4 shows. When a new inode instance is generated, only a few uninteresting standard fields need to be filled. Of prime importance is the fact that the statically defined proc_self_inode_operations instance is used for the inode operations: fs/proc/base.c
static struct inode_operations proc_self_inode_operations = { .readlink = proc_self_readlink, .follow_link = proc_self_follow_link, };
The self directory is implemented as a link to a PID-specific directory. As a result, the associated inode always has the same structure and does not contain any information as to which process it refers. This information is obtained dynamically when the link target is read (this is necessary when following or reading a link, e.g., when listing the entries of /proc). This is precisely the purpose of the two functions in proc_self_inode_operations whose implementations require just a few lines: fs/proc/base.c
static int proc_self_readlink(struct dentry *dentry, char *buffer, int buflen) { char tmp[30]; sprintf(tmp, "%d", current->tgid); return vfs_readlink(dentry,buffer,buflen,tmp); } static void *proc_self_follow_link(struct dentry *dentry, struct nameidata *nd) { char tmp[PROC_NUMBUF]; sprintf(tmp, "%d", task_tgid_vnr(current)); return ERR_PTR(vfs_follow_link(nd,tmp)); }
Both functions generate a string into tmp. For proc_self_readlink, it holds the thread group ID of the currently running process, which is read using current->tgid. For proc_self_follow_link, the PID that the current namespace associates with the task is used. Recall from Chapter 2 that PIDs are not unique across the system because of namespaces. Also remember that the thread group ID is identical 1 Writing programs that print themselves is an old hacker’s delight. A collection of such programs in a wide variety of high-level
languages is available at www.nyx.net/~gthompso/quine.htm.
667
Page 667
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage with the classic PID for single-threaded processes. The sprintf function, with which we are familiar from the C programming of userspace applications, converts the integer number into a string. The remaining work is then delegated to standard virtual filesystem functions that are responsible for directing the lookup operation to the right places.
Selection According to PID Let us turn our attention to how the process-specific information is selected by PID.
Creating the Directory Inode If a PID is passed to proc_pid_lookup instead of "self", the course of the lookup operation is as shown in the code flow diagram in Figure 10-4. Because filenames are always processed in the form of strings but PIDs are integer numbers, the former must be converted accordingly. The kernel provides the name_to_int auxiliary function to convert strings consisting of digits into an integer. The information obtained is used to find the task_struct instance of the desired process by means of the find_task_by_pid_ns function described in Chapter 2. However, the kernel cannot make the assumption that the desired process actually exists. After all, it is not unknown for programs to try to process a nonexistent PID, in which case, a corresponding error (-ENOENT) is reported. Once the desired task_struct is found, the kernel delegates the rest of the work mostly to proc_pid_instantiate implemented in fs/proc/base.c, which itself relies on proc_pid_make_inode. First, a new inode is created by the new_inode standard function of VFS; this basically boils down to the same proc-specific proc_alloc_inode routine mentioned above that makes use of its own slab cache.
The routine not only generates a new struct inode instance, but also reserves memory needed by struct proc_inode; the reserved memory holds a normal VFS inode as a ‘‘subobject,‘‘ as noted in Section 10.1.2. The elements of the object generated are then filled with standard values.
After calling proc_pid_make_inode, all the remaining code in proc_pid_instantiate has to do is perform a couple of administrative tasks. Most important, the inode->i_op inode operations are set to the proc_tgid_base_inode_operations static structure whose contents are examined below.
Processing Files When a file (or directory) in the PID-specific /proc/pid directory is processed, this is done using the inode operations of the directory, as noted in Chapter 8 when discussing the virtual filesystem mechanisms. The kernel uses the statically defined proc_base_inode_operations structure as the inode operations of PID inodes. This structure is defined as follows: fs/proc/base.c
static const struct inode_operations proc_tgid_base_inode_operations = { .lookup = proc_tgid_base_lookup, .getattr = pid_getattr, .setattr = proc_setattr, };
668
5:18pm
Page 668
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage In addition to attribute handling, the directory supports just one more operation — subentry lookup.2 The task of proc_tgid_base_lookup is to return an inode instance with suitable inode operations by reference to a given name (cmdline, maps, etc.). The extended inode operations (proc_inode) must also include a function to output the desired data. Figure 10-5 shows the code flow diagram. proc_tgid_base_lookup proc_pident_lookup Check if name exits in tigd_base_stuff proc_pident_instantiate proc_pid_make_inode Fill in inode and file operations
Figure 10-5: Code flow diagram for proc_tgid_base_lookup. The work is delegated to proc_pident_lookup, which works not only for TGID files, but is a generic method for other ID types. The first step is to find out whether the desired entry exists at all. Because the contents of the PID-specific directory are always the same, a static list of all files together with a few other bits of information is defined in the kernel sources. The list is called tgid_base_stuff and is used to find out easily whether a desired directory entry exists or not. The array contains elements of type pid_entry, which is defined as follows: fs/proc/base.c
struct pid_entry { char *name; int len; mode_t mode; const struct inode_operations *iop; const struct file_operations *fop; union proc_op op; }; name and len specify the filename and the string length of the name, while mode denotes the mode bits.
Additionally, there are fields for the inode and file operations associated with the entry, and a copy of proc_op. Recall that this contains a pointer to the proc_get_link or proc_read_link operation, depending on the file type. Some macros are provided to ease the construction of static pid_entry instances: fs/proc/base.c
#define DIR(NAME, MODE, OTYPE) \ NOD(NAME, (S_IFDIR|(MODE)), \ 2A
special readdir method is also implemented for proc_tgid_base_operations (an instance of struct file_operations) to read a list of all files in the directory. It’s not discussed here simply because every PID-specific directory
always contains the same files, and therefore the same data would always be returned.
669
Page 669
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage &proc_##OTYPE##_inode_operations, &proc_##OTYPE##_operations, \ {} ) #define LNK(NAME, OTYPE) \ NOD(NAME, (S_IFLNK|S_IRWXUGO), \ &proc_pid_link_inode_operations, NULL, \ { .proc_get_link = &proc_##OTYPE##_link } ) #define REG(NAME, MODE, OTYPE) \ NOD(NAME, (S_IFREG|(MODE)), NULL, \ &proc_##OTYPE##_operations, {}) #define INF(NAME, MODE, OTYPE) \ NOD(NAME, (S_IFREG|(MODE)), \ NULL, &proc_info_file_operations, \ { .proc_read = &proc_##OTYPE } )
As the names indicate, the macros generate directories, links, and regular files. INF also generates regular files, but in contrast to REG files, they do not need to provide specialized inode operations, but need only fill in proc_read from pid_entry->op. Observe how REG("environ", S_IRUSR, environ) /*********************************/ INF("auxv", S_IRUSR, pid_auxv)
is expanded to see how both types differ: { .name = ("environ"), .len = sizeof("environ") - 1, .mode = (S_IFREG|(S_IRUSR)), .iop = NULL, .fop = &proc_environ_operations, .op = {}, } /*********************************/ { .name = ("auxv"), .len = sizeof("auxv") - 1, .mode = (S_IFREG|(S_IRUSR)), .iop = NULL, .fop = &proc_info_file_operations, .op = { .proc_read = &proc_pid_auxv }, }
The macros are used to construct the TGID-specific directory entries in tgid_base_stuff: fs/proc/base.c
static const struct pid_entry tgid_base_stuff[] = { DIR("task", S_IRUGO|S_IXUGO, task), DIR("fd", S_IRUSR|S_IXUSR, fd), DIR("fdinfo", S_IRUSR|S_IXUSR, fdinfo), REG("environ", S_IRUSR, environ), INF("auxv", S_IRUSR, pid_auxv), INF("status", S_IRUGO, pid_status), INF("limits", S_IRUSR, pid_limits), ... INF("oom_score", S_IRUGO, oom_score), REG("oom_adj", S_IRUGO|S_IWUSR, oom_adjust),
670
5:18pm
Page 670
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage #ifdef CONFIG_AUDITSYSCALL REG("loginuid", S_IWUSR|S_IRUGO, loginuid), #endif #ifdef CONFIG_FAULT_INJECTION REG("make-it-fail", S_IRUGO|S_IWUSR, fault_inject), #endif #if defined(USE_ELF_CORE_DUMP) && defined(CONFIG_ELF_CORE) REG("coredump_filter", S_IRUGO|S_IWUSR, coredump_filter), #endif #ifdef CONFIG_TASK_IO_ACCOUNTING INF("io", S_IRUGO, pid_io_accounting), #endif };
The structure describes each entry by type, name, and access rights. The latter are defined using the usual VFS constants with which we are familiar from Chapter 8. To summarize, various types of entry can be distinguished: ❑
INF-style files use a separate read_proc function to obtain the desired data. The proc_info_file_operations standard instance is used as the file_operations struc-
ture. The methods it defines represent the VFS interface that passes the data returned upward using read_proc. ❑
SYM generates symbolic links that point to another VFS file. A type-specific function in proc_get_link specifies the link target, and proc_pid_link_inode_operations forwards the
data to the virtual filesystem in suitable form. ❑
REG creates regular files that use specialized inode operations responsible for gathering data and forwarding them to the VFS layer. This is necessary if the data source does not fit into the framework provided by proc_info_inode_operations.
Let us return to proc_pident_lookup. To check whether the desired name is present, all the kernel does is iterate over the array elements and compare the names stored there with the required name until it strikes lucky — or perhaps not. After it has ensured that the name exists in tgid_base_stuff, the function generates a new inode using proc_pident_instantiate, which, in turn, uses the already known proc_pid_make_inode function.
10.1.8 System Control Mechanism Kernel behavior can be modified at run time by means of system controls. Parameters can be transferred from userspace into the kernel without having to reboot. The classic method of manipulating the kernel is the sysctl system call. However, for a variety of reasons, this is not always the most elegant option — one reason being that it is necessary to write a program to read arguments and pass them to the kernel using sysctl. Unfortunately, this method does not allow users to obtain a quick overview of which kernel control options are available; unlike with system calls, there is no POSIX or, indeed, any other standard that defines a standard set of sysctls to be implemented by all compatible systems. Consequently, the sysctl implementation is now regarded as outmoded and will, in the short or the long term, sink into oblivion. To resolve this situation, Linux resorts to the proc filesystem. It exports to /proc/sys a directory structure that arranges all sysctls hierarchically and also allows parameters to be read and manipulated using simple userspace tools; cat and echo are sufficient to modify kernel run-time behavior.
671
Page 671
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage This section not only examines the proc interface of the sysctl mechanism, but also discusses how sysctls are registered and managed in the kernel, particularly as these two aspects are closely related.
Using Sysctls To paint a general picture of system control options and usage, I have chosen a short example to illustrate how userspace programs call on sysctl resources with the help of the sysctl system call. The example also shows how difficult things would be without the proc filesystem. The many sysctls in every Unix look-alike are organized into a clear hierarchical structure that mirrors the familiar tree structure used in filesystems: and it’s thanks to this feature that sysctls can be exported with such ease by a virtual filesystem. However, in contrast to filesystems, sysctls do not use strings to represent path components. Instead, they use integer numbers packed in symbolic constants. These are easier for the kernel to parse than pathnames in strings. The kernel provides several ‘‘base categories‘‘ including CTL_DEV (information on peripherals), CTL_KERN (information on the kernel itself), and CTL_VM (memory management information and parameters). CTL_DEV includes a subcategory named DEV_CDROM that supplies information on the CD-ROM drive(s) of
the system (CD-ROM drives are obviously peripherals). In CTL_DEV/DEV_CDROM there are several ‘‘end points‘‘ representing the actual sysctls. For example, there is a sysctl called DEV_CDROM_INFO which supplies general information on the capabilities of the drive. Applications wishing to access this sysctl must specify the pathname CTL_DEV/DEV_CDROM/DEV_CDROM_INFO to identify it uniquely. The numeric values of the required constants are defined in <sysctl.h>, which the standard library also used (via /usr/include/sys/sysctl.h). Figure 10-6 shows a graphic excerpt from the sysctl hierarchy that also includes the path described above.
CTL_KERN
CTL_VM
KERN_OSTYPE KERN_OSRELEASE KERN_SYSRQ VM_PAGEBUF VM_SWAPPINESS NET_CORE
NET_CORE_FASTROUTE NET_CORE_DEVWEIGHT
NET_IPV4
NET_IPV4_TVP_FIN_TIMEOUT NET_IPV4_AUTOCONFIG
DEV_PARPORT
DEV_CDROM_INFO DEV_CDROM_AUTOCLOSE DEV_CDROM_CHECK_MEDIA
CTL_NET
CTL_DEV
DEV_CDROM
Figure 10-6: Hierarchy of sysctl entries.
672
5:18pm
Page 672
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage The core of the code is the sysctl function defined by the C standard library in /usr/include/ sys/sysctl.h: int sysctl (int *names, int nlen, void *oldval, size_t *oldlenp, void *newval, size_t newlen)
The path to the sysctl is given as an integer array in which each array element represents a path component. In our example, the path is defined statically in names. The kernel does not know how many path components there are and must therefore be informed explicitly by means of nlen; there are three components in our example. oldval is a pointer to a memory area of undefined type, and oldlenp specifies the size of the reserved area in bytes. The kernel uses the oldval pointer to return the old value represented by sysctl. If this
information can be read but not manipulated, its value is the same both before and after the sysctl call. In this case, oldval is used to read its value. Once the system call has been executed, the length of the output data is given in oldval; for this reason, the variable must be passed by reference and not by value. newval and newlen also form a pair consisting of a pointer and a length specification. They are used when a sysctl allows a kernel parameter to be modified. The newval pointer points to the memory area where the new information is held in userspace, and newlenp specifies its length. A null pointer is passed for newval and a zero for newlenp in the case of read access, as in our example.
How does the sample code work? Once all parameters have been generated for the sysctl call (pathname and memory location to return the desired information), sysctl is invoked and returns an integer number as its result. 0 means that the call was successful (I skip error handling for the sake of simplicity). The data obtained are held in oldval and can be printed out like any normal C string using printf.
Data Structures The kernel defines several data structures for managing sysctls. As usual, let’s take a closer look at them before examining their implementation. Because sysctls are arranged hierarchically (each larger kernel subsystem defines its own sysctl list with its various subsections), the data structure must not only hold information on the individual sysctls and their read and write operations, it must also provide ways of mapping the hierarchy between the individual entries. Each sysctl entry has its own ctl_table instance: <sysctl.h>
struct ctl_table { int ctl_name; const char *procname; void *data; int maxlen; mode_t mode; struct ctl_table *child; struct ctl_table *parent; proc_handler *proc_handler; ctl_handler *strategy;
/* Binary ID */ /* Text ID for /proc/sys, or zero */
/* Automatically set */ /* Callback for text formatting */ /* Callback function for all r/w */
673
Page 673
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage struct proc_dir_entry *de; void *extra1; void *extra2;
/* /proc control block */
};
The name of the structure is misleading. A sysctl table is an array of sysctl structures, whereas a single instance of the structure is called a sysctl entry — despite the word table in its name. The meanings of the structure elements are as follows: ❑
ctl_name is an ID, that must be unique only on the given hierarchy level of the entry but not in the entire table. <sysctl.h> contains countless enums that define sysctl identifiers for various purposes. The identifiers for the base categories are defined by the following enumeration: <sysctl.h>
enum { CTL_KERN=1, CTL_VM=2, CTL_NET=3, CTL_PROC=4, CTL_FS=5, CTL_DEBUG=6, CTL_DEV=7, CTL_BUS=8, CTL_ABI=9, CTL_CPU=10
/* /* /* /* /* /* /* /* /* /*
General kernel info and control */ VM management */ Networking */ Process info */ File Systems */ Debugging */ Devices */ Busses */ Binary emulation */ CPU stuff (speed scaling, etc) */
... };
Below CTL_DEV, there are entries for various device types: <sysctl.h>
/* CTL_DEV names: */ enum { DEV_CDROM=1, DEV_HWMON=2, DEV_PARPORT=3, DEV_RAID=4, DEV_MAC_HID=5, DEV_SCSI=6, DEV_IPMI=7, };
The constant 1 (and others) occurs more than once in the enumerations shown — in both CTL_KERN and DEV_CDROM. This is not a problem because the two entries are on different hierarchy levels, as shown in Figure 10-6. ❑
674
procname is a string containing a human-readable description of the entry in /proc/sys. The names of all root entries appear as directory names in /proc/sys.
5:18pm
Page 674
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage wolfgang@meitner> ls -l total 0 dr-xr-xr-x 2 root root dr-xr-xr-x 8 root root dr-xr-xr-x 7 root root dr-xr-xr-x 4 root root dr-xr-xr-x 8 root root dr-xr-xr-x 2 root root dr-xr-xr-x 2 root root dr-xr-xr-x 2 root root
/proc/sys 0 0 0 0 0 0 0 0
2006-08-11 2006-08-11 2006-08-11 2006-08-11 2006-08-11 2006-08-11 2006-08-11 2006-08-11
00:09 00:09 00:09 00:09 00:09 00:09 00:09 00:09
debug dev fs kernel net proc sunrpc vm
If the entry is not to be exported to the proc filesystem (and is therefore only accessible using the sysctl system call), procname can also be assigned a null pointer, although this is extremely unusual. ❑
data may be assigned any value — usually a function pointer or a string — that is processed by
sysctl-specific functions. The generic code leaves this element untouched. ❑
maxlen specifies the maximum length (in bytes) of data accepted or output by a sysctl.
❑
mode controls the access rights to the data and determines whether and by whom data may be read or written. Rights are assigned using the virtual filesystem constants with which you are familiar from Chapter 8.
❑
child is a pointer to an array of additional ctl_table elements regarded as children of the current element. For example, in the CTL_KERN sysctl entry, child points to a table containing entries such as KERN_OSTYPE (operating system type), KERN_OSRELEASE (kernel version number), and KERN_HOSTNAME (name of the host on which the kernel is running) because these are hierarchically subordinate to the CTL_KERN sysctl.
Because the length of the ctl_table arrays is not stored explicitly anywhere, the last entry must always be an instance of ctl_table whose entries consist of null pointers. ❑
proc_readsys is invoked when data are output via the proc interface. The kernel can output the
data stored in the kernel directly, but also has the option of translating it into a more readable form (e.g., converting numeric constants into more meaningful strings). ❑
strategy is used by the kernel to read or write the value of a sysctl via the system call interface discussed above (note that proc uses different functions of its own for this purpose). ctl_handler is a typedef for a function pointer defined as follows: <sysctl.h>
typedef int ctl_handler (ctl_table *table, int __user *name, int nlen, void __user *oldval, size_t __user *oldlenp, void __user *newval, size_t newlen);
In addition to the complete set of arguments used when the sysctl system call is invoked, the function also expects a pointer to the ctl_table instance where the current sysctl is located. It also needs a context-dependent void* pointer that is currently unused and to which a null pointer is therefore always assigned. ❑
The interface to the proc data is set up by de.
❑
extra1 and extra2 can be filled with proc-specific data that are not manipulated by the generic
sysctl code. They are often used to define upper and lower limits for numeric arguments.
675
Page 675
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage The kernel provides the ctl_table_header data structure to enable several sysctl tables to be maintained in a linked list that can be traversed and manipulated using the familiar standard functions. The structure is prefixed to a sysctl table in order to insert the elements needed for list management: <sysctl.h>
struct ctl_table_header { ctl_table *ctl_table; struct list_head ctl_entry; ... }; ctl_table is a pointer to a sysctl array (consisting of ctl_table elements). ctl_entry holds the elements required to manage the list. Figure 10-7 clearly illustrates the relationship between ctl_table_header and ctl_table.3 ctl_entry
ctl_entry DEV
KERN
child ...
DEV
...
child
child ...
...
CDROM
INFO
AUTO CLOSE
...
Figure 10-7: Relationship between ctl_table_header and ctl_table. The hierarchical relationship between the various sysctl tables of the system is established by the child element of ctl_table and by the linked list implemented using ctl_table_header. The linkage via child enables a direct connection to be made between the various tables that map the sysctl hierarchy. In the kernel it is possible to define various hierarchies in which sysctl tables are interlinked by means of child pointers. However, because there may be just one overall hierarchy, the individual hierarchies must be ‘‘overlaid‘‘ to form a single hierarchy. This situation is illustrated in Figure 10-7, in which there are two independent hierarchies. One is the standard kernel hierarchy containing sysctls to query, for example, the name of the host or the network status. This hierarchy also includes a container to supply information on system peripherals. The CD-ROM driver wants to export sysctls to output information on the CD-ROM drive of the system. What is needed is a sysctl (in /proc/sys/dev/cdrom/info in the proc filesystem) that is a child of CTL_DEV and provides, for example, general data to describe the drive. How does the driver go about this? 3 The list elements are actually below the data elements, but, for reasons of presentability, I have turned this situation ‘‘on its head‘‘
in the figure.
676
5:18pm
Page 676
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage ❑
First, a four-level hierarchy is created with the help of sysctl tables. CTL_DEV is the base level and has a child called DEV_CDROM. This also has several child elements, one of which is called DEV_CDROM_INFO.
❑
The new hierarchy is associated with the existing standard hierarchy in a linked list. This has the effect of ‘‘overlaying‘‘ the two hierarchies. Seen from userspace, it is impossible to distinguish between the hierarchies because they appear as a single overall hierarchy.
The sample program above used the sysctl described without having to know how the hierarchy is represented in the kernel. All it needs to know to access the required information is the path CTL_DEV->DEV_CDROM->DEVCDROM_INFO. Of course, the contents of the /proc/sys directory in the proc filesystem are also constructed in such a way that the internal composition of the hierarchy is not visible.
Static Sysctl Tables Static sysctl tables are defined for all sysctls, regardless of the system configuration.4 The base element is the table named root_table, which acts as the root of the statically defined data: kernel/sysctl.c
static ctl_table root_table[]; static struct ctl_table_header root_table_header = { root_table, LIST_HEAD_INIT(root_table_header.ctl_entry) };
The table is given a header element so that additional hierarchies can be maintained in a linked list as described above; these can be overlaid with the hierarchy defined by root_table. The root_table table defines the framework into which the various sysctls are sorted: kernel/sysctl.c
static ctl_table root_table[] = { { .ctl_name = CTL_KERN, .procname = "kernel", .mode = 0555, .child = kern_table, }, { .ctl_name = CTL_VM, .procname = "vm", .mode = 0555, .child = vm_table, }, #ifdef CONFIG_NET { .ctl_name = CTL_NET, .procname = "net", .mode = 0555, .child = net_table, }, #endif# ... 4 Even though sysctls of this kind are implemented on all architectures, their effect may differ from architecture to architecture.
677
Page 677
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage { .ctl_name .procname .mode .child
= = = =
CTL_DEV, "dev", 0555, dev_table,
}, { .ctl_name = 0 } };
Of course, further top-level categories can be added using the overlay mechanism described above. The kernel also selects this option, for example, for all sysctls that are assigned to the ABI (application binary interface) and belong to the CTL_ABI category. The tables referenced in the definition of root_table — kern_table, net_table, and so on — are likewise defined as static arrays. Because they hold a wealth of sysctls, we ignore their lengthy definitions here, particularly as they offer little of interest besides further static ctl_table instances. Their contents can be viewed in the kernel sources, and their definitions are included in kernel/sysctl.c.
Registering Sysctls In addition to statically initiated sysctls, the kernel features an interface for dynamically registering and unregistering new system control functions. register_sysctl_table is used to register controls and its counterpart, unregister_sysctl_table, to remove sysctl tables, typically when modules are unloaded. The register_sysctl_table function requires one parameter — a pointer to an array of ctl_table entries in which the new sysctl hierarchy is defined. The function also comprises just a few steps. First, a new ctl_table_header is instantiated and associated with the sysctl table. The resulting construct is then added to the existing list of sysctl hierarchies. The auxiliary function sysct_check_table is used to check that the new entry contains proper information. Basically, it ensures that no nonsense combinations are specified (i.e., directories that contain data directories that are writable) and that regular files have a valid strategy routine. Registering a sysctl entry does not automatically create inode instances that connect the sysctl entries with proc entries. Since most sysctls are never used via proc, this wastes memory. Instead, the connection with proc files is created dynamically. Only the directory /proc/sys is created when procfs is initialized: fs/proc/proc_sysctl.c
int proc_sys_init(void) { proc_sys_root = proc_mkdir("sys", NULL); proc_sys_root->proc_iops = &proc_sys_inode_operations; proc_sys_root->proc_fops = &proc_sys_file_operations; proc_sys_root->nlink = 0; return 0; }
The inode operations specified in proc_sys_inode_operations ensure that files and directories below /proc/sys are dynamically generated when they are needed. The contents of the structure are as follows:
678
5:18pm
Page 678
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage fs/proc/proc_sysctl.c
static struct inode_operations proc_sys_inode_operations = { .lookup = proc_sys_lookup, .permission = proc_sys_permission, .setattr = proc_sys_setattr, };
Lookup operations are handled by proc_sys_lookup. The following approach is used to dynamically construct inodes for proc entries: ❑
do_proc_sys_lookup takes the parent dentry and the name of the file or directory to find the
desired sysctl table entry. This involves mainly iterating over the data structures presented before. ❑
Given the inode of the parent directory and the sysctl table, proc_sys_make_inode is employed to construct the required inode instance. Since the new inode’s inode operations are also implemented by proc_sys_inode_operations, it is ensured that the described method also works for new subdirectories.
The file operations for /proc/sys entries are given as follows: kernel/sysctl.c
static const struct file_operations proc_sys_file_operations = { .read = proc_sys_read, .write = proc_sys_write, .readdir = proc_sys_readdir, };
Read and write file operations for all entries are implemented by means of standard operations.
/proc/sys File Operations The implementations for proc_sys_read and proc_sys_write are very similar. Both require three easy steps:
1. 2.
3.
do_proc_sys_lookup finds the sysctl table entry that is associated with the file in /proc/sys.
It is not guaranteed that all rights on sysctl entries are granted even to the root user. Some entries can, for instance, be only read, but are not allowed to be changed, that is, written to. Thus an extra permission check with sysctl_perm is required. While proc_sys_read needs read permission, write permission is necessary for proc_sys_write. Calling the proc handler stored in the sysctl table completes the action.
proc_handler is assigned a function pointer when the sysctl tables are defined. Because the various
sysctls are spread over several standard categories (in terms of their parameter and return values), the kernel provides standard implementations that are normally used in place of the specific function implementations. Most frequently, the following functions are used: ❑
proc_dointvec reads or writes integer values from or to the kernel [the exact number of values is specified by table->maxlen/sizeof(unsigned int)]. Only a single integer may be involved (and not a vector) if maxlen indicates the byte number of a single unsigned int.
679
Page 679
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage ❑
proc_dointvec_minmax works in the same way as proc_dointvec, but ensures that each number is within a minimum and maximum value range specified by table->extra1 (minimum value) and table->extra2 (maximum value). All values outside the range are ignored. proc_doulongvec_minmax serves the same purpose, but uses values with type unsigned long instead of int.
❑
proc_dointvec_jiffies reads an integer table. The values are converted to jiffies. A nearly identical variant is proc_dointvec_ms, where the values are interpreted as milliseconds.
❑
proc_dostring transfers strings between kernel and userspace and vice versa. Strings that are
longer than the internal buffer of an entry are automatically truncated. When data are copied into userspace, a carriage return (\n) is appended automatically so that a line break is added after information is output (e.g., using cat).
10.2
Simple Filesystems
Full-featured filesystems are hard to write and require a considerable amount of effort until they reach a usable, efficient, and correct state. This is reasonable if the filesystem is really supposed to store data on disk. However, filesystems — especially virtual ones — serve many purposes that differ from storing proper files on a block device. Such filesystems still run in the kernel, and their code is thus subjected to the rigorous quality requirements imposed by the kernel developers. However, various standard methods makes this aspect of life much easier. A small filesystem library — libfs — contains nearly all ingredients required to implement a filesystem. Developers only need to provide an interface to their data, and they are done. Additionally, some more standard routines — in the form of the seq_file mechanism — are available to handle sequential files with little effort. Finally, developers might want to just export a value or two into userspace without messing with the existing filesystems like procfs. The kernel also provides a cure for this need: The debugfs filesystem allows for implementing a bidirectional debugging interface with only a few function calls.
10.2.1 Sequential Files Before discussing any filesystem library, we need to have a look at the sequential file interface. Files in small filesystems will usually be read sequentially from start to end from userland, and their contents are created by iterating over several items. These could, for instance, be array elements. The kernel traverses the the whole array from start to end and creates a textual representation for each element. Put into kernel nomenclature, one could also call this making synthetic files from sequences of records. The routines in fs/seq_file.c allow implementing such files with minimal effort. Despite their name, seeking is possible for sequential files, but the implementation is not very efficient. Sequential access — where one item is read after another — is clearly the preferred mode of access; simplicity in one aspect often comes with a price in other regards. The kprobe mechanism contains an interface to the aforementioned debug filesystem. A sequential file presents all registered probes to userland. I consider the implementation to illustrate the idea of sequential files.
680
5:18pm
Page 680
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage Writing Sequential File Handlers Basically, an instance of struct file_operations that provides pointers to some seq_ routines must be implemented to benefit from the sequential file standard implementation. The kprobes subsystem does this as follows: kernel/kprobes.c
static struct file_operations debugfs_kprobes_operations = { .open = kprobes_open, .read = seq_read, .llseek = seq_lseek, .release = seq_release, };
This instance of file_operations can be associated with a file by the methods discussed in Chapter 8. In the case of kprobes, the file will be created in the debugging filesystem; see Section 10.2.3. The only method that needs to be implemented is open. Not much effort is required for the function, though: A simple one-liner connects the file with the sequential file interface: kernel/kprobes.c
static struct seq_operations kprobes_seq_ops = { .start = kprobe_seq_start, .next = kprobe_seq_next, .stop = kprobe_seq_stop, .show = show_kprobe_addr }; static int __kprobes kprobes_open(struct inode *inode, struct file *filp) { return seq_open(filp, &kprobes_seq_ops); } struct file
private
struct seq_file
struct seq_operations
op
start stop next show
Figure 10-8: Data structures for sequential files. seq_open sets up the data structures required by the sequential file mechanism. The result is shown in Figure 10-8. Recall from Chapter 8 that the private_data element of struct file can point to arbitrary data that are private to the file and not touched by the generic VFS functions. In this case, seq_open uses the pointer to establish a connection with an instance of struct seq_file that contains status information
about the sequential file: <seq_file.h>
struct seq_file { char *buf;
681
Page 681
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage size_t size_t size_t loff_t
size; from; count; index;
... const struct seq_operations *op; ... }; buf points to a memory buffer that allows for constructing data that go out to userland. count specifies
the number of bytes remaining to be written to userland. The start position of the copy operation is denoted by from, and size gives the total number of bytes in the buffer. index is another index into the buffer. It marks the start position for the next new record that is written into the buffer by the kernel. Note that index and from can evolve differently since writing data into the buffer is different from copying these data to userspace. The most important element from a filesystem implementor’s point of view is the pointer op to an instance of seq_operations. This connects the generic sequential file implementation with routines providing file-specific contents. Four methods are required by the kernel and need to be implemented by the file provider: <seq_file.h>
struct seq_operations { void * (*start) (struct seq_file *m, loff_t *pos); void (*stop) (struct seq_file *m, void *v); void * (*next) (struct seq_file *m, void *v, loff_t *pos); int (*show) (struct seq_file *m, void *v); };
The first argument to the functions is always the seq_file instance in question. The start method is called whenever an operation on a sequential file is started. The position argument pos is a cursor in the file. The interpretation is left to the implementation. It could be taken as a byte offset, but can also be interpreted as an array index. The kprobes example implements all these routines as shown above, so they are discussed now. Let us first, however, briefly describe which type of information is passed to userland — we need to know what goes out before we can discuss how it goes out. The kprobes mechanism allows for attaching probes to certain points in the kernel. All registered probes are hashed on the array kprobe_table, and the size of the array is statically defined to KPROBE_TABLE_SIZE. The file cursor for sequential files is interpreted as an index into the array, and the debug file is supposed to show information about all registered probes that must be constructed from the contents of the hash table. The start method is simple: It just needs to check if the current cursor is beyond the array bounds. kernel/kprobes.c
static void __kprobes *kprobe_seq_start(struct seq_file *f, loff_t *pos) { return (*pos < KPROBE_TABLE_SIZE) ? pos : NULL; }
This is simple, but closing a sequential file is even simpler: In almost all cases, nothing needs to be done!
682
5:18pm
Page 682
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage kernel/kprobes.c
static void __kprobes kprobe_seq_stop(struct seq_file *f, void *v) { /* Nothing to do */ }
The next function is called when the cursor must be updated to the next position. Besides incrementing the array index, the function must check that it does not go out of bounds: kernel/kprobes.c
static void __kprobes *kprobe_seq_next(struct seq_file *f, void *v, loff_t *pos) { (*pos)++; if (*pos >= KPROBE_TABLE_SIZE) return NULL; return pos; }
A NULL pointer indicates that the end of the file is reached. The most interesting function is show since the actual contents of the sequential file are generated here. For the sake of illustration, I present a slightly simplified version that abstracts some of the difficulties associated with kprobes that would detract from the seq_file issues: kernel/kprobes.c
static int show_kprobe_addr(struct seq_file *pi, void *v) { struct hlist_head *head; struct hlist_node *node; struct kprobe *p; const char *sym = NULL; unsigned int i = *(loff_t *) v; unsigned long offset = 0; char *modname, namebuf[128]; head = &kprobe_table[i]; hlist_for_each_entry_rcu(p, node, head, hlist) { sym = kallsyms_lookup((unsigned long)p->addr, NULL, &offset, &modname, namebuf); if (sym) seq_printf(pi, "%p %s+0x%x %s\n", p->addr, sym, offset, (modname ? modname : " ")); else seq_printf(pi, "%p\n", p->addr); } return 0; }
The current value of the file cursor is in the argument v, and the function converts it into the array index i. Data generation is done by iterating over all elements hashed on this array index. An output line is constructed for each element. Information about the probe point and the symbol that is possibly associated with the point is generated, but this is not really relevant for the example. What does matter is that
683
Page 683
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage instead of printk, seq_printf is used to format the information. In fact, the kernel provides some auxiliary functions that must be used for this purpose. All take a pointer to the seq_file instance in question as first parameter: ❑
seq_printf works like printk and can be used to format arbitrary C strings.
❑
seq_putc and seq_puts, respectively, write out a single character and a string without any formatting.
❑
seq_esc takes two strings. All characters in the second string that are found in the first string are replaced by their value in octal.
The special function sec_path allows for constructing the filename associated with a given instance of struct dentry. It is used by filesystem- or namespace-specific code.
Connection with the Virtual Filesystem Up to now, I have presented everything that is required from a sequential file user. The rest, that is, connecting the operations with the virtual filesystem, is left to the kernel. To establish the connection, it is necessary to use the seq_read as read method for file_operations as shown above in the case of debugfs_kprobes_operations. The method bridges VFS and sequential files. First of all, the function needs to obtain the seq_file instance from the VFS layer’s struct file. Recall that seq_opened has established a connection via private_data. If some data are waiting to be written out — as indicated by a positive count element of struct seq_file — , they are copied to userland with copy_to_user. Additionally, updating the various status elements of seq_file is required. In the next step, new data are generated. After calling start, the kernel calls show and next one after another until the available buffer is filled. Finally, stop is employed, and the generated data are copied to userspace using copy_to_user.
10.2.2 Writing Filesystems with Libfs Libfs is a library that provides several very generic standard routines that can be used to create small filesystems that serve one specific purpose. The routines are well suited for in-memory files without a backing store. Obviously the code cannot provide means to interact with specific on-disk formats; this needs to be handled properly by full filesystem implementations. The library code is contained in a single file, fs/libfs.c. The prototypes are defined in
The file and directory hierarchy of virtual filesystems that use libfs is generated and traversed using the dentry tree. This implies that during the lifetime of the filesystem, all dentries must be pinned into memory. They must not go away unless they are explicitly removed via unlink or rmdir. However, this is simple to achieve: The code only needs to ensure that all dentries always have a positive use count.
684
5:18pm
Page 684
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage To understand the idea of libfs better, let’s discuss the way directory handling is implemented. Boilerplate instances of inode and file operations for directories are provided that can immediately be reused for any virtual filesystem implemented along the lines of libfs: fs/libfs.c
const struct file_operations simple_dir_operations = { .open = dcache_dir_open, .release = dcache_dir_close, .llseek = dcache_dir_lseek, .read = generic_read_dir, .readdir = dcache_readdir, .fsync = simple_sync_file, }; const struct inode_operations simple_dir_inode_operations = { .lookup = simple_lookup, };
In contrast to the convention introduced above, the names of the routines that make up simple_dir_operations do not start with simple_. Nevertheless, they are defined in fs/libfs.c. The nomenclature reflects that the operations solely operate on objects from the dentry cache. If a virtual filesystem sets up a proper dentry tree, it suffices to install simple_dir_operations and simple_dir_inode_operations as file or inode operations, respectively, for directories. The libfs functions then ensure that the information contained on the tree is exported to userland via the standard system calls like getdents. Since constructing one representation from another is basically a mechanical task, the source code is not discussed in detail. Instead, it is more interesting to observe how new files are added to a virtual filesystem. Debugfs (discussed below) is one filesystem that employs libfs. New files (and thus new inodes) are created with the following routine: fs/debugfs/inode.c
static struct inode *debugfs_get_inode(struct super_block *sb, int mode, dev_t dev) { struct inode *inode = new_inode(sb); if (inode) { inode->i_mode = mode; inode->i_uid = 0; inode->i_gid = 0; inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; switch (mode & S_IFMT) { default: init_special_inode(inode, mode, dev); break; case S_IFREG: inode->i_fop = &debugfs_file_operations; break; ... case S_IFDIR:
685
Page 685
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage inode->i_op = &simple_dir_inode_operations; inode->i_fop = &simple_dir_operations; /* directory inodes start off with i_nlink == 2 * (for "." entry) */ inc_nlink(inode); break; } } return inode; }
Besides allocating a new instance of struct inode, the kernel needs to decide which file and inode operations are to be associated with the file depending on the information in the access mode. For device special files, the standard routine init_special_file (not connected with libfs) is used. The more interesting cases, however, are regular files and directories. Directories require the standard file and inode operations as discussed above; this ensures with no further effort that the new directory is correctly handled. Regular files cannot be provided with boilerplate file operations. It is at least necessary to manually specify the read, write, and open methods. read is supposed to prepare data from kernel memory and copy them into userspace, while write can be used to read input from the user and apply it somehow. This is all that is required to implement custom files! A filesystem also requires a superblock. Thankfully for lazy programmers, libfs provides the method simple_fill_super, that can be used to fill in a given superblock:
int simple_fill_super(struct super_block *s, int magic, struct tree_descr *files); s is the superblock in question, and magic specifies a unique magic number which can be used to identify the filesystem. The files parameter provides a very convenient method to populate the virtual filesystem
with what it is supposed to contain: files! Unfortunately, only files in a single directory can be specified with this method, but this is not a real limitation for virtual filesystems. More content can still be added later dynamically. An array with struct tree_descr elements is used to describe the initial set of files. The structure is defined as follows:
struct tree_descr { char *name; const struct file_operations *ops; int mode; }; name denotes the filename, ops points to the associated file operations, and mode specifies the access bits.
The last entry in the list must be of the form { "", NULL, 0 }.
686
5:18pm
Page 686
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage
10.2.3 The Debug Filesystem One particular filesystem using functions from libfs is the debug filesystem debugfs. It presents kernel developers with a possibility of providing information to userland. The information is not supposed to be compiled into production kernels. Quite in contrast, it is only an aid for developing new features. Support for debugfs is only activated if the kernel is compiled with the DEBUG_FS configuration option. Code that registers files in debugfs thus needs to be embraced by C pre-processor conditionals checking for CONFIG_DEBUG_FS.
Example Recall the kprobes example discussed earlier in the chapter as an example for the sequential file mechanism. The resulting file is exported via debugfs in only a couple of lines — as simple as can be! kernel/kprobes.c
#ifdef CONFIG_DEBUG_FS ... static int __kprobes debugfs_kprobe_init(void) { struct dentry *dir, *file; unsigned int value = 1; dir = debugfs_create_dir("kprobes", NULL); ... file = debugfs_create_file("list", 0444, dir, NULL, &debugfs_kprobes_operations); ... return 0; } ... #endif /* CONFIG_DEBUG_FS */ debugfs_create_dir is used to create a new directory, and debugfs_create_file establishes a new file in this directory. debugfs_kprobes_operations was discussed above as an example for the sequential
file mechanism.
Programming Interface Since the debugfs code is very clean, simple, and well documented, it is not necessary to add remarks about the implementation. It suffices to discuss the programming interface. However, have a look at the source code, which is a very nice application of the libfs routines. Three functions are available to create new filesystem objects: <debugfs.h>
struct dentry *debugfs_create_file(const char *name, mode_t mode, struct dentry *parent, void *data, const struct file_operations *fops); struct dentry *debugfs_create_dir(const char *name, struct dentry *parent);
687
Page 687
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage struct dentry *debugfs_create_symlink(const char *name, struct dentry *parent, const char *dest);
Unsurprisingly, a filesystem object can either be a regular file, a directory, or a symbolic link. Two additional operations allow for renaming and removing files: <debugfs.h>
void debugfs_remove(struct dentry *dentry); struct dentry *debugfs_rename(struct dentry *old_dir, struct dentry *old_dentry, struct dentry *new_dir, const char *new_name);
When kernel code is being debugged, the need to export and manipulate a single elementary value like an int or a long often arises. Debugfs also provides several functions that create a new file that allows for reading the value from userspace and passing a new value into the kernel. They all share a common prototype: <debugfs.h>
struct dentry *debugfs_create_XX(const char *name, mode_t mode, struct dentry *parent, XX *value); name and mode denote the filename and access mode, while parent points to the dentry instance of the parent directory. value is most important: It points to the value that is exported and can be modified by
writing into the file. The function is available for several data types. If XX is replaced by any of the standard kernel data types u8, u16, u32, or u64, a file that allows for reading but forbids changing the value is created. If x8, x16, or x32 is used, the value can also be changed from userspace. A file that presents a Boolean value can be created by debugfs_create_bool: <debugfs.h>
struct dentry *debugfs_create_bool(const char *name, mode_t mode, struct dentry *parent, u32 *value)
Finally, it is also possible to exchange short portions of binary data (conventionally called binary blobs) with userspace. The following function is provided for this purpose: <debugfs.h>
struct dentry *debugfs_create_blob(const char *name, mode_t mode, struct dentry *parent, struct debugfs_blob_wrapper *blob);
The binary data are represented by a special data structure containing a pointer to the memory location that holds the data and the data length: <debugfs.h>
struct debugfs_blob_wrapper { void *data; unsigned long size; };
688
5:18pm
Page 688
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage
10.2.4 Pseudo Filesystems Recall from Section 8.4.1 that the kernel supports pseudo-filesystems that collect related inodes, but cannot be mounted and are thus not visible in userland. Libfs also provides an auxiliary function to implement this specialized type of filesystem. The kernel employs a pseudo-filesystem to keep track of all inodes that represent block devices: fs/block_dev.c
static int bd_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { return get_sb_pseudo(fs_type, "bdev:", &bdev_sops, 0x62646576, mnt); } static struct file_system_type bd_type = { .name = "bdev", .get_sb = bd_get_sb, .kill_sb = kill_anon_super, };
The code looks as for any regular filesystem, but libfs provides the method get_sb_pseudo which ensures that the filesystem cannot be mounted from userspace. This is simple: It just needs to set the flag MS_NOUSER as discussed in Chapter 8. Besides, an instance of struct super_block is filled in, and the root inode for the pseudo-filesystem is allocated. To use a pseudo-filesystem, the kernel needs to mount it using kern_mount or kern_mount_data. It can be used to collect inodes without the hassle of writing a specialized data structure to do so. For bdev, all inodes that represent block devices are grouped together. The collection, however, will only be visible to the kernel and not to userspace.
10.3
Sysfs
Sysfs is a filesystem for exporting kernel objects to userspace, providing the ability to not only observe properties of kernel-internal data structures, but also to modify them. Especially important is the highly hierarchical organization of the filesystem layout: The entries of sysfs originate from kernel objects (kobjects) as introduced in Chapter 1, and the hierarchical order of these is directly reflected in the directory layout of sysfs.5 Since all devices and buses of the system are organized via kobjects, sysfs provides a representation of the system’s hardware topology. In many cases, short, human readable text strings are used to export object properties, but passing binary data to and from the kernel via sysfs is also frequently employed. Sysfs has become an alternative to the more old-fashioned IOCTL mechanism. Instead of sending cryptic ioctls into the kernel, which usually requires a C program, it is much simpler to read from or write a value to a sysfs file. A simple shell command is sufficient. Another advantage is that a simple directory listing provides a quick overview on what options can be set. As for many virtual filesystems, sysfs was initially based on ramfs; thus, the implementation uses many techniques known from other places in the kernel. Note that sysfs is always compiled into the kernel 5 The large number of extensively interconnected data structures known from the kobject mechanism is thus also directly transferred
to sysfs, at least when a kobject is exported to the filesystem.
689
Page 689
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage if it is configured to be active; generating it as a module is not possible. The canonical mount point for sysfs is /sys. The kernel sources contain some documentation on sysfs, its relation to the driver model with respect to the kobject framework, and so on. It can be found in Documentation/filesystems/sysfs.txt and Documentation/filesystems/sysfs-pci.txt. An overview article by the author of sysfs himself is available in the proceedings of the Ottawa Linux Symposium 2005 on www.linuxsymposium.org/2005/linuxsymposium_procv1.pdf. Finally, note that the connection between kobjects and sysfs is not automatically set up. Standalone kobject instances are by default not integrated into sysfs. You need to call kobject_add to make an object visible in the filesystem. If the kobject is a member of a kernel subsystem, the registration is performed automatically, though.
10.3.1 Overview struct kobject, the related data structures, and their usage are described in Chapter 1; thus, here our discussion is restricted to a recap of the most essential points. In particular, it is important to remember that
❑
kobjects are included in a hierarchic organization; most important, they can have a parent and can be included in a kset. This determines where the kobject appears in the sysfs hierarchy: If
a parent exists, a new entry in the directory of the parent is created. Otherwise, it is placed in the directory of the kobject that belongs to the kset the object is contained in (if both of these possibilities fail, the entry for the kobject is located directly in the top level of the system hierarchy, but this is obviously a rare case). ❑
Every kobject is represented as a directory within sysfs. The files that appear in this directory are the attributes of the object. The operations used to export and set attribute values are provided by the subsystem (class, driver, etc.) to which the kobject belongs.
❑
Buses, devices, drivers, and classes are the main kernel objects using the kobject mechanism; they thus account for nearly all entries of sysfs.
10.3.2 Data Structures As usual, let’s first discuss the data structures used by the sysfs implementation.
Directory Entries Directory entries are represented by struct sysfs_dirent as defined in <sysfs.h>. It is the main data structure of sysfs; the whole implementation is centered around it. Each sysfs node is represented by a single instance of sysfs_dirent. The definition is as follows: <sysfs.h>
struct sysfs_dirent { atomic_t s_count; atomic_t s_active; struct sysfs_dirent *s_parent; struct sysfs_dirent *s_sibling; const char *s_name;
690
5:18pm
Page 690
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage union { struct struct struct struct
sysfs_elem_dir s_dir; sysfs_elem_symlink s_symlink; sysfs_elem_attr s_attr; sysfs_elem_bin_attr s_bin_attr;
}; unsigned int s_flags; ino_t s_ino; umode_t s_mode; struct iattr *s_iattr; };
❑
s_sibling and s_children are used to capture the parent/child relationship between sysfs entries in a data structure: s_sibling is used to connect all children of a parent among each other, and s_children is used by the parent element to serve as a list head.
❑
The kernel uses s_flags with a twofold purpose: First, it is used to set the type of the sysfs entry. Second, it can set a number of flags. The lower 8 bits are used for the type; they can be accessed with the auxiliary function sysfs_type. The type can be any of SYSFS_DIR, SYSFS_KOBJ_ATTR, SYSFS_KOBJ_BIN_ATTR or SYSFS_KOBJ_LINK, depending on whether the instance is a directory, a regular respectively binary attribute, or a symbolic link. The remaining bits are reserved for flags. Currently, only SYSFS_FLAG_REMOVED is defined, which is set when a sysfs entry is in the process of being removed.
❑
Information about the access mode of the file associated with the sysfs_dirent instance is stored in s_mode. Attributes are described by an iattr instance pointed at by s_iattr; if this is a NULL pointer, a default set of attributes is used.
❑
s_name points to the filename for the file, directory, or link represented by the object.
❑
Depending on the type of the sysfs entry, different types of data are associated with it. Since an entry can only represent a single type at a time, the data structures that encapsulate the entry’s payload are collected in an anonymous union. The members are defined as follows: fs/sysfs/sysfs.h
struct sysfs_elem_dir { struct kobject *kobj; /* children list starts here and goes through sd->s_sibling */ struct sysfs_dirent *children; }; struct sysfs_elem_symlink { struct sysfs_dirent *target_sd; }; struct sysfs_elem_attr { struct attribute *attr; struct sysfs_open_dirent *open; }; struct sysfs_elem_bin_attr { struct bin_attribute *bin_attr; };
691
Page 691
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage sysfs_elem_attr and sysfs_bin_attr contain pointers to data structures that represent attributes, and are discussed in the following section. sysfs_elem_symlink implements a symbolic link. All it needs to do is provide a pointer to the target sysfs_dirent instance.
Directories are implemented with the aid of sysfs_elem_dir. children is the head of a singly linked list connecting all children via s_sibling. Note that the elements on the sibling list are sorted by s_ino in decreasing order. The relationship is illustrated in Figure 10-9. Like any other filesystem, sysfs entries are also represented by instances of struct dentry. The connection between both layers is given by dentry->d_fsdata, which points to the sysfs_dirent instance associated with the dentry element.
struct dentry sysfs_dirent d_fsdata
s_dir.children
s_sbiling
s_sbiling
s ino = 1000
s ino = 953
Figure 10-9: Sysfs directory hierarchy based on struct sysfs_dirent.
Reference counting for struct sysfs_dirent is unconventional because two reference counters are provided: s_count and s_active. The first one is a standard reference counter that needs to be incremented when the sysfs_dirent instance under consideration is required by some part of the kernel and decremented when it is not required anymore. A problem arises, though, because whenever a sysfs node is opened, the associated kobject is also referenced. Userland applications could thus prevent the kernel from deleting kobject instances by simply keeping a sysfs file open. To circumvent this, the kernel requires that an active reference on a sysfs_direntry is held whenever the associated internal objects (available via sysfs_elem_*) are accessed. Unsurprisingly, the active reference counter is implemented with s_active. When a sysfs file is supposed to be deleted, access to the internal objects associated with it can be deactivated by setting the active reference counter to a negative value — the auxiliary function sysfs_dectivate is provided for this. Once the value is negative, operations on the associated kobject cannot be performed anymore. When all users of the kobject have disappeared, it can safely be deleted by the kernel. The sysfs file and thus the sysfs_dirent instance, however, can still exist — even if they do not make much sense anymore! Active references can be obtained by sysfs_get_active or sysfs_get_active_two (for a given sysfs_direntry instance as well as its parent). They must immediately be released with
692
5:18pm
Page 692
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage sysfs_put_active (respectively, sysfs_put_active_two) as soon as the operation with the associated internal data is finished.
Attributes Let us turn our attention to the data structures that represent attributes and the mechanisms used to declare new attributes:
Data Structures Attributes are defined by the following data structure: include/linu/<sysfs.h>
struct attribute { const char struct module mode_t };
* name; * owner; mode;
name provides a name for the attribute that is used as a filename in sysfs (thus attributes that belong to the same object need to have unique names), while mode specifies the access mode. owner points to the module instance to which the owner of the attribute belongs.
It is also possible to define a group of attributes with the aid of the following data structure: <sysfs.h>
struct attribute_group { const char struct attribute };
* name; ** attrs;
name is a name for the group, and attrs points to an array of attribute instances terminated by a NULL
entry. Note that these data structures only provide a means to represent attributes, but do not specify how to read or modify them. This is covered in Section 10.3.4. The separation of representation and access method was chosen because all attributes belonging to a certain entity (e.g., a driver, a device class, etc.) are modified in a similar way, so it makes sense to transfer this group property to the export/import mechanism. Note, though, that it is customary that the show and store operations of the subsystem rely on attribute-specific show and store methods that are internally connected with the attribute and that differ on a per-attribute basis. The implementation details are left to the respective subsystem; sysfs is unconcerned about this. For a read/write attribute, two methods denoted as show and store need to be available; the kernel provides the following data structure to keep them together: <sysfs.h>
struct sysfs_ops { ssize_t (*show)(struct kobject *, struct attribute *,char *); ssize_t (*store)(struct kobject *,struct attribute *,const char *, size_t); };
693
Page 693
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage It is the responsibility of the code that declares a new attribute type to provide a suitable set of show and store operations. The situation is different for binary attributes: Here, the methods used to read and modify the data are usually different for each attribute. This is reflected in the data structure, where methods for reading, writing, and memory mapping are specified explicitly: <sysfs.h>
struct bin_attribute { struct attribute attr; size_t size; void *private; ssize_t (*read)(struct kobject *, struct bin_attribute *, char *, loff_t, size_t); ssize_t (*write)(struct kobject *, struct bin_attribute *, char *, loff_t, size_t); int (*mmap)(struct kobject *, struct bin_attribute *attr, struct vm_area_struct *vma); }; size denotes the size of the binary data associated with the attribute, and private is (usually) used to
point to the place where the data are actually stored.
Declaring New Attributes Many possibilities for declaring subsystem-specific attributes are spread around the kernel, but since they all share a basic structure with regard to their implementation, it is sufficient to consider one implementation as an example for the underlying mechanism. Consider, for instance, how the generic hard disk code defines a structure that unites an attribute and the associated methods to read and modify the attribute:
struct disk_attribute { struct attribute attr; ssize_t (*show)(struct gendisk *, char *); ssize_t (*store)(struct gendisk *, const char *, size_t); };
The attr member is nothing other than an attribute as introduced before; this can be fed to sysfs whenever an instance of attribute is required. But note that the show and store function pointers have a different prototype from that required for sysfs! How do the subsystem-specific attribute functions get called by the sysfs layer? The connection is made by the following struct: block/genhd.c
static struct sysfs_ops disk_sysfs_ops = { .show = &disk_attr_show, .store = &disk_attr_store, };
The show and store methods of sysfs_ops are called when a process wants to read from (or write to) a sysfs file as will be shown below in more detail.
694
5:18pm
Page 694
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage When a sysfs file related to generic hard disk attributes is accessed, the kernel uses the methods disk_attr_show and disk_attr_store to read and modify the attribute values. The disk_attr_show function is called whenever the value of an attribute of this type needs to be read from the kernel; the code acts as the glue between sysfs and the genhd implementation: block/genhd.c
static ssize_t disk_attr_show(struct kobject *kobj, struct attribute *attr, char *page) { struct gendisk *disk = to_disk(kobj); struct disk_attribute *disk_attr = container_of(attr,struct disk_attribute,attr); ssize_t ret = -EIO; if (disk_attr->show) ret = disk_attr->show(disk,page); return ret; }
The attribute connected to the sysfs file can be used to infer the containing disk_attribute instance by using the container_of-mechanism; after the kernel has made sure that the attribute possesses a show method, it is called to transfer data from the kernel to userspace and thus from the internal data structures to the sysfs file. Similar methods are implemented by many other subsystems, but since their code is basically identical to the example shown above, it is unnecessary to consider them in greater detail here. Instead, I will cover the steps leading to a call of the sysfs-specific show and store methods; the connection between subsystem and sysfs is left to the subsystem-specific code.
10.3.3 Mounting the Filesystem As usual, let’s start the discussion of the implementation by considering how the filesystem is mounted. The system call ends up in delegating the work to fill a superblock to sysfs_fill_super; the associated code flow diagram can be found in Figure 10-10.
sysfs_fill_super sysfs_get_inode sysfs_init_inode d_alloc_root root->d_fsdata = &sysfs_root
Figure 10-10: Code flow diagram for sysfs_fill_super.
There is not too much to do for sysfs_fill_super: Some uninteresting initialization work needs to be performed first. sysfs_get_inode is then used to create a new instance of struct inode as the
695
Page 695
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage starting point for the whole sysfs tree. The routine can not only be used to obtain the root inode, but is a generic function that works for any sysfs entry. This routine first checks if the inode is already present in the inode hash. Because the filesystem has not been mounted before, this check will fail in our case, so sysfs_init_inode is used to construct a new inode instance from scratch. I will come back to this function in a moment. The final steps are again performed in sysfs_fill_super. After allocating a root dentry with d_alloc_root, the connection between the sysfs data and the filesystem entry is established: sysfs/mount.c
static int sysfs_fill_super(struct super_block *sb, void *data, int silent) { struct inode *inode; struct dentry *root; ... root->d_fsdata = &sysfs_root; sb->s_root = root; ... }
Recall that dentry->d_fsdata is a function pointer reserved for filesystem internal use, so sysfs is allowed to create a connection between sysfs_dirents and dentry instances this way. sysfs_root is a static instance of stuct sysfs_dirent that represents the root entry of sysfs. It is defined as follows: sysfs/mount.c
struct sysfs_dirent sysfs_root = { .s_name = "", .s_count = ATOMIC_INIT(1), .s_flags = SYSFS_DIR, .s_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO, .s_ino = 1, };
Note that d_fsdata always points to the associated instance of struct sysfs_dirent; the scheme is not only used for the root entry, but also for all other entries of sysfs. This connection allows the kernel to derive the sysfs-specific data from the generic VFS data structures. I will now consider inode initialization in sysfs_init_inode in more detail as promised above. The code flow diagram for the function is depicted in Figure 10-11. sysfs_init_inode Set inode, file, and address space operations
No
Non-standard attributes specified?
Yes
sysfs_inode_attr
set_default_inode_attr
Figure 10-11: Code flow diagram for sysfs_new_inode. sysfs_init_inode sets the inode operations such that only setattr is implemented by a filesystemspecific function, namely, sysfs_setattr. Following this, the kernel takes care of assigning the inode
696
5:18pm
Page 696
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage attributes. These can either be specified explicitly via sysfs_dirent->iattr or can be left to the default values if the field contains a NULL pointer. In this case, the following auxiliary function is used to set the default attributes: fs/sysfs/inode.c
static inline void set_default_inode_attr(struct inode * inode, mode_t mode) { inode->i_mode = mode; inode->i_uid = 0; inode->i_gid = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; }
While the access mode of the file can be arbitrarily chosen by the caller, the ownership of the file belongs to root.root in the default case. Finally, the inode needs to be initialized according to the type of the sysfs entry: fs/sysfs/inode.c
static void sysfs_init_inode(struct sysfs_dirent *sd, struct inode *inode) { ... /* initialize inode according to type */ switch (sysfs_type(sd)) { case SYSFS_DIR: inode->i_op = &sysfs_dir_inode_operations; inode->i_fop = &sysfs_dir_operations; inode->i_nlink = sysfs_count_nlink(sd); break; case SYSFS_KOBJ_ATTR: inode->i_size = PAGE_SIZE; inode->i_fop = &sysfs_file_operations; break; case SYSFS_KOBJ_BIN_ATTR: bin_attr = sd->s_bin_attr.bin_attr; inode->i_size = bin_attr->size; inode->i_fop = &bin_fops; break; case SYSFS_KOBJ_LINK: inode->i_op = &sysfs_symlink_inode_operations; break; default: BUG(); } ...
Different types are distinguished by different inode and file operations.
10.3.4 File and Directory Operations Since sysfs exposes its data structures in a filesystem, most interesting operations can be triggered with standard filesystem operations. The functions that implement the filesystem operations thus serve as a glue layer between sysfs and the internal data structures. As for every filesystem, the methods used
697
Page 697
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage for operations on files are collected in an instance of struct file_operations. For sysfs, the following selection is available: fs/sysfs/file.c
const struct file_operations sysfs_file_operations = { .read = sysfs_read_file, .write = sysfs_write_file, .llseek = generic_file_llseek, .open = sysfs_open_file, .release = sysfs_release, .poll = sysfs_poll, };
In the following, not only are the functions responsible for reading and writing data (sysfs_{read,write}_file) described, but also the method for opening files (sysfs_open_file) since the connection between sysfs internals and the virtual filesystem is set up there. A rather small number of directory inode operations need to be specifically provided by sysfs: fs/sysfs/dir.c
struct inode_operations sysfs_dir_inode_operations = { .lookup = sysfs_lookup, .setattr = sysfs_setattr, };
Most operations can be handled by standard VFS operations; only directory lookup and attribute manipulation need to be taken care of explicitly. These methods are discussed in the following sections. The picture is even simpler for inode operations for regular files; only attribute manipulation needs to be specifically taken care of: fs/sysfs/inode.c
static struct inode_operations sysfs_inode_operations ={ .setattr = sysfs_setattr, };
Opening Files Opening a file is a rather boring operation for a regular filesystem. In the case of sysfs, it becomes more interesting because the sysfs internal data needs to be connected with the user-visible representation in the filesystem.
Data Structures In order to facilitate the exchange of data between userland and the sysfs implementation, some buffer space needs to be available. It is provided by the following slightly simplified data structure: fs/sysfs/file.c
struct sysfs_buffer { size_t count; loff_t pos; char * page;
698
5:18pm
Page 698
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage struct sysfs_ops * ops; int needs_read_fill; struct list_head list; };
The contents of the structure are as follows: count specifies the length of the data in the buffer, pos denotes the present position within the data for partial reads and seeking, and page points to a single page used to store the data.6 The sysfs_ops instance belonging to the sysfs entry is connected with an open file via the ops pointer of the buffer. needs_read_fill specifies if the contents of the buffer need to be filled or not (filling the data is performed on the first read and need not be repeated for any successive reads if no write operation was performed in the meantime). file sysfs_dirent
file private_data
private_data
sysfs_buffer
sysfs_elem_attr.open
sysfs_buffer
sysfs_open_dirent
buffers
list
list
Figure 10-12: Connection between struct sysfs_dirent, struct file, and struct sysfs_buffer.
To understand the meaning of list, observe Figure 10-12, which shows how sysfs_buffers are connected with struct file and struct sysfs_dirent. Each open file as represented by an instance of struct file is connected with one instance of sysfs_buffer via file->private_data. A sysfs entry can be referenced via multiple open files, so more than one sysf_buffer can be associated with one instance of struct sysfs_dirent. All these buffers are collected in a list that uses sysfs_buffer->list as list element. The list is headed by an instance of sysfs_open_dirent. For the sake of simplicity, this structure is not discussed in great detail. Suffice it to say that it is connected with sysfs_dirent and heads the list of sysfs_buffers.
Implementation Recall that sysfs_file_operations provides sys_open_file to be called when a file is opened. The associated code flow diagram is shown in Figure 10-13. 6 The restriction to a single page is intentional because sysfs is supposed to export only one simple attribute per file; this will not require more space than a single page.
699
Page 699
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage sysfs_open_file sysfs_get_active_two Set sysfs_ops Check read and write support Allocate sysfs_buffer sysfs_get_open_dirent sysfs_put_active_two
Figure 10-13: Code flow diagram for sysfs_open_file.
The first task is to find the sysfs_ops operations that belong to the opened file. Recall that struct kobj_type provides a pointer to an instance of sysf_ops:
struct kobj_type { ... struct sysfs_ops * sysfs_ops; ... };
However, the kernel needs to obtain an active reference on the kobject instance that is associated with the sysfs file before the proper instance of sysfs_ops can be found. The function sysfs_get_active_two grabs the active reference as discussed above. If the kobject is a member of a set, then the pointer is read from the kset instance. Otherwise, the kobject itself is used as source. If neither provides a pointer to an instance of sysfs_ops, a generic set of operations given by sysfs_sysfs_ops is used. However, this is only necessary for direct kernel attributes found in /sys/kernel: fs/sysfs/file.c
static int sysfs_open_file(struct inode *inode, struct file *file) { struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; struct kobject *kobj = attr_sd->s_parent->s_dir.kobj; struct sysfs_buffer * buffer; struct sysfs_ops * ops = NULL; ... /* need attr_sd for attr and ops, its parent for kobj */ if (!sysfs_get_active_two(attr_sd)) return -ENODEV; /* if the kobject has no ktype, then we assume that it is a subsystem * itself, and use ops for it. */ if (kobj->kset && kobj->kset->ktype)
700
5:18pm
Page 700
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage ops = kobj->kset->ktype->sysfs_ops; else if (kobj->ktype) ops = kobj->ktype->sysfs_ops; else ops = &subsys_sysfs_ops; ...
Since all members of kernel subsystems are collected in a kset, this allows for connecting attributes at a subsystem-specific level because the same access functions are used for all elements. If the kobject under consideration is not contained in a kset, then it is still possible that it has a ktype from which the sysfs_ops can be taken. It is up to the subsystem how to implement the sysfs_ops, but the methods used are quite similar, as shown in Section 10.3.5. If something is supposed to be written into the file, it is not sufficient to just check if the access mode bits allow this. Additionally, the entry is required to provide a store operation in the sysfs_ops. It does not make sense to grant read access if there is no function that can actually present data to userspace. A similar condition holds for read access: fs/sysfs/file.c
/* File needs write support. * The inode’s perms must say it’s ok, * and we must have a store method. */ if (file->f_mode & FMODE_WRITE) { if (!(inode->i_mode & S_IWUGO) || !ops->store) goto err_out; } /* File needs read support. * The inode’s perms must say it’s ok, and we there * must be a show method for it. */ if (file->f_mode & FMODE_READ) { if (!(inode->i_mode & S_IRUGO) || !ops->show) goto err_out; } ...
After the kernel has chosen to allow the access, an instance of sysfs_buffer is allocated, filled in with the appropriate elements, and connected to the file via file->private_data as shown below: fs/sysfs/file.c
buffer = kzalloc(sizeof(struct sysfs_buffer), GFP_KERNEL); ... mutex_init(&buffer->mutex); buffer->needs_read_fill = 1; buffer->ops = ops; file->private_data = buffer; /* make sure we have open dirent struct */ error = sysfs_get_open_dirent(attr_sd, buffer); ... /* open succeeded, put active references */
701
Page 701
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage sysfs_put_active_two(attr_sd); return 0; }
Finally, sysfs_get_open_dirent connects the freshly allocated buffer with the sysfs data structures via sysfs_open_dirent as shown in Figure 10-12. Note that since no further access to the kobjects associated with the sysfs entry is required anymore, the active references can (and need!) be dropped using sysfs_put_active_two.
Reading and Writing File Contents Recall that sysfs_file_operations specifies the methods used by the VFS to access the content of files in sysfs. After having introduced all necessary data structures for reading and writing data, it is now time to discuss these operations.
Reading Reading data is delegated to sysfs_read_file; the associated code flow diagram can be found in Figure 10-14.
sys_read_file Buffer refill needed?
fill_read_buffer
simple_read_from_buffer
Figure 10-14: Code flow diagram for sysfs_read_file.
The implementation is comparatively simple: If the data buffer is not yet filled in because it is accessed for the first time or has been modified by a write operation (both indicated by buffer->needs_read_fill), fill_read_buffer needs to be called to fill the buffer first. This function is responsible for two things:
1. 2.
Allocate a (zero-filled) page frame to hold the data. Call the show method of the struct sysfs_ops instance to provide the buffer contents, that is, fill in data to the page frame allocated above.
Once the buffer is filled with data, the remaining work is delegated to simple_read_from_buffer. As you might have guessed from the name, the task is simple and requires only some bounds checking and a memory copy operation from kernel to userspace.
Writing To allow the reverse process, namely, writing data from user to kernel space, sysfs_write_file is provided. Like for the read companion, the implementation is quite simple as the code flow diagram in Figure 10-15 shows. First, fill_write_buffer allocates a page frame into which the data given from userspace are copied. This sets buffer->needs_refill because the content of the buffer needs to be refreshed if a read request takes place after the write. The remaining work is delegated to flush_write_buffer; its main job is to call the store method provided by the sysfs_ops instance specific to the file.
702
5:18pm
Page 702
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage sysfs_write_file fill_write_file Something to write?
flush_write_buffer
Figure 10-15: Code flow diagram for sysfs_write_file.
Directory Traversal The lookup method of sysfs_dir_inode_operations is the basic building block for directory traversal. We therefore need to have a closer look at sysfs_lookup. Figure 10-16 provides the code flow diagram.
sysfs_lookup sysfs_find_dirent sysfs_get_inode Update and rehash dentry
Figure 10-16: Code flow diagram for sysfs_lookup. Attributes constitute the entries of a directory, and the function tries to find an attribute with a specific name that belongs to an instance of struct sysfs_dirent. By iterating over them and comparing names, the desired entry can be found. Recall that all attributes associated with a kobject are stored in a linked list whose head is sysfs_dirent.s_dir.children. This data structure is now brought to good use: fs/sysfs/dir.c
struct sysfs_dirent *sysfs_find_dirent(struct sysfs_dirent *parent_sd, const unsigned char *name) { struct sysfs_dirent *sd; for (sd = parent_sd->s_dir.children; sd; sd = sd->s_sibling) if (!strcmp(sd->s_name, name)) return sd; return NULL; } sysfs_find_dirent is used by sysfs_lookup to find the desired sysfs_dirent instance for a given
filename. With this in hand, the kernel then needs to establish the connection between sysfs, kernel subsystem, and the filesystem representation by attaching the sysfs_dirent instance of the attribute with the dentry instance of the attribute file. Dentry and inode are then connected with sysfs_get_inode. The method resorts to sysfs_init_inode; this function is discussed in Section 10.3.3. The final steps are not sysfs-specific: The inode information is filled into the dentry. This also requires rehashing the dentry on the global dentry hash.
703
Page 703
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage
10.3.5 Populating Sysfs Since sysfs is an interface to export data from the kernel, only the kernel itself can populate sysfs with file and directory entries. This can be triggered from places all over the kernel, and, indeed, such operations are ubiquitous within the whole tree, which renders it impossible to cover all appearances in detail. Thus only the general methods used to connect sysfs with the internal data structures of the diverse subsystems are demonstrated; the methods used for this purpose are quite similar everywhere.
Registering Subsystems Once more I use the generic hard disk code as an example for a subsystem that uses kobjects that are represented in sysfs. Observe that the directory /sys/block is used to represent this subsystem. For every block device available in the system, a subdirectory contains several attribute files: root@meitner # ls -l /sys/block total 0 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop0 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop1 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop2 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop3 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop4 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop5 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop6 drwxr-xr-x 4 root root 0 2008-02-09 23:26 loop7 drwxr-xr-x 10 root root 0 2008-02-09 23:26 sda drwxr-xr-x 5 root root 0 2008-02-09 23:26 sdb drwxr-xr-x 5 root root 0 2008-02-09 23:26 sr0 root@meitner # ls -l /sys/block/hda total 0 -r--r--r-- 1 root root 4096 2008-02-09 23:26 capability -r--r--r-- 1 root root 4096 2008-02-09 23:26 dev lrwxrwxrwx 1 root root 0 2008-02-09 23:26 device -> ../../devices/pci0000:00/ 0000:00:1f.2/host0/target0:0:0/0:0:0:0 drwxr-xr-x 2 root root 0 2008-02-09 23:26 holders drwxr-xr-x 3 root root 0 2008-02-09 23:26 queue -r--r--r-- 1 root root 4096 2008-02-09 23:26 range -r--r--r-- 1 root root 4096 2008-02-09 23:26 removable drwxr-xr-x 3 root root 0 2008-02-09 23:26 sda1 drwxr-xr-x 3 root root 0 2008-02-09 23:26 sda2 drwxr-xr-x 3 root root 0 2008-02-09 23:26 sda5 drwxr-xr-x 3 root root 0 2008-02-09 23:26 sda6 drwxr-xr-x 3 root root 0 2008-02-09 23:26 sda7 -r--r--r-- 1 root root 4096 2008-02-09 23:26 size drwxr-xr-x 2 root root 0 2008-02-09 23:26 slaves -r--r--r-- 1 root root 4096 2008-02-09 23:26 stat lrwxrwxrwx 1 root root 0 2008-02-09 23:26 subsystem -> ../../block --w------- 1 root root 4096 2008-02-09 23:26 uevent
One of the central elements behind this output is the following data structure, which connects a sysfsspecific attribute structure with genhd-specific store and show methods. Note that these methods do not have the signature required for the show/store methods required by sysfs; these will be provided later:
struct disk_attribute { struct attribute attr;
704
5:18pm
Page 704
Mauerer
runc10.tex
V2 - 09/04/2008
5:18pm
Chapter 10: Filesystems without Persistent Storage ssize_t (*show)(struct gendisk *, char *); ssize_t (*store)(struct gendisk *, const char *, size_t); };
Some attributes are attached to all objects represented by the genhd subsystem, so the kernel creates a collection of instances of disk_attribute as follows: block/genhd.c
static struct disk_attribute disk_attr_uevent = { .attr = {.name = "uevent", .mode = S_IWUSR }, .store = disk_uevent_store }; static struct disk_attribute disk_attr_dev = { .attr = {.name = "dev", .mode = S_IRUGO }, .show = disk_dev_read }; ... static struct disk_attribute disk_attr_stat = { .attr = {.name = "stat", .mode = S_IRUGO }, .show = disk_stats_read }; static struct attribute * default_attrs[] = { &disk_attr_uevent.attr, &disk_attr_dev.attr, &disk_attr_range.attr, ... &disk_attr_stat.attr, ... NULL, };
The connection between the attribute-specific show/store methods and the show/store methods in sysfs_ops is made by the following structure: block/genhd.c
static struct sysfs_ops disk_sysfs_ops = { .show = &disk_attr_show, .store = &disk_attr_store, };
Without getting into any details about their implementation, note that both methods are provided with an attribute instance when called by sysfs, transform this instance into a disk_attribute, and call the show/store method associated with the specific attributes that does the low-level, subsystem-specific work. Finally, the only thing that needs to be considered is how the set of default attributes is connected with all kobjects belonging to the genhd subsystem. For this, a kobj_type is used: block/genhd.c
static struct kobj_type .release .sysfs_ops .default_attrs };
ktype_block = { = disk_release, = &disk_sysfs_ops, = default_attrs,
705
Page 705
Mauerer
runc10.tex
V2 - 09/04/2008
Chapter 10: Filesystems without Persistent Storage Two further steps are necessary to connect this data structure with sysfs:
1. 2.
Create a kset that corresponds to the kobj_type by using decl_subsys. Register the kset with register_subsystem; this function ends up in calling kset_add which, in turn, calls kobject_add to create an appropriate directory with create_dir. Once more, this function calls populate_dir, which iterates over all default attributes and creates a sysfs file for each of them.
Because subelements of generic hard disks (i.e., partitions) are connected with the kset introduced above, they automatically inherit all default attributes by virtue of the kobject model.
10.4
Summar y
Filesystems do not necessarily need to be backed by a physical block device, but their contents can also be generated dynamically. This allows for passing information from the kernel to userland (and vice versa), which can be easily obtained by regular file I/O operations. The /proc filesystem was one of the first virtual filesystems used by Linux, and a more recent addition is sysfs, which presents a hierarchically structured representation of (nearly) all objects known to the kernel. This chapter also discussed some generic routines to implement virtual filesystems and additionally considered how pseudo-filesystems that are not visible to userland carry information important for the kernel itself.
706
5:18pm
Page 706
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Extended Attributes and Access Control Lists Many filesystems provide features that extend the standard functionality offered by the VFS layer. It is impossible for the virtual filesystem to provide specific data structures for every feature that can be imagined — fortunately, there’s lots of room in our imagination, and developers are not exactly short of new ideas. Additional features that go beyond the standard Unix file model often require an extended set of attributes associated with every filesystem object. What the kernel can provide, however, is a framework that allows filesystem-specific extensions. Extended attributes (xattrs) are (more or less) arbitrary attributes that can be associated with a file. Since usually every file will possess only a subset of all possible extended attributes, the attributes are stored outside the regular inode data structure to avoid increasing its size in memory and wasting disk space. This allows a really generic set of attributes without any significant impact on filesystem performance or disk space requirements. One use of extended attributes is the implementation of access control lists that extend the Unix-style permission model: They allow implementation of finer-grained access rights by not only using the concept of the classes user, group, and others, but also by associating an explicit list of users and their allowed operations on the file. Such lists fit naturally into the extended attribute model. Another use of extended attributes is to provide labeling information for SE-Linux.
11.1
Extended Attributes
From the filesystem user’s point of view, an extended attribute is a name/value pair associated with objects in the filesystem. While the name is given by a regular string, the kernel imposes no restrictions on the contents of the value. It can be a text string, but may contain arbitrary binary data as well. An attribute may be defined or not (this is the case if no attribute was associated with a file). If it is defined, it may or may not have a value. No one can blame the kernel for not being liberal in this respect.
Page 707
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists Attribute names are subdivided into namespaces. This implies that addressing attributes are required to list the namespace as well. As per notational convention, a dot is used to separate the namespace and attribute (e.g., user.mime_type). Only the basic details are covered here — it is assumed that you are familiar with the manual page attr(5), where further information about the fine points is given. The kernel uses macros to define the list of valid top-level namespaces. They are of the form XATTR_*_PREFIX. A set of accompanying macros XATTR_*_PREFIX_LEN is useful when a name string passed from the userspace needs to be compared with the namespace prefixes: <xattr.h>
/* Namespaces */ #define XATTR_OS2_PREFIX "os2." #define XATTR_OS2_PREFIX_LEN (sizeof (XATTR_OS2_PREFIX) - 1) #define XATTR_SECURITY_PREFIX "security." #define XATTR_SECURITY_PREFIX_LEN (sizeof (XATTR_SECURITY_PREFIX) - 1) #define XATTR_SYSTEM_PREFIX "system." #define XATTR_SYSTEM_PREFIX_LEN (sizeof (XATTR_SYSTEM_PREFIX) - 1) #define XATTR_TRUSTED_PREFIX "trusted." #define XATTR_TRUSTED_PREFIX_LEN (sizeof (XATTR_TRUSTED_PREFIX) - 1) #define XATTR_USER_PREFIX "user." #define XATTR_USER_PREFIX_LEN (sizeof (XATTR_USER_PREFIX) - 1)
The kernel provides several system calls to read and manipulate extended attributes: ❑
setxattr is used to set or replace the value of an extended attribute or to create a new one.
❑
getxattr retrieves the value of an extended attribute.
❑
removexattr removes an extended attribute.
❑
listxattr provides a list of all extended attributes associated with a given filesystem object.
Note that all calls are also available with the prefix l; this variant does not follow symbolic links by resolving them but operates on the extended attributes of the link itself. Prefixing the calls with f does not work on a filename given by a string, but uses a file descriptor as the argument. As usual, the manual pages provide more information about how these system calls must be used and provide the exact calling convention.
11.1.1 Interface to the Virtual Filesystem The virtual filesystem provides an abstraction layer to the userspace such that all applications can use extended attributes regardless of how the underlying filesystem implementations store the information on disk. The following sections discuss the required data structures and system calls. Note that although the VFS provides an abstraction layer for extended attributes, this does not mean that they have to be implemented by every filesystem. In fact, quite the contrary is the case. Most filesystems in the kernel do not support extended attributes. However, it should also be noted that all filesystems that are used as Linux workhorses (ext3, reiserfs, xfs, etc.) support extended attributes.
708
5:22pm
Page 708
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists Data Structures Since the structure of an extended attribute is very simple, the kernel does not provide a specific data structure to encapsulate the name/value pairs; instead, a simple string is used to represent the name, while a void-pointer denotes the area in memory where the value resides. Nevertheless, there need to be methods that set, retrieve, remove, and list the extended attributes. Since these operations are inode-specific, they are integrated into struct inode_operations:
struct inode_operations { ... int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char*); ... }
Naturally, a filesystem can provide custom implementations for these operations, but the kernel also offers a set of generic handler functions. They are, for instance, used by the third extended filesystem, as discussed below in the chapter. Before the implementation is presented, I need to introduce the fundamental data structures. For every class of extended attributes, functions that transfer the information to and from the block device are required. They are encapsulated in the following structure: <xattr.h>
struct xattr_handler { char *prefix; size_t (*list)(struct inode *inode, char *list, size_t list_size, const char *name, size_t name_len); int (*get)(struct inode *inode, const char *name, void *buffer, size_t size); int (*set)(struct inode *inode, const char *name, const void *buffer, size_t size, int flags); }; prefix denotes the namespace to whose attributes the operations apply: it can be any of the values introduced by XATTR_*_PREFIX as discussed above in the chapter. The get and set methods read and write extended attributes to the underlying block device, while list provides a list of all extended attributes
associated with a file. The superblock provides a link to an array of all supported handlers for the respective filesystem:
struct super_block { ... struct xattr_handler ... }
**s_xattr;
There is no fixed order in which the handlers need to appear in the array. The kernel can find the proper one by comparing the handler’s prefix element with the namespace prefix of the extended attribute name in question. Figure 11-1 presents a graphical summary.
709
Page 709
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists struct super_block
Array of xattr_handler pointers
s_xattr
Figure 11-1: Data structures used for the generic xattr implementation.
System Calls Recall that there are three system calls for each extended attribute operation (get, set, and list), which differ in how the destination is specified. To avoid code duplication, the system calls are structured into two parts:
1. 2.
Find the instance of dentry associated with the target object. Delegate further processing to a function common to all three calls.
Looking up the dentry instance is performed by user_path_walk, by user_path_walk_link, or by reading the dentry pointer contained in the file instance, depending on which system call was used. After this, a common basis for all three system call variants has been established. In the case of setxattr, the common function used for further processing is setxattr; the associated code flow diagram is shown in Figure 11-2. setxattr Copy name and value from userspace vfs_setxattr
Figure 11-2: Code flow diagram for setxattr. First, the routine copies both the name and the attribute value from userspace to kernel space. Since the value of the extended attribute can have arbitrary contents, the size is not predetermined. The system call has an explicit size parameter to indicate how many bytes are supposed to be read in. To avoid abuse of kernel memory, it is ensured that the size of name and value does not exceed the limits imposed by the following quantities: limits.h
#define XATTR_NAME_MAX 255 #define XATTR_SIZE_MAX 65536
/* # chars in an extended attribute name */ /* size of an extended attribute value (64k) */
After this preparation step, further processing is delegated to vfs_setxattr. The associated code flow diagram is shown in Figure 11-3.
710
5:22pm
Page 710
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists vfs_setxattr xattr_permission setxattr available?
Yes
setxattr
No
fsnotify_xattr
security namespace?
Delegate decision to security module
Figure 11-3: Code flow diagram for vfs_setxattr. At first, the kernel needs to make sure that the user is privileged to perform the desired operation; the choice is made by xattr_permission. For Read-Only or immutable inodes, the operation fails immediately; otherwise, the following checks are performed: fs/xattr.c
static int xattr_permission(struct inode *inode, const char *name, int mask) { ... /* * No restriction for security.* and system.* from the VFS. Decision * on these is left to the underlying file system / security module. */ if (!strncmp(name, XATTR_SECURITY_PREFIX, XATTR_SECURITY_PREFIX_LEN) || !strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) return 0; /* * The trusted.* namespace can only accessed by a privileged user. */ if (!strncmp(name, XATTR_TRUSTED_PREFIX, XATTR_TRUSTED_PREFIX_LEN)) return (capable(CAP_SYS_ADMIN) ? 0 : -EPERM); /* In user.* namespace, only regular files and directories can have * extended attributes. For sticky directories, only the owner and * privileged user can write attributes. */ if (!strncmp(name, XATTR_USER_PREFIX, XATTR_USER_PREFIX_LEN)) { if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) return -EPERM; if (S_ISDIR(inode->i_mode) && (inode->i_mode & S_ISVTX) && (mask & MAY_WRITE) && !is_owner_or_cap(inode)) return -EPERM; } return permission(inode, mask, NULL); }
The VFS layer does not care about attributes that live in the security or system namespace. Note that the request is granted if 0 is returned as result of xattr_permission! The kernel ignores these
711
Page 711
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists namespaces and delegates the choice to security modules that are included via numerous security-related calls, security_*, found everywhere in the kernel, or the underlying filesystem. However, the VFS layer is concerned about the trusted namespace. Only a sufficiently privileged user (i.e., root or a user with appropriate capabilities) is allowed to perform operations on such attributes. For a change, the comments in the source code state precisely how the kernel thinks that attributes from the user namespace should be taken care of, so I need not add anything further. Any decision for attributes from a different namespace from those processed until now is deferred to the generic permission function as discussed in Section 8.5.3. Note that this includes ACL checks that are implemented with the aid of extended attributes; how these checks are implemented is discussed in Section 11.2.2. If the inode passed the permission check, vfs_setxattr continues as follows:
1.
If a filesystem-specific setxattr method is available in the inode operations, it is called to perform the low-level interaction with the filesystem. After this, fsnotify_xattr uses the inotify mechanism to inform the userland about the extended attribute change.
2.
If no setxattr method is available (i.e., if the underlying filesystem does not support extended attributes), but the extended attribute in question belongs to the security namespace, then the kernel tries to use a function that can be provided by security frameworks like SELinux. If no such framework is registered, the operation is denied. This allows security labels on files that reside on filesystems without extended attribute support. It is the task of the security subsystem to store the information in a reasonable way.
Note that some more hook functions of the security framework are called during the extended attribute system calls. They are omitted here since if no extra security framework like SELinux is present, they will have no effect. Since the implementation for the system calls getxattr and removexattr nearly completely follows the scheme presented for setxattr, it is not necessary to discuss them in greater depth. The differences are as follows: ❑
getxattr does not need to use fnotify because nothing is modified.
❑
removeattr need not copy an attribute value, but only the name from the userspace. No special
casing for the security handler is required. The code for listing all extended attributes associated with a file differs more from this scheme, particularly because no function vfs_listxattr is used. All work is performed in listxattr. The implementation proceeds in three easy steps:
712
1.
Adapt the maximum size of the list as given by by the userspace program such that it is not higher than the maximal size of an extended attribute list as allowed by the kernel with XATTR_LIST_MAX, and allocate the required memory.
2. 3.
Call listxattr from inode_operations to fill the allocated space with name/value pairs. Copy the result back to the userspace.
5:22pm
Page 712
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists Generic Handler Functions Security is an important business. If wrong decisions are made, then the best security mechanisms are worth nothing. Since duplicating code increases the possibility of getting details wrong, the kernel provides generic implementations of the inode_operation methods for extended attribute handling on which filesystem writers can rely. As an additional benefit, this allows the filesystem people to be lazy — and concentrate their talents on things that matter much more to them than getting each and every security corner case right. The following examples look at these default implementations. As before, the code for different types of access is very similar, so the implementation of generic_setxattr is discussed first and the differences of the other methods afterward. Let’s get right down into the code: fs/xattr.c int generic_setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags) { struct xattr_handler *handler; struct inode *inode = dentry->d_inode; if (size == 0) value = ""; /* empty EA, do not remove */ handler = xattr_resolve_name(inode->i_sb->s_xattr, &name); if (!handler) return -EOPNOTSUPP; return handler->set(inode, name, value, size, flags); }
First, xattr_resolve_name finds the instance of xattr_handler that is apt for the namespace of the extended attribute in question. If a handler exists, the set method is called to perform the desired set operation. Obviously, there cannot be any further generic step; handler->set must be a filesystem-specific method (the implementation of these methods for Ext3 is discussed in Section 11.1.2). It is also not difficult to find the proper handler: fs/xattr.c
static struct xattr_handler * xattr_resolve_name(struct xattr_handler **handlers, const char **name) { ... for_each_xattr_handler(handlers, handler) { const char *n = strcmp_prefix(*name, handler->prefix); if (n) { *name = n; break; } } return handler; }
713
Page 713
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists for_each_xattr_handler is a macro that iterates over all entries in handlers until it encounters a NULL
entry. For every array element, the kernel compares the handler prefix with the namespace part of the attribute name. If there is a match, the appropriate handler has been found. The generic implementations for the other extended attribute operations differ only slightly from the code for generic_setxattr: ❑
generic_getxattr calls handler->get instead of the handler->set.
❑
generic_removexattr calls handler->set but specifies NULL for the value and a size of 0. This triggers, per convention, removing the attribute.1
generic_listxattr can operate in two modes: If a NULL pointer instead of a buffer was passed to the function to hold the result, the code iterates over all handlers registered in the superblock and calls the list method for the inode in question; since list returns the number of bytes required to hold the result, they can be summed up to provide predictions about how much memory is required in total. If a buffer for the results was specified, generic_listxattr again iterates over all handlers, but this time uses the buffer to actually store the results.
11.1.2 Implementation in Ext3 Among the citizens in filesystem land, Ext3 is one of the most prominent members because it makes it understood that support for extended attributes is available and well developed. Examine the following source code to learn more about the filesystem side of extended attribute implementations. This also raises a question that has not been touched on: namely, how extended attributes are permanently stored on disk.
Data Structures As an exemplary citizen, Ext3 starts with some good advice on coding efficiency and employs the generic implementation presented above. A number of handler functions are provided, and the following map makes it possible to access handler functions by their identification number and not by their string identifier; this simplifies many operations and allows a more efficient use of disk space because rather than the prefix string, only a simple number needs to be stored: fs/ext3/xattr.c
static struct xattr_handler *ext3_xattr_handler_map[] = { [EXT3_XATTR_INDEX_USER] = &ext3_xattr_user_handler, #ifdef CONFIG_EXT3_FS_POSIX_ACL [EXT3_XATTR_INDEX_POSIX_ACL_ACCESS] = &ext3_xattr_acl_access_handler, [EXT3_XATTR_INDEX_POSIX_ACL_DEFAULT] = &ext3_xattr_acl_default_handler, #endif [EXT3_XATTR_INDEX_TRUSTED] = &ext3_xattr_trusted_handler, #ifdef CONFIG_EXT3_FS_SECURITY [EXT3_XATTR_INDEX_SECURITY] = &ext3_xattr_security_handler, #endif }; 1 Note that both a NULL value and a size of 0 must be specified for it is possible to have empty attributes with size 0 and an empty value string (which differs from a NULL value).
714
5:22pm
Page 714
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists Figure 11-4 presents an overview of the on-disk layout of Ext3 extended attributes.
ext3_xattr_entry H e a d e r
Attribute value struct ext3_xattr_entry e_value_offs
Figure 11-4: Overview of the on-disk format for extended attributes in the Ext3 filesystem. The space consumed by the extended attributes starts with a short identification header followed by a list of entry elements. Each holds the attribute name and a pointer to the region where the associated value is stored. The list grows downward when new extended attributes are added to the file. The values are stored at the end of the extended attribute data space; the value table grows in the opposite direction of the attribute name table. The values will, in general, not be sorted in the same order as the names, but can be in any arbitrary order. A structure of this kind can be found in two places: ❑
The unused space at the end of the inode.
❑
A separate data block somewhere on the disk.
The first alternative is only possible if the new filesystem format with dynamic inode sizes is used (i.e., EXT3_DYNAMIC_REV); the amount of free space is stored in ext3_inode_info->i_extra_isize. Both alternatives can be used together, but the total size of all extended attribute headers and values is still limited to the sum of the space of a single block and the free space in the inode. It is not possible to use more than one additional block to store extended attributes. In practice, the space required will usually be much less than a complete disk block. Note that it is possible for two files with identical sets of extended attributes to share the on-disk representation; this helps to save some disk space. How do the data structures that implement this layout look? The header is defined as follows: fs/ext3/xattr.h
struct ext3_xattr_header { __le32 h_magic; __le32 h_refcount; __le32 h_blocks; __le32 h_hash; __u32 h_reserved[4]; };
/* /* /* /* /*
magic number for identification */ reference count */ number of disk blocks used */ hash value of all attributes */ zero right now */
715
Page 715
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists The comments in the code precisely describe the meaning of the elements, and nothing more needs to be added. The only exception is h_blocks: Although this element suggests that multiple blocks can be used to store extended attribute data, it is at the moment always set to 1. Any other value is treated as an error. Every entry is represented by the following data structure: fs/ext3/xattr.h
struct ext3_xattr_entry { __u8 e_name_len; __u8 e_name_index; __le16 e_value_offs; __le32 e_value_block; __le32 e_value_size; __le32 e_hash; char e_name[0]; };
/* /* /* /* /* /* /*
length of name */ attribute name index */ offset in disk block of value */ disk block attribute is stored on (n/i) */ size of attribute value */ hash value of name and value */ attribute name */
Note that the entries are not of a uniform size because the length of the attribute names is variable; this is why the name is stored at the end of the structure; e_name_len is available to determine the name length and thus compute the size of each entry. e_value_block, together with e_value_offset, dertermines the location of the attribute value associated with the extended attribute name (if the extended attribute is stored within the inode, ext3_value_offs is used as an offset that starts at the first entry). e_name_index is used as an index into the table ext3_xattr_handler_map defined above.
Implementation Since the handler implementation is quite similar for different attribute namespaces, the following discussion is restricted to the implementation for the user namespace; the handler functions for the other namespaces differ only little or not at all. ext3_xattr_user_handler is defined as follows: fs/ext3/xattr_user.c
struct xattr_handler ext3_xattr_user_handler = { .prefix = XATTR_USER_PREFIX, .list = ext3_xattr_user_list, .get = ext3_xattr_user_get, .set = ext3_xattr_user_set, };
Retrieving Extended Attributes Consider ext3_xattr_user_get first. The code is just a wrapper for a standard routine that works independently of the attribute type. Only the identification number of the type is necessary to choose the correct attributes from the set of all attributes: fs/ext3/xattr_user.c
static int ext3_xattr_user_get(struct inode *inode, const char *name, void *buffer, size_t size) { ... if (!test_opt(inode->i_sb, XATTR_USER)) return -EOPNOTSUPP; return ext3_xattr_get(inode, EXT3_XATTR_INDEX_USER, name, buffer, size); }
716
5:22pm
Page 716
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists The test for XATTR_USER ensures that the filesystem supports extended attributes in the user namespace. It is possible to enable or disable this support at mount time. Note that all get-type functions can be used for two purposes. If a buffer is allocated, the result is copied into it, but if a NULL pointer is given instead of a proper buffer, only the required size for the attribute value is computed and returned. This allows the calling code to first identify the size of the required allocation for the buffer. After the buffer has been allocated, a second call fills in the data. Figure 11-5 shows the code flow diagram for ext3_xattr_get. The function is a dispatcher that first tries to find the required attribute directly in the free space of the inode with ext3_xattr_ibody_get; if this fails, ext3_xattr_block_get is used to read the value from an external attribute data block. ext3_xattr_get ext3_xattr_ibody_get No xattrs directly stored in inode? ext3_xattr_block_get
Figure 11-5: Code flow diagram for ext3_xattr_get. Consider the direct search in the free inode space first. The associated code flow diagram is depicted in Figure 11-6. ext3_xattr_ibody_get Locate inode ext3_xattr_check_names ext3_xattr_find_entry Copy value to buffer if buffer!= NULL
Figure 11-6: Code flow diagram for ext3_xattr_ibody_get. After the location of the inode is determined and access to the raw data is ascertained, ext3_xattr_check_names performs several sanity checks that ensure that the entry table is located within the free space of the inode. The real work is delegated to ext3_xattr_find_entry. Since the routine will be used on several more occasions further below, we need to discuss it in more detail. fs/ext3/xattr.c
static int ext3_xattr_find_entry(struct ext3_xattr_entry **pentry, int name_index, const char *name, size_t size, int sorted) { struct ext3_xattr_entry *entry; size_t name_len; int cmp = 1;
717
Page 717
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists if (name == NULL) return -EINVAL; name_len = strlen(name); entry = *pentry; for (; !IS_LAST_ENTRY(entry); entry = EXT3_XATTR_NEXT(entry)) { cmp = name_index - entry->e_name_index; if (!cmp) cmp = name_len - entry->e_name_len; if (!cmp) cmp = memcmp(name, entry->e_name, name_len); if (cmp <= 0 && (sorted || cmp == 0)) break; } *pentry = entry; ... return cmp ? -ENODATA : 0; } pentry points to the start of the extended attribute entry table. The code loops over all entries and compares the desired name with the entry name if the entry has the correct type (as indicated by cmp == 0,
which results from subtracting the namespace index of the entry under consideration from the index of the queried entry — a slightly unconventional but nevertheless valid way to check this). Since the entries do not have a uniform size, the kernel uses EXT3_XATTR_NEXT to compute the address of the next entry in the table by adding the length of the actual attribute name (plus some padding that is handled by EXT3_XATTR_LEN) to the size of the entry data structure: fs/ext3/xattr.h
#define EXT3_XATTR_NEXT(entry) \ ( (struct ext3_xattr_entry *)( \ (char *)(entry) + EXT3_XATTR_LEN((entry)->e_name_len)) )
The end of the list is marked by a zero that IS_LAST_ENTRY checks for. After ext3_xattr_find_entry returns with the data of the desired entry, ext3_xattr_ibody_get needs to copy the value to the buffer given in the function arguments if it is not a NULL pointer; otherwise, only the size of the entry is returned. If the desired extended attribute cannot be found within the inode, the kernel uses ext3_xattr_block_ get to search for the entry. The associated code flow diagram is presented in Figure 11-7. ext3_xattr_block_get Read the block pointed to by i_file_acl ext3_xattr_cache_insert ext3_xattr_find_entry Copy attribute value to buffer if buffer != NULL
Figure 11-7: Code flow diagram for ext3_xattr_block_get.
718
5:22pm
Page 718
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists The course of action is basically identical with the previously considered case where the data were located in the inode, but two modifications need to be made: ❑
The kernel needs to read the extended attribute block; the address is stored in the i_file_acl element of struct ext3_inode_info.
❑
Metadata blocks are cached by calling ext3_xattr_cache_insert. The kernel uses the so-called filesystem metadata block cache implemented in fs/mbcache.c for this.2 Since nothing really unexpected happens there, it is not necessary to discuss the code in more detail.
Setting Extended Attributes Setting extended attributes for the user namespace is handled by — you guessed it — ext3_xattr_user_ set. As for the get operation, the function is just a wrapper for the generic helper ext3_xattr_set. The code flow diagram in Figure 11-8 shows that this is yet another wrapper function that is responsible for handling the interaction with the journal. The real work is delegated to ext3_xattr_set_handle; the associated code flow diagram can be seen in Figure 11-9. ext3_xattr_set Get handle ext3_xattr_set_handle Stop Journal
Figure 11-8: Code flow diagram for ext3_xattr_set.
ext3_xattr_set_handle Get inode location ext3_xattr_ibody_find Attibute not directly in inode? ext3_xattr_block_find
No
value == NULL?
Yes
ext3_xattr_{ibody,block}_set
Set attribute in the appropriate location
Update superblock and mark inode dirty
Figure 11-9: Code flow diagram for ext3_xattr_set.
2 Although the structure of this cache is generic, it is currently only used by the extended filesystem family.
719
Page 719
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists The following calling convention is used: ❑
If the data buffer passed to the function is NULL, then remove an existing extended attribute.
❑
If the data buffer contains a value, replace an existing extended attribute or create a new one. The flags XATTR_REPLACE and XATTR_CREATE can be used to indicate that the attribute must or must not exist before the call as per the documentation in the man page setxattr(2).
ext3_xattr_set_handle implements these requirements by utilizing the previously introduced frame-
work as follows:
1. 2.
Find the location of the inode.
3.
If no value is given, delete the attribute with ext3_xattr_ibody_set or ext3_xattr_block_ set depending on whether the entry is contained in the inode or in a separate data block.
4.
If a value was given, use ext3_xattr_*_set to modify the value or create a new value either within the inode or on the external data block depending on where enough space is left.
Use ext3_xattr_ibody_find to find the data of the extended attribute. If this fails, search in the external data block with ext3_xattr_block_find.
The functions ext3_xattr_ibody_set and ext3_xattr_block_set handle the low-level work of removing an entry from the data structure described in Section 11.1.2. If no value is given to update, the functions respectively create a new entry. This is primarily a matter of data structure manipulation and will not be discussed in detail here.
Listing Extended Attributes Although the kernel includes a generic function (generic_listxattr) for listing all extended attributes associated with a file, it is not among the filesystem favorites: Only the shared memory implementation makes use of it. So let’s step back a little farther to discuss the operation for Ext3. The inode_operations instance for Ext3 lists ext3_listxattr as the handler function for listxattr. The method is just a one-line wrapper for ext3_xattr_list. This routine calls, in turn, ext3_xattr_ibody_list and ext3_xattr_block_list, depending on where extended attributes are stored. Both functions compute the location of the extended attributes and read the data, but then delegate the work to ext3_xattr_list_entries, which finally does the real work — after all, someone has to do it! It uses the previously introduced macros to iterate over all extended attributes defined for the inode, calls handler->list to retrieve the name of the attribute for each entry, and collects the results in a buffer: fs/ext3/xattr.c
static int ext3_xattr_list_entries(struct inode *inode, struct ext3_xattr_entry *entry, char *buffer, size_t buffer_size) { size_t rest = buffer_size; for (; !IS_LAST_ENTRY(entry); entry = EXT3_XATTR_NEXT(entry)) { struct xattr_handler *handler = ext3_xattr_handler(entry->e_name_index); if (handler) {
720
5:22pm
Page 720
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists size_t size = handler->list(inode, buffer, rest, entry->e_name, entry->e_name_len); if (buffer) { if (size > rest) return -ERANGE; buffer += size; } rest -= size; } } return buffer_size - rest; }
Since the list handler implementation is quite similar for the various attribute types, it suffices to consider the variant for the user namespace. Observe the following code: fs/ext3/xattr_user.c
static size_t ext3_xattr_user_list(struct inode *inode, char *list, size_t list_size, const char *name, size_t name_len) { const size_t prefix_len = sizeof(XATTR_USER_PREFIX)-1; const size_t total_len = prefix_len + name_len + 1; if (!test_opt(inode->i_sb, XATTR_USER)) return 0; if (list && total_len <= list_size) { memcpy(list, XATTR_USER_PREFIX, prefix_len); memcpy(list+prefix_len, name, name_len); list[prefix_len + name_len] = ’\0’; } return total_len; }
The routine copies the prefix ‘‘user.’’ followed by the attribute name and a null byte into the buffer list and returns the number of copied bytes as result.
11.1.3 Implementation in Ext2 The implementation of extended attributes in Ext2 is quite similar to the implementation in Ext3 presented above. This is not surprising since Ext3 is a direct descendent of Ext2, but nevertheless, some features present in Ext3 that are not available in Ext2 are the source of some differences in the xattr implementation: ❑
Since Ext2 does not support dynamic inode sizes, there is not sufficient space left in the on-disk inode to store the data of extended attributes. Thus, xattrs are always stored on a separate data block. This simplifies some functions because no distinction between different locations of the extended attribute data is necessary.
❑
Ext2 does not use journaling, so all journaling-related function calls are not necessary. This also eliminates the need for some wrapper functions that are just dealing with handle operations.
721
Page 721
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists Otherwise, both implementations are nearly identical; for most functions described above, a variant with the prefix ext3_ replaced with ext2_ is available.
11.2
Access Control Lists
POSIX access control lists (ACLs) are an extension specified in a POSIX standard to make the DAC model of Linux finer grained. As usual, I assume that you have some familiarity with the concept, but a very good overview is provided in the manual page acl(5).3 ACLs are implemented on top of extended attributes and modified with the same methods as other extended attributes are. In comparison to other xattrs whose contents are of no interest to the kernel, ACL xattrs are integrated into the inode permission checks. Although filesystems are free to choose a physical format to represent extended attributes, the kernel nevertheless defines a conversation structure to represent an access control list. The following namespaces must be used for extended attributes that carry access control lists: <posix_acl_xattr.h>
#define POSIX_ACL_XATTR_ACCESS "system.posix_acl_access" #define POSIX_ACL_XATTR_DEFAULT "system.posix_acl_default"
The userland programs getfacl, setfacl, and chacl are used to get, set, and change the contents of an ACL. They use the standard system calls to manipulate extended attributes and do not require any non-standard interaction with the kernel. Many other utilities, for instance, ls, also have built-in support for dealing with access control lists.
11.2.1 Generic Implementation The generic code for the implementation of ACLs is contained in two files: fs/posix_acl.c contains code to allocate new ACLs, clone ACLs, perform extended permission checks, and so on; while fs/xattr_acl.c holds functions to convert between extended attributes and the generic representation of ACLs, and vice versa. All generic data structures are defined in include/linux/posix_acl.h and include/linux/posix_acl_xattr.h.
Data Structures The central data structure for in-memory representation that holds all data associated with an ACL is defined as follows: <posix_acl.h>
struct posix_acl_entry { short unsigned short unsigned int }; struct posix_acl { atomic_t
e_tag; e_perm; e_id;
a_refcount;
3 Note that another good overview about ACLs in general and the status of the implementation in various filesystems supported by
Linux is given in the Usenix paper of Andreas Grunbacher ¨ [Gru03], ¨ one of the principal authors of ACL support for the Ext2 and Ext3 filesystems.
722
5:22pm
Page 722
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists unsigned int struct posix_acl_entry
a_count; a_entries[0];
};
Each entry contains a tag, a permission, and a (user or group) ID to which the ACL refers. All ACLs belonging to a given inode are collected by struct posix_acl. The number of ACL entries is given by a_count; since the array that contains all entries is located at the bottom of the structure, there is no limit on the number of entries except for the maximal size of an extended attribute. a_refcount is a standard reference counter. Symbolic constants for the ACL type, the tag, and the permissions are given by the following preprocessor definitions: <posix_acl.h>
/* a_type field in acl_user_posix_entry_t */ #define ACL_TYPE_ACCESS (0x8000) #define ACL_TYPE_DEFAULT (0x4000) /* e_tag entry in struct posix_acl_entry */ #define ACL_USER_OBJ (0x01) #define ACL_USER (0x02) #define ACL_GROUP_OBJ (0x04) #define ACL_GROUP (0x08) #define ACL_MASK (0x10) #define ACL_OTHER (0x20) /* permissions in the e_perm field */ #define ACL_READ (0x04) #define ACL_WRITE (0x02) #define ACL_EXECUTE (0x01)
The kernel defines another set of data structures similar to the ones presented above for xattr representation of ACLs. However, this time they are supposed to be used for external interaction with userland: <posix_acl_xattr.h>
typedef struct { __le16 __le16 __le32 } posix_acl_xattr_entry;
e_tag; e_perm; e_id;
typedef struct { __le32 posix_acl_xattr_entry } posix_acl_xattr_header;
a_version; a_entries[0];
The structures used for internal and external representation are quite similar except that types with defined endianness (see Appendix A.8) and explicit bit length are used for the latter purpose; additionally, no reference counting is necessary for the on-disk representation. Two functions to convert back and forth between the references are available: posix_acl_from_xattr and posix_acl_from_xattr. Since the translation is purely mechanical, it is not necessary to discuss
723
Page 723
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists it in more detail. It is, however, important to observe that they work independently of the underlying filesystem.
Permission Checks For permission checks that involve access control lists, the kernel usually needs support from the underlying filesystems: Either they implement all permission checks by themselves (via the permission function of struct inode_operations), or they provide a callback method for generic_permission. The latter method is preferred by most filesystems in the kernel. The callback is used in generic_permission as follows (note that check_acl denotes the callback function): fs/namei.c
int generic_permission(struct inode *inode, int mask, int (*check_acl)(struct inode *inode, int mask)) { ... if (IS_POSIXACL(inode) && (mode & S_IRWXG) && check_acl) { int error = check_acl(inode, mask); if (error == -EACCES) goto check_capabilities; else if (error != -EAGAIN) return error; } ... } IS_POSIXACL checks if the (mount-time) flag MS_POSIXACL is set signaling that ACLs need to be used.
Even if a filesystem provides a specialized function to perform the ACL permission check, the individual routines usually boil down to some technical work like obtaining the ACL data. The real permission checks are again delegated to the standard function posix_acl_permission provided by the kernel. Accordingly, posix_acl_permission needs to be discussed in more detail. Given a pointer to an inode, a pointer to (the in-memory representation of) an access control list, and the right to check for (MAY_READ, MAY_WRITE or MAY_EXEC in mode), the function returns 0 if access is granted or an appropriate error code otherwise. The implementation is as follows: fs/posix_acl.c
int posix_acl_permission(struct inode *inode, const struct posix_acl *acl, int want) { const struct posix_acl_entry *pa, *pe, *mask_obj; int found = 0; FOREACH_ACL_ENTRY(pa, acl, pe) { switch(pa->e_tag) { case ACL_USER_OBJ: /* (May have been checked already) */ if (inode->i_uid == current->fsuid) goto check_perm;
724
5:22pm
Page 724
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists break; case ACL_USER: if (pa->e_id == current->fsuid) goto mask; break; case ACL_GROUP_OBJ: if (in_group_p(inode->i_gid)) { found = 1; if ((pa->e_perm & want) == want) goto mask; } break; case ACL_GROUP: if (in_group_p(pa->e_id)) { found = 1; if ((pa->e_perm & want) == want) goto mask; } break; case ACL_MASK: break; case ACL_OTHER: if (found) return -EACCES; else goto check_perm; default: return -EIO; } } return -EIO; ... }
The code uses the macro FOREACH_ACL_ENTRY to iterate over all ACL entries. For each entry, a suitable comparison between the file system UID (FSUID) and the appropriate part of the current process credentials (the UID/GID of the inode for _OBJ type entries and the ID specified in the ACL entry for other types). Obviously, the logic needs to be exactly as defined in the manual page acl(5). The code involves two jump labels that are located behind the loop. The code flow ends up at mask once access has basically been granted. It still needs to be ensured, however, that no declaration of ACL_MASK follows the granting entry and denies the access right: fs/posix_acl.c
... mask: for (mask_obj = pa+1; mask_obj != pe; mask_obj++) { if (mask_obj->e_tag == ACL_MASK) { if ((pa->e_perm & mask_obj->e_perm & want) == want) return 0; return -EACCES; } } ...
725
Page 725
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists Victory can seem to be beguilingly close when a granting entry has been found, but the hopes are quickly annihilated when an ACL_MASK entry denies the access. The following code snippet ensures that not only the rights are valid because of a proper UID or GID, but also that the desired access (read, write, or execute) is allowed by the granting entry: fs/posix_acl.c
... check_perm: if ((pa->e_perm & want) == want) return 0; return -EACCES; }
11.2.2 Implementation in Ext3 Since ACLs are implemented on top of extended attributes and with the aid of many generic helper routines as discussed above, the implementation in Ext3 is quite concise.
Data Structures The on-disk representation format for an ACL is similar to the in-memory representation required by the generic POSIX helper functions: fs/ext3/acl.h
typedef struct { __le16 __le16 __le32 } ext3_acl_entry;
e_tag; e_perm; e_id;
The meaning of the struct members is identical to the meaning discussed above for the in-memory variant. To save disk space, a version without the e_id field is also defined. It is used for the first four entries of an ACL list because no specific UID/GID is required for them: fs/ext3/acl.h
typedef struct { __le16 e_tag; __le16 e_perm; } ext3_acl_entry_short;
A list of ACL entries is always led by a header element, which is defined as follows: fs/ext3/acl.h
typedef struct { __le32 } ext3_acl_header;
a_version;
The a_version field would allow for distinguishing between different versions of the ACL implementation. Fortunately, the current implementation has not yet shown any weaknesses that would require introducing a new version, so revision EXT3_ACL_VERSION) — 0x0001 — is still perfectly fine. Although
726
5:22pm
Page 726
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists the field is not relevant right now, it will become important should an incompatible future version be developed. The in-memory representation of every Ext3 inode is augmented with two fields that are relevant for the ACL implementation: <ext3_fs_i.h>
struct ext3_inode_info { ... #ifdef CONFIG_EXT3_FS_POSIX_ACL struct posix_acl *i_acl; struct posix_acl *i_default_acl; #endif ... }
While i_acl points to the posix_acl instance for a regular ACL list associated with an inode, i_default_acl points to the default ACL that may be associated with a directory and is inherited by subdirectories. Since all information is stored in extended attributes on disk, no extension of the disk-based struct ext3_inode is necessary. Note that the kernel does not automatically construct the ACL information for every inode; if the information is not present in memory, the fields are set to EXT3_ACL_NOT_CACHED [defined as (void*)-1].
Conversion between On-Disk and In-Memory Representation Two conversion functions are available to switch between the on-disk and the in-memory representation: ext3_acl_to_disk and ext3_acl_from_disk. Both are implemented in fs/ext3/acl.c. The latter one takes the raw data as read from the information contained in the extended inode, strips off the header, and converts the data from little endian format into a format suitable for the system’s CPU for every entry in the list of ACLs. The counterpart ext3_acl_to_disk works similarly: It iterates over all entries of a given instance of posix_acl and converts the contained data from the CPU-specific format to little endian numbers with appropriate lengths.
Inode Initialization When a new inode is created with ext3_new_inode, the initialization of the ACLs is delegated to ext3_init_acl. In addition to the transaction handle and the instance of struct inode for the new inode, the function also expects a pointer to the inode of the directory in which the new entry is created: fs/ext3/acl.c
int ext3_init_acl(handle_t *handle, struct inode *inode, struct inode *dir) { struct posix_acl *acl = NULL; int error = 0; if (!S_ISLNK(inode->i_mode)) {
727
Page 727
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists if (test_opt(dir->i_sb, POSIX_ACL)) { acl = ext3_get_acl(dir, ACL_TYPE_DEFAULT); if (IS_ERR(acl)) return PTR_ERR(acl); } if (!acl) inode->i_mode &= ~current->fs->umask; } ... }
The inode parameter points to the new inode, and dir shows the inode of the directory containing the file. The directory information is required because if the directory has a default ACL, the contents need also to be applied to the new file. If the superblock of the directory does not support ACLs or no default ACL is associated with it, the kernel simply applies the current umask setting of the process. A more interesting case is when the inode’s filesystem supports ACLs and a default ACL is associated with the parent directory. If the new entry is a directory, the default ACL is inherited to it: fs/ext3/acl.c
... if (test_opt(inode->i_sb, POSIX_ACL) && acl) { struct posix_acl *clone; mode_t mode; if (S_ISDIR(inode->i_mode)) { error = ext3_set_acl(handle, inode, ACL_TYPE_DEFAULT, acl); if (error) goto cleanup; } ... } ext3_set_acl is used to set the ACL contents of a specific inode; this function is discussed below in this
chapter. For all file types and not just directories, the following code remains to be executed: fs/ext3/acl.c
... clone = posix_acl_clone(acl, GFP_KERNEL); error = -ENOMEM; if (!clone) goto cleanup; mode = inode->i_mode; error = posix_acl_create_masq(clone, &mode); if (error >= 0) { inode->i_mode = mode; if (error > 0) { /* This is an extended ACL */
728
5:22pm
Page 728
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists error = ext3_set_acl(handle, inode, ACL_TYPE_ACCESS, clone); } } posix_acl_release(clone); } cleanup: posix_acl_release(acl); return error; }
First, a working copy of the in-memory representation of the ACL is created with posix_acl_clone. Afterward, posix_acl_create_masq is called to remove all permissions given by the mode specification of the inode creation process that are not granted by the default ACL. This can result in two scenarios:
1.
The access mode can remain unchanged or some elements of it must be removed in order to comply with the ACL’s requirements. In this case, the i_mode field of the new inode is set to the mode as computed by posix_acl_create_masq.
2.
In addition to the necessity of trimming the mode, the default ACL can contain elements that cannot be represented in the regular user/group/other scheme. In this case, an ACL with extended information that provides the extra information is created for the new inode.
Retrieving ACLs Given an instance of struct inode, ext3_get_acl can be used to retrieve an in-memory representation of the ACL. Note that another parameter (type) specifies if the default or the access inode is supposed to be retrieved. The cases are distinguished with ACL_TYPE_ACCESS and ACL_TYPE_DEFAULT. The code flow diagram for the function is shown in Figure 11-10. ext3_get_acl ext3_iget_acl ACL cached? Return pointer to in-memory representation ext3_xattr_get ext3_acl_from_disk Update ACL cache in ext3_inode_info
Figure 11-10: Code flow diagram for ext3_get_acl. At first, the kernel uses the helper function ext3_iget_acl to check if the in-memory representation of the ACL is already cached in ext3_inode_info->i_acl (or, respectively, i_default_acl if the default ACL is requested). Should this be the case, the function creates a copy of the representation that can be returned as the result of ext3_get_acl.
729
Page 729
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists If the ACL is not yet cached, then first ext3_xattr_get is called to retrieve the raw data from the extended attribute subsystem4 ; the conversion from the on-disk to the in-memory representation is performed with the aid of ext3_acl_from_disk. Before a pointer to this representation can be returned, the cache field in question of ext3_inode_info is updated so that subsequent requests can directly get the in-memory representation.
Modifying ACLs The function ext3_acl_chmod is responsible for keeping ACLs up to date and consistent when the (generic) attributes of a file are changed via ext3_setattr that is, in turn, called by the VFS layer and thus triggered by the respective system calls from userspace. Since ext3_acl_chmod is called at the very end of ext3_setattr, the new desired mode has already been set for the classical access control part of the inode. A pointer to the instance of struct inode in question is thus sufficient as input data. The operational logic of ext3_acl_chmod is depicted in the code flow diagram in Figure 11-11. ext3_acl_chmod ext3_get_acl Get a cloned working copy of the ACL
posix_acl_chmod_masq Get handle
ext3_set_acl Stop journalling Release clone
Figure 11-11: Code flow diagram for ext3_acl_chmod. After retrieving a pointer to the in-memory representation of the ACL data, a clone as working copy is created using the helper function posix_acl_clone. The main work is delegated to posix_acl_chmod_masq covered below. The remaining work for the Ext3 code deals with technical issues: After a handle for the transaction has been obtained, ext3_set_acl is used to write back the modified ACL data. Finally, the end of the operation is announced to the journal, and the clone is released. The generic work of updating the ACL data is performed in posix_acl_chmod_masq by iterating over all ACL entries. The relevant entries for the owning user and group as well as the generic entry for ‘‘other’’ and mask entries are updated to reflect the new situation: fs/posix_acl.c
int posix_acl_chmod_masq(struct posix_acl *acl, mode_t mode) { struct posix_acl_entry *group_obj = NULL, *mask_obj = NULL; 4 Note that there are actually two calls to
ext3_xattr_get: The first computes how much memory is needed to hold the data, then the appropriate amount is allocated with vmalloc, and the second call of ext3_xattr_get actually transfers the desired data.
730
5:22pm
Page 730
Mauerer
runc11.tex
V2 - 09/04/2008
5:22pm
Chapter 11: Extended Attributes and Access Control Lists struct posix_acl_entry *pa, *pe; /* assert(atomic_read(acl->a_refcount) == 1); */ FOREACH_ACL_ENTRY(pa, acl, pe) { switch(pa->e_tag) { case ACL_USER_OBJ: pa->e_perm = (mode & S_IRWXU) >> 6; break; case ACL_USER: case ACL_GROUP: break; case ACL_GROUP_OBJ: group_obj = pa; break; case ACL_MASK: mask_obj = pa; break; case ACL_OTHER: pa->e_perm = (mode & S_IRWXO); break; default: return -EIO; } } if (mask_obj) { mask_obj->e_perm = (mode & S_IRWXG) >> 3; } else { if (!group_obj) return -EIO; group_obj->e_perm = (mode & S_IRWXG) >> 3; } return 0; }
Permission Checks Recall that the kernel provides the generic permission checking function generic_permission, which allows for integration of a filesystem-specific handler for ACL checks. Indeed, Ext3 makes use of this option: The function ext3_permission (which is, in turn, called by the VFS layer when a permission check is requested) instructs generic_permission to use ext3_check_acl for the ACL-related work: fs/ext3/acl.c
int ext3_permission(struct inode *inode, int mask, struct nameidata *nd) { return generic_permission(inode, mask, ext3_check_acl); }
731
Page 731
Mauerer
runc11.tex
V2 - 09/04/2008
Chapter 11: Extended Attributes and Access Control Lists ext3_check_acl ext3_get_acl posix_acl_permission
Figure 11-12: Code flow diagram for ext3_check_acl. The code flow diagram in Figure 11-12 shows that there is little to do for ext3_check_acl. After the ACL data have been read in by ext3_get_acl, all policy work is delegated to posix_acl_permission, which was introduced in Section 11.2.1.
11.2.3 Implementation in Ext2 The implementation of ACLs for Ext2 is nearly completely identical with the implementation for Ext3. The differences are even less than for extended attributes because for ACLs, the handle-related parts are not split into separate functions. Thus, by replacing ext3_ with ext2_ in all functions and data structures, the comments about ACLs in this chapter apply equally well for Ext2 as for Ext3.
11.3
Summar y
Traditionally, the discretionary access control model is used by Unix and Linux to decide which user may access a given resource as represented by a file in a filesystem. Although these methods work quite well for average installations, it is a very coarse-grained approach to security, and can be inappropriate in certain circumstances. In this chapter, you have seen how ACLs provide more fine-grained means to access control for filesystem objects by attaching an explicit list of access control rules to each object. You have also seen that ACLs are implemented on top of extended attributes, which allow augmenting filesystem objects with additional and more complex attributes than in the traditional Unix model inherited by Linux.
732
5:22pm
Page 732
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Networks That Linux is a child of the Internet is beyond contention. Thanks, above all, to Internet communication, the development of Linux has demonstrated the absurdity of the widely held opinion that project management by globally dispersed groups of programmers is not possible. Since the first kernel sources were made available on an ftp server more than a decade ago, networks have always been the central backbone for data exchange, for the development of concepts and code, and for the elimination of kernel errors. The kernel mailing list is a living example that nothing has changed. Everybody is able to read the latest contributions and add their own opinions to promote Linux development — assuming, of course, that the opinions expressed are reasonable. Linux has a very cozy relationship with networks of all kinds — understandably as it came of age with the Internet. Computers running Linux account for a large proportion of the servers that build the Internet. Unsurprisingly, network implementation is a key kernel component to which more and more attention is being paid. In fact, there are very few network options that are not supported by Linux. Implementation of network functionality is one of the most complex and extensive parts of the kernel. In addition to classic Internet protocols such as TCP, UDP, and the associated IP transport mechanism, Linux also supports many other interconnection options so that all conceivable types of computers and operating systems are able to interoperate. The work of the kernel is not made any simpler by the fact that Linux also supports a gigantic hardware spectrum dedicated to data transfer — ranging from Ethernet cards and token ring adapters to ISDN cards and modems. Nevertheless, Linux developers have been able to come up with a surprisingly well-structured model to unify very different approaches. Even though this chapter is one of the longest in the book, it makes no claim to cover every detail of network implementation. Even an outline description of all drivers and protocols is beyond the scope of a single book — many would be needed owing to the volume of information. Not counting device drivers for network cards, the C implementation of the network layer occupies 15 MiB in the kernel sources, and this equates to more than 6,000 printed pages of code. The shear number of header files that relate to networking has motivated the kernel developers to store them not in the standard location include/linux, but devote the special directory include/net to them. Embedded in this code are many concepts that form the logical
Page 733
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks backbone of the network subsystem, and it is these that interest us in this chapter. Our discussion is restricted mainly to the TCP/IP implementation because it is by far the most widely used network protocol. Of course, development of the network layer did not start with a clean sheet. Standards and conventions for exchanging data between computers had already existed for decades and were well known and well established. Linux also implements these standards to link to other computers.
12.1
Linked Computers
Communication between computers is a complex topic that raises many questions such as: ❑
How is the physical connection established? Which cables are used? Which restrictions and special requirements apply in terms of the media?
❑
How are transmission errors handled?
❑
How are individual computers identified in a network?
❑
How are data exchanged between computers connected to each other via intervening computers? And how is the best route found?
❑
How are data packaged so that they are not reliant on special features of individual computers?
❑
If there are several network services on a computer, how are they identified?
This catalog of questions could be extended at will. Unfortunately, the number of answers as well as the number of questions is almost unlimited, so that over time many suggestions have been put forward as to how to deal with specific problems. The most ‘‘reasonable‘‘ systems are those that classify problems into categories and create various layers to resolve clearly defined issues and communicate with the other layers by means of set mechanisms. This approach dramatically simplifies implementation, maintenance, and, above all, troubleshooting.
12.2
ISO/OSI and TCP/IP Reference Model
The International Organization for Standardization — better known as ISO — has devised a reference model that defines the various layers that make up a network. This model comprises the seven layers shown in Figure 12-1 and is called the Open Systems Interconnection (OSI) model. However, the division into seven layers is too detailed for some issues. Therefore, in practice, use is often made of a second reference model in which some layers of the ISO/OSI model are combined into new layers. This model has only four layers so that its structure is simpler. It is known as the TCP/IP reference model, where IP stands for Internet Protocol and TCP for Transmission Control Protocol. Most of today’s communication across the Internet is based on this model. Figure 12-1 compares the layers of the two models. Each layer may speak only to the layer immediately above or below. For instance, the transport layer in the TCP/IP model may communicate only with the Internet and application layer but is totally independent of the host-to-network layer (ideally, it does not even know that such a layer exists).
734
5:30pm
Page 734
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Application layer Application (HTTP, FTP etc.)
Presentation layer Session layer
Transport (TCP, UDP)
Transport layer
Internet (IP)
Network layer Data link layer
Host-to-host Physical layer
Figure 12-1: TCP/IP and ISO/OSI reference models. The various layers perform the following tasks: ❑
The host-to-network layer is responsible for transferring information from one computer to a distant computer. It deals with the physical properties of the transmission medium1 and with dividing the data stream into frames of a certain size to permit retransmission of data chunks if transmission errors occur. If several computers are sharing a transmission line, the network adapters must have a unique ID number known as a MAC address that is usually burned into the hardware. An agreement between manufacturers ensures that this number is globally unique. An example of a MAC address is 08:00:46:2B:FE:E8. In the view of the kernel, this layer is implemented by device drivers for network cards.
❑
The network layer of the OSI model is called the Internet layer in the TCP/IP model, but both refer basically to the same task of exchanging data between any computers in a network, not necessarily computers that are directly connected, as shown in Figure 12-2. A direct transmission link between computers A and B is not possible because they are not physically connected to each other. The task of the network layer is therefore to find a route via which the computers can talk to each other; for example, A–E–B or A–E–C–B.
A D
E
C
B
Figure 12-2: Network-linked computers. 1 Predominantly coaxial cable, twisted-pair cable, and fiber optic links are used, but there is an increasing trend toward wireless
transmission.
735
Page 735
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks The network layer is also responsible for additional connection details such as splitting the data to be transported into packets of a specific size. This is necessary because the computers along the route may have different maximum limits to the size of the data packets they can accept. When data are sent, the data stream is split into packets that are reassembled upon receipt. This is done so that higher-level protocols can operate transparently with data units of a guaranteed size without having to bother with the specific properties of the Internet or network layer. The network layer also assigns unique addresses within the network so that computers can talk to each other (these are not the same as the abovementioned hardware addresses because networks are usually made up of physical subnets). In the Internet, the network layer is implemented by means of the Internet Protocol (IP), which comes in two versions (v4 and v6). At the moment, most connections are handled by IPv4, but IPv6 will replace it in the future.2 When I speak of IP connections below, I always mean IPv4 connections. IP uses addresses formatted like this — 192.168.1.8 or 62.26.212.10 — to address computers. These addresses are assigned by official registration authorities or providers (sometimes dynamically) or can be freely selected (within defined private ranges). IP allows networks to be divided flexibly into subnets on the address level by supporting various address categories, which, depending on requirements, hold tens of millions of computers and more. However, it is not my intention to deal with this topic in detail. See the wealth of literature on network and system administration, for example, [Ste00] and [Fri02]. ❑
In both models, the fourth layer is the transport layer. Its task is to regulate data transport between applications running on two linked computers. It is not sufficient to establish communication between the computers themselves; it is also necessary to set up a connection between the client and the server application, and this presupposes, of course, that there is an existing link between the computers. In the Internet, TCP (Transmission Control Protocol) or UDP (User Datagram Protocol) is used for this purpose. Each application interested in data in the IP layer uses a unique port number that uniquely identifies it on the target system. Typically, port 80 is used for web servers. Browser clients must send requests to this address to obtain the desired data. (Naturally, the client must also have a unique port number so that the web server can respond to the request, but this port number is generated dynamically.) To fully define a port address, the port number is usually appended to the IP address after a colon; for example, a web server on the computer with the address 192.168.1.8 is uniquely identifiable by the address 192.168.1.8:80. An additional task of this layer can (but need not) be the provision of a reliable connection over which data are transmitted in a given sequence. The above feature and the TCP protocol are discussed in Section 12.9.2.
❑
The application layer in the TCP/IP reference model is represented by layers 5 to 7 (session layer, presentation layer, and application layer) of the OSI model. As the name suggests, this layer represents the application view of a network connection. Once a communication connection has been established between two applications, this layer is responsible for the actual contents to be transferred. After all, web servers communicate with their clients differently than mail servers.
2 The move to IPv6 should have already have taken place, but this is very slow in happening, particularly in the academic and com-
mercial sectors. Perhaps the impending exhaustion of IPv4 address space will act as a spur.
736
5:30pm
Page 736
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks A very large number of standard protocols are defined for the Internet. Usually, they are defined in Request for Comments (RFC) documents and must be implemented by applications wishing to use or offer a particular service. Most protocols can be tested with the telnet tool because they operate with simple text commands. A typical example of the communication flow between a browser and web server is shown below. wolfgang@meitner> telnet 192.168.1.20 80 Trying 192.168.1.20... Connected to 192.168.1.20. Escape character is ’^]’. GET /index.html HTTP/1.1 Host: www.sample.org Connection: close
HTTP/1.1 200 OK Date: Wed, 09 Jan 2002 15:24:15 GMT Server: Apache/1.3.22 (Unix) Content-Location: index.html.en Vary: negotiate,accept-language,accept-charset TCN: choice Last-Modified: Fri, 04 May 2001 00:00:38 GMT ETag: "83617-5b0-3af1f126;3bf57446" Accept-Ranges: bytes Content-Length: 1456 Connection: close Content-Type: text/html Content-Language: en ... telnet is used to set up a TCP connection on port 80 of computer 192.168.1.20. All user input
is forwarded via the network connection to the process associated with this address (which is uniquely identified by the IP address and the port number). A response is sent once the request has been received. The contents of the desired HTML page are output together with a header with information on the document and other stuff. Web browsers use exactly the same procedure to access data transparently to users. As a result of the systematic division of network functionality into layers, applications wishing to communicate with other computers need concern themselves with only a very few details. The actual link between the computers is implemented by lower layers, and all the application has to do is read and generate text strings — regardless of whether the two computers are sitting side by side in the same room or are located on different continents. The layer structure of the network is reflected in the kernel by the fact that the individual levels are implemented in separate sections of code that communicate with each other via clearly defined interfaces to exchange data or forward commands.
737
Page 737
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks
12.3
Communication via Sockets
From the programmer’s view, external devices are simply regular files under Linux (and Unix) that are accessed by normal read and write operations, as described in Chapter 8. This simplifies access to resources because only a single, universal interface is needed. The situation is a bit more complicated with network cards because the above scheme either cannot be adopted at all or only with great difficulty. Network cards function in a totally different way from normal block and character devices so that the typical Unix motto that ‘‘everything is a file‘‘ no longer fully applies.3 One reason is that many different communication protocols are used (in all layers) where many options need to be specified in order to establish a connection — and this cannot be done when device files are opened. Consequently, there are no entries for network cards in the /dev directory.4 Of course, the kernel must provide an interface that is as universal as possible to allow access to its network functions. This problem is not Linux-specific and gave BSD Unix programmers headaches in the 1980s. The solution they adopted — special structures called sockets that are used as an interface for network implementation — has now established itself as an industry standard. Sockets are defined in the POSIX standard and are therefore also implemented by Linux. Sockets are now used to define and set up network connections so that they can be accessed (particularly by read and write operations) using the normal means of an inode. In the view of programmers, the ultimate result of socket creation is a file descriptor that provides not only the whole range of standard functions but also several enhanced functions. The interface used for the actual exchange of data is the same for all protocols and address families. When a socket is created, a distinction is made not only between address and protocol families but also between stream-based and datagram-based communication. What is also important (with streamoriented sockets) is whether a socket is generated for a client or for a server program. To illustrate the function of a socket from a user point of view, I include a short sample program to demonstrate just a few of the network programming options. Detailed descriptions are provided in numerous specialized publications, [Ste00], for example.
12.3.1 Creating a Socket Sockets can be used not only for IP connections with different transport protocols, but also for all other address and protocol types supported by the kernel (e.g., IPX, Appletalk, local Unix sockets, DECNet, and many other listed in <socket.h>). For this reason, it is essential to specify the desired combination when generating a socket. Although, as a relic of the past, it is possible to select any combination of partners from the address and protocol families, now only one protocol family is usually supported for each address family, and it is only possible to differentiate between stream- and datagram-oriented 3 There are, however, several Unix variants that implement network connections directly by means of device files, /dev/tcp, for example (see [Vah96]). From the application programmer’s point of view and from that of the kernel itself, this is far less elegant than the socket method. Because the differences between network devices and normal devices are particularly evident when a connection is opened, network operations in Linux are only implemented by means of file descriptors (that can be processed with normal file methods) once a connection has been set up using the socket mechanism. 4 One exception is the TUN/TAP driver, which simulates a virtual network card in userspace and is therefore very useful for debugging, for simulating network cards, or for setting up virtual tunnel connections. Because it does not communicate with any real device in order to send or receive data, this job is done by a program that communicates with the kernel via /dev/tunX or dev/tapX.
738
5:30pm
Page 738
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks communication. For example, only TCP (for streams) or UDP (for datagram services) can be used as the transport protocol for a socket to which an Internet address such as 192.168.1.20 has been assigned. Sockets are generated using the socket library function, which communicates with the kernel via a system call discussed in Section 12.10.3. A third argument could be used in addition to address family and communication type (stream or datagram) in order to select a protocol; however, as already stated, this is not necessary because the protocol is uniquely defined by the first two parameters. Specifying 0 for the third argument instructs the function to use the appropriate default. Once the socket function has been invoked, it is clear what the format of the socket address must be (or in which address family it resides), but no local address has yet been assigned to it. The bind function to which a sockaddr_type structure must be passed as an argument is used for this purpose. The structure then defines the address. Because address types differ from address family to address family, there is a different version of the structure for each family so that various requirements can be satisfied. type specifies the desired address type. Internet addresses are uniquely identified by IP number and port number, which is why sockaddr_in is defined as follows:
struct sockaddr_in { sa_family_t __be16 struct in_addr ... }
sin_family; sin_port; sin_addr;
/* Address family /* Port number /* Internet address
*/ */ */
An IP address and a port number are also needed in addition to the address family (here, AF_INET). The IP address is not expected in the usual dotted decimal notation (four numbers separated by dots, i.e., 192.168.1.10), but must be specified as a number. The inet_aton library function converts an ASCII string into the format required by the kernel (and by the C library). For example, the numeric representation of the address 192.168.1.20 is 335653056. It is generated by writing the 1-byte-long sections of the IP address successively into a 4-byte data type that is then interpreted as a number. This permits the unique conversion of both representations. As stated in Chapter 1, CPUs apply two popular conventions for storing numeric values — little and big endian. An explicit network byte order corresponding to the big endian format has been defined to ensure that machines with different byte arrangements are able to communicate with each other easily. Numeric values appearing in protocol headers must therefore always be specified in this format. The fact that both the IP address and the port number consist only of numbers must be taken into account when defining the values in the sockaddr_in structure. The C library features numerous functions for converting numbers from the native format of the CPU to the network byte order (if the CPU and the network have the same byte order, the functions leave it unchanged). Good network applications always use these functions even if they are developed on big endian machines to ensure that they can be ported to different machine types. To represent little and big endian types explicitly, the kernel provides several data types. __be16, __be32, and __be64 represent big endian numbers with 16, 32, and 64 bits, while the variants with prefix __le are
739
Page 739
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks analogs for little endian values. They are all defined in
12.3.2 Using Sockets It is assumed that you are familiar with the userland side of network programming. However, to briefly illustrate how sockets represent an interface to the network layer of the kernel, I discuss two very brief sample programs, one that acts as a client for echo requests, the other as a server. A text string is sent from the client to the server and is returned unchanged. The TCP/IP protocol is used.
Echo Client The source code for the echo client is as follows5 : #include<stdio.h> #include
740
5:30pm
Page 740
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks printf("\nBytes received: %u\n", bytes); printf("Text: ’%s’\n", buf); /* End communication (i.e. close socket) */ close(sockfd); }
The Internet superdaemon (inetd, xinetd, or similar) normally uses a built-in echo server. Consequently, the source code can be tested immediately after compilation. wolfgang@meitner> ./echo_client Connect to 192.168.1.20 Numeric: 335653056 Send: ’Hello World’ Bytes received: 11 Text: ’Hello World’
The following steps are performed by the client:
1.
An instance of the sockaddr_in structure is generated to define the address of the server to be contacted. AF_INET indicates that it is an Internet address and the target server is precisely defined by its IP address (192.168.1.20) and port number (7). Also, the data from the host are converted to the network byte order. htons is used for the port number, and the inet_addr auxiliary function performs the conversion implicitly by translating the text string with a dotted decimal address into a number.
2.
A socket is created in the kernel by means of the socket function, which (as shown below) is based on the socketcall system call of the kernel. The result returned is an integer number that is interpreted as a file descriptor — and can therefore be processed by all functions available for regular files, as described in Chapter 8. In addition to these operations, there are other network-specific ways of handling the file descriptor; these permit exact setting of various transmission parameters not discussed here.
3.
A connection is set up by invoking the connect function in conjunction with the file descriptor and the server variable that stores the server connection data (this function is also based on the socketcall system call).
4.
Actual communication is initiated by sending a text string (‘‘Hello World‘‘ — how could it be anything else?) to the server by means of write. Writing data to a socket file descriptor is the equivalent of sending data. This step is totally independent of the server location and the protocol used to set up the connection. The network implementation ensures that the character string reaches its destination — no matter how this is done.
5.
The server response is read by read, but a buffer must first be allocated to hold the data received. As a precaution, 1,000 bytes are reserved in memory, although we only expect the original string to be returned. read blocks until the server supplies a response, and it then returns the number of bytes received as an integer number. Because strings in C are always null-terminated, 11 bytes are received, although the message itself appears to be only 10 bytes long.
741
Page 741
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Echo Server How sockets are used for server processes differs slightly from how they are used in clients. The following sample program demonstrates how a simple echo server can be implemented: #include<stdio.h> #include
742
5:30pm
Page 742
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks printf("Bytes received: %u\n", bytes); printf("Text: ’%s’\n", buf); /* Send response */ write(clientfd, buf, bytes); } }
The first section is almost the same as the client code. An instance of the sockaddr_in structure is created to hold the Internet address of the server, but this is done for a different reason. The address of the server to which the client wishes to connect is specified in the client code. In this case, the address specified is that used by the server to wait for connections. The socket is generated in exactly the same way as for the client. In contrast to the client, the server does not actively attempt to set up a connection to another program but simply waits passively until it receives a connection request. Three library functions (again based on the universal socketcall system call) are required to set up a passive connection: ❑
bind binds the socket to an address (192.186.1.20:7777 in our example).6
❑
listen instructs the socket to wait passively for an incoming connection request from a client.
The function creates a wait queue on which all processes wishing to establish a connection are placed. The length of the queue is defined by the second parameter. (SOMAXCONN specifies that the maximum system-internal number must be used so as not to arbitrarily restrict the maximum number of waiting processes.) ❑
The accept function accepts the connection request of the first client on the wait queue. When the queue is empty, the function blocks until a client wishing to connect is available.
Again, actual communication is performed by read and write, which use the file descriptor returned by accept. The client connection data (supplied by accept and consisting of the IP address and port number) are output for information purposes. While the client IP address for a specific computer is fixed, the port number is selected dynamically by the computer’s kernel when the connection is established. The function of the echo server is easily imitated by reading all client input with read and writing it back with write in an endless loop. When the client closes the connection, read returns a data stream that is 0 bytes long so that the server then also terminates. Client
Server
wolfgang@meitner> ./stream_client Connect to 192.168.1.20 Numeric: 335653056
wolfgang@meitner> ./stream_server Wait for connection on port 7777
Send: ’Hello World’ Bytes received: 11 Text: ’Hello World’
Client: 192.168.1.10:3505 Numeric: 3232235786 Bytes received: 11 Text: ’Hello World’ Connection closed.
6 Under Linux (and all other Unix flavors), all ports between 1 and 1,024 are referred to as reserved ports and may be used only by processes with root rights. For this reason, we use the free port number 7,777.
743
Page 743
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks A 4-tuple notation (192.168.1.20:7777, 192.168.1.10:3506) is used to uniquely identify a connection. The first element specifies the address and port of the local system, the second the address and port of the client. An asterisk (*) is substituted if one of the elements is still undefined. A server process listening on a passive socket but not yet connected to a client is therefore denoted by 192.168.1.20:7777, *.*. Two socket pairs are registered in the kernel once a server has duplicated itself with fork to handle a connection. Listen
Established
192.168.1.20:7777, *.*
192.168.1.20:7777, 192.168.1.10:3506
Although the sockets of both server processes have the same IP address/port number combination, they are differentiated by the 4-tuple. Consequently, the kernel must note all four connection parameters when distributing incoming and outgoing TCP/IP packets to ensure that assignments are made correctly. This task is known as multiplexing. The netstat tool displays and checks the state of all TCP/IP connections on the system. The following sample output is produced if two clients are connected to the server: wolfgang@meitner> netstat -na Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address tcp 0 0 192.168.1.20:7777 0.0.0.0:* tcp 0 0 192.168.1.20:7777 192.168.1.10:3506 tcp 0 0 192.168.1.20:7777 192.168.1.10:3505
State LISTEN ESTABLISHED ESTABLISHED
12.3.3 Datagram Sockets UDP is a second, widely used transport protocol that builds on IP connections. UDP stands for User Datagram Protocol and differs from TCP in several basic areas: ❑
UDP is packet-oriented. No explicit connection setup is required before data are sent.
❑
Packets can be lost during transmission. There is no guarantee that data will actually reach their destination.
❑
Packets are not necessarily received in the same order in which they were sent.
UDP is commonly used for video conferencing, audio streaming, and similar services. Here it doesn’t matter if a few packets go missing — all that would be noticed would be brief dropouts in multimedia sequences. However, like IP, UDP guarantees that the contents of packets are unchanged when they arrive at their destinations. An IP address and port number can be used by a TCP and a UDP process at the same time. In multiplexing, the kernel ensures that only packets of the correct transport protocol are forwarded to the appropriate process.
744
5:30pm
Page 744
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Comparing TCP and UDP is like comparing the postal service with the telephone network. TCP corresponds to a telephone call. The calling party must set up a connection (which must be accepted by the person called) before information can be passed. During the call, all information sent is received in the same order in which it was sent. UDP can be likened to the postal service. Packets (or letters in this analogy) can be sent to recipients without contacting them in advance for permission to do so. There is no guarantee that letters will be delivered (although both the postal service and the network will do their best). Similarly, there is no guarantee that letters will be sent or received in a particular sequence. Those interested in further examples of the use of UDP sockets are referred to the many textbooks on network and system programming.
12.4 The Layer Model of Network Implementation The kernel implements the network layer very similarly to the TCP/IP reference model introduced at the beginning of this chapter. The C code is split into levels with clearly defined tasks, and each level is able to communicate only with the level immediately above and below via clearly defined interfaces. This has the advantage that various devices, transmission mechanisms, and protocols can be combined. For example, normal Ethernet cards can be used not only to set up Internet (IP) connections but also to transmit other protocols such as Appletalk or IPX without the need for any kind of modification to the device driver of the card. Figure 12-3 illustrates the implementation of the layer model in the kernel. Application Userspace Application Layer Kernel Transport Layer Network Layer
C Standard library struct socket struct sock struct proto struct Protocol packet_ specific type dev.c struct net_device
Host to Host Layer
driver.c Hardware Specific
Physical Transmission
Figure 12-3: Implementation of the layer model in the kernel. The network subsystem is one of the most comprehensive and demanding parts of the kernel. Why is this so? The answer is that it deals with a very large number of protocol-specific details and subtleties, and the code path through the layer is riddled with excessive function pointers in place of direct function
745
Page 745
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks calls. This is unavoidable because of the numerous ways in which the layers can be combined — but this does not make the code path any clearer or easier to follow. In addition, the data structures involved are generally very closely linked with each other. To reduce complexity, the information below relates primarily to the Internet protocols. The layer model is mirrored not only in the design of the network layer, but also in the way data are transmitted (or, to be more precise, the way in which the data generated and transmitted by the individual layers are packaged). In general, the data of each layer are made up of a header section and a data section, as shown in Figure 12-4. Data of a protocol session Header
Payload
Figure 12-4: Division into header and data sections. Whereas the header contains metadata (destination address, length, transport protocol type, etc.) on the data section, the data section itself consists of the useful data (or payload). The base unit of transmission is the (Ethernet) frame used by the network card to transmit data. The main entry in the frame header is the hardware address of the destination system to which the data are to be transmitted and which is needed for transmission via cable. The data of the higher-level protocol are packaged in the Ethernet frame by including the header and data tuple generated by the protocol in the data section of the frame. This is the IP layer data in Internet networks. Because not only IP packets but also, for example, Appletalk or IPX packets can be transmitted via Ethernet, the receiving system must be able to distinguish between protocol types in order to forward the data to the correct routines for further processing. Analyzing data to find out which transport protocol is used is very time-consuming. As a result, the Ethernet header (and the headers of all other modern protocols) includes an identifier to uniquely identify the protocol type in the data section. The identifiers (for Ethernet) are assigned by an international organization (IEEE). This division is continued for all protocols in the protocol stack. For this reason, each frame transmitted starts with a series of headers followed by the data of the application layer, as shown in Figure 12-5.7 Ethernet-Frame Mac Header
TCP IP HTTP Header Header Header
HTML Data
Payload of Ethernet Frame Payload of IP Payload of TCP
Figure 12-5: Transporting HTTP data via TCP/IP in an Ethernet frame. 7 The boundary between the HTTP header and the data section is indicated by a change of shading because this distinction is made
in userspace and not in the kernel.
746
5:30pm
Page 746
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks The figure clearly illustrates that part of the bandwidth is inevitably sacrificed to accommodate control information.
12.5
Networking Namespaces
Recall from Chapter 1 that many parts of the kernel are contained in namespaces. These allow for building multiple virtual viewpoints of the system that are separated and segregated from each other. Every instance looks like a single machine running Linux, but, in fact, many such instances can operate simultaneously on a single physical machine. During the development of 2.6.24, the kernel started to adopt namespaces also for the networking subsystem. This adds some extra complexity to the networking layer because all properties of the subsystem that used to be ‘‘global’’ in former versions — for instance, the available network cards — need to be managed on a per-namespace basis now. If a particular networking device is visible in one namespace, it need not be available in another one. As usual, a central structure is used to keep track of all available namespaces. The definition is as follows: include/net/net_namespace.h
struct net { atomic_t count; /* To decided when the network * namespace should be freed. */ ... struct list_head list; /* list of network namespaces */ ... struct proc_dir_entry *proc_net; struct proc_dir_entry *proc_net_stat; struct proc_dir_entry *proc_net_root; struct net_device *loopback_dev; /* The loopback */ struct list_head dev_base_head; struct hlist_head *dev_name_head; struct hlist_head *dev_index_head; };
Work has only begun to make the networking subsystem fully aware of namespaces. What you see now — the situation in kernel 2.6.24 — still represents a comparatively early stage of development. Therefore, struct net will grow in size in the future as more and more networking components are transferred from a global management to a namespace-aware implementation. For now, the basic infrastructure is in place. Network devices are kept track of under consideration of namespaces, and support for the most important protocols is available. Since I have not yet discussed any specific points of the networking implementation, the structures referenced in struct net are naturally still unknown (however, I promise that this will certainly change in the course of this chapter). For now, it suffices to present a broad overview about what is handled in a namespace-aware fashion: ❑
count is a standard usage counter, and the auxiliary functions get_net and put_net are provided to obtain and release permission to use a specific net instance. When count drops to zero,
the namespace is deallocated and removed from the system. ❑
All available namespaces are kept on a doubly linked list that is headed by net_namespace_list. list is used as the list element. The function copy_net_ns adds a new namespace to the list. It is automatically called when a set of new namespaces is created with create_new_namespace.
747
Page 747
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks ❑
Since each namespace can contain different network devices, this must also be reflected in the contents of Procfs (see Chapter 10.1). Three entries require a per-namespace handling: /proc/net is represented by proc_net, while /proc/net/stats is represented by proc_net_stats. proc_net_root points to the root element of the Procfs instance for the current namespace, that is, /proc.
❑
Each namespace may have a different loopback device, and loopback_dev points to the (virtual) network device that fulfills this role.
❑
Network devices are represented by struct net_device. All devices associated with a specific namespace are kept on a doubly linked list headed by dev_base_head. The devices are kept on two additional hash tables: One uses the device name as hash key (dev_name_head), and one uses the interface index (dev_index_head). Note that there is a slight difference in terminology between devices and interfaces. While devices represent hardware devices that provide physical transmission capabilities, interfaces can be purely virtual entities, possibly implemented on top of real devices. For example, a network card could provide two interfaces. Since the distinction between these terms is not relevant for our purposes, I use both terms interchangeably in the following.
Many components still require substantial rework to make them handle namespaces correctly, and there is still a considerable way to go until a fully namespace-aware networking subsystem will be available. For instance, kernel 2.6.25 (which was still under development when this chapter was written) will introduce initial preparations to make specific protocols aware of namespaces: include/net/net_namespace.h
struct net { ... struct netns_packet packet; struct netns_unix unx; struct netns_ipv4 ipv4; #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE) struct netns_ipv6 ipv6; #endif };
The new members like ipv4 will store (formerly global) protocol parameters, and protocol-specific structures are introduced for this purpose. The approach proceeds step-by-step: First, the basic framework is set in place. Subsequent steps will then move global properties into the per-namespace representation; the structures are initially empty. More work along these lines is expected to be accepted into future kernel versions. Each network namespace consists of several components, for example, the representation in Procfs. Whenever a new networking namespace is created, these components must be initialized. Likewise, some cleanups are necessary when a namespace is deleted. The kernel employs the following structure to keep track of all required initialization/cleanup tuples: include/net/net_namespace.h
struct pernet_operations { struct list_head list;
748
5:30pm
Page 748
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks int (*init)(struct net *net); void (*exit)(struct net *net); };
The structure does not present any surprises: init stores an initialization function, while clean-up work is handled by exit. All available pernet_operation instances are kept on a list headed by pernet_list; list is used as the list element. The auxiliary functions register_pernet_subsys and unregister_pernet_subsys add and remove elements to and from the list, respectively. Whenever a new networking namespace is created, the kernel iterates over the list of pernet_operations and calls the initialization function with the net instance that represents the new namespace as parameter. Cleaning up when a networking namespace is deleted is handled similarly. Most computers will typically require only a single networking namespace. The global variable init_net (and in this case, the variable is really global and not contained in another namespace!) contains the net instance for this namespace. In the following, I mostly neglect namespaces to simplify matters. It suffices to keep in mind that all global functions of the network layer require a network namespace as parameter, and that any global properties of the networking subsystem may only be referenced by a detour through the namespace under consideration.
12.6
Socket Buffers
When network packets are analyzed in the kernel, the data of lower-level protocols are passed to higherlevel layers. The reverse sequence applies when data are sent. The data (header and payload) generated by the various protocols are successively passed to lower layers until they are finally transmitted. As the speed of these operations is crucial to network layer performance, the kernel makes use of a special structure known as a socket buffer, which is defined as follows: <skbuff.h>
struct sk_buff { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; struct sock ktime_t struct net_device
*sk; tstamp; *dev;
struct
*dst;
dst_entry
char
cb[48];
unsigned int
len, data_len; mac_len, hdr_len;
__u16 union {
__wsum csum; struct { __u16 csum_start;
749
Page 749
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks __u16 csum_offset; }; }; __u32 __u8
__be16
priority; local_df:1, cloned:1, ip_summed:2, nohdr:1, nfctinfo:3; pkt_type:3, fclone:2, ipvs_property:1; nf_trace:1; protocol;
void
(*destructor)(struct sk_buff *skb);
int
iif;
sk_buff_data_t sk_buff_data_t sk_buff_data_t
transport_header; network_header; mac_header;
__u8
... ... ...
/* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users; };
Socket buffers are used to exchange data between the network implementation levels without having to copy packet data to and fro — this delivers considerable speed gains. The socket structure is one of the cornerstones of the network layer because it is processed on all levels both when packets are analyzed and generated.
12.6.1 Data Management Using Socket Buffers Socket buffers are linked by means of the various pointers they contain with an area in memory where the data of a network packet reside, as shown in Figure 12-6. The figure assumes that we are working on a 32-bit system (the organization of a socket buffer is slightly different on a 64-bit machine, as you will see in a moment). The basic idea of a socket buffer is to add and remove protocol headers by manipulating pointers. ❑
head and end point to the start and end of the area in memory where the data reside.
This area may be larger than actually needed because it is not clear how big packets will be when they are synthesized. ❑
750
data and tail point to the start and end of the protocol data area.
5:30pm
Page 750
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks transport_header network_header mac_header tail end head data
MAC IP TCP
Figure 12-6: Link between socket buffer and network packet data. ❑
mac_header points to the start of the MAC header, while network_header and transport_header point to the header data of the network and transport layer, respectively. On systems with 32-bit word length, the data type sk_buff_data_t that is used for the
various data components is a simple pointer: <skbuff.h>
typedef unsigned char *sk_buff_data_t;
This enables the kernel to use socket buffers for all protocol types. Simple type conversions are necessary to interpret the data correctly, and several auxiliary functions are provided for this purpose. A socket buffer can, for example, contain a TCP or UDP packet. The corresponding information from the transport header can be extracted with tcp_hdr, respectively, udp_hdr. Both functions convert the raw pointer into an appropriate data type. Other transport layer protocols also provide helper functions of the type XXX_hdr that require a pointer to struct sk_buff and return the reinterpreted transport header data. Observe, for example, how a TCP header can be obtained from a socket buffer:
static inline struct tcphdr *tcp_hdr(const struct sk_buff *skb) { return (struct tcphdr *)skb_transport_header(skb); } struct tcphdr is a structure that collects all fields contained in a TCP header; the exact layout is
discussed in Section 12.9.2. Similar conversion functions are also available for the network layer. For our purposes, ip_hdr is most important: It is used to interpret the contents of an IP packet. data and tail enable data to be passed between protocol levels without requiring explicit copy opera-
tions, as shown in Figure 12-7, which demonstrates how packets are synthesized. When a new packet is generated, the TCP layer first allocates memory in userspace to hold the packet data (header and payroll). The space reserved is larger than needed for the data so that lower-level layers can add further headers. A socket buffer is allocated so that head and end point to the start and end of the space reserved in memory, while the TCP data are located between data and tail. A new layer must be added when the socket buffer is passed to the IP layer. This can simply be written into the reserved space, and all pointers remain unchanged with the exception of data, which now points
751
Page 751
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks to the start of the IP header. The same operations are repeated for the layers below until a finished packet is ready to be sent across the network.
head data tail end
TCP
head data tail end
IP TCP
Figure 12-7: Manipulation of the socket buffer in the transition between protocol levels. The procedure adopted to analyze packets is similar. The packet data are copied into a reserved memory area in the kernel and remain there for the duration of the analysis phase. The socket buffer associated with the packet is passed on from layer to layer, and the various pointers are successively supplied with the correct values. The kernel provides the standard functions listed in Table 12-1 for manipulating socket buffers.
Table 12-1: Operations on Socket Buffers Function
Meaning
alloc_skb
Allocates a new sk_buff instance.
skb_copy
Creates a copy of the socket buffer and associated data.
skb_clone
Duplicates a socket buffer but uses the same packet data for the original and the copy.
skb_tailroom
Returns the size of the free space at the end of the data.
skb_headroom
Returns the size of the free space at the start of the data.
skb_realloc_headroom Creates more free space at the start of the data. The existing data are retained.
Socket buffers require numerous pointers to represent the different components of the buffer’s contents. Since low memory footprint and high processing speed are essential for the network layer and thus for struct sk_buff, it is desirable to make the structure as small as possible. On 64-bit CPUs, a little trick can be used to save some space. The definition of sk_buff_data_t is changed to an integer variable: <skbuff.h>
typedef unsigned int sk_buff_data_t;
Since integer variables require only half the memory of pointers (4 instead of 8 bytes) on such architectures, the structure shrinks by 20 bytes.8 The information contained in a socket buffer is still the same, 8 Since integers and pointers use an identical number of bits on 32-bit systems, the trick does not work for them.
752
5:30pm
Page 752
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks though. data and head remain regular pointers, and all sk_buff_data_t elements are now interpreted as offsets relative to these pointers. A pointer to the start of the transport header is now computed as follows: <skbuff.h>
static inline unsigned char *skb_transport_header(const struct sk_buff *skb) { return skb->head + skb->transport_header; }
It is valid to use this approach since 4 bytes are sufficient to describe memory regions of up to 4 GiB, and a socket buffer that exceeds this size will never be encountered. Since the internal representation of socket buffers is not supposed to be visible to the generic networking code, several auxiliary functions as shown above are provided to access the elements of struct sk_buff. They are all defined in <skbuff.h>, and the proper variant is automatically chosen at compile time. ❑
skb_transport_header(const struct sk_buff *skb) obtains the address of the transport
header for a given socket buffer. ❑
skb_reset_transport_header(struct sk_buff *skb) resets the start of the transport header to
the start of the data section. ❑
skb_set_transport_header(struct sk_buff *skb, const int offset) sets the start of the
transport header given the offset to the data pointer. The same set of functions is available for the MAC and network headers by replacing transport with mac or network, respectively.
12.6.2 Management Data of Socket Buffers The socket buffer structure contains not only the above pointers, but also other elements that are used to handle the associated data and to manage the socket buffer itself. The less common elements are dicsussed in this chapter when they are needed. The most important elements are listed below. ❑
tstamp stores the time the packet arrived.
❑
dev specifies the network device on which the packet is processed. dev may change in the course
of processing the packet — for instance, when it will leave the computer on another device at some point. ❑
The interface index number of the input device is always preserved in iif. Section 12.7.1 explains how to use this number.
❑
sk is a link to the socket instance (see Section 12.10.1) of the socket used to process the packet.
❑
dst indicates the further route of the packet through the network implementation. A special format is used (this is discussed in Section 12.8.5).
❑
next and prev hold socket buffers in a doubly linked list. The standard list implementation of
the kernel is not used here but is replaced by a manual version.
753
Page 753
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks A list head is used to implement wait queues with socket buffers. Its structure is defined as follows: <skbuff.h>
struct sk_buff_head { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; __u32 spinlock_t
qlen; lock;
}; qlen specifies the length of the wait queue; that is, the number of elements in the queue. next and prev of sk_buff_head and sk_buff are used to create a cyclic doubly linked list, and the list element of the
socket buffer points back to the list head, as illustrated in Figure 12-8.
sk_buff_head next prev len=2
sk_buff next prev list
sk_buff next prev list
Figure 12-8: Managing socket buffers in a doubly linked list. Packets are often placed on wait queues, for example, when they are awaiting processing or when packets that have been fragmented are reassembled.
12.7
Network Access Layer
Now that we have examined the structure of the network subsystem in the Linux kernel, we turn our attention to the first layer of the network implementation — the network access layer. This layer is primarily responsible for transferring information between computers and collaborates directly with the device drivers of network cards. It is not my intention to discuss the implementation of the card drivers and the associated problems9 because the techniques employed are only slightly different from those described in Chapter 6. I am much more interested in the interface made available by each card driver and used by the network code to provide an abstract view of the hardware. By reference to Ethernet frames, I explain how data are represented ‘‘on the cable‘‘ and describe the steps taken between receiving a packet and passing it on to a higher layer. I also describe the steps in the reverse direction when generated packets leave the computer via a network interface. 9 Even though this may be quite interesting — unfortunately not for technical reasons but for product policy reasons.
754
5:30pm
Page 754
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks
12.7.1 Representation of Network Devices In the kernel, each network device is represented by an instance of the net_device structure. Once a structure instance has been allocated and filled, it must be registered with the kernel using register_netdev from net/core/dev.c. This function performs some initialization tasks and registers the device within the generic device mechanism. This creates a sysfs entry (see Chapter 10.3) /sys/class/net/<device>, which links to the device’s directory. A system with one PCI network card and the loopback device has two entries in /sys/class/net: root@meitner # ls -l /sys/class/net total 0 lrwxrwxrwx 1 root root 0 2008-03-09 09:43 eth0 -> ../../devices/pci0000:00/0000:00:1c.5/ 0000:02:00.0/net/eth0 lrwxrwxrwx 1 root root 0 2008-03-09 09:42 lo -> ../../devices/virtual/net/lo
Data Structure Before discussing the contents of struct net_device in detail, let us address the question of how the kernel keeps track of the available network devices, and how a particular network device can be found. As usual, the devices are not arranged globally, but on a per-namespace basis. Recall that three mechanisms are available for each namespace net: ❑
All network devices are stored in a singly linked list with the list head dev_base.
❑
Hashing by device name. The auxiliary function dev_get_by_name(struct net *net, const char *name) finds a network device on this hash.
❑
Hashing by interface index. The auxiliary function dev_get_by_index(struct net *net, int ifindex) finds the net_device instance given the interface index.
The net_device structure holds all conceivable information on the device. It spans more than 200 lines and is one of the most voluminous structures in the kernel. As the structure is overburdened with details, a much simplified — but still quite long — version is reproduced below.10 Here’s the code:
*/ mem_end; mem_start; base_addr; irq;
unsigned long struct list_head int
state; dev_list; (*init)(struct net_device *dev);
/* /* /* /*
shared shared device device
/* Interface index. Unique device identifier
mem mem I/O IRQ
end start address number
*/ */ */ */
*/
10 The kernel developers are not quite satisfied with the current state of the structure either. The source code states that ‘‘Actually,
this whole structure is a big mistake’’.
755
Page 755
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks int
ifindex;
struct net_device_stats* (*get_stats)(struct net_device *dev); /* Hardware header description */ const struct header_ops *header_ops; unsigned short unsigned unsigned short unsigned short
flags; /* interface flags (a la BSD) mtu; /* interface MTU value type; /* interface hardware type hard_header_len; /* hardware hdr
*/ */ */ length
*/
/* Interface address info. */ unsigned char perm_addr[MAX_ADDR_LEN]; /* permanent hw address */ unsigned char addr_len; /* hardware address length */ int promiscuity; /* Protocol specific pointers */ void *atalk_ptr; void *ip_ptr; void *dn_ptr; void *ip6_ptr; void *ec_ptr;
/* /* /* /* /*
unsigned long unsigned long
/* Time of last Rx */ /* Time (in jiffies) of last Tx */
last_rx; trans_start;
AppleTalk link */ IPv4 specific data */ DECnet specific data */ IPv6 specific data */ Econet specific data */
/* Interface address info used in eth_type_trans() */ unsigned char dev_addr[MAX_ADDR_LEN]; /* hw address, (before bcast because most packets are unicast) */ unsigned char
broadcast[MAX_ADDR_LEN];
/* hw bcast add */
int (*hard_start_xmit) (struct sk_buff *skb, struct net_device *dev); /* Called after device is detached from network. */ void (*uninit)(struct net_device *dev); /* Called after last user reference disappears. */ void (*destructor)(struct net_device *dev); /* Pointers to interface service routines. */ int (*open)(struct net_device *dev); int (*stop)(struct net_device *dev); void int int int int
756
(*set_multicast_list)(struct net_device *dev); (*set_mac_address)(struct net_device *dev, void *addr); (*do_ioctl)(struct net_device *dev, struct ifreq *ifr, int cmd); (*set_config)(struct net_device *dev, struct ifmap *map); (*change_mtu)(struct net_device *dev, int new_mtu);
5:30pm
Page 756
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks void int
(*tx_timeout) (struct net_device *dev); (*neigh_setup)(struct net_device *dev, struct neigh_parms *);
/* Network namespace this network device is inside */ struct net *nd_net; /* class/net/name entry */ struct device dev; ...
The abbreviations Rx and Tx that appear in the structure are often also used in function names, variable names, and comments. They stand for Receive and Transmit, respectively, and crop up a few times in the following sections. The name of the network device is stored in name. It consists of a string followed by a number to differentiate between multiple adapters of the same type (if, e.g., the system has two Ethernet cards). Table 12-2 lists the most common device classes.
Table 12-2: Designations for Network Devices Name
Device class
ethX
Ethernet adapter, regardless of cable type and transmission speed
pppX
PPP connection via modem
isdnX
ISDN cards
atmX
Asynchronous transfer mode, interface to high-speed network cards
lo
Loopback device for communication with the local computer
Symbolic names for network cards are used, for example, when parameters are set using the ifconfig tool. In the kernel, network cards have a unique index number that is assigned dynamically when they are registered and is held in the ifindex element. Recall that the kernel provides the dev_get_by_name and dev_get_by_index functions to find the net_device instance of a network card by reference to its name or index number. Some structure elements define device properties that are relevant for the network layer and the network access layer: ❑
mtu (maximum transfer unit) specifies the maximum length of a transfer frame. Protocols of the network layer must observe this value and may need to split packets into smaller units.
❑
type holds the hardware type of the device and uses constants from
757
Page 757
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks ❑
dev_addr stores the hardware address of the device (e.g., the MAC address for Ethernet cards), and addr_len specifies the address length. broadcast is the broadcast address used to send messages to attached stations.
❑
ip_ptr, ip6_ptr, atalk_ptr, and so on are pointers to protocol-specific data not manipulated by the generic code.
Several of these pointers may have a non-null value because a network device can be used with several network protocols at the same time. Most elements of the net_device structure are function pointers to perform network card-typical tasks. Although the implementation differs from adapter to adapter, the call syntax (and the task performed) is always the same. These elements therefore represent the abstraction interface to the next protocol level. They enable the kernel to address all network cards by means of a uniform set of functions, while the low-level drivers are responsible for implementing the details. ❑
open and stop initialize and terminate network cards. These actions are usually triggered from outside the kernel by calling the ifconfig command. open is responsible for initializing the hardware registers and registering system resources such as interrupts, DMA, IO ports, and so on. close releases these resources and stops transmission.
❑
hard_start_xmit is called to remove finished packets from the wait queue and send them.
❑
header_ops contains a pointer to a structure that provides more function pointers to operations on the hardware header.
Most important are header_ops->create, which creates a new, and header_ops->parse, to analyze a given hardware header. ❑
get_stats queries statistical data that are returned in a structure of type net_device_stats. This structure consists of more than 20 members, all of which are numeric values to indicate, for example, the number of packets sent, received, with errors, discarded, and so on. (Lovers of statistics can query these data using ifconfig and netstat -i.)
Because the net_device structure provides no specific field to store the net_device_stats object, the individual device drivers must keep it in their private data area. ❑
tx_timeout is called to resolve the problem of packet transmission failure.
❑
do_ioctl forwards device-specific commands to the network card.
❑
nd_det is a pointer to the networking namespace (represented by an instance of struct net) to which the device belongs.
Some functions are not normally implemented by driver-specific code but are identical for all Ethernet cards. The kernel therefore makes default implementations available (in net/ethernet/net.c). ❑
change_mtu is implemented by eth_change_mtu and modifies the maximum transfer unit. The
default for Ethernet is 1.5 KiB, other transmission techniques have different defaults. In some situations, it can be useful to increase or decrease this value. However, many cards do not allow this and support only the default hardware setting.
758
5:30pm
Page 758
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks ❑
The default implementation of header_ops->create is in eth_header. This function is used to generate the network access layer header for the existing packet data.
❑
header_ops->parse (usually implemented by eth_header_parse) obtains the source hardware address of a given packet.
An ioctl (see Chapter 8) is applied to the file descriptor of a socket to modify the configuration of a network device from userspace. One of the symbolic constants defined in <sockios.h> must be specified to indicate which part of the configuration is to be changed. For example, SIOCGIFHWADDR is responsible for setting the hardware address of a network card, but the kernel ultimately delegates this task to the set_mac_address function of the net_device instance. Device-specific constants are passed to the do_ioctl function. The implementation is very lengthy because of the many adjustment options but is not interesting enough for us to discuss it here. Network devices work in two directions — they send and they receive (these directions are often referred to as downstream and upstream). The kernel sources include two driver skeletons (isa-skeleton.c and pci-skeleton.c in drivers/net) for use as network driver templates. Below, occasional reference is made to these drivers when we are primarily interested in their interaction with the hardware but do not want to restrict ourselves to a specific proprietary card type. More interesting than the programming of the hardware is the interfaces used by the kernel for communication purposes, which is why I focus on them below. First, we only need to introduce how network devices are registered within the kernel.
Registering Network Devices Each network device is registered in a two-step process:
1.
alloc_netdev allocates a new instance of struct net_device, and a protocol-specific
function fills the structure with typical values. For Ethernet devices, this function is ether_setup. Other protocols (not considerd in detail) use XXX_setup, where possible values for XXX include fddi (fiber distributed data), tr (token ring), ltalk (localtalk), hippi (high-performance parallel interface), or fc (fiber channel).
Some in-kernel pseudo-devices implementing specific ‘‘interfaces’’ without being bound to particular hardware also use the net_device framework. ppp_setup initializes devices for the PPP protocol, for example. Several more XXX_setup functions can be found across the kernel sources.
2.
Once struct net_device is completely filled in, it needs to be registered with register_netdev or register_netdevice. The difference between both functions is that register_netdev allows for working with (limited) format strings for interface names. The name given in net_device->dev can contain the format specifier %d. When the device is registered, the kernel selects a unique number that is substituted for %d. Ethernet devices specify eth%d, for instance, and the kernel subsequently creates the devices eth0, eth1 . . .
The convenience function alloc_etherdev(sizeof_priv) allocates an instance of struct net_device together with sizeof_priv bytes for private use — recall that net_device->priv is a pointer to driverspecific data associated with the device. Additionally, ether_setup mentioned above is called to set Ethernet-specific standard values.
759
Page 759
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks The steps taken by register_netdevice are summarized in the code flow diagram in Figure 12-9. register_netdevice
Initialization function available?
net_device->init
dev_new_index
Check name and features netdev_register_kobject
Insert into namespace specific list and hashes
Figure 12-9: Code flow diagram for register_netdevice. Should a device-specific initialization function be provided by net_device->init, the kernel calls it before proceeding any further. A unique interface index that identifies the device unambiguously within its namespace is generated by dev_new_index. The index is stored in net_device->ifindex. After ensuring that the chosen name is not already in use and no device features (see NETIF_F_* in
12.7.2 Receiving Packets Packets arrive at the kernel at unpredictable times. All modern device drivers use interrupts (discussed in Chapter 14) to inform the kernel (or the system) of the arrival of a packet. The network driver installs a handler routine for the device-specific interrupt so that each time an interrupt is raised — whenever a packet arrives — the kernel invokes the handler function to transfer the data from the network card into RAM, or to notify the kernel to do this some time later. Nearly all cards support DMA mode and are able to transfer data to RAM autonomously. However, these data still needs to be interpreted and processed, and this is only performed later.
Traditional Method Currently the kernel provides two frameworks for packet reception. One of them has been in the kernel for a long time, and thus is referred to as the traditional method. This API suffers from problems with veryhigh-speed network adapters, though, and thus a new API (which is commonly referred to as NAPI11 ) has been devised by the network developers. Let us first start with the traditional methods since they are easier to understand. Besides, more adapters use the old instead of the new variant. This is fine since their physical transmission speed is not so high as to require the new methods. NAPI is discussed afterward. 11 While the name describes precisely that the API is new in contrast to the old API, the naming scheme does not really scale well.
Since NNAPI seems rather out of question, it remains interesting to see how the next new revision will be named. However, it might take a while until this problem becomes pressing since the current state of the art does not expose any severe problems that would justify the creation of another API.
760
5:30pm
Page 760
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Figure 12-10 shows an overview of the path followed by a packet through the kernel to the network layer functions after it arrives at the network adapter. net_rx_action Soft-IRQ
do_softirq
dev.c per-CPU wait queue
netif_rx Interrupt
Driver specific code net_interrupt, net_rx dev.c
Figure 12-10: Path of an incoming packet through the kernel. Because packets are received in the interrupt context, the handler routine may perform only essential tasks so that the system (or the current CPU) is not delayed in performing its other activities for too long. In the interrupt context, data are processed by three short functions12 that carry out the following tasks:
1.
net_interrupt is the interrupt handler installed by the device driver. It determines whether the interrupt was really raised by an incoming packet (other possibilities are, e.g., signaling of an error or confirmation of a transmission as performed by some adapters). If it was, control is passed to net_rx.
2.
The net_rx function, which is also card-specific, first creates a new socket buffer. The packet contents are then transferred from the network card into the buffer and therefore into RAM, where the header data are analyzed using library functions available in the kernel sources for each transmission type. This analysis determines the network layer protocol used by the packet data — IP, for instance.
3.
Unlike the methods mentioned above, netif_rx is not a network driver-specific function but resides in net/core/dev.c. Its call marks the transition between the card-specific part and the universal interface of the network layer. The purpose of this function is to place the received packet on a CPU-specific wait queue and to exit the interrupt context so that the CPU can perform other activities.
The kernel manages the wait queues of incoming and outgoing packets in the globally defined softnet_data array, which contains entries of type softnet_data. To boost performance on multi-
processor systems, wait queues are created per CPU to support parallel processing of packets. Explicit locking to protect the wait queues against concurrent access is not necessary because each CPU modifies 12 net_interrupt and
net_rx are names taken from the driver skeleton isa-skeleton.c. They have different names in other
drivers.
761
Page 761
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks only its own queue and cannot therefore interfere with the work of the other CPUs. Below, I ignore the multiprocessor aspect and refer only to a single ‘‘softnet_data wait queue‘‘ so as not to overcomplicate matters. Only one element of the data structure is of interest for our purposes right now:
struct softnet_data { ... struct sk_buff_head ... };
input_pkt_queue;
input_pkt_queue uses the sk_buff_head list head mentioned above to build a linked list of all incoming
packets. netif_rx marks the soft interrupt NET_RX_SOFTIRQ for execution (refer to Chapter 14 for more information) before it finishes its work and exits the interrupt context. net_rx_action is used as the handler function of the softIRQ. Its code flow diagram is shown in Figure 12-11. Keep in mind that a simplified version is described here. The full story — which includes the new methods introduced for high-speed network adapters — follows below.
net_rx_action process_backlog _ _skb_dequeue
Iterate over all packet types
netif_receive_skb deliver_skb
packet_type->func for example: ip_rcv
Figure 12-11: Code flow diagram for net_rx_action.
After a few preparatory tasks, work is passed to process_backlog, which performs the following steps in a loop. To simplify matters, assume that the loop iterates until all pending packets have been processed and is not interrupted by any other condition.
1.
__skb_dequeue removes a socket buffer that is managing a received packet from the wait
queue.
2.
762
The packet type is analyzed by the netif_receive_skb function so that it can be delivered to the receive function of the network layer (i.e., to a higher layer of the network system). For this, it iterates over all network layer functions that feel responsible for the current type and calls deliver_skb for each of them.
5:30pm
Page 762
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks In turn, the function uses a type-specific handler func that assumes further processing of the packet in the higher layers like IP. netif_receive_skb also handles specialties like bridging, but it is not necessary to discuss
these corner cases — at least they are corner cases on average systems — any further. All network layer functions used to receive data from the underlying network access layer are registered in a hash table implemented by the global array ptype_base.13 New protocols are added by means of dev_add_pack. The entries are structures of type packet_type whose definition is as follows:
struct packet_type { __be16 struct net_device int
type; /* This is really htons(ether_type). */ *dev; /* NULL is wildcarded here */ (*func) (struct sk_buff *, struct net_device *, struct packet_type *, struct net_device *);
... void struct list_head
*af_packet_priv; list;
}; type specifies the identifier of the protocol for the handler. dev binds a protocol handler to a specific network card (a null pointer means that the handler is valid for all network devices of the system). func is the central element of the structure. It is a pointer to the network layer function to which the packet is passed if it has the appropriate type. ip_rcv, discussed below, is used for IPv4-based protocols. netif_receive_skb finds the appropriate handler element for a given socket buffer, invokes its func
function, and delegates responsibility for the packet to the network layer — the next higher level of the network implementation.
Support for High-Speed Interfaces The previously discussed old approach to transferring packets from the network device into higher layers of the kernel works well if the devices do not support too high transmission rates. Each time a frame arrives, an IRQ is used to signalize this to the kernel. This implies a notion of ‘‘fast’’ and ‘‘slow.’’ For slow devices, servicing the IRQ is usually finished before the next packet arrives. Since the next packet is also signaled by an IRQ, failing to fulfill this condition — as is often the case for ‘‘fast’’ devices — leads to problems. Modern Ethernet network cards operate at speeds of 10,000 MBit/s, and this would cause true interrupt storms if the old methods were used to drive them. However if a new IRQ is received while packets are still waiting to be processed, no new information is conveyed to the kernel: It was known before that packets are waiting to be processed, and it is known afterward that packets are supposed to be processed — which is not really any news. To solve this problem, NAPI uses a combination of IRQs and polling.
13 Actually, another list with packet handlers is available.
ptype_all contains packet handlers that are called for all packet types.
763
Page 763
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Assume that no packets have arrived on a network adapter yet, but start to come in at high frequency now. This is what happens with NAPI devices:
1.
The first packet causes the network adapter to issue an IRQ. To prevent further packets from causing more IRQs, the driver turns off Rx IRQs for the adapter. Additionally, the adapter is placed on a poll list.
2.
The kernel then polls the device on the poll list as long as no further packets wait to be processed on the adapter.
3.
Rx interrupts are re-enabled again.
If new packets arrive while old packets are still waiting to be processed, the work is not slowed down by additional interrupts. While polling is usually a very bad technique for a device driver (and for kernel code in general), it does not have any drawbacks here: Polling is stopped when no packets need to be processed anymore, and the device returns to the normal IRQ mode of operation. No unnecessary time is wasted with polling empty receive queues as would be the case if polling without support by interrupts were used all the time. Another advantage of NAPI is that packets can be dropped efficiently. If the kernel is sure that processing any new packets is beyond all question because too much other work needs to be performed, then packets can be directly dropped in the network adapter without being copied into the kernel at all. The NAPI method can only be implemented if the device fulfills two conditions:
1.
The device must be able to preserve multiple received packets, for instance, in a DMA ring buffer. I refer to this buffer as an Rx buffer in the following discussion.
2.
It must be possible to disable IRQs for packet reception. However, sending packets and other management functions that possibly also operate via IRQs must remain enabled.
What happens if more than one device is present on the system? This is accounted for by a round robin method employed to poll the devices. Figure 12-12 provides an overview of the situation. IRQ signals
disable IRQs 10
20
Round robin 20
packet reception
Poll devices 10
Higher network layers
Remove device if all packets have been processed
poll list Re-enable IRQs
Figure 12-12: Overview of the NAPI mechanism and the round robin poll list. Recall that it was mentioned above that a device is placed on a poll list when the initial packet arrives into an empty Rx buffer. As is the very nature of a list, the poll list can also contain more than one device.
764
5:30pm
Page 764
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks The kernel handles all devices on the list in a round robin fashion: One device is polled after another, and when a certain amount of time has elapsed in processing one device, the next device is selected and processed. Additionally, each device carries a relative weight that denotes the importance in contrast to other devices on the poll list. Large weights are used for faster devices, while slower devices get lower weights. Since the weight specifies how many packets are processed in one polling round, this ensures that faster devices receive more attention than slower ones. Now that the basic principle of NAPI is clear, let’s discuss the details of implementation. The key change in contrast to the old API is that a network device that supports NAPI must provide a poll function. The device-specific method is specified when the network card is registered with netif_napi_add. Calling this function also indicates that the devices can and must be handled with the new methods.
static inline void netif_napi_add(struct net_device *dev, struct napi_struct *napi, int (*poll)(struct napi_struct *, int), int weight); dev points to the net_device instance for the device in question, poll specifies which function is used to poll the device with IRQs disabled, and weight does what you expect it to do: It specifies a relative weight for the interface. In principle, an arbitrary integer value can be specified. Usually 10- and 100-MBit drivers specify 16, while 1,000- and 10,000-MBit drivers use 64. In any case, the weight must not exceed the number of packets that can be stored by the device in the Rx buffer. netif_napi_add requires one more parameter, a pointer to an instance of struct napi_struct. The
structure is used to manage the device on the poll list. It is defined as follows:
struct napi_struct { struct list_head poll_list; unsigned long state; int weight; int (*poll)(struct napi_struct *, int); };
The poll list is implemented by means of a standard doubly linked kernel list, and poll_list is used as the list element. weight and poll have the same meaning as described above. state can either be NAPI_STATE_SCHED when the device has to be polled next time the kernel comes around to doing so, or NAPI_STATE_DISABLE once polling is finished and no more packets are waiting to be processed, but the device has not yet been taken off the poll list. Note that struct napi_struct is often embedded inside a bigger structure containing driver-specific information about the network card. This allows for using the container_of mechanism to obtain the information when the kernel polls the card with the poll function.
Implementing Poll Functions The poll function requires two arguments: a pointer to the napi_struct instance and an integer that specifies the budget, that is, how many packets the kernel allows to be processed by the driver. Since we
765
Page 765
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks do not want to deal with the peculiarities of any real networking card, let us discuss a pseudo-function for a very, very fast adapter that needs NAPI: static int hyper_card_poll(struct napi_struct *napi, int budget) { struct nic *nic = container_of(napi, struct nic, napi); struct net_device *netdev = nic->netdev; int work_done; work_done = hyper_do_poll(nic, budget); if (work_done < budget) { netif_rx_complete(netdev, napi); hcard_reenable_irq(nic); } return work_done; }
After obtaining device-specific information from the container of napi_struct, a hardware-specific poll method — in this case, hyper_do_poll — is called to perform the required low-level actions to obtain the packets from the network adapter and pass them to the higher networking layers using netif_receive_skb as before. hyper_do_poll allows processing up to budget packets. The function returns as result how many packets have actually been processed. Two cases must be distinguished:
❑
If the number of processed packets is less than the granted budget, then no more packets are available and the Rx buffer is empty — otherwise, the remaining packets would have been processed. As a consequence, netif_rx_complete signals this condition to the kernel, and the kernel will remove the device from the poll list in consequence. In turn, the driver has to re-enable IRQs by means of a suitable hardware-specific method.
❑
Although the budget has been completely used up, more packets are still waiting to be processed. The device is left on the poll list, and interrupts are not enabled again.
Implementing IRQ Handlers NAPI also requires some changes in the IRQ handlers of network devices. Again, I will not resort to any specific piece of hardware, but present code for an imaginary device: static irqreturn_t e100_intr(int irq, void *dev_id) { struct net_device *netdev = dev_id; struct nic *nic = netdev_priv(netdev); if(likely(netif_rx_schedule_prep(netdev, &nic->napi))) { hcard_disable_irq(nic); __netif_rx_schedule(netdev, &nic->napi); } return IRQ_HANDLED; }
766
5:30pm
Page 766
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Assume that interface-specific data are contained in net_device->private; this is the method used by most network card drivers. The auxiliary function netdev_priv is provided to access it. Now the kernel needs to be informed that a new packet is available. A two-stage approach is required:
1.
netif_rx_schedule_prep prepares the device to be put on the poll list. Essentially, this sets the NAPI_STATE_SCHED flag in napi_struct->flags.
2.
If setting this flag succeeds (it just fails if NAPI is already active), the driver must disable IRQs with a suitable device-specific method. Invoking __netif_rx_schedule adds the device’s napi_struct to the poll list and raises the softIRQ NET_RX_SOFTIRQ. This notifies the kernel to start polling in net_rx_action.
Handling the Rx SoftIRQ After having discussed what individual device drivers are required to do for NAPI, the kernel’s responsibilities remain to be investigated. net_rx_action is as before the handler for the softIRQ NET_RX_SOFTIRQ. Recall that a simplified version was shown in the preceding section. With more details about NAPI in place, we are now prepared to discuss all the details. Figure 12-13 shows the code flow diagram.
Loop over all devices on poll list
net_rx_action Yes Budget used up or Raise NET_RX_SOFTIRQ processing takes too long? No
Call poll method Decrease budget work = = weight
Move device to end of poll list
Figure 12-13: Code flow diagram for net_rx_action.
Essentially, the kernel processes all devices that are currently on the poll list by calling the device-specific poll methods for one after another. The device’s weight is used as the local budget, that is, the number of packets that may be processed in a single poll step. It must be made sure that not too much time is spent in the softIRQ handler. Processing is aborted on two conditions:
1. 2.
More than one jiffie has been spent in the handler. The total number of processed packets is larger than a total budget specified by netdev_budget. Usually, this is set to 300, but the value can be changed via /proc/sys/net/core/netdev_budget. This budget must not be confused with the local budget for each network device! After each poll step, the number of processed packets is subtracted from the global budget, and if the value drops below zero, the softIRQ handler is aborted.
767
Page 767
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks After an individual device has been polled, the kernel checks if the number of processed packets is identical with the allowed local budget. If this is the case, then the device could not obtain all waiting packets as represented by work == weight in the code flow diagram. The kernel moves it to the end of the poll list and will continue to poll the device after all other devices on the list have been processed. Clearly, this implements round robin scheduling between the network devices.
Implementation of the Old API on Top of NAPI Finally, note how the old API is implemented on top of NAPI. The normal behavior of the kernel is controlled by a dummy network device linked with the softnet queue; the process_backlog standard function in net/core/dev.c is used as the poll method. If no network adapters add themselves to the poll list of the queue, it contains only the dummy adapter, and the behavior of net_rx_action therefore corresponds to a single call of process_backlog in which the packets in the queue are processed regardless of the device from which they originate.
12.7.3 Sending Packets A finished packet is sent when a protocol-specific function of the network layer instructs the network access layer to process a packet defined by a socket buffer. What must be noted when messages are sent from the computer? In addition to complete headers and the checksums required by the particular protocol and already generated by the higher instances, the route to be taken by the packet is of prime importance. (Even if the computer has only one network card, the kernel still has to distinguish between packets for external destinations and for the loopback link.) Because this question can only be clarified by higher protocol instances (particularly if there is a choice of routes to the desired destination), the device driver assumes that the decision has already been made. Before a packet can be sent to the next correct computer (normally not the same as the target computer because IP packets are usually sent via gateways unless there is a direct hardware connection), it is necessary to establish the hardware address of the receiving network card. This is a complicated process looked at more closely in Section 12.8.5. At this point, simply assume that the receiving MAC address is known. A further header for the network access layer is normally generated by protocol-specific functions. dev_queue_xmit from net/core/dev.c is used to place the packet on the queue for outgoing packets. I
ignore the implementation of the device-specific queue mechanism because it reveals little of interest on how the network layer functions. It is sufficient to know that the packet is sent a certain length of time after it has been placed on the wait queue. This is done by the adapter-specific hard_start_xmit function that is present as a function pointer in each net_device structure and is implemented by the hardware device drivers.
12.8
Network Layer
The network access layer is still quite strongly influenced by the properties of the transmission medium and the device drivers of the associated adapters. The network layer (and therefore specifically the IP Internet protocol) is almost totally divorced from the hardware properties of the network adapters. Why only almost? As you will see shortly, the layer is responsible not only for sending and receiving data, but
768
5:30pm
Page 768
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks also for forwarding and routing packets between systems not directly connected with each other. Finding the best route and selecting a suitable network device to send the packet also involves handling lowerlevel address families (such as hardware-specific MAC addresses), which accounts for why the layer is at least loosely associated with network cards. The assignment between the addresses of the network layer and the network access layer is made in this layer — another reason why the IP layer is not fully divorced from the hardware. Fragmentation of larger data packets into smaller units cannot be performed without taking the underlying hardware into account (in fact, the properties of the hardware are what make this necessary in the first place). Because each transmission technique supports a maximum packet size, the IP protocol must offer ways of splitting larger packets into smaller units that can be reassembled by the receiver — unnoticed by the higher layers. The size of the fragmented packets depends on the capabilities of the particular transmission protocol. IP was formally defined in 1981 (in RFC 791) and is therefore of ripe old age.14 Even though the situation on the ground is not as represented in the usual company press releases that praise, for example, each new version of a spreadsheet as the greatest invention since the beginning of mankind, the last two decades have left their mark on today’s technology. Deficiencies and unforeseen problems occasioned by the strong growth of the Internet are now more and more evident. This is why the IPv6 standard has been developed as the successor to the present IPv4. Unfortunately, this future standard is only slowly being adopted owing to the lack of a central control authority. In this chapter, our interest focuses on the implementation of the algorithms for Version 4, but we also take a cursory look at future practicable techniques and their implementation in the Linux kernel. To understand how the IP protocol is implemented in the kernel, it is necessary to briefly examine how it works. Naturally, we can only touch on the relevant topics in this huge area. For detailed descriptions, see the many specialized publications, particularly [Ste00] and [Ste94].
12.8.1 IPv4 IP packets use a protocol header as shown in Figure 12-14.
0
4
8
16
20
24
32
Codepoint/Type of Total length Version IHL service Fragment Identification Flags Fragment Offset TTL Protocol Header Checksum Source address Destination address Options Padding
Payload
Figure 12-14: Structure of an IP header. The meanings of the individual components of the structure are explained below. 14 Even though the marketing departments of some companies suggest the opposite, the Internet is older than most of its users.
769
Page 769
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks ❑
version specifies the IP protocol version used. Currently, this field accepts the value 4 or 6. On
hosts that support both versions, the version used is indicated by the transmission protocol identifier discussed in the previous chapter; this identifier also holds different values for the two versions of the protocol. ❑
IHL defines the header length, which is not always the same owing to the variable number of options.
❑
Codepoint or Type of Service is required for more complex protocol options that need not con-
cern us here. ❑
Length specifies the total length of the packet, in other words, the length of the header plus data.
❑
The fragment ID identifies the individual parts of a fragmented IP packet. The fragmenting system assigns the same fragment ID to all parts of an original packet so that they can be identified as members of the same group. The relative arrangement of the parts is defined in the fragment offset field. The offset is specified in units of 64 bits.
❑
Three status bits (flags) enable and disable specific characteristics; only two of them are used. ❑
DF stands for don’t fragment and specifies that the packet must not be split into smaller units.
❑
MF indicates that the present packet is a fragment of a larger packet and is followed by other fragments (the bit is set for all fragments but the last).
The third field is ‘‘reserved for future use,’’ which is very unlikely in view of the presence of IPv6. ❑
TTL stands for Time to Live and specifies the number of intermediate stations (or hops) along the
❑
Protocol identifies the higher-layer protocol (transport layer) carried in the IP datagram. For
route to the receiver.15 example, there are unique values for TCP and UDP.
❑
Checksum contains a checksum calculated on the basis of the contents of the header and the data. If the specified checksum does not match the figure calculated upon receipt, the packet is discarded because a transmission error has occurred.
❑
src and dest specify the 32-bit IP address of the source and destination.
❑
options is used for extended IP options, not discussed here.
❑
data holds the packet data (payload).
All numeric values in the IP header must be in network byte order (big endian). In the kernel sources the header is implemented in the iphdr data structure:
struct iphdr { #if defined(__LITTLE_ENDIAN_BITFIELD) __u8 ihl:4, version:4; #elif defined (__BIG_ENDIAN_BITFIELD) __u8 version:4, 15 In the past, this value was interpreted as the maximum lifetime in seconds.
770
5:30pm
Page 770
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks ihl:4; #endif __u8 tos; __u16 tot_len; __u16 id; __u16 frag_off; __u8 ttl; __u8 protocol; __u16 check; __u32 saddr; __u32 daddr; /*The options start here. */ };
The ip_rcv function is the point of entry into the network layer. The onward route of a packet through the kernel is illustrated in Figure 12-15.
Transport Layer (TCP, UDP) ip_local_deliver
ip_queue_xmit Netfilter:
Netfilter:
NF_IP_LOCAL_OUT
NF_IP_LOCAL_IN
Routing
Forwarding ip_forward
Routing Netfilter:
ip_output
NF_IP_PRE_ROUTING
Netfilter:
ip_rcv
NF_IP_FORWARD
Poll Mechanism
Netfilter: NF_IP_POST_ROUTING
dev_queue_xmit
Host to Host Layer (Ethernet, etc.)
Figure 12-15: Route of a packet through the IP layer.
The program flow for send and receive operations is not always separate and may be interleaved if packets are only forwarded via the computer. The packets are not passed to higher protocol layers (or to an application) but immediately leave the computer bound for a new destination.
12.8.2 Receiving Packets Once a packet (respectively, the corresponding socket buffer with appropriately set pointers) has been forwarded to ip_rcv, the information received must be checked to ensure that it is correct. The main check is that the checksum calculated matches that stored in the header. Other checks include, for example, whether the packet has at least the size of an IP header and whether the packet is actually IP Version 4 (IPv6 employs its own receive routine). After these checks have been made, the kernel does not immediately continue with packet processing but allows a netfilter hook to be invoked so that the packet data can be manipulated in userspace. A netfilter hook is a kind of ‘‘hook‘‘ inserted at defined points in the kernel code to enable packets to be manipulated
771
Page 771
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks dynamically. Hooks are present at various points in the network subsystem, and each one has a special (label) — for example, NF_IP_POST_ROUTING.16 When the kernel arrives at a hook, the routines registered for the label are invoked in userspace. Kernel-side processing (possibly with a modified packet) is then continued in a further kernel function. Section 12.8.6 below discusses the implementation of the netfilter mechanism. In the next step, the received IP packets arrive at a crossroads where a decision is made as to whether they are intended for the local system or for a remote computer. Depending on the answer, they must either be forwarded to one of the higher layers or transferred to the output path of the IP level (I don’t bother with the third option — delivery of packets to a group of computers by means of multicast). ip_route_input is responsible for choosing the route. This relatively complex decision is discussed in
detail in Section 12.8.5. The result of the routing decision is that a function for further packet processing is chosen. Available functions are ip_local_deliver and ip_forward. Which is selected depends on whether the packet is to be delivered to local routines of the next higher protocol layer or is to be forwarded to another computer in the network.
12.8.3 Local Delivery to the Transport Layer If the packet is intended for the local computer, ip_local_deliver must try to find a suitable transport layer function to which the data can be forwarded. IP packets typically use TCP or UDP as the transport layer.
Defragmentation This is made difficult by the fact that IP packets may be fragmented. There is no certainty that a full packet is available. The first task of the function is therefore to reassemble a fragmented packet from its constituent parts by means of ip_defrag.17 The corresponding code flow diagram is shown in Figure 12-16.
ip_defrag ip_find
Other fragment parts in the cache? ip_frag_queue
All parts available?
ip_frag_reasm
Figure 12-16: Code flow diagram for ip_defrag.
16 Note that kernel 2.6.25 (which was still under development when this book was written) will change the names from
NF_IP_* to
NF_INET_*. This change unifies the names for IPv4 and IPv6. 17 The kernel recognizes that a packet is fragmented either by the set fragment bit or by a non-zero value in the offset field. A zero value in the offset field indicates that this fragment is the last in the packet.
772
5:30pm
Page 772
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks The kernel manages the fragments of an originally composite packet in a separate cache known as a fragment cache. In the cache, fragments that belong together are held in a separate wait queue until all fragments are present. The ip_find function is then invoked. It uses a hashing procedure involving the fragment ID, source and destination address, and packet protocol identifier to check whether a wait queue has already been created for the packet. If not, a new queue is created and the packet is placed on it. Otherwise, the address of the existing queue is returned so that ip_frag_queue can place the packet on it.18 When all fragments of the packet are in the cache (i.e., the first and last fragment are present and the data in all the fragments equal the expected total length of the packet), the individual fragments are reassembled by ip_frag_reasm. The socket buffer is then released for further processing. If not all fragments of a packet have arrived, ip_defrag returns a null pointer that terminates packet processing in the IP layer. Processing is resumed when all fragments are present.
Delivery to the Transport Layer Let us go back to ip_local_deliver. After packet defragmentation, the netfilter hook NF_IP_LOCAL_IN is called to resume processing in ip_local_deliver_finish. There the packet is passed to a transport layer function that must first be determined by reference to the protocol identifier. All protocols based on the IP layer have an instance of the structure net_protocol that is defined as follows: include/net/protocol.h
struct net_protocol { int void ... };
(*handler)(struct sk_buff *skb); (*err_handler)(struct sk_buff *skb, u32 info);
❑
handler is the protocol routine to which the packets are passed (in the form of socket buffers) for further processing.
❑
err_handler is invoked when an ICMP error message is received and needs to be passed to
higher levels. The inet_add_protocol standard function is used to store each instance in the inet_protos array that maps the protocols onto the individual list positions using a hashing method. Once the IP header has been ‘‘removed‘‘ by means of the usual pointer manipulations in the socket buffer, all that remains to be done is to invoke the corresponding receive routine of the network access layer stored in the handler field of inet_protocol, for example, the tcp_v4_rcv routine to receive TCP packets and udp_rcv to receive UDP packets. Section 12.9 examines the implementation of these functions. 18 The fragment cache uses a timer mechanism to remove fragments from the cache. When it expires, fragments in the cache are deleted if not all fragments have arrived by then.
773
Page 773
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks
12.8.4 Packet Forwarding IP packets may be delivered locally as described above, or they may leave the IP layer for forwarding to another computer without having come into local contact with the higher protocol instances. There are two categories of packet destinations:
1. 2.
Target computers in one of the local networks to which the sending computer is attached. Geographically remote computers not attached to the local network and accessible only via gateways.
The second scenario is rather more complicated. The first station to which the packet is forwarded along the remaining route must be found in order to move one step closer to the final destination. Information is therefore required not only on the structure of the network in which the computer resides but also on the structure of the ‘‘adjacent‘‘ networks and associated outgoing paths. This information is provided by routing tables managed by the kernel in a variety of data structures discussed in Section 12.8.5. The ip_route_input function invoked when a packet is received acts as the interface to the routing implementation, not only because it is able to recognize whether a packet is to be delivered locally or forwarded, but also because it also finds the route to the destination. The destination is stored in the dst field of the socket buffer. This makes the work of ip_forward very easy, as the code flow diagram in Figure 12-17 shows.
ip_forward TTL ≤ 1?
Discard packet
ip_decrease_ttl Netfilter hook NF_IP_FORWARD ip_forward_finish ip_forward_options dst_output
skb->dst->output
Figure 12-17: Code flow diagram for ip_forward.
First, the function refers to the TTL field to check whether the packet is allowed to pass through another hop. If the TTL value is less than or equal to 1, the packet is discarded; otherwise, the counter is decremented by 1. ip_decrease_ttl does this because changing the TTL field also means that the packet checksum must be altered. Once the netfilter hook NF_IP_FORWARD has been called, the kernel resumes processing in ip_forward_finish. This function delegates its work to two other functions:
774
5:30pm
Page 774
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks ❑
If the packet includes additional options (not normally the case), they are processed in ip_forward_options.
❑
dst_output passes the packet to the send function selected during routing and held in skb->dst->output. Normally, ip_output, which passes the packet to the network adapter that matches the destination, is used for this purpose.19 ip_output is part of the send operation for
IP packets described in the next section.
12.8.5 Sending Packets The kernel provides several functions that are used by higher protocol layers to send data via IP. ip_queue_xmit, whose code flow diagram, shown in Figure 12-18, is the one most frequently used.
ip_queue_xmit No route available yet?
Determine route
ip_send_check Netfilter hook: NF_IP_LOCAL_out dst_output skb->dst->output
Figure 12-18: Code flow diagram for ip_queue_xmit.
The first task is to find a route for the packet. The kernel exploits the fact that all packets originating from a socket have the same destination address so that the route doesn’t have to be determined afresh each time. A pointer to the corresponding data structure discussed below is linked with the socket data structure. When the first packet of a socket is sent, the kernel is required to find a new route (discussed below). Once ip_send_check has generated the checksum for the packet,20 the kernel calls the netfilter hook NF_IP_LOCAL_OUT. The dst_output function is then invoked; it is based on the destination-specific skb->dst->output function of the socket buffer found during routing. Normally, this is ip_output, which is the point where locally generated and forwarded packets are brought together.
Transition to the Network Access Layer Figure 12-19 shows the code flow diagram of the ip_output function that splits the route into two parts, depending on whether a packet needs to be fragmented or not. 19 A different output routine is used when, for example, IP packets are tunneled inside IP packets. This is a very special application that is rarely needed. 20 Generation of IP checksums is time-critical and can be highly optimized by modern processors. For this reason, the various architectures provide fast assembly language implementations of their own in ip_fast_csum.
775
Page 775
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks ip_output Netfilter hook NF_IP_POST_ROUTING ip_finish_output ip_fragment
Fragmentation of the packet necessary? ip_finish_output2
Not enough space for hardware header? skp_realloc_headroom dst->neighbour->output
Figure 12-19: Code flow diagram for ip_output. First of all, the netfilter hook NF_IP_POST_ROUTING is called, followed by ip_finish_output. I first examine the situation in which the packet fits into the MTU of the transmission medium and need not be fragmented. In this case, ip_finish_output2 is directly invoked. The function checks whether the socket buffer still has enough space for the hardware header to be generated. If necessary, skb_realloc_headroom adds extra space. To complete transition to the network access layer, the dst->neighbour->output function set by the routing layer is invoked, normally using dev_queue_xmit.21
Packet Fragmenting IP packets are fragmented into smaller units by ip_fragment, as shown in Figure 12-20.
IP TCP
IP TCP
Payload
1
IP
2
IP
3
IP
4
Figure 12-20: Fragmenting of an IP packet. IP fragmenting is very straightforward if we ignore the subtleties documented in RFC 791. A data fragment, whose size is compatible with the corresponding MTU, is extracted from the packet in each cycle of a loop. A new socket buffer, whose old IP header can be reused with a few modifications, is created to hold the extracted data fragment. A common fragment ID is assigned to all fragments to support reassembly in the destination system. The sequence of the fragments is established on the basis of the fragment offset, which is also set appropriately. The more fragments bit must also be set. Only in the last packet of the series must this bit be set to 0. Each fragment is sent using ip_output after ip_send_check has generated a checksum.22 21 The kernel also uses a hard header cache. This holds frequently needed hardware headers that are copied to the start of a packet. If the cache contains a required entry, it is output using a cache function that is slightly faster than dst->neighbour->output. 22 ip_output is invoked via a function pointer passed to ip_fragment as a parameter. This means, of course, that other send functions can be selected. The bridging subsystem is the only user of this possibility, and is not discussed in more detail.
776
5:30pm
Page 776
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Routing Routing is an important part of any IP implementation and is required not only to forward external packets, but also to deliver data generated locally in the computer. The problem of finding the correct path for data ‘‘out‘‘ of the computer is encountered not only with non-local addresses, but also if there are several network interfaces. This is the case even if there is only one physical network adapter — because there are also virtual interfaces such as the loopback device. Each packet received belongs to one of the following three categories:
1. 2. 3.
It is intended for the local host. It is intended for a computer connected directly to the current host. It is intended for a remote computer that can only be reached by way of intermediate systems.
The previous section discussed packets of the first category; these are passed to the higher protocol layers for further processing (this type is discussed below because all arriving packets are passed to the routing subsystem). If the destination system of a packet is connected directly to the local host, routing is usually restricted to finding the corresponding network card. Otherwise, reference must be made to the routing information to find a gateway system (and the network card associated with the gateway) via which the packet can be sent. The routing implementation has gradually become more and more comprehensive from kernel version to kernel version and now accounts for a large part of the networking source code. Caches and lengthy hash tables are used to speed up work because many routing tasks are time-critical. This is reflected in the profusion of data structures. For reasons of space, we won’t worry what the mechanisms for finding the correct routes in the kernel data structures look like. We look only at the data structures used by the kernel to communicate the results. The starting point of routing is the ip_route_input function, which first tries to find the route in the routing cache (this topic is not discussed here, nor what happens in the case of multicast routing). ip_route_input_slow is invoked to build a new route from the data structures of the kernel. Basically, the routine relies on fib_lookup, whose implicit return value (via a pointer used as a function argument) is an instance of the fib_result structure containing the information we want. fib stands for forwarding
information base and is a table used to manage the routing information held by the kernel. The routing results are linked with a socket buffer by means of its dst element that points to an instance of the dest_entry structure that is filled during lookup. The (very simplified) definition of the data structure is as follows: include/net/dst.h
struct dst_entry { struct net_device int int struct neighbour };
*dev; (*input)(struct sk_buff*); (*output)(struct sk_buff*); *neighbour;
777
Page 777
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks ❑
input and output are invoked to process incoming and outgoing packets as described above.
❑
dev specifies the network device used to process the packets.
input and output are assigned different functions depending on packet type.
❑
input is set to ip_local_deliver for local delivery and output to ip_rt_bug (the latter function simply outputs an error message to the kernel logs because invoking output for a local packet in the kernel code is an error condition that should not occur).
❑
input is set to ip_forward for packets to be forwarded, and a pointer to the ip_output function is used for output.
The neighbour element stores the IP and hardware addresses of the computer in the local network, which can be reached directly via the network access layer. For our purposes, it is sufficient to look at just a few elements of the structure: include/net/neighbour.h
struct neighbour { struct net_device unsigned char int };
*dev; ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))]; (*output)(struct sk_buff *skb);
While dev holds the network device data structure and ha the hardware address of the device, output is a pointer to the appropriate kernel function that must be invoked to transmit a packet via the network adapter. neighbour instances are created by the ARP layer of the kernel that implements the address resolution protocol — a protocol that translates IP addresses into hardware addresses. Because the dst_entry structure has a pointer to neighbour instances, the code of the network access layer can invoke the output function when a packet leaves the system via the network adapter.
12.8.6 Netfilter Netfilter is a Linux kernel framework that enables packets to be filtered and manipulated in accordance with dynamically defined criteria. This dramatically increases the number of conceivable network options — from a simple firewall through detailed analyses of network traffic to complex state-dependent filters. Because of the sophisticated netfilter design, only a few sections of network code are needed to achieve the above goals.
Extending Network Functionality In brief, the netfilter framework adds the following capabilities to the kernel:
778
❑
Packet filtering for different flow directions (incoming, outgoing, forwarded) depending on state and other criteria.
❑
Network address translation (NAT) to convert source and destination addresses in accordance with certain rules. NAT can be used, for example, to implement shared Internet connections where several computers that are not attached directly to the Internet share an Internet access (this is often referred to as masquerading or transparent proxy).
❑
Packet mangling and manipulation, the splitting and modification of packets according to specific rules.
5:30pm
Page 778
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Netfilter functionality can be enhanced by modules loaded into the kernel at run time. A defined rule set informs the kernel when to use the code from the individual modules. The interface between the kernel and netfilter is kept very small to separate the two areas from each other as well as possible (and as little as necessary) in order to prevent mutual interference and improve the network code stability. As frequently mentioned in the preceding sections, netfilter hooks are located at various points in the kernel to support the execution of netfilter code. These are provided not only for IPv4 but also for IPv6 and the DECNET protocol. Only IPv4 is discussed here, but the concepts apply equally to the other two protocols. Netfilter implementation is divided into two areas: ❑
Hooks in the kernel code are used to call netfilter code and are at the heart of the network implementation.
❑
Netfilter modules whose code is called from within the hooks but that are otherwise independent of the remaining network code. A set of standard modules provides frequently needed functions, but user-specific functions can be defined in extension modules.
Iptables used by administrators to configure firewall, packet filter, and similar functions are simply modules that build on the netfilter framework and provide a comprehensive, well-defined set of library functions to facilitate packet handling. I won’t bother describing how the rules are activated and managed from within userspace; see the abundance of literature on network administration.
Calling Hook Functions Functions of the network layer are interrupted by hooks at which netfilter code is executed. An important feature of hooks is that they split a function into two parts — the first part runs before the netfilter code is called, the second after. Why are two separate functions used instead of calling a specific netfilter function that executes all relevant netfilter modules and then returns to the calling function? This approach, which at first may appear to be somewhat complicated, can be explained as follows. It enables users (or administrators) to decide not to compile the netfilter functionality into the kernel, in which case, the network functions can be executed without any loss of speed. It also dispenses with the need to riddle the network implementation with pre-processor statements that, depending on the particular configuration option (netfilter enabled or disabled), select the appropriate code sections at compilation time. Netfilter hooks are called by the NF_HOOK macro from
static inline int nf_hook_thresh(int pf, unsigned int hook, struct sk_buff **pskb, struct net_device *indev, struct net_device *outdev, int (*okfn)(struct sk_buff *), int thresh, int cond) { if (!cond) return 1; return nf_hook_slow(pf, hook, pskb, indev, outdev, okfn, thresh); }
779
Page 779
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks
#define NF_HOOK_THRESH(pf, hook, skb, indev, outdev, okfn, thresh) \ ({int __ret; \ if ((__ret=nf_hook_thresh(pf, hook, &(skb), indev, outdev, okfn, thresh, 1)) == 1)\ __ret = (okfn)(skb); \ __ret;}) #define NF_HOOK(pf, hook, skb, indev, outdev, okfn) \ NF_HOOK_THRESH(pf, hook, skb, indev, outdev, okfn, INT_MIN)
The macro arguments have the following meanings: ❑
pf refers to the protocol family from which the called netfilter hook should originate. All calls in the IPv4 layer use PF_INET.
❑
hook is the hook number; possible values are defined in
❑
skb is the socket buffer being processed.
❑
indev and outdev are pointers to net_device instances of the network devices via which the
packet enters and leaves the kernel. Null pointers can be assigned to these values because this information is not known for all hooks (e.g., before routing is performed, the kernel does not know via which device a packet will leave the kernel). ❑
okfn is a pointer to a function with prototype int (*okfn)(struct sk_buff *). It is executed when the netfilter hook terminates.
The macro expansion makes a detour over NF_HOOK_THRESH and nf_hook_thresh before nf_hook_slow will take care of processing the netfilter hook and calling the continuation function. This seemingly complicated way is necessary because the kernel also provides the possibility to consider only netfilter hooks whose priority is above a certain threshold and skip all others. In the case of NF_HOOK, the threshold is set to the smallest possible integer value so every hook function is considered. Nevertheless, it is possible to use NF_HOOK_THRESH directly to set a specific threshold. Since only the bridging implementation and connection tracking for IPv6 make use of this currently, I will not discuss it any further. Consider the implementation of NF_HOOK_THRESH. First, nf_hook_thresh is called. The function checks if the condition given in cond is true. If that is not so, then 1 is directly passed to the caller. Otherwise, nf_hook_slow is called. The function iterates over all registered netfilter hooks and calls them. If the packet is accepted, 1 is returned, and otherwise some other value. If nf_hook_thresh returned 1, that is, if the netfilter verdict was to accept the packet, then control is passed to the continuation function specified in okfn. The IP forwarding code includes a typical NF_HOOK macro call, which we will consider as an example: net/ipv4/in_forward.c
int ip_forward(struct sk_buff *skb) { ... return NF_HOOK(PF_INET, NF_IP_FORWARD, skb, skb->dev, rt->u.dst.dev, ip_forward_finish); }
780
5:30pm
Page 780
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks The okfn specified is ip_forward_finish. Control is passed directly to this function if the above test establishes that no netfilter hooks are registered for the combination of PF_INET and NF_IP_FORWARD. Otherwise, the relevant netfilter code is executed and control is then transferred to ip_forward_finish (assuming the packet is not discarded or removed from kernel control). If no hooks are installed, code flow is the same as if ip_forward and ip_forward_finish were implemented as a single, uninterrupted procedure. The kernel makes use of the optimization options of the C compiler to prevent speed loss if netfilter is disabled. Kernel versions before 2.6.24 required that the okfn was defined as an inline function: net/ipv4/ip_forward.c
static inline int ip_forward_finish(struct sk_buff *skb) { ... }
This means that it is shown as a normal function, but the compiler does not invoke it by means of a classic function call (pass parameters, set instruction pointers to function code, read arguments, etc.). Instead, the entire C code is copied to the point at which the function is invoked. Although this results in a longer executable (particularly for larger functions), it is compensated by speed gains. The GNU C compiler guarantees that inline functions are as fast as macros if this approach is adopted. However, starting with kernel 2.6.24, the inline definition could be removed in nearly all cases! net/ipv4/ip_forward.c
static int ip_forward_finish(struct sk_buff *skb) { ... }
This is possible because the GNU C compiler has become able to perform an additional optimization technique: procedure tail calls. They originate from functional languages and are, for instance, mandatory for implementations of the Scheme language. When a function is called as the last statement of another function, it is not necessary that the callee returns to the caller after it has finished its work — there is nothing left to do in the caller. This allows for performing some simplifications of the call mechanism that lead to an execution that is as fast as with the old inline mechanism, without the need to duplicate code by inlining, and thus without increasing the size of the kernel. However, this optimization is not performed by gcc for all hook functions, and a small number of them still remain inlined. If the netfilter configuration is enabled, scanning of the nf_hooks array makes no sense, and the NF_HOOK macro is then defined differently: include/net/netfilter.h
#define NF_HOOK(pf, hook, skb, indev, outdev, okfn) (okfn)(skb)
Invocation of the hook function is simply replaced with a call to the function defined in okfn (the inline keyword instructs the compiler to do this by copying the code). The original two functions have now merged into one, and there is no need for an intervening function call.
Scanning the Hook Table nf_hook_slow is called if at least one hook function is registered and needs to be invoked. All hooks are stored in the nf_hooks two-dimensional array: net/netfilter/core.c
struct list_head nf_hooks[NPROTO][NF_MAX_HOOKS] __read_mostly;
781
Page 781
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks NPROTO specifies the maximum number of protocol families supported by the system (currently 34). Symbolic constants for the individual families are PF_INET and PF_DECnet; these are stored in include/linux/socket.h. It is possible to define NF_MAX_HOOKS lists with hooks for each protocol; the default is 8.
The list_head elements of the table are used as list heads for a doubly linked list that accepts nf_hook_ops instances:
struct nf_hook_ops { struct list_head list; /* User fills in from here down. */ nf_hookfn *hook; struct module *owner; int pf; int hooknum; /* Hooks are ordered in ascending priority. */ int priority; };
In addition to the standard elements (list for linking the structure in a doubly linked list, and owner as a pointer to the module data structure of the owner module if the hook is implemented modularly), there are other elements with the following meanings: ❑
hook is a pointer to the hook function that requires the same arguments as the NF_HOOK macro:
typedef unsigned int nf_hookfn(unsigned int hooknum, struct sk_buff **skb, const struct net_device *in, const struct net_device *out, int (*okfn)(struct sk_buff *));
❑
pf and hooknum specify the protocol family and the number associated with the hook. This information could also be derived from the position of the hook list in nf_hooks.
❑
The hooks in a list are sorted in ascending priority (indicated by priority). The full signed int range can be used to indicate the priority, but a number of preferred defaults are defined:
enum nf_ip_hook_priorities { NF_IP_PRI_FIRST = INT_MIN, NF_IP_PRI_CONNTRACK_DEFRAG = -400, NF_IP_PRI_RAW = -300, NF_IP_PRI_SELINUX_FIRST = -225, NF_IP_PRI_CONNTRACK = -200, NF_IP_PRI_MANGLE = -150, NF_IP_PRI_NAT_DST = -100, NF_IP_PRI_FILTER = 0, NF_IP_PRI_NAT_SRC = 100, NF_IP_PRI_SELINUX_LAST = 225, NF_IP_PRI_CONNTRACK_HELPER = INT_MAX - 2,
782
5:30pm
Page 782
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks NF_IP_PRI_NAT_SEQ_ADJUST = INT_MAX - 1, NF_IP_PRI_CONNTRACK_CONFIRM = INT_MAX, NF_IP_PRI_LAST = INT_MAX, };
This ensures, for example, that mangling of packet data is always performed before any filter operations. The appropriate list can be selected from the nf_hook array by reference to the protocol family and hook number. Work is then delegated to nf_iterate, which traverses the list elements and invokes the hook functions.
Activating the Hook Functions Each hook function returns one of the following values: ❑
NF_ACCEPT accepts a packet. This means that the routine in question has made no changes to the data. The kernel continues to use the unmodified packet and lets it run through the remaining layers of the network implementation (or through subsequent hooks).
❑
NF_STOLEN specifies that the hook function has ‘‘stolen‘‘ a packet and will deal with it. As of this point, the packet no longer concerns the kernel, and it is not necessary to call any further hooks. Further processing by other protocol layers must also be suppressed.
❑
NF_DROP instructs the kernel to discard the packet. As with NF_STOLEN, no further processing by
other hooks or in the network layer takes place. Memory space occupied by the socket buffer (and therefore by the packet) is released because the data it contains can be discarded — for example, packets regarded as corrupted by a hook. ❑
NF_QUEUE places the packet on a wait queue so that its data can be processed by userspace code. No other hook functions are executed.
❑
NF_REPEAT calls the hook again.
Ultimately, packets are not further processed in the network layer unless all hook functions return NF_ACCEPT (NF_REPEAT is never the final result). All other packets are either discarded or processed by the netfilter subsystem itself. The kernel provides a collection of hook functions so that separate hook functions need not be defined for every occasion. These are known as iptables and are used for the high-level processing of packets. They are configured using the iptables userspace tool, which is not discussed here.
12.8.7 IPv6 Even though widespread use of the Internet is a a recent phenomenon, its technical foundations have been in place for some time. Today’s Internet protocol was introduced in 1981. Although the underlying standard is well thought out and forward-looking, it is showing signs of age. The explosive growth of the Internet over the past few years has thrown up a problem relating to the available address space of IPv4 — 32-bit addresses allow a maximum of 232 hosts to be addressed (if subnetting and the like are ignored). Although earlier thought to be inexhaustible, this address space will no longer be sufficient in the foreseeable future because more and more devices — ranging from PDAs and laser printers to coffee machines and refrigerators — require IP addresses.
783
Page 783
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Overview and Innovations In 1998 a new standard named IPv6 was defined23 and is now supported by the Linux kernel in production quality. A full implementation of the protocol is located in the net/ipv6 directory. The modular and open structure of the network layer means that IPv6 can make use of the existing, mature infrastructure. As many aspects of IPv6 are similar to IPv4, a brief overview will suffice at this point. A key change in IPv6 is a completely new packet format that uses 128-byte IP addresses, and is therefore easier and faster to process. The structure of an IPv6 packet is shown in Figure 12-21.
Version
Traffic Class Payload length
Flow Label Next Header
Hop Limit
Source address
Destination address
Payload
Figure 12-21: Structure of an IPv6 packet.
The structure is much simpler than that in IPv4. There are only eight header fields instead of 14. Of particular note is the absence of the fragmentation field. Although IPv6 also supports the splitting of packet data into smaller units, the corresponding information is held in an extension header pointed to by the next header field. Support for a variable number of extension headers makes it easier to introduce new features. The changes between IPv4 and IPv6 have also necessitated modification of the interface via which connections are programmed. Although sockets are still used, many old and familiar functions appear under a new name to support the new options. However, this is a problem faced by userspace and C libraries and will be ignored here. The notation of IP addresses has also changed because of the increase in address length from 32 to 128 bits. Retaining the former notation (tuples of bytes) would have resulted in extremely long addresses. Preference was therefore given to hexadecimal notation for IPv6 addresses, for example, FEDC:BA98:7654:3210:FEDC:BA98:7654:3210 and 1080:0:0:0:8:800:200C:417A. A mixture of IPv4 and IPv6 formats resulting in addresses such as 0:0:0:0:0:FFFF:129.144.52.38 is also permitted.
Implementation What route does an IPv6 packet take when it traverses the network layer? On the lower layers, there is no change as compared with IPv4 because the mechanisms used are independent of the higher-level 23 It couldn’t be called IPv5 because the name had already been used to designate the STP protocol, which was defined in an RFC
but never filtered through to a wide public.
784
5:30pm
Page 784
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks protocols. Changes are apparent, however, when data are passed to the IP layer. Figure 12-22 shows a (coarse-grained) code flow diagram for IPv6 implementation.
Transport Layer (TCP, UDP) ip6_input_finish
ip6_xmit Netfilter:
Netfilter:
NF_IP6_LOCAL_OUT
NF_IP6_LOCAL_IN
Forwarding ip6_forward
Routing
ip6_output
Netfilter:
NF_IP6_PRE_ROUTING
ipv6_rcv Poll Mechanism
Routing
Netfilter: NF_IP6_FORWARD
Netfilter: NF_IP6_POST_ROUTING
Host to Host Layer (Ethernet, etc.)
Figure 12-22: Code flow diagram for IPv6 implementation. As the diagram shows, the structural changes between version 4 and version 6 are minor. Although the function names are different, the code follows more or less the same path through the kernel. For reasons of space, the implementation details are not discussed.24
12.9
Transpor t Layer
Two main IP-based transport protocols are used — UDP to send datagrams, and TCP to set up secure, connection-oriented services. Whereas UPD is a simple, easily implemented protocol, TCP has several well-concealed (but nevertheless well-known) booby traps and stumbling blocks that make implementation all the more complex.
12.9.1 UDP As explained in the previous section, ip_local_deliver distributes the transport data contents of IP packets. udp_rcv from net/ipv4/udp.c is used to further process UDP datagram packets. The associated code flow diagram is shown in Figure 12-23. udp_rcv is just a wrapper function for __udp4_lib_rcv since the code is shared with the implementation of the UDP-lite protocol as defined in RFC 3828.
As usual, the input parameter passed to the function is a socket buffer. Once it has been established that the packet data are intact, it is necessary to find a listening socket using __udp4_lib_lookup. The connection parameters can be derived from the UDP header, whose structure is shown in Figure 12-24. 24 Note that the names of the netfilter hooks will be changed in the same manner as noted for IPv4 in kernel 2.6.25, which was still
under development when this book was written. The constants will not be prefixed NF_IP6_ anymore, but instead by NF_INET_. The same set of constants is thus used for IPv4 and IPv6.
785
Page 785
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks udp_rcv _ _udp4_lib_rcv
Consistency check _ _udp4_lib_lookup
Find socket in udptable
No
Destination socket found?
Yes
udp_queue_rcv_skb
sock_queue_rcv_skb
Send Destination unreachable message
Figure 12-23: Code flow diagram for udp_rcv.
16
0 Source Port Length
32 Destination Port Checksum
Payload
Figure 12-24: Structure of a UDP packet. In figure 12-24, ‘‘Source’’ and ‘‘Destination Port’’ specify the port number of the source and destination system and accept values from 0 to 65,535 because each uses 16 bytes.25 ‘‘Length’’ is the total length of the packet (header and data) in bytes, and ‘‘Checksum’’ holds an optional checksum. The header of a UDP packet is represented in the kernel by the following data structure:
struct udphdr { __be16 __be16 __be16 __be16 };
source; dest; len; check;
__udp4_lib_lookup from net/ipv4/udp.c is used to find a kernel-internal socket to which the packet is sent. They employ a hashing method to find and return an instance of the sock structure in the udphash
global array when a listening process is interested in the packet. If they cannot find a socket, they send a destination unreachable message to the original system, and the contents of the packet are discarded. Although I have not yet discussed the sock structure, it inevitably brings the term socket to mind, exactly as is intended. As we are on the borderline of the application layer, the data must be passed to userspace at some time or other using sockets as described in the sample programs at the beginning of the chapter. 25 The IP address need not be specified because it is already in the IP header.
786
5:30pm
Page 786
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Note, however, that two data structures are used to represent sockets in the kernel. sock is the interface to the network access layer, and socket is the link to userspace. These rather lengthy structures are discussed in detail in the next section, which examines the part of the application layer anchored in the kernel. At the moment, we are interested only in the methods of the sock structure needed to forward data to the next higher layer. These must allow data received to be placed on a socket-specific wait queue and must also inform the receiving process that new data have arrived. Currently, the sock structure can be reduced to the following abbreviated version: include/net/sock.h
/* Short version */ struct sock { wait_queue_head_t struct sk_buff_head /* Callback */ void
*sk_sleep; sk_receive_queue;
(*sk_data_ready)(struct sock *sk, int bytes);
}
Control is transferred to udp_queue_rcv_skb once udp_rcv has found the appropriate sock instance and immediately afterward to sock_queue_rcv_skb, where 2 important actions are performed to complete data delivery to the application layer. ❑
Processes waiting for data delivery via the socket sleep on the sleep wait queue.
❑
Invoking skb_queue_tail inserts the socket buffer with the packet data at the end of the receive_queue list whose head is held in the socket-specific sock structure.
❑
The function pointed to by data_ready (typically, sock_def_readable if the sock instance is initialized with the standard function sock_init_data) is invoked to inform the socket that new data has arrived. It wakes all processes sleeping on sleep while waiting for data to arrive.
12.9.2 TCP TCP provides many more functions than UDP. Consequently, its implementation in the kernel is much more difficult and comprehensive, and a whole book could easily be dedicated to the specific problems involved. The connection-oriented communication model used by TCP to support the secure transmission of data streams not only requires greater administrative overhead in the kernel, but also calls for further operations such as explicit connection setup following from negotiations between computers. The handling (and prevention) of specific scenarios as well as optimization to boost transmission performance account for a large part of TCP implementation in the kernel; all their subtleties and oddities are not discussed here. Let’s look at the three major components of the TCP protocol (connection establishment, connection termination, and the orderly transmission of data streams) by first describing the procedure required by the standard before going on to examine the implementation. A TCP connection is always in a clearly defined state. These include the listen and established states mentioned above. There are also other states and clearly defined rules for the possible transitions between them, as shown in Figure 12-25. At first glance, the diagram is a little confusing, not to say off-putting. However, the information it contains almost fully describes the behavior of a TCP implementation. Basically, the kernel could distinguish
787
Page 787
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks closed
Close
Passive Open
Active Open SYN
Close SYN SYN, ACK
listen
ACK
fin_wait_1
SYN
SYN ACK
syn_recv
Close FIN
Send
syn_sent
Close FIN
FIN ACK
fin_wait_2
FIN ACK
close_wait
Close FIN
FIN ACK
ACK
closing
last_ack
ACK
ACK
time_wait
Sending: Black Receiving: Gray Protocol: Underlined
SYN, ACK ACK
established
Timeout
closed
Figure 12-25: TCP state-transition diagram. between the individual states and implement the transitions between them (using a tool known as a finite state machine). This is neither particularly efficient nor fast, so the kernel adopts a different approach. Nevertheless, when describing the individual TCP actions, I make repeated reference to this diagram and use it as a basis for our examination.
TCP Headers TCP packets have a header that contains state data and other connection information. The header structure is shown in Figure 12-26. 4
0
10 Source Port
16
24 Destination Port
Sequence Number Offset
Reserved 1 2 3 4 Check sum Options
5
6
32 1 2
Window Urgent Pointer Padding
3
URG ACK PSH
4 5 6
RST SYN FIN
Payload
Figure 12-26: Structure of a TCP packet. ❑
source and dest specify the port numbers used. As with UDP, they consist of 2 bytes.
❑
seq is a sequence number. It specifies the position of a TCP packet within the data stream and is
important when lost data need to be retransmitted.
788
5:30pm
Page 788
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks ❑
ack_seq holds a sequence number used when acknowledging receipt of TCP packets.
❑
doff stands for data offset and specifies the length of the TCP header structure, which is not always the same owing to the variable nature of some of the options.
❑
reserved is not available (and should therefore always be set to 0).
❑
urg (urgent), ack (acknowledgment), psh (push), rst (reset), syn (synchronize), and fin are control flags used to check, establish, and terminate connections.
❑
window tells the connection partner how many bytes it can send before the receiver buffer will be full. This prevents backlog when fast senders communicate with slow receivers.
❑
checksum is the packet checksum.
❑
options is a variable-length list of additional connection options.
❑
The actual data (or payload) follows the header. The options field may be padded because the data entry must always start at a 32-bit position (to simplify handling).
The header is implemented in the tcphdr data structure. The system endianness must be noted because a split byte field is used.
struct tcphdr { __be16 source; __be16 dest; __be32 seq; __be32 ack_seq; #if defined(__LITTLE_ENDIAN_BITFIELD) __u16 res1:4, doff:4, fin:1, syn:1, rst:1, psh:1, ack:1, urg:1, ece:1, cwr:1; #elif defined(__BIG_ENDIAN_BITFIELD) __u16 doff:4, res1:4, cwr:1, ece:1, urg:1, ack:1, psh:1, rst:1, syn:1, fin:1; #else #error "Adjust your
789
Page 789
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Receiving TCP Data All TCP actions (connection setup and shutdown, data transmission) are performed by sending data packets with specific properties and various flags. Before discussing state transitions, I must first establish how the TCP data are passed to the transport layer and at what point the information in the header is analyzed. tcp_v4_rcv is the entry point into the TCP layer once a packet has been processed by the IP layer. The code flow diagram for tcp_v4_rcv is shown in Figure 12-27.
tcp_v4_rcv _ _inet_lookup _ _inet_lookup_established
No socket?
inet_lookup_listener
tcp_v4_do_rcv
Realize TCP state automaton
Figure 12-27: Code flow diagram for tcp_v4_rcv.
Each TCP socket of the system is included in one of three hash tables that accept sockets in the following states: ❑
Sockets that are fully connected.
❑
Sockets that are waiting for a connection (in the listen state).
❑
Sockets that are in the process of establishing a connection (using the three-way handshake discussed below).
After performing various checks on the packet data and copying information from the header into the control block of the socket buffer, the kernel delegates the work of finding a socket that is waiting for the packet to the __inet_lookup function. The only task of this function is to invoke two further functions to scan various hash tables. __inet_lookup_established attempts to return a connected socket. If no appropriate structure is found, the inet_lookup_listener function is invoked to check all listening sockets. In both cases, the functions combine different elements of the respective connection (IP addresses of the client and server, port addresses and the kernel-internal index of the network interface) by means of hash functions to find an instance of the abovementioned sock type. When searching for a listening socket, a score method is applied to find the best candidate among several sockets working with wildcards. This topic is not discussed because the results simply reflect what would intuitively be regarded as the best candidate. In contrast to UDP, work does not end but begins when the appropriate sock structure for the connection is found. Depending on connection state, it is necessary to perform a state transition as shown in
790
5:30pm
Page 790
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Figure 12-25. tcp_v4_do_rcv is a multiplexer that splits the code flow into different branches on the basis of the socket state. The sections below deal with the individual options and associated actions but do not cover all of the sometimes tricky and seldom used oddities of the TCP protocol. For this, see specialized publications such as [WPR+ 01], [Ben05], and [Ste94].
Three-Way Handshake A connection must be established explicitly between a client and a host before a TCP link can be used. As already noted, a distinction is made between active and passive connection setup. The kernel (i.e., the kernel of both machines involved in the connection) sees the following situation immediately prior to connection establishment — the state of the client process socket is CLOSED, that of the server socket is LISTEN. A TCP connection is set up by means of a procedure that involves the exchange of three TCP packets and is therefore known as a three-way handshake. As the state diagram in Figure 12-25 shows, the following actions take place: ❑
The client sends SYN to the server26 to signal a connection request. The socket state of the client changes from CLOSED to SYN_SENT.
❑
The server receives the connection request on a listening socket and returns SYN and ACK.27 The state of the server socket changes from LISTEN to SYN_REC.
❑
The client socket receives the SYN/ACK packet and switches to the ESTABLISHED state, indicating that a connection has been set up. An ACK packet is sent to the server.
❑
The server receives the ACK packet and also switches to the ESTABLISHED state. This concludes connection setup on both sides, and data exchange can begin.
In principle, a connection could be established using only one or two packets. However, there is then a risk of faulty connections as a result of leftover packets of old connections between the same addresses (IP addresses and port numbers). The purpose of the three-way handshake is to prevent this. A special characteristic of TCP links immediately becomes apparent when connections are established. Each packet sent is given a sequence number, and receipt of each packet must be acknowledged by the TCP instance at the receiving end. Let us take a look at the log of a connection request to a web server28 : 1 192.168.0.143 192.168.1.10 TCP 2 192.168.1.10 192.168.0.143 TCP 3 192.168.0.143 192.168.1.10 TCP
1025 > http [SYN] Seq=2895263889 Ack=0 http > 1025 [SYN, ACK] Seq=2882478813 Ack=2895263890 1025 > http [ACK] Seq=2895263890 Ack=2882478814
The client generates random sequence number 2895263889 for the first packet; it is stored in the SEQ field of the TCP header. The server responds to the arrival of this packet with a combined SYN/ACK packet with a new sequence number (in our example, 2882478813). What we are interested in here is the contents of the SEQ/ACK field (the numeric field, not the flag bit). The server fills this field by adding the number of bytes received +1 to the sequence number received (the underlying principle is discussed below). 26 This is the name given to an empty packet with a set SYN flag. 27 This step could be split into two parts by sending one packet with ACK and a second with SYN, but this is not done in practice. 28 Network connection data can be captured with tools such as tcpdump and wireshark.
791
Page 791
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Together with the set ACK flag of the packet, this indicates to the client that the first packet has been received. No extra packet need be generated to acknowledge receipt of a data packet. Acknowledgment can be given in any packet in which the ACK flag is set and the ack field is filled. Packets sent to establish the connection do not contain data; only the TCP header is relevant. The length stored in the len field of the header is therefore 0. The mechanisms described are not specific to the Linux kernel but must be implemented by all operating systems wishing to communicate via TCP. The sections below deal more extensively with the kernelspecific implementation of the operations described.
Passive Connection Establishment Active connection setup does not originate from the kernel itself but is triggered by receipt of a SYN packet with a connection request. The starting point is therefore the tcp_v4_rcv function, which, as described above, finds a listening socket and transfers control to tcp_v4_do_rcv, whose code flow diagram (for this specific scenario) is shown in Figure 12-28. tcp_v4_rcv tcp_v4_do_rcv tcp_v4_hnd_req tcp_rcv_state_process
Figure 12-28: Code flow diagram for tcp_v4_rcv_passive. tcp_v4_hnd_req is invoked to perform the various initialization tasks required in the network layer to establish a new connection. The actual state transition takes place in tcp_rcv_state_process, which consists of a long case statement to differentiate between the possible socket states and to invoke the appropriate transition function.
Possible socket states are defined in an enum list: include/net/tcp_states.h
enum { TCP_ESTABLISHED = 1, TCP_SYN_SENT, TCP_SYN_RECV, TCP_FIN_WAIT1, TCP_FIN_WAIT2, TCP_TIME_WAIT, TCP_CLOSE, TCP_CLOSE_WAIT, TCP_LAST_ACK, TCP_LISTEN, TCP_CLOSING, /* Now a valid state */ TCP_MAX_STATES /* Leave at the end! */ };
792
5:30pm
Page 792
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks tcp_v4_conn_request is invoked if the socket state is TCP_LISTEN.29 The function concerns itself with
many details and subtleties of TCP that are not described here. What is important is the acknowledgment packet sent at the end of the function. It contains not only the set ACK flag and the sequence number of the received packet but also a newly generated sequence number and a SYN flag as required by the three-way handshake procedure. This concludes the first phase of connection setup. The next step at the client is reception of the ACK packet that arrives at the tcp_rcv_state_process function via the usual path. The socket state is now TCP_SYN_RECV, which is handled by a separate branch of case differentiation. The main task of the kernel is to change the socket state to TCP_ESTABLISHED to indicate that a connection has now been set up.
Active Connection Establishment Active connection setup is initiated by invoking the open library function by means of a userspace application that issues the socketcall system call to arrive at the kernel function tcp_v4_connect, whose code flow diagram is shown on the upper part of Figure 12-29.
tcp_v4_connect ip_route_connect Set socket status to SYN_SENT tcp_connect tcp_transmit_skb inet_csk_reset_xmit_timer tcp_rcv_state_process tcp_rcv_synsent_state_process Set socket status to ESTABILISHED tcp_send_ack
Figure 12-29: Code flow diagram for active connection establishment.
The function starts by looking for an IP route to the destination host using the framework described above. After the TCP header has been generated and the relevant values have been set in a socket buffer, the socket state changes from CLOSED to SYN_SENT. tcp_connect, then sends a SYN packet to the IP layer and therefore to the client. In addition, a timer is created in the kernel to ensure that packet sending is repeated if no acknowledgment is received within a certain period. 29 A function pointer to an address family-specific data structure is used because the dispatcher supports both IPv4 and IPv6. As the
implementation of the finite-state machine is the same for IPv4 and IPv6, a large amount of code can be saved.
793
Page 793
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Now the client must wait for server acknowledgment of the SYN packet and for a SYN packet acknowledging the connection request, which is received by means of the normal TCP mechanisms (lower part of Figure 12-29). This leads to the tcp_rcv_state_process dispatcher, which, in this case, directs the flow of control to tcp_rcv_synsent_state_process. The socket state is set to ESTABLISHED, and tcp_send_ack returns another ACK packet to the server to conclude connection setup.
Transmission of Data Packets Data are transferred between computers once a connection has been set up as described above. This process is sometimes quite tricky because TCP has several features that call for comprehensive control and security procedures between the communicating hosts: ❑
Byte streams are transmitted in a guaranteed order.
❑
Lost packets are retransmitted by automated mechanisms.
❑
Data flow is controlled separately in each direction and is matched to the speeds of the hosts.
Even though initially these requirements may not appear to be very complex, a relatively large number of procedures and tricks are needed to satisfy them. Because most connections are TCP-based, the speed and efficiency of the implementation are crucial. The Linux kernel therefore resorts to tricks and optimizations, and unfortunately these don’t necessarily make the implementation any easier to understand. Before turning our attention to how data transmission is implemented over an established connection, it is necessary to discuss some of the underlying principles. We are particularly interested in the mechanisms that come into play when packets are lost. The concept of packet acknowledgment based on sequence numbers is also adopted for normal data packets. However, sequence numbers reveal more about data transmission than mentioned above. According to which scheme are sequence numbers assigned? When a connection is set up, a random number is generated (by the kernel using secure_tcp_sequence_number from drivers/char/random.c). Thereafter a system supporting the strict acknowledgment of all incoming data packets is used. A unique sequence number that builds on the number initially sent is assigned to each byte of a TCP transmission. Let us assume, for example, that the initial random number of the TCP system is 100. The first 16 bytes sent therefore have the sequence numbers 100, 101, . . . , 115. TCP uses a cumulative acknowledgment scheme. This means that an acknowledgment covers a contiguous range of bytes. The number sent in the ack field acknowledges all bytes between the last and the current ACK number of a data stream. (The initial sequence number is used as the starting point if an acknowledgment has not yet been sent and there is therefore no last number.) The ACK number confirms receipt of all data up to and including the byte that is 1 less than the number and therefore indicates which byte is expected next. For instance, ACK number 166 acknowledges all bytes up to and including 165 and expects bytes from 166 upward in the next packet. This mechanism is used to trace lost packets. Note that TCP does not feature an explicit re-request mechanism; in other words, the receiver cannot request the sender to retransmit lost packets. The onus is on the sender to retransmit the missing segment automatically if it does not receive an acknowledgment within a certain time-out period. How are these procedures implemented in the kernel? We assume that the connection was established as described above so that the two sockets (on the different systems) both have the ESTABLISHED state.
794
5:30pm
Page 794
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks Receiving Packets The code flow diagram in Figure 12-30 shows the path taken — starting from the familiar tcp_v4_rcv function — when packets are received.
tcp_v4_rcv tcp_v4_do_rcv tcp_rcv_established Packet easy to process?
No
Slow Path
Yes
sk->sk_data_ready Fast Path sk->sk_data_ready
Figure 12-30: Receiving packets in TCP connections.
After control has been passed to tcp_v4_do_rcv, a fast path is selected (if a connection already exists) rather than entering the central dispatcher function — this is in contrast to other socket states but is logical because the transmission of data packets accounts for the lion’s share of work in any TCP connection and should therefore be performed as quickly as possible. Once it has been established that the state of the destination socket is TCP_ESTABLISHED, the tcp_rcv_established function is invoked to split the control flow again. Packets that are easy to analyze are handled in the fast path and those with unusual options in the slow path. Packets must fulfill one of the following criteria to be classified as easy to analyze: ❑
The packet must contain only an acknowledgment for the data last sent.
❑
The packet must contain the data expected next.
In addition, none of the following flags must be set: SYN, URG, RST, or FIN. This description of the ‘‘best case scenario‘‘ for packets is not Linux-specific but is also found in many other Unix variants.30 Almost all packets fall within these categories,31 which is why it makes sense to differentiate between a fast and a slow path. Which operations are performed in the fast path? A few packet checks are carried out to find more complex packets and return them to the slow path. Thereafter the packet length is analyzed to ascertain whether the packet is an acknowledgment or a data packet. This is not difficult because ACK packets do not contain data and must therefore be of exactly the same length as a TCP packet header. 30 This approach was developed by Van Jacobsen, a well-known network researcher, and is often referred to as the VJ mechanism. 31 Today’s transmission techniques are so sophisticated that very few errors occur. This was not the case in the early days of TCP.
Although more faults arise on global Internet connections than in local networks, most packets can still be handled in the fast path owing to the low error rates.
795
Page 795
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks The fast path code doesn’t bother with processing ACK segments but delegates this task to tcp_ack. Here, obsolete packets and packets sent too early owing to faulty TCP implementations at the receiving end or to unfortunate combinations of transmission errors and time-outs are filtered out. The most important tasks of this function are not only to analyze new information on the connection (e.g., on the receiving window) and on other subtleties of the TCP protocol, but also to delete acknowledged data from the retransmission queue (discussed below). This queue holds all sent packets and retransmits them if they are not acknowledged by means of an ACK within a certain time period. Because it has been established during selection of the packet for fast path handling that the data received immediately follow the previous segment, the data can be acknowledged by means of an ACK to the sender without the need for any further checks. Finally, the sk_data_ready function pointer stored in the socket is invoked to inform the user process that new data are available. What is the difference between the slow path and the fast path? Owing to the many TCP options, the code in the slow path is more extensive. For this reason, I won’t go into the many special situations that can arise because they are less of a kernel problem and more of a general problem of TCP connections (detailed descriptions are available in, e.g., [Ste94] and [WPR+ 01]). In the slow path, data cannot be forwarded directly to the socket because complicated packet option checks are necessary, and these may be followed by potential TCP subsystem responses. Data arriving out of sequence are placed on a special wait queue, where they remain until a contiguous data segment is complete. Only then can the complete data be forwarded to the socket.
Sending Packets As seen from the TCP layer, the sending of TCP packets begins with the invocation of the tcp_sendmsg function by higher network instances. Figure 12-31 shows the associated code flow diagram. tcp_sendmsg No connection?
sk_stream_wait_connect
Copy data into socket buffer tcp_push_one tcp_snd_test tcp_transmit_skb
af_specific->queue_xmit
update_send_head Set resend timer if necessary
Figure 12-31: Code flow diagram for tcp_sendmsg.
Naturally, the state of the socket used must be TCP_ESTABLISHED before data transmission can begin. If this is not the case, the kernel waits (with the help of wait_for_tcp_connect) until a connection has
796
5:30pm
Page 796
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks been established. The data are then copied to the address space of the userspace process in the kernel and are used to build a TCP packet. I do not intend to discuss this complicated operation because it involves a large number of procedures, all of which are targeted at satisfying the complex requirements of the TCP protocol. Unfortunately, sending a TCP packet is not limited simply to construction of a packet header and transfer to the IP layer. It is also necessary to comply with the following (by no means exhaustive) list of demands: ❑
Sufficient space for the data must be available in the wait queue of the receiver.
❑
The ECN mechanism must be implemented to prevent connection congestion.
❑
Possible stalemate situations must be detected as otherwise communication comes to a halt.
❑
The TCP slow-start mechanism requires a gradual increase in packet size at the start of communication.
❑
Packets sent but not acknowledged must be retransmitted repeatedly after a certain timeout period until they are finally acknowledged by the receiver.
As the retransmission queue is a key element of reliable data transmission via a TCP connection, let’s take a look here at how it actually works. After a packet has been assembled, the kernel arrives at tcp_push_one, which performs the following three tasks: ❑
tcp_snd_test checks whether the data can be sent at the present time. This may not be possible because of backlogs caused by an overloaded receiver.
❑
tcp_transmit_skb forwards the data to the IP layer using the address family-specific af_specific->queue_xmit function (ip_queue_xmit is used for IPv4).
❑
update_send_head takes care of updating some statistics. More important, it initializes the
retransmit timer of the TCP segment sent. This is not necessary for every sent packet, but only for the first packet that follows after an acknowledged data region. inet_csk_reset_xmit_timer is responsible for resetting the retransmit timer. The timer is the basis
for resending data packets that have not been acknowledged and acts as a kind of TCP transmission guarantee certificate. If the receiver does not acknowledge data receipt within a certain period (time-out), the data are retransmitted. The kernel timer used is described in Chapter 15. The sock instance associated with the particular socket holds a list of retransmit timers for each packet sent. The time-out function used by the kernel is tcp_write_timer, which invokes the function tcp_retransmit_timer if an ACK is not received. The following must be noted when retransmitting segments: ❑
The connection may have been closed in the meantime. In this case, the stored packet and the timer entry are removed from kernel memory.
❑
Retransmission is aborted when more retransmit attempts have been made than specified in the sysctl_tcp_retries2 variable.32
As mentioned above, the retransmit timer is deleted once an ACK has been received for a packet.
32 The default for this variable is 15, but it can be modified using
/proc/sys/net/ipv4/tcp_retries2.
797
Page 797
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Connection Termination Like connection setup, shutdown of TCP connections is also brought about by a multistage exchange of packets, as shown in Figure 12-25. A connection can be closed in one of two ways:
1.
A graceful close terminates the connection at the explicit request of one of the participating systems (in rare cases, both systems issue a request at the same time).
2.
Termination or abort can be brought about by a higher protocol (because, e.g., programs have crashed).
Fortunately, since the first situation is by far the more usual, we discuss it and ignore the second. TCP partners must exchange four packets to close a connection gracefully. The sequence of steps is described below.
1.
The standard library function close is invoked in computer A to send a TCP packet whose FIN flag is set in the header. The socket of A switches to the FIN_WAIT_1 state.
2.
B receives the FIN packet and returns an ACK packet. Its socket state changes from ESTABLISHED to CLOSE_WAIT. The socket is informed of receipt of the FIN by means of an ‘‘end of file.’’
3.
After receipt of the ACK packet, the socket state of computer A changes from FIN_WAIT_1 to FIN_WAIT_2.
4.
The application associated with the socket on computer B also executes close to send a FIN segment from B to A. The state of the socket of computer B then changes to LAST_ACK.
5.
Computer A confirms receipt of the FIN with an ACK packet and first goes into the TIME_WAIT state before automatically switching to the CLOSED state after a certain period.
6.
Computer B receives the ACK packet, which causes its socket also to switch to the CLOSED state.
The status transitions are performed in the central dispatcher function (tcp_rcv_state_process), in the path for existing connections (tcp_rcv_established), and in the tcp_close function not yet discussed. The latter is invoked when the user process decides to call the close library function to close a connection. If the state of the socket is LISTEN (i.e., there is no connection to another computer), the approach is simpler because no external parties need be informed of the end of the connection. This situation is checked at the beginning of the procedure, and, if it applies, the response is a change of socket state to CLOSED. If not, tcp_send_fin sends a FIN packet to the other party once the socket state has been set to FIN_WAIT_1 by the tcp_close_state and tcp_set_state call chain.33
33 The approach is not fully compatible with the TCP standard because the socket is not actually allowed to change its state until
after the FIN packet has been sent. However, the Linux alternative is simpler to implement and does not give rise to any problems in practice. This is why kernel developers have gone down this path as noted in a comment to this effect in tcp_close.
798
5:30pm
Page 798
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks The transition from FIN_WAIT_1 to FIN_WAIT_2 is performed by the central dispatcher function tcp_rcv_state_process because there is no longer any need to take the fast path for existing connections. In the familiar case differentiation, a packet received with a set ACK flag triggers the transition to FIN_WAIT_2 by tcp_set_state. All that is now required to place the TCP connection in the TIME_WAIT state followed automatically by the CLOSED state is a FIN packet from the other party. The status transitions of the other party that performs a passive close upon receipt of the first FIN packet follow a similar pattern. Because the first FIN packet is received when the state is ESTABLISHED, handling takes place in the slow path of tcp_rcv_established and involves sending an ACK to the other party and changing the socket state to TCP_CLOSING. The next state transition (to LAST_ACK) is performed by calling the close library function to invoke the tcp_close_state function of the kernel. Only a further ACK packet from the other party is then needed to terminate the connection. This packet is also handled by the tcp_rcv_state_process function, which changes the socket state to CLOSED (by means of tcp_done), releases the memory space occupied by the socket, and thus finally terminates the connection. Only the possible transition from the FIN_WAIT_1 state is described above. As the TCP finite-state machine illustrated in Figure 12-25 shows, two other alternatives are implemented by the kernel but are far less frequently used than the path I describe, reason enough not to bother with them here.
12.10
Application Layer
Sockets are used to apply the Unix metaphor that ‘‘everything is a file‘‘ to network connections. The interfaces between kernel and userspace sockets are implemented in the C standard library using the socketcall system call. socketcall acts as a multiplexer for various tasks performed by various procedure, for instance, opening
a socket or binding or sending data. Linux adopts the concept of kernel sockets to make communication with sockets in userspace as simple as possible. There is an instance of the socket structure and the sock structure for every socket used by a program. These serve as an interface downward (to the kernel) and upward (to userspace). Both structures were referenced in the previous sections without defining them in detail, which is done now.
12.10.1 Socket Data Structures The socket structure, slightly simplified, is defined as follows:
struct socket { socket_state unsigned long const struct proto_ops struct file struct sock short };
state; flags; *ops; *file; *sk; type;
799
Page 799
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks ❑
type specifies the numeric identifier of the protocol type.
❑
state indicates the connection state of the socket by means of the following values (SS stands for socket state):
typedef enum { SS_FREE = 0, SS_UNCONNECTED, SS_CONNECTING, SS_CONNECTED, SS_DISCONNECTING } socket_state;
/* /* /* /* /*
not allocated unconnected to any socket in process of connecting connected to socket in process of disconnecting
*/ */ */ */ */
The values listed here have nothing in common with the state values used by the protocols of the transport layer when connections are set up and closed. They denote general states relevant to the outside world (i.e., to user programs). ❑
file is a pointer to the file instance of a pseudo-file for communication with the socket (as discussed earlier, user applications use normal file descriptors to perform network operations).
The definition of socket is not tied to a specific protocol. This explains why proto_ops is used as a pointer to a data structure that, in turn, holds pointers to protocol-specific functions to handle the socket:
int
int int int
unsigned int int int int int int int
800
family; *owner; (*release) (*bind)
(struct socket *sock); (struct socket *sock, struct sockaddr *myaddr, int sockaddr_len); (*connect) (struct socket *sock, struct sockaddr *vaddr, int sockaddr_len, int flags); (*socketpair)(struct socket *sock1, struct socket *sock2); (*accept) (struct socket *sock, struct socket *newsock, int flags); (*getname) (struct socket *sock, struct sockaddr *addr, int *sockaddr_len, int peer); (*poll) (struct file *file, struct socket *sock, struct poll_table_struct *wait); (*ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg); (*compat_ioctl) (struct socket *sock, unsigned int cmd, unsigned long arg); (*listen) (struct socket *sock, int len); (*shutdown) (struct socket *sock, int flags); (*setsockopt)(struct socket *sock, int level, int optname, char __user *optval, int optlen); (*getsockopt)(struct socket *sock, int level,
5:30pm
Page 800
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks int optname, char __user *optval, int __user *optlen); (*compat_setsockopt)(struct socket *sock, int level, int optname, char __user *optval, int optlen); (*compat_getsockopt)(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen); (*sendmsg) (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len); (*recvmsg) (struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len, int flags); (*mmap) (struct file *file, struct socket *sock, struct vm_area_struct * vma); (*sendpage) (struct socket *sock, struct page *page, int offset, size_t size, int flags);
int int int int
int ssize_t };
Many function pointers have the same name as the corresponding functions in the C standard library. This is not a coincidence because the functions are directed to the functions stored in the pointers by means of the socketcall system call. The sock pointer also included in the structure points to a much lengthier structure that holds additional socket management data of significance to the kernel. The structure consists of a horrendous number of elements used for sometimes very subtle or seldom required features (the original definition is almost 100 lines long). Here I make do with a much shorter and simplified version. Note that the kernel itself places the most important elements in the structure sock_common that is embedded into struct sock right at the beginning. The following code excerpt shows both structures: include/net/sock.h
struct sock_common { unsigned short volatile unsigned char struct hlist_node unsigned int atomic_t struct proto }; struct sock { struct sock_common
skc_family; skc_state; skc_node; skc_hash; skc_refcnt; *skc_prot;
__sk_common;
struct sk_buff_head struct sk_buff_head
sk_receive_queue; sk_write_queue;
struct timer_list void
sk_timer; (*sk_data_ready)(struct sock *sk, int bytes);
... };
The sock structures of the system are organized in a protocol-specific hash table. skc_node is the hash linkage element, while skc_hash denotes the hash value.
801
Page 801
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Data are sent and received by placing them on wait queues (sk_receive_queue and sk_write_queue) that contain socket buffers. In addition, a list of callback functions is associated with each sock structure used by the kernel to draw attention to special events or bring about state changes. Our simplified version shows only one function pointer called sk_data_ready because it is the most significant and its name has already been mentioned several times in the last few chapters. The function it contains is invoked when data arrive for handling by the user process. Typically, the value of the pointer is sock_def_readable. There is a great danger of confusion between the ops element of type struct proto_ops in the socket structure and the prot entry of type struct proto in sock. The latter is defined as follows: include/net/sock.h
struct proto { void
int
(*close)(struct sock *sk, long timeout); (*connect)(struct sock *sk, struct sockaddr *uaddr, int addr_len); (*disconnect)(struct sock *sk, int flags);
struct sock *
(*accept) (struct sock *sk, int flags, int *err);
int
(*ioctl)(struct sock *sk, int cmd, unsigned long arg); (*init)(struct sock *sk); (*destroy)(struct sock *sk); (*shutdown)(struct sock *sk, int how); (*setsockopt)(struct sock *sk, int level, int optname, char __user *optval, int optlen); (*getsockopt)(struct sock *sk, int level, int optname, char __user *optval, int __user *option);
int
int int void int
int
... int int
int int
(*sendmsg)(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len); (*recvmsg)(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t len, int noblock, int flags, int *addr_len); (*sendpage)(struct sock *sk, struct page *page, int offset, size_t size, int flags); (*bind)(struct sock *sk, struct sockaddr *uaddr, int addr_len); struct sockaddr *uaddr, int addr_len);
... };
Both structures have member elements with similar (and often identical) names although they represent different functions. Whereas the operations shown here are used for communication between the (kernelside) socket layer and transport layer, the functions held in the function pointer block of the socket
802
5:30pm
Page 802
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks structure are designed to communicate with system calls. In other words, they form the link between user-side and kernel-side sockets.
12.10.2 Sockets and Files Userspace processes access sockets using normal file operations once a connection has been established. How is this implemented in the kernel? Owing to the open structure of the VFS layer (as discussed in Chapter 8), very few actions are needed. VFS inodes of the virtual filesystem are discussed in Chapter 8. Each socket is assigned an inode of this type, which is, in turn, linked with the other structures associated with normal files. The functions for manipulating files are stored in a separate pointer table:
struct inode { ... struct file_operations ... }
*i_fop; /* former ->i_op->default_file_ops */
As a result, file access to the file descriptor of a socket can be redirected transparently to the code of the network layer. Sockets use the following file operations: net/socket.c struct file_operations socket_file_ops = { .owner = THIS_MODULE, .llseek = no_llseek, .aio_read = sock_aio_read, .aio_write = sock_aio_write, .poll = sock_poll, .unlocked_ioctl = sock_ioctl, .compat_ioctl = compat_sock_ioctl, .mmap = sock_mmap, .open = sock_no_open, /* special open code to disallow open via /proc */ .release = sock_close, .fasync = sock_fasync, .sendpage = sock_sendpage, .splice_write = generic_splice_sendpage, };
The sock_ functions are simple wrapper routines that invoke a sock_operations routine as shown in the following example of sock_mmap: net/socket.c
static int sock_mmap(struct file * file, struct vm_area_struct * vma) { struct socket *sock = file->private_data; return sock->ops->mmap(file, sock, vma); }
803
Page 803
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Inode and socket are linked by allocating one directly after the other in memory by means of the following auxiliary structure: include/net/sock.h
struct socket_alloc { struct socket socket; struct inode vfs_inode; };
The kernel provides two macros that perform the necessary pointer arithmetic to move from an inode to the associated socket instance (SOCKET_I) and vice versa (SOCK_INODE). To simplify the situation, whenever a socket is attached to a file, sock_attach_fd sets the private_data element of struct file so that it points to the socket instance. The sock_mmap example shown above makes use of this.
12.10.3 The socketcall System Call In addition to the read and write operations of the file functions that enter the kernel by means of the system calls of the virtual filesystem where they are redirected to function pointers of the socket_file_ops structure, it is also necessary to carry out other tasks with sockets that cannot be forced into the file scheme. These include, for example, creating a socket and bind and listen calls. For this purpose, Linux provides the socketcall system call, which is implemented in sys_socketcall and to which I have made frequent reference. It is remarkable that there is just one system call for all 17 socket operations. This results in very different lists of arguments depending on the task in hand. The first parameter of the system call is therefore a numeric constant to select the desired call. Possible values are, for example, SYS_SOCKET, SYS_BIND, SYS_ACCEPT, and SYS_RECV. The routines of the standard library use the same names but are all redirected internally to socketcall with the corresponding constant. The fact that there is only a single system call is primarily for historical reasons. The task of sys_socketcall is not especially difficult — it simply acts as a dispatcher to forward the system call to other functions, each of which implements a ‘‘small‘‘ system call to which the parameters are passed: net/socket.c
asmlinkage long sys_socketcall(int call, unsigned long __user *args) { unsigned long a[6]; unsigned long a0,a1; int err; if(call<1||call>SYS_RECVMSG) return -EINVAL; /* copy_from_user should be SMP safe. */ if (copy_from_user(a, args, nargs[call])) return -EFAULT; ... a0=a[0]; a1=a[1];
804
5:30pm
Page 804
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks switch(call) { case SYS_SOCKET: err = sys_socket(a0,a1,a[2]); break; case SYS_BIND: err = sys_bind(a0,(struct sockaddr __user *)a1, a[2]); break; ... case SYS_SENDMSG: err = sys_sendmsg(a0, (struct msghdr __user *) a1, a[2]); break; case SYS_RECVMSG: err = sys_recvmsg(a0, (struct msghdr __user *) a1, a[2]); break; default: err = -EINVAL; break; } return err; }
Even though the target functions comply with the same naming conventions as system calls, they can be invoked only via the socketcall call and not by any other system call.
Table 12-3 shows which ‘‘subcalls‘‘ of socketcall are available.
12.10.4 Creating Sockets sys_socket is the starting point for creating a new socket. The associated code flow diagram is shown in
Figure 12-32. sys_socket sock_create _ _sock_create sock_alloc net_families[family]->create sock_map_fd
Figure 12-32: Code flow diagram for sys_socket. First, a new socket data structure is created using sock_create, which directly calls __sock_create. The task of reserving the required memory is delegated to sock_alloc, which not only reserves space for an instance of struct socket, but also allocates memory for an inode instance directly below. This enables the two objects to be combined as discussed above.
805
Page 805
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks All transport protocols of the kernel are grouped into the array static struct net_proto_family *net_families[NPROTO] defined in net/socket.c. (ccodesock_register is used to add new entries to the database.) The individual members provide a protocol-specific initialization function.
struct net_proto_family int int struct module };
{ family; (*create)(struct socket *sock, int protocol); *owner;
Table 12-3: Network-Related System Calls for Which sys_socketcall Acts as a Multiplexer Function
Meaning
sys_socket
Creates a new socket.
sys_bind
Binds an address to a socket.
sys_connect
Connects a socket with a server.
sys_listen
Opens a passive connection to listen on the socket.
sys_accept
Accepts an incoming connection request.
sys_getsockname
Returns the address of the socket.
sys_getpeername
Returns the address of the communication partner.
sys_socketpair
Creates a socket pair that can be used immediately for bidirectional communication (both sockets are on the same system).
sys_send
Sends data via an existing connection.
sys_sendto
Sends data to an explicitly specified destination address (for UDP connections).
sys_recv
Receives data.
sys_recvfrom
Receives data from a datagram socket and returns the source address at the same time.
sys_shutdown
Closes the connection.
sys_setsockopt
Returns information on the socket settings.
sys_getsockopt
Sets socket options.
sys_sendmsg
Sends messages in BSD style.
sys_recvmsg
Receives messages in BSD style.
806
5:30pm
Page 806
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks It is exactly this function (create) that is invoked after memory has been reserved for the socket. inet_create is used for Internet connections (both TCP and UDP). It creates a new instance of a kernel-internal sock socket, initializes it as far as possible, and inserts it in the kernel data structures. map_sock_fd generates a pseudo-file for the socket (the file operations are specified by socket_ops). A
file descriptor is also allocated so that it can be returned as the result of the system call.
12.10.5 Receiving Data Data are received using the recvfrom and recv system calls and the file-related readv and read functions. Because the code of each of these functions is very similar and merges at an early point, only sys_recvfrom, whose code flow diagram is shown in Figure 12-33, is discussed.
sys_recvfrom fget_light sock_from_file sock_recvmsg sock->ops->recvmsg move_addr_to_user
Figure 12-33: Code flow diagram for sys_recvfrom.
A file descriptor to identify the desired socket is passed to the system call. Consequently, the first task is to find the relevant socket. First, fget_light references the descriptor table of the task structure to find the corresponding file instance. sock_from_file determines the associated inode and ultimately the associated socket by using SOCKET_I. After a few preparations (not discussed here) sock_recvmsg invokes the protocol-specific receive routine sock->ops->recv_msg0. For example, TCP uses tcp_recvmsg to do this. The UDP equivalent is udp_recvmsg. The implementation for UDP is not particularly complicated: ❑
If there is at least one packet on the receive queue (implemented by the receive_queue element of the sock structure), it is removed and returned.
❑
If the receive queue is empty, it is obvious that no data can be passed to the user process. In this case, the process uses wait_for_packet to put itself to sleep until data arrive. As the data_ready function of the sock structure is always invoked when new data arrive, the process can be woken at this point.
move_addr_to_user copies the data from kernel space to userspace using the copy_to_user functions described in Chapter 2.
807
Page 807
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks The implementation for TCP follows a similar pattern but is made a little more complicated by the many details and protocol oddities.
12.10.6 Sending Data Userspace programs also have several alternative ways of sending data. They can use two networkrelated system calls (sendto and send) or the write and writev functions of the file layer. Because, once again, the code in the kernel merges at a certain point, it is sufficient to examine the implementation of the first of the above calls (in the sys_sendto procedure in the kernel sources). The associated code flow diagram is shown in Figure 12-34.34 sys_sendto fget_light sock_from_file move_addr_to_kernel sock_sendmsg
sock->ops->sendmsg
Figure 12-34: Code flow diagram for sys_sendto. fget_light and sock_from_file find the relevant socket by reference to the file descriptor. The data to be sent are copied from userspace to kernel space using move_addr_to_kernel before sock_sendmsg invokes the protocol-specific send routine sock->ops->sendmsg. This routine generates a packet in the required format and forwards it to the lower layers.
12.11
Networking from within the Kernel
Not only userland applications have the desire and need to communicate with other hosts. The kernel could likewise be required to communicate with other computers — without explicit requests from userland to do so. This is not only useful for oddities like the in-kernel web server that used to be included with a number of releases. Network filesystems like CIFS or NCPFS depend on network communication support from within the kernel. This, however, does not yet fulfill all communication needs of the kernel. One more piece is missing: communication between kernel components and communication between userland and kernel. The netlink mechanism provides the required framework.
12.11.1 Communication Functions First, let us turn our attention to the in-kernel networking API. The definitions are nearly identical to the userland case: 34 The sources contain some code that has to deal with the case that __sock_sendmsg can use an asynchronous request. I omit this on purpose in the code flow diagram. If the request is not directly completed in __sock_sendmsg, then wait_on_sync_kiocb is called immediately after __sock_sendmsg, and the synchronous behavior is restored.
808
5:30pm
Page 808
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks
int kernel_sendmsg(struct struct int kernel_recvmsg(struct struct size_t
socket *sock, struct msghdr *msg, kvec *vec, size_t num, size_t len); socket *sock, struct msghdr *msg, kvec *vec, size_t num, len, int flags);
int kernel_bind(struct socket *sock, struct sockaddr *addr, int addrlen); int kernel_listen(struct socket *sock, int backlog); int kernel_accept(struct socket *sock, struct socket **newsock, int flags); int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen, int flags); int kernel_getsockname(struct socket *sock, struct sockaddr *addr, int *addrlen); int kernel_getpeername(struct socket *sock, struct sockaddr *addr, int *addrlen); int kernel_getsockopt(struct socket *sock, int level, int optname, char *optval, int *optlen); int kernel_setsockopt(struct socket *sock, int level, int optname, char *optval, int optlen); int kernel_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags); int kernel_sock_ioctl(struct socket *sock, int cmd, unsigned long arg); int kernel_sock_shutdown(struct socket *sock, enum sock_shutdown_cmd how);
With the exception of kernel_sendmsg and kernel_recvmsg, the parameters are more or less identical with the userland API, except that sockets are not specified by socket file descriptors, but directly by a pointer to an instance of struct socket. The implementation is simple since the functions work as simple wrapper routines around the pointers stored in the protocol operations proto_ops of struct socket: net/socket.c
int kernel_connect(struct socket *sock, struct sockaddr *addr, int addrlen, int flags) { return sock->ops->connect(sock, addr, addrlen, flags); }
A little care is required when the buffer space that takes received data or holds data that must be sent is specified. kernel_sendmsg and kernel_recvmsg do not access the data region directly via struct msghdr as in userland, but employ struct kvec. However, the kernel automatically provides a conversion between both representations as kernel_sendmsg shows. net/socket.c
int kernel_sendmsg(struct socket *sock, struct msghdr *msg, struct kvec *vec, size_t num, size_t size) { ... int result; ... msg->msg_iov = (struct iovec *)vec; msg->msg_iovlen = num;
809
Page 809
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks result = sock_sendmsg(sock, msg, size); ... return result; }
12.11.2 The Netlink Mechanism Netlink is a networking-based mechanism that allows for communication within the kernel as well as between kernel and userland. The formal definition can be found in RFC 3549. The idea to use the networking framework to communicate between kernel and userland stems from BSD’s networking sockets. Netlink sockets, however, extend the possible uses much further. The mechanism is not only used for networking purposes. By now, one of the most important users is the generic object model, which uses netlink sockets to pass all kinds of status information about what is going on inside the kernel to userland. This includes registration and removal of new devices, special events that have happened on the hardware side, and much more. While netlink used to be compilable as a module in former kernel versions, it is nowadays automatically integrated if the kernel has support for networking. This emphasizes the importance of the mechanism. There are some alternative methods in the kernel that implement similar functionality — just think of files in procfs or sysfs. However, the netlink mechanism provides some distinct advantages compared to these approaches: ❑
No polling is required on any side. If status information were passed via a file, then the userland side would constantly need to check if any new messages have arrived.
❑
System calls and ioctls that also allow passing information from userland to the kernel are harder to implement than a simple netlink connection. Besides, there is no problem with modules using netlink services, while modules and system calls clearly do not fit together very well.
❑
The kernel can initiate sending information to userland without being requested to do so from there. This is also possible with files, but impossible with system calls or ioctls.
❑
Userspace applications do not need to use anything else than standard sockets to interact with the kernel.
Netlink supports only datagram messages, but provides bidirectional communication. Additionally, not only unicast but also multicast messages are possible. Like any other socket-based mechanism, netlink works asynchronously. Two manual pages document the netlink mechanism: netlink(3) contains information about in-kernel macros that can be used to manipulate, access, and create netlink datagrams. The manual page netlink(7) contains generic information about netlink sockets and documents the data structures used in this context. Also note that /proc/net/netlink contains some information about the currently active netlink connections. On the userspace side, two libraries simplify the creation of applications employing netlink sockets: ❑
810
libnetlink is bundled with the iproute2 packages. The library has specifically been written with routing sockets in mind. Additionally, is does not come as standalone code, but must be extracted from the package if it is to be used separately.
5:30pm
Page 810
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks ❑
libnl is a standalone library that has not been optimized for a particular use case. Instead, it
provides support for all types of netlink connections, including routing sockets.
Data Structures Specifying Addresses As for every networking protocol, an address needs to be assigned to a netlink socket. The following variant of struct sockaddr represents netlink addresses:
struct sockaddr_nl { sa_family_t nl_family; /* AF_NETLINK */ unsigned short nl_pad; /* zero */ __u32 nl_pid; /* port ID */ __u32 nl_groups; /* multicast groups mask */ };
To distinguish between different netlink channels used by different parts of the kernel, nl_family is employed. Several different families are specified in
NETLINK_ROUTE represents the initial purpose of netlink sockets, namely, changing routing
information. ❑
NETLINK_INET_DIAG allows for monitoring IP sockets; see net/ipv4/inet_diag.c for more
details. ❑
NETLINK_XFRM is used to send and receive messages related to IPSec (or, more generally, to any XFRM transformations).
❑
NETLINK_KOBJECT_UEVENT specifies the protocol for kernel to userland messages that originate from the generic object model (the reverse direction, userland to kernel, is not possible for this type of message). The channel provides the basis of the hotplugging mechanism as discussed in Section 7.4.2.
A unique identifier for the socket is provided in nl_pid. While this is always zero for the kernel itself, userspace applications conventionally use their thread group ID. Note that nl_pid explicitly does not represent a process ID, but can be any unique value — the thread group ID is just one particularly convenient choice.35 nl_pid is a unicast address. Each address family can also specify different multicast groups, and nl_groups is a bitmap that denotes to which multicast addresses the socket belongs. If multicast is not supposed to be used, the field is 0. To simplify matters, I consider only unicast transmissions in the following.
Netlink Protocol Family Recall from Section 12.10.4 that each protocol family needs to register an instance of net_proto_family within the kernel. The structure contains a function pointer that is called when a new socket is created for 35 See the manual page netlink(7) on how to proceed if a userspace process wants to hold more than one netlink socket and thus requires more than one unique identifier.
811
Page 811
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks the protocol family. Netlink uses netlink_create for this purpose.36 The function allocates an instance of struct sock that is connected with the socket via socket->sk. However, space is not only reserved for struct sock but for a larger structure that is (simplified) defined as follows: net/netlink/af_netlink.c
struct netlink_sock { /* struct sock has to be the first member of netlink_sock */ struct sock sk; u32 pid; u32 dst_pid; ... void (*netlink_rcv)(struct sk_buff *skb); ... };
In reality, there are many more netlink-specific elements, and the above code is a selection of the most essential ones. The sock instance is directly embedded into netlink_sock. Given an instance of struct sock for netlink sockets, the associated netlink-specific structure netlink_socket can be obtained using the auxiliary function nlk_sk. The port IDs of both ends of the connection are kept in pid and dst_pid. netlink_rcv points to a function that is called to receive data.
Message Format Netlink messages need to obey a certain format as depicted in Figure 12-35. Message 1 struct nlmsg_hdr Header
Message 2 Padding
Payload
Header aligned on NLMSG_ALIGNTO
Figure 12-35: Format of a netlink message. Each message consists of two components: the header and the payload. While the header is required to be represented by struct nlmsghdr, the payload can be arbitrary.37 The required contents of the header are given by the following data structure:
struct nlmsghdr { __u32 nlmsg_len; /* Length of message including header */ __u16 nlmsg_type; /* Message content */ __u16 nlmsg_flags; /* Additional flags */ 36 The protocol family operations netlink_family_ops point to this function. Recall from Section 12.10.4 that the creation function is automatically called when a new socket is created. 37 The kernel offers the standard data structure struct nlattr if netlink is used to transport attributes. This possibility is not discussed in detail, but note that all attribute definitions and a useful set of auxiliary helper functions can be found in include/net/netlink.h.
812
5:30pm
Page 812
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks __u32 nlmsg_seq; /* Sequence number */ __u32 nlmsg_pid; /* Sending process port ID */ };
❑
The length of the total message — including header and any required padding — is stored in nlmsg_len.
❑
The message type is denoted by nlmsg_type. The value is private to the family and not inspected or modified by generic netlink code.
❑
Various flags can be stored in nlmsg_flags. All possible values are defined in
❑
nlmsg_seq holds a sequence number that induces a temporal relationship amongst a series of
messages. ❑
The unique port ID that identifies the sender is stored in nlmsg_pid.
Note that the constituents of netlink messages are always aligned to NLMSG_ALIGNTO (usually set to 4) byte boundaries as indicated in the figure. Since the size of struct nlmsghdr is currently a multiple of NLMSG_ALIGNTO, the alignment criterion is automatically fulfilled for the header. Padding might, however, be required behind the payload. To ensure that the padding requirements are fulfilled, the kernel introduces several macros in
Keeping Track of Netlink Connections The kernel keeps track of all netlink connections as represented by sock instances using several hash tables. They are implemented around the global array nl_table, which contains pointers to instances of struct netlink_table. The actual definition of this structure does not bother us in detail because the hashing method follows a rather straightforward path:
1.
Each array element of nl_table provides a separate hash for each protocol family member. Recall that each family member is identified by one of the constants defined by NETLINK_XXX, where XXX includes ROUTE or KOBJECT_UEVENT, for instance.
2.
The hash chain number is determined using nl_pid_hashfn based on the port ID and a (unique) random number associated with the hash chain.38
38 Actually, the situation is more complicated because the kernel rehashes the elements on the hash table when there are too many entries, but this extra complexity is ignored here.
813
Page 813
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks netlink_insert is used to insert new entries into the hash table, while netlink_lookup allows for finding sock instances: net/netlink/af_netlink.c
static int netlink_insert(struct sock *sk, struct net *net, u32 pid); static __inline__ struct sock *netlink_lookup(struct net *net, int protocol, u32 pid);
Note that the hashing data structures are not designed to operate on a per-namespace basis since there is only one global structure for the whole system. Nevertheless, the code is networking-namespace-aware: When a sock is looked up, the code ensures that the result lives in the proper namespace. Connections with identical port IDs that originate from different namespaces can exist on the same hash chain simultaneously without problems.
Protocol-Specific Operations Since userland applications use the standard socket interface to deal with netlink connections, the kernel must provide a set of protocol operations. They are defined as follows: net/netlink/af_netlink.c
static const struct proto_ops netlink_ops = { .family = PF_NETLINK, .owner = THIS_MODULE, .release = netlink_release, .bind = netlink_bind, .connect = netlink_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = netlink_getname, .poll = datagram_poll, .ioctl = sock_no_ioctl, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = netlink_setsockopt, .getsockopt = netlink_getsockopt, .sendmsg = netlink_sendmsg, .recvmsg = netlink_recvmsg, .mmap = sock_no_mmap, .sendpage = sock_no_sendpage, };
Programming Interface The generic socket implementation provides most of the basic functionality required for netlink. Netlink sockets can be opened both from the kernel and from userland. In the first case, netlink_kernel_create is employed, while in the second case, the bind method of netlink_ops is triggered via the standard networking paths. For reasons of space, I do not want to discuss the implementation of the userland protocol handlers in detail, but focus on how connections are initialized from the kernel. The function requires various parameters: net/netlink/af_netlink.c
struct sock * netlink_kernel_create(struct net *net, int unit, unsigned int groups,
814
5:30pm
Page 814
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks void (*input)(struct sk_buff *skb), struct mutex *cb_mutex, struct module *module); net denotes the networking namespace, unit specifies the protocol family member, and input is a callback function that is activated when data arrives for the socket.39 If a NULL pointer is passed for input, the
socket will only be able to transport data from kernel to userland, but not vice versa. The tasks performed in netlink_kernel_create are summarized by the code flow diagram in Figure 12-36. netlink_kernel_create sock_create_lite _ _netlink_create Store input function netlink_insert
Figure 12-36: Code flow diagram for netlink_kernel_create (multicast handling is omitted).
1.
All required data structures need to be allocated, especially an instance of struct socket and struct netlink_sock. sock_create_lite handles the first requirement, and allocating netlink_sock is delegated to the auxiliary function __netlink_create.
2. 3.
If an input function is specified, it is stored in netlink_sock->netlink_rcv. The new sock instance is inserted into the netlink hash via netlink_insert.
Consider, for instance, how the generic object model creates a netlink socket for the uevent mechanism (refer to Section 7.4.2 on how to use this connection): lib/kobject_uevent.c
static int __init kobject_uevent_init(void) { uevent_sock = netlink_kernel_create(&init_net, NETLINK_KOBJECT_UEVENT, 1, NULL, NULL, THIS_MODULE); ... return 0; }
Since uevent messages do not require any input from userland, it is not necessary to specify an input function. After the socket is created, the kernel can construct sk_buff instances and send them off with either netlink_unicast or netlink_broadcast. 39 There are some more parameters that are not necessary to consider in detail. groups gives the number of multicast groups, but I will not discuss the associated possibilities any further. It is also possible to specify a locking mutex (cb_mutex) that protects a netlink callback, but since I have also omitted to discuss this mechanism, you can likewise ignore this parameter. Usually, a NULL pointer is specified as mutex argument, and the kernel falls back to a default locking solution.
815
Page 815
Mauerer
runc12.tex
V2 - 09/04/2008
Chapter 12: Networks Naturally, things get more involved when bidirectional communication is allowed. Take, for example, the audit subsystem, which can not only send messages to userspace, but also receive some in the inverse direction. First of all, an input function is required when netlink_kernel_create is called: kernel/audit.c
audit_sock = netlink_kernel_create(&init_net, NETLINK_AUDIT, 0, audit_receive, NULL, THIS_MODULE); audit_receive is responsible to handle received messages stored in socket buffers. audit_receive is just a wrapper that ensures correct locking and dispatches the real work to audit_receive_skb. Since all
receive functions follow a similar pattern, it is instructive to observe the code of this function: kernel/audit.c
static void audit_receive_skb(struct sk_buff *skb) { int err; struct nlmsghdr *nlh; u32 rlen; while (skb->len >= NLMSG_SPACE(0)) { nlh = nlmsg_hdr(skb); ... rlen = NLMSG_ALIGN(nlh->nlmsg_len); ... if ((err = audit_receive_msg(skb, nlh))) { netlink_ack(skb, nlh, err); } else if (nlh->nlmsg_flags & NLM_F_ACK) netlink_ack(skb, nlh, 0); skb_pull(skb, rlen); } }
Multiple netlink messages can be contained in a single socket buffer, so the kernel needs to iterate over all of them until no more payload is left. This is the purpose of the while loop. The general structure is to process one message, remove the processed data with skb_pull,40 and process the next message. Since NLMSG_SPACE(0) specifies the space required for the netlink header, without any payload, the kernel can easily check if more messages wait to be processed by comparing the remaining length of the socket buffer with this quantity. For each message, the header is extracted with nlmsg_hdr, and the total length including padding is computed with NLMSG_ALIGN. audit_receive_msg is then responsible to analyze the audit-specific contents of the message, which does not concern us any further here. Once the data have been parsed, two alternatives are possible:
1.
An error has occurred during parsing. netlink_ack is used to send an acknowledgment response that contains the erroneous message and the error code.
2.
If the message requested to be acknowledged by setting the NLM_F_ACK flag, the kernel sends the desired acknowledgment again by netlink_ack. This time the input message is not contained in the reply because the error argument of netlink_ack is set to 0.
40 To be precise, the function does not remove the data, but just sets the data pointer of the socket buffer accordingly. The effect is, however, identical.
816
5:30pm
Page 816
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Chapter 12: Networks
12.12
Summar y
Linux is often used to operate network servers, and consequently, its networking implementation is powerful, comprehensive, and complex. This chapter discussed the general layered structure of the networking subsystem that allows for accommodating a large number of different protocols, and provides a rich set of services. After introducing the idea of sockets that establish the link between networking layer and userland, we have discussed socket buffers, the fundamental in-kernel data structure for representation, and processing of packets obtained and sent via networks. We then discussed how network devices are operated and also explained how NAPI helps to ensure that they reach their full possible speed. You have then seen how an IP packet travels through the network layer and how the transport layer processes TCP and UDP packets. Ultimately, the packets end up or originate from the application layer, and we have also explored the mechanisms behind this. The chapter closed with a discussion of how networking can be initiated from within the kernel and how the netlink mechanism allows for installing a high-speed communication link between kernel and userland.
817
Page 817
Mauerer
runc12.tex
V2 - 09/04/2008
5:30pm
Page 818
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
System Calls In the view of user programs, the kernel is a transparent system layer — it is always present but never really noticed. Processes don’t know whether the kernel is running or not. Neither do they know which virtual memory contents are currently in RAM or which contents have been swapped out or perhaps not even read in. Nevertheless, processes are engaged in permanent interaction with the kernel to request system resources, access peripherals, communicate with other processes, read in files, and much more. For these purposes, they use standard library routines that, in turn, invoke kernel functions — ultimately, the kernel is responsible for sharing resources and services fairly and, above all, smoothly between requesting processes. Applications therefore see the kernel as a large collection of routines that perform a wide variety of system functions. The standard library is an intermediate layer to standardize and simplify the management of kernel routines across different architectures and systems. In the view of the kernel, the situation is, of course, a bit more complicated especially as there are several major differences between user and kernel mode, some of which were discussed in earlier chapters. Of particular note are the different virtual address spaces of the two modes and the different ways of exploiting various processor features. Also of interest is how control is transferred backward and forward between applications and the kernel, and how parameters and return values are passed. This chapter discusses such questions. As described in previous chapters, system calls are used to invoke kernel routines from within user applications in order to exploit the special capabilities of the kernel. We have already examined the implementation of a number of system calls from a wide range of kernel subsystems. First, let’s take a brief look at system programming to distinguish clearly between library routines of the standard library and the corresponding system calls. We then closely examine the kernel sources in order to describe the mechanism for switching from userspace to kernel space. The infrastructure used to implement system calls is described, and special implementation features are discussed.
Page 819
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls
13.1
Basics of System Programming
Principally, system programming involves work with the standard library that provides a wide range of essential functions for developing applications. No matter what kind of applications they write, programmers have to know the basics of system programming. A simple program such as the classic hello.c routine, which displays ‘‘Hello, world!‘‘ or a similar text on screen, makes indirect use of system routines to output the necessary characters. Of course, system programming need not always be done in C. There are other programming languages — such as C++, Pascal, Java, or even the dreadful FORTRAN — which also support the more or less direct use of routines from external libraries and are therefore also able to invoke standard library functions. Nevertheless, it is usual to write system programs in C simply because this fits best into the Unix concept — all Unix kernels are written in C, and Linux is no exception. The standard library is not only a collection of interfaces to implement the kernel system calls; it also features many other functions that are implemented fully in userspace. This simplifies the work of programmers, who are spared the effort of constantly reinventing the wheel. And the approximately 100 MiB of code in the GNU C library must be good for something. Because the general programming language trend is toward higher and higher levels of abstraction, the real meaning of system programming is slowly being eroded. Why bother with system details when successful programs can be built effortlessly with a few mouse clicks? A middle course is required. A short Perl script that scans a text file for a certain string will hardly want to bother with the mechanisms that open and read the text file. In this situation, a pragmatic view that somehow the data will be coaxed out of the file is sufficient. On the other hand, databases with gigabytes or terabytes of data will certainly want to know which underlying operating system mechanisms are used to access their files and raw data so that the database code can be tuned to deliver maximum performance. Supplying a giant matrix in memory with specific values is a classic example of how program performance can be significantly boosted by observing the internal structures of the operating system. The order in which values are supplied is crucial if the matrix data are spread over several memory pages. Unnecessary paging can be avoided and system caches and buffers can be put to best use depending on how the memory management subsystem manages memory. This chapter discusses techniques that are not (or at least only to a minor extent) abstracted from the functions of the kernel — all the more so as we want to examine the internal structure of the kernel and the architectural principles used, including the interfaces to the outside world.
13.1.1 Tracing System Calls The following example illustrates how system calls are made using the wrapper routines of the standard library: #include<stdio.h> #include
820
5:32pm
Page 820
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls handle = open("/tmp/test.txt", O_RDONLY); ptr = (void*)malloc(150); bytes = read(handle, ptr, 150); printf("%s", ptr); close(handle); return 0; }
The sample program opens /tmp/test.txt, reads the first 150 bytes, and writes them to standard output — a very simple version of the standard Unix head command. How many system calls does the program use? The only ones that are immediately visible are open, read, and close (their implementation is discussed in Chapter 8). However, the print function is also implemented by system calls in the standard library. It would, of course, be possible to find out which system calls are used by reading the source code of the standard library, but this would be tedious. A simpler option is to use the strace tool, which logs all system calls issued by an application and makes this information available to programmers — this tool is indispensable when debugging programs. Naturally, the kernel must provide special support for logging system calls as discussed in Section 13.3.3 (not surprisingly, support is also provided in the form of a system call (ptrace); our only interest is in its output). The following strace writes a list of all issued system calls to the file test.syscalls1 : wolfgang@meitner> strace -o log.txt ./shead
The contents of log.txt are more voluminous than you might have expected: execve("./shead", ["./shead"], [/* 27 vars */]) = 0 uname(sys="Linux", node="jupiter", ...) = 0 brk(0) = 0x8049750 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, ..., -1, 0) = 0x40017000 open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, st_mode=S_IFREG|0644, st_size=85268, ...) = 0 old_mmap(NULL, 85268, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40018000 close(3) = 0 open("/lib/i686/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\200\302"..., 1024) = 1024 fstat64(3, st_mode=S_IFREG|0755, st_size=5634864, ...) = 0 old_mmap(NULL, 1242920, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4002d000 mprotect(0x40153000, 38696, PROT_NONE) = 0 old_mmap(0x40153000, 24576, PROT_READ|PROT_WRITE, ..., 3, 0x125000) = 0x40153000 old_mmap(0x40159000, 14120, PROT_READ|PROT_WRITE, ..., -1, 0) = 0x40159000 close(3) = 0 munmap(0x40018000, 85268) = 0 getpid() = 10604 open("/tmp/test.txt", O_RDONLY) = 3 brk(0) = 0x8049750 1 strace has other options to specify exactly which data are saved; they are documented in the
strace(1) manual page.
821
Page 821
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls brk(0x8049800) = 0x8049800 brk(0x804a000) = 0x804a000 read(3, "A black cat crossing your path s"..., 150) = 109 fstat64(1, st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40018000 ioctl(1, TCGETS, B38400 opost isig icanon echo ...) = 0 write(1, "A black cat crossing your path s"..., 77) = 77 write(1, " -- Groucho Marx\n", 32) = 32 munmap(0x40018000, 4096) = 0 _exit(0) = ?
The trace log shows that the application makes a large number of system calls not explicitly listed in the source code. Consequently, the output of strace is not easy to read. For this reason, all lines with a direct equivalent in the C sources of the example are in italics. All other entries are generated by code added automatically at program compilation time. The additional system calls are generated by code that is needed as a framework for launching and running the application — for example, the C standard library is dynamically mapped into the process memory area. Other calls — old_mmap and unmap — are responsible for managing the dynamic memory used by the application. The three system calls used directly — open, read, and close — are translated into calls of the corresponding kernel functions.2 Two further routines of the standard library make internal use of system calls with different names to achieve the desired effect: ❑
malloc is the standard function for reserving memory in the process heap area. As mentioned in Chapter 3, the malloc variant of the GNU library features an additional memory management facility to make effective use of the memory space allocated by the kernel.
Internally, malloc executes the brk system call whose implementation is described in Chapter 3. The system call log shows that malloc executes the call three times as a result of its internal algorithms — but each time with different arguments. ❑
printf first processes the passed arguments — in this case, a dynamic string — and displays the results with the write system call.
Using the strace tool has a further advantage — the source code of the application being traced need not be present to learn about its internal structure and how it functions. Our small sample program shows clearly that there are strong dependencies between the application and the kernel, as indicated by the repeated use of system calls. Even scientific programs that spend most of their time number-crunching and rarely invoke kernel functions cannot manage without system calls. On the other hand, interactive applications such as emacs and mozilla make frequent use of system calls. The size of the log file for emacs is approximately 170 KiB for program launch alone (i.e., up to the end of program initialization).
2 The GNU standard library also includes a general routine that allows system calls to be executed by reference to their numbers if
no wrapper implementation is available.
822
5:32pm
Page 822
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls
13.1.2 Supported Standards System calls are of special significance in all Unix look-alikes. Their scope and speed and their efficient implementation play a major role in system performance. System calls are implemented extremely efficiently in Linux, as demonstrated in Section 13.3. Of equal importance are the versatility and choice of available routines to make the lives of programmers (of applications and of standard library functions) easier and to facilitate program portability between the various Unix derivatives on source text level. In the more than 25-year history of Unix, this has contributed to the emergence of standards and de facto standards governing the uniformity of interfaces between the various systems. The POSIX standard (whose acronym — Portable Operating System Interface for Unix — reveals its purpose) has emerged as the dominant standard. Linux and the C standard library also make every effort to comply with POSIX, which is why it is worthy of brief discussion here. Since publication of the first documents at the end of the 1980s, the standard has expanded drastically in scope (the current version fills four volumes3 ) and is now — in the opinion of many programmers — overlong and too complex. The Linux kernel is largely compatible with the POSIX-1003.1 standard. Naturally, new developments in the standard take some time before they filter through into kernel code. In addition to POSIX, there are other standards that are not based on the work of committees but are rooted in the development of Unix and its look-alikes. In the history of Unix, two major lines of development have produced two independent and autonomous systems — System V (which derives directly from the original sources of AT&T) and BSD (Berkeley Software Distribution, developed at the University of California and now strongly represented in the marketplace under the names of NetBSD, FreeBSD, and OpenBSD and the commercial offshoots, BSDI and MacOS X). Linux features system calls from all three of the above sources — in a separate implementation, of course. The code of competing systems is not used for legal and licensing reasons alone. For example, the three well-known system calls listed below originate from the three different camps: ❑
flock locks a file to prevent parallel access by several processes and to ensure file consistency.
This call is prescribed by the POSIX standard. ❑
BSD Unix provides the truncate call to shorten a file by a specified number of bytes; Linux also implements this function under the same name.
❑
sysfs gathers information on the filesystems known to the kernel and was introduced in System
V Release 4. Linux has also adopted this system call. However, the Linux developers might not entirely agree with the System V designers about the true value of the call — at least, the source code comment says Whee.. Weird sysv syscall. Nowadays, the information is obtained much more easily by reading /proc/filesystems. Some system calls are required by all three standards. For example, time, gettimeofday and settimeofday exist in identical form in System V, POSIX, and 4.3BSD — and consequently in the Linux kernel. 3 The standard is available in electronic form at
www.opengroup.org/onlinepubs/007904975/.
823
Page 823
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls Similarly, some system calls were developed specifically for Linux and either don’t exist at all in other look-alikes or have a different name. One example is the vm86 system call, which is of fundamental importance in implementing the DOS emulator on IA-32 processors. More general calls, such as nanosleep to suspend process execution for very short periods of time, are also part of the Linux-specific repertoire. In some cases, two system calls are implemented to resolve the same problem in different ways. Prime examples are the poll and select system calls; the first was introduced in System V, the latter in 4.3BSD. Ultimately, both perform the same function. In conclusion, it’s worth noting that — in spite of the name — simply implementing the POSIX standard does not produce a full Unix system. POSIX is nothing more than a collection of interfaces whose concrete implementations are not mandated and need not necessarily be included in the kernel. Some operating systems therefore fully implement the POSIX standard in a normal library to facilitate Unix application porting despite their non-Unix design.4
13.1.3 Restarting System Calls An interesting problem arises when system calls clash with signals. How are priorities assigned when it is imperative to send a signal to a process while a system call is being executed? Should the signal wait until the system call terminates, or should the call be interrupted so that the signal can be delivered as quickly as possible? The first option obviously causes fewer problems and is the simpler solution. Unfortunately, it only functions properly if all system calls terminate quickly and don’t make the process wait for too long (as mentioned in Chapter 5, signals are always delivered when the process returns to user mode after a system call). This is not always the case. System calls not only need a certain period of time to execute, but, in the worst case, they also go to sleep (when, e.g., no data are available to read). This seriously delays delivery of any signals that may have occurred in the meantime. Consequently, such situations must be prevented at all costs. If an executing system call is interrupted, which value does the kernel return to the application? In normal circumstances, there are only two situations: Either the call is successful or it fails — in which case an error code is returned so that the user process can determine the cause of the error and react appropriately. In the event of an interruption, a third situation arises: The application must be informed that the system call would have terminated successfully, had it not been interrupted by a signal during execution. In such situations, the -EINTR constant is used under Linux (and under other System V derivatives). The downside of this procedure is immediately apparent. Although it is simple to implement, it forces programmers of userspace applications to explicitly check the return value of all interruptible system calls for -EINTR and, where this value is true, to restart the call repeatedly until it is no longer interrupted by a signal. System calls restarted in this way are called restartable system calls, and the technique itself is known as restarting. This behavior was introduced for the first time in System V Unix. However, it is not the only way of combining the rapid delivery of new signals and the interruption of system calls, as the approach adopted in the BSD world confirms. Let us examine what happens in the BSD kernel when a system call is interrupted by a signal. The BSD kernel interrupts execution of the system call and switches to signal execution in user mode. When this happens, the call does not issue a return value but is restarted automatically by 4 More recent Windows versions include a library of this kind.
824
5:32pm
Page 824
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls the kernel once the signal handler has terminated. Because this behavior is transparent to the user application and also dispenses with repeated implementation of checks for the -EINTR return value and call restarting, this alternative is much more popular with programmers than the System V approach. Linux supports the BSD variant by means of the SA_RESTART flag, which can be specified on a per-signal basis when handler routines are installed (see Chapter 5). The mechanism proposed by System V is used by default because the BSD mechanism also occasionally gives rise to difficulties, as the following example taken from [ME02], page 229, shows. #include <signal.h> #include <stdio.h> #include
This short C program waits in a while loop until the user enters a character via standard input or until the program is interrupted by the SIGINT signal (which can be sent using kill -INT or by pressing CTRL-C). Let us examine the code flow. If the user hits a normal key that does not cause SIGINT to be sent, read yields a positive return code, namely, the number of characters read. The argument of the while loop must return a logically false value to terminate execution. This happens if one of the two logical queries linked by && (and) is false — which is the case when ❑
A single key was pressed; read then returns 1 and the test to check that the return value is not equal to 1 returns a logically false value.
❑
The signaled variable is set to 1 because the negation of the variable (!signaled) also returns a logically false value.
These conditions simply mean that the program waits either for keyboard input or the arrival of the SIGINT signal in order to terminate. To apply System V behavior for the code as implemented by default under Linux, it is necessary to suppress setting of the SA_RESTART flag; in other words, the line sigact.sa_flags = SA_RESTART must be deleted or commented out. Once this has been done, the program runs as described and can be terminated either by pressing a key or sending SIGINT.
825
Page 825
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls The situation is more interesting if read is interrupted by the SIGINT signal and BSD behavior is activated by means of SA_RESTART, as in the sample program. In this case, the signal handler is invoked, signaled is set to 1, and a message is output to indicate that SIGINT was received — but the program is not terminated. Why? After running the handler, the BSD mechanism restarts the read call and again waits for entry of a character. The !signaled condition of the while loop does not apply and is not evaluated. The program can therefore no longer be terminated by sending the SIGNIT signal, although the code suggests that this is so.
13.2
Available System Calls
Before going into the technical details of system call implementation by the kernel (and by the userspace library), it is useful to take a brief look at the actual functions made available by the kernel in the form of system calls. Each system call is identified by means of a symbolic constant whose platform-dependent definition is specified in
Process Management Processes are at the center of the system, so it’s not surprising that a large number of system calls are devoted to process management. The functions provided by the calls range from querying simple information to starting new processes: ❑
fork and vfork split an existing process into two new processes as described in Chapter 2. clone is an enhanced version of fork that supports, among other things, the generation of
threads. ❑
exit ends a process and frees its resources.
❑
A whole host of system calls exist to query (and set) process properties such as PID, UID, and so on.; most of these calls simply read or modify a field in the task structure. The following can be read: PID, GID, PPID, SID, UID, EUID, PGID, EGID, and PGRP. The following can be set: UID, GID, REUID, REGID, SID, SUID, and FSGID. System calls are named in accordance with a logical scheme that uses designations such as setgid, setuid, and geteuid.
❑
personality defines the execution environment of an application and is used, for
instance, in the implementation of binary emulations. ❑
ptrace enables system call tracing and is the platform on which the above strace tool
builds.
826
5:32pm
Page 826
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls ❑
nice sets the priority of normal processes by assigning a number between −20 and 19 in descending order of importance. Only root processes (or processes with the CAP_SYS_NICE permission) are allowed to specify negative values.
❑
setrlimit is used to set certain resource limits, for example, CPU time or the maximum permitted number of child processes. getrlimit queries the current limits (i.e., maximum permitted values), and getrusage queries current resource usage to check whether the
process is still within the defined resource limits. Time Operations Time operations are critical, not only to query and set the current system time, but also to give processes the opportunity to perform time-based operations, as described in Chapter 15: ❑
adjtimex reads and sets time-based kernel variables to control kernel time behavior.
❑
alarm and setitimer set up alarms and interval timers to defer actions to a later time. getitimer reads settings.
❑
gettimeofday and settimeofday get and set the current system time, respectively. Unlike times, they also take account of the current time zone and daylight saving time.
❑
sleep and nanosleep suspend process execution for a defined interval; nanosleep defines
high-precision intervals. ❑
time returns the number of seconds since midnight on January 1, 1970 (this date is the classic time base for Unix systems). stime sets this value and therefore changes the current system date.
Signal Handling Signals are the simplest (and oldest) way of exchanging limited information between processes and of facilitating interprocess communication. Linux supports not only classic signals common to all Unix look-alikes but also real-time signals in line with the POSIX standard. Chapter 5 deals with the implementation of the signal mechanism. ❑
signal installs signal handler functions. sigaction is a modern, enhanced version that
supports additional options and provides greater flexibility. ❑
sigpending checks whether signals are pending for the process but are currently blocked.
❑
sigsuspend places the process on the wait queue until a specific signal (from a set of
signals) arrives. ❑
setmask enables signal blocking, while getmask returns a list of all currently blocked
signals. ❑
kill is used to send any signals to a process.
❑
The same system calls are available to handle real-time signals. However, their function names are prefixed with rt_. For example, rt_sigaction installs a real-time signal handler, and rt_sigsuspend puts the process in a wait state until a specific signal (from a set of signals) arrives. In contrast to classic signals, 64 different real-time signals can be handled on all architectures — even on 32-bit CPUs. Additional information can be associated with real-time signals, and this makes the work of (application) programmers a little easier.
Scheduling Scheduling-related system calls could be grouped into the process management category because all such calls logically relate to system tasks. However, they merit a category of their own due simply to the sheer number of manipulation options provided by Linux to parameterize process behavior.
827
Page 827
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls ❑
setpriority and getpriority set and get the priority of a process and are therefore key system calls for scheduling purposes.
❑
Linux is noted not only for supporting different process priorities, but also for providing a wide variety of scheduling classes to suit the specific time behavior and time requirements of applications. sched_setscheduler and sched_getscheduler set and query scheduling classes. sched_setparam and sched_getparam set and query additional scheduling parameters of processes (currently, only the parameter for real-time priority is used).
❑
sched_yield voluntarily relinquishes control even when CPU time is still available to the
process. Modules System calls are also used to add and remove modules to and from the kernel, as described in Chapter 7. ❑
init_module adds a new module.
❑
delete_module removes a module from the kernel.
Filesystem All system calls relating to the filesystem apply to the routines of the VFS layer discussed in Chapter 8. From there, the individual calls are forwarded to the filesystem implementations that usually access the block layer. System calls of this kind are very costly in terms of resources and execution time. ❑
Some system calls are used as a direct basis for userspace utilities of the same name that create and modify the directory structure: chdir, mkdir, rmdir, rename, symlink, getcwd, chroot, umask, and mknod.
❑
File and directory attributes can be modified using chown and chmod.
❑
The following utilities for processing file contents are implemented in the standard library and have the same names as the system calls: open, close, read and readv, write and writev, truncate and llseek.
❑
readdir and getdents read directory structures.
❑
link, symlink, and unlink create and delete links (or files if they are the last element in a hard link); readlink reads the contents of a link.
❑
mount and umount are used to attach and detach filesystems.
❑
poll and select are used to wait for some event.
❑
execve loads a new process in place of an old process. It starts new programs when used in conjunction with fork.
Memory Management Under normal circumstances, user applications rarely or never come into contact with memory management system calls because this area is completely shielded from the standard library — by the malloc, balloc, and calloc functions in the case of C. Implementation is usually programming language-specific because each language has different dynamic memory management needs and often provides features like garbage collection that require sophisticated allocation of the memory available to the kernel. ❑
828
In terms of dynamic memory management, the most important call is brk, which modifies the size of the process data segment. Programs that invoke malloc or similar functions (almost all nontrivial code) make frequent use of this system call.
5:32pm
Page 828
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls ❑
mmap, mmap2, munmap, and mremap perform mapping, unmapping, and remapping operations, while mprotect and madvise control access to and give advice about specific regions of virtual memory. mmap and mmap2 differ slightly by their parameters; refer to the manual pages for more details. The GNU C library uses mmap2 by default; mmap is just a userland wrapper function by now.
Depending on the malloc implementation, it can also be that mmap or mmap2 is used internally. This works because anonymous mappings allow installing mappings that are not backed by a file. This approach allows for achieving more flexibility than by using brk. ❑
swapon and swapoff enable and disable (additional) swap space on external storage devices.
Interprocess Communication and Network Functions Because ‘‘IPC and networks‘‘ are complex issues, it would be easy to assume that a rich selection of system calls is available. As Chapters 12 and 5 show, however, the opposite is true. Only two system calls are provided to handle all possible tasks. However, a very large number of parameters is involved. The C standard library spreads them over many different functions with just a few parameters so that they are easier for programmers to handle. Ultimately, the functions are always based on the two system calls: ❑
socketcall deals with network questions and is used to implement socket abstraction. It manages connections and protocols of all kinds and implements a total of 17 different functions differentiated by means of constants such as SYS_ACCEPT, SYS_SENDTO, and so on. The arguments themselves must be passed as a pointer that, depending on function type, points to a userspace structure holding the required data.
❑
ipc is the counterpart to socketcall and is used for process connections local to the computer and not for connections established via networks. Because this system call need implement ‘‘only‘‘ 11 different functions, it uses a fixed number of arguments — five in all — to transfer data from userspace to kernel space.
System Information and Settings It is often necessary to query information on the running kernel and its configuration and on the system configuration. Similarly, kernel parameters need to be set and information must be saved to system log files. The kernel provides three further system calls to perform such tasks: ❑
syslog writes messages to the system logs and permits the assignment of different pri-
orities (depending on message priority, userspace tools send the messages either to a permanent log file or directly to the console to inform users of critical situations). ❑
sysinfo returns information on the state of the system, particularly statistics on memory
usage (RAM, buffer, swap space). ❑
sysctl is used to ‘‘fine-tune‘‘ kernel parameters. The kernel now supports an immense number of dynamically configurable options that can be read and modified using the proc
filesystem, as described in Chapter 10. System Security and Capabilities The traditional Unix security model — based on users, groups, and an ‘‘omnipotent‘‘ root user — is not flexible enough for modern needs. This has led to the introduction of the capabilities system, which enables non-root processes to be furnished with additional privileges and capabilities according to a fine-grained scheme.
829
Page 829
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls In addition, the Linux security modules subsystem (LSM) provides a general interface to support modules whose functions are invoked at various hooks in the kernel to perform security checks:
13.3
❑
capset and capget are responsible for setting and querying process capabilities.
❑
security is a system call multiplexer for implementing LSM.
Implementation of System Calls
In the implementation of system calls, not only the kernel source code that provides the required functions is relevant but also the way in which the functions are invoked. Functions are not called in the same way as normal C functions because the boundary between user and kernel mode is crossed. This raises various problems that are handled by platform-specific assembly language code. This code establishes a processor-independent state as quickly as possible to enable system calls to be implemented independently of the underlying architecture. How parameters are passed between userspace and kernel space must also be considered.
13.3.1 Structure of System Calls Kernel code for implementing system calls is divided into two very different parts. The actual task to be performed by the system call is implemented as a C routine that is virtually no different from the remaining kernel code. The mechanism for calling the routine is packed with platform-specific features and must take numerous details into consideration — so that ultimately implementation in assembly language code is a must.
Implementation of Handler Functions Let us first take a close look at what’s behind C implementation of the actual handler functions. These functions are spread across the kernel because they are embedded in code sections to which they are most closely related in terms of their purpose. For example, all file-related system calls reside in the fs/ kernel subdirectory because they interact directly with the virtual filesystem. Likewise, all memory management calls reside in the files of the mm/ subdirectory. The handler functions for implementing system calls share several formal features: ❑
The name of each function is prefixed with sys_ to uniquely identify the function as a system call — or to be more accurate, as a handler function for a system call. Generally, it is not necessary to distinguish between handler function and system call. In the sections below, I do so only where necessary.
❑
All handler functions accept a maximum of five parameters; these are specified in a parameter list as in normal C functions (how parameters are supplied with values differs slightly from the classic approach, as you will see shortly).
❑
All system calls are executed in kernel mode. Consequently, the restrictions discussed in Chapter 2 apply, primarily that direct access to user mode memory is not permitted. Recall that copy_from_user, copy_to_user, or other functions from this family must ensure that the desired memory region is available to the kernel before doing the actual read/write operation.
Once the kernel has transferred control to the handler routine, it returns to completely neutral code that is not dependent on a particular CPU or architecture. However, there are exceptions — for various reasons,
830
5:32pm
Page 830
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls a small number of handler functions are implemented separately for each platform. When results are returned, the handler function need take no special action; a simple return followed by a return value is sufficient. Switching between kernel and user mode is performed by platform-specific kernel code with which the handler does not come into contact. Figure 13-1 illustrates the chronological sequence. Userspace Application
Libc
Userspace
Kernelspace Kernel
Kernel
Libc
Application
Handler
Figure 13-1: Chronological sequence of a system call. The above approach greatly simplifies the work of programmers because handler functions are implemented in practically the same way as normal kernel code. Some system calls are so simple that they can be implemented by a single line of C code. For example, the getuid system call to return the UID of the current process is implemented as follows: kernel/timer.c
asmlinkage long sys_getuid(void) { /* Only we change this so SMP safe */ return current->uid; } current is a pointer to the current instance of task_struct and is set automatically by the kernel. The above code returns the uid element (current user ID) of task_struct. It couldn’t be simpler!
Of course, there are much more complicated system calls, some of which were discussed in preceding chapters. Implementation of the handler function itself is always short and compact. It is usual to transfer control to a more general kernel auxiliary function as soon as possible, as, for example, in the case of read. fs/read_write.c
asmlinkage ssize_t sys_read(unsigned int fd, char __user * buf, size_t count) { struct file *file; ssize_t ret = -EBADF; int fput_needed; file = fget_light(fd, &fput_needed); if (file) { loff_t pos = file_pos_read(file); ret = vfs_read(file, buf, count, &pos); file_pos_write(file, pos); fput_light(file, fput_needed); } return ret; }
Here, the bulk of the work is done by vfs_read, as described in Chapter 8.
831
Page 831
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls A third ‘‘type‘‘ of system call acts as a multiplexer. Multiplexers use constants to delegate system calls to functions that perform very different tasks. A prime example is socketcall (discussed in Chapter 12), which groups together all network-related calls. net/socket.c
asmlinkage long sys_socketcall(int call, unsigned long __user *args) { unsigned long a[6]; unsigned long a0,a1; int err; ... switch(call) { case SYS_SOCKET: err = sys_socket(a0,a1,a[2]); break; case SYS_BIND: err = sys_bind(a0,(struct sockaddr __user *)a1, a[2]); break; case SYS_CONNECT: err = sys_connect(a0, (struct sockaddr __user *)a1, a[2]); break; case SYS_LISTEN: err = sys_listen(a0,a1); break; ... case SYS_RECVMSG: err = sys_recvmsg(a0, (struct msghdr __user *) a1, a[2]); break; default: err = -EINVAL; break; } return err; }
Formally, only one void pointer is passed because the number of system call arguments varies according to the multiplexing constant. The first task is therefore to determine the required number of arguments and to fill the individual elements of the a[] array (this involves manipulating pointers and arrays and is not discussed here). The call parameter is then referenced to decide which kernel function will be responsible for further processing. Regardless of their complexity, all handler functions have one thing in common. Each function declaration includes the additional (asmlinkage) qualifier, which is not a standard element of C syntax. asmlinkage is an assembler macro defined in
832
5:32pm
Page 832
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls Dispatching and Parameter Passing System calls are uniquely identified by a number assigned by the kernel. This is done for practical reasons that become clear when system calls are activated. All calls are handled by a single central piece of code that uses the number to dispatch a specific function by reference to a static table. The parameters passed are also handled by the central code so that parameter passing is implemented independently of the actual system call. Switching from user to kernel mode — and therefore to dispatching and parameter passing — is implemented in assembly language code to cater for many platform-specific features. Owing to the very large number of architectures supported, every detail cannot be covered, and our description is therefore restricted to the widespread IA-32 architectures. The implementation approach is much the same on other processors, even though assembler details may differ. To permit switching between user and kernel mode, the user process must first draw attention to itself by means of a special machine instruction; this requires the assistance of the C standard library. The kernel must also provide a routine that satisfies the switch request and looks after the technical details. This routine cannot be implemented in userspace because commands are needed that normal applications are not permitted to execute.
Parameter Passing Different platforms use different assembler methods to execute system calls.5 System call parameters are passed directly in registers on all platforms — which handler function parameter is held in which register is precisely defined. A further register is needed to define the system call number used during subsequent dispatching to find the matching handler function. The following overview shows the methods used by a few popular architectures to make system calls: ❑
On IA-32 systems, the assembly language instruction int $0x80 raises software interrupt 128. This is a call gate to which a specific function is assigned to continue system call processing. The system call number is passed in register eax, while parameters are passed in registers ebx, ecx, edx, esi, and edi.6 On more modern processors of the IA-32 series (Pentium II and higher), two assembly language instructions (sysenter and sysexit) are used to enter and exit kernel mode quickly. The way in which parameters are passed and returned is the same, but switching between privilege levels is faster. To enable sysenter calls to be made faster without losing downward compatibility with older processors, the kernel maps a memory page into the top end of address space (at 0x0xffffe000). Depending on processor type, the system call code on this page includes either int 0x80 or sysenter.
5 The details are easy to find in the sources of the GNU standard library by referring to the filenamed sysdeps/unix/sysv/ linux/arch/syscall.S. The assembly language code required for the particular platform can be found under the syscall
label; this code provides a general interface for invoking system calls for the rest of the library. 6 In addition to the 0x80 call gate, kernel implementation on IA-32 processors features two other ways of entering kernel mode and executing system calls — the lcall7 and lcall27 call gates. These are used to perform binary emulation for BSD and Solaris because these systems make system calls in native mode. They differ only slightly from the standard Linux method and offer little in the way of new insight — which is why I do not bother to discuss them here.
833
Page 833
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls Calling the code stored there (with call 0xffffe000) allows the standard library to automatically select the method that matches the processor used. ❑
Alpha processors provide a privileged system mode (PAL, privileged architecture level) in which various system kernel routines can be stored. The kernel employs this mechanism by including in the PAL code a function that must be activated in order to execute system calls. call_pal PAL_callsys transfers control flow to the desired routine. v0 is used to pass the system call number, and the five possible arguments are held in a0 to a4 (note that register naming is more systematic in recent architectures than in earlier architectures such as IA-32 . . . ).
❑
PowerPC processors feature an elegant assembly language instruction called sc (system call). This is used specifically to implement system calls. Register r3 holds the system call number, while parameters are held in registers r4 to r8 inclusive.
❑
The AMD64 architecture also has its own assembly language instruction with the revealing name of syscall to implement system calls. The system call number is held in the raw register, parameters in rdi, rsi, rdx, r10, r8, and r9.
Once the application program has switched to kernel mode with the help of the standard library, the kernel is faced with the task of finding the matching handler function for the system call and supplying it with the passed parameters. A table named sys_call_table, which holds a set of function pointers to handler routines, is available for this purpose on all (!) platforms. Because the table is generated with assembly language instructions in the data segment of the kernel, its contents differ from platform to platform. The principle, however, is always the same: by reference to the system call number, the kernel finds the appropriate position in the table at which a pointer points to the desired handler function.
System Call Table Let us take a look at the sys_call_table of an Sparc64 system as defined in arch/sparc/kernel/ systlbs.S (System call tables for other systems can be found in a file often called entry.S in the corresponding directory for the processor type.) arch/sparc64/kernel/systbls.S sys_call_table64: sys_call_table: /*0*/ .word sys_restart_syscall, sparc_exit, sys_fork, sys_read, sys_write /*5*/ .word sys_open, sys_close, sys_wait4, sys_creat, sys_link /*10*/ .word sys_unlink, sys_nis_syscall, sys_chdir, sys_chown, sys_mknod /*15*/ .word sys_chmod, sys_lchown, sparc_brk, sys_perfctr, sys_lseek /*20*/ .word sys_getpid, sys_capget, sys_capset, sys_setuid, sys_getuid /*25*/ .word sys_vmsplice, sys_ptrace, sys_alarm, sys_sigaltstack, sys_nis_syscall /*30*/ .word sys_utime, sys_nis_syscall, sys_nis_syscall, sys_access, sys_nice .word sys_nis_syscall, sys_sync, sys_kill, sys_newstat, sys_sendfile64 /*40*/ .word sys_newlstat, sys_dup, sys_pipe, sys_times, sys_nis_syscall .word sys_umount, sys_setgid, sys_getgid, sys_signal, sys_geteuid /*50*/ .word sys_getegid, sys_acct, sys_memory_ordering, sys_nis_syscall, sys_ioctl .word sys_reboot, sys_nis_syscall, sys_symlink, sys_readlink, sys_execve /*60*/ .word sys_umask, sys_chroot, sys_newfstat, sys_fstat64, sys_getpagesize ... /*280*/ .word .word /*290*/ .word .word
834
sys_tee, sys_add_key, sys_request_key, sys_keyctl, sys_openat sys_mkdirat, sys_mknodat, sys_fchownat, sys_futimesat, sys_fstatat64 sys_unlinkat, sys_renameat, sys_linkat, sys_symlinkat, sys_readlinkat sys_fchmodat, sys_faccessat, sys_pselect6, sys_ppoll, sys_unshare
5:32pm
Page 834
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls /*300*/ .word sys_set_robust_list, sys_get_robust_list, sys_migrate_pages, sys_mbind, sys_get_mempolicy .word sys_set_mempolicy, sys_kexec_load, sys_move_pages, sys_getcpu, sys_epoll_pwait /*310*/ .word sys_utimensat, sys_signalfd, sys_timerfd, sys_eventfd, sys_fallocate
The table definition is similar on IA-32 processors. arch/x86/kernel/syscall_table_32.S ENTRY(sys_call_table) .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ .long sys_exit .long sys_fork .long sys_read .long sys_write .long sys_open /* 5 */ .long sys_close ... .long sys_utimensat /* 320 */ .long sys_signalfd .long sys_timerfd .long sys_eventfd .long sys_fallocate
The purpose of the .long statements is to align the table entries in memory. The tables defined in this way have the properties of a C array and can therefore be processed using pointer arithmetic. sys_call_table is the base pointer and points to the start of the array, that is, to the zero entry in C terms. If a userspace program invokes the open system call, the number passed is 5. The dispatcher routine adds this number to the sys_call_table base and arrives at the fifth entry that holds the address of sys_open — this is the processor-independent handler function. Once the parameter values still held in registers have been copied onto the stack, the kernel calls the handler routine and switches to the processor-independent part of system call handling.
Because the kernel mode and user mode use two different stacks, as described in Chapter 3, system call parameters cannot be passed on the stack as would normally be the case. Switching between the stacks is performed either in architecture-specific assembly language code that is called when kernel mode is entered, or is carried out automatically by the processor when the protection level is switched from user to kernel mode.
Return to User Mode Each system call must inform the user application whether its routine was executed and with which result. It does this by means of its return code. From the perspective of the application, a normal variable is read using C programming means. However, the kernel, in conjunction with libc, must expend a little more effort to make things just as simple for the user process.
Meaning of Return Values Generally, the following applies for system call return values. Negative values indicate an error, and positive values (and 0) denote successful termination.
835
Page 835
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls Of course, neither programs nor the kernel itself operates with raw numbers but with symbolic constants defined with the help of the pre-processor in include/asm-generic/errno-base.h and include/asm-generic/errno.h.7 The file named <errno.h> contains several additional error codes, but these are kernel-specific and are never visible to the user application. Error codes up to and including 511 are reserved for general errors; kernel-specific constants use the values above 512. Because (not surprisingly) there are a very large number of potential errors, only a few constants are listed below:
#define #define #define #define #define #define #define #define #define #define ... #define #define #define #define
EPERM ENOENT ESRCH EINTR EIO ENXIO E2BIG ENOEXEC EBADF ECHILD
1 2 3 4 5 6 7 8 9 10
/* /* /* /* /* /* /* /* /* /*
Operation not permitted */ No such file or directory */ No such process */ Interrupted system call */ I/O error */ No such device or address */ Argument list too long */ Exec format error */ Bad file number */ No child processes */
EMLINK EPIPE EDOM ERANGE
31 32 33 34
/* /* /* /*
Too many links */ Broken pipe */ Math argument out of domain of func */ Math result not representable */
The ‘‘classic‘‘ errors that occur when working with Unix system calls are listed in errno-base.h. On the other hand, errno.h contains more unusual error codes whose meanings are not immediately obvious even to seasoned programmers. Examples such as EOPNOTSUPP — which stands for ‘‘Operation not supported on transport endpoint’’ — and ELNRNG — which means ‘‘Link number out of range’’ — are not what might be classified as common knowledge. Some more examples:
#define #define #define #define ... #define #define #define #define
EDEADLK ENAMETOOLONG ENOLCK ENOSYS
35 36 37 38
/* /* /* /*
Resource deadlock would occur */ File name too long */ No record locks available */ Function not implemented */
ENOKEY EKEYEXPIRED EKEYREVOKED EKEYREJECTED
126 127 128 129
/* /* /* /*
Required key not available */ Key has expired */ Key has been revoked */ Key was rejected by service */
/* for robust mutexes */ #define EOWNERDEAD 130 #define ENOTRECOVERABLE 131
/* Owner died */ /* State not recoverable */
Although I just mentioned that error codes are always returned with a negative number, all codes shown here are positive. It is a kernel convention that the numbers are defined as positive but are returned as 7 SPARC, Alpha, PA-RISC, and MIPS architectures define their own versions of these files because they use different numeric error
codes from the remaining Linux ports. This is because of the fact that binary specifications for different platforms do not always use the same magic constants.
836
5:32pm
Page 836
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls a negative value by adding a sign. For example, if an operation were not permitted, a handler routine would execute return -ENOPERM to yield the error code −1. Let us examine the open system call with a particular focus on its return values (sys_open implementation is discussed in Chapter 8). What can go wrong when a file is opened? Not much, you would think. But the kernel finds nine ways of causing problems. For the individual sources of error, see the standard library documentation (and, of course, the kernel sources). The most frequent system call error codes are as follows: ❑
EACCES indicates that a file cannot be processed in the desired access mode — for example, a file cannot be opened for write access if the write bit is not set in its mode string.
❑
EEXIST is returned if an attempt is made to create a file that already exists.
❑
ENOENT means that the desired file does not exist, and the flag to allow files that do not exist to be
created is not specified. A positive number greater than zero is returned if the system call terminates successfully. As discussed in Chapter 8, this is a file handle that is used to represent the file in all subsequent operations as well as in the internal data structures of the kernel. Linux uses the long data type to transfer results from kernel space to userspace; this is either 32 or 64 bits wide depending on processor type. One bit is used as the sign bit.8 This causes no problems for most system calls, such as open. The positive values returned are usually so small that they fit into the range provided by long. Unfortunately, the situation is more complicated when returning large numbers that occupy the full unsigned long space. This is the case with malloc and long if memory addresses are allocated at the top of virtual memory space. The kernel then interprets the returned pointer as a negative number because it overruns the positive range of signed long; this would be reported as an error even though the system call terminated successfully. How can the kernel prevent such mishaps? As noted above, the symbolic constants for error codes that reach userspace extend only up to 511 — in other words, error codes returned in the range from −1 to −511. Consequently, all lower error codes are excluded and are interpreted correctly — as very high return values of successful system calls. All that now needs to be done to successfully terminate the system call is to switch back from kernel mode to user mode. The result value is returned using a mechanism that functions similarly in the opposite direction. The C function, in which the system call handler is implemented, uses return to place the return code on the kernel stack. This value is copied into a specific processor register (eax on IA-32 systems, a3 on Alpha systems, etc.), where it is processed by the standard library and transferred to user applications.
13.3.2 Access to Userspace Even though the kernel does its best to keep kernel space and userspace separate, there are situations in which kernel code has to access the virtual memory of user applications. Of course, this only makes sense when the kernel is performing a synchronous action initiated by a user application — write and read access by arbitrary processes not only serves no purpose, but may also produce risky results in the code currently executing. 8 Of course, 2’s complement notation is used to prevent errors where there are two zeros with different signs. See
http://en
.wikipedia.org/wiki/Two%27s_complement for more information about this format.
837
Page 837
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls The processing of system calls is, of course, a classic situation in which the kernel is busy with the synchronous execution of a task assigned to it by an application. There are two reasons why the kernel has to access the address space of user applications: ❑
If a system call requires more than six different arguments, they can be passed only with the help of C structures that reside in process memory space. A pointer to the structures is passed to the system call by means of registers.
❑
Larger amounts of data generated as a side effect of a system call cannot be passed to the user process using the normal return mechanism. Instead, the data must be exchanged in defined memory areas. These must, of course, be located in userspace so that the user application is able to access them.
When the kernel accesses its own memory area, it can always be sure that there is a mapping between the virtual address and a physical memory page. The situation in userspace is different, as described in Chapter 3. Here, pages might be swapped out or not even be allocated. The kernel may not therefore simply de-reference userspace pointers, but also must employ specific functions to ensure that the desired area resides in RAM. To make sure that the kernel complies with this convention, userspace pointers are labeled with the __user attribute to support automated checking by C check tools.9 Chapter 3 discusses the functions used to copy data between userspace and kernel space. In most cases, these will be copy_to_user and copy_from_user, but more variants are available.
13.3.3 System Call Tracing The strace tool developed to trace the system calls of processes using the ptrace system call is described in Section 13.1.1. Implementation of the sys_ptrace handler routine is architecture-specific and is defined in arch/arch/kernel/ptrace.c. Fortunately, there are only minor differences between the code of the individual versions. I therefore provide a generalized description of how the routine works without going into architecture-specific details. Before examining the flow of the system call in detail, it should be noted that this call is needed because ptrace — essentially a tool for reading and modifying values in process address space — cannot be used directly to trace system calls. Only by extracting the desired information at the right places can trace processes draw conclusions on which system calls have been made. Even debuggers such as gdb are totally reliant on ptrace for their implementation. ptrace offers more options than are really needed to simply trace system calls. ptrace requires four arguments as the definition in the kernel sources shows10 : <syscalls.h>
asmlinkage long sys_ptrace(long request, long pid, long addr, long data); 9 Linus Torvalds himself designed the sparse tool to find direct userspace pointer de-referencings in the kernel. 10 <syscalls.h> contains the prototypes for all architecture-independent system calls whose arguments are identical on all
architectures.
838
5:32pm
Page 838
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls ❑
pid identifies the target process. The process identifier is interpreted with respect to the namespace of the caller. Even though the way in which strace is handled suggests that process
tracing must be enabled right from the start, this is not true. The tracer program must ‘‘attach‘‘ itself to the target process by means of ptrace — and this can be done while the process is already running (not only when the process starts). strace is responsible for attaching to the process, usually immediately after the target program is started with fork and exec.
❑
addr and data pass a memory address and additional information to the kernel. Their meanings differ according to the operation selected.
❑
With the help of symbolic constants, request selects an operation to be performed by ptrace. A list of all possible values is given on the manual page ptrace(2) and in
PTRACE_ATTACH issues a request to attach to a process and initiates tracing. PTRACE_DETACH
detaches from the process and terminates tracing. A traced process is always terminated when a signal is pending. The options below enable ptrace to be stopped during a system call or after a single assembly language instruction. When a traced process is stopped, the tracer program is informed by means of a SIGCHLD signal that waiting can take place using the wait function discussed in Chapter 2. When tracing is installed, the SIGSTOP signal is sent to the traced process — this causes the tracer process to be interrupted for the first time. This is essential when system calls are traced, as demonstrated below by means of an example. ❑
PEEKTEXT, PEEKDATA, and PEEKUSR read data from the process address space. PEEKUSR reads the normal CPU registers and any other debug registers used11 (of course, only the contents of a single register selected on the basis of its identifier are read — not the contents of the entire register set). PEEKTEXT and PEEKDATA read any words from the text or data segment of the process.
❑
POKETEXT, POKEDATA, and PEEKUSR write values to the three specified areas of the moni-
tored process and therefore manipulate the process address space contents; this can be very important when debugging programs interactively. Because PTRACE_POKEUSR manipulates the debug registers of the CPU, this option supports the use of advanced debugging techniques; for example, monitoring of events that halt program execution at a particular point when certain conditions are satisfied. ❑
PTRACE_SETREGS and PTRACE_GETREGS set and read values in the privileged register set of
the CPU. ❑
PTRACE_SETFPREGS and PTRACE_GETFPREGS set and read registers used for floating-point computations. These operations are also very useful when testing and debugging applications interactively.
❑
System call tracing is based on PTRACE_SYSCALL. If ptrace is activated with this option, the kernel starts process execution until a system call is invoked. Once the traced process has been stopped, wait informs the tracer process, which then analyzes the process address
11 Because a process other than the traced process is running when the ptrace system call is invoked, the physical registers of the CPU naturally hold the values of the tracer program and not those of the traced process. This is why the data of the pt_regs instance discussed in Chapter 14 are used; these data are copied into the register set when the process is activated after a task switch. Manipulating the data of this structure is tantamount to manipulating the registers themselves.
839
Page 839
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls space using the above ptrace operations to gather relevant information on the system call. The traced process is stopped for a second time after completion of the system call to allow the tracer process to check whether the call was successful. Because the system call mechanism differs according to platform, trace programs such as strace must implement the reading of data separately for each architecture; this is a tedious task that quickly renders source code for portable programs unreadable (the strace sources are overburdened with pre-processor conditionals and are no pleasure to read). ❑
PTRACE_SINGLESTEP places the processor in single-step mode during execution of the
traced process. In this mode, the tracer process is able to access the traced process after each assembly language instruction. Again, this is a very popular application debugging technique, particularly when attempting to track down compiler errors or other such subtleties. Implementation of the single-step function is strongly dependent on the CPU used — after all, the kernel is operating on a machine-oriented level at this point. Nevertheless, a uniform interface is available to the tracer process on all platforms. After execution of the assembler function, a SIGCHLD signal is sent to the tracer, which gathers detailed information on the process state using further ptrace options. This cycle is constantly repeated — the next assembler instruction is executed after invoking ptrace with the PTRACE_SINGLESTEP argument, the process is put to sleep, the tracer is informed accordingly by means of SIGCHLD, and so on. ❑
PTRACE_KILL closes the traced process by sending a KILL signal.
❑
PTRACE_TRACEME starts tracing the current process. The parent of the current process auto-
matically assumes the role of tracer and must be prepared to receive tracing information from its child. ❑
PTRACE_CONT resumes execution of a traced process without specifying special conditions for stopping the process — the traced application next stops when it receives a signal.
System Call Tracing The following short sample program illustrates the use of ptrace. ptrace attaches itself to a process and checks system call usage; as such, it is a minimal replacement for strace. /* Simple replacement for strace(1) */ #include<stdio.h> #include<stdlib.h> #include<signal.h> #include
/* for ORIG_EAX */
static long pid; int upeek(int pid, long off, long *res) { long val; val = ptrace(PTRACE_PEEKUSER, pid, off, 0);
840
5:32pm
Page 840
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls if (val == -1) { return -1; } *res = val; return 0; } void trace_syscall() { long res; res = ptrace(PTRACE_SYSCALL, pid, (char*) 1, 0); if (res < 0) { printf("Failed to execute until next syscall: %d\n", res); } } void sigchld_handler (int signum) { long scno; int res; /* Find out the system call (system-dependent)...*/ if (upeek(pid, 4*ORIG_EAX, &scno) < 0) { return; } /* ... and output the information */ if (scno != 0) { printf("System call: %u\n", scno); } /* Activate tracing until the next system call */ trace_syscall(); } int main(int argc, char** argv) { int res; /* Check the number of arguments */ if (argc != 2) { printf("Usage: ptrace
841
Page 841
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls struct sigaction sigact; sigact.sa_handler = sigchld_handler; sigaction(SIGCHLD, &sigact, NULL); /* Attach to the desired process */ res = ptrace(PTRACE_ATTACH, pid, 0, 0); if (res < 0) { printf("Failed to attach: %d\n", res); exit(-1); } else { printf("Attached to %u\n", pid); } for (;;) { wait(&res); if (res == 0) { exit(1); } } }
The program structure is roughly as follows: ❑
The PID of the traced program is read from the command line, and the usual checks are carried out.
❑
A handler for the CHLD signal is installed because the kernel sends this signal to the tracer process each time the traced program is interrupted.
❑
The tracer process attaches itself to the target application by means of the ptrace request PTRACE_ATTACH.
❑
The main part of the tracer program consists of a simple endless loop that repeatedly invokes the wait command to wait for the arrival of new CHLD signals.
This structure is not dependent on a particular processor type and can be used for all systems supported by Linux. However, the method by which the number of the system call invoked is determined is very architecture-specific. The method shown works only on IA-32 systems because they keep the number at a specific offset in the saved register set. This offset is held in the ORIG_EAX constant defined in asm/ptrace.h. Its value can be read using PTRACE_PEEKUSER and must be multiplied by the factor of 4 because the registers on this architecture are 4 bytes wide. Of course, the above would be implemented differently on other architectures. For details, see the system call-relevant code in the kernel sources and the sources of the standard strace tool. Our main goal is to illustrate how ptrace calls are used to check monitored processes. Once process tracing has been started by means of PTRACE_ATTACH, the bulk of the work is delegated to the handler function of the CHLD signal implemented in sigchld_handler. This function is responsible for peforming the following tasks: ❑
Helping to find the number of the system call invoked using platform-dependent means. The information found is output if the result is a system call number not equal to 0. Testing for 0 is necessary to ensure that only requests for system calls are logged but not the signals sent to the traced process.
842
5:32pm
Page 842
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls ❑
Helping to resume program flow. The kernel must, of course, be informed that execution will be stopped at the next system call; this is done using the ptrace request PTRACE_ SYSCALL.
Program flow is obvious once the ball is rolling. A system call requested by the traced process triggers the ptrace mechanism in the kernel, which sends a CHLD signal to the tracer process. The handler of the tracer process reads the required information — the number of the system call — and outputs it, again using the ptrace mechanism. Execution of the traced process is resumed and interrupted again when a system call is invoked. But how is the ball set rolling? Somehow or other the handler function must be invoked for the first time in order to log system call tracing. As noted above, the kernel also sends SIGCHLD signals to the tracer process when a signal is sent to the traced process — in doing so, it invokes the same handler function activated when a system call occurs. The fact that the kernel automatically sends a STOP signal to the traced process when tracing is initiated ensures that the handler function is invoked when tracing starts — even if the process receives no other signals. This sets the ball — that is, system call tracing — rolling.
Kernel-Side Implementation As expected, the handler function for the ptrace system call is called sys_ptrace. The architectureindependent part of the implementation that is used for all except a handful of architectures can be found in kernel/ptrace.c. The architecture-dependent part, that is, the function arch_ptrace, is located in arch/arch/kernel/ptrace.c. Figure 13-2 shows the code flow diagram.
sys_ptrace ptrace_get_task_struct
No
PTRACE_ATTACH requested?
Yes
ptrace_attach
ptrace_check_attach arch_ptrace
Perform request specific operation
Figure 13-2: Code flow diagram for sys_ptrace. The ptrace system call is dominated by its request parameter — this is immediately apparent in the structure of its code. Preliminary work is carried out, primarily to determine the task_struct instance of the passed PID using ptrace_get_task_struct. This basically uses find_task_by_vpid to find the required instance of task_struct, but also prevents tracing of the init process — the ptrace operation is aborted if a value of 1 is passed for pid.
Starting Tracing Process task structures include several ptrace-specific elements that are needed below. <sched.h> struct task_struct { ...
843
Page 843
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls unsigned int ptrace; ... /* ptrace_list/ptrace_children forms the list of my children * that were stolen by a ptracer. */ struct list_head ptrace_children; struct list_head ptrace_list; ... struct task_struct *real_parent; /* real parent process (when being debugged) */ ... };
If PTRACE_ATTACH is set, ptrace_attach establishes a link between the tracer process and the target process. When this is done, ❑
The ptrace element of the target process is set to PT_TRACED.
❑
The tracer process becomes the parent process of the target process (the real parent process is held in real_parent).
❑
The traced process is added to the ptrace_children list of the tracer using the ptrace_list task structure element.
❑
A STOP signal is sent to the traced process.
If a different action from PTRACE_ATTACH was requested, ptrace_check_attach first checks whether a tracer is attached to the process, and the code splits depending on the particular ptrace operation. This is handled in arch_ptrace; the function is defined by every architecture and cannot be provided by the generic code. However, this is not entirely true: Some requests can, in fact, be handled by architecture-independent code, and they are handled in ptrace_request (from kernel/ptrace.c) called by arch_ptrace. Only very simple requests are processed by this function. For example, PTRACE_DETACH to detach a tracer from a process is one of them. Usually, a large case structure that deals separately with each case (depending on the request parameter) is employed for this purpose. I discuss only some important cases: PTRACE_ATTACH and PTRACE_DETACH, PTRACE_SYSCALL, PTRACE_CONT as well as PTRACE_PEEKDATA and PTRACE_POKEDATA. The implementation of the remaining requests follows a similar pattern. All further tracing actions performed by the kernel are present in the signal handler code discussed in Chapter 5. When a signal is delivered, the kernel checks whether the PT_TRACED flag is set in the ptrace field of task_struct. If it is, the state of the process is set to TASK_STOPPED (in get_signal_to_deliver in kernel/signal.c) in order to interrupt execution. notify_parent with the CHLD signal is then used to inform the tracer process. (The tracer process is woken up if it happens to be sleeping.) The tracer process then performs the desired checks on the target process as specified by the remaining ptrace options.
Implementation of PTRACE_CONT and _SYSCALL PTRACE_CONT resumes a traced process after it was suspended owing to delivery of a signal. The kernel-side implementation of this function is strongly associated with PTRACE_SYSCALL (which suspends a traced process not only after the arrival of a signal but also before and after system calls are invoked).
844
5:32pm
Page 844
Mauerer
runc13.tex
V2 - 09/04/2008
5:32pm
Chapter 13: System Calls Both are discussed in the same section because their code differs only slightly: ❑
When PTRACE_SYSCALL is used, the TIF_SYSCALL_TRACE flag is set in the task structure of the monitored process.
❑
When PTRACE_CONT is used, the flag is removed using clear_tsk_thread_flag.
Both flag routines set the corresponding bit in the flags field of the thread_info instance of the process. Once the flag has been set or removed, the kernel need only wake the traced process using wake_up_process before resuming its normal work.
What are the effects of the TIF_SYSCALL_TRACE flag? Because invoking system calls is very hardwarerelated, the effects of the flag extend into the assembly language source code of entry.S. If the flag is set, the C function do_syscall_trace is invoked on system call completion — but only on IA-32, PPC, and PPC64 systems. Other architectures use other mechanisms not described here. Nevertheless, the effects of the flag are the same on all supported platforms. Before and after the execution of a system call by the monitored process, the process state is set to TASK_STOPPED, and the tracer is informed accordingly by means of a CHLD signal. Required information can then be extracted from the contents of registers or specific memory areas.
Stopping Tracing Tracing is disabled using PTRACE_DETACH, which causes the central ptrace handler to delegate this task to the ptrace_detach function in kernel/ptrace.c. The task itself comprises the following steps:
1.
The architecture-specific hook ptrace_disable allows for performing any required low-level operations to stop tracing.
2. 3.
The flag TIF_SYSCALL_TRACE is removed from the child’s thread flags.
4.
The ptrace element of the task_struct instance is reset to 0, and the target process is removed from the ptrace_children list of the tracer process. The parent process is reset to the original task by overwriting task_struct->parent with the value stored in real_parent.
The traced process is woken up with wake_up_process so that it can resume its work.
Reading and Modifying Target Process Data PTRACE_PEEKDATA reads information from the data segment.12 The ptrace call requires two parameters
for the request: ❑
addr specifies the address to be read in the data segment.
❑
data accepts the associated result.
12 Because memory management does not differentiate between text and data segments — both begin at different addresses but are accessed in the same way — the information provided applies equally for PTRACE_PEEKTEXT.
845
Page 845
Mauerer
runc13.tex
V2 - 09/04/2008
Chapter 13: System Calls The read operation is delegated to the access_process_vm function that is implemented in mm/memory.c. (It used to be located in kernel/ptrace.c, but the new location is clearly a better choice.) This function uses get_user_pages to find the page matching the desired address in userspace memory. A temporary memory location in the kernel is used to buffer the required data. After some clean-up work, control is returned to the dispatcher. Because the required data are still in kernel space, put_user must be used to copy the result to the userspace location specified by the addr parameter. The traced process is manipulated in a similar way by PTRACE_POKEDATA. (PTRACE_POKETEXT is used in exactly the same way because again there is no difference between the two segments of virtual address space.) access_process_vm finds the memory page with the required address. access_process_vm is directly responsible for replacing existing data with the new values passed in the system call.13
13.4
Summar y
One possible way to view the kernel is as a comprehensive library of things it can do for userland applications. System calls are the interface between an application and this library. By invoking a system call, an application can request a service that the kernel then fulfills. This chapter first introduced you to the basics of system programming, which led to how system calls are implemented within the kernel. In contrast to regular functions, invoking system calls requires more effort because a switch between the kernel and user modes of the CPU must be performed. Since the kernel lives in a different portion of the virtual address space from userland, you have also seen that some care is required when the kernel transfers data from or to an application. Finally, you have seen how system call tracing allows for tracking the behavior of programs and serves as an indispensable debugging tool in userspace. System calls are a synchronous mechanism to change from user into kernel mode. The next chapter introduces you to interrupts that require asynchronously changing between the modes.
13 A Boolean parameter can be selected to specify whether data are read only (PTRACE_POKETEXT or PTRACE_POKEDATA) or are to be replaced with a new value en passant.
846
5:32pm
Page 846
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Kernel Activities Chapter 13 demonstrated that system execution time can be split into two large and different parts: kernel mode and user mode. In this chapter, we investigate the various kernel activities and reach the conclusion that a finer-grained differentiation is required. System calls are not the only way of switching between user and system mode. As is evident from the preceding chapters, all platforms supported by Linux employ the concept of interrupts to introduce periodic interruptions for a variety of reasons. Two types of interrupt are distinguished: ❑
Hardware Interrupts — Are produced automatically by the system and connected peripherals. They support more efficient implementation of device drivers, but are also needed by the processor itself to draw attention to exceptions or errors that require interaction with the kernel code.
❑
SoftIRQs — Are used to effectively implement deferred activities in the kernel itself.
In contrast to other parts of the kernel, the code for handling interrupts and system call-specific segments contains very strong interweaving between assembly language and C code to resolve several subtle problems that C could not reasonably handle on its own. This is not a Linux-specific problem. Regardless of their individual approach, most operating system developers try to hide the low-level handling of such points as deeply as possible in the kernel sources to make them invisible to the remaining code. Because of technical circumstances, this is not always possible, but the interrupt handling layer has evolved over time to a state where high-level code and low-level hardware interaction are separated as well and cleanly as possible. Frequently, the kernel needs mechanisms to defer activities until a certain time in the future or to place them in a queue for later processing when time is available. You have come across a number of uses for such mechanisms in earlier chapters. In this section, we take a closer look at their implementation.
Page 847
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities
14.1
Interrupts
Until kernel 2.4, the only commonality in the implementation of interrupts on the diverse platforms supported by the Linux kernel used to be that they exist at all — but that’s where the similarity came to an end. Lots of code (and lots of duplicated functionality) was spread across architecture-specific components. The situation was improved considerably during the development of kernel 2.6 because a generic framework for interrupts and IRQs was introduced. Individual platforms are now only responsible to interact with the hardware on the lowest levels. Everything else is provided by generic code. Let’s start our discussion by introducing the most common types of system interrupts as our starting point before focusing on how they function, what they do, and what problems they cause.
14.1.1 Interrupt Types Generally, interrupt types can be grouped into two categories: ❑
Synchronous Interrupts and Exceptions — Are produced by the CPU itself and are directed at the program currently executing. Exceptions may be triggered for a variety of reasons: because of a programming error that occurred at run time (a classical example is division by zero), or because — as the name suggests — an exceptional situation or an anomalous condition has arisen and the processor needs ‘‘external‘‘ help to deal with it. In the first case, the kernel must inform the application that an exception has arisen. It can use, for example, the signaling mechanism described in Chapter 5. This gives the application an opportunity to correct the error, issue an appropriate error message, or simply terminate. An anomalous condition may not necessarily be caused directly by the process but must be repaired with the help of the kernel. A possible example of this is a page fault that always occurs when a process attempts to access a page of virtual address space that is not held in RAM. As discussed in Chapter 4, the kernel must then interact with the CPU to ensure that the desired data are fetched into RAM. The process can then resume at the point at which the exception occurred. It does not even notice that there has been a page error because the kernel recovered the situation automatically.
❑
Asynchronous interrupts — Are the classical interrupt type generated by peripheral devices and occur at arbitrary times. Unlike synchronous interrupts, asynchronous interrupts are not associated with a particular process. They can happen at any time, regardless of the activities the system is currently performing.1 Network cards report the arrival of new packages by issuing an associated interrupt. Because the data reach the system at an arbitrary moment in time, it is highly likely that some process or other that has nothing to do with the data is currently executing. So as not to disadvantage this process, the kernel must ensure that the interrupt is processed as quickly as possible by ‘‘buffering‘‘ data so that CPU time can be returned to the process. This is why the kernel needs mechanisms to defer activities; these are also discussed in this chapter.
What are the common features of the two types of interrupt? If the CPU is not already in kernel mode, it initiates a switch from user to kernel mode. There it executes a special routine called an interrupt service 1 Because, as you will learn shortly, interrupts can be disabled, this statement is not totally correct. The system can at
least influence when interrupts do not occur.
848
5:37pm
Page 848
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities routine (ISR for short) or an interrupt handler. The purpose of this routine is to handle exception conditions or anomalous situations — after all, the specific goal of an interrupt is to draw the attention of the kernel to such changes. A simple distinction between synchronous and asynchronous interrupts is not sufficient to fully describe the features of these two types of interrupt. A further aspect needs to be considered. Many interrupts can be disabled, but a few cannot. The latter category includes, for example, interrupts issued as a result of hardware faults or other system-critical events. Wherever possible, the kernel tries to avoid disabling interrupts because they are obviously detrimental to system performance. However, there are occasions when it is essential to disable them to prevent the kernel itself from getting into serious trouble. As you will see when we take a closer look at interrupt handlers, major problems may arise in the kernel if a second interrupt occurs while a first interrupt is being handled. If the kernel is interrupted while processing what is already critical code, the synchronization problems discussed in Chapter 5 may arise. In the worst case scenario, this can provoke a kernel deadlock that renders the entire system unusable. If the kernel allows itself too much time to process an ISR when interrupts are disabled, it can (and will) happen that interrupts are lost although they are essential for correct system operation. The kernel resolves this problem by enabling interrupt handlers to be divided into two parts — a performancecritical top half that executes with disabled interrupts, and a less important bottom half used later to perform all less important actions. Earlier kernel versions included a mechanism of the same name for deferring activities to a later time. However, this has been replaced by more efficient mechanisms, discussed below. Each interrupt has a number. If interrupt number n is assigned to a network card and m = n is assigned to the SCSI controller, the kernel is able to differentiate between the two devices and call the corresponding ISR to perform a device-specific action. Of course, the same principle also applies for exceptions where different numbers designate different exceptions. Unfortunately, owing to specific (and usually historical) design ‘‘features‘‘(the IA-32 architecture is a particular case in point), the situation is not always as simple as just described. Because only very few numbers are available for hardware interrupts, they must be shared by several devices. On IA-32 processors, the maximum number is usually 15, not a particularly generous figure — especially considering that some interrupts are already permanently assigned to standard system components (keyboard, timers, etc.), thus restricting still further the number available for other peripheral devices. This procedure is known as interrupt sharing.2 However, both hardware support and kernel support are needed to use this technique because it is necessary to identify the device from which an interrupt originates. This is covered in greater detail in this chapter.
14.1.2 Hardware IRQs The term interrupt has been used carelessly in the past to denote interrupts issued by the CPU as well as by external hardware. Savvy readers will certainly have noticed that this is not quite correct. Interrupts cannot be raised directly by processor-external peripherals but must be requested with the help of a standard component known as an interrupt controller that is present in every system. 2 Naturally, bus systems with a sophisticated overall design are able to dispense with this option. They provide so many interrupts for hardware devices that there is no need for sharing.
849
Page 849
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities From the peripheral devices (or their slots) electronic lines lead to the component used to send interrupt requests to the interrupt controller. After performing various electro-technical tasks, which are of no further interest to us here, the controller forwards such requests to the interrupt inputs of the CPU. Because peripheral devices cannot directly force interrupts but must request them via the above component, such requests are known more correctly as IRQs, or interrupt requests. Because, in terms of software, the difference between IRQs and interrupts is not all that great, the two terms are often used interchangeably. This is not a problem as long as it is clear what is meant. However, one important point concerning the numbering of IRQs and interrupts should not be overlooked as it has an impact on software. Most CPUs make only a small extract from the whole range of available interrupt numbers available for processing hardware interrupts. This range is usually in the middle of the number sequence; for example, IA-32 CPUs provide a total of 16 numbers from 32 to 47. As any reader who has configured an I/O card on an IA-32 system or has studied the contents of /proc/interrupts knows, IRQ numbering of expansion cards starts at 0 and finishes at 15, provided the
classical interrupt controller 8256A is used. This means that there are also 16 different options but with different numerical values. As well as being responsible for the electrical handling of the IRQ signals, the interrupt controller also performs a ‘‘conversion‘‘ between IRQ number and interrupt number; with the IA-32 system, this is the equivalent of simply adding 32. If a device issues IRQ 9, the CPU produces interrupt 41; this must be taken into account when installing interrupt handlers. Other architectures use other mappings between interrupt and IRQ numbers, but I will not deal with these in detail.
14.1.3 Processing Interrupts Once the CPU has been informed of an interrupt, it delegates further handling to a software routine that corrects the fault, provides special handling, or informs a user process of an external event. Because each interrupt and each exception has a unique number, the kernel uses an array containing pointers to handler functions. The associated interrupt number is found by referring to the array position, as shown in Figure 14-1. handle_page_fault handle_whatever
n
n+1 n+2 n+3 n+4 n+5 n+6 n+7
Figure 14-1: Managing interrupt handlers.
Entry and Exit Tasks As Figure 14-2 shows, interrupt handling is divided into three parts. First, a suitable environment in which the handler function can execute must be set up; then the handler itself is called, and finally the system is restored (in the view of the current program) to its exact state prior to the interrupt. The parts that precede and follow invocation of the interrupt handler are known as the entry and exit path. The entry and exit tasks are also responsible for ensuring that the processor switches from user mode to kernel mode. A key task of the entry path is to make the switch from the user mode stack to the kernel
850
5:37pm
Page 850
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities mode stack. However, this alone is not sufficient. Because the kernel also uses CPU resources to execute its code, the entry path must save the current register status of the user application in order to restore it upon termination of interrupt activities. This is the same mechanism used for context switching during scheduling. When kernel mode is entered, only part of the complete register set is saved. The kernel does not use all available registers. Because, for example, no floating point operations are used in kernel code (only integer calculations are made), there is no need to save the floating point registers.3 Their value does not change when kernel code is executed. The platform-specific data structure pt_regs that lists all registers modified in kernel mode is defined to take account of the differences between the various CPUs (Section 14.1.7 takes a closer look at this). Low-level routines coded in assembly language are responsible for filling the structure. Interrupt
Switch to kernel stack Interrupt Handler
Scheduling necessary?
schedule
Signals?
Deliver signals to process
Save registers Restore registers
Activate user stack
Figure 14-2: Handling an interrupt. In the exit path the kernel checks whether ❑
the scheduler should select a new process to replace the old process.
❑
there are signals that must be delivered to the process.
Only when these two questions have been answered can the kernel devote itself to completing its regular tasks after returning from an interrupt; that is, restoring the register set, switching to the user mode stack, switching to an appropriate processor mode for user applications, or switching to a different protection ring.4 Because interaction between C and assembly language code is required, particular care must be taken to correctly design data exchange between the assembly language level and C level. The corresponding code is located in arch/arch/kernel/entry.S and makes thorough use of the specific characteristics of the individual processors. For this reason, the contents of this file should be modified as seldom as possible — and then only with great care. 3 Some architectures (e.g., IA-64) do not adhere to this rule but use a few registers from the floating comma set and save them each
time kernel mode is entered. The bulk of the floating point registers remain ‘‘untouched‘‘ by the kernel, and no explicit floating point operations are used. 4 Some processors make this switch automatically without being requested explicitly to do so by the kernel.
851
Page 851
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities Work in the entry and exit path of an interrupt is made even more difficult by the fact that the processor may be in either user mode or kernel mode when an interrupt arrives. This requires several additional technical modifications that, for reasons of clarity, are not shown in Figure 14.2. (There is no need to switch between kernel mode stack and user mode stack, and there is no need to check whether it is necessary to call the scheduler or deliver signals.)
The term interrupt handler is used ambiguously. It is used to designate invocation of an ISR call by the CPU, and combines the entry/exit path and the ISR itself. Of course, it would be more correct to refer only to the routine that is executed between the entry path and the exit path and that is implemented in C.
Interrupt Handlers Interrupt handlers can encounter difficulties particularly when further interrupts occur while they are executing. Although this can be prevented by disabling interrupts during processing by a handler, this creates other problems such as missing important interrupts. Masking (the term used to denote the selective disabling of one or more interrupts) can therefore only be used for short periods. ISRs must therefore satisfy two requirements:
1.
Implementation (above all, when other interrupts are disabled) must consist of as little code as possible to support rapid processing.
2.
Interrupt handler routines that can be invoked during the processing of other ISRs must not interfere with each other.
Whereas the latter requirement can be satisfied by intelligent programming and clever ISR design, it is rather more difficult to fulfill the former. Depending on the specific interrupt, a fixed program must be run to satisfy the minimum requirements for remedying the situation. Code size cannot therefore be reduced arbitrarily. How does the kernel resolve this dilemma? Not every part of an ISR is equally important. Generally, each handler routine can be divided into three parts of differing significance:
1.
Critical actions must be executed immediately following an interrupt. Otherwise, system stability or correct operation of the computer cannot be maintained. Other interrupts must be disabled when such actions are performed.
2.
Noncritical actions should also be performed as quickly as possible but with enabled interrupts (they may therefore be interrupted by other system events).
3.
Deferrable actions are not particularly important and need not be implemented in the interrupt handler. The kernel can delay these actions and perform them when it has nothing better to do.
The kernel makes tasklets available to perform deferrable actions at a later time. I deal with tasklets in more detail in Section 14.3.
852
5:37pm
Page 852
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities
14.1.4 Data Structures There are two facets to the technical implementation of interrupts — assembly language code, which is highly processor-dependent and is used to process the relevant lower-level details on the particular platform; and an abstracted interface, which is required by device drivers and other kernel code to install and manage IRQ handlers. I focus on the second aspect. The countless details needed to describe the functioning of the assembly language part are best left to books and manuals on processor architecture. To respond to the IRQs of peripheral devices, the kernel must provide a function for each potential IRQ. This function must be able to register and de-register itself dynamically. A static table organization is not sufficient because modules may also be written for devices that interact with the rest of the system by means of interrupts. The central point at which information on IRQs is managed is a global array with an entry for each IRQ number. Because array position and interrupt number are identical, it is easy to locate the entry associated with a specific IRQ: IRQ 0 is at position 0, IRQ 15 at position 15, and so on; to which processor interrupt the IRQs are ultimately mapped is of no relevance here. The array is defined as follows: kernel/irq/handle.c
struct irq_desc irq_desc[NR_IRQS] __cacheline_aligned_in_smp = { [0 ... NR_IRQS-1] = { .status = IRQ_DISABLED, .chip = &no_irq_chip, .handle_irq = handle_bad_irq, .depth = 1, ... } };
Although an architecture-independent data type is used for the individual entries, the maximum possible number of IRQs is specified by a platform-dependent constant: NR_IRQS. This constant is for most architectures defined in the processor-specific header file include/asm-arch/irq.h.5 Its value varies widely not only between the different processors but also within processor families depending on which auxiliary chip is used to help the CPU manage IRQs. Alpha computers support between 32 interrupts on ‘‘smaller‘‘ systems and a fabulous 2,048 interrupts on Wildfire boards; IA-64 processors always have 256 interrupts. IA-32 systems, in conjunction with the classical 8256A controller, provide a meager 16 IRQs. This number can be increased to 224 using the IO-APIC (advanced programmable interrupt controller) expansion that is found on all multiprocessor systems but that can also be deployed on UP machines. Initially, all interrupt slots use handle_bad_irq as a handler function that just acknowledges interrupts for which no specific handler function is installed. More interesting than the maximum number of IRQs is the data type used for the array entries (in contrast to the simple example above, it is not merely a pointer to a function). Before I get into the technical details, I need to present an overview of the kernel’s IRQ-handling subsystem. 5 The IA-32 architecture, however, uses
/include/asm-x86/mach-type/irq_vectors_limits.h.
853
Page 853
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities The early versions of kernel 2.6 contained much platform-specific code to handle IRQs that was identical in many points. Thus, a new, generic IRQ subsystem was introduced during further development of kernel 2.6. It is able to handle different interrupt controllers and different types of interrupts in a unified way. Basically, it consists of three abstraction layers as visualized in Figure 14-3:
1.
High-Level Interrupt Service Routines (ISRs) — Perform all necessary work caused by the interrupt on the device driver’s (or some other kernel component’s) side. If, for instance, a device uses an interrupt to signal that some data have arrived, then the job of the high-level ISR could be to copy the data to an appropriate place.
2.
Interrupt Flow Handling — Takes care of handling the various differences between different interrupt flow types like edge- and level triggering. Edge-triggering means that hardware detects an interrupt by sensing a difference in potential on the line. In level-triggered systems, interrupts are detected when the potential has a specific value — the change in potential is not relevant. From the kernel viewpoint, level-triggering is more complicated because, after each interrupt, the line must be explicitly set to the potential that indicates ‘‘no interrupt.’’
3.
Chip-Level Hardware Encapsulation — Needs to communicate directly with the underlying hardware that is responsible to generate interrupts at the electronic level. This layer can be seen as some sort of ‘‘device driver‘‘ for interrupt controllers.
Flow handling
Central IRQ database
Chip specific functions
Hardware
High-level service routine
Figure 14-3: Various types of interrupt handlers and how they are connected. Let’s return to the technical side of the problem. The structure used to represent an IRQ descriptor is (slightly simplified) defined as follows6 :
struct irq_desc { irq_flow_handler_t struct irq_chip
handle_irq; *chip;
6 Among some technical elements, support for message signaled interrupts (MSIs) has also been omitted. MSIs are an optional exten-
sion to the PCI standard and a required component of PCI express. They allow for sending an interrupt without using a physical pin on some piece of hardware, but via a ‘‘message’’ on the PCI bus. Because the number of available pins on modern processors is not unlimited, but pins are required for many purposes, they are a scarce resource. Hardware designers are thus looking for alternative methods to send interrupts, and the MSI mechanism is one of them. It will gain increased importance in the future. Documentation/MSI-HOWTO.txt in the kernel source tree contains some more information about this mechanism.
854
5:37pm
Page 854
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities void void struct irqaction unsigned int
*handler_data; *chip_data; *action; status;
unsigned int unsigned int unsigned int
depth; /* nested irq disables */ irq_count; /* For detecting broken IRQs */ irqs_unhandled;
/* IRQ action list */ /* IRQ status */
... const char *name; } ____cacheline_internodealigned_in_smp;
In the view of the high-level code in the kernel, each IRQ is fully described by this structure. The three abstraction layers introduced above are represented in the structure as follows: ❑
The flow-level ISR is provided by handle_irq. handler_data may point to some arbitrary, IRQ, and handler function-specific data. handle_irq is called by the architecture-specific code whenever an interrupt occurs. The function is then responsible to use the controller-specific methods provided in chip to perform the necessary low-level actions required to process the interrupt. Default functions for various interrupt types are provided by the kernel. Examples for such handler functions are discussed in Section 14.1.5.
❑
action provides a chain of actions that need to be executed when the interrupt occurs. This is
the place where device drivers for devices that are notified by the interrupt can place their specific handler functions. A special data structure is used to represent these actions, discussed in Section 14.1.4. ❑
Flow handling and chip-specific operations are encapsulated in chip. A special data structure is introduced for this purpose, covered in a moment. chip_data points to arbitrary data that may be associated with chip.
❑
name specifies a name for the flow handler that is displayed in /proc/interrupts. This line is usually either ‘‘edge’’ for edge-, or ‘‘ level’’ for level-triggered interrupts.
There are some more elements in the structure that need to be described. depth has two tasks. It can be used to determine whether an IRQ line is enabled or disabled. A positive value indicates that the latter is true, whereas 0 indicates an enabled line. Why are positive values used for disabled IRQs? Because this allows the kernel to differentiate between enabled and disabled lines and also to repeatedly disable one and the same interrupt. Each time code from the remaining part of the kernel disables an interrupt, the counter is incremented by 1; each time the interrupt is enabled again, the counter is decremented accordingly. Only when depth has returned to 0 may the IRQ be freed again by the hardware. This approach supports the correct handling of nested disabling of interrupts. An IRQ can change its status not only during handler installation but also at run time: status describes the current status. The
IRQ_DISABLED is used for an IRQ line disabled by a device driver. It instructs the kernel not to
enter the handler. ❑
During execution of an IRQ handler the state is set to IRQ_INPROGRESS. As with IRQ_DISABLED, this prohibits the remaining kernel code from executing the handler.
855
Page 855
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities ❑
IRQ_PENDING is active when the CPU has noticed an interrupt but has not yet executed the corresponding handler.
❑
IRQ_MASKED is required to properly handle interrupts that occur during interrupt processing; see
Section 14.1.4. ❑
IRQ_PER_CPU is set when an IRQ can occur on a single CPU only. (On SMP systems this renders
several protection mechanisms against concurrent accesses superfluous.) ❑
IRQ_LEVEL is used on Alpha and PowerPC to differentiate level-triggered and edge-triggered
IRQs. ❑
IRQ_REPLAY means that the IRQ has been disabled but a previous interrupt has not yet been acknowledged.
❑
IRQ_AUTODETECT and IRQ_WAITING are used for the automatic detection and configuration of
IRQs. I will not discuss this in more detail, but mention that the respective code is located in kernel/irq/autoprobe.c.
❑
IRQ_NOREQUEST is set if the IRQ can be shared between devices and must thus not be exclusively requested by a single device.
Using the current contents of status, it is easy for the kernel to query the status of a certain IRQ without having to know the hardware-specific features of the underlying implementation. Of course, just setting the corresponding flags does not produce the desired effect. Disabling an interrupt by setting the IRQ_DISABLED flag is not possible. The underlying hardware must also be informed of the new state. Consequently, the flags may be set only by controller-specific functions that are simultaneously responsible for making the required low-level hardware settings. In many cases, this mandates the use of assembly language code or the writing of magic numbers to magic addresses by means of out commands. Finally, the fields irq_count and irq_unhandled of irq_desc provide some statistics that can be used to detect stalled and unhandled, but permanently occurring interrupts. The latter ones are usually called spurious interrupts. I will not discuss how this is done in more detail.7
IRQ Controller Abstraction handler is an instance of the hw_irq_controller data type that abstracts the specific characteristics of an IRQ controller for the architecture-independent part of the kernel. The functions it provides are used to change the status of an IRQ, which is why they are also responsible for setting flag:
struct irq_chip { const char unsigned int void void void
*name; (*startup)(unsigned int irq); (*shutdown)(unsigned int irq); (*enable)(unsigned int irq); (*disable)(unsigned int irq);
void void void void void
(*ack)(unsigned int irq); (*mask)(unsigned int irq); (*mask_ack)(unsigned int irq); (*unmask)(unsigned int irq); (*eoi)(unsigned int irq);
7 If you are interested in how this detection is performed, see the function
856
note_interrupt in kernel/irq/spurious.c.
5:37pm
Page 856
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities void void
(*end)(unsigned int irq); (*set_affinity)(unsigned int irq, cpumask_t dest);
int
(*set_type)(unsigned int irq, unsigned int flow_type);
... ...
The structure needs to account for all peculiarities of the different IRQ implementations that appear in the kernel. Thus, a particular instance of the structure usually only defines a subset of all possible methods. name holds a short string to identify the hardware controller. Possible values on IA-32 systems are ‘‘XTPIC‘‘ and ‘‘IO-APIC,’’ and the latter one is also used for most interrupts on AMD64 systems. On other systems there is a colorful mix of values because many different controller types are available and in widespread use.
The function pointers have the following meaning: ❑
startup refers to a function for the first-time initialization of an IRQ. In most cases, initialization is limited to enabling the IRQ. As a result, the startup function is just a means of forwarding to enable.
❑
enable activates an IRQ; in other words, it performs a transition from the disabled to the enabled state. To this end, hardware-specific numbers must be written to hardware-specific points in I/O memory or in the I/O ports.
❑
disable is the counterpart to enable and is used to deactivate an IRQ. shutdown completely closes down an interrupt source. If this is not explicitly possible, the function is an alias for disable.
❑
ack is closely linked with the hardware of the interrupt controller. In some models, the arrival
of an IRQ request (and therefore of the corresponding interrupt at the processor) must be explicitly acknowledged so that subsequent requests can be serviced. If a chipset does not issue this request, the pointer can be supplied with a dummy function or a null pointer. ack_and_mask acknowledges an interrupt, but masks it in addition afterward. ❑
end is called to mark the end of interrupt processing at the flow level. If an interrupt was disabled during interrupt processing, it is the responsibility of this handler to re-enable it again.
❑
Modern interrupt controllers do not need much flow control from the kernel, but manage nearly everything themselves out of the box. A single callback to the hardware is required when interrupts are processed, and this callback is provided in eoi — end of interrupt.
❑
In multiprocessor systems, set_affinity can be used to declare the affinity of a CPU for specified IRQs. This allows IRQs to be distributed to certain CPUs (typically, IRQs on SMP systems are spread evenly across all processors). This method has no relevance on single-processor systems and is therefore supplied with a null pointer.
❑
set_type allows for setting the IRQ flow type. This is mostly required on ARM, PowerPC, and SuperH machines; other systems can do without and set set_type to NULL.
The auxiliary function set_irq_type(irq, type) is a convenience function to set the IRQ type for irq. The types IRQ_TYPE_RISING and IRQ_TYPE_FALLING specify edge-triggered interrupts that use the rising of falling flank, while IRQ_TYPE_EDGE_BOTH works for both flank types. Leveltriggered interrupts are denoted by IRQ_TYPE_LEVEL_HIGH and IRQ_TYPE_LEVEL_LOW — you will have guessed that low and high signal levels are distinguished. IRQ_TYPE_NONE, finally, sets an unspecified type.
857
Page 857
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities One particular example for an interrupt controller chip implementation is the IO-APIC on AMD64 systems. It is given by the following definition: arch/x86/kernel/io_apic_64.c
static struct irq_chip ioapic_chip __read_mostly = { .name = "IO-APIC", .startup = startup_ioapic_irq, .mask = mask_IO_APIC_irq, .unmask = unmask_IO_APIC_irq, .ack = ack_apic_edge, .eoi = ack_apic_level, #ifdef CONFIG_SMP .set_affinity = set_ioapic_affinity_irq, #endif };
Note that the kernel defines the alias hw_interrupt_type for irq_chip; this is for compatibility with previous versions of the IRQ subsystem. The name is, for instance, still in use on Alpha systems that define the chip level operations for the i8259A standard interrupt controller as follows8 : arch/alpha/kernel/i8529.c
struct hw_interrupt_type i8259a_irq_type = { .typename = "XT-PIC", .startup = i8259a_startup_irq, .shutdown = i8259a_disable_irq, .enable = i8259a_enable_irq, .disable = i8259a_disable_irq, .ack = i8259a_mask_and_ack_irq, .end = i8259a_end_irq, };
As the code shows, only a subset of all possible handler functions are neecsssary to operate the device. i8259A chips are also still present in many IA-32 systems. Support for this chipset has, however, already been converted to the more modern irq_chip representation. The interrupt controller type used (and the allocation of all system IRQs) can be seen in /proc/interrupts. The following example is from a (rather unchallenged) quad-core AMD64 box: wolfgang@meitner> cat /proc/interrupts CPU0 CPU1 CPU2 0: 48 1 0 1: 1 0 1 4: 3 0 0 8: 0 0 0 9: 0 0 0 16: 48 48 96720 18: 1 0 2 19: 21: 22: 8 Using
858
0 0 407287
0 0 370858
0 0 1164
CPU3 0 0 3 1 0 50082 0
IO-APIC-edge IO-APIC-edge IO-APIC-edge IO-APIC-edge IO-APIC-fasteoi IO-APIC-fasteoi IO-APIC-fasteoi
0 0 1166
IO-APIC-fasteoi IO-APIC-fasteoi IO-APIC-fasteoi
timer i8042 rtc acpi libata, uhci_hcd:usb1 uhci_hcd:usb3, uhci_hcd:usb6, ehci_hcd:usb7 uhci_hcd:usb5 uhci_hcd:usb2 libata, libata, HDA Intel
typename instead of name is also obsolete by now, but still supported for compatibility reasons.
5:37pm
Page 858
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities 23: NMI: LOC: RES: CAL: TLB: TRM: THR: SPU: ERR:
0 0 2307075 22037 363 3355 0 0 0 0
0 0 2266433 18253 373 3729 0 0 0
0 0 2220704 33530 394 1919 0 0 0
0 0 2208597 35156 184 1630 0 0 0
IO-APIC-fasteoi uhci_hcd:usb4, ehci_hcd:usb8 Non-maskable interrupts Local timer interrupts Rescheduling interrupts function call interrupts TLB shootdowns Thermal event interrupts Threshold APIC interrupts Spurious interrupts
Note that the chip name is concatenated with the flow handler name, which results, for instance, in ‘‘IO-APIC-edge.’’ Besides listing all registered IRQs, the file also provides some statistics at the bottom.
Handler Function Representation An instance of the irqaction structure defined as follows exists for each handler function:
struct irqaction { irq_handler_t handler; unsigned long flags; const char *name; void *dev_id; struct irqaction *next; }
The most important element in the structure is the handler function itself, which takes the form of the handler pointer and is located at the beginning of the structure. The handler function is invoked by the kernel when a device has requested a system interrupt and the interrupt controller has forwarded this to the processor by raising an interrupt. We will look more closely at the meaning of the arguments when we consider how to register handler functions. Note, however, that the type irq_handler_t clearly distinguishes this handler type from flow handlers that are of type irq_flow_handler_t! name and dev_id uniquely identify an interrupt handler. While name is a short string used to identify the device (e.g., ‘‘e100,’’ ‘‘ncr53c8xx,’’ etc.), dev_id is a pointer to any data structure that uniquely identifies the device among all kernel data structures — for example, the net_device instance of a network card. This information is needed when removing a handler function if several devices share an IRQ and the IRQ number alone is not sufficient to identify the device. flags is a flag variable that describes some features of the IRQ (and associated interrupt) with the help of a bitmap whose individual elements can, as usual, be accessed via predefined constants. The following constants are defined in
❑
IRQF_SHARED is set for shared IRQs and signals that more than one device is using an IRQ line.
❑
IRQF_SAMPLE_RANDOM is set when the IRQ contributes to the kernel entropy pool.9
❑
IRQF_DISABLED indicates that the IRQ handler must be executed with interrupts disabled.
❑
IRQF_TIMER denotes a timer interrupt.
9 This information is used to generate relatively secure random numbers for
/dev/random and /dev/urandom.
859
Page 859
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities next is used to implement shared IRQ handlers. Several irqaction instances are grouped into a linked list. All elements of a linked list must handle the same IRQ number (instances for different numbers are located at various positions in the irq_desc array). As discussed in Section 14.1.7, the kernel scans the list when a shared interrupt is issued to find out for which device the interrupt is actually intended. Particularly on laptops that integrate many different devices (network, USB, FireWire, sound card, etc.) on a single chip (with just one interrupt), handler chains of this kind can consist of about five elements. However, the desirable situation is that only a single device is registered for each IRQ.
Figure 14-4 shows an overview of the data structures described to illustrate how they interact. Because one type of interrupt controller normally dominates on a system (there is nothing preventing the coexistence of multiple handlers, though), the handler elements of all irq_desc entries point to the same instance of irq_chip. action chip
irq_chip
next
action chip
next
action chip
next
next
next
struct irqaction
. . . action chip
next
irq_desc[]
Figure 14-4: Data structures in IRQ management.
14.1.5 Interrupt Flow Handling Now let’s examine how flow handling is implemented. The situation in this area was quite painful before the interrupt rework in 2.6, and architecture-specific code was heavily involved in flow handling. Thankfully, the situation is now much improved, and a generic framework that accounts for nearly all available hardware with only very few exceptions is available.
Setting Controller Hardware First of all, I need to mention some standard functions that are provided by the kernel to register irq_chips and set flow handlers:
int set_irq_chip(unsigned int irq, struct irq_chip *chip); void set_irq_handler(unsigned int irq, irq_flow_handler_t handle); void set_irq_chained_handler(unsigned int irq, irq_flow_handler_t handle) void set_irq_chip_and_handler(unsigned int irq, struct irq_chip *chip, irq_flow_handler_t handle); void set_irq_chip_and_handler_name(unsigned int irq, struct irq_chip *chip, irq_flow_handler_t handle, const char *name);
860
5:37pm
Page 860
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities ❑
set_irq_chip associates an IRQ chip in the form of an irq_chip instance with a specific interrupt. Besides picking the proper element from irq_desc and setting the chip pointer, the function also inserts default handler functions if no chip-specific implementation is supplied.
If a NULL pointer is given for the chip, then the generic ‘‘no controller’’ variant no_irq_chip, which provides only no-op operations, is used. ❑
set_irq_handler and set_irq_chained_handler set the flow handler function for a given IRQ number. The second variant is required to signal that the handler must deal with shared interrupts. This enables the flags IRQ_NOREQUEST and IRQ_NOPROBE in irq_desc[irq]->status: the first one because shared interrupts cannot be reserved for exclusive use, and the second one because it is obviously a bad idea to use interrupt probing on lines where multiple devices are present.
Both functions use __set_irq_handler internally, which performs some sanity checks and sets irq_desc[irq]->handle_irq. ❑
set_chip_and_handler is a convenient shortcut used instead of calling the functions discussed above one after another. The _name variant works identically, but allows for specifying a name for the flow handler that is stored in irq_desc[irq]->name.
Flow Handling Before discussing how flow handlers are implemented, we need to introduce the type used for them. irq_flow_handler_t specifies the signature of IRQ flow handler functions:
typedef void fastcall (*irq_flow_handler_t)(unsigned int irq, struct irq_desc *desc);
Flow handlers get both the IRQ number and a pointer to the irq_handler instance that is responsible for the interrupt. This information can then be used to implement proper flow handling. Recall that different hardware requires different approaches to flow handling — edge- and leveltriggering need to be dealt with differently, for instance. The kernel provides several default flow handlers for various types. They have one thing in common: Every flow handler is responsible to call the high-level ISRs once its work is finished. handle_IRQ_event is responsible to activate the high-level handlers; this is discussed this in Section 14.1.7. For now, let us examine how flow handling is performed.
Edge-Triggered Interrupts Edge-triggered interrupts are most common on the majority of today’s hardware, so I consider this type first. The default handler is implemented in handle_edge_irq. The code flow diagram is shown in Figure 14-5. Edge-triggered IRQs are not masked when they are processed — in contrast to level-triggered IRQs, there is no need to do so. This has one important implication for SMP systems: When an IRQ is handled on one CPU, another IRQ with the same number can appear on another CPU that we denote as the second CPU. This implies that the flow handler will be called once more while it is still running on the CPU that triggered the first IRQ. But why should two CPUs be engaged with running the same IRQ handler simultaneously? The kernel wants to avoid this situation: The handler should only be processed on a single CPU. The initial portion of handle_edge_irq has to deal with this case. If the IRQ_INPROGRESS flag is set, the IRQ is already being processed on another CPU. By setting the IRQ_PENDING flag, the
861
Page 861
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities kernel remembers that another IRQ needs to be served later. After masking the IRQ and sending an acknowledgment to the controller via mask_ack_irq, processing can be aborted. The second CPU can thus go back to work as usual, while the first CPU will handle the IRQ later.
handle_edge_irq No handler, IRQ in progress or IRQ disabled
Set IRQ_PENDING and IRQ_MASKED
Iterate as long as IRQ_PENDING is set and IRQ is not disabled
mask_ack_irq chip->ack
Cancel processing
Set IRQ_INPROGRESS Handle unmasking Remove IRQ_PENDING handle_IRQ_event
Figure 14-5: Code flow diagram for handle_edge_irq.
Note that processing is also aborted if no ISR handler is available for the IRQ or if it is disabled. (Faulty hardware might nevertheless generate the IRQ, so this case needs to be taken into account by the kernel.) Now the proper work to handle the IRQ starts. After sending an acknowledgment to the interrupt controller with the chip-specific function chip->ack, the kernel sets the IRQ_INPROGRESS flag. This signals that the IRQ is being processed and can be used to avoid the same handler executing on multiple CPUs. Let us assume that only a single IRQ needs to be processed. In this case, the high-level ISR handlers are activated by calling handle_IRQ_event, and the IRQ_INPROGRESS flag can be removed afterward. However, the situation is more complicated in reality, as the source code shows: kernel/irq/chip.c
void fastcall handle_edge_irq(unsigned int irq, struct irq_desc *desc) { ... desc->status |= IRQ_INPROGRESS; do { struct irqaction *action = desc->action; irqreturn_t action_ret; ... /* * When another irq arrived while we were handling * one, we could have masked the irq. * Renable it, if it was not disabled in meantime.
862
5:37pm
Page 862
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities */ if (unlikely((desc->status & (IRQ_PENDING | IRQ_MASKED | IRQ_DISABLED)) == (IRQ_PENDING | IRQ_MASKED))) { desc->chip->unmask(irq); desc->status &= ~IRQ_MASKED; } desc->status &= ~IRQ_PENDING; action_ret = handle_IRQ_event(irq, action); } while ((desc->status & (IRQ_PENDING | IRQ_DISABLED)) == IRQ_PENDING);
Processing the IRQ runs in a loop. Suppose we are at the point right beneath the call to handle_IRQ_event. While the ISR handlers for the first IRQ were running, a second IRQ could have appeared as shown before. This is indicated by IRQ_PENDING. If the flag is set (and the IRQ has not been disabled in the meantime), another IRQ is waiting to be processed, and the loop is started again from the beginning. In this case, however, the IRQ will have been masked. The IRQ must thus be unmasked with chip->unmask and the IRQ_MASKED flag be removed. This guarantees that only one interrupt can occur during the execution of handle_IRQ_event.
After removing the IRQ_PENDING flag — technically, one IRQ is still pending right now, but it is going to be processed immediately — handle_IRQ_event can also serve the second IRQ.
Level-Triggered Interrupts Level-triggered interrupts are a little easier to process than their edge-triggered relatives. This is also reflected in the code flow diagram of the flow handler handle_level_irq, which is depicted in Figure 14-6.
handle_level_irq mask_ack_irq IRQ_INPROGRESS?
Abort processing
No ISR registered or IRQ disabled?
Abort processing
Set IRQ_INPROGRESS handle_IRQ_event Remove IRQ_INPROGRESS Irq not disabled?
chip->unmask
Figure 14-6: Code flow diagram for handle_level_irq. Note that level-triggered interrupts must be masked when they are processed, so the first thing that needs to be done is to call mask_ack_irq. This auxiliary function masks and acknowledges the IRQ by
863
Page 863
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities either calling chip->mask_ack or, if this is not available, chip->mask and chip->ack consecutively. On multiprocessor systems, a race condition might occur such that handle_edge_irq is called although the IRQ is already processed on another CPU. This can be detected by checking for the IRQ_INPROGRESS flag, and the routine can immediately be left — the IRQ is already being processed on another CPU, in this case. If no handler is registered for the IRQ, processing can also be aborted — there is nothing to do. One more reason to abort processing is when IRQ_DISABLED is set. Despite being disabled, broken hardware could nevertheless issue the IRQ, but it can be ignored. Then the proper processing starts. IRQ_INPROGRESS is set to signal that the IRQ is being processed, and the actual work is delegated to handle_IRQ_event. This triggers the high-level ISRs, as discussed below. The IRQ_INPROGRESS can be removed after the ISRs are finished. Finally, the IRQ needs to be unmasked. However, the kernel needs to consider that an ISR could have disabled the interrupt, and in this case, it needs to remain masked. Otherwise, the chip-specific unmask function chip->unmask is used.
Other Types of Interrupts Besides edge- and level-triggered IRQs, some more less common flow types are also possible. The kernel also provides default handlers for them. ❑
Modern IRQ hardware requires only very little flow handling. Only one chip-specific function needs to be called after IRQ processing is finished: chip->eoi. The default handler for this type is handle_fasteoi_irq. It is basically identical with handle_level_irq, except that interaction with the controller chip is only required at the very end.
❑
Really simple interrupts that require no flow control at all are managed by handle_simple_irq. The function can also be used if a caller wants to handle the flow itself.
❑
Per-CPU IRQs, that is, IRQs that can only happen on one specific CPU of a multiprocessor system, are handled by handle_percpu_irq. The function acknowledges the IRQ after reception and calls the EOI routine after processing. The implementation is very simple because no locking is required — the code can by definition only run on a single CPU.
14.1.6 Initializing and Reserving IRQs In this section, we will turn our attention to how IRQs are registered and initialized.
Registering IRQs Dynamic registration of an ISR by a device driver can be performed very simply using the data structures described. The function had been implemented by platform-specific code before the interrupt rework in 2.6. Naturally, the prototype was identical on all architectures as this is an absolute prerequisite for programming platform-independent drivers. Nowadays, the function is implemented by common code: kernel/irq/manage.c
int request_irq(unsigned int irq, irqreturn_t handler, unsigned long irqflags, const char *devname, void *dev_id)
864
5:37pm
Page 864
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities Figure 14-7 shows the code flow diagram for request_irq. request_irq Create irqaction instance setup_irq IRQF_SAMPLE_RANDOM set?
rand_initialize_irq
Place irqaction at the end of the queue Non-shared IRQ?
handler->startup
Register in proc file system
Figure 14-7: Code flow diagram for request_irq. The kernel first generates a new instance of irqaction that is then supplied with the function parameters. Of special importance is, of course, the handler function handler. All further work is delegated to the setup_irq function that performs the following steps:
1.
If IRQF_SAMPLE_RANDOM is set, the interrupt contributes to the kernel entropy source used for the random number generator in /dev/random. rand_initialize_irq adds the IRQ to the corresponding data structures.
2.
The irqaction instance generated by request_irq is added to the end of the list of routines for a specific IRQ number; this list is headed by irq_desc[NUM]->action. This is how the kernel ensures that — in the case of shared interrupts — handlers are invoked in the same sequence in which they were registered when an interrupt occurs.
3.
If the installed handler is the first in the list for the IRQ number, the handler->startup initialization function is invoked.10 This is not necessary if handlers for the IRQ have already been installed.
4.
register_irq_proc generates the directory /proc/irq/NUM in the proc filesystem. register_handler_proc generates proc/irq/NUM /name. The system is then able to see that
the corresponding IRQ channel is in use.
Freeing IRQs The reverse scheme is adopted in order to free interrupts. First, the interrupt controller is informed that the IRQ has been removed by means of a hardware-specific (chip->shutdown) function,11 and then the relevant entries are removed from the general data structures of the kernel. The auxiliary function free_irq assumes these tasks. While it has been an architecture-dependent function before the genirq rework, it can today be found in kernel/irq/manage.c. When an IRQ handler is required to remove a shared interrupt, the number alone is not sufficient to identify the IRQ. In this case, it is necessary to also use the dev_id discussed above for purposes of 10 If no explicit startup function is available, the IRQ is simply enabled by calling chip->enable instead. 11 If no explicit shutdown function is available, the interrupt is simply disabled by chip->disable instead.
865
Page 865
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities unique identification. The kernel scans the list of all registered handlers until it finds a matching element (with a matching dev_id). Only then can the entry be removed.
Registering Interrupts The mechanisms discussed above are effective only for interrupts raised by an interrupt request from a system peripheral. But the kernel must also take care of interrupts raised either by the processor itself or by software mechanisms in an executing user process. In contrast to IRQs, the kernel need not provide an interface for this kind of interrupt in order to dynamically register handlers. This is because the numbers used are made known at initialization time and do not change thereafter. Registering of interrupts, exceptions, and traps is performed at kernel initialization time, and their reservations do not change at run time. The platform-specific kernel sources have very few commonalities, not surprising in view of the sometimes large technical differences. Even though the concepts behind some variants may be similar, their concrete implementation differs strongly from platform to platform. This is because implementation must walk a fine line between C and assembly language code in order to do justice to the specific features of a system. The greatest similarity between the various platforms is a filename. arch/arch/kernel/traps.c contains the system-specific implementation for registering interrupt handlers. The outcome of all implementations is that a handler function is invoked automatically when an interrupt occurs. Because interrupt sharing is not supported for system interrupts, all that need be done is to establish a link between the interrupt number and function pointer. Generally, the kernel responds to interrupts in one of two ways. ❑
A signal is sent to the current user process to inform it that an error has occurred. On IA-32 and AMD64 systems, for example, a division by 0 is signaled by interrupt 0. The automatically invoked assembly language routine divide_error sends the SIGPFE signal to the user process.
❑
The kernel corrects the error situation invisibly to the user process. This is the case on, for example, IA-32 systems, where interrupt 14 is used to signal a page fault, which the kernel can then correct by employing the methods described in Chapter 18.
14.1.7 Servicing IRQs Once an IRQ handler has been registered, the handler routine is executed each time an interrupt occurs. The problem again arises as to how to reconcile the differences between the various platforms. Owing to the nature of things, the differences are not restricted to various C functions with platform-specific implementations but start deep down in the domain of the manually optimized assembly language code used for low-level processing. Fortunately, several structural similarities between the individual platforms can be identified. For example, the interrupt action on each platform comprises three parts, as discussed earlier. The entry path switches from user mode to kernel mode, then the actual handler routine executes, and finally the kernel switches back to user mode. Even though much assembly language code is involved, there are at least some C fragments that are similar on all platforms. These are discussed below.
866
5:37pm
Page 866
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities Switching to Kernel Mode The switch to kernel mode is based on assembly language code executed by the processor automatically after every interrupt. The tasks of this code are described above. Its implementation can be found in arch/arch/kernel/entry.S,12 which usually defines various entry points at which the processor sets the flow of control when an interrupt occurs. Only the most necessary actions are executed directly in assembly language code. The kernel attempts to return to regular C code as quickly as possible because it is far easier to handle. To this end, an environment must be created that is compatible with the expectations of the C compiler. Functions are called in C by placing the required data — return address and parameters — on the stack in a certain order. When switching between user mode and kernel mode, it is also necessary to save the most important registers on the stack so that they can be restored later. These two actions are performed by platform-dependent assembly language code. On most platforms, control flow is then passed to the C function do_IRQ,13 whose implementation is also platform-dependent, but which greatly simplifies the situation. Depending on the platform, the function receives as its parameter either the processor register arch/arch/kernel/irq.c
fastcall unsigned int do_IRQ(struct pt_regs regs)
or the number of the interrupt together with a pointer to the processor register arch/arch/kernel/irq.c
unsigned int do_IRQ(int irq, struct pt_regs *regs) pt_regs is used to save the registers used by the kernel. The values are pushed one after another onto
the stack (by assembly language code) and are left there before the C function is invoked. pt_regs is defined to ensure that the register entries on the stack coincide with the elements of the
structure. The values are not only saved for later, but can also be read by the C code. Figure 14-8 illustrates this. register 1 register 2 register 3 . . register n Call Frames
struct pt_regs
Kernel-Stack
Figure 14-8: Stack layout after entry into kernel mode. 12 The unified x86 architecture distinguishes between 13 Exceptions are Sparc, Sparc64, and Alpha.
entry_32 for IA-32 and entry_64 for AMD64 systems.
867
Page 867
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities Alternatively, the registers can also be copied to a location in address space that is not identical to the stack. In this case, do_IRQ receives as its parameter a pointer to pt_regs, which does not change the fact that the register contents have been saved and can be read by the C code. The definition of struct pt_regs is platform-dependent because different processors provide different register sets. The registers used by the kernel are held in pt_regs. Registers not listed here may be used by user mode applications only. On IA-32 systems, pt_regs is typically defined as follows: include/asm-x86/ptrace.h
struct pt_regs { long ebx; long ecx; long edx; long esi; long edi; long ebp; long eax; int xds; int xes; long orig_eax; long eip; int xcs; long eflags; long esp; int xss; };
PA-Risc processors, for instance, use a totally different set of registers: include/asm-parisc/ptrace.h
struct pt_regs { unsigned long gr[32]; __u64 fr[32]; unsigned long sr[ 8]; unsigned long iasq[2]; unsigned long iaoq[2]; unsigned long cr27; unsigned long pad0; unsigned long orig_r28; unsigned long ksp; unsigned long kpc; unsigned long sar; unsigned long iir; unsigned long isr; unsigned long ior; unsigned long ipsw; };
/* PSW is in gr[0] */
/* available for other uses */
/* /* /* /* /*
CR11 CR19 CR20 CR21 CR22
*/ */ */ */ */
The general trend in 64-bit architectures is to provide more and more registers, with the result that pt_regs definitions are becoming larger and larger. IA-64 has, for example, almost 50 entries in pt_regs, reason enough not to include the definition here.
868
5:37pm
Page 868
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities On IA-32 systems, the number of the raised interrupt is saved in the most significant 8 bits of orig_eax. Other architectures use other locations. As mentioned above, some platforms even adopt the approach of placing the interrupt number on the stack as a direct argument.
IRQ Stacks The situation described above is only valid if the kernel uses the kernel stack to process IRQs. This need not always be the case. The IA-32 architecture provides the configuration option CONFIG_4KSTACKS.14 If it is activated, the size of the kernel stack is reduced from 8 KiB to 4 KiB. Since the page size is 4 KiB on this machine, the number of pages necessary to implement the kernel stack is reduced from two to one. This makes life easier for the VM subsystem when a huge number of processes (or threads) is active on the system because single pages are easier to find than two consecutive ones as required before. Unfortunately, 4 KiB might not always be enough for the regular kernel work and the space required by IRQ processing routines, so two more stacks come into play: ❑
A stack for hardware IRQ processing.
❑
A stack for software IRQ processing.
In contrast to the regular kernel stack that is allocated per process, the two additional stacks are allocated per CPU. Whenever a hardware interrupt occurs (or a softIRQ is processed), the kernel needs to switch to the appropriate stack. Pointers to the additional stacks are provided in the following array: arch/x86/kernel/irq_32.c
static union irq_ctx *hardirq_ctx[NR_CPUS] __read_mostly; static union irq_ctx *softirq_ctx[NR_CPUS] __read_mostly;
Note that the attribute __read_mostly does not refer to the stack itself, but to the pointer that points to the appropriate place in memory. This is only manipulated when the stacks are initially allocated, but no more during the system’s lifetime. The data structure used for the stacks is not too complicated: arch/x86/kernel/irq_32.c
union irq_ctx { struct thread_info u32 };
tinfo; stack[THREAD_SIZE/sizeof(u32)];
tinfo is used to store information about the thread that was running before the interruption occurred (see Chapter 2 for more details). stack provides the stack space itself. STACK_SIZE is defined to 4,096 if 4-KiB stacks are enabled, so this guarantees the desired stack size. Note that since a union is used to combine tinfo and stack[], the data structure fits into exactly one page frame. This also implies that the thread information contained in tinfo is always available on the stack. 14 The PowerPC and SuperH architectures provide the configuration option CONFIG_IRQSTACKS to enable separate stacks for IRQ processing. Since the mechanism used there is similar, these cases are not discussed separately.
869
Page 869
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities Calling the Flow Handler Routine How the flow handler routines are called differs from architecture to architecture; in the following, how this is done is discussed for AMD64 and IA-32. Additionally, we also examine the old handler mechanism that was the default before the IRQ subsystem rewrite, and is still used in some places.
Processing on AMD64 Systems Let us first turn our attention to how do_IRQ is implemented on AMD64 systems. This variant is simpler as compared to IA-32, and many other modern architectures use a similar approach. The code flow diagram is shown in Figure 14-9. do_IRQ set_irq_regs irq_enter generic_handle_irq irq_exit set_irq_regs
Figure 14-9: Code flow diagram for do_IRQ. on AMD64 systems. The prototype of the function is as follows: arch/x86/kernel/irq_64.c
asmlinkage unsigned int do_IRQ(struct pt_regs *regs)
The low-level assembler code is responsible to pass the current state of the register set to the function, and the first task of do_IRQ is to save a pointer to them in a global per-CPU variable using set_irq_regs (the old pointer that was active before the interrupt occurred is preserved for later). Interrupt handlers that require access to the register set can access them from there. irq_enter is then responsible to update some statistics; for systems with dynamic ticks, the global jiffies time-keeping variable is updated if the system has been in a tickless state for some time
(more about dynamic ticks follows in Section 15.5.). Calling the ISRs registered for the IRQ in question is then delegated to the architecture-independent function generic_handle_irq, which calls irq_desc[irq]->handle_irq to activate the flow control handler. irq_exit is then responsible for some statistics bookkeeping, but also calls (assuming the kernel is not still in interrupt mode because it is processing a nested interrupt) do_softirq to service any pend-
ing software IRQs. This mechanism is discussed in more detail in Section 14.2. Finally, another call to set_irq_regs restores the pointer to struct regs to the setting that was active before the call. This ensures that nested handlers work correctly.
870
5:37pm
Page 870
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities Processing on IA-32 Systems IA-32 requires slightly more work in do_IRQ, as the code flow diagram in Figure 14-10 shows. We first suppose that a single page frame is used for the kernel stack, that is, 4 KiB are available per process for the kernel. This is configured if CONFIG_4KSTACKS is set. Recall that in this case a separate stack is used to handle IRQ processing. do_IRQ set_irq_regs irq_enter
Yes
Stack switch necessary?
No
desc->handle_irq
Switch stacks desc->handle_irq Switch stacks back
irq_exit
Figure 14-10: Code flow diagram for do_IRQ on IA-32 systems. As in the AMD64 case, the functions set_irq_regs and irq_enter are called with the same purpose as before. The kernel must switch to the IRQ stack. The current stack can be obtained by calling the auxiliary function current_thread_info, which delivers a pointer to the thread_info instance currently in use. Recall from above that this information is in a union with the current stack. A pointer to the appropriate IRQ-stack can be obtained from hardirq_ctx as discussed above. Two cases are possible:
1.
The process is already using the IRQ stack because nested IRQs are processed. In this case, the kernel can be lazy — nothing needs to be done because everything is already set up. irq_desc[irq]->handle_irq can be called to activate the ISR stored in the IRQ database.
2.
The current stack is not the IRQ stack (curctx != irqctx), and a switch between both is required. In this case, the kernel performs the required low-level assembler operations to switch between the stacks, calls irq_desc[irq]->handle_irq, and switches the stacks back.
Note that in both cases the ISR is called directly and not via a detour over generic_handle_irq as on AMD64 systems. The remaining work is done in the same way as on AMD64 systems. irq_exit handles some accounting and activates SoftIRQs, and set_irq_regs restores the register pointer to the state before the IRQ happened.
871
Page 871
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities When stacks with 8-KiB size, that is, two page frames, are used, IRQ handling is simplified because a potential stack switch does not need to be taken into account and irq_desc[irq]->handle_irq can be called immediately in any case.
Old-Style Processing In the discussion of how AMD64 calls the flow control handler, it was mentioned that the code ends up in generic_handle_irq, which selects and activates the proper handle_irq function from the IRQ database irq_desc. However, generic_handle_irq is a little more complicated in practice:
static inline void generic_handle_irq(unsigned int irq) { struct irq_desc *desc = irq_desc + irq; #ifdef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ desc->handle_irq(irq, desc); #else if (likely(desc->handle_irq)) desc->handle_irq(irq, desc); else __do_IRQ(irq); #endif }
Before the generic IRQ rework, the kernel used a colorful mixture of architecture-dependent approaches to IRQ handling. Most important, there was no separation between flow handling and ISR handling: Both tasks were performed simultaneously in a single architecture-specific routine usually called __do_IRQ. Modern code should activate the configuration option GENRIC_HARDIRQS_NO__DO_IRQ and implement flow handling as shown in the preceding discussions. In this case, generic_handle_irq really boils down to just calling irq_desc[irq]->handle_irq. What if this option is not set? The kernel provides a default implementation of __do_IRQ that combines flow handling for all interrupt types, and also calls the required ISRs.15 Basically, there are three possibilities of how to use this function and implement flow handling:
1.
Use generic flow handlers for some IRQs, and leave the handlers for others undefined. For these, __do_IRQ is employed to handle both flow and high-level processing. It is required to call generic_handle_IRQ from do_IRQ in this case.
2.
Call __do_IRQ directly from do_IRQ. This bypasses the flow separation completely. Some off-mainstream architectures like M32R, H8300, SuperH, and Cris still use this approach.
3.
Handle IRQs in a completely architecture-dependent way without reusing any of the existing frameworks. Clearly, this is not the brightest idea — to say the least.
Since it is needless to say that the long-term goal for all architectures is to convert to the generic IRQ framework, __do_IRQ is not discussed in detail. 15 The implementation is based on the version used on IA-32 systems before the generic IRQ framework was introduced.
872
5:37pm
Page 872
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities Calling the High-level ISR Recall from above that the various flow handler routines all have one thing in common: They employ handle_IRQ_event to activate the high-level ISRs associated with a particular IRQ. The time has come to examine this function a little more closly. The function requires the IRQ number and the action chain to be passed as parameters: kernel/irq/handle.c
irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action); handle_IRQ_event performs various actions:
❑
If IRQF_DISABLED was not set in the first handler function, the interrupts (for the current CPU) are enabled with local_irq_enable_in_hardirq; in other words, the handlers can be interrupted by other IRQs. However, depending on the flow type, it is possible that the IRQ just processed is always masked out.
❑
The action functions of the registered IRQ handlers are invoked one after the other.
❑
If IRQF_SAMPLE_RANDOM is set for the IRQ, add_interrupt_randomness is called in order to use the time of the event as a source for the entropy pool (interrupts are an ideal source if they occur randomly).
❑
local_irq_disable disables the interrupts. Because enabling and disabling of interrupts
is not nested, it is irrelevant whether they were enabled or not at the start of processing. handle_IRQ_event was called with interrupts disabled, and is also expected to leave again with
interrupts disabled. With shared IRQs the kernel has no way of finding out which device raised the request. This is left entirely to the handler routines that use device-specific registers or other hardware characteristics to find the source. Routines not affected also recognize that the interrupt was not intended for them and return control as quickly as possible. Neither is there any way that a handler routine can report to higher-level code that the interrupt was intended for it or not. The kernel always executes all handler routines in turn, regardless of whether the first or the last leads to success. Nevertheless, the kernel can check whether any handler was found to be responsible for the IRQ. irqreturn_t is defined as the return type of handler functions and boils down to a simple integer variable. It accepts the value IRQ_NONE or IRQ_HANDLED, depending on whether the IRQ was serviced by the handler routine or not. During servicing of all handler routines, the kernel combines the results with a logical ‘‘or’’ operation. This is how it is finally able to determine whether the IRQ was serviced or not. kernel/irq/handle.c
irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action) { ... do { ret = action->handler(irq, action->dev_id); if (ret == IRQ_HANDLED) status |= action->flags; retval |= ret; action = action->next;
873
Page 873
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities } while (action); ... return retval; }
Implementing Handler Routines Some important points must be noted when implementing handler routines. These greatly influence not only the speed but also the stability of the system.
Restrictions The main problem when implementing ISRs is that they execute in what is known as the interrupt context. Kernel code can sometimes run both in the regular context and in the interrupt context. To distinguish between these two variants and to design code accordingly, the kernel provides the in_interrupt function to indicate whether or not an interrupt is currently being serviced. The interrupt context differs in three important points from the normal context in which the kernel otherwise executes:
1.
Interrupts are executed asynchronously; in other words, they can occur at any time. As a result, the handler routine is not executed in a clearly defined environment with respect to the reservation of userspace. This prohibits access to userspace and prevents above all the copying of memory contents into and out of the userspace addresses. For network drivers, for example, it is therefore not possible to forward data received directly to the waiting application. After all, it is not certain that the application waiting for the data is running at the time (this is, in fact, extremely unlikely).
2.
The scheduler may not be invoked in the interrupt context. It is therefore impossible to surrender control voluntarily.
3.
The handler routine may not go to sleep. Sleep states can only be broken when an external event causes a state change and wakes the process. However, because interrupts are not allowed in the interrupt context, the sleeping process would wait forever for the relieving news. As the scheduler may also not be invoked, no other process can be selected to replace the current sleeping process. It is not, of course, enough simply to make sure that only the direct code of a handler routine is free of possible sleeping instructions. All invoked procedures and functions (and procedures and functions invoked by these, in turn) must be free of expressions that could go to sleep. Checking that this is the case is not always trivial and must be done very carefully, particularly if control paths have numerous branches.
Implementing Handlers Recall that the prototype of ISR functions is specified by irq_handler_t. I have not shown the actual definition of this typedef, but do so now:
typedef irqreturn_t (*irq_handler_t)(int, void *); irq specifies the IRQ number, and dev_id is the device ID passed when the handler is registered. irqreturn_t is another typedef to a simple integer.
874
5:37pm
Page 874
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities Note that the prototype of ISRs was changed during the development of 2.6.19! Before, the arguments of the handler routine also included a pointer to the saved registers:
irqreturn_t (*handler)(int irq, void *dev_id, struct pt_regs *regs);
Interrupt handlers are obviously hot code paths, and time is very critical. Although most handlers do not need the register state, time and stack space is required to pass a pointer to it to every ISR. Removing this pointer from the prototype is thus a good idea.16 Handlers that need the register set can still access it. The kernel defines a global per-CPU array that stores the registers, and get_irq_regs from >include/asm-generic/irq_regs.h> can be used to retrieve a pointer to the pt_regs instance. This instance contains the register setting that was active when the switch to kernel mode was made. The information is not used by normal device drivers but sometimes comes in useful when debugging kernel problems. Again we emphasize that interrupt handlers can only use two return values: IRQ_HANDLED if the IRQ was handled correctly, or IRQ_NONE if the ISR did not feel responsible for the IRQ. What are the tasks of a handler routine? To service a shared interrupt, the routine must first check whether the IRQ is intended for it. If the peripheral device is of a more modern design, the hardware offers a simple method of performing this check, usually by means of a special device register. If the device has caused an interrupt, the register value is set to 1. In this case, the handler routine must restore the value to its default (usually 0) and then start normal servicing of the interrupt. If it finds the value 0, it can be sure that the managed device is not the source of the interrupt, and control can be returned to the higher-level code. If a device does not have a state register of this kind, the option of manual polling still remains. Each time an interrupt occurs, the handler checks whether data are available for the device. If so, the data are processed. If not, the routine is terminated. A handler routine can, of course, be responsible for several devices at the same time, for example, two network cards of the same type. If an IRQ is received, the same code is executed on both cards because both handler functions point to the same position in the kernel code. If the two devices use different IRQ numbers, the handler routine can differentiate between them. If they share a common IRQ, reference can still be made to the device-specific dev_id field to uniquely identify each card.
14.2
Software Interrupts
Software interrupts enable the kernel to defer tasks. Because they function in a similar way to the interrupts described above but are implemented fully in the software, they are logically enough known as software interrupts or softIRQs. The kernel is informed of an anomalous condition by means of a software interrupt, and the situation is resolved at some later time by special handler routines. As already noted, the kernel services all pending software interrupts at the end of do_IRQ so that regular activation is ensured. 16 Since the patch that introduced the change had to change every ISR, it might well be the one to touch most files at a single blow
in the kernel history.
875
Page 875
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities From a more abstract view, software interrupts can therefore be described as a form of kernel activity that is deferred to a later point in time. However, despite the clear similarities between hardware and software interrupts, they are not always comparable. The central component of the softIRQ mechanism is a table with 32 entries to hold elements of the softirq_action type. This data type has a very simple structure and consists of two elements only:
struct softirq_action { void (*action)(struct softirq_action *); void *data; };
Whereas action is a pointer to the handler routine executed by the kernel when a software interrupt occurs, data accepts a nonspecified pointer to private data of the handler function. The definition of the data structure is architecture-independent, as is the complete implementation of the softIRQ mechanism. With the exception of processing activation, no processor-specific functions or features are deployed; this is in clear contrast to normal interrupts. Software interrupts must be registered before the kernel can execute them. The open_softirq function is provided for this purpose. It writes the new softIRQ at the desired position in the softirq_vec table: kernel/softirq.c
void open_softirq(int nr, void (*action)(struct softirq_action*), void *data) { softirq_vec[nr].data = data; softirq_vec[nr].action = action; } data is used as a parameter each time the action softIRQ handler is called.
The fact that each softIRQ has a unique number immediately suggests that softIRQs are relatively scarce resources that may not be used randomly by all manner of device drivers and kernel parts but must be used judiciously. By default, only 32 softIRQs may be used on a system. However, this limit is not too restrictive because softIRQs act as a basis for implementing other mechanisms that also defer work and are better adapted to the needs of device drivers. The corresponding techniques (tasklets, work queues, and kernel timers) are discussed below. Only the central kernel code uses software interrupts. SoftIRQs are used at a few points only, but these are all the more important:
enum { HI_SOFTIRQ=0, TIMER_SOFTIRQ, NET_TX_SOFTIRQ, NET_RX_SOFTIRQ, BLOCK_SOFTIRQ,
876
5:37pm
Page 876
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities TASKLET_SOFTIRQ SCHED_SOFTIRQ, #ifdef CONFIG_HIGH_RES_TIMERS HRTIMER_SOFTIRQ, #endif }; };
Two serve to implement tasklets (HI_SOFTIRQ and TASKLET_SOFTIRQ), two are used for send and receive operations in networks (NET_TX_SOFTIRQ and NET_RX_SOFTIRQ, the source of the softIRQ mechanism and its most important application), one is used by the block layer to implement asynchronous request completions (BLOCK_SOFTIRQ), and one is used by the scheduler (SCHED_SOFTIRQ) to implement periodic load balancing on SMP systems. When high-resolution timers are enabled, they also require a softIRQ (HRTIMER_SOFTIRQ). Numbering of the softIRQs produces a priority sequence, which does not affect the frequency of execution of individual handler routines or their priority with respect to other system activities, but does define the sequence in which the routines are executed if several are marked as active or pending at the same time. raise_softirq(int nr) is used to raise a software interrupt (similarly to a normal interrupt). The num-
ber of the desired softIRQ is passed as a parameter. This function sets the corresponding bit in the per-CPU variable irq_stat[smp_processor_id].__ softirq_pending. This marks the softIRQ for execution but defers execution. By using a processorspecific bitmap, the kernel ensures that several softIRQs — even identical ones — can be executed on different CPUs at the same time. Providing raise_softirq was not called in the interrupt context, wakeup_softirqd is called to wake the softIRQ daemon; this is one of the two alternative ways of launching the processing of softIRQs. The daemon is discussed in more detail in Section 14.2.2.
14.2.1 Starting SoftIRQ Processing There are several ways of starting softIRQ processing, but all come down to invoking the do_softirq function. For this reason, let’s take a closer look at this function. Figure 14-11 shows the corresponding code flow diagram that presents the essential steps. The function first ensures that it is not in the interrupt context (meaning, of course, that a hardware interrupt is involved). If it is, it terminates immediately. Because softIRQs are used to execute timeuncritical parts of ISRs, the code itself must not be called within an interrupt handler. With the help of local_softirq_pending, the bitmap of all softIRQs set on the current CPU is determined. If any softIRQ is waiting to be processed, then __do_softirq is called. This function resets the original bitmap to 0; in other words, all softIRQs are deleted. Both actions take place (on the current processor) with disabled interrupts to prevent modification of the bitmap as a result of interference by other processes. Subsequent code, on the other hand, executes with interrupts
877
Page 877
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities enabled. This allows the original bitmap to be modified at any time during processing of the softIRQ handlers. The action functions in softirq_vec are invoked in a while loop for each enabled softIRQ. do_softirq local_softirq_pending _ _do_softirq h->action
local_softirq_pendung and restart limit not reached?
New softIRQs activated?
Restart processing
wakeup_softirqd
Figure 14-11: Code flow diagram for do_softirq. Once all marked softIRQs have been serviced, the kernel checks whether new softIRQs have been marked in the original bitmap in the meantime. At least one softIRQ not serviced in the previous cycle must remain, and the number of restarts must not exceed MAX_SOFTIRQ_RESTART (usually set to 10). If this is the case, the marked softIRQs are again processed in sequence. This operation is repeated until no new unprocessed softIRQs remain after execution of all handlers. Should softIRQs still remain after the MAX_SOFTIRQ_RESTART time of restarting the processing, wakeup_softirqd is called to wake up the softIRQ daemon:
14.2.2 The SoftIRQ Daemon The task of the softIRQ daemon is to execute softIRQs asynchronously to remaining kernel code. To this end, each system processor is allocated its own daemon named ksoftirqd. wakeup_softirqd is invoked at two points in the kernel to wake up the daemon:
❑
In do_softirq, as just mentioned.
❑
At the end of raise_softirq_irqoff. This funtion is called by raise_softirq internally, and can also be used directly if the kernel has interrupts turned off at the moment.
The wake-up function itself can be dealt with in a few lines. A pointer to the task_struct of the softIRQ daemon is read from a per-CPU variable by means of a few macros. If the current task state is not already TASK_RUNNING, it is put back in the list of processes ready to run by means of wake_up_process (see Chapter 2). Although this does not immediately start servicing of all pending software interrupts, the daemon (which runs with priority 19) is selected as soon as the scheduler has nothing better to do.
878
5:37pm
Page 878
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities The softIRQ daemons of the system are generated shortly after init is called at system startup using the initcall mechanism described in Appendix D. After initialization, each daemon executes the following endless loop17 : kernel/softirq.c
static int ksoftirqd(void * __bind_cpu) ... while (!kthread_should_stop()) { if (!local_softirq_pending()) { schedule(); } __set_current_state(TASK_RUNNING); while (local_softirq_pending()) { do_softirq(); cond_resched(); } set_current_state(TASK_INTERRUPTIBLE); } ... }
Each time it is awakened, the daemon first checks whether marked softIRQs are pending, as otherwise control can be passed to another process by explicitly invoking the scheduler. If there are marked softIRQs, the daemon gets on with servicing them. In a while loop the two functions do_softirq and cond_resched are invoked repeatedly until no marked softIRQS remain. cond_resched ensures that the scheduler is called if the TIF_NEED_RESCHED flag was set for the current process (see Chapter 2). This is possible because all functions execute with enabled hardware interrupts.
14.3
Tasklets
Software interrupts are the most effective way of deferring the performance of activities to a future point in time. However, this deferral mechanism is very complicated to handle. Because softIRQs can be serviced simultaneously and independently on several processors, the handler routine of one and the same softIRQ can run on several CPUs at the same time. This represents a key contribution to the effectiveness of the concept — network implementation is a clear winner on multiprocessor systems. However, the handler routines must be designed to be fully reentrant and thread-safe. Alternatively, critical areas must be protected with spinlocks (or with other IPC mechanisms; see Chapter 5), and this requires a great deal of careful thought. Tasklets and work queues are mechanisms for the deferred execution of work; their implementation is based on softIRQs, but they are easier to use and therefore more suitable for device drivers (and also for other general kernel code). 17 kthread_should_stop() returns a true value if the softIRQ daemon is stopped explicitly. Since this happens only when a CPU is removed from the system, I will not discuss this case. I also omit preemption handling for the sake of clarity.
879
Page 879
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities Before going into the technical details, a word of caution on the terminology used: For historical reasons, the term bottom half is often used to mean two different things; first, it refers to the lower half of the code of an ISR that performs no time-critical actions. Unfortunately, the mechanisms used in earlier kernel versions to defer the execution of actions was also referred to as the bottom half, with the result that the term is often used ambiguously. In the meantime, bottom halves no longer exist as a kernel mechanism. They were discarded during the development of 2.5 and replaced with tasklets, a far better substitute. Tasklets are ‘‘small tasks‘‘that perform mini jobs that would be wasted on full processes.
14.3.1 Generating Tasklets Not surprisingly, the central data structure of each tasklet is called tasklet_struct and is defined as follows:
struct tasklet_struct { struct tasklet_struct *next; unsigned long state; atomic_t count; void (*func)(unsigned long); unsigned long data; };
From the perspective of a device driver, the most important element is func. It points to the address of a function whose execution is to be deferred. data is passed as a parameter when the function is executed. next is a pointer used to build a linked list of tasklet_struct instances. This allows several tasks to be
queued for execution. state indicates the current state of the task — as for a genuine task. However, only two options are available, each represented by a separate bit in state, which is why they can be set and removed inde-
pendently of each other: ❑
TASKLET_STATE_SCHED is set when the tasklet is registered in the kernel and scheduled for execu-
tion. ❑
TASKLET_STATE_RUN indicates that a tasklet is currently being executed.
The second state is only of relevance on SMP systems. It is used to protect tasklets against concurrent execution on several processors. The atomic counter count is used to disable tasklets already scheduled. If its value is not equal to 0, the corresponding tasklet is simply ignored when all pending tasklets are next executed.
14.3.2 Registering Tasklets tasklet_schedule registers a tasklet in the system:
static inline void tasklet_schedule(struct tasklet_struct *t);
880
5:37pm
Page 880
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities If the TASKLET_STATE_SCHED bit is set, registration is terminated because the tasklet is already registered. Otherwise, the tasklet is placed at the start of a list whose list header is the CPU-specific variable tasklet_vec. This list contains all registered tasklets and uses the next element for linking purposes. The tasklet list is marked for processing once a tasklet has been registered.
14.3.3 Executing Tasklets The most important part in the life of a tasklet is its execution. Because tasklets are implemented on top of softIRQs, they are always executed when software interrupts are handled. Tasklets are linked with the TASKLET_SOFTIRQ softIRQ. Consequently, it is sufficient to invoke raise_softirq(TASKLET_SOFTIRQ) to execute the tasklets of the current processor at the next opportunity. The kernel uses tasklet_action as the action function of the softIRQ. The function first determines the CPU-specific list in which the tasklets marked for execution are linked. It then redirects the list header to a local element, and thus removes all entries from the public list. They are then processed one after the other in the following loop: kernel/softirq.c static void tasklet_action(struct softirq_action *a) ... while (list) { struct tasklet_struct *t = list; list = list->next; if (tasklet_trylock(t)) { if (!atomic_read(&t->count)) { if (!test_and_clear_bit(TASKLET_STATE_SCHED, &t->state)) BUG(); t->func(t->data); tasklet_unlock(t); continue; } tasklet_unlock(t); } ... } ... }
Executing tasklets in a while loop is similar to the mechanism used when handling softIRQs. Because a tasklet can be executed on only one processor at a time, but other tasklets may run in parallel, tasklet-specific locking is required. The state state is used as the locking variable. Before the handler function of a tasklet is executed, the kernel uses tasklet_trylock to check whether the state of the tasklet is TASKLET_STATE_RUN; in other words, whether it is already running on another processor of the system:
static inline int tasklet_trylock(struct tasklet_struct *t) { return !test_and_set_bit(TASKLET_STATE_RUN, &(t)->state); }
881
Page 881
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities If the corresponding bit has not yet been set, it is set now. If the count element is not equal to 0, the tasklet is regarded as deactivated. In this case, the code is not executed. Once both checks have been passed successfully, the kernel executes the handler function of the tasklet with the corresponding function parameters by invoking t->func(t->data). Finally, the TASKLET_SCHED_RUN bit of the tasklet is deleted using tasklet_unlock. If new tasklets were queued for the current processor during execution of the tasklets, the softIRQ TASKLET_SOFTIRQ is raised to execute the new tasklets as soon as possible. (Because the code needed
to do this is not particularly interesting, it is not included above.) In addition to normal tasklets, the kernel uses a second kind of tasklet of a ‘‘higher‘‘ priority. Its implementation is absolutely identical to that of normal tasklets except for the following modifications: ❑
HI_SOFTIRQ is used as a softIRQ instead of TASKLET_SOFTIRQ; its associated action function is tasklet_hi_action.
❑
The registered tasklets are queued in the CPU-specific variable tasklet_hi_vec. This is done using tasklet_hi_schedule.
In this context, ‘‘higher priority‘‘ means that the softIRQ handler HI_SOFTIRQ is executed before all other handlers — particularly before network handlers that account for the main part of software interrupt activity. Currently, mostly sound card drivers make use of this alternative because deferring actions too long can impair the sound quality of audio output. But also network cards for high-speed transmission lines can profit from this mechanism.
14.4
Wait Queues and Completions
Wait queues are used to enable processes to wait for a particular event to occur without the need for constant polling. Processes sleep during wait time and are woken up automatically by the kernel when the event takes place. Completions are mechanisms that build on wait queues and are used by the kernel to wait for the end of an action. Both mechanisms are frequently used, primarily by device drivers, as shown in Chapter 6.
14.4.1 Wait Queues Data Structures Each wait queue has a head represented by the following data structure: <wait.h>
struct __wait_queue_head { spinlock_t lock; struct list_head task_list; }; typedef struct __wait_queue_head wait_queue_head_t;
882
5:37pm
Page 882
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities Because wait queues can also be modified in interrupts, a spinlock named lock must be acquired before the queue is manipulated (see Chapter 5). task_list is a doubly linked list used to implement what it’s best at: queues. The elements in the queue are instances of the following data structure: <wait.h>
struct __wait_queue { unsigned int flags; void *private; wait_queue_func_t func; struct list_head task_list; }; typedef struct __wait_queue wait_queue_t;
❑
flags has the value WQ_FLAG_EXCLUSIVE or it does not — other flags are not defined at the moment. A set WQ_FLAG_EXCLUSIVE flag indicates that the waiting process would like to be woken up exclusively (this is discussed in more detail shortly).
❑
private is a pointer to the task structure of the waiting process. The variable can basically point
to some arbitrary private data, but this is only seldom used in the kernel, so I will not discuss these cases any further. ❑
func is invoked to wake the element.
❑
task_list is used as a list element to position wait_queue_t instances in a wait queue.
Wait queue use is divided into two parts:
1.
To put the current process to sleep in a wait queue, it is necessary to invoke the wait_event function (or one of its equivalents, discussed below). The process goes to sleep and relinquishes control to the scheduler. The kernel invokes this function typically after it has issued a request to a block device to transfer data. Because transfer does not take place immediately and there is nothing else to do in the meantime, the process can sleep and therefore make CPU time available to other processes in the system.
2.
At another point in the kernel — in our example, after data have arrived from the block device — the wake_up function (or one of its equivalents, discussed below) must be invoked to wake the sleeping processes in the wait queue.
When processes are put to sleep using wait_event, you must always ensure that there is a corresponding wake_up call at another point in the kernel.
Putting Processes to Sleep The add_wait_queue function is used to add a task to a wait queue; this function delegates its work to __add_wait_queue once the necessary spinlock has been acquired: <wait.h>
static inline void __add_wait_queue(wait_queue_head_t *head, wait_queue_t *new) {
883
Page 883
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities list_add(&new->task_list, &head->task_list); }
Nothing more need be done than to add the new task to the wait list using the standard list_add list function. add_wait_queue_exclusive is also available. It works in the same way as add_wait_queue but inserts the process at the queue tail and also sets its flag to WQ_EXCLUSIVE (what is behind this flag is discussed
below). Another method to put a process to sleep on a wait queue is prepare_to_wait. In addition to the parameters required by add_wait_queue, a task state is required as well: kernel/wait.c
void fastcall prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state) { unsigned long flags; wait->flags &= ~WQ_FLAG_EXCLUSIVE; spin_lock_irqsave(&q->lock, flags); if (list_empty(&wait->task_list)) __add_wait_queue(q, wait); ... set_current_state(state); spin_unlock_irqrestore(&q->lock, flags); }
After calling __add_wait_queue as discussed above, the kernel sets the current state of the process to the state passed to prepare_to_wait. prepare_to_wait_exclusive is a variant that sets the WQ_FLAG_EXCLUSIVE flag and appends the wait queue element to the queue tail.
Two standard methods are available to initialize a wait queue entry:
1.
init_waitqueue_entry initializes a dynamically allocated instance of wait_queue_t: <wait.h>
static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p) { q->flags = 0; q->private = p; q->func = default_wake_function; } default_wake_function is just a parameter conversion front end that attempts to wake the process using the try_to_wake_up function described in Chapter 2.
884
5:37pm
Page 884
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities 2.
DEFINE_WAIT allows for creating a static instance of wait_queue_t that is automatically ini-
tialized: <wait.h>
#define DEFINE_WAIT(name) \ wait_queue_t name = { \ .private .func .task_list }
= current, \ = autoremove_wake_function, \ = LIST_HEAD_INIT((name).task_list), \
autoremove_wake_function is now used to wake the process. The function not only calls default_wake_function, but also removes the wait queue element from the wait queue.
add_wait_queue is normally not used directly. It is more common to use wait_event. This is a macro that requires two parameters:
1. 2.
A wait queue to wait on. A condition in the form of a C expression of the event to wait for.
All the macro needs to do is to ensure that the condition is not yet already fulfilled; in this case, processing can be immediately stopped because there is nothing to wait for. The hard work is delegated to __wait_event: <wait.h>
#define __wait_event(wq, condition) do { DEFINE_WAIT(__wait); for (;;) { prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE); if (condition) break; schedule(); } finish_wait(&wq, &__wait); } while (0)
\ \ \ \ \ \ \ \ \ \ \
After setting up the wait queue element with DEFINE_WAIT, the macro produces an endless loop. The process is put to sleep on the wait queue using prepare_to_wait. Every time it is woken up, the kernel checks if the specified condition is fulfilled, and exits the endless loop if this is so. Otherwise, control is given to the scheduler, and the task is put to sleep again. It is essential that both wait_event and __wait_event are implemented as macros — this allows for specifying conditions given by standard C expressions! Since C does not support any nifty features like higher-order functions, this behavior would be impossible (or at least very clumsy) to achieve using regular procedures.
885
Page 885
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities When the condition if fulfilled, finish_wait sets the task state back to TASK_RUNNING and removes the entry from the wait queue list.18 In addition to wait_event, the kernel defines several other functions to place the current process in a wait queue. Their implementation is practically identical to that of sleep_on: <wait.h>
#define wait_event_interruptible(wq, condition) #define wait_event_timeout(wq, condition, timeout) { ... } #define wait_event_interruptible_timeout(wq, condition, timeout)
❑
wait_event_interruptible uses the TASK_INTERRUPTIBLE task state. The sleeping process can therefore be woken up by receiving a signal.
❑
wait_event_timeout waits for the specified condition to be fulfilled, but stops waiting after a time-out specified in jiffies. This prevents a process from sleeping for ever.
❑
wait_event_interruptible_timeout puts the process to sleep so that it can be woken up by
receiving a signal. It also registers a time-out. Kernel nomenclature is usually not a place for surprises! Additionally the kernel defines a number of deprecated functions (sleep_on, sleep_on_timeout, interruptible_sleep_on, and interruptible_sleep_on_timeout) that are deprecated and not supposed to be used in new code anymore. They still sit around for compatibility purposes.
Waking Processes The kernel defines a series of macros that are used to wake the processes in a wait queue. They are all based on the same function: <wait.h> #define #define #define #define #define #define
wake_up(x) __wake_up(x, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, NULL) wake_up_nr(x, nr) __wake_up(x, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, nr, NULL) wake_up_all(x) __wake_up(x, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 0, NULL) wake_up_interruptible(x) __wake_up(x, TASK_INTERRUPTIBLE, 1, NULL) wake_up_interruptible_nr(x, nr) __wake_up(x, TASK_INTERRUPTIBLE, nr, NULL) wake_up_interruptible_all(x) __wake_up(x, TASK_INTERRUPTIBLE, 0, NULL)
__wake_up delegates work to __wake_up_common after acquiring the necessary lock of the wait queue
head. kernel/sched.c
static void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync, void *key) { wait_queue_t *curr, *next; ...
18 However, some care is required when doing this because finished_wait is invoked from many places and the task could have been removed by the wake-up function. However, the kernel manages to get everything right by careful manipulation of the list elements.
886
5:37pm
Page 886
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities q selects the desired wait queue and mode specifies what state processes may have in order to be woken up. nr_exclusive indicates how many tasks with a set WQ_FLAG_EXCLUSIVE are to be woken up.
The kernel then iterates through the sleeping tasks and invokes their wake-up function func: kernel/sched.c
list_for_each_safe(curr, next, &q->task_list, task_list) { unsigned flags = curr->flags; if (curr->func(curr, mode, sync, key) && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; } }
The list is scanned repeatedly until there are either no further tasks or until the number of exclusive tasks specified by nr_exclusive has been woken up. This restriction is used to avoid a problem known as the thundering herd. If several processes are waiting for exclusive access to a resource, it makes no sense to wake all waiting processes because all but one will have to be put back to sleep. nr_exclusive generalizes this restriction. The most frequently used wake_up function sets nr_exclusive to 1 and thus makes sure that only one exclusive task is woken up. Recall from above that WQ_FLAG_EXCLUSIVE tasks are added to the end of the wait queue. This implementation ensures that in mixed queues all normal tasks are woken up first, and only then is the restriction for exclusive tasks taken into consideration. It is useful to wake all processes in a wait queue if the processes are waiting for a data transfer to terminate. This is because the data of several processes can be read at the same time without mutual interference.
14.4.2 Completions Completions resemble the semaphores discussed in Chapter 5 but are implemented on the basis of wait queues. What interests us is the completions interface. Two actors are present on the stage: One is waiting for something to be completed, and the other declares when this completion has happened. Actually, this is a simplification: An arbitrary number of processes can wait for a completion. To represent the ‘‘something’’ that the processes wait for to be completed, the kernel uses the following data structure:
struct completion { unsigned int done; wait_queue_head_t wait; }; done allows for handling the situation in which an event is completed before some other process waits for its completion. This is discussed below. wait is a standard wait queue on which waiting processes are
put to sleep.
887
Page 887
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities init_completion initializes a completion instance that was dynamically allocated, while DECLARE_COMPLETION is the macro of choice to set up a static instance of the data structure.
Processes can be added to the list using wait_for_completion, where they wait (in exclusive sleep state) until their request is processed by some part of the kernel. The function requires a completion instance as a parameter:
void wait_for_completion(struct completion *); int wait_for_completion_interruptible(struct completion *x); unsigned long wait_for_completion_timeout(struct completion *x, unsigned long timeout); unsigned long wait_for_completion_interruptible_timeout( struct completion *x, unsigned long timeout);
Several refined variants are additionally available: ❑
Normally processes that wait for completion of an event are in an uninterruptible state, but this can be changed if wait_for_completion_interruptible is used. The function returns -ERESTARTSYS if the process was interrupted, and 0 otherwise.
❑
wait_for_completion_timeout waits for a completion event to occur, but provides an additional time-out in jiffies that cancels waiting after a defined time. This helps to prevent waiting for an event indefinitely. If the completion is finished before the time-out occurs, then the remaining time is returned as result, otherwise 0.
❑
wait_for_completion_interruptible_timeout is a combination of both variants.
Once the request has been processed by another part of the kernel, either complete or complete_all must be invoked from there to wake the waiting processes. Because only one process can be removed from the completions list at each invocation, the function must be invoked exactly n times for n waiting processes. complete_all, on the other hand, wakes up all processing waiting for the completion. complete_and_exit is a small wrapper that first applies complete and then calls do_exit to finish the kernel thread.
void complete(struct completion *); void complete_all(struct completion *); kernel/exit.c
NORET_TYPE void complete_and_exit(struct completion *comp, long code); complete, complete_all, and complete_and_exit require a pointer to an instance of struct completion
as a parameter that identifies the completion in question. Now what is the meaning of done in struct completion? Each time complete is called, the counter is incremented by 1, and the wait_for functions only puts the caller to sleep if done is not equal to 0. Effectively, this means that processes do not wait for events that are already completed. complete_all works similarly, but sets the counter to the largest possible value (UINT_MAX/2 — half of the maximal value of an unsigned integer because the counter can also assume negative values) such that processes that call wait_ after the event has completed will never go to sleep.
888
5:37pm
Page 888
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities
14.4.3 Work Queues Work queues are a further means of deferring actions until later. Because they are executed in the user context by means of daemons, the functions can sleep as long as they like — it does not matter at all to the kernel. During the development of 2.5, work queues were designed as a replacement for the keventd mechanism formerly used. Each work queue has an array with as many entries as there are processors in the system. Each entry lists tasks to be performed at a later time. For each work queue, the kernel generates a new kernel daemon in whose context the deferred tasks are performed using the wait queue mechanism just described. A new wait queue is generated by invoking one of the functions create_workqueue or create_workqueue_singlethread. While the first one creates a worker thread on all CPUs, the latter one just creates a single thread on the first CPU of the system. Both functions use __create_workqueue_key internally19 : kernel/workqueue.c
struct workqueue_struct *__create_workqueue(const char *name, int singlethread)
The name argument indicates the name under which the generated daemon is shown in the process list. If singlethread is set to 0, a thread is created on every CPU of the system, otherwise just on the first one. All tasks pushed onto wait queues must be packed into instances of the work_struct structure in which the following elements are important in the view of the work queue user: <workqueue.h>
struct work_struct; typedef void (*work_func_t)(struct work_struct *work); struct work_struct { atomic_long_t data; struct list_head entry; work_func_t func; } entry is used as usual to group several work_struct instances in a linked list. func is a pointer to the function to be deferred. It is supplied with a pointer to the instance of work_struct that was used to submit the work. This allows the worker function to obtain the data element that can point to arbitrary data associated with the work_struct.
19 Another variant, create_freezable_workqueue, is available to create work queues that are friendly toward system hibernation. Since I do not discuss any mechanisms related to power management, I will also not discuss this alternative any further. Also note that the prototype of __create_workqueue is simplified and does not contain parameters related to lock depth management and power management.
889
Page 889
Mauerer
runc14.tex
V2 - 09/04/2008
Chapter 14: Kernel Activities Why does the kernel use atomic_long_t as the data type for a pointer to some arbitrary data, and not void * as usual? In fact, former kernel versions defined work_struct as follows: <workqueue.h>
struct work_struct { ... void (*func)(void *); void *data; ... }; data was represented by a pointer as expected. However, the kernel does use a little trick — which is
fairly on the edge of being dirty — to squeeze more information into the structure without spending more memory. Because pointers are aligned on 4-byte boundaries on all supported architectures, the first 2 bits are guaranteed to be zero. They are therefore abused to contain flag bits. The remaining bits hold the pointer information as usual. The following macros allow masking out the flag bits: <workqueue.h>
#define WORK_STRUCT_FLAG_MASK (3UL) #define WORK_STRUCT_WQ_DATA_MASK (~WORK_STRUCT_FLAG_MASK)
Currently only a single flag is defined: WORK_STRUCT_PENDING allows for finding out whether a delayable work item is currently pending (if the bit is set) or not. The auxiliary macro work_pending(work) allows for checking for the bit. Note that the atomic data type of data ensures that the bit can be modified without concurrency problems. To simplify declaring and filling a static instance of this structure, the kernel provides the INIT_WORK(work, func) macro, which supplies an existing instance of work_struct with a delayed function. If a data argument is required, it must be set afterward. There are two ways of adding a work_queue instance to a work queue — queue_work and queue_work_delayed. The first alternative has the following prototype: kernel/workqueue.c
int fastcall queue_work(struct workqueue_struct *wq, struct work_struct *work)
It adds work to the work queue wq; the work itself is performed at an undefined time (when the scheduler selects the daemon). To ensure that work queued will be executed after a specified time interval has passed since submission, the work_struct needs to be extended with a timer. The solution is as obvious as can be: <workqueue.h>
struct delayed_work { struct work_struct work; struct timer_list timer; }; queue_delayed_work is used to submit instances of delayed_work to a work queue. It ensure that at least one time interval specified (in jiffies) by delay elapses before the deferred work is performed.
890
5:37pm
Page 890
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Chapter 14: Kernel Activities kernel/workqueue.c
int fastcall queue_delayed_work(struct workqueue_struct *wq, struct delayed_work *dwork, unsigned long delay)
This function first generates a kernel timer whose time-out occurs in delayed jiffies. The associated handler function then uses queue_work to add the work to the work queue in the normal way. The kernel generates a standard wait queue named events. This queue can be used by all parts of the kernel for which it is not worthwhile creating a separate work queue. The two functions below, whose implementation I need not discuss in detail, must be used to place new work in this standard queue: kernel/workqueue.c
int schedule_work(struct work_struct *work) int schedule_delayed_work(struct delay_work *dwork, unsigned long delay)
14.5
Summar y
The kernel can be activated synchronously or asynchronously. While the preceeding chapter discussed how system calls are employed for synchronous activation, you have seen in this chapter that there is a second, asynchronous activation method triggered from the hardware using interrupts. Interrupts are used when the hardware wants to notify the kernel of some condition, and there are various ways that interrupts can be implemented physically. After discussing the different possibilities, we have analyzed the generic data structures of the kernel that are employed to manage interrupts, and have seen how to implement flow handling for various IRQ types. The kernel has to provide service routines for IRQs, and some care is required to implement them properly. Most important, it is necessary to make these handlers as fast as possible, and the work is therefore often distributed into a quick top half and a slower bottom half that runs outside the interrupt context. The kernel offers some means to defer actions until a later point in time, and I have discussed the corresponding possibilities in this chapter: SoftIRQs are the software equivalent to hardware IRQs, and tasklets are built on this mechanism. While they enable the kernel to postpone work until later, they are not allowed to go to sleep. This is, however, possible with wait queues and work queues, also examined in this chapter.
891
Page 891
Mauerer
runc14.tex
V2 - 09/04/2008
5:37pm
Page 892
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Time Management All the methods of deferring work to a future point in time discussed in this book so far do not cover one specific area — the time-based deferral of tasks. The different variants that have been discussed do, of course, give some indication of when a deferred task will be executed (e.g., tasklets when handling softIRQs), but it is not possible to specify an exact time or a time interval after which a deferred activity will be performed by the kernel. The simplest kind of usage in this respect is obviously the implementation of time-outs where the kernel on behalf of a userland process waits a specific period of time for the arrival of an event — for example, 10 seconds for a user to press a key as a last opportunity to cancel before an important operation is carried out. Other usages are widespread in user applications. The kernel itself also uses timers for various tasks, for example, when devices communicate with associated hardware, often using protocols with chronologically defined sequences. A large number of timers are used to specify wait timeouts in TCP implementation. Depending on the job that needs to be performed, timers need to provide different characteristics, especially with respect to the maximal possible resolution. This chapter discusses the alternatives provided by the Linux kernel.
15.1
Over view
First of all, an overview of the subsystem that we are about to inspect in detail is presented.
15.1.1 Types of Timers The timing subsystem of the kernel has grown tremendously during the development of 2.6. For the initial releases, the timer subsystem consisted solely of what are now known as low-resolution timers. Essentially, low-resolution timers are centered around a periodic tick which happens at regular intervals. Events can be scheduled to be activated at one of these intervals. Pressure to extend this comparatively simple framework came predominantly from two sources:
Page 893
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management ❑
Devices with limited power (i.e., laptops, embedded systems, etc.) need to use as little energy as possible when there is nothing to do. If a periodic clock is running, there is, however, nearly always something to do — the tick must be provided. But if no users for the tick are present, it would basically not need to run. Nevertheless, the system needs to be brought from a low-power state into a state with higher power consumption just to implement the periodic tick.
❑
Multimedia-oriented applications need very precise timekeeping, for instance, to avoid frame skips in videos, or jumps during audio playback. This necessitated increasing the available resolution.
Finding a good solution agreeable to all developers (and users!) who come into contact with time management — and there is quite a large number of them — took many years and a good many proposed patches. The current state is rather unusual because two rather distinct types of timers are supported by the kernel: ❑
Classical timers have been available since the initial versions of the kernel. Their implementation is located in kernel/timer.c. A resolution of typically 4 milliseconds is provided, but the value depends on the frequency with which the machine’s timer interrupt is operated. These classical timers are called low-resolution or timer wheel timers.
❑
For many applications, especially media-oriented ones, a timer resolution of several milliseconds is not good enough. Indeed, recent hardware provides means of much more precise timing, which can achieve resolutions in the nanosecond range formally. During the development of kernel 2.6, an additional timer subsystem was added allowing the use of such timer sources. The timers provided by the new subsystem are conventionally referred to as high-resolution timers. Some code for high-resolution timers is always compiled into the kernel, but the implementation will only perform better than low-resolution timers if the configuration option HIGH_RES_TIMERS is set. The framework introduced by high-resolution timers is reused by low-resolution timers (in fact, low-resolution timers are implemented on top of the high-resolution mechanism).
Classical timers are bound by a fixed raster, while high-resolution clock events can essentially happen at arbitrary times; see Figure 15-1. Unless the dynamic ticks feature is active, it can also happen that ticks occur when no event expires. High-resolution events, in contrast, only occur when some event is due.
Tick with events
Time
Tick w/o events Jiffie
1234
1235
1236
1237
1238
1239
High resolution event
Figure 15-1: Comparison between low- and high-resolution timers. Why did the developers not choose the seemingly obvious path and improve the already existing timer subsystem, but instead added a completely new one? Indeed, some people tried to pursue this strategy, but the mature and robust structure of the old timer subsystem did not make it particularly easy to improve while still being efficient — and without creating new problems. Some more thoughts on this problem can be found in Documentation/hrtimers.txt.
894
5:39pm
Page 894
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management Independent of the resolution, the kernel nomenclature distinguishes two types of timers: ❑
Time-outs — Represent events that are bound to happen after some time, but can and usually will be canceled before. For example, consider that the network subsystem waits for an incoming packet that is bound to arrive within a certain period of time. To handle this situation, a timer is set that will expire after the time is over. Since packets usually arrive on time, chances are that the timer will be removed before it will actually go off. Besides resolution is not very critical for these types of timers. When the kernel allows an acknowledgment to a packet to be sent within 10 seconds, it does not really matter if the time-out occurs after 10 or 10.001 seconds.
❑
Timers — Are used to implement temporal sequences. For instance, a sound card driver could want to issue some data to a sound card in small, periodic time intervals. Timers of this sort will usually expire and require much better resolution than time-outs.
An overview of the building blocks employed to implement the timing subsystem is given in Figure 15-2. Owing to the nature of an overview, it is not too precise, but gives a quick glance at what is involved in timekeeping, and how the components interact with each other. Many details are left to the following discussion. Timer wheel
Clock sources
Clock events
High-resolution timers
Low-resolution timers
Generic time & clockevents layer
Architecture specific code
Process accounting
Jiffies & Global tick
per CPU
system-wide
Hardware clock chips
Figure 15-2: Overview of the components that build up the timing subsystem. The raw hardware sits at the very bottom. Every typical system has several devices, usually implemented by clock chips, that provide timing functionality and can serve as clocks. Which hardware is available depends on the particular architecture. IA-32 and AMD64 systems, for instance, have a programmable interrupt timer (PIT, implemented by the 8253 chip) as a classical clock source that has only a very modest resolution and stability. CPU-local APICs (advanced programmable interrupt controllers), which were already mentioned in the context of IRQ handling, provide much better resolution and stability. They are suitable as high-resolution time sources, whereas the PIT is only good enough for low-resolution timers. Hardware naturally needs to be programmed by architecture-specific code, but the clock source abstraction provides a generic interface to all hardware clock chips. Essentially, read access to the current value of the running counter provided by a clock chip is granted.
895
Page 895
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management Periodic events do not comply with a free running counter very well, thus another abstraction is required. Clock events are the foundation of periodic events. Clock events can, however, be more powerful. Some time devices can provide events at arbitrary, irregular points in time. In contrast to periodic event devices, they are called one-shot devices. The high-resolution timer mechanism is based on clock events, whereas the low-resolution timer mechanism utilizes periodic events that can either come directly from a low-resolution clock or from the highresolution subsystem. Two important tasks for which low-resolution timers assume responsibility are
1.
Handle the global jiffies counter. The value is incremented periodically (or at least it looks periodical to most parts of the kernel) and represents a particularly simple form of time reference.1
2.
Perform per-process accounting. This also includes handling classical low-resolution timers, which can be associated with any process.
15.1.2 Configuration Options Not only are there two distinct (but nevertheless related) timing subsystems in the kernel, but the situation is additionally complicated by the dynamic ticks feature. Traditionally, the periodic tick is active during the entire lifetime of the kernel. This can be wasteful in systems where power is scarce, with laptops and portable machines prime examples. If a periodic event is active, the system will never be able to go into power-saving modes for long intervals of time. The kernel thus allows to configure dynamic ticks,2 which do not require a periodic signal. Since this complicates timer handling, assume for now that this feature is not enabled. Four different timekeeping scenarios can be realized by the kernel. While the number may not sound too large, understanding the time-related code is not exactly simplified when many tasks can be implemented in four different ways depending on the chosen configuration. Figure 15-3 summarizes the possible choices.
High-res Dynamic ticks
High-res Periodic ticks
Low-res Dynamic ticks
Low-res Periodic ticks
Figure 15-3: Possible timekeeping configurations that arise because of high- and low-resolution timers and dynamic/periodic ticks.
Computing all four possible combinations from two sets with two elements is certainly not complicated. Nevertheless, it is important to realize that all combinations of low/high res and dynamic/periodic ticks are valid and need to be accounted for by the kernel. 1 Updating the jiffies value is not easy to categorize between low- and high-resolution frameworks because it can be performed by
both, depending on the kernel configuration. The fine details of jiffie updating are discussed in the course of this chapter. 2 It is also customary to refer to a system with this configuration option enabled as a tickless system.
896
5:39pm
Page 896
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management
15.2 Implementation of Low-Resolution Timers Since low-resolution timers have been around in the kernel for many years and are used in hundreds of places, their implementation is covered first. In the following, assume that the kernel is defined to work with periodic ticks. The situation is more involved if dynamic ticks are in use, but that case is discussed in Section 15.5.
15.2.1 Timer Activation and Process Accounting As the time base for timers, the kernel uses the timer interrupt of the processor or any other suitable periodic source. On IA-32 and AMD64 systems, the programmable interrupt timer (PIT) or the High Precision Event Timer (HPET) can be employed for this purpose. Nearly all modestly modern systems of this type are equipped with an HPET, and if one is available, it is preferred to the PIT.3 The interrupt occurs at regular intervals — exactly HZ times per second. HZ is defined by an architecture-specific preprocessor symbol in
The HZ frequency is also defined (and used) when dynamic ticks are enabled because it is the fundamental quantity for many timekeeping tasks. On a busy system where something nontrivial (unlike the idle task) is always available to be done, there is superficially no difference between dynamic and periodic ticks. Differences only arise when there is little to do and some timer interrupts can be skipped.
Higher HZ values will, in general, lead to better interactivity and responsiveness of the system, particularly because the scheduler is called at each timer tick. As a drawback, more system work needs to be done because the timer routines are called more often; thus the general kernel overhead will increase with increasing HZ settings. This makes large HZ values preferable for desktop and multimedia systems, whereas lower HZ values are better for servers and batch machines where interactivity is not much of a concern. Early kernels in the 2.6 series directly hooked into the timer interrupt to start timer activation and process accounting, but this has been somewhat complicated by the introduction of the generic clock framework. Figure 15-4 provides an overview of the situation on IA-32 and AMD64 machines. The details differ for other architectures, but the principle is nevertheless the same. (How a particular architecture proceeds is usually set up in time_init which is called at boot time to initialize the fundamental low-resolution timekeeping.) The periodic clock is set up to operate at HZ ticks per second. IA-32 registers timer_interrupt as the interrupt handler, whereas AMD64 uses timer_event_interrupt. Both functions notify the generic, architecture-independent time processing layers of the kernel by calling the event handler of the so-called global clock (see Section 15.3). Different handler functions are employed 3 Using the HPET can be disabled with the kernel command-line option
hpet=disable, though.
897
Page 897
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management depending on which timekeeping model is used. In any case, the handler will set the ball rolling for periodic low-resolution timekeeping by calling the following two functions: ❑
do_time is responsible for system-wide, global tasks: Update the jiffies value, and handle process
accounting. On a multiprocessor system, one particular CPU is selected to perform both tasks, and all other CPUs are not concerned with them. ❑
update_process_times needs to be performed by every CPU on SMP systems. Besides process accounting, it activates and expires all registered classical low-resolution timers and provides the scheduler with a sense of time. Since these topics merit a discussion of their own (and are not so much related to the rest of this section), they are inspected in detail in Section 15.8. Here we are only concerned with timer activation and expiration, which is triggered by calling run_local_timers. The function, in turn, raises the softIRQ TIMER_SOFTIRQ, and the handler function is responsible to run the low-resolution timers.
IRQ 0
IA-32
AMD64
timer_interrupt
timer_event_interrupt
do_timer_interrupt_hook
event handler of the global clock do_timer update_process_times
Figure 15-4: Overview of periodic low-resolution timer interrupts on IA-32 and AMD64 machines.
First, consider do_time. The function performs as shown in Figure 15-5. The global variable jiffies_64 (an integer variable with 64 bits on all architectures)4 is incremented by 1. All that this means is that jiffies_64 specifies the exact number of timer interrupts since the system started. Its value is increased with constant regularity when dynamic ticks are disabled. If dynamic ticks are active, more than one tick period can have passed since the last update. 4 This is achieved on 32-bit processors by combining two 32-bit variables.
898
5:39pm
Page 898
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management do_time jiffies_64++ update_times update_wall_time calc_load
Figure 15-5: Code flow diagram for do_time.
For historical reasons, the kernel sources also include another time base. jiffies is a variable of the unsigned long type and is therefore only 4 bytes long on 32-bit processors, and this corresponds to 32 and not 64 bits. This causes a problem. After a longer system uptime, the counter reaches its maximum value and must be reset to 0. Given a timer frequency of 100 Hz, this situation would arise after just less than 500 days, and correspondingly earlier for higher HZ settings.5 When a 64-bit data type is used, the problem never occurs because uptimes of 1012 days are a little utopian, even for a very stable kernel such as Linux. The kernel uses a trick to prevent efficiency losses when converting between the two different time bases. jiffies and jiffies_64 match in their less significant bits and therefore point to the same memory
location or the same register. To achieve this, the two variables are declared separately, but the linker script used to bind the final kernel binary specifies that jiffies equates to the 4 less significant bytes of jiffies_64, where either the first or last 4 bytes must be used depending on the endianness of the underlying architecture. The two variables are synonymous on 64-bit machines. Caution: Times specified by jiffies and the jiffies variable itself require some special attention. The peculiarities are discussed in Section 15.2.2 immediately below. The remaining actions that must be performed at each timer interrupt are delegated by update_times: ❑
update_wall_time updates the wall time that specifies how long the system has already been
up and running. While this information is also roughly provided by the jiffies mechanism, the wall clock reads the time from the current time source and updates the wall clock accordingly. In contrast to the jiffies mechanism, the wall clock uses a human readable format (nanoseconds) to represent the current time. ❑
calc_load updates the system load statistics that specify how many tasks have on average been waiting on the run queue in a ready-to-run state during the last 1, 5, and, 15 minutes. This status can be output using, for example, the w command.
5 Most computers do not, of course, run uninterruptedly for so long, which is why the problem might appear to be somewhat
marginal at first glance. However, there are some applications — for instance, servers in embedded systems — in which uptimes of this magnitude can easily be achieved. In such situations it must be ensured that the time base functions reliably. During the development of 2.5, a patch was integrated to cause the jiffies value to wrap around 5 minutes after system boot. Potential problems can, therefore, be found quickly without waiting for years for wraparound to occur.
899
Page 899
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management
15.2.2 Working with Jiffies Jiffies provide a simple form of low-resolution time management in the kernel. Although the concept is simple, some caveats apply when the variable is read or when times specified in jiffies need to be compared. Since jiffies_64 can be a composed variable on 32-bit systems, it must not be read directly, but may only be accessed with the auxiliary function get_jiffies_64. This ensures that the correct value is returned on all systems.
Comparing Times To compare the temporal relation of events, the kernel provides several auxiliary functions that prevent off-by-one errors if they are used instead of a home-grown comparisons (a, b, and c denote jiffie time values for some events): ❑
timer_after(a,b) returns true if time a is after time b. time_before(a,b) will be true if time a is before time b, as you will have guessed.
❑
time_after_eq(a,b) works like time_after, but also returns true if both times are identical. time_before_eq(a,b) is the inverse variant.
❑
time_in_range(a,b,c) checks if time a is contained in the time interval denoted by [b, c]. The boundaries are included in the range, so a may be identical to b or c.
Using these functions ensures that wraparounds of the jiffies counter are handled correctly. As a general rule, kernel code should therefore never compare time values directly, but always use these functions. Although there are fewer problems when 64-bit times as given by jiffies_64 are compared, the kernel also provides the functions shown above for 64-bit times. Save for time_in_range, just append _64 to the respective function name to obtain a variant that works with 64-bit time values.
Time Conversion When it comes to time intervals, jiffies might not be the unit of choice in the minds of most programmers. It is more conventional to think in milliseconds or microseconds for short time intervals. The kernel thus provides some auxiliary functions to convert back and forth between these units and jiffies: <jiffies.h>
unsigned unsigned unsigned unsigned
int jiffies_to_msecs(const unsigned long int jiffies_to_usecs(const unsigned long long msecs_to_jiffies(const unsigned int long usecs_to_jiffies(const unsigned int
j); j); m); u);
The functions are self-explanatory. However, Section 15.2.3 shows that conversion functions between jiffies and struct timeval and struct timespec, respectively, are also available.
15.2.3 Data Structures Let us now turn our attention to how low-resolution timers are implemented. You have already seen that processing is initiated by run_local_timers, but before this function is discussed, some prerequisites in the form of data structures must be introduced.
900
5:39pm
Page 900
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management Timers are organized on lists, and the following data structure represents a timer on a list:
struct timer_list { struct list_head entry; unsigned long expires; void (*function)(unsigned long); unsigned long data; struct tvec_t_base_s *base; };
As usual, a doubly linked list is used to link registered timers with each other. entry is the list head. The other structure items have the following meaning: ❑
function saves a pointer to the callback function invoked upon time-out.
❑
data is an argument for the callback function.
❑
expires specifies the time, in jiffies, at which the timer expires.
❑
base is a pointer to a base element in which the timers are sorted on their expiry time (discussed
in more detail shortly). There is a base element for each processor of the system; consequently, the CPU upon which the timer runs can be determined using base. The macro DEFINE_TIMER(_name, _function, _expires, _data) is provided to declare a static timer_list instance. Times are given in two formats in the kernel — as offsets or as absolute values. Both make use of jiffies. While offsets are used when a new timer is installed, all kernel data structures use absolute values because they can easily be compared with the current jiffies time. The expires element of timer_list also uses absolute times and not offsets. Because programmers tend to think in seconds rather than in HZ units when defining time intervals, the kernel provides a matching data structure plus the option of converting into jiffies (and, of course, vice versa):
struct timeval { time_t suseconds_t };
tv_sec; tv_usec;
/* seconds */ /* microseconds */
The elements are self-explanatory. The complete time interval is calculated by adding the specified second and microsecond values. The timeval_to_jiffies and jiffies_to_timeval functions are used to convert between this representation and a jiffies value. These functions are implemented in
struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ };
901
Page 901
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management Again auxiliary functions convert back and forth between jiffies and timespecs: timespec_to_jiffies and jiffies_to_timespec.
15.2.4 Dynamic Timers The kernel needs data structures to manage all timers registered in the system (these may be assigned to a process or to the kernel itself). The structures must permit rapid and efficient checking for expired timers so as not to consume too much CPU time. After all, such checks must be performed at each timer interrupt.6
Mode of Operation Before taking a closer look at the existing data structures and the implementation of the algorithms, let’s illustrate the principle of timer management by reference to a simplified example, since the algorithm used by the kernel is more complicated than might be expected at first glance. (This complexity brings its rewards in the form of greater performance that could not be achieved with simpler algorithms and structures.) Not only must the data structure hold all the information needed to manage timers,7 but it must also be capable of being scanned easily at periodic intervals so that expired timers can execute and then be removed. Figure 15-6 shows how timers are managed by the kernel. tv1 tv2 tv3 tv4 tv5 tvec_base_t
struct timer_list
Figure 15-6: Data structures for managing timers. The main difficulty lies in scanning the list for timers that are about to expire and that have just expired. Because simply stringing together all timer_list instances is not satisfactory, the kernel creates different groups into which timers are classified according to their expiry time. The basis for grouping is the main array with five entries whose elements are again made up of arrays. The five positions of the main array sort the existing timers roughly according to expiry times. The first group is a collection of all timers whose expiry time is between 0 and 255 (or 28 ) ticks. The second group includes all timers with an expiry time between 256 and 28+6 − 1 = 214 − 1 ticks. The range for the third group is from 214 to 28+2×6 − 1, and so on. The entries in the main table are known as groups and are sometimes referred to as buckets. Table 15-1 lists the intervals of the individual timer groups. I have used the bucket sizes for regular systems as the basis of our calculations. The intervals differ on small systems with little memory. Each group itself comprises an array in which the timers are sorted again. The array of the first group consists of 256 elements, each position standing for a possible expires value between 0 and 256. If there 6 Although the chosen data structure is well suited for the intended purpose, it is nevertheless too inefficient for high-resolution
timers that require even better organization. 7 For the moment, ignore the additional data required for process-specific interval timers.
902
5:39pm
Page 902
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management are several timers in the system with the same expires value, they are linked by means of a doubly linked standard list (and via the entry element of timer_list).
Table 15-1: Interval Lengths for Timers Group
Interval
tv1
0 –255
tv2
28 = 256 –214 − 1
tv3
214 –220 − 1
tv4
220 –226 − 1
tv5
226 –232 − 1
The remaining groups also consist of arrays but with fewer entries, namely, 64. The array entries also accept timer_list instances linked in a doubly linked list. However, each array entry no longer holds just one possible value of expires but an entire interval. The length of the interval depends on the group. While the second group permits 256 = 28 consecutive time values per array element, this figure is 214 in the third group, 220 in the fourth, and 226 in the fifth and final group. Why these interval sizes make sense will become clear when we consider how timers are executed in the course of time and how the associated data structure is changed. How are timers executed? The kernel is responsible primarily for looking after the first of the above groups because this includes all timers due to expire shortly. For simplicity’s sake, let us assume that each group has a counter that stores the number of an array position (actual kernel implementation is the same in functional terms but is far less clearly structured as you will see shortly). The index entry of the first group points to the array element that holds the timer_list instances of the timers shortly due to be executed. The kernel scans this list every time there is a timer interrupt, executes all timer functions, and increments the index position by 1. The timers just executed are removed from the data structure. The next time a timer interrupt occurs, the timers at the new array position are executed and deleted from the data structure, and the index is again incremented by 1, and so on. Once all entries have been processed, the value of the index is 255. Because addition is modulo 256, the index reverts to its initial position (position 0). Because the contents of the first group are exhausted after at most 256 ticks, timers of the higher groups must be pushed forward successively in order to replenish the first group. Once the index position of the first group has reverted to its initial position, the group is replenished with all timers of a single array entry of the second group. This explains the interval size selection in the individual groups. Because 256 different expiry times per array element are possible in the first group, the data of a single entry in the second group are sufficient to replenish the complete array of the first group. The same applies for higher groups. The data in an array element of the third group are sufficient to replenish the entire second group; an element of the fourth group is sufficient for the entire third group, and an element of the fifth group is sufficient for the entire fourth group.
903
Page 903
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management The array positions of the higher groups are not, of course, selected randomly — the index entry again has a role to play. However, the index entry value is no longer incremented by 1 after each timer tick but only after each 256i−1 tick, where i stands for the number of the group. Let’s examine this behavior by reference to a concrete example: 256 jiffies have expired since processing of the first group was started, which is why the index is reset to 0. At the same time, the contents of the first array element of the second group are used to replenish the data of the first group. Let us assume that the jiffies system timer has the value 10,000 at the time of reset. In the first element of the second group, there is a linked list of timers due to expire at 10,001, 10,015, 10,015, and 10,254 ticks. These are distributed over array positions 1, 15, and 254 of the first group, and a linked list made up of two pointers is created at position 15 — after all, both expire at the same time. Once copying is complete, the index position of the second group is incremented by 1. The cycle then starts afresh. The timers of the first group are processed one after the other until index position 255 is reached. All timers in the second array element of the second group are used to replenish the first group. When the index position of the second group has reached 63 (from the second group onward the groups contain only 64 entries), the contents of the first element of the third group are used to replenish the data of the second group. Finally, when the index of the third group has reached its maximum value, data are fetched from the fourth group; the same applies for the transfer of data between the fifth and the fourth groups. To determine which timers have expired, the kernel need not scan through an enormous list of timers but can limit itself to checking a single array position in the first group. Because this position is usually empty or contains only a single timer, this check can be performed very quickly. Even the occasional copying of timers from the higher groups requires little CPU time, because copying can be carried out efficiently by means of pointer manipulation (the kernel is not required to copy memory blocks but need only supply pointers with new values as is usually the case in standard list functions).
Data Structures The contents of the above groups are generated by two simple data structures that differ minimally: kernel/timer.c
typedef struct tvec_s { struct list_head vec[TVN_SIZE]; } tvec_t; typedef struct tvec_root_s { struct list_head vec[TVR_SIZE]; } tvec_root_t;
While tvec_root_t corresponds to the first group, tvec_t represents higher groups. The two structures differ only in the size of the array elements; for the first group, TVR_SIZE is defined as 256. All other groups use TVN_SIZE entries with a default value of 64. Systems where memory is scarce set the configuration option BASE_SMALL; in this case, 64 entries are reserved for the first and 16 for all other groups.
904
5:39pm
Page 904
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management Each processor in the system has its own data structures for managing timers that run on it. A per-CPU instance of the following data structure is used as the root element: kernel/timer.c
struct tvec_t_base_s { ... unsigned long timer_jiffies; tvec_root_t tv1; tvec_t tv2; tvec_t tv3; tvec_t tv4; tvec_t tv5; } ____cacheline_aligned_in_smp;
The elements tv1 to tv5 represent the individual groups; their function should be clear from the above description. Of particular interest is the timer_jiffies element. It records the time (in jiffies) by which all timers of the structure were executed. If, for example, the value of this variable is 10,500, the kernel knows that all timers up to the jiffies value 10,499 have been executed. Usually, timer_jiffies is equal to or 1 less than jiffies. The difference may be a little greater (with very high loading) if the kernel is not able to execute timers for a certain period.
Implementing Timer Handling Handling of all timers is initiated in update_process_times by invoking the run_local_timers function. This limits itself to using raise_softirq(TIMER_SOFTIRQ) to activate the timer management softIRQ, which is executed at the next opportunity.8 run_timer_softirq is used as the handler function of the softIRQ; it selects the CPU-specific instance of struct tvec_t_base_s and invokes __run_timers. __run_timers implements the algorithm described above. However, nowhere in the data structures
shown is the urgently required index position for the individual rough categories to be found! The kernel does not require an explicit variable because all necessary information is contained in the timer_jiffies member of base. The following macros are defined for this purpose: kernel/timer.c
#define #define #define #define #define #define
TVN_BITS TVR_BITS TVN_SIZE TVR_SIZE TVN_MASK TVR_MASK
(CONFIG_BASE_SMALL ? 4 : 6) (CONFIG_BASE_SMALL ? 6 : 8) (1 << TVN_BITS) (1 << TVR_BITS) (TVN_SIZE - 1) (TVR_SIZE - 1)
kernel/timer.c
#define INDEX(N) ((base->timer_jiffies >> (TVR_BITS + (N) * TVN_BITS)) & TVN_MASK)
The configuration option BASE_SMALL can be defined on small, usually embedded systems to save some space by using a smaller number of slots than in the regular case. The timer implementation is otherwise unaffected by this choice. 8 Because softIRQs cannot be handled directly, it can happen that the kernel does not perform any timer handling for a few jiffies.
Timers can, therefore, sometimes be activated too late but can never be activated too early.
905
Page 905
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management The index position of the first group can be computed by masking the value of base->timer_jiffies with TVR_MASK. int index = base->timer_jiffies & TVR_MASK;
Generally, the following macro can be used to compute the current index position in group N: #define INDEX(N) (base->timer_jiffies >> (TVR_BITS + N * TVN_BITS)) & TVN_MASK
Doubting Thomases can easily convince themselves of the correctness of the bit operations by means of a short Perl script. The implementation produces exactly the results described above using the following code (__run_timers is called by the abovementioned run_timer_softirq): kernel/timer.c
static inline void __run_timers(tvec_base_t *base) { while (time_after_eq(jiffies, base->timer_jiffies)) { struct list_head work_list; struct list_head *head = &work_list; int index = base->timer_jiffies & TVR_MASK; ...
If the kernel has missed a number of timers in the past, they are dealt with now by processing all pointers that expired between the last execution point (base->timer_jiffies) and the current time (jiffies): kernel/timer.c
if (!index && (!cascade(base, &base->tv2, INDEX(0))) && (!cascade(base, &base->tv3, INDEX(1))) && !cascade(base, &base->tv4, INDEX(2))) cascade(base, &base->tv5, INDEX(3)); ...
The cascade function is used to replenish the timer lists with timers from higher groups (although its implementation is not discussed here, suffice it to say that it uses the mechanism described above). kernel/timer.c
++base->timer_jiffies; list_replace_init(base->tv1.vec + index, &work_list); ...
All timers located in the first group at the corresponding position for the timer_jiffies value (which is incremented by 1 for the next cycle) are copied into a temporary list and therefore removed from the original data structures. All that need then be done is to execute the individual handler routines: kernel/timer.c
while (!list_empty(head)) { void (*fn)(unsigned long);
906
5:39pm
Page 906
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management unsigned long data; timer = list_entry(head->next,struct timer_list,entry); fn = timer->function; data = timer->data; detach_timer(timer, 1); fn(data); } } ... }
Activating Timers When new timers are installed, a distinction must be made as to whether they are required by the kernel itself or by applications in userspace. First, let’s discuss the mechanism for kernel timers because user timers also build on this mechanism. add_timer is used to insert a fully supplied instance of timer_list into the structures just described
above:
static inline void add_timer(struct timer_list *timer);
After checking several safety conditions (e.g., the same timer may not be added twice), work is delegated to the internal_add_timer function whose task is to place the new timer at the right position in the data structures. The kernel must first compute the number of ticks after which time-out of the new timer will occur because an absolute time-out value is specified in the data structure of new drivers. To compensate for any missed timer handling calls, expires - base->timer_jiffies is used to compute the offset. The group and the position within the group can be determined on the basis of this value. All that now need be done is to add the new timer to the linked list. Because it is placed at the end of the list and because the run_timer_list is processed from the beginning, a first-in, first-out mechanism is implemented.
15.3
Generic Time Subsystem
Low-resolution timers are useful for a wide range of situations and deal well with many possible use cases. This broadness, however, complicates support for timers with high resolution. Years of development have shown that it is very hard to integrate them into the existing framework. The kernel therefore supports a second timing mechanism. While low-resolution timers are based on jiffies as fundamental units of time, high-resolution timers use human time units, namely, nanoseconds. This is reasonable because high precision timers are mostly required for userland applications, and the natural way for programmers to think about time is in human units. And, most important, 1 nanosecond is a precisely defined time interval, whereas the length of one jiffy tick depends on the kernel configuration.
907
Page 907
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management High-resolution timers place more requirements on the architecture-specific code of the individual architectures than classical timers. The generic time framework provides the foundations for high-resolution timers. Before getting into the details of high-resolution timers, let’s take a look into how high-precision timekeeping is achieved in the kernel. The core of the second timer subsystem of the kernel can be found in kernel/time/hrtimer.c. The generic timekeeping code that forms the basis for high-resolution timers is located in several files in kernel/time. After providing an overview of the mechanisms used, the new API that comes with highresolution timers is introduced, and then their implementation is examined in detail.
15.3.1 Overview Figure 15-7 provides an overview of the generic time system that provides the foundation of highresolution timers.
global tick
Jiffies
local tick
local tick
local tick
local tick
tick device
tick device
tick device
tick device
CPU 0
CPU 1
CPU 2
CPU 3
Figure 15-7: Overview of the generic time subsystem.
First, let’s discuss the available components and data structures, the details of which will be covered in the course of this chapter. Three mechanisms form the foundation of any time-related task in the kernel:
908
1.
Clock Sources (defined by struct clocksource) — Form the backbone of time management. Essentially each clock source provides a monotonically increasing counter with Read Only access for the generic kernel parts. The accurateness of different clock sources varies depending on the capabilities of the underlying hardware.
2.
Clock event devices (defined by struct clock_event_device) — Add the possibility of equipping clocks with events that occur at a certain time in the future. Note that it is also common to refer to such devices as clock event sources for historical reasons.
3.
Tick Devices (defined struct tick_device) — Extend clock event sources to provide a continuous stream of tick events that happen at regular time intervals. The dynamic tick mechanism allows for stopping the periodic tick during certain time intervals, though.
5:39pm
Page 908
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management The kernel distinguishes between two types of clocks:
1.
A global clock is responsible to provide the periodic tick that is mainly used to update the jiffies values. In former versions of the kernel, this type of clock was realized by the programmable interrupt timer (PIT) on IA-32 systems, and on similar chips on other architectures.
2.
One local clock per CPU allows for performing process accounting, profiling, and last but not least, high-resolution timers.
The role of the global clock is assumed by one specifically selected local clock. Note that high-resolution timers only work on systems that provide per-CPU clock sources. The extensive communication required between processors would otherwise degrade system performance too much as compared to the benefit of having high-resolution timers. The overall concept is complicated by problems that unfortunately arise on the two most widespread platforms: AMD64 and IA-32 (the MIPS platform is also affected). Local clocks on SMP systems are based on APIC chips. Unfortunately, these clocks only work properly dependent on the power-saving mode they are in. For low-power modes (ACPI mode C3, to be precise), the local APIC timers are stopped, and thus become useless as clock sources. A system-global clock that still works at this power management state is then used to periodically activate signals that look as if they would originate from the original clock sources. The workaround is known as the broadcasting mechanism; more about this follow in Section 15.6. Since broadcasting requires communication between the CPUs, the solution is slower and less accurate than proper local time sources; the kernel will automatically switch back high-resolution to low-resolution mode.
15.3.2 Configuration Options Timer implementation is influenced by several configuration symbols. Two choices are possible at compile time:
1.
The kernel can be built with or without support for dynamic ticks. If dynamic ticks are enabled, the pre-processor constant CONFIG_NO_HZ is set.
2.
High-resolution support can be enabled or disabled. The pre-processor symbol CONFIG_HIGH_RES_TIMERS is enabled if support for them is compiled in.
Both are important in the following discussion of timer implementation. Recall that both choices are independent of each other; this leads to four different configurations of the time and timer subsystems. Additionally, each architecture is required to make some configuration choices. They cannot be influenced by the user. ❑
GENERIC_TIME signals that the architecture supports the generic time framework. GENERIC_CLOCKEVENTS states that the same holds for generic clock events. Since both are
necessary requirements for dynamic ticks and high-resolution timers, only architectures that
909
Page 909
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management provide both are considered.9 Actually most widespread architectures have been updated to support both options, even if some (for instance SuperH) do this only for certain time models. ❑
CONFIG_TICK_ONESHOT builds support for the one-shot mode of clock event devices. This is auto-
matically selected if high-resolution timers or dynamic ticks are enabled. ❑
GENERIC_CLOCKEVENTS_BROADCAST must be defined if the architecture suffers from problems that require broadcasting. Currently only IA-32, AMD64, and MIPS are affected.
15.3.3 Time Representation The generic time framework uses the data type ktime_t to represent time values. Irregardless of the underlying architecture, the type always resolves to a 64-bit quantity. This makes the structure convenient to work with on 64-bit architectures as only simple integer operations are required for time-related operations. To reduce the effort on 32-bit machines, the definition ensures that the two 32-bit values are ordered such that they can be directly interpreted as a 64-bit quantity without further ado — clearly this requires sorting the fields differently depending on the processor’s endianness:
typedef union { s64 tv64; #if BITS_PER_LONG != 64 && !defined(CONFIG_KTIME_SCALAR) struct { # ifdef __BIG_ENDIAN s32 sec, nsec; # else s32 nsec, sec; # endif } tv; #endif } ktime_t;
If a 32-bit architecture provides functions that handle 64-bit quantities efficiently, it can set the configuration option KTIME_SCALAR — IA-32 is the only architecture that makes use of this possibility at the moment. A separation into two 32-bit values is not performed in this case, but the representation of kernel times as direct 64-bit quantities is used. Several auxiliary functions to handle ktime_t objects are defined by the kernel. Among them are the following: ❑
ktime_sub and ktime_add are used to subtract and add ktime_ts, respectively.
❑
ktime_add_ns adds a given number of nanoseconds to a ktime_t. ktime_add_us is another variant for microseconds. ktime_sub_ns and ktime_sub_us are also available.
❑
ktime_set produces a ktime_t from a given number of seconds and nanoseconds.
❑
Various functions of the type x_to_y convert between representation x and y, where the types ktime_t, timeval, clock_t, and timespec are possible.
9 Architectures that are currently migrating to the generic clock event framework can set GENERIC_CLOCKEVENTS_MIGR. This will build the code, but not use it at run time.
910
5:39pm
Page 910
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management Note that a direct interpretation of a ktime_t as a number of nanoseconds would be possible on 64-bit machines, but can lead to problems on 32-bit machines. Thus, the function ktime_to_ns is provided to perform the conversion properly. The auxiliary function ktime_equal is provided to decide if two ktime_ts are identical. To provide exchangeability with other time formats used in the kernel, some conversion functions are available:
ktime_t timespec_to_ktime(const struct timespec ts) ktime_t timeval_to_ktime(const struct timeval tv) struct timespec ktime_to_timespec(const ktime_t kt) struct timeval ktime_to_timeval(const ktime_t kt) s64 ktime_to_ns(const ktime_t kt) s64 ktime_to_us(const ktime_t kt)
The function names specify which quantity is converted into which, so there’s no need to add anything further.
15.3.4 Objects for Time Management Recall from the overview that three objects manage timekeeping in the kernel: clock sources, clock event devices, and tick devices. Each of them is represented by a special data structure discussed in the following.
Clock Sources First of all, consider how time values are acquired from the various sources present in a machine. The kernel defines the abstraction of a clock source for this purpose:
struct clocksource { char *name; struct list_head list; int rating; cycle_t (*read)(void); cycle_t mask; u32 mult; u32 shift; unsigned long flags; ... };
A human-readable name for the source is given in name, and list is a standard list element that connect all available clock sources on a standard kernel list. Not all clocks are of the same quality, and the kernel obviously wants to select the best possible one. Thus, every clock has to (honestly) specify its own quality in rating. The following intervals are possible: ❑
A rating between 1 and 99 denotes a very bad clock that can only be used as a last resort or during boot up, that is, when no better clock is available.
❑
The range 100–199 describes a clock that is fit for real use, but not really desirable if something better can be found.
911
Page 911
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management ❑
Clocks with a rating between 300 and 399 are reasonably fast and accurate.
❑
Perfect clocks that are the ideal source get a rating between 400 and 499.
The best clock sources can currently be found on the PowerPC architecture where two clocks with a rating of 400 are available. The time stamp counter (TSC) on IA-32 and AMD64 machines — usually the most accurate device on these architectures — has a rating of 300. The best clocks on most architectures have similar ratings. The developers do not exaggerate the performance of the devices and leave plenty of space for improvement on the hardware side. It does not come as a surprise that read is used to read the current cycle value of the clock. Note that the value returned does not use any fixed timing basis for all clocks, but needs to be converted into a nanosecond value individually. For this purpose, the field members mult and shift are used to multiply or divide, respectively, the cycles value as follows:
static inline s64 cyc2ns(struct clocksource *cs, cycle_t cycles) { u64 ret = (u64)cycles; ret = (ret * cs->mult) >> cs->shift; return ret; }
Note that cycle_t is defined as an unsigned integer with 64 bits independent of the underlying platform. If a clock does not provide time values with 64 bits, then mask specifies a bitmask to select the appropriate bits. The macro CLOCKSOURCE_MASK(bits) constructs the proper mask for a given number of bits. Finally, the field flags of struct clocksource specifies — you will have guessed it — a number of flags. Only one flag is relevant for our purposes. CLOCK_SOURCE_CONTINUOUS represents a continuous clock, although the meaning is not quite the mathematical sense of of ‘‘continuous.’’ Instead, it describes that the clock is free-running if set to 1 and thus cannot skip. If it is set to 0, then some cycles might be lost; that is, if the last cycle value was n, then the next value does not necessarily need to be n + 1 even if it was read at the next possible moment. A clock must exhibit this flag to be usable for high-resolution timers. For booting purposes and if nothing really better is available on the machine (which should never be the case after bootup), the kernel provides a jiffies-based clock10 : kernel/time/jiffies.c
#define NSEC_PER_JIFFY
((u32)((((u64)NSEC_PER_SEC)<<8)/ACTHZ))
struct clocksource clocksource_jiffies = { .name = "jiffies", .rating = 1, /* lowest valid rating*/ .read = jiffies_read, .mask = 0xffffffff, /*32bits*/ .mult = NSEC_PER_JIFFY << JIFFIES_SHIFT, /* details above */ .shift = JIFFIES_SHIFT, }; 10 Note that if the jiffy clock were used as the main clock source, then the kernel would be responsible to update the jiffies value by some apt means, for instance, directly from the timer interrupt. Usually, architectures don’t do this. It, therefore, does not really make sense to use this clock for tickless systems that emulate the jiffies layer via clock sources. In fact, using the jiffies clock source is a nice way to crash dynamic tick systems, at least on kernel 2.6.24 . . .
912
5:39pm
Page 912
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management At a first glance, it might not make much sense to first multiply by JIFFIES_SHIFT and then again divide by the same value. Nevertheless, this bogosity is required because the NTP code does not work with zero shifts.11 Also note that the jiffies clock has a rating of 1, which makes it definitely the worst clock in the whole system. The read routine for the jiffies clock is particularly simple: No hardware interaction is required. It suffices to return the current jiffies value. The time-stamp counter usually provides the best clock found on IA-32 and AMD64 machines: arch/x86/kernel/tsc_64.c
static struct clocksource clocksource_tsc = { .name = "tsc", .rating = 300, .read = read_tsc, .mask = CLOCKSOURCE_MASK(64), .shift = 22, .flags = CLOCK_SOURCE_IS_CONTINUOUS | CLOCK_SOURCE_MUST_VERIFY, }; read_tsc uses some assembler code to read out the current counter value from hardware.
Working with Clock Sources How can a clock be used? First of all, it must be registered with the kernel. The function clocksource_register is responsible for this. The source is only added to the global clocksource_list (defined in kernel/time/clocksource.c), which sorts all available clock sources by their rating. select_clocksource is called to select the best clock source. Normally this will pick the clock with the best rating, but it is also possible to specify a preference from userland via /sys/devices/system/clocksource/clocksource0/current_clocksource, which is used by the kernel instead. Two global variables are provided for this purpose:
1. 2.
current_clocksource points to the clock source that is currently the best one. next_clocksource points to an instance of struct clocksource that is better than the one
used at the moment. The kernel automatically switches to the best clock source when a new best clock source is registered. To read the clock, the kernel provides the following functions: ❑
__get_realtime_clock_ts takes a pointer to an instance of struct timespec as argument, reads the current clock, converts the result, and stores in the timespec instance.
❑
getnstimeofday is a front-end for __get_realtime_clock_ts, but also works if no highresolution clocks are available in the system. In this case, getnstimeofday as defined in kernel/time.c (instead of kernel/time/timekeeping.c) is used to provide a timespec that fulfills only low-resolution requirements.
11 The definition of NSEC_PER_JIFFY contains the pre-processor symbol ACTHZ. While HZ denotes the base low-resolution frequency that can be selected at compile time, the frequency that the system actually provides will differ slightly because of hardware limitations. ACTHZ stores the frequency at which the clock is actually running.
913
Page 913
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management Clock Event Devices Clock event devices are defined by the following data structure:
struct clock_event_device { const char *name; unsigned int features; unsigned long max_delta_ns; unsigned long min_delta_ns; unsigned long mult; int shift; int rating; int irq; cpumask_t cpumask; int (*set_next_event)(unsigned long evt, struct clock_event_device *); void (*set_mode)(enum clock_event_mode mode, struct clock_event_device *); void (*event_handler)(struct clock_event_device *); void (*broadcast)(cpumask_t mask); struct list_head list; enum clock_event_mode mode; ktime_t next_event; };
Recall that clock event devices allow for registering an event that is going to happen at a defined point of time in the future. In comparison to a full-blown timer implementation, however, only a single event can be stored. The key elements of every clock_event_device are set_next_event because it allows for setting the time at which the event is going to take place, and event_handler, which is called when the event actually happens. Besides, the elements of clock_event_device have the following purpose: ❑
name is a human-readable representation for the event device. It shows up in /proc/timerlist.
❑
max_delta_ns and min_delta_ns specify the maximum or minimum, respectively, difference
between the current time and the time for the next event. Clocks work with individual frequencies at which device cycles occur, but the generic time subsystem expects a nanosecond value when the event shall take place. The auxiliary function clockevent_delta2ns helps to convert one representation into the other. Consider, for instance, that the current time is 20, min_delta_ns is 2, and max_delta_ns is 40 (of course, the exemplary values do not represent any situation possible in reality). Then the next event can take place during the time interval [22, 60] where the boundaries are included.
914
❑
mult and shift are a multiplier and a divider, respectively, used to convert between clock cycles and nanosecond values.
❑
The function pointed to by event_handler is called by the hardware interface code (which usually is architecture-specific) to pass clock events on to the generic layers.
5:39pm
Page 914
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management ❑
irq specifies the number of the IRQ that is used by the event device. Note that this is only required for global devices. Per-CPU local devices use different hardware mechanisms to emit signals and set irq to −1.
❑
cpumask specifies for which CPUs the event device works. A simple bitmask is employed for this purpose. Local devices are usually only responsible for a single CPU.
❑
broadcast is required for the broadcasting implementation that provides a workaround for nonfunctional local APICs on IA-32 and AMD64 in power-saving mode. See Section 15.6 for more details.
❑
rating allows — in analogy to the mechanism described for clock devices — comparison of
clock event devices by explicitly rating their accuracy. ❑
All instances of struct clock_event_device are kept on the global list clockevent_devices, and list is the list head required for this purpose. The auxiliary function clockevents_register_device is used to register a new clock event device. This places the device on the global list.
❑
ktime_t stores the absolute time of the next event.
Each event device is characterized by several features stored as a bit string in features. A number of constants in
Clock event devices that support periodic events (i.e., events that are repeated over and over again without the need to explicitly activate them by reprogramming the device) are identified by CLOCK_EVT_FEAT_PERIODIC.
❑
CLOCK_EVT_FEAT_ONESHOT marks a clock capable of issuing one-shot events that happen exactly once. Basically, this is the opposite of periodic events.
set_mode points to a function that allows for toggling the desired mode of operation between periodic and one-shot mode. mode designates the current mode of operation. A clock can only be in either periodic
or one-shot mode at a time, but it can nevertheless provide the ability to work in both modes — actually, most clocks allow both possibilities. Generic code does not need to call set_next_event directly because the kernel provides the following auxiliary function for this task: kernel/time/clockevents.c
int clockevents_program_event(struct clock_event_device *dev, ktime_t expires, ktime_t now)
The (absolute) expiration time for the device dev is given in expires, while now denotes the current time. Usually the caller will directly pass the result of ktime_get() for this parameter. On IA-32 and AMD64 systems, the role of the global clock event device is initially assumed by the PIT. The HPET takes over this duty once it has been initialized. To keep track of which device is used 12 Recall that local APICs on IA-32 and AMD64 systems expose a problem: They stop working at certain power save levels. This prob-
lem is reported to the kernel by setting the ‘‘feature’’ CLOCK_EVT_FEAT_C3STOP, which should rather be named a mis-feature.
915
Page 915
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management to handle global clock events on x86 systems, the global variable global_clock_event as defined in arch/x86/kernel/i8253.c is employed. It points to the clock_event_device instance for the global clock device that is currently in use. Clock devices and clock event device are formally unconnected at the data structure level. However, one particular hardware chip in the system provides capabilities that allow fulfillment of the requirements for both interfaces, so the kernel usually registers a clock device and a clock event device per time hardware chip. Consider, for instance, the HPET device on IA-32 and AMD64 systems. The capabilities as clock source are collected in clocksource_hpet, while hpet_clockevent is an instance of clock_event_device. Both are defined in arch/x86/kernel/hpet.c. hpet_init first registers the clock source and then the clock event device. This adds two time-management objects to the kernel, but only a single piece of hardware is required.
Tick Devices One particular important use of clock event devices is to provide periodic ticks — recall from Section 15.2 that ticks are, for instance, required to operate the classical timer wheel. A tick device is an extension of a clock event device:
struct tick_device { struct clock_event_device *evtdev; enum tick_device_mode mode; } enum tick_device_mode { TICKDEV_MODE_PERIODIC, TICKDEV_MODE_ONESHOT, };
A tick_device is just a wrapper around struct clock_event_device with an additional field that specifies which mode the device is in. This can either be periodic or one-shot. The distinction will be important when tickless systems are considered; this is discussed further in Section 15.5. For now, it suffices to see a tick device as mechanism to provides a continuous stream of tick events. These form the basis for the scheduler, the classical timer wheel, and related components of the kernel. Again, the kernel distinguishes global and local (per-CPU) tick devices. The local devices are collected in tick_cpu_device (defined in kernel/time/tick-internal.h). Note that the kernel automatically creates a tick device when a new clock event device is registered. Several global variables are additionally defined in include/time/tick-internal.h: ❑
tick_cpu_device is a per-CPU list containing one instance of struct tick_device for each CPU
in the system. ❑
tick_next_period specifies the time (in nanoseconds) when the next global tick event will
happen. ❑
tick_do_timer_cpu contains the CPU number whose tick device assumes the role of the global
tick device. ❑
916
tick_period stores the interval between ticks in nanoseconds. It is the counterpart to HZ that denotes the frequency at which ticks occur.
5:39pm
Page 916
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management To set up a tick device, the kernel provides the function tick_setup_device. The prototype is as follows, and the code flow diagram is depicted in Figure 15-813 : kernel/time/tick-common.c
static void tick_setup_device(struct tick_device *td, struct clock_event_device *newdev, int cpu, cpumask_t cpumask); tick_setup_device Initial setup?
Assume global tick duties
tick_device_uses_broadcast No
return
Periodic mode? No
Yes
Device requires broadcast?
Yes
tick_setup_periodic
tick_setup_oneshot
Figure 15-8: Code flow diagram for tick_setup_device. The parameter td specifies the tick_device instance that is going to be set up. It is about to be equipped with the clock event device newdev. cpu denotes the processor to which the device is associated, and cpumask is a bitmask that allows for restricting the tick device to specific CPUs. When the device is set up for the first time (i.e., if no clock event device is associated with the tick device), the kernel performs two actions:
1.
If no tick device has been chosen to assume the role as global tick device yet, then the current device is selected, and tick_do_timer_cpu is set to the processor number to which the current device belongs. tick_period, that is, the interval between ticks in nanoseconds, is computed based on the value of HZ.
2.
The tick device is set to work in periodic mode.
After assigning the event device to the tick device, the function is finished if broadcasting mode is active (recall that this is used if the system is in a power-saving state where the local clocks don’t work; see Section 15.6 for more details). Otherwise, the kernel needs to establish a periodic tick. How this is done depends on whether the tick device runs in periodic or oneshot mode, and the work is correspondingly delegated either to tick_setup_periodic or tick_setup_oneshot.
The fact that the tick device is in one-shot mode does not automatically mean that dynamic ticks are enabled! Ticks in high-resolution mode are, for instance, always implemented on top of one-shot timers.
13 The function is automatically called if a new clock event device is registered that allows for creating a better tick device than the
previously available ones. Devices with a higher quality are favored, but not if the new and more accurate device does not support one-shot mode, while the old device does provide this support.
917
Page 917
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management Before discussing these functions, let us therefore consider which situations are faced by the kernel depending on the selected configuration: ❑
A low-resolution system without dynamic ticks always uses a periodic tick. Support for one-shot operations is not included in the kernel at all.
❑
Low-resolution systems with dynamic ticks use the tick device in one-shot mode.
❑
High-resolution systems always use one-shot mode independent of whether they work with dynamic ticks or not.
All systems initially work in low-resolution mode and without dynamic ticks; they switch to a different combination only later when the required hardware is initialized. I therefore focus on the low-resolution, periodic tick case here. The more advanced options are discussed in Sections 15.4.5 (high-resolution timers) and 15.5 (dynamic ticks). Some corrections are also required for broadcast mode; Section 15.6 covers them in more detail. Before examining the low-resolution case without dynamic ticks, I would like to point out that Figure 15-9 provides an overview of the tick handler functions that are used for the various possible combinations. Note that which broadcast function is chosen for a system without dynamic ticks depends on the mode of the underlying tick device. The details are given below.
HZ-based
dynamic ticks
tick_handle_oneshot_broadcast
broadcast
tick_handle_oneshot_broadcast
tick_handle_periodic_broadcast tick_handle_periodic (low-res)
event_handler
hrtimer_interrupt (high-res)
tick_nohz_handler (low-res) hrtimer_interrupt (high-res)
Figure 15-9: Tick event and broadcast handler functions for all possible combinations of low- and high-resolution mode, and with/without dynamic ticks. Let us finally turn our attention to tick_setup_periodic. The code flow diagram is shown in Figure 15-10. tick_setup_periodic tick_set_periodic_handler
No
Event device supports periodic events?
Yes
Set event device to periodic mode
Set event device to one-shot mode Program next clock event
Figure 15-10: Code flow diagram for tick_setup_periodic.
918
5:39pm
Page 918
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management Actually, the task is quite simple if the clock event device supports periodic events. In this case, tick_set_periodic_handler installs tick_handle_periodic as handler function, and clockevents_set_mode ensures that the clock event device runs in periodic mode. If the event device does not support periodic events, then the kernel must make do with one-shot events. clockevents_set_mode sets the event device to this mode, but additionally, the next event needs to be programmed in manually using clockevents_program_event. In both cases, the handler function tick_handle_periodic is called on the next event of the tick device. (Recall that we focus on the low-res case without dynamic ticks here; other settings will use different handler functions!) Before discussing the handler function, I need to introduce the auxiliary function tick_periodic. It is responsible for handling the periodic tick on a given CPU required as an argument: kernel/time/tick_common.c
static void tick_periodic(int cpu);
Figure 15-11 shows what is going on inside the function.
tick_periodic CPU responsible for global tick?
do_timer
update_process_times profile_tick
Figure 15-11: Code flow diagram for tick_periodic. If the current tick device is responsible for the global tick, then do_timer is called. Recall that this function is discussed in Section 15.2.1. Nevertheless, remember that do_timer is responsible to update the global jiffies value that is used as the coarse-grained time base in many parts of the kernel. update_process_times is called by every tick handler, as well as profile_tick. The first function is discussed in Section 15.2.1. profile_tick is responsible for profiling, but the details are not discussed
here. Let’s go back to the handler function. Things are again easier here if periodic events are in use: kernel/tick/tick-common.c
void tick_handle_periodic(struct clock_event_device *dev) { int cpu = smp_processor_id(); ktime_t next; tick_periodic(cpu); if (dev->mode != CLOCK_EVT_MODE_ONESHOT) return; ...
919
Page 919
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management All the kernel needs to do is call tick_periodic. If the clock event device operates in one-shot mode, the next tick event needs to be programmed: kernel/tick/tick-common.c
... /* * Setup the next period for devices, which do not have * periodic mode: */ next = ktime_add(dev->next_event, tick_period); for (;;) { if (!clockevents_program_event(dev, next, ktime_get())) return; tick_periodic(cpu); next = ktime_add(next, tick_period); } }
Since tick_device->next_event contains the time of the current tick event, the time for the next event can easily be computed by incrementing the value with the length of the interval as specified in tick_period. Programming this event is then usually just a matter of calling clockevents_program_event. Should this fail14 because the time for the next clock event lies already in the past, then the kernel calls tick_periodic manually and tries again to reprogram the event until it succeeds.
15.4
High-Resolution Timers
After having discussed the generic time framework, we are now ready to take the next step and dive into the implementation of high-resolution timers. Two fundamental differences distinguish these timers from low-resolution timers:
1. 2.
High-resolution (high-res) timers are time-ordered on a red-black tree. They are independent of periodic ticks. They do not use a time specification based on jiffies, but employ nanosecond time stamps.
Merging the high-resolution timer mechanism into the kernel was an interesting process in itself. After the usual development and testing phase, kernel 2.6.16 contained the basic framework that provided most of the implementation except one thing: support for high-resolution timers . . . . The classical implementation of low-resolution timers had, however, been replaced with a new foundation in this release. It was based on the high-resolution timer framework, although the supported resolution was not any better than before. Following kernel releases then added support for another class of timers that did actually provide high-resolution capabilities. This merge strategy is not only of historical interest: Since low-resolution timers are implemented on top of the high-resolution mechanism, (partial) support for high-resolution timers will also be built into the kernel even if support for them is not explicitly enabled! Nevertheless, the system will only be able to provide timers with low-resolution capabilities. 14 Note that 0 is returned on success, so
920
!clockevents_program_event(...) checks for failure.
5:39pm
Page 920
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management Components of the high-resolution timer framework that are not universally applicable, but do really provide actual high-resolution capabilites are bracketed by the pre-processor symbol CONFIG_HIGH_RES_TIMERS, and are only compiled in if high-resolution support is selected at compile time. The generic part of the framework is always added to the kernel.
This means that even kernels that only support low resolution contain parts of the high-resolution framework, which can sometimes lead to confusion.
15.4.1 Data Structures High-resolution timers can be based on two different types of clocks (which are referred to as clock bases). The monotonic clock starts at 0 when the system is booted (CLOCK_MONOTONIC). The other clock (CLOCK_REALTIME) represents the real time of the system. The latter clock may exhibit skips if, for instance, the system time is changed, but the monotonic clock runs, well, monotonously all the time. For each CPU in the system, a data structure with both clock bases is available. Each clock base is equipped with a red-black tree that sorts all pending high-resolution timers. Figure 15-12 summarizes the situation graphically. Two clock bases (monotonic and real time) are available per CPU. All timers are sorted by expiration time on a red-black tree, and expired timers whose callback handlers still need to be executed are moved from the red-black tree to a linked list. active
clock_base[0] clock_base[1] CPU 1
Status info
struct hrtimer
first active
Red-black-tree first
Callback pending list
cb_pending clock_base[0] clock_base[1]
active
Expired timers pending to be processed
first
CPU 2
Status info cb_pending hrtimer_bases
active first
struct hrtimer_clock_base
Figure 15-12: Overview of the data structures used to implement high-resolution timers. A clock base is given by the following data structure:
struct hrtimer_clock_base { struct hrtimer_cpu_base clockid_t struct struct ktime_t ktime_t
*cpu_base; index; rb_root active; rb_node *first; resolution; (*get_time)(void);
921
Page 921
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management ktime_t ktime_t #ifdef CONFIG_HIGH_RES_TIMERS ktime_t int
(*get_softirq_time)(void); softirq_time; offset; (*reprogram)(struct hrtimer *t, struct hrtimer_clock_base *b, ktime_t n);
#endif };
The meaning of the fields is as follows: ❑
hrtimer_cpu_base points to the per-CPU basis to which the clock base belongs.
❑
index distinguishes between CLOCK_MONOTONIC and CLOCK_REALTIME.
❑
rb_root is the root of a red-black tree on which all active timers are sorted.
❑
first points to the timer that will expire first.
❑
Processing high-res timers is initiated from the high-resolution timer softIRQ HRTIMER_SOFTIRQ as described in the next section. softirq_time stores the time at which the softIRQ was issued, and get_softirq_time is a function to obtain this time. If high-resolution mode is not active, then the stored time will be coarse-grained.
❑
get_time reads the fine-grained time. This is simple for the monotonic clock (the value delivered by the current clock source can be directly used), but some straightforward arithmetic is required to convert the value into the real system time.
❑
resolution denotes the resolution of the timer in nanoseconds.
❑
When the real-time clock is adjusted, a discrepancy between the expiration values of timers stored on the CLOCK_REALTIME clock base and the current real time will arise. The offset field helps to fix the situation by denoting an offset by which the timers needs to be corrected. Since this is only a temporary effect that happens only seldomly, the complications need not be discussed in more detail.
❑
reprogram is a function that allows for reprogramming a given timer event, that is, changing the expiration time.
Two clock bases are established for each CPU using the following data structure:
struct hrtimer_cpu_base { struct hrtimer_clock_base #ifdef CONFIG_HIGH_RES_TIMERS ktime_t int struct list_head unsigned long #endif };
clock_base[HRTIMER_MAX_CLOCK_BASES]; expires_next; hres_active; cb_pending; nr_events;
HRTIMER_MAX_CLOCK_BASES is currently set to 2 because a monotonic and a real-time clock exist as discussed above. Note that the clock bases are directly embedded into hrtimer_cpu_base and not referenced
via pointers! The remaining fields of the structure are used as follows:
922
5:39pm
Page 922
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management ❑
expires_next contains the absolute time of the next event that is due for expiration.
❑
hres_active is used as a Boolean variable to signal if high-resolution mode is active, or if only low-resolution is available.
❑
When a timer expires, it is moved from the red-black tree to a list headed by cb_pending.15 Note that the timers on this list still need to be processed. This will take place in the softIRQ handler.
❑
nr_events keeps track of the total number of timer interrupts.
he global per-CPU variable hrtimer_cpu_base contains an instance of struct hrtimer_base_cpu for each processor in the system. Initially it is equipped with the following contents: kernel/hrtimer.c
DEFINE_PER_CPU(struct hrtimer_cpu_base, hrtimer_bases) = { .clock_base = { { .index = CLOCK_REALTIME, .get_time = &ktime_get_real, .resolution = KTIME_LOW_RES, }, { .index = CLOCK_MONOTONIC, .get_time = &ktime_get, .resolution = KTIME_LOW_RES, }, } };
Since the system is initialized in low-resolution mode, the achievable resolution is only KTIME_LOW_RES. The pre-processor constant denotes the timer interval between periodic ticks with frequency HZ in nanoseconds. ktime_get and ktime_get_real both obtain the current time by using getnstimeofday, discussed in Section 15.3. A very important component is still missing. How is a timer itself specified? The kernel provides the following data structure for this purpose:
struct hrtimer { struct rb_node ktime_t int struct hrtimer_base unsigned long #ifdef CONFIG_HIGH_RES_TIMERS enum hrtimer_cb_mode struct list_head #endif };
node; expires; (*function)(struct hrtimer *); *base; state; cb_mode; cb_entry;
15 This requires that the timer is allowed to be executed in softIRQ context. Alternatively, timers are expired directly in the clock
hardware IRQ without involving the detour via the expiration list.
923
Page 923
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management node is used to keep the timer on the red-black tree as mentioned above, and base points to the timer base. The fields that are interesting for the timer’s user are function and expires. While the latter denotes the expiration time, function is the callback employed when the timer expires. cb_entry is the list element that allows for keeping the timer on the callback list headed by hrtimer_cpu_base->cb_pending. Each timer may specify conditions under which it may or must be run. The following choices are possible:
/* * hrtimer callback modes: * * HRTIMER_CB_SOFTIRQ: Callback must run in softirq context * HRTIMER_CB_IRQSAFE: Callback may run in hardirq context * HRTIMER_CB_IRQSAFE_NO_RESTART: Callback may run in hardirq context and * does not restart the timer * HRTIMER_CB_IRQSAFE_NO_SOFTIRQ: Callback must run in hardirq context * Special mode for tick emultation */ enum hrtimer_cb_mode { HRTIMER_CB_SOFTIRQ, HRTIMER_CB_IRQSAFE, HRTIMER_CB_IRQSAFE_NO_RESTART, HRTIMER_CB_IRQSAFE_NO_SOFTIRQ, };
The comment explains the meaning of the individual constants well, and nothing need be added. The current state of a timer is kept in state. The following values are possible16 : ❑ ❑
HRTIMER_STATE_INACTIVE denotes an inactive timer.
A timer that is enqueued on a clock base and waiting for expiration is in state HRTIMER_STATE_ENQUEUED.
❑ ❑
HRTIMER_STATE_CALLBACK states that the callback is currently executing.
When the timer has expired and is waiting on the callback list to be executed, the state is HRTIMER_STATE_PENDING.
The callback function deserves some special consideration. Two return values are possible:
enum hrtimer_restart { HRTIMER_NORESTART, /* Timer is not restarted */ HRTIMER_RESTART, /* Timer must be restarted */ };
Usually, the callback will return HRTIMER_NORESTART when it has finished executing. In this case, the timer will simply disappear from the system. However, the timer can also choose to be restarted. This requires two steps from the callback:
1.
The result of the callback must be HRTIMER_RESTART.
16 In a rare corner case, it is also possible that a timer is both in the states
HRTIMER_STATE_ENQUEUED and HRTIMER_STATE_CALLBACK. See the commentary in
924
5:39pm
Page 924
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management 2.
The expiration of the timer must be set to a future point in time. The callback function can perform this manipulation because it gets a pointer to the hrtimer instance for the currently running timer as function parameter. To simplify matters, the kernel provides an auxiliary function to forward the expiration time of a timer:
unsigned long hrtimer_forward(struct hrtimer *timer, ktime_t now, ktime_t interval);
This resets the timer so that it expires after now [usually now is set to the value returned by hrtimer_clock_base->get_time()]. The exact expiration time is determined by taking the old expiration time of the timer and adding interval so often that the new expiration time lies past now. The function returns the number of times that interval had to be added to the expiration time to exceed now. Let us illustrate the behavior by an example. If the old expiration time is 5, now is 12, and interval is 2, then the new expiration time will be 13. The return value is 4 because 13 = 5 + 4 × 2. A common application for high-resolution timers is to put a task to sleep for a specified, short amount of time. The kernel provides another data structure for this purpose:
struct hrtimer timer; struct task_struct *task; };
An hrtimer instance is bundled with a pointer to the task in question. The kernel uses hrtimer_wakeup as the expiration function for sleepers. When the timer expires, the hrtimer_sleeper can be derived from the hrtimer using the container_of mechanism (note that the timer is embedded in struct hrtimer_sleeper), and the associated task can be woken up.
15.4.2 Setting Timers Setting a new timer is a two-step process:
1.
hrtimer_init is used to initialize a hrtimer instance.
void hrtimer_init(struct hrtimer *timer, clockid_t which_clock, enum hrtimer_mode mode); timer denotes the affected high-resolution timer, clock is the clock to bind the timer to, and mode specifies if absolute or relative time values (relative to the current time) are used. Two
constants are available for selection:
enum hrtimer_mode { HRTIMER_MODE_ABS, HRTIMER_MODE_REL, };
2.
/* Time value is absolute */ /* Time value is relative to now */
hrtimer_start sets the expiration time of a timer and starts it.
925
Page 925
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management The implementation of both functions is purely technical and not very interesting, their code need not be discussed in detail. To cancel a scheduled timer, the kernel offers hrtimer_cancel and hrtimer_try_to_cancel. The difference between both functions is that hrtimer_try_to_cancel provides the extra return value −1 if the timer if currently executing and thus cannot be stopped anymore. hrtimer_cancel waits until the handler has executed in this case. Besides, both functions return 0 if the timer was not active, and 1 if it was active, that is, if its status is either HRTIMER_STATE_ENQUEUED or HRTIMER_STATE_PENDING. Restarting a canceled timer is done with hrtimer_restart:
int hrtimer_cancel(struct hrtimer *timer) int hrtimer_try_to_cancel(struct hrtimer *timer) int hrtimer_restart(struct hrtimer *timer)
15.4.3 Implementation After having introduced all required data structures and components, let’s fill in the last missing pieces by discussing the mechanisms of how high-resolution timers are expired and their callback function run. Recall that parts of the high-resolution timer framework are also compiled into the kernel even if explicit support for them is disabled. Expiring high-resolution timers is in this case driven by a clock with lowresolution. This avoids code duplication because users of high-resolution timers need not supply an extra version of their timing-related code for systems that do not have high-resolution capabilities. The high-resolution framework is employed as usual, but operates with only low resolution. Even if high-resolution support is compiled into the kernel, only low resolution will be available at boot time, so the situation is identical to the one described above. Therefore, we need to take two possibilities into account for how high-resolution timers are run: based on a proper clock with high-resolution capabilities, and based on a low-resolution clock.
High-Resolution Timers in High-Resolution Mode Let us first assume that a high-resolution clock is up and running, and that the transition to highresolution mode is completely finished. The general situation is depicted in Figure 15-13. When the clock event device responsible for high-resolution timers raises an interrupt, hrtimer_interrupt is called as event handler. The function is responsible to select all timers that have expired and either move them to the expiration list (if they may be processed in softIRQ context) or call the handler function directly. After reprogramming the clock event device so that an interrupt is raised when the next pending timer expires, the softIRQ HRTIMER_SOFTIRQ is raised. When the softIRQ executes, run_hrtimer_softirq takes care of executing the handler functions of all timers on the expiration list. Let’s discuss the code responsible to implement all this. First, consider the interrupt handler hrtimer_interrupt. Some initialization work is necessary in the beginning: kernel/hrtimer.c
void hrtimer_interrupt(struct clock_event_device *dev) { struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
926
5:39pm
Page 926
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management struct hrtimer_clock_base *base; ktime_t expires_next, now; ... retry: now = ktime_get(); expires_next.tv64 = KTIME_MAX; base = cpu_base->clock_base; ...
hrtimer_interrupt High-resolution clock interrupt
Select expired timers
Move to expired list Execute directly
Reprogram hardware for next event
Raise HRTIMER_SOFTIRQ run_hrtimer_softirq
Process pending timers
HRTIMER_SORTIRQ
Figure 15-13: Overview of expiration of high-resolution timers with high-resolution clocks.
The expiration time of the timer that is due next is stored in expires_next. Setting this to KTIME_MAX initially is another way of saying that no next timer is available. The main work is to iterate over all clock bases (monotonic and real-time). kernel/hrtimer.c
for (i = 0; i < HRTIMER_MAX_CLOCK_BASES; i++) { ktime_t basenow; struct rb_node *node; basenow = ktime_add(now, base->offset);
Essentially, basenow denotes the current time. base->offset is only non-zero when the real-time clock has been readjusted, so this will never affect the monotonic clock base. Starting from base->first, the expired nodes of the red-black tree can be obtained: kernel/hrtimer.c
while ((node = base->first)) { struct hrtimer *timer; timer = rb_entry(node, struct hrtimer, node); if (basenow.tv64 < timer->expires.tv64) { ktime_t expires; expires = ktime_sub(timer->expires, base->offset);
927
Page 927
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management if (expires.tv64 < expires_next.tv64) expires_next = expires; break; }
If the next timer’s expiration time lies in the future, processing can be stopped by leaving the while loop. The time of expiration is, however, remembered because it is later required to reprogram the clock event device. If the current timer has expired, it is moved to the callback list for later processing in the softIRQ if this is allowed, that is, if HRTIMER_CB_SOFTIRQ is set. continue ensures that the code moves to the next timer. Erasing the timer with __remove_timer also selects the next expiration candidate by updating base->first. Additionally, this sets the timer state to HRTIMER_STATE_PENDING: kernel/hrtimer.c
if (timer->cb_mode == HRTIMER_CB_SOFTIRQ) { __remove_hrtimer(timer, base, HRTIMER_STATE_PENDING, 0); list_add_tail(&timer->cb_entry, &base->cpu_base->cb_pending); raise = 1; continue; }
Otherwise, the timer callback is directly executed in hard interrupt context. Note that this time __remove_timer sets the timer state to HRTIMER_STATE_CALLBACK because the callback handler is executed immediately afterward: kernel/hrtimer.c
__remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0); ... if (timer->function(timer) != HRTIMER_NORESTART) { enqueue_hrtimer(timer, base, 0); } timer->state &= ˜HRTIMER_STATE_CALLBACK; } base++; }
The callback handler is executed by timer->function(timer). If the handler requests to be restarted by returning HRTIMER_RESTART, then enqueue_hrtimer fulfills this request. The HRTIMER_STATE_CALLBACK flag can be removed once the handler has been executed. When the pending timers of all clock bases have been selected, the kernel needs to reprogram the event device to raise an interrupt when the next timer is due. Additionally, the HRTIMER_SOFTIRQ must be raised if any timers are waiting on the callback list: kernel/hrtimer.c
cpu_base->expires_next = expires_next; /* Reprogramming necessary ? */ if (expires_next.tv64 != KTIME_MAX) { if (tick_program_event(expires_next, 0))
928
5:39pm
Page 928
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management goto retry; } /* Raise softirq ? */ if (raise) raise_softirq(HRTIMER_SOFTIRQ); }
Note that reprogramming fails if the next expiration date is already in the past — this can happen if timer processing took too long. In this case, the whole processing sequence is restarted by jumping to the retry label at the beginning of the function. One more final step is necessary to complete one round of high-resolution timer handling: Run the softIRQ to execute the pending callbacks. The softIRQ handler is run_hrtimer_softirq, and Figure 15-14 shows the code flow diagram.17
Iterate over all pending timers
run_hrtimer_softirq timer->function HRTIMER_RESTART returned?
enqueue_hrtimer
Figure 15-14: Code flow diagram for run_hrtimer_softirq. Essentially, the function iterates over the list of all pending timers. For each timer, the callback handler is executed. If the timer requests to be restarted, then enqueue_hrtimer does the required mechanics.
High-Resolution Timers in Low-Resolution Mode What if no high-resolution clocks are available? In this case, expiring high resolution timers is initiated from the hrtimer_run_queues, which is called by the high-resolution timer softIRQ HRTIMER_SOFTIRQ (since softIRQ processing is based on low-resolution timers in this case, the mechanism does not provide any high-resolution capabilities naturally). The code flow diagram is depicted in Figure 15-15. Note that this is a simplified version. In reality, the function is more involved because switching from low- to highresolution mode is started from this place. However, these problems will not bother us now; the required extensions are discussed in Section 15.4.5. hrtimer_run_queues hrtimer_get_softirq_time run_hrtimer_queue Iterate over all bases
Figure 15-15: Code flow diagram for hrtimer_run_queues. 17 The corner case that a timer is rearmed on another CPU after the callback has been executed is omitted. This possibly requires
reprogramming the clock event device to the new expiration time if the timer is the first on the tree to expire.
929
Page 929
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management The mechanism is not particularly complicated: After the coarse time is stored in the timer base by hrtimer_get_softirq_time, the code loops over all clock bases (the monotonic and real-time clocks) and processes the entries in each queue with run_hrtimer_queue. First of all, the function checks if any timers must be processed (if hrtimer_cpu_base is a NULL pointer, then no first timer exists, and thus nothing needs to be done): kernel/hrtimer.c
static inline void run_hrtimer_queue(struct hrtimer_cpu_base *cpu_base, int index) { struct rb_node *node; struct hrtimer_clock_base *base = &cpu_base->clock_base[index]; if (!base->first) return; if (base->get_softirq_time) base->softirq_time = base->get_softirq_time(); ...
Now the kernel has to find all timers that have expired and must be activated: kernel/hrtimer.c
while ((node = base->first)) { struct hrtimer *timer; enum hrtimer_restart (*fn)(struct hrtimer *); int restart; timer = rb_entry(node, struct hrtimer, node); if (base->softirq_time.tv64 <= timer->expires.tv64) break; ... fn = timer->function; __remove_hrtimer(timer, base, HRTIMER_STATE_CALLBACK, 0); ...
Starting from the timer that is the first expiration candidate (base->first), the kernel checks if the timer has already expired and calls the timer’s expiration function if this is the case. Recall that erasing the timer with __remove_timer also selects the next expiration candidate by updating base->first. Additionally, the flag HRTIMER_STATE_CALLBACK is set in the timer because the callback function is about to be executed: kernel/hrtimer.c
restart = fn(timer); timer->state &= ˜HRTIMER_STATE_CALLBACK; if (restart != HRTIMER_NORESTART) { enqueue_hrtimer(timer, base, 0); } } }
930
5:39pm
Page 930
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management When the handler has finished, the HRTIMER_STATE_CALLBACK flag can be removed again. If the timer requested to be put back into the queue, then enqueue_hrtimer fulfills this request.
15.4.4 Periodic Tick Emulation The clock event handler in high-resolution mode is hrtimer_interrupt. This implies that tick_handle_periodic does not provide the periodic tick anymore. An equivalent functionality thus needs be made available based on high-resolution timers. The implementation is (nearly) identical between the situations with and without dynamic ticks. The generic framework for dynamic ticks is discussed in Section 15.5; the required components are covered here only cursorily. Essentially, tick_sched is a special data structure to manage all relevant information about periodic ticks, and one instance per CPU is provided by the global variable tick_cpu_sched. tick_setup_sched_timer is called to activate the tick emulation layer when the kernel switches to high-resolution mode. One high-resolution timer is installed per CPU. The required instance of struct hrtimer is kept in the per-CPU variable tick_sched:
struct tick_sched { struct hrtimer sched_timer; ... }
The function tick_sched_timer is used as the callback handler. To avoid a situation in which all CPUs are engaged in running the periodic tick handlers at the same time, the kernel distributes the acceleration time as shown in Figure 15-16. Recall that the length of a tick period (in nanoseconds) is tick_period. The ticks are spread across the first half of this period. Assume that the first tick starts at time 0. If the system contains N CPUs, the remaining periodic ticks are started at times , 2, 3, . . . The offset is given by tick_period/(2N).
CPU 0 CPU 1 CPU 2 CPU 3
periodic tick occurs
CPU N 0
tick_period/2
tick_period
time
Figure 15-16: Distributing periodic tick handlers in high-resolution mode. The tick timer is registered like every other regular high-resolution timer. The function displays some similarities to tick_periodic, but is slightly more complicated. The code flow diagram is shown in Figure 15-17. If the CPU that is currently executing the timer is responsible to provide the global tick (recall that this duty has already been distributed in low-resolution mode at boot time), then tick_do_update_jiffies64 computes the number of jiffies that have passed since the last update — in our case, this will always be
931
Page 931
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management 1 because I do not consider dynamic tick mode for now. The previously discussed function do_timer is used to handle all duties of the global timer. Recall that this includes updating the global jiffies64 variable. tick_sched_timer CPU responsible for global tick? tick_do_update_jiffies64 update_process_times profile_tick Set next tick after tick_period return HRTIMER_RESTART
Figure 15-17: Code flow diagram for tick_sched_timer. When the per-CPU periodic tick tasks have been performed in update_process_times (see Section 15.8) and profile_tick, the time for the next event is computed, and hrtimer_forward programs the timer accordingly. By returning HRTIMER_RESTART, the timer is automatically re-queued and activated when the next tick is due.
15.4.5 Switching to High-Resolution Timers High-resolution timers are not enabled from the very beginning, but can only be activated when suitable high-resolution clock sources have been initialized and added to the generic clock framework. Lowresolution ticks, however, are provided (nearly) from the very beginning. In the following, I discuss how the kernel switches from low- to high-resolution mode. The high-resolution queue is processed by hrtimer_run_queue when low-resolution timers are active. Before the queues are run, the function provides checks if a clock event device suitable for high resolution timers is present in the system. In this case, the switch to high resolution mode is performed: kernel/hrtimer.c
void hrtimer_run_queues(void) { ... if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())) if (hrtimer_switch_to_hres()) return; ... } tick_check_oneshot_change signalizes that high-resolution timers can be used if a clock that supports one-shot mode and fulfills the resolution requirements for high-res timers, that is, if the flag CLOCK_SOURCE_VALID_FOR_HRES is set. hrtimer_switch_to_hres performs the actual switch. The required steps are summarized in Figure 15-18.
932
5:39pm
Page 932
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management hrtimer_switch_to_highres tick_init_highres tick_switch_to_oneshot(hrtimer_interrupt) Set resolution in clock base tick_setup_sched_timer retrigger_next_event
Figure 15-18: Code flow diagram for hrtimer_switch_to_hires. tick_init_switch_to_highres is a wrapper function using tick_switch_to_oneshot to set the clock event device to one-shot mode. Additionally, hrtimer_interrupt is installed as event handler. Afterward the periodic tick emulation is activated with tick_init_highres as discussed above. Since the
resolution is now improved, this also needs to be reflected in the data structures: kernel/hrtimer.c
static int hrtimer_switch_to_hres(void) { ... base->hres_active = 1; base->clock_base[CLOCK_REALTIME].resolution = KTIME_HIGH_RES; base->clock_base[CLOCK_MONOTONIC].resolution = KTIME_HIGH_RES; ... }
Finally, retrigger_next_event reprograms the clock event device to set the ball rolling. High-resolution support is now active!
15.5
Dynamic Ticks
Periodic ticks have provided a notion of time to the Linux kernel for many of years. The approach is simple and effective, but shows one particular deficiency on systems where power consumption does matter: The periodic tick requires that the system is in an active state at a certain frequency. Longer periods of rest are impossible because of this. Dynamic ticks mend this problem. The periodic tick is only activated when some tasks actually do need to be performed. Otherwise, it is temporarily disabled. Support for this technique can be selected at compile time, and the resulting system is also referred to as a tickless system. However, this name is not entirely accurate because the fundamental frequency HZ at which the periodic tick operates when it is functional still provides a raster for time flow. Since the tick can be activated and deactivated according to the current needs, the term dynamic ticks fits very well. How can the kernel decide if the system has nothing to do? Recall from Chapter 2 that if no active tasks are on the run queue, the kernel picks a special task — the idle task — to run. At this point, the dynamic tick mechanism enters the game. Whenever the idle task is selected to run, the periodic tick is disabled
933
Page 933
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management until the next timer will expire. The tick is re-enabled again after this time span, or when an interrupt occurs. In the meantime, the CPU can enjoy a well-deserved sleep. Note that only classical timers need to be considered for this purpose. High-resolution timers are not bound by the tick frequency, and are also not implemented on top of periodic ticks. Before discussing the dynamic tick implementation, let us note that one-shot clocks are a prerequisite for them. Since a key feature of dynamic ticks is that the tick mechanism can be stopped and restarted as necessary, purely periodic timers will fundamentally not suit the mechanism. In the following, periodic ticks mean a tick implementation that does not use dynamic ticks. This must not be confused with clock event devices that work in periodic mode.
15.5.1 Data Structures Dynamic ticks need to be implemented differently depending on whether high- or low-resolution timers are used. In both cases, the implementation is centered around the following data structure:
struct tick_sched { struct hrtimer sched_timer; enum tick_nohz_mode nohz_mode; ktime_t idle_tick; int tick_stopped; unsigned long idle_jiffies; unsigned long idle_calls; unsigned long idle_sleeps; ktime_t idle_entrytime; ktime_t idle_sleeptime; ktime_t sleep_length; unsigned long last_jiffies; unsigned long next_jiffies; ktime_t idle_expires; };
The individual elements are used as follows: ❑
sched_timer represents the timer used to implement the ticks.
❑
The current mode of operation is stored in nohz_mode. There are three possibilities:
enum tick_nohz_mode { NOHZ_MODE_INACTIVE, NOHZ_MODE_LOWRES, NOHZ_MODE_HIGHRES, }; NOHZ_MOD_INACTIVE is used if periodic ticks are active, while the other two constants indicate
that dynamic ticks are used based on low- and high-resolution timers, respectively. ❑
934
idle_tick stores the expiration time of the last tick before ticks are disabled. This is important to know when ticks are enabled again because the next tick must appear at exactly the same time as if ticks had never been disabled. The proper point in time can be computed by using the
5:39pm
Page 934
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management value stored in idle_tick as basis. A sufficient number of tick intervals are added to obtain the expiration time for the next tick. ❑
tick_stopped is 1 if periodic ticks are stopped, that is, if there is nothing tick-based currently to do. Otherwise, the value is 0.
The remaining fields are used for bookkeeping: ❑
idle_jiffies stores the value of jiffies when periodic ticks were disabled.
❑
idle_calls counts how often the kernel has tried to deactivate periodic ticks. idle_sleeps
counts how often this actually succeeded. The values differ because the kernel does not deactivate ticks if the next tick is only one jiffy away. ❑
idle_sleeptime stores the exact time (with the best current resolution) when periodic ticks were
last disabled. ❑
sleep_length stores how long the periodic tick will remain disabled, that is, the difference between the time the tick was disabled and when the next tick is scheduled to happen.
❑
idle_sleeptime accumulates the total time spent with ticks deactivated.
❑
next_jiffies stores the jiffy value at which the next timer will expire.
❑
idle_expires stores when the next classical timer is due to expire. In contrast to the value above,
the resolution of the value is as good as possible and not in jiffies. The statistical information gathered in tick_sched is exported to userland via /proc/timer_list. tick_cpu_sched is a global per-CPU variable that provides an instance of struct tick_sched. This is
required because disabling ticks naturally works per CPU, not globally for the whole system.
15.5.2 Dynamic Ticks for Low-Resolution Systems Consider the situation in which the kernel does not use high-resolution timers and provides only low resolution. How are dynamic ticks implemented in this scenario? Recall from above that the timer softIRQ calls hrtimer_run_queues to process the high-resolution timer queue, even if only low resolution is available in the underlying clock event device. Again, I emphasize that this does not provide better resolution for timers, but makes it possible to use the existing framework independent of the clock resolution.
Switching to Dynamic Ticks hrtimer_run_queues calls tick_check_oneshot_change to decide if high-resolution timers can be activated. Additionally, the function checks if dynamic ticks can be enabled on low-resolution systems. This is possible under two conditions:
1. 2.
A clock event device that supports one-shot mode is available. high-resolution is not enabled.
If both are fulfilled, then tick_nohz_switch_to_nohz is called to activate dynamic ticks. However, this does not ultimately enable dynamic ticks. If support for tickless systems was disabled at compile time, the function is just an empty dummy function, and the kernel will remain in periodic tick mode. Otherwise, the kernel proceeds as shown in Figure 15-19.
935
Page 935
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management tick_nohz->switch_to_nohz tick_switch_to_oneshot(tick_nohz_handler) Set tick_sched->nohz_mode to NOHZ_MODE_LOWRES Initialize tick timer and program next tick
Figure 15-19: Code flow diagram for tick_nohz_switch_to_nohz. The most important change required for the transition to dynamic ticks is to set the clock event device to one-shot mode, and to install an appropriate tick handler. This is done by calling tick_switch_to_oneshot. The new handler is tick_nohz_handler, examined below. Since the dynamic tick mode is now active, the nohz_mode field of the per-CPU instance of struct tick_sched is changed to NOHZ_MODE_LOWRES. To get things going, the kernel finally needs to activate the first periodic tick by setting the timer to expire at the point in time when the next periodic tick would have been due.
The Dynamic Tick Handler The new tick handler tick_nohz_handler needs to assume two responsibilities:
1. 2.
Perform all actions required for the tick mechanism. Reprogram the tick device such that the next tick expires at the right time.
The code to satisfy these requirements looks as follows. Some initialization work is required to obtain the per-CPU instance of struct tick_sched and the current time: kernel/time/tick-sched.c
static void tick_nohz_handler(struct clock_event_device *dev) { struct tick_sched *ts = &__get_cpu_var(tick_cpu_sched); struct pt_regs *regs = get_irq_regs(); int cpu = smp_processor_id(); ktime_t now = ktime_get(); dev->next_event.tv64 = KTIME_MAX;
The role of the global tick device is as before assumed by one particular CPU, and the handler needs to check if the current CPU is the responsible one. However, the situation is a bit more complicated with dynamic ticks. If a CPU goes into a long sleep, then it cannot be responsible for the global tick anymore, and drops the duty. If this is the case, the next CPU whose tick handler is called must assume the duty18 : kernel/time/tick-sched.c
if (unlikely(tick_do_timer_cpu == -1)) tick_do_timer_cpu = cpu; 18 The case in which all processors sleep for longer than one jiffy is also possible. The kernel needs to consider this case as the dis-
cussion of tick_do_updates_jiffies64 shows below.
936
5:39pm
Page 936
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management /* Check, if the jiffies need an update */ if (tick_do_timer_cpu == cpu) tick_do_update_jiffies64(now); update_process_times(user_mode(regs)); profile_tick(CPU_PROFILING);
If the CPU is responsible to provide the global tick, it is sufficient to call tick_do_update_jiffies64, which takes care of everything required — details will follow in a moment. update_process_times and profile_tick take over the duties of the local tick as you have seen several times before. The crucial part is to reprogram the tick device. If the tick mechanism is stopped on the current CPU, this is not necessary, and the CPU will go into a complete sleep. (Note that setting next_event.tv64 = KTIME_MAX ensures that the event device will not expire anytime soon, or never for practical purposes.) If ticks are active, then tick_nohz_reprogram sets the tick timer to expire at the next jiffy. The while loop ensures that reprogramming is repeated until it succeeds if the processing should have taken too long and the next tick lies already in the past: kernel/time/tick-sched.c
/* Do not restart, when we are in the idle loop */ if (ts->tick_stopped) return; while (tick_nohz_reprogram(ts, now)) { now = ktime_get(); tick_do_update_jiffies64(now); } }
Updating Jiffies The global tick device calls tick_do_update_jiffies64 to update the global jiffies_64 variable, the basis of low-resolution timer handling. When periodic ticks are in use, this is comparatively simple because the function is called whenever a jiffy has passed. When dynamic ticks are enabled, the situation can arise in which all CPUs of the system are idle and none provides global ticks. This needs to be taken into account by tick_do_update_jiffies64. Let’s go directly to the code to see how: kernel/time/tick-sched.c
static void tick_do_update_jiffies64(ktime_t now) { unsigned long ticks = 0; ktime_t delta; delta = ktime_sub(now, last_jiffies_update);
Since the function needs to decide if more than a single jiffy has passed since the last update, the difference between the current time and last_jiffies_update must be computed.
937
Page 937
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management Updating the jiffies value is naturally only required if the last update is more than one tick period ago: kernel/time/tick-sched.c
if (delta.tv64 >= tick_period.tv64) { delta = ktime_sub(delta, tick_period); last_jiffies_update = ktime_add(last_jiffies_update, tick_period);
The most common case is that one tick period has passed since the last jiffy update, and the code shown above handles this situation by increasing last_jifies_update correspondingly. This accounts for the present tick. However, it is also possible that the last update was more than one jiffy ago. Some more effort is required in this case: kernel/time/tick-sched.c
/* Slow path for long timeouts */ if (unlikely(delta.tv64 >= tick_period.tv64)) { s64 incr = ktime_to_ns(tick_period); ticks = ktime_divns(delta, incr); last_jiffies_update = ktime_add_ns(last_jiffies_update, incr * ticks); }
The computation of ticks computes one tick less than the number of ticks that have been skipped, and last_jiffies_updates is updated accordingly. Note that the offset by one is necessary because one tick period was already added to last_jiffies_update at the very beginning. This way, the usual case (i.e., one tick period since the last update) runs fast, while more effort is required for the unusual case where more than one tick period has passed since the last update. Finally, do_timer is called to update the global jiffies value as discussed in Section 15.2.1: kernel/time/tick-sched.c
do_timer(++ticks); } }
15.5.3 Dynamic Ticks for High-Resolution Systems Since clock event devices run in one-shot mode anyway if the kernel uses high timer resolution, support for dynamic ticks is much easier to implement than in the low-resolution case. Recall that the periodic tick is emulated by tick_sched_timer as discussed above. The function is also used to implement dynamic ticks. In the discussion in Section 15.4.4, I omitted two elements required for dynamic ticks:
1.
Since CPUs can drop global tick duties, the handler needs to check if this has been the case, and assume the duties: kernel/time/tick-sched.c
#ifdef CONFIG_NO_HZ if (unlikely(tick_do_timer_cpu == -1)) tick_do_timer_cpu = cpu; #endif
938
5:39pm
Page 938
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management This code is run at the very beginning of tick_sched_timer.
2.
When the handler is finished, it is usually required to reprogram the tick device such that the next tick will happen at the right time. If ticks are stopped, this is not necessary: kernel/time/tick-sched.c
/* Do not restart, when we are in the idle loop */ if (ts->tick_stopped) return HRTIMER_NORESTART;
Only a single change to the existing code is required to initialize dynamic tick mode in a high-resolution regime. Recall that tick_setup_sched_timer is used to initialize the tick emulation layer for highresolution systems. If dynamic ticks are enabled at compile time, a short piece of code is added to the function: kernel/time/tick-sched.c
void tick_setup_sched_timer(void) { ... #ifdef CONFIG_NO_HZ if (tick_nohz_enabled) ts->nohz_mode = NOHZ_MODE_HIGHRES; #endif }
This announces officially that dynamic ticks are in use with high-resolution timers.
15.5.4 Stopping and Starting Periodic Ticks Dynamic ticks provide the framework to defer periodic ticks for a while. What the kernel still needs to decide is when ticks are supposed to be stopped and restarted. A natural possibility to stop ticks is when the idle task is scheduled: This proves that a processor really does not have anything better to do. tick_nohz_stop_sched_tick is provided by the dynamic tick framework to stop ticks. Note that the same function is used independent of low and high resolution. If dynamic ticks are disabled at compile time, the function is replaced by an empty dummy. The idle task is implemented in an architecture-specific way, and not all architectures have been updated to support disabling the periodic tick yet. At the time of writing, ARM, MIPS, PowerPC, SuperH, Sparc64, IA-32, and AMD6419 turn off ticks in the idle task. Integrating tick_nohz_stop_sched_tick is rather straightforward. Consider, for instance, the implementation of cpu_idle (which is run in the idle task) on ARM systems: arch/arm/kernel/process.c
void cpu_idle(void) { ... /* endless idle loop with no priority at all */ 19 And user-mode Linux if you want to count that as a separate architecture.
939
Page 939
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management while (1) { ... tick_nohz_stop_sched_tick(); while (!need_resched()) idle(); ... tick_nohz_restart_sched_tick(); ... } }
Other architectures differ in some details, but the general principle is the same. After calling tick_nohz_stop_sched_tick to turn off ticks, the system goes into an endless loop that ends when a process is available to be scheduled on the processor. Ticks are then necessary again, and are reactivated by tick_nohz_restart_sched_tick. Recall that a sleeping process waits for some condition to be fulfilled such that it switches into a runnable state. A change of this condition is signaled by an interrupt — just suppose that the process has been waiting for some data to arrive, and the interrupt notifies the system that the data are now available. Since interrupts occur at random times from the kernel’s point of view, it can well happen that one is raised during an idle period with ticks turned off. Two conditions can thus require restarting ticks:
1.
An external interrupt make a process runnable, which requires the tick mechanism to work.20 In this case, ticks need to be resumed earlier than initially planned.
2.
The next tick event is due, and the clock interrupt signals that the time for this has come. In this case, the tick mechanism is resumed as planned before.
Stopping Ticks Essentially, tick_nohz_stop_sched_tick needs to perform three tasks:
1. 2.
Check if the next timer wheel event is more than one tick away.
3.
Update the statistical information in tick_sched.
If this is the case, reprogram the tick device to omit the next tick only when it is necessary again. This automatically omits all ticks that are not required.
Since many details require much attention to corner cases, the actual implementation of tick_nohz_stop_sched_tick is rather bulky, so I consider a simplified version below.
First of all, the kernel needs to obtain the tick device and the tick_sched instance for the current CPU: kernel/time/tick-sched.c
void tick_nohz_stop_sched_tick(void) { 20 To simplify matters, I ignore that tick_nohz_stop_sched_tick is also called from irq_exit if an interrupt has disturbed a tickless interval, but did not change the state of the system such that any process became runnable. This also simplifies the discussion of tick_nohz_stop_sched_tick because multiple subsequent invocations of the function need not be taken into account. Additionally, I do not discuss that the jiffies value needs to be updated in irq_enter because interrupt handlers would otherwise assume a wrong value. The function in charge for this is tick_nohz_update_jiffies.
940
5:39pm
Page 940
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management unsigned long seq, last_jiffies, next_jiffies, delta_jiffies, flags; struct tick_sched *ts; ktime_t last_update, expires, now, delta; struct clock_event_device *dev = __get_cpu_var(tick_cpu_device).evtdev; int cpu; cpu = smp_processor_id(); ts = &per_cpu(tick_cpu_sched, cpu);
Some statistical information is updated. Recall that the meaning of these fields has already been described in Section 15.5.1. The last jiffy update and the current jiffy value are stored in local variables: kernel/time/tick-sched.c
now = ktime_get(); ts->idle_entrytime = now; ts->idle_calls++; last_update = last_jiffies_update; last_jiffies = jiffies;
It only makes sense to deactivate ticks if the next periodic event is more than one tick away. The auxiliary function get_next_timer_interrupt analyzes the timer wheel and discovers the jiffy value at which the next event is due. delta_wheel then denotes how many jiffies away the next event is: kernel/time/tick_sched.c
/* Get the next timer wheel timer */ next_jiffies = get_next_timer_interrupt(last_jiffies); delta_jiffies = next_jiffies - last_jiffies;
If the next tick is at least one jiffy away (note that it can also be possible that some event is due in the current jiffy), the tick device needs to be reprogrammed accordingly: kernel/timer/tick-sched.c
/* Schedule the tick, if we are at least one jiffie off */ if ((long)delta_jiffies >= 1) { ts->idle_tick = ts->sched_timer.expires; ts->tick_stopped = 1; ts->idle_jiffies = last_jiffies;
The meaning of the modified tick_sched fields has been discussed before. If the current CPU had to provide the global tick, the task must be handed to another CPU. This is simply achieved by setting tick_do_timer_cpu to −1. The next tick handler that will be activated on another CPU then automatically takes the duties of the global tick source: kernel/time/tick-sched.c
if (cpu == tick_do_timer_cpu) tick_do_timer_cpu = -1; ts->idle_sleeps++;
941
Page 941
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management Finally, the tick device is reprogrammed to provide the next event at the proper point in time. While the method to set the timer differs between high- and low-resolution mode, the code jumps to the label out if programming is successful in both cases: kernel/time/tick-sched.c
expires = ktime_add_ns(last_update, tick_period.tv64 * delta_jiffies); ts->idle_expires = expires; if (ts->nohz_mode == NOHZ_MODE_HIGHRES) { hrtimer_start(&ts->sched_timer, expires, HRTIMER_MODE_ABS); /* Check, if the timer was already in the past */ if (hrtimer_active(&ts->sched_timer)) goto out; } else if(!tick_program_event(expires, 0)) goto out; tick_do_update_jiffies64(ktime_get()); } raise_softirq_irqoff(TIMER_SOFTIRQ); out: ts->next_jiffies = next_jiffies; ts->last_jiffies = last_jiffies; ts->sleep_length = ktime_sub(dev->next_event, now); }
If reprogramming failed, then too much time was spent in processing, and the expiration date already lies in the past. In this case, tick_do_update_jiffies_64 updates jiffies to the correct value, and the timer softIRQ TIMER_SOFTIRQ is raised to process any pending timer-wheel timers. Note that the softIRQ is also raised if some events are due in the current jiffy period.
Restarting Ticks tick_nohz_restart_sched_tick is used to restart ticks. The code flow diagram is given by Figure 15-20.
tick_nohz_restart_sched_tick tick_do_update_jiffies64 Account idle time Set tick_sched->tick_stopped = 0 Program next tick event
Figure 15-20: Code flow diagram for tick_nohz_restart_sched_tick.
Again, the implementation is complicated by various technical details, but the general principle is rather simple. Our old acquaintance tick_do_updates_jiffies64 is called first. After correctly accounting the
942
5:39pm
Page 942
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management idle time, tick_sched->tick_stopped is set to 0 because the tick is now active again. Finally, the next tick event needs to be programmed. This is necessary because the idle time might have ended before the expected time because of an external interrupt.
15.6
Broadcast Mode
On some architectures, clock event devices will go to sleep when certain power-saving modes are active. Thankfully, systems do not have only a single clock event device, so another device that still works can replace the stopped devices. The global variable tick_broadcast_device defined in kernel/tick/tick-broadcast.c contains the tick_device instance for the broadcast device. An overview of broadcast mode is given in Figure 15-21.
tick_handle_periodic_broadcast HPET
tick_do_broadcast event_handler
LAPIC
broadcast
LAPIC
LAPIC
LAPIC 1
CPU 0
CPU 1
CPU 2
1
non-functional 1 apic_timer interrupt
CPU 3
IPI IPI
Figure 15-21: Overview of the situation when broadcasting replaces nonfunctional tick devices. The APIC devices are not functional, but the broadcast event device still is. tick_handle_periodic_ broadcast is used as the event handler. It deals with both periodic and one-shot modes of the broadcast device, so this need not concern us any further. The handler will be activated after each tick_period. The broadcast handler uses tick_do_periodic_broadcast. The code flow diagram is shown in Figure 15-22. The function invokes the event_handler method of the nonfunctional device on the current CPU. The handler cannot distinguish if it was invoked from a clock interrupt or from the broadcast device, and is thus executed as if the underlying event device were functional. If there are more nonfunctional local tick devices, then tick_do_broadcast employs the broadcast method of the first device in the list.21 For local APICs, the broadcast method is lapic_timer_broadcast. It is responsible to send the inter-processor interrupt (IPI) LOCAL_TIMER_VECTOR to all CPUs that are associated with nonfunctional tick devices. The vector has been set up by the kernel to call 21 This is possible because at the moment the same broadcast handler is installed on all devices that can become nonfunctional.
943
Page 943
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management apic_timer_interrupt. The result is that the clock event device cannot distinguish between IPIs and real interrupts, so the effect is the same as if the device were still functional. tick_do_periodic_broadcast Determine affected CPUs tick_do_broadcast Current CPU in broadcast mask? Remove CPU from mask Call event_handler for current CPU More CPUs in broadcast mask?
Call broadcast method
Figure 15-22: Code flow diagram for tick_do_periodic_broadcast. Inter-processor interrupts are slow, and thus the required accuracy and resolution for high-resolution timers will not be available. The kernel therefore always switches to low-resolution mode if broadcasting is required.
15.7 Implementing Timer-Related System Calls The kernel provides several system calls that involve timers; the most important ones are considered in the following:
15.7.1 Time Bases When timers are used, there are three options to distinguish how elapsed time is counted or in which time base22 the timer resides. The kernel features the following variants that draw attention to themselves by various signals when a time-out occurs: ❑
ITIMER_REAL measures the actual elapsed time between activation of the timer and time-out
in order to trigger the signal. In this case, the timer continues to tick regardless of whether the system is in kernel mode or user mode or whether the application using the timer is currently running or not. A signal of the SIGALRM type is sent when the timer times out. ❑
ITIMER_VIRTUAL runs only during the time spent by the owner process of the timer in user mode. In this case, time spent in kernel mode (or when the processor is busy with another application) is ignored. Time-out is indicated by the SIGVTALRM signal.
❑
ITIMER_PROF calculates the time spent by the process both in user and kernel mode — time continues to elapse when a system call is executed on behalf of the task. Other processes of the system are ignored. The signal sent at time-out is SIGPROF.
22 Often also referred to as time domain.
944
5:39pm
Page 944
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management As already suggested by its name, the primary use of this timer is in the profiling of applications in which a search is made for the most compute-intensive fragments of a program so that these can be optimized accordingly. This is an important consideration, particularly in scientific or operating system-related applications. The timer type — and the periodic interval length — must be specified when an interval timer is installed. In our example, TTIMER_REAL is used for a real-time timer. The behavior of alarm timers can be simulated with interval timers by selecting ITIMER_REAL as the timer type and deinstalling the timer after the first time-out. Interval timers are therefore a generalized form of alarm timers.
15.7.2 The alarm and setitimer System Calls alarm installs timers of the ITIMER_REAL type (real-time timers), while setitimer is used to install not only real-time, but also virtual and profiling timers. The system calls all end up in do_setitimer. The implementation of both system calls rests on a common mechanism that is defined in kernel/itimer.c. The implementation is centered around struct hrtimer, so if high-resolution support is available, the corresponding advantages are automatically transferred into userland and not only available to the kernel. Note that since alarm uses a timer of type ITIMER_REAL, the system calls can interfere with each other.
The starting points of the system calls are, as usual, the two functions sys_alarm and sys_setitimer. Both functions use the auxiliary function do_setitimer to actually implement the timer: kernel/itimer.c
int do_setitimer(int which, struct itimerval *value, struct itimerval *ovalue)
Three parameters are required. which specifies the timer type, and can be ITIMER_REAL, ITIMER_VIRTUAL, or ITIMER_PROF. value contains all relevant information about the new timer. If the timer replaces an already existing one, then ovalue is employed to return the previously active timer description. Specifying timer properties is simple:
struct itimerval { struct timeval it_interval; /* timer interval */ struct timeval it_value; /* current value */ };
Essentially, timeval denotes the length of the periodic interval after which the timer expires. it_value denotes the amount of time remaining until the timer expires next. All details are documented in the manual page setitimer(2).
Extensions to the Task Structure The task structure of each process contains a pointer to an instance of struct signal_struct that includes several elements to accommodate information required for timers: <sched.h>
struct signal_struct { ... /* ITIMER_REAL timer for the process */
945
Page 945
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management struct hrtimer real_timer; struct task_struct *tsk; ktime_t it_real_incr; /* ITIMER_PROF and ITIMER_VIRTUAL timers for the process */ cputime_t it_prof_expires, it_virt_expires; cputime_t it_prof_incr, it_virt_incr; ... }
Two fields are reserved for profiling and virtual timer type:
1. 2.
The time at which the next time-out is to occur (it_prof_expires and it_virt_expires). The interval after which the timer is called (it_prof_incr and it_virt_incr).
real_timer is an instance of hrtimer (not a pointer to it) that is inserted in the other data structures of
the kernel and is used to implement real-time timers. The other two types of timer (virtual and profiling) manage without this entry. tsk points to the task structure of the process for which the timers are set. The interval for real timers is specified in it_real_incr. It is therefore possible to have just three different timers of different kinds per process — given the existing data structures, the kernel cannot manage more with the setitimer and alarm mechanism. For example, a process can execute a virtual and a real-time timer at the same time, but not two real-time timers. POSIX timers that are implemented in kernel/posix-timers.c provide an extension to this scheme that allow more timers, but need not be discussed any further. Virtual and profiling timers are also implemented on top of this framework.
Real-Time Timers When installing a real-time (ITIMER_REAL) timer, it is first necessary to preserve the properties of a possibly existing old timer (they will be returned to userland once the new timer has been installed) and cancel the timer with hrtimer_try_to_cancel. Installing a timer ‘‘overwrites‘‘ previous values. The timer period is stored in the task-specific signal_struct->it_real_incr field (if this field is zero, then the timer is not periodic, but only activated once), and hrtimer_start starts a timer that expires at the desired time. No handler routine is executed in userspace when a dynamic timer expires. Instead, a signal is generated that results in the invocation of a signal handler and thus indirectly to the invocation of a callback function. How does the kernel ensure that the signal is sent, and how is the timer made periodic? The kernel uses the callback handler it_real_fn, which is executed for all userspace real-time timers. This function sends the SIGALRM signal to the process that installed the timer, but does not reinstall the signal handler to make the signal periodic. Instead, the timer is reinstalled when the signal is delivered in process context (in dequeue_signal, to be precise). After forwarding the expiration time with hrtimer_forward, the timer is restarted with hrtimer_restart. What keeps the kernel from reactivating the timer immediately after it has expired? Earlier kernel versions did, in fact, choose this approach, but problems arise if high-resolution timers are active. A process
946
5:39pm
Page 946
Mauerer
runc15.tex
V3 - 09/04/2008
5:39pm
Chapter 15: Time Management can choose a very short repetition interval that would cause timers to expire over and over — resulting in excessive time spent in the timer code. Put less politely, one could also call this a denial-of-service attack, and the current approach avoids this.
15.7.3 Getting the Current Time The current time of the system needs to be known for two reasons: First, many operations rely on time stamps — for instance, the kernel needs to record when a file was last changed or when some log information was produced. Second, the absolute time — that is, the real time of the outside world — of the system is needed to inform the user with a clock, for example. While absolute accuracy is not too important for the first purpose as long as the time flow is continuous (i.e., the time stamps of successive operations should follow their order), it is more essential for the second purpose. Hardware clocks are notorious for being either fast, slow, or a random combination of both. There are various methods to solve this problem, with the most common one in the age of networked computers being synchronization with a reliable time source (e.g., an atomic clock) via NTP. Since this is purely a userland issue, I won’t discuss it any further. Two means are provided to obtain timing information:
1.
The system call adjtimex. A small utility program of the same name can be used to quickly display the exported information. The system call allows for reading the current kernel internal time. Other possibilities are documented in the associated manual page 2(adjtimex).
2.
The device special file /dev/rtc. This source can be operated in various modes, but one of them delivers the current date and time to the caller.
I focus on adjtimex in the following. The entry point is as usual sys_adjtimex, but after some preparations, the real work is delegated to do_adjtimex. The function is rather lengthy, but the portion required for our purposes is quite compact: kernel/time.c
int do_adjtimex(struct timex *txc) { ... do_gettimeofday(&txc->time); ... }
The call to do_gettimeofday obtains the kernel’s internal time in the best possible resolution. The best time source that was selected by the kernel as described in Section 15.4 is used for this purpose.
15.8
Managing Process Times
The task structure contains two elements related to process times that are important in our context: <sched.h>
struct task_struct { ... cputime_t utime, stime; ... }
947
Page 947
Mauerer
runc15.tex
V3 - 09/04/2008
Chapter 15: Time Management update_process_times is used to manage process-specific time elements and is invoked from the
local tick. As the code flow diagram in Figure 15-23 shows, four things need to be done:
1.
account_process_tick uses either account_user_time or account_sys_time to update the values for user or system CPU time consumed in the task structure (utime or stime, respectively). The SIGXCPU signal is also sent at intervals of 1 second if the process has exceeded its
CPU limits specified by Rlimit.
2.
run_local_timers activates and expires low-resolution timers. Recall that this was discussed in detail in Section 15.2.
3. 4.
scheduler_tick is a helper for the CPU scheduler as discussed in Chapter 2. run_posix_cpu_timers initiates that the currently registered POSIX timers are run. This includes running the abovementioned interval timers since their implementation is based on POSIX CPU timers. Since these timers are otherwise not very interesting, their implementation is not covered in detail. update_process_times account_process_tick run_local_timers scheduler_tick run_posix_cpu_timers
Figure 15-23: Code flow diagram for update_process_times.
15.9
Summar y
The kernel needs to keep track of time for various purposes, and there are also a good many aspects that must be considered to solve the problem. In this chapter, first you were introduced to the general concept of timekeeping and the difference between timers and time-outs. You have seen that the implementation of timers and time-outs is based on hardware that can manage the time. Typically, each system contains more than one component for this purpose, and you were introduced to the data structures that allow for representing these components and sorting them by quality. Traditionally, the kernel relied on lowresolution timers, but recent hardware progress and a rework of the timing subsystem have allowed the introduction of a new class of high-resolution timers. After a discussion of the implementation of high- and low-resolution timers, you were introduced to the concept of dynamic ticks. Traditionally, a periodic timer tick was issued with HZ frequency, but this is suboptimal for machines where power is scarce: When a system is idle and has nothing to do, the tick is superfluous and can be temporarily disabled to allow components to enter deeper sleep states without being woken up at periodic intervals. The dynamic tick mode allows for achieving exactly this. Time is also relevant for userspace processes, and thus I finally discussed various system calls that are available in this area.
948
5:39pm
Page 948
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Page and Buffer Cache Performance and efficiency are two factors to which great importance is attached during kernel development. The kernel relies not only on a sophisticated overall concept of interaction between its individual components, but also on an extensive framework of buffers and caches designed to boost system speed. Buffering and caching make use of parts of system RAM to ensure that the most important and the most frequently used data of block devices can be manipulated not on the slow devices themselves but in main memory. RAM memory is also used to store the data read in from block devices so that the data can be subsequently accessed directly in fast RAM when it is needed again rather than fetching it from external devices. Of course, this is done transparently so that the applications do not and cannot notice any difference as to from where the data originate. Data are not written back after each change but after a specific interval whose length depends on a variety of factors such as the free RAM capacity, the frequency of usage of the data held in RAM, and so on. Individual write requests are bundled and collectively take less time to perform. Consequently, delaying write operations improves system performance as a whole. However, caching has its downside and must be employed judiciously by the kernel: ❑
Usually there is far less RAM capacity than block device capacity so that only carefully selected data may be cached.
❑
The memory areas used for caching are not exclusively reserved for ‘‘normal‘‘ application data. This reduces the RAM capacity that is effectively available.
❑
If the system crashes (owing to a power outage, e.g.), the caches may contain data that have not been written back to the underlying block device. Such data are irretrievably lost.
However, the advantages of caching outweigh the disadvantages to such an extent that caches are permanently integrated into the kernel structures.
Page 949
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache Caching is a kind of ‘‘reverse‘‘ swapping or paging operation (the latter are discussed in Chapter 18). Whereas fast RAM is sacrificed for caching (so that there is no need for slow operations on block devices), RAM memory is replaced virtually with slow block devices to implement swapping. The kernel must therefore do its best to cater to both mechanisms to ensure that the advantages of the one method are not canceled out by the disadvantages of the other — no easy feat. Previous chapters discussed some of the means provided by the kernel for caching specific structures. The slab cache is a memory-to-memory cache whose purpose is not to accelerate operations on slower devices but to make simpler and more effective use of existing resources. The dentry cache is also used to dispense with the need to access slow block devices but cannot be put to general use since it is specialized to handle a single data type. The kernel features two general caching options for block devices:
1.
The page cache is intended for all operations in units of a page — and takes into account the page size on the specific architecture. A prime example is the memory-mapping technique discussed in many chapters. As other types of file access are also implemented on the basis of this technique in the kernel, the page cache is responsible for most caching work for block devices.
2.
The buffer cache operates with blocks. When I/O operations are performed, the access units used are the individual blocks of a device and not whole pages. Whereas the page size is the same with all filesystems, the block size varies depending on the particular filesystem or its settings. The buffer cache must therefore be able to handle blocks of different sizes. While buffers used to be the traditional method to perform I/O operations with block devices, they are nowadays in this area only supported for very small read operations where the advanced methods are too bulky. The standard data structure used for block transfers has become struct bio, which is discussed in Chapter 6. It is much more efficient to perform block transfers this way because it allows for merging subsequent blocks in a request together that speeds things up. Nevertheless, buffers are still the method of choice to represent I/O operations on individual blocks, even if the underlying I/O is performed with bios. Especially systems often have to read metadata blockwise, and buffers are much easier to handle for this task than other more powerful structures. All in all, buffers still have their own identity and are not around solely for compatibility reasons.
In many scenarios, page and buffer caches are used in combination. For example, a cached page is divided into various buffers during write operations so that the modified parts of the page can be more finely grained. This has advantages when the data are written back because only the modified part of the page and not the whole page need be transferred back to the underlying block device.
16.1
Structure of the Page Cache
As its name suggests, the page cache deals with memory pages that divide virtual memory and RAM memory into small segments. This not only makes it easier for the kernel to manipulate the large address space, but also supports a whole series of functions such as paging, demand loading, memory mapping, and the like. The task of the page cache is to obtain some of the available physical page frames to speed up the operations performed on block devices on a page basis. Of course, the way the page cache behaves
950
5:53pm
Page 950
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache is transparent to user applications as they do not know whether they are interacting directly with a block device or with a copy of their data held in memory — the read and write system calls return identical results in both cases. Naturally, the situation is somewhat different for the kernel. In order to support the use of cached pages, anchors must be positioned at the various points in the code that interact with the page cache. The operation required by the user process must always be performed regardless of whether the desired page resides in the cache or not. When a cache hit occurs, the appropriate action is performed quickly (this is the very purpose of the cache). In the event of a cache miss, the required page must first be read from the underlying block device, and this takes longer. Once the page has been read, it is inserted in the cache and is, therefore, quickly available for subsequent access. The time spent searching for a page in the page cache must be minimized to ensure that cache misses are as cheap as possible — if a miss occurs, the compute time needed to perform the search is (more or less) wasted. The efficient organization of the cached pages is, therefore, a key aspect of page cache design.
16.1.1 Managing and Finding Cached Pages The problem of quickly fetching individual elements (pages) from a large data set (page cache) is not specific to the Linux kernel. It has long been common to all areas of information technology and has spawned many sophisticated data structures that have stood the test of time. Tree data structures of various kinds are very popular, and Linux also opts for such a structure — known as a radix tree — to manage the pages held in page caches. Appendix C provides a more detailed description of this data structure. This chapter gives a brief overview of how the individual pages are organized in the structure. Figure 16-1 shows a radix tree in which various instances of a data structure (represented by squares) are interlinked.1 height = ptr =
2
struct page PAGECACHE_TAG_DIRTY PAGECACHE_TAG_WRITEBACK
PP PP P Count t t t t t r r r r r
PP PP P Count t t t t t r r r r r
PP PP P Count t t t t t r r r r r
Figure 16-1: Example of a radix tree.
1 The structure shown is simplified because the kernel makes use of additional tags in each node to hold specific information on the
pages organized in the node. This has no effect on the basic architecture of the tree.
951
Page 951
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache The structure does not correspond to that of the binary or ternary search trees in general use. Neither are radix trees balanced; in other words, there may be any number of height differences between the branches of the tree. The tree itself consists of two different data structures and a further data structure to represent the leaves and hold the useful data. Because memory pages must be organized, the leaves are instances of the page structure in this case, a fact that is of no further importance in the implementation of the tree. (The kernel sources do not define a particular data type but use a void pointer; this means that radix trees could also be used for other purposes, although this is not done at present.) The root of the tree is represented by a simple data structure that holds the height of the tree (the maximum number of levels to accommodate the nodes) and a pointer to the first node data structure of which the tree is comprised. The nodes are basically arrays. For the sake of simplicity, the nodes are shown with four elements in the figure, but in the kernel sources, they actually have 2RADIX_TREE_MAP_SHIFT entries. Since RADIX_TREE_MAP_SHIFT is typically defined as 6, each array has 64 elements — considerably more than are shown in the figure. Small systems use a RADIX_TREE_MAP_SHIFT setting of 4 to save precious memory. The elements of the tree are addressed by means of a unique key consisting of a simple integer. The details of the algorithm used to find elements by reference to their key are not discussed here. A description of the relevant code is given in Appendix C. Enlarging the tree and deleting tree elements are kernel operations that require little effort, so minimum time is lost in performing cache management operations. Their implementation is also described in greater detail in Appendix C. Observe from Figure 16-1 that the tree is equipped with two search tags. They allow for specifying if a given page is dirty (i.e., the page contents are not identical with the data in the backing store) or if it is currently being written back to the underlying block device. It is important that the tags are not only set in the leaf elements, but also all the way up to the root element. If at least one pointer in level n + 1 has a tag set, then the pointer on level n will also acquire the tag. This allows the kernel to decide that one or more pages in a range have a tag bit set. The figure provides an example: Since the dirty tag bit on the leftmost pointer in the first level is set, the kernel knows that one or more of the pages associated with the corresponding second-level node have the dirty tag bit set. If, on the other hand, a tag is not set for a pointer in the higher levels, then the kernel can be sure that none of the pages in the lower levels has the tag. Recall from Chapter 3 that each page as represented by an instance of struct page is equipped with a set of flags. These also include dirty and writeback flags. The information in the radix tree tags therefore only augments kernel knowledge. Page cache tags are useful to quickly determine if at least one page in a region is dirty or under writeback without scanning all pages in the whole region. They are, however, no replacement for the direct page flags.
16.1.2 Writing Back Modified Data Thanks to the page cache, write operations are not performed directly on the underlying block device but are carried out in memory where the modified data are first collected for subsequent transfer to the lower kernel layer, where the write operations can be further optimized — as discussed in Chapter 6 — to fully exploit the specific capabilities of the individual devices. Here we are interested
952
5:53pm
Page 952
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache only in the situation as seen by the page cache, which is primarily concerned with one specific question: at which point in time should the data be written back? This automatically includes the question as to how often should writeback take place. Understandably, there is no universally valid answer to this question as different systems with different load conditions give rise to very different scenarios. For example, a server running overnight receives very few requests to modify data so that the services of the kernel are seldom required. The same scenario applies on personal computers when users take a break from work. However, the situation can change suddenly when the server launches a huge FTP transfer or the PC user starts a lengthy compiler run to process and produce large volumes of data. In both scenarios, the caches initially have very little to write back, but then, from one moment to the next, they are required to frequently synchronize with the underlying storage medium. For these reasons, the kernel provides several parallel synchronization alternatives: ❑
Several special kernel daemons called pdflush run in the background and are activated periodically — regardless of the current situation in the page cache. They scan the pages in the cache and write back the data that have not been synchronized with the underlying block device for a specific period. Earlier kernel versions employed a userspace daemon named kudpated for this purpose, and this name is still commonly used to describe this mechanism.
❑
A second operating mode of pdflush is activated by the kernel if the number of modified data items in a cache has increased substantially within a short period.
❑
System calls are available to users and applications to instruct the kernel to write back all nonsynchronized data. The best known is the sync call because there is also a userspace tool of the same name that builds on it.
The various mechanisms used to write back dirty data from the caches are discussed in Chapter 17. To manage the various target objects that can be processed and cached in whole pages, the kernel uses an abstraction of the ‘‘address space‘‘that associates the pages in memory with a specific block device (or any other system unit or part of a system unit).
This type of address space must not be confused with the virtual and physical address spaces provided by the system or processor. It is a separate abstraction of the Linux kernel that unfortunately bears the same name.
Initially, we are interested in only one aspect. Each address space has a ‘‘host‘‘from which it obtains its data. In most cases, these are inodes that represent just one file.2 Because all existing inodes are linked with their superblock (as discussed in Chapter 8), all the kernel need do is scan a list of all superblocks and follow their associated inodes to obtain a list of cached pages. Usually, modifications to files or other objects cached in pages change only part and not the whole of the page contents. This gives rise to a problem when data are synchronized; it doesn’t make sense to write 2 Since the majority of cached pages result from file accesses, most host objects, indeed, represent a regular file. It is, however, also
possible that an inode host object stems from the pseudo-block device filesystem. In this case, the address space is not associated with a single file, but with a whole block device or a partition thereof.
953
Page 953
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache the entire page back to the block device because most of the page data in memory are still synchronized with the data on the block device. To save time, the kernel divides each page in the cache into smaller units known as buffers during write operations. When data are synchronized, the kernel is able to restrict writeback to the smaller units that have actually been modified. As a result, the basically sound idea of page caching is not compromised in any way.
16.2
Structure of the Buffer Cache
A page-oriented method has not always been used in the Linux kernel to bear the main caching burden. Earlier versions included only the buffer cache to speed file operations and to enhance system performance. This was a legacy of other Unix look-alikes with the same structure. Blocks from the underlying block devices were kept in main memory buffers to make read and write operations faster. The implementation is contained in fs/buffers.c. In contrast to pages in memory, blocks are not only (mostly) smaller but vary in size depending on the block device in use (or on the filesystem, as demonstrated in Chapter 9). As a result of the ever increasing trend toward generic file access methods implemented by means of page-based operations, the buffer cache has lost much of its importance as a central system cache, and the main caching burden is now placed firmly on the page cache. Additionally, the standard data structure for block-based I/O is not a buffer anymore, but struct bio as discussed in Chapter 6. Buffers are kept for small I/O transfers with block size granularity. This is often required by filesystems to handle their metadata. Transfer of raw data is done in a page-centric fashion, and the implementation of buffers is also on top of the page cache.3 The buffer cache consists of two structural units:
1.
A buffer head holds all management data relating to the state of the buffer including information on block number, block size, access counter, and so on, discussed below. These data are not stored directly after the buffer head but in a separate area of RAM memory indicated by a corresponding pointer in the buffer head structure.
2.
The useful data are held in specially reserved pages that may also reside in the page cache. This further subdivides the page cache as illustrated in Figure 16-2; in our example, the page is split into four identically sized parts, each of which is described by its own buffer head. The buffer heads are held in memory areas unrelated to the areas where the useful data are stored. This enables the page to be subdivided into smaller sections because no gaps arise as a result of prefixing the buffer data with header data. As a buffer consists of at least 512 bytes, there may be up to a maximum of MAX_BUF_PER_PAGE buffers per page; the constant is defined as a function of the page size:
#define MAX_BUF_PER_PAGE (PAGE_CACHE_SIZE / 512)
3 This contrasts kernels before and including the 2.2 series that used separate caches for buffers and pages. Having two distinct
caching possibilities requires enormous efforts to synchronize both, so the kernel developers chose to unify the caching scheme many years ago.
954
5:53pm
Page 954
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache If one of the buffers is modified, this has an immediate effect on the contents of the page (and vice versa) so that there is no need for explicit synchronization of the two caches — after all, both share identical data.
b_this_page b_data
buffer_head
Page frame
Figure 16-2: Link between pages and buffers.
There are, of course, applications that access block devices using blocks rather than pages — reading the superblock of a filesystem is one such example. A separate buffer cache is used to speed access of this kind. The buffer cache operates independently of the page cache, not in addition to it. To this end, buffer heads — the data structure is the same in buffer caches and page caches — are grouped together in an array of constant size whose individual entries are managed on a least recently used basis. After an entry has been used, it is placed at position 0 and the other entries are moved down accordingly; this means that the entries most frequently used are located at the beginning of the array and those less frequently used are pushed further back until they finally ‘‘drop‘‘off the array if they have not been used for a lengthy period. As the size of the array and therefore the number of entries in the LRU list are restricted to a fixed value that does not change during kernel run time, the kernel need not execute separate threads to trim the cache size to reasonable values. Instead, all it need do is remove the associated buffer from the cache when an entry drops off the array in order to release memory for other purposes. Section 16.5 discusses in detail the technical details of buffer implementation. Before this, it is necessary to discuss the concept of address spaces because these are key to the implementation of cache functionality.
16.3
Address Spaces
Not only have caches progressed from a buffer orientation to a page orientation during the course of Linux development, but also the way in which cached data are linked with their sources has been replaced with a more general schema as compared to previous Linux versions. Whereas in the early days of Linux and other Unix derivatives, inodes were the only objects that acted as the starting point for obtaining data from cache contents, the kernel now uses much more general address spaces that establish the link between cached data and the objects and devices required to obtain the data. Although file contents still account for much of the data in caches, the interfaces are so generalized that the caches are also able to hold data from other sources in order to speed access.
955
Page 955
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache How do address spaces fit into the structures of the page cache? They implement a translation mechanism between two units:
1.
Pages in main memory are allocated to each address space. The contents of these pages can be manipulated by user processes or by the kernel itself using a variety of methods. These data represent the contents of the cache.
2.
The backing store specifies the sources from which the address space pages are filled. Address spaces relate to the virtual address space of the processor and are a mapping of the segment managed by the processor in virtual memory and the corresponding positions on a source device (using a block device). If a position in virtual memory that is not associated with a physical page in memory is accessed, the kernel can refer to the address space structure to discover from where the data must be read.
To support data transfer, each address space provides a set of operations (in the form of function pointers) to permit interaction between the two sides of address space — for instance, to read a page from a block device or filesystem, or to write back a modified page. The following section takes a close look at the data structures used before examining the implementation of address space operations. Address spaces are one of the most crucial data structures in the kernel. Their management has evolved to one of the central issues faced by the kernel. Numerous subsystems (filesystems, swapping, synchronization, caching) are centered around the concept of an address space. They can therefore be regarded as one of the fundamental abstraction mechanisms of the kernel, and range in importance among the traditional abstractions like processes and files.
16.3.1 Data Structures The basis of an address space is the address_space structure, which is in slightly simplified form defined as follows:
❑
956
The link with the areas managed by an address space is established by means of a pointer to an inode instance (of type struct inode) to specify the backing store and a root radix tree (page_tree) with a list of all physical memory pages in the address space.
5:53pm
Page 956
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache ❑
The total number of cached pages is held in the nrpages counter variable.
❑
address_space_operations is a pointer to a structure that contains a list of function pointers to
specific operations for handling address spaces. Its definition is discussed below. ❑
i_mmap is the root element of a tree that holds all normal memory mappings of the inode (normal
in the sense that they were not created using the nonlinear mapping mechanism). The task of the tree is to support finding all memory regions that include at least one page in a given interval, and the auxiliary macro vma_prio_tree_foreach is provided for this purpose. Recall that the purpose of the tree is discussed in Section 4.4.3. The details of tree implementation are of no relevance to us at the moment — it is sufficient to know that all pages of the mapping can be found on the tree and that the structure can be manipulated easily. ❑
Two further elements are concerned with the management of memory mappings: i_mmap_writeable counts all mappings created with a set VM_SHARED attribute so that they can be shared by several users at the same time. i_mmap_nonlinear is used to set up a list
of all pages included in nonlinear mappings (reminder: nonlinear mappings are generated by skillful manipulation of the page tables under the control of the remap_file_pages system call). ❑
backing_dev_info is a pointer to a further structure that holds information on the associated
backing store. Backing store is the name used for the peripheral device that serves as a ‘‘backbone‘‘ for the information present in the address space. It is typically a block device:
struct backing_dev_info { unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ unsigned long state; /* Always use atomic bitops on this */ unsigned int capabilities; /* Device capabilities */ ... }; ra_pages specifies the maximum number of pages to be read in anticipation (readahead). The state of the backing store is stored in state. capabilities holds information on the backing store — for example, whether the data in the store can be executed directly as is necessary in ROM-based filesystems. However, the most important information in capabilities is whether pages can be written back. This can always be done with genuine block devices but is not possible with memory-based devices such as RAM disks because there would be little point in writing back data from memory to memory.
If BDI_CAP_NO_WRITEBACK is set, then synchronization is not required; otherwise, it is. Chapter 17 discusses the mechanisms used for this purpose in detail. ❑
private_list is used to interlink buffer_head instances which hold filesystem metadata (usually indirection blocks). assoc_mapping is a pointer to the associated address space.
❑
The flag set in flags is used primarily to hold information on the GFP memory area from which the mapped pages originate. It can also hold errors that occur during asynchronous input/output and that cannot therefore be propagated directly. AS_EIO stands for a general I/O error, and AS_ENOSPC indicates that there is no longer sufficient space for an asynchronous write operation.
Figure 16-3 sketches how address spaces are connected with various other parts of the kernel. Only the most important links are shown in this overview; more details will be discussed in the remainder of this chapter.
957
Page 957
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache
Superblock
Block device
Backing device information Readahead Synchronization
Inode
Page cache
Address space Radix tree
Pages
Figure 16-3: Address spaces and their connection with central kernel data structures and subsystems.
16.3.2 Page Trees The kernel uses radix trees to manage all pages associated with an address space at least possible cost. A general overview of trees of this kind was provided above; now the corresponding data structures in the kernel are focused on. As is clear from the layout of address_space, the radix_tree_root structure is the root element of every radix tree:
struct radix_tree_root { unsigned int gfp_t struct radix_tree_node };
❑
height; gfp_mask; *rnode;
height specifies the height of the tree, that is, the number of levels below the root. On the basis
of this information and the number of entries per node, the kernel is able to calculate quickly the maximum number of elements in a given tree and to expand the tree accordingly if there is insufficient capacity to accept new data.
958
❑
gfp_mask specifies the zone from which memory is to be allocated.
❑
rnode is a pointer to the first node element of the tree. The radix_tree_node date type discussed below is used for this node.
5:53pm
Page 958
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache Implementation The nodes of a radix tree are essentially represented by the following data structure:
#define #define #define #define
RADIX_TREE_TAGS RADIX_TREE_MAP_SHIFT RADIX_TREE_MAP_SIZE RADIX_TREE_TAG_LONGS ((RADIX_TREE_MAP_SIZE +
2 (CONFIG_BASE_SMALL ? 4 : 6) (1UL << RADIX_TREE_MAP_SHIFT) \ BITS_PER_LONG - 1) / BITS_PER_LONG)
struct radix_tree_node { unsigned int height; /* Height from the bottom */ unsigned int count; struct rcu_head rcu_head; void *slots[RADIX_TREE_MAP_SIZE]; unsigned long tags[RADIX_TREE_TAGS][RADIX_TREE_TAG_LONGS]; };
The layout of this data structure is also very simple. slots is an array of void pointers that — depending on the level in which the node is located — point to either data elements or further nodes. count holds the number of used array entries in the node. The array is filled with entries starting at the top, and unused entries have null pointers. Each tree node can point to 64 further nodes (or leaves) as indicated in the definition of the slot array in radix_tree_node. The direct consequence of this definition is that each node may have only an array size that is a power of two. Also, the size of all radix elements may only be defined at compilation time (of course, the maximum number of elements in a tree can change at run time). This behavior is rewarded by speed gains.
Tagging The information discussed so far — the address space and the page tree — does not, however, allow the kernel to make a direct distinction between the clean and dirty pages of a mapping. This distinction is essential when, for example, pages are to be written back to store changes permanently on the underlying block device. Earlier kernel versions provided additional lists of dirty and clean pages in address_space. In principle, the kernel could, of course, scan the entire tree and filter out the pages with the appropriate state, but this is obviously very time-consuming. For this reason, each node of the radix tree includes additional tagging information that specifies whether each page in the node has the property specified in the tag. For example, the kernel uses a tag to label nodes with dirty pages. Nodes without this tag can therefore be skipped during a scan for dirty pages. This approach is a compromise between simple, unified data structures (no explicit lists are needed to hold pages with different states) and the option of performing a quick search for pages with specific properties. Currently, two tags are supported:
1. 2.
PAGECACHE_TAG_DIRTY specifies whether a page is dirty. PAGECACHE_TAG_WRITEBACK indicates that the page is being written back at the moment.
959
Page 959
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache The tagging information is stored in a two-dimensional array (tags) that is a part of radix_tree_node. The first array dimension distinguishes between the possible tags, and the second contains a sufficient number of elements of unsigned longs so that there is a bit for each page that can be organized in the node. radix_tree_tag_set is used to set a flag for a specific page:
void *radix_tree_tag_set(struct radix_tree_root *root, unsigned long index, unsigned int tag);
The kernel searches for the corresponding positions in the bit list and sets the bit to 1. When this is done, the tree is scanned from top to bottom to update the information in all nodes. In order to find all pages with a certain tag, the kernel still has to scan the entire tree, but this operation can be accelerated by first filtering out all subtrees that contain at least one page for which the flag is set. Again, this can be speeded up because the kernel does not check each bit one after the other but simply checks whether at least one of the unsigned long variables in which the bits are stored is greater than 1: lib/radix-tree.c
int radix_tree_tagged(struct radix_tree_root *root, int tag) { int idx; if (!root->rnode) return 0; for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) { if (root->rnode->tags[tag][idx]) return 1; } return 0; }
Accessing Radix Tree Elements The kernel also provides the following functions to process radix trees (they are all implemented in lib/radix_tree.c):
int radix_tree_insert(struct radix_tree_root *, unsigned long, void *); void *radix_tree_lookup(struct radix_tree_root *, unsigned long); void *radix_tree_delete(struct radix_tree_root *, unsigned long); int
radix_tree_tag_get(struct radix_tree_root *root, unsigned long index, unsigned int tag); void *radix_tree_tag_clear(struct radix_tree_root *root, unsigned long index, unsigned int tag);
960
❑
radix_tree_insert adds a new element to a radix tree by means of a void pointer. The tree is automatically expanded if too little capacity is available.
❑
radix_tree_lookup finds a radix tree element whose key — an integer — was passed to the function as argument. The value returned is a void pointer that must be converted to the appropriate target data type.
5:53pm
Page 960
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache ❑
radix_tree_delete removes a tree element selected by means of its integer key. A pointer to the
deleted object is returned if deletion was successful. ❑
radix_tree_tag_get checks if a tag is present on a radix tree node. If the tag is set, the function
returns 1; otherwise, 0. ❑
radix_tree_tag_clear deletes a tag in a radix tree node. The change is propagated upward in the tree; that is, if all elements on one level have no tags, then the bit is also removed in the next higher level, and so on. The address of the tagged item is returned upon success.
These functions are implemented largely by shifting numbers as described in Appendix C. To ensure that radix trees are manipulated very quickly, the kernel uses a separate slab cache that holds instances of radix_tree_node for rapid allocation.
Caution: The slab cache stores only the data structures needed to create the tree. This has nothing to do with the memory used for the cached pages, which is allocated and managed independently.
Each radix tree also has a per-CPU pool of pre-allocated node elements to further speed the insertion of new elements into the tree. radix_tree_preload is a container that ensures that at least one element resides in this cache. The function is always invoked before an individual element is added to the radix tree using radix_tree_insert (this is ignored in the following sections).4
Locking Radix trees do not provide any form of protection against concurrent access in general. As usual in the kernel, it is the responsibility of each subsystem that deploys radix trees to care for correction locking or any other synchronization primitive, as discussed in Chapter 5. However, an exception is made for several important read functions. This includes radix_tree_lookup to perform a lookup operation, radix_tree_tag_get to obtain a tag on a radix tree node, and radix_tree_tagged to test whether any items in the tree are tagged. The first two functions can be called without subsystem-specific locking if they are embraced by rcu_read_lock() . . . rcu_read_unlock(), while the third function does not require any lock at all. rcu_head provides the required connection between radix tree nodes and the RCU implementation. Notice that
radix trees, so I will not discuss the problem in more detail here.
16.3.3 Operations on Address Spaces Address spaces connect backing stores with memory segments. Not only data structures but also functions are needed to perform the transfer operations between the two. Because address spaces can be used in various combinations, the requisite functions are not defined statically but must be determined according to the particular mapping with the help of a special structure that holds function pointers to the appropriate implementation. 4 To be more accurate, the insert operations are embedded between
radix_tree_preload() . . . and radix_tree_ preload_end(). The use of per-CPU variables means that kernel preemption (see Chapter 2) must be disabled and then enabled again upon completion of the operation. This is currently the only task of radix_tree_preload_end.
961
Page 961
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache As demonstrated when discussing struct address_space, each address space contains a pointer to an address_space_operations instance that holds the above function list:
struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); int (*sync_page)(struct page *); /* Write back some dirty pages from this mapping. */ int (*writepages)(struct address_space *, struct writeback_control *); /* Set a page dirty */ int (*set_page_dirty)(struct page *page); int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); /* * ext3 requires that a successful prepare_write() call be followed * by a commit_write() call - they must be balanced */ int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); int (*commit_write)(struct file *, struct page *, unsigned, unsigned); int (*write_begin)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); int (*write_end)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); /* Unfortunately this kludge is needed for FIBMAP. Don’t use it */ sector_t (*bmap)(struct address_space *, sector_t); int (*invalidatepage) (struct page *, unsigned long); int (*releasepage) (struct page *, gfp_t); ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, loff_t offset, unsigned long nr_segs); struct page* (*get_xip_page)(struct address_space *, sector_t, int); int (*migratepage) (struct address_space *, struct page *, struct page *); int (*launder_page) (struct page *); };
❑
writepage and writepages write one or more pages of the address space back to the underlying
block device. This is done by delegating a corresponding request to the block layer. The kernel makes a number of standard functions available for this purpose [block_write_ full_page and mpage_readpage(s)]; these are typically used instead of a manual implementation. Section 16.4.4 discusses the functions of the mpage_ family. ❑
962
readpage and readpages read one or more consecutive pages from the backing store into a page frame. readpage and readpages are likewise not usually implemented manually but are
5:53pm
Page 962
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache executed by standard functions of the kernel (mpage_readpage and mpage_readpages) that can be used for most purposes. Notice that the file argument of readpage is not required if the standard functions are used to implement the desired functionality because the inode associated with the desired page can be determined via page->mapping->host. ❑
sync_page performs synchronization of data that have not yet been written back to the backing store. Unlike writepage, the function operates on block layer level and attempts to perform pending write requests still held in buffers in this layer. In contrast, writepage operates on the
address space layer and simply forwards the data to the block layer without bothering about active buffering there. The kernel provides the standard function block_sync_page, which obtains the address space mapping that belongs to the page in question and unplugs the block device queue to start I/O. ❑
set_page_dirty allows an address space to provide a specific method of marking a page as dirty. However, this option is rarely used. In this case, the kernel automatically uses ccode__set_page_dirty_buffers to simultaneously mark the page as dirty on the buffer level and to add it to the dirty_pages list of the current mapping.
❑
prepare_write and commit_write perform write operations triggered by the write system call.
To cater to the special features of journaling filesystems, this operation must be split into two parts: prepare_write stores the transaction data in the journal, and commit_write performs the actual write operation by sending the appropriate commands to the block layer. When data are written, the kernel must ensure that the two functions are always invoked in pairs and in the correct sequence as otherwise the journal mechanism serves no purpose. It has by now become common practice that even non-journaling filesystems (like Ext2) split writing into two parts.
Unlike writepage, prepare_ and commit_write do not directly initiate I/O operations (in other words, they do not forward corresponding commands to the block layer) but, in the standard implementation, make do with marking whole pages or parts thereof as dirty; the write operation itself is triggered by a kernel daemon that is provided for this purpose and that periodically checks the existing pages.
❑
write_begin and write_end are replacements for prepare_write and commit_write.
While the intention of the functions is identical, the required parameters and especially the way in which locking of involved objects is handled have changed. Since Documentation/filesystems/vfs.txt provides a detailed description of how the functions operate, nothing more needs to be added here. ❑
bmap maps a logical block offset within an address space to a physical block number. This is usu-
ally straightforward for block devices, but since files are in general not represented by a linear number of blocks on a device, the required information cannot be determined otherwise. bmap is required by the swap code (see Section 18.3.3), the FIBMAP file ioctl, and internally by some filesystems.
963
Page 963
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache ❑
releasepage prepares page release in journaling filesystems.
❑
invalidatepage is called if a page is going to be removed from the address space and buffers are associated with it as signalized by the PG_Private flag.
❑
direct_IO is used to implement direct read and write access. This bypasses buffering in the block layer and allows an application to communicate very directly with a block device. Large databases make frequent use of this feature as they are better able to forecast future input and output than the generic mechanisms of the kernel and can therefore achieve better results by implementing their own caching mechanisms.
❑
get_xip_page is used for the execute-in-place mechanism that can launch executable code with-
out having to first load it into the page cache. This is useful on, for example, memory-based filesystems such as a RAM disk or on small systems with little memory that can address ROM areas containing filesystems directly via the CPU. As this mechanism is seldom used, it need not be discussed at length. ❑
migrate_page is used if the kernel wants to relocate a page, that is, move contents of one page
onto another page. Since pages are often equipped with private data, it is not just sufficient to copy the raw information from the old to the new page. Moving pages is, for instance, required to support memory hotplugging. ❑
launder_page offers a last chance to write back a dirty page before it is freed.
Most address spaces do not implement all functions and therefore assign null pointers to some. In many cases, the kernel’s default routines are invoked instead of the specific implementation of the individual address spaces. Below a few of the kernel’s address_space_operations are examined to give an overview of the options available. The Third Extended Filesystem defines the ext3_writeback_aops global variable, which is a filled instance of address_space_operations. It contains the functions used in writeback mode: fs/ext3/inode.c
static const struct address_space_operations ext3_writeback_aops = { .readpage = ext3_readpage, .readpages = ext3_readpages, .writepage = ext3_writeback_writepage, .sync_page = block_sync_page, .write_begin = ext3_write_begin, .write_end = ext3_writeback_write_end, .bmap = ext3_bmap, .invalidatepage = ext3_invalidatepage, .releasepage = ext3_releasepage, .direct_IO = ext3_direct_IO, .migratepage = buffer_migrate_page, };
The pointers that are not explicitly set are automatically initialized with NULL by the compiler. At first sight, Ext3 appears to set a rather large number of function pointers to use its own implementations. However, this supposition is quickly disproved by looking at the definitions of ext2_... in the kernel sources. Many functions consist of few lines and delegate work to the generic helper functions of the kernel:
964
5:53pm
Page 964
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache Function
Standard implementation
ext3_readpage ext3_readpages ext3_writeback_writepage ext3_write_begin ext3_writeback_write_end ext3_direct_IO
mpage_readpage mpage_readpages block_write_full_page block_write_begin block_write_end blockdev_direct_IO
The functions of the address_space_operations structure and the generic helpers of the kernel use other arguments so that a brief wrapper function is needed for purposes of parameter conversion. Otherwise, in most cases, the pointers could point directly to the helper functions mentioned. Other filesystems also use assignments of the address_space_operations instances that make direct or indirect use of kernel standard functions. The structure of the address_space_operations instance of the shared-memory filesystem is particularly simple since only two fields need to be filled with non-NULL pointers: mm/shmem.c
static struct address_space_operations shmem_aops = { .writepage = shmem_writepage, .set_page_dirty = __set_page_dirty_no_writeback, .migratepage = migrate_page, };
All that need be implemented is the marking of the page as dirty, page writeback, and page migration. The other operations are not used to provide shared memory.5 With which backing store does the kernel operate in this case? Memory from the shared-memory filesystem is totally independent of a specific block device because all files of the filesystem are generated dynamically (e.g., by copying the contents of a file from another filesystem, or by writing calculated data into a new file) and do not reside on any original block device. Memory shortage can, of course, also apply to pages that belong to this filesystem so that it is then necessary to write the pages back to the backing store. Because there is no backing store in the real sense, the swap area is used in its stead. Whereas normal files are written back to their filesystem on the hard disk (or on any other block device) in order to free the used page frame, files of the shared-memory filesystem must be stored in the swap area. Since access to block devices need not always be made by way of filesystems but may also apply to raw devices, there are address space operations to support the direct manipulation of the contents of block devices (this kind of access is required, e.g., when creating filesystems from within userspace): fs/block_dev.c
struct address_space_operations def_blk_aops = { .readpage = blkdev_readpage, .writepage = blkdev_writepage, 5 If tmpfs, which is implemented on top of shared memory, is enabled, then readpage, write_begin, and write_end are also implemented.
965
Page 965
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache .sync_page .write_begin .write_end .writepages .direct_IO
= = = = =
block_sync_page, blkdev_write_begin, blkdev_write_end, generic_writepages, blkdev_direct_IO,
};
Again, it is clear that a large number of special functions are used to implement the requirements, but they quickly lead to the kernel’s standard functions: Block layer
Standard function
blkdev_readpage blkdev_writepage blkdev_write_begin blkdev_write_end blkdev_direct_IO
block_read_full_page block_write_full_page block_write_begin block_write_end __blockdev_direct_IO
The implementation of the address space operations for filesystems and raw access to block devices have much in common in the kernel since both share the same helper functions.
16.4
Implementation of the Page Cache
The page cache is implemented on top of radix trees. Although the cache belongs to the most performance-critical parts of the kernel and is widely used across all subsystems, the implementation is astonishingly simple. Well-designed data structures are an essential ingredient for this.
16.4.1 Allocating Pages page_cache_alloc is used to reserve the data structure of a new page to be added to the page cache. The variant postfixed by _cold works identically, but tries to obtain a cache cold page: <pagemap.h>
struct page *page_cache_alloc(struct address_space *x) struct page *page_cache_alloc_cold(struct address_space *x)
Initially, the radix tree is left untouched because work is delegated to alloc_pages, which takes a page frame from the buddy system (described in Chapter 3). However, the address space argument is required to infer from which memory region that page must come. Adding the new page to the page cache is a little more complicated and falls under the responsibility of add_to_page_cache. Here, radix_tree_insert inserts the page instance associated with the page into the radix tree of the address space involved: mm/filemap.c
int add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask) { ... error = radix_tree_insert(&mapping->page_tree, offset, page);
966
5:53pm
Page 966
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache if (!error) { page_cache_get(page); SetPageLocked(page); page->mapping = mapping; page->index = offset; mapping->nrpages++; } ... return error; }
The index in the page cache and the pointer to the address space of the page are held in the corresponding elements of struct page (mapping and index). Finally, the address space page count (nrpages) is incremented by 1 because there is now one more page in the cache. An alternative function named add_to_page_cache_lru with identical prototype is also available. This first invokes add_to_page_cache to add a page to the address space-specific page cache before also adding the page to the system’s LRU cache using the lru_cache_add function.
16.4.2 Finding Pages Keeping all cached pages in a radix tree data structure is especially beneficial when the kernel needs to decide if a given page is cached or not. find_get_page is provided for this purpose: mm/filemap.c
struct page * find_get_page(struct address_space *mapping, pgoff_t offset) { struct page *page; page = radix_tree_lookup(&mapping->page_tree, offset); if (page) page_cache_get(page); return page; }
Life is easy for the page cache because all the hard work is done by the radix tree implementation: radix_tree_lookup finds the desired page at a given offset, and page_cache_get increments the page’s reference count if one was found. However, pages will very often belong to a file. Unfortunately, positions in a file are specified as byte offsets, not as offsets within the page cache. How can a file offset be converted into a page cache offset? Currently, the granularity of the page cache is a single page; that is, the leaf elements of the page cache radix tree are single pages. Future kernels might, however, increase the granularity, so assuming a page size granularity is not valid. Instead, the macro PAGE_CACHE_SHIFT is provided. The object size for a page cache element can be computed by 2PAGE_CACHE_SHIFT. Converting between byte offsets in a file and page cache offsets is then a simple matter of dividing the index by PAGE_CACHE_SHIFT: index = ppos >> PAGE_CACHE_SHIFT;
967
Page 967
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache ppos is a byte offset into a file, and index contains the corresponding page cache offset.
Two auxiliary functions are provided for convenience: <pagemap.h>
struct page * find_or_create_page(struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); struct page * find_lock_page(struct address_space *mapping, pgoff_t index); find_or_create_page does what the name promises — it looks up a page in the page cache and allocates a fresh one if it is not there. The page is inserted into the cache and the LRU list by calling add_to_page_cache_lru. find_lock_page works like find_get_page, but locks the page.
Caution: If the page is already locked from some other part of the kernel, the function can sleep until the page is unlocked. It is also possible to search for more than one page. Here are the prototypes of the responsible auxiliary functions: <pagemap.h>
unsigned find_get_pages(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages); unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index, int tag, unsigned int nr_pages, struct page **pages);
❑
find_get_pages returns up to nr_pages pages in the mapping starting from the page cache offset start. Pointers to the pages are placed on the array pages. The function does not guarantee to return a continuous range of pages — there can be holes for non-present pages. The return value is the number of pages that were found.
❑
find_get_pages_contig works similarly to find_get_pages, but the selected page range is guaranteed to be continuous. The function stops to add pages to the page array when the first hole is discovered.
❑
find_get_pages_tag operates like find_pages, but only selects pages that have a specific tag set. Additionally, the index parameter points to the page cache index of the page that immedi-
ately follows the last page in the resulting page array.
16.4.3 Waiting on Pages The kernel often needs to wait on pages until their status has changed to some desired value. The synchronization implementation, for instance, sometimes wants to ensure that writing back a page has been finished and the contents in memory are identical with the data on the underlying block device. Pages under writeback have the PG_writeback bit set.
968
5:53pm
Page 968
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache The function wait_on_page_writeback is provided to wait until the bit disappears: <pagemap.h>
static inline void wait_on_page_writeback(struct page *page) { if (PageWriteback(page)) wait_on_page_bit(page, PG_writeback); } wait_on_page_bit installs a wait queue on which the process can sleep until the PG_writeback bit is
removed from the page flags. Likewise, the need to wait for a page to become unlocked can arise. wait_on_page_locked is responsible to handle this case.
16.4.4 Operations with Whole Pages Modern block devices can — despite their name — transfer not just individual blocks but much larger units of data in a single operation, thus boosting system performance. This is reflected by a strong kernel focus on algorithms and structures that use pages as the elementary units of transfer between block devices and memory. Buffer-by-buffer transfer acts as a substantial brake on performance when handling complete pages. In the course of redesign of the block layer, BIOs were introduced during the development of 2.5 as a replacement for buffers to handle transfers with block devices. Four new functions were added to the kernel to support the reading and writing of one or more pages: <mpage.h>
int mpage_readpages(struct address_space *mapping, struct list_head *pages, unsigned nr_pages, get_block_t get_block); int mpage_readpage(struct page *page, get_block_t get_block); int mpage_writepages(struct address_space *mapping, struct writeback_control *wbc, get_block_t get_block); int mpage_writepage(struct page *page, get_block_t *get_block, struct writeback_control *wbc);
The meaning of the parameters is evident from the preceding sections, the only exception being writeback_control. As discussed in Chapter 17, this is an option for fine control of the writeback operation. Since the implementations of the four functions share much in common (their goal is always to construct a suitable bio instance for transfer to the block layer), this discussion will be confined to examining just the one specimen — mpage_readpages. The function expects nr_pages page instances as parameters passed in a linked list. mapping is the associated address space, and get_block is, as usual, invoked to find the matching block addresses. The function iterates in a loop over all page instances: fs/mpage.c
int mpage_readpages(struct address_space *mapping, struct list_head *pages, unsigned nr_pages, get_block_t get_block)
969
Page 969
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache { struct bio *bio = NULL; unsigned page_idx; sector_t last_block_in_bio = 0; struct buffer_head map_bh; struct pagevec lru_pvec; clear_buffer_mapped(&map_bh); for (page_idx = 0; page_idx < nr_pages; page_idx++) { struct page *page = list_entry(pages->prev, struct page, lru);
Each loop pass first adds the page to the address space-specific cache before a bio request is created to read the desired data for the block layer: fs/mpage.c
list_del(&page->lru); if (!add_to_page_cache_lru(page, mapping, page->index, GFP_KERNEL)) { bio = do_mpage_readpage(bio, page, nr_pages - page_idx, &last_block_in_bio, &map_bh, &first_logical_block, get_block); } else { page_cache_release(page); } }
The pages are installed both in the page cache and in the kernel’s LRU list using add_to_page_cache_lru. When do_mpage_readpage builds the bio request, the BIO data of the preceding pages are also included so that a combined request can be constructed. If several successive pages are to be read from the block device, this can be done in a single request rather than submitting an individual request for each page. Notice that the buffer_head passed to do_mpage_readpage is usually not required. However, if an unusual situation is encountered (e.g., a page that contains buffers), then it falls back to using the oldfashioned, blockwise read routines. If, at the end of the loop, a BIO request is left unprocessed by do_mpage_readpage, it is now submitted: fs/mpage.c
if (bio) mpage_bio_submit(READ, bio); return 0; }
16.4.5 Page Cache Readahead Predicting the future is generally accepted to be a rather hard problem, but from time to time, the kernel cannot resist making a try nevertheless. Actually, there are situations where it is not too hard to say what will happen next, namely, when a process is reading data from a file. Usually pages are read sequentially — this is also an assumption made by most filesystems. Recall from Chapter 9 that the extended filesystem family makes great effort to allocate adjacent blocks for a file such that the head of a block device only needs to move as little as possible when data are read and written.
970
5:53pm
Page 970
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache Consider the situation in which a process has read a file linearly from position A to position B. Then this practice will usually continue for a while. It therefore makes sense to read ahead of B (say, until position C) such that when requests for pages between B and C are issued from the process, the data are already contained in the page cache. Naturally readahead cannot be tackled by the page cache alone, but support by the VFS and memory management layers is required. In fact, the read-ahead mechanism was discussed in Sections 8.5.2 and 8.5.1. Recall that readahead is controlled from three places as far as the kernel is directly concerned6 :
1.
do_generic_mapping_read, a generic read routine in which most filesystems that rely on the standard routines of the kernel to read data end up at some point.
2.
The page fault handler filemap_fault, which is responsible to read missing pages for memory mappings.
3.
__generic_file_splice_read, a routine invoked to support the splice system call that allows for passing data between two file descriptors directly in kernel space, without the need to involve userspace.7
The temporal flow of readahead routines on the source code level were discussed in Chapter 8, but it is also instructive to observe the behavior from a higher level. Such a viewpoint is provided in Figure 16-4. For the sake of simplicity, I restrict my consideration to do_generic_mapping_read in the following. page_cache_sync_readahead
page accessed, but not present
file_ra_state->start
PG_Readahead
file_ra_ file_ra_state-> state->size async_size .... async_readahead in background page_cache_async_readahead Pages read by asynchronous readahead asynchronous
Figure 16-4: Overview of the readahead mechanism and the required interplay between VFS and page cache. Suppose that a process has opened a file and wants to read in the first page. The page is not yet contained in the page cache. Since typical users will not only read in a single page, but multiple sequential 6 These are at least the places covered in this book. Readahead can also be influenced from userland with the madvise, fadvice, and readahead system calls, but I will not discuss them any further. 7 I do not discuss this system call anywhere in more detail, but refer you to the manual page splice(2) for more information.
971
Page 971
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache pages, the kernel employs page_cache_sync_readahead to read in 8 pages in a row — the number is just an example that does not comply with reality. The first page is immediately available for do_generic_mapping_read.8 Pages selected to be read in before they are actually required are said to be in a readahead window. The process now continues to read in pages and behaves linearly as expected. When the sixth page is accessed (notice that the page was already contained in the page cache before the process issued a corresponding request), do_generic_mapping_read notices that the page was equipped with the PG_Readahead bit in the synchronous read pass.9 This triggers an asynchronous operation that reads in a number of pages in the background. Since two more pages are left in the page cache, there is no need to hurry; thus a synchronous operation is not required. However, the I/O performed in the background will ensure that the pages are present when the process makes further progress in the file. If the kernel would not adopt this scheme, then readahead could only start after a process has experienced a page fault. While the required page (and some more pages for readahead) could be then brought into the page cache synchronously, this would introduce delays, which are clearly undesired. This scheme is now repeated further. Since page_cache_async_read — which is responsible to issue the asynchronous read request — has again marked a page in the readahead window with the PG_Readahead bit, the kernel will start asynchronous readahead again when the process comes to this page, and so on. So much for do_generic_readahead. The differences in how filemap_fault handles things are twofold: Asynchronous, adaptive readahead is only performed if a sequential read hint is set. If no readahead hint is given, then do_page_cache_readahead does a single-shot readahead without setting PG_Readahead, and also without updating the file’s readahead state tracking information. Several functions are used to implement the readahead mechanism. Figure 16-5 shows how they are connected with each other. VFS, Memory management fadvise,madvise, readahead system calls
force_page_cache_ readahead
page_cache_sync_readahead
do_page_cache_ readahead
page_cache_async_readahead
ondemand_readahead
ra_submit
_ _do_ page_cache_readahed Bring pages into page cache
Figure 16-5: Functions used to implement readahead. Note that the figure shows the connections between the functions, but is not a proper code flow diagram. 8 Actually, the term synchronous as adopted by the kernel is a bit misleading here. No effort is made to wait on completion of the read
operation submitted by page_cache_sync_readahed, so it is not synchronous in the usual sense of the word. However, since reading in one page is fast, chances are very good that the page will usually have arrived when page_cache_sync_readahead returns to the caller. Nevertheless, the caller has to make precautions for the case in which the page is not yet available. 9 Since the readahead state for each file is separately tracked, the kernel would essentially not require this special flag because the corresponding information could also be obtained otherwise. However, it is required when multiple concurrent readers act on a file.
972
5:53pm
Page 972
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache Reading pages into the page cache before they are actually required is simple from a technical point of view and can easily be achieved with the framework introduced so far in this chapter. The challenge lies in predicting the optimal size of the readahead window. For this purpose, the kernel keeps track of the last setting for each file. The following data structure is associated with every file instance:
struct file_ra_state { pgoff_t start; /* where readahead started */ unsigned int size; /* # of readahead pages */ unsigned int async_size; /* do asynchronous readahead when there are only # of pages ahead */ unsigned int ra_pages; /* Maximum readahead window */ ... loff_t prev_pos; /* Cache last read() position */ }; start denotes the position in the page cache where readahead was started, and size gives the size of the readahead window. async_size represents the least number of remaining readahead pages. If only
this many pages are still available in the readahead window, then asynchronous readahead is initiated to bring more pages into the page cache. The meaning of these values is also illustrated in Figure 16-4. ra_pages denotes the maximum size of the readahead window. The kernel can decide to read in fewer pages than specified by this value, but it will never read in more. Finally, prev_pos denotes the position
that was last visited in previous reads.
The offset is given as a byte offset into the file, not as a page offset into the page cache! This allows filesystem code that does not know anything about page cache offsets to aid the readahead mechanism. The most important providers of this value are, however, do_generic_mapping_read and filemap_fault. The routine ondemand_readahead is responsible to implement readahead policy, that is, decide how many pages will be read in before they are actually required. As Figure 16-5 shows, both page_cache_sync_readahead and page_cache_async_readahead rely on this function. After deciding on the size of the readahead window, ra_submit is called to delegate the technical aspects to __do_page_cache_readahead. Here pages are allocated in the page cache and subsequently filled from the block layer. Before discussing ondemand_readahead, two helper functions must be introduced: get_init_ra_size determines the initial readahead window size for a file, and get_next_ra_size computes the window for subsequent reads, that is, when a previous readahead window exists. get_init_ra_size determines the window size based on the number of pages requested from the process, and get_next_ra_size bases the computation on the size of the previous readahead window. Both functions ensure that the size of the readahead window does not exceed a file-specific upper limit. While the limit can be modified with the fadvise system call, it is usually set to VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE, which equates to 32 pages on systems with a page size of 4 KiB. The results of both functions are shown in Figure 16-6. The graph shows how the size of the initial readahead scales with request size, and also demonstrates how the size of subsequent readahead operations scales depending on the size of the previous readahead
973
Page 973
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache window. Mathematically speaking, the maximal readahead size is a fixed point of both functions. In practical terms, this means that the readahead window can never grow beyond the maximally allowed value, in this case, 32 pages.
30
Window size
25
20
15
10
Initial window size Next window size
5 5
10
15
20
25
30
Request size and last read readahead window, respectively
Figure 16-6: How the kernel determines the readahead window depending on the request size. Let’s go back to ondemand_readahead, which has to set the readahead window with the help of these auxiliary functions. Three cases are most essential:
1.
The current offset is either at the end of the previous readahead window or at the end of the interval that was synchronously read in. In both cases, the kernel assumes sequential read access, and uses get_next_ra_size to compute the new window size as discussed.
2.
If the readahead marker was hit, but the previous readahead state does not predict this, then most likely two or more concurrent streams perform interleaved reads on the file — and invalidate each other’s readahead state in the process. The kernel constructs a new readahead window that suits all readers.
3.
If (among others) first read access on a file is performed or a cache miss has happened, a new readahead window is set up with get_init_ra_size.
16.5
Implementation of the Buffer Cache
The buffer cache is used not only as an add-on to the page cache but also as an independent cache for objects that are not handled in pages but in blocks.
974
5:53pm
Page 974
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache
16.5.1 Data Structures Fortunately, the data structures for both types of cache — the independent buffer cache and the elements used to support the page cache — are identical, and this greatly simplifies implementation. The principal elements of the buffer cache are the buffer heads, whose basic characteristics are discussed above. The buffer head definition in the kernel sources is as follows:
struct buffer_head { unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page’s buffers */ struct page *b_page; /* the page this bh is mapped to */ sector_t b_blocknr; size_t b_size; char *b_data;
/* start block number */ /* size of mapping */ /* pointer to data within the page */
struct block_device *b_bdev; bh_end_io_t *b_end_io; void *b_private;
/* I/O completion */ /* reserved for b_end_io */
atomic_t b_count;
/* users using this buffer_head */
... };
Buffers, like pages, can have many states. The current state of a buffer head is held in the b_state element that accepts the following selection of values (the full list of values is available as an enum called bh_state_bits in include/linux/buffer_heads.h): ❑
The state is BH_Uptodate if the current data in the buffer match the data in the backing store.
❑
Buffers are labeled as BH_Dirty if their data have been modified and no longer match the data in the backing store.
❑
BH_Lock indicates that the buffer is locked for further access. Buffers are explicitly locked during I/O operations to prevent several threads from handling the buffers concurrently and thus interfering with each other.
❑
BH_Mapped means that there is a mapping of the buffer contents on a secondary storage device, as is the case with all buffers that originate from filesystems or from direct accesses to block devices.
❑
BH_New marks newly created buffers as new.
b_state is interpreted as a bitmap. Every possible constant stands for a position in the bitmap. As a result, several values (BK_Lock and BH_Mapped, e.g.) can be active at the same time — as also at many other points in the kernel.
BH_Uptodate and BH_Dirty can also be active at the same time, and this is often the case. Whereas BH_Uptodate is set after a buffer has been filled with data from the block device, the kernel uses BH_Dirty to indicate that the data in memory have been modified but not yet been written back. This may appear to be confusing but must be remembered when considering the information below.
975
Page 975
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache Besides the above constants, a few additional values are defined in enum bh_state_bits. They are ignored because they are either of little importance or are simply no longer used. They are retained in the kernel sources for historical reasons and will disappear sooner or later. The kernel defines the set_buffer_foo and get_buffer_foo functions to set and read the buffer state bits for BH_Foo. The buffer_head structure also includes further elements whose meanings are given below: ❑
b_count implements the usual access counter to prevent the kernel from freeing buffer heads that are still in active use.
❑
b_page holds a pointer to a page instance with which the buffer head is associated when used in conjunction with the page cache. If the buffer is independent, b_page contains a null pointer.
❑
As discussed above, several buffers are used to split the contents of a page into smaller units. All buffer heads belonging to these units are kept on a singly linked, circular list using b_this_page (the entry for the last buffer points to the entry for the first buffer to create a circular structure).
❑
b_blocknr holds the number of the block on the underlying block device, and b_size specifies the size of the block. b_bdev is a pointer to the block_device instance of the block device. This
information uniquely identifies the source of the data. ❑
The pointer to the data in memory is held in b_data (the end position can be calculated from b_size; there is therefore no need for an explicit pointer to this position, although a pointer was used above for the sake of simplicity).
❑
b_end_io points to a routine that is automatically invoked by the kernel when an I/O operation involving the buffer is completed (it is required by the BIO routines described in Chapter 6). This enables the kernel to postpone further buffer handling until a desired input or output operation has, in fact, been completed.
❑
b_private is a pointer reserved for private use by b_end_io. It is used primarily by journaling filesystems. It is usually set to NULL if it is not needed.
16.5.2 Operations The kernel must provide a set of operations so that the rest of the code can easily and efficiently exploit the functionality of buffers. This section describes the mechanisms for creating and managing new buffer heads. Caution: These mechanisms make no contribution to the actual caching of data in memory, discussed in later sections. Before buffers can be used, the kernel must first create an instance of the buffer_head structure on which the remaining functions act. As the new generation of new buffer heads is a frequently recurring task, it should be performed as quickly as possible. This is a classical situation for the use of a slab cache as described in Chapter 3.
Caution: When a slab cache is used, memory is allocated only for the buffer head. The actual data are ignored when the buffer head is created and must be stored elsewhere.
976
5:53pm
Page 976
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache The kernel sources do, of course, provide functions that can be used as front ends to create and destroy buffer heads. alloc_buffer_head generates a new buffer head, and free_buffer_head destroys an existing head. Both functions are defined in fs/buffer.c. As you might expect, they essentially consist of straightforward gymnastics with memory management functions and statistics accounting and need not be discussed here.
16.5.3 Interaction of Page and Buffer Cache Buffer heads become much more interesting when used in conjunction with the useful data that they are to hold in memory. This section examines the link between pages and buffer heads.
Linking of Pages and Buffer Heads How are buffers and pages interlinked? Recall that this approach was briefly discussed above. A page is split into several data units (the actual number varies between architectures depending on page and block size), but the buffer heads are held in a separate memory area that has nothing to do with the actual data. The page contents are not modified by the interaction with buffers, as the latter simply provide a new view of the page data. The private element of struct page is required to support interaction between a page and buffers. It is of type unsigned long and can therefore be used as a pointer to any positions in virtual address space (the exact definition of page is given in Chapter 3): <mm.h>
struct page { ... unsigned long private; ... }
/* Mapping-private opaque data */
The private element can also be used for various other purposes that, depending on page use, need have nothing to do with buffers.10 However, its predominant use is to link buffers and pages. In this case, private points to the first buffer head used to split the page into smaller units. The various buffer heads are linked in a cyclic list by means of b_this_page. In this list, each pointer points to the next buffer, and the b_this_page element of the last buffer head points to the first buffer. This enables the kernel to easily scan all buffer_head instances associated with the page, starting from the page structure. How is the association between the page and the buffer_head structures established? The kernel provides the create_empty_buffers and link_dev_buffers functions for this purpose, both of which are implemented in fs/buffer.c. The latter serves to associate an existing set of buffer heads with a page, whereas create_empty_buffers generates a completely new set of buffers for association with the page. For example, create_empty_buffers is invoked when reading and writing complete pages with block_read_full_page and __block_write_full_page. create_empty_buffers first invokes alloc_page_buffers to create the required number of buffer heads
(this number varies according to page and block size). It returns a pointer to the first element of a singly 10 If the page resides in the swap cache, an instance of
swp_entry_t is also stored in the cache. If the page is not in use, the element
holds the order in the buddy system.
977
Page 977
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache linked list in which each b_this_page element points to the next buffer. The only exception is the last buffer, where b_this_page holds a null pointer: fs/buffer.c
void create_empty_buffers(struct page *page, unsigned long blocksize, unsigned long b_state) { struct buffer_head *bh, *head, *tail; head = alloc_page_buffers(page, blocksize, 1); ...
The function then iterates over all buffer heads to set their state and generate a cyclic list: fs/buffer.c
do { bh->b_state |= b_state; tail = bh; bh = bh->b_this_page; } while (bh); tail->b_this_page = head; ...
The state of the buffers depends on the state of the data in the page in memory: fs/buffer.c
if (PageUptodate(page) || PageDirty(page)) { bh = head; do { if (PageDirty(page)) set_buffer_dirty(bh); if (PageUptodate(page)) set_buffer_uptodate(bh); bh = bh->b_this_page; } while (bh != head); } attach_page_buffers(page, head); } set_buffer_dirty and set_buffer_uptodate set the corresponding flags BH_Dirty and BH_Uptodate,
respectively, in the buffer head. The concluding invocation of attach_page_buffers associates the buffer with the page in two separate steps:
1.
The PG_private bit is set in the page flags to inform the rest of the kernel code that the private element of the page instance is in use.
2.
The private element of the page is equipped with a pointer to the first buffer head in the cyclic list.
At first sight, setting the PG_Private flag would not appear to be a far-reaching action. However, it is important because it is the only way that the kernel is able to detect whether a page has attached
978
5:53pm
Page 978
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache buffers. Before the kernel launches any operations to modify or process buffers associated with a page, it must first check whether buffers are actually present — this is not always the case. It provides page_has_buffers(page) to do this by checking whether the flag is set. This function is called at very large number of places in the kernel sources and is therefore worthy of mention.
Interaction Setting up a link between pages and buffers serves little purpose if there are no benefits for other parts of the kernel. As already noted, some transfer operations to and from block devices may need to be performed in units whose size depends on the block size of the underlying devices, whereas many parts of the kernel prefer to carry out I/O operations with page granularity as this makes things much easier — especially in terms of memory management.11 In this scenario, buffers act as intermediaries between the two worlds.
Reading Whole Pages in Buffers Let us first look at the approach adopted by the kernel when it reads whole pages from a block device, as is the case in block_read_full_page. Let’s discuss the sections of interest as seen by buffer implementation. Figure 16-7 shows the buffer-related function calls that make up block_read_full_page. block_read_full_page
Iterate over all buffers
!page_has_buffers? buffer_uptodate? No
Yes
create_empty_buffers
Nothing to do
Not mapped
get_block
lock_buffer 1
1
mark_buffer_async_read
Iterate over all buffers 1 which were not uptodate
submit_bh
Figure 16-7: Code flow diagram for the buffer-related operations of block_read_full_page. block_read_full_page reads a full page in three steps:
1. 2. 3.
The buffers are set up and their state is checked. The buffers are locked to rule out interference by other kernel threads in the next step. The data are transferred to the buffers.
The first step involves checking whether buffers are already attached to the page as this is not always the case. If not, buffers are created using the create_empty_buffers function discussed a few sections 11 I/O operations are usually more efficient if data are read or written in pages. This was the main reason for introducing the BIO
layer that has replaced the old concept based on buffer heads.
979
Page 979
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache back. Thereafter, the buffers — whether just created or already in existence — are identified using page_buffers before they are handled as described below. page_buffers simply translates the private element of the page into a buffer_head pointer by means of pointer conversion because, by convention, private points to the first buffer if buffers are attached to a page. The main work of the kernel is to find out which buffers are current (their data match that on the block device or may even be more up-to-date) and therefore need not be read, and which buffers hold invalid data. To do this, the kernel makes use of the BH_Mapping and BH_Uptodate state bits, both of which may be set or unset. It iterates over all buffers attached to the page and performs the following checks:
1.
If the buffer contents are up-to-date (this can be checked with buffer_uptodate), the kernel continues to process the next buffer. In this case, the data in the page cache and on the block device match, and an additional read operation is not required.
2.
If there is no mapping (BH_Mapping is not set), get_block is invoked to determine the position of the block on the block storage medium. ext2_get_block and ext3_get_block, respectively, are used for this purpose on Ext2/Ext3
filesystems. Other filesystems use functions with similar names. Common to all alternatives is that the buffer_head structure is modified so that it can be used to locate the desired block in the filesystem. Essentially, this involves setting the b_bdev and b_blocknr fields because they identify the desired block. The actual reading of data from the block device is performed not by get_block but later during the course of block_read_full_page. After execution of get_block, the state of the buffer is BH_Mapped but not BH_Uptodate12 .
3.
A third situation is also possible. The buffer already has a mapping but is not up-to-date. The kernel then need perform no other actions.
4.
Once the individual combinations of BH_Uptodate and BH_Mapped have been distinguished, the buffer is placed in a temporary array if it has a mapping but is not up-to-date. Processing then continues with the page’s next buffer until no further buffers are available.
If all buffers attached to the page are up-to-date, the whole page can be set to this state using SetPageUptodate. The function then terminates because all the data on the whole page now reside in memory. However, there are usually still buffers that have a mapping but do not reflect the current contents of the block device. Reminder: Buffers of this kind are collected in an array that is used for the second and third phases of block_read_full_page. In the second phase, all buffers to be read are locked using lock_buffer. This prevents two kernel threads from reading the same buffer at the same time and therefore interfering with each other. 12 There is one other state in which a buffer is up-to-date but is not mapped. This state occurs when a file with gaps is read (as can
occur with the Second Extended Filesystem, e.g.). In this case, the buffer is filled with null bytes, but I shall ignore this scenario.
980
5:53pm
Page 980
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache mark_buffer_async_read is also invoked to set end_buffer_async_read for b_end_io — this function is invoked automatically when data transfer ends.
Actual I/O is triggered in the third phase in which submit_bh forwards all buffers to be read to the block or BIO layer where the read operation is started. The function stored in b_end_io (end_buffer_async_read in this case) is called when the read operation terminates. It iterates over all the page’s buffers, checks their state, and sets the state of the entire page to up-to-date assuming all buffers have this state. As can be seen, the advantage of block_read_full_page is that it is necessary to read only those parts of the page that are not up-to-date. However, if it is certain that the entire page is not up-to-date, mpage_readpage is the better alternative as the buffer overhead is then superfluous.
Writing Whole Pages into Buffers Not only reading but also writing of full pages can be divided into smaller buffer units. Only those parts of a page that have actually been modified need be written back, not the whole page contents. Unfortunately, from the buffer viewpoint, the implementation of write operations is much more complicated than the read operations described above. I ignore the minor details of the (somewhat simplified) write operations and focus on the key actions required of the kernel in my discussion below. Figure 16-8 shows the code flow diagram for the error-free performance of the buffer-related operations needed to write back dirty pages in the __block_write_full_page function (to simplify matters, I also omit some seldom required corner cases that must be dealt with in reality).
_ _block_write_full_page create_empty_buffers
!page_has_buffers
!buffer_mapped && buffer_dirty
get_block
buffer_mapped && buffer_dirty lock_buffer mark_buffer_async_write SetPageWriteback buffer_async_write
submit_bh
Figure 16-8: Code flow diagram for the buffer-related operations of __block_write_full_page.
The writeback process is split into several parts, each of which repeatedly iterates over the singly linked list of buffers attached to a page.
981
Page 981
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache As usual, it is first necessary to check that buffers are actually attached to the page — this cannot be taken for granted. As when a page is read, page_has_buffers is invoked to check whether buffers are present. If not, they are created using create_empty_buffers. The kernel then iterates a total of three times over the list of buffers, as shown in the code flow diagram:
1.
The purpose of the first iteration is to create a mapping between the buffer and the block device for all unmapped but dirty buffers. The function held in the get_block function pointer is invoked to find the matching block of the block device for the buffer.
2.
In the second iteration, all dirty buffers are filtered out; this can be checked by test_clear_buffer_dirty — if the flag was set, it is deleted when the function is invoked because the buffer contents are due to be written back immediately.13 mark_buffer_async_write sets the BH_Async_Write state bit and assigns end_buffer_async_write as the BIO completion handler to b_end_io. At the end of this iteration, set_page_writeback sets the PG_writeback flag for the full page.
3.
In the third and final iteration, all buffers marked with BH_Async_Write in the previous pass are forwarded to the block layer that performs the actual write operation by invoking submit_bh, which submits a corresponding request to the block layer (by means of BIOs; see Chapter 6).
When the write operation for a buffer terminates, end_buffer_async_write is invoked automatically to check whether this also applies for all other buffers of the page. If so, all processes that are sleeping on the queue associated with the page and that are waiting for this event are woken.
16.5.4 Independent Buffers Buffers are used not only in the context of pages. In earlier versions of the Linux kernel, all caching was implemented with buffers without resorting to page caching. The value of this approach has diminished in successive versions, and nearly all importance has been attached to full pages. However, there are still situations in which access to block device data is performed on the block level and not on the page level in the view of higher-level code. To help speed up such operations, the kernel provides yet another cache known as an LRU buffer cache discussed below. This cache for independent buffers is not totally divorced from the page cache. Since RAM memory is always managed in pages, buffered blocks must also be held in pages, with the result that there are some points of contact with the page cache. These cannot and should not be ignored — after all, access to individual blocks is still possible via the buffer cache without having to worry about the organization of the blocks into pages.
Mode of Operation Why LRU? As we know, this abbreviation stands for least recently used and refers to a general method in which the elements of a set that are most frequently used can be managed efficiently. If an element is frequently accessed, the likelihood is that it resides in RAM (and is therefore cached). Less frequently or seldom used elements drop out of the cache automatically with time. 13 At this point, the kernel must also call buffer_mapped to ensure that there is a mapping for the buffer. This is not the case if there are holes in files, but then there is nothing to write back.
982
5:53pm
Page 982
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache To make lookup operations faster, the kernel first scans the cache entries from top to bottom to find an independent buffer each time a request is made. If an element contains the required data, the instance in the cache can be used. If not, the kernel must submit a low-level request to the block device to get the desired data. The element last used is automatically placed at the first position by the kernel. If the element was already in the cache, only the positions of the individual elements change. If the element was read from the block device, the last element of the array ‘‘drops out‘‘ of the cache and can therefore be removed from memory. The algorithm is very simple but nevertheless effective. The time needed to look up frequently used elements is reduced because the element is automatically located at one of the top array positions. At the same time, less used elements automatically drop out of the cache if they are not accessed for a certain period. The only disadvantage of this approach is the fact that almost the full contents of the array need to be repositioned after each lookup operation. This is time-consuming and can be implemented for small caches only. Consequently, buffer caches have only a low capacity.
Implementation Let us examine how the kernel implements the algorithm just described for the LRU cache.
Data Structures As the algorithm is not complicated, it requires only relatively simple data structures. The starting point of the implementation is the bh_lru structure which is defined as follows: fs/buffer.c
#define BH_LRU_SIZE
8
struct bh_lru { struct buffer_head *bhs[BH_LRU_SIZE]; }; static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
It is defined in a C file and not in a header file — as usual, an indication for the rest of the kernel code that the cache data structures should (and, besides, can!) not be addressed directly but by means of the dedicated helper functions discussed below. bhs is an array of pointers to buffer heads and is used as a basis for implementing the LRU algorithm (eight entries are used as the pre-processor definition shows). The kernel uses DEFINE_PER_CPU to instantiate an instance for each CPU of the system to improve utilization of the CPU caches.
The cache is managed and utilized by two public functions provided by the kernel: lookup_bh_lru checks whether a required entry is present in the cache, and bh_lru_install adds new buffer heads to the cache. The function implementations hold no surprises since they merely implement the algorithm described above.14 All they need do is select the corresponding array for the current CPU at the start of the action using 14 Or as aptly put by a comment in the kernel code:
The LRU management algorithm is dopey-but-simple. Sorry.
983
Page 983
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache fs/buffer.c
lru = &__get_cpu_var(bh_lrus);
Caution: If lookup_bh_lru fails, the desired buffer is not automatically read from the block device. This is done by the following interface functions.
Interface Functions Normal kernel code does not generally come into contact with either bh_lookup_lru or bh_lru_install because these functions are encapsulated. The kernel provides generic routines for accessing individual blocks, and these automatically cover the buffer cache, thus rendering explicit interaction with the cache unnecessary. These routines include __getblk and __bread, which are implemented in fs/buffer.c. Before discussing their implementation, it is best to describe not only what the two functions have in common, but also how they differ. First, they both require the same parameters: fs/buffer.c
struct buffer_head * __getblk(struct block_device *bdev, sector_t block, int size) { ... } struct buffer_head * __bread(struct block_device *bdev, sector_t block, int size) { ... }
A data block is uniquely identified by the block_device instance of the desired block device, the sector number (of type sector_t), and the block size. The differences relate to the goals of the two functions. __bread guarantees that an up-to-date buffer is returned; this entails, if necessary, read access to the underlying block device. Invocations of __getblk always return a non-NULL pointer (i.e., a buffer head).15 If the data of the desired buffer already reside in memory, the data are returned, but there is no guarantee as to what their state will be — in contrast to __bread, it need not be up-to-date. In the second possible scenario, the buffer does not yet exist in memory. In this case, __getblk ensures that the memory space required for the data are reserved and that the buffer head is inserted in the LRU cache.
__getblk always returns a buffer head with the result that even senseless
requests — for non-existent sector addresses — are processed.
15 There is one exception. The function returns a NULL pointer if the desired block size is less than 512 bytes, larger than a page, or
not a multiple of the hardware sector size of the underlying block device. However, a stack dump is also output at the same time because an invalid block size is interpreted as a kernel bug.
984
5:53pm
Page 984
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache The function __getblk Figure 16-9 shows the code flow diagram for __getblk (this function is discussed first because it is invoked by __bread). _ _getblk _ _find_get_block lookup_bh_lru Cache miss? _ _find_get_block_slow bh_lru_install touch_buffer Null pointer?
_ _getblk_slow
Figure 16-9: Code flow diagram for __getblk. As the code flow diagram shows, there are two possible options when __getblk executes. __find_get_block is invoked to find the desired buffer using the method described below. A buffer_head instance is returned if the search is successful. Otherwise, the task is delegated to __getblk_slow. As the name suggest, __getblk_slow yields the desired buffer but takes longer than __find_get_block. However, this function is able to guarantee that a suitable buffer_head instance will always be returned and that the space needed for the data will be reserved.
As already noted, the fact that a buffer head is returned does not mean that the contents of the data area are correct. But because the buffer head itself is correct, it is inserted in the buffer cache at the end of the function by means of bh_lru_install, and touch_buffer calls the mark_page_accessed method (see Chapter 18) for the page associated with the buffer.
The key issue is obviously the difference between __find_get_block and __getblk_slow, where the main work of __getblk takes place. The familiar lookup_bh_lru function is invoked at the start of __find_get_block to check whether the required block is already present in the LRU cache. If not, other means must be applied to continue the search. __find_get_block_slow attempts to find the data in the page cache, and this can produce two different results: ❑
A null pointer is returned if the data are not in the page cache, if it is in the page cache but the page does not have any attached buffers.
❑
The pointer to the desired buffer head is returned if the data are in the page cache and the page also has attached buffers.
985
Page 985
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache If a buffer head is found, __find_get_block invokes the bh_lru_install function to add it to the cache. The kernel returns to __getblk after touch_buffer has been invoked to mark the page associated with the buffer using mark_page_accessed (see Chapter 18). The second code path implemented in __getblk_slow must be entered if __find_get_block returns a null pointer. This path guarantees that at least the space required for the buffer head and data element is reserved. Its implementation is relatively short: fs/buffer.c
static struct buffer_head * __getblk_slow(struct block_device *bdev, sector_t block, int size) { ... for (;;) { struct buffer_head * bh; int ret; bh = __find_get_block(bdev, block, size); if (bh) return bh; ret = grow_buffers(bdev, block, size); if (ret < 0) return NULL; if (ret == 0) free_more_memory(); } }
Surprisingly, the first thing __getblk_slow does is to invoke __find_get_block — the function that has just failed. If a buffer head is found, it is returned by the function. Of course, the function only succeeds if another CPU has installed the desired buffer and created the corresponding data structures in memory in the meantime. Although this is admittedly not very likely, it still has to be checked. This rather strange behavior becomes clear when we examine the exact course of the function. It is, in fact, an endless loop that repeatedly tries to read the buffer using __find_get_block. Obviously, the code doesn’t content itself with doing nothing if the function fails. The kernel uses grow_buffers to try to reserve memory for the buffer head and buffer data and to add this space to the kernel data structures:
1.
If this is successful, __find_get_block is invoked again, and this returns the desired buffer_head.
2.
If the call to grow_buffers returns a negative result, this means that the block lies outside the possible maximum addressable page cache range, and the loop is aborted because the desired block does not physically exist.
3.
If grow_buffers returns 0, then not enough memory was available to grow the buffers, and the subsequent call to free_more_memory tries to fix this condition by trying to release more RAM as described in Chapters 17 and 18.
This is why the functions are packed into an endless loop — the kernel tries again and again to create the data structures in memory until it finally succeeds.
986
5:53pm
Page 986
Mauerer
runc16.tex
V2 - 09/04/2008
5:53pm
Chapter 16: Page and Buffer Cache The implementation of grow_buffers is not especially lengthy. A few correctness checks are carried out before work is delegated to the grow_dev_page function whose code flow diagram is shown in Figure 16-10. grow_dev_page find_or_create_page Create buffers link_dev_buffers init_page_buffers
Figure 16-10: Code flow diagram for grow_dev_page. The function first invokes find_or_create_page to a suitable page or generates a new page to hold the buffer data. Of course, this and other allocation operations will fail if insufficient memory is available. In this case, the function returns a null pointer, thus causing the complete cycle to be repeated in __getblk_slow until sufficient memory is available. This also applies for the other functions that are invoked so there is no need to mention them explicitly. If the page is already associated with a buffer of the correct size, the remaining buffer data (b_bdev and b_blocknr) are modified by init_page_buffers. grow_dev_page then has nothing else to do and can be exited. Otherwise, alloc_page_buffers generates a new set of buffers that can be attached to the page using the familiar link_dev_buffers function. init_page_buffers is invoked to fill the status (b_status)and the management data (b_bdev, b_blocknr) of the buffer heads.
The function __bread In contrast to the methods just described, __bread ensures that an up-to-date buffer is returned. The function is not difficult to implement as it builds on __getblk: fs/buffer.c
__bread(struct block_device *bdev, sector_t block, int size) { struct buffer_head *bh = __getblk(bdev, block, size); if (likely(bh) && !buffer_uptodate(bh)) bh = __bread_slow(bh); return bh; }
The first action is to invoke the __getblk routine to make sure that memory is present for the buffer head and data contents. A pointer to the buffer is returned if the buffer is already up-to-date.
987
Page 987
Mauerer
runc16.tex
V2 - 09/04/2008
Chapter 16: Page and Buffer Cache If the buffer data are not up-to-date, the rest of the work is delegated to __bread_slow — in other words, to the slow path, as the name indicates. Essentially, this submits a request to the block layer to physically read the data, and waits for the operation to complete. The buffer — which is now guaranteed to be filled and current — is then returned.
Use in the Filesystem When is it necessary to read individual blocks? There are not too many points in the kernel where this must be done, but these are nevertheless of great importance. Filesystems in particular make use of the routines described above when reading superblocks or management blocks. The kernel defines two functions to simplify the work of filesystems with individual blocks:
static inline struct buffer_head * sb_bread(struct super_block *sb, sector_t block) { return __bread(sb->s_bdev, block, sb->s_blocksize); } static inline struct buffer_head * sb_getblk(struct super_block *sb, sector_t block) { return __getblk(sb->s_bdev, block, sb->s_blocksize); }
As the code shows, the routines are used to read specific filesystem blocks found using a superblock, a block number, and a block size.
16.6
Summar y
Reading data from external storage devices like hard disks is much slower than reading data from RAM, so Linux uses caching mechanisms to keep data in RAM once they have been read in, and accesses them from there. Page frames are the natural units on which the page cache operates, and I have discussed in this chapter how the kernel keeps track of which portions of a block device are cached in RAM. You have been introduced to the concept of address spaces which allow for linking cached data with their source, and how address spaces are manipulated and queried. Following that, I have examined the algorithms employed by Linux to handle the technical details of bringing content into the page cache. Traditionally, Unix caches used smaller units than complete pages, and this technique survived until today in the form of the buffer cache. While the main caching load is handled by the page cache, there are still some users of the buffer cache, and you have therefore also been introduced to the corresponding mechanisms. Using RAM to cache data read from a disk is one aspect of the interaction between RAM and disks, but there’s also another side to the story: The kernel must also take care of synchronizing modified data in RAM back to the persistent storage on disk; the next chapter will introduce you to the corresponding mechanisms.
988
5:53pm
Page 988
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Data Synchronization RAM memory and hard disk space are mutually interchangeable to a good extent. If a large amount of RAM is free, the kernel uses part of it to buffer block device data. Conversely, disk space is used to swap data out from memory if too little RAM is available. Both have one thing in common — data are always manipulated in RAM before being written back (or flushed) to disk at some random time to make changes persistent. In this context, block storage devices are often referred to as RAM backing store. Linux provides a variety of caching methods as discussed extensively in Chapter 16. However, what was not discussed in that chapter is how data are written back from cache. Again, the kernel provides several options that are grouped into two categories:
1.
Background threads repeatedly check the state of system memory and write data back at periodic intervals.
2.
Explicit flushing is performed when there are too many dirty pages in system caches and the kernel needs clean pages.
This chapter discusses these techniques.
17.1
Over view
There is a clear relationship between flushing, swapping, and releasing pages. Not only the state of memory pages but also the size of free memory needs checking regularly. When this is done, unused or seldom used pages are swapped out automatically but not before the data they hold have been synchronized with the backing store to prevent data loss. In the case of dynamically generated pages, the system swap areas act as the backing stores. The swap areas for pages mapped from files are the corresponding sections in the underlying filesystems. If there is an acute scarcity of memory, flushing of dirty data must be enforced in order to obtain clean pages.
Page 989
Mauerer
runc17.tex
V2 - 09/04/2008
Chapter 17: Data Synchronization Synchronization between memory/cache and backing store is split into two conceptually different parts: ❑
Policy routines control when data are exchanged. System administrators can set various parameters to help the kernel decide when to exchange data as a function of system load.
❑
The technical implementation deals with the hardware-related details of synchronization between cache and backing store and ensures that the instructions issued by the policy routines are carried out. Synchronization and swapping must not be confused with each other. Whereas synchronization simply aligns the data held in RAM and in the backing store, swapping results in the flushing of data from RAM to free space for higher-priority items. Before data are cleared from RAM, they are synchronized with the data in the associated backing store.
The mechanisms for flushing data are triggered for different reasons and at different times: ❑
Periodic kernel threads scan the lists of dirty pages and pick some to be written back based on the time at which they became dirty. If the system is not too busy with write operations, there is an acceptable ratio between the number of dirty pages and the load imposed on the system by the hard disk access operations needed to flush the pages.
❑
If there are too many dirty pages in the system as a result, for example, of a massive write operation, the kernel triggers further mechanisms to synchronize pages with the backing store until the number of dirty pages returns to an acceptable level. What is meant by ‘‘too many dirty pages‘‘ and ‘‘acceptable level‘‘ is a moot point, discussed below.
❑
Various components of the kernel require that data must be synchronized when a special event has happened, for instance, when a filesystem is re-mounted.
The first two mechanisms are implemented by means of the kernel thread pdflush which executes the synchronization code, while the third alternative can be triggered from many points in the kernel. Since the implementation of data synchronization consists of an unusually large number of interconnected functions, an overview of what lies ahead of us precedes a detailed discussion of everything in detail. Figure 17-1 show the dependence among the functions that constitute the implementation. The figure is not a proper code flow diagram, but just shows how the functions are related to each other and which code paths are possible. The diagram concentrates on synchronization operations originating from the pdflush thread, system calls, and explicit requests from filesystem-related kernel components. The kernel can start to synchronize data from various different places, but all paths save one end up in sync_sb_inodes. The function is responsible to synchronize all dirty inodes belonging to a given superblock, and writeback_single_inode is used for each inode. Both the sync system call and numerous generic kernel layers (like the partition code or the block layer) make use of this possibility. On the other hand, the need to synchronize the dirty inodes of all superblocks in the system can also arise. This is especially required for periodic and forced writeback. When dirtying data in filesystem code, the kernel additionally ensures that the number of dirty pages does not get out of hand by starting synchronization before this happens.
990
5:51pm
Page 990
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Chapter 17: Data Synchronization Data integrity synchronization
Flushing synchronization pdflush
Block layer, partition handling, fsync system call, filesystems...
sys_sync
Periodic writeback
Forced writeback
Pages were dirtied
wb_kupdate
background_ writeout
balance_dirty_ pages
_ _fsync_super _ _sync_inodes sync_inodes_sb
writeback_inodes For all superblocks
sync_blockdev
sync_sb_inodes (∗) (+)
File systems
For a single superblock (+) can place inodes on s_io writeback_single_inode (∗) (∼) (∗) can place inodes on os_more_io _ _sync_single_inode (∗) (∼) (∼) can wait for inode to become unlocked or for writeback to be completed write_inode
Figure 17-1: Overview of some functions involved in data synchronization.
Synchronizing all dirty inodes of a superblock is often much too coarse grained for filesystems. They often require synchronizing a single dirty inode and thus use writeback_single_inode directly. Even if the synchronization implementation is centered around inodes, this does not imply that the mechanisms just work for data contained in mounted filesystems. Recall that raw block devices are represented by inodes via the bdev pseudo-filesystem as discussed in Section 10.2.4. The synchronization methods therefore also affect raw block devices in the same way as regular filesystem objects — good news for everyone who wants to access data directly. One remark on terminology: When I talk about inode synchronization in the following, I always mean synchronization of both the inode metadata and the raw data managed by the inode. For regular files, this means that the synchronization code’s aim is to both transfer time stamps, attributes, and the like, as well as the contents of the file to the underlying block device.
17.2
The pdflush Mechanism
The pdflush mechanism is implemented in a single file: mm/pdflush.c. This contrasts with the fragmented implementation of the synchronization mechanisms in earlier versions.
991
Page 991
Mauerer
runc17.tex
V2 - 09/04/2008
Chapter 17: Data Synchronization pdflush is started with the usual kernel thread mechanisms: mm/pdflush.c
static void start_one_pdflush_thread(void) { kthread_run(pdflush, NULL, "pdflush"); } start_one_pdflush starts a single pdflush thread — however, the kernel uses several threads at the same time in general, as you will see below. It should be noted that a specific pdflush thread is not
always responsible for the same block device. Thread allocation may vary over time simply because the number of threads is not constant and differs according to system load. In fact, the kernel starts the specific number of threads defined in MIN_PDFLUSH_THREADS when it initializes the pdflush subsystem. Typically, this number is 2 so that in a normally loaded system, two active instances of pdflush appear in the task list displayed by ps: wolfgang@meitner> ps fax 2 ? S< 0:00 [kthreadd] ... 206 ? S 0:00 _ [pdflush] 207 ? S 0:00 _ [pdflush] ...
There is a lower and an upper limit to the number of threads. MAX_PDFLUSH_THREADS specifies the maximum number of pdflush instances, typically 8. The number of concurrent threads is held in the nr_pdflush_threads global variable, but no distinction is made as to whether the threads are currently active or sleeping. The current value is visible to userspace in /proc/sys/vm/nr_pdflush_threads. The policy for when to create and destroy pdflush threads is simple. The kernel creates a new thread if no idle thread has been available for 1 second. In contrast, a thread is destroyed if it has been idle for more than 1 second. The upper and lower limits on the number of concurrent pdflush threads defined in MIN_PDFLUSH_THREADS (2) and MAX_PDFLUSH_THREADS (8) are always obeyed. Why is more than one thread required? Modern systems will be typically equipped with more than one block device. If many dirty pages exist in the system, it is the kernel’s job to keep these devices as busy as possible with writing back data. Queues of different block devices are independent of each other, so data can be written in parallel. Data transfer rates are mainly limited by I/O bandwidth, not CPU power on current hardware. The connection between pdflush threads and writeback queues is summarized in Figure 17-2. The figure shows that a dynamically varying number of pdflush threads feeds the block devices with data that must be synchronized with the underlying block devices. Notice that a block device may have more than one queue that can transfer data, and that a pdflush thread may either serve all queues or just a specific one. Former kernel versions only employed a single flushing daemon (which was then called bdflush), but this led to a performance problem: If one block device queue was congested because too many writeback operations were pending, other queues for different devices could not be fed with new data anymore. They remained idle, which can be a good thing on a summer vacation, but certainly not for block devices if there is work to do. This problem is solved by the dynamical creation and destruction of pdflush kernel threads, which allows for keeping many queues busy in parallel.
992
5:51pm
Page 992
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Chapter 17: Data Synchronization Create & destroy threads Policy Order data integrity or flushing writeback
Block device
1
Block device
2 3 Pdflush threads
Block device Queues
Figure 17-2: Overview of the pdflush mechanism.
17.3
Star ting a New Thread
The pdflush mechanism consists of two central components — a data structure to describe the work of the thread and a strategy routine to help perform the work. The data structure is defined as follows: mm/pdflush.c
struct pdflush_work { struct task_struct *who; /* The thread */ void (*fn)(unsigned long); /* A callback function */ unsigned long arg0; /* An argument to the callback */ struct list_head list; /* On pdflush_list, when idle */ unsigned long when_i_went_to_sleep; };
As usual, the fact that the data structure is defined in a C header file instead of a header file indicates to the kernel that the structure may be used only by internal code. Generic code uses other mechanisms to access the kernel synchronization capabilities that are examined below: ❑
who is a pointer to the kernel thread task_struct instance used to represent the specific pdflush
instance in the process table. ❑
Several instances of pdflush_work can be grouped together in a doubly linked standard list using the list list head. The kernel uses the global variable pdflush_list (defined in mm/pdflush.c) to draw up a list of the work still to be done.
❑
The extraordinarily long when_i_went_to_sleep element stores the time in jiffies when the thread last went to sleep. This value is used to remove superfluous pdflush threads from the system (i.e., threads that are still in memory but have been idle for a longer period).
❑
The fn function pointer (in conjunction with arg0) is the backbone of the structure. It holds the function in which the actual work to be done is implemented. arg0 is passed as an argument when the function is invoked. By using different function pointers for fn, the kernel is able to incorporate a variety of synchronization routines in the pdflush framework so that the right routine can be selected for the job in hand.
993
Page 993
Mauerer
runc17.tex
V2 - 09/04/2008
Chapter 17: Data Synchronization
17.4
Thread Initialization
pdflush is used as a work procedure for kernel threads. Once generated, pdflush threads go to sleep and wait until other parts of the kernel assign them tasks that are described in pdflush_work. Consequently, the number of pdflush threads need not match the number of tasks to be performed. The generated
threads are on call and simply wait until the kernel decides to give them work to do. The code flow diagram in Figure 17-3 shows how pdflush works. pdflush _ _pdflush Add to pdflush_list list set when_i_went_to_sleep schedule my_work->fn Create/destroy thread
Figure 17-3: Code flow diagram for pdflush. The start routine for generating a new pdflush thread is pdflush, but control flow is passed immediately to __pdflush.1 In __pdflush, the worker function of the pdflush_work instance is set to NULL because the thread has not been given a particular job to do. The global counter (nr_pdflush_threads) must also be incremented by 1 because a new pdflush thread has now been added to the system. The thread then goes into an endless loop in which the following actions are performed: ❑ ❑
The pdflush_work instance of the thread is added to the global list pdflush_list (reminder: the kernel is able to identify the thread by means of the who element). when_i_went_to_sleep is set to the current system time in jiffies to remember when the thread
started sleeping. ❑
schedule is invoked — this is the most important action. Because the status of the thread was previously set to TASK_INTERRUPTIBLE, the thread now goes to sleep until woken by an external event.
If the kernel requires a worker thread, it sets the worker function of a pdflush_work instance in the global list and wakes the corresponding thread, which resumes work immediately after schedule — but now with the fn worker function. 1 All that happens in
pdflush is that an instance of pdflush_work is generated; a pointer to it is passed to __pdflush_work as a parameter. This is to stop the compiler from performing unfortunate optimizations on this variable. Additionally, the process priority is set to 0, and the allowed CPUs are limited to the ones granted for the parent kthreadd.
994
5:51pm
Page 994
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Chapter 17: Data Synchronization ❑
The worker function is invoked with the stored argument so that it can set about its task.
❑
Upon termination of the worker function, the kernel checks whether there are too many or too few worker threads. If no idle worker thread was available for longer than 1 second,2 start_one_pdflush_thread generates a new thread. If the sleepiest thread (which is at the end of the pdflush_list list) has been asleep for more than 1 second, the current thread is removed from the system by exiting the endless loop. In this case, the only clean-up action required besides handling locking is to decrement nr_pdflush_threads — one pdflush thread less is available.
17.5
Performing Actual Work
pdflush_operation assigns a worker function to a pdflush thread and wakes it up. If no thread is
available, −1 is returned; otherwise, a thread is removed from the list and woken. To simplify matters, we have omitted the required locking in the code: mm/pdflush.c
int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0) { unsigned long flags; int ret = 0; if (list_empty(&pdflush_list)) { ret = -1; } else { struct pdflush_work *pdf; pdf = list_entry(pdflush_list.next, struct pdflush_work, list); list_del_init(&pdf->list); if (list_empty(&pdflush_list)) last_empty_jifs = jiffies; pdf->fn = fn; pdf->arg0 = arg0; wake_up_process(pdf->who); } return ret; } pdflush_operation accepts two arguments that specify the worker function and its argument.
If the list pdflush_list is empty and thus no pdflush daemon can be awoken, an error code is returned. If a sleeping pdflush instance is in the queue, it is removed and is no longer available to any other part of the kernel. The values for the worker function and argument are assigned to the corresponding fields of pdflush_work, and immediately thereafter the thread is woken with wake_up_process. Thanks to the who element in pdflush_work, the kernel knows which process is meant. To ensure that there are always enough worker threads, the kernel checks whether the pdflush_list list is empty after removing the current instance, but before waking the thread. If it is, last_empty_jifs is set to the current system time. When a thread terminates, the kernel uses this information to check the period during which no surplus threads were available — it can then start a new thread as described above. 2 The time the
pdflush_list list was last empty is noted in the global variable last_empty_jifs.
995
Page 995
Mauerer
runc17.tex
V2 - 09/04/2008
Chapter 17: Data Synchronization
17.6
Periodic Flushing
Now that you are familiar with the framework in which the pdflush mechanism operates, let’s move on to describe the routines responsible for the actual synchronization of cache contents with the associated backing store. Recall that two alternatives are available, one periodic and one enforced. First, let’s discuss the periodic writeback mechanism. In earlier kernel versions, a user mode application was used to perform periodic write operations. This application was started at kernel initialization time and invoked a system call at regular intervals to write back dirty pages. In the meantime, this not particularly elegant procedure was replaced with a more modern alternative that does not take the long route via user mode and is therefore not only more efficient but also more aesthetic. What’s left of the earlier method is the name kupdate. The name appears as a component of some functions and is often used to describe the flushing mechanism. Two things are needed to periodically flush dirty cache data: the worker function that is executed with the help of the pdflush mechanism, and code to regularly activate the mechanism.
17.7
Associated Data Structures
The wb_kupdate function in mm/page-writeback.c is responsible for the technical aspects of flushing. It is based on the address space concept (discussed in Chapter 4) that establishes the relationship among RAM, files or inodes, and the underlying block devices.
17.7.1 Page Status wb_kupdate is based on two data structures that control how it functions. One of these structures is the global array vm_stat, which enables the status of all system memory pages to be queried: mm/vmstat.c
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
The array holds a comprehensive collection of statistical information to describe the status of the memory pages of each CPU; consequently, there is an instance of the structure for each CPU in the system. The individual instances are grouped together in an array to simplify access. The structure elements are simple, elementary numbers and therefore indicate only how many pages have a specific status. Other means must be devised to find out which pages these are. This issue is discussed below.
The following statistics are collected in vm_stat: <mmzone.h>
enum zone_stat_item { /* First 128 byte cacheline (assuming 64 bit words) */ NR_FREE_PAGES, NR_INACTIVE, NR_ACTIVE, NR_ANON_PAGES, /* Mapped anonymous pages */
996
5:51pm
Page 996
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Chapter 17: Data Synchronization NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. only modified from process context */ NR_FILE_PAGES, NR_FILE_DIRTY, NR_WRITEBACK, /* Second 128 byte cacheline */ NR_SLAB_RECLAIMABLE, NR_SLAB_UNRECLAIMABLE, NR_PAGETABLE, /* used for pagetables */ NR_UNSTABLE_NFS, /* NFS unstable pages */ NR_BOUNCE, NR_VMSCAN_WRITE, #ifdef CONFIG_NUMA /* Omitted: NUMA-specific statistics */ #endif NR_VM_ZONE_STAT_ITEMS };
The meanings of the entries are easy to guess from their names. NR_FILE_DIRTY specifies the number of file-based dirty pages, and NR_WRITEBACK indicates how many are currently being written back. NR_PAGETABLE stores the number of pages used to hold the page tables, and NR_FILE_MAPPED specifies how many pages are mapped by the page table mechanism (only the file-based pages are accounted for; direct kernel mappings are not included). Finally, NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE indicate how many pages are used for the slab cache described in Chapter 3 (despite their name, the constants work also for the slub cache). The remaining entries consider special cases that are not interesting for our purposes. Note that the kernel not only keeps a global array to collect page statistics, but also provides the same information resolved by memory zone: <mmzone.h>
struct zone { ... /* Zone statistics */ atomic_long_t ... }
vm_stat[NR_VM_ZONE_STAT_ITEMS];
It is the job of memory management to keep the global and zone-specific arrays up-to-date. Of prime interest at this point is how the information is used. To gain a status overview of the entire system, it is necessary to combine the information in the array entries to obtain not only CPU-specific data but the data of the overall system. The kernel provides the auxiliary function global_page_state, which delivers the current value of a particular field of vm_stat:
unsigned long global_page_state(enum zone_stat_item item)
Because the vm_stat arrays and their entries are not protected by a locking mechanism, it may happen that the data change while global_page_state is running. The result returned is not exact but an approximation. This is not a problem because the figures are simply a general indication of how effectively work is distributed. Minor differences between real data and returned data are acceptable.
997
Page 997
Mauerer
runc17.tex
V2 - 09/04/2008
Chapter 17: Data Synchronization
17.7.2 Writeback Control A second data structure holds the various parameters that control writeback of dirty pages. Upper layers use it to pass information about how writeback is to be performed to the lower layers (top to bottom in Figure 17-1). However, the structure also allows for propagating status information in the reverse direction (bottom to top): <writeback.h>
/* A control structure which tells the writeback code what to do. */ struct writeback_control { struct backing_dev_info *bdi; /* If !NULL, only write back this queue */ enum writeback_sync_modes sync_mode; unsigned long *older_than_this; /* If !NULL, only write back inodes older than this */ long nr_to_write; /* Write this many pages, and decrement this for each page written */ long pages_skipped; /* Pages which were not written */ loff_t range_start; loff_t range_end; unsigned unsigned unsigned unsigned unsigned unsigned
nonblocking:1; /* encountered_congestion:1; for_kupdate:1; /* for_reclaim:1; /* for_writepages:1; /* range_cyclic:1; /*
Don’t get stuck on request queues */ /* An output: a queue is full */ A kupdate writeback */ Invoked from the page allocator */ This is a writepages() call */ range_start is cyclic */
};
The meanings of the structure elements are as follows: ❑
bdi points to a structure of type backing_dev_info, which summarizes information on
the underlying storage medium. This structure is discussed briefly in Chapter 16. Two things interest us here. First, the structure provides a variable to hold the status of the writeback queue (this means, e.g., that congestion can be signaled if there are too many write requests), and second, it allows RAM-based filesystems that do not have a (block device) backing store to be labeled — writeback operations to systems of this kind make no sense. ❑
sync_mode distinguishes between three different synchronization modes: <writeback.h>
enum writeback_sync_modes { WB_SYNC_NONE, /* Don’t wait on anything */ WB_SYNC_ALL, /* Wait on every mapping */ WB_SYNC_HOLD, /* Hold the inode on sb_dirty for sys_sync() */ };
To synchronize data, the kernel needs to pass a corresponding write request to the underlying block device. Requests to block devices are asynchronous by nature. If the kernel wants to ensure that the data have safely reached the device, it needs to wait for completion after the request has been issued. This behavior is mandated with WB_SYNC_ALL. Waiting for writeback to complete is
998
5:51pm
Page 998
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Chapter 17: Data Synchronization performed in __sync_single_inode discussed below; recall from Figure 17-1 that it sits at the bottom of the mechanism, where it is responsible to delegate synchronization of a single inode to the filesystem-specific methods. All functions that wait on inodes because WB_SYNC_ALL is set are marked in Figure 17-1. Notice that writeback with WB_SYNC_ALL set is referred to as data integrity writeback. If a system crash happens immediately after writeback in this mode has been finished, no data are lost because everything is synchronized with the underlying block devices. If WB_SYNC_NONE is used, the kernel will send the request, but continue with the remaining synchronization work immediately afterward. This mode is also referred to as flushing writeback. WB_SYNC_HOLD is a special form used for the sync system call that works similarly to WB_SYNC_NONE. The exact differences are subtle and are discussed in Section 17.15.
❑
When the kernel performs writeback, it must decide which dirty cache data need to be synchronized with the backing store. It uses the older_than_this and nr_to_write elements for this purpose. Data are written back if they have been dirty for longer than specified by older_than_this. older_than_this is defined as a pointer type, which is unusual for a single long
value. Its numeric value, which can be obtained by appropriate de-referencing, is of interest. If the pointer is NULL, then age checking is not performed, and all objects are synchronized irrespective of when they became dirty. Setting nr_to_write to 0 likewise disables any upper limit on the number of pages that are supposed to be written back. ❑
nr_to_write can restrict the maximal number of pages that should be written back. The upper bound for this is given by MAX_WRITEBACK_PAGES, which is usually set to 1,024.
❑
If pages were selected to be written back, functions from lower layers perform the required operations. However, they can fail for various reasons, for instance, because the page is locked from some other part of the kernel. The number of skipped pages can be reported to higher layers via the counter pages_skipped.
❑
The nonblocking flag specifies whether writeback queues block or not in the event of congestion (more pending write operations than can be effectively satisfied). If they are blocked, the kernel waits until the queue is free. If not, it relinquishes control. The write operation is then resumed later.
❑
encountered_congestion is also a flag to signal to higher layers that congestion has occurred during data writeback. It is a Boolean variable and accepts the values 1 or 0.
❑
for_kupdated is set to 1 if the write request was issued by the periodic mechanism. Otherwise, its value is 0. for_reclaim and for_writepages are used in a similar manner: They are set if the writeback operation was initiated from memory reclaim from the do_writepages function, respectively.
❑
If range_cyclic is set to 0, the writeback mechanism is restricted to operate on the range given by range_start and range_end. The limits refer to the mapping for which the writeback was initiated. If range_cyclic is set to 1, the kernel may iterate many times over the pages associated with a mapping, thus the name of the element.
999
Page 999
Mauerer
runc17.tex V2 - 09/04/2008
Chapter 17: Data Synchronization
17.7.3 Adjustable Parameters The kernel supports the fine-tuning of synchronization by means of parameters. These can be set by the administrator to help the kernel assess system usage and loading. The sysctl mechanism described in Chapter 10 is used for this purpose, which means that the proc filesystem is the natural interface to manipulate the parameters — they are located in /proc/sys/vm/. Four parameters can be set, all of which are defined in mm/page-writeback.c3: ❑
dirty_background_ratio specifies the percentage of dirty pages at which pdflush starts peri-
odic flushing in the background. The default value is 10 so that the update mechanism kicks in when more than 10 percent of the pages have changed as compared to the backing store. ❑
vm_dirty_ratio (the corresponding sysctl is dirty_ratio) specifies the percentage of dirty pages (with respect to non-HIGHMEM memory) at which data flushing will be started. The default value is 40.
Why is high memory excluded from the percentage? Older kernel versions before 2.6.20 did not, in fact, distinguish between high and normal memory. However, if the ratio between high memory and low memory is too large (i.e., if main memory is much more than 4 GiB on 32-bit processors), the default settings for dirty_background_ratio and dirty_ratio were required to be scaled back slightly when the writeback mechanism was initialized. Retaining the default values would have necessitated an excessively large number of buffer_head instances, and these would have had to be held in valuable low memory. By
excluding high memory from the calculation, the kernel does not deal with scaling anymore, which simplifies matters somewhat. ❑
The interval between two invocations of the periodic flushing routine is defined in dirty_writeback_interval (the corresponding sysctl is dirty_writeback_centisecs). The interval is specified in hundredths of a second (also called centiseconds in the sources). The default is 500, which equates to an interval of 5 seconds between invocations. On systems where a very large number of write operations are performed, lowering this value can have a positive effect, but increasing the value on systems with very few write operations delivers only small performance gains.
❑
The maximum period during which a page may remain dirty is specified in dirty_expire_ interval (the sysctl is dirty_expire_centisecs). Again, the period is expressed is hundredths of a second. The default value is 3,000, which means that a page may remain dirty for a maximum of 30 seconds before it is written back at the next opportunity.
17.8
Central Control
The key periodic flushing component is the wb_kupdate procedure defined in mm/page-writeback.c. It is responsible for dispatching lower-level routines to find the dirty pages in memory and synchronize them with the underlying block device. As usual, our description is based on a code flow diagram as shown in Figure 17-4. The superblocks are synchronized right at the start of the function because this is essential to ensure filesystem integrity. Incorrect superblock data result in consistency errors throughout the filesystem 3 The name of the sysctls differs with the variables names for historical reasons.
1000
5:51pm Page 1000
Mauerer
runc17.tex V2 - 09/04/2008
5:51pm Page 1001
Chapter 17: Data Synchronization and, in most cases, lead to loss of at least part of the data. This is why sync_supers, whose purpose is described in more detail in Section 17.9, is invoked first.
wb_kupdate sync_supers global_page_state writeback_inodes
Congested?
congestion_wait
Reset writeback timer
Figure 17-4: Code flow diagram of wb_kupdate.
Thereafter, ‘‘normal‘‘ dirty data are written back from the page cache. The kernel invokes the global_page_state function to get a picture of the current status of all system pages in a page_state instance. The key item of information is the number of dirty pages held in the NR_FILE_DIRTY element of the vm_stats array.
This function then goes into a loop whose code is repeatedly executed until there are no dirty pages in the system. After a writeback_control instance has been started to initiate non-blocking writeback of MAX_WRITEBACK_PAGES pages (normally 1,024), writeback_inodes writes back the data that can be reached via the inodes. This is quite a lengthy function so it is discussed separately in greater detail in Section 17.10, but a couple of salient points are listed below: ❑
Not all dirty pages are written back — in fact, the number is restricted to MAX_WRITEBACK_PAGES. Because inodes are locked during writeback, smaller groups of dirty pages are processed to prevent overly long blocking of an inode that adversely affects system performance.
❑
The number of pages actually written back is transferred between wb_kupdate and writeback_inodes by subtracting the number of pages written back — which are therefore no longer dirty — from the nr_to_write element of the writeback_control instance after each writeback_inodes call.
When writeback_inodes terminates, the kernel repeats the loop until there are no more dirty pages in the system. The congestion_wait function is invoked if queue congestion occurs (the kernel detects this by means of the set encountered_congestion element of the writeback instance). The function waits until congestion has eased and then continues the loop as normal. Section 17.10 takes a closer look at how the kernel defines congestion. Once the loop has finished, wb_kupdate makes sure that the kernel invokes it again after the interval defined by dirty_writeback_interval in order to guarantee periodic background flushing. Lowresolution kernel timers as discussed in Chapter 15 are used for this purpose — in this particular case, the timer is implemented by means of the global timer list wb_timer (defined in mm/page_writeback.c).
1001
Mauerer
runc17.tex V2 - 09/04/2008
Chapter 17: Data Synchronization Usually, the interval between two calls of the wb_kupdate function is the value specified in dirty_writeback_centisecs. However, a special situation arises if wb_kupdate takes longer than the time specified in dirty_writeback_centisecs. In this case, the time of the next wb_kupdate call is postponed until 1 second after the end of the current wb_kupdate call. This also differs from the normal situation because the interval is not calculated as the time between the start of two successive calls but as the time between the end of one call and the start of the next. The ball is set rolling when the synchronization layer is initialized in page_writeback_init, where the kernel first starts the timer. Initial values for the wb_timer variable — primarily the wb_timer_fn callback function that is invoked when the timer expires — are set statically when the variable is declared in mm/page-writeback.c. Logically, the timer expiry time changes over time and is reset at the end of each wb_kupdate call, as just described. The structure of the periodically invoked wb_timer_fn function is very simple as it consists only of a pdflush_operation call by wb_kupdate. At this point, it is not necessary to reinitialize the timer because this is done in wb_kupdate. The timer must be reset in one situation only — if no pdflush thread is available, the next wb_timer_fn call is postponed by 1 second by the function itself. This ensures that wb_kupdate is invoked regularly to synchronize cache data with block device data, even if the pdflush subsystem is heavily loaded.
17.9
Superblock Synchronization
Superblock data are synchronized by a dedicated function called sync_supers to differentiate it from normal synchronization operations. This and other functions relevant to superblocks are defined in fs/super.c. Its code flow diagram is shown in Figure 17-5.
sync_supers
Iterate over all superblocks
Superblock dirty?
write_super
write_sb->s_op->write_super
Figure 17-5: Code flow diagram for sync_supers. Recall from Chapter 8 that the kernel provides the global list super_blocks to hold the super_block instances of all mounted filesystems. As the code flow diagram shows, the initial task of sync_supers is to iterate over all superblocks and to check whether they are dirty using the s_dirt element of the superblock structure. If they are, the superblock data contents are written to the data medium by write_super. The write_super method included in the superblock-specific super_operations structure does the actual writing. If the pointer is not set, superblock synchronization is not needed for the filesystem (this is the case with virtual and RAM-based filesystems). For instance, the proc filesystem uses a null pointer. Of course, normal filesystems on block devices, such as Ext3 or Reiserfs, provide appropriate methods (e.g., ext3_write_super) to communicate with the block layer and write back relevant data.
1002
5:51pm Page 1002
Mauerer
runc17.tex V2 - 09/04/2008
5:51pm Page 1003
Chapter 17: Data Synchronization
17.10
Inode Synchronization
writeback_inodes writes back installed mappings by walking through the system inodes (for the sake
of simplicity, this is called inode writeback, but in fact not the inode but the dirty data associated with it are written back). The function shoulders the main burden of synchronization because most system data are provided in the form of address space mappings that make use of inodes. Figure 17-6 illustrates the code flow diagram for writeback_inodes. The function is slightly more complicated in reality because some more details and corner cases need to be handled properly. We consider a simplified variant that nevertheless contains everything that is essential when inodes are written back. writeback_inodes sync_sb_inodes
Iterate over sb->s_io
Iterate over all superblocks
Fill I/O list Perform checks __writeback_single_inode Failed to write pages? Writeback limit reached? Writeback limit reached?
Move to s_dirty break
break
Figure 17-6: Code flow diagram for writeback_inodes. The function uses the data structures discussed in Chapter 8 to establish a link among superblocks, inodes, and associated data.
17.10.1 Walking the Superblocks When mappings are written back inode-by-inode, the initial path taken is via all system superblock instances that represent the mounted filesystems. sync_sb_inodes is invoked for each instance in order to write back the superblock inode data, as shown in the code flow diagram in Figure 17-6. Walking the superblock list can be terminated by two different conditions:
1.
All superblock instances have been scanned sequentially. The kernel has reached the end of the list, and its work is therefore done.
2.
The maximum number of writeback pages specified by the writeback_control instance has been reached. Since writeback requires obtaining various important locks, the system should not be disturbed for too long to make the inodes available for other parts of the kernel again.
17.10.2 Examining Superblock Inodes Once it has been established with the help of the superblock structure that the filesystem contains inodes with dirty data, the kernel hands over to sync_sb_inodes, which synchronizes the dirty superblock inodes. The code flow diagram is in Figure 17-6.
1003
Mauerer
runc17.tex V2 - 09/04/2008
Chapter 17: Data Synchronization Great effort would be needed if the kernel were to run through the complete list of filesystem inodes each time in order to differentiate between clean and dirty inodes. The kernel therefore implements a far less costly option by placing all dirty inodes on the superblock-specific list super_block->s_dirty. Notice that inodes on the list are reverse time-ordered. The later an inode was dirtied, the closer it is to the tail of the list. Two more list heads are additionally required to perform the synchronization of these inodes. The relevant portion of the super_block structure is as follows:
struct super_block { ... struct list_head struct list_head struct list_head ... }
s_dirty; s_io; s_more_io;
/* dirty inodes */ /* parked for writeback */ /* parked for more writeback */
All dirty inodes of the superblock are held in the s_dirty list — and are practically served up on a platter to the synchronization mechanism. This list is updated automatically by the relevant code of the VFS layer. s_io keeps all inodes that are currently under consideration of the synchronization code. s_more_io contains inodes that have been selected for synchronization and were placed on s_io, but could not be processed in one go. It would seem to be the simplest solution that the kernel puts such inodes back to s_io, but this could starve newly dirtied inodes or lead to locking problems, so a second list is introduced. All functions that place inodes on s_io or s_more_io are indicated in Figure 17-1.
The first task of sync_sb_inodes is to fill the s_io list. Two cases must be distinguished:
1.
If the synchronization request did not originate from the periodic mechanism, then all inodes on the dirty list are put onto the s_io list. If inodes are present on the more_io list, they are placed at the end of the i_io list. The auxiliary function queue_io is provided to perform both list operations. The behavior ensures that inodes from previous synchronization passes still get consideration, but more recently dirtied inodes are preferred. This way, large dirtied files cannot starve smaller files that were dirtied afterward.
2.
If the periodic mechanism wb_kupdate has triggered synchronization, the s_io list is only replenished with additional dirty inodes if it is completely empty. Otherwise, the kernel waits until all members of s_io have been written back. There is no particular pressure for the periodic mechanism to write back as many inodes as possible in the shortest amount of time. Instead, it is more important to slowly but surely write out a constant stream of inodes.
If the writeback control parameter specifies an older_than_this criterion, only inodes marked dirty within a specified minimum period into the past are included in the synchronization process. If the time stored in this element is before the time held in the dirtied_when element of the mapping, the requisite condition is not satisfied and the kernel does not move the inode from the dirty to the s_io list. After the members of the s_io list have been selected, the kernel starts to iterate over the individual elements.
1004
5:51pm Page 1004
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Page 1005
Chapter 17: Data Synchronization Some checks ascertain that the inode is suitable for synchronization before actual writeback is performed: ❑
Purely memory-based filesystems like RAM disks or pseudo-filesystems or purely virtual filesystems, respectively, do not require synchronization with an underlying block device. This is signaled by setting BDI_CAP_NO_WRITEBACK in the backing_dev_info instance that belongs to the filesystem’s mapping. If an inode of this type is encountered, processing can be aborted immediately. However, there is one filesystem whose metadata are purely memory-based and without physical backing store, but that cannot be skipped: the block device pseudo-filesystem bdev. Recall from Chapter 10 that bdev is used to handle access to raw block devices or partitions thereof. An inode is provided for each partition, and access to the raw device is handled via this inode. While the inode metadata are important in memory, it does not make sense to store them anywhere permanently since they are just used to implement a uniform abstraction mechanism. This, however, does not imply that the contents of the block device do not require synchronization: Quite the opposite is true. Access to the raw device is as usual buffered by the page cache, and any changes are reflected in the radix tree data structures. When modifications are made on the contents of a block device, they go through the page cache. The pages must therefore be synchronized like all other pages in the page cache with the underlying hardware from time to time. The block device pseudo-filesystem bdev thus does not set BDI_CAP_NO_WRITEBACK. However, no write_inode method is contained in the associated super_operations, so metadata synchronization is not performed. Data synchronization, on the other hand, runs as for any other filesystem.
❑
If the synchronization queue is congested (the BDI_write_congested bit is set in the status field of the backing_dev_info instance) and non-blocking writeback was selected in writeback_control, the congestion needs to be reported to the higher layers. This is done by setting the encountered_congestion field in the writeback_control instance to 1. If the current inode belongs to a block device, then the auxiliary function requeue_io is used to move the inode from s_io to more_io. It is possible that different inodes of a block device are backed by different queues, for instance, if multiple physical devices are combined into a single logical device. The kernel therefore continues to process the other inodes on the s_io list in the hope that they belong to different queues that are not congested. If the current inode, however, stems from a regular filesystem, it can be assumed that all other inodes are backed by the same queue. Since this queue is already congested, it does not make sense to synchronize the other inodes, so the loop iteration is aborted. The unprocessed inodes remain in the s_io list and are dealt with the next time sync_sb_inodes is called.
❑
pdflush can be instructed via writeback_control to focus on a single queue. If a regular filesys-
tem inode that uses a different queue is encountered, processing can be aborted. If the inode represents a block device, processing skips forward to the next inode on the s_io list for the same reason as in the write congestion case. ❑
The current system time in jiffies is held in a local variable at the start of sync_sb_inodes. The kernel now checks whether the time when the inode just processed was marked as dirty is after the start time of sync_sb_inodes. If so, synchronization is aborted in its entirety. The unprocessed inodes are again left on s_io.
1005
Mauerer
runc17.tex V2 - 09/04/2008
Chapter 17: Data Synchronization ❑
A further situation leads to termination of sync_sb_inodes. If a pdflush thread is already in the process of writing back the processed queue (this is indicated by the BDI_pdflush bit of the status element of backing_dev_info), the current thread lets the running pdflush thread process the queue on its own.
Inode writeback may not be initiated until the kernel has ensured that the above conditions are satisfied. As the code flow diagram in Figure 17-6 shows, the inode is written back using __writeback_single_inode, examined below. It can happen that writing back pages does not succeed for all pages that should be written back, for instance, because a page might be locked from another part of the kernel, or connections for network filesystems might be unavailable. In this case, the inode is moved back to the s_dirty list again, possibly updating the dirtied_when field unless the inode has been re-dirtied while it was written out. The kernel will automatically retry to synchronize the data in one of the next synchronization runs. Additionally, the kernel needs to make sure that the inverse time ordering of all inodes on s_dirty is preserved. The auxiliary function redirty_tail takes care of this. The process is repeated until one of the two conditions below is fulfilled:
1. 2.
All dirty inodes of the superblock have been written back. The maximum number of page synchronizations (specified in nr_to_write) has been reached. This is necessary to support the unit-by-unit synchronization described above. The remaining inodes in s_io are processed the next time sync_sb_inodes is invoked.
17.10.3 Writing Back Single Inodes As noted above, the kernel delegates synchronization of the data associated with an inode to __writeback_single_inode. The corresponding code flow diagram is shown in Figure 17-7.
__writeback_single_inode
Inode locked?
No
Data integrity writeback (WBC_SYNC_ALL set)?
Yes
Wait on inode
Move inode to s_more_io do_writepages return
_ _sync_single_inode
Figure 17-7: Code flow diagram for __writeback_single_inode.
The function is essentially a dispatcher for __sync_single_inode, but is charged with the important task of distinguishing whether a data integrity (WB_SYNC_ALL) or regular writeback is performed. This influences how locked inodes are handled.
1006
5:51pm Page 1006
Mauerer
runc17.tex
V2 - 09/04/2008
5:51pm
Page 1007
Chapter 17: Data Synchronization A set I_LOCK bit in the state element of the inode data structure indicates that the element is already being synchronized by another part of the kernel — and therefore cannot be modified at the moment in the current path. If a regular writeback is active, this is not much of a problem: The kernel can simply skip the inode and place it on the s_more_io list, which guarantees that it will be reconsidered some time later. Before returning to the caller, do_writepages is used to write out some of the data associated with the inode since this can do no harm.4 The situation is more involved if a data integrity writeback is performed though. In this case, the kernel does not skip the inode but sets up a wait queue (see Chapter 14) to wait until the inode is available again, that is, until the I_SYNC bit is cleared. Notice that it is not sufficient to know that another part of the kernel is already synchronizing the inode. This could be a regular writeback that does not guarantee that the dirty data are actually written to disk. This is not what WB_SYNC_ALL is about: When the synchronization pass completes, the kernel has to guarantee that all data have been synchronized, and waiting on the inode is therefore essential. Once the inode is available, the job is passed on to __sync_single_inode. This extensive function writes back the data associated with the inode and also the inode metadata. Figure 17-8 shows the code flow diagram.
_ _sync_single_inode Lock inode do_writepages a_ops->do_writepages or generic_write_pages write_inode
s_op->write_inode
WB_SYNC_ALL set?
filemap_fdatawait
Unlock inode Place inode on apt list inode_sync_complete
wake_up_inode
Figure 17-8: Code flow diagram for __sync_single_inode.
1.
First of all, the inode must be locked by setting the I_LOCK bit in the inode structure status field. This prevents other kernel threads from processing the inode.
2.
Synchronization of an inode consists of two parts: Synchronizing the data and synchronizing the metadata.
4 Actually, the call also does not have any benefit and will be removed in kernel 2.6.25, which was still under development when this
book was written. Since do_writepages is also called in __sync_single_inodes, the call is superfluous.
1007
Mauerer
runc17.tex
V2 - 09/04/2008
Chapter 17: Data Synchronization The actual write operation for the data is initiated in do_writepages. This function invokes the writepages method of the corresponding address_space_operations structure if the method exists and is not assigned a null pointer; for example, the ext3_writepages method is invoked for the Ext3 filesystem. If no method exists, the kernel invokes the generic_writepages function, which finds all dirty pages of the mapping and sequentially writes them back using writepage from the address space operations (note that in contrast to writepages, there is no s at the end of the name) or mpage_writepage if the former does not exist.
3.
write_inode writes back the metadata needed to manage the inode itself. The function is not
complicated; it simply checks whether the superblock operations associated with the inode instance include the write_inode method (the block device filesystem does not provide one, e.g.). If it exists, it is invoked to find the relevant data and write it back via the block layer. Filesystems often choose to perform no actual writes to a block device, but just submit a dirty buffer to the generic code. This needs to be dealt with in the sync system call discussed below. Note that calling write_inode is skipped if I_DIRTY_SYNC of I_DIRTY_DATASYNC is set because this signals that only data, but not the metadata, require to be written back.
4.
If the current synchronization aims at data integrity, that is, if WB_SYNC_ALL is set, then filemap_fdatawait is used to wait until all pending write operations (which are usually processed asynchronously) are performed. The function waits for write operations to complete on a page-by-page basis. Pages currently written to their backing store have the PG_writeback status bit set, which is automatically removed by the responsible block layer code when the operation is complete. Therefore, the synchronization code just needs to wait until the bit goes away.
The above steps complete inode synchronization, at least in the view of the filesystem (naturally, the block layer still has a few things to do if filemap_fdatawait has not been called to await the results before), but the layer structure of the kernel means that this is of no further relevance to us). The inode now needs to be put back into the correct list, and the kernel must update the inode status if it has changed as a result of synchronization. There are four different lists in which the inode can be inserted:
1.
If the inode data have become dirty again in the meantime (i.e., if the I_DIRTY bit is set in the status element), the inode is added to the s_dirty list of the superblock. It is also placed in this list if not all dirty data of the mapping were written back — because, for example, the number of pages specified by writeback control was too small to allow all dirty pages to be processed in one go. In this case, the inode status is set to I_DIRTY_PAGES so that synchronization of the metadata is skipped the next time __sync_single_inode is invoked — these data have just been written back and are still intact.
2.
If not all data of the mapping were written back, but pdflush was called from wb_kupdate, the inode is placed on s_more_io and will be dealt with in later synchronization runs. If not all data were written back and pdflush was not called from wb_kupdate, then the inode is placed back on the dirty list. This avoids that one large dirty file that cannot be written properly suspends other pending files for a long time or indefinitely. redirty_tail is responsible to keep the inverse time ordering on s_dirty intact.
1008
5:51pm
Page 1008
Mauerer
runc17.tex V2 - 09/04/2008
5:51pm Page 1009
Chapter 17: Data Synchronization 3.
If the inode access counter (i_count) has a value greater than 1, the kernel inserts the inode in the global inode_in_use list because it is still in use.
4.
When the access counter drops to 0, the inode can be placed in the global list of unused inode instances (inode_unused).
The i_list element of the inode is used as a list element in all the above situations. The final step is to invoke wake_up_inode via the dispatcher inode_sync_complete. This function wakes processes that were placed on the queue of inodes waiting to be written back but whose I_LOCK bit is set. Because the inode is no longer needed by the current thread (and is therefore no longer locked), the scheduler selects one of these processes to handle the inode. If the data have already been fully synchronized, this process has nothing else to do. If dirty pages still need to be synchronized, the process goes ahead and synchronizes them.
17.11
Congestion
I have used the term congestion a few times without precisely defining what it means. On an intuitive level it is not difficult to understand — when a kernel block device queue is overloaded with read or write operations, it doesn’t make sense to add further requests for communication with the block device. It is best to wait until a certain number of requests have been processed and the queue is shorter before submitting new read or write requests. Below I examine how the kernel implements this definition on a technical level.
17.11.1 Data Structures A double wait queue is needed to implement the congestion method. The definition is as follows: mm/backing-dev.c
static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) };
The kernel provides two queues, one for input and one for output. Two pre-processor constants (READ and WRITE) are defined in