|  | Path walking and name lookup locking | 
|  | ==================================== | 
|  |  | 
|  | Path resolution is the finding a dentry corresponding to a path name string, by | 
|  | performing a path walk. Typically, for every open(), stat() etc., the path name | 
|  | will be resolved. Paths are resolved by walking the namespace tree, starting | 
|  | with the first component of the pathname (eg. root or cwd) with a known dentry, | 
|  | then finding the child of that dentry, which is named the next component in the | 
|  | path string. Then repeating the lookup from the child dentry and finding its | 
|  | child with the next element, and so on. | 
|  |  | 
|  | Since it is a frequent operation for workloads like multiuser environments and | 
|  | web servers, it is important to optimize this code. | 
|  |  | 
|  | Path walking synchronisation history: | 
|  | Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and | 
|  | thus in every component during path look-up. Since 2.5.10 onwards, fast-walk | 
|  | algorithm changed this by holding the dcache_lock at the beginning and walking | 
|  | as many cached path component dentries as possible. This significantly | 
|  | decreases the number of acquisition of dcache_lock. However it also increases | 
|  | the lock hold time significantly and affects performance in large SMP machines. | 
|  | Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to | 
|  | make dcache look-up lock-free. | 
|  |  | 
|  | All the above algorithms required taking a lock and reference count on the | 
|  | dentry that was looked up, so that may be used as the basis for walking the | 
|  | next path element. This is inefficient and unscalable. It is inefficient | 
|  | because of the locks and atomic operations required for every dentry element | 
|  | slows things down. It is not scalable because many parallel applications that | 
|  | are path-walk intensive tend to do path lookups starting from a common dentry | 
|  | (usually, the root "/" or current working directory). So contention on these | 
|  | common path elements causes lock and cacheline queueing. | 
|  |  | 
|  | Since 2.6.38, RCU is used to make a significant part of the entire path walk | 
|  | (including dcache look-up) completely "store-free" (so, no locks, atomics, or | 
|  | even stores into cachelines of common dentries). This is known as "rcu-walk" | 
|  | path walking. | 
|  |  | 
|  | Path walking overview | 
|  | ===================== | 
|  |  | 
|  | A name string specifies a start (root directory, cwd, fd-relative) and a | 
|  | sequence of elements (directory entry names), which together refer to a path in | 
|  | the namespace. A path is represented as a (dentry, vfsmount) tuple. The name | 
|  | elements are sub-strings, separated by '/'. | 
|  |  | 
|  | Name lookups will want to find a particular path that a name string refers to | 
|  | (usually the final element, or parent of final element). This is done by taking | 
|  | the path given by the name's starting point (which we know in advance -- eg. | 
|  | current->fs->cwd or current->fs->root) as the first parent of the lookup. Then | 
|  | iteratively for each subsequent name element, look up the child of the current | 
|  | parent with the given name and if it is not the desired entry, make it the | 
|  | parent for the next lookup. | 
|  |  | 
|  | A parent, of course, must be a directory, and we must have appropriate | 
|  | permissions on the parent inode to be able to walk into it. | 
|  |  | 
|  | Turning the child into a parent for the next lookup requires more checks and | 
|  | procedures. Symlinks essentially substitute the symlink name for the target | 
|  | name in the name string, and require some recursive path walking.  Mount points | 
|  | must be followed into (thus changing the vfsmount that subsequent path elements | 
|  | refer to), switching from the mount point path to the root of the particular | 
|  | mounted vfsmount. These behaviours are variously modified depending on the | 
|  | exact path walking flags. | 
|  |  | 
|  | Path walking then must, broadly, do several particular things: | 
|  | - find the start point of the walk; | 
|  | - perform permissions and validity checks on inodes; | 
|  | - perform dcache hash name lookups on (parent, name element) tuples; | 
|  | - traverse mount points; | 
|  | - traverse symlinks; | 
|  | - lookup and create missing parts of the path on demand. | 
|  |  | 
|  | Safe store-free look-up of dcache hash table | 
|  | ============================================ | 
|  |  | 
|  | Dcache name lookup | 
|  | ------------------ | 
|  | In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple | 
|  | and use that to select a bucket in the dcache-hash table. The list of entries | 
|  | in that bucket is then walked, and we do a full comparison of each entry | 
|  | against our (parent, name) tuple. | 
|  |  | 
|  | The hash lists are RCU protected, so list walking is not serialised with | 
|  | concurrent updates (insertion, deletion from the hash). This is a standard RCU | 
|  | list application with the exception of renames, which will be covered below. | 
|  |  | 
|  | Parent and name members of a dentry, as well as its membership in the dcache | 
|  | hash, and its inode are protected by the per-dentry d_lock spinlock. A | 
|  | reference is taken on the dentry (while the fields are verified under d_lock), | 
|  | and this stabilises its d_inode pointer and actual inode. This gives a stable | 
|  | point to perform the next step of our path walk against. | 
|  |  | 
|  | These members are also protected by d_seq seqlock, although this offers | 
|  | read-only protection and no durability of results, so care must be taken when | 
|  | using d_seq for synchronisation (see seqcount based lookups, below). | 
|  |  | 
|  | Renames | 
|  | ------- | 
|  | Back to the rename case. In usual RCU protected lists, the only operations that | 
|  | will happen to an object is insertion, and then eventually removal from the | 
|  | list. The object will not be reused until an RCU grace period is complete. | 
|  | This ensures the RCU list traversal primitives can run over the object without | 
|  | problems (see RCU documentation for how this works). | 
|  |  | 
|  | However when a dentry is renamed, its hash value can change, requiring it to be | 
|  | moved to a new hash list. Allocating and inserting a new alias would be | 
|  | expensive and also problematic for directory dentries. Latency would be far to | 
|  | high to wait for a grace period after removing the dentry and before inserting | 
|  | it in the new hash bucket. So what is done is to insert the dentry into the | 
|  | new list immediately. | 
|  |  | 
|  | However, when the dentry's list pointers are updated to point to objects in the | 
|  | new list before waiting for a grace period, this can result in a concurrent RCU | 
|  | lookup of the old list veering off into the new (incorrect) list and missing | 
|  | the remaining dentries on the list. | 
|  |  | 
|  | There is no fundamental problem with walking down the wrong list, because the | 
|  | dentry comparisons will never match. However it is fatal to miss a matching | 
|  | dentry. So a seqlock is used to detect when a rename has occurred, and so the | 
|  | lookup can be retried. | 
|  |  | 
|  | 1      2      3 | 
|  | +---+  +---+  +---+ | 
|  | hlist-->| N-+->| N-+->| N-+-> | 
|  | head <--+-P |<-+-P |<-+-P | | 
|  | +---+  +---+  +---+ | 
|  |  | 
|  | Rename of dentry 2 may require it deleted from the above list, and inserted | 
|  | into a new list. Deleting 2 gives the following list. | 
|  |  | 
|  | 1             3 | 
|  | +---+         +---+     (don't worry, the longer pointers do not | 
|  | hlist-->| N-+-------->| N-+->    impose a measurable performance overhead | 
|  | head <--+-P |<--------+-P |      on modern CPUs) | 
|  | +---+         +---+ | 
|  | ^      2      ^ | 
|  | |    +---+    | | 
|  | |    | N-+----+ | 
|  | +----+-P | | 
|  | +---+ | 
|  |  | 
|  | This is a standard RCU-list deletion, which leaves the deleted object's | 
|  | pointers intact, so a concurrent list walker that is currently looking at | 
|  | object 2 will correctly continue to object 3 when it is time to traverse the | 
|  | next object. | 
|  |  | 
|  | However, when inserting object 2 onto a new list, we end up with this: | 
|  |  | 
|  | 1             3 | 
|  | +---+         +---+ | 
|  | hlist-->| N-+-------->| N-+-> | 
|  | head <--+-P |<--------+-P | | 
|  | +---+         +---+ | 
|  | 2 | 
|  | +---+ | 
|  | | N-+----> | 
|  | <----+-P | | 
|  | +---+ | 
|  |  | 
|  | Because we didn't wait for a grace period, there may be a concurrent lookup | 
|  | still at 2. Now when it follows 2's 'next' pointer, it will walk off into | 
|  | another list without ever having checked object 3. | 
|  |  | 
|  | A related, but distinctly different, issue is that of rename atomicity versus | 
|  | lookup operations. If a file is renamed from 'A' to 'B', a lookup must only | 
|  | find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup | 
|  | of 'B' must succeed (note the reverse is not true). | 
|  |  | 
|  | Between deleting the dentry from the old hash list, and inserting it on the new | 
|  | hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same | 
|  | rename seqlock is also used to cover this race in much the same way, by | 
|  | retrying a negative lookup result if a rename was in progress. | 
|  |  | 
|  | Seqcount based lookups | 
|  | ---------------------- | 
|  | In refcount based dcache lookups, d_lock is used to serialise access to | 
|  | the dentry, stabilising it while comparing its name and parent and then | 
|  | taking a reference count (the reference count then gives a stable place to | 
|  | start the next part of the path walk from). | 
|  |  | 
|  | As explained above, we would like to do path walking without taking locks or | 
|  | reference counts on intermediate dentries along the path. To do this, a per | 
|  | dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry | 
|  | looks like (its name, parent, and inode). That snapshot is then used to start | 
|  | the next part of the path walk. When loading the coherent snapshot under d_seq, | 
|  | care must be taken to load the members up-front, and use those pointers rather | 
|  | than reloading from the dentry later on (otherwise we'd have interesting things | 
|  | like d_inode going NULL underneath us, if the name was unlinked). | 
|  |  | 
|  | Also important is to avoid performing any destructive operations (pretty much: | 
|  | no non-atomic stores to shared data), and to recheck the seqcount when we are | 
|  | "done" with the operation. Retry or abort if the seqcount does not match. | 
|  | Avoiding destructive or changing operations means we can easily unwind from | 
|  | failure. | 
|  |  | 
|  | What this means is that a caller, provided they are holding RCU lock to | 
|  | protect the dentry object from disappearing, can perform a seqcount based | 
|  | lookup which does not increment the refcount on the dentry or write to | 
|  | it in any way. This returned dentry can be used for subsequent operations, | 
|  | provided that d_seq is rechecked after that operation is complete. | 
|  |  | 
|  | Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be | 
|  | queried for permissions. | 
|  |  | 
|  | With this two parts of the puzzle, we can do path lookups without taking | 
|  | locks or refcounts on dentry elements. | 
|  |  | 
|  | RCU-walk path walking design | 
|  | ============================ | 
|  |  | 
|  | Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk | 
|  | is the traditional[*] way of performing dcache lookups using d_lock to | 
|  | serialise concurrent modifications to the dentry and take a reference count on | 
|  | it. ref-walk is simple and obvious, and may sleep, take locks, etc while path | 
|  | walking is operating on each dentry. rcu-walk uses seqcount based dentry | 
|  | lookups, and can perform lookup of intermediate elements without any stores to | 
|  | shared data in the dentry or inode. rcu-walk can not be applied to all cases, | 
|  | eg. if the filesystem must sleep or perform non trivial operations, rcu-walk | 
|  | must be switched to ref-walk mode. | 
|  |  | 
|  | [*] RCU is still used for the dentry hash lookup in ref-walk, but not the full | 
|  | path walk. | 
|  |  | 
|  | Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining | 
|  | path string, rcu-walk uses a d_seq protected snapshot. When looking up a | 
|  | child of this parent snapshot, we open d_seq critical section on the child | 
|  | before closing d_seq critical section on the parent. This gives an interlocking | 
|  | ladder of snapshots to walk down. | 
|  |  | 
|  |  | 
|  | proc 101 | 
|  | /----------------\ | 
|  | / comm:    "vi"    \ | 
|  | /  fs.root: dentry0  \ | 
|  | \  fs.cwd:  dentry2  / | 
|  | \                  / | 
|  | \----------------/ | 
|  |  | 
|  | So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will | 
|  | start from current->fs->root, which is a pinned dentry. Alternatively, | 
|  | "./test.c" would start from cwd; both names refer to the same path in | 
|  | the context of proc101. | 
|  |  | 
|  | dentry 0 | 
|  | +---------------------+   rcu-walk begins here, we note d_seq, check the | 
|  | | name:    "/"        |   inode's permission, and then look up the next | 
|  | | inode:   10         |   path element which is "home"... | 
|  | | children:"home", ...| | 
|  | +---------------------+ | 
|  | | | 
|  | dentry 1 V | 
|  | +---------------------+   ... which brings us here. We find dentry1 via | 
|  | | name:    "home"     |   hash lookup, then note d_seq and compare name | 
|  | | inode:   678        |   string and parent pointer. When we have a match, | 
|  | | children:"npiggin"  |   we now recheck the d_seq of dentry0. Then we | 
|  | +---------------------+   check inode and look up the next element. | 
|  | | | 
|  | dentry2  V | 
|  | +---------------------+   Note: if dentry0 is now modified, lookup is | 
|  | | name:    "npiggin"  |   not necessarily invalid, so we need only keep a | 
|  | | inode:   543        |   parent for d_seq verification, and grandparents | 
|  | | children:"a.c", ... |   can be forgotten. | 
|  | +---------------------+ | 
|  | | | 
|  | dentry3  V | 
|  | +---------------------+   At this point we have our destination dentry. | 
|  | | name:    "a.c"      |   We now take its d_lock, verify d_seq of this | 
|  | | inode:   14221      |   dentry. If that checks out, we can increment | 
|  | | children:NULL       |   its refcount because we're holding d_lock. | 
|  | +---------------------+ | 
|  |  | 
|  | Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock, | 
|  | re-checking its d_seq, and then incrementing its refcount is called | 
|  | "dropping rcu" or dropping from rcu-walk into ref-walk mode. | 
|  |  | 
|  | It is, in some sense, a bit of a house of cards. If the seqcount check of the | 
|  | parent snapshot fails, the house comes down, because we had closed the d_seq | 
|  | section on the grandparent, so we have nothing left to stand on. In that case, | 
|  | the path walk must be fully restarted (which we do in ref-walk mode, to avoid | 
|  | live locks). It is costly to have a full restart, but fortunately they are | 
|  | quite rare. | 
|  |  | 
|  | When we reach a point where sleeping is required, or a filesystem callout | 
|  | requires ref-walk, then instead of restarting the walk, we attempt to drop rcu | 
|  | at the last known good dentry we have. Avoiding a full restart in ref-walk in | 
|  | these cases is fundamental for performance and scalability because blocking | 
|  | operations such as creates and unlinks are not uncommon. | 
|  |  | 
|  | The detailed design for rcu-walk is like this: | 
|  | * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. | 
|  | * Take the RCU lock for the entire path walk, starting with the acquiring | 
|  | of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are | 
|  | not required for dentry persistence. | 
|  | * synchronize_rcu is called when unregistering a filesystem, so we can | 
|  | access d_ops and i_ops during rcu-walk. | 
|  | * Similarly take the vfsmount lock for the entire path walk. So now mnt | 
|  | refcounts are not required for persistence. Also we are free to perform mount | 
|  | lookups, and to assume dentry mount points and mount roots are stable up and | 
|  | down the path. | 
|  | * Have a per-dentry seqlock to protect the dentry name, parent, and inode, | 
|  | so we can load this tuple atomically, and also check whether any of its | 
|  | members have changed. | 
|  | * Dentry lookups (based on parent, candidate string tuple) recheck the parent | 
|  | sequence after the child is found in case anything changed in the parent | 
|  | during the path walk. | 
|  | * inode is also RCU protected so we can load d_inode and use the inode for | 
|  | limited things. | 
|  | * i_mode, i_uid, i_gid can be tested for exec permissions during path walk. | 
|  | * i_op can be loaded. | 
|  | * When the destination dentry is reached, drop rcu there (ie. take d_lock, | 
|  | verify d_seq, increment refcount). | 
|  | * If seqlock verification fails anywhere along the path, do a full restart | 
|  | of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of | 
|  | a better errno) to signal an rcu-walk failure. | 
|  |  | 
|  | The cases where rcu-walk cannot continue are: | 
|  | * NULL dentry (ie. any uncached path element) | 
|  | * Following links | 
|  |  | 
|  | It may be possible eventually to make following links rcu-walk aware. | 
|  |  | 
|  | Uncached path elements will always require dropping to ref-walk mode, at the | 
|  | very least because i_mutex needs to be grabbed, and objects allocated. | 
|  |  | 
|  | Final note: | 
|  | "store-free" path walking is not strictly store free. We take vfsmount lock | 
|  | and refcounts (both of which can be made per-cpu), and we also store to the | 
|  | stack (which is essentially CPU-local), and we also have to take locks and | 
|  | refcount on final dentry. | 
|  |  | 
|  | The point is that shared data, where practically possible, is not locked | 
|  | or stored into. The result is massive improvements in performance and | 
|  | scalability of path resolution. | 
|  |  | 
|  |  | 
|  | Interesting statistics | 
|  | ====================== | 
|  |  | 
|  | The following table gives rcu lookup statistics for a few simple workloads | 
|  | (2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to | 
|  | drop rcu that fail due to d_seq failure and requiring the entire path lookup | 
|  | again. Other cases are successful rcu-drops that are required before the final | 
|  | element, nodentry for missing dentry, revalidate for filesystem revalidate | 
|  | routine requiring rcu drop, permission for permission check requiring drop, | 
|  | and link for symlink traversal requiring drop. | 
|  |  | 
|  | rcu-lookups     restart  nodentry          link  revalidate  permission | 
|  | bootup     47121           0      4624          1010       10283        7852 | 
|  | dbench  25386793           0   6778659(26.7%)     55         549        1156 | 
|  | kbuild   2696672          10     64442(2.3%)  108764(4.0%)     1        1590 | 
|  | git diff   39605           0        28             2           0         106 | 
|  | vfstest 24185492        4945    708725(2.9%) 1076136(4.4%)     0        2651 | 
|  |  | 
|  | What this shows is that failed rcu-walk lookups, ie. ones that are restarted | 
|  | entirely with ref-walk, are quite rare. Even the "vfstest" case which | 
|  | specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise | 
|  | such races is not showing a huge amount of restarts. | 
|  |  | 
|  | Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where | 
|  | the reference count needs to be taken for some reason. This is either because | 
|  | we have reached the target of the path walk, or because we have encountered a | 
|  | condition that can't be resolved in rcu-walk mode.  Ideally, we drop rcu-walk | 
|  | only when we have reached the target dentry, so the other statistics show where | 
|  | this does not happen. | 
|  |  | 
|  | Note that a graceful drop from rcu-walk mode due to something such as the | 
|  | dentry not existing (which can be common) is not necessarily a failure of | 
|  | rcu-walk scheme, because some elements of the path may have been walked in | 
|  | rcu-walk mode. The further we get from common path elements (such as cwd or | 
|  | root), the less contended the dentry is likely to be. The closer we are to | 
|  | common path elements, the more likely they will exist in dentry cache. | 
|  |  | 
|  |  | 
|  | Papers and other documentation on dcache locking | 
|  | ================================================ | 
|  |  | 
|  | 1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). | 
|  |  | 
|  | 2. http://lse.sourceforge.net/locking/dcache/dcache.html | 
|  |  | 
|  | 3. path-lookup.md in this directory. |