All of lore.kernel.org
 help / color / mirror / Atom feed
From: Randy Dunlap <rdunlap@infradead.org>
To: NeilBrown <neilb@suse.com>, Jonathan Corbet <corbet@lwn.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>,
	lkml <linux-kernel@vger.kernel.org>,
	"J. Bruce Fields" <bfields@fieldses.org>,
	linux-doc@vger.kernel.org
Subject: Re: [PATCH] Documentation: add new description of path-name lookup.
Date: Sun, 26 Jul 2015 20:41:52 -0700	[thread overview]
Message-ID: <55B5A880.8040500@infradead.org> (raw)
In-Reply-To: <20150725102825.335e9d9a@noble>

On 07/24/15 17:28, NeilBrown wrote:
> 

Hi Neil,

Some edits for you to consider.


> 
> diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.md
> new file mode 100644
> index 000000000000..eedf31c737ed
> --- /dev/null
> +++ b/Documentation/filesystems/path-lookup.md
> @@ -0,0 +1,1297 @@
> +<head>
> +<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
> +</head>
> +
> +Pathname lookup in Linux.
> +=========================
> +
> +This write-up is based on three articles published at lwn.net:
> +
> +- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
> +- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
> +- <https://lwn.net/Articles/650786/> A walk among the symlinks
> +
> +Written by Neil Brown with help from Al Viro and Jon Corbet.
> +
> +Introduction
> +------------


> +Bringing it together with `struct nameidata`
> +--------------------------------------------
> +

> +### `struct path root` ###
> +
> +This is used to hold a reference to the effective root of the
> +filesystem.  Often that reference won't be needed, so this field is
> +only assigned the first time it is used, or when a non-standard root
> +is requested.  Keeping a reference in the `nameidata` ensures that
> +only one root is in effect for the entire path walk, even if it races
> +with a `chroot()` system call.
> +
> +The root is needed when either of two conditions holds: (1) either the
> +pathname or a symbolic link starts with a "'/'", or (2) a "`..`"
> +component is being handled, since "`..`" from the root must always stay
> +at the root.  The value used is usually the current root directory of
> +the calling process.  An alternate root can be provided as when
> +`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call
> +`mount_subtree()`.  In each case a pathname is being looked up in a very
> +specific part of the filesystem, and the lookup must not be allowed to
> +escape that subtree.  It works a bit like a local `chroot()`.
> +
> +Ignoring the handling of symbolic links, we can now describe the
> +"`link_path_walk()`" function, which handles the lookup of everything
> +except the final component as:
> +
> +>  Given a path (`name`) and a nameidata structure (`nd`), check that the
> +>  current directory has execute permission and then advance `name`
> +>  over one component while updating `last_type` and `last`.  If that
> +>  was the final component, then return, otherwise call
> +>  `walk_component()` and repeat from the top.
> +
> +`walk_component()` is even easier.  If the component is `LAST_DOTS`,
> +it calls `handle_dots()` which does the necessary locking as already
> +described.  If it finds a `LAST_NORM` component it first calls
> +"`lookup_fast()`" which only looks in the dcache, but will ask the
> +filesystem to revalidate the result if it is that sort of filesystem.
> +If that doesn't get a good result, it calls "`lookup_slow()`" which
> +takes the `i_mutex`, rechecks the cache, and then asks the filesystem
> +to find a definitive answer.  Each of these will call
> +`follow_managed()` (as described below) to handle any mount points.
> +
> +In the absence of symbolic links, `walk_component()` creates a new
> +`struct path` containing a counted reference to the new dentry and a
> +reference to the new `vfsmount` which is only counted if it is
> +different from the previous `vfsmount`.  It then calls
> +`path_to_nameidata()` to install the new `struct path` in the
> +`struct nameidate` and drop the unneeded references.

           nameidata

> +
> +This "hand-over-hand" sequencing of getting a reference to the new
> +dentry before dropping the reference to the previous dentry may
> +seem obvious, but is worth pointing out so that we will recognize its
> +analogue in the "RCU-walk" version.
> +
> +Handling the final component.
> +-----------------------------
> +
> +`link_path_walk()` only walks as far as setting `nd->last` and
> +`nd->last_type` to refer to the final component of the path.  It does
> +not call `walk_component()` that last time.  Handling that final
> +component remains for the caller to sort out. Those callers are
> +`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and
> +`path_openat()` each of which handles the differing requirements of
> +different system calls.
> +
> +`path_parentat()` is clearly the simplest - it just wraps a little bit
> +of housekeeping around `link_path_walk()` and returns the parent
> +directory and final component to the caller.  The caller will be either
> +aiming to create a name (via `filename_create()`) or remove or rename
> +a name (in which case `user_path_parent()` is used).  They will use
> +`i_mutex` to exclude other changes while they validate and then
> +perform their operation.
> +
> +`path_lookupat()` is nearly as simple - it is used when an existing
> +object is wanted such as by `stat()` or `chmod()`.  It essentially just
> +calls `walk_component()` on the final component through a call to
> +`lookup_last()`.  `path_lookupat()` returns just the final dentry.
> +
> +`path_mountpoint()` handles the special case of unmounting which must
> +not try to revalidate the mounted filesystem.  It effectively
> +contains, through a call to `mountpoint_last()`, an alternate
> +implementation of `lookup_slow()` which skips that step.  This is
> +important when unmounting a filesystem that is inaccessible, such as
> +one provided by a dead NFS server.
> +
> +Finally `path_openat()` is used for the `open()` system call; it
> +contains, in support functions starting with "`do_last()`", all the
> +complexity needed to handle the different subtleties of O_CREAT (with
> +or without O_EXCL) final "`/`" characters, and trailing symbolic

missing comma?      , final

> +links.  We will revisit this in the final part of this series, which
> +focuses on those symbolic links.  "`do_last()`" will sometimes, but
> +not always, take `i_mutex`, depending on what it finds.
> +
> +Each of these, or the functions which call them, need to be alert to
> +the possibility that the final component is not `LAST_NORM`.  If the
> +goal of the lookup is to create something, then any value for
> +`last_type` other than `LAST_NORM` will result in an error.  For
> +example if `path_parentat()` reports `LAST_DOTDOT`, then the caller
> +won't try to create that name.  They also check for trailing slashes
> +by testing `last.name[last.len]`.  If there is any character beyond
> +the final component, it must be a trailing slash.
> +
> +Revalidation and automounts
> +---------------------------
> +
> +Apart from symbolic links, there are only two parts of the "REF-walk"
> +process not yet covered.  One is the handling of stale cache entries
> +and the other is automounts.
> +
> +On filesystems that require it, the lookup routines will call the
> +`->d_revalidate()` dentry method to ensure that the cached information
> +is current.  This will often confirm validity or update a few details
> +from a server.  In some cases it may find that there has been change
> +further up the path and that something that was thought to be valid
> +previously isn't really.  When this happens the lookup of the whole
> +path is aborted and retried with the "`LOOKUP_REVAL`" flag set.  This
> +forces revalidation to be more thorough.  We will see more details of
> +this retry process in the next article.
> +
> +Automount points are locations in the filesystem where an attempt to
> +lookup a name can trigger changes to how that lookup should be
> +handled, in particular by mounting a filesystem there.  These are
> +covered in greater detail in autofs4.txt in the Linux documentation
> +tree, but a few notes specifically related to path lookup are in order
> +here.
> +
> +The Linux VFS has a concept of "managed" dentries which is reflected
> +in function names such as "`follow_managed()`".  There are three
> +potentially interesting things about these dentries corresponding
> +to three different flags that might be set in `dentry->d_flags`:
> +


> +RCU-walk - faster pathname lookup in Linux
> +==========================================
> +
> +RCU-walk is another algorithm for performing pathname lookup in Linux.
> +It is in many ways similar to REF-walk and the two share quite a bit
> +of code.  The significant difference in RCU-walk is how it allows for
> +the possibility of concurrent access.
> +
> +We noted that REF-walk is complex because there are numerous details
> +and special cases.  RCU-walk reduces this complexity by simply
> +refusing to handle a number of cases -- it instead falls back to
> +REF-walk.  The difficulty with RCU-walk comes from a different
> +direction: unfamiliarity.  The locking rules when depending on RCU are
> +quite different to traditional locking, so we will spend a little extra

I would say:       from

> +time when we come to those.
> +
> +Clear demarcation of roles
> +--------------------------
> +
> +The easiest way to manage concurrency is to forcibly stop any other
> +thread from changing the data structures that a given thread is
> +looking at.  In cases where no other thread would even think of
> +changing the data and lots of different threads want to read at the
> +same time, this can be very costly.  Even when using locks that permit
> +multiple concurrent readers, the simple act of updating the count of
> +the number of current readers can impose an unwanted cost.  So the
> +goal when reading a shared data structure that no other process is
> +changing, is to avoid writing anything to memory at all.  Take no

dubious comma^

> +locks, increment no counts, leave no footprints.
> +

> +RCU and seqlocks: fast and light
> +--------------------------------
> +

> +However, there is a little bit more to seqlocks than that.  If
> +RCU-walk accesses two different fields in a seqlock-protected
> +structure, or accesses the same field twice, there is no a-priori

                                             no hyphen:      ^

> +guarantee of any consistency between those accesses.  When consistency
> +is needed - which it usually is - RCU-walk must take a copy and then
> +use `read_seqcount_retry()` to validate that copy.
> +


> +`unlazy walk()` and `complete_walk()`
> +-------------------------------------
> +

> +A pair of patterns
> +------------------
> +
> +In various places in the details of REF-walk and RCU-walk, and also in
> +the big picture, there are a couple of related patterns that are worth
> +being aware of.
> +
> +The first is "try quickly and check, if that fails try slowly".  We
> +can see that in the high-level approach of first trying RCU-walk and
> +then trying REF-walk, and in places were `unlazy_walk()` is used to

                                       where

> +switch to REF-walk for the rest of the path.  We also saw it earlier
> +in `dget_parent()` when following a "`..`" link.  It tries a quick way
> +to get a reference, then falls back to taking locks if needed.
> +

> +A walk among the symlinks
> +=========================
> +

> +
> +Storage and lifetime of cached symlinks
> +---------------------------------------
> +

> +
> +When neither of these are suitable, the next most likely scenario is

                         is

> +that the filesystem will allocate some temporary memory and copy or
> +construct the symlink content into that memory whenever it is needed.
> +


> +Updating the access time
> +------------------------
> +
> +We previously said of RCU-walk that it would "take no locks, increment
> +no counts, leave no footprints."  We have since seen that some
> +"footprints" can be needed when handling symlinks as a counted
> +reference (or even a memory allocation) may be needed.  But these
> +footprints are best kept to a minimum.
> +
> +One other place where walking down a symlink can involve leaving
> +footprints in a way that doesn't affect directories is in updating access times.
> +In Unix (and Linux) every filesystem object has a "last accessed
> +time", or "`atime`".  Passing through a directory to access a file
> +within, is not considered to be an access for the purposes of

drop     ^ comma?

> +`atime`; only listing the contents of a directory can update its `atime`.
> +Symlinks are different it seems.  Both reading a symlink (with `readlink()`)
> +and looking up a symlink on the way to some other destination can
> +update the atime on that symlink.
> +



-- 
~Randy

  reply	other threads:[~2015-07-27  3:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-25  0:28 [PATCH] Documentation: add new description of path-name lookup NeilBrown
2015-07-27  3:41 ` Randy Dunlap [this message]
2015-08-06  2:54   ` NeilBrown
2015-08-06  2:56     ` Randy Dunlap
2015-08-06 10:01     ` Jonathan Corbet
2015-10-26  6:35       ` [PATCHv2] " Neil Brown
2015-10-27 21:21         ` J. Bruce Fields
2015-10-27 23:12           ` Neil Brown
2015-11-03  1:19         ` Jonathan Corbet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=55B5A880.8040500@infradead.org \
    --to=rdunlap@infradead.org \
    --cc=bfields@fieldses.org \
    --cc=corbet@lwn.net \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=neilb@suse.com \
    --cc=viro@ZenIV.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.