Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Al Viro <viro@zeniv.linux.org.uk>
To: NeilBrown <neilb@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Daire Byrne <daire@dneg.com>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Chuck Lever <chuck.lever@oracle.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	linux-fsdevel@vger.kernel.org,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 01/10] VFS: support parallel updates in the one directory.
Date: Sat, 27 Aug 2022 04:43:39 +0100	[thread overview]
Message-ID: <YwmS63X3Sm4bhlcT@ZenIV> (raw)
In-Reply-To: <166147984370.25420.13019217727422217511.stgit@noble.brown>

On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:

> +/**
> + * d_lock_update_nested - lock a dentry before updating
> + * @dentry: the dentry to be locked
> + * @base:   the parent, or %NULL
> + * @name:   the name in that parent, or %NULL
> + * @subclass: lockdep locking class.
> + *
> + * Lock a dentry in a directory on which a shared-lock may be held, and
> + * on which parallel updates are permitted.
> + * If the base and name are given, then on success the dentry will still
> + * have that base and name - it will not have raced with rename.
> + * On success, a positive dentry will still be hashed, ensuring there
> + * was no race with unlink.
> + * If they are not given, then on success the dentry will be negative,
> + * which again ensures no race with rename, or unlink.

I'm not sure it's a good idea to have that in one function, TBH.
Looking at the callers, there are
	* lookup_hash_update()
		lookup_hash_update_len()
			nfsd shite
		filename_create_one()
			filename_create_one_len()
				nfsd shite
			filename_create()
				kern_path_create()
				user_path_create()
				do_mknodat()
				do_mkdirat()
				do_symlinkat()
				do_linkat()
		do_rmdir()
		do_unlinkat()
	* fuckloads of callers in lock_rename_lookup()
	* d_lock_update()
		atomic_open()
		lookup_open()

Only the last two can get NULL base or name.  Already interesting,
isn't it?  What's more, splitup between O_CREATE open() on one
side and everything else that might create, remove or rename on
the other looks really weird.

> +	rcu_read_lock(); /* for d_same_name() */
> +	if (d_unhashed(dentry) && d_is_positive(dentry)) {
> +		/* name was unlinked while we waited */
> +		ret = false;

BTW, what happens if somebody has ->lookup() returning a positive
unhashed?  Livelock on attempt to hit it with any directory-modifying
syscall?  Right now such behaviour is permitted; I don't know if
anything in the tree actually does it, but at the very least it
would need to be documented.

Note that *negative* unhashed is not just permitted, it's
actively used e.g. by autofs.  That really needs to be well
commented...

> +	} else if (base &&
> +		 (dentry->d_parent != base ||
> +		  dentry->d_name.hash != name->hash ||
> +		  !d_same_name(dentry, base, name))) {
> +		/* dentry was renamed - possibly silly-rename */
> +		ret = false;
> +	} else if (!base && d_is_positive(dentry)) {
> +		ret = false;
> +	} else {
> +		dentry->d_flags |= DCACHE_PAR_UPDATE;
> +	}

> + * Parent directory has inode locked exclusive, or possibly shared if wq
> + * is given.  In the later case the fs has explicitly allowed concurrent
> + * updates in this directory.  This is the one and only case
> + * when ->lookup() may be called on a non in-lookup dentry.

What Linus already said about wq...  To add a reason he hadn't mentioned,
the length of call chain one needs to track to figure out whether it's
NULL or not is... excessive.  And I don't mean just "greater than 0".
We have places like that, and sometimes we have to, but it's never a good
thing...

>  static struct dentry *__lookup_hash(const struct qstr *name,
> -		struct dentry *base, unsigned int flags)
> +				    struct dentry *base, unsigned int flags,
> +				    wait_queue_head_t *wq)

> -	dentry = d_alloc(base, name);
> +	if (wq)
> +		dentry = d_alloc_parallel(base, name, wq);
> +	else
> +		dentry = d_alloc(base, name);
>  	if (unlikely(!dentry))
>  		return ERR_PTR(-ENOMEM);
> +	if (IS_ERR(dentry))
> +		return dentry;

	BTW, considering the fact that we have 12 callers of d_alloc()
(including this one) and 28 callers of its wrapper (d_alloc_name()), I
would seriously consider converting d_alloc() from "NULL or new dentry"
to "ERR_PTR(-ENOMEM) or new dentry".  Especially since quite a few of
those callers will be happier that way.  Grep and you'll see...  As a
side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)).

> +static struct dentry *lookup_hash_update(
> +	const struct qstr *name,
> +	struct dentry *base, unsigned int flags,
> +	wait_queue_head_t *wq)
> +{
> +	struct dentry *dentry;
> +	struct inode *dir = base->d_inode;
> +	int err;
> +
> +	if (wq && IS_PAR_UPDATE(dir))
> +		inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> +	else
> +		inode_lock_nested(dir, I_MUTEX_PARENT);
> +
> +retry:
> +	dentry = __lookup_hash(name, base, flags, wq);
> +	if (IS_ERR(dentry)) {
> +		err = PTR_ERR(dentry);
> +		goto out_err;
> +	}
> +	if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
> +		/*
> +		 * Failed to get lock due to race with unlink or rename
> +		 * - try again
> +		 */
> +		d_lookup_done(dentry);

When would we get out of __lookup_hash() with in-lookup dentry?
Confused...

> +struct dentry *lookup_hash_update_len(const char *name, int nlen,
> +				      struct path *path, unsigned int flags,

	const struct path *, damnit...

> +				      wait_queue_head_t *wq)
> +{
> +	struct qstr this;
> +	int err = lookup_one_common(mnt_user_ns(path->mnt), name,
> +				    path->dentry, nlen, &this);
> +	if (err)
> +		return ERR_PTR(err);
> +	return lookup_hash_update(&this, path->dentry, flags, wq);
> +}
> +EXPORT_SYMBOL(lookup_hash_update_len);

Frankly, the calling conventions of the "..._one_len" family is something
I've kept regretting for a long time.  Oh, well...

> +static void done_path_update(struct path *path, struct dentry *dentry,
> +			     bool with_wq)
> +{
> +	struct inode *dir = path->dentry->d_inode;
> +
> +	d_lookup_done(dentry);
> +	d_unlock_update(dentry);
> +	if (IS_PAR_UPDATE(dir) && with_wq)
> +		inode_unlock_shared(dir);
> +	else
> +		inode_unlock(dir);
> +}

const struct path *, again...

> @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  			dentry = res;
>  		}
>  	}
> +	/*
> +	 * If dentry is negative and this is a create we need to get
> +	 * DCACHE_PAR_UPDATE.
> +	 */
> +	if ((open_flag & O_CREAT) && !dentry->d_inode)
> +		have_par_update = d_lock_update(dentry, NULL, NULL);
>  
>  	/* Negative dentry, just create the file */
>  	if (!dentry->d_inode && (open_flag & O_CREAT)) {

Fold the above here, please.  What's more, losing the flag would've
made it much easier to follow...

> @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  		error = create_error;
>  		goto out_dput;
>  	}
> +	if (have_par_update)
> +		d_unlock_update(dentry);
>  	return dentry;
>  
>  out_dput:
> +	if (have_par_update)
> +		d_unlock_update(dentry);


> @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
>  				struct path *path, unsigned int lookup_flags)

BTW, there's 9 callers of that sucker in the entire tree, along with
3 callers of user_path_create() and 14 callers of done_path_create().
Not a big deal to add the wq in those, especially since it can be easily
split into preparatory patch (with wq passed, but being unused).

> -void done_path_create(struct path *path, struct dentry *dentry)
> +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)

Why "with_wq" and not the wq itself?

> - * The caller must hold dir->i_mutex.
> + * The caller must either hold a write-lock on dir->i_rwsem, or
> + * a have atomically set DCACHE_PAR_UPDATE, or both.

???

> + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
> + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.

The latter happens unconditionally here, unless I'm misreading the code (as well
as your cover letter).  It does *NOT* happen on rename(), though, contrary to
the same.  And while your later patch adds it in lock_rename_lookup(), existing
lock_rename() callers do not get that at all.  Likely to be a problem...

> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -13,7 +13,9 @@
>  #include <linux/rcupdate.h>
>  #include <linux/lockref.h>
>  #include <linux/stringhash.h>
> +#include <linux/sched.h>

Bloody hell, man...

> +static inline void d_unlock_update(struct dentry *dentry)
> +{
> +	if (IS_ERR_OR_NULL(dentry))
> +		return;

Do explain...  When could we ever get NULL or ERR_PTR() passed to that?


BTW, I would seriously look into splitting the "let's add a helper
that combines locking parent with __lookup_hash()" into a preliminary
patch.  Would be easier to follow.

next prev parent reply	other threads:[~2022-08-27  3:43 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-26  2:10 [PATCH/RFC 00/10 v5] Improve scalability of directory operations NeilBrown
2022-08-26  2:10 ` [PATCH 03/10] VFS: move want_write checks into lookup_hash_update() NeilBrown
2022-08-27  3:48   ` Al Viro
2022-08-26  2:10 ` [PATCH 09/10] VFS: add LOOKUP_SILLY_RENAME NeilBrown
2022-08-27  1:21   ` Al Viro
2022-08-29  3:15     ` NeilBrown
2022-08-26  2:10 ` [PATCH 04/10] VFS: move dput() and mnt_drop_write() into done_path_update() NeilBrown
2022-08-26  2:10 ` [PATCH 02/10] VFS: move EEXIST and ENOENT tests into lookup_hash_update() NeilBrown
2022-08-26  2:10 ` [PATCH 08/10] NFSD: allow parallel creates from nfsd NeilBrown
2022-08-27  4:37   ` Al Viro
2022-08-29  3:12     ` NeilBrown
2022-08-26  2:10 ` [PATCH 01/10] VFS: support parallel updates in the one directory NeilBrown
2022-08-26 19:06   ` Linus Torvalds
2022-08-26 23:06     ` NeilBrown
2022-08-27  0:13       ` Linus Torvalds
2022-08-27  0:23         ` Al Viro
2022-08-27 21:14         ` Al Viro
2022-08-27  0:17     ` Al Viro
2022-09-01  0:31       ` NeilBrown
2022-09-01  3:44         ` Al Viro
2022-08-27  3:43   ` Al Viro [this message]
2022-08-29  1:59     ` NeilBrown
2022-09-03  0:06       ` Al Viro
2022-09-03  1:40         ` NeilBrown
2022-09-03  2:12           ` Al Viro
2022-09-03 17:52             ` Al Viro
2022-09-04 23:33               ` NeilBrown
2022-08-26  2:10 ` [PATCH 07/10] VFS: hold DCACHE_PAR_UPDATE lock across d_revalidate() NeilBrown
2022-08-26  2:10 ` [PATCH 05/10] VFS: export done_path_update() NeilBrown
2022-08-26  2:10 ` [PATCH 10/10] NFS: support parallel updates in the one directory NeilBrown
2022-08-26 15:31   ` John Stoffel
2022-08-26 23:13     ` NeilBrown
2022-08-26  2:10 ` [PATCH 06/10] VFS: support concurrent renames NeilBrown
2022-08-27  4:12   ` Al Viro
2022-08-29  3:08     ` NeilBrown
2022-08-26 14:42 ` [PATCH/RFC 00/10 v5] Improve scalability of directory operations John Stoffel
2022-08-26 23:30   ` NeilBrown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YwmS63X3Sm4bhlcT@ZenIV \
    --to=viro@zeniv.linux.org.uk \
    --cc=chuck.lever@oracle.com \
    --cc=daire@dneg.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=torvalds@linux-foundation.org \
    --cc=trond.myklebust@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.