[PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory
@ 2025-02-06  5:42 NeilBrown
  2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
                   ` (21 more replies)
  0 siblings, 22 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

This is my latest attempt at removing the requirement for an exclusive
lock on a directory which performing updates in this.  This version,
inspired by Dave Chinner, goes a step further and allow async updates.

The inode operation still requires the inode lock, at least a shared
lock, but may return -EINPROGRES and then continue asynchronously
without needing any ongoing lock on the directory.

An exclusive lock on the dentry is held across the entire operation.

This change requires various extra checks.  rmdir must ensure there is
no async creation still happening.  rename between directories must
ensure non of the relevant ancestors are undergoing async rename.  There
may be or checks that I need to consider - mounting?

One other important change since my previous posting is that I've
dropped the idea of taking a separate exclusive lock on the directory
when the fs doesn't support shared locking.  This cannot work as it
doeesn't prevent lookups and filesystems don't expect a lookup while
they are changing a directory.  So instead we need to choose between
exclusive or shared for the inode on a case-by-case basis.

To make this choice we divide all ops into four groups: create, remove,
rename, open/create.  If an inode has no operations in the group that
require an exclusive lock, then a flag is set on the inode so that
various code knows that a shared lock is sufficient.  If the flag is not
set, an exclusive lock is obtained.

I've also added rename handling and converted NFS to use all _async ops.

The motivation for this comes from the general increase in scale of
systems.  We can support very large directories and many-core systems
and applications that choose to use large directories can hit
unnecessary contention.

NFS can easily hit this when used over a high-latency link.
Lustre already has code to allow concurrent directory updates in the
back-end filesystem (ldiskfs - a slightly modified ext4).
Lustre developers believe this would also benefit the client-side
filesystem with large core counts.

The idea behind the async support is to eventually connect this to
io_uring so that one process can launch several concurrent directory
operations.  I have not looked deeply into io_uring and cannot be
certain that the interface I've provided will be able to be used.  I
would welcome any advice on that matter, though I hope to find time to
explore myself.  For now if any _async op returns -EINPROGRESS we simply
wait for the callback to indicate completion.

Test status:  only light testing.  It doesn't easily blow up, but lockdep
complains that repeated calls to d_update_wait() are bad, even though
it has balanced acquire and release calls. Weird?

Thanks,
NeilBrown

 [PATCH 01/19] VFS: introduce vfs_mkdir_return()
 [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
 [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
 [PATCH 04/19] VFS: change kern_path_locked() and
 [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
 [PATCH 06/19] VFS: repack DENTRY_ flags.
 [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
 [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
 [PATCH 09/19] VFS: add _async versions of the various directory
 [PATCH 10/19] VFS: introduce inode flags to report locking needs for
 [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
 [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
 [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
 [PATCH 14/19] VFS: Ensure no async updates happening in directory
 [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
 [PATCH 16/19] VFS: add lookup_and_lock_rename()
 [PATCH 17/19] nfsd: use lookup_and_lock_one() and
 [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
 [PATCH 19/19] nfs: switch to _async for all directory ops.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 12:24   ` Christian Brauner
                     ` (2 more replies)
  2025-02-06  5:42 ` [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
                   ` (20 subsequent siblings)
  21 siblings, 3 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

vfs_mkdir() does not guarantee to make the child dentry positive on
success.  It may leave it negative and then the caller needs to perform a
lookup to find the target dentry.

This patch introduced vfs_mkdir_return() which performs the lookup if
needed so that this code is centralised.

This prepares for a new inode operation which will perform mkdir and
returns the correct dentry.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/cachefiles/namei.c    |  7 +---
 fs/namei.c               | 69 ++++++++++++++++++++++++++++++++++++++++
 fs/nfsd/vfs.c            | 21 ++----------
 fs/overlayfs/dir.c       | 33 +------------------
 fs/overlayfs/overlayfs.h | 10 +++---
 fs/overlayfs/super.c     |  2 +-
 fs/smb/server/vfs.c      | 24 +++-----------
 include/linux/fs.h       |  2 ++
 8 files changed, 86 insertions(+), 82 deletions(-)

diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
index 7cf59713f0f7..3c866c3b9534 100644
--- a/fs/cachefiles/namei.c
+++ b/fs/cachefiles/namei.c
@@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
 	/* search the current directory for the element name */
 	inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
 
-retry:
 	ret = cachefiles_inject_read_error();
 	if (ret == 0)
 		subdir = lookup_one_len(dirname, dir, strlen(dirname));
@@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
 			goto mkdir_error;
 		ret = cachefiles_inject_write_error();
 		if (ret == 0)
-			ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
+			ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
 		if (ret < 0) {
 			trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
 						   cachefiles_trace_mkdir_error);
@@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
 		}
 		trace_cachefiles_mkdir(dir, subdir);
 
-		if (unlikely(d_unhashed(subdir))) {
-			cachefiles_put_directory(subdir);
-			goto retry;
-		}
 		ASSERT(d_backing_inode(subdir));
 
 		_debug("mkdir -> %pd{ino=%lu}",
diff --git a/fs/namei.c b/fs/namei.c
index 3ab9440c5b93..d98caf36e867 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 }
 EXPORT_SYMBOL(vfs_mkdir);
 
+/**
+ * vfs_mkdir_return - create directory returning correct dentry
+ * @idmap:	idmap of the mount the inode was found from
+ * @dir:	inode of the parent directory
+ * @dentryp:	pointer to dentry of the child directory
+ * @mode:	mode of the child directory
+ *
+ * Create a directory.
+ *
+ * If the inode has been found through an idmapped mount the idmap of
+ * the vfsmount must be passed through @idmap. This function will then take
+ * care to map the inode according to @idmap before checking permissions.
+ * On non-idmapped mounts or if permission checking is to be performed on the
+ * raw inode simply pass @nop_mnt_idmap.
+ *
+ * The filesystem may not use the dentry that was passed in.  In that case
+ * the passed-in dentry is put and a new one is placed in *@dentryp;
+ * So on successful return *@dentryp will always be positive.
+ */
+int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
+		     struct dentry **dentryp, umode_t mode)
+{
+	struct dentry *dentry = *dentryp;
+	int error;
+	unsigned max_links = dir->i_sb->s_max_links;
+
+	error = may_create(idmap, dir, dentry);
+	if (error)
+		return error;
+
+	if (!dir->i_op->mkdir)
+		return -EPERM;
+
+	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
+	error = security_inode_mkdir(dir, dentry, mode);
+	if (error)
+		return error;
+
+	if (max_links && dir->i_nlink >= max_links)
+		return -EMLINK;
+
+	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+	if (!error) {
+		fsnotify_mkdir(dir, dentry);
+		if (unlikely(d_unhashed(dentry))) {
+			struct dentry *d;
+			/* Need a "const" pointer.  We know d_name is const
+			 * because we hold an exclusive lock on i_rwsem
+			 * in d_parent.
+			 */
+			const struct qstr *d_name = (void*)&dentry->d_name;
+			d = lookup_dcache(d_name, dentry->d_parent, 0);
+			if (!d)
+				d = __lookup_slow(d_name, dentry->d_parent, 0);
+			if (IS_ERR(d)) {
+				error = PTR_ERR(d);
+			} else if (unlikely(d_is_negative(d))) {
+				dput(d);
+				error = -ENOENT;
+			} else {
+				dput(dentry);
+				*dentryp = d;
+			}
+		}
+	}
+	return error;
+}
+EXPORT_SYMBOL(vfs_mkdir_return);
+
 int do_mkdirat(int dfd, struct filename *name, umode_t mode)
 {
 	struct dentry *dentry;
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 29cb7b812d71..740332413138 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1488,26 +1488,11 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
 			nfsd_check_ignore_resizing(iap);
 		break;
 	case S_IFDIR:
-		host_err = vfs_mkdir(&nop_mnt_idmap, dirp, dchild, iap->ia_mode);
-		if (!host_err && unlikely(d_unhashed(dchild))) {
-			struct dentry *d;
-			d = lookup_one_len(dchild->d_name.name,
-					   dchild->d_parent,
-					   dchild->d_name.len);
-			if (IS_ERR(d)) {
-				host_err = PTR_ERR(d);
-				break;
-			}
-			if (unlikely(d_is_negative(d))) {
-				dput(d);
-				err = nfserr_serverfault;
-				goto out;
-			}
+		host_err = vfs_mkdir_return(&nop_mnt_idmap, dirp, &dchild, iap->ia_mode);
+		if (!host_err && unlikely(dchild != resfhp->fh_dentry)) {
 			dput(resfhp->fh_dentry);
-			resfhp->fh_dentry = dget(d);
+			resfhp->fh_dentry = dget(dchild);
 			err = fh_update(resfhp);
-			dput(dchild);
-			dchild = d;
 			if (err)
 				goto out;
 		}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index c9993ff66fc2..e6c54c6ef0f5 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -138,37 +138,6 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
 	goto out;
 }
 
-int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
-		   struct dentry **newdentry, umode_t mode)
-{
-	int err;
-	struct dentry *d, *dentry = *newdentry;
-
-	err = ovl_do_mkdir(ofs, dir, dentry, mode);
-	if (err)
-		return err;
-
-	if (likely(!d_unhashed(dentry)))
-		return 0;
-
-	/*
-	 * vfs_mkdir() may succeed and leave the dentry passed
-	 * to it unhashed and negative. If that happens, try to
-	 * lookup a new hashed and positive dentry.
-	 */
-	d = ovl_lookup_upper(ofs, dentry->d_name.name, dentry->d_parent,
-			     dentry->d_name.len);
-	if (IS_ERR(d)) {
-		pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
-			dentry, err);
-		return PTR_ERR(d);
-	}
-	dput(dentry);
-	*newdentry = d;
-
-	return 0;
-}
-
 struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
 			       struct dentry *newdentry, struct ovl_cattr *attr)
 {
@@ -191,7 +160,7 @@ struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
 
 		case S_IFDIR:
 			/* mkdir is special... */
-			err =  ovl_mkdir_real(ofs, dir, &newdentry, attr->mode);
+			err =  ovl_do_mkdir(ofs, dir, &newdentry, attr->mode);
 			break;
 
 		case S_IFCHR:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 0021e2025020..967870f12482 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -242,11 +242,11 @@ static inline int ovl_do_create(struct ovl_fs *ofs,
 }
 
 static inline int ovl_do_mkdir(struct ovl_fs *ofs,
-			       struct inode *dir, struct dentry *dentry,
+			       struct inode *dir, struct dentry **dentry,
 			       umode_t mode)
 {
-	int err = vfs_mkdir(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
-	pr_debug("mkdir(%pd2, 0%o) = %i\n", dentry, mode, err);
+	int err = vfs_mkdir_return(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
+	pr_debug("mkdir(%pd2, 0%o) = %i\n", *dentry, mode, err);
 	return err;
 }
 
@@ -838,8 +838,8 @@ struct ovl_cattr {
 
 #define OVL_CATTR(m) (&(struct ovl_cattr) { .mode = (m) })
 
-int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
-		   struct dentry **newdentry, umode_t mode);
+int ovl_do_mkdir(struct ovl_fs *ofs, struct inode *dir,
+	      struct dentry **newdentry, umode_t mode);
 struct dentry *ovl_create_real(struct ovl_fs *ofs,
 			       struct inode *dir, struct dentry *newdentry,
 			       struct ovl_cattr *attr);
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 86ae6f6da36b..06ca8b01c336 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -327,7 +327,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
 			goto retry;
 		}
 
-		err = ovl_mkdir_real(ofs, dir, &work, attr.ia_mode);
+		err = ovl_do_mkdir(ofs, dir, &work, attr.ia_mode);
 		if (err)
 			goto out_dput;
 
diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index 6890016e1923..4e580bb7baf8 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -211,7 +211,7 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
 {
 	struct mnt_idmap *idmap;
 	struct path path;
-	struct dentry *dentry;
+	struct dentry *dentry, *d;
 	int err;
 
 	dentry = ksmbd_vfs_kern_path_create(work, name,
@@ -227,27 +227,11 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
 
 	idmap = mnt_idmap(path.mnt);
 	mode |= S_IFDIR;
-	err = vfs_mkdir(idmap, d_inode(path.dentry), dentry, mode);
-	if (!err && d_unhashed(dentry)) {
-		struct dentry *d;
-
-		d = lookup_one(idmap, dentry->d_name.name, dentry->d_parent,
-			       dentry->d_name.len);
-		if (IS_ERR(d)) {
-			err = PTR_ERR(d);
-			goto out_err;
-		}
-		if (unlikely(d_is_negative(d))) {
-			dput(d);
-			err = -ENOENT;
-			goto out_err;
-		}
-
+	d = dentry;
+	err = vfs_mkdir_return(idmap, d_inode(path.dentry), &dentry, mode);
+	if (!err && dentry != d)
 		ksmbd_vfs_inherit_owner(work, d_inode(path.dentry), d_inode(d));
-		dput(d);
-	}
 
-out_err:
 	done_path_create(&path, dentry);
 	if (err)
 		pr_err("mkdir(%s): creation failed (err:%d)\n", name, err);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index be3ad155ec9f..f81d6bc65fe4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1971,6 +1971,8 @@ int vfs_create(struct mnt_idmap *, struct inode *,
 	       struct dentry *, umode_t, bool);
 int vfs_mkdir(struct mnt_idmap *, struct inode *,
 	      struct dentry *, umode_t);
+int vfs_mkdir_return(struct mnt_idmap *, struct inode *,
+		     struct dentry **, umode_t);
 int vfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *,
               umode_t, dev_t);
 int vfs_symlink(struct mnt_idmap *, struct inode *,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
  2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-07 19:32   ` Al Viro
  2025-02-06  5:42 ` [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it NeilBrown
                   ` (19 subsequent siblings)
  21 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

d_alloc_parallel() currently requires a wait_queue_head to be passed in.
This must have a life time which extends until the lookup is completed.

Future proposed patches will use d_alloc_parallel() for names being
created/unlinked etc.  Some filesystems combine lookup with create
making a longer code path that the wq needs to live for.  If it is still
to be allocated on-stack this can be cumbersome.

This patch replaces the on-stack wqs with a global array of wqs which
are used as needed.  A wq is NOT allocated when a dentry is first
created but only when a second thread attempts to use the same name and
so is forced to wait.  At this moment a wq is chosen using the
least-significant bits on the task's pid and that wq is assigned to
->d_wait.  The ->d_lock is then dropped and the task waits.

When the dentry is finally moved out of "in_lookup" a wake up is only
sent if ->d_wait is not NULL.  This avoids an (uncontended) spin
lock/unlock which saves a couple of atomic operations in a common case.

The wake up passes the dentry that the wake up is for as the "key" and
the waiter will only wake processes waiting on the same key.  This means
that when these global waitqueues are shared (which is inevitable
though unlikely to be frequent), a task will not be woken prematurely.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/afs/dir_silly.c      |  4 +--
 fs/dcache.c             | 69 +++++++++++++++++++++++++++++++++--------
 fs/fuse/readdir.c       |  3 +-
 fs/namei.c              |  6 ++--
 fs/nfs/dir.c            |  6 ++--
 fs/nfs/unlink.c         |  3 +-
 fs/proc/base.c          |  3 +-
 fs/proc/proc_sysctl.c   |  3 +-
 fs/smb/client/readdir.c |  3 +-
 include/linux/dcache.h  |  3 +-
 include/linux/nfs_xdr.h |  1 -
 11 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/fs/afs/dir_silly.c b/fs/afs/dir_silly.c
index a1e581946b93..aa4363a1c6fa 100644
--- a/fs/afs/dir_silly.c
+++ b/fs/afs/dir_silly.c
@@ -239,13 +239,11 @@ int afs_silly_iput(struct dentry *dentry, struct inode *inode)
 	struct dentry *alias;
 	int ret;
 
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-
 	_enter("%p{%pd},%llx", dentry, dentry, vnode->fid.vnode);
 
 	down_read(&dvnode->rmdir_lock);
 
-	alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name, &wq);
+	alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name);
 	if (IS_ERR(alias)) {
 		up_read(&dvnode->rmdir_lock);
 		return 0;
diff --git a/fs/dcache.c b/fs/dcache.c
index 96b21a47312e..e49607d00d2d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2095,8 +2095,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		return found;
 	}
 	if (d_in_lookup(dentry)) {
-		found = d_alloc_parallel(dentry->d_parent, name,
-					dentry->d_wait);
+		found = d_alloc_parallel(dentry->d_parent, name);
 		if (IS_ERR(found) || !d_in_lookup(found)) {
 			iput(inode);
 			return found;
@@ -2106,7 +2105,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		if (!found) {
 			iput(inode);
 			return ERR_PTR(-ENOMEM);
-		} 
+		}
 	}
 	res = d_splice_alias(inode, found);
 	if (res) {
@@ -2476,30 +2475,70 @@ static inline unsigned start_dir_add(struct inode *dir)
 }
 
 static inline void end_dir_add(struct inode *dir, unsigned int n,
-			       wait_queue_head_t *d_wait)
+			       wait_queue_head_t *d_wait, struct dentry *de)
 {
 	smp_store_release(&dir->i_dir_seq, n + 2);
 	preempt_enable_nested();
-	wake_up_all(d_wait);
+	if (d_wait)
+		__wake_up(d_wait, TASK_NORMAL, 0, de);
+}
+
+#define	PAR_LOOKUP_WQS	256
+static wait_queue_head_t par_wait_table[PAR_LOOKUP_WQS] __cacheline_aligned;
+
+static int __init par_wait_init(void)
+{
+	int i;
+
+	for (i = 0; i < PAR_LOOKUP_WQS; i++)
+		init_waitqueue_head(&par_wait_table[i]);
+	return 0;
+}
+fs_initcall(par_wait_init);
+
+struct par_wait_key {
+	struct dentry *de;
+	struct wait_queue_entry wqe;
+};
+
+static int d_wait_wake_fn(struct wait_queue_entry *wq_entry,
+			  unsigned mode, int sync, void *key)
+{
+	struct par_wait_key *pwk = container_of(wq_entry,
+						 struct par_wait_key, wqe);
+	if (pwk->de == key)
+		return default_wake_function(wq_entry, mode, sync, key);
+	return 0;
 }
 
 static void d_wait_lookup(struct dentry *dentry)
 {
 	if (d_in_lookup(dentry)) {
-		DECLARE_WAITQUEUE(wait, current);
-		add_wait_queue(dentry->d_wait, &wait);
+		struct par_wait_key wk = {
+			.de = dentry,
+			.wqe = {
+				.private = current,
+				.func = d_wait_wake_fn,
+			},
+		};
+		struct wait_queue_head *wq;
+		if (!dentry->d_wait)
+			dentry->d_wait = &par_wait_table[current->pid %
+							 PAR_LOOKUP_WQS];
+		wq = dentry->d_wait;
+		add_wait_queue(wq, &wk.wqe);
 		do {
 			set_current_state(TASK_UNINTERRUPTIBLE);
 			spin_unlock(&dentry->d_lock);
 			schedule();
 			spin_lock(&dentry->d_lock);
 		} while (d_in_lookup(dentry));
+		remove_wait_queue(wq, &wk.wqe);
 	}
 }
 
 struct dentry *d_alloc_parallel(struct dentry *parent,
-				const struct qstr *name,
-				wait_queue_head_t *wq)
+				const struct qstr *name)
 {
 	unsigned int hash = name->hash;
 	struct hlist_bl_head *b = in_lookup_hash(parent, hash);
@@ -2596,7 +2635,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 	rcu_read_unlock();
 	/* we can't take ->d_lock here; it's OK, though. */
 	new->d_flags |= DCACHE_PAR_LOOKUP;
-	new->d_wait = wq;
+	new->d_wait = NULL;
 	hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
 	hlist_bl_unlock(b);
 	return new;
@@ -2633,8 +2672,12 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 
 void __d_lookup_unhash_wake(struct dentry *dentry)
 {
+	wait_queue_head_t *d_wait;
+
 	spin_lock(&dentry->d_lock);
-	wake_up_all(__d_lookup_unhash(dentry));
+	d_wait = __d_lookup_unhash(dentry);
+	if (d_wait)
+		__wake_up(d_wait, TASK_NORMAL, 0, dentry);
 	spin_unlock(&dentry->d_lock);
 }
 EXPORT_SYMBOL(__d_lookup_unhash_wake);
@@ -2662,7 +2705,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 	}
 	__d_rehash(dentry);
 	if (dir)
-		end_dir_add(dir, n, d_wait);
+		end_dir_add(dir, n, d_wait, dentry);
 	spin_unlock(&dentry->d_lock);
 	if (inode)
 		spin_unlock(&inode->i_lock);
@@ -2874,7 +2917,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	write_seqcount_end(&dentry->d_seq);
 
 	if (dir)
-		end_dir_add(dir, n, d_wait);
+		end_dir_add(dir, n, d_wait, target);
 
 	if (dentry->d_parent != old_parent)
 		spin_unlock(&dentry->d_parent->d_lock);
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index 17ce9636a2b1..c6b646a3f1bd 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -160,7 +160,6 @@ static int fuse_direntplus_link(struct file *file,
 	struct inode *dir = d_inode(parent);
 	struct fuse_conn *fc;
 	struct inode *inode;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if (!o->nodeid) {
 		/*
@@ -195,7 +194,7 @@ static int fuse_direntplus_link(struct file *file,
 	dentry = d_lookup(parent, &name);
 	if (!dentry) {
 retry:
-		dentry = d_alloc_parallel(parent, &name, &wq);
+		dentry = d_alloc_parallel(parent, &name);
 		if (IS_ERR(dentry))
 			return PTR_ERR(dentry);
 	}
diff --git a/fs/namei.c b/fs/namei.c
index d98caf36e867..5cdbd2eb4056 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1769,13 +1769,12 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 {
 	struct dentry *dentry, *old;
 	struct inode *inode = dir->d_inode;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	/* Don't go there if it's already dead */
 	if (unlikely(IS_DEADDIR(inode)))
 		return ERR_PTR(-ENOENT);
 again:
-	dentry = d_alloc_parallel(dir, name, &wq);
+	dentry = d_alloc_parallel(dir, name);
 	if (IS_ERR(dentry))
 		return dentry;
 	if (unlikely(!d_in_lookup(dentry))) {
@@ -3561,7 +3560,6 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	struct dentry *dentry;
 	int error, create_error = 0;
 	umode_t mode = op->mode;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if (unlikely(IS_DEADDIR(dir_inode)))
 		return ERR_PTR(-ENOENT);
@@ -3570,7 +3568,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	dentry = d_lookup(dir, &nd->last);
 	for (;;) {
 		if (!dentry) {
-			dentry = d_alloc_parallel(dir, &nd->last, &wq);
+			dentry = d_alloc_parallel(dir, &nd->last);
 			if (IS_ERR(dentry))
 				return dentry;
 		}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2b04038b0e40..27c7a5c4e91b 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -725,7 +725,6 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 		unsigned long dir_verifier)
 {
 	struct qstr filename = QSTR_INIT(entry->name, entry->len);
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct dentry *dentry;
 	struct dentry *alias;
 	struct inode *inode;
@@ -754,7 +753,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 	dentry = d_lookup(parent, &filename);
 again:
 	if (!dentry) {
-		dentry = d_alloc_parallel(parent, &filename, &wq);
+		dentry = d_alloc_parallel(parent, &filename);
 		if (IS_ERR(dentry))
 			return;
 	}
@@ -2059,7 +2058,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 		    struct file *file, unsigned open_flags,
 		    umode_t mode)
 {
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct nfs_open_context *ctx;
 	struct dentry *res;
 	struct iattr attr = { .ia_valid = ATTR_OPEN };
@@ -2115,7 +2113,7 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 		d_drop(dentry);
 		switched = true;
 		dentry = d_alloc_parallel(dentry->d_parent,
-					  &dentry->d_name, &wq);
+					  &dentry->d_name);
 		if (IS_ERR(dentry))
 			return PTR_ERR(dentry);
 		if (unlikely(!d_in_lookup(dentry)))
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index bf77399696a7..d44162d3a8f1 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -124,7 +124,7 @@ static int nfs_call_unlink(struct dentry *dentry, struct inode *inode, struct nf
 	struct dentry *alias;
 
 	down_read_non_owner(&NFS_I(dir)->rmdir_sem);
-	alias = d_alloc_parallel(dentry->d_parent, &data->args.name, &data->wq);
+	alias = d_alloc_parallel(dentry->d_parent, &data->args.name);
 	if (IS_ERR(alias)) {
 		up_read_non_owner(&NFS_I(dir)->rmdir_sem);
 		return 0;
@@ -185,7 +185,6 @@ nfs_async_unlink(struct dentry *dentry, const struct qstr *name)
 
 	data->cred = get_current_cred();
 	data->res.dir_attr = &data->dir_attr;
-	init_waitqueue_head(&data->wq);
 
 	status = -EBUSY;
 	spin_lock(&dentry->d_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index cd89e956c322..c8bcbdac87d5 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2126,8 +2126,7 @@ bool proc_fill_cache(struct file *file, struct dir_context *ctx,
 
 	child = d_hash_and_lookup(dir, &qname);
 	if (!child) {
-		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-		child = d_alloc_parallel(dir, &qname, &wq);
+		child = d_alloc_parallel(dir, &qname);
 		if (IS_ERR(child))
 			goto end_instantiate;
 		if (d_in_lookup(child)) {
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index cc9d74a06ff0..9f1088f138f4 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -693,8 +693,7 @@ static bool proc_sys_fill_cache(struct file *file,
 
 	child = d_lookup(dir, &qname);
 	if (!child) {
-		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-		child = d_alloc_parallel(dir, &qname, &wq);
+		child = d_alloc_parallel(dir, &qname);
 		if (IS_ERR(child))
 			return false;
 		if (d_in_lookup(child)) {
diff --git a/fs/smb/client/readdir.c b/fs/smb/client/readdir.c
index 50f96259d9ad..39d8a18cd443 100644
--- a/fs/smb/client/readdir.c
+++ b/fs/smb/client/readdir.c
@@ -73,7 +73,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
 	struct cifs_sb_info *cifs_sb = CIFS_SB(sb);
 	bool posix = cifs_sb_master_tcon(cifs_sb)->posix_extensions;
 	bool reparse_need_reval = false;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	int rc;
 
 	cifs_dbg(FYI, "%s: for %s\n", __func__, name->name);
@@ -105,7 +104,7 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
 		    (fattr->cf_flags & CIFS_FATTR_NEED_REVAL))
 			return;
 
-		dentry = d_alloc_parallel(parent, name, &wq);
+		dentry = d_alloc_parallel(parent, name);
 	}
 	if (IS_ERR(dentry))
 		return;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 4afb60365675..b03cbb0177a3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -247,8 +247,7 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
 /* allocate/de-allocate */
 extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_anon(struct super_block *);
-extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *,
-					wait_queue_head_t *);
+extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
 extern bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 9155a6ffc370..d0473e0d4aba 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1731,7 +1731,6 @@ struct nfs_unlinkdata {
 	struct nfs_removeargs args;
 	struct nfs_removeres res;
 	struct dentry *dentry;
-	wait_queue_head_t wq;
 	const struct cred *cred;
 	struct nfs_fattr dir_attr;
 	long timeout;
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
  2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
  2025-02-06  5:42 ` [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 14:30   ` Jeff Layton
  2025-02-07 20:01   ` Al Viro
  2025-02-06  5:42 ` [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry NeilBrown
                   ` (18 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

lookup_one_qstr_excl() is used for lookups prior to directory
modifications, whether create, unlink, rename, or whatever.

To prepare for allowing modification to happen in parallel, change
lookup_one_qstr_excl() to use d_alloc_parallel().

To reflect this, name is changed to lookup_one_qtr() - as the directory
may be locked shared.

If any for the "intent" LOOKUP flags are passed, the caller must ensure
d_lookup_done() is called at an appropriate time.  If none are passed
then we can be sure ->lookup() will do a real lookup and d_lookup_done()
is called internally.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c            | 47 +++++++++++++++++++++++++------------------
 fs/smb/server/vfs.c   |  7 ++++---
 include/linux/namei.h |  9 ++++++---
 3 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5cdbd2eb4056..d684102d873d 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1665,15 +1665,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
 }
 
 /*
- * Parent directory has inode locked exclusive.  This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
+ * Parent directory has inode locked: exclusive or shared.
+ * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
+ * must be called after the intended operation is performed - or aborted.
  */
-struct dentry *lookup_one_qstr_excl(const struct qstr *name,
-				    struct dentry *base,
-				    unsigned int flags)
+struct dentry *lookup_one_qstr(const struct qstr *name,
+			       struct dentry *base,
+			       unsigned int flags)
 {
 	struct dentry *dentry = lookup_dcache(name, base, flags);
 	struct dentry *old;
@@ -1686,18 +1684,25 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
 	if (unlikely(IS_DEADDIR(dir)))
 		return ERR_PTR(-ENOENT);
 
-	dentry = d_alloc(base, name);
-	if (unlikely(!dentry))
+	dentry = d_alloc_parallel(base, name);
+	if (unlikely(IS_ERR_OR_NULL(dentry)))
 		return ERR_PTR(-ENOMEM);
+	if (!d_in_lookup(dentry))
+		/* Raced with another thread which did the lookup */
+		return dentry;
 
 	old = dir->i_op->lookup(dir, dentry, flags);
 	if (unlikely(old)) {
+		d_lookup_done(dentry);
 		dput(dentry);
 		dentry = old;
 	}
+	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
+		/* ->lookup must have given final answer */
+		d_lookup_done(dentry);
 	return dentry;
 }
-EXPORT_SYMBOL(lookup_one_qstr_excl);
+EXPORT_SYMBOL(lookup_one_qstr);
 
 /**
  * lookup_fast - do fast lockless (but racy) lookup of a dentry
@@ -2739,7 +2744,7 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
 		return ERR_PTR(-EINVAL);
 	}
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	d = lookup_one_qstr_excl(&last, path->dentry, 0);
+	d = lookup_one_qstr(&last, path->dentry, 0);
 	if (IS_ERR(d)) {
 		inode_unlock(path->dentry->d_inode);
 		path_put(path);
@@ -4078,8 +4083,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	if (last.name[last.len] && !want_dir)
 		create_flags = 0;
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	dentry = lookup_one_qstr_excl(&last, path->dentry,
-				      reval_flag | create_flags);
+	dentry = lookup_one_qstr(&last, path->dentry,
+				 reval_flag | create_flags);
 	if (IS_ERR(dentry))
 		goto unlock;
 
@@ -4103,6 +4108,7 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	}
 	return dentry;
 fail:
+	d_lookup_done(dentry);
 	dput(dentry);
 	dentry = ERR_PTR(error);
 unlock:
@@ -4508,7 +4514,7 @@ int do_rmdir(int dfd, struct filename *name)
 		goto exit2;
 
 	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
+	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit3;
@@ -4641,7 +4647,7 @@ int do_unlinkat(int dfd, struct filename *name)
 		goto exit2;
 retry_deleg:
 	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
+	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
 
@@ -5231,8 +5237,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 		goto exit_lock_rename;
 	}
 
-	old_dentry = lookup_one_qstr_excl(&old_last, old_path.dentry,
-					  lookup_flags);
+	old_dentry = lookup_one_qstr(&old_last, old_path.dentry,
+				     lookup_flags);
 	error = PTR_ERR(old_dentry);
 	if (IS_ERR(old_dentry))
 		goto exit3;
@@ -5240,8 +5246,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = -ENOENT;
 	if (d_is_negative(old_dentry))
 		goto exit4;
-	new_dentry = lookup_one_qstr_excl(&new_last, new_path.dentry,
-					  lookup_flags | target_flags);
+	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
+				     lookup_flags | target_flags);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto exit4;
@@ -5292,6 +5298,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	rd.flags	   = flags;
 	error = vfs_rename(&rd);
 exit5:
+	d_lookup_done(new_dentry);
 	dput(new_dentry);
 exit4:
 	dput(old_dentry);
diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index 4e580bb7baf8..89b3823f6405 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -109,7 +109,7 @@ static int ksmbd_vfs_path_lookup_locked(struct ksmbd_share_config *share_conf,
 	}
 
 	inode_lock_nested(parent_path->dentry->d_inode, I_MUTEX_PARENT);
-	d = lookup_one_qstr_excl(&last, parent_path->dentry, 0);
+	d = lookup_one_qstr(&last, parent_path->dentry, 0);
 	if (IS_ERR(d))
 		goto err_out;
 
@@ -726,8 +726,8 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
 		ksmbd_fd_put(work, parent_fp);
 	}
 
-	new_dentry = lookup_one_qstr_excl(&new_last, new_path.dentry,
-					  lookup_flags | LOOKUP_RENAME_TARGET);
+	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
+				     lookup_flags | LOOKUP_RENAME_TARGET);
 	if (IS_ERR(new_dentry)) {
 		err = PTR_ERR(new_dentry);
 		goto out3;
@@ -771,6 +771,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
 		ksmbd_debug(VFS, "vfs_rename failed err %d\n", err);
 
 out4:
+	d_lookup_done(new_dentry);
 	dput(new_dentry);
 out3:
 	dput(old_parent);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 8ec8fed3bce8..06bb3ea65beb 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -34,6 +34,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_EXCL		0x0400	/* ... in exclusive creation */
 #define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
 
+#define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
+				 LOOKUP_RENAME_TARGET)
+
 /* internal use only */
 #define LOOKUP_PARENT		0x0010
 
@@ -52,9 +55,9 @@ extern int path_pts(struct path *path);
 
 extern int user_path_at(int, const char __user *, unsigned, struct path *);
 
-struct dentry *lookup_one_qstr_excl(const struct qstr *name,
-				    struct dentry *base,
-				    unsigned int flags);
+struct dentry *lookup_one_qstr(const struct qstr *name,
+			       struct dentry *base,
+			       unsigned int flags);
 extern int kern_path(const char *, unsigned, struct path *);
 
 extern struct dentry *kern_path_create(int, const char *, struct path *, unsigned int);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (2 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 12:31   ` Christian Brauner
  2025-02-06  5:42 ` [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() NeilBrown
                   ` (17 subsequent siblings)
  21 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

No callers of kern_path_locked() or user_path_locked_at() want a
negative dentry.  So change them to return -ENOENT instead.  This
simplifies callers.

This results in a subtle change to bcachefs in that an ioctl will now
return -ENOENT in preference to -EXDEV.  I believe this restores the
behaviour to what it was prior to
 Commit bbe6a7c899e7 ("bch2_ioctl_subvolume_destroy(): fix locking")

Signed-off-by: NeilBrown <neilb@suse.de>
---
 drivers/base/devtmpfs.c | 65 +++++++++++++++++++----------------------
 fs/bcachefs/fs-ioctl.c  |  4 ---
 fs/namei.c              |  4 +++
 kernel/audit_watch.c    | 12 ++++----
 4 files changed, 40 insertions(+), 45 deletions(-)

diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
index b848764ef018..c9e34842139f 100644
--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -245,15 +245,12 @@ static int dev_rmdir(const char *name)
 	dentry = kern_path_locked(name, &parent);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
-	if (d_really_is_positive(dentry)) {
-		if (d_inode(dentry)->i_private == &thread)
-			err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
-					dentry);
-		else
-			err = -EPERM;
-	} else {
-		err = -ENOENT;
-	}
+	if (d_inode(dentry)->i_private == &thread)
+		err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
+				dentry);
+	else
+		err = -EPERM;
+
 	dput(dentry);
 	inode_unlock(d_inode(parent.dentry));
 	path_put(&parent);
@@ -310,6 +307,8 @@ static int handle_remove(const char *nodename, struct device *dev)
 {
 	struct path parent;
 	struct dentry *dentry;
+	struct kstat stat;
+	struct path p;
 	int deleted = 0;
 	int err;
 
@@ -317,32 +316,28 @@ static int handle_remove(const char *nodename, struct device *dev)
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
-	if (d_really_is_positive(dentry)) {
-		struct kstat stat;
-		struct path p = {.mnt = parent.mnt, .dentry = dentry};
-		err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
-				  AT_STATX_SYNC_AS_STAT);
-		if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
-			struct iattr newattrs;
-			/*
-			 * before unlinking this node, reset permissions
-			 * of possible references like hardlinks
-			 */
-			newattrs.ia_uid = GLOBAL_ROOT_UID;
-			newattrs.ia_gid = GLOBAL_ROOT_GID;
-			newattrs.ia_mode = stat.mode & ~0777;
-			newattrs.ia_valid =
-				ATTR_UID|ATTR_GID|ATTR_MODE;
-			inode_lock(d_inode(dentry));
-			notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
-			inode_unlock(d_inode(dentry));
-			err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
-					 dentry, NULL);
-			if (!err || err == -ENOENT)
-				deleted = 1;
-		}
-	} else {
-		err = -ENOENT;
+	p.mnt = parent.mnt;
+	p.dentry = dentry;
+	err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
+			  AT_STATX_SYNC_AS_STAT);
+	if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
+		struct iattr newattrs;
+		/*
+		 * before unlinking this node, reset permissions
+		 * of possible references like hardlinks
+		 */
+		newattrs.ia_uid = GLOBAL_ROOT_UID;
+		newattrs.ia_gid = GLOBAL_ROOT_GID;
+		newattrs.ia_mode = stat.mode & ~0777;
+		newattrs.ia_valid =
+			ATTR_UID|ATTR_GID|ATTR_MODE;
+		inode_lock(d_inode(dentry));
+		notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
+		inode_unlock(d_inode(dentry));
+		err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
+				 dentry, NULL);
+		if (!err || err == -ENOENT)
+			deleted = 1;
 	}
 	dput(dentry);
 	inode_unlock(d_inode(parent.dentry));
diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
index 15725b4ce393..595b57fabc9a 100644
--- a/fs/bcachefs/fs-ioctl.c
+++ b/fs/bcachefs/fs-ioctl.c
@@ -511,10 +511,6 @@ static long bch2_ioctl_subvolume_destroy(struct bch_fs *c, struct file *filp,
 		ret = -EXDEV;
 		goto err;
 	}
-	if (!d_is_positive(victim)) {
-		ret = -ENOENT;
-		goto err;
-	}
 	ret = __bch2_unlink(dir, victim, true);
 	if (!ret) {
 		fsnotify_rmdir(dir, victim);
diff --git a/fs/namei.c b/fs/namei.c
index d684102d873d..1901120bcbb8 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2745,6 +2745,10 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
 	}
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
 	d = lookup_one_qstr(&last, path->dentry, 0);
+	if (!IS_ERR(d) && d_is_negative(d)) {
+		dput(d);
+		d = ERR_PTR(-ENOENT);
+	}
 	if (IS_ERR(d)) {
 		inode_unlock(path->dentry->d_inode);
 		path_put(path);
diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
index 7f358740e958..e3130675ee6b 100644
--- a/kernel/audit_watch.c
+++ b/kernel/audit_watch.c
@@ -350,11 +350,10 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
 	struct dentry *d = kern_path_locked(watch->path, parent);
 	if (IS_ERR(d))
 		return PTR_ERR(d);
-	if (d_is_positive(d)) {
-		/* update watch filter fields */
-		watch->dev = d->d_sb->s_dev;
-		watch->ino = d_backing_inode(d)->i_ino;
-	}
+	/* update watch filter fields */
+	watch->dev = d->d_sb->s_dev;
+	watch->ino = d_backing_inode(d)->i_ino;
+
 	inode_unlock(d_backing_inode(parent->dentry));
 	dput(d);
 	return 0;
@@ -419,7 +418,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
 	/* caller expects mutex locked */
 	mutex_lock(&audit_filter_mutex);
 
-	if (ret) {
+	if (ret && ret != -ENOENT) {
 		audit_put_watch(watch);
 		return ret;
 	}
@@ -438,6 +437,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
 
 	h = audit_hash_ino((u32)watch->ino);
 	*list = &audit_inode_hash[h];
+	ret = 0;
 error:
 	path_put(&parent_path);
 	audit_put_watch(watch);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (3 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 12:33   ` Christian Brauner
                     ` (2 more replies)
  2025-02-06  5:42 ` [PATCH 06/19] VFS: repack DENTRY_ flags NeilBrown
                   ` (16 subsequent siblings)
  21 siblings, 3 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

Callers of lookup_one_qstr() often check if the result is negative or
positive.
These changes can easily be moved into lookup_one_qstr() by checking the
lookup flags:
LOOKUP_CREATE means it is NOT an error if the name doesn't exist.
LOOKUP_EXCL means it IS an error if the name DOES exist.

This patch adds these checks, then removes error checks from callers,
and ensures that appropriate flags are passed.

This subtly changes the meaning of LOOKUP_EXCL.  Previously it could
only accompany LOOKUP_CREATE.  Now it can accompany LOOKUP_RENAME_TARGET
as well.  A couple of small changes are needed to accommodate this.  The
NFS is functionally a no-op but ensures nfs_is_exclusive_create() does
exactly what the name says.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c            | 61 ++++++++++++++-----------------------------
 fs/nfs/dir.c          |  3 ++-
 fs/smb/server/vfs.c   | 26 +++++++-----------
 include/linux/namei.h |  2 +-
 4 files changed, 33 insertions(+), 59 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1901120bcbb8..69610047f6c6 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1668,6 +1668,8 @@ static struct dentry *lookup_dcache(const struct qstr *name,
  * Parent directory has inode locked: exclusive or shared.
  * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
  * must be called after the intended operation is performed - or aborted.
+ * Will return -ENOENT if name isn't found and LOOKUP_CREATE wasn't passed.
+ * Will return -EEXIST if name is found and LOOKUP_EXCL was passed.
  */
 struct dentry *lookup_one_qstr(const struct qstr *name,
 			       struct dentry *base,
@@ -1678,7 +1680,7 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
 	struct inode *dir = base->d_inode;
 
 	if (dentry)
-		return dentry;
+		goto found;
 
 	/* Don't create child dentry for a dead directory. */
 	if (unlikely(IS_DEADDIR(dir)))
@@ -1689,7 +1691,7 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
 		return ERR_PTR(-ENOMEM);
 	if (!d_in_lookup(dentry))
 		/* Raced with another thread which did the lookup */
-		return dentry;
+		goto found;
 
 	old = dir->i_op->lookup(dir, dentry, flags);
 	if (unlikely(old)) {
@@ -1700,6 +1702,15 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
 	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
 		/* ->lookup must have given final answer */
 		d_lookup_done(dentry);
+found:
+	if (d_is_negative(dentry) && !(flags & LOOKUP_CREATE)) {
+		dput(dentry);
+		return ERR_PTR(-ENOENT);
+	}
+	if (d_is_positive(dentry) && (flags & LOOKUP_EXCL)) {
+		dput(dentry);
+		return ERR_PTR(-EEXIST);
+	}
 	return dentry;
 }
 EXPORT_SYMBOL(lookup_one_qstr);
@@ -2745,10 +2756,6 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
 	}
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
 	d = lookup_one_qstr(&last, path->dentry, 0);
-	if (!IS_ERR(d) && d_is_negative(d)) {
-		dput(d);
-		d = ERR_PTR(-ENOENT);
-	}
 	if (IS_ERR(d)) {
 		inode_unlock(path->dentry->d_inode);
 		path_put(path);
@@ -4085,27 +4092,13 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	 * '/', and a directory wasn't requested.
 	 */
 	if (last.name[last.len] && !want_dir)
-		create_flags = 0;
+		create_flags &= ~LOOKUP_CREATE;
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
 	dentry = lookup_one_qstr(&last, path->dentry,
 				 reval_flag | create_flags);
 	if (IS_ERR(dentry))
 		goto unlock;
 
-	error = -EEXIST;
-	if (d_is_positive(dentry))
-		goto fail;
-
-	/*
-	 * Special case - lookup gave negative, but... we had foo/bar/
-	 * From the vfs_mknod() POV we just have a negative dentry -
-	 * all is fine. Let's be bastards - you had / on the end, you've
-	 * been asking for (non-existent) directory. -ENOENT for you.
-	 */
-	if (unlikely(!create_flags)) {
-		error = -ENOENT;
-		goto fail;
-	}
 	if (unlikely(err2)) {
 		error = err2;
 		goto fail;
@@ -4522,10 +4515,6 @@ int do_rmdir(int dfd, struct filename *name)
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit3;
-	if (!dentry->d_inode) {
-		error = -ENOENT;
-		goto exit4;
-	}
 	error = security_path_rmdir(&path, dentry);
 	if (error)
 		goto exit4;
@@ -4656,7 +4645,7 @@ int do_unlinkat(int dfd, struct filename *name)
 	if (!IS_ERR(dentry)) {
 
 		/* Why not before? Because we want correct error value */
-		if (last.name[last.len] || d_is_negative(dentry))
+		if (last.name[last.len])
 			goto slashes;
 		inode = dentry->d_inode;
 		ihold(inode);
@@ -4690,9 +4679,7 @@ int do_unlinkat(int dfd, struct filename *name)
 	return error;
 
 slashes:
-	if (d_is_negative(dentry))
-		error = -ENOENT;
-	else if (d_is_dir(dentry))
+	if (d_is_dir(dentry))
 		error = -EISDIR;
 	else
 		error = -ENOTDIR;
@@ -5192,7 +5179,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	struct qstr old_last, new_last;
 	int old_type, new_type;
 	struct inode *delegated_inode = NULL;
-	unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
+	unsigned int lookup_flags = 0, target_flags =
+		LOOKUP_RENAME_TARGET | LOOKUP_CREATE;
 	bool should_retry = false;
 	int error = -EINVAL;
 
@@ -5205,6 +5193,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 
 	if (flags & RENAME_EXCHANGE)
 		target_flags = 0;
+	if (flags & RENAME_NOREPLACE)
+		target_flags |= LOOKUP_EXCL;
 
 retry:
 	error = filename_parentat(olddfd, from, lookup_flags, &old_path,
@@ -5246,23 +5236,12 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = PTR_ERR(old_dentry);
 	if (IS_ERR(old_dentry))
 		goto exit3;
-	/* source must exist */
-	error = -ENOENT;
-	if (d_is_negative(old_dentry))
-		goto exit4;
 	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
 				     lookup_flags | target_flags);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto exit4;
-	error = -EEXIST;
-	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
-		goto exit5;
 	if (flags & RENAME_EXCHANGE) {
-		error = -ENOENT;
-		if (d_is_negative(new_dentry))
-			goto exit5;
-
 		if (!d_is_dir(new_dentry)) {
 			error = -ENOTDIR;
 			if (new_last.name[new_last.len])
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 27c7a5c4e91b..8cbe63f4089a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1531,7 +1531,8 @@ static int nfs_is_exclusive_create(struct inode *dir, unsigned int flags)
 {
 	if (NFS_PROTO(dir)->version == 2)
 		return 0;
-	return flags & LOOKUP_EXCL;
+	return (flags & (LOOKUP_CREATE | LOOKUP_EXCL)) ==
+		(LOOKUP_CREATE | LOOKUP_EXCL);
 }
 
 /*
diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
index 89b3823f6405..bf8ac43c39b0 100644
--- a/fs/smb/server/vfs.c
+++ b/fs/smb/server/vfs.c
@@ -113,11 +113,6 @@ static int ksmbd_vfs_path_lookup_locked(struct ksmbd_share_config *share_conf,
 	if (IS_ERR(d))
 		goto err_out;
 
-	if (d_is_negative(d)) {
-		dput(d);
-		goto err_out;
-	}
-
 	path->dentry = d;
 	path->mnt = mntget(parent_path->mnt);
 
@@ -677,6 +672,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
 	struct ksmbd_file *parent_fp;
 	int new_type;
 	int err, lookup_flags = LOOKUP_NO_SYMLINKS;
+	int target_lookup_flags = LOOKUP_RENAME_TARGET;
 
 	if (ksmbd_override_fsids(work))
 		return -ENOMEM;
@@ -687,6 +683,14 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
 		goto revert_fsids;
 	}
 
+	/*
+	 * explicitly handle file overwrite case, for compatibility with
+	 * filesystems that may not support rename flags (e.g: fuse)
+	 */
+	if (flags & RENAME_NOREPLACE)
+		target_lookup_flags |= LOOKUP_EXCL;
+	flags &= ~(RENAME_NOREPLACE);
+
 retry:
 	err = vfs_path_parent_lookup(to, lookup_flags | LOOKUP_BENEATH,
 				     &new_path, &new_last, &new_type,
@@ -727,7 +731,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
 	}
 
 	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
-				     lookup_flags | LOOKUP_RENAME_TARGET);
+				     lookup_flags | target_lookup_flags);
 	if (IS_ERR(new_dentry)) {
 		err = PTR_ERR(new_dentry);
 		goto out3;
@@ -738,16 +742,6 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
 		goto out4;
 	}
 
-	/*
-	 * explicitly handle file overwrite case, for compatibility with
-	 * filesystems that may not support rename flags (e.g: fuse)
-	 */
-	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry)) {
-		err = -EEXIST;
-		goto out4;
-	}
-	flags &= ~(RENAME_NOREPLACE);
-
 	if (old_child == trap) {
 		err = -EINVAL;
 		goto out4;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 06bb3ea65beb..839a64d07f8c 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -31,7 +31,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 /* These tell filesystem methods that we are dealing with the final component... */
 #define LOOKUP_OPEN		0x0100	/* ... in open */
 #define LOOKUP_CREATE		0x0200	/* ... in object creation */
-#define LOOKUP_EXCL		0x0400	/* ... in exclusive creation */
+#define LOOKUP_EXCL		0x0400	/* ... in target must not exist */
 #define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
 
 #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 06/19] VFS: repack DENTRY_ flags.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (4 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 12:34   ` (subset) " Christian Brauner
  2025-02-06  5:42 ` [PATCH 07/19] VFS: repack LOOKUP_ bit flags NeilBrown
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

Bits 13, 23, 24, and 27 are not used.  Move all those holes to the end.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/dcache.h | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index b03cbb0177a3..d5816cf19538 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -203,34 +203,34 @@ struct dentry_operations {
 #define DCACHE_NFSFS_RENAMED		BIT(12)
      /* this dentry has been "silly renamed" and has to be deleted on the last
       * dput() */
-#define DCACHE_FSNOTIFY_PARENT_WATCHED	BIT(14)
+#define DCACHE_FSNOTIFY_PARENT_WATCHED	BIT(13)
      /* Parent inode is watched by some fsnotify listener */
 
-#define DCACHE_DENTRY_KILLED		BIT(15)
+#define DCACHE_DENTRY_KILLED		BIT(14)
 
-#define DCACHE_MOUNTED			BIT(16) /* is a mountpoint */
-#define DCACHE_NEED_AUTOMOUNT		BIT(17) /* handle automount on this dir */
-#define DCACHE_MANAGE_TRANSIT		BIT(18) /* manage transit from this dirent */
+#define DCACHE_MOUNTED			BIT(15) /* is a mountpoint */
+#define DCACHE_NEED_AUTOMOUNT		BIT(16) /* handle automount on this dir */
+#define DCACHE_MANAGE_TRANSIT		BIT(17) /* manage transit from this dirent */
 #define DCACHE_MANAGED_DENTRY \
 	(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
-#define DCACHE_LRU_LIST			BIT(19)
+#define DCACHE_LRU_LIST			BIT(18)
 
-#define DCACHE_ENTRY_TYPE		(7 << 20) /* bits 20..22 are for storing type: */
-#define DCACHE_MISS_TYPE		(0 << 20) /* Negative dentry */
-#define DCACHE_WHITEOUT_TYPE		(1 << 20) /* Whiteout dentry (stop pathwalk) */
-#define DCACHE_DIRECTORY_TYPE		(2 << 20) /* Normal directory */
-#define DCACHE_AUTODIR_TYPE		(3 << 20) /* Lookupless directory (presumed automount) */
-#define DCACHE_REGULAR_TYPE		(4 << 20) /* Regular file type */
-#define DCACHE_SPECIAL_TYPE		(5 << 20) /* Other file type */
-#define DCACHE_SYMLINK_TYPE		(6 << 20) /* Symlink */
+#define DCACHE_ENTRY_TYPE		(7 << 19) /* bits 19..21 are for storing type: */
+#define DCACHE_MISS_TYPE		(0 << 19) /* Negative dentry */
+#define DCACHE_WHITEOUT_TYPE		(1 << 19) /* Whiteout dentry (stop pathwalk) */
+#define DCACHE_DIRECTORY_TYPE		(2 << 19) /* Normal directory */
+#define DCACHE_AUTODIR_TYPE		(3 << 19) /* Lookupless directory (presumed automount) */
+#define DCACHE_REGULAR_TYPE		(4 << 19) /* Regular file type */
+#define DCACHE_SPECIAL_TYPE		(5 << 19) /* Other file type */
+#define DCACHE_SYMLINK_TYPE		(6 << 19) /* Symlink */
 
-#define DCACHE_NOKEY_NAME		BIT(25) /* Encrypted name encoded without key */
-#define DCACHE_OP_REAL			BIT(26)
+#define DCACHE_NOKEY_NAME		BIT(22) /* Encrypted name encoded without key */
+#define DCACHE_OP_REAL			BIT(23)
 
-#define DCACHE_PAR_LOOKUP		BIT(28) /* being looked up (with parent locked shared) */
-#define DCACHE_DENTRY_CURSOR		BIT(29)
-#define DCACHE_NORCU			BIT(30) /* No RCU delay for freeing */
+#define DCACHE_PAR_LOOKUP		BIT(24) /* being looked up (with parent locked shared) */
+#define DCACHE_DENTRY_CURSOR		BIT(25)
+#define DCACHE_NORCU			BIT(26) /* No RCU delay for freeing */
 
 extern seqlock_t rename_lock;
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (5 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 06/19] VFS: repack DENTRY_ flags NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 12:44   ` Christian Brauner
  2025-02-06 12:54   ` (subset) " Christian Brauner
  2025-02-06  5:42 ` [PATCH 08/19] VFS: introduce lookup_and_lock() and friends NeilBrown
                   ` (14 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

The LOOKUP_ bits are not in order, which can make it awkward when adding
new bits.  Two bits have recently been added to the end which makes them
look like "scoping flags", but in fact they aren't.

Also LOOKUP_PARENT is described as "internal use only" but is used in
fs/nfs/

This patch:
 - Moves these three flags into the "pathwalk mode" section
 - changes all bits to use the BIT(n) macro
 - Allocates bits in order leaving gaps between the sections,
   and documents those gaps.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/namei.h | 46 +++++++++++++++++++++----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/namei.h b/include/linux/namei.h
index 839a64d07f8c..0d81e571a159 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -18,38 +18,38 @@ enum { MAX_NESTED_LINKS = 8 };
 enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 
 /* pathwalk mode */
-#define LOOKUP_FOLLOW		0x0001	/* follow links at the end */
-#define LOOKUP_DIRECTORY	0x0002	/* require a directory */
-#define LOOKUP_AUTOMOUNT	0x0004  /* force terminal automount */
-#define LOOKUP_EMPTY		0x4000	/* accept empty path [user_... only] */
-#define LOOKUP_DOWN		0x8000	/* follow mounts in the starting point */
-#define LOOKUP_MOUNTPOINT	0x0080	/* follow mounts in the end */
-
-#define LOOKUP_REVAL		0x0020	/* tell ->d_revalidate() to trust no cache */
-#define LOOKUP_RCU		0x0040	/* RCU pathwalk mode; semi-internal */
+#define LOOKUP_FOLLOW		BIT(0)	/* follow links at the end */
+#define LOOKUP_DIRECTORY	BIT(1)	/* require a directory */
+#define LOOKUP_AUTOMOUNT	BIT(2)  /* force terminal automount */
+#define LOOKUP_EMPTY		BIT(3)	/* accept empty path [user_... only] */
+#define LOOKUP_LINKAT_EMPTY	BIT(4) /* Linkat request with empty path. */
+#define LOOKUP_DOWN		BIT(5)	/* follow mounts in the starting point */
+#define LOOKUP_MOUNTPOINT	BIT(6)	/* follow mounts in the end */
+#define LOOKUP_REVAL		BIT(7)	/* tell ->d_revalidate() to trust no cache */
+#define LOOKUP_RCU		BIT(8)	/* RCU pathwalk mode; semi-internal */
+#define LOOKUP_CACHED		BIT(9) /* Only do cached lookup */
+#define LOOKUP_PARENT		BIT(10)	/* Looking up final parent in path */
+/* 5 spare bits for pathwalk */
 
 /* These tell filesystem methods that we are dealing with the final component... */
-#define LOOKUP_OPEN		0x0100	/* ... in open */
-#define LOOKUP_CREATE		0x0200	/* ... in object creation */
-#define LOOKUP_EXCL		0x0400	/* ... in target must not exist */
-#define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
+#define LOOKUP_OPEN		BIT(16)	/* ... in open */
+#define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
+#define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
+#define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
 
 #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
 				 LOOKUP_RENAME_TARGET)
-
-/* internal use only */
-#define LOOKUP_PARENT		0x0010
+/* 4 spare bits for intent */
 
 /* Scoping flags for lookup. */
-#define LOOKUP_NO_SYMLINKS	0x010000 /* No symlink crossing. */
-#define LOOKUP_NO_MAGICLINKS	0x020000 /* No nd_jump_link() crossing. */
-#define LOOKUP_NO_XDEV		0x040000 /* No mountpoint crossing. */
-#define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
-#define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
-#define LOOKUP_CACHED		0x200000 /* Only do cached lookup */
-#define LOOKUP_LINKAT_EMPTY	0x400000 /* Linkat request with empty path. */
+#define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
+#define LOOKUP_NO_MAGICLINKS	BIT(25) /* No nd_jump_link() crossing. */
+#define LOOKUP_NO_XDEV		BIT(26) /* No mountpoint crossing. */
+#define LOOKUP_BENEATH		BIT(27) /* No escaping from starting point. */
+#define LOOKUP_IN_ROOT		BIT(28) /* Treat dirfd as fs root. */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
 #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
+/* 3 spare bits for scoping */
 
 extern int path_pts(struct path *path);
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (6 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 07/19] VFS: repack LOOKUP_ bit flags NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 13:49   ` Christian Brauner
  2025-02-07 20:22   ` Al Viro
  2025-02-06  5:42 ` [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations NeilBrown
                   ` (13 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

lookup_and_lock() combines locking the directory and performing a lookup
prior to a change to the directory.
Abstracting this prepares for changing the locking requirements.

done_lookup_and_lock() provides the inverse of putting the dentry and
unlocking.

For "silly_rename" we will need to lookup_and_lock() in a directory that
is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.

Like lookup_len_qstr(), lookup_and_lock() returns -ENOENT if
LOOKUP_CREATE was NOT given and the name cannot be found,, and returns
-EEXIST if LOOKUP_EXCL WAS given and the name CAN be found.

These functions replace all uses of lookup_one_qstr() in namei.c
except for those used for rename.

The name might seem backwards as the lock happens before the lookup.
A future patch will change this so that only a shared lock is taken
before the lookup, and an exclusive lock on the dentry is taken after a
successful lookup.  So the order "lookup" then "lock" will make sense.

This functionality is exported as lookup_and_lock_one() which takes a
name and len rather than a qstr.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c            | 102 ++++++++++++++++++++++++++++--------------
 include/linux/namei.h |  15 ++++++-
 2 files changed, 83 insertions(+), 34 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 69610047f6c6..3c0feca081a2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1715,6 +1715,41 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
 }
 EXPORT_SYMBOL(lookup_one_qstr);
 
+static struct dentry *lookup_and_lock_nested(const struct qstr *last,
+					     struct dentry *base,
+					     unsigned int lookup_flags,
+					     unsigned int subclass)
+{
+	struct dentry *dentry;
+
+	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
+		inode_lock_nested(base->d_inode, subclass);
+
+	dentry = lookup_one_qstr(last, base, lookup_flags);
+	if (IS_ERR(dentry) && !(lookup_flags & LOOKUP_PARENT_LOCKED)) {
+			inode_unlock(base->d_inode);
+	}
+	return dentry;
+}
+
+static struct dentry *lookup_and_lock(const struct qstr *last,
+				      struct dentry *base,
+				      unsigned int lookup_flags)
+{
+	return lookup_and_lock_nested(last, base, lookup_flags,
+				      I_MUTEX_PARENT);
+}
+
+void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
+			  unsigned int lookup_flags)
+{
+	d_lookup_done(dentry);
+	dput(dentry);
+	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
+		inode_unlock(base->d_inode);
+}
+EXPORT_SYMBOL(done_lookup_and_lock);
+
 /**
  * lookup_fast - do fast lockless (but racy) lookup of a dentry
  * @nd: current nameidata
@@ -2754,12 +2789,9 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
 		path_put(path);
 		return ERR_PTR(-EINVAL);
 	}
-	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	d = lookup_one_qstr(&last, path->dentry, 0);
-	if (IS_ERR(d)) {
-		inode_unlock(path->dentry->d_inode);
+	d = lookup_and_lock(&last, path->dentry, 0);
+	if (IS_ERR(d))
 		path_put(path);
-	}
 	return d;
 }
 
@@ -3053,6 +3085,22 @@ struct dentry *lookup_positive_unlocked(const char *name,
 }
 EXPORT_SYMBOL(lookup_positive_unlocked);
 
+struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
+				   const char *name, int len, struct dentry *base,
+				   unsigned int lookup_flags)
+{
+	struct qstr this;
+	int err;
+
+	if (!idmap)
+		idmap = &nop_mnt_idmap;
+	err = lookup_one_common(idmap, name, base, len, &this);
+	if (err)
+		return ERR_PTR(err);
+	return lookup_and_lock(&this, base, lookup_flags);
+}
+EXPORT_SYMBOL(lookup_and_lock_one);
+
 #ifdef CONFIG_UNIX98_PTYS
 int path_pts(struct path *path)
 {
@@ -4071,7 +4119,6 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
 	unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
 	int type;
-	int err2;
 	int error;
 
 	error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
@@ -4083,36 +4130,30 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	 * (foo/., foo/.., /////)
 	 */
 	if (unlikely(type != LAST_NORM))
-		goto out;
+		goto put;
 
 	/* don't fail immediately if it's r/o, at least try to report other errors */
-	err2 = mnt_want_write(path->mnt);
+	error = mnt_want_write(path->mnt);
 	/*
 	 * Do the final lookup.  Suppress 'create' if there is a trailing
 	 * '/', and a directory wasn't requested.
 	 */
 	if (last.name[last.len] && !want_dir)
 		create_flags &= ~LOOKUP_CREATE;
-	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
-	dentry = lookup_one_qstr(&last, path->dentry,
-				 reval_flag | create_flags);
+	dentry = lookup_and_lock(&last, path->dentry, reval_flag | create_flags);
 	if (IS_ERR(dentry))
-		goto unlock;
+		goto drop;
 
-	if (unlikely(err2)) {
-		error = err2;
+	if (unlikely(error))
 		goto fail;
-	}
 	return dentry;
 fail:
-	d_lookup_done(dentry);
-	dput(dentry);
+	done_lookup_and_lock(path->dentry, dentry, reval_flag | create_flags);
 	dentry = ERR_PTR(error);
-unlock:
-	inode_unlock(path->dentry->d_inode);
-	if (!err2)
+drop:
+	if (!error)
 		mnt_drop_write(path->mnt);
-out:
+put:
 	path_put(path);
 	return dentry;
 }
@@ -4130,14 +4171,13 @@ EXPORT_SYMBOL(kern_path_create);
 
 void done_path_create(struct path *path, struct dentry *dentry)
 {
-	dput(dentry);
-	inode_unlock(path->dentry->d_inode);
+	done_lookup_and_lock(path->dentry, dentry, LOOKUP_CREATE);
 	mnt_drop_write(path->mnt);
 	path_put(path);
 }
 EXPORT_SYMBOL(done_path_create);
 
-inline struct dentry *user_path_create(int dfd, const char __user *pathname,
+struct dentry *user_path_create(int dfd, const char __user *pathname,
 				struct path *path, unsigned int lookup_flags)
 {
 	struct filename *filename = getname(pathname);
@@ -4510,19 +4550,18 @@ int do_rmdir(int dfd, struct filename *name)
 	if (error)
 		goto exit2;
 
-	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
+	dentry = lookup_and_lock(&last, path.dentry, lookup_flags);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit3;
+
 	error = security_path_rmdir(&path, dentry);
 	if (error)
 		goto exit4;
 	error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
 exit4:
-	dput(dentry);
+	done_lookup_and_lock(path.dentry, dentry, lookup_flags);
 exit3:
-	inode_unlock(path.dentry->d_inode);
 	mnt_drop_write(path.mnt);
 exit2:
 	path_put(&path);
@@ -4639,11 +4678,9 @@ int do_unlinkat(int dfd, struct filename *name)
 	if (error)
 		goto exit2;
 retry_deleg:
-	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
-	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
+	dentry = lookup_and_lock(&last, path.dentry, lookup_flags);
 	error = PTR_ERR(dentry);
 	if (!IS_ERR(dentry)) {
-
 		/* Why not before? Because we want correct error value */
 		if (last.name[last.len])
 			goto slashes;
@@ -4655,9 +4692,8 @@ int do_unlinkat(int dfd, struct filename *name)
 		error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
 				   dentry, &delegated_inode);
 exit3:
-		dput(dentry);
+		done_lookup_and_lock(path.dentry, dentry, lookup_flags);
 	}
-	inode_unlock(path.dentry->d_inode);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
 	inode = NULL;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 0d81e571a159..76c587a5ec3a 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -29,7 +29,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_RCU		BIT(8)	/* RCU pathwalk mode; semi-internal */
 #define LOOKUP_CACHED		BIT(9) /* Only do cached lookup */
 #define LOOKUP_PARENT		BIT(10)	/* Looking up final parent in path */
-/* 5 spare bits for pathwalk */
+#define LOOKUP_PARENT_LOCKED	BIT(11)	/* filesystem sets this for nested
+					 * "lookup_and_lock_one" when it knows
+					 * parent is sufficiently locked.
+					 */
+/* 4 spare bits for pathwalk */
 
 /* These tell filesystem methods that we are dealing with the final component... */
 #define LOOKUP_OPEN		BIT(16)	/* ... in open */
@@ -82,6 +86,15 @@ struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap,
 struct dentry *lookup_one_positive_unlocked(struct mnt_idmap *idmap,
 					    const char *name,
 					    struct dentry *base, int len);
+struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
+				   const char *name, int len, struct dentry *base,
+				   unsigned int lookup_flags);
+struct dentry *__lookup_and_lock_one(struct mnt_idmap *idmap,
+				     const char *name, int len, struct dentry *base,
+				     unsigned int lookup_flags);
+void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
+			  unsigned int lookup_flags);
+void __done_lookup_and_lock(struct dentry *dentry);
 
 extern int follow_down_one(struct path *);
 extern int follow_down(struct path *path, unsigned int flags);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (7 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 08/19] VFS: introduce lookup_and_lock() and friends NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 13:15   ` Christian Brauner
  2025-02-07 22:41   ` Al Viro
  2025-02-06  5:42 ` [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops NeilBrown
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

These "_async" versions of various inode operations are only guaranteed
a shared lock on the directory but if the directory isn't exclusively
locked then they are guaranteed an exclusive lock on the dentry within
the directory (which will be implemented in a later patch).

This will allow a graceful transition from exclusive to shared locking
for directory updates, and even to async updates which can complete with
no lock on the directory - only on the dentry.

mkdir_async is a bit different as it optionally returns a new dentry
for cases when the filesystem is not able to use the original dentry.
This allows vfs_mkdir_return() to avoid the need for an extra lookup.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/filesystems/locking.rst |  51 ++++++++-
 Documentation/filesystems/porting.rst |  10 ++
 Documentation/filesystems/vfs.rst     |  24 +++++
 fs/namei.c                            | 142 +++++++++++++++++++++-----
 include/linux/fs.h                    |  24 +++++
 5 files changed, 223 insertions(+), 28 deletions(-)

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d20a32b77b60..adeead366332 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -62,15 +62,24 @@ inode_operations
 prototypes::
 
 	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
+	int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool, struct dirop_ret *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
+	int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
 	int (*unlink) (struct inode *,struct dentry *);
+	int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
 	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+	int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *m , struct dirop_ret *);
 	int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+	struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
 	int (*rmdir) (struct inode *,struct dentry *);
+	int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
 	int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+	int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
 	int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
 			struct inode *, struct dentry *, unsigned int);
+	int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
+			struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
 	int (*readlink) (struct dentry *, char __user *,int);
 	const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
 	void (*truncate) (struct inode *);
@@ -84,6 +93,9 @@ prototypes::
 	int (*atomic_open)(struct inode *, struct dentry *,
 				struct file *, unsigned open_flag,
 				umode_t create_mode);
+	int (*atomic_open_async)(struct inode *, struct dentry *,
+				struct file *, unsigned open_flag,
+				umode_t create_mode, struct dirop_ret *);
 	int (*tmpfile) (struct mnt_idmap *, struct inode *,
 			struct file *, umode_t);
 	int (*fileattr_set)(struct mnt_idmap *idmap,
@@ -95,18 +107,33 @@ prototypes::
 locking rules:
 	all may block
 
+All directory-modifying operations are called with an exclusive lock on
+the target dentry or dentries using DCACHE_PAR_LOOKUP.  This allows the
+shared lock on i_rwsem for the _async ops to be safe.  The lock on
+i_rwsem may be dropped as soon as the op returns, though if it returns
+-EINPROGRESS the lock using DCACHE_PAR_UPDATE will not be dropped until
+the callback is called.
+
 ==============	==================================================
 ops		i_rwsem(inode)
 ==============	==================================================
 lookup:		shared
 create:		exclusive
+create_async:	shared
 link:		exclusive (both)
+link_async:	exclusive on source, shared on target
 mknod:		exclusive
+mknod_async:	shared
 symlink:	exclusive
+symlink_async:	shared
 mkdir:		exclusive
+mkdir_async:	shared
 unlink:		exclusive (both)
+unlink_async:	exclusive on object, shared on directory/name
 rmdir:		exclusive (both)(see below)
+rmdir_async:	exclusive on object, shared on directory/name (see below)
 rename:		exclusive (both parents, some children)	(see below)
+rename_async:	shared (both parents) exclusive (some children)	(see below)
 readlink:	no
 get_link:	no
 setattr:	exclusive
@@ -118,6 +145,7 @@ listxattr:	no
 fiemap:		no
 update_time:	no
 atomic_open:	shared (exclusive if O_CREAT is set in open flags)
+atomic_open_async:	shared (if O_CREAT is not set, then may not have exclusive lock on name)
 tmpfile:	no
 fileattr_get:	no or exclusive
 fileattr_set:	exclusive
@@ -125,8 +153,10 @@ get_offset_ctx  no
 ==============	==================================================
 
 
-	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
-	exclusive on victim.
+	Additionally, ->rmdir(), ->unlink() and ->rename(), as well as _async
+	versions, have ->i_rwsem exclusive on victim.  This exclusive lock
+        may be dropped when the op completes even if the async operation is
+        continuing.
 	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
 	->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
 	involved.
@@ -135,6 +165,23 @@ get_offset_ctx  no
 See Documentation/filesystems/directory-locking.rst for more detailed discussion
 of the locking scheme for directory operations.
 
+The _async operations will be passed a (non-NULL) struct dirop_ret pointer::
+
+	struct dirop_ret {
+		union {
+			int err;
+			struct dentry *dentry;
+		};
+		void (*done_cb)(struct dirop_ret*);
+	};
+
+They may return -EINPROGRESS (or ERR_PTR(-EINPROGRESS)) in which case
+the op will continue asynchronously.  When it completes the result,
+which must NOT be -EINPROGRESS, is stored in err or dentry (as
+appropriate) and the done_cb() function is called.  Callers can only
+make use of the asynchrony when they determine that no lock need be held
+on i_rwsem.
+
 xattr_handler operations
 ========================
 
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 1639e78e3146..a736c9f30d9d 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1157,3 +1157,13 @@ in normal case it points into the pathname being looked up.
 NOTE: if you need something like full path from the root of filesystem,
 you are still on your own - this assists with simple cases, but it's not
 magic.
+
+---
+
+**recommended**
+
+create_async, link_async, unlink_async, rmdir_async, mknod_async,
+rename_async, atomic_open_async can be provided instead of the
+corresponding inode_operations with the "_async" suffix.  Multiple
+_async operations can be performed in a given directory concurrently,
+but never on the same name.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 31eea688609a..e18655054e6c 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -491,15 +491,24 @@ As of kernel 2.6.22, the following members are defined:
 
 	struct inode_operations {
 		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
+		int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool, struct dirop_ret *);
 		struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 		int (*link) (struct dentry *,struct inode *,struct dentry *);
+		int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
 		int (*unlink) (struct inode *,struct dentry *);
+		int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
 		int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
+		int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *, struct dirop_ret *);
 		int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
+		struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
 		int (*rmdir) (struct inode *,struct dentry *);
+		int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
 		int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
+		int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
 		int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
 			       struct inode *, struct dentry *, unsigned int);
+		int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
+			       struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
 		int (*readlink) (struct dentry *, char __user *,int);
 		const char *(*get_link) (struct dentry *, struct inode *,
 					 struct delayed_call *);
@@ -511,6 +520,8 @@ As of kernel 2.6.22, the following members are defined:
 		void (*update_time)(struct inode *, struct timespec *, int);
 		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
 				   unsigned open_flag, umode_t create_mode);
+		int (*atomic_open_async)(struct inode *, struct dentry *, struct file *,
+				   unsigned open_flag, umode_t create_mode, struct dirop_ret *);
 		int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
 		struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
 	        int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
@@ -524,6 +535,7 @@ Again, all methods are called without any locks being held, unless
 otherwise noted.
 
 ``create``
+``create_async``
 	called by the open(2) and creat(2) system calls.  Only required
 	if you want to support regular files.  The dentry you get should
 	not have an inode (i.e. it should be a negative dentry).  Here
@@ -546,29 +558,39 @@ otherwise noted.
 	directory inode semaphore held
 
 ``link``
+``link_async``
 	called by the link(2) system call.  Only required if you want to
 	support hard links.  You will probably need to call
 	d_instantiate() just as you would in the create() method
 
 ``unlink``
+``unlink_async``
 	called by the unlink(2) system call.  Only required if you want
 	to support deleting inodes
 
 ``symlink``
+``symlink_async``
 	called by the symlink(2) system call.  Only required if you want
 	to support symlinks.  You will probably need to call
 	d_instantiate() just as you would in the create() method
 
 ``mkdir``
+``mkdir_async``
 	called by the mkdir(2) system call.  Only required if you want
 	to support creating subdirectories.  You will probably need to
 	call d_instantiate() just as you would in the create() method
 
+	mkdir_async can return an alternate dentry, much like lookup.
+	In this case the original dentry will still be negative and will
+	be unhashed.
+
 ``rmdir``
+``rmdir_async``
 	called by the rmdir(2) system call.  Only required if you want
 	to support deleting subdirectories
 
 ``mknod``
+``mknod_async``
 	called by the mknod(2) system call to create a device (char,
 	block) inode or a named pipe (FIFO) or socket.  Only required if
 	you want to support creating these types of inodes.  You will
@@ -576,6 +598,7 @@ otherwise noted.
 	create() method
 
 ``rename``
+``rename_async``
 	called by the rename(2) system call to rename the object to have
 	the parent and name given by the second inode and dentry.
 
@@ -647,6 +670,7 @@ otherwise noted.
 	itself and call mark_inode_dirty_sync.
 
 ``atomic_open``
+``atomic_open_async``
 	called on the last component of an open.  Using this optional
 	method the filesystem can look up, possibly create and open the
 	file in one atomic operation.  If it wants to leave actual
diff --git a/fs/namei.c b/fs/namei.c
index 3c0feca081a2..eadde9de73bf 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -123,6 +123,41 @@
  * PATH_MAX includes the nul terminator --RR.
  */
 
+static void dirop_done_cb(struct dirop_ret *dret)
+{
+	wake_up_var(dret);
+}
+
+#define DO_DIROP(dir, op, ...)						\
+	({								\
+		 struct dirop_ret dret;					\
+		 int ret;						\
+		 dret.err = -EINPROGRESS;				\
+		 dret.done_cb = dirop_done_cb;				\
+		 ret = (dir)->i_op->op(__VA_ARGS__, &dret);		\
+		 if (ret == -EINPROGRESS) {				\
+			 wait_var_event(&dret,				\
+					dret.err != -EINPROGRESS);	\
+			 ret = dret.err;				\
+		 }							\
+		 ret;							\
+	})
+
+#define DO_DE_DIROP(dir, op, ...)					\
+	({								\
+		 struct dirop_ret dret;					\
+		 struct dentry *ret;					\
+		 dret.dentry = ERR_PTR(-EINPROGRESS);			\
+		 dret.done_cb = dirop_done_cb;				\
+		 ret = (dir)->i_op->op(__VA_ARGS__, &dret);		\
+		 if (ret == ERR_PTR(-EINPROGRESS)) {			\
+			 wait_var_event(&dret,				\
+					dret.dentry != ERR_PTR(-EINPROGRESS));	\
+			 ret = dret.dentry;				\
+		 }							\
+		 ret;							\
+	})
+
 #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
 
 struct filename *
@@ -3403,14 +3438,17 @@ int vfs_create(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	if (!dir->i_op->create)
+	if (!dir->i_op->create && !dir->i_op->create_async)
 		return -EACCES;	/* shouldn't it be ENOSYS? */
 
 	mode = vfs_prepare_mode(idmap, dir, mode, S_IALLUGO, S_IFREG);
 	error = security_inode_create(dir, dentry, mode);
 	if (error)
 		return error;
-	error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
+	if (dir->i_op->create_async)
+		error = DO_DIROP(dir, create_async, idmap, dir, dentry, mode, want_excl);
+	else
+		error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -3571,8 +3609,12 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
 
 	file->f_path.dentry = DENTRY_NOT_SET;
 	file->f_path.mnt = nd->path.mnt;
-	error = dir->i_op->atomic_open(dir, dentry, file,
-				       open_to_namei_flags(open_flag), mode);
+	if (dir->i_op->atomic_open_async)
+		error = DO_DIROP(dir, atomic_open_async, dir, dentry, file,
+				 open_to_namei_flags(open_flag), mode);
+	else
+		error = dir->i_op->atomic_open(dir, dentry, file,
+					       open_to_namei_flags(open_flag), mode);
 	d_lookup_done(dentry);
 	if (!error) {
 		if (file->f_mode & FMODE_OPENED) {
@@ -3680,7 +3722,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	}
 	if (create_error)
 		open_flag &= ~O_CREAT;
-	if (dir_inode->i_op->atomic_open) {
+	if (dir_inode->i_op->atomic_open || dir_inode->i_op->atomic_open_async) {
 		dentry = atomic_open(nd, dentry, file, open_flag, mode);
 		if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
 			dentry = ERR_PTR(create_error);
@@ -3705,13 +3747,16 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (!dentry->d_inode && (open_flag & O_CREAT)) {
 		file->f_mode |= FMODE_CREATED;
 		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
-		if (!dir_inode->i_op->create) {
-			error = -EACCES;
-			goto out_dput;
-		}
 
-		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
-						mode, open_flag & O_EXCL);
+		if (dir_inode->i_op->create_async)
+			error = DO_DIROP(dir_inode, create_async, idmap, dir_inode,
+					 dentry, mode,  open_flag & O_EXCL);
+		else if (dir_inode->i_op->create)
+			error = dir_inode->i_op->create(idmap, dir_inode,
+							dentry, mode,
+							open_flag & O_EXCL);
+		else
+			error = -EACCES;
 		if (error)
 			goto out_dput;
 	}
@@ -4217,7 +4262,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	    !capable(CAP_MKNOD))
 		return -EPERM;
 
-	if (!dir->i_op->mknod)
+	if (!dir->i_op->mknod && !dir->i_op->mknod_async)
 		return -EPERM;
 
 	mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
@@ -4229,7 +4274,10 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
+	if (dir->i_op->mknod_async)
+		error = DO_DIROP(dir, mknod_async, idmap, dir, dentry, mode, dev);
+	else
+		error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -4340,7 +4388,7 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	if (!dir->i_op->mkdir)
+	if (!dir->i_op->mkdir && !dir->i_op->mkdir_async)
 		return -EPERM;
 
 	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
@@ -4351,7 +4399,16 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (max_links && dir->i_nlink >= max_links)
 		return -EMLINK;
 
-	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+	if (dir->i_op->mkdir_async) {
+		struct dentry *de;
+		de = DO_DE_DIROP(dir, mkdir_async, idmap, dir, dentry, mode);
+		if (IS_ERR(de))
+			error = PTR_ERR(de);
+		else if (de)
+			dput(de);
+	} else {
+		error = dir->i_op->mkdir(idmap, dir, dentry, mode);
+	}
 	if (!error)
 		fsnotify_mkdir(dir, dentry);
 	return error;
@@ -4399,6 +4456,20 @@ int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
 	if (max_links && dir->i_nlink >= max_links)
 		return -EMLINK;
 
+	if (dir->i_op->mkdir_async) {
+		struct dentry *de;
+
+		de = DO_DE_DIROP(dir, mkdir_async, idmap, dir, dentry, mode);
+		if (IS_ERR(de))
+			return PTR_ERR(de);
+		if (de) {
+			dput(dentry);
+			*dentryp = de;
+		}
+		fsnotify_mkdir(dir, dentry);
+		return 0;
+	}
+
 	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
 	if (!error) {
 		fsnotify_mkdir(dir, dentry);
@@ -4488,7 +4559,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	if (!dir->i_op->rmdir)
+	if (!dir->i_op->rmdir && !dir->i_op->rmdir_async)
 		return -EPERM;
 
 	dget(dentry);
@@ -4503,7 +4574,10 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		goto out;
 
-	error = dir->i_op->rmdir(dir, dentry);
+	if (dir->i_op->rmdir_async)
+		error = DO_DIROP(dir, rmdir_async, dir, dentry);
+	else
+		error = dir->i_op->rmdir(dir, dentry);
 	if (error)
 		goto out;
 
@@ -4613,7 +4687,7 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	if (!dir->i_op->unlink)
+	if (!dir->i_op->unlink && !dir->i_op->unlink_async)
 		return -EPERM;
 
 	inode_lock(target);
@@ -4627,7 +4701,10 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
 			error = try_break_deleg(target, delegated_inode);
 			if (error)
 				goto out;
-			error = dir->i_op->unlink(dir, dentry);
+			if (dir->i_op->unlink_async)
+				error = DO_DIROP(dir, unlink_async, dir, dentry);
+			else
+				error = dir->i_op->unlink(dir, dentry);
 			if (!error) {
 				dont_mount(dentry);
 				detach_mounts(dentry);
@@ -4761,14 +4838,17 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
 	if (error)
 		return error;
 
-	if (!dir->i_op->symlink)
+	if (!dir->i_op->symlink && !dir->i_op->symlink_async)
 		return -EPERM;
 
 	error = security_inode_symlink(dir, dentry, oldname);
 	if (error)
 		return error;
 
-	error = dir->i_op->symlink(idmap, dir, dentry, oldname);
+	if (dir->i_op->symlink_async)
+		error = DO_DIROP(dir, symlink_async, idmap, dir, dentry, oldname);
+	else
+		error = dir->i_op->symlink(idmap, dir, dentry, oldname);
 	if (!error)
 		fsnotify_create(dir, dentry);
 	return error;
@@ -4874,7 +4954,7 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
 	 */
 	if (HAS_UNMAPPED_ID(idmap, inode))
 		return -EPERM;
-	if (!dir->i_op->link)
+	if (!dir->i_op->link && !dir->i_op->link_async)
 		return -EPERM;
 	if (S_ISDIR(inode->i_mode))
 		return -EPERM;
@@ -4891,7 +4971,11 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
 		error = -EMLINK;
 	else {
 		error = try_break_deleg(inode, delegated_inode);
-		if (!error)
+		if (error)
+			;
+		else if (dir->i_op->link_async)
+			error = DO_DIROP(dir, link_async, old_dentry, dir, new_dentry);
+		else
 			error = dir->i_op->link(old_dentry, dir, new_dentry);
 	}
 
@@ -5083,7 +5167,7 @@ int vfs_rename(struct renamedata *rd)
 	if (error)
 		return error;
 
-	if (!old_dir->i_op->rename)
+	if (!old_dir->i_op->rename && !old_dir->i_op->rename_async)
 		return -EPERM;
 
 	/*
@@ -5166,8 +5250,14 @@ int vfs_rename(struct renamedata *rd)
 		if (error)
 			goto out;
 	}
-	error = old_dir->i_op->rename(rd->new_mnt_idmap, old_dir, old_dentry,
-				      new_dir, new_dentry, flags);
+	if (old_dir->i_op->rename_async)
+		error = DO_DIROP(old_dir, rename_async, rd->new_mnt_idmap,
+				 old_dir, old_dentry,
+				 new_dir, new_dentry, flags);
+	else
+		error = old_dir->i_op->rename(rd->new_mnt_idmap,
+					      old_dir, old_dentry,
+					      new_dir, new_dentry, flags);
 	if (error)
 		goto out;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f81d6bc65fe4..e414400c2487 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2187,6 +2187,14 @@ int wrap_directory_iterator(struct file *, struct dir_context *,
 	static int shared_##x(struct file *file , struct dir_context *ctx) \
 	{ return wrap_directory_iterator(file, ctx, x); }
 
+struct dirop_ret {
+	union {
+		int err;
+		struct dentry *dentry;
+	};
+	void (*done_cb)(struct dirop_ret*);
+};
+
 struct inode_operations {
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
@@ -2197,17 +2205,30 @@ struct inode_operations {
 
 	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
 		       umode_t, bool);
+	int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *,
+		       umode_t, bool, struct dirop_ret *);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
+	int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
 	int (*unlink) (struct inode *,struct dentry *);
+	int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
 	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
 			const char *);
+	int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,
+			const char *, struct dirop_ret *);
 	int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,
 		      umode_t);
+	struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,
+		      umode_t, struct dirop_ret *);
 	int (*rmdir) (struct inode *,struct dentry *);
+	int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
 	int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,
 		      umode_t,dev_t);
+	int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,
+		      umode_t,dev_t, struct dirop_ret *);
 	int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
 			struct inode *, struct dentry *, unsigned int);
+	int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
+			struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
 	int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
 	int (*getattr) (struct mnt_idmap *, const struct path *,
 			struct kstat *, u32, unsigned int);
@@ -2218,6 +2239,9 @@ struct inode_operations {
 	int (*atomic_open)(struct inode *, struct dentry *,
 			   struct file *, unsigned open_flag,
 			   umode_t create_mode);
+	int (*atomic_open_async)(struct inode *, struct dentry *,
+			   struct file *, unsigned open_flag,
+			   umode_t create_mode, struct dirop_ret *);
 	int (*tmpfile) (struct mnt_idmap *, struct inode *,
 			struct file *, umode_t);
 	struct posix_acl *(*get_acl)(struct mnt_idmap *, struct dentry *,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (8 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 13:22   ` Christian Brauner
  2025-02-06  5:42 ` [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove operations NeilBrown
                   ` (11 subsequent siblings)
  21 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

If a filesystem supports _async ops for some directory ops we can take a
"shared" lock on i_rwsem otherwise we must take an "exclusive" lock.  As
the filesystem may support some async ops but not others we need to
easily determine which.

With this patch we group the ops into 4 groups that are likely be
supported together:

CREATE: create, link, mkdir, mknod
REMOVE: rmdir, unlink
RENAME: rename
OPEN: atomic_open, create

and set S_ASYNC_XXX for each when the inode in initialised.

We also add a LOOKUP_REMOVE intent flag which will be used by locking
interfaces to help know which group is being used.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/dcache.c           | 24 ++++++++++++++++++++++++
 include/linux/fs.h    |  5 +++++
 include/linux/namei.h |  5 +++--
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e49607d00d2d..37c0f655166d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -384,6 +384,27 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
 	smp_store_release(&dentry->d_flags, flags);
 }
 
+static void set_inode_flags(struct inode *inode)
+{
+	const struct inode_operations *i_op = inode->i_op;
+
+	lockdep_assert_held(&inode->i_lock);
+	if ((i_op->create_async || !i_op->create) &&
+	    (i_op->link_async || !i_op->link) &&
+	    (i_op->symlink_async || !i_op->symlink) &&
+	    (i_op->mkdir_async || !i_op->mkdir) &&
+	    (i_op->mknod_async || !i_op->mknod))
+		inode->i_flags |= S_ASYNC_CREATE;
+	if ((i_op->unlink_async || !i_op->unlink) &&
+	    (i_op->mkdir_async || !i_op->mkdir))
+		inode->i_flags |= S_ASYNC_REMOVE;
+	if (i_op->rename_async)
+		inode->i_flags |= S_ASYNC_RENAME;
+	if (i_op->atomic_open_async ||
+	    (!i_op->atomic_open && i_op->create_async))
+		inode->i_flags |= S_ASYNC_OPEN;
+}
+
 static inline void __d_clear_type_and_inode(struct dentry *dentry)
 {
 	unsigned flags = READ_ONCE(dentry->d_flags);
@@ -1893,6 +1914,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_set_inode_and_type(dentry, inode, add_flags);
 	raw_write_seqcount_end(&dentry->d_seq);
+	set_inode_flags(inode);
 	fsnotify_update_flags(dentry);
 	spin_unlock(&dentry->d_lock);
 }
@@ -1999,6 +2021,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
 
 		spin_lock(&new->d_lock);
 		__d_set_inode_and_type(new, inode, add_flags);
+		set_inode_flags(inode);
 		hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
 		if (!disconnected) {
 			hlist_bl_lock(&sb->s_roots);
@@ -2701,6 +2724,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 		raw_write_seqcount_begin(&dentry->d_seq);
 		__d_set_inode_and_type(dentry, inode, add_flags);
 		raw_write_seqcount_end(&dentry->d_seq);
+		set_inode_flags(inode);
 		fsnotify_update_flags(dentry);
 	}
 	__d_rehash(dentry);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e414400c2487..9a9282fef347 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2361,6 +2361,11 @@ struct super_operations {
 #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
 #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
 
+#define S_ASYNC_CREATE	BIT(18)	/* create, link, symlink, mkdir, mknod all _async */
+#define S_ASYNC_REMOVE	BIT(19)	/* unlink, mkdir both _async */
+#define S_ASYNC_RENAME	BIT(20) /* rename_async supported */
+#define S_ASYNC_OPEN	BIT(21) /* atomic_open_async or create_async supported */
+
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
  * flags just means all the inodes inherit those flags by default. It might be
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 76c587a5ec3a..72e351640406 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -40,10 +40,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
 #define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
 #define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
+#define LOOKUP_REMOVE		BIT(20)	/* ... in target of object removal */
 
 #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
-				 LOOKUP_RENAME_TARGET)
-/* 4 spare bits for intent */
+				 LOOKUP_RENAME_TARGET | LOOKUP_REMOVE)
+/* 3 spare bits for intent */
 
 /* Scoping flags for lookup. */
 #define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove  operations.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (9 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-08  1:38   ` Al Viro
  2025-02-09  6:40   ` Al Viro
  2025-02-06  5:42 ` [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock updates NeilBrown
                   ` (10 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

d_update_lock(), d_update_trylock(), d_update_unlock() are added which
can be used to get an exclusive lock on a dentry in preparation for
updating it.

As contention on a name is rare this is optimised for the uncontended
case.  A bit is set under the d_lock spinlock to claim as lock, and
wait_var_event_spinlock() is used when waiting is needed.  To avoid
sending a wakeup when not needed we have a second bit flag to indicate
if there are any waiters.

This locking is used in lookup_and_lock().

Once the exclusive "update" lock is obtained on the dentry we must make
sure it wasn't unlinked or renamed while we slept.  If it was we repeat
the lookup.

We also ensure that the parent isn't similarly locked.  This is will be
used to protect a directory during rmdir.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/dcache.c            |   5 +-
 fs/internal.h          |  18 +++++++
 fs/namei.c             | 110 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |   4 ++
 4 files changed, 134 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 37c0f655166d..e705696ca57e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1675,9 +1675,10 @@ EXPORT_SYMBOL(d_invalidate);
  * available. On a success the dentry is returned. The name passed in is
  * copied and the copy passed in may be reused after this call.
  */
- 
+
 static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 {
+	static struct lock_class_key __key;
 	struct dentry *dentry;
 	char *dname;
 	int err;
@@ -1735,6 +1736,8 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_HLIST_NODE(&dentry->d_sib);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
 
+	lockdep_init_map(&dentry->d_update_map, "DCACHE_PAR_UPDATE", &__key, 0);
+
 	if (dentry->d_op && dentry->d_op->d_init) {
 		err = dentry->d_op->d_init(dentry);
 		if (err) {
diff --git a/fs/internal.h b/fs/internal.h
index e7f02ae1e098..5cb9a34e26e8 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -225,6 +225,24 @@ extern struct dentry *__d_lookup_rcu(const struct dentry *parent,
 				const struct qstr *name, unsigned *seq);
 extern void d_genocide(struct dentry *);
 
+extern bool d_update_lock(struct dentry *dentry,
+			  struct dentry *base, const struct qstr *last,
+			  unsigned int subclass);
+
+extern bool d_update_trylock(struct dentry *dentry,
+			     struct dentry *base,
+			     const struct qstr *last);
+
+static inline void d_update_unlock(struct dentry *dentry)
+{
+	lock_map_release(&dentry->d_update_map);
+	spin_lock(&dentry->d_lock);
+	if (dentry->d_flags & DCACHE_PAR_WAITER)
+		wake_up_var_locked(&dentry->d_flags, &dentry->d_lock);
+	dentry->d_flags &= ~(DCACHE_PAR_UPDATE | DCACHE_PAR_WAITER);
+	spin_unlock(&dentry->d_lock);
+}
+
 /*
  * pipe.c
  */
diff --git a/fs/namei.c b/fs/namei.c
index eadde9de73bf..145ae07f9b8c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1750,6 +1750,110 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
 }
 EXPORT_SYMBOL(lookup_one_qstr);
 
+/*
+ * dentry locking for updates.
+ * When modifying a directory the target dentry will be locked by
+ * setting DCACHE_PAR_UPDATE under ->d_lock.  If it is already set,
+ * DCACHE_PAR_WAITER is set to ensure a wakeup is sent, and we wait
+ * using wait_var_event_spinlock().
+ * The DCACHE_PAR_UPDATE bit will only be set in a denty if it is
+ * NOT set in the parent.  This avoids commensing a new operation in
+ * a directory that is being asynchronously deleted using ->mkdir_async.
+ * Instead of holding ->d_lock on the parent while testing the flag, we
+ * use memory ordering to ensure correctness.  Locking a child
+ * retests the parent *after* setting the bit, and deleting a directory
+ * requires testing all children *after* setting the bit in the parent.
+ */
+
+static bool check_dentry_locked(struct dentry *de)
+{
+	if (de->d_flags & DCACHE_PAR_UPDATE) {
+		de->d_flags |= DCACHE_PAR_WAITER;
+		return true;
+	}
+	return false;
+}
+
+bool d_update_lock(struct dentry *dentry,
+		   struct dentry *base, const struct qstr *last,
+		   unsigned int subclass)
+{
+	lock_acquire_exclusive(&dentry->d_update_map, subclass, 0, NULL, _THIS_IP_);
+again:
+	spin_lock(&dentry->d_lock);
+	wait_var_event_spinlock(&dentry->d_flags,
+				!check_dentry_locked(dentry),
+				&dentry->d_lock);
+	if (d_is_positive(dentry)) {
+		rcu_read_lock(); /* needed for d_same_name() */
+		if (
+			/* Was unlinked while we waited ?*/
+			d_unhashed(dentry) ||
+			/* Or was dentry renamed ?? */
+			dentry->d_parent != base ||
+			dentry->d_name.hash != last->hash ||
+			!d_same_name(dentry, base, last)
+		) {
+			rcu_read_unlock();
+			spin_unlock(&dentry->d_lock);
+			lock_map_release(&dentry->d_update_map);
+			return false;
+		}
+		rcu_read_unlock();
+	}
+	/* Must ensure DCACHE_PAR_UPDATE in child is visible before reading
+	 * from parent
+	 */
+	smp_store_mb(dentry->d_flags, dentry->d_flags | DCACHE_PAR_UPDATE);
+	if (base->d_flags & DCACHE_PAR_UPDATE) {
+		/* We cannot grant DCACHE_PAR_UPDATE on a dentry while
+		 * it is held on the parent
+		 */
+		dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+		spin_unlock(&dentry->d_lock);
+		spin_lock(&base->d_lock);
+		wait_var_event_spinlock(&base->d_flags,
+					!check_dentry_locked(base),
+					&base->d_lock);
+		spin_unlock(&base->d_lock);
+		goto again;
+	}
+	spin_unlock(&dentry->d_lock);
+	return true;
+}
+
+bool d_update_trylock(struct dentry *dentry,
+		      struct dentry *base,
+		      const struct qstr *last)
+{
+	int ret = false;
+
+	spin_lock(&dentry->d_lock);
+	rcu_read_lock(); /* needed for d_same_name() */
+	if (!(smp_load_acquire(&dentry->d_flags) & DCACHE_PAR_UPDATE) &&
+	    !(dentry->d_parent->d_flags & DCACHE_PAR_UPDATE)) {
+		if (!base || !(
+			/* Was unlinked before we got spinlock ?*/
+			d_unhashed(dentry) ||
+			/* Or was dentry renamed ?? */
+			dentry->d_parent != base ||
+			dentry->d_name.hash != last->hash ||
+			!d_same_name(dentry, base, last)
+		)) {
+			lock_map_acquire_try(&dentry->d_update_map);
+			smp_store_mb(dentry->d_flags,
+				     dentry->d_flags | DCACHE_PAR_UPDATE);
+			if (dentry->d_parent->d_flags & DCACHE_PAR_UPDATE)
+				dentry->d_flags &= ~DCACHE_PAR_UPDATE;
+			else
+				ret = true;
+		}
+	}
+	rcu_read_unlock();
+	spin_unlock(&dentry->d_lock);
+	return ret;
+}
+
 static struct dentry *lookup_and_lock_nested(const struct qstr *last,
 					     struct dentry *base,
 					     unsigned int lookup_flags,
@@ -1759,8 +1863,9 @@ static struct dentry *lookup_and_lock_nested(const struct qstr *last,
 
 	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
 		inode_lock_nested(base->d_inode, subclass);
-
-	dentry = lookup_one_qstr(last, base, lookup_flags);
+	do {
+		dentry = lookup_one_qstr(last, base, lookup_flags);
+	} while (!IS_ERR(dentry) && !d_update_lock(dentry, base, last, subclass));
 	if (IS_ERR(dentry) && !(lookup_flags & LOOKUP_PARENT_LOCKED)) {
 			inode_unlock(base->d_inode);
 	}
@@ -1779,6 +1884,7 @@ void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
 			  unsigned int lookup_flags)
 {
 	d_lookup_done(dentry);
+	d_update_unlock(dentry);
 	dput(dentry);
 	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
 		inode_unlock(base->d_inode);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index d5816cf19538..f891fb1be63b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -111,6 +111,8 @@ struct dentry {
 					 * possible!
 					 */
 
+	/* lockdep tracking of DCACHE_PAR_UPDATE locks */
+	struct lockdep_map		d_update_map;
 	union {
 		struct list_head d_lru;		/* LRU list */
 		wait_queue_head_t *d_wait;	/* in-lookup ones only */
@@ -232,6 +234,8 @@ struct dentry_operations {
 #define DCACHE_DENTRY_CURSOR		BIT(25)
 #define DCACHE_NORCU			BIT(26) /* No RCU delay for freeing */
 
+#define DCACHE_PAR_UPDATE		BIT(27) /* Locked for update */
+#define DCACHE_PAR_WAITER		BIT(28) /* someone is waiting for PAR_UPDATE */
 extern seqlock_t rename_lock;
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock updates
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (10 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove operations NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06  5:42 ` [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc NeilBrown
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

d_splice_alias() - via __d_unalias() - currently assumes that taking a
shared lock on the parent directory locks against any change to the
parent/name of the dentry.  This will no longer be the case with
shared-lock updates.  We also need a DCACHE_PAR_UPDATE lock on the
dentry.

This patch adds a call to d_update_trylock() to get this lock -if
possible.  This lock ensures that the test on ->d_parent and ->d_name in
d_update_lock() will not be invalidated by the __d_move() in __d_unalias.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/dcache.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e705696ca57e..fb331596f1b1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3036,13 +3036,17 @@ static int __d_unalias(struct dentry *dentry, struct dentry *alias)
 		goto out_err;
 	m2 = &alias->d_parent->d_inode->i_rwsem;
 out_unalias:
+	if (!d_update_trylock(dentry, NULL, NULL))
+		goto out_err;
 	if (alias->d_op && alias->d_op->d_unalias_trylock &&
 	    !alias->d_op->d_unalias_trylock(alias))
-		goto out_err;
+		goto out_err2;
 	__d_move(alias, dentry, false);
 	if (alias->d_op && alias->d_op->d_unalias_unlock)
 		alias->d_op->d_unalias_unlock(alias);
 	ret = 0;
+out_err2:
+	d_update_unlock(dentry);
 out_err:
 	if (m2)
 		up_read(m2);
@@ -3073,6 +3077,10 @@ static int __d_unalias(struct dentry *dentry, struct dentry *alias)
  * In that case, we know that the inode will be a regular file, and also this
  * will only occur during atomic_open. So we need to check for the dentry
  * being already hashed only in the final case.
+ *
+ * @dentry must have a valid ->d_parent and that directory must be
+ * locked (i_rwsem) either exclusively or shared.  If shared then
+ * @dentry must have %DCACHE_PAR_LOOKUP or %DCACHE_PAR_UPDATE set.
  */
 struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 {
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (11 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock updates NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-07 20:28   ` Al Viro
  2025-02-08  1:30   ` Al Viro
  2025-02-06  5:42 ` [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed NeilBrown
                   ` (8 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

When we call ->revalidate we want to be sure we are revalidating the
expected name.  As a shared lock on i_rwsem no longer prevents renames
we need to lock the dentry and ensure it still has the expected name.

So pass parent name to d_revalidate() and be prepared to retry the
lookup if it returns -EAGAIN.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c | 49 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 11 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 145ae07f9b8c..3a107d6098be 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -957,12 +957,24 @@ static bool try_to_unlazy_next(struct nameidata *nd, struct dentry *dentry)
 }
 
 static inline int d_revalidate(struct inode *dir, const struct qstr *name,
-			       struct dentry *dentry, unsigned int flags)
+			       struct dentry *dentry, unsigned int flags,
+			       struct dentry *base, const struct qstr *last)
 {
-	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
-		return dentry->d_op->d_revalidate(dir, name, dentry, flags);
-	else
+	int status;
+
+	if (!unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE))
 		return 1;
+
+	if (dentry->d_flags & LOOKUP_RCU) {
+		if (!d_update_trylock(dentry, base, last))
+			return -ECHILD;
+	} else {
+		if (!d_update_lock(dentry, base, last, I_MUTEX_NORMAL))
+			return -EAGAIN;
+	}
+	status = dentry->d_op->d_revalidate(dir, name, dentry, flags);
+	d_update_unlock(dentry);
+	return status;
 }
 
 /**
@@ -1686,13 +1698,18 @@ static struct dentry *lookup_dcache(const struct qstr *name,
 				    struct dentry *dir,
 				    unsigned int flags)
 {
-	struct dentry *dentry = d_lookup(dir, name);
+	struct dentry *dentry;
+again:
+	dentry = d_lookup(dir, name);
 	if (dentry) {
-		int error = d_revalidate(dir->d_inode, name, dentry, flags);
+		int error = d_revalidate(dir->d_inode, name, dentry, flags, dir, name);
 		if (unlikely(error <= 0)) {
 			if (!error)
 				d_invalidate(dentry);
 			dput(dentry);
+			if (error == -EAGAIN)
+				/* raced with rename etc */
+				goto again;
 			return ERR_PTR(error);
 		}
 	}
@@ -1915,6 +1932,7 @@ static struct dentry *lookup_fast(struct nameidata *nd)
 	 * of a false negative due to a concurrent rename, the caller is
 	 * going to fall back to non-racy lookup.
 	 */
+again:
 	if (nd->flags & LOOKUP_RCU) {
 		dentry = __d_lookup_rcu(parent, &nd->last, &nd->next_seq);
 		if (unlikely(!dentry)) {
@@ -1930,7 +1948,7 @@ static struct dentry *lookup_fast(struct nameidata *nd)
 		if (read_seqcount_retry(&parent->d_seq, nd->seq))
 			return ERR_PTR(-ECHILD);
 
-		status = d_revalidate(nd->inode, &nd->last, dentry, nd->flags);
+		status = d_revalidate(nd->inode, &nd->last, dentry, nd->flags, parent, &nd->last);
 		if (likely(status > 0))
 			return dentry;
 		if (!try_to_unlazy_next(nd, dentry))
@@ -1938,17 +1956,19 @@ static struct dentry *lookup_fast(struct nameidata *nd)
 		if (status == -ECHILD)
 			/* we'd been told to redo it in non-rcu mode */
 			status = d_revalidate(nd->inode, &nd->last,
-					      dentry, nd->flags);
+					      dentry, nd->flags, parent, &nd->last);
 	} else {
 		dentry = __d_lookup(parent, &nd->last);
 		if (unlikely(!dentry))
 			return NULL;
-		status = d_revalidate(nd->inode, &nd->last, dentry, nd->flags);
+		status = d_revalidate(nd->inode, &nd->last, dentry, nd->flags, parent, &nd->last);
 	}
 	if (unlikely(status <= 0)) {
 		if (!status)
 			d_invalidate(dentry);
 		dput(dentry);
+		if (status == -EAGAIN)
+			goto again;
 		return ERR_PTR(status);
 	}
 	return dentry;
@@ -1970,7 +1990,7 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 	if (IS_ERR(dentry))
 		return dentry;
 	if (unlikely(!d_in_lookup(dentry))) {
-		int error = d_revalidate(inode, name, dentry, flags);
+		int error = d_revalidate(inode, name, dentry, flags, dir, name);
 		if (unlikely(error <= 0)) {
 			if (!error) {
 				d_invalidate(dentry);
@@ -1978,6 +1998,8 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 				goto again;
 			}
 			dput(dentry);
+			if (error == -EAGAIN)
+				goto again;
 			dentry = ERR_PTR(error);
 		}
 	} else {
@@ -3777,6 +3799,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		return ERR_PTR(-ENOENT);
 
 	file->f_mode &= ~FMODE_CREATED;
+again:
 	dentry = d_lookup(dir, &nd->last);
 	for (;;) {
 		if (!dentry) {
@@ -3787,9 +3810,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		if (d_in_lookup(dentry))
 			break;
 
-		error = d_revalidate(dir_inode, &nd->last, dentry, nd->flags);
+		error = d_revalidate(dir_inode, &nd->last, dentry, nd->flags, dir, &nd->last);
 		if (likely(error > 0))
 			break;
+		if (error == -EAGAIN) {
+			dput(dentry);
+			goto again;
+		}
 		if (error)
 			goto out_dput;
 		d_invalidate(dentry);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (12 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06 14:06   ` Christian Brauner
  2025-02-07 21:06   ` Al Viro
  2025-02-06  5:42 ` [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when possible NeilBrown
                   ` (7 subsequent siblings)
  21 siblings, 2 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

vfs_rmdir takes an exclusive lock on the target directory to ensure
nothing new is created in it while the rmdir progresses.  With the
possibility of async updates continuing after the inode lock is dropped
we now need extra protection.

Any async updates will have DCACHE_PAR_UPDATE set on the dentry.  We
simply wait for that flag to be cleared on all children.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/dcache.c |  2 +-
 fs/namei.c  | 40 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index fb331596f1b1..90dee859d138 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -53,7 +53,7 @@
  *   - d_lru
  *   - d_count
  *   - d_unhashed()
- *   - d_parent and d_chilren
+ *   - d_parent and d_children
  *   - childrens' d_sib and d_parent
  *   - d_u.d_alias, d_inode
  *
diff --git a/fs/namei.c b/fs/namei.c
index 3a107d6098be..e8a85c9f431c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1839,6 +1839,27 @@ bool d_update_lock(struct dentry *dentry,
 	return true;
 }
 
+static void d_update_wait(struct dentry *dentry, unsigned int subclass)
+{
+	/* Note this may only ever be called in a context where we have
+	 * a lock preventing this dentry from becoming locked, possibly
+	 * an update lock on the parent dentry.  The must be a smp_mb()
+	 * after that lock is taken and before this is called so that
+	 * the following test is safe. d_update_lock() provides that
+	 * barrier.
+	 */
+	if (!(dentry->d_flags & DCACHE_PAR_UPDATE))
+		return
+	lock_acquire_exclusive(&dentry->d_update_map, subclass,
+			       0, NULL, _THIS_IP_);
+	spin_lock(&dentry->d_lock);
+	wait_var_event_spinlock(&dentry->d_flags,
+				!check_dentry_locked(dentry),
+				&dentry->d_lock);
+	spin_unlock(&dentry->d_lock);
+	lock_map_release(&dentry->d_update_map);
+}
+
 bool d_update_trylock(struct dentry *dentry,
 		      struct dentry *base,
 		      const struct qstr *last)
@@ -4688,6 +4709,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 		     struct dentry *dentry)
 {
 	int error = may_delete(idmap, dir, dentry, 1);
+	struct dentry *child;
 
 	if (error)
 		return error;
@@ -4697,6 +4719,24 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 
 	dget(dentry);
 	inode_lock(dentry->d_inode);
+	/*
+	 * Some children of dentry might be active in an async update.
+	 * We need to wait for them.  New children cannot be locked
+	 * while the inode lock is held.
+	 */
+again:
+	spin_lock(&dentry->d_lock);
+	for (child = d_first_child(dentry); child;
+	     child = d_next_sibling(child)) {
+		if (child->d_flags & DCACHE_PAR_UPDATE) {
+			dget(child);
+			spin_unlock(&dentry->d_lock);
+			d_update_wait(child, I_MUTEX_CHILD);
+			dput(child);
+			goto again;
+		}
+	}
+	spin_unlock(&dentry->d_lock);
 
 	error = -EBUSY;
 	if (is_local_mountpoint(dentry) ||
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when possible.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (13 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06  5:42 ` [PATCH 16/19] VFS: add lookup_and_lock_rename() NeilBrown
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

lookup_and_lock() and done_lookup_and_lock() are now told, via LOOKUP_
intent flags what operation is being performed, including a new
LOOKUP_REMOVE.

They use this to determine whether shared or exclusive locking is
needed.

If all filesystems eventually support all async interface, this locking
can be discarded.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c | 40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index e8a85c9f431c..c7b7445c770e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1898,13 +1898,26 @@ static struct dentry *lookup_and_lock_nested(const struct qstr *last,
 					     unsigned int subclass)
 {
 	struct dentry *dentry;
+	unsigned int shared = 0;
 
-	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
-		inode_lock_nested(base->d_inode, subclass);
+	if (!(lookup_flags & LOOKUP_PARENT_LOCKED)) {
+		if (lookup_flags & LOOKUP_CREATE)
+			shared = S_ASYNC_CREATE;
+		if (lookup_flags & LOOKUP_REMOVE)
+			shared = S_ASYNC_REMOVE;
+
+		if (base->d_inode->i_flags & shared)
+			inode_lock_shared_nested(base->d_inode, subclass);
+		else
+			inode_lock_nested(base->d_inode, subclass);
+	}
 	do {
 		dentry = lookup_one_qstr(last, base, lookup_flags);
 	} while (!IS_ERR(dentry) && !d_update_lock(dentry, base, last, subclass));
 	if (IS_ERR(dentry) && !(lookup_flags & LOOKUP_PARENT_LOCKED)) {
+		if (base->d_inode->i_flags & shared)
+			inode_unlock_shared(base->d_inode);
+		else
 			inode_unlock(base->d_inode);
 	}
 	return dentry;
@@ -1921,11 +1934,22 @@ static struct dentry *lookup_and_lock(const struct qstr *last,
 void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
 			  unsigned int lookup_flags)
 {
+	unsigned int shared = 0;
+
+	if (lookup_flags & LOOKUP_CREATE)
+		shared = S_ASYNC_CREATE;
+	if (lookup_flags & LOOKUP_REMOVE)
+		shared = S_ASYNC_REMOVE;
+
 	d_lookup_done(dentry);
 	d_update_unlock(dentry);
 	dput(dentry);
-	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
-		inode_unlock(base->d_inode);
+	if (!(lookup_flags & LOOKUP_PARENT_LOCKED)) {
+		if (base->d_inode->i_flags & shared)
+			inode_unlock_shared(base->d_inode);
+		else
+			inode_unlock(base->d_inode);
+	}
 }
 EXPORT_SYMBOL(done_lookup_and_lock);
 
@@ -4004,7 +4028,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
+	if ((open_flag & O_CREAT) && !(dir->d_inode->i_flags & S_ASYNC_OPEN))
 		inode_lock(dir->d_inode);
 	else
 		inode_lock_shared(dir->d_inode);
@@ -4015,7 +4039,7 @@ static const char *open_last_lookups(struct nameidata *nd,
 		if (file->f_mode & FMODE_OPENED)
 			fsnotify_open(file);
 	}
-	if (open_flag & O_CREAT)
+	if ((open_flag & O_CREAT) && !(dir->d_inode->i_flags & S_ASYNC_OPEN))
 		inode_unlock(dir->d_inode);
 	else
 		inode_unlock_shared(dir->d_inode);
@@ -4775,7 +4799,7 @@ int do_rmdir(int dfd, struct filename *name)
 	struct path path;
 	struct qstr last;
 	int type;
-	unsigned int lookup_flags = 0;
+	unsigned int lookup_flags = LOOKUP_REMOVE;
 retry:
 	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (error)
@@ -4914,7 +4938,7 @@ int do_unlinkat(int dfd, struct filename *name)
 	int type;
 	struct inode *inode = NULL;
 	struct inode *delegated_inode = NULL;
-	unsigned int lookup_flags = 0;
+	unsigned int lookup_flags = LOOKUP_REMOVE;
 retry:
 	error = filename_parentat(dfd, name, lookup_flags, &path, &last, &type);
 	if (error)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 16/19] VFS: add lookup_and_lock_rename()
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (14 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when possible NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-07 21:21   ` Al Viro
  2025-02-06  5:42 ` [PATCH 17/19] nfsd: use lookup_and_lock_one() and lookup_and_lock_rename_one() NeilBrown
                   ` (5 subsequent siblings)
  21 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

lookup_and_lock_rename() combines locking and lookup for two names.
It uses the new lock_two_directories_shared() if that is appropriate for
the filesystem.

unlock_rename_shared() does either a shared unlock or an exclusive
unlock depending on how the filesystem wants rename to be handled.

lookup_and_lock_rename_one() and done_lookup_and_lock_rename() are
exported for other modules to use.

As a rename can continue asynchronously after the inode lock is dropped,
lock_two_directories() and lock_two_directories_shared() must ensure
that is not happening before looking at ->d_parent.  This requires a
call to d_update_wait().  Note that is the dentry is locked for update
it must be a rename.  It cannot be a create or a (successful) rmdir as
these dentries are not empty - except possibly the target directory, but
waiting for the rmdir there is still needed of course.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/namei.c            | 230 +++++++++++++++++++++++++++++++++++-------
 include/linux/namei.h |   7 ++
 2 files changed, 199 insertions(+), 38 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c7b7445c770e..771e9d7b620c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3451,8 +3451,14 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
 {
 	struct dentry *p = p1, *q = p2, *r;
 
-	while ((r = p->d_parent) != p2 && r != p)
+	/* Ensure d_update_wait() tests are safe - one barrier for all */
+	smp_mb();
+
+	d_update_wait(p, I_MUTEX_NORMAL);
+	while ((r = p->d_parent) != p2 && r != p) {
 		p = r;
+		d_update_wait(p, I_MUTEX_NORMAL);
+	}
 	if (r == p2) {
 		// p is a child of p2 and an ancestor of p1 or p1 itself
 		inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
@@ -3461,8 +3467,11 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
 	}
 	// p is the root of connected component that contains p1
 	// p2 does not occur on the path from p to p1
-	while ((r = q->d_parent) != p1 && r != p && r != q)
+	d_update_wait(q, I_MUTEX_NORMAL);
+	while ((r = q->d_parent) != p1 && r != p && r != q) {
 		q = r;
+		d_update_wait(q, I_MUTEX_NORMAL);
+	}
 	if (r == p1) {
 		// q is a child of p1 and an ancestor of p2 or p2 itself
 		inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
@@ -3479,6 +3488,46 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
 	}
 }
 
+static struct dentry *lock_two_directories_shared(struct dentry *p1, struct dentry *p2)
+{
+	struct dentry *p = p1, *q = p2, *r;
+
+	/* Ensure d_update_wait() tests are safe - one barrier for all */
+	smp_mb();
+
+	d_update_wait(p1, I_MUTEX_NORMAL);
+	while ((r = p->d_parent) != p2 && r != p) {
+		p = r;
+		d_update_wait(p, I_MUTEX_NORMAL);
+	}
+	if (r == p2) {
+		// p is a child of p2 and an ancestor of p1 or p1 itself
+		inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT);
+		inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT2);
+		return p;
+	}
+	// p is the root of connected component that contains p1
+	// p2 does not occur on the path from p to p1
+	d_update_wait(q, I_MUTEX_NORMAL);
+	while ((r = q->d_parent) != p1 && r != p && r != q) {
+		q = r;
+		d_update_wait(q, I_MUTEX_NORMAL);
+	}
+	if (r == p1) {
+		// q is a child of p1 and an ancestor of p2 or p2 itself
+		inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+		inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT2);
+		return q;
+	} else if (likely(r == p)) {
+		// both p2 and p1 are descendents of p
+		inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+		inode_lock_shared_nested(p2->d_inode, I_MUTEX_PARENT2);
+		return NULL;
+	} else { // no common ancestor at the time we'd been called
+		return ERR_PTR(-EXDEV);
+	}
+}
+
 /*
  * p1 and p2 should be directories on the same fs.
  */
@@ -3494,6 +3543,134 @@ struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
 }
 EXPORT_SYMBOL(lock_rename);
 
+static void unlock_rename_shared(struct dentry *p1, struct dentry *p2)
+{
+	if (!(p1->d_inode->i_flags & S_ASYNC_RENAME))
+		unlock_rename(p1, p2);
+	else {
+		inode_unlock_shared(p1->d_inode);
+		if (p1 != p2) {
+			inode_unlock_shared(p2->d_inode);
+			mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+		}
+	}
+}
+
+static int
+lookup_and_lock_rename(struct dentry *p1, struct dentry *p2,
+		       struct dentry **d1p, struct dentry **d2p,
+		       struct qstr *last1, struct qstr *last2,
+		       unsigned int flags1, unsigned int flags2)
+{
+	struct dentry *p = NULL;
+	struct dentry *d1, *d2;
+	bool ok1, ok2;
+
+	if (p1->d_inode->i_flags & S_ASYNC_RENAME) {
+		if (p1 == p2) {
+			/* same parent - only one parent lock needed and
+			 * no s_vfs_rename_mutex */
+			inode_lock_shared_nested(p1->d_inode, I_MUTEX_PARENT);
+		} else {
+			mutex_lock(&p1->d_sb->s_vfs_rename_mutex);
+
+			p = lock_two_directories_shared(p1, p2);
+			if (IS_ERR(p)) {
+				mutex_unlock(&p1->d_sb->s_vfs_rename_mutex);
+				return PTR_ERR(p);
+			}
+		}
+	} else
+		lock_rename(p1, p2);
+retry:
+	d1 = lookup_one_qstr(last1, p1, flags1);
+	if (IS_ERR(d1))
+		goto out_unlock_1;
+	d2 = lookup_one_qstr(last2, p2, flags2);
+	if (IS_ERR(d2))
+		goto out_unlock_2;
+
+	if (d1 == p) {
+		dput(d1); dput(d2);
+		unlock_rename_shared(p1, p2);
+		if (flags1 & LOOKUP_CREATE)
+			return -EINVAL;
+		else
+			return -ENOTEMPTY;
+	}
+
+	if (d2 == p) {
+		dput(d1); dput(d2);
+		unlock_rename_shared(p1, p2);
+		if (flags2 & LOOKUP_CREATE)
+			return -EINVAL;
+		else
+			return -ENOTEMPTY;
+	}
+
+	if (d1 < d2) {
+		ok1 = d_update_lock(d1, p1, last1, I_MUTEX_PARENT);
+		ok2 = d_update_lock(d2, p2, last2, I_MUTEX_PARENT2);
+	} else if (d1 > d2) {
+		ok2 = d_update_lock(d2, p2, last2, I_MUTEX_PARENT);
+		ok1 = d_update_lock(d1, p1, last1, I_MUTEX_PARENT2);
+	} else {
+		ok1 = ok2 = d_update_lock(d1, p1, last1, I_MUTEX_PARENT);
+	}
+	if (!ok1 || !ok2) {
+		if (ok1)
+			d_update_unlock(d1);
+		if (ok2 && d2 != d1)
+			d_update_unlock(d2);
+		dput(d1);
+		dput(d2);
+		goto retry;
+	}
+	*d1p = d1; *d2p = d2;
+	return 0;
+
+out_unlock_2:
+	dput(d1);
+	d1 = d2;
+out_unlock_1:
+	unlock_rename_shared(p1, p2);
+	return PTR_ERR(d1);
+}
+
+int lookup_and_lock_rename_one(struct dentry *p1, struct dentry *p2,
+			       struct dentry **d1p, struct dentry **d2p,
+			       const char *name1, int nlen1,
+			       const char *name2, int nlen2,
+			       unsigned int flags1, unsigned int flags2)
+{
+	struct qstr this1, this2;
+	int err;
+
+	err = lookup_one_common(&nop_mnt_idmap, name1, p1, nlen1, &this1);
+	if (err)
+		return err;
+	err = lookup_one_common(&nop_mnt_idmap, name2, p2, nlen2, &this2);
+	if (err)
+		return err;
+	return lookup_and_lock_rename(p1, p2, d1p, d2p, &this1, &this2,
+				      flags1, flags2);
+}
+EXPORT_SYMBOL(lookup_and_lock_rename_one);
+
+void done_lookup_and_lock_rename(struct dentry *p1, struct dentry *p2,
+				 struct dentry *d1, struct dentry *d2)
+{
+	d_lookup_done(d1);
+	d_lookup_done(d2);
+	d_update_unlock(d1);
+	if (d2 != d1)
+		d_update_unlock(d2);
+	unlock_rename_shared(p1, p2);
+	dput(d1);
+	dput(d2);
+}
+EXPORT_SYMBOL(done_lookup_and_lock_rename);
+
 /*
  * c1 and p2 should be on the same fs.
  */
@@ -5497,7 +5674,6 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 {
 	struct renamedata rd;
 	struct dentry *old_dentry, *new_dentry;
-	struct dentry *trap;
 	struct path old_path, new_path;
 	struct qstr old_last, new_last;
 	int old_type, new_type;
@@ -5548,51 +5724,33 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 		goto exit2;
 
 retry_deleg:
-	trap = lock_rename(new_path.dentry, old_path.dentry);
-	if (IS_ERR(trap)) {
-		error = PTR_ERR(trap);
+	error = lookup_and_lock_rename(old_path.dentry, new_path.dentry,
+				       &old_dentry, &new_dentry,
+				       &old_last, &new_last,
+				       lookup_flags, lookup_flags | target_flags);
+	if (error)
 		goto exit_lock_rename;
-	}
 
-	old_dentry = lookup_one_qstr(&old_last, old_path.dentry,
-				     lookup_flags);
-	error = PTR_ERR(old_dentry);
-	if (IS_ERR(old_dentry))
-		goto exit3;
-	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
-				     lookup_flags | target_flags);
-	error = PTR_ERR(new_dentry);
-	if (IS_ERR(new_dentry))
-		goto exit4;
 	if (flags & RENAME_EXCHANGE) {
 		if (!d_is_dir(new_dentry)) {
 			error = -ENOTDIR;
 			if (new_last.name[new_last.len])
-				goto exit5;
+				goto exit_unlock;
 		}
 	}
 	/* unless the source is a directory trailing slashes give -ENOTDIR */
 	if (!d_is_dir(old_dentry)) {
 		error = -ENOTDIR;
 		if (old_last.name[old_last.len])
-			goto exit5;
+			goto exit_unlock;
 		if (!(flags & RENAME_EXCHANGE) && new_last.name[new_last.len])
-			goto exit5;
-	}
-	/* source should not be ancestor of target */
-	error = -EINVAL;
-	if (old_dentry == trap)
-		goto exit5;
-	/* target should not be an ancestor of source */
-	if (!(flags & RENAME_EXCHANGE))
-		error = -ENOTEMPTY;
-	if (new_dentry == trap)
-		goto exit5;
+			goto exit_unlock;
+	}
 
 	error = security_path_rename(&old_path, old_dentry,
 				     &new_path, new_dentry, flags);
 	if (error)
-		goto exit5;
+		goto exit_unlock;
 
 	rd.old_dir	   = old_path.dentry->d_inode;
 	rd.old_dentry	   = old_dentry;
@@ -5603,13 +5761,9 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	rd.delegated_inode = &delegated_inode;
 	rd.flags	   = flags;
 	error = vfs_rename(&rd);
-exit5:
-	d_lookup_done(new_dentry);
-	dput(new_dentry);
-exit4:
-	dput(old_dentry);
-exit3:
-	unlock_rename(new_path.dentry, old_path.dentry);
+exit_unlock:
+	done_lookup_and_lock_rename(new_path.dentry, old_path.dentry,
+				    new_dentry, old_dentry);
 exit_lock_rename:
 	if (delegated_inode) {
 		error = break_deleg_wait(&delegated_inode);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 72e351640406..8ef7aa6ed64c 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -104,6 +104,13 @@ extern int follow_up(struct path *);
 extern struct dentry *lock_rename(struct dentry *, struct dentry *);
 extern struct dentry *lock_rename_child(struct dentry *, struct dentry *);
 extern void unlock_rename(struct dentry *, struct dentry *);
+int lookup_and_lock_rename_one(struct dentry *p1, struct dentry *p2,
+			       struct dentry **d1p, struct dentry **d2p,
+			       const char *name1, int nlen1,
+			       const char *name2, int nlen2,
+			       unsigned int flags1, unsigned int flags2);
+void done_lookup_and_lock_rename(struct dentry *p1, struct dentry *p2,
+				struct dentry *d1, struct dentry *d2);
 
 /**
  * mode_strip_umask - handle vfs umask stripping
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 17/19] nfsd: use lookup_and_lock_one() and lookup_and_lock_rename_one()
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (15 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 16/19] VFS: add lookup_and_lock_rename() NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06  5:42 ` [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async NeilBrown
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

nfsd now used lookup_and_lock_one() when creating/removing names in the
exported filesystem.
It uses lookup_and_lock_rename_one() when renaming.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/nfsproc.c |  12 +++---
 fs/nfsd/vfs.c     | 107 +++++++++++++---------------------------------
 2 files changed, 36 insertions(+), 83 deletions(-)

diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index 6dda081eb24c..27c2b1d5e1ac 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -311,17 +311,16 @@ nfsd_proc_create(struct svc_rqst *rqstp)
 		goto done;
 	}
 
-	inode_lock_nested(dirfhp->fh_dentry->d_inode, I_MUTEX_PARENT);
-	dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
+	dchild = lookup_and_lock_one(NULL, argp->name, argp->len,
+				     dirfhp->fh_dentry, LOOKUP_CREATE);
 	if (IS_ERR(dchild)) {
 		resp->status = nfserrno(PTR_ERR(dchild));
-		goto out_unlock;
+		goto put_write;
 	}
 	fh_init(newfhp, NFS_FHSIZE);
 	resp->status = fh_compose(newfhp, dirfhp->fh_export, dchild, dirfhp);
 	if (!resp->status && d_really_is_negative(dchild))
 		resp->status = nfserr_noent;
-	dput(dchild);
 	if (resp->status) {
 		if (resp->status != nfserr_noent)
 			goto out_unlock;
@@ -331,7 +330,7 @@ nfsd_proc_create(struct svc_rqst *rqstp)
 		 */
 		resp->status = nfserr_acces;
 		if (!newfhp->fh_dentry) {
-			printk(KERN_WARNING 
+			printk(KERN_WARNING
 				"nfsd_proc_create: file handle not verified\n");
 			goto out_unlock;
 		}
@@ -427,7 +426,8 @@ nfsd_proc_create(struct svc_rqst *rqstp)
 	}
 
 out_unlock:
-	inode_unlock(dirfhp->fh_dentry->d_inode);
+	done_lookup_and_lock(dirfhp->fh_dentry, dchild, LOOKUP_CREATE);
+put_write:
 	fh_drop_write(dirfhp);
 done:
 	fh_put(dirfhp);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 740332413138..af4a7f75cca0 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1551,19 +1551,13 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (host_err)
 		return nfserrno(host_err);
 
-	inode_lock_nested(dentry->d_inode, I_MUTEX_PARENT);
-	dchild = lookup_one_len(fname, dentry, flen);
+	dchild = lookup_and_lock_one(NULL, fname, flen, dentry, LOOKUP_CREATE);
 	host_err = PTR_ERR(dchild);
 	if (IS_ERR(dchild)) {
 		err = nfserrno(host_err);
-		goto out_unlock;
+		goto out;
 	}
 	err = fh_compose(resfhp, fhp->fh_export, dchild, fhp);
-	/*
-	 * We unconditionally drop our ref to dchild as fh_compose will have
-	 * already grabbed its own ref for it.
-	 */
-	dput(dchild);
 	if (err)
 		goto out_unlock;
 	err = fh_fill_pre_attrs(fhp);
@@ -1572,7 +1566,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err = nfsd_create_locked(rqstp, fhp, attrs, type, rdev, resfhp);
 	fh_fill_post_attrs(fhp);
 out_unlock:
-	inode_unlock(dentry->d_inode);
+	done_lookup_and_lock(dentry, dchild, LOOKUP_CREATE);
+out:
 	return err;
 }
 
@@ -1656,8 +1651,7 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	}
 
 	dentry = fhp->fh_dentry;
-	inode_lock_nested(dentry->d_inode, I_MUTEX_PARENT);
-	dnew = lookup_one_len(fname, dentry, flen);
+	dnew = lookup_and_lock_one(NULL, fname, flen, dentry, LOOKUP_CREATE);
 	if (IS_ERR(dnew)) {
 		err = nfserrno(PTR_ERR(dnew));
 		inode_unlock(dentry->d_inode);
@@ -1673,11 +1667,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		nfsd_create_setattr(rqstp, fhp, resfhp, attrs);
 	fh_fill_post_attrs(fhp);
 out_unlock:
-	inode_unlock(dentry->d_inode);
+	done_lookup_and_lock(dentry, dnew, LOOKUP_CREATE);
 	if (!err)
 		err = nfserrno(commit_metadata(fhp));
-	dput(dnew);
-	if (err==0) err = cerr;
+	if (err==0)
+		err = cerr;
 out_drop_write:
 	fh_drop_write(fhp);
 out:
@@ -1721,43 +1715,35 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 
 	ddir = ffhp->fh_dentry;
 	dirp = d_inode(ddir);
-	inode_lock_nested(dirp, I_MUTEX_PARENT);
-
-	dnew = lookup_one_len(name, ddir, len);
+	dnew = lookup_and_lock_one(NULL, name, len, ddir, LOOKUP_CREATE);
 	if (IS_ERR(dnew)) {
-		err = nfserrno(PTR_ERR(dnew));
-		goto out_unlock;
+		err = PTR_ERR(dnew);
+		goto out_drop_write;
 	}
 
 	dold = tfhp->fh_dentry;
 
 	err = nfserr_noent;
 	if (d_really_is_negative(dold))
-		goto out_dput;
+		goto out_unlock;
 	err = fh_fill_pre_attrs(ffhp);
 	if (err != nfs_ok)
-		goto out_dput;
+		goto out_unlock;
 	host_err = vfs_link(dold, &nop_mnt_idmap, dirp, dnew, NULL);
 	fh_fill_post_attrs(ffhp);
-	inode_unlock(dirp);
-	if (!host_err) {
+out_unlock:
+	done_lookup_and_lock(ddir, dnew, LOOKUP_CREATE);
+	if (!err && !host_err) {
 		err = nfserrno(commit_metadata(ffhp));
 		if (!err)
 			err = nfserrno(commit_metadata(tfhp));
-	} else {
+	} else if (!err) {
 		err = nfserrno(host_err);
 	}
-	dput(dnew);
 out_drop_write:
 	fh_drop_write(tfhp);
 out:
 	return err;
-
-out_dput:
-	dput(dnew);
-out_unlock:
-	inode_unlock(dirp);
-	goto out_drop_write;
 }
 
 static void
@@ -1788,7 +1774,7 @@ __be32
 nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 			    struct svc_fh *tfhp, char *tname, int tlen)
 {
-	struct dentry	*fdentry, *tdentry, *odentry, *ndentry, *trap;
+	struct dentry	*fdentry, *tdentry, *odentry, *ndentry;
 	struct inode	*fdir, *tdir;
 	__be32		err;
 	int		host_err;
@@ -1824,9 +1810,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 		goto out;
 	}
 
-	trap = lock_rename(tdentry, fdentry);
-	if (IS_ERR(trap)) {
-		err = nfserr_xdev;
+	host_err = lookup_and_lock_rename_one(fdentry, tdentry,
+					      &odentry, &ndentry,
+					      fname, flen, tname, tlen,
+					      0, LOOKUP_CREATE|LOOKUP_RENAME_TARGET);
+	if (host_err) {
+		err = nfserrno(host_err);
 		goto out_want_write;
 	}
 	err = fh_fill_pre_attrs(ffhp);
@@ -1836,30 +1825,10 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	if (err != nfs_ok)
 		goto out_unlock;
 
-	odentry = lookup_one_len(fname, fdentry, flen);
-	host_err = PTR_ERR(odentry);
-	if (IS_ERR(odentry))
-		goto out_nfserr;
-
-	host_err = -ENOENT;
-	if (d_really_is_negative(odentry))
-		goto out_dput_old;
-	host_err = -EINVAL;
-	if (odentry == trap)
-		goto out_dput_old;
-
-	ndentry = lookup_one_len(tname, tdentry, tlen);
-	host_err = PTR_ERR(ndentry);
-	if (IS_ERR(ndentry))
-		goto out_dput_old;
-	host_err = -ENOTEMPTY;
-	if (ndentry == trap)
-		goto out_dput_new;
-
 	if ((ndentry->d_sb->s_export_op->flags & EXPORT_OP_CLOSE_BEFORE_UNLINK) &&
 	    nfsd_has_cached_files(ndentry)) {
 		close_cached = true;
-		goto out_dput_old;
+		goto out_unlock;
 	} else {
 		struct renamedata rd = {
 			.old_mnt_idmap	= &nop_mnt_idmap,
@@ -1884,11 +1853,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 				host_err = commit_metadata(ffhp);
 		}
 	}
- out_dput_new:
-	dput(ndentry);
- out_dput_old:
-	dput(odentry);
- out_nfserr:
 	err = nfserrno(host_err);
 
 	if (!close_cached) {
@@ -1896,7 +1860,7 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 		fh_fill_post_attrs(tfhp);
 	}
 out_unlock:
-	unlock_rename(tdentry, fdentry);
+	done_lookup_and_lock_rename(fdentry, tdentry, odentry, ndentry);
 out_want_write:
 	fh_drop_write(ffhp);
 
@@ -1943,18 +1907,11 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 
 	dentry = fhp->fh_dentry;
 	dirp = d_inode(dentry);
-	inode_lock_nested(dirp, I_MUTEX_PARENT);
-
-	rdentry = lookup_one_len(fname, dentry, flen);
+	rdentry = lookup_and_lock_one(NULL, fname, flen, dentry, LOOKUP_REMOVE);
 	host_err = PTR_ERR(rdentry);
 	if (IS_ERR(rdentry))
-		goto out_unlock;
+		goto out_drop_write;
 
-	if (d_really_is_negative(rdentry)) {
-		dput(rdentry);
-		host_err = -ENOENT;
-		goto out_unlock;
-	}
 	rinode = d_inode(rdentry);
 	err = fh_fill_pre_attrs(fhp);
 	if (err != nfs_ok)
@@ -1981,11 +1938,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 		host_err = vfs_rmdir(&nop_mnt_idmap, dirp, rdentry);
 	}
 	fh_fill_post_attrs(fhp);
-
-	inode_unlock(dirp);
+out_unlock:
+	done_lookup_and_lock(dentry, rdentry, LOOKUP_REMOVE);
 	if (!host_err)
 		host_err = commit_metadata(fhp);
-	dput(rdentry);
 	iput(rinode);    /* truncate the inode here */
 
 out_drop_write:
@@ -2001,9 +1957,6 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 	}
 out:
 	return err;
-out_unlock:
-	inode_unlock(dirp);
-	goto out_drop_write;
 }
 
 /*
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (16 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 17/19] nfsd: use lookup_and_lock_one() and lookup_and_lock_rename_one() NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-06  5:42 ` [PATCH 19/19] nfs: switch to _async for all directory ops NeilBrown
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

mkdir_async allows a different dentry to be returned which is sometimes
relevant for nfs.

This patch changes the nfs_rpc_ops mkdir op to return a dentry, and
passes that back to the caller using mkdir_async.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfs/dir.c            | 17 ++++++++--------
 fs/nfs/internal.h       |  4 ++--
 fs/nfs/nfs3proc.c       |  9 +++++----
 fs/nfs/nfs4proc.c       | 45 +++++++++++++++++++++++++++++------------
 fs/nfs/proc.c           | 14 ++++++++-----
 include/linux/nfs_xdr.h |  2 +-
 6 files changed, 58 insertions(+), 33 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 8cbe63f4089a..2c69ec77d02c 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -2420,11 +2420,12 @@ EXPORT_SYMBOL_GPL(nfs_mknod);
 /*
  * See comments for nfs_proc_create regarding failed operations.
  */
-int nfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
-	      struct dentry *dentry, umode_t mode)
+struct dentry *nfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
+			 struct dentry *dentry, umode_t mode,
+			 struct dirop_ret *dret)
 {
 	struct iattr attr;
-	int error;
+	struct dentry *ret;
 
 	dfprintk(VFS, "NFS: mkdir(%s/%lu), %pd\n",
 			dir->i_sb->s_id, dir->i_ino, dentry);
@@ -2433,14 +2434,14 @@ int nfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
 	attr.ia_mode = mode | S_IFDIR;
 
 	trace_nfs_mkdir_enter(dir, dentry);
-	error = NFS_PROTO(dir)->mkdir(dir, dentry, &attr);
-	trace_nfs_mkdir_exit(dir, dentry, error);
-	if (error != 0)
+	ret = NFS_PROTO(dir)->mkdir(dir, dentry, &attr);
+	trace_nfs_mkdir_exit(dir, dentry, PTR_ERR_OR_ZERO(ret));
+	if (IS_ERR(ret))
 		goto out_err;
-	return 0;
+	return ret;
 out_err:
 	d_drop(dentry);
-	return error;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(nfs_mkdir);
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index fae2c7ae4acc..f7dea7fe5ebc 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -400,8 +400,8 @@ struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 void nfs_d_prune_case_insensitive_aliases(struct inode *inode);
 int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *,
 	       umode_t, bool);
-int nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *,
-	      umode_t);
+struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *,
+			 umode_t, struct dirop_ret *);
 int nfs_rmdir(struct inode *, struct dentry *);
 int nfs_unlink(struct inode *, struct dentry *);
 int nfs_symlink(struct mnt_idmap *, struct inode *, struct dentry *,
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 0c3bc98cd999..41797cbbb8dc 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -578,7 +578,7 @@ nfs3_proc_symlink(struct inode *dir, struct dentry *dentry, struct folio *folio,
 	return status;
 }
 
-static int
+static struct dentry *
 nfs3_proc_mkdir(struct inode *dir, struct dentry *dentry, struct iattr *sattr)
 {
 	struct posix_acl *default_acl, *acl;
@@ -613,14 +613,15 @@ nfs3_proc_mkdir(struct inode *dir, struct dentry *dentry, struct iattr *sattr)
 
 	status = nfs3_proc_setacls(d_inode(dentry), acl, default_acl);
 
-	dput(d_alias);
 out_release_acls:
 	posix_acl_release(acl);
 	posix_acl_release(default_acl);
 out:
 	nfs3_free_createdata(data);
 	dprintk("NFS reply mkdir: %d\n", status);
-	return status;
+	if (status)
+		return ERR_PTR(status);
+	return d_alias;
 }
 
 static int
@@ -1037,7 +1038,7 @@ static const struct inode_operations nfs3_dir_inode_operations = {
 	.link		= nfs_link,
 	.unlink		= nfs_unlink,
 	.symlink	= nfs_symlink,
-	.mkdir		= nfs_mkdir,
+	.mkdir_async	= nfs_mkdir,
 	.rmdir		= nfs_rmdir,
 	.mknod		= nfs_mknod,
 	.rename		= nfs_rename,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index df9669d4ded7..ef219968ed22 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -5135,9 +5135,6 @@ static int nfs4_do_create(struct inode *dir, struct dentry *dentry, struct nfs4_
 				    &data->arg.seq_args, &data->res.seq_res, 1);
 	if (status == 0) {
 		spin_lock(&dir->i_lock);
-		/* Creating a directory bumps nlink in the parent */
-		if (data->arg.ftype == NF4DIR)
-			nfs4_inc_nlink_locked(dir);
 		nfs4_update_changeattr_locked(dir, &data->res.dir_cinfo,
 					      data->res.fattr->time_start,
 					      NFS_INO_INVALID_DATA);
@@ -5147,6 +5144,25 @@ static int nfs4_do_create(struct inode *dir, struct dentry *dentry, struct nfs4_
 	return status;
 }
 
+static struct dentry *nfs4_do_mkdir(struct inode *dir, struct dentry *dentry,
+				    struct nfs4_createdata *data)
+{
+	int status = nfs4_call_sync(NFS_SERVER(dir)->client, NFS_SERVER(dir), &data->msg,
+				    &data->arg.seq_args, &data->res.seq_res, 1);
+
+	if (status)
+		return ERR_PTR(status);
+
+	spin_lock(&dir->i_lock);
+	/* Creating a directory bumps nlink in the parent */
+	nfs4_inc_nlink_locked(dir);
+	nfs4_update_changeattr_locked(dir, &data->res.dir_cinfo,
+				      data->res.fattr->time_start,
+				      NFS_INO_INVALID_DATA);
+	spin_unlock(&dir->i_lock);
+	return nfs_add_or_obtain(dentry, data->res.fh, data->res.fattr);
+}
+
 static void nfs4_free_createdata(struct nfs4_createdata *data)
 {
 	nfs4_label_free(data->fattr.label);
@@ -5203,32 +5219,34 @@ static int nfs4_proc_symlink(struct inode *dir, struct dentry *dentry,
 	return err;
 }
 
-static int _nfs4_proc_mkdir(struct inode *dir, struct dentry *dentry,
-		struct iattr *sattr, struct nfs4_label *label)
+static struct dentry *_nfs4_proc_mkdir(struct inode *dir, struct dentry *dentry,
+				       struct iattr *sattr,
+				       struct nfs4_label *label)
 {
 	struct nfs4_createdata *data;
-	int status = -ENOMEM;
+	struct dentry *ret = ERR_PTR(-ENOMEM);
 
 	data = nfs4_alloc_createdata(dir, &dentry->d_name, sattr, NF4DIR);
 	if (data == NULL)
 		goto out;
 
 	data->arg.label = label;
-	status = nfs4_do_create(dir, dentry, data);
+	ret = nfs4_do_mkdir(dir, dentry, data);
 
 	nfs4_free_createdata(data);
 out:
-	return status;
+	return ret;
 }
 
-static int nfs4_proc_mkdir(struct inode *dir, struct dentry *dentry,
-		struct iattr *sattr)
+static struct dentry *nfs4_proc_mkdir(struct inode *dir, struct dentry *dentry,
+				      struct iattr *sattr)
 {
 	struct nfs_server *server = NFS_SERVER(dir);
 	struct nfs4_exception exception = {
 		.interruptible = true,
 	};
 	struct nfs4_label l, *label;
+	struct dentry *alias;
 	int err;
 
 	label = nfs4_label_init_security(dir, dentry, sattr, &l);
@@ -5236,14 +5254,15 @@ static int nfs4_proc_mkdir(struct inode *dir, struct dentry *dentry,
 	if (!(server->attr_bitmask[2] & FATTR4_WORD2_MODE_UMASK))
 		sattr->ia_mode &= ~current_umask();
 	do {
-		err = _nfs4_proc_mkdir(dir, dentry, sattr, label);
+		alias = _nfs4_proc_mkdir(dir, dentry, sattr, label);
+		err = PTR_ERR_OR_ZERO(alias);
 		trace_nfs4_mkdir(dir, &dentry->d_name, err);
 		err = nfs4_handle_exception(NFS_SERVER(dir), err,
 				&exception);
 	} while (exception.retry);
 	nfs4_label_release_security(label);
 
-	return err;
+	return alias;
 }
 
 static int _nfs4_proc_readdir(struct nfs_readdir_arg *nr_arg,
@@ -10865,7 +10884,7 @@ static const struct inode_operations nfs4_dir_inode_operations = {
 	.link		= nfs_link,
 	.unlink		= nfs_unlink,
 	.symlink	= nfs_symlink,
-	.mkdir		= nfs_mkdir,
+	.mkdir_async	= nfs_mkdir,
 	.rmdir		= nfs_rmdir,
 	.mknod		= nfs_mknod,
 	.rename		= nfs_rename,
diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
index 77920a2e3cef..7e8f6d8f02b4 100644
--- a/fs/nfs/proc.c
+++ b/fs/nfs/proc.c
@@ -446,13 +446,14 @@ nfs_proc_symlink(struct inode *dir, struct dentry *dentry, struct folio *folio,
 	return status;
 }
 
-static int
+static struct dentry *
 nfs_proc_mkdir(struct inode *dir, struct dentry *dentry, struct iattr *sattr)
 {
 	struct nfs_createdata *data;
 	struct rpc_message msg = {
 		.rpc_proc	= &nfs_procedures[NFSPROC_MKDIR],
 	};
+	struct dentry *alias = NULL;
 	int status = -ENOMEM;
 
 	dprintk("NFS call  mkdir %pd\n", dentry);
@@ -464,12 +465,15 @@ nfs_proc_mkdir(struct inode *dir, struct dentry *dentry, struct iattr *sattr)
 
 	status = rpc_call_sync(NFS_CLIENT(dir), &msg, 0);
 	nfs_mark_for_revalidate(dir);
-	if (status == 0)
-		status = nfs_instantiate(dentry, data->res.fh, data->res.fattr);
+	if (status == 0) {
+		alias = nfs_add_or_obtain(dentry, data->res.fh, data->res.fattr);
+		status = PTR_ERR_OR_ZERO(alias);
+	} else
+		alias = ERR_PTR(status);
 	nfs_free_createdata(data);
 out:
 	dprintk("NFS reply mkdir: %d\n", status);
-	return status;
+	return alias;
 }
 
 static int
@@ -706,7 +710,7 @@ static const struct inode_operations nfs_dir_inode_operations = {
 	.link		= nfs_link,
 	.unlink		= nfs_unlink,
 	.symlink	= nfs_symlink,
-	.mkdir		= nfs_mkdir,
+	.mkdir_async	= nfs_mkdir,
 	.rmdir		= nfs_rmdir,
 	.mknod		= nfs_mknod,
 	.rename		= nfs_rename,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index d0473e0d4aba..33d7f4c8183e 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1801,7 +1801,7 @@ struct nfs_rpc_ops {
 	int	(*link)    (struct inode *, struct inode *, const struct qstr *);
 	int	(*symlink) (struct inode *, struct dentry *, struct folio *,
 			    unsigned int, struct iattr *);
-	int	(*mkdir)   (struct inode *, struct dentry *, struct iattr *);
+	struct dentry *(*mkdir)   (struct inode *, struct dentry *, struct iattr *);
 	int	(*rmdir)   (struct inode *, const struct qstr *);
 	int	(*readdir) (struct nfs_readdir_arg *, struct nfs_readdir_res *);
 	int	(*mknod)   (struct inode *, struct dentry *, struct iattr *,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH 19/19] nfs: switch to _async for all directory ops.
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (17 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async NeilBrown
@ 2025-02-06  5:42 ` NeilBrown
  2025-02-13  3:51   ` Al Viro
  2025-02-06 14:36 ` [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory Christian Brauner
                   ` (2 subsequent siblings)
  21 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-06  5:42 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

nfs doesn't benefit from exclusive locking by the VFS as all directory
ops are sent to the server which does any needed locking.

The interesting part is "silly-rename" which needs to create and lock
another dentry while an unlink or rename is happening.

nfs_sillyrename() now returns that locked dentry and
nfs_sillyrename_finish() is added to unlock it when appropriate.

In order to keep all dentries locked until the operation completes,
nfs_sillyrename() now uses d_exchange() to record the silly rename in
the dcache.  This has to be exported and permitted to work on a negative
second dentry.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/dcache.c            |  5 +++-
 fs/nfs/dir.c           | 55 ++++++++++++++++++++++++------------------
 fs/nfs/internal.h      | 20 +++++++++------
 fs/nfs/nfs3proc.c      | 16 ++++++------
 fs/nfs/nfs4_fs.h       |  2 +-
 fs/nfs/nfs4proc.c      | 16 ++++++------
 fs/nfs/proc.c          | 16 ++++++------
 fs/nfs/unlink.c        | 48 +++++++++++++++++++++++++-----------
 include/linux/namei.h  |  1 -
 include/linux/nfs_fs.h |  3 ---
 10 files changed, 106 insertions(+), 76 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 90dee859d138..203d71eb4789 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2981,7 +2981,9 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 	write_seqlock(&rename_lock);
 
 	WARN_ON(!dentry1->d_inode);
-	WARN_ON(!dentry2->d_inode);
+	/* allow dentry2 to be negative so we can do a rename but keep
+	 * both names locked with DCACHE_PAR_UPDATE.
+	 */
 	WARN_ON(IS_ROOT(dentry1));
 	WARN_ON(IS_ROOT(dentry2));
 
@@ -2989,6 +2991,7 @@ void d_exchange(struct dentry *dentry1, struct dentry *dentry2)
 
 	write_sequnlock(&rename_lock);
 }
+EXPORT_SYMBOL(d_exchange);
 
 /**
  * d_ancestor - search for an ancestor
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2c69ec77d02c..c0116d44a6fc 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1956,10 +1956,14 @@ struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, unsigned in
 		return ERR_PTR(-ENAMETOOLONG);
 
 	/*
-	 * If we're doing an exclusive create, optimize away the lookup
-	 * but don't hash the dentry.
+	 * If we're doing an exclusive create, or if this is the target
+	 * of a rename, optimize away the lookup but don't hash the dentry.
+	 * A silly_rename is uniquely marked exclusive (REALLY? FIXME) and a rename target,
+	 * sand it request and explicit lookup.
 	 */
-	if (nfs_is_exclusive_create(dir, flags) || flags & LOOKUP_RENAME_TARGET)
+	if (nfs_is_exclusive_create(dir, flags) || (flags & LOOKUP_RENAME_TARGET &&
+	    ((flags & (LOOKUP_EXCL | LOOKUP_RENAME_TARGET)) !=
+	     (LOOKUP_EXCL | LOOKUP_RENAME_TARGET))))
 		return NULL;
 
 	res = ERR_PTR(-ENOMEM);
@@ -2057,7 +2061,7 @@ static int nfs_finish_open(struct nfs_open_context *ctx,
 
 int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
 		    struct file *file, unsigned open_flags,
-		    umode_t mode)
+		    umode_t mode, struct dirop_ret *ret)
 {
 	struct nfs_open_context *ctx;
 	struct dentry *res;
@@ -2256,7 +2260,7 @@ nfs4_lookup_revalidate(struct inode *dir, const struct qstr *name,
 
 int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry,
 			struct file *file, unsigned int open_flags,
-			umode_t mode)
+			umode_t mode, struct dirop_ret *ret)
 {
 
 	/* Same as look+open from lookup_open(), but with different O_TRUNC
@@ -2383,7 +2387,8 @@ static int nfs_do_create(struct inode *dir, struct dentry *dentry,
 }
 
 int nfs_create(struct mnt_idmap *idmap, struct inode *dir,
-	       struct dentry *dentry, umode_t mode, bool excl)
+	       struct dentry *dentry, umode_t mode, bool excl,
+	       struct dirop_ret *ret)
 {
 	return nfs_do_create(dir, dentry, mode, excl ? O_EXCL : 0);
 }
@@ -2394,7 +2399,8 @@ EXPORT_SYMBOL_GPL(nfs_create);
  */
 int
 nfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
-	  struct dentry *dentry, umode_t mode, dev_t rdev)
+	  struct dentry *dentry, umode_t mode, dev_t rdev,
+	  struct dirop_ret *ret)
 {
 	struct iattr attr;
 	int status;
@@ -2466,7 +2472,7 @@ static void nfs_dentry_remove_handle_error(struct inode *dir,
 	}
 }
 
-int nfs_rmdir(struct inode *dir, struct dentry *dentry)
+int nfs_rmdir(struct inode *dir, struct dentry *dentry, struct dirop_ret *ret)
 {
 	int error;
 
@@ -2535,7 +2541,7 @@ static int nfs_safe_remove(struct dentry *dentry)
  *
  *  If sillyrename() returns 0, we do nothing, otherwise we unlink.
  */
-int nfs_unlink(struct inode *dir, struct dentry *dentry)
+int nfs_unlink(struct inode *dir, struct dentry *dentry, struct dirop_ret *ret)
 {
 	int error;
 
@@ -2546,10 +2552,14 @@ int nfs_unlink(struct inode *dir, struct dentry *dentry)
 	spin_lock(&dentry->d_lock);
 	if (d_count(dentry) > 1 && !test_bit(NFS_INO_PRESERVE_UNLINKED,
 					     &NFS_I(d_inode(dentry))->flags)) {
+		struct dentry *silly;
+
 		spin_unlock(&dentry->d_lock);
 		/* Start asynchronous writeout of the inode */
 		write_inode_now(d_inode(dentry), 0);
-		error = nfs_sillyrename(dir, dentry);
+		silly = nfs_sillyrename(dir, dentry);
+		nfs_sillyrename_finish(silly);
+		error = PTR_ERR_OR_ZERO(silly);
 		goto out;
 	}
 	/* We must prevent any concurrent open until the unlink
@@ -2591,7 +2601,7 @@ EXPORT_SYMBOL_GPL(nfs_unlink);
  * and move the raw page into its mapping.
  */
 int nfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
-		struct dentry *dentry, const char *symname)
+		struct dentry *dentry, const char *symname, struct dirop_ret *ret)
 {
 	struct folio *folio;
 	char *kaddr;
@@ -2647,7 +2657,8 @@ int nfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
 EXPORT_SYMBOL_GPL(nfs_symlink);
 
 int
-nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
+nfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry,
+	 struct dirop_ret *ret)
 {
 	struct inode *inode = d_inode(old_dentry);
 	int error;
@@ -2688,7 +2699,7 @@ nfs_unblock_rename(struct rpc_task *task, struct nfs_renamedata *data)
  * file in old_dir will go away when the last process iput()s the inode.
  *
  * FIXED.
- * 
+ *
  * It actually works quite well. One needs to have the possibility for
  * at least one ".nfs..." file in each directory the file ever gets
  * moved or linked to which happens automagically with the new
@@ -2704,7 +2715,8 @@ nfs_unblock_rename(struct rpc_task *task, struct nfs_renamedata *data)
  */
 int nfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
 	       struct dentry *old_dentry, struct inode *new_dir,
-	       struct dentry *new_dentry, unsigned int flags)
+	       struct dentry *new_dentry, unsigned int flags,
+	       struct dirop_ret *ret)
 {
 	struct inode *old_inode = d_inode(old_dentry);
 	struct inode *new_inode = d_inode(new_dentry);
@@ -2744,16 +2756,12 @@ int nfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
 
 			spin_unlock(&new_dentry->d_lock);
 
-			/* copy the target dentry's name */
-			dentry = d_alloc(new_dentry->d_parent,
-					 &new_dentry->d_name);
-			if (!dentry)
-				goto out;
-
 			/* silly-rename the existing target ... */
-			err = nfs_sillyrename(new_dir, new_dentry);
-			if (err)
+			dentry = nfs_sillyrename(new_dir, new_dentry);
+			if (IS_ERR(dentry)) {
+				err = PTR_ERR(dentry);
 				goto out;
+			}
 
 			new_dentry = dentry;
 			new_inode = NULL;
@@ -2811,9 +2819,8 @@ int nfs_rename(struct mnt_idmap *idmap, struct inode *old_dir,
 	} else if (error == -ENOENT)
 		nfs_dentry_handle_enoent(old_dentry);
 
-	/* new dentry created? */
 	if (dentry)
-		dput(dentry);
+		nfs_sillyrename_finish(dentry);
 	return error;
 }
 EXPORT_SYMBOL_GPL(nfs_rename);
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f7dea7fe5ebc..ba00ffeb70ac 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -399,18 +399,21 @@ extern unsigned long nfs_access_cache_scan(struct shrinker *shrink,
 struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
 void nfs_d_prune_case_insensitive_aliases(struct inode *inode);
 int nfs_create(struct mnt_idmap *, struct inode *, struct dentry *,
-	       umode_t, bool);
+	       umode_t, bool, struct dirop_ret *);
 struct dentry *nfs_mkdir(struct mnt_idmap *, struct inode *, struct dentry *,
 			 umode_t, struct dirop_ret *);
-int nfs_rmdir(struct inode *, struct dentry *);
-int nfs_unlink(struct inode *, struct dentry *);
+int nfs_rmdir(struct inode *, struct dentry *, struct dirop_ret *);
+int nfs_unlink(struct inode *, struct dentry *, struct dirop_ret *);
 int nfs_symlink(struct mnt_idmap *, struct inode *, struct dentry *,
-		const char *);
-int nfs_link(struct dentry *, struct inode *, struct dentry *);
+		const char *, struct dirop_ret *);
+int nfs_link(struct dentry *, struct inode *, struct dentry *, struct dirop_ret *);
 int nfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *, umode_t,
-	      dev_t);
+	      dev_t, struct dirop_ret *);
 int nfs_rename(struct mnt_idmap *, struct inode *, struct dentry *,
-	       struct inode *, struct dentry *, unsigned int);
+	       struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
+int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry,
+			struct file *file, unsigned int open_flags,
+			umode_t mode, struct dirop_ret *);
 
 #ifdef CONFIG_NFS_V4_2
 static inline __u32 nfs_access_xattr_mask(const struct nfs_server *server)
@@ -707,7 +710,8 @@ extern struct rpc_task *
 nfs_async_rename(struct inode *old_dir, struct inode *new_dir,
 		 struct dentry *old_dentry, struct dentry *new_dentry,
 		 void (*complete)(struct rpc_task *, struct nfs_renamedata *));
-extern int nfs_sillyrename(struct inode *dir, struct dentry *dentry);
+extern struct dentry *nfs_sillyrename(struct inode *dir, struct dentry *dentry);
+extern void nfs_sillyrename_finish(struct dentry *dentry);
 
 /* direct.c */
 void nfs_init_cinfo_from_dreq(struct nfs_commit_info *cinfo,
diff --git a/fs/nfs/nfs3proc.c b/fs/nfs/nfs3proc.c
index 41797cbbb8dc..833e679d0a2b 100644
--- a/fs/nfs/nfs3proc.c
+++ b/fs/nfs/nfs3proc.c
@@ -1032,16 +1032,16 @@ static int nfs3_return_delegation(struct inode *inode)
 }
 
 static const struct inode_operations nfs3_dir_inode_operations = {
-	.create		= nfs_create,
-	.atomic_open	= nfs_atomic_open_v23,
+	.create_async	= nfs_create,
+	.atomic_open_async = nfs_atomic_open_v23,
 	.lookup		= nfs_lookup,
-	.link		= nfs_link,
-	.unlink		= nfs_unlink,
-	.symlink	= nfs_symlink,
+	.link_async	= nfs_link,
+	.unlink_async	= nfs_unlink,
+	.symlink_async	= nfs_symlink,
 	.mkdir_async	= nfs_mkdir,
-	.rmdir		= nfs_rmdir,
-	.mknod		= nfs_mknod,
-	.rename		= nfs_rename,
+	.rmdir_async	= nfs_rmdir,
+	.mknod_async	= nfs_mknod,
+	.rename_async	= nfs_rename,
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
diff --git a/fs/nfs/nfs4_fs.h b/fs/nfs/nfs4_fs.h
index 7d383d29a995..65fbcef5830e 100644
--- a/fs/nfs/nfs4_fs.h
+++ b/fs/nfs/nfs4_fs.h
@@ -273,7 +273,7 @@ extern const struct dentry_operations nfs4_dentry_operations;
 
 /* dir.c */
 int nfs_atomic_open(struct inode *, struct dentry *, struct file *,
-		    unsigned, umode_t);
+		    unsigned, umode_t, struct dirop_ret *);
 
 /* fs_context.c */
 extern struct file_system_type nfs4_fs_type;
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ef219968ed22..4fd312838bd3 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10878,16 +10878,16 @@ static void nfs4_disable_swap(struct inode *inode)
 }
 
 static const struct inode_operations nfs4_dir_inode_operations = {
-	.create		= nfs_create,
+	.create_async	= nfs_create,
 	.lookup		= nfs_lookup,
-	.atomic_open	= nfs_atomic_open,
-	.link		= nfs_link,
-	.unlink		= nfs_unlink,
-	.symlink	= nfs_symlink,
+	.atomic_open_async = nfs_atomic_open,
+	.link_async	= nfs_link,
+	.unlink_async	= nfs_unlink,
+	.symlink_async	= nfs_symlink,
 	.mkdir_async	= nfs_mkdir,
-	.rmdir		= nfs_rmdir,
-	.mknod		= nfs_mknod,
-	.rename		= nfs_rename,
+	.rmdir_async	= nfs_rmdir,
+	.mknod_async	= nfs_mknod,
+	.rename_async	= nfs_rename,
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
diff --git a/fs/nfs/proc.c b/fs/nfs/proc.c
index 7e8f6d8f02b4..211edd9f5115 100644
--- a/fs/nfs/proc.c
+++ b/fs/nfs/proc.c
@@ -704,16 +704,16 @@ static int nfs_return_delegation(struct inode *inode)
 }
 
 static const struct inode_operations nfs_dir_inode_operations = {
-	.create		= nfs_create,
+	.create_async	= nfs_create,
 	.lookup		= nfs_lookup,
-	.atomic_open	= nfs_atomic_open_v23,
-	.link		= nfs_link,
-	.unlink		= nfs_unlink,
-	.symlink	= nfs_symlink,
+	.atomic_open_async = nfs_atomic_open_v23,
+	.link_async	= nfs_link,
+	.unlink_async	= nfs_unlink,
+	.symlink_async	= nfs_symlink,
 	.mkdir_async	= nfs_mkdir,
-	.rmdir		= nfs_rmdir,
-	.mknod		= nfs_mknod,
-	.rename		= nfs_rename,
+	.rmdir_async	= nfs_rmdir,
+	.mknod_async	= nfs_mknod,
+	.rename_async	= nfs_rename,
 	.permission	= nfs_permission,
 	.getattr	= nfs_getattr,
 	.setattr	= nfs_setattr,
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index d44162d3a8f1..06b71ec9520c 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -430,6 +430,10 @@ nfs_complete_sillyrename(struct rpc_task *task, struct nfs_renamedata *data)
  *
  * The final cleanup is done during dentry_iput.
  *
+ * We exchange the original with the new (silly) dentries, and return
+ * the new dentry which will now have the original name.  This ensures that
+ * the target name remains locked until the rename completes.
+ *
  * (Note: NFSv4 is stateful, and has opens, so in theory an NFSv4 server
  * could take responsibility for keeping open files referenced.  The server
  * would also need to ensure that opened-but-deleted files were kept over
@@ -438,7 +442,7 @@ nfs_complete_sillyrename(struct rpc_task *task, struct nfs_renamedata *data)
  * use to advertise that it does this; some day we may take advantage of
  * it.))
  */
-int
+struct dentry *
 nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 {
 	static unsigned int sillycounter;
@@ -447,7 +451,8 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 	struct dentry *sdentry;
 	struct inode *inode = d_inode(dentry);
 	struct rpc_task *task;
-	int            error = -EBUSY;
+	struct dentry *base;
+	int error = -EBUSY;
 
 	dfprintk(VFS, "NFS: silly-rename(%pd2, ct=%d)\n",
 		dentry, d_count(dentry));
@@ -461,10 +466,11 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 
 	fileid = NFS_FILEID(d_inode(dentry));
 
+	base = d_find_alias(dir);
 	sdentry = NULL;
 	do {
 		int slen;
-		dput(sdentry);
+
 		sillycounter++;
 		slen = scnprintf(silly, sizeof(silly),
 				SILLYNAME_PREFIX "%0*llx%0*x",
@@ -474,14 +480,19 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 		dfprintk(VFS, "NFS: trying to rename %pd to %s\n",
 				dentry, silly);
 
-		sdentry = lookup_one_len(silly, dentry->d_parent, slen);
-		/*
-		 * N.B. Better to return EBUSY here ... it could be
-		 * dangerous to delete the file while it's in use.
-		 */
-		if (IS_ERR(sdentry))
-			goto out;
-	} while (d_inode(sdentry) != NULL); /* need negative lookup */
+		sdentry = lookup_and_lock_one(NULL, silly, slen,
+					      base,
+					      LOOKUP_CREATE | LOOKUP_EXCL
+					      | LOOKUP_RENAME_TARGET
+					      | LOOKUP_PARENT_LOCKED);
+	} while (PTR_ERR_OR_ZERO(sdentry) == -EEXIST); /* need negative lookup */
+	dput(base);
+	/*
+	 * N.B. Better to return EBUSY here ... it could be
+	 * dangerous to delete the file while it's in use.
+	 */
+	if (IS_ERR(sdentry))
+		goto out;
 
 	ihold(inode);
 
@@ -515,7 +526,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 						     NFS_INO_INVALID_CTIME |
 						     NFS_INO_REVAL_FORCED);
 		spin_unlock(&inode->i_lock);
-		d_move(dentry, sdentry);
+		d_exchange(dentry, sdentry);
 		break;
 	case -ERESTARTSYS:
 		/* The result of the rename is unknown. Play it safe by
@@ -526,7 +537,16 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 	rpc_put_task(task);
 out_dput:
 	iput(inode);
-	dput(sdentry);
+	if (!error)
+		return dentry;
+	done_lookup_and_lock(NULL, sdentry, LOOKUP_PARENT_LOCKED);
+
 out:
-	return error;
+	return ERR_PTR(error);
+}
+
+void nfs_sillyrename_finish(struct dentry *dentry)
+{
+	if (!IS_ERR(dentry))
+		done_lookup_and_lock(NULL, dentry, LOOKUP_PARENT_LOCKED);
 }
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 8ef7aa6ed64c..29903e2cdf97 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -95,7 +95,6 @@ struct dentry *__lookup_and_lock_one(struct mnt_idmap *idmap,
 				     unsigned int lookup_flags);
 void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
 			  unsigned int lookup_flags);
-void __done_lookup_and_lock(struct dentry *dentry);
 
 extern int follow_down_one(struct path *);
 extern int follow_down(struct path *path, unsigned int flags);
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index 67ae2c3f41d2..6f9f4adfdf4c 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -579,9 +579,6 @@ extern int nfs_may_open(struct inode *inode, const struct cred *cred, int openfl
 extern void nfs_access_zap_cache(struct inode *inode);
 extern int nfs_access_get_cached(struct inode *inode, const struct cred *cred,
 				 u32 *mask, bool may_block);
-extern int nfs_atomic_open_v23(struct inode *dir, struct dentry *dentry,
-			       struct file *file, unsigned int open_flags,
-			       umode_t mode);
 
 /*
  * linux/fs/nfs/symlink.c
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
@ 2025-02-06 12:24   ` Christian Brauner
  2025-02-06 23:52     ` NeilBrown
  2025-02-06 13:52   ` Jeff Layton
  2025-02-07 19:45   ` Al Viro
  2 siblings, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 12:24 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:38PM +1100, NeilBrown wrote:
> vfs_mkdir() does not guarantee to make the child dentry positive on
> success.  It may leave it negative and then the caller needs to perform a
> lookup to find the target dentry.
> 
> This patch introduced vfs_mkdir_return() which performs the lookup if
> needed so that this code is centralised.
> 
> This prepares for a new inode operation which will perform mkdir and
> returns the correct dentry.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/cachefiles/namei.c    |  7 +---
>  fs/namei.c               | 69 ++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/vfs.c            | 21 ++----------
>  fs/overlayfs/dir.c       | 33 +------------------
>  fs/overlayfs/overlayfs.h | 10 +++---
>  fs/overlayfs/super.c     |  2 +-
>  fs/smb/server/vfs.c      | 24 +++-----------
>  include/linux/fs.h       |  2 ++
>  8 files changed, 86 insertions(+), 82 deletions(-)
> 
> diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
> index 7cf59713f0f7..3c866c3b9534 100644
> --- a/fs/cachefiles/namei.c
> +++ b/fs/cachefiles/namei.c
> @@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
>  	/* search the current directory for the element name */
>  	inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
>  
> -retry:
>  	ret = cachefiles_inject_read_error();
>  	if (ret == 0)
>  		subdir = lookup_one_len(dirname, dir, strlen(dirname));
> @@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
>  			goto mkdir_error;
>  		ret = cachefiles_inject_write_error();
>  		if (ret == 0)
> -			ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
> +			ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
>  		if (ret < 0) {
>  			trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
>  						   cachefiles_trace_mkdir_error);
> @@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
>  		}
>  		trace_cachefiles_mkdir(dir, subdir);
>  
> -		if (unlikely(d_unhashed(subdir))) {
> -			cachefiles_put_directory(subdir);
> -			goto retry;
> -		}
>  		ASSERT(d_backing_inode(subdir));
>  
>  		_debug("mkdir -> %pd{ino=%lu}",
> diff --git a/fs/namei.c b/fs/namei.c
> index 3ab9440c5b93..d98caf36e867 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>  }
>  EXPORT_SYMBOL(vfs_mkdir);
>  
> +/**
> + * vfs_mkdir_return - create directory returning correct dentry
> + * @idmap:	idmap of the mount the inode was found from
> + * @dir:	inode of the parent directory
> + * @dentryp:	pointer to dentry of the child directory
> + * @mode:	mode of the child directory
> + *
> + * Create a directory.
> + *
> + * If the inode has been found through an idmapped mount the idmap of
> + * the vfsmount must be passed through @idmap. This function will then take
> + * care to map the inode according to @idmap before checking permissions.
> + * On non-idmapped mounts or if permission checking is to be performed on the
> + * raw inode simply pass @nop_mnt_idmap.
> + *
> + * The filesystem may not use the dentry that was passed in.  In that case
> + * the passed-in dentry is put and a new one is placed in *@dentryp;
> + * So on successful return *@dentryp will always be positive.
> + */
> +int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
> +		     struct dentry **dentryp, umode_t mode)
> +{

I think this is misnamed. Maybe vfs_mkdir_positive() is better here.
It also be nice to have a comment on vfs_mkdir() as well pointing out
that the returned dentry might be negative.

And is there a particular reason to not have it return the new dentry?
That seems clearer than using the argument as a return value.

> +	struct dentry *dentry = *dentryp;
> +	int error;
> +	unsigned max_links = dir->i_sb->s_max_links;
> +
> +	error = may_create(idmap, dir, dentry);
> +	if (error)
> +		return error;
> +
> +	if (!dir->i_op->mkdir)
> +		return -EPERM;
> +
> +	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
> +	error = security_inode_mkdir(dir, dentry, mode);
> +	if (error)
> +		return error;
> +
> +	if (max_links && dir->i_nlink >= max_links)
> +		return -EMLINK;
> +
> +	error = dir->i_op->mkdir(idmap, dir, dentry, mode);

Why isn't this calling vfs_mkdir() and then only starts differing afterwards?

> +	if (!error) {
> +		fsnotify_mkdir(dir, dentry);
> +		if (unlikely(d_unhashed(dentry))) {
> +			struct dentry *d;
> +			/* Need a "const" pointer.  We know d_name is const
> +			 * because we hold an exclusive lock on i_rwsem
> +			 * in d_parent.
> +			 */
> +			const struct qstr *d_name = (void*)&dentry->d_name;
> +			d = lookup_dcache(d_name, dentry->d_parent, 0);
> +			if (!d)
> +				d = __lookup_slow(d_name, dentry->d_parent, 0);

Quite a few caller's use lookup_one() here which calls
inode_permission() on @dir again. Are we guaranteed that the permission
check would always pass?

> +			if (IS_ERR(d)) {
> +				error = PTR_ERR(d);
> +			} else if (unlikely(d_is_negative(d))) {
> +				dput(d);
> +				error = -ENOENT;
> +			} else {
> +				dput(dentry);
> +				*dentryp = d;
> +			}
> +		}
> +	}
> +	return error;
> +}
> +EXPORT_SYMBOL(vfs_mkdir_return);
> +
>  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  {
>  	struct dentry *dentry;
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 29cb7b812d71..740332413138 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1488,26 +1488,11 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  			nfsd_check_ignore_resizing(iap);
>  		break;
>  	case S_IFDIR:
> -		host_err = vfs_mkdir(&nop_mnt_idmap, dirp, dchild, iap->ia_mode);
> -		if (!host_err && unlikely(d_unhashed(dchild))) {
> -			struct dentry *d;
> -			d = lookup_one_len(dchild->d_name.name,
> -					   dchild->d_parent,
> -					   dchild->d_name.len);
> -			if (IS_ERR(d)) {
> -				host_err = PTR_ERR(d);
> -				break;
> -			}
> -			if (unlikely(d_is_negative(d))) {
> -				dput(d);
> -				err = nfserr_serverfault;
> -				goto out;
> -			}
> +		host_err = vfs_mkdir_return(&nop_mnt_idmap, dirp, &dchild, iap->ia_mode);
> +		if (!host_err && unlikely(dchild != resfhp->fh_dentry)) {
>  			dput(resfhp->fh_dentry);
> -			resfhp->fh_dentry = dget(d);
> +			resfhp->fh_dentry = dget(dchild);
>  			err = fh_update(resfhp);
> -			dput(dchild);
> -			dchild = d;
>  			if (err)
>  				goto out;
>  		}
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index c9993ff66fc2..e6c54c6ef0f5 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -138,37 +138,6 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
>  	goto out;
>  }
>  
> -int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
> -		   struct dentry **newdentry, umode_t mode)
> -{
> -	int err;
> -	struct dentry *d, *dentry = *newdentry;
> -
> -	err = ovl_do_mkdir(ofs, dir, dentry, mode);
> -	if (err)
> -		return err;
> -
> -	if (likely(!d_unhashed(dentry)))
> -		return 0;
> -
> -	/*
> -	 * vfs_mkdir() may succeed and leave the dentry passed
> -	 * to it unhashed and negative. If that happens, try to
> -	 * lookup a new hashed and positive dentry.
> -	 */
> -	d = ovl_lookup_upper(ofs, dentry->d_name.name, dentry->d_parent,
> -			     dentry->d_name.len);
> -	if (IS_ERR(d)) {
> -		pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
> -			dentry, err);
> -		return PTR_ERR(d);
> -	}
> -	dput(dentry);
> -	*newdentry = d;
> -
> -	return 0;
> -}
> -
>  struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
>  			       struct dentry *newdentry, struct ovl_cattr *attr)
>  {
> @@ -191,7 +160,7 @@ struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
>  
>  		case S_IFDIR:
>  			/* mkdir is special... */
> -			err =  ovl_mkdir_real(ofs, dir, &newdentry, attr->mode);
> +			err =  ovl_do_mkdir(ofs, dir, &newdentry, attr->mode);
>  			break;
>  
>  		case S_IFCHR:
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 0021e2025020..967870f12482 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -242,11 +242,11 @@ static inline int ovl_do_create(struct ovl_fs *ofs,
>  }
>  
>  static inline int ovl_do_mkdir(struct ovl_fs *ofs,
> -			       struct inode *dir, struct dentry *dentry,
> +			       struct inode *dir, struct dentry **dentry,
>  			       umode_t mode)
>  {
> -	int err = vfs_mkdir(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
> -	pr_debug("mkdir(%pd2, 0%o) = %i\n", dentry, mode, err);
> +	int err = vfs_mkdir_return(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
> +	pr_debug("mkdir(%pd2, 0%o) = %i\n", *dentry, mode, err);
>  	return err;
>  }
>  
> @@ -838,8 +838,8 @@ struct ovl_cattr {
>  
>  #define OVL_CATTR(m) (&(struct ovl_cattr) { .mode = (m) })
>  
> -int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
> -		   struct dentry **newdentry, umode_t mode);
> +int ovl_do_mkdir(struct ovl_fs *ofs, struct inode *dir,
> +	      struct dentry **newdentry, umode_t mode);
>  struct dentry *ovl_create_real(struct ovl_fs *ofs,
>  			       struct inode *dir, struct dentry *newdentry,
>  			       struct ovl_cattr *attr);
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 86ae6f6da36b..06ca8b01c336 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -327,7 +327,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
>  			goto retry;
>  		}
>  
> -		err = ovl_mkdir_real(ofs, dir, &work, attr.ia_mode);
> +		err = ovl_do_mkdir(ofs, dir, &work, attr.ia_mode);
>  		if (err)
>  			goto out_dput;
>  
> diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
> index 6890016e1923..4e580bb7baf8 100644
> --- a/fs/smb/server/vfs.c
> +++ b/fs/smb/server/vfs.c
> @@ -211,7 +211,7 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
>  {
>  	struct mnt_idmap *idmap;
>  	struct path path;
> -	struct dentry *dentry;
> +	struct dentry *dentry, *d;
>  	int err;
>  
>  	dentry = ksmbd_vfs_kern_path_create(work, name,
> @@ -227,27 +227,11 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
>  
>  	idmap = mnt_idmap(path.mnt);
>  	mode |= S_IFDIR;
> -	err = vfs_mkdir(idmap, d_inode(path.dentry), dentry, mode);
> -	if (!err && d_unhashed(dentry)) {
> -		struct dentry *d;
> -
> -		d = lookup_one(idmap, dentry->d_name.name, dentry->d_parent,
> -			       dentry->d_name.len);
> -		if (IS_ERR(d)) {
> -			err = PTR_ERR(d);
> -			goto out_err;
> -		}
> -		if (unlikely(d_is_negative(d))) {
> -			dput(d);
> -			err = -ENOENT;
> -			goto out_err;
> -		}
> -
> +	d = dentry;
> +	err = vfs_mkdir_return(idmap, d_inode(path.dentry), &dentry, mode);
> +	if (!err && dentry != d)
>  		ksmbd_vfs_inherit_owner(work, d_inode(path.dentry), d_inode(d));
> -		dput(d);
> -	}
>  
> -out_err:
>  	done_path_create(&path, dentry);
>  	if (err)
>  		pr_err("mkdir(%s): creation failed (err:%d)\n", name, err);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index be3ad155ec9f..f81d6bc65fe4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1971,6 +1971,8 @@ int vfs_create(struct mnt_idmap *, struct inode *,
>  	       struct dentry *, umode_t, bool);
>  int vfs_mkdir(struct mnt_idmap *, struct inode *,
>  	      struct dentry *, umode_t);
> +int vfs_mkdir_return(struct mnt_idmap *, struct inode *,
> +		     struct dentry **, umode_t);
>  int vfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *,
>                umode_t, dev_t);
>  int vfs_symlink(struct mnt_idmap *, struct inode *,
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  2025-02-06  5:42 ` [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry NeilBrown
@ 2025-02-06 12:31   ` Christian Brauner
  2025-02-06 13:09     ` Christian Brauner
  0 siblings, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 12:31 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:41PM +1100, NeilBrown wrote:
> No callers of kern_path_locked() or user_path_locked_at() want a
> negative dentry.  So change them to return -ENOENT instead.  This
> simplifies callers.
> 
> This results in a subtle change to bcachefs in that an ioctl will now
> return -ENOENT in preference to -EXDEV.  I believe this restores the
> behaviour to what it was prior to
>  Commit bbe6a7c899e7 ("bch2_ioctl_subvolume_destroy(): fix locking")
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---

It would be nice if you could send this as a separate cleanup patch.
It seems unrelated to the series.

>  drivers/base/devtmpfs.c | 65 +++++++++++++++++++----------------------
>  fs/bcachefs/fs-ioctl.c  |  4 ---
>  fs/namei.c              |  4 +++
>  kernel/audit_watch.c    | 12 ++++----
>  4 files changed, 40 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
> index b848764ef018..c9e34842139f 100644
> --- a/drivers/base/devtmpfs.c
> +++ b/drivers/base/devtmpfs.c
> @@ -245,15 +245,12 @@ static int dev_rmdir(const char *name)
>  	dentry = kern_path_locked(name, &parent);
>  	if (IS_ERR(dentry))
>  		return PTR_ERR(dentry);
> -	if (d_really_is_positive(dentry)) {
> -		if (d_inode(dentry)->i_private == &thread)
> -			err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
> -					dentry);
> -		else
> -			err = -EPERM;
> -	} else {
> -		err = -ENOENT;
> -	}
> +	if (d_inode(dentry)->i_private == &thread)
> +		err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
> +				dentry);
> +	else
> +		err = -EPERM;
> +
>  	dput(dentry);
>  	inode_unlock(d_inode(parent.dentry));
>  	path_put(&parent);
> @@ -310,6 +307,8 @@ static int handle_remove(const char *nodename, struct device *dev)
>  {
>  	struct path parent;
>  	struct dentry *dentry;
> +	struct kstat stat;
> +	struct path p;
>  	int deleted = 0;
>  	int err;
>  
> @@ -317,32 +316,28 @@ static int handle_remove(const char *nodename, struct device *dev)
>  	if (IS_ERR(dentry))
>  		return PTR_ERR(dentry);
>  
> -	if (d_really_is_positive(dentry)) {
> -		struct kstat stat;
> -		struct path p = {.mnt = parent.mnt, .dentry = dentry};
> -		err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
> -				  AT_STATX_SYNC_AS_STAT);
> -		if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
> -			struct iattr newattrs;
> -			/*
> -			 * before unlinking this node, reset permissions
> -			 * of possible references like hardlinks
> -			 */
> -			newattrs.ia_uid = GLOBAL_ROOT_UID;
> -			newattrs.ia_gid = GLOBAL_ROOT_GID;
> -			newattrs.ia_mode = stat.mode & ~0777;
> -			newattrs.ia_valid =
> -				ATTR_UID|ATTR_GID|ATTR_MODE;
> -			inode_lock(d_inode(dentry));
> -			notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
> -			inode_unlock(d_inode(dentry));
> -			err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
> -					 dentry, NULL);
> -			if (!err || err == -ENOENT)
> -				deleted = 1;
> -		}
> -	} else {
> -		err = -ENOENT;
> +	p.mnt = parent.mnt;
> +	p.dentry = dentry;
> +	err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
> +			  AT_STATX_SYNC_AS_STAT);
> +	if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
> +		struct iattr newattrs;
> +		/*
> +		 * before unlinking this node, reset permissions
> +		 * of possible references like hardlinks
> +		 */
> +		newattrs.ia_uid = GLOBAL_ROOT_UID;
> +		newattrs.ia_gid = GLOBAL_ROOT_GID;
> +		newattrs.ia_mode = stat.mode & ~0777;
> +		newattrs.ia_valid =
> +			ATTR_UID|ATTR_GID|ATTR_MODE;
> +		inode_lock(d_inode(dentry));
> +		notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
> +		inode_unlock(d_inode(dentry));
> +		err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
> +				 dentry, NULL);
> +		if (!err || err == -ENOENT)
> +			deleted = 1;
>  	}
>  	dput(dentry);
>  	inode_unlock(d_inode(parent.dentry));
> diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
> index 15725b4ce393..595b57fabc9a 100644
> --- a/fs/bcachefs/fs-ioctl.c
> +++ b/fs/bcachefs/fs-ioctl.c
> @@ -511,10 +511,6 @@ static long bch2_ioctl_subvolume_destroy(struct bch_fs *c, struct file *filp,
>  		ret = -EXDEV;
>  		goto err;
>  	}
> -	if (!d_is_positive(victim)) {
> -		ret = -ENOENT;
> -		goto err;
> -	}
>  	ret = __bch2_unlink(dir, victim, true);
>  	if (!ret) {
>  		fsnotify_rmdir(dir, victim);
> diff --git a/fs/namei.c b/fs/namei.c
> index d684102d873d..1901120bcbb8 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2745,6 +2745,10 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
>  	}
>  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
>  	d = lookup_one_qstr(&last, path->dentry, 0);
> +	if (!IS_ERR(d) && d_is_negative(d)) {
> +		dput(d);
> +		d = ERR_PTR(-ENOENT);
> +	}
>  	if (IS_ERR(d)) {
>  		inode_unlock(path->dentry->d_inode);
>  		path_put(path);
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index 7f358740e958..e3130675ee6b 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -350,11 +350,10 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
>  	struct dentry *d = kern_path_locked(watch->path, parent);
>  	if (IS_ERR(d))
>  		return PTR_ERR(d);
> -	if (d_is_positive(d)) {
> -		/* update watch filter fields */
> -		watch->dev = d->d_sb->s_dev;
> -		watch->ino = d_backing_inode(d)->i_ino;
> -	}
> +	/* update watch filter fields */
> +	watch->dev = d->d_sb->s_dev;
> +	watch->ino = d_backing_inode(d)->i_ino;
> +
>  	inode_unlock(d_backing_inode(parent->dentry));
>  	dput(d);
>  	return 0;
> @@ -419,7 +418,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
>  	/* caller expects mutex locked */
>  	mutex_lock(&audit_filter_mutex);
>  
> -	if (ret) {
> +	if (ret && ret != -ENOENT) {
>  		audit_put_watch(watch);
>  		return ret;
>  	}
> @@ -438,6 +437,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
>  
>  	h = audit_hash_ino((u32)watch->ino);
>  	*list = &audit_inode_hash[h];
> +	ret = 0;
>  error:
>  	path_put(&parent_path);
>  	audit_put_watch(watch);
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
  2025-02-06  5:42 ` [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() NeilBrown
@ 2025-02-06 12:33   ` Christian Brauner
  2025-02-07 20:14   ` Al Viro
  2025-02-09 20:23   ` Al Viro
  2 siblings, 0 replies; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 12:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:42PM +1100, NeilBrown wrote:
> Callers of lookup_one_qstr() often check if the result is negative or
> positive.
> These changes can easily be moved into lookup_one_qstr() by checking the
> lookup flags:
> LOOKUP_CREATE means it is NOT an error if the name doesn't exist.
> LOOKUP_EXCL means it IS an error if the name DOES exist.
> 
> This patch adds these checks, then removes error checks from callers,
> and ensures that appropriate flags are passed.
> 
> This subtly changes the meaning of LOOKUP_EXCL.  Previously it could
> only accompany LOOKUP_CREATE.  Now it can accompany LOOKUP_RENAME_TARGET
> as well.  A couple of small changes are needed to accommodate this.  The
> NFS is functionally a no-op but ensures nfs_is_exclusive_create() does
> exactly what the name says.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---

This would be a worthwhile cleanup patch to lookup_one_qstr_excl()
before you've modified it to be lookup_one_qstr(). So this should also
go separately imho.

>  fs/namei.c            | 61 ++++++++++++++-----------------------------
>  fs/nfs/dir.c          |  3 ++-
>  fs/smb/server/vfs.c   | 26 +++++++-----------
>  include/linux/namei.h |  2 +-
>  4 files changed, 33 insertions(+), 59 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 1901120bcbb8..69610047f6c6 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1668,6 +1668,8 @@ static struct dentry *lookup_dcache(const struct qstr *name,
>   * Parent directory has inode locked: exclusive or shared.
>   * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
>   * must be called after the intended operation is performed - or aborted.
> + * Will return -ENOENT if name isn't found and LOOKUP_CREATE wasn't passed.
> + * Will return -EEXIST if name is found and LOOKUP_EXCL was passed.
>   */
>  struct dentry *lookup_one_qstr(const struct qstr *name,
>  			       struct dentry *base,
> @@ -1678,7 +1680,7 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
>  	struct inode *dir = base->d_inode;
>  
>  	if (dentry)
> -		return dentry;
> +		goto found;
>  
>  	/* Don't create child dentry for a dead directory. */
>  	if (unlikely(IS_DEADDIR(dir)))
> @@ -1689,7 +1691,7 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
>  		return ERR_PTR(-ENOMEM);
>  	if (!d_in_lookup(dentry))
>  		/* Raced with another thread which did the lookup */
> -		return dentry;
> +		goto found;
>  
>  	old = dir->i_op->lookup(dir, dentry, flags);
>  	if (unlikely(old)) {
> @@ -1700,6 +1702,15 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
>  	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
>  		/* ->lookup must have given final answer */
>  		d_lookup_done(dentry);
> +found:
> +	if (d_is_negative(dentry) && !(flags & LOOKUP_CREATE)) {
> +		dput(dentry);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	if (d_is_positive(dentry) && (flags & LOOKUP_EXCL)) {
> +		dput(dentry);
> +		return ERR_PTR(-EEXIST);
> +	}
>  	return dentry;
>  }
>  EXPORT_SYMBOL(lookup_one_qstr);
> @@ -2745,10 +2756,6 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
>  	}
>  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
>  	d = lookup_one_qstr(&last, path->dentry, 0);
> -	if (!IS_ERR(d) && d_is_negative(d)) {
> -		dput(d);
> -		d = ERR_PTR(-ENOENT);
> -	}
>  	if (IS_ERR(d)) {
>  		inode_unlock(path->dentry->d_inode);
>  		path_put(path);
> @@ -4085,27 +4092,13 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  	 * '/', and a directory wasn't requested.
>  	 */
>  	if (last.name[last.len] && !want_dir)
> -		create_flags = 0;
> +		create_flags &= ~LOOKUP_CREATE;
>  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
>  	dentry = lookup_one_qstr(&last, path->dentry,
>  				 reval_flag | create_flags);
>  	if (IS_ERR(dentry))
>  		goto unlock;
>  
> -	error = -EEXIST;
> -	if (d_is_positive(dentry))
> -		goto fail;
> -
> -	/*
> -	 * Special case - lookup gave negative, but... we had foo/bar/
> -	 * From the vfs_mknod() POV we just have a negative dentry -
> -	 * all is fine. Let's be bastards - you had / on the end, you've
> -	 * been asking for (non-existent) directory. -ENOENT for you.
> -	 */
> -	if (unlikely(!create_flags)) {
> -		error = -ENOENT;
> -		goto fail;
> -	}
>  	if (unlikely(err2)) {
>  		error = err2;
>  		goto fail;
> @@ -4522,10 +4515,6 @@ int do_rmdir(int dfd, struct filename *name)
>  	error = PTR_ERR(dentry);
>  	if (IS_ERR(dentry))
>  		goto exit3;
> -	if (!dentry->d_inode) {
> -		error = -ENOENT;
> -		goto exit4;
> -	}
>  	error = security_path_rmdir(&path, dentry);
>  	if (error)
>  		goto exit4;
> @@ -4656,7 +4645,7 @@ int do_unlinkat(int dfd, struct filename *name)
>  	if (!IS_ERR(dentry)) {
>  
>  		/* Why not before? Because we want correct error value */
> -		if (last.name[last.len] || d_is_negative(dentry))
> +		if (last.name[last.len])
>  			goto slashes;
>  		inode = dentry->d_inode;
>  		ihold(inode);
> @@ -4690,9 +4679,7 @@ int do_unlinkat(int dfd, struct filename *name)
>  	return error;
>  
>  slashes:
> -	if (d_is_negative(dentry))
> -		error = -ENOENT;
> -	else if (d_is_dir(dentry))
> +	if (d_is_dir(dentry))
>  		error = -EISDIR;
>  	else
>  		error = -ENOTDIR;
> @@ -5192,7 +5179,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>  	struct qstr old_last, new_last;
>  	int old_type, new_type;
>  	struct inode *delegated_inode = NULL;
> -	unsigned int lookup_flags = 0, target_flags = LOOKUP_RENAME_TARGET;
> +	unsigned int lookup_flags = 0, target_flags =
> +		LOOKUP_RENAME_TARGET | LOOKUP_CREATE;
>  	bool should_retry = false;
>  	int error = -EINVAL;
>  
> @@ -5205,6 +5193,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>  
>  	if (flags & RENAME_EXCHANGE)
>  		target_flags = 0;
> +	if (flags & RENAME_NOREPLACE)
> +		target_flags |= LOOKUP_EXCL;
>  
>  retry:
>  	error = filename_parentat(olddfd, from, lookup_flags, &old_path,
> @@ -5246,23 +5236,12 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>  	error = PTR_ERR(old_dentry);
>  	if (IS_ERR(old_dentry))
>  		goto exit3;
> -	/* source must exist */
> -	error = -ENOENT;
> -	if (d_is_negative(old_dentry))
> -		goto exit4;
>  	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
>  				     lookup_flags | target_flags);
>  	error = PTR_ERR(new_dentry);
>  	if (IS_ERR(new_dentry))
>  		goto exit4;
> -	error = -EEXIST;
> -	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
> -		goto exit5;
>  	if (flags & RENAME_EXCHANGE) {
> -		error = -ENOENT;
> -		if (d_is_negative(new_dentry))
> -			goto exit5;
> -
>  		if (!d_is_dir(new_dentry)) {
>  			error = -ENOTDIR;
>  			if (new_last.name[new_last.len])
> diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
> index 27c7a5c4e91b..8cbe63f4089a 100644
> --- a/fs/nfs/dir.c
> +++ b/fs/nfs/dir.c
> @@ -1531,7 +1531,8 @@ static int nfs_is_exclusive_create(struct inode *dir, unsigned int flags)
>  {
>  	if (NFS_PROTO(dir)->version == 2)
>  		return 0;
> -	return flags & LOOKUP_EXCL;
> +	return (flags & (LOOKUP_CREATE | LOOKUP_EXCL)) ==
> +		(LOOKUP_CREATE | LOOKUP_EXCL);
>  }
>  
>  /*
> diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
> index 89b3823f6405..bf8ac43c39b0 100644
> --- a/fs/smb/server/vfs.c
> +++ b/fs/smb/server/vfs.c
> @@ -113,11 +113,6 @@ static int ksmbd_vfs_path_lookup_locked(struct ksmbd_share_config *share_conf,
>  	if (IS_ERR(d))
>  		goto err_out;
>  
> -	if (d_is_negative(d)) {
> -		dput(d);
> -		goto err_out;
> -	}
> -
>  	path->dentry = d;
>  	path->mnt = mntget(parent_path->mnt);
>  
> @@ -677,6 +672,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
>  	struct ksmbd_file *parent_fp;
>  	int new_type;
>  	int err, lookup_flags = LOOKUP_NO_SYMLINKS;
> +	int target_lookup_flags = LOOKUP_RENAME_TARGET;
>  
>  	if (ksmbd_override_fsids(work))
>  		return -ENOMEM;
> @@ -687,6 +683,14 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
>  		goto revert_fsids;
>  	}
>  
> +	/*
> +	 * explicitly handle file overwrite case, for compatibility with
> +	 * filesystems that may not support rename flags (e.g: fuse)
> +	 */
> +	if (flags & RENAME_NOREPLACE)
> +		target_lookup_flags |= LOOKUP_EXCL;
> +	flags &= ~(RENAME_NOREPLACE);
> +
>  retry:
>  	err = vfs_path_parent_lookup(to, lookup_flags | LOOKUP_BENEATH,
>  				     &new_path, &new_last, &new_type,
> @@ -727,7 +731,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
>  	}
>  
>  	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
> -				     lookup_flags | LOOKUP_RENAME_TARGET);
> +				     lookup_flags | target_lookup_flags);
>  	if (IS_ERR(new_dentry)) {
>  		err = PTR_ERR(new_dentry);
>  		goto out3;
> @@ -738,16 +742,6 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
>  		goto out4;
>  	}
>  
> -	/*
> -	 * explicitly handle file overwrite case, for compatibility with
> -	 * filesystems that may not support rename flags (e.g: fuse)
> -	 */
> -	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry)) {
> -		err = -EEXIST;
> -		goto out4;
> -	}
> -	flags &= ~(RENAME_NOREPLACE);
> -
>  	if (old_child == trap) {
>  		err = -EINVAL;
>  		goto out4;
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 06bb3ea65beb..839a64d07f8c 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -31,7 +31,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
>  /* These tell filesystem methods that we are dealing with the final component... */
>  #define LOOKUP_OPEN		0x0100	/* ... in open */
>  #define LOOKUP_CREATE		0x0200	/* ... in object creation */
> -#define LOOKUP_EXCL		0x0400	/* ... in exclusive creation */
> +#define LOOKUP_EXCL		0x0400	/* ... in target must not exist */
>  #define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
>  
>  #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: (subset) [PATCH 06/19] VFS: repack DENTRY_ flags.
  2025-02-06  5:42 ` [PATCH 06/19] VFS: repack DENTRY_ flags NeilBrown
@ 2025-02-06 12:34   ` Christian Brauner
  0 siblings, 0 replies; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 12:34 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, Alexander Viro,
	Jan Kara, Linus Torvalds, Jeff Layton, Dave Chinner

On Thu, 06 Feb 2025 16:42:43 +1100, NeilBrown wrote:
> Bits 13, 23, 24, and 27 are not used.  Move all those holes to the end.
> 
> 

This is a useful cleanup independent of the rest of the series.

---

Applied to the vfs-6.15.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.misc

[06/19] VFS: repack DENTRY_ flags.
        https://git.kernel.org/vfs/vfs/c/893dd4ccbb7b

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
  2025-02-06  5:42 ` [PATCH 07/19] VFS: repack LOOKUP_ bit flags NeilBrown
@ 2025-02-06 12:44   ` Christian Brauner
  2025-02-07  0:24     ` NeilBrown
  2025-02-06 12:54   ` (subset) " Christian Brauner
  1 sibling, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 12:44 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:44PM +1100, NeilBrown wrote:
> The LOOKUP_ bits are not in order, which can make it awkward when adding
> new bits.  Two bits have recently been added to the end which makes them
> look like "scoping flags", but in fact they aren't.
> 
> Also LOOKUP_PARENT is described as "internal use only" but is used in
> fs/nfs/
> 
> This patch:
>  - Moves these three flags into the "pathwalk mode" section
>  - changes all bits to use the BIT(n) macro
>  - Allocates bits in order leaving gaps between the sections,
>    and documents those gaps.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---

This is also a worthwhile cleanup independent of the rest of the series.
But you've added LOOKUP_INTENT_FLAGS prior to packing the flags. Imho,
this patch should've gone before the addition of LOOKUP_INTENT_FLAGS.

And btw, what does this series apply to?
Doesn't apply to next-20250206 nor to current mainline.
I get the usual

Patch failed at 0012 VFS: enhance d_splice_alias to accommodate shared-lock updates
error: sha1 information is lacking or useless (fs/dcache.c).
error: could not build fake ancestor

when trying to look at this locally.

>  include/linux/namei.h | 46 +++++++++++++++++++++----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 839a64d07f8c..0d81e571a159 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -18,38 +18,38 @@ enum { MAX_NESTED_LINKS = 8 };
>  enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
>  
>  /* pathwalk mode */
> -#define LOOKUP_FOLLOW		0x0001	/* follow links at the end */
> -#define LOOKUP_DIRECTORY	0x0002	/* require a directory */
> -#define LOOKUP_AUTOMOUNT	0x0004  /* force terminal automount */
> -#define LOOKUP_EMPTY		0x4000	/* accept empty path [user_... only] */
> -#define LOOKUP_DOWN		0x8000	/* follow mounts in the starting point */
> -#define LOOKUP_MOUNTPOINT	0x0080	/* follow mounts in the end */
> -
> -#define LOOKUP_REVAL		0x0020	/* tell ->d_revalidate() to trust no cache */
> -#define LOOKUP_RCU		0x0040	/* RCU pathwalk mode; semi-internal */
> +#define LOOKUP_FOLLOW		BIT(0)	/* follow links at the end */
> +#define LOOKUP_DIRECTORY	BIT(1)	/* require a directory */
> +#define LOOKUP_AUTOMOUNT	BIT(2)  /* force terminal automount */
> +#define LOOKUP_EMPTY		BIT(3)	/* accept empty path [user_... only] */
> +#define LOOKUP_LINKAT_EMPTY	BIT(4) /* Linkat request with empty path. */
> +#define LOOKUP_DOWN		BIT(5)	/* follow mounts in the starting point */
> +#define LOOKUP_MOUNTPOINT	BIT(6)	/* follow mounts in the end */
> +#define LOOKUP_REVAL		BIT(7)	/* tell ->d_revalidate() to trust no cache */
> +#define LOOKUP_RCU		BIT(8)	/* RCU pathwalk mode; semi-internal */
> +#define LOOKUP_CACHED		BIT(9) /* Only do cached lookup */
> +#define LOOKUP_PARENT		BIT(10)	/* Looking up final parent in path */
> +/* 5 spare bits for pathwalk */
>  
>  /* These tell filesystem methods that we are dealing with the final component... */
> -#define LOOKUP_OPEN		0x0100	/* ... in open */
> -#define LOOKUP_CREATE		0x0200	/* ... in object creation */
> -#define LOOKUP_EXCL		0x0400	/* ... in target must not exist */
> -#define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
> +#define LOOKUP_OPEN		BIT(16)	/* ... in open */
> +#define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
> +#define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
> +#define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
>  
>  #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
>  				 LOOKUP_RENAME_TARGET)
> -
> -/* internal use only */
> -#define LOOKUP_PARENT		0x0010
> +/* 4 spare bits for intent */
>  
>  /* Scoping flags for lookup. */
> -#define LOOKUP_NO_SYMLINKS	0x010000 /* No symlink crossing. */
> -#define LOOKUP_NO_MAGICLINKS	0x020000 /* No nd_jump_link() crossing. */
> -#define LOOKUP_NO_XDEV		0x040000 /* No mountpoint crossing. */
> -#define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
> -#define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
> -#define LOOKUP_CACHED		0x200000 /* Only do cached lookup */
> -#define LOOKUP_LINKAT_EMPTY	0x400000 /* Linkat request with empty path. */
> +#define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
> +#define LOOKUP_NO_MAGICLINKS	BIT(25) /* No nd_jump_link() crossing. */
> +#define LOOKUP_NO_XDEV		BIT(26) /* No mountpoint crossing. */
> +#define LOOKUP_BENEATH		BIT(27) /* No escaping from starting point. */
> +#define LOOKUP_IN_ROOT		BIT(28) /* Treat dirfd as fs root. */
>  /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
>  #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
> +/* 3 spare bits for scoping */
>  
>  extern int path_pts(struct path *path);
>  
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: (subset) [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
  2025-02-06  5:42 ` [PATCH 07/19] VFS: repack LOOKUP_ bit flags NeilBrown
  2025-02-06 12:44   ` Christian Brauner
@ 2025-02-06 12:54   ` Christian Brauner
  1 sibling, 0 replies; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 12:54 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, linux-fsdevel, linux-kernel, Alexander Viro,
	Jan Kara, Linus Torvalds, Jeff Layton, Dave Chinner

On Thu, 06 Feb 2025 16:42:44 +1100, NeilBrown wrote:
> The LOOKUP_ bits are not in order, which can make it awkward when adding
> new bits.  Two bits have recently been added to the end which makes them
> look like "scoping flags", but in fact they aren't.
> 
> Also LOOKUP_PARENT is described as "internal use only" but is used in
> fs/nfs/
> 
> [...]

Applied to the vfs-6.15.misc branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.misc

[07/19] VFS: repack LOOKUP_ bit flags.
        https://git.kernel.org/vfs/vfs/c/01db36d3f0da

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  2025-02-06 12:31   ` Christian Brauner
@ 2025-02-06 13:09     ` Christian Brauner
  2025-02-07  0:08       ` NeilBrown
  0 siblings, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 13:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 01:31:56PM +0100, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:41PM +1100, NeilBrown wrote:
> > No callers of kern_path_locked() or user_path_locked_at() want a
> > negative dentry.  So change them to return -ENOENT instead.  This
> > simplifies callers.
> > 
> > This results in a subtle change to bcachefs in that an ioctl will now
> > return -ENOENT in preference to -EXDEV.  I believe this restores the
> > behaviour to what it was prior to
> >  Commit bbe6a7c899e7 ("bch2_ioctl_subvolume_destroy(): fix locking")
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> 
> It would be nice if you could send this as a separate cleanup patch.
> It seems unrelated to the series.
> 
> >  drivers/base/devtmpfs.c | 65 +++++++++++++++++++----------------------
> >  fs/bcachefs/fs-ioctl.c  |  4 ---
> >  fs/namei.c              |  4 +++
> >  kernel/audit_watch.c    | 12 ++++----
> >  4 files changed, 40 insertions(+), 45 deletions(-)
> > 
> > diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
> > index b848764ef018..c9e34842139f 100644
> > --- a/drivers/base/devtmpfs.c
> > +++ b/drivers/base/devtmpfs.c
> > @@ -245,15 +245,12 @@ static int dev_rmdir(const char *name)
> >  	dentry = kern_path_locked(name, &parent);
> >  	if (IS_ERR(dentry))
> >  		return PTR_ERR(dentry);
> > -	if (d_really_is_positive(dentry)) {
> > -		if (d_inode(dentry)->i_private == &thread)
> > -			err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
> > -					dentry);
> > -		else
> > -			err = -EPERM;
> > -	} else {
> > -		err = -ENOENT;
> > -	}
> > +	if (d_inode(dentry)->i_private == &thread)
> > +		err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
> > +				dentry);
> > +	else
> > +		err = -EPERM;
> > +
> >  	dput(dentry);
> >  	inode_unlock(d_inode(parent.dentry));
> >  	path_put(&parent);
> > @@ -310,6 +307,8 @@ static int handle_remove(const char *nodename, struct device *dev)
> >  {
> >  	struct path parent;
> >  	struct dentry *dentry;
> > +	struct kstat stat;
> > +	struct path p;
> >  	int deleted = 0;
> >  	int err;
> >  
> > @@ -317,32 +316,28 @@ static int handle_remove(const char *nodename, struct device *dev)
> >  	if (IS_ERR(dentry))
> >  		return PTR_ERR(dentry);
> >  
> > -	if (d_really_is_positive(dentry)) {
> > -		struct kstat stat;
> > -		struct path p = {.mnt = parent.mnt, .dentry = dentry};
> > -		err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
> > -				  AT_STATX_SYNC_AS_STAT);
> > -		if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
> > -			struct iattr newattrs;
> > -			/*
> > -			 * before unlinking this node, reset permissions
> > -			 * of possible references like hardlinks
> > -			 */
> > -			newattrs.ia_uid = GLOBAL_ROOT_UID;
> > -			newattrs.ia_gid = GLOBAL_ROOT_GID;
> > -			newattrs.ia_mode = stat.mode & ~0777;
> > -			newattrs.ia_valid =
> > -				ATTR_UID|ATTR_GID|ATTR_MODE;
> > -			inode_lock(d_inode(dentry));
> > -			notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
> > -			inode_unlock(d_inode(dentry));
> > -			err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
> > -					 dentry, NULL);
> > -			if (!err || err == -ENOENT)
> > -				deleted = 1;
> > -		}
> > -	} else {
> > -		err = -ENOENT;
> > +	p.mnt = parent.mnt;
> > +	p.dentry = dentry;
> > +	err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
> > +			  AT_STATX_SYNC_AS_STAT);
> > +	if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
> > +		struct iattr newattrs;
> > +		/*
> > +		 * before unlinking this node, reset permissions
> > +		 * of possible references like hardlinks
> > +		 */
> > +		newattrs.ia_uid = GLOBAL_ROOT_UID;
> > +		newattrs.ia_gid = GLOBAL_ROOT_GID;
> > +		newattrs.ia_mode = stat.mode & ~0777;
> > +		newattrs.ia_valid =
> > +			ATTR_UID|ATTR_GID|ATTR_MODE;
> > +		inode_lock(d_inode(dentry));
> > +		notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
> > +		inode_unlock(d_inode(dentry));
> > +		err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
> > +				 dentry, NULL);
> > +		if (!err || err == -ENOENT)
> > +			deleted = 1;
> >  	}
> >  	dput(dentry);
> >  	inode_unlock(d_inode(parent.dentry));
> > diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
> > index 15725b4ce393..595b57fabc9a 100644
> > --- a/fs/bcachefs/fs-ioctl.c
> > +++ b/fs/bcachefs/fs-ioctl.c
> > @@ -511,10 +511,6 @@ static long bch2_ioctl_subvolume_destroy(struct bch_fs *c, struct file *filp,
> >  		ret = -EXDEV;
> >  		goto err;
> >  	}
> > -	if (!d_is_positive(victim)) {
> > -		ret = -ENOENT;
> > -		goto err;
> > -	}
> >  	ret = __bch2_unlink(dir, victim, true);
> >  	if (!ret) {
> >  		fsnotify_rmdir(dir, victim);
> > diff --git a/fs/namei.c b/fs/namei.c
> > index d684102d873d..1901120bcbb8 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -2745,6 +2745,10 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
> >  	}
> >  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> >  	d = lookup_one_qstr(&last, path->dentry, 0);
> > +	if (!IS_ERR(d) && d_is_negative(d)) {
> > +		dput(d);
> > +		d = ERR_PTR(-ENOENT);

This doesn't unlock which afaict does cause issue with your devtmpfs
changes:

--- a/drivers/base/devtmpfs.c
+++ b/drivers/base/devtmpfs.c
@@ -245,15 +245,12 @@ static int dev_rmdir(const char *name)
 	dentry = kern_path_locked(name, &parent);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);

Here you fail to unlock which means dev_rmdir() will return with inode
lock held even though it returned an error?

-	if (d_really_is_positive(dentry)) {
-		if (d_inode(dentry)->i_private == &thread)
-			err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
-					dentry);
-		else
-			err = -EPERM;
-	} else {
-		err = -ENOENT;
-	}
+	if (d_inode(dentry)->i_private == &thread)
+		err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
+				dentry);
+	else
+		err = -EPERM;
+

> > +	}
> >  	if (IS_ERR(d)) {
> >  		inode_unlock(path->dentry->d_inode);
> >  		path_put(path);
> > diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> > index 7f358740e958..e3130675ee6b 100644
> > --- a/kernel/audit_watch.c
> > +++ b/kernel/audit_watch.c
> > @@ -350,11 +350,10 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
> >  	struct dentry *d = kern_path_locked(watch->path, parent);
> >  	if (IS_ERR(d))
> >  		return PTR_ERR(d);
> > -	if (d_is_positive(d)) {
> > -		/* update watch filter fields */
> > -		watch->dev = d->d_sb->s_dev;
> > -		watch->ino = d_backing_inode(d)->i_ino;
> > -	}
> > +	/* update watch filter fields */
> > +	watch->dev = d->d_sb->s_dev;
> > +	watch->ino = d_backing_inode(d)->i_ino;
> > +
> >  	inode_unlock(d_backing_inode(parent->dentry));
> >  	dput(d);
> >  	return 0;
> > @@ -419,7 +418,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
> >  	/* caller expects mutex locked */
> >  	mutex_lock(&audit_filter_mutex);
> >  
> > -	if (ret) {
> > +	if (ret && ret != -ENOENT) {
> >  		audit_put_watch(watch);
> >  		return ret;
> >  	}
> > @@ -438,6 +437,7 @@ int audit_add_watch(struct audit_krule *krule, struct list_head **list)
> >  
> >  	h = audit_hash_ino((u32)watch->ino);
> >  	*list = &audit_inode_hash[h];
> > +	ret = 0;
> >  error:
> >  	path_put(&parent_path);
> >  	audit_put_watch(watch);
> > -- 
> > 2.47.1
> > 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
  2025-02-06  5:42 ` [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations NeilBrown
@ 2025-02-06 13:15   ` Christian Brauner
  2025-02-07  1:46     ` NeilBrown
  2025-02-07 22:41   ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 13:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:46PM +1100, NeilBrown wrote:
> These "_async" versions of various inode operations are only guaranteed
> a shared lock on the directory but if the directory isn't exclusively
> locked then they are guaranteed an exclusive lock on the dentry within
> the directory (which will be implemented in a later patch).
> 
> This will allow a graceful transition from exclusive to shared locking
> for directory updates, and even to async updates which can complete with
> no lock on the directory - only on the dentry.
> 
> mkdir_async is a bit different as it optionally returns a new dentry
> for cases when the filesystem is not able to use the original dentry.
> This allows vfs_mkdir_return() to avoid the need for an extra lookup.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  Documentation/filesystems/locking.rst |  51 ++++++++-
>  Documentation/filesystems/porting.rst |  10 ++
>  Documentation/filesystems/vfs.rst     |  24 +++++
>  fs/namei.c                            | 142 +++++++++++++++++++++-----
>  include/linux/fs.h                    |  24 +++++
>  5 files changed, 223 insertions(+), 28 deletions(-)
> 
> diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> index d20a32b77b60..adeead366332 100644
> --- a/Documentation/filesystems/locking.rst
> +++ b/Documentation/filesystems/locking.rst
> @@ -62,15 +62,24 @@ inode_operations
>  prototypes::
>  
>  	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
> +	int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool, struct dirop_ret *);

If we end up doing this then imho the correct thing to do would be to
extend the existing operations. Yes, that's more work I know as I've
done that multiple times myself and it's a bit more annoying churn but
we shouldn't just keep adding new methods without a good reason.

I assume that you've done that mostly so that you wouldn't be held up by
menial work for the prototype. That's obviously fine. But for the final
thing we should just fixup everyone.

>  	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
>  	int (*link) (struct dentry *,struct inode *,struct dentry *);
> +	int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
>  	int (*unlink) (struct inode *,struct dentry *);
> +	int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
>  	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
> +	int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *m , struct dirop_ret *);
>  	int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
> +	struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
>  	int (*rmdir) (struct inode *,struct dentry *);
> +	int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
>  	int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
> +	int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
>  	int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
>  			struct inode *, struct dentry *, unsigned int);
> +	int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> +			struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
>  	int (*readlink) (struct dentry *, char __user *,int);
>  	const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
>  	void (*truncate) (struct inode *);
> @@ -84,6 +93,9 @@ prototypes::
>  	int (*atomic_open)(struct inode *, struct dentry *,
>  				struct file *, unsigned open_flag,
>  				umode_t create_mode);
> +	int (*atomic_open_async)(struct inode *, struct dentry *,
> +				struct file *, unsigned open_flag,
> +				umode_t create_mode, struct dirop_ret *);
>  	int (*tmpfile) (struct mnt_idmap *, struct inode *,
>  			struct file *, umode_t);
>  	int (*fileattr_set)(struct mnt_idmap *idmap,
> @@ -95,18 +107,33 @@ prototypes::
>  locking rules:
>  	all may block
>  
> +All directory-modifying operations are called with an exclusive lock on
> +the target dentry or dentries using DCACHE_PAR_LOOKUP.  This allows the
> +shared lock on i_rwsem for the _async ops to be safe.  The lock on
> +i_rwsem may be dropped as soon as the op returns, though if it returns
> +-EINPROGRESS the lock using DCACHE_PAR_UPDATE will not be dropped until
> +the callback is called.
> +
>  ==============	==================================================
>  ops		i_rwsem(inode)
>  ==============	==================================================
>  lookup:		shared
>  create:		exclusive
> +create_async:	shared
>  link:		exclusive (both)
> +link_async:	exclusive on source, shared on target
>  mknod:		exclusive
> +mknod_async:	shared
>  symlink:	exclusive
> +symlink_async:	shared
>  mkdir:		exclusive
> +mkdir_async:	shared
>  unlink:		exclusive (both)
> +unlink_async:	exclusive on object, shared on directory/name
>  rmdir:		exclusive (both)(see below)
> +rmdir_async:	exclusive on object, shared on directory/name (see below)
>  rename:		exclusive (both parents, some children)	(see below)
> +rename_async:	shared (both parents) exclusive (some children)	(see below)
>  readlink:	no
>  get_link:	no
>  setattr:	exclusive
> @@ -118,6 +145,7 @@ listxattr:	no
>  fiemap:		no
>  update_time:	no
>  atomic_open:	shared (exclusive if O_CREAT is set in open flags)
> +atomic_open_async:	shared (if O_CREAT is not set, then may not have exclusive lock on name)
>  tmpfile:	no
>  fileattr_get:	no or exclusive
>  fileattr_set:	exclusive
> @@ -125,8 +153,10 @@ get_offset_ctx  no
>  ==============	==================================================
>  
>  
> -	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
> -	exclusive on victim.
> +	Additionally, ->rmdir(), ->unlink() and ->rename(), as well as _async
> +	versions, have ->i_rwsem exclusive on victim.  This exclusive lock
> +        may be dropped when the op completes even if the async operation is
> +        continuing.
>  	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
>  	->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
>  	involved.
> @@ -135,6 +165,23 @@ get_offset_ctx  no
>  See Documentation/filesystems/directory-locking.rst for more detailed discussion
>  of the locking scheme for directory operations.
>  
> +The _async operations will be passed a (non-NULL) struct dirop_ret pointer::
> +
> +	struct dirop_ret {
> +		union {
> +			int err;
> +			struct dentry *dentry;
> +		};
> +		void (*done_cb)(struct dirop_ret*);
> +	};
> +
> +They may return -EINPROGRESS (or ERR_PTR(-EINPROGRESS)) in which case
> +the op will continue asynchronously.  When it completes the result,
> +which must NOT be -EINPROGRESS, is stored in err or dentry (as
> +appropriate) and the done_cb() function is called.  Callers can only
> +make use of the asynchrony when they determine that no lock need be held
> +on i_rwsem.
> +
>  xattr_handler operations
>  ========================
>  
> diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
> index 1639e78e3146..a736c9f30d9d 100644
> --- a/Documentation/filesystems/porting.rst
> +++ b/Documentation/filesystems/porting.rst
> @@ -1157,3 +1157,13 @@ in normal case it points into the pathname being looked up.
>  NOTE: if you need something like full path from the root of filesystem,
>  you are still on your own - this assists with simple cases, but it's not
>  magic.
> +
> +---
> +
> +**recommended**
> +
> +create_async, link_async, unlink_async, rmdir_async, mknod_async,
> +rename_async, atomic_open_async can be provided instead of the
> +corresponding inode_operations with the "_async" suffix.  Multiple
> +_async operations can be performed in a given directory concurrently,
> +but never on the same name.
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 31eea688609a..e18655054e6c 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -491,15 +491,24 @@ As of kernel 2.6.22, the following members are defined:
>  
>  	struct inode_operations {
>  		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
> +		int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool, struct dirop_ret *);
>  		struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
>  		int (*link) (struct dentry *,struct inode *,struct dentry *);
> +		int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
>  		int (*unlink) (struct inode *,struct dentry *);
> +		int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
>  		int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
> +		int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *, struct dirop_ret *);
>  		int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
> +		struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
>  		int (*rmdir) (struct inode *,struct dentry *);
> +		int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
>  		int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
> +		int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
>  		int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
>  			       struct inode *, struct dentry *, unsigned int);
> +		int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> +			       struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
>  		int (*readlink) (struct dentry *, char __user *,int);
>  		const char *(*get_link) (struct dentry *, struct inode *,
>  					 struct delayed_call *);
> @@ -511,6 +520,8 @@ As of kernel 2.6.22, the following members are defined:
>  		void (*update_time)(struct inode *, struct timespec *, int);
>  		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
>  				   unsigned open_flag, umode_t create_mode);
> +		int (*atomic_open_async)(struct inode *, struct dentry *, struct file *,
> +				   unsigned open_flag, umode_t create_mode, struct dirop_ret *);
>  		int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
>  		struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
>  	        int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
> @@ -524,6 +535,7 @@ Again, all methods are called without any locks being held, unless
>  otherwise noted.
>  
>  ``create``
> +``create_async``
>  	called by the open(2) and creat(2) system calls.  Only required
>  	if you want to support regular files.  The dentry you get should
>  	not have an inode (i.e. it should be a negative dentry).  Here
> @@ -546,29 +558,39 @@ otherwise noted.
>  	directory inode semaphore held
>  
>  ``link``
> +``link_async``
>  	called by the link(2) system call.  Only required if you want to
>  	support hard links.  You will probably need to call
>  	d_instantiate() just as you would in the create() method
>  
>  ``unlink``
> +``unlink_async``
>  	called by the unlink(2) system call.  Only required if you want
>  	to support deleting inodes
>  
>  ``symlink``
> +``symlink_async``
>  	called by the symlink(2) system call.  Only required if you want
>  	to support symlinks.  You will probably need to call
>  	d_instantiate() just as you would in the create() method
>  
>  ``mkdir``
> +``mkdir_async``
>  	called by the mkdir(2) system call.  Only required if you want
>  	to support creating subdirectories.  You will probably need to
>  	call d_instantiate() just as you would in the create() method
>  
> +	mkdir_async can return an alternate dentry, much like lookup.
> +	In this case the original dentry will still be negative and will
> +	be unhashed.
> +
>  ``rmdir``
> +``rmdir_async``
>  	called by the rmdir(2) system call.  Only required if you want
>  	to support deleting subdirectories
>  
>  ``mknod``
> +``mknod_async``
>  	called by the mknod(2) system call to create a device (char,
>  	block) inode or a named pipe (FIFO) or socket.  Only required if
>  	you want to support creating these types of inodes.  You will
> @@ -576,6 +598,7 @@ otherwise noted.
>  	create() method
>  
>  ``rename``
> +``rename_async``
>  	called by the rename(2) system call to rename the object to have
>  	the parent and name given by the second inode and dentry.
>  
> @@ -647,6 +670,7 @@ otherwise noted.
>  	itself and call mark_inode_dirty_sync.
>  
>  ``atomic_open``
> +``atomic_open_async``
>  	called on the last component of an open.  Using this optional
>  	method the filesystem can look up, possibly create and open the
>  	file in one atomic operation.  If it wants to leave actual
> diff --git a/fs/namei.c b/fs/namei.c
> index 3c0feca081a2..eadde9de73bf 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -123,6 +123,41 @@
>   * PATH_MAX includes the nul terminator --RR.
>   */
>  
> +static void dirop_done_cb(struct dirop_ret *dret)
> +{
> +	wake_up_var(dret);
> +}
> +
> +#define DO_DIROP(dir, op, ...)						\
> +	({								\
> +		 struct dirop_ret dret;					\
> +		 int ret;						\
> +		 dret.err = -EINPROGRESS;				\
> +		 dret.done_cb = dirop_done_cb;				\
> +		 ret = (dir)->i_op->op(__VA_ARGS__, &dret);		\
> +		 if (ret == -EINPROGRESS) {				\
> +			 wait_var_event(&dret,				\
> +					dret.err != -EINPROGRESS);	\
> +			 ret = dret.err;				\
> +		 }							\
> +		 ret;							\
> +	})
> +
> +#define DO_DE_DIROP(dir, op, ...)					\
> +	({								\
> +		 struct dirop_ret dret;					\
> +		 struct dentry *ret;					\
> +		 dret.dentry = ERR_PTR(-EINPROGRESS);			\
> +		 dret.done_cb = dirop_done_cb;				\
> +		 ret = (dir)->i_op->op(__VA_ARGS__, &dret);		\
> +		 if (ret == ERR_PTR(-EINPROGRESS)) {			\
> +			 wait_var_event(&dret,				\
> +					dret.dentry != ERR_PTR(-EINPROGRESS));	\
> +			 ret = dret.dentry;				\
> +		 }							\
> +		 ret;							\
> +	})

We should also try to avoid these ugly wrappers. That'll be easier if we
don't have multiple methods as well.

> +
>  #define EMBEDDED_NAME_MAX	(PATH_MAX - offsetof(struct filename, iname))
>  
>  struct filename *
> @@ -3403,14 +3438,17 @@ int vfs_create(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		return error;
>  
> -	if (!dir->i_op->create)
> +	if (!dir->i_op->create && !dir->i_op->create_async)
>  		return -EACCES;	/* shouldn't it be ENOSYS? */
>  
>  	mode = vfs_prepare_mode(idmap, dir, mode, S_IALLUGO, S_IFREG);
>  	error = security_inode_create(dir, dentry, mode);
>  	if (error)
>  		return error;
> -	error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
> +	if (dir->i_op->create_async)
> +		error = DO_DIROP(dir, create_async, idmap, dir, dentry, mode, want_excl);
> +	else
> +		error = dir->i_op->create(idmap, dir, dentry, mode, want_excl);
>  	if (!error)
>  		fsnotify_create(dir, dentry);
>  	return error;
> @@ -3571,8 +3609,12 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
>  
>  	file->f_path.dentry = DENTRY_NOT_SET;
>  	file->f_path.mnt = nd->path.mnt;
> -	error = dir->i_op->atomic_open(dir, dentry, file,
> -				       open_to_namei_flags(open_flag), mode);
> +	if (dir->i_op->atomic_open_async)
> +		error = DO_DIROP(dir, atomic_open_async, dir, dentry, file,
> +				 open_to_namei_flags(open_flag), mode);
> +	else
> +		error = dir->i_op->atomic_open(dir, dentry, file,
> +					       open_to_namei_flags(open_flag), mode);
>  	d_lookup_done(dentry);
>  	if (!error) {
>  		if (file->f_mode & FMODE_OPENED) {
> @@ -3680,7 +3722,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	}
>  	if (create_error)
>  		open_flag &= ~O_CREAT;
> -	if (dir_inode->i_op->atomic_open) {
> +	if (dir_inode->i_op->atomic_open || dir_inode->i_op->atomic_open_async) {
>  		dentry = atomic_open(nd, dentry, file, open_flag, mode);
>  		if (unlikely(create_error) && dentry == ERR_PTR(-ENOENT))
>  			dentry = ERR_PTR(create_error);
> @@ -3705,13 +3747,16 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
>  	if (!dentry->d_inode && (open_flag & O_CREAT)) {
>  		file->f_mode |= FMODE_CREATED;
>  		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
> -		if (!dir_inode->i_op->create) {
> -			error = -EACCES;
> -			goto out_dput;
> -		}
>  
> -		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
> -						mode, open_flag & O_EXCL);
> +		if (dir_inode->i_op->create_async)
> +			error = DO_DIROP(dir_inode, create_async, idmap, dir_inode,
> +					 dentry, mode,  open_flag & O_EXCL);
> +		else if (dir_inode->i_op->create)
> +			error = dir_inode->i_op->create(idmap, dir_inode,
> +							dentry, mode,
> +							open_flag & O_EXCL);
> +		else
> +			error = -EACCES;
>  		if (error)
>  			goto out_dput;
>  	}
> @@ -4217,7 +4262,7 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>  	    !capable(CAP_MKNOD))
>  		return -EPERM;
>  
> -	if (!dir->i_op->mknod)
> +	if (!dir->i_op->mknod && !dir->i_op->mknod_async)
>  		return -EPERM;
>  
>  	mode = vfs_prepare_mode(idmap, dir, mode, mode, mode);
> @@ -4229,7 +4274,10 @@ int vfs_mknod(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		return error;
>  
> -	error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
> +	if (dir->i_op->mknod_async)
> +		error = DO_DIROP(dir, mknod_async, idmap, dir, dentry, mode, dev);
> +	else
> +		error = dir->i_op->mknod(idmap, dir, dentry, mode, dev);
>  	if (!error)
>  		fsnotify_create(dir, dentry);
>  	return error;
> @@ -4340,7 +4388,7 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		return error;
>  
> -	if (!dir->i_op->mkdir)
> +	if (!dir->i_op->mkdir && !dir->i_op->mkdir_async)
>  		return -EPERM;
>  
>  	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
> @@ -4351,7 +4399,16 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>  	if (max_links && dir->i_nlink >= max_links)
>  		return -EMLINK;
>  
> -	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> +	if (dir->i_op->mkdir_async) {
> +		struct dentry *de;
> +		de = DO_DE_DIROP(dir, mkdir_async, idmap, dir, dentry, mode);
> +		if (IS_ERR(de))
> +			error = PTR_ERR(de);
> +		else if (de)
> +			dput(de);
> +	} else {
> +		error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> +	}
>  	if (!error)
>  		fsnotify_mkdir(dir, dentry);
>  	return error;
> @@ -4399,6 +4456,20 @@ int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
>  	if (max_links && dir->i_nlink >= max_links)
>  		return -EMLINK;
>  
> +	if (dir->i_op->mkdir_async) {
> +		struct dentry *de;
> +
> +		de = DO_DE_DIROP(dir, mkdir_async, idmap, dir, dentry, mode);
> +		if (IS_ERR(de))
> +			return PTR_ERR(de);
> +		if (de) {
> +			dput(dentry);
> +			*dentryp = de;
> +		}
> +		fsnotify_mkdir(dir, dentry);
> +		return 0;
> +	}
> +
>  	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
>  	if (!error) {
>  		fsnotify_mkdir(dir, dentry);
> @@ -4488,7 +4559,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		return error;
>  
> -	if (!dir->i_op->rmdir)
> +	if (!dir->i_op->rmdir && !dir->i_op->rmdir_async)
>  		return -EPERM;
>  
>  	dget(dentry);
> @@ -4503,7 +4574,10 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		goto out;
>  
> -	error = dir->i_op->rmdir(dir, dentry);
> +	if (dir->i_op->rmdir_async)
> +		error = DO_DIROP(dir, rmdir_async, dir, dentry);
> +	else
> +		error = dir->i_op->rmdir(dir, dentry);
>  	if (error)
>  		goto out;
>  
> @@ -4613,7 +4687,7 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		return error;
>  
> -	if (!dir->i_op->unlink)
> +	if (!dir->i_op->unlink && !dir->i_op->unlink_async)
>  		return -EPERM;
>  
>  	inode_lock(target);
> @@ -4627,7 +4701,10 @@ int vfs_unlink(struct mnt_idmap *idmap, struct inode *dir,
>  			error = try_break_deleg(target, delegated_inode);
>  			if (error)
>  				goto out;
> -			error = dir->i_op->unlink(dir, dentry);
> +			if (dir->i_op->unlink_async)
> +				error = DO_DIROP(dir, unlink_async, dir, dentry);
> +			else
> +				error = dir->i_op->unlink(dir, dentry);
>  			if (!error) {
>  				dont_mount(dentry);
>  				detach_mounts(dentry);
> @@ -4761,14 +4838,17 @@ int vfs_symlink(struct mnt_idmap *idmap, struct inode *dir,
>  	if (error)
>  		return error;
>  
> -	if (!dir->i_op->symlink)
> +	if (!dir->i_op->symlink && !dir->i_op->symlink_async)
>  		return -EPERM;
>  
>  	error = security_inode_symlink(dir, dentry, oldname);
>  	if (error)
>  		return error;
>  
> -	error = dir->i_op->symlink(idmap, dir, dentry, oldname);
> +	if (dir->i_op->symlink_async)
> +		error = DO_DIROP(dir, symlink_async, idmap, dir, dentry, oldname);
> +	else
> +		error = dir->i_op->symlink(idmap, dir, dentry, oldname);
>  	if (!error)
>  		fsnotify_create(dir, dentry);
>  	return error;
> @@ -4874,7 +4954,7 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
>  	 */
>  	if (HAS_UNMAPPED_ID(idmap, inode))
>  		return -EPERM;
> -	if (!dir->i_op->link)
> +	if (!dir->i_op->link && !dir->i_op->link_async)
>  		return -EPERM;
>  	if (S_ISDIR(inode->i_mode))
>  		return -EPERM;
> @@ -4891,7 +4971,11 @@ int vfs_link(struct dentry *old_dentry, struct mnt_idmap *idmap,
>  		error = -EMLINK;
>  	else {
>  		error = try_break_deleg(inode, delegated_inode);
> -		if (!error)
> +		if (error)
> +			;
> +		else if (dir->i_op->link_async)
> +			error = DO_DIROP(dir, link_async, old_dentry, dir, new_dentry);
> +		else
>  			error = dir->i_op->link(old_dentry, dir, new_dentry);
>  	}
>  
> @@ -5083,7 +5167,7 @@ int vfs_rename(struct renamedata *rd)
>  	if (error)
>  		return error;
>  
> -	if (!old_dir->i_op->rename)
> +	if (!old_dir->i_op->rename && !old_dir->i_op->rename_async)
>  		return -EPERM;
>  
>  	/*
> @@ -5166,8 +5250,14 @@ int vfs_rename(struct renamedata *rd)
>  		if (error)
>  			goto out;
>  	}
> -	error = old_dir->i_op->rename(rd->new_mnt_idmap, old_dir, old_dentry,
> -				      new_dir, new_dentry, flags);
> +	if (old_dir->i_op->rename_async)
> +		error = DO_DIROP(old_dir, rename_async, rd->new_mnt_idmap,
> +				 old_dir, old_dentry,
> +				 new_dir, new_dentry, flags);
> +	else
> +		error = old_dir->i_op->rename(rd->new_mnt_idmap,
> +					      old_dir, old_dentry,
> +					      new_dir, new_dentry, flags);
>  	if (error)
>  		goto out;
>  
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f81d6bc65fe4..e414400c2487 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2187,6 +2187,14 @@ int wrap_directory_iterator(struct file *, struct dir_context *,
>  	static int shared_##x(struct file *file , struct dir_context *ctx) \
>  	{ return wrap_directory_iterator(file, ctx, x); }
>  
> +struct dirop_ret {
> +	union {
> +		int err;
> +		struct dentry *dentry;
> +	};
> +	void (*done_cb)(struct dirop_ret*);
> +};
> +
>  struct inode_operations {
>  	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
>  	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
> @@ -2197,17 +2205,30 @@ struct inode_operations {
>  
>  	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,
>  		       umode_t, bool);
> +	int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *,
> +		       umode_t, bool, struct dirop_ret *);
>  	int (*link) (struct dentry *,struct inode *,struct dentry *);
> +	int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
>  	int (*unlink) (struct inode *,struct dentry *);
> +	int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
>  	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,
>  			const char *);
> +	int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,
> +			const char *, struct dirop_ret *);
>  	int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,
>  		      umode_t);
> +	struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,
> +		      umode_t, struct dirop_ret *);
>  	int (*rmdir) (struct inode *,struct dentry *);
> +	int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
>  	int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,
>  		      umode_t,dev_t);
> +	int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,
> +		      umode_t,dev_t, struct dirop_ret *);
>  	int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
>  			struct inode *, struct dentry *, unsigned int);
> +	int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> +			struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
>  	int (*setattr) (struct mnt_idmap *, struct dentry *, struct iattr *);
>  	int (*getattr) (struct mnt_idmap *, const struct path *,
>  			struct kstat *, u32, unsigned int);
> @@ -2218,6 +2239,9 @@ struct inode_operations {
>  	int (*atomic_open)(struct inode *, struct dentry *,
>  			   struct file *, unsigned open_flag,
>  			   umode_t create_mode);
> +	int (*atomic_open_async)(struct inode *, struct dentry *,
> +			   struct file *, unsigned open_flag,
> +			   umode_t create_mode, struct dirop_ret *);
>  	int (*tmpfile) (struct mnt_idmap *, struct inode *,
>  			struct file *, umode_t);
>  	struct posix_acl *(*get_acl)(struct mnt_idmap *, struct dentry *,
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops
  2025-02-06  5:42 ` [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops NeilBrown
@ 2025-02-06 13:22   ` Christian Brauner
  2025-02-07  2:01     ` NeilBrown
  0 siblings, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 13:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:47PM +1100, NeilBrown wrote:
> If a filesystem supports _async ops for some directory ops we can take a
> "shared" lock on i_rwsem otherwise we must take an "exclusive" lock.  As
> the filesystem may support some async ops but not others we need to
> easily determine which.
> 
> With this patch we group the ops into 4 groups that are likely be
> supported together:
> 
> CREATE: create, link, mkdir, mknod
> REMOVE: rmdir, unlink
> RENAME: rename
> OPEN: atomic_open, create
> 
> and set S_ASYNC_XXX for each when the inode in initialised.
> 
> We also add a LOOKUP_REMOVE intent flag which will be used by locking
> interfaces to help know which group is being used.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/dcache.c           | 24 ++++++++++++++++++++++++
>  include/linux/fs.h    |  5 +++++
>  include/linux/namei.h |  5 +++--
>  3 files changed, 32 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index e49607d00d2d..37c0f655166d 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -384,6 +384,27 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
>  	smp_store_release(&dentry->d_flags, flags);
>  }
>  
> +static void set_inode_flags(struct inode *inode)
> +{
> +	const struct inode_operations *i_op = inode->i_op;
> +
> +	lockdep_assert_held(&inode->i_lock);
> +	if ((i_op->create_async || !i_op->create) &&
> +	    (i_op->link_async || !i_op->link) &&
> +	    (i_op->symlink_async || !i_op->symlink) &&
> +	    (i_op->mkdir_async || !i_op->mkdir) &&
> +	    (i_op->mknod_async || !i_op->mknod))
> +		inode->i_flags |= S_ASYNC_CREATE;
> +	if ((i_op->unlink_async || !i_op->unlink) &&
> +	    (i_op->mkdir_async || !i_op->mkdir))
> +		inode->i_flags |= S_ASYNC_REMOVE;
> +	if (i_op->rename_async)
> +		inode->i_flags |= S_ASYNC_RENAME;
> +	if (i_op->atomic_open_async ||
> +	    (!i_op->atomic_open && i_op->create_async))
> +		inode->i_flags |= S_ASYNC_OPEN;
> +}

I think this is unpleasant. As I said we should fold _async into the
normal methods. Then we can add:

diff --git a/include/linux/fs.h b/include/linux/fs.h
index be3ad155ec9f..1d19f72448fc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2186,6 +2186,7 @@ int wrap_directory_iterator(struct file *, struct dir_context *,
        { return wrap_directory_iterator(file, ctx, x); }

 struct inode_operations {
+       iop_flags_t iop_flags;
        struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
        const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
        int (*permission) (struct mnt_idmap *, struct inode *, int);

which is similar to what I did for

struct file_operations {
        struct module *owner;
        fop_flags_t fop_flags;

and introduce

IOP_ASYNC_CREATE
IOP_ASYNC_OPEN

etc and then filesystems can just do:

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index df9669d4ded7..90c7aeb49466 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -10859,6 +10859,7 @@ static void nfs4_disable_swap(struct inode *inode)
 }

 static const struct inode_operations nfs4_dir_inode_operations = {
+       .iop_flags      = IOP_ASYNC_CREATE | IOP_ASYNC_OPEN,
        .create         = nfs_create,
        .lookup         = nfs_lookup,
        .atomic_open    = nfs_atomic_open,

and then you can raise S_ASYNC_OPEN and so on based on the flags, not
the individual methods.

> +
>  static inline void __d_clear_type_and_inode(struct dentry *dentry)
>  {
>  	unsigned flags = READ_ONCE(dentry->d_flags);
> @@ -1893,6 +1914,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
>  	raw_write_seqcount_begin(&dentry->d_seq);
>  	__d_set_inode_and_type(dentry, inode, add_flags);
>  	raw_write_seqcount_end(&dentry->d_seq);
> +	set_inode_flags(inode);
>  	fsnotify_update_flags(dentry);
>  	spin_unlock(&dentry->d_lock);
>  }
> @@ -1999,6 +2021,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
>  
>  		spin_lock(&new->d_lock);
>  		__d_set_inode_and_type(new, inode, add_flags);
> +		set_inode_flags(inode);
>  		hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
>  		if (!disconnected) {
>  			hlist_bl_lock(&sb->s_roots);
> @@ -2701,6 +2724,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
>  		raw_write_seqcount_begin(&dentry->d_seq);
>  		__d_set_inode_and_type(dentry, inode, add_flags);
>  		raw_write_seqcount_end(&dentry->d_seq);
> +		set_inode_flags(inode);
>  		fsnotify_update_flags(dentry);
>  	}
>  	__d_rehash(dentry);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e414400c2487..9a9282fef347 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2361,6 +2361,11 @@ struct super_operations {
>  #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
>  #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
>  
> +#define S_ASYNC_CREATE	BIT(18)	/* create, link, symlink, mkdir, mknod all _async */
> +#define S_ASYNC_REMOVE	BIT(19)	/* unlink, mkdir both _async */
> +#define S_ASYNC_RENAME	BIT(20) /* rename_async supported */
> +#define S_ASYNC_OPEN	BIT(21) /* atomic_open_async or create_async supported */
> +
>  /*
>   * Note that nosuid etc flags are inode-specific: setting some file-system
>   * flags just means all the inodes inherit those flags by default. It might be
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 76c587a5ec3a..72e351640406 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -40,10 +40,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
>  #define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
>  #define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
>  #define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
> +#define LOOKUP_REMOVE		BIT(20)	/* ... in target of object removal */
>  
>  #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
> -				 LOOKUP_RENAME_TARGET)
> -/* 4 spare bits for intent */
> +				 LOOKUP_RENAME_TARGET | LOOKUP_REMOVE)
> +/* 3 spare bits for intent */
>  
>  /* Scoping flags for lookup. */
>  #define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
> -- 
> 2.47.1
> 

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-06  5:42 ` [PATCH 08/19] VFS: introduce lookup_and_lock() and friends NeilBrown
@ 2025-02-06 13:49   ` Christian Brauner
  2025-02-07  1:28     ` NeilBrown
  2025-02-07 20:22   ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 13:49 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:45PM +1100, NeilBrown wrote:
> lookup_and_lock() combines locking the directory and performing a lookup
> prior to a change to the directory.
> Abstracting this prepares for changing the locking requirements.
> 
> done_lookup_and_lock() provides the inverse of putting the dentry and
> unlocking.
> 
> For "silly_rename" we will need to lookup_and_lock() in a directory that
> is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.
> 
> Like lookup_len_qstr(), lookup_and_lock() returns -ENOENT if
> LOOKUP_CREATE was NOT given and the name cannot be found,, and returns
> -EEXIST if LOOKUP_EXCL WAS given and the name CAN be found.
> 
> These functions replace all uses of lookup_one_qstr() in namei.c
> except for those used for rename.
> 
> The name might seem backwards as the lock happens before the lookup.
> A future patch will change this so that only a shared lock is taken
> before the lookup, and an exclusive lock on the dentry is taken after a
> successful lookup.  So the order "lookup" then "lock" will make sense.
> 
> This functionality is exported as lookup_and_lock_one() which takes a
> name and len rather than a qstr.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/namei.c            | 102 ++++++++++++++++++++++++++++--------------
>  include/linux/namei.h |  15 ++++++-
>  2 files changed, 83 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 69610047f6c6..3c0feca081a2 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1715,6 +1715,41 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
>  }
>  EXPORT_SYMBOL(lookup_one_qstr);
>  
> +static struct dentry *lookup_and_lock_nested(const struct qstr *last,
> +					     struct dentry *base,
> +					     unsigned int lookup_flags,
> +					     unsigned int subclass)
> +{
> +	struct dentry *dentry;
> +
> +	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
> +		inode_lock_nested(base->d_inode, subclass);
> +
> +	dentry = lookup_one_qstr(last, base, lookup_flags);
> +	if (IS_ERR(dentry) && !(lookup_flags & LOOKUP_PARENT_LOCKED)) {
> +			inode_unlock(base->d_inode);

Nit: The indentation here is wrong and the {} aren't common practice.

> +	}
> +	return dentry;
> +}
> +
> +static struct dentry *lookup_and_lock(const struct qstr *last,
> +				      struct dentry *base,
> +				      unsigned int lookup_flags)
> +{
> +	return lookup_and_lock_nested(last, base, lookup_flags,
> +				      I_MUTEX_PARENT);
> +}
> +
> +void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
> +			  unsigned int lookup_flags)

Did you mean done_lookup_and_unlock()?

> +{
> +	d_lookup_done(dentry);
> +	dput(dentry);
> +	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
> +		inode_unlock(base->d_inode);
> +}
> +EXPORT_SYMBOL(done_lookup_and_lock);
> +
>  /**
>   * lookup_fast - do fast lockless (but racy) lookup of a dentry
>   * @nd: current nameidata
> @@ -2754,12 +2789,9 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
>  		path_put(path);
>  		return ERR_PTR(-EINVAL);
>  	}
> -	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> -	d = lookup_one_qstr(&last, path->dentry, 0);
> -	if (IS_ERR(d)) {
> -		inode_unlock(path->dentry->d_inode);
> +	d = lookup_and_lock(&last, path->dentry, 0);
> +	if (IS_ERR(d))
>  		path_put(path);
> -	}
>  	return d;
>  }
>  
> @@ -3053,6 +3085,22 @@ struct dentry *lookup_positive_unlocked(const char *name,
>  }
>  EXPORT_SYMBOL(lookup_positive_unlocked);
>  
> +struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
> +				   const char *name, int len, struct dentry *base,
> +				   unsigned int lookup_flags)
> +{
> +	struct qstr this;
> +	int err;
> +
> +	if (!idmap)
> +		idmap = &nop_mnt_idmap;

The callers should pass nop_mnt_idmap. That's how every function that
takes this argument works. This is a lot more explicit than magically
fixing this up in the function.

> +	err = lookup_one_common(idmap, name, base, len, &this);
> +	if (err)
> +		return ERR_PTR(err);
> +	return lookup_and_lock(&this, base, lookup_flags);
> +}
> +EXPORT_SYMBOL(lookup_and_lock_one);
> +
>  #ifdef CONFIG_UNIX98_PTYS
>  int path_pts(struct path *path)
>  {
> @@ -4071,7 +4119,6 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  	unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
>  	unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
>  	int type;
> -	int err2;
>  	int error;
>  
>  	error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
> @@ -4083,36 +4130,30 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  	 * (foo/., foo/.., /////)
>  	 */
>  	if (unlikely(type != LAST_NORM))
> -		goto out;
> +		goto put;
>  
>  	/* don't fail immediately if it's r/o, at least try to report other errors */
> -	err2 = mnt_want_write(path->mnt);
> +	error = mnt_want_write(path->mnt);
>  	/*
>  	 * Do the final lookup.  Suppress 'create' if there is a trailing
>  	 * '/', and a directory wasn't requested.
>  	 */
>  	if (last.name[last.len] && !want_dir)
>  		create_flags &= ~LOOKUP_CREATE;
> -	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> -	dentry = lookup_one_qstr(&last, path->dentry,
> -				 reval_flag | create_flags);
> +	dentry = lookup_and_lock(&last, path->dentry, reval_flag | create_flags);
>  	if (IS_ERR(dentry))
> -		goto unlock;
> +		goto drop;
>  
> -	if (unlikely(err2)) {
> -		error = err2;
> +	if (unlikely(error))
>  		goto fail;
> -	}
>  	return dentry;
>  fail:
> -	d_lookup_done(dentry);
> -	dput(dentry);
> +	done_lookup_and_lock(path->dentry, dentry, reval_flag | create_flags);
>  	dentry = ERR_PTR(error);
> -unlock:
> -	inode_unlock(path->dentry->d_inode);
> -	if (!err2)
> +drop:
> +	if (!error)
>  		mnt_drop_write(path->mnt);
> -out:
> +put:
>  	path_put(path);
>  	return dentry;
>  }
> @@ -4130,14 +4171,13 @@ EXPORT_SYMBOL(kern_path_create);
>  
>  void done_path_create(struct path *path, struct dentry *dentry)
>  {
> -	dput(dentry);
> -	inode_unlock(path->dentry->d_inode);
> +	done_lookup_and_lock(path->dentry, dentry, LOOKUP_CREATE);
>  	mnt_drop_write(path->mnt);
>  	path_put(path);
>  }
>  EXPORT_SYMBOL(done_path_create);
>  
> -inline struct dentry *user_path_create(int dfd, const char __user *pathname,
> +struct dentry *user_path_create(int dfd, const char __user *pathname,
>  				struct path *path, unsigned int lookup_flags)
>  {
>  	struct filename *filename = getname(pathname);
> @@ -4510,19 +4550,18 @@ int do_rmdir(int dfd, struct filename *name)
>  	if (error)
>  		goto exit2;
>  
> -	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
> -	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
> +	dentry = lookup_and_lock(&last, path.dentry, lookup_flags);
>  	error = PTR_ERR(dentry);
>  	if (IS_ERR(dentry))
>  		goto exit3;
> +
>  	error = security_path_rmdir(&path, dentry);
>  	if (error)
>  		goto exit4;
>  	error = vfs_rmdir(mnt_idmap(path.mnt), path.dentry->d_inode, dentry);
>  exit4:
> -	dput(dentry);
> +	done_lookup_and_lock(path.dentry, dentry, lookup_flags);
>  exit3:
> -	inode_unlock(path.dentry->d_inode);
>  	mnt_drop_write(path.mnt);
>  exit2:
>  	path_put(&path);
> @@ -4639,11 +4678,9 @@ int do_unlinkat(int dfd, struct filename *name)
>  	if (error)
>  		goto exit2;
>  retry_deleg:
> -	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
> -	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
> +	dentry = lookup_and_lock(&last, path.dentry, lookup_flags);
>  	error = PTR_ERR(dentry);
>  	if (!IS_ERR(dentry)) {
> -
>  		/* Why not before? Because we want correct error value */
>  		if (last.name[last.len])
>  			goto slashes;
> @@ -4655,9 +4692,8 @@ int do_unlinkat(int dfd, struct filename *name)
>  		error = vfs_unlink(mnt_idmap(path.mnt), path.dentry->d_inode,
>  				   dentry, &delegated_inode);
>  exit3:
> -		dput(dentry);
> +		done_lookup_and_lock(path.dentry, dentry, lookup_flags);
>  	}
> -	inode_unlock(path.dentry->d_inode);
>  	if (inode)
>  		iput(inode);	/* truncate the inode here */
>  	inode = NULL;
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 0d81e571a159..76c587a5ec3a 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -29,7 +29,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
>  #define LOOKUP_RCU		BIT(8)	/* RCU pathwalk mode; semi-internal */
>  #define LOOKUP_CACHED		BIT(9) /* Only do cached lookup */
>  #define LOOKUP_PARENT		BIT(10)	/* Looking up final parent in path */
> -/* 5 spare bits for pathwalk */
> +#define LOOKUP_PARENT_LOCKED	BIT(11)	/* filesystem sets this for nested
> +					 * "lookup_and_lock_one" when it knows
> +					 * parent is sufficiently locked.
> +					 */
> +/* 4 spare bits for pathwalk */
>  
>  /* These tell filesystem methods that we are dealing with the final component... */
>  #define LOOKUP_OPEN		BIT(16)	/* ... in open */
> @@ -82,6 +86,15 @@ struct dentry *lookup_one_unlocked(struct mnt_idmap *idmap,
>  struct dentry *lookup_one_positive_unlocked(struct mnt_idmap *idmap,
>  					    const char *name,
>  					    struct dentry *base, int len);
> +struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
> +				   const char *name, int len, struct dentry *base,
> +				   unsigned int lookup_flags);
> +struct dentry *__lookup_and_lock_one(struct mnt_idmap *idmap,
> +				     const char *name, int len, struct dentry *base,
> +				     unsigned int lookup_flags);
> +void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
> +			  unsigned int lookup_flags);
> +void __done_lookup_and_lock(struct dentry *dentry);
>  
>  extern int follow_down_one(struct path *);
>  extern int follow_down(struct path *path, unsigned int flags);
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
  2025-02-06 12:24   ` Christian Brauner
@ 2025-02-06 13:52   ` Jeff Layton
  2025-02-06 23:57     ` NeilBrown
  2025-02-07 19:45   ` Al Viro
  2 siblings, 1 reply; 83+ messages in thread
From: Jeff Layton @ 2025-02-06 13:52 UTC (permalink / raw)
  To: NeilBrown, Alexander Viro, Christian Brauner, Jan Kara,
	Linus Torvalds, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

On Thu, 2025-02-06 at 16:42 +1100, NeilBrown wrote:
> vfs_mkdir() does not guarantee to make the child dentry positive on
> success.  It may leave it negative and then the caller needs to perform a
> lookup to find the target dentry.
> 
> This patch introduced vfs_mkdir_return() which performs the lookup if
> needed so that this code is centralised.
> 
> This prepares for a new inode operation which will perform mkdir and
> returns the correct dentry.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/cachefiles/namei.c    |  7 +---
>  fs/namei.c               | 69 ++++++++++++++++++++++++++++++++++++++++
>  fs/nfsd/vfs.c            | 21 ++----------
>  fs/overlayfs/dir.c       | 33 +------------------
>  fs/overlayfs/overlayfs.h | 10 +++---
>  fs/overlayfs/super.c     |  2 +-
>  fs/smb/server/vfs.c      | 24 +++-----------
>  include/linux/fs.h       |  2 ++
>  8 files changed, 86 insertions(+), 82 deletions(-)
> 
> diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
> index 7cf59713f0f7..3c866c3b9534 100644
> --- a/fs/cachefiles/namei.c
> +++ b/fs/cachefiles/namei.c
> @@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
>  	/* search the current directory for the element name */
>  	inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
>  
> -retry:
>  	ret = cachefiles_inject_read_error();
>  	if (ret == 0)
>  		subdir = lookup_one_len(dirname, dir, strlen(dirname));
> @@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
>  			goto mkdir_error;
>  		ret = cachefiles_inject_write_error();
>  		if (ret == 0)
> -			ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
> +			ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
>  		if (ret < 0) {
>  			trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
>  						   cachefiles_trace_mkdir_error);
> @@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
>  		}
>  		trace_cachefiles_mkdir(dir, subdir);
>  
> -		if (unlikely(d_unhashed(subdir))) {
> -			cachefiles_put_directory(subdir);
> -			goto retry;
> -		}
>  		ASSERT(d_backing_inode(subdir));
>  
>  		_debug("mkdir -> %pd{ino=%lu}",
> diff --git a/fs/namei.c b/fs/namei.c
> index 3ab9440c5b93..d98caf36e867 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
>  }
>  EXPORT_SYMBOL(vfs_mkdir);
>  
> +/**
> + * vfs_mkdir_return - create directory returning correct dentry
> + * @idmap:	idmap of the mount the inode was found from
> + * @dir:	inode of the parent directory
> + * @dentryp:	pointer to dentry of the child directory
> + * @mode:	mode of the child directory
> + *
> + * Create a directory.
> + *
> + * If the inode has been found through an idmapped mount the idmap of
> + * the vfsmount must be passed through @idmap. This function will then take
> + * care to map the inode according to @idmap before checking permissions.
> + * On non-idmapped mounts or if permission checking is to be performed on the
> + * raw inode simply pass @nop_mnt_idmap.
> + *
> + * The filesystem may not use the dentry that was passed in.  In that case
> + * the passed-in dentry is put and a new one is placed in *@dentryp;

This sounds like the filesystem is not allowed to use the dentry that
we're passing it. Maybe something like this:

"In the event that the filesystem doesn't use *@dentryp, the dentry is
put and a new one is placed in *@dentryp;"


> + * So on successful return *@dentryp will always be positive.
> + */
> +int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
> +		     struct dentry **dentryp, umode_t mode)
> +{
> +	struct dentry *dentry = *dentryp;
> +	int error;
> +	unsigned max_links = dir->i_sb->s_max_links;
> +
> +	error = may_create(idmap, dir, dentry);
> +	if (error)
> +		return error;
> +
> +	if (!dir->i_op->mkdir)
> +		return -EPERM;
> +
> +	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
> +	error = security_inode_mkdir(dir, dentry, mode);
> +	if (error)
> +		return error;
> +
> +	if (max_links && dir->i_nlink >= max_links)
> +		return -EMLINK;
> +
> +	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> +	if (!error) {
> +		fsnotify_mkdir(dir, dentry);
> +		if (unlikely(d_unhashed(dentry))) {
> +			struct dentry *d;
> +			/* Need a "const" pointer.  We know d_name is const
> +			 * because we hold an exclusive lock on i_rwsem
> +			 * in d_parent.
> +			 */
> +			const struct qstr *d_name = (void*)&dentry->d_name;
> +			d = lookup_dcache(d_name, dentry->d_parent, 0);
> +			if (!d)
> +				d = __lookup_slow(d_name, dentry->d_parent, 0);
> +			if (IS_ERR(d)) {
> +				error = PTR_ERR(d);
> +			} else if (unlikely(d_is_negative(d))) {
> +				dput(d);
> +				error = -ENOENT;
> +			} else {
> +				dput(dentry);
> +				*dentryp = d;
> +			}
> +		}
> +	}
> +	return error;
> +}
> +EXPORT_SYMBOL(vfs_mkdir_return);
> +
>  int do_mkdirat(int dfd, struct filename *name, umode_t mode)
>  {
>  	struct dentry *dentry;
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 29cb7b812d71..740332413138 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1488,26 +1488,11 @@ nfsd_create_locked(struct svc_rqst *rqstp, struct svc_fh *fhp,
>  			nfsd_check_ignore_resizing(iap);
>  		break;
>  	case S_IFDIR:
> -		host_err = vfs_mkdir(&nop_mnt_idmap, dirp, dchild, iap->ia_mode);
> -		if (!host_err && unlikely(d_unhashed(dchild))) {
> -			struct dentry *d;
> -			d = lookup_one_len(dchild->d_name.name,
> -					   dchild->d_parent,
> -					   dchild->d_name.len);
> -			if (IS_ERR(d)) {
> -				host_err = PTR_ERR(d);
> -				break;
> -			}
> -			if (unlikely(d_is_negative(d))) {
> -				dput(d);
> -				err = nfserr_serverfault;
> -				goto out;
> -			}
> +		host_err = vfs_mkdir_return(&nop_mnt_idmap, dirp, &dchild, iap->ia_mode);
> +		if (!host_err && unlikely(dchild != resfhp->fh_dentry)) {
>  			dput(resfhp->fh_dentry);
> -			resfhp->fh_dentry = dget(d);
> +			resfhp->fh_dentry = dget(dchild);
>  			err = fh_update(resfhp);
> -			dput(dchild);
> -			dchild = d;
>  			if (err)
>  				goto out;
>  		}
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index c9993ff66fc2..e6c54c6ef0f5 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -138,37 +138,6 @@ int ovl_cleanup_and_whiteout(struct ovl_fs *ofs, struct inode *dir,
>  	goto out;
>  }
>  
> -int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
> -		   struct dentry **newdentry, umode_t mode)
> -{
> -	int err;
> -	struct dentry *d, *dentry = *newdentry;
> -
> -	err = ovl_do_mkdir(ofs, dir, dentry, mode);
> -	if (err)
> -		return err;
> -
> -	if (likely(!d_unhashed(dentry)))
> -		return 0;
> -
> -	/*
> -	 * vfs_mkdir() may succeed and leave the dentry passed
> -	 * to it unhashed and negative. If that happens, try to
> -	 * lookup a new hashed and positive dentry.
> -	 */
> -	d = ovl_lookup_upper(ofs, dentry->d_name.name, dentry->d_parent,
> -			     dentry->d_name.len);
> -	if (IS_ERR(d)) {
> -		pr_warn("failed lookup after mkdir (%pd2, err=%i).\n",
> -			dentry, err);
> -		return PTR_ERR(d);
> -	}
> -	dput(dentry);
> -	*newdentry = d;
> -
> -	return 0;
> -}
> -
>  struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
>  			       struct dentry *newdentry, struct ovl_cattr *attr)
>  {
> @@ -191,7 +160,7 @@ struct dentry *ovl_create_real(struct ovl_fs *ofs, struct inode *dir,
>  
>  		case S_IFDIR:
>  			/* mkdir is special... */
> -			err =  ovl_mkdir_real(ofs, dir, &newdentry, attr->mode);
> +			err =  ovl_do_mkdir(ofs, dir, &newdentry, attr->mode);
>  			break;
>  
>  		case S_IFCHR:
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 0021e2025020..967870f12482 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -242,11 +242,11 @@ static inline int ovl_do_create(struct ovl_fs *ofs,
>  }
>  
>  static inline int ovl_do_mkdir(struct ovl_fs *ofs,
> -			       struct inode *dir, struct dentry *dentry,
> +			       struct inode *dir, struct dentry **dentry,
>  			       umode_t mode)
>  {
> -	int err = vfs_mkdir(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
> -	pr_debug("mkdir(%pd2, 0%o) = %i\n", dentry, mode, err);
> +	int err = vfs_mkdir_return(ovl_upper_mnt_idmap(ofs), dir, dentry, mode);
> +	pr_debug("mkdir(%pd2, 0%o) = %i\n", *dentry, mode, err);
>  	return err;
>  }
>  
> @@ -838,8 +838,8 @@ struct ovl_cattr {
>  
>  #define OVL_CATTR(m) (&(struct ovl_cattr) { .mode = (m) })
>  
> -int ovl_mkdir_real(struct ovl_fs *ofs, struct inode *dir,
> -		   struct dentry **newdentry, umode_t mode);
> +int ovl_do_mkdir(struct ovl_fs *ofs, struct inode *dir,
> +	      struct dentry **newdentry, umode_t mode);
>  struct dentry *ovl_create_real(struct ovl_fs *ofs,
>  			       struct inode *dir, struct dentry *newdentry,
>  			       struct ovl_cattr *attr);
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index 86ae6f6da36b..06ca8b01c336 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -327,7 +327,7 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs,
>  			goto retry;
>  		}
>  
> -		err = ovl_mkdir_real(ofs, dir, &work, attr.ia_mode);
> +		err = ovl_do_mkdir(ofs, dir, &work, attr.ia_mode);
>  		if (err)
>  			goto out_dput;
>  
> diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
> index 6890016e1923..4e580bb7baf8 100644
> --- a/fs/smb/server/vfs.c
> +++ b/fs/smb/server/vfs.c
> @@ -211,7 +211,7 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
>  {
>  	struct mnt_idmap *idmap;
>  	struct path path;
> -	struct dentry *dentry;
> +	struct dentry *dentry, *d;
>  	int err;
>  
>  	dentry = ksmbd_vfs_kern_path_create(work, name,
> @@ -227,27 +227,11 @@ int ksmbd_vfs_mkdir(struct ksmbd_work *work, const char *name, umode_t mode)
>  
>  	idmap = mnt_idmap(path.mnt);
>  	mode |= S_IFDIR;
> -	err = vfs_mkdir(idmap, d_inode(path.dentry), dentry, mode);
> -	if (!err && d_unhashed(dentry)) {
> -		struct dentry *d;
> -
> -		d = lookup_one(idmap, dentry->d_name.name, dentry->d_parent,
> -			       dentry->d_name.len);
> -		if (IS_ERR(d)) {
> -			err = PTR_ERR(d);
> -			goto out_err;
> -		}
> -		if (unlikely(d_is_negative(d))) {
> -			dput(d);
> -			err = -ENOENT;
> -			goto out_err;
> -		}
> -
> +	d = dentry;
> +	err = vfs_mkdir_return(idmap, d_inode(path.dentry), &dentry, mode);
> +	if (!err && dentry != d)
>  		ksmbd_vfs_inherit_owner(work, d_inode(path.dentry), d_inode(d));
> -		dput(d);
> -	}
>  
> -out_err:
>  	done_path_create(&path, dentry);
>  	if (err)
>  		pr_err("mkdir(%s): creation failed (err:%d)\n", name, err);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index be3ad155ec9f..f81d6bc65fe4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1971,6 +1971,8 @@ int vfs_create(struct mnt_idmap *, struct inode *,
>  	       struct dentry *, umode_t, bool);
>  int vfs_mkdir(struct mnt_idmap *, struct inode *,
>  	      struct dentry *, umode_t);
> +int vfs_mkdir_return(struct mnt_idmap *, struct inode *,
> +		     struct dentry **, umode_t);
>  int vfs_mknod(struct mnt_idmap *, struct inode *, struct dentry *,
>                umode_t, dev_t);
>  int vfs_symlink(struct mnt_idmap *, struct inode *,

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-06  5:42 ` [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed NeilBrown
@ 2025-02-06 14:06   ` Christian Brauner
  2025-02-07  2:17     ` NeilBrown
  2025-02-07 21:06   ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 14:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:51PM +1100, NeilBrown wrote:
> vfs_rmdir takes an exclusive lock on the target directory to ensure
> nothing new is created in it while the rmdir progresses.  With the

It also excludes concurrent mount operations.

> possibility of async updates continuing after the inode lock is dropped
> we now need extra protection.
> 
> Any async updates will have DCACHE_PAR_UPDATE set on the dentry.  We
> simply wait for that flag to be cleared on all children.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/dcache.c |  2 +-
>  fs/namei.c  | 40 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index fb331596f1b1..90dee859d138 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -53,7 +53,7 @@
>   *   - d_lru
>   *   - d_count
>   *   - d_unhashed()
> - *   - d_parent and d_chilren
> + *   - d_parent and d_children
>   *   - childrens' d_sib and d_parent
>   *   - d_u.d_alias, d_inode
>   *
> diff --git a/fs/namei.c b/fs/namei.c
> index 3a107d6098be..e8a85c9f431c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1839,6 +1839,27 @@ bool d_update_lock(struct dentry *dentry,
>  	return true;
>  }
>  
> +static void d_update_wait(struct dentry *dentry, unsigned int subclass)
> +{
> +	/* Note this may only ever be called in a context where we have
> +	 * a lock preventing this dentry from becoming locked, possibly
> +	 * an update lock on the parent dentry.  The must be a smp_mb()
> +	 * after that lock is taken and before this is called so that
> +	 * the following test is safe. d_update_lock() provides that
> +	 * barrier.
> +	 */
> +	if (!(dentry->d_flags & DCACHE_PAR_UPDATE))
> +		return
> +	lock_acquire_exclusive(&dentry->d_update_map, subclass,
> +			       0, NULL, _THIS_IP_);
> +	spin_lock(&dentry->d_lock);
> +	wait_var_event_spinlock(&dentry->d_flags,
> +				!check_dentry_locked(dentry),
> +				&dentry->d_lock);
> +	spin_unlock(&dentry->d_lock);
> +	lock_map_release(&dentry->d_update_map);
> +}
> +
>  bool d_update_trylock(struct dentry *dentry,
>  		      struct dentry *base,
>  		      const struct qstr *last)
> @@ -4688,6 +4709,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
>  		     struct dentry *dentry)
>  {
>  	int error = may_delete(idmap, dir, dentry, 1);
> +	struct dentry *child;
>  
>  	if (error)
>  		return error;
> @@ -4697,6 +4719,24 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
>  
>  	dget(dentry);
>  	inode_lock(dentry->d_inode);
> +	/*
> +	 * Some children of dentry might be active in an async update.
> +	 * We need to wait for them.  New children cannot be locked
> +	 * while the inode lock is held.
> +	 */
> +again:
> +	spin_lock(&dentry->d_lock);
> +	for (child = d_first_child(dentry); child;
> +	     child = d_next_sibling(child)) {
> +		if (child->d_flags & DCACHE_PAR_UPDATE) {
> +			dget(child);
> +			spin_unlock(&dentry->d_lock);
> +			d_update_wait(child, I_MUTEX_CHILD);
> +			dput(child);
> +			goto again;
> +		}
> +	}
> +	spin_unlock(&dentry->d_lock);

That looks like it can cause stalls when you call rmdir on a directory
that has a lots of children and a larg-ish subset of them has pending
async updates, no?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it.
  2025-02-06  5:42 ` [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it NeilBrown
@ 2025-02-06 14:30   ` Jeff Layton
  2025-02-07  0:04     ` NeilBrown
  2025-02-07 20:01   ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Jeff Layton @ 2025-02-06 14:30 UTC (permalink / raw)
  To: NeilBrown, Alexander Viro, Christian Brauner, Jan Kara,
	Linus Torvalds, Dave Chinner
  Cc: linux-fsdevel, linux-kernel

On Thu, 2025-02-06 at 16:42 +1100, NeilBrown wrote:
> lookup_one_qstr_excl() is used for lookups prior to directory
> modifications, whether create, unlink, rename, or whatever.
> 
> To prepare for allowing modification to happen in parallel, change
> lookup_one_qstr_excl() to use d_alloc_parallel().
> 
> To reflect this, name is changed to lookup_one_qtr() - as the directory
> may be locked shared.
> 
> If any for the "intent" LOOKUP flags are passed, the caller must ensure
> d_lookup_done() is called at an appropriate time.  If none are passed
> then we can be sure ->lookup() will do a real lookup and d_lookup_done()
> is called internally.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/namei.c            | 47 +++++++++++++++++++++++++------------------
>  fs/smb/server/vfs.c   |  7 ++++---
>  include/linux/namei.h |  9 ++++++---
>  3 files changed, 37 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/namei.c b/fs/namei.c
> index 5cdbd2eb4056..d684102d873d 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1665,15 +1665,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
>  }
>  
>  /*
> - * Parent directory has inode locked exclusive.  This is one
> - * and only case when ->lookup() gets called on non in-lookup
> - * dentries - as the matter of fact, this only gets called
> - * when directory is guaranteed to have no in-lookup children
> - * at all.
> + * Parent directory has inode locked: exclusive or shared.
> + * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
> + * must be called after the intended operation is performed - or aborted.
>   */
> -struct dentry *lookup_one_qstr_excl(const struct qstr *name,
> -				    struct dentry *base,
> -				    unsigned int flags)
> +struct dentry *lookup_one_qstr(const struct qstr *name,
> +			       struct dentry *base,
> +			       unsigned int flags)
>  {
>  	struct dentry *dentry = lookup_dcache(name, base, flags);
>  	struct dentry *old;
> @@ -1686,18 +1684,25 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
>  	if (unlikely(IS_DEADDIR(dir)))
>  		return ERR_PTR(-ENOENT);
>  
> -	dentry = d_alloc(base, name);
> -	if (unlikely(!dentry))
> +	dentry = d_alloc_parallel(base, name);
> +	if (unlikely(IS_ERR_OR_NULL(dentry)))
>  		return ERR_PTR(-ENOMEM);
> +	if (!d_in_lookup(dentry))
> +		/* Raced with another thread which did the lookup */
> +		return dentry;
>  
>  	old = dir->i_op->lookup(dir, dentry, flags);
>  	if (unlikely(old)) {
> +		d_lookup_done(dentry);
>  		dput(dentry);
>  		dentry = old;
>  	}
> +	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
> +		/* ->lookup must have given final answer */
> +		d_lookup_done(dentry);

This is kind of an ugly thing for the callers to get right. I think it
would be cleaner to just push the d_lookup_done() into all of the
callers that don't pass any intent flags, and do away with this.

>  	return dentry;
>  }
> -EXPORT_SYMBOL(lookup_one_qstr_excl);
> +EXPORT_SYMBOL(lookup_one_qstr);
>  
>  /**
>   * lookup_fast - do fast lockless (but racy) lookup of a dentry
> @@ -2739,7 +2744,7 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
>  		return ERR_PTR(-EINVAL);
>  	}
>  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> -	d = lookup_one_qstr_excl(&last, path->dentry, 0);
> +	d = lookup_one_qstr(&last, path->dentry, 0);
>  	if (IS_ERR(d)) {
>  		inode_unlock(path->dentry->d_inode);
>  		path_put(path);
> @@ -4078,8 +4083,8 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  	if (last.name[last.len] && !want_dir)
>  		create_flags = 0;
>  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> -	dentry = lookup_one_qstr_excl(&last, path->dentry,
> -				      reval_flag | create_flags);
> +	dentry = lookup_one_qstr(&last, path->dentry,
> +				 reval_flag | create_flags);
>  	if (IS_ERR(dentry))
>  		goto unlock;
>  
> @@ -4103,6 +4108,7 @@ static struct dentry *filename_create(int dfd, struct filename *name,
>  	}
>  	return dentry;
>  fail:
> +	d_lookup_done(dentry);
>  	dput(dentry);
>  	dentry = ERR_PTR(error);
>  unlock:
> @@ -4508,7 +4514,7 @@ int do_rmdir(int dfd, struct filename *name)
>  		goto exit2;
>  
>  	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
> -	dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
> +	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
>  	error = PTR_ERR(dentry);
>  	if (IS_ERR(dentry))
>  		goto exit3;
> @@ -4641,7 +4647,7 @@ int do_unlinkat(int dfd, struct filename *name)
>  		goto exit2;
>  retry_deleg:
>  	inode_lock_nested(path.dentry->d_inode, I_MUTEX_PARENT);
> -	dentry = lookup_one_qstr_excl(&last, path.dentry, lookup_flags);
> +	dentry = lookup_one_qstr(&last, path.dentry, lookup_flags);
>  	error = PTR_ERR(dentry);
>  	if (!IS_ERR(dentry)) {
>  
> @@ -5231,8 +5237,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>  		goto exit_lock_rename;
>  	}
>  
> -	old_dentry = lookup_one_qstr_excl(&old_last, old_path.dentry,
> -					  lookup_flags);
> +	old_dentry = lookup_one_qstr(&old_last, old_path.dentry,
> +				     lookup_flags);
>  	error = PTR_ERR(old_dentry);
>  	if (IS_ERR(old_dentry))
>  		goto exit3;
> @@ -5240,8 +5246,8 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>  	error = -ENOENT;
>  	if (d_is_negative(old_dentry))
>  		goto exit4;
> -	new_dentry = lookup_one_qstr_excl(&new_last, new_path.dentry,
> -					  lookup_flags | target_flags);
> +	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
> +				     lookup_flags | target_flags);
>  	error = PTR_ERR(new_dentry);
>  	if (IS_ERR(new_dentry))
>  		goto exit4;
> @@ -5292,6 +5298,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>  	rd.flags	   = flags;
>  	error = vfs_rename(&rd);
>  exit5:
> +	d_lookup_done(new_dentry);
>  	dput(new_dentry);
>  exit4:
>  	dput(old_dentry);
> diff --git a/fs/smb/server/vfs.c b/fs/smb/server/vfs.c
> index 4e580bb7baf8..89b3823f6405 100644
> --- a/fs/smb/server/vfs.c
> +++ b/fs/smb/server/vfs.c
> @@ -109,7 +109,7 @@ static int ksmbd_vfs_path_lookup_locked(struct ksmbd_share_config *share_conf,
>  	}
>  
>  	inode_lock_nested(parent_path->dentry->d_inode, I_MUTEX_PARENT);
> -	d = lookup_one_qstr_excl(&last, parent_path->dentry, 0);
> +	d = lookup_one_qstr(&last, parent_path->dentry, 0);
>  	if (IS_ERR(d))
>  		goto err_out;
>  
> @@ -726,8 +726,8 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
>  		ksmbd_fd_put(work, parent_fp);
>  	}
>  
> -	new_dentry = lookup_one_qstr_excl(&new_last, new_path.dentry,
> -					  lookup_flags | LOOKUP_RENAME_TARGET);
> +	new_dentry = lookup_one_qstr(&new_last, new_path.dentry,
> +				     lookup_flags | LOOKUP_RENAME_TARGET);
>  	if (IS_ERR(new_dentry)) {
>  		err = PTR_ERR(new_dentry);
>  		goto out3;
> @@ -771,6 +771,7 @@ int ksmbd_vfs_rename(struct ksmbd_work *work, const struct path *old_path,
>  		ksmbd_debug(VFS, "vfs_rename failed err %d\n", err);
>  
>  out4:
> +	d_lookup_done(new_dentry);
>  	dput(new_dentry);
>  out3:
>  	dput(old_parent);
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 8ec8fed3bce8..06bb3ea65beb 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -34,6 +34,9 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
>  #define LOOKUP_EXCL		0x0400	/* ... in exclusive creation */
>  #define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
>  
> +#define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
> +				 LOOKUP_RENAME_TARGET)
> +
>  /* internal use only */
>  #define LOOKUP_PARENT		0x0010
>  
> @@ -52,9 +55,9 @@ extern int path_pts(struct path *path);
>  
>  extern int user_path_at(int, const char __user *, unsigned, struct path *);
>  
> -struct dentry *lookup_one_qstr_excl(const struct qstr *name,
> -				    struct dentry *base,
> -				    unsigned int flags);
> +struct dentry *lookup_one_qstr(const struct qstr *name,
> +			       struct dentry *base,
> +			       unsigned int flags);
>  extern int kern_path(const char *, unsigned, struct path *);
>  
>  extern struct dentry *kern_path_create(int, const char *, struct path *, unsigned int);

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (18 preceding siblings ...)
  2025-02-06  5:42 ` [PATCH 19/19] nfs: switch to _async for all directory ops NeilBrown
@ 2025-02-06 14:36 ` Christian Brauner
  2025-02-06 15:36 ` John Stoffel
  2025-02-09 23:33 ` Al Viro
  21 siblings, 0 replies; 83+ messages in thread
From: Christian Brauner @ 2025-02-06 14:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote:
> This is my latest attempt at removing the requirement for an exclusive
> lock on a directory which performing updates in this.  This version,
> inspired by Dave Chinner, goes a step further and allow async updates.
> 
> The inode operation still requires the inode lock, at least a shared
> lock, but may return -EINPROGRES and then continue asynchronously
> without needing any ongoing lock on the directory.
> 
> An exclusive lock on the dentry is held across the entire operation.
> 
> This change requires various extra checks.  rmdir must ensure there is
> no async creation still happening.  rename between directories must
> ensure non of the relevant ancestors are undergoing async rename.  There
> may be or checks that I need to consider - mounting?

Mounting takes an exclusive lock on the target inode in do_lock_mount()
and finish_automount(). As long as dont_mount() can't happen
asynchronously in vfs_rmdir(), vfs_unlink() or vfs_rename() it should be
fine.

> 
> One other important change since my previous posting is that I've
> dropped the idea of taking a separate exclusive lock on the directory
> when the fs doesn't support shared locking.  This cannot work as it
> doeesn't prevent lookups and filesystems don't expect a lookup while
> they are changing a directory.  So instead we need to choose between
> exclusive or shared for the inode on a case-by-case basis.

Which is possibly fine if we do it similar to what I suggested in the
series. As it stands with the separate methods it's a no-go for me. But
that's a solvable problem, I think.

> To make this choice we divide all ops into four groups: create, remove,
> rename, open/create.  If an inode has no operations in the group that
> require an exclusive lock, then a flag is set on the inode so that
> various code knows that a shared lock is sufficient.  If the flag is not
> set, an exclusive lock is obtained.
> 
> I've also added rename handling and converted NFS to use all _async ops.
> 
> The motivation for this comes from the general increase in scale of
> systems.  We can support very large directories and many-core systems
> and applications that choose to use large directories can hit
> unnecessary contention.
> 
> NFS can easily hit this when used over a high-latency link.
> Lustre already has code to allow concurrent directory updates in the
> back-end filesystem (ldiskfs - a slightly modified ext4).
> Lustre developers believe this would also benefit the client-side
> filesystem with large core counts.
> 
> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.
> 
> Test status:  only light testing.  It doesn't easily blow up, but lockdep
> complains that repeated calls to d_update_wait() are bad, even though
> it has balanced acquire and release calls. Weird?
> 
> Thanks,
> NeilBrown
> 
>  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
>  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
>  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
>  [PATCH 04/19] VFS: change kern_path_locked() and
>  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
>  [PATCH 06/19] VFS: repack DENTRY_ flags.
>  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
>  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
>  [PATCH 09/19] VFS: add _async versions of the various directory
>  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
>  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
>  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
>  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
>  [PATCH 14/19] VFS: Ensure no async updates happening in directory
>  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
>  [PATCH 16/19] VFS: add lookup_and_lock_rename()
>  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
>  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
>  [PATCH 19/19] nfs: switch to _async for all directory ops.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (19 preceding siblings ...)
  2025-02-06 14:36 ` [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory Christian Brauner
@ 2025-02-06 15:36 ` John Stoffel
  2025-02-07  2:18   ` NeilBrown
  2025-02-09 23:33 ` Al Viro
  21 siblings, 1 reply; 83+ messages in thread
From: John Stoffel @ 2025-02-06 15:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner, linux-fsdevel, linux-kernel

>>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:

> This is my latest attempt at removing the requirement for an exclusive
> lock on a directory which performing updates in this.  This version,
> inspired by Dave Chinner, goes a step further and allow async updates.

This initial sentence reads poorly to me.  I think you maybe are
trying to say:

  This is my latest attempt to removing the requirement for writers to
  have an exclusive lock on a directory when performing updates on
  entries in that directory.  This allows for parallel updates by
  multiple processes (connections? hosts? clients?) to improve scaling
  of large filesystems. 

I get what you're trying to do here, and I applaud it!  I just
struggled over the intro here.  


> The inode operation still requires the inode lock, at least a shared
> lock, but may return -EINPROGRES and then continue asynchronously
> without needing any ongoing lock on the directory.

> An exclusive lock on the dentry is held across the entire operation.

> This change requires various extra checks.  rmdir must ensure there is
> no async creation still happening.  rename between directories must
> ensure non of the relevant ancestors are undergoing async rename.  There
> may be or checks that I need to consider - mounting?

> One other important change since my previous posting is that I've
> dropped the idea of taking a separate exclusive lock on the directory
> when the fs doesn't support shared locking.  This cannot work as it
> doeesn't prevent lookups and filesystems don't expect a lookup while
> they are changing a directory.  So instead we need to choose between
> exclusive or shared for the inode on a case-by-case basis.

> To make this choice we divide all ops into four groups: create, remove,
> rename, open/create.  If an inode has no operations in the group that
> require an exclusive lock, then a flag is set on the inode so that
> various code knows that a shared lock is sufficient.  If the flag is not
> set, an exclusive lock is obtained.

> I've also added rename handling and converted NFS to use all _async ops.

> The motivation for this comes from the general increase in scale of
> systems.  We can support very large directories and many-core systems
> and applications that choose to use large directories can hit
> unnecessary contention.

> NFS can easily hit this when used over a high-latency link.
> Lustre already has code to allow concurrent directory updates in the
> back-end filesystem (ldiskfs - a slightly modified ext4).
> Lustre developers believe this would also benefit the client-side
> filesystem with large core counts.

> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.

> Test status:  only light testing.  It doesn't easily blow up, but lockdep
> complains that repeated calls to d_update_wait() are bad, even though
> it has balanced acquire and release calls. Weird?

> Thanks,
> NeilBrown

>  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
>  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
>  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
>  [PATCH 04/19] VFS: change kern_path_locked() and
>  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
>  [PATCH 06/19] VFS: repack DENTRY_ flags.
>  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
>  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
>  [PATCH 09/19] VFS: add _async versions of the various directory
>  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
>  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
>  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
>  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
>  [PATCH 14/19] VFS: Ensure no async updates happening in directory
>  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
>  [PATCH 16/19] VFS: add lookup_and_lock_rename()
>  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
>  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
>  [PATCH 19/19] nfs: switch to _async for all directory ops.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-06 12:24   ` Christian Brauner
@ 2025-02-06 23:52     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06 23:52 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, 06 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:38PM +1100, NeilBrown wrote:
> > vfs_mkdir() does not guarantee to make the child dentry positive on
> > success.  It may leave it negative and then the caller needs to perform a
> > lookup to find the target dentry.
> > 
> > This patch introduced vfs_mkdir_return() which performs the lookup if
> > needed so that this code is centralised.
> > 
> > This prepares for a new inode operation which will perform mkdir and
> > returns the correct dentry.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/cachefiles/namei.c    |  7 +---
> >  fs/namei.c               | 69 ++++++++++++++++++++++++++++++++++++++++
> >  fs/nfsd/vfs.c            | 21 ++----------
> >  fs/overlayfs/dir.c       | 33 +------------------
> >  fs/overlayfs/overlayfs.h | 10 +++---
> >  fs/overlayfs/super.c     |  2 +-
> >  fs/smb/server/vfs.c      | 24 +++-----------
> >  include/linux/fs.h       |  2 ++
> >  8 files changed, 86 insertions(+), 82 deletions(-)
> > 
> > diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
> > index 7cf59713f0f7..3c866c3b9534 100644
> > --- a/fs/cachefiles/namei.c
> > +++ b/fs/cachefiles/namei.c
> > @@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> >  	/* search the current directory for the element name */
> >  	inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
> >  
> > -retry:
> >  	ret = cachefiles_inject_read_error();
> >  	if (ret == 0)
> >  		subdir = lookup_one_len(dirname, dir, strlen(dirname));
> > @@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> >  			goto mkdir_error;
> >  		ret = cachefiles_inject_write_error();
> >  		if (ret == 0)
> > -			ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
> > +			ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
> >  		if (ret < 0) {
> >  			trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
> >  						   cachefiles_trace_mkdir_error);
> > @@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> >  		}
> >  		trace_cachefiles_mkdir(dir, subdir);
> >  
> > -		if (unlikely(d_unhashed(subdir))) {
> > -			cachefiles_put_directory(subdir);
> > -			goto retry;
> > -		}
> >  		ASSERT(d_backing_inode(subdir));
> >  
> >  		_debug("mkdir -> %pd{ino=%lu}",
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3ab9440c5b93..d98caf36e867 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> >  }
> >  EXPORT_SYMBOL(vfs_mkdir);
> >  
> > +/**
> > + * vfs_mkdir_return - create directory returning correct dentry
> > + * @idmap:	idmap of the mount the inode was found from
> > + * @dir:	inode of the parent directory
> > + * @dentryp:	pointer to dentry of the child directory
> > + * @mode:	mode of the child directory
> > + *
> > + * Create a directory.
> > + *
> > + * If the inode has been found through an idmapped mount the idmap of
> > + * the vfsmount must be passed through @idmap. This function will then take
> > + * care to map the inode according to @idmap before checking permissions.
> > + * On non-idmapped mounts or if permission checking is to be performed on the
> > + * raw inode simply pass @nop_mnt_idmap.
> > + *
> > + * The filesystem may not use the dentry that was passed in.  In that case
> > + * the passed-in dentry is put and a new one is placed in *@dentryp;
> > + * So on successful return *@dentryp will always be positive.
> > + */
> > +int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
> > +		     struct dentry **dentryp, umode_t mode)
> > +{
> 
> I think this is misnamed. Maybe vfs_mkdir_positive() is better here.
> It also be nice to have a comment on vfs_mkdir() as well pointing out
> that the returned dentry might be negative.

While I'm not particularly fond of vfs_mkdir_return(), I don't see that
vfs_mkdir_positive() is an improvement.  I cannot find any relevant
precedent in the kernel to guide.  Most _return and _positive functions
are for low-level counting primitives :-)

I'm tempted to add another arg to vfs_mkdir() instead of adding a new
function.  That would solve one problem by introducing another: what
arg?  Maybe pass both a 'struct dentry *' and a 'struct dentry **' and
if the latter is not NULL, it gets filled with the new dentry if there
is one.

> 
> And is there a particular reason to not have it return the new dentry?
> That seems clearer than using the argument as a return value.

If I did that then every caller would need to check if the return value
was not IS_ERR_OR_NULL() and if so, dput() the original dentry and keep
the new one - just like current callers of ->lookup need to.  It seems
cleaner to do that once in vfs_mkdir_return() rather than in all the
callers.  I guess we could *always* return the dentry on success and
dput the old one if it was different or if there were an error.  So

   dentry = vfs_mkdir_return(idmap, inode, dentry, mode)

would be the common pattern.  Would you be OK with that?


> 
> > +	struct dentry *dentry = *dentryp;
> > +	int error;
> > +	unsigned max_links = dir->i_sb->s_max_links;
> > +
> > +	error = may_create(idmap, dir, dentry);
> > +	if (error)
> > +		return error;
> > +
> > +	if (!dir->i_op->mkdir)
> > +		return -EPERM;
> > +
> > +	mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
> > +	error = security_inode_mkdir(dir, dentry, mode);
> > +	if (error)
> > +		return error;
> > +
> > +	if (max_links && dir->i_nlink >= max_links)
> > +		return -EMLINK;
> > +
> > +	error = dir->i_op->mkdir(idmap, dir, dentry, mode);
> 
> Why isn't this calling vfs_mkdir() and then only starts differing afterwards?

Because once we introduce the new ->mkdir_async which returns a dentry
the two functions start to diverge more.
I could have a __vfs_mkdir() which does both and has a bool arg to tell
it if we want the return value.  That would avoid code duplication.

> 
> > +	if (!error) {
> > +		fsnotify_mkdir(dir, dentry);
> > +		if (unlikely(d_unhashed(dentry))) {
> > +			struct dentry *d;
> > +			/* Need a "const" pointer.  We know d_name is const
> > +			 * because we hold an exclusive lock on i_rwsem
> > +			 * in d_parent.
> > +			 */
> > +			const struct qstr *d_name = (void*)&dentry->d_name;
> > +			d = lookup_dcache(d_name, dentry->d_parent, 0);
> > +			if (!d)
> > +				d = __lookup_slow(d_name, dentry->d_parent, 0);
> 
> Quite a few caller's use lookup_one() here which calls
> inode_permission() on @dir again. Are we guaranteed that the permission
> check would always pass?

I think they use lookup_one() because that was the easiest, not because
they need all the functionality.
If the process had permission to create a directory with a given name
but now doesn't have permission to look up that same name, then
something is weird.  Maybe a race with permission changing could do
that.  But I think the process should have the right to hold the dentry
that it has just successfully created.
The lookup is hopefully just a work-around until the new improved
interface is used by all relevant filesystems.


Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-06 13:52   ` Jeff Layton
@ 2025-02-06 23:57     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-06 23:57 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Jeff Layton wrote:
> On Thu, 2025-02-06 at 16:42 +1100, NeilBrown wrote:
> > vfs_mkdir() does not guarantee to make the child dentry positive on
> > success.  It may leave it negative and then the caller needs to perform a
> > lookup to find the target dentry.
> > 
> > This patch introduced vfs_mkdir_return() which performs the lookup if
> > needed so that this code is centralised.
> > 
> > This prepares for a new inode operation which will perform mkdir and
> > returns the correct dentry.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/cachefiles/namei.c    |  7 +---
> >  fs/namei.c               | 69 ++++++++++++++++++++++++++++++++++++++++
> >  fs/nfsd/vfs.c            | 21 ++----------
> >  fs/overlayfs/dir.c       | 33 +------------------
> >  fs/overlayfs/overlayfs.h | 10 +++---
> >  fs/overlayfs/super.c     |  2 +-
> >  fs/smb/server/vfs.c      | 24 +++-----------
> >  include/linux/fs.h       |  2 ++
> >  8 files changed, 86 insertions(+), 82 deletions(-)
> > 
> > diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
> > index 7cf59713f0f7..3c866c3b9534 100644
> > --- a/fs/cachefiles/namei.c
> > +++ b/fs/cachefiles/namei.c
> > @@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> >  	/* search the current directory for the element name */
> >  	inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
> >  
> > -retry:
> >  	ret = cachefiles_inject_read_error();
> >  	if (ret == 0)
> >  		subdir = lookup_one_len(dirname, dir, strlen(dirname));
> > @@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> >  			goto mkdir_error;
> >  		ret = cachefiles_inject_write_error();
> >  		if (ret == 0)
> > -			ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
> > +			ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
> >  		if (ret < 0) {
> >  			trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
> >  						   cachefiles_trace_mkdir_error);
> > @@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> >  		}
> >  		trace_cachefiles_mkdir(dir, subdir);
> >  
> > -		if (unlikely(d_unhashed(subdir))) {
> > -			cachefiles_put_directory(subdir);
> > -			goto retry;
> > -		}
> >  		ASSERT(d_backing_inode(subdir));
> >  
> >  		_debug("mkdir -> %pd{ino=%lu}",
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3ab9440c5b93..d98caf36e867 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> >  }
> >  EXPORT_SYMBOL(vfs_mkdir);
> >  
> > +/**
> > + * vfs_mkdir_return - create directory returning correct dentry
> > + * @idmap:	idmap of the mount the inode was found from
> > + * @dir:	inode of the parent directory
> > + * @dentryp:	pointer to dentry of the child directory
> > + * @mode:	mode of the child directory
> > + *
> > + * Create a directory.
> > + *
> > + * If the inode has been found through an idmapped mount the idmap of
> > + * the vfsmount must be passed through @idmap. This function will then take
> > + * care to map the inode according to @idmap before checking permissions.
> > + * On non-idmapped mounts or if permission checking is to be performed on the
> > + * raw inode simply pass @nop_mnt_idmap.
> > + *
> > + * The filesystem may not use the dentry that was passed in.  In that case
> > + * the passed-in dentry is put and a new one is placed in *@dentryp;
> 
> This sounds like the filesystem is not allowed to use the dentry that
> we're passing it. Maybe something like this:
> 
> "In the event that the filesystem doesn't use *@dentryp, the dentry is
> put and a new one is placed in *@dentryp;"

Good catch - thanks.
I've updated my patch you use your test, except I decided on "dput()"
rather than "put".

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it.
  2025-02-06 14:30   ` Jeff Layton
@ 2025-02-07  0:04     ` NeilBrown
  2025-02-07  0:23       ` Jeff Layton
  0 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-07  0:04 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Jeff Layton wrote:
> On Thu, 2025-02-06 at 16:42 +1100, NeilBrown wrote:
> > lookup_one_qstr_excl() is used for lookups prior to directory
> > modifications, whether create, unlink, rename, or whatever.
> > 
> > To prepare for allowing modification to happen in parallel, change
> > lookup_one_qstr_excl() to use d_alloc_parallel().
> > 
> > To reflect this, name is changed to lookup_one_qtr() - as the directory
> > may be locked shared.
> > 
> > If any for the "intent" LOOKUP flags are passed, the caller must ensure
> > d_lookup_done() is called at an appropriate time.  If none are passed
> > then we can be sure ->lookup() will do a real lookup and d_lookup_done()
> > is called internally.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/namei.c            | 47 +++++++++++++++++++++++++------------------
> >  fs/smb/server/vfs.c   |  7 ++++---
> >  include/linux/namei.h |  9 ++++++---
> >  3 files changed, 37 insertions(+), 26 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 5cdbd2eb4056..d684102d873d 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1665,15 +1665,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
> >  }
> >  
> >  /*
> > - * Parent directory has inode locked exclusive.  This is one
> > - * and only case when ->lookup() gets called on non in-lookup
> > - * dentries - as the matter of fact, this only gets called
> > - * when directory is guaranteed to have no in-lookup children
> > - * at all.
> > + * Parent directory has inode locked: exclusive or shared.
> > + * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
> > + * must be called after the intended operation is performed - or aborted.
> >   */
> > -struct dentry *lookup_one_qstr_excl(const struct qstr *name,
> > -				    struct dentry *base,
> > -				    unsigned int flags)
> > +struct dentry *lookup_one_qstr(const struct qstr *name,
> > +			       struct dentry *base,
> > +			       unsigned int flags)
> >  {
> >  	struct dentry *dentry = lookup_dcache(name, base, flags);
> >  	struct dentry *old;
> > @@ -1686,18 +1684,25 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
> >  	if (unlikely(IS_DEADDIR(dir)))
> >  		return ERR_PTR(-ENOENT);
> >  
> > -	dentry = d_alloc(base, name);
> > -	if (unlikely(!dentry))
> > +	dentry = d_alloc_parallel(base, name);
> > +	if (unlikely(IS_ERR_OR_NULL(dentry)))
> >  		return ERR_PTR(-ENOMEM);
> > +	if (!d_in_lookup(dentry))
> > +		/* Raced with another thread which did the lookup */
> > +		return dentry;
> >  
> >  	old = dir->i_op->lookup(dir, dentry, flags);
> >  	if (unlikely(old)) {
> > +		d_lookup_done(dentry);
> >  		dput(dentry);
> >  		dentry = old;
> >  	}
> > +	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
> > +		/* ->lookup must have given final answer */
> > +		d_lookup_done(dentry);
> 
> This is kind of an ugly thing for the callers to get right. I think it
> would be cleaner to just push the d_lookup_done() into all of the
> callers that don't pass any intent flags, and do away with this.

I don't understand your concern.  This does not impose on callers,
rather it relieves them of a burden.  d_lookup_done() is fully
idempotent so if a caller does call it, there is no harm done.

In the final result of my series there are 4 callers of this function.
1/ lookup_and_lock() which must always be balanced with
  done_lookup_and_lock(), which calls d_lookup_done()
2/ lookup_and_lock_rename() which is similarly balance with
  done_lookup_and_lock_rename(). 
3/ ksmbd_vfs_path_lookup_locked() which passes zero for the flags and so
   doesn't need d_lookup_done()
4/ ksmbd_vfs_rename() which calls d_lookup_done() as required.

So if I dropped this code it would only affect one caller which would
need to add a call to d_lookup_done() probably immediately after the
successful return of lookup_one_qstr().
While that wouldn't hurt much, I don't see that it would help much
either.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry
  2025-02-06 13:09     ` Christian Brauner
@ 2025-02-07  0:08       ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  0:08 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 01:31:56PM +0100, Christian Brauner wrote:
> > On Thu, Feb 06, 2025 at 04:42:41PM +1100, NeilBrown wrote:
> > > No callers of kern_path_locked() or user_path_locked_at() want a
> > > negative dentry.  So change them to return -ENOENT instead.  This
> > > simplifies callers.
> > > 
> > > This results in a subtle change to bcachefs in that an ioctl will now
> > > return -ENOENT in preference to -EXDEV.  I believe this restores the
> > > behaviour to what it was prior to
> > >  Commit bbe6a7c899e7 ("bch2_ioctl_subvolume_destroy(): fix locking")
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > 
> > It would be nice if you could send this as a separate cleanup patch.
> > It seems unrelated to the series.

I'll do that, thanks.

> > 
> > >  drivers/base/devtmpfs.c | 65 +++++++++++++++++++----------------------
> > >  fs/bcachefs/fs-ioctl.c  |  4 ---
> > >  fs/namei.c              |  4 +++
> > >  kernel/audit_watch.c    | 12 ++++----
> > >  4 files changed, 40 insertions(+), 45 deletions(-)
> > > 
> > > diff --git a/drivers/base/devtmpfs.c b/drivers/base/devtmpfs.c
> > > index b848764ef018..c9e34842139f 100644
> > > --- a/drivers/base/devtmpfs.c
> > > +++ b/drivers/base/devtmpfs.c
> > > @@ -245,15 +245,12 @@ static int dev_rmdir(const char *name)
> > >  	dentry = kern_path_locked(name, &parent);
> > >  	if (IS_ERR(dentry))
> > >  		return PTR_ERR(dentry);
> > > -	if (d_really_is_positive(dentry)) {
> > > -		if (d_inode(dentry)->i_private == &thread)
> > > -			err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
> > > -					dentry);
> > > -		else
> > > -			err = -EPERM;
> > > -	} else {
> > > -		err = -ENOENT;
> > > -	}
> > > +	if (d_inode(dentry)->i_private == &thread)
> > > +		err = vfs_rmdir(&nop_mnt_idmap, d_inode(parent.dentry),
> > > +				dentry);
> > > +	else
> > > +		err = -EPERM;
> > > +
> > >  	dput(dentry);
> > >  	inode_unlock(d_inode(parent.dentry));
> > >  	path_put(&parent);
> > > @@ -310,6 +307,8 @@ static int handle_remove(const char *nodename, struct device *dev)
> > >  {
> > >  	struct path parent;
> > >  	struct dentry *dentry;
> > > +	struct kstat stat;
> > > +	struct path p;
> > >  	int deleted = 0;
> > >  	int err;
> > >  
> > > @@ -317,32 +316,28 @@ static int handle_remove(const char *nodename, struct device *dev)
> > >  	if (IS_ERR(dentry))
> > >  		return PTR_ERR(dentry);
> > >  
> > > -	if (d_really_is_positive(dentry)) {
> > > -		struct kstat stat;
> > > -		struct path p = {.mnt = parent.mnt, .dentry = dentry};
> > > -		err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
> > > -				  AT_STATX_SYNC_AS_STAT);
> > > -		if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
> > > -			struct iattr newattrs;
> > > -			/*
> > > -			 * before unlinking this node, reset permissions
> > > -			 * of possible references like hardlinks
> > > -			 */
> > > -			newattrs.ia_uid = GLOBAL_ROOT_UID;
> > > -			newattrs.ia_gid = GLOBAL_ROOT_GID;
> > > -			newattrs.ia_mode = stat.mode & ~0777;
> > > -			newattrs.ia_valid =
> > > -				ATTR_UID|ATTR_GID|ATTR_MODE;
> > > -			inode_lock(d_inode(dentry));
> > > -			notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
> > > -			inode_unlock(d_inode(dentry));
> > > -			err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
> > > -					 dentry, NULL);
> > > -			if (!err || err == -ENOENT)
> > > -				deleted = 1;
> > > -		}
> > > -	} else {
> > > -		err = -ENOENT;
> > > +	p.mnt = parent.mnt;
> > > +	p.dentry = dentry;
> > > +	err = vfs_getattr(&p, &stat, STATX_TYPE | STATX_MODE,
> > > +			  AT_STATX_SYNC_AS_STAT);
> > > +	if (!err && dev_mynode(dev, d_inode(dentry), &stat)) {
> > > +		struct iattr newattrs;
> > > +		/*
> > > +		 * before unlinking this node, reset permissions
> > > +		 * of possible references like hardlinks
> > > +		 */
> > > +		newattrs.ia_uid = GLOBAL_ROOT_UID;
> > > +		newattrs.ia_gid = GLOBAL_ROOT_GID;
> > > +		newattrs.ia_mode = stat.mode & ~0777;
> > > +		newattrs.ia_valid =
> > > +			ATTR_UID|ATTR_GID|ATTR_MODE;
> > > +		inode_lock(d_inode(dentry));
> > > +		notify_change(&nop_mnt_idmap, dentry, &newattrs, NULL);
> > > +		inode_unlock(d_inode(dentry));
> > > +		err = vfs_unlink(&nop_mnt_idmap, d_inode(parent.dentry),
> > > +				 dentry, NULL);
> > > +		if (!err || err == -ENOENT)
> > > +			deleted = 1;
> > >  	}
> > >  	dput(dentry);
> > >  	inode_unlock(d_inode(parent.dentry));
> > > diff --git a/fs/bcachefs/fs-ioctl.c b/fs/bcachefs/fs-ioctl.c
> > > index 15725b4ce393..595b57fabc9a 100644
> > > --- a/fs/bcachefs/fs-ioctl.c
> > > +++ b/fs/bcachefs/fs-ioctl.c
> > > @@ -511,10 +511,6 @@ static long bch2_ioctl_subvolume_destroy(struct bch_fs *c, struct file *filp,
> > >  		ret = -EXDEV;
> > >  		goto err;
> > >  	}
> > > -	if (!d_is_positive(victim)) {
> > > -		ret = -ENOENT;
> > > -		goto err;
> > > -	}
> > >  	ret = __bch2_unlink(dir, victim, true);
> > >  	if (!ret) {
> > >  		fsnotify_rmdir(dir, victim);
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index d684102d873d..1901120bcbb8 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -2745,6 +2745,10 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
> > >  	}
> > >  	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> > >  	d = lookup_one_qstr(&last, path->dentry, 0);
> > > +	if (!IS_ERR(d) && d_is_negative(d)) {
> > > +		dput(d);
> > > +		d = ERR_PTR(-ENOENT);
> 
> This doesn't unlock which afaict does cause issue with your devtmpfs
> changes:

I unlocks a little further down.  The above leaves 'd' as an err, and it
followed by
	if (IS_ERR(d)) {
		inode_unlock(path->dentry->d_inode);
		path_put(path);
	}
	return d;

so I don't think there is a problem here.

Thanks for the review.

NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it.
  2025-02-07  0:04     ` NeilBrown
@ 2025-02-07  0:23       ` Jeff Layton
  0 siblings, 0 replies; 83+ messages in thread
From: Jeff Layton @ 2025-02-07  0:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 2025-02-07 at 11:04 +1100, NeilBrown wrote:
> On Fri, 07 Feb 2025, Jeff Layton wrote:
> > On Thu, 2025-02-06 at 16:42 +1100, NeilBrown wrote:
> > > lookup_one_qstr_excl() is used for lookups prior to directory
> > > modifications, whether create, unlink, rename, or whatever.
> > > 
> > > To prepare for allowing modification to happen in parallel, change
> > > lookup_one_qstr_excl() to use d_alloc_parallel().
> > > 
> > > To reflect this, name is changed to lookup_one_qtr() - as the directory
> > > may be locked shared.
> > > 
> > > If any for the "intent" LOOKUP flags are passed, the caller must ensure
> > > d_lookup_done() is called at an appropriate time.  If none are passed
> > > then we can be sure ->lookup() will do a real lookup and d_lookup_done()
> > > is called internally.
> > > 
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > >  fs/namei.c            | 47 +++++++++++++++++++++++++------------------
> > >  fs/smb/server/vfs.c   |  7 ++++---
> > >  include/linux/namei.h |  9 ++++++---
> > >  3 files changed, 37 insertions(+), 26 deletions(-)
> > > 
> > > diff --git a/fs/namei.c b/fs/namei.c
> > > index 5cdbd2eb4056..d684102d873d 100644
> > > --- a/fs/namei.c
> > > +++ b/fs/namei.c
> > > @@ -1665,15 +1665,13 @@ static struct dentry *lookup_dcache(const struct qstr *name,
> > >  }
> > >  
> > >  /*
> > > - * Parent directory has inode locked exclusive.  This is one
> > > - * and only case when ->lookup() gets called on non in-lookup
> > > - * dentries - as the matter of fact, this only gets called
> > > - * when directory is guaranteed to have no in-lookup children
> > > - * at all.
> > > + * Parent directory has inode locked: exclusive or shared.
> > > + * If @flags contains any LOOKUP_INTENT_FLAGS then d_lookup_done()
> > > + * must be called after the intended operation is performed - or aborted.
> > >   */
> > > -struct dentry *lookup_one_qstr_excl(const struct qstr *name,
> > > -				    struct dentry *base,
> > > -				    unsigned int flags)
> > > +struct dentry *lookup_one_qstr(const struct qstr *name,
> > > +			       struct dentry *base,
> > > +			       unsigned int flags)
> > >  {
> > >  	struct dentry *dentry = lookup_dcache(name, base, flags);
> > >  	struct dentry *old;
> > > @@ -1686,18 +1684,25 @@ struct dentry *lookup_one_qstr_excl(const struct qstr *name,
> > >  	if (unlikely(IS_DEADDIR(dir)))
> > >  		return ERR_PTR(-ENOENT);
> > >  
> > > -	dentry = d_alloc(base, name);
> > > -	if (unlikely(!dentry))
> > > +	dentry = d_alloc_parallel(base, name);
> > > +	if (unlikely(IS_ERR_OR_NULL(dentry)))
> > >  		return ERR_PTR(-ENOMEM);
> > > +	if (!d_in_lookup(dentry))
> > > +		/* Raced with another thread which did the lookup */
> > > +		return dentry;
> > >  
> > >  	old = dir->i_op->lookup(dir, dentry, flags);
> > >  	if (unlikely(old)) {
> > > +		d_lookup_done(dentry);
> > >  		dput(dentry);
> > >  		dentry = old;
> > >  	}
> > > +	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
> > > +		/* ->lookup must have given final answer */
> > > +		d_lookup_done(dentry);
> > 
> > This is kind of an ugly thing for the callers to get right. I think it
> > would be cleaner to just push the d_lookup_done() into all of the
> > callers that don't pass any intent flags, and do away with this.
> 
> I don't understand your concern.  This does not impose on callers,
> rather it relieves them of a burden.  d_lookup_done() is fully
> idempotent so if a caller does call it, there is no harm done.
> 
> In the final result of my series there are 4 callers of this function.
> 1/ lookup_and_lock() which must always be balanced with
>   done_lookup_and_lock(), which calls d_lookup_done()
> 2/ lookup_and_lock_rename() which is similarly balance with
>   done_lookup_and_lock_rename(). 
> 3/ ksmbd_vfs_path_lookup_locked() which passes zero for the flags and so
>    doesn't need d_lookup_done()
> 4/ ksmbd_vfs_rename() which calls d_lookup_done() as required.
> 
> So if I dropped this code it would only affect one caller which would
> need to add a call to d_lookup_done() probably immediately after the
> successful return of lookup_one_qstr().
> While that wouldn't hurt much, I don't see that it would help much
> either.
> 

My concern is about the complex return handling. If the flags are 0,
then I don't need to call d_lookup_done(), but if they aren't 0, then I
do. That's just an easy opportunity to get it wrong if new callers are
added.

My preference would be that the caller must always call d_lookup_done()
on a successful return. If ksmbd_vfs_path_lookup_locked() has to call
it immediately afterward, then that's fine. No need for this special
handling in a generic function, just for a single caller.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
  2025-02-06 12:44   ` Christian Brauner
@ 2025-02-07  0:24     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  0:24 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, 06 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:44PM +1100, NeilBrown wrote:
> > The LOOKUP_ bits are not in order, which can make it awkward when adding
> > new bits.  Two bits have recently been added to the end which makes them
> > look like "scoping flags", but in fact they aren't.
> > 
> > Also LOOKUP_PARENT is described as "internal use only" but is used in
> > fs/nfs/
> > 
> > This patch:
> >  - Moves these three flags into the "pathwalk mode" section
> >  - changes all bits to use the BIT(n) macro
> >  - Allocates bits in order leaving gaps between the sections,
> >    and documents those gaps.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> 
> This is also a worthwhile cleanup independent of the rest of the series.
> But you've added LOOKUP_INTENT_FLAGS prior to packing the flags. Imho,
> this patch should've gone before the addition of LOOKUP_INTENT_FLAGS.

I'll fix that and submit separately - thanks.

> 
> And btw, what does this series apply to?

It was based on
Commit 92514ef226f5 ("Merge tag 'for-6.14-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux")

which was the current upstream at the time.

> Doesn't apply to next-20250206 nor to current mainline.
> I get the usual
> 
> Patch failed at 0012 VFS: enhance d_splice_alias to accommodate shared-lock updates
> error: sha1 information is lacking or useless (fs/dcache.c).
> error: could not build fake ancestor
> 
> when trying to look at this locally.

Probably your tree was missing
Commit 902e09c8acde ("fix braino in "9p: fix ->rename_sem exclusion"")

Thanks,
NeilBrown


> 
> >  include/linux/namei.h | 46 +++++++++++++++++++++----------------------
> >  1 file changed, 23 insertions(+), 23 deletions(-)
> > 
> > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > index 839a64d07f8c..0d81e571a159 100644
> > --- a/include/linux/namei.h
> > +++ b/include/linux/namei.h
> > @@ -18,38 +18,38 @@ enum { MAX_NESTED_LINKS = 8 };
> >  enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
> >  
> >  /* pathwalk mode */
> > -#define LOOKUP_FOLLOW		0x0001	/* follow links at the end */
> > -#define LOOKUP_DIRECTORY	0x0002	/* require a directory */
> > -#define LOOKUP_AUTOMOUNT	0x0004  /* force terminal automount */
> > -#define LOOKUP_EMPTY		0x4000	/* accept empty path [user_... only] */
> > -#define LOOKUP_DOWN		0x8000	/* follow mounts in the starting point */
> > -#define LOOKUP_MOUNTPOINT	0x0080	/* follow mounts in the end */
> > -
> > -#define LOOKUP_REVAL		0x0020	/* tell ->d_revalidate() to trust no cache */
> > -#define LOOKUP_RCU		0x0040	/* RCU pathwalk mode; semi-internal */
> > +#define LOOKUP_FOLLOW		BIT(0)	/* follow links at the end */
> > +#define LOOKUP_DIRECTORY	BIT(1)	/* require a directory */
> > +#define LOOKUP_AUTOMOUNT	BIT(2)  /* force terminal automount */
> > +#define LOOKUP_EMPTY		BIT(3)	/* accept empty path [user_... only] */
> > +#define LOOKUP_LINKAT_EMPTY	BIT(4) /* Linkat request with empty path. */
> > +#define LOOKUP_DOWN		BIT(5)	/* follow mounts in the starting point */
> > +#define LOOKUP_MOUNTPOINT	BIT(6)	/* follow mounts in the end */
> > +#define LOOKUP_REVAL		BIT(7)	/* tell ->d_revalidate() to trust no cache */
> > +#define LOOKUP_RCU		BIT(8)	/* RCU pathwalk mode; semi-internal */
> > +#define LOOKUP_CACHED		BIT(9) /* Only do cached lookup */
> > +#define LOOKUP_PARENT		BIT(10)	/* Looking up final parent in path */
> > +/* 5 spare bits for pathwalk */
> >  
> >  /* These tell filesystem methods that we are dealing with the final component... */
> > -#define LOOKUP_OPEN		0x0100	/* ... in open */
> > -#define LOOKUP_CREATE		0x0200	/* ... in object creation */
> > -#define LOOKUP_EXCL		0x0400	/* ... in target must not exist */
> > -#define LOOKUP_RENAME_TARGET	0x0800	/* ... in destination of rename() */
> > +#define LOOKUP_OPEN		BIT(16)	/* ... in open */
> > +#define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
> > +#define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
> > +#define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
> >  
> >  #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
> >  				 LOOKUP_RENAME_TARGET)
> > -
> > -/* internal use only */
> > -#define LOOKUP_PARENT		0x0010
> > +/* 4 spare bits for intent */
> >  
> >  /* Scoping flags for lookup. */
> > -#define LOOKUP_NO_SYMLINKS	0x010000 /* No symlink crossing. */
> > -#define LOOKUP_NO_MAGICLINKS	0x020000 /* No nd_jump_link() crossing. */
> > -#define LOOKUP_NO_XDEV		0x040000 /* No mountpoint crossing. */
> > -#define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
> > -#define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
> > -#define LOOKUP_CACHED		0x200000 /* Only do cached lookup */
> > -#define LOOKUP_LINKAT_EMPTY	0x400000 /* Linkat request with empty path. */
> > +#define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
> > +#define LOOKUP_NO_MAGICLINKS	BIT(25) /* No nd_jump_link() crossing. */
> > +#define LOOKUP_NO_XDEV		BIT(26) /* No mountpoint crossing. */
> > +#define LOOKUP_BENEATH		BIT(27) /* No escaping from starting point. */
> > +#define LOOKUP_IN_ROOT		BIT(28) /* Treat dirfd as fs root. */
> >  /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
> >  #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
> > +/* 3 spare bits for scoping */
> >  
> >  extern int path_pts(struct path *path);
> >  
> > -- 
> > 2.47.1
> > 
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-06 13:49   ` Christian Brauner
@ 2025-02-07  1:28     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  1:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:45PM +1100, NeilBrown wrote:
> > lookup_and_lock() combines locking the directory and performing a lookup
> > prior to a change to the directory.
> > Abstracting this prepares for changing the locking requirements.
> > 
> > done_lookup_and_lock() provides the inverse of putting the dentry and
> > unlocking.
> > 
> > For "silly_rename" we will need to lookup_and_lock() in a directory that
> > is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.
> > 
> > Like lookup_len_qstr(), lookup_and_lock() returns -ENOENT if
> > LOOKUP_CREATE was NOT given and the name cannot be found,, and returns
> > -EEXIST if LOOKUP_EXCL WAS given and the name CAN be found.
> > 
> > These functions replace all uses of lookup_one_qstr() in namei.c
> > except for those used for rename.
> > 
> > The name might seem backwards as the lock happens before the lookup.
> > A future patch will change this so that only a shared lock is taken
> > before the lookup, and an exclusive lock on the dentry is taken after a
> > successful lookup.  So the order "lookup" then "lock" will make sense.
> > 
> > This functionality is exported as lookup_and_lock_one() which takes a
> > name and len rather than a qstr.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/namei.c            | 102 ++++++++++++++++++++++++++++--------------
> >  include/linux/namei.h |  15 ++++++-
> >  2 files changed, 83 insertions(+), 34 deletions(-)
> > 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 69610047f6c6..3c0feca081a2 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1715,6 +1715,41 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
> >  }
> >  EXPORT_SYMBOL(lookup_one_qstr);
> >  
> > +static struct dentry *lookup_and_lock_nested(const struct qstr *last,
> > +					     struct dentry *base,
> > +					     unsigned int lookup_flags,
> > +					     unsigned int subclass)
> > +{
> > +	struct dentry *dentry;
> > +
> > +	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
> > +		inode_lock_nested(base->d_inode, subclass);
> > +
> > +	dentry = lookup_one_qstr(last, base, lookup_flags);
> > +	if (IS_ERR(dentry) && !(lookup_flags & LOOKUP_PARENT_LOCKED)) {
> > +			inode_unlock(base->d_inode);
> 
> Nit: The indentation here is wrong and the {} aren't common practice.

Thanks.

> 
> > +	}
> > +	return dentry;
> > +}
> > +
> > +static struct dentry *lookup_and_lock(const struct qstr *last,
> > +				      struct dentry *base,
> > +				      unsigned int lookup_flags)
> > +{
> > +	return lookup_and_lock_nested(last, base, lookup_flags,
> > +				      I_MUTEX_PARENT);
> > +}
> > +
> > +void done_lookup_and_lock(struct dentry *base, struct dentry *dentry,
> > +			  unsigned int lookup_flags)
> 
> Did you mean done_lookup_and_unlock()?

No.  The thing that we are done with is "lookup_and_lock()".
This matches "done_path_create()" which doesn't create anything.

On the other hand we have d_lookup_done() which puts _done at the end.
Or end_name_hash().  ->write_end(), finish_automount()

I guess I could accept done_lookup_and_unlock() if you prefer that.

> 
> > +{
> > +	d_lookup_done(dentry);
> > +	dput(dentry);
> > +	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
> > +		inode_unlock(base->d_inode);
> > +}
> > +EXPORT_SYMBOL(done_lookup_and_lock);
> > +
> >  /**
> >   * lookup_fast - do fast lockless (but racy) lookup of a dentry
> >   * @nd: current nameidata
> > @@ -2754,12 +2789,9 @@ static struct dentry *__kern_path_locked(int dfd, struct filename *name, struct
> >  		path_put(path);
> >  		return ERR_PTR(-EINVAL);
> >  	}
> > -	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
> > -	d = lookup_one_qstr(&last, path->dentry, 0);
> > -	if (IS_ERR(d)) {
> > -		inode_unlock(path->dentry->d_inode);
> > +	d = lookup_and_lock(&last, path->dentry, 0);
> > +	if (IS_ERR(d))
> >  		path_put(path);
> > -	}
> >  	return d;
> >  }
> >  
> > @@ -3053,6 +3085,22 @@ struct dentry *lookup_positive_unlocked(const char *name,
> >  }
> >  EXPORT_SYMBOL(lookup_positive_unlocked);
> >  
> > +struct dentry *lookup_and_lock_one(struct mnt_idmap *idmap,
> > +				   const char *name, int len, struct dentry *base,
> > +				   unsigned int lookup_flags)
> > +{
> > +	struct qstr this;
> > +	int err;
> > +
> > +	if (!idmap)
> > +		idmap = &nop_mnt_idmap;
> 
> The callers should pass nop_mnt_idmap. That's how every function that
> takes this argument works. This is a lot more explicit than magically
> fixing this up in the function.

OK.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
  2025-02-06 13:15   ` Christian Brauner
@ 2025-02-07  1:46     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  1:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:46PM +1100, NeilBrown wrote:
> > These "_async" versions of various inode operations are only guaranteed
> > a shared lock on the directory but if the directory isn't exclusively
> > locked then they are guaranteed an exclusive lock on the dentry within
> > the directory (which will be implemented in a later patch).
> > 
> > This will allow a graceful transition from exclusive to shared locking
> > for directory updates, and even to async updates which can complete with
> > no lock on the directory - only on the dentry.
> > 
> > mkdir_async is a bit different as it optionally returns a new dentry
> > for cases when the filesystem is not able to use the original dentry.
> > This allows vfs_mkdir_return() to avoid the need for an extra lookup.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  Documentation/filesystems/locking.rst |  51 ++++++++-
> >  Documentation/filesystems/porting.rst |  10 ++
> >  Documentation/filesystems/vfs.rst     |  24 +++++
> >  fs/namei.c                            | 142 +++++++++++++++++++++-----
> >  include/linux/fs.h                    |  24 +++++
> >  5 files changed, 223 insertions(+), 28 deletions(-)
> > 
> > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> > index d20a32b77b60..adeead366332 100644
> > --- a/Documentation/filesystems/locking.rst
> > +++ b/Documentation/filesystems/locking.rst
> > @@ -62,15 +62,24 @@ inode_operations
> >  prototypes::
> >  
> >  	int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
> > +	int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool, struct dirop_ret *);
> 
> If we end up doing this then imho the correct thing to do would be to
> extend the existing operations. Yes, that's more work I know as I've
> done that multiple times myself and it's a bit more annoying churn but
> we shouldn't just keep adding new methods without a good reason.
> 
> I assume that you've done that mostly so that you wouldn't be held up by
> menial work for the prototype. That's obviously fine. But for the final
> thing we should just fixup everyone.

I did it this way because it follows a pattern I've seen before.
 readdir -> iterate -> iterate_shared
 ioctl -> unlocked_ioctl

There are three changes happening here:

1/ add "struct dirop_ret *ret" to the end of each function.  That could
  certainly be done across all filesystems in one patch
2/ change mkdir to return a dentry.  The might be doable in a single
  patch if NFS is the only filesystem affected.  The change is
  sufficiently intrusive that maintainers would want to review it
  carefully and might want to land it through their own tree.  But I
  suspect there are other filesystems that would be affected and I think
  it would be prohibitive to try to land this sort of change to multiple
  filesystems in a single patch.
3/ change these functions to work with only a shared lock on the
   directory.  I could try to do something a bit like what Linus
   did in
     Commit 3e3271549670 ("vfs: get rid of old '->iterate' directory operation")
   but the circumstances are quite different and the excuses he used
   there don't apply.  Also I would need to add an i_rwsem to every
   inode which is unlikely to go down well with the maintainers.  I'm
   sure the active ones would want to manage that change themselves.

So while I hope we get to the point of discarding all the non-async
operations in a little less than the 7 years that it took to get rid of
->iterate, I don't see how to make the change without introducing new
inode_operations.


> 
> >  	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
> >  	int (*link) (struct dentry *,struct inode *,struct dentry *);
> > +	int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
> >  	int (*unlink) (struct inode *,struct dentry *);
> > +	int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
> >  	int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
> > +	int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *m , struct dirop_ret *);
> >  	int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
> > +	struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
> >  	int (*rmdir) (struct inode *,struct dentry *);
> > +	int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
> >  	int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
> > +	int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
> >  	int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
> >  			struct inode *, struct dentry *, unsigned int);
> > +	int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> > +			struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
> >  	int (*readlink) (struct dentry *, char __user *,int);
> >  	const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
> >  	void (*truncate) (struct inode *);
> > @@ -84,6 +93,9 @@ prototypes::
> >  	int (*atomic_open)(struct inode *, struct dentry *,
> >  				struct file *, unsigned open_flag,
> >  				umode_t create_mode);
> > +	int (*atomic_open_async)(struct inode *, struct dentry *,
> > +				struct file *, unsigned open_flag,
> > +				umode_t create_mode, struct dirop_ret *);
> >  	int (*tmpfile) (struct mnt_idmap *, struct inode *,
> >  			struct file *, umode_t);
> >  	int (*fileattr_set)(struct mnt_idmap *idmap,
> > @@ -95,18 +107,33 @@ prototypes::
> >  locking rules:
> >  	all may block
> >  
> > +All directory-modifying operations are called with an exclusive lock on
> > +the target dentry or dentries using DCACHE_PAR_LOOKUP.  This allows the
> > +shared lock on i_rwsem for the _async ops to be safe.  The lock on
> > +i_rwsem may be dropped as soon as the op returns, though if it returns
> > +-EINPROGRESS the lock using DCACHE_PAR_UPDATE will not be dropped until
> > +the callback is called.
> > +
> >  ==============	==================================================
> >  ops		i_rwsem(inode)
> >  ==============	==================================================
> >  lookup:		shared
> >  create:		exclusive
> > +create_async:	shared
> >  link:		exclusive (both)
> > +link_async:	exclusive on source, shared on target
> >  mknod:		exclusive
> > +mknod_async:	shared
> >  symlink:	exclusive
> > +symlink_async:	shared
> >  mkdir:		exclusive
> > +mkdir_async:	shared
> >  unlink:		exclusive (both)
> > +unlink_async:	exclusive on object, shared on directory/name
> >  rmdir:		exclusive (both)(see below)
> > +rmdir_async:	exclusive on object, shared on directory/name (see below)
> >  rename:		exclusive (both parents, some children)	(see below)
> > +rename_async:	shared (both parents) exclusive (some children)	(see below)
> >  readlink:	no
> >  get_link:	no
> >  setattr:	exclusive
> > @@ -118,6 +145,7 @@ listxattr:	no
> >  fiemap:		no
> >  update_time:	no
> >  atomic_open:	shared (exclusive if O_CREAT is set in open flags)
> > +atomic_open_async:	shared (if O_CREAT is not set, then may not have exclusive lock on name)
> >  tmpfile:	no
> >  fileattr_get:	no or exclusive
> >  fileattr_set:	exclusive
> > @@ -125,8 +153,10 @@ get_offset_ctx  no
> >  ==============	==================================================
> >  
> >  
> > -	Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
> > -	exclusive on victim.
> > +	Additionally, ->rmdir(), ->unlink() and ->rename(), as well as _async
> > +	versions, have ->i_rwsem exclusive on victim.  This exclusive lock
> > +        may be dropped when the op completes even if the async operation is
> > +        continuing.
> >  	cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
> >  	->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
> >  	involved.
> > @@ -135,6 +165,23 @@ get_offset_ctx  no
> >  See Documentation/filesystems/directory-locking.rst for more detailed discussion
> >  of the locking scheme for directory operations.
> >  
> > +The _async operations will be passed a (non-NULL) struct dirop_ret pointer::
> > +
> > +	struct dirop_ret {
> > +		union {
> > +			int err;
> > +			struct dentry *dentry;
> > +		};
> > +		void (*done_cb)(struct dirop_ret*);
> > +	};
> > +
> > +They may return -EINPROGRESS (or ERR_PTR(-EINPROGRESS)) in which case
> > +the op will continue asynchronously.  When it completes the result,
> > +which must NOT be -EINPROGRESS, is stored in err or dentry (as
> > +appropriate) and the done_cb() function is called.  Callers can only
> > +make use of the asynchrony when they determine that no lock need be held
> > +on i_rwsem.
> > +
> >  xattr_handler operations
> >  ========================
> >  
> > diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
> > index 1639e78e3146..a736c9f30d9d 100644
> > --- a/Documentation/filesystems/porting.rst
> > +++ b/Documentation/filesystems/porting.rst
> > @@ -1157,3 +1157,13 @@ in normal case it points into the pathname being looked up.
> >  NOTE: if you need something like full path from the root of filesystem,
> >  you are still on your own - this assists with simple cases, but it's not
> >  magic.
> > +
> > +---
> > +
> > +**recommended**
> > +
> > +create_async, link_async, unlink_async, rmdir_async, mknod_async,
> > +rename_async, atomic_open_async can be provided instead of the
> > +corresponding inode_operations with the "_async" suffix.  Multiple
> > +_async operations can be performed in a given directory concurrently,
> > +but never on the same name.
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 31eea688609a..e18655054e6c 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -491,15 +491,24 @@ As of kernel 2.6.22, the following members are defined:
> >  
> >  	struct inode_operations {
> >  		int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
> > +		int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool, struct dirop_ret *);
> >  		struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
> >  		int (*link) (struct dentry *,struct inode *,struct dentry *);
> > +		int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
> >  		int (*unlink) (struct inode *,struct dentry *);
> > +		int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
> >  		int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
> > +		int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *, struct dirop_ret *);
> >  		int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
> > +		struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
> >  		int (*rmdir) (struct inode *,struct dentry *);
> > +		int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
> >  		int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
> > +		int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
> >  		int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
> >  			       struct inode *, struct dentry *, unsigned int);
> > +		int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> > +			       struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
> >  		int (*readlink) (struct dentry *, char __user *,int);
> >  		const char *(*get_link) (struct dentry *, struct inode *,
> >  					 struct delayed_call *);
> > @@ -511,6 +520,8 @@ As of kernel 2.6.22, the following members are defined:
> >  		void (*update_time)(struct inode *, struct timespec *, int);
> >  		int (*atomic_open)(struct inode *, struct dentry *, struct file *,
> >  				   unsigned open_flag, umode_t create_mode);
> > +		int (*atomic_open_async)(struct inode *, struct dentry *, struct file *,
> > +				   unsigned open_flag, umode_t create_mode, struct dirop_ret *);
> >  		int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
> >  		struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
> >  	        int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
> > @@ -524,6 +535,7 @@ Again, all methods are called without any locks being held, unless
> >  otherwise noted.
> >  
> >  ``create``
> > +``create_async``
> >  	called by the open(2) and creat(2) system calls.  Only required
> >  	if you want to support regular files.  The dentry you get should
> >  	not have an inode (i.e. it should be a negative dentry).  Here
> > @@ -546,29 +558,39 @@ otherwise noted.
> >  	directory inode semaphore held
> >  
> >  ``link``
> > +``link_async``
> >  	called by the link(2) system call.  Only required if you want to
> >  	support hard links.  You will probably need to call
> >  	d_instantiate() just as you would in the create() method
> >  
> >  ``unlink``
> > +``unlink_async``
> >  	called by the unlink(2) system call.  Only required if you want
> >  	to support deleting inodes
> >  
> >  ``symlink``
> > +``symlink_async``
> >  	called by the symlink(2) system call.  Only required if you want
> >  	to support symlinks.  You will probably need to call
> >  	d_instantiate() just as you would in the create() method
> >  
> >  ``mkdir``
> > +``mkdir_async``
> >  	called by the mkdir(2) system call.  Only required if you want
> >  	to support creating subdirectories.  You will probably need to
> >  	call d_instantiate() just as you would in the create() method
> >  
> > +	mkdir_async can return an alternate dentry, much like lookup.
> > +	In this case the original dentry will still be negative and will
> > +	be unhashed.
> > +
> >  ``rmdir``
> > +``rmdir_async``
> >  	called by the rmdir(2) system call.  Only required if you want
> >  	to support deleting subdirectories
> >  
> >  ``mknod``
> > +``mknod_async``
> >  	called by the mknod(2) system call to create a device (char,
> >  	block) inode or a named pipe (FIFO) or socket.  Only required if
> >  	you want to support creating these types of inodes.  You will
> > @@ -576,6 +598,7 @@ otherwise noted.
> >  	create() method
> >  
> >  ``rename``
> > +``rename_async``
> >  	called by the rename(2) system call to rename the object to have
> >  	the parent and name given by the second inode and dentry.
> >  
> > @@ -647,6 +670,7 @@ otherwise noted.
> >  	itself and call mark_inode_dirty_sync.
> >  
> >  ``atomic_open``
> > +``atomic_open_async``
> >  	called on the last component of an open.  Using this optional
> >  	method the filesystem can look up, possibly create and open the
> >  	file in one atomic operation.  If it wants to leave actual
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3c0feca081a2..eadde9de73bf 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -123,6 +123,41 @@
> >   * PATH_MAX includes the nul terminator --RR.
> >   */
> >  
> > +static void dirop_done_cb(struct dirop_ret *dret)
> > +{
> > +	wake_up_var(dret);
> > +}
> > +
> > +#define DO_DIROP(dir, op, ...)						\
> > +	({								\
> > +		 struct dirop_ret dret;					\
> > +		 int ret;						\
> > +		 dret.err = -EINPROGRESS;				\
> > +		 dret.done_cb = dirop_done_cb;				\
> > +		 ret = (dir)->i_op->op(__VA_ARGS__, &dret);		\
> > +		 if (ret == -EINPROGRESS) {				\
> > +			 wait_var_event(&dret,				\
> > +					dret.err != -EINPROGRESS);	\
> > +			 ret = dret.err;				\
> > +		 }							\
> > +		 ret;							\
> > +	})
> > +
> > +#define DO_DE_DIROP(dir, op, ...)					\
> > +	({								\
> > +		 struct dirop_ret dret;					\
> > +		 struct dentry *ret;					\
> > +		 dret.dentry = ERR_PTR(-EINPROGRESS);			\
> > +		 dret.done_cb = dirop_done_cb;				\
> > +		 ret = (dir)->i_op->op(__VA_ARGS__, &dret);		\
> > +		 if (ret == ERR_PTR(-EINPROGRESS)) {			\
> > +			 wait_var_event(&dret,				\
> > +					dret.dentry != ERR_PTR(-EINPROGRESS));	\
> > +			 ret = dret.dentry;				\
> > +		 }							\
> > +		 ret;							\
> > +	})
> 
> We should also try to avoid these ugly wrappers. That'll be easier if we
> don't have multiple methods as well.

I don't think that the multiple methods make a whole lot of difference
here.  The same code would be in the function, it is just indented one
more level when there are multiple functions.  But if you would prefer
all the duplication of boiler-plate I can do it that way.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops
  2025-02-06 13:22   ` Christian Brauner
@ 2025-02-07  2:01     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  2:01 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:47PM +1100, NeilBrown wrote:
> > If a filesystem supports _async ops for some directory ops we can take a
> > "shared" lock on i_rwsem otherwise we must take an "exclusive" lock.  As
> > the filesystem may support some async ops but not others we need to
> > easily determine which.
> > 
> > With this patch we group the ops into 4 groups that are likely be
> > supported together:
> > 
> > CREATE: create, link, mkdir, mknod
> > REMOVE: rmdir, unlink
> > RENAME: rename
> > OPEN: atomic_open, create
> > 
> > and set S_ASYNC_XXX for each when the inode in initialised.
> > 
> > We also add a LOOKUP_REMOVE intent flag which will be used by locking
> > interfaces to help know which group is being used.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/dcache.c           | 24 ++++++++++++++++++++++++
> >  include/linux/fs.h    |  5 +++++
> >  include/linux/namei.h |  5 +++--
> >  3 files changed, 32 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index e49607d00d2d..37c0f655166d 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -384,6 +384,27 @@ static inline void __d_set_inode_and_type(struct dentry *dentry,
> >  	smp_store_release(&dentry->d_flags, flags);
> >  }
> >  
> > +static void set_inode_flags(struct inode *inode)
> > +{
> > +	const struct inode_operations *i_op = inode->i_op;
> > +
> > +	lockdep_assert_held(&inode->i_lock);
> > +	if ((i_op->create_async || !i_op->create) &&
> > +	    (i_op->link_async || !i_op->link) &&
> > +	    (i_op->symlink_async || !i_op->symlink) &&
> > +	    (i_op->mkdir_async || !i_op->mkdir) &&
> > +	    (i_op->mknod_async || !i_op->mknod))
> > +		inode->i_flags |= S_ASYNC_CREATE;
> > +	if ((i_op->unlink_async || !i_op->unlink) &&
> > +	    (i_op->mkdir_async || !i_op->mkdir))
> > +		inode->i_flags |= S_ASYNC_REMOVE;
> > +	if (i_op->rename_async)
> > +		inode->i_flags |= S_ASYNC_RENAME;
> > +	if (i_op->atomic_open_async ||
> > +	    (!i_op->atomic_open && i_op->create_async))
> > +		inode->i_flags |= S_ASYNC_OPEN;
> > +}
> 
> I think this is unpleasant. As I said we should fold _async into the
> normal methods. Then we can add:
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index be3ad155ec9f..1d19f72448fc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2186,6 +2186,7 @@ int wrap_directory_iterator(struct file *, struct dir_context *,
>         { return wrap_directory_iterator(file, ctx, x); }
> 
>  struct inode_operations {
> +       iop_flags_t iop_flags;
>         struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
>         const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
>         int (*permission) (struct mnt_idmap *, struct inode *, int);
> 
> which is similar to what I did for
> 
> struct file_operations {
>         struct module *owner;
>         fop_flags_t fop_flags;
> 
> and introduce
> 
> IOP_ASYNC_CREATE
> IOP_ASYNC_OPEN
> 

Ahh - I see where you are going.  Interesting.
The iop_flags effectively provides versioning for the functions so we
don't have to embed the version in the name.  That would work.

I guess we would handle the mkdir change by changing every current mkdir
to return ERR_PTR() of the current return value and the vfs_mkdir_xx
caller checks if that is NULL and the original dentry is still negative,
and then performs the lookup.

Thanks,
NeilBrown


> etc and then filesystems can just do:
> 
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index df9669d4ded7..90c7aeb49466 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -10859,6 +10859,7 @@ static void nfs4_disable_swap(struct inode *inode)
>  }
> 
>  static const struct inode_operations nfs4_dir_inode_operations = {
> +       .iop_flags      = IOP_ASYNC_CREATE | IOP_ASYNC_OPEN,
>         .create         = nfs_create,
>         .lookup         = nfs_lookup,
>         .atomic_open    = nfs_atomic_open,
> 
> and then you can raise S_ASYNC_OPEN and so on based on the flags, not
> the individual methods.
> 
> > +
> >  static inline void __d_clear_type_and_inode(struct dentry *dentry)
> >  {
> >  	unsigned flags = READ_ONCE(dentry->d_flags);
> > @@ -1893,6 +1914,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
> >  	raw_write_seqcount_begin(&dentry->d_seq);
> >  	__d_set_inode_and_type(dentry, inode, add_flags);
> >  	raw_write_seqcount_end(&dentry->d_seq);
> > +	set_inode_flags(inode);
> >  	fsnotify_update_flags(dentry);
> >  	spin_unlock(&dentry->d_lock);
> >  }
> > @@ -1999,6 +2021,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
> >  
> >  		spin_lock(&new->d_lock);
> >  		__d_set_inode_and_type(new, inode, add_flags);
> > +		set_inode_flags(inode);
> >  		hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
> >  		if (!disconnected) {
> >  			hlist_bl_lock(&sb->s_roots);
> > @@ -2701,6 +2724,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
> >  		raw_write_seqcount_begin(&dentry->d_seq);
> >  		__d_set_inode_and_type(dentry, inode, add_flags);
> >  		raw_write_seqcount_end(&dentry->d_seq);
> > +		set_inode_flags(inode);
> >  		fsnotify_update_flags(dentry);
> >  	}
> >  	__d_rehash(dentry);
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index e414400c2487..9a9282fef347 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -2361,6 +2361,11 @@ struct super_operations {
> >  #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
> >  #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
> >  
> > +#define S_ASYNC_CREATE	BIT(18)	/* create, link, symlink, mkdir, mknod all _async */
> > +#define S_ASYNC_REMOVE	BIT(19)	/* unlink, mkdir both _async */
> > +#define S_ASYNC_RENAME	BIT(20) /* rename_async supported */
> > +#define S_ASYNC_OPEN	BIT(21) /* atomic_open_async or create_async supported */
> > +
> >  /*
> >   * Note that nosuid etc flags are inode-specific: setting some file-system
> >   * flags just means all the inodes inherit those flags by default. It might be
> > diff --git a/include/linux/namei.h b/include/linux/namei.h
> > index 76c587a5ec3a..72e351640406 100644
> > --- a/include/linux/namei.h
> > +++ b/include/linux/namei.h
> > @@ -40,10 +40,11 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
> >  #define LOOKUP_CREATE		BIT(17)	/* ... in object creation */
> >  #define LOOKUP_EXCL		BIT(18)	/* ... in target must not exist */
> >  #define LOOKUP_RENAME_TARGET	BIT(19)	/* ... in destination of rename() */
> > +#define LOOKUP_REMOVE		BIT(20)	/* ... in target of object removal */
> >  
> >  #define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
> > -				 LOOKUP_RENAME_TARGET)
> > -/* 4 spare bits for intent */
> > +				 LOOKUP_RENAME_TARGET | LOOKUP_REMOVE)
> > +/* 3 spare bits for intent */
> >  
> >  /* Scoping flags for lookup. */
> >  #define LOOKUP_NO_SYMLINKS	BIT(24) /* No symlink crossing. */
> > -- 
> > 2.47.1
> > 
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-06 14:06   ` Christian Brauner
@ 2025-02-07  2:17     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  2:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:51PM +1100, NeilBrown wrote:
> > vfs_rmdir takes an exclusive lock on the target directory to ensure
> > nothing new is created in it while the rmdir progresses.  With the
> 
> It also excludes concurrent mount operations.

And it excludes chown and ACL changes.  I doubt those are important.  I
do need to check mount/unmount.

> 
> > possibility of async updates continuing after the inode lock is dropped
> > we now need extra protection.
> > 
> > Any async updates will have DCACHE_PAR_UPDATE set on the dentry.  We
> > simply wait for that flag to be cleared on all children.
> > 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/dcache.c |  2 +-
> >  fs/namei.c  | 40 ++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 41 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index fb331596f1b1..90dee859d138 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -53,7 +53,7 @@
> >   *   - d_lru
> >   *   - d_count
> >   *   - d_unhashed()
> > - *   - d_parent and d_chilren
> > + *   - d_parent and d_children
> >   *   - childrens' d_sib and d_parent
> >   *   - d_u.d_alias, d_inode
> >   *
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3a107d6098be..e8a85c9f431c 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1839,6 +1839,27 @@ bool d_update_lock(struct dentry *dentry,
> >  	return true;
> >  }
> >  
> > +static void d_update_wait(struct dentry *dentry, unsigned int subclass)
> > +{
> > +	/* Note this may only ever be called in a context where we have
> > +	 * a lock preventing this dentry from becoming locked, possibly
> > +	 * an update lock on the parent dentry.  The must be a smp_mb()
> > +	 * after that lock is taken and before this is called so that
> > +	 * the following test is safe. d_update_lock() provides that
> > +	 * barrier.
> > +	 */
> > +	if (!(dentry->d_flags & DCACHE_PAR_UPDATE))
> > +		return
> > +	lock_acquire_exclusive(&dentry->d_update_map, subclass,
> > +			       0, NULL, _THIS_IP_);
> > +	spin_lock(&dentry->d_lock);
> > +	wait_var_event_spinlock(&dentry->d_flags,
> > +				!check_dentry_locked(dentry),
> > +				&dentry->d_lock);
> > +	spin_unlock(&dentry->d_lock);
> > +	lock_map_release(&dentry->d_update_map);
> > +}
> > +
> >  bool d_update_trylock(struct dentry *dentry,
> >  		      struct dentry *base,
> >  		      const struct qstr *last)
> > @@ -4688,6 +4709,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> >  		     struct dentry *dentry)
> >  {
> >  	int error = may_delete(idmap, dir, dentry, 1);
> > +	struct dentry *child;
> >  
> >  	if (error)
> >  		return error;
> > @@ -4697,6 +4719,24 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
> >  
> >  	dget(dentry);
> >  	inode_lock(dentry->d_inode);
> > +	/*
> > +	 * Some children of dentry might be active in an async update.
> > +	 * We need to wait for them.  New children cannot be locked
> > +	 * while the inode lock is held.
> > +	 */
> > +again:
> > +	spin_lock(&dentry->d_lock);
> > +	for (child = d_first_child(dentry); child;
> > +	     child = d_next_sibling(child)) {
> > +		if (child->d_flags & DCACHE_PAR_UPDATE) {
> > +			dget(child);
> > +			spin_unlock(&dentry->d_lock);
> > +			d_update_wait(child, I_MUTEX_CHILD);
> > +			dput(child);
> > +			goto again;
> > +		}
> > +	}
> > +	spin_unlock(&dentry->d_lock);
> 
> That looks like it can cause stalls when you call rmdir on a directory
> that has a lots of children and a larg-ish subset of them has pending
> async updates, no?
> 

It can certainly block waiting for other operations to complete, but
that is already the case when waiting for an exclusive lock on i_rwsem. 
Any thread that has already tried to get that lock might get it before
rmdir eventually succeeds.  So I don't think that is a behavioural
change.

I'm not concerned about walking the sibling list under a spinlock if the
list if very long.  Maybe I could periodically take a ref to the current
child, drop and reclaim the spinlock, and hopefully continue from there.
Doing that on a non-D_PAR_UPDATE dentry should be safe.  I wonder if
that complexity is worth it.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory
  2025-02-06 15:36 ` John Stoffel
@ 2025-02-07  2:18   ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-07  2:18 UTC (permalink / raw)
  To: John Stoffel
  Cc: Alexander Viro, Christian Brauner, Jan Kara, Linus Torvalds,
	Jeff Layton, Dave Chinner, linux-fsdevel, linux-kernel

On Fri, 07 Feb 2025, John Stoffel wrote:
> >>>>> "NeilBrown" == NeilBrown  <neilb@suse.de> writes:
> 
> > This is my latest attempt at removing the requirement for an exclusive
> > lock on a directory which performing updates in this.  This version,
> > inspired by Dave Chinner, goes a step further and allow async updates.
> 
> This initial sentence reads poorly to me.  I think you maybe are
> trying to say:
> 
>   This is my latest attempt to removing the requirement for writers to
>   have an exclusive lock on a directory when performing updates on
>   entries in that directory.  This allows for parallel updates by
>   multiple processes (connections? hosts? clients?) to improve scaling
>   of large filesystems. 
> 
> I get what you're trying to do here, and I applaud it!  I just
> struggled over the intro here.  

Yes, my intro was rather poorly worded.  I think your version is much
better.  Thanks.

NeilBrown

> 
> 
> > The inode operation still requires the inode lock, at least a shared
> > lock, but may return -EINPROGRES and then continue asynchronously
> > without needing any ongoing lock on the directory.
> 
> > An exclusive lock on the dentry is held across the entire operation.
> 
> > This change requires various extra checks.  rmdir must ensure there is
> > no async creation still happening.  rename between directories must
> > ensure non of the relevant ancestors are undergoing async rename.  There
> > may be or checks that I need to consider - mounting?
> 
> > One other important change since my previous posting is that I've
> > dropped the idea of taking a separate exclusive lock on the directory
> > when the fs doesn't support shared locking.  This cannot work as it
> > doeesn't prevent lookups and filesystems don't expect a lookup while
> > they are changing a directory.  So instead we need to choose between
> > exclusive or shared for the inode on a case-by-case basis.
> 
> > To make this choice we divide all ops into four groups: create, remove,
> > rename, open/create.  If an inode has no operations in the group that
> > require an exclusive lock, then a flag is set on the inode so that
> > various code knows that a shared lock is sufficient.  If the flag is not
> > set, an exclusive lock is obtained.
> 
> > I've also added rename handling and converted NFS to use all _async ops.
> 
> > The motivation for this comes from the general increase in scale of
> > systems.  We can support very large directories and many-core systems
> > and applications that choose to use large directories can hit
> > unnecessary contention.
> 
> > NFS can easily hit this when used over a high-latency link.
> > Lustre already has code to allow concurrent directory updates in the
> > back-end filesystem (ldiskfs - a slightly modified ext4).
> > Lustre developers believe this would also benefit the client-side
> > filesystem with large core counts.
> 
> > The idea behind the async support is to eventually connect this to
> > io_uring so that one process can launch several concurrent directory
> > operations.  I have not looked deeply into io_uring and cannot be
> > certain that the interface I've provided will be able to be used.  I
> > would welcome any advice on that matter, though I hope to find time to
> > explore myself.  For now if any _async op returns -EINPROGRESS we simply
> > wait for the callback to indicate completion.
> 
> > Test status:  only light testing.  It doesn't easily blow up, but lockdep
> > complains that repeated calls to d_update_wait() are bad, even though
> > it has balanced acquire and release calls. Weird?
> 
> > Thanks,
> > NeilBrown
> 
> >  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
> >  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
> >  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
> >  [PATCH 04/19] VFS: change kern_path_locked() and
> >  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
> >  [PATCH 06/19] VFS: repack DENTRY_ flags.
> >  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
> >  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
> >  [PATCH 09/19] VFS: add _async versions of the various directory
> >  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
> >  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
> >  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
> >  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
> >  [PATCH 14/19] VFS: Ensure no async updates happening in directory
> >  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
> >  [PATCH 16/19] VFS: add lookup_and_lock_rename()
> >  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
> >  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
> >  [PATCH 19/19] nfs: switch to _async for all directory ops.
> 
> 


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-06  5:42 ` [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
@ 2025-02-07 19:32   ` Al Viro
  2025-02-10  4:58     ` NeilBrown
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-07 19:32 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

1) what's wrong with using middle bits of dentry as index?  What the hell
is that thing about pid for?

2) part in d_add_ci() might be worth a comment re d_lookup_done() coming
for the original dentry, no matter what.

3) the dance with conditional __wake_up() is worth a helper, IMO.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
  2025-02-06 12:24   ` Christian Brauner
  2025-02-06 13:52   ` Jeff Layton
@ 2025-02-07 19:45   ` Al Viro
  2025-02-10  4:36     ` NeilBrown
  2 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-07 19:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:38PM +1100, NeilBrown wrote:
> vfs_mkdir() does not guarantee to make the child dentry positive on
> success.  It may leave it negative and then the caller needs to perform a
> lookup to find the target dentry.
> 
> This patch introduced vfs_mkdir_return() which performs the lookup if
> needed so that this code is centralised.
> 
> This prepares for a new inode operation which will perform mkdir and
> returns the correct dentry.

* Calling conventions stink; make it _consume_ dentry reference and
return dentry reference or ERR_PTR().  Callers will be happier that way
(check it).

* Calling conventions should be documented in commit message *and* in
D/f/porting

* devpts, nfs4recover and xfs might as well convert (not going to hit
the "need a lookup" case anyway)

* that 
+                       /* Need a "const" pointer.  We know d_name is const
+                        * because we hold an exclusive lock on i_rwsem
+                        * in d_parent.
+                        */
+                       const struct qstr *d_name = (void*)&dentry->d_name;
+                       d = lookup_dcache(d_name, dentry->d_parent, 0);
+                       if (!d)
+                               d = __lookup_slow(d_name, dentry->d_parent, 0);
doesn't need a cast.  C is perfectly fine with
	T *x = foo();
	const T *y = x;

You are not allowed to _strip_ qualifiers; adding them is fine.
Same reason why you are allowed to pass char * to strlen() without
any casts whatsoever.

Comment re stability is fine; the cast is pure WTF material.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it.
  2025-02-06  5:42 ` [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it NeilBrown
  2025-02-06 14:30   ` Jeff Layton
@ 2025-02-07 20:01   ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-07 20:01 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:40PM +1100, NeilBrown wrote:
> -	dentry = d_alloc(base, name);
> -	if (unlikely(!dentry))
> +	dentry = d_alloc_parallel(base, name);
> +	if (unlikely(IS_ERR_OR_NULL(dentry)))
>  		return ERR_PTR(-ENOMEM);

Huh?  When does d_alloc_parallel() return NULL and why do you
play with explicit ERR_PTR(-ENOMEM) here?

> +	if ((flags & LOOKUP_INTENT_FLAGS) == 0)

Yecchh...  Thank you (from all reviewers, I suspect) for the exciting
opportunity to verify what values are possible in lookup_flags in various
callers and which are guaranteed to intersect with your LOOKUP_INTENT_FLAGS
mask.

> +#define LOOKUP_INTENT_FLAGS	(LOOKUP_OPEN | LOOKUP_CREATE | LOOKUP_EXCL |	\
> +				 LOOKUP_RENAME_TARGET)
> +

... as well as figuring out WTF do LOOKUP_OPEN and LOOKUP_EXCL fit into
that.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
  2025-02-06  5:42 ` [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() NeilBrown
  2025-02-06 12:33   ` Christian Brauner
@ 2025-02-07 20:14   ` Al Viro
  2025-02-09 20:23   ` Al Viro
  2 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-07 20:14 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:42PM +1100, NeilBrown wrote:

> Callers of lookup_one_qstr() often check if the result is negative or
> positive.
> These changes can easily be moved into lookup_one_qstr() by checking the
> lookup flags:
> LOOKUP_CREATE means it is NOT an error if the name doesn't exist.
> LOOKUP_EXCL means it IS an error if the name DOES exist.
> 
> This patch adds these checks, then removes error checks from callers,
> and ensures that appropriate flags are passed.
> 
> This subtly changes the meaning of LOOKUP_EXCL.  Previously it could
> only accompany LOOKUP_CREATE.  Now it can accompany LOOKUP_RENAME_TARGET
> as well.  A couple of small changes are needed to accommodate this.  The
> NFS is functionally a no-op but ensures nfs_is_exclusive_create() does
> exactly what the name says.

Where's D/f/porting chunk?  Mind you, this one needs _very_ careful
review and testing - you are touching codepaths that are convoluted
as hell and rarely tested.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-06  5:42 ` [PATCH 08/19] VFS: introduce lookup_and_lock() and friends NeilBrown
  2025-02-06 13:49   ` Christian Brauner
@ 2025-02-07 20:22   ` Al Viro
  2025-02-08 23:18     ` Al Viro
  2025-02-12  4:49     ` NeilBrown
  1 sibling, 2 replies; 83+ messages in thread
From: Al Viro @ 2025-02-07 20:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:45PM +1100, NeilBrown wrote:
> lookup_and_lock() combines locking the directory and performing a lookup
> prior to a change to the directory.
> Abstracting this prepares for changing the locking requirements.
> 
> done_lookup_and_lock() provides the inverse of putting the dentry and
> unlocking.
> 
> For "silly_rename" we will need to lookup_and_lock() in a directory that
> is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.

Ewww...  I do realize that such things might appear in intermediate
stages of locking massage, but they'd better be _GONE_ by the end of it.
Conditional locking of that sort is really asking for trouble.

If nothing else, better split the function in two variants and document
the differences; that kind of stuff really does not belong in arguments.
If you need it to exist through the series, that is - if not, you should
just leave lookup_one_qstr() for the "locked" case from the very beginning.

> This functionality is exported as lookup_and_lock_one() which takes a
> name and len rather than a qstr.

... for the sake of ...?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc
  2025-02-06  5:42 ` [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc NeilBrown
@ 2025-02-07 20:28   ` Al Viro
  2025-02-07 20:35     ` Al Viro
  2025-02-08  1:30   ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-07 20:28 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:50PM +1100, NeilBrown wrote:

> +	if (dentry->d_flags & LOOKUP_RCU) {

Really?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc
  2025-02-07 20:28   ` Al Viro
@ 2025-02-07 20:35     ` Al Viro
  0 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-07 20:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Feb 07, 2025 at 08:28:39PM +0000, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:50PM +1100, NeilBrown wrote:
> 
> > +	if (dentry->d_flags & LOOKUP_RCU) {
> 
> Really?

That aside, you are *NOT* passing the parent's name here - if you
look at the callers, all of them have 'name' and 'last' arguments
identicaly.  What is going on here?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-06  5:42 ` [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed NeilBrown
  2025-02-06 14:06   ` Christian Brauner
@ 2025-02-07 21:06   ` Al Viro
  2025-02-08 22:06     ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-07 21:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:51PM +1100, NeilBrown wrote:
> vfs_rmdir takes an exclusive lock on the target directory to ensure
> nothing new is created in it while the rmdir progresses.  With the
> possibility of async updates continuing after the inode lock is dropped
> we now need extra protection.
> 
> Any async updates will have DCACHE_PAR_UPDATE set on the dentry.  We
> simply wait for that flag to be cleared on all children.

> +static void d_update_wait(struct dentry *dentry, unsigned int subclass)
> +{
> +	/* Note this may only ever be called in a context where we have
> +	 * a lock preventing this dentry from becoming locked, possibly
> +	 * an update lock on the parent dentry.  The must be a smp_mb()
> +	 * after that lock is taken and before this is called so that
> +	 * the following test is safe. d_update_lock() provides that
> +	 * barrier.
> +	 */
> +	if (!(dentry->d_flags & DCACHE_PAR_UPDATE))
> +		return
> +	lock_acquire_exclusive(&dentry->d_update_map, subclass,
> +			       0, NULL, _THIS_IP_);

What the fuck?

> +	spin_lock(&dentry->d_lock);
> +	wait_var_event_spinlock(&dentry->d_flags,
> +				!check_dentry_locked(dentry),
> +				&dentry->d_lock);
> +	spin_unlock(&dentry->d_lock);
> +	lock_map_release(&dentry->d_update_map);
> +}

OK, I realize that it compiles, but it should've raised all
kinds of red flags for anyone reading that.  return + <newline> is
already fishy, but having the next line indented *less* than that
return is firmly in the "somebody's trying to hide something nasty
here" territory, even without parsing the damn thing.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 16/19] VFS: add lookup_and_lock_rename()
  2025-02-06  5:42 ` [PATCH 16/19] VFS: add lookup_and_lock_rename() NeilBrown
@ 2025-02-07 21:21   ` Al Viro
  0 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-07 21:21 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:53PM +1100, NeilBrown wrote:
> @@ -3451,8 +3451,14 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
>  {
>  	struct dentry *p = p1, *q = p2, *r;
>  
> -	while ((r = p->d_parent) != p2 && r != p)
> +	/* Ensure d_update_wait() tests are safe - one barrier for all */
> +	smp_mb();
> +
> +	d_update_wait(p, I_MUTEX_NORMAL);
> +	while ((r = p->d_parent) != p2 && r != p) {
>  		p = r;
> +		d_update_wait(p, I_MUTEX_NORMAL);
> +	}
>  	if (r == p2) {
>  		// p is a child of p2 and an ancestor of p1 or p1 itself
>  		inode_lock_nested(p2->d_inode, I_MUTEX_PARENT);
> @@ -3461,8 +3467,11 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
>  	}
>  	// p is the root of connected component that contains p1
>  	// p2 does not occur on the path from p to p1
> -	while ((r = q->d_parent) != p1 && r != p && r != q)
> +	d_update_wait(q, I_MUTEX_NORMAL);
> +	while ((r = q->d_parent) != p1 && r != p && r != q) {
>  		q = r;
> +		d_update_wait(q, I_MUTEX_NORMAL);
> +	}

That makes no sense whatsoever.  What are you waiting on here and _why_
are you waiting on those sucker?  Especially since there's nothing
to prevent the condition for which you wait from arising immediately
afterwards.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
  2025-02-06  5:42 ` [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations NeilBrown
  2025-02-06 13:15   ` Christian Brauner
@ 2025-02-07 22:41   ` Al Viro
  2025-02-09  1:09     ` Al Viro
  1 sibling, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-07 22:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:46PM +1100, NeilBrown wrote:
> These "_async" versions of various inode operations are only guaranteed
> a shared lock on the directory but if the directory isn't exclusively
> locked then they are guaranteed an exclusive lock on the dentry within
> the directory (which will be implemented in a later patch).
> 
> This will allow a graceful transition from exclusive to shared locking
> for directory updates, and even to async updates which can complete with
> no lock on the directory - only on the dentry.

I'm sorry, but I don't buy the "complete with no lock on directory"
part - not without a verifiable proof of correctness of the locking
scheme.  Especially if you are putting rename into the mix.

And your method prototypes pretty much bake that in.

*IF* we intend to try going that way (and I'm not at all convinced
that it's feasible - locking aside, there's also a shitload of fun
with fsnotify, audit, etc.), let's make those new methods take
a single argument - something like struct mkdir_args, etc., with
inlines for extracting individual arguments out of that.  Yes, it's
ugly, but it allows later changes without a massive headache on
each calling convention modification.

Said that, an explicit description of locking scheme and a proof of
correctness (at least on the "it can't deadlock" level) is, IMO,
a hard requirement for the entire thing, async or no async.

We *do* have such for the current locking scheme.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc
  2025-02-06  5:42 ` [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc NeilBrown
  2025-02-07 20:28   ` Al Viro
@ 2025-02-08  1:30   ` Al Viro
  2025-02-08  1:35     ` Al Viro
  2025-02-12 21:22     ` Al Viro
  1 sibling, 2 replies; 83+ messages in thread
From: Al Viro @ 2025-02-08  1:30 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:50PM +1100, NeilBrown wrote:
> When we call ->revalidate we want to be sure we are revalidating the
> expected name.  As a shared lock on i_rwsem no longer prevents renames
> we need to lock the dentry and ensure it still has the expected name.

*blink*

We never had been guaranteed any lock on the parent - the most common
call chain doesn't (and didn't) have it taken.

> So pass parent name to d_revalidate() and be prepared to retry the
> lookup if it returns -EAGAIN.

I don't understand that one at all.  What's the point of those retries
on -EAGAIN?  Rename (or race with d_splice_alias(), for that matter)
can happen just as we return success from ->d_revalidate(), so we
don't get anything useful out of that check.

What's more, why do we need that exclusion in the first place?
The instance *is* given a stable parent reference and stable name,
so there's no need for it to even look at ->d_parent or ->d_name.

It looks like a bad rebase on top of ->d_revalidate() series that
had landed in -rc1, with the original variant trying to provide the
guarantees now offered by that series.

Unless there's something subtle I'm missing here, I would suggest
dropping that one.  Incidentally, d_update_trylock() would be
better off in fs/dcache.c - static and with just one argument.

HOWEVER, if you do not bother with doing that before ->d_unalias_trylock()
(and there's no reason to do that), the whole thing becomes much simpler -
you can do the check inside __d_move(), after all locks had been taken.

After
        spin_lock_nested(&dentry->d_lock, 2);
        spin_lock_nested(&target->d_lock, 3);
you have everything stable.  Just make the sucker return bool instead
of void, check that crap and have it return false if there's a problem.

Callers other than __d_unalias() would just do WARN_ON(!__d_move(...))
instead of their __d_move() calls and __d_unalias() would have
	if (__d_move(...))
		ret = 0;
and screw the d_update_trylock/d_update_unlock there.

All there is to it...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc
  2025-02-08  1:30   ` Al Viro
@ 2025-02-08  1:35     ` Al Viro
  2025-02-12 21:22     ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-08  1:35 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Feb 08, 2025 at 01:30:43AM +0000, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:50PM +1100, NeilBrown wrote:
> > When we call ->revalidate we want to be sure we are revalidating the
> > expected name.  As a shared lock on i_rwsem no longer prevents renames
> > we need to lock the dentry and ensure it still has the expected name.
> 
> *blink*
> 
> We never had been guaranteed any lock on the parent - the most common
> call chain doesn't (and didn't) have it taken.
> 
> > So pass parent name to d_revalidate() and be prepared to retry the
> > lookup if it returns -EAGAIN.
> 
> I don't understand that one at all.  What's the point of those retries
> on -EAGAIN?  Rename (or race with d_splice_alias(), for that matter)
> can happen just as we return success from ->d_revalidate(), so we
> don't get anything useful out of that check.
> 
> What's more, why do we need that exclusion in the first place?
> The instance *is* given a stable parent reference and stable name,
> so there's no need for it to even look at ->d_parent or ->d_name.
> 
> It looks like a bad rebase on top of ->d_revalidate() series that
> had landed in -rc1, with the original variant trying to provide the
> guarantees now offered by that series.
> 
> Unless there's something subtle I'm missing here, I would suggest
> dropping that one.  Incidentally, d_update_trylock() would be
> better off in fs/dcache.c - static and with just one argument.

Sorry, lost a sentence here while editing:

The only remaining caller of d_update_trylock() would be the one in
__d_unalias(), just before the call of ->d_unalias_trylock() in there
and it gets NULL/NULL in the last two arguments.

> HOWEVER, if you do not bother with doing that before ->d_unalias_trylock()
> (and there's no reason to do that), the whole thing becomes much simpler -
> you can do the check inside __d_move(), after all locks had been taken.
> 
> After
>         spin_lock_nested(&dentry->d_lock, 2);
>         spin_lock_nested(&target->d_lock, 3);
> you have everything stable.  Just make the sucker return bool instead
> of void, check that crap and have it return false if there's a problem.
> 
> Callers other than __d_unalias() would just do WARN_ON(!__d_move(...))
> instead of their __d_move() calls and __d_unalias() would have
> 	if (__d_move(...))
> 		ret = 0;
> and screw the d_update_trylock/d_update_unlock there.
> 
> All there is to it...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove  operations.
  2025-02-06  5:42 ` [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove operations NeilBrown
@ 2025-02-08  1:38   ` Al Viro
  2025-02-09  6:40   ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-08  1:38 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:48PM +1100, NeilBrown wrote:
> d_update_lock(), d_update_trylock(), d_update_unlock() are added which
> can be used to get an exclusive lock on a dentry in preparation for
> updating it.
> 
> As contention on a name is rare this is optimised for the uncontended
> case.  A bit is set under the d_lock spinlock to claim as lock, and
> wait_var_event_spinlock() is used when waiting is needed.  To avoid
> sending a wakeup when not needed we have a second bit flag to indicate
> if there are any waiters.
> 
> This locking is used in lookup_and_lock().
> 
> Once the exclusive "update" lock is obtained on the dentry we must make
> sure it wasn't unlinked or renamed while we slept.  If it was we repeat
> the lookup.
> 
> We also ensure that the parent isn't similarly locked.  This is will be
> used to protect a directory during rmdir.

What's the point re rmdir()?  Just have the victim _always_ locked exclusive,
same as e.g. for ->unlink() or overwriting ->rename().

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-07 21:06   ` Al Viro
@ 2025-02-08 22:06     ` Al Viro
  2025-02-08 22:30       ` Linus Torvalds
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-08 22:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Feb 07, 2025 at 09:06:58PM +0000, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:51PM +1100, NeilBrown wrote:
> > vfs_rmdir takes an exclusive lock on the target directory to ensure
> > nothing new is created in it while the rmdir progresses.  With the
> > possibility of async updates continuing after the inode lock is dropped
> > we now need extra protection.
> > 
> > Any async updates will have DCACHE_PAR_UPDATE set on the dentry.  We
> > simply wait for that flag to be cleared on all children.
> 
> > +static void d_update_wait(struct dentry *dentry, unsigned int subclass)
> > +{
> > +	/* Note this may only ever be called in a context where we have
> > +	 * a lock preventing this dentry from becoming locked, possibly
> > +	 * an update lock on the parent dentry.  The must be a smp_mb()
> > +	 * after that lock is taken and before this is called so that
> > +	 * the following test is safe. d_update_lock() provides that
> > +	 * barrier.
> > +	 */
> > +	if (!(dentry->d_flags & DCACHE_PAR_UPDATE))
> > +		return
> > +	lock_acquire_exclusive(&dentry->d_update_map, subclass,
> > +			       0, NULL, _THIS_IP_);
> 
> What the fuck?
> 
> > +	spin_lock(&dentry->d_lock);
> > +	wait_var_event_spinlock(&dentry->d_flags,
> > +				!check_dentry_locked(dentry),
> > +				&dentry->d_lock);
> > +	spin_unlock(&dentry->d_lock);
> > +	lock_map_release(&dentry->d_update_map);
> > +}
> 
> OK, I realize that it compiles, but it should've raised all
> kinds of red flags for anyone reading that.  return + <newline> is
> already fishy, but having the next line indented *less* than that
> return is firmly in the "somebody's trying to hide something nasty
> here" territory, even without parsing the damn thing.

Incidentally, that's where lockdep warnings you've mentioned are
coming from...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-08 22:06     ` Al Viro
@ 2025-02-08 22:30       ` Linus Torvalds
  2025-02-08 22:34         ` Linus Torvalds
  2025-02-08 23:25         ` Al Viro
  0 siblings, 2 replies; 83+ messages in thread
From: Linus Torvalds @ 2025-02-08 22:30 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Christian Brauner, Jan Kara, Jeff Layton, Dave Chinner,
	linux-fsdevel, linux-kernel

On Sat, 8 Feb 2025 at 14:06, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > OK, I realize that it compiles, but it should've raised all
> > kinds of red flags for anyone reading that.

Well, it's literally just missing a ';' so, the "red flag" is "oops,
nobody noticed the typo".

> > return + <newline> is
> > already fishy, but having the next line indented *less* than that
> > return is firmly in the "somebody's trying to hide something nasty
> > here" territory, even without parsing the damn thing.

Sadly, there are probably no sane way to do semi-automated indentation checks.

> Incidentally, that's where lockdep warnings you've mentioned are
> coming from...

Yeah, so because of the missing ';', and because gcc allows a 'return
<voidfn>()" in a void function (which is actually a useful syntax
extension, so I'm not really complaining), it compiles cleanly but the
lock_acquire_exclusive() is done in *exactly* the wrong situation.

Do we have any useful indentation checkers that might have caught
things like this?

gcc does have a "-Wmisleading-indentation" option, but afaik it only
warns about a few very specific things because anything more
aggressive results in way too many false positives.

I've never used clang-format, but I do know it supports those kinds of
extensions, since I see them in the kernel config file.

                  Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-08 22:30       ` Linus Torvalds
@ 2025-02-08 22:34         ` Linus Torvalds
  2025-02-08 23:25         ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Linus Torvalds @ 2025-02-08 22:34 UTC (permalink / raw)
  To: Al Viro
  Cc: NeilBrown, Christian Brauner, Jan Kara, Jeff Layton, Dave Chinner,
	linux-fsdevel, linux-kernel

On Sat, 8 Feb 2025 at 14:30, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I've never used clang-format, but I do know it supports those kinds of
> extensions, since I see them in the kernel config file.

Bah. Over-eager editing removed the context of that sentence.

The context was supposed to be that in the kernel, we tend to have
lots of patterns that make traditional indentation checking totally
useless: things like the "list_for_each()" macro that obviously
includes a loop in it and thus has indentation expectations.

           Linus

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-07 20:22   ` Al Viro
@ 2025-02-08 23:18     ` Al Viro
  2025-02-12  5:22       ` NeilBrown
  2025-02-12  4:49     ` NeilBrown
  1 sibling, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-08 23:18 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Feb 07, 2025 at 08:22:35PM +0000, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:45PM +1100, NeilBrown wrote:
> > lookup_and_lock() combines locking the directory and performing a lookup
> > prior to a change to the directory.
> > Abstracting this prepares for changing the locking requirements.
> > 
> > done_lookup_and_lock() provides the inverse of putting the dentry and
> > unlocking.
> > 
> > For "silly_rename" we will need to lookup_and_lock() in a directory that
> > is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.
> 
> Ewww...  I do realize that such things might appear in intermediate
> stages of locking massage, but they'd better be _GONE_ by the end of it.
> Conditional locking of that sort is really asking for trouble.
> 
> If nothing else, better split the function in two variants and document
> the differences; that kind of stuff really does not belong in arguments.
> If you need it to exist through the series, that is - if not, you should
> just leave lookup_one_qstr() for the "locked" case from the very beginning.

The same, BTW, applies to more than LOOKUP_PARENT_LOCKED part.

One general observation: if the locking behaviour of a function depends
upon the flags passed to it, it's going to cause massive headache afterwards.

If you need to bother with data flow analysis to tell what given call will
do, expect trouble.

If anything, I would rather have separate lookup_for_removal(), etc., each
with its locking effects explicitly spelled out.  Incidentally, looking
through that I would say that your "VFS: filename_create(): fix incorrect
intent" is not the right solution.  If we hit that condition (no LOOKUP_DIRECTORY
and last component ending with slash), we are going to fail anyway, the only
question is which error to return.  Rules:
	* if the last component lookup fails, return the error from lookup
	* if it yields positive, return -EEXIST
	* if it yields negative, return -ENOENT
Correct?  So how about this:

diff --git a/fs/namei.c b/fs/namei.c
index 3ab9440c5b93..6189e54f767a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -4054,13 +4054,13 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 	struct qstr last;
 	bool want_dir = lookup_flags & LOOKUP_DIRECTORY;
-	unsigned int reval_flag = lookup_flags & LOOKUP_REVAL;
-	unsigned int create_flags = LOOKUP_CREATE | LOOKUP_EXCL;
 	int type;
 	int err2;
 	int error;
 
-	error = filename_parentat(dfd, name, reval_flag, path, &last, &type);
+	lookup_flags &= LOOKUP_REVAL;
+
+	error = filename_parentat(dfd, name, lookup_flags, path, &last, &type);
 	if (error)
 		return ERR_PTR(error);
 
@@ -4070,18 +4070,28 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	 */
 	if (unlikely(type != LAST_NORM))
 		goto out;
+	/*
+	 * mkdir foo/bar/ is OK, but for anything else a slash in the end
+	 * is always an error; the only question is which one.
+	 */
+	if (unlikely(last.name[last.len] && !want_dir)) {
+		dentry = lookup_dcache(&last, path->dentry, lookup_flags);
+		if (!dentry)
+			dentry = lookup_slow(&last, path->dentry, lookup_flags);
+		if (!IS_ERR(dentry)) {
+			error = d_is_positive(dentry) ? -EEXIST : -ENOENT;
+			dput(dentry);
+			dentry = ERR_PTR(error);
+		}
+		goto out;
+	}
 
 	/* don't fail immediately if it's r/o, at least try to report other errors */
 	err2 = mnt_want_write(path->mnt);
-	/*
-	 * Do the final lookup.  Suppress 'create' if there is a trailing
-	 * '/', and a directory wasn't requested.
-	 */
-	if (last.name[last.len] && !want_dir)
-		create_flags = 0;
+	/* do the final lookup */
 	inode_lock_nested(path->dentry->d_inode, I_MUTEX_PARENT);
 	dentry = lookup_one_qstr_excl(&last, path->dentry,
-				      reval_flag | create_flags);
+				lookup_flags | LOOKUP_CREATE | LOOKUP_EXCL);
 	if (IS_ERR(dentry))
 		goto unlock;
 
@@ -4089,16 +4099,6 @@ static struct dentry *filename_create(int dfd, struct filename *name,
 	if (d_is_positive(dentry))
 		goto fail;
 
-	/*
-	 * Special case - lookup gave negative, but... we had foo/bar/
-	 * From the vfs_mknod() POV we just have a negative dentry -
-	 * all is fine. Let's be bastards - you had / on the end, you've
-	 * been asking for (non-existent) directory. -ENOENT for you.
-	 */
-	if (unlikely(!create_flags)) {
-		error = -ENOENT;
-		goto fail;
-	}
 	if (unlikely(err2)) {
 		error = err2;
 		goto fail;

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed.
  2025-02-08 22:30       ` Linus Torvalds
  2025-02-08 22:34         ` Linus Torvalds
@ 2025-02-08 23:25         ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-08 23:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: NeilBrown, Christian Brauner, Jan Kara, Jeff Layton, Dave Chinner,
	linux-fsdevel, linux-kernel

On Sat, Feb 08, 2025 at 02:30:39PM -0800, Linus Torvalds wrote:
> On Sat, 8 Feb 2025 at 14:06, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > OK, I realize that it compiles, but it should've raised all
> > > kinds of red flags for anyone reading that.
> 
> Well, it's literally just missing a ';' so, the "red flag" is "oops,
> nobody noticed the typo".

Sure - what I'm saying is that this is visually wrong; "red flag" as in
"WTF am I looking at here?" when scrolling through the patch.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
  2025-02-07 22:41   ` Al Viro
@ 2025-02-09  1:09     ` Al Viro
  2025-02-09  4:57       ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-09  1:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Fri, Feb 07, 2025 at 10:41:34PM +0000, Al Viro wrote:

> I'm sorry, but I don't buy the "complete with no lock on directory"
> part - not without a verifiable proof of correctness of the locking
> scheme.  Especially if you are putting rename into the mix.
> 
> And your method prototypes pretty much bake that in.
> 
> *IF* we intend to try going that way (and I'm not at all convinced
> that it's feasible - locking aside, there's also a shitload of fun
> with fsnotify, audit, etc.), let's make those new methods take
> a single argument - something like struct mkdir_args, etc., with
> inlines for extracting individual arguments out of that.  Yes, it's
> ugly, but it allows later changes without a massive headache on
> each calling convention modification.
> 
> Said that, an explicit description of locking scheme and a proof of
> correctness (at least on the "it can't deadlock" level) is, IMO,
> a hard requirement for the entire thing, async or no async.
> 
> We *do* have such for the current locking scheme.

While we are at it, the locking order is... interesting.  You
have
	* parent's ->i_rwsem before child's d_update_lock()
	* for a child, d_update_lock() before ->i_rwsem
and that - on top of ordering between ->i_rwsem of various
inodes.

Do you actually have a proof that it's deadlock-free?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
  2025-02-09  1:09     ` Al Viro
@ 2025-02-09  4:57       ` Al Viro
  0 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-09  4:57 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sun, Feb 09, 2025 at 01:09:10AM +0000, Al Viro wrote:
> On Fri, Feb 07, 2025 at 10:41:34PM +0000, Al Viro wrote:
> 
> > I'm sorry, but I don't buy the "complete with no lock on directory"
> > part - not without a verifiable proof of correctness of the locking
> > scheme.  Especially if you are putting rename into the mix.
> > 
> > And your method prototypes pretty much bake that in.
> > 
> > *IF* we intend to try going that way (and I'm not at all convinced
> > that it's feasible - locking aside, there's also a shitload of fun
> > with fsnotify, audit, etc.), let's make those new methods take
> > a single argument - something like struct mkdir_args, etc., with
> > inlines for extracting individual arguments out of that.  Yes, it's
> > ugly, but it allows later changes without a massive headache on
> > each calling convention modification.
> > 
> > Said that, an explicit description of locking scheme and a proof of
> > correctness (at least on the "it can't deadlock" level) is, IMO,
> > a hard requirement for the entire thing, async or no async.
> > 
> > We *do* have such for the current locking scheme.
> 
> While we are at it, the locking order is... interesting.  You
> have
> 	* parent's ->i_rwsem before child's d_update_lock()
> 	* for a child, d_update_lock() before ->i_rwsem
> and that - on top of ordering between ->i_rwsem of various
> inodes.
> 
> Do you actually have a proof that it's deadlock-free?

Note that "child's d_update_lock()" might very well be sleeping
on something that is no longer the parent's child, so the
ordering by depth, with ->i_rwsem and d_update_lock interspersed
does not hold.

What am I missing here?  I'd been trying to come up with
a proof of deadlock avoidance, but... no luck so far.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove  operations.
  2025-02-06  5:42 ` [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove operations NeilBrown
  2025-02-08  1:38   ` Al Viro
@ 2025-02-09  6:40   ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-09  6:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:48PM +1100, NeilBrown wrote:

> +bool d_update_lock(struct dentry *dentry,
> +		   struct dentry *base, const struct qstr *last,
> +		   unsigned int subclass)
> +{
> +	lock_acquire_exclusive(&dentry->d_update_map, subclass, 0, NULL, _THIS_IP_);
> +again:
> +	spin_lock(&dentry->d_lock);
> +	wait_var_event_spinlock(&dentry->d_flags,
> +				!check_dentry_locked(dentry),
> +				&dentry->d_lock);
> +	if (d_is_positive(dentry)) {
> +		rcu_read_lock(); /* needed for d_same_name() */

It isn't.  You are holding ->d_lock there.

> +		if (
> +			/* Was unlinked while we waited ?*/
> +			d_unhashed(dentry) ||
> +			/* Or was dentry renamed ?? */
> +			dentry->d_parent != base ||
> +			dentry->d_name.hash != last->hash ||
> +			!d_same_name(dentry, base, last)

Negatives can't be moved, but they bloody well can be unhashed.  So skipping
the d_unhashed() part for negatives is wrong.

> +		) {
> +			rcu_read_unlock();
> +			spin_unlock(&dentry->d_lock);
> +			lock_map_release(&dentry->d_update_map);
> +			return false;
> +		}
> +		rcu_read_unlock();
> +	}
> +	/* Must ensure DCACHE_PAR_UPDATE in child is visible before reading
> +	 * from parent
> +	 */
> +	smp_store_mb(dentry->d_flags, dentry->d_flags | DCACHE_PAR_UPDATE);

... paired with?

> +	if (base->d_flags & DCACHE_PAR_UPDATE) {
> +		/* We cannot grant DCACHE_PAR_UPDATE on a dentry while
> +		 * it is held on the parent
> +		 */
> +		dentry->d_flags &= ~DCACHE_PAR_UPDATE;
> +		spin_unlock(&dentry->d_lock);
> +		spin_lock(&base->d_lock);
> +		wait_var_event_spinlock(&base->d_flags,
> +					!check_dentry_locked(base),
> +					&base->d_lock);

Oh?  So you might also be waiting on the parent?  That's a deadlock fodder right
there - caller might be holding ->i_rwsem on the same parent, so you have waiting
on _->d_flags nested both outside and inside _->d_inode->i_rwsem.

Just in case anyone goes "->i_rwsem will only be held shared" - that wouldn't help.
Throw fchmod() into the mix and enjoy your deadlock -
	A: holds ->i_rwsem shared, waits for C to clear DCACHE_PAR_UPDATE.
	B: blocked trying to grab ->i_rwsem exclusive
	C: has DCACHE_PAR_UPDATE set, is blocked trying to grab ->i_rwsem shared
and there you go...

> +		spin_unlock(&base->d_lock);
> +		goto again;
> +	}
> +	spin_unlock(&dentry->d_lock);
> +	return true;
> +}

The entire thing is refcount-neutral for both dentry and base.  Which makes this

> @@ -1759,8 +1863,9 @@ static struct dentry *lookup_and_lock_nested(const struct qstr *last,
>  
>  	if (!(lookup_flags & LOOKUP_PARENT_LOCKED))
>  		inode_lock_nested(base->d_inode, subclass);
> -
> -	dentry = lookup_one_qstr(last, base, lookup_flags);
> +	do {
> +		dentry = lookup_one_qstr(last, base, lookup_flags);
> +	} while (!IS_ERR(dentry) && !d_update_lock(dentry, base, last, subclass));

... a refcount leak waiting to happen.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
  2025-02-06  5:42 ` [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() NeilBrown
  2025-02-06 12:33   ` Christian Brauner
  2025-02-07 20:14   ` Al Viro
@ 2025-02-09 20:23   ` Al Viro
  2 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-09 20:23 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:42PM +1100, NeilBrown wrote:

> @@ -1700,6 +1702,15 @@ struct dentry *lookup_one_qstr(const struct qstr *name,
>  	if ((flags & LOOKUP_INTENT_FLAGS) == 0)
>  		/* ->lookup must have given final answer */
>  		d_lookup_done(dentry);
> +found:
> +	if (d_is_negative(dentry) && !(flags & LOOKUP_CREATE)) {
> +		dput(dentry);
> +		return ERR_PTR(-ENOENT);
> +	}
> +	if (d_is_positive(dentry) && (flags & LOOKUP_EXCL)) {
> +		dput(dentry);
> +		return ERR_PTR(-EEXIST);
> +	}

Final dput() on an in-lookup dentry would blow up.  What happens if we get
there without LOOKUP_CREATE, but with something else from LOOKUP_INTENT_FLAGS?

That, BTW, is another lovely example of the reasons why making state (in-lookup
in this case, locking elsewhere) transitions dependent upon the function arguments
is a bad idea.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory
  2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
                   ` (20 preceding siblings ...)
  2025-02-06 15:36 ` John Stoffel
@ 2025-02-09 23:33 ` Al Viro
  21 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-09 23:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote:

> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.

OK, after looking through that and playing around with the locking
scheme of yours:

Separating directory rwsem for reads/modifications from locking of
individual dentries may be feasible, but it needs to be a lot more
careful about the states it sleeps in.  Your current variant is rife
with deadlocks; for the "wait on dentry itself" it's probably possible
to avoid, with some care; for "wait on parent" it's really not an option.

Quite a bit of headache comes from the fact that NFS et.al. are playing
silly buggers with "OK, we see that lookup is for <operation>; skip it,
the call of actual method will do the right thing".  The trouble is,
d_lookup_done() of not-really-looked-up is fine under exclusive lock on
parent, but only because there won't be d_alloc_parallel() on the same
name until we drop that exclusive lock.

Your scheme, OTOH, has hard dependency upon those suckers staying visible
to d_alloc_parallel() until the actual operation is done.  Which means
that this code, including the methods, is exposed to in-lookup dentries.

What's more, similar dependency is there for dentries getting unhashed
between the lookup and the end of operation - something which NFS
cheerfully violates.  If method's argument gets hit with d_drop() and
d_rehash(), there's a window where it won't be found in dcache, leaving
no indication that it's being operated upon.  Currently we are fine -
exclusive lock on parent means that on dcache miss we try to grab
the parent shared and repeat dcache lookup when we get that.

Your variant does not have such exclusion - parent is held shared and
child dentry involved is not there to be found during d_drop()/d_rehash()
window.

IOW, your in-update state might make sense, but not in the way it's done
at the moment - it's too brittle.

And the part about async tree topology modifications are bloody insane,
IMO.  I won't believe that to be feasible until I see the algorithm and
proof of correctness; preferably _before_ the actual code.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
  2025-02-07 19:45   ` Al Viro
@ 2025-02-10  4:36     ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-10  4:36 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sat, 08 Feb 2025, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:38PM +1100, NeilBrown wrote:
> > vfs_mkdir() does not guarantee to make the child dentry positive on
> > success.  It may leave it negative and then the caller needs to perform a
> > lookup to find the target dentry.
> > 
> > This patch introduced vfs_mkdir_return() which performs the lookup if
> > needed so that this code is centralised.
> > 
> > This prepares for a new inode operation which will perform mkdir and
> > returns the correct dentry.
> 
> * Calling conventions stink; make it _consume_ dentry reference and
> return dentry reference or ERR_PTR().  Callers will be happier that way
> (check it).

With later patches it will need to consume the lock on the dentry as
well, and either transfer it to the new one or (for error) unlock it.
We need to have the result dentry still locked for fsnotify_mkdir().

Transferring the dentry lock would have to be done in d_splice_alias(). 
The __d_unalias() branch should be ok because I already trylock in there
and fail if I can't get the lock.  For the IS_ROOT branch ....  I think
it is safe to fail if a trylock doesn't succeed.

So I can probably make that work - thanks.

Hmm... kernfs reportedly can leave the mkdir dentry negative and fill in
the inode later.  How does that work?  I assume it will still be hashed
so mkdir won't try the lookup.

done_path_create() will need to accept an IS_ERR() dentry.

> 
> * Calling conventions should be documented in commit message *and* in
> D/f/porting

What is the scope of "porting" ?  IT seems to be mostly about
_operations interfaces, but I do see other things in there.  I'll try to
remember that - thanks.

> 
> * devpts, nfs4recover and xfs might as well convert (not going to hit
> the "need a lookup" case anyway)

good point - avoiding the lookup when not requested is a pointless
optimisation because it is hardly every needed and should always be
cheap - we expect it to be in the dcache.

> 
> * that 
> +                       /* Need a "const" pointer.  We know d_name is const
> +                        * because we hold an exclusive lock on i_rwsem
> +                        * in d_parent.
> +                        */
> +                       const struct qstr *d_name = (void*)&dentry->d_name;
> +                       d = lookup_dcache(d_name, dentry->d_parent, 0);
> +                       if (!d)
> +                               d = __lookup_slow(d_name, dentry->d_parent, 0);
> doesn't need a cast.  C is perfectly fine with
> 	T *x = foo();
> 	const T *y = x;
> 
> You are not allowed to _strip_ qualifiers; adding them is fine.
> Same reason why you are allowed to pass char * to strlen() without
> any casts whatsoever.

hmm..  I thought I had tried that.  Maybe I didn't try hard enough.
Thanks for the guidance.
> 
> Comment re stability is fine; the cast is pure WTF material.
> 

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-07 19:32   ` Al Viro
@ 2025-02-10  4:58     ` NeilBrown
  2025-02-10  5:15       ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-10  4:58 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sat, 08 Feb 2025, Al Viro wrote:
> 1) what's wrong with using middle bits of dentry as index?  What the hell
> is that thing about pid for?

That does "hell" have to do with it?

All we need here is a random number.  Preferably a cheap random number.
pid is cheap and quite random.
The dentry pointer would be even cheaper (no mem access) providing it
doesn't cost much to get the randomness out.  I considered hash_ptr()
but thought that was more code that it was worth.

Do you have a formula for selecting the "middle" bits in a way that is
expected to still give good randomness?

> 
> 2) part in d_add_ci() might be worth a comment re d_lookup_done() coming
> for the original dentry, no matter what.

I think the previous code deserved explanation more than the new, but
maybe I missed something.
In each case, d_wait_lookup() will wait for the given dentry to no
longer be d_in_lookup() which means waiting for DCACHE_PAR_LOOKUP to be
cleared.  The only place which clears DCACHE_PAR_LOOKUP is
__d_lookup_unhash_wake(). which always wakes the target.
In the previous code it would wake both the non-case-exact dentry and
the case-exact dentry waiters but they would go back to sleep if their
DCACHE_PAR_LOOKUP hadn't been cleared, so no interesting behaviour.
Reusing the wq from one to the other is a sensible simplification, but
not something we need any reminder of once it is no longer needed.

Would sort of comment would you add?

> 
> 3) the dance with conditional __wake_up() is worth a helper, IMO.
> 

I tried to explain that in the commit message bug I agree it deserves to
be in the code too.
I have added:

	/* ->d_wait is only set if some thread is actually waiting.
	 * If we find it is NULL - the common case - then there was no
	 * contention and there are no waiters to be woken.
	 */

and 
	/* Don't set a wait_queue until someone is actually waiting */
before
	new->d_wait = NULL;
in d_alloc_parallel().

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-10  4:58     ` NeilBrown
@ 2025-02-10  5:15       ` Al Viro
  2025-02-11 23:35         ` NeilBrown
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-10  5:15 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Mon, Feb 10, 2025 at 03:58:02PM +1100, NeilBrown wrote:
> On Sat, 08 Feb 2025, Al Viro wrote:
> > 1) what's wrong with using middle bits of dentry as index?  What the hell
> > is that thing about pid for?
> 
> That does "hell" have to do with it?
> 
> All we need here is a random number.  Preferably a cheap random number.
> pid is cheap and quite random.
> The dentry pointer would be even cheaper (no mem access) providing it
> doesn't cost much to get the randomness out.  I considered hash_ptr()
> but thought that was more code that it was worth.
> 
> Do you have a formula for selecting the "middle" bits in a way that is
> expected to still give good randomness?

((unsigned long) dentry / L1_CACHE_BYTES) % <table size>

Bits just over the cacheline size should have uniform distribution...

> > 2) part in d_add_ci() might be worth a comment re d_lookup_done() coming
> > for the original dentry, no matter what.
> 
> I think the previous code deserved explanation more than the new, but
> maybe I missed something.
> In each case, d_wait_lookup() will wait for the given dentry to no
> longer be d_in_lookup() which means waiting for DCACHE_PAR_LOOKUP to be
> cleared.  The only place which clears DCACHE_PAR_LOOKUP is
> __d_lookup_unhash_wake(). which always wakes the target.
> In the previous code it would wake both the non-case-exact dentry and
> the case-exact dentry waiters but they would go back to sleep if their
> DCACHE_PAR_LOOKUP hadn't been cleared, so no interesting behaviour.
> Reusing the wq from one to the other is a sensible simplification, but
> not something we need any reminder of once it is no longer needed.

It's not just about the wakeups; any in-lookup dentry should be taken
out of in-lookup hash before it gets dropped.
 
> > 3) the dance with conditional __wake_up() is worth a helper, IMO.

I mean an inlined helper function.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-10  5:15       ` Al Viro
@ 2025-02-11 23:35         ` NeilBrown
  2025-02-12  0:25           ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-11 23:35 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Mon, 10 Feb 2025, Al Viro wrote:
> On Mon, Feb 10, 2025 at 03:58:02PM +1100, NeilBrown wrote:
> > On Sat, 08 Feb 2025, Al Viro wrote:
> > > 1) what's wrong with using middle bits of dentry as index?  What the hell
> > > is that thing about pid for?
> > 
> > That does "hell" have to do with it?
> > 
> > All we need here is a random number.  Preferably a cheap random number.
> > pid is cheap and quite random.
> > The dentry pointer would be even cheaper (no mem access) providing it
> > doesn't cost much to get the randomness out.  I considered hash_ptr()
> > but thought that was more code that it was worth.
> > 
> > Do you have a formula for selecting the "middle" bits in a way that is
> > expected to still give good randomness?
> 
> ((unsigned long) dentry / L1_CACHE_BYTES) % <table size>
> 
> Bits just over the cacheline size should have uniform distribution...

I tested this, doing the calculation on each allocation and counting the
number of times each bucket was hit.
On my test kernel with lockdep enabled the dentry is 328 bytes and
L1_CACHE_BYTES is 64.  So 6 cache lines per dentry and 10 dentries per
4K slab.  The indices created by the above formula were roughly 1 in 6
of available.
The 256 possibilities can be divided into 4 groups of 64 and within each
group there are 10 possible values.: 0 6 12 18 24 30 36 42 48 54

Without lockdep making the dentry extra large, struct dentry is 192
bytes, exactly 3 cache lines.  There are 16 entries per 4K slab.
Now exactly 1/4 of possible indices are used.
For every group of 16 possible indices, only 0, 4, 8, 12 are used.
slabinfo says the object size is 256 which explains some of the spread. 
But ultimately the problem is that addresses are not evenly distributed
inside a single slab.

If I divide by PAGE_SIZE instead of L1_CACHE_BYTES I get every possible
value used but it is far from uniform.
With 40000 allocations we would want about 160 in each slot.
The median I measured is 155 (good) but the range is from 16 to 330
which is nearly +/- 100% of the median.
So that isn't random - but then you weren't suggesting that exactly.

I don't think there is a good case here for selecting bits from the
middle of the dentry address.

If I use hash_ptr(dentry, 8) I get a more uniform distribution.  64000
entries would hope for 250 per bucket.  Median is 248.  Range is 186 to
324 so +/- 25%.

Maybe that is the better choice.

> 
> > > 2) part in d_add_ci() might be worth a comment re d_lookup_done() coming
> > > for the original dentry, no matter what.
> > 
> > I think the previous code deserved explanation more than the new, but
> > maybe I missed something.
> > In each case, d_wait_lookup() will wait for the given dentry to no
> > longer be d_in_lookup() which means waiting for DCACHE_PAR_LOOKUP to be
> > cleared.  The only place which clears DCACHE_PAR_LOOKUP is
> > __d_lookup_unhash_wake(). which always wakes the target.
> > In the previous code it would wake both the non-case-exact dentry and
> > the case-exact dentry waiters but they would go back to sleep if their
> > DCACHE_PAR_LOOKUP hadn't been cleared, so no interesting behaviour.
> > Reusing the wq from one to the other is a sensible simplification, but
> > not something we need any reminder of once it is no longer needed.
> 
> It's not just about the wakeups; any in-lookup dentry should be taken
> out of in-lookup hash before it gets dropped.
>  
> > > 3) the dance with conditional __wake_up() is worth a helper, IMO.
> 
> I mean an inlined helper function.

Yes.. Of course...

Maybe we should put

static inline void wake_up_key(struct wait_queue_head *wq, void *key)
{
	__wake_up(wq, TASK_NORMAL, 0, key);
}

in include/linux/wait.h to avoid the __wake_up() "internal" name, and
then use
	wake_up_key(d_wait, dentry);
in the two places in dcache.c, or did you want something
dcache-specific?
I'm not good at guessing what other people are thinking.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-11 23:35         ` NeilBrown
@ 2025-02-12  0:25           ` Al Viro
  2025-02-12  1:46             ` NeilBrown
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-12  0:25 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Feb 12, 2025 at 10:35:41AM +1100, NeilBrown wrote:

> Without lockdep making the dentry extra large, struct dentry is 192
> bytes, exactly 3 cache lines.  There are 16 entries per 4K slab.
> Now exactly 1/4 of possible indices are used.
> For every group of 16 possible indices, only 0, 4, 8, 12 are used.
> slabinfo says the object size is 256 which explains some of the spread. 

Interesting...

root@cannonball:~# grep -w dentry /proc/slabinfo
dentry            1370665 1410864    192   21    1 : tunables    0    0    0 : slabdata  67184  67184      0

Where does that 256 come from?  The above is on amd64, with 6.1-based debian
kernel and I see the same object size on other boxen (with local configs).

> I don't think there is a good case here for selecting bits from the
> middle of the dentry address.
> 
> If I use hash_ptr(dentry, 8) I get a more uniform distribution.  64000
> entries would hope for 250 per bucket.  Median is 248.  Range is 186 to
> 324 so +/- 25%.
> 
> Maybe that is the better choice.

That's really interesting, considering the implications for m_hash() and mp_hash()
(see fs/namespace.c)...

> > > > 3) the dance with conditional __wake_up() is worth a helper, IMO.
> > 
> > I mean an inlined helper function.
> 
> Yes.. Of course...
> 
> Maybe we should put
> 
> static inline void wake_up_key(struct wait_queue_head *wq, void *key)
> {
> 	__wake_up(wq, TASK_NORMAL, 0, key);
> }
> 
> in include/linux/wait.h to avoid the __wake_up() "internal" name, and
> then use
> 	wake_up_key(d_wait, dentry);
> in the two places in dcache.c, or did you want something
> dcache-specific?

More like
	if (wq)
		__wake_up(wq, TASK_NORMAL, 0, key);
probably...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
  2025-02-12  0:25           ` Al Viro
@ 2025-02-12  1:46             ` NeilBrown
  0 siblings, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-12  1:46 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Wed, 12 Feb 2025, Al Viro wrote:
> On Wed, Feb 12, 2025 at 10:35:41AM +1100, NeilBrown wrote:
> 
> > Without lockdep making the dentry extra large, struct dentry is 192
> > bytes, exactly 3 cache lines.  There are 16 entries per 4K slab.
> > Now exactly 1/4 of possible indices are used.
> > For every group of 16 possible indices, only 0, 4, 8, 12 are used.
> > slabinfo says the object size is 256 which explains some of the spread. 
> 
> Interesting...
> 
> root@cannonball:~# grep -w dentry /proc/slabinfo
> dentry            1370665 1410864    192   21    1 : tunables    0    0    0 : slabdata  67184  67184      0
> 
> Where does that 256 come from?  The above is on amd64, with 6.1-based debian
> kernel and I see the same object size on other boxen (with local configs).

I found SLUB_DEBUG and redzoning does that.  Disabling the debug brings
done to 192 bytes and 21 per slab which you see.  That is still only 33%
hit rate.

> 
> > I don't think there is a good case here for selecting bits from the
> > middle of the dentry address.
> > 
> > If I use hash_ptr(dentry, 8) I get a more uniform distribution.  64000
> > entries would hope for 250 per bucket.  Median is 248.  Range is 186 to
> > 324 so +/- 25%.
> > 
> > Maybe that is the better choice.
> 
> That's really interesting, considering the implications for m_hash() and mp_hash()
> (see fs/namespace.c)...

Those functions add in the next set of bits as well - effectively mixing
in more bits from the page address.  If I do that the spread is better
but there are still buckets with close to twice the median, though most
are +/- 30%.

> 
> > > > > 3) the dance with conditional __wake_up() is worth a helper, IMO.
> > > 
> > > I mean an inlined helper function.
> > 
> > Yes.. Of course...
> > 
> > Maybe we should put
> > 
> > static inline void wake_up_key(struct wait_queue_head *wq, void *key)
> > {
> > 	__wake_up(wq, TASK_NORMAL, 0, key);
> > }
> > 
> > in include/linux/wait.h to avoid the __wake_up() "internal" name, and
> > then use
> > 	wake_up_key(d_wait, dentry);
> > in the two places in dcache.c, or did you want something
> > dcache-specific?
> 
> More like
> 	if (wq)
> 		__wake_up(wq, TASK_NORMAL, 0, key);
> probably...
> 

Thanks,
NeilBrown



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-07 20:22   ` Al Viro
  2025-02-08 23:18     ` Al Viro
@ 2025-02-12  4:49     ` NeilBrown
  1 sibling, 0 replies; 83+ messages in thread
From: NeilBrown @ 2025-02-12  4:49 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sat, 08 Feb 2025, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:45PM +1100, NeilBrown wrote:
> > lookup_and_lock() combines locking the directory and performing a lookup
> > prior to a change to the directory.
> > Abstracting this prepares for changing the locking requirements.
> > 
> > done_lookup_and_lock() provides the inverse of putting the dentry and
> > unlocking.
> > 
> > For "silly_rename" we will need to lookup_and_lock() in a directory that
> > is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.
> 
> Ewww...  I do realize that such things might appear in intermediate
> stages of locking massage, but they'd better be _GONE_ by the end of it.
> Conditional locking of that sort is really asking for trouble.
> 
> If nothing else, better split the function in two variants and document
> the differences; that kind of stuff really does not belong in arguments.
> If you need it to exist through the series, that is - if not, you should
> just leave lookup_one_qstr() for the "locked" case from the very beginning.

That's what I did at first, but then when I realised I had to pass the
lookup flags around everywhere....  Will revert.

> 
> > This functionality is exported as lookup_and_lock_one() which takes a
> > name and len rather than a qstr.
> 
> ... for the sake of ...?

nfsd.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-08 23:18     ` Al Viro
@ 2025-02-12  5:22       ` NeilBrown
  2025-02-12 15:51         ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: NeilBrown @ 2025-02-12  5:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sun, 09 Feb 2025, Al Viro wrote:
> On Fri, Feb 07, 2025 at 08:22:35PM +0000, Al Viro wrote:
> > On Thu, Feb 06, 2025 at 04:42:45PM +1100, NeilBrown wrote:
> > > lookup_and_lock() combines locking the directory and performing a lookup
> > > prior to a change to the directory.
> > > Abstracting this prepares for changing the locking requirements.
> > > 
> > > done_lookup_and_lock() provides the inverse of putting the dentry and
> > > unlocking.
> > > 
> > > For "silly_rename" we will need to lookup_and_lock() in a directory that
> > > is already locked.  For this purpose we add LOOKUP_PARENT_LOCKED.
> > 
> > Ewww...  I do realize that such things might appear in intermediate
> > stages of locking massage, but they'd better be _GONE_ by the end of it.
> > Conditional locking of that sort is really asking for trouble.
> > 
> > If nothing else, better split the function in two variants and document
> > the differences; that kind of stuff really does not belong in arguments.
> > If you need it to exist through the series, that is - if not, you should
> > just leave lookup_one_qstr() for the "locked" case from the very beginning.
> 
> The same, BTW, applies to more than LOOKUP_PARENT_LOCKED part.
> 
> One general observation: if the locking behaviour of a function depends
> upon the flags passed to it, it's going to cause massive headache afterwards.
> 
> If you need to bother with data flow analysis to tell what given call will
> do, expect trouble.
> 
> If anything, I would rather have separate lookup_for_removal(), etc., each
> with its locking effects explicitly spelled out.  Incidentally, looking

lookup_for_removal() etc would only be temporarily needed.  Eventually
(I hope) we would get to a place where all filesystems support all
operations with only a shared lock.  When we get there,
lookup_for_remove() and lookup_for_create() would be identical again.

And the difference wouldn't be that one takes a shared lock and the
other takes an exclusive lock.  It would be that one takes a shared or
exclusive lock based on flag X stored somewhere (inode, inode_operations,
...) while the other takes a shared or exclusive lock based on flag Y.

It would be nice to be able to accelerate that and push the locking down
into the filesystems call at once as Linus suggested last time:

https://lore.kernel.org/all/CAHk-=whz69y=98udgGB5ujH6bapYuapwfHS2esWaFrKEoi9-Ow@mail.gmail.com/

That would require either adding a new rwsem to each inode, possibly in
the filesystem-private part of the inode, or changing VFS to not lock
the inode at all.  The first would be unwelcome by fs developers I
expect, the second would be a serious challenge.  I started thinking
about and quickly decided I had enough challenges already.

So I think we need some way for the VFS to determine and select the lock
type requires by the filesystem.  Christian suggested a flag in
inode_operations and think that is a good idea.  I originally suggested
a flag in the superblock, but Linus suggested different operations might
want different locking (same email linked above).

But I don't think we can get away without having conditional locking
(like we already do in open_last_lookup() depending on O_CREAT).

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-12  5:22       ` NeilBrown
@ 2025-02-12 15:51         ` Al Viro
  2025-02-12 20:11           ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-12 15:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Feb 12, 2025 at 04:22:16PM +1100, NeilBrown wrote:

> lookup_for_removal() etc would only be temporarily needed.  Eventually
> (I hope) we would get to a place where all filesystems support all
> operations with only a shared lock.  When we get there,
> lookup_for_remove() and lookup_for_create() would be identical again.
> 
> And the difference wouldn't be that one takes a shared lock and the
> other takes an exclusive lock.  It would be that one takes a shared or
> exclusive lock based on flag X stored somewhere (inode, inode_operations,
> ...) while the other takes a shared or exclusive lock based on flag Y.
> 
> It would be nice to be able to accelerate that and push the locking down
> into the filesystems call at once as Linus suggested last time:
> 
> https://lore.kernel.org/all/CAHk-=whz69y=98udgGB5ujH6bapYuapwfHS2esWaFrKEoi9-Ow@mail.gmail.com/
> 
> That would require either adding a new rwsem to each inode, possibly in
> the filesystem-private part of the inode, or changing VFS to not lock
> the inode at all.  The first would be unwelcome by fs developers I
> expect, the second would be a serious challenge.  I started thinking
> about and quickly decided I had enough challenges already.

I think it's the wrong way to go.

Your "in-update" state does make sense, but it doesn't go far enough
and it's not really about parallel anything - it's simply "this
dentry is nailed down <here> with <this> name for now".

And _that_ is really useful, provided that it's reliable.  What we
need to avoid is d_drop()/d_rehash() windows, when that "operated
upon" dentry ceases to be visible.

Currently we can do that, provided that parent is held exclusive.
Any lookup will hit dcache miss and proceed to lookup_slow()
path, which will block on attempt to get the parent shared.

As soon as you switch to holding parent shared, that pattern becomes
a source of problems.

And if we deal with that, there's not much reason to nest this
dentry lock inside ->i_rwsem.  Then ->i_rwsem would become easy
to push inside the methods.

Right now the fundamental problem with your locking is that you
get dentry locks sandwiched between ->i_rwsem on parents and that
on children.  We can try to be clever with how we acquire them
(have ->d_parent rechecked before going to sleep, etc.), but
that's rather brittle.

_IF_ we push them outside of ->i_rwsem, the role of ->i_rwsem
would shrink to protecting (1) the directory internal representation,
(2) emptiness checks and (3) link counts.

What goes away is "we are holding it exclusive, so anything that
comes here with dcache miss won't get around to doing anything
until we unlock".

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
  2025-02-12 15:51         ` Al Viro
@ 2025-02-12 20:11           ` Al Viro
  0 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-12 20:11 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Wed, Feb 12, 2025 at 03:51:32PM +0000, Al Viro wrote:

> And _that_ is really useful, provided that it's reliable.  What we
> need to avoid is d_drop()/d_rehash() windows, when that "operated
> upon" dentry ceases to be visible.

... which is easier to do these days - NFS doesn't do it anymore
(AFS still does, though).  There's also a bit of magical mystery shite
in exfat_lookup()...

IIRC, we used to have something similar in VFAT as well, and it
had been bloody bogus...

Actually, this one is worse - this
               /*
                * Unhashed alias is able to exist because of revalidate()
                * called by lookup_fast. You can easily make this status
                * by calling create and lookup concurrently
                * In such case, we reuse an alias instead of new dentry
                */
in there is utter nonsense - exfat_d_revalidate() never tells you to
drop positive dentries, to start with.  Check for disconnected stuff
is also bogus (reasoning in "vfat: simplify checks in vfat_lookup()"
applies), d_drop(dentry) is pointless (->lookup() argument is not
hashed), for directories we don't give a rat's arse whether it's
hashed or not (d_splice_alias() will DTRT) and for non-directories
the next case in there (d_move() and return alias) will work,
hashed or unhashed.

Now, the case of alias dentry being locked is interesting (both for
exfat and vfat)...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc
  2025-02-08  1:30   ` Al Viro
  2025-02-08  1:35     ` Al Viro
@ 2025-02-12 21:22     ` Al Viro
  1 sibling, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-12 21:22 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Sat, Feb 08, 2025 at 01:30:43AM +0000, Al Viro wrote:
> HOWEVER, if you do not bother with doing that before ->d_unalias_trylock()
> (and there's no reason to do that), the whole thing becomes much simpler -
> you can do the check inside __d_move(), after all locks had been taken.
> 
> After
>         spin_lock_nested(&dentry->d_lock, 2);
>         spin_lock_nested(&target->d_lock, 3);
> you have everything stable.  Just make the sucker return bool instead
> of void, check that crap and have it return false if there's a problem.

... except that this requires telling __d_move() that it's an unalias -
on normal move dentries will have been locked by the caller.  Might
make sense to turn that bool exchange argument into an enum...

Let me play with that a bit and see what falls out...

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/19] nfs: switch to _async for all directory ops.
  2025-02-06  5:42 ` [PATCH 19/19] nfs: switch to _async for all directory ops NeilBrown
@ 2025-02-13  3:51   ` Al Viro
  2025-02-13  4:09     ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-13  3:51 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 06, 2025 at 04:42:56PM +1100, NeilBrown wrote:
>  nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>  {
>  	static unsigned int sillycounter;
> @@ -447,7 +451,8 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>  	struct dentry *sdentry;
>  	struct inode *inode = d_inode(dentry);
>  	struct rpc_task *task;
> -	int            error = -EBUSY;
> +	struct dentry *base;
> +	int error = -EBUSY;
>  
>  	dfprintk(VFS, "NFS: silly-rename(%pd2, ct=%d)\n",
>  		dentry, d_count(dentry));
> @@ -461,10 +466,11 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>  
>  	fileid = NFS_FILEID(d_inode(dentry));
>  
> +	base = d_find_alias(dir);

Huh?  That would better be dentry->d_parent and all operations are in
that directory, so you don't even need to grab a reference...

>  	sdentry = NULL;
>  	do {
>  		int slen;
> -		dput(sdentry);
> +
>  		sillycounter++;
>  		slen = scnprintf(silly, sizeof(silly),
>  				SILLYNAME_PREFIX "%0*llx%0*x",
> @@ -474,14 +480,19 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
>  		dfprintk(VFS, "NFS: trying to rename %pd to %s\n",
>  				dentry, silly);
>  
> -		sdentry = lookup_one_len(silly, dentry->d_parent, slen);
> -		/*
> -		 * N.B. Better to return EBUSY here ... it could be
> -		 * dangerous to delete the file while it's in use.
> -		 */
> -		if (IS_ERR(sdentry))
> -			goto out;
> -	} while (d_inode(sdentry) != NULL); /* need negative lookup */
> +		sdentry = lookup_and_lock_one(NULL, silly, slen,
> +					      base,
> +					      LOOKUP_CREATE | LOOKUP_EXCL
> +					      | LOOKUP_RENAME_TARGET
> +					      | LOOKUP_PARENT_LOCKED);
> +	} while (PTR_ERR_OR_ZERO(sdentry) == -EEXIST); /* need negative lookup */

What's wrong with sdentry == ERR_PTR(-EEXIST)?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/19] nfs: switch to _async for all directory ops.
  2025-02-13  3:51   ` Al Viro
@ 2025-02-13  4:09     ` Al Viro
  2025-02-13 18:01       ` Al Viro
  0 siblings, 1 reply; 83+ messages in thread
From: Al Viro @ 2025-02-13  4:09 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 13, 2025 at 03:51:16AM +0000, Al Viro wrote:
> On Thu, Feb 06, 2025 at 04:42:56PM +1100, NeilBrown wrote:
> >  nfs_sillyrename(struct inode *dir, struct dentry *dentry)
> >  {
> >  	static unsigned int sillycounter;
> > @@ -447,7 +451,8 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
> >  	struct dentry *sdentry;
> >  	struct inode *inode = d_inode(dentry);
> >  	struct rpc_task *task;
> > -	int            error = -EBUSY;
> > +	struct dentry *base;
> > +	int error = -EBUSY;
> >  
> >  	dfprintk(VFS, "NFS: silly-rename(%pd2, ct=%d)\n",
> >  		dentry, d_count(dentry));
> > @@ -461,10 +466,11 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
> >  
> >  	fileid = NFS_FILEID(d_inode(dentry));
> >  
> > +	base = d_find_alias(dir);
> 
> Huh?  That would better be dentry->d_parent and all operations are in
> that directory, so you don't even need to grab a reference...
> 
> >  	sdentry = NULL;
> >  	do {
> >  		int slen;
> > -		dput(sdentry);
> > +
> >  		sillycounter++;
> >  		slen = scnprintf(silly, sizeof(silly),
> >  				SILLYNAME_PREFIX "%0*llx%0*x",
> > @@ -474,14 +480,19 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
> >  		dfprintk(VFS, "NFS: trying to rename %pd to %s\n",
> >  				dentry, silly);
> >  
> > -		sdentry = lookup_one_len(silly, dentry->d_parent, slen);
> > -		/*
> > -		 * N.B. Better to return EBUSY here ... it could be
> > -		 * dangerous to delete the file while it's in use.
> > -		 */
> > -		if (IS_ERR(sdentry))
> > -			goto out;
> > -	} while (d_inode(sdentry) != NULL); /* need negative lookup */
> > +		sdentry = lookup_and_lock_one(NULL, silly, slen,
> > +					      base,
> > +					      LOOKUP_CREATE | LOOKUP_EXCL
> > +					      | LOOKUP_RENAME_TARGET
> > +					      | LOOKUP_PARENT_LOCKED);
> > +	} while (PTR_ERR_OR_ZERO(sdentry) == -EEXIST); /* need negative lookup */
> 
> What's wrong with sdentry == ERR_PTR(-EEXIST)?

BTW, do you need to mess with NFS_DATA_BLOCKED with that thing in place?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 19/19] nfs: switch to _async for all directory ops.
  2025-02-13  4:09     ` Al Viro
@ 2025-02-13 18:01       ` Al Viro
  0 siblings, 0 replies; 83+ messages in thread
From: Al Viro @ 2025-02-13 18:01 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christian Brauner, Jan Kara, Linus Torvalds, Jeff Layton,
	Dave Chinner, linux-fsdevel, linux-kernel

On Thu, Feb 13, 2025 at 04:09:31AM +0000, Al Viro wrote:
> > > +	} while (PTR_ERR_OR_ZERO(sdentry) == -EEXIST); /* need negative lookup */
> > 
> > What's wrong with sdentry == ERR_PTR(-EEXIST)?
> 
> BTW, do you need to mess with NFS_DATA_BLOCKED with that thing in place?

That'd be NFS_FSDATA_BLOCKED, of course, and apparently it's still needed for
the "not busy, not sillyrenaming" cases in rename and unlink...

Nevermind, just looking into getting rid of d_drop/d_rehash on the AFS side
of things.

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2025-02-13 18:01 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-06  5:42 [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory NeilBrown
2025-02-06  5:42 ` [PATCH 01/19] VFS: introduce vfs_mkdir_return() NeilBrown
2025-02-06 12:24   ` Christian Brauner
2025-02-06 23:52     ` NeilBrown
2025-02-06 13:52   ` Jeff Layton
2025-02-06 23:57     ` NeilBrown
2025-02-07 19:45   ` Al Viro
2025-02-10  4:36     ` NeilBrown
2025-02-06  5:42 ` [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
2025-02-07 19:32   ` Al Viro
2025-02-10  4:58     ` NeilBrown
2025-02-10  5:15       ` Al Viro
2025-02-11 23:35         ` NeilBrown
2025-02-12  0:25           ` Al Viro
2025-02-12  1:46             ` NeilBrown
2025-02-06  5:42 ` [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() and rename it NeilBrown
2025-02-06 14:30   ` Jeff Layton
2025-02-07  0:04     ` NeilBrown
2025-02-07  0:23       ` Jeff Layton
2025-02-07 20:01   ` Al Viro
2025-02-06  5:42 ` [PATCH 04/19] VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry NeilBrown
2025-02-06 12:31   ` Christian Brauner
2025-02-06 13:09     ` Christian Brauner
2025-02-07  0:08       ` NeilBrown
2025-02-06  5:42 ` [PATCH 05/19] VFS: add common error checks to lookup_one_qstr() NeilBrown
2025-02-06 12:33   ` Christian Brauner
2025-02-07 20:14   ` Al Viro
2025-02-09 20:23   ` Al Viro
2025-02-06  5:42 ` [PATCH 06/19] VFS: repack DENTRY_ flags NeilBrown
2025-02-06 12:34   ` (subset) " Christian Brauner
2025-02-06  5:42 ` [PATCH 07/19] VFS: repack LOOKUP_ bit flags NeilBrown
2025-02-06 12:44   ` Christian Brauner
2025-02-07  0:24     ` NeilBrown
2025-02-06 12:54   ` (subset) " Christian Brauner
2025-02-06  5:42 ` [PATCH 08/19] VFS: introduce lookup_and_lock() and friends NeilBrown
2025-02-06 13:49   ` Christian Brauner
2025-02-07  1:28     ` NeilBrown
2025-02-07 20:22   ` Al Viro
2025-02-08 23:18     ` Al Viro
2025-02-12  5:22       ` NeilBrown
2025-02-12 15:51         ` Al Viro
2025-02-12 20:11           ` Al Viro
2025-02-12  4:49     ` NeilBrown
2025-02-06  5:42 ` [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations NeilBrown
2025-02-06 13:15   ` Christian Brauner
2025-02-07  1:46     ` NeilBrown
2025-02-07 22:41   ` Al Viro
2025-02-09  1:09     ` Al Viro
2025-02-09  4:57       ` Al Viro
2025-02-06  5:42 ` [PATCH 10/19] VFS: introduce inode flags to report locking needs for directory ops NeilBrown
2025-02-06 13:22   ` Christian Brauner
2025-02-07  2:01     ` NeilBrown
2025-02-06  5:42 ` [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use for create/remove operations NeilBrown
2025-02-08  1:38   ` Al Viro
2025-02-09  6:40   ` Al Viro
2025-02-06  5:42 ` [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock updates NeilBrown
2025-02-06  5:42 ` [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with rename etc NeilBrown
2025-02-07 20:28   ` Al Viro
2025-02-07 20:35     ` Al Viro
2025-02-08  1:30   ` Al Viro
2025-02-08  1:35     ` Al Viro
2025-02-12 21:22     ` Al Viro
2025-02-06  5:42 ` [PATCH 14/19] VFS: Ensure no async updates happening in directory being removed NeilBrown
2025-02-06 14:06   ` Christian Brauner
2025-02-07  2:17     ` NeilBrown
2025-02-07 21:06   ` Al Viro
2025-02-08 22:06     ` Al Viro
2025-02-08 22:30       ` Linus Torvalds
2025-02-08 22:34         ` Linus Torvalds
2025-02-08 23:25         ` Al Viro
2025-02-06  5:42 ` [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when possible NeilBrown
2025-02-06  5:42 ` [PATCH 16/19] VFS: add lookup_and_lock_rename() NeilBrown
2025-02-07 21:21   ` Al Viro
2025-02-06  5:42 ` [PATCH 17/19] nfsd: use lookup_and_lock_one() and lookup_and_lock_rename_one() NeilBrown
2025-02-06  5:42 ` [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async NeilBrown
2025-02-06  5:42 ` [PATCH 19/19] nfs: switch to _async for all directory ops NeilBrown
2025-02-13  3:51   ` Al Viro
2025-02-13  4:09     ` Al Viro
2025-02-13 18:01       ` Al Viro
2025-02-06 14:36 ` [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory Christian Brauner
2025-02-06 15:36 ` John Stoffel
2025-02-07  2:18   ` NeilBrown
2025-02-09 23:33 ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).