[PATCH 0/8 preview] demonstrate proposed new locking strategy for directories

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories
@ 2025-06-09  7:34 NeilBrown
  2025-06-09  7:34 ` [PATCH 1/8] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
                   ` (7 more replies)
  0 siblings, 8 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

This patches are still under development.  In particular some proper
documentation is needed.  They are sufficient to demonstrate my design.

They add an alternate mechanism for providing the locking that the VFS
needs for directory operations.  This includes:
 - only one operation per name at a time
 - no operations in a directory being removed
 - no concurrent cross-directory renames which might result in an
    ancestor loop

I had originally hoped to push the locking of i_rw_sem down into the
filesystems and have the new locking on top of that.  This turned out to
be impractical.  This series leave the i_rw_sem locking where it is,
introduces new locking that happens while the directory is locked, and
gives the filesystem the option of disabling (most of) the i_rw_sem
locking.  Once all filesystems are converted the i_rw_sem locking can be
removed.

Shared lock on i_rw_sem is still used for readdir and simple lookup, to
exclude it while rmdir is happening.

The problem with pushing i_rw_sem down is that I still want to use it to
exclude readdir while rmdir is happening.  Some readdir implementations
use the result to prime the dcache which means creating d_in_lookup()
dentries in the directory.  If we can do this while holding i_rw_sem,
then it is not safe to take i_rw_sem while holding a d_in_lookup()
dentry.  So i_rw_sem CANNOT be taken after a lookup has been performed -
it must be before, or never.

Another issue is that after taking i_rw_sem in rmdir() I need to wait
for any dentries that are still locked.  Waiting for the dentry lock
while holding i_rw_sem means we cannot take i_rw_sem after getting a
dentry lock.

So we take i_rw_sem for filesystems that still require it (initially
all) but still do the other locking which will be uncontended.  This
exercises the code to help ensure it is ready when we remove the
i_rw_sem requirement for any given filesystem.

The central feature is a per-dentry lock implemented with a couple of
d_flags and wait_var_event/wake_up_var.  A single thread can take 1,
sometimes 2, occasionally 3 locks on different dentries.

A second lock is needed for rename - we lock the two dentries in
address-order after confirming there is no hierarchical relationship.
It is also needed for silly-rename as part of unlink.  In this case the
plan is for the second dentry to always be a d_in_lookup dentry so the
lock is guaranteed to be uncontented.  I'm not sure I got that finished
yet.

The three-dentry case is a rename which results in a silly-rename of the
target.

For rmdir we introduce S_DYING so that marking a directory a S_DEAD is
two-stage.  We mark is S_DYING which will prevent more dentry locks
being taken, then we wait for the locks that were already taken, then
set S_DEAD.

For rename ...  maybe just read the patch.  I tried to explain it
thoroughly.

The goal is to perform create/remove/rename without any mutex/semaphore
held by the VFS.  This will allow concurrent operations in a directory
and prepare the way for async operation so that e.g.  io_uring could be
given a list of many names in a directory to unlink and it could unlink
them in parallel.  We probably need to make changes to the locking on
the inode being removed before this can be fully achieved - I haven't
explored that in detail yet.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/8] VFS: use global wait-queue table for d_alloc_parallel()
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-09  7:34 ` [PATCH 2/8] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() NeilBrown
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

d_alloc_parallel() currently requires a wait_queue_head to be passed in.
This must have a life time which extends until the lookup is completed.

Future proposed patches will use d_alloc_parallel() for names being
created/unlinked etc.  Some filesystems combine lookup with create
making a longer code path that the wq needs to live for.  If it is still
to be allocated on-stack this can be cumbersome.

This patch replaces the on-stack wqs with a global array of wqs which
are used as needed.  A wq is NOT allocated when a dentry is first
created but only when a second thread attempts to use the same name and
so is forced to wait.  At this moment a wq is chosen using a hash of the
dentry pointer and that wq is assigned to ->d_wait.  The ->d_lock is
then dropped and the task waits.

When the dentry is finally moved out of "in_lookup" a wake up is only
sent if ->d_wait is not NULL.  This avoids an (uncontended) spin
lock/unlock which saves a couple of atomic operations in a common case.

The wake up passes the dentry that the wake up is for as the "key" and
the waiter will only wake processes waiting on the same key.  This means
that when these global waitqueues are shared (which is inevitable
though unlikely to be frequent), a task will not be woken prematurely.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst |  6 +++
 fs/afs/dir_silly.c                    |  4 +-
 fs/dcache.c                           | 78 ++++++++++++++++++++++-----
 fs/fuse/readdir.c                     |  3 +-
 fs/namei.c                            |  6 +--
 fs/nfs/dir.c                          |  3 +-
 fs/nfs/unlink.c                       |  3 +-
 fs/proc/base.c                        |  3 +-
 fs/proc/proc_sysctl.c                 |  3 +-
 fs/proc/self.c                        |  3 +-
 fs/proc/thread_self.c                 |  4 +-
 fs/smb/client/readdir.c               |  3 +-
 include/linux/dcache.h                |  3 +-
 include/linux/nfs_xdr.h               |  1 -
 14 files changed, 82 insertions(+), 41 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 2cdd9e9ad7f9..385ca21e230e 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1265,3 +1265,9 @@ d_splice_alias() now works on a hashed dentry.  d_drop() need not, and
 must not, be called before d_splice_alias().  In general d_drop() must
 not be called in a directory-modifying operation until the operation has
 completed - typically when it completes with failure.
+---
+
+** mandatory**
+
+d_alloc_parallel() no longer requires a waitqueue_head.  It used one
+from an internal table when needed.
diff --git a/fs/afs/dir_silly.c b/fs/afs/dir_silly.c
index b0504bd45fa2..68e38429cf49 100644
--- a/fs/afs/dir_silly.c
+++ b/fs/afs/dir_silly.c
@@ -237,13 +237,11 @@ int afs_silly_iput(struct dentry *dentry, struct inode *inode)
 	struct dentry *alias;
 	int ret;
 
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-
 	_enter("%p{%pd},%llx", dentry, dentry, vnode->fid.vnode);
 
 	down_read(&dvnode->rmdir_lock);
 
-	alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name, &wq);
+	alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name);
 	if (IS_ERR(alias)) {
 		up_read(&dvnode->rmdir_lock);
 		return 0;
diff --git a/fs/dcache.c b/fs/dcache.c
index 246b00d3a2fb..c21122ccea4f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2121,8 +2121,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		return found;
 	}
 	if (d_in_lookup(dentry)) {
-		found = d_alloc_parallel(dentry->d_parent, name,
-					dentry->d_wait);
+		found = d_alloc_parallel(dentry->d_parent, name);
 		if (IS_ERR(found) || !d_in_lookup(found)) {
 			iput(inode);
 			return found;
@@ -2132,7 +2131,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
 		if (!found) {
 			iput(inode);
 			return ERR_PTR(-ENOMEM);
-		} 
+		}
 	}
 	res = d_splice_alias(inode, found);
 	if (res) {
@@ -2489,6 +2488,46 @@ void d_rehash(struct dentry * entry)
 }
 EXPORT_SYMBOL(d_rehash);
 
+#define	PAR_LOOKUP_WQ_BITS	8
+#define PAR_LOOKUP_WQS (1 << PAR_LOOKUP_WQ_BITS)
+static wait_queue_head_t par_wait_table[PAR_LOOKUP_WQS] __cacheline_aligned;
+
+static int __init par_wait_init(void)
+{
+	int i;
+
+	for (i = 0; i < PAR_LOOKUP_WQS; i++)
+		init_waitqueue_head(&par_wait_table[i]);
+	return 0;
+}
+fs_initcall(par_wait_init);
+
+struct par_wait_key {
+	struct dentry *de;
+	struct wait_queue_entry wqe;
+};
+
+static int d_wait_wake_fn(struct wait_queue_entry *wq_entry,
+			  unsigned mode, int sync, void *key)
+{
+	struct par_wait_key *pwk = container_of(wq_entry,
+						 struct par_wait_key, wqe);
+	if (pwk->de == key)
+		return default_wake_function(wq_entry, mode, sync, key);
+	return 0;
+}
+
+static inline void d_wake_waiters(struct wait_queue_head *d_wait,
+				  struct dentry *dentry)
+{
+	/* ->d_wait is only set if some thread is actually waiting.
+	 * If we find it is NULL - the common case - then there was no
+	 * contention and there are no waiters to be woken.
+	 */
+	if (d_wait)
+		__wake_up(d_wait, TASK_NORMAL, 0, dentry);
+}
+
 static inline unsigned start_dir_add(struct inode *dir)
 {
 	preempt_disable_nested();
@@ -2501,31 +2540,41 @@ static inline unsigned start_dir_add(struct inode *dir)
 }
 
 static inline void end_dir_add(struct inode *dir, unsigned int n,
-			       wait_queue_head_t *d_wait)
+			       wait_queue_head_t *d_wait, struct dentry *de)
 {
 	smp_store_release(&dir->i_dir_seq, n + 2);
 	preempt_enable_nested();
-	if (wq_has_sleeper(d_wait))
-		wake_up_all(d_wait);
+	d_wake_waiters(d_wait, de);
 }
 
 static void d_wait_lookup(struct dentry *dentry)
 {
 	if (d_in_lookup(dentry)) {
-		DECLARE_WAITQUEUE(wait, current);
-		add_wait_queue(dentry->d_wait, &wait);
+		struct par_wait_key wk = {
+			.de = dentry,
+			.wqe = {
+				.private = current,
+				.func = d_wait_wake_fn,
+			},
+		};
+		struct wait_queue_head *wq;
+		if (!dentry->d_wait)
+			dentry->d_wait = &par_wait_table[hash_ptr(dentry,
+								  PAR_LOOKUP_WQ_BITS)];
+		wq = dentry->d_wait;
+		add_wait_queue(wq, &wk.wqe);
 		do {
 			set_current_state(TASK_UNINTERRUPTIBLE);
 			spin_unlock(&dentry->d_lock);
 			schedule();
 			spin_lock(&dentry->d_lock);
 		} while (d_in_lookup(dentry));
+		remove_wait_queue(wq, &wk.wqe);
 	}
 }
 
 struct dentry *d_alloc_parallel(struct dentry *parent,
-				const struct qstr *name,
-				wait_queue_head_t *wq)
+				const struct qstr *name)
 {
 	unsigned int hash = name->hash;
 	struct hlist_bl_head *b = in_lookup_hash(parent, hash);
@@ -2622,7 +2671,8 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 	rcu_read_unlock();
 	/* we can't take ->d_lock here; it's OK, though. */
 	new->d_flags |= DCACHE_PAR_LOOKUP;
-	new->d_wait = wq;
+	/* Don't set a wait_queue until someone is actually waiting */
+	new->d_wait = NULL;
 	hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
 	hlist_bl_unlock(b);
 	return new;
@@ -2660,7 +2710,7 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 void __d_lookup_unhash_wake(struct dentry *dentry)
 {
 	spin_lock(&dentry->d_lock);
-	wake_up_all(__d_lookup_unhash(dentry));
+	d_wake_waiters(__d_lookup_unhash(dentry), dentry);
 	spin_unlock(&dentry->d_lock);
 }
 EXPORT_SYMBOL(__d_lookup_unhash_wake);
@@ -2692,7 +2742,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
 			   (DCACHE_LRU_LIST|DCACHE_SHRINK_LIST)) == DCACHE_LRU_LIST)
 		this_cpu_dec(nr_dentry_negative);
 	if (dir)
-		end_dir_add(dir, n, d_wait);
+		end_dir_add(dir, n, d_wait, dentry);
 	spin_unlock(&dentry->d_lock);
 	if (inode)
 		spin_unlock(&inode->i_lock);
@@ -2858,7 +2908,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
 	write_seqcount_end(&dentry->d_seq);
 
 	if (dir)
-		end_dir_add(dir, n, d_wait);
+		end_dir_add(dir, n, d_wait, target);
 
 	if (dentry->d_parent != old_parent)
 		spin_unlock(&dentry->d_parent->d_lock);
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index c2aae2eef086..f588252891af 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -160,7 +160,6 @@ static int fuse_direntplus_link(struct file *file,
 	struct inode *dir = d_inode(parent);
 	struct fuse_conn *fc;
 	struct inode *inode;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	int epoch;
 
 	if (!o->nodeid) {
@@ -197,7 +196,7 @@ static int fuse_direntplus_link(struct file *file,
 	dentry = d_lookup(parent, &name);
 	if (!dentry) {
 retry:
-		dentry = d_alloc_parallel(parent, &name, &wq);
+		dentry = d_alloc_parallel(parent, &name);
 		if (IS_ERR(dentry))
 			return PTR_ERR(dentry);
 	}
diff --git a/fs/namei.c b/fs/namei.c
index afba27fcf94e..0703568339d3 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1970,13 +1970,12 @@ static struct dentry *__lookup_slow(const struct qstr *name,
 {
 	struct dentry *dentry, *old;
 	struct inode *inode = dir->d_inode;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	/* Don't go there if it's already dead */
 	if (unlikely(IS_DEADDIR(inode)))
 		return ERR_PTR(-ENOENT);
 again:
-	dentry = d_alloc_parallel(dir, name, &wq);
+	dentry = d_alloc_parallel(dir, name);
 	if (IS_ERR(dentry))
 		return dentry;
 	if (unlikely(!d_in_lookup(dentry))) {
@@ -3953,7 +3952,6 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	struct dentry *dentry;
 	int error, create_error = 0;
 	umode_t mode = op->mode;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 
 	if (unlikely(IS_DEADDIR(dir_inode)))
 		return ERR_PTR(-ENOENT);
@@ -3962,7 +3960,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	dentry = d_lookup(dir, &nd->last);
 	for (;;) {
 		if (!dentry) {
-			dentry = d_alloc_parallel(dir, &nd->last, &wq);
+			dentry = d_alloc_parallel(dir, &nd->last);
 			if (IS_ERR(dentry))
 				return dentry;
 		}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 311e517f822f..b435c3b627af 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -727,7 +727,6 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 		unsigned long dir_verifier)
 {
 	struct qstr filename = QSTR_INIT(entry->name, entry->len);
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct dentry *dentry;
 	struct dentry *alias;
 	struct inode *inode;
@@ -756,7 +755,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
 	dentry = d_lookup(parent, &filename);
 again:
 	if (!dentry) {
-		dentry = d_alloc_parallel(parent, &filename, &wq);
+		dentry = d_alloc_parallel(parent, &filename);
 		if (IS_ERR(dentry))
 			return;
 	}
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index 2db8839b16ff..a67df3ae74ab 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -124,7 +124,7 @@ static int nfs_call_unlink(struct dentry *dentry, struct inode *inode, struct nf
 	struct dentry *alias;
 
 	down_read_non_owner(&NFS_I(dir)->rmdir_sem);
-	alias = d_alloc_parallel(dentry->d_parent, &data->args.name, &data->wq);
+	alias = d_alloc_parallel(dentry->d_parent, &data->args.name);
 	if (IS_ERR(alias)) {
 		up_read_non_owner(&NFS_I(dir)->rmdir_sem);
 		return 0;
@@ -185,7 +185,6 @@ nfs_async_unlink(struct dentry *dentry, const struct qstr *name)
 
 	data->cred = get_current_cred();
 	data->res.dir_attr = &data->dir_attr;
-	init_waitqueue_head(&data->wq);
 
 	status = -EBUSY;
 	spin_lock(&dentry->d_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index c667702dc69b..6a847ce8d718 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2129,8 +2129,7 @@ bool proc_fill_cache(struct file *file, struct dir_context *ctx,
 
 	child = try_lookup_noperm(&qname, dir);
 	if (!child) {
-		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-		child = d_alloc_parallel(dir, &qname, &wq);
+		child = d_alloc_parallel(dir, &qname);
 		if (IS_ERR(child))
 			goto end_instantiate;
 		if (d_in_lookup(child)) {
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index cc9d74a06ff0..9f1088f138f4 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -693,8 +693,7 @@ static bool proc_sys_fill_cache(struct file *file,
 
 	child = d_lookup(dir, &qname);
 	if (!child) {
-		DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-		child = d_alloc_parallel(dir, &qname, &wq);
+		child = d_alloc_parallel(dir, &qname);
 		if (IS_ERR(child))
 			return false;
 		if (d_in_lookup(child)) {
diff --git a/fs/proc/self.c b/fs/proc/self.c
index b034f28506c9..08f7ea161eea 100644
--- a/fs/proc/self.c
+++ b/fs/proc/self.c
@@ -37,11 +37,10 @@ static unsigned self_inum __ro_after_init;
 int proc_setup_self(struct super_block *s)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(s);
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct dentry *self;
 	int ret = -ENOMEM;
 
-	self = d_alloc_parallel(s->s_root, &QSTR_HASH(s->s_root, "self"), &wq);
+	self = d_alloc_parallel(s->s_root, &QSTR_HASH(s->s_root, "self"));
 	if (self && lock_and_check_dentry(self, s->s_root)) {
 		struct inode *inode = new_inode(s);
 		if (inode) {
diff --git a/fs/proc/thread_self.c b/fs/proc/thread_self.c
index 34bc7dd07632..6b08bd83490b 100644
--- a/fs/proc/thread_self.c
+++ b/fs/proc/thread_self.c
@@ -38,13 +38,11 @@ static unsigned thread_self_inum __ro_after_init;
 int proc_setup_thread_self(struct super_block *s)
 {
 	struct proc_fs_info *fs_info = proc_sb_info(s);
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	struct dentry *thread_self;
 	int ret = -ENOMEM;
 
 	thread_self = d_alloc_parallel(s->s_root,
-				       &QSTR_HASH(s->s_root,"thread-self"),
-				       &wq);
+				       &QSTR_HASH(s->s_root,"thread-self"));
 	if (thread_self && lock_and_check_dentry(thread_self, s->s_root)) {
 		struct inode *inode = new_inode(s);
 		if (inode) {
diff --git a/fs/smb/client/readdir.c b/fs/smb/client/readdir.c
index f9f11cbf89be..ae1c60efc475 100644
--- a/fs/smb/client/readdir.c
+++ b/fs/smb/client/readdir.c
@@ -74,7 +74,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
 	struct cifs_sb_info *cifs_sb = CIFS_SB(sb);
 	bool posix = cifs_sb_master_tcon(cifs_sb)->posix_extensions;
 	bool reparse_need_reval = false;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
 	int rc;
 
 	cifs_dbg(FYI, "%s: for %s\n", __func__, name->name);
@@ -106,7 +105,7 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
 		    (fattr->cf_flags & CIFS_FATTR_NEED_REVAL))
 			return;
 
-		dentry = d_alloc_parallel(parent, name, &wq);
+		dentry = d_alloc_parallel(parent, name);
 	}
 	if (IS_ERR(dentry))
 		return;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 8c9ad0da0e02..2b8f8641e1f8 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -244,8 +244,7 @@ extern void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op
 /* allocate/de-allocate */
 extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
 extern struct dentry * d_alloc_anon(struct super_block *);
-extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *,
-					wait_queue_head_t *);
+extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
 extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
 extern struct dentry * d_add_ci(struct dentry *, struct inode *, struct qstr *);
 extern bool d_same_name(const struct dentry *dentry, const struct dentry *parent,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 67f6632f723b..d1c9c569a03e 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1732,7 +1732,6 @@ struct nfs_unlinkdata {
 	struct nfs_removeargs args;
 	struct nfs_removeres res;
 	struct dentry *dentry;
-	wait_queue_head_t wq;
 	const struct cred *cred;
 	struct nfs_fattr dir_attr;
 	long timeout;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/8] VFS: use d_alloc_parallel() in lookup_one_qstr_excl().
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
  2025-06-09  7:34 ` [PATCH 1/8] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-09  7:34 ` [PATCH 3/8] fs/proc: take rcu_read_lock() in proc_sys_compare() NeilBrown
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

lookup_one_qstr_excl() is used for lookups prior to directory
modifications, whether create, unlink, rename, or whatever.

To prepare for allowing modification to happen in parallel, change
lookup_one_qstr_excl() to use d_alloc_parallel() and change its name to
remove the _excl requirement.

If LOOKUP_EXCL or LOOKUP_RENAME_TARGET is passed, the caller must ensure
d_lookup_done() is called at an appropriate time, and must not assume
that it can test for positive or negative dentries without confirming
that the dentry is no longer d_in_lookup() - unless it is filesystem
code acting one itself and *knows* that ->lookup() always completes the
lookup (currently true for all filesystems other than NFS).

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst | 11 +++++
 fs/namei.c                            | 58 ++++++++++++++++++---------
 2 files changed, 51 insertions(+), 18 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 385ca21e230e..1feeac9e1483 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1271,3 +1271,14 @@ completed - typically when it completes with failure.
 
 d_alloc_parallel() no longer requires a waitqueue_head.  It used one
 from an internal table when needed.
+
+---
+
+** mandatory**
+
+All lookup_and_lock* functions may return a d_in_lookup() dentry if
+passed "O_CREATE|O_EXCL" or "O_RENAME_TARGET".  dentry_unlock() calls
+the necessary d_lookup_done().  If the caller *knows* which filesystem
+is being used, it may know that this is not possible.  Otherwise it must
+be careful testing if the dentry is positive or negative as the lookup
+may not have been performed yet.
diff --git a/fs/namei.c b/fs/namei.c
index 0703568339d3..e1749fb03cb5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1666,17 +1666,18 @@ static struct dentry *lookup_dcache(const struct qstr *name,
 }
 
 /*
- * Parent directory has inode locked exclusive.  This is one
- * and only case when ->lookup() gets called on non in-lookup
- * dentries - as the matter of fact, this only gets called
- * when directory is guaranteed to have no in-lookup children
- * at all.
+ * Parent directory has inode locked.
+ * d_lookup_done() must be called before the dentry is dput()
+ * if LOOKUP_EXCL or LOOKUP_RENAME_TARGET is set.
+ * If the dentry is not d_in_lookup():
  * Will return -ENOENT if name isn't found and LOOKUP_CREATE wasn't passed.
  * Will return -EEXIST if name is found and LOOKUP_EXCL was passed.
+ * If it is d_in_lookup() then these conditions will be checked when
+ * the attempt make to create the act on the name.
  */
-static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
-					   struct dentry *base,
-					   unsigned int flags)
+static struct dentry *lookup_one_qstr(const struct qstr *name,
+				      struct dentry *base,
+				      unsigned int flags)
 {
 	struct dentry *dentry;
 	struct dentry *old;
@@ -1691,18 +1692,27 @@ static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
 	if (unlikely(IS_DEADDIR(dir)))
 		return ERR_PTR(-ENOENT);
 
-	dentry = d_alloc(base, name);
-	if (unlikely(!dentry))
-		return ERR_PTR(-ENOMEM);
+	dentry = d_alloc_parallel(base, name);
+	if (unlikely(IS_ERR(dentry)))
+		return dentry;
+	if (unlikely(!d_in_lookup(dentry)))
+		/* Raced with another thread which did the lookup */
+		goto found;
 
 	old = dir->i_op->lookup(dir, dentry, flags);
 	if (unlikely(old)) {
+		d_lookup_done(dentry);
 		dput(dentry);
 		dentry = old;
 	}
 found:
 	if (IS_ERR(dentry))
 		return dentry;
+	if (d_in_lookup(dentry))
+		/* We cannot check for errors - the caller will have to
+		 * wait for any create-etc attempt to get relevant errors.
+		 */
+		return dentry;
 	if (d_is_negative(dentry) && !(flags & LOOKUP_CREATE)) {
 		dput(dentry);
 		return ERR_PTR(-ENOENT);
@@ -1725,7 +1735,10 @@ static struct dentry *lookup_one_qstr_excl(const struct qstr *name,
  * The "necessary locks" are currently the inode node lock on @base.
  * The name @last is expected to already have the hash calculated.
  * No permission checks are performed.
+ *
  * Returns: the dentry, suitably locked, or an ERR_PTR().
+ *    The dentry may be d_in_lookup() if LOOKUP_EXCL or LOOKUP_RENAME_TARGET
+ *    is given, depending on the filesystem.
  */
 struct dentry *lookup_and_lock_hashed(struct qstr *last,
 				      struct dentry *base,
@@ -1735,7 +1748,7 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last,
 
 	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
 
-	dentry = lookup_one_qstr_excl(last, base, lookup_flags);
+	dentry = lookup_one_qstr(last, base, lookup_flags);
 	if (IS_ERR(dentry))
 		inode_unlock(base->d_inode);
 	return dentry;
@@ -1755,7 +1768,7 @@ struct dentry *lookup_and_lock_noperm_locked(struct qstr *last,
 	if (err < 0)
 		return ERR_PTR(err);
 
-	return lookup_one_qstr_excl(last, base, lookup_flags);
+	return lookup_one_qstr(last, base, lookup_flags);
 }
 EXPORT_SYMBOL(lookup_and_lock_noperm_locked);
 
@@ -1770,7 +1783,10 @@ EXPORT_SYMBOL(lookup_and_lock_noperm_locked);
  * The "necessary locks" are currently the inode node lock on @base.
  * The name @last is NOT expected to have the hash calculated.
  * No permission checks are performed.
+ *
  * Returns: the dentry, suitably locked, or an ERR_PTR().
+ *    The dentry may be d_in_lookup() if LOOKUP_EXCL or LOOKUP_RENAME_TARGET
+ *    is given, depending on the filesystem.
  */
 struct dentry *lookup_and_lock_noperm(struct qstr *last,
 				      struct dentry *base,
@@ -1798,7 +1814,10 @@ EXPORT_SYMBOL(lookup_and_lock_noperm);
  * The "necessary locks" are currently the inode node lock on @base.
  * The name @last is NOT expected to already have the hash calculated.
  * Permission checks are performed to ensure %MAY_EXEC access to @base.
+ *
  * Returns: the dentry, suitably locked, or an ERR_PTR().
+ *    The dentry may be d_in_lookup() if LOOKUP_EXCL or LOOKUP_RENAME_TARGET
+ *    is given, depending on the filesystem.
  */
 struct dentry *lookup_and_lock(struct mnt_idmap *idmap,
 			       struct qstr *last,
@@ -1826,7 +1845,10 @@ EXPORT_SYMBOL(lookup_and_lock);
  * If a fatal signal arrives or is already pending the operation is aborted.
  * The name @last is NOT expected to already have the hash calculated.
  * Permission checks are performed to ensure %MAY_EXEC access to @base.
+ *
  * Returns: the dentry, suitably locked, or an ERR_PTR().
+ *    The dentry may be d_in_lookup() if LOOKUP_EXCL or LOOKUP_RENAME_TARGET
+ *    is given, depending on the filesystem.
  */
 struct dentry *lookup_and_lock_killable(struct mnt_idmap *idmap,
 					struct qstr *last,
@@ -1843,7 +1865,7 @@ struct dentry *lookup_and_lock_killable(struct mnt_idmap *idmap,
 	if (err < 0)
 		return ERR_PTR(err);
 
-	dentry = lookup_one_qstr_excl(last, base, lookup_flags);
+	dentry = lookup_one_qstr(last, base, lookup_flags);
 	if (IS_ERR(dentry))
 		inode_unlock(base->d_inode);
 	return dentry;
@@ -1894,6 +1916,7 @@ EXPORT_SYMBOL(dentry_unlock_dir_locked);
 void dentry_unlock(struct dentry *dentry)
 {
 	if (!IS_ERR(dentry)) {
+		d_lookup_done(dentry);
 		inode_unlock(dentry->d_parent->d_inode);
 		dentry_unlock_dir_locked(dentry);
 	}
@@ -3500,8 +3523,7 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 		p = lock_rename(rd->old_dir, rd->new_dir);
 		dget(rd->old_dir);
 
-		d1 = lookup_one_qstr_excl(&rd->old_last, rd->old_dir,
-					  lookup_flags);
+		d1 = lookup_one_qstr(&rd->old_last, rd->old_dir, lookup_flags);
 		if (IS_ERR(d1))
 			goto out_unlock_1;
 	}
@@ -3513,8 +3535,8 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 		}
 		d2 = dget(rd->new_dentry);
 	} else {
-		d2 = lookup_one_qstr_excl(&rd->new_last, rd->new_dir,
-				  lookup_flags | target_flags);
+		d2 = lookup_one_qstr(&rd->new_last, rd->new_dir,
+				     lookup_flags | target_flags);
 		if (IS_ERR(d2))
 			goto out_unlock_2;
 	}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/8] fs/proc: take rcu_read_lock() in proc_sys_compare()
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
  2025-06-09  7:34 ` [PATCH 1/8] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
  2025-06-09  7:34 ` [PATCH 2/8] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-09  7:34 ` [PATCH 4/8] VFS: Add ability to exclusively lock a dentry and use for open/create NeilBrown
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

proc_sys_compare() is the ->d_compare function for /proc/sys.
It uses rcu_dereference() which assumes the RCU read lock is held and
can complain if it isn't.

However there is no guarantee that this lock is held by d_same_name()
(the caller of ->d_compare).  In particularly d_alloc_parallel() calls
d_same_name() after rcu_read_unlock().

So this patch calls rcu_read_lock() before accessing the inode (which
seems to be the focus of RCU protection here), and drops it afterwards.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/proc/proc_sysctl.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 9f1088f138f4..e3d9f36b6699 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -916,19 +916,23 @@ static int proc_sys_compare(const struct dentry *dentry,
 {
 	struct ctl_table_header *head;
 	struct inode *inode;
+	int ret;
 
 	/* Although proc doesn't have negative dentries, rcu-walk means
 	 * that inode here can be NULL */
 	/* AV: can it, indeed? */
+	rcu_read_lock();
 	inode = d_inode_rcu(dentry);
-	if (!inode)
-		return 1;
-	if (name->len != len)
-		return 1;
-	if (memcmp(name->name, str, len))
-		return 1;
-	head = rcu_dereference(PROC_I(inode)->sysctl);
-	return !head || !sysctl_is_seen(head);
+	if (!inode ||
+	    name->len != len ||
+	    memcmp(name->name, str, len)) {
+		ret = 1;
+	} else {
+		head = rcu_dereference(PROC_I(inode)->sysctl);
+		ret = !head || !sysctl_is_seen(head);
+	}
+	rcu_read_unlock();
+	return ret;
 }
 
 static const struct dentry_operations proc_sys_dentry_operations = {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/8] VFS: Add ability to exclusively lock a dentry and use for open/create
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
                   ` (2 preceding siblings ...)
  2025-06-09  7:34 ` [PATCH 3/8] fs/proc: take rcu_read_lock() in proc_sys_compare() NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-09  7:34 ` [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow NeilBrown
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

dentry_lock(), dentry_trylock(), __dentry_unlock() are added which can be
used to get an exclusive lock on a dentry in preparation for a directory
operation on the name.  It is planned for this to be used to provide
name-based exclusive access to create (mkdir, mknod, symlink, link,
create) a name, to remove (unlink, rmdir) a name, to rename between 2
names (rename) or to open (atomic_open) a name.  This will allow i_rwsem
to be deprecated for directories.

Note that lookup does not require this exclusion as it is only performed
on d_in_lookup() dentries which already provide exclusive access.

As contention on a name is rare this locking is optimised for the
uncontended case.  A bit is set under the d_lock spinlock to claim a
lock, and wait_var_event_spinlock() is used when waiting is needed.  To
avoid sending a wakeup when not needed we have a second bit flag to
indicate if there are any waiters.

Once the exclusive lock is obtained on the dentry we must make sure it
wasn't unlinked or renamed while we slept.  If it was, dentry_lock()
will fail so the (in-cache) lookup can be repeated.

For this patch the lock is only taken for "open" when a positive dentry
is NOT found.  This is currently uncontended as the i_rwsem is still
held on the directory.  A future patch will make that locking optional.

The notify_create() calls are moved into the region where the dentry is
locked to ensure continuing exclusion.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/dcache.c            |  26 ++++++-
 fs/internal.h          |  20 +++++
 fs/namei.c             | 173 +++++++++++++++++++++++++++++++++++++++--
 include/linux/dcache.h |  11 +++
 4 files changed, 221 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c21122ccea4f..b5cef2963169 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1680,9 +1680,10 @@ EXPORT_SYMBOL(d_invalidate);
  * available. On a success the dentry is returned. The name passed in is
  * copied and the copy passed in may be reused after this call.
  */
- 
+
 static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 {
+	static struct lock_class_key __key;
 	struct dentry *dentry;
 	char *dname;
 	int err;
@@ -1740,6 +1741,8 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_HLIST_NODE(&dentry->d_sib);
 	d_set_d_op(dentry, dentry->d_sb->s_d_op);
 
+	lockdep_init_map(&dentry->dentry_map, "DCACHE_LOCK", &__key, 0);
+
 	if (dentry->d_op && dentry->d_op->d_init) {
 		err = dentry->d_op->d_init(dentry);
 		if (err) {
@@ -2988,6 +2991,10 @@ static int __d_unalias(struct dentry *dentry, struct dentry *alias)
 	struct rw_semaphore *m2 = NULL;
 	int ret = -ESTALE;
 
+	if (dentry->d_flags & DCACHE_LOCK) {
+		if (!dentry_trylock(alias, NULL, NULL))
+			return ret;
+	}
 	/* If alias and dentry share a parent, then no extra locks required */
 	if (alias->d_parent == dentry->d_parent)
 		goto out_unalias;
@@ -3012,6 +3019,12 @@ static int __d_unalias(struct dentry *dentry, struct dentry *alias)
 		up_read(m2);
 	if (m1)
 		mutex_unlock(m1);
+	if (dentry->d_flags & DCACHE_LOCK) {
+		if (ret)
+			__dentry_unlock(alias);
+		else
+			__dentry_unlock(dentry);
+	}
 	return ret;
 }
 
@@ -3033,6 +3046,9 @@ static int __d_unalias(struct dentry *dentry, struct dentry *alias)
  * If a dentry was found and moved, then it is returned.  Otherwise NULL
  * is returned.  This matches the expected return value of ->lookup.
  *
+ * If a different dentry is returned, any DCACHE_LOCK on the original
+ * dentry will be transferred to the new.
+ *
  * Cluster filesystems may call this function with a negative, hashed dentry.
  * In that case, we know that the inode will be a regular file, and also this
  * will only occur during atomic_open. So we need to check for the dentry
@@ -3074,7 +3090,15 @@ struct dentry *d_splice_alias(struct inode *inode, struct dentry *dentry)
 				}
 				dput(old_parent);
 			} else {
+				if (dentry->d_flags & DCACHE_LOCK)
+					/* This must succeed because IS_ROOT() dentries
+					 * are never locked - except temporarily
+					 * here while rename_lock is held.
+					 */
+					dentry_trylock(new, NULL, NULL);
 				__d_move(new, dentry, false);
+				if (dentry->d_flags & DCACHE_LOCK)
+					__dentry_unlock(dentry);
 				write_sequnlock(&rename_lock);
 			}
 			iput(inode);
diff --git a/fs/internal.h b/fs/internal.h
index 393f6c5c24f6..d40d1d479a46 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -228,6 +228,26 @@ extern struct dentry *__d_lookup_rcu(const struct dentry *parent,
 				const struct qstr *name, unsigned *seq);
 extern void d_genocide(struct dentry *);
 
+extern bool dentry_trylock(struct dentry *dentry,
+			     struct dentry *base,
+			     const struct qstr *last);
+
+/*
+ * The name "dentry_unlock()" is current used for unlocking the parent
+ * directory.  Once we dispense with locking the dentry, that name can
+ * be used for this function.
+ */
+static inline void __dentry_unlock(struct dentry *dentry)
+{
+	WARN_ON(!(dentry->d_flags & DCACHE_LOCK));
+	lock_map_release(&dentry->dentry_map);
+	spin_lock(&dentry->d_lock);
+	if (dentry->d_flags & DCACHE_LOCK_WAITER)
+		wake_up_var_locked(&dentry->d_flags, &dentry->d_lock);
+	dentry->d_flags &= ~(DCACHE_LOCK | DCACHE_LOCK_WAITER);
+	spin_unlock(&dentry->d_lock);
+}
+
 /*
  * pipe.c
  */
diff --git a/fs/namei.c b/fs/namei.c
index e1749fb03cb5..4ad76df21677 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1723,6 +1723,142 @@ static struct dentry *lookup_one_qstr(const struct qstr *name,
 	}
 	return dentry;
 }
+/*
+ * dentry locking for updates.
+ * When modifying a directory the target dentry will be locked by
+ * setting DCACHE_LOCK under ->d_lock.  If it is already set,
+ * DCACHE_LOCK_WAITER is set to ensure a wakeup is sent, and we wait
+ * using wait_var_event_any_lock().
+ * Conceptually the name in the parent is locked, so if a dentry has no
+ * name or parent is can not be locked.  So an IS_ROOT() dentry is never
+ * locked.
+ */
+
+static bool check_dentry_locked(struct dentry *de)
+{
+	if (de->d_flags & DCACHE_LOCK) {
+		de->d_flags |= DCACHE_LOCK_WAITER;
+		return true;
+	}
+	return false;
+}
+
+static bool dentry_matches(struct dentry *dentry,
+			   struct dentry *base, const struct qstr *last)
+{
+	if (d_unhashed(dentry) && !d_in_lookup(dentry))
+		/* An unhashed dentry can never be locked */
+		return false;
+	if (d_is_negative(dentry))
+		/* Negative dentries are never unlinked or renamed,
+		 * there is no race we could have lost and no need to check.
+		 */
+		return true;
+	if (!base)
+		/* No matching required */
+		return true;
+	if (dentry->d_parent != base)
+		return false;
+	if (last &&
+	    (dentry->d_name.hash != last->hash ||
+	     !d_same_name(dentry, base, last)))
+		return false;
+	return true;
+}
+
+static bool __dentry_lock(struct dentry *dentry,
+			  struct dentry *base, const struct qstr *last,
+			  unsigned int subclass, int state)
+{
+	int err;
+
+	lock_acquire_exclusive(&dentry->dentry_map, subclass, 0, NULL, _THIS_IP_);
+	spin_lock(&dentry->d_lock);
+	err = wait_var_event_any_lock(&dentry->d_flags,
+				      !check_dentry_locked(dentry),
+				      &dentry->d_lock, spin, state);
+	if (err || !dentry_matches(dentry, base, last)) {
+		lock_map_release(&dentry->dentry_map);
+		spin_unlock(&dentry->d_lock);
+		return false;
+	}
+
+	dentry->d_flags |= DCACHE_LOCK;
+	spin_unlock(&dentry->d_lock);
+	return true;
+}
+
+/**
+ * dentry_lock - lock a dentry in preparation for create/remove/rename
+ * @dentry:	the dentry to be locked
+ * @base:	the parent the dentry must still have after being locked, or NULL
+ * @last:	the name the dentry must still have after being locked, or NULL
+ * @state:	the process state to wait in.
+ *
+ * This function locks a dentry in preparation for create/remove/rename.
+ * While the lock is held no other process will change the parent or name of
+ * this dentry.
+ * The only case where a process might hold locks on two dentries is when
+ * performing a rename operation.  In that case they should be taken in
+ * address order to avoid AB-BA deadlocks.
+ *
+ * Returns: %true if lock was successfully applied, or %false if the process
+ * was signalled and @state allowed the signal, or if @base and @last are given
+ * but the dentry was renamed or unlinked while waiting for the lock.
+ *
+ * If @state is TASK_UNINTERRUPTIBLE and base is NULL the lock will always be
+ * obtained.
+ *
+ * Must not be called while an inode lock is held except for silly_rename
+ * lookups though this exception will be removed on the transition to
+ * dentry locking is complete.
+ */
+static bool dentry_lock(struct dentry *dentry,
+			struct dentry *base, const struct qstr *last,
+			int state)
+{
+	return __dentry_lock(dentry, base, last, DLOCK_NORMAL, state);
+}
+
+__maybe_unused /* will be used for rename */
+static bool dentry_lock_nested(struct dentry *dentry,
+			       struct dentry *base, const struct qstr *last,
+			       int state)
+{
+	return __dentry_lock(dentry, base, last, DLOCK_RENAME, state);
+}
+
+/**
+ * dentry_trylock - attempt to lock a dentry without waiting
+ * @dentry:	the dentry to be locked
+ * @base:	the parent the dentry must still have after being locked, or NULL
+ * @last:	the name the dentry must still have afte being locked, or NULL
+ *
+ * This function locks a dentry in preparation for create/remove/rename if it
+ * is not already locked.
+ *
+ * Returns: %true if the dentry was not locked but now is.  %false if
+ * the dentry was already locked, or if the parent or name were wrong
+ * after finding the the dentry wasn't already locked.
+ */
+
+bool dentry_trylock(struct dentry *dentry,
+		      struct dentry *base,
+		      const struct qstr *last)
+{
+	int ret = false;
+
+	spin_lock(&dentry->d_lock);
+	if (!(dentry->d_flags & DCACHE_LOCK) &&
+	    dentry_matches(dentry, base, last)) {
+		lock_map_acquire_try(&dentry->dentry_map);
+		dentry->d_flags |= DCACHE_LOCK;
+		ret = true;
+	}
+	spin_unlock(&dentry->d_lock);
+
+	return ret;
+}
 
 /**
  * lookup_and_lock_hashed - lookup and lock a name prior to dir ops
@@ -3907,7 +4043,9 @@ static int may_o_create(struct mnt_idmap *idmap,
  * have been updated to point to the new dentry.  This may be negative.
  *
  * Returns an error code otherwise.
- */
+ *
+ * The dentry must be locked (DCACHE_LOCK) and it will be unlocked on return.
+*/
 static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
 				  struct file *file,
 				  int open_flag, umode_t mode)
@@ -3941,6 +4079,9 @@ static struct dentry *atomic_open(struct nameidata *nd, struct dentry *dentry,
 				error = -ENOENT;
 		}
 	}
+	if (!error && (file->f_mode & FMODE_CREATED))
+		fsnotify_create(dir, dentry);
+	__dentry_unlock(dentry);
 	if (error) {
 		dput(dentry);
 		dentry = ERR_PTR(error);
@@ -3979,6 +4120,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		return ERR_PTR(-ENOENT);
 
 	file->f_mode &= ~FMODE_CREATED;
+lookup:
 	dentry = d_lookup(dir, &nd->last);
 	for (;;) {
 		if (!dentry) {
@@ -4002,6 +4144,19 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		/* Cached positive dentry: will open in f_op->open */
 		return dentry;
 	}
+	if (!dentry_lock(dentry, dir, &nd->last,
+			 TASK_UNINTERRUPTIBLE)) {
+		/* Became positive and renamed/removed while we slept */
+		dput(dentry);
+		goto lookup;
+	}
+	if (dentry->d_inode) {
+		/* Became positive while waiting for lock - will open in
+		 * f_op->open
+		 */
+		__dentry_unlock(dentry);
+		return dentry;
+	}
 
 	if (open_flag & O_CREAT)
 		audit_inode(nd->name, dir, AUDIT_INODE_PARENT);
@@ -4021,7 +4176,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 	if (open_flag & O_CREAT) {
 		if (open_flag & O_EXCL)
 			open_flag &= ~O_TRUNC;
-		mode = vfs_prepare_mode(idmap, dir->d_inode, mode, mode, mode);
+		mode = vfs_prepare_mode(idmap, dir_inode, mode, mode, mode);
 		if (likely(got_write))
 			create_error = may_o_create(idmap, &nd->path,
 						    dentry, mode);
@@ -4044,7 +4199,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		if (unlikely(res)) {
 			if (IS_ERR(res)) {
 				error = PTR_ERR(res);
-				goto out_dput;
+				goto out_unlock;
 			}
 			dput(dentry);
 			dentry = res;
@@ -4057,20 +4212,24 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
 		audit_inode_child(dir_inode, dentry, AUDIT_TYPE_CHILD_CREATE);
 		if (!dir_inode->i_op->create) {
 			error = -EACCES;
-			goto out_dput;
+			goto out_unlock;
 		}
 
 		error = dir_inode->i_op->create(idmap, dir_inode, dentry,
 						mode, open_flag & O_EXCL);
 		if (error)
-			goto out_dput;
+			goto out_unlock;
+		fsnotify_create(dir_inode, dentry);
 	}
 	if (unlikely(create_error) && !dentry->d_inode) {
 		error = create_error;
-		goto out_dput;
+		goto out_unlock;;
 	}
+	__dentry_unlock(dentry);
 	return dentry;
 
+out_unlock:
+	__dentry_unlock(dentry);
 out_dput:
 	dput(dentry);
 	return ERR_PTR(error);
@@ -4161,8 +4320,6 @@ static const char *open_last_lookups(struct nameidata *nd,
 		inode_lock_shared(dir->d_inode);
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry)) {
-		if (file->f_mode & FMODE_CREATED)
-			fsnotify_create(dir->d_inode, dentry);
 		if (file->f_mode & FMODE_OPENED)
 			fsnotify_open(file);
 	}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2b8f8641e1f8..c150d3f10a3d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -114,6 +114,8 @@ struct dentry {
 					 * possible!
 					 */
 
+	/* lockdep tracking of DCACHE_LOCK locks */
+	struct lockdep_map		dentry_map;
 	union {
 		struct list_head d_lru;		/* LRU list */
 		wait_queue_head_t *d_wait;	/* in-lookup ones only */
@@ -229,6 +231,15 @@ enum dentry_flags {
 #define DCACHE_MANAGED_DENTRY \
 	(DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
+#define DCACHE_LOCK			BIT(27) /* Locked for update */
+#define DCACHE_LOCK_WAITER		BIT(28) /* someone is waiting for LOCK */
+
+/* Nesting levels for DCACHE_LOCK */
+enum {
+	DLOCK_NORMAL,
+	DLOCK_RENAME,		/* dentry with higher address in rename */
+};
+
 extern seqlock_t rename_lock;
 
 /*
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow.
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
                   ` (3 preceding siblings ...)
  2025-06-09  7:34 ` [PATCH 4/8] VFS: Add ability to exclusively lock a dentry and use for open/create NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-10 20:57   ` Al Viro
  2025-06-09  7:34 ` [PATCH 6/8] VFS: provide alternative to s_vfs_rename_mutex NeilBrown
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

Once we support directory operations (e.g. create) without requiring the
parent to be locked, the current practice locking a directory while
processing rmdir() or similar will not be sufficient to wait for
operations to complete and to block further operations.

This patch introduced a new inode flag S_DYING.  It indicates that
a rmdir or similar is being processed and new directory operations must
not commence in the directory.  They should not abort either as the
rmdir might fail - instead they should block.  They can do this by
waiting for a lock on the inode.

A new interface rmdir_lock() locks the inode, sets this flag, and waits
for any children with DCACHE_LOCK set to complete their operation, and
for any d_in_lookup() children to complete the lookup.  It should be
called before attempted to delete the directory or set S_DEAD.  Matching
rmdir_unlock() clears the flag and unlocks the inode.

dentry_lock() and d_alloc_parallel() are changed to block while this
flag it set and to fail if the parent IS_DEADDIR(), though dentry_lock()
doesn't block for d_in_lookup() dentries.

Signed-off-by: NeilBrown <neil@brown.name>
---
 Documentation/filesystems/porting.rst |  10 ++
 fs/btrfs/ioctl.c                      |   4 +-
 fs/configfs/dir.c                     |  24 ++---
 fs/dcache.c                           |  26 ++++-
 fs/fuse/dir.c                         |   4 +-
 fs/internal.h                         |   1 +
 fs/libfs.c                            |   8 +-
 fs/namei.c                            | 139 ++++++++++++++++++++++++--
 include/linux/dcache.h                |   1 +
 include/linux/fs.h                    |   1 +
 include/linux/namei.h                 |   3 +
 11 files changed, 191 insertions(+), 30 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 1feeac9e1483..10dcdcf8be57 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1282,3 +1282,13 @@ the necessary d_lookup_done().  If the caller *knows* which filesystem
 is being used, it may know that this is not possible.  Otherwise it must
 be careful testing if the dentry is positive or negative as the lookup
 may not have been performed yet.
+
+---
+
+*** mandatory**
+
+Code that might set %S_DEAD must use rmdir_lock() on the dentry rather
+than inode_lock() on the inode.  This both locks the inode and waits for
+any child dentrys which are locked, to be unlocked.  rmdir_unlock() is
+then called after %S_DEAD has been set, or after it has been decided not
+to set it.
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9a3af4049c60..159b6d3f22c1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2466,9 +2466,9 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file,
 		goto out_unlock;
 	}
 
-	btrfs_inode_lock(BTRFS_I(inode), 0);
+	rmdir_lock(parent);
 	ret = btrfs_delete_subvolume(BTRFS_I(dir), dentry);
-	btrfs_inode_unlock(BTRFS_I(inode), 0);
+	rmdir_unlock(parent);
 	if (!ret)
 		d_delete_notify(dir, dentry);
 
diff --git a/fs/configfs/dir.c b/fs/configfs/dir.c
index ebf32822e29b..965c0e925d17 100644
--- a/fs/configfs/dir.c
+++ b/fs/configfs/dir.c
@@ -11,6 +11,7 @@
 #undef DEBUG
 
 #include <linux/fs.h>
+#include <linux/namei.h>
 #include <linux/fsnotify.h>
 #include <linux/mount.h>
 #include <linux/module.h>
@@ -660,13 +661,11 @@ static void detach_groups(struct config_group *group)
 
 		child = sd->s_dentry;
 
-		inode_lock(d_inode(child));
-
+		rmdir_lock(child);
 		configfs_detach_group(sd->s_element);
 		d_inode(child)->i_flags |= S_DEAD;
 		dont_mount(child);
-
-		inode_unlock(d_inode(child));
+		rmdir_unlock(child);
 
 		d_delete(child);
 		dput(child);
@@ -853,11 +852,11 @@ static int configfs_attach_item(struct config_item *parent_item,
 			 * the VFS may already have hit and used them. Thus,
 			 * we must lock them as rmdir() would.
 			 */
-			inode_lock(d_inode(dentry));
+			rmdir_lock(dentry);
 			configfs_remove_dir(item);
 			d_inode(dentry)->i_flags |= S_DEAD;
 			dont_mount(dentry);
-			inode_unlock(d_inode(dentry));
+			rmdir_unlock(dentry);
 			d_delete(dentry);
 		}
 	}
@@ -894,7 +893,7 @@ static int configfs_attach_group(struct config_item *parent_item,
 		 * We must also lock the inode to remove it safely in case of
 		 * error, as rmdir() would.
 		 */
-		inode_lock_nested(d_inode(dentry), I_MUTEX_CHILD);
+		rmdir_lock(dentry);
 		configfs_adjust_dir_dirent_depth_before_populate(sd);
 		ret = populate_groups(to_config_group(item), frag);
 		if (ret) {
@@ -903,7 +902,7 @@ static int configfs_attach_group(struct config_item *parent_item,
 			dont_mount(dentry);
 		}
 		configfs_adjust_dir_dirent_depth_after_populate(sd);
-		inode_unlock(d_inode(dentry));
+		rmdir_unlock(dentry);
 		if (ret)
 			d_delete(dentry);
 	}
@@ -1806,7 +1805,8 @@ void configfs_unregister_group(struct config_group *group)
 	frag->frag_dead = true;
 	up_write(&frag->frag_sem);
 
-	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
+	rmdir_lock(parent);
+
 	spin_lock(&configfs_dirent_lock);
 	configfs_detach_prep(dentry, NULL);
 	spin_unlock(&configfs_dirent_lock);
@@ -1816,7 +1816,7 @@ void configfs_unregister_group(struct config_group *group)
 	dont_mount(dentry);
 	d_drop(dentry);
 	fsnotify_rmdir(d_inode(parent), dentry);
-	inode_unlock(d_inode(parent));
+	rmdir_unlock(parent);
 
 	dput(dentry);
 
@@ -1952,7 +1952,7 @@ void configfs_unregister_subsystem(struct configfs_subsystem *subsys)
 
 	inode_lock_nested(d_inode(root),
 			  I_MUTEX_PARENT);
-	inode_lock_nested(d_inode(dentry), I_MUTEX_CHILD);
+	rmdir_lock(dentry);
 	mutex_lock(&configfs_symlink_mutex);
 	spin_lock(&configfs_dirent_lock);
 	if (configfs_detach_prep(dentry, NULL)) {
@@ -1963,7 +1963,7 @@ void configfs_unregister_subsystem(struct configfs_subsystem *subsys)
 	configfs_detach_group(&group->cg_item);
 	d_inode(dentry)->i_flags |= S_DEAD;
 	dont_mount(dentry);
-	inode_unlock(d_inode(dentry));
+	rmdir_unlock(dentry);
 
 	d_drop(dentry);
 	fsnotify_rmdir(d_inode(root), dentry);
diff --git a/fs/dcache.c b/fs/dcache.c
index b5cef2963169..78ec5ab58362 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2550,7 +2550,7 @@ static inline void end_dir_add(struct inode *dir, unsigned int n,
 	d_wake_waiters(d_wait, de);
 }
 
-static void d_wait_lookup(struct dentry *dentry)
+void d_wait_lookup(struct dentry *dentry)
 {
 	if (d_in_lookup(dentry)) {
 		struct par_wait_key wk = {
@@ -2672,10 +2672,30 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 		return dentry;
 	}
 	rcu_read_unlock();
-	/* we can't take ->d_lock here; it's OK, though. */
-	new->d_flags |= DCACHE_PAR_LOOKUP;
+	/*
+	 * we can't take ->d_lock here; it's OK, though as there is no concurrency.
+	 * barrier ensure rmdir_lock() will see the flag or we will see S_DYING
+	 */
+	smp_store_mb(new->d_flags, new->d_flags | DCACHE_PAR_LOOKUP);
 	/* Don't set a wait_queue until someone is actually waiting */
 	new->d_wait = NULL;
+	if (parent->d_inode->i_flags & (S_DYING | S_DEAD)) {
+		new->d_flags &= DCACHE_PAR_LOOKUP;
+		hlist_bl_del(&new->d_u.d_in_lookup_hash);
+		hlist_bl_unlock(b);
+		/* rmdir_lock() might be waiting already ! */
+		__d_lookup_unhash_wake(new);
+		if (!(parent->d_inode->i_flags & S_DEAD)) {
+			inode_lock(parent->d_inode);
+			/* S_DYING must be clean now */
+			inode_unlock(parent->d_inode);
+		}
+		if (parent->d_inode->i_flags & S_DEAD) {
+			dput(new);
+			return ERR_PTR(-ENOENT);
+		}
+		goto retry;
+	}
 	hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
 	hlist_bl_unlock(b);
 	return new;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 3118191ea290..3d379678992e 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1427,7 +1427,7 @@ int fuse_reverse_inval_entry(struct fuse_conn *fc, u64 parent_nodeid,
 	fuse_invalidate_entry_cache(entry);
 
 	if (child_nodeid != 0 && d_really_is_positive(entry)) {
-		inode_lock(d_inode(entry));
+		rmdir_lock(entry);
 		if (get_node_id(d_inode(entry)) != child_nodeid) {
 			err = -ENOENT;
 			goto badentry;
@@ -1448,7 +1448,7 @@ int fuse_reverse_inval_entry(struct fuse_conn *fc, u64 parent_nodeid,
 		clear_nlink(d_inode(entry));
 		err = 0;
  badentry:
-		inode_unlock(d_inode(entry));
+		rmdir_unlock(entry);
 		if (!err)
 			d_delete(entry);
 	} else {
diff --git a/fs/internal.h b/fs/internal.h
index d40d1d479a46..2462d67d84f6 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -231,6 +231,7 @@ extern void d_genocide(struct dentry *);
 extern bool dentry_trylock(struct dentry *dentry,
 			     struct dentry *base,
 			     const struct qstr *last);
+void d_wait_lookup(struct dentry *dentry);
 
 /*
  * The name "dentry_unlock()" is current used for unlocking the parent
diff --git a/fs/libfs.c b/fs/libfs.c
index 9ea0ecc325a8..cfd85fe7b8ee 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -613,9 +613,13 @@ void simple_recursive_removal(struct dentry *dentry,
 		struct dentry *victim = NULL, *child;
 		struct inode *inode = this->d_inode;
 
-		inode_lock(inode);
-		if (d_is_dir(this))
+		if (d_is_dir(this)) {
+			rmdir_lock(this);
 			inode->i_flags |= S_DEAD;
+			rmdir_unlock(this);
+		}
+
+		inode_lock(inode);
 		while ((child = find_next_child(this, victim)) == NULL) {
 			// kill and ascend
 			// update metadata while it's still locked
diff --git a/fs/namei.c b/fs/namei.c
index 4ad76df21677..c590f25d0d49 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1770,8 +1770,11 @@ static bool __dentry_lock(struct dentry *dentry,
 			  struct dentry *base, const struct qstr *last,
 			  unsigned int subclass, int state)
 {
+	struct dentry *parent;
+	struct inode *dir;
 	int err;
 
+retry:
 	lock_acquire_exclusive(&dentry->dentry_map, subclass, 0, NULL, _THIS_IP_);
 	spin_lock(&dentry->d_lock);
 	err = wait_var_event_any_lock(&dentry->d_flags,
@@ -1782,10 +1785,43 @@ static bool __dentry_lock(struct dentry *dentry,
 		spin_unlock(&dentry->d_lock);
 		return false;
 	}
-
-	dentry->d_flags |= DCACHE_LOCK;
+	parent = dentry->d_parent;
+	/*
+	 * memory barrier ensures rmdir_lock() will see the lock,
+	 * or we will subsequently see S_DYING or S_DEAD
+	 */
+	smp_store_mb(dentry->d_flags, dentry->d_flags | DCACHE_LOCK);
+	/* in-lookup dentries can bypass S_DYING tests because the test
+	 * was done in d_alloc_parallel()
+	 */
+	if (d_in_lookup(dentry) ||
+	    !(parent->d_inode->i_flags & (S_DYING | S_DEAD))) {
+		spin_unlock(&dentry->d_lock);
+		return true;
+	}
+	/* Cannot lock while parent is dying */
+	dentry->d_flags &= ~DCACHE_LOCK;
+	if (IS_DEADDIR(parent->d_inode)) {
+		lock_map_release(&dentry->dentry_map);
+		spin_unlock(&dentry->d_lock);
+		return false;
+	}
+	dir = igrab(parent->d_inode);
+	lock_map_release(&dentry->dentry_map);
 	spin_unlock(&dentry->d_lock);
-	return true;
+
+	if (state == TASK_KILLABLE) {
+		err = down_write_killable(&dir->i_rwsem);
+		if (err) {
+			iput(dir);
+			return false;
+		}
+	} else
+		inode_lock(dir);
+	/* S_DYING much be clear now */
+	inode_unlock(dir);
+	iput(dir);
+	goto retry;
 }
 
 /**
@@ -1860,6 +1896,89 @@ bool dentry_trylock(struct dentry *dentry,
 	return ret;
 }
 
+static bool dentry_wait_locked(struct dentry *dentry)
+{
+	int ret;
+
+	/*
+	 * We might be in rename holding two dentry locks already.
+	 * Here we wait on a child of one of those so use a different
+	 * nesting level.
+	 */
+	lock_acquire_exclusive(&dentry->dentry_map, DLOCK_DYING_WAIT,
+			       0, NULL, _THIS_IP_);
+	spin_lock(&dentry->d_lock);
+	ret = wait_var_event_spinlock(&dentry->d_flags,
+				      !check_dentry_locked(dentry),
+				      &dentry->d_lock);
+	spin_unlock(&dentry->d_lock);
+	lock_map_release(&dentry->dentry_map);
+	return ret == 0;
+}
+
+/** rmdir_lock - wait for all operations in directory to complete
+ * @dentry: dentry for the directory
+ *
+ * When removing a directory it is necessary to wait for pending operations
+ * (e.g. create) in the directory to complete and to block further operations.
+ * rmdir_lock() achieves this by marking the inode as dying and waiting
+ * for any locked children to unlock.
+ *
+ * If removal of the directory is successeful, S_DEAD should be set. In any case
+ * rmdir_unlock() must be called after either success or failure.
+ */
+void rmdir_lock(struct dentry *dentry)
+{
+	struct dentry *child;
+	struct inode *dir = d_inode(dentry);
+
+	inode_lock_nested(dir, I_MUTEX_CHILD);
+
+	/* memory barrier matches that in dcache_lock() */
+	smp_store_mb(dir->i_flags, dir->i_flags | S_DYING);
+	/*
+	 * Any attempt to lock a child will now block on parent lock.
+	 * Must wait for any locked children to be unlocked
+	 */
+again:
+	spin_lock(&dentry->d_lock);
+	for (child = d_first_child(dentry); child;
+	     child = d_next_sibling(child)) {
+		if (child->d_flags & DCACHE_LOCK) {
+			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+			dget_dlock(child);
+			spin_unlock(&child->d_lock);
+			spin_unlock(&dentry->d_lock);
+			dentry_wait_locked(child);
+			dput(child);
+			goto again;
+		}
+		if (d_in_lookup(child)) {
+			spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+			dget_dlock(child);
+			spin_unlock(&child->d_lock);
+			spin_unlock(&dentry->d_lock);
+			d_wait_lookup(child);
+			dput(child);
+			goto again;
+		}
+	}
+	spin_unlock(&dentry->d_lock);
+}
+EXPORT_SYMBOL(rmdir_lock);
+
+/** rmdir_unlock - remove the block imposed by rmdir_lock()
+ * @dentry: dentry for directory
+ *
+ * Every call to rmdir_lock() must be paired with a call to rmdir_unlock().
+ */
+void rmdir_unlock(struct dentry *dentry)
+{
+	d_inode(dentry)->i_flags &= ~S_DYING;
+	inode_unlock(d_inode(dentry));
+}
+EXPORT_SYMBOL(rmdir_unlock);
+
 /**
  * lookup_and_lock_hashed - lookup and lock a name prior to dir ops
  * @last: the name in the given directory
@@ -4953,7 +5072,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 		return -EPERM;
 
 	dget(dentry);
-	inode_lock(dentry->d_inode);
+	rmdir_lock(dentry);
 
 	error = -EBUSY;
 	if (is_local_mountpoint(dentry) ||
@@ -4974,7 +5093,7 @@ int vfs_rmdir(struct mnt_idmap *idmap, struct inode *dir,
 	detach_mounts(dentry);
 
 out:
-	inode_unlock(dentry->d_inode);
+	rmdir_unlock(dentry);
 	dput(dentry);
 	if (!error)
 		d_delete_notify(dir, dentry);
@@ -5592,10 +5711,10 @@ int vfs_rename(struct renamedata *rd)
 		if (lock_old_subdir)
 			inode_lock_nested(source, I_MUTEX_CHILD);
 		if (target && (!new_is_dir || lock_new_subdir))
-			inode_lock(target);
+			rmdir_lock(new_dentry);
 	} else if (new_is_dir) {
 		if (lock_new_subdir)
-			inode_lock_nested(target, I_MUTEX_CHILD);
+			rmdir_lock(new_dentry);
 		inode_lock(source);
 	} else {
 		lock_two_nondirectories(source, target);
@@ -5647,10 +5766,12 @@ int vfs_rename(struct renamedata *rd)
 			d_exchange(old_dentry, new_dentry);
 	}
 out:
+	if (target && new_is_dir && !(flags & RENAME_EXCHANGE)) // IS THIS CORRECT?
+		rmdir_unlock(new_dentry);
+	else if (target && (!new_is_dir || lock_new_subdir))
+		inode_unlock(target);
 	if (!is_dir || lock_old_subdir)
 		inode_unlock(source);
-	if (target && (!new_is_dir || lock_new_subdir))
-		inode_unlock(target);
 	dput(new_dentry);
 	if (!error) {
 		fsnotify_move(old_dir, new_dir, &old_name.name, is_dir,
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c150d3f10a3d..2c7a2326367b 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -238,6 +238,7 @@ enum dentry_flags {
 enum {
 	DLOCK_NORMAL,
 	DLOCK_RENAME,		/* dentry with higher address in rename */
+	DLOCK_DYING_WAIT,	/* child of a locked parent that is dying */
 };
 
 extern seqlock_t rename_lock;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 24bc29efecd5..d7fa38668987 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2395,6 +2395,7 @@ struct super_operations {
 #define S_CASEFOLD	(1 << 15) /* Casefolded file */
 #define S_VERITY	(1 << 16) /* Verity file (using fs/verity/) */
 #define S_KERNEL_FILE	(1 << 17) /* File is in use by the kernel (eg. fs/cachefiles) */
+#define S_DYING		(1 << 18) /* dir is locked ready to set S_DEAD */
 #define S_ANON_INODE	(1 << 19) /* Inode is an anonymous inode */
 
 /*
diff --git a/include/linux/namei.h b/include/linux/namei.h
index aa5bcdbe705d..6c9b545c60ce 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -90,6 +90,9 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last, struct dentry *base,
 void dentry_unlock(struct dentry *dentry);
 void dentry_unlock_dir_locked(struct dentry *dentry);
 
+void rmdir_lock(struct dentry *dentry);
+void rmdir_unlock(struct dentry *dentry);
+
 extern int follow_down_one(struct path *);
 extern int follow_down(struct path *path, unsigned int flags);
 extern int follow_up(struct path *);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/8] VFS: provide alternative to s_vfs_rename_mutex
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
                   ` (4 preceding siblings ...)
  2025-06-09  7:34 ` [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-09  7:34 ` [PATCH 7/8] VFS: use new dentry locking for create/remove/rename NeilBrown
  2025-06-09  7:34 ` [PATCH 8/8] VFS: allow a filesystem to opt out of directory locking NeilBrown
  7 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

s_vfs_rename_mutex plays a few different roles in managing renames.

1/ It stablises all ->d_parent pointers in the filesystem so that
   ancestral relationships between the source and target directories
   can be determined.
2/ It prevents any other cross-directory rename so that if the
   source and target directories do not have an ancestral relationship
   then they can be locked in any order without risk if AB-BA deadlock
   with another rename
3/ It prevents any other cross-directory rename so that the validation
   of ancestral relationship between the source and target which ensures
   no loop is created remains in effect until the rename has completed
   in the filesystem.

A future patch will introduce a new locking mechanism that will allow
filesystems to support concurrent directory operations where they do not
directly conflict.  The use of s_vfs_rename_mutex reduces the
opportunity for concurrency so this patch introduces a new mechanism.
The exist mechanism will remain but only be used for filesystems that
haven't advertised support for concurrent operations.

The new mechanism depends on rename_lock and locks against rename only
the ancestors of the two targets that are different.

A/ rename_lock is used to stablise all ->d_parent pointers.  While it is
   held the nearest common ancestor to both targets is found.  If this
   either the source or destination dentry an error is reported.
B/ While still holding rename_lock, DCACHE_RENAME_LOCK is set on all
   dentries from the two targets up to, but not including, the common
   ancestor.  If any already have DCACHE_RENAME_LOCK set, we abort
   before setting the flag anywhere, wait for that dentry to lose
   DCACHE_RENAME_LOCK, then try again.

   DCACHE_RENAME_LOCK ensures that the name and parent of the dentry is
   not changed.  DCACHE_RENAME_LOCK of the whole path up to the common
   ancestor ensures that these paths don't change so the ancestral
   relationship between the two targets don't change.  This ensures that
   no other rename can cause this one to create a path loop.

Waiting for DCACHE_RENAME_LOCK to be cleared uses a single
per-filesystem wait_queue_head.  This is signalled once all
DCACHE_RENAME_LOCK flags have been cleared.  This ensures waiters aren't
woken prematurely if they were only waiting on one other rename.

An important consideration when clearing the DCACHE_RENAME_LOCK flag is that
we do this from child to parent.  As we walk up we need to be sure we
have a valid reference to each dentry.  We will because we have a
counted reference to old_dir and new_dentry.  As none of these
directories are empty they cannot be removed and they cannot be moved
away as long as we continue to hold rename_lock.  So as long as we hold
rename_lock while clearing all the DCACHE_RENAME_LOCK flags we will have a
usable reference through the chain of ->d_parent links.  As soon as
rename_lock is dropped the ancestor and everything below that we do not
have an explicit counted reference on will become inaccessible.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/namei.c             | 160 ++++++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h |   3 +
 include/linux/fs.h     |   8 ++-
 3 files changed, 168 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c590f25d0d49..da9ba37f8a93 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3722,6 +3722,151 @@ static void unlock_rename(struct dentry *p1, struct dentry *p2)
 	}
 }
 
+static void wait_not_renaming(struct dentry *d)
+{
+	spin_lock(&d->d_lock);
+	wait_event_cmd(d->d_sb->s_vfs_rename_wq, !(d->d_flags & DCACHE_RENAME_LOCK),
+		       spin_unlock(&d->d_lock), spin_lock(&d->d_lock));
+	spin_unlock(&d->d_lock);
+}
+
+static void renaming_unlock_one(struct dentry *d)
+{
+	spin_lock(&d->d_lock);
+	WARN_ON_ONCE(!(d->d_flags & DCACHE_RENAME_LOCK));
+	d->d_flags &= ~DCACHE_RENAME_LOCK;
+	spin_unlock(&d->d_lock);
+}
+
+static void renaming_unlock(struct dentry *dir1, struct dentry *dir2,
+			    struct dentry *ancestor,
+			    struct dentry *d1, struct dentry *d2)
+{
+	struct dentry *d;
+
+	if (!ancestor)
+		return;
+
+	read_seqlock_excl(&rename_lock);
+	for (d = dir1; d != ancestor; d = d->d_parent)
+		renaming_unlock_one(d);
+	for (d = dir2; d != ancestor; d = d->d_parent)
+		renaming_unlock_one(d);
+	renaming_unlock_one(d1);
+	renaming_unlock_one(d2);
+	read_sequnlock_excl(&rename_lock);
+	/* We don't send any wakeup until all dentries have been unlocked */
+	wake_up(&dir1->d_sb->s_vfs_rename_wq);
+}
+
+/*
+ * Find the nearest common ancestor of d1 and d2, and report
+ * if any of the children of the ancestor which are in the line
+ * to d1 and d2 are locked for rename - have DCACHE_RENAME_LOCK set.
+ */
+static struct dentry *common_ancestor(struct dentry *d1, struct dentry *d2,
+				      struct dentry **locked)
+{
+	struct dentry *p, *q;
+	int depth1 = 0, depth2 = 0;
+
+	*locked = NULL;
+	/* Find depth of each and compare ancestors of equal depth */
+	for (p = d1; !IS_ROOT(p); p = p->d_parent)
+		depth1 += 1;
+	for (q = d2; !IS_ROOT(q); q = q->d_parent)
+		depth2 += 1;
+	if (p != q)
+		/* Different root! */
+		return NULL;
+	while (depth1 > depth2) {
+		if (d1->d_flags & DCACHE_RENAME_LOCK)
+			*locked = d1;
+		d1 = d1->d_parent;
+		depth1 -= 1;
+	}
+	while (depth2 > depth1) {
+		if (d2->d_flags & DCACHE_RENAME_LOCK)
+			*locked = d2;
+		d2 = d2->d_parent;
+		depth2 -= 1;
+	}
+	while (d1 != d2) {
+		if (d1->d_flags & DCACHE_RENAME_LOCK)
+			*locked = d1;
+		if (d2->d_flags & DCACHE_RENAME_LOCK)
+			*locked = d2;
+		d1 = d1->d_parent;
+		d2 = d2->d_parent;
+	}
+	return d1;
+}
+
+/*
+ * Lock all ancestors below nearest common ancestor for rename
+ * Returns:
+ *    -EXDEV if there is no common ancestor
+ *    -EINVAL if the first dentry is the common ancestor -
+ *		you cannot move a directory into a descendent
+ *    -ENOTEMPTY if the second dentry is the common ancestor -
+ *		the target directory must usually be empty.
+ * Note that these errors might be adjusted for RENAME_EXCHANGE
+ *
+ * The ancestors will be locked against rename, and no directory between
+ * either target and the ancestor will be the ancestor of an active
+ * rename.  This ensures that the common ancestor will continue to be
+ * the common ancestor, and that there will be no concurrent rename with
+ * the same ancestor.
+ */
+static struct dentry *lock_ancestors(struct dentry *d1, struct dentry *d2)
+{
+	struct dentry *locked, *ancestor;
+
+	if (d1->d_parent == d2->d_parent)
+		/* Nothing to lock */
+		return NULL;
+again:
+	read_seqlock_excl(&rename_lock);
+	ancestor = common_ancestor(d1, d2, &locked);
+	if (!ancestor) {
+		read_sequnlock_excl(&rename_lock);
+		return ERR_PTR(-EXDEV);
+	}
+	if (ancestor == d1) {
+		read_sequnlock_excl(&rename_lock);
+		return ERR_PTR(-EINVAL);
+	}
+	if (ancestor == d2) {
+		read_sequnlock_excl(&rename_lock);
+		return ERR_PTR(-ENOTEMPTY);
+	}
+	if (locked) {
+		dget(locked);
+		read_sequnlock_excl(&rename_lock);
+		wait_not_renaming(locked);
+		dput(locked);
+		goto again;
+	}
+	/*
+	 * Nothing from d1,d2 up to ancestor can have DCACHE_RENAME_LOCK
+	 * as we hold rename_lock and nothing was reported in "locked".
+	 */
+	while (d1 != ancestor) {
+		spin_lock(&d1->d_lock);
+		d1->d_flags |= DCACHE_RENAME_LOCK;
+		spin_unlock(&d1->d_lock);
+		d1 = d1->d_parent;
+	}
+	while (d2 != ancestor) {
+		spin_lock(&d2->d_lock);
+		d2->d_flags |= DCACHE_RENAME_LOCK;
+		spin_unlock(&d2->d_lock);
+		d2 = d2->d_parent;
+	}
+	read_sequnlock_excl(&rename_lock);
+	return ancestor;
+}
+
 /**
  * lookup_and_lock_rename_hashed - lookup and lock names for rename
  * @rd:           rename data containing relevant details
@@ -3753,7 +3898,7 @@ static void unlock_rename(struct dentry *p1, struct dentry *p2)
 int
 lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 {
-	struct dentry *p;
+	struct dentry *p, *ancestor;
 	struct dentry *d1, *d2;
 	int target_flags = LOOKUP_RENAME_TARGET | LOOKUP_CREATE;
 	int err;
@@ -3811,8 +3956,17 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 		goto out_unlock_3;
 	}
 
+	ancestor = lock_ancestors(d1, d2);
+	if (IS_ERR(ancestor)) {
+		err = PTR_ERR(ancestor);
+		if (err == -ENOTEMPTY && (rd->flags & RENAME_EXCHANGE))
+			err = -EINVAL;
+		goto out_unlock_3;
+	}
+
 	rd->old_dentry = d1;
 	rd->new_dentry = d2;
+	rd->ancestor = ancestor;
 	return 0;
 
 out_unlock_3:
@@ -3934,6 +4088,10 @@ void dentry_unlock_rename(struct renamedata *rd)
 {
 	d_lookup_done(rd->old_dentry);
 	d_lookup_done(rd->new_dentry);
+
+	renaming_unlock(rd->old_dir, rd->new_dir, rd->ancestor,
+			rd->old_dentry, rd->new_dentry);
+
 	unlock_rename(rd->old_dir, rd->new_dir);
 	dput(rd->old_dir);
 	dput(rd->old_dentry);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2c7a2326367b..27e645d290b1 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -226,6 +226,9 @@ enum dentry_flags {
 	DCACHE_PAR_LOOKUP		= BIT(24),	/* being looked up (with parent locked shared) */
 	DCACHE_DENTRY_CURSOR		= BIT(25),
 	DCACHE_NORCU			= BIT(26),	/* No RCU delay for freeing */
+	DCACHE_RENAME_LOCK		= BIT(27),	/* A directory is being renamed
+							 * into or out-of this tree
+							 */
 };
 
 #define DCACHE_MANAGED_DENTRY \
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d7fa38668987..6b4a1a1f4786 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1402,10 +1402,11 @@ struct super_block {
 	unsigned int		s_max_links;
 
 	/*
-	 * The next field is for VFS *only*. No filesystems have any business
-	 * even looking at it. You had been warned.
+	 * The next two fields are for VFS *only*. No filesystems have any business
+	 * even looking at them. You had been warned.
 	 */
 	struct mutex s_vfs_rename_mutex;	/* Kludge */
+	struct wait_queue_head s_vfs_rename_wq;	/* waiting for DCACHE_RENAME_LOCK to clear */
 
 	/*
 	 * Filesystem subtype.  If non-empty the filesystem type field
@@ -2012,6 +2013,8 @@ int vfs_unlink(struct mnt_idmap *, struct inode *, struct dentry *,
  * @new_dentry:                destination
  * @new_last:          name for new_dentry in new_dir, if new_dentry not given
  * @delegated_inode:   returns an inode needing a delegation break
+ * @ancestor:          Closest common ancestor of @old_dir and @new_dir if those
+ *                     two are differernt.
  * @flags:             rename flags
  */
 struct renamedata {
@@ -2024,6 +2027,7 @@ struct renamedata {
 	struct dentry *new_dentry;
 	struct qstr new_last;
 	struct inode **delegated_inode;
+	struct dentry *ancestor;
 	unsigned int flags;
 } __randomize_layout;
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/8] VFS: use new dentry locking for create/remove/rename
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
                   ` (5 preceding siblings ...)
  2025-06-09  7:34 ` [PATCH 6/8] VFS: provide alternative to s_vfs_rename_mutex NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  2025-06-10 20:36   ` Al Viro
  2025-06-09  7:34 ` [PATCH 8/8] VFS: allow a filesystem to opt out of directory locking NeilBrown
  7 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

After taking the directory lock (or locks) we now lock the target
dentries.  This is pointless at present but will allow us to remove the
taking of the directory lock in a future patch.

MORE WORDS

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/afs/dir.c           |   3 +-
 fs/afs/dir_silly.c     |   2 +-
 fs/namei.c             | 140 ++++++++++++++++++++++++++++++++++++-----
 fs/nfs/unlink.c        |   2 +-
 include/linux/dcache.h |   3 +
 include/linux/namei.h  |   4 +-
 6 files changed, 133 insertions(+), 21 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index 421bd044f8c9..da30700c0106 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -943,7 +943,8 @@ static struct dentry *afs_lookup_atsys(struct inode *dir, struct dentry *dentry)
 		}
 
 		strcpy(p, name);
-		ret = lookup_and_lock_noperm_locked(&QSTR(buf), dentry->d_parent);
+		ret = lookup_and_lock_noperm_locked(&QSTR(buf), dentry->d_parent, 0,
+						    DLOCK_SUB_LOOKUP);
 		if (!IS_ERR(ret) && d_is_positive(ret))
 			dget(ret);
 		dentry_unlock(ret);
diff --git a/fs/afs/dir_silly.c b/fs/afs/dir_silly.c
index 68e38429cf49..d25353fab18c 100644
--- a/fs/afs/dir_silly.c
+++ b/fs/afs/dir_silly.c
@@ -120,7 +120,7 @@ int afs_sillyrename(struct afs_vnode *dvnode, struct afs_vnode *vnode,
 		scnprintf(silly, sizeof(silly), ".__afs%04X", sillycounter);
 		sdentry = lookup_and_lock_noperm_locked(
 			&QSTR(silly), dentry->d_parent,
-			LOOKUP_CREATE | LOOKUP_EXCL);
+			LOOKUP_CREATE | LOOKUP_EXCL, DLOCK_SUB_LOOKUP);
 
 		/* N.B. Better to return EBUSY here ... it could be dangerous
 		 * to delete the file while it's in use.
diff --git a/fs/namei.c b/fs/namei.c
index da9ba37f8a93..5c9279657b32 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1723,6 +1723,7 @@ static struct dentry *lookup_one_qstr(const struct qstr *name,
 	}
 	return dentry;
 }
+
 /*
  * dentry locking for updates.
  * When modifying a directory the target dentry will be locked by
@@ -1856,7 +1857,6 @@ static bool dentry_lock(struct dentry *dentry,
 	return __dentry_lock(dentry, base, last, DLOCK_NORMAL, state);
 }
 
-__maybe_unused /* will be used for rename */
 static bool dentry_lock_nested(struct dentry *dentry,
 			       struct dentry *base, const struct qstr *last,
 			       int state)
@@ -2003,7 +2003,14 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last,
 
 	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
 
+retry:
 	dentry = lookup_one_qstr(last, base, lookup_flags);
+	if (!IS_ERR(dentry) &&
+	    !dentry_lock(dentry, base, last, TASK_UNINTERRUPTIBLE)) {
+		dput(dentry);
+		goto retry;
+	}
+
 	if (IS_ERR(dentry))
 		inode_unlock(base->d_inode);
 	return dentry;
@@ -2015,15 +2022,25 @@ static int lookup_one_common(struct mnt_idmap *idmap,
 			     struct qstr *qname, struct dentry *base);
 struct dentry *lookup_and_lock_noperm_locked(struct qstr *last,
 					     struct dentry *base,
-					     unsigned int lookup_flags)
+					     unsigned int lookup_flags,
+					     unsigned int class)
 {
+	struct dentry *dentry;
 	int err;
 
 	err = lookup_noperm_common(last, base);
 	if (err < 0)
 		return ERR_PTR(err);
 
-	return lookup_one_qstr(last, base, lookup_flags);
+retry:
+	dentry = lookup_one_qstr(last, base, lookup_flags);
+	if (!IS_ERR(dentry) &&
+	    !__dentry_lock(dentry, base, last, class,
+			   TASK_UNINTERRUPTIBLE)) {
+		dput(dentry);
+		goto retry;
+	}
+	return dentry;
 }
 EXPORT_SYMBOL(lookup_and_lock_noperm_locked);
 
@@ -2051,13 +2068,43 @@ struct dentry *lookup_and_lock_noperm(struct qstr *last,
 
 	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
 
-	dentry = lookup_and_lock_noperm_locked(last, base, lookup_flags);
+	dentry = lookup_and_lock_noperm_locked(last, base, lookup_flags,
+					       DLOCK_NORMAL);
 	if (IS_ERR(dentry))
 		inode_unlock(base->d_inode);
 	return dentry;
 }
 EXPORT_SYMBOL(lookup_and_lock_noperm);
 
+/**
+ * lookup_and_lock_nested_noperm - lookup and lock a name prior to dir ops
+ * @last: the name in the given directory
+ * @base: the directory in which the name is to be found
+ * @lookup_flags: %LOOKUP_xxx flags
+ * @class: locking subclass for lock on dentry
+ *
+ * The name is looked up and necessary locks are taken so that
+ * the name can be created or removed.
+ * The "necessary locks" are currently the inode node lock on @base.
+ * The name @last is NOT expected to have the hash calculated.
+ * No permission checks are performed.
+ * Returns: the dentry, suitably locked, or an ERR_PTR().
+ */
+struct dentry *lookup_and_lock_noperm_nested(struct qstr *last,
+					     struct dentry *base,
+					     unsigned int lookup_flags,
+					     unsigned int class)
+{
+	struct dentry *dentry;
+
+	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
+	dentry = lookup_and_lock_noperm_locked(last, base, lookup_flags, class);
+	if (IS_ERR(dentry))
+		inode_unlock(base->d_inode);
+	return dentry;
+}
+EXPORT_SYMBOL(lookup_and_lock_noperm_nested);
+
 /**
  * lookup_and_lock - lookup and lock a name prior to dir ops
  * @last: the name in the given directory
@@ -2120,7 +2167,15 @@ struct dentry *lookup_and_lock_killable(struct mnt_idmap *idmap,
 	if (err < 0)
 		return ERR_PTR(err);
 
+retry:
 	dentry = lookup_one_qstr(last, base, lookup_flags);
+	if (!IS_ERR(dentry) &&
+	    !dentry_lock(dentry, base, last, TASK_KILLABLE)) {
+		dput(dentry);
+		if (fatal_signal_pending(current))
+			return ERR_PTR(-ERESTARTSYS);
+		goto retry;
+	}
 	if (IS_ERR(dentry))
 		inode_unlock(base->d_inode);
 	return dentry;
@@ -2142,20 +2197,23 @@ EXPORT_SYMBOL(lookup_and_lock_killable);
  */
 bool lock_and_check_dentry(struct dentry *child, struct dentry *parent)
 {
-	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
-	if (child->d_parent == parent) {
-		/* get the child to balance with dentry_unlock which puts it. */
-		dget(child);
-		return true;
+	if (!dentry_lock(child, NULL, NULL, TASK_UNINTERRUPTIBLE))
+		return false;
+	if (child->d_parent != parent) {
+		__dentry_unlock(child);
+		return false;
 	}
-	inode_unlock(d_inode(parent));
-	return false;
+	/* get the child to balance with dentry_unlock() which puts it. */
+	dget(child);
+	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
+	return true;
 }
 EXPORT_SYMBOL(lock_and_check_dentry);
 
 void dentry_unlock_dir_locked(struct dentry *dentry)
 {
 	d_lookup_done(dentry);
+	__dentry_unlock(dentry);
 	dput(dentry);
 }
 EXPORT_SYMBOL(dentry_unlock_dir_locked);
@@ -3900,7 +3958,10 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 {
 	struct dentry *p, *ancestor;
 	struct dentry *d1, *d2;
+	struct dentry *old_dir;
+	struct qstr *old_last = NULL, *new_last = NULL;
 	int target_flags = LOOKUP_RENAME_TARGET | LOOKUP_CREATE;
+	bool ok1, ok2;
 	int err;
 
 	if (rd->flags & RENAME_EXCHANGE)
@@ -3908,6 +3969,7 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 	if (rd->flags & RENAME_NOREPLACE)
 		target_flags |= LOOKUP_EXCL;
 
+retry:
 	if (rd->old_dentry) {
 		/* Already have the dentry - need to be sure to lock the correct parent */
 		p = lock_rename_child(rd->old_dentry, rd->new_dir);
@@ -3926,19 +3988,17 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 		d1 = lookup_one_qstr(&rd->old_last, rd->old_dir, lookup_flags);
 		if (IS_ERR(d1))
 			goto out_unlock_1;
+		old_last = &rd->old_last;
 	}
+
 	if (rd->new_dentry) {
-		if (d_unhashed(rd->new_dentry) ||
-		    rd->new_dentry->d_parent != rd->new_dir) {
-			/* new_dentry was moved or removed! */
-			goto out_unlock_2;
-		}
 		d2 = dget(rd->new_dentry);
 	} else {
 		d2 = lookup_one_qstr(&rd->new_last, rd->new_dir,
 				     lookup_flags | target_flags);
 		if (IS_ERR(d2))
 			goto out_unlock_2;
+		new_last = &rd->new_last;
 	}
 
 	if (d1 == p) {
@@ -3964,6 +4024,44 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 		goto out_unlock_3;
 	}
 
+	if (!old_dir)
+		old_dir = dget(d1->d_parent);
+
+	/*
+	 * Time to lock the dentrys.  Only validate the name if a lookup
+	 * was performed.  i.e.  if old_last or new_last is not NULL.
+	 */
+	if (d1 < d2) {
+		ok1 = dentry_lock(d1, rd->old_dir, old_last,
+				  TASK_UNINTERRUPTIBLE);
+		ok2 = dentry_lock_nested(d2, rd->new_dir, new_last,
+					 TASK_UNINTERRUPTIBLE);
+	} else if (d1 > d2) {
+		ok2 = dentry_lock(d2, rd->new_dir, new_last,
+				  TASK_UNINTERRUPTIBLE);
+		ok1 = dentry_lock_nested(d1, rd->old_dir, old_last,
+					 TASK_UNINTERRUPTIBLE);
+	} else {
+		ok1 = ok2 = dentry_lock(d1, NULL, NULL, TASK_UNINTERRUPTIBLE);
+	}
+
+	if (!ok1 || !ok2) {
+		if (ok1)
+			__dentry_unlock(d1);
+		if (ok2)
+			__dentry_unlock(d2);
+		d_lookup_done(d1);
+		d_lookup_done(d2);
+		renaming_unlock(old_dir, rd->new_dir, ancestor,
+				d1, d2);
+		unlock_rename(rd->old_dir, rd->new_dir);
+		dput(d1);
+		dput(d2);
+		dput(old_dir);
+		goto retry;
+	}
+
+	rd->old_dir = old_dir;
 	rd->old_dentry = d1;
 	rd->new_dentry = d2;
 	rd->ancestor = ancestor;
@@ -3971,15 +4069,19 @@ lookup_and_lock_rename_hashed(struct renamedata *rd, int lookup_flags)
 
 out_unlock_3:
 	d_lookup_done(d2);
+	if (d2 != d1)
+		__dentry_unlock(d2);
 	dput(d2);
 	d2 = ERR_PTR(err);
+	d_lookup_done(d1);
+	__dentry_unlock(d1);
 out_unlock_2:
 	d_lookup_done(d1);
 	dput(d1);
 	d1 = d2;
 out_unlock_1:
 	unlock_rename(rd->old_dir, rd->new_dir);
-	dput(rd->old_dir);
+	dput(old_dir);
 	return PTR_ERR(d1);
 }
 EXPORT_SYMBOL(lookup_and_lock_rename_hashed);
@@ -4093,8 +4195,12 @@ void dentry_unlock_rename(struct renamedata *rd)
 			rd->old_dentry, rd->new_dentry);
 
 	unlock_rename(rd->old_dir, rd->new_dir);
+
 	dput(rd->old_dir);
+	__dentry_unlock(rd->old_dentry);
 	dput(rd->old_dentry);
+	if (rd->old_dentry != rd->new_dentry)
+		__dentry_unlock(rd->new_dentry);
 	dput(rd->new_dentry);
 }
 EXPORT_SYMBOL(dentry_unlock_rename);
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index a67df3ae74ab..4ac3e3195a45 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -476,7 +476,7 @@ nfs_sillyrename(struct inode *dir, struct dentry *dentry)
 		 */
 		sdentry = lookup_and_lock_noperm_locked(
 			&QSTR(silly), dentry->d_parent,
-			LOOKUP_CREATE);
+			LOOKUP_CREATE, DLOCK_SUB_LOOKUP);
 		if (!IS_ERR(sdentry) && d_is_positive(dentry)) {
 			dput(sdentry);
 			sdentry = ERR_PTR(-EEXIST);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 27e645d290b1..ea1c37df7f00 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -242,6 +242,9 @@ enum {
 	DLOCK_NORMAL,
 	DLOCK_RENAME,		/* dentry with higher address in rename */
 	DLOCK_DYING_WAIT,	/* child of a locked parent that is dying */
+	DLOCK_SUB_LOOKUP,	/* lock on target for silly-rename or other
+				 * subordinate lookup
+				 */
 };
 
 extern seqlock_t rename_lock;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 6c9b545c60ce..1cbaa392fa21 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -83,8 +83,10 @@ struct dentry *lookup_and_lock_killable(struct mnt_idmap *idmap,
 					unsigned int lookup_flags);
 struct dentry *lookup_and_lock_noperm(struct qstr *name, struct dentry *base,
 				      unsigned int lookup_flags);
+struct dentry *lookup_and_lock_noperm_nested(struct qstr *name, struct dentry *base,
+					     unsigned int lookup_flags, unsigned int class);
 struct dentry *lookup_and_lock_noperm_locked(struct qstr *name, struct dentry *base,
-					     unsigned int lookup_flags);
+					     unsigned int lookup_flags, unsigned int class);
 struct dentry *lookup_and_lock_hashed(struct qstr *last, struct dentry *base,
 				      unsigned int lookup_flags);
 void dentry_unlock(struct dentry *dentry);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 8/8] VFS: allow a filesystem to opt out of directory locking.
  2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
                   ` (6 preceding siblings ...)
  2025-06-09  7:34 ` [PATCH 7/8] VFS: use new dentry locking for create/remove/rename NeilBrown
@ 2025-06-09  7:34 ` NeilBrown
  7 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-09  7:34 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara; +Cc: linux-fsdevel

The VFS no longer needs the directory to be locked when performing
updates in the directory (create/remove/rename).  We only lock
directories during these ops because the filesystem might expect that.
Some filesystems may not need it.  Allow the filesystem to opt out by
setting no_dir_lock in inode_operations.

Signed-off-by: NeilBrown <neil@brown.name>
---
 fs/namei.c         | 75 ++++++++++++++++++++++++++++++++--------------
 include/linux/fs.h |  1 +
 2 files changed, 54 insertions(+), 22 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 5c9279657b32..55ea67b4f891 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2001,7 +2001,8 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last,
 {
 	struct dentry *dentry;
 
-	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
+	if (!d_inode(base)->i_op->no_dir_lock)
+		inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
 
 retry:
 	dentry = lookup_one_qstr(last, base, lookup_flags);
@@ -2011,7 +2012,8 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last,
 		goto retry;
 	}
 
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry) &&
+	    !d_inode(base)->i_op->no_dir_lock)
 		inode_unlock(base->d_inode);
 	return dentry;
 }
@@ -2066,11 +2068,13 @@ struct dentry *lookup_and_lock_noperm(struct qstr *last,
 {
 	struct dentry *dentry;
 
-	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
+	if (!d_inode(base)->i_op->no_dir_lock)
+		inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
 
 	dentry = lookup_and_lock_noperm_locked(last, base, lookup_flags,
 					       DLOCK_NORMAL);
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry) &&
+	    !d_inode(base)->i_op->no_dir_lock)
 		inode_unlock(base->d_inode);
 	return dentry;
 }
@@ -2097,9 +2101,11 @@ struct dentry *lookup_and_lock_noperm_nested(struct qstr *last,
 {
 	struct dentry *dentry;
 
-	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
+	if (!d_inode(base)->i_op->no_dir_lock)
+		inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
 	dentry = lookup_and_lock_noperm_locked(last, base, lookup_flags, class);
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry) &&
+	    !d_inode(base)->i_op->no_dir_lock)
 		inode_unlock(base->d_inode);
 	return dentry;
 }
@@ -2160,9 +2166,12 @@ struct dentry *lookup_and_lock_killable(struct mnt_idmap *idmap,
 	struct dentry *dentry;
 	int err;
 
-	err = down_write_killable_nested(&base->d_inode->i_rwsem, I_MUTEX_PARENT);
-	if (err)
-		return ERR_PTR(err);
+	if (!d_inode(base)->i_op->no_dir_lock) {
+		err = down_write_killable_nested(&base->d_inode->i_rwsem,
+						 I_MUTEX_PARENT);
+		if (err)
+			return ERR_PTR(err);
+	}
 	err = lookup_one_common(idmap, last, base);
 	if (err < 0)
 		return ERR_PTR(err);
@@ -2176,7 +2185,8 @@ struct dentry *lookup_and_lock_killable(struct mnt_idmap *idmap,
 			return ERR_PTR(-ERESTARTSYS);
 		goto retry;
 	}
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry) &&
+	    !d_inode(base)->i_op->no_dir_lock)
 		inode_unlock(base->d_inode);
 	return dentry;
 }
@@ -2205,7 +2215,8 @@ bool lock_and_check_dentry(struct dentry *child, struct dentry *parent)
 	}
 	/* get the child to balance with dentry_unlock() which puts it. */
 	dget(child);
-	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
+	if (!d_inode(parent)->i_op->no_dir_lock)
+		inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
 	return true;
 }
 EXPORT_SYMBOL(lock_and_check_dentry);
@@ -2230,7 +2241,8 @@ void dentry_unlock(struct dentry *dentry)
 {
 	if (!IS_ERR(dentry)) {
 		d_lookup_done(dentry);
-		inode_unlock(dentry->d_parent->d_inode);
+		if (!dentry->d_parent->d_inode->i_op->no_dir_lock)
+			inode_unlock(dentry->d_parent->d_inode);
 		dentry_unlock_dir_locked(dentry);
 	}
 }
@@ -2342,9 +2354,11 @@ static struct dentry *lookup_slow(const struct qstr *name,
 {
 	struct inode *inode = dir->d_inode;
 	struct dentry *res;
-	inode_lock_shared(inode);
+	if (!inode->i_op->no_dir_lock)
+		inode_lock_shared(inode);
 	res = __lookup_slow(name, dir, flags);
-	inode_unlock_shared(inode);
+	if (!inode->i_op->no_dir_lock)
+		inode_unlock_shared(inode);
 	return res;
 }
 
@@ -3721,6 +3735,9 @@ static struct dentry *lock_two_directories(struct dentry *p1, struct dentry *p2)
  */
 static struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
 {
+	if (d_inode(p1)->i_op->no_dir_lock)
+		return NULL;
+
 	if (p1 == p2) {
 		inode_lock_nested(p1->d_inode, I_MUTEX_PARENT);
 		return NULL;
@@ -3735,6 +3752,9 @@ static struct dentry *lock_rename(struct dentry *p1, struct dentry *p2)
  */
 static struct dentry *lock_rename_child(struct dentry *c1, struct dentry *p2)
 {
+	if (d_inode(c1)->i_op->no_dir_lock)
+		return NULL;
+
 	if (READ_ONCE(c1->d_parent) == p2) {
 		/*
 		 * hopefully won't need to touch ->s_vfs_rename_mutex at all.
@@ -3773,6 +3793,8 @@ static struct dentry *lock_rename_child(struct dentry *c1, struct dentry *p2)
 
 static void unlock_rename(struct dentry *p1, struct dentry *p2)
 {
+	if (d_inode(p1)->i_op->no_dir_lock)
+		return;
 	inode_unlock(p1->d_inode);
 	if (p1 != p2) {
 		inode_unlock(p2->d_inode);
@@ -3880,6 +3902,10 @@ static struct dentry *lock_ancestors(struct dentry *d1, struct dentry *d2)
 {
 	struct dentry *locked, *ancestor;
 
+	if (!d_inode(d1)->i_op->no_dir_lock)
+		/* s_vfs_rename_mutex is being used, so skip this locking */
+		return NULL;
+
 	if (d1->d_parent == d2->d_parent)
 		/* Nothing to lock */
 		return NULL;
@@ -4194,6 +4220,7 @@ void dentry_unlock_rename(struct renamedata *rd)
 	renaming_unlock(rd->old_dir, rd->new_dir, rd->ancestor,
 			rd->old_dentry, rd->new_dentry);
 
+	if (!d_inode(rd->old_dir)->i_op->no_dir_lock)
 	unlock_rename(rd->old_dir, rd->new_dir);
 
 	dput(rd->old_dir);
@@ -4697,19 +4724,23 @@ static const char *open_last_lookups(struct nameidata *nd,
 		 * dropping this one anyway.
 		 */
 	}
-	if (open_flag & O_CREAT)
-		inode_lock(dir->d_inode);
-	else
-		inode_lock_shared(dir->d_inode);
+	if (!d_inode(dir)->i_op->no_dir_lock) {
+		if (open_flag & O_CREAT)
+			inode_lock(dir->d_inode);
+		else
+			inode_lock_shared(dir->d_inode);
+	}
 	dentry = lookup_open(nd, file, op, got_write);
 	if (!IS_ERR(dentry)) {
 		if (file->f_mode & FMODE_OPENED)
 			fsnotify_open(file);
 	}
-	if (open_flag & O_CREAT)
-		inode_unlock(dir->d_inode);
-	else
-		inode_unlock_shared(dir->d_inode);
+	if (!d_inode(dir)->i_op->no_dir_lock) {
+		if (open_flag & O_CREAT)
+			inode_unlock(dir->d_inode);
+		else
+			inode_unlock_shared(dir->d_inode);
+	}
 
 	if (got_write)
 		mnt_drop_write(nd->path.mnt);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b4a1a1f4786..b213993c486a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2225,6 +2225,7 @@ int wrap_directory_iterator(struct file *, struct dir_context *,
 	{ return wrap_directory_iterator(file, ctx, x); }
 
 struct inode_operations {
+	bool no_dir_lock:1;
 	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
 	const char * (*get_link) (struct dentry *, struct inode *, struct delayed_call *);
 	int (*permission) (struct mnt_idmap *, struct inode *, int);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/8] VFS: use new dentry locking for create/remove/rename
  2025-06-09  7:34 ` [PATCH 7/8] VFS: use new dentry locking for create/remove/rename NeilBrown
@ 2025-06-10 20:36   ` Al Viro
  2025-06-11  0:34     ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2025-06-10 20:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christian Brauner, Jan Kara, linux-fsdevel

On Mon, Jun 09, 2025 at 05:34:12PM +1000, NeilBrown wrote:
> After taking the directory lock (or locks) we now lock the target
> dentries.  This is pointless at present but will allow us to remove the
> taking of the directory lock in a future patch.
> 
> MORE WORDS

Such as "why doesn't it deadlock?", presumably, seeing that you have

> @@ -2003,7 +2003,14 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last,
>  
>  	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
>  
> +retry:
>  	dentry = lookup_one_qstr(last, base, lookup_flags);
> +	if (!IS_ERR(dentry) &&
> +	    !dentry_lock(dentry, base, last, TASK_UNINTERRUPTIBLE)) {

... take dentry lock inside ->i_rwsem on parent and

>  bool lock_and_check_dentry(struct dentry *child, struct dentry *parent)
>  {
> -	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
> -	if (child->d_parent == parent) {
> -		/* get the child to balance with dentry_unlock which puts it. */
> -		dget(child);
> -		return true;
> +	if (!dentry_lock(child, NULL, NULL, TASK_UNINTERRUPTIBLE))
> +		return false;
> +	if (child->d_parent != parent) {
> +		__dentry_unlock(child);
> +		return false;
>  	}
> -	inode_unlock(d_inode(parent));
> -	return false;
> +	/* get the child to balance with dentry_unlock() which puts it. */
> +	dget(child);
> +	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);

... do the same in opposite order?

How could that possibly work?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow.
  2025-06-09  7:34 ` [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow NeilBrown
@ 2025-06-10 20:57   ` Al Viro
  2025-06-11  1:00     ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2025-06-10 20:57 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christian Brauner, Jan Kara, linux-fsdevel

On Mon, Jun 09, 2025 at 05:34:10PM +1000, NeilBrown wrote:
> Once we support directory operations (e.g. create) without requiring the
> parent to be locked, the current practice locking a directory while
> processing rmdir() or similar will not be sufficient to wait for
> operations to complete and to block further operations.
> 
> This patch introduced a new inode flag S_DYING.  It indicates that
> a rmdir or similar is being processed and new directory operations must
> not commence in the directory.  They should not abort either as the
> rmdir might fail - instead they should block.  They can do this by
> waiting for a lock on the inode.
> 
> A new interface rmdir_lock() locks the inode, sets this flag, and waits
> for any children with DCACHE_LOCK set to complete their operation, and
> for any d_in_lookup() children to complete the lookup.  It should be
> called before attempted to delete the directory or set S_DEAD.  Matching
> rmdir_unlock() clears the flag and unlocks the inode.
> 
> dentry_lock() and d_alloc_parallel() are changed to block while this
> flag it set and to fail if the parent IS_DEADDIR(), though dentry_lock()
> doesn't block for d_in_lookup() dentries.

> diff --git a/fs/namei.c b/fs/namei.c
> index 4ad76df21677..c590f25d0d49 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1770,8 +1770,11 @@ static bool __dentry_lock(struct dentry *dentry,
>  			  struct dentry *base, const struct qstr *last,
>  			  unsigned int subclass, int state)
>  {
> +	struct dentry *parent;
> +	struct inode *dir;
>  	int err;
>  
> +retry:
>  	lock_acquire_exclusive(&dentry->dentry_map, subclass, 0, NULL, _THIS_IP_);
>  	spin_lock(&dentry->d_lock);
>  	err = wait_var_event_any_lock(&dentry->d_flags,
> @@ -1782,10 +1785,43 @@ static bool __dentry_lock(struct dentry *dentry,
>  		spin_unlock(&dentry->d_lock);
>  		return false;
>  	}
> -
> -	dentry->d_flags |= DCACHE_LOCK;
> +	parent = dentry->d_parent;

Why will it stay the parent?  Matter of fact, why will it stay positive?

> +	dir = igrab(parent->d_inode);

... and not oops right here?

> +	lock_map_release(&dentry->dentry_map);
>  	spin_unlock(&dentry->d_lock);
> -	return true;
> +
> +	if (state == TASK_KILLABLE) {
> +		err = down_write_killable(&dir->i_rwsem);
> +		if (err) {
> +			iput(dir);
> +			return false;
> +		}
> +	} else
> +		inode_lock(dir);
> +	/* S_DYING much be clear now */
> +	inode_unlock(dir);
> +	iput(dir);
> +	goto retry;

OK, I'm really confused now.  Is it allowed to call dentry_lock() while holding
->i_rwsem of the parent?

Where does your dentry lock nest wrt ->i_rwsem?  As a bonus (well, malus, I guess)
question, where does it nest wrt parent *and* child inodes' ->i_rwsem for rmdir
and rename?

Tangentially connected question: which locks are held for ->unlink() in your
scheme?  You do need *something* on the victim inode to protect ->i_nlink
modifications, and anything on dentries of victim or their parent directories
is not going to give that.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/8] VFS: use new dentry locking for create/remove/rename
  2025-06-10 20:36   ` Al Viro
@ 2025-06-11  0:34     ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-11  0:34 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, Jan Kara, linux-fsdevel

On Wed, 11 Jun 2025, Al Viro wrote:
> On Mon, Jun 09, 2025 at 05:34:12PM +1000, NeilBrown wrote:
> > After taking the directory lock (or locks) we now lock the target
> > dentries.  This is pointless at present but will allow us to remove the
> > taking of the directory lock in a future patch.
> > 
> > MORE WORDS
> 
> Such as "why doesn't it deadlock?", presumably, seeing that you have
> 
> > @@ -2003,7 +2003,14 @@ struct dentry *lookup_and_lock_hashed(struct qstr *last,
> >  
> >  	inode_lock_nested(base->d_inode, I_MUTEX_PARENT);
> >  
> > +retry:
> >  	dentry = lookup_one_qstr(last, base, lookup_flags);
> > +	if (!IS_ERR(dentry) &&
> > +	    !dentry_lock(dentry, base, last, TASK_UNINTERRUPTIBLE)) {
> 
> ... take dentry lock inside ->i_rwsem on parent and
> 
> >  bool lock_and_check_dentry(struct dentry *child, struct dentry *parent)
> >  {
> > -	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
> > -	if (child->d_parent == parent) {
> > -		/* get the child to balance with dentry_unlock which puts it. */
> > -		dget(child);
> > -		return true;
> > +	if (!dentry_lock(child, NULL, NULL, TASK_UNINTERRUPTIBLE))
> > +		return false;
> > +	if (child->d_parent != parent) {
> > +		__dentry_unlock(child);
> > +		return false;
> >  	}
> > -	inode_unlock(d_inode(parent));
> > -	return false;
> > +	/* get the child to balance with dentry_unlock() which puts it. */
> > +	dget(child);
> > +	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
> 
> ... do the same in opposite order?
> 
> How could that possibly work?
> 

Clearly it can't - thanks for finding that.
A previous iteration has the parent locking after the dentry locking,
but I found that couldn't work.  So I move the parent locking back to
the start, but must have missed this function.

That part of the patch now reads:

@@ -2143,8 +2198,8 @@ EXPORT_SYMBOL(lookup_and_lock_killable);
 bool lock_and_check_dentry(struct dentry *child, struct dentry *parent)
 {
 	inode_lock_nested(d_inode(parent), I_MUTEX_PARENT);
-	if (child->d_parent == parent) {
-		/* get the child to balance with dentry_unlock which puts it. */
+	if (!dentry_lock(child, parent, NULL, TASK_UNINTERRUPTIBLE)) {
+		/* get the child to balance with dentry_unlock() which puts it. */
 		dget(child);
 		return true;
 	}

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow.
  2025-06-10 20:57   ` Al Viro
@ 2025-06-11  1:00     ` NeilBrown
  2025-06-11  1:13       ` Al Viro
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2025-06-11  1:00 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, Jan Kara, linux-fsdevel

On Wed, 11 Jun 2025, Al Viro wrote:
> On Mon, Jun 09, 2025 at 05:34:10PM +1000, NeilBrown wrote:
> > Once we support directory operations (e.g. create) without requiring the
> > parent to be locked, the current practice locking a directory while
> > processing rmdir() or similar will not be sufficient to wait for
> > operations to complete and to block further operations.
> > 
> > This patch introduced a new inode flag S_DYING.  It indicates that
> > a rmdir or similar is being processed and new directory operations must
> > not commence in the directory.  They should not abort either as the
> > rmdir might fail - instead they should block.  They can do this by
> > waiting for a lock on the inode.
> > 
> > A new interface rmdir_lock() locks the inode, sets this flag, and waits
> > for any children with DCACHE_LOCK set to complete their operation, and
> > for any d_in_lookup() children to complete the lookup.  It should be
> > called before attempted to delete the directory or set S_DEAD.  Matching
> > rmdir_unlock() clears the flag and unlocks the inode.
> > 
> > dentry_lock() and d_alloc_parallel() are changed to block while this
> > flag it set and to fail if the parent IS_DEADDIR(), though dentry_lock()
> > doesn't block for d_in_lookup() dentries.
> 
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 4ad76df21677..c590f25d0d49 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1770,8 +1770,11 @@ static bool __dentry_lock(struct dentry *dentry,
> >  			  struct dentry *base, const struct qstr *last,
> >  			  unsigned int subclass, int state)
> >  {
> > +	struct dentry *parent;
> > +	struct inode *dir;
> >  	int err;
> >  
> > +retry:
> >  	lock_acquire_exclusive(&dentry->dentry_map, subclass, 0, NULL, _THIS_IP_);
> >  	spin_lock(&dentry->d_lock);
> >  	err = wait_var_event_any_lock(&dentry->d_flags,
> > @@ -1782,10 +1785,43 @@ static bool __dentry_lock(struct dentry *dentry,
> >  		spin_unlock(&dentry->d_lock);
> >  		return false;
> >  	}
> > -
> > -	dentry->d_flags |= DCACHE_LOCK;
> > +	parent = dentry->d_parent;
> 
> Why will it stay the parent?  Matter of fact, why will it stay positive?

As long as we continue to hold dentry->d_lock it will stay the
parent, and so will have a reference, and so will stay positive.

> 
> > +	dir = igrab(parent->d_inode);
> 
> ... and not oops right here?

Still holding dentry->d_lock here so parent cannot have changed.

> 
> > +	lock_map_release(&dentry->dentry_map);
> >  	spin_unlock(&dentry->d_lock);
> > -	return true;
> > +
> > +	if (state == TASK_KILLABLE) {
> > +		err = down_write_killable(&dir->i_rwsem);
> > +		if (err) {
> > +			iput(dir);
> > +			return false;
> > +		}
> > +	} else
> > +		inode_lock(dir);
> > +	/* S_DYING much be clear now */
> > +	inode_unlock(dir);
> > +	iput(dir);
> > +	goto retry;
> 
> OK, I'm really confused now.  Is it allowed to call dentry_lock() while holding
> ->i_rwsem of the parent?

Yes.

> 
> Where does your dentry lock nest wrt ->i_rwsem?  As a bonus (well, malus, I guess)
> question, where does it nest wrt parent *and* child inodes' ->i_rwsem for rmdir
> and rename?

Between inode of parent of the dentry and inode of the dentry.

In this case we aren't holding the dentry lock when we lock the parent.
We might already have the parent locked when calling dentry_lock() if
the filesystem hasn't opted out) but in that case S_DYING will not be
set.  It is only set while holding the i_rw_sem in case which doesn't
lock dentries.  So if S_DYING is set here, then it must be safe to lock
the parent.

> 
> Tangentially connected question: which locks are held for ->unlink() in your
> scheme?  You do need *something* on the victim inode to protect ->i_nlink
> modifications, and anything on dentries of victim or their parent directories
> is not going to give that.
> 

I haven't change the locking on non-directories at all.  The target of
->unlink() will be locked after the dentry is locked (which is after the
parent is locked if the fs requires that).

->i_nlink for non-directories is still protected by ->i_rwsem on the
inode.
->i_nlink for directories is something the fs will have to handle
when opting out of i_rwsem on directory ops.  NFS, for example, already
takes inode->i_lock when calling inc_nlink() or drop_nlink().  The
set_nlink() in nfs_update_inode() isn't obviously protected but as the
number is informational it possibly doesn't matter.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow.
  2025-06-11  1:00     ` NeilBrown
@ 2025-06-11  1:13       ` Al Viro
  2025-06-11  2:49         ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Al Viro @ 2025-06-11  1:13 UTC (permalink / raw)
  To: NeilBrown; +Cc: Christian Brauner, Jan Kara, linux-fsdevel

On Wed, Jun 11, 2025 at 11:00:06AM +1000, NeilBrown wrote:

> Yes.
> 
> > 
> > Where does your dentry lock nest wrt ->i_rwsem?  As a bonus (well, malus, I guess)
> > question, where does it nest wrt parent *and* child inodes' ->i_rwsem for rmdir
> > and rename?
> 
> Between inode of parent of the dentry and inode of the dentry.

That's... going to be fun to prove the deadlock avoidance.
Looking forward to such proof...

Look, the reason why I'm sceptical is that we had quite a few interesting
problems with directory locking schemes; fun scenarios are easy to
miss and I've fucked up more than a few times in that area.  Fixing it
afterwards can be a real bitch, especially if we get filesystem-specific
parts in the picture.

So let's sort it out _before_ we go there.  And I mean proof - verifiable
statements about the functions, etc.

Incidentally, what was the problem with having dentry locked before
the parent?  At least that way we would have a reasonable lock ordering...
It would require some preliminary work, but last time I looked at the
area (not very deeply) it felt like a plausible direction...  I wonder
which obstacle have I missed...

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow.
  2025-06-11  1:13       ` Al Viro
@ 2025-06-11  2:49         ` NeilBrown
  0 siblings, 0 replies; 15+ messages in thread
From: NeilBrown @ 2025-06-11  2:49 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, Jan Kara, linux-fsdevel

On Wed, 11 Jun 2025, Al Viro wrote:
> On Wed, Jun 11, 2025 at 11:00:06AM +1000, NeilBrown wrote:
> 
> > Yes.
> > 
> > > 
> > > Where does your dentry lock nest wrt ->i_rwsem?  As a bonus (well, malus, I guess)
> > > question, where does it nest wrt parent *and* child inodes' ->i_rwsem for rmdir
> > > and rename?
> > 
> > Between inode of parent of the dentry and inode of the dentry.
> 
> That's... going to be fun to prove the deadlock avoidance.
> Looking forward to such proof...

I do hope to write something along those lines, but I'd rather do it
after getting feedback on the design.  It's already changed several
times and while I have growing confidence that I understand the key
issues there is still room for change as discussed below.

> 
> Look, the reason why I'm sceptical is that we had quite a few interesting
> problems with directory locking schemes; fun scenarios are easy to
> miss and I've fucked up more than a few times in that area.  Fixing it
> afterwards can be a real bitch, especially if we get filesystem-specific
> parts in the picture.

That's part of why I like bringing all the locking into namei.c using
lookup_and_lock() etc.  Having everything in one place won't make it
easy but might make it more tractable.

> 
> So let's sort it out _before_ we go there.  And I mean proof - verifiable
> statements about the functions, etc.

I was hoping to get some of the refactoring - which I think it useful in
any case - in before needing a complete tidy solution.  Partly this is
because I thought the review discussion of the locking would be more
effective when the code was clean and centralised....

> 
> Incidentally, what was the problem with having dentry locked before
> the parent?  At least that way we would have a reasonable lock ordering...
> It would require some preliminary work, but last time I looked at the
> area (not very deeply) it felt like a plausible direction...  I wonder
> which obstacle have I missed...
> 

If we are to put the parent locking after the dentry locking it must
also be after the d_alloc_parallel() locking.  As some readdir
implementations (quite sensibly) use d_alloc_parallel() to prime the
dcache with readdir results, we would not be able to hold the parent
locked across readdir.  So we would need some other exclusion with
rmdir.

Put another way: if we want to push the parent locking down into the
filesystem for anything, we really need to do it for everything.
For rmdir we would need the "set S_DYING and wait for locked dentries"
to happen before taking the lock, and we cannot use parent lock for
readdir.

Maybe we could do that, but we can't use spare d_flags or i_flags for
the readdir locking - we need a counter.  Maybe i_writecount or
i_dio_count could provide the space we need as they aren't used on
directories.

I thought that going down that path added more complexity than I wanted,
but it does have the advantage of making the purpose and scope of the
different locks more explicit.

... or maybe we could change those readdir routines to do a try-lock.
i.e.  have a d_alloc_parallel() variant which returned -EWOULDBLOCK
rather than waiting for the dentry to be ready.  Maybe that is a
sensible approach anyway...

Another (unrelated) option that I've considered but not yet explored is
using the same locking scheme for in_lookup dentries and for the targets
of operations.  There would be just two DCACHE flags (locked, and
waiter-exists) and the in_lookup dentries would be recognised by having
d_inode being $RESERVED_ADDRESS.  Then dentries would always be created
locked and then unlocked when they are ready (like inodes).  I think I
like this, but I also wonder if it is worth the churn.

NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-06-11  2:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-09  7:34 [PATCH 0/8 preview] demonstrate proposed new locking strategy for directories NeilBrown
2025-06-09  7:34 ` [PATCH 1/8] VFS: use global wait-queue table for d_alloc_parallel() NeilBrown
2025-06-09  7:34 ` [PATCH 2/8] VFS: use d_alloc_parallel() in lookup_one_qstr_excl() NeilBrown
2025-06-09  7:34 ` [PATCH 3/8] fs/proc: take rcu_read_lock() in proc_sys_compare() NeilBrown
2025-06-09  7:34 ` [PATCH 4/8] VFS: Add ability to exclusively lock a dentry and use for open/create NeilBrown
2025-06-09  7:34 ` [PATCH 5/8] Introduce S_DYING which warns that S_DEAD might follow NeilBrown
2025-06-10 20:57   ` Al Viro
2025-06-11  1:00     ` NeilBrown
2025-06-11  1:13       ` Al Viro
2025-06-11  2:49         ` NeilBrown
2025-06-09  7:34 ` [PATCH 6/8] VFS: provide alternative to s_vfs_rename_mutex NeilBrown
2025-06-09  7:34 ` [PATCH 7/8] VFS: use new dentry locking for create/remove/rename NeilBrown
2025-06-10 20:36   ` Al Viro
2025-06-11  0:34     ` NeilBrown
2025-06-09  7:34 ` [PATCH 8/8] VFS: allow a filesystem to opt out of directory locking NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).