* [RFC PATCH 01/25] VFS: use wait_var_event for waiting in d_alloc_parallel()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories Al Viro
` (24 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
From: NeilBrown <neil@brown.name>
Parallel lookup starts with a call of d_alloc_parallel(). That primitive
either returns a matching hashed dentry or allocates a new one in the
in-lookup state and returns it to the caller. Once the caller is done
with lookup, it indicates so either by call of d_{splice_alias,add}()
or by call of d_done_lookup(); at that point dentry leaves the in-lookup
state.
If d_alloc_parallel() finds a matching in-lookup dentry, it must wait for
that dentry to leave the in-lookup state, one way or another. Currently
by supplying wait_queue_head to d_alloc_parallel(). If d_alloc_parallel()
creates a new in-lookup dentry, the address of that wait_queue_head is stored
in ->d_wait of new dentry and stays there while it's in the in-lookup;
subsequent d_alloc_parallel() will wait on the queue found in the matching
in-lookup dentry. Transition out of in-lookup state wakes waiters on that
queue (if any).
That works, but the calling conventions are inconvenient - the caller must
supply wait_queue_head and make sure that it survives at least until the new
in-lookup dentry leaves the in-lookup state. That amounts to boilerplate
in the d_alloc_parallel() callers that are followed by a call of d_lookup_done()
in the same function; in cases like nfs asynchronous unlink it gets worse than
that.
This patch changes d_alloc_parallel() to use wake_up_var_locked() to
wake up waiters, and wait_var_event_spinlock() to wait. dentry->d_lock
is used for synchronisation as it is already held and the relevant
times.
That eliminates the need of caller-supplied wait_queue_head, simplifying
the calling conventions. Better yet, we only need one bit of information
stored in dentry itself: whether there are any waiters to be woken up,
and that can be easily stored in ->d_flags; ->d_wait goes away.
The reason we need that bit (DCACHE_LOOKUP_WAITERS) is that with wait_var
machinery the queues are shared with all kinds of stuff and there's
no way tell if any of the waiters have anything to do with our dentry;
most of the time none of them will be relevant, so we need to avoid the
pointless wakeups.
Another benefit of the new scheme comes from the fact that wakeups
have to be done outside of write-side critical areas of ->i_dir_seq;
with the old scheme we need to carry the value picked from ->d_wait from
__d_lookup_unhash() to the place where we actually wake the waiters up.
Now we can just leave DCACHE_LOOKUP_WAITERS in ->d_flags until we get
to doing wakeups - that's done within the same ->d_lock scope, so we
are fine; new bit is accessed only under ->d_lock and it's seen only
on dentries with DCACHE_PAR_LOOKUP in ->d_flags.
__d_lookup_unhash() no longer needs to re-init ->d_lru. That was
previously shared (in a union) with ->d_wait but ->d_wait is now gone
so it no longer corrupts ->d_lru.
Co-developed-by: Al Viro <viro@zeniv.linux.org.uk> # saner handling of flags
Signed-off-by: NeilBrown <neil@brown.name>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
Documentation/filesystems/porting.rst | 6 ++
fs/afs/dir_silly.c | 4 +-
fs/dcache.c | 83 +++++++++++++++------------
fs/fuse/readdir.c | 3 +-
fs/namei.c | 6 +-
fs/nfs/dir.c | 6 +-
fs/nfs/unlink.c | 3 +-
fs/proc/base.c | 3 +-
fs/proc/proc_sysctl.c | 3 +-
fs/smb/client/readdir.c | 3 +-
include/linux/dcache.h | 11 ++--
include/linux/nfs_xdr.h | 1 -
12 files changed, 67 insertions(+), 65 deletions(-)
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index fdf074429cd3..36fecc7a3d97 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1385,3 +1385,9 @@ for_each_alias(dentry, inode) instead of hlist_for_each_entry; better
yet, see if any of the exported primitives could be used instead of
the entire loop. You still need to hold ->i_lock of the inode over
either form of manual loop.
+
+---
+
+**mandatory**
+
+d_alloc_parallel() no longer requires a waitqueue_head.
diff --git a/fs/afs/dir_silly.c b/fs/afs/dir_silly.c
index a748fd133faf..982bb6ec15f0 100644
--- a/fs/afs/dir_silly.c
+++ b/fs/afs/dir_silly.c
@@ -248,13 +248,11 @@ int afs_silly_iput(struct dentry *dentry, struct inode *inode)
struct dentry *alias;
int ret;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
-
_enter("%p{%pd},%llx", dentry, dentry, vnode->fid.vnode);
down_read(&dvnode->rmdir_lock);
- alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name, &wq);
+ alias = d_alloc_parallel(dentry->d_parent, &dentry->d_name);
if (IS_ERR(alias)) {
up_read(&dvnode->rmdir_lock);
return 0;
diff --git a/fs/dcache.c b/fs/dcache.c
index 2c61aeea41f4..0aff2c510beb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2250,8 +2250,7 @@ struct dentry *d_add_ci(struct dentry *dentry, struct inode *inode,
return found;
}
if (d_in_lookup(dentry)) {
- found = d_alloc_parallel(dentry->d_parent, name,
- dentry->d_wait);
+ found = d_alloc_parallel(dentry->d_parent, name);
if (IS_ERR(found) || !d_in_lookup(found)) {
iput(inode);
return found;
@@ -2638,32 +2637,24 @@ static inline unsigned start_dir_add(struct inode *dir)
}
}
-static inline void end_dir_add(struct inode *dir, unsigned int n,
- wait_queue_head_t *d_wait)
+static inline void end_dir_add(struct inode *dir, unsigned int n)
{
smp_store_release(&dir->i_dir_seq, n + 2);
preempt_enable_nested();
- if (wq_has_sleeper(d_wait))
- wake_up_all(d_wait);
}
static void d_wait_lookup(struct dentry *dentry)
{
- if (d_in_lookup(dentry)) {
- DECLARE_WAITQUEUE(wait, current);
- add_wait_queue(dentry->d_wait, &wait);
- do {
- set_current_state(TASK_UNINTERRUPTIBLE);
- spin_unlock(&dentry->d_lock);
- schedule();
- spin_lock(&dentry->d_lock);
- } while (d_in_lookup(dentry));
+ if (likely(d_in_lookup(dentry))) {
+ dentry->d_flags |= DCACHE_LOOKUP_WAITERS;
+ wait_var_event_spinlock(&dentry->d_flags,
+ !d_in_lookup(dentry),
+ &dentry->d_lock);
}
}
struct dentry *d_alloc_parallel(struct dentry *parent,
- const struct qstr *name,
- wait_queue_head_t *wq)
+ const struct qstr *name)
{
unsigned int hash = name->hash;
struct hlist_bl_head *b = in_lookup_hash(parent, hash);
@@ -2766,7 +2757,6 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
return dentry;
}
rcu_read_unlock();
- new->d_wait = wq;
hlist_bl_add_head(&new->d_in_lookup_hash, b);
hlist_bl_unlock(b);
return new;
@@ -2778,13 +2768,26 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
EXPORT_SYMBOL(d_alloc_parallel);
/*
- * - Unhash the dentry
- * - Retrieve and clear the waitqueue head in dentry
- * - Return the waitqueue head
+ * Move dentry from in-lookup state to busy-negative one.
+ *
+ * From now on d_in_lookup(dentry) will return false and dentry is gone from
+ * in-lookup hash.
+ *
+ * Anyone who had been waiting on it in d_alloc_parallel() is free to
+ * proceed after that. Note that waking such waiters up is left to
+ * the callers; PREEMPT_RT kernels can't have that wakeup done while
+ * in write-side critical area for ->i_dir_seq, so it's done by calling
+ * __d_wake_in_lookup_waiters() once it's safe to do so.
+ *
+ * Both __d_lookup_unhash() and __d_wake_in_lookup_waiters() should
+ * be called within the same ->d_lock scope. PAR_LOOKUP is cleared
+ * here, while LOOKUP_WAITERS (set by somebody finding dentry in
+ * the in-lookup hash and setting down to wait) is checked and cleared
+ * in __d_wake_in_lookup_waiters(). Both are gone by the end of
+ * ->d_lock scope.
*/
-static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
+static void __d_lookup_unhash(struct dentry *dentry)
{
- wait_queue_head_t *d_wait;
struct hlist_bl_head *b;
lockdep_assert_held(&dentry->d_lock);
@@ -2793,18 +2796,23 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
hlist_bl_lock(b);
dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
__hlist_bl_del(&dentry->d_in_lookup_hash);
- d_wait = dentry->d_wait;
- dentry->d_wait = NULL;
hlist_bl_unlock(b);
dentry->waiters = NULL;
- INIT_LIST_HEAD(&dentry->d_lru);
- return d_wait;
+}
+
+static inline void __d_wake_in_lookup_waiters(struct dentry *dentry)
+{
+ if (dentry->d_flags & DCACHE_LOOKUP_WAITERS) {
+ wake_up_var_locked(&dentry->d_flags, &dentry->d_lock);
+ dentry->d_flags &= ~DCACHE_LOOKUP_WAITERS;
+ }
}
void __d_lookup_unhash_wake(struct dentry *dentry)
{
spin_lock(&dentry->d_lock);
- wake_up_all(__d_lookup_unhash(dentry));
+ __d_lookup_unhash(dentry);
+ __d_wake_in_lookup_waiters(dentry);
spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(__d_lookup_unhash_wake);
@@ -2814,14 +2822,13 @@ EXPORT_SYMBOL(__d_lookup_unhash_wake);
static inline void __d_add(struct dentry *dentry, struct inode *inode,
const struct dentry_operations *ops)
{
- wait_queue_head_t *d_wait;
struct inode *dir = NULL;
unsigned n;
spin_lock(&dentry->d_lock);
if (unlikely(d_in_lookup(dentry))) {
dir = dentry->d_parent->d_inode;
n = start_dir_add(dir);
- d_wait = __d_lookup_unhash(dentry);
+ __d_lookup_unhash(dentry);
}
if (unlikely(ops))
d_set_d_op(dentry, ops);
@@ -2834,8 +2841,10 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode,
fsnotify_update_flags(dentry);
}
__d_rehash(dentry);
- if (dir)
- end_dir_add(dir, n, d_wait);
+ if (dir) {
+ end_dir_add(dir, n);
+ __d_wake_in_lookup_waiters(dentry);
+ }
spin_unlock(&dentry->d_lock);
if (inode)
spin_unlock(&inode->i_lock);
@@ -2948,7 +2957,6 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
bool exchange)
{
struct dentry *old_parent, *p;
- wait_queue_head_t *d_wait;
struct inode *dir = NULL;
unsigned n;
@@ -2979,7 +2987,7 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
if (unlikely(d_in_lookup(target))) {
dir = target->d_parent->d_inode;
n = start_dir_add(dir);
- d_wait = __d_lookup_unhash(target);
+ __d_lookup_unhash(target);
}
write_seqcount_begin(&dentry->d_seq);
@@ -3018,9 +3026,10 @@ static void __d_move(struct dentry *dentry, struct dentry *target,
write_seqcount_end(&target->d_seq);
write_seqcount_end(&dentry->d_seq);
- if (dir)
- end_dir_add(dir, n, d_wait);
-
+ if (dir) {
+ end_dir_add(dir, n);
+ __d_wake_in_lookup_waiters(target);
+ }
if (dentry->d_parent != old_parent)
spin_unlock(&dentry->d_parent->d_lock);
if (dentry != old_parent)
diff --git a/fs/fuse/readdir.c b/fs/fuse/readdir.c
index db5ae8ec1030..a2361f1d9905 100644
--- a/fs/fuse/readdir.c
+++ b/fs/fuse/readdir.c
@@ -164,7 +164,6 @@ static int fuse_direntplus_link(struct file *file,
struct inode *dir = d_inode(parent);
struct fuse_conn *fc;
struct inode *inode;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
int epoch;
if (!o->nodeid) {
@@ -201,7 +200,7 @@ static int fuse_direntplus_link(struct file *file,
dentry = d_lookup(parent, &name);
if (!dentry) {
retry:
- dentry = d_alloc_parallel(parent, &name, &wq);
+ dentry = d_alloc_parallel(parent, &name);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
}
diff --git a/fs/namei.c b/fs/namei.c
index c7fac83c9a85..ebde3a35746c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1891,13 +1891,12 @@ static struct dentry *__lookup_slow(const struct qstr *name,
{
struct dentry *dentry, *old;
struct inode *inode = dir->d_inode;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
/* Don't go there if it's already dead */
if (unlikely(IS_DEADDIR(inode)))
return ERR_PTR(-ENOENT);
again:
- dentry = d_alloc_parallel(dir, name, &wq);
+ dentry = d_alloc_parallel(dir, name);
if (IS_ERR(dentry))
return dentry;
if (unlikely(!d_in_lookup(dentry))) {
@@ -4414,7 +4413,6 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
struct dentry *dentry;
int error, create_error = 0;
umode_t mode = op->mode;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
if (unlikely(IS_DEADDIR(dir_inode)))
return ERR_PTR(-ENOENT);
@@ -4423,7 +4421,7 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
dentry = d_lookup(dir, &nd->last);
for (;;) {
if (!dentry) {
- dentry = d_alloc_parallel(dir, &nd->last, &wq);
+ dentry = d_alloc_parallel(dir, &nd->last);
if (IS_ERR(dentry))
return dentry;
}
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index e9ce1883288c..9580af999d70 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -726,7 +726,6 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
unsigned long dir_verifier)
{
struct qstr filename = QSTR_INIT(entry->name, entry->len);
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct dentry *dentry;
struct dentry *alias;
struct inode *inode;
@@ -755,7 +754,7 @@ void nfs_prime_dcache(struct dentry *parent, struct nfs_entry *entry,
dentry = d_lookup(parent, &filename);
again:
if (!dentry) {
- dentry = d_alloc_parallel(parent, &filename, &wq);
+ dentry = d_alloc_parallel(parent, &filename);
if (IS_ERR(dentry))
return;
}
@@ -2106,7 +2105,6 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
struct file *file, unsigned open_flags,
umode_t mode)
{
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
struct nfs_open_context *ctx;
struct dentry *res;
struct iattr attr = { .ia_valid = ATTR_OPEN };
@@ -2162,7 +2160,7 @@ int nfs_atomic_open(struct inode *dir, struct dentry *dentry,
d_drop(dentry);
switched = true;
dentry = d_alloc_parallel(dentry->d_parent,
- &dentry->d_name, &wq);
+ &dentry->d_name);
if (IS_ERR(dentry))
return PTR_ERR(dentry);
if (unlikely(!d_in_lookup(dentry)))
diff --git a/fs/nfs/unlink.c b/fs/nfs/unlink.c
index df3ca4669df6..43ea897943c0 100644
--- a/fs/nfs/unlink.c
+++ b/fs/nfs/unlink.c
@@ -124,7 +124,7 @@ static int nfs_call_unlink(struct dentry *dentry, struct inode *inode, struct nf
struct dentry *alias;
down_read_non_owner(&NFS_I(dir)->rmdir_sem);
- alias = d_alloc_parallel(dentry->d_parent, &data->args.name, &data->wq);
+ alias = d_alloc_parallel(dentry->d_parent, &data->args.name);
if (IS_ERR(alias)) {
up_read_non_owner(&NFS_I(dir)->rmdir_sem);
return 0;
@@ -185,7 +185,6 @@ nfs_async_unlink(struct dentry *dentry, const struct qstr *name)
data->cred = get_current_cred();
data->res.dir_attr = &data->dir_attr;
- init_waitqueue_head(&data->wq);
status = -EBUSY;
spin_lock(&dentry->d_lock);
diff --git a/fs/proc/base.c b/fs/proc/base.c
index d9acfa89c894..d55a4b603188 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2132,8 +2132,7 @@ bool proc_fill_cache(struct file *file, struct dir_context *ctx,
goto end_instantiate;
if (!child) {
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
- child = d_alloc_parallel(dir, &qname, &wq);
+ child = d_alloc_parallel(dir, &qname);
if (IS_ERR(child))
goto end_instantiate;
if (d_in_lookup(child)) {
diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 49ab74e0bfde..04a382178c65 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -692,8 +692,7 @@ static bool proc_sys_fill_cache(struct file *file,
child = d_lookup(dir, &qname);
if (!child) {
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
- child = d_alloc_parallel(dir, &qname, &wq);
+ child = d_alloc_parallel(dir, &qname);
if (IS_ERR(child))
return false;
if (d_in_lookup(child)) {
diff --git a/fs/smb/client/readdir.c b/fs/smb/client/readdir.c
index be22bbc4a65a..a8995261831c 100644
--- a/fs/smb/client/readdir.c
+++ b/fs/smb/client/readdir.c
@@ -73,7 +73,6 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
struct cifs_sb_info *cifs_sb = CIFS_SB(sb);
bool posix = cifs_sb_master_tcon(cifs_sb)->posix_extensions;
bool reparse_need_reval = false;
- DECLARE_WAIT_QUEUE_HEAD_ONSTACK(wq);
int rc;
cifs_dbg(FYI, "%s: for %s\n", __func__, name->name);
@@ -105,7 +104,7 @@ cifs_prime_dcache(struct dentry *parent, struct qstr *name,
(fattr->cf_flags & CIFS_FATTR_NEED_REVAL))
return;
- dentry = d_alloc_parallel(parent, name, &wq);
+ dentry = d_alloc_parallel(parent, name);
}
if (IS_ERR(dentry))
return;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 2577c05f84ec..97a887be150a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -116,10 +116,7 @@ struct dentry {
* possible!
*/
- union {
- struct list_head d_lru; /* LRU list */
- wait_queue_head_t *d_wait; /* in-lookup ones only */
- };
+ struct list_head d_lru; /* LRU list */
struct hlist_node d_sib; /* child of parent list */
struct hlist_head d_children; /* our children */
/*
@@ -210,6 +207,9 @@ enum dentry_flags {
DCACHE_REFERENCED = BIT(6), /* Recently used, don't discard. */
DCACHE_DONTCACHE = BIT(7), /* Purge from memory on final dput() */
DCACHE_CANT_MOUNT = BIT(8),
+ DCACHE_LOOKUP_WAITERS = BIT(9), /* A thread is waiting for
+ * PAR_LOOKUP to clear
+ */
DCACHE_SHRINK_LIST = BIT(10),
DCACHE_OP_WEAK_REVALIDATE = BIT(11),
/*
@@ -256,8 +256,7 @@ extern void d_delete(struct dentry *);
/* allocate/de-allocate */
extern struct dentry * d_alloc(struct dentry *, const struct qstr *);
extern struct dentry * d_alloc_anon(struct super_block *);
-extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *,
- wait_queue_head_t *);
+extern struct dentry * d_alloc_parallel(struct dentry *, const struct qstr *);
extern struct dentry * d_splice_alias(struct inode *, struct dentry *);
/* weird procfs mess; *NOT* exported */
extern struct dentry * d_splice_alias_ops(struct inode *, struct dentry *,
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index fcbd21b5685f..6aced49d5f00 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1743,7 +1743,6 @@ struct nfs_unlinkdata {
struct nfs_removeargs args;
struct nfs_removeres res;
struct dentry *dentry;
- wait_queue_head_t wq;
const struct cred *cred;
struct nfs_fattr dir_attr;
long timeout;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
2026-05-05 5:53 ` [RFC PATCH 01/25] VFS: use wait_var_event for waiting in d_alloc_parallel() Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 8:21 ` NeilBrown
2026-05-05 5:53 ` [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries Al Viro
` (23 subsequent siblings)
25 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
A lot of places relies upon directories never having NORCU dentries;
currently that property holds, but the proof is not straightforward
and rather brittle.
It's better to have that verified in the sole caller of d_alloc_pseudo(),
so that any future bugs in that direction were caught early.
That way we can be sure that
* current directory of any process is not NORCU
* root directory of any process is not NORCU
* starting point of any LOOKUP_RCU pathwalk is not NORCU
* dget_parent() can rely upon ->d_parent not being NORCU
* d_walk() and is_subdir() can rely upon the same
* alloc_file_pseudo() won't create multiple aliases for a directory
without having to go through a convoluted audit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/file_table.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/file_table.c b/fs/file_table.c
index 16e52e7fc2ac..108ba09fb402 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -402,6 +402,8 @@ static struct file *alloc_file(const struct path *path, int flags,
static inline int alloc_path_pseudo(const char *name, struct inode *inode,
struct vfsmount *mnt, struct path *path)
{
+ if (WARN_ON_ONCE(S_ISDIR(inode->i_mode)))
+ return -EINVAL;
path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
if (!path->dentry)
return -ENOMEM;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories
2026-05-05 5:53 ` [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories Al Viro
@ 2026-05-05 8:21 ` NeilBrown
2026-05-05 17:48 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2026-05-05 8:21 UTC (permalink / raw)
To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara
On Tue, 05 May 2026, Al Viro wrote:
> A lot of places relies upon directories never having NORCU dentries;
> currently that property holds, but the proof is not straightforward
> and rather brittle.
This requirement is only indirectly documented in d_alloc_pseudo()
where it says "the stuff that should never be anyone's children or
parents", and they are for "lookup-less filesystems" wherein directories
obviously cannot have a role.
Would it be worth making it explicit?
"A pseudo dentry must never be used for a directory"
NeilBrown
>
> It's better to have that verified in the sole caller of d_alloc_pseudo(),
> so that any future bugs in that direction were caught early.
>
> That way we can be sure that
> * current directory of any process is not NORCU
> * root directory of any process is not NORCU
> * starting point of any LOOKUP_RCU pathwalk is not NORCU
> * dget_parent() can rely upon ->d_parent not being NORCU
> * d_walk() and is_subdir() can rely upon the same
> * alloc_file_pseudo() won't create multiple aliases for a directory
> without having to go through a convoluted audit.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> fs/file_table.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 16e52e7fc2ac..108ba09fb402 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -402,6 +402,8 @@ static struct file *alloc_file(const struct path *path, int flags,
> static inline int alloc_path_pseudo(const char *name, struct inode *inode,
> struct vfsmount *mnt, struct path *path)
> {
> + if (WARN_ON_ONCE(S_ISDIR(inode->i_mode)))
> + return -EINVAL;
> path->dentry = d_alloc_pseudo(mnt->mnt_sb, &QSTR(name));
> if (!path->dentry)
> return -ENOMEM;
> --
> 2.47.3
>
>
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories
2026-05-05 8:21 ` NeilBrown
@ 2026-05-05 17:48 ` Al Viro
0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 17:48 UTC (permalink / raw)
To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara
On Tue, May 05, 2026 at 06:21:55PM +1000, NeilBrown wrote:
> On Tue, 05 May 2026, Al Viro wrote:
> > A lot of places relies upon directories never having NORCU dentries;
> > currently that property holds, but the proof is not straightforward
> > and rather brittle.
>
> This requirement is only indirectly documented in d_alloc_pseudo()
> where it says "the stuff that should never be anyone's children or
> parents", and they are for "lookup-less filesystems" wherein directories
> obviously cannot have a role.
> Would it be worth making it explicit?
> "A pseudo dentry must never be used for a directory"
The comment in d_alloc_pseudo() is going to be replaced; currently it's
flat-out inaccurate - memfd_create(2) will result in a NORCU dentry
on tmpfs, which is very definitely *NOT* a lookup-less filesystem.
I have a half-finished documentation on the rules for NORCU dentries -
adding that is in the "yet to be stabilized" local followups to this
series. I hope to get that dealt with during this cycle - this is not
the final iteration of that patchset.
^ permalink raw reply [flat|nested] 53+ messages in thread
* [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
2026-05-05 5:53 ` [RFC PATCH 01/25] VFS: use wait_var_event for waiting in d_alloc_parallel() Al Viro
2026-05-05 5:53 ` [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 17:06 ` Linus Torvalds
2026-05-05 5:53 ` [RFC PATCH 04/25] find_acceptable_alias(): skip NORCU aliases with zero refcount Al Viro
` (22 subsequent siblings)
25 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Refcount of a NORCU dentry must not be incremented after having dropped
to zero. Otherwise we might end up with the following race:
CPU1: in fast_dput(d), rcu_read_lock();
CPU1: decrements refcount of d to 0
CPU1: notice that it's unhashed
CPU2: grab a reference to d
CPU2: dput(d), freeing d
CPU1: ... looks like we need to evict d, let's grab ->d_lock, recheck
the refcount, etc.
and that spin_lock(&d->d_lock) ends up a UAF, despite still being in
an RCU read-side critical area started back when the refcount had been
positive. If not for DCACHE_NORCU in d->d_flags freeing would've been
RCU-delayed, so we'd have grabbed ->d_lock, noticed the negative value
stored into refcount by __dentry_kill(), dropped the locks and that would
be it. For NORCU dentries freeing is _not_ delayed, though.
Most of the non-counting references are excluded for NORCU dentries -
they are not allowed to be hashed, they never get placed on LRU, they
never get placed into anyone's list of children and while dput_to_list()
might put them into a shrink list, nobody bumps refcount of something
that had been reached that way.
However, inode's list of aliases can be a problem - it does not contribute
to dentry refcount (for obvious reasons) and we *do* have places that
grab references to something found on that list - that's precisely what
d_find_alias() is. In case of d_find_alias() we are safe - it skips
unhashed aliases, so all NORCU ones are ignored there. d_find_any_alias()
is *not* limited to hashed ones, though, and while it's usually called
for directories (which never get NORCU dentries), there are callers that
use it to get something for non-directories with no hashed aliases.
Having d_find_any_alias() hit a NORCU dentry is not impossible - it can
be easily arranged if you have CAP_DAC_READ_SEARCH (memfd_create() + mmap()
+ name_to_handle_at() for /proc/self/map_files/<...> + munmap() +
open_by_handle_at() will do that, and adding a second memfd_create() for
mount_fd makes it possible to do that without having memfd pinned).
The race window is narrow, and it's probably not feasible on bare hardware,
but...
It's not hard to fix, fortunately:
* separate __d_find_dir_alias() (== current __d_find_any_alias()) to
be used for directory inodes.
* provide __dget_alias_careful() that would return false for NORCU
dentries with zero refcount and return true incrementing refcount otherwise
* make __d_find_any_alias() go over the list of aliases, using
__dget_alias_careful() and returning the alias it succeeds on (normally the
first one). Any NORCU alias with zero refcount is going to be evicted by
the thread that had dropped the final reference; this makes __d_find_any_alias()
pretend it had lost the race with eviction.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 21 ++++++++++++++++++---
include/linux/dcache.h | 18 ++++++++++++++++++
2 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 0aff2c510beb..923e499ffe7e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1052,7 +1052,10 @@ struct dentry *dget_parent(struct dentry *dentry)
}
EXPORT_SYMBOL(dget_parent);
-static struct dentry * __d_find_any_alias(struct inode *inode)
+/*
+ * inode is a directory, inode->i_lock is held by the caller
+ */
+static struct dentry * __d_find_dir_alias(struct inode *inode)
{
struct dentry *alias;
@@ -1063,6 +1066,18 @@ static struct dentry * __d_find_any_alias(struct inode *inode)
return alias;
}
+static struct dentry * __d_find_any_alias(struct inode *inode)
+{
+ struct dentry *alias;
+
+ if (hlist_empty(&inode->i_dentry))
+ return NULL;
+ for_each_alias(alias, inode)
+ if (__dget_alias_careful(alias))
+ return alias;
+ return NULL;
+}
+
/**
* d_find_any_alias - find any alias for a given inode
* @inode: inode to find an alias for
@@ -1086,7 +1101,7 @@ static struct dentry *__d_find_alias(struct inode *inode)
struct dentry *alias;
if (S_ISDIR(inode->i_mode))
- return __d_find_any_alias(inode);
+ return __d_find_dir_alias(inode);
for_each_alias(alias, inode) {
spin_lock(&alias->d_lock);
@@ -3150,7 +3165,7 @@ struct dentry *d_splice_alias_ops(struct inode *inode, struct dentry *dentry,
security_d_instantiate(dentry, inode);
spin_lock(&inode->i_lock);
if (S_ISDIR(inode->i_mode)) {
- struct dentry *new = __d_find_any_alias(inode);
+ struct dentry *new = __d_find_dir_alias(inode);
if (unlikely(new)) {
/* The reference to new ensures it remains an alias */
spin_unlock(&inode->i_lock);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 97a887be150a..684aeb9e9cbe 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -365,6 +365,24 @@ static inline struct dentry *dget(struct dentry *dentry)
return dentry;
}
+/* dentry->d_inode->i_lock must be held by caller */
+static inline bool __dget_alias_careful(struct dentry *dentry)
+{
+ if (likely(!(READ_ONCE(dentry->d_flags) & DCACHE_NORCU))) {
+ lockref_get(&dentry->d_lockref);
+ return true;
+ }
+ // NORCU dentries with zero refcount MUST NOT be grabbed
+ spin_lock(&dentry->d_lock);
+ if (dentry->d_lockref.count > 0) {
+ dget_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
+ return true;
+ }
+ spin_unlock(&dentry->d_lock);
+ return false;
+}
+
extern struct dentry *dget_parent(struct dentry *dentry);
/**
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries
2026-05-05 5:53 ` [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries Al Viro
@ 2026-05-05 17:06 ` Linus Torvalds
2026-05-05 20:29 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2026-05-05 17:06 UTC (permalink / raw)
To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Mon, 4 May 2026 at 22:55, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> +/* dentry->d_inode->i_lock must be held by caller */
> +static inline bool __dget_alias_careful(struct dentry *dentry)
I know the double-underscore pattern is the standard pattern we use
for "internal functions that have special requirements you need to
care about" and I'll take the blame for that pattern... It goes back
to the very beginning.
But maybe in cases like this, we could just instead document the rule
in the name instead of that "careful" and double underscores?
IOW, how about just calling it
static inline bool dget_alias_ilocked(struct dentry *dentry)
or something like that?
I'm being difficult today, I know.
So feel free to ignore me if you think I'm being silly.
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries
2026-05-05 17:06 ` Linus Torvalds
@ 2026-05-05 20:29 ` Al Viro
0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 20:29 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Tue, May 05, 2026 at 10:06:06AM -0700, Linus Torvalds wrote:
> On Mon, 4 May 2026 at 22:55, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > +/* dentry->d_inode->i_lock must be held by caller */
> > +static inline bool __dget_alias_careful(struct dentry *dentry)
>
> I know the double-underscore pattern is the standard pattern we use
> for "internal functions that have special requirements you need to
> care about" and I'll take the blame for that pattern... It goes back
> to the very beginning.
>
> But maybe in cases like this, we could just instead document the rule
> in the name instead of that "careful" and double underscores?
>
> IOW, how about just calling it
>
> static inline bool dget_alias_ilocked(struct dentry *dentry)
>
> or something like that?
Makes sense, will do...
^ permalink raw reply [flat|nested] 53+ messages in thread
* [RFC PATCH 04/25] find_acceptable_alias(): skip NORCU aliases with zero refcount
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (2 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 05/25] select_collect(): ignore dentries on shrink lists if they have positive refcounts Al Viro
` (21 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
similar to d_find_any_alias() situation
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/exportfs/expfs.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index fbd45e7ae706..0295bd3386cf 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -53,10 +53,10 @@ find_acceptable_alias(struct dentry *result,
inode = result->d_inode;
spin_lock(&inode->i_lock);
for_each_alias(dentry, inode) {
- dget(dentry);
+ if (!__dget_alias_careful(dentry))
+ continue;
spin_unlock(&inode->i_lock);
- if (toput)
- dput(toput);
+ dput(toput);
if (dentry != result && acceptable(context, dentry)) {
dput(result);
return dentry;
@@ -66,8 +66,7 @@ find_acceptable_alias(struct dentry *result,
}
spin_unlock(&inode->i_lock);
- if (toput)
- dput(toput);
+ dput(toput);
return NULL;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 05/25] select_collect(): ignore dentries on shrink lists if they have positive refcounts
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (3 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 04/25] find_acceptable_alias(): skip NORCU aliases with zero refcount Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 06/25] make to_shrink_list() return whether it has moved dentry to list Al Viro
` (20 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
If all dentries we find have positive refcounts and some happen
to be on shrink lists, there's no point trying to steal them in the
select_collect2() phase - we won't be able to evict any of them. Busy on
shrink lists is still busy...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 923e499ffe7e..f3e909c7aaad 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1589,9 +1589,7 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
if (data->start == dentry)
goto out;
- if (dentry->d_flags & DCACHE_SHRINK_LIST) {
- data->found++;
- } else if (!dentry->d_lockref.count) {
+ if (!dentry->d_lockref.count) {
to_shrink_list(dentry, &data->dispose);
data->found++;
} else if (dentry->d_lockref.count < 0) {
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 06/25] make to_shrink_list() return whether it has moved dentry to list
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (4 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 05/25] select_collect(): ignore dentries on shrink lists if they have positive refcounts Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 07/25] kill d_dispose_if_unused() Al Viro
` (19 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
... and make it check the refcount for being zero in addition to
dentry not being on a shrink list already. Simplifies the callers...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f3e909c7aaad..dc91cbb568f2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -988,14 +988,17 @@ void d_make_discardable(struct dentry *dentry)
}
EXPORT_SYMBOL(d_make_discardable);
-static void to_shrink_list(struct dentry *dentry, struct list_head *list)
+static bool to_shrink_list(struct dentry *dentry, struct list_head *list)
__must_hold(&dentry->d_lock)
{
- if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
+ if (likely(!dentry->d_lockref.count &&
+ !(dentry->d_flags & DCACHE_SHRINK_LIST))) {
if (dentry->d_flags & DCACHE_LRU_LIST)
d_lru_del(dentry);
d_shrink_add(dentry, list);
+ return true;
}
+ return false;
}
void dput_to_list(struct dentry *dentry, struct list_head *list)
@@ -1180,8 +1183,7 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
void d_dispose_if_unused(struct dentry *dentry, struct list_head *dispose)
{
spin_lock(&dentry->d_lock);
- if (!dentry->d_lockref.count)
- to_shrink_list(dentry, dispose);
+ to_shrink_list(dentry, dispose);
spin_unlock(&dentry->d_lock);
}
EXPORT_SYMBOL(d_dispose_if_unused);
@@ -1589,11 +1591,9 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
if (data->start == dentry)
goto out;
- if (!dentry->d_lockref.count) {
+ if (dentry->d_lockref.count <= 0) {
to_shrink_list(dentry, &data->dispose);
data->found++;
- } else if (dentry->d_lockref.count < 0) {
- data->found++;
}
/*
* We can return to the caller if we have found some (this
@@ -1623,17 +1623,12 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
if (data->start == dentry)
goto out;
- if (!dentry->d_lockref.count) {
- if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+ if (dentry->d_lockref.count <= 0) {
+ if (!to_shrink_list(dentry, &data->dispose)) {
rcu_read_lock();
data->victim = dentry;
return D_WALK_QUIT;
}
- to_shrink_list(dentry, &data->dispose);
- } else if (dentry->d_lockref.count < 0) {
- rcu_read_lock();
- data->victim = dentry;
- return D_WALK_QUIT;
}
/*
* We can return to the caller if we have found some (this
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 07/25] kill d_dispose_if_unused()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (5 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 06/25] make to_shrink_list() return whether it has moved dentry to list Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 08/25] d_prune_aliases(): make sure to skip NORCU aliases Al Viro
` (18 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Rename to_shrink_list() into __move_to_shrink_list(), document and
export it. Switch d_dispose_if_unused() users to that and kill
d_dispose_if_unused() itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
Documentation/filesystems/porting.rst | 11 ++++++
fs/dcache.c | 51 ++++++++++++++-------------
fs/fuse/dir.c | 2 +-
include/linux/dcache.h | 2 +-
4 files changed, 40 insertions(+), 26 deletions(-)
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 36fecc7a3d97..003ab084ad48 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1391,3 +1391,14 @@ either form of manual loop.
**mandatory**
d_alloc_parallel() no longer requires a waitqueue_head.
+
+---
+
+**mandatory**
+
+d_dispose_if_unused() is gone; use __move_to_shrink_list() if you really
+need that functionality, but watch out for memory safety issues - just
+as with d_dispose_if_unused() these are not trivial; with this variant
+of API it's more explicit, since grabbing ->d_lock is caller-side, but
+d_dispose_if_unused() had all the same issues. It's a low-level primitive;
+use only if you have no alternative.
diff --git a/fs/dcache.c b/fs/dcache.c
index dc91cbb568f2..7d8c23a42409 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -988,7 +988,24 @@ void d_make_discardable(struct dentry *dentry)
}
EXPORT_SYMBOL(d_make_discardable);
-static bool to_shrink_list(struct dentry *dentry, struct list_head *list)
+/**
+ * __move_to_shrink_list - try to place a dentry into a shrink list
+ * @dentry: dentry to try putting into shrink list
+ * @list: the list to put @dentry into.
+ * Returns: true @dentry had been placed into @list, false otherwise
+ *
+ * If @dentry is idle and not already include into a shrink list, move
+ * it into @list and return %true; otherwise do nothing and return %false.
+ *
+ * Caller must be holding @dentry->d_lock. There must have been no calls of
+ * dentry_free(@dentry) prior to the beginning of the RCU read-side critical
+ * area in which __move_to_shrink_list(@dentry, @list) is called.
+ *
+ * @list should be thread-private and eventually emptied by passing it to
+ * shrink_dentry_list().
+ */
+
+bool __move_to_shrink_list(struct dentry *dentry, struct list_head *list)
__must_hold(&dentry->d_lock)
{
if (likely(!dentry->d_lockref.count &&
@@ -1000,6 +1017,7 @@ __must_hold(&dentry->d_lock)
}
return false;
}
+EXPORT_SYMBOL(__move_to_shrink_list);
void dput_to_list(struct dentry *dentry, struct list_head *list)
{
@@ -1009,7 +1027,7 @@ void dput_to_list(struct dentry *dentry, struct list_head *list)
return;
}
rcu_read_unlock();
- to_shrink_list(dentry, list);
+ __move_to_shrink_list(dentry, list);
spin_unlock(&dentry->d_lock);
}
@@ -1170,24 +1188,6 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
return de;
}
-/**
- * d_dispose_if_unused - move unreferenced dentries to shrink list
- * @dentry: dentry in question
- * @dispose: head of shrink list
- *
- * If dentry has no external references, move it to shrink list.
- *
- * NOTE!!! The caller is responsible for preventing eviction of the dentry by
- * holding dentry->d_inode->i_lock or equivalent.
- */
-void d_dispose_if_unused(struct dentry *dentry, struct list_head *dispose)
-{
- spin_lock(&dentry->d_lock);
- to_shrink_list(dentry, dispose);
- spin_unlock(&dentry->d_lock);
-}
-EXPORT_SYMBOL(d_dispose_if_unused);
-
/*
* Try to kill dentries associated with this inode.
* WARNING: you must own a reference to inode.
@@ -1198,8 +1198,11 @@ void d_prune_aliases(struct inode *inode)
struct dentry *dentry;
spin_lock(&inode->i_lock);
- for_each_alias(dentry, inode)
- d_dispose_if_unused(dentry, &dispose);
+ for_each_alias(dentry, inode) {
+ spin_lock(&dentry->d_lock);
+ __move_to_shrink_list(dentry, &dispose);
+ spin_unlock(&dentry->d_lock);
+ }
spin_unlock(&inode->i_lock);
shrink_dentry_list(&dispose);
}
@@ -1592,7 +1595,7 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
goto out;
if (dentry->d_lockref.count <= 0) {
- to_shrink_list(dentry, &data->dispose);
+ __move_to_shrink_list(dentry, &data->dispose);
data->found++;
}
/*
@@ -1624,7 +1627,7 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
goto out;
if (dentry->d_lockref.count <= 0) {
- if (!to_shrink_list(dentry, &data->dispose)) {
+ if (!__move_to_shrink_list(dentry, &data->dispose)) {
rcu_read_lock();
data->victim = dentry;
return D_WALK_QUIT;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b658b6baf72f..d8e8ea7280bc 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -177,8 +177,8 @@ static void fuse_dentry_tree_work(struct work_struct *work)
spin_lock(&fd->dentry->d_lock);
/* If dentry is still referenced, let next dput release it */
fd->dentry->d_flags |= DCACHE_OP_DELETE;
+ __move_to_shrink_list(fd->dentry, &dispose);
spin_unlock(&fd->dentry->d_lock);
- d_dispose_if_unused(fd->dentry, &dispose);
if (need_resched()) {
spin_unlock(&dentry_hash[i].lock);
cond_resched();
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 684aeb9e9cbe..0eecefe90ca7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -280,7 +280,7 @@ extern void d_tmpfile(struct file *, struct inode *);
extern struct dentry *d_find_alias(struct inode *);
extern void d_prune_aliases(struct inode *);
-extern void d_dispose_if_unused(struct dentry *, struct list_head *);
+extern bool __move_to_shrink_list(struct dentry *, struct list_head *);
extern void shrink_dentry_list(struct list_head *);
extern struct dentry *d_find_alias_rcu(struct inode *);
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 08/25] d_prune_aliases(): make sure to skip NORCU aliases
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (6 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 07/25] kill d_dispose_if_unused() Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 09/25] shrink_dentry_list(): start with removing from shrink list Al Viro
` (17 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Either they are busy (in which case they won't be moved to shrink
list anyway) or they have a zero refcount, in which case we really
shouldn't mess with them - whoever had dropped the refcount to
zero is on the way to evicting and freeing them.
That way we are guaranteed that only the thread that has dropped
refcount of NORCU dentry to zero might call lock_for_kill() and
__dentry_kill() for those.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 7d8c23a42409..34d57ed9d791 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1200,7 +1200,8 @@ void d_prune_aliases(struct inode *inode)
spin_lock(&inode->i_lock);
for_each_alias(dentry, inode) {
spin_lock(&dentry->d_lock);
- __move_to_shrink_list(dentry, &dispose);
+ if (likely(!(dentry->d_flags & DCACHE_NORCU)))
+ __move_to_shrink_list(dentry, &dispose);
spin_unlock(&dentry->d_lock);
}
spin_unlock(&inode->i_lock);
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 09/25] shrink_dentry_list(): start with removing from shrink list
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (7 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 08/25] d_prune_aliases(): make sure to skip NORCU aliases Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-07 20:39 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 10/25] fold lock_for_kill() into shrink_kill() Al Viro
` (16 subsequent siblings)
25 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Currently we leave dentry on the list until we are done with
lock_for_kill(). That guarantees that it won't have been
even scheduled for removal until we remove it from the list
and drop ->d_lock. We grab ->d_lock and rcu_read_lock()
and call lock_for_kill(). There are four possible cases:
1) lock_for_kill() has succeeded; dentry and its inode
(if any) are locked, dentry refcount is zero and we can
remove it from shrink list and feed it to shrink_kill().
2) lock_for_kill() fails since dentry has become busy.
Nothing to do, rcu_read_unlock(), remove from shrink list,
drop ->d_lock and move on.
3) lock_for_kill() fails since dentry is currently
being killed - already entered __dentry_kill(), but hasn't
reached dentry_unlist() yet. Nothing to do, we should just
do rcu_read_unlock(), remove from shrink list so that
whoever's executing __dentry_kill() would free it once they
are done, drop ->d_lock and move on - same actions as in
case (2).
4) lock_for_kill() fails since dentry has been killed
(reached dentry_unlist(), DCACHE_DENTRY_KILLED set in ->d_flags).
In that case whoever had been killing it had already seen it
on our shrink list and skipped freeing it. At that point it's
just a passive chunk of memory; rcu_read_unlock(), remove from
the list, drop ->d_lock and use dentry_free() to schedule
freeing.
While that works, there's a simpler way to do it:
* grab ->d_lock
* remove dentry from our shrink list
* if DCACHE_DENTRY_KILLED is already set, drop ->d_lock and move on.
* otherwise grab rcu_read_lock() and call lock_for_free()
* if lock_for_kill() succeeds, feed dentry
to shrink_kill(), otherwise drop the locks and move on.
The end result is equivalent to the old variant. The only difference
arises if at the time we grab ->d_lock dentry had refcount 0 and
lock_for_kill() had failed spin_trylock() and had to drop and regain
->d_lock. Otherwise nobody can observe at which point within the
unbroken ->d_lock scope dentry had been removed from the shrink list -
all accesses to ->d_lru are under ->d_lock.
If ->d_lock had been dropped and regained, it is possible for another
thread to feed that dentry to __dentry_kill(); if it doesn't get to
dentry_unlist() before we regain ->d_lock, behaviour is still identical -
it's case (3) and by the time __dentry_kill() would've gotten around
to checking if the victim is on shrink list, it would've been already
removed from ours.
If __dentry_kill() from another thread *does* get to dentry_unlist(),
in the old variant we would have __dentry_kill() leave calling
dentry_free() to us and in the new one __dentry_kill() would've called
dentry_free() itself. Since we are under rcu_read_lock(), we are
guaranteed that actual freeing won't happen until we get around to
rcu_read_unlock(). IOW, the new variant is still safe wrt UAF, if
not for the same reason as the old one, and overall result is the same;
the only difference is which threads ends up scheduling the actual
freeing of dentry.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 34d57ed9d791..ee11171a75e6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1228,19 +1228,19 @@ void shrink_dentry_list(struct list_head *list)
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(&dentry->d_lock);
+ d_shrink_del(dentry);
+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ dentry_free(dentry);
+ spin_unlock(&dentry->d_lock);
+ continue;
+ }
rcu_read_lock();
if (!lock_for_kill(dentry)) {
- bool can_free;
- rcu_read_unlock();
- d_shrink_del(dentry);
- can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
spin_unlock(&dentry->d_lock);
- if (can_free)
- dentry_free(dentry);
- continue;
+ rcu_read_unlock();
+ } else {
+ shrink_kill(dentry);
}
- d_shrink_del(dentry);
- shrink_kill(dentry);
}
}
EXPORT_SYMBOL(shrink_dentry_list);
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 09/25] shrink_dentry_list(): start with removing from shrink list
2026-05-05 5:53 ` [RFC PATCH 09/25] shrink_dentry_list(): start with removing from shrink list Al Viro
@ 2026-05-07 20:39 ` Al Viro
0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-07 20:39 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Tue, May 05, 2026 at 06:53:56AM +0100, Al Viro wrote:
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 34d57ed9d791..ee11171a75e6 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -1228,19 +1228,19 @@ void shrink_dentry_list(struct list_head *list)
> + if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
> + dentry_free(dentry);
> + spin_unlock(&dentry->d_lock);
In the opposite order, obviously. It *is* safe in such form, but only
because NORCU dentries placed on shrink lists can't be evicted by
shrink_dentry_tree() (they won't be found by select_collect2()) or
by dput() (once on a shrink list, they have refcount zero and they
never get it incremented afterwards). So this "if killed by another
thread, with the empty husk left on the shrink list to be disposed of"
can only happen for !NORCU ones, with actual freeing not happening
until we are done with spin_unlock().
But even though that braino does not result in a UAF, we obviously should
just unlock first. Or, possibly, make dentry_free() called locked and
have it unlock the sucker as the first step... That would mean
- if (dentry->d_flags & DCACHE_SHRINK_LIST)
- can_free = false;
- spin_unlock(&dentry->d_lock);
- if (likely(can_free))
+ if (unlikely(dentry->d_flags & DCACHE_SHRINK_LIST))
+ spin_unlock(&dentry->d_lock);
+ else
dentry_free(dentry);
in dentry_kill(), and to hell with 'can_free' variable in there.
Not sure...
Anyway, for now I'm just fixing that braino; we can always revisit
the calling conventions for dentry_free() later.
^ permalink raw reply [flat|nested] 53+ messages in thread
* [RFC PATCH 10/25] fold lock_for_kill() into shrink_kill()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (8 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 09/25] shrink_dentry_list(): start with removing from shrink list Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 11/25] fold lock_for_kill() and __dentry_kill() into common helper Al Viro
` (15 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Both callers have exact same shape.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index ee11171a75e6..8fb0d924ecac 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1211,14 +1211,15 @@ EXPORT_SYMBOL(d_prune_aliases);
static inline void shrink_kill(struct dentry *victim)
{
- do {
+ while (lock_for_kill(victim)) {
rcu_read_unlock();
victim = __dentry_kill(victim);
+ if (!victim)
+ return;
rcu_read_lock();
- } while (victim && lock_for_kill(victim));
+ }
+ spin_unlock(&victim->d_lock);
rcu_read_unlock();
- if (victim)
- spin_unlock(&victim->d_lock);
}
void shrink_dentry_list(struct list_head *list)
@@ -1235,12 +1236,7 @@ void shrink_dentry_list(struct list_head *list)
continue;
}
rcu_read_lock();
- if (!lock_for_kill(dentry)) {
- spin_unlock(&dentry->d_lock);
- rcu_read_unlock();
- } else {
- shrink_kill(dentry);
- }
+ shrink_kill(dentry);
}
}
EXPORT_SYMBOL(shrink_dentry_list);
@@ -1688,12 +1684,7 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
wait_for_completion(&wait.completion);
continue;
}
- if (!lock_for_kill(v)) {
- spin_unlock(&v->d_lock);
- rcu_read_unlock();
- } else {
- shrink_kill(v);
- }
+ shrink_kill(v);
}
if (!list_empty(&data.dispose))
shrink_dentry_list(&data.dispose);
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 11/25] fold lock_for_kill() and __dentry_kill() into common helper
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (9 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 10/25] fold lock_for_kill() into shrink_kill() Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1 Al Viro
` (14 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
There are two callers of lock_for_kill() and both are followed
by the same sequence of actions:
* in case of failure, drop ->d_lock, do rcu_read_unlock() and
go away
* in case of success, do rcu_read_unlock() followed by
passing dentry to __dentry_kill(); if the latter returns NULL, go away.
All calls of __dentry_kill() are paired with lock_for_kill() now;
let's turn that sequence into a new helper (dentry_kill()) and switch
to using it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 28 +++++++++++++---------------
1 file changed, 13 insertions(+), 15 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 8fb0d924ecac..7cf965f9fb29 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -779,6 +779,17 @@ static bool lock_for_kill(struct dentry *dentry)
return false;
}
+static struct dentry *dentry_kill(struct dentry *dentry)
+{
+ if (unlikely(!lock_for_kill(dentry))) {
+ spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+ return NULL;
+ }
+ rcu_read_unlock();
+ return __dentry_kill(dentry);
+}
+
/*
* Decide if dentry is worth retaining. Usually this is called with dentry
* locked; if not locked, we are more limited and might not be able to tell
@@ -922,19 +933,13 @@ static void finish_dput(struct dentry *dentry)
__releases(dentry->d_lock)
__releases(RCU)
{
- while (lock_for_kill(dentry)) {
- rcu_read_unlock();
- dentry = __dentry_kill(dentry);
- if (!dentry)
- return;
+ while ((dentry = dentry_kill(dentry)) != NULL) {
if (retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
return;
}
rcu_read_lock();
}
- rcu_read_unlock();
- spin_unlock(&dentry->d_lock);
}
/*
@@ -1211,15 +1216,8 @@ EXPORT_SYMBOL(d_prune_aliases);
static inline void shrink_kill(struct dentry *victim)
{
- while (lock_for_kill(victim)) {
- rcu_read_unlock();
- victim = __dentry_kill(victim);
- if (!victim)
- return;
+ while ((victim = dentry_kill(victim)) != NULL)
rcu_read_lock();
- }
- spin_unlock(&victim->d_lock);
- rcu_read_unlock();
}
void shrink_dentry_list(struct list_head *list)
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (10 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 11/25] fold lock_for_kill() and __dentry_kill() into common helper Al Viro
@ 2026-05-05 5:53 ` Al Viro
2026-05-05 8:55 ` NeilBrown
2026-05-05 16:47 ` Linus Torvalds
2026-05-05 5:54 ` [RFC PATCH 13/25] reducing rcu_read_lock() scopes in dput and friends, step 2 Al Viro
` (13 subsequent siblings)
25 siblings, 2 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:53 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
rcu_read_lock() scopes in dentry eviction machinery are too wide;
worse, quite a few of the function involved are not neutral wrt
that, making them harder to reason about.
rcu_read_lock() scope is not the only thing establishing an
RCU read-side critical area - spin_lock scope does the same and
they can be mixed - the sequence
rcu_read_lock()
...
spin_lock()
...
rcu_read_unlock()
...
rcu_read_lock()
...
spun_unlock()
...
rcu_read_unlock()
is an unbroken RCU read-side critical area.
Use of that observation allows to shrink rcu_read_lock() scopes;
it's too delicate to do that in a single step, so we'll split that
into several commits.
This one breaks rcu_read_lock() scopes in front of shrink_kill() call
in shrink_dentry_list() and finish_dput() call in dput(); in both
places we are withing ->d_lock scope, so doing that won't change
the RCU read-side critical areas.
With that done, we have all calls of shrink_kill() and finish_dput()
immediately preceded by rcu_read_lock(); next step will use that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/dcache.c b/fs/dcache.c
index 7cf965f9fb29..d690c18f0c33 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -978,6 +978,8 @@ void dput(struct dentry *dentry)
rcu_read_unlock();
return;
}
+ rcu_read_unlock();
+ rcu_read_lock();
finish_dput(dentry);
}
EXPORT_SYMBOL(dput);
@@ -1682,6 +1684,8 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
wait_for_completion(&wait.completion);
continue;
}
+ rcu_read_unlock();
+ rcu_read_lock();
shrink_kill(v);
}
if (!list_empty(&data.dispose))
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 5:53 ` [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1 Al Viro
@ 2026-05-05 8:55 ` NeilBrown
2026-05-05 14:22 ` Al Viro
2026-05-05 16:47 ` Linus Torvalds
1 sibling, 1 reply; 53+ messages in thread
From: NeilBrown @ 2026-05-05 8:55 UTC (permalink / raw)
To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara
On Tue, 05 May 2026, Al Viro wrote:
> rcu_read_lock() scopes in dentry eviction machinery are too wide;
> worse, quite a few of the function involved are not neutral wrt
> that, making them harder to reason about.
>
> rcu_read_lock() scope is not the only thing establishing an
> RCU read-side critical area - spin_lock scope does the same and
> they can be mixed - the sequence
Does it? I can't find any document or code which supports that clain.
Could you provide a pointer?
Thanks,
NeilBrown
> rcu_read_lock()
> ...
> spin_lock()
> ...
> rcu_read_unlock()
> ...
> rcu_read_lock()
> ...
> spun_unlock()
> ...
> rcu_read_unlock()
> is an unbroken RCU read-side critical area.
>
> Use of that observation allows to shrink rcu_read_lock() scopes;
> it's too delicate to do that in a single step, so we'll split that
> into several commits.
>
> This one breaks rcu_read_lock() scopes in front of shrink_kill() call
> in shrink_dentry_list() and finish_dput() call in dput(); in both
> places we are withing ->d_lock scope, so doing that won't change
> the RCU read-side critical areas.
>
> With that done, we have all calls of shrink_kill() and finish_dput()
> immediately preceded by rcu_read_lock(); next step will use that.
>
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> fs/dcache.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 7cf965f9fb29..d690c18f0c33 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -978,6 +978,8 @@ void dput(struct dentry *dentry)
> rcu_read_unlock();
> return;
> }
> + rcu_read_unlock();
> + rcu_read_lock();
> finish_dput(dentry);
> }
> EXPORT_SYMBOL(dput);
> @@ -1682,6 +1684,8 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
> wait_for_completion(&wait.completion);
> continue;
> }
> + rcu_read_unlock();
> + rcu_read_lock();
> shrink_kill(v);
> }
> if (!list_empty(&data.dispose))
> --
> 2.47.3
>
>
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 8:55 ` NeilBrown
@ 2026-05-05 14:22 ` Al Viro
2026-05-05 21:58 ` NeilBrown
0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 14:22 UTC (permalink / raw)
To: NeilBrown; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara
On Tue, May 05, 2026 at 06:55:33PM +1000, NeilBrown wrote:
> On Tue, 05 May 2026, Al Viro wrote:
> > rcu_read_lock() scopes in dentry eviction machinery are too wide;
> > worse, quite a few of the function involved are not neutral wrt
> > that, making them harder to reason about.
> >
> > rcu_read_lock() scope is not the only thing establishing an
> > RCU read-side critical area - spin_lock scope does the same and
> > they can be mixed - the sequence
>
> Does it? I can't find any document or code which supports that clain.
> Could you provide a pointer?
Documentation/RCU/whatisRCU.rst in the section on rcu_read_unlock().
See the subthread back in April:
https://lore.kernel.org/all/CAHk-=wjRgHLvSnEY3P45hSQ0ycKxdz-xqnccAMPuGRrwsvWdig@mail.gmail.com/
including this, for example:
https://lore.kernel.org/all/4e59837d-567b-4678-8a15-e933def32fcb@paulmck-laptop/
Basically, "if ain't true, RCU implementation has a bug that needs fixing"
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 14:22 ` Al Viro
@ 2026-05-05 21:58 ` NeilBrown
0 siblings, 0 replies; 53+ messages in thread
From: NeilBrown @ 2026-05-05 21:58 UTC (permalink / raw)
To: Al Viro; +Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara
On Wed, 06 May 2026, Al Viro wrote:
> On Tue, May 05, 2026 at 06:55:33PM +1000, NeilBrown wrote:
> > On Tue, 05 May 2026, Al Viro wrote:
> > > rcu_read_lock() scopes in dentry eviction machinery are too wide;
> > > worse, quite a few of the function involved are not neutral wrt
> > > that, making them harder to reason about.
> > >
> > > rcu_read_lock() scope is not the only thing establishing an
> > > RCU read-side critical area - spin_lock scope does the same and
> > > they can be mixed - the sequence
> >
> > Does it? I can't find any document or code which supports that clain.
> > Could you provide a pointer?
>
> Documentation/RCU/whatisRCU.rst in the section on rcu_read_unlock().
>
> See the subthread back in April:
> https://lore.kernel.org/all/CAHk-=wjRgHLvSnEY3P45hSQ0ycKxdz-xqnccAMPuGRrwsvWdig@mail.gmail.com/
>
> including this, for example:
>
> https://lore.kernel.org/all/4e59837d-567b-4678-8a15-e933def32fcb@paulmck-laptop/
>
> Basically, "if ain't true, RCU implementation has a bug that needs fixing"
>
Thanks - that helped a lot. As you note elsewhere, things have changed
over the years.
NeilBrown
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 5:53 ` [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1 Al Viro
2026-05-05 8:55 ` NeilBrown
@ 2026-05-05 16:47 ` Linus Torvalds
2026-05-05 22:42 ` Al Viro
1 sibling, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2026-05-05 16:47 UTC (permalink / raw)
To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Mon, May 4, 2026 at 10:54 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> This one breaks rcu_read_lock() scopes in front of shrink_kill() call
> in shrink_dentry_list() and finish_dput() call in dput(); in both
> places we are withing ->d_lock scope, so doing that won't change
> the RCU read-side critical areas.
Ugh.
Ok, so I see what this does and why it does it, but I still dislike
this one intensely. That
rcu_read_unlock();
rcu_read_lock();
pattern is just ugly, and I'd like to see any actual numbers that it
makes a real latency difference.
> With that done, we have all calls of shrink_kill() and finish_dput()
> immediately preceded by rcu_read_lock(); next step will use that.
Yes, the next patch at least removes the ugly pattern. It still
obviously *does* the same "drop and immediately re-take", just without
making it as obvious.
Honestly, if this is then going to make some still later patch
cleaner, I'd prefer 12/13 to be just combined into one "change rcu
locking semantics to be better".
Because as-is, I think patch 12 on its own is just ugly.
It looks like patch 14 then uses the thing to clean up dentry_kill() a
bit, so that combined patch wouldn't trigger my "this is just ugly"
case, adn would still have a real reason for it (outside of a
theoretical "shrink rcu region" without numbers to say why it's
needed).
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 16:47 ` Linus Torvalds
@ 2026-05-05 22:42 ` Al Viro
2026-05-07 7:35 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 22:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Tue, May 05, 2026 at 09:47:03AM -0700, Linus Torvalds wrote:
> On Mon, May 4, 2026 at 10:54 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > This one breaks rcu_read_lock() scopes in front of shrink_kill() call
> > in shrink_dentry_list() and finish_dput() call in dput(); in both
> > places we are withing ->d_lock scope, so doing that won't change
> > the RCU read-side critical areas.
>
> Ugh.
>
> Ok, so I see what this does and why it does it, but I still dislike
> this one intensely. That
>
> rcu_read_unlock();
> rcu_read_lock();
>
> pattern is just ugly, and I'd like to see any actual numbers that it
> makes a real latency difference.
It almost certainly doesn't; the only point is to split the changes into
equivalent transformations - the whole thing is tricky enough as it is.
> > With that done, we have all calls of shrink_kill() and finish_dput()
> > immediately preceded by rcu_read_lock(); next step will use that.
>
> Yes, the next patch at least removes the ugly pattern. It still
> obviously *does* the same "drop and immediately re-take", just without
> making it as obvious.
Yes, and the next two steps after that drive those down into lock_for_kill(),
where we finally get the sucker off the fast path - it's not (re)taken unless
trylock on inode->i_lock fails and we have to play games with dropping and
retaking ->d_lock.
> Honestly, if this is then going to make some still later patch
> cleaner, I'd prefer 12/13 to be just combined into one "change rcu
> locking semantics to be better".
>
> Because as-is, I think patch 12 on its own is just ugly.
>
> It looks like patch 14 then uses the thing to clean up dentry_kill() a
> bit, so that combined patch wouldn't trigger my "this is just ugly"
> case, adn would still have a real reason for it (outside of a
> theoretical "shrink rcu region" without numbers to say why it's
> needed).
The main payoff is in #15, really. The entire 12--17 started as a single
commit, which I fucked up on the first two attempts. Thus the carve-up
into chunks too small for idiot me to fumble...
FWIW, an issue with large rcu_read_lock() scopes (as opposed to RCU read-side
critical areas) is that they are harder to reason about; having them cover
smaller chunks makes it a lot more obvious what each one is for.
In the fs/dcache.c after this series:
1. dentry_name_snapshot(). Protection of external name - from fetching
->d_name.name to atomic_inc_not_zero() on the external name refcount.
The paired RCU delay is somewhat subtle and needs to be documented.
In release_dentry_name_snapshot() and copy_name() it's the use
of kfree_rcu() when refcount hits zero; that's the obvious part.
In dentry_free() it's having __d_free_external() called via call_rcu(),
regardless of DCACHE_NORCU on dentry itself.
2. lock_for_kill(). Bridging over the spinlock switchover in case of
trylock failure, protecting dentry and inode from being freed.
Safety for !NORCU case: freeing of dentry is RCU-delayed due to !NORCU,
freeing of inode is RCU-delayed with 3 exceptions. Inode must have
non-NULL ->destroy_inode(), NULL ->free_inode() and ->destroy_inode()
must do prompt freeing (the last part excludes XFS)
* pipefs inodes (only on NORCU dentries, except for pipefs root,
which is never dropped)
* btrfs-test inodes. A bug, should have btrfs_test_destroy_inode()
reduced to the call of btrfs_drop_extent_map_range(), with ->free_inode
set to btrfs_free_inode, same as for btrfs proper. Not triggered due to
the very limited use of those suckers - it's never exposed to userland and
self-tests don't step into that fun.
* bpffs inodes. A bug. Another manifestation of the same bug had
been reported by folks doing fuzzing; I'd posted a completely untested fix
about a week ago, need to return to it (Alexey replied with "Feel free to
take it into your tree directly", I didn't get around to testing it yet).
Safety for NORCU case: we only get there if dentry refcount had
been dropped to zero by ourselves; no other thread will be calling
dentry_kill() on that dentry *or* making it negative. IOW, neither
dentry nor inode are going to be even scheduled for freeing until after
we return.
3. fast_dput(). Protected area from attempted decrement of refcount
to either having grabbed ->d_lock or to deciding that we are done
and returning true, protecting dentry from being freed. Safety for
!NORCU: dentry freeing is RCU-delayed; if it's already scheduled by
the time we have grabbed ->d_lock, we'll observe a negative number
in ->d_lockref.count, drop ->d_lock and that'll be our last access to
dentry. Safety for NORCU: if decrement succeeds and yields non-zero,
we are not touching dentry anymore. If it succeeds and yields zero, no
other thread will be able to increment the refcount, so no other thread
will be able to drive it back to zero. As the result, no other thread
will even attempt to call dentry_kill(), let alone free the sucker.
The comment in front is stale - will update.
4. dget_parent(). Protects candidate parent from freeing. Protected area
from fetching ->d_parent to lockref_get_not_zero() on its refcount.
Safety: NORCU dentries are never anyone's parents (and argument of
dget_parent() is held by the caller). For !NORCU dentries freeing
is RCU-delayed.
5. dget_parent(). Protects candidate parent from freeing. Protected area
from fetching ->d_parent to grabbing ->d_lock on it. Safety: same as the
previous one. FWIW, I wonder if this
rcu_read_lock();
ret = dentry->d_parent;
spin_lock(&ret->d_lock);
if (unlikely(ret != dentry->d_parent)) {
spin_unlock(&ret->d_lock);
rcu_read_unlock();
goto repeat;
}
rcu_read_unlock();
would be better off as
rcu_read_lock();
ret = dentry->d_parent;
spin_lock(&ret->d_lock);
rcu_read_unlock();
if (unlikely(ret != dentry->d_parent)) {
spin_unlock(&ret->d_lock);
goto repeat;
}
Or, for that matter, as
scoped_guard(rcu) {
// grab ->d_parent->d_lock safely
// NORCU dentries can't be anyone's parents
ret = dentry->d_parent;
spin_lock(&ret->d_lock);
}
// dentry->d_parent is not stable; however, the result of comparison is
// - the set of children is stabilized by ->d_lock
if (unlikely(ret != dentry->d_parent)) {
spin_unlock(&ret->d_lock);
goto repeat;
}
5. d_walk(). Bridging over from ->d_lock on dentry to ->d_lock on
this_parent (aka dentry->d_parent, fetched under dentry->d_lock),
protecting this_parent from being freed. NORCU dentries are never
parents, so freeing this_parent is RCU-delayed.
6. select_collect2(). Found dentry that isn't busy and can't be placed
into our shrink list; need to return it to caller's caller, protecting
it from being freed. We are in ->d_lock scope, but that'll end in the
caller, so we need to extend the RCU read-side critical area all the
way into shrink_dcache_tree() (caller's caller). The end of scope is
in shrink_dcache_tree(), after it has grabbed ->d_lock. NORCU dentries
are never passed to d_walk() callbacks, so freeing is RCU-delayed.
7. shrink_dcache_for_umount(). Bridging over from ->s_roots_lock to
->d_lock, protecting the secondary root we'd found from being freed; NORCU
dentries are never anyone's secondary roots, so freeing is RCU-delayed.
8. __d_lookup(). Protects hlist_bl_for_each_entry_rcu traversal of the hash
chain. All ->d_hash modifications are done by RCU-aware primitives (and
serialized on chain's bitlock), NORCU dentries are never hashed, so freeing
of anything we walk into is RCU-delayed.
9. d_alloc_parallel(). Scope from __d_lookup_rcu() to lockref_get_not_dead().
Protects the hash chain traversal in __d_lookup_rcu() from being disrupted
under us and its result from being freed under us; NORCU dentries are never
returned by __d_lookup_rcu().
10. d_alloc_parallel(). Bridges over from the bitlock of in-lookup
hash chain to ->d_lock, protects the found match from being freed.
->d_lock nests outside of the chain's bitlock, so we need to transfer.
Nothing on the in-lookup hash chains is NORCU, so freeing is RCU-delayed.
11. is_subdir(). Somewhat excessive - we only need it around the lockless
call of d_ancestor(), to protect ->d_parent involved from freeing.
Additionally we have two functions that rely upon rcu_read_lock() held
by the caller: __d_lookup_rcu(), __d_lookup_rcu_op_compare(). Both
use it to protect the hash chain traversal (with the same iterator as
__d_lookup()); again, hash chain modifications are all by RCU-aware primitives
and no NORCU dentries are ever hashed, so nothing we walk through can be freed
under us. The latter provides RCU read-side critical area for ->d_compare(),
protecting dentry argument, the external name (if any) and whatever objects
the method itself might need (if any).
d_same_name() and dentry_cmp() are interesting - they rely either upon
being called in RCU read-side critical area, protecting dentry and external
name (if any) from being freed along with whatever objects required by
->d_compare() (in case of d_same_name()) OR upon being called with dentry
having name and parent stable and an active reference to superblock held
by the caller.
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-05 22:42 ` Al Viro
@ 2026-05-07 7:35 ` Al Viro
2026-05-07 15:32 ` Linus Torvalds
0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-07 7:35 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Tue, May 05, 2026 at 11:42:56PM +0100, Al Viro wrote:
> > Honestly, if this is then going to make some still later patch
> > cleaner, I'd prefer 12/13 to be just combined into one "change rcu
> > locking semantics to be better".
> >
> > Because as-is, I think patch 12 on its own is just ugly.
> >
> > It looks like patch 14 then uses the thing to clean up dentry_kill() a
> > bit, so that combined patch wouldn't trigger my "this is just ugly"
> > case, adn would still have a real reason for it (outside of a
> > theoretical "shrink rcu region" without numbers to say why it's
> > needed).
>
> The main payoff is in #15, really. The entire 12--17 started as a single
> commit, which I fucked up on the first two attempts. Thus the carve-up
> into chunks too small for idiot me to fumble...
Does the following look sane to you? That's pretty much #12--#15 combined,
with hopefully saner commit message.
commit f6c6d2194a4911619a4ab8011953fbf2ec202778
Author: Al Viro <viro@zeniv.linux.org.uk>
Date: Sat Apr 11 03:24:28 2026 -0400
simplify safety for lock_for_kill() slowpath
rcu_read_lock() scopes in dentry eviction machinery are too wide
and badly structured; we end up with too many of those, quite
a few essentially identical. Worse, quite a few of the function
involved are not neutral wrt that, making them harder to reason about.
rcu_read_lock() scope is not the only thing establishing an
RCU read-side critical area - spin_lock scope does the same and
they can be mixed - the sequence
rcu_read_lock()
...
spin_lock()
...
rcu_read_unlock()
...
rcu_read_lock()
...
spun_unlock()
...
rcu_read_unlock()
is an unbroken RCU read-side critical area.
Use of that observation allows to simplify things. First of all,
lock_for_kill() relies upon being in an unbroken RCU read-side
critical area. It's always called with ->d_lock held, and normally
returns without having ever dropped that spinlock. We would not
need rcu_read_lock() at all, if not for the slow path - if trylock
of inode->i_lock fails, we need to drop and retake ->d_lock.
Having all calls of lock_for_kill() inside an rcu_read_lock() scope
takes care of that, but to show that lock_for_kill() slow path is safe,
we need to demonstrate such rcu_read_lock() scope for any call chain
leading to lock_for_kill(). Which is not fun, seeing that there are
10 such scopes, with 5 distinct beginnings between them.
Case 1: opens in dput() proceeds through fast_dput() grabbing ->d_lock,
returning false into dput() and there a call of finish_dput() which calls
dentry_kill(), which calls lock_for_kill(); ends in dentry_kill(), either
right after lock_for_kill() success or right after dropping ->d_lock
on lock_for_kill() failure. ->d_lock is held continuously all the way
into lock_for_kill().
Case 2: opens in dentry_kill(), where we proceed to the same call of
dentry_kill() as in case 1. ->d_lock is held since before the
beginning of the scope and all the way into lock_for_kill().
Case 3: opens in select_collect2(), proceeds through the return to
d_walk() and to shrink_dcache_tree() where we grab ->d_lock and
proceed to call shrink_kill(), which calls dentry_kill(), then as
in the previous scopes.
Case 4: opens in shrink_dentry_list(), followed by call of shrink_kill(),
then same as in case 3. ->d_lock is held since before the beginning
of the scope and all the way into lock_for_kill().
Case 5: opens in shrink_kill(), where it's immediately followed by
call of dentry_kill(), then same as in the previous scopes. ->d_lock
is held since before the beginning of the scope all the way into
lock_for_kill().
Note that in cases 2, 4 and 5 the slow path of lock_for_kill() is the
only part of rcu_read_lock() scope that is not covered by spinlock
scopes. In case 1 we have the area in fast_dput() as well and in
case 3 - the return path from select_collect2() and chunk in shrink_dcache_tree()
up to grabbing ->d_lock.
Seeing that the reasons we need rcu_read_lock() in these additional
areas are completely unrelated to lock_for_kill() slow path, the things
get much more straightforward with
* explicit rcu_read_lock() scope surrounding the area in slow path
of lock_for_kill() where ->d_lock is not held
* shrink_dentry_list() dropping rcu_read_lock() as soon as it has
grabbed ->d_lock.
* dput() dropping rcu_read_lock() just before calling finish_dput().
* rcu_read_lock() calls in finish_dput(), shrink_kill() and
shrink_dentry_list() are removed, along with rcu_read_unlock() calls in
dentry_kill().
RCU read-side critical areas are unchanged by that, safety of lock_for_kill()
slow path is trivial to verify and a bunch of rcu_read_lock() scopes either
gone or become easier to describe.
Incidentally, all calls of fast_dput() are immediately preceded by rcu_read_lock()
and followed by rcu_read_unlock() now, which will allow to simplify those on
the next step...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
diff --git a/fs/dcache.c b/fs/dcache.c
index 0943a17eccbc..4a657bcd5b4f 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -763,6 +763,7 @@ static bool lock_for_kill(struct dentry *dentry)
if (!inode || likely(spin_trylock(&inode->i_lock)))
return true;
+ rcu_read_lock();
do {
spin_unlock(&dentry->d_lock);
spin_lock(&inode->i_lock);
@@ -772,6 +773,7 @@ static bool lock_for_kill(struct dentry *dentry)
spin_unlock(&inode->i_lock);
inode = dentry->d_inode;
} while (inode);
+ rcu_read_unlock();
if (likely(!dentry->d_lockref.count))
return true;
if (inode)
@@ -783,10 +785,8 @@ static struct dentry *dentry_kill(struct dentry *dentry)
{
if (unlikely(!lock_for_kill(dentry))) {
spin_unlock(&dentry->d_lock);
- rcu_read_unlock();
return NULL;
}
- rcu_read_unlock();
return __dentry_kill(dentry);
}
@@ -931,14 +931,12 @@ static inline bool fast_dput(struct dentry *dentry)
static void finish_dput(struct dentry *dentry)
__releases(dentry->d_lock)
- __releases(RCU)
{
while ((dentry = dentry_kill(dentry)) != NULL) {
if (retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
return;
}
- rcu_read_lock();
}
}
@@ -978,6 +976,7 @@ void dput(struct dentry *dentry)
rcu_read_unlock();
return;
}
+ rcu_read_unlock();
finish_dput(dentry);
}
EXPORT_SYMBOL(dput);
@@ -988,7 +987,6 @@ void d_make_discardable(struct dentry *dentry)
WARN_ON(!(dentry->d_flags & DCACHE_PERSISTENT));
dentry->d_flags &= ~DCACHE_PERSISTENT;
dentry->d_lockref.count--;
- rcu_read_lock();
finish_dput(dentry);
}
EXPORT_SYMBOL(d_make_discardable);
@@ -1217,7 +1215,7 @@ EXPORT_SYMBOL(d_prune_aliases);
static inline void shrink_kill(struct dentry *victim)
{
while ((victim = dentry_kill(victim)) != NULL)
- rcu_read_lock();
+ ;
}
void shrink_dentry_list(struct list_head *list)
@@ -1233,7 +1231,6 @@ void shrink_dentry_list(struct list_head *list)
spin_unlock(&dentry->d_lock);
continue;
}
- rcu_read_lock();
shrink_kill(dentry);
}
}
@@ -1669,6 +1666,7 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
struct dentry *v = data.victim;
spin_lock(&v->d_lock);
+ rcu_read_unlock();
if (v->d_lockref.count < 0 &&
!(v->d_flags & DCACHE_DENTRY_KILLED)) {
struct completion_list wait;
@@ -1676,7 +1674,6 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
// it becomes invisible to d_walk().
d_add_waiter(v, &wait);
spin_unlock(&v->d_lock);
- rcu_read_unlock();
if (!list_empty(&data.dispose))
shrink_dentry_list(&data.dispose);
wait_for_completion(&wait.completion);
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1
2026-05-07 7:35 ` Al Viro
@ 2026-05-07 15:32 ` Linus Torvalds
0 siblings, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2026-05-07 15:32 UTC (permalink / raw)
To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Thu, 7 May 2026 at 00:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Does the following look sane to you? That's pretty much #12--#15 combined,
> with hopefully saner commit message.
This looks fine to me. I htink some of the commit message might be
worth as actual comments in the code - like that rcu_read_lock() in
lock_for_kill().
Just a "we are dropping spinlock, take rcu lock to have unbroken rcu
read side protection"
Because that one is probably one of the subtler things in there now -
you pointed it out in the commit message, but for somebody just
reading the code...
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread
* [RFC PATCH 13/25] reducing rcu_read_lock() scopes in dput and friends, step 2
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (11 preceding siblings ...)
2026-05-05 5:53 ` [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1 Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 14/25] reducing rcu_read_lock() scopes in dput and friends, step 3 Al Viro
` (12 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
All calls of shrink_kill and finish_dput() are immediately preceded
by rcu_read_lock(); move it inside those functions.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index d690c18f0c33..cb1e27d8a900 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -931,8 +931,8 @@ static inline bool fast_dput(struct dentry *dentry)
static void finish_dput(struct dentry *dentry)
__releases(dentry->d_lock)
- __releases(RCU)
{
+ rcu_read_lock();
while ((dentry = dentry_kill(dentry)) != NULL) {
if (retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
@@ -979,7 +979,6 @@ void dput(struct dentry *dentry)
return;
}
rcu_read_unlock();
- rcu_read_lock();
finish_dput(dentry);
}
EXPORT_SYMBOL(dput);
@@ -990,7 +989,6 @@ void d_make_discardable(struct dentry *dentry)
WARN_ON(!(dentry->d_flags & DCACHE_PERSISTENT));
dentry->d_flags &= ~DCACHE_PERSISTENT;
dentry->d_lockref.count--;
- rcu_read_lock();
finish_dput(dentry);
}
EXPORT_SYMBOL(d_make_discardable);
@@ -1218,6 +1216,7 @@ EXPORT_SYMBOL(d_prune_aliases);
static inline void shrink_kill(struct dentry *victim)
{
+ rcu_read_lock();
while ((victim = dentry_kill(victim)) != NULL)
rcu_read_lock();
}
@@ -1235,7 +1234,6 @@ void shrink_dentry_list(struct list_head *list)
spin_unlock(&dentry->d_lock);
continue;
}
- rcu_read_lock();
shrink_kill(dentry);
}
}
@@ -1685,7 +1683,6 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
continue;
}
rcu_read_unlock();
- rcu_read_lock();
shrink_kill(v);
}
if (!list_empty(&data.dispose))
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 14/25] reducing rcu_read_lock() scopes in dput and friends, step 3
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (12 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 13/25] reducing rcu_read_lock() scopes in dput and friends, step 2 Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 15/25] reducing rcu_read_lock() scopes in dput and friends, step 4 Al Viro
` (11 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
all calls of dentry_kill() are immediately preceded by
calls of rcu_read_lock(); move those into dentry_kill().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index cb1e27d8a900..bcaf17840939 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -781,6 +781,7 @@ static bool lock_for_kill(struct dentry *dentry)
static struct dentry *dentry_kill(struct dentry *dentry)
{
+ rcu_read_lock();
if (unlikely(!lock_for_kill(dentry))) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
@@ -932,13 +933,11 @@ static inline bool fast_dput(struct dentry *dentry)
static void finish_dput(struct dentry *dentry)
__releases(dentry->d_lock)
{
- rcu_read_lock();
while ((dentry = dentry_kill(dentry)) != NULL) {
if (retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
return;
}
- rcu_read_lock();
}
}
@@ -1216,9 +1215,8 @@ EXPORT_SYMBOL(d_prune_aliases);
static inline void shrink_kill(struct dentry *victim)
{
- rcu_read_lock();
while ((victim = dentry_kill(victim)) != NULL)
- rcu_read_lock();
+ ;
}
void shrink_dentry_list(struct list_head *list)
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 15/25] reducing rcu_read_lock() scopes in dput and friends, step 4
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (13 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 14/25] reducing rcu_read_lock() scopes in dput and friends, step 3 Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 16/25] reducing rcu_read_lock() scopes in dput and friends, step 5 Al Viro
` (10 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Shrink rcu_read_lock() scope around lock_for_kill() into
lock_for_kill() itself. Note that spin_unlock() can be
transposed with rcu_read_unlock() - RCU read-side critical
area remains the same and even if the structure containing
the spinlock gets freed from RCU callback run immediately
after spin_unlock(), we are OK - it won't run until after
the last access to spinlock done by spin_unlock().
Inside lock_for_kill() the scope can be shrunk down to the
area where we drop and regain ->d_lock - rcu_read_lock() scope
is needed only to bridge the ->d_lock scopes together.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index bcaf17840939..4222d89692a8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -763,6 +763,7 @@ static bool lock_for_kill(struct dentry *dentry)
if (!inode || likely(spin_trylock(&inode->i_lock)))
return true;
+ rcu_read_lock();
do {
spin_unlock(&dentry->d_lock);
spin_lock(&inode->i_lock);
@@ -772,6 +773,7 @@ static bool lock_for_kill(struct dentry *dentry)
spin_unlock(&inode->i_lock);
inode = dentry->d_inode;
} while (inode);
+ rcu_read_unlock();
if (likely(!dentry->d_lockref.count))
return true;
if (inode)
@@ -781,13 +783,10 @@ static bool lock_for_kill(struct dentry *dentry)
static struct dentry *dentry_kill(struct dentry *dentry)
{
- rcu_read_lock();
if (unlikely(!lock_for_kill(dentry))) {
spin_unlock(&dentry->d_lock);
- rcu_read_unlock();
return NULL;
}
- rcu_read_unlock();
return __dentry_kill(dentry);
}
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 16/25] reducing rcu_read_lock() scopes in dput and friends, step 5
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (14 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 15/25] reducing rcu_read_lock() scopes in dput and friends, step 4 Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 17/25] reducing rcu_read_lock() scopes in dput and friends, step 6 Al Viro
` (9 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Shrink rcu_read_lock() scopes surrounding fast_dput() calls.
Both callers are immediately preceded and followed by
rcu_read_lock()/rcu_read_unlock() resp. Shrink that down
into fast_dput() itself; in case when fast_dput() ends up
grabbing ->d_lock, we can pull rcu_read_unlock() up to
right after spin_lock().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 4222d89692a8..5872f369ddaf 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -876,6 +876,7 @@ static inline bool fast_dput(struct dentry *dentry)
/*
* try to decrement the lockref optimistically.
*/
+ rcu_read_lock();
ret = lockref_put_return(&dentry->d_lockref);
/*
@@ -885,6 +886,7 @@ static inline bool fast_dput(struct dentry *dentry)
*/
if (unlikely(ret < 0)) {
spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
if (WARN_ON_ONCE(dentry->d_lockref.count <= 0)) {
spin_unlock(&dentry->d_lock);
return true;
@@ -896,8 +898,10 @@ static inline bool fast_dput(struct dentry *dentry)
/*
* If we weren't the last ref, we're done.
*/
- if (ret)
+ if (ret) {
+ rcu_read_unlock();
return true;
+ }
/*
* Can we decide that decrement of refcount is all we needed without
@@ -905,8 +909,10 @@ static inline bool fast_dput(struct dentry *dentry)
* dentry looks like it ought to be retained and there's nothing else
* to do.
*/
- if (retain_dentry(dentry, false))
+ if (retain_dentry(dentry, false)) {
+ rcu_read_unlock();
return true;
+ }
/*
* Either not worth retaining or we can't tell without the lock.
@@ -914,6 +920,7 @@ static inline bool fast_dput(struct dentry *dentry)
* but we'll need to re-check the situation after getting the lock.
*/
spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
/*
* Did somebody else grab a reference to it in the meantime, and
@@ -971,12 +978,8 @@ void dput(struct dentry *dentry)
if (!dentry)
return;
might_sleep();
- rcu_read_lock();
- if (likely(fast_dput(dentry))) {
- rcu_read_unlock();
+ if (likely(fast_dput(dentry)))
return;
- }
- rcu_read_unlock();
finish_dput(dentry);
}
EXPORT_SYMBOL(dput);
@@ -1024,12 +1027,8 @@ EXPORT_SYMBOL(__move_to_shrink_list);
void dput_to_list(struct dentry *dentry, struct list_head *list)
{
- rcu_read_lock();
- if (likely(fast_dput(dentry))) {
- rcu_read_unlock();
+ if (likely(fast_dput(dentry)))
return;
- }
- rcu_read_unlock();
__move_to_shrink_list(dentry, list);
spin_unlock(&dentry->d_lock);
}
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 17/25] reducing rcu_read_lock() scopes in dput and friends, step 6
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (15 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 16/25] reducing rcu_read_lock() scopes in dput and friends, step 5 Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 18/25] adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Al Viro
` (8 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
In the "got a victim" case in shrink_dcache_tree() we can
pull the rcu_read_unlock() call up to right after grabbing
->d_lock.
Document the resulting rcu_read_lock() scope - it spans from
unbalanced rcu_read_lock() in select_collect2() to unbalanced
rcu_read_unlock() in shrink_dcache_tree(), bridging two
->d_lock scopes into single RCU read-side critical area.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 25 ++++++++++++++++++++++---
1 file changed, 22 insertions(+), 3 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 5872f369ddaf..6822e8bfc6af 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1619,6 +1619,15 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
if (dentry->d_lockref.count <= 0) {
if (!__move_to_shrink_list(dentry, &data->dispose)) {
+ /*
+ * We need an enter RCU read-side critical area that
+ * would extend past the return from d_walk() and
+ * we are in the scope of ->d_lock that will terminate
+ * before that, so we use rcu_read_lock() to bridge
+ * over to the scope of ->d_lock in d_walk() caller.
+ * The scope of rcu_read_lock() spans from here to
+ * paired rcu_read_unlock() in shrink_dcache_tree().
+ */
rcu_read_lock();
data->victim = dentry;
return D_WALK_QUIT;
@@ -1663,8 +1672,20 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
d_walk(parent, &data, select_collect2);
if (data.victim) {
struct dentry *v = data.victim;
-
+ /*
+ * select_collect2() has picked a dentry that was
+ * either dying or on a shrink list and arranged
+ * for it to be returned to us. We are still in
+ * the RCU read-side critical area started there
+ * (rcu_read_lock() scope opened in select_collect2()),
+ * so dentry couldn't have been freed yet, but its
+ * state might've changed since we dropped ->d_lock
+ * on the way out. Switch over to ->d_lock scope
+ * and recheck the dentry state.
+ */
spin_lock(&v->d_lock);
+ rcu_read_unlock();
+
if (v->d_lockref.count < 0 &&
!(v->d_flags & DCACHE_DENTRY_KILLED)) {
struct completion_list wait;
@@ -1672,13 +1693,11 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
// it becomes invisible to d_walk().
d_add_waiter(v, &wait);
spin_unlock(&v->d_lock);
- rcu_read_unlock();
if (!list_empty(&data.dispose))
shrink_dentry_list(&data.dispose);
wait_for_completion(&wait.completion);
continue;
}
- rcu_read_unlock();
shrink_kill(v);
}
if (!list_empty(&data.dispose))
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 18/25] adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (16 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 17/25] reducing rcu_read_lock() scopes in dput and friends, step 6 Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 19/25] document dentry_kill() Al Viro
` (7 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Pull dropping ->d_lock on lock_for_kill() failure into lock_for_kill() itself.
That reduces dentry_kill() to
if (!lock_for_kill(dentry))
return NULL;
return __dentry_kill(dentry);
at which point it's easier to move that if (...) into the beginning of __dentry_kill()
itself and rename it into dentry_kill().
Document the new calling conventions of lock_for_kill().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 99 +++++++++++++++++++++++++++--------------------------
1 file changed, 50 insertions(+), 49 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 6822e8bfc6af..125f280fe6ee 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -691,11 +691,60 @@ static inline void dentry_unlist(struct dentry *dentry)
}
}
-static struct dentry *__dentry_kill(struct dentry *dentry)
+/*
+ * Prepare locking environment for killing a dentry.
+ * Called under dentry->d_lock. To proceed with eviction of a positive
+ * dentry we need to get ->i_lock of dentry inode as well and that
+ * needs to be done carefully - ->i_lock nests outside of ->d_lock,
+ * so we might need to drop and regain the latter. We use rcu_read_lock()
+ * to keep the RCU read-side critical area contiguous, so dentry and
+ * inode won't get freed under us, but dentry state might've changed
+ * while its ->d_lock had not been held - it might end up getting
+ * killed, becoming busy, negative, etc.
+ *
+ * If dentry is busy (or busy dying, or already dead), unlock dentry
+ * and return false. Otherwise, return true and have that dentry's
+ * inode (if any) locked in addition to dentry itself.
+ */
+static bool lock_for_kill(struct dentry *dentry)
+{
+ struct inode *inode = dentry->d_inode;
+
+ if (unlikely(dentry->d_lockref.count)) {
+ spin_unlock(&dentry->d_lock);
+ return false;
+ }
+
+ if (!inode || likely(spin_trylock(&inode->i_lock)))
+ return true;
+
+ rcu_read_lock();
+ do {
+ spin_unlock(&dentry->d_lock);
+ spin_lock(&inode->i_lock);
+ spin_lock(&dentry->d_lock);
+ if (likely(inode == dentry->d_inode))
+ break;
+ spin_unlock(&inode->i_lock);
+ inode = dentry->d_inode;
+ } while (inode);
+ rcu_read_unlock();
+ if (likely(!dentry->d_lockref.count))
+ return true;
+ if (inode)
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&dentry->d_lock);
+ return false;
+}
+
+static struct dentry *dentry_kill(struct dentry *dentry)
{
struct dentry *parent = NULL;
bool can_free = true;
+ if (unlikely(!lock_for_kill(dentry)))
+ return NULL;
+
/*
* The dentry is now unrecoverably dead to the world.
*/
@@ -742,54 +791,6 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
return parent;
}
-/*
- * Lock a dentry for feeding it to __dentry_kill().
- * Called under rcu_read_lock() and dentry->d_lock; the former
- * guarantees that nothing we access will be freed under us.
- * Note that dentry is *not* protected from concurrent dentry_kill(),
- * d_delete(), etc.
- *
- * Return false if dentry is busy. Otherwise, return true and have
- * that dentry's inode locked.
- */
-
-static bool lock_for_kill(struct dentry *dentry)
-{
- struct inode *inode = dentry->d_inode;
-
- if (unlikely(dentry->d_lockref.count))
- return false;
-
- if (!inode || likely(spin_trylock(&inode->i_lock)))
- return true;
-
- rcu_read_lock();
- do {
- spin_unlock(&dentry->d_lock);
- spin_lock(&inode->i_lock);
- spin_lock(&dentry->d_lock);
- if (likely(inode == dentry->d_inode))
- break;
- spin_unlock(&inode->i_lock);
- inode = dentry->d_inode;
- } while (inode);
- rcu_read_unlock();
- if (likely(!dentry->d_lockref.count))
- return true;
- if (inode)
- spin_unlock(&inode->i_lock);
- return false;
-}
-
-static struct dentry *dentry_kill(struct dentry *dentry)
-{
- if (unlikely(!lock_for_kill(dentry))) {
- spin_unlock(&dentry->d_lock);
- return NULL;
- }
- return __dentry_kill(dentry);
-}
-
/*
* Decide if dentry is worth retaining. Usually this is called with dentry
* locked; if not locked, we are more limited and might not be able to tell
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 19/25] document dentry_kill()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (17 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 18/25] adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope Al Viro
` (6 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/fs/dcache.c b/fs/dcache.c
index 125f280fe6ee..f283818095c0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -737,6 +737,44 @@ static bool lock_for_kill(struct dentry *dentry)
return false;
}
+/**
+ * dentry_kill - evict a dentry
+ * @dentry: dentry to be evicted
+ *
+ * All dentry evictions are done by this function. The reference we are
+ * passed does not contribute to the refcount; the caller had either
+ * already decremented the refcount or it had never held one in the
+ * first place. @dentry->d_lock is held by the caller and dropped
+ * by dentry_kill(@dentry).
+ *
+ * We are guaranteed that nobody had called dentry_free(@dentry)
+ * prior to the beginning of RCU read-side critical area we are in.
+ *
+ * Caller must not access @dentry after the call.
+ *
+ * If eviction of @dentry drops the last reference to its parent,
+ * the reference to parent is returned to caller. In that case
+ * it is guaranteed to satisfy the requirements for dentry_kill()
+ * argument - its ->d_lock is held and we are guaranteed that nobody
+ * had passed it to dentry_free() prior to acquisition of its ->d_lock.
+ * Otherwise %NULL is returned.
+ *
+ * If @dentry is idle and remains such after we assemble the full
+ * locking environment for eviction (see lock_for_kill() for details)
+ * we mark it doomed (->d_lockref.count < 0) and proceed to detaching
+ * it from any filesystem objects. Otherwise we drop ->d_lock and
+ * return %NULL.
+ *
+ * Once @dentry is detached from the filesystem objects, we complete
+ * detaching it from dentry tree. The parent, if any, gets locked
+ * and its refcount is decremented; dentry is carefully removed from
+ * the tree (see dentry_unlist() for details) and marked killed
+ * (%DCACHE_DENTRY_KILLED set in ->d_flags). At that point it's just
+ * an inert chunk of memory, accessible only via RCU references
+ * and possibly via a shrink list. If it is not on any shrink lists,
+ * we call dentry_free(), which schedules actual freeing of memory.
+ * Othewise freeing is left to the owner of the shrink list in question.
+ */
static struct dentry *dentry_kill(struct dentry *dentry)
{
struct dentry *parent = NULL;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (18 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 19/25] document dentry_kill() Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 17:01 ` Linus Torvalds
2026-05-05 5:54 ` [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel() Al Viro
` (5 subsequent siblings)
25 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
we only need it to bridge over from ->d_lock scope of child to ->d_lock
scope of parent; dropping ->d_lock at rename_retry doesn't need to be
in rcu_read_lock() scope.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f283818095c0..2a342d618303 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1477,14 +1477,15 @@ static void d_walk(struct dentry *parent, void *data,
/*
* All done at this level ... ascend and resume the search.
*/
- rcu_read_lock();
ascend:
if (this_parent != parent) {
dentry = this_parent;
this_parent = dentry->d_parent;
+ rcu_read_lock();
spin_unlock(&dentry->d_lock);
spin_lock(&this_parent->d_lock);
+ rcu_read_unlock();
/* might go back up the wrong parent if we have had a rename. */
if (need_seqretry(&rename_lock, seq))
@@ -1492,7 +1493,6 @@ static void d_walk(struct dentry *parent, void *data,
/* go into the first sibling still alive */
hlist_for_each_entry_continue(dentry, d_sib) {
if (likely(!(dentry->d_flags & DCACHE_DENTRY_KILLED))) {
- rcu_read_unlock();
goto resume;
}
}
@@ -1500,7 +1500,6 @@ static void d_walk(struct dentry *parent, void *data,
}
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
- rcu_read_unlock();
out_unlock:
spin_unlock(&this_parent->d_lock);
@@ -1509,7 +1508,6 @@ static void d_walk(struct dentry *parent, void *data,
rename_retry:
spin_unlock(&this_parent->d_lock);
- rcu_read_unlock();
BUG_ON(seq & 1);
if (!retry)
return;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 5:54 ` [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope Al Viro
@ 2026-05-05 17:01 ` Linus Torvalds
2026-05-05 20:05 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: Linus Torvalds @ 2026-05-05 17:01 UTC (permalink / raw)
To: Al Viro, Paul E. McKenney, Frederic Weisbecker
Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
[ Adding Pail and Frederic explicitly not because they care about this
code, but because it's - again - about spinlock vs rcu_read_lock() ]
On Mon, 4 May 2026 at 22:54, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> @@ -1477,14 +1477,15 @@ static void d_walk(struct dentry *parent, void *data,
> /*
> * All done at this level ... ascend and resume the search.
> */
> - rcu_read_lock();
> ascend:
> if (this_parent != parent) {
> dentry = this_parent;
> this_parent = dentry->d_parent;
>
> + rcu_read_lock();
> spin_unlock(&dentry->d_lock);
> spin_lock(&this_parent->d_lock);
> + rcu_read_unlock();
Ugh. This also triggers my "this is ugly". It makes no sense to me.
It's documented as "shrink rcu_read_lock() scope", but again, it's not
clear why that matters, and more importantly it's not really *true*.
The code clearly still depends on RCU reader locking - it's just that
it mixes explicit rcu_read_lock() sections with "we hold a spinlock,
so we're rcu read-locked", and then it goes "oh, now we drop the
spinlock so we need to explicitly take the RCU read lock".
Which I don't think is an improvement. I think it was actually clearer
to just do rcu_read_lock() over the whole thing, and then essentially
ignoring that spinlocks *also* act as a RCU lock.
Now, with all that said...
If we had a function that was "transfer spinlock without dropping RCU
read lock", and we'd use that as an _optimization_ (maybe we could
avoid all the preemption count games and testing for whether there are
active RCU events etc), that would make sense to me.
At that point it would *all* become "we rely on the spinlock to hold
RCU read locking", and it wouldn't mix locking models. So I think such
an operation might make sense on its own.
And no, I doubt the "optimization" part of it matters hugely, but I
think it might actually clarify what is going on just due to the added
abstraction. IOW, even *without* actively optimizing the code
sequence, maybe a
static void transfer_spinlock_with_rcu(spinlock_t *a, spinlock_t *b)
{
rcu_read_lock();
spin_unlock(a);
spin_lock(b);
rcu_read_unlock();
}
would I think be a better thing than have this ugly sequence written
out. It would make it very explicit what we're doing (and it would
_allow_ for code generation improvements if we care about it).
I dunno. Maybe I'm just crazy, and irrational in my reaction to that
pattern. Which is partly why I added Paila nd Frederic to the cc to
see what others think.
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 17:01 ` Linus Torvalds
@ 2026-05-05 20:05 ` Al Viro
2026-05-05 21:40 ` Frederic Weisbecker
0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 20:05 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul E. McKenney, Frederic Weisbecker, linux-fsdevel,
Christian Brauner, Jan Kara, NeilBrown
On Tue, May 05, 2026 at 10:01:39AM -0700, Linus Torvalds wrote:
> If we had a function that was "transfer spinlock without dropping RCU
> read lock", and we'd use that as an _optimization_ (maybe we could
> avoid all the preemption count games and testing for whether there are
> active RCU events etc), that would make sense to me.
>
> At that point it would *all* become "we rely on the spinlock to hold
> RCU read locking", and it wouldn't mix locking models. So I think such
> an operation might make sense on its own.
Definitely; we have the same "bridge two spin_lock scopes together into
a contiguous RCU read-side critical area" elsewhere as well. Hell,
rcu_read_lock();
do {
spin_unlock(&dentry->d_lock);
spin_lock(&inode->i_lock);
spin_lock(&dentry->d_lock);
if (likely(inode == dentry->d_inode))
break;
spin_unlock(&inode->i_lock);
inode = dentry->d_inode;
} while (inode);
rcu_read_unlock();
in lock_for_kill() might be better off as
do {
transfer_spinlock_with_rcu(&dentry->d_lock, &inode->i_lock);
spin_lock(&dentry->d_lock);
if (likely(inode == dentry->d_inode))
break;
spin_unlock(&inode->i_lock);
inode = dentry->d_inode;
} while (inode);
Explicit rcu_read_lock() uses don't explain what's really going on; in this
form it's "we can't take ->i_lock under ->d_lock, but we need an unbroken
RCU read-side critical area for memory safety". The same goes for d_walk()
use ("can't take parent's ->d_lock under child's one"), as well as in
shrink_dcache_for_umount() ("can't take ->d_lock under ->s_roots_lock")
and I wouldn't be surprised to see other instances of the same pattern.
I would really prefer to have all explicit rcu_read_lock() in fs/dcache.c
clearly documented ;-/
FWIW, back in 2015 spin_lock() did *not* provide an RCU read-side critical
area, according to https://lwn.net/Articles/652156/; would be nice to have
the history mentioned in D/RCU/whatisRCU.rst - at least to the level of
"once upon a time one *did* need an explicit rcu_read_lock() in addition
to spin_lock() - spin_lock() or preempt_disable() alone was not enough;
that has changed in <...>".
I'm not familiar enough with that code for that kind of archaeology ;-/
Could somebody help with that? Paul?
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 20:05 ` Al Viro
@ 2026-05-05 21:40 ` Frederic Weisbecker
2026-05-05 22:50 ` Al Viro
2026-05-08 3:01 ` Al Viro
0 siblings, 2 replies; 53+ messages in thread
From: Frederic Weisbecker @ 2026-05-05 21:40 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Paul E. McKenney, linux-fsdevel,
Christian Brauner, Jan Kara, NeilBrown, Boqun Feng,
Joel Fernandes, Uladzislau Rezki
(Adding other RCU folks in Cc)
Le Tue, May 05, 2026 at 09:05:01PM +0100, Al Viro a écrit :
> On Tue, May 05, 2026 at 10:01:39AM -0700, Linus Torvalds wrote:
>
> > If we had a function that was "transfer spinlock without dropping RCU
> > read lock", and we'd use that as an _optimization_ (maybe we could
> > avoid all the preemption count games and testing for whether there are
> > active RCU events etc), that would make sense to me.
> >
> > At that point it would *all* become "we rely on the spinlock to hold
> > RCU read locking", and it wouldn't mix locking models. So I think such
> > an operation might make sense on its own.
>
> Definitely; we have the same "bridge two spin_lock scopes together into
> a contiguous RCU read-side critical area" elsewhere as well. Hell,
>
> rcu_read_lock();
> do {
> spin_unlock(&dentry->d_lock);
> spin_lock(&inode->i_lock);
> spin_lock(&dentry->d_lock);
> if (likely(inode == dentry->d_inode))
> break;
> spin_unlock(&inode->i_lock);
> inode = dentry->d_inode;
> } while (inode);
> rcu_read_unlock();
>
> in lock_for_kill() might be better off as
>
> do {
> transfer_spinlock_with_rcu(&dentry->d_lock, &inode->i_lock);
> spin_lock(&dentry->d_lock);
> if (likely(inode == dentry->d_inode))
> break;
> spin_unlock(&inode->i_lock);
> inode = dentry->d_inode;
> } while (inode);
>
> Explicit rcu_read_lock() uses don't explain what's really going on; in this
> form it's "we can't take ->i_lock under ->d_lock, but we need an unbroken
> RCU read-side critical area for memory safety". The same goes for d_walk()
> use ("can't take parent's ->d_lock under child's one"), as well as in
> shrink_dcache_for_umount() ("can't take ->d_lock under ->s_roots_lock")
> and I wouldn't be surprised to see other instances of the same pattern.
>
> I would really prefer to have all explicit rcu_read_lock() in fs/dcache.c
> clearly documented ;-/
And the calls to rcu_read_lock() / rcu_read_unlock() are _usually_ cheap enough
that their explicit calls should go unnoticed over an implicit spinlock RCU
critical section. Ok rcu_read_unlock() has its slowpath but it would otherwise
trigger randomly when implicit read side are used (context switch, interrupts,
etc...)
Also rcu_read_lock() has the advantage to tell clearly that we are doing an RCU read
side (and its uses indeed claim to be documented) here while relying on spinlocks to do
that is fine but implicit and more error-prone when there are tricky things
like spinlock transfers: better don't ask for trouble.
So I would say that we should probably avoid introducing such an API if possible.
> FWIW, back in 2015 spin_lock() did *not* provide an RCU read-side critical
> area, according to https://lwn.net/Articles/652156/; would be nice to have
> the history mentioned in D/RCU/whatisRCU.rst - at least to the level of
> "once upon a time one *did* need an explicit rcu_read_lock() in addition
> to spin_lock() - spin_lock() or preempt_disable() alone was not enough;
> that has changed in <...>".
>
> I'm not familiar enough with that code for that kind of archaeology ;-/
> Could somebody help with that? Paul?
That's quite outdated. Now preempt disabled, IRQs disabled, and spinlocks
are all implicit RCU read sides. On non RT spinlocks disable preemption and
on RT they call rcu_read_lock() internally.
Thanks.
--
Frederic Weisbecker
SUSE Labs
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 21:40 ` Frederic Weisbecker
@ 2026-05-05 22:50 ` Al Viro
2026-05-06 3:49 ` Paul E. McKenney
2026-05-08 3:01 ` Al Viro
1 sibling, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 22:50 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Linus Torvalds, Paul E. McKenney, linux-fsdevel,
Christian Brauner, Jan Kara, NeilBrown, Boqun Feng,
Joel Fernandes, Uladzislau Rezki
On Tue, May 05, 2026 at 11:40:44PM +0200, Frederic Weisbecker wrote:
> > FWIW, back in 2015 spin_lock() did *not* provide an RCU read-side critical
> > area, according to https://lwn.net/Articles/652156/; would be nice to have
> > the history mentioned in D/RCU/whatisRCU.rst - at least to the level of
> > "once upon a time one *did* need an explicit rcu_read_lock() in addition
> > to spin_lock() - spin_lock() or preempt_disable() alone was not enough;
> > that has changed in <...>".
> >
> > I'm not familiar enough with that code for that kind of archaeology ;-/
> > Could somebody help with that? Paul?
>
> That's quite outdated. Now preempt disabled, IRQs disabled, and spinlocks
> are all implicit RCU read sides. On non RT spinlocks disable preemption and
> on RT they call rcu_read_lock() internally.
Out of curiosity, how much headache do bitlocks cause on RT? I mean, they
are unconditionally disabling preemption, RT or no RT...
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 22:50 ` Al Viro
@ 2026-05-06 3:49 ` Paul E. McKenney
2026-05-07 22:39 ` NeilBrown
0 siblings, 1 reply; 53+ messages in thread
From: Paul E. McKenney @ 2026-05-06 3:49 UTC (permalink / raw)
To: Al Viro
Cc: Frederic Weisbecker, Linus Torvalds, linux-fsdevel,
Christian Brauner, Jan Kara, NeilBrown, Boqun Feng,
Joel Fernandes, Uladzislau Rezki
On Tue, May 05, 2026 at 11:50:28PM +0100, Al Viro wrote:
> On Tue, May 05, 2026 at 11:40:44PM +0200, Frederic Weisbecker wrote:
>
> > > FWIW, back in 2015 spin_lock() did *not* provide an RCU read-side critical
> > > area, according to https://lwn.net/Articles/652156/; would be nice to have
> > > the history mentioned in D/RCU/whatisRCU.rst - at least to the level of
> > > "once upon a time one *did* need an explicit rcu_read_lock() in addition
> > > to spin_lock() - spin_lock() or preempt_disable() alone was not enough;
> > > that has changed in <...>".
> > >
> > > I'm not familiar enough with that code for that kind of archaeology ;-/
> > > Could somebody help with that? Paul?
> >
> > That's quite outdated. Now preempt disabled, IRQs disabled, and spinlocks
> > are all implicit RCU read sides. On non RT spinlocks disable preemption and
> > on RT they call rcu_read_lock() internally.
>
> Out of curiosity, how much headache do bitlocks cause on RT? I mean, they
> are unconditionally disabling preemption, RT or no RT...
Given the unconditional disabling of preemption, they would be about as
bad for RT as are raw spinlocks.
Thanx, Paul
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-06 3:49 ` Paul E. McKenney
@ 2026-05-07 22:39 ` NeilBrown
2026-05-07 23:21 ` Paul E. McKenney
0 siblings, 1 reply; 53+ messages in thread
From: NeilBrown @ 2026-05-07 22:39 UTC (permalink / raw)
To: paulmck
Cc: Al Viro, Frederic Weisbecker, Linus Torvalds, linux-fsdevel,
Christian Brauner, Jan Kara, Boqun Feng, Joel Fernandes,
Uladzislau Rezki
On Wed, 06 May 2026, paulmck@kernel.org wrote:
> On Tue, May 05, 2026 at 11:50:28PM +0100, Al Viro wrote:
> > On Tue, May 05, 2026 at 11:40:44PM +0200, Frederic Weisbecker wrote:
> >
> > > > FWIW, back in 2015 spin_lock() did *not* provide an RCU read-side critical
> > > > area, according to https://lwn.net/Articles/652156/; would be nice to have
> > > > the history mentioned in D/RCU/whatisRCU.rst - at least to the level of
> > > > "once upon a time one *did* need an explicit rcu_read_lock() in addition
> > > > to spin_lock() - spin_lock() or preempt_disable() alone was not enough;
> > > > that has changed in <...>".
> > > >
> > > > I'm not familiar enough with that code for that kind of archaeology ;-/
> > > > Could somebody help with that? Paul?
> > >
> > > That's quite outdated. Now preempt disabled, IRQs disabled, and spinlocks
> > > are all implicit RCU read sides. On non RT spinlocks disable preemption and
> > > on RT they call rcu_read_lock() internally.
> >
> > Out of curiosity, how much headache do bitlocks cause on RT? I mean, they
> > are unconditionally disabling preemption, RT or no RT...
>
> Given the unconditional disabling of preemption, they would be about as
> bad for RT as are raw spinlocks.
But presumably if they ever became a measurable problem, someone could
create an RT-bitlock implementation I assume ?
NeilBrown
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-07 22:39 ` NeilBrown
@ 2026-05-07 23:21 ` Paul E. McKenney
2026-05-08 14:47 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: Paul E. McKenney @ 2026-05-07 23:21 UTC (permalink / raw)
To: NeilBrown
Cc: Al Viro, Frederic Weisbecker, Linus Torvalds, linux-fsdevel,
Christian Brauner, Jan Kara, Boqun Feng, Joel Fernandes,
Uladzislau Rezki
On Fri, May 08, 2026 at 08:39:35AM +1000, NeilBrown wrote:
> On Wed, 06 May 2026, paulmck@kernel.org wrote:
> > On Tue, May 05, 2026 at 11:50:28PM +0100, Al Viro wrote:
> > > On Tue, May 05, 2026 at 11:40:44PM +0200, Frederic Weisbecker wrote:
> > >
> > > > > FWIW, back in 2015 spin_lock() did *not* provide an RCU read-side critical
> > > > > area, according to https://lwn.net/Articles/652156/; would be nice to have
> > > > > the history mentioned in D/RCU/whatisRCU.rst - at least to the level of
> > > > > "once upon a time one *did* need an explicit rcu_read_lock() in addition
> > > > > to spin_lock() - spin_lock() or preempt_disable() alone was not enough;
> > > > > that has changed in <...>".
> > > > >
> > > > > I'm not familiar enough with that code for that kind of archaeology ;-/
> > > > > Could somebody help with that? Paul?
> > > >
> > > > That's quite outdated. Now preempt disabled, IRQs disabled, and spinlocks
> > > > are all implicit RCU read sides. On non RT spinlocks disable preemption and
> > > > on RT they call rcu_read_lock() internally.
> > >
> > > Out of curiosity, how much headache do bitlocks cause on RT? I mean, they
> > > are unconditionally disabling preemption, RT or no RT...
> >
> > Given the unconditional disabling of preemption, they would be about as
> > bad for RT as are raw spinlocks.
>
> But presumably if they ever became a measurable problem, someone could
> create an RT-bitlock implementation I assume ?
Alternatively, if the bitlock acquisition is always a trylock-style
operation with no tight loop around the trylock, it should not be
necessary to disable preemption. (Give or take the expectations of the
code in the critical section, of course!)
Thanx, Paul
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-07 23:21 ` Paul E. McKenney
@ 2026-05-08 14:47 ` Al Viro
2026-05-08 22:03 ` Paul E. McKenney
0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-08 14:47 UTC (permalink / raw)
To: Paul E. McKenney
Cc: NeilBrown, Frederic Weisbecker, Linus Torvalds, linux-fsdevel,
Christian Brauner, Jan Kara, Boqun Feng, Joel Fernandes,
Uladzislau Rezki
On Thu, May 07, 2026 at 04:21:28PM -0700, Paul E. McKenney wrote:
> Alternatively, if the bitlock acquisition is always a trylock-style
> operation with no tight loop around the trylock, it should not be
> necessary to disable preemption. (Give or take the expectations of the
> code in the critical section, of course!)
In dcache we have two of those - hash chains in the main dentry hash
(insertion and removal only; RCU for traversals) and, more unpleasantly,
hash chains in the in-lookup hash. Those have search-and-insert-new
and removal, the former with full traversal of the chain under the
lock.
The entries are there for duration of ->lookup(), so it stuffing a lot
of them in there might be feasible with a bunch of lookups for different
names on a stuck network filesystem (each with a separate thread blocked
waiting for server to reply)...
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-08 14:47 ` Al Viro
@ 2026-05-08 22:03 ` Paul E. McKenney
2026-05-08 23:03 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: Paul E. McKenney @ 2026-05-08 22:03 UTC (permalink / raw)
To: Al Viro
Cc: NeilBrown, Frederic Weisbecker, Linus Torvalds, linux-fsdevel,
Christian Brauner, Jan Kara, Boqun Feng, Joel Fernandes,
Uladzislau Rezki
On Fri, May 08, 2026 at 03:47:54PM +0100, Al Viro wrote:
> On Thu, May 07, 2026 at 04:21:28PM -0700, Paul E. McKenney wrote:
>
> > Alternatively, if the bitlock acquisition is always a trylock-style
> > operation with no tight loop around the trylock, it should not be
> > necessary to disable preemption. (Give or take the expectations of the
> > code in the critical section, of course!)
>
> In dcache we have two of those - hash chains in the main dentry hash
> (insertion and removal only; RCU for traversals) and, more unpleasantly,
> hash chains in the in-lookup hash. Those have search-and-insert-new
> and removal, the former with full traversal of the chain under the
> lock.
>
> The entries are there for duration of ->lookup(), so it stuffing a lot
> of them in there might be feasible with a bunch of lookups for different
> names on a stuck network filesystem (each with a separate thread blocked
> waiting for server to reply)...
On the other hand, one penalty of permitting preemption of tasks holding
bitlocks is that if some task is preempted holding such a bitlock,
all attempts to acquire that bitlock will fail until that task gets to
run again. And bitlocks don't do anything to identify who is holding
them, making debugging a bit more challenging. :-/
Thanx, Paul
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-08 22:03 ` Paul E. McKenney
@ 2026-05-08 23:03 ` Al Viro
0 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-08 23:03 UTC (permalink / raw)
To: Paul E. McKenney
Cc: NeilBrown, Frederic Weisbecker, Linus Torvalds, linux-fsdevel,
Christian Brauner, Jan Kara, Boqun Feng, Joel Fernandes,
Uladzislau Rezki
On Fri, May 08, 2026 at 03:03:10PM -0700, Paul E. McKenney wrote:
> On the other hand, one penalty of permitting preemption of tasks holding
> bitlocks is that if some task is preempted holding such a bitlock,
> all attempts to acquire that bitlock will fail until that task gets to
> run again. And bitlocks don't do anything to identify who is holding
> them, making debugging a bit more challenging. :-/
Quite. And considering that this is attempted by anything that hits the same
hash chain while trying to look a pathname component up and not finding it
already in dcache...
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope
2026-05-05 21:40 ` Frederic Weisbecker
2026-05-05 22:50 ` Al Viro
@ 2026-05-08 3:01 ` Al Viro
1 sibling, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-08 3:01 UTC (permalink / raw)
To: Frederic Weisbecker
Cc: Linus Torvalds, Paul E. McKenney, linux-fsdevel,
Christian Brauner, Jan Kara, NeilBrown, Boqun Feng,
Joel Fernandes, Uladzislau Rezki
On Tue, May 05, 2026 at 11:40:44PM +0200, Frederic Weisbecker wrote:
> > I would really prefer to have all explicit rcu_read_lock() in fs/dcache.c
> > clearly documented ;-/
>
> And the calls to rcu_read_lock() / rcu_read_unlock() are _usually_ cheap enough
> that their explicit calls should go unnoticed over an implicit spinlock RCU
> critical section. Ok rcu_read_unlock() has its slowpath but it would otherwise
> trigger randomly when implicit read side are used (context switch, interrupts,
> etc...)
>
> Also rcu_read_lock() has the advantage to tell clearly that we are doing an RCU read
> side (and its uses indeed claim to be documented) here while relying on spinlocks to do
> that is fine but implicit and more error-prone when there are tricky things
> like spinlock transfers: better don't ask for trouble.
>
> So I would say that we should probably avoid introducing such an API if possible.
The problem is, for dentries such transfer may be _less_ error-prone than
a large rcu_read_lock() scope. There's such thing as NORCU dentries, and
for those freeing is not RCU-delayed. So for d_walk() the safety of
dentry = this_parent;
this_parent = dentry->d_parent;
spin_unlock(&dentry->d_lock);
spin_lock(&this_parent->d_lock);
requires more than just "everything is in RCU read-side critical area"; it's
also "and NORCU dentries never occur as anyone's parents".
We do rely upon ->d_lock for all other transitions in that graph walk -
for descent into a subtree and for traversals into the next sibling,
including those during ascent from subtree.
Any place where we rely solely upon rcu_read_lock() needs an analysis of
NORCU case and it's not always "NORCU can't occur here" - e.g. fast_dput()
is hit every time we close a pipe or socket, and these are the original
reason for having NORCU in the first place.
FWIW, the rules (and I'm still not finished writing down a coherent
variant of those) are very heavily ->d_lock-based; basically, we have
3 safe cases - counting reference, RCU reference and locked reference.
We don't have that enforced by the type system, unfortunately, and I'm
not sure we *can* express that with C type system in a way that would
be entirely compiler-enforced. I do think we can reduce most of that
down to a handful of "prove the safety manually" areas; we'll see.
Long story short, rcu_read_lock() is really too blunt a tool in that area.
Look at the list of scopes in
https://lore.kernel.org/all/20260507073510.GL3518998@ZenIV/
and tell me if you like having to rely upon that kind of analysis in
proof of correctness...
^ permalink raw reply [flat|nested] 53+ messages in thread
* [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (19 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-07 21:52 ` Jori Koolstra
2026-05-05 5:54 ` [RFC PATCH 22/25] shrink_dentry_tree(): unify the calls of shrink_dentry_list() Al Viro
` (4 subsequent siblings)
25 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
The current use of rcu_read_lock() uses in d_alloc_parallel()
is fairly opaque - the single large scope serves two purposes.
We start with lookup in normal hash, and there rcu_read_lock()
scope puts __d_lookup_rcu() and subsequent lockref_get_not_dead() into
the same RCU read-side critical area.
If no match is found, we proceed to lock the hash chain of
in-lookup hash and scan that for a match. If we find a match, we want
to grab it and wait for lookup in progress to finish. Since the bitlock
we use for these hash chains has to nest inside ->d_lock, we need to
unlock the chain first and use lockref_get_not_dead() on the match.
That has to be done without breaking the RCU read-side critical area,
and we use the same rcu_read_lock() scope to bridge over.
The thing is, after having grabbed the reference (and it is
very unlikely to fail) we proceed to grab ->d_lock - d_wait_lookup()
and __d_lookup_unhash()/__d_wake_in_lookup_waiters() are using that for
serialization. That makes lockref_get_not_dead() pointless - trying
to avoid grabbing ->d_lock for refcount increment, only to grab it
anyway immediately after that. If we grab ->d_lock first and replace
lockref_get_not_dead() with direct check for sign and increment if
non-negative we can move rcu_read_unlock() to immediately after grabbing
->d_lock. Moreover, we don't need the RCU read-side critical area to
be contiguous since before earlier __d_lookup_rcu() - we can just as
well terminate the earlier one ASAP and call rcu_read_lock() again only
after having found a match (if any) in the in-lookup hash chain.
That makes the entire thing easier to follow and the purpose
of those rcu_read_lock() calls easier to describe - the first scope is
for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges
over from the bitlock scope to the ->d_lock scope on the match found in
in-lookup hash.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 29 ++++++++++++-----------------
1 file changed, 12 insertions(+), 17 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 2a342d618303..f3c8d46867a9 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2729,38 +2729,33 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
spin_unlock(&parent->d_lock);
retry:
- rcu_read_lock();
seq = smp_load_acquire(&parent->d_inode->i_dir_seq);
r_seq = read_seqbegin(&rename_lock);
+ rcu_read_lock();
dentry = __d_lookup_rcu(parent, name, &d_seq);
if (unlikely(dentry)) {
if (!lockref_get_not_dead(&dentry->d_lockref)) {
rcu_read_unlock();
goto retry;
}
+ rcu_read_unlock();
if (read_seqcount_retry(&dentry->d_seq, d_seq)) {
- rcu_read_unlock();
dput(dentry);
goto retry;
}
- rcu_read_unlock();
dput(new);
return dentry;
}
- if (unlikely(read_seqretry(&rename_lock, r_seq))) {
- rcu_read_unlock();
+ rcu_read_unlock();
+ if (unlikely(read_seqretry(&rename_lock, r_seq)))
goto retry;
- }
- if (unlikely(seq & 1)) {
- rcu_read_unlock();
+ if (unlikely(seq & 1))
goto retry;
- }
hlist_bl_lock(b);
if (unlikely(READ_ONCE(parent->d_inode->i_dir_seq) != seq)) {
hlist_bl_unlock(b);
- rcu_read_unlock();
goto retry;
}
/*
@@ -2777,19 +2772,20 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
continue;
if (!d_same_name(dentry, parent, name))
continue;
+ rcu_read_lock();
hlist_bl_unlock(b);
+ spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
/* now we can try to grab a reference */
- if (!lockref_get_not_dead(&dentry->d_lockref)) {
- rcu_read_unlock();
+ if (unlikely(dentry->d_lockref.count < 0)) {
+ spin_unlock(&dentry->d_lock);
goto retry;
}
-
- rcu_read_unlock();
/*
* somebody is likely to be still doing lookup for it;
- * wait for them to finish
+ * pin it and wait for them to finish
*/
- spin_lock(&dentry->d_lock);
+ dget_dlock(dentry);
d_wait_lookup(dentry);
/*
* it's not in-lookup anymore; in principle we should repeat
@@ -2810,7 +2806,6 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
dput(new);
return dentry;
}
- rcu_read_unlock();
hlist_bl_add_head(&new->d_in_lookup_hash, b);
hlist_bl_unlock(b);
return new;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel()
2026-05-05 5:54 ` [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel() Al Viro
@ 2026-05-07 21:52 ` Jori Koolstra
2026-05-08 3:12 ` Al Viro
0 siblings, 1 reply; 53+ messages in thread
From: Jori Koolstra @ 2026-05-07 21:52 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
NeilBrown
On Tue, May 05, 2026 at 06:54:08AM +0100, Al Viro wrote:
>
> If no match is found, we proceed to lock the hash chain of
> in-lookup hash and scan that for a match. If we find a match, we want
> to grab it and wait for lookup in progress to finish. Since the bitlock
> we use for these hash chains has to nest inside ->d_lock, we need to
> unlock the chain first and use lockref_get_not_dead() on the match.
Just curious, reading this code as someone fairly new to the kernel,
how can I know what is the agreed-upon lock order between these hash
chains and the d_locks of the dentries they contain?
>
> That makes the entire thing easier to follow and the purpose
> of those rcu_read_lock() calls easier to describe - the first scope is
> for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges
> over from the bitlock scope to the ->d_lock scope on the match found in
> in-lookup hash.
Because the dentry could be freed after the hash chain unlock and before
grabbing the d_lock?
I do think the hlist_bl_for_each_entry loop would read clearer if we
pull the code after having found the initial match out of it, since we
always exit the loop afterwards.
Thanks for this patch, I've found myself puzzeled with the locking here
several times, and this clears things up for me. I see no issues with
the change, for what it's worth.
Best,
Jori.
^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel()
2026-05-07 21:52 ` Jori Koolstra
@ 2026-05-08 3:12 ` Al Viro
2026-05-08 9:28 ` Jori Koolstra
0 siblings, 1 reply; 53+ messages in thread
From: Al Viro @ 2026-05-08 3:12 UTC (permalink / raw)
To: Jori Koolstra, Linus Torvalds, linux-fsdevel, Christian Brauner,
Jan Kara, NeilBrown
On Thu, May 07, 2026 at 11:52:21PM +0200, Jori Koolstra wrote:
> > If no match is found, we proceed to lock the hash chain of
> > in-lookup hash and scan that for a match. If we find a match, we want
> > to grab it and wait for lookup in progress to finish. Since the bitlock
> > we use for these hash chains has to nest inside ->d_lock, we need to
> > unlock the chain first and use lockref_get_not_dead() on the match.
>
> Just curious, reading this code as someone fairly new to the kernel,
> how can I know what is the agreed-upon lock order between these hash
> chains and the d_locks of the dentries they contain?
Either wait for documentation to get there, or look at the places where
we deal with those locks; in particular, the removal from the in-lookup
hash chain comes in __d_lookup_unhash():
hlist_bl_lock(b);
dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
__hlist_bl_del(&dentry->d_in_lookup_hash);
hlist_bl_unlock(b);
with callers of that thing all done under ->d_lock. And yes, we *do*
need to document that shite - no arguments here.
> > That makes the entire thing easier to follow and the purpose
> > of those rcu_read_lock() calls easier to describe - the first scope is
> > for __d_lookup_rcu() + lockref_get_not_dead(), the second one bridges
> > over from the bitlock scope to the ->d_lock scope on the match found in
> > in-lookup hash.
>
> Because the dentry could be freed after the hash chain unlock and before
> grabbing the d_lock?
Yes. It's a bit more complicated - the missing bit is "and dentry we might
find in the in-lookup hash is not going to be NORCU, so freeing is guaranteed
to be RCU-delayed".
> I do think the hlist_bl_for_each_entry loop would read clearer if we
> pull the code after having found the initial match out of it, since we
> always exit the loop afterwards.
Umm... Not sure - we'd need to distinguish between "no match found" and "got
a match, stopped looking" cases. Would have to be something like
if (!pos)
insert into hash
ele
dentry is a match, deal with it
which is more reliant upon the details of hlist_bl_for_each_entry than I'd like...
^ permalink raw reply [flat|nested] 53+ messages in thread* Re: [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel()
2026-05-08 3:12 ` Al Viro
@ 2026-05-08 9:28 ` Jori Koolstra
0 siblings, 0 replies; 53+ messages in thread
From: Jori Koolstra @ 2026-05-08 9:28 UTC (permalink / raw)
To: Al Viro, Linus Torvalds, linux-fsdevel, Christian Brauner,
Jan Kara, NeilBrown
> Op 08-05-2026 05:12 CEST schreef Al Viro <viro@zeniv.linux.org.uk>:
>
> Umm... Not sure - we'd need to distinguish between "no match found" and "got
> a match, stopped looking" cases. Would have to be something like
> if (!pos)
> insert into hash
> ele
> dentry is a match, deal with it
> which is more reliant upon the details of hlist_bl_for_each_entry than I'd like...
I was thinking roughly along the lines of
hlist_bl_for_each_entry(dentry, node, b, d_in_lookup_hash) {
if (dentry->d_name.hash != hash)
continue;
if (dentry->d_parent != parent)
continue;
if (!d_same_name(dentry, parent, name))
continue;
goto found;
}
/* nothing found */
rcu_read_unlock();
new->d_wait = wq;
hlist_bl_add_head(&new->d_in_lookup_hash, b);
hlist_bl_unlock(b);
return new;
found:
[...]
/* OK, it *is* a hashed match; return it */
spin_unlock(&dentry->d_lock);
dput(new);
return dentry;
[...]
Ah, but you would still depend on the fact that dentry has the last iteration value,
which is an implementation detail. All right, I see your point. Still, I feel that
processing the thing that we are iterating for shouldn't be part of a loop, unless
that processing might fail and the loop continues looking further. But maybe there
is no clean way of separating that out here.
Sorry, enough bikeshedding from me...
I would add a reviewed-by, but I feel that is a bit too pretentious. Thanks for the
clarifications, Al, I appreciate it :)
Best,
Jori.
^ permalink raw reply [flat|nested] 53+ messages in thread
* [RFC PATCH 22/25] shrink_dentry_tree(): unify the calls of shrink_dentry_list()
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (20 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel() Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 23/25] wind ->s_roots via ->d_sib instead of ->d_hash Al Viro
` (3 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 34 +++++++++++++++++-----------------
1 file changed, 17 insertions(+), 17 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index f3c8d46867a9..9003b8cf7134 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -630,12 +630,14 @@ struct completion_list {
* Use ->waiters for a single-linked list of struct completion_list of
* waiters.
*/
-static inline void d_add_waiter(struct dentry *dentry, struct completion_list *p)
+static inline bool d_add_waiter(struct dentry *dentry, struct completion_list *p)
{
- struct completion_list *v = dentry->waiters;
+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED))
+ return false;
init_completion(&p->completion);
- p->next = v;
+ p->next = dentry->waiters;
dentry->waiters = p;
+ return true;
}
static inline void d_complete_waiters(struct dentry *dentry)
@@ -1691,7 +1693,9 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
{
for (;;) {
- struct select_data data = {.start = parent};
+ struct completion_list wait;
+ bool need_wait = false;
+ struct select_data data = { .start = parent };
INIT_LIST_HEAD(&data.dispose);
d_walk(parent, &data,
@@ -1723,22 +1727,18 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
spin_lock(&v->d_lock);
rcu_read_unlock();
- if (v->d_lockref.count < 0 &&
- !(v->d_flags & DCACHE_DENTRY_KILLED)) {
- struct completion_list wait;
- // It's busy dying; have it notify us once
- // it becomes invisible to d_walk().
- d_add_waiter(v, &wait);
+ if (unlikely(v->d_lockref.count < 0)) {
+ // It's doomed; if it isn't dead yet, notify us
+ // once it becomes invisible to d_walk().
+ need_wait = d_add_waiter(v, &wait);
spin_unlock(&v->d_lock);
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
- wait_for_completion(&wait.completion);
- continue;
+ } else {
+ shrink_kill(v);
}
- shrink_kill(v);
}
- if (!list_empty(&data.dispose))
- shrink_dentry_list(&data.dispose);
+ shrink_dentry_list(&data.dispose);
+ if (unlikely(need_wait))
+ wait_for_completion(&wait.completion);
}
}
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 23/25] wind ->s_roots via ->d_sib instead of ->d_hash
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (21 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 22/25] shrink_dentry_tree(): unify the calls of shrink_dentry_list() Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 24/25] nfs: get rid of fake root dentries Al Viro
` (2 subsequent siblings)
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.
That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list. Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.
If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.
Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.
AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.
Changes:
* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
* ___d_drop() doesn't care about ->s_roots anymore.
* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 65 ++++++++++++++++++++++++----------
fs/super.c | 1 +
include/linux/fs/super_types.h | 3 +-
3 files changed, 50 insertions(+), 19 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 9003b8cf7134..12f1143d479a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -43,8 +43,8 @@
* - i_dentry, d_alias, d_inode of aliases
* dcache_hash_bucket lock protects:
* - the dcache hash table
- * s_roots bl list spinlock protects:
- * - the s_roots list (see __d_drop)
+ * s_roots_lock protects:
+ * - the s_roots list (see __d_move()/dentry_unlist()/d_obtain_root())
* dentry->d_sb->s_dentry_lru_lock protects:
* - the dcache lru lists and counters
* d_lock protects:
@@ -562,16 +562,7 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
static void ___d_drop(struct dentry *dentry)
{
- struct hlist_bl_head *b;
- /*
- * Hashed dentries are normally on the dentry hashtable,
- * with the exception of those newly allocated by
- * d_obtain_root, which are always IS_ROOT:
- */
- if (unlikely(IS_ROOT(dentry)))
- b = &dentry->d_sb->s_roots;
- else
- b = d_hash(dentry->d_name.hash);
+ struct hlist_bl_head *b = d_hash(dentry->d_name.hash);
hlist_bl_lock(b);
__hlist_bl_del(&dentry->d_hash);
@@ -654,6 +645,13 @@ static inline void d_complete_waiters(struct dentry *dentry)
}
}
+static void unlink_secondary_root(struct dentry *dentry)
+{
+ spin_lock(&dentry->d_sb->s_roots_lock);
+ hlist_del_init(&dentry->d_sib);
+ spin_unlock(&dentry->d_sb->s_roots_lock);
+}
+
static inline void dentry_unlist(struct dentry *dentry)
{
struct dentry *next;
@@ -665,6 +663,10 @@ static inline void dentry_unlist(struct dentry *dentry)
d_complete_waiters(dentry);
if (unlikely(hlist_unhashed(&dentry->d_sib)))
return;
+ if (unlikely(IS_ROOT(dentry))) {
+ unlink_secondary_root(dentry); // secondary root goes away
+ return;
+ }
__hlist_del(&dentry->d_sib);
/*
* Cursors can move around the list of children. While we'd been
@@ -1791,9 +1793,30 @@ void shrink_dcache_for_umount(struct super_block *sb)
sb->s_root = NULL;
do_one_tree(dentry);
- while (!hlist_bl_empty(&sb->s_roots)) {
- dentry = dget(hlist_bl_entry(hlist_bl_first(&sb->s_roots), struct dentry, d_hash));
- do_one_tree(dentry);
+ for (;;) {
+ spin_lock(&sb->s_roots_lock);
+ dentry = hlist_entry_safe(sb->s_roots.first,
+ struct dentry, d_sib);
+ if (!dentry) {
+ spin_unlock(&sb->s_roots_lock);
+ break;
+ }
+ rcu_read_lock();
+ spin_unlock(&sb->s_roots_lock);
+ spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
+ if (unlikely(dentry->d_lockref.count < 0)) {
+ struct completion_list wait;
+ bool need_wait = d_add_waiter(dentry, &wait);
+
+ spin_unlock(&dentry->d_lock);
+ if (need_wait)
+ wait_for_completion(&wait.completion);
+ } else {
+ dget_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
+ do_one_tree(dentry);
+ }
}
}
@@ -2210,9 +2233,9 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
__d_set_inode_and_type(new, inode, add_flags);
hlist_add_head(&new->d_alias, &inode->i_dentry);
if (!disconnected) {
- hlist_bl_lock(&sb->s_roots);
- hlist_bl_add_head(&new->d_hash, &sb->s_roots);
- hlist_bl_unlock(&sb->s_roots);
+ spin_lock(&sb->s_roots_lock);
+ hlist_add_head(&new->d_sib, &sb->s_roots);
+ spin_unlock(&sb->s_roots_lock);
}
spin_unlock(&new->d_lock);
spin_unlock(&inode->i_lock);
@@ -3224,6 +3247,12 @@ struct dentry *d_splice_alias_ops(struct inode *inode, struct dentry *dentry,
}
dput(old_parent);
} else {
+ if (unlikely(!hlist_unhashed(&new->d_sib))) {
+ // secondary root getting spliced
+ spin_lock(&new->d_lock);
+ unlink_secondary_root(new);
+ spin_unlock(&new->d_lock);
+ }
__d_move(new, dentry, false);
write_sequnlock(&rename_lock);
}
diff --git a/fs/super.c b/fs/super.c
index 378e81efe643..fb44ebadda82 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -359,6 +359,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_roots);
+ spin_lock_init(&s->s_roots_lock);
mutex_init(&s->s_sync_lock);
INIT_LIST_HEAD(&s->s_inodes);
spin_lock_init(&s->s_inode_list_lock);
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 383050e7fdf5..23d1c2612d0c 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -162,7 +162,8 @@ struct super_block {
struct unicode_map *s_encoding;
__u16 s_encoding_flags;
#endif
- struct hlist_bl_head s_roots; /* alternate root dentries for NFS */
+ struct hlist_head s_roots; /* alternate root dentries for NFS */
+ spinlock_t s_roots_lock;
struct mount *s_mounts; /* list of mounts; _not_ for fs use */
struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */
struct file *s_bdev_file;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 24/25] nfs: get rid of fake root dentries
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (22 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 23/25] wind ->s_roots via ->d_sib instead of ->d_hash Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 25/25] make cursors NORCU Al Viro
2026-05-05 17:09 ` [RFC PATCH 00/25] assorted dcache cleanups and fixes Linus Torvalds
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
... just grab the reference to the (real) root we are about to return
for the first mount of this superblock and be done with that.
Once upon a time dentry tree eviction at fs shutdown used to break
if ->s_root had been spliced on top of something; that hadn't been
the case for years now, and these fake root dentries violate a bunch
of invariants. Let's get rid of them...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 8 ++------
fs/nfs/getroot.c | 35 ++---------------------------------
2 files changed, 4 insertions(+), 39 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 12f1143d479a..2b5cf6e52ed6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -455,14 +455,10 @@ static void dentry_unlink_inode(struct dentry * dentry)
raw_write_seqcount_begin(&dentry->d_seq);
__d_clear_type_and_inode(dentry);
- hlist_del_init(&dentry->d_alias);
+ __hlist_del(&dentry->d_alias);
/*
* dentry becomes negative, so the space occupied by ->d_alias
- * belongs to ->waiters now; we could use __hlist_del() instead
- * of hlist_del_init(), if not for the stunt pulled by nfs
- * dummy root dentries - positive dentry *not* included into
- * the alias list of its inode. Open-coding hlist_del_init()
- * and removing zeroing would be too clumsy...
+ * belongs to ->waiters now.
*/
dentry->waiters = NULL;
raw_write_seqcount_end(&dentry->d_seq);
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index eef0736beb67..ff7424bc4bec 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -32,35 +32,6 @@
#define NFSDBG_FACILITY NFSDBG_CLIENT
-/*
- * Set the superblock root dentry.
- * Note that this function frees the inode in case of error.
- */
-static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *inode)
-{
- /* The mntroot acts as the dummy root dentry for this superblock */
- if (sb->s_root == NULL) {
- sb->s_root = d_make_root(inode);
- if (sb->s_root == NULL)
- return -ENOMEM;
- ihold(inode);
- /*
- * Ensure that this dentry is invisible to d_find_alias().
- * Otherwise, it may be spliced into the tree by
- * d_splice_alias if a parent directory from the same
- * filesystem gets mounted at a later time.
- * This again causes shrink_dcache_for_umount_subtree() to
- * Oops, since the test for IS_ROOT() will fail.
- */
- spin_lock(&d_inode(sb->s_root)->i_lock);
- spin_lock(&sb->s_root->d_lock);
- hlist_del_init(&sb->s_root->d_alias);
- spin_unlock(&sb->s_root->d_lock);
- spin_unlock(&d_inode(sb->s_root)->i_lock);
- }
- return 0;
-}
-
/*
* get a root dentry from the root filehandle
*/
@@ -99,10 +70,6 @@ int nfs_get_root(struct super_block *s, struct fs_context *fc)
goto out_fattr;
}
- error = nfs_superblock_set_dummy_root(s, inode);
- if (error != 0)
- goto out_fattr;
-
/* root dentries normally start off anonymous and get spliced in later
* if the dentry tree reaches them; however if the dentry already
* exists, we'll pick it up at this point and use it as the root
@@ -123,6 +90,8 @@ int nfs_get_root(struct super_block *s, struct fs_context *fc)
name = NULL;
}
spin_unlock(&root->d_lock);
+ if (!s->s_root)
+ s->s_root = dget(root);
fc->root = root;
if (server->caps & NFS_CAP_SECURITY_LABEL)
kflags |= SECURITY_LSM_NATIVE_LABELS;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* [RFC PATCH 25/25] make cursors NORCU
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (23 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 24/25] nfs: get rid of fake root dentries Al Viro
@ 2026-05-05 5:54 ` Al Viro
2026-05-05 17:09 ` [RFC PATCH 00/25] assorted dcache cleanups and fixes Linus Torvalds
25 siblings, 0 replies; 53+ messages in thread
From: Al Viro @ 2026-05-05 5:54 UTC (permalink / raw)
To: Linus Torvalds; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
All it requires is making sure that d_walk() will skip *all*
CURSOR dentries, even if somebody passes it one as an argument.
Cursors are negative and unhashed all along, never get added to
LRU or to shrink lists and no RCU references via ->d_sib are
possible for those - dentry_unlist() makes sure that no killed
dentry has ->d_sib.next left pointing to a cursor.
Seeing that a cursor is allocated every time we open a directory
on autofs, debugfs, devpts, etc., avoiding an RCU delay when such
opened files get closed is attractive...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 2b5cf6e52ed6..f64a02998184 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1429,6 +1429,8 @@ static void d_walk(struct dentry *parent, void *data,
read_seqbegin_or_lock(&rename_lock, &seq);
this_parent = parent;
spin_lock(&this_parent->d_lock);
+ if (unlikely(this_parent->d_flags & DCACHE_DENTRY_CURSOR))
+ goto out_unlock;
ret = enter(data, this_parent);
switch (ret) {
@@ -1982,7 +1984,7 @@ struct dentry *d_alloc_cursor(struct dentry * parent)
{
struct dentry *dentry = d_alloc_anon(parent->d_sb);
if (dentry) {
- dentry->d_flags |= DCACHE_DENTRY_CURSOR;
+ dentry->d_flags |= DCACHE_DENTRY_CURSOR | DCACHE_NORCU;
dentry->d_parent = dget(parent);
}
return dentry;
--
2.47.3
^ permalink raw reply related [flat|nested] 53+ messages in thread* Re: [RFC PATCH 00/25] assorted dcache cleanups and fixes
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
` (24 preceding siblings ...)
2026-05-05 5:54 ` [RFC PATCH 25/25] make cursors NORCU Al Viro
@ 2026-05-05 17:09 ` Linus Torvalds
25 siblings, 0 replies; 53+ messages in thread
From: Linus Torvalds @ 2026-05-05 17:09 UTC (permalink / raw)
To: Al Viro; +Cc: linux-fsdevel, Christian Brauner, Jan Kara, NeilBrown
On Mon, 4 May 2026 at 22:55, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> This series has grown out of trying to prove memory safety
> of dentry handling; there are followups sitting in the local
> tree, but those are getting reshuffled.
Ok, so I had negative reactions to some of the patches that I
documented, but on the whole I like this series.
None of my negative reactions were deal killers, and I'm perfectly
happy to be ignored if people think my reactions were a bit odd. So
take them as you will.
Linus
^ permalink raw reply [flat|nested] 53+ messages in thread