* [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode
2026-04-04 8:07 ` [RFC PATCH v3 " Al Viro
@ 2026-04-04 8:07 ` Al Viro
2026-04-04 8:07 ` [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous Al Viro
` (3 subsequent siblings)
4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04 8:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
Most of the places using d_alias are loops iterating through all aliases for
given inode; introduce a helper macro (for_each_alias(dentry, inode))
and convert open-coded instances of such loop to it.
They are easier to read that way and it reduces the noise on the next steps.
You _must_ hold inode->i_lock over that thing.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
Documentation/filesystems/porting.rst | 10 ++++++++++
fs/affs/amigaffs.c | 2 +-
fs/ceph/mds_client.c | 2 +-
fs/dcache.c | 6 +++---
fs/exportfs/expfs.c | 2 +-
fs/nfs/dir.c | 2 +-
fs/notify/fsnotify.c | 2 +-
fs/ocfs2/dcache.c | 2 +-
fs/overlayfs/dir.c | 2 +-
fs/smb/client/inode.c | 2 +-
include/linux/dcache.h | 4 ++++
11 files changed, 25 insertions(+), 11 deletions(-)
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 52ff1d19405b..9a9babd9ec48 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1361,3 +1361,13 @@ to match what strlen() would return if it was ran on the string.
However, if the string is freely accessible for the duration of inode's
lifetime, consider using inode_set_cached_link() instead.
+
+---
+
+**recommended**
+
+If you really need to iterate through dentries for given inode, use
+for_each_alias(dentry, inode) instead of hlist_for_each_entry; better
+yet, see if any of the exported primitives could be used instead of
+the entire loop. You still need to hold ->i_lock of the inode over
+either form of manual loop.
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index fd669daa4e7b..91966b1f41f6 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -126,7 +126,7 @@ affs_fix_dcache(struct inode *inode, u32 entry_ino)
{
struct dentry *dentry;
spin_lock(&inode->i_lock);
- hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(dentry, inode) {
if (entry_ino == (u32)(long)dentry->d_fsdata) {
dentry->d_fsdata = (void *)inode->i_ino;
break;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..f839109fb66f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4614,7 +4614,7 @@ static struct dentry* d_find_primary(struct inode *inode)
goto out_unlock;
}
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(alias, inode) {
spin_lock(&alias->d_lock);
if (!d_unhashed(alias) &&
(ceph_dentry(alias)->flags & CEPH_DENTRY_PRIMARY_LINK)) {
diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..e069b6ec4ec0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -790,7 +790,7 @@ void d_mark_dontcache(struct inode *inode)
struct dentry *de;
spin_lock(&inode->i_lock);
- hlist_for_each_entry(de, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(de, inode) {
spin_lock(&de->d_lock);
de->d_flags |= DCACHE_DONTCACHE;
spin_unlock(&de->d_lock);
@@ -1040,7 +1040,7 @@ static struct dentry *__d_find_alias(struct inode *inode)
if (S_ISDIR(inode->i_mode))
return __d_find_any_alias(inode);
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(alias, inode) {
spin_lock(&alias->d_lock);
if (!d_unhashed(alias)) {
dget_dlock(alias);
@@ -1133,7 +1133,7 @@ void d_prune_aliases(struct inode *inode)
struct dentry *dentry;
spin_lock(&inode->i_lock);
- hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)
+ for_each_alias(dentry, inode)
d_dispose_if_unused(dentry, &dispose);
spin_unlock(&inode->i_lock);
shrink_dentry_list(&dispose);
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 6c9be60a3e48..f67b3ce672fc 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -52,7 +52,7 @@ find_acceptable_alias(struct dentry *result,
inode = result->d_inode;
spin_lock(&inode->i_lock);
- hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(dentry, inode) {
dget(dentry);
spin_unlock(&inode->i_lock);
if (toput)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2402f57c8e7d..5a0bd8113e3a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1471,7 +1471,7 @@ static void nfs_clear_verifier_file(struct inode *inode)
struct dentry *alias;
struct inode *dir;
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(alias, inode) {
spin_lock(&alias->d_lock);
dir = d_inode_rcu(alias->d_parent);
if (!dir ||
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 9995de1710e5..b7198c4744e3 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -76,7 +76,7 @@ void fsnotify_set_children_dentry_flags(struct inode *inode)
spin_lock(&inode->i_lock);
/* run all of the dentries associated with this inode. Since this is a
* directory, there damn well better only be one item on this list */
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(alias, inode) {
struct dentry *child;
/* run all of the children of the original inode and fix their
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index c4ba968e778b..e06774fd89d8 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -145,7 +145,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
struct dentry *dentry;
spin_lock(&inode->i_lock);
- hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(dentry, inode) {
spin_lock(&dentry->d_lock);
if (ocfs2_match_dentry(dentry, parent_blkno, skip_unhashed)) {
trace_ocfs2_find_local_alias(dentry->d_name.len,
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index ff3dbd1ca61f..f8dfa534b566 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -904,7 +904,7 @@ static void ovl_drop_nlink(struct dentry *dentry)
/* Try to find another, hashed alias */
spin_lock(&inode->i_lock);
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(alias, inode) {
if (alias != dentry && !d_unhashed(alias))
break;
}
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index 888f9e35f14b..e2b4ad9bd0bd 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -1595,7 +1595,7 @@ inode_has_hashed_dentries(struct inode *inode)
struct dentry *dentry;
spin_lock(&inode->i_lock);
- hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+ for_each_alias(dentry, inode) {
if (!d_unhashed(dentry) || IS_ROOT(dentry)) {
spin_unlock(&inode->i_lock);
return true;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 898c60d21c92..7f1dbc7121d7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -615,4 +615,8 @@ void set_default_d_op(struct super_block *, const struct dentry_operations *);
struct dentry *d_make_persistent(struct dentry *, struct inode *);
void d_make_discardable(struct dentry *dentry);
+/* inode->i_lock must be held over that */
+#define for_each_alias(dentry, inode) \
+ hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
+
#endif /* __LINUX_DCACHE_H */
--
2.47.3
^ permalink raw reply related [flat|nested] 63+ messages in thread* [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous
2026-04-04 8:07 ` [RFC PATCH v3 " Al Viro
2026-04-04 8:07 ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
@ 2026-04-04 8:07 ` Al Viro
2026-04-04 8:07 ` [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
` (2 subsequent siblings)
4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04 8:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
Making ->d_rcu and (then) ->d_child overlapping dates back to
2006; anon unions support had been added to gcc only in 4.6
(2011) and the minimal gcc version hadn't been bumped to that
until 4.19 (2018).
These days there's no reason not to keep that union named.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/ceph/mds_client.c | 2 +-
fs/dcache.c | 48 +++++++++++++++++++++---------------------
fs/inode.c | 2 +-
fs/nfs/dir.c | 2 +-
fs/nfs/getroot.c | 2 +-
include/linux/dcache.h | 4 ++--
6 files changed, 30 insertions(+), 30 deletions(-)
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f839109fb66f..a5eb99c3c36b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4608,7 +4608,7 @@ static struct dentry* d_find_primary(struct inode *inode)
goto out_unlock;
if (S_ISDIR(inode->i_mode)) {
- alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
+ alias = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
if (!IS_ROOT(alias))
dn = dget(alias);
goto out_unlock;
diff --git a/fs/dcache.c b/fs/dcache.c
index e069b6ec4ec0..4378eb8a00bb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -40,7 +40,7 @@
/*
* Usage:
* dcache->d_inode->i_lock protects:
- * - i_dentry, d_u.d_alias, d_inode of aliases
+ * - i_dentry, d_alias, d_inode of aliases
* dcache_hash_bucket lock protects:
* - the dcache hash table
* s_roots bl list spinlock protects:
@@ -55,7 +55,7 @@
* - d_unhashed()
* - d_parent and d_chilren
* - childrens' d_sib and d_parent
- * - d_u.d_alias, d_inode
+ * - d_alias, d_inode
*
* Ordering:
* dentry->d_inode->i_lock
@@ -341,14 +341,14 @@ static inline struct external_name *external_name(struct dentry *dentry)
static void __d_free(struct rcu_head *head)
{
- struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+ struct dentry *dentry = container_of(head, struct dentry, d_rcu);
kmem_cache_free(dentry_cache, dentry);
}
static void __d_free_external(struct rcu_head *head)
{
- struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+ struct dentry *dentry = container_of(head, struct dentry, d_rcu);
kfree(external_name(dentry));
kmem_cache_free(dentry_cache, dentry);
}
@@ -428,19 +428,19 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
static void dentry_free(struct dentry *dentry)
{
- WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+ WARN_ON(!hlist_unhashed(&dentry->d_alias));
if (unlikely(dname_external(dentry))) {
struct external_name *p = external_name(dentry);
if (likely(atomic_dec_and_test(&p->count))) {
- call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+ call_rcu(&dentry->d_rcu, __d_free_external);
return;
}
}
/* if dentry was never visible to RCU, immediate free is OK */
if (dentry->d_flags & DCACHE_NORCU)
- __d_free(&dentry->d_u.d_rcu);
+ __d_free(&dentry->d_rcu);
else
- call_rcu(&dentry->d_u.d_rcu, __d_free);
+ call_rcu(&dentry->d_rcu, __d_free);
}
/*
@@ -455,7 +455,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
raw_write_seqcount_begin(&dentry->d_seq);
__d_clear_type_and_inode(dentry);
- hlist_del_init(&dentry->d_u.d_alias);
+ hlist_del_init(&dentry->d_alias);
raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
@@ -1010,7 +1010,7 @@ static struct dentry * __d_find_any_alias(struct inode *inode)
if (hlist_empty(&inode->i_dentry))
return NULL;
- alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
+ alias = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
lockref_get(&alias->d_lockref);
return alias;
}
@@ -1093,9 +1093,9 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
// used without having I_FREEING set, which means no aliases left
if (likely(!(inode_state_read(inode) & I_FREEING) && !hlist_empty(l))) {
if (S_ISDIR(inode->i_mode)) {
- de = hlist_entry(l->first, struct dentry, d_u.d_alias);
+ de = hlist_entry(l->first, struct dentry, d_alias);
} else {
- hlist_for_each_entry(de, l, d_u.d_alias)
+ hlist_for_each_entry(de, l, d_alias)
if (!d_unhashed(de))
break;
}
@@ -1787,7 +1787,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
INIT_HLIST_BL_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_HLIST_HEAD(&dentry->d_children);
- INIT_HLIST_NODE(&dentry->d_u.d_alias);
+ INIT_HLIST_NODE(&dentry->d_alias);
INIT_HLIST_NODE(&dentry->d_sib);
if (dentry->d_op && dentry->d_op->d_init) {
@@ -1980,7 +1980,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
if ((dentry->d_flags &
(DCACHE_LRU_LIST|DCACHE_SHRINK_LIST)) == DCACHE_LRU_LIST)
this_cpu_dec(nr_dentry_negative);
- hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+ hlist_add_head(&dentry->d_alias, &inode->i_dentry);
raw_write_seqcount_begin(&dentry->d_seq);
__d_set_inode_and_type(dentry, inode, add_flags);
raw_write_seqcount_end(&dentry->d_seq);
@@ -2004,7 +2004,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
void d_instantiate(struct dentry *entry, struct inode * inode)
{
- BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+ BUG_ON(!hlist_unhashed(&entry->d_alias));
if (inode) {
security_d_instantiate(entry, inode);
spin_lock(&inode->i_lock);
@@ -2024,7 +2024,7 @@ EXPORT_SYMBOL(d_instantiate);
*/
void d_instantiate_new(struct dentry *entry, struct inode *inode)
{
- BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+ BUG_ON(!hlist_unhashed(&entry->d_alias));
BUG_ON(!inode);
lockdep_annotate_inode_mutex_key(inode);
security_d_instantiate(entry, inode);
@@ -2087,7 +2087,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
spin_lock(&new->d_lock);
__d_set_inode_and_type(new, inode, add_flags);
- hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
+ hlist_add_head(&new->d_alias, &inode->i_dentry);
if (!disconnected) {
hlist_bl_lock(&sb->s_roots);
hlist_bl_add_head(&new->d_hash, &sb->s_roots);
@@ -2658,7 +2658,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
* we unlock the chain. All fields are stable in everything
* we encounter.
*/
- hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
+ hlist_bl_for_each_entry(dentry, node, b, d_in_lookup_hash) {
if (dentry->d_name.hash != hash)
continue;
if (dentry->d_parent != parent)
@@ -2700,7 +2700,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
}
rcu_read_unlock();
new->d_wait = wq;
- hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
+ hlist_bl_add_head(&new->d_in_lookup_hash, b);
hlist_bl_unlock(b);
return new;
mismatch:
@@ -2725,11 +2725,11 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
b = in_lookup_hash(dentry->d_parent, dentry->d_name.hash);
hlist_bl_lock(b);
dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
- __hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
+ __hlist_bl_del(&dentry->d_in_lookup_hash);
d_wait = dentry->d_wait;
dentry->d_wait = NULL;
hlist_bl_unlock(b);
- INIT_HLIST_NODE(&dentry->d_u.d_alias);
+ INIT_HLIST_NODE(&dentry->d_alias);
INIT_LIST_HEAD(&dentry->d_lru);
return d_wait;
}
@@ -2760,7 +2760,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode,
d_set_d_op(dentry, ops);
if (inode) {
unsigned add_flags = d_flags_for_inode(inode);
- hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+ hlist_add_head(&dentry->d_alias, &inode->i_dentry);
raw_write_seqcount_begin(&dentry->d_seq);
__d_set_inode_and_type(dentry, inode, add_flags);
raw_write_seqcount_end(&dentry->d_seq);
@@ -2795,7 +2795,7 @@ EXPORT_SYMBOL(d_add);
struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
{
- WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+ WARN_ON(!hlist_unhashed(&dentry->d_alias));
WARN_ON(!inode);
security_d_instantiate(dentry, inode);
spin_lock(&inode->i_lock);
@@ -3185,7 +3185,7 @@ void d_mark_tmpfile(struct file *file, struct inode *inode)
struct dentry *dentry = file->f_path.dentry;
BUG_ON(dname_external(dentry) ||
- !hlist_unhashed(&dentry->d_u.d_alias) ||
+ !hlist_unhashed(&dentry->d_alias) ||
!d_unlinked(dentry));
spin_lock(&dentry->d_parent->d_lock);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
diff --git a/fs/inode.c b/fs/inode.c
index cc12b68e021b..9e1ab333d382 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -754,7 +754,7 @@ void dump_mapping(const struct address_space *mapping)
return;
}
- dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias);
+ dentry_ptr = container_of(dentry_first, struct dentry, d_alias);
if (get_kernel_nofault(dentry, dentry_ptr) ||
!dentry.d_parent || !dentry.d_name.name) {
pr_warn("aops:%ps ino:%lx invalid dentry:%px\n",
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 5a0bd8113e3a..f2f1b036f2f1 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1490,7 +1490,7 @@ static void nfs_clear_verifier_directory(struct inode *dir)
if (hlist_empty(&dir->i_dentry))
return;
this_parent =
- hlist_entry(dir->i_dentry.first, struct dentry, d_u.d_alias);
+ hlist_entry(dir->i_dentry.first, struct dentry, d_alias);
spin_lock(&this_parent->d_lock);
nfs_unset_verifier_delegated(&this_parent->d_time);
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index f13d25d95b85..eef0736beb67 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -54,7 +54,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
*/
spin_lock(&d_inode(sb->s_root)->i_lock);
spin_lock(&sb->s_root->d_lock);
- hlist_del_init(&sb->s_root->d_u.d_alias);
+ hlist_del_init(&sb->s_root->d_alias);
spin_unlock(&sb->s_root->d_lock);
spin_unlock(&d_inode(sb->s_root)->i_lock);
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7f1dbc7121d7..f939d2ed10a3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -128,7 +128,7 @@ struct dentry {
struct hlist_node d_alias; /* inode alias list */
struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
struct rcu_head d_rcu;
- } d_u;
+ };
};
/*
@@ -617,6 +617,6 @@ void d_make_discardable(struct dentry *dentry);
/* inode->i_lock must be held over that */
#define for_each_alias(dentry, inode) \
- hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
+ hlist_for_each_entry(dentry, &(inode)->i_dentry, d_alias)
#endif /* __LINUX_DCACHE_H */
--
2.47.3
^ permalink raw reply related [flat|nested] 63+ messages in thread* [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks
2026-04-04 8:07 ` [RFC PATCH v3 " Al Viro
2026-04-04 8:07 ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
2026-04-04 8:07 ` [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous Al Viro
@ 2026-04-04 8:07 ` Al Viro
2026-04-04 8:07 ` [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
2026-04-09 16:51 ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04 8:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
Several functions have BUG_ON/WARN_ON sanity checks that want to verify
that dentry is not positive and instead of looking at ->d_inode (as we
do in all other places that check that) they look at ->d_alias.
Just use the normal helpers instead - that way we no longer even look
at ->d_alias for negative dentries
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 4378eb8a00bb..616a445ec720 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -428,7 +428,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
static void dentry_free(struct dentry *dentry)
{
- WARN_ON(!hlist_unhashed(&dentry->d_alias));
+ WARN_ON(d_really_is_positive(dentry));
if (unlikely(dname_external(dentry))) {
struct external_name *p = external_name(dentry);
if (likely(atomic_dec_and_test(&p->count))) {
@@ -2004,7 +2004,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
void d_instantiate(struct dentry *entry, struct inode * inode)
{
- BUG_ON(!hlist_unhashed(&entry->d_alias));
+ BUG_ON(d_really_is_positive(entry));
if (inode) {
security_d_instantiate(entry, inode);
spin_lock(&inode->i_lock);
@@ -2024,7 +2024,7 @@ EXPORT_SYMBOL(d_instantiate);
*/
void d_instantiate_new(struct dentry *entry, struct inode *inode)
{
- BUG_ON(!hlist_unhashed(&entry->d_alias));
+ BUG_ON(d_really_is_positive(entry));
BUG_ON(!inode);
lockdep_annotate_inode_mutex_key(inode);
security_d_instantiate(entry, inode);
@@ -2795,7 +2795,7 @@ EXPORT_SYMBOL(d_add);
struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
{
- WARN_ON(!hlist_unhashed(&dentry->d_alias));
+ WARN_ON(d_really_is_positive(dentry));
WARN_ON(!inode);
security_d_instantiate(dentry, inode);
spin_lock(&inode->i_lock);
@@ -3185,7 +3185,7 @@ void d_mark_tmpfile(struct file *file, struct inode *inode)
struct dentry *dentry = file->f_path.dentry;
BUG_ON(dname_external(dentry) ||
- !hlist_unhashed(&dentry->d_alias) ||
+ d_really_is_positive(dentry) ||
!d_unlinked(dentry));
spin_lock(&dentry->d_parent->d_lock);
spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
--
2.47.3
^ permalink raw reply related [flat|nested] 63+ messages in thread* [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree()
2026-04-04 8:07 ` [RFC PATCH v3 " Al Viro
` (2 preceding siblings ...)
2026-04-04 8:07 ` [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
@ 2026-04-04 8:07 ` Al Viro
2026-04-09 16:51 ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04 8:07 UTC (permalink / raw)
To: Linus Torvalds
Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
If shrink_dcache_tree() runs into a potential victim that is already
dying, it must wait for that dentry to go away. To avoid busy-waiting
we need some object to wait on and a way for dentry_unlist() to see that
we need to be notified.
The obvious place for the object to wait on would be on our stack frame.
We will store a pointer to that object (struct completion_list) in victim
dentry; if there's more than one thread wanting to wait for the same
dentry to finish dying, we'll have their instances linked into a list,
with reference in dentry pointing to the head of that list.
* new object - struct completion_list. A pair of struct completion and
pointer to the next instance. That's what shrink_dcache_tree() will wait
on if needed.
* add a new member (->waiters, opaque pointer to struct completion_list)
to struct dentry. It is defined for negative live dentries that are
not in-lookup ones and it will remain NULL for almost all of them.
It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
(defined for positive dentries, all live) or ->d_in_lookup_hash (defined
for in-lookup dentries, all live negative). That allows to colocate
all four members.
* make sure that all places where dentry enters the state where ->waiters
is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.
* if select_collect2() runs into a dentry that is already dying, have
its caller insert a local instance of struct completion_list into the
head of the list hanging off dentry->waiters and wait for completion.
* if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
through the completion_list instances in that list, calling complete()
for each.
For now struct completion_list is local to fs/dcache.c; it's obviously
dentry-agnostic, and it can be trivially lifted into linux/completion.h
if somebody finds a reason to do so...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 79 ++++++++++++++++++++++++++++++++++++++----
include/linux/dcache.h | 18 ++++++++--
2 files changed, 88 insertions(+), 9 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 616a445ec720..0c8faeee02e2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -456,6 +456,15 @@ static void dentry_unlink_inode(struct dentry * dentry)
raw_write_seqcount_begin(&dentry->d_seq);
__d_clear_type_and_inode(dentry);
hlist_del_init(&dentry->d_alias);
+ /*
+ * dentry becomes negative, so the space occupied by ->d_alias
+ * belongs to ->waiters now; we could use __hlist_del() instead
+ * of hlist_del_init(), if not for the stunt pulled by nfs
+ * dummy root dentries - positive dentry *not* included into
+ * the alias list of its inode. Open-coding hlist_del_init()
+ * and removing zeroing would be too clumsy...
+ */
+ dentry->waiters = NULL;
raw_write_seqcount_end(&dentry->d_seq);
spin_unlock(&dentry->d_lock);
spin_unlock(&inode->i_lock);
@@ -605,6 +614,44 @@ void d_drop(struct dentry *dentry)
}
EXPORT_SYMBOL(d_drop);
+struct completion_list {
+ struct completion_list *next;
+ struct completion completion;
+};
+
+/*
+ * shrink_dcache_tree() needs to be notified when dentry in process of
+ * being evicted finally gets unlisted. Such dentries are
+ * already with negative ->d_count
+ * already negative
+ * already not in in-lookup hash
+ * reachable only via ->d_sib.
+ *
+ * Use ->waiters for a single-linked list of struct completion_list of
+ * waiters.
+ */
+static inline void d_add_waiter(struct dentry *dentry, struct completion_list *p)
+{
+ struct completion_list *v = dentry->waiters;
+ init_completion(&p->completion);
+ p->next = v;
+ dentry->waiters = p;
+}
+
+static inline void d_complete_waiters(struct dentry *dentry)
+{
+ struct completion_list *v = dentry->waiters;
+ if (unlikely(v)) {
+ /* some shrink_dcache_tree() instances are waiting */
+ dentry->waiters = NULL;
+ while (v) {
+ struct completion *r = &v->completion;
+ v = v->next;
+ complete(r);
+ }
+ }
+}
+
static inline void dentry_unlist(struct dentry *dentry)
{
struct dentry *next;
@@ -613,6 +660,7 @@ static inline void dentry_unlist(struct dentry *dentry)
* attached to the dentry tree
*/
dentry->d_flags |= DCACHE_DENTRY_KILLED;
+ d_complete_waiters(dentry);
if (unlikely(hlist_unhashed(&dentry->d_sib)))
return;
__hlist_del(&dentry->d_sib);
@@ -1569,6 +1617,10 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
return D_WALK_QUIT;
}
to_shrink_list(dentry, &data->dispose);
+ } else if (dentry->d_lockref.count < 0) {
+ rcu_read_lock();
+ data->victim = dentry;
+ return D_WALK_QUIT;
}
/*
* We can return to the caller if we have found some (this
@@ -1608,12 +1660,27 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
data.victim = NULL;
d_walk(parent, &data, select_collect2);
if (data.victim) {
- spin_lock(&data.victim->d_lock);
- if (!lock_for_kill(data.victim)) {
- spin_unlock(&data.victim->d_lock);
+ struct dentry *v = data.victim;
+
+ spin_lock(&v->d_lock);
+ if (v->d_lockref.count < 0 &&
+ !(v->d_flags & DCACHE_DENTRY_KILLED)) {
+ struct completion_list wait;
+ // It's busy dying; have it notify us once
+ // it becomes invisible to d_walk().
+ d_add_waiter(v, &wait);
+ spin_unlock(&v->d_lock);
+ rcu_read_unlock();
+ if (!list_empty(&data.dispose))
+ shrink_dentry_list(&data.dispose);
+ wait_for_completion(&wait.completion);
+ continue;
+ }
+ if (!lock_for_kill(v)) {
+ spin_unlock(&v->d_lock);
rcu_read_unlock();
} else {
- shrink_kill(data.victim);
+ shrink_kill(v);
}
}
if (!list_empty(&data.dispose))
@@ -1787,7 +1854,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
INIT_HLIST_BL_NODE(&dentry->d_hash);
INIT_LIST_HEAD(&dentry->d_lru);
INIT_HLIST_HEAD(&dentry->d_children);
- INIT_HLIST_NODE(&dentry->d_alias);
+ dentry->waiters = NULL;
INIT_HLIST_NODE(&dentry->d_sib);
if (dentry->d_op && dentry->d_op->d_init) {
@@ -2729,7 +2796,7 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
d_wait = dentry->d_wait;
dentry->d_wait = NULL;
hlist_bl_unlock(b);
- INIT_HLIST_NODE(&dentry->d_alias);
+ dentry->waiters = NULL;
INIT_LIST_HEAD(&dentry->d_lru);
return d_wait;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f939d2ed10a3..19098253f2dd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -88,6 +88,7 @@ union shortname_store {
#define d_lock d_lockref.lock
#define d_iname d_shortname.string
+struct completion_list;
struct dentry {
/* RCU lookup touched fields */
@@ -122,12 +123,23 @@ struct dentry {
struct hlist_node d_sib; /* child of parent list */
struct hlist_head d_children; /* our children */
/*
- * d_alias and d_rcu can share memory
+ * the following members can share memory - their uses are
+ * mutually exclusive.
*/
union {
- struct hlist_node d_alias; /* inode alias list */
- struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */
+ /* positives: inode alias list */
+ struct hlist_node d_alias;
+ /* in-lookup ones (all negative, live): hash chain */
+ struct hlist_bl_node d_in_lookup_hash;
+ /* killed ones: (already negative) used to schedule freeing */
struct rcu_head d_rcu;
+ /*
+ * live non-in-lookup negatives: used if shrink_dcache_tree()
+ * races with eviction by another thread and needs to wait for
+ * this dentry to get killed . Remains NULL for almost all
+ * negative dentries.
+ */
+ struct completion_list *waiters;
};
};
--
2.47.3
^ permalink raw reply related [flat|nested] 63+ messages in thread* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
2026-04-04 8:07 ` [RFC PATCH v3 " Al Viro
` (3 preceding siblings ...)
2026-04-04 8:07 ` [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
@ 2026-04-09 16:51 ` Jeff Layton
2026-04-09 19:02 ` Al Viro
4 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-09 16:51 UTC (permalink / raw)
To: Al Viro, Linus Torvalds
Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Sat, 2026-04-04 at 09:07 +0100, Al Viro wrote:
> Changes since v2:
> * separate select_data from the completion list elements - add a separate
> struct completion_list (private to fs/dcache.c at the moment, but it's completely
> independent from struct dentry and can be lifted e.g. into linux/completion.h
> if there are other potential users).
>
> Branch (currently -rc6-based) lives in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> individual patches in followups. Please, review.
>
> Shortlog:
> Al Viro (4):
> for_each_alias(): helper macro for iterating through dentries of given inode
> struct dentry: make ->d_u anonymous
> dcache.c: more idiomatic "positives are not allowed" sanity checks
> get rid of busy-waiting in shrink_dcache_tree()
>
> Diffstat:
> Documentation/filesystems/porting.rst | 10 ++
> fs/affs/amigaffs.c | 2 +-
> fs/ceph/mds_client.c | 4 +-
> fs/dcache.c | 129 +++++++++++++++++++-------
> fs/exportfs/expfs.c | 2 +-
> fs/inode.c | 2 +-
> fs/nfs/dir.c | 4 +-
> fs/nfs/getroot.c | 2 +-
> fs/notify/fsnotify.c | 2 +-
> fs/ocfs2/dcache.c | 2 +-
> fs/overlayfs/dir.c | 2 +-
> fs/smb/client/inode.c | 2 +-
> include/linux/dcache.h | 24 ++++-
> 13 files changed, 140 insertions(+), 47 deletions(-)
>
> Series overview
>
> First a couple of cleanups to reduce the amount of noise:
>
> 1) for_each_alias(): helper macro for iterating through dentries of given inode
>
> Most of the places using d_alias are loops iterating through all aliases for
> given inode; introduce a helper macro (for_each_alias(dentry, inode))
> and convert open-coded instances of such loop to it.
>
> They are easier to read that way and it reduces the noise on the next steps.
>
> 2) struct dentry: make ->d_u anonymous
>
> Making ->d_rcu and (then) ->d_child overlapping dates back to 2006; anon unions
> support had been added to gcc only in 4.6 (2011) and the minimal gcc version hadn't
> been bumped to that until 4.19 (2018).
> These days there's no reason not to keep that union named.
>
> Next we get rid of the aforementioned pathologies:
>
> 3) dcache.c: more idiomatic "positives are not allowed" sanity checks
>
> Several functions have BUG_ON/WARN_ON sanity checks that want to verify
> that dentry is not positive and instead of looking at ->d_inode (as we
> do in all other places that check that) they look at ->d_alias.
> Just use the normal helpers instead - that way we no longer even look
> at ->d_alias for negative dentries
>
> Now that everything is ready for adding a new member and killing busy-wait:
>
> 4) get rid of busy-waiting in shrink_dcache_tree()
>
> If shrink_dcache_tree() runs into a potential victim that is already
> dying, it must wait for that dentry to go away. To avoid busy-waiting
> we need some object to wait on and a way for dentry_unlist() to see that
> we need to be notified.
>
> The obvious place for the object to wait on would be on our stack frame.
> We will store a pointer to that object (struct completion_list) in victim
> dentry; if there's more than one thread wanting to wait for the same
> dentry to finish dying, we'll have their instances linked into a list,
> with reference in dentry pointing to the head of that list.
>
> * new object - struct completion_list. A pair of struct completion and
> pointer to the next instance. That's what shrink_dcache_tree() will wait
> on if needed.
>
> * add a new member (->waiters, opaque pointer to struct completion_list)
> to struct dentry. It is defined for negative live dentries that are
> not in-lookup ones and it will remain NULL for almost all of them.
>
> It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
> (defined for positive dentries, all live) or ->d_in_lookup_hash (defined
> for in-lookup dentries, all live negative). That allows to colocate
> all four members.
>
> * make sure that all places where dentry enters the state where ->waiters
> is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.
>
> * if select_collect2() runs into a dentry that is already dying, have
> its caller insert a local instance of struct completion_list into the
> head of the list hanging off dentry->waiters and wait for completion.
>
> * if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
> through the completion_list instances in that list, calling complete()
> for each.
>
> For now struct completion_list is local to fs/dcache.c; it's obviously
> dentry-agnostic, and it can be trivially lifted into linux/completion.h
> if somebody finds a reason to do so...
>
>
This all looks good, Al. Nice work. I like the new wrappers and anon
union too -- makes the code a lot more readable.
I can also confirm that this fixes the livelock we were seeing as a
precursor to this UAF [1]. I'm still not 100% sure how that livelock
leads to a UAF. Claude thinks is has to do with the livelock making the
loop outlive an RCU grace period.
I did eventually find a vmcore and was able to confirm some things:
- in the freed/reallocated dentry, most of the fields are clobbered,
but the d_lru list pointers seemed to be intact. That tells me that the
sucker was added to the shrink list after being freed and reallocated.
That means that the WARN_ON_ONCE() wouldn't have caught this anyway.
- there was some list corruption too: it looked like there were two
different (on-stack) list heads in the list. There were also some
dentries in there that were outside the shrin
Note that this was an older (v6.13-based) kernel though, and it's
possible there are other unfixed bugs in here.
I'm currently trying to reproduce the UAF using that mdelay() based
fault injection patch on something closer to mainline.
[1]: https://lore.kernel.org/linux-fsdevel/20260406-dcache-warn-v1-1-c665efbc005f@kernel.org/
I reviewed the version in your work.dcache-busy-wait branch, but you
can add this to the lot.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
2026-04-09 16:51 ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
@ 2026-04-09 19:02 ` Al Viro
2026-04-09 20:10 ` Jeff Layton
0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-09 19:02 UTC (permalink / raw)
To: Jeff Layton
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Thu, Apr 09, 2026 at 12:51:38PM -0400, Jeff Layton wrote:
> I can also confirm that this fixes the livelock we were seeing as a
> precursor to this UAF [1]. I'm still not 100% sure how that livelock
> leads to a UAF. Claude thinks is has to do with the livelock making the
> loop outlive an RCU grace period.
Claude is full of substance likely to cause CoC complaints when mentioned,
film at 11. I agree with the "livelock opens up some window" part,
what with the stack traces there, but I very much doubt that mechanism
has anything to do with the stuff on a shrink list staying around from
one iteration to another. The thing is, *nothing* is held between the
iterations of loop in there; data.dispose is empty at the end of the loop
body in all cases and reference in data.victim is not retained either
across the iterations either. We do have a weird RCU use pattern there,
all right (unbalanced acquire in select_collect2() paired with drop in
the caller of caller of select_collect2()), but the scope is neither
long nor extended by the livelock - in that case the function between
shrink_dentry_tree() and select_collect2() (d_walk()) is told to drop
the locks and return to its caller immediately.
What I suspect might be getting opened up is the possibility
of d_invalidate() (e.g. from pathname resolution trying
to walk into /proc/<pid>/...) hitting in the middle of
proc_invalidate_siblings_dcache().
Incidentally, one thing to check would be whether anything on affected
boxen had things mounted under /proc/<pid>/... - that would give
some idea whether detach_mounts() (and namespace_unlock() with its
handling of the list of mountpoints) might be needed in the mix.
> I did eventually find a vmcore and was able to confirm some things:
>
> - in the freed/reallocated dentry, most of the fields are clobbered,
> but the d_lru list pointers seemed to be intact. That tells me that the
> sucker was added to the shrink list after being freed and reallocated.
> That means that the WARN_ON_ONCE() wouldn't have caught this anyway.
>
> - there was some list corruption too: it looked like there were two
> different (on-stack) list heads in the list. There were also some
> dentries in there that were outside the shrin
?
> Note that this was an older (v6.13-based) kernel though, and it's
> possible there are other unfixed bugs in here.
>
> I'm currently trying to reproduce the UAF using that mdelay() based
> fault injection patch on something closer to mainline.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
2026-04-09 19:02 ` Al Viro
@ 2026-04-09 20:10 ` Jeff Layton
2026-04-09 21:57 ` Al Viro
0 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-09 20:10 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Thu, 2026-04-09 at 20:02 +0100, Al Viro wrote:
> On Thu, Apr 09, 2026 at 12:51:38PM -0400, Jeff Layton wrote:
>
> > I can also confirm that this fixes the livelock we were seeing as a
> > precursor to this UAF [1]. I'm still not 100% sure how that livelock
> > leads to a UAF. Claude thinks is has to do with the livelock making the
> > loop outlive an RCU grace period.
>
> Claude is full of substance likely to cause CoC complaints when mentioned,
> film at 11. I agree with the "livelock opens up some window" part,
> what with the stack traces there, but I very much doubt that mechanism
> has anything to do with the stuff on a shrink list staying around from
> one iteration to another. The thing is, *nothing* is held between the
> iterations of loop in there; data.dispose is empty at the end of the loop
> body in all cases and reference in data.victim is not retained either
> across the iterations either. We do have a weird RCU use pattern there,
> all right (unbalanced acquire in select_collect2() paired with drop in
> the caller of caller of select_collect2()), but the scope is neither
> long nor extended by the livelock - in that case the function between
> shrink_dentry_tree() and select_collect2() (d_walk()) is told to drop
> the locks and return to its caller immediately.
>
> What I suspect might be getting opened up is the possibility
> of d_invalidate() (e.g. from pathname resolution trying
> to walk into /proc/<pid>/...) hitting in the middle of
> proc_invalidate_siblings_dcache().
>
> Incidentally, one thing to check would be whether anything on affected
> boxen had things mounted under /proc/<pid>/... - that would give
> some idea whether detach_mounts() (and namespace_unlock() with its
> handling of the list of mountpoints) might be needed in the mix.
>
● had_submounts = false. The d_invalidate code never found submounts, so
detach_mounts() was never called for this dentry. The find_submount /
namespace_unlock path was not involved in this crash.
Summary for Al: On this crashed machine, there were no mounts under
/proc/<pid>/... in any of the 49 mount namespaces checked (from
surviving tasks). The d_invalidate stack frame shows had_submounts =
false, confirming that detach_mounts() and namespace_unlock() were not
in the mix for this crash. The code path was purely
shrink_dcache_parent → shrink_dentry_list → lock_for_kill, without any
mount detachment involved.
> > I did eventually find a vmcore and was able to confirm some things:
> >
> > - in the freed/reallocated dentry, most of the fields are clobbered,
> > but the d_lru list pointers seemed to be intact. That tells me that the
> > sucker was added to the shrink list after being freed and reallocated.
> > That means that the WARN_ON_ONCE() wouldn't have caught this anyway.
> >
> > - there was some list corruption too: it looked like there were two
> > different (on-stack) list heads in the list. There were also some
> > dentries in there that were outside the
...directories being shrunk from processes exiting. Here's what it
found when I told it to validate the integrity of the d_lru that the
realloc'ed entry was on.
More claude slop from the earlier vmcore analysis:
------------------8<------------------
### Dispose List: Two Merged Lists
The dispose list contains **55 unique entries** and **two stack-based
list heads** forming a single ring:
1. **head_A** at `0xffffc900c0e8fcc0` — crashed thread's `data.dispose`
2. **head_B** at `0xffffc900c0e7fcc0` — another thread's `data.dispose`
(exited, stack freed, exactly 0x10000 bytes apart)
```
head_A ↔ UAF ↔ mountinfo(1) ↔ swaps(/) ↔ ... ↔
head_B(STACK) ↔ status(5962) ↔ exe(5962) ↔ ... ↔
stat(8808) ↔ ... ↔ mountinfo(1) ↔ UAF ↔ head_A
```
### d_walk Escaped Its Starting Dentry
`d_walk` was called with `parent = /proc/4530/task/5964` (`data.start`
confirmed in stack frame). It should only traverse descendants of 5964.
But the dispose list contains entries from:
- `/proc/4530/task/5962/*` (151 children — sibling of 5964)
- `/proc/4530/task/6830`, `/proc/4530/task/8808` — other task entries
- `/proc/1/mountinfo`, `/proc/1/status`, `/proc/1/net` — PID 1 entries
- `/proc/sys/vm/overcommit_memory`, `/proc/sys/fs/*` — sysctl entries
- `/proc/pressure/{cpu,io,memory}` — PSI entries
- `/proc/swaps`, `/proc/cpuinfo`, `/proc/kcore` — root proc entries
------------------8<------------------
>
>
> > Note that this was an older (v6.13-based) kernel though, and it's
> > possible there are other unfixed bugs in here.
> >
> > I'm currently trying to reproduce the UAF using that mdelay() based
> > fault injection patch on something closer to mainline.
...and so far it's not trivially reproducible. I can reproduce the
stalls on a recent kernel without your patches, but whatever causes
this UAF is not easily triggerable by just stalling things out.
I think I'm going to have to hunt for more vmcores...
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
2026-04-09 20:10 ` Jeff Layton
@ 2026-04-09 21:57 ` Al Viro
2026-04-09 22:38 ` Jeff Layton
2026-04-10 8:48 ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
0 siblings, 2 replies; 63+ messages in thread
From: Al Viro @ 2026-04-09 21:57 UTC (permalink / raw)
To: Jeff Layton
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Thu, Apr 09, 2026 at 04:10:41PM -0400, Jeff Layton wrote:
> head_A ↔ UAF ↔ mountinfo(1) ↔ swaps(/) ↔ ... ↔
> head_B(STACK) ↔ status(5962) ↔ exe(5962) ↔ ... ↔
> stat(8808) ↔ ... ↔ mountinfo(1) ↔ UAF ↔ head_A
> ```
Wait a bloody minute; *two* UAF and two mountinfo there? Are the links
in that cyclic list consistent? ->next and ->prev, I mean.
If I understand the notation correctly... are there two different
dentries, both with "mountinfo" for name and PROC_I(_->d_inode)->pid
being PID 1? What are their ->d_parent pointing to? Are they hashed?
> ### d_walk Escaped Its Starting Dentry
> `d_walk` was called with `parent = /proc/4530/task/5964` (`data.start`
> confirmed in stack frame). It should only traverse descendants of 5964.
> But the dispose list contains entries from:
> - `/proc/4530/task/5962/*` (151 children — sibling of 5964)
> - `/proc/4530/task/6830`, `/proc/4530/task/8808` — other task entries
> - `/proc/1/mountinfo`, `/proc/1/status`, `/proc/1/net` — PID 1 entries
> - `/proc/sys/vm/overcommit_memory`, `/proc/sys/fs/*` — sysctl entries
> - `/proc/pressure/{cpu,io,memory}` — PSI entries
> - `/proc/swaps`, `/proc/cpuinfo`, `/proc/kcore` — root proc entries
Not at all obvious; there's a list from another thread mixed into that,
and we've no idea what had the root been for that one. For that matter,
I'd check ->d_flags on the entries, just to verify that those are
from shrink lists and not from LRU - the fact that these UAF still
have pointers to plausible dentries does not mean they are from the
same moment in time; if dentry in question had been on a different
list before getting freed... That's why the question about ->prev
and ->next consistncy.
And seeing that procfs has zero callers of d_splice_alias() or d_move(),
I would expect ->d_parent on all dentries in there to be constant over
the entire lifetime; would be rather hard for d_walk() to escape in
such conditions...
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
2026-04-09 21:57 ` Al Viro
@ 2026-04-09 22:38 ` Jeff Layton
2026-04-10 8:48 ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
1 sibling, 0 replies; 63+ messages in thread
From: Jeff Layton @ 2026-04-09 22:38 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Thu, 2026-04-09 at 22:57 +0100, Al Viro wrote:
> On Thu, Apr 09, 2026 at 04:10:41PM -0400, Jeff Layton wrote:
>
> > head_A ↔ UAF ↔ mountinfo(1) ↔ swaps(/) ↔ ... ↔
> > head_B(STACK) ↔ status(5962) ↔ exe(5962) ↔ ... ↔
> > stat(8808) ↔ ... ↔ mountinfo(1) ↔ UAF ↔ head_A
> > ```
>
I think the above is BS. Claude says:
In the earlier analysis I walked the list both forward from
head_B (the stack entry) and backward from head_A That was misleading
— there's only one mountinfo node. The ring passes through it once. The
duplicate in my listing was an artifact of walking the ring from two
entry points and not deduplicating.
> Wait a bloody minute; *two* UAF and two mountinfo there? Are the links
> in that cyclic list consistent? ->next and ->prev, I mean.
No:
The list was NOT a proper ring:
- FWD (head.next): LOOP at entry 1 — immediately hits a self-loop (the processed "status" dentry)
- BWD (head.prev): LOOP at entry 55 — walks 55 entries then hits a cycle
> If I understand the notation correctly... are there two different
> dentries, both with "mountinfo" for name and PROC_I(_->d_inode)->pid
> being PID 1? What are their ->d_parent pointing to? Are they hashed?
>
> > ### d_walk Escaped Its Starting Dentry
> > `d_walk` was called with `parent = /proc/4530/task/5964` (`data.start`
> > confirmed in stack frame). It should only traverse descendants of 5964.
> > But the dispose list contains entries from:
> > - `/proc/4530/task/5962/*` (151 children — sibling of 5964)
> > - `/proc/4530/task/6830`, `/proc/4530/task/8808` — other task entries
> > - `/proc/1/mountinfo`, `/proc/1/status`, `/proc/1/net` — PID 1 entries
> > - `/proc/sys/vm/overcommit_memory`, `/proc/sys/fs/*` — sysctl entries
> > - `/proc/pressure/{cpu,io,memory}` — PSI entries
> > - `/proc/swaps`, `/proc/cpuinfo`, `/proc/kcore` — root proc entries
>
> Not at all obvious; there's a list from another thread mixed into that,
> and we've no idea what had the root been for that one. For that matter,
> I'd check ->d_flags on the entries, just to verify that those are
> from shrink lists and not from LRU - the fact that these UAF still
> have pointers to plausible dentries does not mean they are from the
> same moment in time; if dentry in question had been on a different
> list before getting freed... That's why the question about ->prev
> and ->next consistncy.
>
> And seeing that procfs has zero callers of d_splice_alias() or d_move(),
> I would expect ->d_parent on all dentries in there to be constant over
> the entire lifetime; would be rather hard for d_walk() to escape in
> such conditions...
Definitely not legitimately. I suspect some other sort of mem
corruption. /proc/1/mountinfo is on the list though:
mountinfo dentries on dispose list: 1
/proc/1/mountinfo (dentry 0xffff88823deef800)
d_parent: 0xffff891373c06a80 "1" -> "/" (root)
hashed: yes (d_hash.pprev=0xffff893d337264d0)
PROC_I->pid nr: 3217077
d_inode: 0xffff88ea55a44508
Claude's theory (maybe garbage):
PID 1's mountinfo being on the list is actually a strong clue — it's
a dentry that would be on the global LRU (negative or zero-count
after someone closed /proc/1/mountinfo), and drop_caches would pull
it off the LRU via prune_dcache_sb. That's likely how it ended up on
Thread B's dispose list.
Here's a cursory comparitive analysis across the 3 vmcores I have so
far:
Comparative analysis across three vmcores
=========================================
Kernel 6.13.2-0_fbk5-hf2 6.13.2-0_fbk12-hf1 6.13.2-0_fbk5-hf2
Date Mar 31 Apr 9 Apr 2
Process SREventBase61 tokio-runtime-w TTLSFWorkerU151
d_invalidate target /task/5964 /task/3555587 /task/1102
had_submounts False False False
UAF slab kmalloc-96 kmalloc-64 kmalloc-96
UAF d_flags 0x0 0x81e73ce0 0x0
UAF d_sb 0x0 0x0 0x0
UAF d_lockref.count 0 -14080 -771573072
Consistent findings across all three cores:
- had_submounts is False in every case. detach_mounts() and
namespace_unlock() are not involved in any of these crashes.
- No mounts under /proc/<pid>/... in any mount namespace (checked
on the od0103 core across 49 namespaces, zero proc submounts).
- All three are in the same call path:
release_task -> proc_flush_pid -> d_invalidate ->
shrink_dcache_parent -> shrink_dentry_list -> spin_lock (UAF)
- The UAF'd dentry is in a non-dentry slab in every case (kmalloc-64
or kmalloc-96), confirming the dentry was freed and its page reused
by a different slab cache before shrink_dentry_list tried to lock it.
- Different applications triggered the crash (Rust async runtime,
C++ worker, Java-like worker) -- not application-specific.
^ permalink raw reply [flat|nested] 63+ messages in thread* [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-09 21:57 ` Al Viro
2026-04-09 22:38 ` Jeff Layton
@ 2026-04-10 8:48 ` Al Viro
2026-04-10 11:18 ` Jeff Layton
2026-04-10 15:25 ` Linus Torvalds
1 sibling, 2 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 8:48 UTC (permalink / raw)
To: Jeff Layton
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
[this may or may not be the source of UAFs caught by Jeff and by Helge]
lock_for_kill() losing a race to eviction by another thread will end up
returning false (correctly - the caller should *not* evict dentry in
that case), with ->d_lock held and rcu read-critical area still unbroken,
so the caller can safely drop the locks, provided that it's done in the
right order - ->d_lock should be dropped before rcu_read_unlock().
Unfortunately, 3 of 4 callers did it the other way round and for two of them
(finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
The third (shrink_dentry_list()) had been safe for other reasons (dentry being
on the shrink list => the other thread wouldn't even schedule its freeing
until observing dentry off the shrink list while holding ->d_lock), but it's
simpler to make it a general rule - in case of lock_for_kill() failure
->d_lock must be dropped before rcu_read_unlock().
UAF scenario was basically this:
dput() drops the last reference to foo/bar and evicts it.
The reference it held to foo happens to be the last one.
When trylock on ->i_lock of parent's inode fails, we
(under rcu_read_lock()) drop ->d_lock and take the locks in
the right order; while we'd been spinning on ->i_lock somebody
else comes and evicts the parent.
Noticing non-zero (negative) refcount we decide there's nothing
left to do. And had we only dropped parent's ->d_lock before
rcu_read_unlock(), everything would be fine. Unfortunately,
doing that the other way round allows rcu-scheduled freeing of
parent to proceed before we drop its ->d_lock.
The same applies if instead of final dput() of foo/bar it gets evicted
by shrink_dcache_parent() or memory pressure, again taking out the last
reference to that used to pin its parent.
Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 0c8faeee02e2..b1c163f20db6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
*
* Return false if dentry is busy. Otherwise, return true and have
* that dentry's inode locked.
+ *
+ * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
*/
static bool lock_for_kill(struct dentry *dentry)
@@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
}
rcu_read_lock();
}
- rcu_read_unlock();
spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
}
/*
@@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
do {
rcu_read_unlock();
victim = __dentry_kill(victim);
+ if (!victim)
+ return;
rcu_read_lock();
- } while (victim && lock_for_kill(victim));
+ } while (lock_for_kill(victim));
+ spin_unlock(&victim->d_lock);
rcu_read_unlock();
- if (victim)
- spin_unlock(&victim->d_lock);
}
void shrink_dentry_list(struct list_head *list)
@@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
rcu_read_lock();
if (!lock_for_kill(dentry)) {
bool can_free;
- rcu_read_unlock();
d_shrink_del(dentry);
can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
if (can_free)
dentry_free(dentry);
continue;
^ permalink raw reply related [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 8:48 ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
@ 2026-04-10 11:18 ` Jeff Layton
2026-04-10 11:56 ` Jeff Layton
2026-04-10 15:25 ` Linus Torvalds
1 sibling, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 11:18 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, 2026-04-10 at 09:48 +0100, Al Viro wrote:
> [this may or may not be the source of UAFs caught by Jeff and by Helge]
>
> lock_for_kill() losing a race to eviction by another thread will end up
> returning false (correctly - the caller should *not* evict dentry in
> that case), with ->d_lock held and rcu read-critical area still unbroken,
> so the caller can safely drop the locks, provided that it's done in the
> right order - ->d_lock should be dropped before rcu_read_unlock().
>
> Unfortunately, 3 of 4 callers did it the other way round and for two of them
> (finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
> The third (shrink_dentry_list()) had been safe for other reasons (dentry being
> on the shrink list => the other thread wouldn't even schedule its freeing
> until observing dentry off the shrink list while holding ->d_lock), but it's
> simpler to make it a general rule - in case of lock_for_kill() failure
> ->d_lock must be dropped before rcu_read_unlock().
>
> UAF scenario was basically this:
> dput() drops the last reference to foo/bar and evicts it.
> The reference it held to foo happens to be the last one.
> When trylock on ->i_lock of parent's inode fails, we
> (under rcu_read_lock()) drop ->d_lock and take the locks in
> the right order; while we'd been spinning on ->i_lock somebody
> else comes and evicts the parent.
> Noticing non-zero (negative) refcount we decide there's nothing
> left to do. And had we only dropped parent's ->d_lock before
> rcu_read_unlock(), everything would be fine. Unfortunately,
> doing that the other way round allows rcu-scheduled freeing of
> parent to proceed before we drop its ->d_lock.
> The same applies if instead of final dput() of foo/bar it gets evicted
> by shrink_dcache_parent() or memory pressure, again taking out the last
> reference to that used to pin its parent.
>
> Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 0c8faeee02e2..b1c163f20db6 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
> *
> * Return false if dentry is busy. Otherwise, return true and have
> * that dentry's inode locked.
> + *
> + * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
> */
>
> static bool lock_for_kill(struct dentry *dentry)
> @@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
> }
> rcu_read_lock();
> }
> - rcu_read_unlock();
> spin_unlock(&dentry->d_lock);
> + rcu_read_unlock();
> }
>
> /*
> @@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
> do {
> rcu_read_unlock();
> victim = __dentry_kill(victim);
> + if (!victim)
> + return;
> rcu_read_lock();
> - } while (victim && lock_for_kill(victim));
> + } while (lock_for_kill(victim));
> + spin_unlock(&victim->d_lock);
> rcu_read_unlock();
> - if (victim)
> - spin_unlock(&victim->d_lock);
> }
>
> void shrink_dentry_list(struct list_head *list)
> @@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
> rcu_read_lock();
> if (!lock_for_kill(dentry)) {
> bool can_free;
> - rcu_read_unlock();
> d_shrink_del(dentry);
> can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
> spin_unlock(&dentry->d_lock);
> + rcu_read_unlock();
> if (can_free)
> dentry_free(dentry);
> continue;
I'm a little skeptical that this explains the UAF we've seen. It looks
like this would mostly affect the parent dentry rather than the
children on the shrink list, but parents can be children too, so we
can't rule it out.
Either way, releasing the rcu_read_lock() protecting a dentry you're
holding a lock on seems like a bad idea.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 11:18 ` Jeff Layton
@ 2026-04-10 11:56 ` Jeff Layton
0 siblings, 0 replies; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 11:56 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, 2026-04-10 at 07:18 -0400, Jeff Layton wrote:
> On Fri, 2026-04-10 at 09:48 +0100, Al Viro wrote:
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> >
> > lock_for_kill() losing a race to eviction by another thread will end up
> > returning false (correctly - the caller should *not* evict dentry in
> > that case), with ->d_lock held and rcu read-critical area still unbroken,
> > so the caller can safely drop the locks, provided that it's done in the
> > right order - ->d_lock should be dropped before rcu_read_unlock().
> >
> > Unfortunately, 3 of 4 callers did it the other way round and for two of them
> > (finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
> > The third (shrink_dentry_list()) had been safe for other reasons (dentry being
> > on the shrink list => the other thread wouldn't even schedule its freeing
> > until observing dentry off the shrink list while holding ->d_lock), but it's
> > simpler to make it a general rule - in case of lock_for_kill() failure
> > ->d_lock must be dropped before rcu_read_unlock().
> >
> > UAF scenario was basically this:
> > dput() drops the last reference to foo/bar and evicts it.
> > The reference it held to foo happens to be the last one.
> > When trylock on ->i_lock of parent's inode fails, we
> > (under rcu_read_lock()) drop ->d_lock and take the locks in
> > the right order; while we'd been spinning on ->i_lock somebody
> > else comes and evicts the parent.
> > Noticing non-zero (negative) refcount we decide there's nothing
> > left to do. And had we only dropped parent's ->d_lock before
> > rcu_read_unlock(), everything would be fine. Unfortunately,
> > doing that the other way round allows rcu-scheduled freeing of
> > parent to proceed before we drop its ->d_lock.
> > The same applies if instead of final dput() of foo/bar it gets evicted
> > by shrink_dcache_parent() or memory pressure, again taking out the last
> > reference to that used to pin its parent.
> >
> > Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 0c8faeee02e2..b1c163f20db6 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
> > *
> > * Return false if dentry is busy. Otherwise, return true and have
> > * that dentry's inode locked.
> > + *
> > + * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
> > */
> >
> > static bool lock_for_kill(struct dentry *dentry)
> > @@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
> > }
> > rcu_read_lock();
> > }
> > - rcu_read_unlock();
> > spin_unlock(&dentry->d_lock);
> > + rcu_read_unlock();
> > }
> >
> > /*
> > @@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
> > do {
> > rcu_read_unlock();
> > victim = __dentry_kill(victim);
> > + if (!victim)
> > + return;
> > rcu_read_lock();
> > - } while (victim && lock_for_kill(victim));
> > + } while (lock_for_kill(victim));
> > + spin_unlock(&victim->d_lock);
> > rcu_read_unlock();
> > - if (victim)
> > - spin_unlock(&victim->d_lock);
> > }
> >
> > void shrink_dentry_list(struct list_head *list)
> > @@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
> > rcu_read_lock();
> > if (!lock_for_kill(dentry)) {
> > bool can_free;
> > - rcu_read_unlock();
> > d_shrink_del(dentry);
> > can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
> > spin_unlock(&dentry->d_lock);
> > + rcu_read_unlock();
> > if (can_free)
> > dentry_free(dentry);
> > continue;
>
> I'm a little skeptical that this explains the UAF we've seen. It looks
> like this would mostly affect the parent dentry rather than the
> children on the shrink list, but parents can be children too, so we
> can't rule it out.
>
> Either way, releasing the rcu_read_lock() protecting a dentry you're
> holding a lock on seems like a bad idea.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
FWIW, here's Claude's review:
Summary
-------
The patch fixes a use-after-free caused by incorrect lock ordering
when lock_for_kill() returns false (race lost to another thread's
eviction). Three of four callers were dropping rcu_read_unlock()
before spin_unlock(&dentry->d_lock), which allowed RCU-scheduled
freeing of the dentry to proceed while d_lock was still held on
the freed object.
Changes
-------
finish_dput() (line 935-938):
Swaps the unlock order so d_lock is dropped before rcu_read_unlock().
This is reachable from dput() and d_make_discardable(). The UAF
scenario described in the commit message applies directly here:
dput() evicts foo/bar, the parent's refcount drops to zero,
lock_for_kill() fails on the parent (trylock on i_lock failed and
during the retry another thread evicted the parent), then
rcu_read_unlock() lets the parent be freed while d_lock is still
held on it.
shrink_kill() (lines 1193-1203):
Restructured to pull the !victim early return out of the while
condition. When __dentry_kill() returns NULL (parent's refcount
didn't drop to zero), the function now returns immediately without
touching rcu at all -- correct, since __dentry_kill() entered
with rcu_read_unlock() already called. When lock_for_kill() fails,
the loop exits and d_lock is dropped before rcu_read_unlock().
This is the path most likely responsible for the production crashes
in T262643929: shrink_dentry_list() calls shrink_kill(), which
walks up to parent dentries via __dentry_kill().
shrink_dentry_list() (lines 1213-1218):
Swaps the unlock order in the lock_for_kill() failure path. As
the commit message notes, this was already safe for other reasons:
the dentry is on a shrink list (DCACHE_SHRINK_LIST set), so the
other thread's __dentry_kill() would see that flag and set
can_free = false, preventing dentry_free() from being called.
Changed for consistency with the general rule.
lock_for_kill() comment:
Documents the ordering requirement: on failure, callers must drop
d_lock before rcu_read_unlock().
Production crash relevance
--------------------------
The shrink_kill() fix directly addresses the UAF pattern observed
in three production vmcores (od0103.dkl2, od5175.prn6,
twshared90354.17.frc2). In all three, the crashed thread was in
shrink_dentry_list() -> spin_lock() on a dentry whose slab page had
been freed and reused by kmalloc-64 or kmalloc-96.
The scenario: shrink_dentry_list() kills a dentry via shrink_kill(),
__dentry_kill() returns the parent with d_lock held and refcount
decremented to zero, lock_for_kill() fails on the parent because
another thread raced in and evicted it. With the old code, the
rcu_read_unlock() in shrink_kill() allowed the parent's
call_rcu-scheduled free to complete, and the subsequent
spin_unlock(&victim->d_lock) touched freed memory.
Whether this is the full explanation for the list corruption
observed in the vmcores (LRU dentries spliced onto the dispose
list) remains an open question. The lock ordering fix prevents
the immediate UAF on the parent dentry, but the mechanism by
which entire LRU chains ended up on the dispose list may involve
additional factors.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 8:48 ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
2026-04-10 11:18 ` Jeff Layton
@ 2026-04-10 15:25 ` Linus Torvalds
2026-04-10 15:57 ` Al Viro
` (3 more replies)
1 sibling, 4 replies; 63+ messages in thread
From: Linus Torvalds @ 2026-04-10 15:25 UTC (permalink / raw)
To: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki
Cc: Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
[ Adding RCU maintainers to the participants: see
https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
and the fairly long thread associated with it for context ]
On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> [this may or may not be the source of UAFs caught by Jeff and by Helge]
Hmm.
I think this patch may indeed fix the problem, and I don't mind how it looks.
But while I think the patch looks fine, I am still quite unhappy about
it if it matters - because we have very much documented that spinlocks
in themselves are also RCU read locks:
Note that anything that disables bottom halves, preemption,
or interrupts also enters an RCU read-side critical section.
Acquiring a spinlock also enters an RCU read-side critical
sections, even for spinlocks that do not disable preemption,
as is the case in kernels built with CONFIG_PREEMPT_RT=y.
Sleeplocks do *not* enter RCU read-side critical sections.
so *if* this makes a difference, I think it's actually implying that
there's something wrong with our "rcu_read_unlock()" implementation,
and that it doesn't nest properly.
Because our documentation also makes it very clear that this should
all work as-is, and your patch should be a complete no-op. Just a few
lines later in that core RCU doc, we have
Note that RCU read-side critical sections may be nested and/or
overlapping.
so the order of the spin_unlock() and the rcu_read_unlock() *should*
be entirely immaterial, and the order of unlocking simply shouldn't
matter.
So it may indeed be that relying on the ordering of RCU read unlock
matters for the case of explicit unlocks and the implicit unlocks by a
spinlock, but that's not great.
Does the rcu_read_unlock() code perhaps only check the RCU count, not
the atomicity count? That would explain it, and looking at
__rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
case (rcu_preempt_read_exit() seems to only check
rcu_read_lock_nesting).
This is most likely dependent on kernel config options, and it may all
be unavoidable becasue RCU might not even have a way to tell that it's
still in a critical region without preemption counts etc.
But if it's unavoidable, we need to update the docs about this gotcha,
and we need to have some tooling scan our existing code for this
documentation change.
RCU people - what do you think?
Linus
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 15:25 ` Linus Torvalds
@ 2026-04-10 15:57 ` Al Viro
2026-04-10 16:27 ` Boqun Feng
` (2 subsequent siblings)
3 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 15:57 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara,
Helge Deller
[belatedly Cc'ing Helge on this one]
On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
>
> https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
FWIW, the thread starting at https://lore.kernel.org/all/adQO8B0rkR2ftusI@p100/
might also be relevant - another UAF report in the same area, this one
with more(?) reliable reproducer, parisc-specific.
> and the fairly long thread associated with it for context ]
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 15:25 ` Linus Torvalds
2026-04-10 15:57 ` Al Viro
@ 2026-04-10 16:27 ` Boqun Feng
2026-04-10 17:31 ` Linus Torvalds
2026-04-10 17:32 ` Paul E. McKenney
2026-04-10 18:52 ` Al Viro
3 siblings, 1 reply; 63+ messages in thread
From: Boqun Feng @ 2026-04-10 16:27 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Jeff Layton,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
>
> https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
>
> and the fairly long thread associated with it for context ]
>
> On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
>
> Hmm.
>
> I think this patch may indeed fix the problem, and I don't mind how it looks.
>
> But while I think the patch looks fine, I am still quite unhappy about
> it if it matters - because we have very much documented that spinlocks
> in themselves are also RCU read locks:
>
> Note that anything that disables bottom halves, preemption,
> or interrupts also enters an RCU read-side critical section.
> Acquiring a spinlock also enters an RCU read-side critical
> sections, even for spinlocks that do not disable preemption,
> as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> Sleeplocks do *not* enter RCU read-side critical sections.
>
> so *if* this makes a difference, I think it's actually implying that
> there's something wrong with our "rcu_read_unlock()" implementation,
> and that it doesn't nest properly.
>
> Because our documentation also makes it very clear that this should
> all work as-is, and your patch should be a complete no-op. Just a few
> lines later in that core RCU doc, we have
>
> Note that RCU read-side critical sections may be nested and/or
> overlapping.
>
> so the order of the spin_unlock() and the rcu_read_unlock() *should*
> be entirely immaterial, and the order of unlocking simply shouldn't
> matter.
>
I agree. So there could be a lock/RCU bug here ;-)
> So it may indeed be that relying on the ordering of RCU read unlock
> matters for the case of explicit unlocks and the implicit unlocks by a
> spinlock, but that's not great.
>
> Does the rcu_read_unlock() code perhaps only check the RCU count, not
> the atomicity count? That would explain it, and looking at
> __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> case (rcu_preempt_read_exit() seems to only check
> rcu_read_lock_nesting).
>
Notice in rcu_read_unlock_special(), atomicity/preemption count is
checked in two ways:
a) if it's called with preemption count being 0, then good a quiescent
state will be reported to unblock the RCU grace period.
b) if it's called with preemption disabled, there are two cases:
1. use_softirq is true and either we are in an hard irq or we need an
expedited GP, RCU softirq will raised, but in it, rcu_core() will
guard the rcu_preempt_deferred_qs() with preempt count checking.
2. otherwise, RCU will just rely on __schedule() to call
rcu_note_context_switch() to report a QS, hence the preempt count
is also respected.
@Jeff, just want to confirm, this UAF can be reproduced without
PREEMPT_RT=y, right?
Regards,
Boqun
> This is most likely dependent on kernel config options, and it may all
> be unavoidable becasue RCU might not even have a way to tell that it's
> still in a critical region without preemption counts etc.
>
> But if it's unavoidable, we need to update the docs about this gotcha,
> and we need to have some tooling scan our existing code for this
> documentation change.
>
> RCU people - what do you think?
>
> Linus
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 16:27 ` Boqun Feng
@ 2026-04-10 17:31 ` Linus Torvalds
2026-04-10 18:11 ` Paul E. McKenney
2026-04-10 18:21 ` Jeff Layton
0 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2026-04-10 17:31 UTC (permalink / raw)
To: Boqun Feng
Cc: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Jeff Layton,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, 10 Apr 2026 at 09:27, Boqun Feng <boqun@kernel.org> wrote:
>
> Notice in rcu_read_unlock_special(), atomicity/preemption count is
> checked in two ways:
Yeah, I didn't go that deep ... That code is inscrutable.
Raising a softirq looks pretty strangein there, since there's
absolutely no reason to believe that the preemption-off region has
ended before the softirq is triggered. It makes sense for the "we're
in hardirq context", but it also does it for that (needs_exp &&
!irqs_were_disabled()) case
So then it depends on the softirq checking it *again*. That all seems
a bit strange.
But it looks like rcu_core_si() does do that, and then falls back to
if (rcu_preempt_need_deferred_qs())
set_need_resched_current();
and it feels like maybe the rcu_read_unlock_special() should just have
done that itself instead of playing softirq games.
Maybe it's some off optimization, but going through the softirq code
doesn't _feel_ like it's the right thing to do.
I don't know this code well enough to judge.
Let's wait to see if Al's patch actually makes any difference. It
*shouldn't* make any difference, and maybe it doesn't. No need to
start chasing RCU bugs that may not even be there.
Jeff?
> @Jeff, just want to confirm, this UAF can be reproduced without
> PREEMPT_RT=y, right?
I haven't seen Jeff's config, but some of the syzbot reports that
*may* be related have configs that look pretty normal and didn't have
PREEMPT_RT.
Linus
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 17:31 ` Linus Torvalds
@ 2026-04-10 18:11 ` Paul E. McKenney
2026-04-10 18:21 ` Jeff Layton
1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 18:11 UTC (permalink / raw)
To: Linus Torvalds
Cc: Boqun Feng, Al Viro, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, Jeff Layton,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara,
Sebastian Andrzej Siewior
On Fri, Apr 10, 2026 at 10:31:32AM -0700, Linus Torvalds wrote:
> On Fri, 10 Apr 2026 at 09:27, Boqun Feng <boqun@kernel.org> wrote:
> >
> > Notice in rcu_read_unlock_special(), atomicity/preemption count is
> > checked in two ways:
>
> Yeah, I didn't go that deep ... That code is inscrutable.
We are working to make it better, but this will be a long process.
> Raising a softirq looks pretty strangein there, since there's
> absolutely no reason to believe that the preemption-off region has
> ended before the softirq is triggered. It makes sense for the "we're
> in hardirq context", but it also does it for that (needs_exp &&
> !irqs_were_disabled()) case
>
> So then it depends on the softirq checking it *again*. That all seems
> a bit strange.
>
> But it looks like rcu_core_si() does do that, and then falls back to
>
> if (rcu_preempt_need_deferred_qs())
> set_need_resched_current();
>
> and it feels like maybe the rcu_read_unlock_special() should just have
> done that itself instead of playing softirq games.
>
> Maybe it's some off optimization, but going through the softirq code
> doesn't _feel_ like it's the right thing to do.
>
> I don't know this code well enough to judge.
We could quite easily add another leg to the inner "if" statement
in rcu_read_unlock_special() so that it does raise_softirq_irqoff()
if in_hardirq() and set_need_resched_current() otherwise, as in the
(untested, probably doesn't even build)first patch at the end of this
email. My feeling is that it is not worth it, but on the other hand,
my idea of "simplicity" might or might not serve the people reading
this code.
> Let's wait to see if Al's patch actually makes any difference. It
> *shouldn't* make any difference, and maybe it doesn't. No need to
> start chasing RCU bugs that may not even be there.
Fair enough!
> Jeff?
>
> > @Jeff, just want to confirm, this UAF can be reproduced without
> > PREEMPT_RT=y, right?
>
> I haven't seen Jeff's config, but some of the syzbot reports that
> *may* be related have configs that look pretty normal and didn't have
> PREEMPT_RT.
And I am concerned about PREEMPT_RT, please see the second patch below.
But Sebastian will no doubt tell me the error of my ways. My question is
"why is it safe to exit the RCU reader before the lock is fully released".
Sebastian's answer might well be that all the valid use cases for
combining locking and RCU have a lock acquisition on the free path.
Whether I believe that or not, I have no idea. ;-)
------------------------------------------------------------------------
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 95ad967adcf3c..066b6fe6b6a3f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -749,7 +749,10 @@ static void rcu_read_unlock_special(struct task_struct *t)
// GP in flight or a potential need to deboost.
if (rdp->defer_qs_pending != DEFER_QS_PENDING) {
rdp->defer_qs_pending = DEFER_QS_PENDING;
- raise_softirq_irqoff(RCU_SOFTIRQ);
+ if (in_hardirq())
+ raise_softirq_irqoff(RCU_SOFTIRQ);
+ else
+ set_need_resched_current();
}
} else {
// Enabling BH or preempt does reschedule, so...
------------------------------------------------------------------------
diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index db1e11b45de67..fbf4deba4029c 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -79,10 +79,10 @@ void __sched rt_spin_unlock(spinlock_t *lock) __releases(RCU)
{
spin_release(&lock->dep_map, _RET_IP_);
migrate_enable();
- rcu_read_unlock();
if (unlikely(!rt_mutex_cmpxchg_release(&lock->lock, current, NULL)))
rt_mutex_slowunlock(&lock->lock);
+ rcu_read_unlock();
}
EXPORT_SYMBOL(rt_spin_unlock);
@@ -262,17 +262,17 @@ void __sched rt_read_unlock(rwlock_t *rwlock) __releases(RCU)
{
rwlock_release(&rwlock->dep_map, _RET_IP_);
migrate_enable();
- rcu_read_unlock();
rwbase_read_unlock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
+ rcu_read_unlock();
}
EXPORT_SYMBOL(rt_read_unlock);
void __sched rt_write_unlock(rwlock_t *rwlock) __releases(RCU)
{
rwlock_release(&rwlock->dep_map, _RET_IP_);
- rcu_read_unlock();
migrate_enable();
rwbase_write_unlock(&rwlock->rwbase);
+ rcu_read_unlock();
}
EXPORT_SYMBOL(rt_write_unlock);
^ permalink raw reply related [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 17:31 ` Linus Torvalds
2026-04-10 18:11 ` Paul E. McKenney
@ 2026-04-10 18:21 ` Jeff Layton
2026-04-10 19:19 ` Al Viro
1 sibling, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 18:21 UTC (permalink / raw)
To: Linus Torvalds, Boqun Feng
Cc: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Uladzislau Rezki, linux-fsdevel,
Christian Brauner, Jan Kara, Nikolay Borisov, Max Kellermann,
Eric Sandeen, Paulo Alcantara
On Fri, 2026-04-10 at 10:31 -0700, Linus Torvalds wrote:
> On Fri, 10 Apr 2026 at 09:27, Boqun Feng <boqun@kernel.org> wrote:
> >
> > Notice in rcu_read_unlock_special(), atomicity/preemption count is
> > checked in two ways:
>
> Yeah, I didn't go that deep ... That code is inscrutable.
>
> Raising a softirq looks pretty strangein there, since there's
> absolutely no reason to believe that the preemption-off region has
> ended before the softirq is triggered. It makes sense for the "we're
> in hardirq context", but it also does it for that (needs_exp &&
> !irqs_were_disabled()) case
>
> So then it depends on the softirq checking it *again*. That all seems
> a bit strange.
>
> But it looks like rcu_core_si() does do that, and then falls back to
>
> if (rcu_preempt_need_deferred_qs())
> set_need_resched_current();
>
> and it feels like maybe the rcu_read_unlock_special() should just have
> done that itself instead of playing softirq games.
>
> Maybe it's some off optimization, but going through the softirq code
> doesn't _feel_ like it's the right thing to do.
>
> I don't know this code well enough to judge.
>
> Let's wait to see if Al's patch actually makes any difference. It
> *shouldn't* make any difference, and maybe it doesn't. No need to
> start chasing RCU bugs that may not even be there.
>
> Jeff?
>
> > @Jeff, just want to confirm, this UAF can be reproduced without
> > PREEMPT_RT=y, right?
>
> I haven't seen Jeff's config, but some of the syzbot reports that
> *may* be related have configs that look pretty normal and didn't have
> PREEMPT_RT.
>
>
I've not been able to reproduce the UAF at all.
It was possible to reproduce the hang that we were seeing by injecting
an mdelay() just after the cond_resched() in __dentry_kill(), but Al's
other series seems to have fixed that.
The UAF seems to be dependent on those hangs somehow, but we were never
able to get one to fire from a reproducer attempt. They just
occasionally pop up when these stalls occur.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 18:21 ` Jeff Layton
@ 2026-04-10 19:19 ` Al Viro
2026-04-10 19:32 ` Jeff Layton
0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 19:19 UTC (permalink / raw)
To: Jeff Layton
Cc: Linus Torvalds, Boqun Feng, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Uladzislau Rezki,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 02:21:39PM -0400, Jeff Layton wrote:
> I've not been able to reproduce the UAF at all.
>
> It was possible to reproduce the hang that we were seeing by injecting
> an mdelay() just after the cond_resched() in __dentry_kill(), but Al's
> other series seems to have fixed that.
Elimination of busy-wait is likely to hide whatever's going on, so it
shouldn't be there when hunting for reproducer.
> The UAF seems to be dependent on those hangs somehow, but we were never
> able to get one to fire from a reproducer attempt. They just
> occasionally pop up when these stalls occur.
Not even with mdelay() stuck in there? FWIW, another UAF report in the
same general area is parisc buildd running into that when building
a specific package (webkit2gtk); Debian kernels 6.16..6.19 (at least),
no idea which config it is...
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 19:19 ` Al Viro
@ 2026-04-10 19:32 ` Jeff Layton
2026-04-10 21:13 ` Calvin Owens
0 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 19:32 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Boqun Feng, Paul E. McKenney, Frederic Weisbecker,
Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Uladzislau Rezki,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, 2026-04-10 at 20:19 +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 02:21:39PM -0400, Jeff Layton wrote:
>
> > I've not been able to reproduce the UAF at all.
> >
> > It was possible to reproduce the hang that we were seeing by injecting
> > an mdelay() just after the cond_resched() in __dentry_kill(), but Al's
> > other series seems to have fixed that.
>
> Elimination of busy-wait is likely to hide whatever's going on, so it
> shouldn't be there when hunting for reproducer.
>
This was before that. I know Gustavo also tried to reproduce it and he
was able to hit the hang but no UAF.
> > The UAF seems to be dependent on those hangs somehow, but we were never
> > able to get one to fire from a reproducer attempt. They just
> > occasionally pop up when these stalls occur.
>
> Not even with mdelay() stuck in there? FWIW, another UAF report in the
> same general area is parisc buildd running into that when building
> a specific package (webkit2gtk); Debian kernels 6.16..6.19 (at least),
> no idea which config it is...
Yep, not even with that. One thing that claude pointed out: doing an
mdelay() might prevent it from making any progress freeing dentries.
I'm experimenting now with one that only randomly does a delay, but so
far that hasn't turned up anything. I'll let you know if that changes.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 19:32 ` Jeff Layton
@ 2026-04-10 21:13 ` Calvin Owens
2026-04-10 21:24 ` Al Viro
0 siblings, 1 reply; 63+ messages in thread
From: Calvin Owens @ 2026-04-10 21:13 UTC (permalink / raw)
To: Jeff Layton
Cc: Al Viro, Linus Torvalds, Boqun Feng, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
Paulo Alcantara
On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> Yep, not even with that. One thing that claude pointed out: doing an
> mdelay() might prevent it from making any progress freeing dentries.
> I'm experimenting now with one that only randomly does a delay, but so
> far that hasn't turned up anything. I'll let you know if that changes.
Hi Jeff,
I've been poking at this one too. With the attached reproducer, the
waiter will get stuck in an infinite loop apparently in d_walk() with a
stock 7.0-rc7 (in my case with PREEMPT_LAZY):
d_walk
list_empty (inlined)
shrink_dcache_tree
d_invalidate
proc_invalidate_siblings_dcache
release_task
wait_task_zombie (inlined)
wait_consider_task
do_wait_thread (inlined)
__do_wait
do_wait
kernel_wait4
__do_sys_wait4 (inlined)
__se_sys_wait4 (inlined)
__x64_sys_wait4
do_syscall_64
entry_SYSCALL_64_after_hwframe
__internal_syscall_cancel
__syscall_cancel
__wait4 (inlined)
os_wait_impl (inlined)
os_wait
_Py_LeaveRecursiveCallTstate (inlined)
cfunction_vectorcall_NOARGS
_PyObject_VectorcallTstate (inlined)
PyObject_Vectorcall
It doesn't happen 100% of the time, but it's never taken more than three
attempts on two different machines. I've SIGKILL'd it and waited over an
hour, it really never comes back. Making NR_FORKS larger than 3 was
counterproductive on my 32-thread machines, but YMMV of course.
I haven't had time to dig more yet, but wanted to share the reproducer,
hopefully it yields something useful on your end.
I don't think this is Al's livelock, because only one thread is
spinning on an otherwise idle machine... but I apologize if it is and
it's going over my head.
Cheers,
Calvin
---
#!/usr/bin/python3
# sudo prlimit --nofile=10000000:10000000 ./repro.py
import os
NR_FORKS = 3
for _ in range(NR_FORKS):
pid = os.fork()
if pid == 0:
break
for dirpath, _, filenames in os.walk("/proc"):
for fn in filenames:
try:
os.open(os.path.join(dirpath, fn), os.O_RDONLY)
except:
pass
if pid == 0:
os._exit(0)
while True:
try:
os.wait()
except ChildProcessError:
os._exit(0)
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 21:13 ` Calvin Owens
@ 2026-04-10 21:24 ` Al Viro
2026-04-10 22:15 ` Calvin Owens
0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 21:24 UTC (permalink / raw)
To: Calvin Owens
Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
Paulo Alcantara
On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > Yep, not even with that. One thing that claude pointed out: doing an
> > mdelay() might prevent it from making any progress freeing dentries.
> > I'm experimenting now with one that only randomly does a delay, but so
> > far that hasn't turned up anything. I'll let you know if that changes.
>
> Hi Jeff,
>
> I've been poking at this one too. With the attached reproducer, the
> waiter will get stuck in an infinite loop apparently in d_walk() with a
> stock 7.0-rc7 (in my case with PREEMPT_LAZY):
Interesting... I'll try it in a bit, but just in case - could you check if
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
steps into the same thing on your setup?
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 21:24 ` Al Viro
@ 2026-04-10 22:15 ` Calvin Owens
2026-04-10 23:05 ` Al Viro
0 siblings, 1 reply; 63+ messages in thread
From: Calvin Owens @ 2026-04-10 22:15 UTC (permalink / raw)
To: Al Viro
Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
Paulo Alcantara
On Friday 04/10 at 22:24 +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> > On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > > Yep, not even with that. One thing that claude pointed out: doing an
> > > mdelay() might prevent it from making any progress freeing dentries.
> > > I'm experimenting now with one that only randomly does a delay, but so
> > > far that hasn't turned up anything. I'll let you know if that changes.
> >
> > Hi Jeff,
> >
> > I've been poking at this one too. With the attached reproducer, the
> > waiter will get stuck in an infinite loop apparently in d_walk() with a
> > stock 7.0-rc7 (in my case with PREEMPT_LAZY):
>
> Interesting... I'll try it in a bit, but just in case - could you check if
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> steps into the same thing on your setup?
Hi Al,
It still reproduces on that branch in the exact same way.
I'm absolutely 100% guessing... but I recall tomfoolery with the "magic"
symlinks in /proc/$/fd/* in the past... the below patch makes the infinite
spins impossible to trigger with my reproducer, does my explanation make
any sense to you? Apologies if I'm off in the weeds...
Thanks,
Calvin
---
From: Calvin Owens <calvin@wbinvd.org>
Subject: [PATCH] fs/proc: Implement d_weak_revalidate()
I think this is necessary because ND_JUMPED can happen in /proc when
the magic symlinks in /proc/$/fd/* point to files in /proc. If true,
this is a problem, because the default behavior for d_weak_revalidate()
is to assume the jumped-to dentries are always valid.
Fix by checking the in_use count in d_weak_revalidate() for procfs, the
same way it is checked in procfs d_revalidate() today.
Signed-off-by: Calvin Owens <calvin@wbinvd.org>
---
fs/proc/generic.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 501889856461..359176e5eeb3 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -216,15 +216,26 @@ void proc_free_inum(unsigned int inum)
ida_free(&proc_inum_ida, inum - PROC_DYNAMIC_FIRST);
}
+static int __proc_misc_revalidate(struct dentry *dentry)
+{
+ if (atomic_read(&PDE(d_inode(dentry))->in_use) < 0)
+ return 0; /* revalidate */
+ return 1;
+}
+
static int proc_misc_d_revalidate(struct inode *dir, const struct qstr *name,
struct dentry *dentry, unsigned int flags)
{
if (flags & LOOKUP_RCU)
return -ECHILD;
- if (atomic_read(&PDE(d_inode(dentry))->in_use) < 0)
- return 0; /* revalidate */
- return 1;
+ return __proc_misc_revalidate(dentry);
+}
+
+static int proc_misc_d_weak_revalidate(struct dentry * dentry,
+ unsigned int flags)
+{
+ return __proc_misc_revalidate(dentry);
}
static int proc_misc_d_delete(const struct dentry *dentry)
@@ -233,8 +244,9 @@ static int proc_misc_d_delete(const struct dentry *dentry)
}
static const struct dentry_operations proc_misc_dentry_ops = {
- .d_revalidate = proc_misc_d_revalidate,
- .d_delete = proc_misc_d_delete,
+ .d_revalidate = proc_misc_d_revalidate,
+ .d_weak_revalidate = proc_misc_d_weak_revalidate,
+ .d_delete = proc_misc_d_delete,
};
/*
--
2.47.3
^ permalink raw reply related [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 22:15 ` Calvin Owens
@ 2026-04-10 23:05 ` Al Viro
2026-04-10 23:30 ` Calvin Owens
0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 23:05 UTC (permalink / raw)
To: Calvin Owens
Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
Paulo Alcantara
On Fri, Apr 10, 2026 at 03:15:16PM -0700, Calvin Owens wrote:
> On Friday 04/10 at 22:24 +0100, Al Viro wrote:
> > On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> > > On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > > > Yep, not even with that. One thing that claude pointed out: doing an
> > > > mdelay() might prevent it from making any progress freeing dentries.
> > > > I'm experimenting now with one that only randomly does a delay, but so
> > > > far that hasn't turned up anything. I'll let you know if that changes.
> > >
> > > Hi Jeff,
> > >
> > > I've been poking at this one too. With the attached reproducer, the
> > > waiter will get stuck in an infinite loop apparently in d_walk() with a
> > > stock 7.0-rc7 (in my case with PREEMPT_LAZY):
> >
> > Interesting... I'll try it in a bit, but just in case - could you check if
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> > steps into the same thing on your setup?
>
> Hi Al,
>
> It still reproduces on that branch in the exact same way.
>
> I'm absolutely 100% guessing... but I recall tomfoolery with the "magic"
> symlinks in /proc/$/fd/* in the past... the below patch makes the infinite
> spins impossible to trigger with my reproducer, does my explanation make
> any sense to you? Apologies if I'm off in the weeds...
What it's doing, AFAICS, is a weird way of opening arseloads of files, mostly
the same ones again and again - the reason it needs that prlimit (there's
way fewer files in procfs when it starts) is that these threads keep running
through each other's /proc/*/fd and opening what they'd got again and again,
adding to /proc/*/fd/* as they go.
IOW, you are opening a random mix of files, both procfs and ones some processes
had opened, then exiting. That triggers invalidation of your /proc/*, with
a _lot_ of shite that needs to be taken out. It might be triggering a livelock
or a UAF somewhere in dcache or it might be something entirely different -
no idea at that point. I'll look further into that, but I wouldn't be surprised
if it turns out to be entirely unrelated. Would be easier to deal with if the
mix had been more predictable, but we have what we have...
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 23:05 ` Al Viro
@ 2026-04-10 23:30 ` Calvin Owens
2026-04-11 0:51 ` Al Viro
0 siblings, 1 reply; 63+ messages in thread
From: Calvin Owens @ 2026-04-10 23:30 UTC (permalink / raw)
To: Al Viro
Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
Paulo Alcantara
On Saturday 04/11 at 00:05 +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 03:15:16PM -0700, Calvin Owens wrote:
> > On Friday 04/10 at 22:24 +0100, Al Viro wrote:
> > > On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> > > > On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > > > > Yep, not even with that. One thing that claude pointed out: doing an
> > > > > mdelay() might prevent it from making any progress freeing dentries.
> > > > > I'm experimenting now with one that only randomly does a delay, but so
> > > > > far that hasn't turned up anything. I'll let you know if that changes.
> > > >
> > > > Hi Jeff,
> > > >
> > > > I've been poking at this one too. With the attached reproducer, the
> > > > waiter will get stuck in an infinite loop apparently in d_walk() with a
> > > > stock 7.0-rc7 (in my case with PREEMPT_LAZY):
> > >
> > > Interesting... I'll try it in a bit, but just in case - could you check if
> > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> > > steps into the same thing on your setup?
> >
> > Hi Al,
> >
> > It still reproduces on that branch in the exact same way.
> >
> > I'm absolutely 100% guessing... but I recall tomfoolery with the "magic"
> > symlinks in /proc/$/fd/* in the past... the below patch makes the infinite
> > spins impossible to trigger with my reproducer, does my explanation make
> > any sense to you? Apologies if I'm off in the weeds...
>
> What it's doing, AFAICS, is a weird way of opening arseloads of files, mostly
> the same ones again and again - the reason it needs that prlimit (there's
> way fewer files in procfs when it starts) is that these threads keep running
> through each other's /proc/*/fd and opening what they'd got again and again,
> adding to /proc/*/fd/* as they go.
Yes exactly, I was trying a lot of different pathological behaviors and
that was the first pattern I found that triggered the d_walk() spin
consistently.
Initially I made the reproducer ignore itself in /proc. But it turns out
to be much more reliable if it *doesn't*. I realized it was opening its
own children's files in /proc/*/fd/*, and the children were also opening
other /proc files, which is what made me suspect the magic symlinks.
> IOW, you are opening a random mix of files, both procfs and ones some processes
> had opened, then exiting. That triggers invalidation of your /proc/*, with
> a _lot_ of shite that needs to be taken out. It might be triggering a livelock
> or a UAF somewhere in dcache or it might be something entirely different -
> no idea at that point. I'll look further into that, but I wouldn't be surprised
> if it turns out to be entirely unrelated. Would be easier to deal with if the
> mix had been more predictable, but we have what we have...
I'm sure I can narrow down the reproducer more, I'll try a bit.
I appreciate you taking a look. I'm happy to test anything else you like
if that reproducer isn't working for you, it's very consistent for me on
two different machines. Hopefully it works for Jeff.
Thanks,
Calvin
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 23:30 ` Calvin Owens
@ 2026-04-11 0:51 ` Al Viro
0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-11 0:51 UTC (permalink / raw)
To: Calvin Owens
Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
Paulo Alcantara
On Fri, Apr 10, 2026 at 04:30:32PM -0700, Calvin Owens wrote:
> Yes exactly, I was trying a lot of different pathological behaviors and
> that was the first pattern I found that triggered the d_walk() spin
> consistently.
>
> Initially I made the reproducer ignore itself in /proc. But it turns out
> to be much more reliable if it *doesn't*. I realized it was opening its
> own children's files in /proc/*/fd/*, and the children were also opening
> other /proc files, which is what made me suspect the magic symlinks.
>
> > IOW, you are opening a random mix of files, both procfs and ones some processes
> > had opened, then exiting. That triggers invalidation of your /proc/*, with
> > a _lot_ of shite that needs to be taken out. It might be triggering a livelock
> > or a UAF somewhere in dcache or it might be something entirely different -
> > no idea at that point. I'll look further into that, but I wouldn't be surprised
> > if it turns out to be entirely unrelated. Would be easier to deal with if the
> > mix had been more predictable, but we have what we have...
>
> I'm sure I can narrow down the reproducer more, I'll try a bit.
No need... Just have parent and child share descriptor table, then let parent
wait for child and child do
char buf[128], buf2[128];
int fd = 0, n;
do {
n = fd;
sprintf(buf, "/proc/self/fdinfo/%d", fd);
fd = open(buf, 0);
} while (fd >= 0);
for (int i = 0; i <= n; i++) {
sprintf(buf, "/proc/self/fd/%d", i);
readlink(buf, buf2, sizeof(buf2));
}
and that's it.
That's a nice demonstration of how nasty conditions can be arranged for
d_invalidate(); *plenty* of busy dentries in the tree (to the tune of
several millions), with some evictables scattered here and there.
If you get enough busy stuff to wade through until you find an eviction
candidate, you'll get select_collect() return D_WALK_QUIT as soon as you
get a single evictable sucker. Of course, then you need to restart
the whole thing - with the same pile of busy ones to get through. Repeat
for every single evictable left in there.
No memory corruption involved, just a fuckton of rescans ;-/
It *is* recoverable - you just need to flush caches hard enough and that'll
unwedge the sucker.
IMO that's a very convincing argument for not using shrink_dcache_parent()
in d_invalidate().
That goes back to 2.1.65 - very early in dcache history. Back then (and
until about 2014, IIRC) it had been "try to evict the subtree, unhash
if they are all gone, fail otherwise"; these days it's worse, since
d_invalidate() is not allowed to fail.
"Rescan if we have found something and ran out of timeslice" is there since
2005 mingo's 116194f29f13 "[PATCH] sched: vfs: fix scheduling latencies in
prune_dcache() and select_parent()" - latency problems prior to that got
traded for this "rescan the same pile of busy stuff for each evictable"
pathological behaviour.
FWIW, it doesn't have to be on procfs - you can arrange for something similar
with fuse.
Joy...
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 15:25 ` Linus Torvalds
2026-04-10 15:57 ` Al Viro
2026-04-10 16:27 ` Boqun Feng
@ 2026-04-10 17:32 ` Paul E. McKenney
2026-04-10 18:26 ` Jeff Layton
2026-04-10 18:52 ` Al Viro
3 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 17:32 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, Jeff Layton,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
>
> https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
>
> and the fairly long thread associated with it for context ]
>
> On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
>
> Hmm.
>
> I think this patch may indeed fix the problem, and I don't mind how it looks.
>
> But while I think the patch looks fine, I am still quite unhappy about
> it if it matters - because we have very much documented that spinlocks
> in themselves are also RCU read locks:
>
> Note that anything that disables bottom halves, preemption,
> or interrupts also enters an RCU read-side critical section.
> Acquiring a spinlock also enters an RCU read-side critical
> sections, even for spinlocks that do not disable preemption,
> as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> Sleeplocks do *not* enter RCU read-side critical sections.
>
> so *if* this makes a difference, I think it's actually implying that
> there's something wrong with our "rcu_read_unlock()" implementation,
> and that it doesn't nest properly.
>
> Because our documentation also makes it very clear that this should
> all work as-is, and your patch should be a complete no-op. Just a few
> lines later in that core RCU doc, we have
>
> Note that RCU read-side critical sections may be nested and/or
> overlapping.
>
> so the order of the spin_unlock() and the rcu_read_unlock() *should*
> be entirely immaterial, and the order of unlocking simply shouldn't
> matter.
>
> So it may indeed be that relying on the ordering of RCU read unlock
> matters for the case of explicit unlocks and the implicit unlocks by a
> spinlock, but that's not great.
>
> Does the rcu_read_unlock() code perhaps only check the RCU count, not
> the atomicity count? That would explain it, and looking at
> __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> case (rcu_preempt_read_exit() seems to only check
> rcu_read_lock_nesting).
>
> This is most likely dependent on kernel config options, and it may all
> be unavoidable becasue RCU might not even have a way to tell that it's
> still in a critical region without preemption counts etc.
>
> But if it's unavoidable, we need to update the docs about this gotcha,
> and we need to have some tooling scan our existing code for this
> documentation change.
>
> RCU people - what do you think?
Adding to Boqun's email:
The https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV URL
has preemption disabled (right, Jeff?), in which case rcu_read_lock()
is just preempt_disable() and rcu_read_unlock() is just preempt_enable().
All the spinlocks also disable and enable preemption, so if you have
something like this:
rcu_read_lock();
do_something();
spin_lock(...);
do_something_else();
rcu_read_unlock();
do_yet_another_thing();
spin_unlock(...);
Then there should not be any calls to schedule() between the initial
rcu_read_lock() and the final spin_unlock(). But this is just
preempt_disable() doing its job, not much to do with RCU.
And repeating Boqun's email:
Now if this is a preemptible kernel, things are more involved.
However, in this case we know that the kernel has been built with
CONFIG_PREEMPT_COUNT=y. So rcu_read_unlock_special() can sample
preempt_count() and set preempt_bh_were_disabled is then true if either
preemption or softirq are disabled. Then irqs_were_disabled is set if
interrupts were disabled on entry. If either are set, the inner pair of
"if" statements set up to defer the end of the read-side critical section.
If this is important, which it might be, I will go through this in more
detail later.
Otherwise, the call to rcu_preempt_deferred_qs_irqrestore() at the end
of rcu_read_unlock_special() immediately ends the RCU read-side critical
section.
And more adding to Boqun's email:
Now, CONFIG_PREEMPT_RT=y is a special case, because "spinlocks"
are preemptible. So __rt_spin_lock() and __rt_spin_trylock() both do
rcu_read_lock() and rt_spin_unlock() does rcu_read_unlock(). So that
should be covered via RCU nesting along with the rcu_read_unlock_special()
checks for disabled interrupts, softirq, and preemption.
And rcutorture tests random overlapping of RCU readers with all those
disabled regions. But it might still be missing something. And I
will add spinlocks to the mix in addition to the preemption disabling.
In the meantime, is there something more specific that I should be
looking at?
Thanx, Paul
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 17:32 ` Paul E. McKenney
@ 2026-04-10 18:26 ` Jeff Layton
2026-04-10 18:36 ` Paul E. McKenney
0 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 18:26 UTC (permalink / raw)
To: paulmck, Linus Torvalds
Cc: Al Viro, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
Josh Triplett, Boqun Feng, Uladzislau Rezki, linux-fsdevel,
Christian Brauner, Jan Kara, Nikolay Borisov, Max Kellermann,
Eric Sandeen, Paulo Alcantara
On Fri, 2026-04-10 at 10:32 -0700, Paul E. McKenney wrote:
> On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> > [ Adding RCU maintainers to the participants: see
> >
> > https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> >
> > and the fairly long thread associated with it for context ]
> >
> > On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> >
> > Hmm.
> >
> > I think this patch may indeed fix the problem, and I don't mind how it looks.
> >
> > But while I think the patch looks fine, I am still quite unhappy about
> > it if it matters - because we have very much documented that spinlocks
> > in themselves are also RCU read locks:
> >
> > Note that anything that disables bottom halves, preemption,
> > or interrupts also enters an RCU read-side critical section.
> > Acquiring a spinlock also enters an RCU read-side critical
> > sections, even for spinlocks that do not disable preemption,
> > as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> > Sleeplocks do *not* enter RCU read-side critical sections.
> >
> > so *if* this makes a difference, I think it's actually implying that
> > there's something wrong with our "rcu_read_unlock()" implementation,
> > and that it doesn't nest properly.
> >
> > Because our documentation also makes it very clear that this should
> > all work as-is, and your patch should be a complete no-op. Just a few
> > lines later in that core RCU doc, we have
> >
> > Note that RCU read-side critical sections may be nested and/or
> > overlapping.
> >
> > so the order of the spin_unlock() and the rcu_read_unlock() *should*
> > be entirely immaterial, and the order of unlocking simply shouldn't
> > matter.
> >
> > So it may indeed be that relying on the ordering of RCU read unlock
> > matters for the case of explicit unlocks and the implicit unlocks by a
> > spinlock, but that's not great.
> >
> > Does the rcu_read_unlock() code perhaps only check the RCU count, not
> > the atomicity count? That would explain it, and looking at
> > __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> > case (rcu_preempt_read_exit() seems to only check
> > rcu_read_lock_nesting).
> >
> > This is most likely dependent on kernel config options, and it may all
> > be unavoidable becasue RCU might not even have a way to tell that it's
> > still in a critical region without preemption counts etc.
> >
> > But if it's unavoidable, we need to update the docs about this gotcha,
> > and we need to have some tooling scan our existing code for this
> > documentation change.
> >
> > RCU people - what do you think?
>
> Adding to Boqun's email:
>
> The https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV URL
> has preemption disabled (right, Jeff?)
>
Not as familiar with this part of the kernel, but I think that's
correct:
# zgrep PREEMPT /proc/config.gz
CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_ARCH_HAS_PREEMPT_LAZY=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_LAZY is not set
# CONFIG_PREEMPT_RT is not set
# CONFIG_PREEMPT_DYNAMIC is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set
> , in which case rcu_read_lock()
> is just preempt_disable() and rcu_read_unlock() is just preempt_enable().
> All the spinlocks also disable and enable preemption, so if you have
> something like this:
>
> rcu_read_lock();
> do_something();
> spin_lock(...);
> do_something_else();
> rcu_read_unlock();
> do_yet_another_thing();
> spin_unlock(...);
>
> Then there should not be any calls to schedule() between the initial
> rcu_read_lock() and the final spin_unlock(). But this is just
> preempt_disable() doing its job, not much to do with RCU.
>
> And repeating Boqun's email:
>
> Now if this is a preemptible kernel, things are more involved.
> However, in this case we know that the kernel has been built with
> CONFIG_PREEMPT_COUNT=y. So rcu_read_unlock_special() can sample
> preempt_count() and set preempt_bh_were_disabled is then true if either
> preemption or softirq are disabled. Then irqs_were_disabled is set if
> interrupts were disabled on entry. If either are set, the inner pair of
> "if" statements set up to defer the end of the read-side critical section.
> If this is important, which it might be, I will go through this in more
> detail later.
>
> Otherwise, the call to rcu_preempt_deferred_qs_irqrestore() at the end
> of rcu_read_unlock_special() immediately ends the RCU read-side critical
> section.
>
> And more adding to Boqun's email:
>
> Now, CONFIG_PREEMPT_RT=y is a special case, because "spinlocks"
> are preemptible. So __rt_spin_lock() and __rt_spin_trylock() both do
> rcu_read_lock() and rt_spin_unlock() does rcu_read_unlock(). So that
> should be covered via RCU nesting along with the rcu_read_unlock_special()
> checks for disabled interrupts, softirq, and preemption.
>
> And rcutorture tests random overlapping of RCU readers with all those
> disabled regions. But it might still be missing something. And I
> will add spinlocks to the mix in addition to the preemption disabling.
>
> In the meantime, is there something more specific that I should be
> looking at?
>
> Thanx, Paul
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 18:26 ` Jeff Layton
@ 2026-04-10 18:36 ` Paul E. McKenney
0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 18:36 UTC (permalink / raw)
To: Jeff Layton
Cc: Linus Torvalds, Al Viro, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 02:26:47PM -0400, Jeff Layton wrote:
> On Fri, 2026-04-10 at 10:32 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> > > [ Adding RCU maintainers to the participants: see
> > >
> > > https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> > >
> > > and the fairly long thread associated with it for context ]
> > >
> > > On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> > >
> > > Hmm.
> > >
> > > I think this patch may indeed fix the problem, and I don't mind how it looks.
> > >
> > > But while I think the patch looks fine, I am still quite unhappy about
> > > it if it matters - because we have very much documented that spinlocks
> > > in themselves are also RCU read locks:
> > >
> > > Note that anything that disables bottom halves, preemption,
> > > or interrupts also enters an RCU read-side critical section.
> > > Acquiring a spinlock also enters an RCU read-side critical
> > > sections, even for spinlocks that do not disable preemption,
> > > as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> > > Sleeplocks do *not* enter RCU read-side critical sections.
> > >
> > > so *if* this makes a difference, I think it's actually implying that
> > > there's something wrong with our "rcu_read_unlock()" implementation,
> > > and that it doesn't nest properly.
> > >
> > > Because our documentation also makes it very clear that this should
> > > all work as-is, and your patch should be a complete no-op. Just a few
> > > lines later in that core RCU doc, we have
> > >
> > > Note that RCU read-side critical sections may be nested and/or
> > > overlapping.
> > >
> > > so the order of the spin_unlock() and the rcu_read_unlock() *should*
> > > be entirely immaterial, and the order of unlocking simply shouldn't
> > > matter.
> > >
> > > So it may indeed be that relying on the ordering of RCU read unlock
> > > matters for the case of explicit unlocks and the implicit unlocks by a
> > > spinlock, but that's not great.
> > >
> > > Does the rcu_read_unlock() code perhaps only check the RCU count, not
> > > the atomicity count? That would explain it, and looking at
> > > __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> > > case (rcu_preempt_read_exit() seems to only check
> > > rcu_read_lock_nesting).
> > >
> > > This is most likely dependent on kernel config options, and it may all
> > > be unavoidable becasue RCU might not even have a way to tell that it's
> > > still in a critical region without preemption counts etc.
> > >
> > > But if it's unavoidable, we need to update the docs about this gotcha,
> > > and we need to have some tooling scan our existing code for this
> > > documentation change.
> > >
> > > RCU people - what do you think?
> >
> > Adding to Boqun's email:
> >
> > The https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV URL
> > has preemption disabled (right, Jeff?)
> >
>
> Not as familiar with this part of the kernel, but I think that's
> correct:
>
> # zgrep PREEMPT /proc/config.gz
> CONFIG_PREEMPT_NONE_BUILD=y
> CONFIG_ARCH_HAS_PREEMPT_LAZY=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> # CONFIG_PREEMPT_LAZY is not set
> # CONFIG_PREEMPT_RT is not set
> # CONFIG_PREEMPT_DYNAMIC is not set
> CONFIG_HAVE_PREEMPT_DYNAMIC=y
> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set
Thank you, Jeff!
Yes that has preemption entirely disabled, so RCU readers do nothing but
disable and enable preemption. Except that it is everywhere disabled,
so the order of the rcu_read_unlock() and the spin_unlock() should
not matter. (Famous last words!)
Thanx, Paul
> > , in which case rcu_read_lock()
> > is just preempt_disable() and rcu_read_unlock() is just preempt_enable().
> > All the spinlocks also disable and enable preemption, so if you have
> > something like this:
> >
> > rcu_read_lock();
> > do_something();
> > spin_lock(...);
> > do_something_else();
> > rcu_read_unlock();
> > do_yet_another_thing();
> > spin_unlock(...);
> >
> > Then there should not be any calls to schedule() between the initial
> > rcu_read_lock() and the final spin_unlock(). But this is just
> > preempt_disable() doing its job, not much to do with RCU.
> >
> > And repeating Boqun's email:
> >
> > Now if this is a preemptible kernel, things are more involved.
> > However, in this case we know that the kernel has been built with
> > CONFIG_PREEMPT_COUNT=y. So rcu_read_unlock_special() can sample
> > preempt_count() and set preempt_bh_were_disabled is then true if either
> > preemption or softirq are disabled. Then irqs_were_disabled is set if
> > interrupts were disabled on entry. If either are set, the inner pair of
> > "if" statements set up to defer the end of the read-side critical section.
> > If this is important, which it might be, I will go through this in more
> > detail later.
> >
> > Otherwise, the call to rcu_preempt_deferred_qs_irqrestore() at the end
> > of rcu_read_unlock_special() immediately ends the RCU read-side critical
> > section.
> >
> > And more adding to Boqun's email:
> >
> > Now, CONFIG_PREEMPT_RT=y is a special case, because "spinlocks"
> > are preemptible. So __rt_spin_lock() and __rt_spin_trylock() both do
> > rcu_read_lock() and rt_spin_unlock() does rcu_read_unlock(). So that
> > should be covered via RCU nesting along with the rcu_read_unlock_special()
> > checks for disabled interrupts, softirq, and preemption.
> >
> > And rcutorture tests random overlapping of RCU readers with all those
> > disabled regions. But it might still be missing something. And I
> > will add spinlocks to the mix in addition to the preemption disabling.
> >
> > In the meantime, is there something more specific that I should be
> > looking at?
> >
> > Thanx, Paul
>
> --
> Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 15:25 ` Linus Torvalds
` (2 preceding siblings ...)
2026-04-10 17:32 ` Paul E. McKenney
@ 2026-04-10 18:52 ` Al Viro
2026-04-10 19:21 ` Paul E. McKenney
2026-04-10 19:30 ` Linus Torvalds
3 siblings, 2 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 18:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
>
> https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
>
> and the fairly long thread associated with it for context ]
>
> On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
>
> Hmm.
>
> I think this patch may indeed fix the problem, and I don't mind how it looks.
>
> But while I think the patch looks fine, I am still quite unhappy about
> it if it matters - because we have very much documented that spinlocks
> in themselves are also RCU read locks:
>
> Note that anything that disables bottom halves, preemption,
> or interrupts also enters an RCU read-side critical section.
> Acquiring a spinlock also enters an RCU read-side critical
> sections, even for spinlocks that do not disable preemption,
> as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> Sleeplocks do *not* enter RCU read-side critical sections.
>
> so *if* this makes a difference, I think it's actually implying that
> there's something wrong with our "rcu_read_unlock()" implementation,
> and that it doesn't nest properly.
>
> Because our documentation also makes it very clear that this should
> all work as-is, and your patch should be a complete no-op. Just a few
> lines later in that core RCU doc, we have
>
> Note that RCU read-side critical sections may be nested and/or
> overlapping.
>
> so the order of the spin_unlock() and the rcu_read_unlock() *should*
> be entirely immaterial, and the order of unlocking simply shouldn't
> matter.
>
> So it may indeed be that relying on the ordering of RCU read unlock
> matters for the case of explicit unlocks and the implicit unlocks by a
> spinlock, but that's not great.
>
> Does the rcu_read_unlock() code perhaps only check the RCU count, not
> the atomicity count? That would explain it, and looking at
> __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> case (rcu_preempt_read_exit() seems to only check
> rcu_read_lock_nesting).
>
> This is most likely dependent on kernel config options, and it may all
> be unavoidable becasue RCU might not even have a way to tell that it's
> still in a critical region without preemption counts etc.
>
> But if it's unavoidable, we need to update the docs about this gotcha,
> and we need to have some tooling scan our existing code for this
> documentation change.
>
> RCU people - what do you think?
FWIW, I wonder if we would be better off with the following:
void shrink_dentry_list(struct list_head *list)
{
while (!list_empty(list)) {
struct dentry *dentry;
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(&dentry->d_lock);
d_shrink_del(dentry);
if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
// killed while on shrink list, freeing left to us
spin_unlock(&dentry->d_lock);
dentry_free(dentry);
continue;
}
rcu_read_lock();
if (!lock_for_kill(dentry)) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
continue;
}
shrink_kill(dentry);
}
}
because in that case both callers of shrink_kill() have the same form - the
other one is
if (!lock_for_kill(v)) {
spin_unlock(&v->d_lock);
rcu_read_unlock();
} else {
shrink_kill(v);
}
and we could fold lock_for_kill() into shrink_kill(), with a nice series of
cleanups becoming possible after that, especially if we can rely upon
spinlock acting as RCU read-side critical in all cases...
Interesting...
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 18:52 ` Al Viro
@ 2026-04-10 19:21 ` Paul E. McKenney
2026-04-10 19:30 ` Linus Torvalds
1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 19:21 UTC (permalink / raw)
To: Al Viro
Cc: Linus Torvalds, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 07:52:43PM +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> > [ Adding RCU maintainers to the participants: see
> >
> > https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> >
> > and the fairly long thread associated with it for context ]
> >
> > On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> >
> > Hmm.
> >
> > I think this patch may indeed fix the problem, and I don't mind how it looks.
> >
> > But while I think the patch looks fine, I am still quite unhappy about
> > it if it matters - because we have very much documented that spinlocks
> > in themselves are also RCU read locks:
> >
> > Note that anything that disables bottom halves, preemption,
> > or interrupts also enters an RCU read-side critical section.
> > Acquiring a spinlock also enters an RCU read-side critical
> > sections, even for spinlocks that do not disable preemption,
> > as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> > Sleeplocks do *not* enter RCU read-side critical sections.
> >
> > so *if* this makes a difference, I think it's actually implying that
> > there's something wrong with our "rcu_read_unlock()" implementation,
> > and that it doesn't nest properly.
> >
> > Because our documentation also makes it very clear that this should
> > all work as-is, and your patch should be a complete no-op. Just a few
> > lines later in that core RCU doc, we have
> >
> > Note that RCU read-side critical sections may be nested and/or
> > overlapping.
> >
> > so the order of the spin_unlock() and the rcu_read_unlock() *should*
> > be entirely immaterial, and the order of unlocking simply shouldn't
> > matter.
> >
> > So it may indeed be that relying on the ordering of RCU read unlock
> > matters for the case of explicit unlocks and the implicit unlocks by a
> > spinlock, but that's not great.
> >
> > Does the rcu_read_unlock() code perhaps only check the RCU count, not
> > the atomicity count? That would explain it, and looking at
> > __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> > case (rcu_preempt_read_exit() seems to only check
> > rcu_read_lock_nesting).
> >
> > This is most likely dependent on kernel config options, and it may all
> > be unavoidable becasue RCU might not even have a way to tell that it's
> > still in a critical region without preemption counts etc.
> >
> > But if it's unavoidable, we need to update the docs about this gotcha,
> > and we need to have some tooling scan our existing code for this
> > documentation change.
> >
> > RCU people - what do you think?
>
> FWIW, I wonder if we would be better off with the following:
>
> void shrink_dentry_list(struct list_head *list)
> {
> while (!list_empty(list)) {
> struct dentry *dentry;
>
> dentry = list_entry(list->prev, struct dentry, d_lru);
> spin_lock(&dentry->d_lock);
> d_shrink_del(dentry);
> if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
> // killed while on shrink list, freeing left to us
> spin_unlock(&dentry->d_lock);
> dentry_free(dentry);
> continue;
> }
> rcu_read_lock();
> if (!lock_for_kill(dentry)) {
> spin_unlock(&dentry->d_lock);
> rcu_read_unlock();
> continue;
> }
> shrink_kill(dentry);
> }
> }
>
> because in that case both callers of shrink_kill() have the same form - the
> other one is
> if (!lock_for_kill(v)) {
> spin_unlock(&v->d_lock);
> rcu_read_unlock();
> } else {
> shrink_kill(v);
> }
> and we could fold lock_for_kill() into shrink_kill(), with a nice series of
> cleanups becoming possible after that, especially if we can rely upon
> spinlock acting as RCU read-side critical in all cases...
If there is a case where a spinlock fails to act as an RCU read-side
critical section, then we have a bug that needs to be fixed.
Thanx, Paul
> Interesting...
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 18:52 ` Al Viro
2026-04-10 19:21 ` Paul E. McKenney
@ 2026-04-10 19:30 ` Linus Torvalds
2026-04-10 20:24 ` Al Viro
1 sibling, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-10 19:30 UTC (permalink / raw)
To: Al Viro
Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, 10 Apr 2026 at 11:48, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I wonder if we would be better off with the following:
>
> void shrink_dentry_list(struct list_head *list)
> {
> while (!list_empty(list)) {
> struct dentry *dentry;
>
> dentry = list_entry(list->prev, struct dentry, d_lru);
> spin_lock(&dentry->d_lock);
> d_shrink_del(dentry);
> if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
> // killed while on shrink list, freeing left to us
> spin_unlock(&dentry->d_lock);
> dentry_free(dentry);
> continue;
> }
> rcu_read_lock();
> if (!lock_for_kill(dentry)) {
> spin_unlock(&dentry->d_lock);
> rcu_read_unlock();
> continue;
> }
Ugh. I despise that rcu_read_lock() in there, because it's typically
completely useless.
The reason it exists is because lock_for_kill() can drop d_lock(), but
that's in the unlikely case that we cn't just immediately get the
inode lock.
So honestly, I think that rcu_read_lock() should be inside
lock_for_kill(), rather than in the caller as a "just in case things
go down".
IOW, that code basically now not only does that typically unnecessary
rcu locking (and then unlocking), it does so *because* it knows about
the subtle internal behavior of lock_for_kill().
In contrast, if we put it inside lock_for_kill(), all we need is a
fairly straightforward comment along the lines of
"we're dropping the spinlock in order to take the inode lock, but
the caller may be depending on RCU behavior so we need to take the rcu
read lock first".
that would turn strange illogical code that also generates worse code
into straightforwardly explained code that also performs better.
Ok, so "performs better" is kind of exaggerated, in that obviously the
extra rcu_read_lock/unlock sequences aren't exactly expensive, but
still - I feel it's a win-win to just do this differently.
Or am I missing somethign else?
Linus
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 19:30 ` Linus Torvalds
@ 2026-04-10 20:24 ` Al Viro
2026-04-10 20:48 ` Al Viro
0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 20:24 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 12:30:13PM -0700, Linus Torvalds wrote:
> The reason it exists is because lock_for_kill() can drop d_lock(), but
> that's in the unlikely case that we cn't just immediately get the
> inode lock.
>
> So honestly, I think that rcu_read_lock() should be inside
> lock_for_kill(), rather than in the caller as a "just in case things
> go down".
Yup, in the cascade of followups I've mentioned...
> IOW, that code basically now not only does that typically unnecessary
> rcu locking (and then unlocking), it does so *because* it knows about
> the subtle internal behavior of lock_for_kill().
>
> In contrast, if we put it inside lock_for_kill(), all we need is a
> fairly straightforward comment along the lines of
>
> "we're dropping the spinlock in order to take the inode lock, but
> the caller may be depending on RCU behavior so we need to take the rcu
> read lock first".
>
> that would turn strange illogical code that also generates worse code
> into straightforwardly explained code that also performs better.
>
> Ok, so "performs better" is kind of exaggerated, in that obviously the
> extra rcu_read_lock/unlock sequences aren't exactly expensive, but
> still - I feel it's a win-win to just do this differently.
>
> Or am I missing somethign else?
Mostly that fast_dput() calling conventions would need to change a bit
as well.
With the above as step 1,
Step 2: fold lock_for_kill() into shrink_kill(). Result: 2 call sites of
lock_for_kill() left, shrink_kill() becomes
static inline void shrink_kill(struct dentry *victim)
{
while (lock_for_kill(victim)) {
rcu_read_unlock();
victim = __dentry_kill(victim);
if (!victim)
return;
rcu_read_lock();
}
spin_unlock(&victim->d_lock);
rcu_read_unlock();
}
shrink_dentry_list() and shrink_dentry_tree() calling shrink_kill()
without bothering with lock_for_kill().
Step 3: note that both call sites of lock_for_kill() are followed with the
same sequence - drop ->d_lock + rcu_read_unlock on failure,
rcu_read_unlock + __dentry_kill() on success. And these are the
only callers of __dentry_kill(). Moreover, NULL from __dentry_kill()
means "bugger off" in both callers. So add
struct dentry *dentry_kill(dentry)
{
if (!lock_for_kill(dentry)) {
spin_unlock(&dentry->d_lock);
rcu_read_unlock();
return NULL;
}
rcu_read_unlock();
return __dentry_kill(dentry);
}
and convert both callers to that:
static inline void shrink_kill(struct dentry *victim)
{
while ((victim = dentry_kill(victim)) != NULL)
rcu_read_lock();
}
and
static void finish_dput(struct dentry *dentry)
{
while ((dentry = dentry_kill(dentry)) != NULL) {
if (retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
return;
}
rcu_read_lock();
}
}
Step 4: note that callers of shrink_kill() and finish_dput()
have dentry->d_lock held. We can (assuming there's no RCU
bugs) drop the lock before calling them and grab it the
first thing in. Moreover, we can shift grabbing it into
lock_for_kill(), leaving these
static inline void shrink_kill(struct dentry *victim)
{
while ((victim = dentry_kill(victim)) != NULL)
;
}
and
static void finish_dput(struct dentry *dentry)
{
while ((dentry = dentry_kill(dentry)) != NULL) {
if (retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
return;
}
}
}
resp., with the callers becoming
void dput(struct dentry *dentry)
{
if (!dentry)
return;
might_sleep();
rcu_read_lock();
if (likely(fast_dput(dentry))) {
rcu_read_unlock();
return;
}
rcu_read_unlock();
finish_dput(dentry);
}
void d_make_discardable(struct dentry *dentry)
{
spin_lock(&dentry->d_lock);
WARN_ON(!(dentry->d_flags & DCACHE_PERSISTENT));
dentry->d_flags &= ~DCACHE_PERSISTENT;
dentry->d_lockref.count--;
finish_dput(dentry);
}
void shrink_dentry_list(struct list_head *list)
{
while (!list_empty(list)) {
struct dentry *dentry;
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(&dentry->d_lock);
d_shrink_del(dentry);
if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED) {
spin_unlock(&dentry->d_lock);
dentry_free(dentry);
continue;
}
shrink_kill(dentry);
}
}
and, in shrink_dentry_list(),
spin_lock(&v->d_lock);
rcu_read_unlock(); // now held by ->d_lock
if (v->d_lockref.count < 0 &&
!(v->d_flags & DCACHE_DENTRY_KILLED)) {
struct completion_list wait;
// It's busy dying; have it notify us once
// it becomes invisible to d_walk().
d_add_waiter(v, &wait);
spin_unlock(&v->d_lock);
if (!list_empty(&data.dispose))
shrink_dentry_list(&data.dispose);
wait_for_completion(&wait.completion);
continue;
}
shrink_kill(v);
lock_for_kill() has grown a call of rcu_read_lock() in the very beginning,
will be gone shortly.
Step 5: pull the followup rcu_read_unlock() (and dropping ->d_lock on
failure) into lock_for_kill(); again, that assumes no RCU bugs there:
static bool lock_for_kill(struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
if (unlikely(dentry->d_lockref.count)) {
spin_unlock(&dentry->d_lock);
return false;
}
if (!inode || likely(spin_trylock(&inode->i_lock)))
return true;
rcu_read_lock();
do {
spin_unlock(&dentry->d_lock);
spin_lock(&inode->i_lock);
spin_lock(&dentry->d_lock);
if (likely(inode == dentry->d_inode))
break;
spin_unlock(&inode->i_lock);
inode = dentry->d_inode;
} while (inode);
rcu_read_unlock();
if (likely(!dentry->d_lockref.count))
return true;
if (inode)
spin_unlock(&inode->i_lock);
spin_unlock(&dentry->d_lock);
return false;
}
and in dentry_kill()
{
if (!lock_for_kill(dentry))
return NULL;
return __dentry_kill(dentry);
}
Step 6: fold __dentry_kill() into dentry_kill().
Step 7: both callers of fast_dput() are identical:
rcu_read_lock();
if (likely(fast_dput(dentry))) {
rcu_read_unlock();
return;
}
rcu_read_unlock();
(in dput() and dput_to_list()). Pull rcu_read_lock() and rcu_read_unlock()
into fast_dput(); ends up with 5 lines added there. Result (with comments
skipped) is
static inline bool fast_dput(struct dentry *dentry)
{
int ret;
rcu_read_lock(); // added
ret = lockref_put_return(&dentry->d_lockref);
if (unlikely(ret < 0)) {
spin_lock(&dentry->d_lock);
rcu_read_unlock(); // added
if (WARN_ON_ONCE(dentry->d_lockref.count <= 0)) {
spin_unlock(&dentry->d_lock);
return true;
}
dentry->d_lockref.count--;
goto locked;
}
if (ret) {
rcu_read_unlock(); // added
return true;
}
if (retain_dentry(dentry, false)) {
rcu_read_unlock(); // added
return true;
}
spin_lock(&dentry->d_lock);
rcu_read_unlock(); // added
locked:
if (dentry->d_lockref.count || retain_dentry(dentry, true)) {
spin_unlock(&dentry->d_lock);
return true;
}
return false;
}
Callers become
void dput(struct dentry *dentry)
{
if (!dentry)
return;
might_sleep();
if (likely(fast_dput(dentry)))
return;
finish_dput(dentry);
}
and
void dput_to_list(struct dentry *dentry, struct list_head *list)
{
if (likely(fast_dput(dentry)))
return;
to_shrink_list(dentry, list);
spin_unlock(&dentry->d_lock);
}
resp.
^ permalink raw reply [flat|nested] 63+ messages in thread* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
2026-04-10 20:24 ` Al Viro
@ 2026-04-10 20:48 ` Al Viro
0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 20:48 UTC (permalink / raw)
To: Linus Torvalds
Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara
On Fri, Apr 10, 2026 at 09:24:04PM +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 12:30:13PM -0700, Linus Torvalds wrote:
>
> > The reason it exists is because lock_for_kill() can drop d_lock(), but
> > that's in the unlikely case that we cn't just immediately get the
> > inode lock.
> >
> > So honestly, I think that rcu_read_lock() should be inside
> > lock_for_kill(), rather than in the caller as a "just in case things
> > go down".
>
> Yup, in the cascade of followups I've mentioned...
>
> > IOW, that code basically now not only does that typically unnecessary
> > rcu locking (and then unlocking), it does so *because* it knows about
> > the subtle internal behavior of lock_for_kill().
> >
> > In contrast, if we put it inside lock_for_kill(), all we need is a
> > fairly straightforward comment along the lines of
> >
> > "we're dropping the spinlock in order to take the inode lock, but
> > the caller may be depending on RCU behavior so we need to take the rcu
> > read lock first".
> >
> > that would turn strange illogical code that also generates worse code
> > into straightforwardly explained code that also performs better.
> >
> > Ok, so "performs better" is kind of exaggerated, in that obviously the
> > extra rcu_read_lock/unlock sequences aren't exactly expensive, but
> > still - I feel it's a win-win to just do this differently.
> >
> > Or am I missing somethign else?
>
> Mostly that fast_dput() calling conventions would need to change a bit
> as well.
>
> With the above as step 1,
[snip]
FWIW, I would really like to figure out what the hell is going on with
those UAF; livelock elimination (which is clearly needed as well -
there's an easy way to turn that livelock into hard lock reliably eating
a CPU on an SMP box) is likely to hide whatever the bug is. And it
does look like RCU bugs are not the cause of what Jeff is occasionally
seeing - not with their config. So there's something else going on ;-/
Jeff's reports are 6.11--6.13; Helge's parisc ones are 6.16--6.19.
If they are using the stock debian parisc kernel, it's PREEMPT_NONE
config (AFAICS) and in that case RCU is not the cause there as well...
Back to trying to put together a proof of correctness and see where
it breaks ;-/ If nothing else, that should give a set of assertion
checks to try and slap on top of the kernels involved...
^ permalink raw reply [flat|nested] 63+ messages in thread