From: Al Viro <viro@zeniv.linux.org.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel@vger.kernel.org,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
NeilBrown <neil@brown.name>
Subject: [RFC PATCH 23/25] wind ->s_roots via ->d_sib instead of ->d_hash
Date: Tue, 5 May 2026 06:54:10 +0100 [thread overview]
Message-ID: <20260505055412.1261144-24-viro@zeniv.linux.org.uk> (raw)
In-Reply-To: <20260505055412.1261144-1-viro@zeniv.linux.org.uk>
shrink_dcache_for_umount() is supposed to handle the possibility of
some of the dentries to be evicted being in other threads shrink
lists; it either kills them, leaving an empty husk to be freed by
the owner of shrink list whenever it gets around to that, or it
waits for the eviction in progress to get completed.
That relies upon dentry remaining attached to the tree until the
eviction reaches dentry_unlist() and its ->d_sib gets removed
from the list. Unfortunately, the secondary roots are linked
via ->d_hash, rather than ->d_sib and they become removed from
that list before their inode references are dropped.
If shrink_dentry_list() from another thread ends up evicting
one of the secondary roots and gets to that point in dentry_kill()
when shrink_dcache_for_umount() is looking for secondary roots,
the latter will *not* notice anything, possibly leading to
warnings about busy inodes at umount time and all kinds of breakage
after that.
Moreover, shrink_dcache_for_umount() walks the list of secondary
roots with no protection whatsoever, so it might end up calling
dget() on a dentry that already passed through
lockref_mark_dead(&dentry->d_lockref);
ending up with corrupted refcount and possible UAF.
AFAICS, the most straightforward way to deal with that would be
to have secondary roots linked via ->d_sib rather than ->d_hash;
then they would remain on the list until killed, and we could
use d_add_waiter() machinery to wait for eviction in progress.
Changes:
* secondary roots look the same as ->s_root from d_unhashed()
and d_unlinked() POV now.
* secondary roots are represented as "no parent, but on ->d_sib"
instead of "no parent, but on ->d_hash".
* since ->d_sib is a plain hlist, we protect it with per-superblock
spinlock (sb->s_roots_lock) instead of the LSB of the head pointer (for
non-root dentries it would be protected by ->d_lock of parent).
* __d_obtain_alias() uses ->d_sib for linkage when allocating
a secondary root.
* d_splice_alias_ops() detects splicing of a secondary root and
removes it from the list before calling __d_move().
* dentry_unlist() detects eviction of a secondary root and
removes it from the list; no need to play the games for d_walk() sake,
since the latter is not going to look for the next sibling of those
anyway.
* ___d_drop() doesn't care about ->s_roots anymore.
* shrink_dcache_for_umount() uses proper locking for access to
the list of secondary roots and if it runs into one that is in the middle
of eviction waits for that to finish.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
fs/dcache.c | 65 ++++++++++++++++++++++++----------
fs/super.c | 1 +
include/linux/fs/super_types.h | 3 +-
3 files changed, 50 insertions(+), 19 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 9003b8cf7134..12f1143d479a 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -43,8 +43,8 @@
* - i_dentry, d_alias, d_inode of aliases
* dcache_hash_bucket lock protects:
* - the dcache hash table
- * s_roots bl list spinlock protects:
- * - the s_roots list (see __d_drop)
+ * s_roots_lock protects:
+ * - the s_roots list (see __d_move()/dentry_unlist()/d_obtain_root())
* dentry->d_sb->s_dentry_lru_lock protects:
* - the dcache lru lists and counters
* d_lock protects:
@@ -562,16 +562,7 @@ static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
static void ___d_drop(struct dentry *dentry)
{
- struct hlist_bl_head *b;
- /*
- * Hashed dentries are normally on the dentry hashtable,
- * with the exception of those newly allocated by
- * d_obtain_root, which are always IS_ROOT:
- */
- if (unlikely(IS_ROOT(dentry)))
- b = &dentry->d_sb->s_roots;
- else
- b = d_hash(dentry->d_name.hash);
+ struct hlist_bl_head *b = d_hash(dentry->d_name.hash);
hlist_bl_lock(b);
__hlist_bl_del(&dentry->d_hash);
@@ -654,6 +645,13 @@ static inline void d_complete_waiters(struct dentry *dentry)
}
}
+static void unlink_secondary_root(struct dentry *dentry)
+{
+ spin_lock(&dentry->d_sb->s_roots_lock);
+ hlist_del_init(&dentry->d_sib);
+ spin_unlock(&dentry->d_sb->s_roots_lock);
+}
+
static inline void dentry_unlist(struct dentry *dentry)
{
struct dentry *next;
@@ -665,6 +663,10 @@ static inline void dentry_unlist(struct dentry *dentry)
d_complete_waiters(dentry);
if (unlikely(hlist_unhashed(&dentry->d_sib)))
return;
+ if (unlikely(IS_ROOT(dentry))) {
+ unlink_secondary_root(dentry); // secondary root goes away
+ return;
+ }
__hlist_del(&dentry->d_sib);
/*
* Cursors can move around the list of children. While we'd been
@@ -1791,9 +1793,30 @@ void shrink_dcache_for_umount(struct super_block *sb)
sb->s_root = NULL;
do_one_tree(dentry);
- while (!hlist_bl_empty(&sb->s_roots)) {
- dentry = dget(hlist_bl_entry(hlist_bl_first(&sb->s_roots), struct dentry, d_hash));
- do_one_tree(dentry);
+ for (;;) {
+ spin_lock(&sb->s_roots_lock);
+ dentry = hlist_entry_safe(sb->s_roots.first,
+ struct dentry, d_sib);
+ if (!dentry) {
+ spin_unlock(&sb->s_roots_lock);
+ break;
+ }
+ rcu_read_lock();
+ spin_unlock(&sb->s_roots_lock);
+ spin_lock(&dentry->d_lock);
+ rcu_read_unlock();
+ if (unlikely(dentry->d_lockref.count < 0)) {
+ struct completion_list wait;
+ bool need_wait = d_add_waiter(dentry, &wait);
+
+ spin_unlock(&dentry->d_lock);
+ if (need_wait)
+ wait_for_completion(&wait.completion);
+ } else {
+ dget_dlock(dentry);
+ spin_unlock(&dentry->d_lock);
+ do_one_tree(dentry);
+ }
}
}
@@ -2210,9 +2233,9 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
__d_set_inode_and_type(new, inode, add_flags);
hlist_add_head(&new->d_alias, &inode->i_dentry);
if (!disconnected) {
- hlist_bl_lock(&sb->s_roots);
- hlist_bl_add_head(&new->d_hash, &sb->s_roots);
- hlist_bl_unlock(&sb->s_roots);
+ spin_lock(&sb->s_roots_lock);
+ hlist_add_head(&new->d_sib, &sb->s_roots);
+ spin_unlock(&sb->s_roots_lock);
}
spin_unlock(&new->d_lock);
spin_unlock(&inode->i_lock);
@@ -3224,6 +3247,12 @@ struct dentry *d_splice_alias_ops(struct inode *inode, struct dentry *dentry,
}
dput(old_parent);
} else {
+ if (unlikely(!hlist_unhashed(&new->d_sib))) {
+ // secondary root getting spliced
+ spin_lock(&new->d_lock);
+ unlink_secondary_root(new);
+ spin_unlock(&new->d_lock);
+ }
__d_move(new, dentry, false);
write_sequnlock(&rename_lock);
}
diff --git a/fs/super.c b/fs/super.c
index 378e81efe643..fb44ebadda82 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -359,6 +359,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_iflags |= SB_I_NODEV;
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_roots);
+ spin_lock_init(&s->s_roots_lock);
mutex_init(&s->s_sync_lock);
INIT_LIST_HEAD(&s->s_inodes);
spin_lock_init(&s->s_inode_list_lock);
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 383050e7fdf5..23d1c2612d0c 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -162,7 +162,8 @@ struct super_block {
struct unicode_map *s_encoding;
__u16 s_encoding_flags;
#endif
- struct hlist_bl_head s_roots; /* alternate root dentries for NFS */
+ struct hlist_head s_roots; /* alternate root dentries for NFS */
+ spinlock_t s_roots_lock;
struct mount *s_mounts; /* list of mounts; _not_ for fs use */
struct block_device *s_bdev; /* can go away once we use an accessor for @s_bdev_file */
struct file *s_bdev_file;
--
2.47.3
next prev parent reply other threads:[~2026-05-05 5:54 UTC|newest]
Thread overview: 53+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 5:53 [RFC PATCH 00/25] assorted dcache cleanups and fixes Al Viro
2026-05-05 5:53 ` [RFC PATCH 01/25] VFS: use wait_var_event for waiting in d_alloc_parallel() Al Viro
2026-05-05 5:53 ` [RFC PATCH 02/25] alloc_path_pseudo(): make sure we don't end up with NORCU dentries for directories Al Viro
2026-05-05 8:21 ` NeilBrown
2026-05-05 17:48 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 03/25] fix a race between d_find_any_alias() and final dput() of NORCU dentries Al Viro
2026-05-05 17:06 ` Linus Torvalds
2026-05-05 20:29 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 04/25] find_acceptable_alias(): skip NORCU aliases with zero refcount Al Viro
2026-05-05 5:53 ` [RFC PATCH 05/25] select_collect(): ignore dentries on shrink lists if they have positive refcounts Al Viro
2026-05-05 5:53 ` [RFC PATCH 06/25] make to_shrink_list() return whether it has moved dentry to list Al Viro
2026-05-05 5:53 ` [RFC PATCH 07/25] kill d_dispose_if_unused() Al Viro
2026-05-05 5:53 ` [RFC PATCH 08/25] d_prune_aliases(): make sure to skip NORCU aliases Al Viro
2026-05-05 5:53 ` [RFC PATCH 09/25] shrink_dentry_list(): start with removing from shrink list Al Viro
2026-05-07 20:39 ` Al Viro
2026-05-05 5:53 ` [RFC PATCH 10/25] fold lock_for_kill() into shrink_kill() Al Viro
2026-05-05 5:53 ` [RFC PATCH 11/25] fold lock_for_kill() and __dentry_kill() into common helper Al Viro
2026-05-05 5:53 ` [RFC PATCH 12/25] reducing rcu_read_lock() scopes in dput and friends, step 1 Al Viro
2026-05-05 8:55 ` NeilBrown
2026-05-05 14:22 ` Al Viro
2026-05-05 21:58 ` NeilBrown
2026-05-05 16:47 ` Linus Torvalds
2026-05-05 22:42 ` Al Viro
2026-05-07 7:35 ` Al Viro
2026-05-07 15:32 ` Linus Torvalds
2026-05-05 5:54 ` [RFC PATCH 13/25] reducing rcu_read_lock() scopes in dput and friends, step 2 Al Viro
2026-05-05 5:54 ` [RFC PATCH 14/25] reducing rcu_read_lock() scopes in dput and friends, step 3 Al Viro
2026-05-05 5:54 ` [RFC PATCH 15/25] reducing rcu_read_lock() scopes in dput and friends, step 4 Al Viro
2026-05-05 5:54 ` [RFC PATCH 16/25] reducing rcu_read_lock() scopes in dput and friends, step 5 Al Viro
2026-05-05 5:54 ` [RFC PATCH 17/25] reducing rcu_read_lock() scopes in dput and friends, step 6 Al Viro
2026-05-05 5:54 ` [RFC PATCH 18/25] adjust calling conventions of lock_for_kill(), fold __dentry_kill() into dentry_kill() Al Viro
2026-05-05 5:54 ` [RFC PATCH 19/25] document dentry_kill() Al Viro
2026-05-05 5:54 ` [RFC PATCH 20/25] d_walk(): shrink rcu_read_lock() scope Al Viro
2026-05-05 17:01 ` Linus Torvalds
2026-05-05 20:05 ` Al Viro
2026-05-05 21:40 ` Frederic Weisbecker
2026-05-05 22:50 ` Al Viro
2026-05-06 3:49 ` Paul E. McKenney
2026-05-07 22:39 ` NeilBrown
2026-05-07 23:21 ` Paul E. McKenney
2026-05-08 14:47 ` Al Viro
2026-05-08 22:03 ` Paul E. McKenney
2026-05-08 23:03 ` Al Viro
2026-05-08 3:01 ` Al Viro
2026-05-05 5:54 ` [RFC PATCH 21/25] shrinking rcu_read_lock() scope in d_alloc_parallel() Al Viro
2026-05-07 21:52 ` Jori Koolstra
2026-05-08 3:12 ` Al Viro
2026-05-08 9:28 ` Jori Koolstra
2026-05-05 5:54 ` [RFC PATCH 22/25] shrink_dentry_tree(): unify the calls of shrink_dentry_list() Al Viro
2026-05-05 5:54 ` Al Viro [this message]
2026-05-05 5:54 ` [RFC PATCH 24/25] nfs: get rid of fake root dentries Al Viro
2026-05-05 5:54 ` [RFC PATCH 25/25] make cursors NORCU Al Viro
2026-05-05 17:09 ` [RFC PATCH 00/25] assorted dcache cleanups and fixes Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260505055412.1261144-24-viro@zeniv.linux.org.uk \
--to=viro@zeniv.linux.org.uk \
--cc=brauner@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=neil@brown.name \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox