* [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
@ 2026-04-06 16:44 Jeff Layton
2026-04-07 10:51 ` Jan Kara
2026-04-08 6:42 ` Al Viro
0 siblings, 2 replies; 8+ messages in thread
From: Jeff Layton @ 2026-04-06 16:44 UTC (permalink / raw)
To: Alexander Viro, Christian Brauner, Jan Kara
Cc: linux-fsdevel, linux-kernel, Jeff Layton
We've had a number of panics that seem to occur on hosts with heavy
process churn. The symptoms are a panic when invalidating /proc entries
as a task is exiting:
queued_spin_lock_slowpath+0x153/0x270
shrink_dentry_list+0x11d/0x220
shrink_dcache_parent+0x68/0x110
d_invalidate+0x90/0x170
proc_invalidate_siblings_dcache+0xc8/0x140
release_task+0x41b/0x510
do_exit+0x3d8/0x9d0
do_group_exit+0x7d/0xa0
get_signal+0x2a9/0x6a0
arch_do_signal_or_restart+0x1a/0x1c0
syscall_exit_to_user_mode+0xe6/0x1c0
do_syscall_64+0x74/0x130
entry_SYSCALL_64_after_hwframe+0x4b/0x53
The problem appears to be a UAF. It's freeing a shrink list of
dentries, but one of the dentries on it has already been freed.
The d_lru field is always list_del_init()'ed, and so should be empty
whenever a dentry is freed. Add a WARN_ON_ONCE() whenever it isn't.
Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
We've had some of these panics internally for a while. Additionally,
Claude also noted that these syzbot reports may be related:
https://syzbot.org/bug?extid=0aee5e8066eddbbe7397
https://syzbot.org/bug?extid=e8b3520b53e78e90034e
https://syzbot.org/bug?extid=ad14fd37e76c579511d0
So far, I've been unable to spot the bug. Hoping this will make it
easier.
---
fs/dcache.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..c6f475d940e3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -429,6 +429,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
static void dentry_free(struct dentry *dentry)
{
WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+ WARN_ON_ONCE(!list_empty(&dentry->d_lru));
if (unlikely(dname_external(dentry))) {
struct external_name *p = external_name(dentry);
if (likely(atomic_dec_and_test(&p->count))) {
---
base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d
change-id: 20260403-dcache-warn-a493b0e3c877
Best regards,
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton @ 2026-04-07 10:51 ` Jan Kara 2026-04-08 6:42 ` Al Viro 1 sibling, 0 replies; 8+ messages in thread From: Jan Kara @ 2026-04-07 10:51 UTC (permalink / raw) To: Jeff Layton Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel On Mon 06-04-26 12:44:13, Jeff Layton wrote: > We've had a number of panics that seem to occur on hosts with heavy > process churn. The symptoms are a panic when invalidating /proc entries > as a task is exiting: > > queued_spin_lock_slowpath+0x153/0x270 > shrink_dentry_list+0x11d/0x220 > shrink_dcache_parent+0x68/0x110 > d_invalidate+0x90/0x170 > proc_invalidate_siblings_dcache+0xc8/0x140 > release_task+0x41b/0x510 > do_exit+0x3d8/0x9d0 > do_group_exit+0x7d/0xa0 > get_signal+0x2a9/0x6a0 > arch_do_signal_or_restart+0x1a/0x1c0 > syscall_exit_to_user_mode+0xe6/0x1c0 > do_syscall_64+0x74/0x130 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > The problem appears to be a UAF. It's freeing a shrink list of > dentries, but one of the dentries on it has already been freed. > > The d_lru field is always list_del_init()'ed, and so should be empty > whenever a dentry is freed. Add a WARN_ON_ONCE() whenever it isn't. > > Signed-off-by: Jeff Layton <jlayton@kernel.org> Yes, looks like a sensible assert. Feel free to add: Reviewed-by: Jan Kara <jack@suse.cz> Honza > --- > We've had some of these panics internally for a while. Additionally, > Claude also noted that these syzbot reports may be related: > > https://syzbot.org/bug?extid=0aee5e8066eddbbe7397 > https://syzbot.org/bug?extid=e8b3520b53e78e90034e > https://syzbot.org/bug?extid=ad14fd37e76c579511d0 > > So far, I've been unable to spot the bug. Hoping this will make it > easier. > --- > fs/dcache.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/dcache.c b/fs/dcache.c > index 7ba1801d8132..c6f475d940e3 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -429,6 +429,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry) > static void dentry_free(struct dentry *dentry) > { > WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias)); > + WARN_ON_ONCE(!list_empty(&dentry->d_lru)); > if (unlikely(dname_external(dentry))) { > struct external_name *p = external_name(dentry); > if (likely(atomic_dec_and_test(&p->count))) { > > --- > base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d > change-id: 20260403-dcache-warn-a493b0e3c877 > > Best regards, > -- > Jeff Layton <jlayton@kernel.org> > -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton 2026-04-07 10:51 ` Jan Kara @ 2026-04-08 6:42 ` Al Viro 2026-04-08 11:10 ` Jeff Layton 2026-04-08 18:28 ` Jeff Layton 1 sibling, 2 replies; 8+ messages in thread From: Al Viro @ 2026-04-08 6:42 UTC (permalink / raw) To: Jeff Layton; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote: > We've had a number of panics that seem to occur on hosts with heavy > process churn. The symptoms are a panic when invalidating /proc entries > as a task is exiting: > > queued_spin_lock_slowpath+0x153/0x270 > shrink_dentry_list+0x11d/0x220 > shrink_dcache_parent+0x68/0x110 > d_invalidate+0x90/0x170 > proc_invalidate_siblings_dcache+0xc8/0x140 > release_task+0x41b/0x510 > do_exit+0x3d8/0x9d0 > do_group_exit+0x7d/0xa0 > get_signal+0x2a9/0x6a0 > arch_do_signal_or_restart+0x1a/0x1c0 > syscall_exit_to_user_mode+0xe6/0x1c0 > do_syscall_64+0x74/0x130 > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > The problem appears to be a UAF. It's freeing a shrink list of > dentries, but one of the dentries on it has already been freed. That, or dentry pointer passed to shrink_dcache_parent() is a complete garbage - e.g. due to struct pid having already been freed. Might make sense to try and get a crash dump and poke around... Which kernels have you seen it on? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-08 6:42 ` Al Viro @ 2026-04-08 11:10 ` Jeff Layton 2026-04-08 18:28 ` Jeff Layton 1 sibling, 0 replies; 8+ messages in thread From: Jeff Layton @ 2026-04-08 11:10 UTC (permalink / raw) To: Al Viro; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote: > On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote: > > We've had a number of panics that seem to occur on hosts with heavy > > process churn. The symptoms are a panic when invalidating /proc entries > > as a task is exiting: > > > > queued_spin_lock_slowpath+0x153/0x270 > > shrink_dentry_list+0x11d/0x220 > > shrink_dcache_parent+0x68/0x110 > > d_invalidate+0x90/0x170 > > proc_invalidate_siblings_dcache+0xc8/0x140 > > release_task+0x41b/0x510 > > do_exit+0x3d8/0x9d0 > > do_group_exit+0x7d/0xa0 > > get_signal+0x2a9/0x6a0 > > arch_do_signal_or_restart+0x1a/0x1c0 > > syscall_exit_to_user_mode+0xe6/0x1c0 > > do_syscall_64+0x74/0x130 > > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > > > The problem appears to be a UAF. It's freeing a shrink list of > > dentries, but one of the dentries on it has already been freed. > > That, or dentry pointer passed to shrink_dcache_parent() is a > complete garbage - e.g. due to struct pid having already been > freed. Might make sense to try and get a crash dump and poke > around... > I'm trying to get one. We had an issue that prevented the machines that were crashing this way from getting a coredump. Hoping that'll be resolved soon and we can get it. > Which kernels have you seen it on? v6.11 and v6.13 so far. The crash seems to be pretty workload-dependent (a lot of processes rapidly starting and exiting). I'm not sure this workload is running on later kernels yet so I don't know if this is something already fixed. Thanks, -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-08 6:42 ` Al Viro 2026-04-08 11:10 ` Jeff Layton @ 2026-04-08 18:28 ` Jeff Layton 2026-04-08 19:26 ` Al Viro 1 sibling, 1 reply; 8+ messages in thread From: Jeff Layton @ 2026-04-08 18:28 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm, gustavold On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote: > On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote: > > We've had a number of panics that seem to occur on hosts with heavy > > process churn. The symptoms are a panic when invalidating /proc entries > > as a task is exiting: > > > > queued_spin_lock_slowpath+0x153/0x270 > > shrink_dentry_list+0x11d/0x220 > > shrink_dcache_parent+0x68/0x110 > > d_invalidate+0x90/0x170 > > proc_invalidate_siblings_dcache+0xc8/0x140 > > release_task+0x41b/0x510 > > do_exit+0x3d8/0x9d0 > > do_group_exit+0x7d/0xa0 > > get_signal+0x2a9/0x6a0 > > arch_do_signal_or_restart+0x1a/0x1c0 > > syscall_exit_to_user_mode+0xe6/0x1c0 > > do_syscall_64+0x74/0x130 > > entry_SYSCALL_64_after_hwframe+0x4b/0x53 > > > > The problem appears to be a UAF. It's freeing a shrink list of > > dentries, but one of the dentries on it has already been freed. > > That, or dentry pointer passed to shrink_dcache_parent() is a > complete garbage - e.g. due to struct pid having already been > freed. Might make sense to try and get a crash dump and poke > around... > Chris was able to track me down a vmcore. No, it actually does seem to be what we thought originally. The parent is fine, but one of the dentries under it has been freed and reallocated: >>> stack #0 queued_spin_lock_slowpath (kernel/locking/qspinlock.c:471:3) #1 spin_lock (./include/linux/spinlock.h:351:2) #2 lock_for_kill (fs/dcache.c:675:3) #3 shrink_dentry_list (fs/dcache.c:1086:8) #4 shrink_dcache_parent (fs/dcache.c:0) #5 d_invalidate (fs/dcache.c:1614:2) #6 proc_invalidate_siblings_dcache (fs/proc/inode.c:142:5) #7 proc_flush_pid (fs/proc/base.c:3478:2) #8 release_task (kernel/exit.c:279:2) #9 exit_notify (kernel/exit.c:775:3) #10 do_exit (kernel/exit.c:958:2) #11 do_group_exit (kernel/exit.c:1087:2) #12 get_signal (kernel/signal.c:3036:3) #13 arch_do_signal_or_restart (arch/x86/kernel/signal.c:337:6) #14 exit_to_user_mode_loop (kernel/entry/common.c:111:4) #15 exit_to_user_mode_prepare (./include/linux/entry-common.h:329:13) #16 __syscall_exit_to_user_mode_work (kernel/entry/common.c:207:2) #17 syscall_exit_to_user_mode (kernel/entry/common.c:218:2) #18 do_syscall_64 (arch/x86/entry/common.c:89:2) #19 entry_SYSCALL_64+0x6c/0xaa (arch/x86/entry/entry_64.S:121) #20 0x7f49ead2c482 >>> identify_address(stack[3]["dentry"]) 'slab object: kmalloc-96+0x48' >>> identify_address(stack[4]["parent"]) 'slab object: dentry+0x0' ...it turns out that Gustavo had been chasing this independently to me, and had Claude do a bit more analysis. I included it below, but here's a link that may be more readable. Any thoughts? https://markdownpastebin.com/?id=7c258413493b4144ab27d5cdcb8ae5b4 -------------8<---------------------- ## dcache: `shrink_dcache_parent()` livelock leading to use-after-free ### Summary A race between concurrent proc dentry invalidation (`proc_flush_pid` → `d_invalidate` → `shrink_dcache_parent`) and the global dentry shrinker (`drop_caches` / memory pressure → `prune_dcache_sb`) causes `shrink_dcache_parent()` to loop indefinitely. This livelock is the root cause of the use-after-free crash observed in production (see P2260313060 for the original crash analysis). ### How the bug manifests **In production** (narrow race window): The livelock occasionally resolves through specific timing that allows a parent dentry to be freed and its slab page reused. When a sibling's `__dentry_kill` then tries `spin_lock(&parent->d_lock)` on the reused memory → page fault in `queued_spin_lock_slowpath` (Oops). **With `CONFIG_DCACHE_SHRINK_RACE_DEBUG`** (5ms delay in `__dentry_kill`): The race is deterministic. `shrink_dcache_parent()` livelocks on the first iteration and never completes. ### Root cause In `select_collect()` (the `d_walk` callback used by `shrink_dcache_parent`), two types of dentries are incorrectly counted as "found": 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called `dentry_unlist()` to remove the dentry from the parent's children list. With the debug delay, the dentry stays dead-but-visible for 5ms. 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its own dispose list. These are being processed by that other path but slowly (5ms per proc dentry with the debug delay). When `select_collect` counts these as `found++`, `shrink_dcache_parent()` sees `data.found > 0` and loops again. But these dentries can never be collected onto `data.dispose` (dead ones have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST` set), so the loop never makes progress → **infinite loop**. ``` shrink_dcache_parent() loop: for (;;) { d_walk(parent, &data, select_collect); if (!list_empty(&data.dispose)) { shrink_dentry_list(&data.dispose); // never reached continue; } if (!data.found) break; // never reached because found > 0 // ... loops forever } ``` ### Reproducer **Requirements:** - `CONFIG_DCACHE_SHRINK_RACE_DEBUG=y` (injects 5ms `mdelay()` in `__dentry_kill` for proc dentries) - `CONFIG_KASAN=y` (optional, for UAF detection) - `CONFIG_DEBUG_KERNEL=y` **Debug patch** (apply to `fs/Kconfig` and `fs/dcache.c`): ```diff --- a/fs/Kconfig +++ b/fs/Kconfig @@ -9,6 +9,15 @@ menu "File systems" config DCACHE_WORD_ACCESS bool +config DCACHE_SHRINK_RACE_DEBUG + bool "Debug: inject delay in __dentry_kill to widen race window" + depends on DEBUG_KERNEL + default n + help + Inject a delay in __dentry_kill() between releasing d_lock and + re-acquiring it, to make the shrink_dentry_list race reproducible + in test environments. Only enable for testing. + config VALIDATE_FS_PARSER --- a/fs/dcache.c +++ b/fs/dcache.c @@ -32,6 +32,7 @@ #include <linux/list_lru.h> +#include <linux/delay.h> #include "internal.h" @@ -630,6 +631,16 @@ static struct dentry *__dentry_kill(...) cond_resched(); +#ifdef CONFIG_DCACHE_SHRINK_RACE_DEBUG + /* + * Delay proc dentry kills to keep dead dentries in the tree + * longer. With the bug (count < 0 counted as "found" in + * select_collect), d_walk keeps re-finding dead dentries and + * shrink_dcache_parent() loops forever. + */ + if (dentry->d_sb->s_magic == 0x9fa0 /* PROC_SUPER_MAGIC */) + mdelay(5); +#endif /* now that it's negative, ->d_parent is stable */ ``` **Test program** (`test_dcache_race.sh`): The reproducer creates multi-threaded processes, populates their `/proc/<pid>/task/<tid>/...` dcache entries, then SIGKILLs them while simultaneously running `drop_caches` in tight loops. This creates concurrent `proc_flush_pid` (from dying threads) and `prune_dcache_sb` (from `drop_caches`) paths competing on the same proc dentries. ```c /* Key structure: * - Fork child with N threads (creates /proc/<pid>/task/<tid>/... entries) * - Parent reads all /proc entries to populate dcache * - Background threads continuously do: echo 2 > /proc/sys/vm/drop_caches * - SIGKILL child -> all threads exit -> concurrent proc_flush_pid * - drop_caches shrinker races with proc_flush_pid on same dentries */ ``` Parameters used: 50 threads/process, 200 iterations, 4 shrinker threads, 4 reader threads. **vmtest.toml:** ```toml [[target]] name = "dcache-shrink-race" kernel = "arch/x86/boot/bzImage" kernel_args = "hung_task_panic=0 softlockup_panic=0 rcupdate.rcu_cpu_stall_suppress=1" command = "/mnt/vmtest/test_dcache_race.sh" [target.vm] memory = "16G" # KASAN needs extra memory num_cpus = 8 timeout = 1200 ``` ### Reproduction results | Kernel | Result | |---|---| | Unfixed + debug delay + KASAN | **FAIL**: livelock on iteration 1, test timed out at 750s | | Fixed + debug delay + KASAN | **PASS**: all 200 iterations completed, no KASAN/warnings | ### Fix The fix is in `select_collect()` — stop counting dead dentries and `DCACHE_SHRINK_LIST` dentries as "found": ```diff --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1448,13 +1459,27 @@ static enum d_walk_ret select_collect(void *_data, struct dentry *dentry) if (data->start == dentry) goto out; - if (dentry->d_flags & DCACHE_SHRINK_LIST) { - data->found++; + if (dentry->d_lockref.count < 0) { + /* + * Dead dentry (lockref_mark_dead sets count negative). + * Another CPU is in the middle of __dentry_kill() and + * will shortly unlink it from the tree. Do not count + * it as "found" --- that causes shrink_dcache_parent() + * to loop indefinitely. + */ + } else if (dentry->d_flags & DCACHE_SHRINK_LIST) { + /* + * Already on a shrink list, being processed by another + * path (e.g., the global LRU shrinker). Do not count + * it as "found" --- if the other path is slow (e.g., + * contention on d_lock or filesystem callbacks), + * shrink_dcache_parent() would spin forever waiting for + * them to finish. The other shrinker will handle these + * dentries. + */ } else if (!dentry->d_lockref.count) { to_shrink_list(dentry, &data->dispose); data->found++; - } else if (dentry->d_lockref.count < 0) { - data->found++; } ``` **Why this is correct:** - **Dead dentries (`count < 0`)**: These are being killed by another CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove them from the parent's children list. `shrink_dcache_parent()` doesn't need to wait for them — they'll disappear from the tree on their own. - **`DCACHE_SHRINK_LIST` dentries**: These are already on another shrinker's dispose list and will be processed by that path. Counting them as "found" forces `shrink_dcache_parent()` to wait for the other shrinker to finish, which can take arbitrarily long (especially with filesystem callbacks or the debug delay). - **The `select_collect2` path** (used when `data.found > 0` but `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries separately by setting `data->victim` and processing them directly. With this fix, `select_collect2` is only reached when there are genuinely unprocessable dentries (count > 0, not dead, not on shrink list), not when there are merely in-flight kills or concurrent shrinkers. ### Relationship to the production UAF crash The livelock is the **precursor** to the use-after-free crash seen in production (P2260313060): 1. Without the debug delay, the `__dentry_kill` race window is nanoseconds (just `cond_resched()`). 2. Most of the time, the dead dentry is unlinked before `select_collect` finds it → no issue. 3. Occasionally, `select_collect` finds dead dentries and spins briefly. During this spinning, the specific timing allows a parent dentry to be fully freed (via `dentry_free` → `call_rcu` → slab reclaim) and its slab page reused for `kmalloc-96`. 4. When the spinning `shrink_dcache_parent` or a concurrent `__dentry_kill` then accesses the freed parent → UAF crash. The fix prevents the spinning entirely, eliminating both the livelock and the UAF. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-08 18:28 ` Jeff Layton @ 2026-04-08 19:26 ` Al Viro 2026-04-08 21:05 ` Jeff Layton 0 siblings, 1 reply; 8+ messages in thread From: Al Viro @ 2026-04-08 19:26 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm, gustavold On Wed, Apr 08, 2026 at 02:28:20PM -0400, Jeff Layton wrote: > ...it turns out that Gustavo had been chasing this independently to me, > and had Claude do a bit more analysis. I included it below, but here's > a link that may be more readable. Any thoughts? Other than rather uncharitable ones about the usefulness of the Turing Test, you mean? > **In production** (narrow race window): The livelock occasionally > resolves through specific timing that allows a parent dentry to be > freed and its slab page reused. Livelock is real and known, all right, but do explain what does "resolves through specific timing that allows a parent dentry to be freed" mean. Especially since the reference to parent is *not* dropped until after having child detached from the tree and DCACHE_DENTRY_KILLED on it, with ->d_lock on child held over that. So select_collect2() seeing the victim still locked and attached to the tree has to happen before the grace period for parent has a chance to begin. And rcu_read_lock() grabbed there prevents that grace period from completing until we do the matching rcu_read_unlock() in shrink_dcache_parent(). > In `select_collect()` (the `d_walk` callback used by > `shrink_dcache_parent`), two types of dentries are incorrectly counted really? > as "found": > 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called > `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called > `dentry_unlist()` to remove the dentry from the parent's children list. > With the debug delay, the dentry stays dead-but-visible for 5ms. Yes. And? That's the livelock, all right, and it needs fixing, but how does busy-wait here lead to UAF on anything? > 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another > shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its > own dispose list. These are being processed by that other path but > slowly (5ms per proc dentry with the debug delay). > > When `select_collect` counts these as `found++`, > `shrink_dcache_parent()` sees `data.found > 0` and loops again. But > these dentries can never be collected onto `data.dispose` (dead ones > have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST` > set), so the loop never makes progress → **infinite loop**. They have no business going into data.dispose; for fuck sake, dentries on somebody else's shrink list are explicitly fed to shrink_kill(). > **Why this is correct:** It is not. > - **Dead dentries (`count < 0`)**: These are being killed by another > CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove > them from the parent's children list. `shrink_dcache_parent()` doesn't > need to wait for them — they'll disappear from the tree on their own. ... and since they keep their parents busy, we should not leave until they are gone. For real fun, consider calls from shrink_dcache_for_umount() - and yes, it *is* possible for another thread's shrink list to contain dentries from filesystem being shut down. Legitimately so. > - **`DCACHE_SHRINK_LIST` dentries**: These are already on another > shrinker's dispose list and will be processed by that path. Counting > them as "found" forces `shrink_dcache_parent()` to wait for the other > shrinker to finish, which can take arbitrarily long (especially with > filesystem callbacks or the debug delay). Ditto. > - **The `select_collect2` path** (used when `data.found > 0` but > `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries > separately by setting `data->victim` and processing them directly. With > this fix, `select_collect2` is only reached when there are genuinely > unprocessable dentries (count > 0, not dead, not on shrink list), not > when there are merely in-flight kills or concurrent shrinkers. Bollocks, due to above. > ### Relationship to the production UAF crash > > The livelock is the **precursor** to the use-after-free crash seen in > production (P2260313060): ... and it still offers zero explanation of the path from livelock to UAF. It may or may not be real, but there's nothing in all that verbiage even suggesting what it might be. And proposed analysis is flat-out wrong. As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next as of today). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-08 19:26 ` Al Viro @ 2026-04-08 21:05 ` Jeff Layton 2026-04-08 22:43 ` Al Viro 0 siblings, 1 reply; 8+ messages in thread From: Jeff Layton @ 2026-04-08 21:05 UTC (permalink / raw) To: Al Viro Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm, gustavold On Wed, 2026-04-08 at 20:26 +0100, Al Viro wrote: > On Wed, Apr 08, 2026 at 02:28:20PM -0400, Jeff Layton wrote: > > > ...it turns out that Gustavo had been chasing this independently to me, > > and had Claude do a bit more analysis. I included it below, but here's > > a link that may be more readable. Any thoughts? > > Other than rather uncharitable ones about the usefulness of the Turing Test, > you mean? > > > **In production** (narrow race window): The livelock occasionally > > resolves through specific timing that allows a parent dentry to be > > freed and its slab page reused. > > Livelock is real and known, all right, but do explain what does "resolves > through specific timing that allows a parent dentry to be freed" mean. > Especially since the reference to parent is *not* dropped until after > having child detached from the tree and DCACHE_DENTRY_KILLED on it, > with ->d_lock on child held over that. So select_collect2() seeing > the victim still locked and attached to the tree has to happen before > the grace period for parent has a chance to begin. And rcu_read_lock() > grabbed there prevents that grace period from completing until we > do the matching rcu_read_unlock() in shrink_dcache_parent(). > > > In `select_collect()` (the `d_walk` callback used by > > `shrink_dcache_parent`), two types of dentries are incorrectly counted > really? > > as "found": > > > 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called > > `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called > > `dentry_unlist()` to remove the dentry from the parent's children list. > > With the debug delay, the dentry stays dead-but-visible for 5ms. > > Yes. And? That's the livelock, all right, and it needs fixing, > but how does busy-wait here lead to UAF on anything? > > > 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another > > shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its > > own dispose list. These are being processed by that other path but > > slowly (5ms per proc dentry with the debug delay). > > > > When `select_collect` counts these as `found++`, > > `shrink_dcache_parent()` sees `data.found > 0` and loops again. But > > these dentries can never be collected onto `data.dispose` (dead ones > > have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST` > > set), so the loop never makes progress → **infinite loop**. > > They have no business going into data.dispose; for fuck sake, dentries > on somebody else's shrink list are explicitly fed to shrink_kill(). > > > **Why this is correct:** > > It is not. > > > - **Dead dentries (`count < 0`)**: These are being killed by another > > CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove > > them from the parent's children list. `shrink_dcache_parent()` doesn't > > need to wait for them — they'll disappear from the tree on their own. > > ... and since they keep their parents busy, we should not leave until > they are gone. For real fun, consider calls from shrink_dcache_for_umount() - > and yes, it *is* possible for another thread's shrink list to contain > dentries from filesystem being shut down. Legitimately so. > > > - **`DCACHE_SHRINK_LIST` dentries**: These are already on another > > shrinker's dispose list and will be processed by that path. Counting > > them as "found" forces `shrink_dcache_parent()` to wait for the other > > shrinker to finish, which can take arbitrarily long (especially with > > filesystem callbacks or the debug delay). > > Ditto. > > > - **The `select_collect2` path** (used when `data.found > 0` but > > `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries > > separately by setting `data->victim` and processing them directly. With > > this fix, `select_collect2` is only reached when there are genuinely > > unprocessable dentries (count > 0, not dead, not on shrink list), not > > when there are merely in-flight kills or concurrent shrinkers. > > Bollocks, due to above. > > > ### Relationship to the production UAF crash > > > > The livelock is the **precursor** to the use-after-free crash seen in > > production (P2260313060): > > ... and it still offers zero explanation of the path from livelock to > UAF. It may or may not be real, but there's nothing in all that verbiage > even suggesting what it might be. And proposed analysis is flat-out > wrong. > > As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next > as of today). Thanks for taking a look, Al. We'll keep looking at this thing and see if we can collect more info, and come up with a better theory of the crash. I'll dig deeper into the vmcore tomorrow. FWIW, I've been running a bpftrace script across a swath of machines that emulates the WARN_ON_ONCE() in the original patch I proposed. We've had a few crashes while it's running and haven't yet collected any stack traces. The script only runs about 80% of the time though, so it's possible we're just getting unlucky, but I sort of doubt it. It seems more likely that these dentries are racing onto the list after we've passed the check. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru 2026-04-08 21:05 ` Jeff Layton @ 2026-04-08 22:43 ` Al Viro 0 siblings, 0 replies; 8+ messages in thread From: Al Viro @ 2026-04-08 22:43 UTC (permalink / raw) To: Jeff Layton Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm, gustavold On Wed, Apr 08, 2026 at 05:05:41PM -0400, Jeff Layton wrote: > > ... and it still offers zero explanation of the path from livelock to > > UAF. It may or may not be real, but there's nothing in all that verbiage > > even suggesting what it might be. And proposed analysis is flat-out > > wrong. > > > > As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next > > as of today). > > Thanks for taking a look, Al. We'll keep looking at this thing and see > if we can collect more info, and come up with a better theory of the > crash. I'll dig deeper into the vmcore tomorrow. OK... FWIW, on the trees up to 7.0-rc7: * dentry passed to select_collect2() must be still attached to the tree (no DCACHE_DENTRY_KILLED on it) * caller holds ->d_lock on that dentry. * in case when its ->d_lockref.count is positive, it's busy (at least at the moment) and there's nothing to be done; move on. * in case when its ->d_lockref.count is negative, it's getting killed right now. That's a busy-wait case - we ignore it on this pass, but remember that we'll need to rescan. * in case when its ->d_lockref.count is 0 and it's not on a shrink list: it's evictable, move it to data.dispose, no problem. * in case when its ->d_lockref.count is 0 and it *is* on a shrink list: try to steal it. We can't do that from the d_walk() callback itself - it's a non-blocking environment, to start with, so no evictions are possible there. What we can do is to have it returned to shrink_dcache_tree() - grab rcu_read_lock() to make sure it won't get freed (it hasn't reached dentry_unlist(), let alone dentry_free(), so the grace period for freeing hasn't started yet) and tell d_walk() to stop and return to caller immediately. Note that we deliberately return without rcu_read_unlock() - that's what protects the victim from getting freed under us. Other thread might or might not get around to starting the eviction of the victim, but whether it does that or not, it won't get around to freeing it. In shrink_dcache_tree() we grab ->d_lock on the victim again (it had been dropped by d_walk() when the callback returned). Then we call lock_for_kill(), which starts with checking that ->d_lockref.count is 0. If it isn't, there's nothing to be done to that sucker; it either went busy (in which case we should ignore it and move on) or somebody (likely the owner of the shrink list it had been on) got around to evicting it; in the latter case we need to wait until the damn thing is killed. *IF* refcount is still 0, we carefully acquire the ->i_lock on its inode (if any). Note that we are still holding rcu_read_lock(), so it can't get freed under us even if we have to drop and regain ->d_lock. We do need to recheck that refcount is still zero if we do that, obviously. Failing lock_for_kill() is either due to dentry becoming busy (and thus to be skipped) or due to dentry having been passed to __dentry_kill(). In the latter case we busy-wait, same as we'd do if we saw negative ->d_lockref.count in select_collect2(). Successful lock_for_kill() is followed by evicting the sucker; we do *not* remove it (or its ancestors, if it had been the only thing holding them busy) from whatever shrink list they'd been on. Anything on shrink lists gets reduced to the state of passive chunk of memory no longer connected to filesystem objects, marked with DCACHE_DENTRY_KILLED and left for the owner of shrink list to free once it gets around to that. There's no urgency anymore - rcu_read_lock(); if (!lock_for_kill(dentry)) { bool can_free; rcu_read_unlock(); d_shrink_del(dentry); can_free = dentry->d_flags & DCACHE_DENTRY_KILLED; spin_unlock(&dentry->d_lock); if (can_free) dentry_free(dentry); continue; } on the shrink_dentry_list() side will have lock_for_kill() return false (->d_lockref.count already negative), remove the sucker from the list, see DCACHE_DENTRY_KILLED on it, unlock the sucker and free it. Note that nothing in that sequence touches any fs objects - it's just a disposal of inert chunk of memory now. Note that we can't be stealing from ourselves - if anything had been added to data.dispose, d_walk() sees D_WALK_QUIT or D_WALK_NORETRY and either buggers off immediately or sets retry to false. It won't restart walking the tree in either case, so there's no way for the same dentry to be revisited by it. Having one shrink_dentry_tree() steal from another is OK - see the conditions when shrink_dentry_list() and __dentry_kill() are calling dentry_free() (called 'can_free' in both). Getting rid of busy-wait is handled with fairly small modification to that: select_collect2() treats negative refcount same as it would treat 0-and-on-shrink-list case and shrink_dcache_tree() checks if the victim has negative refcount and no DCACHE_DENTRY_KILLED yet. In that case it adds a local object (struct completion_list node) to a list hanging off dentry->waiters, evicts whatever else it might've collected, then waits for dentry_unlist() on the victim to have called complete() on the node.completion for everything on ->waiters. The rest of the logics is unchanged - if victim is not in the middle of eviction, we try to steal it, etc., same as in the mainline. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-04-08 22:39 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton 2026-04-07 10:51 ` Jan Kara 2026-04-08 6:42 ` Al Viro 2026-04-08 11:10 ` Jeff Layton 2026-04-08 18:28 ` Jeff Layton 2026-04-08 19:26 ` Al Viro 2026-04-08 21:05 ` Jeff Layton 2026-04-08 22:43 ` Al Viro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox