[PATCH] dcache: warn when a dentry is freed with a non-empty ->d

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
@ 2026-04-06 16:44 Jeff Layton
  2026-04-07 10:51 ` Jan Kara
  2026-04-08  6:42 ` Al Viro
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff Layton @ 2026-04-06 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara
  Cc: linux-fsdevel, linux-kernel, Jeff Layton

We've had a number of panics that seem to occur on hosts with heavy
process churn. The symptoms are a panic when invalidating /proc entries
as a task is exiting:

queued_spin_lock_slowpath+0x153/0x270
shrink_dentry_list+0x11d/0x220
shrink_dcache_parent+0x68/0x110
d_invalidate+0x90/0x170
proc_invalidate_siblings_dcache+0xc8/0x140
release_task+0x41b/0x510
do_exit+0x3d8/0x9d0
do_group_exit+0x7d/0xa0
get_signal+0x2a9/0x6a0
arch_do_signal_or_restart+0x1a/0x1c0
syscall_exit_to_user_mode+0xe6/0x1c0
do_syscall_64+0x74/0x130
entry_SYSCALL_64_after_hwframe+0x4b/0x53

The problem appears to be a UAF. It's freeing a shrink list of
dentries, but one of the dentries on it has already been freed.

The d_lru field is always list_del_init()'ed, and so should be empty
whenever a dentry is freed. Add a WARN_ON_ONCE() whenever it isn't.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
We've had some of these panics internally for a while. Additionally,
Claude also noted that these syzbot reports may be related:

    https://syzbot.org/bug?extid=0aee5e8066eddbbe7397
    https://syzbot.org/bug?extid=e8b3520b53e78e90034e
    https://syzbot.org/bug?extid=ad14fd37e76c579511d0

So far, I've been unable to spot the bug. Hoping this will make it
easier.
---
 fs/dcache.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..c6f475d940e3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -429,6 +429,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 static void dentry_free(struct dentry *dentry)
 {
 	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+	WARN_ON_ONCE(!list_empty(&dentry->d_lru));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->count))) {

---
base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d
change-id: 20260403-dcache-warn-a493b0e3c877

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
@ 2026-04-07 10:51 ` Jan Kara
  2026-04-08  6:42 ` Al Viro
  1 sibling, 0 replies; 7+ messages in thread
From: Jan Kara @ 2026-04-07 10:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	linux-kernel

On Mon 06-04-26 12:44:13, Jeff Layton wrote:
> We've had a number of panics that seem to occur on hosts with heavy
> process churn. The symptoms are a panic when invalidating /proc entries
> as a task is exiting:
> 
> queued_spin_lock_slowpath+0x153/0x270
> shrink_dentry_list+0x11d/0x220
> shrink_dcache_parent+0x68/0x110
> d_invalidate+0x90/0x170
> proc_invalidate_siblings_dcache+0xc8/0x140
> release_task+0x41b/0x510
> do_exit+0x3d8/0x9d0
> do_group_exit+0x7d/0xa0
> get_signal+0x2a9/0x6a0
> arch_do_signal_or_restart+0x1a/0x1c0
> syscall_exit_to_user_mode+0xe6/0x1c0
> do_syscall_64+0x74/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> The problem appears to be a UAF. It's freeing a shrink list of
> dentries, but one of the dentries on it has already been freed.
> 
> The d_lru field is always list_del_init()'ed, and so should be empty
> whenever a dentry is freed. Add a WARN_ON_ONCE() whenever it isn't.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Yes, looks like a sensible assert. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> We've had some of these panics internally for a while. Additionally,
> Claude also noted that these syzbot reports may be related:
> 
>     https://syzbot.org/bug?extid=0aee5e8066eddbbe7397
>     https://syzbot.org/bug?extid=e8b3520b53e78e90034e
>     https://syzbot.org/bug?extid=ad14fd37e76c579511d0
> 
> So far, I've been unable to spot the bug. Hoping this will make it
> easier.
> ---
>  fs/dcache.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 7ba1801d8132..c6f475d940e3 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -429,6 +429,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
>  static void dentry_free(struct dentry *dentry)
>  {
>  	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
> +	WARN_ON_ONCE(!list_empty(&dentry->d_lru));
>  	if (unlikely(dname_external(dentry))) {
>  		struct external_name *p = external_name(dentry);
>  		if (likely(atomic_dec_and_test(&p->count))) {
> 
> ---
> base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d
> change-id: 20260403-dcache-warn-a493b0e3c877
> 
> Best regards,
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
  2026-04-07 10:51 ` Jan Kara
@ 2026-04-08  6:42 ` Al Viro
  2026-04-08 11:10   ` Jeff Layton
  2026-04-08 18:28   ` Jeff Layton
  1 sibling, 2 replies; 7+ messages in thread
From: Al Viro @ 2026-04-08  6:42 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel

On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> We've had a number of panics that seem to occur on hosts with heavy
> process churn. The symptoms are a panic when invalidating /proc entries
> as a task is exiting:
> 
> queued_spin_lock_slowpath+0x153/0x270
> shrink_dentry_list+0x11d/0x220
> shrink_dcache_parent+0x68/0x110
> d_invalidate+0x90/0x170
> proc_invalidate_siblings_dcache+0xc8/0x140
> release_task+0x41b/0x510
> do_exit+0x3d8/0x9d0
> do_group_exit+0x7d/0xa0
> get_signal+0x2a9/0x6a0
> arch_do_signal_or_restart+0x1a/0x1c0
> syscall_exit_to_user_mode+0xe6/0x1c0
> do_syscall_64+0x74/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> The problem appears to be a UAF. It's freeing a shrink list of
> dentries, but one of the dentries on it has already been freed.

That, or dentry pointer passed to shrink_dcache_parent() is a
complete garbage - e.g. due to struct pid having already been
freed.  Might make sense to try and get a crash dump and poke
around...

Which kernels have you seen it on?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08  6:42 ` Al Viro
@ 2026-04-08 11:10   ` Jeff Layton
  2026-04-08 18:28   ` Jeff Layton
  1 sibling, 0 replies; 7+ messages in thread
From: Jeff Layton @ 2026-04-08 11:10 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel

On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote:
> On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> > We've had a number of panics that seem to occur on hosts with heavy
> > process churn. The symptoms are a panic when invalidating /proc entries
> > as a task is exiting:
> > 
> > queued_spin_lock_slowpath+0x153/0x270
> > shrink_dentry_list+0x11d/0x220
> > shrink_dcache_parent+0x68/0x110
> > d_invalidate+0x90/0x170
> > proc_invalidate_siblings_dcache+0xc8/0x140
> > release_task+0x41b/0x510
> > do_exit+0x3d8/0x9d0
> > do_group_exit+0x7d/0xa0
> > get_signal+0x2a9/0x6a0
> > arch_do_signal_or_restart+0x1a/0x1c0
> > syscall_exit_to_user_mode+0xe6/0x1c0
> > do_syscall_64+0x74/0x130
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > The problem appears to be a UAF. It's freeing a shrink list of
> > dentries, but one of the dentries on it has already been freed.
> 
> That, or dentry pointer passed to shrink_dcache_parent() is a
> complete garbage - e.g. due to struct pid having already been
> freed.  Might make sense to try and get a crash dump and poke
> around...
> 

I'm trying to get one. We had an issue that prevented the machines that
were crashing this way from getting a coredump. Hoping that'll be
resolved soon and we can get it.

> Which kernels have you seen it on?

v6.11 and v6.13 so far. The crash seems to be pretty workload-dependent
(a lot of processes rapidly starting and exiting). I'm not sure this
workload is running on later kernels yet so I don't know if this is
something already fixed.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08  6:42 ` Al Viro
  2026-04-08 11:10   ` Jeff Layton
@ 2026-04-08 18:28   ` Jeff Layton
  2026-04-08 19:26     ` Al Viro
  1 sibling, 1 reply; 7+ messages in thread
From: Jeff Layton @ 2026-04-08 18:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote:
> On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> > We've had a number of panics that seem to occur on hosts with heavy
> > process churn. The symptoms are a panic when invalidating /proc entries
> > as a task is exiting:
> > 
> > queued_spin_lock_slowpath+0x153/0x270
> > shrink_dentry_list+0x11d/0x220
> > shrink_dcache_parent+0x68/0x110
> > d_invalidate+0x90/0x170
> > proc_invalidate_siblings_dcache+0xc8/0x140
> > release_task+0x41b/0x510
> > do_exit+0x3d8/0x9d0
> > do_group_exit+0x7d/0xa0
> > get_signal+0x2a9/0x6a0
> > arch_do_signal_or_restart+0x1a/0x1c0
> > syscall_exit_to_user_mode+0xe6/0x1c0
> > do_syscall_64+0x74/0x130
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > The problem appears to be a UAF. It's freeing a shrink list of
> > dentries, but one of the dentries on it has already been freed.
> 
> That, or dentry pointer passed to shrink_dcache_parent() is a
> complete garbage - e.g. due to struct pid having already been
> freed.  Might make sense to try and get a crash dump and poke
> around...
> 

Chris was able to track me down a vmcore.

No, it actually does seem to be what we thought originally. The parent
is fine, but one of the dentries under it has been freed and
reallocated:

>>> stack
#0  queued_spin_lock_slowpath (kernel/locking/qspinlock.c:471:3)
#1  spin_lock (./include/linux/spinlock.h:351:2)
#2  lock_for_kill (fs/dcache.c:675:3)
#3  shrink_dentry_list (fs/dcache.c:1086:8)
#4  shrink_dcache_parent (fs/dcache.c:0)
#5  d_invalidate (fs/dcache.c:1614:2)
#6  proc_invalidate_siblings_dcache (fs/proc/inode.c:142:5)
#7  proc_flush_pid (fs/proc/base.c:3478:2)
#8  release_task (kernel/exit.c:279:2)
#9  exit_notify (kernel/exit.c:775:3)
#10 do_exit (kernel/exit.c:958:2)
#11 do_group_exit (kernel/exit.c:1087:2)
#12 get_signal (kernel/signal.c:3036:3)
#13 arch_do_signal_or_restart (arch/x86/kernel/signal.c:337:6)
#14 exit_to_user_mode_loop (kernel/entry/common.c:111:4)
#15 exit_to_user_mode_prepare (./include/linux/entry-common.h:329:13)
#16 __syscall_exit_to_user_mode_work (kernel/entry/common.c:207:2)
#17 syscall_exit_to_user_mode (kernel/entry/common.c:218:2)
#18 do_syscall_64 (arch/x86/entry/common.c:89:2)
#19 entry_SYSCALL_64+0x6c/0xaa (arch/x86/entry/entry_64.S:121)
#20 0x7f49ead2c482
>>> identify_address(stack[3]["dentry"])
'slab object: kmalloc-96+0x48'
>>> identify_address(stack[4]["parent"])
'slab object: dentry+0x0'

...it turns out that Gustavo had been chasing this independently to me,
and had Claude do a bit more analysis. I included it below, but here's
a link that may be more readable. Any thoughts?

https://markdownpastebin.com/?id=7c258413493b4144ab27d5cdcb8ae5b4

-------------8<----------------------

## dcache: `shrink_dcache_parent()` livelock leading to use-after-free

### Summary

A race between concurrent proc dentry invalidation (`proc_flush_pid` →
`d_invalidate` → `shrink_dcache_parent`) and the global dentry shrinker
(`drop_caches` / memory pressure → `prune_dcache_sb`) causes
`shrink_dcache_parent()` to loop indefinitely. This livelock is the
root cause of the use-after-free crash observed in production (see
P2260313060 for the original crash analysis).

### How the bug manifests

**In production** (narrow race window): The livelock occasionally
resolves through specific timing that allows a parent dentry to be
freed and its slab page reused. When a sibling's `__dentry_kill` then
tries `spin_lock(&parent->d_lock)` on the reused memory → page fault in
`queued_spin_lock_slowpath` (Oops).

**With `CONFIG_DCACHE_SHRINK_RACE_DEBUG`** (5ms delay in
`__dentry_kill`): The race is deterministic. `shrink_dcache_parent()`
livelocks on the first iteration and never completes.

### Root cause

In `select_collect()` (the `d_walk` callback used by
`shrink_dcache_parent`), two types of dentries are incorrectly counted
as "found":

1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
`lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
`dentry_unlist()` to remove the dentry from the parent's children list.
With the debug delay, the dentry stays dead-but-visible for 5ms.

2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
own dispose list. These are being processed by that other path but
slowly (5ms per proc dentry with the debug delay).

When `select_collect` counts these as `found++`,
`shrink_dcache_parent()` sees `data.found > 0` and loops again. But
these dentries can never be collected onto `data.dispose` (dead ones
have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
set), so the loop never makes progress → **infinite loop**.

```
shrink_dcache_parent() loop:
  for (;;) {
      d_walk(parent, &data, select_collect);
      if (!list_empty(&data.dispose)) {
          shrink_dentry_list(&data.dispose);  // never reached
          continue;
      }
      if (!data.found)
          break;            // never reached because found > 0
      // ... loops forever
  }
```

### Reproducer

**Requirements:**
- `CONFIG_DCACHE_SHRINK_RACE_DEBUG=y` (injects 5ms `mdelay()` in
`__dentry_kill` for proc dentries)
- `CONFIG_KASAN=y` (optional, for UAF detection)
- `CONFIG_DEBUG_KERNEL=y`

**Debug patch** (apply to `fs/Kconfig` and `fs/dcache.c`):

```diff
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -9,6 +9,15 @@ menu "File systems"
 config DCACHE_WORD_ACCESS
        bool
 
+config DCACHE_SHRINK_RACE_DEBUG
+	bool "Debug: inject delay in __dentry_kill to widen race
window"
+	depends on DEBUG_KERNEL
+	default n
+	help
+	  Inject a delay in __dentry_kill() between releasing d_lock
and
+	  re-acquiring it, to make the shrink_dentry_list race
reproducible
+	  in test environments. Only enable for testing.
+
 config VALIDATE_FS_PARSER

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,6 +32,7 @@
 #include <linux/list_lru.h>
+#include <linux/delay.h>
 #include "internal.h"
 
@@ -630,6 +631,16 @@ static struct dentry *__dentry_kill(...)
 	cond_resched();
+#ifdef CONFIG_DCACHE_SHRINK_RACE_DEBUG
+	/*
+	 * Delay proc dentry kills to keep dead dentries in the tree
+	 * longer. With the bug (count < 0 counted as "found" in
+	 * select_collect), d_walk keeps re-finding dead dentries and
+	 * shrink_dcache_parent() loops forever.
+	 */
+	if (dentry->d_sb->s_magic == 0x9fa0 /* PROC_SUPER_MAGIC */)
+		mdelay(5);
+#endif
 	/* now that it's negative, ->d_parent is stable */
```

**Test program** (`test_dcache_race.sh`):

The reproducer creates multi-threaded processes, populates their
`/proc/<pid>/task/<tid>/...` dcache entries, then SIGKILLs them while
simultaneously running `drop_caches` in tight loops. This creates
concurrent `proc_flush_pid` (from dying threads) and `prune_dcache_sb`
(from `drop_caches`) paths competing on the same proc dentries.

```c
/* Key structure:
 * - Fork child with N threads (creates /proc/<pid>/task/<tid>/...
entries)
 * - Parent reads all /proc entries to populate dcache
 * - Background threads continuously do: echo 2 >
/proc/sys/vm/drop_caches
 * - SIGKILL child -> all threads exit -> concurrent proc_flush_pid
 * - drop_caches shrinker races with proc_flush_pid on same dentries
 */
```

Parameters used: 50 threads/process, 200 iterations, 4 shrinker
threads, 4 reader threads.

**vmtest.toml:**
```toml
[[target]]
name = "dcache-shrink-race"
kernel = "arch/x86/boot/bzImage"
kernel_args = "hung_task_panic=0 softlockup_panic=0
rcupdate.rcu_cpu_stall_suppress=1"
command = "/mnt/vmtest/test_dcache_race.sh"

  [target.vm]
  memory = "16G"   # KASAN needs extra memory
  num_cpus = 8
  timeout = 1200
```

### Reproduction results

| Kernel | Result |
|---|---|
| Unfixed + debug delay + KASAN | **FAIL**: livelock on iteration 1,
test timed out at 750s |
| Fixed + debug delay + KASAN | **PASS**: all 200 iterations completed,
no KASAN/warnings |

### Fix

The fix is in `select_collect()` — stop counting dead dentries and
`DCACHE_SHRINK_LIST` dentries as "found":

```diff
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1448,13 +1459,27 @@ static enum d_walk_ret select_collect(void
*_data, struct dentry *dentry)
 	if (data->start == dentry)
 		goto out;
 
-	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
-		data->found++;
+	if (dentry->d_lockref.count < 0) {
+		/*
+		 * Dead dentry (lockref_mark_dead sets count
negative).
+		 * Another CPU is in the middle of __dentry_kill() and
+		 * will shortly unlink it from the tree.  Do not count
+		 * it as "found" --- that causes
shrink_dcache_parent()
+		 * to loop indefinitely.
+		 */
+	} else if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/*
+		 * Already on a shrink list, being processed by
another
+		 * path (e.g., the global LRU shrinker).  Do not count
+		 * it as "found" --- if the other path is slow (e.g.,
+		 * contention on d_lock or filesystem callbacks),
+		 * shrink_dcache_parent() would spin forever waiting
for
+		 * them to finish.  The other shrinker will handle
these
+		 * dentries.
+		 */
 	} else if (!dentry->d_lockref.count) {
 		to_shrink_list(dentry, &data->dispose);
 		data->found++;
-	} else if (dentry->d_lockref.count < 0) {
-		data->found++;
 	}
```

**Why this is correct:**

- **Dead dentries (`count < 0`)**: These are being killed by another
CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
them from the parent's children list. `shrink_dcache_parent()` doesn't
need to wait for them — they'll disappear from the tree on their own.

- **`DCACHE_SHRINK_LIST` dentries**: These are already on another
shrinker's dispose list and will be processed by that path. Counting
them as "found" forces `shrink_dcache_parent()` to wait for the other
shrinker to finish, which can take arbitrarily long (especially with
filesystem callbacks or the debug delay).

- **The `select_collect2` path** (used when `data.found > 0` but
`data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
separately by setting `data->victim` and processing them directly. With
this fix, `select_collect2` is only reached when there are genuinely
unprocessable dentries (count > 0, not dead, not on shrink list), not
when there are merely in-flight kills or concurrent shrinkers.

### Relationship to the production UAF crash

The livelock is the **precursor** to the use-after-free crash seen in
production (P2260313060):

1. Without the debug delay, the `__dentry_kill` race window is
nanoseconds (just `cond_resched()`).
2. Most of the time, the dead dentry is unlinked before
`select_collect` finds it → no issue.
3. Occasionally, `select_collect` finds dead dentries and spins
briefly. During this spinning, the specific timing allows a parent
dentry to be fully freed (via `dentry_free` → `call_rcu` → slab
reclaim) and its slab page reused for `kmalloc-96`.
4. When the spinning `shrink_dcache_parent` or a concurrent
`__dentry_kill` then accesses the freed parent → UAF crash.

The fix prevents the spinning entirely, eliminating both the livelock
and the UAF.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08 18:28   ` Jeff Layton
@ 2026-04-08 19:26     ` Al Viro
  2026-04-08 21:05       ` Jeff Layton
  0 siblings, 1 reply; 7+ messages in thread
From: Al Viro @ 2026-04-08 19:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, Apr 08, 2026 at 02:28:20PM -0400, Jeff Layton wrote:

> ...it turns out that Gustavo had been chasing this independently to me,
> and had Claude do a bit more analysis. I included it below, but here's
> a link that may be more readable. Any thoughts?

Other than rather uncharitable ones about the usefulness of the Turing Test,
you mean?

> **In production** (narrow race window): The livelock occasionally
> resolves through specific timing that allows a parent dentry to be
> freed and its slab page reused.

Livelock is real and known, all right, but do explain what does "resolves
through specific timing that allows a parent dentry to be freed" mean.
Especially since the reference to parent is *not* dropped until after
having child detached from the tree and DCACHE_DENTRY_KILLED on it,
with ->d_lock on child held over that.  So select_collect2() seeing
the victim still locked and attached to the tree has to happen before
the grace period for parent has a chance to begin.  And rcu_read_lock()
grabbed there prevents that grace period from completing until we
do the matching rcu_read_unlock() in shrink_dcache_parent().

> In `select_collect()` (the `d_walk` callback used by
> `shrink_dcache_parent`), two types of dentries are incorrectly counted
						     really?
> as "found":

> 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
> `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
> `dentry_unlist()` to remove the dentry from the parent's children list.
> With the debug delay, the dentry stays dead-but-visible for 5ms.

Yes.  And?  That's the livelock, all right, and it needs fixing,
but how does busy-wait here lead to UAF on anything?

> 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
> shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
> own dispose list. These are being processed by that other path but
> slowly (5ms per proc dentry with the debug delay).
> 
> When `select_collect` counts these as `found++`,
> `shrink_dcache_parent()` sees `data.found > 0` and loops again. But
> these dentries can never be collected onto `data.dispose` (dead ones
> have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
> set), so the loop never makes progress → **infinite loop**.

They have no business going into data.dispose; for fuck sake, dentries
on somebody else's shrink list are explicitly fed to shrink_kill().

> **Why this is correct:**

It is not.

> - **Dead dentries (`count < 0`)**: These are being killed by another
> CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
> them from the parent's children list. `shrink_dcache_parent()` doesn't
> need to wait for them — they'll disappear from the tree on their own.

... and since they keep their parents busy, we should not leave until
they are gone.  For real fun, consider calls from shrink_dcache_for_umount() -
and yes, it *is* possible for another thread's shrink list to contain
dentries from filesystem being shut down.  Legitimately so.

> - **`DCACHE_SHRINK_LIST` dentries**: These are already on another
> shrinker's dispose list and will be processed by that path. Counting
> them as "found" forces `shrink_dcache_parent()` to wait for the other
> shrinker to finish, which can take arbitrarily long (especially with
> filesystem callbacks or the debug delay).

Ditto.

> - **The `select_collect2` path** (used when `data.found > 0` but
> `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
> separately by setting `data->victim` and processing them directly. With
> this fix, `select_collect2` is only reached when there are genuinely
> unprocessable dentries (count > 0, not dead, not on shrink list), not
> when there are merely in-flight kills or concurrent shrinkers.

Bollocks, due to above.

> ### Relationship to the production UAF crash
> 
> The livelock is the **precursor** to the use-after-free crash seen in
> production (P2260313060):

... and it still offers zero explanation of the path from livelock to
UAF.  It may or may not be real, but there's nothing in all that verbiage
even suggesting what it might be.  And proposed analysis is flat-out
wrong.

As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next
as of today).

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08 19:26     ` Al Viro
@ 2026-04-08 21:05       ` Jeff Layton
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Layton @ 2026-04-08 21:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, 2026-04-08 at 20:26 +0100, Al Viro wrote:
> On Wed, Apr 08, 2026 at 02:28:20PM -0400, Jeff Layton wrote:
> 
> > ...it turns out that Gustavo had been chasing this independently to me,
> > and had Claude do a bit more analysis. I included it below, but here's
> > a link that may be more readable. Any thoughts?
> 
> Other than rather uncharitable ones about the usefulness of the Turing Test,
> you mean?
> 
> > **In production** (narrow race window): The livelock occasionally
> > resolves through specific timing that allows a parent dentry to be
> > freed and its slab page reused.
> 
> Livelock is real and known, all right, but do explain what does "resolves
> through specific timing that allows a parent dentry to be freed" mean.
> Especially since the reference to parent is *not* dropped until after
> having child detached from the tree and DCACHE_DENTRY_KILLED on it,
> with ->d_lock on child held over that.  So select_collect2() seeing
> the victim still locked and attached to the tree has to happen before
> the grace period for parent has a chance to begin.  And rcu_read_lock()
> grabbed there prevents that grace period from completing until we
> do the matching rcu_read_unlock() in shrink_dcache_parent().
> 
> > In `select_collect()` (the `d_walk` callback used by
> > `shrink_dcache_parent`), two types of dentries are incorrectly counted
> 						     really?
> > as "found":
> 
> > 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
> > `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
> > `dentry_unlist()` to remove the dentry from the parent's children list.
> > With the debug delay, the dentry stays dead-but-visible for 5ms.
> 
> Yes.  And?  That's the livelock, all right, and it needs fixing,
> but how does busy-wait here lead to UAF on anything?
> 
> > 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
> > shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
> > own dispose list. These are being processed by that other path but
> > slowly (5ms per proc dentry with the debug delay).
> > 
> > When `select_collect` counts these as `found++`,
> > `shrink_dcache_parent()` sees `data.found > 0` and loops again. But
> > these dentries can never be collected onto `data.dispose` (dead ones
> > have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
> > set), so the loop never makes progress → **infinite loop**.
> 
> They have no business going into data.dispose; for fuck sake, dentries
> on somebody else's shrink list are explicitly fed to shrink_kill().
> 
> > **Why this is correct:**
> 
> It is not.
> 
> > - **Dead dentries (`count < 0`)**: These are being killed by another
> > CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
> > them from the parent's children list. `shrink_dcache_parent()` doesn't
> > need to wait for them — they'll disappear from the tree on their own.
> 
> ... and since they keep their parents busy, we should not leave until
> they are gone.  For real fun, consider calls from shrink_dcache_for_umount() -
> and yes, it *is* possible for another thread's shrink list to contain
> dentries from filesystem being shut down.  Legitimately so.
> 
> > - **`DCACHE_SHRINK_LIST` dentries**: These are already on another
> > shrinker's dispose list and will be processed by that path. Counting
> > them as "found" forces `shrink_dcache_parent()` to wait for the other
> > shrinker to finish, which can take arbitrarily long (especially with
> > filesystem callbacks or the debug delay).
> 
> Ditto.
> 
> > - **The `select_collect2` path** (used when `data.found > 0` but
> > `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
> > separately by setting `data->victim` and processing them directly. With
> > this fix, `select_collect2` is only reached when there are genuinely
> > unprocessable dentries (count > 0, not dead, not on shrink list), not
> > when there are merely in-flight kills or concurrent shrinkers.
> 
> Bollocks, due to above.
> 
> > ### Relationship to the production UAF crash
> > 
> > The livelock is the **precursor** to the use-after-free crash seen in
> > production (P2260313060):
> 
> ... and it still offers zero explanation of the path from livelock to
> UAF.  It may or may not be real, but there's nothing in all that verbiage
> even suggesting what it might be.  And proposed analysis is flat-out
> wrong.
> 
> As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next
> as of today).

Thanks for taking a look, Al. We'll keep looking at this thing and see
if we can collect more info, and come up with a better theory of the
crash. I'll dig deeper into the vmcore tomorrow.

FWIW, I've been running a bpftrace script across a swath of machines
that emulates the WARN_ON_ONCE() in the original patch I proposed.
We've had a few crashes while it's running and haven't yet collected
any stack traces.

The script only runs about 80% of the time though, so it's possible
we're just getting unlucky, but I sort of doubt it. It seems more
likely that these dentries are racing onto the list after we've passed
the check.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-04-08 21:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
2026-04-07 10:51 ` Jan Kara
2026-04-08  6:42 ` Al Viro
2026-04-08 11:10   ` Jeff Layton
2026-04-08 18:28   ` Jeff Layton
2026-04-08 19:26     ` Al Viro
2026-04-08 21:05       ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox