[PATCH] dcache: warn when a dentry is freed with a non-empty ->d

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
@ 2026-04-06 16:44 Jeff Layton
  2026-04-07 10:51 ` Jan Kara
  2026-04-08  6:42 ` Al Viro
  0 siblings, 2 replies; 8+ messages in thread
From: Jeff Layton @ 2026-04-06 16:44 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Jan Kara
  Cc: linux-fsdevel, linux-kernel, Jeff Layton

We've had a number of panics that seem to occur on hosts with heavy
process churn. The symptoms are a panic when invalidating /proc entries
as a task is exiting:

queued_spin_lock_slowpath+0x153/0x270
shrink_dentry_list+0x11d/0x220
shrink_dcache_parent+0x68/0x110
d_invalidate+0x90/0x170
proc_invalidate_siblings_dcache+0xc8/0x140
release_task+0x41b/0x510
do_exit+0x3d8/0x9d0
do_group_exit+0x7d/0xa0
get_signal+0x2a9/0x6a0
arch_do_signal_or_restart+0x1a/0x1c0
syscall_exit_to_user_mode+0xe6/0x1c0
do_syscall_64+0x74/0x130
entry_SYSCALL_64_after_hwframe+0x4b/0x53

The problem appears to be a UAF. It's freeing a shrink list of
dentries, but one of the dentries on it has already been freed.

The d_lru field is always list_del_init()'ed, and so should be empty
whenever a dentry is freed. Add a WARN_ON_ONCE() whenever it isn't.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
We've had some of these panics internally for a while. Additionally,
Claude also noted that these syzbot reports may be related:

    https://syzbot.org/bug?extid=0aee5e8066eddbbe7397
    https://syzbot.org/bug?extid=e8b3520b53e78e90034e
    https://syzbot.org/bug?extid=ad14fd37e76c579511d0

So far, I've been unable to spot the bug. Hoping this will make it
easier.
---
 fs/dcache.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..c6f475d940e3 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -429,6 +429,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 static void dentry_free(struct dentry *dentry)
 {
 	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+	WARN_ON_ONCE(!list_empty(&dentry->d_lru));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->count))) {

---
base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d
change-id: 20260403-dcache-warn-a493b0e3c877

Best regards,
-- 
Jeff Layton <jlayton@kernel.org>


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
@ 2026-04-07 10:51 ` Jan Kara
  2026-04-08  6:42 ` Al Viro
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2026-04-07 10:51 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Alexander Viro, Christian Brauner, Jan Kara, linux-fsdevel,
	linux-kernel

On Mon 06-04-26 12:44:13, Jeff Layton wrote:
> We've had a number of panics that seem to occur on hosts with heavy
> process churn. The symptoms are a panic when invalidating /proc entries
> as a task is exiting:
> 
> queued_spin_lock_slowpath+0x153/0x270
> shrink_dentry_list+0x11d/0x220
> shrink_dcache_parent+0x68/0x110
> d_invalidate+0x90/0x170
> proc_invalidate_siblings_dcache+0xc8/0x140
> release_task+0x41b/0x510
> do_exit+0x3d8/0x9d0
> do_group_exit+0x7d/0xa0
> get_signal+0x2a9/0x6a0
> arch_do_signal_or_restart+0x1a/0x1c0
> syscall_exit_to_user_mode+0xe6/0x1c0
> do_syscall_64+0x74/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> The problem appears to be a UAF. It's freeing a shrink list of
> dentries, but one of the dentries on it has already been freed.
> 
> The d_lru field is always list_del_init()'ed, and so should be empty
> whenever a dentry is freed. Add a WARN_ON_ONCE() whenever it isn't.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Yes, looks like a sensible assert. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> We've had some of these panics internally for a while. Additionally,
> Claude also noted that these syzbot reports may be related:
> 
>     https://syzbot.org/bug?extid=0aee5e8066eddbbe7397
>     https://syzbot.org/bug?extid=e8b3520b53e78e90034e
>     https://syzbot.org/bug?extid=ad14fd37e76c579511d0
> 
> So far, I've been unable to spot the bug. Hoping this will make it
> easier.
> ---
>  fs/dcache.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 7ba1801d8132..c6f475d940e3 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -429,6 +429,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
>  static void dentry_free(struct dentry *dentry)
>  {
>  	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
> +	WARN_ON_ONCE(!list_empty(&dentry->d_lru));
>  	if (unlikely(dname_external(dentry))) {
>  		struct external_name *p = external_name(dentry);
>  		if (likely(atomic_dec_and_test(&p->count))) {
> 
> ---
> base-commit: d8a9a4b11a137909e306e50346148fc5c3b63f9d
> change-id: 20260403-dcache-warn-a493b0e3c877
> 
> Best regards,
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
  2026-04-07 10:51 ` Jan Kara
@ 2026-04-08  6:42 ` Al Viro
  2026-04-08 11:10   ` Jeff Layton
  2026-04-08 18:28   ` Jeff Layton
  1 sibling, 2 replies; 8+ messages in thread
From: Al Viro @ 2026-04-08  6:42 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel

On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> We've had a number of panics that seem to occur on hosts with heavy
> process churn. The symptoms are a panic when invalidating /proc entries
> as a task is exiting:
> 
> queued_spin_lock_slowpath+0x153/0x270
> shrink_dentry_list+0x11d/0x220
> shrink_dcache_parent+0x68/0x110
> d_invalidate+0x90/0x170
> proc_invalidate_siblings_dcache+0xc8/0x140
> release_task+0x41b/0x510
> do_exit+0x3d8/0x9d0
> do_group_exit+0x7d/0xa0
> get_signal+0x2a9/0x6a0
> arch_do_signal_or_restart+0x1a/0x1c0
> syscall_exit_to_user_mode+0xe6/0x1c0
> do_syscall_64+0x74/0x130
> entry_SYSCALL_64_after_hwframe+0x4b/0x53
> 
> The problem appears to be a UAF. It's freeing a shrink list of
> dentries, but one of the dentries on it has already been freed.

That, or dentry pointer passed to shrink_dcache_parent() is a
complete garbage - e.g. due to struct pid having already been
freed.  Might make sense to try and get a crash dump and poke
around...

Which kernels have you seen it on?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08  6:42 ` Al Viro
@ 2026-04-08 11:10   ` Jeff Layton
  2026-04-08 18:28   ` Jeff Layton
  1 sibling, 0 replies; 8+ messages in thread
From: Jeff Layton @ 2026-04-08 11:10 UTC (permalink / raw)
  To: Al Viro; +Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel

On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote:
> On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> > We've had a number of panics that seem to occur on hosts with heavy
> > process churn. The symptoms are a panic when invalidating /proc entries
> > as a task is exiting:
> > 
> > queued_spin_lock_slowpath+0x153/0x270
> > shrink_dentry_list+0x11d/0x220
> > shrink_dcache_parent+0x68/0x110
> > d_invalidate+0x90/0x170
> > proc_invalidate_siblings_dcache+0xc8/0x140
> > release_task+0x41b/0x510
> > do_exit+0x3d8/0x9d0
> > do_group_exit+0x7d/0xa0
> > get_signal+0x2a9/0x6a0
> > arch_do_signal_or_restart+0x1a/0x1c0
> > syscall_exit_to_user_mode+0xe6/0x1c0
> > do_syscall_64+0x74/0x130
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > The problem appears to be a UAF. It's freeing a shrink list of
> > dentries, but one of the dentries on it has already been freed.
> 
> That, or dentry pointer passed to shrink_dcache_parent() is a
> complete garbage - e.g. due to struct pid having already been
> freed.  Might make sense to try and get a crash dump and poke
> around...
> 

I'm trying to get one. We had an issue that prevented the machines that
were crashing this way from getting a coredump. Hoping that'll be
resolved soon and we can get it.

> Which kernels have you seen it on?

v6.11 and v6.13 so far. The crash seems to be pretty workload-dependent
(a lot of processes rapidly starting and exiting). I'm not sure this
workload is running on later kernels yet so I don't know if this is
something already fixed.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08  6:42 ` Al Viro
  2026-04-08 11:10   ` Jeff Layton
@ 2026-04-08 18:28   ` Jeff Layton
  2026-04-08 19:26     ` Al Viro
  1 sibling, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-04-08 18:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote:
> On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> > We've had a number of panics that seem to occur on hosts with heavy
> > process churn. The symptoms are a panic when invalidating /proc entries
> > as a task is exiting:
> > 
> > queued_spin_lock_slowpath+0x153/0x270
> > shrink_dentry_list+0x11d/0x220
> > shrink_dcache_parent+0x68/0x110
> > d_invalidate+0x90/0x170
> > proc_invalidate_siblings_dcache+0xc8/0x140
> > release_task+0x41b/0x510
> > do_exit+0x3d8/0x9d0
> > do_group_exit+0x7d/0xa0
> > get_signal+0x2a9/0x6a0
> > arch_do_signal_or_restart+0x1a/0x1c0
> > syscall_exit_to_user_mode+0xe6/0x1c0
> > do_syscall_64+0x74/0x130
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > The problem appears to be a UAF. It's freeing a shrink list of
> > dentries, but one of the dentries on it has already been freed.
> 
> That, or dentry pointer passed to shrink_dcache_parent() is a
> complete garbage - e.g. due to struct pid having already been
> freed.  Might make sense to try and get a crash dump and poke
> around...
> 

Chris was able to track me down a vmcore.

No, it actually does seem to be what we thought originally. The parent
is fine, but one of the dentries under it has been freed and
reallocated:

>>> stack
#0  queued_spin_lock_slowpath (kernel/locking/qspinlock.c:471:3)
#1  spin_lock (./include/linux/spinlock.h:351:2)
#2  lock_for_kill (fs/dcache.c:675:3)
#3  shrink_dentry_list (fs/dcache.c:1086:8)
#4  shrink_dcache_parent (fs/dcache.c:0)
#5  d_invalidate (fs/dcache.c:1614:2)
#6  proc_invalidate_siblings_dcache (fs/proc/inode.c:142:5)
#7  proc_flush_pid (fs/proc/base.c:3478:2)
#8  release_task (kernel/exit.c:279:2)
#9  exit_notify (kernel/exit.c:775:3)
#10 do_exit (kernel/exit.c:958:2)
#11 do_group_exit (kernel/exit.c:1087:2)
#12 get_signal (kernel/signal.c:3036:3)
#13 arch_do_signal_or_restart (arch/x86/kernel/signal.c:337:6)
#14 exit_to_user_mode_loop (kernel/entry/common.c:111:4)
#15 exit_to_user_mode_prepare (./include/linux/entry-common.h:329:13)
#16 __syscall_exit_to_user_mode_work (kernel/entry/common.c:207:2)
#17 syscall_exit_to_user_mode (kernel/entry/common.c:218:2)
#18 do_syscall_64 (arch/x86/entry/common.c:89:2)
#19 entry_SYSCALL_64+0x6c/0xaa (arch/x86/entry/entry_64.S:121)
#20 0x7f49ead2c482
>>> identify_address(stack[3]["dentry"])
'slab object: kmalloc-96+0x48'
>>> identify_address(stack[4]["parent"])
'slab object: dentry+0x0'

...it turns out that Gustavo had been chasing this independently to me,
and had Claude do a bit more analysis. I included it below, but here's
a link that may be more readable. Any thoughts?

https://markdownpastebin.com/?id=7c258413493b4144ab27d5cdcb8ae5b4

-------------8<----------------------

## dcache: `shrink_dcache_parent()` livelock leading to use-after-free

### Summary

A race between concurrent proc dentry invalidation (`proc_flush_pid` →
`d_invalidate` → `shrink_dcache_parent`) and the global dentry shrinker
(`drop_caches` / memory pressure → `prune_dcache_sb`) causes
`shrink_dcache_parent()` to loop indefinitely. This livelock is the
root cause of the use-after-free crash observed in production (see
P2260313060 for the original crash analysis).

### How the bug manifests

**In production** (narrow race window): The livelock occasionally
resolves through specific timing that allows a parent dentry to be
freed and its slab page reused. When a sibling's `__dentry_kill` then
tries `spin_lock(&parent->d_lock)` on the reused memory → page fault in
`queued_spin_lock_slowpath` (Oops).

**With `CONFIG_DCACHE_SHRINK_RACE_DEBUG`** (5ms delay in
`__dentry_kill`): The race is deterministic. `shrink_dcache_parent()`
livelocks on the first iteration and never completes.

### Root cause

In `select_collect()` (the `d_walk` callback used by
`shrink_dcache_parent`), two types of dentries are incorrectly counted
as "found":

1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
`lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
`dentry_unlist()` to remove the dentry from the parent's children list.
With the debug delay, the dentry stays dead-but-visible for 5ms.

2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
own dispose list. These are being processed by that other path but
slowly (5ms per proc dentry with the debug delay).

When `select_collect` counts these as `found++`,
`shrink_dcache_parent()` sees `data.found > 0` and loops again. But
these dentries can never be collected onto `data.dispose` (dead ones
have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
set), so the loop never makes progress → **infinite loop**.

```
shrink_dcache_parent() loop:
  for (;;) {
      d_walk(parent, &data, select_collect);
      if (!list_empty(&data.dispose)) {
          shrink_dentry_list(&data.dispose);  // never reached
          continue;
      }
      if (!data.found)
          break;            // never reached because found > 0
      // ... loops forever
  }
```

### Reproducer

**Requirements:**
- `CONFIG_DCACHE_SHRINK_RACE_DEBUG=y` (injects 5ms `mdelay()` in
`__dentry_kill` for proc dentries)
- `CONFIG_KASAN=y` (optional, for UAF detection)
- `CONFIG_DEBUG_KERNEL=y`

**Debug patch** (apply to `fs/Kconfig` and `fs/dcache.c`):

```diff
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -9,6 +9,15 @@ menu "File systems"
 config DCACHE_WORD_ACCESS
        bool
 
+config DCACHE_SHRINK_RACE_DEBUG
+	bool "Debug: inject delay in __dentry_kill to widen race
window"
+	depends on DEBUG_KERNEL
+	default n
+	help
+	  Inject a delay in __dentry_kill() between releasing d_lock
and
+	  re-acquiring it, to make the shrink_dentry_list race
reproducible
+	  in test environments. Only enable for testing.
+
 config VALIDATE_FS_PARSER

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,6 +32,7 @@
 #include <linux/list_lru.h>
+#include <linux/delay.h>
 #include "internal.h"
 
@@ -630,6 +631,16 @@ static struct dentry *__dentry_kill(...)
 	cond_resched();
+#ifdef CONFIG_DCACHE_SHRINK_RACE_DEBUG
+	/*
+	 * Delay proc dentry kills to keep dead dentries in the tree
+	 * longer. With the bug (count < 0 counted as "found" in
+	 * select_collect), d_walk keeps re-finding dead dentries and
+	 * shrink_dcache_parent() loops forever.
+	 */
+	if (dentry->d_sb->s_magic == 0x9fa0 /* PROC_SUPER_MAGIC */)
+		mdelay(5);
+#endif
 	/* now that it's negative, ->d_parent is stable */
```

**Test program** (`test_dcache_race.sh`):

The reproducer creates multi-threaded processes, populates their
`/proc/<pid>/task/<tid>/...` dcache entries, then SIGKILLs them while
simultaneously running `drop_caches` in tight loops. This creates
concurrent `proc_flush_pid` (from dying threads) and `prune_dcache_sb`
(from `drop_caches`) paths competing on the same proc dentries.

```c
/* Key structure:
 * - Fork child with N threads (creates /proc/<pid>/task/<tid>/...
entries)
 * - Parent reads all /proc entries to populate dcache
 * - Background threads continuously do: echo 2 >
/proc/sys/vm/drop_caches
 * - SIGKILL child -> all threads exit -> concurrent proc_flush_pid
 * - drop_caches shrinker races with proc_flush_pid on same dentries
 */
```

Parameters used: 50 threads/process, 200 iterations, 4 shrinker
threads, 4 reader threads.

**vmtest.toml:**
```toml
[[target]]
name = "dcache-shrink-race"
kernel = "arch/x86/boot/bzImage"
kernel_args = "hung_task_panic=0 softlockup_panic=0
rcupdate.rcu_cpu_stall_suppress=1"
command = "/mnt/vmtest/test_dcache_race.sh"

  [target.vm]
  memory = "16G"   # KASAN needs extra memory
  num_cpus = 8
  timeout = 1200
```

### Reproduction results

| Kernel | Result |
|---|---|
| Unfixed + debug delay + KASAN | **FAIL**: livelock on iteration 1,
test timed out at 750s |
| Fixed + debug delay + KASAN | **PASS**: all 200 iterations completed,
no KASAN/warnings |

### Fix

The fix is in `select_collect()` — stop counting dead dentries and
`DCACHE_SHRINK_LIST` dentries as "found":

```diff
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1448,13 +1459,27 @@ static enum d_walk_ret select_collect(void
*_data, struct dentry *dentry)
 	if (data->start == dentry)
 		goto out;
 
-	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
-		data->found++;
+	if (dentry->d_lockref.count < 0) {
+		/*
+		 * Dead dentry (lockref_mark_dead sets count
negative).
+		 * Another CPU is in the middle of __dentry_kill() and
+		 * will shortly unlink it from the tree.  Do not count
+		 * it as "found" --- that causes
shrink_dcache_parent()
+		 * to loop indefinitely.
+		 */
+	} else if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/*
+		 * Already on a shrink list, being processed by
another
+		 * path (e.g., the global LRU shrinker).  Do not count
+		 * it as "found" --- if the other path is slow (e.g.,
+		 * contention on d_lock or filesystem callbacks),
+		 * shrink_dcache_parent() would spin forever waiting
for
+		 * them to finish.  The other shrinker will handle
these
+		 * dentries.
+		 */
 	} else if (!dentry->d_lockref.count) {
 		to_shrink_list(dentry, &data->dispose);
 		data->found++;
-	} else if (dentry->d_lockref.count < 0) {
-		data->found++;
 	}
```

**Why this is correct:**

- **Dead dentries (`count < 0`)**: These are being killed by another
CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
them from the parent's children list. `shrink_dcache_parent()` doesn't
need to wait for them — they'll disappear from the tree on their own.

- **`DCACHE_SHRINK_LIST` dentries**: These are already on another
shrinker's dispose list and will be processed by that path. Counting
them as "found" forces `shrink_dcache_parent()` to wait for the other
shrinker to finish, which can take arbitrarily long (especially with
filesystem callbacks or the debug delay).

- **The `select_collect2` path** (used when `data.found > 0` but
`data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
separately by setting `data->victim` and processing them directly. With
this fix, `select_collect2` is only reached when there are genuinely
unprocessable dentries (count > 0, not dead, not on shrink list), not
when there are merely in-flight kills or concurrent shrinkers.

### Relationship to the production UAF crash

The livelock is the **precursor** to the use-after-free crash seen in
production (P2260313060):

1. Without the debug delay, the `__dentry_kill` race window is
nanoseconds (just `cond_resched()`).
2. Most of the time, the dead dentry is unlinked before
`select_collect` finds it → no issue.
3. Occasionally, `select_collect` finds dead dentries and spins
briefly. During this spinning, the specific timing allows a parent
dentry to be fully freed (via `dentry_free` → `call_rcu` → slab
reclaim) and its slab page reused for `kmalloc-96`.
4. When the spinning `shrink_dcache_parent` or a concurrent
`__dentry_kill` then accesses the freed parent → UAF crash.

The fix prevents the spinning entirely, eliminating both the livelock
and the UAF.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08 18:28   ` Jeff Layton
@ 2026-04-08 19:26     ` Al Viro
  2026-04-08 21:05       ` Jeff Layton
  0 siblings, 1 reply; 8+ messages in thread
From: Al Viro @ 2026-04-08 19:26 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, Apr 08, 2026 at 02:28:20PM -0400, Jeff Layton wrote:

> ...it turns out that Gustavo had been chasing this independently to me,
> and had Claude do a bit more analysis. I included it below, but here's
> a link that may be more readable. Any thoughts?

Other than rather uncharitable ones about the usefulness of the Turing Test,
you mean?

> **In production** (narrow race window): The livelock occasionally
> resolves through specific timing that allows a parent dentry to be
> freed and its slab page reused.

Livelock is real and known, all right, but do explain what does "resolves
through specific timing that allows a parent dentry to be freed" mean.
Especially since the reference to parent is *not* dropped until after
having child detached from the tree and DCACHE_DENTRY_KILLED on it,
with ->d_lock on child held over that.  So select_collect2() seeing
the victim still locked and attached to the tree has to happen before
the grace period for parent has a chance to begin.  And rcu_read_lock()
grabbed there prevents that grace period from completing until we
do the matching rcu_read_unlock() in shrink_dcache_parent().

> In `select_collect()` (the `d_walk` callback used by
> `shrink_dcache_parent`), two types of dentries are incorrectly counted
						     really?
> as "found":

> 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
> `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
> `dentry_unlist()` to remove the dentry from the parent's children list.
> With the debug delay, the dentry stays dead-but-visible for 5ms.

Yes.  And?  That's the livelock, all right, and it needs fixing,
but how does busy-wait here lead to UAF on anything?

> 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
> shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
> own dispose list. These are being processed by that other path but
> slowly (5ms per proc dentry with the debug delay).
> 
> When `select_collect` counts these as `found++`,
> `shrink_dcache_parent()` sees `data.found > 0` and loops again. But
> these dentries can never be collected onto `data.dispose` (dead ones
> have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
> set), so the loop never makes progress → **infinite loop**.

They have no business going into data.dispose; for fuck sake, dentries
on somebody else's shrink list are explicitly fed to shrink_kill().

> **Why this is correct:**

It is not.

> - **Dead dentries (`count < 0`)**: These are being killed by another
> CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
> them from the parent's children list. `shrink_dcache_parent()` doesn't
> need to wait for them — they'll disappear from the tree on their own.

... and since they keep their parents busy, we should not leave until
they are gone.  For real fun, consider calls from shrink_dcache_for_umount() -
and yes, it *is* possible for another thread's shrink list to contain
dentries from filesystem being shut down.  Legitimately so.

> - **`DCACHE_SHRINK_LIST` dentries**: These are already on another
> shrinker's dispose list and will be processed by that path. Counting
> them as "found" forces `shrink_dcache_parent()` to wait for the other
> shrinker to finish, which can take arbitrarily long (especially with
> filesystem callbacks or the debug delay).

Ditto.

> - **The `select_collect2` path** (used when `data.found > 0` but
> `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
> separately by setting `data->victim` and processing them directly. With
> this fix, `select_collect2` is only reached when there are genuinely
> unprocessable dentries (count > 0, not dead, not on shrink list), not
> when there are merely in-flight kills or concurrent shrinkers.

Bollocks, due to above.

> ### Relationship to the production UAF crash
> 
> The livelock is the **precursor** to the use-after-free crash seen in
> production (P2260313060):

... and it still offers zero explanation of the path from livelock to
UAF.  It may or may not be real, but there's nothing in all that verbiage
even suggesting what it might be.  And proposed analysis is flat-out
wrong.

As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next
as of today).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08 19:26     ` Al Viro
@ 2026-04-08 21:05       ` Jeff Layton
  2026-04-08 22:43         ` Al Viro
  0 siblings, 1 reply; 8+ messages in thread
From: Jeff Layton @ 2026-04-08 21:05 UTC (permalink / raw)
  To: Al Viro
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, 2026-04-08 at 20:26 +0100, Al Viro wrote:
> On Wed, Apr 08, 2026 at 02:28:20PM -0400, Jeff Layton wrote:
> 
> > ...it turns out that Gustavo had been chasing this independently to me,
> > and had Claude do a bit more analysis. I included it below, but here's
> > a link that may be more readable. Any thoughts?
> 
> Other than rather uncharitable ones about the usefulness of the Turing Test,
> you mean?
> 
> > **In production** (narrow race window): The livelock occasionally
> > resolves through specific timing that allows a parent dentry to be
> > freed and its slab page reused.
> 
> Livelock is real and known, all right, but do explain what does "resolves
> through specific timing that allows a parent dentry to be freed" mean.
> Especially since the reference to parent is *not* dropped until after
> having child detached from the tree and DCACHE_DENTRY_KILLED on it,
> with ->d_lock on child held over that.  So select_collect2() seeing
> the victim still locked and attached to the tree has to happen before
> the grace period for parent has a chance to begin.  And rcu_read_lock()
> grabbed there prevents that grace period from completing until we
> do the matching rcu_read_unlock() in shrink_dcache_parent().
> 
> > In `select_collect()` (the `d_walk` callback used by
> > `shrink_dcache_parent`), two types of dentries are incorrectly counted
> 						     really?
> > as "found":
> 
> > 1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
> > `lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
> > `dentry_unlist()` to remove the dentry from the parent's children list.
> > With the debug delay, the dentry stays dead-but-visible for 5ms.
> 
> Yes.  And?  That's the livelock, all right, and it needs fixing,
> but how does busy-wait here lead to UAF on anything?
> 
> > 2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
> > shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
> > own dispose list. These are being processed by that other path but
> > slowly (5ms per proc dentry with the debug delay).
> > 
> > When `select_collect` counts these as `found++`,
> > `shrink_dcache_parent()` sees `data.found > 0` and loops again. But
> > these dentries can never be collected onto `data.dispose` (dead ones
> > have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
> > set), so the loop never makes progress → **infinite loop**.
> 
> They have no business going into data.dispose; for fuck sake, dentries
> on somebody else's shrink list are explicitly fed to shrink_kill().
> 
> > **Why this is correct:**
> 
> It is not.
> 
> > - **Dead dentries (`count < 0`)**: These are being killed by another
> > CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
> > them from the parent's children list. `shrink_dcache_parent()` doesn't
> > need to wait for them — they'll disappear from the tree on their own.
> 
> ... and since they keep their parents busy, we should not leave until
> they are gone.  For real fun, consider calls from shrink_dcache_for_umount() -
> and yes, it *is* possible for another thread's shrink list to contain
> dentries from filesystem being shut down.  Legitimately so.
> 
> > - **`DCACHE_SHRINK_LIST` dentries**: These are already on another
> > shrinker's dispose list and will be processed by that path. Counting
> > them as "found" forces `shrink_dcache_parent()` to wait for the other
> > shrinker to finish, which can take arbitrarily long (especially with
> > filesystem callbacks or the debug delay).
> 
> Ditto.
> 
> > - **The `select_collect2` path** (used when `data.found > 0` but
> > `data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
> > separately by setting `data->victim` and processing them directly. With
> > this fix, `select_collect2` is only reached when there are genuinely
> > unprocessable dentries (count > 0, not dead, not on shrink list), not
> > when there are merely in-flight kills or concurrent shrinkers.
> 
> Bollocks, due to above.
> 
> > ### Relationship to the production UAF crash
> > 
> > The livelock is the **precursor** to the use-after-free crash seen in
> > production (P2260313060):
> 
> ... and it still offers zero explanation of the path from livelock to
> UAF.  It may or may not be real, but there's nothing in all that verbiage
> even suggesting what it might be.  And proposed analysis is flat-out
> wrong.
> 
> As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next
> as of today).

Thanks for taking a look, Al. We'll keep looking at this thing and see
if we can collect more info, and come up with a better theory of the
crash. I'll dig deeper into the vmcore tomorrow.

FWIW, I've been running a bpftrace script across a swath of machines
that emulates the WARN_ON_ONCE() in the original patch I proposed.
We've had a few crashes while it's running and haven't yet collected
any stack traces.

The script only runs about 80% of the time though, so it's possible
we're just getting unlucky, but I sort of doubt it. It seems more
likely that these dentries are racing onto the list after we've passed
the check.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
  2026-04-08 21:05       ` Jeff Layton
@ 2026-04-08 22:43         ` Al Viro
  0 siblings, 0 replies; 8+ messages in thread
From: Al Viro @ 2026-04-08 22:43 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Christian Brauner, Jan Kara, linux-fsdevel, linux-kernel, clm,
	gustavold

On Wed, Apr 08, 2026 at 05:05:41PM -0400, Jeff Layton wrote:

> > ... and it still offers zero explanation of the path from livelock to
> > UAF.  It may or may not be real, but there's nothing in all that verbiage
> > even suggesting what it might be.  And proposed analysis is flat-out
> > wrong.
> > 
> > As for the livelock, see viro/vfs.git #work.dcache-busy-wait (in -next
> > as of today).
> 
> Thanks for taking a look, Al. We'll keep looking at this thing and see
> if we can collect more info, and come up with a better theory of the
> crash. I'll dig deeper into the vmcore tomorrow.

OK...  FWIW, on the trees up to 7.0-rc7:
	* dentry passed to select_collect2() must be still attached to
the tree (no DCACHE_DENTRY_KILLED on it)
	* caller holds ->d_lock on that dentry.
	* in case when its ->d_lockref.count is positive, it's busy (at least
at the moment) and there's nothing to be done; move on.
	* in case when its ->d_lockref.count is negative, it's getting killed
right now.  That's a busy-wait case - we ignore it on this pass, but remember
that we'll need to rescan.
	* in case when its ->d_lockref.count is 0 and it's not on a shrink list:
it's evictable, move it to data.dispose, no problem.
	* in case when its ->d_lockref.count is 0 and it *is* on a shrink list:
try to steal it.  We can't do that from the d_walk() callback itself - it's
a non-blocking environment, to start with, so no evictions are possible there.
What we can do is to have it returned to shrink_dcache_tree() - grab rcu_read_lock()
to make sure it won't get freed (it hasn't reached dentry_unlist(), let alone
dentry_free(), so the grace period for freeing hasn't started yet) and tell d_walk()
to stop and return to caller immediately.
	Note that we deliberately return without rcu_read_unlock() - that's what
protects the victim from getting freed under us.  Other thread might or might not
get around to starting the eviction of the victim, but whether it does that or not,
it won't get around to freeing it.
	In shrink_dcache_tree() we grab ->d_lock on the victim again (it had been
dropped by d_walk() when the callback returned).  Then we call lock_for_kill(),
which starts with checking that ->d_lockref.count is 0.  If it isn't, there's
nothing to be done to that sucker; it either went busy (in which case we should
ignore it and move on) or somebody (likely the owner of the shrink list it had
been on) got around to evicting it; in the latter case we need to wait until
the damn thing is killed.  *IF* refcount is still 0, we carefully acquire the
->i_lock on its inode (if any).  Note that we are still holding rcu_read_lock(),
so it can't get freed under us even if we have to drop and regain ->d_lock.
We do need to recheck that refcount is still zero if we do that, obviously.
	Failing lock_for_kill() is either due to dentry becoming busy (and thus
to be skipped) or due to dentry having been passed to __dentry_kill().  In
the latter case we busy-wait, same as we'd do if we saw negative ->d_lockref.count
in select_collect2().
	Successful lock_for_kill() is followed by evicting the sucker;
we do *not* remove it (or its ancestors, if it had been the only thing
holding them busy) from whatever shrink list they'd been on.  Anything on
shrink lists gets reduced to the state of passive chunk of memory no
longer connected to filesystem objects, marked with DCACHE_DENTRY_KILLED
and left for the owner of shrink list to free once it gets around to that.
There's no urgency anymore -
		rcu_read_lock();
		if (!lock_for_kill(dentry)) {
			bool can_free;
			rcu_read_unlock();
			d_shrink_del(dentry);
			can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
			spin_unlock(&dentry->d_lock);
			if (can_free)
				dentry_free(dentry);
			continue;
		}
on the shrink_dentry_list() side will have lock_for_kill() return false
(->d_lockref.count already negative), remove the sucker from the list,
see DCACHE_DENTRY_KILLED on it, unlock the sucker and free it.  Note that
nothing in that sequence touches any fs objects - it's just a disposal of
inert chunk of memory now.
	Note that we can't be stealing from ourselves - if anything had
been added to data.dispose, d_walk() sees D_WALK_QUIT or D_WALK_NORETRY
and either buggers off immediately or sets retry to false.  It won't
restart walking the tree in either case, so there's no way for the same
dentry to be revisited by it.

	Having one shrink_dentry_tree() steal from another is OK - see
the conditions when shrink_dentry_list() and __dentry_kill() are calling
dentry_free() (called 'can_free' in both).

	Getting rid of busy-wait is handled with fairly small modification
to that: select_collect2() treats negative refcount same as it would
treat 0-and-on-shrink-list case and shrink_dcache_tree() checks if the
victim has negative refcount and no DCACHE_DENTRY_KILLED yet.  In that
case it adds a local object (struct completion_list node) to a list hanging
off dentry->waiters, evicts whatever else it might've collected, then waits
for dentry_unlist() on the victim to have called complete() on the
node.completion for everything on ->waiters.
	The rest of the logics is unchanged - if victim is not in the
middle of eviction, we try to steal it, etc., same as in the mainline.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-08 22:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
2026-04-07 10:51 ` Jan Kara
2026-04-08  6:42 ` Al Viro
2026-04-08 11:10   ` Jeff Layton
2026-04-08 18:28   ` Jeff Layton
2026-04-08 19:26     ` Al Viro
2026-04-08 21:05       ` Jeff Layton
2026-04-08 22:43         ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox