Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jeff Layton <jlayton@kernel.org>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
	 linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
	clm@meta.com,  gustavold@meta.com
Subject: Re: [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru
Date: Wed, 08 Apr 2026 14:28:20 -0400	[thread overview]
Message-ID: <f44d1047296ede580c69bdad8e527ac8294746ec.camel@kernel.org> (raw)
In-Reply-To: <20260408064251.GE3836593@ZenIV>

On Wed, 2026-04-08 at 07:42 +0100, Al Viro wrote:
> On Mon, Apr 06, 2026 at 12:44:13PM -0400, Jeff Layton wrote:
> > We've had a number of panics that seem to occur on hosts with heavy
> > process churn. The symptoms are a panic when invalidating /proc entries
> > as a task is exiting:
> > 
> > queued_spin_lock_slowpath+0x153/0x270
> > shrink_dentry_list+0x11d/0x220
> > shrink_dcache_parent+0x68/0x110
> > d_invalidate+0x90/0x170
> > proc_invalidate_siblings_dcache+0xc8/0x140
> > release_task+0x41b/0x510
> > do_exit+0x3d8/0x9d0
> > do_group_exit+0x7d/0xa0
> > get_signal+0x2a9/0x6a0
> > arch_do_signal_or_restart+0x1a/0x1c0
> > syscall_exit_to_user_mode+0xe6/0x1c0
> > do_syscall_64+0x74/0x130
> > entry_SYSCALL_64_after_hwframe+0x4b/0x53
> > 
> > The problem appears to be a UAF. It's freeing a shrink list of
> > dentries, but one of the dentries on it has already been freed.
> 
> That, or dentry pointer passed to shrink_dcache_parent() is a
> complete garbage - e.g. due to struct pid having already been
> freed.  Might make sense to try and get a crash dump and poke
> around...
> 

Chris was able to track me down a vmcore.

No, it actually does seem to be what we thought originally. The parent
is fine, but one of the dentries under it has been freed and
reallocated:

>>> stack
#0  queued_spin_lock_slowpath (kernel/locking/qspinlock.c:471:3)
#1  spin_lock (./include/linux/spinlock.h:351:2)
#2  lock_for_kill (fs/dcache.c:675:3)
#3  shrink_dentry_list (fs/dcache.c:1086:8)
#4  shrink_dcache_parent (fs/dcache.c:0)
#5  d_invalidate (fs/dcache.c:1614:2)
#6  proc_invalidate_siblings_dcache (fs/proc/inode.c:142:5)
#7  proc_flush_pid (fs/proc/base.c:3478:2)
#8  release_task (kernel/exit.c:279:2)
#9  exit_notify (kernel/exit.c:775:3)
#10 do_exit (kernel/exit.c:958:2)
#11 do_group_exit (kernel/exit.c:1087:2)
#12 get_signal (kernel/signal.c:3036:3)
#13 arch_do_signal_or_restart (arch/x86/kernel/signal.c:337:6)
#14 exit_to_user_mode_loop (kernel/entry/common.c:111:4)
#15 exit_to_user_mode_prepare (./include/linux/entry-common.h:329:13)
#16 __syscall_exit_to_user_mode_work (kernel/entry/common.c:207:2)
#17 syscall_exit_to_user_mode (kernel/entry/common.c:218:2)
#18 do_syscall_64 (arch/x86/entry/common.c:89:2)
#19 entry_SYSCALL_64+0x6c/0xaa (arch/x86/entry/entry_64.S:121)
#20 0x7f49ead2c482
>>> identify_address(stack[3]["dentry"])
'slab object: kmalloc-96+0x48'
>>> identify_address(stack[4]["parent"])
'slab object: dentry+0x0'

...it turns out that Gustavo had been chasing this independently to me,
and had Claude do a bit more analysis. I included it below, but here's
a link that may be more readable. Any thoughts?

https://markdownpastebin.com/?id=7c258413493b4144ab27d5cdcb8ae5b4

-------------8<----------------------

## dcache: `shrink_dcache_parent()` livelock leading to use-after-free

### Summary

A race between concurrent proc dentry invalidation (`proc_flush_pid` →
`d_invalidate` → `shrink_dcache_parent`) and the global dentry shrinker
(`drop_caches` / memory pressure → `prune_dcache_sb`) causes
`shrink_dcache_parent()` to loop indefinitely. This livelock is the
root cause of the use-after-free crash observed in production (see
P2260313060 for the original crash analysis).

### How the bug manifests

**In production** (narrow race window): The livelock occasionally
resolves through specific timing that allows a parent dentry to be
freed and its slab page reused. When a sibling's `__dentry_kill` then
tries `spin_lock(&parent->d_lock)` on the reused memory → page fault in
`queued_spin_lock_slowpath` (Oops).

**With `CONFIG_DCACHE_SHRINK_RACE_DEBUG`** (5ms delay in
`__dentry_kill`): The race is deterministic. `shrink_dcache_parent()`
livelocks on the first iteration and never completes.

### Root cause

In `select_collect()` (the `d_walk` callback used by
`shrink_dcache_parent`), two types of dentries are incorrectly counted
as "found":

1. **Dead dentries** (`d_lockref.count < 0`): Another CPU called
`lockref_mark_dead()` in `__dentry_kill()` but hasn't yet called
`dentry_unlist()` to remove the dentry from the parent's children list.
With the debug delay, the dentry stays dead-but-visible for 5ms.

2. **`DCACHE_SHRINK_LIST` dentries**: Already isolated by another
shrinker path (e.g., the global LRU shrinker from `drop_caches`) to its
own dispose list. These are being processed by that other path but
slowly (5ms per proc dentry with the debug delay).

When `select_collect` counts these as `found++`,
`shrink_dcache_parent()` sees `data.found > 0` and loops again. But
these dentries can never be collected onto `data.dispose` (dead ones
have count < 0, shrink-list ones already have `DCACHE_SHRINK_LIST`
set), so the loop never makes progress → **infinite loop**.

```
shrink_dcache_parent() loop:
  for (;;) {
      d_walk(parent, &data, select_collect);
      if (!list_empty(&data.dispose)) {
          shrink_dentry_list(&data.dispose);  // never reached
          continue;
      }
      if (!data.found)
          break;            // never reached because found > 0
      // ... loops forever
  }
```

### Reproducer

**Requirements:**
- `CONFIG_DCACHE_SHRINK_RACE_DEBUG=y` (injects 5ms `mdelay()` in
`__dentry_kill` for proc dentries)
- `CONFIG_KASAN=y` (optional, for UAF detection)
- `CONFIG_DEBUG_KERNEL=y`

**Debug patch** (apply to `fs/Kconfig` and `fs/dcache.c`):

```diff
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -9,6 +9,15 @@ menu "File systems"
 config DCACHE_WORD_ACCESS
        bool
 
+config DCACHE_SHRINK_RACE_DEBUG
+	bool "Debug: inject delay in __dentry_kill to widen race
window"
+	depends on DEBUG_KERNEL
+	default n
+	help
+	  Inject a delay in __dentry_kill() between releasing d_lock
and
+	  re-acquiring it, to make the shrink_dentry_list race
reproducible
+	  in test environments. Only enable for testing.
+
 config VALIDATE_FS_PARSER

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -32,6 +32,7 @@
 #include <linux/list_lru.h>
+#include <linux/delay.h>
 #include "internal.h"
 
@@ -630,6 +631,16 @@ static struct dentry *__dentry_kill(...)
 	cond_resched();
+#ifdef CONFIG_DCACHE_SHRINK_RACE_DEBUG
+	/*
+	 * Delay proc dentry kills to keep dead dentries in the tree
+	 * longer. With the bug (count < 0 counted as "found" in
+	 * select_collect), d_walk keeps re-finding dead dentries and
+	 * shrink_dcache_parent() loops forever.
+	 */
+	if (dentry->d_sb->s_magic == 0x9fa0 /* PROC_SUPER_MAGIC */)
+		mdelay(5);
+#endif
 	/* now that it's negative, ->d_parent is stable */
```

**Test program** (`test_dcache_race.sh`):

The reproducer creates multi-threaded processes, populates their
`/proc/<pid>/task/<tid>/...` dcache entries, then SIGKILLs them while
simultaneously running `drop_caches` in tight loops. This creates
concurrent `proc_flush_pid` (from dying threads) and `prune_dcache_sb`
(from `drop_caches`) paths competing on the same proc dentries.

```c
/* Key structure:
 * - Fork child with N threads (creates /proc/<pid>/task/<tid>/...
entries)
 * - Parent reads all /proc entries to populate dcache
 * - Background threads continuously do: echo 2 >
/proc/sys/vm/drop_caches
 * - SIGKILL child -> all threads exit -> concurrent proc_flush_pid
 * - drop_caches shrinker races with proc_flush_pid on same dentries
 */
```

Parameters used: 50 threads/process, 200 iterations, 4 shrinker
threads, 4 reader threads.

**vmtest.toml:**
```toml
[[target]]
name = "dcache-shrink-race"
kernel = "arch/x86/boot/bzImage"
kernel_args = "hung_task_panic=0 softlockup_panic=0
rcupdate.rcu_cpu_stall_suppress=1"
command = "/mnt/vmtest/test_dcache_race.sh"

  [target.vm]
  memory = "16G"   # KASAN needs extra memory
  num_cpus = 8
  timeout = 1200
```

### Reproduction results

| Kernel | Result |
|---|---|
| Unfixed + debug delay + KASAN | **FAIL**: livelock on iteration 1,
test timed out at 750s |
| Fixed + debug delay + KASAN | **PASS**: all 200 iterations completed,
no KASAN/warnings |

### Fix

The fix is in `select_collect()` — stop counting dead dentries and
`DCACHE_SHRINK_LIST` dentries as "found":

```diff
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1448,13 +1459,27 @@ static enum d_walk_ret select_collect(void
*_data, struct dentry *dentry)
 	if (data->start == dentry)
 		goto out;
 
-	if (dentry->d_flags & DCACHE_SHRINK_LIST) {
-		data->found++;
+	if (dentry->d_lockref.count < 0) {
+		/*
+		 * Dead dentry (lockref_mark_dead sets count
negative).
+		 * Another CPU is in the middle of __dentry_kill() and
+		 * will shortly unlink it from the tree.  Do not count
+		 * it as "found" --- that causes
shrink_dcache_parent()
+		 * to loop indefinitely.
+		 */
+	} else if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+		/*
+		 * Already on a shrink list, being processed by
another
+		 * path (e.g., the global LRU shrinker).  Do not count
+		 * it as "found" --- if the other path is slow (e.g.,
+		 * contention on d_lock or filesystem callbacks),
+		 * shrink_dcache_parent() would spin forever waiting
for
+		 * them to finish.  The other shrinker will handle
these
+		 * dentries.
+		 */
 	} else if (!dentry->d_lockref.count) {
 		to_shrink_list(dentry, &data->dispose);
 		data->found++;
-	} else if (dentry->d_lockref.count < 0) {
-		data->found++;
 	}
```

**Why this is correct:**

- **Dead dentries (`count < 0`)**: These are being killed by another
CPU's `__dentry_kill()`. That CPU will call `dentry_unlist()` to remove
them from the parent's children list. `shrink_dcache_parent()` doesn't
need to wait for them — they'll disappear from the tree on their own.

- **`DCACHE_SHRINK_LIST` dentries**: These are already on another
shrinker's dispose list and will be processed by that path. Counting
them as "found" forces `shrink_dcache_parent()` to wait for the other
shrinker to finish, which can take arbitrarily long (especially with
filesystem callbacks or the debug delay).

- **The `select_collect2` path** (used when `data.found > 0` but
`data.dispose` is empty) handles `DCACHE_SHRINK_LIST` dentries
separately by setting `data->victim` and processing them directly. With
this fix, `select_collect2` is only reached when there are genuinely
unprocessable dentries (count > 0, not dead, not on shrink list), not
when there are merely in-flight kills or concurrent shrinkers.

### Relationship to the production UAF crash

The livelock is the **precursor** to the use-after-free crash seen in
production (P2260313060):

1. Without the debug delay, the `__dentry_kill` race window is
nanoseconds (just `cond_resched()`).
2. Most of the time, the dead dentry is unlinked before
`select_collect` finds it → no issue.
3. Occasionally, `select_collect` finds dead dentries and spins
briefly. During this spinning, the specific timing allows a parent
dentry to be fully freed (via `dentry_free` → `call_rcu` → slab
reclaim) and its slab page reused for `kmalloc-96`.
4. When the spinning `shrink_dcache_parent` or a concurrent
`__dentry_kill` then accesses the freed parent → UAF crash.

The fix prevents the spinning entirely, eliminating both the livelock
and the UAF.

next prev parent reply	other threads:[~2026-04-08 18:28 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-06 16:44 [PATCH] dcache: warn when a dentry is freed with a non-empty ->d_lru Jeff Layton
2026-04-07 10:51 ` Jan Kara
2026-04-08  6:42 ` Al Viro
2026-04-08 11:10   ` Jeff Layton
2026-04-08 18:28   ` Jeff Layton [this message]
2026-04-08 19:26     ` Al Viro
2026-04-08 21:05       ` Jeff Layton
2026-04-08 22:43         ` Al Viro

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f44d1047296ede580c69bdad8e527ac8294746ec.camel@kernel.org \
    --to=jlayton@kernel.org \
    --cc=brauner@kernel.org \
    --cc=clm@meta.com \
    --cc=gustavold@meta.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.