* Re: [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression
@ 2026-06-18 6:36 kernel test robot
0 siblings, 0 replies; 3+ messages in thread
From: kernel test robot @ 2026-06-18 6:36 UTC (permalink / raw)
To: A. Sterling
Cc: oe-lkp, lkp, linux-fsdevel, Christian Brauner, Alexander Viro,
Jan Kara, linux-kernel, oliver.sang
Hello,
kernel test robot noticed "WARNING:possible_irq_lock_inversion_dependency_detected" on:
commit: 028b94b023efe53f26decb0497e3ffe5b9e16ee9 ("[PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression")
url: https://github.com/intel-lab-lkp/linux/commits/A-Sterling/pidfs-pin-stashed-dentry-for-pid-lifetime-to-fix-repeat-open-regression/20260612-190630
base: https://git.kernel.org/cgit/linux/kernel/git/vfs/vfs.git vfs.all
patch link: https://lore.kernel.org/all/20260612110529.80055-1-adam.sterling@gmail.com/
patch subject: [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression
in testcase: boot
config: x86_64-rhel-9.4-bpf
compiler: gcc-14
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 32G
(please refer to attached dmesg/kmsg for entire log/backtrace)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202606180805.1b2fdd03-lkp@intel.com
[ 92.416502][ C0] WARNING: possible irq lock inversion dependency detected
[ 92.417060][ C0] 7.1.0-rc4+ #1 Not tainted
[ 92.417436][ C0] --------------------------------------------------------
[ 92.418046][ C0] sed/219 just changed the state of lock:
[ 92.418527][ C0] ffff8881bbb865d8 (&sighand->siglock){-...}-{3:3}, at: lock_task_sighand (signal.c:1379 (discriminator 1))
[ 92.420518][ C0] but this lock took another, HARDIRQ-unsafe lock in the past:
[ 92.421115][ C0] (&lockref->lock){+.+.}-{3:3}
[ 92.421123][ C0]
[ 92.421123][ C0] and interrupts could create inverse lock ordering between them.
[ 92.421123][ C0]
[ 92.426274][ C0]
[ 92.426274][ C0] other info that might help us debug this:
[ 92.430176][ C0] Chain exists of:
[ 92.430176][ C0] &sighand->siglock --> &(&sig->stats_lock)->lock --> &lockref->lock
[ 92.430176][ C0]
[ 92.435925][ C0] Possible interrupt unsafe locking scenario:
[ 92.435925][ C0]
[ 92.439638][ C0] CPU0 CPU1
[ 92.441400][ C0] ---- ----
[ 92.443137][ C0] lock(&lockref->lock);
[ 92.444756][ C0] local_irq_disable();
[ 92.446587][ C0] lock(&sighand->siglock);
[ 92.448366][ C0] lock(&(&sig->stats_lock)->lock);
[ 92.450174][ C0] <Interrupt>
[ 92.451642][ C0] lock(&sighand->siglock);
[ 92.453200][ C0]
[ 92.453200][ C0] *** DEADLOCK ***
[ 92.453200][ C0]
[ 92.457265][ C0] 2 locks held by sed/219:
[ 92.458724][ C0] #0: ffffffffb322ad60 (rcu_read_lock){....}-{1:3}, at: kill_pid_info_type (linux/rcupdate.h:300 linux/rcupdate.h:838 signal.c:1455)
[ 92.460612][ C0] #1: ffffffffb322ad60 (rcu_read_lock){....}-{1:3}, at: lock_task_sighand (linux/rcupdate.h:300 linux/rcupdate.h:838 signal.c:1362)
[ 92.462494][ C0]
[ 92.462494][ C0] the shortest dependencies between 2nd lock and 1st lock:
[ 92.465475][ C0] -> (&lockref->lock){+.+.}-{3:3} {
[ 92.467032][ C0] HARDIRQ-ON-W at:
[ 92.468491][ C0] __lock_acquire (locking/lockdep.c:5191)
[ 92.470147][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.471793][ C0] _raw_spin_lock (linux/spinlock_api_smp.h:158 (discriminator 1) locking/spinlock.c:158 (discriminator 1))
[ 92.473420][ C0] d_instantiate (linux/spinlock.h:342 dcache.c:2084 dcache.c:2078)
[ 92.475015][ C0] d_make_root (dcache.c:2126)
[ 92.476597][ C0] shmem_fill_super (shmem.c:5103)
[ 92.478192][ C0] get_tree_nodev (super.c:1266 super.c:1285)
[ 92.479724][ C0] vfs_get_tree (super.c:1693)
[ 92.481253][ C0] fc_mount (namespace.c:1198)
[ 92.482730][ C0] vfs_kern_mount (namespace.c:6298)
[ 92.484296][ C0] kern_mount (namespace.c:6289 namespace.c:6289)
[ 92.485750][ C0] shmem_init (shmem.c:5420)
[ 92.487245][ C0] mnt_init (namespace.c:6271)
[ 92.488676][ C0] vfs_caches_init (dcache.c:3418)
[ 92.490238][ C0] start_kernel (main.c:1202)
[ 92.491707][ C0] __pfx_clear_bss (x86/kernel/head64.c:310)
[ 92.493159][ C0] x86_64_start_kernel (x86/kernel/head64.c:291)
[ 92.494609][ C0] common_startup_64 (x86/kernel/head_64.S:418)
[ 92.496025][ C0] SOFTIRQ-ON-W at:
[ 92.497295][ C0] __lock_acquire (locking/lockdep.c:5191)
[ 92.498764][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.500198][ C0] _raw_spin_lock (linux/spinlock_api_smp.h:158 (discriminator 1) locking/spinlock.c:158 (discriminator 1))
[ 92.501576][ C0] d_instantiate (linux/spinlock.h:342 dcache.c:2084 dcache.c:2078)
[ 92.502953][ C0] d_make_root (dcache.c:2126)
[ 92.504319][ C0] shmem_fill_super (shmem.c:5103)
[ 92.505708][ C0] get_tree_nodev (super.c:1266 super.c:1285)
[ 92.507160][ C0] vfs_get_tree (super.c:1693)
[ 92.508507][ C0] fc_mount (namespace.c:1198)
[ 92.509809][ C0] vfs_kern_mount (namespace.c:6298)
[ 92.511231][ C0] kern_mount (namespace.c:6289 namespace.c:6289)
[ 92.512525][ C0] shmem_init (shmem.c:5420)
[ 92.513817][ C0] mnt_init (namespace.c:6271)
[ 92.516019][ C0] vfs_caches_init (dcache.c:3418)
[ 92.517541][ C0] start_kernel (main.c:1202)
[ 92.519055][ C0] __pfx_clear_bss (x86/kernel/head64.c:310)
[ 92.520442][ C0] x86_64_start_kernel (x86/kernel/head64.c:291)
[ 92.521878][ C0] common_startup_64 (x86/kernel/head_64.S:418)
[ 92.523342][ C0] INITIAL USE at:
[ 92.524518][ C0] __lock_acquire (locking/lockdep.c:5191)
[ 92.525930][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.527403][ C0] _raw_spin_lock (linux/spinlock_api_smp.h:158 (discriminator 1) locking/spinlock.c:158 (discriminator 1))
[ 92.528797][ C0] d_instantiate (linux/spinlock.h:342 dcache.c:2084 dcache.c:2078)
[ 92.530252][ C0] d_make_root (dcache.c:2126)
[ 92.531629][ C0] shmem_fill_super (shmem.c:5103)
[ 92.533035][ C0] get_tree_nodev (super.c:1266 super.c:1285)
[ 92.534518][ C0] vfs_get_tree (super.c:1693)
[ 92.535844][ C0] fc_mount (namespace.c:1198)
[ 92.537152][ C0] vfs_kern_mount (namespace.c:6298)
[ 92.538524][ C0] kern_mount (namespace.c:6289 namespace.c:6289)
[ 92.539835][ C0] shmem_init (shmem.c:5420)
[ 92.541165][ C0] mnt_init (namespace.c:6271)
[ 92.542506][ C0] vfs_caches_init (dcache.c:3418)
[ 92.543881][ C0] start_kernel (main.c:1202)
[ 92.545279][ C0] __pfx_clear_bss (x86/kernel/head64.c:310)
[ 92.546706][ C0] x86_64_start_kernel (x86/kernel/head64.c:291)
[ 92.548218][ C0] common_startup_64 (x86/kernel/head_64.S:418)
[ 92.549692][ C0] }
[ 92.550763][ C0] ... key at: __key.4+0x0/0x40
[ 92.552113][ C0] ... acquired at:
[ 92.553212][ C0] __lock_acquire (locking/lockdep.c:5237)
[ 92.554371][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.555540][ C0] _raw_spin_lock (linux/spinlock_api_smp.h:158 (discriminator 1) locking/spinlock.c:158 (discriminator 1))
[ 92.556673][ C0] lockref_get (linux/spinlock.h:342 lockref.c:50)
[ 92.557781][ C0] pidfs_alloc_file (linux/dcache.h:365 pidfs.c:1168)
[ 92.558971][ C0] pidfd_prepare (fork.c:1918)
[ 92.560119][ C0] __x64_sys_pidfd_open (pid.c:674 pid.c:710 pid.c:695 pid.c:695)
[ 92.561296][ C0] do_syscall_64 (x86/entry/syscall_64.c:63 x86/entry/syscall_64.c:94)
[ 92.562456][ C0] entry_SYSCALL_64_after_hwframe (x86/entry/entry_64.S:121)
[ 92.563689][ C0]
[ 92.564649][ C0] -> (&pid->wait_pidfd){....}-{3:3} {
[ 92.565868][ C0] INITIAL USE at:
[ 92.566983][ C0] __lock_acquire (locking/lockdep.c:5191)
[ 92.568286][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.569603][ C0] _raw_spin_lock_irqsave (linux/spinlock_api_smp.h:132 (discriminator 1) locking/spinlock.c:166 (discriminator 1))
[ 92.570965][ C0] __wake_up (sched/wait.c:124 (discriminator 1) sched/wait.c:146 (discriminator 1))
[ 92.572231][ C0] do_notify_parent (signal.c:2156 signal.c:2184)
[ 92.573541][ C0] exit_notify (exit.c:757)
[ 92.574834][ C0] do_exit (exit.c:987)
[ 92.576106][ C0] kthread (kthread.c:438)
[ 92.577494][ C0] ret_from_fork (x86/kernel/process.c:158)
[ 92.578795][ C0] ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 92.580104][ C0] }
[ 92.581115][ C0] ... key at: __key.3+0x0/0x40
[ 92.582473][ C0] ... acquired at:
[ 92.583582][ C0] __lock_acquire (locking/lockdep.c:5237)
[ 92.584748][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.585949][ C0] _raw_spin_lock_irqsave (linux/spinlock_api_smp.h:132 (discriminator 1) locking/spinlock.c:166 (discriminator 1))
[ 92.587170][ C0] __wake_up (sched/wait.c:124 (discriminator 1) sched/wait.c:146 (discriminator 1))
[ 92.588284][ C0] __exit_signal (exit.c:142 exit.c:212)
[ 92.589434][ C0] release_task (exit.c:265)
[ 92.590586][ C0] exit_notify (exit.c:776)
[ 92.591725][ C0] do_exit (exit.c:987)
[ 92.592834][ C0] kthread (kthread.c:438)
[ 92.593966][ C0] ret_from_fork (x86/kernel/process.c:158)
[ 92.595127][ C0] ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 92.596277][ C0]
[ 92.597241][ C0] -> (&____s->seqcount#3){....}-{0:0} {
[ 92.598469][ C0] INITIAL USE at:
[ 92.599564][ C0] __lock_acquire (locking/lockdep.c:5191)
[ 92.600850][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.602201][ C0] __exit_signal (linux/seqlock.h:478 (discriminator 1) linux/seqlock.h:504 (discriminator 1) linux/seqlock.h:881 (discriminator 1) exit.c:199 (discriminator 1))
[ 92.603491][ C0] release_task (exit.c:265)
[ 92.604770][ C0] exit_notify (exit.c:776)
[ 92.606066][ C0] do_exit (exit.c:987)
[ 92.607334][ C0] kthread (kthread.c:438)
[ 92.608581][ C0] ret_from_fork (x86/kernel/process.c:158)
[ 92.609873][ C0] ret_from_fork_asm (x86/entry/entry_64.S:245)
[ 92.611166][ C0] INITIAL READ USE at:
[ 92.612286][ C0] __lock_acquire (locking/lockdep.c:5191)
[ 92.613596][ C0] lock_acquire (trace/events/lock.h:24 (discriminator 15) trace/events/lock.h:24 (discriminator 15) locking/lockdep.c:5831 (discriminator 15))
[ 92.615780][ C0] thread_group_cputime (linux/seqlock.h:73 (discriminator 1) linux/seqlock.h:838 (discriminator 1) sched/cputime.c:345 (discriminator 1))
[ 92.617425][ C0] thread_group_cputime_adjusted (sched/cputime.c:644)
[ 92.619121][ C0] wait_task_zombie (exit.c:1234)
[ 92.620533][ C0] __do_wait (exit.c:1646 exit.c:1681)
[ 92.621885][ C0] do_wait (exit.c:1722)
[ 92.623341][ C0] kernel_wait (exit.c:1898)
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20260618/202606180805.1b2fdd03-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 3+ messages in thread* [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression
@ 2026-06-12 11:05 A. Sterling
2026-06-17 13:13 ` Christian Brauner
0 siblings, 1 reply; 3+ messages in thread
From: A. Sterling @ 2026-06-12 11:05 UTC (permalink / raw)
To: Christian Brauner; +Cc: Alexander Viro, Jan Kara, linux-fsdevel, linux-kernel
Commit cb12fd8e0dab ("pidfd: add pidfs") moved pidfds from anonymous
inodes to a pseudo filesystem, with the intent that pid->stashed would
cache the dentry+inode across multiple pidfd_open() calls for the same
struct pid. Commit b28ddcc32d8f ("pidfs: convert to path_from_stashed()
helper") realized that intent via the stashed_dentry_get() fast-path.
In practice the stash cache never hits for serial open/close patterns.
pidfs dentries are anonymous (d_alloc_anon) and the superblock raises
DCACHE_DONTCACHE, so dput() at close() drops the refcount to zero and
immediately destroys the dentry. __dentry_kill -> d_prune ->
stashed_dentry_prune cmpxchg()es pid->stashed back to NULL, and the
last iput() runs pidfs_evict_inode(). The next pidfd_open() finds an
empty stash slot and reallocates both the inode (new_inode_pseudo +
simple_inode_init_ts + sops->init_inode) and the dentry (d_alloc_anon
+ d_instantiate + cmpxchg into the stash slot).
Measured with a small benchmark that times a single pidfd_open()
syscall on a live process, best-of-k per run (written for this
investigation as part of an extended fork of the LEBench suite,
Ren et al. SOSP'19; the test is not part of upstream LEBench).
QEMU/KVM x86_64 guest, 5 runs per kernel, RSD <1%, kernels v5.19
to v7.1-rc2:
v5.19 v6.0 v6.9 (pidfs) v7.1-rc2
pidfd_open kbest (us) 0.40 0.39 [step] 0.50
i.e. ~25% slower than the pre-pidfs anon_inode baseline, persistent
through current. The cost is a kmem_cache_alloc on inode_cachep + a
clock read (simple_inode_init_ts) + a kmem_cache_alloc on dentry_cache
per call, all of which can be elided when the stash cache hits.
With this patch applied to v7.1-rc4 (same guest, median kbest
across 5 boots, kernels built with the pidfd selftest config
requirements enabled):
v7.1-rc4 stock v7.1-rc4 + patch
pidfd_open kbest (us) 0.489 0.349 (-28.6%)
All eight buildable pidfd kselftest suites (41 tests, including the
file_handle suite that exercises the stashed-dentry path with
CLONE_NEWUSER children) pass identically on the stock and patched
kernels.
The patched kernel is faster than the pre-pidfs anon_inode baseline
(0.40 us at v5.19/v6.0): a stash hit reuses the cached dentry and
inode outright, where the anon_inode path still performed per-call
file setup against the shared inode.
Fix by holding an extra reference to the stashed dentry for as long as
the task is alive. pidfs_alloc_file() takes a dget() on the first
successful path_from_stashed() per pid, recorded by a new
PIDFS_ATTR_BIT_STASH_PINNED bit in pid->attr->attr_mask. pidfs_exit()
-- which already runs in sleepable context from release_task() --
drops the reference.
pidfs_alloc_file() is reachable after pidfs_exit() has already run:
SO_PEERPIDFD and SCM_PIDFD hold struct pid references on sockets and
can allocate pidfds for already-reaped peers (the PIDFD_STALE path).
A pin taken on that path could never be dropped, since pidfs_exit()
will not run again. Both the pin and the unpin therefore take
pid->wait_pidfd.lock -- the lock pidfs_exit() already uses to
synchronize with pidfs_register_pid() -- and the pin is skipped once
PIDFS_ATTR_BIT_EXIT is set:
- pin before exit's unpin section: the unpin sees the bit, drops
the reference. Balanced.
- exit's unpin section before the pin attempt: PIDFS_ATTR_BIT_EXIT
is set before the unpin section, so the later allocation observes
it under the lock and does not pin. Stale allocations behave as
they do today (per-call inode+dentry), which is acceptable: the
cache targets live processes.
The unpin's dput() runs outside the lock (pidfs_exit() is sleepable;
dput must not run under a spinlock). While pinned, the reference
keeps the dentry alive, so stashed_dentry_prune() cannot run and
pid->stashed still points at the pinned dentry when pidfs_exit()
reads it.
Unpinning at pidfs_free_pid() instead is not possible: the pinned
dentry holds the inode, the inode holds a pid reference
(pidfs_evict_inode() does put_pid()), so the pid refcount cannot
reach zero while the pin is held and pidfs_free_pid() would never
run.
Memory cost is one inode (~800 B) + one dentry (~200 B) per pid that
has ever had a pidfd allocated, held until release_task() instead of
until the last fd close. Bounded by pid_max. Tasks that never have
a pidfd allocated are unaffected.
The existing VFS_WARN_ON_ONCE(pid->stashed) assertion in
pidfs_free_pid() is preserved: the exit-time dput triggers the
existing teardown chain (__dentry_kill -> stashed_dentry_prune ->
pidfs_evict_inode) before pidfs_free_pid runs.
Fixes: cb12fd8e0dab ("pidfd: add pidfs")
Assisted-by: Claude:claude-fable-5
Signed-off-by: A. Sterling <adam.sterling@gmail.com>
---
fs/pidfs.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 47 insertions(+), 2 deletions(-)
diff --git a/fs/pidfs.c b/fs/pidfs.c
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -44,8 +44,9 @@
}
enum pidfs_attr_mask_bits {
- PIDFS_ATTR_BIT_EXIT = 0,
- PIDFS_ATTR_BIT_COREDUMP = 1,
+ PIDFS_ATTR_BIT_EXIT = 0,
+ PIDFS_ATTR_BIT_COREDUMP = 1,
+ PIDFS_ATTR_BIT_STASH_PINNED = 2,
};
struct pidfs_anon_attr {
@@ -725,6 +726,7 @@
{
struct pid *pid = task_pid(tsk);
struct pidfs_attr *attr;
+ struct dentry *unpin = NULL;
#ifdef CONFIG_CGROUPS
struct cgroup *cgrp;
#endif
@@ -765,6 +767,23 @@
/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
smp_wmb();
set_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask);
+
+ /*
+ * Drop the per-pid stash reference taken by pidfs_alloc_file()
+ * (see PIDFS_ATTR_BIT_STASH_PINNED). The wait_pidfd.lock guard
+ * orders this against the pin in pidfs_alloc_file(): once
+ * PIDFS_ATTR_BIT_EXIT is visible under the lock, no new pin can
+ * be taken. dput() runs outside the lock; on the final
+ * reference it triggers __dentry_kill -> stashed_dentry_prune
+ * -> pidfs_evict_inode, the existing teardown path.
+ */
+ scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+ if (test_and_clear_bit(PIDFS_ATTR_BIT_STASH_PINNED,
+ &attr->attr_mask))
+ unpin = READ_ONCE(pid->stashed);
+ }
+ if (unpin)
+ dput(unpin);
}
#ifdef CONFIG_COREDUMP
@@ -1151,6 +1170,32 @@
VFS_WARN_ON_ONCE(!pid->attr);
+ /*
+ * Pin the stashed dentry for the lifetime of the task so that
+ * repeat pidfd_open() on the same pid reuses the cached dentry
+ * + inode instead of reallocating both each time. Anonymous
+ * pidfs dentries don't enter the dcache LRU (DCACHE_DONTCACHE),
+ * so without an explicit reference the dput() at close()
+ * immediately destroys the dentry and, via
+ * stashed_dentry_prune, the inode.
+ *
+ * Skip the pin once PIDFS_ATTR_BIT_EXIT is set: pidfs_exit()
+ * has run (or is running) and will not run again, so a
+ * reference taken now could never be dropped (stale
+ * allocations via SO_PEERPIDFD reach here after
+ * release_task()). The wait_pidfd.lock guard makes the EXIT
+ * check and the pin atomic against the unpin in pidfs_exit().
+ */
+ scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+ struct pidfs_attr *attr = pid->attr;
+
+ if (!IS_ERR_OR_NULL(attr) &&
+ !test_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask) &&
+ !test_and_set_bit(PIDFS_ATTR_BIT_STASH_PINNED,
+ &attr->attr_mask))
+ dget(path.dentry);
+ }
+
flags &= ~PIDFD_STALE;
flags |= O_RDWR;
pidfd_file = dentry_open(&path, flags, current_cred());
--
2.45.1
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression
2026-06-12 11:05 A. Sterling
@ 2026-06-17 13:13 ` Christian Brauner
0 siblings, 0 replies; 3+ messages in thread
From: Christian Brauner @ 2026-06-17 13:13 UTC (permalink / raw)
To: A. Sterling
Cc: Christian Brauner, Alexander Viro, Jan Kara, linux-fsdevel,
linux-kernel
> Commit cb12fd8e0dab ("pidfd: add pidfs") moved pidfds from anonymous
> inodes to a pseudo filesystem, with the intent that pid->stashed would
> cache the dentry+inode across multiple pidfd_open() calls for the same
> struct pid. Commit b28ddcc32d8f ("pidfs: convert to path_from_stashed()
> helper") realized that intent via the stashed_dentry_get() fast-path.
>
> In practice the stash cache never hits for serial open/close patterns.
> pidfs dentries are anonymous (d_alloc_anon) and the superblock raises
> DCACHE_DONTCACHE, so dput() at close() drops the refcount to zero and
> immediately destroys the dentry. __dentry_kill -> d_prune ->
> stashed_dentry_prune cmpxchg()es pid->stashed back to NULL, and the
> last iput() runs pidfs_evict_inode(). The next pidfd_open() finds an
> empty stash slot and reallocates both the inode (new_inode_pseudo +
> simple_inode_init_ts + sops->init_inode) and the dentry (d_alloc_anon
> + d_instantiate + cmpxchg into the stash slot).
>
> Measured with a small benchmark that times a single pidfd_open()
> syscall on a live process, best-of-k per run (written for this
> investigation as part of an extended fork of the LEBench suite,
> Ren et al. SOSP'19; the test is not part of upstream LEBench).
> QEMU/KVM x86_64 guest, 5 runs per kernel, RSD <1%, kernels v5.19
> to v7.1-rc2:
>
> v5.19 v6.0 v6.9 (pidfs) v7.1-rc2
> pidfd_open kbest (us) 0.40 0.39 [step] 0.50
>
> i.e. ~25% slower than the pre-pidfs anon_inode baseline, persistent
> through current. The cost is a kmem_cache_alloc on inode_cachep + a
> clock read (simple_inode_init_ts) + a kmem_cache_alloc on dentry_cache
> per call, all of which can be elided when the stash cache hits.
>
> With this patch applied to v7.1-rc4 (same guest, median kbest
> across 5 boots, kernels built with the pidfd selftest config
> requirements enabled):
>
> v7.1-rc4 stock v7.1-rc4 + patch
> pidfd_open kbest (us) 0.489 0.349 (-28.6%)
>
> All eight buildable pidfd kselftest suites (41 tests, including the
> file_handle suite that exercises the stashed-dentry path with
> CLONE_NEWUSER children) pass identically on the stock and patched
> kernels.
>
> The patched kernel is faster than the pre-pidfs anon_inode baseline
> (0.40 us at v5.19/v6.0): a stash hit reuses the cached dentry and
> inode outright, where the anon_inode path still performed per-call
> file setup against the shared inode.
>
> Fix by holding an extra reference to the stashed dentry for as long as
> the task is alive. pidfs_alloc_file() takes a dget() on the first
> successful path_from_stashed() per pid, recorded by a new
> PIDFS_ATTR_BIT_STASH_PINNED bit in pid->attr->attr_mask. pidfs_exit()
> -- which already runs in sleepable context from release_task() --
> drops the reference.
>
> pidfs_alloc_file() is reachable after pidfs_exit() has already run:
> SO_PEERPIDFD and SCM_PIDFD hold struct pid references on sockets and
> can allocate pidfds for already-reaped peers (the PIDFD_STALE path).
> A pin taken on that path could never be dropped, since pidfs_exit()
> will not run again. Both the pin and the unpin therefore take
> pid->wait_pidfd.lock -- the lock pidfs_exit() already uses to
> synchronize with pidfs_register_pid() -- and the pin is skipped once
> PIDFS_ATTR_BIT_EXIT is set:
>
> - pin before exit's unpin section: the unpin sees the bit, drops
> the reference. Balanced.
> - exit's unpin section before the pin attempt: PIDFS_ATTR_BIT_EXIT
> is set before the unpin section, so the later allocation observes
> it under the lock and does not pin. Stale allocations behave as
> they do today (per-call inode+dentry), which is acceptable: the
> cache targets live processes.
>
> The unpin's dput() runs outside the lock (pidfs_exit() is sleepable;
> dput must not run under a spinlock). While pinned, the reference
> keeps the dentry alive, so stashed_dentry_prune() cannot run and
> pid->stashed still points at the pinned dentry when pidfs_exit()
> reads it.
>
> Unpinning at pidfs_free_pid() instead is not possible: the pinned
> dentry holds the inode, the inode holds a pid reference
> (pidfs_evict_inode() does put_pid()), so the pid refcount cannot
> reach zero while the pin is held and pidfs_free_pid() would never
> run.
>
> Memory cost is one inode (~800 B) + one dentry (~200 B) per pid that
> has ever had a pidfd allocated, held until release_task() instead of
> until the last fd close. Bounded by pid_max. Tasks that never have
> a pidfd allocated are unaffected.
>
> The existing VFS_WARN_ON_ONCE(pid->stashed) assertion in
> pidfs_free_pid() is preserved: the exit-time dput triggers the
> existing teardown chain (__dentry_kill -> stashed_dentry_prune ->
> pidfs_evict_inode) before pidfs_free_pid runs.
>
> Fixes: cb12fd8e0dab ("pidfd: add pidfs")
> Assisted-by: Claude:claude-fable-5
> Signed-off-by: A. Sterling <adam.sterling@gmail.com>
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index aaa609ddab04..49585e763ccb 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
> @@ -46,8 +46,9 @@ void pidfs_get_root(struct path *path)
> }
>
> enum pidfs_attr_mask_bits {
> - PIDFS_ATTR_BIT_EXIT = 0,
> - PIDFS_ATTR_BIT_COREDUMP = 1,
> + PIDFS_ATTR_BIT_EXIT = 0,
> + PIDFS_ATTR_BIT_COREDUMP = 1,
> + PIDFS_ATTR_BIT_STASH_PINNED = 2,
> };
>
> struct pidfs_anon_attr {
> @@ -717,6 +718,7 @@ void pidfs_exit(struct task_struct *tsk)
> {
> struct pid *pid = task_pid(tsk);
> struct pidfs_attr *attr;
> + struct dentry *unpin = NULL;
> #ifdef CONFIG_CGROUPS
> struct cgroup *cgrp;
> #endif
> @@ -757,6 +759,23 @@ void pidfs_exit(struct task_struct *tsk)
> /* Ensure that PIDFD_GET_INFO sees either all or nothing. */
> smp_wmb();
> set_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask);
> +
> + /*
> + * Drop the per-pid stash reference taken by pidfs_alloc_file()
> + * (see PIDFS_ATTR_BIT_STASH_PINNED). The wait_pidfd.lock guard
> + * orders this against the pin in pidfs_alloc_file(): once
> + * PIDFS_ATTR_BIT_EXIT is visible under the lock, no new pin can
> + * be taken. dput() runs outside the lock; on the final
> + * reference it triggers __dentry_kill -> stashed_dentry_prune
> + * -> pidfs_evict_inode, the existing teardown path.
> + */
> + scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
> + if (test_and_clear_bit(PIDFS_ATTR_BIT_STASH_PINNED,
> + &attr->attr_mask))
> + unpin = READ_ONCE(pid->stashed);
> + }
> + if (unpin)
> + dput(unpin);
> }
>
> #ifdef CONFIG_COREDUMP
> @@ -1125,6 +1144,32 @@ struct file *pidfs_alloc_file(struct pid *pid, unsigned int flags)
>
> VFS_WARN_ON_ONCE(!pid->attr);
>
> + /*
> + * Pin the stashed dentry for the lifetime of the task so that
> + * repeat pidfd_open() on the same pid reuses the cached dentry
> + * + inode instead of reallocating both each time. Anonymous
> + * pidfs dentries don't enter the dcache LRU (DCACHE_DONTCACHE),
> + * so without an explicit reference the dput() at close()
> + * immediately destroys the dentry and, via
> + * stashed_dentry_prune, the inode.
> + *
> + * Skip the pin once PIDFS_ATTR_BIT_EXIT is set: pidfs_exit()
> + * has run (or is running) and will not run again, so a
> + * reference taken now could never be dropped (stale
> + * allocations via SO_PEERPIDFD reach here after
> + * release_task()). The wait_pidfd.lock guard makes the EXIT
> + * check and the pin atomic against the unpin in pidfs_exit().
> + */
> + scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
> + struct pidfs_attr *attr = pid->attr;
> +
> + if (!IS_ERR_OR_NULL(attr) &&
> + !test_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask) &&
> + !test_and_set_bit(PIDFS_ATTR_BIT_STASH_PINNED,
> + &attr->attr_mask))
> + dget(path.dentry);
Anywhere pidfs_alloc_file() is called where cleanup doesn't pass through
pidfs_exit() will permanently leak the dentry. That needs to be fixed.
--
Christian Brauner <brauner@kernel.org>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-18 6:36 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-18 6:36 [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression kernel test robot
-- strict thread matches above, loose matches on Subject: below --
2026-06-12 11:05 A. Sterling
2026-06-17 13:13 ` Christian Brauner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox