[PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "A. Sterling" <adam.sterling@gmail.com>
To: Christian Brauner <brauner@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression
Date: Fri, 12 Jun 2026 07:05:29 -0400	[thread overview]
Message-ID: <20260612110529.80055-1-adam.sterling@gmail.com> (raw)

Commit cb12fd8e0dab ("pidfd: add pidfs") moved pidfds from anonymous
inodes to a pseudo filesystem, with the intent that pid->stashed would
cache the dentry+inode across multiple pidfd_open() calls for the same
struct pid.  Commit b28ddcc32d8f ("pidfs: convert to path_from_stashed()
helper") realized that intent via the stashed_dentry_get() fast-path.

In practice the stash cache never hits for serial open/close patterns.
pidfs dentries are anonymous (d_alloc_anon) and the superblock raises
DCACHE_DONTCACHE, so dput() at close() drops the refcount to zero and
immediately destroys the dentry.  __dentry_kill -> d_prune ->
stashed_dentry_prune cmpxchg()es pid->stashed back to NULL, and the
last iput() runs pidfs_evict_inode().  The next pidfd_open() finds an
empty stash slot and reallocates both the inode (new_inode_pseudo +
simple_inode_init_ts + sops->init_inode) and the dentry (d_alloc_anon
+ d_instantiate + cmpxchg into the stash slot).

Measured with a small benchmark that times a single pidfd_open()
syscall on a live process, best-of-k per run (written for this
investigation as part of an extended fork of the LEBench suite,
Ren et al. SOSP'19; the test is not part of upstream LEBench).
QEMU/KVM x86_64 guest, 5 runs per kernel, RSD <1%, kernels v5.19
to v7.1-rc2:

                            v5.19   v6.0    v6.9 (pidfs)   v7.1-rc2
  pidfd_open kbest (us)      0.40   0.39          [step]       0.50

i.e. ~25% slower than the pre-pidfs anon_inode baseline, persistent
through current.  The cost is a kmem_cache_alloc on inode_cachep + a
clock read (simple_inode_init_ts) + a kmem_cache_alloc on dentry_cache
per call, all of which can be elided when the stash cache hits.

With this patch applied to v7.1-rc4 (same guest, median kbest
across 5 boots, kernels built with the pidfd selftest config
requirements enabled):

                            v7.1-rc4 stock   v7.1-rc4 + patch
  pidfd_open kbest (us)          0.489           0.349  (-28.6%)

All eight buildable pidfd kselftest suites (41 tests, including the
file_handle suite that exercises the stashed-dentry path with
CLONE_NEWUSER children) pass identically on the stock and patched
kernels.

The patched kernel is faster than the pre-pidfs anon_inode baseline
(0.40 us at v5.19/v6.0): a stash hit reuses the cached dentry and
inode outright, where the anon_inode path still performed per-call
file setup against the shared inode.

Fix by holding an extra reference to the stashed dentry for as long as
the task is alive.  pidfs_alloc_file() takes a dget() on the first
successful path_from_stashed() per pid, recorded by a new
PIDFS_ATTR_BIT_STASH_PINNED bit in pid->attr->attr_mask.  pidfs_exit()
-- which already runs in sleepable context from release_task() --
drops the reference.

pidfs_alloc_file() is reachable after pidfs_exit() has already run:
SO_PEERPIDFD and SCM_PIDFD hold struct pid references on sockets and
can allocate pidfds for already-reaped peers (the PIDFD_STALE path).
A pin taken on that path could never be dropped, since pidfs_exit()
will not run again.  Both the pin and the unpin therefore take
pid->wait_pidfd.lock -- the lock pidfs_exit() already uses to
synchronize with pidfs_register_pid() -- and the pin is skipped once
PIDFS_ATTR_BIT_EXIT is set:

  - pin before exit's unpin section: the unpin sees the bit, drops
    the reference.  Balanced.
  - exit's unpin section before the pin attempt: PIDFS_ATTR_BIT_EXIT
    is set before the unpin section, so the later allocation observes
    it under the lock and does not pin.  Stale allocations behave as
    they do today (per-call inode+dentry), which is acceptable: the
    cache targets live processes.

The unpin's dput() runs outside the lock (pidfs_exit() is sleepable;
dput must not run under a spinlock).  While pinned, the reference
keeps the dentry alive, so stashed_dentry_prune() cannot run and
pid->stashed still points at the pinned dentry when pidfs_exit()
reads it.

Unpinning at pidfs_free_pid() instead is not possible: the pinned
dentry holds the inode, the inode holds a pid reference
(pidfs_evict_inode() does put_pid()), so the pid refcount cannot
reach zero while the pin is held and pidfs_free_pid() would never
run.

Memory cost is one inode (~800 B) + one dentry (~200 B) per pid that
has ever had a pidfd allocated, held until release_task() instead of
until the last fd close.  Bounded by pid_max.  Tasks that never have
a pidfd allocated are unaffected.

The existing VFS_WARN_ON_ONCE(pid->stashed) assertion in
pidfs_free_pid() is preserved: the exit-time dput triggers the
existing teardown chain (__dentry_kill -> stashed_dentry_prune ->
pidfs_evict_inode) before pidfs_free_pid runs.

Fixes: cb12fd8e0dab ("pidfd: add pidfs")
Assisted-by: Claude:claude-fable-5
Signed-off-by: A. Sterling <adam.sterling@gmail.com>
---
 fs/pidfs.c | 49 ++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

diff --git a/fs/pidfs.c b/fs/pidfs.c
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -44,8 +44,9 @@
 }

 enum pidfs_attr_mask_bits {
-	PIDFS_ATTR_BIT_EXIT	= 0,
-	PIDFS_ATTR_BIT_COREDUMP	= 1,
+	PIDFS_ATTR_BIT_EXIT		= 0,
+	PIDFS_ATTR_BIT_COREDUMP		= 1,
+	PIDFS_ATTR_BIT_STASH_PINNED	= 2,
 };

 struct pidfs_anon_attr {
@@ -725,6 +726,7 @@
 {
 	struct pid *pid = task_pid(tsk);
 	struct pidfs_attr *attr;
+	struct dentry *unpin = NULL;
 #ifdef CONFIG_CGROUPS
 	struct cgroup *cgrp;
 #endif
@@ -765,6 +767,23 @@
 	/* Ensure that PIDFD_GET_INFO sees either all or nothing. */
 	smp_wmb();
 	set_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask);
+
+	/*
+	 * Drop the per-pid stash reference taken by pidfs_alloc_file()
+	 * (see PIDFS_ATTR_BIT_STASH_PINNED).  The wait_pidfd.lock guard
+	 * orders this against the pin in pidfs_alloc_file(): once
+	 * PIDFS_ATTR_BIT_EXIT is visible under the lock, no new pin can
+	 * be taken.  dput() runs outside the lock; on the final
+	 * reference it triggers __dentry_kill -> stashed_dentry_prune
+	 * -> pidfs_evict_inode, the existing teardown path.
+	 */
+	scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+		if (test_and_clear_bit(PIDFS_ATTR_BIT_STASH_PINNED,
+				       &attr->attr_mask))
+			unpin = READ_ONCE(pid->stashed);
+	}
+	if (unpin)
+		dput(unpin);
 }

 #ifdef CONFIG_COREDUMP
@@ -1151,6 +1170,32 @@

 	VFS_WARN_ON_ONCE(!pid->attr);

+	/*
+	 * Pin the stashed dentry for the lifetime of the task so that
+	 * repeat pidfd_open() on the same pid reuses the cached dentry
+	 * + inode instead of reallocating both each time.  Anonymous
+	 * pidfs dentries don't enter the dcache LRU (DCACHE_DONTCACHE),
+	 * so without an explicit reference the dput() at close()
+	 * immediately destroys the dentry and, via
+	 * stashed_dentry_prune, the inode.
+	 *
+	 * Skip the pin once PIDFS_ATTR_BIT_EXIT is set: pidfs_exit()
+	 * has run (or is running) and will not run again, so a
+	 * reference taken now could never be dropped (stale
+	 * allocations via SO_PEERPIDFD reach here after
+	 * release_task()).  The wait_pidfd.lock guard makes the EXIT
+	 * check and the pin atomic against the unpin in pidfs_exit().
+	 */
+	scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+		struct pidfs_attr *attr = pid->attr;
+
+		if (!IS_ERR_OR_NULL(attr) &&
+		    !test_bit(PIDFS_ATTR_BIT_EXIT, &attr->attr_mask) &&
+		    !test_and_set_bit(PIDFS_ATTR_BIT_STASH_PINNED,
+				      &attr->attr_mask))
+			dget(path.dentry);
+	}
+
 	flags &= ~PIDFD_STALE;
 	flags |= O_RDWR;
 	pidfd_file = dentry_open(&path, flags, current_cred());
-- 
2.45.1

next             reply	other threads:[~2026-06-12 11:05 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-12 11:05 A. Sterling [this message]
2026-06-17 13:13 ` [PATCH] pidfs: pin stashed dentry for pid lifetime to fix repeat-open regression Christian Brauner
  -- strict thread matches above, loose matches on Subject: below --
2026-06-18  6:36 kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260612110529.80055-1-adam.sterling@gmail.com \
    --to=adam.sterling@gmail.com \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.