linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] pidfd: improve uapi when task isn't found
@ 2025-04-03 14:09 Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 1/4] selftests/pidfd: adapt to recent changes Christian Brauner
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Christian Brauner @ 2025-04-03 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Christian Brauner

Oleg,

We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

This is causing issues for non-threaded workloads as in [1] where they
expect ESRCH to be reported, not EINVAL. I think that's a very resonable
assumption.

This patch tries to allow userspace to distinguish between (1) and (2).
This is racy of course but that shouldn't matter.

Christian

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (4):
      selftests/pidfd: adapt to recent changes
      pidfd: remove unneeded NULL check from pidfd_prepare()
      pidfd: improve uapi when task isn't found
      selftest/pidfd: add test for thread-group leader pidfd open for thread

 kernel/fork.c                                   | 31 ++++++++++++++++++++++---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 13 ++++++-----
 2 files changed, 35 insertions(+), 9 deletions(-)
---
base-commit: a2cc6ff5ec8f91bc463fd3b0c26b61166a07eb11
change-id: 20250403-work-pidfd-fixes-54c5b13ee0ee


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH RFC 1/4] selftests/pidfd: adapt to recent changes
  2025-04-03 14:09 [PATCH RFC 0/4] pidfd: improve uapi when task isn't found Christian Brauner
@ 2025-04-03 14:09 ` Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 2/4] pidfd: remove unneeded NULL check from pidfd_prepare() Christian Brauner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2025-04-03 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Christian Brauner

Adapt to changes in commit 9133607de37a ("exit: fix the usage of
delay_group_leader->exit_code in do_notify_parent() and pidfs_exit()").

Even if the thread-group leader exited early and succesfully it's exit
status will only be reported once the whole thread-group has exited and
it will share the exit code of the thread-group. So if the thread-group
was SIGKILLed the thread-group leader will also be reported as having
been SIGKILLed.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index 1758a1b0457b..accfd6bdc539 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -362,9 +362,9 @@ TEST_F(pidfd_info, thread_group)
 	ASSERT_EQ(ioctl(pidfd_leader, PIDFD_GET_INFO, &info), 0);
 	ASSERT_FALSE(!!(info.mask & PIDFD_INFO_CREDS));
 	ASSERT_TRUE(!!(info.mask & PIDFD_INFO_EXIT));
-	/* The thread-group leader exited successfully. Only the specific thread was SIGKILLed. */
-	ASSERT_TRUE(WIFEXITED(info.exit_code));
-	ASSERT_EQ(WEXITSTATUS(info.exit_code), 0);
+	/* Even though the thread-group exited successfully it will still report the group exit code. */
+	ASSERT_TRUE(WIFSIGNALED(info.exit_code));
+	ASSERT_EQ(WTERMSIG(info.exit_code), SIGKILL);
 
 	/*
 	 * Retrieve exit information for the thread-group leader via the
@@ -375,9 +375,9 @@ TEST_F(pidfd_info, thread_group)
 	ASSERT_FALSE(!!(info2.mask & PIDFD_INFO_CREDS));
 	ASSERT_TRUE(!!(info2.mask & PIDFD_INFO_EXIT));
 
-	/* The thread-group leader exited successfully. Only the specific thread was SIGKILLed. */
-	ASSERT_TRUE(WIFEXITED(info2.exit_code));
-	ASSERT_EQ(WEXITSTATUS(info2.exit_code), 0);
+	/* Even though the thread-group exited successfully it will still report the group exit code. */
+	ASSERT_TRUE(WIFSIGNALED(info2.exit_code));
+	ASSERT_EQ(WTERMSIG(info2.exit_code), SIGKILL);
 
 	/* Retrieve exit information for the thread. */
 	info.mask = PIDFD_INFO_CGROUPID | PIDFD_INFO_EXIT;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 2/4] pidfd: remove unneeded NULL check from pidfd_prepare()
  2025-04-03 14:09 [PATCH RFC 0/4] pidfd: improve uapi when task isn't found Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 1/4] selftests/pidfd: adapt to recent changes Christian Brauner
@ 2025-04-03 14:09 ` Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 3/4] pidfd: improve uapi when task isn't found Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 4/4] selftest/pidfd: add test for thread-group leader pidfd open for thread Christian Brauner
  3 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2025-04-03 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Christian Brauner

None of the caller actually pass a NULL pid in there.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index c4b26cd8998b..182ec2e9087d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2110,7 +2110,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
 	bool thread = flags & PIDFD_THREAD;
 
-	if (!pid || !pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
+	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
 		return -EINVAL;
 
 	return __pidfd_prepare(pid, flags, ret);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 3/4] pidfd: improve uapi when task isn't found
  2025-04-03 14:09 [PATCH RFC 0/4] pidfd: improve uapi when task isn't found Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 1/4] selftests/pidfd: adapt to recent changes Christian Brauner
  2025-04-03 14:09 ` [PATCH RFC 2/4] pidfd: remove unneeded NULL check from pidfd_prepare() Christian Brauner
@ 2025-04-03 14:09 ` Christian Brauner
  2025-04-04 12:37   ` Oleg Nesterov
  2025-04-03 14:09 ` [PATCH RFC 4/4] selftest/pidfd: add test for thread-group leader pidfd open for thread Christian Brauner
  3 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-03 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Christian Brauner

We currently report EINVAL whenever a struct pid has no tasked attached
anymore thereby conflating two concepts:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

This is causing issues for non-threaded workloads as in [1].

This patch tries to allow userspace to distinguish between (1) and (2).
This is racy of course but that shouldn't matter.

Link: https://github.com/systemd/systemd/pull/36982 [1]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/fork.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 182ec2e9087d..0fe54fcd11b3 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2108,10 +2108,35 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	bool thread = flags & PIDFD_THREAD;
+	int err = 0;
 
-	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
-		return -EINVAL;
+	if (!(flags & PIDFD_THREAD)) {
+		/*
+		 * If this is struct pid isn't used as a thread-group
+		 * leader pid but the caller requested to create a
+		 * thread-group leader pidfd then report ENOENT to the
+		 * caller as a hint.
+		 */
+		if (!pid_has_task(pid, PIDTYPE_TGID))
+			err = -ENOENT;
+	}
+
+	/*
+	 * If this wasn't a thread-group leader struct pid or the task
+	 * got reaped in the meantime report -ESRCH to userspace.
+	 *
+	 * This is racy of course. This could've not been a thread-group
+	 * leader struct pid and we set ENOENT above but in the meantime
+	 * the task got reaped. Or there was a multi-threaded-exec by a
+	 * subthread and we were a thread-group leader but now got
+	 * killed. All of that doesn't matter since the task has already
+	 * been reaped that distinction is meaningless to userspace so
+	 * just report ESRCH.
+	 */
+	if (!pid_has_task(pid, PIDTYPE_PID))
+		err = -ESRCH;
+	if (err)
+		return err;
 
 	return __pidfd_prepare(pid, flags, ret);
 }

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH RFC 4/4] selftest/pidfd: add test for thread-group leader pidfd open for thread
  2025-04-03 14:09 [PATCH RFC 0/4] pidfd: improve uapi when task isn't found Christian Brauner
                   ` (2 preceding siblings ...)
  2025-04-03 14:09 ` [PATCH RFC 3/4] pidfd: improve uapi when task isn't found Christian Brauner
@ 2025-04-03 14:09 ` Christian Brauner
  3 siblings, 0 replies; 19+ messages in thread
From: Christian Brauner @ 2025-04-03 14:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Christian Brauner

Verify that we report ENOENT when userspace tries to create a
thread-group leader pidfd for a thread pidfd that isn't a thread-group
leader.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 tools/testing/selftests/pidfd/pidfd_info_test.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/pidfd/pidfd_info_test.c b/tools/testing/selftests/pidfd/pidfd_info_test.c
index accfd6bdc539..a0eb6e81eaa2 100644
--- a/tools/testing/selftests/pidfd/pidfd_info_test.c
+++ b/tools/testing/selftests/pidfd/pidfd_info_test.c
@@ -299,6 +299,7 @@ TEST_F(pidfd_info, thread_group)
 	/* Opening a thread as a thread-group leader must fail. */
 	pidfd_thread = sys_pidfd_open(pid_thread, 0);
 	ASSERT_LT(pidfd_thread, 0);
+	ASSERT_EQ(errno, ENOENT);
 
 	/* Opening a thread as a PIDFD_THREAD must succeed. */
 	pidfd_thread = sys_pidfd_open(pid_thread, PIDFD_THREAD);

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/4] pidfd: improve uapi when task isn't found
  2025-04-03 14:09 ` [PATCH RFC 3/4] pidfd: improve uapi when task isn't found Christian Brauner
@ 2025-04-04 12:37   ` Oleg Nesterov
  2025-04-04 13:38     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-04 12:37 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel

On 04/03, Christian Brauner wrote:
>
> We currently report EINVAL whenever a struct pid has no tasked attached
> anymore thereby conflating two concepts:
>
> (1) The task has already been reaped.
> (2) The caller requested a pidfd for a thread-group leader but the pid
>     actually references a struct pid that isn't used as a thread-group
>     leader.
>
> This is causing issues for non-threaded workloads as in [1].
>
> This patch tries to allow userspace to distinguish between (1) and (2).
> This is racy of course but that shouldn't matter.
>
> Link: https://github.com/systemd/systemd/pull/36982 [1]
> Signed-off-by: Christian Brauner <brauner@kernel.org>

For this series:

Reviewed-by: Oleg Nesterov <oleg@redhat.com>


But I have a couple of cosmetic nits...

>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -	bool thread = flags & PIDFD_THREAD;
> +	int err = 0;
>
> -	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
> -		return -EINVAL;
> +	if (!(flags & PIDFD_THREAD)) {
> +		/*
> +		 * If this is struct pid isn't used as a thread-group
> +		 * leader pid but the caller requested to create a
> +		 * thread-group leader pidfd then report ENOENT to the
> +		 * caller as a hint.
> +		 */
> +		if (!pid_has_task(pid, PIDTYPE_TGID))
> +			err = -ENOENT;
> +	}
> +
> +	/*
> +	 * If this wasn't a thread-group leader struct pid or the task
> +	 * got reaped in the meantime report -ESRCH to userspace.
> +	 *
> +	 * This is racy of course. This could've not been a thread-group
> +	 * leader struct pid and we set ENOENT above but in the meantime
> +	 * the task got reaped. Or there was a multi-threaded-exec by a
> +	 * subthread and we were a thread-group leader but now got
> +	 * killed.

The comment about the multi-threaded-exec looks a bit misleading to me.
If this pid is a group-leader-pid and we race with de_thread() which does

		exchange_tids(tsk, leader);
		transfer_pid(leader, tsk, PIDTYPE_TGID);

nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or
pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this
transition.

hlists_swap_heads_rcu() or hlist_replace_rcu() can't make
hlist_head->first == NULL during this transition...

Or I misunderstood the comment?

And... the code looks a bit overcomplicated to me, why not simply

	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
	{
		if (!pid_has_task(pid, PIDTYPE_PID))
			return -ESRCH;

		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
			return -ENOENT;

		return __pidfd_prepare(pid, flags, ret);
	}

? Of course, the comments should stay.

But again, this is cosmetic/subjective, please do what you like more.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/4] pidfd: improve uapi when task isn't found
  2025-04-04 12:37   ` Oleg Nesterov
@ 2025-04-04 13:38     ` Christian Brauner
  2025-04-04 14:53       ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-04 13:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel

On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote:
> On 04/03, Christian Brauner wrote:
> >
> > We currently report EINVAL whenever a struct pid has no tasked attached
> > anymore thereby conflating two concepts:
> >
> > (1) The task has already been reaped.
> > (2) The caller requested a pidfd for a thread-group leader but the pid
> >     actually references a struct pid that isn't used as a thread-group
> >     leader.
> >
> > This is causing issues for non-threaded workloads as in [1].
> >
> > This patch tries to allow userspace to distinguish between (1) and (2).
> > This is racy of course but that shouldn't matter.
> >
> > Link: https://github.com/systemd/systemd/pull/36982 [1]
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> 
> For this series:
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> 
> But I have a couple of cosmetic nits...
> 
> >  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >  {
> > -	bool thread = flags & PIDFD_THREAD;
> > +	int err = 0;
> >
> > -	if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
> > -		return -EINVAL;
> > +	if (!(flags & PIDFD_THREAD)) {
> > +		/*
> > +		 * If this is struct pid isn't used as a thread-group
> > +		 * leader pid but the caller requested to create a
> > +		 * thread-group leader pidfd then report ENOENT to the
> > +		 * caller as a hint.
> > +		 */
> > +		if (!pid_has_task(pid, PIDTYPE_TGID))
> > +			err = -ENOENT;
> > +	}
> > +
> > +	/*
> > +	 * If this wasn't a thread-group leader struct pid or the task
> > +	 * got reaped in the meantime report -ESRCH to userspace.
> > +	 *
> > +	 * This is racy of course. This could've not been a thread-group
> > +	 * leader struct pid and we set ENOENT above but in the meantime
> > +	 * the task got reaped. Or there was a multi-threaded-exec by a
> > +	 * subthread and we were a thread-group leader but now got
> > +	 * killed.
> 
> The comment about the multi-threaded-exec looks a bit misleading to me.
> If this pid is a group-leader-pid and we race with de_thread() which does
> 
> 		exchange_tids(tsk, leader);
> 		transfer_pid(leader, tsk, PIDTYPE_TGID);
> 
> nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or
> pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this
> transition.
> 
> hlists_swap_heads_rcu() or hlist_replace_rcu() can't make
> hlist_head->first == NULL during this transition...

Good point.

> 
> Or I misunderstood the comment?
> 
> And... the code looks a bit overcomplicated to me, why not simply
> 
> 	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> 	{
> 		if (!pid_has_task(pid, PIDTYPE_PID))
> 			return -ESRCH;
> 
> 		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
> 			return -ENOENT;

I thought that checking PIDTYPE_PID first could cause misleading results
where we report ENOENT where we should report ESRCH: If the task was
released after the successful PIDTYPE_PID check for a pid that was never
a thread-group leader we report ENOENT. That's what I reversed the
check. But I can adapt that to you scheme. I mostly wanted a place to
put the comments.

> 
> 		return __pidfd_prepare(pid, flags, ret);
> 	}
> 
> ? Of course, the comments should stay.
> 
> But again, this is cosmetic/subjective, please do what you like more.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/4] pidfd: improve uapi when task isn't found
  2025-04-04 13:38     ` Christian Brauner
@ 2025-04-04 14:53       ` Oleg Nesterov
  2025-04-09 15:38         ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-04 14:53 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel

On 04/04, Christian Brauner wrote:
>
> On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote:
> > And... the code looks a bit overcomplicated to me, why not simply
> >
> > 	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> > 	{
> > 		if (!pid_has_task(pid, PIDTYPE_PID))
> > 			return -ESRCH;
> >
> > 		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
> > 			return -ENOENT;
>
> I thought that checking PIDTYPE_PID first could cause misleading results
> where we report ENOENT where we should report ESRCH: If the task was
> released after the successful PIDTYPE_PID check for a pid that was never
> a thread-group leader we report ENOENT.

Hmm... but the code above can only return ENOENT if !(flags & PIDFD_THREAD),
so in this case -ENOENT is correct?

I guess -ENOENT would be wrong if this pid _was_ a leader pid and we
race with __unhash_process() which does

	detach_pid(post->pids, p, PIDTYPE_PID);
	if (group_dead)
		detach_pid(post->pids, p, PIDTYPE_TGID);

but without tasklist_lock (or additional barries in both pidfd_prepare() and
__unhash_process() pidfd_prepare() can see the result of these 2 detach_pid()'s
in any order anyway. So I don't think the code above is "more" racy.

Although perhaps we can rely on the fact the the 1st detach_pid(PIDTYPE_PID)
does wake_up(pid->wait_pidfd) and use pid->wait_pidfd->lock to avoid the
races, not sure...

But,

> But I can adapt that to you scheme.

Again, up to you, whatever you prefer.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH RFC 3/4] pidfd: improve uapi when task isn't found
  2025-04-04 14:53       ` Oleg Nesterov
@ 2025-04-09 15:38         ` Christian Brauner
  2025-04-09 18:18           ` [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-09 15:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel

On Fri, Apr 04, 2025 at 04:53:24PM +0200, Oleg Nesterov wrote:
> On 04/04, Christian Brauner wrote:
> >
> > On Fri, Apr 04, 2025 at 02:37:38PM +0200, Oleg Nesterov wrote:
> > > And... the code looks a bit overcomplicated to me, why not simply
> > >
> > > 	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> > > 	{
> > > 		if (!pid_has_task(pid, PIDTYPE_PID))
> > > 			return -ESRCH;
> > >
> > > 		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
> > > 			return -ENOENT;
> >
> > I thought that checking PIDTYPE_PID first could cause misleading results
> > where we report ENOENT where we should report ESRCH: If the task was
> > released after the successful PIDTYPE_PID check for a pid that was never
> > a thread-group leader we report ENOENT.
> 
> Hmm... but the code above can only return ENOENT if !(flags & PIDFD_THREAD),
> so in this case -ENOENT is correct?
> 
> I guess -ENOENT would be wrong if this pid _was_ a leader pid and we
> race with __unhash_process() which does
> 
> 	detach_pid(post->pids, p, PIDTYPE_PID);
> 	if (group_dead)
> 		detach_pid(post->pids, p, PIDTYPE_TGID);

Yes, exactly.

> 
> but without tasklist_lock (or additional barries in both pidfd_prepare() and
> __unhash_process() pidfd_prepare() can see the result of these 2 detach_pid()'s
> in any order anyway. So I don't think the code above is "more" racy.

Right... Hm, I don't like the inherent raciness of this. I think we
should fix this. I'm playing with something. I'll try to get it out
today.

> 
> Although perhaps we can rely on the fact the the 1st detach_pid(PIDTYPE_PID)
> does wake_up(pid->wait_pidfd) and use pid->wait_pidfd->lock to avoid the
> races, not sure...
> 
> But,
> 
> > But I can adapt that to you scheme.
> 
> Again, up to you, whatever you prefer.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-09 15:38         ` Christian Brauner
@ 2025-04-09 18:18           ` Christian Brauner
  2025-04-09 18:40             ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-09 18:18 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Christian Brauner, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan, linux-kernel, Peter Ziljstra

In a prior patch series we tried to cleanly differentiate between:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

as this was causing issues for non-threaded workloads.

But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().

Stabilize this by using a sequence counter associated with tasklist_lock
and retry while we're in __unhash_process(). The seqcounter might be
useful independent of pidfs.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 include/linux/pid.h |  1 +
 kernel/exit.c       | 11 +++++++++++
 kernel/fork.c       | 22 ++++++++++++----------
 kernel/pid.c        |  1 +
 4 files changed, 25 insertions(+), 10 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 311ecebd7d56..b54a4c1ef602 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -65,6 +65,7 @@ struct pid
 	struct hlist_head inodes;
 	/* wait queue for pidfd notifications */
 	wait_queue_head_t wait_pidfd;
+	seqcount_rwlock_t pid_seq;
 	struct rcu_head rcu;
 	struct upid numbers[];
 };
diff --git a/kernel/exit.c b/kernel/exit.c
index 1b51dc099f1e..8050572fe682 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -133,17 +133,28 @@ struct release_task_post {
 static void __unhash_process(struct release_task_post *post, struct task_struct *p,
 			     bool group_dead)
 {
+	struct pid *pid;
+
+	lockdep_assert_held_write(&tasklist_lock);
+
 	nr_threads--;
+
+	pid = task_pid(p);
+	raw_write_seqcount_begin(&pid->pid_seq);
 	detach_pid(post->pids, p, PIDTYPE_PID);
 	if (group_dead) {
 		detach_pid(post->pids, p, PIDTYPE_TGID);
 		detach_pid(post->pids, p, PIDTYPE_PGID);
 		detach_pid(post->pids, p, PIDTYPE_SID);
+	}
+	raw_write_seqcount_end(&pid->pid_seq);
 
+	if (group_dead) {
 		list_del_rcu(&p->tasks);
 		list_del_init(&p->sibling);
 		__this_cpu_dec(process_counts);
 	}
+
 	list_del_rcu(&p->thread_node);
 }
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4a2080b968c8..1480bf6f5f38 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2109,24 +2109,26 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
 	int err = 0;
+	unsigned int seq;
 
-	if (!(flags & PIDFD_THREAD)) {
+	do {
+		seq = raw_seqcount_begin(&pid->pid_seq);
 		/*
 		 * If this is struct pid isn't used as a thread-group
 		 * leader pid but the caller requested to create a
 		 * thread-group leader pidfd then report ENOENT to the
 		 * caller as a hint.
 		 */
-		if (!pid_has_task(pid, PIDTYPE_TGID))
+		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
 			err = -ENOENT;
-	}
-
-	/*
-	 * If this wasn't a thread-group leader struct pid or the task
-	 * got reaped in the meantime report -ESRCH to userspace.
-	 */
-	if (!pid_has_task(pid, PIDTYPE_PID))
-		err = -ESRCH;
+		/*
+		 * If this wasn't a thread-group leader struct pid or
+		 * the task got reaped in the meantime report -ESRCH to
+		 * userspace.
+		 */
+		if (!pid_has_task(pid, PIDTYPE_PID))
+			err = -ESRCH;
+	} while (read_seqcount_retry(&pid->pid_seq, seq));
 	if (err)
 		return err;
 
diff --git a/kernel/pid.c b/kernel/pid.c
index 4ac2ce46817f..bbca61f62faa 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -271,6 +271,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
 	upid = pid->numbers + ns->level;
 	idr_preload(GFP_KERNEL);
 	spin_lock(&pidmap_lock);
+	seqcount_rwlock_init(&pid->pid_seq, &tasklist_lock);
 	if (!(ns->pid_allocated & PIDNS_ADDING))
 		goto out_unlock;
 	pidfs_add_pid(pid);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-09 18:18           ` [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
@ 2025-04-09 18:40             ` Oleg Nesterov
  2025-04-10 10:18               ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-09 18:40 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

Christian,

I will actually read your patch tomorrow, but at first glance

On 04/09, Christian Brauner wrote:
>
> The seqcounter might be
> useful independent of pidfs.

Are you sure? ;) to me the new pid->pid_seq needs more justification...

Again, can't we use pid->wait_pidfd->lock if we want to avoid the
(minor) problem with the wrong ENOENT?

or even signal->siglock, although in this case we will need
pid_task() + lock_task_sighand()...

Oleg.

> Signed-off-by: Christian Brauner <brauner@kernel.org>
> ---
>  include/linux/pid.h |  1 +
>  kernel/exit.c       | 11 +++++++++++
>  kernel/fork.c       | 22 ++++++++++++----------
>  kernel/pid.c        |  1 +
>  4 files changed, 25 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/pid.h b/include/linux/pid.h
> index 311ecebd7d56..b54a4c1ef602 100644
> --- a/include/linux/pid.h
> +++ b/include/linux/pid.h
> @@ -65,6 +65,7 @@ struct pid
>  	struct hlist_head inodes;
>  	/* wait queue for pidfd notifications */
>  	wait_queue_head_t wait_pidfd;
> +	seqcount_rwlock_t pid_seq;
>  	struct rcu_head rcu;
>  	struct upid numbers[];
>  };
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 1b51dc099f1e..8050572fe682 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -133,17 +133,28 @@ struct release_task_post {
>  static void __unhash_process(struct release_task_post *post, struct task_struct *p,
>  			     bool group_dead)
>  {
> +	struct pid *pid;
> +
> +	lockdep_assert_held_write(&tasklist_lock);
> +
>  	nr_threads--;
> +
> +	pid = task_pid(p);
> +	raw_write_seqcount_begin(&pid->pid_seq);
>  	detach_pid(post->pids, p, PIDTYPE_PID);
>  	if (group_dead) {
>  		detach_pid(post->pids, p, PIDTYPE_TGID);
>  		detach_pid(post->pids, p, PIDTYPE_PGID);
>  		detach_pid(post->pids, p, PIDTYPE_SID);
> +	}
> +	raw_write_seqcount_end(&pid->pid_seq);
>  
> +	if (group_dead) {
>  		list_del_rcu(&p->tasks);
>  		list_del_init(&p->sibling);
>  		__this_cpu_dec(process_counts);
>  	}
> +
>  	list_del_rcu(&p->thread_node);
>  }
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 4a2080b968c8..1480bf6f5f38 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2109,24 +2109,26 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
>  	int err = 0;
> +	unsigned int seq;
>  
> -	if (!(flags & PIDFD_THREAD)) {
> +	do {
> +		seq = raw_seqcount_begin(&pid->pid_seq);
>  		/*
>  		 * If this is struct pid isn't used as a thread-group
>  		 * leader pid but the caller requested to create a
>  		 * thread-group leader pidfd then report ENOENT to the
>  		 * caller as a hint.
>  		 */
> -		if (!pid_has_task(pid, PIDTYPE_TGID))
> +		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
>  			err = -ENOENT;
> -	}
> -
> -	/*
> -	 * If this wasn't a thread-group leader struct pid or the task
> -	 * got reaped in the meantime report -ESRCH to userspace.
> -	 */
> -	if (!pid_has_task(pid, PIDTYPE_PID))
> -		err = -ESRCH;
> +		/*
> +		 * If this wasn't a thread-group leader struct pid or
> +		 * the task got reaped in the meantime report -ESRCH to
> +		 * userspace.
> +		 */
> +		if (!pid_has_task(pid, PIDTYPE_PID))
> +			err = -ESRCH;
> +	} while (read_seqcount_retry(&pid->pid_seq, seq));
>  	if (err)
>  		return err;
>  
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 4ac2ce46817f..bbca61f62faa 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -271,6 +271,7 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
>  	upid = pid->numbers + ns->level;
>  	idr_preload(GFP_KERNEL);
>  	spin_lock(&pidmap_lock);
> +	seqcount_rwlock_init(&pid->pid_seq, &tasklist_lock);
>  	if (!(ns->pid_allocated & PIDNS_ADDING))
>  		goto out_unlock;
>  	pidfs_add_pid(pid);
> -- 
> 2.47.2
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-09 18:40             ` Oleg Nesterov
@ 2025-04-10 10:18               ` Oleg Nesterov
  2025-04-10 10:43                 ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-10 10:18 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On 04/09, Oleg Nesterov wrote:
>
> Christian,
>
> I will actually read your patch tomorrow, but at first glance
>
> On 04/09, Christian Brauner wrote:
> >
> > The seqcounter might be
> > useful independent of pidfs.
>
> Are you sure? ;) to me the new pid->pid_seq needs more justification...
>
> Again, can't we use pid->wait_pidfd->lock if we want to avoid the
> (minor) problem with the wrong ENOENT?

I mean

	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
	{
		int err = 0;

		spin_lock_irq(&pid->wait_pidfd->lock);

		if (!pid_has_task(pid, PIDTYPE_PID))
			err = -ESRCH;
		else if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
			err = -ENOENT;

		spin_lock_irq(&pid->wait_pidfd->lock);

		return err ?: __pidfd_prepare(pid, flags, ret);
	}

To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
takes pid->wait_pidfd->lock.

So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
is not possible until we drop pid->wait_pidfd->lock.

If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
pid_has_task(PIDTYPE_PID) can't succeed.

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-10 10:18               ` Oleg Nesterov
@ 2025-04-10 10:43                 ` Christian Brauner
  2025-04-10 13:10                   ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-10 10:43 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> On 04/09, Oleg Nesterov wrote:
> >
> > Christian,
> >
> > I will actually read your patch tomorrow, but at first glance
> >
> > On 04/09, Christian Brauner wrote:
> > >
> > > The seqcounter might be
> > > useful independent of pidfs.
> >
> > Are you sure? ;) to me the new pid->pid_seq needs more justification...

Yeah, pretty much. I'd make use of this in other cases where we need to
detect concurrent changes to struct pid without having to take any
locks. Multi-threaded exec in de_exec() comes to mind as well.

> > Again, can't we use pid->wait_pidfd->lock if we want to avoid the
> > (minor) problem with the wrong ENOENT?
> 
> I mean
> 
> 	int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> 	{
> 		int err = 0;
> 
> 		spin_lock_irq(&pid->wait_pidfd->lock);
> 
> 		if (!pid_has_task(pid, PIDTYPE_PID))
> 			err = -ESRCH;
> 		else if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
> 			err = -ENOENT;
> 
> 		spin_lock_irq(&pid->wait_pidfd->lock);
> 
> 		return err ?: __pidfd_prepare(pid, flags, ret);
> 	}
> 
> To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> takes pid->wait_pidfd->lock.
> 
> So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> is not possible until we drop pid->wait_pidfd->lock.
> 
> If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> pid_has_task(PIDTYPE_PID) can't succeed.

I know. I was trying to avoid having to take the lock and just make this
lockless. But if you think we should use this lock here instead I'm
willing to do this. I just find the sequence counter more elegant than
the spin_lock_irq().

And note that it doesn't grow struct pid. There's a 4 byte hole I would
place it into just before struct dentry *. So there's no downside to
this imho and it would give pidfds a reliable way to detect relevant
concurrent changes locklessly without penalizing other critical paths
(e.g., under tasklist_lock) in the kernel.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-10 10:43                 ` Christian Brauner
@ 2025-04-10 13:10                   ` Oleg Nesterov
  2025-04-10 20:05                     ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-10 13:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On 04/10, Christian Brauner wrote:
>
> On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> > On 04/09, Oleg Nesterov wrote:
> > >
> > > On 04/09, Christian Brauner wrote:
> > > >
> > > > The seqcounter might be
> > > > useful independent of pidfs.
> > >
> > > Are you sure? ;) to me the new pid->pid_seq needs more justification...
>
> Yeah, pretty much. I'd make use of this in other cases where we need to
> detect concurrent changes to struct pid without having to take any
> locks. Multi-threaded exec in de_exec() comes to mind as well.

Perhaps you are right, but so far I am still not sure it makes sense.
And we can always add it later if we have another (more convincing)
use-case.

> > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> > takes pid->wait_pidfd->lock.
> >
> > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> > is not possible until we drop pid->wait_pidfd->lock.
> >
> > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> > pid_has_task(PIDTYPE_PID) can't succeed.
>
> I know. I was trying to avoid having to take the lock and just make this
> lockless. But if you think we should use this lock here instead I'm
> willing to do this. I just find the sequence counter more elegant than
> the spin_lock_irq().

This is subjective, and quite possibly I am wrong. But yes, I'd prefer
to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not
penalize __unhash_process(). Simply because this is simpler.

If you really dislike taking wait_pidfd->lock, we can add mb() into
__unhash_process() or even smp_mb__after_spinlock() into __change_pid(),
but this will need a lengthy comment...



As for your patch... it doesn't apply on top of 3/4, but I guess it
is clear what does it do, and (unfortunately ;) it looks correct, so
I won't insist too much. See a couple of nits below.

> this imho and it would give pidfds a reliable way to detect relevant
> concurrent changes locklessly without penalizing other critical paths
> (e.g., under tasklist_lock) in the kernel.

Can't resist... Note that raw_seqcount_begin() in pidfd_prepare() will
take/drop tasklist_lock if it races with __unhash_process() on PREEMPT_RT.
Yes, this is unlikely case, but still...

Now. Unless I misread your patch, pidfd_prepare() does "err = 0" only
once before the main loop. And this is correct. But this means that
we do not need the do/while loop.

If read_seqcount_retry() returns true, we can safely return -ESRCH. So
we can do

	seq = raw_seqcount_begin(&pid->pid_seq);

	if (!PIDFD_THREAD && !pid_has_task(PIDTYPE_TGID))
		err = -ENOENT;

	if (!pid_has_task(PIDTYPE_PID))
		err = -ESRCH;

	if (read_seqcount_retry(pid->pid_seq, seq))
		err = -ESRCH;

In fact we don't even need raw_seqcount_begin(), we could use
raw_seqcount_try_begin().

And why seqcount_rwlock_t? A plain seqcount_t can equally work.


And, if we use seqcount_rwlock_t,

	lockdep_assert_held_write(&tasklist_lock);
	...
	raw_write_seqcount_begin(pid->pid_seq);

in __unhash_process() looks a bit strange. I'd suggest to use
write_seqcount_begin() which does seqprop_assert() and kill
lockdep_assert_held_write().

Oleg.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-10 13:10                   ` Oleg Nesterov
@ 2025-04-10 20:05                     ` Christian Brauner
  2025-04-10 20:24                       ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-10 20:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On Thu, Apr 10, 2025 at 03:10:09PM +0200, Oleg Nesterov wrote:
> On 04/10, Christian Brauner wrote:
> >
> > On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> > > On 04/09, Oleg Nesterov wrote:
> > > >
> > > > On 04/09, Christian Brauner wrote:
> > > > >
> > > > > The seqcounter might be
> > > > > useful independent of pidfs.
> > > >
> > > > Are you sure? ;) to me the new pid->pid_seq needs more justification...
> >
> > Yeah, pretty much. I'd make use of this in other cases where we need to
> > detect concurrent changes to struct pid without having to take any
> > locks. Multi-threaded exec in de_exec() comes to mind as well.
> 
> Perhaps you are right, but so far I am still not sure it makes sense.
> And we can always add it later if we have another (more convincing)
> use-case.
> 
> > > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> > > takes pid->wait_pidfd->lock.
> > >
> > > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> > > is not possible until we drop pid->wait_pidfd->lock.
> > >
> > > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> > > pid_has_task(PIDTYPE_PID) can't succeed.
> >
> > I know. I was trying to avoid having to take the lock and just make this
> > lockless. But if you think we should use this lock here instead I'm
> > willing to do this. I just find the sequence counter more elegant than
> > the spin_lock_irq().
> 
> This is subjective, and quite possibly I am wrong. But yes, I'd prefer
> to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not
> penalize __unhash_process(). Simply because this is simpler.
> 
> If you really dislike taking wait_pidfd->lock, we can add mb() into
> __unhash_process() or even smp_mb__after_spinlock() into __change_pid(),
> but this will need a lengthy comment...

No, I don't think we should do that.

> As for your patch... it doesn't apply on top of 3/4, but I guess it
> is clear what does it do, and (unfortunately ;) it looks correct, so
> I won't insist too much. See a couple of nits below.
> 
> > this imho and it would give pidfds a reliable way to detect relevant
> > concurrent changes locklessly without penalizing other critical paths
> > (e.g., under tasklist_lock) in the kernel.
> 
> Can't resist... Note that raw_seqcount_begin() in pidfd_prepare() will
> take/drop tasklist_lock if it races with __unhash_process() on PREEMPT_RT.

Eeeeew,

        if (!IS_ENABLED(CONFIG_PREEMPT_RT))                             \
                return seq;                                             \
                                                                        \
        if (preemptible && unlikely(seq & 1)) {                         \
                __SEQ_LOCK(lockbase##_lock(s->lock));                   \
                __SEQ_LOCK(lockbase##_unlock(s->lock));                 \

priority inversion fix, I take it. That's equally ugly as what we had to
do for mnt_get_write_access()...

I actually think what you just pointed out is rather problematic. It's
absolutely wild that raw_seqcount_begin() suddenly implies locking.

How isn't that a huge landmine? On non-rt I can happily do:

acquire_associated_lock()
raw_seqcount_begin()
drop_associated_lock()

But this will immediately turn into a deadlock on preempt-rt, no?

> Yes, this is unlikely case, but still...
> 
> Now. Unless I misread your patch, pidfd_prepare() does "err = 0" only
> once before the main loop. And this is correct. But this means that
> we do not need the do/while loop.

Yes, I know. I simply used the common idiom.

> 
> If read_seqcount_retry() returns true, we can safely return -ESRCH. So
> we can do
> 
> 	seq = raw_seqcount_begin(&pid->pid_seq);
> 
> 	if (!PIDFD_THREAD && !pid_has_task(PIDTYPE_TGID))
> 		err = -ENOENT;
> 
> 	if (!pid_has_task(PIDTYPE_PID))
> 		err = -ESRCH;
> 
> 	if (read_seqcount_retry(pid->pid_seq, seq))
> 		err = -ESRCH;
> 
> In fact we don't even need raw_seqcount_begin(), we could use
> raw_seqcount_try_begin().
> 
> And why seqcount_rwlock_t? A plain seqcount_t can equally work.

Yes, but this way its dependence on tasklist_lock is natively integrated
with lockdep afaict:

 * typedef seqcount_LOCKNAME_t - sequence counter with LOCKNAME associated
 * @seqcount:   The real sequence counter
 * @lock:       Pointer to the associated lock
 *
 * A plain sequence counter with external writer synchronization by
 * LOCKNAME @lock. The lock is associated to the sequence counter in the
 * static initializer or init function. This enables lockdep to validate
 * that the write side critical section is properly serialized.
 *
 * LOCKNAME:    raw_spinlock, spinlock, rwlock or mutex
 */

/*
 * seqcount_LOCKNAME_init() - runtime initializer for seqcount_LOCKNAME_t
 * @s:          Pointer to the seqcount_LOCKNAME_t instance
 * @lock:       Pointer to the associated lock
 */

#define seqcount_LOCKNAME_init(s, _lock, lockname)                      \
        do {                                                            \
                seqcount_##lockname##_t *____s = (s);                   \
                seqcount_init(&____s->seqcount);                        \
                __SEQ_LOCK(____s->lock = (_lock));                      \
        } while (0)

#define seqcount_raw_spinlock_init(s, lock)     seqcount_LOCKNAME_init(s, lock, raw_spinlock)
#define seqcount_spinlock_init(s, lock)         seqcount_LOCKNAME_init(s, lock, spinlock)
#define seqcount_rwlock_init(s, lock)           seqcount_LOCKNAME_init(s, lock, rwlock)
#define seqcount_mutex_init(s, lock)            seqcount_LOCKNAME_init(s, lock, mutex)

> And, if we use seqcount_rwlock_t,
> 
> 	lockdep_assert_held_write(&tasklist_lock);
> 	...
> 	raw_write_seqcount_begin(pid->pid_seq);
> 
> in __unhash_process() looks a bit strange. I'd suggest to use
> write_seqcount_begin() which does seqprop_assert() and kill
> lockdep_assert_held_write().
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-10 20:05                     ` Christian Brauner
@ 2025-04-10 20:24                       ` Christian Brauner
  2025-04-11 11:08                         ` Christian Brauner
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-10 20:24 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On Thu, Apr 10, 2025 at 10:05:58PM +0200, Christian Brauner wrote:
> On Thu, Apr 10, 2025 at 03:10:09PM +0200, Oleg Nesterov wrote:
> > On 04/10, Christian Brauner wrote:
> > >
> > > On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> > > > On 04/09, Oleg Nesterov wrote:
> > > > >
> > > > > On 04/09, Christian Brauner wrote:
> > > > > >
> > > > > > The seqcounter might be
> > > > > > useful independent of pidfs.
> > > > >
> > > > > Are you sure? ;) to me the new pid->pid_seq needs more justification...
> > >
> > > Yeah, pretty much. I'd make use of this in other cases where we need to
> > > detect concurrent changes to struct pid without having to take any
> > > locks. Multi-threaded exec in de_exec() comes to mind as well.
> > 
> > Perhaps you are right, but so far I am still not sure it makes sense.
> > And we can always add it later if we have another (more convincing)
> > use-case.
> > 
> > > > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> > > > takes pid->wait_pidfd->lock.
> > > >
> > > > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> > > > is not possible until we drop pid->wait_pidfd->lock.
> > > >
> > > > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> > > > pid_has_task(PIDTYPE_PID) can't succeed.
> > >
> > > I know. I was trying to avoid having to take the lock and just make this
> > > lockless. But if you think we should use this lock here instead I'm
> > > willing to do this. I just find the sequence counter more elegant than
> > > the spin_lock_irq().
> > 
> > This is subjective, and quite possibly I am wrong. But yes, I'd prefer
> > to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not
> > penalize __unhash_process(). Simply because this is simpler.

Looking close at this. Why is:

        if (type == PIDTYPE_PID) {
                WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
                wake_up_all(&pid->wait_pidfd);
        }

located in __change_pid()? The only valid call to __change_pid() with a NULL
argument and PIDTYPE_PID is from __unhash_process(), no?

So why isn't this in __unhash_process() where it's immediately obvious
that it's the only valid place this can currently be called from?

diff --git a/kernel/exit.c b/kernel/exit.c
index 1b51dc099f1e..d92e8bee0ab7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -135,6 +135,7 @@ static void __unhash_process(struct release_task_post *post, struct task_struct
 {
        nr_threads--;
        detach_pid(post->pids, p, PIDTYPE_PID);
+       wake_up_all(&post->pids[PIDTYPE_PID]->wait_pidfd);
        if (group_dead) {
                detach_pid(post->pids, p, PIDTYPE_TGID);
                detach_pid(post->pids, p, PIDTYPE_PGID);
diff --git a/kernel/pid.c b/kernel/pid.c
index 4ac2ce46817f..26f1e136f017 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -359,11 +359,6 @@ static void __change_pid(struct pid **pids, struct task_struct *task,
        hlist_del_rcu(&task->pid_links[type]);
        *pid_ptr = new;

-       if (type == PIDTYPE_PID) {
-               WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
-               wake_up_all(&pid->wait_pidfd);
-       }
-
        for (tmp = PIDTYPE_MAX; --tmp >= 0; )
                if (pid_has_task(pid, tmp))
                        return;

I'm probably missing something obvious.

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-10 20:24                       ` Christian Brauner
@ 2025-04-11 11:08                         ` Christian Brauner
  2025-04-11 11:25                           ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Christian Brauner @ 2025-04-11 11:08 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On Thu, Apr 10, 2025 at 10:24:22PM +0200, Christian Brauner wrote:
> On Thu, Apr 10, 2025 at 10:05:58PM +0200, Christian Brauner wrote:
> > On Thu, Apr 10, 2025 at 03:10:09PM +0200, Oleg Nesterov wrote:
> > > On 04/10, Christian Brauner wrote:
> > > >
> > > > On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> > > > > On 04/09, Oleg Nesterov wrote:
> > > > > >
> > > > > > On 04/09, Christian Brauner wrote:
> > > > > > >
> > > > > > > The seqcounter might be
> > > > > > > useful independent of pidfs.
> > > > > >
> > > > > > Are you sure? ;) to me the new pid->pid_seq needs more justification...
> > > >
> > > > Yeah, pretty much. I'd make use of this in other cases where we need to
> > > > detect concurrent changes to struct pid without having to take any
> > > > locks. Multi-threaded exec in de_exec() comes to mind as well.
> > > 
> > > Perhaps you are right, but so far I am still not sure it makes sense.
> > > And we can always add it later if we have another (more convincing)
> > > use-case.
> > > 
> > > > > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> > > > > takes pid->wait_pidfd->lock.
> > > > >
> > > > > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> > > > > is not possible until we drop pid->wait_pidfd->lock.
> > > > >
> > > > > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> > > > > pid_has_task(PIDTYPE_PID) can't succeed.
> > > >
> > > > I know. I was trying to avoid having to take the lock and just make this
> > > > lockless. But if you think we should use this lock here instead I'm
> > > > willing to do this. I just find the sequence counter more elegant than
> > > > the spin_lock_irq().
> > > 
> > > This is subjective, and quite possibly I am wrong. But yes, I'd prefer
> > > to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not
> > > penalize __unhash_process(). Simply because this is simpler.
> 
> Looking close at this. Why is:
> 
>         if (type == PIDTYPE_PID) {
>                 WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
>                 wake_up_all(&pid->wait_pidfd);
>         }
> 
> located in __change_pid()? The only valid call to __change_pid() with a NULL
> argument and PIDTYPE_PID is from __unhash_process(), no?

We used to perform free_pid() directly from __change_pid() so prior to
v6.15 changes it wasn't possible. Now that we free the pids separately let's
just move the notification into __unhash_process(). I have a patch ready
for this.

> 
> So why isn't this in __unhash_process() where it's immediately obvious
> that it's the only valid place this can currently be called from?
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 1b51dc099f1e..d92e8bee0ab7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -135,6 +135,7 @@ static void __unhash_process(struct release_task_post *post, struct task_struct
>  {
>         nr_threads--;
>         detach_pid(post->pids, p, PIDTYPE_PID);
> +       wake_up_all(&post->pids[PIDTYPE_PID]->wait_pidfd);
>         if (group_dead) {
>                 detach_pid(post->pids, p, PIDTYPE_TGID);
>                 detach_pid(post->pids, p, PIDTYPE_PGID);
> diff --git a/kernel/pid.c b/kernel/pid.c
> index 4ac2ce46817f..26f1e136f017 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -359,11 +359,6 @@ static void __change_pid(struct pid **pids, struct task_struct *task,
>         hlist_del_rcu(&task->pid_links[type]);
>         *pid_ptr = new;
> 
> -       if (type == PIDTYPE_PID) {
> -               WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
> -               wake_up_all(&pid->wait_pidfd);
> -       }
> -
>         for (tmp = PIDTYPE_MAX; --tmp >= 0; )
>                 if (pid_has_task(pid, tmp))
>                         return;
> 
> I'm probably missing something obvious.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 11:08                         ` Christian Brauner
@ 2025-04-11 11:25                           ` Oleg Nesterov
  2025-04-11 11:41                             ` Oleg Nesterov
  0 siblings, 1 reply; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-11 11:25 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On 04/11, Christian Brauner wrote:
>
> > Looking close at this. Why is:
> >
> >         if (type == PIDTYPE_PID) {
> >                 WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
> >                 wake_up_all(&pid->wait_pidfd);
> >         }
> >
> > located in __change_pid()? The only valid call to __change_pid() with a NULL
> > argument and PIDTYPE_PID is from __unhash_process(), no?
>
> We used to perform free_pid() directly from __change_pid() so prior to
> v6.15 changes it wasn't possible.

Yes, exactly ;)

> Now that we free the pids separately let's
> just move the notification into __unhash_process(). I have a patch ready
> for this.

Agreed,

Acked-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 11:25                           ` Oleg Nesterov
@ 2025-04-11 11:41                             ` Oleg Nesterov
  0 siblings, 0 replies; 19+ messages in thread
From: Oleg Nesterov @ 2025-04-11 11:41 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On 04/11, Oleg Nesterov wrote:
>
> On 04/11, Christian Brauner wrote:
> >
> > > Looking close at this. Why is:
> > >
> > >         if (type == PIDTYPE_PID) {
> > >                 WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
> > >                 wake_up_all(&pid->wait_pidfd);
> > >         }
> > >
> > > located in __change_pid()? The only valid call to __change_pid() with a NULL
> > > argument and PIDTYPE_PID is from __unhash_process(), no?
> >
> > We used to perform free_pid() directly from __change_pid() so prior to
> > v6.15 changes it wasn't possible.
>
> Yes, exactly ;)

To clarify, it was actually possible because the caller, release_task(),
does

	thread_pid = get_pid(p->thread_pid);

before __exit_signal() and detach_pid(PIDTYPE_PID) uses the same
task_struct->thread_pid. But I didn't want to rely on this fact.

And it seems we can do another cleanup... We can kill the no longer
needed get_pid/put_pid in release_task(). I'll send the patch.

> > Now that we free the pids separately let's
> > just move the notification into __unhash_process(). I have a patch ready
> > for this.
>
> Agreed,
>
> Acked-by: Oleg Nesterov <oleg@redhat.com>


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-04-11 11:42 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-03 14:09 [PATCH RFC 0/4] pidfd: improve uapi when task isn't found Christian Brauner
2025-04-03 14:09 ` [PATCH RFC 1/4] selftests/pidfd: adapt to recent changes Christian Brauner
2025-04-03 14:09 ` [PATCH RFC 2/4] pidfd: remove unneeded NULL check from pidfd_prepare() Christian Brauner
2025-04-03 14:09 ` [PATCH RFC 3/4] pidfd: improve uapi when task isn't found Christian Brauner
2025-04-04 12:37   ` Oleg Nesterov
2025-04-04 13:38     ` Christian Brauner
2025-04-04 14:53       ` Oleg Nesterov
2025-04-09 15:38         ` Christian Brauner
2025-04-09 18:18           ` [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
2025-04-09 18:40             ` Oleg Nesterov
2025-04-10 10:18               ` Oleg Nesterov
2025-04-10 10:43                 ` Christian Brauner
2025-04-10 13:10                   ` Oleg Nesterov
2025-04-10 20:05                     ` Christian Brauner
2025-04-10 20:24                       ` Christian Brauner
2025-04-11 11:08                         ` Christian Brauner
2025-04-11 11:25                           ` Oleg Nesterov
2025-04-11 11:41                             ` Oleg Nesterov
2025-04-03 14:09 ` [PATCH RFC 4/4] selftest/pidfd: add test for thread-group leader pidfd open for thread Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).