[PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
@ 2025-04-11 13:22 Christian Brauner
  2025-04-11 13:22 ` [PATCH v2 1/2] exit: move wake_up_all() pidfd waiters into __unhash_process() Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Christian Brauner @ 2025-04-11 13:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra, Christian Brauner

In a prior patch series we tried to cleanly differentiate between:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
actually references a struct pid that isn't used as a thread-group
leader.

as this was causing issues for non-threaded workloads.

But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
Christian Brauner (2):
      exit: move wake_up_all() pidfd waiters into __unhash_process()
      pidfs: ensure consistent ENOENT/ESRCH reporting

 kernel/exit.c |  5 +++++
 kernel/fork.c | 31 +++++++++++++------------------
 kernel/pid.c  |  5 -----
 3 files changed, 18 insertions(+), 23 deletions(-)
---
base-commit: 1e940fff94374d04b6c34f896ed9fbad3d2fb706
change-id: 20250411-work-pidfs-enoent-747579160c8c

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/2] exit: move wake_up_all() pidfd waiters into __unhash_process()
  2025-04-11 13:22 [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
@ 2025-04-11 13:22 ` Christian Brauner
  2025-04-11 13:22 ` [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
  2025-04-15 22:34 ` [PATCH v2 0/2] " Nathan Chancellor
  2 siblings, 0 replies; 11+ messages in thread
From: Christian Brauner @ 2025-04-11 13:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra, Christian Brauner

Move the pidfd notification out of __change_pid() and into
__unhash_process(). The only valid call to __change_pid() with a NULL
argument and PIDTYPE_PID is from __unhash_process(). This is a lot more
obvious than calling it from __change_pid().

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/exit.c | 5 +++++
 kernel/pid.c  | 5 -----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 1b51dc099f1e..abcd93ce4e18 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -133,8 +133,13 @@ struct release_task_post {
 static void __unhash_process(struct release_task_post *post, struct task_struct *p,
 			     bool group_dead)
 {
+	struct pid *pid = task_pid(p);
+
 	nr_threads--;
+
 	detach_pid(post->pids, p, PIDTYPE_PID);
+	wake_up_all(&pid->wait_pidfd);
+
 	if (group_dead) {
 		detach_pid(post->pids, p, PIDTYPE_TGID);
 		detach_pid(post->pids, p, PIDTYPE_PGID);
diff --git a/kernel/pid.c b/kernel/pid.c
index 4ac2ce46817f..26f1e136f017 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -359,11 +359,6 @@ static void __change_pid(struct pid **pids, struct task_struct *task,
 	hlist_del_rcu(&task->pid_links[type]);
 	*pid_ptr = new;
 
-	if (type == PIDTYPE_PID) {
-		WARN_ON_ONCE(pid_has_task(pid, PIDTYPE_PID));
-		wake_up_all(&pid->wait_pidfd);
-	}
-
 	for (tmp = PIDTYPE_MAX; --tmp >= 0; )
 		if (pid_has_task(pid, tmp))
 			return;

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 13:22 [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
  2025-04-11 13:22 ` [PATCH v2 1/2] exit: move wake_up_all() pidfd waiters into __unhash_process() Christian Brauner
@ 2025-04-11 13:22 ` Christian Brauner
  2025-04-11 13:54   ` Oleg Nesterov
  2025-04-15 22:34 ` [PATCH v2 0/2] " Nathan Chancellor
  2 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-04-11 13:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra, Christian Brauner

In a prior patch series we tried to cleanly differentiate between:

(1) The task has already been reaped.
(2) The caller requested a pidfd for a thread-group leader but the pid
    actually references a struct pid that isn't used as a thread-group
    leader.

as this was causing issues for non-threaded workloads.

But there's cases where the current simple logic is wrong. Specifically,
if the pid was a leader pid and the check races with __unhash_process().
Stabilize this by using the pidfd waitqueue lock.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 kernel/fork.c | 31 +++++++++++++------------------
 1 file changed, 13 insertions(+), 18 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4a2080b968c8..cde960fd0c71 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2108,28 +2108,23 @@ static int __pidfd_prepare(struct pid *pid, unsigned int flags, struct file **re
  */
 int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
 {
-	int err = 0;
-
-	if (!(flags & PIDFD_THREAD)) {
+	scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
+		/*
+		 * If this wasn't a thread-group leader struct pid or
+		 * the task already been reaped report ESRCH to
+		 * userspace.
+		 */
+		if (!pid_has_task(pid, PIDTYPE_PID))
+			return -ESRCH;
 		/*
-		 * If this is struct pid isn't used as a thread-group
-		 * leader pid but the caller requested to create a
-		 * thread-group leader pidfd then report ENOENT to the
-		 * caller as a hint.
+		 * If this struct pid isn't used as a thread-group
+		 * leader but the caller requested to create a
+		 * thread-group leader pidfd then report ENOENT.
 		 */
-		if (!pid_has_task(pid, PIDTYPE_TGID))
-			err = -ENOENT;
+		if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
+			return -ENOENT;
 	}
 
-	/*
-	 * If this wasn't a thread-group leader struct pid or the task
-	 * got reaped in the meantime report -ESRCH to userspace.
-	 */
-	if (!pid_has_task(pid, PIDTYPE_PID))
-		err = -ESRCH;
-	if (err)
-		return err;
-
 	return __pidfd_prepare(pid, flags, ret);
 }
 

-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 13:22 ` [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
@ 2025-04-11 13:54   ` Oleg Nesterov
  2025-04-11 15:14     ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Oleg Nesterov @ 2025-04-11 13:54 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

For both patches:

Reviewed-by: Oleg Nesterov <oleg@redhat.com>

a minor nit below...

On 04/11, Christian Brauner wrote:
>
>  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>  {
> -	int err = 0;
> -
> -	if (!(flags & PIDFD_THREAD)) {
> +	scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
> +		/*
> +		 * If this wasn't a thread-group leader struct pid or
> +		 * the task already been reaped report ESRCH to
> +		 * userspace.
> +		 */
> +		if (!pid_has_task(pid, PIDTYPE_PID))
> +			return -ESRCH;

The "If this wasn't a thread-group leader struct pid" part of the
comment looks a bit confusing to me, as if pid_has_task(PIDTYPE_PID)
should return false in this case.

OTOH, perhaps it makes sense to explain scoped_guard(wait_pidfd.lock)?
Something like "see unhash_process -> wake_up_all(), detach_pid(TGID)
isn't possible if pid_has_task(PID) succeeds".

Oleg.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 13:54   ` Oleg Nesterov
@ 2025-04-11 15:14     ` Christian Brauner
  2025-04-11 15:28       ` Oleg Nesterov
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-04-11 15:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On Fri, Apr 11, 2025 at 03:54:45PM +0200, Oleg Nesterov wrote:
> For both patches:
> 
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> a minor nit below...
> 
> On 04/11, Christian Brauner wrote:
> >
> >  int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >  {
> > -	int err = 0;
> > -
> > -	if (!(flags & PIDFD_THREAD)) {
> > +	scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
> > +		/*
> > +		 * If this wasn't a thread-group leader struct pid or
> > +		 * the task already been reaped report ESRCH to
> > +		 * userspace.
> > +		 */
> > +		if (!pid_has_task(pid, PIDTYPE_PID))
> > +			return -ESRCH;
> 
> The "If this wasn't a thread-group leader struct pid" part of the
> comment looks a bit confusing to me, as if pid_has_task(PIDTYPE_PID)
> should return false in this case.

Ok.

> 
> OTOH, perhaps it makes sense to explain scoped_guard(wait_pidfd.lock)?
> Something like "see unhash_process -> wake_up_all(), detach_pid(TGID)
> isn't possible if pid_has_task(PID) succeeds".

I'm verbose. I hope you can live with it:

        /*
         * While holding the pidfd waitqueue lock removing the task
         * linkage for the thread-group leader pid (PIDTYPE_TGID) isn't
         * possible. Thus, if there's still task linkage for PIDTYPE_PID
         * not having thread-group leader linkage for the pid means it
         * wasn't a thread-group leader in the first place.
         */

:)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 15:14     ` Christian Brauner
@ 2025-04-11 15:28       ` Oleg Nesterov
  0 siblings, 0 replies; 11+ messages in thread
From: Oleg Nesterov @ 2025-04-11 15:28 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-fsdevel, Jeff Layton, Lennart Poettering, Daan De Meyer,
	Mike Yuan, linux-kernel, Peter Ziljstra

On 04/11, Christian Brauner wrote:
>
> I'm verbose. I hope you can live with it:
>
>         /*
>          * While holding the pidfd waitqueue lock removing the task
>          * linkage for the thread-group leader pid (PIDTYPE_TGID) isn't
>          * possible. Thus, if there's still task linkage for PIDTYPE_PID
>          * not having thread-group leader linkage for the pid means it
>          * wasn't a thread-group leader in the first place.
>          */
>
> :)

LGTM ;)

Oleg.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-11 13:22 [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
  2025-04-11 13:22 ` [PATCH v2 1/2] exit: move wake_up_all() pidfd waiters into __unhash_process() Christian Brauner
  2025-04-11 13:22 ` [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
@ 2025-04-15 22:34 ` Nathan Chancellor
  2025-04-16 13:55   ` Christian Brauner
  2 siblings, 1 reply; 11+ messages in thread
From: Nathan Chancellor @ 2025-04-15 22:34 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan, linux-kernel, Peter Ziljstra

Hi Christian,

On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> In a prior patch series we tried to cleanly differentiate between:
> 
> (1) The task has already been reaped.
> (2) The caller requested a pidfd for a thread-group leader but the pid
> actually references a struct pid that isn't used as a thread-group
> leader.
> 
> as this was causing issues for non-threaded workloads.
> 
> But there's cases where the current simple logic is wrong. Specifically,
> if the pid was a leader pid and the check races with __unhash_process().
> Stabilize this by using the pidfd waitqueue lock.

After the recent work in vfs-6.16.pidfs (I tested at
a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
'machinectl shell' to connect to a systemd-nspawn container on one of my
machines running Fedora 41 (the container is using Rawhide).

  $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
  Failed to get shell PTY: Connection timed out

My initial bisect attempt landed on the merge of the first series
(1e940fff9437), which does not make much sense because 4fc3f73c16d was
allegedly good in my test, but I did not investigate that too hard since
I have lost enough time on this as it is heh. It never reproduces at
6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
would report it here since you mention this series is a fix for the
first one. If there is any other information I can provide or patches I
can test (either as fixes or for debugging), I am more than happy to do
so.

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-15 22:34 ` [PATCH v2 0/2] " Nathan Chancellor
@ 2025-04-16 13:55   ` Christian Brauner
  2025-04-16 19:47     ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-04-16 13:55 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Oleg Nesterov, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan, linux-kernel, Peter Ziljstra

On Tue, Apr 15, 2025 at 03:34:54PM -0700, Nathan Chancellor wrote:
> Hi Christian,
> 
> On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> > In a prior patch series we tried to cleanly differentiate between:
> > 
> > (1) The task has already been reaped.
> > (2) The caller requested a pidfd for a thread-group leader but the pid
> > actually references a struct pid that isn't used as a thread-group
> > leader.
> > 
> > as this was causing issues for non-threaded workloads.
> > 
> > But there's cases where the current simple logic is wrong. Specifically,
> > if the pid was a leader pid and the check races with __unhash_process().
> > Stabilize this by using the pidfd waitqueue lock.
> 
> After the recent work in vfs-6.16.pidfs (I tested at
> a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
> 'machinectl shell' to connect to a systemd-nspawn container on one of my
> machines running Fedora 41 (the container is using Rawhide).
> 
>   $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
>   Failed to get shell PTY: Connection timed out
> 
> My initial bisect attempt landed on the merge of the first series
> (1e940fff9437), which does not make much sense because 4fc3f73c16d was
> allegedly good in my test, but I did not investigate that too hard since
> I have lost enough time on this as it is heh. It never reproduces at
> 6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
> would report it here since you mention this series is a fix for the
> first one. If there is any other information I can provide or patches I
> can test (either as fixes or for debugging), I am more than happy to do
> so.

Does the following patch make a difference for you?:

diff --git a/kernel/fork.c b/kernel/fork.c
index f7403e1fb0d4..dd30f7e09917 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2118,7 +2118,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
        scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
                /* Task has already been reaped. */
                if (!pid_has_task(pid, PIDTYPE_PID))
-                       return -ESRCH;
+                       return -EINVAL;
                /*
                 * If this struct pid isn't used as a thread-group
                 * leader but the caller requested to create a

If it did it would be weird if the first merge is indeed marked as good.
What if you used a non-rawhide version of systemd? Because this might
also be a regression on their side.

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-16 13:55   ` Christian Brauner
@ 2025-04-16 19:47     ` Christian Brauner
  2025-04-16 20:21       ` Christian Brauner
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-04-16 19:47 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Oleg Nesterov, linux-fsdevel, Jeff Layton, Lennart Poettering,
	Daan De Meyer, Mike Yuan, linux-kernel, Peter Ziljstra

On Wed, Apr 16, 2025 at 03:55:48PM +0200, Christian Brauner wrote:
> On Tue, Apr 15, 2025 at 03:34:54PM -0700, Nathan Chancellor wrote:
> > Hi Christian,
> > 
> > On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> > > In a prior patch series we tried to cleanly differentiate between:
> > > 
> > > (1) The task has already been reaped.
> > > (2) The caller requested a pidfd for a thread-group leader but the pid
> > > actually references a struct pid that isn't used as a thread-group
> > > leader.
> > > 
> > > as this was causing issues for non-threaded workloads.
> > > 
> > > But there's cases where the current simple logic is wrong. Specifically,
> > > if the pid was a leader pid and the check races with __unhash_process().
> > > Stabilize this by using the pidfd waitqueue lock.
> > 
> > After the recent work in vfs-6.16.pidfs (I tested at
> > a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
> > 'machinectl shell' to connect to a systemd-nspawn container on one of my
> > machines running Fedora 41 (the container is using Rawhide).
> > 
> >   $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
> >   Failed to get shell PTY: Connection timed out
> > 
> > My initial bisect attempt landed on the merge of the first series
> > (1e940fff9437), which does not make much sense because 4fc3f73c16d was
> > allegedly good in my test, but I did not investigate that too hard since
> > I have lost enough time on this as it is heh. It never reproduces at
> > 6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
> > would report it here since you mention this series is a fix for the
> > first one. If there is any other information I can provide or patches I
> > can test (either as fixes or for debugging), I am more than happy to do
> > so.

I can't reproduce this issue at all with vfs-6.16.pidfs unfortunately.

> 
> Does the following patch make a difference for you?:
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f7403e1fb0d4..dd30f7e09917 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2118,7 +2118,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
>         scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
>                 /* Task has already been reaped. */
>                 if (!pid_has_task(pid, PIDTYPE_PID))
> -                       return -ESRCH;
> +                       return -EINVAL;
>                 /*
>                  * If this struct pid isn't used as a thread-group
>                  * leader but the caller requested to create a
> 
> If it did it would be weird if the first merge is indeed marked as good.
> What if you used a non-rawhide version of systemd? Because this might
> also be a regression on their side.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-16 19:47     ` Christian Brauner
@ 2025-04-16 20:21       ` Christian Brauner
  2025-04-16 21:15         ` Nathan Chancellor
  0 siblings, 1 reply; 11+ messages in thread
From: Christian Brauner @ 2025-04-16 20:21 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: David Rheinsberg, Oleg Nesterov, linux-fsdevel, Jeff Layton,
	Lennart Poettering, Daan De Meyer, Mike Yuan, linux-kernel,
	Peter Ziljstra

[-- Attachment #1: Type: text/plain, Size: 3472 bytes --]

On Wed, Apr 16, 2025 at 09:47:34PM +0200, Christian Brauner wrote:
> On Wed, Apr 16, 2025 at 03:55:48PM +0200, Christian Brauner wrote:
> > On Tue, Apr 15, 2025 at 03:34:54PM -0700, Nathan Chancellor wrote:
> > > Hi Christian,
> > > 
> > > On Fri, Apr 11, 2025 at 03:22:43PM +0200, Christian Brauner wrote:
> > > > In a prior patch series we tried to cleanly differentiate between:
> > > > 
> > > > (1) The task has already been reaped.
> > > > (2) The caller requested a pidfd for a thread-group leader but the pid
> > > > actually references a struct pid that isn't used as a thread-group
> > > > leader.
> > > > 
> > > > as this was causing issues for non-threaded workloads.
> > > > 
> > > > But there's cases where the current simple logic is wrong. Specifically,
> > > > if the pid was a leader pid and the check races with __unhash_process().
> > > > Stabilize this by using the pidfd waitqueue lock.
> > > 
> > > After the recent work in vfs-6.16.pidfs (I tested at
> > > a9d7de0f68b79e5e481967fc605698915a37ac13), I am seeing issues with using
> > > 'machinectl shell' to connect to a systemd-nspawn container on one of my
> > > machines running Fedora 41 (the container is using Rawhide).
> > > 
> > >   $ machinectl shell -q nathan@$DEV_IMG $SHELL -l
> > >   Failed to get shell PTY: Connection timed out
> > > 
> > > My initial bisect attempt landed on the merge of the first series
> > > (1e940fff9437), which does not make much sense because 4fc3f73c16d was
> > > allegedly good in my test, but I did not investigate that too hard since
> > > I have lost enough time on this as it is heh. It never reproduces at
> > > 6.15-rc1 and it consistently reproduces at a9d7de0f68b so I figured I
> > > would report it here since you mention this series is a fix for the
> > > first one. If there is any other information I can provide or patches I
> > > can test (either as fixes or for debugging), I am more than happy to do
> > > so.
> 
> I can't reproduce this issue at all with vfs-6.16.pidfs unfortunately.
> 
> > 
> > Does the following patch make a difference for you?:
> > 
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index f7403e1fb0d4..dd30f7e09917 100644
> > --- a/kernel/fork.c
> > +++ b/kernel/fork.c
> > @@ -2118,7 +2118,7 @@ int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> >         scoped_guard(spinlock_irq, &pid->wait_pidfd.lock) {
> >                 /* Task has already been reaped. */
> >                 if (!pid_has_task(pid, PIDTYPE_PID))
> > -                       return -ESRCH;
> > +                       return -EINVAL;
> >                 /*
> >                  * If this struct pid isn't used as a thread-group
> >                  * leader but the caller requested to create a
> > 
> > If it did it would be weird if the first merge is indeed marked as good.
> > What if you used a non-rawhide version of systemd? Because this might
> > also be a regression on their side.

Ok, I think I understand how this happens. dbus-broker assumes that
only EINVAL is reported by SO_PEERPIDFD when a process is reaped:

https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241

It would be great if it would also allow for ESRCH which is the correct
error code anyway. So hopefully we'll get that fixed in userspace. For
now we paper over this in the kernel in the SO_PEERPIDFD code.

Can you please test vfs-6.16.pidfs again. It has the patch I'm
appending.

[-- Attachment #2: 0001-net-pidfd-report-EINVAL-for-ESRCH.patch --]
[-- Type: text/x-diff, Size: 1360 bytes --]

From 3f662e72d51caea74370e07a3d5bad66f020423d Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Wed, 16 Apr 2025 22:05:40 +0200
Subject: [PATCH] net, pidfd: report EINVAL for ESRCH

dbus-broker relies -EINVAL being returned to indicate ESRCH in [1].
This causes issues for some workloads as reported in [2].
Paper over it until this is fixed in userspace.

Link: https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241 [1]
Link: https://lore.kernel.org/20250415223454.GA1852104@ax162 [2]
Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 net/core/sock.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index f67a3c5b0988..ed8e7fd36284 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1893,8 +1893,16 @@ int sk_getsockopt(struct sock *sk, int level, int optname,
 
 		pidfd = pidfd_prepare(peer_pid, 0, &pidfd_file);
 		put_pid(peer_pid);
-		if (pidfd < 0)
+		if (pidfd < 0) {
+			/*
+			 * dbus-broker relies -EINVAL being returned to
+			 * indicate ESRCH. Paper over it until this is
+			 * fixed in userspace.
+			 */
+			if (pidfd == -ESRCH)
+				pidfd = -EINVAL;
 			return pidfd;
+		}
 
 		if (copy_to_sockptr(optval, &pidfd, len) ||
 		    copy_to_sockptr(optlen, &len, sizeof(int))) {
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting
  2025-04-16 20:21       ` Christian Brauner
@ 2025-04-16 21:15         ` Nathan Chancellor
  0 siblings, 0 replies; 11+ messages in thread
From: Nathan Chancellor @ 2025-04-16 21:15 UTC (permalink / raw)
  To: Christian Brauner
  Cc: David Rheinsberg, Oleg Nesterov, linux-fsdevel, Jeff Layton,
	Lennart Poettering, Daan De Meyer, Mike Yuan, linux-kernel,
	Peter Ziljstra

On Wed, Apr 16, 2025 at 10:21:02PM +0200, Christian Brauner wrote:
> Ok, I think I understand how this happens. dbus-broker assumes that
> only EINVAL is reported by SO_PEERPIDFD when a process is reaped:
> 
> https://github.com/bus1/dbus-broker/blob/5d34d91b138fc802a016aa68c093eb81ea31139c/src/util/sockopt.c#L241
> 
> It would be great if it would also allow for ESRCH which is the correct
> error code anyway. So hopefully we'll get that fixed in userspace. For
> now we paper over this in the kernel in the SO_PEERPIDFD code.
> 
> Can you please test vfs-6.16.pidfs again. It has the patch I'm
> appending.

Yes, that appears to work fine for me, thanks for the quick fix!

If it would be helpful:

Tested-by: Nathan Chancellor <nathan@kernel.org>

Cheers,
Nathan

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-04-16 21:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-11 13:22 [PATCH v2 0/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
2025-04-11 13:22 ` [PATCH v2 1/2] exit: move wake_up_all() pidfd waiters into __unhash_process() Christian Brauner
2025-04-11 13:22 ` [PATCH v2 2/2] pidfs: ensure consistent ENOENT/ESRCH reporting Christian Brauner
2025-04-11 13:54   ` Oleg Nesterov
2025-04-11 15:14     ` Christian Brauner
2025-04-11 15:28       ` Oleg Nesterov
2025-04-15 22:34 ` [PATCH v2 0/2] " Nathan Chancellor
2025-04-16 13:55   ` Christian Brauner
2025-04-16 19:47     ` Christian Brauner
2025-04-16 20:21       ` Christian Brauner
2025-04-16 21:15         ` Nathan Chancellor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).