* [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
@ 2025-09-13 22:28 ` Demi Marie Obenour via B4 Relay
0 siblings, 0 replies; 13+ messages in thread
From: Demi Marie Obenour @ 2025-09-13 22:28 UTC (permalink / raw)
To: linux-kernel; +Cc: Demi Marie Obenour
If a process calls prctl(PR_SET_PDEATHSIG) at the same time that the
parent process exits, the child will write to me->pdeath_sig at the same
time the parent is reading it. Since there is no synchronization, this
is a data race.
Worse, it is possible that a subsequent call to getppid() can continue
to return the previous parent process ID without the parent death signal
being delivered. This happens in the following scenario:
parent child
forget_original_parent() prctl(PR_SET_PDEATHSIG, SIGKILL)
sys_prctl()
me->pdeath_sig = SIGKILL;
getppid();
RCU_INIT_POINTER(t->real_parent, reaper);
if (t->pdeath_signal) /* reads stale me->pdeath_sig */
group_send_sig_info(t->pdeath_signal, ...);
And in the following:
parent child
forget_original_parent()
RCU_INIT_POINTER(t->real_parent, reaper);
/* also no barrier */
if (t->pdeath_signal) /* reads stale me->pdeath_sig */
group_send_sig_info(t->pdeath_signal, ...);
prctl(PR_SET_PDEATHSIG, SIGKILL)
sys_prctl()
me->pdeath_sig = SIGKILL;
getppid(); /* reads old ppid() */
As a result, the following pattern is racy:
pid_t parent_pid = getpid();
pid_t child_pid = fork();
if (child_pid == -1) {
/* handle error... */
return;
}
if (child_pid == 0) {
if (prctl(PR_SET_PDEATHSIG, SIGKILL) != 0) {
/* handle error */
_exit(126);
}
if (getppid() != parent_pid) {
/* parent died already */
raise(SIGKILL);
}
/* keep going in child */
}
/* keep going in parent */
If the parent is killed at exactly the wrong time, the child process can
(wrongly) stay running.
I didn't manage to reproduce this in my testing, but I'm pretty sure the
race is real. KCSAN is probably the best way to spot the race.
Fix the bug by holding tasklist_lock for reading whenever pdeath_signal
is being written to. This prevents races on me->pdeath_sig, and the
locking and unlocking of the rwlock provide the needed memory barriers.
If prctl(PR_SET_PDEATHSIG) happens before the parent exits, the signal
will be sent. If it happens afterwards, a subsequent getppid() will
return the new value.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
---
I believe this is not the only code missing locking, but I'm
not familiar enough with the rest of the kernel to know if
read_lock(&tasklist_lock) is safe in the other places (AppArmor,
execve(), SELinux, commit_creds()) that change it.
checkpatch complains about overly long lines in the commit message. I
don't see a way to wrap them without making the description of the race
harder to read.
Only compile-tested, but this looks obvious.
---
kernel/sys.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce206d7d0ed27e8a4fce8b616c3565..5e0e0bdca386492dace6341e3ce8083d7aa732cb 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2470,7 +2470,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = -EINVAL;
break;
}
+ /*
+ * Ensure that either:
+ *
+ * 1. Subsequent getppid() calls reflect the parent process having died.
+ * 2. forget_original_parent() will send the new me->pdeath_signal.
+ *
+ * Also prevent the read of me->pdeath_signal from being a data race.
+ */
+ read_lock(&tasklist_lock);
me->pdeath_signal = arg2;
+ read_unlock(&tasklist_lock);
break;
case PR_GET_PDEATHSIG:
error = put_user(me->pdeath_signal, (int __user *)arg2);
---
base-commit: 5cd64d4f92683afa691a6b83dcad5adfb2165ed0
change-id: 20250913-fix-prctl-pdeathsig-race-fed53c2a5851
Best regards,
--
Demi Marie Obenour <demiobenour@gmail.com>
^ permalink raw reply related [flat|nested] 13+ messages in thread* [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
@ 2025-09-13 22:28 ` Demi Marie Obenour via B4 Relay
0 siblings, 0 replies; 13+ messages in thread
From: Demi Marie Obenour via B4 Relay @ 2025-09-13 22:28 UTC (permalink / raw)
To: linux-kernel; +Cc: Demi Marie Obenour
From: Demi Marie Obenour <demiobenour@gmail.com>
If a process calls prctl(PR_SET_PDEATHSIG) at the same time that the
parent process exits, the child will write to me->pdeath_sig at the same
time the parent is reading it. Since there is no synchronization, this
is a data race.
Worse, it is possible that a subsequent call to getppid() can continue
to return the previous parent process ID without the parent death signal
being delivered. This happens in the following scenario:
parent child
forget_original_parent() prctl(PR_SET_PDEATHSIG, SIGKILL)
sys_prctl()
me->pdeath_sig = SIGKILL;
getppid();
RCU_INIT_POINTER(t->real_parent, reaper);
if (t->pdeath_signal) /* reads stale me->pdeath_sig */
group_send_sig_info(t->pdeath_signal, ...);
And in the following:
parent child
forget_original_parent()
RCU_INIT_POINTER(t->real_parent, reaper);
/* also no barrier */
if (t->pdeath_signal) /* reads stale me->pdeath_sig */
group_send_sig_info(t->pdeath_signal, ...);
prctl(PR_SET_PDEATHSIG, SIGKILL)
sys_prctl()
me->pdeath_sig = SIGKILL;
getppid(); /* reads old ppid() */
As a result, the following pattern is racy:
pid_t parent_pid = getpid();
pid_t child_pid = fork();
if (child_pid == -1) {
/* handle error... */
return;
}
if (child_pid == 0) {
if (prctl(PR_SET_PDEATHSIG, SIGKILL) != 0) {
/* handle error */
_exit(126);
}
if (getppid() != parent_pid) {
/* parent died already */
raise(SIGKILL);
}
/* keep going in child */
}
/* keep going in parent */
If the parent is killed at exactly the wrong time, the child process can
(wrongly) stay running.
I didn't manage to reproduce this in my testing, but I'm pretty sure the
race is real. KCSAN is probably the best way to spot the race.
Fix the bug by holding tasklist_lock for reading whenever pdeath_signal
is being written to. This prevents races on me->pdeath_sig, and the
locking and unlocking of the rwlock provide the needed memory barriers.
If prctl(PR_SET_PDEATHSIG) happens before the parent exits, the signal
will be sent. If it happens afterwards, a subsequent getppid() will
return the new value.
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
---
I believe this is not the only code missing locking, but I'm
not familiar enough with the rest of the kernel to know if
read_lock(&tasklist_lock) is safe in the other places (AppArmor,
execve(), SELinux, commit_creds()) that change it.
checkpatch complains about overly long lines in the commit message. I
don't see a way to wrap them without making the description of the race
harder to read.
Only compile-tested, but this looks obvious.
---
kernel/sys.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/kernel/sys.c b/kernel/sys.c
index 1e28b40053ce206d7d0ed27e8a4fce8b616c3565..5e0e0bdca386492dace6341e3ce8083d7aa732cb 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2470,7 +2470,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
error = -EINVAL;
break;
}
+ /*
+ * Ensure that either:
+ *
+ * 1. Subsequent getppid() calls reflect the parent process having died.
+ * 2. forget_original_parent() will send the new me->pdeath_signal.
+ *
+ * Also prevent the read of me->pdeath_signal from being a data race.
+ */
+ read_lock(&tasklist_lock);
me->pdeath_signal = arg2;
+ read_unlock(&tasklist_lock);
break;
case PR_GET_PDEATHSIG:
error = put_user(me->pdeath_signal, (int __user *)arg2);
---
base-commit: 5cd64d4f92683afa691a6b83dcad5adfb2165ed0
change-id: 20250913-fix-prctl-pdeathsig-race-fed53c2a5851
Best regards,
--
Demi Marie Obenour <demiobenour@gmail.com>
^ permalink raw reply related [flat|nested] 13+ messages in thread* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-13 22:28 ` Demi Marie Obenour via B4 Relay
(?)
@ 2025-09-20 4:10 ` Demi Marie Obenour
2025-09-22 22:48 ` Andrew Morton
-1 siblings, 1 reply; 13+ messages in thread
From: Demi Marie Obenour @ 2025-09-20 4:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux kernel mailing list
[-- Attachment #1.1.1: Type: text/plain, Size: 105 bytes --]
CCing Andrew Morton as the maintainer of last resort.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-20 4:10 ` Demi Marie Obenour
@ 2025-09-22 22:48 ` Andrew Morton
2025-09-23 12:03 ` Oleg Nesterov
0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2025-09-22 22:48 UTC (permalink / raw)
To: Demi Marie Obenour; +Cc: Linux kernel mailing list, Oleg Nesterov
On Sat, 20 Sep 2025 00:10:07 -0400 Demi Marie Obenour <demiobenour@gmail.com> wrote:
> CCing Andrew Morton as the maintainer of last resort.
> --
Thanks.
This code has no maintainer, as far as I know. So the trick is to cc a
smart person who understands signals. Hi Oleg ;)
From: Demi Marie Obenour <demiobenour@gmail.com>
Subject: kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
Date: Sat, 13 Sep 2025 18:28:49 -0400
If a process calls prctl(PR_SET_PDEATHSIG) at the same time that the
parent process exits, the child will write to me->pdeath_sig at the same
time the parent is reading it. Since there is no synchronization, this is
a data race.
Worse, it is possible that a subsequent call to getppid() can continue to
return the previous parent process ID without the parent death signal
being delivered. This happens in the following scenario:
parent child
forget_original_parent() prctl(PR_SET_PDEATHSIG, SIGKILL)
sys_prctl()
me->pdeath_sig = SIGKILL;
getppid();
RCU_INIT_POINTER(t->real_parent, reaper);
if (t->pdeath_signal) /* reads stale me->pdeath_sig */
group_send_sig_info(t->pdeath_signal, ...);
And in the following:
parent child
forget_original_parent()
RCU_INIT_POINTER(t->real_parent, reaper);
/* also no barrier */
if (t->pdeath_signal) /* reads stale me->pdeath_sig */
group_send_sig_info(t->pdeath_signal, ...);
prctl(PR_SET_PDEATHSIG, SIGKILL)
sys_prctl()
me->pdeath_sig = SIGKILL;
getppid(); /* reads old ppid() */
As a result, the following pattern is racy:
pid_t parent_pid = getpid();
pid_t child_pid = fork();
if (child_pid == -1) {
/* handle error... */
return;
}
if (child_pid == 0) {
if (prctl(PR_SET_PDEATHSIG, SIGKILL) != 0) {
/* handle error */
_exit(126);
}
if (getppid() != parent_pid) {
/* parent died already */
raise(SIGKILL);
}
/* keep going in child */
}
/* keep going in parent */
If the parent is killed at exactly the wrong time, the child process can
(wrongly) stay running.
I didn't manage to reproduce this in my testing, but I'm pretty sure the
race is real. KCSAN is probably the best way to spot the race.
Fix the bug by holding tasklist_lock for reading whenever pdeath_signal is
being written to. This prevents races on me->pdeath_sig, and the locking
and unlocking of the rwlock provide the needed memory barriers. If
prctl(PR_SET_PDEATHSIG) happens before the parent exits, the signal will
be sent. If it happens afterwards, a subsequent getppid() will return the
new value.
Link: https://lkml.kernel.org/r/20250913-fix-prctl-pdeathsig-race-v1-1-44e2eb426fe9@gmail.com
Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
kernel/sys.c | 10 ++++++++++
1 file changed, 10 insertions(+)
--- a/kernel/sys.c~kernel-prevent-prctlpr_set_pdeathsig-from-racing-with-parent-process-exit
+++ a/kernel/sys.c
@@ -2533,7 +2533,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
error = -EINVAL;
break;
}
+ /*
+ * Ensure that either:
+ *
+ * 1. Subsequent getppid() calls reflect the parent process having died.
+ * 2. forget_original_parent() will send the new me->pdeath_signal.
+ *
+ * Also prevent the read of me->pdeath_signal from being a data race.
+ */
+ read_lock(&tasklist_lock);
me->pdeath_signal = arg2;
+ read_unlock(&tasklist_lock);
break;
case PR_GET_PDEATHSIG:
error = put_user(me->pdeath_signal, (int __user *)arg2);
_
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-22 22:48 ` Andrew Morton
@ 2025-09-23 12:03 ` Oleg Nesterov
2025-09-23 13:39 ` Mateusz Guzik
0 siblings, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2025-09-23 12:03 UTC (permalink / raw)
To: Andrew Morton, Demi Marie Obenour, Mateusz Guzik,
Christian Brauner
Cc: Linux kernel mailing list
Hi Demi,
(add more CC's)
I think your patch is correct in that it fixes the problems described in
the changelog. Thanks for the detailed explanations.
But I obviously dislike the fact it adds read_lock(tasklist_lock) into
prctl(PR_SET_PDEATHSIG) ;)
For the moment lets forget about the data race. Of course you right, but
this is simple, we can add READ/WRITE_ONCE(pdeath_signal).
Lets only discuss the 2nd problem: the exiting parent can miss
child->pdeath_signal != 0 while the child can still see the old getppid().
I honestly do not know if we really care, I always thought that this API
is ancient/obsolete and "strange". If nothing else, pdeath_signal is sent
even in the case of a threaded reparent...
But OK, lets suppose we need to fix this problem. Not that it matters, but
I guess you didn't hit it in practice?
As you correctly pointed out, forget_original_parent/prctl lack the necessary
barries. So lets add the barriers instead of abusing tasklist? As for sys_prctl(),
I think that ret-to-user-mode + enter-the-kernel-mode should act as a full
barrier, so it only needs WRITE_ONCE()...
Or perhaps user-space can do something else to sync with the exiting parent
instead of using getppid() ?
Oleg.
On 09/22, Andrew Morton wrote:
>
> From: Demi Marie Obenour <demiobenour@gmail.com>
> Subject: kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
> Date: Sat, 13 Sep 2025 18:28:49 -0400
>
> If a process calls prctl(PR_SET_PDEATHSIG) at the same time that the
> parent process exits, the child will write to me->pdeath_sig at the same
> time the parent is reading it. Since there is no synchronization, this is
> a data race.
>
> Worse, it is possible that a subsequent call to getppid() can continue to
> return the previous parent process ID without the parent death signal
> being delivered. This happens in the following scenario:
>
> parent child
>
> forget_original_parent() prctl(PR_SET_PDEATHSIG, SIGKILL)
> sys_prctl()
> me->pdeath_sig = SIGKILL;
> getppid();
> RCU_INIT_POINTER(t->real_parent, reaper);
> if (t->pdeath_signal) /* reads stale me->pdeath_sig */
> group_send_sig_info(t->pdeath_signal, ...);
>
> And in the following:
>
> parent child
>
> forget_original_parent()
> RCU_INIT_POINTER(t->real_parent, reaper);
> /* also no barrier */
> if (t->pdeath_signal) /* reads stale me->pdeath_sig */
> group_send_sig_info(t->pdeath_signal, ...);
>
> prctl(PR_SET_PDEATHSIG, SIGKILL)
> sys_prctl()
> me->pdeath_sig = SIGKILL;
> getppid(); /* reads old ppid() */
>
> As a result, the following pattern is racy:
>
> pid_t parent_pid = getpid();
> pid_t child_pid = fork();
> if (child_pid == -1) {
> /* handle error... */
> return;
> }
> if (child_pid == 0) {
> if (prctl(PR_SET_PDEATHSIG, SIGKILL) != 0) {
> /* handle error */
> _exit(126);
> }
> if (getppid() != parent_pid) {
> /* parent died already */
> raise(SIGKILL);
> }
> /* keep going in child */
> }
> /* keep going in parent */
>
> If the parent is killed at exactly the wrong time, the child process can
> (wrongly) stay running.
>
> I didn't manage to reproduce this in my testing, but I'm pretty sure the
> race is real. KCSAN is probably the best way to spot the race.
>
> Fix the bug by holding tasklist_lock for reading whenever pdeath_signal is
> being written to. This prevents races on me->pdeath_sig, and the locking
> and unlocking of the rwlock provide the needed memory barriers. If
> prctl(PR_SET_PDEATHSIG) happens before the parent exits, the signal will
> be sent. If it happens afterwards, a subsequent getppid() will return the
> new value.
>
> Link: https://lkml.kernel.org/r/20250913-fix-prctl-pdeathsig-race-v1-1-44e2eb426fe9@gmail.com
> Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
> kernel/sys.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> --- a/kernel/sys.c~kernel-prevent-prctlpr_set_pdeathsig-from-racing-with-parent-process-exit
> +++ a/kernel/sys.c
> @@ -2533,7 +2533,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
> error = -EINVAL;
> break;
> }
> + /*
> + * Ensure that either:
> + *
> + * 1. Subsequent getppid() calls reflect the parent process having died.
> + * 2. forget_original_parent() will send the new me->pdeath_signal.
> + *
> + * Also prevent the read of me->pdeath_signal from being a data race.
> + */
> + read_lock(&tasklist_lock);
> me->pdeath_signal = arg2;
> + read_unlock(&tasklist_lock);
> break;
> case PR_GET_PDEATHSIG:
> error = put_user(me->pdeath_signal, (int __user *)arg2);
> _
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-23 12:03 ` Oleg Nesterov
@ 2025-09-23 13:39 ` Mateusz Guzik
2025-09-25 16:28 ` Oleg Nesterov
2025-10-07 12:53 ` Christian Brauner
0 siblings, 2 replies; 13+ messages in thread
From: Mateusz Guzik @ 2025-09-23 13:39 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Demi Marie Obenour, Christian Brauner,
Linux kernel mailing list
On Tue, Sep 23, 2025 at 2:05 PM Oleg Nesterov <oleg@redhat.com> wrote:
> As you correctly pointed out, forget_original_parent/prctl lack the necessary
> barries. So lets add the barriers instead of abusing tasklist? As for sys_prctl(),
> I think that ret-to-user-mode + enter-the-kernel-mode should act as a full
> barrier, so it only needs WRITE_ONCE()...
>
So I looked over this and I think I see why you are not eager to fix
the problem to begin with. ;)
I agree with reluctance to take tasklist lock to handle
PR_SET_PDEATHSIG, but I wonder if in practice this is used rarely
enough that the lock trip would not be a problem? It avoids any
modifications to the exit codepath.
By barriers I presume you meant smp_mb() between
RCU_INIT_POINTER(t->real_parent, reaper) and
READ_ONCE(t->pdeath_signal) in forget_original_parent. That's very
nasty as the full fence is quite expensive. This could be done with
just one fence for the entire call by iterating the list twice, but
that's still preferably avoided.
> Or perhaps user-space can do something else to sync with the exiting parent
> instead of using getppid() ?
>
I never put any thought concerning this mechanism, I do think it
nicely showcases the prctl at hand is kind of crap. The non-crap
version would pass the PID you think your parent is, so that you do
this race-free. I don't know if makes any sense to add this.
I'm wondering if the fact that tasklist is write-locked in that code
path could be utilized to synchronize this in a matter other than
taking it.
pseudo-code wise, something like this:
WRITE_ONCE(me->pdeath_signal, arg2);
/* publish the above store and load the lock after */
smb_mb();
/* here spin waiting until tasklist_lock is not write-locked */
smb_rmb();
Unless I'm missing something this should provide the guarantee you see
the updated parent, if any.
I don't see a routine to do it though and knowing memory barriers
there might be some bullshit hiding there making this not work, so not
my first choice unless someone with more memory barrier clue can chime
in.
>
> On 09/22, Andrew Morton wrote:
> >
> > From: Demi Marie Obenour <demiobenour@gmail.com>
> > Subject: kernel: prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
> > Date: Sat, 13 Sep 2025 18:28:49 -0400
> >
> > If a process calls prctl(PR_SET_PDEATHSIG) at the same time that the
> > parent process exits, the child will write to me->pdeath_sig at the same
> > time the parent is reading it. Since there is no synchronization, this is
> > a data race.
> >
> > Worse, it is possible that a subsequent call to getppid() can continue to
> > return the previous parent process ID without the parent death signal
> > being delivered. This happens in the following scenario:
> >
> > parent child
> >
> > forget_original_parent() prctl(PR_SET_PDEATHSIG, SIGKILL)
> > sys_prctl()
> > me->pdeath_sig = SIGKILL;
> > getppid();
> > RCU_INIT_POINTER(t->real_parent, reaper);
> > if (t->pdeath_signal) /* reads stale me->pdeath_sig */
> > group_send_sig_info(t->pdeath_signal, ...);
> >
> > And in the following:
> >
> > parent child
> >
> > forget_original_parent()
> > RCU_INIT_POINTER(t->real_parent, reaper);
> > /* also no barrier */
> > if (t->pdeath_signal) /* reads stale me->pdeath_sig */
> > group_send_sig_info(t->pdeath_signal, ...);
> >
> > prctl(PR_SET_PDEATHSIG, SIGKILL)
> > sys_prctl()
> > me->pdeath_sig = SIGKILL;
> > getppid(); /* reads old ppid() */
> >
> > As a result, the following pattern is racy:
> >
> > pid_t parent_pid = getpid();
> > pid_t child_pid = fork();
> > if (child_pid == -1) {
> > /* handle error... */
> > return;
> > }
> > if (child_pid == 0) {
> > if (prctl(PR_SET_PDEATHSIG, SIGKILL) != 0) {
> > /* handle error */
> > _exit(126);
> > }
> > if (getppid() != parent_pid) {
> > /* parent died already */
> > raise(SIGKILL);
> > }
> > /* keep going in child */
> > }
> > /* keep going in parent */
> >
> > If the parent is killed at exactly the wrong time, the child process can
> > (wrongly) stay running.
> >
> > I didn't manage to reproduce this in my testing, but I'm pretty sure the
> > race is real. KCSAN is probably the best way to spot the race.
> >
> > Fix the bug by holding tasklist_lock for reading whenever pdeath_signal is
> > being written to. This prevents races on me->pdeath_sig, and the locking
> > and unlocking of the rwlock provide the needed memory barriers. If
> > prctl(PR_SET_PDEATHSIG) happens before the parent exits, the signal will
> > be sent. If it happens afterwards, a subsequent getppid() will return the
> > new value.
> >
> > Link: https://lkml.kernel.org/r/20250913-fix-prctl-pdeathsig-race-v1-1-44e2eb426fe9@gmail.com
> > Signed-off-by: Demi Marie Obenour <demiobenour@gmail.com>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >
> > kernel/sys.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > --- a/kernel/sys.c~kernel-prevent-prctlpr_set_pdeathsig-from-racing-with-parent-process-exit
> > +++ a/kernel/sys.c
> > @@ -2533,7 +2533,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsi
> > error = -EINVAL;
> > break;
> > }
> > + /*
> > + * Ensure that either:
> > + *
> > + * 1. Subsequent getppid() calls reflect the parent process having died.
> > + * 2. forget_original_parent() will send the new me->pdeath_signal.
> > + *
> > + * Also prevent the read of me->pdeath_signal from being a data race.
> > + */
> > + read_lock(&tasklist_lock);
> > me->pdeath_signal = arg2;
> > + read_unlock(&tasklist_lock);
> > break;
> > case PR_GET_PDEATHSIG:
> > error = put_user(me->pdeath_signal, (int __user *)arg2);
> > _
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-23 13:39 ` Mateusz Guzik
@ 2025-09-25 16:28 ` Oleg Nesterov
2025-09-25 18:35 ` Mateusz Guzik
2025-10-07 12:53 ` Christian Brauner
1 sibling, 1 reply; 13+ messages in thread
From: Oleg Nesterov @ 2025-09-25 16:28 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Andrew Morton, Demi Marie Obenour, Christian Brauner,
Linux kernel mailing list
Sorry for the late reply...
On 09/23, Mateusz Guzik wrote:
>
> On Tue, Sep 23, 2025 at 2:05 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > As you correctly pointed out, forget_original_parent/prctl lack the necessary
> > barries. So lets add the barriers instead of abusing tasklist? As for sys_prctl(),
> > I think that ret-to-user-mode + enter-the-kernel-mode should act as a full
> > barrier, so it only needs WRITE_ONCE()...
> >
>
> So I looked over this and I think I see why you are not eager to fix
> the problem to begin with. ;)
>
> I agree with reluctance to take tasklist lock to handle
> PR_SET_PDEATHSIG, but I wonder if in practice this is used rarely
> enough that the lock trip would not be a problem? It avoids any
> modifications to the exit codepath.
Yes... I mostly dislike the fact that this patch adds another possibility
to easily abuse the global tasklist lock from userspace...
> By barriers I presume you meant smp_mb() between
> RCU_INIT_POINTER(t->real_parent, reaper) and
> READ_ONCE(t->pdeath_signal) in forget_original_parent.
Yes,
> That's very
> nasty as the full fence is quite expensive.
Well, the exit_notify() path is already heavy, not sure smp_mb() or
smp_store_mb(real_parent, reaper) can add a noticeable difference.
> > Or perhaps user-space can do something else to sync with the exiting parent
> > instead of using getppid() ?
> >
>
> I never put any thought concerning this mechanism, I do think it
> nicely showcases the prctl at hand is kind of crap. The non-crap
> version would pass the PID you think your parent is, so that you do
> this race-free.
Or PR_SET_PDEATHSIG_FOR_CHILDREN(pdeath_signal), or the new
CLONE_WITH_PDEATHSIG. Or something else, I agree that the current API is,
well, not perfect ;)
> I don't know if makes any sense to add this.
Neither me.
OK. I won't argue with this patch. At least the usage of tasklist_lock is well
documented.
Thanks!
Oleg.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-25 16:28 ` Oleg Nesterov
@ 2025-09-25 18:35 ` Mateusz Guzik
2025-09-25 18:50 ` Oleg Nesterov
2025-09-26 23:58 ` Demi Marie Obenour
0 siblings, 2 replies; 13+ messages in thread
From: Mateusz Guzik @ 2025-09-25 18:35 UTC (permalink / raw)
To: Oleg Nesterov
Cc: Andrew Morton, Demi Marie Obenour, Christian Brauner,
Linux kernel mailing list
It struck me that this mail thread is perhaps a little rude towards
Demi, so I would like to state the reported race is legitimate and if
it was reported against come core functionality it would count as
"good spotting". It just so happens this is a corner case to something
not-that-imporant and the proposed fix is rather heavy-weight (despite
being perfectly sensible), so there is quite a bit of reluctance.
With that out of the way...
On Thu, Sep 25, 2025 at 6:29 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > That's very
> > nasty as the full fence is quite expensive.
>
> Well, the exit_notify() path is already heavy, not sure smp_mb() or
> smp_store_mb(real_parent, reaper) can add a noticeable difference.
>
Well the tasklist consumers already suffer a lot of avoidable
overhead, but I'm going to save the spiel about it. Maybe instead I
will post a patch to remove some. ;)
I realized I never checked how often processes are exiting while still
having children -- for legitimate workloads this is probably not that
common either, so the fence would not even show up in typical usage?
This could be answered with bpftrace over a bunch of workloads if
someone cares to investigate.
> > I don't know if makes any sense to add this.
>
> Neither me.
>
> OK. I won't argue with this patch. At least the usage of tasklist_lock is well
> documented.
>
ye.. avoiding smp_mb may be a case of "premature optimization", except
it is also simpler, so that's a really tough call. good news is that
it's not mine to make ;-)
I guess if the lock acquire goes in the sky is not going to fall,
worst case this can get revisited later. So fwiw I would be leaning
towards accepting the patch as well for the time being.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-25 18:35 ` Mateusz Guzik
@ 2025-09-25 18:50 ` Oleg Nesterov
2025-09-26 23:58 ` Demi Marie Obenour
1 sibling, 0 replies; 13+ messages in thread
From: Oleg Nesterov @ 2025-09-25 18:50 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Andrew Morton, Demi Marie Obenour, Christian Brauner,
Linux kernel mailing list
On 09/25, Mateusz Guzik wrote:
>
> It struck me that this mail thread is perhaps a little rude towards
> Demi, so I would like to state the reported race is legitimate and if
> it was reported against come core functionality it would count as
> "good spotting".
Agreed, agreed. Sorry if my emails were not clear in this respect.
Although let me repeat, I didn't know that people still rely on this API,
but this is is almost offtopic.
> > OK. I won't argue with this patch. At least the usage of tasklist_lock is well
> > documented.
> >
>
> ye.. avoiding smp_mb may be a case of "premature optimization", except
> it is also simpler, so that's a really tough call. good news is that
> it's not mine to make ;-)
>
> I guess if the lock acquire goes in the sky is not going to fall,
> worst case this can get revisited later. So fwiw I would be leaning
> towards accepting the patch as well for the time being.
Again, agreed, and sorry if it wasn't clear.
Thanks Demi.
Oleg.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-25 18:35 ` Mateusz Guzik
2025-09-25 18:50 ` Oleg Nesterov
@ 2025-09-26 23:58 ` Demi Marie Obenour
2025-09-27 1:51 ` Mateusz Guzik
1 sibling, 1 reply; 13+ messages in thread
From: Demi Marie Obenour @ 2025-09-26 23:58 UTC (permalink / raw)
To: Mateusz Guzik, Oleg Nesterov
Cc: Andrew Morton, Christian Brauner, Linux kernel mailing list
[-- Attachment #1.1.1: Type: text/plain, Size: 2733 bytes --]
On 9/25/25 14:35, Mateusz Guzik wrote:
> It struck me that this mail thread is perhaps a little rude towards
> Demi, so I would like to state the reported race is legitimate and if
> it was reported against come core functionality it would count as
> "good spotting". It just so happens this is a corner case to something
> not-that-imporant and the proposed fix is rather heavy-weight (despite
> being perfectly sensible), so there is quite a bit of reluctance.
That makes sense. Thank you!
My personal thought is that prctl(PR_SET_DEATHSIG) is rather rare,
and also the lock is not held very long. In particular, exit
already takes tasklist_lock for writing, and that is much more
common. Therefore, I would be shocked if this added any significant
contention outside of contrived benchmarks.
What I _am_ concerned about is potential starvation,
especially on PREEMPT_RT. Per the documentation:
> - Because an rwlock_t writer cannot grant its priority to multiple
> readers, a preempted low-priority reader will continue holding its lock,
> thus starving even high-priority writers.
This allows any user to hammer tasklist_lock
at will. Is that going to be a problem?
> With that out of the way...
>
> On Thu, Sep 25, 2025 at 6:29 PM Oleg Nesterov <oleg@redhat.com> wrote:
>>> That's very
>>> nasty as the full fence is quite expensive.
>>
>> Well, the exit_notify() path is already heavy, not sure smp_mb() or
>> smp_store_mb(real_parent, reaper) can add a noticeable difference.
>>
>
> Well the tasklist consumers already suffer a lot of avoidable
> overhead, but I'm going to save the spiel about it. Maybe instead I
> will post a patch to remove some. ;)
>
> I realized I never checked how often processes are exiting while still
> having children -- for legitimate workloads this is probably not that
> common either, so the fence would not even show up in typical usage?
>
> This could be answered with bpftrace over a bunch of workloads if
> someone cares to investigate.
Starvation on PREEMPT_RT is my only concern. See above.
>>> I don't know if makes any sense to add this.
>>
>> Neither me.
>>
>> OK. I won't argue with this patch. At least the usage of tasklist_lock is well
>> documented.
>>
>
> ye.. avoiding smp_mb may be a case of "premature optimization", except
> it is also simpler, so that's a really tough call. good news is that
> it's not mine to make ;-)
>
> I guess if the lock acquire goes in the sky is not going to fall,
> worst case this can get revisited later. So fwiw I would be leaning
> towards accepting the patch as well for the time being.
Thank you!
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-26 23:58 ` Demi Marie Obenour
@ 2025-09-27 1:51 ` Mateusz Guzik
0 siblings, 0 replies; 13+ messages in thread
From: Mateusz Guzik @ 2025-09-27 1:51 UTC (permalink / raw)
To: Demi Marie Obenour
Cc: Oleg Nesterov, Andrew Morton, Christian Brauner,
Linux kernel mailing list
On Sat, Sep 27, 2025 at 1:58 AM Demi Marie Obenour
<demiobenour@gmail.com> wrote:
> My personal thought is that prctl(PR_SET_DEATHSIG) is rather rare,
> and also the lock is not held very long. In particular, exit
> already takes tasklist_lock for writing, and that is much more
> common. Therefore, I would be shocked if this added any significant
> contention outside of contrived benchmarks.
>
This is my suspicion as well, yes.
> What I _am_ concerned about is potential starvation,
> especially on PREEMPT_RT. Per the documentation:
>
> > - Because an rwlock_t writer cannot grant its priority to multiple
> > readers, a preempted low-priority reader will continue holding its lock,
> > thus starving even high-priority writers.
>
> This allows any user to hammer tasklist_lock
> at will. Is that going to be a problem?
>
I can't speak for Oleg, but he also expressed concern over free access
to readlock tasklist, presumably including in this context.
My take is that any thread can already freely abuse the lock in this
manner by spamming waitid or kill (and probably in some other ways).
The existing spots are not trivial to fix. Should someone do it, your
patch is comparatively a minor addition effort-wise.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-23 13:39 ` Mateusz Guzik
2025-09-25 16:28 ` Oleg Nesterov
@ 2025-10-07 12:53 ` Christian Brauner
1 sibling, 0 replies; 13+ messages in thread
From: Christian Brauner @ 2025-10-07 12:53 UTC (permalink / raw)
To: Mateusz Guzik
Cc: Oleg Nesterov, Andrew Morton, Demi Marie Obenour,
Linux kernel mailing list
On Tue, Sep 23, 2025 at 03:39:06PM +0200, Mateusz Guzik wrote:
> On Tue, Sep 23, 2025 at 2:05 PM Oleg Nesterov <oleg@redhat.com> wrote:
> > As you correctly pointed out, forget_original_parent/prctl lack the necessary
> > barries. So lets add the barriers instead of abusing tasklist? As for sys_prctl(),
> > I think that ret-to-user-mode + enter-the-kernel-mode should act as a full
> > barrier, so it only needs WRITE_ONCE()...
> >
>
> So I looked over this and I think I see why you are not eager to fix
> the problem to begin with. ;)
>
> I agree with reluctance to take tasklist lock to handle
> PR_SET_PDEATHSIG, but I wonder if in practice this is used rarely
> enough that the lock trip would not be a problem? It avoids any
> modifications to the exit codepath.
Unprivileged userspace can now just call that prctl() in a loop to
hammer on tasklist_lock. Already possible with the subreaper stuff but I
think we don't need to add more. At least for the subreaper stuff it has
an actual meaningful use and not just plugs a theoretical problem.
>
> By barriers I presume you meant smp_mb() between
> RCU_INIT_POINTER(t->real_parent, reaper) and
> READ_ONCE(t->pdeath_signal) in forget_original_parent. That's very
> nasty as the full fence is quite expensive. This could be done with
> just one fence for the entire call by iterating the list twice, but
> that's still preferably avoided.
>
> > Or perhaps user-space can do something else to sync with the exiting parent
> > instead of using getppid() ?
> >
>
> I never put any thought concerning this mechanism, I do think it
> nicely showcases the prctl at hand is kind of crap. The non-crap
> version would pass the PID you think your parent is, so that you do
> this race-free. I don't know if makes any sense to add this.
I see that stuff already made it into the tree. I strongly dislike
adding more tasklist_lock usage. All for a theoretical race with limited
impact.
If I'd seen this sooner I would have NAKed this out of existence.
>
> I'm wondering if the fact that tasklist is write-locked in that code
> path could be utilized to synchronize this in a matter other than
> taking it.
>
> pseudo-code wise, something like this:
> WRITE_ONCE(me->pdeath_signal, arg2);
> /* publish the above store and load the lock after */
> smb_mb();
> /* here spin waiting until tasklist_lock is not write-locked */
> smb_rmb();
>
> Unless I'm missing something this should provide the guarantee you see
> the updated parent, if any.
>
> I don't see a routine to do it though and knowing memory barriers
> there might be some bullshit hiding there making this not work, so not
> my first choice unless someone with more memory barrier clue can chime
> in.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit
2025-09-13 22:28 ` Demi Marie Obenour via B4 Relay
(?)
(?)
@ 2025-09-20 4:10 ` Demi Marie Obenour
-1 siblings, 0 replies; 13+ messages in thread
From: Demi Marie Obenour @ 2025-09-20 4:10 UTC (permalink / raw)
To: Andrew Morton; +Cc: Linux kernel mailing list
[-- Attachment #1.1.1: Type: text/plain, Size: 105 bytes --]
CCing Andrew Morton as the maintainer of last resort.
--
Sincerely,
Demi Marie Obenour (she/her/hers)
[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 7253 bytes --]
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-10-07 12:53 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-13 22:28 [PATCH] kernel: Prevent prctl(PR_SET_PDEATHSIG) from racing with parent process exit Demi Marie Obenour
2025-09-13 22:28 ` Demi Marie Obenour via B4 Relay
2025-09-20 4:10 ` Demi Marie Obenour
2025-09-22 22:48 ` Andrew Morton
2025-09-23 12:03 ` Oleg Nesterov
2025-09-23 13:39 ` Mateusz Guzik
2025-09-25 16:28 ` Oleg Nesterov
2025-09-25 18:35 ` Mateusz Guzik
2025-09-25 18:50 ` Oleg Nesterov
2025-09-26 23:58 ` Demi Marie Obenour
2025-09-27 1:51 ` Mateusz Guzik
2025-10-07 12:53 ` Christian Brauner
2025-09-20 4:10 ` Demi Marie Obenour
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.