From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Christian Brauner <brauner@kernel.org>,
Oleg Nesterov <oleg@redhat.com>, Sasha Levin <sashal@kernel.org>,
viro@zeniv.linux.org.uk, akpm@linux-foundation.org,
mhocko@suse.com, Liam.Howlett@Oracle.com, mjguzik@gmail.com,
alexjlzheng@tencent.com, pasha.tatashin@soleen.com,
tglx@linutronix.de, frederic@kernel.org, peterz@infradead.org,
lorenzo.stoakes@oracle.com, linux-fsdevel@vger.kernel.org
Subject: [PATCH AUTOSEL 6.14 067/642] pidfs: improve multi-threaded exec and premature thread-group leader exit polling
Date: Mon, 5 May 2025 18:04:43 -0400 [thread overview]
Message-ID: <20250505221419.2672473-67-sashal@kernel.org> (raw)
In-Reply-To: <20250505221419.2672473-1-sashal@kernel.org>
From: Christian Brauner <brauner@kernel.org>
[ Upstream commit 0fb482728ba1ee2130eaa461bf551f014447997c ]
This is another attempt trying to make pidfd polling for multi-threaded
exec and premature thread-group leader exit consistent.
A quick recap of these two cases:
(1) During a multi-threaded exec by a subthread, i.e., non-thread-group
leader thread, all other threads in the thread-group including the
thread-group leader are killed and the struct pid of the
thread-group leader will be taken over by the subthread that called
exec. IOW, two tasks change their TIDs.
(2) A premature thread-group leader exit means that the thread-group
leader exited before all of the other subthreads in the thread-group
have exited.
Both cases lead to inconsistencies for pidfd polling with PIDFD_THREAD.
Any caller that holds a PIDFD_THREAD pidfd to the current thread-group
leader may or may not see an exit notification on the file descriptor
depending on when poll is performed. If the poll is performed before the
exec of the subthread has concluded an exit notification is generated
for the old thread-group leader. If the poll is performed after the exec
of the subthread has concluded no exit notification is generated for the
old thread-group leader.
The correct behavior would be to simply not generate an exit
notification on the struct pid of a subhthread exec because the struct
pid is taken over by the subthread and thus remains alive.
But this is difficult to handle because a thread-group may exit
prematurely as mentioned in (2). In that case an exit notification is
reliably generated but the subthreads may continue to run for an
indeterminate amount of time and thus also may exec at some point.
So far there was no way to distinguish between (1) and (2) internally.
This tiny series tries to address this problem by discarding
PIDFD_THREAD notification on premature thread-group leader exit.
If that works correctly then no exit notifications are generated for a
PIDFD_THREAD pidfd for a thread-group leader until all subthreads have
been reaped. If a subthread should exec aftewards no exit notification
will be generated until that task exits or it creates subthreads and
repeates the cycle.
Co-Developed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20250320-work-pidfs-thread_group-v4-1-da678ce805bf@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
fs/pidfs.c | 9 +++++----
kernel/exit.c | 6 +++---
kernel/signal.c | 3 +--
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/pidfs.c b/fs/pidfs.c
index c0478b3c55d9f..9aa4c705776dd 100644
--- a/fs/pidfs.c
+++ b/fs/pidfs.c
@@ -188,20 +188,21 @@ static void pidfd_show_fdinfo(struct seq_file *m, struct file *f)
static __poll_t pidfd_poll(struct file *file, struct poll_table_struct *pts)
{
struct pid *pid = pidfd_pid(file);
- bool thread = file->f_flags & PIDFD_THREAD;
struct task_struct *task;
__poll_t poll_flags = 0;
poll_wait(file, &pid->wait_pidfd, pts);
/*
- * Depending on PIDFD_THREAD, inform pollers when the thread
- * or the whole thread-group exits.
+ * Don't wake waiters if the thread-group leader exited
+ * prematurely. They either get notified when the last subthread
+ * exits or not at all if one of the remaining subthreads execs
+ * and assumes the struct pid of the old thread-group leader.
*/
guard(rcu)();
task = pid_task(pid, PIDTYPE_PID);
if (!task)
poll_flags = EPOLLIN | EPOLLRDNORM | EPOLLHUP;
- else if (task->exit_state && (thread || thread_group_empty(task)))
+ else if (task->exit_state && !delay_group_leader(task))
poll_flags = EPOLLIN | EPOLLRDNORM;
return poll_flags;
diff --git a/kernel/exit.c b/kernel/exit.c
index 6bb59b16e33e1..a9960dd6d0aa0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -744,10 +744,10 @@ static void exit_notify(struct task_struct *tsk, int group_dead)
tsk->exit_state = EXIT_ZOMBIE;
/*
- * sub-thread or delay_group_leader(), wake up the
- * PIDFD_THREAD waiters.
+ * Ignore thread-group leaders that exited before all
+ * subthreads did.
*/
- if (!thread_group_empty(tsk))
+ if (!delay_group_leader(tsk))
do_notify_pidfd(tsk);
if (unlikely(tsk->ptrace)) {
diff --git a/kernel/signal.c b/kernel/signal.c
index 875e97f6205a2..b2e5c90f29602 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2180,8 +2180,7 @@ bool do_notify_parent(struct task_struct *tsk, int sig)
WARN_ON_ONCE(!tsk->ptrace &&
(tsk->group_leader != tsk || !thread_group_empty(tsk)));
/*
- * tsk is a group leader and has no threads, wake up the
- * non-PIDFD_THREAD waiters.
+ * Notify for thread-group leaders without subthreads.
*/
if (thread_group_empty(tsk))
do_notify_pidfd(tsk);
--
2.39.5
next prev parent reply other threads:[~2025-05-05 22:16 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20250505221419.2672473-1-sashal@kernel.org>
2025-05-05 22:03 ` [PATCH AUTOSEL 6.14 009/642] fuse: Return EPERM rather than ENOSYS from link() Sasha Levin
2025-05-05 22:03 ` [PATCH AUTOSEL 6.14 010/642] exfat: call bh_read in get_block only when necessary Sasha Levin
2025-05-05 22:04 ` Sasha Levin [this message]
2025-05-05 22:07 ` [PATCH AUTOSEL 6.14 209/642] fs/pipe: Limit the slots in pipe_resize_ring() Sasha Levin
2025-05-05 22:09 ` [PATCH AUTOSEL 6.14 342/642] fs/mpage: avoid negative shift for large blocksize Sasha Levin
2025-05-06 1:01 ` Luis Chamberlain
2025-05-06 13:51 ` Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250505221419.2672473-67-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=Liam.Howlett@Oracle.com \
--cc=akpm@linux-foundation.org \
--cc=alexjlzheng@tencent.com \
--cc=brauner@kernel.org \
--cc=frederic@kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=mjguzik@gmail.com \
--cc=oleg@redhat.com \
--cc=pasha.tatashin@soleen.com \
--cc=peterz@infradead.org \
--cc=stable@vger.kernel.org \
--cc=tglx@linutronix.de \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).