From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
To: Tejun Heo <tj@kernel.org>
Cc: linux-rt-devel@lists.linux.dev, cgroups@vger.kernel.org,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Clark Williams" <clrkwllms@kernel.org>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Bert Karwatzki" <spasswolf@web.de>
Subject: Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
Date: Tue, 3 Mar 2026 14:13:01 +0100 [thread overview]
Message-ID: <20260303131301.ieSSCM4n@linutronix.de> (raw)
In-Reply-To: <aaYHmCV2CW-tOT-Z@slm.duckdns.org>
On 2026-03-02 11:56:40 [-1000], Tejun Heo wrote:
> Hello, Seb.
Hi Tejun,
> On Mon, Mar 02, 2026 at 01:07:38PM +0100, Sebastian Andrzej Siewior wrote:
> > Tejun, with this change, would it be okay to
> > - replace the irq-work with kworker? With this change it should address
> > your concern regarding "run in definite time" as mentioned in [0]. So
> > it might be significantly delayed but it shouldn't be visible.
> > This would lift the restriction that a irq-work needs to run on this
> > CPU and the kworker could run on any CPU.
>
> Yeah, that's fine.
Okay.
> > - would it be okay to treat RT and !RT equally here (and do this delayed
> > cgroup_task_dead() in both cases)
>
> I don't see why we'd bounce on !RT. Are there any benefits?
You would have the same bugs if any and not two separate types if at
all. It would get rid of the ifdef. No other benefit.
> > @@ -5283,6 +5283,11 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
> >
> > static int cgroup_procs_show(struct seq_file *s, void *v)
> > {
> > + struct task_struct *tsk = v;
> > +
> > + if (READ_ONCE(tsk->__state) & TASK_DEAD)
> > + return 0;
>
> Does this actually close the window for systemd through operation ordering
> or does it just reduce the race window?
I see the task in questing exiting via sched_process_exit() and after
its schedule() it is removed from the cgroup list as per current logic.
The parent systemd task doing all its cleanup gets probably active
because its child died/ SIGCHLD. So we have sched_process_exit(), wake
parent, final schedule() leaving as zombie removing itself from the
cgroup list.
Judging from do_exit(), a preemption after exit_notify() before
do_task_dead() should lead to the same problem. By adding a delay
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -893,6 +893,7 @@ static void synchronize_group_exit(struct task_struct *tsk, long code)
coredump_task_exit(tsk, core_state);
}
+#include <linux/delay.h>
void __noreturn do_exit(long code)
{
struct task_struct *tsk = current;
@@ -1004,6 +1005,7 @@ void __noreturn do_exit(long code)
exit_task_stack_account(tsk);
check_stack_usage();
+ ssleep(1);
preempt_disable();
if (tsk->nr_dirtied)
__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);
I get the same timeout behaviour on !PREEMPT_RT. So it seems like I did
reduce the race window.
So what about doing the removal before sending the signal about the
upcoming death? Something like:
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6975,6 +6975,7 @@ void cgroup_post_fork(struct task_struct *child,
* Description: Detach cgroup from @tsk.
*
*/
+static void do_cgroup_task_dead(struct task_struct *tsk);
void cgroup_task_exit(struct task_struct *tsk)
{
struct cgroup_subsys *ss;
@@ -6984,6 +6985,7 @@ void cgroup_task_exit(struct task_struct *tsk)
do_each_subsys_mask(ss, i, have_exit_callback) {
ss->exit(tsk);
} while_each_subsys_mask();
+ do_cgroup_task_dead(tsk);
}
static void do_cgroup_task_dead(struct task_struct *tsk)
@@ -7050,16 +7052,12 @@ static void __init cgroup_rt_init(void)
void cgroup_task_dead(struct task_struct *task)
{
- get_task_struct(task);
- llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
- irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
}
#else /* CONFIG_PREEMPT_RT */
static void __init cgroup_rt_init(void) {}
void cgroup_task_dead(struct task_struct *task)
{
- do_cgroup_task_dead(task);
}
#endif /* CONFIG_PREEMPT_RT */
cgroup_task_exit() is invoked a few functions before exit_notify(). Not
sure what else I might have broken but the race window should be closed.
> Thanks.
Sebastian
next prev parent reply other threads:[~2026-03-03 13:13 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-02 12:07 [PATCH] cgroup: Don't expose dead tasks in cgroup Sebastian Andrzej Siewior
2026-03-02 21:56 ` Tejun Heo
2026-03-03 13:13 ` Sebastian Andrzej Siewior [this message]
2026-03-03 17:59 ` Tejun Heo
2026-03-03 20:22 ` Tejun Heo
2026-03-04 19:16 ` Sebastian Andrzej Siewior
2026-03-04 19:22 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260303131301.ieSSCM4n@linutronix.de \
--to=bigeasy@linutronix.de \
--cc=cgroups@vger.kernel.org \
--cc=clrkwllms@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=linux-rt-devel@lists.linux.dev \
--cc=mkoutny@suse.com \
--cc=rostedt@goodmis.org \
--cc=spasswolf@web.de \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox