[PATCH] cgroup: Don't expose dead tasks in cgroup

public inbox for cgroups@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] cgroup: Don't expose dead tasks in cgroup
@ 2026-03-02 12:07 Sebastian Andrzej Siewior
  2026-03-02 21:56 ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-02 12:07 UTC (permalink / raw)
  To: linux-rt-devel, cgroups
  Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Clark Williams,
	Steven Rostedt, Bert Karwatzki

Once a task exits it has its state set to TASK_DEAD and then it is
removed the cgroup it belonged to. The last step happens on the task
gets out of its last schedule() invocation and is delayed on PREEMPT_RT
due to locking constrains.

As a result it is possible to receive a pid via waitpid() of a task
which is still listed in cgroup.procs for the cgroup it belonged
to. This is something that systemd does not expect and as a result it
waits for its exit until a time out occurs.

This can be avoided by skipping tasks which are in the DEAD state. There
is no need to verify both task states under task_struct::pi_lock because
once the task is exiting after is its last schedule, there will not be
any sleeping locks.

Reported-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20260219164648.3014-1-spasswolf@web.de/
Tested-by: Bert Karwatzki <spasswolf@web.de>
Fixes: 9311e6c29b348 ("cgroup: Fix sleeping from invalid context warning on PREEMPT_RT")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

Tejun, with this change, would it be okay to
- replace the irq-work with kworker? With this change it should address
  your concern regarding "run in definite time" as mentioned in [0]. So
  it might be significantly delayed but it shouldn't be visible.
  This would lift the restriction that a irq-work needs to run on this
  CPU and the kworker could run on any CPU. 

- would it be okay to treat RT and !RT equally here (and do this delayed
  cgroup_task_dead() in both cases)

[0] https://lore.kernel.org/all/aQzg9kcnCsdRQiB4@slm.duckdns.org/

 kernel/cgroup/cgroup.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c22cda7766d84..a8254229d62d3 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5283,6 +5283,11 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
 
 static int cgroup_procs_show(struct seq_file *s, void *v)
 {
+	struct task_struct *tsk = v;
+
+	if (READ_ONCE(tsk->__state) & TASK_DEAD)
+		return 0;
+
 	seq_printf(s, "%d\n", task_pid_vnr(v));
 	return 0;
 }
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
  2026-03-02 12:07 [PATCH] cgroup: Don't expose dead tasks in cgroup Sebastian Andrzej Siewior
@ 2026-03-02 21:56 ` Tejun Heo
  2026-03-03 13:13   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2026-03-02 21:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, cgroups, Johannes Weiner, Michal Koutný,
	Clark Williams, Steven Rostedt, Bert Karwatzki

Hello, Seb.

On Mon, Mar 02, 2026 at 01:07:38PM +0100, Sebastian Andrzej Siewior wrote:
> Tejun, with this change, would it be okay to
> - replace the irq-work with kworker? With this change it should address
>   your concern regarding "run in definite time" as mentioned in [0]. So
>   it might be significantly delayed but it shouldn't be visible.
>   This would lift the restriction that a irq-work needs to run on this
>   CPU and the kworker could run on any CPU. 

Yeah, that's fine.

> - would it be okay to treat RT and !RT equally here (and do this delayed
>   cgroup_task_dead() in both cases)

I don't see why we'd bounce on !RT. Are there any benefits?

> @@ -5283,6 +5283,11 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
>  
>  static int cgroup_procs_show(struct seq_file *s, void *v)
>  {
> +	struct task_struct *tsk = v;
> +
> +	if (READ_ONCE(tsk->__state) & TASK_DEAD)
> +		return 0;

Does this actually close the window for systemd through operation ordering
or does it just reduce the race window?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
  2026-03-02 21:56 ` Tejun Heo
@ 2026-03-03 13:13   ` Sebastian Andrzej Siewior
  2026-03-03 17:59     ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-03 13:13 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-rt-devel, cgroups, Johannes Weiner, Michal Koutný,
	Clark Williams, Steven Rostedt, Bert Karwatzki

On 2026-03-02 11:56:40 [-1000], Tejun Heo wrote:
> Hello, Seb.
Hi Tejun,

> On Mon, Mar 02, 2026 at 01:07:38PM +0100, Sebastian Andrzej Siewior wrote:
> > Tejun, with this change, would it be okay to
> > - replace the irq-work with kworker? With this change it should address
> >   your concern regarding "run in definite time" as mentioned in [0]. So
> >   it might be significantly delayed but it shouldn't be visible.
> >   This would lift the restriction that a irq-work needs to run on this
> >   CPU and the kworker could run on any CPU. 
> 
> Yeah, that's fine.

Okay.

> > - would it be okay to treat RT and !RT equally here (and do this delayed
> >   cgroup_task_dead() in both cases)
> 
> I don't see why we'd bounce on !RT. Are there any benefits?

You would have the same bugs if any and not two separate types if at
all. It would get rid of the ifdef. No other benefit.

> > @@ -5283,6 +5283,11 @@ static void *cgroup_procs_start(struct seq_file *s, loff_t *pos)
> >  
> >  static int cgroup_procs_show(struct seq_file *s, void *v)
> >  {
> > +	struct task_struct *tsk = v;
> > +
> > +	if (READ_ONCE(tsk->__state) & TASK_DEAD)
> > +		return 0;
> 
> Does this actually close the window for systemd through operation ordering
> or does it just reduce the race window?

I see the task in questing exiting via sched_process_exit() and after
its schedule() it is removed from the cgroup list as per current logic.

The parent systemd task doing all its cleanup gets probably active
because its child died/ SIGCHLD. So we have sched_process_exit(), wake
parent, final schedule() leaving as zombie removing itself from the
cgroup list.
Judging from do_exit(), a preemption after exit_notify() before
do_task_dead() should lead to the same problem. By adding a delay

--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -893,6 +893,7 @@ static void synchronize_group_exit(struct task_struct *tsk, long code)
 		coredump_task_exit(tsk, core_state);
 }
 
+#include <linux/delay.h>
 void __noreturn do_exit(long code)
 {
 	struct task_struct *tsk = current;
@@ -1004,6 +1005,7 @@ void __noreturn do_exit(long code)
 	exit_task_stack_account(tsk);
 
 	check_stack_usage();
+	ssleep(1);
 	preempt_disable();
 	if (tsk->nr_dirtied)
 		__this_cpu_add(dirty_throttle_leaks, tsk->nr_dirtied);

I get the same timeout behaviour on !PREEMPT_RT. So it seems like I did
reduce the race window.

So what about doing the removal before sending the signal about the
upcoming death? Something like:

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6975,6 +6975,7 @@ void cgroup_post_fork(struct task_struct *child,
  * Description: Detach cgroup from @tsk.
  *
  */
+static void do_cgroup_task_dead(struct task_struct *tsk);
 void cgroup_task_exit(struct task_struct *tsk)
 {
 	struct cgroup_subsys *ss;
@@ -6984,6 +6985,7 @@ void cgroup_task_exit(struct task_struct *tsk)
 	do_each_subsys_mask(ss, i, have_exit_callback) {
 		ss->exit(tsk);
 	} while_each_subsys_mask();
+	do_cgroup_task_dead(tsk);
 }
 
 static void do_cgroup_task_dead(struct task_struct *tsk)
@@ -7050,16 +7052,12 @@ static void __init cgroup_rt_init(void)
 
 void cgroup_task_dead(struct task_struct *task)
 {
-	get_task_struct(task);
-	llist_add(&task->cg_dead_lnode, this_cpu_ptr(&cgrp_dead_tasks));
-	irq_work_queue(this_cpu_ptr(&cgrp_dead_tasks_iwork));
 }
 #else	/* CONFIG_PREEMPT_RT */
 static void __init cgroup_rt_init(void) {}
 
 void cgroup_task_dead(struct task_struct *task)
 {
-	do_cgroup_task_dead(task);
 }
 #endif	/* CONFIG_PREEMPT_RT */
 

cgroup_task_exit() is invoked a few functions before exit_notify(). Not
sure what else I might have broken but the race window should be closed.

> Thanks.

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
  2026-03-03 13:13   ` Sebastian Andrzej Siewior
@ 2026-03-03 17:59     ` Tejun Heo
  2026-03-03 20:22       ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2026-03-03 17:59 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, cgroups, Johannes Weiner, Michal Koutný,
	Clark Williams, Steven Rostedt, Bert Karwatzki

Hello,

On Tue, Mar 03, 2026 at 02:13:01PM +0100, Sebastian Andrzej Siewior wrote:
...
> I get the same timeout behaviour on !PREEMPT_RT. So it seems like I did
> reduce the race window.

I see.

> So what about doing the removal before sending the signal about the
> upcoming death? Something like:

d245698d727a ("cgroup: Defer task cgroup unlink until after the task is done
switching out") is what created the big wide window causing the regression
and we need that because otherwise the cgroup controllers get confused. I
think we need to figure out where we need to start hiding tasks from listing
to remove the race window and do that explicitly.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
  2026-03-03 17:59     ` Tejun Heo
@ 2026-03-03 20:22       ` Tejun Heo
  2026-03-04 19:16         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2026-03-03 20:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, cgroups, Johannes Weiner, Michal Koutny,
	Clark Williams, Steven Rostedt, Bert Karwatzki

So, I think we can fix this in the iterator without moving the unlink.
css_task_iter_advance() already skips dying leaders w/ no live threads
but only on the dying_tasks list, which gets populated too late. We can
extend it to catch PF_EXITING tasks on the regular tasks list too:

  if ((task->flags & PF_EXITING) &&
      !atomic_read(&task->signal->live))
          goto repeat;

PF_EXITING is set in exit_signals() which is before exit_notify(), so
by the time the parent wakes up, the flag is already set. The
signal->live check keeps zombie leaders with live threads visible. I
can't see anything that would break by this being checked earlier than
cgroup_task_exit() - everything between PF_EXITING and
cgroup_task_exit() is just teardown (mm, files, etc.) and it should
close the race window.

Haven't tested this yet. What do you think?

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
  2026-03-03 20:22       ` Tejun Heo
@ 2026-03-04 19:16         ` Sebastian Andrzej Siewior
  2026-03-04 19:22           ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-04 19:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: linux-rt-devel, cgroups, Johannes Weiner, Michal Koutny,
	Clark Williams, Steven Rostedt, Bert Karwatzki

On 2026-03-03 10:22:38 [-1000], Tejun Heo wrote:
> So, I think we can fix this in the iterator without moving the unlink.
> css_task_iter_advance() already skips dying leaders w/ no live threads
> but only on the dying_tasks list, which gets populated too late. We can
> extend it to catch PF_EXITING tasks on the regular tasks list too:
> 
>   if ((task->flags & PF_EXITING) &&
>       !atomic_read(&task->signal->live))
>           goto repeat;
> 
> PF_EXITING is set in exit_signals() which is before exit_notify(), so
> by the time the parent wakes up, the flag is already set. The
> signal->live check keeps zombie leaders with live threads visible. I
> can't see anything that would break by this being checked earlier than
> cgroup_task_exit() - everything between PF_EXITING and
> cgroup_task_exit() is just teardown (mm, files, etc.) and it should
> close the race window.
> 
> Haven't tested this yet. What do you think?

So this

--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5108,6 +5108,8 @@ static void css_task_iter_advance(struct css_task_iter *it)
 		return;
 
 	task = list_entry(it->task_pos, struct task_struct, cg_list);
+	if ((task->flags & PF_EXITING) && !atomic_read(&task->signal->live))
+		goto repeat;
 
 	if (it->flags & CSS_TASK_ITER_PROCS) {
 		/* if PROCS, skip over tasks which aren't group leaders */

does work.
So we delay the removal due to sched_ext and then hide due to userspace.
Nice ;)
The signal check is to see the zombies, so you they pop up in the list
until a waitpid()?
Anyway, do you want me make a proper patch out of it?

> Thanks.
> 

Sebastian

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] cgroup: Don't expose dead tasks in cgroup
  2026-03-04 19:16         ` Sebastian Andrzej Siewior
@ 2026-03-04 19:22           ` Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2026-03-04 19:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, cgroups, Johannes Weiner, Michal Koutny,
	Clark Williams, Steven Rostedt, Bert Karwatzki

Hello,

On Wed, Mar 04, 2026 at 08:16:17PM +0100, Sebastian Andrzej Siewior wrote:
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5108,6 +5108,8 @@ static void css_task_iter_advance(struct css_task_iter *it)
>  		return;
>  
>  	task = list_entry(it->task_pos, struct task_struct, cg_list);
> +	if ((task->flags & PF_EXITING) && !atomic_read(&task->signal->live))
> +		goto repeat;
>  
>  	if (it->flags & CSS_TASK_ITER_PROCS) {
>  		/* if PROCS, skip over tasks which aren't group leaders */
> 
> does work.
> So we delay the removal due to sched_ext and then hide due to userspace.
> Nice ;)

sched_ext made it more visible but the problem is shared in cgroup
controller in general. We don't want a cgroup to become empty while there's
active resource consumption going on and we tell userspace the task is dead
before it switches out for the last time, so...

> The signal check is to see the zombies, so you they pop up in the list
> until a waitpid()?
> Anyway, do you want me make a proper patch out of it?

Yes, please.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-03-04 19:22 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02 12:07 [PATCH] cgroup: Don't expose dead tasks in cgroup Sebastian Andrzej Siewior
2026-03-02 21:56 ` Tejun Heo
2026-03-03 13:13   ` Sebastian Andrzej Siewior
2026-03-03 17:59     ` Tejun Heo
2026-03-03 20:22       ` Tejun Heo
2026-03-04 19:16         ` Sebastian Andrzej Siewior
2026-03-04 19:22           ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox