From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
Andrea Righi <arighi@nvidia.com>,
Changwoo Min <changwoo@igalia.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>,
Peter Zijlstra <peterz@infradead.org>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
sched-ext@lists.linux.dev, Tejun Heo <tj@kernel.org>
Subject: [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering
Date: Tue, 28 Oct 2025 20:19:14 -1000 [thread overview]
Message-ID: <20251029061918.4179554-1-tj@kernel.org> (raw)
Hello,
This series fixes a cgroup task exit ordering issue that is generally
suboptimal for all cgroup controllers and has caused real breakage for
sched_ext schedulers.
Currently, when a task exits, cgroup_task_exit() in do_exit() immediately
unlinks the task from its cgroup via css_set_move_task(). From the cgroup's
perspective, the task is now gone. If this makes the cgroup empty, it can be
destroyed, triggering ->css_offline() callbacks that notify controllers the
cgroup is going offline resource-wise.
However, the exiting task continues to run, perform memory operations, and
schedule until the final context switch in finish_task_switch(). This creates
a problematic window where controllers are told a cgroup is offline while
resource activities are still occurring in it. While this hasn't broken
existing controllers, it's clearly suboptimal and has caused real breakage
for sched_ext schedulers.
The sched_ext breakage manifests in two ways:
1. When a sched_ext scheduler is loaded, it walks all tasks and calls
ops.init_task() on each. For tasks in a cgroup, it first ensures the
cgroup has been initialized via ops.cgroup_init(). However, if the task
is in the dying state (still running but already unlinked from its
cgroup), the cgroup may already be offline. This results in
ops.init_task() being called on a cgroup that never received
ops.cgroup_init(), breaking the initialization invariant.
This broke the scx_mitosis scheduler with errors like "cgrp_ctx lookup
failed for cgid 3869" where the BPF program couldn't find cgroup context
that was never created. See: https://github.com/sched-ext/scx/issues/2846
2. Because sched_ext_free() was called from __put_task_struct() (which can
happen long after the task stops running), ops.cgroup_exit() could be
called before ops.exit_task() was called on all member tasks, violating
the expected ordering where all tasks exit before their cgroup does.
The fix defers the cgroup unlinking from do_exit() to finish_task_switch(),
ensuring the task remains linked to its cgroup until it's truly done running.
For sched_ext specifically, we also move the cleanup earlier to
finish_task_switch() to ensure proper ordering with cgroup operations.
This adds two new calls to finish_task_switch() that operate on the dead task
after the final switch: cgroup_task_dead() and sched_ext_dead(). It may make
sense to factor these into a helper function located in kernel/exit.c if this
pattern continues to grow.
Patch 0004 changes sched_ext and can be applied to sched_ext/for-6.19 after
pulling cgroup/for-6.19. Alternatively, would it be easier for some or all of
this series to go through the tip tree?
Based on cgroup/for-6.19 (d5cf4d34a333).
0001 cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
0002 cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
0003 cgroup: Defer task cgroup unlink until after the task is done switching out
0004 sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-fix-exit-ordering
include/linux/cgroup.h | 14 ++++++++------
include/linux/sched/ext.h | 4 ++--
kernel/cgroup/cgroup.c | 39 +++++++++++++++++++++++----------------
kernel/exit.c | 4 ++--
kernel/fork.c | 3 +--
kernel/sched/autogroup.c | 4 ++--
kernel/sched/core.c | 8 ++++++++
kernel/sched/ext.c | 2 +-
8 files changed, 47 insertions(+), 31 deletions(-)
--
tejun
next reply other threads:[~2025-10-29 6:19 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-29 6:19 Tejun Heo [this message]
2025-10-29 6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
2025-10-31 2:25 ` Chen Ridong
2025-10-29 6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
2025-11-14 17:48 ` Michal Koutný
2025-11-14 18:18 ` Tejun Heo
2025-10-29 6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
2025-11-14 17:48 ` Michal Koutný
2025-10-29 6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
2025-10-31 19:32 ` Andrea Righi
2025-11-03 20:25 ` [PATCH v2 " Tejun Heo
2025-11-03 20:28 ` Peter Zijlstra
2025-11-03 20:31 ` Tejun Heo
2025-11-03 20:38 ` Peter Zijlstra
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251029061918.4179554-1-tj@kernel.org \
--to=tj@kernel.org \
--cc=arighi@nvidia.com \
--cc=cgroups@vger.kernel.org \
--cc=changwoo@igalia.com \
--cc=dschatzberg@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=sched-ext@lists.linux.dev \
--cc=void@manifault.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.