All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: David Vernet <void@manifault.com>,
	Andrea Righi <arighi@nvidia.com>,
	Changwoo Min <changwoo@igalia.com>
Cc: Dan Schatzberg <dschatzberg@meta.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	sched-ext@lists.linux.dev, Tejun Heo <tj@kernel.org>
Subject: [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering
Date: Tue, 28 Oct 2025 20:19:14 -1000	[thread overview]
Message-ID: <20251029061918.4179554-1-tj@kernel.org> (raw)

Hello,

This series fixes a cgroup task exit ordering issue that is generally
suboptimal for all cgroup controllers and has caused real breakage for
sched_ext schedulers.

Currently, when a task exits, cgroup_task_exit() in do_exit() immediately
unlinks the task from its cgroup via css_set_move_task(). From the cgroup's
perspective, the task is now gone. If this makes the cgroup empty, it can be
destroyed, triggering ->css_offline() callbacks that notify controllers the
cgroup is going offline resource-wise.

However, the exiting task continues to run, perform memory operations, and
schedule until the final context switch in finish_task_switch(). This creates
a problematic window where controllers are told a cgroup is offline while
resource activities are still occurring in it. While this hasn't broken
existing controllers, it's clearly suboptimal and has caused real breakage
for sched_ext schedulers.

The sched_ext breakage manifests in two ways:

1. When a sched_ext scheduler is loaded, it walks all tasks and calls
   ops.init_task() on each. For tasks in a cgroup, it first ensures the
   cgroup has been initialized via ops.cgroup_init(). However, if the task
   is in the dying state (still running but already unlinked from its
   cgroup), the cgroup may already be offline. This results in
   ops.init_task() being called on a cgroup that never received
   ops.cgroup_init(), breaking the initialization invariant.

   This broke the scx_mitosis scheduler with errors like "cgrp_ctx lookup
   failed for cgid 3869" where the BPF program couldn't find cgroup context
   that was never created. See: https://github.com/sched-ext/scx/issues/2846

2. Because sched_ext_free() was called from __put_task_struct() (which can
   happen long after the task stops running), ops.cgroup_exit() could be
   called before ops.exit_task() was called on all member tasks, violating
   the expected ordering where all tasks exit before their cgroup does.

The fix defers the cgroup unlinking from do_exit() to finish_task_switch(),
ensuring the task remains linked to its cgroup until it's truly done running.
For sched_ext specifically, we also move the cleanup earlier to
finish_task_switch() to ensure proper ordering with cgroup operations.

This adds two new calls to finish_task_switch() that operate on the dead task
after the final switch: cgroup_task_dead() and sched_ext_dead(). It may make
sense to factor these into a helper function located in kernel/exit.c if this
pattern continues to grow.

Patch 0004 changes sched_ext and can be applied to sched_ext/for-6.19 after
pulling cgroup/for-6.19. Alternatively, would it be easier for some or all of
this series to go through the tip tree?

Based on cgroup/for-6.19 (d5cf4d34a333).

 0001 cgroup: Rename cgroup lifecycle hooks to cgroup_task_*()
 0002 cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free()
 0003 cgroup: Defer task cgroup unlink until after the task is done switching out
 0004 sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()

Git tree: git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git cgroup-fix-exit-ordering

 include/linux/cgroup.h    | 14 ++++++++------
 include/linux/sched/ext.h |  4 ++--
 kernel/cgroup/cgroup.c    | 39 +++++++++++++++++++++++----------------
 kernel/exit.c             |  4 ++--
 kernel/fork.c             |  3 +--
 kernel/sched/autogroup.c  |  4 ++--
 kernel/sched/core.c       |  8 ++++++++
 kernel/sched/ext.c        |  2 +-
 8 files changed, 47 insertions(+), 31 deletions(-)

--
tejun

             reply	other threads:[~2025-10-29  6:19 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-29  6:19 Tejun Heo [this message]
2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
2025-10-31  2:25   ` Chen Ridong
2025-10-29  6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
2025-11-14 17:48   ` Michal Koutný
2025-11-14 18:18     ` Tejun Heo
2025-10-29  6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
2025-11-14 17:48   ` Michal Koutný
2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
2025-10-31 19:32   ` Andrea Righi
2025-11-03 20:25   ` [PATCH v2 " Tejun Heo
2025-11-03 20:28     ` Peter Zijlstra
2025-11-03 20:31       ` Tejun Heo
2025-11-03 20:38         ` Peter Zijlstra
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251029061918.4179554-1-tj@kernel.org \
    --to=tj@kernel.org \
    --cc=arighi@nvidia.com \
    --cc=cgroups@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=dschatzberg@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.