All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>,
	Andrea Righi <arighi@nvidia.com>,
	Changwoo Min <changwoo@igalia.com>,
	Dan Schatzberg <dschatzberg@meta.com>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	sched-ext@lists.linux.dev
Subject: Re: [PATCH v2 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch()
Date: Mon, 3 Nov 2025 21:28:43 +0100	[thread overview]
Message-ID: <20251103202843.GF3245006@noisy.programming.kicks-ass.net> (raw)
In-Reply-To: <aQkPqUSMr5L0spd8@slm.duckdns.org>

On Mon, Nov 03, 2025 at 10:25:13AM -1000, Tejun Heo wrote:
> sched_ext_free() was called from __put_task_struct() when the last reference
> to the task is dropped, which could be long after the task has finished
> running. This causes cgroup-related problems:
> 
> - ops.init_task() can be called on a cgroup which didn't get ops.cgroup_init()'d
>   during scheduler load, because the cgroup might be destroyed/unlinked
>   while the zombie or dead task is still lingering on the scx_tasks list.
> 
> - ops.cgroup_exit() could be called before ops.exit_task() is called on all
>   member tasks, leading to incorrect exit ordering.
> 
> Fix by moving it to finish_task_switch() to be called right after the final
> context switch away from the dying task, matching when sched_class->task_dead()
> is called. Rename it to sched_ext_dead() to match the new calling context.
> 
> By calling sched_ext_dead() before cgroup_task_dead(), we ensure that:
> 
> - Tasks visible on scx_tasks list have valid cgroups during scheduler load,
>   as cgroup_mutex prevents cgroup destruction while the task is still linked.
> 
> - All member tasks have ops.exit_task() called and are removed from scx_tasks
>   before the cgroup can be destroyed and trigger ops.cgroup_exit().
> 
> This fix is made possible by the cgroup_task_dead() split in the previous patch.
> 
> This also makes more sense resource-wise as there's no point in keeping
> scheduler side resources around for dead tasks.
> 
> Reported-by: Dan Schatzberg <dschatzberg@meta.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: Andrea Righi <arighi@nvidia.com>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> ---
> v2: - Description correction and update (Andrea Righi).
> 
>  include/linux/sched/ext.h |    4 ++--
>  kernel/fork.c             |    1 -
>  kernel/sched/core.c       |    6 ++++++
>  kernel/sched/ext.c        |    2 +-
>  4 files changed, 9 insertions(+), 4 deletions(-)
> 
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -207,14 +207,14 @@ struct sched_ext_entity {
>  	struct list_head	tasks_node;
>  };
>  
> -void sched_ext_free(struct task_struct *p);
> +void sched_ext_dead(struct task_struct *p);
>  void print_scx_info(const char *log_lvl, struct task_struct *p);
>  void scx_softlockup(u32 dur_s);
>  bool scx_rcu_cpu_stall(void);
>  
>  #else	/* !CONFIG_SCHED_CLASS_EXT */
>  
> -static inline void sched_ext_free(struct task_struct *p) {}
> +static inline void sched_ext_dead(struct task_struct *p) {}
>  static inline void print_scx_info(const char *log_lvl, struct task_struct *p) {}
>  static inline void scx_softlockup(u32 dur_s) {}
>  static inline bool scx_rcu_cpu_stall(void) { return false; }
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -736,7 +736,6 @@ void __put_task_struct(struct task_struc
>  	WARN_ON(tsk == current);
>  
>  	unwind_task_free(tsk);
> -	sched_ext_free(tsk);
>  	io_uring_free(tsk);
>  	cgroup_task_free(tsk);
>  	task_numa_free(tsk, true);
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5222,6 +5222,12 @@ static struct rq *finish_task_switch(str
>  		if (prev->sched_class->task_dead)
>  			prev->sched_class->task_dead(prev);

^^^ can you not use task_dead_scx() ?

> +		/*
> +		 * sched_ext_dead() must come before cgroup_task_dead() to
> +		 * prevent cgroups from being removed while its member tasks are
> +		 * visible to SCX schedulers.
> +		 */
> +		sched_ext_dead(prev);
>  		cgroup_task_dead(prev);
>  
>  		/* Task is done with its stack. */
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -2926,7 +2926,7 @@ void scx_cancel_fork(struct task_struct
>  	percpu_up_read(&scx_fork_rwsem);
>  }
>  
> -void sched_ext_free(struct task_struct *p)
> +void sched_ext_dead(struct task_struct *p)
>  {
>  	unsigned long flags;
>  

  reply	other threads:[~2025-11-03 20:28 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-29  6:19 [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-10-29  6:19 ` [PATCH 1/4] cgroup: Rename cgroup lifecycle hooks to cgroup_task_*() Tejun Heo
2025-10-31  2:25   ` Chen Ridong
2025-10-29  6:19 ` [PATCH 2/4] cgroup: Move dying_tasks cleanup from cgroup_task_release() to cgroup_task_free() Tejun Heo
2025-11-14 17:48   ` Michal Koutný
2025-11-14 18:18     ` Tejun Heo
2025-10-29  6:19 ` [PATCH 3/4] cgroup: Defer task cgroup unlink until after the task is done switching out Tejun Heo
2025-11-14 17:48   ` Michal Koutný
2025-10-29  6:19 ` [PATCH 4/4] sched_ext: Fix cgroup exit ordering by moving sched_ext_free() to finish_task_switch() Tejun Heo
2025-10-31 19:32   ` Andrea Righi
2025-11-03 20:25   ` [PATCH v2 " Tejun Heo
2025-11-03 20:28     ` Peter Zijlstra [this message]
2025-11-03 20:31       ` Tejun Heo
2025-11-03 20:38         ` Peter Zijlstra
2025-11-03 20:26 ` [PATCHSET cgroup/for-6.19] cgroup: Fix task exit ordering Tejun Heo
2025-11-03 22:04 ` [PATCH v2 0/4] " Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251103202843.GF3245006@noisy.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=arighi@nvidia.com \
    --cc=cgroups@vger.kernel.org \
    --cc=changwoo@igalia.com \
    --cc=dschatzberg@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sched-ext@lists.linux.dev \
    --cc=tj@kernel.org \
    --cc=void@manifault.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.