[PATCH 0/4] Task based throttle follow ups

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Task based throttle follow ups
@ 2025-09-10  9:50 Aaron Lu
  2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-10  9:50 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Peter noticed the inconsistency in load propagation for throttled cfs_rq
and Ben pointed out several other places regarding throttled cfs_rq that
could be no longer needed after task based throttle model.

To ease discussing and reviewing, I've come up with this follow up
series which implements the individual changes.

Patch1 deals with load propagation. According to Peter and Prateek's
discussion, previously, load propagation for throttled cfs_rq happened
on unthrottle time but now with per-task throttle, it's no longer the
case so load propagation should happen immediately or we could lose this
propagated part.

Patch2 made update_cfs_group() to continue function for cfs_rqs in
throttled hierarchy so that cfs_rq's entity can get an up2date weight. I
think this is mostly useful when a cfs_rq in throttled hierarchy still
has tasks running and on tick/enqueue/dequeue, update_cfs_group() can
update this cfs_rq's entity weight.

Patch3 removed special treatment of tasks in throttled hierarchy,
including: dequeue_entities(), check_preempt_wakeup_fair() and
yield_task_to_fair().

Patch4 inhibited load balancing to a throttled cfs_rq to make hackbench
happy.

I think patch1 is needed for correctness, patch2-4 is open for
discussion as there are pros/cons doing things either way. Comments are
welcome, thanks.

BTW, I also noticed there is the task_is_throttled sched class callback
and in fair, it is task_is_throttled_fair(). IIUC, it is used by core
scheduling to find a matching cookie task to run on the sibling SMT CPU.
For this reason, it doesn't seem very useful if we find it a task that
is to be throttled so I kept the current implementation; but I guess
this is also two folded if that to be throttled task is holding some
kernel resources. Anyway, I didn't write a patch to change it in this
series, but feel free to let me know if it should be changed.

Aaron Lu (4):
  sched/fair: Propagate load for throttled cfs_rq
  sched/fair: update_cfs_group() for throttled cfs_rqs
  sched/fair: Do not special case tasks in throttled hierarchy
  sched/fair: Do not balance task to a throttled cfs_rq

 kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++-----------------
 1 file changed, 31 insertions(+), 19 deletions(-)

base-commit: 5b726e9bf9544a349090879a513a5e00da486c14
-- 
2.39.5

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
@ 2025-09-10  9:50 ` Aaron Lu
  2025-09-10 12:36   ` Chengming Zhou
                     ` (2 more replies)
  2025-09-10  9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-10  9:50 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().

Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.

While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
("sched/fair: Fix unfairness caused by missing load decay").

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e1..f993de30e1466 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5729,6 +5729,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6721,6 +6726,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -13151,10 +13161,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
-		return;
-
-	if (!throttled_hierarchy(cfs_rq))
+	/*
+	 * If a task gets attached to this cfs_rq and before being queued,
+	 * it gets migrated to another CPU due to reasons like affinity
+	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
+	 * that removed load decayed or it can cause faireness problem.
+	 */
+	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13165,10 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		if (!throttled_hierarchy(cfs_rq))
+		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
  2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
@ 2025-09-10  9:50 ` Aaron Lu
  2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
  2025-09-10  9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Aaron Lu @ 2025-09-10  9:50 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.

Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.

The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.

In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e1466..58f5349d37256 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;

-	if (throttled_hierarchy(gcfs_rq))
-		return;
-
 	shares = calc_group_shares(gcfs_rq);
 	if (unlikely(se->load.weight != shares))
 		reweight_entity(cfs_rq_of(se), se, shares);
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
  2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
  2025-09-10  9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
@ 2025-09-10  9:50 ` Aaron Lu
  2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
  2025-09-10  9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Aaron Lu @ 2025-09-10  9:50 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

With the introduction of task based throttle model, task in a throttled
hierarchy is allowed to continue to run till it gets throttled on its
ret2user path.

For this reason, remove those throttled_hierarchy() checks in the
following functions so that those tasks can get their turn as normal
tasks: dequeue_entities(), check_preempt_wakeup_fair() and
yield_to_task_fair().

The benefit of doing it this way is: if those tasks gets the chance to
run earlier and if they hold any kernel resources, they can release
those resources earlier. The downside is, if they don't hold any kernel
resouces, all they can do is to throttle themselves on their way back to
user space so the favor to let them run seems not that useful and for
check_preempt_wakeup_fair(), that favor may be bad for curr.

K Prateek Nayak pointed out prio_changed_fair() can send a throttled
task to check_preempt_wakeup_fair(), further tests showed the affinity
change path from move_queued_task() can also send a throttled task to
check_preempt_wakeup_fair(), that's why the check of task_is_throttled()
in that function.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58f5349d37256..3dbdfaa697477 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7081,7 +7081,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 			 * Bias pick_next to pick a task from this cfs_rq, as
 			 * p is sleeping when it is within its sched_slice.
 			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
+			if (task_sleep && se)
 				set_next_buddy(se);
 			break;
 		}
@@ -8735,7 +8735,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
 	 * lead to a throttle).  This both saves work and prevents false
 	 * next-buddy nomination below.
 	 */
-	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
+	if (task_is_throttled(p))
 		return;
 
 	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
@@ -9009,8 +9009,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
 
-	/* throttled hierarchies are not runnable */
-	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
+	/* !se->on_rq also covers throttled task */
+	if (!se->on_rq)
 		return false;
 
 	/* Tell the scheduler that we'd really like se to run next. */
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
                   ` (2 preceding siblings ...)
  2025-09-10  9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
@ 2025-09-10  9:50 ` Aaron Lu
  2025-09-11  2:03   ` kernel test robot
  2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Aaron Lu @ 2025-09-10  9:50 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

When doing load balance and the target cfs_rq is in throttled hierarchy,
whether to allow balancing there is a question.

The good side to allow balancing is: if the target CPU is idle or less
loaded and the being balanced task is holding some kernel resources,
then it seems a good idea to balance the task there and let the task get
the CPU earlier and release kernel resources sooner. The bad part is, if
the task is not holding any kernel resources, then the balance seems not
that useful.

While theoretically it's debatable, a performance test[0] which involves
200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
pipe mode showed a performance degradation on AMD Genoa when allowing
load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
like task migration across LLC boundary. For this reason, add a check in
can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
hierarchy. This reduced task migration a lot and performance restored.

[0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
[1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3dbdfaa697477..00ee59993b6a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9369,14 +9369,19 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU, or
-	 * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
+	 * 2) target cfs_rq is in throttled hierarchy, or
+	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU, or
+	 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
+	if (task_group(p) &&
+	    throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
+		return 0;
+
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
@ 2025-09-10 12:36   ` Chengming Zhou
  2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
  2025-09-23 13:05   ` [PATCH 1/4] " Matteo Martelli
  2 siblings, 0 replies; 35+ messages in thread
From: Chengming Zhou @ 2025-09-10 12:36 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On 2025/9/10 17:50, Aaron Lu wrote:
> Before task based throttle model, propagating load will stop at a
> throttled cfs_rq and that propagate will happen on unthrottle time by
> update_load_avg().
> 
> Now that there is no update_load_avg() on unthrottle for throttled
> cfs_rq and all load tracking is done by task related operations, let the
> propagate happen immediately.
> 
> While at it, add a comment to explain why cfs_rqs that are not affected
> by throttle have to be added to leaf cfs_rq list in
> propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> ("sched/fair: Fix unfairness caused by missing load decay").
> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

LGTM!

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks.

> ---
>   kernel/sched/fair.c | 26 ++++++++++++++++++--------
>   1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df8dc389af8e1..f993de30e1466 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5729,6 +5729,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   	return cfs_bandwidth_used() && cfs_rq->throttled;
>   }
>   
> +static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
> +}
> +
>   /* check whether cfs_rq, or any parent, is throttled */
>   static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>   {
> @@ -6721,6 +6726,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   	return 0;
>   }
>   
> +static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
> +{
> +	return false;
> +}
> +
>   static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>   {
>   	return 0;
> @@ -13151,10 +13161,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>   {
>   	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>   
> -	if (cfs_rq_throttled(cfs_rq))
> -		return;
> -
> -	if (!throttled_hierarchy(cfs_rq))
> +	/*
> +	 * If a task gets attached to this cfs_rq and before being queued,
> +	 * it gets migrated to another CPU due to reasons like affinity
> +	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
> +	 * that removed load decayed or it can cause faireness problem.
> +	 */
> +	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
>   		list_add_leaf_cfs_rq(cfs_rq);
>   
>   	/* Start to propagate at parent */
> @@ -13165,10 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>   
>   		update_load_avg(cfs_rq, se, UPDATE_TG);
>   
> -		if (cfs_rq_throttled(cfs_rq))
> -			break;
> -
> -		if (!throttled_hierarchy(cfs_rq))
> +		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
>   			list_add_leaf_cfs_rq(cfs_rq);
>   	}
>   }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq
  2025-09-10  9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
@ 2025-09-11  2:03   ` kernel test robot
  2025-09-12  3:44     ` [PATCH update " Aaron Lu
  0 siblings, 1 reply; 35+ messages in thread
From: kernel test robot @ 2025-09-11  2:03 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang
  Cc: llvm, oe-kbuild-all, linux-kernel, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka, Songtang Liu, Chen Yu, Matteo Martelli,
	Michal Koutný, Sebastian Andrzej Siewior

Hi Aaron,

kernel test robot noticed the following build errors:

[auto build test ERROR on 5b726e9bf9544a349090879a513a5e00da486c14]

url:    https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Propagate-load-for-throttled-cfs_rq/20250910-175310
base:   5b726e9bf9544a349090879a513a5e00da486c14
patch link:    https://lore.kernel.org/r/20250910095044.278-5-ziqianlu%40bytedance.com
patch subject: [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq
config: i386-randconfig-012-20250911 (https://download.01.org/0day-ci/archive/20250911/202509110908.a2P8HZ8A-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250911/202509110908.a2P8HZ8A-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509110908.a2P8HZ8A-lkp@intel.com/

All errors (new ones prefixed by >>):

>> kernel/sched/fair.c:9382:41: error: no member named 'cfs_rq' in 'struct task_group'
    9382 |             throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
         |                                 ~~~~~~~~~~~~~  ^
   1 error generated.


vim +9382 kernel/sched/fair.c

  9356	
  9357	/*
  9358	 * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
  9359	 */
  9360	static
  9361	int can_migrate_task(struct task_struct *p, struct lb_env *env)
  9362	{
  9363		long degrades, hot;
  9364	
  9365		lockdep_assert_rq_held(env->src_rq);
  9366		if (p->sched_task_hot)
  9367			p->sched_task_hot = 0;
  9368	
  9369		/*
  9370		 * We do not migrate tasks that are:
  9371		 * 1) delayed dequeued unless we migrate load, or
  9372		 * 2) target cfs_rq is in throttled hierarchy, or
  9373		 * 3) cannot be migrated to this CPU due to cpus_ptr, or
  9374		 * 4) running (obviously), or
  9375		 * 5) are cache-hot on their current CPU, or
  9376		 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
  9377		 */
  9378		if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
  9379			return 0;
  9380	
  9381		if (task_group(p) &&
> 9382		    throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
  9383			return 0;
  9384	
  9385		/*
  9386		 * We want to prioritize the migration of eligible tasks.
  9387		 * For ineligible tasks we soft-limit them and only allow
  9388		 * them to migrate when nr_balance_failed is non-zero to
  9389		 * avoid load-balancing trying very hard to balance the load.
  9390		 */
  9391		if (!env->sd->nr_balance_failed &&
  9392		    task_is_ineligible_on_dst_cpu(p, env->dst_cpu))
  9393			return 0;
  9394	
  9395		/* Disregard percpu kthreads; they are where they need to be. */
  9396		if (kthread_is_per_cpu(p))
  9397			return 0;
  9398	
  9399		if (task_is_blocked(p))
  9400			return 0;
  9401	
  9402		if (!cpumask_test_cpu(env->dst_cpu, p->cpus_ptr)) {
  9403			int cpu;
  9404	
  9405			schedstat_inc(p->stats.nr_failed_migrations_affine);
  9406	
  9407			env->flags |= LBF_SOME_PINNED;
  9408	
  9409			/*
  9410			 * Remember if this task can be migrated to any other CPU in
  9411			 * our sched_group. We may want to revisit it if we couldn't
  9412			 * meet load balance goals by pulling other tasks on src_cpu.
  9413			 *
  9414			 * Avoid computing new_dst_cpu
  9415			 * - for NEWLY_IDLE
  9416			 * - if we have already computed one in current iteration
  9417			 * - if it's an active balance
  9418			 */
  9419			if (env->idle == CPU_NEWLY_IDLE ||
  9420			    env->flags & (LBF_DST_PINNED | LBF_ACTIVE_LB))
  9421				return 0;
  9422	
  9423			/* Prevent to re-select dst_cpu via env's CPUs: */
  9424			cpu = cpumask_first_and_and(env->dst_grpmask, env->cpus, p->cpus_ptr);
  9425	
  9426			if (cpu < nr_cpu_ids) {
  9427				env->flags |= LBF_DST_PINNED;
  9428				env->new_dst_cpu = cpu;
  9429			}
  9430	
  9431			return 0;
  9432		}
  9433	
  9434		/* Record that we found at least one task that could run on dst_cpu */
  9435		env->flags &= ~LBF_ALL_PINNED;
  9436	
  9437		if (task_on_cpu(env->src_rq, p) ||
  9438		    task_current_donor(env->src_rq, p)) {
  9439			schedstat_inc(p->stats.nr_failed_migrations_running);
  9440			return 0;
  9441		}
  9442	
  9443		/*
  9444		 * Aggressive migration if:
  9445		 * 1) active balance
  9446		 * 2) destination numa is preferred
  9447		 * 3) task is cache cold, or
  9448		 * 4) too many balance attempts have failed.
  9449		 */
  9450		if (env->flags & LBF_ACTIVE_LB)
  9451			return 1;
  9452	
  9453		degrades = migrate_degrades_locality(p, env);
  9454		if (!degrades)
  9455			hot = task_hot(p, env);
  9456		else
  9457			hot = degrades > 0;
  9458	
  9459		if (!hot || env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
  9460			if (hot)
  9461				p->sched_task_hot = 1;
  9462			return 1;
  9463		}
  9464	
  9465		schedstat_inc(p->stats.nr_failed_migrations_hot);
  9466		return 0;
  9467	}
  9468	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Task based throttle follow ups
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
                   ` (3 preceding siblings ...)
  2025-09-10  9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
@ 2025-09-11 10:42 ` Peter Zijlstra
  2025-09-11 12:16   ` Aaron Lu
  2025-09-15 21:54 ` Benjamin Segall
  2025-09-19 14:37 ` Valentin Schneider
  6 siblings, 1 reply; 35+ messages in thread
From: Peter Zijlstra @ 2025-09-11 10:42 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Wed, Sep 10, 2025 at 05:50:40PM +0800, Aaron Lu wrote:

> Aaron Lu (4):
>   sched/fair: Propagate load for throttled cfs_rq
>   sched/fair: update_cfs_group() for throttled cfs_rqs
>   sched/fair: Do not special case tasks in throttled hierarchy

I stuffed these first 3 into queue/sched/core, leaving this one:

>   sched/fair: Do not balance task to a throttled cfs_rq

on the table, because the robot already hated on it.

If anybody disagrees with one of the queued patches, holler and I'll
make it go away. Otherwise they should show up in tip 'soonish'.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Task based throttle follow ups
  2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
@ 2025-09-11 12:16   ` Aaron Lu
  0 siblings, 0 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-11 12:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Thu, Sep 11, 2025 at 12:42:22PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 10, 2025 at 05:50:40PM +0800, Aaron Lu wrote:
> 
> > Aaron Lu (4):
> >   sched/fair: Propagate load for throttled cfs_rq
> >   sched/fair: update_cfs_group() for throttled cfs_rqs
> >   sched/fair: Do not special case tasks in throttled hierarchy
> 
> I stuffed these first 3 into queue/sched/core, leaving this one:
> 
> >   sched/fair: Do not balance task to a throttled cfs_rq
> 
> on the table, because the robot already hated on it.

Should have followed Prateek's suggestion to put it behind
CONFIG_CFS_BANDWIDTH. Will fix that and send an updated version later,
thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH update 4/4] sched/fair: Do not balance task to a throttled cfs_rq
  2025-09-11  2:03   ` kernel test robot
@ 2025-09-12  3:44     ` Aaron Lu
  2025-09-12  3:56       ` K Prateek Nayak
  2025-09-16 11:43       ` [tip: sched/core] " tip-bot2 for Aaron Lu
  0 siblings, 2 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-12  3:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: kernel test robot, Valentin Schneider, Ben Segall,
	K Prateek Nayak, Chengming Zhou, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang, llvm, oe-kbuild-all, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

When doing load balance and the target cfs_rq is in throttled hierarchy,
whether to allow balancing there is a question.

The good side to allow balancing is: if the target CPU is idle or less
loaded and the being balanced task is holding some kernel resources,
then it seems a good idea to balance the task there and let the task get
the CPU earlier and release kernel resources sooner. The bad part is, if
the task is not holding any kernel resources, then the balance seems not
that useful.

While theoretically it's debatable, a performance test[0] which involves
200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
pipe mode showed a performance degradation on AMD Genoa when allowing
load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
like task migration across LLC boundary. For this reason, add a check in
can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
hierarchy. This reduced task migration a lot and performance restored.

[0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
[1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
update: fix build error reported by kernel test robot when
CONFIG_FAIR_GROUP_SCHED is not set.

 kernel/sched/fair.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3dbdfaa697477..18a30ae35441a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5737,6 +5737,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
+static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
+{
+	return throttled_hierarchy(task_group(p)->cfs_rq[dst_cpu]);
+}
+
 static inline bool task_is_throttled(struct task_struct *p)
 {
 	return cfs_bandwidth_used() && p->throttled;
@@ -6733,6 +6738,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
+{
+	return 0;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9369,14 +9379,18 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU, or
-	 * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
+	 * 2) target cfs_rq is in throttled hierarchy, or
+	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU, or
+	 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
+	if (lb_throttled_hierarchy(p, env->dst_cpu))
+		return 0;
+
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH update 4/4] sched/fair: Do not balance task to a throttled cfs_rq
  2025-09-12  3:44     ` [PATCH update " Aaron Lu
@ 2025-09-12  3:56       ` K Prateek Nayak
  2025-09-16 11:43       ` [tip: sched/core] " tip-bot2 for Aaron Lu
  1 sibling, 0 replies; 35+ messages in thread
From: K Prateek Nayak @ 2025-09-12  3:56 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra
  Cc: kernel test robot, Valentin Schneider, Ben Segall, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, llvm,
	oe-kbuild-all, linux-kernel, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka, Songtang Liu, Chen Yu, Matteo Martelli,
	Michal Koutný, Sebastian Andrzej Siewior

Hello Aaron,

On 9/12/2025 9:14 AM, Aaron Lu wrote:
> When doing load balance and the target cfs_rq is in throttled hierarchy,
> whether to allow balancing there is a question.
> 
> The good side to allow balancing is: if the target CPU is idle or less
> loaded and the being balanced task is holding some kernel resources,
> then it seems a good idea to balance the task there and let the task get
> the CPU earlier and release kernel resources sooner. The bad part is, if
> the task is not holding any kernel resources, then the balance seems not
> that useful.
> 
> While theoretically it's debatable, a performance test[0] which involves
> 200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
> pipe mode showed a performance degradation on AMD Genoa when allowing
> load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
> like task migration across LLC boundary. For this reason, add a check in
> can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
> hierarchy. This reduced task migration a lot and performance restored.
> 
> [0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
> [1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

Thank you for updating the patch. Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek

> ---
> update: fix build error reported by kernel test robot when
> CONFIG_FAIR_GROUP_SCHED is not set.
> 
>  kernel/sched/fair.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3dbdfaa697477..18a30ae35441a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5737,6 +5737,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>  	return cfs_bandwidth_used() && cfs_rq->throttle_count;
>  }
>  
> +static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
> +{
> +	return throttled_hierarchy(task_group(p)->cfs_rq[dst_cpu]);
> +}
> +
>  static inline bool task_is_throttled(struct task_struct *p)
>  {
>  	return cfs_bandwidth_used() && p->throttled;
> @@ -6733,6 +6738,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>  	return 0;
>  }
>  
> +static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
> +{
> +	return 0;
> +}
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>  void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
>  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> @@ -9369,14 +9379,18 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	/*
>  	 * We do not migrate tasks that are:
>  	 * 1) delayed dequeued unless we migrate load, or
> -	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
> -	 * 3) running (obviously), or
> -	 * 4) are cache-hot on their current CPU, or
> -	 * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
> +	 * 2) target cfs_rq is in throttled hierarchy, or
> +	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
> +	 * 4) running (obviously), or
> +	 * 5) are cache-hot on their current CPU, or
> +	 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
>  	 */
>  	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
>  		return 0;
>  
> +	if (lb_throttled_hierarchy(p, env->dst_cpu))
> +		return 0;
> +
>  	/*
>  	 * We want to prioritize the migration of eligible tasks.
>  	 * For ineligible tasks we soft-limit them and only allow


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Task based throttle follow ups
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
                   ` (4 preceding siblings ...)
  2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
@ 2025-09-15 21:54 ` Benjamin Segall
  2025-09-19 14:37 ` Valentin Schneider
  6 siblings, 0 replies; 35+ messages in thread
From: Benjamin Segall @ 2025-09-15 21:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Aaron Lu <ziqianlu@bytedance.com> writes:

> Peter noticed the inconsistency in load propagation for throttled cfs_rq
> and Ben pointed out several other places regarding throttled cfs_rq that
> could be no longer needed after task based throttle model.
>
> To ease discussing and reviewing, I've come up with this follow up
> series which implements the individual changes.
>
> Patch1 deals with load propagation. According to Peter and Prateek's
> discussion, previously, load propagation for throttled cfs_rq happened
> on unthrottle time but now with per-task throttle, it's no longer the
> case so load propagation should happen immediately or we could lose this
> propagated part.
>
> Patch2 made update_cfs_group() to continue function for cfs_rqs in
> throttled hierarchy so that cfs_rq's entity can get an up2date weight. I
> think this is mostly useful when a cfs_rq in throttled hierarchy still
> has tasks running and on tick/enqueue/dequeue, update_cfs_group() can
> update this cfs_rq's entity weight.
>
> Patch3 removed special treatment of tasks in throttled hierarchy,
> including: dequeue_entities(), check_preempt_wakeup_fair() and
> yield_task_to_fair().
>
> Patch4 inhibited load balancing to a throttled cfs_rq to make hackbench
> happy.
>
> I think patch1 is needed for correctness, patch2-4 is open for
> discussion as there are pros/cons doing things either way. Comments are
> welcome, thanks.
>
> BTW, I also noticed there is the task_is_throttled sched class callback
> and in fair, it is task_is_throttled_fair(). IIUC, it is used by core
> scheduling to find a matching cookie task to run on the sibling SMT CPU.
> For this reason, it doesn't seem very useful if we find it a task that
> is to be throttled so I kept the current implementation; but I guess
> this is also two folded if that to be throttled task is holding some
> kernel resources. Anyway, I didn't write a patch to change it in this
> series, but feel free to let me know if it should be changed.
>
> Aaron Lu (4):
>   sched/fair: Propagate load for throttled cfs_rq
>   sched/fair: update_cfs_group() for throttled cfs_rqs
>   sched/fair: Do not special case tasks in throttled hierarchy
>   sched/fair: Do not balance task to a throttled cfs_rq
>
>  kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++-----------------
>  1 file changed, 31 insertions(+), 19 deletions(-)
>
>
> base-commit: 5b726e9bf9544a349090879a513a5e00da486c14

Yeah, these all make sense to me (with v2 for patch 4).

Reviewed-by: Ben Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip: sched/core] sched/fair: Do not balance task to a throttled cfs_rq
  2025-09-12  3:44     ` [PATCH update " Aaron Lu
  2025-09-12  3:56       ` K Prateek Nayak
@ 2025-09-16 11:43       ` tip-bot2 for Aaron Lu
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for Aaron Lu @ 2025-09-16 11:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Aaron Lu, Peter Zijlstra (Intel), K Prateek Nayak, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     0d4eaf8caf8cd633b23e949e2996b420052c2d45
Gitweb:        https://git.kernel.org/tip/0d4eaf8caf8cd633b23e949e2996b420052c2d45
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Fri, 12 Sep 2025 11:44:28 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 15 Sep 2025 09:38:38 +02:00

sched/fair: Do not balance task to a throttled cfs_rq

When doing load balance and the target cfs_rq is in throttled hierarchy,
whether to allow balancing there is a question.

The good side to allow balancing is: if the target CPU is idle or less
loaded and the being balanced task is holding some kernel resources,
then it seems a good idea to balance the task there and let the task get
the CPU earlier and release kernel resources sooner. The bad part is, if
the task is not holding any kernel resources, then the balance seems not
that useful.

While theoretically it's debatable, a performance test[0] which involves
200 cgroups and each cgroup runs hackbench(20 sender, 20 receiver) in
pipe mode showed a performance degradation on AMD Genoa when allowing
load balance to throttled cfs_rq. Analysis[1] showed hackbench doesn't
like task migration across LLC boundary. For this reason, add a check in
can_migrate_task() to forbid balancing to a cfs_rq that is in throttled
hierarchy. This reduced task migration a lot and performance restored.

[0]: https://lore.kernel.org/lkml/20250822110701.GB289@bytedance/
[1]: https://lore.kernel.org/lkml/20250903101102.GB42@bytedance/

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3dbdfaa..18a30ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5737,6 +5737,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
+static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
+{
+	return throttled_hierarchy(task_group(p)->cfs_rq[dst_cpu]);
+}
+
 static inline bool task_is_throttled(struct task_struct *p)
 {
 	return cfs_bandwidth_used() && p->throttled;
@@ -6733,6 +6738,11 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline int lb_throttled_hierarchy(struct task_struct *p, int dst_cpu)
+{
+	return 0;
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9369,14 +9379,18 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 3) running (obviously), or
-	 * 4) are cache-hot on their current CPU, or
-	 * 5) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
+	 * 2) target cfs_rq is in throttled hierarchy, or
+	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 4) running (obviously), or
+	 * 5) are cache-hot on their current CPU, or
+	 * 6) are blocked on mutexes (if SCHED_PROXY_EXEC is enabled)
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
+	if (lb_throttled_hierarchy(p, env->dst_cpu))
+		return 0;
+
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: sched/core] sched/fair: Do not special case tasks in throttled hierarchy
  2025-09-10  9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
@ 2025-09-16 11:43   ` tip-bot2 for Aaron Lu
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Aaron Lu @ 2025-09-16 11:43 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Aaron Lu, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     253b3f587241967a97a971e23b1e2a7d74244fad
Gitweb:        https://git.kernel.org/tip/253b3f587241967a97a971e23b1e2a7d74244fad
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Wed, 10 Sep 2025 17:50:43 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 15 Sep 2025 09:38:37 +02:00

sched/fair: Do not special case tasks in throttled hierarchy

With the introduction of task based throttle model, task in a throttled
hierarchy is allowed to continue to run till it gets throttled on its
ret2user path.

For this reason, remove those throttled_hierarchy() checks in the
following functions so that those tasks can get their turn as normal
tasks: dequeue_entities(), check_preempt_wakeup_fair() and
yield_to_task_fair().

The benefit of doing it this way is: if those tasks gets the chance to
run earlier and if they hold any kernel resources, they can release
those resources earlier. The downside is, if they don't hold any kernel
resouces, all they can do is to throttle themselves on their way back to
user space so the favor to let them run seems not that useful and for
check_preempt_wakeup_fair(), that favor may be bad for curr.

K Prateek Nayak pointed out prio_changed_fair() can send a throttled
task to check_preempt_wakeup_fair(), further tests showed the affinity
change path from move_queued_task() can also send a throttled task to
check_preempt_wakeup_fair(), that's why the check of task_is_throttled()
in that function.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 58f5349..3dbdfaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7081,7 +7081,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 			 * Bias pick_next to pick a task from this cfs_rq, as
 			 * p is sleeping when it is within its sched_slice.
 			 */
-			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
+			if (task_sleep && se)
 				set_next_buddy(se);
 			break;
 		}
@@ -8735,7 +8735,7 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int 
 	 * lead to a throttle).  This both saves work and prevents false
 	 * next-buddy nomination below.
 	 */
-	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
+	if (task_is_throttled(p))
 		return;

 	if (sched_feat(NEXT_BUDDY) && !(wake_flags & WF_FORK) && !pse->sched_delayed) {
@@ -9009,8 +9009,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;

-	/* throttled hierarchies are not runnable */
-	if (!se->on_rq || throttled_hierarchy(cfs_rq_of(se)))
+	/* !se->on_rq also covers throttled task */
+	if (!se->on_rq)
 		return false;

 	/* Tell the scheduler that we'd really like se to run next. */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: sched/core] sched/fair: update_cfs_group() for throttled cfs_rqs
  2025-09-10  9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
@ 2025-09-16 11:43   ` tip-bot2 for Aaron Lu
  0 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Aaron Lu @ 2025-09-16 11:43 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Aaron Lu, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fcd394866e3db344cbe0bb485d7e3f741ac07245
Gitweb:        https://git.kernel.org/tip/fcd394866e3db344cbe0bb485d7e3f741ac07245
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Wed, 10 Sep 2025 17:50:42 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 15 Sep 2025 09:38:37 +02:00

sched/fair: update_cfs_group() for throttled cfs_rqs

With task based throttle model, tasks in a throttled hierarchy are
allowed to continue to run if they are running in kernel mode. For this
reason, PELT clock is not stopped for these cfs_rqs in throttled
hierarchy when they still have tasks running or queued.

Since PELT clock is not stopped, whether to allow update_cfs_group()
doing its job for cfs_rqs which are in throttled hierarchy but still
have tasks running/queued is a question.

The good side is, continue to run update_cfs_group() can get these
cfs_rq entities with an up2date weight and that up2date weight can be
useful to derive an accurate load for the CPU as well as ensure fairness
if multiple tasks of different cgroups are running on the same CPU.
OTOH, as Benjamin Segall pointed: when unthrottle comes around the most
likely correct distribution is the distribution we had at the time of
throttle.

In reality, either way may not matter that much if tasks in throttled
hierarchy don't run in kernel mode for too long. But in case that
happens, let these cfs_rq entities have an up2date weight seems a good
thing to do.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de3..58f5349 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3957,9 +3957,6 @@ static void update_cfs_group(struct sched_entity *se)
 	if (!gcfs_rq || !gcfs_rq->load.weight)
 		return;

-	if (throttled_hierarchy(gcfs_rq))
-		return;
-
 	shares = calc_group_shares(gcfs_rq);
 	if (unlikely(se->load.weight != shares))
 		reweight_entity(cfs_rq_of(se), se, shares);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [tip: sched/core] sched/fair: Propagate load for throttled cfs_rq
  2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
  2025-09-10 12:36   ` Chengming Zhou
@ 2025-09-16 11:43   ` tip-bot2 for Aaron Lu
  2025-09-23 13:05   ` [PATCH 1/4] " Matteo Martelli
  2 siblings, 0 replies; 35+ messages in thread
From: tip-bot2 for Aaron Lu @ 2025-09-16 11:43 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Aaron Lu, Peter Zijlstra (Intel), Chengming Zhou, x86,
	linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     fe8d238e646e16cc431b7a5899f8dda690258ee9
Gitweb:        https://git.kernel.org/tip/fe8d238e646e16cc431b7a5899f8dda690258ee9
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Wed, 10 Sep 2025 17:50:41 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 15 Sep 2025 09:38:37 +02:00

sched/fair: Propagate load for throttled cfs_rq

Before task based throttle model, propagating load will stop at a
throttled cfs_rq and that propagate will happen on unthrottle time by
update_load_avg().

Now that there is no update_load_avg() on unthrottle for throttled
cfs_rq and all load tracking is done by task related operations, let the
propagate happen immediately.

While at it, add a comment to explain why cfs_rqs that are not affected
by throttle have to be added to leaf cfs_rq list in
propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
("sched/fair: Fix unfairness caused by missing load decay").

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 kernel/sched/fair.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc38..f993de3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5729,6 +5729,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttled;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return cfs_bandwidth_used() && cfs_rq->pelt_clock_throttled;
+}
+
 /* check whether cfs_rq, or any parent, is throttled */
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
@@ -6721,6 +6726,11 @@ static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
+static inline bool cfs_rq_pelt_clock_throttled(struct cfs_rq *cfs_rq)
+{
+	return false;
+}
+
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 {
 	return 0;
@@ -13151,10 +13161,13 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
-	if (cfs_rq_throttled(cfs_rq))
-		return;
-
-	if (!throttled_hierarchy(cfs_rq))
+	/*
+	 * If a task gets attached to this cfs_rq and before being queued,
+	 * it gets migrated to another CPU due to reasons like affinity
+	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
+	 * that removed load decayed or it can cause faireness problem.
+	 */
+	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13165,10 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (cfs_rq_throttled(cfs_rq))
-			break;
-
-		if (!throttled_hierarchy(cfs_rq))
+		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/4] Task based throttle follow ups
  2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
                   ` (5 preceding siblings ...)
  2025-09-15 21:54 ` Benjamin Segall
@ 2025-09-19 14:37 ` Valentin Schneider
  6 siblings, 0 replies; 35+ messages in thread
From: Valentin Schneider @ 2025-09-19 14:37 UTC (permalink / raw)
  To: Aaron Lu, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On 10/09/25 17:50, Aaron Lu wrote:
> Peter noticed the inconsistency in load propagation for throttled cfs_rq
> and Ben pointed out several other places regarding throttled cfs_rq that
> could be no longer needed after task based throttle model.
>
> To ease discussing and reviewing, I've come up with this follow up
> series which implements the individual changes.
>
> Patch1 deals with load propagation. According to Peter and Prateek's
> discussion, previously, load propagation for throttled cfs_rq happened
> on unthrottle time but now with per-task throttle, it's no longer the
> case so load propagation should happen immediately or we could lose this
> propagated part.
>
> Patch2 made update_cfs_group() to continue function for cfs_rqs in
> throttled hierarchy so that cfs_rq's entity can get an up2date weight. I
> think this is mostly useful when a cfs_rq in throttled hierarchy still
> has tasks running and on tick/enqueue/dequeue, update_cfs_group() can
> update this cfs_rq's entity weight.
>
> Patch3 removed special treatment of tasks in throttled hierarchy,
> including: dequeue_entities(), check_preempt_wakeup_fair() and
> yield_task_to_fair().
>
> Patch4 inhibited load balancing to a throttled cfs_rq to make hackbench
> happy.
>
> I think patch1 is needed for correctness, patch2-4 is open for
> discussion as there are pros/cons doing things either way. Comments are
> welcome, thanks.
>
> BTW, I also noticed there is the task_is_throttled sched class callback
> and in fair, it is task_is_throttled_fair(). IIUC, it is used by core
> scheduling to find a matching cookie task to run on the sibling SMT CPU.
> For this reason, it doesn't seem very useful if we find it a task that
> is to be throttled so I kept the current implementation; but I guess
> this is also two folded if that to be throttled task is holding some
> kernel resources. Anyway, I didn't write a patch to change it in this
> series, but feel free to let me know if it should be changed.
>

I saw these already made it to tip/sched/core, but FWIW that all LGTM.

Reviewed-by: Valentin Schneider <vschneid@redhat.com>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
  2025-09-10 12:36   ` Chengming Zhou
  2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
@ 2025-09-23 13:05   ` Matteo Martelli
  2025-09-24 11:33     ` Aaron Lu
  2025-09-29  7:51     ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
  2 siblings, 2 replies; 35+ messages in thread
From: Matteo Martelli @ 2025-09-23 13:05 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior,
	Matteo Martelli

Hi Aaron,

On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> Before task based throttle model, propagating load will stop at a
> throttled cfs_rq and that propagate will happen on unthrottle time by
> update_load_avg().
> 
> Now that there is no update_load_avg() on unthrottle for throttled
> cfs_rq and all load tracking is done by task related operations, let the
> propagate happen immediately.
> 
> While at it, add a comment to explain why cfs_rqs that are not affected
> by throttle have to be added to leaf cfs_rq list in
> propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> ("sched/fair: Fix unfairness caused by missing load decay").
> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---

I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
when task exits to user" [1] together with these follow up patches. I
found out that with this patch the kernel sometimes produces the warning
WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
reproduce this systematically by applying both [1] and this patch on top
of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
from sched/core branch of tip tree. I couldn't reproduce the warning by
testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
throttled_lb_pair()").

The test setup is the same used in my previous testing for v3 [2], where
the CFS throttling events are mostly triggered by the first ssh logins
into the system as the systemd user slice is configured with CPUQuota of
25%. Also note that the same systemd user slice is configured with CPU
affinity set to only one core. I added some tracing to trace functions
throttle_cfs_rq, tg_throttle_down, unthrottle_cfs_rq, tg_unthrottle_up,
and it looks like the warning is triggered after the last unthrottle
event, however I'm not sure the warning is actually related to the
printed trace below or not. See the following logs that contains both
the traced function events and the kernel warning.

[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865040: throttle_cfs_rq <-pick_task_fair
[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
[   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865043: tg_throttle_down <-walk_tg_tree_from
[   17.876999]        ktimers/0-15      [000] d.s13    17.882601: unthrottle_cfs_rq <-distribute_cfs_runtime
[   17.876999]        ktimers/0-15      [000] d.s13    17.882603: tg_unthrottle_up <-walk_tg_tree_from
[   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
[   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
[   17.910250]          systemd-999     [000] dN.2.    17.916019: throttle_cfs_rq <-put_prev_entity
[   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
[   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
[   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
[   17.977245]        ktimers/0-15      [000] d.s13    17.982575: unthrottle_cfs_rq <-distribute_cfs_runtime
[   17.977245]        ktimers/0-15      [000] d.s13    17.982578: tg_unthrottle_up <-walk_tg_tree_from
[   17.977245]        ktimers/0-15      [000] d.s13    17.982579: tg_unthrottle_up <-walk_tg_tree_from
[   17.977245]        ktimers/0-15      [000] d.s13    17.982580: tg_unthrottle_up <-walk_tg_tree_from
[   18.009244]          systemd-999     [000] dN.2.    18.015030: throttle_cfs_rq <-pick_task_fair
[   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
[   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
[   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
[   18.076822]        ktimers/0-15      [000] d.s13    18.082607: unthrottle_cfs_rq <-distribute_cfs_runtime
[   18.076822]        ktimers/0-15      [000] d.s13    18.082609: tg_unthrottle_up <-walk_tg_tree_from
[   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
[   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
[   18.109820]          systemd-999     [000] dN.2.    18.115604: throttle_cfs_rq <-put_prev_entity
[   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
[   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
[   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
[   18.177167]        ktimers/0-15      [000] d.s13    18.182630: unthrottle_cfs_rq <-distribute_cfs_runtime
[   18.177167]        ktimers/0-15      [000] d.s13    18.182632: tg_unthrottle_up <-walk_tg_tree_from
[   18.177167]        ktimers/0-15      [000] d.s13    18.182633: tg_unthrottle_up <-walk_tg_tree_from
[   18.177167]        ktimers/0-15      [000] d.s13    18.182634: tg_unthrottle_up <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226594: throttle_cfs_rq <-pick_task_fair
[   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282606: unthrottle_cfs_rq <-distribute_cfs_runtime
[   18.276886]        ktimers/0-15      [000] d.s13    18.282608: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
[   18.421349] ------------[ cut here ]------------
[   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
[   18.421355] Modules linked in: efivarfs
[   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
[   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
[   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
[   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
[   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
[   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
[   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
[   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
[   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
[   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
[   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
[   18.421377] Call Trace:
[   18.421383]  <TASK>
[   18.421387]  enqueue_task+0x31/0x70
[   18.421389]  ttwu_do_activate+0x73/0x220
[   18.421391]  try_to_wake_up+0x2b1/0x7a0
[   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
[   18.421396]  ep_autoremove_wake_function+0x12/0x40
[   18.421400]  __wake_up_common+0x72/0xa0
[   18.421402]  __wake_up_sync+0x38/0x50
[   18.421404]  ep_poll_callback+0xd2/0x240
[   18.421406]  __wake_up_common+0x72/0xa0
[   18.421407]  __wake_up_sync_key+0x3f/0x60
[   18.421409]  sock_def_readable+0x42/0xc0
[   18.421414]  unix_dgram_sendmsg+0x48f/0x840
[   18.421420]  ____sys_sendmsg+0x31c/0x350
[   18.421423]  ___sys_sendmsg+0x99/0xe0
[   18.421425]  __sys_sendmsg+0x8a/0xf0
[   18.421429]  do_syscall_64+0xa4/0x260
[   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[   18.421438] RIP: 0033:0x7fb07e8d4d94
[   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
[   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
[   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
[   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
[   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
[   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
[   18.421445]  </TASK>
[   18.421446] ---[ end trace 0000000000000000 ]---

[1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
[2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/

I hope this is helpful. I'm happy to provide more information or run
additional tests if needed.

Best regards,
Matteo Martelli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-23 13:05   ` [PATCH 1/4] " Matteo Martelli
@ 2025-09-24 11:33     ` Aaron Lu
  2025-09-25  8:17       ` K Prateek Nayak
  2025-09-29  7:51     ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
  1 sibling, 1 reply; 35+ messages in thread
From: Aaron Lu @ 2025-09-24 11:33 UTC (permalink / raw)
  To: Matteo Martelli
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior

Hi Matteo,

On Tue, Sep 23, 2025 at 03:05:29PM +0200, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> > Before task based throttle model, propagating load will stop at a
> > throttled cfs_rq and that propagate will happen on unthrottle time by
> > update_load_avg().
> > 
> > Now that there is no update_load_avg() on unthrottle for throttled
> > cfs_rq and all load tracking is done by task related operations, let the
> > propagate happen immediately.
> > 
> > While at it, add a comment to explain why cfs_rqs that are not affected
> > by throttle have to be added to leaf cfs_rq list in
> > propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> > ("sched/fair: Fix unfairness caused by missing load decay").
> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> 
> I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
> when task exits to user" [1] together with these follow up patches. I
> found out that with this patch the kernel sometimes produces the warning
> WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
> assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
> reproduce this systematically by applying both [1] and this patch on top
> of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
> from sched/core branch of tip tree. I couldn't reproduce the warning by
> testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
> throttled_lb_pair()").
>

Thanks a lot for the test.

> The test setup is the same used in my previous testing for v3 [2], where
> the CFS throttling events are mostly triggered by the first ssh logins
> into the system as the systemd user slice is configured with CPUQuota of
> 25%. Also note that the same systemd user slice is configured with CPU

I tried to replicate this setup, below is my setup using a 4 cpu VM
and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
throttled cfs_rq"):
# pwd
/sys/fs/cgroup/user.slice
# cat cpu.max
25000 100000
# cat cpuset.cpus
0

I then login using ssh as a normal user and I can see throttle happened
but couldn't hit this warning. Do you have to do something special to
trigger it?

> affinity set to only one core. I added some tracing to trace functions
> throttle_cfs_rq, tg_throttle_down, unthrottle_cfs_rq, tg_unthrottle_up,
> and it looks like the warning is triggered after the last unthrottle
> event, however I'm not sure the warning is actually related to the
> printed trace below or not. See the following logs that contains both
> the traced function events and the kernel warning.
> 
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865040: throttle_cfs_rq <-pick_task_fair
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865042: tg_throttle_down <-walk_tg_tree_from
> [   17.859264]  systemd-xdg-aut-1006    [000] dN.2.    17.865043: tg_throttle_down <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882601: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882603: tg_unthrottle_up <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
> [   17.876999]        ktimers/0-15      [000] d.s13    17.882605: tg_unthrottle_up <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916019: throttle_cfs_rq <-put_prev_entity
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.910250]          systemd-999     [000] dN.2.    17.916025: tg_throttle_down <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982575: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982578: tg_unthrottle_up <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982579: tg_unthrottle_up <-walk_tg_tree_from
> [   17.977245]        ktimers/0-15      [000] d.s13    17.982580: tg_unthrottle_up <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015030: throttle_cfs_rq <-pick_task_fair
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.009244]          systemd-999     [000] dN.2.    18.015033: tg_throttle_down <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082607: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082609: tg_unthrottle_up <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.076822]        ktimers/0-15      [000] d.s13    18.082611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115604: throttle_cfs_rq <-put_prev_entity
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.109820]          systemd-999     [000] dN.2.    18.115609: tg_throttle_down <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182630: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182632: tg_unthrottle_up <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182633: tg_unthrottle_up <-walk_tg_tree_from
> [   18.177167]        ktimers/0-15      [000] d.s13    18.182634: tg_unthrottle_up <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226594: throttle_cfs_rq <-pick_task_fair
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226597: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.220827]          systemd-999     [000] dN.2.    18.226598: tg_throttle_down <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282606: unthrottle_cfs_rq <-distribute_cfs_runtime
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282608: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282610: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.276886]        ktimers/0-15      [000] d.s13    18.282611: tg_unthrottle_up <-walk_tg_tree_from
> [   18.421349] ------------[ cut here ]------------
> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980

I stared at the code and haven't been able to figure out when
enqueue_task_fair() would end up with a broken leaf cfs_rq list.

No matter what the culprit commit did, enqueue_task_fair() should always
get all the non-queued cfs_rqs on the list in a bottom up way. It should
either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
of the ancestor cfs_rqs which is already on the list. Either way, the
list should not be broken.

> [   18.421355] Modules linked in: efivarfs
> [   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
> [   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> [   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
> [   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
> [   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
> [   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
> [   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
> [   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
> [   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
> [   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
> [   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
> [   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
> [   18.421377] Call Trace:
> [   18.421383]  <TASK>
> [   18.421387]  enqueue_task+0x31/0x70
> [   18.421389]  ttwu_do_activate+0x73/0x220
> [   18.421391]  try_to_wake_up+0x2b1/0x7a0
> [   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
> [   18.421396]  ep_autoremove_wake_function+0x12/0x40
> [   18.421400]  __wake_up_common+0x72/0xa0
> [   18.421402]  __wake_up_sync+0x38/0x50
> [   18.421404]  ep_poll_callback+0xd2/0x240
> [   18.421406]  __wake_up_common+0x72/0xa0
> [   18.421407]  __wake_up_sync_key+0x3f/0x60
> [   18.421409]  sock_def_readable+0x42/0xc0
> [   18.421414]  unix_dgram_sendmsg+0x48f/0x840
> [   18.421420]  ____sys_sendmsg+0x31c/0x350
> [   18.421423]  ___sys_sendmsg+0x99/0xe0
> [   18.421425]  __sys_sendmsg+0x8a/0xf0
> [   18.421429]  do_syscall_64+0xa4/0x260
> [   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [   18.421438] RIP: 0033:0x7fb07e8d4d94
> [   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
> [   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
> [   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
> [   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
> [   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
> [   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
> [   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
> [   18.421445]  </TASK>
> [   18.421446] ---[ end trace 0000000000000000 ]---
> 
> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
> 
> I hope this is helpful. I'm happy to provide more information or run
> additional tests if needed.

Yeah, definitely helpful, thanks.

While looking at this commit, I'm thinking maybe we shouldn't use
cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
to rq's leaf list. The reason is, for a cfs_rq that is in throttled
hierarchy, it can be removed from that leaf list when it has no entities
left in dequeue_entity(). So even when it's on the list now doesn't
mean it will still be on the list at unthrottle time.

Considering that the purpose is to have cfs_rq and its ancestors to be
added to the list in case this cfs_rq may have some removed load that
needs to be decayed later as described in commit 0258bdfaff5b("sched/fair: 
Fix unfairness caused by missing load decay"), I'm thinking maybe we
should deal with cfs_rqs differently according to whether it is in
throttled hierarchy or not:
- for cfs_rqs not in throttled hierarchy, add it and its ancestors to
  the list so that the removed load can be decayed;
- for cfs_rqs in throttled hierarchy, check on unthrottle time whether
  it has any removed load that needs to be decayed.
  The case in my mind is: an blocked task @p gets attached to a throttled
  cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
  no tasks throttled or queued underneath it. Then @p is migrated to
  another cpu before being queued on it, so this cfs_rq now has some
  removed load on it. On unthrottle, this cfs_rq is considered fully
  decayed and isn't added to leaf cfs_rq list. Then we have a problem.

With the above said, I'm thinking the below diff. No idea if this can
fix Matteo's problem though, it's just something I think can fix the
issue I described above, if I understand things correctly...

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e1466..444f0eb2df71d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	if (child_cfs_rq_on_list(cfs_rq))
 		return false;
 
+	if (cfs_rq->removed.nr)
+		return false;
+
 	return true;
 }
 
@@ -13167,7 +13170,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
 	 * that removed load decayed or it can cause faireness problem.
 	 */
-	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+	if (!throttled_hierarchy(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13178,7 +13181,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+		if (!throttled_hierarchy(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-24 11:33     ` Aaron Lu
@ 2025-09-25  8:17       ` K Prateek Nayak
  2025-09-25  9:29         ` Aaron Lu
  0 siblings, 1 reply; 35+ messages in thread
From: K Prateek Nayak @ 2025-09-25  8:17 UTC (permalink / raw)
  To: Aaron Lu, Matteo Martelli
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Michal Koutný, Sebastian Andrzej Siewior

Hello Aaron, Matteo,

On 9/24/2025 5:03 PM, Aaron Lu wrote:
>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> 
> I stared at the code and haven't been able to figure out when
> enqueue_task_fair() would end up with a broken leaf cfs_rq list.

Yeah neither could I. I tried running with PREEMPT_RT too and still
couldn't trigger it :(

But I'm wondering if all we are missing is:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e146..5f9e7b4df391 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
 
 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
---

This is the only way we can currently have a break in
cfs_rq_pelt_clock_throttled() hierarchy.

> 
> No matter what the culprit commit did, enqueue_task_fair() should always
> get all the non-queued cfs_rqs on the list in a bottom up way. It should
> either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
> of the ancestor cfs_rqs which is already on the list. Either way, the
> list should not be broken.
> 
>> [   18.421355] Modules linked in: efivarfs
>> [   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
>> [   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
>> [   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
>> [   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
>> [   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
>> [   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
>> [   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
>> [   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
>> [   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
>> [   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
>> [   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
>> [   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
>> [   18.421377] Call Trace:
>> [   18.421383]  <TASK>
>> [   18.421387]  enqueue_task+0x31/0x70
>> [   18.421389]  ttwu_do_activate+0x73/0x220
>> [   18.421391]  try_to_wake_up+0x2b1/0x7a0
>> [   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
>> [   18.421396]  ep_autoremove_wake_function+0x12/0x40
>> [   18.421400]  __wake_up_common+0x72/0xa0
>> [   18.421402]  __wake_up_sync+0x38/0x50
>> [   18.421404]  ep_poll_callback+0xd2/0x240
>> [   18.421406]  __wake_up_common+0x72/0xa0
>> [   18.421407]  __wake_up_sync_key+0x3f/0x60
>> [   18.421409]  sock_def_readable+0x42/0xc0
>> [   18.421414]  unix_dgram_sendmsg+0x48f/0x840
>> [   18.421420]  ____sys_sendmsg+0x31c/0x350
>> [   18.421423]  ___sys_sendmsg+0x99/0xe0
>> [   18.421425]  __sys_sendmsg+0x8a/0xf0
>> [   18.421429]  do_syscall_64+0xa4/0x260
>> [   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
>> [   18.421438] RIP: 0033:0x7fb07e8d4d94
>> [   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
>> [   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
>> [   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
>> [   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
>> [   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
>> [   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
>> [   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
>> [   18.421445]  </TASK>
>> [   18.421446] ---[ end trace 0000000000000000 ]---
>>
>> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
>> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
>>
>> I hope this is helpful. I'm happy to provide more information or run
>> additional tests if needed.
> 
> Yeah, definitely helpful, thanks.
> 
> While looking at this commit, I'm thinking maybe we shouldn't use
> cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
> to rq's leaf list. The reason is, for a cfs_rq that is in throttled
> hierarchy, it can be removed from that leaf list when it has no entities
> left in dequeue_entity(). So even when it's on the list now doesn't
> mean it will still be on the list at unthrottle time.
> 
> Considering that the purpose is to have cfs_rq and its ancestors to be
> added to the list in case this cfs_rq may have some removed load that
> needs to be decayed later as described in commit 0258bdfaff5b("sched/fair: 
> Fix unfairness caused by missing load decay"), I'm thinking maybe we
> should deal with cfs_rqs differently according to whether it is in
> throttled hierarchy or not:
> - for cfs_rqs not in throttled hierarchy, add it and its ancestors to
>   the list so that the removed load can be decayed;
> - for cfs_rqs in throttled hierarchy, check on unthrottle time whether
>   it has any removed load that needs to be decayed.
>   The case in my mind is: an blocked task @p gets attached to a throttled
>   cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
>   no tasks throttled or queued underneath it. Then @p is migrated to
>   another cpu before being queued on it, so this cfs_rq now has some
>   removed load on it. On unthrottle, this cfs_rq is considered fully
>   decayed and isn't added to leaf cfs_rq list. Then we have a problem.
> 
> With the above said, I'm thinking the below diff. No idea if this can
> fix Matteo's problem though, it's just something I think can fix the
> issue I described above, if I understand things correctly...
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e1466..444f0eb2df71d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  	if (child_cfs_rq_on_list(cfs_rq))
>  		return false;
>  
> +	if (cfs_rq->removed.nr)
> +		return false;

If load_avg_is_decayed(), then having removed load makes no difference
right? We are not adding any weight to the tg and the sum/avg cannot go
negative so we are essentially removing nothing.

And, update_load_avg() would propagate the removed load anyways so does
this make a difference?

> +
>  	return true;
>  }
>  
> @@ -13167,7 +13170,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (!throttled_hierarchy(cfs_rq))
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13178,7 +13181,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (!throttled_hierarchy(cfs_rq))
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
>  }

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-25  8:17       ` K Prateek Nayak
@ 2025-09-25  9:29         ` Aaron Lu
  2025-09-25 11:22           ` K Prateek Nayak
  0 siblings, 1 reply; 35+ messages in thread
From: Aaron Lu @ 2025-09-25  9:29 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Matteo Martelli, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior

Hi Prateek,

On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> Hello Aaron, Matteo,
> 
> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> >> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> > 
> > I stared at the code and haven't been able to figure out when
> > enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> 
> Yeah neither could I. I tried running with PREEMPT_RT too and still
> couldn't trigger it :(
> 
> But I'm wondering if all we are missing is:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e146..5f9e7b4df391 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
>  
>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
>  }
>  
>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
> ---
> 
> This is the only way we can currently have a break in
> cfs_rq_pelt_clock_throttled() hierarchy.
> 

Great finding! Yes, that is missed.

According to this info, I'm able to trigger the assert in
enqueue_task_fair(). The stack is different from Matteo's: his stack is
from ttwu path while mine is from exit. Anyway, let me do more analysis
and get back to you:

[   67.041905] ------------[ cut here ]------------
[   67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
[   67.043227] Modules linked in:
[   67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G        W           6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
[   67.044694] Tainted: [W]=WARN
[   67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
[   67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
[   67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
[   67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
[   67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
[   67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
[   67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
[   67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
[   67.052083] FS:  0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
[   67.052855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
[   67.054083] Call Trace:
[   67.054334]  <TASK>
[   67.054546]  enqueue_task+0x35/0xd0
[   67.054885]  sched_move_task+0x291/0x370
[   67.055268]  ? kmem_cache_free+0x2d9/0x480
[   67.055669]  do_exit+0x204/0x4f0
[   67.055984]  ? lock_release+0x10a/0x170
[   67.056356]  do_group_exit+0x36/0xa0
[   67.056714]  __x64_sys_exit_group+0x18/0x20
[   67.057121]  x64_sys_call+0x14fa/0x1720
[   67.057502]  do_syscall_64+0x6a/0x2d0
[   67.057865]  entry_SYSCALL_64_after_hwframe+0x76/0x7e


> > 
> > No matter what the culprit commit did, enqueue_task_fair() should always
> > get all the non-queued cfs_rqs on the list in a bottom up way. It should
> > either add the whole hierarchy to rq's leaf cfs_rq list, or stop at one
> > of the ancestor cfs_rqs which is already on the list. Either way, the
> > list should not be broken.
> > 
> >> [   18.421355] Modules linked in: efivarfs
> >> [   18.421360] CPU: 0 UID: 0 PID: 1 Comm: systemd Not tainted 6.17.0-rc4-00010-gfe8d238e646e #2 PREEMPT_{RT,(full)}
> >> [   18.421362] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
> >> [   18.421364] RIP: 0010:enqueue_task_fair+0x925/0x980
> >> [   18.421366] Code: b5 48 01 00 00 49 89 95 48 01 00 00 49 89 bd 50 01 00 00 48 89 37 48 89 b0 70 0a 00 00 48 89 90 78 0a 00 00 e9 49 fa ff ff 90 <0f> 0b 90 e9 1c f9 ff ff 90 0f 0b 90 e9 59 fa ff ff 48 8b b0 88 0a
> >> [   18.421367] RSP: 0018:ffff9c7c8001fa20 EFLAGS: 00010087
> >> [   18.421369] RAX: ffff9358fdc29da8 RBX: 0000000000000003 RCX: ffff9358fdc29340
> >> [   18.421370] RDX: ffff935881a89000 RSI: 0000000000000000 RDI: 0000000000000003
> >> [   18.421371] RBP: ffff9358fdc293c0 R08: 0000000000000000 R09: 00000000b808a33f
> >> [   18.421371] R10: 0000000000200b20 R11: 0000000011659969 R12: 0000000000000001
> >> [   18.421372] R13: ffff93588214fe00 R14: 0000000000000000 R15: 0000000000200b20
> >> [   18.421375] FS:  00007fb07deddd80(0000) GS:ffff935945f6d000(0000) knlGS:0000000000000000
> >> [   18.421376] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [   18.421377] CR2: 00005571bafe12a0 CR3: 00000000024e6000 CR4: 00000000000006f0
> >> [   18.421377] Call Trace:
> >> [   18.421383]  <TASK>
> >> [   18.421387]  enqueue_task+0x31/0x70
> >> [   18.421389]  ttwu_do_activate+0x73/0x220
> >> [   18.421391]  try_to_wake_up+0x2b1/0x7a0
> >> [   18.421393]  ? kmem_cache_alloc_node_noprof+0x7f/0x210
> >> [   18.421396]  ep_autoremove_wake_function+0x12/0x40
> >> [   18.421400]  __wake_up_common+0x72/0xa0
> >> [   18.421402]  __wake_up_sync+0x38/0x50
> >> [   18.421404]  ep_poll_callback+0xd2/0x240
> >> [   18.421406]  __wake_up_common+0x72/0xa0
> >> [   18.421407]  __wake_up_sync_key+0x3f/0x60
> >> [   18.421409]  sock_def_readable+0x42/0xc0
> >> [   18.421414]  unix_dgram_sendmsg+0x48f/0x840
> >> [   18.421420]  ____sys_sendmsg+0x31c/0x350
> >> [   18.421423]  ___sys_sendmsg+0x99/0xe0
> >> [   18.421425]  __sys_sendmsg+0x8a/0xf0
> >> [   18.421429]  do_syscall_64+0xa4/0x260
> >> [   18.421434]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
> >> [   18.421438] RIP: 0033:0x7fb07e8d4d94
> >> [   18.421439] Code: 15 91 10 0d 00 f7 d8 64 89 02 b8 ff ff ff ff eb bf 0f 1f 44 00 00 f3 0f 1e fa 80 3d d5 92 0d 00 00 74 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 4c c3 0f 1f 00 55 48 89 e5 48 83 ec 20 89 55
> >> [   18.421440] RSP: 002b:00007ffff30e4d08 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
> >> [   18.421442] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb07e8d4d94
> >> [   18.421442] RDX: 0000000000004000 RSI: 00007ffff30e4e80 RDI: 0000000000000031
> >> [   18.421443] RBP: 00007ffff30e5ff0 R08: 00000000000000c0 R09: 0000000000000000
> >> [   18.421443] R10: 00007fb07deddc08 R11: 0000000000000202 R12: 00007ffff30e6070
> >> [   18.421444] R13: 00007ffff30e4f00 R14: 00007ffff30e4d10 R15: 000000000000000f
> >> [   18.421445]  </TASK>
> >> [   18.421446] ---[ end trace 0000000000000000 ]---
> >>
> >> [1]: https://lore-kernel.gnuweeb.org/lkml/20250829081120.806-1-ziqianlu@bytedance.com/
> >> [2]: https://lore.kernel.org/lkml/d37fcac575ee94c3fe605e08e6297986@codethink.co.uk/
> >>
> >> I hope this is helpful. I'm happy to provide more information or run
> >> additional tests if needed.
> > 
> > Yeah, definitely helpful, thanks.
> > 
> > While looking at this commit, I'm thinking maybe we shouldn't use
> > cfs_rq_pelt_clock_throttled() to decide if cfs_rq should be added
> > to rq's leaf list. The reason is, for a cfs_rq that is in throttled
> > hierarchy, it can be removed from that leaf list when it has no entities
> > left in dequeue_entity(). So even when it's on the list now doesn't
> > mean it will still be on the list at unthrottle time.
> > 
> > Considering that the purpose is to have cfs_rq and its ancestors to be
> > added to the list in case this cfs_rq may have some removed load that
> > needs to be decayed later as described in commit 0258bdfaff5b("sched/fair: 
> > Fix unfairness caused by missing load decay"), I'm thinking maybe we
> > should deal with cfs_rqs differently according to whether it is in
> > throttled hierarchy or not:
> > - for cfs_rqs not in throttled hierarchy, add it and its ancestors to
> >   the list so that the removed load can be decayed;
> > - for cfs_rqs in throttled hierarchy, check on unthrottle time whether
> >   it has any removed load that needs to be decayed.
> >   The case in my mind is: an blocked task @p gets attached to a throttled
> >   cfs_rq by attaching a pid to a cgroup. Assume the cfs_rq was empty, had
> >   no tasks throttled or queued underneath it. Then @p is migrated to
> >   another cpu before being queued on it, so this cfs_rq now has some
> >   removed load on it. On unthrottle, this cfs_rq is considered fully
> >   decayed and isn't added to leaf cfs_rq list. Then we have a problem.
> > 
> > With the above said, I'm thinking the below diff. No idea if this can
> > fix Matteo's problem though, it's just something I think can fix the
> > issue I described above, if I understand things correctly...
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index f993de30e1466..444f0eb2df71d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4062,6 +4062,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >  	if (child_cfs_rq_on_list(cfs_rq))
> >  		return false;
> >  
> > +	if (cfs_rq->removed.nr)
> > +		return false;
> 
> If load_avg_is_decayed(), then having removed load makes no difference
> right? We are not adding any weight to the tg and the sum/avg cannot go
> negative so we are essentially removing nothing.
> 
> And, update_load_avg() would propagate the removed load anyways so does
> this make a difference?
> 

You are right. I misunderstood the meanning of removed load, I thought
the load was transferred to the removed part but actually, the load is
still there in the cfs_rq when a task migrates away.

Having a positive removed.nr but fully decayed load avg looks strange
to me, maybe we can avoid this by doing something below, it should
be able to save some cycles by avoiding taking a lock and later dealing
with zero removed load in update_cfs_rq_load_avg(). Just a thought:
(I had a vague memory that util_avg and runnable_avg should always be
smaller than load_avg, if so, we can simplify the condition by just
checking load_avg)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993de30e1466..130db255a1ef6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4788,6 +4808,10 @@ static void remove_entity_load_avg(struct sched_entity *se)
 
 	sync_entity_load_avg(se);
 
+	/* It's possible this entity has no load left after sync */
+	if (!se->avg.util_avg && !se->avg.load_avg && !se->avg.runnable_avg)
+		return;
+
 	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
 	++cfs_rq->removed.nr;
 	cfs_rq->removed.util_avg	+= se->avg.util_avg;

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-25  9:29         ` Aaron Lu
@ 2025-09-25 11:22           ` K Prateek Nayak
  2025-09-25 12:05             ` Aaron Lu
  0 siblings, 1 reply; 35+ messages in thread
From: K Prateek Nayak @ 2025-09-25 11:22 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Matteo Martelli, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior



On 9/25/2025 2:59 PM, Aaron Lu wrote:
> Hi Prateek,
> 
> On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
>> Hello Aaron, Matteo,
>>
>> On 9/24/2025 5:03 PM, Aaron Lu wrote:
>>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
>>>
>>> I stared at the code and haven't been able to figure out when
>>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
>>
>> Yeah neither could I. I tried running with PREEMPT_RT too and still
>> couldn't trigger it :(
>>
>> But I'm wondering if all we are missing is:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index f993de30e146..5f9e7b4df391 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
>>  
>>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
>>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
>> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
>>  }
>>  
>>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
>> ---
>>
>> This is the only way we can currently have a break in
>> cfs_rq_pelt_clock_throttled() hierarchy.
>>
> 
> Great finding! Yes, that is missed.
> 
> According to this info, I'm able to trigger the assert in
> enqueue_task_fair(). The stack is different from Matteo's: his stack is
> from ttwu path while mine is from exit. Anyway, let me do more analysis
> and get back to you:
> 
> [   67.041905] ------------[ cut here ]------------
> [   67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> [   67.043227] Modules linked in:
> [   67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G        W           6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> [   67.044694] Tainted: [W]=WARN
> [   67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> [   67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> [   67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> [   67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> [   67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> [   67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> [   67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> [   67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> [   67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> [   67.052083] FS:  0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> [   67.052855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> [   67.054083] Call Trace:
> [   67.054334]  <TASK>
> [   67.054546]  enqueue_task+0x35/0xd0
> [   67.054885]  sched_move_task+0x291/0x370
> [   67.055268]  ? kmem_cache_free+0x2d9/0x480
> [   67.055669]  do_exit+0x204/0x4f0
> [   67.055984]  ? lock_release+0x10a/0x170
> [   67.056356]  do_group_exit+0x36/0xa0
> [   67.056714]  __x64_sys_exit_group+0x18/0x20
> [   67.057121]  x64_sys_call+0x14fa/0x1720
> [   67.057502]  do_syscall_64+0x6a/0x2d0
> [   67.057865]  entry_SYSCALL_64_after_hwframe+0x76/0x7e

Great! I'll try stressing this path too.
P.S. Are you seeing this with sync_throttle() fix too?

[..snip..]

>>
>> If load_avg_is_decayed(), then having removed load makes no difference
>> right? We are not adding any weight to the tg and the sum/avg cannot go
>> negative so we are essentially removing nothing.
>>
>> And, update_load_avg() would propagate the removed load anyways so does
>> this make a difference?
>>
> 
> You are right. I misunderstood the meanning of removed load, I thought
> the load was transferred to the removed part but actually, the load is
> still there in the cfs_rq when a task migrates away.
> 
> Having a positive removed.nr but fully decayed load avg looks strange
> to me, maybe we can avoid this by doing something below, it should
> be able to save some cycles by avoiding taking a lock and later dealing
> with zero removed load in update_cfs_rq_load_avg(). Just a thought:
> (I had a vague memory that util_avg and runnable_avg should always be
> smaller than load_avg, if so, we can simplify the condition by just
> checking load_avg)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f993de30e1466..130db255a1ef6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4788,6 +4808,10 @@ static void remove_entity_load_avg(struct sched_entity *se)
>  
>  	sync_entity_load_avg(se);
>  
> +	/* It's possible this entity has no load left after sync */
> +	if (!se->avg.util_avg && !se->avg.load_avg && !se->avg.runnable_avg)
> +		return;
> +

This makes sense. Maybe we can rename the current "load_avg_is_decayed()"
to "load_sum_is_decayed()" and extract the condition from the
WARN_ON_ONCE() in it to "load_avg_is_decayed()" and use it here.
Thoughts?

P.S. There is this other patch that also touches this bit
https://lore.kernel.org/lkml/20250910084316.356169-1-hupu.gm@gmail.com/
Maybe we can use load_avg_is_decayed() itself here.

>  	raw_spin_lock_irqsave(&cfs_rq->removed.lock, flags);
>  	++cfs_rq->removed.nr;
>  	cfs_rq->removed.util_avg	+= se->avg.util_avg;

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-25 11:22           ` K Prateek Nayak
@ 2025-09-25 12:05             ` Aaron Lu
  2025-09-25 13:33               ` Matteo Martelli
  0 siblings, 1 reply; 35+ messages in thread
From: Aaron Lu @ 2025-09-25 12:05 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Matteo Martelli, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior

On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
> 
> 
> On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > Hi Prateek,
> > 
> > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> >> Hello Aaron, Matteo,
> >>
> >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> >>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> >>>
> >>> I stared at the code and haven't been able to figure out when
> >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> >>
> >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> >> couldn't trigger it :(
> >>
> >> But I'm wondering if all we are missing is:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index f993de30e146..5f9e7b4df391 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> >>  
> >>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
> >>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> >> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> >>  }
> >>  
> >>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
> >> ---
> >>
> >> This is the only way we can currently have a break in
> >> cfs_rq_pelt_clock_throttled() hierarchy.
> >>
> > 
> > Great finding! Yes, that is missed.
> > 
> > According to this info, I'm able to trigger the assert in
> > enqueue_task_fair(). The stack is different from Matteo's: his stack is
> > from ttwu path while mine is from exit. Anyway, let me do more analysis
> > and get back to you:
> > 
> > [   67.041905] ------------[ cut here ]------------
> > [   67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> > [   67.043227] Modules linked in:
> > [   67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G        W           6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> > [   67.044694] Tainted: [W]=WARN
> > [   67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [   67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> > [   67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> > [   67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> > [   67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> > [   67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> > [   67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> > [   67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> > [   67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> > [   67.052083] FS:  0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> > [   67.052855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> > [   67.054083] Call Trace:
> > [   67.054334]  <TASK>
> > [   67.054546]  enqueue_task+0x35/0xd0
> > [   67.054885]  sched_move_task+0x291/0x370
> > [   67.055268]  ? kmem_cache_free+0x2d9/0x480
> > [   67.055669]  do_exit+0x204/0x4f0
> > [   67.055984]  ? lock_release+0x10a/0x170
> > [   67.056356]  do_group_exit+0x36/0xa0
> > [   67.056714]  __x64_sys_exit_group+0x18/0x20
> > [   67.057121]  x64_sys_call+0x14fa/0x1720
> > [   67.057502]  do_syscall_64+0x6a/0x2d0
> > [   67.057865]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Great! I'll try stressing this path too.

I now also see other paths leading to enqueue_task_fair() too, so I
think this is the same problem as seen by Matteo.

> P.S. Are you seeing this with sync_throttle() fix too?

Nope, your finding fixed it for me :)

I added some trace prints but due to too many traces, it keeps losing
those critical ones.

Anyway, I think I've figured out how it happened: during
online_fair_sched_group() -> sync_throttle(), the newly onlined cfs_rq
didn't have pelt_clock_throttled synced. Suppose its parent's pelt clock
is throttled, then in propagate_entity_cfs_rq(), this newly onlined
cfs_rq is added to leaf list but its parent is not. Now
rq->tmp_alone_branch points to this newly onlined cfs_rq, waiting for
its parent to be added(but this didn't happen).

Then another task wakes up and gets enqueued on this same cpu, all its
ancestor cfs_rqs are already on the list so list_add_leaf_cfs_rq()
didn't manipulate rq->tmp_alone_branch. At the end of the enqueue,
assert will fire.

I'm thinking we should add an assert_list_leaf_cfs_rq() at the end of
propagate_entity_cfs_rq() to capture other potential problems.

Hi Matteo,

Can you test the above diff Prateek sent in his last email? Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-25 12:05             ` Aaron Lu
@ 2025-09-25 13:33               ` Matteo Martelli
  2025-09-26  4:32                 ` K Prateek Nayak
                                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Matteo Martelli @ 2025-09-25 13:33 UTC (permalink / raw)
  To: Aaron Lu, K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Michal Koutný, Sebastian Andrzej Siewior, Matteo Martelli

Hi Aaron,

On Thu, 25 Sep 2025 20:05:04 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
> > 
> > On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > > Hi Prateek,
> > > 
> > > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> > >> Hello Aaron, Matteo,
> > >>
> > >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> > >>>> ...
> > >>>> The test setup is the same used in my previous testing for v3 [2], where
> > >>>> the CFS throttling events are mostly triggered by the first ssh logins
> > >>>> into the system as the systemd user slice is configured with CPUQuota of
> > >>>> 25%. Also note that the same systemd user slice is configured with CPU
> > >>>
> > >>> I tried to replicate this setup, below is my setup using a 4 cpu VM
> > >>> and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
> > >>> throttled cfs_rq"):
> > >>> # pwd
> > >>> /sys/fs/cgroup/user.slice
> > >>> # cat cpu.max
> > >>> 25000 100000
> > >>> # cat cpuset.cpus
> > >>> 0
> > >>>
> > >>> I then login using ssh as a normal user and I can see throttle happened
> > >>> but couldn't hit this warning. Do you have to do something special to
> > >>> trigger it?

It wasn't very reproducible in my setup either, but I found out that the
warning was being triggered more often when I tried to ssh into the
system just after boot, probably due to some additional load from
processes spawned during the boot phase. Therefore I prepared a
reproducer script that resemble my initial setup, plus a stress-ng
worker in the background while connecting with ssh to the system. I also
reduced the CPUQuota to 10% which seemed to increase the probability to
trigger the warning. With this script I can reproduce the warning about
once or twice every 10 ssh executions. See the script at the end of this
email.

> > >>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> > >>>
> > >>> I stared at the code and haven't been able to figure out when
> > >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> > >>>

> > >>
> > >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> > >> couldn't trigger it :(
> > >>
> > >> But I'm wondering if all we are missing is:
> > >>
> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >> index f993de30e146..5f9e7b4df391 100644
> > >> --- a/kernel/sched/fair.c
> > >> +++ b/kernel/sched/fair.c
> > >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> > >>  
> > >>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
> > >>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> > >> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> > >>  }
> > >>  
> > >>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
> > >> ---
> > >>
> > >> This is the only way we can currently have a break in
> > >> cfs_rq_pelt_clock_throttled() hierarchy.
> > >>
> ...
> 
> Hi Matteo,
> 
> Can you test the above diff Prateek sent in his last email? Thanks.
> 

I have just tested with the same script below the diff sent by Prateek
in [1] (also quoted above) that changes sync_throttle(), and I couldn't
reproduce the warning.

Here's the script (I hope it doesn't add too much noise to the email
thread).

---
# Copyright (C) 2024 Codethink Limited
# SPDX-License-Identifier: GPL-2.0-only
#!/usr/bin/bash

script_path=$(realpath "$0")
wrk_dir=$(pwd)
kernel_version=fe8d238e646e
kernel_dir=${wrk_dir}/../linux
kernel_image=${kernel_dir}/arch/x86_64/boot/bzImage
qemu_image_url=https://cdimage.debian.org/images/cloud/sid/daily/20250919-2240/debian-sid-nocloud-amd64-daily-20250919-2240.qcow2
qemu_image_src=${wrk_dir}/$(basename "${qemu_image_url}")
qemu_image=${wrk_dir}/image.qcow2
guest_pkgs="stress-ng" # comma separated list of additional packages to install

run_qemu() {
    qemu-system-x86_64 \
        -m 2G -smp 4 \
        -nographic \
        -nic user,hostfwd=tcp::2222-:22 \
        -M q35,accel=kvm \
        -drive format=qcow2,file="${qemu_image}" \
        -virtfs local,path=.,mount_tag=shared,security_model=mapped-xattr \
        -serial file:console.log \
        -monitor none \
        -append "root=/dev/sda1 console=ttyS0,115200 sysctl.kernel.panic_on_oops=1" \
        -kernel "${kernel_image}"
}

run_ssh() {
    local cmd=$1
    ssh root@localhost -p 2222 \
        -i ${wrk_dir}/id_ed25519 -F none \
        -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
        $cmd
}

setup_kernel() {
    echo "Building kernel ..."
    pushd $kernel_dir
    git reset --hard $kernel_version && git clean -f -d
    make mrproper
    make defconfig
    scripts/config -k -e EXPERT
    scripts/config -k -e PREEMPT_RT
    scripts/config -k -e CFS_BANDWIDTH
    scripts/config -k -e FUNCTION_TRACER
    make olddefconfig
    make -j8
    popd
}

setup_image() {
    echo "Preparing qemu image ..."
    killall -9 qemu-system-x86_64

    if [ ! -f "${qemu_image_src}" ] ; then
        wget ${qemu_image_url} -O ${qemu_image_src}
    fi

    cp ${qemu_image_src} ${qemu_image}

    echo "[Unit]
Description=User and Session Slice
Before=slices.target

[Slice]
CPUQuota=10%
AllowedCPUs=0
    " > user.slice

    yes | ssh-keygen -t ed25519 -f id_ed25519 -N ''

    # https://wiki.debian.org/ThomasChung/CloudImage
    virt-customize -a ${qemu_image} \
        --install ssh,${guest_pkgs} \
        --append-line "/root/.ssh/authorized_keys:$(cat id_ed25519.pub)" \
        --upload "user.slice:/etc/systemd/system/user.slice" \
        --chown "0:0:/etc/systemd/system/user.slice"
}

setup_kernel
setup_image

echo "Run test..."
ts=$(date --utc "+%FT%TZ")
out_dir=${wrk_dir}/out/${ts}
out_log_dir=${out_dir}/logs
mkdir -p ${out_dir} ${out_log_dir}
cp ${script_path} ${out_dir}
cp ${kernel_dir}/.config ${out_dir}/kernel.config
trap "exit" INT TERM
trap "kill 0" EXIT
export -f run_ssh
export wrk_dir

while true; do
        run_qemu &
        qemu_pid=$!
        sleep 10
        run_ssh 'stress-ng --cpu 1 --timeout 60' &
        for i in $(seq 1 10) ; do
            echo "running ssh $i"
            timeout 3 bash -c run_ssh #launch interactive ssh
        done
        run_ssh "systemctl poweroff"
        wait $qemu_pid

        serial_out_file=${out_log_dir}/serial-$(date "+%F-%T").log
	mv ${wrk_dir}/console.log ${serial_out_file}
	grep -e "Kernel panic" -e "Call Trace" -e "Kernel BUG" -e "cut here" ${serial_out_file} && \
            echo "${serial_out_file}" >> ${out_dir}/panics-warns
        grep -e "rcu.*detected.*stall" ${serial_out_file} && \
            echo "${serial_out_file}" >> ${out_dir}/rcu-stalls
done
---

[1]: https://lore.kernel.org/all/db7fc090-5c12-450b-87a4-bcf06e10ef68@amd.com/

Best regards,
Matteo Martelli



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-25 13:33               ` Matteo Martelli
@ 2025-09-26  4:32                 ` K Prateek Nayak
  2025-09-26  5:53                   ` Aaron Lu
  2025-09-26  8:19                 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
  2025-10-21  5:35                 ` [PATCH v2] " K Prateek Nayak
  2 siblings, 1 reply; 35+ messages in thread
From: K Prateek Nayak @ 2025-09-26  4:32 UTC (permalink / raw)
  To: Matteo Martelli, Aaron Lu
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Michal Koutný, Sebastian Andrzej Siewior

Hello Aaron, Matteo,

On 9/25/2025 7:03 PM, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Thu, 25 Sep 2025 20:05:04 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
>> On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
>>>
>>> On 9/25/2025 2:59 PM, Aaron Lu wrote:
>>>> Hi Prateek,
>>>>
>>>> On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
>>>>> Hello Aaron, Matteo,
>>>>>
>>>>> On 9/24/2025 5:03 PM, Aaron Lu wrote:
>>>>>>> ...
>>>>>>> The test setup is the same used in my previous testing for v3 [2], where
>>>>>>> the CFS throttling events are mostly triggered by the first ssh logins
>>>>>>> into the system as the systemd user slice is configured with CPUQuota of
>>>>>>> 25%. Also note that the same systemd user slice is configured with CPU
>>>>>>
>>>>>> I tried to replicate this setup, below is my setup using a 4 cpu VM
>>>>>> and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
>>>>>> throttled cfs_rq"):
>>>>>> # pwd
>>>>>> /sys/fs/cgroup/user.slice
>>>>>> # cat cpu.max
>>>>>> 25000 100000
>>>>>> # cat cpuset.cpus
>>>>>> 0
>>>>>>
>>>>>> I then login using ssh as a normal user and I can see throttle happened
>>>>>> but couldn't hit this warning. Do you have to do something special to
>>>>>> trigger it?
> 
> It wasn't very reproducible in my setup either, but I found out that the
> warning was being triggered more often when I tried to ssh into the
> system just after boot, probably due to some additional load from
> processes spawned during the boot phase. Therefore I prepared a
> reproducer script that resemble my initial setup, plus a stress-ng
> worker in the background while connecting with ssh to the system. I also
> reduced the CPUQuota to 10% which seemed to increase the probability to
> trigger the warning. With this script I can reproduce the warning about
> once or twice every 10 ssh executions. See the script at the end of this
> email.

I have a similar setup with a bunch hackbench instances going in cgroups
with bandwidth limits set and I keep creating/removing cgroups on this
hierarchy and keeping moving some tasks between them.

> 
>>>>>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
>>>>>>
>>>>>> I stared at the code and haven't been able to figure out when
>>>>>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
>>>>>>
> 
>>>>>
>>>>> Yeah neither could I. I tried running with PREEMPT_RT too and still
>>>>> couldn't trigger it :(
>>>>>
>>>>> But I'm wondering if all we are missing is:
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index f993de30e146..5f9e7b4df391 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
>>>>>  
>>>>>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
>>>>>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
>>>>> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
>>>>>  }
>>>>>  
>>>>>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
>>>>> ---
>>>>>
>>>>> This is the only way we can currently have a break in
>>>>> cfs_rq_pelt_clock_throttled() hierarchy.
>>>>>
>> ...
>>
>> Hi Matteo,
>>
>> Can you test the above diff Prateek sent in his last email? Thanks.
>>
> 
> I have just tested with the same script below the diff sent by Prateek
> in [1] (also quoted above) that changes sync_throttle(), and I couldn't
> reproduce the warning.

Thank you both for testing the diff and providing the setup! I'll post a
formal patch soon on the thread.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-26  4:32                 ` K Prateek Nayak
@ 2025-09-26  5:53                   ` Aaron Lu
  0 siblings, 0 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-26  5:53 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Matteo Martelli, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior

On Fri, Sep 26, 2025 at 10:02:53AM +0530, K Prateek Nayak wrote:
> Thank you both for testing the diff and providing the setup! I'll post a
> formal patch soon on the thread.

Thanks a lot Prateek and Matteo, looking forward to the formal patch.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-09-25 13:33               ` Matteo Martelli
  2025-09-26  4:32                 ` K Prateek Nayak
@ 2025-09-26  8:19                 ` K Prateek Nayak
  2025-09-26  9:38                   ` Aaron Lu
  2025-09-26 14:48                   ` Matteo Martelli
  2025-10-21  5:35                 ` [PATCH v2] " K Prateek Nayak
  2 siblings, 2 replies; 35+ messages in thread
From: K Prateek Nayak @ 2025-09-26  8:19 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Matteo Martelli, Aaron Lu, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak

Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().

The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.

Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:

    root(#)
   /    \
  A(#)   B(*)
         |
         C <--- new cgroup
         |
         D <--- new cgroup

  # - Already on leaf cfs_rq list
  * - Throttled with PELT frozen

The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:

1. online_fair_sched_group() for cgroup D will call
   propagate_entity_cfs_rq(). (Same can happen if a throttled task is
   moved to cgroup C and enqueue_task_fair() returns early.)

   propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
   "rq->tmp_alone_branch" since its PELT clock is not marked throttled
   and cfs_rq of cgroup B is not on the list.

   cfs_rq of cgroup B is skipped since its PELT is throttled.

   root cfs_rq already exists on cfs_rq leading to
   list_add_leaf_cfs_rq() returning early.

   The cfs_rq of cgroup C is left dangling on the
   "rq->tmp_alone_branch".

2. A new task wakes up on cgroup A. Since the whole hierarchy is already
   on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
   without any modifications to "rq->tmp_alone_branch".

   The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
   dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".

   !!! Splat !!!

Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.

Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.

While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.

Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
Stress test included running sched-messaging in nested hierarchy with
various quota set alongside a continuous loop of cgroup creation and
deletion, as well as another loop of continuous movement of a busy loop
between cgroups.

No splats have been observed yet with this patch.

Aaron, Matteo,

I've not added any "Tested-by" tags since the final diff is slightly
different from the diff shared previously. The previous diff led to this
splat during my test with the newly added assert:

  ------------[ cut here ]------------
  WARNING: CPU: 6 PID: 22912 at kernel/sched/fair.c:400 propagate_entity_cfs_rq+0x1f9/0x270
  CPU: 6 UID: 0 PID: 22912 Comm: bash Not tainted 6.17.0-rc4-test+ #50 PREEMPT_{RT,(full)}
  Hardware name: Dell Inc. PowerEdge R6525/024PW1, BIOS 2.7.3 03/30/2022
  RIP: 0010:propagate_entity_cfs_rq+0x1f9/0x270
  Code: ...
  RSP: 0018:ffffd4a203713928 EFLAGS: 00010087
  RAX: ffff8ea4714b1e00 RBX: 0000000000000000 RCX: 0000000000000001
  RDX: ffff8ea4714b28e8 RSI: 0000003570150cf8 RDI: ffff8e65abc69400
  RBP: ffff8ea4714b1f00 R08: 0000000000006160 R09: 0000000000016d5a
  R10: 000000000000009c R11: 0000000000000000 R12: ffff8ea4714b1e00
  R13: ffff8e6586df1a00 R14: 0000000000000001 R15: 0000000000000246
  FS:  00007f73a6a7f740(0000) GS:ffff8e64d0689000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000562b9791d461 CR3: 000000010d875003 CR4: 0000000000f70ef0
  PKRU: 55555554
  Call Trace:
   <TASK>
   sched_move_task+0xc2/0x280
   cpu_cgroup_attach+0x37/0x70
   cgroup_migrate_execute+0x3b8/0x560
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   cgroup_attach_task+0x15b/0x200
   ? cgroup_attach_permissions+0x17e/0x1e0
   __cgroup_procs_write+0x105/0x170
   cgroup_procs_write+0x17/0x30
   kernfs_fop_write_iter+0x127/0x1c0
   vfs_write+0x305/0x420
   ksys_write+0x6b/0xe0
   do_syscall_64+0x85/0xb30
   ? __x64_sys_close+0x3d/0x80
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? wp_page_reuse.constprop.0+0x7c/0x90
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_wp_page+0x3df/0xd70
   ? set_close_on_exec+0x31/0x70
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? rt_spin_unlock+0x60/0xa0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_fcntl+0x3f6/0x760
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_syscall_64+0xbe/0xb30
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? __handle_mm_fault+0x375/0x380
   ? __x64_sys_fcntl+0x80/0x110
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? count_memcg_events+0xd9/0x1c0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? handle_mm_fault+0x1af/0x290
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? do_user_addr_fault+0x2d0/0x8c0
   ? srso_alias_return_thunk+0x5/0xfbef5
   ? srso_alias_return_thunk+0x5/0xfbef5
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f73a6b968f7
  Code: ...
  RSP: 002b:00007ffebcfdc688 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f73a6b968f7
  RDX: 0000000000000006 RSI: 0000562b9791d460 RDI: 0000000000000001
  RBP: 0000562b9791d460 R08: 00007f73a6c53460 R09: 000000007fffffff
  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
  R13: 00007f73a6c9d780 R14: 00007f73a6c99600 R15: 00007f73a6c98a00
   </TASK>
  ---[ end trace 0000000000000000 ]---

... which happens because we can have:

	  A (*throttled; PELT is not frozen)
        /   \
 (new) C     B (contains runnable tasks)

C is a new cgroup but is not enqueued on cfs_rq of A yet. B is another
cfs_rq with tasks and is enqueued on cfs_rq of A. Entire hierarchy shown
above is throttled.

When all tasks on B leaves, PELT of A and B is frozen but since C was
never queued on A and A's PELT wasn't throttled when it was created, C
still has pelt_clock_throttled as 0.

Next propagate_entity_cfs_rq() on C will see C doesn't have PELT clock
throttled, adding it to "rq->tmp_alone_branch" but then it sees A has
its PELT clock throttled which leads to the dangling reference, or
worse, incorrect ordering in leaf cfs_rq list.

Changes are prepared on top of tip:sched/core at commit 45b7f780739a
("sched: Fix some typos in include/linux/preempt.h")
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18a30ae35441..9af82e214312 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,16 @@ static void sync_throttle(struct task_group *tg, int cpu)

 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+
+	/*
+	 * It is not enough to sync the "pelt_clock_throttled" indicator
+	 * with the parent cfs_rq when the hierarchy is not queued.
+	 * Always join a throttled hierarchy with PELT clock throttled
+	 * and leave it to the first enqueue, or distribution to
+	 * unthrottle the PELT clock and add us on the leaf_cfs_rq_list.
+	 */
+	if (cfs_rq->throttle_count)
+		cfs_rq->pelt_clock_throttled = 1;
 }

 /* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -13192,6 +13202,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
+
+	assert_list_leaf_cfs_rq(rq_of(cfs_rq));
 }
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
 static void propagate_entity_cfs_rq(struct sched_entity *se) { }

base-commit: 45b7f780739a3145aeef24d2dfa02517a6c82ed6
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-09-26  8:19                 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
@ 2025-09-26  9:38                   ` Aaron Lu
  2025-09-26 10:11                     ` K Prateek Nayak
  2025-10-01 20:37                     ` Benjamin Segall
  2025-09-26 14:48                   ` Matteo Martelli
  1 sibling, 2 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-26  9:38 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Matteo Martelli, linux-kernel, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider

On Fri, Sep 26, 2025 at 08:19:17AM +0000, K Prateek Nayak wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
>

Sigh...

> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 

Makes sense.

> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")

Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
model")? "Task based throttle time accounting" doesn't touch pelt bits.

> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>

Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>

Thanks for the fix.

BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
ancestor cfs_rq's pelt clock throttled status but only the first level
cfs_rq's, because the purpose is to have the first level cfs_rq to stay
on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
are just to make sure the list is fully connected. I mean something like
this:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75c615f5ed640..6a6d9200ab93c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
 static void propagate_entity_cfs_rq(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);
 
 	/*
 	 * If a task gets attached to this cfs_rq and before being queued,
@@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
 	 * that removed load decayed or it can cause faireness problem.
 	 */
-	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+	if (add)
 		list_add_leaf_cfs_rq(cfs_rq);
 
 	/* Start to propagate at parent */
@@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 
 		update_load_avg(cfs_rq, se, UPDATE_TG);
 
-		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
+		if (add)
 			list_add_leaf_cfs_rq(cfs_rq);
 	}

But this is a different thing and can be taken care of if necessary
later. Current logic doesn't have a problem, it's just not as clear as
the above diff to me.

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-09-26  9:38                   ` Aaron Lu
@ 2025-09-26 10:11                     ` K Prateek Nayak
  2025-10-01 20:37                     ` Benjamin Segall
  1 sibling, 0 replies; 35+ messages in thread
From: K Prateek Nayak @ 2025-09-26 10:11 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Matteo Martelli,
	linux-kernel, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Chengming Zhou

(+Chengming for the discussion on the PELT bits)

Hello Aaron,

On 9/26/2025 3:08 PM, Aaron Lu wrote:
>> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
>> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
>> dequeue on other subtree on the throttled hierarchy can freeze the PELT
>> clock for the parent hierarchy without setting the indicators for this
>> newly added cfs_rq which was never enqueued.
>>
> 
> Sigh...

I had the exact reaction when I managed to trigger that again (T_T)

> 
>> Since there are no tasks on the new hierarchy, start a cfs_rq on a
>> throttled hierarchy with its PELT clock throttled. The first enqueue, or
>> the distribution (whichever happens first) will unfreeze the PELT clock
>> and queue the cfs_rq on the leaf cfs_rq list.
>>
> 
> Makes sense.
> 
>> While at it, add an assert_list_leaf_cfs_rq() in
>> propagate_entity_cfs_rq() to catch such cases in the future.
>>
>> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
>> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
>> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
>> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> 
> Should be Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle
> model")? "Task based throttle time accounting" doesn't touch pelt bits.

True that the issue is seen after e1fad12dcb66 but I though this can be
considered as a fix for the initial introduction of indicator. I don't
have any strong feelings on this.

Peter, can you update the commit in the fixes tag if and when you pick
this?

> 
>> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
> Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> 
> Thanks for the fix.

Thank you for helping with the debug and testing :)

> 
> BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
> ancestor cfs_rq's pelt clock throttled status but only the first level
> cfs_rq's, because the purpose is to have the first level cfs_rq to stay
> on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
> are just to make sure the list is fully connected. I mean something like
> this:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 75c615f5ed640..6a6d9200ab93c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>  static void propagate_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);

I actually think what it is doing now is correct already. A part of the
hierarchy, between the root and the first throttled cfs_rq get the load
via propagate and this needs to be decayed, otherwise CPU might appear
busy.

Consider: root (decayed) -> A (decayed) -> B (*throttled)

If a heavy task is migrated to B, the load is propagated to B, A, and
then to root. If we don't queue the lhem on leaf_cfs_rq_list until
distribution, the root cfs_rq will appear to have some load and the load
balancer may get confused during find_busiest_group().

Now, if we place A, and root cfs_rq back on leaf_cfs_rq_list, each ILB
stats kick will decay the load and eventually the CPU will appear idle
and post unthrottle, the load will start accumulating again.

>  
>  	/*
>  	 * If a task gets attached to this cfs_rq and before being queued,
> @@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (add)
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (add)
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
> 
> But this is a different thing and can be taken care of if necessary
> later. Current logic doesn't have a problem, it's just not as clear as
> the above diff to me.

Ack! Ben, Chengming have better idea of these bits and perhaps they can
shed some light on the intended purpose and what is the right thing to
do.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-09-26  8:19                 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
  2025-09-26  9:38                   ` Aaron Lu
@ 2025-09-26 14:48                   ` Matteo Martelli
  1 sibling, 0 replies; 35+ messages in thread
From: Matteo Martelli @ 2025-09-26 14:48 UTC (permalink / raw)
  To: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Aaron Lu, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak, Matteo Martelli

Hi Prateek,

On Fri, 26 Sep 2025 08:19:17 +0000, K Prateek Nayak <kprateek.nayak@amd.com> wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
> 
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> Stress test included running sched-messaging in nested hierarchy with
> various quota set alongside a continuous loop of cgroup creation and
> deletion, as well as another loop of continuous movement of a busy loop
> between cgroups.
> 
> No splats have been observed yet with this patch.
> 
> Aaron, Matteo,
> 
> I've not added any "Tested-by" tags since the final diff is slightly
> different from the diff shared previously. ...

I applied this patch on top of commit 45b7f780739a ("sched: Fix some
typos in include/linux/preempt.h") from sched/core branch of tip tree,
and tested it with exactly the same setup I described in my previous
email[1]. With the patch applied, I couldn't reproduce the warning in 5
hours of testing, while before the patch the issue was systematically
reprodicible and the warning was being triggered at least once per
minute.

Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>

> ...

[1]: https://lore.kernel.org/all/e2e558b863c929c5019264b2ddefd4c0@codethink.co.uk/

Thanks to you and Aaron for addressing this!

Best regards,
Matteo Martelli

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
  2025-09-23 13:05   ` [PATCH 1/4] " Matteo Martelli
  2025-09-24 11:33     ` Aaron Lu
@ 2025-09-29  7:51     ` Aaron Lu
  1 sibling, 0 replies; 35+ messages in thread
From: Aaron Lu @ 2025-09-29  7:51 UTC (permalink / raw)
  To: Matteo Martelli
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Michal Koutný, Sebastian Andrzej Siewior

On Tue, Sep 23, 2025 at 03:05:29PM +0200, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Wed, 10 Sep 2025 17:50:41 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> > Before task based throttle model, propagating load will stop at a
> > throttled cfs_rq and that propagate will happen on unthrottle time by
> > update_load_avg().
> > 
> > Now that there is no update_load_avg() on unthrottle for throttled
> > cfs_rq and all load tracking is done by task related operations, let the
> > propagate happen immediately.
> > 
> > While at it, add a comment to explain why cfs_rqs that are not affected
> > by throttle have to be added to leaf cfs_rq list in
> > propagate_entity_cfs_rq() per my understanding of commit 0258bdfaff5b
> > ("sched/fair: Fix unfairness caused by missing load decay").
> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> 
> I have been testing again the patch set "[PATCH v4 0/5] Defer throttle
> when task exits to user" [1] together with these follow up patches. I
> found out that with this patch the kernel sometimes produces the warning
> WARN_ON_ONCE(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); in
> assert_list_leaf_cfs_rq() called by enqueue_task_fair(). I could
> reproduce this systematically by applying both [1] and this patch on top
> of tag v6.17-rc6 and also by directly testing at commit fe8d238e646e
> from sched/core branch of tip tree. I couldn't reproduce the warning by
> testing at commmit 5b726e9bf954 ("sched/fair: Get rid of
> throttled_lb_pair()").

Just a note that while trying to reproduce this problem, I noticed a
warn triggered in tg_throttle_down() with one setup. It's a different
problem and I've sent a patch to address that:
https://lore.kernel.org/lkml/20250929074645.416-1-ziqianlu@bytedance.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-09-26  9:38                   ` Aaron Lu
  2025-09-26 10:11                     ` K Prateek Nayak
@ 2025-10-01 20:37                     ` Benjamin Segall
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Segall @ 2025-10-01 20:37 UTC (permalink / raw)
  To: Aaron Lu
  Cc: K Prateek Nayak, Ingo Molnar, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Matteo Martelli, linux-kernel, Dietmar Eggemann,
	Steven Rostedt, Mel Gorman, Valentin Schneider

Aaron Lu <ziqianlu@bytedance.com> writes:

> BTW, I'm thinking in propagate_entity_cfs_rq(), we shouldn't check the
> ancestor cfs_rq's pelt clock throttled status but only the first level
> cfs_rq's, because the purpose is to have the first level cfs_rq to stay
> on the leaf list and all those list_add_leaf_cfs_rq() for its ancestors
> are just to make sure the list is fully connected. I mean something like
> this:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 75c615f5ed640..6a6d9200ab93c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13170,6 +13170,7 @@ prio_changed_fair(struct rq *rq, struct task_struct *p, int oldprio)
>  static void propagate_entity_cfs_rq(struct sched_entity *se)
>  {
>  	struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +	bool add = !cfs_rq_pelt_clock_throttled(cfs_rq);
>  
>  	/*
>  	 * If a task gets attached to this cfs_rq and before being queued,
> @@ -13177,7 +13178,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  	 * change, make sure this cfs_rq stays on leaf cfs_rq list to have
>  	 * that removed load decayed or it can cause faireness problem.
>  	 */
> -	if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +	if (add)
>  		list_add_leaf_cfs_rq(cfs_rq);
>  
>  	/* Start to propagate at parent */
> @@ -13188,7 +13189,7 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
>  
>  		update_load_avg(cfs_rq, se, UPDATE_TG);
>  
> -		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
> +		if (add)
>  			list_add_leaf_cfs_rq(cfs_rq);
>  	}
>
> But this is a different thing and can be taken care of if necessary
> later. Current logic doesn't have a problem, it's just not as clear as
> the above diff to me.

So the question is if we need to call list_add_leaf_cfs_rq in cases
where the immediate cfs_rq clock was stopped but the parents might not
have. It certainly doesn't seem to me like it should be necessary, but
I'm insufficiently clear on the exact details of the leaf_cfs_rq list
to swear to it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v2] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-09-25 13:33               ` Matteo Martelli
  2025-09-26  4:32                 ` K Prateek Nayak
  2025-09-26  8:19                 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
@ 2025-10-21  5:35                 ` K Prateek Nayak
  2025-10-21 10:10                   ` Peter Zijlstra
  2025-10-22 13:28                   ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
  2 siblings, 2 replies; 35+ messages in thread
From: K Prateek Nayak @ 2025-10-21  5:35 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Matteo Martelli, Aaron Lu, linux-kernel
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, K Prateek Nayak

Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().

The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.

Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:

    root(#)
   /    \
  A(#)   B(*)
         |
         C <--- new cgroup
         |
         D <--- new cgroup

  # - Already on leaf cfs_rq list
  * - Throttled with PELT frozen

The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:

1. online_fair_sched_group() for cgroup D will call
   propagate_entity_cfs_rq(). (Same can happen if a throttled task is
   moved to cgroup C and enqueue_task_fair() returns early.)

   propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
   "rq->tmp_alone_branch" since its PELT clock is not marked throttled
   and cfs_rq of cgroup B is not on the list.

   cfs_rq of cgroup B is skipped since its PELT is throttled.

   root cfs_rq already exists on cfs_rq leading to
   list_add_leaf_cfs_rq() returning early.

   The cfs_rq of cgroup C is left dangling on the
   "rq->tmp_alone_branch".

2. A new task wakes up on cgroup A. Since the whole hierarchy is already
   on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
   without any modifications to "rq->tmp_alone_branch".

   The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
   dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".

   !!! Splat !!!

Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.

Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.

While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.

Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v1..v2:

o Fixed the commit in "Fixes:" tag. (Aaron)
o Collected tags from Aaron and Matteo. (Thanks a ton!)
o Rebased on tip:sched/urgent.

v1: https://lore.kernel.org/lkml/20250926081918.30488-1-kprateek.nayak@amd.com/
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cee1793e8277..25970dbbb279 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,16 @@ static void sync_throttle(struct task_group *tg, int cpu)

 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+
+	/*
+	 * It is not enough to sync the "pelt_clock_throttled" indicator
+	 * with the parent cfs_rq when the hierarchy is not queued.
+	 * Always join a throttled hierarchy with PELT clock throttled
+	 * and leaf it to the first enqueue, or distribution to
+	 * unthrottle the PELT clock.
+	 */
+	if (cfs_rq->throttle_count)
+		cfs_rq->pelt_clock_throttled = 1;
 }

 /* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -13187,6 +13197,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
+
+	assert_list_leaf_cfs_rq(rq_of(cfs_rq));
 }
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
 static void propagate_entity_cfs_rq(struct sched_entity *se) { }

base-commit: 17e3e88ed0b6318fde0d1c14df1a804711cab1b5
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v2] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-10-21  5:35                 ` [PATCH v2] " K Prateek Nayak
@ 2025-10-21 10:10                   ` Peter Zijlstra
  2025-10-22 13:28                   ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
  1 sibling, 0 replies; 35+ messages in thread
From: Peter Zijlstra @ 2025-10-21 10:10 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Matteo Martelli,
	Aaron Lu, linux-kernel, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider

On Tue, Oct 21, 2025 at 05:35:22AM +0000, K Prateek Nayak wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
> 
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
> Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
> Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> v1..v2:
> 
> o Fixed the commit in "Fixes:" tag. (Aaron)
> o Collected tags from Aaron and Matteo. (Thanks a ton!)
> o Rebased on tip:sched/urgent.

OK, let me queue this for tip:sched/urgent. Thanks!

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [tip: sched/urgent] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled
  2025-10-21  5:35                 ` [PATCH v2] " K Prateek Nayak
  2025-10-21 10:10                   ` Peter Zijlstra
@ 2025-10-22 13:28                   ` tip-bot2 for K Prateek Nayak
  1 sibling, 0 replies; 35+ messages in thread
From: tip-bot2 for K Prateek Nayak @ 2025-10-22 13:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Matteo Martelli, Aaron Lu, K Prateek Nayak,
	Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     0e4a169d1a2b630c607416d9e3739d80e176ed67
Gitweb:        https://git.kernel.org/tip/0e4a169d1a2b630c607416d9e3739d80e176ed67
Author:        K Prateek Nayak <kprateek.nayak@amd.com>
AuthorDate:    Tue, 21 Oct 2025 05:35:22 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 22 Oct 2025 15:21:52 +02:00

sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
load for throttled cfs_rq") which transitioned to using
cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
propagate_entity_cfs_rq().

The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
its descendants should have their PELT frozen too or weird things can
happen as a result of children accumulating PELT signals when the
parents have their PELT clock stopped.

Another side effect of this is the loss of integrity of the leaf cfs_rq
list. As debugged by Aaron, consider the following hierarchy:

    root(#)
   /    \
  A(#)   B(*)
         |
         C <--- new cgroup
         |
         D <--- new cgroup

  # - Already on leaf cfs_rq list
  * - Throttled with PELT frozen

The newly created cgroups don't have their "pelt_clock_throttled" signal
synced with cgroup B. Next, the following series of events occur:

1. online_fair_sched_group() for cgroup D will call
   propagate_entity_cfs_rq(). (Same can happen if a throttled task is
   moved to cgroup C and enqueue_task_fair() returns early.)

   propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
   "rq->tmp_alone_branch" since its PELT clock is not marked throttled
   and cfs_rq of cgroup B is not on the list.

   cfs_rq of cgroup B is skipped since its PELT is throttled.

   root cfs_rq already exists on cfs_rq leading to
   list_add_leaf_cfs_rq() returning early.

   The cfs_rq of cgroup C is left dangling on the
   "rq->tmp_alone_branch".

2. A new task wakes up on cgroup A. Since the whole hierarchy is already
   on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
   without any modifications to "rq->tmp_alone_branch".

   The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
   dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".

   !!! Splat !!!

Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
enough since the new cfs_rq is not yet enqueued on the hierarchy. A
dequeue on other subtree on the throttled hierarchy can freeze the PELT
clock for the parent hierarchy without setting the indicators for this
newly added cfs_rq which was never enqueued.

Since there are no tasks on the new hierarchy, start a cfs_rq on a
throttled hierarchy with its PELT clock throttled. The first enqueue, or
the distribution (whichever happens first) will unfreeze the PELT clock
and queue the cfs_rq on the leaf cfs_rq list.

While at it, add an assert_list_leaf_cfs_rq() in
propagate_entity_cfs_rq() to catch such cases in the future.

Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reported-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
Link: https://patch.msgid.link/20251021053522.37583-1-kprateek.nayak@amd.com
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cee1793..25970db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,16 @@ static void sync_throttle(struct task_group *tg, int cpu)

 	cfs_rq->throttle_count = pcfs_rq->throttle_count;
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
+
+	/*
+	 * It is not enough to sync the "pelt_clock_throttled" indicator
+	 * with the parent cfs_rq when the hierarchy is not queued.
+	 * Always join a throttled hierarchy with PELT clock throttled
+	 * and leaf it to the first enqueue, or distribution to
+	 * unthrottle the PELT clock.
+	 */
+	if (cfs_rq->throttle_count)
+		cfs_rq->pelt_clock_throttled = 1;
 }

 /* conditionally throttle active cfs_rq's from put_prev_entity() */
@@ -13187,6 +13197,8 @@ static void propagate_entity_cfs_rq(struct sched_entity *se)
 		if (!cfs_rq_pelt_clock_throttled(cfs_rq))
 			list_add_leaf_cfs_rq(cfs_rq);
 	}
+
+	assert_list_leaf_cfs_rq(rq_of(cfs_rq));
 }
 #else /* !CONFIG_FAIR_GROUP_SCHED: */
 static void propagate_entity_cfs_rq(struct sched_entity *se) { }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-10-22 13:28 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10  9:50 [PATCH 0/4] Task based throttle follow ups Aaron Lu
2025-09-10  9:50 ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10 12:36   ` Chengming Zhou
2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-23 13:05   ` [PATCH 1/4] " Matteo Martelli
2025-09-24 11:33     ` Aaron Lu
2025-09-25  8:17       ` K Prateek Nayak
2025-09-25  9:29         ` Aaron Lu
2025-09-25 11:22           ` K Prateek Nayak
2025-09-25 12:05             ` Aaron Lu
2025-09-25 13:33               ` Matteo Martelli
2025-09-26  4:32                 ` K Prateek Nayak
2025-09-26  5:53                   ` Aaron Lu
2025-09-26  8:19                 ` [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled K Prateek Nayak
2025-09-26  9:38                   ` Aaron Lu
2025-09-26 10:11                     ` K Prateek Nayak
2025-10-01 20:37                     ` Benjamin Segall
2025-09-26 14:48                   ` Matteo Martelli
2025-10-21  5:35                 ` [PATCH v2] " K Prateek Nayak
2025-10-21 10:10                   ` Peter Zijlstra
2025-10-22 13:28                   ` [tip: sched/urgent] " tip-bot2 for K Prateek Nayak
2025-09-29  7:51     ` [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-10  9:50 ` [PATCH 2/4] sched/fair: update_cfs_group() for throttled cfs_rqs Aaron Lu
2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10  9:50 ` [PATCH 3/4] sched/fair: Do not special case tasks in throttled hierarchy Aaron Lu
2025-09-16 11:43   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-10  9:50 ` [PATCH 4/4] sched/fair: Do not balance task to a throttled cfs_rq Aaron Lu
2025-09-11  2:03   ` kernel test robot
2025-09-12  3:44     ` [PATCH update " Aaron Lu
2025-09-12  3:56       ` K Prateek Nayak
2025-09-16 11:43       ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-11 10:42 ` [PATCH 0/4] Task based throttle follow ups Peter Zijlstra
2025-09-11 12:16   ` Aaron Lu
2025-09-15 21:54 ` Benjamin Segall
2025-09-19 14:37 ` Valentin Schneider

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox