[PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
@ 2025-10-30  3:27 Aaron Lu
  2025-10-30  5:46 ` K Prateek Nayak
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Aaron Lu @ 2025-10-30  3:27 UTC (permalink / raw)
  To: Ben Segall, K Prateek Nayak, Peter Zijlstra, Hao Jia,
	Valentin Schneider, Chengming Zhou, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!

 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 15 ++++++---------
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67b48e21..f754a60de8484 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,

 		guard(rq_lock_irq)(rq);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;

 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb2795..5b752324270b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];

 	/*
-	 * It's possible we are called with !runtime_remaining due to things
-	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
-	 * unthrottled us with a positive runtime_remaining but other still
-	 * running entities consumed those runtime before we reached here.
+	 * It's possible we are called with runtime_remaining < 0 due to things
+	 * like async unthrottled us with a positive runtime_remaining but other
+	 * still running entities consumed those runtime before we reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
-	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
-	 * throttle, which is not supposed to happen on unthrottle path.
+	 * We can't unthrottle this cfs_rq without any runtime remaining because
+	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
+	 * which is not supposed to happen on unthrottle path.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;

-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;

 	update_rq_clock(rq);
-- 
2.39.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-30  3:27 [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
@ 2025-10-30  5:46 ` K Prateek Nayak
  2025-10-30  6:01   ` Aaron Lu
  2025-10-30  6:51 ` Hao Jia
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2025-10-30  5:46 UTC (permalink / raw)
  To: Aaron Lu, Ben Segall, Peter Zijlstra, Hao Jia, Valentin Schneider,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Hello Aaron,

On 10/30/2025 8:57 AM, Aaron Lu wrote:
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
> 
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
> 
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

Have been stress testing this on my system and I haven't seen
any splats yet. Feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-30  5:46 ` K Prateek Nayak
@ 2025-10-30  6:01   ` Aaron Lu
  0 siblings, 0 replies; 10+ messages in thread
From: Aaron Lu @ 2025-10-30  6:01 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ben Segall, Peter Zijlstra, Hao Jia, Valentin Schneider,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Thu, Oct 30, 2025 at 11:16:36AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 10/30/2025 8:57 AM, Aaron Lu wrote:
> > I think it's a chaos to trigger throttle on unthrottle path, the status
> > of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> > this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> > cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> > and cannot enter tg_unthrottle_up() with zero runtime_remaining.
> > 
> > Also, update outdated comments in tg_throttle_down() since
> > unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> > While at it, remove a redundant assignment to se in tg_throttle_down().
> > 
> > Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> > Suggested-by: Benjamin Segall <bsegall@google.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> 
> Have been stress testing this on my system and I haven't seen
> any splats yet. Feel free to include:
> 
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thank you Prateek!

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-30  3:27 [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
  2025-10-30  5:46 ` K Prateek Nayak
@ 2025-10-30  6:51 ` Hao Jia
  2025-10-30  7:45   ` Aaron Lu
  2025-11-05 21:37 ` Benjamin Segall
  2025-11-06 11:41 ` [tip: sched/urgent] " tip-bot2 for Aaron Lu
  3 siblings, 1 reply; 10+ messages in thread
From: Hao Jia @ 2025-10-30  6:51 UTC (permalink / raw)
  To: Aaron Lu, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Valentin Schneider, Chengming Zhou, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior



On 2025/10/30 11:27, Aaron Lu wrote:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
> 
> When running a test with the following hierarchy:
> 
>            root
>          /      \
>          A*     ...
>       /  |  \   ...
>          B
>         /  \
>        C*
> 
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
> 
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
> 
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
> 
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
> 
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
> 
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>


It worked well in my test cases, and the non-empty throttled_limbo_list 
warning no longer appeared.

Tested-by: Hao Jia <jiahao1@lixiang.com>


> ---
> v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!
> 
>   kernel/sched/core.c |  2 +-
>   kernel/sched/fair.c | 15 ++++++---------
>   2 files changed, 7 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..f754a60de8484 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>   
>   		guard(rq_lock_irq)(rq);
>   		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		cfs_rq->runtime_remaining = 1;
>   
>   		if (cfs_rq->throttled)
>   			unthrottle_cfs_rq(cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25970dbbb2795..5b752324270b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>   	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
>   
>   	/*
> -	 * It's possible we are called with !runtime_remaining due to things
> -	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
> -	 * unthrottled us with a positive runtime_remaining but other still
> -	 * running entities consumed those runtime before we reached here.
> +	 * It's possible we are called with runtime_remaining < 0 due to things
> +	 * like async unthrottled us with a positive runtime_remaining but other
> +	 * still running entities consumed those runtime before we reached here.
>   	 *
> -	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> -	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
> -	 * throttle, which is not supposed to happen on unthrottle path.
> +	 * We can't unthrottle this cfs_rq without any runtime remaining because
> +	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
> +	 * which is not supposed to happen on unthrottle path.
>   	 */
>   	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>   		return;
>   
> -	se = cfs_rq->tg->se[cpu_of(rq)];
> -
>   	cfs_rq->throttled = 0;
>   
>   	update_rq_clock(rq);

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-30  6:51 ` Hao Jia
@ 2025-10-30  7:45   ` Aaron Lu
  0 siblings, 0 replies; 10+ messages in thread
From: Aaron Lu @ 2025-10-30  7:45 UTC (permalink / raw)
  To: Hao Jia
  Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Valentin Schneider,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Thu, Oct 30, 2025 at 02:51:02PM +0800, Hao Jia wrote:
> It worked well in my test cases, and the non-empty throttled_limbo_list
> warning no longer appeared.
> 
> Tested-by: Hao Jia <jiahao1@lixiang.com>

Thanks Hao, it's great to get your confirm that this also works for
your case.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-30  3:27 [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
  2025-10-30  5:46 ` K Prateek Nayak
  2025-10-30  6:51 ` Hao Jia
@ 2025-11-05 21:37 ` Benjamin Segall
  2025-11-06 11:25   ` Aaron Lu
  2025-11-06 11:41 ` [tip: sched/urgent] " tip-bot2 for Aaron Lu
  3 siblings, 1 reply; 10+ messages in thread
From: Benjamin Segall @ 2025-11-05 21:37 UTC (permalink / raw)
  To: Aaron Lu
  Cc: K Prateek Nayak, Peter Zijlstra, Hao Jia, Valentin Schneider,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Aaron Lu <ziqianlu@bytedance.com> writes:

> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
>
> When running a test with the following hierarchy:
>
>           root
>         /      \
>         A*     ...
>      /  |  \   ...
>         B
>        /  \
>       C*
>
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
>
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
>
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
>
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
> this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
> cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
> and cannot enter tg_unthrottle_up() with zero runtime_remaining.
>
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> While at it, remove a redundant assignment to se in tg_throttle_down().
>
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Suggested-by: Benjamin Segall <bsegall@google.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
> v3: grant cfs_rq 1ns runtime on quota set as suggested by Ben, thanks!
>
>  kernel/sched/core.c |  2 +-
>  kernel/sched/fair.c | 15 ++++++---------
>  2 files changed, 7 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..f754a60de8484 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>  
>  		guard(rq_lock_irq)(rq);
>  		cfs_rq->runtime_enabled = runtime_enabled;
> -		cfs_rq->runtime_remaining = 0;
> +		cfs_rq->runtime_remaining = 1;
>  
>  		if (cfs_rq->throttled)
>  			unthrottle_cfs_rq(cfs_rq);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 25970dbbb2795..5b752324270b0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
>  
>  	/*
> -	 * It's possible we are called with !runtime_remaining due to things
> -	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
> -	 * unthrottled us with a positive runtime_remaining but other still
> -	 * running entities consumed those runtime before we reached here.
> +	 * It's possible we are called with runtime_remaining < 0 due to things
> +	 * like async unthrottled us with a positive runtime_remaining but other
> +	 * still running entities consumed those runtime before we reached here.
>  	 *
> -	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> -	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
> -	 * throttle, which is not supposed to happen on unthrottle path.
> +	 * We can't unthrottle this cfs_rq without any runtime remaining because
> +	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
> +	 * which is not supposed to happen on unthrottle path.
>  	 */
>  	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>  		return;
>  
> -	se = cfs_rq->tg->se[cpu_of(rq)];
> -
>  	cfs_rq->throttled = 0;
>  
>  	update_rq_clock(rq);

Reviewed-By: Benjamin Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-11-05 21:37 ` Benjamin Segall
@ 2025-11-06 11:25   ` Aaron Lu
  2025-11-06 11:27     ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Aaron Lu @ 2025-11-06 11:25 UTC (permalink / raw)
  To: Benjamin Segall, Peter Zijlstra
  Cc: K Prateek Nayak, Hao Jia, Valentin Schneider, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Wed, Nov 05, 2025 at 01:37:28PM -0800, Benjamin Segall wrote:
... ...
> Reviewed-By: Benjamin Segall <bsegall@google.com>

Thank you Ben.

Hi Peter,

Do you have any other comments about this patch? Or do I need to send an
updated version with all the tags collected? Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-11-06 11:25   ` Aaron Lu
@ 2025-11-06 11:27     ` Peter Zijlstra
  2025-11-06 11:33       ` Aaron Lu
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2025-11-06 11:27 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Benjamin Segall, K Prateek Nayak, Hao Jia, Valentin Schneider,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Thu, Nov 06, 2025 at 07:25:00PM +0800, Aaron Lu wrote:
> On Wed, Nov 05, 2025 at 01:37:28PM -0800, Benjamin Segall wrote:
> ... ...
> > Reviewed-By: Benjamin Segall <bsegall@google.com>
> 
> Thank you Ben.
> 
> Hi Peter,
> 
> Do you have any other comments about this patch? Or do I need to send an
> updated version with all the tags collected? Thanks.

I can tag the whole thread (esc-t) and pipe the whole lot through b4 to
easily do so -- in fact I just did :-)

This was meant to go in sched/urgent, right?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-11-06 11:27     ` Peter Zijlstra
@ 2025-11-06 11:33       ` Aaron Lu
  0 siblings, 0 replies; 10+ messages in thread
From: Aaron Lu @ 2025-11-06 11:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Benjamin Segall, K Prateek Nayak, Hao Jia, Valentin Schneider,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Thu, Nov 06, 2025 at 12:27:21PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 06, 2025 at 07:25:00PM +0800, Aaron Lu wrote:
> > On Wed, Nov 05, 2025 at 01:37:28PM -0800, Benjamin Segall wrote:
> > ... ...
> > > Reviewed-By: Benjamin Segall <bsegall@google.com>
> > 
> > Thank you Ben.
> > 
> > Hi Peter,
> > 
> > Do you have any other comments about this patch? Or do I need to send an
> > updated version with all the tags collected? Thanks.
> 
> I can tag the whole thread (esc-t) and pipe the whole lot through b4 to
> easily do so -- in fact I just did :-)

Cool, thanks for taking care of this.

> 
> This was meant to go in sched/urgent, right?

Right, it's a fix for the current kernel.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [tip: sched/urgent] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-30  3:27 [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
                   ` (2 preceding siblings ...)
  2025-11-05 21:37 ` Benjamin Segall
@ 2025-11-06 11:41 ` tip-bot2 for Aaron Lu
  3 siblings, 0 replies; 10+ messages in thread
From: tip-bot2 for Aaron Lu @ 2025-11-06 11:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Benjamin Segall, Aaron Lu, Peter Zijlstra (Intel),
	K Prateek Nayak, Hao Jia, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     956dfda6a70885f18c0f8236a461aa2bc4f556ad
Gitweb:        https://git.kernel.org/tip/956dfda6a70885f18c0f8236a461aa2bc4f556ad
Author:        Aaron Lu <ziqianlu@bytedance.com>
AuthorDate:    Thu, 30 Oct 2025 11:27:55 +08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 06 Nov 2025 12:30:52 +01:00

sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining

When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state in the end, so fix
this by granting 1ns to cfs_rq in tg_set_cfs_bandwidth(). This ensures
cfs_rq_c has a positive runtime_remaining when initialized as unthrottled
and cannot enter tg_unthrottle_up() with zero runtime_remaining.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reviewed-By: Benjamin Segall <bsegall@google.com>
Suggested-by: Benjamin Segall <bsegall@google.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Hao Jia <jiahao1@lixiang.com>
Link: https://patch.msgid.link/20251030032755.560-1-ziqianlu@bytedance.com
---
 kernel/sched/core.c |  2 +-
 kernel/sched/fair.c | 15 ++++++---------
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67..f754a60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,

 		guard(rq_lock_irq)(rq);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;

 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970db..5b75232 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6024,20 +6024,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];

 	/*
-	 * It's possible we are called with !runtime_remaining due to things
-	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
-	 * unthrottled us with a positive runtime_remaining but other still
-	 * running entities consumed those runtime before we reached here.
+	 * It's possible we are called with runtime_remaining < 0 due to things
+	 * like async unthrottled us with a positive runtime_remaining but other
+	 * still running entities consumed those runtime before we reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
-	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
-	 * throttle, which is not supposed to happen on unthrottle path.
+	 * We can't unthrottle this cfs_rq without any runtime remaining because
+	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
+	 * which is not supposed to happen on unthrottle path.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;

-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;

 	update_rq_clock(rq);

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-06 11:41 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-30  3:27 [PATCH v3] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
2025-10-30  5:46 ` K Prateek Nayak
2025-10-30  6:01   ` Aaron Lu
2025-10-30  6:51 ` Hao Jia
2025-10-30  7:45   ` Aaron Lu
2025-11-05 21:37 ` Benjamin Segall
2025-11-06 11:25   ` Aaron Lu
2025-11-06 11:27     ` Peter Zijlstra
2025-11-06 11:33       ` Aaron Lu
2025-11-06 11:41 ` [tip: sched/urgent] " tip-bot2 for Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox