public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
@ 2025-09-29  7:46 Aaron Lu
  2025-09-29  9:34 ` K Prateek Nayak
  2025-10-14  7:43 ` Hao Jia
  0 siblings, 2 replies; 25+ messages in thread
From: Aaron Lu @ 2025-09-29  7:46 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debugging showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth()
immediately after enabling bandwidth and setting runtime_remaining = 0.
This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue
path in an unthrottled state with no runtime.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.

While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/core.c  |  9 ++++++++-
 kernel/sched/fair.c  | 16 +++++++---------
 kernel/sched/sched.h |  1 +
 3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f1e5cb94c536..421166d431fa7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;
 
-		if (cfs_rq->throttled)
+		/*
+		 * Throttle cfs_rq now or it can be unthrottled with zero
+		 * runtime_remaining and gets throttled on its unthrottle path.
+		 */
+		if (cfs_rq->runtime_enabled && !cfs_rq->throttled)
+			throttle_cfs_rq(cfs_rq);
+
+		if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
 	}
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e6dd3af82fc..3ef11783369d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5976,7 +5976,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	return 0;
 }
 
-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
@@ -6025,19 +6025,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	/*
 	 * It's possible we are called with !runtime_remaining due to things
-	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
-	 * unthrottled us with a positive runtime_remaining but other still
-	 * running entities consumed those runtime before we reached here.
+	 * like async unthrottled us with a positive runtime_remaining but
+	 * other still running entities consumed those runtime before we
+	 * reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
-	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
-	 * throttle, which is not supposed to happen on unthrottle path.
+	 * We can't unthrottle this cfs_rq without any runtime remaining
+	 * because any enqueue in tg_unthrottle_up() will immediately trigger
+	 * a throttle, which is not supposed to happen on unthrottle path.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;
 
 	update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5367c514c143..359bb858cffd3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
 
 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern bool throttle_cfs_rq(struct cfs_rq *cfs_rq);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 extern bool cfs_task_bw_constrained(struct task_struct *p);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-29  7:46 [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
@ 2025-09-29  9:34 ` K Prateek Nayak
  2025-09-29 10:55   ` Aaron Lu
  2025-09-30  7:56   ` Aaron Lu
  2025-10-14  7:43 ` Hao Jia
  1 sibling, 2 replies; 25+ messages in thread
From: K Prateek Nayak @ 2025-09-29  9:34 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Hello Aaron,

On 9/29/2025 1:16 PM, Aaron Lu wrote:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
> 
> When running a test with the following hierarchy:
> 
>           root
>         /      \
>         A*     ...
>      /  |  \   ...
>         B
>        /  \
>       C*
> 
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
> 
> Debugging showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
> 
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
> 
> Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth()
> immediately after enabling bandwidth and setting runtime_remaining = 0.
> This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue
> path in an unthrottled state with no runtime.
> 
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> 
> While at it, remove a redundant assignment to se in tg_throttle_down().
> 
> Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>  kernel/sched/core.c  |  9 ++++++++-
>  kernel/sched/fair.c  | 16 +++++++---------
>  kernel/sched/sched.h |  1 +
>  3 files changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 7f1e5cb94c536..421166d431fa7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>  		cfs_rq->runtime_enabled = runtime_enabled;
>  		cfs_rq->runtime_remaining = 0;
>  
> -		if (cfs_rq->throttled)
> +		/*
> +		 * Throttle cfs_rq now or it can be unthrottled with zero
> +		 * runtime_remaining and gets throttled on its unthrottle path.
> +		 */
> +		if (cfs_rq->runtime_enabled && !cfs_rq->throttled)
> +			throttle_cfs_rq(cfs_rq);

So one downside of this is throttle_cfs_rq() here can assign bandwidth
to an empty cfs_rq and a genuine enqueue later on another CPU might not
find bandwidth thus delaying its execution.

Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
if we find cfs_rq->throttled_limbo_list to be empty?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18a30ae35441..fd2d4dad9c27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
 	 */
 	if (throttled_hierarchy(cfs_rq) &&
 	    !task_current_donor(rq_of(cfs_rq), p)) {
+		if (list_empty(&cfs_rq->throttled_limbo_list))
+			check_enqueue_throttle(cfs_rq);
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		return true;
 	}
---

> +
> +		if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
>  			unthrottle_cfs_rq(cfs_rq);
>  	}
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 22e6dd3af82fc..3ef11783369d7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5976,7 +5976,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
>  	return 0;
>  }
>  
> -static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> +bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
>  {
>  	struct rq *rq = rq_of(cfs_rq);
>  	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> @@ -6025,19 +6025,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  
>  	/*
>  	 * It's possible we are called with !runtime_remaining due to things
> -	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
> -	 * unthrottled us with a positive runtime_remaining but other still
> -	 * running entities consumed those runtime before we reached here.
> +	 * like async unthrottled us with a positive runtime_remaining but
> +	 * other still running entities consumed those runtime before we
> +	 * reached here.
>  	 *
> -	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> -	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
> -	 * throttle, which is not supposed to happen on unthrottle path.
> +	 * We can't unthrottle this cfs_rq without any runtime remaining
> +	 * because any enqueue in tg_unthrottle_up() will immediately trigger
> +	 * a throttle, which is not supposed to happen on unthrottle path.
>  	 */
>  	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>  		return;
>  
> -	se = cfs_rq->tg->se[cpu_of(rq)];
> -

Ack on these bits!

>  	cfs_rq->throttled = 0;
>  
>  	update_rq_clock(rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b5367c514c143..359bb858cffd3 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -558,6 +558,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
>  
>  extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
>  extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
> +extern bool throttle_cfs_rq(struct cfs_rq *cfs_rq);
>  extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
>  extern bool cfs_task_bw_constrained(struct task_struct *p);
>  

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-29  9:34 ` K Prateek Nayak
@ 2025-09-29 10:55   ` Aaron Lu
  2025-09-30  7:56   ` Aaron Lu
  1 sibling, 0 replies; 25+ messages in thread
From: Aaron Lu @ 2025-09-29 10:55 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

Hi Prateek,

Thanks for taking a look and the suggestion.

On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 9/29/2025 1:16 PM, Aaron Lu wrote:
> > When a cfs_rq is to be throttled, its limbo list should be empty and
> > that's why there is a warn in tg_throttle_down() for non empty
> > cfs_rq->throttled_limbo_list.
> > 
> > When running a test with the following hierarchy:
> > 
> >           root
> >         /      \
> >         A*     ...
> >      /  |  \   ...
> >         B
> >        /  \
> >       C*
> > 
> > where both A and C have quota settings, that warn on non empty limbo list
> > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> > part of the cfs_rq for the sake of simpler representation).
> > 
> > Debugging showed it happened like this:
> > Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> > cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> > equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> > *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> > group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> > called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> > these throttled tasks are placed into cfs_rq_c's limbo list by
> > enqueue_throttled_task().
> > 
> > Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> > tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> > runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> > can't get more runtime and enters tg_throttle_down(), where the warning
> > is hit due to remaining tasks in the limbo list.
> > 
> > Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth()
> > immediately after enabling bandwidth and setting runtime_remaining = 0.
> > This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue
> > path in an unthrottled state with no runtime.
> > 
> > Also, update outdated comments in tg_throttle_down() since
> > unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> > 
> > While at it, remove a redundant assignment to se in tg_throttle_down().
> > 
> > Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >  kernel/sched/core.c  |  9 ++++++++-
> >  kernel/sched/fair.c  | 16 +++++++---------
> >  kernel/sched/sched.h |  1 +
> >  3 files changed, 16 insertions(+), 10 deletions(-)
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7f1e5cb94c536..421166d431fa7 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
> >  		cfs_rq->runtime_enabled = runtime_enabled;
> >  		cfs_rq->runtime_remaining = 0;
> >  
> > -		if (cfs_rq->throttled)
> > +		/*
> > +		 * Throttle cfs_rq now or it can be unthrottled with zero
> > +		 * runtime_remaining and gets throttled on its unthrottle path.
> > +		 */
> > +		if (cfs_rq->runtime_enabled && !cfs_rq->throttled)
> > +			throttle_cfs_rq(cfs_rq);
> 
> So one downside of this is throttle_cfs_rq() here can assign bandwidth
> to an empty cfs_rq and a genuine enqueue later on another CPU might not
> find bandwidth thus delaying its execution.

Agree that assign doesn't make sense here.

> 
> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> if we find cfs_rq->throttled_limbo_list to be empty?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18a30ae35441..fd2d4dad9c27 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
>  	 */
>  	if (throttled_hierarchy(cfs_rq) &&
>  	    !task_current_donor(rq_of(cfs_rq), p)) {
> +		if (list_empty(&cfs_rq->throttled_limbo_list))
> +			check_enqueue_throttle(cfs_rq);
>  		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>  		return true;
>  	}
> ---
> 

Works for me, will follow your suggestion if no other comments, thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-29  9:34 ` K Prateek Nayak
  2025-09-29 10:55   ` Aaron Lu
@ 2025-09-30  7:56   ` Aaron Lu
  2025-09-30  8:58     ` K Prateek Nayak
  1 sibling, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-09-30  7:56 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
... ...
> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> if we find cfs_rq->throttled_limbo_list to be empty?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18a30ae35441..fd2d4dad9c27 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
>  	 */
>  	if (throttled_hierarchy(cfs_rq) &&
>  	    !task_current_donor(rq_of(cfs_rq), p)) {
                /*
                 * Make sure to throttle this cfs_rq or it can be unthrottled
                 * with no runtime_remaining and gets throttled again on its
                 * unthrottle path.
                 */
> +		if (list_empty(&cfs_rq->throttled_limbo_list))
> +			check_enqueue_throttle(cfs_rq);

BTW, do you think a comment is needed? Something like the above, not
sure if it's too redundant though, feel free to let me know your
thoughts, thanks.

>  		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>  		return true;
>  	}
> ---

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-30  7:56   ` Aaron Lu
@ 2025-09-30  8:58     ` K Prateek Nayak
  2025-09-30  9:27       ` Aaron Lu
  2025-09-30 11:07       ` Aaron Lu
  0 siblings, 2 replies; 25+ messages in thread
From: K Prateek Nayak @ 2025-09-30  8:58 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

Hello Aaron,

On 9/30/2025 1:26 PM, Aaron Lu wrote:
> On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> ... ...
>> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
>> if we find cfs_rq->throttled_limbo_list to be empty?
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18a30ae35441..fd2d4dad9c27 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
>>  	 */
>>  	if (throttled_hierarchy(cfs_rq) &&
>>  	    !task_current_donor(rq_of(cfs_rq), p)) {
>                 /*
>                  * Make sure to throttle this cfs_rq or it can be unthrottled
>                  * with no runtime_remaining and gets throttled again on its
>                  * unthrottle path.
>                  */
>> +		if (list_empty(&cfs_rq->throttled_limbo_list))
>> +			check_enqueue_throttle(cfs_rq);
> 
> BTW, do you think a comment is needed? Something like the above, not
> sure if it's too redundant though, feel free to let me know your
> thoughts, thanks.

Now that I'm looking at it again, I think we should actually do a:

    for_each_entity(se)
        check_enqueue_throttle(cfs_rq_of(se));

The reason being, we can have:

    root -> A (throttled) -> B -> C

Consider B has runtime_remaining = 0, and subsequently a throttled task
is queued onto C. Ideally, we should start the B/W timer for B at that
point but we bail out after queuing it on C. Thoughts?

Since we only catch up to the 0 runtime_remaining point, it should be
fine.

The comment can perhaps be something like:

	/*
	 * If this is the first enqueue on throttled hierarchy,
	 * ensure bandwidth is available when the hierarchy is
	 * unthrottled. check_enqueue_throttle() will ensure
	 * either some bandwidth is available, or will throttle
	 * the cfs_rq and queue the bandwidth timer.
	 */

> 
>>  		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>>  		return true;
>>  	}
>> ---

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-30  8:58     ` K Prateek Nayak
@ 2025-09-30  9:27       ` Aaron Lu
  2025-09-30 11:07       ` Aaron Lu
  1 sibling, 0 replies; 25+ messages in thread
From: Aaron Lu @ 2025-09-30  9:27 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

Hi Prateek,

On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 9/30/2025 1:26 PM, Aaron Lu wrote:
> > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> > ... ...
> >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> >> if we find cfs_rq->throttled_limbo_list to be empty?
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 18a30ae35441..fd2d4dad9c27 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
> >>  	 */
> >>  	if (throttled_hierarchy(cfs_rq) &&
> >>  	    !task_current_donor(rq_of(cfs_rq), p)) {
> >                 /*
> >                  * Make sure to throttle this cfs_rq or it can be unthrottled
> >                  * with no runtime_remaining and gets throttled again on its
> >                  * unthrottle path.
> >                  */
> >> +		if (list_empty(&cfs_rq->throttled_limbo_list))
> >> +			check_enqueue_throttle(cfs_rq);
> > 
> > BTW, do you think a comment is needed? Something like the above, not
> > sure if it's too redundant though, feel free to let me know your
> > thoughts, thanks.
> 
> Now that I'm looking at it again, I think we should actually do a:
> 
>     for_each_entity(se)
>         check_enqueue_throttle(cfs_rq_of(se));

Nice catch and sigh.

> 
> The reason being, we can have:
> 
>     root -> A (throttled) -> B -> C
> 
> Consider B has runtime_remaining = 0, and subsequently a throttled task
> is queued onto C. Ideally, we should start the B/W timer for B at that
> point but we bail out after queuing it on C. Thoughts?
>

If we want to make sure no cfs_rqs with runtime_enabled gets unthrottled
with zero runtime_remaining, agree we will have to do that in a hierarchy
way.

I don't feel good about that for_each_entity(se) check_enqueue_throttle()
though, it made me feel we are duplicating enqueue_task_fair() somehow...

With this said, if we have to do that hierarchical check, I would prefer
to throttle it upfront in tg_set_cfs_bandwidth() :) The useless assign
of runtime is just 1ns, and it should only affect the first period, so
shouldn't matter much?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-30  8:58     ` K Prateek Nayak
  2025-09-30  9:27       ` Aaron Lu
@ 2025-09-30 11:07       ` Aaron Lu
  2025-09-30 12:39         ` Aaron Lu
  2025-09-30 13:38         ` K Prateek Nayak
  1 sibling, 2 replies; 25+ messages in thread
From: Aaron Lu @ 2025-09-30 11:07 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 9/30/2025 1:26 PM, Aaron Lu wrote:
> > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> > ... ...
> >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> >> if we find cfs_rq->throttled_limbo_list to be empty?
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 18a30ae35441..fd2d4dad9c27 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
> >>  	 */
> >>  	if (throttled_hierarchy(cfs_rq) &&
> >>  	    !task_current_donor(rq_of(cfs_rq), p)) {
> >                 /*
> >                  * Make sure to throttle this cfs_rq or it can be unthrottled
> >                  * with no runtime_remaining and gets throttled again on its
> >                  * unthrottle path.
> >                  */
> >> +		if (list_empty(&cfs_rq->throttled_limbo_list))
> >> +			check_enqueue_throttle(cfs_rq);
> > 
> > BTW, do you think a comment is needed? Something like the above, not
> > sure if it's too redundant though, feel free to let me know your
> > thoughts, thanks.
> 
> Now that I'm looking at it again, I think we should actually do a:
> 
>     for_each_entity(se)
>         check_enqueue_throttle(cfs_rq_of(se));
> 
> The reason being, we can have:
> 
>     root -> A (throttled) -> B -> C
> 
> Consider B has runtime_remaining = 0, and subsequently a throttled task
> is queued onto C. Ideally, we should start the B/W timer for B at that
> point but we bail out after queuing it on C. Thoughts?

Yes agree the B/W timer should also be considered.

So in my original patch, cfs_rqs will (most likely) start with
runtime_remaining == 1 and unthrottled after calling throttle_cfs_rq(),
which will also start the B/W timer. The timer is not needed in this
case when no cfs_rqs are actually throttled but it doesn't hurt. Looks
like everything is OK, we do not need to do any special handling in
enqueue_throttled_task(). Thoughts?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-30 11:07       ` Aaron Lu
@ 2025-09-30 12:39         ` Aaron Lu
  2025-09-30 13:38         ` K Prateek Nayak
  1 sibling, 0 replies; 25+ messages in thread
From: Aaron Lu @ 2025-09-30 12:39 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Tue, Sep 30, 2025 at 07:07:17PM +0800, Aaron Lu wrote:
> On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote:
> > Hello Aaron,
> > 
> > On 9/30/2025 1:26 PM, Aaron Lu wrote:
> > > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> > > ... ...
> > >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> > >> if we find cfs_rq->throttled_limbo_list to be empty?
> > >>
> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > >> index 18a30ae35441..fd2d4dad9c27 100644
> > >> --- a/kernel/sched/fair.c
> > >> +++ b/kernel/sched/fair.c
> > >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
> > >>  	 */
> > >>  	if (throttled_hierarchy(cfs_rq) &&
> > >>  	    !task_current_donor(rq_of(cfs_rq), p)) {
> > >                 /*
> > >                  * Make sure to throttle this cfs_rq or it can be unthrottled
> > >                  * with no runtime_remaining and gets throttled again on its
> > >                  * unthrottle path.
> > >                  */
> > >> +		if (list_empty(&cfs_rq->throttled_limbo_list))
> > >> +			check_enqueue_throttle(cfs_rq);
> > > 
> > > BTW, do you think a comment is needed? Something like the above, not
> > > sure if it's too redundant though, feel free to let me know your
> > > thoughts, thanks.
> > 
> > Now that I'm looking at it again, I think we should actually do a:
> > 
> >     for_each_entity(se)
> >         check_enqueue_throttle(cfs_rq_of(se));
> > 
> > The reason being, we can have:
> > 
> >     root -> A (throttled) -> B -> C
> > 
> > Consider B has runtime_remaining = 0, and subsequently a throttled task
> > is queued onto C. Ideally, we should start the B/W timer for B at that
> > point but we bail out after queuing it on C. Thoughts?
> 
> Yes agree the B/W timer should also be considered.

On another thought, do we really need care about B/W timer for B?

I mean, when C is unthrottled and gets enqueued on B,
check_enqueue_throttle() will do the right thing for B so I don't
think we need to do this hierarchy check_enqueue_throttle() here.

I think the only difference with your suggestion and my patch is, with
your suggestion, it's possible for a runtime_enabled cfs_rq to reach
tg_unthrottle_up() with runtime_remaining equals to 0 but since it
doesn't have any tasks in its limbo list, it will not do any enqueue so
won't possibly trigger throttle there, so it's still fine. i.e. I think
your original suggestion works.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-30 11:07       ` Aaron Lu
  2025-09-30 12:39         ` Aaron Lu
@ 2025-09-30 13:38         ` K Prateek Nayak
  2025-10-01 11:58           ` Aaron Lu
  1 sibling, 1 reply; 25+ messages in thread
From: K Prateek Nayak @ 2025-09-30 13:38 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

Hello Aaron,

I'll merge the two replies in one.

On 9/30/2025 4:37 PM, Aaron Lu wrote:
> So in my original patch, cfs_rqs will (most likely) start with
> runtime_remaining == 1 and unthrottled after calling throttle_cfs_rq(),
> which will also start the B/W timer. The timer is not needed in this
> case when no cfs_rqs are actually throttled but it doesn't hurt. Looks
> like everything is OK, we do not need to do any special handling in
> enqueue_throttled_task(). Thoughts?

Now that I look at throttle_cfs_rq() properly, we'll only move the
runtime_remaining from 0 to 1 so few usecs worth of bandwidth
distributed at max should be okay. Sorry for the being overly cautious!

So your current approach should be good. Please feel free to include:

Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

As for the other thread:

On 9/30/2025 6:09 PM, Aaron Lu wrote:
>>>
>>>     root -> A (throttled) -> B -> C
>>>
>>> Consider B has runtime_remaining = 0, and subsequently a throttled task
>>> is queued onto C. Ideally, we should start the B/W timer for B at that
>>> point but we bail out after queuing it on C. Thoughts?
>>
>> Yes agree the B/W timer should also be considered.
> 
> On another thought, do we really need care about B/W timer for B?
> 
> I mean, when C is unthrottled and gets enqueued on B,
> check_enqueue_throttle() will do the right thing for B so I don't
> think we need to do this hierarchy check_enqueue_throttle() here.

So what I though would happen here is that when A is unthrottled,
you'll enqueue the task and only then realize B doesn't have any
bandwidth and start the timer then but had you identified it
earlier, distribution could have already added some bandwidth to
B and then you could run the task without adding any further
latency.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-30 13:38         ` K Prateek Nayak
@ 2025-10-01 11:58           ` Aaron Lu
  0 siblings, 0 replies; 25+ messages in thread
From: Aaron Lu @ 2025-10-01 11:58 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Chen Yu,
	Matteo Martelli, Michal Koutný, Sebastian Andrzej Siewior

On Tue, Sep 30, 2025 at 07:08:20PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> I'll merge the two replies in one.
> 
> On 9/30/2025 4:37 PM, Aaron Lu wrote:
> > So in my original patch, cfs_rqs will (most likely) start with
> > runtime_remaining == 1 and unthrottled after calling throttle_cfs_rq(),
> > which will also start the B/W timer. The timer is not needed in this
> > case when no cfs_rqs are actually throttled but it doesn't hurt. Looks
> > like everything is OK, we do not need to do any special handling in
> > enqueue_throttled_task(). Thoughts?
> 
> Now that I look at throttle_cfs_rq() properly, we'll only move the
> runtime_remaining from 0 to 1 so few usecs worth of bandwidth
> distributed at max should be okay. Sorry for the being overly cautious!

Never mind.

> 
> So your current approach should be good. Please feel free to include:
> 
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-09-29  7:46 [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
  2025-09-29  9:34 ` K Prateek Nayak
@ 2025-10-14  7:43 ` Hao Jia
  2025-10-14  9:11   ` Aaron Lu
  1 sibling, 1 reply; 25+ messages in thread
From: Hao Jia @ 2025-10-14  7:43 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior


Hello Aaron,

On 2025/9/29 15:46, Aaron Lu wrote:
> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
> 
> When running a test with the following hierarchy:
> 
>            root
>          /      \
>          A*     ...
>       /  |  \   ...
>          B
>         /  \
>        C*
> 
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
> 

I encountered a similar warning a while ago and fixed it. I have a 
question I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls 
enqueue_task_fair(p) to enqueue a task, which requires that the 
runtime_remaining of task p's entire task_group hierarchy be greater than 0.

In addition to the case you fixed above,
When bandwidth is running normally, Is it possible that there's a corner 
case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 
0  could trigger a similar warning?

So, I previously tried to fix this issue using the following code, 
adding the ENQUEUE_THROTTLE flag to ensure that tasks enqueued in 
tg_unthrottle_up() aren't throttled.

---
  kernel/sched/fair.c  | 6 ++++--
  kernel/sched/sched.h | 1 +
  2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e..128efa2eba57 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,7 +5290,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
  	se->on_rq = 1;

  	if (cfs_rq->nr_queued == 1) {
-		check_enqueue_throttle(cfs_rq);
+		if (!(flags & ENQUEUE_THROTTLE))
+			check_enqueue_throttle(cfs_rq);
+
  		list_add_leaf_cfs_rq(cfs_rq);
  #ifdef CONFIG_CFS_BANDWIDTH
  		if (cfs_rq->pelt_clock_throttled) {
@@ -5905,7 +5907,7 @@ static int tg_unthrottle_up(struct task_group *tg, 
void *data)
  	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, 
throttle_node) {
  		list_del_init(&p->throttle_node);
  		p->throttled = false;
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
  	}

  	/* Add cfs_rq with load or one or more already running entities to 
the list */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5367c514c14..871dfb761676 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2358,6 +2358,7 @@ extern const u32		sched_prio_to_wmult[40];
  #define ENQUEUE_MIGRATING	0x100
  #define ENQUEUE_DELAYED		0x200
  #define ENQUEUE_RQ_SELECTED	0x400
+#define ENQUEUE_THROTTLE	0x800

  #define RETRY_TASK		((void *)-1UL)
---

Unfortunately, I tried to build some tests locally and didn't reproduce 
this corner case.

Thanks,
Hao

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-14  7:43 ` Hao Jia
@ 2025-10-14  9:11   ` Aaron Lu
  2025-10-14 11:01     ` Hao Jia
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-10-14  9:11 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Hi Hao,

On Tue, Oct 14, 2025 at 03:43:10PM +0800, Hao Jia wrote:
> 
> Hello Aaron,
> 
> On 2025/9/29 15:46, Aaron Lu wrote:
> > When a cfs_rq is to be throttled, its limbo list should be empty and
> > that's why there is a warn in tg_throttle_down() for non empty
> > cfs_rq->throttled_limbo_list.
> > 
> > When running a test with the following hierarchy:
> > 
> >            root
> >          /      \
> >          A*     ...
> >       /  |  \   ...
> >          B
> >         /  \
> >        C*
> > 
> > where both A and C have quota settings, that warn on non empty limbo list
> > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> > part of the cfs_rq for the sake of simpler representation).
> > 
> 
> I encountered a similar warning a while ago and fixed it. I have a question
> I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls enqueue_task_fair(p) to
> enqueue a task, which requires that the runtime_remaining of task p's entire
> task_group hierarchy be greater than 0.
> 
> In addition to the case you fixed above,
> When bandwidth is running normally, Is it possible that there's a corner
> case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 0
> could trigger a similar warning?

Do you mean B also has quota set and cfs_B's runtime_remaining < 0?
In this case, B should be throttled and C is a descendent of B so should
also be throttled, i.e. C can't be unthrottled when B is in throttled
state. Do I understand you correctly?

> 
> So, I previously tried to fix this issue using the following code, adding
> the ENQUEUE_THROTTLE flag to ensure that tasks enqueued in
> tg_unthrottle_up() aren't throttled.
> 

Yeah I think this can also fix the warning.
I'm not sure if it is a good idea though, because on unthrottle, the
expectation is, this cfs_rq should have runtime_remaining > 0 and if
it's not the case, I think it is better to know why.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-14  9:11   ` Aaron Lu
@ 2025-10-14 11:01     ` Hao Jia
  2025-10-14 11:50       ` Aaron Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Jia @ 2025-10-14 11:01 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior


Hello Aaron,

Thank you for your reply.

On 2025/10/14 17:11, Aaron Lu wrote:
> Hi Hao,
> 
> On Tue, Oct 14, 2025 at 03:43:10PM +0800, Hao Jia wrote:
>>
>> Hello Aaron,
>>
>> On 2025/9/29 15:46, Aaron Lu wrote:
>>> When a cfs_rq is to be throttled, its limbo list should be empty and
>>> that's why there is a warn in tg_throttle_down() for non empty
>>> cfs_rq->throttled_limbo_list.
>>>
>>> When running a test with the following hierarchy:
>>>
>>>             root
>>>           /      \
>>>           A*     ...
>>>        /  |  \   ...
>>>           B
>>>          /  \
>>>         C*
>>>
>>> where both A and C have quota settings, that warn on non empty limbo list
>>> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
>>> part of the cfs_rq for the sake of simpler representation).
>>>
>>
>> I encountered a similar warning a while ago and fixed it. I have a question
>> I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls enqueue_task_fair(p) to
>> enqueue a task, which requires that the runtime_remaining of task p's entire
>> task_group hierarchy be greater than 0.
>>
>> In addition to the case you fixed above,
>> When bandwidth is running normally, Is it possible that there's a corner
>> case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 0
>> could trigger a similar warning?
> 
> Do you mean B also has quota set and cfs_B's runtime_remaining < 0?
> In this case, B should be throttled and C is a descendent of B so should
> also be throttled, i.e. C can't be unthrottled when B is in throttled
> state. Do I understand you correctly?
>
Yes, both A and B have quota set.

Is there a possible corner case?
Asynchronous unthrottling causes other running entities to completely 
consume cfs_B->runtime_remaining (cfs_B->runtime_remaining < 0) but not 
completely consume cfs_A->runtime_remaining (cfs_A->runtime_remaining > 
0) when we call unthrottle_cfs_rq(cfs_rq_A) .

When we unthrottle_cfs_rq(cfs_rq_A), cfs_A->runtime_remaining > 0, but 
if cfs_B->runtime_remaining < 0 at this time,
therefore, when 
enqueue_task_fair(p)->check_enqueue_throttle(cfs_rq_B)->throttle_cfs_rq(cfs_rq_B), 
an warnning may be triggered.

My core question is:
When we call unthrottle_cfs_rq(cfs_rq_A), we only check 
cfs_rq_A->runtime_remaining. However, 
enqueue_task_fair(p)->enqueue_entity(C->B->A)->check_enqueue_throttle() 
does require that the runtime_remaining of each task_group level of task 
p is greater than 0.

Can we guarantee this?

Thanks,
Hao

>>
>> So, I previously tried to fix this issue using the following code, adding
>> the ENQUEUE_THROTTLE flag to ensure that tasks enqueued in
>> tg_unthrottle_up() aren't throttled.
>>
> 
> Yeah I think this can also fix the warning.
> I'm not sure if it is a good idea though, because on unthrottle, the
> expectation is, this cfs_rq should have runtime_remaining > 0 and if
> it's not the case, I think it is better to know why.
> 
> Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-14 11:01     ` Hao Jia
@ 2025-10-14 11:50       ` Aaron Lu
  2025-10-15  1:43         ` Hao Jia
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-10-14 11:50 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Tue, Oct 14, 2025 at 07:01:15PM +0800, Hao Jia wrote:
> 
> Hello Aaron,
> 
> Thank you for your reply.
> 
> On 2025/10/14 17:11, Aaron Lu wrote:
> > Hi Hao,
> > 
> > On Tue, Oct 14, 2025 at 03:43:10PM +0800, Hao Jia wrote:
> > > 
> > > Hello Aaron,
> > > 
> > > On 2025/9/29 15:46, Aaron Lu wrote:
> > > > When a cfs_rq is to be throttled, its limbo list should be empty and
> > > > that's why there is a warn in tg_throttle_down() for non empty
> > > > cfs_rq->throttled_limbo_list.
> > > > 
> > > > When running a test with the following hierarchy:
> > > > 
> > > >             root
> > > >           /      \
> > > >           A*     ...
> > > >        /  |  \   ...
> > > >           B
> > > >          /  \
> > > >         C*
> > > > 
> > > > where both A and C have quota settings, that warn on non empty limbo list
> > > > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> > > > part of the cfs_rq for the sake of simpler representation).
> > > > 
> > > 
> > > I encountered a similar warning a while ago and fixed it. I have a question
> > > I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls enqueue_task_fair(p) to
> > > enqueue a task, which requires that the runtime_remaining of task p's entire
> > > task_group hierarchy be greater than 0.
> > > 
> > > In addition to the case you fixed above,
> > > When bandwidth is running normally, Is it possible that there's a corner
> > > case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 0
> > > could trigger a similar warning?
> > 
> > Do you mean B also has quota set and cfs_B's runtime_remaining < 0?
> > In this case, B should be throttled and C is a descendent of B so should
> > also be throttled, i.e. C can't be unthrottled when B is in throttled
> > state. Do I understand you correctly?
> > 
> Yes, both A and B have quota set.
> 
> Is there a possible corner case?
> Asynchronous unthrottling causes other running entities to completely
> consume cfs_B->runtime_remaining (cfs_B->runtime_remaining < 0) but not
> completely consume cfs_A->runtime_remaining (cfs_A->runtime_remaining > 0)
> when we call unthrottle_cfs_rq(cfs_rq_A) .

Let me try to understand the situation here: in your described setup,
all three task groups(A, B, C) have quota set?

> 
> When we unthrottle_cfs_rq(cfs_rq_A), cfs_A->runtime_remaining > 0, but if
> cfs_B->runtime_remaining < 0 at this time,

Hmm... if cfs_B->runtime_remaining < 0, why it's not throttled?

> therefore, when enqueue_task_fair(p)->check_enqueue_throttle(cfs_rq_B)->throttle_cfs_rq(cfs_rq_B),

I assume p is a task of group B?
So when A is unthrottled, since p is a throttled task of group B and B
is still throttled, enqueue_task_fair(p) should not happen.

> an warnning may be triggered.
> 
> My core question is:
> When we call unthrottle_cfs_rq(cfs_rq_A), we only check
> cfs_rq_A->runtime_remaining. However,
> enqueue_task_fair(p)->enqueue_entity(C->B->A)->check_enqueue_throttle() does

According to this info, I assume p is a task of group C here. If
unthrottle A would cause enqueuing p, that means: either group C and B
do not have quota set or group C and B are in unthrottled state. 

> require that the runtime_remaining of each task_group level of task p is
> greater than 0.

If group C and B are in unthrottled state, their runtime_remaining
should be > 0.

> 
> Can we guarantee this?

To guarantee this, a warn like below could be used. Can you try in your
setup if you can hit it? Thanks.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ef11783369d7..c347aa28c411a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5908,6 +5908,8 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->throttled_clock_self_time += delta;
 	}
 
+	WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
+
 	/* Re-enqueue the tasks that have been throttled at this level. */
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-14 11:50       ` Aaron Lu
@ 2025-10-15  1:43         ` Hao Jia
  2025-10-15  1:48           ` Hao Jia
  2025-10-15  2:51           ` Aaron Lu
  0 siblings, 2 replies; 25+ messages in thread
From: Hao Jia @ 2025-10-15  1:43 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior



On 2025/10/14 19:50, Aaron Lu wrote:
> On Tue, Oct 14, 2025 at 07:01:15PM +0800, Hao Jia wrote:
>>
>> Hello Aaron,
>>
>> Thank you for your reply.
>>
>> On 2025/10/14 17:11, Aaron Lu wrote:
>>> Hi Hao,
>>>
>>> On Tue, Oct 14, 2025 at 03:43:10PM +0800, Hao Jia wrote:
>>>>
>>>> Hello Aaron,
>>>>
>>>> On 2025/9/29 15:46, Aaron Lu wrote:
>>>>> When a cfs_rq is to be throttled, its limbo list should be empty and
>>>>> that's why there is a warn in tg_throttle_down() for non empty
>>>>> cfs_rq->throttled_limbo_list.
>>>>>
>>>>> When running a test with the following hierarchy:
>>>>>
>>>>>              root
>>>>>            /      \
>>>>>            A*     ...
>>>>>         /  |  \   ...
>>>>>            B
>>>>>           /  \
>>>>>          C*
>>>>>
>>>>> where both A and C have quota settings, that warn on non empty limbo list
>>>>> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
>>>>> part of the cfs_rq for the sake of simpler representation).
>>>>>
>>>>
>>>> I encountered a similar warning a while ago and fixed it. I have a question
>>>> I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls enqueue_task_fair(p) to
>>>> enqueue a task, which requires that the runtime_remaining of task p's entire
>>>> task_group hierarchy be greater than 0.
>>>>
>>>> In addition to the case you fixed above,
>>>> When bandwidth is running normally, Is it possible that there's a corner
>>>> case where cfs_A->runtime_remaining > 0, but cfs_B->runtime_remaining < 0
>>>> could trigger a similar warning?
>>>
>>> Do you mean B also has quota set and cfs_B's runtime_remaining < 0?
>>> In this case, B should be throttled and C is a descendent of B so should
>>> also be throttled, i.e. C can't be unthrottled when B is in throttled
>>> state. Do I understand you correctly?
>>>
>> Yes, both A and B have quota set.
>>
>> Is there a possible corner case?
>> Asynchronous unthrottling causes other running entities to completely
>> consume cfs_B->runtime_remaining (cfs_B->runtime_remaining < 0) but not
>> completely consume cfs_A->runtime_remaining (cfs_A->runtime_remaining > 0)
>> when we call unthrottle_cfs_rq(cfs_rq_A) .
> 
> Let me try to understand the situation here: in your described setup,
> all three task groups(A, B, C) have quota set?
> 
>>
>> When we unthrottle_cfs_rq(cfs_rq_A), cfs_A->runtime_remaining > 0, but if
>> cfs_B->runtime_remaining < 0 at this time,
> 
> Hmm... if cfs_B->runtime_remaining < 0, why it's not throttled?
> 
>> therefore, when enqueue_task_fair(p)->check_enqueue_throttle(cfs_rq_B)->throttle_cfs_rq(cfs_rq_B),
> 
> I assume p is a task of group B?
> So when A is unthrottled, since p is a throttled task of group B and B
> is still throttled, enqueue_task_fair(p) should not happen.
> 
>> an warnning may be triggered.
>>
>> My core question is:
>> When we call unthrottle_cfs_rq(cfs_rq_A), we only check
>> cfs_rq_A->runtime_remaining. However,
>> enqueue_task_fair(p)->enqueue_entity(C->B->A)->check_enqueue_throttle() does
> 
> According to this info, I assume p is a task of group C here. If
> unthrottle A would cause enqueuing p, that means: either group C and B
> do not have quota set or group C and B are in unthrottled state.
> 
>> require that the runtime_remaining of each task_group level of task p is
>> greater than 0.
> 
> If group C and B are in unthrottled state, their runtime_remaining
> should be > 0.
> 
>>
>> Can we guarantee this?
> 
> To guarantee this, a warn like below could be used. Can you try in your
> setup if you can hit it? Thanks.

Yes, I've already hit the cfs_rq->runtime_remaining < 0 condition in 
tg_unthrottle_up().

This morning, after applying your patch, I still get the same issue. 
However, As before, because cfs_rq->curr isn't NULL, 
check_enqueue_throttle() returns prematurely, preventing the triggering 
of throttle_cfs_rq().



Some information to share with you.

<...>-13123   [023] dN.1.    69.245648: tg_unthrottle_up: 
cfs_rq->runtime_remaining -5487 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c80ad300 cfs_rq->nr_queued 0
  sched-messaging-4299    [020] dNh2.   108.475966: tg_unthrottle_up: 
cfs_rq->runtime_remaining -1073 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57d2ef7400 cfs_rq->nr_queued 2
  sched-messaging-3514    [009] dNh2.   109.640727: tg_unthrottle_up: 
cfs_rq->runtime_remaining -11345 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad5a00 cfs_rq->nr_queued 4
            <...>-13872   [005] dNh2.   117.155177: tg_unthrottle_up: 
cfs_rq->runtime_remaining -15195 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad7e00 cfs_rq->nr_queued 5
            <...>-9861    [005] dNh2.   139.307579: tg_unthrottle_up: 
cfs_rq->runtime_remaining -41639 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f4f38400 cfs_rq->nr_queued 20
            <...>-9861    [005] dNh2.   139.307598: tg_unthrottle_up: 
cfs_rq->runtime_remaining -41639 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad7e00 cfs_rq->nr_queued 7
  sched-messaging-11165   [013] dNh2.   158.640827: tg_unthrottle_up: 
cfs_rq->runtime_remaining -32575 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e585693a400 cfs_rq->nr_queued 5
  sched-messaging-15337   [002] dNh2.   178.400559: tg_unthrottle_up: 
cfs_rq->runtime_remaining -29005 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ff4f6000 cfs_rq->nr_queued 7
  sched-messaging-8740    [010] dNh2.   178.842565: tg_unthrottle_up: 
cfs_rq->runtime_remaining -34448 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f4f4c800 cfs_rq->nr_queued 141
  sched-messaging-8740    [010] dNh2.   178.842583: tg_unthrottle_up: 
cfs_rq->runtime_remaining -34448 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad4e00 cfs_rq->nr_queued 4
            <...>-6101    [000] dNh1.   179.185475: tg_unthrottle_up: 
cfs_rq->runtime_remaining -14179 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ea202000 cfs_rq->nr_queued 116
  sched-messaging-6101    [000] dNh1.   179.185489: tg_unthrottle_up: 
cfs_rq->runtime_remaining -14179 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c7078400 cfs_rq->nr_queued 6
            <...>-7812    [002] dNh2.   184.228104: tg_unthrottle_up: 
cfs_rq->runtime_remaining -21154 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f597c200 cfs_rq->nr_queued 75
            <...>-7812    [002] dNh2.   184.228128: tg_unthrottle_up: 
cfs_rq->runtime_remaining -21154 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c7079600 cfs_rq->nr_queued 7
            <...>-4589    [006] dNh2.   185.872173: tg_unthrottle_up: 
cfs_rq->runtime_remaining -25309 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e584debc800 cfs_rq->nr_queued 173
            <...>-4589    [006] dNh2.   185.872192: tg_unthrottle_up: 
cfs_rq->runtime_remaining -25309 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad7800 cfs_rq->nr_queued 7
  sched-messaging-5155    [008] dNh2.   188.689689: tg_unthrottle_up: 
cfs_rq->runtime_remaining -14108 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ea20e800 cfs_rq->nr_queued 150
  sched-messaging-5155    [008] dNh2.   188.689704: tg_unthrottle_up: 
cfs_rq->runtime_remaining -14108 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad7600 cfs_rq->nr_queued 7
  sched-messaging-15265   [002] dNh2.   190.920442: tg_unthrottle_up: 
cfs_rq->runtime_remaining -25129 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ff4f6000 cfs_rq->nr_queued 6
            <...>-10191   [007] dNh1.   191.851170: tg_unthrottle_up: 
cfs_rq->runtime_remaining -27077 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad7200 cfs_rq->nr_queued 4
  sched-messaging-5963    [008] dNh2.   193.298329: tg_unthrottle_up: 
cfs_rq->runtime_remaining -41126 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57e9d31400 cfs_rq->nr_queued 156
  sched-messaging-5963    [008] dNh2.   193.298345: tg_unthrottle_up: 
cfs_rq->runtime_remaining -41126 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad7600 cfs_rq->nr_queued 6
            <...>-4877    [021] dNh2.   198.988637: tg_unthrottle_up: 
cfs_rq->runtime_remaining -19362 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ea2bf400 cfs_rq->nr_queued 86
            <...>-4877    [021] dNh2.   198.988657: tg_unthrottle_up: 
cfs_rq->runtime_remaining -19362 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c47e9e00 cfs_rq->nr_queued 6
            <...>-7344    [015] dNh2.   208.086556: tg_unthrottle_up: 
cfs_rq->runtime_remaining -31964 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e5856938a00 cfs_rq->nr_queued 4
            <...>-8384    [020] dNh2.   209.858709: tg_unthrottle_up: 
cfs_rq->runtime_remaining -28618 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f32ede00 cfs_rq->nr_queued 133
            <...>-8384    [020] dNh2.   209.858736: tg_unthrottle_up: 
cfs_rq->runtime_remaining -28618 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57d2ef7400 cfs_rq->nr_queued 7
  sched-messaging-7803    [000] dNh2.   210.380559: tg_unthrottle_up: 
cfs_rq->runtime_remaining -18845 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c7078400 cfs_rq->nr_queued 8
  sched-messaging-4388    [020] dNh2.   211.688504: tg_unthrottle_up: 
cfs_rq->runtime_remaining -22815 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57e7f0c600 cfs_rq->nr_queued 128
  sched-messaging-4388    [020] dNh2.   211.688533: tg_unthrottle_up: 
cfs_rq->runtime_remaining -22815 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57d2ef7400 cfs_rq->nr_queued 7
            <...>-15470   [014] dNh1.   223.440099: tg_unthrottle_up: 
cfs_rq->runtime_remaining -41950 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ff79ee00 cfs_rq->nr_queued 7
  sched-messaging-14343   [011] dNh2.   226.159614: tg_unthrottle_up: 
cfs_rq->runtime_remaining -24303 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f46d3a00 cfs_rq->nr_queued 2
            <...>-14543   [018] dNh1.   229.854321: tg_unthrottle_up: 
cfs_rq->runtime_remaining -23239 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e580acae000 cfs_rq->nr_queued 6
            <...>-9075    [010] dNh1.   231.397245: tg_unthrottle_up: 
cfs_rq->runtime_remaining -38148 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f4abb400 cfs_rq->nr_queued 108
            <...>-9075    [010] dNh1.   231.397267: tg_unthrottle_up: 
cfs_rq->runtime_remaining -38148 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad4e00 cfs_rq->nr_queued 5
            <...>-7271    [004] dNh2.   233.612978: tg_unthrottle_up: 
cfs_rq->runtime_remaining -24292 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57f418be00 cfs_rq->nr_queued 99
            <...>-7271    [004] dNh2.   233.612995: tg_unthrottle_up: 
cfs_rq->runtime_remaining -24292 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57c8ad4400 cfs_rq->nr_queued 6
            <...>-15191   [002] dNh1.   234.231866: tg_unthrottle_up: 
cfs_rq->runtime_remaining -28451 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ff4f6000 cfs_rq->nr_queued 6
            <...>-15181   [003] dNh2.   237.086491: tg_unthrottle_up: 
cfs_rq->runtime_remaining -20183 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ff4f7600 cfs_rq->nr_queued 4
  sched-messaging-6091    [013] dNh2.   243.412722: tg_unthrottle_up: 
cfs_rq->runtime_remaining -27394 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57ea209e00 cfs_rq->nr_queued 4
  sched-messaging-6091    [013] dNh2.   243.412742: tg_unthrottle_up: 
cfs_rq->runtime_remaining -27394 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e585693a400 cfs_rq->nr_queued 1
            <...>-6291    [010] dNh2.   245.166679: tg_unthrottle_up: 
cfs_rq->runtime_remaining -34474 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e57efee7600 cfs_rq->nr_queued 108
            <...>-4648    [017] dNh2.   246.414518: tg_unthrottle_up: 
cfs_rq->runtime_remaining -19404 cfs_rq->throttled 0 cfs_rq->curr 
ffff8e5856939600 cfs_rq->nr_queued 3



[Wed Oct 15 09:26:41 2025] ------------[ cut here ]------------
[Wed Oct 15 09:26:41 2025] WARNING: kernel/sched/fair.c:5897 at 
tg_unthrottle_up+0x2a6/0x340, CPU#23: sh/13123
[Wed Oct 15 09:26:41 2025] Modules linked in: xt_CHECKSUM xt_MASQUERADE 
xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat x_tables 
nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 cfg80211 
nf_tables nfnetlink bridge stp llc cmac algif_hash algif_skcipher af_alg 
snd_hda_codec_intelhdmi bnep snd_hda_codec_alc269 
snd_hda_scodec_component snd_hda_codec_realtek_lib snd_hda_codec_generic 
binfmt_misc intel_uncore_frequency intel_uncore_frequency_common i915 
snd_hda_intel snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl 
snd_sof_intel_hda_generic soundwire_intel soundwire_generic_allocation 
snd_sof_intel_hda_sdw_bpt snd_sof_intel_hda_common snd_soc_hdac_hda 
snd_sof_intel_hda_mlink snd_sof_intel_hda snd_hda_codec_hdmi 
snd_hda_ext_core soundwire_cadence snd_sof_pci snd_sof_xtensa_dsp 
snd_hda_codec snd_hda_core snd_sof snd_sof_utils x86_pkg_temp_thermal 
snd_intel_dspcfg intel_powerclamp snd_intel_sdw_acpi coretemp 
snd_soc_acpi_intel_match kvm_intel snd_soc_acpi_intel_sdca_quirks 
snd_soc_acpi snd_hwdep soundwire_bus
[Wed Oct 15 09:26:41 2025]  snd_soc_sdca snd_soc_core kvm snd_compress 
ac97_bus snd_pcm_dmaengine snd_pcm snd_seq_midi snd_seq_midi_event 
snd_rawmidi nls_iso8859_1 btusb snd_seq drm_buddy 
processor_thermal_device_pci btmtk ttm processor_thermal_device btrtl 
processor_thermal_wt_hint snd_seq_device btbcm irqbypass 
platform_temperature_control mei_hdcp ghash_clmulni_intel 
processor_thermal_soc_slider snd_timer btintel aesni_intel 
platform_profile processor_thermal_rfim drm_display_helper snd mei_me 
intel_rapl_msr rapl processor_thermal_rapl soundcore bluetooth 
intel_rapl_common intel_cstate cec think_lmi joydev mei input_leds crc8 
processor_thermal_wt_req firmware_attributes_class rc_core 8250_dw 
wmi_bmof processor_thermal_power_floor processor_thermal_mbox 
i2c_algo_bit int340x_thermal_zone int3400_thermal acpi_thermal_rel 
acpi_tad acpi_pad mac_hid sch_fq_codel msr parport_pc ppdev lp parport 
efi_pstore autofs4 btrfs blake2b_generic raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
raid1 raid0
[Wed Oct 15 09:26:41 2025]  spi_pxa2xx_platform hid_generic dw_dmac 
dw_dmac_core usbhid spi_pxa2xx_core nvme r8169 i2c_i801 hid nvme_core 
ahci i2c_mux realtek intel_lpss_pci i2c_smbus libahci intel_lpss idma64 
video wmi
[Wed Oct 15 09:26:41 2025] CPU: 23 UID: 0 PID: 13123 Comm: sh Kdump: 
loaded Not tainted 6.17.0+ #120 PREEMPT(voluntary)
[Wed Oct 15 09:26:41 2025] Hardware name: LENOVO 12JJA02KCD/3339, BIOS 
M4ZKT39A 11/22/2023
[Wed Oct 15 09:26:41 2025] RIP: 0010:tg_unthrottle_up+0x2a6/0x340
[Wed Oct 15 09:26:41 2025] Code: 48 8d 3d 00 00 00 00 e8 28 03 15 00 45 
8b 85 64 01 00 00 45 85 c0 0f 84 b9 fd ff ff 49 83 bd 68 01 00 00 00 0f 
8f ab fd ff ff <0f> 0b e9 a4 fd ff ff 48 8b 88 f0 0a 00 00 e9 7c ff ff 
ff 48 8b b1
[Wed Oct 15 09:26:41 2025] RSP: 0018:ffffd18529b8fae0 EFLAGS: 00010082
[Wed Oct 15 09:26:41 2025] RAX: 0000000000000006 RBX: ffff8e5f1fa72600 
RCX: 0000000000000000
[Wed Oct 15 09:26:41 2025] RDX: 0000000000000000 RSI: 0000000000000000 
RDI: 0000000000000000
[Wed Oct 15 09:26:41 2025] RBP: ffffd18529b8faf8 R08: 0000000000000001 
R09: 0000000000000000
[Wed Oct 15 09:26:41 2025] R10: 0000000000000000 R11: 0000000000000000 
R12: ffff8e57ca398c80
[Wed Oct 15 09:26:41 2025] R13: ffff8e57d1cc7a00 R14: 0000000000000000 
R15: ffff8e5f1fa72600
[Wed Oct 15 09:26:41 2025] FS:  00007fba99989740(0000) 
GS:ffff8e5f9826b000(0000) knlGS:0000000000000000
[Wed Oct 15 09:26:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Oct 15 09:26:41 2025] CR2: 0000559e6f0ceda8 CR3: 000000014cee9002 
CR4: 0000000000f72ef0
[Wed Oct 15 09:26:41 2025] PKRU: 55555554
[Wed Oct 15 09:26:41 2025] Call Trace:
[Wed Oct 15 09:26:41 2025]  <TASK>
[Wed Oct 15 09:26:41 2025]  ? __pfx_tg_unthrottle_up+0x10/0x10
[Wed Oct 15 09:26:41 2025]  walk_tg_tree_from+0x66/0xd0
[Wed Oct 15 09:26:41 2025]  ? __pfx_tg_nop+0x10/0x10
[Wed Oct 15 09:26:41 2025]  unthrottle_cfs_rq+0xe4/0x280
[Wed Oct 15 09:26:41 2025]  tg_set_bandwidth+0x351/0x4d0
[Wed Oct 15 09:26:41 2025]  cpu_max_write+0xc3/0x130
[Wed Oct 15 09:26:41 2025]  cgroup_file_write+0x92/0x1a0
[Wed Oct 15 09:26:41 2025]  ? __check_object_size+0x27a/0x300
[Wed Oct 15 09:26:41 2025]  kernfs_fop_write_iter+0x15f/0x1f0
[Wed Oct 15 09:26:41 2025]  vfs_write+0x31b/0x430
[Wed Oct 15 09:26:41 2025]  ksys_write+0x6d/0xf0
[Wed Oct 15 09:26:41 2025]  __x64_sys_write+0x1d/0x30
[Wed Oct 15 09:26:41 2025]  x64_sys_call+0x1900/0x2760
[Wed Oct 15 09:26:41 2025]  do_syscall_64+0x83/0x8c0
[Wed Oct 15 09:26:41 2025]  ? __rseq_handle_notify_resume+0xaf/0x4e0
[Wed Oct 15 09:26:41 2025]  ? __do_sys_vfork+0x4f/0x70
[Wed Oct 15 09:26:41 2025]  ? __set_task_blocked+0x29/0x80
[Wed Oct 15 09:26:41 2025]  ? sigprocmask+0xb5/0xe0
[Wed Oct 15 09:26:41 2025]  ? __x64_sys_rt_sigprocmask+0x93/0xe0
[Wed Oct 15 09:26:41 2025]  ? x64_sys_call+0x1cc0/0x2760
[Wed Oct 15 09:26:41 2025]  ? do_syscall_64+0xbc/0x8c0
[Wed Oct 15 09:26:41 2025]  ? irqentry_exit+0x47/0x60
[Wed Oct 15 09:26:41 2025]  ? exc_page_fault+0x97/0x1b0
[Wed Oct 15 09:26:41 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Wed Oct 15 09:26:41 2025] RIP: 0033:0x7fba997148f7
[Wed Oct 15 09:26:41 2025] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff 
ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 
00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 
48 89 74 24
[Wed Oct 15 09:26:41 2025] RSP: 002b:00007fffa912ef68 EFLAGS: 00000246 
ORIG_RAX: 0000000000000001
[Wed Oct 15 09:26:41 2025] RAX: ffffffffffffffda RBX: 000000000000000b 
RCX: 00007fba997148f7
[Wed Oct 15 09:26:41 2025] RDX: 000000000000000b RSI: 0000559e6f0ccda0 
RDI: 0000000000000001
[Wed Oct 15 09:26:41 2025] RBP: 0000559e6f0ccda0 R08: 0000559e6f0cccd7 
R09: 0000000000000000
[Wed Oct 15 09:26:41 2025] R10: 0000559e6f0cccd6 R11: 0000000000000246 
R12: 0000000000000001
[Wed Oct 15 09:26:41 2025] R13: 000000000000000b R14: 0000000000000000 
R15: 0000000000000000
[Wed Oct 15 09:26:41 2025]  </TASK>
[Wed Oct 15 09:26:41 2025] ---[ end trace 0000000000000000 ]---



> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3ef11783369d7..c347aa28c411a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5908,6 +5908,8 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>   		cfs_rq->throttled_clock_self_time += delta;
>   	}
>   
> +	WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
> +
>   	/* Re-enqueue the tasks that have been throttled at this level. */
>   	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>   		list_del_init(&p->throttle_node);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-15  1:43         ` Hao Jia
@ 2025-10-15  1:48           ` Hao Jia
  2025-10-15  2:51           ` Aaron Lu
  1 sibling, 0 replies; 25+ messages in thread
From: Hao Jia @ 2025-10-15  1:48 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior



On 2025/10/15 09:43, Hao Jia wrote:
> 
> 
> On 2025/10/14 19:50, Aaron Lu wrote:
>> On Tue, Oct 14, 2025 at 07:01:15PM +0800, Hao Jia wrote:
>>>
>>> Hello Aaron,
>>>
>>> Thank you for your reply.
>>>
>>> On 2025/10/14 17:11, Aaron Lu wrote:
>>>> Hi Hao,
>>>>
>>>> On Tue, Oct 14, 2025 at 03:43:10PM +0800, Hao Jia wrote:
>>>>>
>>>>> Hello Aaron,
>>>>>
>>>>> On 2025/9/29 15:46, Aaron Lu wrote:
>>>>>> When a cfs_rq is to be throttled, its limbo list should be empty and
>>>>>> that's why there is a warn in tg_throttle_down() for non empty
>>>>>> cfs_rq->throttled_limbo_list.
>>>>>>
>>>>>> When running a test with the following hierarchy:
>>>>>>
>>>>>>              root
>>>>>>            /      \
>>>>>>            A*     ...
>>>>>>         /  |  \   ...
>>>>>>            B
>>>>>>           /  \
>>>>>>          C*
>>>>>>
>>>>>> where both A and C have quota settings, that warn on non empty 
>>>>>> limbo list
>>>>>> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore 
>>>>>> the cpu
>>>>>> part of the cfs_rq for the sake of simpler representation).
>>>>>>
>>>>>
>>>>> I encountered a similar warning a while ago and fixed it. I have a 
>>>>> question
>>>>> I'd like to ask. tg_unthrottle_up(cfs_rq_C) calls 
>>>>> enqueue_task_fair(p) to
>>>>> enqueue a task, which requires that the runtime_remaining of task 
>>>>> p's entire
>>>>> task_group hierarchy be greater than 0.
>>>>>
>>>>> In addition to the case you fixed above,
>>>>> When bandwidth is running normally, Is it possible that there's a 
>>>>> corner
>>>>> case where cfs_A->runtime_remaining > 0, but 
>>>>> cfs_B->runtime_remaining < 0
>>>>> could trigger a similar warning?
>>>>
>>>> Do you mean B also has quota set and cfs_B's runtime_remaining < 0?
>>>> In this case, B should be throttled and C is a descendent of B so 
>>>> should
>>>> also be throttled, i.e. C can't be unthrottled when B is in throttled
>>>> state. Do I understand you correctly?
>>>>
>>> Yes, both A and B have quota set.
>>>
>>> Is there a possible corner case?
>>> Asynchronous unthrottling causes other running entities to completely
>>> consume cfs_B->runtime_remaining (cfs_B->runtime_remaining < 0) but not
>>> completely consume cfs_A->runtime_remaining (cfs_A->runtime_remaining 
>>> > 0)
>>> when we call unthrottle_cfs_rq(cfs_rq_A) .
>>


By the way, I also noticed that in unthrottle_cfs_rq(cfs_rq_A), we don't 
update cfs_rq_A->runtime_remaining when checking 
cfs_rq_A->runtime_remaining.
But I'm not sure if I need to submit a patch to update it and then check 
runtime_remaining.

Thanks,
Hao


>> Let me try to understand the situation here: in your described setup,
>> all three task groups(A, B, C) have quota set?
>>
>>>
>>> When we unthrottle_cfs_rq(cfs_rq_A), cfs_A->runtime_remaining > 0, 
>>> but if
>>> cfs_B->runtime_remaining < 0 at this time,
>>
>> Hmm... if cfs_B->runtime_remaining < 0, why it's not throttled?
>>
>>> therefore, when 
>>> enqueue_task_fair(p)->check_enqueue_throttle(cfs_rq_B)->throttle_cfs_rq(cfs_rq_B),
>>
>> I assume p is a task of group B?
>> So when A is unthrottled, since p is a throttled task of group B and B
>> is still throttled, enqueue_task_fair(p) should not happen.
>>
>>> an warnning may be triggered.
>>>
>>> My core question is:
>>> When we call unthrottle_cfs_rq(cfs_rq_A), we only check
>>> cfs_rq_A->runtime_remaining. However,
>>> enqueue_task_fair(p)->enqueue_entity(C->B->A)->check_enqueue_throttle() does
>>
>> According to this info, I assume p is a task of group C here. If
>> unthrottle A would cause enqueuing p, that means: either group C and B
>> do not have quota set or group C and B are in unthrottled state.
>>
>>> require that the runtime_remaining of each task_group level of task p is
>>> greater than 0.
>>
>> If group C and B are in unthrottled state, their runtime_remaining
>> should be > 0.
>>
>>>
>>> Can we guarantee this?
>>
>> To guarantee this, a warn like below could be used. Can you try in your
>> setup if you can hit it? Thanks.
> 
> Yes, I've already hit the cfs_rq->runtime_remaining < 0 condition in 
> tg_unthrottle_up().
> 
> This morning, after applying your patch, I still get the same issue. 
> However, As before, because cfs_rq->curr isn't NULL, 
> check_enqueue_throttle() returns prematurely, preventing the triggering 
> of throttle_cfs_rq().
> 
> 
> 
> Some information to share with you.
> 
> <...>-13123   [023] dN.1.    69.245648: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -5487 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c80ad300 cfs_rq->nr_queued 0
>   sched-messaging-4299    [020] dNh2.   108.475966: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -1073 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57d2ef7400 cfs_rq->nr_queued 2
>   sched-messaging-3514    [009] dNh2.   109.640727: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -11345 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad5a00 cfs_rq->nr_queued 4
>             <...>-13872   [005] dNh2.   117.155177: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -15195 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad7e00 cfs_rq->nr_queued 5
>             <...>-9861    [005] dNh2.   139.307579: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -41639 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f4f38400 cfs_rq->nr_queued 20
>             <...>-9861    [005] dNh2.   139.307598: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -41639 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad7e00 cfs_rq->nr_queued 7
>   sched-messaging-11165   [013] dNh2.   158.640827: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -32575 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e585693a400 cfs_rq->nr_queued 5
>   sched-messaging-15337   [002] dNh2.   178.400559: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -29005 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ff4f6000 cfs_rq->nr_queued 7
>   sched-messaging-8740    [010] dNh2.   178.842565: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -34448 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f4f4c800 cfs_rq->nr_queued 141
>   sched-messaging-8740    [010] dNh2.   178.842583: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -34448 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad4e00 cfs_rq->nr_queued 4
>             <...>-6101    [000] dNh1.   179.185475: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -14179 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ea202000 cfs_rq->nr_queued 116
>   sched-messaging-6101    [000] dNh1.   179.185489: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -14179 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c7078400 cfs_rq->nr_queued 6
>             <...>-7812    [002] dNh2.   184.228104: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -21154 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f597c200 cfs_rq->nr_queued 75
>             <...>-7812    [002] dNh2.   184.228128: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -21154 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c7079600 cfs_rq->nr_queued 7
>             <...>-4589    [006] dNh2.   185.872173: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -25309 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e584debc800 cfs_rq->nr_queued 173
>             <...>-4589    [006] dNh2.   185.872192: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -25309 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad7800 cfs_rq->nr_queued 7
>   sched-messaging-5155    [008] dNh2.   188.689689: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -14108 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ea20e800 cfs_rq->nr_queued 150
>   sched-messaging-5155    [008] dNh2.   188.689704: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -14108 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad7600 cfs_rq->nr_queued 7
>   sched-messaging-15265   [002] dNh2.   190.920442: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -25129 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ff4f6000 cfs_rq->nr_queued 6
>             <...>-10191   [007] dNh1.   191.851170: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -27077 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad7200 cfs_rq->nr_queued 4
>   sched-messaging-5963    [008] dNh2.   193.298329: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -41126 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57e9d31400 cfs_rq->nr_queued 156
>   sched-messaging-5963    [008] dNh2.   193.298345: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -41126 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad7600 cfs_rq->nr_queued 6
>             <...>-4877    [021] dNh2.   198.988637: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -19362 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ea2bf400 cfs_rq->nr_queued 86
>             <...>-4877    [021] dNh2.   198.988657: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -19362 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c47e9e00 cfs_rq->nr_queued 6
>             <...>-7344    [015] dNh2.   208.086556: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -31964 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e5856938a00 cfs_rq->nr_queued 4
>             <...>-8384    [020] dNh2.   209.858709: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -28618 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f32ede00 cfs_rq->nr_queued 133
>             <...>-8384    [020] dNh2.   209.858736: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -28618 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57d2ef7400 cfs_rq->nr_queued 7
>   sched-messaging-7803    [000] dNh2.   210.380559: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -18845 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c7078400 cfs_rq->nr_queued 8
>   sched-messaging-4388    [020] dNh2.   211.688504: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -22815 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57e7f0c600 cfs_rq->nr_queued 128
>   sched-messaging-4388    [020] dNh2.   211.688533: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -22815 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57d2ef7400 cfs_rq->nr_queued 7
>             <...>-15470   [014] dNh1.   223.440099: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -41950 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ff79ee00 cfs_rq->nr_queued 7
>   sched-messaging-14343   [011] dNh2.   226.159614: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -24303 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f46d3a00 cfs_rq->nr_queued 2
>             <...>-14543   [018] dNh1.   229.854321: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -23239 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e580acae000 cfs_rq->nr_queued 6
>             <...>-9075    [010] dNh1.   231.397245: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -38148 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f4abb400 cfs_rq->nr_queued 108
>             <...>-9075    [010] dNh1.   231.397267: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -38148 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad4e00 cfs_rq->nr_queued 5
>             <...>-7271    [004] dNh2.   233.612978: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -24292 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57f418be00 cfs_rq->nr_queued 99
>             <...>-7271    [004] dNh2.   233.612995: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -24292 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57c8ad4400 cfs_rq->nr_queued 6
>             <...>-15191   [002] dNh1.   234.231866: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -28451 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ff4f6000 cfs_rq->nr_queued 6
>             <...>-15181   [003] dNh2.   237.086491: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -20183 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ff4f7600 cfs_rq->nr_queued 4
>   sched-messaging-6091    [013] dNh2.   243.412722: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -27394 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57ea209e00 cfs_rq->nr_queued 4
>   sched-messaging-6091    [013] dNh2.   243.412742: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -27394 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e585693a400 cfs_rq->nr_queued 1
>             <...>-6291    [010] dNh2.   245.166679: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -34474 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e57efee7600 cfs_rq->nr_queued 108
>             <...>-4648    [017] dNh2.   246.414518: tg_unthrottle_up: 
> cfs_rq->runtime_remaining -19404 cfs_rq->throttled 0 cfs_rq->curr 
> ffff8e5856939600 cfs_rq->nr_queued 3
> 
> 
> 
> [Wed Oct 15 09:26:41 2025] ------------[ cut here ]------------
> [Wed Oct 15 09:26:41 2025] WARNING: kernel/sched/fair.c:5897 at 
> tg_unthrottle_up+0x2a6/0x340, CPU#23: sh/13123
> [Wed Oct 15 09:26:41 2025] Modules linked in: xt_CHECKSUM xt_MASQUERADE 
> xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat x_tables 
> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 cfg80211 
> nf_tables nfnetlink bridge stp llc cmac algif_hash algif_skcipher af_alg 
> snd_hda_codec_intelhdmi bnep snd_hda_codec_alc269 
> snd_hda_scodec_component snd_hda_codec_realtek_lib snd_hda_codec_generic 
> binfmt_misc intel_uncore_frequency intel_uncore_frequency_common i915 
> snd_hda_intel snd_sof_pci_intel_tgl snd_sof_pci_intel_cnl 
> snd_sof_intel_hda_generic soundwire_intel soundwire_generic_allocation 
> snd_sof_intel_hda_sdw_bpt snd_sof_intel_hda_common snd_soc_hdac_hda 
> snd_sof_intel_hda_mlink snd_sof_intel_hda snd_hda_codec_hdmi 
> snd_hda_ext_core soundwire_cadence snd_sof_pci snd_sof_xtensa_dsp 
> snd_hda_codec snd_hda_core snd_sof snd_sof_utils x86_pkg_temp_thermal 
> snd_intel_dspcfg intel_powerclamp snd_intel_sdw_acpi coretemp 
> snd_soc_acpi_intel_match kvm_intel snd_soc_acpi_intel_sdca_quirks 
> snd_soc_acpi snd_hwdep soundwire_bus
> [Wed Oct 15 09:26:41 2025]  snd_soc_sdca snd_soc_core kvm snd_compress 
> ac97_bus snd_pcm_dmaengine snd_pcm snd_seq_midi snd_seq_midi_event 
> snd_rawmidi nls_iso8859_1 btusb snd_seq drm_buddy 
> processor_thermal_device_pci btmtk ttm processor_thermal_device btrtl 
> processor_thermal_wt_hint snd_seq_device btbcm irqbypass 
> platform_temperature_control mei_hdcp ghash_clmulni_intel 
> processor_thermal_soc_slider snd_timer btintel aesni_intel 
> platform_profile processor_thermal_rfim drm_display_helper snd mei_me 
> intel_rapl_msr rapl processor_thermal_rapl soundcore bluetooth 
> intel_rapl_common intel_cstate cec think_lmi joydev mei input_leds crc8 
> processor_thermal_wt_req firmware_attributes_class rc_core 8250_dw 
> wmi_bmof processor_thermal_power_floor processor_thermal_mbox 
> i2c_algo_bit int340x_thermal_zone int3400_thermal acpi_thermal_rel 
> acpi_tad acpi_pad mac_hid sch_fq_codel msr parport_pc ppdev lp parport 
> efi_pstore autofs4 btrfs blake2b_generic raid10 raid456 
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
> raid1 raid0
> [Wed Oct 15 09:26:41 2025]  spi_pxa2xx_platform hid_generic dw_dmac 
> dw_dmac_core usbhid spi_pxa2xx_core nvme r8169 i2c_i801 hid nvme_core 
> ahci i2c_mux realtek intel_lpss_pci i2c_smbus libahci intel_lpss idma64 
> video wmi
> [Wed Oct 15 09:26:41 2025] CPU: 23 UID: 0 PID: 13123 Comm: sh Kdump: 
> loaded Not tainted 6.17.0+ #120 PREEMPT(voluntary)
> [Wed Oct 15 09:26:41 2025] Hardware name: LENOVO 12JJA02KCD/3339, BIOS 
> M4ZKT39A 11/22/2023
> [Wed Oct 15 09:26:41 2025] RIP: 0010:tg_unthrottle_up+0x2a6/0x340
> [Wed Oct 15 09:26:41 2025] Code: 48 8d 3d 00 00 00 00 e8 28 03 15 00 45 
> 8b 85 64 01 00 00 45 85 c0 0f 84 b9 fd ff ff 49 83 bd 68 01 00 00 00 0f 
> 8f ab fd ff ff <0f> 0b e9 a4 fd ff ff 48 8b 88 f0 0a 00 00 e9 7c ff ff 
> ff 48 8b b1
> [Wed Oct 15 09:26:41 2025] RSP: 0018:ffffd18529b8fae0 EFLAGS: 00010082
> [Wed Oct 15 09:26:41 2025] RAX: 0000000000000006 RBX: ffff8e5f1fa72600 
> RCX: 0000000000000000
> [Wed Oct 15 09:26:41 2025] RDX: 0000000000000000 RSI: 0000000000000000 
> RDI: 0000000000000000
> [Wed Oct 15 09:26:41 2025] RBP: ffffd18529b8faf8 R08: 0000000000000001 
> R09: 0000000000000000
> [Wed Oct 15 09:26:41 2025] R10: 0000000000000000 R11: 0000000000000000 
> R12: ffff8e57ca398c80
> [Wed Oct 15 09:26:41 2025] R13: ffff8e57d1cc7a00 R14: 0000000000000000 
> R15: ffff8e5f1fa72600
> [Wed Oct 15 09:26:41 2025] FS:  00007fba99989740(0000) 
> GS:ffff8e5f9826b000(0000) knlGS:0000000000000000
> [Wed Oct 15 09:26:41 2025] CS:  0010 DS: 0000 ES: 0000 CR0: 
> 0000000080050033
> [Wed Oct 15 09:26:41 2025] CR2: 0000559e6f0ceda8 CR3: 000000014cee9002 
> CR4: 0000000000f72ef0
> [Wed Oct 15 09:26:41 2025] PKRU: 55555554
> [Wed Oct 15 09:26:41 2025] Call Trace:
> [Wed Oct 15 09:26:41 2025]  <TASK>
> [Wed Oct 15 09:26:41 2025]  ? __pfx_tg_unthrottle_up+0x10/0x10
> [Wed Oct 15 09:26:41 2025]  walk_tg_tree_from+0x66/0xd0
> [Wed Oct 15 09:26:41 2025]  ? __pfx_tg_nop+0x10/0x10
> [Wed Oct 15 09:26:41 2025]  unthrottle_cfs_rq+0xe4/0x280
> [Wed Oct 15 09:26:41 2025]  tg_set_bandwidth+0x351/0x4d0
> [Wed Oct 15 09:26:41 2025]  cpu_max_write+0xc3/0x130
> [Wed Oct 15 09:26:41 2025]  cgroup_file_write+0x92/0x1a0
> [Wed Oct 15 09:26:41 2025]  ? __check_object_size+0x27a/0x300
> [Wed Oct 15 09:26:41 2025]  kernfs_fop_write_iter+0x15f/0x1f0
> [Wed Oct 15 09:26:41 2025]  vfs_write+0x31b/0x430
> [Wed Oct 15 09:26:41 2025]  ksys_write+0x6d/0xf0
> [Wed Oct 15 09:26:41 2025]  __x64_sys_write+0x1d/0x30
> [Wed Oct 15 09:26:41 2025]  x64_sys_call+0x1900/0x2760
> [Wed Oct 15 09:26:41 2025]  do_syscall_64+0x83/0x8c0
> [Wed Oct 15 09:26:41 2025]  ? __rseq_handle_notify_resume+0xaf/0x4e0
> [Wed Oct 15 09:26:41 2025]  ? __do_sys_vfork+0x4f/0x70
> [Wed Oct 15 09:26:41 2025]  ? __set_task_blocked+0x29/0x80
> [Wed Oct 15 09:26:41 2025]  ? sigprocmask+0xb5/0xe0
> [Wed Oct 15 09:26:41 2025]  ? __x64_sys_rt_sigprocmask+0x93/0xe0
> [Wed Oct 15 09:26:41 2025]  ? x64_sys_call+0x1cc0/0x2760
> [Wed Oct 15 09:26:41 2025]  ? do_syscall_64+0xbc/0x8c0
> [Wed Oct 15 09:26:41 2025]  ? irqentry_exit+0x47/0x60
> [Wed Oct 15 09:26:41 2025]  ? exc_page_fault+0x97/0x1b0
> [Wed Oct 15 09:26:41 2025]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [Wed Oct 15 09:26:41 2025] RIP: 0033:0x7fba997148f7
> [Wed Oct 15 09:26:41 2025] Code: 10 00 f7 d8 64 89 02 48 c7 c0 ff ff ff 
> ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 
> 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 
> 48 89 74 24
> [Wed Oct 15 09:26:41 2025] RSP: 002b:00007fffa912ef68 EFLAGS: 00000246 
> ORIG_RAX: 0000000000000001
> [Wed Oct 15 09:26:41 2025] RAX: ffffffffffffffda RBX: 000000000000000b 
> RCX: 00007fba997148f7
> [Wed Oct 15 09:26:41 2025] RDX: 000000000000000b RSI: 0000559e6f0ccda0 
> RDI: 0000000000000001
> [Wed Oct 15 09:26:41 2025] RBP: 0000559e6f0ccda0 R08: 0000559e6f0cccd7 
> R09: 0000000000000000
> [Wed Oct 15 09:26:41 2025] R10: 0000559e6f0cccd6 R11: 0000000000000246 
> R12: 0000000000000001
> [Wed Oct 15 09:26:41 2025] R13: 000000000000000b R14: 0000000000000000 
> R15: 0000000000000000
> [Wed Oct 15 09:26:41 2025]  </TASK>
> [Wed Oct 15 09:26:41 2025] ---[ end trace 0000000000000000 ]---
> 
> 
> 
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3ef11783369d7..c347aa28c411a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5908,6 +5908,8 @@ static int tg_unthrottle_up(struct task_group 
>> *tg, void *data)
>>           cfs_rq->throttled_clock_self_time += delta;
>>       }
>> +    WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining 
>> <= 0);
>> +
>>       /* Re-enqueue the tasks that have been throttled at this level. */
>>       list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, 
>> throttle_node) {
>>           list_del_init(&p->throttle_node);

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-15  1:43         ` Hao Jia
  2025-10-15  1:48           ` Hao Jia
@ 2025-10-15  2:51           ` Aaron Lu
  2025-10-15  6:31             ` Hao Jia
  1 sibling, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-10-15  2:51 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Wed, Oct 15, 2025 at 09:43:20AM +0800, Hao Jia wrote:
... ...
> Yes, I've already hit the cfs_rq->runtime_remaining < 0 condition in
> tg_unthrottle_up().
> 
> This morning, after applying your patch, I still get the same issue.
> However, As before, because cfs_rq->curr isn't NULL,
> check_enqueue_throttle() returns prematurely, preventing the triggering of
> throttle_cfs_rq().
> 
> 
> Some information to share with you.

Can you also share your cgroup setup and related quota setting etc. and
how to trigger it? Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-15  2:51           ` Aaron Lu
@ 2025-10-15  6:31             ` Hao Jia
  2025-10-15  8:40               ` Aaron Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Jia @ 2025-10-15  6:31 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior



On 2025/10/15 10:51, Aaron Lu wrote:
> On Wed, Oct 15, 2025 at 09:43:20AM +0800, Hao Jia wrote:
> ... ...
>> Yes, I've already hit the cfs_rq->runtime_remaining < 0 condition in
>> tg_unthrottle_up().
>>
>> This morning, after applying your patch, I still get the same issue.
>> However, As before, because cfs_rq->curr isn't NULL,
>> check_enqueue_throttle() returns prematurely, preventing the triggering of
>> throttle_cfs_rq().
>>
>>
>> Some information to share with you.
> 
> Can you also share your cgroup setup and related quota setting etc. and
> how to trigger it? Thanks.

I ran some internal workloads on my test machine with different quota 
settings, and added 10 sched messaging branchmark cgroups, setting their 
cpu.max to 1000 100000.

perf bench sched messaging -g 10 -t -l 50000 &

I'm not sure if the issue can be reproduced without these internal 
workloads.

Thanks,
Hao

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-15  6:31             ` Hao Jia
@ 2025-10-15  8:40               ` Aaron Lu
  2025-10-15 10:21                 ` Hao Jia
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-10-15  8:40 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Wed, Oct 15, 2025 at 02:31:27PM +0800, Hao Jia wrote:
> On 2025/10/15 10:51, Aaron Lu wrote:
> > On Wed, Oct 15, 2025 at 09:43:20AM +0800, Hao Jia wrote:
> > ... ...
> > > Yes, I've already hit the cfs_rq->runtime_remaining < 0 condition in
> > > tg_unthrottle_up().
> > > 
> > > This morning, after applying your patch, I still get the same issue.
> > > However, As before, because cfs_rq->curr isn't NULL,
> > > check_enqueue_throttle() returns prematurely, preventing the triggering of
> > > throttle_cfs_rq().
> > > 
> > > 
> > > Some information to share with you.
> > 
> > Can you also share your cgroup setup and related quota setting etc. and
> > how to trigger it? Thanks.
> 
> I ran some internal workloads on my test machine with different quota
> settings, and added 10 sched messaging branchmark cgroups, setting their
> cpu.max to 1000 100000.
> 
> perf bench sched messaging -g 10 -t -l 50000 &
> 
> I'm not sure if the issue can be reproduced without these internal
> workloads.

Thanks for the report, I think I understand your concern now.

I managed to trigger a condition in tg_unthrottle_up() for a cfs_rq that
has runtime_enabled but with a negative runtime_remaining, the setup is
as before:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings.

1 Initially, both cfs_rq_a and cfs_rq_c are in unthrottled state with a
  positive runtime_remaining.
2 At some time, cfs_rq_a is throttled. cfs_rq_c is now in a throttled
  hierarchy, but it's not throttled and has a positive runtime_remaining.
3 Some time later, task @p gets enqueued to cfs_rq_c and starts execution
  in kernel mode, consumed all cfs_rq_c's runtime_remaining.
  account_cfs_rq_runtime() properly accounted, but resched_curr() doesn't
  cause schedule() -> check_cfs_rq_runtime() -> throttle_cfs_rq() to
  happen immediately, because task @p is still executing in kernel mode
  (CONFIG_PREEMPT_VOLUNTARY).
4 Some time later, cfs_rq_a is unthrottled.
  tg_unthrottle_up() noticed cfs_rq_c has a negative runtime_remaining.

In this situation, check_enqueue_throttle() will not do anything though
because cfs_rq_c->curr is set, throttle will not happen immediately so
it won't cause throttle to happen on unthrottle path.

Hao Jia,

Do I understand you correctly that you can only hit the newly added
debug warn in tg_unthrottle_up():
WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
but not throttle triggered on unthrottle path?

BTW, I think your change has the advantage of being straightforward and
easy to reason about. My concern is, it's not efficient to enqueue tasks
to a cfs_rq that has no runtime left, not sure how big a deal that is
though.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-15  8:40               ` Aaron Lu
@ 2025-10-15 10:21                 ` Hao Jia
  2025-10-16  6:54                   ` Aaron Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Jia @ 2025-10-15 10:21 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior


Hi Aaron,

On 2025/10/15 16:40, Aaron Lu wrote:
> On Wed, Oct 15, 2025 at 02:31:27PM +0800, Hao Jia wrote:
>> On 2025/10/15 10:51, Aaron Lu wrote:
>>> On Wed, Oct 15, 2025 at 09:43:20AM +0800, Hao Jia wrote:
>>> ... ...
>>>> Yes, I've already hit the cfs_rq->runtime_remaining < 0 condition in
>>>> tg_unthrottle_up().
>>>>
>>>> This morning, after applying your patch, I still get the same issue.
>>>> However, As before, because cfs_rq->curr isn't NULL,
>>>> check_enqueue_throttle() returns prematurely, preventing the triggering of
>>>> throttle_cfs_rq().
>>>>
>>>>
>>>> Some information to share with you.
>>>
>>> Can you also share your cgroup setup and related quota setting etc. and
>>> how to trigger it? Thanks.
>>
>> I ran some internal workloads on my test machine with different quota
>> settings, and added 10 sched messaging branchmark cgroups, setting their
>> cpu.max to 1000 100000.
>>
>> perf bench sched messaging -g 10 -t -l 50000 &
>>
>> I'm not sure if the issue can be reproduced without these internal
>> workloads.
> 
> Thanks for the report, I think I understand your concern now.
> 
> I managed to trigger a condition in tg_unthrottle_up() for a cfs_rq that
> has runtime_enabled but with a negative runtime_remaining, the setup is
> as before:
> 
>            root
>          /      \
>          A*     ...
>       /  |  \   ...
>          B
>         /  \
>        C*
> 
> where both A and C have quota settings.
> 
> 1 Initially, both cfs_rq_a and cfs_rq_c are in unthrottled state with a
>    positive runtime_remaining.
> 2 At some time, cfs_rq_a is throttled. cfs_rq_c is now in a throttled
>    hierarchy, but it's not throttled and has a positive runtime_remaining.
> 3 Some time later, task @p gets enqueued to cfs_rq_c and starts execution
>    in kernel mode, consumed all cfs_rq_c's runtime_remaining.
>    account_cfs_rq_runtime() properly accounted, but resched_curr() doesn't
>    cause schedule() -> check_cfs_rq_runtime() -> throttle_cfs_rq() to
>    happen immediately, because task @p is still executing in kernel mode
>    (CONFIG_PREEMPT_VOLUNTARY).
> 4 Some time later, cfs_rq_a is unthrottled.
>    tg_unthrottle_up() noticed cfs_rq_c has a negative runtime_remaining.
> 
> In this situation, check_enqueue_throttle() will not do anything though
> because cfs_rq_c->curr is set, throttle will not happen immediately so
> it won't cause throttle to happen on unthrottle path.
> 
> Hao Jia,
> 
> Do I understand you correctly that you can only hit the newly added
> debug warn in tg_unthrottle_up():
> WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
> but not throttle triggered on unthrottle path?
> 

yes. but I'm not sure if there are other corner cases where 
cfs_rq->runtime_remaining <= 0 and cfs_rq->curr is NULL.

> BTW, I think your change has the advantage of being straightforward and
> easy to reason about. My concern is, it's not efficient to enqueue tasks
> to a cfs_rq that has no runtime left, not sure how big a deal that is
> though.

Yes, but that's what we're doing now. The case described above involves 
enqueue a task where cfs_rq->runtime_remaining <= 0.

I previously tried adding a runtime_remaining check for each level of 
task p's cfs_rq in unthrottle_cfs_rq()/tg_unthrottle_up(), but this made 
the code strange and complicated.

Thanks,
Hao





> 
> Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-15 10:21                 ` Hao Jia
@ 2025-10-16  6:54                   ` Aaron Lu
  2025-10-16  7:49                     ` Hao Jia
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-10-16  6:54 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Wed, Oct 15, 2025 at 06:21:01PM +0800, Hao Jia wrote:
> On 2025/10/15 16:40, Aaron Lu wrote:
... ...
> > Hao Jia,
> > 
> > Do I understand you correctly that you can only hit the newly added
> > debug warn in tg_unthrottle_up():
> > WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
> > but not throttle triggered on unthrottle path?
> > 
> 
> yes. but I'm not sure if there are other corner cases where
> cfs_rq->runtime_remaining <= 0 and cfs_rq->curr is NULL.
> 

Right, I'm not aware of any but might be possible.

> > BTW, I think your change has the advantage of being straightforward and
> > easy to reason about. My concern is, it's not efficient to enqueue tasks
> > to a cfs_rq that has no runtime left, not sure how big a deal that is
> > though.
> 
> Yes, but that's what we're doing now. The case described above involves
> enqueue a task where cfs_rq->runtime_remaining <= 0.
> 
> I previously tried adding a runtime_remaining check for each level of task
> p's cfs_rq in unthrottle_cfs_rq()/tg_unthrottle_up(), but this made the code
> strange and complicated.

Agree that adding a runtime_remaining check for each level in
unthrottle_cfs_rq() looks too complex.

So I think you approach is fine, feel free to submit a formal patch.
With your change, theoretically we do not need to do those
runtime_remaining check in unthrottle_cfs_rq() but keeping that check
could save us some unnecessary enqueues, so I'll leave it to you to
decide if you want to keep it or not. If you want to keep it, please
also change its comments because the current comments will be stale
then.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-16  6:54                   ` Aaron Lu
@ 2025-10-16  7:49                     ` Hao Jia
  2025-10-16  9:23                       ` Aaron Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Jia @ 2025-10-16  7:49 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior


Hi Aaron,

On 2025/10/16 14:54, Aaron Lu wrote:
> On Wed, Oct 15, 2025 at 06:21:01PM +0800, Hao Jia wrote:
>> On 2025/10/15 16:40, Aaron Lu wrote:
> ... ...
>>> Hao Jia,
>>>
>>> Do I understand you correctly that you can only hit the newly added
>>> debug warn in tg_unthrottle_up():
>>> WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
>>> but not throttle triggered on unthrottle path?
>>>
>>
>> yes. but I'm not sure if there are other corner cases where
>> cfs_rq->runtime_remaining <= 0 and cfs_rq->curr is NULL.
>>
> 
> Right, I'm not aware of any but might be possible.
> 
>>> BTW, I think your change has the advantage of being straightforward and
>>> easy to reason about. My concern is, it's not efficient to enqueue tasks
>>> to a cfs_rq that has no runtime left, not sure how big a deal that is
>>> though.
>>
>> Yes, but that's what we're doing now. The case described above involves
>> enqueue a task where cfs_rq->runtime_remaining <= 0.
>>
>> I previously tried adding a runtime_remaining check for each level of task
>> p's cfs_rq in unthrottle_cfs_rq()/tg_unthrottle_up(), but this made the code
>> strange and complicated.
> 
> Agree that adding a runtime_remaining check for each level in
> unthrottle_cfs_rq() looks too complex.
> 
> So I think you approach is fine, feel free to submit a formal patch.
> With your change, theoretically we do not need to do those
> runtime_remaining check in unthrottle_cfs_rq() but keeping that check
> could save us some unnecessary enqueues, so I'll leave it to you to
> decide if you want to keep it or not. If you want to keep it, please
> also change its comments because the current comments will be stale
> then.
> 

Thank you for your suggestion. I'll send a formal patch later.

I'm also happy for you to submit a patch for the next version. This 
warning needs to be fixed, regardless of the method.

However, I've discovered a minor bug in your current patch.

In kernel/sched/core.c tg_set_cfs_bandwidth()

...
if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
     update_rq_clock(rq);   <----
     throttle_cfs_rq(cfs_rq);
}
...

Call update_rq_clock() to avoid the warning about using an outdated 
rq_clock in tg_throttle_down()->rq_clock_pelt().

Thanks,
Hao

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-16  7:49                     ` Hao Jia
@ 2025-10-16  9:23                       ` Aaron Lu
  2025-10-16 11:04                         ` Hao Jia
  0 siblings, 1 reply; 25+ messages in thread
From: Aaron Lu @ 2025-10-16  9:23 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Thu, Oct 16, 2025 at 03:49:15PM +0800, Hao Jia wrote:
> 
> Hi Aaron,
> 
> On 2025/10/16 14:54, Aaron Lu wrote:
> > On Wed, Oct 15, 2025 at 06:21:01PM +0800, Hao Jia wrote:
> > > On 2025/10/15 16:40, Aaron Lu wrote:
> > ... ...
> > > > Hao Jia,
> > > > 
> > > > Do I understand you correctly that you can only hit the newly added
> > > > debug warn in tg_unthrottle_up():
> > > > WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
> > > > but not throttle triggered on unthrottle path?
> > > > 
> > > 
> > > yes. but I'm not sure if there are other corner cases where
> > > cfs_rq->runtime_remaining <= 0 and cfs_rq->curr is NULL.
> > > 
> > 
> > Right, I'm not aware of any but might be possible.
> > 
> > > > BTW, I think your change has the advantage of being straightforward and
> > > > easy to reason about. My concern is, it's not efficient to enqueue tasks
> > > > to a cfs_rq that has no runtime left, not sure how big a deal that is
> > > > though.
> > > 
> > > Yes, but that's what we're doing now. The case described above involves
> > > enqueue a task where cfs_rq->runtime_remaining <= 0.
> > > 
> > > I previously tried adding a runtime_remaining check for each level of task
> > > p's cfs_rq in unthrottle_cfs_rq()/tg_unthrottle_up(), but this made the code
> > > strange and complicated.
> > 
> > Agree that adding a runtime_remaining check for each level in
> > unthrottle_cfs_rq() looks too complex.
> > 
> > So I think you approach is fine, feel free to submit a formal patch.
> > With your change, theoretically we do not need to do those
> > runtime_remaining check in unthrottle_cfs_rq() but keeping that check
> > could save us some unnecessary enqueues, so I'll leave it to you to
> > decide if you want to keep it or not. If you want to keep it, please
> > also change its comments because the current comments will be stale
> > then.
> > 
> 
> Thank you for your suggestion. I'll send a formal patch later.
> 
> I'm also happy for you to submit a patch for the next version. This warning
> needs to be fixed, regardless of the method.

With your change, task enqueue in unthrottle path will not call
check_enqueue_path(), thus the warn on non-empty limbo list in
tg_throttle_down() should not happen, so I suppose we do not need
this patch anymore, no?

> 
> However, I've discovered a minor bug in your current patch.
> 
> In kernel/sched/core.c tg_set_cfs_bandwidth()
> 
> ...
> if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
>     update_rq_clock(rq);   <----
>     throttle_cfs_rq(cfs_rq);
> }
> ...
> 
> Call update_rq_clock() to avoid the warning about using an outdated rq_clock
> in tg_throttle_down()->rq_clock_pelt().

With the above said, this shouldn't matter anymore but just out of
curiosity: did you notice this by inspecting the code or actually
hitting the warning about using an outdated rq clock?

Per my understanding, most likely: __assign_cfs_rq_runtime() in
throttle_cfs_rq(cfs_rq) will grant 1ns runtime to cfs_rq so it won't
reach tg_throttle_down(). The comment I added above that if condition
is kind of misleading though.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-16  9:23                       ` Aaron Lu
@ 2025-10-16 11:04                         ` Hao Jia
  2025-10-16 11:46                           ` Aaron Lu
  0 siblings, 1 reply; 25+ messages in thread
From: Hao Jia @ 2025-10-16 11:04 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior



On 2025/10/16 17:23, Aaron Lu wrote:
> On Thu, Oct 16, 2025 at 03:49:15PM +0800, Hao Jia wrote:
>>
>> Hi Aaron,
>>
>> On 2025/10/16 14:54, Aaron Lu wrote:
>>> On Wed, Oct 15, 2025 at 06:21:01PM +0800, Hao Jia wrote:
>>>> On 2025/10/15 16:40, Aaron Lu wrote:
>>> ... ...
>>>>> Hao Jia,
>>>>>
>>>>> Do I understand you correctly that you can only hit the newly added
>>>>> debug warn in tg_unthrottle_up():
>>>>> WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
>>>>> but not throttle triggered on unthrottle path?
>>>>>
>>>>
>>>> yes. but I'm not sure if there are other corner cases where
>>>> cfs_rq->runtime_remaining <= 0 and cfs_rq->curr is NULL.
>>>>
>>>
>>> Right, I'm not aware of any but might be possible.
>>>
>>>>> BTW, I think your change has the advantage of being straightforward and
>>>>> easy to reason about. My concern is, it's not efficient to enqueue tasks
>>>>> to a cfs_rq that has no runtime left, not sure how big a deal that is
>>>>> though.
>>>>
>>>> Yes, but that's what we're doing now. The case described above involves
>>>> enqueue a task where cfs_rq->runtime_remaining <= 0.
>>>>
>>>> I previously tried adding a runtime_remaining check for each level of task
>>>> p's cfs_rq in unthrottle_cfs_rq()/tg_unthrottle_up(), but this made the code
>>>> strange and complicated.
>>>
>>> Agree that adding a runtime_remaining check for each level in
>>> unthrottle_cfs_rq() looks too complex.
>>>
>>> So I think you approach is fine, feel free to submit a formal patch.
>>> With your change, theoretically we do not need to do those
>>> runtime_remaining check in unthrottle_cfs_rq() but keeping that check
>>> could save us some unnecessary enqueues, so I'll leave it to you to
>>> decide if you want to keep it or not. If you want to keep it, please
>>> also change its comments because the current comments will be stale
>>> then.
>>>
>>
>> Thank you for your suggestion. I'll send a formal patch later.
>>
>> I'm also happy for you to submit a patch for the next version. This warning
>> needs to be fixed, regardless of the method.
> 
> With your change, task enqueue in unthrottle path will not call
> check_enqueue_path(), thus the warn on non-empty limbo list in
> tg_throttle_down() should not happen, so I suppose we do not need
> this patch anymore, no?

Yes, I mean maybe the maintainer thinks your patch is more suitable.

> 
>>
>> However, I've discovered a minor bug in your current patch.
>>
>> In kernel/sched/core.c tg_set_cfs_bandwidth()
>>
>> ...
>> if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
>>      update_rq_clock(rq);   <----
>>      throttle_cfs_rq(cfs_rq);
>> }
>> ...
>>
>> Call update_rq_clock() to avoid the warning about using an outdated rq_clock
>> in tg_throttle_down()->rq_clock_pelt().
> 
> With the above said, this shouldn't matter anymore but just out of
> curiosity: did you notice this by inspecting the code or actually
> hitting the warning about using an outdated rq clock?
> 
> Per my understanding, most likely: __assign_cfs_rq_runtime() in
> throttle_cfs_rq(cfs_rq) will grant 1ns runtime to cfs_rq so it won't
> reach tg_throttle_down(). The comment I added above that if condition
> is kind of misleading though.


I did encounter this once.

perhaps in the following corner case:

If cfs_b->quota is set low (and quota is set at each level), and there 
are a large number of CPUs.

After tg_set_cfs_bandwidth()->__refill_cfs_bandwidth_runtime(), we 
release cfs_b->lock. cfs_b->runtime might be consumed by cfs_rq on other 
CPUs.

Then, on one online CPU, we can't get 1ns runtime via 
__assign_cfs_rq_runtime(). Current limiting is triggered on this CPU, 
and tg_throttle_down()->rq_clock_pelt() is called.


           ------------[ cut here ]------------
WARNING: kernel/sched/sched.h:1681 at tg_throttle_down+0x106/0x110, 
CPU#4:          CPU: 4 UID: 0 PID: 7840 Comm: test_cgroup.sh Kdump: 
loaded Not tainted 6.17.0+ #94 PREEMPT(voluntary)

           Call Trace:
            <TASK>
            walk_tg_tree_from+0x39/0xd0
            ? __pfx_tg_throttle_down+0x10/0x10
            throttle_cfs_rq+0xea/0x210
            tg_set_bandwidth+0x31f/0x4d0
            cpu_max_write+0xc3/0x130
            cgroup_file_write+0x92/0x1a0
            ? __check_object_size+0x27a/0x300
            kernfs_fop_write_iter+0x15f/0x1f0
            vfs_write+0x31b/0x430
            ksys_write+0x6d/0xf0
            __x64_sys_write+0x1d/0x30
            x64_sys_call+0x1900/0x2760
            do_syscall_64+0x83/0x8c0
            ? ksys_dup3+0x9d/0x120
            ? filp_flush+0x96/0xb0
            ? __x64_sys_close+0x42/0x90
            ? x64_sys_call+0x1c48/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? do_syscall_64+0xbc/0x8c0
            ? x64_sys_call+0x2404/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? do_sys_openat2+0x8e/0xd0
            ? __x64_sys_openat+0x58/0xa0
            ? x64_sys_call+0x101f/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? count_memcg_events+0xf1/0x1e0
            ? get_close_on_exec+0x3b/0x50
            ? do_fcntl+0x27a/0x7d0
            ? handle_mm_fault+0x1d2/0x2b0
            ? __x64_sys_fcntl+0x9d/0x130
            ? x64_sys_call+0x2404/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? exc_page_fault+0x97/0x1b0
            entry_SYSCALL_64_after_hwframe+0x76/0x7e

Thanks,
Hao

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-16 11:04                         ` Hao Jia
@ 2025-10-16 11:46                           ` Aaron Lu
  0 siblings, 0 replies; 25+ messages in thread
From: Aaron Lu @ 2025-10-16 11:46 UTC (permalink / raw)
  To: Hao Jia
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Thu, Oct 16, 2025 at 07:04:30PM +0800, Hao Jia wrote:
> On 2025/10/16 17:23, Aaron Lu wrote:
... ...
> > With the above said, this shouldn't matter anymore but just out of
> > curiosity: did you notice this by inspecting the code or actually
> > hitting the warning about using an outdated rq clock?
> > 
> > Per my understanding, most likely: __assign_cfs_rq_runtime() in
> > throttle_cfs_rq(cfs_rq) will grant 1ns runtime to cfs_rq so it won't
> > reach tg_throttle_down(). The comment I added above that if condition
> > is kind of misleading though.
> 
> 
> I did encounter this once.
> 
> perhaps in the following corner case:
> 
> If cfs_b->quota is set low (and quota is set at each level), and there are a
> large number of CPUs.
> 
> After tg_set_cfs_bandwidth()->__refill_cfs_bandwidth_runtime(), we release
> cfs_b->lock. cfs_b->runtime might be consumed by cfs_rq on other CPUs.

Ah I see. So I suppose in your setup, tasks are already started and then
quota is set.

> 
> Then, on one online CPU, we can't get 1ns runtime via
> __assign_cfs_rq_runtime(). Current limiting is triggered on this CPU, and
> tg_throttle_down()->rq_clock_pelt() is called.
> 

Got it, thanks for the info!

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2025-10-16 11:46 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-29  7:46 [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
2025-09-29  9:34 ` K Prateek Nayak
2025-09-29 10:55   ` Aaron Lu
2025-09-30  7:56   ` Aaron Lu
2025-09-30  8:58     ` K Prateek Nayak
2025-09-30  9:27       ` Aaron Lu
2025-09-30 11:07       ` Aaron Lu
2025-09-30 12:39         ` Aaron Lu
2025-09-30 13:38         ` K Prateek Nayak
2025-10-01 11:58           ` Aaron Lu
2025-10-14  7:43 ` Hao Jia
2025-10-14  9:11   ` Aaron Lu
2025-10-14 11:01     ` Hao Jia
2025-10-14 11:50       ` Aaron Lu
2025-10-15  1:43         ` Hao Jia
2025-10-15  1:48           ` Hao Jia
2025-10-15  2:51           ` Aaron Lu
2025-10-15  6:31             ` Hao Jia
2025-10-15  8:40               ` Aaron Lu
2025-10-15 10:21                 ` Hao Jia
2025-10-16  6:54                   ` Aaron Lu
2025-10-16  7:49                     ` Hao Jia
2025-10-16  9:23                       ` Aaron Lu
2025-10-16 11:04                         ` Hao Jia
2025-10-16 11:46                           ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox