[PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
@ 2025-10-23  8:56 Aaron Lu
  2025-10-27 22:33 ` Benjamin Segall
  0 siblings, 1 reply; 3+ messages in thread
From: Aaron Lu @ 2025-10-23  8:56 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Hao Jia, Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot,
	Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.

When running a test with the following hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B
       /  \
      C*

where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).

Debug showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are directly placed into cfs_rq_c's limbo list by
enqueue_throttled_task().

Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.

I think it's a chaos to trigger throttle on unthrottle path, the status
of a being unthrottled cfs_rq can be in a mixed state at the end, so fix
this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() immediately
after enabling bandwidth and setting runtime_remaining = 0. This ensures
cfs_rq_c is throttled upfront and cannot enter tg_unthrottle_up() with
zero runtime_remaining.

Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.

While at it, remove a redundant assignment to se in tg_throttle_down().

Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
v2: add update_rq_clock() before throttle_cfs_rq() as reported by Hao
    Jia, or a warn on outdated rq clock is trigged in tg_throttle_down().
    This can happen when user specified a tiny quota.

Note that Hao Jia also proposed another solution by using a special flag
when doing enqueue_task_fair() in unthrottle path to avoid doing
check_enqueue_throttle() [0]. I think that approach is fine too and it
also has the benefit of not needing to worry about any other potential
cases where a cfs_rq is unthrottled with <=0 runtime_remaining. Thoughts
on which approach to go is welcome, thanks.
[0]: https://lore.kernel.org/lkml/c4a1bcea-fb00-6f3f-6bf6-d876393190e4@gmail.com/

 kernel/sched/core.c  | 11 ++++++++++-
 kernel/sched/fair.c  | 16 +++++++---------
 kernel/sched/sched.h |  1 +
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f1ebf67b48e21..58185ec5b8efd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9608,7 +9608,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 0;

-		if (cfs_rq->throttled)
+		/*
+		 * Throttle cfs_rq now or it can be unthrottled with zero
+		 * runtime_remaining and gets throttled on its unthrottle path.
+		 */
+		if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
+			update_rq_clock(rq);
+			throttle_cfs_rq(cfs_rq);
+		}
+
+		if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);
 	}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb2795..ddf405497b828 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5976,7 +5976,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	return 0;
 }

-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
@@ -6025,19 +6025,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)

 	/*
 	 * It's possible we are called with !runtime_remaining due to things
-	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
-	 * unthrottled us with a positive runtime_remaining but other still
-	 * running entities consumed those runtime before we reached here.
+	 * like async unthrottled us with a positive runtime_remaining but
+	 * other still running entities consumed those runtime before we
+	 * reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
-	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
-	 * throttle, which is not supposed to happen on unthrottle path.
+	 * We can't unthrottle this cfs_rq without any runtime remaining
+	 * because any enqueue in tg_unthrottle_up() will immediately trigger
+	 * a throttle, which is not supposed to happen on unthrottle path.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;

-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;

 	update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1f5d07067f60a..c70a833ac9a24 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -583,6 +583,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth

 extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
 extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern bool throttle_cfs_rq(struct cfs_rq *cfs_rq);
 extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
 extern bool cfs_task_bw_constrained(struct task_struct *p);

-- 
2.39.5

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-23  8:56 [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
@ 2025-10-27 22:33 ` Benjamin Segall
  2025-10-28  6:36   ` Aaron Lu
  0 siblings, 1 reply; 3+ messages in thread
From: Benjamin Segall @ 2025-10-27 22:33 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, K Prateek Nayak, Peter Zijlstra, Hao Jia,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

Aaron Lu <ziqianlu@bytedance.com> writes:

> When a cfs_rq is to be throttled, its limbo list should be empty and
> that's why there is a warn in tg_throttle_down() for non empty
> cfs_rq->throttled_limbo_list.
>
> When running a test with the following hierarchy:
>
>           root
>         /      \
>         A*     ...
>      /  |  \   ...
>         B
>        /  \
>       C*
>
> where both A and C have quota settings, that warn on non empty limbo list
> is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> part of the cfs_rq for the sake of simpler representation).
>
> Debug showed it happened like this:
> Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> these throttled tasks are directly placed into cfs_rq_c's limbo list by
> enqueue_throttled_task().
>
> Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> can't get more runtime and enters tg_throttle_down(), where the warning
> is hit due to remaining tasks in the limbo list.
>
> I think it's a chaos to trigger throttle on unthrottle path, the status
> of a being unthrottled cfs_rq can be in a mixed state at the end, so fix
> this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() immediately
> after enabling bandwidth and setting runtime_remaining = 0. This ensures
> cfs_rq_c is throttled upfront and cannot enter tg_unthrottle_up() with
> zero runtime_remaining.
>
> Also, update outdated comments in tg_throttle_down() since
> unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
>
> While at it, remove a redundant assignment to se in tg_throttle_down().
>
> Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
> v2: add update_rq_clock() before throttle_cfs_rq() as reported by Hao
>     Jia, or a warn on outdated rq clock is trigged in tg_throttle_down().
>     This can happen when user specified a tiny quota.
>
> Note that Hao Jia also proposed another solution by using a special flag
> when doing enqueue_task_fair() in unthrottle path to avoid doing
> check_enqueue_throttle() [0]. I think that approach is fine too and it
> also has the benefit of not needing to worry about any other potential
> cases where a cfs_rq is unthrottled with <=0 runtime_remaining. Thoughts
> on which approach to go is welcome, thanks.
> [0]: https://lore.kernel.org/lkml/c4a1bcea-fb00-6f3f-6bf6-d876393190e4@gmail.com/
>
>  kernel/sched/core.c  | 11 ++++++++++-
>  kernel/sched/fair.c  | 16 +++++++---------
>  kernel/sched/sched.h |  1 +
>  3 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f1ebf67b48e21..58185ec5b8efd 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9608,7 +9608,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
>  		cfs_rq->runtime_enabled = runtime_enabled;
>  		cfs_rq->runtime_remaining = 0;
>  
> -		if (cfs_rq->throttled)
> +		/*
> +		 * Throttle cfs_rq now or it can be unthrottled with zero
> +		 * runtime_remaining and gets throttled on its unthrottle path.
> +		 */
> +		if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
> +			update_rq_clock(rq);
> +			throttle_cfs_rq(cfs_rq);
> +		}
> +
> +		if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
>  			unthrottle_cfs_rq(cfs_rq);
>  	}
>

So if this is the only case it can come up, and it only occurs becasue
we set runtime_remaining = 0 and check in unthrottle with <= 0, then I
think we should just set runtime_remaining = 1 here. 

That seems simpler than either throttling immediately (despite likely
having plenty of cfs_b->runtime) or adding an enqueue flag. Adding NR_CPUs
nanoseconds worth of quota on configure seems fine.

unthrottle_cfs_rq/tg_unthrottle_up itself doesn't drop rq lock, so we
shouldn't be able to see cfs_rq->runtime_remaining being consumed during
it, even if it's running on a remote cpu so that threads in the cfs_rq
can be running. They should wind up stuck waiting for rq lock in order
to update runtime_remaining.

Is there anything you see missing from that approach? I think it doing =
0 in particular here is just an artifact, and while the extra check for
runtime_remaining in unthrottle isn't unreasonable, the conflict with
tg_set_cfs_bandwidth isn't a fundamental issue.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
  2025-10-27 22:33 ` Benjamin Segall
@ 2025-10-28  6:36   ` Aaron Lu
  0 siblings, 0 replies; 3+ messages in thread
From: Aaron Lu @ 2025-10-28  6:36 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Valentin Schneider, K Prateek Nayak, Peter Zijlstra, Hao Jia,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
	Chen Yu, Matteo Martelli, Michal Koutný,
	Sebastian Andrzej Siewior

On Mon, Oct 27, 2025 at 03:33:19PM -0700, Benjamin Segall wrote:
> Aaron Lu <ziqianlu@bytedance.com> writes:
> 
> > When a cfs_rq is to be throttled, its limbo list should be empty and
> > that's why there is a warn in tg_throttle_down() for non empty
> > cfs_rq->throttled_limbo_list.
> >
> > When running a test with the following hierarchy:
> >
> >           root
> >         /      \
> >         A*     ...
> >      /  |  \   ...
> >         B
> >        /  \
> >       C*
> >
> > where both A and C have quota settings, that warn on non empty limbo list
> > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> > part of the cfs_rq for the sake of simpler representation).
> >
> > Debug showed it happened like this:
> > Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> > cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> > equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> > *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> > group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> > called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> > these throttled tasks are directly placed into cfs_rq_c's limbo list by
> > enqueue_throttled_task().
> >
> > Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> > tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> > runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> > can't get more runtime and enters tg_throttle_down(), where the warning
> > is hit due to remaining tasks in the limbo list.
> >
> > I think it's a chaos to trigger throttle on unthrottle path, the status
> > of a being unthrottled cfs_rq can be in a mixed state at the end, so fix
> > this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() immediately
> > after enabling bandwidth and setting runtime_remaining = 0. This ensures
> > cfs_rq_c is throttled upfront and cannot enter tg_unthrottle_up() with
> > zero runtime_remaining.
> >
> > Also, update outdated comments in tg_throttle_down() since
> > unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> >
> > While at it, remove a redundant assignment to se in tg_throttle_down().
> >
> > Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > ---
> > v2: add update_rq_clock() before throttle_cfs_rq() as reported by Hao
> >     Jia, or a warn on outdated rq clock is trigged in tg_throttle_down().
> >     This can happen when user specified a tiny quota.
> >
> > Note that Hao Jia also proposed another solution by using a special flag
> > when doing enqueue_task_fair() in unthrottle path to avoid doing
> > check_enqueue_throttle() [0]. I think that approach is fine too and it
> > also has the benefit of not needing to worry about any other potential
> > cases where a cfs_rq is unthrottled with <=0 runtime_remaining. Thoughts
> > on which approach to go is welcome, thanks.
> > [0]: https://lore.kernel.org/lkml/c4a1bcea-fb00-6f3f-6bf6-d876393190e4@gmail.com/
> >
> >  kernel/sched/core.c  | 11 ++++++++++-
> >  kernel/sched/fair.c  | 16 +++++++---------
> >  kernel/sched/sched.h |  1 +
> >  3 files changed, 18 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f1ebf67b48e21..58185ec5b8efd 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9608,7 +9608,16 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
> >  		cfs_rq->runtime_enabled = runtime_enabled;
> >  		cfs_rq->runtime_remaining = 0;
> >  
> > -		if (cfs_rq->throttled)
> > +		/*
> > +		 * Throttle cfs_rq now or it can be unthrottled with zero
> > +		 * runtime_remaining and gets throttled on its unthrottle path.
> > +		 */
> > +		if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
> > +			update_rq_clock(rq);
> > +			throttle_cfs_rq(cfs_rq);
> > +		}
> > +
> > +		if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
> >  			unthrottle_cfs_rq(cfs_rq);
> >  	}
> >
> 
> So if this is the only case it can come up, and it only occurs becasue
> we set runtime_remaining = 0 and check in unthrottle with <= 0, then I
> think we should just set runtime_remaining = 1 here. 
>

Thanks Ben, I like your suggestion and tested that it works for the case
I described here. I think it should also work for the case Hao Jia
described in his patch's changelog. 

> That seems simpler than either throttling immediately (despite likely
> having plenty of cfs_b->runtime) or adding an enqueue flag. Adding NR_CPUs
> nanoseconds worth of quota on configure seems fine.

Agree.

> 
> unthrottle_cfs_rq/tg_unthrottle_up itself doesn't drop rq lock, so we
> shouldn't be able to see cfs_rq->runtime_remaining being consumed during
> it, even if it's running on a remote cpu so that threads in the cfs_rq
> can be running. They should wind up stuck waiting for rq lock in order
> to update runtime_remaining.
> 
> Is there anything you see missing from that approach? I think it doing =

Not any that I'm aware of.

> 0 in particular here is just an artifact, and while the extra check for
> runtime_remaining in unthrottle isn't unreasonable, the conflict with
> tg_set_cfs_bandwidth isn't a fundamental issue.

Got it, thanks for the suggestion, will change the patch accordingly for
v3. I think it will become a simple one line change:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f1e5cb94c536..23f92222aedf3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9606,7 +9606,7 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
 
 		guard(rq_lock_irq)(rq);
 		cfs_rq->runtime_enabled = runtime_enabled;
-		cfs_rq->runtime_remaining = 0;
+		cfs_rq->runtime_remaining = 1;
 
 		if (cfs_rq->throttled)
 			unthrottle_cfs_rq(cfs_rq);

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-10-28  6:37 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23  8:56 [PATCH v2] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
2025-10-27 22:33 ` Benjamin Segall
2025-10-28  6:36   ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox