[PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy
@ 2026-05-28  9:48 K Prateek Nayak
  2026-05-28  9:48 ` [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards K Prateek Nayak
                   ` (6 more replies)
  0 siblings, 7 replies; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-28  9:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Josh Don, K Prateek Nayak,
	linux-kernel

The current hierarchy is always throttled in __schedule() during the
pick when update_curr() detects a cfs_rq running out of the bandwidth
and issues a resched.

This was necessary prior to per-task throttling where the entire
throttled hierarchy was dequeued at the point of first throttle during
the pick but with per-task throttling, tasks continue to run as usual
until they exit to userspace and dequeue themselves one-by-one until the
hierarchy is deemed fully throttled and the PELT is frozen.

throttle_cfs_rq() is now simply a propagator of throttle indicators and
nothing more.

Unify the throttling for current hierarchy under
account_cfs_rq_runtime() which is responsible for the time accounting.
If the bandwidth runs out, account_cfs_rq_runtime() will request for
sched_cfs_bandwidth_slice() and mark the hierarchy as throttled if it
fails to grab bandwidth.

throttle_cfs_rq() will do a task_throttle_setup_work() if it finds the
current task to be on a throttled hierarchy and the task will naturally
dequeue itself when it exits to the userspace without needing an
explicit resched.

First four patches are cleanups and preparation for the final bit that
switches over to using account_cfs_rq_runtime() for throttling which was
provided by Peter in [1].

Following are the results of running hackbench running 3 levels deep
with the setup from "Testing" section on [2] when compared to
tip:sched/core:

  kernel        :  tip        tip + series

  Min           : 207.33        202.20
  Max           : 210.20        222.47
  Median        : 207.83        218.33
  AMean         : 208.29        215.36
  GMean         : 208.29        215.25
  HMean         : 208.29        215.13
  AMean Stddev  : 1.02          7.37
  AMean CoefVar : 0.49 pct      3.42 pct

  All numbers are in seconds.

There is a slight boot to boot variation for this benchmark but the
utilization numbers in top is more or less similar between the two.
Additional testing and feedback is always appreciated as usual :-)

Patches are based on tip:sched/core at commit 9e005ed21152
("sched/topology: Allow multiple domains to claim sched_domain_shared")
All testing was done on a dual socket 4th Generation EPYC system (2 x
128C/256T). CONFIG_CFS_BANDWIDTH=n was only build tested.

Patches also cleanly apply on top of Zecheng's optimization from [3]
when applied on top of the same base. Peter, there is only one trivial
conflict with sched/flat, and Zecheng's optimization is generally
beneficial for deep hierarchies even with flattened pick.

References
==========

[1] https://lore.kernel.org/lkml/20260512110932.GB1889694@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/lkml/20250220093257.9380-1-kprateek.nayak@amd.com/
[3] https://lore.kernel.org/lkml/20260522141623.600235-4-zli94@ncsu.edu/

---
K Prateek Nayak (4):
  sched/fair: Convert cfs bandwidth throttling to use guards
  sched/fair: Use throttled_csd_list for local unthrottle
  sched/fair: Call update_curr() before unthrottling the hierarchy
  sched/fair: Move the throttled tasks to a local list in
    tg_unthrottle_up()

Peter Zijlstra (1):
  sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()

 kernel/sched/fair.c | 342 +++++++++++++++++++++++---------------------
 1 file changed, 181 insertions(+), 161 deletions(-)

base-commit: 9e005ed21152d4a4bb0ceea71045ff8a642a6feb
-- 
2.34.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
@ 2026-05-28  9:48 ` K Prateek Nayak
  2026-05-28 21:46   ` Benjamin Segall
  2026-05-28  9:48 ` [PATCH 2/5] sched/fair: Use throttled_csd_list for local unthrottle K Prateek Nayak
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-28  9:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Josh Don, K Prateek Nayak,
	linux-kernel

Routine conversion of rcu_read_lock(), spin_lock*, and rq_lock usage
within the cfs bandwidth controller to use class guards.

Only notable changes are:

- Conversion of do_sched_cfs_slack_timer() to return a bool and moving
  the call to distribute_cfs_runtime() into the caller for a cleaner
  guard usage.

- Reordering of list_del_rcu() against throttled_clock indicator update
  in unthrottle_cfs_rq(). Both are done with "cfs_b->lock" held after
  the "cfs_rq->throttled" is cleared which make the reordering safe
  against concurrent list modifications.

No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 174 ++++++++++++++++++++------------------------
 1 file changed, 79 insertions(+), 95 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8e858ca6bcd0..8c2a5a2f046d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6511,13 +6511,10 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	int ret;
 
-	raw_spin_lock(&cfs_b->lock);
-	ret = __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice());
-	raw_spin_unlock(&cfs_b->lock);
+	guard(raw_spinlock)(&cfs_b->lock);
 
-	return ret;
+	return __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice());
 }
 
 static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
@@ -6806,33 +6803,32 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	int dequeue = 1;
 
-	raw_spin_lock(&cfs_b->lock);
-	/* This will start the period timer if necessary */
-	if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1)) {
+	scoped_guard(raw_spinlock, &cfs_b->lock) {
 		/*
-		 * We have raced with bandwidth becoming available, and if we
-		 * actually throttled the timer might not unthrottle us for an
-		 * entire period. We additionally needed to make sure that any
-		 * subsequent check_cfs_rq_runtime calls agree not to throttle
-		 * us, as we may commit to do cfs put_prev+pick_next, so we ask
-		 * for 1ns of runtime rather than just check cfs_b.
+		 * Check if We have raced with bandwidth becoming available. If
+		 * we actually throttled the timer might not unthrottle us for
+		 * an entire period. We additionally needed to make sure that
+		 * any subsequent check_cfs_rq_runtime calls agree not to
+		 * throttle us, as we may commit to do cfs put_prev+pick_next,
+		 * so we ask for 1ns of runtime rather than just check cfs_b.
+		 *
+		 * This will start the period timer if necessary.
+		 */
+		if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1))
+			return false;
+
+		/*
+		 * No bandwidth available; Add ourselves on the list to be
+		 * unthrottled later.
 		 */
-		dequeue = 0;
-	} else {
 		list_add_tail_rcu(&cfs_rq->throttled_list,
 				  &cfs_b->throttled_cfs_rq);
 	}
-	raw_spin_unlock(&cfs_b->lock);
-
-	if (!dequeue)
-		return false;  /* Throttle no longer required. */
 
 	/* freeze hierarchy runnable averages while throttled */
-	rcu_read_lock();
-	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
-	rcu_read_unlock();
+	scoped_guard(rcu)
+		walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 
 	/*
 	 * Note: distribution will already see us throttled via the
@@ -6865,13 +6861,15 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 
 	update_rq_clock(rq);
 
-	raw_spin_lock(&cfs_b->lock);
-	if (cfs_rq->throttled_clock) {
+	scoped_guard(raw_spinlock, &cfs_b->lock) {
+		list_del_rcu(&cfs_rq->throttled_list);
+
+		if (!cfs_rq->throttled_clock)
+			break;
+
 		cfs_b->throttled_time += rq_clock(rq) - cfs_rq->throttled_clock;
 		cfs_rq->throttled_clock = 0;
 	}
-	list_del_rcu(&cfs_rq->throttled_list);
-	raw_spin_unlock(&cfs_b->lock);
 
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
@@ -6900,9 +6898,8 @@ static void __cfsb_csd_unthrottle(void *arg)
 {
 	struct cfs_rq *cursor, *tmp;
 	struct rq *rq = arg;
-	struct rq_flags rf;
 
-	rq_lock(rq, &rf);
+	guard(rq_lock)(rq);
 
 	/*
 	 * Iterating over the list can trigger several call to
@@ -6919,7 +6916,7 @@ static void __cfsb_csd_unthrottle(void *arg)
 	 * race with group being freed in the window between removing it
 	 * from the list and advancing to the next entry in the list.
 	 */
-	rcu_read_lock();
+	guard(rcu)();
 
 	list_for_each_entry_safe(cursor, tmp, &rq->cfsb_csd_list,
 				 throttled_csd_list) {
@@ -6929,10 +6926,7 @@ static void __cfsb_csd_unthrottle(void *arg)
 			unthrottle_cfs_rq(cursor);
 	}
 
-	rcu_read_unlock();
-
 	rq_clock_stop_loop_update(rq);
-	rq_unlock(rq, &rf);
 }
 
 static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
@@ -6972,11 +6966,11 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 	u64 runtime, remaining = 1;
 	bool throttled = false;
 	struct cfs_rq *cfs_rq, *tmp;
-	struct rq_flags rf;
 	struct rq *rq;
 	LIST_HEAD(local_unthrottle);
 
-	rcu_read_lock();
+	guard(rcu)();
+
 	list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
 				throttled_list) {
 		rq = rq_of(cfs_rq);
@@ -6986,24 +6980,25 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 			break;
 		}
 
-		rq_lock_irqsave(rq, &rf);
+		guard(rq_lock_irqsave)(rq);
+
 		if (!cfs_rq_throttled(cfs_rq))
-			goto next;
+			continue;
 
 		/* Already queued for async unthrottle */
 		if (!list_empty(&cfs_rq->throttled_csd_list))
-			goto next;
+			continue;
 
 		/* By the above checks, this should never be true */
 		WARN_ON_ONCE(cfs_rq->runtime_remaining > 0);
 
-		raw_spin_lock(&cfs_b->lock);
-		runtime = -cfs_rq->runtime_remaining + 1;
-		if (runtime > cfs_b->runtime)
-			runtime = cfs_b->runtime;
-		cfs_b->runtime -= runtime;
-		remaining = cfs_b->runtime;
-		raw_spin_unlock(&cfs_b->lock);
+		scoped_guard(raw_spinlock, &cfs_b->lock) {
+			runtime = -cfs_rq->runtime_remaining + 1;
+			if (runtime > cfs_b->runtime)
+				runtime = cfs_b->runtime;
+			cfs_b->runtime -= runtime;
+			remaining = cfs_b->runtime;
+		}
 
 		cfs_rq->runtime_remaining += runtime;
 
@@ -7011,40 +7006,33 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 		if (cfs_rq->runtime_remaining > 0) {
 			if (cpu_of(rq) != this_cpu) {
 				unthrottle_cfs_rq_async(cfs_rq);
-			} else {
-				/*
-				 * We currently only expect to be unthrottling
-				 * a single cfs_rq locally.
-				 */
-				WARN_ON_ONCE(!list_empty(&local_unthrottle));
-				list_add_tail(&cfs_rq->throttled_csd_list,
-					      &local_unthrottle);
+				continue;
 			}
-		} else {
-			throttled = true;
+
+			/*
+			 * We currently only expect to be unthrottling
+			 * a single cfs_rq locally.
+			 */
+			WARN_ON_ONCE(!list_empty(&local_unthrottle));
+			list_add_tail(&cfs_rq->throttled_csd_list, &local_unthrottle);
+			continue;
 		}
 
-next:
-		rq_unlock_irqrestore(rq, &rf);
+		throttled = true;
 	}
 
 	list_for_each_entry_safe(cfs_rq, tmp, &local_unthrottle,
 				 throttled_csd_list) {
 		struct rq *rq = rq_of(cfs_rq);
 
-		rq_lock_irqsave(rq, &rf);
+		guard(rq_lock_irqsave)(rq);
 
 		list_del_init(&cfs_rq->throttled_csd_list);
-
 		if (cfs_rq_throttled(cfs_rq))
 			unthrottle_cfs_rq(cfs_rq);
-
-		rq_unlock_irqrestore(rq, &rf);
 	}
 	WARN_ON_ONCE(!list_empty(&local_unthrottle));
 
-	rcu_read_unlock();
-
 	return throttled;
 }
 
@@ -7167,7 +7155,8 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	if (slack_runtime <= 0)
 		return;
 
-	raw_spin_lock(&cfs_b->lock);
+	guard(raw_spinlock)(&cfs_b->lock);
+
 	if (cfs_b->quota != RUNTIME_INF) {
 		cfs_b->runtime += slack_runtime;
 
@@ -7176,7 +7165,6 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 		    !list_empty(&cfs_b->throttled_cfs_rq))
 			start_cfs_slack_bandwidth(cfs_b);
 	}
-	raw_spin_unlock(&cfs_b->lock);
 
 	/* even if it's not valid for return we don't want to try again */
 	cfs_rq->runtime_remaining -= slack_runtime;
@@ -7197,29 +7185,25 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
  * This is done with a timer (instead of inline with bandwidth return) since
  * it's necessary to juggle rq->locks to unthrottle their respective cfs_rqs.
  */
-static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
+static bool do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
 {
 	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
-	unsigned long flags;
 
 	/* confirm we're still not at a refresh boundary */
-	raw_spin_lock_irqsave(&cfs_b->lock, flags);
+	guard(raw_spinlock_irqsave)(&cfs_b->lock);
+
 	cfs_b->slack_started = false;
 
-	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
-		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
-		return;
-	}
+	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
+		return false;
 
 	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
 		runtime = cfs_b->runtime;
 
-	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
-
 	if (!runtime)
-		return;
+		return false;
 
-	distribute_cfs_runtime(cfs_b);
+	return true;
 }
 
 /*
@@ -7297,7 +7281,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, slack_timer);
 
-	do_sched_cfs_slack_timer(cfs_b);
+	if (do_sched_cfs_slack_timer(cfs_b))
+		distribute_cfs_runtime(cfs_b);
 
 	return HRTIMER_NORESTART;
 }
@@ -7306,18 +7291,18 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
 		container_of(timer, struct cfs_bandwidth, period_timer);
-	unsigned long flags;
 	int overrun;
 	int idle = 0;
 	int count = 0;
 
-	raw_spin_lock_irqsave(&cfs_b->lock, flags);
+	CLASS(raw_spinlock_irqsave, cfsb_guard)(&cfs_b->lock);
+
 	for (;;) {
 		overrun = hrtimer_forward_now(timer, cfs_b->period);
 		if (!overrun)
 			break;
 
-		idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
+		idle = do_sched_cfs_period_timer(cfs_b, overrun, cfsb_guard.flags);
 
 		if (++count > 3) {
 			u64 new, old = ktime_to_ns(cfs_b->period);
@@ -7350,11 +7335,13 @@ static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 			count = 0;
 		}
 	}
-	if (idle)
+
+	if (idle) {
 		cfs_b->period_active = 0;
-	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
+		return HRTIMER_NORESTART;
+	}
 
-	return idle ? HRTIMER_NORESTART : HRTIMER_RESTART;
+	return HRTIMER_RESTART;
 }
 
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent)
@@ -7421,14 +7408,12 @@ static void destroy_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
 	 */
 	for_each_possible_cpu(i) {
 		struct rq *rq = cpu_rq(i);
-		unsigned long flags;
 
 		if (list_empty(&rq->cfsb_csd_list))
 			continue;
 
-		local_irq_save(flags);
-		__cfsb_csd_unthrottle(rq);
-		local_irq_restore(flags);
+		scoped_guard(irqsave)
+			__cfsb_csd_unthrottle(rq);
 	}
 }
 
@@ -7446,16 +7431,15 @@ static void __maybe_unused update_runtime_enabled(struct rq *rq)
 
 	lockdep_assert_rq_held(rq);
 
-	rcu_read_lock();
+	guard(rcu)();
+
 	list_for_each_entry_rcu(tg, &task_groups, list) {
 		struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
 		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
-		raw_spin_lock(&cfs_b->lock);
-		cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
-		raw_spin_unlock(&cfs_b->lock);
+		scoped_guard(raw_spinlock, &cfs_b->lock)
+			cfs_rq->runtime_enabled = cfs_b->quota != RUNTIME_INF;
 	}
-	rcu_read_unlock();
 }
 
 /* cpu offline callback */
@@ -7476,7 +7460,8 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 	 */
 	rq_clock_start_loop_update(rq);
 
-	rcu_read_lock();
+	guard(rcu)();
+
 	list_for_each_entry_rcu(tg, &task_groups, list) {
 		struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
@@ -7499,7 +7484,6 @@ static void __maybe_unused unthrottle_offline_cfs_rqs(struct rq *rq)
 		cfs_rq->runtime_remaining = 1;
 		unthrottle_cfs_rq(cfs_rq);
 	}
-	rcu_read_unlock();
 
 	rq_clock_stop_loop_update(rq);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards
  2026-05-28  9:48 ` [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards K Prateek Nayak
@ 2026-05-28 21:46   ` Benjamin Segall
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Segall @ 2026-05-28 21:46 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> Routine conversion of rcu_read_lock(), spin_lock*, and rq_lock usage
> within the cfs bandwidth controller to use class guards.
>
> Only notable changes are:
>
> - Conversion of do_sched_cfs_slack_timer() to return a bool and moving
>   the call to distribute_cfs_runtime() into the caller for a cleaner
>   guard usage.
>
> - Reordering of list_del_rcu() against throttled_clock indicator update
>   in unthrottle_cfs_rq(). Both are done with "cfs_b->lock" held after
>   the "cfs_rq->throttled" is cleared which make the reordering safe
>   against concurrent list modifications.
>
> No functional changes intended.

Some minor style nitpicks that I don't care /that/ much about, so:

Reviewed-By: Ben Segall <bsegall@google.com>

> @@ -7011,40 +7006,33 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
>  		if (cfs_rq->runtime_remaining > 0) {
>  			if (cpu_of(rq) != this_cpu) {
>  				unthrottle_cfs_rq_async(cfs_rq);
> -			} else {
> -				/*
> -				 * We currently only expect to be unthrottling
> -				 * a single cfs_rq locally.
> -				 */
> -				WARN_ON_ONCE(!list_empty(&local_unthrottle));
> -				list_add_tail(&cfs_rq->throttled_csd_list,
> -					      &local_unthrottle);
> +				continue;
>  			}
> -		} else {
> -			throttled = true;
> +
> +			/*
> +			 * We currently only expect to be unthrottling
> +			 * a single cfs_rq locally.
> +			 */
> +			WARN_ON_ONCE(!list_empty(&local_unthrottle));
> +			list_add_tail(&cfs_rq->throttled_csd_list, &local_unthrottle);
> +			continue;
>  		}
>  
> -next:
> -		rq_unlock_irqrestore(rq, &rf);
> +		throttled = true;
>  	}

I don't think the extra continues wind up being clearer here than the
original if/else code and just removing the next/unlock.

> @@ -7197,29 +7185,25 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   * This is done with a timer (instead of inline with bandwidth return) since
>   * it's necessary to juggle rq->locks to unthrottle their respective cfs_rqs.
>   */
> -static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> +static bool do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  {
>  	u64 runtime = 0, slice = sched_cfs_bandwidth_slice();
> -	unsigned long flags;
>  
>  	/* confirm we're still not at a refresh boundary */
> -	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> +	guard(raw_spinlock_irqsave)(&cfs_b->lock);
> +
>  	cfs_b->slack_started = false;
>  
> -	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
> -		raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> -		return;
> -	}
> +	if (runtime_refresh_within(cfs_b, min_bandwidth_expiration))
> +		return false;
>  
>  	if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
>  		runtime = cfs_b->runtime;
>  
> -	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
> -
>  	if (!runtime)
> -		return;
> +		return false;
>  
> -	distribute_cfs_runtime(cfs_b);
> +	return true;
>  }
>  
>  /*

Here I think it's also nicer to just use a scoped_guard rather than move
part of it out and have part of the work done outside of the do_ helper.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 2/5] sched/fair: Use throttled_csd_list for local unthrottle
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
  2026-05-28  9:48 ` [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards K Prateek Nayak
@ 2026-05-28  9:48 ` K Prateek Nayak
  2026-05-28 21:53   ` Benjamin Segall
  2026-05-28  9:48 ` [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy K Prateek Nayak
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-28  9:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Josh Don, K Prateek Nayak,
	linux-kernel

When distribute_cfs_runtime() encounters a local cfs_rq, it adds it to a
local list and unthrottles it at the end, when it is done unthrottling
other cfs_rq(s) on cfs_b->throttled_cfs_rq until the bandwidth runs out.

Instead of using a local list, reuse the local CPU's
rq->throttled_csd_list and the __cfsb_csd_unthrottle() path for
unthrottle.

If this is the first cfs_rq to be queued on the "throttled_csd_list", it
prevents the need for a remote CPUs to interrupt this local CPU if they
themselves are performing async unthrottle.

If this is not the first cfs_rq on the list, there is an async unthrottle
operation pending on this local CPU and the unthrottle can be batched
together.

No functional changes intended.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c2a5a2f046d..23742d928b51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6962,12 +6962,11 @@ static void unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 
 static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 {
+	bool throttled = false, unthrottle_local = false;
 	int this_cpu = smp_processor_id();
 	u64 runtime, remaining = 1;
-	bool throttled = false;
-	struct cfs_rq *cfs_rq, *tmp;
+	struct cfs_rq *cfs_rq;
 	struct rq *rq;
-	LIST_HEAD(local_unthrottle);
 
 	guard(rcu)();
 
@@ -7010,28 +7009,27 @@ static bool distribute_cfs_runtime(struct cfs_bandwidth *cfs_b)
 			}
 
 			/*
-			 * We currently only expect to be unthrottling
-			 * a single cfs_rq locally.
+			 * Allow a parallel async unthrottle to unthrottle
+			 * this cfs_rq too via __cfsb_csd_unthrottle().
+			 * If we are first, do it ourselves at the end and
+			 * save on an IPI from remote CPUs.
 			 */
-			WARN_ON_ONCE(!list_empty(&local_unthrottle));
-			list_add_tail(&cfs_rq->throttled_csd_list, &local_unthrottle);
+			unthrottle_local = list_empty(&rq->cfsb_csd_list);
+			list_add_tail(&cfs_rq->throttled_csd_list, &rq->cfsb_csd_list);
 			continue;
 		}
 
 		throttled = true;
 	}
 
-	list_for_each_entry_safe(cfs_rq, tmp, &local_unthrottle,
-				 throttled_csd_list) {
-		struct rq *rq = rq_of(cfs_rq);
-
-		guard(rq_lock_irqsave)(rq);
-
-		list_del_init(&cfs_rq->throttled_csd_list);
-		if (cfs_rq_throttled(cfs_rq))
-			unthrottle_cfs_rq(cfs_rq);
+	if (unthrottle_local) {
+		/*
+		 * Protect against an IPI that is also trying to flush
+		 * the unthrottled cfs_rq(s) from this CPU's csd_list.
+		 */
+		scoped_guard(irqsave)
+			__cfsb_csd_unthrottle(cpu_rq(this_cpu));
 	}
-	WARN_ON_ONCE(!list_empty(&local_unthrottle));
 
 	return throttled;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/5] sched/fair: Use throttled_csd_list for local unthrottle
  2026-05-28  9:48 ` [PATCH 2/5] sched/fair: Use throttled_csd_list for local unthrottle K Prateek Nayak
@ 2026-05-28 21:53   ` Benjamin Segall
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Segall @ 2026-05-28 21:53 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> When distribute_cfs_runtime() encounters a local cfs_rq, it adds it to a
> local list and unthrottles it at the end, when it is done unthrottling
> other cfs_rq(s) on cfs_b->throttled_cfs_rq until the bandwidth runs out.
>
> Instead of using a local list, reuse the local CPU's
> rq->throttled_csd_list and the __cfsb_csd_unthrottle() path for
> unthrottle.
>
> If this is the first cfs_rq to be queued on the "throttled_csd_list", it
> prevents the need for a remote CPUs to interrupt this local CPU if they
> themselves are performing async unthrottle.
>
> If this is not the first cfs_rq on the list, there is an async unthrottle
> operation pending on this local CPU and the unthrottle can be batched
> together.
>
> No functional changes intended.

Reviewed-By: Benjamin Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
  2026-05-28  9:48 ` [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards K Prateek Nayak
  2026-05-28  9:48 ` [PATCH 2/5] sched/fair: Use throttled_csd_list for local unthrottle K Prateek Nayak
@ 2026-05-28  9:48 ` K Prateek Nayak
  2026-05-28 22:03   ` Benjamin Segall
  2026-06-01  3:52   ` Aaron Lu
  2026-05-28  9:48 ` [PATCH 4/5] sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() K Prateek Nayak
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-28  9:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Josh Don, K Prateek Nayak,
	linux-kernel

Subsequent commits will allow update_curr() to throttle the hierarchy
when the runtime accounting exceeds allocated quota. Call update_curr()
before the unthrottle event, and in tg_unthrottle_up() to catch up on
any remaining runtime and stabilize the "runtime_remaining" and
"throttle_count" for that cfs_rq.

Doing an update_curr() early ensures the cfs_rq is not throttled right
back up again when the unthrottle is in progress.

Since all callers of unthrottle_cfs_rq(), except one, already update the
rq_clock and call rq_clock_start_loop_update(), move the
update_rq_clock() in unthrottle_cfs_rq() to the lonely caller that
doesn't update the rq_clock.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23742d928b51..b3b3172702a9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6711,6 +6711,15 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 	struct task_struct *p, *tmp;
 
+	/*
+	 * If cfs_rq->curr is set, the cfs_rq might not have caught up
+	 * since the last clock update. Do it now before we begin
+	 * queueing task onto it to save the need for unnecessarily
+	 * unthrottle the hierarchy for this cfs_rq to be throttled
+	 * right back again.
+	 */
+	update_curr(cfs_rq);
+
 	if (--cfs_rq->throttle_count)
 		return 0;
 
@@ -6853,14 +6862,16 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * We can't unthrottle this cfs_rq without any runtime remaining because
 	 * any enqueue in tg_unthrottle_up() will immediately trigger a throttle,
 	 * which is not supposed to happen on unthrottle path.
+	 *
+	 * Catch up on the remaining runtime since last clock update before
+	 * checking runtime remaining.
 	 */
+	update_curr(cfs_rq);
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;
 
 	cfs_rq->throttled = 0;
 
-	update_rq_clock(rq);
-
 	scoped_guard(raw_spinlock, &cfs_b->lock) {
 		list_del_rcu(&cfs_rq->throttled_list);
 
@@ -6935,6 +6946,7 @@ static inline void __unthrottle_cfs_rq_async(struct cfs_rq *cfs_rq)
 	bool first;
 
 	if (rq == this_rq()) {
+		update_rq_clock(rq);
 		unthrottle_cfs_rq(cfs_rq);
 		return;
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy
  2026-05-28  9:48 ` [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy K Prateek Nayak
@ 2026-05-28 22:03   ` Benjamin Segall
  2026-06-01  3:52   ` Aaron Lu
  1 sibling, 0 replies; 21+ messages in thread
From: Benjamin Segall @ 2026-05-28 22:03 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> Subsequent commits will allow update_curr() to throttle the hierarchy
> when the runtime accounting exceeds allocated quota. Call update_curr()
> before the unthrottle event, and in tg_unthrottle_up() to catch up on
> any remaining runtime and stabilize the "runtime_remaining" and
> "throttle_count" for that cfs_rq.
>
> Doing an update_curr() early ensures the cfs_rq is not throttled right
> back up again when the unthrottle is in progress.
>
> Since all callers of unthrottle_cfs_rq(), except one, already update the
> rq_clock and call rq_clock_start_loop_update(), move the
> update_rq_clock() in unthrottle_cfs_rq() to the lonely caller that
> doesn't update the rq_clock.

Reviewed-By: Benjamin Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy
  2026-05-28  9:48 ` [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy K Prateek Nayak
  2026-05-28 22:03   ` Benjamin Segall
@ 2026-06-01  3:52   ` Aaron Lu
  2026-06-01  5:50     ` K Prateek Nayak
  1 sibling, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2026-06-01  3:52 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Josh Don, linux-kernel

Hi Prateek,

On Thu, May 28, 2026 at 09:48:28AM +0000, K Prateek Nayak wrote:
> Subsequent commits will allow update_curr() to throttle the hierarchy
> when the runtime accounting exceeds allocated quota. Call update_curr()
> before the unthrottle event, and in tg_unthrottle_up() to catch up on
> any remaining runtime and stabilize the "runtime_remaining" and
> "throttle_count" for that cfs_rq.
> 
> Doing an update_curr() early ensures the cfs_rq is not throttled right
> back up again when the unthrottle is in progress.
> 
> Since all callers of unthrottle_cfs_rq(), except one, already update the
> rq_clock and call rq_clock_start_loop_update(), move the
> update_rq_clock() in unthrottle_cfs_rq() to the lonely caller that
> doesn't update the rq_clock.

Looks like there is one more path that doesn't have clock updated:
cpu_max_write() -> tg_set_bandwidth() -> tg_set_cfs_bandwidth() ->
unthrottle_cfs_rq().

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy
  2026-06-01  3:52   ` Aaron Lu
@ 2026-06-01  5:50     ` K Prateek Nayak
  2026-06-01 11:27       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-06-01  5:50 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Josh Don, linux-kernel

Hello Aaron,

On 6/1/2026 9:22 AM, Aaron Lu wrote:
>> Since all callers of unthrottle_cfs_rq(), except one, already update the
>> rq_clock and call rq_clock_start_loop_update(), move the
>> update_rq_clock() in unthrottle_cfs_rq() to the lonely caller that
>> doesn't update the rq_clock.
> 
> Looks like there is one more path that doesn't have clock updated:
> cpu_max_write() -> tg_set_bandwidth() -> tg_set_cfs_bandwidth() ->
> unthrottle_cfs_rq().

Woops! Completely missed that. Thank you for pointing it out. I'll
add an update_rq_clock() there too once we grab the rq_lock.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy
  2026-06-01  5:50     ` K Prateek Nayak
@ 2026-06-01 11:27       ` Peter Zijlstra
  2026-06-02  6:33         ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2026-06-01 11:27 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Aaron Lu, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Josh Don, linux-kernel

On Mon, Jun 01, 2026 at 11:20:30AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 6/1/2026 9:22 AM, Aaron Lu wrote:
> >> Since all callers of unthrottle_cfs_rq(), except one, already update the
> >> rq_clock and call rq_clock_start_loop_update(), move the
> >> update_rq_clock() in unthrottle_cfs_rq() to the lonely caller that
> >> doesn't update the rq_clock.
> > 
> > Looks like there is one more path that doesn't have clock updated:
> > cpu_max_write() -> tg_set_bandwidth() -> tg_set_cfs_bandwidth() ->
> > unthrottle_cfs_rq().
> 
> Woops! Completely missed that. Thank you for pointing it out. I'll
> add an update_rq_clock() there too once we grab the rq_lock.


Like this?

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9830,6 +9830,7 @@ static int tg_set_cfs_bandwidth(struct t
 		struct rq *rq = cfs_rq->rq;
 
 		guard(rq_lock_irq)(rq);
+		update_rq_clock();
 		cfs_rq->runtime_enabled = runtime_enabled;
 		cfs_rq->runtime_remaining = 1;
 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy
  2026-06-01 11:27       ` Peter Zijlstra
@ 2026-06-02  6:33         ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2026-06-02  6:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Aaron Lu, Ingo Molnar, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Josh Don, linux-kernel

On 6/1/2026 4:57 PM, Peter Zijlstra wrote:
> Like this?
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9830,6 +9830,7 @@ static int tg_set_cfs_bandwidth(struct t
>  		struct rq *rq = cfs_rq->rq;
>  
>  		guard(rq_lock_irq)(rq);
> +		update_rq_clock();

Except that needs an rq as an argument ;-)

I've send out a v2 here for convenience:
https://lore.kernel.org/lkml/20260602050005.11160-1-kprateek.nayak@amd.com/

It addresses the comments from Ben, and I've move the update_rq_clock()
just before the unthrottle to avoid doing it unnecessarily in case the
cfs_rq is not throttled yet.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 4/5] sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
                   ` (2 preceding siblings ...)
  2026-05-28  9:48 ` [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy K Prateek Nayak
@ 2026-05-28  9:48 ` K Prateek Nayak
  2026-05-28 22:14   ` Benjamin Segall
  2026-05-28  9:48 ` [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() K Prateek Nayak
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-28  9:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Josh Don, K Prateek Nayak,
	linux-kernel

An update_curr() during the enqueue of throttled task will start
throttling the hierarchy from subsequent commit. This can lead to
tg_throttle_down() seeing non-empty throttled_limbo_list for the cfs_rq
attaching the task from throttled_limbo_list one by one. For example:

     R
     |
     A
    / \
  *B   C
       |
       rq->curr

*B is throttled with tasks on hte limbo list. When the tasks are
unthrottled via tg_unthrottle_up() and entity of group B is placed onto
A, update_curr() is called to catch up the vruntime and it may throttle
group A causing the subsequent tg_throttle_down() to see the pending
task's on B's limbo list.

  tg_unthrottle_up()
    /* --cfs_rq->throttle_count == 0 */
    list_for_each_entry_safe(p, cfs_rq->throttled_limbo_list)
      enqueue_task_fair()
        enqueue_entity(se /* B->se */)
          update_curr(cfs_rq /* A->gcfs_rq */)
            account_cfs_rq_runtime(cfs_rq)
              throttle_cfs_rq(cfs_rq /* A->gcfs_rq */ )
                tg_throttle_down()
                  /* Reaches B->cfs_rq with throttle_count == 0 */

                  !!! !list_empty(&cfs_rq->throttled_limbo_list)) !!!

Move the tasks from throttled_limbo_list onto a local list before
starting the unthrottle to prevent the splat described above. If the
hierarchy is throttled again in middle of an unthrottle, put the pending
tasks back onto the limbo list to prevent running them unnecessarily.

Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3b3172702a9..c48eaf2d7919 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6710,6 +6710,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 	struct task_struct *p, *tmp;
+	LIST_HEAD(throttled_tasks);
 
 	/*
 	 * If cfs_rq->curr is set, the cfs_rq might not have caught up
@@ -6740,13 +6741,31 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 		cfs_rq->throttled_clock_self_time += delta;
 	}
 
+	/*
+	 * Move the tasks to a local list since an update_curr() during
+	 * enqueue_task_fair() can throttle a higher cfs_rq, and it can
+	 * see the "throttled_limbo_list" being non-empty in
+	 * tg_throttle_down() if throttle_count turned 0 above.
+	 */
+	list_splice_init(&cfs_rq->throttled_limbo_list, &throttled_tasks);
+
 	/* Re-enqueue the tasks that have been throttled at this level. */
-	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+	list_for_each_entry_safe(p, tmp, &throttled_tasks, throttle_node) {
+		/*
+		 * Back to being throttled! Break out and put the remaining
+		 * tasks back onto the limbo_list to prevent running them
+		 * unnecessarily.
+		 */
+		if (cfs_rq->throttle_count)
+			break;
+
 		list_del_init(&p->throttle_node);
 		p->throttled = false;
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq, p, ENQUEUE_WAKEUP);
 	}
 
+	list_splice(&throttled_tasks, &cfs_rq->throttled_limbo_list);
+
 	/* Add cfs_rq with load or one or more already running entities to the list */
 	if (!cfs_rq_is_decayed(cfs_rq))
 		list_add_leaf_cfs_rq(cfs_rq);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 4/5] sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up()
  2026-05-28  9:48 ` [PATCH 4/5] sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() K Prateek Nayak
@ 2026-05-28 22:14   ` Benjamin Segall
  0 siblings, 0 replies; 21+ messages in thread
From: Benjamin Segall @ 2026-05-28 22:14 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> An update_curr() during the enqueue of throttled task will start
> throttling the hierarchy from subsequent commit. This can lead to
> tg_throttle_down() seeing non-empty throttled_limbo_list for the cfs_rq
> attaching the task from throttled_limbo_list one by one. For example:
>
>      R
>      |
>      A
>     / \
>   *B   C
>        |
>        rq->curr
>
> *B is throttled with tasks on hte limbo list. When the tasks are
> unthrottled via tg_unthrottle_up() and entity of group B is placed onto
> A, update_curr() is called to catch up the vruntime and it may throttle
> group A causing the subsequent tg_throttle_down() to see the pending
> task's on B's limbo list.
>
>   tg_unthrottle_up()
>     /* --cfs_rq->throttle_count == 0 */
>     list_for_each_entry_safe(p, cfs_rq->throttled_limbo_list)
>       enqueue_task_fair()
>         enqueue_entity(se /* B->se */)
>           update_curr(cfs_rq /* A->gcfs_rq */)
>             account_cfs_rq_runtime(cfs_rq)
>               throttle_cfs_rq(cfs_rq /* A->gcfs_rq */ )
>                 tg_throttle_down()
>                   /* Reaches B->cfs_rq with throttle_count == 0 */
>
>                   !!! !list_empty(&cfs_rq->throttled_limbo_list)) !!!
>
> Move the tasks from throttled_limbo_list onto a local list before
> starting the unthrottle to prevent the splat described above. If the
> hierarchy is throttled again in middle of an unthrottle, put the pending
> tasks back onto the limbo list to prevent running them unnecessarily.

And for extra fun, in order to get here we need to have just finished
throttle_cfs_rq_work and its resched_curr, but not have managed to reach
schedule yet when the unthrottle hits. But yeah, that can happen.

Reviewed-By: Benjamin Segall <bsegall@google.com>


>
> Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
> ---
>  kernel/sched/fair.c | 23 +++++++++++++++++++++--
>  1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b3b3172702a9..c48eaf2d7919 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6710,6 +6710,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>  	struct rq *rq = data;
>  	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>  	struct task_struct *p, *tmp;
> +	LIST_HEAD(throttled_tasks);
>  
>  	/*
>  	 * If cfs_rq->curr is set, the cfs_rq might not have caught up
> @@ -6740,13 +6741,31 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>  		cfs_rq->throttled_clock_self_time += delta;
>  	}
>  
> +	/*
> +	 * Move the tasks to a local list since an update_curr() during
> +	 * enqueue_task_fair() can throttle a higher cfs_rq, and it can
> +	 * see the "throttled_limbo_list" being non-empty in
> +	 * tg_throttle_down() if throttle_count turned 0 above.
> +	 */
> +	list_splice_init(&cfs_rq->throttled_limbo_list, &throttled_tasks);
> +
>  	/* Re-enqueue the tasks that have been throttled at this level. */
> -	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> +	list_for_each_entry_safe(p, tmp, &throttled_tasks, throttle_node) {
> +		/*
> +		 * Back to being throttled! Break out and put the remaining
> +		 * tasks back onto the limbo_list to prevent running them
> +		 * unnecessarily.
> +		 */
> +		if (cfs_rq->throttle_count)
> +			break;
> +
>  		list_del_init(&p->throttle_node);
>  		p->throttled = false;
> -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> +		enqueue_task_fair(rq, p, ENQUEUE_WAKEUP);
>  	}
>  
> +	list_splice(&throttled_tasks, &cfs_rq->throttled_limbo_list);
> +
>  	/* Add cfs_rq with load or one or more already running entities to the list */
>  	if (!cfs_rq_is_decayed(cfs_rq))
>  		list_add_leaf_cfs_rq(cfs_rq);

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
                   ` (3 preceding siblings ...)
  2026-05-28  9:48 ` [PATCH 4/5] sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() K Prateek Nayak
@ 2026-05-28  9:48 ` K Prateek Nayak
  2026-05-28 22:44   ` Benjamin Segall
  2026-06-01 13:48   ` Peter Zijlstra
  2026-05-28 11:45 ` [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy Peter Zijlstra
  2026-06-01  6:18 ` Aaron Lu
  6 siblings, 2 replies; 21+ messages in thread
From: K Prateek Nayak @ 2026-05-28  9:48 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot
  Cc: Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Aaron Lu, Josh Don, K Prateek Nayak,
	linux-kernel

From: Peter Zijlstra <peterz@infradead.org>

assign_cfs_rq_runtime() during update_curr() sets the resched indicator
and relies on check_cfs_rq_runtime() during pick_next_task() /
put_prev_entity() to throttle the hierarchy once current task is
preempted / blocks.

Per-task throttle, on the other hand, uses throttle_cfs_rq() to simply
propagate the throttle signals, and then relies on task work to
individually throttle the runnable tasks on their way out to the
userspace.

Remove check_cfs_rq_runtime() and unify throttling into
account_cfs_rq_runtime() which only sets the cfs_rq->throttled,
cfs_rq->throttle_count indicators via throttle_cfs_rq() and optionally
adds the task work to the current task (donor) it is on the throttled
hierarchy.

throttle_cfs_rq() requests for sched_cfs_bandwidth_slice() worth of
bandwidth for the current hierarchy that enable it to continue running
uninterrupted when selected. For the rest, it requests a bare minimum of
"1" to ensure some bandwidth is available and pass the
"runtime_remaining > 0" checks once selected.

For SCHED_PROXY_EXEC, a mutex holder cannot exit to userspace without
dropping it first and the mutex_unlock() ensures proxy is stopped before
the mutex handoff which preserves the current semantics for running a
throttled task until it exits to the userspace even if it acts as a
donor.

  [ prateek: rebased on tip, comments, commit message. ]

Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/fair.c | 110 ++++++++++++++++++++++----------------------
 1 file changed, 55 insertions(+), 55 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c48eaf2d7919..a481647f0f0f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -525,7 +525,7 @@ static int se_is_idle(struct sched_entity *se)
 #endif /* !CONFIG_FAIR_GROUP_SCHED */
 
 static __always_inline
-void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
+bool account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec);
 
 /**************************************************************
  * Scheduling class tree data structure manipulation methods:
@@ -6359,8 +6359,6 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq, bool protect)
 	return se;
 }
 
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
-
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
 	/*
@@ -6370,9 +6368,6 @@ static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 	if (prev->on_rq)
 		update_curr(cfs_rq);
 
-	/* throttle cfs_rqs exceeding runtime */
-	check_cfs_rq_runtime(cfs_rq);
-
 	if (prev->on_rq) {
 		update_stats_wait_start_fair(cfs_rq, prev);
 		/* Put 'current' back into the tree. */
@@ -6507,41 +6502,32 @@ static int __assign_cfs_rq_runtime(struct cfs_bandwidth *cfs_b,
 	return cfs_rq->runtime_remaining > 0;
 }
 
-/* returns 0 on failure to allocate runtime */
-static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-
-	guard(raw_spinlock)(&cfs_b->lock);
+static bool throttle_cfs_rq(struct cfs_rq *cfs_rq);
 
-	return __assign_cfs_rq_runtime(cfs_b, cfs_rq, sched_cfs_bandwidth_slice());
-}
-
-static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
+static bool __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	/* dock delta_exec before expiring quota (as it could span periods) */
 	cfs_rq->runtime_remaining -= delta_exec;
 
 	if (likely(cfs_rq->runtime_remaining > 0))
-		return;
+		return false;
 
 	if (cfs_rq->throttled)
-		return;
+		return true;
 	/*
-	 * if we're unable to extend our runtime we resched so that the active
-	 * hierarchy can be throttled
+	 * throttle_cfs_rq() will try to extend the runtime first
+	 * before throttling the hierarchy.
 	 */
-	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr))
-		resched_curr(rq_of(cfs_rq));
+	return throttle_cfs_rq(cfs_rq);
 }
 
 static __always_inline
-void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
+bool account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
 {
 	if (!cfs_bandwidth_used() || !cfs_rq->runtime_enabled)
-		return;
+		return false;
 
-	__account_cfs_rq_runtime(cfs_rq, delta_exec);
+	return __account_cfs_rq_runtime(cfs_rq, delta_exec);
 }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -6829,10 +6815,24 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 
 static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
-	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
+	struct sched_entity *curr = cfs_rq->curr;
+	struct rq *rq = rq_of(cfs_rq);
 
 	scoped_guard(raw_spinlock, &cfs_b->lock) {
+		u64 target_runtime = 1;
+
+		/*
+		 * If cfs_rq->curr is still runnable, we are here from an
+		 * update_curr(). Request sysctl_sched_cfs_bandwidth_slice
+		 * worth of bandwidth to continue running.
+		 *
+		 * If the curr is not runnable, just request enough bandwidth
+		 * to be runnable next time the pick selects this cfs_rq.
+		 */
+		if (curr && curr->on_rq)
+			target_runtime = sched_cfs_bandwidth_slice();
+
 		/*
 		 * Check if We have raced with bandwidth becoming available. If
 		 * we actually throttled the timer might not unthrottle us for
@@ -6843,7 +6843,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 		 *
 		 * This will start the period timer if necessary.
 		 */
-		if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, 1))
+		if (__assign_cfs_rq_runtime(cfs_b, cfs_rq, target_runtime))
 			return false;
 
 		/*
@@ -6864,6 +6864,17 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 */
 	cfs_rq->throttled = 1;
 	WARN_ON_ONCE(cfs_rq->throttled_clock);
+
+	/*
+	 * If current hierarchy was throttled, add throttle work to the
+	 * current donor. In case of proxy-execution, the execution
+	 * context cannot exit to the userspace while holding a mutex
+	 * and the rule of throttle deferral to only throttle the
+	 * throttled context at exit to userspace is still preserved.
+	 */
+	if (curr && curr->on_rq)
+		task_throttle_setup_work(rq->donor);
+
 	return true;
 }
 
@@ -7245,7 +7256,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 	if (!cfs_bandwidth_used())
 		return;
 
-	/* an active group must be handled by the update_curr()->put() path */
+	/* an active group must be handled by the update_curr() path */
 	if (!cfs_rq->runtime_enabled || cfs_rq->curr)
 		return;
 
@@ -7255,8 +7266,6 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 
 	/* update runtime allocation */
 	account_cfs_rq_runtime(cfs_rq, 0);
-	if (cfs_rq->runtime_remaining <= 0)
-		throttle_cfs_rq(cfs_rq);
 }
 
 static void sync_throttle(struct task_group *tg, int cpu)
@@ -7286,25 +7295,6 @@ static void sync_throttle(struct task_group *tg, int cpu)
 		cfs_rq->pelt_clock_throttled = 1;
 }
 
-/* conditionally throttle active cfs_rq's from put_prev_entity() */
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
-{
-	if (!cfs_bandwidth_used())
-		return false;
-
-	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
-
-	/*
-	 * it's possible for a throttled entity to be forced into a running
-	 * state (e.g. set_curr_task), in this case we're finished.
-	 */
-	if (cfs_rq_throttled(cfs_rq))
-		return true;
-
-	return throttle_cfs_rq(cfs_rq);
-}
-
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
 {
 	struct cfs_bandwidth *cfs_b =
@@ -7559,8 +7549,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 
 #else /* !CONFIG_CFS_BANDWIDTH: */
 
-static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
+static bool account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9893,8 +9882,15 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 		/* Might not have done put_prev_entity() */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
-
-		throttled |= check_cfs_rq_runtime(cfs_rq);
+		/*
+		 * For the current hierarchy, update_curr() above would
+		 * have set the throttle indicators if the cfs_rq has
+		 * run out of bandwidth. For others, enqueue / last
+		 * update_curr() for the cfs_rq would have ensured the
+		 * throttle indicators are set if bandwidth was not
+		 * available.
+		 */
+		throttled |= cfs_rq_throttled(cfs_rq);
 
 		se = pick_next_entity(rq, cfs_rq, true);
 		if (!se)
@@ -14868,8 +14864,8 @@ static inline void task_tick_core(struct rq *rq, struct task_struct *curr) {}
  */
 static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 {
-	struct cfs_rq *cfs_rq;
 	struct sched_entity *se = &curr->se;
+	struct cfs_rq *cfs_rq;
 
 	for_each_sched_entity(se) {
 		cfs_rq = cfs_rq_of(se);
@@ -15074,15 +15070,19 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
 static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 {
 	struct sched_entity *se = &p->se;
+	bool throttled = false;
 
 	for_each_sched_entity(se) {
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		set_next_entity(cfs_rq, se, first);
 		/* ensure bandwidth has been allocated on our new cfs_rq */
-		account_cfs_rq_runtime(cfs_rq, 0);
+		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
 	}
 
+	if (throttled)
+		task_throttle_setup_work(p);
+
 	__set_next_task_fair(rq, p, first);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  2026-05-28  9:48 ` [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() K Prateek Nayak
@ 2026-05-28 22:44   ` Benjamin Segall
  2026-06-01 13:48   ` Peter Zijlstra
  1 sibling, 0 replies; 21+ messages in thread
From: Benjamin Segall @ 2026-05-28 22:44 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

K Prateek Nayak <kprateek.nayak@amd.com> writes:

> From: Peter Zijlstra <peterz@infradead.org>
>
> assign_cfs_rq_runtime() during update_curr() sets the resched indicator
> and relies on check_cfs_rq_runtime() during pick_next_task() /
> put_prev_entity() to throttle the hierarchy once current task is
> preempted / blocks.
>
> Per-task throttle, on the other hand, uses throttle_cfs_rq() to simply
> propagate the throttle signals, and then relies on task work to
> individually throttle the runnable tasks on their way out to the
> userspace.
>
> Remove check_cfs_rq_runtime() and unify throttling into
> account_cfs_rq_runtime() which only sets the cfs_rq->throttled,
> cfs_rq->throttle_count indicators via throttle_cfs_rq() and optionally
> adds the task work to the current task (donor) it is on the throttled
> hierarchy.
>
> throttle_cfs_rq() requests for sched_cfs_bandwidth_slice() worth of
> bandwidth for the current hierarchy that enable it to continue running
> uninterrupted when selected. For the rest, it requests a bare minimum of
> "1" to ensure some bandwidth is available and pass the
> "runtime_remaining > 0" checks once selected.
>
> For SCHED_PROXY_EXEC, a mutex holder cannot exit to userspace without
> dropping it first and the mutex_unlock() ensures proxy is stopped before
> the mutex handoff which preserves the current semantics for running a
> throttled task until it exits to the userspace even if it acts as a
> donor.
>
>   [ prateek: rebased on tip, comments, commit message. ]

Yeah, no need to go into schedule() anymore. Having some extra callers
of task_throttle_setup_work() instead is better.

Reviewed-By: Benjamin Segall <bsegall@google.com>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  2026-05-28  9:48 ` [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() K Prateek Nayak
  2026-05-28 22:44   ` Benjamin Segall
@ 2026-06-01 13:48   ` Peter Zijlstra
  2026-06-02  7:01     ` K Prateek Nayak
  1 sibling, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2026-06-01 13:48 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

On Thu, May 28, 2026 at 09:48:30AM +0000, K Prateek Nayak wrote:

> @@ -9893,8 +9882,15 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>  		/* Might not have done put_prev_entity() */
>  		if (cfs_rq->curr && cfs_rq->curr->on_rq)
>  			update_curr(cfs_rq);
> -
> -		throttled |= check_cfs_rq_runtime(cfs_rq);
> +		/*
> +		 * For the current hierarchy, update_curr() above would
> +		 * have set the throttle indicators if the cfs_rq has
> +		 * run out of bandwidth. For others, enqueue / last
> +		 * update_curr() for the cfs_rq would have ensured the
> +		 * throttle indicators are set if bandwidth was not
> +		 * available.
> +		 */
> +		throttled |= cfs_rq_throttled(cfs_rq);
>  
>  		se = pick_next_entity(rq, cfs_rq, true);
>  		if (!se)

> @@ -15074,15 +15070,19 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>  static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>  {
>  	struct sched_entity *se = &p->se;
> +	bool throttled = false;
>  
>  	for_each_sched_entity(se) {
>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>  
>  		set_next_entity(cfs_rq, se, first);
>  		/* ensure bandwidth has been allocated on our new cfs_rq */
> -		account_cfs_rq_runtime(cfs_rq, 0);
> +		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
>  	}
>  
> +	if (throttled)
> +		task_throttle_setup_work(p);
> +
>  	__set_next_task_fair(rq, p, first);
>  }

(noticed while trying to rebase flat on top)

Why do we have both? Isn't just set_next_task_fair(.first=true)
sufficient?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  2026-06-01 13:48   ` Peter Zijlstra
@ 2026-06-02  7:01     ` K Prateek Nayak
  2026-06-02  8:32       ` Peter Zijlstra
  0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2026-06-02  7:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

Hello Peter,

On 6/1/2026 7:18 PM, Peter Zijlstra wrote:
> On Thu, May 28, 2026 at 09:48:30AM +0000, K Prateek Nayak wrote:
> 
>> @@ -9893,8 +9882,15 @@ static struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
>>  		/* Might not have done put_prev_entity() */
>>  		if (cfs_rq->curr && cfs_rq->curr->on_rq)
>>  			update_curr(cfs_rq);
>> -
>> -		throttled |= check_cfs_rq_runtime(cfs_rq);
>> +		/*
>> +		 * For the current hierarchy, update_curr() above would
>> +		 * have set the throttle indicators if the cfs_rq has
>> +		 * run out of bandwidth. For others, enqueue / last
>> +		 * update_curr() for the cfs_rq would have ensured the
>> +		 * throttle indicators are set if bandwidth was not
>> +		 * available.
>> +		 */
>> +		throttled |= cfs_rq_throttled(cfs_rq);
>>  
>>  		se = pick_next_entity(rq, cfs_rq, true);
>>  		if (!se)
> 
>> @@ -15074,15 +15070,19 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
>>  static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
>>  {
>>  	struct sched_entity *se = &p->se;
>> +	bool throttled = false;
>>  
>>  	for_each_sched_entity(se) {
>>  		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>  
>>  		set_next_entity(cfs_rq, se, first);
>>  		/* ensure bandwidth has been allocated on our new cfs_rq */
>> -		account_cfs_rq_runtime(cfs_rq, 0);
>> +		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
>>  	}
>>  
>> +	if (throttled)
>> +		task_throttle_setup_work(p);
>> +
>>  	__set_next_task_fair(rq, p, first);
>>  }
> 
> (noticed while trying to rebase flat on top)
> 
> Why do we have both? Isn't just set_next_task_fair(.first=true)
> sufficient?

I misread this bit and only called account_cfs_rq_runtime() for !first in my
v2 [1] but even the following changes on top of my v2 [1] yields similar
results in my testing (slightly better on performance) so feel free to squash
this bit into the last patch:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ce5cf494b934..fa8c0b1a1cf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9892,15 +9892,6 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 		/* Might not have done put_prev_entity() */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
-		/*
-		 * For the current hierarchy, update_curr() above would
-		 * have set the throttle indicators if the cfs_rq has
-		 * run out of bandwidth. For others, enqueue / last
-		 * update_curr() for the cfs_rq would have ensured the
-		 * throttle indicators are set if bandwidth was not
-		 * available.
-		 */
-		throttled |= cfs_rq_throttled(cfs_rq);
 
 		se = pick_next_entity(rq, cfs_rq, true);
 		if (!se)
@@ -15012,12 +15003,8 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 			break;
 
 		set_next_entity(cfs_rq, se, first);
-		/*
-		 * Ensure bandwidth has been allocated on our new cfs_rq
-		 * if we've reached here for reasons other than pick.
-		 */
-		if (!first)
-			throttled |= account_cfs_rq_runtime(cfs_rq, 0);
+		/* ensure bandwidth has been allocated on our new cfs_rq */
+		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
 	}
 
 	if (throttled)
---

My mind is taking a while to grasp the ->pick_next_task() removal.

[1] https://lore.kernel.org/lkml/20260602050005.11160-1-kprateek.nayak@amd.com/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  2026-06-02  7:01     ` K Prateek Nayak
@ 2026-06-02  8:32       ` Peter Zijlstra
  2026-06-02  8:57         ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2026-06-02  8:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

On Tue, Jun 02, 2026 at 12:31:36PM +0530, K Prateek Nayak wrote:

> My mind is taking a while to grasp the ->pick_next_task() removal.
> 
> [1] https://lore.kernel.org/lkml/20260602050005.11160-1-kprateek.nayak@amd.com/

Yes, I'm familiar with that struggle. If you can manage to write a
comment that clarifies it somewhat that would be awesome.

I've tried, but every time I read it back after a few days, I'm just
left more confused that I was at the beginning :-(

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime()
  2026-06-02  8:32       ` Peter Zijlstra
@ 2026-06-02  8:57         ` K Prateek Nayak
  0 siblings, 0 replies; 21+ messages in thread
From: K Prateek Nayak @ 2026-06-02  8:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

Hello Peter,

On 6/2/2026 2:02 PM, Peter Zijlstra wrote:
> On Tue, Jun 02, 2026 at 12:31:36PM +0530, K Prateek Nayak wrote:
> 
>> My mind is taking a while to grasp the ->pick_next_task() removal.
>>
>> [1] https://lore.kernel.org/lkml/20260602050005.11160-1-kprateek.nayak@amd.com/
> 
> Yes, I'm familiar with that struggle. If you can manage to write a
> comment that clarifies it somewhat that would be awesome.
> 
> I've tried, but every time I read it back after a few days, I'm just
> left more confused that I was at the beginning :-(

Here is an attempt on top of v2.1:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa8c0b1a1cf1..9a14d75ff671 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9889,7 +9889,13 @@ struct task_struct *pick_task_fair(struct rq *rq, struct rq_flags *rf)
 	throttled = false;
 
 	do {
-		/* Might not have done put_prev_entity() */
+		/*
+		 * Might not have done put_prev_entity().
+		 * The cfs_rq gets throttled here or via
+		 * pick_task() -> set_next_task() where
+		 * sched_cfs_bandwidth_slice() worth of
+		 * runtime is requested for cfs_rq->curr.
+		 */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
@@ -15003,7 +15009,12 @@ static void set_next_task_fair(struct rq *rq, struct task_struct *p, bool first)
 			break;
 
 		set_next_entity(cfs_rq, se, first);
-		/* ensure bandwidth has been allocated on our new cfs_rq */
+		/*
+		 * Ensure bandwidth has been allocated on our new cfs_rq.
+		 * If this hierarchy was freshly picked, update_curr()
+		 * was skipped for this cfs_rq. Request for the correct
+		 * bandwidth slice now that cfs_rq->curr is updated.
+		 */
 		throttled |= account_cfs_rq_runtime(cfs_rq, 0);
 	}
 
-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
                   ` (4 preceding siblings ...)
  2026-05-28  9:48 ` [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() K Prateek Nayak
@ 2026-05-28 11:45 ` Peter Zijlstra
  2026-06-01  6:18 ` Aaron Lu
  6 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2026-05-28 11:45 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Juri Lelli, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Aaron Lu, Josh Don, linux-kernel

On Thu, May 28, 2026 at 09:48:25AM +0000, K Prateek Nayak wrote:
> Patches also cleanly apply on top of Zecheng's optimization from [3]
> when applied on top of the same base. Peter, there is only one trivial
> conflict with sched/flat, and Zecheng's optimization is generally
> beneficial for deep hierarchies even with flattened pick.

Ah, yes, I had meant to go look at how bad that conflict was, but just
hadn't gotten around to it.

Let me go have a play.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy
  2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
                   ` (5 preceding siblings ...)
  2026-05-28 11:45 ` [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy Peter Zijlstra
@ 2026-06-01  6:18 ` Aaron Lu
  6 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2026-06-01  6:18 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Ingo Molnar, Peter Zijlstra, Juri Lelli, Vincent Guittot,
	Dietmar Eggemann, Steven Rostedt, Ben Segall, Mel Gorman,
	Valentin Schneider, Josh Don, linux-kernel

Hi Prateek,

On Thu, May 28, 2026 at 09:48:25AM +0000, K Prateek Nayak wrote:
> The current hierarchy is always throttled in __schedule() during the
> pick when update_curr() detects a cfs_rq running out of the bandwidth
> and issues a resched.
> 
> This was necessary prior to per-task throttling where the entire
> throttled hierarchy was dequeued at the point of first throttle during
> the pick but with per-task throttling, tasks continue to run as usual
> until they exit to userspace and dequeue themselves one-by-one until the
> hierarchy is deemed fully throttled and the PELT is frozen.
> 
> throttle_cfs_rq() is now simply a propagator of throttle indicators and
> nothing more.
> 
> Unify the throttling for current hierarchy under
> account_cfs_rq_runtime() which is responsible for the time accounting.
> If the bandwidth runs out, account_cfs_rq_runtime() will request for
> sched_cfs_bandwidth_slice() and mark the hierarchy as throttled if it
> fails to grab bandwidth.
> 
> throttle_cfs_rq() will do a task_throttle_setup_work() if it finds the
> current task to be on a throttled hierarchy and the task will naturally
> dequeue itself when it exits to the userspace without needing an
> explicit resched.
> 
> First four patches are cleanups and preparation for the final bit that
> switches over to using account_cfs_rq_runtime() for throttling which was
> provided by Peter in [1].
> 
> Following are the results of running hackbench running 3 levels deep
> with the setup from "Testing" section on [2] when compared to
> tip:sched/core:
> 
>   kernel        :  tip        tip + series
> 
>   Min           : 207.33        202.20
>   Max           : 210.20        222.47
>   Median        : 207.83        218.33
>   AMean         : 208.29        215.36
>   GMean         : 208.29        215.25
>   HMean         : 208.29        215.13
>   AMean Stddev  : 1.02          7.37
>   AMean CoefVar : 0.49 pct      3.42 pct
> 
>   All numbers are in seconds.
> 
> There is a slight boot to boot variation for this benchmark but the
> utilization numbers in top is more or less similar between the two.
> Additional testing and feedback is always appreciated as usual :-)

I tested hackbench and netperf with quota set on a 2 sockets Intel EMR
and the result is in noise range.

Hackbench(in seconds, less is better)
base:  176.114420±2
head:  176.214394±3

Netperf(throughput, higher is better)
base:  14071, min/max: 13376/15261
head:  14769, min/max: 14095/15588

Feel free to add my tested-by tag after the clock warning is fixed in
patch 3.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-06-02  8:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28  9:48 [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy K Prateek Nayak
2026-05-28  9:48 ` [PATCH 1/5] sched/fair: Convert cfs bandwidth throttling to use guards K Prateek Nayak
2026-05-28 21:46   ` Benjamin Segall
2026-05-28  9:48 ` [PATCH 2/5] sched/fair: Use throttled_csd_list for local unthrottle K Prateek Nayak
2026-05-28 21:53   ` Benjamin Segall
2026-05-28  9:48 ` [PATCH 3/5] sched/fair: Call update_curr() before unthrottling the hierarchy K Prateek Nayak
2026-05-28 22:03   ` Benjamin Segall
2026-06-01  3:52   ` Aaron Lu
2026-06-01  5:50     ` K Prateek Nayak
2026-06-01 11:27       ` Peter Zijlstra
2026-06-02  6:33         ` K Prateek Nayak
2026-05-28  9:48 ` [PATCH 4/5] sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() K Prateek Nayak
2026-05-28 22:14   ` Benjamin Segall
2026-05-28  9:48 ` [PATCH 5/5] sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() K Prateek Nayak
2026-05-28 22:44   ` Benjamin Segall
2026-06-01 13:48   ` Peter Zijlstra
2026-06-02  7:01     ` K Prateek Nayak
2026-06-02  8:32       ` Peter Zijlstra
2026-06-02  8:57         ` K Prateek Nayak
2026-05-28 11:45 ` [PATCH 0/5] sched/fair: Allow account_cfs_rq_runtime() to throttle current hierarchy Peter Zijlstra
2026-06-01  6:18 ` Aaron Lu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.