[PATCH v2 0/5] Defer throttle when task exits to user

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/5] Defer throttle when task exits to user
@ 2025-06-18  8:19 Aaron Lu
  2025-06-18  8:19 ` [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Aaron Lu @ 2025-06-18  8:19 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

v2:
- Re-org the patchset to use a single patch to implement throttle
  related changes, suggested by Chengming;
- Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
  decide if throttle task work is needed instead of checking
  throttled_hierarchy(), suggested by Peter;
- Simplify throttle_count check in tg_throtthe_down() and
  tg_unthrottle_up(), suggested by Peter;
- Add enqueue_throttled_task() to speed up enqueuing a throttled task to
  a throttled cfs_rq, suggested by Peter;
- Address the missing of detach_task_cfs_rq() for throttled tasks that
  get migrated to a new rq, pointed out by Chengming;
- Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
  cond_resched*() is going away, pointed out by Peter.
I hope I didn't miss any comments and suggestions for v1 and if I do,
please kindly let me know, thanks!

Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
of double_rq_clock_clear_update()")

cover letter of v1:

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link and I
quote:
"
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
"

Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
and K Prateek Nayak has an detailed analysis of how deadlock happened:
https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/

To fix this issue for PREEMPT_RT and improve latency situation for
!PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
is throttled, mark its throttled status but do not remove it from cpu's
rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
add a task work to them so that when they return to user, they can be
dequeued. In this way, tasks throttled will not hold any kernel resources.
When cfs_rq gets unthrottled, enqueue back those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, according to RFC v2's feedback, rework
throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its
  hierarchy;
- stop accounting on unthrottle.

There is also the concern of increased duration of (un)throttle operations
in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Aaron Lu (2):
  sched/fair: Task based throttle time accounting
  sched/fair: Get rid of throttled_lb_pair()

Valentin Schneider (3):
  sched/fair: Add related data structure for task based throttle
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Switch to task based throttle model

 include/linux/sched.h |   5 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 445 +++++++++++++++++++++++-------------------
 kernel/sched/sched.h  |   4 +
 4 files changed, 253 insertions(+), 204 deletions(-)

-- 
2.39.5

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
@ 2025-06-18  8:19 ` Aaron Lu
  2025-06-18  8:19 ` [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-06-18  8:19 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

Add related data structures for this new throttle functionality.

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 include/linux/sched.h |  5 +++++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 13 +++++++++++++
 kernel/sched/sched.h  |  3 +++
 4 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index eec6b225e9d14..95f2453b17222 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -874,6 +874,11 @@ struct task_struct {
 
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct callback_head		sched_throttle_work;
+	struct list_head		throttle_node;
+	bool				throttled;
+#endif
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a03c3c1d7f50a..afb4da893c02f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4446,6 +4446,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_throttle_work(p);
+#endif
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b17d3da034af..f7b8597bc95ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5757,6 +5757,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+	INIT_LIST_HEAD(&p->throttle_node);
+}
+
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
@@ -6488,6 +6500,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	cfs_rq->runtime_enabled = 0;
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
 }
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c323d015486cf..fa78c0c3efe39 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -728,6 +728,7 @@ struct cfs_rq {
 	int			throttle_count;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head        throttled_limbo_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
@@ -2627,6 +2628,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
+extern void init_cfs_throttle_work(struct task_struct *p);
+
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
 #define RATIO_SHIFT		8
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
  2025-06-18  8:19 ` [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
@ 2025-06-18  8:19 ` Aaron Lu
  2025-06-18  9:03   ` Chengming Zhou
  2025-06-18  8:19 ` [PATCH v2 3/5] sched/fair: Switch to task based throttle model Aaron Lu
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2025-06-18  8:19 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

Implement throttle_cfs_rq_work() task work which gets executed on task's
ret2user path where the task is dequeued and marked as throttled.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f7b8597bc95ac..8226120b8771a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5757,8 +5757,51 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static inline bool task_is_throttled(struct task_struct *p)
+{
+	return p->throttled;
+}
+
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
+	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
+
+	WARN_ON_ONCE(p != current);
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+
+	/*
+	 * If task is exiting, then there won't be a return to userspace, so we
+	 * don't have to bother with any of this.
+	 */
+	if ((p->flags & PF_EXITING))
+		return;
+
+	scoped_guard(task_rq_lock, p) {
+		se = &p->se;
+		cfs_rq = cfs_rq_of(se);
+
+		/* Raced, forget */
+		if (p->sched_class != &fair_sched_class)
+			return;
+
+		/*
+		 * If not in limbo, then either replenish has happened or this
+		 * task got migrated out of the throttled cfs_rq, move along.
+		 */
+		if (!cfs_rq->throttle_count)
+			return;
+		rq = scope.rq;
+		update_rq_clock(rq);
+		WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		p->throttled = true;
+		resched_curr(rq);
+	}
 }
 
 void init_cfs_throttle_work(struct task_struct *p)
@@ -5798,6 +5841,26 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	return 0;
 }
 
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+	return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+	if (task_has_throttle_work(p))
+		return;
+
+	/*
+	 * Kthreads and exiting tasks don't return to userspace, so adding the
+	 * work is pointless
+	 */
+	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
+		return;
+
+	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+}
+
 static int tg_throttle_down(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
@@ -6668,6 +6731,8 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void task_throttle_setup_work(struct task_struct *p) {}
+static bool task_is_throttled(struct task_struct *p) { return false; }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 3/5] sched/fair: Switch to task based throttle model
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
  2025-06-18  8:19 ` [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
  2025-06-18  8:19 ` [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
@ 2025-06-18  8:19 ` Aaron Lu
  2025-06-18  9:55   ` Chengming Zhou
  2025-06-18  8:19 ` [PATCH v2 4/5] sched/fair: Task based throttle time accounting Aaron Lu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2025-06-18  8:19 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.

Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a
task can be enqueued to a throttled cfs_rq and gets to run, to not break
the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to
leaf cfs_rq list when it has its first entity enqueued and delete it
from leaf cfs_rq list when it has no tasks enqueued.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 325 +++++++++++++++++++++-----------------------
 1 file changed, 153 insertions(+), 172 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8226120b8771a..59b372ffae18c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5291,18 +5291,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 1) {
 		check_enqueue_throttle(cfs_rq);
-		if (!throttled_hierarchy(cfs_rq)) {
-			list_add_leaf_cfs_rq(cfs_rq);
-		} else {
+		list_add_leaf_cfs_rq(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
+		if (throttled_hierarchy(cfs_rq)) {
 			struct rq *rq = rq_of(cfs_rq);
 
 			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
 				cfs_rq->throttled_clock = rq_clock(rq);
 			if (!cfs_rq->throttled_clock_self)
 				cfs_rq->throttled_clock_self = rq_clock(rq);
-#endif
 		}
+#endif
 	}
 }
 
@@ -5341,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable--;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
@@ -5363,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable++;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
@@ -5450,8 +5445,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
-	if (cfs_rq->nr_queued == 0)
+	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
+		if (throttled_hierarchy(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 
 	return true;
 }
@@ -5799,6 +5797,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
 		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		/*
+		 * Must not set throttled before dequeue or dequeue will
+		 * mistakenly regard this task as an already throttled one.
+		 */
 		p->throttled = true;
 		resched_curr(rq);
 	}
@@ -5812,32 +5814,116 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+/*
+ * Task is throttled and someone wants to dequeue it again:
+ * it could be sched/core when core needs to do things like
+ * task affinity change, task group change, task sched class
+ * change etc. and in these cases, DEQUEUE_SLEEP is not set;
+ * or the task is blocked after throttled due to freezer etc.
+ * and in these cases, DEQUEUE_SLEEP is set.
+ */
+static void detach_task_cfs_rq(struct task_struct *p);
+static void dequeue_throttled_task(struct task_struct *p, int flags)
+{
+	WARN_ON_ONCE(p->se.on_rq);
+	list_del_init(&p->throttle_node);
+
+	/* task blocked after throttled */
+	if (flags & DEQUEUE_SLEEP) {
+		p->throttled = false;
+		return;
+	}
+
+	/*
+	 * task is migrating off its old cfs_rq, detach
+	 * the task's load from its old cfs_rq.
+	 */
+	if (task_on_rq_migrating(p))
+		detach_task_cfs_rq(p);
+}
+
+static bool enqueue_throttled_task(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+	/*
+	 * If the throttled task is enqueued to a throttled cfs_rq,
+	 * take the fast path by directly put the task on target
+	 * cfs_rq's limbo list, except when p is current because
+	 * the following race can cause p's group_node left in rq's
+	 * cfs_tasks list when it's throttled:
+	 *
+	 *        cpuX                       cpuY
+	 *   taskA ret2user
+	 *  throttle_cfs_rq_work()    sched_move_task(taskA)
+	 *  task_rq_lock acquired
+	 *  dequeue_task_fair(taskA)
+	 *  task_rq_lock released
+	 *                            task_rq_lock acquired
+	 *                            task_current_donor(taskA) == true
+	 *                            task_on_rq_queued(taskA) == true
+	 *                            dequeue_task(taskA)
+	 *                            put_prev_task(taskA)
+	 *                            sched_change_group()
+	 *                            enqueue_task(taskA) -> taskA's new cfs_rq
+	 *                                                   is throttled, go
+	 *                                                   fast path and skip
+	 *                                                   actual enqueue
+	 *                            set_next_task(taskA)
+	 *                          __set_next_task_fair(taskA)
+	 *                    list_move(&se->group_node, &rq->cfs_tasks); // bug
+	 *  schedule()
+	 *
+	 * And in the above race case, the task's current cfs_rq is in the same
+	 * rq as its previous cfs_rq because sched_move_task() doesn't migrate
+	 * task so we can use its current cfs_rq to derive rq and test if the
+	 * task is current.
+	 */
+	if (throttled_hierarchy(cfs_rq) &&
+	    !task_current_donor(rq_of(cfs_rq), p)) {
+		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		return true;
+	}
+
+	/* we can't take the fast path, do an actual enqueue*/
+	p->throttled = false;
+	return false;
+}
+
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+	struct task_struct *p, *tmp;
 
-	cfs_rq->throttle_count--;
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
-					     cfs_rq->throttled_clock_pelt;
+	if (--cfs_rq->throttle_count)
+		return 0;
 
-		/* Add cfs_rq with load or one or more already running entities to the list */
-		if (!cfs_rq_is_decayed(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+		cfs_rq->throttled_clock_pelt;
 
-		if (cfs_rq->throttled_clock_self) {
-			u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+	if (cfs_rq->throttled_clock_self) {
+		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
 
-			cfs_rq->throttled_clock_self = 0;
+		cfs_rq->throttled_clock_self = 0;
 
-			if (WARN_ON_ONCE((s64)delta < 0))
-				delta = 0;
+		if (WARN_ON_ONCE((s64)delta < 0))
+			delta = 0;
 
-			cfs_rq->throttled_clock_self_time += delta;
-		}
+		cfs_rq->throttled_clock_self_time += delta;
 	}
 
+	/* Re-enqueue the tasks that have been throttled at this level. */
+	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+		list_del_init(&p->throttle_node);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+	}
+
+	/* Add cfs_rq with load or one or more already running entities to the list */
+	if (!cfs_rq_is_decayed(cfs_rq))
+		list_add_leaf_cfs_rq(cfs_rq);
+
 	return 0;
 }
 
@@ -5866,17 +5952,19 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
+	if (cfs_rq->throttle_count++)
+		return 0;
+
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
-		list_del_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-		if (cfs_rq->nr_queued)
-			cfs_rq->throttled_clock_self = rq_clock(rq);
-	}
-	cfs_rq->throttle_count++;
+	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+	if (cfs_rq->nr_queued)
+		cfs_rq->throttled_clock_self = rq_clock(rq);
+	else
+		list_del_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
 
@@ -5884,9 +5972,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int dequeue = 1;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -5909,72 +5995,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	if (!dequeue)
 		return false;  /* Throttle no longer required. */
 
-	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
 	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		int flags;
-
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		/*
-		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
-		 * This avoids teaching dequeue_entities() about throttled
-		 * entities and keeps things relatively simple.
-		 */
-		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
-		if (se->sched_delayed)
-			flags |= DEQUEUE_DELAYED;
-		dequeue_entity(qcfs_rq, se, flags);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-
-		if (qcfs_rq->load.weight) {
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			break;
-		}
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		update_load_avg(qcfs_rq, se, 0);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-	}
-
-	/* At this point se is NULL and we are at root level*/
-	sub_nr_running(rq, queued_delta);
-
-	/* Stop the fair server if throttling resulted in no runnable tasks */
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
-done:
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -5990,9 +6015,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+
+	/*
+	 * It's possible we are called with !runtime_remaining due to things
+	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
+	 * unthrottled us with a positive runtime_remaining but other still
+	 * running entities consumed those runtime before we reached here.
+	 *
+	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
+	 * throttle, which is not supposed to happen on unthrottle path.
+	 */
+	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
+		return;
 
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
@@ -6022,62 +6058,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 			if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
 				break;
 		}
-		goto unthrottle_throttle;
-	}
-
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		/* Handle any unfinished DELAY_DEQUEUE business first. */
-		if (se->sched_delayed) {
-			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
-			dequeue_entity(qcfs_rq, se, flags);
-		} else if (se->on_rq)
-			break;
-		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
 	}
 
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		update_load_avg(qcfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	/* Start the fair server if un-throttling resulted in new runnable tasks */
-	if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
-		dl_server_start(&rq->fair_server);
-
-	/* At this point se is NULL and we are at root level*/
-	add_nr_running(rq, queued_delta);
-
-unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
@@ -6733,6 +6715,8 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
+static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static bool enqueue_throttled_task(struct task_struct *p) { return false; }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -6925,6 +6909,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	u64 slice = 0;
 
+	if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p)))
+		return;
+
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
@@ -6977,10 +6964,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
-
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -7002,10 +6985,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@@ -7035,7 +7014,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		check_update_overutilized_status(rq);
 
-enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
@@ -7091,10 +7069,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
-
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			slice = cfs_rq_min_slice(cfs_rq);
@@ -7131,10 +7105,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -7171,6 +7141,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (unlikely(task_is_throttled(p))) {
+		dequeue_throttled_task(p, flags);
+		return true;
+	}
+
 	if (!p->se.sched_delayed)
 		util_est_dequeue(&rq->cfs, p);
 
@@ -8836,19 +8811,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	struct task_struct *p;
+	bool throttled;
 
 again:
 	cfs_rq = &rq->cfs;
 	if (!cfs_rq->nr_queued)
 		return NULL;
 
+	throttled = false;
+
 	do {
 		/* Might not have done put_prev_entity() */
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto again;
+		throttled |= check_cfs_rq_runtime(cfs_rq);
 
 		se = pick_next_entity(rq, cfs_rq);
 		if (!se)
@@ -8856,7 +8834,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
-	return task_of(se);
+	p = task_of(se);
+	if (unlikely(throttled))
+		task_throttle_setup_work(p);
+	return p;
 }
 
 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 4/5] sched/fair: Task based throttle time accounting
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
                   ` (2 preceding siblings ...)
  2025-06-18  8:19 ` [PATCH v2 3/5] sched/fair: Switch to task based throttle model Aaron Lu
@ 2025-06-18  8:19 ` Aaron Lu
  2025-06-18  8:19 ` [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-06-18  8:19 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq which has a single task is throttled,
that task could later block in kernel mode instead of being dequeued on
limbo list and account this as throttled time is not accurate.

Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c  | 41 +++++++++++++++++++++++------------------
 kernel/sched/sched.h |  1 +
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59b372ffae18c..aab3ce4073582 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5292,16 +5292,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (cfs_rq->nr_queued == 1) {
 		check_enqueue_throttle(cfs_rq);
 		list_add_leaf_cfs_rq(cfs_rq);
-#ifdef CONFIG_CFS_BANDWIDTH
-		if (throttled_hierarchy(cfs_rq)) {
-			struct rq *rq = rq_of(cfs_rq);
-
-			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
-				cfs_rq->throttled_clock = rq_clock(rq);
-			if (!cfs_rq->throttled_clock_self)
-				cfs_rq->throttled_clock_self = rq_clock(rq);
-		}
-#endif
 	}
 }
 
@@ -5387,7 +5377,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
 		 * states must not suffer spurious wakeups, excempt them.
 		 */
-		if (flags & DEQUEUE_SPECIAL)
+		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
 			delay = false;
 
 		WARN_ON_ONCE(delay && se->sched_delayed);
@@ -5795,7 +5785,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		rq = scope.rq;
 		update_rq_clock(rq);
 		WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
-		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		/*
 		 * Must not set throttled before dequeue or dequeue will
@@ -5947,6 +5937,17 @@ static inline void task_throttle_setup_work(struct task_struct *p)
 	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
 }
 
+static void record_throttle_clock(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+
+	if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
+		cfs_rq->throttled_clock = rq_clock(rq);
+
+	if (!cfs_rq->throttled_clock_self)
+		cfs_rq->throttled_clock_self = rq_clock(rq);
+}
+
 static int tg_throttle_down(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
@@ -5958,12 +5959,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	/* group is entering throttled state, stop time */
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock_self = rq_clock(rq);
-	else
+	if (!cfs_rq->nr_queued)
 		list_del_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
 	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
@@ -6006,8 +6005,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 */
 	cfs_rq->throttled = 1;
 	WARN_ON_ONCE(cfs_rq->throttled_clock);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock = rq_clock(rq);
 	return true;
 }
 
@@ -6717,6 +6714,7 @@ static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
 static void dequeue_throttled_task(struct task_struct *p, int flags) {}
 static bool enqueue_throttled_task(struct task_struct *p) { return false; }
+static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7036,6 +7034,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
+	bool task_throttled = flags & DEQUEUE_THROTTLE;
 	struct task_struct *p = NULL;
 	int h_nr_idle = 0;
 	int h_nr_queued = 0;
@@ -7069,6 +7068,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
+		if (throttled_hierarchy(cfs_rq) && task_throttled)
+			record_throttle_clock(cfs_rq);
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			slice = cfs_rq_min_slice(cfs_rq);
@@ -7105,6 +7107,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
+
+		if (throttled_hierarchy(cfs_rq) && task_throttled)
+			record_throttle_clock(cfs_rq);
 	}
 
 	sub_nr_running(rq, h_nr_queued);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fa78c0c3efe39..f2a07537d3c12 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2312,6 +2312,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SPECIAL		0x10
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
+#define DEQUEUE_THROTTLE	0x800
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair()
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
                   ` (3 preceding siblings ...)
  2025-06-18  8:19 ` [PATCH v2 4/5] sched/fair: Task based throttle time accounting Aaron Lu
@ 2025-06-18  8:19 ` Aaron Lu
  2025-07-01  8:31 ` [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-06-18  8:19 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aab3ce4073582..d869c8b51c5a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5728,23 +5728,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
-/*
- * Ensure that neither of the group entities corresponding to src_cpu or
- * dest_cpu are members of a throttled hierarchy when performing group
- * load-balance operations.
- */
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
-
-	src_cfs_rq = tg->cfs_rq[src_cpu];
-	dest_cfs_rq = tg->cfs_rq[dest_cpu];
-
-	return throttled_hierarchy(src_cfs_rq) ||
-	       throttled_hierarchy(dest_cfs_rq);
-}
-
 static inline bool task_is_throttled(struct task_struct *p)
 {
 	return p->throttled;
@@ -6726,12 +6709,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	return 0;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9369,17 +9346,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) throttled_lb_pair, or
-	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 4) running (obviously), or
-	 * 5) are cache-hot on their current CPU.
+	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 3) running (obviously), or
+	 * 4) are cache-hot on their current CPU.
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
-	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
-		return 0;
-
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers
  2025-06-18  8:19 ` [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
@ 2025-06-18  9:03   ` Chengming Zhou
  0 siblings, 0 replies; 21+ messages in thread
From: Chengming Zhou @ 2025-06-18  9:03 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/6/18 16:19, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> Implement throttle_cfs_rq_work() task work which gets executed on task's
> ret2user path where the task is dequeued and marked as throttled.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

LGTM,

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks!

> ---
>   kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 65 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f7b8597bc95ac..8226120b8771a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5757,8 +5757,51 @@ static inline int throttled_lb_pair(struct task_group *tg,
>   	       throttled_hierarchy(dest_cfs_rq);
>   }
>   
> +static inline bool task_is_throttled(struct task_struct *p)
> +{
> +	return p->throttled;
> +}
> +
> +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>   static void throttle_cfs_rq_work(struct callback_head *work)
>   {
> +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> +	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq;
> +	struct rq *rq;
> +
> +	WARN_ON_ONCE(p != current);
> +	p->sched_throttle_work.next = &p->sched_throttle_work;
> +
> +	/*
> +	 * If task is exiting, then there won't be a return to userspace, so we
> +	 * don't have to bother with any of this.
> +	 */
> +	if ((p->flags & PF_EXITING))
> +		return;
> +
> +	scoped_guard(task_rq_lock, p) {
> +		se = &p->se;
> +		cfs_rq = cfs_rq_of(se);
> +
> +		/* Raced, forget */
> +		if (p->sched_class != &fair_sched_class)
> +			return;
> +
> +		/*
> +		 * If not in limbo, then either replenish has happened or this
> +		 * task got migrated out of the throttled cfs_rq, move along.
> +		 */
> +		if (!cfs_rq->throttle_count)
> +			return;
> +		rq = scope.rq;
> +		update_rq_clock(rq);
> +		WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
> +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> +		p->throttled = true;
> +		resched_curr(rq);
> +	}
>   }
>   
>   void init_cfs_throttle_work(struct task_struct *p)
> @@ -5798,6 +5841,26 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>   	return 0;
>   }
>   
> +static inline bool task_has_throttle_work(struct task_struct *p)
> +{
> +	return p->sched_throttle_work.next != &p->sched_throttle_work;
> +}
> +
> +static inline void task_throttle_setup_work(struct task_struct *p)
> +{
> +	if (task_has_throttle_work(p))
> +		return;
> +
> +	/*
> +	 * Kthreads and exiting tasks don't return to userspace, so adding the
> +	 * work is pointless
> +	 */
> +	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
> +		return;
> +
> +	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
> +}
> +
>   static int tg_throttle_down(struct task_group *tg, void *data)
>   {
>   	struct rq *rq = data;
> @@ -6668,6 +6731,8 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
>   static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
>   static inline void sync_throttle(struct task_group *tg, int cpu) {}
>   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> +static void task_throttle_setup_work(struct task_struct *p) {}
> +static bool task_is_throttled(struct task_struct *p) { return false; }
>   
>   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   {

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] sched/fair: Switch to task based throttle model
  2025-06-18  8:19 ` [PATCH v2 3/5] sched/fair: Switch to task based throttle model Aaron Lu
@ 2025-06-18  9:55   ` Chengming Zhou
  2025-06-18 11:19     ` Aaron Lu
  0 siblings, 1 reply; 21+ messages in thread
From: Chengming Zhou @ 2025-06-18  9:55 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/6/18 16:19, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
> 
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets woken, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued there. In this way, tasks throttled will
> not hold any kernel resources. And on unthrottle, enqueue back those
> tasks so they can continue to run.
> 
> Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a
> task can be enqueued to a throttled cfs_rq and gets to run, to not break
> the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to
> leaf cfs_rq list when it has its first entity enqueued and delete it
> from leaf cfs_rq list when it has no tasks enqueued.
> 
> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>   kernel/sched/fair.c | 325 +++++++++++++++++++++-----------------------
>   1 file changed, 153 insertions(+), 172 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8226120b8771a..59b372ffae18c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5291,18 +5291,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   
>   	if (cfs_rq->nr_queued == 1) {
>   		check_enqueue_throttle(cfs_rq);
> -		if (!throttled_hierarchy(cfs_rq)) {
> -			list_add_leaf_cfs_rq(cfs_rq);
> -		} else {
> +		list_add_leaf_cfs_rq(cfs_rq);
>   #ifdef CONFIG_CFS_BANDWIDTH
> +		if (throttled_hierarchy(cfs_rq)) {
>   			struct rq *rq = rq_of(cfs_rq);
>   
>   			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
>   				cfs_rq->throttled_clock = rq_clock(rq);
>   			if (!cfs_rq->throttled_clock_self)
>   				cfs_rq->throttled_clock_self = rq_clock(rq);
> -#endif
>   		}
> +#endif
>   	}
>   }
>   
> @@ -5341,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
>   		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>   
>   		cfs_rq->h_nr_runnable--;
> -		if (cfs_rq_throttled(cfs_rq))
> -			break;
>   	}
>   }
>   
> @@ -5363,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
>   		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>   
>   		cfs_rq->h_nr_runnable++;
> -		if (cfs_rq_throttled(cfs_rq))
> -			break;
>   	}
>   }
>   
> @@ -5450,8 +5445,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   	if (flags & DEQUEUE_DELAYED)
>   		finish_delayed_dequeue_entity(se);
>   
> -	if (cfs_rq->nr_queued == 0)
> +	if (cfs_rq->nr_queued == 0) {
>   		update_idle_cfs_rq_clock_pelt(cfs_rq);
> +		if (throttled_hierarchy(cfs_rq))
> +			list_del_leaf_cfs_rq(cfs_rq);

The cfs_rq should be removed from leaf list only after
it has been fully decayed, not here.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] sched/fair: Switch to task based throttle model
  2025-06-18  9:55   ` Chengming Zhou
@ 2025-06-18 11:19     ` Aaron Lu
  2025-06-19 12:02       ` Chengming Zhou
  0 siblings, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2025-06-18 11:19 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hi Chengming,

Thanks for your review.

On Wed, Jun 18, 2025 at 05:55:08PM +0800, Chengming Zhou wrote:
> On 2025/6/18 16:19, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> > 
> > In current throttle model, when a cfs_rq is throttled, its entity will
> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > thus achiveing the throttle target.
> > 
> > This has a drawback though: assume a task is a reader of percpu_rwsem
> > and is waiting. When it gets woken, it can not run till its task group's
> > next period comes, which can be a relatively long time. Waiting writer
> > will have to wait longer due to this and it also makes further reader
> > build up and eventually trigger task hung.
> > 
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, record its throttled status but do not remove
> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > they get picked, add a task work to them so that when they return
> > to user, they can be dequeued there. In this way, tasks throttled will
> > not hold any kernel resources. And on unthrottle, enqueue back those
> > tasks so they can continue to run.
> > 
> > Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a
> > task can be enqueued to a throttled cfs_rq and gets to run, to not break
> > the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to
> > leaf cfs_rq list when it has its first entity enqueued and delete it
> > from leaf cfs_rq list when it has no tasks enqueued.
> > 
> > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >   kernel/sched/fair.c | 325 +++++++++++++++++++++-----------------------
> >   1 file changed, 153 insertions(+), 172 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8226120b8771a..59b372ffae18c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5291,18 +5291,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >   	if (cfs_rq->nr_queued == 1) {
> >   		check_enqueue_throttle(cfs_rq);
> > -		if (!throttled_hierarchy(cfs_rq)) {
> > -			list_add_leaf_cfs_rq(cfs_rq);
> > -		} else {
> > +		list_add_leaf_cfs_rq(cfs_rq);
> >   #ifdef CONFIG_CFS_BANDWIDTH
> > +		if (throttled_hierarchy(cfs_rq)) {
> >   			struct rq *rq = rq_of(cfs_rq);
> >   			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
> >   				cfs_rq->throttled_clock = rq_clock(rq);
> >   			if (!cfs_rq->throttled_clock_self)
> >   				cfs_rq->throttled_clock_self = rq_clock(rq);
> > -#endif
> >   		}
> > +#endif
> >   	}
> >   }
> > @@ -5341,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
> >   		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >   		cfs_rq->h_nr_runnable--;
> > -		if (cfs_rq_throttled(cfs_rq))
> > -			break;
> >   	}
> >   }
> > @@ -5363,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
> >   		struct cfs_rq *cfs_rq = cfs_rq_of(se);
> >   		cfs_rq->h_nr_runnable++;
> > -		if (cfs_rq_throttled(cfs_rq))
> > -			break;
> >   	}
> >   }
> > @@ -5450,8 +5445,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >   	if (flags & DEQUEUE_DELAYED)
> >   		finish_delayed_dequeue_entity(se);
> > -	if (cfs_rq->nr_queued == 0)
> > +	if (cfs_rq->nr_queued == 0) {
> >   		update_idle_cfs_rq_clock_pelt(cfs_rq);
> > +		if (throttled_hierarchy(cfs_rq))
> > +			list_del_leaf_cfs_rq(cfs_rq);
> 
> The cfs_rq should be removed from leaf list only after
> it has been fully decayed, not here.

For a throttled cfs_rq, the intent is to preserve its load while it's
throttled. Its pelt clock is stopped in tg_throttle_down(), there will
be no decay for it if left on leaf list.

I've also described why I chose this behaviour in cover letter:
"
For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.
"

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 3/5] sched/fair: Switch to task based throttle model
  2025-06-18 11:19     ` Aaron Lu
@ 2025-06-19 12:02       ` Chengming Zhou
  0 siblings, 0 replies; 21+ messages in thread
From: Chengming Zhou @ 2025-06-19 12:02 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/6/18 19:19, Aaron Lu wrote:
> Hi Chengming,
> 
> Thanks for your review.
> 
> On Wed, Jun 18, 2025 at 05:55:08PM +0800, Chengming Zhou wrote:
>> On 2025/6/18 16:19, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@redhat.com>
>>>
>>> In current throttle model, when a cfs_rq is throttled, its entity will
>>> be dequeued from cpu's rq, making tasks attached to it not able to run,
>>> thus achiveing the throttle target.
>>>
>>> This has a drawback though: assume a task is a reader of percpu_rwsem
>>> and is waiting. When it gets woken, it can not run till its task group's
>>> next period comes, which can be a relatively long time. Waiting writer
>>> will have to wait longer due to this and it also makes further reader
>>> build up and eventually trigger task hung.
>>>
>>> To improve this situation, change the throttle model to task based, i.e.
>>> when a cfs_rq is throttled, record its throttled status but do not remove
>>> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>>> they get picked, add a task work to them so that when they return
>>> to user, they can be dequeued there. In this way, tasks throttled will
>>> not hold any kernel resources. And on unthrottle, enqueue back those
>>> tasks so they can continue to run.
>>>
>>> Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a
>>> task can be enqueued to a throttled cfs_rq and gets to run, to not break
>>> the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to
>>> leaf cfs_rq list when it has its first entity enqueued and delete it
>>> from leaf cfs_rq list when it has no tasks enqueued.
>>>
>>> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
>>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>> ---
>>>    kernel/sched/fair.c | 325 +++++++++++++++++++++-----------------------
>>>    1 file changed, 153 insertions(+), 172 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 8226120b8771a..59b372ffae18c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5291,18 +5291,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>>    	if (cfs_rq->nr_queued == 1) {
>>>    		check_enqueue_throttle(cfs_rq);
>>> -		if (!throttled_hierarchy(cfs_rq)) {
>>> -			list_add_leaf_cfs_rq(cfs_rq);
>>> -		} else {
>>> +		list_add_leaf_cfs_rq(cfs_rq);
>>>    #ifdef CONFIG_CFS_BANDWIDTH
>>> +		if (throttled_hierarchy(cfs_rq)) {
>>>    			struct rq *rq = rq_of(cfs_rq);
>>>    			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
>>>    				cfs_rq->throttled_clock = rq_clock(rq);
>>>    			if (!cfs_rq->throttled_clock_self)
>>>    				cfs_rq->throttled_clock_self = rq_clock(rq);
>>> -#endif
>>>    		}
>>> +#endif
>>>    	}
>>>    }
>>> @@ -5341,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
>>>    		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>    		cfs_rq->h_nr_runnable--;
>>> -		if (cfs_rq_throttled(cfs_rq))
>>> -			break;
>>>    	}
>>>    }
>>> @@ -5363,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
>>>    		struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>>    		cfs_rq->h_nr_runnable++;
>>> -		if (cfs_rq_throttled(cfs_rq))
>>> -			break;
>>>    	}
>>>    }
>>> @@ -5450,8 +5445,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>>    	if (flags & DEQUEUE_DELAYED)
>>>    		finish_delayed_dequeue_entity(se);
>>> -	if (cfs_rq->nr_queued == 0)
>>> +	if (cfs_rq->nr_queued == 0) {
>>>    		update_idle_cfs_rq_clock_pelt(cfs_rq);
>>> +		if (throttled_hierarchy(cfs_rq))
>>> +			list_del_leaf_cfs_rq(cfs_rq);
>>
>> The cfs_rq should be removed from leaf list only after
>> it has been fully decayed, not here.
> 
> For a throttled cfs_rq, the intent is to preserve its load while it's
> throttled. Its pelt clock is stopped in tg_throttle_down(), there will
> be no decay for it if left on leaf list.

Ah, right.

> 
> I've also described why I chose this behaviour in cover letter:
> "
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
> "

Ok, I get it, keeping the current behavior seems reasonable to me.

Another way maybe detaching throttled task's load when dequeue, and
resetting its se->avg.last_update_time to 0, so its load will be attached
when enqueue. So we don't need to stop its cfs_rq's pelt clock.

But the current approach looks simpler.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
                   ` (4 preceding siblings ...)
  2025-06-18  8:19 ` [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
@ 2025-07-01  8:31 ` Aaron Lu
  2025-07-03  7:37   ` Peter Zijlstra
  2025-07-02  4:25 ` K Prateek Nayak
  2025-07-02 22:00 ` Benjamin Segall
  7 siblings, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2025-07-01  8:31 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hello,

On Wed, Jun 18, 2025 at 04:19:35PM +0800, Aaron Lu wrote:
> v2:
> - Re-org the patchset to use a single patch to implement throttle
>   related changes, suggested by Chengming;
> - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
>   decide if throttle task work is needed instead of checking
>   throttled_hierarchy(), suggested by Peter;
> - Simplify throttle_count check in tg_throtthe_down() and
>   tg_unthrottle_up(), suggested by Peter;
> - Add enqueue_throttled_task() to speed up enqueuing a throttled task to
>   a throttled cfs_rq, suggested by Peter;
> - Address the missing of detach_task_cfs_rq() for throttled tasks that
>   get migrated to a new rq, pointed out by Chengming;
> - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
>   cond_resched*() is going away, pointed out by Peter.
> I hope I didn't miss any comments and suggestions for v1 and if I do,
> please kindly let me know, thanks!
> 
> Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
> of double_rq_clock_clear_update()")

I wonder is there any more comments about this series?
Is it going in the right direction?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
                   ` (5 preceding siblings ...)
  2025-07-01  8:31 ` [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
@ 2025-07-02  4:25 ` K Prateek Nayak
  2025-07-02  8:51   ` Aaron Lu
  2025-07-02 22:00 ` Benjamin Segall
  7 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-07-02  4:25 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hello Aaron,

On 6/18/2025 1:49 PM, Aaron Lu wrote:
> v2:
> - Re-org the patchset to use a single patch to implement throttle
>    related changes, suggested by Chengming;
> - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
>    decide if throttle task work is needed instead of checking
>    throttled_hierarchy(), suggested by Peter;
> - Simplify throttle_count check in tg_throtthe_down() and
>    tg_unthrottle_up(), suggested by Peter;
> - Add enqueue_throttled_task() to speed up enqueuing a throttled task to
>    a throttled cfs_rq, suggested by Peter;
> - Address the missing of detach_task_cfs_rq() for throttled tasks that
>    get migrated to a new rq, pointed out by Chengming;
> - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
>    cond_resched*() is going away, pointed out by Peter.
> I hope I didn't miss any comments and suggestions for v1 and if I do,
> please kindly let me know, thanks!
> 
> Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
> of double_rq_clock_clear_update()")

Sorry for the delay! I gave this a spin with my nested hierarchy stress
test with sched-messaging as well as with Jan's reproducer from [1] and
I didn't see anything unexpected.

A 2 vCPU VM running vanilla tip:sched/core (PREEMPT_RT) hangs within a
few seconds when the two tasks from Jan's reproducer are pinned to the
same CPU as the bandwidth timer.

I haven't seen any hangs / rcu-stalls with this series applied on top of
tip:sched/core. Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

[1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-02  4:25 ` K Prateek Nayak
@ 2025-07-02  8:51   ` Aaron Lu
  0 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-07-02  8:51 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hi Prateek,

On Wed, Jul 02, 2025 at 09:55:19AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 6/18/2025 1:49 PM, Aaron Lu wrote:
> > v2:
> > - Re-org the patchset to use a single patch to implement throttle
> >    related changes, suggested by Chengming;
> > - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
> >    decide if throttle task work is needed instead of checking
> >    throttled_hierarchy(), suggested by Peter;
> > - Simplify throttle_count check in tg_throtthe_down() and
> >    tg_unthrottle_up(), suggested by Peter;
> > - Add enqueue_throttled_task() to speed up enqueuing a throttled task to
> >    a throttled cfs_rq, suggested by Peter;
> > - Address the missing of detach_task_cfs_rq() for throttled tasks that
> >    get migrated to a new rq, pointed out by Chengming;
> > - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
> >    cond_resched*() is going away, pointed out by Peter.
> > I hope I didn't miss any comments and suggestions for v1 and if I do,
> > please kindly let me know, thanks!
> > 
> > Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
> > of double_rq_clock_clear_update()")
> 
> Sorry for the delay! I gave this a spin with my nested hierarchy stress
> test with sched-messaging as well as with Jan's reproducer from [1] and
> I didn't see anything unexpected.
> 
> A 2 vCPU VM running vanilla tip:sched/core (PREEMPT_RT) hangs within a
> few seconds when the two tasks from Jan's reproducer are pinned to the
> same CPU as the bandwidth timer.
> 
> I haven't seen any hangs / rcu-stalls with this series applied on top of
> tip:sched/core. Feel free to include:
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 
> [1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
>

Thanks a lot Prateek, I really appreciate your time on testing this.

Best regards,
Aaron

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
                   ` (6 preceding siblings ...)
  2025-07-02  4:25 ` K Prateek Nayak
@ 2025-07-02 22:00 ` Benjamin Segall
  2025-07-03  6:34   ` Aaron Lu
  2025-07-04  4:34   ` K Prateek Nayak
  7 siblings, 2 replies; 21+ messages in thread
From: Benjamin Segall @ 2025-07-02 22:00 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Aaron Lu <ziqianlu@bytedance.com> writes:

> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.

I suppose the way that this would go wrong would be CPU 1 using up all
of the quota, and then a task waking up on CPU 2 and trying to run in
the kernel for a while. I suspect pelt time needs to also keep running
until all the tasks are asleep (and that's what we have been running at
google with the version based on separate accounting, so we haven't
accidentally done a large scale test of letting it pause).

Otherwise it does look ok, so long as we're ok with increasing distribute
time again.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-02 22:00 ` Benjamin Segall
@ 2025-07-03  6:34   ` Aaron Lu
  2025-07-04  4:34   ` K Prateek Nayak
  1 sibling, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-07-03  6:34 UTC (permalink / raw)
  To: Benjamin Segall
  Cc: Valentin Schneider, K Prateek Nayak, Peter Zijlstra,
	Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hi Benjamin,

Thanks for taking a look.

On Wed, Jul 02, 2025 at 03:00:56PM -0700, Benjamin Segall wrote:
> Aaron Lu <ziqianlu@bytedance.com> writes:
> 
> > For pelt clock, I chose to keep the current behavior to freeze it on
> > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > its load and its corresponding sched_entity's weight. Hopefully, this can
> > result in a stable situation for the remaining running tasks to quickly
> > finish their jobs in kernel mode.
> 
> I suppose the way that this would go wrong would be CPU 1 using up all
> of the quota, and then a task waking up on CPU 2 and trying to run in
> the kernel for a while. I suspect pelt time needs to also keep running
> until all the tasks are asleep (and that's what we have been running at
> google with the version based on separate accounting, so we haven't
> accidentally done a large scale test of letting it pause).
>

Got it, I'll rework this part to keep pelt clock ticking and only stop
it when all tasks of a throttled cfs_rq get either throttled or blocked.
I will paste a diff on top of this series and if the diff looks OK, I'll
fold it to patch3 in next version.

> Otherwise it does look ok, so long as we're ok with increasing distribute
> time again.

Good to know this!

About not strictly limiting quota, I suppose that is a trade off and
having a system that is operating properly is better than a system with
task hung or even deadlock.

Best regards,
Aaron

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-01  8:31 ` [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
@ 2025-07-03  7:37   ` Peter Zijlstra
  2025-07-03 11:51     ` Aaron Lu
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Zijlstra @ 2025-07-03  7:37 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Tue, Jul 01, 2025 at 04:31:23PM +0800, Aaron Lu wrote:

> I wonder is there any more comments about this series?
> Is it going in the right direction?

I had a quick look yesterday, and things seem more or less agreeable.
I see there has been some feedback from Ben that warrants a new version,
so I'll try and keep an eye out for that one.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-03  7:37   ` Peter Zijlstra
@ 2025-07-03 11:51     ` Aaron Lu
  0 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-07-03 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Chengming Zhou,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, Jul 03, 2025 at 09:37:40AM +0200, Peter Zijlstra wrote:
> On Tue, Jul 01, 2025 at 04:31:23PM +0800, Aaron Lu wrote:
> 
> > I wonder is there any more comments about this series?
> > Is it going in the right direction?
> 
> I had a quick look yesterday, and things seem more or less agreeable.
> I see there has been some feedback from Ben that warrants a new version,
> so I'll try and keep an eye out for that one.
> 
> Thanks!

Thank you Peter for the update.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-02 22:00 ` Benjamin Segall
  2025-07-03  6:34   ` Aaron Lu
@ 2025-07-04  4:34   ` K Prateek Nayak
  2025-07-04  7:54     ` Aaron Lu
  1 sibling, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-07-04  4:34 UTC (permalink / raw)
  To: Benjamin Segall, Aaron Lu
  Cc: Valentin Schneider, Peter Zijlstra, Chengming Zhou, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chuyi Zhou,
	Jan Kiszka, Florian Bezdeka

Hello Ben,

On 7/3/2025 3:30 AM, Benjamin Segall wrote:
> Aaron Lu <ziqianlu@bytedance.com> writes:
> 
>> For pelt clock, I chose to keep the current behavior to freeze it on
>> cfs_rq's throttle time. The assumption is that tasks running in kernel
>> mode should not last too long, freezing the cfs_rq's pelt clock can keep
>> its load and its corresponding sched_entity's weight. Hopefully, this can
>> result in a stable situation for the remaining running tasks to quickly
>> finish their jobs in kernel mode.
> 
> I suppose the way that this would go wrong would be CPU 1 using up all
> of the quota, and then a task waking up on CPU 2 and trying to run in
> the kernel for a while. I suspect pelt time needs to also keep running
> until all the tasks are asleep (and that's what we have been running at
> google with the version based on separate accounting, so we haven't
> accidentally done a large scale test of letting it pause).

Thinking out loud ...

One thing this can possibly do is create a lot of:

   throttled -> partially unthrottled -> throttled

transitions when tasks wakeup on throttled hierarchy, run for a while,
and then go back to sleep. If the PELT clocks aren't frozen, this
either means:

1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have
    any load associated back onto the list and allow PELT to progress only
    to then remove them again once tasks go back to sleep. A great many of
    these transitions are possible theoretically which is not ideal.

2. Propagate the delta time where PELT was not frozen during unthrottle
    and if it isn't 0, do an update_load_avg() to sync PELT. This will
    increase the overhead of the tg_tree callback which isn't ideal
    either. It can also complicate the enqueue path since the PELT of
    the the cfs_rq hierarchy being enqueued may need correction before
    the task can be enqueued.

I know Josh hates both approaches since tg_tree_walks are already very
expensive in your use cases and it has to be done in a non-preemptible
context holding the rq_lock but which do you think is the lesser of two
evils? Or is there a better solution that I have completely missed?

> 
> Otherwise it does look ok, so long as we're ok with increasing distribute
> time again.

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-04  4:34   ` K Prateek Nayak
@ 2025-07-04  7:54     ` Aaron Lu
  2025-07-04  8:48       ` K Prateek Nayak
  0 siblings, 1 reply; 21+ messages in thread
From: Aaron Lu @ 2025-07-04  7:54 UTC (permalink / raw)
  To: K Prateek Nayak, Benjamin Segall, Chengming Zhou
  Cc: Valentin Schneider, Peter Zijlstra, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chuyi Zhou,
	Jan Kiszka, Florian Bezdeka

On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote:
> Hello Ben,
> 
> On 7/3/2025 3:30 AM, Benjamin Segall wrote:
> > Aaron Lu <ziqianlu@bytedance.com> writes:
> > 
> > > For pelt clock, I chose to keep the current behavior to freeze it on
> > > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > > its load and its corresponding sched_entity's weight. Hopefully, this can
> > > result in a stable situation for the remaining running tasks to quickly
> > > finish their jobs in kernel mode.
> > 
> > I suppose the way that this would go wrong would be CPU 1 using up all
> > of the quota, and then a task waking up on CPU 2 and trying to run in
> > the kernel for a while. I suspect pelt time needs to also keep running
> > until all the tasks are asleep (and that's what we have been running at
> > google with the version based on separate accounting, so we haven't
> > accidentally done a large scale test of letting it pause).
> 
> Thinking out loud ...
> 
> One thing this can possibly do is create a lot of:
> 
>   throttled -> partially unthrottled -> throttled
> 
> transitions when tasks wakeup on throttled hierarchy, run for a while,
> and then go back to sleep. If the PELT clocks aren't frozen, this
> either means:
> 
> 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have
>    any load associated back onto the list and allow PELT to progress only
>    to then remove them again once tasks go back to sleep. A great many of
>    these transitions are possible theoretically which is not ideal.
> 

I'm going this route, becasue it is kind of already the case now :)

I mean throttled cfs_rqs are already added back to the leaf cfs_rq
list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at
the bottom of enqueue_task_fair() happy when a task is enqueued to a
throttled cfs_rq.

I'm sorry if this is not obvious in this series, I guess I put too many
things in patch3.

Below is the diff I cooked on top of this series to keep pelt clock
running as long as there is task running in a throttled cfs_rq, does it
look sane?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d869c8b51c5a6..410b850df2a12 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_queued == 1) {
+		struct rq *rq = rq_of(cfs_rq);
+
 		check_enqueue_throttle(cfs_rq);
 		list_add_leaf_cfs_rq(cfs_rq);
+		if (cfs_rq->pelt_clock_throttled) {
+			cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+				cfs_rq->throttled_clock_pelt;
+			cfs_rq->pelt_clock_throttled = 0;
+		}
 	}
 }
 
@@ -5437,8 +5444,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
-		if (throttled_hierarchy(cfs_rq))
+		if (throttled_hierarchy(cfs_rq)) {
+			struct rq *rq = rq_of(cfs_rq);
+
+			cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+			cfs_rq->pelt_clock_throttled = 1;
 			list_del_leaf_cfs_rq(cfs_rq);
+		}
 	}
 
 	return true;
@@ -5873,8 +5885,11 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	if (--cfs_rq->throttle_count)
 		return 0;
 
-	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
-		cfs_rq->throttled_clock_pelt;
+	if (cfs_rq->pelt_clock_throttled) {
+		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+			cfs_rq->throttled_clock_pelt;
+		cfs_rq->pelt_clock_throttled = 0;
+	}
 
 	if (cfs_rq->throttled_clock_self) {
 		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
@@ -5939,11 +5954,13 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	if (cfs_rq->throttle_count++)
 		return 0;
 
-	/* group is entering throttled state, stop time */
-	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-	if (!cfs_rq->nr_queued)
+	if (!cfs_rq->nr_queued) {
+		/* group is entering throttled state, stop time */
+		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+		cfs_rq->pelt_clock_throttled = 1;
 		list_del_leaf_cfs_rq(cfs_rq);
+	}
 
 	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
 	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 62c3fa543c0f2..f921302dc40fb 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
 {
 	u64 throttled;
 
-	if (unlikely(cfs_rq->throttle_count))
+	if (unlikely(cfs_rq->pelt_clock_throttled))
 		throttled = U64_MAX;
 	else
 		throttled = cfs_rq->throttled_clock_pelt_time;
@@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
 /* rq->task_clock normalized against any time this cfs_rq has spent throttled */
 static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
 {
-	if (unlikely(cfs_rq->throttle_count))
+	if (unlikely(cfs_rq->pelt_clock_throttled))
 		return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time;
 
 	return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f2a07537d3c12..877e40ccd8cc1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -724,7 +724,8 @@ struct cfs_rq {
 	u64			throttled_clock_pelt_time;
 	u64			throttled_clock_self;
 	u64			throttled_clock_self_time;
-	int			throttled;
+	int			throttled:1;
+	int			pelt_clock_throttled:1;
 	int			throttle_count;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;


Thanks,
Aaron

> 2. Propagate the delta time where PELT was not frozen during unthrottle
>    and if it isn't 0, do an update_load_avg() to sync PELT. This will
>    increase the overhead of the tg_tree callback which isn't ideal
>    either. It can also complicate the enqueue path since the PELT of
>    the the cfs_rq hierarchy being enqueued may need correction before
>    the task can be enqueued.
> 
> I know Josh hates both approaches since tg_tree_walks are already very
> expensive in your use cases and it has to be done in a non-preemptible
> context holding the rq_lock but which do you think is the lesser of two
> evils? Or is there a better solution that I have completely missed?
> 
> > 
> > Otherwise it does look ok, so long as we're ok with increasing distribute
> > time again.
> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-04  7:54     ` Aaron Lu
@ 2025-07-04  8:48       ` K Prateek Nayak
  2025-07-04  9:47         ` Aaron Lu
  0 siblings, 1 reply; 21+ messages in thread
From: K Prateek Nayak @ 2025-07-04  8:48 UTC (permalink / raw)
  To: Aaron Lu, Benjamin Segall, Chengming Zhou
  Cc: Valentin Schneider, Peter Zijlstra, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chuyi Zhou,
	Jan Kiszka, Florian Bezdeka

Hello Aaron,

On 7/4/2025 1:24 PM, Aaron Lu wrote:
> On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote:
>> Hello Ben,
>>
>> On 7/3/2025 3:30 AM, Benjamin Segall wrote:
>>> Aaron Lu <ziqianlu@bytedance.com> writes:
>>>
>>>> For pelt clock, I chose to keep the current behavior to freeze it on
>>>> cfs_rq's throttle time. The assumption is that tasks running in kernel
>>>> mode should not last too long, freezing the cfs_rq's pelt clock can keep
>>>> its load and its corresponding sched_entity's weight. Hopefully, this can
>>>> result in a stable situation for the remaining running tasks to quickly
>>>> finish their jobs in kernel mode.
>>>
>>> I suppose the way that this would go wrong would be CPU 1 using up all
>>> of the quota, and then a task waking up on CPU 2 and trying to run in
>>> the kernel for a while. I suspect pelt time needs to also keep running
>>> until all the tasks are asleep (and that's what we have been running at
>>> google with the version based on separate accounting, so we haven't
>>> accidentally done a large scale test of letting it pause).
>>
>> Thinking out loud ...
>>
>> One thing this can possibly do is create a lot of:
>>
>>    throttled -> partially unthrottled -> throttled
>>
>> transitions when tasks wakeup on throttled hierarchy, run for a while,
>> and then go back to sleep. If the PELT clocks aren't frozen, this
>> either means:
>>
>> 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have
>>     any load associated back onto the list and allow PELT to progress only
>>     to then remove them again once tasks go back to sleep. A great many of
>>     these transitions are possible theoretically which is not ideal.
>>
> 
> I'm going this route, becasue it is kind of already the case now :)
> 
> I mean throttled cfs_rqs are already added back to the leaf cfs_rq
> list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at
> the bottom of enqueue_task_fair() happy when a task is enqueued to a
> throttled cfs_rq.
> 
> I'm sorry if this is not obvious in this series, I guess I put too many
> things in patch3.
> 
> Below is the diff I cooked on top of this series to keep pelt clock
> running as long as there is task running in a throttled cfs_rq, does it
> look sane?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d869c8b51c5a6..410b850df2a12 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   	se->on_rq = 1;
>   
>   	if (cfs_rq->nr_queued == 1) {
> +		struct rq *rq = rq_of(cfs_rq);
> +
>   		check_enqueue_throttle(cfs_rq);
>   		list_add_leaf_cfs_rq(cfs_rq);
> +		if (cfs_rq->pelt_clock_throttled) {
> +			cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> +				cfs_rq->throttled_clock_pelt;
> +			cfs_rq->pelt_clock_throttled = 0;
> +		}

At this point we've already done a update_load_avg() above in
enqueue_entity() without unfreezing the PELT clock. Does it make
sense to do it at the beginning?

Overall idea seems sane to me. I was thinking if anything can go
wrong by only unfreezing the PELT for one part of the hierarchy but
I suppose the other cfs_rq can be considered individually throttled
and it should turn out fine.

>   	}
>   }
>   
> @@ -5437,8 +5444,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   
>   	if (cfs_rq->nr_queued == 0) {
>   		update_idle_cfs_rq_clock_pelt(cfs_rq);
> -		if (throttled_hierarchy(cfs_rq))
> +		if (throttled_hierarchy(cfs_rq)) {
> +			struct rq *rq = rq_of(cfs_rq);
> +
> +			cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> +			cfs_rq->pelt_clock_throttled = 1;
>   			list_del_leaf_cfs_rq(cfs_rq);
> +		}
>   	}
>   
>   	return true;
> @@ -5873,8 +5885,11 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>   	if (--cfs_rq->throttle_count)
>   		return 0;
>   
> -	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> -		cfs_rq->throttled_clock_pelt;
> +	if (cfs_rq->pelt_clock_throttled) {
> +		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> +			cfs_rq->throttled_clock_pelt;
> +		cfs_rq->pelt_clock_throttled = 0;
> +	}
>   
>   	if (cfs_rq->throttled_clock_self) {
>   		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
> @@ -5939,11 +5954,13 @@ static int tg_throttle_down(struct task_group *tg, void *data)
>   	if (cfs_rq->throttle_count++)
>   		return 0;
>   
> -	/* group is entering throttled state, stop time */
> -	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
>   
> -	if (!cfs_rq->nr_queued)
> +	if (!cfs_rq->nr_queued) {
> +		/* group is entering throttled state, stop time */
> +		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> +		cfs_rq->pelt_clock_throttled = 1;
>   		list_del_leaf_cfs_rq(cfs_rq);
> +	}
>   
>   	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
>   	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index 62c3fa543c0f2..f921302dc40fb 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
>   {
>   	u64 throttled;
>   
> -	if (unlikely(cfs_rq->throttle_count))
> +	if (unlikely(cfs_rq->pelt_clock_throttled))
>   		throttled = U64_MAX;
>   	else
>   		throttled = cfs_rq->throttled_clock_pelt_time;
> @@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
>   /* rq->task_clock normalized against any time this cfs_rq has spent throttled */
>   static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
>   {
> -	if (unlikely(cfs_rq->throttle_count))
> +	if (unlikely(cfs_rq->pelt_clock_throttled))
>   		return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time;
>   
>   	return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f2a07537d3c12..877e40ccd8cc1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -724,7 +724,8 @@ struct cfs_rq {
>   	u64			throttled_clock_pelt_time;
>   	u64			throttled_clock_self;
>   	u64			throttled_clock_self_time;
> -	int			throttled;
> +	int			throttled:1;
> +	int			pelt_clock_throttled:1;
>   	int			throttle_count;
>   	struct list_head	throttled_list;
>   	struct list_head	throttled_csd_list;
> 
> 
> Thanks,
> Aaron
> 
>> 2. Propagate the delta time where PELT was not frozen during unthrottle
>>     and if it isn't 0, do an update_load_avg() to sync PELT. This will
>>     increase the overhead of the tg_tree callback which isn't ideal
>>     either. It can also complicate the enqueue path since the PELT of
>>     the the cfs_rq hierarchy being enqueued may need correction before
>>     the task can be enqueued.
>>
>> I know Josh hates both approaches since tg_tree_walks are already very
>> expensive in your use cases and it has to be done in a non-preemptible
>> context holding the rq_lock but which do you think is the lesser of two
>> evils? Or is there a better solution that I have completely missed?
>>
>>>
>>> Otherwise it does look ok, so long as we're ok with increasing distribute
>>> time again.
>>
>> -- 
>> Thanks and Regards,
>> Prateek
>>

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v2 0/5] Defer throttle when task exits to user
  2025-07-04  8:48       ` K Prateek Nayak
@ 2025-07-04  9:47         ` Aaron Lu
  0 siblings, 0 replies; 21+ messages in thread
From: Aaron Lu @ 2025-07-04  9:47 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Benjamin Segall, Chengming Zhou, Valentin Schneider,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
	linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hi Prateek,

On Fri, Jul 04, 2025 at 02:18:49PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 7/4/2025 1:24 PM, Aaron Lu wrote:
> > On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote:
> > > Hello Ben,
> > > 
> > > On 7/3/2025 3:30 AM, Benjamin Segall wrote:
> > > > Aaron Lu <ziqianlu@bytedance.com> writes:
> > > > 
> > > > > For pelt clock, I chose to keep the current behavior to freeze it on
> > > > > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > > > > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > > > > its load and its corresponding sched_entity's weight. Hopefully, this can
> > > > > result in a stable situation for the remaining running tasks to quickly
> > > > > finish their jobs in kernel mode.
> > > > 
> > > > I suppose the way that this would go wrong would be CPU 1 using up all
> > > > of the quota, and then a task waking up on CPU 2 and trying to run in
> > > > the kernel for a while. I suspect pelt time needs to also keep running
> > > > until all the tasks are asleep (and that's what we have been running at
> > > > google with the version based on separate accounting, so we haven't
> > > > accidentally done a large scale test of letting it pause).
> > > 
> > > Thinking out loud ...
> > > 
> > > One thing this can possibly do is create a lot of:
> > > 
> > >    throttled -> partially unthrottled -> throttled
> > > 
> > > transitions when tasks wakeup on throttled hierarchy, run for a while,
> > > and then go back to sleep. If the PELT clocks aren't frozen, this
> > > either means:
> > > 
> > > 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have
> > >     any load associated back onto the list and allow PELT to progress only
> > >     to then remove them again once tasks go back to sleep. A great many of
> > >     these transitions are possible theoretically which is not ideal.
> > > 
> > 
> > I'm going this route, becasue it is kind of already the case now :)
> > 
> > I mean throttled cfs_rqs are already added back to the leaf cfs_rq
> > list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at
> > the bottom of enqueue_task_fair() happy when a task is enqueued to a
> > throttled cfs_rq.
> > 
> > I'm sorry if this is not obvious in this series, I guess I put too many
> > things in patch3.
> > 
> > Below is the diff I cooked on top of this series to keep pelt clock
> > running as long as there is task running in a throttled cfs_rq, does it
> > look sane?
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d869c8b51c5a6..410b850df2a12 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >   	se->on_rq = 1;
> >   	if (cfs_rq->nr_queued == 1) {
> > +		struct rq *rq = rq_of(cfs_rq);
> > +
> >   		check_enqueue_throttle(cfs_rq);
> >   		list_add_leaf_cfs_rq(cfs_rq);
> > +		if (cfs_rq->pelt_clock_throttled) {
> > +			cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > +				cfs_rq->throttled_clock_pelt;
> > +			cfs_rq->pelt_clock_throttled = 0;
> > +		}
> 
> At this point we've already done a update_load_avg() above in
> enqueue_entity() without unfreezing the PELT clock. Does it make
> sense to do it at the beginning?
>

My thinking is, when the first entity is enqueued, the cfs_rq should
have been throttled so the update_load_avg() done above should do nothing
since its pelt clock is frozen(so no decay should happen). Then after the
entity is added, this throttled cfs_rq's pelt clock is unfrozen and load
can be updated accordingly.

Thinking it another way, when we move the unfreezing earlier, the delta
calculated in __update_load_sum() should be zero because
cfs_rq->throttled_clock_pelt_time is adjusted and the "now" value
derived from cfs_rq_clock_pelt() should be the same as last update time.

Does this make sense or do I mis-understand it?

> Overall idea seems sane to me. I was thinking if anything can go
> wrong by only unfreezing the PELT for one part of the hierarchy but
> I suppose the other cfs_rq can be considered individually throttled
> and it should turn out fine.
> 

Unfreezing the PELT for only one part seems problematic, I suppose we
will have to propagate the load upwards all the way up to let the root
level sched entity have a correct load setting so that when it needs to
compete for cpu resources, it gets the correct share. I assume this is
why Ben think it is better to keep the pelt clock running as long as
there is task running.

Note that the pelt clock is only frozen when the cfs_rq has no tasks
running, and gets unfrozen when the first entity is enqueued or at
unthrottle time. So when a new task is enqueued, at each level's
enqueue_entity():
- if it's the first entity of a throttled cfs_rq, then its PELT clock
  will be unfrozen;
- if it's not the first entity, that means the cfs_rq's pelt clock is
  not frozen yet.
So in the end, we should not have a partial unfrozen hiearchy. At least,
that's my intention, please let me know if I messed up :)

Thanks for the quick review!

Best regards,
Aaron

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-07-04  9:48 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-18  8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
2025-06-18  8:19 ` [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-06-18  8:19 ` [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-06-18  9:03   ` Chengming Zhou
2025-06-18  8:19 ` [PATCH v2 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-06-18  9:55   ` Chengming Zhou
2025-06-18 11:19     ` Aaron Lu
2025-06-19 12:02       ` Chengming Zhou
2025-06-18  8:19 ` [PATCH v2 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-06-18  8:19 ` [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-07-01  8:31 ` [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-03  7:37   ` Peter Zijlstra
2025-07-03 11:51     ` Aaron Lu
2025-07-02  4:25 ` K Prateek Nayak
2025-07-02  8:51   ` Aaron Lu
2025-07-02 22:00 ` Benjamin Segall
2025-07-03  6:34   ` Aaron Lu
2025-07-04  4:34   ` K Prateek Nayak
2025-07-04  7:54     ` Aaron Lu
2025-07-04  8:48       ` K Prateek Nayak
2025-07-04  9:47         ` Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox