linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] Defer throttle when task exits to user
@ 2025-05-20 10:41 Aaron Lu
  2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link and I
quote:
"
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
"

Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
and K Prateek Nayak has an detailed analysis of how deadlock happened:
https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/

To fix this issue for PREEMPT_RT and improve latency situation for
!PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
is throttled, mark its throttled status but do not remove it from cpu's
rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
add a task work to them so that when they return to user, they can be
dequeued. In this way, tasks throttled will not hold any kernel resources.
When cfs_rq gets unthrottled, enqueue back those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, according to RFC v2's feedback, rework
throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its
  hierarchy;
- stop accounting on unthrottle.

There is also the concern of increased duration of (un)throttle operations
in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Changes since RFC v2:
- drop RFC tag;
- Change throttle time accounting as suggested by Chenming. Prateek
  helped simplify the implementation;
- Re-organize patchset to avoid breaking bisect as suggested by Prateek.

Patches:
To avoid breaking bisect for CFS bandwidth functionality related issue,
the patchset is reorganized. The previous patch2-3 is divided into
patch2-3 and patch5. Patch2 and 3 mostly did previous thing, except it
retained existing throttle functionality till patch5 which switched the
throttle model to task based. The end result should be: before patch5,
existing throttle model should still work correctly and since patch5,
task based throttle model takes over.

Details about each patch:

Patch1 is preparation work;

Patch2 prepares throttle path for task based throttle: when a cfs_rq is
to be throttled, mark throttled status for this cfs_rq and when tasks in
throttled hierarchy gets picked, add a task work to them so that when
those tasks return to user, the task work can throttle it by dequeuing
the task and remember this by adding the task to its cfs_rq's limbo list.
To avoid breaking bisect, patch2 retained existing throttle behaviour in
that throttled hierarchy is still removed from runqueue, the actual
behaviour change is done in patch5;

Patch3 prepares unthrottle path for task based throttle: when a cfs_rq is
unthrottled, enqueue back those tasks in limbo list. It didn't take effect
though since no tasks are placed on that limbo list yet in patch2;

Patch4 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group or sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.

Patch5 implements the actual throttle model change.

Patch6 implements throttle time accounting for this new task based
throttle model.

Patch7 is a cleanup after throttle model changed to task based.

All change logs are written by me and if they read bad, it's my fault.

Comments are welcome.

Base: tip/sched/core commit 676e8cf70cb0("sched,livepatch: Untangle
cond_resched() and live-patching")

Aaron Lu (3):
  sched/fair: Take care of group/affinity/sched_class change for
    throttled task
  sched/fair: task based throttle time accounting
  sched/fair: get rid of throttled_lb_pair()

Valentin Schneider (4):
  sched/fair: Add related data structure for task based throttle
  sched/fair: prepare throttle path for task based throttle
  sched/fair: prepare unthrottle path for task based throttle
  sched/fair: switch to task based throttle model

 include/linux/sched.h |   4 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 403 ++++++++++++++++++++----------------------
 kernel/sched/sched.h  |   4 +
 4 files changed, 199 insertions(+), 215 deletions(-)

-- 
2.39.5


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/7] sched/fair: Add related data structure for task based throttle
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  2025-05-21  8:48   ` Chengming Zhou
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

Add related data structures for this new throttle functionality.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 include/linux/sched.h |  4 ++++
 kernel/sched/core.c   |  3 +++
 kernel/sched/fair.c   | 12 ++++++++++++
 kernel/sched/sched.h  |  2 ++
 4 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index b98195991031c..055f3782eeaee 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -879,6 +879,10 @@ struct task_struct {
 
 #ifdef CONFIG_CGROUP_SCHED
 	struct task_group		*sched_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+	struct callback_head		sched_throttle_work;
+	struct list_head		throttle_node;
+#endif
 #endif
 
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bece0ba6f5b3a..b7ca7cefee54e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4499,6 +4499,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	p->se.cfs_rq			= NULL;
+#ifdef CONFIG_CFS_BANDWIDTH
+	init_cfs_throttle_work(p);
+#endif
 #endif
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb5a2572b4f8b..75bf6186a5137 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5825,6 +5825,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+	INIT_LIST_HEAD(&p->throttle_node);
+}
+
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5a6a503eb6de..921527327f107 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2703,6 +2703,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
 
 extern void init_dl_entity(struct sched_dl_entity *dl_se);
 
+extern void init_cfs_throttle_work(struct task_struct *p);
+
 #define BW_SHIFT		20
 #define BW_UNIT			(1 << BW_SHIFT)
 #define RATIO_SHIFT		8
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
  2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  2025-05-20 12:02   ` Florian Bezdeka
                     ` (4 more replies)
  2025-05-20 10:41 ` [PATCH 3/7] sched/fair: prepare unthrottle " Aaron Lu
                   ` (4 subsequent siblings)
  6 siblings, 5 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.

This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets wakeup, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.

To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued. In this way, tasks throttled will not
hold any kernel resources.

To avoid breaking bisect, preserve the current throttle behavior by
still dequeuing throttled hierarchy from rq and because of this, no task
can have that throttle task work added yet. The throttle model will
switch to task based in a later patch.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c  | 88 +++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h |  1 +
 2 files changed, 80 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 75bf6186a5137..e87ceb0a2d37f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5825,8 +5825,47 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
+	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
+	struct sched_entity *se;
+	struct cfs_rq *cfs_rq;
+	struct rq *rq;
+
+	WARN_ON_ONCE(p != current);
+	p->sched_throttle_work.next = &p->sched_throttle_work;
+
+	/*
+	 * If task is exiting, then there won't be a return to userspace, so we
+	 * don't have to bother with any of this.
+	 */
+	if ((p->flags & PF_EXITING))
+		return;
+
+	scoped_guard(task_rq_lock, p) {
+		se = &p->se;
+		cfs_rq = cfs_rq_of(se);
+
+		/* Raced, forget */
+		if (p->sched_class != &fair_sched_class)
+			return;
+
+		/*
+		 * If not in limbo, then either replenish has happened or this
+		 * task got migrated out of the throttled cfs_rq, move along.
+		 */
+		if (!cfs_rq->throttle_count)
+			return;
+		rq = scope.rq;
+		update_rq_clock(rq);
+		WARN_ON_ONCE(!list_empty(&p->throttle_node));
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		resched_curr(rq);
+	}
+
+	cond_resched_tasks_rcu_qs();
 }
 
 void init_cfs_throttle_work(struct task_struct *p)
@@ -5866,21 +5905,42 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	return 0;
 }
 
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+	return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+	if (task_has_throttle_work(p))
+		return;
+
+	/*
+	 * Kthreads and exiting tasks don't return to userspace, so adding the
+	 * work is pointless
+	 */
+	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
+		return;
+
+	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+}
+
 static int tg_throttle_down(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
 
+	cfs_rq->throttle_count++;
+	if (cfs_rq->throttle_count > 1)
+		return 0;
+
 	/* group is entering throttled state, stop time */
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
-		list_del_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+	list_del_leaf_cfs_rq(cfs_rq);
 
-		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-		if (cfs_rq->nr_queued)
-			cfs_rq->throttled_clock_self = rq_clock(rq);
-	}
-	cfs_rq->throttle_count++;
+	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+	if (cfs_rq->nr_queued)
+		cfs_rq->throttled_clock_self = rq_clock(rq);
 
 	return 0;
 }
@@ -6575,6 +6635,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 	cfs_rq->runtime_enabled = 0;
 	INIT_LIST_HEAD(&cfs_rq->throttled_list);
 	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
 }
 
 void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
@@ -6744,6 +6805,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void task_throttle_setup_work(struct task_struct *p) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
+	struct task_struct *p;
 
 again:
 	cfs_rq = &rq->cfs;
@@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		cfs_rq = group_cfs_rq(se);
 	} while (cfs_rq);
 
-	return task_of(se);
+	p = task_of(se);
+	if (throttled_hierarchy(cfs_rq_of(se))) {
+		/* Shuold not happen for now */
+		WARN_ON_ONCE(1);
+		task_throttle_setup_work(p);
+	}
+
+	return p;
 }
 
 static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 921527327f107..83f16fc44884f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -736,6 +736,7 @@ struct cfs_rq {
 	int			throttle_count;
 	struct list_head	throttled_list;
 	struct list_head	throttled_csd_list;
+	struct list_head        throttled_limbo_list;
 #endif /* CONFIG_CFS_BANDWIDTH */
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 };
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/7] sched/fair: prepare unthrottle path for task based throttle
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
  2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

During unthrottle, enqueued back throttled tasks so that they can
resume execution.

This patch only sets up the enqueue mechanism; it does not take effect
yet because no tasks in the throttled hierarchy are placed on the limbo
list as of now.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 55 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e87ceb0a2d37f..74bc320cbc238 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5825,6 +5825,11 @@ static inline int throttled_lb_pair(struct task_group *tg,
 	       throttled_hierarchy(dest_cfs_rq);
 }
 
+static inline bool task_is_throttled(struct task_struct *p)
+{
+	return !list_empty(&p->throttle_node);
+}
+
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static void throttle_cfs_rq_work(struct callback_head *work)
 {
@@ -5876,32 +5881,41 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
 	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+	struct task_struct *p, *tmp;
 
 	cfs_rq->throttle_count--;
-	if (!cfs_rq->throttle_count) {
-		cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
-					     cfs_rq->throttled_clock_pelt;
+	if (cfs_rq->throttle_count)
+		return 0;
 
-		/* Add cfs_rq with load or one or more already running entities to the list */
-		if (!cfs_rq_is_decayed(cfs_rq))
-			list_add_leaf_cfs_rq(cfs_rq);
+	cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+		cfs_rq->throttled_clock_pelt;
 
-		if (cfs_rq->throttled_clock_self) {
-			u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+	if (cfs_rq->throttled_clock_self) {
+		u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
 
-			cfs_rq->throttled_clock_self = 0;
+		cfs_rq->throttled_clock_self = 0;
 
-			if (WARN_ON_ONCE((s64)delta < 0))
-				delta = 0;
+		if (WARN_ON_ONCE((s64)delta < 0))
+			delta = 0;
 
-			cfs_rq->throttled_clock_self_time += delta;
-		}
+		cfs_rq->throttled_clock_self_time += delta;
+	}
+
+	/* Re-enqueue the tasks that have been throttled at this level. */
+	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+		list_del_init(&p->throttle_node);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
 	}
 
+	/* Add cfs_rq with load or one or more already running entities to the list */
+	if (!cfs_rq_is_decayed(cfs_rq))
+		list_add_leaf_cfs_rq(cfs_rq);
+
 	return 0;
 }
 
@@ -6059,6 +6073,19 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	long queued_delta, runnable_delta, idle_delta;
 	long rq_h_nr_queued = rq->cfs.h_nr_queued;
 
+	/*
+	 * It's possible we are called with !runtime_remaining due to things
+	 * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
+	 * unthrottled us with a positive runtime_remaining but other still
+	 * running entities consumed those runtime before we reached here.
+	 *
+	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
+	 * throttle, which is not supposed to happen on unthrottle path.
+	 */
+	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
+		return;
+
 	se = cfs_rq->tg->se[cpu_of(rq)];
 
 	cfs_rq->throttled = 0;
@@ -6806,6 +6833,7 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
+static bool task_is_throttled(struct task_struct *p) { return false; }
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7014,6 +7042,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		util_est_enqueue(&rq->cfs, p);
 
 	if (flags & ENQUEUE_DELAYED) {
+		WARN_ON_ONCE(task_is_throttled(p));
 		requeue_delayed_entity(se);
 		return;
 	}
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (2 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH 3/7] sched/fair: prepare unthrottle " Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  2025-05-22 12:03   ` Peter Zijlstra
  2025-05-23  2:43   ` Chengming Zhou
  2025-05-20 10:41 ` [PATCH 5/7] sched/fair: switch to task based throttle model Aaron Lu
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
core will dequeue it and then requeued it.

The throttled task is still considered as queued by core because p->on_rq
is still set so core will dequeue it, but since the task is already
dequeued on throttle in fair, handle this case properly.

Affinity and sched class change is similar.

Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 74bc320cbc238..4c66fd8d24389 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		update_rq_clock(rq);
 		WARN_ON_ONCE(!list_empty(&p->throttle_node));
 		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		/*
+		 * Must not add it to limbo list before dequeue or dequeue will
+		 * mistakenly regard this task as an already throttled one.
+		 */
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		resched_curr(rq);
 	}
@@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
 	INIT_LIST_HEAD(&p->throttle_node);
 }
 
+static void dequeue_throttled_task(struct task_struct *p, int flags)
+{
+	/*
+	 * Task is throttled and someone wants to dequeue it again:
+	 * it must be sched/core when core needs to do things like
+	 * task affinity change, task group change, task sched class
+	 * change etc.
+	 */
+	WARN_ON_ONCE(p->se.on_rq);
+	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
+
+	list_del_init(&p->throttle_node);
+}
+
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
@@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
+static void dequeue_throttled_task(struct task_struct *p, int flags) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
  */
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
+	if (unlikely(task_is_throttled(p))) {
+		dequeue_throttled_task(p, flags);
+		return true;
+	}
+
 	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
 		util_est_dequeue(&rq->cfs, p);
 
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/7] sched/fair: switch to task based throttle model
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (3 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  2025-05-20 10:41 ` [PATCH 6/7] sched/fair: task based throttle time accounting Aaron Lu
  2025-05-20 10:41 ` [PATCH 7/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
  6 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

From: Valentin Schneider <vschneid@redhat.com>

Now with all the preparatory work in place, switch to task based
throttle model by:
- on throttle time, do not remove the hierarchy in runqueue but rely on
  task work added in pick_task_fair() to do the actual throttle/dequeue
  in task's ret2user path;
- on unthrottle, enqueue back those throttled tasks on limbo list.

Since throttle_cfs_rq() no longer remove the hierarchy from rq, its
return value is not needed. Same for check_cfs_rq_runtime().

Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a
task can be enqueued to a throttled cfs_rq and gets to run, to not break
the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to
leaf cfs_rq list when it has its first entity enqueued and delete it
from leaf cfs_rq list when it has no tasks enqueued.

Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 188 ++++++--------------------------------------
 1 file changed, 24 insertions(+), 164 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4c66fd8d24389..a968d334e8730 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5359,18 +5359,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 
 	if (cfs_rq->nr_queued == 1) {
 		check_enqueue_throttle(cfs_rq);
-		if (!throttled_hierarchy(cfs_rq)) {
-			list_add_leaf_cfs_rq(cfs_rq);
-		} else {
+		list_add_leaf_cfs_rq(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
+		if (throttled_hierarchy(cfs_rq)) {
 			struct rq *rq = rq_of(cfs_rq);
 
 			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
 				cfs_rq->throttled_clock = rq_clock(rq);
 			if (!cfs_rq->throttled_clock_self)
 				cfs_rq->throttled_clock_self = rq_clock(rq);
-#endif
 		}
+#endif
 	}
 }
 
@@ -5409,8 +5408,6 @@ static void set_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable--;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
@@ -5431,8 +5428,6 @@ static void clear_delayed(struct sched_entity *se)
 		struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
 		cfs_rq->h_nr_runnable++;
-		if (cfs_rq_throttled(cfs_rq))
-			break;
 	}
 }
 
@@ -5518,8 +5513,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (flags & DEQUEUE_DELAYED)
 		finish_delayed_dequeue_entity(se);
 
-	if (cfs_rq->nr_queued == 0)
+	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
+		if (throttled_hierarchy(cfs_rq))
+			list_del_leaf_cfs_rq(cfs_rq);
+	}
 
 	return true;
 }
@@ -5600,7 +5598,7 @@ pick_next_entity(struct rq *rq, struct cfs_rq *cfs_rq)
 	return se;
 }
 
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
 static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
 {
@@ -5968,22 +5966,22 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 
 	/* group is entering throttled state, stop time */
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
-	list_del_leaf_cfs_rq(cfs_rq);
 
 	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
 	if (cfs_rq->nr_queued)
 		cfs_rq->throttled_clock_self = rq_clock(rq);
+	else
+		list_del_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
 
-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta, dequeue = 1;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int dequeue = 1;
 
 	raw_spin_lock(&cfs_b->lock);
 	/* This will start the period timer if necessary */
@@ -6004,74 +6002,13 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	raw_spin_unlock(&cfs_b->lock);
 
 	if (!dequeue)
-		return false;  /* Throttle no longer required. */
-
-	se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
+		return;  /* Throttle no longer required. */
 
 	/* freeze hierarchy runnable averages while throttled */
 	rcu_read_lock();
 	walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
 	rcu_read_unlock();
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		int flags;
-
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		/*
-		 * Abuse SPECIAL to avoid delayed dequeue in this instance.
-		 * This avoids teaching dequeue_entities() about throttled
-		 * entities and keeps things relatively simple.
-		 */
-		flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
-		if (se->sched_delayed)
-			flags |= DEQUEUE_DELAYED;
-		dequeue_entity(qcfs_rq, se, flags);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-
-		if (qcfs_rq->load.weight) {
-			/* Avoid re-evaluating load for this entity: */
-			se = parent_entity(se);
-			break;
-		}
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-		/* throttled entity or throttle-on-deactivate */
-		if (!se->on_rq)
-			goto done;
-
-		update_load_avg(qcfs_rq, se, 0);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued -= queued_delta;
-		qcfs_rq->h_nr_runnable -= runnable_delta;
-		qcfs_rq->h_nr_idle -= idle_delta;
-	}
-
-	/* At this point se is NULL and we are at root level*/
-	sub_nr_running(rq, queued_delta);
-
-	/* Stop the fair server if throttling resulted in no runnable tasks */
-	if (rq_h_nr_queued && !rq->cfs.h_nr_queued)
-		dl_server_stop(&rq->fair_server);
-done:
 	/*
 	 * Note: distribution will already see us throttled via the
 	 * throttled-list.  rq->lock protects completion.
@@ -6080,16 +6017,14 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	WARN_ON_ONCE(cfs_rq->throttled_clock);
 	if (cfs_rq->nr_queued)
 		cfs_rq->throttled_clock = rq_clock(rq);
-	return true;
+	return;
 }
 
 void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 {
 	struct rq *rq = rq_of(cfs_rq);
 	struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
-	struct sched_entity *se;
-	long queued_delta, runnable_delta, idle_delta;
-	long rq_h_nr_queued = rq->cfs.h_nr_queued;
+	struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
 
 	/*
 	 * It's possible we are called with !runtime_remaining due to things
@@ -6132,62 +6067,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 			if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
 				break;
 		}
-		goto unthrottle_throttle;
 	}
 
-	queued_delta = cfs_rq->h_nr_queued;
-	runnable_delta = cfs_rq->h_nr_runnable;
-	idle_delta = cfs_rq->h_nr_idle;
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		/* Handle any unfinished DELAY_DEQUEUE business first. */
-		if (se->sched_delayed) {
-			int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
-			dequeue_entity(qcfs_rq, se, flags);
-		} else if (se->on_rq)
-			break;
-		enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	for_each_sched_entity(se) {
-		struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
-		update_load_avg(qcfs_rq, se, UPDATE_TG);
-		se_update_runnable(se);
-
-		if (cfs_rq_is_idle(group_cfs_rq(se)))
-			idle_delta = cfs_rq->h_nr_queued;
-
-		qcfs_rq->h_nr_queued += queued_delta;
-		qcfs_rq->h_nr_runnable += runnable_delta;
-		qcfs_rq->h_nr_idle += idle_delta;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(qcfs_rq))
-			goto unthrottle_throttle;
-	}
-
-	/* Start the fair server if un-throttling resulted in new runnable tasks */
-	if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
-		dl_server_start(&rq->fair_server);
-
-	/* At this point se is NULL and we are at root level*/
-	add_nr_running(rq, queued_delta);
-
-unthrottle_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
@@ -6569,22 +6450,22 @@ static void sync_throttle(struct task_group *tg, int cpu)
 }
 
 /* conditionally throttle active cfs_rq's from put_prev_entity() */
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq)
 {
 	if (!cfs_bandwidth_used())
-		return false;
+		return;
 
 	if (likely(!cfs_rq->runtime_enabled || cfs_rq->runtime_remaining > 0))
-		return false;
+		return;
 
 	/*
 	 * it's possible for a throttled entity to be forced into a running
 	 * state (e.g. set_curr_task), in this case we're finished.
 	 */
 	if (cfs_rq_throttled(cfs_rq))
-		return true;
+		return;
 
-	return throttle_cfs_rq(cfs_rq);
+	throttle_cfs_rq(cfs_rq);
 }
 
 static enum hrtimer_restart sched_cfs_slack_timer(struct hrtimer *timer)
@@ -6846,7 +6727,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 #else /* CONFIG_CFS_BANDWIDTH */
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
-static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
+static void check_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -7104,10 +6985,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
-
 		flags = ENQUEUE_WAKEUP;
 	}
 
@@ -7129,10 +7006,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			goto enqueue_throttle;
 	}
 
 	if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@@ -7162,7 +7035,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	if (!task_new)
 		check_update_overutilized_status(rq);
 
-enqueue_throttle:
 	assert_list_leaf_cfs_rq(rq);
 
 	hrtick_update(rq);
@@ -7220,10 +7092,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
-
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			slice = cfs_rq_min_slice(cfs_rq);
@@ -7260,10 +7128,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
-
-		/* end evaluation on encountering a throttled cfs_rq */
-		if (cfs_rq_throttled(cfs_rq))
-			return 0;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -8978,8 +8842,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
 			update_curr(cfs_rq);
 
-		if (unlikely(check_cfs_rq_runtime(cfs_rq)))
-			goto again;
+		check_cfs_rq_runtime(cfs_rq);
 
 		se = pick_next_entity(rq, cfs_rq);
 		if (!se)
@@ -8988,11 +8851,8 @@ static struct task_struct *pick_task_fair(struct rq *rq)
 	} while (cfs_rq);
 
 	p = task_of(se);
-	if (throttled_hierarchy(cfs_rq_of(se))) {
-		/* Shuold not happen for now */
-		WARN_ON_ONCE(1);
+	if (throttled_hierarchy(cfs_rq_of(se)))
 		task_throttle_setup_work(p);
-	}
 
 	return p;
 }
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/7] sched/fair: task based throttle time accounting
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (4 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH 5/7] sched/fair: switch to task based throttle model Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  2025-05-20 10:41 ` [PATCH 7/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
  6 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq is throttled, its nr_queued == 1 but
then that task could block in kernel mode instead of being dequeued on
limbo list. Account this as throttled time is not accurate.

Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.

Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c  | 41 +++++++++++++++++++++++------------------
 kernel/sched/sched.h |  1 +
 2 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a968d334e8730..4646d4f8b878d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5360,16 +5360,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	if (cfs_rq->nr_queued == 1) {
 		check_enqueue_throttle(cfs_rq);
 		list_add_leaf_cfs_rq(cfs_rq);
-#ifdef CONFIG_CFS_BANDWIDTH
-		if (throttled_hierarchy(cfs_rq)) {
-			struct rq *rq = rq_of(cfs_rq);
-
-			if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
-				cfs_rq->throttled_clock = rq_clock(rq);
-			if (!cfs_rq->throttled_clock_self)
-				cfs_rq->throttled_clock_self = rq_clock(rq);
-		}
-#endif
 	}
 }
 
@@ -5455,7 +5445,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		 * DELAY_DEQUEUE relies on spurious wakeups, special task
 		 * states must not suffer spurious wakeups, excempt them.
 		 */
-		if (flags & DEQUEUE_SPECIAL)
+		if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
 			delay = false;
 
 		WARN_ON_ONCE(delay && se->sched_delayed);
@@ -5863,7 +5853,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		rq = scope.rq;
 		update_rq_clock(rq);
 		WARN_ON_ONCE(!list_empty(&p->throttle_node));
-		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
 		/*
 		 * Must not add it to limbo list before dequeue or dequeue will
 		 * mistakenly regard this task as an already throttled one.
@@ -5955,6 +5945,17 @@ static inline void task_throttle_setup_work(struct task_struct *p)
 	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
 }
 
+static void record_throttle_clock(struct cfs_rq *cfs_rq)
+{
+	struct rq *rq = rq_of(cfs_rq);
+
+	if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
+		cfs_rq->throttled_clock = rq_clock(rq);
+
+	if (!cfs_rq->throttled_clock_self)
+		cfs_rq->throttled_clock_self = rq_clock(rq);
+}
+
 static int tg_throttle_down(struct task_group *tg, void *data)
 {
 	struct rq *rq = data;
@@ -5967,12 +5968,10 @@ static int tg_throttle_down(struct task_group *tg, void *data)
 	/* group is entering throttled state, stop time */
 	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
 
-	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock_self = rq_clock(rq);
-	else
+	if (!cfs_rq->nr_queued)
 		list_del_leaf_cfs_rq(cfs_rq);
 
+	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
 	WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
 	return 0;
 }
@@ -6015,8 +6014,6 @@ static void throttle_cfs_rq(struct cfs_rq *cfs_rq)
 	 */
 	cfs_rq->throttled = 1;
 	WARN_ON_ONCE(cfs_rq->throttled_clock);
-	if (cfs_rq->nr_queued)
-		cfs_rq->throttled_clock = rq_clock(rq);
 	return;
 }
 
@@ -6734,6 +6731,7 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
 static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
 {
@@ -7057,6 +7055,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	bool task_sleep = flags & DEQUEUE_SLEEP;
 	bool task_delayed = flags & DEQUEUE_DELAYED;
+	bool task_throttled = flags & DEQUEUE_THROTTLE;
 	struct task_struct *p = NULL;
 	int h_nr_idle = 0;
 	int h_nr_queued = 0;
@@ -7092,6 +7091,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
 
+		if (throttled_hierarchy(cfs_rq) && task_throttled)
+			record_throttle_clock(cfs_rq);
+
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
 			slice = cfs_rq_min_slice(cfs_rq);
@@ -7128,6 +7130,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = h_nr_queued;
+
+		if (throttled_hierarchy(cfs_rq) && task_throttled)
+			record_throttle_clock(cfs_rq);
 	}
 
 	sub_nr_running(rq, h_nr_queued);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 83f16fc44884f..9ba4b8f988ebf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2360,6 +2360,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_SPECIAL		0x10
 #define DEQUEUE_MIGRATING	0x100 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x200 /* Matches ENQUEUE_DELAYED */
+#define DEQUEUE_THROTTLE	0x800
 
 #define ENQUEUE_WAKEUP		0x01
 #define ENQUEUE_RESTORE		0x02
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/7] sched/fair: get rid of throttled_lb_pair()
  2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
                   ` (5 preceding siblings ...)
  2025-05-20 10:41 ` [PATCH 6/7] sched/fair: task based throttle time accounting Aaron Lu
@ 2025-05-20 10:41 ` Aaron Lu
  6 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-20 10:41 UTC (permalink / raw)
  To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka,
	Florian Bezdeka

Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.

Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/fair.c | 33 +++------------------------------
 1 file changed, 3 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4646d4f8b878d..89afa472299b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5796,23 +5796,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return cfs_bandwidth_used() && cfs_rq->throttle_count;
 }
 
-/*
- * Ensure that neither of the group entities corresponding to src_cpu or
- * dest_cpu are members of a throttled hierarchy when performing group
- * load-balance operations.
- */
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
-
-	src_cfs_rq = tg->cfs_rq[src_cpu];
-	dest_cfs_rq = tg->cfs_rq[dest_cpu];
-
-	return throttled_hierarchy(src_cfs_rq) ||
-	       throttled_hierarchy(dest_cfs_rq);
-}
-
 static inline bool task_is_throttled(struct task_struct *p)
 {
 	return !list_empty(&p->throttle_node);
@@ -6743,12 +6726,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 	return 0;
 }
 
-static inline int throttled_lb_pair(struct task_group *tg,
-				    int src_cpu, int dest_cpu)
-{
-	return 0;
-}
-
 #ifdef CONFIG_FAIR_GROUP_SCHED
 void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
 static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9387,17 +9364,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
-	 * 2) throttled_lb_pair, or
-	 * 3) cannot be migrated to this CPU due to cpus_ptr, or
-	 * 4) running (obviously), or
-	 * 5) are cache-hot on their current CPU.
+	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
+	 * 3) running (obviously), or
+	 * 4) are cache-hot on their current CPU.
 	 */
 	if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
 		return 0;
 
-	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
-		return 0;
-
 	/*
 	 * We want to prioritize the migration of eligible tasks.
 	 * For ineligible tasks we soft-limit them and only allow
-- 
2.39.5


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
@ 2025-05-20 12:02   ` Florian Bezdeka
  2025-05-21  6:37     ` Aaron Lu
  2025-05-21  9:01   ` Chengming Zhou
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Florian Bezdeka @ 2025-05-20 12:02 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Tue, 2025-05-20 at 18:41 +0800, Aaron Lu wrote:
> @@ -6744,6 +6805,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
>  static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
>  static inline void sync_throttle(struct task_group *tg, int cpu) {}
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> +static void task_throttle_setup_work(struct task_struct *p) {}
>  
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>  {
> @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>  {
>  	struct sched_entity *se;
>  	struct cfs_rq *cfs_rq;
> +	struct task_struct *p;
>  
>  again:
>  	cfs_rq = &rq->cfs;
> @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>  		cfs_rq = group_cfs_rq(se);
>  	} while (cfs_rq);
>  
> -	return task_of(se);
> +	p = task_of(se);
> +	if (throttled_hierarchy(cfs_rq_of(se))) {
> +		/* Shuold not happen for now */

Typo: s/Shuold/Should

Btw: Is there a trace point in place when throttling/unthrottling
happens? Would be nice to see that in a trace, but might be that I
missed those events in my configuration up to now.

> +		WARN_ON_ONCE(1);
> +		task_throttle_setup_work(p);
> +	}
> +
> +	return p;
>  }


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 12:02   ` Florian Bezdeka
@ 2025-05-21  6:37     ` Aaron Lu
  2025-05-21 11:51       ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-21  6:37 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Tue, May 20, 2025 at 02:02:54PM +0200, Florian Bezdeka wrote:
> On Tue, 2025-05-20 at 18:41 +0800, Aaron Lu wrote:
> > @@ -6744,6 +6805,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> >  static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> >  static inline void sync_throttle(struct task_group *tg, int cpu) {}
> >  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> > +static void task_throttle_setup_work(struct task_struct *p) {}
> >  
> >  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> >  {
> > @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >  {
> >  	struct sched_entity *se;
> >  	struct cfs_rq *cfs_rq;
> > +	struct task_struct *p;
> >  
> >  again:
> >  	cfs_rq = &rq->cfs;
> > @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >  		cfs_rq = group_cfs_rq(se);
> >  	} while (cfs_rq);
> >  
> > -	return task_of(se);
> > +	p = task_of(se);
> > +	if (throttled_hierarchy(cfs_rq_of(se))) {
> > +		/* Shuold not happen for now */
> 
> Typo: s/Shuold/Should
>

Ah right, thanks.

> Btw: Is there a trace point in place when throttling/unthrottling
> happens? Would be nice to see that in a trace, but might be that I
> missed those events in my configuration up to now.

There is no trace point for throttle related events yet.

I tried to see if bpftrace can do the job but unfortunately, bpftrace
does not allow variable as array indice so I failed to get cfs_rq when
probing tg_throttle_down(tg, data)...

Thanks,
Aaron

> > +		WARN_ON_ONCE(1);
> > +		task_throttle_setup_work(p);
> > +	}
> > +
> > +	return p;
> >  }
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/7] sched/fair: Add related data structure for task based throttle
  2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
@ 2025-05-21  8:48   ` Chengming Zhou
  0 siblings, 0 replies; 44+ messages in thread
From: Chengming Zhou @ 2025-05-21  8:48 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/20 18:41, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> Add related data structures for this new throttle functionality.
> 
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

Looks good to me.

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks!

> ---
>   include/linux/sched.h |  4 ++++
>   kernel/sched/core.c   |  3 +++
>   kernel/sched/fair.c   | 12 ++++++++++++
>   kernel/sched/sched.h  |  2 ++
>   4 files changed, 21 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index b98195991031c..055f3782eeaee 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -879,6 +879,10 @@ struct task_struct {
>   
>   #ifdef CONFIG_CGROUP_SCHED
>   	struct task_group		*sched_task_group;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	struct callback_head		sched_throttle_work;
> +	struct list_head		throttle_node;
> +#endif
>   #endif
>   
>   
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bece0ba6f5b3a..b7ca7cefee54e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4499,6 +4499,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>   
>   #ifdef CONFIG_FAIR_GROUP_SCHED
>   	p->se.cfs_rq			= NULL;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +	init_cfs_throttle_work(p);
> +#endif
>   #endif
>   
>   #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eb5a2572b4f8b..75bf6186a5137 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5825,6 +5825,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
>   	       throttled_hierarchy(dest_cfs_rq);
>   }
>   
> +static void throttle_cfs_rq_work(struct callback_head *work)
> +{
> +}
> +
> +void init_cfs_throttle_work(struct task_struct *p)
> +{
> +	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
> +	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
> +	p->sched_throttle_work.next = &p->sched_throttle_work;
> +	INIT_LIST_HEAD(&p->throttle_node);
> +}
> +
>   static int tg_unthrottle_up(struct task_group *tg, void *data)
>   {
>   	struct rq *rq = data;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c5a6a503eb6de..921527327f107 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2703,6 +2703,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
>   
>   extern void init_dl_entity(struct sched_dl_entity *dl_se);
>   
> +extern void init_cfs_throttle_work(struct task_struct *p);
> +
>   #define BW_SHIFT		20
>   #define BW_UNIT			(1 << BW_SHIFT)
>   #define RATIO_SHIFT		8

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
  2025-05-20 12:02   ` Florian Bezdeka
@ 2025-05-21  9:01   ` Chengming Zhou
  2025-05-21  9:21     ` [External] " Aaron Lu
  2025-05-22 10:48   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2025-05-21  9:01 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/20 18:41, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
> 
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets wakeup, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
> 
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued. In this way, tasks throttled will not
> hold any kernel resources.
> 
> To avoid breaking bisect, preserve the current throttle behavior by
> still dequeuing throttled hierarchy from rq and because of this, no task
> can have that throttle task work added yet. The throttle model will
> switch to task based in a later patch.
> 
> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>

I'm wondering how about put 02-04 patches together, since it's strange
to setup task work in this patch but without changing throttle_cfs_rq(),
which makes the reviewing process a bit confused? WDYT?

Thanks!

> ---
>   kernel/sched/fair.c  | 88 +++++++++++++++++++++++++++++++++++++++-----
>   kernel/sched/sched.h |  1 +
>   2 files changed, 80 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 75bf6186a5137..e87ceb0a2d37f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5825,8 +5825,47 @@ static inline int throttled_lb_pair(struct task_group *tg,
>   	       throttled_hierarchy(dest_cfs_rq);
>   }
>   
> +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>   static void throttle_cfs_rq_work(struct callback_head *work)
>   {
> +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> +	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq;
> +	struct rq *rq;
> +
> +	WARN_ON_ONCE(p != current);
> +	p->sched_throttle_work.next = &p->sched_throttle_work;
> +
> +	/*
> +	 * If task is exiting, then there won't be a return to userspace, so we
> +	 * don't have to bother with any of this.
> +	 */
> +	if ((p->flags & PF_EXITING))
> +		return;
> +
> +	scoped_guard(task_rq_lock, p) {
> +		se = &p->se;
> +		cfs_rq = cfs_rq_of(se);
> +
> +		/* Raced, forget */
> +		if (p->sched_class != &fair_sched_class)
> +			return;
> +
> +		/*
> +		 * If not in limbo, then either replenish has happened or this
> +		 * task got migrated out of the throttled cfs_rq, move along.
> +		 */
> +		if (!cfs_rq->throttle_count)
> +			return;
> +		rq = scope.rq;
> +		update_rq_clock(rq);
> +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> +		resched_curr(rq);
> +	}
> +
> +	cond_resched_tasks_rcu_qs();
>   }
>   
>   void init_cfs_throttle_work(struct task_struct *p)
> @@ -5866,21 +5905,42 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>   	return 0;
>   }
>   
> +static inline bool task_has_throttle_work(struct task_struct *p)
> +{
> +	return p->sched_throttle_work.next != &p->sched_throttle_work;
> +}
> +
> +static inline void task_throttle_setup_work(struct task_struct *p)
> +{
> +	if (task_has_throttle_work(p))
> +		return;
> +
> +	/*
> +	 * Kthreads and exiting tasks don't return to userspace, so adding the
> +	 * work is pointless
> +	 */
> +	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
> +		return;
> +
> +	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
> +}
> +
>   static int tg_throttle_down(struct task_group *tg, void *data)
>   {
>   	struct rq *rq = data;
>   	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>   
> +	cfs_rq->throttle_count++;
> +	if (cfs_rq->throttle_count > 1)
> +		return 0;
> +
>   	/* group is entering throttled state, stop time */
> -	if (!cfs_rq->throttle_count) {
> -		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> -		list_del_leaf_cfs_rq(cfs_rq);
> +	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> +	list_del_leaf_cfs_rq(cfs_rq);
>   
> -		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> -		if (cfs_rq->nr_queued)
> -			cfs_rq->throttled_clock_self = rq_clock(rq);
> -	}
> -	cfs_rq->throttle_count++;
> +	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> +	if (cfs_rq->nr_queued)
> +		cfs_rq->throttled_clock_self = rq_clock(rq);
>   
>   	return 0;
>   }
> @@ -6575,6 +6635,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
>   	cfs_rq->runtime_enabled = 0;
>   	INIT_LIST_HEAD(&cfs_rq->throttled_list);
>   	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
> +	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
>   }
>   
>   void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> @@ -6744,6 +6805,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
>   static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
>   static inline void sync_throttle(struct task_group *tg, int cpu) {}
>   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> +static void task_throttle_setup_work(struct task_struct *p) {}
>   
>   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   {
> @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>   {
>   	struct sched_entity *se;
>   	struct cfs_rq *cfs_rq;
> +	struct task_struct *p;
>   
>   again:
>   	cfs_rq = &rq->cfs;
> @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>   		cfs_rq = group_cfs_rq(se);
>   	} while (cfs_rq);
>   
> -	return task_of(se);
> +	p = task_of(se);
> +	if (throttled_hierarchy(cfs_rq_of(se))) {
> +		/* Shuold not happen for now */
> +		WARN_ON_ONCE(1);
> +		task_throttle_setup_work(p);
> +	}
> +
> +	return p;
>   }
>   
>   static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 921527327f107..83f16fc44884f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -736,6 +736,7 @@ struct cfs_rq {
>   	int			throttle_count;
>   	struct list_head	throttled_list;
>   	struct list_head	throttled_csd_list;
> +	struct list_head        throttled_limbo_list;
>   #endif /* CONFIG_CFS_BANDWIDTH */
>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>   };

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-21  9:01   ` Chengming Zhou
@ 2025-05-21  9:21     ` Aaron Lu
  2025-05-22 11:43       ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-21  9:21 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Wed, May 21, 2025 at 05:01:58PM +0800, Chengming Zhou wrote:
> On 2025/5/20 18:41, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> > 
> > In current throttle model, when a cfs_rq is throttled, its entity will
> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > thus achiveing the throttle target.
> > 
> > This has a drawback though: assume a task is a reader of percpu_rwsem
> > and is waiting. When it gets wakeup, it can not run till its task group's
> > next period comes, which can be a relatively long time. Waiting writer
> > will have to wait longer due to this and it also makes further reader
> > build up and eventually trigger task hung.
> > 
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, record its throttled status but do not remove
> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > they get picked, add a task work to them so that when they return
> > to user, they can be dequeued. In this way, tasks throttled will not
> > hold any kernel resources.
> > 
> > To avoid breaking bisect, preserve the current throttle behavior by
> > still dequeuing throttled hierarchy from rq and because of this, no task
> > can have that throttle task work added yet. The throttle model will
> > switch to task based in a later patch.
> > 
> > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> 
> I'm wondering how about put 02-04 patches together, since it's strange
> to setup task work in this patch but without changing throttle_cfs_rq(),

Do you mean 02-05?
Because the actual change to throttle_cfs_rq() happens in patch5 :)

> which makes the reviewing process a bit confused? WDYT?

Yes, I agree it looks a bit confused.

The point is to not break bisect while make review easier; if merging
all task based throttle related patches together, that would be to put
patch 02-05 together, which seems too big?

Thanks,
Aaron

> > ---
> >   kernel/sched/fair.c  | 88 +++++++++++++++++++++++++++++++++++++++-----
> >   kernel/sched/sched.h |  1 +
> >   2 files changed, 80 insertions(+), 9 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 75bf6186a5137..e87ceb0a2d37f 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5825,8 +5825,47 @@ static inline int throttled_lb_pair(struct task_group *tg,
> >   	       throttled_hierarchy(dest_cfs_rq);
> >   }
> > +static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> >   static void throttle_cfs_rq_work(struct callback_head *work)
> >   {
> > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > +	struct sched_entity *se;
> > +	struct cfs_rq *cfs_rq;
> > +	struct rq *rq;
> > +
> > +	WARN_ON_ONCE(p != current);
> > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > +
> > +	/*
> > +	 * If task is exiting, then there won't be a return to userspace, so we
> > +	 * don't have to bother with any of this.
> > +	 */
> > +	if ((p->flags & PF_EXITING))
> > +		return;
> > +
> > +	scoped_guard(task_rq_lock, p) {
> > +		se = &p->se;
> > +		cfs_rq = cfs_rq_of(se);
> > +
> > +		/* Raced, forget */
> > +		if (p->sched_class != &fair_sched_class)
> > +			return;
> > +
> > +		/*
> > +		 * If not in limbo, then either replenish has happened or this
> > +		 * task got migrated out of the throttled cfs_rq, move along.
> > +		 */
> > +		if (!cfs_rq->throttle_count)
> > +			return;
> > +		rq = scope.rq;
> > +		update_rq_clock(rq);
> > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > +		resched_curr(rq);
> > +	}
> > +
> > +	cond_resched_tasks_rcu_qs();
> >   }
> >   void init_cfs_throttle_work(struct task_struct *p)
> > @@ -5866,21 +5905,42 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> >   	return 0;
> >   }
> > +static inline bool task_has_throttle_work(struct task_struct *p)
> > +{
> > +	return p->sched_throttle_work.next != &p->sched_throttle_work;
> > +}
> > +
> > +static inline void task_throttle_setup_work(struct task_struct *p)
> > +{
> > +	if (task_has_throttle_work(p))
> > +		return;
> > +
> > +	/*
> > +	 * Kthreads and exiting tasks don't return to userspace, so adding the
> > +	 * work is pointless
> > +	 */
> > +	if ((p->flags & (PF_EXITING | PF_KTHREAD)))
> > +		return;
> > +
> > +	task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
> > +}
> > +
> >   static int tg_throttle_down(struct task_group *tg, void *data)
> >   {
> >   	struct rq *rq = data;
> >   	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
> > +	cfs_rq->throttle_count++;
> > +	if (cfs_rq->throttle_count > 1)
> > +		return 0;
> > +
> >   	/* group is entering throttled state, stop time */
> > -	if (!cfs_rq->throttle_count) {
> > -		cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > -		list_del_leaf_cfs_rq(cfs_rq);
> > +	cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > +	list_del_leaf_cfs_rq(cfs_rq);
> > -		WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> > -		if (cfs_rq->nr_queued)
> > -			cfs_rq->throttled_clock_self = rq_clock(rq);
> > -	}
> > -	cfs_rq->throttle_count++;
> > +	WARN_ON_ONCE(cfs_rq->throttled_clock_self);
> > +	if (cfs_rq->nr_queued)
> > +		cfs_rq->throttled_clock_self = rq_clock(rq);
> >   	return 0;
> >   }
> > @@ -6575,6 +6635,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> >   	cfs_rq->runtime_enabled = 0;
> >   	INIT_LIST_HEAD(&cfs_rq->throttled_list);
> >   	INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
> > +	INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
> >   }
> >   void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > @@ -6744,6 +6805,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> >   static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> >   static inline void sync_throttle(struct task_group *tg, int cpu) {}
> >   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> > +static void task_throttle_setup_work(struct task_struct *p) {}
> >   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> >   {
> > @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >   {
> >   	struct sched_entity *se;
> >   	struct cfs_rq *cfs_rq;
> > +	struct task_struct *p;
> >   again:
> >   	cfs_rq = &rq->cfs;
> > @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >   		cfs_rq = group_cfs_rq(se);
> >   	} while (cfs_rq);
> > -	return task_of(se);
> > +	p = task_of(se);
> > +	if (throttled_hierarchy(cfs_rq_of(se))) {
> > +		/* Shuold not happen for now */
> > +		WARN_ON_ONCE(1);
> > +		task_throttle_setup_work(p);
> > +	}
> > +
> > +	return p;
> >   }
> >   static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 921527327f107..83f16fc44884f 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -736,6 +736,7 @@ struct cfs_rq {
> >   	int			throttle_count;
> >   	struct list_head	throttled_list;
> >   	struct list_head	throttled_csd_list;
> > +	struct list_head        throttled_limbo_list;
> >   #endif /* CONFIG_CFS_BANDWIDTH */
> >   #endif /* CONFIG_FAIR_GROUP_SCHED */
> >   };

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-21  6:37     ` Aaron Lu
@ 2025-05-21 11:51       ` Aaron Lu
  0 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-21 11:51 UTC (permalink / raw)
  To: Florian Bezdeka
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chengming Zhou, Chuyi Zhou, Jan Kiszka

On Wed, May 21, 2025 at 02:37:05PM +0800, Aaron Lu wrote:
> On Tue, May 20, 2025 at 02:02:54PM +0200, Florian Bezdeka wrote:
> > On Tue, 2025-05-20 at 18:41 +0800, Aaron Lu wrote:
> > > @@ -6744,6 +6805,7 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> > >  static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> > >  static inline void sync_throttle(struct task_group *tg, int cpu) {}
> > >  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> > > +static void task_throttle_setup_work(struct task_struct *p) {}
> > >  
> > >  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> > >  {
> > > @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > >  {
> > >  	struct sched_entity *se;
> > >  	struct cfs_rq *cfs_rq;
> > > +	struct task_struct *p;
> > >  
> > >  again:
> > >  	cfs_rq = &rq->cfs;
> > > @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > >  		cfs_rq = group_cfs_rq(se);
> > >  	} while (cfs_rq);
> > >  
> > > -	return task_of(se);
> > > +	p = task_of(se);
> > > +	if (throttled_hierarchy(cfs_rq_of(se))) {
> > > +		/* Shuold not happen for now */
> > 
> > Typo: s/Shuold/Should
> >
> 
> Ah right, thanks.
> 
> > Btw: Is there a trace point in place when throttling/unthrottling
> > happens? Would be nice to see that in a trace, but might be that I
> > missed those events in my configuration up to now.
> 
> There is no trace point for throttle related events yet.
> 
> I tried to see if bpftrace can do the job but unfortunately, bpftrace
> does not allow variable as array indice so I failed to get cfs_rq when
> probing tg_throttle_down(tg, data)...
> 

@ajor from IRC helped me solve this problem so I think the following
bpftrace script can somehow serve the purpose before trace events for
throttle/unthrottle are supported:

kfunc:tg_throttle_down
{
	$rq = (struct rq *)args->data;
	$cpu = $rq->cpu;
	$cfs_rq = *(args->tg->cfs_rq + $cpu);

	if ($cfs_rq->throttle_count == 0) {
		printf("%llu cfs_rq %s/%d throttled\n", nsecs, str(args->tg->css.cgroup->kn->name), $cpu);
	}
}

kfunc:tg_unthrottle_up
{
	$rq = (struct rq *)args->data;
	$cpu = $rq->cpu;
	$cfs_rq = *(args->tg->cfs_rq + $cpu);

	if ($cfs_rq->throttle_count == 1) {
		printf("%llu cfs_rq %s/%d unthrottled\n", nsecs, str(args->tg->css.cgroup->kn->name), $cpu);
	}
}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
  2025-05-20 12:02   ` Florian Bezdeka
  2025-05-21  9:01   ` Chengming Zhou
@ 2025-05-22 10:48   ` Peter Zijlstra
  2025-05-22 11:44     ` Aaron Lu
  2025-05-22 11:07   ` Peter Zijlstra
  2025-05-23 12:35   ` Peter Zijlstra
  4 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-22 10:48 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:

>  static void throttle_cfs_rq_work(struct callback_head *work)
>  {
> +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> +	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq;
> +	struct rq *rq;
> +
> +	WARN_ON_ONCE(p != current);
> +	p->sched_throttle_work.next = &p->sched_throttle_work;
> +
> +	/*
> +	 * If task is exiting, then there won't be a return to userspace, so we
> +	 * don't have to bother with any of this.
> +	 */
> +	if ((p->flags & PF_EXITING))
> +		return;
> +
> +	scoped_guard(task_rq_lock, p) {
> +		se = &p->se;
> +		cfs_rq = cfs_rq_of(se);
> +
> +		/* Raced, forget */
> +		if (p->sched_class != &fair_sched_class)
> +			return;
> +
> +		/*
> +		 * If not in limbo, then either replenish has happened or this
> +		 * task got migrated out of the throttled cfs_rq, move along.
> +		 */
> +		if (!cfs_rq->throttle_count)
> +			return;
> +		rq = scope.rq;
> +		update_rq_clock(rq);
> +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> +		resched_curr(rq);
> +	}
> +
> +	cond_resched_tasks_rcu_qs();
>  }

What's that cond_resched thing about? The general plan is to make
cond_resched go away.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
                     ` (2 preceding siblings ...)
  2025-05-22 10:48   ` Peter Zijlstra
@ 2025-05-22 11:07   ` Peter Zijlstra
  2025-05-23  7:40     ` Aaron Lu
  2025-05-23 12:35   ` Peter Zijlstra
  4 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-22 11:07 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>  {
>  	struct sched_entity *se;
>  	struct cfs_rq *cfs_rq;
> +	struct task_struct *p;
>  
>  again:
>  	cfs_rq = &rq->cfs;
> @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>  		cfs_rq = group_cfs_rq(se);
>  	} while (cfs_rq);
>  
> -	return task_of(se);
> +	p = task_of(se);
> +	if (throttled_hierarchy(cfs_rq_of(se))) {
> +		/* Shuold not happen for now */
> +		WARN_ON_ONCE(1);
> +		task_throttle_setup_work(p);
> +	}
> +
> +	return p;
>  }

So the final code is a little different, because you're removing the
return value from check_cfs_rq_runtime().

But would not that exact return value be the thing you're now checking
for again?

That is; at the end of the series, would not something like:

static struct task_struct *pick_task_fair(struct rq *rq)
{
	struct sched_entity *se;
	struct cfs_rq *cfs_rq;
	struct task_struct *p;
	bool throttled;

again:
	cfs_rq = &rq->cfs;
	if (!cfs_rq->nr_queued)
		return NULL;

	throttled = false;

	do {
		if (cfs_rq->curr && cfs_rq->curr->on_rq)
			update_curr(cfs_rq);

		throttled |= check_cfs_rq_runtime(cfs_rq);

		se = pick_next_entity(rq, cfs_rq);
		if (!se)
			goto again;

		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

	p = task_of(se);
	if (unlikely(throttled))
		task_throttle_setup_work(p);
	return p;
}

make it more obvious / be simpler?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [External] Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-21  9:21     ` [External] " Aaron Lu
@ 2025-05-22 11:43       ` Chengming Zhou
  2025-05-23  8:03         ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2025-05-22 11:43 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/21 17:21, Aaron Lu wrote:
> On Wed, May 21, 2025 at 05:01:58PM +0800, Chengming Zhou wrote:
>> On 2025/5/20 18:41, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@redhat.com>
>>>
>>> In current throttle model, when a cfs_rq is throttled, its entity will
>>> be dequeued from cpu's rq, making tasks attached to it not able to run,
>>> thus achiveing the throttle target.
>>>
>>> This has a drawback though: assume a task is a reader of percpu_rwsem
>>> and is waiting. When it gets wakeup, it can not run till its task group's
>>> next period comes, which can be a relatively long time. Waiting writer
>>> will have to wait longer due to this and it also makes further reader
>>> build up and eventually trigger task hung.
>>>
>>> To improve this situation, change the throttle model to task based, i.e.
>>> when a cfs_rq is throttled, record its throttled status but do not remove
>>> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>>> they get picked, add a task work to them so that when they return
>>> to user, they can be dequeued. In this way, tasks throttled will not
>>> hold any kernel resources.
>>>
>>> To avoid breaking bisect, preserve the current throttle behavior by
>>> still dequeuing throttled hierarchy from rq and because of this, no task
>>> can have that throttle task work added yet. The throttle model will
>>> switch to task based in a later patch.
>>>
>>> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
>>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>
>> I'm wondering how about put 02-04 patches together, since it's strange
>> to setup task work in this patch but without changing throttle_cfs_rq(),
> 
> Do you mean 02-05?
> Because the actual change to throttle_cfs_rq() happens in patch5 :)

Ah, right.

> 
>> which makes the reviewing process a bit confused? WDYT?
> 
> Yes, I agree it looks a bit confused.
> 
> The point is to not break bisect while make review easier; if merging
> all task based throttle related patches together, that would be to put
> patch 02-05 together, which seems too big?

Yeah, a big patch but complete, instead of changing a bit on the same
function in each patch. Anyway, it's your call :-)

Thanks!

> 
> Thanks,
> Aaron
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-22 10:48   ` Peter Zijlstra
@ 2025-05-22 11:44     ` Aaron Lu
  2025-05-22 11:54       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-22 11:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> 
> >  static void throttle_cfs_rq_work(struct callback_head *work)
> >  {
> > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > +	struct sched_entity *se;
> > +	struct cfs_rq *cfs_rq;
> > +	struct rq *rq;
> > +
> > +	WARN_ON_ONCE(p != current);
> > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > +
> > +	/*
> > +	 * If task is exiting, then there won't be a return to userspace, so we
> > +	 * don't have to bother with any of this.
> > +	 */
> > +	if ((p->flags & PF_EXITING))
> > +		return;
> > +
> > +	scoped_guard(task_rq_lock, p) {
> > +		se = &p->se;
> > +		cfs_rq = cfs_rq_of(se);
> > +
> > +		/* Raced, forget */
> > +		if (p->sched_class != &fair_sched_class)
> > +			return;
> > +
> > +		/*
> > +		 * If not in limbo, then either replenish has happened or this
> > +		 * task got migrated out of the throttled cfs_rq, move along.
> > +		 */
> > +		if (!cfs_rq->throttle_count)
> > +			return;
> > +		rq = scope.rq;
> > +		update_rq_clock(rq);
> > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > +		resched_curr(rq);
> > +	}
> > +
> > +	cond_resched_tasks_rcu_qs();
> >  }
> 
> What's that cond_resched thing about? The general plan is to make
> cond_resched go away.

Got it.

The purpose is to let throttled task schedule and also mark a task rcu
quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
will be scheduled by cond_resched() in task_work_run() and since that is
a preempt schedule, it didn't mark a task rcu quiescent state.

Any suggestion here? Perhaps a plain schedule()? Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-22 11:44     ` Aaron Lu
@ 2025-05-22 11:54       ` Peter Zijlstra
  2025-05-22 12:40         ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-22 11:54 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 07:44:55PM +0800, Aaron Lu wrote:
> On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > 
> > >  static void throttle_cfs_rq_work(struct callback_head *work)
> > >  {
> > > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > > +	struct sched_entity *se;
> > > +	struct cfs_rq *cfs_rq;
> > > +	struct rq *rq;
> > > +
> > > +	WARN_ON_ONCE(p != current);
> > > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > > +
> > > +	/*
> > > +	 * If task is exiting, then there won't be a return to userspace, so we
> > > +	 * don't have to bother with any of this.
> > > +	 */
> > > +	if ((p->flags & PF_EXITING))
> > > +		return;
> > > +
> > > +	scoped_guard(task_rq_lock, p) {
> > > +		se = &p->se;
> > > +		cfs_rq = cfs_rq_of(se);
> > > +
> > > +		/* Raced, forget */
> > > +		if (p->sched_class != &fair_sched_class)
> > > +			return;
> > > +
> > > +		/*
> > > +		 * If not in limbo, then either replenish has happened or this
> > > +		 * task got migrated out of the throttled cfs_rq, move along.
> > > +		 */
> > > +		if (!cfs_rq->throttle_count)
> > > +			return;
> > > +		rq = scope.rq;
> > > +		update_rq_clock(rq);
> > > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > +		resched_curr(rq);
> > > +	}
> > > +
> > > +	cond_resched_tasks_rcu_qs();
> > >  }
> > 
> > What's that cond_resched thing about? The general plan is to make
> > cond_resched go away.
> 
> Got it.
> 
> The purpose is to let throttled task schedule and also mark a task rcu
> quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
> will be scheduled by cond_resched() in task_work_run() and since that is
> a preempt schedule, it didn't mark a task rcu quiescent state.
> 
> Any suggestion here? Perhaps a plain schedule()? Thanks.

I am confused, this is task_work_run(), that is ran from
exit_to_user_mode_loop(), which contains a schedule().



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
@ 2025-05-22 12:03   ` Peter Zijlstra
  2025-05-22 12:49     ` Aaron Lu
  2025-05-23  2:43   ` Chengming Zhou
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-22 12:03 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Tue, May 20, 2025 at 06:41:07PM +0800, Aaron Lu wrote:
> On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> core will dequeue it and then requeued it.
> 
> The throttled task is still considered as queued by core because p->on_rq
> is still set so core will dequeue it, but since the task is already
> dequeued on throttle in fair, handle this case properly.
> 
> Affinity and sched class change is similar.
> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>  kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 74bc320cbc238..4c66fd8d24389 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
>  		update_rq_clock(rq);
>  		WARN_ON_ONCE(!list_empty(&p->throttle_node));
>  		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> +		/*
> +		 * Must not add it to limbo list before dequeue or dequeue will
> +		 * mistakenly regard this task as an already throttled one.
> +		 */
>  		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>  		resched_curr(rq);
>  	}
> @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
>  	INIT_LIST_HEAD(&p->throttle_node);
>  }
>  
> +static void dequeue_throttled_task(struct task_struct *p, int flags)
> +{
> +	/*
> +	 * Task is throttled and someone wants to dequeue it again:
> +	 * it must be sched/core when core needs to do things like
> +	 * task affinity change, task group change, task sched class
> +	 * change etc.
> +	 */
> +	WARN_ON_ONCE(p->se.on_rq);
> +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> +
> +	list_del_init(&p->throttle_node);
> +}
> +
>  static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>  static int tg_unthrottle_up(struct task_group *tg, void *data)
>  {
> @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>  static void task_throttle_setup_work(struct task_struct *p) {}
>  static bool task_is_throttled(struct task_struct *p) { return false; }
> +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
>  
>  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>  {
> @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>   */
>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	if (unlikely(task_is_throttled(p))) {
> +		dequeue_throttled_task(p, flags);
> +		return true;
> +	}
> +
>  	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>  		util_est_dequeue(&rq->cfs, p);

This is asymmetric -- dequeue removes it from that throttle list, but
the corresponding enqueue will not add it back, what gives?

Because now we have:

 p->on_rq=1
 p->throttle_node on list

move_queued_task()
  deactivate_task()
    dequeue_task_fair()
      list_del_init(throttle_node)
    p->on_rq = 2

  activate_task()
    enqueue_task_fair()
      // nothing special, makes the thing runnable
    p->on_rq = 1;

and we exit with a task that is on-rq and not throttled ?!?

Why is this? Are we relying on pick_task_fair() to dequeue it again and
fix up our inconsistencies? If so, that had better have a comment on.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-22 11:54       ` Peter Zijlstra
@ 2025-05-22 12:40         ` Aaron Lu
  2025-05-23  9:53           ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-22 12:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 01:54:18PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2025 at 07:44:55PM +0800, Aaron Lu wrote:
> > On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > > 
> > > >  static void throttle_cfs_rq_work(struct callback_head *work)
> > > >  {
> > > > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > > > +	struct sched_entity *se;
> > > > +	struct cfs_rq *cfs_rq;
> > > > +	struct rq *rq;
> > > > +
> > > > +	WARN_ON_ONCE(p != current);
> > > > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > > > +
> > > > +	/*
> > > > +	 * If task is exiting, then there won't be a return to userspace, so we
> > > > +	 * don't have to bother with any of this.
> > > > +	 */
> > > > +	if ((p->flags & PF_EXITING))
> > > > +		return;
> > > > +
> > > > +	scoped_guard(task_rq_lock, p) {
> > > > +		se = &p->se;
> > > > +		cfs_rq = cfs_rq_of(se);
> > > > +
> > > > +		/* Raced, forget */
> > > > +		if (p->sched_class != &fair_sched_class)
> > > > +			return;
> > > > +
> > > > +		/*
> > > > +		 * If not in limbo, then either replenish has happened or this
> > > > +		 * task got migrated out of the throttled cfs_rq, move along.
> > > > +		 */
> > > > +		if (!cfs_rq->throttle_count)
> > > > +			return;
> > > > +		rq = scope.rq;
> > > > +		update_rq_clock(rq);
> > > > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > > +		resched_curr(rq);
> > > > +	}
> > > > +
> > > > +	cond_resched_tasks_rcu_qs();
> > > >  }
> > > 
> > > What's that cond_resched thing about? The general plan is to make
> > > cond_resched go away.
> > 
> > Got it.
> > 
> > The purpose is to let throttled task schedule and also mark a task rcu
> > quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
> > will be scheduled by cond_resched() in task_work_run() and since that is
> > a preempt schedule, it didn't mark a task rcu quiescent state.
> > 
> > Any suggestion here? Perhaps a plain schedule()? Thanks.
> 
> I am confused, this is task_work_run(), that is ran from
> exit_to_user_mode_loop(), which contains a schedule().

There is a cond_resched() in task_work_run() loop:

		do {
			next = work->next;
			work->func(work);
			work = next;
			cond_resched();
		} while (work);

And when this throttle work returns with need_resched bit set,
cond_resched() will cause a schedule but that didn't mark a task
quiescent state...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-22 12:03   ` Peter Zijlstra
@ 2025-05-22 12:49     ` Aaron Lu
  2025-05-23 14:59       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-22 12:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 02:03:36PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 06:41:07PM +0800, Aaron Lu wrote:
> > On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> > core will dequeue it and then requeued it.
> > 
> > The throttled task is still considered as queued by core because p->on_rq
> > is still set so core will dequeue it, but since the task is already
> > dequeued on throttle in fair, handle this case properly.
> > 
> > Affinity and sched class change is similar.
> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >  kernel/sched/fair.c | 24 ++++++++++++++++++++++++
> >  1 file changed, 24 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 74bc320cbc238..4c66fd8d24389 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
> >  		update_rq_clock(rq);
> >  		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> >  		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > +		/*
> > +		 * Must not add it to limbo list before dequeue or dequeue will
> > +		 * mistakenly regard this task as an already throttled one.
> > +		 */
> >  		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >  		resched_curr(rq);
> >  	}
> > @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
> >  	INIT_LIST_HEAD(&p->throttle_node);
> >  }
> >  
> > +static void dequeue_throttled_task(struct task_struct *p, int flags)
> > +{
> > +	/*
> > +	 * Task is throttled and someone wants to dequeue it again:
> > +	 * it must be sched/core when core needs to do things like
> > +	 * task affinity change, task group change, task sched class
> > +	 * change etc.
> > +	 */
> > +	WARN_ON_ONCE(p->se.on_rq);
> > +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> > +
> > +	list_del_init(&p->throttle_node);
> > +}
> > +
> >  static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> >  static int tg_unthrottle_up(struct task_group *tg, void *data)
> >  {
> > @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
> >  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> >  static void task_throttle_setup_work(struct task_struct *p) {}
> >  static bool task_is_throttled(struct task_struct *p) { return false; }
> > +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
> >  
> >  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> >  {
> > @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >   */
> >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  {
> > +	if (unlikely(task_is_throttled(p))) {
> > +		dequeue_throttled_task(p, flags);
> > +		return true;
> > +	}
> > +
> >  	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> >  		util_est_dequeue(&rq->cfs, p);
> 
> This is asymmetric -- dequeue removes it from that throttle list, but
> the corresponding enqueue will not add it back, what gives?
> 
> Because now we have:
> 
>  p->on_rq=1
>  p->throttle_node on list
> 
> move_queued_task()
>   deactivate_task()
>     dequeue_task_fair()
>       list_del_init(throttle_node)
>     p->on_rq = 2
> 
>   activate_task()
>     enqueue_task_fair()
>       // nothing special, makes the thing runnable
>     p->on_rq = 1;
> 
> and we exit with a task that is on-rq and not throttled ?!?
>
> Why is this? Are we relying on pick_task_fair() to dequeue it again and
> fix up our inconsistencies? If so, that had better have a comment on.

Correct.

Does the following comment look OK?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89afa472299b7..4f4d64cf31fb1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7147,6 +7147,10 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 {
 	if (unlikely(task_is_throttled(p))) {
+		/*
+		 * Task migrated to new rq will have its throttle work
+		 * added if necessary in pick time.
+		 */
 		dequeue_throttled_task(p, flags);
 		return true;
 	}

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
  2025-05-22 12:03   ` Peter Zijlstra
@ 2025-05-23  2:43   ` Chengming Zhou
  2025-05-23  7:56     ` Aaron Lu
  1 sibling, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2025-05-23  2:43 UTC (permalink / raw)
  To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
	Peter Zijlstra, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
  Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
	Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/20 18:41, Aaron Lu wrote:
> On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> core will dequeue it and then requeued it.
> 
> The throttled task is still considered as queued by core because p->on_rq
> is still set so core will dequeue it, but since the task is already
> dequeued on throttle in fair, handle this case properly.
> 
> Affinity and sched class change is similar.

How about setting p->on_rq to 0 when throttled? which is the fact that
the task is not on cfs queue anymore, does this method cause any problem?

Thanks!

> 
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
>   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>   1 file changed, 24 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 74bc320cbc238..4c66fd8d24389 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
>   		update_rq_clock(rq);
>   		WARN_ON_ONCE(!list_empty(&p->throttle_node));
>   		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> +		/*
> +		 * Must not add it to limbo list before dequeue or dequeue will
> +		 * mistakenly regard this task as an already throttled one.
> +		 */
>   		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>   		resched_curr(rq);
>   	}
> @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
>   	INIT_LIST_HEAD(&p->throttle_node);
>   }
>   
> +static void dequeue_throttled_task(struct task_struct *p, int flags)
> +{
> +	/*
> +	 * Task is throttled and someone wants to dequeue it again:
> +	 * it must be sched/core when core needs to do things like
> +	 * task affinity change, task group change, task sched class
> +	 * change etc.
> +	 */
> +	WARN_ON_ONCE(p->se.on_rq);
> +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> +
> +	list_del_init(&p->throttle_node);
> +}
> +
>   static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>   static int tg_unthrottle_up(struct task_group *tg, void *data)
>   {
> @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
>   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>   static void task_throttle_setup_work(struct task_struct *p) {}
>   static bool task_is_throttled(struct task_struct *p) { return false; }
> +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
>   
>   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>   {
> @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>    */
>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   {
> +	if (unlikely(task_is_throttled(p))) {
> +		dequeue_throttled_task(p, flags);
> +		return true;
> +	}
> +
>   	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>   		util_est_dequeue(&rq->cfs, p);
>   

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-22 11:07   ` Peter Zijlstra
@ 2025-05-23  7:40     ` Aaron Lu
  2025-05-29 11:51       ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-23  7:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 01:07:28PM +0200, Peter Zijlstra wrote:
> On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >  {
> >  	struct sched_entity *se;
> >  	struct cfs_rq *cfs_rq;
> > +	struct task_struct *p;
> >  
> >  again:
> >  	cfs_rq = &rq->cfs;
> > @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> >  		cfs_rq = group_cfs_rq(se);
> >  	} while (cfs_rq);
> >  
> > -	return task_of(se);
> > +	p = task_of(se);
> > +	if (throttled_hierarchy(cfs_rq_of(se))) {
> > +		/* Shuold not happen for now */
> > +		WARN_ON_ONCE(1);
> > +		task_throttle_setup_work(p);
> > +	}
> > +
> > +	return p;
> >  }
> 
> So the final code is a little different, because you're removing the
> return value from check_cfs_rq_runtime().
> 
> But would not that exact return value be the thing you're now checking
> for again?
>

Ah yes.

> That is; at the end of the series, would not something like:
> 
> static struct task_struct *pick_task_fair(struct rq *rq)
> {
> 	struct sched_entity *se;
> 	struct cfs_rq *cfs_rq;
> 	struct task_struct *p;
> 	bool throttled;
> 
> again:
> 	cfs_rq = &rq->cfs;
> 	if (!cfs_rq->nr_queued)
> 		return NULL;
> 
> 	throttled = false;
> 
> 	do {
> 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
> 			update_curr(cfs_rq);
> 
> 		throttled |= check_cfs_rq_runtime(cfs_rq);
> 
> 		se = pick_next_entity(rq, cfs_rq);
> 		if (!se)
> 			goto again;
> 
> 		cfs_rq = group_cfs_rq(se);
> 	} while (cfs_rq);
> 
> 	p = task_of(se);
> 	if (unlikely(throttled))
> 		task_throttle_setup_work(p);
> 	return p;
> }
> 
> make it more obvious / be simpler?

Thanks for the suggestion, will follow it in next version.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23  2:43   ` Chengming Zhou
@ 2025-05-23  7:56     ` Aaron Lu
  2025-05-23  9:13       ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-23  7:56 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
> On 2025/5/20 18:41, Aaron Lu wrote:
> > On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> > core will dequeue it and then requeued it.
> > 
> > The throttled task is still considered as queued by core because p->on_rq
> > is still set so core will dequeue it, but since the task is already
> > dequeued on throttle in fair, handle this case properly.
> > 
> > Affinity and sched class change is similar.
> 
> How about setting p->on_rq to 0 when throttled? which is the fact that
> the task is not on cfs queue anymore, does this method cause any problem?
> 

On task group change/affinity change etc. if the throttled task is
regarded as !on_rq, then it will miss the chance to be enqueued to the
new(and correct) cfs_rqs, instead, it will be enqueued back to its
original cfs_rq on unthrottle which breaks affinity or task group
settings. We may be able to do something in tg_unthrottle_up() to take
special care of these situations, but it seems a lot of headaches.

Also, for task group change, if the new task group does not have throttle
setting, that throttled task should be allowed to run immediately instead
of waiting for its old cfs_rq's unthrottle event. Similar is true when
this throttled task changed its sched class, like from fair to rt.

Makes sense?

Thanks,
Aaron

> > 
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
> >   kernel/sched/fair.c | 24 ++++++++++++++++++++++++
> >   1 file changed, 24 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 74bc320cbc238..4c66fd8d24389 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
> >   		update_rq_clock(rq);
> >   		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> >   		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > +		/*
> > +		 * Must not add it to limbo list before dequeue or dequeue will
> > +		 * mistakenly regard this task as an already throttled one.
> > +		 */
> >   		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >   		resched_curr(rq);
> >   	}
> > @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
> >   	INIT_LIST_HEAD(&p->throttle_node);
> >   }
> > +static void dequeue_throttled_task(struct task_struct *p, int flags)
> > +{
> > +	/*
> > +	 * Task is throttled and someone wants to dequeue it again:
> > +	 * it must be sched/core when core needs to do things like
> > +	 * task affinity change, task group change, task sched class
> > +	 * change etc.
> > +	 */
> > +	WARN_ON_ONCE(p->se.on_rq);
> > +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> > +
> > +	list_del_init(&p->throttle_node);
> > +}
> > +
> >   static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> >   static int tg_unthrottle_up(struct task_group *tg, void *data)
> >   {
> > @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
> >   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> >   static void task_throttle_setup_work(struct task_struct *p) {}
> >   static bool task_is_throttled(struct task_struct *p) { return false; }
> > +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
> >   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> >   {
> > @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >    */
> >   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   {
> > +	if (unlikely(task_is_throttled(p))) {
> > +		dequeue_throttled_task(p, flags);
> > +		return true;
> > +	}
> > +
> >   	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> >   		util_est_dequeue(&rq->cfs, p);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-22 11:43       ` Chengming Zhou
@ 2025-05-23  8:03         ` Aaron Lu
  0 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-23  8:03 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 07:43:40PM +0800, Chengming Zhou wrote:
> On 2025/5/21 17:21, Aaron Lu wrote:
> > On Wed, May 21, 2025 at 05:01:58PM +0800, Chengming Zhou wrote:
> > > On 2025/5/20 18:41, Aaron Lu wrote:
> > > > From: Valentin Schneider <vschneid@redhat.com>
> > > > 
> > > > In current throttle model, when a cfs_rq is throttled, its entity will
> > > > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > > > thus achiveing the throttle target.
> > > > 
> > > > This has a drawback though: assume a task is a reader of percpu_rwsem
> > > > and is waiting. When it gets wakeup, it can not run till its task group's
> > > > next period comes, which can be a relatively long time. Waiting writer
> > > > will have to wait longer due to this and it also makes further reader
> > > > build up and eventually trigger task hung.
> > > > 
> > > > To improve this situation, change the throttle model to task based, i.e.
> > > > when a cfs_rq is throttled, record its throttled status but do not remove
> > > > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > > > they get picked, add a task work to them so that when they return
> > > > to user, they can be dequeued. In this way, tasks throttled will not
> > > > hold any kernel resources.
> > > > 
> > > > To avoid breaking bisect, preserve the current throttle behavior by
> > > > still dequeuing throttled hierarchy from rq and because of this, no task
> > > > can have that throttle task work added yet. The throttle model will
> > > > switch to task based in a later patch.
> > > > 
> > > > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > > > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > > 
> > > I'm wondering how about put 02-04 patches together, since it's strange
> > > to setup task work in this patch but without changing throttle_cfs_rq(),
> > 
> > Do you mean 02-05?
> > Because the actual change to throttle_cfs_rq() happens in patch5 :)
> 
> Ah, right.
> 
> > 
> > > which makes the reviewing process a bit confused? WDYT?
> > 
> > Yes, I agree it looks a bit confused.
> > 
> > The point is to not break bisect while make review easier; if merging
> > all task based throttle related patches together, that would be to put
> > patch 02-05 together, which seems too big?
> 
> Yeah, a big patch but complete, instead of changing a bit on the same
> function in each patch. Anyway, it's your call :-)
>

Thanks for the suggestion and I can try that in next version.

I'm thinking maybe one patch for newly added data structures and one
patch for throttle related helper functions, than a complete patch to
switch to task based throttle and the rest is the same.

Let's also see if others have opinions on this :)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23  7:56     ` Aaron Lu
@ 2025-05-23  9:13       ` Chengming Zhou
  2025-05-23  9:42         ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2025-05-23  9:13 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/23 15:56, Aaron Lu wrote:
> On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
>> On 2025/5/20 18:41, Aaron Lu wrote:
>>> On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
>>> core will dequeue it and then requeued it.
>>>
>>> The throttled task is still considered as queued by core because p->on_rq
>>> is still set so core will dequeue it, but since the task is already
>>> dequeued on throttle in fair, handle this case properly.
>>>
>>> Affinity and sched class change is similar.
>>
>> How about setting p->on_rq to 0 when throttled? which is the fact that
>> the task is not on cfs queue anymore, does this method cause any problem?
>>
> 
> On task group change/affinity change etc. if the throttled task is
> regarded as !on_rq, then it will miss the chance to be enqueued to the
> new(and correct) cfs_rqs, instead, it will be enqueued back to its
> original cfs_rq on unthrottle which breaks affinity or task group

Yeah, this is indeed a problem, I was thinking to delete the throttled task
from the cfs_rq limbo list, then add it to another cfs_rq limbo list or cfs_rq
runnable tree based on the new cfs_rq's throttle status.

But it's much complex compared with your current method.

> settings. We may be able to do something in tg_unthrottle_up() to take
> special care of these situations, but it seems a lot of headaches.
> 
> Also, for task group change, if the new task group does not have throttle
> setting, that throttled task should be allowed to run immediately instead
> of waiting for its old cfs_rq's unthrottle event. Similar is true when
> this throttled task changed its sched class, like from fair to rt.
> 
> Makes sense?

Ok, the another problem of the current method I can think of is the PELT maintenance,
we skip the actual dequeue_task_fair() process, which includes PELT detach, we just
delete it from the cfs_rq limbo list, so it can result in PELT maintenance error.

Thanks!

> 
> Thanks,
> Aaron
> 
>>>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>> ---
>>>    kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>>>    1 file changed, 24 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 74bc320cbc238..4c66fd8d24389 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
>>>    		update_rq_clock(rq);
>>>    		WARN_ON_ONCE(!list_empty(&p->throttle_node));
>>>    		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
>>> +		/*
>>> +		 * Must not add it to limbo list before dequeue or dequeue will
>>> +		 * mistakenly regard this task as an already throttled one.
>>> +		 */
>>>    		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>>>    		resched_curr(rq);
>>>    	}
>>> @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
>>>    	INIT_LIST_HEAD(&p->throttle_node);
>>>    }
>>> +static void dequeue_throttled_task(struct task_struct *p, int flags)
>>> +{
>>> +	/*
>>> +	 * Task is throttled and someone wants to dequeue it again:
>>> +	 * it must be sched/core when core needs to do things like
>>> +	 * task affinity change, task group change, task sched class
>>> +	 * change etc.
>>> +	 */
>>> +	WARN_ON_ONCE(p->se.on_rq);
>>> +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
>>> +
>>> +	list_del_init(&p->throttle_node);
>>> +}
>>> +
>>>    static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>>>    static int tg_unthrottle_up(struct task_group *tg, void *data)
>>>    {
>>> @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
>>>    static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>>>    static void task_throttle_setup_work(struct task_struct *p) {}
>>>    static bool task_is_throttled(struct task_struct *p) { return false; }
>>> +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
>>>    static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>>>    {
>>> @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>>>     */
>>>    static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>    {
>>> +	if (unlikely(task_is_throttled(p))) {
>>> +		dequeue_throttled_task(p, flags);
>>> +		return true;
>>> +	}
>>> +
>>>    	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>>>    		util_est_dequeue(&rq->cfs, p);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23  9:13       ` Chengming Zhou
@ 2025-05-23  9:42         ` Aaron Lu
  2025-05-23  9:53           ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-23  9:42 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Fri, May 23, 2025 at 05:13:35PM +0800, Chengming Zhou wrote:
> On 2025/5/23 15:56, Aaron Lu wrote:
> > On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
> > > On 2025/5/20 18:41, Aaron Lu wrote:
> > > > On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> > > > core will dequeue it and then requeued it.
> > > > 
> > > > The throttled task is still considered as queued by core because p->on_rq
> > > > is still set so core will dequeue it, but since the task is already
> > > > dequeued on throttle in fair, handle this case properly.
> > > > 
> > > > Affinity and sched class change is similar.
> > > 
> > > How about setting p->on_rq to 0 when throttled? which is the fact that
> > > the task is not on cfs queue anymore, does this method cause any problem?
> > > 
> > 
> > On task group change/affinity change etc. if the throttled task is
> > regarded as !on_rq, then it will miss the chance to be enqueued to the
> > new(and correct) cfs_rqs, instead, it will be enqueued back to its
> > original cfs_rq on unthrottle which breaks affinity or task group
> 
> Yeah, this is indeed a problem, I was thinking to delete the throttled task
> from the cfs_rq limbo list, then add it to another cfs_rq limbo list or cfs_rq
> runnable tree based on the new cfs_rq's throttle status.

Only work when the task is still handled by fair :)

> 
> But it's much complex compared with your current method.
> 
> > settings. We may be able to do something in tg_unthrottle_up() to take
> > special care of these situations, but it seems a lot of headaches.
> > 
> > Also, for task group change, if the new task group does not have throttle
> > setting, that throttled task should be allowed to run immediately instead
> > of waiting for its old cfs_rq's unthrottle event. Similar is true when
> > this throttled task changed its sched class, like from fair to rt.
> > 
> > Makes sense?
> 
> Ok, the another problem of the current method I can think of is the PELT maintenance,
> we skip the actual dequeue_task_fair() process, which includes PELT detach, we just
> delete it from the cfs_rq limbo list, so it can result in PELT maintenance error.
>

There are corresponding callbacks that handle this, e.g. for task group
change, there is task_change_group_fair() that handles PELT detach; for
affinity change, there is migrate_task_rq_fair() does that and for sched
class change, there is switched_from/to() does that.

Or do I miss anything?

Thanks,
Aaron
 
> > > > 
> > > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > > > ---
> > > >    kernel/sched/fair.c | 24 ++++++++++++++++++++++++
> > > >    1 file changed, 24 insertions(+)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 74bc320cbc238..4c66fd8d24389 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
> > > >    		update_rq_clock(rq);
> > > >    		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > >    		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > > +		/*
> > > > +		 * Must not add it to limbo list before dequeue or dequeue will
> > > > +		 * mistakenly regard this task as an already throttled one.
> > > > +		 */
> > > >    		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > >    		resched_curr(rq);
> > > >    	}
> > > > @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
> > > >    	INIT_LIST_HEAD(&p->throttle_node);
> > > >    }
> > > > +static void dequeue_throttled_task(struct task_struct *p, int flags)
> > > > +{
> > > > +	/*
> > > > +	 * Task is throttled and someone wants to dequeue it again:
> > > > +	 * it must be sched/core when core needs to do things like
> > > > +	 * task affinity change, task group change, task sched class
> > > > +	 * change etc.
> > > > +	 */
> > > > +	WARN_ON_ONCE(p->se.on_rq);
> > > > +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> > > > +
> > > > +	list_del_init(&p->throttle_node);
> > > > +}
> > > > +
> > > >    static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> > > >    static int tg_unthrottle_up(struct task_group *tg, void *data)
> > > >    {
> > > > @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
> > > >    static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> > > >    static void task_throttle_setup_work(struct task_struct *p) {}
> > > >    static bool task_is_throttled(struct task_struct *p) { return false; }
> > > > +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
> > > >    static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
> > > >    {
> > > > @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > > >     */
> > > >    static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >    {
> > > > +	if (unlikely(task_is_throttled(p))) {
> > > > +		dequeue_throttled_task(p, flags);
> > > > +		return true;
> > > > +	}
> > > > +
> > > >    	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
> > > >    		util_est_dequeue(&rq->cfs, p);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-22 12:40         ` Aaron Lu
@ 2025-05-23  9:53           ` Aaron Lu
  2025-05-23 10:52             ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-23  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 08:40:02PM +0800, Aaron Lu wrote:
> On Thu, May 22, 2025 at 01:54:18PM +0200, Peter Zijlstra wrote:
> > On Thu, May 22, 2025 at 07:44:55PM +0800, Aaron Lu wrote:
> > > On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> > > > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > > > 
> > > > >  static void throttle_cfs_rq_work(struct callback_head *work)
> > > > >  {
> > > > > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > > > > +	struct sched_entity *se;
> > > > > +	struct cfs_rq *cfs_rq;
> > > > > +	struct rq *rq;
> > > > > +
> > > > > +	WARN_ON_ONCE(p != current);
> > > > > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > > > > +
> > > > > +	/*
> > > > > +	 * If task is exiting, then there won't be a return to userspace, so we
> > > > > +	 * don't have to bother with any of this.
> > > > > +	 */
> > > > > +	if ((p->flags & PF_EXITING))
> > > > > +		return;
> > > > > +
> > > > > +	scoped_guard(task_rq_lock, p) {
> > > > > +		se = &p->se;
> > > > > +		cfs_rq = cfs_rq_of(se);
> > > > > +
> > > > > +		/* Raced, forget */
> > > > > +		if (p->sched_class != &fair_sched_class)
> > > > > +			return;
> > > > > +
> > > > > +		/*
> > > > > +		 * If not in limbo, then either replenish has happened or this
> > > > > +		 * task got migrated out of the throttled cfs_rq, move along.
> > > > > +		 */
> > > > > +		if (!cfs_rq->throttle_count)
> > > > > +			return;
> > > > > +		rq = scope.rq;
> > > > > +		update_rq_clock(rq);
> > > > > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > > > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > > > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > > > +		resched_curr(rq);
> > > > > +	}
> > > > > +
> > > > > +	cond_resched_tasks_rcu_qs();
> > > > >  }
> > > > 
> > > > What's that cond_resched thing about? The general plan is to make
> > > > cond_resched go away.
> > > 
> > > Got it.
> > > 
> > > The purpose is to let throttled task schedule and also mark a task rcu
> > > quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
> > > will be scheduled by cond_resched() in task_work_run() and since that is
> > > a preempt schedule, it didn't mark a task rcu quiescent state.
> > > 
> > > Any suggestion here? Perhaps a plain schedule()? Thanks.
> > 
> > I am confused, this is task_work_run(), that is ran from
> > exit_to_user_mode_loop(), which contains a schedule().
>

I should probably have added that the schedule() call contained in
exit_to_user_mode_loop() is early in that loop, where the to-be-throttled
task doesn't have need_resched bit set yet.

> There is a cond_resched() in task_work_run() loop:
> 
> 		do {
> 			next = work->next;
> 			work->func(work);
> 			work = next;
> 			cond_resched();
> 		} while (work);
> 
> And when this throttle work returns with need_resched bit set,
> cond_resched() will cause a schedule but that didn't mark a task
> quiescent state...

Another approach I can think of is to add a test of task_is_throttled()
in rcu_tasks_is_holdout(). I remembered when I tried this before, I can
hit the following path:

exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() -> 
do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() -> 
try_to_block_task() -> dequeue_task_fair().

I would like to avoid this path, because it doesn't feel right for a
throttled task to go through another dequeue again(except for the cases
like task group change, affinity change etc. are special cases that have
to be dealed with though).

It looks to me, a schedule() call(or any other form) that makes sure
this throttled task gets scheduled in its task work is the safest thing
to do.

Thoughts?

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23  9:42         ` Aaron Lu
@ 2025-05-23  9:53           ` Chengming Zhou
  2025-05-23 11:59             ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Chengming Zhou @ 2025-05-23  9:53 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/23 17:42, Aaron Lu wrote:
> On Fri, May 23, 2025 at 05:13:35PM +0800, Chengming Zhou wrote:
>> On 2025/5/23 15:56, Aaron Lu wrote:
>>> On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
>>>> On 2025/5/20 18:41, Aaron Lu wrote:
>>>>> On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
>>>>> core will dequeue it and then requeued it.
>>>>>
>>>>> The throttled task is still considered as queued by core because p->on_rq
>>>>> is still set so core will dequeue it, but since the task is already
>>>>> dequeued on throttle in fair, handle this case properly.
>>>>>
>>>>> Affinity and sched class change is similar.
>>>>
>>>> How about setting p->on_rq to 0 when throttled? which is the fact that
>>>> the task is not on cfs queue anymore, does this method cause any problem?
>>>>
>>>
>>> On task group change/affinity change etc. if the throttled task is
>>> regarded as !on_rq, then it will miss the chance to be enqueued to the
>>> new(and correct) cfs_rqs, instead, it will be enqueued back to its
>>> original cfs_rq on unthrottle which breaks affinity or task group
>>
>> Yeah, this is indeed a problem, I was thinking to delete the throttled task
>> from the cfs_rq limbo list, then add it to another cfs_rq limbo list or cfs_rq
>> runnable tree based on the new cfs_rq's throttle status.
> 
> Only work when the task is still handled by fair :)
> 
>>
>> But it's much complex compared with your current method.
>>
>>> settings. We may be able to do something in tg_unthrottle_up() to take
>>> special care of these situations, but it seems a lot of headaches.
>>>
>>> Also, for task group change, if the new task group does not have throttle
>>> setting, that throttled task should be allowed to run immediately instead
>>> of waiting for its old cfs_rq's unthrottle event. Similar is true when
>>> this throttled task changed its sched class, like from fair to rt.
>>>
>>> Makes sense?
>>
>> Ok, the another problem of the current method I can think of is the PELT maintenance,
>> we skip the actual dequeue_task_fair() process, which includes PELT detach, we just
>> delete it from the cfs_rq limbo list, so it can result in PELT maintenance error.
>>
> 
> There are corresponding callbacks that handle this, e.g. for task group
> change, there is task_change_group_fair() that handles PELT detach; for
> affinity change, there is migrate_task_rq_fair() does that and for sched
> class change, there is switched_from/to() does that.
> 
> Or do I miss anything?

migrate_task_rq_fair() only do it when !task_on_rq_migrating(p), which is wakeup migrate,
because we already do detach in dequeue_task_fair() for on_rq task migration...
You can see the DO_DETACH flag in update_load_avg() called from dequeue_entity().

Thanks!

> 
> Thanks,
> Aaron
>   
>>>>>
>>>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>>>> ---
>>>>>     kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>>>>>     1 file changed, 24 insertions(+)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 74bc320cbc238..4c66fd8d24389 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
>>>>>     		update_rq_clock(rq);
>>>>>     		WARN_ON_ONCE(!list_empty(&p->throttle_node));
>>>>>     		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
>>>>> +		/*
>>>>> +		 * Must not add it to limbo list before dequeue or dequeue will
>>>>> +		 * mistakenly regard this task as an already throttled one.
>>>>> +		 */
>>>>>     		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>>>>>     		resched_curr(rq);
>>>>>     	}
>>>>> @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
>>>>>     	INIT_LIST_HEAD(&p->throttle_node);
>>>>>     }
>>>>> +static void dequeue_throttled_task(struct task_struct *p, int flags)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Task is throttled and someone wants to dequeue it again:
>>>>> +	 * it must be sched/core when core needs to do things like
>>>>> +	 * task affinity change, task group change, task sched class
>>>>> +	 * change etc.
>>>>> +	 */
>>>>> +	WARN_ON_ONCE(p->se.on_rq);
>>>>> +	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
>>>>> +
>>>>> +	list_del_init(&p->throttle_node);
>>>>> +}
>>>>> +
>>>>>     static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>>>>>     static int tg_unthrottle_up(struct task_group *tg, void *data)
>>>>>     {
>>>>> @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
>>>>>     static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>>>>>     static void task_throttle_setup_work(struct task_struct *p) {}
>>>>>     static bool task_is_throttled(struct task_struct *p) { return false; }
>>>>> +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
>>>>>     static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>>>>>     {
>>>>> @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>>>>>      */
>>>>>     static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>>     {
>>>>> +	if (unlikely(task_is_throttled(p))) {
>>>>> +		dequeue_throttled_task(p, flags);
>>>>> +		return true;
>>>>> +	}
>>>>> +
>>>>>     	if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>>>>>     		util_est_dequeue(&rq->cfs, p);

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-23  9:53           ` Aaron Lu
@ 2025-05-23 10:52             ` Peter Zijlstra
  2025-05-23 11:17               ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-23 10:52 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Paul McKenney

On Fri, May 23, 2025 at 05:53:50PM +0800, Aaron Lu wrote:
> On Thu, May 22, 2025 at 08:40:02PM +0800, Aaron Lu wrote:
> > On Thu, May 22, 2025 at 01:54:18PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 22, 2025 at 07:44:55PM +0800, Aaron Lu wrote:
> > > > On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> > > > > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > > > > 
> > > > > >  static void throttle_cfs_rq_work(struct callback_head *work)
> > > > > >  {
> > > > > > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > > > > > +	struct sched_entity *se;
> > > > > > +	struct cfs_rq *cfs_rq;
> > > > > > +	struct rq *rq;
> > > > > > +
> > > > > > +	WARN_ON_ONCE(p != current);
> > > > > > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > > > > > +
> > > > > > +	/*
> > > > > > +	 * If task is exiting, then there won't be a return to userspace, so we
> > > > > > +	 * don't have to bother with any of this.
> > > > > > +	 */
> > > > > > +	if ((p->flags & PF_EXITING))
> > > > > > +		return;
> > > > > > +
> > > > > > +	scoped_guard(task_rq_lock, p) {
> > > > > > +		se = &p->se;
> > > > > > +		cfs_rq = cfs_rq_of(se);
> > > > > > +
> > > > > > +		/* Raced, forget */
> > > > > > +		if (p->sched_class != &fair_sched_class)
> > > > > > +			return;
> > > > > > +
> > > > > > +		/*
> > > > > > +		 * If not in limbo, then either replenish has happened or this
> > > > > > +		 * task got migrated out of the throttled cfs_rq, move along.
> > > > > > +		 */
> > > > > > +		if (!cfs_rq->throttle_count)
> > > > > > +			return;
> > > > > > +		rq = scope.rq;
> > > > > > +		update_rq_clock(rq);
> > > > > > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > > > > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > > > > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > > > > +		resched_curr(rq);
> > > > > > +	}
> > > > > > +
> > > > > > +	cond_resched_tasks_rcu_qs();
> > > > > >  }
> > > > > 
> > > > > What's that cond_resched thing about? The general plan is to make
> > > > > cond_resched go away.
> > > > 
> > > > Got it.
> > > > 
> > > > The purpose is to let throttled task schedule and also mark a task rcu
> > > > quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
> > > > will be scheduled by cond_resched() in task_work_run() and since that is
> > > > a preempt schedule, it didn't mark a task rcu quiescent state.
> > > > 
> > > > Any suggestion here? Perhaps a plain schedule()? Thanks.
> > > 
> > > I am confused, this is task_work_run(), that is ran from
> > > exit_to_user_mode_loop(), which contains a schedule().
> >
> 
> I should probably have added that the schedule() call contained in
> exit_to_user_mode_loop() is early in that loop, where the to-be-throttled
> task doesn't have need_resched bit set yet.

No, but if it does get set, it will get picked up at:

	ti_work = read_thread_flags();

and since TIF_NEED_RESCHED is part of EXIT_TO_USER_MODE_WORK, we'll get
another cycle, and do the schedule() thing.

> > There is a cond_resched() in task_work_run() loop:
> > 
> > 		do {
> > 			next = work->next;
> > 			work->func(work);
> > 			work = next;
> > 			cond_resched();
> > 		} while (work);

That cond_resched() is equally going away.

> > And when this throttle work returns with need_resched bit set,
> > cond_resched() will cause a schedule but that didn't mark a task
> > quiescent state...
> 
> Another approach I can think of is to add a test of task_is_throttled()
> in rcu_tasks_is_holdout(). I remembered when I tried this before, I can
> hit the following path:

So this really is about task_rcu needing something? Let me go look at
task-rcu.

So AFAICT, exit_to_user_mode_loop() will do schedule(), which will call
__schedule(SM_NONE), which then will have preempt = false and call:
rcu_note_context_switch(false) which in turn will do:
rcu_task_rq(current, false).

This should be sufficient, no?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-23 10:52             ` Peter Zijlstra
@ 2025-05-23 11:17               ` Aaron Lu
  0 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-23 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Paul McKenney

On Fri, May 23, 2025 at 12:52:22PM +0200, Peter Zijlstra wrote:
> On Fri, May 23, 2025 at 05:53:50PM +0800, Aaron Lu wrote:
> > On Thu, May 22, 2025 at 08:40:02PM +0800, Aaron Lu wrote:
> > > On Thu, May 22, 2025 at 01:54:18PM +0200, Peter Zijlstra wrote:
> > > > On Thu, May 22, 2025 at 07:44:55PM +0800, Aaron Lu wrote:
> > > > > On Thu, May 22, 2025 at 12:48:43PM +0200, Peter Zijlstra wrote:
> > > > > > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > > > > > 
> > > > > > >  static void throttle_cfs_rq_work(struct callback_head *work)
> > > > > > >  {
> > > > > > > +	struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
> > > > > > > +	struct sched_entity *se;
> > > > > > > +	struct cfs_rq *cfs_rq;
> > > > > > > +	struct rq *rq;
> > > > > > > +
> > > > > > > +	WARN_ON_ONCE(p != current);
> > > > > > > +	p->sched_throttle_work.next = &p->sched_throttle_work;
> > > > > > > +
> > > > > > > +	/*
> > > > > > > +	 * If task is exiting, then there won't be a return to userspace, so we
> > > > > > > +	 * don't have to bother with any of this.
> > > > > > > +	 */
> > > > > > > +	if ((p->flags & PF_EXITING))
> > > > > > > +		return;
> > > > > > > +
> > > > > > > +	scoped_guard(task_rq_lock, p) {
> > > > > > > +		se = &p->se;
> > > > > > > +		cfs_rq = cfs_rq_of(se);
> > > > > > > +
> > > > > > > +		/* Raced, forget */
> > > > > > > +		if (p->sched_class != &fair_sched_class)
> > > > > > > +			return;
> > > > > > > +
> > > > > > > +		/*
> > > > > > > +		 * If not in limbo, then either replenish has happened or this
> > > > > > > +		 * task got migrated out of the throttled cfs_rq, move along.
> > > > > > > +		 */
> > > > > > > +		if (!cfs_rq->throttle_count)
> > > > > > > +			return;
> > > > > > > +		rq = scope.rq;
> > > > > > > +		update_rq_clock(rq);
> > > > > > > +		WARN_ON_ONCE(!list_empty(&p->throttle_node));
> > > > > > > +		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
> > > > > > > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > > > > > > +		resched_curr(rq);
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	cond_resched_tasks_rcu_qs();
> > > > > > >  }
> > > > > > 
> > > > > > What's that cond_resched thing about? The general plan is to make
> > > > > > cond_resched go away.
> > > > > 
> > > > > Got it.
> > > > > 
> > > > > The purpose is to let throttled task schedule and also mark a task rcu
> > > > > quiescent state. Without this cond_resched_tasks_rcu_qs(), this task
> > > > > will be scheduled by cond_resched() in task_work_run() and since that is
> > > > > a preempt schedule, it didn't mark a task rcu quiescent state.
> > > > > 
> > > > > Any suggestion here? Perhaps a plain schedule()? Thanks.
> > > > 
> > > > I am confused, this is task_work_run(), that is ran from
> > > > exit_to_user_mode_loop(), which contains a schedule().
> > >
> > 
> > I should probably have added that the schedule() call contained in
> > exit_to_user_mode_loop() is early in that loop, where the to-be-throttled
> > task doesn't have need_resched bit set yet.
> 
> No, but if it does get set, it will get picked up at:
> 
> 	ti_work = read_thread_flags();
> 
> and since TIF_NEED_RESCHED is part of EXIT_TO_USER_MODE_WORK, we'll get
> another cycle, and do the schedule() thing.
> 
> > > There is a cond_resched() in task_work_run() loop:
> > > 
> > > 		do {
> > > 			next = work->next;
> > > 			work->func(work);
> > > 			work = next;
> > > 			cond_resched();
> > > 		} while (work);
> 
> That cond_resched() is equally going away.

Good to know this.

As long as this cond_resched() goes away, there is no need for an
explicite schedule() or any of its other forms in this throttle task
work.

> > > And when this throttle work returns with need_resched bit set,
> > > cond_resched() will cause a schedule but that didn't mark a task
> > > quiescent state...
> > 
> > Another approach I can think of is to add a test of task_is_throttled()
> > in rcu_tasks_is_holdout(). I remembered when I tried this before, I can
> > hit the following path:
> 
> So this really is about task_rcu needing something? Let me go look at
> task-rcu.

Yes. I found this problem when using bpftrace to profile something and
bpftrace couldn't start untill the test is finished :)

I'm assuming bpftrace need those throttled tasks properly mark a qs
state. With this change here, bpftrace can start normally when the test
is running.

> 
> So AFAICT, exit_to_user_mode_loop() will do schedule(), which will call
> __schedule(SM_NONE), which then will have preempt = false and call:
> rcu_note_context_switch(false) which in turn will do:
> rcu_task_rq(current, false).
> 
> This should be sufficient, no?

Yes, as long as that cond_resched() in task_work_run() loop is gone.

I'll also give it a test and will let you know if I find anything
unexpected.

Thanks,
Aaron

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23  9:53           ` Chengming Zhou
@ 2025-05-23 11:59             ` Aaron Lu
  2025-05-26 13:14               ` Chengming Zhou
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-23 11:59 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Fri, May 23, 2025 at 05:53:55PM +0800, Chengming Zhou wrote:
> On 2025/5/23 17:42, Aaron Lu wrote:
> > On Fri, May 23, 2025 at 05:13:35PM +0800, Chengming Zhou wrote:
> > > On 2025/5/23 15:56, Aaron Lu wrote:
> > > > On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
> > > > > On 2025/5/20 18:41, Aaron Lu wrote:
> > > > > > On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
> > > > > > core will dequeue it and then requeued it.
> > > > > > 
> > > > > > The throttled task is still considered as queued by core because p->on_rq
> > > > > > is still set so core will dequeue it, but since the task is already
> > > > > > dequeued on throttle in fair, handle this case properly.
> > > > > > 
> > > > > > Affinity and sched class change is similar.
> > > > > 
> > > > > How about setting p->on_rq to 0 when throttled? which is the fact that
> > > > > the task is not on cfs queue anymore, does this method cause any problem?
> > > > > 
> > > > 
> > > > On task group change/affinity change etc. if the throttled task is
> > > > regarded as !on_rq, then it will miss the chance to be enqueued to the
> > > > new(and correct) cfs_rqs, instead, it will be enqueued back to its
> > > > original cfs_rq on unthrottle which breaks affinity or task group
> > > 
> > > Yeah, this is indeed a problem, I was thinking to delete the throttled task
> > > from the cfs_rq limbo list, then add it to another cfs_rq limbo list or cfs_rq
> > > runnable tree based on the new cfs_rq's throttle status.
> > 
> > Only work when the task is still handled by fair :)
> > 
> > > 
> > > But it's much complex compared with your current method.
> > > 
> > > > settings. We may be able to do something in tg_unthrottle_up() to take
> > > > special care of these situations, but it seems a lot of headaches.
> > > > 
> > > > Also, for task group change, if the new task group does not have throttle
> > > > setting, that throttled task should be allowed to run immediately instead
> > > > of waiting for its old cfs_rq's unthrottle event. Similar is true when
> > > > this throttled task changed its sched class, like from fair to rt.
> > > > 
> > > > Makes sense?
> > > 
> > > Ok, the another problem of the current method I can think of is the PELT maintenance,
> > > we skip the actual dequeue_task_fair() process, which includes PELT detach, we just
> > > delete it from the cfs_rq limbo list, so it can result in PELT maintenance error.
> > > 
> > 
> > There are corresponding callbacks that handle this, e.g. for task group
> > change, there is task_change_group_fair() that handles PELT detach; for
> > affinity change, there is migrate_task_rq_fair() does that and for sched
> > class change, there is switched_from/to() does that.
> > 
> > Or do I miss anything?
> 
> migrate_task_rq_fair() only do it when !task_on_rq_migrating(p), which is wakeup migrate,
> because we already do detach in dequeue_task_fair() for on_rq task migration...
> You can see the DO_DETACH flag in update_load_avg() called from dequeue_entity().
>

Understood, thanks for catching this!

So the code was initially developed on top of v5.15 and there is a
detach in migrate_task_rq_fair():

	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
		/*
		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
		 * rq->lock and can modify state directly.
		 */
		lockdep_assert_rq_held(task_rq(p));
		detach_entity_cfs_rq(&p->se);
	}

But looks like it's gone now by commit e1f078f50478("sched/fair: Combine
detach into dequeue when migrating task") and I failed to notice this
detail...

Anyway, the task is already dequeued without TASK_ON_RQ_MIGRATING being
set when throttled and it can't be dequeued again, so perhaps something
like below could cure this situation?(just to illustrate the idea, not
even compile tested)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89afa472299b7..dc2e9a6bf3fd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5868,6 +5868,9 @@ static void dequeue_throttled_task(struct task_struct *p, int flags)
	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);

	list_del_init(&p->throttle_node);
+
+       if (task_on_rq_migrating(p))
+               detach_task_cfs_rq(p);
}



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
                     ` (3 preceding siblings ...)
  2025-05-22 11:07   ` Peter Zijlstra
@ 2025-05-23 12:35   ` Peter Zijlstra
  4 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-23 12:35 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
>  static int tg_throttle_down(struct task_group *tg, void *data)
>  {
>  	struct rq *rq = data;
>  	struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>  
> +	cfs_rq->throttle_count++;
> +	if (cfs_rq->throttle_count > 1)
> +		return 0;

	if (cfs_rq->throttle_count++)
		return 0;

vs
	
	if (--cfs_rq->throttle_count)
		return;


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-22 12:49     ` Aaron Lu
@ 2025-05-23 14:59       ` Peter Zijlstra
  2025-05-26 11:36         ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2025-05-23 14:59 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Thu, May 22, 2025 at 08:49:43PM +0800, Aaron Lu wrote:
> On Thu, May 22, 2025 at 02:03:36PM +0200, Peter Zijlstra wrote:

> > This is asymmetric -- dequeue removes it from that throttle list, but
> > the corresponding enqueue will not add it back, what gives?
> > 
> > Because now we have:
> > 
> >  p->on_rq=1
> >  p->throttle_node on list
> > 
> > move_queued_task()
> >   deactivate_task()
> >     dequeue_task_fair()
> >       list_del_init(throttle_node)
> >     p->on_rq = 2
> > 
> >   activate_task()
> >     enqueue_task_fair()
> >       // nothing special, makes the thing runnable
> >     p->on_rq = 1;
> > 
> > and we exit with a task that is on-rq and not throttled ?!?
> >
> > Why is this? Are we relying on pick_task_fair() to dequeue it again and
> > fix up our inconsistencies? If so, that had better have a comment on.
> 
> Correct.

But would it not be better to have enqueue bail when we're trying to
enqueue an already throttled task into a throttled cfs_rq?

It seems a waste to do the actual enqueue, pick, dequeue when we
could've just avoided all that.

The immediate problem seems to be that you destroy the
task_is_throttled() state on dequeue, but surely that is trivially
fixable by not keeping that state in the list.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23 14:59       ` Peter Zijlstra
@ 2025-05-26 11:36         ` Aaron Lu
  2025-05-27  6:58           ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-26 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Fri, May 23, 2025 at 04:59:42PM +0200, Peter Zijlstra wrote:
> On Thu, May 22, 2025 at 08:49:43PM +0800, Aaron Lu wrote:
> > On Thu, May 22, 2025 at 02:03:36PM +0200, Peter Zijlstra wrote:
> 
> > > This is asymmetric -- dequeue removes it from that throttle list, but
> > > the corresponding enqueue will not add it back, what gives?
> > > 
> > > Because now we have:
> > > 
> > >  p->on_rq=1
> > >  p->throttle_node on list
> > > 
> > > move_queued_task()
> > >   deactivate_task()
> > >     dequeue_task_fair()
> > >       list_del_init(throttle_node)
> > >     p->on_rq = 2
> > > 
> > >   activate_task()
> > >     enqueue_task_fair()
> > >       // nothing special, makes the thing runnable
> > >     p->on_rq = 1;
> > > 
> > > and we exit with a task that is on-rq and not throttled ?!?
> > >
> > > Why is this? Are we relying on pick_task_fair() to dequeue it again and
> > > fix up our inconsistencies? If so, that had better have a comment on.
> > 
> > Correct.
> 
> But would it not be better to have enqueue bail when we're trying to
> enqueue an already throttled task into a throttled cfs_rq?
> 
> It seems a waste to do the actual enqueue, pick, dequeue when we
> could've just avoided all that.
>

The original idea is to keep code simple but surely this can be
optimized. I'm working on it and will paste diff here once I get it
work.

Thanks,
Aaron

> The immediate problem seems to be that you destroy the
> task_is_throttled() state on dequeue, but surely that is trivially
> fixable by not keeping that state in the list.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-23 11:59             ` Aaron Lu
@ 2025-05-26 13:14               ` Chengming Zhou
  0 siblings, 0 replies; 44+ messages in thread
From: Chengming Zhou @ 2025-05-26 13:14 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
	Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
	Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 2025/5/23 19:59, Aaron Lu wrote:
> On Fri, May 23, 2025 at 05:53:55PM +0800, Chengming Zhou wrote:
>> On 2025/5/23 17:42, Aaron Lu wrote:
>>> On Fri, May 23, 2025 at 05:13:35PM +0800, Chengming Zhou wrote:
>>>> On 2025/5/23 15:56, Aaron Lu wrote:
>>>>> On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
>>>>>> On 2025/5/20 18:41, Aaron Lu wrote:
>>>>>>> On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
>>>>>>> core will dequeue it and then requeued it.
>>>>>>>
>>>>>>> The throttled task is still considered as queued by core because p->on_rq
>>>>>>> is still set so core will dequeue it, but since the task is already
>>>>>>> dequeued on throttle in fair, handle this case properly.
>>>>>>>
>>>>>>> Affinity and sched class change is similar.
>>>>>>
>>>>>> How about setting p->on_rq to 0 when throttled? which is the fact that
>>>>>> the task is not on cfs queue anymore, does this method cause any problem?
>>>>>>
>>>>>
>>>>> On task group change/affinity change etc. if the throttled task is
>>>>> regarded as !on_rq, then it will miss the chance to be enqueued to the
>>>>> new(and correct) cfs_rqs, instead, it will be enqueued back to its
>>>>> original cfs_rq on unthrottle which breaks affinity or task group
>>>>
>>>> Yeah, this is indeed a problem, I was thinking to delete the throttled task
>>>> from the cfs_rq limbo list, then add it to another cfs_rq limbo list or cfs_rq
>>>> runnable tree based on the new cfs_rq's throttle status.
>>>
>>> Only work when the task is still handled by fair :)
>>>
>>>>
>>>> But it's much complex compared with your current method.
>>>>
>>>>> settings. We may be able to do something in tg_unthrottle_up() to take
>>>>> special care of these situations, but it seems a lot of headaches.
>>>>>
>>>>> Also, for task group change, if the new task group does not have throttle
>>>>> setting, that throttled task should be allowed to run immediately instead
>>>>> of waiting for its old cfs_rq's unthrottle event. Similar is true when
>>>>> this throttled task changed its sched class, like from fair to rt.
>>>>>
>>>>> Makes sense?
>>>>
>>>> Ok, the another problem of the current method I can think of is the PELT maintenance,
>>>> we skip the actual dequeue_task_fair() process, which includes PELT detach, we just
>>>> delete it from the cfs_rq limbo list, so it can result in PELT maintenance error.
>>>>
>>>
>>> There are corresponding callbacks that handle this, e.g. for task group
>>> change, there is task_change_group_fair() that handles PELT detach; for
>>> affinity change, there is migrate_task_rq_fair() does that and for sched
>>> class change, there is switched_from/to() does that.
>>>
>>> Or do I miss anything?
>>
>> migrate_task_rq_fair() only do it when !task_on_rq_migrating(p), which is wakeup migrate,
>> because we already do detach in dequeue_task_fair() for on_rq task migration...
>> You can see the DO_DETACH flag in update_load_avg() called from dequeue_entity().
>>
> 
> Understood, thanks for catching this!
> 
> So the code was initially developed on top of v5.15 and there is a
> detach in migrate_task_rq_fair():
> 
> 	if (p->on_rq == TASK_ON_RQ_MIGRATING) {
> 		/*
> 		 * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
> 		 * rq->lock and can modify state directly.
> 		 */
> 		lockdep_assert_rq_held(task_rq(p));
> 		detach_entity_cfs_rq(&p->se);
> 	}
> 
> But looks like it's gone now by commit e1f078f50478("sched/fair: Combine
> detach into dequeue when migrating task") and I failed to notice this
> detail...

Yeah..

> 
> Anyway, the task is already dequeued without TASK_ON_RQ_MIGRATING being
> set when throttled and it can't be dequeued again, so perhaps something
> like below could cure this situation?(just to illustrate the idea, not
> even compile tested)

Ok, seems reasonable to me!

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 89afa472299b7..dc2e9a6bf3fd7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5868,6 +5868,9 @@ static void dequeue_throttled_task(struct task_struct *p, int flags)
> 	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
> 
> 	list_del_init(&p->throttle_node);
> +
> +       if (task_on_rq_migrating(p))
> +               detach_task_cfs_rq(p);
> }
> 
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-26 11:36         ` Aaron Lu
@ 2025-05-27  6:58           ` Aaron Lu
  2025-05-27 11:19             ` K Prateek Nayak
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-27  6:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Mon, May 26, 2025 at 07:36:50PM +0800, Aaron Lu wrote:
> On Fri, May 23, 2025 at 04:59:42PM +0200, Peter Zijlstra wrote:
> > On Thu, May 22, 2025 at 08:49:43PM +0800, Aaron Lu wrote:
> > > On Thu, May 22, 2025 at 02:03:36PM +0200, Peter Zijlstra wrote:
> > 
> > > > This is asymmetric -- dequeue removes it from that throttle list, but
> > > > the corresponding enqueue will not add it back, what gives?
> > > > 
> > > > Because now we have:
> > > > 
> > > >  p->on_rq=1
> > > >  p->throttle_node on list
> > > > 
> > > > move_queued_task()
> > > >   deactivate_task()
> > > >     dequeue_task_fair()
> > > >       list_del_init(throttle_node)
> > > >     p->on_rq = 2
> > > > 
> > > >   activate_task()
> > > >     enqueue_task_fair()
> > > >       // nothing special, makes the thing runnable
> > > >     p->on_rq = 1;
> > > > 
> > > > and we exit with a task that is on-rq and not throttled ?!?
> > > >
> > > > Why is this? Are we relying on pick_task_fair() to dequeue it again and
> > > > fix up our inconsistencies? If so, that had better have a comment on.
> > > 
> > > Correct.
> > 
> > But would it not be better to have enqueue bail when we're trying to
> > enqueue an already throttled task into a throttled cfs_rq?
> > 
> > It seems a waste to do the actual enqueue, pick, dequeue when we
> > could've just avoided all that.
> >
> 
> The original idea is to keep code simple but surely this can be
> optimized. I'm working on it and will paste diff here once I get it
> work.
>

I tried below diff on top of this series:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 055f3782eeaee..1c5d7c4ff6652 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -882,6 +882,7 @@ struct task_struct {
 #ifdef CONFIG_CFS_BANDWIDTH
 	struct callback_head		sched_throttle_work;
 	struct list_head		throttle_node;
+	bool				throttled;
 #endif
 #endif
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 89afa472299b7..c585a12f2c753 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5798,7 +5798,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
 
 static inline bool task_is_throttled(struct task_struct *p)
 {
-	return !list_empty(&p->throttle_node);
+	return p->throttled;
 }
 
 static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
@@ -5842,6 +5842,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
 		 * mistakenly regard this task as an already throttled one.
 		 */
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		p->throttled = true;
 		resched_curr(rq);
 	}
 
@@ -5870,6 +5871,22 @@ static void dequeue_throttled_task(struct task_struct *p, int flags)
 	list_del_init(&p->throttle_node);
 }
 
+/* return true to skip actual enqueue */
+static bool enqueue_throttled_task(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+	if (throttled_hierarchy(cfs_rq)) {
+		/* throttled task move across task groups/rqs. */
+		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		return true;
+	}
+
+	/* unthrottle */
+	p->throttled = false;
+	return false;
+}
+
 static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
 static int tg_unthrottle_up(struct task_group *tg, void *data)
 {
@@ -6714,6 +6731,7 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
 static bool task_is_throttled(struct task_struct *p) { return false; }
 static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static bool enqueue_throttled_task(struct task_struct *p) { return false; }
 static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
 
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -6907,6 +6925,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
 	u64 slice = 0;
 
+	if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p)))
+		return;
+
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
 	 * the cfs_rq utilization to select a frequency.
@@ -6917,7 +6938,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		util_est_enqueue(&rq->cfs, p);
 
 	if (flags & ENQUEUE_DELAYED) {
-		WARN_ON_ONCE(task_is_throttled(p));
 		requeue_delayed_entity(se);
 		return;
 	}

But got a list corruption issue on se->group_node. After some debugging,
the following situation could happen and cause a throttled task's
se.group_node left on rq->cfs_tasks when this task is returning to user
with throttle task executed and another cpu moving it to a new group and
its new cfs_rq is also throttled:

       cpuX                         cpuY
    taskA ret2user
  throttle_cfs_rq_work()     sched_move_task(taskA)
  task_rq_lock acquired
  dequeue_task_fair(taskA)
  task_rq_lock released
                             task_rq_lock acquired
			     task_current_donor(taskA) == true
			     task_on_rq_queued(taskA) == true
			     dequeue_task(taskA)
			     put_prev_task(taskA)
			     sched_change_group()
			     enqueue_task(taskA) -> taskA's new cfs_rq
			                            is throttled, go the
						    fast path and skip
						    actual enqueue
			     set_next_task(taskA)
		          __set_next_task_fair(taskA)
	              list_move(&se->group_node, &rq->cfs_tasks); // bug
     schedule()

(The current series does not have the problem because it always did an
actual enqueue.)

I think this can be trivially fixed by checking if the task is the
current one in enqueue_throttled_task() and if so, do not go the fast
path but do an actual enqueue, like below. I've tested it and do not
find any problem right now. Thoughts?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c585a12f2c753..f9de7df44e968 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5876,7 +5876,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
 
-	if (throttled_hierarchy(cfs_rq)) {
+	if (throttled_hierarchy(cfs_rq) &&
+	    !task_current_donor(rq_of(cfs_rq), p)) {
 		/* throttled task move across task groups/rqs. */
 		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
 		return true;

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-27  6:58           ` Aaron Lu
@ 2025-05-27 11:19             ` K Prateek Nayak
  2025-05-27 11:54               ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: K Prateek Nayak @ 2025-05-27 11:19 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hello Aaron,

On 5/27/2025 12:28 PM, Aaron Lu wrote:
> On Mon, May 26, 2025 at 07:36:50PM +0800, Aaron Lu wrote:
>> On Fri, May 23, 2025 at 04:59:42PM +0200, Peter Zijlstra wrote:
>>> On Thu, May 22, 2025 at 08:49:43PM +0800, Aaron Lu wrote:
>>>> On Thu, May 22, 2025 at 02:03:36PM +0200, Peter Zijlstra wrote:
>>>
>>>>> This is asymmetric -- dequeue removes it from that throttle list, but
>>>>> the corresponding enqueue will not add it back, what gives?
>>>>>
>>>>> Because now we have:
>>>>>
>>>>>   p->on_rq=1
>>>>>   p->throttle_node on list
>>>>>
>>>>> move_queued_task()
>>>>>    deactivate_task()
>>>>>      dequeue_task_fair()
>>>>>        list_del_init(throttle_node)
>>>>>      p->on_rq = 2
>>>>>
>>>>>    activate_task()
>>>>>      enqueue_task_fair()
>>>>>        // nothing special, makes the thing runnable
>>>>>      p->on_rq = 1;
>>>>>
>>>>> and we exit with a task that is on-rq and not throttled ?!?
>>>>>
>>>>> Why is this? Are we relying on pick_task_fair() to dequeue it again and
>>>>> fix up our inconsistencies? If so, that had better have a comment on.
>>>>
>>>> Correct.
>>>
>>> But would it not be better to have enqueue bail when we're trying to
>>> enqueue an already throttled task into a throttled cfs_rq?
>>>
>>> It seems a waste to do the actual enqueue, pick, dequeue when we
>>> could've just avoided all that.
>>>
>>
>> The original idea is to keep code simple but surely this can be
>> optimized. I'm working on it and will paste diff here once I get it
>> work.
>>
> 
> I tried below diff on top of this series:
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 055f3782eeaee..1c5d7c4ff6652 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -882,6 +882,7 @@ struct task_struct {
>   #ifdef CONFIG_CFS_BANDWIDTH
>   	struct callback_head		sched_throttle_work;
>   	struct list_head		throttle_node;
> +	bool				throttled;
>   #endif
>   #endif
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 89afa472299b7..c585a12f2c753 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5798,7 +5798,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
>   
>   static inline bool task_is_throttled(struct task_struct *p)
>   {
> -	return !list_empty(&p->throttle_node);
> +	return p->throttled;
>   }
>   
>   static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
> @@ -5842,6 +5842,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
>   		 * mistakenly regard this task as an already throttled one.
>   		 */
>   		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> +		p->throttled = true;
>   		resched_curr(rq);
>   	}

Since we now have an official per-task throttle indicator, what are your
thoughts on reusing "p->se.group_node" for throttled_limbo_list?

Something like this lightly tested diff based on your suggestion:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11eb0612e22d..f9fdcf812e81 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -578,6 +578,9 @@ struct sched_entity {
  	unsigned char			sched_delayed;
  	unsigned char			rel_deadline;
  	unsigned char			custom_slice;
+#ifdef CONFIG_CFS_BANDWIDTH
+	unsigned char			sched_throttled;
+#endif
  					/* hole */
  
  	u64				exec_start;
@@ -881,7 +884,6 @@ struct task_struct {
  	struct task_group		*sched_task_group;
  #ifdef CONFIG_CFS_BANDWIDTH
  	struct callback_head		sched_throttle_work;
-	struct list_head		throttle_node;
  #endif
  #endif
  
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25e794ea0283..b1cb05baf8d4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5798,7 +5798,7 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
  
  static inline bool task_is_throttled(struct task_struct *p)
  {
-	return !list_empty(&p->throttle_node);
+	return !!p->se.sched_throttled;
  }
  
  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
@@ -5835,13 +5835,14 @@ static void throttle_cfs_rq_work(struct callback_head *work)
  			return;
  		rq = scope.rq;
  		update_rq_clock(rq);
-		WARN_ON_ONCE(!list_empty(&p->throttle_node));
+		WARN_ON_ONCE(p->se.sched_throttled);
  		dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
  		/*
-		 * Must not add it to limbo list before dequeue or dequeue will
+		 * Must not mark throttled before dequeue or dequeue will
  		 * mistakenly regard this task as an already throttled one.
  		 */
-		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+		p->se.sched_throttled = 1;
+		list_add(&p->se.group_node, &cfs_rq->throttled_limbo_list);
  		resched_curr(rq);
  	}
  
@@ -5853,7 +5854,6 @@ void init_cfs_throttle_work(struct task_struct *p)
  	init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
  	/* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
  	p->sched_throttle_work.next = &p->sched_throttle_work;
-	INIT_LIST_HEAD(&p->throttle_node);
  }
  
  static void dequeue_throttled_task(struct task_struct *p, int flags)
@@ -5864,10 +5864,26 @@ static void dequeue_throttled_task(struct task_struct *p, int flags)
  	 * task affinity change, task group change, task sched class
  	 * change etc.
  	 */
-	WARN_ON_ONCE(p->se.on_rq);
+	WARN_ON_ONCE(!p->se.sched_throttled);
  	WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
  
-	list_del_init(&p->throttle_node);
+	list_del_init(&p->se.group_node);
+}
+
+/* return true to skip actual enqueue */
+static bool enqueue_throttled_task(struct task_struct *p)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+	if (throttled_hierarchy(cfs_rq)) {
+		/* throttled task move across task groups/rqs. */
+		list_add(&p->se.group_node, &cfs_rq->throttled_limbo_list);
+		return true;
+	}
+
+	/* unthrottle */
+	p->se.sched_throttled = 0;
+	return false;
  }
  
  static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
@@ -5896,8 +5912,11 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
  	}
  
  	/* Re-enqueue the tasks that have been throttled at this level. */
-	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
-		list_del_init(&p->throttle_node);
+	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, se.group_node) {
+		WARN_ON_ONCE(!p->se.sched_throttled);
+
+		p->se.sched_throttled = 0;
+		list_del_init(&p->se.group_node);
  		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
  	}
  
@@ -6714,6 +6733,7 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
  static void task_throttle_setup_work(struct task_struct *p) {}
  static bool task_is_throttled(struct task_struct *p) { return false; }
  static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static bool enqueue_throttled_task(struct task_struct *p) { return false; }
  static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
  
  static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
@@ -6907,6 +6927,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
  	int rq_h_nr_queued = rq->cfs.h_nr_queued;
  	u64 slice = 0;
  
+	if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p)))
+		return;
+
  	/*
  	 * The code below (indirectly) updates schedutil which looks at
  	 * the cfs_rq utilization to select a frequency.
@@ -13244,7 +13267,7 @@ static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool firs
  	struct sched_entity *se = &p->se;
  
  #ifdef CONFIG_SMP
-	if (task_on_rq_queued(p)) {
+	if (task_on_rq_queued(p) && !task_is_throttled(p)) {
  		/*
  		 * Move the next running task to the front of the list, so our
  		 * cfs_tasks list becomes MRU one.


-- 
Thanks and Regards,
Prateek


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-27 11:19             ` K Prateek Nayak
@ 2025-05-27 11:54               ` Aaron Lu
  2025-05-27 14:16                 ` K Prateek Nayak
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-27 11:54 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Valentin Schneider, Ben Segall, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hi Prateek,

On Tue, May 27, 2025 at 04:49:36PM +0530, K Prateek Nayak wrote:
... ...
> Since we now have an official per-task throttle indicator, what are your
> thoughts on reusing "p->se.group_node" for throttled_limbo_list?
>

I'm not sure. I can easily get confused when I see se.group_node and
thought it was something related with rq->cfs_tasks :) Maybe using a
union could make it look better?

Anyway, if space is a concern then this is a good way to do it, thanks
for the suggestion. I'll leave it to Peter to decide.

Best wishes,
Aaron

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
  2025-05-27 11:54               ` Aaron Lu
@ 2025-05-27 14:16                 ` K Prateek Nayak
  0 siblings, 0 replies; 44+ messages in thread
From: K Prateek Nayak @ 2025-05-27 14:16 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Valentin Schneider, Ben Segall, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On 5/27/2025 5:24 PM, Aaron Lu wrote:
> Hi Prateek,
> 
> On Tue, May 27, 2025 at 04:49:36PM +0530, K Prateek Nayak wrote:
> ... ...
>> Since we now have an official per-task throttle indicator, what are your
>> thoughts on reusing "p->se.group_node" for throttled_limbo_list?
>>
> 
> I'm not sure. I can easily get confused when I see se.group_node and
> thought it was something related with rq->cfs_tasks :) Maybe using a
> union could make it look better?
> 
> Anyway, if space is a concern then this is a good way to do it, thanks
> for the suggestion. I'll leave it to Peter to decide.

Ack! Was just trying something out. I don't think space is actually a
worry (yet!) looking at the amount of members behind
CONFIG_CFS_BANDWIDTH :)

Union is a good idea but if space is not a concern, it is great as is.

-- 
Thanks and Regards,
Prateek

> 
> Best wishes,
> Aaron


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-23  7:40     ` Aaron Lu
@ 2025-05-29 11:51       ` Aaron Lu
  2025-05-30  5:36         ` K Prateek Nayak
  0 siblings, 1 reply; 44+ messages in thread
From: Aaron Lu @ 2025-05-29 11:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

On Fri, May 23, 2025 at 03:40:14PM +0800, Aaron Lu wrote:
> On Thu, May 22, 2025 at 01:07:28PM +0200, Peter Zijlstra wrote:
> > On Tue, May 20, 2025 at 06:41:05PM +0800, Aaron Lu wrote:
> > > @@ -8851,6 +8913,7 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > >  {
> > >  	struct sched_entity *se;
> > >  	struct cfs_rq *cfs_rq;
> > > +	struct task_struct *p;
> > >  
> > >  again:
> > >  	cfs_rq = &rq->cfs;
> > > @@ -8871,7 +8934,14 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > >  		cfs_rq = group_cfs_rq(se);
> > >  	} while (cfs_rq);
> > >  
> > > -	return task_of(se);
> > > +	p = task_of(se);
> > > +	if (throttled_hierarchy(cfs_rq_of(se))) {
> > > +		/* Shuold not happen for now */
> > > +		WARN_ON_ONCE(1);
> > > +		task_throttle_setup_work(p);
> > > +	}
> > > +
> > > +	return p;
> > >  }
> > 
> > So the final code is a little different, because you're removing the
> > return value from check_cfs_rq_runtime().
> > 
> > But would not that exact return value be the thing you're now checking
> > for again?
> >
> 
> Ah yes.
> 
> > That is; at the end of the series, would not something like:
> > 
> > static struct task_struct *pick_task_fair(struct rq *rq)
> > {
> > 	struct sched_entity *se;
> > 	struct cfs_rq *cfs_rq;
> > 	struct task_struct *p;
> > 	bool throttled;
> > 
> > again:
> > 	cfs_rq = &rq->cfs;
> > 	if (!cfs_rq->nr_queued)
> > 		return NULL;
> > 
> > 	throttled = false;
> > 
> > 	do {
> > 		if (cfs_rq->curr && cfs_rq->curr->on_rq)
> > 			update_curr(cfs_rq);
> > 
> > 		throttled |= check_cfs_rq_runtime(cfs_rq);
> > 
> > 		se = pick_next_entity(rq, cfs_rq);
> > 		if (!se)
> > 			goto again;
> > 
> > 		cfs_rq = group_cfs_rq(se);
> > 	} while (cfs_rq);
> > 
> > 	p = task_of(se);
> > 	if (unlikely(throttled))
> > 		task_throttle_setup_work(p);
> > 	return p;
> > }
> > 
> > make it more obvious / be simpler?
> 
> Thanks for the suggestion, will follow it in next version.

Found a tiny difference while testing: check_cfs_rq_runtime() could
return false for a cfs_rq whose throttled_hierarchy() is true. The
reason is, that still throttled cfs_rq may be assigned runtime by
another cpu doing distribute_cfs_runtime() and has an async unthrottle
queued but didn't process it yet. The end result is, it has a positive
runtime_remaining but isn't unthrottled yet. I think this doesn't make
much difference but thought it might be worth mentioning.

A side note, now that check_cfs_rq_runtime() only marks cfs_rq's
throttle status and returns a signal, it no longer does dequeuing
stuffs, I suppose there is no need to call it in put_prev_entity()?
Because that signal is now only useful in pick time and we always run
check_cfs_rq_runtime() on every cfs_rq encountered during pick.

Also, check_enqueue_throttle() doesn't look useful either because
enqueued task will go through pick and we will add a throttle work to it
if needed. I removed these stuffs and run some tests, didn't notice
anything wrong yet but perhaps I missed something, comments?

Best regards,
Aaron

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-29 11:51       ` Aaron Lu
@ 2025-05-30  5:36         ` K Prateek Nayak
  2025-05-30 11:02           ` Aaron Lu
  0 siblings, 1 reply; 44+ messages in thread
From: K Prateek Nayak @ 2025-05-30  5:36 UTC (permalink / raw)
  To: Aaron Lu, Peter Zijlstra
  Cc: Valentin Schneider, Ben Segall, Josh Don, Ingo Molnar,
	Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hello Aaron,

On 5/29/2025 5:21 PM, Aaron Lu wrote:
> A side note, now that check_cfs_rq_runtime() only marks cfs_rq's
> throttle status and returns a signal, it no longer does dequeuing
> stuffs, I suppose there is no need to call it in put_prev_entity()?

But perhaps it is needed to queue the bandwidth timer via
__assign_cfs_rq_runtime() if the cfs_rq had built up slack when the last
task was dequeued?

Otherwise, the throttle will only be noted at the time of pick and the
timer will only start then but check_cfs_rq_runtime() in
put_prev_entity() and check_enqueue_throttle() could have already
spotted a slack and queued the timer which can ensure bandwidth is
available sooner and reduce latency.

Let me know if I'm terribly mistaken :)

> Because that signal is now only useful in pick time and we always run
> check_cfs_rq_runtime() on every cfs_rq encountered during pick.> 
> Also, check_enqueue_throttle() doesn't look useful either because
> enqueued task will go through pick and we will add a throttle work to it
> if needed. I removed these stuffs and run some tests, didn't notice
> anything wrong yet but perhaps I missed something, comments?
> 
> Best regards,
> Aaron

-- 
Thanks and Regards,
Prateek


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/7] sched/fair: prepare throttle path for task based throttle
  2025-05-30  5:36         ` K Prateek Nayak
@ 2025-05-30 11:02           ` Aaron Lu
  0 siblings, 0 replies; 44+ messages in thread
From: Aaron Lu @ 2025-05-30 11:02 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Valentin Schneider, Ben Segall, Josh Don,
	Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
	Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chengming Zhou,
	Chuyi Zhou, Jan Kiszka, Florian Bezdeka

Hi Prateek,

On Fri, May 30, 2025 at 11:06:52AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 5/29/2025 5:21 PM, Aaron Lu wrote:
> > A side note, now that check_cfs_rq_runtime() only marks cfs_rq's
> > throttle status and returns a signal, it no longer does dequeuing
> > stuffs, I suppose there is no need to call it in put_prev_entity()?
> 
> But perhaps it is needed to queue the bandwidth timer via
> __assign_cfs_rq_runtime() if the cfs_rq had built up slack when the last
> task was dequeued?
> 
> Otherwise, the throttle will only be noted at the time of pick and the
> timer will only start then but check_cfs_rq_runtime() in
> put_prev_entity() and check_enqueue_throttle() could have already
> spotted a slack and queued the timer which can ensure bandwidth is
> available sooner and reduce latency.
> 
> Let me know if I'm terribly mistaken :)
> 

Sounds reasonable to me, thanks for the insight!

Best wishes,
Aaron

> > Because that signal is now only useful in pick time and we always run
> > check_cfs_rq_runtime() on every cfs_rq encountered during pick.> Also,
> > check_enqueue_throttle() doesn't look useful either because
> > enqueued task will go through pick and we will add a throttle work to it
> > if needed. I removed these stuffs and run some tests, didn't notice
> > anything wrong yet but perhaps I missed something, comments?
> > 
> > Best regards,
> > Aaron
> 
> -- 
> Thanks and Regards,
> Prateek
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-05-30 11:02 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-05-21  8:48   ` Chengming Zhou
2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
2025-05-20 12:02   ` Florian Bezdeka
2025-05-21  6:37     ` Aaron Lu
2025-05-21 11:51       ` Aaron Lu
2025-05-21  9:01   ` Chengming Zhou
2025-05-21  9:21     ` [External] " Aaron Lu
2025-05-22 11:43       ` Chengming Zhou
2025-05-23  8:03         ` Aaron Lu
2025-05-22 10:48   ` Peter Zijlstra
2025-05-22 11:44     ` Aaron Lu
2025-05-22 11:54       ` Peter Zijlstra
2025-05-22 12:40         ` Aaron Lu
2025-05-23  9:53           ` Aaron Lu
2025-05-23 10:52             ` Peter Zijlstra
2025-05-23 11:17               ` Aaron Lu
2025-05-22 11:07   ` Peter Zijlstra
2025-05-23  7:40     ` Aaron Lu
2025-05-29 11:51       ` Aaron Lu
2025-05-30  5:36         ` K Prateek Nayak
2025-05-30 11:02           ` Aaron Lu
2025-05-23 12:35   ` Peter Zijlstra
2025-05-20 10:41 ` [PATCH 3/7] sched/fair: prepare unthrottle " Aaron Lu
2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-05-22 12:03   ` Peter Zijlstra
2025-05-22 12:49     ` Aaron Lu
2025-05-23 14:59       ` Peter Zijlstra
2025-05-26 11:36         ` Aaron Lu
2025-05-27  6:58           ` Aaron Lu
2025-05-27 11:19             ` K Prateek Nayak
2025-05-27 11:54               ` Aaron Lu
2025-05-27 14:16                 ` K Prateek Nayak
2025-05-23  2:43   ` Chengming Zhou
2025-05-23  7:56     ` Aaron Lu
2025-05-23  9:13       ` Chengming Zhou
2025-05-23  9:42         ` Aaron Lu
2025-05-23  9:53           ` Chengming Zhou
2025-05-23 11:59             ` Aaron Lu
2025-05-26 13:14               ` Chengming Zhou
2025-05-20 10:41 ` [PATCH 5/7] sched/fair: switch to task based throttle model Aaron Lu
2025-05-20 10:41 ` [PATCH 6/7] sched/fair: task based throttle time accounting Aaron Lu
2025-05-20 10:41 ` [PATCH 7/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).