* [PATCH v3 0/5] Defer throttle when task exits to user
@ 2025-07-15 7:16 Aaron Lu
2025-07-15 7:16 ` [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
` (8 more replies)
0 siblings, 9 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:16 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
v3:
- Keep throttled cfs_rq's PELT clock running as long as it still has
entity queued, suggested by Benjamin Segall. I've folded this change
into patch3;
- Rebased on top of tip/sched/core, commit 2885daf47081
("lib/smp_processor_id: Make migration check unconditional of SMP").
Hi Prateek,
I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
clock change should not affect things much, but let me know if you don't
think that is appropriate.
Tests I've done:
- Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
within 2 minutes and with this series, I do not see rcu-stalls after
10 minutes.
- A stress test that creates a lot of pressure on fork/exit path and
cgroup_threadgroup_rwsem. Without this series, the test will cause
task hung in about 5 minutes and with this series, no problem found
after several hours. Songtang wrote this test script and I've used it
to verify the patches, thanks Songtang.
[1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
v2:
- Re-org the patchset to use a single patch to implement throttle
related changes, suggested by Chengming;
- Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
decide if throttle task work is needed instead of checking
throttled_hierarchy(), suggested by Peter;
- Simplify throttle_count check in tg_throtthe_down() and
tg_unthrottle_up(), suggested by Peter;
- Add enqueue_throttled_task() to speed up enqueuing a throttled task to
a throttled cfs_rq, suggested by Peter;
- Address the missing of detach_task_cfs_rq() for throttled tasks that
get migrated to a new rq, pointed out by Chengming;
- Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
cond_resched*() is going away, pointed out by Peter.
I hope I didn't miss any comments and suggestions for v1 and if I do,
please kindly let me know, thanks!
Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
of double_rq_clock_clear_update()")
cover letter of v1:
This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
Valentin has described the problem very well in the above link and I
quote:
"
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)
If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.
This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.
The linux-rt tree has had
1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
"
Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
and K Prateek Nayak has an detailed analysis of how deadlock happened:
https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/
To fix this issue for PREEMPT_RT and improve latency situation for
!PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
is throttled, mark its throttled status but do not remove it from cpu's
rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
add a task work to them so that when they return to user, they can be
dequeued. In this way, tasks throttled will not hold any kernel resources.
When cfs_rq gets unthrottled, enqueue back those throttled tasks.
There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?
For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.
For throttle time accounting, according to RFC v2's feedback, rework
throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its
hierarchy;
- stop accounting on unthrottle.
There is also the concern of increased duration of (un)throttle operations
in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.
Aaron Lu (2):
sched/fair: Task based throttle time accounting
sched/fair: Get rid of throttled_lb_pair()
Valentin Schneider (3):
sched/fair: Add related data structure for task based throttle
sched/fair: Implement throttle task work and related helpers
sched/fair: Switch to task based throttle model
include/linux/sched.h | 5 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 451 ++++++++++++++++++++++++------------------
kernel/sched/pelt.h | 4 +-
kernel/sched/sched.h | 7 +-
5 files changed, 274 insertions(+), 196 deletions(-)
--
2.39.5
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
@ 2025-07-15 7:16 ` Aaron Lu
2025-07-15 7:16 ` [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
` (7 subsequent siblings)
8 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:16 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
From: Valentin Schneider <vschneid@redhat.com>
Add related data structures for this new throttle functionality.
Tesed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
include/linux/sched.h | 5 +++++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 13 +++++++++++++
kernel/sched/sched.h | 3 +++
4 files changed, 24 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 55921385927d8..ec4b54540c244 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -883,6 +883,11 @@ struct task_struct {
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
+#ifdef CONFIG_CFS_BANDWIDTH
+ struct callback_head sched_throttle_work;
+ struct list_head throttle_node;
+ bool throttled;
+#endif
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2f8caa9db78d5..410acc7435e86 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4446,6 +4446,9 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = NULL;
+#ifdef CONFIG_CFS_BANDWIDTH
+ init_cfs_throttle_work(p);
+#endif
#endif
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 20a845697c1dc..c072e87c5bd9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5742,6 +5742,18 @@ static inline int throttled_lb_pair(struct task_group *tg,
throttled_hierarchy(dest_cfs_rq);
}
+static void throttle_cfs_rq_work(struct callback_head *work)
+{
+}
+
+void init_cfs_throttle_work(struct task_struct *p)
+{
+ init_task_work(&p->sched_throttle_work, throttle_cfs_rq_work);
+ /* Protect against double add, see throttle_cfs_rq() and throttle_cfs_rq_work() */
+ p->sched_throttle_work.next = &p->sched_throttle_work;
+ INIT_LIST_HEAD(&p->throttle_node);
+}
+
static int tg_unthrottle_up(struct task_group *tg, void *data)
{
struct rq *rq = data;
@@ -6466,6 +6478,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
cfs_rq->runtime_enabled = 0;
INIT_LIST_HEAD(&cfs_rq->throttled_list);
INIT_LIST_HEAD(&cfs_rq->throttled_csd_list);
+ INIT_LIST_HEAD(&cfs_rq->throttled_limbo_list);
}
void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 105190b180203..b0c9559992d8a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -741,6 +741,7 @@ struct cfs_rq {
int throttle_count;
struct list_head throttled_list;
struct list_head throttled_csd_list;
+ struct list_head throttled_limbo_list;
#endif /* CONFIG_CFS_BANDWIDTH */
#endif /* CONFIG_FAIR_GROUP_SCHED */
};
@@ -2640,6 +2641,8 @@ extern bool sched_rt_bandwidth_account(struct rt_rq *rt_rq);
extern void init_dl_entity(struct sched_dl_entity *dl_se);
+extern void init_cfs_throttle_work(struct task_struct *p);
+
#define BW_SHIFT 20
#define BW_UNIT (1 << BW_SHIFT)
#define RATIO_SHIFT 8
--
2.39.5
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-15 7:16 ` [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
@ 2025-07-15 7:16 ` Aaron Lu
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
` (6 subsequent siblings)
8 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:16 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
From: Valentin Schneider <vschneid@redhat.com>
Implement throttle_cfs_rq_work() task work which gets executed on task's
ret2user path where the task is dequeued and marked as throttled.
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 65 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c072e87c5bd9f..54c2a4df6a5d1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5742,8 +5742,51 @@ static inline int throttled_lb_pair(struct task_group *tg,
throttled_hierarchy(dest_cfs_rq);
}
+static inline bool task_is_throttled(struct task_struct *p)
+{
+ return p->throttled;
+}
+
+static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags);
static void throttle_cfs_rq_work(struct callback_head *work)
{
+ struct task_struct *p = container_of(work, struct task_struct, sched_throttle_work);
+ struct sched_entity *se;
+ struct cfs_rq *cfs_rq;
+ struct rq *rq;
+
+ WARN_ON_ONCE(p != current);
+ p->sched_throttle_work.next = &p->sched_throttle_work;
+
+ /*
+ * If task is exiting, then there won't be a return to userspace, so we
+ * don't have to bother with any of this.
+ */
+ if ((p->flags & PF_EXITING))
+ return;
+
+ scoped_guard(task_rq_lock, p) {
+ se = &p->se;
+ cfs_rq = cfs_rq_of(se);
+
+ /* Raced, forget */
+ if (p->sched_class != &fair_sched_class)
+ return;
+
+ /*
+ * If not in limbo, then either replenish has happened or this
+ * task got migrated out of the throttled cfs_rq, move along.
+ */
+ if (!cfs_rq->throttle_count)
+ return;
+ rq = scope.rq;
+ update_rq_clock(rq);
+ WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
+ dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+ list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ p->throttled = true;
+ resched_curr(rq);
+ }
}
void init_cfs_throttle_work(struct task_struct *p)
@@ -5783,6 +5826,26 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
return 0;
}
+static inline bool task_has_throttle_work(struct task_struct *p)
+{
+ return p->sched_throttle_work.next != &p->sched_throttle_work;
+}
+
+static inline void task_throttle_setup_work(struct task_struct *p)
+{
+ if (task_has_throttle_work(p))
+ return;
+
+ /*
+ * Kthreads and exiting tasks don't return to userspace, so adding the
+ * work is pointless
+ */
+ if ((p->flags & (PF_EXITING | PF_KTHREAD)))
+ return;
+
+ task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
+}
+
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
@@ -6646,6 +6709,8 @@ static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
static inline void sync_throttle(struct task_group *tg, int cpu) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
+static void task_throttle_setup_work(struct task_struct *p) {}
+static bool task_is_throttled(struct task_struct *p) { return false; }
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
--
2.39.5
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-15 7:16 ` [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-07-15 7:16 ` [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
@ 2025-07-15 7:16 ` Aaron Lu
2025-07-15 23:29 ` kernel test robot
` (3 more replies)
2025-07-15 7:16 ` [PATCH v3 4/5] sched/fair: Task based throttle time accounting Aaron Lu
` (5 subsequent siblings)
8 siblings, 4 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:16 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
From: Valentin Schneider <vschneid@redhat.com>
In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.
This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.
To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.
Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock only when the throttled cfs_rq has no tasks left.
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
kernel/sched/fair.c | 336 ++++++++++++++++++++++---------------------
kernel/sched/pelt.h | 4 +-
kernel/sched/sched.h | 3 +-
3 files changed, 176 insertions(+), 167 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54c2a4df6a5d1..0eeea7f2e693d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5285,18 +5285,23 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (cfs_rq->nr_queued == 1) {
check_enqueue_throttle(cfs_rq);
- if (!throttled_hierarchy(cfs_rq)) {
- list_add_leaf_cfs_rq(cfs_rq);
- } else {
+ list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
+ if (throttled_hierarchy(cfs_rq)) {
struct rq *rq = rq_of(cfs_rq);
if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
cfs_rq->throttled_clock = rq_clock(rq);
if (!cfs_rq->throttled_clock_self)
cfs_rq->throttled_clock_self = rq_clock(rq);
-#endif
+
+ if (cfs_rq->pelt_clock_throttled) {
+ cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+ cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
+ }
}
+#endif
}
}
@@ -5335,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_runnable--;
- if (cfs_rq_throttled(cfs_rq))
- break;
}
}
@@ -5357,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_runnable++;
- if (cfs_rq_throttled(cfs_rq))
- break;
}
}
@@ -5444,8 +5445,18 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & DEQUEUE_DELAYED)
finish_delayed_dequeue_entity(se);
- if (cfs_rq->nr_queued == 0)
+ if (cfs_rq->nr_queued == 0) {
update_idle_cfs_rq_clock_pelt(cfs_rq);
+#ifdef CONFIG_CFS_BANDWIDTH
+ if (throttled_hierarchy(cfs_rq)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ list_del_leaf_cfs_rq(cfs_rq);
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ cfs_rq->pelt_clock_throttled = 1;
+ }
+#endif
+ }
return true;
}
@@ -5784,6 +5795,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ /*
+ * Must not set throttled before dequeue or dequeue will
+ * mistakenly regard this task as an already throttled one.
+ */
p->throttled = true;
resched_curr(rq);
}
@@ -5797,32 +5812,119 @@ void init_cfs_throttle_work(struct task_struct *p)
INIT_LIST_HEAD(&p->throttle_node);
}
+/*
+ * Task is throttled and someone wants to dequeue it again:
+ * it could be sched/core when core needs to do things like
+ * task affinity change, task group change, task sched class
+ * change etc. and in these cases, DEQUEUE_SLEEP is not set;
+ * or the task is blocked after throttled due to freezer etc.
+ * and in these cases, DEQUEUE_SLEEP is set.
+ */
+static void detach_task_cfs_rq(struct task_struct *p);
+static void dequeue_throttled_task(struct task_struct *p, int flags)
+{
+ WARN_ON_ONCE(p->se.on_rq);
+ list_del_init(&p->throttle_node);
+
+ /* task blocked after throttled */
+ if (flags & DEQUEUE_SLEEP) {
+ p->throttled = false;
+ return;
+ }
+
+ /*
+ * task is migrating off its old cfs_rq, detach
+ * the task's load from its old cfs_rq.
+ */
+ if (task_on_rq_migrating(p))
+ detach_task_cfs_rq(p);
+}
+
+static bool enqueue_throttled_task(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+ /*
+ * If the throttled task is enqueued to a throttled cfs_rq,
+ * take the fast path by directly put the task on target
+ * cfs_rq's limbo list, except when p is current because
+ * the following race can cause p's group_node left in rq's
+ * cfs_tasks list when it's throttled:
+ *
+ * cpuX cpuY
+ * taskA ret2user
+ * throttle_cfs_rq_work() sched_move_task(taskA)
+ * task_rq_lock acquired
+ * dequeue_task_fair(taskA)
+ * task_rq_lock released
+ * task_rq_lock acquired
+ * task_current_donor(taskA) == true
+ * task_on_rq_queued(taskA) == true
+ * dequeue_task(taskA)
+ * put_prev_task(taskA)
+ * sched_change_group()
+ * enqueue_task(taskA) -> taskA's new cfs_rq
+ * is throttled, go
+ * fast path and skip
+ * actual enqueue
+ * set_next_task(taskA)
+ * __set_next_task_fair(taskA)
+ * list_move(&se->group_node, &rq->cfs_tasks); // bug
+ * schedule()
+ *
+ * And in the above race case, the task's current cfs_rq is in the same
+ * rq as its previous cfs_rq because sched_move_task() doesn't migrate
+ * task so we can use its current cfs_rq to derive rq and test if the
+ * task is current.
+ */
+ if (throttled_hierarchy(cfs_rq) &&
+ !task_current_donor(rq_of(cfs_rq), p)) {
+ list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ return true;
+ }
+
+ /* we can't take the fast path, do an actual enqueue*/
+ p->throttled = false;
+ return false;
+}
+
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
static int tg_unthrottle_up(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct task_struct *p, *tmp;
+
+ if (--cfs_rq->throttle_count)
+ return 0;
- cfs_rq->throttle_count--;
- if (!cfs_rq->throttle_count) {
+ if (cfs_rq->pelt_clock_throttled) {
cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
+ }
- /* Add cfs_rq with load or one or more already running entities to the list */
- if (!cfs_rq_is_decayed(cfs_rq))
- list_add_leaf_cfs_rq(cfs_rq);
+ if (cfs_rq->throttled_clock_self) {
+ u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
- if (cfs_rq->throttled_clock_self) {
- u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+ cfs_rq->throttled_clock_self = 0;
- cfs_rq->throttled_clock_self = 0;
+ if (WARN_ON_ONCE((s64)delta < 0))
+ delta = 0;
- if (WARN_ON_ONCE((s64)delta < 0))
- delta = 0;
+ cfs_rq->throttled_clock_self_time += delta;
+ }
- cfs_rq->throttled_clock_self_time += delta;
- }
+ /* Re-enqueue the tasks that have been throttled at this level. */
+ list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+ list_del_init(&p->throttle_node);
+ enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
}
+ /* Add cfs_rq with load or one or more already running entities to the list */
+ if (!cfs_rq_is_decayed(cfs_rq))
+ list_add_leaf_cfs_rq(cfs_rq);
+
return 0;
}
@@ -5851,17 +5953,25 @@ static int tg_throttle_down(struct task_group *tg, void *data)
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ if (cfs_rq->throttle_count++)
+ return 0;
+
+
/* group is entering throttled state, stop time */
- if (!cfs_rq->throttle_count) {
- cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+ if (cfs_rq->nr_queued)
+ cfs_rq->throttled_clock_self = rq_clock(rq);
+ else {
+ /*
+ * For cfs_rqs that still have entities enqueued, PELT clock
+ * stop happens at dequeue time when all entities are dequeued.
+ */
list_del_leaf_cfs_rq(cfs_rq);
-
- WARN_ON_ONCE(cfs_rq->throttled_clock_self);
- if (cfs_rq->nr_queued)
- cfs_rq->throttled_clock_self = rq_clock(rq);
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ cfs_rq->pelt_clock_throttled = 1;
}
- cfs_rq->throttle_count++;
+ WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
return 0;
}
@@ -5869,8 +5979,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
- struct sched_entity *se;
- long queued_delta, runnable_delta, idle_delta, dequeue = 1;
+ int dequeue = 1;
raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -5893,68 +6002,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!dequeue)
return false; /* Throttle no longer required. */
- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();
- queued_delta = cfs_rq->h_nr_queued;
- runnable_delta = cfs_rq->h_nr_runnable;
- idle_delta = cfs_rq->h_nr_idle;
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- int flags;
-
- /* throttled entity or throttle-on-deactivate */
- if (!se->on_rq)
- goto done;
-
- /*
- * Abuse SPECIAL to avoid delayed dequeue in this instance.
- * This avoids teaching dequeue_entities() about throttled
- * entities and keeps things relatively simple.
- */
- flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
- if (se->sched_delayed)
- flags |= DEQUEUE_DELAYED;
- dequeue_entity(qcfs_rq, se, flags);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued -= queued_delta;
- qcfs_rq->h_nr_runnable -= runnable_delta;
- qcfs_rq->h_nr_idle -= idle_delta;
-
- if (qcfs_rq->load.weight) {
- /* Avoid re-evaluating load for this entity: */
- se = parent_entity(se);
- break;
- }
- }
-
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- /* throttled entity or throttle-on-deactivate */
- if (!se->on_rq)
- goto done;
-
- update_load_avg(qcfs_rq, se, 0);
- se_update_runnable(se);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued -= queued_delta;
- qcfs_rq->h_nr_runnable -= runnable_delta;
- qcfs_rq->h_nr_idle -= idle_delta;
- }
-
- /* At this point se is NULL and we are at root level*/
- sub_nr_running(rq, queued_delta);
-done:
/*
* Note: distribution will already see us throttled via the
* throttled-list. rq->lock protects completion.
@@ -5970,9 +6022,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
- struct sched_entity *se;
- long queued_delta, runnable_delta, idle_delta;
- long rq_h_nr_queued = rq->cfs.h_nr_queued;
+ struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+
+ /*
+ * It's possible we are called with !runtime_remaining due to things
+ * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
+ * unthrottled us with a positive runtime_remaining but other still
+ * running entities consumed those runtime before we reached here.
+ *
+ * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+ * because any enqueue in tg_unthrottle_up() will immediately trigger a
+ * throttle, which is not supposed to happen on unthrottle path.
+ */
+ if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
+ return;
se = cfs_rq->tg->se[cpu_of(rq)];
@@ -6002,62 +6065,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
break;
}
- goto unthrottle_throttle;
}
- queued_delta = cfs_rq->h_nr_queued;
- runnable_delta = cfs_rq->h_nr_runnable;
- idle_delta = cfs_rq->h_nr_idle;
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
- /* Handle any unfinished DELAY_DEQUEUE business first. */
- if (se->sched_delayed) {
- int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
- dequeue_entity(qcfs_rq, se, flags);
- } else if (se->on_rq)
- break;
- enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued += queued_delta;
- qcfs_rq->h_nr_runnable += runnable_delta;
- qcfs_rq->h_nr_idle += idle_delta;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(qcfs_rq))
- goto unthrottle_throttle;
- }
-
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
- update_load_avg(qcfs_rq, se, UPDATE_TG);
- se_update_runnable(se);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued += queued_delta;
- qcfs_rq->h_nr_runnable += runnable_delta;
- qcfs_rq->h_nr_idle += idle_delta;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(qcfs_rq))
- goto unthrottle_throttle;
- }
-
- /* Start the fair server if un-throttling resulted in new runnable tasks */
- if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
- dl_server_start(&rq->fair_server);
-
- /* At this point se is NULL and we are at root level*/
- add_nr_running(rq, queued_delta);
-
-unthrottle_throttle:
assert_list_leaf_cfs_rq(rq);
/* Determine whether we need to wake up potentially idle CPU: */
@@ -6711,6 +6720,8 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
static void task_throttle_setup_work(struct task_struct *p) {}
static bool task_is_throttled(struct task_struct *p) { return false; }
+static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static bool enqueue_throttled_task(struct task_struct *p) { return false; }
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
@@ -6903,6 +6914,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int rq_h_nr_queued = rq->cfs.h_nr_queued;
u64 slice = 0;
+ if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p)))
+ return;
+
/*
* The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency.
@@ -6955,10 +6969,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = 1;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
-
flags = ENQUEUE_WAKEUP;
}
@@ -6980,10 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = 1;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
}
if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@@ -7013,7 +7019,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
check_update_overutilized_status(rq);
-enqueue_throttle:
assert_list_leaf_cfs_rq(rq);
hrtick_update(rq);
@@ -7068,10 +7073,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- return 0;
-
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
slice = cfs_rq_min_slice(cfs_rq);
@@ -7108,10 +7109,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- return 0;
}
sub_nr_running(rq, h_nr_queued);
@@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
*/
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
+ if (unlikely(task_is_throttled(p))) {
+ dequeue_throttled_task(p, flags);
+ return true;
+ }
+
if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p);
@@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ struct task_struct *p;
+ bool throttled;
again:
cfs_rq = &rq->cfs;
if (!cfs_rq->nr_queued)
return NULL;
+ throttled = false;
+
do {
/* Might not have done put_prev_entity() */
if (cfs_rq->curr && cfs_rq->curr->on_rq)
update_curr(cfs_rq);
- if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- goto again;
+ throttled |= check_cfs_rq_runtime(cfs_rq);
se = pick_next_entity(rq, cfs_rq);
if (!se)
@@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
- return task_of(se);
+ p = task_of(se);
+ if (unlikely(throttled))
+ task_throttle_setup_work(p);
+ return p;
}
static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 62c3fa543c0f2..f921302dc40fb 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
u64 throttled;
- if (unlikely(cfs_rq->throttle_count))
+ if (unlikely(cfs_rq->pelt_clock_throttled))
throttled = U64_MAX;
else
throttled = cfs_rq->throttled_clock_pelt_time;
@@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
- if (unlikely(cfs_rq->throttle_count))
+ if (unlikely(cfs_rq->pelt_clock_throttled))
return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time;
return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b0c9559992d8a..fc697d4bf6685 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,7 +737,8 @@ struct cfs_rq {
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
- int throttled;
+ int throttled:1;
+ int pelt_clock_throttled:1;
int throttle_count;
struct list_head throttled_list;
struct list_head throttled_csd_list;
--
2.39.5
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (2 preceding siblings ...)
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
@ 2025-07-15 7:16 ` Aaron Lu
2025-08-18 14:57 ` Valentin Schneider
2025-07-15 7:16 ` [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
` (4 subsequent siblings)
8 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:16 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq which has a single task is throttled,
that task could later block in kernel mode instead of being dequeued on
limbo list and account this as throttled time is not accurate.
Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # accounting mechanism
Co-developed-by: K Prateek Nayak <kprateek.nayak@amd.com> # simplify implementation
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
kernel/sched/fair.c | 56 ++++++++++++++++++++++++--------------------
kernel/sched/sched.h | 1 +
2 files changed, 32 insertions(+), 25 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0eeea7f2e693d..6f534fbe89bcf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5287,19 +5287,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
check_enqueue_throttle(cfs_rq);
list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
- if (throttled_hierarchy(cfs_rq)) {
+ if (cfs_rq->pelt_clock_throttled) {
struct rq *rq = rq_of(cfs_rq);
- if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
- cfs_rq->throttled_clock = rq_clock(rq);
- if (!cfs_rq->throttled_clock_self)
- cfs_rq->throttled_clock_self = rq_clock(rq);
-
- if (cfs_rq->pelt_clock_throttled) {
- cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
- cfs_rq->throttled_clock_pelt;
- cfs_rq->pelt_clock_throttled = 0;
- }
+ cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+ cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
}
#endif
}
@@ -5387,7 +5380,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
* DELAY_DEQUEUE relies on spurious wakeups, special task
* states must not suffer spurious wakeups, excempt them.
*/
- if (flags & DEQUEUE_SPECIAL)
+ if (flags & (DEQUEUE_SPECIAL | DEQUEUE_THROTTLE))
delay = false;
WARN_ON_ONCE(delay && se->sched_delayed);
@@ -5793,7 +5786,7 @@ static void throttle_cfs_rq_work(struct callback_head *work)
rq = scope.rq;
update_rq_clock(rq);
WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
- dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
+ dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_THROTTLE);
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
/*
* Must not set throttled before dequeue or dequeue will
@@ -5948,6 +5941,17 @@ static inline void task_throttle_setup_work(struct task_struct *p)
task_work_add(p, &p->sched_throttle_work, TWA_RESUME);
}
+static void record_throttle_clock(struct cfs_rq *cfs_rq)
+{
+ struct rq *rq = rq_of(cfs_rq);
+
+ if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
+ cfs_rq->throttled_clock = rq_clock(rq);
+
+ if (!cfs_rq->throttled_clock_self)
+ cfs_rq->throttled_clock_self = rq_clock(rq);
+}
+
static int tg_throttle_down(struct task_group *tg, void *data)
{
struct rq *rq = data;
@@ -5956,21 +5960,17 @@ static int tg_throttle_down(struct task_group *tg, void *data)
if (cfs_rq->throttle_count++)
return 0;
-
- /* group is entering throttled state, stop time */
- WARN_ON_ONCE(cfs_rq->throttled_clock_self);
- if (cfs_rq->nr_queued)
- cfs_rq->throttled_clock_self = rq_clock(rq);
- else {
- /*
- * For cfs_rqs that still have entities enqueued, PELT clock
- * stop happens at dequeue time when all entities are dequeued.
- */
+ /*
+ * For cfs_rqs that still have entities enqueued, PELT clock
+ * stop happens at dequeue time when all entities are dequeued.
+ */
+ if (!cfs_rq->nr_queued) {
list_del_leaf_cfs_rq(cfs_rq);
cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
cfs_rq->pelt_clock_throttled = 1;
}
+ WARN_ON_ONCE(cfs_rq->throttled_clock_self);
WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
return 0;
}
@@ -6013,8 +6013,6 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
*/
cfs_rq->throttled = 1;
WARN_ON_ONCE(cfs_rq->throttled_clock);
- if (cfs_rq->nr_queued)
- cfs_rq->throttled_clock = rq_clock(rq);
return true;
}
@@ -6722,6 +6720,7 @@ static void task_throttle_setup_work(struct task_struct *p) {}
static bool task_is_throttled(struct task_struct *p) { return false; }
static void dequeue_throttled_task(struct task_struct *p, int flags) {}
static bool enqueue_throttled_task(struct task_struct *p) { return false; }
+static void record_throttle_clock(struct cfs_rq *cfs_rq) {}
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
@@ -7040,6 +7039,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
bool was_sched_idle = sched_idle_rq(rq);
bool task_sleep = flags & DEQUEUE_SLEEP;
bool task_delayed = flags & DEQUEUE_DELAYED;
+ bool task_throttled = flags & DEQUEUE_THROTTLE;
struct task_struct *p = NULL;
int h_nr_idle = 0;
int h_nr_queued = 0;
@@ -7073,6 +7073,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued;
+ if (throttled_hierarchy(cfs_rq) && task_throttled)
+ record_throttle_clock(cfs_rq);
+
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
slice = cfs_rq_min_slice(cfs_rq);
@@ -7109,6 +7112,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued;
+
+ if (throttled_hierarchy(cfs_rq) && task_throttled)
+ record_throttle_clock(cfs_rq);
}
sub_nr_running(rq, h_nr_queued);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fc697d4bf6685..dbe52e18b93a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2326,6 +2326,7 @@ extern const u32 sched_prio_to_wmult[40];
#define DEQUEUE_SPECIAL 0x10
#define DEQUEUE_MIGRATING 0x100 /* Matches ENQUEUE_MIGRATING */
#define DEQUEUE_DELAYED 0x200 /* Matches ENQUEUE_DELAYED */
+#define DEQUEUE_THROTTLE 0x800
#define ENQUEUE_WAKEUP 0x01
#define ENQUEUE_RESTORE 0x02
--
2.39.5
^ permalink raw reply related [flat|nested] 48+ messages in thread
* [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair()
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (3 preceding siblings ...)
2025-07-15 7:16 ` [PATCH v3 4/5] sched/fair: Task based throttle time accounting Aaron Lu
@ 2025-07-15 7:16 ` Aaron Lu
2025-07-15 7:22 ` [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (3 subsequent siblings)
8 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:16 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
Now that throttled tasks are dequeued and can not stay on rq's cfs_tasks
list, there is no need to take special care of these throttled tasks
anymore in load balance.
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
kernel/sched/fair.c | 33 +++------------------------------
1 file changed, 3 insertions(+), 30 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6f534fbe89bcf..af33d107d8034 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5729,23 +5729,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
return cfs_bandwidth_used() && cfs_rq->throttle_count;
}
-/*
- * Ensure that neither of the group entities corresponding to src_cpu or
- * dest_cpu are members of a throttled hierarchy when performing group
- * load-balance operations.
- */
-static inline int throttled_lb_pair(struct task_group *tg,
- int src_cpu, int dest_cpu)
-{
- struct cfs_rq *src_cfs_rq, *dest_cfs_rq;
-
- src_cfs_rq = tg->cfs_rq[src_cpu];
- dest_cfs_rq = tg->cfs_rq[dest_cpu];
-
- return throttled_hierarchy(src_cfs_rq) ||
- throttled_hierarchy(dest_cfs_rq);
-}
-
static inline bool task_is_throttled(struct task_struct *p)
{
return p->throttled;
@@ -6732,12 +6715,6 @@ static inline int throttled_hierarchy(struct cfs_rq *cfs_rq)
return 0;
}
-static inline int throttled_lb_pair(struct task_group *tg,
- int src_cpu, int dest_cpu)
-{
- return 0;
-}
-
#ifdef CONFIG_FAIR_GROUP_SCHED
void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth *parent) {}
static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
@@ -9374,17 +9351,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* We do not migrate tasks that are:
* 1) delayed dequeued unless we migrate load, or
- * 2) throttled_lb_pair, or
- * 3) cannot be migrated to this CPU due to cpus_ptr, or
- * 4) running (obviously), or
- * 5) are cache-hot on their current CPU.
+ * 2) cannot be migrated to this CPU due to cpus_ptr, or
+ * 3) running (obviously), or
+ * 4) are cache-hot on their current CPU.
*/
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
- if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
- return 0;
-
/*
* We want to prioritize the migration of eligible tasks.
* For ineligible tasks we soft-limit them and only allow
--
2.39.5
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (4 preceding siblings ...)
2025-07-15 7:16 ` [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
@ 2025-07-15 7:22 ` Aaron Lu
2025-08-01 14:31 ` Matteo Martelli
` (2 subsequent siblings)
8 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-15 7:22 UTC (permalink / raw)
To: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
[-- Attachment #1: Type: text/plain, Size: 414 bytes --]
On Tue, Jul 15, 2025 at 03:16:53PM +0800, Aaron Lu wrote:
> - A stress test that creates a lot of pressure on fork/exit path and
> cgroup_threadgroup_rwsem. Without this series, the test will cause
> task hung in about 5 minutes and with this series, no problem found
> after several hours. Songtang wrote this test script and I've used it
> to verify the patches, thanks Songtang.
Test scripts attached.
[-- Attachment #2: cg.sh --]
[-- Type: application/x-sh, Size: 1118 bytes --]
[-- Attachment #3: test.sh --]
[-- Type: application/x-sh, Size: 2219 bytes --]
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
@ 2025-07-15 23:29 ` kernel test robot
2025-07-16 6:57 ` Aaron Lu
2025-07-16 15:20 ` kernel test robot
` (2 subsequent siblings)
3 siblings, 1 reply; 48+ messages in thread
From: kernel test robot @ 2025-07-15 23:29 UTC (permalink / raw)
To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
Vincent Guittot, Xi Wang
Cc: llvm, oe-kbuild-all, linux-kernel, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Mel Gorman, Chuyi Zhou, Jan Kiszka,
Florian Bezdeka, Songtang Liu
Hi Aaron,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20250715]
[cannot apply to linus/master v6.16-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
base: tip/sched/core
patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
config: i386-buildonly-randconfig-006-20250716 (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507160730.0cXkgs0S-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> kernel/sched/fair.c:5456:33: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
5456 | cfs_rq->pelt_clock_throttled = 1;
| ^ ~
kernel/sched/fair.c:5971:32: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
5971 | cfs_rq->pelt_clock_throttled = 1;
| ^ ~
kernel/sched/fair.c:6014:20: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
6014 | cfs_rq->throttled = 1;
| ^ ~
3 warnings generated.
vim +/int +5456 kernel/sched/fair.c
5372
5373 static bool
5374 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
5375 {
5376 bool sleep = flags & DEQUEUE_SLEEP;
5377 int action = UPDATE_TG;
5378
5379 update_curr(cfs_rq);
5380 clear_buddies(cfs_rq, se);
5381
5382 if (flags & DEQUEUE_DELAYED) {
5383 WARN_ON_ONCE(!se->sched_delayed);
5384 } else {
5385 bool delay = sleep;
5386 /*
5387 * DELAY_DEQUEUE relies on spurious wakeups, special task
5388 * states must not suffer spurious wakeups, excempt them.
5389 */
5390 if (flags & DEQUEUE_SPECIAL)
5391 delay = false;
5392
5393 WARN_ON_ONCE(delay && se->sched_delayed);
5394
5395 if (sched_feat(DELAY_DEQUEUE) && delay &&
5396 !entity_eligible(cfs_rq, se)) {
5397 update_load_avg(cfs_rq, se, 0);
5398 set_delayed(se);
5399 return false;
5400 }
5401 }
5402
5403 if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
5404 action |= DO_DETACH;
5405
5406 /*
5407 * When dequeuing a sched_entity, we must:
5408 * - Update loads to have both entity and cfs_rq synced with now.
5409 * - For group_entity, update its runnable_weight to reflect the new
5410 * h_nr_runnable of its group cfs_rq.
5411 * - Subtract its previous weight from cfs_rq->load.weight.
5412 * - For group entity, update its weight to reflect the new share
5413 * of its group cfs_rq.
5414 */
5415 update_load_avg(cfs_rq, se, action);
5416 se_update_runnable(se);
5417
5418 update_stats_dequeue_fair(cfs_rq, se, flags);
5419
5420 update_entity_lag(cfs_rq, se);
5421 if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
5422 se->deadline -= se->vruntime;
5423 se->rel_deadline = 1;
5424 }
5425
5426 if (se != cfs_rq->curr)
5427 __dequeue_entity(cfs_rq, se);
5428 se->on_rq = 0;
5429 account_entity_dequeue(cfs_rq, se);
5430
5431 /* return excess runtime on last dequeue */
5432 return_cfs_rq_runtime(cfs_rq);
5433
5434 update_cfs_group(se);
5435
5436 /*
5437 * Now advance min_vruntime if @se was the entity holding it back,
5438 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
5439 * put back on, and if we advance min_vruntime, we'll be placed back
5440 * further than we started -- i.e. we'll be penalized.
5441 */
5442 if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
5443 update_min_vruntime(cfs_rq);
5444
5445 if (flags & DEQUEUE_DELAYED)
5446 finish_delayed_dequeue_entity(se);
5447
5448 if (cfs_rq->nr_queued == 0) {
5449 update_idle_cfs_rq_clock_pelt(cfs_rq);
5450 #ifdef CONFIG_CFS_BANDWIDTH
5451 if (throttled_hierarchy(cfs_rq)) {
5452 struct rq *rq = rq_of(cfs_rq);
5453
5454 list_del_leaf_cfs_rq(cfs_rq);
5455 cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> 5456 cfs_rq->pelt_clock_throttled = 1;
5457 }
5458 #endif
5459 }
5460
5461 return true;
5462 }
5463
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-15 23:29 ` kernel test robot
@ 2025-07-16 6:57 ` Aaron Lu
2025-07-16 7:40 ` Philip Li
2025-07-16 11:27 ` [PATCH v3 " Peter Zijlstra
0 siblings, 2 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-16 6:57 UTC (permalink / raw)
To: kernel test robot
Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
llvm, oe-kbuild-all, linux-kernel, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Mel Gorman, Chuyi Zhou, Jan Kiszka,
Florian Bezdeka, Songtang Liu
On Wed, Jul 16, 2025 at 07:29:37AM +0800, kernel test robot wrote:
> Hi Aaron,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on next-20250715]
> [cannot apply to linus/master v6.16-rc6]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
> base: tip/sched/core
> patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
> patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
> config: i386-buildonly-randconfig-006-20250716 (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/config)
> compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202507160730.0cXkgs0S-lkp@intel.com/
>
> All warnings (new ones prefixed by >>):
>
> >> kernel/sched/fair.c:5456:33: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> 5456 | cfs_rq->pelt_clock_throttled = 1;
> | ^ ~
Thanks for the report.
I don't think this will affect correctness since both cfs_rq's throttled
and pelt_clock_throttled fields are used as true(not 0) or false(0). I
used bitfield for them to save some space.
Change their types to either unsigned int or bool should cure this
warning, I suppose bool looks more clear?
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dbe52e18b93a0..434f816a56701 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,8 +737,8 @@ struct cfs_rq {
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
- int throttled:1;
- int pelt_clock_throttled:1;
+ bool throttled:1;
+ bool pelt_clock_throttled:1;
int throttle_count;
struct list_head throttled_list;
struct list_head throttled_csd_list;
Hi LKP,
I tried using clang-19 but couldn't reproduce this warning and I don't
have clang-20 at hand. Can you please apply the above hunk on top of
this series and see if that warning is gone? Thanks.
Best regards,
Aaron
> kernel/sched/fair.c:5971:32: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> 5971 | cfs_rq->pelt_clock_throttled = 1;
> | ^ ~
> kernel/sched/fair.c:6014:20: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> 6014 | cfs_rq->throttled = 1;
> | ^ ~
> 3 warnings generated.
>
>
> vim +/int +5456 kernel/sched/fair.c
>
> 5372
> 5373 static bool
> 5374 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 5375 {
> 5376 bool sleep = flags & DEQUEUE_SLEEP;
> 5377 int action = UPDATE_TG;
> 5378
> 5379 update_curr(cfs_rq);
> 5380 clear_buddies(cfs_rq, se);
> 5381
> 5382 if (flags & DEQUEUE_DELAYED) {
> 5383 WARN_ON_ONCE(!se->sched_delayed);
> 5384 } else {
> 5385 bool delay = sleep;
> 5386 /*
> 5387 * DELAY_DEQUEUE relies on spurious wakeups, special task
> 5388 * states must not suffer spurious wakeups, excempt them.
> 5389 */
> 5390 if (flags & DEQUEUE_SPECIAL)
> 5391 delay = false;
> 5392
> 5393 WARN_ON_ONCE(delay && se->sched_delayed);
> 5394
> 5395 if (sched_feat(DELAY_DEQUEUE) && delay &&
> 5396 !entity_eligible(cfs_rq, se)) {
> 5397 update_load_avg(cfs_rq, se, 0);
> 5398 set_delayed(se);
> 5399 return false;
> 5400 }
> 5401 }
> 5402
> 5403 if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> 5404 action |= DO_DETACH;
> 5405
> 5406 /*
> 5407 * When dequeuing a sched_entity, we must:
> 5408 * - Update loads to have both entity and cfs_rq synced with now.
> 5409 * - For group_entity, update its runnable_weight to reflect the new
> 5410 * h_nr_runnable of its group cfs_rq.
> 5411 * - Subtract its previous weight from cfs_rq->load.weight.
> 5412 * - For group entity, update its weight to reflect the new share
> 5413 * of its group cfs_rq.
> 5414 */
> 5415 update_load_avg(cfs_rq, se, action);
> 5416 se_update_runnable(se);
> 5417
> 5418 update_stats_dequeue_fair(cfs_rq, se, flags);
> 5419
> 5420 update_entity_lag(cfs_rq, se);
> 5421 if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
> 5422 se->deadline -= se->vruntime;
> 5423 se->rel_deadline = 1;
> 5424 }
> 5425
> 5426 if (se != cfs_rq->curr)
> 5427 __dequeue_entity(cfs_rq, se);
> 5428 se->on_rq = 0;
> 5429 account_entity_dequeue(cfs_rq, se);
> 5430
> 5431 /* return excess runtime on last dequeue */
> 5432 return_cfs_rq_runtime(cfs_rq);
> 5433
> 5434 update_cfs_group(se);
> 5435
> 5436 /*
> 5437 * Now advance min_vruntime if @se was the entity holding it back,
> 5438 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
> 5439 * put back on, and if we advance min_vruntime, we'll be placed back
> 5440 * further than we started -- i.e. we'll be penalized.
> 5441 */
> 5442 if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
> 5443 update_min_vruntime(cfs_rq);
> 5444
> 5445 if (flags & DEQUEUE_DELAYED)
> 5446 finish_delayed_dequeue_entity(se);
> 5447
> 5448 if (cfs_rq->nr_queued == 0) {
> 5449 update_idle_cfs_rq_clock_pelt(cfs_rq);
> 5450 #ifdef CONFIG_CFS_BANDWIDTH
> 5451 if (throttled_hierarchy(cfs_rq)) {
> 5452 struct rq *rq = rq_of(cfs_rq);
> 5453
> 5454 list_del_leaf_cfs_rq(cfs_rq);
> 5455 cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > 5456 cfs_rq->pelt_clock_throttled = 1;
> 5457 }
> 5458 #endif
> 5459 }
> 5460
> 5461 return true;
> 5462 }
> 5463
>
> --
> 0-DAY CI Kernel Test Service
> https://github.com/intel/lkp-tests/wiki
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-16 6:57 ` Aaron Lu
@ 2025-07-16 7:40 ` Philip Li
2025-07-16 11:15 ` [PATCH v3 update " Aaron Lu
2025-07-16 11:27 ` [PATCH v3 " Peter Zijlstra
1 sibling, 1 reply; 48+ messages in thread
From: Philip Li @ 2025-07-16 7:40 UTC (permalink / raw)
To: Aaron Lu
Cc: kernel test robot, Valentin Schneider, Ben Segall,
K Prateek Nayak, Peter Zijlstra, Chengming Zhou, Josh Don,
Ingo Molnar, Vincent Guittot, Xi Wang, llvm, oe-kbuild-all,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Wed, Jul 16, 2025 at 02:57:07PM +0800, Aaron Lu wrote:
> On Wed, Jul 16, 2025 at 07:29:37AM +0800, kernel test robot wrote:
> > Hi Aaron,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on tip/sched/core]
> > [also build test WARNING on next-20250715]
> > [cannot apply to linus/master v6.16-rc6]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
> > base: tip/sched/core
> > patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
> > patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
> > config: i386-buildonly-randconfig-006-20250716 (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/config)
> > compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202507160730.0cXkgs0S-lkp@intel.com/
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> kernel/sched/fair.c:5456:33: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> > 5456 | cfs_rq->pelt_clock_throttled = 1;
> > | ^ ~
>
> Thanks for the report.
>
> I don't think this will affect correctness since both cfs_rq's throttled
> and pelt_clock_throttled fields are used as true(not 0) or false(0). I
> used bitfield for them to save some space.
>
> Change their types to either unsigned int or bool should cure this
> warning, I suppose bool looks more clear?
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dbe52e18b93a0..434f816a56701 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -737,8 +737,8 @@ struct cfs_rq {
> u64 throttled_clock_pelt_time;
> u64 throttled_clock_self;
> u64 throttled_clock_self_time;
> - int throttled:1;
> - int pelt_clock_throttled:1;
> + bool throttled:1;
> + bool pelt_clock_throttled:1;
> int throttle_count;
> struct list_head throttled_list;
> struct list_head throttled_csd_list;
>
> Hi LKP,
>
> I tried using clang-19 but couldn't reproduce this warning and I don't
> have clang-20 at hand. Can you please apply the above hunk on top of
> this series and see if that warning is gone? Thanks.
Got it, is it possible to give a try for the reproduce step [1], which can
download clang-20 to local dir? If it has issue, we will follow up to check
as early as possible.
[1] https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/reproduce
>
> Best regards,
> Aaron
>
> > kernel/sched/fair.c:5971:32: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> > 5971 | cfs_rq->pelt_clock_throttled = 1;
> > | ^ ~
> > kernel/sched/fair.c:6014:20: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> > 6014 | cfs_rq->throttled = 1;
> > | ^ ~
> > 3 warnings generated.
> >
> >
> > vim +/int +5456 kernel/sched/fair.c
> >
> > 5372
> > 5373 static bool
> > 5374 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > 5375 {
> > 5376 bool sleep = flags & DEQUEUE_SLEEP;
> > 5377 int action = UPDATE_TG;
> > 5378
> > 5379 update_curr(cfs_rq);
> > 5380 clear_buddies(cfs_rq, se);
> > 5381
> > 5382 if (flags & DEQUEUE_DELAYED) {
> > 5383 WARN_ON_ONCE(!se->sched_delayed);
> > 5384 } else {
> > 5385 bool delay = sleep;
> > 5386 /*
> > 5387 * DELAY_DEQUEUE relies on spurious wakeups, special task
> > 5388 * states must not suffer spurious wakeups, excempt them.
> > 5389 */
> > 5390 if (flags & DEQUEUE_SPECIAL)
> > 5391 delay = false;
> > 5392
> > 5393 WARN_ON_ONCE(delay && se->sched_delayed);
> > 5394
> > 5395 if (sched_feat(DELAY_DEQUEUE) && delay &&
> > 5396 !entity_eligible(cfs_rq, se)) {
> > 5397 update_load_avg(cfs_rq, se, 0);
> > 5398 set_delayed(se);
> > 5399 return false;
> > 5400 }
> > 5401 }
> > 5402
> > 5403 if (entity_is_task(se) && task_on_rq_migrating(task_of(se)))
> > 5404 action |= DO_DETACH;
> > 5405
> > 5406 /*
> > 5407 * When dequeuing a sched_entity, we must:
> > 5408 * - Update loads to have both entity and cfs_rq synced with now.
> > 5409 * - For group_entity, update its runnable_weight to reflect the new
> > 5410 * h_nr_runnable of its group cfs_rq.
> > 5411 * - Subtract its previous weight from cfs_rq->load.weight.
> > 5412 * - For group entity, update its weight to reflect the new share
> > 5413 * of its group cfs_rq.
> > 5414 */
> > 5415 update_load_avg(cfs_rq, se, action);
> > 5416 se_update_runnable(se);
> > 5417
> > 5418 update_stats_dequeue_fair(cfs_rq, se, flags);
> > 5419
> > 5420 update_entity_lag(cfs_rq, se);
> > 5421 if (sched_feat(PLACE_REL_DEADLINE) && !sleep) {
> > 5422 se->deadline -= se->vruntime;
> > 5423 se->rel_deadline = 1;
> > 5424 }
> > 5425
> > 5426 if (se != cfs_rq->curr)
> > 5427 __dequeue_entity(cfs_rq, se);
> > 5428 se->on_rq = 0;
> > 5429 account_entity_dequeue(cfs_rq, se);
> > 5430
> > 5431 /* return excess runtime on last dequeue */
> > 5432 return_cfs_rq_runtime(cfs_rq);
> > 5433
> > 5434 update_cfs_group(se);
> > 5435
> > 5436 /*
> > 5437 * Now advance min_vruntime if @se was the entity holding it back,
> > 5438 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
> > 5439 * put back on, and if we advance min_vruntime, we'll be placed back
> > 5440 * further than we started -- i.e. we'll be penalized.
> > 5441 */
> > 5442 if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
> > 5443 update_min_vruntime(cfs_rq);
> > 5444
> > 5445 if (flags & DEQUEUE_DELAYED)
> > 5446 finish_delayed_dequeue_entity(se);
> > 5447
> > 5448 if (cfs_rq->nr_queued == 0) {
> > 5449 update_idle_cfs_rq_clock_pelt(cfs_rq);
> > 5450 #ifdef CONFIG_CFS_BANDWIDTH
> > 5451 if (throttled_hierarchy(cfs_rq)) {
> > 5452 struct rq *rq = rq_of(cfs_rq);
> > 5453
> > 5454 list_del_leaf_cfs_rq(cfs_rq);
> > 5455 cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
> > > 5456 cfs_rq->pelt_clock_throttled = 1;
> > 5457 }
> > 5458 #endif
> > 5459 }
> > 5460
> > 5461 return true;
> > 5462 }
> > 5463
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v3 update 3/5] sched/fair: Switch to task based throttle model
2025-07-16 7:40 ` Philip Li
@ 2025-07-16 11:15 ` Aaron Lu
0 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-16 11:15 UTC (permalink / raw)
To: Philip Li
Cc: kernel test robot, Valentin Schneider, Ben Segall,
K Prateek Nayak, Peter Zijlstra, Chengming Zhou, Josh Don,
Ingo Molnar, Vincent Guittot, Xi Wang, llvm, oe-kbuild-all,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Wed, Jul 16, 2025 at 03:40:26PM +0800, Philip Li wrote:
> On Wed, Jul 16, 2025 at 02:57:07PM +0800, Aaron Lu wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index dbe52e18b93a0..434f816a56701 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -737,8 +737,8 @@ struct cfs_rq {
> > u64 throttled_clock_pelt_time;
> > u64 throttled_clock_self;
> > u64 throttled_clock_self_time;
> > - int throttled:1;
> > - int pelt_clock_throttled:1;
> > + bool throttled:1;
> > + bool pelt_clock_throttled:1;
> > int throttle_count;
> > struct list_head throttled_list;
> > struct list_head throttled_csd_list;
> >
> > Hi LKP,
> >
> > I tried using clang-19 but couldn't reproduce this warning and I don't
> > have clang-20 at hand. Can you please apply the above hunk on top of
> > this series and see if that warning is gone? Thanks.
>
> Got it, is it possible to give a try for the reproduce step [1], which can
> download clang-20 to local dir? If it has issue, we will follow up to check
> as early as possible.
I managed to install clang-20 and can see this warning. With the above
hunk applied, the warning is gone.
Here is the updated patch3:
From: Valentin Schneider <vschneid@redhat.com>
Date: Fri, 23 May 2025 15:30:15 +0800
Subject: [PATCH v3 update 3/5] sched/fair: Switch to task based throttle model
In current throttle model, when a cfs_rq is throttled, its entity will
be dequeued from cpu's rq, making tasks attached to it not able to run,
thus achiveing the throttle target.
This has a drawback though: assume a task is a reader of percpu_rwsem
and is waiting. When it gets woken, it can not run till its task group's
next period comes, which can be a relatively long time. Waiting writer
will have to wait longer due to this and it also makes further reader
build up and eventually trigger task hung.
To improve this situation, change the throttle model to task based, i.e.
when a cfs_rq is throttled, record its throttled status but do not remove
it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
they get picked, add a task work to them so that when they return
to user, they can be dequeued there. In this way, tasks throttled will
not hold any kernel resources. And on unthrottle, enqueue back those
tasks so they can continue to run.
Throttled cfs_rq's PELT clock is handled differently now: previously the
cfs_rq's PELT clock is stopped once it entered throttled state but since
now tasks(in kernel mode) can continue to run, change the behaviour to
stop PELT clock when the throttled cfs_rq has no tasks left.
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
v3 update:
- fix compiler warning about int bit-field.
kernel/sched/fair.c | 336 ++++++++++++++++++++++---------------------
kernel/sched/pelt.h | 4 +-
kernel/sched/sched.h | 3 +-
3 files changed, 176 insertions(+), 167 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54c2a4df6a5d1..0eeea7f2e693d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5285,18 +5285,23 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (cfs_rq->nr_queued == 1) {
check_enqueue_throttle(cfs_rq);
- if (!throttled_hierarchy(cfs_rq)) {
- list_add_leaf_cfs_rq(cfs_rq);
- } else {
+ list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
+ if (throttled_hierarchy(cfs_rq)) {
struct rq *rq = rq_of(cfs_rq);
if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
cfs_rq->throttled_clock = rq_clock(rq);
if (!cfs_rq->throttled_clock_self)
cfs_rq->throttled_clock_self = rq_clock(rq);
-#endif
+
+ if (cfs_rq->pelt_clock_throttled) {
+ cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+ cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
+ }
}
+#endif
}
}
@@ -5335,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_runnable--;
- if (cfs_rq_throttled(cfs_rq))
- break;
}
}
@@ -5357,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
struct cfs_rq *cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_runnable++;
- if (cfs_rq_throttled(cfs_rq))
- break;
}
}
@@ -5444,8 +5445,18 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (flags & DEQUEUE_DELAYED)
finish_delayed_dequeue_entity(se);
- if (cfs_rq->nr_queued == 0)
+ if (cfs_rq->nr_queued == 0) {
update_idle_cfs_rq_clock_pelt(cfs_rq);
+#ifdef CONFIG_CFS_BANDWIDTH
+ if (throttled_hierarchy(cfs_rq)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ list_del_leaf_cfs_rq(cfs_rq);
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ cfs_rq->pelt_clock_throttled = 1;
+ }
+#endif
+ }
return true;
}
@@ -5784,6 +5795,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
WARN_ON_ONCE(p->throttled || !list_empty(&p->throttle_node));
dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ /*
+ * Must not set throttled before dequeue or dequeue will
+ * mistakenly regard this task as an already throttled one.
+ */
p->throttled = true;
resched_curr(rq);
}
@@ -5797,32 +5812,119 @@ void init_cfs_throttle_work(struct task_struct *p)
INIT_LIST_HEAD(&p->throttle_node);
}
+/*
+ * Task is throttled and someone wants to dequeue it again:
+ * it could be sched/core when core needs to do things like
+ * task affinity change, task group change, task sched class
+ * change etc. and in these cases, DEQUEUE_SLEEP is not set;
+ * or the task is blocked after throttled due to freezer etc.
+ * and in these cases, DEQUEUE_SLEEP is set.
+ */
+static void detach_task_cfs_rq(struct task_struct *p);
+static void dequeue_throttled_task(struct task_struct *p, int flags)
+{
+ WARN_ON_ONCE(p->se.on_rq);
+ list_del_init(&p->throttle_node);
+
+ /* task blocked after throttled */
+ if (flags & DEQUEUE_SLEEP) {
+ p->throttled = false;
+ return;
+ }
+
+ /*
+ * task is migrating off its old cfs_rq, detach
+ * the task's load from its old cfs_rq.
+ */
+ if (task_on_rq_migrating(p))
+ detach_task_cfs_rq(p);
+}
+
+static bool enqueue_throttled_task(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
+
+ /*
+ * If the throttled task is enqueued to a throttled cfs_rq,
+ * take the fast path by directly put the task on target
+ * cfs_rq's limbo list, except when p is current because
+ * the following race can cause p's group_node left in rq's
+ * cfs_tasks list when it's throttled:
+ *
+ * cpuX cpuY
+ * taskA ret2user
+ * throttle_cfs_rq_work() sched_move_task(taskA)
+ * task_rq_lock acquired
+ * dequeue_task_fair(taskA)
+ * task_rq_lock released
+ * task_rq_lock acquired
+ * task_current_donor(taskA) == true
+ * task_on_rq_queued(taskA) == true
+ * dequeue_task(taskA)
+ * put_prev_task(taskA)
+ * sched_change_group()
+ * enqueue_task(taskA) -> taskA's new cfs_rq
+ * is throttled, go
+ * fast path and skip
+ * actual enqueue
+ * set_next_task(taskA)
+ * __set_next_task_fair(taskA)
+ * list_move(&se->group_node, &rq->cfs_tasks); // bug
+ * schedule()
+ *
+ * And in the above race case, the task's current cfs_rq is in the same
+ * rq as its previous cfs_rq because sched_move_task() doesn't migrate
+ * task so we can use its current cfs_rq to derive rq and test if the
+ * task is current.
+ */
+ if (throttled_hierarchy(cfs_rq) &&
+ !task_current_donor(rq_of(cfs_rq), p)) {
+ list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
+ return true;
+ }
+
+ /* we can't take the fast path, do an actual enqueue*/
+ p->throttled = false;
+ return false;
+}
+
+static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
static int tg_unthrottle_up(struct task_group *tg, void *data)
{
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ struct task_struct *p, *tmp;
+
+ if (--cfs_rq->throttle_count)
+ return 0;
- cfs_rq->throttle_count--;
- if (!cfs_rq->throttle_count) {
+ if (cfs_rq->pelt_clock_throttled) {
cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
+ }
- /* Add cfs_rq with load or one or more already running entities to the list */
- if (!cfs_rq_is_decayed(cfs_rq))
- list_add_leaf_cfs_rq(cfs_rq);
+ if (cfs_rq->throttled_clock_self) {
+ u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
- if (cfs_rq->throttled_clock_self) {
- u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
+ cfs_rq->throttled_clock_self = 0;
- cfs_rq->throttled_clock_self = 0;
+ if (WARN_ON_ONCE((s64)delta < 0))
+ delta = 0;
- if (WARN_ON_ONCE((s64)delta < 0))
- delta = 0;
+ cfs_rq->throttled_clock_self_time += delta;
+ }
- cfs_rq->throttled_clock_self_time += delta;
- }
+ /* Re-enqueue the tasks that have been throttled at this level. */
+ list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
+ list_del_init(&p->throttle_node);
+ enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
}
+ /* Add cfs_rq with load or one or more already running entities to the list */
+ if (!cfs_rq_is_decayed(cfs_rq))
+ list_add_leaf_cfs_rq(cfs_rq);
+
return 0;
}
@@ -5851,17 +5953,25 @@ static int tg_throttle_down(struct task_group *tg, void *data)
struct rq *rq = data;
struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
+ if (cfs_rq->throttle_count++)
+ return 0;
+
+
/* group is entering throttled state, stop time */
- if (!cfs_rq->throttle_count) {
- cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ WARN_ON_ONCE(cfs_rq->throttled_clock_self);
+ if (cfs_rq->nr_queued)
+ cfs_rq->throttled_clock_self = rq_clock(rq);
+ else {
+ /*
+ * For cfs_rqs that still have entities enqueued, PELT clock
+ * stop happens at dequeue time when all entities are dequeued.
+ */
list_del_leaf_cfs_rq(cfs_rq);
-
- WARN_ON_ONCE(cfs_rq->throttled_clock_self);
- if (cfs_rq->nr_queued)
- cfs_rq->throttled_clock_self = rq_clock(rq);
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ cfs_rq->pelt_clock_throttled = 1;
}
- cfs_rq->throttle_count++;
+ WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
return 0;
}
@@ -5869,8 +5979,7 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
- struct sched_entity *se;
- long queued_delta, runnable_delta, idle_delta, dequeue = 1;
+ int dequeue = 1;
raw_spin_lock(&cfs_b->lock);
/* This will start the period timer if necessary */
@@ -5893,68 +6002,11 @@ static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
if (!dequeue)
return false; /* Throttle no longer required. */
- se = cfs_rq->tg->se[cpu_of(rq_of(cfs_rq))];
-
/* freeze hierarchy runnable averages while throttled */
rcu_read_lock();
walk_tg_tree_from(cfs_rq->tg, tg_throttle_down, tg_nop, (void *)rq);
rcu_read_unlock();
- queued_delta = cfs_rq->h_nr_queued;
- runnable_delta = cfs_rq->h_nr_runnable;
- idle_delta = cfs_rq->h_nr_idle;
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- int flags;
-
- /* throttled entity or throttle-on-deactivate */
- if (!se->on_rq)
- goto done;
-
- /*
- * Abuse SPECIAL to avoid delayed dequeue in this instance.
- * This avoids teaching dequeue_entities() about throttled
- * entities and keeps things relatively simple.
- */
- flags = DEQUEUE_SLEEP | DEQUEUE_SPECIAL;
- if (se->sched_delayed)
- flags |= DEQUEUE_DELAYED;
- dequeue_entity(qcfs_rq, se, flags);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued -= queued_delta;
- qcfs_rq->h_nr_runnable -= runnable_delta;
- qcfs_rq->h_nr_idle -= idle_delta;
-
- if (qcfs_rq->load.weight) {
- /* Avoid re-evaluating load for this entity: */
- se = parent_entity(se);
- break;
- }
- }
-
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
- /* throttled entity or throttle-on-deactivate */
- if (!se->on_rq)
- goto done;
-
- update_load_avg(qcfs_rq, se, 0);
- se_update_runnable(se);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued -= queued_delta;
- qcfs_rq->h_nr_runnable -= runnable_delta;
- qcfs_rq->h_nr_idle -= idle_delta;
- }
-
- /* At this point se is NULL and we are at root level*/
- sub_nr_running(rq, queued_delta);
-done:
/*
* Note: distribution will already see us throttled via the
* throttled-list. rq->lock protects completion.
@@ -5970,9 +6022,20 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
- struct sched_entity *se;
- long queued_delta, runnable_delta, idle_delta;
- long rq_h_nr_queued = rq->cfs.h_nr_queued;
+ struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
+
+ /*
+ * It's possible we are called with !runtime_remaining due to things
+ * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
+ * unthrottled us with a positive runtime_remaining but other still
+ * running entities consumed those runtime before we reached here.
+ *
+ * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+ * because any enqueue in tg_unthrottle_up() will immediately trigger a
+ * throttle, which is not supposed to happen on unthrottle path.
+ */
+ if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
+ return;
se = cfs_rq->tg->se[cpu_of(rq)];
@@ -6002,62 +6065,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
if (list_add_leaf_cfs_rq(cfs_rq_of(se)))
break;
}
- goto unthrottle_throttle;
}
- queued_delta = cfs_rq->h_nr_queued;
- runnable_delta = cfs_rq->h_nr_runnable;
- idle_delta = cfs_rq->h_nr_idle;
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
- /* Handle any unfinished DELAY_DEQUEUE business first. */
- if (se->sched_delayed) {
- int flags = DEQUEUE_SLEEP | DEQUEUE_DELAYED;
-
- dequeue_entity(qcfs_rq, se, flags);
- } else if (se->on_rq)
- break;
- enqueue_entity(qcfs_rq, se, ENQUEUE_WAKEUP);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued += queued_delta;
- qcfs_rq->h_nr_runnable += runnable_delta;
- qcfs_rq->h_nr_idle += idle_delta;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(qcfs_rq))
- goto unthrottle_throttle;
- }
-
- for_each_sched_entity(se) {
- struct cfs_rq *qcfs_rq = cfs_rq_of(se);
-
- update_load_avg(qcfs_rq, se, UPDATE_TG);
- se_update_runnable(se);
-
- if (cfs_rq_is_idle(group_cfs_rq(se)))
- idle_delta = cfs_rq->h_nr_queued;
-
- qcfs_rq->h_nr_queued += queued_delta;
- qcfs_rq->h_nr_runnable += runnable_delta;
- qcfs_rq->h_nr_idle += idle_delta;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(qcfs_rq))
- goto unthrottle_throttle;
- }
-
- /* Start the fair server if un-throttling resulted in new runnable tasks */
- if (!rq_h_nr_queued && rq->cfs.h_nr_queued)
- dl_server_start(&rq->fair_server);
-
- /* At this point se is NULL and we are at root level*/
- add_nr_running(rq, queued_delta);
-
-unthrottle_throttle:
assert_list_leaf_cfs_rq(rq);
/* Determine whether we need to wake up potentially idle CPU: */
@@ -6711,6 +6720,8 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
static void task_throttle_setup_work(struct task_struct *p) {}
static bool task_is_throttled(struct task_struct *p) { return false; }
+static void dequeue_throttled_task(struct task_struct *p, int flags) {}
+static bool enqueue_throttled_task(struct task_struct *p) { return false; }
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
{
@@ -6903,6 +6914,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
int rq_h_nr_queued = rq->cfs.h_nr_queued;
u64 slice = 0;
+ if (unlikely(task_is_throttled(p) && enqueue_throttled_task(p)))
+ return;
+
/*
* The code below (indirectly) updates schedutil which looks at
* the cfs_rq utilization to select a frequency.
@@ -6955,10 +6969,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = 1;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
-
flags = ENQUEUE_WAKEUP;
}
@@ -6980,10 +6990,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = 1;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- goto enqueue_throttle;
}
if (!rq_h_nr_queued && rq->cfs.h_nr_queued) {
@@ -7013,7 +7019,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
if (!task_new)
check_update_overutilized_status(rq);
-enqueue_throttle:
assert_list_leaf_cfs_rq(rq);
hrtick_update(rq);
@@ -7068,10 +7073,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued;
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- return 0;
-
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
slice = cfs_rq_min_slice(cfs_rq);
@@ -7108,10 +7109,6 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
if (cfs_rq_is_idle(cfs_rq))
h_nr_idle = h_nr_queued;
-
- /* end evaluation on encountering a throttled cfs_rq */
- if (cfs_rq_throttled(cfs_rq))
- return 0;
}
sub_nr_running(rq, h_nr_queued);
@@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
*/
static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
+ if (unlikely(task_is_throttled(p))) {
+ dequeue_throttled_task(p, flags);
+ return true;
+ }
+
if (!p->se.sched_delayed)
util_est_dequeue(&rq->cfs, p);
@@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
{
struct sched_entity *se;
struct cfs_rq *cfs_rq;
+ struct task_struct *p;
+ bool throttled;
again:
cfs_rq = &rq->cfs;
if (!cfs_rq->nr_queued)
return NULL;
+ throttled = false;
+
do {
/* Might not have done put_prev_entity() */
if (cfs_rq->curr && cfs_rq->curr->on_rq)
update_curr(cfs_rq);
- if (unlikely(check_cfs_rq_runtime(cfs_rq)))
- goto again;
+ throttled |= check_cfs_rq_runtime(cfs_rq);
se = pick_next_entity(rq, cfs_rq);
if (!se)
@@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
- return task_of(se);
+ p = task_of(se);
+ if (unlikely(throttled))
+ task_throttle_setup_work(p);
+ return p;
}
static void __set_next_task_fair(struct rq *rq, struct task_struct *p, bool first);
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 62c3fa543c0f2..f921302dc40fb 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
u64 throttled;
- if (unlikely(cfs_rq->throttle_count))
+ if (unlikely(cfs_rq->pelt_clock_throttled))
throttled = U64_MAX;
else
throttled = cfs_rq->throttled_clock_pelt_time;
@@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
- if (unlikely(cfs_rq->throttle_count))
+ if (unlikely(cfs_rq->pelt_clock_throttled))
return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time;
return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b0c9559992d8a..51d000ab4a70d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,7 +737,8 @@ struct cfs_rq {
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
- int throttled;
+ bool throttled:1;
+ bool pelt_clock_throttled:1;
int throttle_count;
struct list_head throttled_list;
struct list_head throttled_csd_list;
--
2.39.5
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-16 6:57 ` Aaron Lu
2025-07-16 7:40 ` Philip Li
@ 2025-07-16 11:27 ` Peter Zijlstra
1 sibling, 0 replies; 48+ messages in thread
From: Peter Zijlstra @ 2025-07-16 11:27 UTC (permalink / raw)
To: Aaron Lu
Cc: kernel test robot, Valentin Schneider, Ben Segall,
K Prateek Nayak, Chengming Zhou, Josh Don, Ingo Molnar,
Vincent Guittot, Xi Wang, llvm, oe-kbuild-all, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Wed, Jul 16, 2025 at 02:57:07PM +0800, Aaron Lu wrote:
> On Wed, Jul 16, 2025 at 07:29:37AM +0800, kernel test robot wrote:
> > Hi Aaron,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on tip/sched/core]
> > [also build test WARNING on next-20250715]
> > [cannot apply to linus/master v6.16-rc6]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
> > base: tip/sched/core
> > patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
> > patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
> > config: i386-buildonly-randconfig-006-20250716 (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/config)
> > compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
> > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250716/202507160730.0cXkgs0S-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202507160730.0cXkgs0S-lkp@intel.com/
> >
> > All warnings (new ones prefixed by >>):
> >
> > >> kernel/sched/fair.c:5456:33: warning: implicit truncation from 'int' to a one-bit wide bit-field changes value from 1 to -1 [-Wsingle-bit-bitfield-constant-conversion]
> > 5456 | cfs_rq->pelt_clock_throttled = 1;
> > | ^ ~
Nice warning from clang.
>
> Thanks for the report.
>
> I don't think this will affect correctness since both cfs_rq's throttled
> and pelt_clock_throttled fields are used as true(not 0) or false(0). I
> used bitfield for them to save some space.
>
> Change their types to either unsigned int or bool should cure this
> warning, I suppose bool looks more clear?
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index dbe52e18b93a0..434f816a56701 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -737,8 +737,8 @@ struct cfs_rq {
> u64 throttled_clock_pelt_time;
> u64 throttled_clock_self;
> u64 throttled_clock_self_time;
> - int throttled:1;
> - int pelt_clock_throttled:1;
> + bool throttled:1;
> + bool pelt_clock_throttled:1;
> int throttle_count;
> struct list_head throttled_list;
> struct list_head throttled_csd_list;
Yeah, either this or any unsigned type will do.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-07-15 23:29 ` kernel test robot
@ 2025-07-16 15:20 ` kernel test robot
2025-07-17 3:52 ` Aaron Lu
2025-08-08 9:12 ` Valentin Schneider
2025-08-17 8:50 ` Chen, Yu C
3 siblings, 1 reply; 48+ messages in thread
From: kernel test robot @ 2025-07-16 15:20 UTC (permalink / raw)
To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
Vincent Guittot, Xi Wang
Cc: oe-kbuild-all, linux-kernel, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Mel Gorman, Chuyi Zhou, Jan Kiszka,
Florian Bezdeka, Songtang Liu
Hi Aaron,
kernel test robot noticed the following build warnings:
[auto build test WARNING on tip/sched/core]
[also build test WARNING on next-20250716]
[cannot apply to linus/master v6.16-rc6]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
base: tip/sched/core
patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
config: xtensa-randconfig-r121-20250716 (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/config)
compiler: xtensa-linux-gcc (GCC) 8.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507162238.qiw7kyu0-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
kernel/sched/core.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/thread_info.h, ...):
arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
kernel/sched/core.c: note: in included file:
>> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/sched.h:742:55: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2429:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2429:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2429:9: sparse: struct task_struct *
kernel/sched/core.c:2131:38: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/core.c:2131:38: sparse: struct task_struct [noderef] __rcu *
kernel/sched/core.c:2131:38: sparse: struct task_struct const *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2452:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2452:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2452:9: sparse: struct task_struct *
kernel/sched/sched.h:2452:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2452:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2452:9: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2429:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2429:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2429:9: sparse: struct task_struct *
--
kernel/sched/fair.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/kernel.h, ...):
arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
kernel/sched/fair.c: note: in included file:
>> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/sched.h:742:55: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/fair.c:6073:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/fair.c:6073:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/fair.c:6073:22: sparse: struct task_struct *
kernel/sched/fair.c:10625:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/fair.c:10625:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/fair.c:10625:22: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2452:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2452:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2452:9: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
--
kernel/sched/build_policy.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/kernel.h, ...):
arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
kernel/sched/build_policy.c: note: in included file:
>> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/sched.h:742:55: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/build_policy.c: note: in included file:
kernel/sched/rt.c:2289:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/rt.c:2289:25: sparse: struct task_struct *
kernel/sched/rt.c:2289:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/rt.c:1994:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/rt.c:1994:13: sparse: struct task_struct *
kernel/sched/rt.c:1994:13: sparse: struct task_struct [noderef] __rcu *
kernel/sched/build_policy.c: note: in included file:
kernel/sched/deadline.c:2675:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/deadline.c:2675:13: sparse: struct task_struct *
kernel/sched/deadline.c:2675:13: sparse: struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:2781:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/deadline.c:2781:25: sparse: struct task_struct *
kernel/sched/deadline.c:2781:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:3024:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/deadline.c:3024:23: sparse: struct task_struct [noderef] __rcu *
kernel/sched/deadline.c:3024:23: sparse: struct task_struct *
kernel/sched/build_policy.c: note: in included file:
kernel/sched/syscalls.c:206:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/syscalls.c:206:22: sparse: struct task_struct [noderef] __rcu *
kernel/sched/syscalls.c:206:22: sparse: struct task_struct *
kernel/sched/build_policy.c: note: in included file:
kernel/sched/sched.h:2241:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2241:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2241:25: sparse: struct task_struct *
kernel/sched/sched.h:2241:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2241:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2241:25: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2241:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2241:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2241:25: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2241:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2241:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2241:25: sparse: struct task_struct *
kernel/sched/sched.h:2241:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2241:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2241:25: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2429:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2429:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2429:9: sparse: struct task_struct *
kernel/sched/sched.h:2252:26: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2252:26: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2252:26: sparse: struct task_struct *
kernel/sched/sched.h:2429:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2429:9: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2429:9: sparse: struct task_struct *
--
kernel/sched/build_utility.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/kernel.h, ...):
arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
kernel/sched/build_utility.c: note: in included file:
>> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/sched.h:742:55: sparse: sparse: dubious one-bit signed bitfield
kernel/sched/sched.h:2241:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
kernel/sched/sched.h:2241:25: sparse: struct task_struct [noderef] __rcu *
kernel/sched/sched.h:2241:25: sparse: struct task_struct *
vim +741 kernel/sched/sched.h
709
710 #ifdef CONFIG_FAIR_GROUP_SCHED
711 struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */
712
713 /*
714 * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
715 * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
716 * (like users, containers etc.)
717 *
718 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
719 * This list is used during load balance.
720 */
721 int on_list;
722 struct list_head leaf_cfs_rq_list;
723 struct task_group *tg; /* group that "owns" this runqueue */
724
725 /* Locally cached copy of our task_group's idle value */
726 int idle;
727
728 #ifdef CONFIG_CFS_BANDWIDTH
729 int runtime_enabled;
730 s64 runtime_remaining;
731
732 u64 throttled_pelt_idle;
733 #ifndef CONFIG_64BIT
734 u64 throttled_pelt_idle_copy;
735 #endif
736 u64 throttled_clock;
737 u64 throttled_clock_pelt;
738 u64 throttled_clock_pelt_time;
739 u64 throttled_clock_self;
740 u64 throttled_clock_self_time;
> 741 int throttled:1;
742 int pelt_clock_throttled:1;
743 int throttle_count;
744 struct list_head throttled_list;
745 struct list_head throttled_csd_list;
746 struct list_head throttled_limbo_list;
747 #endif /* CONFIG_CFS_BANDWIDTH */
748 #endif /* CONFIG_FAIR_GROUP_SCHED */
749 };
750
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-16 15:20 ` kernel test robot
@ 2025-07-17 3:52 ` Aaron Lu
2025-07-23 8:21 ` Oliver Sang
0 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-07-17 3:52 UTC (permalink / raw)
To: kernel test robot
Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
oe-kbuild-all, linux-kernel, Juri Lelli, Dietmar Eggemann,
Steven Rostedt, Mel Gorman, Chuyi Zhou, Jan Kiszka,
Florian Bezdeka, Songtang Liu
On Wed, Jul 16, 2025 at 11:20:55PM +0800, kernel test robot wrote:
> Hi Aaron,
>
> kernel test robot noticed the following build warnings:
>
> [auto build test WARNING on tip/sched/core]
> [also build test WARNING on next-20250716]
> [cannot apply to linus/master v6.16-rc6]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
> base: tip/sched/core
> patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
> patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
> config: xtensa-randconfig-r121-20250716 (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/config)
> compiler: xtensa-linux-gcc (GCC) 8.5.0
> reproduce: (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202507162238.qiw7kyu0-lkp@intel.com/
>
> sparse warnings: (new ones prefixed by >>)
> kernel/sched/core.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/thread_info.h, ...):
> arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
> arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
> kernel/sched/core.c: note: in included file:
> >> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
Same problem as last report.
I've downloaded this compiler from kernel.org and confirmed there is no
such warnings after using bool.
Thanks,
Aaron
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-17 3:52 ` Aaron Lu
@ 2025-07-23 8:21 ` Oliver Sang
2025-07-23 10:08 ` Aaron Lu
0 siblings, 1 reply; 48+ messages in thread
From: Oliver Sang @ 2025-07-23 8:21 UTC (permalink / raw)
To: Aaron Lu
Cc: kernel test robot, Valentin Schneider, Ben Segall,
K Prateek Nayak, Peter Zijlstra, Chengming Zhou, Josh Don,
Ingo Molnar, Vincent Guittot, Xi Wang, oe-kbuild-all,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
oliver.sang
hi, Aaron,
On Thu, Jul 17, 2025 at 11:52:43AM +0800, Aaron Lu wrote:
> On Wed, Jul 16, 2025 at 11:20:55PM +0800, kernel test robot wrote:
> > Hi Aaron,
> >
> > kernel test robot noticed the following build warnings:
> >
> > [auto build test WARNING on tip/sched/core]
> > [also build test WARNING on next-20250716]
> > [cannot apply to linus/master v6.16-rc6]
> > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > And when submitting patch, we suggest to use '--base' as documented in
> > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> >
> > url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
> > base: tip/sched/core
> > patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
> > patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
> > config: xtensa-randconfig-r121-20250716 (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/config)
> > compiler: xtensa-linux-gcc (GCC) 8.5.0
> > reproduce: (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/reproduce)
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <lkp@intel.com>
> > | Closes: https://lore.kernel.org/oe-kbuild-all/202507162238.qiw7kyu0-lkp@intel.com/
> >
> > sparse warnings: (new ones prefixed by >>)
> > kernel/sched/core.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/thread_info.h, ...):
> > arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
> > arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
> > kernel/sched/core.c: note: in included file:
> > >> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
>
> Same problem as last report.
>
> I've downloaded this compiler from kernel.org and confirmed there is no
> such warnings after using bool.
want to confirm, do you mean you can reproduce the build sparse error
> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
then after doing below change:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3c3ea0089b0b5..6eb15b00edccd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -738,7 +738,7 @@ struct cfs_rq {
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
- int throttled:1;
+ bool throttled:1;
int pelt_clock_throttled:1;
int throttle_count;
struct list_head throttled_list;
the issue will disappear?
>
> Thanks,
> Aaron
>
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-23 8:21 ` Oliver Sang
@ 2025-07-23 10:08 ` Aaron Lu
0 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-07-23 10:08 UTC (permalink / raw)
To: Oliver Sang
Cc: kernel test robot, Valentin Schneider, Ben Segall,
K Prateek Nayak, Peter Zijlstra, Chengming Zhou, Josh Don,
Ingo Molnar, Vincent Guittot, Xi Wang, oe-kbuild-all,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
Hi Oliver,
On Wed, Jul 23, 2025 at 04:21:59PM +0800, Oliver Sang wrote:
> hi, Aaron,
>
> On Thu, Jul 17, 2025 at 11:52:43AM +0800, Aaron Lu wrote:
> > On Wed, Jul 16, 2025 at 11:20:55PM +0800, kernel test robot wrote:
> > > Hi Aaron,
> > >
> > > kernel test robot noticed the following build warnings:
> > >
> > > [auto build test WARNING on tip/sched/core]
> > > [also build test WARNING on next-20250716]
> > > [cannot apply to linus/master v6.16-rc6]
> > > [If your patch is applied to the wrong git tree, kindly drop us a note.
> > > And when submitting patch, we suggest to use '--base' as documented in
> > > https://git-scm.com/docs/git-format-patch#_base_tree_information]
> > >
> > > url: https://github.com/intel-lab-lkp/linux/commits/Aaron-Lu/sched-fair-Add-related-data-structure-for-task-based-throttle/20250715-152307
> > > base: tip/sched/core
> > > patch link: https://lore.kernel.org/r/20250715071658.267-4-ziqianlu%40bytedance.com
> > > patch subject: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
> > > config: xtensa-randconfig-r121-20250716 (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/config)
> > > compiler: xtensa-linux-gcc (GCC) 8.5.0
> > > reproduce: (https://download.01.org/0day-ci/archive/20250716/202507162238.qiw7kyu0-lkp@intel.com/reproduce)
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <lkp@intel.com>
> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202507162238.qiw7kyu0-lkp@intel.com/
> > >
> > > sparse warnings: (new ones prefixed by >>)
> > > kernel/sched/core.c: note: in included file (through arch/xtensa/include/asm/bitops.h, include/linux/bitops.h, include/linux/thread_info.h, ...):
> > > arch/xtensa/include/asm/processor.h:105:2: sparse: sparse: Unsupported xtensa ABI
> > > arch/xtensa/include/asm/processor.h:135:2: sparse: sparse: Unsupported Xtensa ABI
> > > kernel/sched/core.c: note: in included file:
> > > >> kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
> >
> > Same problem as last report.
> >
> > I've downloaded this compiler from kernel.org and confirmed there is no
> > such warnings after using bool.
>
>
> want to confirm, do you mean you can reproduce the build sparse error
> > kernel/sched/sched.h:741:44: sparse: sparse: dubious one-bit signed bitfield
>
Right.
>
> then after doing below change:
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3c3ea0089b0b5..6eb15b00edccd 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -738,7 +738,7 @@ struct cfs_rq {
> u64 throttled_clock_pelt_time;
> u64 throttled_clock_self;
> u64 throttled_clock_self_time;
> - int throttled:1;
> + bool throttled:1;
> int pelt_clock_throttled:1;
> int throttle_count;
> struct list_head throttled_list;
>
>
> the issue will disappear?
>
Yes, that's correct.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (5 preceding siblings ...)
2025-07-15 7:22 ` [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
@ 2025-08-01 14:31 ` Matteo Martelli
2025-08-04 7:52 ` Aaron Lu
2025-08-04 8:51 ` K Prateek Nayak
2025-08-27 14:58 ` Valentin Schneider
8 siblings, 1 reply; 48+ messages in thread
From: Matteo Martelli @ 2025-08-01 14:31 UTC (permalink / raw)
To: Aaron Lu, Valentin Schneider, Ben Segall, K Prateek Nayak,
Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Matteo Martelli
Hi Aaron,
On Tue, 15 Jul 2025 15:16:53 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> v3:
> - Keep throttled cfs_rq's PELT clock running as long as it still has
> entity queued, suggested by Benjamin Segall. I've folded this change
> into patch3;
> - Rebased on top of tip/sched/core, commit 2885daf47081
> ("lib/smp_processor_id: Make migration check unconditional of SMP").
>
> Hi Prateek,
> I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
> clock change should not affect things much, but let me know if you don't
> think that is appropriate.
>
> Tests I've done:
> - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
> within 2 minutes and with this series, I do not see rcu-stalls after
> 10 minutes.
> - A stress test that creates a lot of pressure on fork/exit path and
> cgroup_threadgroup_rwsem. Without this series, the test will cause
> task hung in about 5 minutes and with this series, no problem found
> after several hours. Songtang wrote this test script and I've used it
> to verify the patches, thanks Songtang.
>
> [1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
>
> v2:
> - Re-org the patchset to use a single patch to implement throttle
> related changes, suggested by Chengming;
> - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
> decide if throttle task work is needed instead of checking
> throttled_hierarchy(), suggested by Peter;
> - Simplify throttle_count check in tg_throtthe_down() and
> tg_unthrottle_up(), suggested by Peter;
> - Add enqueue_throttled_task() to speed up enqueuing a throttled task to
> a throttled cfs_rq, suggested by Peter;
> - Address the missing of detach_task_cfs_rq() for throttled tasks that
> get migrated to a new rq, pointed out by Chengming;
> - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
> cond_resched*() is going away, pointed out by Peter.
> I hope I didn't miss any comments and suggestions for v1 and if I do,
> please kindly let me know, thanks!
>
> Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
> of double_rq_clock_clear_update()")
>
> cover letter of v1:
>
> This is a continuous work based on Valentin Schneider's posting here:
> Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
> https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
>
> Valentin has described the problem very well in the above link and I
> quote:
> "
> CFS tasks can end up throttled while holding locks that other,
> non-throttled tasks are blocking on.
>
> For !PREEMPT_RT, this can be a source of latency due to the throttling
> causing a resource acquisition denial.
>
> For PREEMPT_RT, this is worse and can lead to a deadlock:
> o A CFS task p0 gets throttled while holding read_lock(&lock)
> o A task p1 blocks on write_lock(&lock), making further readers enter
> the slowpath
> o A ktimers or ksoftirqd task blocks on read_lock(&lock)
>
> If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
> on the same CPU as one where ktimers/ksoftirqd is blocked on
> read_lock(&lock), this creates a circular dependency.
>
> This has been observed to happen with:
> o fs/eventpoll.c::ep->lock
> o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
> but can trigger with any rwlock that can be acquired in both process and
> softirq contexts.
>
> The linux-rt tree has had
> 1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
> which helped this scenario for non-rwlock locks by ensuring the throttled
> task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
> rwlocks cannot sanely do PI as they allow multiple readers.
> "
>
> Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
> https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
> and K Prateek Nayak has an detailed analysis of how deadlock happened:
> https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/
>
> To fix this issue for PREEMPT_RT and improve latency situation for
> !PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
> is throttled, mark its throttled status but do not remove it from cpu's
> rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
> add a task work to them so that when they return to user, they can be
> dequeued. In this way, tasks throttled will not hold any kernel resources.
> When cfs_rq gets unthrottled, enqueue back those throttled tasks.
>
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
>
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
>
> For throttle time accounting, according to RFC v2's feedback, rework
> throttle time accounting for a cfs_rq as follows:
> - start accounting when the first task gets throttled in its
> hierarchy;
> - stop accounting on unthrottle.
>
> There is also the concern of increased duration of (un)throttle operations
> in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
> setup on a 2sockets/384cpus AMD server, the longest duration of
> distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
> https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
> For throttle path, with Chengming's suggestion to move "task work setup"
> from throttle time to pick time, it's not an issue anymore.
>
> Aaron Lu (2):
> sched/fair: Task based throttle time accounting
> sched/fair: Get rid of throttled_lb_pair()
>
> Valentin Schneider (3):
> sched/fair: Add related data structure for task based throttle
> sched/fair: Implement throttle task work and related helpers
> sched/fair: Switch to task based throttle model
>
> include/linux/sched.h | 5 +
> kernel/sched/core.c | 3 +
> kernel/sched/fair.c | 451 ++++++++++++++++++++++++------------------
> kernel/sched/pelt.h | 4 +-
> kernel/sched/sched.h | 7 +-
> 5 files changed, 274 insertions(+), 196 deletions(-)
>
> --
> 2.39.5
>
>
I encountered this issue on a test image with both PREEMPT_RT and
CFS_BANDWIDTH kernel options enabled. The test image is based on
freedesktop-sdk (v24.08.10) [1] with custom system configurations on
top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
notable system configuration is having most of system services running
on a systemd slice, restricted on a single CPU core (with AllowedCPUs
systemd option) and using CFS throttling (with CPUQuota systemd option).
With this configuration I encountered RCU stalls during boots, I think
because of the increased probability given by multiple processes being
spawned simultaneously on the same core. After the first RCU stall, the
system becomes unresponsive and successive RCU stalls are detected
periodically. This seems to match with the deadlock situation described
in your cover letter. I could only reproduce RCU stalls with the
combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.
I previously already tested this patch set at v2 (RFC) [2] on top of
kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
patch set applied. More specifically, in the last test I ran, without
patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
or 2 every hour. In this test the system was rebooting just after the
first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
This means the system rebooted 854 times in 24 hours (about 3.7%
reproducibility). You can see below two RCU stall instances. I could not
reproduce any RCU stall with the same test after applying the patch set.
I obtained similar results while testing the patch set at v2 (RFC)[1].
Another possibly interesting note is that the original custom
configuration was with the slice CPUQuota=150%, then I retested it with
CPUQuota=80%. The issue was reproducible in both configurations, notably
even with CPUQuota=150% that to my understanding should correspond to no
CFS throttling due to the CPU affinity set to 1 core only.
I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
verify that CFS throttling was behaving as expected. See details below after
RCU stall logs.
I hope this is helpful information and I can provide additional details if
needed.
Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
[1]: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/releases/freedesktop-sdk-24.08.10
[2]: https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/
- RCU stall instances:
...
[ 40.083057] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 40.083067] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P1075/4:b..l
[ 40.083070] rcu: (detected by 0, t=21002 jiffies, g=2637, q=547 ncpus=4)
[ 40.083073] task:podman state:R running task stack:13568 pid:1075 tgid:1062 ppid:1021 task_flags:0x40014c flags:0x00004002
[ 40.083081] Call Trace:
[ 40.083082] <TASK>
[ 40.083084] __schedule+0x3d4/0xf10
[ 40.083100] preempt_schedule+0x2e/0x50
[ 40.083102] preempt_schedule_thunk+0x16/0x30
[ 40.083107] try_to_wake_up+0x2fc/0x630
[ 40.083111] ep_autoremove_wake_function+0xd/0x40
[ 40.083115] __wake_up_common+0x6d/0x90
[ 40.083117] __wake_up+0x2c/0x50
[ 40.083119] ep_poll_callback+0x17b/0x230
[ 40.083121] __wake_up_common+0x6d/0x90
[ 40.083122] __wake_up+0x2c/0x50
[ 40.083123] sock_def_wakeup+0x3a/0x40
[ 40.083128] unix_release_sock+0x2a7/0x4a0
[ 40.083134] unix_release+0x2d/0x40
[ 40.083137] __sock_release+0x44/0xb0
[ 40.083141] sock_close+0x13/0x20
[ 40.083142] __fput+0xe1/0x2a0
[ 40.083146] task_work_run+0x58/0x90
[ 40.083149] do_exit+0x270/0xac0
[ 40.083152] do_group_exit+0x2b/0xc0
[ 40.083153] __x64_sys_exit_group+0x13/0x20
[ 40.083154] x64_sys_call+0xfdb/0x14f0
[ 40.083156] do_syscall_64+0xa4/0x260
[ 40.083160] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 40.083165] RIP: 0033:0x48638b
[ 40.083167] RSP: 002b:000000c00004fde0 EFLAGS: 00000216 ORIG_RAX: 00000000000000e7
[ 40.083169] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000048638b
[ 40.083171] RDX: 000000c00004fdb0 RSI: 0000000000000001 RDI: 0000000000000000
[ 40.083171] RBP: 000000c00004fdf0 R08: 4bad2e33de989e9a R09: 0000000002d79c40
[ 40.083173] R10: 000000c0005eaa08 R11: 0000000000000216 R12: 0000000000000000
[ 40.083173] R13: 0000000000000001 R14: 000000c0000061c0 R15: 000000c0000f27e0
[ 40.083175] </TASK>
[ 40.083176] rcu: rcu_preempt kthread timer wakeup didn't happen for 20975 jiffies! g2637 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 40.083179] rcu: Possible timer handling issue on cpu=0 timer-softirq=1708
[ 40.083180] rcu: rcu_preempt kthread starved for 20978 jiffies! g2637 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 40.083182] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 40.083183] rcu: RCU grace-period kthread stack dump:
[ 40.083183] task:rcu_preempt state:I stack:14800 pid:17 tgid:17 ppid:2 task_flags:0x208040 flags:0x00004000
[ 40.083187] Call Trace:
[ 40.083188] <TASK>
[ 40.083189] __schedule+0x3d4/0xf10
[ 40.083192] schedule+0x22/0xd0
[ 40.083194] schedule_timeout+0x7e/0x100
[ 40.083199] ? __pfx_process_timeout+0x10/0x10
[ 40.083202] rcu_gp_fqs_loop+0x103/0x6b0
[ 40.083206] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 40.083207] rcu_gp_kthread+0x191/0x230
[ 40.083208] kthread+0xf6/0x1f0
[ 40.083210] ? __pfx_kthread+0x10/0x10
[ 40.083212] ret_from_fork+0x80/0xd0
[ 40.083215] ? __pfx_kthread+0x10/0x10
[ 40.083217] ret_from_fork_asm+0x1a/0x30
[ 40.083219] </TASK>
[ 40.083220] rcu: Stack dump where RCU GP kthread last ran:
[ 40.083225] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[ 40.083227] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 40.083227] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 40.083229] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 40.083230] RSP: 0018:ffffffff9a003e80 EFLAGS: 00000216
[ 40.083231] RAX: ffff9034a1221000 RBX: ffffffff9a018900 RCX: 0000000000000001
[ 40.083231] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 000000000007507c
[ 40.083232] RBP: 0000000000000000 R08: 000000000007507c R09: ffff90343bc24d90
[ 40.083233] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 40.083233] R13: 0000000000000000 R14: ffffffff9a018038 R15: 000000007e0c1000
[ 40.083237] FS: 0000000000000000(0000) GS:ffff9034a1221000(0000) knlGS:0000000000000000
[ 40.083237] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.083238] CR2: 00007f72d6c58000 CR3: 0000000007722000 CR4: 00000000000006f0
[ 40.083239] Call Trace:
[ 40.083239] <TASK>
[ 40.083240] default_idle+0x9/0x10
[ 40.083241] default_idle_call+0x2b/0x100
[ 40.083243] do_idle+0x1d0/0x230
[ 40.083244] cpu_startup_entry+0x24/0x30
[ 40.083245] rest_init+0xbc/0xc0
[ 40.083247] start_kernel+0x6ca/0x6d0
[ 40.083252] x86_64_start_reservations+0x24/0x30
[ 40.083255] x86_64_start_kernel+0xc5/0xd0
[ 40.083256] common_startup_64+0x13e/0x148
[ 40.083258] </TASK>
[ 40.083260] Kernel panic - not syncing: RCU Stall
[ 40.083261] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[ 40.083263] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 40.083263] Call Trace:
[ 40.083267] <IRQ>
[ 40.083268] dump_stack_lvl+0x4d/0x70
[ 40.083269] panic+0x10a/0x2b9
[ 40.083271] ? try_to_wake_up+0x2f2/0x630
[ 40.083273] panic_on_rcu_stall.cold+0xc/0xc
[ 40.083275] rcu_sched_clock_irq.cold+0x15f/0x3db
[ 40.083277] ? __pfx_tick_nohz_handler+0x10/0x10
[ 40.083279] update_process_times+0x70/0xb0
[ 40.083281] tick_nohz_handler+0x8c/0x150
[ 40.083284] __hrtimer_run_queues+0x148/0x2e0
[ 40.083292] hrtimer_interrupt+0xf2/0x210
[ 40.083294] __sysvec_apic_timer_interrupt+0x53/0x100
[ 40.083296] sysvec_apic_timer_interrupt+0x66/0x80
[ 40.083298] </IRQ>
[ 40.083298] <TASK>
[ 40.083299] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 40.083300] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 40.083301] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 40.083302] RSP: 0018:ffffffff9a003e80 EFLAGS: 00000216
[ 40.083303] RAX: ffff9034a1221000 RBX: ffffffff9a018900 RCX: 0000000000000001
[ 40.083303] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 000000000007507c
[ 40.083304] RBP: 0000000000000000 R08: 000000000007507c R09: ffff90343bc24d90
[ 40.083304] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 40.083305] R13: 0000000000000000 R14: ffffffff9a018038 R15: 000000007e0c1000
[ 40.083306] default_idle+0x9/0x10
[ 40.083307] default_idle_call+0x2b/0x100
[ 40.083309] do_idle+0x1d0/0x230
[ 40.083310] cpu_startup_entry+0x24/0x30
[ 40.083311] rest_init+0xbc/0xc0
[ 40.083312] start_kernel+0x6ca/0x6d0
[ 40.083313] x86_64_start_reservations+0x24/0x30
[ 40.083315] x86_64_start_kernel+0xc5/0xd0
[ 40.083316] common_startup_64+0x13e/0x148
[ 40.083317] </TASK>
[ 40.083440] Kernel Offset: 0x17600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
...
...
[ 40.057080] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 40.057091] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P1035/4:b..l P981/4:b..l
[ 40.057096] rcu: (detected by 0, t=21002 jiffies, g=2965, q=707 ncpus=4)
[ 40.057100] task:systemd state:R running task stack:12856 pid:981 tgid:981 ppid:1 task_flags:0x400100 flags:0x0000
[ 40.057109] Call Trace:
[ 40.057110] <TASK>
[ 40.057114] __schedule+0x3d4/0xf10
[ 40.057132] preempt_schedule+0x2e/0x50
[ 40.057134] preempt_schedule_thunk+0x16/0x30
[ 40.057140] try_to_wake_up+0x2fc/0x630
[ 40.057145] ep_autoremove_wake_function+0xd/0x40
[ 40.057150] __wake_up_common+0x6d/0x90
[ 40.057152] __wake_up_sync+0x33/0x50
[ 40.057154] ep_poll_callback+0xcd/0x230
[ 40.057156] __wake_up_common+0x6d/0x90
[ 40.057158] __wake_up_sync_key+0x3a/0x50
[ 40.057160] sock_def_readable+0x3d/0xb0
[ 40.057166] unix_dgram_sendmsg+0x454/0x800
[ 40.057174] ____sys_sendmsg+0x317/0x350
[ 40.057179] ___sys_sendmsg+0x94/0xe0
[ 40.057182] __sys_sendmsg+0x85/0xe0
[ 40.057187] do_syscall_64+0xa4/0x260
[ 40.057192] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 40.057197] RIP: 0033:0x7f356c320d94
[ 40.057198] RSP: 002b:00007ffc4a68a438 EFLAGS: 00000202 ORIG_RAX: 000000000000002e
[ 40.057200] RAX: ffffffffffffffda RBX: 000000000000001f RCX: 00007f356c320d94
[ 40.057202] RDX: 0000000000004000 RSI: 00007ffc4a68a490 RDI: 000000000000001f
[ 40.057203] RBP: 00007ffc4a68a620 R08: 0000000000000080 R09: 0000000000000007
[ 40.057204] R10: 00007ffc4a68a3f4 R11: 0000000000000202 R12: 0000000000000000
[ 40.057205] R13: 00007ffc4a68a490 R14: 0000000000000000 R15: 0000000000000002
[ 40.057207] </TASK>
[ 40.057207] task:(sd-close) state:D stack:14576 pid:1035 tgid:1035 ppid:1 task_flags:0x40004c flags:0x00004002
[ 40.057211] Call Trace:
[ 40.057212] <TASK>
[ 40.057212] __schedule+0x3d4/0xf10
[ 40.057215] schedule_rtlock+0x15/0x30
[ 40.057218] rtlock_slowlock_locked+0x314/0xea0
[ 40.057224] rt_spin_lock+0x79/0xd0
[ 40.057226] __wake_up+0x1a/0x50
[ 40.057227] ep_poll_callback+0x17b/0x230
[ 40.057230] __wake_up_common+0x6d/0x90
[ 40.057232] __wake_up+0x2c/0x50
[ 40.057233] __send_signal_locked+0x417/0x430
[ 40.057237] ? rt_spin_unlock+0x12/0x40
[ 40.057239] ? rt_spin_lock+0x33/0xd0
[ 40.057242] do_notify_parent+0x24a/0x2a0
[ 40.057244] do_exit+0x7cc/0xac0
[ 40.057247] __x64_sys_exit+0x16/0x20
[ 40.057249] x64_sys_call+0xfe9/0x14f0
[ 40.057251] do_syscall_64+0xa4/0x260
[ 40.057253] entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 40.057254] RIP: 0033:0x7fb2358f97de
[ 40.057255] RSP: 002b:00007fffd71b3000 EFLAGS: 00000246 ORIG_RAX: 000000000000003c
[ 40.057257] RAX: ffffffffffffffda RBX: 00007fb235adb6e0 RCX: 00007fb2358f97de
[ 40.057258] RDX: 000055a50e51881b RSI: 00007fb235d70293 RDI: 0000000000000000
[ 40.057259] RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000007
[ 40.057260] R10: 00007fb2358f97c6 R11: 0000000000000246 R12: 0000000000000019
[ 40.057261] R13: 0000000000000411 R14: 000000000000003d R15: 000055a05454c3a0
[ 40.057263] </TASK>
[ 40.057264] rcu: rcu_preempt kthread timer wakeup didn't happen for 20963 jiffies! g2965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 40.057266] rcu: Possible timer handling issue on cpu=0 timer-softirq=1618
[ 40.057267] rcu: rcu_preempt kthread starved for 20966 jiffies! g2965 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
[ 40.057269] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 40.057269] rcu: RCU grace-period kthread stack dump:
[ 40.057270] task:rcu_preempt state:I stack:14800 pid:17 tgid:17 ppid:2 task_flags:0x208040 flags:0x00004000
[ 40.057274] Call Trace:
[ 40.057274] <TASK>
[ 40.057275] __schedule+0x3d4/0xf10
[ 40.057278] schedule+0x22/0xd0
[ 40.057280] schedule_timeout+0x7e/0x100
[ 40.057282] ? __pfx_process_timeout+0x10/0x10
[ 40.057285] rcu_gp_fqs_loop+0x103/0x6b0
[ 40.057291] ? __pfx_rcu_gp_kthread+0x10/0x10
[ 40.057292] rcu_gp_kthread+0x191/0x230
[ 40.057294] kthread+0xf6/0x1f0
[ 40.057296] ? __pfx_kthread+0x10/0x10
[ 40.057298] ret_from_fork+0x80/0xd0
[ 40.057303] ? __pfx_kthread+0x10/0x10
[ 40.057305] ret_from_fork_asm+0x1a/0x30
[ 40.057308] </TASK>
[ 40.057308] rcu: Stack dump where RCU GP kthread last ran:
[ 40.057314] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[ 40.057316] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 40.057317] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 40.057319] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 40.057321] RSP: 0018:ffffffff97a03e80 EFLAGS: 00000212
[ 40.057322] RAX: ffff9269a3821000 RBX: ffffffff97a18900 RCX: 0000000000000001
[ 40.057323] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 00000000000759bc
[ 40.057324] RBP: 0000000000000000 R08: 00000000000759bc R09: ffff92693bc24d90
[ 40.057325] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 40.057326] R13: 0000000000000000 R14: ffffffff97a18038 R15: 000000007e0c1000
[ 40.057330] FS: 0000000000000000(0000) GS:ffff9269a3821000(0000) knlGS:0000000000000000
[ 40.057331] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.057332] CR2: 00007f0546942030 CR3: 000000000746c000 CR4: 00000000000006f0
[ 40.057333] Call Trace:
[ 40.057334] <TASK>
[ 40.057334] default_idle+0x9/0x10
[ 40.057336] default_idle_call+0x2b/0x100
[ 40.057338] do_idle+0x1d0/0x230
[ 40.057341] cpu_startup_entry+0x24/0x30
[ 40.057342] rest_init+0xbc/0xc0
[ 40.057344] start_kernel+0x6ca/0x6d0
[ 40.057350] x86_64_start_reservations+0x24/0x30
[ 40.057354] x86_64_start_kernel+0xc5/0xd0
[ 40.057355] common_startup_64+0x13e/0x148
[ 40.057358] </TASK>
[ 40.057361] Kernel panic - not syncing: RCU Stall
[ 40.057362] CPU: 0 UID: 0 PID: 0 Comm: swapper/0 Not tainted 6.16.0-rc7 #1 PREEMPT_{RT,(full)}
[ 40.057364] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
[ 40.057364] Call Trace:
[ 40.057369] <IRQ>
[ 40.057370] dump_stack_lvl+0x4d/0x70
[ 40.057372] panic+0x10a/0x2b9
[ 40.057375] ? try_to_wake_up+0x2f2/0x630
[ 40.057377] panic_on_rcu_stall.cold+0xc/0xc
[ 40.057379] rcu_sched_clock_irq.cold+0x15f/0x3db
[ 40.057383] ? __pfx_tick_nohz_handler+0x10/0x10
[ 40.057385] update_process_times+0x70/0xb0
[ 40.057387] tick_nohz_handler+0x8c/0x150
[ 40.057391] __hrtimer_run_queues+0x148/0x2e0
[ 40.057394] hrtimer_interrupt+0xf2/0x210
[ 40.057397] __sysvec_apic_timer_interrupt+0x53/0x100
[ 40.057400] sysvec_apic_timer_interrupt+0x66/0x80
[ 40.057402] </IRQ>
[ 40.057403] <TASK>
[ 40.057403] asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 40.057405] RIP: 0010:pv_native_safe_halt+0xf/0x20
[ 40.057407] Code: 76 79 00 e9 83 f5 00 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 35 68 21 00 fb f4 <c3> cc cc cc cc 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90
[ 40.057408] RSP: 0018:ffffffff97a03e80 EFLAGS: 00000212
[ 40.057409] RAX: ffff9269a3821000 RBX: ffffffff97a18900 RCX: 0000000000000001
[ 40.057410] RDX: 4000000000000000 RSI: 0000000000000000 RDI: 00000000000759bc
[ 40.057411] RBP: 0000000000000000 R08: 00000000000759bc R09: ffff92693bc24d90
[ 40.057412] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 40.057413] R13: 0000000000000000 R14: ffffffff97a18038 R15: 000000007e0c1000
[ 40.057415] default_idle+0x9/0x10
[ 40.057417] default_idle_call+0x2b/0x100
[ 40.057419] do_idle+0x1d0/0x230
[ 40.057420] cpu_startup_entry+0x24/0x30
[ 40.057422] rest_init+0xbc/0xc0
[ 40.057424] start_kernel+0x6ca/0x6d0
[ 40.057426] x86_64_start_reservations+0x24/0x30
[ 40.057428] x86_64_start_kernel+0xc5/0xd0
[ 40.057429] common_startup_64+0x13e/0x148
[ 40.057432] </TASK>
[ 40.057586] Kernel Offset: 0x15000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
...
- Tested stress-ng with 1 worker process with group CPU limit set to 20%.
stress-ng metrics showed 20% CPU usage for the worker process and top showed
20% CPU usage increase for CPU 0, where the worker process was running.
[root@localhost ~]# systemd-run -p CPUQuota=20% stress-ng --cpu 1 --timeout 10s --metrics
Running as unit: run-rb98f1ee55a4e4c9dacb29774213a399c.service; invocation ID: 77c93909960347e09e916fac907f87c6
[root@localhost ~]# journalctl -f -u run-rb98f1ee55a4e4c9dacb29774213a399c.service
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: metrc: [1310] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: metrc: [1310] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: metrc: [1310] cpu 3181 10.02 2.01 0.01 317.50 1581.53 20.08 7504
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info: [1310] skipped: 0
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info: [1310] passed: 1: cpu (1)
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info: [1310] failed: 0
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info: [1310] metrics untrustworthy: 0
Jul 28 16:19:31 localhost stress-ng[1310]: stress-ng: info: [1310] successful run completed in 10.02 secs
Jul 28 16:19:31 localhost systemd[1]: run-rb98f1ee55a4e4c9dacb29774213a399c.service: Deactivated successfully.
Jul 28 16:19:31 localhost systemd[1]: run-rb98f1ee55a4e4c9dacb29774213a399c.service: Consumed 2.026s CPU time.
- Tested stress-ng with 2 worker processes with group CPU limit set to 20%.
Both processes ran on the same CPU core due to the systemd slice CPU affinity
settings (AllowedCPUs=0). stress-ng metrics showed 10% CPU usage per worker
process and top showed 20% usage increase for CPU 0, where both worker
processes were running.
[root@localhost ~]# systemd-run -p CPUQuota=20% stress-ng --cpu 2 --timeout 10s --metrics
Running as unit: run-rd616594713434ac9bb346faa92f7110a.service; invocation ID: f45acc85d19944cbbdf633f0c95091bb
[root@localhost ~]# journalctl -f -u run-rd616594713434ac9bb346faa92f7110a.service
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: metrc: [1373] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: metrc: [1373] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: metrc: [1373] cpu 3275 10.06 2.00 0.01 325.53 1630.74 9.98 7476
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info: [1373] skipped: 0
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info: [1373] passed: 2: cpu (2)
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info: [1373] failed: 0
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info: [1373] metrics untrustworthy: 0
Jul 28 16:24:08 localhost stress-ng[1373]: stress-ng: info: [1373] successful run completed in 10.06 secs
Jul 28 16:24:08 localhost systemd[1]: run-rd616594713434ac9bb346faa92f7110a.service: Deactivated successfully.
Jul 28 16:24:08 localhost systemd[1]: run-rd616594713434ac9bb346faa92f7110a.service: Consumed 2.023s CPU time.
- Tested stress-ng with 3 worker processes with group CPU limit set to 60%.
This time without CPU affinity settings, so each process ran on a different CPU
core. stress-ng metrics showed 20% CPU usage per worker process and top showed
20% usage increase per each CPU.
[root@localhost ~]# systemd-run -p CPUQuota=60% stress-ng --cpu 3 --timeout 10s --metrics
Running as unit: run-r19417007568a4c55a02817588bd2b32f.service; invocation ID: c09e104a497c4fdfb467c6744bf3923b
[root@localhost ~]# journalctl -f -u run-r19417007568a4c55a02817588bd2b32f.service
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: metrc: [1386] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: metrc: [1386] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: metrc: [1386] cpu 1974 10.08 6.04 0.01 195.93 326.04 20.03 6856
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info: [1386] skipped: 0
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info: [1386] passed: 3: cpu (3)
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info: [1386] failed: 0
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info: [1386] metrics untrustworthy: 0
Jul 28 16:55:46 localhost stress-ng[1386]: stress-ng: info: [1386] successful run completed in 10.08 secs
Jul 28 16:55:46 localhost systemd[1]: run-r19417007568a4c55a02817588bd2b32f.service: Deactivated successfully.
Jul 28 16:55:46 localhost systemd[1]: run-r19417007568a4c55a02817588bd2b32f.service: Consumed 6.096s CPU time.
- Tested stress-ng with 4 worker processes with group CPU limit set to 40%.
Also this time without CPU affinity settings, so each process ran on a
different CPU core. stress-ng metrics showed 10% CPU usage per worker process
and top showed 10% usage increase per each CPU.
[root@localhost ~]# systemd-run -p CPUQuota=40% stress-ng --cpu 4 --timeout 10s --metrics
Running as unit: run-r70a53f5333b948029f9739e80454648d.service; invocation ID: be510cc4c4e74676a9749c1758e65226
[root@localhost ~]# journalctl -f -u run-r70a53f5333b948029f9739e80454648d.service
Jul 28 16:58:33 localhost systemd[1]: Started /usr/bin/stress-ng --cpu 4 --timeout 10s --metrics.
Jul 28 16:58:33 localhost stress-ng[1420]: invoked with '/usr/bin/stress-ng --cpu 4 --timeout 10s --metrics' by user 0 'root'
Jul 28 16:58:33 localhost stress-ng[1420]: system: 'localhost' Linux 6.16.0-rc7-00005-g3113c41a2959 #1 SMP PREEMPT_RT Wed Jul 23 18:00:56 CEST 2025 x86_64
Jul 28 16:58:33 localhost stress-ng[1420]: stress-ng: info: [1420] setting to a 10 secs run per stressor
Jul 28 16:58:33 localhost stress-ng[1420]: memory (MB): total 1973.16, free 1749.38, shared 10.99, buffer 7.56, swap 0.00, free swap 0.00
Jul 28 16:58:33 localhost stress-ng[1420]: stress-ng: info: [1420] dispatching hogs: 4 cpu
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: metrc: [1420] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: metrc: [1420] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: metrc: [1420] cpu 1062 10.08 4.01 0.02 105.41 263.63 10.00 7276
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info: [1420] skipped: 0
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info: [1420] passed: 4: cpu (4)
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info: [1420] failed: 0
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info: [1420] metrics untrustworthy: 0
Jul 28 16:58:43 localhost stress-ng[1420]: stress-ng: info: [1420] successful run completed in 10.08 secs
Jul 28 16:58:43 localhost systemd[1]: run-r70a53f5333b948029f9739e80454648d.service: Deactivated successfully.
Jul 28 16:58:43 localhost systemd[1]: run-r70a53f5333b948029f9739e80454648d.service: Consumed 4.047s CPU time.
- Tested stress-ng with 4 worker processes with group CPU limit set to 200%.
Also this time without CPU affinity settings, so each process ran on a
different CPU core. stress-ng metrics showed 50% CPU usage per worker process
and top showed 50% usage increase per each CPU.
[root@localhost ~]# systemd-run -p CPUQuota=200% stress-ng --cpu 4 --timeout 10s --metrics
Running as unit: run-r887083cd168e4b3fa07672b09c3bb72d.service; invocation ID: 224b6544b79e449db43b42455700fddd
[root@localhost ~]# journalctl -f -u run-r887083cd168e4b3fa07672b09c3bb72d.service
Jul 28 17:03:44 localhost systemd[1]: Started /usr/bin/stress-ng --cpu 4 --timeout 10s --metrics.
Jul 28 17:03:44 localhost stress-ng[1169]: invoked with '/usr/bin/stress-ng --cpu 4 --timeout 10s --metrics' by user 0 'root'
Jul 28 17:03:44 localhost stress-ng[1169]: system: 'localhost' Linux 6.16.0-rc7-00005-g3113c41a2959 #1 SMP PREEMPT_RT Wed Jul 23 18:00:56 CEST 2025 x86_64
Jul 28 17:03:44 localhost stress-ng[1169]: stress-ng: info: [1169] setting to a 10 secs run per stressor
Jul 28 17:03:44 localhost stress-ng[1169]: memory (MB): total 1973.15, free 1722.16, shared 10.26, buffer 7.33, swap 0.00, free swap 0.00
Jul 28 17:03:44 localhost stress-ng[1169]: stress-ng: info: [1169] dispatching hogs: 4 cpu
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: metrc: [1169] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: metrc: [1169] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: metrc: [1169] cpu 21605 10.00 20.11 0.01 2160.25 1073.95 50.29 7424
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info: [1169] skipped: 0
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info: [1169] passed: 4: cpu (4)
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info: [1169] failed: 0
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info: [1169] metrics untrustworthy: 0
Jul 28 17:03:54 localhost stress-ng[1169]: stress-ng: info: [1169] successful run completed in 10.00 secs
Jul 28 17:03:54 localhost systemd[1]: run-r887083cd168e4b3fa07672b09c3bb72d.service: Deactivated successfully.
Jul 28 17:03:54 localhost systemd[1]: run-r887083cd168e4b3fa07672b09c3bb72d.service: Consumed 20.133s CPU time.
Best regards,
Matteo Martelli
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-08-01 14:31 ` Matteo Martelli
@ 2025-08-04 7:52 ` Aaron Lu
2025-08-04 11:18 ` Valentin Schneider
2025-08-08 16:37 ` Matteo Martelli
0 siblings, 2 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-04 7:52 UTC (permalink / raw)
To: Matteo Martelli
Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
Hi Matteo,
On Fri, Aug 01, 2025 at 04:31:25PM +0200, Matteo Martelli wrote:
... ...
> I encountered this issue on a test image with both PREEMPT_RT and
> CFS_BANDWIDTH kernel options enabled. The test image is based on
> freedesktop-sdk (v24.08.10) [1] with custom system configurations on
> top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
> notable system configuration is having most of system services running
> on a systemd slice, restricted on a single CPU core (with AllowedCPUs
> systemd option) and using CFS throttling (with CPUQuota systemd option).
> With this configuration I encountered RCU stalls during boots, I think
> because of the increased probability given by multiple processes being
> spawned simultaneously on the same core. After the first RCU stall, the
> system becomes unresponsive and successive RCU stalls are detected
> periodically. This seems to match with the deadlock situation described
> in your cover letter. I could only reproduce RCU stalls with the
> combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.
>
> I previously already tested this patch set at v2 (RFC) [2] on top of
> kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
> v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
> patch set applied. More specifically, in the last test I ran, without
> patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
> or 2 every hour. In this test the system was rebooting just after the
> first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
> kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
> This means the system rebooted 854 times in 24 hours (about 3.7%
> reproducibility). You can see below two RCU stall instances. I could not
> reproduce any RCU stall with the same test after applying the patch set.
> I obtained similar results while testing the patch set at v2 (RFC)[1].
> Another possibly interesting note is that the original custom
> configuration was with the slice CPUQuota=150%, then I retested it with
> CPUQuota=80%. The issue was reproducible in both configurations, notably
> even with CPUQuota=150% that to my understanding should correspond to no
> CFS throttling due to the CPU affinity set to 1 core only.
Agree. With cpu affinity set to 1 core, 150% quota should never hit. But
from the test results, it seems quota is hit somehow because if quota is
not hit, this series should make no difference.
Maybe fire a bpftrace script and see if quota is actually hit? A
reference script is here:
https://lore.kernel.org/lkml/20250521115115.GB24746@bytedance/
>
> I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
> verify that CFS throttling was behaving as expected. See details below after
> RCU stall logs.
Thanks for all these tests. If I read them correctly, in all these
tests, CFS throttling worked as expected. Right?
>
> I hope this is helpful information and I can provide additional details if
> needed.
>
Yes it's very helpful.
> Tested-by: Matteo Martelli <matteo.martelli@codethink.co.uk>
>
Thanks!
> [1]: https://gitlab.com/freedesktop-sdk/freedesktop-sdk/-/releases/freedesktop-sdk-24.08.10
> [2]: https://lore.kernel.org/all/20250409120746.635476-1-ziqianlu@bytedance.com/
>
I'll rebase this series after merge window for v6.17 is closed and
hopefully it's in good shape and maintainer will pick it up :)
Best regards,
Aaron
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (6 preceding siblings ...)
2025-08-01 14:31 ` Matteo Martelli
@ 2025-08-04 8:51 ` K Prateek Nayak
2025-08-04 11:48 ` Aaron Lu
2025-08-27 14:58 ` Valentin Schneider
8 siblings, 1 reply; 48+ messages in thread
From: K Prateek Nayak @ 2025-08-04 8:51 UTC (permalink / raw)
To: Aaron Lu, Valentin Schneider, Ben Segall, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
Hello Aaron,
On 7/15/2025 12:46 PM, Aaron Lu wrote:
> v3:
> - Keep throttled cfs_rq's PELT clock running as long as it still has
> entity queued, suggested by Benjamin Segall. I've folded this change
> into patch3;
> - Rebased on top of tip/sched/core, commit 2885daf47081
> ("lib/smp_processor_id: Make migration check unconditional of SMP").
>
> Hi Prateek,
> I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
> clock change should not affect things much, but let me know if you don't
> think that is appropriate.
I've officially tested this series so it should be fine :)
In addition to Jan's test, I also did some sanity test looking at PELT
and everything looks good for the simplest case - once busy loop inside
a cgroup that gets throttled. The per-task throttling behavior is
identical to the current behavior for this simplest case.
If I find time, I'll look into nested hierarchies with wakeups to see
if I can spot anything odd there. I don't really have a good control
setup to compare against here but so far I haven't found anything odd
and it works as intended.
>
> Tests I've done:
> - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
> within 2 minutes and with this series, I do not see rcu-stalls after
> 10 minutes.
> - A stress test that creates a lot of pressure on fork/exit path and
> cgroup_threadgroup_rwsem. Without this series, the test will cause
> task hung in about 5 minutes and with this series, no problem found
> after several hours. Songtang wrote this test script and I've used it
> to verify the patches, thanks Songtang.
I just noticed this script. I'll give this a spin too when I test
nested hierarchies.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-08-04 7:52 ` Aaron Lu
@ 2025-08-04 11:18 ` Valentin Schneider
2025-08-04 11:56 ` Aaron Lu
2025-08-08 16:37 ` Matteo Martelli
1 sibling, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-08-04 11:18 UTC (permalink / raw)
To: Aaron Lu, Matteo Martelli
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On 04/08/25 15:52, Aaron Lu wrote:
> I'll rebase this series after merge window for v6.17 is closed and
> hopefully it's in good shape and maintainer will pick it up :)
>
FWIW I've had this buried in my todolist for too long, I'm bumping it up
and will do a proper review starting this week.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-08-04 8:51 ` K Prateek Nayak
@ 2025-08-04 11:48 ` Aaron Lu
0 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-04 11:48 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Mon, Aug 04, 2025 at 02:21:30PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 7/15/2025 12:46 PM, Aaron Lu wrote:
> > v3:
> > - Keep throttled cfs_rq's PELT clock running as long as it still has
> > entity queued, suggested by Benjamin Segall. I've folded this change
> > into patch3;
> > - Rebased on top of tip/sched/core, commit 2885daf47081
> > ("lib/smp_processor_id: Make migration check unconditional of SMP").
> >
> > Hi Prateek,
> > I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
> > clock change should not affect things much, but let me know if you don't
> > think that is appropriate.
>
> I've officially tested this series so it should be fine :)
Good to hear this :)
>
> In addition to Jan's test, I also did some sanity test looking at PELT
> and everything looks good for the simplest case - once busy loop inside
> a cgroup that gets throttled. The per-task throttling behavior is
> identical to the current behavior for this simplest case.
>
> If I find time, I'll look into nested hierarchies with wakeups to see
> if I can spot anything odd there. I don't really have a good control
> setup to compare against here but so far I haven't found anything odd
> and it works as intended.
>
Thanks for all these tests.
Best regards,
Aaron
> >
> > Tests I've done:
> > - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
> > within 2 minutes and with this series, I do not see rcu-stalls after
> > 10 minutes.
> > - A stress test that creates a lot of pressure on fork/exit path and
> > cgroup_threadgroup_rwsem. Without this series, the test will cause
> > task hung in about 5 minutes and with this series, no problem found
> > after several hours. Songtang wrote this test script and I've used it
> > to verify the patches, thanks Songtang.
>
> I just noticed this script. I'll give this a spin too when I test
> nested hierarchies.
>
> --
> Thanks and Regards,
> Prateek
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-08-04 11:18 ` Valentin Schneider
@ 2025-08-04 11:56 ` Aaron Lu
0 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-04 11:56 UTC (permalink / raw)
To: Valentin Schneider
Cc: Matteo Martelli, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Mon, Aug 04, 2025 at 01:18:05PM +0200, Valentin Schneider wrote:
> On 04/08/25 15:52, Aaron Lu wrote:
> > I'll rebase this series after merge window for v6.17 is closed and
> > hopefully it's in good shape and maintainer will pick it up :)
> >
>
> FWIW I've had this buried in my todolist for too long, I'm bumping it up
> and will do a proper review starting this week.
>
Thanks Valentin.
It's great you can look at this, look forward to your comments.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-07-15 23:29 ` kernel test robot
2025-07-16 15:20 ` kernel test robot
@ 2025-08-08 9:12 ` Valentin Schneider
2025-08-08 10:13 ` Aaron Lu
2025-08-17 8:50 ` Chen, Yu C
3 siblings, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-08-08 9:12 UTC (permalink / raw)
To: Aaron Lu, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On 15/07/25 15:16, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
>
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
>
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets woken, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
>
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued there. In this way, tasks throttled will
> not hold any kernel resources. And on unthrottle, enqueue back those
> tasks so they can continue to run.
>
Moving the actual throttle work to pick time is clever. In my previous
versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
but this makes the code a lot more palatable. I'd like to see how this
impacts performance though.
> Throttled cfs_rq's PELT clock is handled differently now: previously the
> cfs_rq's PELT clock is stopped once it entered throttled state but since
> now tasks(in kernel mode) can continue to run, change the behaviour to
> stop PELT clock only when the throttled cfs_rq has no tasks left.
>
I haven't spent much time looking at the PELT stuff yet; I'll do that next
week. Thanks for all the work you've been putting into this, and sorry it
got me this long to get a proper look at it!
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> +static bool enqueue_throttled_task(struct task_struct *p)
> +{
> + struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> +
> + /*
> + * If the throttled task is enqueued to a throttled cfs_rq,
> + * take the fast path by directly put the task on target
> + * cfs_rq's limbo list, except when p is current because
> + * the following race can cause p's group_node left in rq's
> + * cfs_tasks list when it's throttled:
> + *
> + * cpuX cpuY
> + * taskA ret2user
> + * throttle_cfs_rq_work() sched_move_task(taskA)
> + * task_rq_lock acquired
> + * dequeue_task_fair(taskA)
> + * task_rq_lock released
> + * task_rq_lock acquired
> + * task_current_donor(taskA) == true
> + * task_on_rq_queued(taskA) == true
> + * dequeue_task(taskA)
> + * put_prev_task(taskA)
> + * sched_change_group()
> + * enqueue_task(taskA) -> taskA's new cfs_rq
> + * is throttled, go
> + * fast path and skip
> + * actual enqueue
> + * set_next_task(taskA)
> + * __set_next_task_fair(taskA)
> + * list_move(&se->group_node, &rq->cfs_tasks); // bug
> + * schedule()
> + *
> + * And in the above race case, the task's current cfs_rq is in the same
> + * rq as its previous cfs_rq because sched_move_task() doesn't migrate
> + * task so we can use its current cfs_rq to derive rq and test if the
> + * task is current.
> + */
OK I think I got this; I slightly rephrased things to match similar
comments in the sched code:
--->8---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a7c86c5b8a2b..8566ee0399527 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5827,37 +5827,38 @@ static bool enqueue_throttled_task(struct task_struct *p)
struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
/*
- * If the throttled task is enqueued to a throttled cfs_rq,
- * take the fast path by directly put the task on target
- * cfs_rq's limbo list, except when p is current because
- * the following race can cause p's group_node left in rq's
- * cfs_tasks list when it's throttled:
+ * If the throttled task @p is enqueued to a throttled cfs_rq,
+ * take the fast path by directly putting the task on the
+ * target cfs_rq's limbo list.
+
+ * Do not do that when @p is current because the following race can
+ * cause @p's group_node to be incorectly re-insterted in its in rq's
+ * cfs_tasks list, despite being throttled:
*
* cpuX cpuY
- * taskA ret2user
- * throttle_cfs_rq_work() sched_move_task(taskA)
- * task_rq_lock acquired
- * dequeue_task_fair(taskA)
- * task_rq_lock released
- * task_rq_lock acquired
- * task_current_donor(taskA) == true
- * task_on_rq_queued(taskA) == true
- * dequeue_task(taskA)
- * put_prev_task(taskA)
- * sched_change_group()
- * enqueue_task(taskA) -> taskA's new cfs_rq
- * is throttled, go
- * fast path and skip
- * actual enqueue
- * set_next_task(taskA)
- * __set_next_task_fair(taskA)
- * list_move(&se->group_node, &rq->cfs_tasks); // bug
+ * p ret2user
+ * throttle_cfs_rq_work() sched_move_task(p)
+ * LOCK task_rq_lock
+ * dequeue_task_fair(p)
+ * UNLOCK task_rq_lock
+ * LOCK task_rq_lock
+ * task_current_donor(p) == true
+ * task_on_rq_queued(p) == true
+ * dequeue_task(p)
+ * put_prev_task(p)
+ * sched_change_group()
+ * enqueue_task(p) -> p's new cfs_rq
+ * is throttled, go
+ * fast path and skip
+ * actual enqueue
+ * set_next_task(p)
+ * list_move(&se->group_node, &rq->cfs_tasks); // bug
* schedule()
*
- * And in the above race case, the task's current cfs_rq is in the same
- * rq as its previous cfs_rq because sched_move_task() doesn't migrate
- * task so we can use its current cfs_rq to derive rq and test if the
- * task is current.
+ * In the above race case, @p current cfs_rq is in the same
+ * rq as its previous cfs_rq because sched_move_task() only moves a task
+ * to a different group from the same rq, so we can use its current
+ * cfs_rq to derive rq and test if the * task is current.
*/
if (throttled_hierarchy(cfs_rq) &&
!task_current_donor(rq_of(cfs_rq), p)) {
---8<---
> + if (throttled_hierarchy(cfs_rq) &&
> + !task_current_donor(rq_of(cfs_rq), p)) {
> + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> + return true;
> + }
> +
> + /* we can't take the fast path, do an actual enqueue*/
> + p->throttled = false;
So we clear p->throttled but not p->throttle_node? Won't that cause issues
when @p's previous cfs_rq gets unthrottled?
> + return false;
> +}
> +
> @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> */
> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> {
> + if (unlikely(task_is_throttled(p))) {
> + dequeue_throttled_task(p, flags);
> + return true;
> + }
> +
Handling a throttled task's move pattern at dequeue does simplify things,
however that's quite a hot path. I think this wants at the very least a
if (cfs_bandwidth_used())
since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
hurt, but we can probably find some people who really care about that to
run it for us ;)
> if (!p->se.sched_delayed)
> util_est_dequeue(&rq->cfs, p);
>
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-08 9:12 ` Valentin Schneider
@ 2025-08-08 10:13 ` Aaron Lu
2025-08-08 11:45 ` Valentin Schneider
0 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-08-08 10:13 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
> On 15/07/25 15:16, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> >
> > In current throttle model, when a cfs_rq is throttled, its entity will
> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > thus achiveing the throttle target.
> >
> > This has a drawback though: assume a task is a reader of percpu_rwsem
> > and is waiting. When it gets woken, it can not run till its task group's
> > next period comes, which can be a relatively long time. Waiting writer
> > will have to wait longer due to this and it also makes further reader
> > build up and eventually trigger task hung.
> >
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, record its throttled status but do not remove
> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > they get picked, add a task work to them so that when they return
> > to user, they can be dequeued there. In this way, tasks throttled will
> > not hold any kernel resources. And on unthrottle, enqueue back those
> > tasks so they can continue to run.
> >
>
> Moving the actual throttle work to pick time is clever. In my previous
> versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
> but this makes the code a lot more palatable. I'd like to see how this
> impacts performance though.
>
Let me run some scheduler benchmark to see how it impacts performance.
I'm thinking maybe running something like hackbench on server platforms,
first with quota not set and see if performance changes; then also test
with quota set and see how performance changes.
Does this sound good to you? Or do you have any specific benchmark and
test methodology in mind?
> > Throttled cfs_rq's PELT clock is handled differently now: previously the
> > cfs_rq's PELT clock is stopped once it entered throttled state but since
> > now tasks(in kernel mode) can continue to run, change the behaviour to
> > stop PELT clock only when the throttled cfs_rq has no tasks left.
> >
>
> I haven't spent much time looking at the PELT stuff yet; I'll do that next
> week. Thanks for all the work you've been putting into this, and sorry it
> got me this long to get a proper look at it!
>
Never mind and thanks for taking a look now :)
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>
> > +static bool enqueue_throttled_task(struct task_struct *p)
> > +{
> > + struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
> > +
> > + /*
> > + * If the throttled task is enqueued to a throttled cfs_rq,
> > + * take the fast path by directly put the task on target
> > + * cfs_rq's limbo list, except when p is current because
> > + * the following race can cause p's group_node left in rq's
> > + * cfs_tasks list when it's throttled:
> > + *
> > + * cpuX cpuY
> > + * taskA ret2user
> > + * throttle_cfs_rq_work() sched_move_task(taskA)
> > + * task_rq_lock acquired
> > + * dequeue_task_fair(taskA)
> > + * task_rq_lock released
> > + * task_rq_lock acquired
> > + * task_current_donor(taskA) == true
> > + * task_on_rq_queued(taskA) == true
> > + * dequeue_task(taskA)
> > + * put_prev_task(taskA)
> > + * sched_change_group()
> > + * enqueue_task(taskA) -> taskA's new cfs_rq
> > + * is throttled, go
> > + * fast path and skip
> > + * actual enqueue
> > + * set_next_task(taskA)
> > + * __set_next_task_fair(taskA)
> > + * list_move(&se->group_node, &rq->cfs_tasks); // bug
> > + * schedule()
> > + *
> > + * And in the above race case, the task's current cfs_rq is in the same
> > + * rq as its previous cfs_rq because sched_move_task() doesn't migrate
> > + * task so we can use its current cfs_rq to derive rq and test if the
> > + * task is current.
> > + */
>
> OK I think I got this; I slightly rephrased things to match similar
> comments in the sched code:
>
> --->8---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3a7c86c5b8a2b..8566ee0399527 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5827,37 +5827,38 @@ static bool enqueue_throttled_task(struct task_struct *p)
> struct cfs_rq *cfs_rq = cfs_rq_of(&p->se);
>
> /*
> - * If the throttled task is enqueued to a throttled cfs_rq,
> - * take the fast path by directly put the task on target
> - * cfs_rq's limbo list, except when p is current because
> - * the following race can cause p's group_node left in rq's
> - * cfs_tasks list when it's throttled:
> + * If the throttled task @p is enqueued to a throttled cfs_rq,
> + * take the fast path by directly putting the task on the
> + * target cfs_rq's limbo list.
> +
> + * Do not do that when @p is current because the following race can
> + * cause @p's group_node to be incorectly re-insterted in its in rq's
> + * cfs_tasks list, despite being throttled:
> *
> * cpuX cpuY
> - * taskA ret2user
> - * throttle_cfs_rq_work() sched_move_task(taskA)
> - * task_rq_lock acquired
> - * dequeue_task_fair(taskA)
> - * task_rq_lock released
> - * task_rq_lock acquired
> - * task_current_donor(taskA) == true
> - * task_on_rq_queued(taskA) == true
> - * dequeue_task(taskA)
> - * put_prev_task(taskA)
> - * sched_change_group()
> - * enqueue_task(taskA) -> taskA's new cfs_rq
> - * is throttled, go
> - * fast path and skip
> - * actual enqueue
> - * set_next_task(taskA)
> - * __set_next_task_fair(taskA)
> - * list_move(&se->group_node, &rq->cfs_tasks); // bug
> + * p ret2user
> + * throttle_cfs_rq_work() sched_move_task(p)
> + * LOCK task_rq_lock
> + * dequeue_task_fair(p)
> + * UNLOCK task_rq_lock
> + * LOCK task_rq_lock
> + * task_current_donor(p) == true
> + * task_on_rq_queued(p) == true
> + * dequeue_task(p)
> + * put_prev_task(p)
> + * sched_change_group()
> + * enqueue_task(p) -> p's new cfs_rq
> + * is throttled, go
> + * fast path and skip
> + * actual enqueue
> + * set_next_task(p)
> + * list_move(&se->group_node, &rq->cfs_tasks); // bug
> * schedule()
> *
> - * And in the above race case, the task's current cfs_rq is in the same
> - * rq as its previous cfs_rq because sched_move_task() doesn't migrate
> - * task so we can use its current cfs_rq to derive rq and test if the
> - * task is current.
> + * In the above race case, @p current cfs_rq is in the same
> + * rq as its previous cfs_rq because sched_move_task() only moves a task
> + * to a different group from the same rq, so we can use its current
> + * cfs_rq to derive rq and test if the * task is current.
> */
> if (throttled_hierarchy(cfs_rq) &&
> !task_current_donor(rq_of(cfs_rq), p)) {
> ---8<---
>
Will follow your suggestion in next version.
> > + if (throttled_hierarchy(cfs_rq) &&
> > + !task_current_donor(rq_of(cfs_rq), p)) {
> > + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> > + return true;
> > + }
> > +
> > + /* we can't take the fast path, do an actual enqueue*/
> > + p->throttled = false;
>
> So we clear p->throttled but not p->throttle_node? Won't that cause issues
> when @p's previous cfs_rq gets unthrottled?
>
p->throttle_node is already removed from its previous cfs_rq at dequeue
time in dequeue_throttled_task().
This is done so because in enqueue time, we may not hold its previous
rq's lock so can't touch its previous cfs_rq's limbo list, like when
dealing with affinity changes.
> > + return false;
> > +}
> > +
>
> > @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > */
> > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > {
> > + if (unlikely(task_is_throttled(p))) {
> > + dequeue_throttled_task(p, flags);
> > + return true;
> > + }
> > +
>
> Handling a throttled task's move pattern at dequeue does simplify things,
> however that's quite a hot path. I think this wants at the very least a
>
> if (cfs_bandwidth_used())
>
> since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
> hurt, but we can probably find some people who really care about that to
> run it for us ;)
>
No problem.
I'm thinking maybe I can add this cfs_bandwidth_used() in
task_is_throttled()? So that other callsites of task_is_throttled() can
also get the benefit of paying less cost when cfs bandwidth is not
enabled.
> > if (!p->se.sched_delayed)
> > util_est_dequeue(&rq->cfs, p);
> >
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-08 10:13 ` Aaron Lu
@ 2025-08-08 11:45 ` Valentin Schneider
2025-08-12 8:48 ` Aaron Lu
2025-08-28 3:50 ` Aaron Lu
0 siblings, 2 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-08-08 11:45 UTC (permalink / raw)
To: Aaron Lu
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On 08/08/25 18:13, Aaron Lu wrote:
> On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
>> On 15/07/25 15:16, Aaron Lu wrote:
>> > From: Valentin Schneider <vschneid@redhat.com>
>> >
>> > In current throttle model, when a cfs_rq is throttled, its entity will
>> > be dequeued from cpu's rq, making tasks attached to it not able to run,
>> > thus achiveing the throttle target.
>> >
>> > This has a drawback though: assume a task is a reader of percpu_rwsem
>> > and is waiting. When it gets woken, it can not run till its task group's
>> > next period comes, which can be a relatively long time. Waiting writer
>> > will have to wait longer due to this and it also makes further reader
>> > build up and eventually trigger task hung.
>> >
>> > To improve this situation, change the throttle model to task based, i.e.
>> > when a cfs_rq is throttled, record its throttled status but do not remove
>> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>> > they get picked, add a task work to them so that when they return
>> > to user, they can be dequeued there. In this way, tasks throttled will
>> > not hold any kernel resources. And on unthrottle, enqueue back those
>> > tasks so they can continue to run.
>> >
>>
>> Moving the actual throttle work to pick time is clever. In my previous
>> versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
>> but this makes the code a lot more palatable. I'd like to see how this
>> impacts performance though.
>>
>
> Let me run some scheduler benchmark to see how it impacts performance.
>
> I'm thinking maybe running something like hackbench on server platforms,
> first with quota not set and see if performance changes; then also test
> with quota set and see how performance changes.
>
> Does this sound good to you? Or do you have any specific benchmark and
> test methodology in mind?
>
Yeah hackbench is pretty good for stressing the EQ/DQ paths.
>> > + if (throttled_hierarchy(cfs_rq) &&
>> > + !task_current_donor(rq_of(cfs_rq), p)) {
>> > + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>> > + return true;
>> > + }
>> > +
>> > + /* we can't take the fast path, do an actual enqueue*/
>> > + p->throttled = false;
>>
>> So we clear p->throttled but not p->throttle_node? Won't that cause issues
>> when @p's previous cfs_rq gets unthrottled?
>>
>
> p->throttle_node is already removed from its previous cfs_rq at dequeue
> time in dequeue_throttled_task().
>
> This is done so because in enqueue time, we may not hold its previous
> rq's lock so can't touch its previous cfs_rq's limbo list, like when
> dealing with affinity changes.
>
Ah right, the DQ/EQ_throttled_task() functions are when DQ/EQ is applied to an
already-throttled task and it does the right thing.
Does this mean we want this as enqueue_throttled_task()'s prologue?
/* @p should have gone through dequeue_throttled_task() first */
WARN_ON_ONCE(!list_empty(&p->throttle_node));
>> > @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>> > */
>> > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> > {
>> > + if (unlikely(task_is_throttled(p))) {
>> > + dequeue_throttled_task(p, flags);
>> > + return true;
>> > + }
>> > +
>>
>> Handling a throttled task's move pattern at dequeue does simplify things,
>> however that's quite a hot path. I think this wants at the very least a
>>
>> if (cfs_bandwidth_used())
>>
>> since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
>> hurt, but we can probably find some people who really care about that to
>> run it for us ;)
>>
>
> No problem.
>
> I'm thinking maybe I can add this cfs_bandwidth_used() in
> task_is_throttled()? So that other callsites of task_is_throttled() can
> also get the benefit of paying less cost when cfs bandwidth is not
> enabled.
>
Sounds good to me; just drop the unlikely and let the static key do its
thing :)
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-08-04 7:52 ` Aaron Lu
2025-08-04 11:18 ` Valentin Schneider
@ 2025-08-08 16:37 ` Matteo Martelli
1 sibling, 0 replies; 48+ messages in thread
From: Matteo Martelli @ 2025-08-08 16:37 UTC (permalink / raw)
To: Aaron Lu
Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Matteo Martelli
Hi Aaron,
On Mon, 4 Aug 2025 15:52:04 +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> Hi Matteo,
>
> On Fri, Aug 01, 2025 at 04:31:25PM +0200, Matteo Martelli wrote:
> ... ...
> > I encountered this issue on a test image with both PREEMPT_RT and
> > CFS_BANDWIDTH kernel options enabled. The test image is based on
> > freedesktop-sdk (v24.08.10) [1] with custom system configurations on
> > top, and it was being run on qemu x86_64 with 4 virtual CPU cores. One
> > notable system configuration is having most of system services running
> > on a systemd slice, restricted on a single CPU core (with AllowedCPUs
> > systemd option) and using CFS throttling (with CPUQuota systemd option).
> > With this configuration I encountered RCU stalls during boots, I think
> > because of the increased probability given by multiple processes being
> > spawned simultaneously on the same core. After the first RCU stall, the
> > system becomes unresponsive and successive RCU stalls are detected
> > periodically. This seems to match with the deadlock situation described
> > in your cover letter. I could only reproduce RCU stalls with the
> > combination of both PREEMPT_RT and CFS_BANDWIDTH enabled.
> >
> > I previously already tested this patch set at v2 (RFC) [2] on top of
> > kernel v6.14 and v6.15. I've now retested it at v3 on top of kernel
> > v6.16-rc7. I could no longer reproduce RCU stalls in all cases with the
> > patch set applied. More specifically, in the last test I ran, without
> > patch set applied, I could reproduce 32 RCU stalls in 24 hours, about 1
> > or 2 every hour. In this test the system was rebooting just after the
> > first RCU stall occurrence (through panic_on_rcu_stall=1 and panic=5
> > kernel cmdline arguments) or after 100 seconds if no RCU stall occurred.
> > This means the system rebooted 854 times in 24 hours (about 3.7%
> > reproducibility). You can see below two RCU stall instances. I could not
> > reproduce any RCU stall with the same test after applying the patch set.
> > I obtained similar results while testing the patch set at v2 (RFC)[1].
> > Another possibly interesting note is that the original custom
> > configuration was with the slice CPUQuota=150%, then I retested it with
> > CPUQuota=80%. The issue was reproducible in both configurations, notably
> > even with CPUQuota=150% that to my understanding should correspond to no
> > CFS throttling due to the CPU affinity set to 1 core only.
>
> Agree. With cpu affinity set to 1 core, 150% quota should never hit. But
> from the test results, it seems quota is hit somehow because if quota is
> not hit, this series should make no difference.
>
> Maybe fire a bpftrace script and see if quota is actually hit? A
> reference script is here:
> https://lore.kernel.org/lkml/20250521115115.GB24746@bytedance/
>
I better looked into this and actually there was another slice
(user.slice) configured with CPUQuota=25%. Disabling the CPUQuota limit
on the first mentioned slice (system.slice) I could still reproduce the
RCU stalls. It looks like the throttling was happening during the first
login after boot as also shown by the following ftrace logs.
[ 12.019263] podman-user-gen-992 [000] dN.2. 12.023684: throttle_cfs_rq <-pick_task_fair
[ 12.051074] systemd-981 [000] dN.2. 12.055502: throttle_cfs_rq <-pick_task_fair
[ 12.150067] systemd-981 [000] dN.2. 12.154500: throttle_cfs_rq <-put_prev_entity
[ 12.251448] systemd-981 [000] dN.2. 12.255839: throttle_cfs_rq <-put_prev_entity
[ 12.369867] sshd-session-976 [000] dN.2. 12.374293: throttle_cfs_rq <-pick_task_fair
[ 12.453080] bash-1002 [000] dN.2. 12.457502: throttle_cfs_rq <-pick_task_fair
[ 12.551279] bash-1012 [000] dN.2. 12.555701: throttle_cfs_rq <-pick_task_fair
[ 12.651085] podman-998 [000] dN.2. 12.655505: throttle_cfs_rq <-pick_task_fair
[ 12.750509] podman-1001 [000] dN.2. 12.754931: throttle_cfs_rq <-put_prev_entity
[ 12.868351] podman-1030 [000] dN.2. 12.872780: throttle_cfs_rq <-put_prev_entity
[ 12.961076] podman-1033 [000] dN.2. 12.965504: throttle_cfs_rq <-put_prev_entity
By increasing the CPUQuota to 50% limit of the user.slice, the same test
mentioned in my previous email produced less RCU stalls and less
throttling events in the ftrace logs. Then by setting the user.slice to
100% I could no longer reproduce either RCU stalls or traced throttling
events.
> > I also ran some quick tests with stress-ng and systemd CPUQuota parameter to
> > verify that CFS throttling was behaving as expected. See details below after
> > RCU stall logs.
>
> Thanks for all these tests. If I read them correctly, in all these
> tests, CFS throttling worked as expected. Right?
>
Yes, correct.
> Best regards,
> Aaron
>
Best regards,
Matteo Martelli
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-08 11:45 ` Valentin Schneider
@ 2025-08-12 8:48 ` Aaron Lu
2025-08-14 15:54 ` Valentin Schneider
2025-08-28 3:50 ` Aaron Lu
1 sibling, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-08-12 8:48 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Fri, Aug 08, 2025 at 01:45:11PM +0200, Valentin Schneider wrote:
> On 08/08/25 18:13, Aaron Lu wrote:
> > On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
> >> On 15/07/25 15:16, Aaron Lu wrote:
> >> > From: Valentin Schneider <vschneid@redhat.com>
> >> >
> >> > In current throttle model, when a cfs_rq is throttled, its entity will
> >> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> >> > thus achiveing the throttle target.
> >> >
> >> > This has a drawback though: assume a task is a reader of percpu_rwsem
> >> > and is waiting. When it gets woken, it can not run till its task group's
> >> > next period comes, which can be a relatively long time. Waiting writer
> >> > will have to wait longer due to this and it also makes further reader
> >> > build up and eventually trigger task hung.
> >> >
> >> > To improve this situation, change the throttle model to task based, i.e.
> >> > when a cfs_rq is throttled, record its throttled status but do not remove
> >> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> >> > they get picked, add a task work to them so that when they return
> >> > to user, they can be dequeued there. In this way, tasks throttled will
> >> > not hold any kernel resources. And on unthrottle, enqueue back those
> >> > tasks so they can continue to run.
> >> >
> >>
> >> Moving the actual throttle work to pick time is clever. In my previous
> >> versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
> >> but this makes the code a lot more palatable. I'd like to see how this
> >> impacts performance though.
> >>
> >
> > Let me run some scheduler benchmark to see how it impacts performance.
> >
> > I'm thinking maybe running something like hackbench on server platforms,
> > first with quota not set and see if performance changes; then also test
> > with quota set and see how performance changes.
> >
> > Does this sound good to you? Or do you have any specific benchmark and
> > test methodology in mind?
> >
>
> Yeah hackbench is pretty good for stressing the EQ/DQ paths.
>
Tested hackbench/pipe and netperf/UDP_RR on Intel EMR(2 sockets/240
cpus) and AMD Genoa(2 sockets/384 cpus), the tldr is: there is no clear
performance change between base and this patchset(head). Below is
detailed test data:
(turbo/boost disabled, cpuidle disabled, cpufreq set to performance)
hackbench/pipe/loops=150000
(seconds, smaller is better)
On Intel EMR:
nr_group base head change
1 3.62±2.99% 3.61±10.42% +0.28%
8 8.06±1.58% 7.88±5.82% +2.23%
16 11.40±2.57% 11.25±3.72% +1.32%
For nr_group=16 case, configure a cgroup and set quota to half cpu and
then let hackbench run in this cgroup:
base head change
quota=50% 18.35±2.40% 18.78±1.97% -2.34%
On AMD Genoa:
nr_group base head change
1 17.05±1.92% 16.99±2.81% +0.35%
8 16.54±0.71% 16.73±1.18% -1.15%
16 27.04±0.39% 26.72±2.37% +1.18%
For nr_group=16 case, configure a cgroup and set quota to half cpu and
then let hackbench run in this cgroup:
base head change
quota=50% 43.79±1.10% 44.65±0.37% -1.96%
Netperf/UDP_RR/testlen=30s
(throughput, higher is better)
25% means nr_clients set to 1/4 nr_cpu, 50% means nr_clients is 1/2
nr_cpu, etc.
On Intel EMR:
nr_clients base head change
25% 83,567±0.06% 84,298±0.23% +0.87%
50% 61,336±1.49% 60,816±0.63% -0.85%
75% 40,592±0.97% 40,461±0.14% -0.32%
100% 31,277±2.11% 30,948±1.84% -1.05%
For nr_clients=100% case, configure a cgroup and set quota to half cpu
and then let netperf run in this cgroup:
nr_clients base head change
100% 25,532±0.56% 26,772±3.05% +4.86%
On AMD Genoa:
nr_clients base head change
25% 12,443±0.40% 12,525±0.06% +0.66%
50% 11,403±0.35% 11,472±0.50% +0.61%
75% 10,070±0.19% 10,071±0.95% 0.00%
100% 9,947±0.80% 9,881±0.58% -0.66%
For nr_clients=100% case, configure a cgroup and set quota to half cpu
and then let netperf run in this cgroup:
nr_clients base head change
100% 4,954±0.24% 4,952±0.14% 0.00%
> >> > + if (throttled_hierarchy(cfs_rq) &&
> >> > + !task_current_donor(rq_of(cfs_rq), p)) {
> >> > + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >> > + return true;
> >> > + }
> >> > +
> >> > + /* we can't take the fast path, do an actual enqueue*/
> >> > + p->throttled = false;
> >>
> >> So we clear p->throttled but not p->throttle_node? Won't that cause issues
> >> when @p's previous cfs_rq gets unthrottled?
> >>
> >
> > p->throttle_node is already removed from its previous cfs_rq at dequeue
> > time in dequeue_throttled_task().
> >
> > This is done so because in enqueue time, we may not hold its previous
> > rq's lock so can't touch its previous cfs_rq's limbo list, like when
> > dealing with affinity changes.
> >
>
> Ah right, the DQ/EQ_throttled_task() functions are when DQ/EQ is applied to an
> already-throttled task and it does the right thing.
>
> Does this mean we want this as enqueue_throttled_task()'s prologue?
>
> /* @p should have gone through dequeue_throttled_task() first */
> WARN_ON_ONCE(!list_empty(&p->throttle_node));
>
Sure, will add it in next version.
> >> > @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >> > */
> >> > static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> > {
> >> > + if (unlikely(task_is_throttled(p))) {
> >> > + dequeue_throttled_task(p, flags);
> >> > + return true;
> >> > + }
> >> > +
> >>
> >> Handling a throttled task's move pattern at dequeue does simplify things,
> >> however that's quite a hot path. I think this wants at the very least a
> >>
> >> if (cfs_bandwidth_used())
> >>
> >> since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
> >> hurt, but we can probably find some people who really care about that to
> >> run it for us ;)
> >>
> >
> > No problem.
> >
> > I'm thinking maybe I can add this cfs_bandwidth_used() in
> > task_is_throttled()? So that other callsites of task_is_throttled() can
> > also get the benefit of paying less cost when cfs bandwidth is not
> > enabled.
> >
>
> Sounds good to me; just drop the unlikely and let the static key do its
> thing :)
Got it, thanks for the suggestion.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-12 8:48 ` Aaron Lu
@ 2025-08-14 15:54 ` Valentin Schneider
2025-08-15 9:30 ` Aaron Lu
0 siblings, 1 reply; 48+ messages in thread
From: Valentin Schneider @ 2025-08-14 15:54 UTC (permalink / raw)
To: Aaron Lu
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On 12/08/25 16:48, Aaron Lu wrote:
> On Fri, Aug 08, 2025 at 01:45:11PM +0200, Valentin Schneider wrote:
>> On 08/08/25 18:13, Aaron Lu wrote:
>> > Let me run some scheduler benchmark to see how it impacts performance.
>> >
>> > I'm thinking maybe running something like hackbench on server platforms,
>> > first with quota not set and see if performance changes; then also test
>> > with quota set and see how performance changes.
>> >
>> > Does this sound good to you? Or do you have any specific benchmark and
>> > test methodology in mind?
>> >
>>
>> Yeah hackbench is pretty good for stressing the EQ/DQ paths.
>>
>
> Tested hackbench/pipe and netperf/UDP_RR on Intel EMR(2 sockets/240
> cpus) and AMD Genoa(2 sockets/384 cpus), the tldr is: there is no clear
> performance change between base and this patchset(head). Below is
> detailed test data:
> (turbo/boost disabled, cpuidle disabled, cpufreq set to performance)
>
> hackbench/pipe/loops=150000
> (seconds, smaller is better)
>
> On Intel EMR:
>
> nr_group base head change
> 1 3.62±2.99% 3.61±10.42% +0.28%
> 8 8.06±1.58% 7.88±5.82% +2.23%
> 16 11.40±2.57% 11.25±3.72% +1.32%
>
> For nr_group=16 case, configure a cgroup and set quota to half cpu and
> then let hackbench run in this cgroup:
>
> base head change
> quota=50% 18.35±2.40% 18.78±1.97% -2.34%
>
> On AMD Genoa:
>
> nr_group base head change
> 1 17.05±1.92% 16.99±2.81% +0.35%
> 8 16.54±0.71% 16.73±1.18% -1.15%
> 16 27.04±0.39% 26.72±2.37% +1.18%
>
> For nr_group=16 case, configure a cgroup and set quota to half cpu and
> then let hackbench run in this cgroup:
>
> base head change
> quota=50% 43.79±1.10% 44.65±0.37% -1.96%
>
> Netperf/UDP_RR/testlen=30s
> (throughput, higher is better)
>
> 25% means nr_clients set to 1/4 nr_cpu, 50% means nr_clients is 1/2
> nr_cpu, etc.
>
> On Intel EMR:
>
> nr_clients base head change
> 25% 83,567±0.06% 84,298±0.23% +0.87%
> 50% 61,336±1.49% 60,816±0.63% -0.85%
> 75% 40,592±0.97% 40,461±0.14% -0.32%
> 100% 31,277±2.11% 30,948±1.84% -1.05%
>
> For nr_clients=100% case, configure a cgroup and set quota to half cpu
> and then let netperf run in this cgroup:
>
> nr_clients base head change
> 100% 25,532±0.56% 26,772±3.05% +4.86%
>
> On AMD Genoa:
>
> nr_clients base head change
> 25% 12,443±0.40% 12,525±0.06% +0.66%
> 50% 11,403±0.35% 11,472±0.50% +0.61%
> 75% 10,070±0.19% 10,071±0.95% 0.00%
> 100% 9,947±0.80% 9,881±0.58% -0.66%
>
> For nr_clients=100% case, configure a cgroup and set quota to half cpu
> and then let netperf run in this cgroup:
>
> nr_clients base head change
> 100% 4,954±0.24% 4,952±0.14% 0.00%
Thank you for running these, looks like mostly slightly bigger variance on
a few of these but that's about it.
I would also suggest running similar benchmarks but with deeper
hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
when tg_unthrottle_up() goes up a bigger tree.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-14 15:54 ` Valentin Schneider
@ 2025-08-15 9:30 ` Aaron Lu
2025-08-22 11:07 ` Aaron Lu
0 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-08-15 9:30 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote:
> On 12/08/25 16:48, Aaron Lu wrote:
> > On Fri, Aug 08, 2025 at 01:45:11PM +0200, Valentin Schneider wrote:
> >> On 08/08/25 18:13, Aaron Lu wrote:
> >> > Let me run some scheduler benchmark to see how it impacts performance.
> >> >
> >> > I'm thinking maybe running something like hackbench on server platforms,
> >> > first with quota not set and see if performance changes; then also test
> >> > with quota set and see how performance changes.
> >> >
> >> > Does this sound good to you? Or do you have any specific benchmark and
> >> > test methodology in mind?
> >> >
> >>
> >> Yeah hackbench is pretty good for stressing the EQ/DQ paths.
> >>
> >
> > Tested hackbench/pipe and netperf/UDP_RR on Intel EMR(2 sockets/240
> > cpus) and AMD Genoa(2 sockets/384 cpus), the tldr is: there is no clear
> > performance change between base and this patchset(head). Below is
> > detailed test data:
> > (turbo/boost disabled, cpuidle disabled, cpufreq set to performance)
> >
> > hackbench/pipe/loops=150000
> > (seconds, smaller is better)
> >
> > On Intel EMR:
> >
> > nr_group base head change
> > 1 3.62±2.99% 3.61±10.42% +0.28%
> > 8 8.06±1.58% 7.88±5.82% +2.23%
> > 16 11.40±2.57% 11.25±3.72% +1.32%
> >
> > For nr_group=16 case, configure a cgroup and set quota to half cpu and
> > then let hackbench run in this cgroup:
> >
> > base head change
> > quota=50% 18.35±2.40% 18.78±1.97% -2.34%
> >
> > On AMD Genoa:
> >
> > nr_group base head change
> > 1 17.05±1.92% 16.99±2.81% +0.35%
> > 8 16.54±0.71% 16.73±1.18% -1.15%
> > 16 27.04±0.39% 26.72±2.37% +1.18%
> >
> > For nr_group=16 case, configure a cgroup and set quota to half cpu and
> > then let hackbench run in this cgroup:
> >
> > base head change
> > quota=50% 43.79±1.10% 44.65±0.37% -1.96%
> >
> > Netperf/UDP_RR/testlen=30s
> > (throughput, higher is better)
> >
> > 25% means nr_clients set to 1/4 nr_cpu, 50% means nr_clients is 1/2
> > nr_cpu, etc.
> >
> > On Intel EMR:
> >
> > nr_clients base head change
> > 25% 83,567±0.06% 84,298±0.23% +0.87%
> > 50% 61,336±1.49% 60,816±0.63% -0.85%
> > 75% 40,592±0.97% 40,461±0.14% -0.32%
> > 100% 31,277±2.11% 30,948±1.84% -1.05%
> >
> > For nr_clients=100% case, configure a cgroup and set quota to half cpu
> > and then let netperf run in this cgroup:
> >
> > nr_clients base head change
> > 100% 25,532±0.56% 26,772±3.05% +4.86%
> >
> > On AMD Genoa:
> >
> > nr_clients base head change
> > 25% 12,443±0.40% 12,525±0.06% +0.66%
> > 50% 11,403±0.35% 11,472±0.50% +0.61%
> > 75% 10,070±0.19% 10,071±0.95% 0.00%
> > 100% 9,947±0.80% 9,881±0.58% -0.66%
> >
> > For nr_clients=100% case, configure a cgroup and set quota to half cpu
> > and then let netperf run in this cgroup:
> >
> > nr_clients base head change
> > 100% 4,954±0.24% 4,952±0.14% 0.00%
>
> Thank you for running these, looks like mostly slightly bigger variance on
> a few of these but that's about it.
>
> I would also suggest running similar benchmarks but with deeper
> hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
> when tg_unthrottle_up() goes up a bigger tree.
No problem.
I suppose I can reuse the previous shared test script:
https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
There I used:
nr_level1=2
nr_level2=100
nr_level3=10
But I can tweak these numbers for this performance evaluation. I can make
the leaf level to be 5 level deep and place tasks in leaf level cgroups
and configure quota on 1st level cgroups.
I'll get back to you once I finished collecting data, feel free to let
me know if you have other idea testing this :)
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
` (2 preceding siblings ...)
2025-08-08 9:12 ` Valentin Schneider
@ 2025-08-17 8:50 ` Chen, Yu C
2025-08-18 2:50 ` Aaron Lu
3 siblings, 1 reply; 48+ messages in thread
From: Chen, Yu C @ 2025-08-17 8:50 UTC (permalink / raw)
To: Aaron Lu
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
On 7/15/2025 3:16 PM, Aaron Lu wrote:
> From: Valentin Schneider <vschneid@redhat.com>
>
> In current throttle model, when a cfs_rq is throttled, its entity will
> be dequeued from cpu's rq, making tasks attached to it not able to run,
> thus achiveing the throttle target.
>
> This has a drawback though: assume a task is a reader of percpu_rwsem
> and is waiting. When it gets woken, it can not run till its task group's
> next period comes, which can be a relatively long time. Waiting writer
> will have to wait longer due to this and it also makes further reader
> build up and eventually trigger task hung.
>
> To improve this situation, change the throttle model to task based, i.e.
> when a cfs_rq is throttled, record its throttled status but do not remove
> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> they get picked, add a task work to them so that when they return
> to user, they can be dequeued there. In this way, tasks throttled will
> not hold any kernel resources. And on unthrottle, enqueue back those
> tasks so they can continue to run.
>
> Throttled cfs_rq's PELT clock is handled differently now: previously the
> cfs_rq's PELT clock is stopped once it entered throttled state but since
> now tasks(in kernel mode) can continue to run, change the behaviour to
> stop PELT clock only when the throttled cfs_rq has no tasks left.
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
[snip]
> @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> {
> struct sched_entity *se;
> struct cfs_rq *cfs_rq;
> + struct task_struct *p;
> + bool throttled;
>
> again:
> cfs_rq = &rq->cfs;
> if (!cfs_rq->nr_queued)
> return NULL;
>
> + throttled = false;
> +
> do {
> /* Might not have done put_prev_entity() */
> if (cfs_rq->curr && cfs_rq->curr->on_rq)
> update_curr(cfs_rq);
>
> - if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> - goto again;
> + throttled |= check_cfs_rq_runtime(cfs_rq);
>
> se = pick_next_entity(rq, cfs_rq);
> if (!se)
> @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> cfs_rq = group_cfs_rq(se);
> } while (cfs_rq);
>
> - return task_of(se);
> + p = task_of(se);
> + if (unlikely(throttled))
> + task_throttle_setup_work(p);
> + return p;
> }
>
Previously, I was wondering if the above change might impact
wakeup latency in some corner cases: If there are many tasks
enqueued on a throttled cfs_rq, the above pick-up mechanism
might return an invalid p repeatedly (where p is dequeued,
and a reschedule is triggered in throttle_cfs_rq_work() to
pick the next p; then the new p is found again on a throttled
cfs_rq). Before the above change, the entire cfs_rq's corresponding
sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)
So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
cgroup. The results show that there is not much impact in terms of
wakeup latency (considering the standard deviation). Based on the data
and my understanding, for this series,
Tested-by: Chen Yu <yu.c.chen@intel.com>
Tested script parameters are borrowed from the previous attached ones:
#!/bin/bash
if [ $# -ne 1 ]; then
echo "please provide cgroup level"
exit
fi
N=$1
current_path="/sys/fs/cgroup"
for ((i=1; i<=N; i++)); do
new_dir="${current_path}/${i}"
mkdir -p "$new_dir"
if [ "$i" -ne "$N" ]; then
echo '+cpu +memory +pids' >
${new_dir}/cgroup.subtree_control
fi
current_path="$new_dir"
done
echo "current_path:${current_path}"
echo "1600000 100000" > ${current_path}/cpu.max
echo "34G" > ${current_path}/memory.max
echo $$ > ${current_path}/cgroup.procs
#./run-mmtests.sh --no-monitor --config config-schbench baseline
./run-mmtests.sh --no-monitor --config config-schbench sch_throt
pids=$(cat "${current_path}/cgroup.procs")
for pid in $pids; do
echo $pid > "/sys/fs/cgroup/cgroup.procs" 2>/dev/null
done
for ((i=N; i>=1; i--)); do
rmdir ${current_path}
current_path=$(dirname "$current_path")
done
Results:
schbench thread = 1
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
//the baseline's std% is 35%, the change should not be a problem
Wakeup Latencies 99.0th 15.00(5.29) 17.00(1.00)
-13.33%
Request Latencies 99.0th 3830.67(33.31) 3854.67(25.72)
-0.63%
RPS 50.0th 1598.00(4.00) 1606.00(4.00)
+0.50%
Average RPS 1597.77(5.13) 1606.11(4.75)
+0.52%
schbench thread = 2
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 18.33(0.58) 18.67(0.58)
-1.85%
Request Latencies 99.0th 3868.00(49.96) 3854.67(44.06)
+0.34%
RPS 50.0th 3185.33(4.62) 3204.00(8.00)
+0.59%
Average RPS 3186.49(2.70) 3204.21(11.25)
+0.56%
schbench thread = 4
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 19.33(1.15) 19.33(0.58)
0.00%
Request Latencies 99.0th 35690.67(517.31) 35946.67(517.31)
-0.72%
RPS 50.0th 4418.67(18.48) 4434.67(9.24)
+0.36%
Average RPS 4414.38(16.94) 4436.02(8.77)
+0.49%
schbench thread = 8
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 22.67(0.58) 22.33(0.58)
+1.50%
Request Latencies 99.0th 73002.67(147.80) 72661.33(147.80)
+0.47%
RPS 50.0th 4376.00(16.00) 4392.00(0.00)
+0.37%
Average RPS 4373.89(15.04) 4393.88(6.22)
+0.46%
schbench thread = 16
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 29.00(2.65) 29.00(3.61)
0.00%
Request Latencies 99.0th 88704.00(0.00) 88704.00(0.00)
0.00%
RPS 50.0th 4274.67(24.44) 4290.67(9.24)
+0.37%
Average RPS 4277.27(24.80) 4287.97(9.80)
+0.25%
schbench thread = 32
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 100.00(22.61) 82.00(16.46)
+18.00%
Request Latencies 99.0th 100138.67(295.60) 100053.33(147.80)
+0.09%
RPS 50.0th 3942.67(20.13) 3916.00(42.33)
-0.68%
Average RPS 3919.39(19.01) 3892.39(42.26)
-0.69%
schbench thread = 63
Metric Base (mean±std) Compare (mean±std)
Change
-------------------------------------------------------------------------------------
Wakeup Latencies 99.0th 94848.00(0.00) 94336.00(0.00)
+0.54%
//the baseline's std% is 19%, the change should not be a problem
Request Latencies 99.0th 264618.67(51582.78) 298154.67(591.21)
-12.67%
RPS 50.0th 2641.33(4.62) 2628.00(8.00)
-0.50%
Average RPS 2659.49(8.88) 2636.17(7.58)
-0.88%
thanks,
Chenyu
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-17 8:50 ` Chen, Yu C
@ 2025-08-18 2:50 ` Aaron Lu
2025-08-18 3:10 ` Chen, Yu C
2025-08-18 3:12 ` Aaron Lu
0 siblings, 2 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-18 2:50 UTC (permalink / raw)
To: Chen, Yu C
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
On Sun, Aug 17, 2025 at 04:50:50PM +0800, Chen, Yu C wrote:
> On 7/15/2025 3:16 PM, Aaron Lu wrote:
> > From: Valentin Schneider <vschneid@redhat.com>
> >
> > In current throttle model, when a cfs_rq is throttled, its entity will
> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > thus achiveing the throttle target.
> >
> > This has a drawback though: assume a task is a reader of percpu_rwsem
> > and is waiting. When it gets woken, it can not run till its task group's
> > next period comes, which can be a relatively long time. Waiting writer
> > will have to wait longer due to this and it also makes further reader
> > build up and eventually trigger task hung.
> >
> > To improve this situation, change the throttle model to task based, i.e.
> > when a cfs_rq is throttled, record its throttled status but do not remove
> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > they get picked, add a task work to them so that when they return
> > to user, they can be dequeued there. In this way, tasks throttled will
> > not hold any kernel resources. And on unthrottle, enqueue back those
> > tasks so they can continue to run.
> >
> > Throttled cfs_rq's PELT clock is handled differently now: previously the
> > cfs_rq's PELT clock is stopped once it entered throttled state but since
> > now tasks(in kernel mode) can continue to run, change the behaviour to
> > stop PELT clock only when the throttled cfs_rq has no tasks left.
> >
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > ---
>
> [snip]
>
>
> > @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > {
> > struct sched_entity *se;
> > struct cfs_rq *cfs_rq;
> > + struct task_struct *p;
> > + bool throttled;
> > again:
> > cfs_rq = &rq->cfs;
> > if (!cfs_rq->nr_queued)
> > return NULL;
> > + throttled = false;
> > +
> > do {
> > /* Might not have done put_prev_entity() */
> > if (cfs_rq->curr && cfs_rq->curr->on_rq)
> > update_curr(cfs_rq);
> > - if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> > - goto again;
> > + throttled |= check_cfs_rq_runtime(cfs_rq);
> > se = pick_next_entity(rq, cfs_rq);
> > if (!se)
> > @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > cfs_rq = group_cfs_rq(se);
> > } while (cfs_rq);
> > - return task_of(se);
> > + p = task_of(se);
> > + if (unlikely(throttled))
> > + task_throttle_setup_work(p);
> > + return p;
> > }
>
> Previously, I was wondering if the above change might impact
> wakeup latency in some corner cases: If there are many tasks
> enqueued on a throttled cfs_rq, the above pick-up mechanism
> might return an invalid p repeatedly (where p is dequeued,
By invalid, do you mean task that is in a throttled hierarchy?
> and a reschedule is triggered in throttle_cfs_rq_work() to
> pick the next p; then the new p is found again on a throttled
> cfs_rq). Before the above change, the entire cfs_rq's corresponding
> sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)
>
Yes this is true and it sounds inefficient, but these newly woken tasks
may hold some kernel resources like a reader lock so we really want them
to finish their kernel jobs and release that resource before being
throttled or it can block/impact other tasks and even cause the whole
system to hung.
> So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
> 384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
> cgroup. The results show that there is not much impact in terms of
> wakeup latency (considering the standard deviation). Based on the data
> and my understanding, for this series,
>
> Tested-by: Chen Yu <yu.c.chen@intel.com>
Good to know this and thanks a lot for the test!
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-18 2:50 ` Aaron Lu
@ 2025-08-18 3:10 ` Chen, Yu C
2025-08-18 3:12 ` Aaron Lu
1 sibling, 0 replies; 48+ messages in thread
From: Chen, Yu C @ 2025-08-18 3:10 UTC (permalink / raw)
To: Aaron Lu
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
On 8/18/2025 10:50 AM, Aaron Lu wrote:
> On Sun, Aug 17, 2025 at 04:50:50PM +0800, Chen, Yu C wrote:
>> On 7/15/2025 3:16 PM, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@redhat.com>
>>>
>>> In current throttle model, when a cfs_rq is throttled, its entity will
>>> be dequeued from cpu's rq, making tasks attached to it not able to run,
>>> thus achiveing the throttle target.
>>>
>>> This has a drawback though: assume a task is a reader of percpu_rwsem
>>> and is waiting. When it gets woken, it can not run till its task group's
>>> next period comes, which can be a relatively long time. Waiting writer
>>> will have to wait longer due to this and it also makes further reader
>>> build up and eventually trigger task hung.
>>>
>>> To improve this situation, change the throttle model to task based, i.e.
>>> when a cfs_rq is throttled, record its throttled status but do not remove
>>> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>>> they get picked, add a task work to them so that when they return
>>> to user, they can be dequeued there. In this way, tasks throttled will
>>> not hold any kernel resources. And on unthrottle, enqueue back those
>>> tasks so they can continue to run.
>>>
>>> Throttled cfs_rq's PELT clock is handled differently now: previously the
>>> cfs_rq's PELT clock is stopped once it entered throttled state but since
>>> now tasks(in kernel mode) can continue to run, change the behaviour to
>>> stop PELT clock only when the throttled cfs_rq has no tasks left.
>>>
>>> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>>> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
>>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>> ---
>>
>> [snip]
>>
>>
>>> @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>>> {
>>> struct sched_entity *se;
>>> struct cfs_rq *cfs_rq;
>>> + struct task_struct *p;
>>> + bool throttled;
>>> again:
>>> cfs_rq = &rq->cfs;
>>> if (!cfs_rq->nr_queued)
>>> return NULL;
>>> + throttled = false;
>>> +
>>> do {
>>> /* Might not have done put_prev_entity() */
>>> if (cfs_rq->curr && cfs_rq->curr->on_rq)
>>> update_curr(cfs_rq);
>>> - if (unlikely(check_cfs_rq_runtime(cfs_rq)))
>>> - goto again;
>>> + throttled |= check_cfs_rq_runtime(cfs_rq);
>>> se = pick_next_entity(rq, cfs_rq);
>>> if (!se)
>>> @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
>>> cfs_rq = group_cfs_rq(se);
>>> } while (cfs_rq);
>>> - return task_of(se);
>>> + p = task_of(se);
>>> + if (unlikely(throttled))
>>> + task_throttle_setup_work(p);
>>> + return p;
>>> }
>>
>> Previously, I was wondering if the above change might impact
>> wakeup latency in some corner cases: If there are many tasks
>> enqueued on a throttled cfs_rq, the above pick-up mechanism
>> might return an invalid p repeatedly (where p is dequeued,
>
> By invalid, do you mean task that is in a throttled hierarchy?
>
Yes.
>> and a reschedule is triggered in throttle_cfs_rq_work() to
>> pick the next p; then the new p is found again on a throttled
>> cfs_rq). Before the above change, the entire cfs_rq's corresponding
>> sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)
>>
>
> Yes this is true and it sounds inefficient, but these newly woken tasks
> may hold some kernel resources like a reader lock so we really want them
> to finish their kernel jobs and release that resource before being
> throttled or it can block/impact other tasks and even cause the whole
> system to hung.
>
I see. Always dequeue each task during their ret2user phase would be safer.
thanks,
Chenyu
>> So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
>> 384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
>> cgroup. The results show that there is not much impact in terms of
>> wakeup latency (considering the standard deviation). Based on the data
>> and my understanding, for this series,
>>
>> Tested-by: Chen Yu <yu.c.chen@intel.com>
>
> Good to know this and thanks a lot for the test!
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-18 2:50 ` Aaron Lu
2025-08-18 3:10 ` Chen, Yu C
@ 2025-08-18 3:12 ` Aaron Lu
1 sibling, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-18 3:12 UTC (permalink / raw)
To: Chen, Yu C
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
On Mon, Aug 18, 2025 at 10:50:14AM +0800, Aaron Lu wrote:
> On Sun, Aug 17, 2025 at 04:50:50PM +0800, Chen, Yu C wrote:
> > On 7/15/2025 3:16 PM, Aaron Lu wrote:
> > > From: Valentin Schneider <vschneid@redhat.com>
> > >
> > > In current throttle model, when a cfs_rq is throttled, its entity will
> > > be dequeued from cpu's rq, making tasks attached to it not able to run,
> > > thus achiveing the throttle target.
> > >
> > > This has a drawback though: assume a task is a reader of percpu_rwsem
> > > and is waiting. When it gets woken, it can not run till its task group's
> > > next period comes, which can be a relatively long time. Waiting writer
> > > will have to wait longer due to this and it also makes further reader
> > > build up and eventually trigger task hung.
> > >
> > > To improve this situation, change the throttle model to task based, i.e.
> > > when a cfs_rq is throttled, record its throttled status but do not remove
> > > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> > > they get picked, add a task work to them so that when they return
> > > to user, they can be dequeued there. In this way, tasks throttled will
> > > not hold any kernel resources. And on unthrottle, enqueue back those
> > > tasks so they can continue to run.
> > >
> > > Throttled cfs_rq's PELT clock is handled differently now: previously the
> > > cfs_rq's PELT clock is stopped once it entered throttled state but since
> > > now tasks(in kernel mode) can continue to run, change the behaviour to
> > > stop PELT clock only when the throttled cfs_rq has no tasks left.
> > >
> > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> > > Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
> > > Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
> > > ---
> >
> > [snip]
> >
> >
> > > @@ -8813,19 +8815,22 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > > {
> > > struct sched_entity *se;
> > > struct cfs_rq *cfs_rq;
> > > + struct task_struct *p;
> > > + bool throttled;
> > > again:
> > > cfs_rq = &rq->cfs;
> > > if (!cfs_rq->nr_queued)
> > > return NULL;
> > > + throttled = false;
> > > +
> > > do {
> > > /* Might not have done put_prev_entity() */
> > > if (cfs_rq->curr && cfs_rq->curr->on_rq)
> > > update_curr(cfs_rq);
> > > - if (unlikely(check_cfs_rq_runtime(cfs_rq)))
> > > - goto again;
> > > + throttled |= check_cfs_rq_runtime(cfs_rq);
> > > se = pick_next_entity(rq, cfs_rq);
> > > if (!se)
> > > @@ -8833,7 +8838,10 @@ static struct task_struct *pick_task_fair(struct rq *rq)
> > > cfs_rq = group_cfs_rq(se);
> > > } while (cfs_rq);
> > > - return task_of(se);
> > > + p = task_of(se);
> > > + if (unlikely(throttled))
> > > + task_throttle_setup_work(p);
> > > + return p;
> > > }
> >
> > Previously, I was wondering if the above change might impact
> > wakeup latency in some corner cases: If there are many tasks
> > enqueued on a throttled cfs_rq, the above pick-up mechanism
> > might return an invalid p repeatedly (where p is dequeued,
>
> By invalid, do you mean task that is in a throttled hierarchy?
>
> > and a reschedule is triggered in throttle_cfs_rq_work() to
> > pick the next p; then the new p is found again on a throttled
> > cfs_rq). Before the above change, the entire cfs_rq's corresponding
> > sched_entity was dequeued in throttle_cfs_rq(): se = cfs_rq->tg->se(cpu)
> >
>
> Yes this is true and it sounds inefficient, but these newly woken tasks
> may hold some kernel resources like a reader lock so we really want them
~~~~
Sorry, I meant reader semaphore.
> to finish their kernel jobs and release that resource before being
> throttled or it can block/impact other tasks and even cause the whole
> system to hung.
>
> > So I did some tests for this scenario on a Xeon with 6 NUMA nodes and
> > 384 CPUs. I created 10 levels of cgroups and ran schbench on the leaf
> > cgroup. The results show that there is not much impact in terms of
> > wakeup latency (considering the standard deviation). Based on the data
> > and my understanding, for this series,
> >
> > Tested-by: Chen Yu <yu.c.chen@intel.com>
>
> Good to know this and thanks a lot for the test!
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-07-15 7:16 ` [PATCH v3 4/5] sched/fair: Task based throttle time accounting Aaron Lu
@ 2025-08-18 14:57 ` Valentin Schneider
2025-08-19 9:34 ` Aaron Lu
2025-08-26 9:15 ` Aaron Lu
0 siblings, 2 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-08-18 14:57 UTC (permalink / raw)
To: Aaron Lu, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Tejun Heo
On 15/07/25 15:16, Aaron Lu wrote:
> @@ -5287,19 +5287,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> check_enqueue_throttle(cfs_rq);
> list_add_leaf_cfs_rq(cfs_rq);
> #ifdef CONFIG_CFS_BANDWIDTH
> - if (throttled_hierarchy(cfs_rq)) {
> + if (cfs_rq->pelt_clock_throttled) {
> struct rq *rq = rq_of(cfs_rq);
>
> - if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
> - cfs_rq->throttled_clock = rq_clock(rq);
> - if (!cfs_rq->throttled_clock_self)
> - cfs_rq->throttled_clock_self = rq_clock(rq);
> -
> - if (cfs_rq->pelt_clock_throttled) {
> - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> - cfs_rq->throttled_clock_pelt;
> - cfs_rq->pelt_clock_throttled = 0;
> - }
> + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> + cfs_rq->throttled_clock_pelt;
> + cfs_rq->pelt_clock_throttled = 0;
This is the only hunk of the patch that affects the PELT stuff; should this
have been included in patch 3 which does the rest of the PELT accounting changes?
> @@ -7073,6 +7073,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> if (cfs_rq_is_idle(cfs_rq))
> h_nr_idle = h_nr_queued;
>
> + if (throttled_hierarchy(cfs_rq) && task_throttled)
> + record_throttle_clock(cfs_rq);
> +
Apologies if this has been discussed before.
So the throttled time (as reported by cpu.stat.local) is now accounted as
the time from which the first task in the hierarchy gets effectively
throttled - IOW the first time a task in a throttled hierarchy reaches
resume_user_mode_work() - as opposed to as soon as the hierarchy runs out
of quota.
The gap between the two shouldn't be much, but that should at the very
least be highlighted in the changelog.
AFAICT this is a purely user-facing stat; Josh/Tejun, any opinions on this?
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-08-18 14:57 ` Valentin Schneider
@ 2025-08-19 9:34 ` Aaron Lu
2025-08-19 14:09 ` Valentin Schneider
2025-08-26 14:10 ` Michal Koutný
2025-08-26 9:15 ` Aaron Lu
1 sibling, 2 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-19 9:34 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Tejun Heo
On Mon, Aug 18, 2025 at 04:57:27PM +0200, Valentin Schneider wrote:
> On 15/07/25 15:16, Aaron Lu wrote:
> > @@ -5287,19 +5287,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > check_enqueue_throttle(cfs_rq);
> > list_add_leaf_cfs_rq(cfs_rq);
> > #ifdef CONFIG_CFS_BANDWIDTH
> > - if (throttled_hierarchy(cfs_rq)) {
> > + if (cfs_rq->pelt_clock_throttled) {
> > struct rq *rq = rq_of(cfs_rq);
> >
> > - if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
> > - cfs_rq->throttled_clock = rq_clock(rq);
> > - if (!cfs_rq->throttled_clock_self)
> > - cfs_rq->throttled_clock_self = rq_clock(rq);
> > -
> > - if (cfs_rq->pelt_clock_throttled) {
> > - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > - cfs_rq->throttled_clock_pelt;
> > - cfs_rq->pelt_clock_throttled = 0;
> > - }
> > + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > + cfs_rq->throttled_clock_pelt;
> > + cfs_rq->pelt_clock_throttled = 0;
>
> This is the only hunk of the patch that affects the PELT stuff; should this
> have been included in patch 3 which does the rest of the PELT accounting changes?
>
Yes, I think your suggestion makes sense, I'll move it to patch3 in next
version, thanks.
> > @@ -7073,6 +7073,9 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> > if (cfs_rq_is_idle(cfs_rq))
> > h_nr_idle = h_nr_queued;
> >
> > + if (throttled_hierarchy(cfs_rq) && task_throttled)
> > + record_throttle_clock(cfs_rq);
> > +
>
> Apologies if this has been discussed before.
>
> So the throttled time (as reported by cpu.stat.local) is now accounted as
> the time from which the first task in the hierarchy gets effectively
> throttled - IOW the first time a task in a throttled hierarchy reaches
> resume_user_mode_work() - as opposed to as soon as the hierarchy runs out
> of quota.
Right.
>
> The gap between the two shouldn't be much, but that should at the very
> least be highlighted in the changelog.
>
Got it, does the below added words make this clear?
With task based throttle model, the previous way to check cfs_rq's
nr_queued to decide if throttled time should be accounted doesn't work
as expected, e.g. when a cfs_rq which has a single task is throttled,
that task could later block in kernel mode instead of being dequeued on
limbo list and account this as throttled time is not accurate.
Rework throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its hierarchy;
- stop accounting on unthrottle.
Note that there will be a time gap between when a cfs_rq is throttled
and when a task in its hierarchy is actually throttled. This accounting
mechanism only started accounting in the latter case.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-08-19 9:34 ` Aaron Lu
@ 2025-08-19 14:09 ` Valentin Schneider
2025-08-26 14:10 ` Michal Koutný
1 sibling, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-08-19 14:09 UTC (permalink / raw)
To: Aaron Lu
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Tejun Heo
On 19/08/25 17:34, Aaron Lu wrote:
> On Mon, Aug 18, 2025 at 04:57:27PM +0200, Valentin Schneider wrote:
>> On 15/07/25 15:16, Aaron Lu wrote:
>> Apologies if this has been discussed before.
>>
>> So the throttled time (as reported by cpu.stat.local) is now accounted as
>> the time from which the first task in the hierarchy gets effectively
>> throttled - IOW the first time a task in a throttled hierarchy reaches
>> resume_user_mode_work() - as opposed to as soon as the hierarchy runs out
>> of quota.
>
> Right.
>
>>
>> The gap between the two shouldn't be much, but that should at the very
>> least be highlighted in the changelog.
>>
>
> Got it, does the below added words make this clear?
>
Yes, thank you. Small corrections below.
> With task based throttle model, the previous way to check cfs_rq's
> nr_queued to decide if throttled time should be accounted doesn't work
> as expected, e.g. when a cfs_rq which has a single task is throttled,
> that task could later block in kernel mode instead of being dequeued on
> limbo list and account this as throttled time is not accurate.
^^^^^^
accounting
>
> Rework throttle time accounting for a cfs_rq as follows:
> - start accounting when the first task gets throttled in its hierarchy;
> - stop accounting on unthrottle.
>
> Note that there will be a time gap between when a cfs_rq is throttled
> and when a task in its hierarchy is actually throttled. This accounting
> mechanism only started accounting in the latter case.
^^^^^^
starts
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-15 9:30 ` Aaron Lu
@ 2025-08-22 11:07 ` Aaron Lu
2025-09-03 7:14 ` Aaron Lu
0 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-08-22 11:07 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Sebastian Andrzej Siewior
On Fri, Aug 15, 2025 at 05:30:08PM +0800, Aaron Lu wrote:
> On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote:
... ...
> > I would also suggest running similar benchmarks but with deeper
> > hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
> > when tg_unthrottle_up() goes up a bigger tree.
>
> No problem.
>
> I suppose I can reuse the previous shared test script:
> https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
>
> There I used:
> nr_level1=2
> nr_level2=100
> nr_level3=10
>
> But I can tweak these numbers for this performance evaluation. I can make
> the leaf level to be 5 level deep and place tasks in leaf level cgroups
> and configure quota on 1st level cgroups.
Tested on Intel EMR(2 sockets, 120cores, 240cpus) and AMD Genoa(2
sockets, 192cores, 384cpus), with turbo/boost disabled, cpufreq set to
performance and cpuidle states all disabled.
cgroup hierarchy:
nr_level1=2
nr_level2=2
nr_level3=2
nr_level4=5
nr_level5=5
i.e. two cgroups in the root level, with each level1 cgroup having 2
child cgroups, and each level2 cgroup having 2 child cgroups, etc. This
creates a 5 level deep, 200 leaf cgroups setup. Tasks are placed in leaf
cgroups. Quota are set on the two level1 cgroups.
The TLDR is, when there is a very large number of tasks(like 8000 tasks),
task based throttle saw 10-20% performance drop on AMD Genoa; otherwise,
no obvious performance change is observed. Detailed test results below.
Netperf: measured in throughput, more is better
- quota set to 50 cpu for each level1 cgroup;
- each leaf cgroup run a pair of netserver and netperf with following
cmdline:
netserver -p $port_for_this_cgroup
netperf -p $port_for_this_cgroup -H 127.0.0.1 -t UDP_RR -c -C -l 30
i.e. each cgroup has 2 tasks, total task number is 2 * 200 = 400
tasks.
On Intel EMR:
base head diff
throughput 33305±8.40% 33995±7.84% noise
On AMD Genoa:
base head diff
throughput 5013±1.16% 4967±1.82 noise
Hackbench, measured in seconds, less is better:
- quota set to 50cpu for each level1 cgroup;
- each cgroup runs with the following cmdline:
hackbench -p -g 1 -l $see_below
i.e. each leaf cgroup has 20 sender tasks and 20 receiver tasks, total
task number is 40 * 200 = 8000 tasks.
On Intel EMR(loops set to 100000):
base head diff
Time 85.45±3.98% 86.41±3.98% noise
On AMD Genoa(loops set to 20000):
base head diff
Time 104±4.33% 116±7.71% -11.54%
So for this test case, task based throttle suffered ~10% performance
drop. I also tested on another AMD Genoa(same cpu spec) to make sure
it's not a machine problem and performance dropped there too:
On 2nd AMD Genoa(loops set to 50000)
base head diff
Time 81±3.13% 101±7.05% -24.69%
According to perf, __schedule() in head takes 7.29% cycles while in base
it takes 4.61% cycles. I suppose with task based throttle, __schedule()
is more frequent since tasks in a throttled cfs_rq have to be dequeued
one by one while in current behaviour, the cfs_rq can be dequeued off rq
in one go. This is most obvious when there are multiple tasks in a single
cfs_rq; if there is only 1 task per cfs_rq, things should be roughly the
same for the two throttling model.
With this said, I reduced the task number and retested on this 2nd AMD
Genoa:
- quota set to 50 cpu for each level1 cgroup;
- using only 1 fd pair, i.e. 2 task for each cgroup:
hackbench -p -g 1 -f 1 -l 50000000
i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
task number is 2 * 200 = 400 tasks.
base head diff
Time 127.77±2.60% 127.49±2.63% noise
In this setup, performance is about the same.
Now I'm wondering why on Intel EMR, running that extreme setup(8000
tasks), performance of task based throttle didn't see noticeable drop...
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-08-18 14:57 ` Valentin Schneider
2025-08-19 9:34 ` Aaron Lu
@ 2025-08-26 9:15 ` Aaron Lu
1 sibling, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-26 9:15 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Tejun Heo
On Mon, Aug 18, 2025 at 04:57:27PM +0200, Valentin Schneider wrote:
> On 15/07/25 15:16, Aaron Lu wrote:
> > @@ -5287,19 +5287,12 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > check_enqueue_throttle(cfs_rq);
> > list_add_leaf_cfs_rq(cfs_rq);
> > #ifdef CONFIG_CFS_BANDWIDTH
> > - if (throttled_hierarchy(cfs_rq)) {
> > + if (cfs_rq->pelt_clock_throttled) {
> > struct rq *rq = rq_of(cfs_rq);
> >
> > - if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
> > - cfs_rq->throttled_clock = rq_clock(rq);
> > - if (!cfs_rq->throttled_clock_self)
> > - cfs_rq->throttled_clock_self = rq_clock(rq);
> > -
> > - if (cfs_rq->pelt_clock_throttled) {
> > - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > - cfs_rq->throttled_clock_pelt;
> > - cfs_rq->pelt_clock_throttled = 0;
> > - }
> > + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
> > + cfs_rq->throttled_clock_pelt;
> > + cfs_rq->pelt_clock_throttled = 0;
>
> This is the only hunk of the patch that affects the PELT stuff; should this
> have been included in patch 3 which does the rest of the PELT accounting changes?
>
While working on a rebase and staring at this further, this hunk that
deals with pelt stuff is actually introduced in patch 3 and do not have
any real changes here. i.e. after throttled_clock related lines are
removed, pelt stuffs just moved from an inner if to an outer if but it
didn't have any real changes.
I hope I've clarified it clear this time, last time my brain stopped
working...
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-08-19 9:34 ` Aaron Lu
2025-08-19 14:09 ` Valentin Schneider
@ 2025-08-26 14:10 ` Michal Koutný
2025-08-27 15:16 ` Valentin Schneider
2025-08-28 6:06 ` Aaron Lu
1 sibling, 2 replies; 48+ messages in thread
From: Michal Koutný @ 2025-08-26 14:10 UTC (permalink / raw)
To: Aaron Lu
Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Tejun Heo
[-- Attachment #1: Type: text/plain, Size: 1687 bytes --]
Hello.
On Tue, Aug 19, 2025 at 05:34:27PM +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> Got it, does the below added words make this clear?
>
> With task based throttle model, the previous way to check cfs_rq's
> nr_queued to decide if throttled time should be accounted doesn't work
> as expected, e.g. when a cfs_rq which has a single task is throttled,
> that task could later block in kernel mode instead of being dequeued on
> limbo list and account this as throttled time is not accurate.
>
> Rework throttle time accounting for a cfs_rq as follows:
> - start accounting when the first task gets throttled in its hierarchy;
> - stop accounting on unthrottle.
>
> Note that there will be a time gap between when a cfs_rq is throttled
> and when a task in its hierarchy is actually throttled. This accounting
> mechanism only started accounting in the latter case.
Do I understand it correctly that this rework doesn't change the
cumulative amount of throttled_time in cpu.stat.local but the value gets
updated only later?
I'd say such little shifts are OK [1]. What should be avoided is
changing the semantics so that throttled_time time would scale with the
number of tasks inside the cgroup (assuming a single cfs_rq, i.e. number
of tasks on the cfs_rq).
0.02€,
Michal
[1] Maybe not even shifts -- in that case of a cfs_rq with a task, it
can manage to run in kernel almost for the whole period, so it gets
dequeued on return to userspace only to be re-enqueued when its cfs_rq
is unthrottled. It apparently escaped throttling, so the reported
throttled_time would be rightfully lower.
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 0/5] Defer throttle when task exits to user
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
` (7 preceding siblings ...)
2025-08-04 8:51 ` K Prateek Nayak
@ 2025-08-27 14:58 ` Valentin Schneider
8 siblings, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-08-27 14:58 UTC (permalink / raw)
To: Aaron Lu, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang
Cc: linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On 15/07/25 15:16, Aaron Lu wrote:
> There are consequences because of this new throttle model, e.g. for a
> cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
> return2user path, one task still running in kernel mode, this cfs_rq is
> in a partial throttled state:
> - Should its pelt clock be frozen?
> - Should this state be accounted into throttled_time?
>
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
OK, I finally got to testing the PELT side of things :-)
I shoved a bunch of periodic tasks in a CPU cgroup with quite low limits
(1ms runtime, 10ms period); I looked at the _avg values using the
trace_pelt* tracepoints.
Overall there isn't much change to the averages themselves. There are more
updates since the tasks are genuinely dequeued/enqueued during a throttle
cycle, but that's expected.
I'll wait for your next version, but you can have:
Tested-by: Valentin Schneider <vschneid@redhat.com>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-08-26 14:10 ` Michal Koutný
@ 2025-08-27 15:16 ` Valentin Schneider
2025-08-28 6:06 ` Aaron Lu
1 sibling, 0 replies; 48+ messages in thread
From: Valentin Schneider @ 2025-08-27 15:16 UTC (permalink / raw)
To: Michal Koutný, Aaron Lu
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu, Tejun Heo
On 26/08/25 16:10, Michal Koutný wrote:
> Hello.
>
> On Tue, Aug 19, 2025 at 05:34:27PM +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
>> Got it, does the below added words make this clear?
>>
>> With task based throttle model, the previous way to check cfs_rq's
>> nr_queued to decide if throttled time should be accounted doesn't work
>> as expected, e.g. when a cfs_rq which has a single task is throttled,
>> that task could later block in kernel mode instead of being dequeued on
>> limbo list and account this as throttled time is not accurate.
>>
>> Rework throttle time accounting for a cfs_rq as follows:
>> - start accounting when the first task gets throttled in its hierarchy;
>> - stop accounting on unthrottle.
>>
>> Note that there will be a time gap between when a cfs_rq is throttled
>> and when a task in its hierarchy is actually throttled. This accounting
>> mechanism only started accounting in the latter case.
>
> Do I understand it correctly that this rework doesn't change the
> cumulative amount of throttled_time in cpu.stat.local but the value gets
> updated only later?
>
No, so currently when a cfs_rq runs out of quota, all of its tasks
instantly get throttled, synchronously with that we record the time at
which it got throttled and use that to report how long it was throttled
(cpu.stat.local).
What this is doing is separating running out of quota and actually
throttling the tasks. When a cfs_rq runs out of quota, we "mark" its tasks
to throttle themselves whenever they next exit the kernel. We record the
throttled time (cpu.stat.local) as the time between the first
to-be-throttled task exiting the kernel and the unthrottle/quota
replenishment.
IOW, this is inducing a (short) delay between a cfs_rq running out of quota
and we starting to account for its cumulative throttled time.
Hopefully that was somewhat clear.
> I'd say such little shifts are OK [1]. What should be avoided is
> changing the semantics so that throttled_time time would scale with the
> number of tasks inside the cgroup (assuming a single cfs_rq, i.e. number
> of tasks on the cfs_rq).
>
Yeah that's fine we don't do that.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-08 11:45 ` Valentin Schneider
2025-08-12 8:48 ` Aaron Lu
@ 2025-08-28 3:50 ` Aaron Lu
1 sibling, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-28 3:50 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu
On Fri, Aug 08, 2025 at 01:45:11PM +0200, Valentin Schneider wrote:
> On 08/08/25 18:13, Aaron Lu wrote:
> > On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
... ...
> >> > + if (throttled_hierarchy(cfs_rq) &&
> >> > + !task_current_donor(rq_of(cfs_rq), p)) {
> >> > + list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >> > + return true;
> >> > + }
> >> > +
> >> > + /* we can't take the fast path, do an actual enqueue*/
> >> > + p->throttled = false;
> >>
> >> So we clear p->throttled but not p->throttle_node? Won't that cause issues
> >> when @p's previous cfs_rq gets unthrottled?
> >>
> >
> > p->throttle_node is already removed from its previous cfs_rq at dequeue
> > time in dequeue_throttled_task().
> >
> > This is done so because in enqueue time, we may not hold its previous
> > rq's lock so can't touch its previous cfs_rq's limbo list, like when
> > dealing with affinity changes.
> >
>
> Ah right, the DQ/EQ_throttled_task() functions are when DQ/EQ is applied to an
> already-throttled task and it does the right thing.
>
> Does this mean we want this as enqueue_throttled_task()'s prologue?
>
> /* @p should have gone through dequeue_throttled_task() first */
> WARN_ON_ONCE(!list_empty(&p->throttle_node));
>
While adding this change to the new version, I remembered this
enqueue_throttled_task() also gets called for tasks that are going to be
unthrottled on unthrottle path, i.e.
unthrottle_cfs_rq() -> tg_unthrottle_up() -> enqueue_task_fair()
because task's throttled flag is not cleared yet(but throttle_node is
removed from the limbo list so the above warn still works as expected).
I didn't clear p->throttled in tg_unthrottle_up() before calling
enqueue_task_fair() because enqueue_throttled_task() will take care of
that but now I look at it, I think it is better to clear p->throttled
before calling enqueue_task_fair(): this saves some cycles by skipping
enqueue_throttled_task() for these unthrottled tasks and
enqueue_throttled_task() only has to deal with migrated throttled task.
This feels cleaner and more efficient. I remember Prateek also suggested
this before but I couldn't find his email now, not sure if I remembered
wrong.
Any way, just a note that I'm going to make below changes to the next
version, let me know if this doesn't look right, thanks.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 785a15caffbcc..df8dc389af8e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5904,6 +5904,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
/* Re-enqueue the tasks that have been throttled at this level. */
list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
list_del_init(&p->throttle_node);
+ p->throttled = false;
enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
}
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 4/5] sched/fair: Task based throttle time accounting
2025-08-26 14:10 ` Michal Koutný
2025-08-27 15:16 ` Valentin Schneider
@ 2025-08-28 6:06 ` Aaron Lu
1 sibling, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-08-28 6:06 UTC (permalink / raw)
To: Michal Koutný
Cc: Valentin Schneider, Ben Segall, K Prateek Nayak, Peter Zijlstra,
Chengming Zhou, Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang,
linux-kernel, Juri Lelli, Dietmar Eggemann, Steven Rostedt,
Mel Gorman, Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Tejun Heo
Hi Michal,
Thanks for taking a look.
On Tue, Aug 26, 2025 at 04:10:37PM +0200, Michal Koutný wrote:
> Hello.
>
> On Tue, Aug 19, 2025 at 05:34:27PM +0800, Aaron Lu <ziqianlu@bytedance.com> wrote:
> > Got it, does the below added words make this clear?
> >
> > With task based throttle model, the previous way to check cfs_rq's
> > nr_queued to decide if throttled time should be accounted doesn't work
> > as expected, e.g. when a cfs_rq which has a single task is throttled,
> > that task could later block in kernel mode instead of being dequeued on
> > limbo list and account this as throttled time is not accurate.
> >
> > Rework throttle time accounting for a cfs_rq as follows:
> > - start accounting when the first task gets throttled in its hierarchy;
> > - stop accounting on unthrottle.
> >
> > Note that there will be a time gap between when a cfs_rq is throttled
> > and when a task in its hierarchy is actually throttled. This accounting
> > mechanism only started accounting in the latter case.
>
> Do I understand it correctly that this rework doesn't change the
> cumulative amount of throttled_time in cpu.stat.local but the value gets
> updated only later?
>
> I'd say such little shifts are OK [1]. What should be avoided is
> changing the semantics so that throttled_time time would scale with the
> number of tasks inside the cgroup (assuming a single cfs_rq, i.e. number
> of tasks on the cfs_rq).
As Valetin explained, throttle_time does not scale with the number of
tasks inside the cgroup.
> [1] Maybe not even shifts -- in that case of a cfs_rq with a task, it
> can manage to run in kernel almost for the whole period, so it gets
> dequeued on return to userspace only to be re-enqueued when its cfs_rq
> is unthrottled. It apparently escaped throttling, so the reported
> throttled_time would be rightfully lower.
Right, in this case, the throttle_time would be very small.
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-08-22 11:07 ` Aaron Lu
@ 2025-09-03 7:14 ` Aaron Lu
2025-09-03 9:11 ` K Prateek Nayak
0 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-09-03 7:14 UTC (permalink / raw)
To: Valentin Schneider
Cc: Ben Segall, K Prateek Nayak, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Sebastian Andrzej Siewior
On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
> On Fri, Aug 15, 2025 at 05:30:08PM +0800, Aaron Lu wrote:
> > On Thu, Aug 14, 2025 at 05:54:34PM +0200, Valentin Schneider wrote:
> ... ...
> > > I would also suggest running similar benchmarks but with deeper
> > > hierarchies, to get an idea of how much worse unthrottle_cfs_rq() can get
> > > when tg_unthrottle_up() goes up a bigger tree.
> >
> > No problem.
> >
> > I suppose I can reuse the previous shared test script:
> > https://lore.kernel.org/lkml/CANCG0GdOwS7WO0k5Fb+hMd8R-4J_exPTt2aS3-0fAMUC5pVD8g@mail.gmail.com/
> >
> > There I used:
> > nr_level1=2
> > nr_level2=100
> > nr_level3=10
> >
> > But I can tweak these numbers for this performance evaluation. I can make
> > the leaf level to be 5 level deep and place tasks in leaf level cgroups
> > and configure quota on 1st level cgroups.
>
> Tested on Intel EMR(2 sockets, 120cores, 240cpus) and AMD Genoa(2
> sockets, 192cores, 384cpus), with turbo/boost disabled, cpufreq set to
> performance and cpuidle states all disabled.
>
> cgroup hierarchy:
> nr_level1=2
> nr_level2=2
> nr_level3=2
> nr_level4=5
> nr_level5=5
> i.e. two cgroups in the root level, with each level1 cgroup having 2
> child cgroups, and each level2 cgroup having 2 child cgroups, etc. This
> creates a 5 level deep, 200 leaf cgroups setup. Tasks are placed in leaf
> cgroups. Quota are set on the two level1 cgroups.
>
> The TLDR is, when there is a very large number of tasks(like 8000 tasks),
> task based throttle saw 10-20% performance drop on AMD Genoa; otherwise,
> no obvious performance change is observed. Detailed test results below.
>
> Netperf: measured in throughput, more is better
> - quota set to 50 cpu for each level1 cgroup;
> - each leaf cgroup run a pair of netserver and netperf with following
> cmdline:
> netserver -p $port_for_this_cgroup
> netperf -p $port_for_this_cgroup -H 127.0.0.1 -t UDP_RR -c -C -l 30
> i.e. each cgroup has 2 tasks, total task number is 2 * 200 = 400
> tasks.
>
> On Intel EMR:
> base head diff
> throughput 33305±8.40% 33995±7.84% noise
>
> On AMD Genoa:
> base head diff
> throughput 5013±1.16% 4967±1.82 noise
>
>
> Hackbench, measured in seconds, less is better:
> - quota set to 50cpu for each level1 cgroup;
> - each cgroup runs with the following cmdline:
> hackbench -p -g 1 -l $see_below
> i.e. each leaf cgroup has 20 sender tasks and 20 receiver tasks, total
> task number is 40 * 200 = 8000 tasks.
>
> On Intel EMR(loops set to 100000):
>
> base head diff
> Time 85.45±3.98% 86.41±3.98% noise
>
> On AMD Genoa(loops set to 20000):
>
> base head diff
> Time 104±4.33% 116±7.71% -11.54%
>
> So for this test case, task based throttle suffered ~10% performance
> drop. I also tested on another AMD Genoa(same cpu spec) to make sure
> it's not a machine problem and performance dropped there too:
>
> On 2nd AMD Genoa(loops set to 50000)
>
> base head diff
> Time 81±3.13% 101±7.05% -24.69%
>
> According to perf, __schedule() in head takes 7.29% cycles while in base
> it takes 4.61% cycles. I suppose with task based throttle, __schedule()
> is more frequent since tasks in a throttled cfs_rq have to be dequeued
> one by one while in current behaviour, the cfs_rq can be dequeued off rq
> in one go. This is most obvious when there are multiple tasks in a single
> cfs_rq; if there is only 1 task per cfs_rq, things should be roughly the
> same for the two throttling model.
>
> With this said, I reduced the task number and retested on this 2nd AMD
> Genoa:
> - quota set to 50 cpu for each level1 cgroup;
> - using only 1 fd pair, i.e. 2 task for each cgroup:
> hackbench -p -g 1 -f 1 -l 50000000
> i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
> task number is 2 * 200 = 400 tasks.
>
> base head diff
> Time 127.77±2.60% 127.49±2.63% noise
>
> In this setup, performance is about the same.
>
> Now I'm wondering why on Intel EMR, running that extreme setup(8000
> tasks), performance of task based throttle didn't see noticeable drop...
Looks like hackbench doesn't like task migration on this AMD system
(domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA).
If I revert patch5, running this 40 * 200 = 8000 hackbench workload
again, performance is roughly the same now(head~1 is slightly worse but
given the 4+% stddev in base, it can be considered in noise range):
base head~1(patch1-4) diff head(patch1-5)
Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71%
According to /proc/schedstat, the lb_gained for domain2 is:
NOT_IDLE IDLE NEWLY_IDLE
base 0 8052 81791
head~1 0 7197 175096
head 1 14818 793065
Other domains have similar number: base has smallest migration number
while head has the most and head~1 reduce the number a lot. I suppose
this is expected, because we removed the throttled_lb_pair() restriction
in patch5 and that can cause runnable tasks in throttled hierarchy to be
balanced to other cpus while in base, this can not happen.
I think patch5 still makes sense and is correct, it's just this specific
workload doesn't like task migrations. Intel EMR doesn't suffer from
this, I suppose that's because EMR has a much larger LLC while AMD Genoa
has a relatively small LLC and task migrations across LLC boundary hurts
hackbench's performance.
I also tried to apply below hack to prove this "task migration across
LLC boundary hurts hackbench" theory on both base and head:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c2..34c5f6b75e53d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9297,6 +9297,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
+ if (!(env->sd->flags & SD_SHARE_LLC))
+ return 0;
+
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
return 0;
With this diff applied, the result is:
base' head' diff
Time 74.78±8.2% 78.87±15.4% -5.47%
base': base + above diff
head': head + above diff
So both performs better now, but with much larger variance, I guess
that's because no load balance on domain2 and above now. head' is still
worse than base, but not as much as before.
To conclude this: hackbench doesn't like task migration, especially when
task is migrated across LLC boundary. patch5 removed the restriction of
no balancing throttled tasks, this caused more balance to happen and
hackbench doesn't like this. But balancing has its own merit and could
still benefit other workloads so I think patch5 should stay, especially
considering that when throttled tasks are eventually dequeued, they will
not stay on rq's cfs_tasks list so no need to take special care for them
when doing load balance.
On a side note: should we increase the cost of balancing tasks out of LLC
boundary? I tried to enlarge sysctl_sched_migration_cost 100 times for
domains without SD_SHARE_LLC in task_hot() but that didn't help.
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-09-03 7:14 ` Aaron Lu
@ 2025-09-03 9:11 ` K Prateek Nayak
2025-09-03 10:11 ` Aaron Lu
0 siblings, 1 reply; 48+ messages in thread
From: K Prateek Nayak @ 2025-09-03 9:11 UTC (permalink / raw)
To: Aaron Lu, Valentin Schneider
Cc: Ben Segall, Peter Zijlstra, Chengming Zhou, Josh Don, Ingo Molnar,
Vincent Guittot, Xi Wang, linux-kernel, Juri Lelli,
Dietmar Eggemann, Steven Rostedt, Mel Gorman, Chuyi Zhou,
Jan Kiszka, Florian Bezdeka, Songtang Liu,
Sebastian Andrzej Siewior
Hello Aaron,
On 9/3/2025 12:44 PM, Aaron Lu wrote:
> On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
>> With this said, I reduced the task number and retested on this 2nd AMD
>> Genoa:
>> - quota set to 50 cpu for each level1 cgroup;
What exactly is the quota and period when you say 50cpu?
>> - using only 1 fd pair, i.e. 2 task for each cgroup:
>> hackbench -p -g 1 -f 1 -l 50000000
>> i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
>> task number is 2 * 200 = 400 tasks.
>>
>> base head diff
>> Time 127.77±2.60% 127.49±2.63% noise
>>
>> In this setup, performance is about the same.
>>
>> Now I'm wondering why on Intel EMR, running that extreme setup(8000
>> tasks), performance of task based throttle didn't see noticeable drop...
>
> Looks like hackbench doesn't like task migration on this AMD system
> (domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA).
>
> If I revert patch5, running this 40 * 200 = 8000 hackbench workload
> again, performance is roughly the same now(head~1 is slightly worse but
> given the 4+% stddev in base, it can be considered in noise range):
>
> base head~1(patch1-4) diff head(patch1-5)
> Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71%
>
> According to /proc/schedstat, the lb_gained for domain2 is:
>
> NOT_IDLE IDLE NEWLY_IDLE
> base 0 8052 81791
> head~1 0 7197 175096
> head 1 14818 793065
Since these are mostly idle and newidle balance, I wonder if we can run
into a scenario where,
1. All the tasks are throttled.
2. CPU turning idle does a newidle balance.
3. CPU pulls a tasks from throttled hierarchy and selects it.
4. The task exits to user space and is dequeued.
5. Goto 1.
and when the CPU is unthrottled, it has a large number of tasks on it
that'll again require a load balance to even stuff out.
>
> Other domains have similar number: base has smallest migration number
> while head has the most and head~1 reduce the number a lot. I suppose
> this is expected, because we removed the throttled_lb_pair() restriction
> in patch5 and that can cause runnable tasks in throttled hierarchy to be
> balanced to other cpus while in base, this can not happen.
>
> I think patch5 still makes sense and is correct, it's just this specific
> workload doesn't like task migrations. Intel EMR doesn't suffer from
> this, I suppose that's because EMR has a much larger LLC while AMD Genoa
> has a relatively small LLC and task migrations across LLC boundary hurts
> hackbench's performance.
I think we can leave the throttled_lb_pair() condition as is and revisit
it later if this is visible in real world workloads. I cannot think of
any easy way to avoid the case for potential pileup without accounting
for the throttled tasks in limbo except for something like below at
head~1:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bdc9bfa0b9ef..3dc807af21ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9385,7 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* We do not migrate tasks that are:
* 1) delayed dequeued unless we migrate load, or
- * 2) throttled_lb_pair, or
+ * 2) throttled_lb_pair unless we migrate load, or
* 3) cannot be migrated to this CPU due to cpus_ptr, or
* 4) running (obviously), or
* 5) are cache-hot on their current CPU, or
@@ -9394,7 +9394,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
- if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
+ if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu) &&
+ env->migration_type != migrate_load)
return 0;
/*
---
Since load_avg moves slowly, it might be enough to avoid pileup of
tasks. This is similar to the condition for migrating delayed tasks
above but unlike the hierarchies of delayed tasks, the weight of
throttled hierarchy does change when throttled tasks are transitioned to
limbo so this needs some more staring at.
>
> I also tried to apply below hack to prove this "task migration across
> LLC boundary hurts hackbench" theory on both base and head:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c2..34c5f6b75e53d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9297,6 +9297,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
> return 0;
>
> + if (!(env->sd->flags & SD_SHARE_LLC))
> + return 0;
> +
> if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> return 0;
>
> With this diff applied, the result is:
>
>
> base' head' diff
> Time 74.78±8.2% 78.87±15.4% -5.47%
>
> base': base + above diff
> head': head + above diff
>
> So both performs better now, but with much larger variance, I guess
> that's because no load balance on domain2 and above now. head' is still
> worse than base, but not as much as before.
>
> To conclude this: hackbench doesn't like task migration, especially when
> task is migrated across LLC boundary. patch5 removed the restriction of
> no balancing throttled tasks, this caused more balance to happen and
> hackbench doesn't like this. But balancing has its own merit and could
> still benefit other workloads so I think patch5 should stay, especially
> considering that when throttled tasks are eventually dequeued, they will
> not stay on rq's cfs_tasks list so no need to take special care for them
> when doing load balance.
Mathieu had run some experiments a couple years ago where he too
discovered reducing the number of migrations for hackbench helps but it
wasn't clear if these strategies would benefit real-world workloads:
https://lore.kernel.org/lkml/20230905171105.1005672-1-mathieu.desnoyers@efficios.com/
https://lore.kernel.org/lkml/20231018204511.1563390-1-mathieu.desnoyers@efficios.com/
>
> On a side note: should we increase the cost of balancing tasks out of LLC
> boundary? I tried to enlarge sysctl_sched_migration_cost 100 times for
> domains without SD_SHARE_LLC in task_hot() but that didn't help.
I'll take a look at sd->imbalance_pct and see if there is any room for
improvements there. Thank you again for the detailed analysis.
--
Thanks and Regards,
Prateek
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-09-03 9:11 ` K Prateek Nayak
@ 2025-09-03 10:11 ` Aaron Lu
2025-09-03 10:31 ` K Prateek Nayak
0 siblings, 1 reply; 48+ messages in thread
From: Aaron Lu @ 2025-09-03 10:11 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Sebastian Andrzej Siewior
Hi Prateek,
On Wed, Sep 03, 2025 at 02:41:55PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 9/3/2025 12:44 PM, Aaron Lu wrote:
> > On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
> >> With this said, I reduced the task number and retested on this 2nd AMD
> >> Genoa:
> >> - quota set to 50 cpu for each level1 cgroup;
>
> What exactly is the quota and period when you say 50cpu?
period is the default 100000 and quota is set to 5000000.
>
> >> - using only 1 fd pair, i.e. 2 task for each cgroup:
> >> hackbench -p -g 1 -f 1 -l 50000000
> >> i.e. each leaf cgroup has 1 sender task and 1 receiver task, total
> >> task number is 2 * 200 = 400 tasks.
> >>
> >> base head diff
> >> Time 127.77±2.60% 127.49±2.63% noise
> >>
> >> In this setup, performance is about the same.
> >>
> >> Now I'm wondering why on Intel EMR, running that extreme setup(8000
> >> tasks), performance of task based throttle didn't see noticeable drop...
> >
> > Looks like hackbench doesn't like task migration on this AMD system
> > (domain0 SMT; domain1 MC; domain2 PKG; domain3 NUMA).
> >
> > If I revert patch5, running this 40 * 200 = 8000 hackbench workload
> > again, performance is roughly the same now(head~1 is slightly worse but
> > given the 4+% stddev in base, it can be considered in noise range):
> >
> > base head~1(patch1-4) diff head(patch1-5)
> > Time 82.55±4.82% 84.45±2.70% -2.3% 99.69±6.71%
> >
> > According to /proc/schedstat, the lb_gained for domain2 is:
> >
> > NOT_IDLE IDLE NEWLY_IDLE
> > base 0 8052 81791
> > head~1 0 7197 175096
> > head 1 14818 793065
>
> Since these are mostly idle and newidle balance, I wonder if we can run
> into a scenario where,
>
> 1. All the tasks are throttled.
> 2. CPU turning idle does a newidle balance.
> 3. CPU pulls a tasks from throttled hierarchy and selects it.
> 4. The task exits to user space and is dequeued.
> 5. Goto 1.
>
> and when the CPU is unthrottled, it has a large number of tasks on it
> that'll again require a load balance to even stuff out.
>
I think it is because we allow balancing tasks under a throttled
hirarchy that made the balance number much larger.
> >
> > Other domains have similar number: base has smallest migration number
> > while head has the most and head~1 reduce the number a lot. I suppose
> > this is expected, because we removed the throttled_lb_pair() restriction
> > in patch5 and that can cause runnable tasks in throttled hierarchy to be
> > balanced to other cpus while in base, this can not happen.
> >
> > I think patch5 still makes sense and is correct, it's just this specific
> > workload doesn't like task migrations. Intel EMR doesn't suffer from
> > this, I suppose that's because EMR has a much larger LLC while AMD Genoa
> > has a relatively small LLC and task migrations across LLC boundary hurts
> > hackbench's performance.
>
> I think we can leave the throttled_lb_pair() condition as is and revisit
> it later if this is visible in real world workloads. I cannot think of
> any easy way to avoid the case for potential pileup without accounting
> for the throttled tasks in limbo except for something like below at
> head~1:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bdc9bfa0b9ef..3dc807af21ba 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9385,7 +9385,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> /*
> * We do not migrate tasks that are:
> * 1) delayed dequeued unless we migrate load, or
> - * 2) throttled_lb_pair, or
> + * 2) throttled_lb_pair unless we migrate load, or
> * 3) cannot be migrated to this CPU due to cpus_ptr, or
> * 4) running (obviously), or
> * 5) are cache-hot on their current CPU, or
> @@ -9394,7 +9394,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
> return 0;
>
> - if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> + if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu) &&
> + env->migration_type != migrate_load)
> return 0;
>
> /*
> ---
>
> Since load_avg moves slowly, it might be enough to avoid pileup of
> tasks. This is similar to the condition for migrating delayed tasks
> above but unlike the hierarchies of delayed tasks, the weight of
> throttled hierarchy does change when throttled tasks are transitioned to
> limbo so this needs some more staring at.
>
I was thinking: should we not allow task balancing to a throttled target
cfs_rq? For task based throttle model, if a task is on rq's cfs_tasks
list, it is allowed to run so we should not check src cfs_rq's throttle
status but we should check if the target cfs_rq is throttled and if it is,
then it's probably not very useful to do the balance. I tried below diff
and the performance is restored:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e1..3e927b9b7eeb6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9370,6 +9370,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
+ if (throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
+ return 0;
+
/*
* We want to prioritize the migration of eligible tasks.
* For ineligible tasks we soft-limit them and only allow
base head' diff head(patch1-5)
Time 82.55±4.82% 83.81±2.89% -1.5% 99.69±6.71%
head': head + above diff
I also tested netperf on this AMD system as well as hackbench and
netperf on Intel EMR, no obvious performance difference observed
after applying the above diff, i.e. base and head' performance is
roughly the same.
Does the above diff make sense? One thing I'm slightly concerned is,
there may be one case when balancing a task to a throttled target
cfs_rq makes sense: if the task holds some kernel resource and is
running inside kernel, even its target cfs_rq is throttled, we still
want this task to go there and finish its job in kernel mode sooner,
this could help other resource waiters. But, this may not be a big deal
and in most of the time, balancing a task to a throttled cfs_rq doesn't
look like a meaningful thing to do.
Best regards,
Aaron
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-09-03 10:11 ` Aaron Lu
@ 2025-09-03 10:31 ` K Prateek Nayak
2025-09-03 11:35 ` Aaron Lu
0 siblings, 1 reply; 48+ messages in thread
From: K Prateek Nayak @ 2025-09-03 10:31 UTC (permalink / raw)
To: Aaron Lu
Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Sebastian Andrzej Siewior
Hello Aaron,
On 9/3/2025 3:41 PM, Aaron Lu wrote:
> Hi Prateek,
>
> On Wed, Sep 03, 2025 at 02:41:55PM +0530, K Prateek Nayak wrote:
>> Hello Aaron,
>>
>> On 9/3/2025 12:44 PM, Aaron Lu wrote:
>>> On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
>>>> With this said, I reduced the task number and retested on this 2nd AMD
>>>> Genoa:
>>>> - quota set to 50 cpu for each level1 cgroup;
>>
>> What exactly is the quota and period when you say 50cpu?
>
> period is the default 100000 and quota is set to 5000000.
Thank you! I'll do some tests on my end as well.
[..snip..]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index df8dc389af8e1..3e927b9b7eeb6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9370,6 +9370,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
> return 0;
>
> + if (throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
> + return 0;
> +
This makes sense instead of the full throttled_lb_pair(). You'll still
need to put it behind CONFIG_CGROUP_SCHED (or better yet
CONFIG_CFS_BANDWIDTH) since task_group() can return NULL if GROUP_SCHED
is not enabled.
> /*
> * We want to prioritize the migration of eligible tasks.
> * For ineligible tasks we soft-limit them and only allow
>
> base head' diff head(patch1-5)
> Time 82.55±4.82% 83.81±2.89% -1.5% 99.69±6.71%
>
> head': head + above diff
>
> I also tested netperf on this AMD system as well as hackbench and
> netperf on Intel EMR, no obvious performance difference observed
> after applying the above diff, i.e. base and head' performance is
> roughly the same.
>
> Does the above diff make sense? One thing I'm slightly concerned is,
> there may be one case when balancing a task to a throttled target
> cfs_rq makes sense: if the task holds some kernel resource and is
> running inside kernel, even its target cfs_rq is throttled, we still
> want this task to go there and finish its job in kernel mode sooner,
> this could help other resource waiters. But, this may not be a big deal
I think it is still an improvement over per-cfs_rq throttling from a
tail latency perspective.
> and in most of the time, balancing a task to a throttled cfs_rq doesn't
> look like a meaningful thing to do.Ack.
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
2025-09-03 10:31 ` K Prateek Nayak
@ 2025-09-03 11:35 ` Aaron Lu
0 siblings, 0 replies; 48+ messages in thread
From: Aaron Lu @ 2025-09-03 11:35 UTC (permalink / raw)
To: K Prateek Nayak
Cc: Valentin Schneider, Ben Segall, Peter Zijlstra, Chengming Zhou,
Josh Don, Ingo Molnar, Vincent Guittot, Xi Wang, linux-kernel,
Juri Lelli, Dietmar Eggemann, Steven Rostedt, Mel Gorman,
Chuyi Zhou, Jan Kiszka, Florian Bezdeka, Songtang Liu,
Sebastian Andrzej Siewior
[-- Attachment #1: Type: text/plain, Size: 4001 bytes --]
On Wed, Sep 03, 2025 at 04:01:03PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 9/3/2025 3:41 PM, Aaron Lu wrote:
> > Hi Prateek,
> >
> > On Wed, Sep 03, 2025 at 02:41:55PM +0530, K Prateek Nayak wrote:
> >> Hello Aaron,
> >>
> >> On 9/3/2025 12:44 PM, Aaron Lu wrote:
> >>> On Fri, Aug 22, 2025 at 07:07:01PM +0800, Aaron Lu wrote:
> >>>> With this said, I reduced the task number and retested on this 2nd AMD
> >>>> Genoa:
> >>>> - quota set to 50 cpu for each level1 cgroup;
> >>
> >> What exactly is the quota and period when you say 50cpu?
> >
> > period is the default 100000 and quota is set to 5000000.
>
> Thank you! I'll do some tests on my end as well.
>
I've attached test scripts I've used for your reference.
> [..snip..]
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index df8dc389af8e1..3e927b9b7eeb6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9370,6 +9370,9 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> > if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
> > return 0;
> >
> > + if (throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
> > + return 0;
> > +
>
> This makes sense instead of the full throttled_lb_pair(). You'll still
> need to put it behind CONFIG_CGROUP_SCHED (or better yet
> CONFIG_CFS_BANDWIDTH) since task_group() can return NULL if GROUP_SCHED
> is not enabled.
>
Got it, thanks for the remind. Maybe I can avoid adding new wrappers
and just check task_group() first, something like this:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df8dc389af8e1..d9abde5e631b8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9362,6 +9362,7 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* We do not migrate tasks that are:
* 1) delayed dequeued unless we migrate load, or
+ * 2) target cfs_rq is in throttled hierarchy, or
* 2) cannot be migrated to this CPU due to cpus_ptr, or
* 3) running (obviously), or
* 4) are cache-hot on their current CPU, or
@@ -9370,6 +9371,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
if ((p->se.sched_delayed) && (env->migration_type != migrate_load))
return 0;
+ if (task_group(p) &&
+ throttled_hierarchy(task_group(p)->cfs_rq[env->dst_cpu]))
+ return 0;
+
/*
* We want to prioritize the migration of eligible tasks.
* For ineligible tasks we soft-limit them and only allow
> > /*
> > * We want to prioritize the migration of eligible tasks.
> > * For ineligible tasks we soft-limit them and only allow
> >
> > base head' diff head(patch1-5)
> > Time 82.55±4.82% 83.81±2.89% -1.5% 99.69±6.71%
> >
> > head': head + above diff
> >
> > I also tested netperf on this AMD system as well as hackbench and
> > netperf on Intel EMR, no obvious performance difference observed
> > after applying the above diff, i.e. base and head' performance is
> > roughly the same.
> >
> > Does the above diff make sense? One thing I'm slightly concerned is,
> > there may be one case when balancing a task to a throttled target
> > cfs_rq makes sense: if the task holds some kernel resource and is
> > running inside kernel, even its target cfs_rq is throttled, we still
> > want this task to go there and finish its job in kernel mode sooner,
> > this could help other resource waiters. But, this may not be a big deal
>
> I think it is still an improvement over per-cfs_rq throttling from a
> tail latency perspective.
>
> > and in most of the time, balancing a task to a throttled cfs_rq doesn't
> > look like a meaningful thing to do.Ack.
Just want to add that with the above diff applied, I also tested
songtang's stress test[0] and Jan's rt deadlock reproducer[1] and didn't
see any problem.
[0]: https://lore.kernel.org/lkml/20250715072218.GA304@bytedance/
[1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
[-- Attachment #2: test.sh --]
[-- Type: application/x-sh, Size: 2030 bytes --]
[-- Attachment #3: run_in_cg.sh --]
[-- Type: application/x-sh, Size: 294 bytes --]
[-- Attachment #4: cleanup.sh --]
[-- Type: application/x-sh, Size: 1022 bytes --]
^ permalink raw reply related [flat|nested] 48+ messages in thread
end of thread, other threads:[~2025-09-03 11:36 UTC | newest]
Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-15 7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-15 7:16 ` [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-07-15 7:16 ` [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-07-15 7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-07-15 23:29 ` kernel test robot
2025-07-16 6:57 ` Aaron Lu
2025-07-16 7:40 ` Philip Li
2025-07-16 11:15 ` [PATCH v3 update " Aaron Lu
2025-07-16 11:27 ` [PATCH v3 " Peter Zijlstra
2025-07-16 15:20 ` kernel test robot
2025-07-17 3:52 ` Aaron Lu
2025-07-23 8:21 ` Oliver Sang
2025-07-23 10:08 ` Aaron Lu
2025-08-08 9:12 ` Valentin Schneider
2025-08-08 10:13 ` Aaron Lu
2025-08-08 11:45 ` Valentin Schneider
2025-08-12 8:48 ` Aaron Lu
2025-08-14 15:54 ` Valentin Schneider
2025-08-15 9:30 ` Aaron Lu
2025-08-22 11:07 ` Aaron Lu
2025-09-03 7:14 ` Aaron Lu
2025-09-03 9:11 ` K Prateek Nayak
2025-09-03 10:11 ` Aaron Lu
2025-09-03 10:31 ` K Prateek Nayak
2025-09-03 11:35 ` Aaron Lu
2025-08-28 3:50 ` Aaron Lu
2025-08-17 8:50 ` Chen, Yu C
2025-08-18 2:50 ` Aaron Lu
2025-08-18 3:10 ` Chen, Yu C
2025-08-18 3:12 ` Aaron Lu
2025-07-15 7:16 ` [PATCH v3 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-08-18 14:57 ` Valentin Schneider
2025-08-19 9:34 ` Aaron Lu
2025-08-19 14:09 ` Valentin Schneider
2025-08-26 14:10 ` Michal Koutný
2025-08-27 15:16 ` Valentin Schneider
2025-08-28 6:06 ` Aaron Lu
2025-08-26 9:15 ` Aaron Lu
2025-07-15 7:16 ` [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-07-15 7:22 ` [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-08-01 14:31 ` Matteo Martelli
2025-08-04 7:52 ` Aaron Lu
2025-08-04 11:18 ` Valentin Schneider
2025-08-04 11:56 ` Aaron Lu
2025-08-08 16:37 ` Matteo Martelli
2025-08-04 8:51 ` K Prateek Nayak
2025-08-04 11:48 ` Aaron Lu
2025-08-27 14:58 ` Valentin Schneider
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).