* [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
@ 2025-10-23 12:12 Hao Jia
2025-10-24 3:39 ` Aaron Lu
2025-10-24 4:36 ` K Prateek Nayak
0 siblings, 2 replies; 5+ messages in thread
From: Hao Jia @ 2025-10-23 12:12 UTC (permalink / raw)
To: mingo, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
kprateek.nayak
Cc: linux-kernel, Hao Jia, Aaron Lu
From: Hao Jia <jiahao1@lixiang.com>
Aaron Lu and I hit a non-empty throttled_limbo_list warning in
tg_throttle_down() during testing.
WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
Call Trace:
? __pfx_tg_nop+0x10/0x10
walk_tg_tree_from+0x39/0xd0
? __pfx_tg_throttle_down+0x10/0x10
throttle_cfs_rq+0x176/0x1f0
enqueue_task_fair+0x4f5/0xd30
? unthrottle_cfs_rq+0x2f7/0x3a0
tg_unthrottle_up+0x10c/0x2f0
? __pfx_tg_unthrottle_up+0x10/0x10
walk_tg_tree_from+0x66/0xd0
? __pfx_tg_nop+0x10/0x10
unthrottle_cfs_rq+0x16b/0x3a0
__cfsb_csd_unthrottle+0x1f0/0x250
? __pfx___cfsb_csd_unthrottle+0x10/0x10
__flush_smp_call_function_queue+0x104/0x440
? tick_nohz_account_idle_time+0x4c/0x80
flush_smp_call_function_queue+0x3b/0x80
do_idle+0x14f/0x240
cpu_startup_entry+0x30/0x40
start_secondary+0x128/0x160
common_startup_64+0x13e/0x141
cgroup hierarchy:
root
/ \
A* ...
/ | \ ...
B* ...
Debugging shows the following:
A and B are configured with relatively small quota and large period.
At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.
Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
tg_set_cfs_bandwidth().
Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
and runtime_remaining stays 0. With task_group B has a small quota,
tasks on other CPUs in task_group B quickly consume all of
cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.
When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
the task will be throttled in
enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
triggering a non-empty throttled_limbo_list warning in tg_throttle_down().
Root Cause:
In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
level be greater than 0.
Solution:
One way to fix this warning is to add a runtime_remaining check for
each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
strange and complicated.
Another straightforward and simple solution is to add a new enqueue flag
to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
throttling. This may enqueue sched_entity with no remaining runtime, which
is not a big deal because the current kernel also has such situations [1].
We still retain the runtime_remaining check in unthrottle_cfs_rq() for
higher-level cfs_rq to avoid enqueuing many entities with
runtime_remaining < 0.
Also remove the redundant assignment to se in tg_throttle_down().
[1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance
Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reported-by: Aaron Lu <ziqianlu@bytedance.com>
Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
Reproduction steps:
(1) Create a test.sh script and run it:
#!/bin/bash
set -e
CGP_ROOT="/sys/fs/cgroup/"
if ! mount | grep -q "type cgroup2"; then
echo "ERROR: Not in cgroup v2 mode."
exit 1
fi
echo 1 > /sys/kernel/debug/clear_warn_once
mkdir -p ${CGP_ROOT}/test
echo "+cpu +cpuset" > ${CGP_ROOT}/test/cgroup.subtree_control
for i in $(seq 1 2); do
SUB_DIR=${CGP_ROOT}/test/sub$i
mkdir -p ${SUB_DIR}
echo $$ > ${SUB_DIR}/cgroup.procs
perf bench sched messaging -g 10 -t -l 50000 &
echo $$ > /sys/fs/cgroup/cgroup.procs
echo "1000 100000" > ${SUB_DIR}/cpu.max
echo "Started stress in stress_sub$i"
done
echo "1000 100000" > ${CGP_ROOT}/test/cpu.max
(2) Run the following command multiple times until the warning is triggered:
echo 1000 10000 > /sys/fs/cgroup/test/sub1/cpu.max
kernel/sched/fair.c | 15 ++++++++++-----
kernel/sched/sched.h | 3 +++
2 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58e696d7ccc..7721466dc8f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->on_rq = 1;
if (cfs_rq->nr_queued == 1) {
- check_enqueue_throttle(cfs_rq);
+ if (!(flags & ENQUEUE_THROTTLE))
+ check_enqueue_throttle(cfs_rq);
+
list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
if (cfs_rq->pelt_clock_throttled) {
@@ -5912,7 +5914,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
list_del_init(&p->throttle_node);
p->throttled = false;
- enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+ enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
}
/* Add cfs_rq with load or one or more already running entities to the list */
@@ -6029,15 +6031,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
* unthrottled us with a positive runtime_remaining but other still
* running entities consumed those runtime before we reached here.
*
- * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+ * We can't unthrottle this cfs_rq without any runtime remaining
* because any enqueue in tg_unthrottle_up() will immediately trigger a
* throttle, which is not supposed to happen on unthrottle path.
+ *
+ * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
+ * tg_unthrottle_up() do not trigger a throttle, we still retain the check
+ * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
+ * entities whose runtime_remaining is less than 0.
*/
if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
return;
- se = cfs_rq->tg->se[cpu_of(rq)];
-
cfs_rq->throttled = 0;
update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e7718f12bc55..6f45e00d1fc2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2368,6 +2368,8 @@ extern const u32 sched_prio_to_wmult[40];
* ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
* ENQUEUE_MIGRATED - the task was migrated during wakeup
* ENQUEUE_RQ_SELECTED - ->select_task_rq() was called
+ * ENQUEUE_THROTTLE - Called in tg_unthrottle_up() to ensure that
+ * task can be enqueued during unthrottle
*
* XXX SAVE/RESTORE in combination with CLASS doesn't really make sense, but
* SCHED_DEADLINE seems to rely on this for now.
@@ -2399,6 +2401,7 @@ extern const u32 sched_prio_to_wmult[40];
#define ENQUEUE_MIGRATED 0x00040000
#define ENQUEUE_INITIAL 0x00080000
#define ENQUEUE_RQ_SELECTED 0x00100000
+#define ENQUEUE_THROTTLE 0x00200000
#define RETRY_TASK ((void *)-1UL)
--
2.34.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
2025-10-23 12:12 [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down() Hao Jia
@ 2025-10-24 3:39 ` Aaron Lu
2025-10-24 6:51 ` Hao Jia
2025-10-24 4:36 ` K Prateek Nayak
1 sibling, 1 reply; 5+ messages in thread
From: Aaron Lu @ 2025-10-24 3:39 UTC (permalink / raw)
To: Hao Jia
Cc: mingo, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
kprateek.nayak, linux-kernel, Hao Jia
On Thu, Oct 23, 2025 at 08:12:13PM +0800, Hao Jia wrote:
> From: Hao Jia <jiahao1@lixiang.com>
>
> Aaron Lu and I hit a non-empty throttled_limbo_list warning in
> tg_throttle_down() during testing.
>
> WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
> Call Trace:
> ? __pfx_tg_nop+0x10/0x10
> walk_tg_tree_from+0x39/0xd0
> ? __pfx_tg_throttle_down+0x10/0x10
> throttle_cfs_rq+0x176/0x1f0
> enqueue_task_fair+0x4f5/0xd30
> ? unthrottle_cfs_rq+0x2f7/0x3a0
> tg_unthrottle_up+0x10c/0x2f0
> ? __pfx_tg_unthrottle_up+0x10/0x10
> walk_tg_tree_from+0x66/0xd0
> ? __pfx_tg_nop+0x10/0x10
> unthrottle_cfs_rq+0x16b/0x3a0
> __cfsb_csd_unthrottle+0x1f0/0x250
> ? __pfx___cfsb_csd_unthrottle+0x10/0x10
> __flush_smp_call_function_queue+0x104/0x440
> ? tick_nohz_account_idle_time+0x4c/0x80
> flush_smp_call_function_queue+0x3b/0x80
> do_idle+0x14f/0x240
> cpu_startup_entry+0x30/0x40
> start_secondary+0x128/0x160
> common_startup_64+0x13e/0x141
>
> cgroup hierarchy:
>
> root
> / \
> A* ...
> / | \ ...
> B* ...
>
> Debugging shows the following:
> A and B are configured with relatively small quota and large period.
>
> At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
> the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.
>
> Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
> tg_set_cfs_bandwidth().
> Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
> and runtime_remaining stays 0. With task_group B has a small quota,
> tasks on other CPUs in task_group B quickly consume all of
> cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.
>
> When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
> re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
> and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
> the task will be throttled in
> enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
> triggering a non-empty throttled_limbo_list warning in tg_throttle_down().
>
> Root Cause:
> In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
> enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
> level be greater than 0.
>
> Solution:
> One way to fix this warning is to add a runtime_remaining check for
> each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
> strange and complicated.
> Another straightforward and simple solution is to add a new enqueue flag
> to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
> throttling. This may enqueue sched_entity with no remaining runtime, which
> is not a big deal because the current kernel also has such situations [1].
>
> We still retain the runtime_remaining check in unthrottle_cfs_rq() for
> higher-level cfs_rq to avoid enqueuing many entities with
> runtime_remaining < 0.
>
> Also remove the redundant assignment to se in tg_throttle_down().
>
> [1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance
>
> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
> Reported-by: Aaron Lu <ziqianlu@bytedance.com>
> Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> ---
> Reproduction steps:
> (1) Create a test.sh script and run it:
> #!/bin/bash
> set -e
> CGP_ROOT="/sys/fs/cgroup/"
> if ! mount | grep -q "type cgroup2"; then
> echo "ERROR: Not in cgroup v2 mode."
> exit 1
> fi
> echo 1 > /sys/kernel/debug/clear_warn_once
> mkdir -p ${CGP_ROOT}/test
> echo "+cpu +cpuset" > ${CGP_ROOT}/test/cgroup.subtree_control
> for i in $(seq 1 2); do
> SUB_DIR=${CGP_ROOT}/test/sub$i
> mkdir -p ${SUB_DIR}
> echo $$ > ${SUB_DIR}/cgroup.procs
> perf bench sched messaging -g 10 -t -l 50000 &
> echo $$ > /sys/fs/cgroup/cgroup.procs
> echo "1000 100000" > ${SUB_DIR}/cpu.max
> echo "Started stress in stress_sub$i"
> done
> echo "1000 100000" > ${CGP_ROOT}/test/cpu.max
>
> (2) Run the following command multiple times until the warning is triggered:
> echo 1000 10000 > /sys/fs/cgroup/test/sub1/cpu.max
>
> kernel/sched/fair.c | 15 ++++++++++-----
> kernel/sched/sched.h | 3 +++
> 2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b58e696d7ccc..7721466dc8f2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> se->on_rq = 1;
>
> if (cfs_rq->nr_queued == 1) {
> - check_enqueue_throttle(cfs_rq);
> + if (!(flags & ENQUEUE_THROTTLE))
> + check_enqueue_throttle(cfs_rq);
> +
nit: might be better to place this check inside check_enqueue_throttle()
after cfs_bandwidth_used()? I was thinking: in this way, it might save
some cycles for systems with CONFIG_CFS_BANDWIDTH set but don't actually
set any quota.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 671cbf8dc00ee..688f74ee145c9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5229,7 +5229,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->deadline = se->vruntime + vslice;
}
-static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags);
static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
static void
@@ -5287,8 +5287,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->on_rq = 1;
if (cfs_rq->nr_queued == 1) {
- if (!(flags & ENQUEUE_THROTTLE))
- check_enqueue_throttle(cfs_rq);
+ check_enqueue_throttle(cfs_rq, flags);
list_add_leaf_cfs_rq(cfs_rq);
#ifdef CONFIG_CFS_BANDWIDTH
@@ -6408,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
* expired/exceeded, otherwise it may be allowed to steal additional ticks of
* runtime as update_curr() throttling can not trigger until it's on-rq.
*/
-static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
{
if (!cfs_bandwidth_used())
return;
@@ -6421,6 +6420,10 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
if (cfs_rq_throttled(cfs_rq))
return;
+ /* Do not attempt throttle on cfs_rq's unthrottle path */
+ if (flags & ENQUEUE_THROTTLE)
+ return;
+
/* update runtime allocation */
account_cfs_rq_runtime(cfs_rq, 0);
if (cfs_rq->runtime_remaining <= 0)
@@ -6719,7 +6722,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
-static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags) {}
static inline void sync_throttle(struct task_group *tg, int cpu) {}
static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
static void task_throttle_setup_work(struct task_struct *p) {}
> list_add_leaf_cfs_rq(cfs_rq);
> #ifdef CONFIG_CFS_BANDWIDTH
> if (cfs_rq->pelt_clock_throttled) {
> @@ -5912,7 +5914,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
> list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
> list_del_init(&p->throttle_node);
> p->throttled = false;
> - enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
> + enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
> }
>
> /* Add cfs_rq with load or one or more already running entities to the list */
> @@ -6029,15 +6031,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> * unthrottled us with a positive runtime_remaining but other still
> * running entities consumed those runtime before we reached here.
> *
> - * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
> + * We can't unthrottle this cfs_rq without any runtime remaining
> * because any enqueue in tg_unthrottle_up() will immediately trigger a
> * throttle, which is not supposed to happen on unthrottle path.
> + *
> + * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
> + * tg_unthrottle_up() do not trigger a throttle, we still retain the check
> + * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
> + * entities whose runtime_remaining is less than 0.
> */
> if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
> return;
>
> - se = cfs_rq->tg->se[cpu_of(rq)];
> -
> cfs_rq->throttled = 0;
>
> update_rq_clock(rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e7718f12bc55..6f45e00d1fc2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2368,6 +2368,8 @@ extern const u32 sched_prio_to_wmult[40];
> * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
> * ENQUEUE_MIGRATED - the task was migrated during wakeup
> * ENQUEUE_RQ_SELECTED - ->select_task_rq() was called
> + * ENQUEUE_THROTTLE - Called in tg_unthrottle_up() to ensure that
> + * task can be enqueued during unthrottle
> *
> * XXX SAVE/RESTORE in combination with CLASS doesn't really make sense, but
> * SCHED_DEADLINE seems to rely on this for now.
> @@ -2399,6 +2401,7 @@ extern const u32 sched_prio_to_wmult[40];
> #define ENQUEUE_MIGRATED 0x00040000
> #define ENQUEUE_INITIAL 0x00080000
> #define ENQUEUE_RQ_SELECTED 0x00100000
> +#define ENQUEUE_THROTTLE 0x00200000
Another nit: now EN/DEQUEUE_THROTTLE is a pair, they could share the
same value in low word?
>
> #define RETRY_TASK ((void *)-1UL)
>
> --
> 2.34.1
>
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
2025-10-23 12:12 [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down() Hao Jia
2025-10-24 3:39 ` Aaron Lu
@ 2025-10-24 4:36 ` K Prateek Nayak
2025-10-24 6:58 ` Hao Jia
1 sibling, 1 reply; 5+ messages in thread
From: K Prateek Nayak @ 2025-10-24 4:36 UTC (permalink / raw)
To: Hao Jia, mingo, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid
Cc: linux-kernel, Hao Jia, Aaron Lu
Hello Hao,
On 10/23/2025 5:42 PM, Hao Jia wrote:
> @@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> se->on_rq = 1;
>
> if (cfs_rq->nr_queued == 1) {
> - check_enqueue_throttle(cfs_rq);
> + if (!(flags & ENQUEUE_THROTTLE))
> + check_enqueue_throttle(cfs_rq);
> +
So my only concern here is:
check_enqueue_throttle()
account_cfs_rq_runtime()
__account_cfs_rq_runtime()
assign_cfs_rq_runtime()
__assign_cfs_rq_runtime()
start_cfs_bandwidth() /* Starts the BW timer. */
If we skip it, we wouldn't know we've run out of bandwidth until the
hierarchy is picked which would cause additional delay until the
bandwidth is replenished.
At the very least, we should pass the enqueue flags to
check_enqueue_throttle() and only skip the throttle_cfs_rq() part if
we spot ENQUEUE_THROTTLE.
Thoughts?
--
Thanks and Regards,
Prateek
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
2025-10-24 3:39 ` Aaron Lu
@ 2025-10-24 6:51 ` Hao Jia
0 siblings, 0 replies; 5+ messages in thread
From: Hao Jia @ 2025-10-24 6:51 UTC (permalink / raw)
To: Aaron Lu
Cc: mingo, peterz, mingo, juri.lelli, vincent.guittot,
dietmar.eggemann, rostedt, bsegall, mgorman, vschneid,
kprateek.nayak, linux-kernel, Hao Jia
Hi Aaron,
On 2025/10/24 11:39, Aaron Lu wrote:
> On Thu, Oct 23, 2025 at 08:12:13PM +0800, Hao Jia wrote:
>> From: Hao Jia <jiahao1@lixiang.com>
>>
>> Aaron Lu and I hit a non-empty throttled_limbo_list warning in
>> tg_throttle_down() during testing.
>>
>> WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
>> Call Trace:
>> ? __pfx_tg_nop+0x10/0x10
>> walk_tg_tree_from+0x39/0xd0
>> ? __pfx_tg_throttle_down+0x10/0x10
>> throttle_cfs_rq+0x176/0x1f0
>> enqueue_task_fair+0x4f5/0xd30
>> ? unthrottle_cfs_rq+0x2f7/0x3a0
>> tg_unthrottle_up+0x10c/0x2f0
>> ? __pfx_tg_unthrottle_up+0x10/0x10
>> walk_tg_tree_from+0x66/0xd0
>> ? __pfx_tg_nop+0x10/0x10
>> unthrottle_cfs_rq+0x16b/0x3a0
>> __cfsb_csd_unthrottle+0x1f0/0x250
>> ? __pfx___cfsb_csd_unthrottle+0x10/0x10
>> __flush_smp_call_function_queue+0x104/0x440
>> ? tick_nohz_account_idle_time+0x4c/0x80
>> flush_smp_call_function_queue+0x3b/0x80
>> do_idle+0x14f/0x240
>> cpu_startup_entry+0x30/0x40
>> start_secondary+0x128/0x160
>> common_startup_64+0x13e/0x141
>>
>> cgroup hierarchy:
>>
>> root
>> / \
>> A* ...
>> / | \ ...
>> B* ...
>>
>> Debugging shows the following:
>> A and B are configured with relatively small quota and large period.
>>
>> At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
>> the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.
>>
>> Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
>> tg_set_cfs_bandwidth().
>> Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
>> and runtime_remaining stays 0. With task_group B has a small quota,
>> tasks on other CPUs in task_group B quickly consume all of
>> cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.
>>
>> When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
>> re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
>> and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
>> the task will be throttled in
>> enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
>> triggering a non-empty throttled_limbo_list warning in tg_throttle_down().
>>
>> Root Cause:
>> In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
>> enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
>> level be greater than 0.
>>
>> Solution:
>> One way to fix this warning is to add a runtime_remaining check for
>> each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
>> strange and complicated.
>> Another straightforward and simple solution is to add a new enqueue flag
>> to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
>> throttling. This may enqueue sched_entity with no remaining runtime, which
>> is not a big deal because the current kernel also has such situations [1].
>>
>> We still retain the runtime_remaining check in unthrottle_cfs_rq() for
>> higher-level cfs_rq to avoid enqueuing many entities with
>> runtime_remaining < 0.
>>
>> Also remove the redundant assignment to se in tg_throttle_down().
>>
>> [1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance
>>
>> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
>> Reported-by: Aaron Lu <ziqianlu@bytedance.com>
>> Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
>> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
>
> Tested-by: Aaron Lu <ziqianlu@bytedance.com>
>
>> ---
>> Reproduction steps:
>> (1) Create a test.sh script and run it:
>> #!/bin/bash
>> set -e
>> CGP_ROOT="/sys/fs/cgroup/"
>> if ! mount | grep -q "type cgroup2"; then
>> echo "ERROR: Not in cgroup v2 mode."
>> exit 1
>> fi
>> echo 1 > /sys/kernel/debug/clear_warn_once
>> mkdir -p ${CGP_ROOT}/test
>> echo "+cpu +cpuset" > ${CGP_ROOT}/test/cgroup.subtree_control
>> for i in $(seq 1 2); do
>> SUB_DIR=${CGP_ROOT}/test/sub$i
>> mkdir -p ${SUB_DIR}
>> echo $$ > ${SUB_DIR}/cgroup.procs
>> perf bench sched messaging -g 10 -t -l 50000 &
>> echo $$ > /sys/fs/cgroup/cgroup.procs
>> echo "1000 100000" > ${SUB_DIR}/cpu.max
>> echo "Started stress in stress_sub$i"
>> done
>> echo "1000 100000" > ${CGP_ROOT}/test/cpu.max
>>
>> (2) Run the following command multiple times until the warning is triggered:
>> echo 1000 10000 > /sys/fs/cgroup/test/sub1/cpu.max
>>
>> kernel/sched/fair.c | 15 ++++++++++-----
>> kernel/sched/sched.h | 3 +++
>> 2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b58e696d7ccc..7721466dc8f2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> se->on_rq = 1;
>>
>> if (cfs_rq->nr_queued == 1) {
>> - check_enqueue_throttle(cfs_rq);
>> + if (!(flags & ENQUEUE_THROTTLE))
>> + check_enqueue_throttle(cfs_rq);
>> +
>
> nit: might be better to place this check inside check_enqueue_throttle()
> after cfs_bandwidth_used()? I was thinking: in this way, it might save
> some cycles for systems with CONFIG_CFS_BANDWIDTH set but don't actually
> set any quota.
Thanks for your suggestion, I will do it in the next version.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 671cbf8dc00ee..688f74ee145c9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5229,7 +5229,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> se->deadline = se->vruntime + vslice;
> }
>
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags);
> static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
>
> static void
> @@ -5287,8 +5287,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> se->on_rq = 1;
>
> if (cfs_rq->nr_queued == 1) {
> - if (!(flags & ENQUEUE_THROTTLE))
> - check_enqueue_throttle(cfs_rq);
> + check_enqueue_throttle(cfs_rq, flags);
>
> list_add_leaf_cfs_rq(cfs_rq);
> #ifdef CONFIG_CFS_BANDWIDTH
> @@ -6408,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> * expired/exceeded, otherwise it may be allowed to steal additional ticks of
> * runtime as update_curr() throttling can not trigger until it's on-rq.
> */
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
> {
> if (!cfs_bandwidth_used())
> return;
> @@ -6421,6 +6420,10 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
> if (cfs_rq_throttled(cfs_rq))
> return;
>
> + /* Do not attempt throttle on cfs_rq's unthrottle path */
> + if (flags & ENQUEUE_THROTTLE)
> + return;
> +
> /* update runtime allocation */
> account_cfs_rq_runtime(cfs_rq, 0);
> if (cfs_rq->runtime_remaining <= 0)
> @@ -6719,7 +6722,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
>
> static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
> static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags) {}
> static inline void sync_throttle(struct task_group *tg, int cpu) {}
> static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
> static void task_throttle_setup_work(struct task_struct *p) {}
>
>> list_add_leaf_cfs_rq(cfs_rq);
>> #ifdef CONFIG_CFS_BANDWIDTH
>> if (cfs_rq->pelt_clock_throttled) {
>> @@ -5912,7 +5914,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>> list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>> list_del_init(&p->throttle_node);
>> p->throttled = false;
>> - enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
>> + enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
>> }
>>
>> /* Add cfs_rq with load or one or more already running entities to the list */
>> @@ -6029,15 +6031,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>> * unthrottled us with a positive runtime_remaining but other still
>> * running entities consumed those runtime before we reached here.
>> *
>> - * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
>> + * We can't unthrottle this cfs_rq without any runtime remaining
>> * because any enqueue in tg_unthrottle_up() will immediately trigger a
>> * throttle, which is not supposed to happen on unthrottle path.
>> + *
>> + * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
>> + * tg_unthrottle_up() do not trigger a throttle, we still retain the check
>> + * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
>> + * entities whose runtime_remaining is less than 0.
>> */
>> if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>> return;
>>
>> - se = cfs_rq->tg->se[cpu_of(rq)];
>> -
>> cfs_rq->throttled = 0;
>>
>> update_rq_clock(rq);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e7718f12bc55..6f45e00d1fc2 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2368,6 +2368,8 @@ extern const u32 sched_prio_to_wmult[40];
>> * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
>> * ENQUEUE_MIGRATED - the task was migrated during wakeup
>> * ENQUEUE_RQ_SELECTED - ->select_task_rq() was called
>> + * ENQUEUE_THROTTLE - Called in tg_unthrottle_up() to ensure that
>> + * task can be enqueued during unthrottle
>> *
>> * XXX SAVE/RESTORE in combination with CLASS doesn't really make sense, but
>> * SCHED_DEADLINE seems to rely on this for now.
>> @@ -2399,6 +2401,7 @@ extern const u32 sched_prio_to_wmult[40];
>> #define ENQUEUE_MIGRATED 0x00040000
>> #define ENQUEUE_INITIAL 0x00080000
>> #define ENQUEUE_RQ_SELECTED 0x00100000
>> +#define ENQUEUE_THROTTLE 0x00200000
>
> Another nit: now EN/DEQUEUE_THROTTLE is a pair, they could share the
> same value in low word?
>
I will do it in the next version.
Thanks,
Hao
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
2025-10-24 4:36 ` K Prateek Nayak
@ 2025-10-24 6:58 ` Hao Jia
0 siblings, 0 replies; 5+ messages in thread
From: Hao Jia @ 2025-10-24 6:58 UTC (permalink / raw)
To: K Prateek Nayak, mingo, peterz, mingo, juri.lelli,
vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
vschneid
Cc: linux-kernel, Hao Jia, Aaron Lu
Hi Prateek,
On 2025/10/24 12:36, K Prateek Nayak wrote:
> Hello Hao,
>
> On 10/23/2025 5:42 PM, Hao Jia wrote:
>> @@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> se->on_rq = 1;
>>
>> if (cfs_rq->nr_queued == 1) {
>> - check_enqueue_throttle(cfs_rq);
>> + if (!(flags & ENQUEUE_THROTTLE))
>> + check_enqueue_throttle(cfs_rq);
>> +
>
> So my only concern here is:
>
> check_enqueue_throttle()
> account_cfs_rq_runtime()
> __account_cfs_rq_runtime()
> assign_cfs_rq_runtime()
> __assign_cfs_rq_runtime()
> start_cfs_bandwidth() /* Starts the BW timer. */
>
> If we skip it, we wouldn't know we've run out of bandwidth until the
> hierarchy is picked which would cause additional delay until the
> bandwidth is replenished.
>
> At the very least, we should pass the enqueue flags to
> check_enqueue_throttle() and only skip the throttle_cfs_rq() part if
> we spot ENQUEUE_THROTTLE.
>
> Thoughts?
>
Thanks for your suggestion. This is indeed a potential risk, and it will
do it in the next version.
Thanks,
Hao
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-24 6:58 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 12:12 [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down() Hao Jia
2025-10-24 3:39 ` Aaron Lu
2025-10-24 6:51 ` Hao Jia
2025-10-24 4:36 ` K Prateek Nayak
2025-10-24 6:58 ` Hao Jia
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox