All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hao Jia <jiahao.kernel@gmail.com>
To: mingo@redhat.com, peterz@infradead.org, mingo@kernel.org,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	kprateek.nayak@amd.com
Cc: linux-kernel@vger.kernel.org, Hao Jia <jiahao1@lixiang.com>,
	Aaron Lu <ziqianlu@bytedance.com>
Subject: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
Date: Thu, 23 Oct 2025 20:12:13 +0800	[thread overview]
Message-ID: <20251023121213.38282-1-jiahao.kernel@gmail.com> (raw)

From: Hao Jia <jiahao1@lixiang.com>

Aaron Lu and I hit a non-empty throttled_limbo_list warning in
tg_throttle_down() during testing.

WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
Call Trace:
 ? __pfx_tg_nop+0x10/0x10
 walk_tg_tree_from+0x39/0xd0
 ? __pfx_tg_throttle_down+0x10/0x10
 throttle_cfs_rq+0x176/0x1f0
 enqueue_task_fair+0x4f5/0xd30
 ? unthrottle_cfs_rq+0x2f7/0x3a0
 tg_unthrottle_up+0x10c/0x2f0
 ? __pfx_tg_unthrottle_up+0x10/0x10
 walk_tg_tree_from+0x66/0xd0
 ? __pfx_tg_nop+0x10/0x10
 unthrottle_cfs_rq+0x16b/0x3a0
 __cfsb_csd_unthrottle+0x1f0/0x250
 ? __pfx___cfsb_csd_unthrottle+0x10/0x10
 __flush_smp_call_function_queue+0x104/0x440
 ? tick_nohz_account_idle_time+0x4c/0x80
 flush_smp_call_function_queue+0x3b/0x80
 do_idle+0x14f/0x240
 cpu_startup_entry+0x30/0x40
 start_secondary+0x128/0x160
 common_startup_64+0x13e/0x141

cgroup hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B* ...

Debugging shows the following:
A and B are configured with relatively small quota and large period.

At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.

Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
tg_set_cfs_bandwidth().
Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
and runtime_remaining stays 0. With task_group B has a small quota,
tasks on other CPUs in task_group B quickly consume all of
cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.

When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
the task will be throttled in
enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
triggering a non-empty throttled_limbo_list warning in tg_throttle_down().

Root Cause:
In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
level be greater than 0.

Solution:
One way to fix this warning is to add a runtime_remaining check for
each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
strange and complicated.
Another straightforward and simple solution is to add a new enqueue flag
to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
throttling. This may enqueue sched_entity with no remaining runtime, which
is not a big deal because the current kernel also has such situations [1].

We still retain the runtime_remaining check in unthrottle_cfs_rq() for
higher-level cfs_rq to avoid enqueuing many entities with
runtime_remaining < 0.

Also remove the redundant assignment to se in tg_throttle_down().

[1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance

Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Reported-by: Aaron Lu <ziqianlu@bytedance.com>
Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
Reproduction steps:
(1) Create a test.sh script and run it:
#!/bin/bash
set -e
CGP_ROOT="/sys/fs/cgroup/"
if ! mount | grep -q "type cgroup2"; then
	echo "ERROR: Not in cgroup v2 mode."
	exit 1
fi
echo 1 > /sys/kernel/debug/clear_warn_once
mkdir -p ${CGP_ROOT}/test
echo "+cpu +cpuset" > ${CGP_ROOT}/test/cgroup.subtree_control
for i in $(seq 1 2); do
	SUB_DIR=${CGP_ROOT}/test/sub$i
	mkdir -p ${SUB_DIR}
	echo $$ > ${SUB_DIR}/cgroup.procs
	perf bench sched messaging -g 10 -t -l 50000 &
	echo $$ > /sys/fs/cgroup/cgroup.procs
	echo "1000 100000" > ${SUB_DIR}/cpu.max
	echo "Started stress in stress_sub$i"
done
echo "1000 100000" > ${CGP_ROOT}/test/cpu.max

(2) Run the following command multiple times until the warning is triggered:
echo 1000 10000 > /sys/fs/cgroup/test/sub1/cpu.max

 kernel/sched/fair.c  | 15 ++++++++++-----
 kernel/sched/sched.h |  3 +++
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b58e696d7ccc..7721466dc8f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_queued == 1) {
-		check_enqueue_throttle(cfs_rq);
+		if (!(flags & ENQUEUE_THROTTLE))
+			check_enqueue_throttle(cfs_rq);
+
 		list_add_leaf_cfs_rq(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
 		if (cfs_rq->pelt_clock_throttled) {
@@ -5912,7 +5914,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);
 		p->throttled = false;
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
 	}
 
 	/* Add cfs_rq with load or one or more already running entities to the list */
@@ -6029,15 +6031,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * unthrottled us with a positive runtime_remaining but other still
 	 * running entities consumed those runtime before we reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+	 * We can't unthrottle this cfs_rq without any runtime remaining
 	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
 	 * throttle, which is not supposed to happen on unthrottle path.
+	 *
+	 * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
+	 * tg_unthrottle_up() do not trigger a throttle, we still retain the check
+	 * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
+	 * entities whose runtime_remaining is less than 0.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;
 
 	update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e7718f12bc55..6f45e00d1fc2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2368,6 +2368,8 @@ extern const u32		sched_prio_to_wmult[40];
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
  * ENQUEUE_RQ_SELECTED - ->select_task_rq() was called
+ * ENQUEUE_THROTTLE  - Called in tg_unthrottle_up() to ensure that
+ *                     task can be enqueued during unthrottle
  *
  * XXX SAVE/RESTORE in combination with CLASS doesn't really make sense, but
  * SCHED_DEADLINE seems to rely on this for now.
@@ -2399,6 +2401,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATED	0x00040000
 #define ENQUEUE_INITIAL		0x00080000
 #define ENQUEUE_RQ_SELECTED	0x00100000
+#define ENQUEUE_THROTTLE	0x00200000
 
 #define RETRY_TASK		((void *)-1UL)
 
-- 
2.34.1


             reply	other threads:[~2025-10-23 12:12 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 12:12 Hao Jia [this message]
2025-10-24  3:39 ` [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down() Aaron Lu
2025-10-24  6:51   ` Hao Jia
2025-10-24  4:36 ` K Prateek Nayak
2025-10-24  6:58   ` Hao Jia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20251023121213.38282-1-jiahao.kernel@gmail.com \
    --to=jiahao.kernel@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jiahao1@lixiang.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.