Re: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hao Jia <jiahao.kernel@gmail.com>
To: Aaron Lu <ziqianlu@bytedance.com>
Cc: mingo@redhat.com, peterz@infradead.org, mingo@kernel.org,
	juri.lelli@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, rostedt@goodmis.org,
	bsegall@google.com, mgorman@suse.de, vschneid@redhat.com,
	kprateek.nayak@amd.com, linux-kernel@vger.kernel.org,
	Hao Jia <jiahao1@lixiang.com>
Subject: Re: [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()
Date: Fri, 24 Oct 2025 14:51:23 +0800	[thread overview]
Message-ID: <ee79013d-2fc5-b103-ce75-a9fcb5ce4977@gmail.com> (raw)
In-Reply-To: <20251024033955.GA39@bytedance>


Hi Aaron,

On 2025/10/24 11:39, Aaron Lu wrote:
> On Thu, Oct 23, 2025 at 08:12:13PM +0800, Hao Jia wrote:
>> From: Hao Jia <jiahao1@lixiang.com>
>>
>> Aaron Lu and I hit a non-empty throttled_limbo_list warning in
>> tg_throttle_down() during testing.
>>
>> WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
>> Call Trace:
>>   ? __pfx_tg_nop+0x10/0x10
>>   walk_tg_tree_from+0x39/0xd0
>>   ? __pfx_tg_throttle_down+0x10/0x10
>>   throttle_cfs_rq+0x176/0x1f0
>>   enqueue_task_fair+0x4f5/0xd30
>>   ? unthrottle_cfs_rq+0x2f7/0x3a0
>>   tg_unthrottle_up+0x10c/0x2f0
>>   ? __pfx_tg_unthrottle_up+0x10/0x10
>>   walk_tg_tree_from+0x66/0xd0
>>   ? __pfx_tg_nop+0x10/0x10
>>   unthrottle_cfs_rq+0x16b/0x3a0
>>   __cfsb_csd_unthrottle+0x1f0/0x250
>>   ? __pfx___cfsb_csd_unthrottle+0x10/0x10
>>   __flush_smp_call_function_queue+0x104/0x440
>>   ? tick_nohz_account_idle_time+0x4c/0x80
>>   flush_smp_call_function_queue+0x3b/0x80
>>   do_idle+0x14f/0x240
>>   cpu_startup_entry+0x30/0x40
>>   start_secondary+0x128/0x160
>>   common_startup_64+0x13e/0x141
>>
>> cgroup hierarchy:
>>
>>            root
>>          /      \
>>          A*     ...
>>       /  |  \   ...
>>          B* ...
>>
>> Debugging shows the following:
>> A and B are configured with relatively small quota and large period.
>>
>> At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
>> the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.
>>
>> Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
>> tg_set_cfs_bandwidth().
>> Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
>> and runtime_remaining stays 0. With task_group B has a small quota,
>> tasks on other CPUs in task_group B quickly consume all of
>> cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.
>>
>> When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
>> re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
>> and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
>> the task will be throttled in
>> enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
>> triggering a non-empty throttled_limbo_list warning in tg_throttle_down().
>>
>> Root Cause:
>> In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
>> enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
>> level be greater than 0.
>>
>> Solution:
>> One way to fix this warning is to add a runtime_remaining check for
>> each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
>> strange and complicated.
>> Another straightforward and simple solution is to add a new enqueue flag
>> to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
>> throttling. This may enqueue sched_entity with no remaining runtime, which
>> is not a big deal because the current kernel also has such situations [1].
>>
>> We still retain the runtime_remaining check in unthrottle_cfs_rq() for
>> higher-level cfs_rq to avoid enqueuing many entities with
>> runtime_remaining < 0.
>>
>> Also remove the redundant assignment to se in tg_throttle_down().
>>
>> [1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance
>>
>> Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
>> Reported-by: Aaron Lu <ziqianlu@bytedance.com>
>> Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
>> Signed-off-by: Hao Jia <jiahao1@lixiang.com>
> 
> Tested-by: Aaron Lu <ziqianlu@bytedance.com>
> 
>> ---
>> Reproduction steps:
>> (1) Create a test.sh script and run it:
>> #!/bin/bash
>> set -e
>> CGP_ROOT="/sys/fs/cgroup/"
>> if ! mount | grep -q "type cgroup2"; then
>> 	echo "ERROR: Not in cgroup v2 mode."
>> 	exit 1
>> fi
>> echo 1 > /sys/kernel/debug/clear_warn_once
>> mkdir -p ${CGP_ROOT}/test
>> echo "+cpu +cpuset" > ${CGP_ROOT}/test/cgroup.subtree_control
>> for i in $(seq 1 2); do
>> 	SUB_DIR=${CGP_ROOT}/test/sub$i
>> 	mkdir -p ${SUB_DIR}
>> 	echo $$ > ${SUB_DIR}/cgroup.procs
>> 	perf bench sched messaging -g 10 -t -l 50000 &
>> 	echo $$ > /sys/fs/cgroup/cgroup.procs
>> 	echo "1000 100000" > ${SUB_DIR}/cpu.max
>> 	echo "Started stress in stress_sub$i"
>> done
>> echo "1000 100000" > ${CGP_ROOT}/test/cpu.max
>>
>> (2) Run the following command multiple times until the warning is triggered:
>> echo 1000 10000 > /sys/fs/cgroup/test/sub1/cpu.max
>>
>>   kernel/sched/fair.c  | 15 ++++++++++-----
>>   kernel/sched/sched.h |  3 +++
>>   2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b58e696d7ccc..7721466dc8f2 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5287,7 +5287,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>   	se->on_rq = 1;
>>   
>>   	if (cfs_rq->nr_queued == 1) {
>> -		check_enqueue_throttle(cfs_rq);
>> +		if (!(flags & ENQUEUE_THROTTLE))
>> +			check_enqueue_throttle(cfs_rq);
>> +
> 
> nit: might be better to place this check inside check_enqueue_throttle()
> after cfs_bandwidth_used()? I was thinking: in this way, it might save
> some cycles for systems with CONFIG_CFS_BANDWIDTH set but don't actually
> set any quota.

Thanks for your suggestion, I will do it in the next version.

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 671cbf8dc00ee..688f74ee145c9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5229,7 +5229,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   	se->deadline = se->vruntime + vslice;
>   }
>   
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags);
>   static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
>   
>   static void
> @@ -5287,8 +5287,7 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>   	se->on_rq = 1;
>   
>   	if (cfs_rq->nr_queued == 1) {
> -		if (!(flags & ENQUEUE_THROTTLE))
> -			check_enqueue_throttle(cfs_rq);
> +		check_enqueue_throttle(cfs_rq, flags);
>   
>   		list_add_leaf_cfs_rq(cfs_rq);
>   #ifdef CONFIG_CFS_BANDWIDTH
> @@ -6408,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>    * expired/exceeded, otherwise it may be allowed to steal additional ticks of
>    * runtime as update_curr() throttling can not trigger until it's on-rq.
>    */
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
>   {
>   	if (!cfs_bandwidth_used())
>   		return;
> @@ -6421,6 +6420,10 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>   	if (cfs_rq_throttled(cfs_rq))
>   		return;
>   
> +	/* Do not attempt throttle on cfs_rq's unthrottle path */
> +	if (flags & ENQUEUE_THROTTLE)
> +		return;
> +
>   	/* update runtime allocation */
>   	account_cfs_rq_runtime(cfs_rq, 0);
>   	if (cfs_rq->runtime_remaining <= 0)
> @@ -6719,7 +6722,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
>   
>   static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
>   static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags) {}
>   static inline void sync_throttle(struct task_group *tg, int cpu) {}
>   static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>   static void task_throttle_setup_work(struct task_struct *p) {}
> 
>>   		list_add_leaf_cfs_rq(cfs_rq);
>>   #ifdef CONFIG_CFS_BANDWIDTH
>>   		if (cfs_rq->pelt_clock_throttled) {
>> @@ -5912,7 +5914,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>>   	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>>   		list_del_init(&p->throttle_node);
>>   		p->throttled = false;
>> -		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
>> +		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
>>   	}
>>   
>>   	/* Add cfs_rq with load or one or more already running entities to the list */
>> @@ -6029,15 +6031,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>>   	 * unthrottled us with a positive runtime_remaining but other still
>>   	 * running entities consumed those runtime before we reached here.
>>   	 *
>> -	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
>> +	 * We can't unthrottle this cfs_rq without any runtime remaining
>>   	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
>>   	 * throttle, which is not supposed to happen on unthrottle path.
>> +	 *
>> +	 * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
>> +	 * tg_unthrottle_up() do not trigger a throttle, we still retain the check
>> +	 * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
>> +	 * entities whose runtime_remaining is less than 0.
>>   	 */
>>   	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
>>   		return;
>>   
>> -	se = cfs_rq->tg->se[cpu_of(rq)];
>> -
>>   	cfs_rq->throttled = 0;
>>   
>>   	update_rq_clock(rq);
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index e7718f12bc55..6f45e00d1fc2 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -2368,6 +2368,8 @@ extern const u32		sched_prio_to_wmult[40];
>>    * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
>>    * ENQUEUE_MIGRATED  - the task was migrated during wakeup
>>    * ENQUEUE_RQ_SELECTED - ->select_task_rq() was called
>> + * ENQUEUE_THROTTLE  - Called in tg_unthrottle_up() to ensure that
>> + *                     task can be enqueued during unthrottle
>>    *
>>    * XXX SAVE/RESTORE in combination with CLASS doesn't really make sense, but
>>    * SCHED_DEADLINE seems to rely on this for now.
>> @@ -2399,6 +2401,7 @@ extern const u32		sched_prio_to_wmult[40];
>>   #define ENQUEUE_MIGRATED	0x00040000
>>   #define ENQUEUE_INITIAL		0x00080000
>>   #define ENQUEUE_RQ_SELECTED	0x00100000
>> +#define ENQUEUE_THROTTLE	0x00200000
> 
> Another nit: now EN/DEQUEUE_THROTTLE is a pair, they could share the
> same value in low word?
> 

I will do it in the next version.

Thanks,
Hao

next prev parent reply	other threads:[~2025-10-24  6:51 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-23 12:12 [PATCH] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down() Hao Jia
2025-10-24  3:39 ` Aaron Lu
2025-10-24  6:51   ` Hao Jia [this message]
2025-10-24  4:36 ` K Prateek Nayak
2025-10-24  6:58   ` Hao Jia

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ee79013d-2fc5-b103-ce75-a9fcb5ce4977@gmail.com \
    --to=jiahao.kernel@gmail.com \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=jiahao1@lixiang.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.