Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Hao Jia <jiahao.kernel@gmail.com>
To: Aaron Lu <ziqianlu@bytedance.com>
Cc: "Valentin Schneider" <vschneid@redhat.com>,
	"Ben Segall" <bsegall@google.com>,
	"K Prateek Nayak" <kprateek.nayak@amd.com>,
	"Peter Zijlstra" <peterz@infradead.org>,
	"Chengming Zhou" <chengming.zhou@linux.dev>,
	"Josh Don" <joshdon@google.com>, "Ingo Molnar" <mingo@redhat.com>,
	"Vincent Guittot" <vincent.guittot@linaro.org>,
	"Xi Wang" <xii@google.com>,
	linux-kernel@vger.kernel.org,
	"Juri Lelli" <juri.lelli@redhat.com>,
	"Dietmar Eggemann" <dietmar.eggemann@arm.com>,
	"Steven Rostedt" <rostedt@goodmis.org>,
	"Mel Gorman" <mgorman@suse.de>,
	"Chuyi Zhou" <zhouchuyi@bytedance.com>,
	"Jan Kiszka" <jan.kiszka@siemens.com>,
	"Florian Bezdeka" <florian.bezdeka@siemens.com>,
	"Songtang Liu" <liusongtang@bytedance.com>,
	"Chen Yu" <yu.c.chen@intel.com>,
	"Matteo Martelli" <matteo.martelli@codethink.co.uk>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Sebastian Andrzej Siewior" <bigeasy@linutronix.de>
Subject: Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining
Date: Thu, 16 Oct 2025 19:04:30 +0800	[thread overview]
Message-ID: <eaedc19e-8647-ab3b-c09b-a8602d193011@gmail.com> (raw)
In-Reply-To: <20251016092300.GB32@bytedance>



On 2025/10/16 17:23, Aaron Lu wrote:
> On Thu, Oct 16, 2025 at 03:49:15PM +0800, Hao Jia wrote:
>>
>> Hi Aaron,
>>
>> On 2025/10/16 14:54, Aaron Lu wrote:
>>> On Wed, Oct 15, 2025 at 06:21:01PM +0800, Hao Jia wrote:
>>>> On 2025/10/15 16:40, Aaron Lu wrote:
>>> ... ...
>>>>> Hao Jia,
>>>>>
>>>>> Do I understand you correctly that you can only hit the newly added
>>>>> debug warn in tg_unthrottle_up():
>>>>> WARN_ON_ONCE(cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0);
>>>>> but not throttle triggered on unthrottle path?
>>>>>
>>>>
>>>> yes. but I'm not sure if there are other corner cases where
>>>> cfs_rq->runtime_remaining <= 0 and cfs_rq->curr is NULL.
>>>>
>>>
>>> Right, I'm not aware of any but might be possible.
>>>
>>>>> BTW, I think your change has the advantage of being straightforward and
>>>>> easy to reason about. My concern is, it's not efficient to enqueue tasks
>>>>> to a cfs_rq that has no runtime left, not sure how big a deal that is
>>>>> though.
>>>>
>>>> Yes, but that's what we're doing now. The case described above involves
>>>> enqueue a task where cfs_rq->runtime_remaining <= 0.
>>>>
>>>> I previously tried adding a runtime_remaining check for each level of task
>>>> p's cfs_rq in unthrottle_cfs_rq()/tg_unthrottle_up(), but this made the code
>>>> strange and complicated.
>>>
>>> Agree that adding a runtime_remaining check for each level in
>>> unthrottle_cfs_rq() looks too complex.
>>>
>>> So I think you approach is fine, feel free to submit a formal patch.
>>> With your change, theoretically we do not need to do those
>>> runtime_remaining check in unthrottle_cfs_rq() but keeping that check
>>> could save us some unnecessary enqueues, so I'll leave it to you to
>>> decide if you want to keep it or not. If you want to keep it, please
>>> also change its comments because the current comments will be stale
>>> then.
>>>
>>
>> Thank you for your suggestion. I'll send a formal patch later.
>>
>> I'm also happy for you to submit a patch for the next version. This warning
>> needs to be fixed, regardless of the method.
> 
> With your change, task enqueue in unthrottle path will not call
> check_enqueue_path(), thus the warn on non-empty limbo list in
> tg_throttle_down() should not happen, so I suppose we do not need
> this patch anymore, no?

Yes, I mean maybe the maintainer thinks your patch is more suitable.

> 
>>
>> However, I've discovered a minor bug in your current patch.
>>
>> In kernel/sched/core.c tg_set_cfs_bandwidth()
>>
>> ...
>> if (cfs_rq->runtime_enabled && !cfs_rq->throttled) {
>>      update_rq_clock(rq);   <----
>>      throttle_cfs_rq(cfs_rq);
>> }
>> ...
>>
>> Call update_rq_clock() to avoid the warning about using an outdated rq_clock
>> in tg_throttle_down()->rq_clock_pelt().
> 
> With the above said, this shouldn't matter anymore but just out of
> curiosity: did you notice this by inspecting the code or actually
> hitting the warning about using an outdated rq clock?
> 
> Per my understanding, most likely: __assign_cfs_rq_runtime() in
> throttle_cfs_rq(cfs_rq) will grant 1ns runtime to cfs_rq so it won't
> reach tg_throttle_down(). The comment I added above that if condition
> is kind of misleading though.


I did encounter this once.

perhaps in the following corner case:

If cfs_b->quota is set low (and quota is set at each level), and there 
are a large number of CPUs.

After tg_set_cfs_bandwidth()->__refill_cfs_bandwidth_runtime(), we 
release cfs_b->lock. cfs_b->runtime might be consumed by cfs_rq on other 
CPUs.

Then, on one online CPU, we can't get 1ns runtime via 
__assign_cfs_rq_runtime(). Current limiting is triggered on this CPU, 
and tg_throttle_down()->rq_clock_pelt() is called.


           ------------[ cut here ]------------
WARNING: kernel/sched/sched.h:1681 at tg_throttle_down+0x106/0x110, 
CPU#4:          CPU: 4 UID: 0 PID: 7840 Comm: test_cgroup.sh Kdump: 
loaded Not tainted 6.17.0+ #94 PREEMPT(voluntary)

           Call Trace:
            <TASK>
            walk_tg_tree_from+0x39/0xd0
            ? __pfx_tg_throttle_down+0x10/0x10
            throttle_cfs_rq+0xea/0x210
            tg_set_bandwidth+0x31f/0x4d0
            cpu_max_write+0xc3/0x130
            cgroup_file_write+0x92/0x1a0
            ? __check_object_size+0x27a/0x300
            kernfs_fop_write_iter+0x15f/0x1f0
            vfs_write+0x31b/0x430
            ksys_write+0x6d/0xf0
            __x64_sys_write+0x1d/0x30
            x64_sys_call+0x1900/0x2760
            do_syscall_64+0x83/0x8c0
            ? ksys_dup3+0x9d/0x120
            ? filp_flush+0x96/0xb0
            ? __x64_sys_close+0x42/0x90
            ? x64_sys_call+0x1c48/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? do_syscall_64+0xbc/0x8c0
            ? x64_sys_call+0x2404/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? do_sys_openat2+0x8e/0xd0
            ? __x64_sys_openat+0x58/0xa0
            ? x64_sys_call+0x101f/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? count_memcg_events+0xf1/0x1e0
            ? get_close_on_exec+0x3b/0x50
            ? do_fcntl+0x27a/0x7d0
            ? handle_mm_fault+0x1d2/0x2b0
            ? __x64_sys_fcntl+0x9d/0x130
            ? x64_sys_call+0x2404/0x2760
            ? do_syscall_64+0xbc/0x8c0
            ? exc_page_fault+0x97/0x1b0
            entry_SYSCALL_64_after_hwframe+0x76/0x7e

Thanks,
Hao

next prev parent reply	other threads:[~2025-10-16 11:04 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-29  7:46 [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with zero runtime_remaining Aaron Lu
2025-09-29  9:34 ` K Prateek Nayak
2025-09-29 10:55   ` Aaron Lu
2025-09-30  7:56   ` Aaron Lu
2025-09-30  8:58     ` K Prateek Nayak
2025-09-30  9:27       ` Aaron Lu
2025-09-30 11:07       ` Aaron Lu
2025-09-30 12:39         ` Aaron Lu
2025-09-30 13:38         ` K Prateek Nayak
2025-10-01 11:58           ` Aaron Lu
2025-10-14  7:43 ` Hao Jia
2025-10-14  9:11   ` Aaron Lu
2025-10-14 11:01     ` Hao Jia
2025-10-14 11:50       ` Aaron Lu
2025-10-15  1:43         ` Hao Jia
2025-10-15  1:48           ` Hao Jia
2025-10-15  2:51           ` Aaron Lu
2025-10-15  6:31             ` Hao Jia
2025-10-15  8:40               ` Aaron Lu
2025-10-15 10:21                 ` Hao Jia
2025-10-16  6:54                   ` Aaron Lu
2025-10-16  7:49                     ` Hao Jia
2025-10-16  9:23                       ` Aaron Lu
2025-10-16 11:04                         ` Hao Jia [this message]
2025-10-16 11:46                           ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eaedc19e-8647-ab3b-c09b-a8602d193011@gmail.com \
    --to=jiahao.kernel@gmail.com \
    --cc=bigeasy@linutronix.de \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=florian.bezdeka@siemens.com \
    --cc=jan.kiszka@siemens.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liusongtang@bytedance.com \
    --cc=matteo.martelli@codethink.co.uk \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=yu.c.chen@intel.com \
    --cc=zhouchuyi@bytedance.com \
    --cc=ziqianlu@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.