From: Chengming Zhou <chengming.zhou@linux.dev>
To: Aaron Lu <ziqianlu@bytedance.com>
Cc: Valentin Schneider <vschneid@redhat.com>,
Ben Segall <bsegall@google.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Peter Zijlstra <peterz@infradead.org>,
Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Xi Wang <xii@google.com>,
linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Chuyi Zhou <zhouchuyi@bytedance.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Florian Bezdeka <florian.bezdeka@siemens.com>
Subject: Re: [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task
Date: Fri, 23 May 2025 17:53:55 +0800 [thread overview]
Message-ID: <a0d4f499-1222-45da-a3ea-e9d445b81c87@linux.dev> (raw)
In-Reply-To: <20250523094106.GA1210419@bytedance>
On 2025/5/23 17:42, Aaron Lu wrote:
> On Fri, May 23, 2025 at 05:13:35PM +0800, Chengming Zhou wrote:
>> On 2025/5/23 15:56, Aaron Lu wrote:
>>> On Fri, May 23, 2025 at 10:43:53AM +0800, Chengming Zhou wrote:
>>>> On 2025/5/20 18:41, Aaron Lu wrote:
>>>>> On task group change, for tasks whose on_rq equals to TASK_ON_RQ_QUEUED,
>>>>> core will dequeue it and then requeued it.
>>>>>
>>>>> The throttled task is still considered as queued by core because p->on_rq
>>>>> is still set so core will dequeue it, but since the task is already
>>>>> dequeued on throttle in fair, handle this case properly.
>>>>>
>>>>> Affinity and sched class change is similar.
>>>>
>>>> How about setting p->on_rq to 0 when throttled? which is the fact that
>>>> the task is not on cfs queue anymore, does this method cause any problem?
>>>>
>>>
>>> On task group change/affinity change etc. if the throttled task is
>>> regarded as !on_rq, then it will miss the chance to be enqueued to the
>>> new(and correct) cfs_rqs, instead, it will be enqueued back to its
>>> original cfs_rq on unthrottle which breaks affinity or task group
>>
>> Yeah, this is indeed a problem, I was thinking to delete the throttled task
>> from the cfs_rq limbo list, then add it to another cfs_rq limbo list or cfs_rq
>> runnable tree based on the new cfs_rq's throttle status.
>
> Only work when the task is still handled by fair :)
>
>>
>> But it's much complex compared with your current method.
>>
>>> settings. We may be able to do something in tg_unthrottle_up() to take
>>> special care of these situations, but it seems a lot of headaches.
>>>
>>> Also, for task group change, if the new task group does not have throttle
>>> setting, that throttled task should be allowed to run immediately instead
>>> of waiting for its old cfs_rq's unthrottle event. Similar is true when
>>> this throttled task changed its sched class, like from fair to rt.
>>>
>>> Makes sense?
>>
>> Ok, the another problem of the current method I can think of is the PELT maintenance,
>> we skip the actual dequeue_task_fair() process, which includes PELT detach, we just
>> delete it from the cfs_rq limbo list, so it can result in PELT maintenance error.
>>
>
> There are corresponding callbacks that handle this, e.g. for task group
> change, there is task_change_group_fair() that handles PELT detach; for
> affinity change, there is migrate_task_rq_fair() does that and for sched
> class change, there is switched_from/to() does that.
>
> Or do I miss anything?
migrate_task_rq_fair() only do it when !task_on_rq_migrating(p), which is wakeup migrate,
because we already do detach in dequeue_task_fair() for on_rq task migration...
You can see the DO_DETACH flag in update_load_avg() called from dequeue_entity().
Thanks!
>
> Thanks,
> Aaron
>
>>>>>
>>>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>>>> ---
>>>>> kernel/sched/fair.c | 24 ++++++++++++++++++++++++
>>>>> 1 file changed, 24 insertions(+)
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index 74bc320cbc238..4c66fd8d24389 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -5866,6 +5866,10 @@ static void throttle_cfs_rq_work(struct callback_head *work)
>>>>> update_rq_clock(rq);
>>>>> WARN_ON_ONCE(!list_empty(&p->throttle_node));
>>>>> dequeue_task_fair(rq, p, DEQUEUE_SLEEP | DEQUEUE_SPECIAL);
>>>>> + /*
>>>>> + * Must not add it to limbo list before dequeue or dequeue will
>>>>> + * mistakenly regard this task as an already throttled one.
>>>>> + */
>>>>> list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
>>>>> resched_curr(rq);
>>>>> }
>>>>> @@ -5881,6 +5885,20 @@ void init_cfs_throttle_work(struct task_struct *p)
>>>>> INIT_LIST_HEAD(&p->throttle_node);
>>>>> }
>>>>> +static void dequeue_throttled_task(struct task_struct *p, int flags)
>>>>> +{
>>>>> + /*
>>>>> + * Task is throttled and someone wants to dequeue it again:
>>>>> + * it must be sched/core when core needs to do things like
>>>>> + * task affinity change, task group change, task sched class
>>>>> + * change etc.
>>>>> + */
>>>>> + WARN_ON_ONCE(p->se.on_rq);
>>>>> + WARN_ON_ONCE(flags & DEQUEUE_SLEEP);
>>>>> +
>>>>> + list_del_init(&p->throttle_node);
>>>>> +}
>>>>> +
>>>>> static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags);
>>>>> static int tg_unthrottle_up(struct task_group *tg, void *data)
>>>>> {
>>>>> @@ -6834,6 +6852,7 @@ static inline void sync_throttle(struct task_group *tg, int cpu) {}
>>>>> static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>>>>> static void task_throttle_setup_work(struct task_struct *p) {}
>>>>> static bool task_is_throttled(struct task_struct *p) { return false; }
>>>>> +static void dequeue_throttled_task(struct task_struct *p, int flags) {}
>>>>> static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq)
>>>>> {
>>>>> @@ -7281,6 +7300,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>>>>> */
>>>>> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>>> {
>>>>> + if (unlikely(task_is_throttled(p))) {
>>>>> + dequeue_throttled_task(p, flags);
>>>>> + return true;
>>>>> + }
>>>>> +
>>>>> if (!(p->se.sched_delayed && (task_on_rq_migrating(p) || (flags & DEQUEUE_SAVE))))
>>>>> util_est_dequeue(&rq->cfs, p);
next prev parent reply other threads:[~2025-05-23 9:54 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-20 10:41 [PATCH 0/7] Defer throttle when task exits to user Aaron Lu
2025-05-20 10:41 ` [PATCH 1/7] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-05-21 8:48 ` Chengming Zhou
2025-05-20 10:41 ` [PATCH 2/7] sched/fair: prepare throttle path " Aaron Lu
2025-05-20 12:02 ` Florian Bezdeka
2025-05-21 6:37 ` Aaron Lu
2025-05-21 11:51 ` Aaron Lu
2025-05-21 9:01 ` Chengming Zhou
2025-05-21 9:21 ` [External] " Aaron Lu
2025-05-22 11:43 ` Chengming Zhou
2025-05-23 8:03 ` Aaron Lu
2025-05-22 10:48 ` Peter Zijlstra
2025-05-22 11:44 ` Aaron Lu
2025-05-22 11:54 ` Peter Zijlstra
2025-05-22 12:40 ` Aaron Lu
2025-05-23 9:53 ` Aaron Lu
2025-05-23 10:52 ` Peter Zijlstra
2025-05-23 11:17 ` Aaron Lu
2025-05-22 11:07 ` Peter Zijlstra
2025-05-23 7:40 ` Aaron Lu
2025-05-29 11:51 ` Aaron Lu
2025-05-30 5:36 ` K Prateek Nayak
2025-05-30 11:02 ` Aaron Lu
2025-05-23 12:35 ` Peter Zijlstra
2025-05-20 10:41 ` [PATCH 3/7] sched/fair: prepare unthrottle " Aaron Lu
2025-05-20 10:41 ` [PATCH 4/7] sched/fair: Take care of group/affinity/sched_class change for throttled task Aaron Lu
2025-05-22 12:03 ` Peter Zijlstra
2025-05-22 12:49 ` Aaron Lu
2025-05-23 14:59 ` Peter Zijlstra
2025-05-26 11:36 ` Aaron Lu
2025-05-27 6:58 ` Aaron Lu
2025-05-27 11:19 ` K Prateek Nayak
2025-05-27 11:54 ` Aaron Lu
2025-05-27 14:16 ` K Prateek Nayak
2025-05-23 2:43 ` Chengming Zhou
2025-05-23 7:56 ` Aaron Lu
2025-05-23 9:13 ` Chengming Zhou
2025-05-23 9:42 ` Aaron Lu
2025-05-23 9:53 ` Chengming Zhou [this message]
2025-05-23 11:59 ` Aaron Lu
2025-05-26 13:14 ` Chengming Zhou
2025-05-20 10:41 ` [PATCH 5/7] sched/fair: switch to task based throttle model Aaron Lu
2025-05-20 10:41 ` [PATCH 6/7] sched/fair: task based throttle time accounting Aaron Lu
2025-05-20 10:41 ` [PATCH 7/7] sched/fair: get rid of throttled_lb_pair() Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a0d4f499-1222-45da-a3ea-e9d445b81c87@linux.dev \
--to=chengming.zhou@linux.dev \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=zhouchuyi@bytedance.com \
--cc=ziqianlu@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.