From: Chengming Zhou <chengming.zhou@linux.dev>
To: Aaron Lu <ziqianlu@bytedance.com>
Cc: Valentin Schneider <vschneid@redhat.com>,
Ben Segall <bsegall@google.com>,
K Prateek Nayak <kprateek.nayak@amd.com>,
Peter Zijlstra <peterz@infradead.org>,
Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Xi Wang <xii@google.com>,
linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Chuyi Zhou <zhouchuyi@bytedance.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Florian Bezdeka <florian.bezdeka@siemens.com>
Subject: Re: [PATCH v2 3/5] sched/fair: Switch to task based throttle model
Date: Thu, 19 Jun 2025 20:02:23 +0800 [thread overview]
Message-ID: <b598f584-e9fd-4160-9ce7-d328fee9f5d2@linux.dev> (raw)
In-Reply-To: <20250618111913.GA646@bytedance>
On 2025/6/18 19:19, Aaron Lu wrote:
> Hi Chengming,
>
> Thanks for your review.
>
> On Wed, Jun 18, 2025 at 05:55:08PM +0800, Chengming Zhou wrote:
>> On 2025/6/18 16:19, Aaron Lu wrote:
>>> From: Valentin Schneider <vschneid@redhat.com>
>>>
>>> In current throttle model, when a cfs_rq is throttled, its entity will
>>> be dequeued from cpu's rq, making tasks attached to it not able to run,
>>> thus achiveing the throttle target.
>>>
>>> This has a drawback though: assume a task is a reader of percpu_rwsem
>>> and is waiting. When it gets woken, it can not run till its task group's
>>> next period comes, which can be a relatively long time. Waiting writer
>>> will have to wait longer due to this and it also makes further reader
>>> build up and eventually trigger task hung.
>>>
>>> To improve this situation, change the throttle model to task based, i.e.
>>> when a cfs_rq is throttled, record its throttled status but do not remove
>>> it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
>>> they get picked, add a task work to them so that when they return
>>> to user, they can be dequeued there. In this way, tasks throttled will
>>> not hold any kernel resources. And on unthrottle, enqueue back those
>>> tasks so they can continue to run.
>>>
>>> Throttled cfs_rq's leaf_cfs_rq_list is handled differently now: since a
>>> task can be enqueued to a throttled cfs_rq and gets to run, to not break
>>> the assert_list_leaf_cfs_rq() in enqueue_task_fair(), always add it to
>>> leaf cfs_rq list when it has its first entity enqueued and delete it
>>> from leaf cfs_rq list when it has no tasks enqueued.
>>>
>>> Suggested-by: Chengming Zhou <chengming.zhou@linux.dev> # tag on pick
>>> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
>>> Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
>>> ---
>>> kernel/sched/fair.c | 325 +++++++++++++++++++++-----------------------
>>> 1 file changed, 153 insertions(+), 172 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 8226120b8771a..59b372ffae18c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5291,18 +5291,17 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>> if (cfs_rq->nr_queued == 1) {
>>> check_enqueue_throttle(cfs_rq);
>>> - if (!throttled_hierarchy(cfs_rq)) {
>>> - list_add_leaf_cfs_rq(cfs_rq);
>>> - } else {
>>> + list_add_leaf_cfs_rq(cfs_rq);
>>> #ifdef CONFIG_CFS_BANDWIDTH
>>> + if (throttled_hierarchy(cfs_rq)) {
>>> struct rq *rq = rq_of(cfs_rq);
>>> if (cfs_rq_throttled(cfs_rq) && !cfs_rq->throttled_clock)
>>> cfs_rq->throttled_clock = rq_clock(rq);
>>> if (!cfs_rq->throttled_clock_self)
>>> cfs_rq->throttled_clock_self = rq_clock(rq);
>>> -#endif
>>> }
>>> +#endif
>>> }
>>> }
>>> @@ -5341,8 +5340,6 @@ static void set_delayed(struct sched_entity *se)
>>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> cfs_rq->h_nr_runnable--;
>>> - if (cfs_rq_throttled(cfs_rq))
>>> - break;
>>> }
>>> }
>>> @@ -5363,8 +5360,6 @@ static void clear_delayed(struct sched_entity *se)
>>> struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> cfs_rq->h_nr_runnable++;
>>> - if (cfs_rq_throttled(cfs_rq))
>>> - break;
>>> }
>>> }
>>> @@ -5450,8 +5445,11 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>> if (flags & DEQUEUE_DELAYED)
>>> finish_delayed_dequeue_entity(se);
>>> - if (cfs_rq->nr_queued == 0)
>>> + if (cfs_rq->nr_queued == 0) {
>>> update_idle_cfs_rq_clock_pelt(cfs_rq);
>>> + if (throttled_hierarchy(cfs_rq))
>>> + list_del_leaf_cfs_rq(cfs_rq);
>>
>> The cfs_rq should be removed from leaf list only after
>> it has been fully decayed, not here.
>
> For a throttled cfs_rq, the intent is to preserve its load while it's
> throttled. Its pelt clock is stopped in tg_throttle_down(), there will
> be no decay for it if left on leaf list.
Ah, right.
>
> I've also described why I chose this behaviour in cover letter:
> "
> For pelt clock, I chose to keep the current behavior to freeze it on
> cfs_rq's throttle time. The assumption is that tasks running in kernel
> mode should not last too long, freezing the cfs_rq's pelt clock can keep
> its load and its corresponding sched_entity's weight. Hopefully, this can
> result in a stable situation for the remaining running tasks to quickly
> finish their jobs in kernel mode.
> "
Ok, I get it, keeping the current behavior seems reasonable to me.
Another way maybe detaching throttled task's load when dequeue, and
resetting its se->avg.last_update_time to 0, so its load will be attached
when enqueue. So we don't need to stop its cfs_rq's pelt clock.
But the current approach looks simpler.
Thanks!
next prev parent reply other threads:[~2025-06-19 12:02 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-18 8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
2025-06-18 8:19 ` [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-06-18 8:19 ` [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-06-18 9:03 ` Chengming Zhou
2025-06-18 8:19 ` [PATCH v2 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-06-18 9:55 ` Chengming Zhou
2025-06-18 11:19 ` Aaron Lu
2025-06-19 12:02 ` Chengming Zhou [this message]
2025-06-18 8:19 ` [PATCH v2 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-06-18 8:19 ` [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-07-01 8:31 ` [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-03 7:37 ` Peter Zijlstra
2025-07-03 11:51 ` Aaron Lu
2025-07-02 4:25 ` K Prateek Nayak
2025-07-02 8:51 ` Aaron Lu
2025-07-02 22:00 ` Benjamin Segall
2025-07-03 6:34 ` Aaron Lu
2025-07-04 4:34 ` K Prateek Nayak
2025-07-04 7:54 ` Aaron Lu
2025-07-04 8:48 ` K Prateek Nayak
2025-07-04 9:47 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b598f584-e9fd-4160-9ce7-d328fee9f5d2@linux.dev \
--to=chengming.zhou@linux.dev \
--cc=bsegall@google.com \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=zhouchuyi@bytedance.com \
--cc=ziqianlu@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.