From: Aaron Lu <ziqianlu@bytedance.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>,
Benjamin Segall <bsegall@google.com>,
Chengming Zhou <chengming.zhou@linux.dev>
Cc: Valentin Schneider <vschneid@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Xi Wang <xii@google.com>,
linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Chuyi Zhou <zhouchuyi@bytedance.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Florian Bezdeka <florian.bezdeka@siemens.com>
Subject: Re: [PATCH v2 0/5] Defer throttle when task exits to user
Date: Fri, 4 Jul 2025 15:54:30 +0800 [thread overview]
Message-ID: <20250704075430.GA1641@bytedance> (raw)
In-Reply-To: <dc54a6ab-2529-4def-ae7d-6a217e3bc1bc@amd.com>
On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote:
> Hello Ben,
>
> On 7/3/2025 3:30 AM, Benjamin Segall wrote:
> > Aaron Lu <ziqianlu@bytedance.com> writes:
> >
> > > For pelt clock, I chose to keep the current behavior to freeze it on
> > > cfs_rq's throttle time. The assumption is that tasks running in kernel
> > > mode should not last too long, freezing the cfs_rq's pelt clock can keep
> > > its load and its corresponding sched_entity's weight. Hopefully, this can
> > > result in a stable situation for the remaining running tasks to quickly
> > > finish their jobs in kernel mode.
> >
> > I suppose the way that this would go wrong would be CPU 1 using up all
> > of the quota, and then a task waking up on CPU 2 and trying to run in
> > the kernel for a while. I suspect pelt time needs to also keep running
> > until all the tasks are asleep (and that's what we have been running at
> > google with the version based on separate accounting, so we haven't
> > accidentally done a large scale test of letting it pause).
>
> Thinking out loud ...
>
> One thing this can possibly do is create a lot of:
>
> throttled -> partially unthrottled -> throttled
>
> transitions when tasks wakeup on throttled hierarchy, run for a while,
> and then go back to sleep. If the PELT clocks aren't frozen, this
> either means:
>
> 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have
> any load associated back onto the list and allow PELT to progress only
> to then remove them again once tasks go back to sleep. A great many of
> these transitions are possible theoretically which is not ideal.
>
I'm going this route, becasue it is kind of already the case now :)
I mean throttled cfs_rqs are already added back to the leaf cfs_rq
list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at
the bottom of enqueue_task_fair() happy when a task is enqueued to a
throttled cfs_rq.
I'm sorry if this is not obvious in this series, I guess I put too many
things in patch3.
Below is the diff I cooked on top of this series to keep pelt clock
running as long as there is task running in a throttled cfs_rq, does it
look sane?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d869c8b51c5a6..410b850df2a12 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->on_rq = 1;
if (cfs_rq->nr_queued == 1) {
+ struct rq *rq = rq_of(cfs_rq);
+
check_enqueue_throttle(cfs_rq);
list_add_leaf_cfs_rq(cfs_rq);
+ if (cfs_rq->pelt_clock_throttled) {
+ cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+ cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
+ }
}
}
@@ -5437,8 +5444,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
if (cfs_rq->nr_queued == 0) {
update_idle_cfs_rq_clock_pelt(cfs_rq);
- if (throttled_hierarchy(cfs_rq))
+ if (throttled_hierarchy(cfs_rq)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ cfs_rq->pelt_clock_throttled = 1;
list_del_leaf_cfs_rq(cfs_rq);
+ }
}
return true;
@@ -5873,8 +5885,11 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
if (--cfs_rq->throttle_count)
return 0;
- cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
- cfs_rq->throttled_clock_pelt;
+ if (cfs_rq->pelt_clock_throttled) {
+ cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) -
+ cfs_rq->throttled_clock_pelt;
+ cfs_rq->pelt_clock_throttled = 0;
+ }
if (cfs_rq->throttled_clock_self) {
u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self;
@@ -5939,11 +5954,13 @@ static int tg_throttle_down(struct task_group *tg, void *data)
if (cfs_rq->throttle_count++)
return 0;
- /* group is entering throttled state, stop time */
- cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
- if (!cfs_rq->nr_queued)
+ if (!cfs_rq->nr_queued) {
+ /* group is entering throttled state, stop time */
+ cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq);
+ cfs_rq->pelt_clock_throttled = 1;
list_del_leaf_cfs_rq(cfs_rq);
+ }
WARN_ON_ONCE(cfs_rq->throttled_clock_self);
WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list));
diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index 62c3fa543c0f2..f921302dc40fb 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
u64 throttled;
- if (unlikely(cfs_rq->throttle_count))
+ if (unlikely(cfs_rq->pelt_clock_throttled))
throttled = U64_MAX;
else
throttled = cfs_rq->throttled_clock_pelt_time;
@@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
/* rq->task_clock normalized against any time this cfs_rq has spent throttled */
static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq)
{
- if (unlikely(cfs_rq->throttle_count))
+ if (unlikely(cfs_rq->pelt_clock_throttled))
return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time;
return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f2a07537d3c12..877e40ccd8cc1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -724,7 +724,8 @@ struct cfs_rq {
u64 throttled_clock_pelt_time;
u64 throttled_clock_self;
u64 throttled_clock_self_time;
- int throttled;
+ int throttled:1;
+ int pelt_clock_throttled:1;
int throttle_count;
struct list_head throttled_list;
struct list_head throttled_csd_list;
Thanks,
Aaron
> 2. Propagate the delta time where PELT was not frozen during unthrottle
> and if it isn't 0, do an update_load_avg() to sync PELT. This will
> increase the overhead of the tg_tree callback which isn't ideal
> either. It can also complicate the enqueue path since the PELT of
> the the cfs_rq hierarchy being enqueued may need correction before
> the task can be enqueued.
>
> I know Josh hates both approaches since tg_tree_walks are already very
> expensive in your use cases and it has to be done in a non-preemptible
> context holding the rq_lock but which do you think is the lesser of two
> evils? Or is there a better solution that I have completely missed?
>
> >
> > Otherwise it does look ok, so long as we're ok with increasing distribute
> > time again.
>
> --
> Thanks and Regards,
> Prateek
>
next prev parent reply other threads:[~2025-07-04 7:54 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-18 8:19 [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
2025-06-18 8:19 ` [PATCH v2 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-06-18 8:19 ` [PATCH v2 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-06-18 9:03 ` Chengming Zhou
2025-06-18 8:19 ` [PATCH v2 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-06-18 9:55 ` Chengming Zhou
2025-06-18 11:19 ` Aaron Lu
2025-06-19 12:02 ` Chengming Zhou
2025-06-18 8:19 ` [PATCH v2 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-06-18 8:19 ` [PATCH v2 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-07-01 8:31 ` [PATCH v2 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-03 7:37 ` Peter Zijlstra
2025-07-03 11:51 ` Aaron Lu
2025-07-02 4:25 ` K Prateek Nayak
2025-07-02 8:51 ` Aaron Lu
2025-07-02 22:00 ` Benjamin Segall
2025-07-03 6:34 ` Aaron Lu
2025-07-04 4:34 ` K Prateek Nayak
2025-07-04 7:54 ` Aaron Lu [this message]
2025-07-04 8:48 ` K Prateek Nayak
2025-07-04 9:47 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250704075430.GA1641@bytedance \
--to=ziqianlu@bytedance.com \
--cc=bsegall@google.com \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.