From: Aaron Lu <ziqianlu@bytedance.com>
To: Benjamin Segall <bsegall@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
Peter Zijlstra <peterz@infradead.org>,
Valentin Schneider <vschneid@redhat.com>,
Chengming Zhou <chengming.zhou@linux.dev>,
Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
Vincent Guittot <vincent.guittot@linaro.org>,
Xi Wang <xii@google.com>,
linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Mel Gorman <mgorman@suse.de>,
Chuyi Zhou <zhouchuyi@bytedance.com>,
Jan Kiszka <jan.kiszka@siemens.com>,
Florian Bezdeka <florian.bezdeka@siemens.com>,
Songtang Liu <liusongtang@bytedance.com>,
Chen Yu <yu.c.chen@intel.com>,
Matteo Martelli <matteo.martelli@codethink.co.uk>,
Michal Koutn?? <mkoutny@suse.com>,
Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model
Date: Thu, 4 Sep 2025 20:04:01 +0800 [thread overview]
Message-ID: <20250904120401.GJ42@bytedance> (raw)
In-Reply-To: <20250904081611.GE42@bytedance>
On Thu, Sep 04, 2025 at 04:16:11PM +0800, Aaron Lu wrote:
> On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote:
> > K Prateek Nayak <kprateek.nayak@amd.com> writes:
> >
> > > Hello Peter,
> > >
> > > On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
> > >>> static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >>> {
> > >>> + if (task_is_throttled(p)) {
> > >>> + dequeue_throttled_task(p, flags);
> > >>> + return true;
> > >>> + }
> > >>> +
> > >>> if (!p->se.sched_delayed)
> > >>> util_est_dequeue(&rq->cfs, p);
> > >>>
> > >>
> > >> OK, so this makes it so that either a task is fully enqueued (all
> > >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
> > >> its tasks are gone, and unthrottled when a task gets added. Right?
> > >
> > > cfs_rq (and the hierarchy below) is marked throttled when the quota
> > > has elapsed. Tasks on the throttled hierarchies will dequeue
> > > themselves completely via task work added during pick. When the last
> > > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
> > > that cfs_rq.
> > >
> > > When a new task is added on the hierarchy, the PELT is unfrozen and
> > > the task becomes runnable. The cfs_rq and the hierarchy is still
> > > marked throttled.
> > >
> > > Unthrottling of hierarchy is only done at distribution.
> > >
> > >>
> > >> But propagate_entity_cfs_rq() is still doing the old thing, and has a
> > >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
> > >> iteration.
> > >>
> > >> This seems somewhat inconsistent; or am I missing something ?
> > >
> > > Probably an oversight. But before that, what was the reason to have
> > > stopped this propagation at throttled_cfs_rq() before the changes?
> > >
> >
> > Yeah, this was one of the things I was (slowly) looking at - with this
> > series we currently still abort in:
> >
> > 1) update_cfs_group
> > 2) dequeue_entities's set_next_buddy
> > 3) check_preempt_fair
> > 4) yield_to
> > 5) propagate_entity_cfs_rq
> >
> > In the old design on throttle immediately remove the entire cfs_rq,
> > freeze time for it, and stop adjusting load. In the new design we still
> > pick from it, so we definitely don't want to stop time (and don't). I'm
Per my understanding, we keep PELT clock running because we want the
throttled cfs_rq's load to continue get update when it still has tasks
running in kernel mode and have that up2date load could let it have a
hopefully more accurate weight through update_cfs_group(). So it looks
to me, if PELT clock should not be stopped, then we should not abort in
propagate_entity_cfs_rq() and update_cfs_group(). I missed these two
aborts in these two functions, but now you and Peter have pointed this
out, I suppose there is no doubt we should not abort in
update_cfs_group() and propagate_entity_cfs_rq()? If we should not mess
with shares distribution, then the up2date load is not useful and why
not simply freeze PELT clock on throttle :)
> > guessing we probably also want to now adjust load for it, but it is
> > arguable - since all the cfs_rqs for the tg are likely to throttle at the
> > same time, so we might not want to mess with the shares distribution,
> > since when unthrottle comes around the most likely correct distribution
> > is the distribution we had at the time of throttle.
> >
>
> I can give it a test to see how things change by adjusting load and share
> distribution using my previous performance tests.
>
Run hackbench and netperf on AMD Genoa and didn't notice any obvious
difference with the cumulated diff.
> > Assuming we do want to adjust load for a throttle then we probably want
> > to remove the aborts from update_cfs_group and propagate_entity_cfs_rq.
> > I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but
> > I'm not 100% sure when they are actually doing something in propagate as
> > opposed to enqueue.
> >
>
> Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing
> load decay") added that list_add_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to fix a problem.
>
> > The other 3 are the same sort of thing - scheduling pick heuristics
> > which imo are pretty arbitrary to keep. We can reasonably say that "the
> > most likely thing a task in a throttled hierarchy will do is just go
> > throttle itself, so we shouldn't buddy it or let it preempt", but it
> > would also be reasonable to let them preempt/buddy normally, in case
> > they hold locks or such.
>
> I think we do not need to special case tasks in throttled hierarchy in
> check_preempt_wakeup_fair().
>
Since there is pros and cons either way and consider the performance
test result, I'm now feeling maybe we can leave these 3 as is and revisit
them later when there is some clear case.
> >
> > yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a
> > throttle-on-exit kvm cpu thread isn't useful (so no need to remove the
> > abort there). The dma code is just yielding to a just-spawned kthread,
> > so it should be fine either way.
>
> Get it.
>
> The cumulated diff I'm going to experiment is below, let me know if
> something is wrong, thanks.
next prev parent reply other threads:[~2025-09-04 12:04 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-29 8:11 [PATCH v4 0/5] Defer throttle when task exits to user Aaron Lu
2025-08-29 8:11 ` [PATCH v4 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-09-03 8:05 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2025-08-29 8:11 ` [PATCH v4 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-09-03 8:05 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2025-08-29 8:11 ` [PATCH v4 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-09-03 8:05 ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2025-09-03 14:51 ` [PATCH v4 3/5] " Peter Zijlstra
2025-09-03 17:12 ` K Prateek Nayak
2025-09-03 20:27 ` Peter Zijlstra
2025-09-04 5:44 ` K Prateek Nayak
2025-09-04 7:04 ` Aaron Lu
2025-09-05 11:37 ` Aaron Lu
2025-09-05 12:53 ` Peter Zijlstra
2025-09-08 11:05 ` [PATCH] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-09 4:20 ` kernel test robot
2025-09-09 6:17 ` Aaron Lu
2025-09-09 6:22 ` K Prateek Nayak
2025-09-09 6:27 ` Aaron Lu
2025-09-10 9:55 ` Aaron Lu
2025-09-03 20:46 ` [PATCH v4 3/5] sched/fair: Switch to task based throttle model Benjamin Segall
2025-09-04 6:03 ` K Prateek Nayak
2025-09-09 4:10 ` Benjamin Segall
2025-09-04 8:16 ` Aaron Lu
2025-09-04 9:51 ` K Prateek Nayak
2025-09-04 11:05 ` Aaron Lu
2025-09-04 14:20 ` K Prateek Nayak
2025-09-09 3:58 ` Benjamin Segall
2025-09-09 12:03 ` Aaron Lu
2025-09-10 3:03 ` Aaron Lu
2025-09-04 12:04 ` Aaron Lu [this message]
2025-09-05 7:53 ` Aaron Lu
2025-09-03 20:55 ` Benjamin Segall
2025-09-04 11:26 ` Aaron Lu
2025-09-04 11:30 ` Aaron Lu
2025-08-29 8:11 ` [PATCH v4 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-09-03 8:05 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-08-29 8:11 ` [PATCH v4 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-09-03 8:05 ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-01 10:03 ` [PATCH v4 0/5] Defer throttle when task exits to user Peter Zijlstra
2025-12-02 8:59 ` Bezdeka, Florian
2025-12-02 9:43 ` Aaron Lu
2025-12-02 10:09 ` Florian Bezdeka
2025-12-02 12:01 ` Aaron Lu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250904120401.GJ42@bytedance \
--to=ziqianlu@bytedance.com \
--cc=bigeasy@linutronix.de \
--cc=bsegall@google.com \
--cc=chengming.zhou@linux.dev \
--cc=dietmar.eggemann@arm.com \
--cc=florian.bezdeka@siemens.com \
--cc=jan.kiszka@siemens.com \
--cc=joshdon@google.com \
--cc=juri.lelli@redhat.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=liusongtang@bytedance.com \
--cc=matteo.martelli@codethink.co.uk \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
--cc=xii@google.com \
--cc=yu.c.chen@intel.com \
--cc=zhouchuyi@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox