public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <ziqianlu@bytedance.com>
To: Benjamin Segall <bsegall@google.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Xi Wang <xii@google.com>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>,
	Jan Kiszka <jan.kiszka@siemens.com>,
	Florian Bezdeka <florian.bezdeka@siemens.com>,
	Songtang Liu <liusongtang@bytedance.com>,
	Chen Yu <yu.c.chen@intel.com>,
	Matteo Martelli <matteo.martelli@codethink.co.uk>,
	Michal Koutn?? <mkoutny@suse.com>,
	Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: Re: [PATCH v4 3/5] sched/fair: Switch to task based throttle model
Date: Thu, 4 Sep 2025 20:04:01 +0800	[thread overview]
Message-ID: <20250904120401.GJ42@bytedance> (raw)
In-Reply-To: <20250904081611.GE42@bytedance>

On Thu, Sep 04, 2025 at 04:16:11PM +0800, Aaron Lu wrote:
> On Wed, Sep 03, 2025 at 01:46:48PM -0700, Benjamin Segall wrote:
> > K Prateek Nayak <kprateek.nayak@amd.com> writes:
> > 
> > > Hello Peter,
> > >
> > > On 9/3/2025 8:21 PM, Peter Zijlstra wrote:
> > >>>  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >>>  {
> > >>> +	if (task_is_throttled(p)) {
> > >>> +		dequeue_throttled_task(p, flags);
> > >>> +		return true;
> > >>> +	}
> > >>> +
> > >>>  	if (!p->se.sched_delayed)
> > >>>  		util_est_dequeue(&rq->cfs, p);
> > >>>  
> > >> 
> > >> OK, so this makes it so that either a task is fully enqueued (all
> > >> cfs_rq's) or full not. A group cfs_rq is only marked throttled when all
> > >> its tasks are gone, and unthrottled when a task gets added. Right?
> > >
> > > cfs_rq (and the hierarchy below) is marked throttled when the quota
> > > has elapsed. Tasks on the throttled hierarchies will dequeue
> > > themselves completely via task work added during pick. When the last
> > > task leaves on a cfs_rq of throttled hierarchy, PELT is frozen for
> > > that cfs_rq.
> > >
> > > When a new task is added on the hierarchy, the PELT is unfrozen and
> > > the task becomes runnable. The cfs_rq and the hierarchy is still
> > > marked throttled.
> > >
> > > Unthrottling of hierarchy is only done at distribution.
> > >
> > >> 
> > >> But propagate_entity_cfs_rq() is still doing the old thing, and has a
> > >> if (cfs_rq_throttled(cfs_rq)) break; inside the for_each_sched_entity()
> > >> iteration.
> > >> 
> > >> This seems somewhat inconsistent; or am I missing something ? 
> > >
> > > Probably an oversight. But before that, what was the reason to have
> > > stopped this propagation at throttled_cfs_rq() before the changes?
> > >
> > 
> > Yeah, this was one of the things I was (slowly) looking at - with this
> > series we currently still abort in:
> > 
> > 1) update_cfs_group
> > 2) dequeue_entities's set_next_buddy
> > 3) check_preempt_fair
> > 4) yield_to
> > 5) propagate_entity_cfs_rq
> > 
> > In the old design on throttle immediately remove the entire cfs_rq,
> > freeze time for it, and stop adjusting load. In the new design we still
> > pick from it, so we definitely don't want to stop time (and don't). I'm

Per my understanding, we keep PELT clock running because we want the
throttled cfs_rq's load to continue get update when it still has tasks
running in kernel mode and have that up2date load could let it have a
hopefully more accurate weight through update_cfs_group(). So it looks
to me, if PELT clock should not be stopped, then we should not abort in
propagate_entity_cfs_rq() and update_cfs_group(). I missed these two
aborts in these two functions, but now you and Peter have pointed this
out, I suppose there is no doubt we should not abort in
update_cfs_group() and propagate_entity_cfs_rq()? If we should not mess
with shares distribution, then the up2date load is not useful and why
not simply freeze PELT clock on throttle :)

> > guessing we probably also want to now adjust load for it, but it is
> > arguable - since all the cfs_rqs for the tg are likely to throttle at the
> > same time, so we might not want to mess with the shares distribution,
> > since when unthrottle comes around the most likely correct distribution
> > is the distribution we had at the time of throttle.
> >
> 
> I can give it a test to see how things change by adjusting load and share
> distribution using my previous performance tests.
>

Run hackbench and netperf on AMD Genoa and didn't notice any obvious
difference with the cumulated diff.

> > Assuming we do want to adjust load for a throttle then we probably want
> > to remove the aborts from update_cfs_group and propagate_entity_cfs_rq.
> > I'm guessing that we need the list_add_leaf_cfs_rq from propagate, but
> > I'm not 100% sure when they are actually doing something in propagate as
> > opposed to enqueue.
> >
> 
> Yes, commit 0258bdfaff5bd("sched/fair: Fix unfairness caused by missing 
> load decay") added that list_add_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to fix a problem.
> 
> > The other 3 are the same sort of thing - scheduling pick heuristics
> > which imo are pretty arbitrary to keep. We can reasonably say that "the
> > most likely thing a task in a throttled hierarchy will do is just go
> > throttle itself, so we shouldn't buddy it or let it preempt", but it
> > would also be reasonable to let them preempt/buddy normally, in case
> > they hold locks or such.
> 
> I think we do not need to special case tasks in throttled hierarchy in
> check_preempt_wakeup_fair().
>

Since there is pros and cons either way and consider the performance
test result, I'm now feeling maybe we can leave these 3 as is and revisit
them later when there is some clear case.

> > 
> > yield_to is used by kvm and st-dma-fence-chain.c. Yielding to a
> > throttle-on-exit kvm cpu thread isn't useful (so no need to remove the
> > abort there). The dma code is just yielding to a just-spawned kthread,
> > so it should be fine either way.
> 
> Get it.
> 
> The cumulated diff I'm going to experiment is below, let me know if
> something is wrong, thanks.

  parent reply	other threads:[~2025-09-04 12:04 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-29  8:11 [PATCH v4 0/5] Defer throttle when task exits to user Aaron Lu
2025-08-29  8:11 ` [PATCH v4 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-09-03  8:05   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2025-08-29  8:11 ` [PATCH v4 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-09-03  8:05   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2025-08-29  8:11 ` [PATCH v4 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-09-03  8:05   ` [tip: sched/core] " tip-bot2 for Valentin Schneider
2025-09-03 14:51   ` [PATCH v4 3/5] " Peter Zijlstra
2025-09-03 17:12     ` K Prateek Nayak
2025-09-03 20:27       ` Peter Zijlstra
2025-09-04  5:44         ` K Prateek Nayak
2025-09-04  7:04           ` Aaron Lu
2025-09-05 11:37             ` Aaron Lu
2025-09-05 12:53               ` Peter Zijlstra
2025-09-08 11:05                 ` [PATCH] sched/fair: Propagate load for throttled cfs_rq Aaron Lu
2025-09-09  4:20                   ` kernel test robot
2025-09-09  6:17                     ` Aaron Lu
2025-09-09  6:22                       ` K Prateek Nayak
2025-09-09  6:27                         ` Aaron Lu
2025-09-10  9:55                           ` Aaron Lu
2025-09-03 20:46       ` [PATCH v4 3/5] sched/fair: Switch to task based throttle model Benjamin Segall
2025-09-04  6:03         ` K Prateek Nayak
2025-09-09  4:10           ` Benjamin Segall
2025-09-04  8:16         ` Aaron Lu
2025-09-04  9:51           ` K Prateek Nayak
2025-09-04 11:05             ` Aaron Lu
2025-09-04 14:20               ` K Prateek Nayak
2025-09-09  3:58               ` Benjamin Segall
2025-09-09 12:03                 ` Aaron Lu
2025-09-10  3:03               ` Aaron Lu
2025-09-04 12:04           ` Aaron Lu [this message]
2025-09-05  7:53             ` Aaron Lu
2025-09-03 20:55   ` Benjamin Segall
2025-09-04 11:26     ` Aaron Lu
2025-09-04 11:30       ` Aaron Lu
2025-08-29  8:11 ` [PATCH v4 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-09-03  8:05   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-08-29  8:11 ` [PATCH v4 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-09-03  8:05   ` [tip: sched/core] " tip-bot2 for Aaron Lu
2025-09-01 10:03 ` [PATCH v4 0/5] Defer throttle when task exits to user Peter Zijlstra
2025-12-02  8:59 ` Bezdeka, Florian
2025-12-02  9:43   ` Aaron Lu
2025-12-02 10:09     ` Florian Bezdeka
2025-12-02 12:01       ` Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250904120401.GJ42@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bigeasy@linutronix.de \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=florian.bezdeka@siemens.com \
    --cc=jan.kiszka@siemens.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liusongtang@bytedance.com \
    --cc=matteo.martelli@codethink.co.uk \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=mkoutny@suse.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=yu.c.chen@intel.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox