Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Aaron Lu <ziqianlu@bytedance.com>
To: Valentin Schneider <vschneid@redhat.com>
Cc: Ben Segall <bsegall@google.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Josh Don <joshdon@google.com>, Ingo Molnar <mingo@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Xi Wang <xii@google.com>,
	linux-kernel@vger.kernel.org, Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mel Gorman <mgorman@suse.de>,
	Chuyi Zhou <zhouchuyi@bytedance.com>,
	Jan Kiszka <jan.kiszka@siemens.com>,
	Florian Bezdeka <florian.bezdeka@siemens.com>,
	Songtang Liu <liusongtang@bytedance.com>
Subject: Re: [PATCH v3 3/5] sched/fair: Switch to task based throttle model
Date: Tue, 12 Aug 2025 16:48:28 +0800	[thread overview]
Message-ID: <20250812084828.GA52@bytedance> (raw)
In-Reply-To: <xhsmhsei2ox4o.mognet@vschneid-thinkpadt14sgen2i.remote.csb>

On Fri, Aug 08, 2025 at 01:45:11PM +0200, Valentin Schneider wrote:
> On 08/08/25 18:13, Aaron Lu wrote:
> > On Fri, Aug 08, 2025 at 11:12:48AM +0200, Valentin Schneider wrote:
> >> On 15/07/25 15:16, Aaron Lu wrote:
> >> > From: Valentin Schneider <vschneid@redhat.com>
> >> >
> >> > In current throttle model, when a cfs_rq is throttled, its entity will
> >> > be dequeued from cpu's rq, making tasks attached to it not able to run,
> >> > thus achiveing the throttle target.
> >> >
> >> > This has a drawback though: assume a task is a reader of percpu_rwsem
> >> > and is waiting. When it gets woken, it can not run till its task group's
> >> > next period comes, which can be a relatively long time. Waiting writer
> >> > will have to wait longer due to this and it also makes further reader
> >> > build up and eventually trigger task hung.
> >> >
> >> > To improve this situation, change the throttle model to task based, i.e.
> >> > when a cfs_rq is throttled, record its throttled status but do not remove
> >> > it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when
> >> > they get picked, add a task work to them so that when they return
> >> > to user, they can be dequeued there. In this way, tasks throttled will
> >> > not hold any kernel resources. And on unthrottle, enqueue back those
> >> > tasks so they can continue to run.
> >> >
> >> 
> >> Moving the actual throttle work to pick time is clever. In my previous
> >> versions I tried really hard to stay out of the enqueue/dequeue/pick paths,
> >> but this makes the code a lot more palatable. I'd like to see how this
> >> impacts performance though.
> >> 
> >
> > Let me run some scheduler benchmark to see how it impacts performance.
> >
> > I'm thinking maybe running something like hackbench on server platforms,
> > first with quota not set and see if performance changes; then also test
> > with quota set and see how performance changes.
> >
> > Does this sound good to you? Or do you have any specific benchmark and
> > test methodology in mind?
> >
> 
> Yeah hackbench is pretty good for stressing the EQ/DQ paths.
> 

Tested hackbench/pipe and netperf/UDP_RR on Intel EMR(2 sockets/240
cpus) and AMD Genoa(2 sockets/384 cpus), the tldr is: there is no clear
performance change between base and this patchset(head). Below is
detailed test data:
(turbo/boost disabled, cpuidle disabled, cpufreq set to performance)

hackbench/pipe/loops=150000
(seconds, smaller is better)

On Intel EMR:

nr_group          base             head          change
 1              3.62±2.99%      3.61±10.42%      +0.28%
 8              8.06±1.58%      7.88±5.82%       +2.23%
16             11.40±2.57%     11.25±3.72%       +1.32%

For nr_group=16 case, configure a cgroup and set quota to half cpu and
then let hackbench run in this cgroup:

                 base             head           change
quota=50%      18.35±2.40%     18.78±1.97%       -2.34%

On AMD Genoa:

nr_group          base             head          change
 1             17.05±1.92%     16.99±2.81%       +0.35%
 8             16.54±0.71%     16.73±1.18%       -1.15%
16             27.04±0.39%     26.72±2.37%       +1.18%

For nr_group=16 case, configure a cgroup and set quota to half cpu and
then let hackbench run in this cgroup:

                  base             head          change
quota=50%      43.79±1.10%     44.65±0.37%       -1.96%

Netperf/UDP_RR/testlen=30s
(throughput, higher is better)

25% means nr_clients set to 1/4 nr_cpu, 50% means nr_clients is 1/2
nr_cpu, etc.

On Intel EMR:

nr_clients     base                 head             change
  25%       83,567±0.06%         84,298±0.23%        +0.87%
  50%       61,336±1.49%         60,816±0.63%        -0.85%
  75%       40,592±0.97%         40,461±0.14%        -0.32%
 100%       31,277±2.11%         30,948±1.84%        -1.05%

For nr_clients=100% case, configure a cgroup and set quota to half cpu
and then let netperf run in this cgroup:

nr_clients     base                 head             change
 100%       25,532±0.56%         26,772±3.05%        +4.86%

On AMD Genoa:

nr_clients     base                 head             change
 25%        12,443±0.40%         12,525±0.06%        +0.66%
 50%        11,403±0.35%         11,472±0.50%        +0.61%
 75%        10,070±0.19%         10,071±0.95%         0.00%
100%         9,947±0.80%          9,881±0.58%        -0.66%

For nr_clients=100% case, configure a cgroup and set quota to half cpu
and then let netperf run in this cgroup:

nr_clients     base                 head             change
100%         4,954±0.24%          4,952±0.14%         0.00%

> >> > +	if (throttled_hierarchy(cfs_rq) &&
> >> > +	    !task_current_donor(rq_of(cfs_rq), p)) {
> >> > +		list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> >> > +		return true;
> >> > +	}
> >> > +
> >> > +	/* we can't take the fast path, do an actual enqueue*/
> >> > +	p->throttled = false;
> >> 
> >> So we clear p->throttled but not p->throttle_node? Won't that cause issues
> >> when @p's previous cfs_rq gets unthrottled?
> >> 
> >
> > p->throttle_node is already removed from its previous cfs_rq at dequeue
> > time in dequeue_throttled_task().
> >
> > This is done so because in enqueue time, we may not hold its previous
> > rq's lock so can't touch its previous cfs_rq's limbo list, like when
> > dealing with affinity changes.
> >
> 
> Ah right, the DQ/EQ_throttled_task() functions are when DQ/EQ is applied to an
> already-throttled task and it does the right thing.
> 
> Does this mean we want this as enqueue_throttled_task()'s prologue?
> 
>   /* @p should have gone through dequeue_throttled_task() first */
>   WARN_ON_ONCE(!list_empty(&p->throttle_node));
>

Sure, will add it in next version.

> >> > @@ -7145,6 +7142,11 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
> >> >   */
> >> >  static bool dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >> >  {
> >> > +	if (unlikely(task_is_throttled(p))) {
> >> > +		dequeue_throttled_task(p, flags);
> >> > +		return true;
> >> > +	}
> >> > +
> >> 
> >> Handling a throttled task's move pattern at dequeue does simplify things,
> >> however that's quite a hot path. I think this wants at the very least a
> >> 
> >>   if (cfs_bandwidth_used())
> >> 
> >> since that has a static key underneath. Some heavy EQ/DQ benchmark wouldn't
> >> hurt, but we can probably find some people who really care about that to
> >> run it for us ;)
> >> 
> >
> > No problem.
> >
> > I'm thinking maybe I can add this cfs_bandwidth_used() in
> > task_is_throttled()? So that other callsites of task_is_throttled() can
> > also get the benefit of paying less cost when cfs bandwidth is not
> > enabled.
> >
> 
> Sounds good to me; just drop the unlikely and let the static key do its
> thing :)

Got it, thanks for the suggestion.

next prev parent reply	other threads:[~2025-08-12  8:48 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-15  7:16 [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-07-15  7:16 ` [PATCH v3 1/5] sched/fair: Add related data structure for task based throttle Aaron Lu
2025-07-15  7:16 ` [PATCH v3 2/5] sched/fair: Implement throttle task work and related helpers Aaron Lu
2025-07-15  7:16 ` [PATCH v3 3/5] sched/fair: Switch to task based throttle model Aaron Lu
2025-07-15 23:29   ` kernel test robot
2025-07-16  6:57     ` Aaron Lu
2025-07-16  7:40       ` Philip Li
2025-07-16 11:15         ` [PATCH v3 update " Aaron Lu
2025-07-16 11:27       ` [PATCH v3 " Peter Zijlstra
2025-07-16 15:20   ` kernel test robot
2025-07-17  3:52     ` Aaron Lu
2025-07-23  8:21       ` Oliver Sang
2025-07-23 10:08         ` Aaron Lu
2025-08-08  9:12   ` Valentin Schneider
2025-08-08 10:13     ` Aaron Lu
2025-08-08 11:45       ` Valentin Schneider
2025-08-12  8:48         ` Aaron Lu [this message]
2025-08-14 15:54           ` Valentin Schneider
2025-08-15  9:30             ` Aaron Lu
2025-08-22 11:07               ` Aaron Lu
2025-09-03  7:14                 ` Aaron Lu
2025-09-03  9:11                   ` K Prateek Nayak
2025-09-03 10:11                     ` Aaron Lu
2025-09-03 10:31                       ` K Prateek Nayak
2025-09-03 11:35                         ` Aaron Lu
2025-09-04  7:33                           ` Bezdeka, Florian
2025-09-04  8:26                             ` K Prateek Nayak
2025-09-04  8:40                             ` Aaron Lu
2025-08-28  3:50         ` Aaron Lu
2025-08-17  8:50   ` Chen, Yu C
2025-08-18  2:50     ` Aaron Lu
2025-08-18  3:10       ` Chen, Yu C
2025-08-18  3:12       ` Aaron Lu
2025-07-15  7:16 ` [PATCH v3 4/5] sched/fair: Task based throttle time accounting Aaron Lu
2025-08-18 14:57   ` Valentin Schneider
2025-08-19  9:34     ` Aaron Lu
2025-08-19 14:09       ` Valentin Schneider
2025-08-26 14:10       ` Michal Koutný
2025-08-27 15:16         ` Valentin Schneider
2025-08-28  6:06         ` Aaron Lu
2025-08-26  9:15     ` Aaron Lu
2025-07-15  7:16 ` [PATCH v3 5/5] sched/fair: Get rid of throttled_lb_pair() Aaron Lu
2025-07-15  7:22 ` [PATCH v3 0/5] Defer throttle when task exits to user Aaron Lu
2025-08-01 14:31 ` Matteo Martelli
2025-08-04  7:52   ` Aaron Lu
2025-08-04 11:18     ` Valentin Schneider
2025-08-04 11:56       ` Aaron Lu
2025-08-08 16:37     ` Matteo Martelli
2025-08-04  8:51 ` K Prateek Nayak
2025-08-04 11:48   ` Aaron Lu
2025-08-27 14:58 ` Valentin Schneider

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250812084828.GA52@bytedance \
    --to=ziqianlu@bytedance.com \
    --cc=bsegall@google.com \
    --cc=chengming.zhou@linux.dev \
    --cc=dietmar.eggemann@arm.com \
    --cc=florian.bezdeka@siemens.com \
    --cc=jan.kiszka@siemens.com \
    --cc=joshdon@google.com \
    --cc=juri.lelli@redhat.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=liusongtang@bytedance.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=xii@google.com \
    --cc=zhouchuyi@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.