From: Andrea Righi <arighi@nvidia.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
Ingo Molnar <mingo@redhat.com>,
Peter Zijlstra <peterz@infradead.org>,
Juri Lelli <juri.lelli@redhat.com>,
Dietmar Eggemann <dietmar.eggemann@arm.com>,
Steven Rostedt <rostedt@goodmis.org>,
Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
Valentin Schneider <vschneid@redhat.com>,
Christian Loehle <christian.loehle@arm.com>,
Koba Ko <kobak@nvidia.com>,
Felix Abecassis <fabecassis@nvidia.com>,
Balbir Singh <balbirs@nvidia.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
Date: Fri, 27 Mar 2026 23:45:03 +0100 [thread overview]
Message-ID: <accIbzfIzvtxbY05@gpd4> (raw)
In-Reply-To: <acbqNOWr37pMC1sG@gpd4>
On Fri, Mar 27, 2026 at 09:36:15PM +0100, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> >
> > On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > > Hi Vincent,
> > >
> > > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> > >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> > >>>
> > >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> > >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> > >>> work onto a CPU that still offers full effective capacity. Fall back to
> > >>> any idle candidate if none qualify.
> > >>
> > >> This one isn't straightforward for me. The ilb cpu will check all
> > >> other idle CPUs 1st and finish with itself so unless the next CPU in
> > >> the idle_cpus_mask is a sibling, this should not make a difference
> > >>
> > >> Did you see any perf diff ?
> > >
> > > I actually see a benefit, in particular, with the first patch applied I see
> > > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > > which seems pretty consistent across runs (definitely not in error range).
> > >
> > > The intention with this change was to minimize SMT noise running the ILB
> > > code on a fully-idle core when possible, but I also didn't expect to see
> > > such big difference.
> > >
> > > I'll investigate more to better understand what's happening.
> >
> > Interesting! Either this "CPU-intensive workload" hates SMT turning
> > busy (but to an extent where performance drops visibly?) or ILB
> > keeps getting interrupted on an SMT sibling that is burdened by
> > interrupts leading to slower balance (or IRQs driving the workload
> > being delayed by rq_lock disabling them)
> >
> > Would it be possible to share the total SCHED_SOFTIRQ time, load
> > balancing attempts, and utlization with and without the patch? I too
> > will go queue up some runs to see if this makes a difference.
>
> Quick update: I also tried this on a Vera machine with a firmware that
> exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
> disabled and SMT still on of course) and I see similar performance
> benefits.
>
> Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
> differences, all within error range (results produced using a vibe-coded
> python script):
>
> - baseline (stats/sec):
>
> SCHED softirq count : 2,625
> LB attempts (total) : 69,832
>
> Per-domain breakdown:
> domain0 (SMT):
> lb_count (total) : 68,482 [balanced=68,472 failed=9]
> CPU_IDLE : lb=1,408 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NEWLY_IDLE : lb=67,041 imb(load=0 util=0 task=7 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
> domain1 (MC):
> lb_count (total) : 902 [balanced=900 failed=2]
> CPU_NEWLY_IDLE : lb=869 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=2 misfit=0) gained=0
> domain2 (NUMA):
> lb_count (total) : 448 [balanced=441 failed=7]
> CPU_NEWLY_IDLE : lb=415 imb(load=0 util=0 task=44 misfit=0) gained=0
> CPU_NOT_IDLE : lb=33 imb(load=0 util=0 task=268 misfit=0) gained=0
>
> - with ilb-smt (stats/sec):
>
> SCHED softirq count : 2,671
> LB attempts (total) : 68,572
>
> Per-domain breakdown:
> domain0 (SMT):
> lb_count (total) : 67,239 [balanced=67,197 failed=41]
> CPU_IDLE : lb=1,419 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NEWLY_IDLE : lb=65,783 imb(load=0 util=0 task=42 misfit=0) gained=1
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
> domain1 (MC):
> lb_count (total) : 833 [balanced=833 failed=0]
> CPU_NEWLY_IDLE : lb=796 imb(load=0 util=0 task=0 misfit=0) gained=0
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=0 misfit=0) gained=0
> domain2 (NUMA):
> lb_count (total) : 500 [balanced=488 failed=12]
> CPU_NEWLY_IDLE : lb=463 imb(load=0 util=0 task=44 misfit=0) gained=0
> CPU_NOT_IDLE : lb=37 imb(load=0 util=0 task=627 misfit=0) gained=0
>
> I'll add more direct instrumentation to check what ILB is doing
> differently...
More data.
== SMT contention ==
tracepoint:sched:sched_switch
{
if (args->next_pid != 0) {
@busy[cpu] = 1;
} else {
delete(@busy[cpu]);
}
}
tracepoint:sched:sched_switch
/ args->prev_pid == 0 && args->next_pid != 0 /
{
$sib = (cpu + 176) % 352;
if (@busy[$sib]) {
@smt_contention++;
} else {
@smt_no_contention++;
}
}
END
{
printf("smt_contention %lld\n", (int64)@smt_contention);
printf("smt_no_contention %lld\n", (int64)@smt_no_contention);
}
- baseline:
@smt_contention: 1103
@smt_no_contention: 3815
- ilb-smt:
@smt_contention: 937
@smt_no_contention: 4459
== ILB duration ==
- baseline:
@ilb_duration_us:
[0] 147 | |
[1] 354 |@ |
[2, 4) 739 |@@@ |
[4, 8) 3040 |@@@@@@@@@@@@@@@@ |
[8, 16) 9825 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 8142 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 1267 |@@@@@@ |
[64, 128) 1607 |@@@@@@@@ |
[128, 256) 2222 |@@@@@@@@@@@ |
[256, 512) 2326 |@@@@@@@@@@@@ |
[512, 1K) 141 | |
[1K, 2K) 37 | |
[2K, 4K) 7 |
- ilb-smt:
@ilb_duration_us:
[0] 79 | |
[1] 137 | |
[2, 4) 1440 |@@@@@@@@@@ |
[4, 8) 2897 |@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 7433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 4993 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[32, 64) 2390 |@@@@@@@@@@@@@@@@ |
[64, 128) 2254 |@@@@@@@@@@@@@@@ |
[128, 256) 2731 |@@@@@@@@@@@@@@@@@@@ |
[256, 512) 1083 |@@@@@@@ |
[512, 1K) 265 |@ |
[1K, 2K) 29 | |
[2K, 4K) 5 | |
== rq_lock hold ==
- baseline:
@lb_rqlock_hold_us:
[0] 664396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 77446 |@@@@@@ |
[2, 4) 25044 |@ |
[4, 8) 19847 |@ |
[8, 16) 2434 | |
[16, 32) 605 | |
[32, 64) 308 | |
[64, 128) 38 | |
[128, 256) 2 | |
- ilb-smt:
@lb_rqlock_hold_us:
[0] 229152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1] 135060 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2, 4) 26989 |@@@@@@ |
[4, 8) 48034 |@@@@@@@@@@ |
[8, 16) 1919 | |
[16, 32) 2236 | |
[32, 64) 595 | |
[64, 128) 135 | |
[128, 256) 27 | |
For what I see ILB runs are more expensive, but I still don't see why I'm
getting the speedup with this ilb-smt patch. I'll keep investigating...
-Andrea
next prev parent reply other threads:[~2026-03-27 22:45 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
2026-03-27 9:46 ` Andrea Righi
2026-03-27 10:44 ` K Prateek Nayak
2026-03-27 10:58 ` Andrea Righi
2026-03-27 11:14 ` K Prateek Nayak
2026-03-27 16:39 ` Andrea Righi
2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
2026-03-27 8:09 ` Vincent Guittot
2026-03-27 9:45 ` Andrea Righi
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
2026-03-27 8:45 ` Vincent Guittot
2026-03-27 9:44 ` Andrea Righi
2026-03-27 11:34 ` K Prateek Nayak
2026-03-27 20:36 ` Andrea Righi
2026-03-27 22:45 ` Andrea Righi [this message]
2026-03-27 13:44 ` Shrikanth Hegde
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
2026-03-27 6:52 ` Andrea Righi
2026-03-27 16:31 ` Shrikanth Hegde
2026-03-27 17:08 ` Andrea Righi
2026-03-28 6:51 ` Shrikanth Hegde
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=accIbzfIzvtxbY05@gpd4 \
--to=arighi@nvidia.com \
--cc=balbirs@nvidia.com \
--cc=bsegall@google.com \
--cc=christian.loehle@arm.com \
--cc=dietmar.eggemann@arm.com \
--cc=fabecassis@nvidia.com \
--cc=juri.lelli@redhat.com \
--cc=kobak@nvidia.com \
--cc=kprateek.nayak@amd.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=rostedt@goodmis.org \
--cc=vincent.guittot@linaro.org \
--cc=vschneid@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox