public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Andrea Righi <arighi@nvidia.com>
To: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Valentin Schneider <vschneid@redhat.com>,
	Christian Loehle <christian.loehle@arm.com>,
	Koba Ko <kobak@nvidia.com>,
	Felix Abecassis <fabecassis@nvidia.com>,
	Balbir Singh <balbirs@nvidia.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer
Date: Fri, 27 Mar 2026 23:45:03 +0100	[thread overview]
Message-ID: <accIbzfIzvtxbY05@gpd4> (raw)
In-Reply-To: <acbqNOWr37pMC1sG@gpd4>

On Fri, Mar 27, 2026 at 09:36:15PM +0100, Andrea Righi wrote:
> On Fri, Mar 27, 2026 at 05:04:23PM +0530, K Prateek Nayak wrote:
> > Hello Andrea,
> > 
> > On 3/27/2026 3:14 PM, Andrea Righi wrote:
> > > Hi Vincent,
> > > 
> > > On Fri, Mar 27, 2026 at 09:45:56AM +0100, Vincent Guittot wrote:
> > >> On Thu, 26 Mar 2026 at 16:12, Andrea Righi <arighi@nvidia.com> wrote:
> > >>>
> > >>> When choosing which idle housekeeping CPU runs the idle load balancer,
> > >>> prefer one on a fully idle core if SMT is active, so balance can migrate
> > >>> work onto a CPU that still offers full effective capacity. Fall back to
> > >>> any idle candidate if none qualify.
> > >>
> > >> This one isn't straightforward for me. The ilb cpu will check all
> > >> other idle CPUs 1st and finish with itself so unless the next CPU in
> > >> the idle_cpus_mask is a sibling, this should not make a difference
> > >>
> > >> Did you see any perf diff ?
> > > 
> > > I actually see a benefit, in particular, with the first patch applied I see
> > > a ~1.76x speedup, if I add this on top I get ~1.9x speedup vs baseline,
> > > which seems pretty consistent across runs (definitely not in error range).
> > > 
> > > The intention with this change was to minimize SMT noise running the ILB
> > > code on a fully-idle core when possible, but I also didn't expect to see
> > > such big difference.
> > > 
> > > I'll investigate more to better understand what's happening.
> > 
> > Interesting! Either this "CPU-intensive workload" hates SMT turning
> > busy (but to an extent where performance drops visibly?) or ILB
> > keeps getting interrupted on an SMT sibling that is burdened by
> > interrupts leading to slower balance (or IRQs driving the workload
> > being delayed by rq_lock disabling them)
> > 
> > Would it be possible to share the total SCHED_SOFTIRQ time, load
> > balancing attempts, and utlization with and without the patch? I too
> > will go queue up some runs to see if this makes a difference.
> 
> Quick update: I also tried this on a Vera machine with a firmware that
> exposes the same capacity for all the CPUs (so with SD_ASYM_CPUCAPACITY
> disabled and SMT still on of course) and I see similar performance
> benefits.
> 
> Looking at SCHED_SOFTIRQ and load balancing attempts I don't see big
> differences, all within error range (results produced using a vibe-coded
> python script):
> 
>  - baseline (stats/sec):
> 
>   SCHED softirq count  :        2,625
>   LB attempts (total)  :       69,832
> 
>   Per-domain breakdown:
>     domain0 (SMT):
>       lb_count    (total)  :       68,482  [balanced=68,472  failed=9]
>         CPU_IDLE         : lb=1,408  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NEWLY_IDLE   : lb=67,041  imb(load=0 util=0 task=7 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
>     domain1 (MC):
>       lb_count    (total)  :          902  [balanced=900  failed=2]
>         CPU_NEWLY_IDLE   : lb=869  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=2 misfit=0)  gained=0
>     domain2 (NUMA):
>       lb_count    (total)  :          448  [balanced=441  failed=7]
>         CPU_NEWLY_IDLE   : lb=415  imb(load=0 util=0 task=44 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=33  imb(load=0 util=0 task=268 misfit=0)  gained=0
> 
>  - with ilb-smt (stats/sec):
> 
>   SCHED softirq count  :        2,671
>   LB attempts (total)  :       68,572
> 
>   Per-domain breakdown:
>     domain0 (SMT):
>       lb_count    (total)  :       67,239  [balanced=67,197  failed=41]
>         CPU_IDLE         : lb=1,419  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NEWLY_IDLE   : lb=65,783  imb(load=0 util=0 task=42 misfit=0)  gained=1
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
>     domain1 (MC):
>       lb_count    (total)  :          833  [balanced=833  failed=0]
>         CPU_NEWLY_IDLE   : lb=796  imb(load=0 util=0 task=0 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=0 misfit=0)  gained=0
>     domain2 (NUMA):
>       lb_count    (total)  :          500  [balanced=488  failed=12]
>         CPU_NEWLY_IDLE   : lb=463  imb(load=0 util=0 task=44 misfit=0)  gained=0
>         CPU_NOT_IDLE     : lb=37  imb(load=0 util=0 task=627 misfit=0)  gained=0
> 
> I'll add more direct instrumentation to check what ILB is doing
> differently...

More data.

== SMT contention ==

tracepoint:sched:sched_switch
{
    if (args->next_pid != 0) {
        @busy[cpu] = 1;
    } else {
        delete(@busy[cpu]);
    }
}

tracepoint:sched:sched_switch
/ args->prev_pid == 0 && args->next_pid != 0 /
{
    $sib = (cpu + 176) % 352;

    if (@busy[$sib]) {
        @smt_contention++;
    } else {
        @smt_no_contention++;
    }
}

END
{
    printf("smt_contention %lld\n", (int64)@smt_contention);
    printf("smt_no_contention %lld\n", (int64)@smt_no_contention);
}

 - baseline:

@smt_contention: 1103
@smt_no_contention: 3815

 - ilb-smt:

@smt_contention: 937
@smt_no_contention: 4459

== ILB duration ==

 - baseline:

@ilb_duration_us:
[0]                  147 |                                                    |
[1]                  354 |@                                                   |
[2, 4)               739 |@@@                                                 |
[4, 8)              3040 |@@@@@@@@@@@@@@@@                                    |
[8, 16)             9825 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            8142 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)            1267 |@@@@@@                                              |
[64, 128)           1607 |@@@@@@@@                                            |
[128, 256)          2222 |@@@@@@@@@@@                                         |
[256, 512)          2326 |@@@@@@@@@@@@                                        |
[512, 1K)            141 |                                                    |
[1K, 2K)              37 |                                                    |
[2K, 4K)               7 |

 - ilb-smt:

@ilb_duration_us:
[0]                   79 |                                                    |
[1]                  137 |                                                    |
[2, 4)              1440 |@@@@@@@@@@                                          |
[4, 8)              2897 |@@@@@@@@@@@@@@@@@@@@                                |
[8, 16)             7433 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32)            4993 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[32, 64)            2390 |@@@@@@@@@@@@@@@@                                    |
[64, 128)           2254 |@@@@@@@@@@@@@@@                                     |
[128, 256)          2731 |@@@@@@@@@@@@@@@@@@@                                 |
[256, 512)          1083 |@@@@@@@                                             |
[512, 1K)            265 |@                                                   |
[1K, 2K)              29 |                                                    |
[2K, 4K)               5 |                                                    |

== rq_lock hold ==

 - baseline:

@lb_rqlock_hold_us:
[0]               664396 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                77446 |@@@@@@                                              |
[2, 4)             25044 |@                                                   |
[4, 8)             19847 |@                                                   |
[8, 16)             2434 |                                                    |
[16, 32)             605 |                                                    |
[32, 64)             308 |                                                    |
[64, 128)             38 |                                                    |
[128, 256)             2 |                                                    |

 - ilb-smt:

@lb_rqlock_hold_us:
[0]               229152 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]               135060 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                      |
[2, 4)             26989 |@@@@@@                                              |
[4, 8)             48034 |@@@@@@@@@@                                          |
[8, 16)             1919 |                                                    |
[16, 32)            2236 |                                                    |
[32, 64)             595 |                                                    |
[64, 128)            135 |                                                    |
[128, 256)            27 |                                                    |

For what I see ILB runs are more expensive, but I still don't see why I'm
getting the speedup with this ilb-smt patch. I'll keep investigating...

-Andrea

  reply	other threads:[~2026-03-27 22:45 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-26 15:02 [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 1/4] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection Andrea Righi
2026-03-27  8:09   ` Vincent Guittot
2026-03-27  9:46     ` Andrea Righi
2026-03-27 10:44   ` K Prateek Nayak
2026-03-27 10:58     ` Andrea Righi
2026-03-27 11:14       ` K Prateek Nayak
2026-03-27 16:39         ` Andrea Righi
2026-03-26 15:02 ` [PATCH 2/4] sched/fair: Reject misfit pulls onto busy SMT siblings on asym-capacity Andrea Righi
2026-03-26 15:02 ` [PATCH 3/4] sched/fair: Enable EAS with SMT on SD_ASYM_CPUCAPACITY systems Andrea Righi
2026-03-27  8:09   ` Vincent Guittot
2026-03-27  9:45     ` Andrea Righi
2026-03-26 15:02 ` [PATCH 4/4] sched/fair: Prefer fully-idle SMT core for NOHZ idle load balancer Andrea Righi
2026-03-27  8:45   ` Vincent Guittot
2026-03-27  9:44     ` Andrea Righi
2026-03-27 11:34       ` K Prateek Nayak
2026-03-27 20:36         ` Andrea Righi
2026-03-27 22:45           ` Andrea Righi [this message]
2026-03-27 13:44   ` Shrikanth Hegde
2026-03-26 16:33 ` [PATCH 0/4] sched/fair: SMT-aware asymmetric CPU capacity Christian Loehle
2026-03-27  6:52   ` Andrea Righi
2026-03-27 16:31 ` Shrikanth Hegde
2026-03-27 17:08   ` Andrea Righi
2026-03-28  6:51     ` Shrikanth Hegde

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=accIbzfIzvtxbY05@gpd4 \
    --to=arighi@nvidia.com \
    --cc=balbirs@nvidia.com \
    --cc=bsegall@google.com \
    --cc=christian.loehle@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=fabecassis@nvidia.com \
    --cc=juri.lelli@redhat.com \
    --cc=kobak@nvidia.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox