From: David Laight <david.laight.linux@gmail.com>
To: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>,
x86@kernel.org, Jon Kohler <jon@nutanix.com>,
Nikolay Borisov <nik.borisov@suse.com>,
"H. Peter Anvin" <hpa@zytor.com>,
Josh Poimboeuf <jpoimboe@kernel.org>,
David Kaplan <david.kaplan@amd.com>,
Sean Christopherson <seanjc@google.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
Peter Zijlstra <peterz@infradead.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
KP Singh <kpsingh@kernel.org>, Jiri Olsa <jolsa@kernel.org>,
"David S. Miller" <davem@davemloft.net>,
Andy Lutomirski <luto@kernel.org>,
Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>,
David Ahern <dsahern@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
Stanislav Fomichev <sdf@fomichev.me>, Hao Luo <haoluo@google.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Jonathan Corbet <corbet@lwn.net>,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Asit Mallick <asit.k.mallick@intel.com>,
Tao Zhang <tao1.zhang@intel.com>,
bpf@vger.kernel.org, netdev@vger.kernel.org,
linux-doc@vger.kernel.org
Subject: Re: [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs
Date: Wed, 1 Apr 2026 10:02:00 +0100 [thread overview]
Message-ID: <20260401100200.5b347628@pumpkin> (raw)
In-Reply-To: <20260401081236.3rjp2wigkr6w3nym@desk>
On Wed, 1 Apr 2026 01:12:36 -0700
Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> On Sat, Mar 28, 2026 at 10:08:37AM +0000, David Laight wrote:
> > On Fri, 27 Mar 2026 17:42:56 -0700
> > Pawan Gupta <pawan.kumar.gupta@linux.intel.com> wrote:
> >
> > > On Thu, Mar 26, 2026 at 01:29:31PM -0700, Pawan Gupta wrote:
> > > > On Thu, Mar 26, 2026 at 10:45:57AM +0000, David Laight wrote:
> > > > > On Thu, 26 Mar 2026 11:01:20 +0100
> > > > > Borislav Petkov <bp@alien8.de> wrote:
> > > > >
> > > > > > On Thu, Mar 26, 2026 at 01:39:34AM -0700, Pawan Gupta wrote:
> > > > > > > I believe the equivalent for cpu_feature_enabled() in asm is the
> > > > > > > ALTERNATIVE. Please let me know if I am missing something.
> > > > > >
> > > > > > Yes, you are.
> > > > > >
> > > > > > The point is that you don't want to stick those alternative calls inside some
> > > > > > magic bhb_loop function but hand them in from the outside, as function
> > > > > > arguments.
> > > > > >
> > > > > > Basically what I did.
> > > > > >
> > > > > > Then you were worried about this being C code and it had to be noinstr... So
> > > > > > that outer function can be rewritten in asm, I think, and still keep it well
> > > > > > separate.
> > > > > >
> > > > > > I'll try to rewrite it once I get a free minute, and see how it looks.
> > > > > >
> > > > >
> > > > > I think someone tried getting C code to write the values to global data
> > > > > and getting the asm to read them.
> > > > > That got discounted because it spilt things between two largely unrelated files.
> > > >
> > > >
> > > > The implementation with global variables wasn't that bad, let me revive it.
> > > >
> > > > This part which ties sequence to BHI mitigation, which is not ideal,
> > > > (because VMSCAPE also uses it) it does seems a cleaner option.
> > > >
> > > > --- a/arch/x86/kernel/cpu/bugs.c
> > > > +++ b/arch/x86/kernel/cpu/bugs.c
> > > > @@ -2095,6 +2095,11 @@ static void __init bhi_select_mitigation(void)
> > > >
> > > > static void __init bhi_update_mitigation(void)
> > > > {
> > > > + if (!cpu_feature_enabled(X86_FEATURE_BHI_CTRL)) {
> > > > + bhi_seq_outer_loop = 5;
> > > > + bhi_seq_inner_loop = 5;
> > > > + }
> > > > +
> > > >
> > > > I believe this can be moved to somewhere common to all mitigations.
> > > >
> > > > > I think the BPF code would need significant refactoring to call a C function.
> > > >
> > > > Ya, true. Will use globals and keep clear_bhb_loop() in asm.
> > >
> > > While testing this approach, I noticed that syscalls were suffering an 8%
> > > regression on ICX for Native BHI mitigation:
> > >
> > > $ perf bench syscall basic -l 100000000
> > >
> > > Bisection pointed to the change for using 8-bit registers (al/ah replacing
> > > eax/ecx) as the main contributor to the regression. (Global variables added
> > > a bit, but within noise).
> > >
> > > Further digging revealed a strange behavior, using %ah for the inner loop
> > > was causing the regression, interchanging %al and %ah in the loops
> > > (for movb and sub) eliminated the regression.
> > >
> > > <clear_bhb_loop_nofence>:
> > >
> > > movb bhb_seq_outer_loop(%rip), %al
> > >
> > > call 1f
> > > jmp 5f
> > > 1: call 2f
> > > .Lret1: RET
> > > 2: movb bhb_seq_inner_loop(%rip), %ah
> > > 3: jmp 4f
> > > nop
> > > 4: sub $1, %ah <---- No regression with %al here
> > > jnz 3b
> > > sub $1, %al
> > > jnz 1b
> > >
> > > My guess is, "sub $1, %al" is faster than "sub $1, %ah". Using %al in the
> > > inner loop, which is executed more number of times is likely making the
> > > difference. A perf profile is needed to confirm this.
> >
> > I bet it is also CPU dependant - it is quite likely that there isn't
> > any special hardware to support partial writes of %ah so it ends up taking
> > a slow path (possibly even a microcoded one to get an 8% regression).
>
> Strangely, %ah in the inner loop incurs less uops and has fewer branch
> misses, yet takes more cycles. Below is the perf data for the sequence on a
> Rocket Lake (similar observation on ICX and EMR):
>
> Event %al inner %ah inner Delta
> ---------------------- ------------- ------------- ----------
> cycles 776,775,020 972,322,384 +25.2%
> instructions/cycle 1.23 0.98 -20.3%
> branch-misses 4,792,502 560,449 -88.3%
> uops_issued.any 768,019,010 696,888,357 -9.3%
> time elapsed 0.1627s 0.2048s +25.9%
>
> Time elapsed directly correlates with the increase in cycles.
That might be consistent with the %ah accesses (probably writes)
being very slow/synchronising.
So you are getting a full cpu stall instead speculative execution
of the following instructions - which must include a lot of mis-predicted
branches.
> > As well as swapping %al <-> %ah try changing the outer loop decrement to
> > sub $0x100, %ax
> > since %al is zero that will set the z flag the same.
>
> Unfortunately, using "sub $0x100, %ax"(with %al as inner loop) isn't better
> than just using "sub $1, %ah" in the outer loop:
>
> Event %al inner + sub %ax Delta
> ---------------------- ------------- ------------- ----------
> cycles 776,775,020 813,372,036 +4.7%
> instructions/cycle 1.23 1.17 -4.5%
> branch-misses 4,792,502 7,610,323 +58.8%
> uops_issued.any 768,019,010 827,465,137 +7.7%
> time elapsed 0.1627s 0.1707s +4.9%
That is even more interesting.
The 'sub %ax' version has more uops and more branch-misses.
Looks like the extra cost of the %ah access is less than the cost
of the extra mis-predicted branches.
Makes me wonder where a version that uses %cl fits?
(Or use a zero-extending read and %eax/%ecx - likely to be the same.)
I'll bet 'one beer' that is nearest the 'sub %ax' version.
David
>
> > I've just hacked a test into some test code I've got.
> > I'm not seeing an unexpected costs on either zen-5 or haswell.
> > So it may be more subtle.
>
> This is puzzling, but atleast it is evident that using %al for the inner
> loop seems to be the best option. In summary:
>
> Variant Cycles Uops Issued Branch Misses
> ------- ---------- ----------- -------------
> %al 776M 768M 4.8M (fastest)
> %ah 972M (+25%) 697M (-9%) 560K (-88%) (fewer uops + misses, yet slowest)
> sub %ax 813M (+5%) 827M (+8%) 7.6M (+59%) (most uops + misses)
next prev parent reply other threads:[~2026-04-01 9:02 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-24 18:16 [PATCH v8 00/10] VMSCAPE optimization for BHI variant Pawan Gupta
2026-03-24 18:16 ` [PATCH v8 01/10] x86/bhi: x86/vmscape: Move LFENCE out of clear_bhb_loop() Pawan Gupta
2026-03-24 20:22 ` Borislav Petkov
2026-03-24 21:30 ` Pawan Gupta
2026-03-24 18:16 ` [PATCH v8 02/10] x86/bhi: Make clear_bhb_loop() effective on newer CPUs Pawan Gupta
2026-03-24 20:59 ` Borislav Petkov
2026-03-24 22:13 ` Pawan Gupta
2026-03-25 20:37 ` Borislav Petkov
2026-03-25 22:40 ` David Laight
2026-03-26 8:39 ` Pawan Gupta
2026-03-26 9:15 ` David Laight
2026-03-26 10:01 ` Borislav Petkov
2026-03-26 10:45 ` David Laight
2026-03-26 20:29 ` Pawan Gupta
2026-03-28 0:42 ` Pawan Gupta
2026-03-28 10:08 ` David Laight
2026-04-01 8:12 ` Pawan Gupta
2026-04-01 9:02 ` David Laight [this message]
2026-04-01 18:52 ` Pawan Gupta
2026-03-25 17:50 ` Jim Mattson
2026-03-25 18:44 ` Pawan Gupta
2026-03-25 19:41 ` David Laight
2026-03-25 22:29 ` Pawan Gupta
2026-03-24 18:17 ` [PATCH v8 03/10] x86/bhi: Rename clear_bhb_loop() to clear_bhb_loop_nofence() Pawan Gupta
2026-03-24 18:17 ` [PATCH v8 04/10] x86/vmscape: Rename x86_ibpb_exit_to_user to x86_predictor_flush_exit_to_user Pawan Gupta
2026-03-31 17:50 ` Sean Christopherson
2026-04-01 8:13 ` Pawan Gupta
2026-03-24 18:17 ` [PATCH v8 05/10] x86/vmscape: Move mitigation selection to a switch() Pawan Gupta
2026-03-24 18:17 ` [PATCH v8 06/10] x86/vmscape: Use write_ibpb() instead of indirect_branch_prediction_barrier() Pawan Gupta
2026-03-24 18:18 ` [PATCH v8 07/10] x86/vmscape: Use static_call() for predictor flush Pawan Gupta
2026-03-24 19:09 ` bot+bpf-ci
2026-03-24 19:51 ` Pawan Gupta
2026-03-24 18:18 ` [PATCH v8 08/10] x86/vmscape: Deploy BHB clearing mitigation Pawan Gupta
2026-03-24 19:09 ` bot+bpf-ci
2026-03-24 19:46 ` Pawan Gupta
2026-03-24 18:18 ` [PATCH v8 09/10] x86/vmscape: Resolve conflict between attack-vectors and vmscape=force Pawan Gupta
2026-03-24 18:19 ` [PATCH v8 10/10] x86/vmscape: Add cmdline vmscape=on to override attack vector controls Pawan Gupta
2026-03-24 19:09 ` bot+bpf-ci
2026-03-30 3:16 ` [PATCH v8 00/10] VMSCAPE optimization for BHI variant Jon Kohler
2026-03-30 16:11 ` Pawan Gupta
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260401100200.5b347628@pumpkin \
--to=david.laight.linux@gmail.com \
--cc=andrii@kernel.org \
--cc=asit.k.mallick@intel.com \
--cc=ast@kernel.org \
--cc=bp@alien8.de \
--cc=bpf@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=daniel@iogearbox.net \
--cc=dave.hansen@linux.intel.com \
--cc=davem@davemloft.net \
--cc=david.kaplan@amd.com \
--cc=dsahern@kernel.org \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=hpa@zytor.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=jon@nutanix.com \
--cc=jpoimboe@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@kernel.org \
--cc=martin.lau@linux.dev \
--cc=mingo@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=nik.borisov@suse.com \
--cc=pawan.kumar.gupta@linux.intel.com \
--cc=pbonzini@redhat.com \
--cc=peterz@infradead.org \
--cc=sdf@fomichev.me \
--cc=seanjc@google.com \
--cc=song@kernel.org \
--cc=tao1.zhang@intel.com \
--cc=tglx@kernel.org \
--cc=x86@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.