* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-14 1:50 ` H. Peter Anvin
@ 2026-06-14 18:08 ` Xin Li
2026-06-14 18:31 ` H. Peter Anvin
2026-06-15 0:19 ` H. Peter Anvin
2026-06-16 8:28 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: Xin Li @ 2026-06-14 18:08 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
Dave Hansen, torvalds, x86-ML, LKML
> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>>
>>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>>
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>>
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>>
>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>>
>>>> I don't actually have a FRED capable machine, but have you tried running
>>>> one of those top-down perf things on it, to see where its hurting?
>>>
>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>
>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>> > there isn't anything in my test setup that could introduce any kind of
>> > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>
> OK, I have, I believe root-caused this.
>
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>
> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.
The problem doesn’t happen to IDT?
>
> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...
Good point, alignment check should be applied to all such entries.
Thanks
Xin
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-14 18:08 ` Xin Li
@ 2026-06-14 18:31 ` H. Peter Anvin
0 siblings, 0 replies; 18+ messages in thread
From: H. Peter Anvin @ 2026-06-14 18:31 UTC (permalink / raw)
To: Xin Li
Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
Dave Hansen, torvalds, x86-ML, LKML
On June 14, 2026 11:08:59 AM PDT, Xin Li <xin@zytor.com> wrote:
>
>> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> On 2026-06-13 16:52, H. Peter Anvin wrote:
>>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>>> in system call latency between v7.0 and the current master, and it
>>>>>> bisects down to:
>>>>>>
>>>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>>
>>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>>> is a bare metal boot, no KVM.
>>>>>>
>>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>>> and I will be investigating the possibility that this is a false
>>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>>> stability.
>>>>>>
>>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>>> rather than later; I will update as I get more data.
>>>>>
>>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>>> affect a FRED host either, except perhaps in code layout.
>>>>>
>>>>> I don't actually have a FRED capable machine, but have you tried running
>>>>> one of those top-down perf things on it, to see where its hurting?
>>>>
>>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>>
>>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>> > there isn't anything in my test setup that could introduce any kind of
>>> > "memory" between builds...>
>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>>
>> OK, I have, I believe root-caused this.
>>
>> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>>
>> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.
>
>
>The problem doesn’t happen to IDT?
>
>
>>
>> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...
>
>
>Good point, alignment check should be applied to all such entries.
>
>Thanks
> Xin
The problem is that if you put an alignment directive on a function, it aligns the __pfx stub, which is exactly The Wrong Thing™.
Otherwise this would be easy to fix, permanently.
I haven't had time to test IDT yet. I assume it is similar.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-14 1:50 ` H. Peter Anvin
2026-06-14 18:08 ` Xin Li
@ 2026-06-15 0:19 ` H. Peter Anvin
2026-06-15 2:07 ` H. Peter Anvin
2026-06-16 8:28 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: H. Peter Anvin @ 2026-06-15 0:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
torvalds, x86-ML, LKML
On 2026-06-13 18:50, H. Peter Anvin wrote:
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel)
>>>> wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>>
>>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>>
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>>
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>>
>>>> Uhm, massive WTF indeed. I don't immediately see how this could
>>>> possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>>
>>>> I don't actually have a FRED capable machine, but have you tried
>>>> running
>>>> one of those top-down perf things on it, to see where its hurting?
>>>
>>> Not yet, but I'm investigating right now (I have some family
>>> obligations this weekend, so my duty cycle is somewhat limited.)
>>>
>>> I reverted the patch on top of rc7, and it did, in fact, fix the
>>> regression,
>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>> > there isn't anything in my test setup that could introduce any kind of
>> > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in
>> fact worse than I originally stated: the average with 7.1rc7 is 478±6
>> cycles (with the top and bottom octiles removed as outlier
>> protection); with 7.1rc7 with the above patch reverted it is
>> 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>>
>
> OK, I have, I believe root-caused this.
>
> It is a padding issue; removing the code changes __pfx_x64_sys_call to
> be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>
> Reverting the patch but adding an alignment statement to x64_sys_call
> re-introduces the performance regression.
>
> I am concerned because this could mean that the __pfx stubs add
> substantial overhead elsewhere, unless this just happens to be a
> particularly sensitive case...
>
OK, so v7.1 was released with this sizable performance regression. That
begs the question how to deal with it.
One option that might be reasonable for -stable is to simply add back 16
bytes of NOPs into the assembly file. However, that is obviously not a
long term fix.
Any thoughts?
-hpa
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-15 0:19 ` H. Peter Anvin
@ 2026-06-15 2:07 ` H. Peter Anvin
2026-06-15 3:41 ` Linus Torvalds
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: H. Peter Anvin @ 2026-06-15 2:07 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
torvalds, x86-ML, LKML
[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]
On 2026-06-14 17:19, H. Peter Anvin wrote:
>
> OK, so v7.1 was released with this sizable performance regression. That
> begs the question how to deal with it.
>
> One option that might be reasonable for -stable is to simply add back 16
> bytes of NOPs into the assembly file. However, that is obviously not a
> long term fix.
>
Okay, here is a hack that actually generates the proper alignment, and
it DOES in fact fix the performance regression.
It uses the same hack as the Makefile to deal with function alignment
with a prefix: it adds unnecessary NOPs so that the pre-alignment and
post-alignment are the same. At the end of the day this really ought to
be fixed in gcc.
This is not meant to be a final patch; this should go in a header file
and be cleaned up etc, but I wanted to confirm that it does, in fact,
fix the regression and that the alignment of x64_sys_call is the root
cause of the problem.
PeterZ: at some point you and I talked about the following:
- Should x64_sys_call() be noinstr?
- If so, any reason we can't inline it into do_syscall_64()?
- Since we no longer use the sys_call_table[] as a jump table,
do we actually need array_index_nospec()? in do_syscall_x64|32?
-hpa
[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1428 bytes --]
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 71f032504e73..337e3e53d262 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -9,6 +9,14 @@
#include <linux/nospec.h>
#include <asm/syscall.h>
+#ifdef CONFIG_CALL_PADDING
+# define _pfe(x) __attribute((patchable_function_entry(x,x)))
+#else
+# define _pfe(x)
+#endif
+#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
+#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
+
#define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
#define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
#include <asm/syscalls_64.h>
@@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
#undef __SYSCALL
#define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
#include <asm/syscalls_64.h>
@@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
}
#ifdef CONFIG_X86_X32_ABI
-long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
{
switch (nr) {
#include <asm/syscalls_x32.h>
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-15 2:07 ` H. Peter Anvin
@ 2026-06-15 3:41 ` Linus Torvalds
2026-06-15 18:30 ` H. Peter Anvin
2026-06-16 7:38 ` Peter Zijlstra
2026-06-16 7:53 ` Peter Zijlstra
2 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2026-06-15 3:41 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
Dave Hansen, x86-ML, LKML
On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>
> - Since we no longer use the sys_call_table[] as a jump table,
> do we actually need array_index_nospec()? in do_syscall_x64|32?
Well, gcc will still generate a jump table from it when retpolines
aren't enabled.
So I think we do want that array_index_nospec. It should be cheap
insurance against the simplest kinds of speculation issues.
Linus
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-15 3:41 ` Linus Torvalds
@ 2026-06-15 18:30 ` H. Peter Anvin
2026-06-16 7:12 ` Peter Zijlstra
0 siblings, 1 reply; 18+ messages in thread
From: H. Peter Anvin @ 2026-06-15 18:30 UTC (permalink / raw)
To: Linus Torvalds
Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
Dave Hansen, x86-ML, LKML
On 2026-06-14 20:41, Linus Torvalds wrote:
> On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> - Since we no longer use the sys_call_table[] as a jump table,
>> do we actually need array_index_nospec()? in do_syscall_x64|32?
>
> Well, gcc will still generate a jump table from it when retpolines
> aren't enabled.
>
> So I think we do want that array_index_nospec. It should be cheap
> insurance against the simplest kinds of speculation issues.
>
Well, we could put it under an #ifdef by putting macro to detect when we
use -fno-jump-tables. PeterZ and I have also been talking about making
-fno-jump-tables unconditional, because at some point we found that the
performance difference was negligible, at least when
array_index_nospec() is necessary, and it makes it a lot easier to tune
when you don't have to deal with code bases that compile. It is not just
retpoline but also IBT (although the comment says "for now"); this of
course means in practice that the kernels everyone uses are compiled
without jump tables.
The system call dispatch is really the biggest case here.
It does, however, make me think that using regs->ax to dispatch system
calls in the a FRED path might actually be The Wrong Thing[TM]; FRED
delivery is a speculation barrier and so %rax is guaranteed to be stable
at that point. *In practice* the stack engine probably would propagate
that (I can't really think of any way to implement a stack engine that
wouldn't, and I suspect if it didn't we would have lots of other issues)
but instead of dumping it into memory and reading it back it probably
would be better to do what the SYSCALL path does and move it into an
argument register instead.
I have experimented with micro-optimizations of the FRED path lately, in
part because FRED inherently does provide speculation guarantees that
SYSCALL/SYSRET do not, in part because some of the code paths have a
fair bit of unnecessary overhead in general of which some of affects
FRED disproportionately (some duplicates work that FRED does inherently,
for one thing.) So far I have been somewhat surprised how *little*
effect some of them have had; clearly branch prediction does a really
good job sometimes even without static branches.
Still, some pretty simple changes can get a few percent improvement,
well above the statistical noise margin.
Doing a *very* early-out and dispatching do_syscall_64() already in
asm_entry_point_user is one of the more effective hacks; I am (or
rather, were, until I discovered this immediate issue ;) also
experimenting with having separate IDT and FRED versions of
do_syscall_64() -- the code factors very cleanly and the duplication is
nearly all at the object code level.
Part of my questions to PeterZ was because I believe that inlining
x64_sys_call() will benefit a fair bit from better code layout. We have
talked about sunsetting x32, but until we do, merging x32_sys_call()
into the same function also ends up with the two switch statements being
able to share a fair bit of code, since there are large contiguous
chunks of x32 system call space which are the same as x64.
One of the things I have been thinking about, too, is to move FRED- and
IDT-specific code into separate text sections; not only so that they can
be close together in memory, but also so that we can poison out the
areas that aren't being used. Every code flow that has almost unlimited
versatility is, obviously, *extremely* desirable as targets for
execution redirection attacks...
-hpa
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-15 18:30 ` H. Peter Anvin
@ 2026-06-16 7:12 ` Peter Zijlstra
0 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2026-06-16 7:12 UTC (permalink / raw)
To: H. Peter Anvin
Cc: Linus Torvalds, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
Dave Hansen, x86-ML, LKML
On Mon, Jun 15, 2026 at 11:30:11AM -0700, H. Peter Anvin wrote:
> Well, we could put it under an #ifdef by putting macro to detect when we use
> -fno-jump-tables. PeterZ and I have also been talking about making
> -fno-jump-tables unconditional, because at some point we found that the
> performance difference was negligible, at least when array_index_nospec() is
> necessary, and it makes it a lot easier to tune when you don't have to deal
> with code bases that compile. It is not just retpoline but also IBT
> (although the comment says "for now"); this of course means in practice that
> the kernels everyone uses are compiled without jump tables.
The IBT thing is because GCC (and I assume, but haven't checked, clang
too) generated NOTRACK prefixes for jump tables. And we have explicitly
disallowed NOTRACK for kernel IBT.
The "not yet" pertains to the compilers being changed to not use
NOTRACK; but I don't think this is anything anybody is actively chasing
up on.
So yeah, effectively jump-tables are disabled for everybody.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-15 2:07 ` H. Peter Anvin
2026-06-15 3:41 ` Linus Torvalds
@ 2026-06-16 7:38 ` Peter Zijlstra
2026-06-16 7:53 ` Peter Zijlstra
2 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2026-06-16 7:38 UTC (permalink / raw)
To: H. Peter Anvin
Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
torvalds, x86-ML, LKML
On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote:
> PeterZ: at some point you and I talked about the following:
>
> - Should x64_sys_call() be noinstr?
I still think it should be, yes. But I also think it wants __noendbr,
there is no sane reason you should ever be allowed to do an indirect
call to this.
Realistically, objtool will seal this function (scribble the ENDBR), but
really, it just shouldn't be there to begin with.
> - If so, any reason we can't inline it into do_syscall_64()?
Code gen, GCC makes a mess out of things if you do that. x64_sys_call()
now ends up being a giant pile of tail-calls. If you inline it into
do_syscall_x64() that goes out the window.
> - Since we no longer use the sys_call_table[] as a jump table,
> do we actually need array_index_nospec()? in do_syscall_x64|32?
It would mean unconditionally disabling jump-tables -- at least for this
TU, but possibly for the whole thing (mixed compiler flags and LTO is a
pain you don't need IIRC).
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-15 2:07 ` H. Peter Anvin
2026-06-15 3:41 ` Linus Torvalds
2026-06-16 7:38 ` Peter Zijlstra
@ 2026-06-16 7:53 ` Peter Zijlstra
2 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2026-06-16 7:53 UTC (permalink / raw)
To: H. Peter Anvin
Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
torvalds, x86-ML, LKML
On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote:
> It uses the same hack as the Makefile to deal with function alignment with a
> prefix: it adds unnecessary NOPs so that the pre-alignment and
> post-alignment are the same. At the end of the day this really ought to be
> fixed in gcc.
And clang, but I don't think they can, it wrecks the 'ABI' they have in
place with the current set of arguments. Which I agree is somewhat
unfortunate, but it is what it is.
> diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
> index 71f032504e73..337e3e53d262 100644
> --- a/arch/x86/entry/syscall_64.c
> +++ b/arch/x86/entry/syscall_64.c
> @@ -9,6 +9,14 @@
> #include <linux/nospec.h>
> #include <asm/syscall.h>
>
> +#ifdef CONFIG_CALL_PADDING
> +# define _pfe(x) __attribute((patchable_function_entry(x,x)))
> +#else
> +# define _pfe(x)
> +#endif
> +#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
> +#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
> +
> #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
> #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
> #include <asm/syscalls_64.h>
> @@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
> #undef __SYSCALL
>
> #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> -long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> +long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> {
> switch (nr) {
> #include <asm/syscalls_64.h>
> @@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> }
>
> #ifdef CONFIG_X86_X32_ABI
> -long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
> +long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
> {
> switch (nr) {
> #include <asm/syscalls_x32.h>
This more or less works by accident, in general your align_func() macro
is horrendously broken when you consider kCFI. By changing the
patchable_function_entry attribute like this, the kCFI hash ends up at a
different location and things go side-ways really really fast.
The only reason it works here is that this function is never indirectly
called and so the kCFI ABI violation is immaterial.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-14 1:50 ` H. Peter Anvin
2026-06-14 18:08 ` Xin Li
2026-06-15 0:19 ` H. Peter Anvin
@ 2026-06-16 8:28 ` Peter Zijlstra
2026-06-16 8:46 ` Linus Torvalds
2 siblings, 1 reply; 18+ messages in thread
From: Peter Zijlstra @ 2026-06-16 8:28 UTC (permalink / raw)
To: H. Peter Anvin
Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
torvalds, x86-ML, LKML
On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote:
> OK, I have, I believe root-caused this.
>
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be
> 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>
> Reverting the patch but adding an alignment statement to x64_sys_call
> re-introduces the performance regression.
>
> I am concerned because this could mean that the __pfx stubs add substantial
> overhead elsewhere, unless this just happens to be a particularly sensitive
> case...
So what is the actual alignment requirement these days then? We're
building the (x86_64) kernel with 16 byte function and 1 byte jump
alignment.
So ISTR the Intel I-fetch window was 16 bytes, so the above things would
make sense. However, Gemini, or whatever AI sits in google search, is
trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
That same thing is saying AMD switched to 32 byte I-fetch with Zen (1)
and later.
This all seems to suggest we do something like so, hmm?
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b9f5a4a3cc2a..65fff65271d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -329,7 +329,9 @@ config X86
select HAVE_ARCH_KCSAN if X86_64
select PROC_PID_ARCH_STATUS if PROC_FS
select HAVE_ARCH_NODE_DEV_GROUP if X86_SGX
- select FUNCTION_ALIGNMENT_16B if X86_64 || X86_ALIGNMENT_16
+ # AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch
+ select FUNCTION_ALIGNMENT_32B if X86_64
+ select FUNCTION_ALIGNMENT_16B if X86_ALIGNMENT_16
select FUNCTION_ALIGNMENT_4B
imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI
select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected
2026-06-16 8:28 ` Peter Zijlstra
@ 2026-06-16 8:46 ` Linus Torvalds
0 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2026-06-16 8:46 UTC (permalink / raw)
To: Peter Zijlstra
Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
Dave Hansen, x86-ML, LKML
On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> make sense. However, Gemini, or whatever AI sits in google search, is
> trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
to not be 64-byte aligned - simply because you may need to fetch more
cachelines (assuming fairly linear code).
And afaik, some of the newer ones aren't 32-byte wide, but can do 48
bytes as three 16-byte fetches.
But I don't know if they can do the old "split line access" that older
cores could do, where a Pentium would do two 8-byte accesses at the
same time, and they didn't have to be in the same cache line.
So 64-byte alignment would always be the best option if you only look
at a *particular* piece of code.
But it obviously is very wasteful and hurts when there is code around
it that could be loaded into the cache at the same time.
So almost certainly not a good idea in general.
But 64-byte alignment is probably what things like interrupt and
system call entrypoints should use, because those things would make
sense to look at as isolated things, not part of a bigger load". And
they are quite likely to start from a fairly cold-cache situation.
So *not* some general compiler option in a config file, but maybe a
special "entry point alignment" macro?
Linus
^ permalink raw reply [flat|nested] 18+ messages in thread