8aeb879baf12 - significant system call latency regression, bisected

All of lore.kernel.org
 help / color / mirror / Atom feed

* 8aeb879baf12 - significant system call latency regression, bisected
@ 2026-06-13  1:45 "H. Peter Anvin" (Intel)
  2026-06-13  8:59 ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: "H. Peter Anvin" (Intel) @ 2026-06-13  1:45 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

So I was trying to figure out a significant -- about 13% -- increase
in system call latency between v7.0 and the current master, and it
bisects down to:

	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build

This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
is a bare metal boot, no KVM.

I'm personally extremely puzzled how this could possibly be related,
and I will be investigating the possibility that this is a false
bisect, but it is not a Heisenbug in any way; it has been extremely
reproducible, and the difference is statistically valid by close to 10
sigma. Futhermore, the bisection at least gave the appearance of
stability.

Given how late in the cycle this is I wanted to send an alert sooner
rather than later; I will update as I get more data.

        -hpa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
@ 2026-06-13  8:59 ` Peter Zijlstra
  2026-06-13 20:34   ` H. Peter Anvin
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Zijlstra @ 2026-06-13  8:59 UTC (permalink / raw)
  To: "H. Peter Anvin" (Intel)
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> So I was trying to figure out a significant -- about 13% -- increase
> in system call latency between v7.0 and the current master, and it
> bisects down to:
> 
> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> 
> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> is a bare metal boot, no KVM.
> 
> I'm personally extremely puzzled how this could possibly be related,
> and I will be investigating the possibility that this is a false
> bisect, but it is not a Heisenbug in any way; it has been extremely
> reproducible, and the difference is statistically valid by close to 10
> sigma. Futhermore, the bisection at least gave the appearance of
> stability.
> 
> Given how late in the cycle this is I wanted to send an alert sooner
> rather than later; I will update as I get more data.

Uhm, massive WTF indeed. I don't immediately see how this could possibly
affect a FRED host either, except perhaps in code layout.

I don't actually have a FRED capable machine, but have you tried running
one of those top-down perf things on it, to see where its hurting?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13  8:59 ` Peter Zijlstra
@ 2026-06-13 20:34   ` H. Peter Anvin
  2026-06-13 23:52     ` H. Peter Anvin
  0 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-13 20:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 01:59, Peter Zijlstra wrote:
> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>> So I was trying to figure out a significant -- about 13% -- increase
>> in system call latency between v7.0 and the current master, and it
>> bisects down to:
>> 
>> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>> 
>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>> is a bare metal boot, no KVM.
>> 
>> I'm personally extremely puzzled how this could possibly be related,
>> and I will be investigating the possibility that this is a false
>> bisect, but it is not a Heisenbug in any way; it has been extremely
>> reproducible, and the difference is statistically valid by close to 10
>> sigma. Futhermore, the bisection at least gave the appearance of
>> stability.
>> 
>> Given how late in the cycle this is I wanted to send an alert sooner
>> rather than later; I will update as I get more data.
> 
> Uhm, massive WTF indeed. I don't immediately see how this could possibly
> affect a FRED host either, except perhaps in code layout.
> 
> I don't actually have a FRED capable machine, but have you tried running
> one of those top-down perf things on it, to see where its hurting?

Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)

I reverted the patch on top of rc7, and it did, in fact, fix the regression, but I'm doing a clean from-scratch rebuild of both trees to make sure there isn't anything in my test setup that could introduce any kind of "memory" between builds...


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 20:34   ` H. Peter Anvin
@ 2026-06-13 23:52     ` H. Peter Anvin
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14  2:11       ` Calvin Owens
  0 siblings, 2 replies; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-13 23:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 13:34, H. Peter Anvin wrote:
> On 2026-06-13 01:59, Peter Zijlstra wrote:
>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>> So I was trying to figure out a significant -- about 13% -- increase
>>> in system call latency between v7.0 and the current master, and it
>>> bisects down to:
>>>
>>> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>
>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>> is a bare metal boot, no KVM.
>>>
>>> I'm personally extremely puzzled how this could possibly be related,
>>> and I will be investigating the possibility that this is a false
>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>> reproducible, and the difference is statistically valid by close to 10
>>> sigma. Futhermore, the bisection at least gave the appearance of
>>> stability.
>>>
>>> Given how late in the cycle this is I wanted to send an alert sooner
>>> rather than later; I will update as I get more data.
>>
>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>> affect a FRED host either, except perhaps in code layout.
>>
>> I don't actually have a FRED capable machine, but have you tried running
>> one of those top-down perf things on it, to see where its hurting?
> 
> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> 
> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
 > but I'm doing a clean from-scratch rebuild of both trees to make sure
 > there isn't anything in my test setup that could introduce any kind of
 > "memory" between builds...>
Nope, even with the clean rebuild it is 100% reproducible. It is in fact 
worse than I originally stated: the average with 7.1rc7 is 478±6 cycles 
(with the top and bottom octiles removed as outlier protection); with 
7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact 
a 20% increase in latency, not 13%...

	-hpa


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 23:52     ` H. Peter Anvin
@ 2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
                           ` (2 more replies)
  2026-06-14  2:11       ` Calvin Owens
  1 sibling, 3 replies; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-14  1:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 16:52, H. Peter Anvin wrote:
> On 2026-06-13 13:34, H. Peter Anvin wrote:
>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) 
>>> wrote:
>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>> in system call latency between v7.0 and the current master, and it
>>>> bisects down to:
>>>>
>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>
>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>> is a bare metal boot, no KVM.
>>>>
>>>> I'm personally extremely puzzled how this could possibly be related,
>>>> and I will be investigating the possibility that this is a false
>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>> reproducible, and the difference is statistically valid by close to 10
>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>> stability.
>>>>
>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>> rather than later; I will update as I get more data.
>>>
>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>> affect a FRED host either, except perhaps in code layout.
>>>
>>> I don't actually have a FRED capable machine, but have you tried running
>>> one of those top-down perf things on it, to see where its hurting?
>>
>> Not yet, but I'm investigating right now (I have some family 
>> obligations this weekend, so my duty cycle is somewhat limited.)
>>
>> I reverted the patch on top of rc7, and it did, in fact, fix the 
>> regression,
>  > but I'm doing a clean from-scratch rebuild of both trees to make sure
>  > there isn't anything in my test setup that could introduce any kind of
>  > "memory" between builds...>
> Nope, even with the clean rebuild it is 100% reproducible. It is in fact 
> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles 
> (with the top and bottom octiles removed as outlier protection); with 
> 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact 
> a 20% increase in latency, not 13%...
> 

OK, I have, I believe root-caused this.

It is a padding issue; removing the code changes __pfx_x64_sys_call to 
be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.

Reverting the patch but adding an alignment statement to x64_sys_call 
re-introduces the performance regression.

I am concerned because this could mean that the __pfx stubs add 
substantial overhead elsewhere, unless this just happens to be a 
particularly sensitive case...

	-hpa


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
@ 2026-06-14 18:08         ` Xin Li
  2026-06-14 18:31           ` H. Peter Anvin
  2026-06-15  0:19         ` H. Peter Anvin
  2026-06-16  8:28         ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Xin Li @ 2026-06-14 18:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML


> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> 
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>> 
>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>> 
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>> 
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>> 
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>> 
>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>> 
>>>> I don't actually have a FRED capable machine, but have you tried running
>>>> one of those top-down perf things on it, to see where its hurting?
>>> 
>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>> 
>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>> > there isn't anything in my test setup that could introduce any kind of
>> > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
> 
> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.


The problem doesn’t happen to IDT?


> 
> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...


Good point, alignment check should be applied to all such entries.

Thanks
   Xin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14 18:08         ` Xin Li
@ 2026-06-14 18:31           ` H. Peter Anvin
  0 siblings, 0 replies; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-14 18:31 UTC (permalink / raw)
  To: Xin Li
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML

On June 14, 2026 11:08:59 AM PDT, Xin Li <xin@zytor.com> wrote:
>
>> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> 
>> On 2026-06-13 16:52, H. Peter Anvin wrote:
>>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>>> in system call latency between v7.0 and the current master, and it
>>>>>> bisects down to:
>>>>>> 
>>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>> 
>>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>>> is a bare metal boot, no KVM.
>>>>>> 
>>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>>> and I will be investigating the possibility that this is a false
>>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>>> stability.
>>>>>> 
>>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>>> rather than later; I will update as I get more data.
>>>>> 
>>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>>> affect a FRED host either, except perhaps in code layout.
>>>>> 
>>>>> I don't actually have a FRED capable machine, but have you tried running
>>>>> one of those top-down perf things on it, to see where its hurting?
>>>> 
>>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>> 
>>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>> > there isn't anything in my test setup that could introduce any kind of
>>> > "memory" between builds...>
>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>> 
>> OK, I have, I believe root-caused this.
>> 
>> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>> 
>> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.
>
>
>The problem doesn’t happen to IDT?
>
>
>> 
>> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...
>
>
>Good point, alignment check should be applied to all such entries.
>
>Thanks
>   Xin

The problem is that if you put an alignment directive on a function, it aligns the __pfx stub, which is exactly The Wrong Thing™.

Otherwise this would be easy to fix, permanently. 

I haven't had time to test IDT yet. I assume it is similar.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
@ 2026-06-15  0:19         ` H. Peter Anvin
  2026-06-15  2:07           ` H. Peter Anvin
  2026-06-16  8:28         ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-15  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 18:50, H. Peter Anvin wrote:
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) 
>>>> wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>>
>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>>
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>>
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>>
>>>> Uhm, massive WTF indeed. I don't immediately see how this could 
>>>> possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>>
>>>> I don't actually have a FRED capable machine, but have you tried 
>>>> running
>>>> one of those top-down perf things on it, to see where its hurting?
>>>
>>> Not yet, but I'm investigating right now (I have some family 
>>> obligations this weekend, so my duty cycle is somewhat limited.)
>>>
>>> I reverted the patch on top of rc7, and it did, in fact, fix the 
>>> regression,
>>  > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>  > there isn't anything in my test setup that could introduce any kind of
>>  > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in 
>> fact worse than I originally stated: the average with 7.1rc7 is 478±6 
>> cycles (with the top and bottom octiles removed as outlier 
>> protection); with 7.1rc7 with the above patch reverted it is 
>> 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>>
> 
> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to 
> be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call 
> re-introduces the performance regression.
> 
> I am concerned because this could mean that the __pfx stubs add 
> substantial overhead elsewhere, unless this just happens to be a 
> particularly sensitive case...
> 

OK, so v7.1 was released with this sizable performance regression. That 
begs the question how to deal with it.

One option that might be reasonable for -stable is to simply add back 16 
bytes of NOPs into the assembly file. However, that is obviously not a 
long term fix.

Any thoughts?

	-hpa


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  0:19         ` H. Peter Anvin
@ 2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
                               ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-15  2:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On 2026-06-14 17:19, H. Peter Anvin wrote:
> 
> OK, so v7.1 was released with this sizable performance regression. That 
> begs the question how to deal with it.
> 
> One option that might be reasonable for -stable is to simply add back 16 
> bytes of NOPs into the assembly file. However, that is obviously not a 
> long term fix.
> 

Okay, here is a hack that actually generates the proper alignment, and 
it DOES in fact fix the performance regression.

It uses the same hack as the Makefile to deal with function alignment 
with a prefix: it adds unnecessary NOPs so that the pre-alignment and 
post-alignment are the same. At the end of the day this really ought to 
be fixed in gcc.

This is not meant to be a final patch; this should go in a header file 
and be cleaned up etc, but I wanted to confirm that it does, in fact, 
fix the regression and that the alignment of x64_sys_call is the root 
cause of the problem.

PeterZ: at some point you and I talked about the following:

- Should x64_sys_call() be noinstr?
- If so, any reason we can't inline it into do_syscall_64()?
- Since we no longer use the sys_call_table[] as a jump table,
   do we actually need array_index_nospec()? in do_syscall_x64|32?

	-hpa

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1428 bytes --]

diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 71f032504e73..337e3e53d262 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -9,6 +9,14 @@
 #include <linux/nospec.h>
 #include <asm/syscall.h>
 
+#ifdef CONFIG_CALL_PADDING
+# define _pfe(x) __attribute((patchable_function_entry(x,x)))
+#else
+# define _pfe(x)
+#endif
+#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
+#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
+
 #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
 #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
 #include <asm/syscalls_64.h>
@@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
 #undef  __SYSCALL
 
 #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
 	switch (nr) {
 	#include <asm/syscalls_64.h>
@@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 }
 
 #ifdef CONFIG_X86_X32_ABI
-long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
 	switch (nr) {
 	#include <asm/syscalls_x32.h>

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
@ 2026-06-15  3:41             ` Linus Torvalds
  2026-06-15 18:30               ` H. Peter Anvin
  2026-06-16  7:38             ` Peter Zijlstra
  2026-06-16  7:53             ` Peter Zijlstra
  2 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2026-06-15  3:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>
> - Since we no longer use the sys_call_table[] as a jump table,
>    do we actually need array_index_nospec()? in do_syscall_x64|32?

Well, gcc will still generate a jump table from it when retpolines
aren't enabled.

So I think we do want that array_index_nospec. It should be cheap
insurance against the simplest kinds of speculation issues.

              Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  3:41             ` Linus Torvalds
@ 2026-06-15 18:30               ` H. Peter Anvin
  2026-06-16  7:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-15 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On 2026-06-14 20:41, Linus Torvalds wrote:
> On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> - Since we no longer use the sys_call_table[] as a jump table,
>>     do we actually need array_index_nospec()? in do_syscall_x64|32?
> 
> Well, gcc will still generate a jump table from it when retpolines
> aren't enabled.
> 
> So I think we do want that array_index_nospec. It should be cheap
> insurance against the simplest kinds of speculation issues.
> 

Well, we could put it under an #ifdef by putting macro to detect when we 
use -fno-jump-tables. PeterZ and I have also been talking about making
-fno-jump-tables unconditional, because at some point we found that the 
performance difference was negligible, at least when 
array_index_nospec() is necessary, and it makes it a lot easier to tune 
when you don't have to deal with code bases that compile. It is not just 
retpoline but also IBT (although the comment says "for now"); this of 
course means in practice that the kernels everyone uses are compiled 
without jump tables.

The system call dispatch is really the biggest case here.

It does, however, make me think that using regs->ax to dispatch system 
calls in the a FRED path might actually be The Wrong Thing[TM]; FRED 
delivery is a speculation barrier and so %rax is guaranteed to be stable 
at that point. *In practice* the stack engine probably would propagate 
that (I can't really think of any way to implement a stack engine that 
wouldn't, and I suspect if it didn't we would have lots of other issues) 
but instead of dumping it into memory and reading it back it probably 
would be better to do what the SYSCALL path does and move it into an 
argument register instead.

I have experimented with micro-optimizations of the FRED path lately, in 
part because FRED inherently does provide speculation guarantees that 
SYSCALL/SYSRET do not, in part because some of the code paths have a 
fair bit of unnecessary overhead in general of which some of affects 
FRED disproportionately (some duplicates work that FRED does inherently, 
for one thing.) So far I have been somewhat surprised how *little* 
effect some of them have had; clearly branch prediction does a really 
good job sometimes even without static branches.

Still, some pretty simple changes can get a few percent improvement, 
well above the statistical noise margin.

Doing a *very* early-out and dispatching do_syscall_64() already in 
asm_entry_point_user is one of the more effective hacks; I am (or 
rather, were, until I discovered this immediate issue ;) also 
experimenting with having separate IDT and FRED versions of 
do_syscall_64() -- the code factors very cleanly and the duplication is 
nearly all at the object code level.

Part of my questions to PeterZ was because I believe that inlining 
x64_sys_call() will benefit a fair bit from better code layout. We have 
talked about sunsetting x32, but until we do, merging x32_sys_call() 
into the same function also ends up with the two switch statements being 
able to share a fair bit of code, since there are large contiguous 
chunks of x32 system call space which are the same as x64.

One of the things I have been thinking about, too, is to move FRED- and 
IDT-specific code into separate text sections; not only so that they can 
be close together in memory, but also so that we can poison out the 
areas that aren't being used. Every code flow that has almost unlimited 
versatility is, obviously, *extremely* desirable as targets for 
execution redirection attacks...

	-hpa

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15 18:30               ` H. Peter Anvin
@ 2026-06-16  7:12                 ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2026-06-16  7:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Mon, Jun 15, 2026 at 11:30:11AM -0700, H. Peter Anvin wrote:

> Well, we could put it under an #ifdef by putting macro to detect when we use
> -fno-jump-tables. PeterZ and I have also been talking about making
> -fno-jump-tables unconditional, because at some point we found that the
> performance difference was negligible, at least when array_index_nospec() is
> necessary, and it makes it a lot easier to tune when you don't have to deal
> with code bases that compile. It is not just retpoline but also IBT
> (although the comment says "for now"); this of course means in practice that
> the kernels everyone uses are compiled without jump tables.

The IBT thing is because GCC (and I assume, but haven't checked, clang
too) generated NOTRACK prefixes for jump tables. And we have explicitly
disallowed NOTRACK for kernel IBT.

The "not yet" pertains to the compilers being changed to not use
NOTRACK; but I don't think this is anything anybody is actively chasing
up on.

So yeah, effectively jump-tables are disabled for everybody.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
@ 2026-06-16  7:38             ` Peter Zijlstra
  2026-06-16  7:53             ` Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2026-06-16  7:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote:

> PeterZ: at some point you and I talked about the following:
> 
> - Should x64_sys_call() be noinstr?

I still think it should be, yes. But I also think it wants __noendbr,
there is no sane reason you should ever be allowed to do an indirect
call to this.

Realistically, objtool will seal this function (scribble the ENDBR), but
really, it just shouldn't be there to begin with.

> - If so, any reason we can't inline it into do_syscall_64()?

Code gen, GCC makes a mess out of things if you do that. x64_sys_call()
now ends up being a giant pile of tail-calls. If you inline it into
do_syscall_x64() that goes out the window.

> - Since we no longer use the sys_call_table[] as a jump table,
>   do we actually need array_index_nospec()? in do_syscall_x64|32?

It would mean unconditionally disabling jump-tables -- at least for this
TU, but possibly for the whole thing (mixed compiler flags and LTO is a
pain you don't need IIRC).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
  2026-06-16  7:38             ` Peter Zijlstra
@ 2026-06-16  7:53             ` Peter Zijlstra
  2 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2026-06-16  7:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote:

> It uses the same hack as the Makefile to deal with function alignment with a
> prefix: it adds unnecessary NOPs so that the pre-alignment and
> post-alignment are the same. At the end of the day this really ought to be
> fixed in gcc.

And clang, but I don't think they can, it wrecks the 'ABI' they have in
place with the current set of arguments. Which I agree is somewhat
unfortunate, but it is what it is.

> diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
> index 71f032504e73..337e3e53d262 100644
> --- a/arch/x86/entry/syscall_64.c
> +++ b/arch/x86/entry/syscall_64.c
> @@ -9,6 +9,14 @@
>  #include <linux/nospec.h>
>  #include <asm/syscall.h>
>  
> +#ifdef CONFIG_CALL_PADDING
> +# define _pfe(x) __attribute((patchable_function_entry(x,x)))
> +#else
> +# define _pfe(x)
> +#endif
> +#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
> +#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
> +
>  #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
>  #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
>  #include <asm/syscalls_64.h>
> @@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
>  #undef  __SYSCALL
>  
>  #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> -long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> +long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
>  {
>  	switch (nr) {
>  	#include <asm/syscalls_64.h>
> @@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
>  }
>  
>  #ifdef CONFIG_X86_X32_ABI
> -long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
> +long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
>  {
>  	switch (nr) {
>  	#include <asm/syscalls_x32.h>

This more or less works by accident, in general your align_func() macro
is horrendously broken when you consider kCFI. By changing the
patchable_function_entry attribute like this, the kCFI hash ends up at a
different location and things go side-ways really really fast.

The only reason it works here is that this function is never indirectly
called and so the kCFI ABI violation is immaterial.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
  2026-06-15  0:19         ` H. Peter Anvin
@ 2026-06-16  8:28         ` Peter Zijlstra
  2026-06-16  8:46           ` Linus Torvalds
  2026-06-16 13:53           ` David Laight
  2 siblings, 2 replies; 24+ messages in thread
From: Peter Zijlstra @ 2026-06-16  8:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote:

> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be
> 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call
> re-introduces the performance regression.
> 
> I am concerned because this could mean that the __pfx stubs add substantial
> overhead elsewhere, unless this just happens to be a particularly sensitive
> case...

So what is the actual alignment requirement these days then? We're
building the (x86_64) kernel with 16 byte function and 1 byte jump
alignment.

So ISTR the Intel I-fetch window was 16 bytes, so the above things would
make sense. However, Gemini, or whatever AI sits in google search, is
trying to tell me Intel moved to 32 byte I-fetch with Alderlake.

That same thing is saying AMD switched to 32 byte I-fetch with Zen (1)
and later.

This all seems to suggest we do something like so, hmm?


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b9f5a4a3cc2a..65fff65271d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -329,7 +329,9 @@ config X86
 	select HAVE_ARCH_KCSAN			if X86_64
 	select PROC_PID_ARCH_STATUS		if PROC_FS
 	select HAVE_ARCH_NODE_DEV_GROUP		if X86_SGX
-	select FUNCTION_ALIGNMENT_16B		if X86_64 || X86_ALIGNMENT_16
+	# AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch
+	select FUNCTION_ALIGNMENT_32B		if X86_64
+	select FUNCTION_ALIGNMENT_16B		if X86_ALIGNMENT_16
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:28         ` Peter Zijlstra
@ 2026-06-16  8:46           ` Linus Torvalds
  2026-06-16  9:51             ` Ingo Molnar
  2026-06-17 12:37             ` Peter Zijlstra
  2026-06-16 13:53           ` David Laight
  1 sibling, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2026-06-16  8:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> make sense. However, Gemini, or whatever AI sits in google search, is
> trying to tell me Intel moved to 32 byte I-fetch with Alderlake.

Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
to not be 64-byte aligned - simply because you may need to fetch more
cachelines (assuming fairly linear code).

And afaik, some of the newer ones aren't 32-byte wide, but can do 48
bytes as three 16-byte fetches.

But I don't know if they can do the old "split line access" that older
cores could do, where a Pentium would do two 8-byte accesses at the
same time, and they didn't have to be in the same cache line.

So 64-byte alignment would always be the best option if you only look
at a *particular* piece of code.

But it obviously is very wasteful and hurts when there is code around
it that could be loaded into the cache at the same time.

So almost certainly not a good idea in general.

But 64-byte alignment is probably what things like interrupt and
system call entrypoints should use, because those things would make
sense to look at as isolated things, not part of a bigger load". And
they are quite likely to start from a fairly cold-cache situation.

So *not* some general compiler option in a config file, but maybe a
special "entry point alignment" macro?

             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:46           ` Linus Torvalds
@ 2026-06-16  9:51             ` Ingo Molnar
  2026-06-16 17:44               ` H. Peter Anvin
  2026-06-17 12:37             ` Peter Zijlstra
  1 sibling, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2026-06-16  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, tglx, mingo, bp,
	Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> > make sense. However, Gemini, or whatever AI sits in google search, is
> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
> 
> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
> to not be 64-byte aligned - simply because you may need to fetch more
> cachelines (assuming fairly linear code).
> 
> And afaik, some of the newer ones aren't 32-byte wide, but can do 48
> bytes as three 16-byte fetches.
> 
> But I don't know if they can do the old "split line access" that older
> cores could do, where a Pentium would do two 8-byte accesses at the
> same time, and they didn't have to be in the same cache line.
> 
> So 64-byte alignment would always be the best option if you only look
> at a *particular* piece of code.
> 
> But it obviously is very wasteful and hurts when there is code around
> it that could be loaded into the cache at the same time.
> 
> So almost certainly not a good idea in general.
> 
> But 64-byte alignment is probably what things like interrupt and
> system call entrypoints should use, because those things would make
> sense to look at as isolated things, not part of a bigger load". And
> they are quite likely to start from a fairly cold-cache situation.
> 
> So *not* some general compiler option in a config file, but maybe a
> special "entry point alignment" macro?

Yeah, agreed on that approach - but before/while we fix it,
I'm also still somewhat baffled by the numbers hpa reported:

>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
>>> increase in latency, not 13%...

Now that we know that this regression is caused by entry function
alignment changes, do we know *why* it causes a 80 cycles
shift in system call entry performance?

What does the benchmark measure, cache-cold or cache-hot
execution?

1) Cache-cold performance:

If it is cold-cache performance, does the misaligned case fetch
one more cold cacheline?

From which cache does it miss? Fetching from the 2-4MB Panther Lake
L2 shouldn't be 80 cycles, it should be ~17 cycles.

If it's fetching from the 18MB L3 (which I'd say is the norm for
most workloads), then the L3->L1I latency is around ~55 cycles on
Panther Lake, with everything included.

It cannot really be DRAM latency, ie. true cache-cold latency,
as that would be much more severe, in the 400 cycles range even
with premium DRAM modules - and more like 500 cycles with
mainstream DRAM modules and layouts. (Unless we are *lucky* with
alignment and sizing and the alignment regression doesn't trigger
full DRAM latency.) The on-die DRAM MSC cache's latency should
be around 300 cycles - that too is too high.

2) Cache-hot performance:

While cache-hot performance is less relevance for system calls
(which tend to be cache-cold in practice), if the benchmark
measures cache-hot performance, why is there a 80 cycles shift
from just a single misaligned symbol?

Ie. the specific and rather stable figure of 80 cycles overhead
does not seem to match any of the Panther Lake latencies that
ought to be relevant to this regression, if we use the simplest
mental model of what's going on when alignment changes.

So it is either some other uarch pathology, triggered by bad
alignment, or something doesn't add up in my mental model
of the root cause of this problem. :-)

Side notes:

 - The 6 cycles noise in the 478±6 cycles measurement
   does suggest that we might have missed out to a
   deeper cache hierarchy level, versus the rather
   stable 397.5±0.4 pre-regression figure.

 - I'm also assuming that 'cycles' here is a frequency-invariant
   standardized constant 5.1 GHz TSC value or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  9:51             ` Ingo Molnar
@ 2026-06-16 17:44               ` H. Peter Anvin
  2026-06-17  9:54                 ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2026-06-16 17:44 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On June 16, 2026 2:51:12 AM PDT, Ingo Molnar <mingo@kernel.org> wrote:
>
>* Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
>> >
>> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would
>> > make sense. However, Gemini, or whatever AI sits in google search, is
>> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
>> 
>> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
>> to not be 64-byte aligned - simply because you may need to fetch more
>> cachelines (assuming fairly linear code).
>> 
>> And afaik, some of the newer ones aren't 32-byte wide, but can do 48
>> bytes as three 16-byte fetches.
>> 
>> But I don't know if they can do the old "split line access" that older
>> cores could do, where a Pentium would do two 8-byte accesses at the
>> same time, and they didn't have to be in the same cache line.
>> 
>> So 64-byte alignment would always be the best option if you only look
>> at a *particular* piece of code.
>> 
>> But it obviously is very wasteful and hurts when there is code around
>> it that could be loaded into the cache at the same time.
>> 
>> So almost certainly not a good idea in general.
>> 
>> But 64-byte alignment is probably what things like interrupt and
>> system call entrypoints should use, because those things would make
>> sense to look at as isolated things, not part of a bigger load". And
>> they are quite likely to start from a fairly cold-cache situation.
>> 
>> So *not* some general compiler option in a config file, but maybe a
>> special "entry point alignment" macro?
>
>Yeah, agreed on that approach - but before/while we fix it,
>I'm also still somewhat baffled by the numbers hpa reported:
>
>>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
>>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
>>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
>>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
>>>> increase in latency, not 13%...
>
>Now that we know that this regression is caused by entry function
>alignment changes, do we know *why* it causes a 80 cycles
>shift in system call entry performance?
>
>What does the benchmark measure, cache-cold or cache-hot
>execution?
>
>1) Cache-cold performance:
>
>If it is cold-cache performance, does the misaligned case fetch
>one more cold cacheline?
>
>From which cache does it miss? Fetching from the 2-4MB Panther Lake
>L2 shouldn't be 80 cycles, it should be ~17 cycles.
>
>If it's fetching from the 18MB L3 (which I'd say is the norm for
>most workloads), then the L3->L1I latency is around ~55 cycles on
>Panther Lake, with everything included.
>
>It cannot really be DRAM latency, ie. true cache-cold latency,
>as that would be much more severe, in the 400 cycles range even
>with premium DRAM modules - and more like 500 cycles with
>mainstream DRAM modules and layouts. (Unless we are *lucky* with
>alignment and sizing and the alignment regression doesn't trigger
>full DRAM latency.) The on-die DRAM MSC cache's latency should
>be around 300 cycles - that too is too high.
>
>2) Cache-hot performance:
>
>While cache-hot performance is less relevance for system calls
>(which tend to be cache-cold in practice), if the benchmark
>measures cache-hot performance, why is there a 80 cycles shift
>from just a single misaligned symbol?
>
>Ie. the specific and rather stable figure of 80 cycles overhead
>does not seem to match any of the Panther Lake latencies that
>ought to be relevant to this regression, if we use the simplest
>mental model of what's going on when alignment changes.
>
>So it is either some other uarch pathology, triggered by bad
>alignment, or something doesn't add up in my mental model
>of the root cause of this problem. :-)
>
>Side notes:
>
> - The 6 cycles noise in the 478±6 cycles measurement
>   does suggest that we might have missed out to a
>   deeper cache hierarchy level, versus the rather
>   stable 397.5±0.4 pre-regression figure.
>
> - I'm also assuming that 'cycles' here is a frequency-invariant
>   standardized constant 5.1 GHz TSC value or so?
>
>Thanks,
>
>	Ingo

It's cache hot, calling getppid() in a tight loop. The units are renormalized to from TSC cycles to core cycles using fixed counter 1 to determine the actual ratio.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16 17:44               ` H. Peter Anvin
@ 2026-06-17  9:54                 ` Ingo Molnar
  2026-06-17 10:05                   ` Ingo Molnar
  0 siblings, 1 reply; 24+ messages in thread
From: Ingo Molnar @ 2026-06-17  9:54 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Peter Zijlstra, tglx, mingo, bp,
	Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML

* H. Peter Anvin <hpa@zytor.com> wrote:

> It's cache hot, calling getppid() in a tight loop.
> The units are renormalized to from TSC cycles to
> core cycles using fixed counter 1 to determine the
> actual ratio.

Hm, in that light the 80 cycles overhead from a single
misaligned symbol is rather surprising (to me): it's
way too high to be reasonably caused by any hot cache
alignment effects - and all of the regular instruction
caches (or even data caches) should be more than large
enough to fit such a getppid() benchmark fully into the
cache.

Would be nice to see a before/after perf stat --repeat <N>
figures with sufficiently high <N> to get <0.1% stddev?

And just to guess around a bit, here's the various caches,
buffers and queues on a Panther Lake Performance Core
(Cougar Cove) that may play a role:

 - L0 Data Cache (L0D)		   48 KB	    68 cachelines
 - L1 Data Cache (L1D)		  192 KB	 3,072 cachelines
 - L1 Instruction Cache (L1I)	   64 KB	 1,024 cachelines
 - L2 Cache			3,072 KB	49,152 cachlines
 - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs, ~64 sets x 10-12-way
 - uOP Queue			    -		   192 entries
 - Reorder Buffer (ROB)		    -		   576 entries
 - L1 Data TLB (DTLB)		    -		   128 entries
 - L2 Shared TLB (STLB)		    -		~4,096 entries
 - Return Stack Buffer (RSB)	    -		    24 entries
 - Load Queue			    -		  ~114 entries
 - Store Queue			    -		   ~56 entries

Where all cacheline sizes are 64 bytes, and a uOP cache 'set'
fits up to 6-8 uops.

I think with a cache-hot syscall benchmark we can exclude the
largest caches with over 1,000 effective entries with near
certainty as a factor, so what is left are:

 - L0 Data Cache (L0D)		   48 KB	    68 cachelines
 - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs ~64 sets x 10-12-way
 - uOP Queue			    -		   192 entries
 - Reorder Buffer (ROB)		    -		   576 entries
 - L1 Data TLB (DTLB)		    -		   128 entries
 - Return Stack Buffer (RSB)	    -		    24 entries
 - Load Queue			    -		  ~114 entries
 - Store Queue			    -		   ~56 entries

I'd exclude the L0D, L1DTLB, the RSB and the load/store queues
as well, because code alignment of a single symbol should have
a minimal effect on them, which leaves:

 - uOP Queue			    -		   192 entries
 - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs, ~64 sets x 10-12 way
 - Reorder Buffer (ROB)		    -		   576 entries

And I think of these the main suspect would be the uOP cache,
because its (estimated...) ~10-12 deep associativity limit
of uop-sets may be something this benchmark is hitting on
Panther Lake?

Could it be that the extra alignment adds +1 to the maximum number
of uOP cache 'ways' this execution hits in the uOP cache, moving
it form say 12 (still fits) to 13 (misses) so that this particular
uOP cache association depth starts trashing? But I'm really just
guessing wildly here...

( The extra statistical noise of the regressed figures does suggest
  some sort of trashing mechanic behind the scenes though, and the
  regular caches seem large enough to not actually trash for such
  a cache-hot benchmark. )

Or am I missing something obvious?

Any perf stat uOP related counter measurements might be elluminating.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-17  9:54                 ` Ingo Molnar
@ 2026-06-17 10:05                   ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2026-06-17 10:05 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Peter Zijlstra, tglx, mingo, bp,
	Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML


* Ingo Molnar <mingo@kernel.org> wrote:

> I'd exclude the L0D, L1DTLB, the RSB and the load/store queues
> as well, because code alignment of a single symbol should have
> a minimal effect on them, which leaves:
>
>  - uOP Queue			    -		   192 entries
>  - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs, ~64 sets x 10-12 way
>  - Reorder Buffer (ROB)		    -		   576 entries
>
> And I think of these the main suspect would be the uOP cache,
> because its (estimated...) ~10-12 deep associativity limit
> of uop-sets may be something this benchmark is hitting on
> Panther Lake?
>
> Could it be that the extra alignment adds +1 to the maximum number
> of uOP cache 'ways' this execution hits in the uOP cache, moving
> it form say 12 (still fits) to 13 (misses) so that this particular
> uOP cache association depth starts trashing? But I'm really just
> guessing wildly here...
>
> ( The extra statistical noise of the regressed figures does suggest
>   some sort of trashing mechanic behind the scenes though, and the
>   regular caches seem large enough to not actually trash for such
>   a cache-hot benchmark. )
>
> Or am I missing something obvious?
>
> Any perf stat uOP related counter measurements might be illuminating.

The relevant uOP cache (Intel DSB) perf stat counters would be:

  starship:~/tip> git grep DSB_ tools/perf/pmu-events/arch/x86/pantherlake/
  tools/perf/pmu-events/arch/x86/pantherlake/frontend.json:        "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS",
  tools/perf/pmu-events/arch/x86/pantherlake/frontend.json:        "EventName": "FRONTEND_RETIRED.DSB_MISS",
  tools/perf/pmu-events/arch/x86/pantherlake/frontend.json:        "EventName": "IDQ.DSB_CYCLES_ANY",
  tools/perf/pmu-events/arch/x86/pantherlake/frontend.json:        "EventName": "IDQ.DSB_CYCLES_OK",
  tools/perf/pmu-events/arch/x86/pantherlake/frontend.json:        "EventName": "IDQ.DSB_UOPS",

In particular FRONTEND_RETIRED.ANY_DSB_MISS and
FRONTEND_RETIRED.DSB_MISS before/after counts?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:46           ` Linus Torvalds
  2026-06-16  9:51             ` Ingo Molnar
@ 2026-06-17 12:37             ` Peter Zijlstra
  1 sibling, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2026-06-17 12:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Tue, Jun 16, 2026 at 02:16:41PM +0530, Linus Torvalds wrote:

> So 64-byte alignment would always be the best option if you only look
> at a *particular* piece of code.
> 
> But it obviously is very wasteful and hurts when there is code around
> it that could be loaded into the cache at the same time.
> 
> So almost certainly not a good idea in general.
> 
> But 64-byte alignment is probably what things like interrupt and
> system call entrypoints should use, because those things would make
> sense to look at as isolated things, not part of a bigger load". And
> they are quite likely to start from a fairly cold-cache situation.
> 
> So *not* some general compiler option in a config file, but maybe a
> special "entry point alignment" macro?

This builds with kcfi on and seems to do more or less do what is expected.

I've not actually tried performance measurements on my IDT based system.

Obviously this would want splitting into a few patches, but it does:

 - makes -fno-jump-tables unconditional
 - removes array_index_nospec() from the syscall dispatch
 - makes x{32,64}_sys_call() 'static noinstr'
 - adds align_entry attribute that aligns on cacheline boundaries
   and disallows taking address
 - sprinkles align_entry on the noinstr syscall path

---
 arch/x86/Makefile              | 29 +++++++++--------------------
 arch/x86/entry/entry_fred.c    | 12 ++++++------
 arch/x86/entry/syscall_64.c    | 26 +++++++++++++++++++++-----
 arch/x86/include/asm/fred.h    |  5 +++--
 arch/x86/include/asm/syscall.h | 23 ++++++++++++++++++++---
 5 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 598f178102ee..b154a2a20eb2 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -90,17 +90,8 @@ CC_FLAGS_FPU += -mhard-float
 endif
 
 ifeq ($(CONFIG_X86_KERNEL_IBT),y)
-#
-# Kernel IBT has S_CET.NOTRACK_EN=0, as such the compilers must not generate
-# NOTRACK prefixes. Current generation compilers unconditionally employ NOTRACK
-# for jump-tables, as such, disable jump-tables for now.
-#
-# (jump-tables are implicitly disabled by RETPOLINE)
-#
-#   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104816
-#
-KBUILD_CFLAGS += $(call cc-option,-fcf-protection=branch -fno-jump-tables)
-KBUILD_RUSTFLAGS += -Zcf-protection=branch $(if $(call rustc-min-version,109300),-Cjump-tables=n,-Zno-jump-tables)
+KBUILD_CFLAGS += $(call cc-option,-fcf-protection=branch)
+KBUILD_RUSTFLAGS += -Zcf-protection=branch
 else
 KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none)
 endif
@@ -173,6 +164,13 @@ endif
         KBUILD_RUSTFLAGS += -Ccode-model=kernel
 
         percpu_seg := gs
+
+	# Due to retpolines and cf-protection=branch's implicit NOTRACK usage
+	# for jump-tables, blanked disable jump-tables for all x86_64 builds to
+	# get a consistent behaviour across configurations. This allows
+	# removing some array_index_nospec() usage.
+	KBUILD_CFLAGS += -fno-jump-tables
+	KBUILD_RISTFLAGS += $(if $(call rustc-min-version,109300),-Cjump-tables=n,-Zno-jump-tables)
 endif
 
 ifeq ($(CONFIG_STACKPROTECTOR),y)
@@ -209,15 +207,6 @@ KBUILD_CFLAGS += -fno-asynchronous-unwind-tables
 ifdef CONFIG_MITIGATION_RETPOLINE
   KBUILD_CFLAGS += $(RETPOLINE_CFLAGS)
   KBUILD_RUSTFLAGS += $(RETPOLINE_RUSTFLAGS)
-  # Additionally, avoid generating expensive indirect jumps which
-  # are subject to retpolines for small number of switch cases.
-  # LLVM turns off jump table generation by default when under
-  # retpoline builds, however, gcc does not for x86. This has
-  # only been fixed starting from gcc stable version 8.4.0 and
-  # onwards, but not for older ones. See gcc bug #86952.
-  ifndef CONFIG_CC_IS_CLANG
-    KBUILD_CFLAGS += -fno-jump-tables
-  endif
 endif
 
 ifdef CONFIG_MITIGATION_SLS
diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c
index fb3594ddf731..740fdf9bb08a 100644
--- a/arch/x86/entry/entry_fred.c
+++ b/arch/x86/entry/entry_fred.c
@@ -51,7 +51,7 @@ static noinstr void fred_bad_type(struct pt_regs *regs, unsigned long error_code
 	irqentry_nmi_exit(regs, irq_state);
 }
 
-static noinstr void fred_intx(struct pt_regs *regs)
+static noinstr align_entry void fred_intx(struct pt_regs *regs)
 {
 	switch (regs->fred_ss.vector) {
 	/* Opcode 0xcd, 0x3, NOT INT3 (opcode 0xcc) */
@@ -157,7 +157,7 @@ void __init fred_complete_exception_setup(void)
 	fred_setup_done = true;
 }
 
-static noinstr void fred_extint(struct pt_regs *regs)
+static noinstr align_entry void fred_extint(struct pt_regs *regs)
 {
 	unsigned int vector = regs->fred_ss.vector;
 
@@ -177,7 +177,7 @@ static noinstr void fred_extint(struct pt_regs *regs)
 	}
 }
 
-static noinstr void fred_hwexc(struct pt_regs *regs, unsigned long error_code)
+static noinstr align_entry void fred_hwexc(struct pt_regs *regs, unsigned long error_code)
 {
 	/* Optimize for #PF. That's the only exception which matters performance wise */
 	if (likely(regs->fred_ss.vector == X86_TRAP_PF))
@@ -216,7 +216,7 @@ static noinstr void fred_hwexc(struct pt_regs *regs, unsigned long error_code)
 
 }
 
-static noinstr void fred_swexc(struct pt_regs *regs, unsigned long error_code)
+static noinstr align_entry void fred_swexc(struct pt_regs *regs, unsigned long error_code)
 {
 	switch (regs->fred_ss.vector) {
 	case X86_TRAP_BP: return exc_int3(regs);
@@ -225,7 +225,7 @@ static noinstr void fred_swexc(struct pt_regs *regs, unsigned long error_code)
 	}
 }
 
-__visible noinstr void fred_entry_from_user(struct pt_regs *regs)
+__visible noinstr align_entry void fred_entry_from_user(struct pt_regs *regs)
 {
 	unsigned long error_code = regs->orig_ax;
 
@@ -257,7 +257,7 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs)
 	return fred_bad_type(regs, error_code);
 }
 
-__visible noinstr void fred_entry_from_kernel(struct pt_regs *regs)
+__visible noinstr align_entry void fred_entry_from_kernel(struct pt_regs *regs)
 {
 	unsigned long error_code = regs->orig_ax;
 
diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 71f032504e73..10654c12dd36 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -8,6 +8,7 @@
 #include <linux/entry-common.h>
 #include <linux/nospec.h>
 #include <asm/syscall.h>
+#include <asm/ibt.h>
 
 #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
 #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
@@ -32,23 +33,40 @@ const sys_call_ptr_t sys_call_table[] = {
 #undef  __SYSCALL
 
 #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
+static noinstr align_entry long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
+	/*
+	 * Because -fno-jump-tables, this compiles into a binary branch tree
+	 * rather than a jump-table. As such @nr is not used as an array
+	 * index. Additionally, this is an out-of-line function on purpose,
+	 * such that all the actual syscall function calls are tail-calls,
+	 * returning to our caller for the common bits.
+	 */
+	instrumentation_begin();
 	switch (nr) {
 	#include <asm/syscalls_64.h>
 	default: return __x64_sys_ni_syscall(regs);
 	}
+	instrumentation_end();
 }
 
 #ifdef CONFIG_X86_X32_ABI
-long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
+static noinstr align_entry long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
+	instrumentation_begin();
 	switch (nr) {
 	#include <asm/syscalls_x32.h>
 	default: return __x64_sys_ni_syscall(regs);
 	}
+	instrumentation_end();
+}
+#else
+static __always_inline long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
+{
+	return __x64_sys_ni_syscall(regs);
 }
 #endif
+#undef  __SYSCALL
 
 static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
 {
@@ -59,7 +77,6 @@ static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
 	unsigned int unr = nr;
 
 	if (likely(unr < NR_syscalls)) {
-		unr = array_index_nospec(unr, NR_syscalls);
 		regs->ax = x64_sys_call(regs, unr);
 		return true;
 	}
@@ -76,7 +93,6 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
 	unsigned int xnr = nr - __X32_SYSCALL_BIT;
 
 	if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) {
-		xnr = array_index_nospec(xnr, X32_NR_syscalls);
 		regs->ax = x32_sys_call(regs, xnr);
 		return true;
 	}
@@ -84,7 +100,7 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr)
 }
 
 /* Returns true to return using SYSRET, or false to use IRET */
-__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
+__visible noinstr align_entry bool do_syscall_64(struct pt_regs *regs, int nr)
 {
 	nr = syscall_enter_from_user_mode(regs, nr);
 
diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h
index 18a2f811c358..10b8d73e4088 100644
--- a/arch/x86/include/asm/fred.h
+++ b/arch/x86/include/asm/fred.h
@@ -11,6 +11,7 @@
 #include <asm/asm.h>
 #include <asm/msr.h>
 #include <asm/trapnr.h>
+#include <asm/syscall.h>
 
 /*
  * FRED event return instruction opcodes for ERET{S,U}; supported in
@@ -67,8 +68,8 @@ void asm_fred_entrypoint_user(void);
 void asm_fred_entrypoint_kernel(void);
 void asm_fred_entry_from_kvm(struct fred_ss);
 
-__visible void fred_entry_from_user(struct pt_regs *regs);
-__visible void fred_entry_from_kernel(struct pt_regs *regs);
+__visible align_entry void fred_entry_from_user(struct pt_regs *regs);
+__visible align_entry void fred_entry_from_kernel(struct pt_regs *regs);
 __visible void __fred_entry_from_kvm(struct pt_regs *regs);
 
 /* Can be called from noinstr code, thus __always_inline */
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index c10dbb74cd00..624e7d6f30a3 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -20,13 +20,30 @@
 typedef long (*sys_call_ptr_t)(const struct pt_regs *);
 extern const sys_call_ptr_t sys_call_table[];
 
+/*
+ * When changing patchable_function_entry for a function, the kCFI ABI is
+ * affected, therefore combine this with __noendbr, which disallows indirect
+ * calls and generates compiler warnings when the address is taken of such a
+ * function.
+ *
+ * This will effectively waste a full cacheline per align_entry user.
+ */
+#ifdef CONFIG_CALL_PADDING
+#define __pfe(x)	__attribute__((patchable_function_entry(x,x))) __noendbr
+#else
+#define __pfe(x)	__noendbr
+#endif
+
+#define __align_entry(x) __aligned(x) \
+	__pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
+
+#define align_entry	__align_entry(SMP_CACHE_BYTES)
+
 /*
  * These may not exist, but still put the prototypes in so we
  * can use IS_ENABLED().
  */
 extern long ia32_sys_call(const struct pt_regs *, unsigned int nr);
-extern long x32_sys_call(const struct pt_regs *, unsigned int nr);
-extern long x64_sys_call(const struct pt_regs *, unsigned int nr);
 
 /*
  * Only the low 32 bits of orig_ax are meaningful, so we return int.
@@ -172,7 +189,7 @@ static inline int syscall_get_arch(struct task_struct *task)
 		? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64;
 }
 
-bool do_syscall_64(struct pt_regs *regs, int nr);
+align_entry bool do_syscall_64(struct pt_regs *regs, int nr);
 void do_int80_emulation(struct pt_regs *regs);
 
 #endif	/* CONFIG_X86_32 */

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:28         ` Peter Zijlstra
  2026-06-16  8:46           ` Linus Torvalds
@ 2026-06-16 13:53           ` David Laight
  1 sibling, 0 replies; 24+ messages in thread
From: David Laight @ 2026-06-16 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML

On Tue, 16 Jun 2026 10:28:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote:
> 
> > OK, I have, I believe root-caused this.
> > 
> > It is a padding issue; removing the code changes __pfx_x64_sys_call to be
> > 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> > 
> > Reverting the patch but adding an alignment statement to x64_sys_call
> > re-introduces the performance regression.
> > 
> > I am concerned because this could mean that the __pfx stubs add substantial
> > overhead elsewhere, unless this just happens to be a particularly sensitive
> > case...  
> 
> So what is the actual alignment requirement these days then? We're
> building the (x86_64) kernel with 16 byte function and 1 byte jump
> alignment.
> 
> So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> make sense. However, Gemini, or whatever AI sits in google search, is
> trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
> 
> That same thing is saying AMD switched to 32 byte I-fetch with Zen (1)
> and later.

Basically you can't win.
I was looking at why a patch didn't give the expected performance gain
on a different base kernel build.
It seems to depend on whether the function (actually strlen) was aligned
to an odd or even 16 byte boundary.
If aligned to an even boundary the loop inside the function crossed a
'significant' boundary and the code ran measurably slower.
If you start aligning loop tops and labels in general you probably lose
due to code bloat.
(Here the loop didn't need aligning, it just needed not to contain
the relevant boundary.)

In this case the extra padding will change the alignment of everything that
follows - and some of those might make a difference as well.

You'd need to add extra code further down the function to keep the size
the same (and hope the compiler keeps the functions in the same order).

	David


> 
> This all seems to suggest we do something like so, hmm?
> 
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b9f5a4a3cc2a..65fff65271d0 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -329,7 +329,9 @@ config X86
>  	select HAVE_ARCH_KCSAN			if X86_64
>  	select PROC_PID_ARCH_STATUS		if PROC_FS
>  	select HAVE_ARCH_NODE_DEV_GROUP		if X86_SGX
> -	select FUNCTION_ALIGNMENT_16B		if X86_64 || X86_ALIGNMENT_16
> +	# AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch
> +	select FUNCTION_ALIGNMENT_32B		if X86_64
> +	select FUNCTION_ALIGNMENT_16B		if X86_ALIGNMENT_16
>  	select FUNCTION_ALIGNMENT_4B
>  	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
>  	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 23:52     ` H. Peter Anvin
  2026-06-14  1:50       ` H. Peter Anvin
@ 2026-06-14  2:11       ` Calvin Owens
  2026-06-14  2:14         ` Calvin Owens
  1 sibling, 1 reply; 24+ messages in thread
From: Calvin Owens @ 2026-06-14  2:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen,
	torvalds, x86-ML, LKML

On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote:
> On 2026-06-13 13:34, H. Peter Anvin wrote:
> > On 2026-06-13 01:59, Peter Zijlstra wrote:
> > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> > > > So I was trying to figure out a significant -- about 13% -- increase
> > > > in system call latency between v7.0 and the current master, and it
> > > > bisects down to:
> > > > 
> > > > 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> > > > 
> > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> > > > is a bare metal boot, no KVM.
> > > > 
> > > > I'm personally extremely puzzled how this could possibly be related,
> > > > and I will be investigating the possibility that this is a false
> > > > bisect, but it is not a Heisenbug in any way; it has been extremely
> > > > reproducible, and the difference is statistically valid by close to 10
> > > > sigma. Futhermore, the bisection at least gave the appearance of
> > > > stability.
> > > > 
> > > > Given how late in the cycle this is I wanted to send an alert sooner
> > > > rather than later; I will update as I get more data.
> > > 
> > > Uhm, massive WTF indeed. I don't immediately see how this could possibly
> > > affect a FRED host either, except perhaps in code layout.
> > > 
> > > I don't actually have a FRED capable machine, but have you tried running
> > > one of those top-down perf things on it, to see where its hurting?
> > 
> > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> > 
> > I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> > but I'm doing a clean from-scratch rebuild of both trees to make sure
> > there isn't anything in my test setup that could introduce any kind of
> > "memory" between builds...>
> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
> increase in latency, not 13%...

It has to be the .text layout, doesn't it?

I notice we're splitting a cache line here now with the prefix symbol,
7.0-rc7 has:

    ffffffff812175f0 <__pfx_x64_sys_call>:
    ffffffff81217600 <x64_sys_call>:

If I revert 8aeb879baf12, I get:

    ffffffff812175c0 <__pfx_x64_sys_call>:
    ffffffff812175d0 <x64_sys_call>:

Could that be it?

Unfortunately I don't have any hardware new enough to poke at it myself.

Cheers,
Calvin

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  2:11       ` Calvin Owens
@ 2026-06-14  2:14         ` Calvin Owens
  0 siblings, 0 replies; 24+ messages in thread
From: Calvin Owens @ 2026-06-14  2:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen,
	torvalds, x86-ML, LKML

On Saturday 06/13 at 19:11 -0700, Calvin Owens wrote:
> On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote:
> > On 2026-06-13 13:34, H. Peter Anvin wrote:
> > > On 2026-06-13 01:59, Peter Zijlstra wrote:
> > > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> > > > > So I was trying to figure out a significant -- about 13% -- increase
> > > > > in system call latency between v7.0 and the current master, and it
> > > > > bisects down to:
> > > > > 
> > > > > 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> > > > > 
> > > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> > > > > is a bare metal boot, no KVM.
> > > > > 
> > > > > I'm personally extremely puzzled how this could possibly be related,
> > > > > and I will be investigating the possibility that this is a false
> > > > > bisect, but it is not a Heisenbug in any way; it has been extremely
> > > > > reproducible, and the difference is statistically valid by close to 10
> > > > > sigma. Futhermore, the bisection at least gave the appearance of
> > > > > stability.
> > > > > 
> > > > > Given how late in the cycle this is I wanted to send an alert sooner
> > > > > rather than later; I will update as I get more data.
> > > > 
> > > > Uhm, massive WTF indeed. I don't immediately see how this could possibly
> > > > affect a FRED host either, except perhaps in code layout.
> > > > 
> > > > I don't actually have a FRED capable machine, but have you tried running
> > > > one of those top-down perf things on it, to see where its hurting?
> > > 
> > > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> > > 
> > > I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> > > but I'm doing a clean from-scratch rebuild of both trees to make sure
> > > there isn't anything in my test setup that could introduce any kind of
> > > "memory" between builds...>
> > Nope, even with the clean rebuild it is 100% reproducible. It is in fact
> > worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
> > (with the top and bottom octiles removed as outlier protection); with 7.1rc7
> > with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
> > increase in latency, not 13%...
> 
> It has to be the .text layout, doesn't it?
> 
> I notice we're splitting a cache line here now with the prefix symbol,
> 7.0-rc7 has:

Whoops, I meant 7.1-rc7.

But seeing your other mail, sounds like this is it :)

>     ffffffff812175f0 <__pfx_x64_sys_call>:
>     ffffffff81217600 <x64_sys_call>:
> 
> If I revert 8aeb879baf12, I get:
> 
>     ffffffff812175c0 <__pfx_x64_sys_call>:
>     ffffffff812175d0 <x64_sys_call>:
> 
> Could that be it?
> 
> Unfortunately I don't have any hardware new enough to poke at it myself.
> 
> Cheers,
> Calvin

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-06-17 12:37 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
2026-06-13  8:59 ` Peter Zijlstra
2026-06-13 20:34   ` H. Peter Anvin
2026-06-13 23:52     ` H. Peter Anvin
2026-06-14  1:50       ` H. Peter Anvin
2026-06-14 18:08         ` Xin Li
2026-06-14 18:31           ` H. Peter Anvin
2026-06-15  0:19         ` H. Peter Anvin
2026-06-15  2:07           ` H. Peter Anvin
2026-06-15  3:41             ` Linus Torvalds
2026-06-15 18:30               ` H. Peter Anvin
2026-06-16  7:12                 ` Peter Zijlstra
2026-06-16  7:38             ` Peter Zijlstra
2026-06-16  7:53             ` Peter Zijlstra
2026-06-16  8:28         ` Peter Zijlstra
2026-06-16  8:46           ` Linus Torvalds
2026-06-16  9:51             ` Ingo Molnar
2026-06-16 17:44               ` H. Peter Anvin
2026-06-17  9:54                 ` Ingo Molnar
2026-06-17 10:05                   ` Ingo Molnar
2026-06-17 12:37             ` Peter Zijlstra
2026-06-16 13:53           ` David Laight
2026-06-14  2:11       ` Calvin Owens
2026-06-14  2:14         ` Calvin Owens

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.