8aeb879baf12 - significant system call latency regression, bisected

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

* 8aeb879baf12 - significant system call latency regression, bisected
@ 2026-06-13  1:45 "H. Peter Anvin" (Intel)
  2026-06-13  8:59 ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: "H. Peter Anvin" (Intel) @ 2026-06-13  1:45 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

So I was trying to figure out a significant -- about 13% -- increase
in system call latency between v7.0 and the current master, and it
bisects down to:

	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build

This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
is a bare metal boot, no KVM.

I'm personally extremely puzzled how this could possibly be related,
and I will be investigating the possibility that this is a false
bisect, but it is not a Heisenbug in any way; it has been extremely
reproducible, and the difference is statistically valid by close to 10
sigma. Futhermore, the bisection at least gave the appearance of
stability.

Given how late in the cycle this is I wanted to send an alert sooner
rather than later; I will update as I get more data.

        -hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
@ 2026-06-13  8:59 ` Peter Zijlstra
  2026-06-13 20:34   ` H. Peter Anvin
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2026-06-13  8:59 UTC (permalink / raw)
  To: "H. Peter Anvin" (Intel)
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> So I was trying to figure out a significant -- about 13% -- increase
> in system call latency between v7.0 and the current master, and it
> bisects down to:
> 
> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> 
> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> is a bare metal boot, no KVM.
> 
> I'm personally extremely puzzled how this could possibly be related,
> and I will be investigating the possibility that this is a false
> bisect, but it is not a Heisenbug in any way; it has been extremely
> reproducible, and the difference is statistically valid by close to 10
> sigma. Futhermore, the bisection at least gave the appearance of
> stability.
> 
> Given how late in the cycle this is I wanted to send an alert sooner
> rather than later; I will update as I get more data.

Uhm, massive WTF indeed. I don't immediately see how this could possibly
affect a FRED host either, except perhaps in code layout.

I don't actually have a FRED capable machine, but have you tried running
one of those top-down perf things on it, to see where its hurting?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13  8:59 ` Peter Zijlstra
@ 2026-06-13 20:34   ` H. Peter Anvin
  2026-06-13 23:52     ` H. Peter Anvin
  0 siblings, 1 reply; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-13 20:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 01:59, Peter Zijlstra wrote:
> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>> So I was trying to figure out a significant -- about 13% -- increase
>> in system call latency between v7.0 and the current master, and it
>> bisects down to:
>> 
>> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>> 
>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>> is a bare metal boot, no KVM.
>> 
>> I'm personally extremely puzzled how this could possibly be related,
>> and I will be investigating the possibility that this is a false
>> bisect, but it is not a Heisenbug in any way; it has been extremely
>> reproducible, and the difference is statistically valid by close to 10
>> sigma. Futhermore, the bisection at least gave the appearance of
>> stability.
>> 
>> Given how late in the cycle this is I wanted to send an alert sooner
>> rather than later; I will update as I get more data.
> 
> Uhm, massive WTF indeed. I don't immediately see how this could possibly
> affect a FRED host either, except perhaps in code layout.
> 
> I don't actually have a FRED capable machine, but have you tried running
> one of those top-down perf things on it, to see where its hurting?

Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)

I reverted the patch on top of rc7, and it did, in fact, fix the regression, but I'm doing a clean from-scratch rebuild of both trees to make sure there isn't anything in my test setup that could introduce any kind of "memory" between builds...


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 20:34   ` H. Peter Anvin
@ 2026-06-13 23:52     ` H. Peter Anvin
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14  2:11       ` Calvin Owens
  0 siblings, 2 replies; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-13 23:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 13:34, H. Peter Anvin wrote:
> On 2026-06-13 01:59, Peter Zijlstra wrote:
>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>> So I was trying to figure out a significant -- about 13% -- increase
>>> in system call latency between v7.0 and the current master, and it
>>> bisects down to:
>>>
>>> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>
>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>> is a bare metal boot, no KVM.
>>>
>>> I'm personally extremely puzzled how this could possibly be related,
>>> and I will be investigating the possibility that this is a false
>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>> reproducible, and the difference is statistically valid by close to 10
>>> sigma. Futhermore, the bisection at least gave the appearance of
>>> stability.
>>>
>>> Given how late in the cycle this is I wanted to send an alert sooner
>>> rather than later; I will update as I get more data.
>>
>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>> affect a FRED host either, except perhaps in code layout.
>>
>> I don't actually have a FRED capable machine, but have you tried running
>> one of those top-down perf things on it, to see where its hurting?
> 
> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> 
> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
 > but I'm doing a clean from-scratch rebuild of both trees to make sure
 > there isn't anything in my test setup that could introduce any kind of
 > "memory" between builds...>
Nope, even with the clean rebuild it is 100% reproducible. It is in fact 
worse than I originally stated: the average with 7.1rc7 is 478±6 cycles 
(with the top and bottom octiles removed as outlier protection); with 
7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact 
a 20% increase in latency, not 13%...

	-hpa


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 23:52     ` H. Peter Anvin
@ 2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
                           ` (2 more replies)
  2026-06-14  2:11       ` Calvin Owens
  1 sibling, 3 replies; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-14  1:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 16:52, H. Peter Anvin wrote:
> On 2026-06-13 13:34, H. Peter Anvin wrote:
>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) 
>>> wrote:
>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>> in system call latency between v7.0 and the current master, and it
>>>> bisects down to:
>>>>
>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>
>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>> is a bare metal boot, no KVM.
>>>>
>>>> I'm personally extremely puzzled how this could possibly be related,
>>>> and I will be investigating the possibility that this is a false
>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>> reproducible, and the difference is statistically valid by close to 10
>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>> stability.
>>>>
>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>> rather than later; I will update as I get more data.
>>>
>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>> affect a FRED host either, except perhaps in code layout.
>>>
>>> I don't actually have a FRED capable machine, but have you tried running
>>> one of those top-down perf things on it, to see where its hurting?
>>
>> Not yet, but I'm investigating right now (I have some family 
>> obligations this weekend, so my duty cycle is somewhat limited.)
>>
>> I reverted the patch on top of rc7, and it did, in fact, fix the 
>> regression,
>  > but I'm doing a clean from-scratch rebuild of both trees to make sure
>  > there isn't anything in my test setup that could introduce any kind of
>  > "memory" between builds...>
> Nope, even with the clean rebuild it is 100% reproducible. It is in fact 
> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles 
> (with the top and bottom octiles removed as outlier protection); with 
> 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact 
> a 20% increase in latency, not 13%...
> 

OK, I have, I believe root-caused this.

It is a padding issue; removing the code changes __pfx_x64_sys_call to 
be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.

Reverting the patch but adding an alignment statement to x64_sys_call 
re-introduces the performance regression.

I am concerned because this could mean that the __pfx stubs add 
substantial overhead elsewhere, unless this just happens to be a 
particularly sensitive case...

	-hpa


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 23:52     ` H. Peter Anvin
  2026-06-14  1:50       ` H. Peter Anvin
@ 2026-06-14  2:11       ` Calvin Owens
  2026-06-14  2:14         ` Calvin Owens
  1 sibling, 1 reply; 20+ messages in thread
From: Calvin Owens @ 2026-06-14  2:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen,
	torvalds, x86-ML, LKML

On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote:
> On 2026-06-13 13:34, H. Peter Anvin wrote:
> > On 2026-06-13 01:59, Peter Zijlstra wrote:
> > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> > > > So I was trying to figure out a significant -- about 13% -- increase
> > > > in system call latency between v7.0 and the current master, and it
> > > > bisects down to:
> > > > 
> > > > 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> > > > 
> > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> > > > is a bare metal boot, no KVM.
> > > > 
> > > > I'm personally extremely puzzled how this could possibly be related,
> > > > and I will be investigating the possibility that this is a false
> > > > bisect, but it is not a Heisenbug in any way; it has been extremely
> > > > reproducible, and the difference is statistically valid by close to 10
> > > > sigma. Futhermore, the bisection at least gave the appearance of
> > > > stability.
> > > > 
> > > > Given how late in the cycle this is I wanted to send an alert sooner
> > > > rather than later; I will update as I get more data.
> > > 
> > > Uhm, massive WTF indeed. I don't immediately see how this could possibly
> > > affect a FRED host either, except perhaps in code layout.
> > > 
> > > I don't actually have a FRED capable machine, but have you tried running
> > > one of those top-down perf things on it, to see where its hurting?
> > 
> > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> > 
> > I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> > but I'm doing a clean from-scratch rebuild of both trees to make sure
> > there isn't anything in my test setup that could introduce any kind of
> > "memory" between builds...>
> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
> increase in latency, not 13%...

It has to be the .text layout, doesn't it?

I notice we're splitting a cache line here now with the prefix symbol,
7.0-rc7 has:

    ffffffff812175f0 <__pfx_x64_sys_call>:
    ffffffff81217600 <x64_sys_call>:

If I revert 8aeb879baf12, I get:

    ffffffff812175c0 <__pfx_x64_sys_call>:
    ffffffff812175d0 <x64_sys_call>:

Could that be it?

Unfortunately I don't have any hardware new enough to poke at it myself.

Cheers,
Calvin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  2:11       ` Calvin Owens
@ 2026-06-14  2:14         ` Calvin Owens
  0 siblings, 0 replies; 20+ messages in thread
From: Calvin Owens @ 2026-06-14  2:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen,
	torvalds, x86-ML, LKML

On Saturday 06/13 at 19:11 -0700, Calvin Owens wrote:
> On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote:
> > On 2026-06-13 13:34, H. Peter Anvin wrote:
> > > On 2026-06-13 01:59, Peter Zijlstra wrote:
> > > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> > > > > So I was trying to figure out a significant -- about 13% -- increase
> > > > > in system call latency between v7.0 and the current master, and it
> > > > > bisects down to:
> > > > > 
> > > > > 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> > > > > 
> > > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> > > > > is a bare metal boot, no KVM.
> > > > > 
> > > > > I'm personally extremely puzzled how this could possibly be related,
> > > > > and I will be investigating the possibility that this is a false
> > > > > bisect, but it is not a Heisenbug in any way; it has been extremely
> > > > > reproducible, and the difference is statistically valid by close to 10
> > > > > sigma. Futhermore, the bisection at least gave the appearance of
> > > > > stability.
> > > > > 
> > > > > Given how late in the cycle this is I wanted to send an alert sooner
> > > > > rather than later; I will update as I get more data.
> > > > 
> > > > Uhm, massive WTF indeed. I don't immediately see how this could possibly
> > > > affect a FRED host either, except perhaps in code layout.
> > > > 
> > > > I don't actually have a FRED capable machine, but have you tried running
> > > > one of those top-down perf things on it, to see where its hurting?
> > > 
> > > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> > > 
> > > I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> > > but I'm doing a clean from-scratch rebuild of both trees to make sure
> > > there isn't anything in my test setup that could introduce any kind of
> > > "memory" between builds...>
> > Nope, even with the clean rebuild it is 100% reproducible. It is in fact
> > worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
> > (with the top and bottom octiles removed as outlier protection); with 7.1rc7
> > with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
> > increase in latency, not 13%...
> 
> It has to be the .text layout, doesn't it?
> 
> I notice we're splitting a cache line here now with the prefix symbol,
> 7.0-rc7 has:

Whoops, I meant 7.1-rc7.

But seeing your other mail, sounds like this is it :)

>     ffffffff812175f0 <__pfx_x64_sys_call>:
>     ffffffff81217600 <x64_sys_call>:
> 
> If I revert 8aeb879baf12, I get:
> 
>     ffffffff812175c0 <__pfx_x64_sys_call>:
>     ffffffff812175d0 <x64_sys_call>:
> 
> Could that be it?
> 
> Unfortunately I don't have any hardware new enough to poke at it myself.
> 
> Cheers,
> Calvin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
@ 2026-06-14 18:08         ` Xin Li
  2026-06-14 18:31           ` H. Peter Anvin
  2026-06-15  0:19         ` H. Peter Anvin
  2026-06-16  8:28         ` Peter Zijlstra
  2 siblings, 1 reply; 20+ messages in thread
From: Xin Li @ 2026-06-14 18:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML


> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> 
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>> 
>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>> 
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>> 
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>> 
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>> 
>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>> 
>>>> I don't actually have a FRED capable machine, but have you tried running
>>>> one of those top-down perf things on it, to see where its hurting?
>>> 
>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>> 
>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>> > there isn't anything in my test setup that could introduce any kind of
>> > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
> 
> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.


The problem doesn’t happen to IDT?


> 
> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...


Good point, alignment check should be applied to all such entries.

Thanks
   Xin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14 18:08         ` Xin Li
@ 2026-06-14 18:31           ` H. Peter Anvin
  0 siblings, 0 replies; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-14 18:31 UTC (permalink / raw)
  To: Xin Li
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML

On June 14, 2026 11:08:59 AM PDT, Xin Li <xin@zytor.com> wrote:
>
>> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> 
>> On 2026-06-13 16:52, H. Peter Anvin wrote:
>>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>>> in system call latency between v7.0 and the current master, and it
>>>>>> bisects down to:
>>>>>> 
>>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>> 
>>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>>> is a bare metal boot, no KVM.
>>>>>> 
>>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>>> and I will be investigating the possibility that this is a false
>>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>>> stability.
>>>>>> 
>>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>>> rather than later; I will update as I get more data.
>>>>> 
>>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>>> affect a FRED host either, except perhaps in code layout.
>>>>> 
>>>>> I don't actually have a FRED capable machine, but have you tried running
>>>>> one of those top-down perf things on it, to see where its hurting?
>>>> 
>>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>> 
>>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>> > there isn't anything in my test setup that could introduce any kind of
>>> > "memory" between builds...>
>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>> 
>> OK, I have, I believe root-caused this.
>> 
>> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>> 
>> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.
>
>
>The problem doesn’t happen to IDT?
>
>
>> 
>> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...
>
>
>Good point, alignment check should be applied to all such entries.
>
>Thanks
>   Xin

The problem is that if you put an alignment directive on a function, it aligns the __pfx stub, which is exactly The Wrong Thing™.

Otherwise this would be easy to fix, permanently. 

I haven't had time to test IDT yet. I assume it is similar.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
@ 2026-06-15  0:19         ` H. Peter Anvin
  2026-06-15  2:07           ` H. Peter Anvin
  2026-06-16  8:28         ` Peter Zijlstra
  2 siblings, 1 reply; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-15  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 18:50, H. Peter Anvin wrote:
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) 
>>>> wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>>
>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>>
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>>
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>>
>>>> Uhm, massive WTF indeed. I don't immediately see how this could 
>>>> possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>>
>>>> I don't actually have a FRED capable machine, but have you tried 
>>>> running
>>>> one of those top-down perf things on it, to see where its hurting?
>>>
>>> Not yet, but I'm investigating right now (I have some family 
>>> obligations this weekend, so my duty cycle is somewhat limited.)
>>>
>>> I reverted the patch on top of rc7, and it did, in fact, fix the 
>>> regression,
>>  > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>  > there isn't anything in my test setup that could introduce any kind of
>>  > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in 
>> fact worse than I originally stated: the average with 7.1rc7 is 478±6 
>> cycles (with the top and bottom octiles removed as outlier 
>> protection); with 7.1rc7 with the above patch reverted it is 
>> 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>>
> 
> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to 
> be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call 
> re-introduces the performance regression.
> 
> I am concerned because this could mean that the __pfx stubs add 
> substantial overhead elsewhere, unless this just happens to be a 
> particularly sensitive case...
> 

OK, so v7.1 was released with this sizable performance regression. That 
begs the question how to deal with it.

One option that might be reasonable for -stable is to simply add back 16 
bytes of NOPs into the assembly file. However, that is obviously not a 
long term fix.

Any thoughts?

	-hpa


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  0:19         ` H. Peter Anvin
@ 2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
                               ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-15  2:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On 2026-06-14 17:19, H. Peter Anvin wrote:
> 
> OK, so v7.1 was released with this sizable performance regression. That 
> begs the question how to deal with it.
> 
> One option that might be reasonable for -stable is to simply add back 16 
> bytes of NOPs into the assembly file. However, that is obviously not a 
> long term fix.
> 

Okay, here is a hack that actually generates the proper alignment, and 
it DOES in fact fix the performance regression.

It uses the same hack as the Makefile to deal with function alignment 
with a prefix: it adds unnecessary NOPs so that the pre-alignment and 
post-alignment are the same. At the end of the day this really ought to 
be fixed in gcc.

This is not meant to be a final patch; this should go in a header file 
and be cleaned up etc, but I wanted to confirm that it does, in fact, 
fix the regression and that the alignment of x64_sys_call is the root 
cause of the problem.

PeterZ: at some point you and I talked about the following:

- Should x64_sys_call() be noinstr?
- If so, any reason we can't inline it into do_syscall_64()?
- Since we no longer use the sys_call_table[] as a jump table,
   do we actually need array_index_nospec()? in do_syscall_x64|32?

	-hpa

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1428 bytes --]

diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 71f032504e73..337e3e53d262 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -9,6 +9,14 @@
 #include <linux/nospec.h>
 #include <asm/syscall.h>
 
+#ifdef CONFIG_CALL_PADDING
+# define _pfe(x) __attribute((patchable_function_entry(x,x)))
+#else
+# define _pfe(x)
+#endif
+#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
+#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
+
 #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
 #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
 #include <asm/syscalls_64.h>
@@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
 #undef  __SYSCALL
 
 #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
 	switch (nr) {
 	#include <asm/syscalls_64.h>
@@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 }
 
 #ifdef CONFIG_X86_X32_ABI
-long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
 	switch (nr) {
 	#include <asm/syscalls_x32.h>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
@ 2026-06-15  3:41             ` Linus Torvalds
  2026-06-15 18:30               ` H. Peter Anvin
  2026-06-16  7:38             ` Peter Zijlstra
  2026-06-16  7:53             ` Peter Zijlstra
  2 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2026-06-15  3:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>
> - Since we no longer use the sys_call_table[] as a jump table,
>    do we actually need array_index_nospec()? in do_syscall_x64|32?

Well, gcc will still generate a jump table from it when retpolines
aren't enabled.

So I think we do want that array_index_nospec. It should be cheap
insurance against the simplest kinds of speculation issues.

              Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  3:41             ` Linus Torvalds
@ 2026-06-15 18:30               ` H. Peter Anvin
  2026-06-16  7:12                 ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: H. Peter Anvin @ 2026-06-15 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On 2026-06-14 20:41, Linus Torvalds wrote:
> On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> - Since we no longer use the sys_call_table[] as a jump table,
>>     do we actually need array_index_nospec()? in do_syscall_x64|32?
> 
> Well, gcc will still generate a jump table from it when retpolines
> aren't enabled.
> 
> So I think we do want that array_index_nospec. It should be cheap
> insurance against the simplest kinds of speculation issues.
> 

Well, we could put it under an #ifdef by putting macro to detect when we 
use -fno-jump-tables. PeterZ and I have also been talking about making
-fno-jump-tables unconditional, because at some point we found that the 
performance difference was negligible, at least when 
array_index_nospec() is necessary, and it makes it a lot easier to tune 
when you don't have to deal with code bases that compile. It is not just 
retpoline but also IBT (although the comment says "for now"); this of 
course means in practice that the kernels everyone uses are compiled 
without jump tables.

The system call dispatch is really the biggest case here.

It does, however, make me think that using regs->ax to dispatch system 
calls in the a FRED path might actually be The Wrong Thing[TM]; FRED 
delivery is a speculation barrier and so %rax is guaranteed to be stable 
at that point. *In practice* the stack engine probably would propagate 
that (I can't really think of any way to implement a stack engine that 
wouldn't, and I suspect if it didn't we would have lots of other issues) 
but instead of dumping it into memory and reading it back it probably 
would be better to do what the SYSCALL path does and move it into an 
argument register instead.

I have experimented with micro-optimizations of the FRED path lately, in 
part because FRED inherently does provide speculation guarantees that 
SYSCALL/SYSRET do not, in part because some of the code paths have a 
fair bit of unnecessary overhead in general of which some of affects 
FRED disproportionately (some duplicates work that FRED does inherently, 
for one thing.) So far I have been somewhat surprised how *little* 
effect some of them have had; clearly branch prediction does a really 
good job sometimes even without static branches.

Still, some pretty simple changes can get a few percent improvement, 
well above the statistical noise margin.

Doing a *very* early-out and dispatching do_syscall_64() already in 
asm_entry_point_user is one of the more effective hacks; I am (or 
rather, were, until I discovered this immediate issue ;) also 
experimenting with having separate IDT and FRED versions of 
do_syscall_64() -- the code factors very cleanly and the duplication is 
nearly all at the object code level.

Part of my questions to PeterZ was because I believe that inlining 
x64_sys_call() will benefit a fair bit from better code layout. We have 
talked about sunsetting x32, but until we do, merging x32_sys_call() 
into the same function also ends up with the two switch statements being 
able to share a fair bit of code, since there are large contiguous 
chunks of x32 system call space which are the same as x64.

One of the things I have been thinking about, too, is to move FRED- and 
IDT-specific code into separate text sections; not only so that they can 
be close together in memory, but also so that we can poison out the 
areas that aren't being used. Every code flow that has almost unlimited 
versatility is, obviously, *extremely* desirable as targets for 
execution redirection attacks...

	-hpa

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15 18:30               ` H. Peter Anvin
@ 2026-06-16  7:12                 ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-06-16  7:12 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Mon, Jun 15, 2026 at 11:30:11AM -0700, H. Peter Anvin wrote:

> Well, we could put it under an #ifdef by putting macro to detect when we use
> -fno-jump-tables. PeterZ and I have also been talking about making
> -fno-jump-tables unconditional, because at some point we found that the
> performance difference was negligible, at least when array_index_nospec() is
> necessary, and it makes it a lot easier to tune when you don't have to deal
> with code bases that compile. It is not just retpoline but also IBT
> (although the comment says "for now"); this of course means in practice that
> the kernels everyone uses are compiled without jump tables.

The IBT thing is because GCC (and I assume, but haven't checked, clang
too) generated NOTRACK prefixes for jump tables. And we have explicitly
disallowed NOTRACK for kernel IBT.

The "not yet" pertains to the compilers being changed to not use
NOTRACK; but I don't think this is anything anybody is actively chasing
up on.

So yeah, effectively jump-tables are disabled for everybody.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
@ 2026-06-16  7:38             ` Peter Zijlstra
  2026-06-16  7:53             ` Peter Zijlstra
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-06-16  7:38 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote:

> PeterZ: at some point you and I talked about the following:
> 
> - Should x64_sys_call() be noinstr?

I still think it should be, yes. But I also think it wants __noendbr,
there is no sane reason you should ever be allowed to do an indirect
call to this.

Realistically, objtool will seal this function (scribble the ENDBR), but
really, it just shouldn't be there to begin with.

> - If so, any reason we can't inline it into do_syscall_64()?

Code gen, GCC makes a mess out of things if you do that. x64_sys_call()
now ends up being a giant pile of tail-calls. If you inline it into
do_syscall_x64() that goes out the window.

> - Since we no longer use the sys_call_table[] as a jump table,
>   do we actually need array_index_nospec()? in do_syscall_x64|32?

It would mean unconditionally disabling jump-tables -- at least for this
TU, but possibly for the whole thing (mixed compiler flags and LTO is a
pain you don't need IIRC).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
  2026-06-16  7:38             ` Peter Zijlstra
@ 2026-06-16  7:53             ` Peter Zijlstra
  2 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-06-16  7:53 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote:

> It uses the same hack as the Makefile to deal with function alignment with a
> prefix: it adds unnecessary NOPs so that the pre-alignment and
> post-alignment are the same. At the end of the day this really ought to be
> fixed in gcc.

And clang, but I don't think they can, it wrecks the 'ABI' they have in
place with the current set of arguments. Which I agree is somewhat
unfortunate, but it is what it is.

> diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
> index 71f032504e73..337e3e53d262 100644
> --- a/arch/x86/entry/syscall_64.c
> +++ b/arch/x86/entry/syscall_64.c
> @@ -9,6 +9,14 @@
>  #include <linux/nospec.h>
>  #include <asm/syscall.h>
>  
> +#ifdef CONFIG_CALL_PADDING
> +# define _pfe(x) __attribute((patchable_function_entry(x,x)))
> +#else
> +# define _pfe(x)
> +#endif
> +#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
> +#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
> +
>  #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
>  #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
>  #include <asm/syscalls_64.h>
> @@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
>  #undef  __SYSCALL
>  
>  #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
> -long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
> +long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
>  {
>  	switch (nr) {
>  	#include <asm/syscalls_64.h>
> @@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
>  }
>  
>  #ifdef CONFIG_X86_X32_ABI
> -long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
> +long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
>  {
>  	switch (nr) {
>  	#include <asm/syscalls_x32.h>

This more or less works by accident, in general your align_func() macro
is horrendously broken when you consider kCFI. By changing the
patchable_function_entry attribute like this, the kCFI hash ends up at a
different location and things go side-ways really really fast.

The only reason it works here is that this function is never indirectly
called and so the kCFI ABI violation is immaterial.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
  2026-06-15  0:19         ` H. Peter Anvin
@ 2026-06-16  8:28         ` Peter Zijlstra
  2026-06-16  8:46           ` Linus Torvalds
  2026-06-16 13:53           ` David Laight
  2 siblings, 2 replies; 20+ messages in thread
From: Peter Zijlstra @ 2026-06-16  8:28 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote:

> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be
> 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call
> re-introduces the performance regression.
> 
> I am concerned because this could mean that the __pfx stubs add substantial
> overhead elsewhere, unless this just happens to be a particularly sensitive
> case...

So what is the actual alignment requirement these days then? We're
building the (x86_64) kernel with 16 byte function and 1 byte jump
alignment.

So ISTR the Intel I-fetch window was 16 bytes, so the above things would
make sense. However, Gemini, or whatever AI sits in google search, is
trying to tell me Intel moved to 32 byte I-fetch with Alderlake.

That same thing is saying AMD switched to 32 byte I-fetch with Zen (1)
and later.

This all seems to suggest we do something like so, hmm?


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index b9f5a4a3cc2a..65fff65271d0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -329,7 +329,9 @@ config X86
 	select HAVE_ARCH_KCSAN			if X86_64
 	select PROC_PID_ARCH_STATUS		if PROC_FS
 	select HAVE_ARCH_NODE_DEV_GROUP		if X86_SGX
-	select FUNCTION_ALIGNMENT_16B		if X86_64 || X86_ALIGNMENT_16
+	# AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch
+	select FUNCTION_ALIGNMENT_32B		if X86_64
+	select FUNCTION_ALIGNMENT_16B		if X86_ALIGNMENT_16
 	select FUNCTION_ALIGNMENT_4B
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:28         ` Peter Zijlstra
@ 2026-06-16  8:46           ` Linus Torvalds
  2026-06-16  9:51             ` Ingo Molnar
  2026-06-16 13:53           ` David Laight
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2026-06-16  8:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
>
> So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> make sense. However, Gemini, or whatever AI sits in google search, is
> trying to tell me Intel moved to 32 byte I-fetch with Alderlake.

Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
to not be 64-byte aligned - simply because you may need to fetch more
cachelines (assuming fairly linear code).

And afaik, some of the newer ones aren't 32-byte wide, but can do 48
bytes as three 16-byte fetches.

But I don't know if they can do the old "split line access" that older
cores could do, where a Pentium would do two 8-byte accesses at the
same time, and they didn't have to be in the same cache line.

So 64-byte alignment would always be the best option if you only look
at a *particular* piece of code.

But it obviously is very wasteful and hurts when there is code around
it that could be loaded into the cache at the same time.

So almost certainly not a good idea in general.

But 64-byte alignment is probably what things like interrupt and
system call entrypoints should use, because those things would make
sense to look at as isolated things, not part of a bigger load". And
they are quite likely to start from a fairly cold-cache situation.

So *not* some general compiler option in a config file, but maybe a
special "entry point alignment" macro?

             Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:46           ` Linus Torvalds
@ 2026-06-16  9:51             ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2026-06-16  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, H. Peter Anvin, tglx, mingo, bp,
	Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> > make sense. However, Gemini, or whatever AI sits in google search, is
> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
> 
> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
> to not be 64-byte aligned - simply because you may need to fetch more
> cachelines (assuming fairly linear code).
> 
> And afaik, some of the newer ones aren't 32-byte wide, but can do 48
> bytes as three 16-byte fetches.
> 
> But I don't know if they can do the old "split line access" that older
> cores could do, where a Pentium would do two 8-byte accesses at the
> same time, and they didn't have to be in the same cache line.
> 
> So 64-byte alignment would always be the best option if you only look
> at a *particular* piece of code.
> 
> But it obviously is very wasteful and hurts when there is code around
> it that could be loaded into the cache at the same time.
> 
> So almost certainly not a good idea in general.
> 
> But 64-byte alignment is probably what things like interrupt and
> system call entrypoints should use, because those things would make
> sense to look at as isolated things, not part of a bigger load". And
> they are quite likely to start from a fairly cold-cache situation.
> 
> So *not* some general compiler option in a config file, but maybe a
> special "entry point alignment" macro?

Yeah, agreed on that approach - but before/while we fix it,
I'm also still somewhat baffled by the numbers hpa reported:

>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
>>> increase in latency, not 13%...

Now that we know that this regression is caused by entry function
alignment changes, do we know *why* it causes a 80 cycles
shift in system call entry performance?

What does the benchmark measure, cache-cold or cache-hot
execution?

1) Cache-cold performance:

If it is cold-cache performance, does the misaligned case fetch
one more cold cacheline?

From which cache does it miss? Fetching from the 2-4MB Panther Lake
L2 shouldn't be 80 cycles, it should be ~17 cycles.

If it's fetching from the 18MB L3 (which I'd say is the norm for
most workloads), then the L3->L1I latency is around ~55 cycles on
Panther Lake, with everything included.

It cannot really be DRAM latency, ie. true cache-cold latency,
as that would be much more severe, in the 400 cycles range even
with premium DRAM modules - and more like 500 cycles with
mainstream DRAM modules and layouts. (Unless we are *lucky* with
alignment and sizing and the alignment regression doesn't trigger
full DRAM latency.) The on-die DRAM MSC cache's latency should
be around 300 cycles - that too is too high.

2) Cache-hot performance:

While cache-hot performance is less relevance for system calls
(which tend to be cache-cold in practice), if the benchmark
measures cache-hot performance, why is there a 80 cycles shift
from just a single misaligned symbol?

Ie. the specific and rather stable figure of 80 cycles overhead
does not seem to match any of the Panther Lake latencies that
ought to be relevant to this regression, if we use the simplest
mental model of what's going on when alignment changes.

So it is either some other uarch pathology, triggered by bad
alignment, or something doesn't add up in my mental model
of the root cause of this problem. :-)

Side notes:

 - The 6 cycles noise in the 478±6 cycles measurement
   does suggest that we might have missed out to a
   deeper cache hierarchy level, versus the rather
   stable 397.5±0.4 pre-regression figure.

 - I'm also assuming that 'cycles' here is a frequency-invariant
   standardized constant 5.1 GHz TSC value or so?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-16  8:28         ` Peter Zijlstra
  2026-06-16  8:46           ` Linus Torvalds
@ 2026-06-16 13:53           ` David Laight
  1 sibling, 0 replies; 20+ messages in thread
From: David Laight @ 2026-06-16 13:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML

On Tue, 16 Jun 2026 10:28:14 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote:
> 
> > OK, I have, I believe root-caused this.
> > 
> > It is a padding issue; removing the code changes __pfx_x64_sys_call to be
> > 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> > 
> > Reverting the patch but adding an alignment statement to x64_sys_call
> > re-introduces the performance regression.
> > 
> > I am concerned because this could mean that the __pfx stubs add substantial
> > overhead elsewhere, unless this just happens to be a particularly sensitive
> > case...  
> 
> So what is the actual alignment requirement these days then? We're
> building the (x86_64) kernel with 16 byte function and 1 byte jump
> alignment.
> 
> So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> make sense. However, Gemini, or whatever AI sits in google search, is
> trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
> 
> That same thing is saying AMD switched to 32 byte I-fetch with Zen (1)
> and later.

Basically you can't win.
I was looking at why a patch didn't give the expected performance gain
on a different base kernel build.
It seems to depend on whether the function (actually strlen) was aligned
to an odd or even 16 byte boundary.
If aligned to an even boundary the loop inside the function crossed a
'significant' boundary and the code ran measurably slower.
If you start aligning loop tops and labels in general you probably lose
due to code bloat.
(Here the loop didn't need aligning, it just needed not to contain
the relevant boundary.)

In this case the extra padding will change the alignment of everything that
follows - and some of those might make a difference as well.

You'd need to add extra code further down the function to keep the size
the same (and hope the compiler keeps the functions in the same order).

	David


> 
> This all seems to suggest we do something like so, hmm?
> 
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index b9f5a4a3cc2a..65fff65271d0 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -329,7 +329,9 @@ config X86
>  	select HAVE_ARCH_KCSAN			if X86_64
>  	select PROC_PID_ARCH_STATUS		if PROC_FS
>  	select HAVE_ARCH_NODE_DEV_GROUP		if X86_SGX
> -	select FUNCTION_ALIGNMENT_16B		if X86_64 || X86_ALIGNMENT_16
> +	# AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch
> +	select FUNCTION_ALIGNMENT_32B		if X86_64
> +	select FUNCTION_ALIGNMENT_16B		if X86_ALIGNMENT_16
>  	select FUNCTION_ALIGNMENT_4B
>  	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
>  	select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2026-06-16 13:53 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
2026-06-13  8:59 ` Peter Zijlstra
2026-06-13 20:34   ` H. Peter Anvin
2026-06-13 23:52     ` H. Peter Anvin
2026-06-14  1:50       ` H. Peter Anvin
2026-06-14 18:08         ` Xin Li
2026-06-14 18:31           ` H. Peter Anvin
2026-06-15  0:19         ` H. Peter Anvin
2026-06-15  2:07           ` H. Peter Anvin
2026-06-15  3:41             ` Linus Torvalds
2026-06-15 18:30               ` H. Peter Anvin
2026-06-16  7:12                 ` Peter Zijlstra
2026-06-16  7:38             ` Peter Zijlstra
2026-06-16  7:53             ` Peter Zijlstra
2026-06-16  8:28         ` Peter Zijlstra
2026-06-16  8:46           ` Linus Torvalds
2026-06-16  9:51             ` Ingo Molnar
2026-06-16 13:53           ` David Laight
2026-06-14  2:11       ` Calvin Owens
2026-06-14  2:14         ` Calvin Owens

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox