8aeb879baf12 - significant system call latency regression, bisected

All of lore.kernel.org
 help / color / mirror / Atom feed

* 8aeb879baf12 - significant system call latency regression, bisected
@ 2026-06-13  1:45 "H. Peter Anvin" (Intel)
  2026-06-13  8:59 ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: "H. Peter Anvin" (Intel) @ 2026-06-13  1:45 UTC (permalink / raw)
  To: Peter Zijlstra (Intel)
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

So I was trying to figure out a significant -- about 13% -- increase
in system call latency between v7.0 and the current master, and it
bisects down to:

	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build

This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
is a bare metal boot, no KVM.

I'm personally extremely puzzled how this could possibly be related,
and I will be investigating the possibility that this is a false
bisect, but it is not a Heisenbug in any way; it has been extremely
reproducible, and the difference is statistically valid by close to 10
sigma. Futhermore, the bisection at least gave the appearance of
stability.

Given how late in the cycle this is I wanted to send an alert sooner
rather than later; I will update as I get more data.

        -hpa

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
@ 2026-06-13  8:59 ` Peter Zijlstra
  2026-06-13 20:34   ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2026-06-13  8:59 UTC (permalink / raw)
  To: "H. Peter Anvin" (Intel)
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> So I was trying to figure out a significant -- about 13% -- increase
> in system call latency between v7.0 and the current master, and it
> bisects down to:
> 
> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> 
> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> is a bare metal boot, no KVM.
> 
> I'm personally extremely puzzled how this could possibly be related,
> and I will be investigating the possibility that this is a false
> bisect, but it is not a Heisenbug in any way; it has been extremely
> reproducible, and the difference is statistically valid by close to 10
> sigma. Futhermore, the bisection at least gave the appearance of
> stability.
> 
> Given how late in the cycle this is I wanted to send an alert sooner
> rather than later; I will update as I get more data.

Uhm, massive WTF indeed. I don't immediately see how this could possibly
affect a FRED host either, except perhaps in code layout.

I don't actually have a FRED capable machine, but have you tried running
one of those top-down perf things on it, to see where its hurting?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13  8:59 ` Peter Zijlstra
@ 2026-06-13 20:34   ` H. Peter Anvin
  2026-06-13 23:52     ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-13 20:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 01:59, Peter Zijlstra wrote:
> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>> So I was trying to figure out a significant -- about 13% -- increase
>> in system call latency between v7.0 and the current master, and it
>> bisects down to:
>> 
>> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>> 
>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>> is a bare metal boot, no KVM.
>> 
>> I'm personally extremely puzzled how this could possibly be related,
>> and I will be investigating the possibility that this is a false
>> bisect, but it is not a Heisenbug in any way; it has been extremely
>> reproducible, and the difference is statistically valid by close to 10
>> sigma. Futhermore, the bisection at least gave the appearance of
>> stability.
>> 
>> Given how late in the cycle this is I wanted to send an alert sooner
>> rather than later; I will update as I get more data.
> 
> Uhm, massive WTF indeed. I don't immediately see how this could possibly
> affect a FRED host either, except perhaps in code layout.
> 
> I don't actually have a FRED capable machine, but have you tried running
> one of those top-down perf things on it, to see where its hurting?

Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)

I reverted the patch on top of rc7, and it did, in fact, fix the regression, but I'm doing a clean from-scratch rebuild of both trees to make sure there isn't anything in my test setup that could introduce any kind of "memory" between builds...


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 20:34   ` H. Peter Anvin
@ 2026-06-13 23:52     ` H. Peter Anvin
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14  2:11       ` Calvin Owens
  0 siblings, 2 replies; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-13 23:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 13:34, H. Peter Anvin wrote:
> On 2026-06-13 01:59, Peter Zijlstra wrote:
>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>> So I was trying to figure out a significant -- about 13% -- increase
>>> in system call latency between v7.0 and the current master, and it
>>> bisects down to:
>>>
>>> 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>
>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>> is a bare metal boot, no KVM.
>>>
>>> I'm personally extremely puzzled how this could possibly be related,
>>> and I will be investigating the possibility that this is a false
>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>> reproducible, and the difference is statistically valid by close to 10
>>> sigma. Futhermore, the bisection at least gave the appearance of
>>> stability.
>>>
>>> Given how late in the cycle this is I wanted to send an alert sooner
>>> rather than later; I will update as I get more data.
>>
>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>> affect a FRED host either, except perhaps in code layout.
>>
>> I don't actually have a FRED capable machine, but have you tried running
>> one of those top-down perf things on it, to see where its hurting?
> 
> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> 
> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
 > but I'm doing a clean from-scratch rebuild of both trees to make sure
 > there isn't anything in my test setup that could introduce any kind of
 > "memory" between builds...>
Nope, even with the clean rebuild it is 100% reproducible. It is in fact 
worse than I originally stated: the average with 7.1rc7 is 478±6 cycles 
(with the top and bottom octiles removed as outlier protection); with 
7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact 
a 20% increase in latency, not 13%...

	-hpa


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 23:52     ` H. Peter Anvin
@ 2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
  2026-06-15  0:19         ` H. Peter Anvin
  2026-06-14  2:11       ` Calvin Owens
  1 sibling, 2 replies; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-14  1:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 16:52, H. Peter Anvin wrote:
> On 2026-06-13 13:34, H. Peter Anvin wrote:
>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) 
>>> wrote:
>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>> in system call latency between v7.0 and the current master, and it
>>>> bisects down to:
>>>>
>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>
>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>> is a bare metal boot, no KVM.
>>>>
>>>> I'm personally extremely puzzled how this could possibly be related,
>>>> and I will be investigating the possibility that this is a false
>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>> reproducible, and the difference is statistically valid by close to 10
>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>> stability.
>>>>
>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>> rather than later; I will update as I get more data.
>>>
>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>> affect a FRED host either, except perhaps in code layout.
>>>
>>> I don't actually have a FRED capable machine, but have you tried running
>>> one of those top-down perf things on it, to see where its hurting?
>>
>> Not yet, but I'm investigating right now (I have some family 
>> obligations this weekend, so my duty cycle is somewhat limited.)
>>
>> I reverted the patch on top of rc7, and it did, in fact, fix the 
>> regression,
>  > but I'm doing a clean from-scratch rebuild of both trees to make sure
>  > there isn't anything in my test setup that could introduce any kind of
>  > "memory" between builds...>
> Nope, even with the clean rebuild it is 100% reproducible. It is in fact 
> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles 
> (with the top and bottom octiles removed as outlier protection); with 
> 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact 
> a 20% increase in latency, not 13%...
> 

OK, I have, I believe root-caused this.

It is a padding issue; removing the code changes __pfx_x64_sys_call to 
be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.

Reverting the patch but adding an alignment statement to x64_sys_call 
re-introduces the performance regression.

I am concerned because this could mean that the __pfx stubs add 
substantial overhead elsewhere, unless this just happens to be a 
particularly sensitive case...

	-hpa


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-13 23:52     ` H. Peter Anvin
  2026-06-14  1:50       ` H. Peter Anvin
@ 2026-06-14  2:11       ` Calvin Owens
  2026-06-14  2:14         ` Calvin Owens
  1 sibling, 1 reply; 13+ messages in thread
From: Calvin Owens @ 2026-06-14  2:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen,
	torvalds, x86-ML, LKML

On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote:
> On 2026-06-13 13:34, H. Peter Anvin wrote:
> > On 2026-06-13 01:59, Peter Zijlstra wrote:
> > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> > > > So I was trying to figure out a significant -- about 13% -- increase
> > > > in system call latency between v7.0 and the current master, and it
> > > > bisects down to:
> > > > 
> > > > 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> > > > 
> > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> > > > is a bare metal boot, no KVM.
> > > > 
> > > > I'm personally extremely puzzled how this could possibly be related,
> > > > and I will be investigating the possibility that this is a false
> > > > bisect, but it is not a Heisenbug in any way; it has been extremely
> > > > reproducible, and the difference is statistically valid by close to 10
> > > > sigma. Futhermore, the bisection at least gave the appearance of
> > > > stability.
> > > > 
> > > > Given how late in the cycle this is I wanted to send an alert sooner
> > > > rather than later; I will update as I get more data.
> > > 
> > > Uhm, massive WTF indeed. I don't immediately see how this could possibly
> > > affect a FRED host either, except perhaps in code layout.
> > > 
> > > I don't actually have a FRED capable machine, but have you tried running
> > > one of those top-down perf things on it, to see where its hurting?
> > 
> > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> > 
> > I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> > but I'm doing a clean from-scratch rebuild of both trees to make sure
> > there isn't anything in my test setup that could introduce any kind of
> > "memory" between builds...>
> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
> increase in latency, not 13%...

It has to be the .text layout, doesn't it?

I notice we're splitting a cache line here now with the prefix symbol,
7.0-rc7 has:

    ffffffff812175f0 <__pfx_x64_sys_call>:
    ffffffff81217600 <x64_sys_call>:

If I revert 8aeb879baf12, I get:

    ffffffff812175c0 <__pfx_x64_sys_call>:
    ffffffff812175d0 <x64_sys_call>:

Could that be it?

Unfortunately I don't have any hardware new enough to poke at it myself.

Cheers,
Calvin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  2:11       ` Calvin Owens
@ 2026-06-14  2:14         ` Calvin Owens
  0 siblings, 0 replies; 13+ messages in thread
From: Calvin Owens @ 2026-06-14  2:14 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen,
	torvalds, x86-ML, LKML

On Saturday 06/13 at 19:11 -0700, Calvin Owens wrote:
> On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote:
> > On 2026-06-13 13:34, H. Peter Anvin wrote:
> > > On 2026-06-13 01:59, Peter Zijlstra wrote:
> > > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
> > > > > So I was trying to figure out a significant -- about 13% -- increase
> > > > > in system call latency between v7.0 and the current master, and it
> > > > > bisects down to:
> > > > > 
> > > > > 	8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
> > > > > 
> > > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
> > > > > is a bare metal boot, no KVM.
> > > > > 
> > > > > I'm personally extremely puzzled how this could possibly be related,
> > > > > and I will be investigating the possibility that this is a false
> > > > > bisect, but it is not a Heisenbug in any way; it has been extremely
> > > > > reproducible, and the difference is statistically valid by close to 10
> > > > > sigma. Futhermore, the bisection at least gave the appearance of
> > > > > stability.
> > > > > 
> > > > > Given how late in the cycle this is I wanted to send an alert sooner
> > > > > rather than later; I will update as I get more data.
> > > > 
> > > > Uhm, massive WTF indeed. I don't immediately see how this could possibly
> > > > affect a FRED host either, except perhaps in code layout.
> > > > 
> > > > I don't actually have a FRED capable machine, but have you tried running
> > > > one of those top-down perf things on it, to see where its hurting?
> > > 
> > > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
> > > 
> > > I reverted the patch on top of rc7, and it did, in fact, fix the regression,
> > > but I'm doing a clean from-scratch rebuild of both trees to make sure
> > > there isn't anything in my test setup that could introduce any kind of
> > > "memory" between builds...>
> > Nope, even with the clean rebuild it is 100% reproducible. It is in fact
> > worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
> > (with the top and bottom octiles removed as outlier protection); with 7.1rc7
> > with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
> > increase in latency, not 13%...
> 
> It has to be the .text layout, doesn't it?
> 
> I notice we're splitting a cache line here now with the prefix symbol,
> 7.0-rc7 has:

Whoops, I meant 7.1-rc7.

But seeing your other mail, sounds like this is it :)

>     ffffffff812175f0 <__pfx_x64_sys_call>:
>     ffffffff81217600 <x64_sys_call>:
> 
> If I revert 8aeb879baf12, I get:
> 
>     ffffffff812175c0 <__pfx_x64_sys_call>:
>     ffffffff812175d0 <x64_sys_call>:
> 
> Could that be it?
> 
> Unfortunately I don't have any hardware new enough to poke at it myself.
> 
> Cheers,
> Calvin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
@ 2026-06-14 18:08         ` Xin Li
  2026-06-14 18:31           ` H. Peter Anvin
  2026-06-15  0:19         ` H. Peter Anvin
  1 sibling, 1 reply; 13+ messages in thread
From: Xin Li @ 2026-06-14 18:08 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML


> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> 
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>> 
>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>> 
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>> 
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>> 
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>> 
>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>> 
>>>> I don't actually have a FRED capable machine, but have you tried running
>>>> one of those top-down perf things on it, to see where its hurting?
>>> 
>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>> 
>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>> > there isn't anything in my test setup that could introduce any kind of
>> > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
> 
> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.


The problem doesn’t happen to IDT?


> 
> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...


Good point, alignment check should be applied to all such entries.

Thanks
   Xin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14 18:08         ` Xin Li
@ 2026-06-14 18:31           ` H. Peter Anvin
  0 siblings, 0 replies; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-14 18:31 UTC (permalink / raw)
  To: Xin Li
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, torvalds, x86-ML, LKML

On June 14, 2026 11:08:59 AM PDT, Xin Li <xin@zytor.com> wrote:
>
>> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote:
>> 
>> On 2026-06-13 16:52, H. Peter Anvin wrote:
>>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote:
>>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>>> in system call latency between v7.0 and the current master, and it
>>>>>> bisects down to:
>>>>>> 
>>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>> 
>>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>>> is a bare metal boot, no KVM.
>>>>>> 
>>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>>> and I will be investigating the possibility that this is a false
>>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>>> stability.
>>>>>> 
>>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>>> rather than later; I will update as I get more data.
>>>>> 
>>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly
>>>>> affect a FRED host either, except perhaps in code layout.
>>>>> 
>>>>> I don't actually have a FRED capable machine, but have you tried running
>>>>> one of those top-down perf things on it, to see where its hurting?
>>>> 
>>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.)
>>>> 
>>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression,
>>> > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>> > there isn't anything in my test setup that could introduce any kind of
>>> > "memory" between builds...>
>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>> 
>> OK, I have, I believe root-caused this.
>> 
>> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
>> 
>> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression.
>
>
>The problem doesn’t happen to IDT?
>
>
>> 
>> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case...
>
>
>Good point, alignment check should be applied to all such entries.
>
>Thanks
>   Xin

The problem is that if you put an alignment directive on a function, it aligns the __pfx stub, which is exactly The Wrong Thing™.

Otherwise this would be easy to fix, permanently. 

I haven't had time to test IDT yet. I assume it is similar.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-14  1:50       ` H. Peter Anvin
  2026-06-14 18:08         ` Xin Li
@ 2026-06-15  0:19         ` H. Peter Anvin
  2026-06-15  2:07           ` H. Peter Anvin
  1 sibling, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-15  0:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

On 2026-06-13 18:50, H. Peter Anvin wrote:
> On 2026-06-13 16:52, H. Peter Anvin wrote:
>> On 2026-06-13 13:34, H. Peter Anvin wrote:
>>> On 2026-06-13 01:59, Peter Zijlstra wrote:
>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) 
>>>> wrote:
>>>>> So I was trying to figure out a significant -- about 13% -- increase
>>>>> in system call latency between v7.0 and the current master, and it
>>>>> bisects down to:
>>>>>
>>>>>     8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
>>>>>
>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
>>>>> is a bare metal boot, no KVM.
>>>>>
>>>>> I'm personally extremely puzzled how this could possibly be related,
>>>>> and I will be investigating the possibility that this is a false
>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely
>>>>> reproducible, and the difference is statistically valid by close to 10
>>>>> sigma. Futhermore, the bisection at least gave the appearance of
>>>>> stability.
>>>>>
>>>>> Given how late in the cycle this is I wanted to send an alert sooner
>>>>> rather than later; I will update as I get more data.
>>>>
>>>> Uhm, massive WTF indeed. I don't immediately see how this could 
>>>> possibly
>>>> affect a FRED host either, except perhaps in code layout.
>>>>
>>>> I don't actually have a FRED capable machine, but have you tried 
>>>> running
>>>> one of those top-down perf things on it, to see where its hurting?
>>>
>>> Not yet, but I'm investigating right now (I have some family 
>>> obligations this weekend, so my duty cycle is somewhat limited.)
>>>
>>> I reverted the patch on top of rc7, and it did, in fact, fix the 
>>> regression,
>>  > but I'm doing a clean from-scratch rebuild of both trees to make sure
>>  > there isn't anything in my test setup that could introduce any kind of
>>  > "memory" between builds...>
>> Nope, even with the clean rebuild it is 100% reproducible. It is in 
>> fact worse than I originally stated: the average with 7.1rc7 is 478±6 
>> cycles (with the top and bottom octiles removed as outlier 
>> protection); with 7.1rc7 with the above patch reverted it is 
>> 397.5±0.4. - this is in fact a 20% increase in latency, not 13%...
>>
> 
> OK, I have, I believe root-caused this.
> 
> It is a padding issue; removing the code changes __pfx_x64_sys_call to 
> be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned.
> 
> Reverting the patch but adding an alignment statement to x64_sys_call 
> re-introduces the performance regression.
> 
> I am concerned because this could mean that the __pfx stubs add 
> substantial overhead elsewhere, unless this just happens to be a 
> particularly sensitive case...
> 

OK, so v7.1 was released with this sizable performance regression. That 
begs the question how to deal with it.

One option that might be reasonable for -stable is to simply add back 16 
bytes of NOPs into the assembly file. However, that is obviously not a 
long term fix.

Any thoughts?

	-hpa


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  0:19         ` H. Peter Anvin
@ 2026-06-15  2:07           ` H. Peter Anvin
  2026-06-15  3:41             ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-15  2:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
	torvalds, x86-ML, LKML

[-- Attachment #1: Type: text/plain, Size: 1222 bytes --]

On 2026-06-14 17:19, H. Peter Anvin wrote:
> 
> OK, so v7.1 was released with this sizable performance regression. That 
> begs the question how to deal with it.
> 
> One option that might be reasonable for -stable is to simply add back 16 
> bytes of NOPs into the assembly file. However, that is obviously not a 
> long term fix.
> 

Okay, here is a hack that actually generates the proper alignment, and 
it DOES in fact fix the performance regression.

It uses the same hack as the Makefile to deal with function alignment 
with a prefix: it adds unnecessary NOPs so that the pre-alignment and 
post-alignment are the same. At the end of the day this really ought to 
be fixed in gcc.

This is not meant to be a final patch; this should go in a header file 
and be cleaned up etc, but I wanted to confirm that it does, in fact, 
fix the regression and that the alignment of x64_sys_call is the root 
cause of the problem.

PeterZ: at some point you and I talked about the following:

- Should x64_sys_call() be noinstr?
- If so, any reason we can't inline it into do_syscall_64()?
- Since we no longer use the sys_call_table[] as a jump table,
   do we actually need array_index_nospec()? in do_syscall_x64|32?

	-hpa

[-- Attachment #2: diff --]
[-- Type: text/plain, Size: 1428 bytes --]

diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c
index 71f032504e73..337e3e53d262 100644
--- a/arch/x86/entry/syscall_64.c
+++ b/arch/x86/entry/syscall_64.c
@@ -9,6 +9,14 @@
 #include <linux/nospec.h>
 #include <asm/syscall.h>
 
+#ifdef CONFIG_CALL_PADDING
+# define _pfe(x) __attribute((patchable_function_entry(x,x)))
+#else
+# define _pfe(x)
+#endif
+#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES)
+#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x))
+
 #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *);
 #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *);
 #include <asm/syscalls_64.h>
@@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = {
 #undef  __SYSCALL
 
 #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs);
-long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
 	switch (nr) {
 	#include <asm/syscalls_64.h>
@@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr)
 }
 
 #ifdef CONFIG_X86_X32_ABI
-long x32_sys_call(const struct pt_regs *regs, unsigned int nr)
+long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr)
 {
 	switch (nr) {
 	#include <asm/syscalls_x32.h>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  2:07           ` H. Peter Anvin
@ 2026-06-15  3:41             ` Linus Torvalds
  2026-06-15 18:30               ` H. Peter Anvin
  0 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2026-06-15  3:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>
> - Since we no longer use the sys_call_table[] as a jump table,
>    do we actually need array_index_nospec()? in do_syscall_x64|32?

Well, gcc will still generate a jump table from it when retpolines
aren't enabled.

So I think we do want that array_index_nospec. It should be cheap
insurance against the simplest kinds of speculation issues.

              Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 8aeb879baf12 - significant system call latency regression, bisected
  2026-06-15  3:41             ` Linus Torvalds
@ 2026-06-15 18:30               ` H. Peter Anvin
  0 siblings, 0 replies; 13+ messages in thread
From: H. Peter Anvin @ 2026-06-15 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens,
	Dave Hansen, x86-ML, LKML

On 2026-06-14 20:41, Linus Torvalds wrote:
> On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> - Since we no longer use the sys_call_table[] as a jump table,
>>     do we actually need array_index_nospec()? in do_syscall_x64|32?
> 
> Well, gcc will still generate a jump table from it when retpolines
> aren't enabled.
> 
> So I think we do want that array_index_nospec. It should be cheap
> insurance against the simplest kinds of speculation issues.
> 

Well, we could put it under an #ifdef by putting macro to detect when we 
use -fno-jump-tables. PeterZ and I have also been talking about making
-fno-jump-tables unconditional, because at some point we found that the 
performance difference was negligible, at least when 
array_index_nospec() is necessary, and it makes it a lot easier to tune 
when you don't have to deal with code bases that compile. It is not just 
retpoline but also IBT (although the comment says "for now"); this of 
course means in practice that the kernels everyone uses are compiled 
without jump tables.

The system call dispatch is really the biggest case here.

It does, however, make me think that using regs->ax to dispatch system 
calls in the a FRED path might actually be The Wrong Thing[TM]; FRED 
delivery is a speculation barrier and so %rax is guaranteed to be stable 
at that point. *In practice* the stack engine probably would propagate 
that (I can't really think of any way to implement a stack engine that 
wouldn't, and I suspect if it didn't we would have lots of other issues) 
but instead of dumping it into memory and reading it back it probably 
would be better to do what the SYSCALL path does and move it into an 
argument register instead.

I have experimented with micro-optimizations of the FRED path lately, in 
part because FRED inherently does provide speculation guarantees that 
SYSCALL/SYSRET do not, in part because some of the code paths have a 
fair bit of unnecessary overhead in general of which some of affects 
FRED disproportionately (some duplicates work that FRED does inherently, 
for one thing.) So far I have been somewhat surprised how *little* 
effect some of them have had; clearly branch prediction does a really 
good job sometimes even without static branches.

Still, some pretty simple changes can get a few percent improvement, 
well above the statistical noise margin.

Doing a *very* early-out and dispatching do_syscall_64() already in 
asm_entry_point_user is one of the more effective hacks; I am (or 
rather, were, until I discovered this immediate issue ;) also 
experimenting with having separate IDT and FRED versions of 
do_syscall_64() -- the code factors very cleanly and the duplication is 
nearly all at the object code level.

Part of my questions to PeterZ was because I believe that inlining 
x64_sys_call() will benefit a fair bit from better code layout. We have 
talked about sunsetting x32, but until we do, merging x32_sys_call() 
into the same function also ends up with the two switch statements being 
able to share a fair bit of code, since there are large contiguous 
chunks of x32 system call space which are the same as x64.

One of the things I have been thinking about, too, is to move FRED- and 
IDT-specific code into separate text sections; not only so that they can 
be close together in memory, but also so that we can poison out the 
areas that aren't being used. Every code flow that has almost unlimited 
versatility is, obviously, *extremely* desirable as targets for 
execution redirection attacks...

	-hpa

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-06-15 18:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
2026-06-13  8:59 ` Peter Zijlstra
2026-06-13 20:34   ` H. Peter Anvin
2026-06-13 23:52     ` H. Peter Anvin
2026-06-14  1:50       ` H. Peter Anvin
2026-06-14 18:08         ` Xin Li
2026-06-14 18:31           ` H. Peter Anvin
2026-06-15  0:19         ` H. Peter Anvin
2026-06-15  2:07           ` H. Peter Anvin
2026-06-15  3:41             ` Linus Torvalds
2026-06-15 18:30               ` H. Peter Anvin
2026-06-14  2:11       ` Calvin Owens
2026-06-14  2:14         ` Calvin Owens

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.