* 8aeb879baf12 - significant system call latency regression, bisected
@ 2026-06-13 1:45 "H. Peter Anvin" (Intel)
2026-06-13 8:59 ` Peter Zijlstra
0 siblings, 1 reply; 24+ messages in thread
From: "H. Peter Anvin" (Intel) @ 2026-06-13 1:45 UTC (permalink / raw)
To: Peter Zijlstra (Intel)
Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen,
torvalds, x86-ML, LKML
So I was trying to figure out a significant -- about 13% -- increase
in system call latency between v7.0 and the current master, and it
bisects down to:
8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build
This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This
is a bare metal boot, no KVM.
I'm personally extremely puzzled how this could possibly be related,
and I will be investigating the possibility that this is a false
bisect, but it is not a Heisenbug in any way; it has been extremely
reproducible, and the difference is statistically valid by close to 10
sigma. Futhermore, the bisection at least gave the appearance of
stability.
Given how late in the cycle this is I wanted to send an alert sooner
rather than later; I will update as I get more data.
-hpa
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-13 1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel) @ 2026-06-13 8:59 ` Peter Zijlstra 2026-06-13 20:34 ` H. Peter Anvin 0 siblings, 1 reply; 24+ messages in thread From: Peter Zijlstra @ 2026-06-13 8:59 UTC (permalink / raw) To: "H. Peter Anvin" (Intel) Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: > So I was trying to figure out a significant -- about 13% -- increase > in system call latency between v7.0 and the current master, and it > bisects down to: > > 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This > is a bare metal boot, no KVM. > > I'm personally extremely puzzled how this could possibly be related, > and I will be investigating the possibility that this is a false > bisect, but it is not a Heisenbug in any way; it has been extremely > reproducible, and the difference is statistically valid by close to 10 > sigma. Futhermore, the bisection at least gave the appearance of > stability. > > Given how late in the cycle this is I wanted to send an alert sooner > rather than later; I will update as I get more data. Uhm, massive WTF indeed. I don't immediately see how this could possibly affect a FRED host either, except perhaps in code layout. I don't actually have a FRED capable machine, but have you tried running one of those top-down perf things on it, to see where its hurting? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-13 8:59 ` Peter Zijlstra @ 2026-06-13 20:34 ` H. Peter Anvin 2026-06-13 23:52 ` H. Peter Anvin 0 siblings, 1 reply; 24+ messages in thread From: H. Peter Anvin @ 2026-06-13 20:34 UTC (permalink / raw) To: Peter Zijlstra Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On 2026-06-13 01:59, Peter Zijlstra wrote: > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: >> So I was trying to figure out a significant -- about 13% -- increase >> in system call latency between v7.0 and the current master, and it >> bisects down to: >> >> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build >> >> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This >> is a bare metal boot, no KVM. >> >> I'm personally extremely puzzled how this could possibly be related, >> and I will be investigating the possibility that this is a false >> bisect, but it is not a Heisenbug in any way; it has been extremely >> reproducible, and the difference is statistically valid by close to 10 >> sigma. Futhermore, the bisection at least gave the appearance of >> stability. >> >> Given how late in the cycle this is I wanted to send an alert sooner >> rather than later; I will update as I get more data. > > Uhm, massive WTF indeed. I don't immediately see how this could possibly > affect a FRED host either, except perhaps in code layout. > > I don't actually have a FRED capable machine, but have you tried running > one of those top-down perf things on it, to see where its hurting? Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.) I reverted the patch on top of rc7, and it did, in fact, fix the regression, but I'm doing a clean from-scratch rebuild of both trees to make sure there isn't anything in my test setup that could introduce any kind of "memory" between builds... ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-13 20:34 ` H. Peter Anvin @ 2026-06-13 23:52 ` H. Peter Anvin 2026-06-14 1:50 ` H. Peter Anvin 2026-06-14 2:11 ` Calvin Owens 0 siblings, 2 replies; 24+ messages in thread From: H. Peter Anvin @ 2026-06-13 23:52 UTC (permalink / raw) To: Peter Zijlstra Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On 2026-06-13 13:34, H. Peter Anvin wrote: > On 2026-06-13 01:59, Peter Zijlstra wrote: >> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: >>> So I was trying to figure out a significant -- about 13% -- increase >>> in system call latency between v7.0 and the current master, and it >>> bisects down to: >>> >>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build >>> >>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This >>> is a bare metal boot, no KVM. >>> >>> I'm personally extremely puzzled how this could possibly be related, >>> and I will be investigating the possibility that this is a false >>> bisect, but it is not a Heisenbug in any way; it has been extremely >>> reproducible, and the difference is statistically valid by close to 10 >>> sigma. Futhermore, the bisection at least gave the appearance of >>> stability. >>> >>> Given how late in the cycle this is I wanted to send an alert sooner >>> rather than later; I will update as I get more data. >> >> Uhm, massive WTF indeed. I don't immediately see how this could possibly >> affect a FRED host either, except perhaps in code layout. >> >> I don't actually have a FRED capable machine, but have you tried running >> one of those top-down perf things on it, to see where its hurting? > > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.) > > I reverted the patch on top of rc7, and it did, in fact, fix the regression, > but I'm doing a clean from-scratch rebuild of both trees to make sure > there isn't anything in my test setup that could introduce any kind of > "memory" between builds...> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%... -hpa ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-13 23:52 ` H. Peter Anvin @ 2026-06-14 1:50 ` H. Peter Anvin 2026-06-14 18:08 ` Xin Li ` (2 more replies) 2026-06-14 2:11 ` Calvin Owens 1 sibling, 3 replies; 24+ messages in thread From: H. Peter Anvin @ 2026-06-14 1:50 UTC (permalink / raw) To: Peter Zijlstra Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On 2026-06-13 16:52, H. Peter Anvin wrote: > On 2026-06-13 13:34, H. Peter Anvin wrote: >> On 2026-06-13 01:59, Peter Zijlstra wrote: >>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) >>> wrote: >>>> So I was trying to figure out a significant -- about 13% -- increase >>>> in system call latency between v7.0 and the current master, and it >>>> bisects down to: >>>> >>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build >>>> >>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This >>>> is a bare metal boot, no KVM. >>>> >>>> I'm personally extremely puzzled how this could possibly be related, >>>> and I will be investigating the possibility that this is a false >>>> bisect, but it is not a Heisenbug in any way; it has been extremely >>>> reproducible, and the difference is statistically valid by close to 10 >>>> sigma. Futhermore, the bisection at least gave the appearance of >>>> stability. >>>> >>>> Given how late in the cycle this is I wanted to send an alert sooner >>>> rather than later; I will update as I get more data. >>> >>> Uhm, massive WTF indeed. I don't immediately see how this could possibly >>> affect a FRED host either, except perhaps in code layout. >>> >>> I don't actually have a FRED capable machine, but have you tried running >>> one of those top-down perf things on it, to see where its hurting? >> >> Not yet, but I'm investigating right now (I have some family >> obligations this weekend, so my duty cycle is somewhat limited.) >> >> I reverted the patch on top of rc7, and it did, in fact, fix the >> regression, > > but I'm doing a clean from-scratch rebuild of both trees to make sure > > there isn't anything in my test setup that could introduce any kind of > > "memory" between builds...> > Nope, even with the clean rebuild it is 100% reproducible. It is in fact > worse than I originally stated: the average with 7.1rc7 is 478±6 cycles > (with the top and bottom octiles removed as outlier protection); with > 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact > a 20% increase in latency, not 13%... > OK, I have, I believe root-caused this. It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned. Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression. I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case... -hpa ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-14 1:50 ` H. Peter Anvin @ 2026-06-14 18:08 ` Xin Li 2026-06-14 18:31 ` H. Peter Anvin 2026-06-15 0:19 ` H. Peter Anvin 2026-06-16 8:28 ` Peter Zijlstra 2 siblings, 1 reply; 24+ messages in thread From: Xin Li @ 2026-06-14 18:08 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML > On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote: > > On 2026-06-13 16:52, H. Peter Anvin wrote: >> On 2026-06-13 13:34, H. Peter Anvin wrote: >>> On 2026-06-13 01:59, Peter Zijlstra wrote: >>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: >>>>> So I was trying to figure out a significant -- about 13% -- increase >>>>> in system call latency between v7.0 and the current master, and it >>>>> bisects down to: >>>>> >>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build >>>>> >>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This >>>>> is a bare metal boot, no KVM. >>>>> >>>>> I'm personally extremely puzzled how this could possibly be related, >>>>> and I will be investigating the possibility that this is a false >>>>> bisect, but it is not a Heisenbug in any way; it has been extremely >>>>> reproducible, and the difference is statistically valid by close to 10 >>>>> sigma. Futhermore, the bisection at least gave the appearance of >>>>> stability. >>>>> >>>>> Given how late in the cycle this is I wanted to send an alert sooner >>>>> rather than later; I will update as I get more data. >>>> >>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly >>>> affect a FRED host either, except perhaps in code layout. >>>> >>>> I don't actually have a FRED capable machine, but have you tried running >>>> one of those top-down perf things on it, to see where its hurting? >>> >>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.) >>> >>> I reverted the patch on top of rc7, and it did, in fact, fix the regression, >> > but I'm doing a clean from-scratch rebuild of both trees to make sure >> > there isn't anything in my test setup that could introduce any kind of >> > "memory" between builds...> >> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%... > > OK, I have, I believe root-caused this. > > It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned. > > Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression. The problem doesn’t happen to IDT? > > I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case... Good point, alignment check should be applied to all such entries. Thanks Xin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-14 18:08 ` Xin Li @ 2026-06-14 18:31 ` H. Peter Anvin 0 siblings, 0 replies; 24+ messages in thread From: H. Peter Anvin @ 2026-06-14 18:31 UTC (permalink / raw) To: Xin Li Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On June 14, 2026 11:08:59 AM PDT, Xin Li <xin@zytor.com> wrote: > >> On Jun 13, 2026, at 6:50 PM, H. Peter Anvin <hpa@zytor.com> wrote: >> >> On 2026-06-13 16:52, H. Peter Anvin wrote: >>> On 2026-06-13 13:34, H. Peter Anvin wrote: >>>> On 2026-06-13 01:59, Peter Zijlstra wrote: >>>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: >>>>>> So I was trying to figure out a significant -- about 13% -- increase >>>>>> in system call latency between v7.0 and the current master, and it >>>>>> bisects down to: >>>>>> >>>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build >>>>>> >>>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This >>>>>> is a bare metal boot, no KVM. >>>>>> >>>>>> I'm personally extremely puzzled how this could possibly be related, >>>>>> and I will be investigating the possibility that this is a false >>>>>> bisect, but it is not a Heisenbug in any way; it has been extremely >>>>>> reproducible, and the difference is statistically valid by close to 10 >>>>>> sigma. Futhermore, the bisection at least gave the appearance of >>>>>> stability. >>>>>> >>>>>> Given how late in the cycle this is I wanted to send an alert sooner >>>>>> rather than later; I will update as I get more data. >>>>> >>>>> Uhm, massive WTF indeed. I don't immediately see how this could possibly >>>>> affect a FRED host either, except perhaps in code layout. >>>>> >>>>> I don't actually have a FRED capable machine, but have you tried running >>>>> one of those top-down perf things on it, to see where its hurting? >>>> >>>> Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.) >>>> >>>> I reverted the patch on top of rc7, and it did, in fact, fix the regression, >>> > but I'm doing a clean from-scratch rebuild of both trees to make sure >>> > there isn't anything in my test setup that could introduce any kind of >>> > "memory" between builds...> >>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact worse than I originally stated: the average with 7.1rc7 is 478±6 cycles (with the top and bottom octiles removed as outlier protection); with 7.1rc7 with the above patch reverted it is 397.5±0.4. - this is in fact a 20% increase in latency, not 13%... >> >> OK, I have, I believe root-caused this. >> >> It is a padding issue; removing the code changes __pfx_x64_sys_call to be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned. >> >> Reverting the patch but adding an alignment statement to x64_sys_call re-introduces the performance regression. > > >The problem doesn’t happen to IDT? > > >> >> I am concerned because this could mean that the __pfx stubs add substantial overhead elsewhere, unless this just happens to be a particularly sensitive case... > > >Good point, alignment check should be applied to all such entries. > >Thanks > Xin The problem is that if you put an alignment directive on a function, it aligns the __pfx stub, which is exactly The Wrong Thing™. Otherwise this would be easy to fix, permanently. I haven't had time to test IDT yet. I assume it is similar. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-14 1:50 ` H. Peter Anvin 2026-06-14 18:08 ` Xin Li @ 2026-06-15 0:19 ` H. Peter Anvin 2026-06-15 2:07 ` H. Peter Anvin 2026-06-16 8:28 ` Peter Zijlstra 2 siblings, 1 reply; 24+ messages in thread From: H. Peter Anvin @ 2026-06-15 0:19 UTC (permalink / raw) To: Peter Zijlstra Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On 2026-06-13 18:50, H. Peter Anvin wrote: > On 2026-06-13 16:52, H. Peter Anvin wrote: >> On 2026-06-13 13:34, H. Peter Anvin wrote: >>> On 2026-06-13 01:59, Peter Zijlstra wrote: >>>> On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) >>>> wrote: >>>>> So I was trying to figure out a significant -- about 13% -- increase >>>>> in system call latency between v7.0 and the current master, and it >>>>> bisects down to: >>>>> >>>>> 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build >>>>> >>>>> This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This >>>>> is a bare metal boot, no KVM. >>>>> >>>>> I'm personally extremely puzzled how this could possibly be related, >>>>> and I will be investigating the possibility that this is a false >>>>> bisect, but it is not a Heisenbug in any way; it has been extremely >>>>> reproducible, and the difference is statistically valid by close to 10 >>>>> sigma. Futhermore, the bisection at least gave the appearance of >>>>> stability. >>>>> >>>>> Given how late in the cycle this is I wanted to send an alert sooner >>>>> rather than later; I will update as I get more data. >>>> >>>> Uhm, massive WTF indeed. I don't immediately see how this could >>>> possibly >>>> affect a FRED host either, except perhaps in code layout. >>>> >>>> I don't actually have a FRED capable machine, but have you tried >>>> running >>>> one of those top-down perf things on it, to see where its hurting? >>> >>> Not yet, but I'm investigating right now (I have some family >>> obligations this weekend, so my duty cycle is somewhat limited.) >>> >>> I reverted the patch on top of rc7, and it did, in fact, fix the >>> regression, >> > but I'm doing a clean from-scratch rebuild of both trees to make sure >> > there isn't anything in my test setup that could introduce any kind of >> > "memory" between builds...> >> Nope, even with the clean rebuild it is 100% reproducible. It is in >> fact worse than I originally stated: the average with 7.1rc7 is 478±6 >> cycles (with the top and bottom octiles removed as outlier >> protection); with 7.1rc7 with the above patch reverted it is >> 397.5±0.4. - this is in fact a 20% increase in latency, not 13%... >> > > OK, I have, I believe root-caused this. > > It is a padding issue; removing the code changes __pfx_x64_sys_call to > be 32-byte aligned, with the result that x64_sys_call gets *mis*aligned. > > Reverting the patch but adding an alignment statement to x64_sys_call > re-introduces the performance regression. > > I am concerned because this could mean that the __pfx stubs add > substantial overhead elsewhere, unless this just happens to be a > particularly sensitive case... > OK, so v7.1 was released with this sizable performance regression. That begs the question how to deal with it. One option that might be reasonable for -stable is to simply add back 16 bytes of NOPs into the assembly file. However, that is obviously not a long term fix. Any thoughts? -hpa ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-15 0:19 ` H. Peter Anvin @ 2026-06-15 2:07 ` H. Peter Anvin 2026-06-15 3:41 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: H. Peter Anvin @ 2026-06-15 2:07 UTC (permalink / raw) To: Peter Zijlstra Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML [-- Attachment #1: Type: text/plain, Size: 1222 bytes --] On 2026-06-14 17:19, H. Peter Anvin wrote: > > OK, so v7.1 was released with this sizable performance regression. That > begs the question how to deal with it. > > One option that might be reasonable for -stable is to simply add back 16 > bytes of NOPs into the assembly file. However, that is obviously not a > long term fix. > Okay, here is a hack that actually generates the proper alignment, and it DOES in fact fix the performance regression. It uses the same hack as the Makefile to deal with function alignment with a prefix: it adds unnecessary NOPs so that the pre-alignment and post-alignment are the same. At the end of the day this really ought to be fixed in gcc. This is not meant to be a final patch; this should go in a header file and be cleaned up etc, but I wanted to confirm that it does, in fact, fix the regression and that the alignment of x64_sys_call is the root cause of the problem. PeterZ: at some point you and I talked about the following: - Should x64_sys_call() be noinstr? - If so, any reason we can't inline it into do_syscall_64()? - Since we no longer use the sys_call_table[] as a jump table, do we actually need array_index_nospec()? in do_syscall_x64|32? -hpa [-- Attachment #2: diff --] [-- Type: text/plain, Size: 1428 bytes --] diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index 71f032504e73..337e3e53d262 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -9,6 +9,14 @@ #include <linux/nospec.h> #include <asm/syscall.h> +#ifdef CONFIG_CALL_PADDING +# define _pfe(x) __attribute((patchable_function_entry(x,x))) +#else +# define _pfe(x) +#endif +#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES) +#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x)) + #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *); #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *); #include <asm/syscalls_64.h> @@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = { #undef __SYSCALL #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs); -long x64_sys_call(const struct pt_regs *regs, unsigned int nr) +long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr) { switch (nr) { #include <asm/syscalls_64.h> @@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr) } #ifdef CONFIG_X86_X32_ABI -long x32_sys_call(const struct pt_regs *regs, unsigned int nr) +long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr) { switch (nr) { #include <asm/syscalls_x32.h> ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-15 2:07 ` H. Peter Anvin @ 2026-06-15 3:41 ` Linus Torvalds 2026-06-15 18:30 ` H. Peter Anvin 2026-06-16 7:38 ` Peter Zijlstra 2026-06-16 7:53 ` Peter Zijlstra 2 siblings, 1 reply; 24+ messages in thread From: Linus Torvalds @ 2026-06-15 3:41 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote: > > - Since we no longer use the sys_call_table[] as a jump table, > do we actually need array_index_nospec()? in do_syscall_x64|32? Well, gcc will still generate a jump table from it when retpolines aren't enabled. So I think we do want that array_index_nospec. It should be cheap insurance against the simplest kinds of speculation issues. Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-15 3:41 ` Linus Torvalds @ 2026-06-15 18:30 ` H. Peter Anvin 2026-06-16 7:12 ` Peter Zijlstra 0 siblings, 1 reply; 24+ messages in thread From: H. Peter Anvin @ 2026-06-15 18:30 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML On 2026-06-14 20:41, Linus Torvalds wrote: > On Mon, 15 Jun 2026 at 07:38, H. Peter Anvin <hpa@zytor.com> wrote: >> >> - Since we no longer use the sys_call_table[] as a jump table, >> do we actually need array_index_nospec()? in do_syscall_x64|32? > > Well, gcc will still generate a jump table from it when retpolines > aren't enabled. > > So I think we do want that array_index_nospec. It should be cheap > insurance against the simplest kinds of speculation issues. > Well, we could put it under an #ifdef by putting macro to detect when we use -fno-jump-tables. PeterZ and I have also been talking about making -fno-jump-tables unconditional, because at some point we found that the performance difference was negligible, at least when array_index_nospec() is necessary, and it makes it a lot easier to tune when you don't have to deal with code bases that compile. It is not just retpoline but also IBT (although the comment says "for now"); this of course means in practice that the kernels everyone uses are compiled without jump tables. The system call dispatch is really the biggest case here. It does, however, make me think that using regs->ax to dispatch system calls in the a FRED path might actually be The Wrong Thing[TM]; FRED delivery is a speculation barrier and so %rax is guaranteed to be stable at that point. *In practice* the stack engine probably would propagate that (I can't really think of any way to implement a stack engine that wouldn't, and I suspect if it didn't we would have lots of other issues) but instead of dumping it into memory and reading it back it probably would be better to do what the SYSCALL path does and move it into an argument register instead. I have experimented with micro-optimizations of the FRED path lately, in part because FRED inherently does provide speculation guarantees that SYSCALL/SYSRET do not, in part because some of the code paths have a fair bit of unnecessary overhead in general of which some of affects FRED disproportionately (some duplicates work that FRED does inherently, for one thing.) So far I have been somewhat surprised how *little* effect some of them have had; clearly branch prediction does a really good job sometimes even without static branches. Still, some pretty simple changes can get a few percent improvement, well above the statistical noise margin. Doing a *very* early-out and dispatching do_syscall_64() already in asm_entry_point_user is one of the more effective hacks; I am (or rather, were, until I discovered this immediate issue ;) also experimenting with having separate IDT and FRED versions of do_syscall_64() -- the code factors very cleanly and the duplication is nearly all at the object code level. Part of my questions to PeterZ was because I believe that inlining x64_sys_call() will benefit a fair bit from better code layout. We have talked about sunsetting x32, but until we do, merging x32_sys_call() into the same function also ends up with the two switch statements being able to share a fair bit of code, since there are large contiguous chunks of x32 system call space which are the same as x64. One of the things I have been thinking about, too, is to move FRED- and IDT-specific code into separate text sections; not only so that they can be close together in memory, but also so that we can poison out the areas that aren't being used. Every code flow that has almost unlimited versatility is, obviously, *extremely* desirable as targets for execution redirection attacks... -hpa ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-15 18:30 ` H. Peter Anvin @ 2026-06-16 7:12 ` Peter Zijlstra 0 siblings, 0 replies; 24+ messages in thread From: Peter Zijlstra @ 2026-06-16 7:12 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML On Mon, Jun 15, 2026 at 11:30:11AM -0700, H. Peter Anvin wrote: > Well, we could put it under an #ifdef by putting macro to detect when we use > -fno-jump-tables. PeterZ and I have also been talking about making > -fno-jump-tables unconditional, because at some point we found that the > performance difference was negligible, at least when array_index_nospec() is > necessary, and it makes it a lot easier to tune when you don't have to deal > with code bases that compile. It is not just retpoline but also IBT > (although the comment says "for now"); this of course means in practice that > the kernels everyone uses are compiled without jump tables. The IBT thing is because GCC (and I assume, but haven't checked, clang too) generated NOTRACK prefixes for jump tables. And we have explicitly disallowed NOTRACK for kernel IBT. The "not yet" pertains to the compilers being changed to not use NOTRACK; but I don't think this is anything anybody is actively chasing up on. So yeah, effectively jump-tables are disabled for everybody. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-15 2:07 ` H. Peter Anvin 2026-06-15 3:41 ` Linus Torvalds @ 2026-06-16 7:38 ` Peter Zijlstra 2026-06-16 7:53 ` Peter Zijlstra 2 siblings, 0 replies; 24+ messages in thread From: Peter Zijlstra @ 2026-06-16 7:38 UTC (permalink / raw) To: H. Peter Anvin Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote: > PeterZ: at some point you and I talked about the following: > > - Should x64_sys_call() be noinstr? I still think it should be, yes. But I also think it wants __noendbr, there is no sane reason you should ever be allowed to do an indirect call to this. Realistically, objtool will seal this function (scribble the ENDBR), but really, it just shouldn't be there to begin with. > - If so, any reason we can't inline it into do_syscall_64()? Code gen, GCC makes a mess out of things if you do that. x64_sys_call() now ends up being a giant pile of tail-calls. If you inline it into do_syscall_x64() that goes out the window. > - Since we no longer use the sys_call_table[] as a jump table, > do we actually need array_index_nospec()? in do_syscall_x64|32? It would mean unconditionally disabling jump-tables -- at least for this TU, but possibly for the whole thing (mixed compiler flags and LTO is a pain you don't need IIRC). ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-15 2:07 ` H. Peter Anvin 2026-06-15 3:41 ` Linus Torvalds 2026-06-16 7:38 ` Peter Zijlstra @ 2026-06-16 7:53 ` Peter Zijlstra 2 siblings, 0 replies; 24+ messages in thread From: Peter Zijlstra @ 2026-06-16 7:53 UTC (permalink / raw) To: H. Peter Anvin Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On Sun, Jun 14, 2026 at 07:07:50PM -0700, H. Peter Anvin wrote: > It uses the same hack as the Makefile to deal with function alignment with a > prefix: it adds unnecessary NOPs so that the pre-alignment and > post-alignment are the same. At the end of the day this really ought to be > fixed in gcc. And clang, but I don't think they can, it wrecks the 'ABI' they have in place with the current set of arguments. Which I agree is somewhat unfortunate, but it is what it is. > diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c > index 71f032504e73..337e3e53d262 100644 > --- a/arch/x86/entry/syscall_64.c > +++ b/arch/x86/entry/syscall_64.c > @@ -9,6 +9,14 @@ > #include <linux/nospec.h> > #include <asm/syscall.h> > > +#ifdef CONFIG_CALL_PADDING > +# define _pfe(x) __attribute((patchable_function_entry(x,x))) > +#else > +# define _pfe(x) > +#endif > +#define _align_func(x) __aligned(x) _pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES) > +#define align_func(x) _align_func((x) < CONFIG_FUNCTION_ALIGNMENT ? CONFIG_FUNCTION_ALIGNMENT : (x)) > + > #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *); > #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *); > #include <asm/syscalls_64.h> > @@ -32,7 +40,7 @@ const sys_call_ptr_t sys_call_table[] = { > #undef __SYSCALL > > #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs); > -long x64_sys_call(const struct pt_regs *regs, unsigned int nr) > +long align_func(32) x64_sys_call(const struct pt_regs *regs, unsigned int nr) > { > switch (nr) { > #include <asm/syscalls_64.h> > @@ -41,7 +49,7 @@ long x64_sys_call(const struct pt_regs *regs, unsigned int nr) > } > > #ifdef CONFIG_X86_X32_ABI > -long x32_sys_call(const struct pt_regs *regs, unsigned int nr) > +long align_func(32) x32_sys_call(const struct pt_regs *regs, unsigned int nr) > { > switch (nr) { > #include <asm/syscalls_x32.h> This more or less works by accident, in general your align_func() macro is horrendously broken when you consider kCFI. By changing the patchable_function_entry attribute like this, the kCFI hash ends up at a different location and things go side-ways really really fast. The only reason it works here is that this function is never indirectly called and so the kCFI ABI violation is immaterial. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-14 1:50 ` H. Peter Anvin 2026-06-14 18:08 ` Xin Li 2026-06-15 0:19 ` H. Peter Anvin @ 2026-06-16 8:28 ` Peter Zijlstra 2026-06-16 8:46 ` Linus Torvalds 2026-06-16 13:53 ` David Laight 2 siblings, 2 replies; 24+ messages in thread From: Peter Zijlstra @ 2026-06-16 8:28 UTC (permalink / raw) To: H. Peter Anvin Cc: tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote: > OK, I have, I believe root-caused this. > > It is a padding issue; removing the code changes __pfx_x64_sys_call to be > 32-byte aligned, with the result that x64_sys_call gets *mis*aligned. > > Reverting the patch but adding an alignment statement to x64_sys_call > re-introduces the performance regression. > > I am concerned because this could mean that the __pfx stubs add substantial > overhead elsewhere, unless this just happens to be a particularly sensitive > case... So what is the actual alignment requirement these days then? We're building the (x86_64) kernel with 16 byte function and 1 byte jump alignment. So ISTR the Intel I-fetch window was 16 bytes, so the above things would make sense. However, Gemini, or whatever AI sits in google search, is trying to tell me Intel moved to 32 byte I-fetch with Alderlake. That same thing is saying AMD switched to 32 byte I-fetch with Zen (1) and later. This all seems to suggest we do something like so, hmm? diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b9f5a4a3cc2a..65fff65271d0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -329,7 +329,9 @@ config X86 select HAVE_ARCH_KCSAN if X86_64 select PROC_PID_ARCH_STATUS if PROC_FS select HAVE_ARCH_NODE_DEV_GROUP if X86_SGX - select FUNCTION_ALIGNMENT_16B if X86_64 || X86_ALIGNMENT_16 + # AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch + select FUNCTION_ALIGNMENT_32B if X86_64 + select FUNCTION_ALIGNMENT_16B if X86_ALIGNMENT_16 select FUNCTION_ALIGNMENT_4B imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-16 8:28 ` Peter Zijlstra @ 2026-06-16 8:46 ` Linus Torvalds 2026-06-16 9:51 ` Ingo Molnar 2026-06-17 12:37 ` Peter Zijlstra 2026-06-16 13:53 ` David Laight 1 sibling, 2 replies; 24+ messages in thread From: Linus Torvalds @ 2026-06-16 8:46 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote: > > So ISTR the Intel I-fetch window was 16 bytes, so the above things would > make sense. However, Gemini, or whatever AI sits in google search, is > trying to tell me Intel moved to 32 byte I-fetch with Alderlake. Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts to not be 64-byte aligned - simply because you may need to fetch more cachelines (assuming fairly linear code). And afaik, some of the newer ones aren't 32-byte wide, but can do 48 bytes as three 16-byte fetches. But I don't know if they can do the old "split line access" that older cores could do, where a Pentium would do two 8-byte accesses at the same time, and they didn't have to be in the same cache line. So 64-byte alignment would always be the best option if you only look at a *particular* piece of code. But it obviously is very wasteful and hurts when there is code around it that could be loaded into the cache at the same time. So almost certainly not a good idea in general. But 64-byte alignment is probably what things like interrupt and system call entrypoints should use, because those things would make sense to look at as isolated things, not part of a bigger load". And they are quite likely to start from a fairly cold-cache situation. So *not* some general compiler option in a config file, but maybe a special "entry point alignment" macro? Linus ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-16 8:46 ` Linus Torvalds @ 2026-06-16 9:51 ` Ingo Molnar 2026-06-16 17:44 ` H. Peter Anvin 2026-06-17 12:37 ` Peter Zijlstra 1 sibling, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2026-06-16 9:51 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote: > > > > So ISTR the Intel I-fetch window was 16 bytes, so the above things would > > make sense. However, Gemini, or whatever AI sits in google search, is > > trying to tell me Intel moved to 32 byte I-fetch with Alderlake. > > Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts > to not be 64-byte aligned - simply because you may need to fetch more > cachelines (assuming fairly linear code). > > And afaik, some of the newer ones aren't 32-byte wide, but can do 48 > bytes as three 16-byte fetches. > > But I don't know if they can do the old "split line access" that older > cores could do, where a Pentium would do two 8-byte accesses at the > same time, and they didn't have to be in the same cache line. > > So 64-byte alignment would always be the best option if you only look > at a *particular* piece of code. > > But it obviously is very wasteful and hurts when there is code around > it that could be loaded into the cache at the same time. > > So almost certainly not a good idea in general. > > But 64-byte alignment is probably what things like interrupt and > system call entrypoints should use, because those things would make > sense to look at as isolated things, not part of a bigger load". And > they are quite likely to start from a fairly cold-cache situation. > > So *not* some general compiler option in a config file, but maybe a > special "entry point alignment" macro? Yeah, agreed on that approach - but before/while we fix it, I'm also still somewhat baffled by the numbers hpa reported: >>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact >>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles >>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7 >>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20% >>> increase in latency, not 13%... Now that we know that this regression is caused by entry function alignment changes, do we know *why* it causes a 80 cycles shift in system call entry performance? What does the benchmark measure, cache-cold or cache-hot execution? 1) Cache-cold performance: If it is cold-cache performance, does the misaligned case fetch one more cold cacheline? From which cache does it miss? Fetching from the 2-4MB Panther Lake L2 shouldn't be 80 cycles, it should be ~17 cycles. If it's fetching from the 18MB L3 (which I'd say is the norm for most workloads), then the L3->L1I latency is around ~55 cycles on Panther Lake, with everything included. It cannot really be DRAM latency, ie. true cache-cold latency, as that would be much more severe, in the 400 cycles range even with premium DRAM modules - and more like 500 cycles with mainstream DRAM modules and layouts. (Unless we are *lucky* with alignment and sizing and the alignment regression doesn't trigger full DRAM latency.) The on-die DRAM MSC cache's latency should be around 300 cycles - that too is too high. 2) Cache-hot performance: While cache-hot performance is less relevance for system calls (which tend to be cache-cold in practice), if the benchmark measures cache-hot performance, why is there a 80 cycles shift from just a single misaligned symbol? Ie. the specific and rather stable figure of 80 cycles overhead does not seem to match any of the Panther Lake latencies that ought to be relevant to this regression, if we use the simplest mental model of what's going on when alignment changes. So it is either some other uarch pathology, triggered by bad alignment, or something doesn't add up in my mental model of the root cause of this problem. :-) Side notes: - The 6 cycles noise in the 478±6 cycles measurement does suggest that we might have missed out to a deeper cache hierarchy level, versus the rather stable 397.5±0.4 pre-regression figure. - I'm also assuming that 'cycles' here is a frequency-invariant standardized constant 5.1 GHz TSC value or so? Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-16 9:51 ` Ingo Molnar @ 2026-06-16 17:44 ` H. Peter Anvin 2026-06-17 9:54 ` Ingo Molnar 0 siblings, 1 reply; 24+ messages in thread From: H. Peter Anvin @ 2026-06-16 17:44 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML On June 16, 2026 2:51:12 AM PDT, Ingo Molnar <mingo@kernel.org> wrote: > >* Linus Torvalds <torvalds@linux-foundation.org> wrote: > >> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote: >> > >> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would >> > make sense. However, Gemini, or whatever AI sits in google search, is >> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake. >> >> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts >> to not be 64-byte aligned - simply because you may need to fetch more >> cachelines (assuming fairly linear code). >> >> And afaik, some of the newer ones aren't 32-byte wide, but can do 48 >> bytes as three 16-byte fetches. >> >> But I don't know if they can do the old "split line access" that older >> cores could do, where a Pentium would do two 8-byte accesses at the >> same time, and they didn't have to be in the same cache line. >> >> So 64-byte alignment would always be the best option if you only look >> at a *particular* piece of code. >> >> But it obviously is very wasteful and hurts when there is code around >> it that could be loaded into the cache at the same time. >> >> So almost certainly not a good idea in general. >> >> But 64-byte alignment is probably what things like interrupt and >> system call entrypoints should use, because those things would make >> sense to look at as isolated things, not part of a bigger load". And >> they are quite likely to start from a fairly cold-cache situation. >> >> So *not* some general compiler option in a config file, but maybe a >> special "entry point alignment" macro? > >Yeah, agreed on that approach - but before/while we fix it, >I'm also still somewhat baffled by the numbers hpa reported: > >>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact >>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles >>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7 >>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20% >>>> increase in latency, not 13%... > >Now that we know that this regression is caused by entry function >alignment changes, do we know *why* it causes a 80 cycles >shift in system call entry performance? > >What does the benchmark measure, cache-cold or cache-hot >execution? > >1) Cache-cold performance: > >If it is cold-cache performance, does the misaligned case fetch >one more cold cacheline? > >From which cache does it miss? Fetching from the 2-4MB Panther Lake >L2 shouldn't be 80 cycles, it should be ~17 cycles. > >If it's fetching from the 18MB L3 (which I'd say is the norm for >most workloads), then the L3->L1I latency is around ~55 cycles on >Panther Lake, with everything included. > >It cannot really be DRAM latency, ie. true cache-cold latency, >as that would be much more severe, in the 400 cycles range even >with premium DRAM modules - and more like 500 cycles with >mainstream DRAM modules and layouts. (Unless we are *lucky* with >alignment and sizing and the alignment regression doesn't trigger >full DRAM latency.) The on-die DRAM MSC cache's latency should >be around 300 cycles - that too is too high. > >2) Cache-hot performance: > >While cache-hot performance is less relevance for system calls >(which tend to be cache-cold in practice), if the benchmark >measures cache-hot performance, why is there a 80 cycles shift >from just a single misaligned symbol? > >Ie. the specific and rather stable figure of 80 cycles overhead >does not seem to match any of the Panther Lake latencies that >ought to be relevant to this regression, if we use the simplest >mental model of what's going on when alignment changes. > >So it is either some other uarch pathology, triggered by bad >alignment, or something doesn't add up in my mental model >of the root cause of this problem. :-) > >Side notes: > > - The 6 cycles noise in the 478±6 cycles measurement > does suggest that we might have missed out to a > deeper cache hierarchy level, versus the rather > stable 397.5±0.4 pre-regression figure. > > - I'm also assuming that 'cycles' here is a frequency-invariant > standardized constant 5.1 GHz TSC value or so? > >Thanks, > > Ingo It's cache hot, calling getppid() in a tight loop. The units are renormalized to from TSC cycles to core cycles using fixed counter 1 to determine the actual ratio. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-16 17:44 ` H. Peter Anvin @ 2026-06-17 9:54 ` Ingo Molnar 2026-06-17 10:05 ` Ingo Molnar 0 siblings, 1 reply; 24+ messages in thread From: Ingo Molnar @ 2026-06-17 9:54 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML * H. Peter Anvin <hpa@zytor.com> wrote: > It's cache hot, calling getppid() in a tight loop. > The units are renormalized to from TSC cycles to > core cycles using fixed counter 1 to determine the > actual ratio. Hm, in that light the 80 cycles overhead from a single misaligned symbol is rather surprising (to me): it's way too high to be reasonably caused by any hot cache alignment effects - and all of the regular instruction caches (or even data caches) should be more than large enough to fit such a getppid() benchmark fully into the cache. Would be nice to see a before/after perf stat --repeat <N> figures with sufficiently high <N> to get <0.1% stddev? And just to guess around a bit, here's the various caches, buffers and queues on a Panther Lake Performance Core (Cougar Cove) that may play a role: - L0 Data Cache (L0D) 48 KB 68 cachelines - L1 Data Cache (L1D) 192 KB 3,072 cachelines - L1 Instruction Cache (L1I) 64 KB 1,024 cachelines - L2 Cache 3,072 KB 49,152 cachlines - uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12-way - uOP Queue - 192 entries - Reorder Buffer (ROB) - 576 entries - L1 Data TLB (DTLB) - 128 entries - L2 Shared TLB (STLB) - ~4,096 entries - Return Stack Buffer (RSB) - 24 entries - Load Queue - ~114 entries - Store Queue - ~56 entries Where all cacheline sizes are 64 bytes, and a uOP cache 'set' fits up to 6-8 uops. I think with a cache-hot syscall benchmark we can exclude the largest caches with over 1,000 effective entries with near certainty as a factor, so what is left are: - L0 Data Cache (L0D) 48 KB 68 cachelines - uOP Cache (Micro-op Cache) - ~5,250 uOPs ~64 sets x 10-12-way - uOP Queue - 192 entries - Reorder Buffer (ROB) - 576 entries - L1 Data TLB (DTLB) - 128 entries - Return Stack Buffer (RSB) - 24 entries - Load Queue - ~114 entries - Store Queue - ~56 entries I'd exclude the L0D, L1DTLB, the RSB and the load/store queues as well, because code alignment of a single symbol should have a minimal effect on them, which leaves: - uOP Queue - 192 entries - uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12 way - Reorder Buffer (ROB) - 576 entries And I think of these the main suspect would be the uOP cache, because its (estimated...) ~10-12 deep associativity limit of uop-sets may be something this benchmark is hitting on Panther Lake? Could it be that the extra alignment adds +1 to the maximum number of uOP cache 'ways' this execution hits in the uOP cache, moving it form say 12 (still fits) to 13 (misses) so that this particular uOP cache association depth starts trashing? But I'm really just guessing wildly here... ( The extra statistical noise of the regressed figures does suggest some sort of trashing mechanic behind the scenes though, and the regular caches seem large enough to not actually trash for such a cache-hot benchmark. ) Or am I missing something obvious? Any perf stat uOP related counter measurements might be elluminating. Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-17 9:54 ` Ingo Molnar @ 2026-06-17 10:05 ` Ingo Molnar 0 siblings, 0 replies; 24+ messages in thread From: Ingo Molnar @ 2026-06-17 10:05 UTC (permalink / raw) To: H. Peter Anvin Cc: Linus Torvalds, Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML * Ingo Molnar <mingo@kernel.org> wrote: > I'd exclude the L0D, L1DTLB, the RSB and the load/store queues > as well, because code alignment of a single symbol should have > a minimal effect on them, which leaves: > > - uOP Queue - 192 entries > - uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12 way > - Reorder Buffer (ROB) - 576 entries > > And I think of these the main suspect would be the uOP cache, > because its (estimated...) ~10-12 deep associativity limit > of uop-sets may be something this benchmark is hitting on > Panther Lake? > > Could it be that the extra alignment adds +1 to the maximum number > of uOP cache 'ways' this execution hits in the uOP cache, moving > it form say 12 (still fits) to 13 (misses) so that this particular > uOP cache association depth starts trashing? But I'm really just > guessing wildly here... > > ( The extra statistical noise of the regressed figures does suggest > some sort of trashing mechanic behind the scenes though, and the > regular caches seem large enough to not actually trash for such > a cache-hot benchmark. ) > > Or am I missing something obvious? > > Any perf stat uOP related counter measurements might be illuminating. The relevant uOP cache (Intel DSB) perf stat counters would be: starship:~/tip> git grep DSB_ tools/perf/pmu-events/arch/x86/pantherlake/ tools/perf/pmu-events/arch/x86/pantherlake/frontend.json: "EventName": "FRONTEND_RETIRED.ANY_DSB_MISS", tools/perf/pmu-events/arch/x86/pantherlake/frontend.json: "EventName": "FRONTEND_RETIRED.DSB_MISS", tools/perf/pmu-events/arch/x86/pantherlake/frontend.json: "EventName": "IDQ.DSB_CYCLES_ANY", tools/perf/pmu-events/arch/x86/pantherlake/frontend.json: "EventName": "IDQ.DSB_CYCLES_OK", tools/perf/pmu-events/arch/x86/pantherlake/frontend.json: "EventName": "IDQ.DSB_UOPS", In particular FRONTEND_RETIRED.ANY_DSB_MISS and FRONTEND_RETIRED.DSB_MISS before/after counts? Thanks, Ingo ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-16 8:46 ` Linus Torvalds 2026-06-16 9:51 ` Ingo Molnar @ 2026-06-17 12:37 ` Peter Zijlstra 1 sibling, 0 replies; 24+ messages in thread From: Peter Zijlstra @ 2026-06-17 12:37 UTC (permalink / raw) To: Linus Torvalds Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, x86-ML, LKML On Tue, Jun 16, 2026 at 02:16:41PM +0530, Linus Torvalds wrote: > So 64-byte alignment would always be the best option if you only look > at a *particular* piece of code. > > But it obviously is very wasteful and hurts when there is code around > it that could be loaded into the cache at the same time. > > So almost certainly not a good idea in general. > > But 64-byte alignment is probably what things like interrupt and > system call entrypoints should use, because those things would make > sense to look at as isolated things, not part of a bigger load". And > they are quite likely to start from a fairly cold-cache situation. > > So *not* some general compiler option in a config file, but maybe a > special "entry point alignment" macro? This builds with kcfi on and seems to do more or less do what is expected. I've not actually tried performance measurements on my IDT based system. Obviously this would want splitting into a few patches, but it does: - makes -fno-jump-tables unconditional - removes array_index_nospec() from the syscall dispatch - makes x{32,64}_sys_call() 'static noinstr' - adds align_entry attribute that aligns on cacheline boundaries and disallows taking address - sprinkles align_entry on the noinstr syscall path --- arch/x86/Makefile | 29 +++++++++-------------------- arch/x86/entry/entry_fred.c | 12 ++++++------ arch/x86/entry/syscall_64.c | 26 +++++++++++++++++++++----- arch/x86/include/asm/fred.h | 5 +++-- arch/x86/include/asm/syscall.h | 23 ++++++++++++++++++++--- 5 files changed, 59 insertions(+), 36 deletions(-) diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 598f178102ee..b154a2a20eb2 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -90,17 +90,8 @@ CC_FLAGS_FPU += -mhard-float endif ifeq ($(CONFIG_X86_KERNEL_IBT),y) -# -# Kernel IBT has S_CET.NOTRACK_EN=0, as such the compilers must not generate -# NOTRACK prefixes. Current generation compilers unconditionally employ NOTRACK -# for jump-tables, as such, disable jump-tables for now. -# -# (jump-tables are implicitly disabled by RETPOLINE) -# -# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104816 -# -KBUILD_CFLAGS += $(call cc-option,-fcf-protection=branch -fno-jump-tables) -KBUILD_RUSTFLAGS += -Zcf-protection=branch $(if $(call rustc-min-version,109300),-Cjump-tables=n,-Zno-jump-tables) +KBUILD_CFLAGS += $(call cc-option,-fcf-protection=branch) +KBUILD_RUSTFLAGS += -Zcf-protection=branch else KBUILD_CFLAGS += $(call cc-option,-fcf-protection=none) endif @@ -173,6 +164,13 @@ endif KBUILD_RUSTFLAGS += -Ccode-model=kernel percpu_seg := gs + + # Due to retpolines and cf-protection=branch's implicit NOTRACK usage + # for jump-tables, blanked disable jump-tables for all x86_64 builds to + # get a consistent behaviour across configurations. This allows + # removing some array_index_nospec() usage. + KBUILD_CFLAGS += -fno-jump-tables + KBUILD_RISTFLAGS += $(if $(call rustc-min-version,109300),-Cjump-tables=n,-Zno-jump-tables) endif ifeq ($(CONFIG_STACKPROTECTOR),y) @@ -209,15 +207,6 @@ KBUILD_CFLAGS += -fno-asynchronous-unwind-tables ifdef CONFIG_MITIGATION_RETPOLINE KBUILD_CFLAGS += $(RETPOLINE_CFLAGS) KBUILD_RUSTFLAGS += $(RETPOLINE_RUSTFLAGS) - # Additionally, avoid generating expensive indirect jumps which - # are subject to retpolines for small number of switch cases. - # LLVM turns off jump table generation by default when under - # retpoline builds, however, gcc does not for x86. This has - # only been fixed starting from gcc stable version 8.4.0 and - # onwards, but not for older ones. See gcc bug #86952. - ifndef CONFIG_CC_IS_CLANG - KBUILD_CFLAGS += -fno-jump-tables - endif endif ifdef CONFIG_MITIGATION_SLS diff --git a/arch/x86/entry/entry_fred.c b/arch/x86/entry/entry_fred.c index fb3594ddf731..740fdf9bb08a 100644 --- a/arch/x86/entry/entry_fred.c +++ b/arch/x86/entry/entry_fred.c @@ -51,7 +51,7 @@ static noinstr void fred_bad_type(struct pt_regs *regs, unsigned long error_code irqentry_nmi_exit(regs, irq_state); } -static noinstr void fred_intx(struct pt_regs *regs) +static noinstr align_entry void fred_intx(struct pt_regs *regs) { switch (regs->fred_ss.vector) { /* Opcode 0xcd, 0x3, NOT INT3 (opcode 0xcc) */ @@ -157,7 +157,7 @@ void __init fred_complete_exception_setup(void) fred_setup_done = true; } -static noinstr void fred_extint(struct pt_regs *regs) +static noinstr align_entry void fred_extint(struct pt_regs *regs) { unsigned int vector = regs->fred_ss.vector; @@ -177,7 +177,7 @@ static noinstr void fred_extint(struct pt_regs *regs) } } -static noinstr void fred_hwexc(struct pt_regs *regs, unsigned long error_code) +static noinstr align_entry void fred_hwexc(struct pt_regs *regs, unsigned long error_code) { /* Optimize for #PF. That's the only exception which matters performance wise */ if (likely(regs->fred_ss.vector == X86_TRAP_PF)) @@ -216,7 +216,7 @@ static noinstr void fred_hwexc(struct pt_regs *regs, unsigned long error_code) } -static noinstr void fred_swexc(struct pt_regs *regs, unsigned long error_code) +static noinstr align_entry void fred_swexc(struct pt_regs *regs, unsigned long error_code) { switch (regs->fred_ss.vector) { case X86_TRAP_BP: return exc_int3(regs); @@ -225,7 +225,7 @@ static noinstr void fred_swexc(struct pt_regs *regs, unsigned long error_code) } } -__visible noinstr void fred_entry_from_user(struct pt_regs *regs) +__visible noinstr align_entry void fred_entry_from_user(struct pt_regs *regs) { unsigned long error_code = regs->orig_ax; @@ -257,7 +257,7 @@ __visible noinstr void fred_entry_from_user(struct pt_regs *regs) return fred_bad_type(regs, error_code); } -__visible noinstr void fred_entry_from_kernel(struct pt_regs *regs) +__visible noinstr align_entry void fred_entry_from_kernel(struct pt_regs *regs) { unsigned long error_code = regs->orig_ax; diff --git a/arch/x86/entry/syscall_64.c b/arch/x86/entry/syscall_64.c index 71f032504e73..10654c12dd36 100644 --- a/arch/x86/entry/syscall_64.c +++ b/arch/x86/entry/syscall_64.c @@ -8,6 +8,7 @@ #include <linux/entry-common.h> #include <linux/nospec.h> #include <asm/syscall.h> +#include <asm/ibt.h> #define __SYSCALL(nr, sym) extern long __x64_##sym(const struct pt_regs *); #define __SYSCALL_NORETURN(nr, sym) extern long __noreturn __x64_##sym(const struct pt_regs *); @@ -32,23 +33,40 @@ const sys_call_ptr_t sys_call_table[] = { #undef __SYSCALL #define __SYSCALL(nr, sym) case nr: return __x64_##sym(regs); -long x64_sys_call(const struct pt_regs *regs, unsigned int nr) +static noinstr align_entry long x64_sys_call(const struct pt_regs *regs, unsigned int nr) { + /* + * Because -fno-jump-tables, this compiles into a binary branch tree + * rather than a jump-table. As such @nr is not used as an array + * index. Additionally, this is an out-of-line function on purpose, + * such that all the actual syscall function calls are tail-calls, + * returning to our caller for the common bits. + */ + instrumentation_begin(); switch (nr) { #include <asm/syscalls_64.h> default: return __x64_sys_ni_syscall(regs); } + instrumentation_end(); } #ifdef CONFIG_X86_X32_ABI -long x32_sys_call(const struct pt_regs *regs, unsigned int nr) +static noinstr align_entry long x32_sys_call(const struct pt_regs *regs, unsigned int nr) { + instrumentation_begin(); switch (nr) { #include <asm/syscalls_x32.h> default: return __x64_sys_ni_syscall(regs); } + instrumentation_end(); +} +#else +static __always_inline long x32_sys_call(const struct pt_regs *regs, unsigned int nr) +{ + return __x64_sys_ni_syscall(regs); } #endif +#undef __SYSCALL static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr) { @@ -59,7 +77,6 @@ static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr) unsigned int unr = nr; if (likely(unr < NR_syscalls)) { - unr = array_index_nospec(unr, NR_syscalls); regs->ax = x64_sys_call(regs, unr); return true; } @@ -76,7 +93,6 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr) unsigned int xnr = nr - __X32_SYSCALL_BIT; if (IS_ENABLED(CONFIG_X86_X32_ABI) && likely(xnr < X32_NR_syscalls)) { - xnr = array_index_nospec(xnr, X32_NR_syscalls); regs->ax = x32_sys_call(regs, xnr); return true; } @@ -84,7 +100,7 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr) } /* Returns true to return using SYSRET, or false to use IRET */ -__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr) +__visible noinstr align_entry bool do_syscall_64(struct pt_regs *regs, int nr) { nr = syscall_enter_from_user_mode(regs, nr); diff --git a/arch/x86/include/asm/fred.h b/arch/x86/include/asm/fred.h index 18a2f811c358..10b8d73e4088 100644 --- a/arch/x86/include/asm/fred.h +++ b/arch/x86/include/asm/fred.h @@ -11,6 +11,7 @@ #include <asm/asm.h> #include <asm/msr.h> #include <asm/trapnr.h> +#include <asm/syscall.h> /* * FRED event return instruction opcodes for ERET{S,U}; supported in @@ -67,8 +68,8 @@ void asm_fred_entrypoint_user(void); void asm_fred_entrypoint_kernel(void); void asm_fred_entry_from_kvm(struct fred_ss); -__visible void fred_entry_from_user(struct pt_regs *regs); -__visible void fred_entry_from_kernel(struct pt_regs *regs); +__visible align_entry void fred_entry_from_user(struct pt_regs *regs); +__visible align_entry void fred_entry_from_kernel(struct pt_regs *regs); __visible void __fred_entry_from_kvm(struct pt_regs *regs); /* Can be called from noinstr code, thus __always_inline */ diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index c10dbb74cd00..624e7d6f30a3 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -20,13 +20,30 @@ typedef long (*sys_call_ptr_t)(const struct pt_regs *); extern const sys_call_ptr_t sys_call_table[]; +/* + * When changing patchable_function_entry for a function, the kCFI ABI is + * affected, therefore combine this with __noendbr, which disallows indirect + * calls and generates compiler warnings when the address is taken of such a + * function. + * + * This will effectively waste a full cacheline per align_entry user. + */ +#ifdef CONFIG_CALL_PADDING +#define __pfe(x) __attribute__((patchable_function_entry(x,x))) __noendbr +#else +#define __pfe(x) __noendbr +#endif + +#define __align_entry(x) __aligned(x) \ + __pfe(x-CONFIG_FUNCTION_ALIGNMENT+CONFIG_FUNCTION_PADDING_BYTES) + +#define align_entry __align_entry(SMP_CACHE_BYTES) + /* * These may not exist, but still put the prototypes in so we * can use IS_ENABLED(). */ extern long ia32_sys_call(const struct pt_regs *, unsigned int nr); -extern long x32_sys_call(const struct pt_regs *, unsigned int nr); -extern long x64_sys_call(const struct pt_regs *, unsigned int nr); /* * Only the low 32 bits of orig_ax are meaningful, so we return int. @@ -172,7 +189,7 @@ static inline int syscall_get_arch(struct task_struct *task) ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; } -bool do_syscall_64(struct pt_regs *regs, int nr); +align_entry bool do_syscall_64(struct pt_regs *regs, int nr); void do_int80_emulation(struct pt_regs *regs); #endif /* CONFIG_X86_32 */ ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-16 8:28 ` Peter Zijlstra 2026-06-16 8:46 ` Linus Torvalds @ 2026-06-16 13:53 ` David Laight 1 sibling, 0 replies; 24+ messages in thread From: David Laight @ 2026-06-16 13:53 UTC (permalink / raw) To: Peter Zijlstra Cc: H. Peter Anvin, tglx, mingo, bp, Nathan Chancellor, Calvin Owens, Dave Hansen, torvalds, x86-ML, LKML On Tue, 16 Jun 2026 10:28:14 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Sat, Jun 13, 2026 at 06:50:24PM -0700, H. Peter Anvin wrote: > > > OK, I have, I believe root-caused this. > > > > It is a padding issue; removing the code changes __pfx_x64_sys_call to be > > 32-byte aligned, with the result that x64_sys_call gets *mis*aligned. > > > > Reverting the patch but adding an alignment statement to x64_sys_call > > re-introduces the performance regression. > > > > I am concerned because this could mean that the __pfx stubs add substantial > > overhead elsewhere, unless this just happens to be a particularly sensitive > > case... > > So what is the actual alignment requirement these days then? We're > building the (x86_64) kernel with 16 byte function and 1 byte jump > alignment. > > So ISTR the Intel I-fetch window was 16 bytes, so the above things would > make sense. However, Gemini, or whatever AI sits in google search, is > trying to tell me Intel moved to 32 byte I-fetch with Alderlake. > > That same thing is saying AMD switched to 32 byte I-fetch with Zen (1) > and later. Basically you can't win. I was looking at why a patch didn't give the expected performance gain on a different base kernel build. It seems to depend on whether the function (actually strlen) was aligned to an odd or even 16 byte boundary. If aligned to an even boundary the loop inside the function crossed a 'significant' boundary and the code ran measurably slower. If you start aligning loop tops and labels in general you probably lose due to code bloat. (Here the loop didn't need aligning, it just needed not to contain the relevant boundary.) In this case the extra padding will change the alignment of everything that follows - and some of those might make a difference as well. You'd need to add extra code further down the function to keep the size the same (and hope the compiler keeps the functions in the same order). David > > This all seems to suggest we do something like so, hmm? > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index b9f5a4a3cc2a..65fff65271d0 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -329,7 +329,9 @@ config X86 > select HAVE_ARCH_KCSAN if X86_64 > select PROC_PID_ARCH_STATUS if PROC_FS > select HAVE_ARCH_NODE_DEV_GROUP if X86_SGX > - select FUNCTION_ALIGNMENT_16B if X86_64 || X86_ALIGNMENT_16 > + # AMD-Zen+ and Intel-Alderlake+ moved to 32 byte I-fetch > + select FUNCTION_ALIGNMENT_32B if X86_64 > + select FUNCTION_ALIGNMENT_16B if X86_ALIGNMENT_16 > select FUNCTION_ALIGNMENT_4B > imply IMA_SECURE_AND_OR_TRUSTED_BOOT if EFI > select HAVE_DYNAMIC_FTRACE_NO_PATCHABLE > ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-13 23:52 ` H. Peter Anvin 2026-06-14 1:50 ` H. Peter Anvin @ 2026-06-14 2:11 ` Calvin Owens 2026-06-14 2:14 ` Calvin Owens 1 sibling, 1 reply; 24+ messages in thread From: Calvin Owens @ 2026-06-14 2:11 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen, torvalds, x86-ML, LKML On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote: > On 2026-06-13 13:34, H. Peter Anvin wrote: > > On 2026-06-13 01:59, Peter Zijlstra wrote: > > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: > > > > So I was trying to figure out a significant -- about 13% -- increase > > > > in system call latency between v7.0 and the current master, and it > > > > bisects down to: > > > > > > > > 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build > > > > > > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This > > > > is a bare metal boot, no KVM. > > > > > > > > I'm personally extremely puzzled how this could possibly be related, > > > > and I will be investigating the possibility that this is a false > > > > bisect, but it is not a Heisenbug in any way; it has been extremely > > > > reproducible, and the difference is statistically valid by close to 10 > > > > sigma. Futhermore, the bisection at least gave the appearance of > > > > stability. > > > > > > > > Given how late in the cycle this is I wanted to send an alert sooner > > > > rather than later; I will update as I get more data. > > > > > > Uhm, massive WTF indeed. I don't immediately see how this could possibly > > > affect a FRED host either, except perhaps in code layout. > > > > > > I don't actually have a FRED capable machine, but have you tried running > > > one of those top-down perf things on it, to see where its hurting? > > > > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.) > > > > I reverted the patch on top of rc7, and it did, in fact, fix the regression, > > but I'm doing a clean from-scratch rebuild of both trees to make sure > > there isn't anything in my test setup that could introduce any kind of > > "memory" between builds...> > Nope, even with the clean rebuild it is 100% reproducible. It is in fact > worse than I originally stated: the average with 7.1rc7 is 478±6 cycles > (with the top and bottom octiles removed as outlier protection); with 7.1rc7 > with the above patch reverted it is 397.5±0.4. - this is in fact a 20% > increase in latency, not 13%... It has to be the .text layout, doesn't it? I notice we're splitting a cache line here now with the prefix symbol, 7.0-rc7 has: ffffffff812175f0 <__pfx_x64_sys_call>: ffffffff81217600 <x64_sys_call>: If I revert 8aeb879baf12, I get: ffffffff812175c0 <__pfx_x64_sys_call>: ffffffff812175d0 <x64_sys_call>: Could that be it? Unfortunately I don't have any hardware new enough to poke at it myself. Cheers, Calvin ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: 8aeb879baf12 - significant system call latency regression, bisected 2026-06-14 2:11 ` Calvin Owens @ 2026-06-14 2:14 ` Calvin Owens 0 siblings, 0 replies; 24+ messages in thread From: Calvin Owens @ 2026-06-14 2:14 UTC (permalink / raw) To: H. Peter Anvin Cc: Peter Zijlstra, tglx, mingo, bp, Nathan Chancellor, Dave Hansen, torvalds, x86-ML, LKML On Saturday 06/13 at 19:11 -0700, Calvin Owens wrote: > On Saturday 06/13 at 16:52 -0700, H. Peter Anvin wrote: > > On 2026-06-13 13:34, H. Peter Anvin wrote: > > > On 2026-06-13 01:59, Peter Zijlstra wrote: > > > > On Fri, Jun 12, 2026 at 06:45:06PM -0700, "H. Peter Anvin" (Intel) wrote: > > > > > So I was trying to figure out a significant -- about 13% -- increase > > > > > in system call latency between v7.0 and the current master, and it > > > > > bisects down to: > > > > > > > > > > 8aeb879baf12 x86/kvm/vmx: Fix x86_64 CFI build > > > > > > > > > > This is on Panther Lake (Core Ultra X7 358H) with FRED enabled. This > > > > > is a bare metal boot, no KVM. > > > > > > > > > > I'm personally extremely puzzled how this could possibly be related, > > > > > and I will be investigating the possibility that this is a false > > > > > bisect, but it is not a Heisenbug in any way; it has been extremely > > > > > reproducible, and the difference is statistically valid by close to 10 > > > > > sigma. Futhermore, the bisection at least gave the appearance of > > > > > stability. > > > > > > > > > > Given how late in the cycle this is I wanted to send an alert sooner > > > > > rather than later; I will update as I get more data. > > > > > > > > Uhm, massive WTF indeed. I don't immediately see how this could possibly > > > > affect a FRED host either, except perhaps in code layout. > > > > > > > > I don't actually have a FRED capable machine, but have you tried running > > > > one of those top-down perf things on it, to see where its hurting? > > > > > > Not yet, but I'm investigating right now (I have some family obligations this weekend, so my duty cycle is somewhat limited.) > > > > > > I reverted the patch on top of rc7, and it did, in fact, fix the regression, > > > but I'm doing a clean from-scratch rebuild of both trees to make sure > > > there isn't anything in my test setup that could introduce any kind of > > > "memory" between builds...> > > Nope, even with the clean rebuild it is 100% reproducible. It is in fact > > worse than I originally stated: the average with 7.1rc7 is 478±6 cycles > > (with the top and bottom octiles removed as outlier protection); with 7.1rc7 > > with the above patch reverted it is 397.5±0.4. - this is in fact a 20% > > increase in latency, not 13%... > > It has to be the .text layout, doesn't it? > > I notice we're splitting a cache line here now with the prefix symbol, > 7.0-rc7 has: Whoops, I meant 7.1-rc7. But seeing your other mail, sounds like this is it :) > ffffffff812175f0 <__pfx_x64_sys_call>: > ffffffff81217600 <x64_sys_call>: > > If I revert 8aeb879baf12, I get: > > ffffffff812175c0 <__pfx_x64_sys_call>: > ffffffff812175d0 <x64_sys_call>: > > Could that be it? > > Unfortunately I don't have any hardware new enough to poke at it myself. > > Cheers, > Calvin ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2026-06-17 12:37 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-06-13 1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel) 2026-06-13 8:59 ` Peter Zijlstra 2026-06-13 20:34 ` H. Peter Anvin 2026-06-13 23:52 ` H. Peter Anvin 2026-06-14 1:50 ` H. Peter Anvin 2026-06-14 18:08 ` Xin Li 2026-06-14 18:31 ` H. Peter Anvin 2026-06-15 0:19 ` H. Peter Anvin 2026-06-15 2:07 ` H. Peter Anvin 2026-06-15 3:41 ` Linus Torvalds 2026-06-15 18:30 ` H. Peter Anvin 2026-06-16 7:12 ` Peter Zijlstra 2026-06-16 7:38 ` Peter Zijlstra 2026-06-16 7:53 ` Peter Zijlstra 2026-06-16 8:28 ` Peter Zijlstra 2026-06-16 8:46 ` Linus Torvalds 2026-06-16 9:51 ` Ingo Molnar 2026-06-16 17:44 ` H. Peter Anvin 2026-06-17 9:54 ` Ingo Molnar 2026-06-17 10:05 ` Ingo Molnar 2026-06-17 12:37 ` Peter Zijlstra 2026-06-16 13:53 ` David Laight 2026-06-14 2:11 ` Calvin Owens 2026-06-14 2:14 ` Calvin Owens
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.