Re: 8aeb879baf12 - significant system call latency regression, bisected

The Linux Kernel Mailing List
 help / color / mirror / Atom feed

From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	"H. Peter Anvin" <hpa@zytor.com>,
	tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
	Nathan Chancellor <nathan@kernel.org>,
	Calvin Owens <calvin@wbinvd.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86-ML <x86@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: 8aeb879baf12 - significant system call latency regression, bisected
Date: Tue, 16 Jun 2026 11:51:12 +0200	[thread overview]
Message-ID: <ajEckJgqJlTJgxic@gmail.com> (raw)
In-Reply-To: <CAHk-=wi2tNnbZ+w8kr1LHKJNFdQTXyA4wbhzd4cSi-W5cDpuhA@mail.gmail.com>

* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 16 Jun 2026 at 13:58, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > So ISTR the Intel I-fetch window was 16 bytes, so the above things would
> > make sense. However, Gemini, or whatever AI sits in google search, is
> > trying to tell me Intel moved to 32 byte I-fetch with Alderlake.
> 
> Even with 16-byte fetch, the cacheline size is 64 bytes, so it hurts
> to not be 64-byte aligned - simply because you may need to fetch more
> cachelines (assuming fairly linear code).
> 
> And afaik, some of the newer ones aren't 32-byte wide, but can do 48
> bytes as three 16-byte fetches.
> 
> But I don't know if they can do the old "split line access" that older
> cores could do, where a Pentium would do two 8-byte accesses at the
> same time, and they didn't have to be in the same cache line.
> 
> So 64-byte alignment would always be the best option if you only look
> at a *particular* piece of code.
> 
> But it obviously is very wasteful and hurts when there is code around
> it that could be loaded into the cache at the same time.
> 
> So almost certainly not a good idea in general.
> 
> But 64-byte alignment is probably what things like interrupt and
> system call entrypoints should use, because those things would make
> sense to look at as isolated things, not part of a bigger load". And
> they are quite likely to start from a fairly cold-cache situation.
> 
> So *not* some general compiler option in a config file, but maybe a
> special "entry point alignment" macro?

Yeah, agreed on that approach - but before/while we fix it,
I'm also still somewhat baffled by the numbers hpa reported:

>>> Nope, even with the clean rebuild it is 100% reproducible. It is in fact
>>> worse than I originally stated: the average with 7.1rc7 is 478±6 cycles
>>> (with the top and bottom octiles removed as outlier protection); with 7.1rc7
>>> with the above patch reverted it is 397.5±0.4. - this is in fact a 20%
>>> increase in latency, not 13%...

Now that we know that this regression is caused by entry function
alignment changes, do we know *why* it causes a 80 cycles
shift in system call entry performance?

What does the benchmark measure, cache-cold or cache-hot
execution?

1) Cache-cold performance:

If it is cold-cache performance, does the misaligned case fetch
one more cold cacheline?

From which cache does it miss? Fetching from the 2-4MB Panther Lake
L2 shouldn't be 80 cycles, it should be ~17 cycles.

If it's fetching from the 18MB L3 (which I'd say is the norm for
most workloads), then the L3->L1I latency is around ~55 cycles on
Panther Lake, with everything included.

It cannot really be DRAM latency, ie. true cache-cold latency,
as that would be much more severe, in the 400 cycles range even
with premium DRAM modules - and more like 500 cycles with
mainstream DRAM modules and layouts. (Unless we are *lucky* with
alignment and sizing and the alignment regression doesn't trigger
full DRAM latency.) The on-die DRAM MSC cache's latency should
be around 300 cycles - that too is too high.

2) Cache-hot performance:

While cache-hot performance is less relevance for system calls
(which tend to be cache-cold in practice), if the benchmark
measures cache-hot performance, why is there a 80 cycles shift
from just a single misaligned symbol?

Ie. the specific and rather stable figure of 80 cycles overhead
does not seem to match any of the Panther Lake latencies that
ought to be relevant to this regression, if we use the simplest
mental model of what's going on when alignment changes.

So it is either some other uarch pathology, triggered by bad
alignment, or something doesn't add up in my mental model
of the root cause of this problem. :-)

Side notes:

 - The 6 cycles noise in the 478±6 cycles measurement
   does suggest that we might have missed out to a
   deeper cache hierarchy level, versus the rather
   stable 397.5±0.4 pre-regression figure.

 - I'm also assuming that 'cycles' here is a frequency-invariant
   standardized constant 5.1 GHz TSC value or so?

Thanks,

	Ingo

next prev parent reply	other threads:[~2026-06-16  9:51 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-13  1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
2026-06-13  8:59 ` Peter Zijlstra
2026-06-13 20:34   ` H. Peter Anvin
2026-06-13 23:52     ` H. Peter Anvin
2026-06-14  1:50       ` H. Peter Anvin
2026-06-14 18:08         ` Xin Li
2026-06-14 18:31           ` H. Peter Anvin
2026-06-15  0:19         ` H. Peter Anvin
2026-06-15  2:07           ` H. Peter Anvin
2026-06-15  3:41             ` Linus Torvalds
2026-06-15 18:30               ` H. Peter Anvin
2026-06-16  7:12                 ` Peter Zijlstra
2026-06-16  7:38             ` Peter Zijlstra
2026-06-16  7:53             ` Peter Zijlstra
2026-06-16  8:28         ` Peter Zijlstra
2026-06-16  8:46           ` Linus Torvalds
2026-06-16  9:51             ` Ingo Molnar [this message]
2026-06-16 13:53           ` David Laight
2026-06-14  2:11       ` Calvin Owens
2026-06-14  2:14         ` Calvin Owens

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajEckJgqJlTJgxic@gmail.com \
    --to=mingo@kernel.org \
    --cc=bp@alien8.de \
    --cc=calvin@wbinvd.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nathan@kernel.org \
    --cc=peterz@infradead.org \
    --cc=tglx@kernel.org \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox