From: Ingo Molnar <mingo@kernel.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Peter Zijlstra <peterz@infradead.org>,
tglx@kernel.org, mingo@redhat.com, bp@alien8.de,
Nathan Chancellor <nathan@kernel.org>,
Calvin Owens <calvin@wbinvd.org>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86-ML <x86@kernel.org>, LKML <linux-kernel@vger.kernel.org>
Subject: Re: 8aeb879baf12 - significant system call latency regression, bisected
Date: Wed, 17 Jun 2026 11:54:46 +0200 [thread overview]
Message-ID: <ajJu5nROLmoINPdR@gmail.com> (raw)
In-Reply-To: <9D5BDF61-3C14-4F21-9D8C-4AB7F28B7EEC@zytor.com>
* H. Peter Anvin <hpa@zytor.com> wrote:
> It's cache hot, calling getppid() in a tight loop.
> The units are renormalized to from TSC cycles to
> core cycles using fixed counter 1 to determine the
> actual ratio.
Hm, in that light the 80 cycles overhead from a single
misaligned symbol is rather surprising (to me): it's
way too high to be reasonably caused by any hot cache
alignment effects - and all of the regular instruction
caches (or even data caches) should be more than large
enough to fit such a getppid() benchmark fully into the
cache.
Would be nice to see a before/after perf stat --repeat <N>
figures with sufficiently high <N> to get <0.1% stddev?
And just to guess around a bit, here's the various caches,
buffers and queues on a Panther Lake Performance Core
(Cougar Cove) that may play a role:
- L0 Data Cache (L0D) 48 KB 68 cachelines
- L1 Data Cache (L1D) 192 KB 3,072 cachelines
- L1 Instruction Cache (L1I) 64 KB 1,024 cachelines
- L2 Cache 3,072 KB 49,152 cachlines
- uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12-way
- uOP Queue - 192 entries
- Reorder Buffer (ROB) - 576 entries
- L1 Data TLB (DTLB) - 128 entries
- L2 Shared TLB (STLB) - ~4,096 entries
- Return Stack Buffer (RSB) - 24 entries
- Load Queue - ~114 entries
- Store Queue - ~56 entries
Where all cacheline sizes are 64 bytes, and a uOP cache 'set'
fits up to 6-8 uops.
I think with a cache-hot syscall benchmark we can exclude the
largest caches with over 1,000 effective entries with near
certainty as a factor, so what is left are:
- L0 Data Cache (L0D) 48 KB 68 cachelines
- uOP Cache (Micro-op Cache) - ~5,250 uOPs ~64 sets x 10-12-way
- uOP Queue - 192 entries
- Reorder Buffer (ROB) - 576 entries
- L1 Data TLB (DTLB) - 128 entries
- Return Stack Buffer (RSB) - 24 entries
- Load Queue - ~114 entries
- Store Queue - ~56 entries
I'd exclude the L0D, L1DTLB, the RSB and the load/store queues
as well, because code alignment of a single symbol should have
a minimal effect on them, which leaves:
- uOP Queue - 192 entries
- uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12 way
- Reorder Buffer (ROB) - 576 entries
And I think of these the main suspect would be the uOP cache,
because its (estimated...) ~10-12 deep associativity limit
of uop-sets may be something this benchmark is hitting on
Panther Lake?
Could it be that the extra alignment adds +1 to the maximum number
of uOP cache 'ways' this execution hits in the uOP cache, moving
it form say 12 (still fits) to 13 (misses) so that this particular
uOP cache association depth starts trashing? But I'm really just
guessing wildly here...
( The extra statistical noise of the regressed figures does suggest
some sort of trashing mechanic behind the scenes though, and the
regular caches seem large enough to not actually trash for such
a cache-hot benchmark. )
Or am I missing something obvious?
Any perf stat uOP related counter measurements might be elluminating.
Thanks,
Ingo
next prev parent reply other threads:[~2026-06-17 9:54 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-13 1:45 8aeb879baf12 - significant system call latency regression, bisected "H. Peter Anvin" (Intel)
2026-06-13 8:59 ` Peter Zijlstra
2026-06-13 20:34 ` H. Peter Anvin
2026-06-13 23:52 ` H. Peter Anvin
2026-06-14 1:50 ` H. Peter Anvin
2026-06-14 18:08 ` Xin Li
2026-06-14 18:31 ` H. Peter Anvin
2026-06-15 0:19 ` H. Peter Anvin
2026-06-15 2:07 ` H. Peter Anvin
2026-06-15 3:41 ` Linus Torvalds
2026-06-15 18:30 ` H. Peter Anvin
2026-06-16 7:12 ` Peter Zijlstra
2026-06-16 7:38 ` Peter Zijlstra
2026-06-16 7:53 ` Peter Zijlstra
2026-06-16 8:28 ` Peter Zijlstra
2026-06-16 8:46 ` Linus Torvalds
2026-06-16 9:51 ` Ingo Molnar
2026-06-16 17:44 ` H. Peter Anvin
2026-06-17 9:54 ` Ingo Molnar [this message]
2026-06-17 10:05 ` Ingo Molnar
2026-06-17 12:37 ` Peter Zijlstra
2026-06-16 13:53 ` David Laight
2026-06-14 2:11 ` Calvin Owens
2026-06-14 2:14 ` Calvin Owens
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ajJu5nROLmoINPdR@gmail.com \
--to=mingo@kernel.org \
--cc=bp@alien8.de \
--cc=calvin@wbinvd.org \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=nathan@kernel.org \
--cc=peterz@infradead.org \
--cc=tglx@kernel.org \
--cc=torvalds@linux-foundation.org \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.