From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id AAE0E3822A5
	for <linux-kernel@vger.kernel.org>; Wed, 17 Jun 2026 09:54:51 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781690092; cv=none; b=YHJUpYl/xi9zqO2vTsn1IuGbZSXYE7HdkHynYHWpT5qyWT0MWgAqVnhXd/ITxC2woL5bxm9Z2quYBJWzyq5dZLB9XxCDUh/OcMuGD3jRxmaZGR8J4FdUd87105h0vidNEmcsMy4JntO6yYj7WklmTVrtgh5+0eQyV0fjI5QTmBs=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781690092; c=relaxed/simple;
	bh=uftq5cWjWnoWq0A1vrhkGAJT4lnyR8z68kPvnj0p9E8=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=lO4ZXtxAqVkWMKcMyLqDepBXSrlZP8Yn71WOyvd3/XBFIMbdElOlhUB60wlsxS2vFJ0SklcUD2Lh2CbsdheGTzUk0YbJfOjJyAakl9QeoiWTDoFusPaS46I3GG9P88H62O2+1cACvDv/GDkiQnfmIkrYVUjsQ5rS/uo/ZGKY1bQ=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XovQviH1; arc=none smtp.client-ip=100.103.45.18
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XovQviH1"
Received: by smtp.kernel.org (Postfix) with ESMTPSA id CFA5E1F000E9;
	Wed, 17 Jun 2026 09:54:48 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org;
	s=k20260515; t=1781690091;
	bh=FGsHUQ76XJgn1h1drjz+Mta7NgG2k/lgr5LHaZwX3gk=;
	h=Date:From:To:Cc:Subject:References:In-Reply-To;
	b=XovQviH1OMyx32Hm7WWqvcRSC1yrIe5knO14fJ/Z+zVyuOPOSermr/uTnYmzYqt0a
	 zj9moR80DNsJzykcBSpapdbZebP10j+hnNXp6zR5ss/bpviHkURjlGsBGobNXXm8wz
	 18VtpAvCWom93KMAA5BCpx44YibDqnXnu1qGJwvLxiTh6PvvDNbIagVOPhWfiVN48o
	 K3Y5HsOojlQVnh5LKOCh5FjDxoZpP3V/UhrTUNK6KhBQKL0EhLA0q7W5EFiXCg5ufA
	 zSLdIJ1dH1BqDz+Myf9AUM1OwfD79rXM7S0wGs2gPRi5rAp/9f5ZwYmLRCmSL6dr5p
	 PsYPghy3edMww==
Date: Wed, 17 Jun 2026 11:54:46 +0200
From: Ingo Molnar <mingo@kernel.org>
To: "H. Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Peter Zijlstra <peterz@infradead.org>, tglx@kernel.org,
	mingo@redhat.com, bp@alien8.de,
	Nathan Chancellor <nathan@kernel.org>,
	Calvin Owens <calvin@wbinvd.org>,
	Dave Hansen <dave.hansen@linux.intel.com>, x86-ML <x86@kernel.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: 8aeb879baf12 - significant system call latency regression,
 bisected
Message-ID: <ajJu5nROLmoINPdR@gmail.com>
References: <a915c0d9-7a0a-4586-ade9-ab02d195d045@zytor.com>
 <20260613085919.GF42921@noisy.programming.kicks-ass.net>
 <203E61B7-290F-4F87-860F-B352D0072703@zytor.com>
 <d15b2624-8bd9-476a-bd72-cb769c1c2d22@zytor.com>
 <f99e3ff3-e7b0-4aa4-9120-3c5f7151451d@zytor.com>
 <20260616082814.GQ48970@noisy.programming.kicks-ass.net>
 <CAHk-=wi2tNnbZ+w8kr1LHKJNFdQTXyA4wbhzd4cSi-W5cDpuhA@mail.gmail.com>
 <ajEckJgqJlTJgxic@gmail.com>
 <9D5BDF61-3C14-4F21-9D8C-4AB7F28B7EEC@zytor.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <9D5BDF61-3C14-4F21-9D8C-4AB7F28B7EEC@zytor.com>


* H. Peter Anvin <hpa@zytor.com> wrote:

> It's cache hot, calling getppid() in a tight loop.
> The units are renormalized to from TSC cycles to
> core cycles using fixed counter 1 to determine the
> actual ratio.

Hm, in that light the 80 cycles overhead from a single
misaligned symbol is rather surprising (to me): it's
way too high to be reasonably caused by any hot cache
alignment effects - and all of the regular instruction
caches (or even data caches) should be more than large
enough to fit such a getppid() benchmark fully into the
cache.

Would be nice to see a before/after perf stat --repeat <N>
figures with sufficiently high <N> to get <0.1% stddev?

And just to guess around a bit, here's the various caches,
buffers and queues on a Panther Lake Performance Core
(Cougar Cove) that may play a role:

 - L0 Data Cache (L0D)		   48 KB	    68 cachelines
 - L1 Data Cache (L1D)		  192 KB	 3,072 cachelines
 - L1 Instruction Cache (L1I)	   64 KB	 1,024 cachelines
 - L2 Cache			3,072 KB	49,152 cachlines
 - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs, ~64 sets x 10-12-way
 - uOP Queue			    -		   192 entries
 - Reorder Buffer (ROB)		    -		   576 entries
 - L1 Data TLB (DTLB)		    -		   128 entries
 - L2 Shared TLB (STLB)		    -		~4,096 entries
 - Return Stack Buffer (RSB)	    -		    24 entries
 - Load Queue			    -		  ~114 entries
 - Store Queue			    -		   ~56 entries

Where all cacheline sizes are 64 bytes, and a uOP cache 'set'
fits up to 6-8 uops.

I think with a cache-hot syscall benchmark we can exclude the
largest caches with over 1,000 effective entries with near
certainty as a factor, so what is left are:

 - L0 Data Cache (L0D)		   48 KB	    68 cachelines
 - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs ~64 sets x 10-12-way
 - uOP Queue			    -		   192 entries
 - Reorder Buffer (ROB)		    -		   576 entries
 - L1 Data TLB (DTLB)		    -		   128 entries
 - Return Stack Buffer (RSB)	    -		    24 entries
 - Load Queue			    -		  ~114 entries
 - Store Queue			    -		   ~56 entries

I'd exclude the L0D, L1DTLB, the RSB and the load/store queues
as well, because code alignment of a single symbol should have
a minimal effect on them, which leaves:

 - uOP Queue			    -		   192 entries
 - uOP Cache (Micro-op Cache)	    -		~5,250 uOPs, ~64 sets x 10-12 way
 - Reorder Buffer (ROB)		    -		   576 entries

And I think of these the main suspect would be the uOP cache,
because its (estimated...) ~10-12 deep associativity limit
of uop-sets may be something this benchmark is hitting on
Panther Lake?

Could it be that the extra alignment adds +1 to the maximum number
of uOP cache 'ways' this execution hits in the uOP cache, moving
it form say 12 (still fits) to 13 (misses) so that this particular
uOP cache association depth starts trashing? But I'm really just
guessing wildly here...

( The extra statistical noise of the regressed figures does suggest
  some sort of trashing mechanic behind the scenes though, and the
  regular caches seem large enough to not actually trash for such
  a cache-hot benchmark. )

Or am I missing something obvious?

Any perf stat uOP related counter measurements might be elluminating.

Thanks,

	Ingo