From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AAE0E3822A5 for ; Wed, 17 Jun 2026 09:54:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781690092; cv=none; b=YHJUpYl/xi9zqO2vTsn1IuGbZSXYE7HdkHynYHWpT5qyWT0MWgAqVnhXd/ITxC2woL5bxm9Z2quYBJWzyq5dZLB9XxCDUh/OcMuGD3jRxmaZGR8J4FdUd87105h0vidNEmcsMy4JntO6yYj7WklmTVrtgh5+0eQyV0fjI5QTmBs= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781690092; c=relaxed/simple; bh=uftq5cWjWnoWq0A1vrhkGAJT4lnyR8z68kPvnj0p9E8=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=lO4ZXtxAqVkWMKcMyLqDepBXSrlZP8Yn71WOyvd3/XBFIMbdElOlhUB60wlsxS2vFJ0SklcUD2Lh2CbsdheGTzUk0YbJfOjJyAakl9QeoiWTDoFusPaS46I3GG9P88H62O2+1cACvDv/GDkiQnfmIkrYVUjsQ5rS/uo/ZGKY1bQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XovQviH1; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XovQviH1" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CFA5E1F000E9; Wed, 17 Jun 2026 09:54:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1781690091; bh=FGsHUQ76XJgn1h1drjz+Mta7NgG2k/lgr5LHaZwX3gk=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=XovQviH1OMyx32Hm7WWqvcRSC1yrIe5knO14fJ/Z+zVyuOPOSermr/uTnYmzYqt0a zj9moR80DNsJzykcBSpapdbZebP10j+hnNXp6zR5ss/bpviHkURjlGsBGobNXXm8wz 18VtpAvCWom93KMAA5BCpx44YibDqnXnu1qGJwvLxiTh6PvvDNbIagVOPhWfiVN48o K3Y5HsOojlQVnh5LKOCh5FjDxoZpP3V/UhrTUNK6KhBQKL0EhLA0q7W5EFiXCg5ufA zSLdIJ1dH1BqDz+Myf9AUM1OwfD79rXM7S0wGs2gPRi5rAp/9f5ZwYmLRCmSL6dr5p PsYPghy3edMww== Date: Wed, 17 Jun 2026 11:54:46 +0200 From: Ingo Molnar To: "H. Peter Anvin" Cc: Linus Torvalds , Peter Zijlstra , tglx@kernel.org, mingo@redhat.com, bp@alien8.de, Nathan Chancellor , Calvin Owens , Dave Hansen , x86-ML , LKML Subject: Re: 8aeb879baf12 - significant system call latency regression, bisected Message-ID: References: <20260613085919.GF42921@noisy.programming.kicks-ass.net> <203E61B7-290F-4F87-860F-B352D0072703@zytor.com> <20260616082814.GQ48970@noisy.programming.kicks-ass.net> <9D5BDF61-3C14-4F21-9D8C-4AB7F28B7EEC@zytor.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9D5BDF61-3C14-4F21-9D8C-4AB7F28B7EEC@zytor.com> * H. Peter Anvin wrote: > It's cache hot, calling getppid() in a tight loop. > The units are renormalized to from TSC cycles to > core cycles using fixed counter 1 to determine the > actual ratio. Hm, in that light the 80 cycles overhead from a single misaligned symbol is rather surprising (to me): it's way too high to be reasonably caused by any hot cache alignment effects - and all of the regular instruction caches (or even data caches) should be more than large enough to fit such a getppid() benchmark fully into the cache. Would be nice to see a before/after perf stat --repeat figures with sufficiently high to get <0.1% stddev? And just to guess around a bit, here's the various caches, buffers and queues on a Panther Lake Performance Core (Cougar Cove) that may play a role: - L0 Data Cache (L0D) 48 KB 68 cachelines - L1 Data Cache (L1D) 192 KB 3,072 cachelines - L1 Instruction Cache (L1I) 64 KB 1,024 cachelines - L2 Cache 3,072 KB 49,152 cachlines - uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12-way - uOP Queue - 192 entries - Reorder Buffer (ROB) - 576 entries - L1 Data TLB (DTLB) - 128 entries - L2 Shared TLB (STLB) - ~4,096 entries - Return Stack Buffer (RSB) - 24 entries - Load Queue - ~114 entries - Store Queue - ~56 entries Where all cacheline sizes are 64 bytes, and a uOP cache 'set' fits up to 6-8 uops. I think with a cache-hot syscall benchmark we can exclude the largest caches with over 1,000 effective entries with near certainty as a factor, so what is left are: - L0 Data Cache (L0D) 48 KB 68 cachelines - uOP Cache (Micro-op Cache) - ~5,250 uOPs ~64 sets x 10-12-way - uOP Queue - 192 entries - Reorder Buffer (ROB) - 576 entries - L1 Data TLB (DTLB) - 128 entries - Return Stack Buffer (RSB) - 24 entries - Load Queue - ~114 entries - Store Queue - ~56 entries I'd exclude the L0D, L1DTLB, the RSB and the load/store queues as well, because code alignment of a single symbol should have a minimal effect on them, which leaves: - uOP Queue - 192 entries - uOP Cache (Micro-op Cache) - ~5,250 uOPs, ~64 sets x 10-12 way - Reorder Buffer (ROB) - 576 entries And I think of these the main suspect would be the uOP cache, because its (estimated...) ~10-12 deep associativity limit of uop-sets may be something this benchmark is hitting on Panther Lake? Could it be that the extra alignment adds +1 to the maximum number of uOP cache 'ways' this execution hits in the uOP cache, moving it form say 12 (still fits) to 13 (misses) so that this particular uOP cache association depth starts trashing? But I'm really just guessing wildly here... ( The extra statistical noise of the regressed figures does suggest some sort of trashing mechanic behind the scenes though, and the regular caches seem large enough to not actually trash for such a cache-hot benchmark. ) Or am I missing something obvious? Any perf stat uOP related counter measurements might be elluminating. Thanks, Ingo