From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 475D638E119 for ; Tue, 9 Jun 2026 10:46:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781001975; cv=none; b=U2O92KrsKxDyF0eqfwZj+I2e+Sfd9K9WUqa38G21k9ndLzA3Gt/yHVnRJm1N79XbANVhlFDtWcl6vDc5lLUmbC3mYSTJB1haWKBcAKKsxA6Lr/GnbFgjiZ2ZJNn4n7Kh69/Loin/tAviYVz9wOWUo6naGyspXFURMwLLv88hORk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1781001975; c=relaxed/simple; bh=Fl/+rwzUC3niO9QrANN953gPxqbsyrK3wHEVj3ucgqA=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=FkDrpF8UMbqNgXSn6q8OkU/V89JFYa8KxstVJNaE+nnYGux2B8MjQo5ZhayQ/mCNe/rLAvRVcvrll6t0Q7HQYiTrgX+wY/pZ2O2BZEyKZ6sOizYqVndr8Z1hrhoV3e6fT4//DLREO9HoCbhkPWtSvfLiLJQqrsFMxw9jbu7pDQ8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=pass smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=cy5SO+4r; arc=none smtp.client-ip=90.155.50.34 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="cy5SO+4r" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=HQY1ZwCaa1IrkqF5AdMI+f9rLZkMV42Jx82K11AM8qQ=; b=cy5SO+4rJye/z6BGYQsw8qYwaw aJFVqsFhcQcqy68dGX68Emck1ksnsp1hKSBSVIFJjGXwCiB+VLyURXcl2I0PHEwOphH7GJRYi7eSI bXprJdgO1EaG8aW7i8jbIcpl1zzQxyxwxo0qRYE+bYpuhfwpq9ICcd7Cl+Y2Fqrppbyj7HOmG8nNM MbmbhYzeBML8cQi6QjQ4w4dxpXBfhjEA+DaKUEVtF9inqzQNMqByI7eP8tma2WY9jbmTmbLeDBr/u WHVH4cFX6VZB64JHtqpLemjoDp5wf8sOfOJP6Td1wCWvz7BabF1L+ddRYfThMfnaUG2cEFw3/vbOT u/pw0+Gg==; Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.99.1 #2 (Red Hat Linux)) id 1wWty8-0000000FX6o-455r; Tue, 09 Jun 2026 10:46:05 +0000 Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000) id C9F02302F12; Tue, 09 Jun 2026 12:46:03 +0200 (CEST) Date: Tue, 9 Jun 2026 12:46:03 +0200 From: Peter Zijlstra To: Breno Leitao Cc: Thomas Gleixner , Ingo Molnar , Darren Hart , Davidlohr Bueso , =?iso-8859-1?Q?Andr=E9?= Almeida , linux-kernel@vger.kernel.org, puranjay@kernel.org, rmikey@meta.com, stuclar@meta.com, namhyung@kernel.org, kernel-team@meta.com Subject: Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock Message-ID: <20260609104603.GA48970@noisy.programming.kicks-ass.net> References: <20260605-futex-v1-1-4ad4a0d6f265@debian.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20260605-futex-v1-1-4ad4a0d6f265@debian.org> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote: > struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock, > struct plist_head chain, struct futex_private_hash *priv) into a > single ____cacheline_aligned_in_smp 64-byte block. Three distinct > access patterns hit that line: > > 1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending() > on the fast path before taking the lock. > 2. spin_lock(&hb->lock) contenders writing the lock word. > 3. The lock holder modifying chain.{next,prev} on every futex_wake, > futex_q_unlock, plist_add, __futex_unqueue. > > This was first noticed on a Meta cache (ucache) production workload: > perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as > the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct > CPUs in a second. > > The contention is not specific to that workload, though. Our very own > "perf bench futex" hash exercises the same buckets and shows the same > false sharing, so the rest of this changelog quantifies the fix with > perf bench futex. So I can't see this. After 'fixing' the benchmark to run with a fixed number of buckets (see below), a perf c2c record shows the futex_hash_bucket::priv load to be the 'expensive' (when doing perf report on that, rather than perf c2c report, because this latter is total garbage) > Move chain to its own cacheline so: > - Lockless waiters_pending() readers no longer invalidate the line > that lock contenders are spinning to acquire. > - Cross-CCD lock handoffs ship only the (waiters, lock) line; the > next holder reads chain from its own L2/L3 instead of fetching > chain entries together with the lock byte. > > This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by > 15%: > > baseline +fix delta > average 1,394,938 1,616,781 +15.9 % > median 1,430,012 1,617,072 +13.1 % > min 1,214,488 1,501,741 +23.5 % > max 1,488,167 1,730,734 +16.3 % > > The distributions do not overlap: the slowest +fix run (1.50 M) is > faster than every baseline run except the single fastest (1.49 M). When I run: "perf bench futex hash", I do see massive contention, but not on the line you mention. Instead we hammer mm->futex.phash.atomic in futex_ref_{get,put}(). These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test(). The reason this happens is unfortunate, you would want this thing to hit the PERCPU fast-path, but due to the per-thread auto scaling, the benchmark startup phase allocates a (2 thread) small hash, then a bigger and a bigger, for each next thread that comes in. Per there being a pending new hash, we drop to ATOMIC mode, such that we can actually observe the 0 references. However, because the benchmark is in fact hammering the buckets (per design), it will never actually hit 0 references and swap in the larger hash. If one were to specific an explicit number of buckets, the benchmark will function correctly: v7.1-rc7 +patch perf bench futex hash 192479 195523 +1.5% perf bench futex hash -b 256 3453734 3987880 +15.5% And then I do see the improvement from your patch, but I really cannot make sense of your reasoning for it. > Cost: one extra cacheline (56 B padding) per bucket. Would it be > acceptable? I'm really not sure, it *doubles* the futex memory cost.