From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from casper.infradead.org (casper.infradead.org [90.155.50.34])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 475D638E119
	for <linux-kernel@vger.kernel.org>; Tue,  9 Jun 2026 10:46:13 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=90.155.50.34
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1781001975; cv=none; b=U2O92KrsKxDyF0eqfwZj+I2e+Sfd9K9WUqa38G21k9ndLzA3Gt/yHVnRJm1N79XbANVhlFDtWcl6vDc5lLUmbC3mYSTJB1haWKBcAKKsxA6Lr/GnbFgjiZ2ZJNn4n7Kh69/Loin/tAviYVz9wOWUo6naGyspXFURMwLLv88hORk=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1781001975; c=relaxed/simple;
	bh=Fl/+rwzUC3niO9QrANN953gPxqbsyrK3wHEVj3ucgqA=;
	h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version:
	 Content-Type:Content-Disposition:In-Reply-To; b=FkDrpF8UMbqNgXSn6q8OkU/V89JFYa8KxstVJNaE+nnYGux2B8MjQo5ZhayQ/mCNe/rLAvRVcvrll6t0Q7HQYiTrgX+wY/pZ2O2BZEyKZ6sOizYqVndr8Z1hrhoV3e6fT4//DLREO9HoCbhkPWtSvfLiLJQqrsFMxw9jbu7pDQ8=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org; spf=pass smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=cy5SO+4r; arc=none smtp.client-ip=90.155.50.34
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=infradead.org
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=infradead.org
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="cy5SO+4r"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version:
	References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To:
	Content-Transfer-Encoding:Content-ID:Content-Description;
	bh=HQY1ZwCaa1IrkqF5AdMI+f9rLZkMV42Jx82K11AM8qQ=; b=cy5SO+4rJye/z6BGYQsw8qYwaw
	aJFVqsFhcQcqy68dGX68Emck1ksnsp1hKSBSVIFJjGXwCiB+VLyURXcl2I0PHEwOphH7GJRYi7eSI
	bXprJdgO1EaG8aW7i8jbIcpl1zzQxyxwxo0qRYE+bYpuhfwpq9ICcd7Cl+Y2Fqrppbyj7HOmG8nNM
	MbmbhYzeBML8cQi6QjQ4w4dxpXBfhjEA+DaKUEVtF9inqzQNMqByI7eP8tma2WY9jbmTmbLeDBr/u
	WHVH4cFX6VZB64JHtqpLemjoDp5wf8sOfOJP6Td1wCWvz7BabF1L+ddRYfThMfnaUG2cEFw3/vbOT
	u/pw0+Gg==;
Received: from 77-249-17-252.cable.dynamic.v4.ziggo.nl ([77.249.17.252] helo=noisy.programming.kicks-ass.net)
	by casper.infradead.org with esmtpsa (Exim 4.99.1 #2 (Red Hat Linux))
	id 1wWty8-0000000FX6o-455r;
	Tue, 09 Jun 2026 10:46:05 +0000
Received: by noisy.programming.kicks-ass.net (Postfix, from userid 1000)
	id C9F02302F12; Tue, 09 Jun 2026 12:46:03 +0200 (CEST)
Date: Tue, 9 Jun 2026 12:46:03 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Breno Leitao <leitao@debian.org>
Cc: Thomas Gleixner <tglx@kernel.org>, Ingo Molnar <mingo@redhat.com>,
	Darren Hart <dvhart@infradead.org>,
	Davidlohr Bueso <dave@stgolabs.net>,
	=?iso-8859-1?Q?Andr=E9?= Almeida <andrealmeid@igalia.com>,
	linux-kernel@vger.kernel.org, puranjay@kernel.org, rmikey@meta.com,
	stuclar@meta.com, namhyung@kernel.org, kernel-team@meta.com
Subject: Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the
 bucket lock
Message-ID: <20260609104603.GA48970@noisy.programming.kicks-ass.net>
References: <20260605-futex-v1-1-4ad4a0d6f265@debian.org>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20260605-futex-v1-1-4ad4a0d6f265@debian.org>

On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> struct plist_head chain, struct futex_private_hash *priv) into a
> single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> access patterns hit that line:
> 
>   1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
>      on the fast path before taking the lock.
>   2. spin_lock(&hb->lock) contenders writing the lock word.
>   3. The lock holder modifying chain.{next,prev} on every futex_wake,
>      futex_q_unlock, plist_add, __futex_unqueue.
> 
> This was first noticed on a Meta cache (ucache) production workload:
> perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> CPUs in a second.
> 
> The contention is not specific to that workload, though. Our very own
> "perf bench futex" hash exercises the same buckets and shows the same
> false sharing, so the rest of this changelog quantifies the fix with
> perf bench futex.

So I can't see this. After 'fixing' the benchmark to run with a fixed
number of buckets (see below), a perf c2c record shows the
futex_hash_bucket::priv load to be the 'expensive' (when doing perf
report on that, rather than perf c2c report, because this latter is
total garbage)

> Move chain to its own cacheline so:
>   - Lockless waiters_pending() readers no longer invalidate the line
>     that lock contenders are spinning to acquire.
>   - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
>     next holder reads chain from its own L2/L3 instead of fetching
>     chain entries together with the lock byte.
> 
> This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> 15%:
> 
>                    baseline    +fix       delta
>   average      1,394,938   1,616,781    +15.9 %
>   median       1,430,012   1,617,072    +13.1 %
>   min          1,214,488   1,501,741    +23.5 %
>   max          1,488,167   1,730,734    +16.3 %
> 
> The distributions do not overlap: the slowest +fix run (1.50 M) is
> faster than every baseline run except the single fastest (1.49 M).

When I run: "perf bench futex hash", I do see massive contention, but
not on the line you mention. Instead we hammer mm->futex.phash.atomic in
futex_ref_{get,put}().

These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().

The reason this happens is unfortunate, you would want this thing to hit
the PERCPU fast-path, but due to the per-thread auto scaling, the
benchmark startup phase allocates a (2 thread) small hash, then a bigger
and a bigger, for each next thread that comes in.

Per there being a pending new hash, we drop to ATOMIC mode, such that we
can actually observe the 0 references.

However, because the benchmark is in fact hammering the buckets (per
design), it will never actually hit 0 references and swap in the larger
hash.

If one were to specific an explicit number of buckets, the benchmark
will function correctly:

  				       v7.1-rc7	+patch

  perf bench futex hash			 192479  195523  +1.5%
  perf bench futex hash -b 256		3453734 3987880 +15.5%

And then I do see the improvement from your patch, but I really cannot
make sense of your reasoning for it.

> Cost: one extra cacheline (56 B padding) per bucket. Would it be
> acceptable?

I'm really not sure, it *doubles* the futex memory cost.