The Linux Kernel Mailing List
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Thomas Gleixner" <tglx@kernel.org>,
	"Ingo Molnar" <mingo@redhat.com>,
	"Darren Hart" <dvhart@infradead.org>,
	"Davidlohr Bueso" <dave@stgolabs.net>,
	"André Almeida" <andrealmeid@igalia.com>,
	linux-kernel@vger.kernel.org, puranjay@kernel.org,
	rmikey@meta.com, stuclar@meta.com, namhyung@kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock
Date: Tue, 9 Jun 2026 08:28:16 -0700	[thread overview]
Message-ID: <aigTu8PqXn8wyhiK@gmail.com> (raw)
In-Reply-To: <20260609104603.GA48970@noisy.programming.kicks-ass.net>

Hello Peter,

On Tue, Jun 09, 2026 at 12:46:03PM +0200, Peter Zijlstra wrote:
> On Fri, Jun 05, 2026 at 09:53:12AM -0700, Breno Leitao wrote:
> > struct futex_hash_bucket packs (atomic_t waiters, spinlock_t lock,
> > struct plist_head chain, struct futex_private_hash *priv) into a
> > single ____cacheline_aligned_in_smp 64-byte block. Three distinct
> > access patterns hit that line:
> > 
> >   1. Lockless atomic_read(&hb->waiters) via futex_hb_waiters_pending()
> >      on the fast path before taking the lock.
> >   2. spin_lock(&hb->lock) contenders writing the lock word.
> >   3. The lock holder modifying chain.{next,prev} on every futex_wake,
> >      futex_q_unlock, plist_add, __futex_unqueue.
> > 
> > This was first noticed on a Meta cache (ucache) production workload:
> > perf c2c on a busy 176-core AMD EPYC 9D64 ranked this exact cacheline as
> > the #1 HITM source: 129 Local + 31 Remote HITM, hit by 156 distinct
> > CPUs in a second.
> > 
> > The contention is not specific to that workload, though. Our very own
> > "perf bench futex" hash exercises the same buckets and shows the same
> > false sharing, so the rest of this changelog quantifies the fix with
> > perf bench futex.
> 
> So I can't see this. After 'fixing' the benchmark to run with a fixed
> number of buckets (see below), a perf c2c record shows the
> futex_hash_bucket::priv load to be the 'expensive' (when doing perf
> report on that, rather than perf c2c report, because this latter is
> total garbage)

I am able to confirm with both. Keep in mind that I am using a multi CCD
CPU (AMD EPYC 9D64).

I ran perf c2c record on an EPYC 9D64 (88C/176T, 1 socket, multi-CCD)
under `perf bench futex hash -b 256`, on baseline and patched. Top hot
kernel HITM lines:

  Baseline (b99ae45861ec):
    offset 0x00  futex_q_lock                  core.c:865   (waiters)
    offset 0x04  queued_spin_lock_slowpath     qspinlock.c  (lock)
    offset 0x04  _raw_spin_lock                atomic.h     (lock)
    offset 0x18  futex_hash                    core.c:312   (priv)

  + With this patch 
    offset 0x00  futex_q_lock                  core.c:865   (waiters)
    offset 0x04  queued_spin_lock_slowpath     qspinlock.c  (lock)
    offset 0x04  _raw_spin_lock                atomic.h     (lock)
    [no priv entry on this cacheline]

`futex_hash` is literally the lockless `fph = hb->priv`
read. On baseline it sits on the lock cacheline at offset 0x18 and is
a top HITM source - exactly what you saw. On this patch that entry is
gone from the lock cacheline.

What remains at offsets 0x00 and 0x04 is intrinsic lock contention
(waiters_pending fast path + queued spinlock hand-off); the patch can't reduce
that without changing the lock itself.

Throughput on the same run:

  baseline   : 1,267,863 ops/sec
  +This patch: 1,460,971 ops/sec

> > Move chain to its own cacheline so:
> >   - Lockless waiters_pending() readers no longer invalidate the line
> >     that lock contenders are spinning to acquire.
> >   - Cross-CCD lock handoffs ship only the (waiters, lock) line; the
> >     next holder reads chain from its own L2/L3 instead of fetching
> >     chain entries together with the lock byte.
> > 
> > This improves "perf bench futex hash" on a 176-core AMD EPYC 9D64 by
> > 15%:
> > 
> >                    baseline    +fix       delta
> >   average      1,394,938   1,616,781    +15.9 %
> >   median       1,430,012   1,617,072    +13.1 %
> >   min          1,214,488   1,501,741    +23.5 %
> >   max          1,488,167   1,730,734    +16.3 %
> > 
> > The distributions do not overlap: the slowest +fix run (1.50 M) is
> > faster than every baseline run except the single fastest (1.49 M).
> 
> When I run: "perf bench futex hash", I do see massive contention, but
> not on the line you mention. Instead we hammer mm->futex.phash.atomic in
> futex_ref_{get,put}().
> 
> These are the atomic_long_inc_not_zero() / atomic_long_dec_and_test().
> 
> The reason this happens is unfortunate, you would want this thing to hit
> the PERCPU fast-path, but due to the per-thread auto scaling, the
> benchmark startup phase allocates a (2 thread) small hash, then a bigger
> and a bigger, for each next thread that comes in.
> 
> Per there being a pending new hash, we drop to ATOMIC mode, such that we
> can actually observe the 0 references.
> 
> However, because the benchmark is in fact hammering the buckets (per
> design), it will never actually hit 0 references and swap in the larger
> hash.

Ack.  the auto-scaling pathology you described reproduces here too

> If one were to specific an explicit number of buckets, the benchmark
> will function correctly:
> 
>   				       v7.1-rc7	+patch
> 
>   perf bench futex hash			 192479  195523  +1.5%
>   perf bench futex hash -b 256		3453734 3987880 +15.5%
> 
> And then I do see the improvement from your patch, but I really cannot
> make sense of your reasoning for it.

So, let me rephrase it. The bucket cacheline takes hits from four access
patterns - the three I listed (waiters_pending readers, lock spinners,
lock-holder chain writes) plus the lockless `fph = hb->priv` load on the
futex_hash() fast path, which is what c2c surfaced. That priv load is the
dominant HITM source on baseline, not the chain writes I emphasized. 

> > Cost: one extra cacheline (56 B padding) per bucket. Would it be
> > acceptable?
> 
> I'm really not sure, it *doubles* the futex memory cost.

I think it's worth the trade. The global hash scales linearly with
num_possible_cpus(), so the extra bytes track the same curve as the machines
that actually need the fix

in simpler words, a box big enough to feel this contention has plenty of RAM
headroom to absorb it.

Thanks for the review,
--breno

  reply	other threads:[~2026-06-09 15:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-05 16:53 [PATCH RFC] futex: avoid false sharing between hb->chain and the bucket lock Breno Leitao
2026-06-09 10:46 ` Peter Zijlstra
2026-06-09 15:28   ` Breno Leitao [this message]
2026-06-09 20:11     ` Peter Zijlstra
2026-06-09 20:18       ` Peter Zijlstra
2026-06-09 20:16     ` Thomas Gleixner
2026-06-09 20:23       ` Peter Zijlstra
2026-06-09 20:25       ` Peter Zijlstra
2026-06-09 20:32         ` Thomas Gleixner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aigTu8PqXn8wyhiK@gmail.com \
    --to=leitao@debian.org \
    --cc=andrealmeid@igalia.com \
    --cc=dave@stgolabs.net \
    --cc=dvhart@infradead.org \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=puranjay@kernel.org \
    --cc=rmikey@meta.com \
    --cc=stuclar@meta.com \
    --cc=tglx@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox