From: Aniket Gattani <aniketgattani@google.com>
To: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
"Paul E . McKenney" <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
Ingo Molnar <mingo@redhat.com>, Ben Segall <bsegall@google.com>,
Josh Don <joshdon@google.com>,
linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org,
Aniket Gattani <aniketgattani@google.com>
Subject: [PATCH v2 1/2] sched/membarrier: Use per-CPU mutexes for targeted commands
Date: Wed, 15 Apr 2026 23:21:05 +0000 [thread overview]
Message-ID: <20260415232106.2803644-2-aniketgattani@google.com> (raw)
In-Reply-To: <20260415232106.2803644-1-aniketgattani@google.com>
Currently, the membarrier system call uses a single global mutex
(`membarrier_ipi_mutex`) to serialize expedited commands. This causes
significant contention on large systems when multiple threads invoke
membarrier concurrently, even if they target different CPUs.
This contention becomes critical when combined with CFS bandwidth
throttling/unthrottling, during which interrupts can be disabled for
relatively long periods on target CPUs. If membarrier is waiting for a
response from such a CPU, it holds the global mutex, blocking all other
membarrier calls on the system. This cascade effect can lead to hard
lockups when thousands of threads stall waiting for the mutex.
Optimize `MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ` when a specific CPU is
targeted by introducing per-CPU mutexes. Broadcast commands and commands
without a specific CPU target continue to use the global mutex.
This prevents the cascade lockup scenario. As measured by the stress test
introduced in the subsequent patch, on an AMD Turin machine with 384 CPUs
(2 NUMA nodes with SMT=2), this optimization yields 200x more
throughput.
Signed-off-by: Aniket Gattani <aniketgattani@google.com>
---
Changes in v2:
- Use different mutex macros for global vs targeted cpu membarrier (Mathieu).
- Use (unsigned int) cpu_id >= nr_cpu_id (Peter).
---
kernel/sched/membarrier.c | 36 +++++++++++++++++++++++++-----------
1 file changed, 25 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index 623445603725..7f995bd48280 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -165,7 +165,20 @@
| MEMBARRIER_CMD_GET_REGISTRATIONS)
static DEFINE_MUTEX(membarrier_ipi_mutex);
+static DEFINE_PER_CPU(struct mutex, membarrier_cpu_mutexes);
+
#define SERIALIZE_IPI() guard(mutex)(&membarrier_ipi_mutex)
+#define SERIALIZE_IPI_CPU(cpu_id) guard(mutex)(&per_cpu(membarrier_cpu_mutexes, cpu_id))
+
+static int __init membarrier_init(void)
+{
+ int i;
+
+ for_each_possible_cpu(i)
+ mutex_init(&per_cpu(membarrier_cpu_mutexes, i));
+ return 0;
+}
+core_initcall(membarrier_init);
static void ipi_mb(void *info)
{
@@ -358,14 +371,19 @@ static int membarrier_private_expedited(int flags, int cpu_id)
if (cpu_id < 0 && !zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
return -ENOMEM;
- SERIALIZE_IPI();
+ if ((unsigned int)cpu_id >= nr_cpu_ids || !cpu_possible(cpu_id))
+ return 0;
+
+ SERIALIZE_IPI_CPU(cpu_id);
+
cpus_read_lock();
if (cpu_id >= 0) {
struct task_struct *p;
- if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id))
+ if (!cpu_online(cpu_id))
goto out;
+
rcu_read_lock();
p = rcu_dereference(cpu_rq(cpu_id)->curr);
if (!p || p->mm != mm) {
@@ -373,6 +391,11 @@ static int membarrier_private_expedited(int flags, int cpu_id)
goto out;
}
rcu_read_unlock();
+ /*
+ * smp_call_function_single() will call ipi_func() if cpu_id
+ * is the calling CPU.
+ */
+ smp_call_function_single(cpu_id, ipi_func, NULL, 1);
} else {
int cpu;
@@ -385,15 +408,6 @@ static int membarrier_private_expedited(int flags, int cpu_id)
__cpumask_set_cpu(cpu, tmpmask);
}
rcu_read_unlock();
- }
-
- if (cpu_id >= 0) {
- /*
- * smp_call_function_single() will call ipi_func() if cpu_id
- * is the calling CPU.
- */
- smp_call_function_single(cpu_id, ipi_func, NULL, 1);
- } else {
/*
* For regular membarrier, we can save a few cycles by
* skipping the current cpu -- we're about to do smp_mb()
--
2.54.0.rc1.513.gad8abe7a5a-goog
next prev parent reply other threads:[~2026-04-15 23:21 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-15 23:21 [PATCH v2 0/2] sched/membarrier: Use per-CPU mutexes for targeted commands Aniket Gattani
2026-04-15 23:21 ` Aniket Gattani [this message]
2026-04-28 12:48 ` [PATCH v2 1/2] " Peter Zijlstra
2026-04-30 0:12 ` Aniket Gattani
2026-04-15 23:21 ` [PATCH v2 2/2] selftests/membarrier: Add rseq stress test for CFS throttle interactions Aniket Gattani
2026-04-27 20:07 ` [PATCH v2] sched/membarrier: Use per-CPU mutexes for targeted commands Aniket Gattani
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260415232106.2803644-2-aniketgattani@google.com \
--to=aniketgattani@google.com \
--cc=bsegall@google.com \
--cc=joshdon@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mingo@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox