* [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance
@ 2026-05-28 15:13 Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
` (13 more replies)
0 siblings, 14 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Changes in v6:
- Make the task-local cpumask selection explicit and drop preemptible()
check in smp_call_function_many_cond(). The early put_cpu() decision now
depends only on whether a task-local cpumask is available.
- Keep smp_task_ipi_mask() private to kernel/smp.c in [PATCH v6 4/12].
- Add #include <linux/slab.h> to kernel/smp.c in [PATCH v6 4/12] for
kmalloc()/kfree(), fixing the kernel test robot build failure reported
at: https://lore.kernel.org/oe-kbuild-all/202605241101.w6T2LApw-lkp@intel.com/
- Update the csd_lock_wait() comment in [PATCH v6 6/12].
- Add Sebastian's Reviewed-by tags to the reviewed patches.
Changes in v5:
- Replace "smp: Remove get_cpu from smp_call_function_any" with a new
approach that extracts a common __smp_call_function_single() to safely
keep the remote CPU selection and IPI dispatch process within a single
preemption-disabled region in [PATCH v5 3/12].
- Fix a typo in comments (s/cpumask_stack/task_mask/) and remove the
obsolete "Preemption must be disabled" constraint from the kernel-doc
in [PATCH v5 6/12].
- Adjust the WARN_ON_ONCE() validation condition to avoid a false positive
warning caused by CPU hotplug races when use_cpus_read_lock is false in
[PATCH v5 9/12].
- Move the preemptible() check in smp_call_function_many_cond() from
[PATCH v5 4/12] to [PATCH v5 6/12].
- Rebase to commit 4ac4d6549a65 ("sched: Use trace_call__<tp>() to save a
static branch").
Changes in v4:
- Use task-local IPI cpumask rather than on-stack cpumask in
[PATCH v4 4/12] (suggested by sebastian).
- Skip to free csd memory in smpcfd_dead_cpu() to guarantee csd memory
access safety, instead of using RCU mechanism in [PATCH v4 5/12]
(suggested by sebastian).
- Align flush_tlb_info with SMP_CACHE_BYTES to avoid performance
degradation caused by unnecessary cache line movements in [PATCH v4
10/12](suggested by sebastian and Nadav).
- Collect Acked-bys and Reviewed-bys.
Changes in v3:
- Add benchmarks to measure the performance impact of changing
flush_tlb_info to stack variable in [PATCH v3 10/12] (suggested by
peter)
- Adjust the rcu_read_unlock() location in [PATCH v3 5/12] (suggested
by muchun)
- Use raw_smp_processor_id() to prevent warning[1] from
check_preemption_disabled() in [PATCH v3 12/12].
- Collect Acked-bys and Reviewed-by.
[1]: https://lore.kernel.org/lkml/20260302075216.2170675-1-zhouchuyi@bytedance.com/T/#mc39999cbeb3f50be176f0903d0fa4075688b073d
Changes in v2:
- Simplify the code comments in [PATCH v2 2/12] (pointed by peter and
muchun)
- Adjust the preemption disabling logic in smp_call_function_any() in
[PATCH v2 3/12] (suggested by peter).
- Use on-stack cpumask only when !CONFIG_CPUMASK_OFFSTACK in [PATCH V2
4/12] (pointed by peter)
- Add [PATCH v2 5/12] to replace migrate_disable with the rcu mechanism
- Adjust the preemption disabling logic to allow flush_tlb_multi() to be
preemptible and migratable in [PATCH v2 11/12]
- Collect Acked-bys and Reviewed-bys
Introduction
============
The vast majority of smp_call_function*() callers block until remote CPUs
complete the IPI function execution. As smp_call_function*() runs with
preemption disabled throughout, scheduling latency increases dramatically
with the number of remote CPUs and other factors (such as interrupts being
disabled).
On x86-64 architectures, TLB flushes are performed via IPIs; thus, during
process exit or when process-mapped pages are reclaimed, numerous IPI
operations must be awaited, leading to increased scheduling latency for
other threads on the current CPU. In our production environment, we
observed IPI wait-induced scheduling latency reaching up to 16ms on a
16-core machine. Our goal is to allow preemption during IPI completion
waiting to improve real-time performance.
Background
============
In our production environments, latency-sensitive workloads (DPDK) are
configured with the highest priority to preempt lower-priority tasks at any
time. We discovered that DPDK's wake-up latency is primarily caused by the
current CPU having preemption disabled. Therefore, we collected the maximum
preemption disabled events within every 30-second interval and then
calculated the P50/P99 of these max preemption disabled events:
p50(ns) p99(ns)
cpu0 254956 5465050
cpu1 115801 120782
cpu2 43324 72957
cpu3 256637 16723307
cpu4 58979 87237
cpu5 47464 79815
cpu6 48881 81371
cpu7 52263 82294
cpu8 263555 4657713
cpu9 44935 73962
cpu10 37659 65026
cpu11 257008 2706878
cpu12 49669 90006
cpu13 45186 74666
cpu14 60705 83866
cpu15 51311 86885
Meanwhile, we have collected the distribution of preemption disabling
events exceeding 1ms across different CPUs over several hours(I omitted
CPU data that were all zeros):
CPU 1~10ms 10~50ms 50~100ms
cpu0 29 5 0
cpu3 38 13 0
cpu8 34 6 0
cpu11 24 10 0
The preemption disabled for several milliseconds or even 10ms+ mostly
originates from TLB flush:
@stack[
trace_preempt_on+143
trace_preempt_on+143
preempt_count_sub+67
arch_tlbbatch_flush/flush_tlb_mm_range
task_exit/page_reclaim/...
]
Further analysis confirms that the majority of the time is consumed in
csd_lock_wait().
Now smp_call*() always needs to disable preemption, mainly to protect its
internal per‑CPU data structures and synchronize with CPU offline
operations. This patchset attempts to make csd_lock_wait() preemptible,
thereby reducing the preemption‑disabled critical section and improving
kernel real‑time performance.
Effect
======
After applying this patchset, we no longer observe preemption disabled for
more than 1ms on the arch_tlbbatch_flush/flush_tlb_mm_range path. The
overall P99 of max preemption disabled events in every 30-second is
reduced to around 1.5ms (the remaining latency is primarily due to lock
contention.
before patch after patch reduced by
----------- -------------- ------------
p99(ns) 16723307 1556034 ~90.70%
Chuyi Zhou (12):
smp: Disable preemption explicitly in __csd_lock_wait
smp: Enable preemption early in smp_call_function_single
smp: Refactor remote CPU selection in smp_call_function_any()
smp: Use task-local IPI cpumask in smp_call_function_many_cond()
smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once
smp: Enable preemption early in smp_call_function_many_cond
smp: Remove preempt_disable from smp_call_function
smp: Remove preempt_disable from on_each_cpu_cond_mask
scftorture: Remove preempt_disable in scftorture_invoke_one
x86/mm: Move flush_tlb_info back to the stack
x86/mm: Enable preemption during native_flush_tlb_multi
x86/mm: Enable preemption during flush_tlb_kernel_range
arch/x86/include/asm/tlbflush.h | 8 +-
arch/x86/kernel/kvm.c | 4 +-
arch/x86/mm/tlb.c | 86 ++++++-----------
include/linux/sched.h | 6 ++
include/linux/smp.h | 15 +++
kernel/fork.c | 9 +-
kernel/scftorture.c | 13 +--
kernel/smp.c | 161 ++++++++++++++++++++++++--------
8 files changed, 194 insertions(+), 108 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v6 01/12] smp: Disable preemption explicitly in __csd_lock_wait
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
` (12 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
The latter patches will enable preemption before csd_lock_wait(), which
could break csdlock_debug. Because the slice of other tasks on the CPU may
be accounted between ktime_get_mono_fast_ns() calls, disable preemption
explicitly in __csd_lock_wait(). This is a preparation for the next
patches.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/smp.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/smp.c b/kernel/smp.c
index a0bb56bd8dda..b58975480e11 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -323,6 +323,8 @@ static void __csd_lock_wait(call_single_data_t *csd)
int bug_id = 0;
u64 ts0, ts1;
+ guard(preempt)();
+
ts1 = ts0 = ktime_get_mono_fast_ns();
for (;;) {
if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 02/12] smp: Enable preemption early in smp_call_function_single
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 03/12] smp: Refactor remote CPU selection in smp_call_function_any() Chuyi Zhou
` (11 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Now smp_call_function_single() disables preemption mainly for the following
reasons:
- To protect the per-cpu csd_data from concurrent modification by other
tasks on the current CPU in the !wait case. For the wait case,
synchronization is not a concern as on-stack csd is used.
- To prevent the remote online CPU from being offlined. Specifically, we
want to ensure that no new IPIs are queued after smpcfd_dying_cpu() has
finished.
Disabling preemption for the entire execution is unnecessary, especially
csd_lock_wait() part does not require preemption protection. This patch
enables preemption before csd_lock_wait() to reduce the preemption-disabled
critical section.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/smp.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index b58975480e11..292eefadddbc 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -700,11 +700,16 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
err = generic_exec_single(cpu, csd);
+ /*
+ * @csd is stack-allocated when @wait is true. No concurrent access
+ * except from the IPI completion path, so we can re-enable preemption
+ * early to reduce latency.
+ */
+ put_cpu();
+
if (wait)
csd_lock_wait(csd);
- put_cpu();
-
return err;
}
EXPORT_SYMBOL(smp_call_function_single);
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 03/12] smp: Refactor remote CPU selection in smp_call_function_any()
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond() Chuyi Zhou
` (10 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Currently, smp_call_function_any() disables preemption across the entire
process of picking a target CPU, enqueueing the IPI, and synchronously
waiting for the remote CPU. Since smp_call_function_single() has already
been optimized to re-enable preemption before the synchronous
csd_lock_wait(), callers of smp_call_function_any() should also benefit
from this optimization to reduce the preemption-disabled critical section.
A naive approach would be to simply remove get_cpu() and put_cpu() from
smp_call_function_any(), leaving the preemption disablement entirely to
smp_call_function_single(). However, doing so opens a dangerous
preemption window between picking the remote CPU (e.g., via
sched_numa_find_nth_cpu()) and dispatching the IPI inside
smp_call_function_single(). If the selected remote CPU is fully offlined
during this window, smp_call_function_single() will fail its
cpu_online() check and return -ENXIO directly to the caller, violating
the guarantee to execute on *any* online CPU in the mask.
To safely enable this optimization, this patch refactors the logic of
smp_call_function_any() and smp_call_function_single(). By moving the
random remote CPU selection into a common __smp_call_function_single(),
and keep the entire selection and IPI dispatch process within a single
preemption-disabled region.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/smp.c | 48 ++++++++++++++++++++++++++----------------------
1 file changed, 26 insertions(+), 22 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 292eefadddbc..9e9dab3b0d51 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -641,17 +641,8 @@ void flush_smp_call_function_queue(void)
local_irq_restore(flags);
}
-/**
- * smp_call_function_single - Run a function on a specific CPU
- * @cpu: Specific target CPU for this function.
- * @func: The function to run. This must be fast and non-blocking.
- * @info: An arbitrary pointer to pass to the function.
- * @wait: If true, wait until function has completed on other CPUs.
- *
- * Returns: %0 on success, else a negative status code.
- */
-int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
- int wait)
+static int __smp_call_function_single(int cpu, smp_call_func_t func,
+ void *info, const struct cpumask *mask, int wait)
{
call_single_data_t *csd;
call_single_data_t csd_stack = {
@@ -668,6 +659,14 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
*/
this_cpu = get_cpu();
+ if (mask) {
+ /* Try for same CPU (cheapest) */
+ if (!cpumask_test_cpu(this_cpu, mask))
+ cpu = sched_numa_find_nth_cpu(mask, 0, cpu_to_node(this_cpu));
+ else
+ cpu = this_cpu;
+ }
+
/*
* Can deadlock when called with interrupts disabled.
* We allow cpu's that are not yet online though, as no one else can
@@ -712,6 +711,21 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
return err;
}
+
+/**
+ * smp_call_function_single - Run a function on a specific CPU
+ * @cpu: Specific target CPU for this function.
+ * @func: The function to run. This must be fast and non-blocking.
+ * @info: An arbitrary pointer to pass to the function.
+ * @wait: If true, wait until function has completed on other CPUs.
+ *
+ * Returns: %0 on success, else a negative status code.
+ */
+int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
+ int wait)
+{
+ return __smp_call_function_single(cpu, func, info, NULL, wait);
+}
EXPORT_SYMBOL(smp_call_function_single);
/**
@@ -776,17 +790,7 @@ EXPORT_SYMBOL_GPL(smp_call_function_single_async);
int smp_call_function_any(const struct cpumask *mask,
smp_call_func_t func, void *info, int wait)
{
- unsigned int cpu;
- int ret;
-
- /* Try for same CPU (cheapest) */
- cpu = get_cpu();
- if (!cpumask_test_cpu(cpu, mask))
- cpu = sched_numa_find_nth_cpu(mask, 0, cpu_to_node(cpu));
-
- ret = smp_call_function_single(cpu, func, info, wait);
- put_cpu();
- return ret;
+ return __smp_call_function_single(-1, func, info, mask, wait);
}
EXPORT_SYMBOL_GPL(smp_call_function_any);
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond()
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (2 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 03/12] smp: Refactor remote CPU selection in smp_call_function_any() Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-06-03 10:54 ` Sebastian Andrzej Siewior
2026-05-28 15:13 ` [PATCH v6 05/12] smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once Chuyi Zhou
` (9 subsequent siblings)
13 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
This patch prepares the task-local IPI cpumask during thread creation, and
uses the local cpumask to replace the percpu cfd cpumask in
smp_call_function_many_cond(). We will enable preemption during
csd_lock_wait() later, and this can prevent concurrent access to the
cfd->cpumask from other tasks on the current CPU. For cases where
cpumask_size() is smaller than or equal to the pointer size, it tries to
stash the cpumask in the pointer itself to avoid extra memory allocations.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
include/linux/sched.h | 6 ++++
include/linux/smp.h | 15 ++++++++++
kernel/fork.c | 9 +++++-
kernel/smp.c | 66 +++++++++++++++++++++++++++++++++++++++----
4 files changed, 89 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 368c7b4d7cb5..bb2c53279412 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1348,6 +1348,12 @@ struct task_struct {
struct list_head perf_event_list;
struct perf_ctx_data __rcu *perf_ctx_data;
#endif
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPTION)
+ union {
+ cpumask_t *ipi_mask_ptr;
+ unsigned long ipi_mask_val;
+ };
+#endif
#ifdef CONFIG_DEBUG_PREEMPT
unsigned long preempt_disable_ip;
#endif
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 6925d15ccaa7..e05af439abe4 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -167,6 +167,11 @@ void smp_call_function_many(const struct cpumask *mask,
int smp_call_function_any(const struct cpumask *mask,
smp_call_func_t func, void *info, int wait);
+#ifdef CONFIG_PREEMPTION
+int smp_task_ipi_mask_alloc(struct task_struct *task);
+void smp_task_ipi_mask_free(struct task_struct *task);
+#endif
+
void kick_all_cpus_sync(void);
void wake_up_all_idle_cpus(void);
bool cpus_peek_for_pending_ipi(const struct cpumask *mask);
@@ -310,4 +315,14 @@ bool csd_lock_is_stuck(void);
static inline bool csd_lock_is_stuck(void) { return false; }
#endif
+#if !defined(CONFIG_SMP) || !defined(CONFIG_PREEMPTION)
+static inline int smp_task_ipi_mask_alloc(struct task_struct *task)
+{
+ return 0;
+}
+static inline void smp_task_ipi_mask_free(struct task_struct *task)
+{
+}
+#endif
+
#endif /* __LINUX_SMP_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 5f3fdfdb14c7..bf485c51c447 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -535,6 +535,7 @@ void free_task(struct task_struct *tsk)
#endif
release_user_cpus_ptr(tsk);
scs_release(tsk);
+ smp_task_ipi_mask_free(tsk);
#ifndef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -932,10 +933,14 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#endif
account_kernel_stack(tsk, 1);
- err = scs_prepare(tsk, node);
+ err = smp_task_ipi_mask_alloc(tsk);
if (err)
goto free_stack;
+ err = scs_prepare(tsk, node);
+ if (err)
+ goto free_ipi_mask;
+
#ifdef CONFIG_SECCOMP
/*
* We must handle setting up seccomp filters once we're under
@@ -1006,6 +1011,8 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#endif
return tsk;
+free_ipi_mask:
+ smp_task_ipi_mask_free(tsk);
free_stack:
exit_task_stack_account(tsk);
free_thread_stack(tsk);
diff --git a/kernel/smp.c b/kernel/smp.c
index 9e9dab3b0d51..8f8a9ee2ad11 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -16,6 +16,7 @@
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/gfp.h>
+#include <linux/slab.h>
#include <linux/smp.h>
#include <linux/cpu.h>
#include <linux/sched.h>
@@ -794,6 +795,49 @@ int smp_call_function_any(const struct cpumask *mask,
}
EXPORT_SYMBOL_GPL(smp_call_function_any);
+static DEFINE_STATIC_KEY_FALSE(ipi_mask_inlined);
+
+#ifdef CONFIG_PREEMPTION
+
+int smp_task_ipi_mask_alloc(struct task_struct *task)
+{
+ if (static_branch_unlikely(&ipi_mask_inlined))
+ return 0;
+
+ task->ipi_mask_ptr = kmalloc(cpumask_size(), GFP_KERNEL);
+ if (!task->ipi_mask_ptr)
+ return -ENOMEM;
+
+ return 0;
+}
+
+void smp_task_ipi_mask_free(struct task_struct *task)
+{
+ if (static_branch_unlikely(&ipi_mask_inlined))
+ return;
+
+ kfree(task->ipi_mask_ptr);
+}
+
+static cpumask_t *smp_task_ipi_mask(struct task_struct *cur)
+{
+ /*
+ * If cpumask_size() is smaller than or equal to the pointer
+ * size, it stashes the cpumask in the pointer itself to
+ * avoid extra memory allocations.
+ */
+ if (static_branch_unlikely(&ipi_mask_inlined))
+ return (cpumask_t *)&cur->ipi_mask_val;
+
+ return cur->ipi_mask_ptr;
+}
+#else
+static cpumask_t *smp_task_ipi_mask(struct task_struct *cur)
+{
+ return NULL;
+}
+#endif
+
/*
* Flags to be used as scf_flags argument of smp_call_function_many_cond().
*
@@ -811,11 +855,19 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
int cpu, last_cpu, this_cpu = smp_processor_id();
struct call_function_data *cfd;
bool wait = scf_flags & SCF_WAIT;
+ struct cpumask *cpumask, *task_mask;
int nr_cpus = 0;
bool run_remote = false;
lockdep_assert_preemption_disabled();
+ task_mask = smp_task_ipi_mask(current);
+ cfd = this_cpu_ptr(&cfd_data);
+ if (task_mask)
+ cpumask = task_mask;
+ else
+ cpumask = cfd->cpumask;
+
/*
* Can deadlock when called with interrupts disabled.
* We allow cpu's that are not yet online though, as no one else can
@@ -836,16 +888,15 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
/* Check if we need remote execution, i.e., any CPU excluding this one. */
if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
- cfd = this_cpu_ptr(&cfd_data);
- cpumask_and(cfd->cpumask, mask, cpu_online_mask);
- __cpumask_clear_cpu(this_cpu, cfd->cpumask);
+ cpumask_and(cpumask, mask, cpu_online_mask);
+ __cpumask_clear_cpu(this_cpu, cpumask);
cpumask_clear(cfd->cpumask_ipi);
- for_each_cpu(cpu, cfd->cpumask) {
+ for_each_cpu(cpu, cpumask) {
call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
if (cond_func && !cond_func(cpu, info)) {
- __cpumask_clear_cpu(cpu, cfd->cpumask);
+ __cpumask_clear_cpu(cpu, cpumask);
continue;
}
@@ -896,7 +947,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
}
if (run_remote && wait) {
- for_each_cpu(cpu, cfd->cpumask) {
+ for_each_cpu(cpu, cpumask) {
call_single_data_t *csd;
csd = per_cpu_ptr(cfd->csd, cpu);
@@ -1010,6 +1061,9 @@ EXPORT_SYMBOL(nr_cpu_ids);
void __init setup_nr_cpu_ids(void)
{
set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
+
+ if (IS_ENABLED(CONFIG_PREEMPTION) && cpumask_size() <= sizeof(unsigned long))
+ static_branch_enable(&ipi_mask_inlined);
}
/* Called by boot processor to activate the rest. */
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 05/12] smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (3 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond() Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
` (8 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Later patch would enable preemption during csd_lock_wait() in
smp_call_function_many_cond(), which may cause access cfd->csd data that
has already been freed in smpcfd_dead_cpu().
One way to fix the above issue is to use the RCU mechanism to protect the
csd data and wait for all read critical sections to exit before freeing
the memory in smpcfd_dead_cpu(), but this could delay CPU shutdown. This
patch chooses a simpler approach: allocate the percpu csd on the UP side
only once and skip freeing the csd memory in smpcfd_dead_cpu().
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/smp.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 8f8a9ee2ad11..9ef136bacda0 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -64,7 +64,15 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
- cfd->csd = alloc_percpu(call_single_data_t);
+
+ /*
+ * The percpu csd is allocated only once and never freed.
+ * This ensures that smp_call_function_many_cond() can safely
+ * access the csd of an offlined CPU if it gets preempted
+ * during csd_lock_wait().
+ */
+ if (!cfd->csd)
+ cfd->csd = alloc_percpu(call_single_data_t);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -80,7 +88,6 @@ int smpcfd_dead_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
- free_percpu(cfd->csd);
return 0;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (4 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 05/12] smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-06-03 11:00 ` Sebastian Andrzej Siewior
2026-05-28 15:13 ` [PATCH v6 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
` (7 subsequent siblings)
13 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Disabling preemption entirely during smp_call_function_many_cond() was
primarily for the following reasons:
- To prevent the remote online CPU from going offline. Specifically, we
want to ensure that no new csds are queued after smpcfd_dying_cpu() has
finished. Therefore, preemption must be disabled until all necessary IPIs
are sent.
- To prevent current CPU from going offline. Being migrated to another CPU
and calling csd_lock_wait() may cause UAF due to smpcfd_dead_cpu() during
the current CPU offline process.
- To protect the per-cpu cfd_data from concurrent modification by other
tasks on the current CPU. cfd_data contains cpumasks and per-cpu csds.
Before enqueueing a csd, we block on the csd_lock() to ensure the
previous async csd->func() has completed, and then initialize csd->func and
csd->info. After sending the IPI, we spin-wait for the remote CPU to call
csd_unlock(). Actually the csd_lock mechanism already guarantees csd
serialization. If preemption occurs during csd_lock_wait, other concurrent
smp_call_function_many_cond calls will simply block until the previous
csd->func() completes:
task A task B
sd->func = fun_a
send ipis
preempted by B
--------------->
csd_lock(csd); // block until last
// fun_a finished
csd->func = func_b;
csd->info = info;
...
send ipis
switch back to A
<---------------
csd_lock_wait(csd); // block until remote finish func_*
Previous patches replaced the per-cpu cfd->cpumask with task-local cpumask,
and the percpu csd is allocated only once and is never freed to ensure
we can safely access csd. Now we can enable preemption before
csd_lock_wait() which makes the potentially unpredictable csd_lock_wait()
preemptible and migratable.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
kernel/smp.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 9ef136bacda0..5cb09a84263b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -859,15 +859,14 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
unsigned int scf_flags,
smp_cond_func_t cond_func)
{
- int cpu, last_cpu, this_cpu = smp_processor_id();
+ int cpu, last_cpu, this_cpu;
struct call_function_data *cfd;
bool wait = scf_flags & SCF_WAIT;
struct cpumask *cpumask, *task_mask;
int nr_cpus = 0;
bool run_remote = false;
- lockdep_assert_preemption_disabled();
-
+ this_cpu = get_cpu();
task_mask = smp_task_ipi_mask(current);
cfd = this_cpu_ptr(&cfd_data);
if (task_mask)
@@ -953,6 +952,17 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
local_irq_restore(flags);
}
+ /*
+ * Waiting for completion can take time especially with many CPUs.
+ * On a PREEMPTIBLE kernel a per-task cpumask is used to track CPUs
+ * with pending IPI request. This allows to enable preemption and
+ * potentially wait while allowing task preemption. On a !PREEMPTIBLE
+ * the cpumask is shared and the call must block until completion to
+ * avoid modifications by a another caller on this CPU.
+ */
+ if (task_mask)
+ put_cpu();
+
if (run_remote && wait) {
for_each_cpu(cpu, cpumask) {
call_single_data_t *csd;
@@ -961,6 +971,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
csd_lock_wait(csd);
}
}
+
+ if (!task_mask)
+ put_cpu();
}
/**
@@ -972,8 +985,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
* on other CPUs.
*
* You must not call this function with disabled interrupts or from a
- * hardware interrupt handler or from a bottom half handler. Preemption
- * must be disabled when calling this function.
+ * hardware interrupt handler or from a bottom half handler.
*
* @func is not called on the local CPU even if @mask contains it. Consider
* using on_each_cpu_cond_mask() instead if this is not desirable.
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 07/12] smp: Remove preempt_disable from smp_call_function
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (5 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
` (6 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Now smp_call_function_many_cond() internally handles the preemption logic,
so smp_call_function() does not need to explicitly disable preemption.
Remove preempt_{enable, disable} from smp_call_function().
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/smp.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 5cb09a84263b..b1061fbdaa68 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -1012,9 +1012,8 @@ EXPORT_SYMBOL(smp_call_function_many);
*/
void smp_call_function(smp_call_func_t func, void *info, int wait)
{
- preempt_disable();
- smp_call_function_many(cpu_online_mask, func, info, wait);
- preempt_enable();
+ smp_call_function_many_cond(cpu_online_mask, func, info,
+ wait ? SCF_WAIT : 0, NULL);
}
EXPORT_SYMBOL(smp_call_function);
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (6 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
` (5 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Now smp_call_function_many_cond() internally handles the preemption logic,
so on_each_cpu_cond_mask does not need to explicitly disable preemption.
Remove preempt_{enable, disable} from on_each_cpu_cond_mask().
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/smp.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index b1061fbdaa68..15799f842746 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -1136,9 +1136,7 @@ void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
if (wait)
scf_flags |= SCF_WAIT;
- preempt_disable();
smp_call_function_many_cond(mask, func, info, scf_flags, cond_func);
- preempt_enable();
}
EXPORT_SYMBOL(on_each_cpu_cond_mask);
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (7 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
` (4 subsequent siblings)
13 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Previous patches make smp_call*() functions handle preemption logic
internally. Thus, the explicit preempt_disable() surrounding these calls
becomes unnecessary. Furthermore, keeping the external preempt_disable()
would prevent scftorture from exercising the newly narrowed internal
preemption-disabled regions during IPI dispatch. This patch removes
the preempt_{enable, disable} pairs in scftorture_invoke_one().
Removing this preemption protection could expose a race condition with
CPU hotplug when use_cpus_read_lock is false. Specifically, for
multi-cast operations (SCF_PRIM_MANY or SCF_PRIM_ALL), if only 1 CPU is
online, smp_call_function_many() correctly skips sending IPIs and leaves
scfc_out as false. Without preemption disabled, a CPU hotplug thread
could preempt the test thread, bring a second CPU online, and increment
num_online_cpus(). When the test thread resumes, the validation check
would see num_online_cpus() > 1 and falsely trigger the memory-ordering
warning, leaking the scfcp structure.
To avoid this potential false positive, restrict the num_online_cpus() > 1
condition to only apply when use_cpus_read_lock is true, ensuring the CPU
count remains stable during evaluation.
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
kernel/scftorture.c | 13 ++++---------
1 file changed, 4 insertions(+), 9 deletions(-)
diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 327c315f411c..2082f9b44370 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -348,6 +348,8 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
int ret = 0;
struct scf_check *scfcp = NULL;
struct scf_selector *scfsp = scf_sel_rand(trsp);
+ bool is_single = (scfsp->scfs_prim == SCF_PRIM_SINGLE ||
+ scfsp->scfs_prim == SCF_PRIM_SINGLE_RPC);
if (scfsp->scfs_prim == SCF_PRIM_SINGLE || scfsp->scfs_wait) {
scfcp = kmalloc_obj(*scfcp, GFP_ATOMIC);
@@ -364,8 +366,6 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
}
if (use_cpus_read_lock)
cpus_read_lock();
- else
- preempt_disable();
switch (scfsp->scfs_prim) {
case SCF_PRIM_RESCHED:
if (IS_BUILTIN(CONFIG_SCF_TORTURE_TEST)) {
@@ -411,13 +411,10 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
if (!ret) {
if (use_cpus_read_lock)
cpus_read_unlock();
- else
- preempt_enable();
+
wait_for_completion(&scfcp->scfc_completion);
if (use_cpus_read_lock)
cpus_read_lock();
- else
- preempt_disable();
} else {
scfp->n_single_rpc_ofl++;
scf_add_to_free_list(scfcp);
@@ -452,7 +449,7 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
scfcp->scfc_out = true;
}
if (scfcp && scfsp->scfs_wait) {
- if (WARN_ON_ONCE((num_online_cpus() > 1 || scfsp->scfs_prim == SCF_PRIM_SINGLE) &&
+ if (WARN_ON_ONCE(((use_cpus_read_lock && num_online_cpus() > 1) || is_single) &&
!scfcp->scfc_out)) {
pr_warn("%s: Memory-ordering failure, scfs_prim: %d.\n", __func__, scfsp->scfs_prim);
atomic_inc(&n_mb_out_errs); // Leak rather than trash!
@@ -463,8 +460,6 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
}
if (use_cpus_read_lock)
cpus_read_unlock();
- else
- preempt_enable();
if (allocfail)
schedule_timeout_idle((1 + longwait) * HZ); // Let no-wait handlers complete.
else if (!(torture_random(trsp) & 0xfff))
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (8 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-06-04 20:54 ` Dave Hansen
2026-05-28 15:13 ` [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
` (3 subsequent siblings)
13 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
stack") converted flush_tlb_info from stack variable to per-CPU variable.
This brought about a performance improvement of around 3% in extreme test.
However, it also required that all flush_tlb* operations keep preemption
disabled entirely to prevent concurrent modifications of flush_tlb_info.
flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
all remote CPUs to complete their local TLB flushes. The process could
take tens of milliseconds when interrupts are disabled or with a large
number of remote CPUs.
From the perspective of improving kernel real-time performance, this patch
reverts flush_tlb_info back to stack variables and align it with
SMP_CACHE_BYTES. In certain configurations, SMP_CACHE_BYTES may be large,
so the alignment size is limited to 64. This is a preparation for enabling
preemption during TLB flush in next patch.
To evaluate the performance impact of this patch, use the following script
to reproduce the microbenchmark mentioned in commit 3db6d5a5ecaf
("x86/mm/tlb: Remove 'struct flush_tlb_info' from the stack"). The test
environment is an Ice Lake system (Intel(R) Xeon(R) Platinum 8336C) with
128 CPUs and 2 NUMA nodes. During the test, the threads were bound to
specific CPUs, and both pti and mitigations were disabled:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <unistd.h>
#define NUM_OPS 1000000
#define NUM_THREADS 3
#define NUM_RUNS 5
#define PAGE_SIZE 4096
volatile int stop_threads = 0;
void *busy_wait_thread(void *arg) {
while (!stop_threads) {
__asm__ volatile ("nop");
}
return NULL;
}
long long get_usec() {
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec * 1000000LL + tv.tv_usec;
}
int main() {
pthread_t threads[NUM_THREADS];
char *addr;
int i, r;
addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
| MAP_ANONYMOUS, -1, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}
for (i = 0; i < NUM_THREADS; i++) {
if (pthread_create(&threads[i], NULL, busy_wait_thread, NULL))
exit(1);
}
printf("Running benchmark: %d runs, %d ops each, %d background\n"
"threads\n", NUM_RUNS, NUM_OPS, NUM_THREADS);
for (r = 0; r < NUM_RUNS; r++) {
long long start, end;
start = get_usec();
for (i = 0; i < NUM_OPS; i++) {
addr[0] = 1;
if (madvise(addr, PAGE_SIZE, MADV_DONTNEED)) {
perror("madvise");
exit(1);
}
}
end = get_usec();
double duration = (double)(end - start);
double avg_lat = duration / NUM_OPS;
printf("Run %d: Total time %.2f us, Avg latency %.4f us/op\n",
r + 1, duration, avg_lat);
}
stop_threads = 1;
for (i = 0; i < NUM_THREADS; i++)
pthread_join(threads[i], NULL);
munmap(addr, PAGE_SIZE);
return 0;
}
base on-stack-aligned on-stack-not-aligned
---- --------- -----------
avg (usec/op) 2.5278 2.5261 2.5508
stddev 0.0007 0.0027 0.0023
The benchmark results show that the average latency difference between the
baseline (base) and the properly aligned stack variable (on-stack-aligned)
is within the standard deviation (stddev). This indicates that the
variations are caused by testing noise, and reverting to a stack variable
with proper alignment causes no performance regression compared to the
per-CPU implementation. The unaligned version (on-stack-not-aligned) shows
a minor performance drop. This demonstrates that we can improve the
real-time performance without sacrificing performance.
Suggested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Suggested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
arch/x86/include/asm/tlbflush.h | 8 +++-
arch/x86/mm/tlb.c | 72 +++++++++------------------------
2 files changed, 27 insertions(+), 53 deletions(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 0545fe75c3fa..f4e4505d4ece 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -211,6 +211,12 @@ extern u16 invlpgb_count_max;
extern void initialize_tlbstate_and_flush(void);
+#if SMP_CACHE_BYTES > 64
+#define FLUSH_TLB_INFO_ALIGN 64
+#else
+#define FLUSH_TLB_INFO_ALIGN SMP_CACHE_BYTES
+#endif
+
/*
* TLB flushing:
*
@@ -249,7 +255,7 @@ struct flush_tlb_info {
u8 stride_shift;
u8 freed_tables;
u8 trim_cpumask;
-};
+} __aligned(FLUSH_TLB_INFO_ALIGN);
void flush_tlb_local(void);
void flush_tlb_one_user(unsigned long addr);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index af43d177087e..cfc3a72477f5 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1373,28 +1373,12 @@ void flush_tlb_multi(const struct cpumask *cpumask,
*/
unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
-
-#ifdef CONFIG_DEBUG_VM
-static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
-#endif
-
-static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
- unsigned long start, unsigned long end,
- unsigned int stride_shift, bool freed_tables,
- u64 new_tlb_gen)
+static void get_flush_tlb_info(struct flush_tlb_info *info,
+ struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ unsigned int stride_shift, bool freed_tables,
+ u64 new_tlb_gen)
{
- struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
-
-#ifdef CONFIG_DEBUG_VM
- /*
- * Ensure that the following code is non-reentrant and flush_tlb_info
- * is not overwritten. This means no TLB flushing is initiated by
- * interrupt handlers and machine-check exception handlers.
- */
- BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
-#endif
-
/*
* If the number of flushes is so large that a full flush
* would be faster, do a full flush.
@@ -1412,32 +1396,22 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
info->new_tlb_gen = new_tlb_gen;
info->initiating_cpu = smp_processor_id();
info->trim_cpumask = 0;
-
- return info;
-}
-
-static void put_flush_tlb_info(void)
-{
-#ifdef CONFIG_DEBUG_VM
- /* Complete reentrancy prevention checks */
- barrier();
- this_cpu_dec(flush_tlb_info_idx);
-#endif
}
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables)
{
- struct flush_tlb_info *info;
+ struct flush_tlb_info _info;
+ struct flush_tlb_info *info = &_info;
int cpu = get_cpu();
u64 new_tlb_gen;
/* This is also a barrier that synchronizes with switch_mm(). */
new_tlb_gen = inc_mm_tlb_gen(mm);
- info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
- new_tlb_gen);
+ get_flush_tlb_info(&_info, mm, start, end, stride_shift, freed_tables,
+ new_tlb_gen);
/*
* flush_tlb_multi() is not optimized for the common case in which only
@@ -1457,7 +1431,6 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
local_irq_enable();
}
- put_flush_tlb_info();
put_cpu();
mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
}
@@ -1527,19 +1500,16 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
- struct flush_tlb_info *info;
+ struct flush_tlb_info info;
guard(preempt)();
+ get_flush_tlb_info(&info, NULL, start, end, PAGE_SHIFT, false,
+ TLB_GENERATION_INVALID);
- info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
- TLB_GENERATION_INVALID);
-
- if (info->end == TLB_FLUSH_ALL)
- kernel_tlb_flush_all(info);
+ if (info.end == TLB_FLUSH_ALL)
+ kernel_tlb_flush_all(&info);
else
- kernel_tlb_flush_range(info);
-
- put_flush_tlb_info();
+ kernel_tlb_flush_range(&info);
}
/*
@@ -1707,12 +1677,11 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
{
- struct flush_tlb_info *info;
+ struct flush_tlb_info info;
int cpu = get_cpu();
-
- info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
- TLB_GENERATION_INVALID);
+ get_flush_tlb_info(&info, NULL, 0, TLB_FLUSH_ALL, 0, false,
+ TLB_GENERATION_INVALID);
/*
* flush_tlb_multi() is not optimized for the common case in which only
* a local TLB flush is needed. Optimize this use-case by calling
@@ -1722,17 +1691,16 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
invlpgb_flush_all_nonglobals();
batch->unmapped_pages = false;
} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
- flush_tlb_multi(&batch->cpumask, info);
+ flush_tlb_multi(&batch->cpumask, &info);
} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
- flush_tlb_func(info);
+ flush_tlb_func(&info);
local_irq_enable();
}
cpumask_clear(&batch->cpumask);
- put_flush_tlb_info();
put_cpu();
}
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (9 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-06-04 21:15 ` Dave Hansen
2026-05-28 15:13 ` [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
` (2 subsequent siblings)
13 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
native_flush_tlb_multi() may be frequently called by flush_tlb_mm_range()
and arch_tlbbatch_flush() in production environments. When pages are
reclaimed or process exit, native_flush_tlb_multi() sends IPIs to remote
CPUs and waits for all remote CPUs to complete their local TLB flushes.
The overall latency may reach tens of milliseconds due to a large number of
remote CPUs and other factors (such as interrupts being disabled). Since
flush_tlb_mm_range() and arch_tlbbatch_flush() always disable preemption,
which may cause increased scheduling latency for other threads on the
current CPU.
Previous patch converted flush_tlb_info from per-cpu variable to on-stack
variable. Additionally, it's no longer necessary to explicitly disable
preemption before calling smp_call*() since they internally handle the
preemption logic. Now it's safe to enable preemption during
native_flush_tlb_multi().
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
arch/x86/kernel/kvm.c | 4 +++-
arch/x86/mm/tlb.c | 9 +++++++--
2 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 29226d112029..d540f54f4d16 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -662,8 +662,10 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
u8 state;
int cpu;
struct kvm_steal_time *src;
- struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
+ struct cpumask *flushmask;
+ guard(preempt)();
+ flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
cpumask_copy(flushmask, cpumask);
/*
* We have to call flush only on online vCPUs. And
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index cfc3a72477f5..58c6f3d2f993 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1421,9 +1421,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
if (mm_global_asid(mm)) {
broadcast_tlb_flush(info);
} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+ put_cpu();
info->trim_cpumask = should_trim_cpumask(mm);
flush_tlb_multi(mm_cpumask(mm), info);
consider_global_asid(mm);
+ goto invalidate;
} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
@@ -1432,6 +1434,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
}
put_cpu();
+invalidate:
mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
}
@@ -1691,7 +1694,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
invlpgb_flush_all_nonglobals();
batch->unmapped_pages = false;
} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+ put_cpu();
flush_tlb_multi(&batch->cpumask, &info);
+ goto clear;
} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
@@ -1699,9 +1704,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
local_irq_enable();
}
- cpumask_clear(&batch->cpumask);
-
put_cpu();
+clear:
+ cpumask_clear(&batch->cpumask);
}
/*
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (10 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
@ 2026-05-28 15:13 ` Chuyi Zhou
2026-06-04 21:21 ` Dave Hansen
2026-05-28 19:47 ` [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Paul E. McKenney
2026-06-03 11:02 ` Sebastian Andrzej Siewior
13 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-28 15:13 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel, Chuyi Zhou
flush_tlb_kernel_range() is invoked when kernel memory mapping changes.
On x86 platforms without the INVLPGB feature enabled, we need to send IPIs
to every online CPU and synchronously wait for them to complete
do_kernel_range_flush(). This process can be time-consuming due to factors
such as a large number of CPUs or other issues (like interrupts being
disabled). flush_tlb_kernel_range() always disables preemption, this may
affect the scheduling latency of other tasks on the current CPU.
Previous patch converted flush_tlb_info from per-cpu variable to on-stack
variable. Additionally, it's no longer necessary to explicitly disable
preemption before calling smp_call*() since they internally handles the
preemption logic. Now it's safe to enable preemption during
flush_tlb_kernel_range(). Additionally, in get_flush_tlb_info() use
raw_smp_processor_id() to avoid warnings from check_preemption_disabled().
Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
arch/x86/mm/tlb.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 58c6f3d2f993..c37cc9845abc 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1394,7 +1394,7 @@ static void get_flush_tlb_info(struct flush_tlb_info *info,
info->stride_shift = stride_shift;
info->freed_tables = freed_tables;
info->new_tlb_gen = new_tlb_gen;
- info->initiating_cpu = smp_processor_id();
+ info->initiating_cpu = raw_smp_processor_id();
info->trim_cpumask = 0;
}
@@ -1461,6 +1461,8 @@ static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
{
unsigned long addr, nr;
+ guard(preempt)();
+
for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
nr = (info->end - addr) >> PAGE_SHIFT;
@@ -1505,7 +1507,6 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
{
struct flush_tlb_info info;
- guard(preempt)();
get_flush_tlb_info(&info, NULL, start, end, PAGE_SHIFT, false,
TLB_GENERATION_INVALID);
--
2.20.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (11 preceding siblings ...)
2026-05-28 15:13 ` [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
@ 2026-05-28 19:47 ` Paul E. McKenney
2026-05-29 3:22 ` Chuyi Zhou
2026-06-03 11:02 ` Sebastian Andrzej Siewior
13 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2026-05-28 19:47 UTC (permalink / raw)
To: Chuyi Zhou
Cc: tglx, mingo, luto, peterz, muchun.song, bp, dave.hansen, pbonzini,
bigeasy, clrkwllms, rostedt, nadav.amit, linux-kernel
On Thu, May 28, 2026 at 11:13:26PM +0800, Chuyi Zhou wrote:
> Changes in v6:
> - Make the task-local cpumask selection explicit and drop preemptible()
> check in smp_call_function_many_cond(). The early put_cpu() decision now
> depends only on whether a task-local cpumask is available.
> - Keep smp_task_ipi_mask() private to kernel/smp.c in [PATCH v6 4/12].
> - Add #include <linux/slab.h> to kernel/smp.c in [PATCH v6 4/12] for
> kmalloc()/kfree(), fixing the kernel test robot build failure reported
> at: https://lore.kernel.org/oe-kbuild-all/202605241101.w6T2LApw-lkp@intel.com/
> - Update the csd_lock_wait() comment in [PATCH v6 6/12].
> - Add Sebastian's Reviewed-by tags to the reviewed patches.
For the series:
Tested-by: Paul E. McKenney <paulmck@kernel.org>
> Changes in v5:
> - Replace "smp: Remove get_cpu from smp_call_function_any" with a new
> approach that extracts a common __smp_call_function_single() to safely
> keep the remote CPU selection and IPI dispatch process within a single
> preemption-disabled region in [PATCH v5 3/12].
> - Fix a typo in comments (s/cpumask_stack/task_mask/) and remove the
> obsolete "Preemption must be disabled" constraint from the kernel-doc
> in [PATCH v5 6/12].
> - Adjust the WARN_ON_ONCE() validation condition to avoid a false positive
> warning caused by CPU hotplug races when use_cpus_read_lock is false in
> [PATCH v5 9/12].
> - Move the preemptible() check in smp_call_function_many_cond() from
> [PATCH v5 4/12] to [PATCH v5 6/12].
> - Rebase to commit 4ac4d6549a65 ("sched: Use trace_call__<tp>() to save a
> static branch").
>
> Changes in v4:
> - Use task-local IPI cpumask rather than on-stack cpumask in
> [PATCH v4 4/12] (suggested by sebastian).
> - Skip to free csd memory in smpcfd_dead_cpu() to guarantee csd memory
> access safety, instead of using RCU mechanism in [PATCH v4 5/12]
> (suggested by sebastian).
> - Align flush_tlb_info with SMP_CACHE_BYTES to avoid performance
> degradation caused by unnecessary cache line movements in [PATCH v4
> 10/12](suggested by sebastian and Nadav).
> - Collect Acked-bys and Reviewed-bys.
>
> Changes in v3:
> - Add benchmarks to measure the performance impact of changing
> flush_tlb_info to stack variable in [PATCH v3 10/12] (suggested by
> peter)
> - Adjust the rcu_read_unlock() location in [PATCH v3 5/12] (suggested
> by muchun)
> - Use raw_smp_processor_id() to prevent warning[1] from
> check_preemption_disabled() in [PATCH v3 12/12].
> - Collect Acked-bys and Reviewed-by.
>
> [1]: https://lore.kernel.org/lkml/20260302075216.2170675-1-zhouchuyi@bytedance.com/T/#mc39999cbeb3f50be176f0903d0fa4075688b073d
>
> Changes in v2:
> - Simplify the code comments in [PATCH v2 2/12] (pointed by peter and
> muchun)
> - Adjust the preemption disabling logic in smp_call_function_any() in
> [PATCH v2 3/12] (suggested by peter).
> - Use on-stack cpumask only when !CONFIG_CPUMASK_OFFSTACK in [PATCH V2
> 4/12] (pointed by peter)
> - Add [PATCH v2 5/12] to replace migrate_disable with the rcu mechanism
> - Adjust the preemption disabling logic to allow flush_tlb_multi() to be
> preemptible and migratable in [PATCH v2 11/12]
> - Collect Acked-bys and Reviewed-bys
>
> Introduction
> ============
>
> The vast majority of smp_call_function*() callers block until remote CPUs
> complete the IPI function execution. As smp_call_function*() runs with
> preemption disabled throughout, scheduling latency increases dramatically
> with the number of remote CPUs and other factors (such as interrupts being
> disabled).
>
> On x86-64 architectures, TLB flushes are performed via IPIs; thus, during
> process exit or when process-mapped pages are reclaimed, numerous IPI
> operations must be awaited, leading to increased scheduling latency for
> other threads on the current CPU. In our production environment, we
> observed IPI wait-induced scheduling latency reaching up to 16ms on a
> 16-core machine. Our goal is to allow preemption during IPI completion
> waiting to improve real-time performance.
>
> Background
> ============
>
> In our production environments, latency-sensitive workloads (DPDK) are
> configured with the highest priority to preempt lower-priority tasks at any
> time. We discovered that DPDK's wake-up latency is primarily caused by the
> current CPU having preemption disabled. Therefore, we collected the maximum
> preemption disabled events within every 30-second interval and then
> calculated the P50/P99 of these max preemption disabled events:
>
>
> p50(ns) p99(ns)
> cpu0 254956 5465050
> cpu1 115801 120782
> cpu2 43324 72957
> cpu3 256637 16723307
> cpu4 58979 87237
> cpu5 47464 79815
> cpu6 48881 81371
> cpu7 52263 82294
> cpu8 263555 4657713
> cpu9 44935 73962
> cpu10 37659 65026
> cpu11 257008 2706878
> cpu12 49669 90006
> cpu13 45186 74666
> cpu14 60705 83866
> cpu15 51311 86885
>
> Meanwhile, we have collected the distribution of preemption disabling
> events exceeding 1ms across different CPUs over several hours(I omitted
> CPU data that were all zeros):
>
> CPU 1~10ms 10~50ms 50~100ms
> cpu0 29 5 0
> cpu3 38 13 0
> cpu8 34 6 0
> cpu11 24 10 0
>
> The preemption disabled for several milliseconds or even 10ms+ mostly
> originates from TLB flush:
>
> @stack[
> trace_preempt_on+143
> trace_preempt_on+143
> preempt_count_sub+67
> arch_tlbbatch_flush/flush_tlb_mm_range
> task_exit/page_reclaim/...
> ]
>
> Further analysis confirms that the majority of the time is consumed in
> csd_lock_wait().
>
> Now smp_call*() always needs to disable preemption, mainly to protect its
> internal per‑CPU data structures and synchronize with CPU offline
> operations. This patchset attempts to make csd_lock_wait() preemptible,
> thereby reducing the preemption‑disabled critical section and improving
> kernel real‑time performance.
>
> Effect
>
> ======
>
> After applying this patchset, we no longer observe preemption disabled for
> more than 1ms on the arch_tlbbatch_flush/flush_tlb_mm_range path. The
> overall P99 of max preemption disabled events in every 30-second is
> reduced to around 1.5ms (the remaining latency is primarily due to lock
> contention.
>
> before patch after patch reduced by
> ----------- -------------- ------------
> p99(ns) 16723307 1556034 ~90.70%
>
> Chuyi Zhou (12):
> smp: Disable preemption explicitly in __csd_lock_wait
> smp: Enable preemption early in smp_call_function_single
> smp: Refactor remote CPU selection in smp_call_function_any()
> smp: Use task-local IPI cpumask in smp_call_function_many_cond()
> smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once
> smp: Enable preemption early in smp_call_function_many_cond
> smp: Remove preempt_disable from smp_call_function
> smp: Remove preempt_disable from on_each_cpu_cond_mask
> scftorture: Remove preempt_disable in scftorture_invoke_one
> x86/mm: Move flush_tlb_info back to the stack
> x86/mm: Enable preemption during native_flush_tlb_multi
> x86/mm: Enable preemption during flush_tlb_kernel_range
>
> arch/x86/include/asm/tlbflush.h | 8 +-
> arch/x86/kernel/kvm.c | 4 +-
> arch/x86/mm/tlb.c | 86 ++++++-----------
> include/linux/sched.h | 6 ++
> include/linux/smp.h | 15 +++
> kernel/fork.c | 9 +-
> kernel/scftorture.c | 13 +--
> kernel/smp.c | 161 ++++++++++++++++++++++++--------
> 8 files changed, 194 insertions(+), 108 deletions(-)
>
> --
> 2.20.1
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance
2026-05-28 19:47 ` [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Paul E. McKenney
@ 2026-05-29 3:22 ` Chuyi Zhou
2026-05-29 6:41 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-05-29 3:22 UTC (permalink / raw)
To: paulmck
Cc: tglx, mingo, luto, peterz, muchun.song, bp, dave.hansen, pbonzini,
bigeasy, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-05-29 3:47 a.m., Paul E. McKenney wrote:
> On Thu, May 28, 2026 at 11:13:26PM +0800, Chuyi Zhou wrote:
>> Changes in v6:
>> - Make the task-local cpumask selection explicit and drop preemptible()
>> check in smp_call_function_many_cond(). The early put_cpu() decision now
>> depends only on whether a task-local cpumask is available.
>> - Keep smp_task_ipi_mask() private to kernel/smp.c in [PATCH v6 4/12].
>> - Add #include <linux/slab.h> to kernel/smp.c in [PATCH v6 4/12] for
>> kmalloc()/kfree(), fixing the kernel test robot build failure reported
>> at: https://lore.kernel.org/oe-kbuild-all/202605241101.w6T2LApw-lkp@intel.com/
>> - Update the csd_lock_wait() comment in [PATCH v6 6/12].
>> - Add Sebastian's Reviewed-by tags to the reviewed patches.
>
> For the series:
>
> Tested-by: Paul E. McKenney <paulmck@kernel.org>
>
Thanks Paul, much appreciated!
I will carry your Tested-by tag if I need to post another revision.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance
2026-05-29 3:22 ` Chuyi Zhou
@ 2026-05-29 6:41 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-05-29 6:41 UTC (permalink / raw)
To: Chuyi Zhou
Cc: paulmck, tglx, mingo, luto, peterz, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-05-29 11:22:48 [+0800], Chuyi Zhou wrote:
> I will carry your Tested-by tag if I need to post another revision.
I didn't look at v6 yet but the bot responded to the v5 that I added to
my tree. I guess you addressed those in v6? I will try to stuff it my
tree later today…
Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond()
2026-05-28 15:13 ` [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond() Chuyi Zhou
@ 2026-06-03 10:54 ` Sebastian Andrzej Siewior
2026-06-03 11:48 ` Chuyi Zhou
0 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-03 10:54 UTC (permalink / raw)
To: Chuyi Zhou
Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-05-28 23:13:30 [+0800], Chuyi Zhou wrote:
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -167,6 +167,11 @@ void smp_call_function_many(const struct cpumask *mask,
> int smp_call_function_any(const struct cpumask *mask,
> smp_call_func_t func, void *info, int wait);
>
> +#ifdef CONFIG_PREEMPTION
> +int smp_task_ipi_mask_alloc(struct task_struct *task);
> +void smp_task_ipi_mask_free(struct task_struct *task);
> +#endif
> +
> void kick_all_cpus_sync(void);
> void wake_up_all_idle_cpus(void);
> bool cpus_peek_for_pending_ipi(const struct cpumask *mask);
> @@ -310,4 +315,14 @@ bool csd_lock_is_stuck(void);
> static inline bool csd_lock_is_stuck(void) { return false; }
> #endif
>
> +#if !defined(CONFIG_SMP) || !defined(CONFIG_PREEMPTION)
> +static inline int smp_task_ipi_mask_alloc(struct task_struct *task)
> +{
> + return 0;
> +}
> +static inline void smp_task_ipi_mask_free(struct task_struct *task)
> +{
> +}
> +#endif
> +
It might make sense to move them closer together after
CONFIG_UP_LATE_INIT so you have
#if defined(CONFIG_PREEMPTION) && defined(CONFIG_SMP)
int smp_task_ipi_mask_alloc(struct task_struct *task);
void smp_task_ipi_mask_free(struct task_struct *task);
#else
static inline int smp_task_ipi_mask_alloc(struct task_struct *task)
{
return 0;
}
static inline void smp_task_ipi_mask_free(struct task_struct *task) { }
#endif
…
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -1010,6 +1061,9 @@ EXPORT_SYMBOL(nr_cpu_ids);
> void __init setup_nr_cpu_ids(void)
> {
> set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
> +
> + if (IS_ENABLED(CONFIG_PREEMPTION) && cpumask_size() <= sizeof(unsigned long))
> + static_branch_enable(&ipi_mask_inlined);
Nitpicks:
We restrict the inline case to a large enough size.
smp_call_function_many_cond() can deal with with a NULL pointer and will
use the per-CPU mask in this cases and not enable PREEMPTION early.
This happens for instance on !PREEMPT kernels and request issued by the
init task which is defined at compile at init/init_task.c. Not sure if
this is a problem as it did not trigger is testing.
Just to let you know.
The inline condition is based on cpumask_size() which uses
large_cpumask_bits. This is used by cpumask_copy() and cpumask_clear()
because the underlying operation can be optimized if the size is a
constant.
large_cpumask_bits remains a constant as long as CONFIG_NR_CPUS is <=
256 on 64bit mirroring CONFIG_NR_CPUS. That means if you boot this on a
4 core CPU then it will not inline the operation if CONFIG_NR_CPUS is
say 128 to cope with larger machines. Debian for instance uses here
NR_CPUS=8192 so it will inline it.
All the cpumask operation that are used by smp_call_function_many_cond()
for this task_mask are based on small_cpumask_bits. It could be used to
allow the inline case if CONFIG_NR_CPUS=256 but boot with 8 CPUs. It
would risk breakage if the code changes one does
cpumask_clear(task_mask).
Just two things that I noticed while looking at it.
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> }
>
> /* Called by boot processor to activate the rest. */
Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond
2026-05-28 15:13 ` [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
@ 2026-06-03 11:00 ` Sebastian Andrzej Siewior
2026-06-03 11:54 ` Chuyi Zhou
0 siblings, 1 reply; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-03 11:00 UTC (permalink / raw)
To: Chuyi Zhou
Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-05-28 23:13:32 [+0800], Chuyi Zhou wrote:
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -953,6 +952,17 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
> local_irq_restore(flags);
> }
>
> + /*
> + * Waiting for completion can take time especially with many CPUs.
> + * On a PREEMPTIBLE kernel a per-task cpumask is used to track CPUs
I wouldn't use PREEMPTIBLE. It is PREEMPT which short for
CONFIG_PREEMPT. PREEMPTIBLE is not a distinct entity since "preemptible"
can be changed at runtime.
> + * with pending IPI request. This allows to enable preemption and
> + * potentially wait while allowing task preemption. On a !PREEMPTIBLE
> + * the cpumask is shared and the call must block until completion to
> + * avoid modifications by a another caller on this CPU.
> + */
> + if (task_mask)
> + put_cpu();
> +
> if (run_remote && wait) {
> for_each_cpu(cpu, cpumask) {
> call_single_data_t *csd;
Other than that,
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
` (12 preceding siblings ...)
2026-05-28 19:47 ` [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Paul E. McKenney
@ 2026-06-03 11:02 ` Sebastian Andrzej Siewior
13 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-03 11:02 UTC (permalink / raw)
To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
Cc: Chuyi Zhou
On 2026-05-28 23:13:26 [+0800], Chuyi Zhou wrote:
> Changes in v6:
…
I didn't find anything major bad with this. Any statement from x86 or
sched folks?
Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond()
2026-06-03 10:54 ` Sebastian Andrzej Siewior
@ 2026-06-03 11:48 ` Chuyi Zhou
2026-06-03 12:21 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 31+ messages in thread
From: Chuyi Zhou @ 2026-06-03 11:48 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-06-03 6:54 p.m., Sebastian Andrzej Siewior wrote:
> On 2026-05-28 23:13:30 [+0800], Chuyi Zhou wrote:
>> --- a/include/linux/smp.h
>> +++ b/include/linux/smp.h
>> @@ -167,6 +167,11 @@ void smp_call_function_many(const struct cpumask *mask,
>> int smp_call_function_any(const struct cpumask *mask,
>> smp_call_func_t func, void *info, int wait);
>>
>> +#ifdef CONFIG_PREEMPTION
>> +int smp_task_ipi_mask_alloc(struct task_struct *task);
>> +void smp_task_ipi_mask_free(struct task_struct *task);
>> +#endif
>> +
>> void kick_all_cpus_sync(void);
>> void wake_up_all_idle_cpus(void);
>> bool cpus_peek_for_pending_ipi(const struct cpumask *mask);
>> @@ -310,4 +315,14 @@ bool csd_lock_is_stuck(void);
>> static inline bool csd_lock_is_stuck(void) { return false; }
>> #endif
>>
>> +#if !defined(CONFIG_SMP) || !defined(CONFIG_PREEMPTION)
>> +static inline int smp_task_ipi_mask_alloc(struct task_struct *task)
>> +{
>> + return 0;
>> +}
>> +static inline void smp_task_ipi_mask_free(struct task_struct *task)
>> +{
>> +}
>> +#endif
>> +
>
Yes, that looks cleaner.
> It might make sense to move them closer together after
> CONFIG_UP_LATE_INIT so you have
>
> #if defined(CONFIG_PREEMPTION) && defined(CONFIG_SMP)
> int smp_task_ipi_mask_alloc(struct task_struct *task);
> void smp_task_ipi_mask_free(struct task_struct *task);
>
> #else
> static inline int smp_task_ipi_mask_alloc(struct task_struct *task)
> {
> return 0;
> }
> static inline void smp_task_ipi_mask_free(struct task_struct *task) { }
> #endif
>
> …
>
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -1010,6 +1061,9 @@ EXPORT_SYMBOL(nr_cpu_ids);
>> void __init setup_nr_cpu_ids(void)
>> {
>> set_nr_cpu_ids(find_last_bit(cpumask_bits(cpu_possible_mask), NR_CPUS) + 1);
>> +
>> + if (IS_ENABLED(CONFIG_PREEMPTION) && cpumask_size() <= sizeof(unsigned long))
>> + static_branch_enable(&ipi_mask_inlined);
> Nitpicks:
> We restrict the inline case to a large enough size.
> smp_call_function_many_cond() can deal with with a NULL pointer and will
> use the per-CPU mask in this cases and not enable PREEMPTION early.
>
> This happens for instance on !PREEMPT kernels and request issued by the
> init task which is defined at compile at init/init_task.c. Not sure if
> this is a problem as it did not trigger is testing.
> Just to let you know.
>
> The inline condition is based on cpumask_size() which uses
> large_cpumask_bits. This is used by cpumask_copy() and cpumask_clear()
> because the underlying operation can be optimized if the size is a
> constant.
> large_cpumask_bits remains a constant as long as CONFIG_NR_CPUS is <=
> 256 on 64bit mirroring CONFIG_NR_CPUS. That means if you boot this on a
> 4 core CPU then it will not inline the operation if CONFIG_NR_CPUS is
> say 128 to cope with larger machines. Debian for instance uses here
> NR_CPUS=8192 so it will inline it.
>
> All the cpumask operation that are used by smp_call_function_many_cond()
> for this task_mask are based on small_cpumask_bits. It could be used to
> allow the inline case if CONFIG_NR_CPUS=256 but boot with 8 CPUs. It
> would risk breakage if the code changes one does
> cpumask_clear(task_mask).
>
> Just two things that I noticed while looking at it.
>
Thanks for pointing this out.
The NULL fallback in smp_call_function_many_cond() is intentional for
cases where no task-local mask is available, such as !PREEMPT kernels or
static/early tasks. In those cases it falls back to the per-CPU mask and
keeps the existing preemption-disabled behavior.
Using cpumask_size() is conservative, but it ensures that the inline
storage is large enough for cpumask operations which may use
large_cpumask_bits, such as cpumask_clear() or cpumask_copy(), and does
not rely on this path continuing to only use small_cpumask_bits based
operations. So I think keeping the cpumask_size() check is the safer
tradeoff here.
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
>> }
>>
>> /* Called by boot processor to activate the rest. */
>
> Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond
2026-06-03 11:00 ` Sebastian Andrzej Siewior
@ 2026-06-03 11:54 ` Chuyi Zhou
0 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-06-03 11:54 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-06-03 7:00 p.m., Sebastian Andrzej Siewior wrote:
> On 2026-05-28 23:13:32 [+0800], Chuyi Zhou wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -953,6 +952,17 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>> local_irq_restore(flags);
>> }
>>
>> + /*
>> + * Waiting for completion can take time especially with many CPUs.
>> + * On a PREEMPTIBLE kernel a per-task cpumask is used to track CPUs
>
> I wouldn't use PREEMPTIBLE. It is PREEMPT which short for
> CONFIG_PREEMPT. PREEMPTIBLE is not a distinct entity since "preemptible"
> can be changed at runtime.
Right, PREEMPT/!PREEMPT is the better wording here. I will update the
comment accordingly.
>
>> + * with pending IPI request. This allows to enable preemption and
>> + * potentially wait while allowing task preemption. On a !PREEMPTIBLE
>> + * the cpumask is shared and the call must block until completion to
>> + * avoid modifications by a another caller on this CPU.
>> + */
>> + if (task_mask)
>> + put_cpu();
>> +
>> if (run_remote && wait) {
>> for_each_cpu(cpu, cpumask) {
>> call_single_data_t *csd;
>
> Other than that,
>
> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>
> Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond()
2026-06-03 11:48 ` Chuyi Zhou
@ 2026-06-03 12:21 ` Sebastian Andrzej Siewior
0 siblings, 0 replies; 31+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-06-03 12:21 UTC (permalink / raw)
To: Chuyi Zhou
Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, clrkwllms, rostedt, nadav.amit, linux-kernel
On 2026-06-03 19:48:50 [+0800], Chuyi Zhou wrote:
> Thanks for pointing this out.
>
> The NULL fallback in smp_call_function_many_cond() is intentional for
> cases where no task-local mask is available, such as !PREEMPT kernels or
> static/early tasks. In those cases it falls back to the per-CPU mask and
> keeps the existing preemption-disabled behavior.
Yes, I am aware of this. I just wanted to point a case where a task has
no mask associated.
Sebastian
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack
2026-05-28 15:13 ` [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
@ 2026-06-04 20:54 ` Dave Hansen
2026-06-04 21:11 ` Nadav Amit
0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2026-06-04 20:54 UTC (permalink / raw)
To: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel
On 5/28/26 08:13, Chuyi Zhou wrote:
> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
> stack") converted flush_tlb_info from stack variable to per-CPU variable.
> This brought about a performance improvement of around 3% in extreme test.
You've basically (nicely) thrown down the gauntlet and told Nadav that
his patch or methodology is bad. PeterZ also acked it Nadav's approach.
I think they should chime in before doing anything here.
Also, this patch does at least three different things:
1. Adds alignment for flush_tlb_info
2. Changes the signature of get_flush_tlb_info() to take a pointer
instead of returning one.
3. Actually allocates the info on the stack.
Could you refactor this, please?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack
2026-06-04 20:54 ` Dave Hansen
@ 2026-06-04 21:11 ` Nadav Amit
2026-06-04 21:16 ` Dave Hansen
0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2026-06-04 21:11 UTC (permalink / raw)
To: Dave Hansen
Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, linux-kernel
> On 4 Jun 2026, at 23:54, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 5/28/26 08:13, Chuyi Zhou wrote:
>> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
>> stack") converted flush_tlb_info from stack variable to per-CPU variable.
>> This brought about a performance improvement of around 3% in extreme test.
>
> You've basically (nicely) thrown down the gauntlet and told Nadav that
> his patch or methodology is bad. PeterZ also acked it Nadav's approach.
Dave, why can’t we all be friends. :)
I communicated with Zhou about this. Back when I made my original changes - the
ones Zhou is now editing — I also wanted to keep flush_tlb_info on the stack
with alignment. But PeterZ reverted that [1], because in certain cases (KASAN
with probably some experimental Intel larger cache-line size) the cache-line
alignment made the stack grow too much and triggered warnings.
That's what led us to move flush_tlb_info into a per-cpu struct (preemption
was disabled at that time, so there was no trade-off). Now that Zhou claims
that disabling the preemption is a pain-point, and therefore does not want
to set flush_tlb_info per-cpu, the solution we've landed on is to put it back
on the stack - avoiding the need for a single flush_tlb_info that depends on
preemption being disabled — while limiting the alignment to 64 bytes so the
stack doesn't grow out of control.
[1] https://lore.kernel.org/all/tip-780e0106d468a2962b16b52fdf42898f2639e0a0@git.kernel.org/
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi
2026-05-28 15:13 ` [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
@ 2026-06-04 21:15 ` Dave Hansen
2026-06-05 3:36 ` Chuyi Zhou
0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2026-06-04 21:15 UTC (permalink / raw)
To: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel
First, the subject needs some improvement. Add parenthesis to
functions(), please. Second, it's literally wrong: "Enable preemption
during native_flush_tlb_multi". It does not do that or it at least
describes it badly.
It enables preemption during *one* call to native_flush_tlb_multi().
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 29226d112029..d540f54f4d16 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -662,8 +662,10 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
> u8 state;
> int cpu;
> struct kvm_steal_time *src;
> - struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
> + struct cpumask *flushmask;
>
> + guard(preempt)();
> + flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
> cpumask_copy(flushmask, cpumask);
> /*
> * We have to call flush only on online vCPUs. And
This KVM modification is a complete non sequitur. It comes from nowhere.
No comments. No mention in the changelog.
Now, looking at how it's called, I guess flush_tlb_multi() lands here
because of pv_ops. But, please have mercy on the poor reviewers and walk
them through this.
This could also be done in a separate patch. It's OK to disable
preemption twice.
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index cfc3a72477f5..58c6f3d2f993 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1421,9 +1421,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> if (mm_global_asid(mm)) {
> broadcast_tlb_flush(info);
> } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
> + put_cpu();
> info->trim_cpumask = should_trim_cpumask(mm);
> flush_tlb_multi(mm_cpumask(mm), info);
> consider_global_asid(mm);
> + goto invalidate;
> } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
> lockdep_assert_irqs_enabled();
> local_irq_disable();
> @@ -1432,6 +1434,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
> }
>
> put_cpu();
> +invalidate:
> mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
> }
I really don't like the goto. Can this be refactored to not use a goto?
I'd honestly rather it be:
if (foo) {
broadcast_tlb_flush(info);
put_cpu();
} else if (bar) {
put_cpu();
flush_tlb_multi();
} else {
flush_tlb_func(info);
put_cpu();
}
than have the goto. At least that ^ makes it obvious that each case
needs a put_cpu(). But I also just generally don't like how the code is
structured at this point.
Does anybody have any smart ideas?
> @@ -1691,7 +1694,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> invlpgb_flush_all_nonglobals();
> batch->unmapped_pages = false;
> } else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> + put_cpu();
> flush_tlb_multi(&batch->cpumask, &info);
> + goto clear;
> } else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
> lockdep_assert_irqs_enabled();
> local_irq_disable();
> @@ -1699,9 +1704,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> local_irq_enable();
> }
>
> - cpumask_clear(&batch->cpumask);
> -
> put_cpu();
> +clear:
> + cpumask_clear(&batch->cpumask);
> }
I have the same general complaint about this one. This is really just
hacked into place, leaving a mess for everyone in the future. It needs
some careful refactoring.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack
2026-06-04 21:11 ` Nadav Amit
@ 2026-06-04 21:16 ` Dave Hansen
2026-06-04 21:21 ` Nadav Amit
0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2026-06-04 21:16 UTC (permalink / raw)
To: Nadav Amit
Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, linux-kernel
On 6/4/26 14:11, Nadav Amit wrote:
>
> That's what led us to move flush_tlb_info into a per-cpu struct (preemption
> was disabled at that time, so there was no trade-off). Now that Zhou claims
> that disabling the preemption is a pain-point, and therefore does not want
> to set flush_tlb_info per-cpu, the solution we've landed on is to put it back
> on the stack - avoiding the need for a single flush_tlb_info that depends on
> preemption being disabled — while limiting the alignment to 64 bytes so the
> stack doesn't grow out of control.
OK, cool. That's great info for a Link: tag. It also seems like you're
happy with this, so would a ack/review tag be appropriate to provide?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack
2026-06-04 21:16 ` Dave Hansen
@ 2026-06-04 21:21 ` Nadav Amit
2026-06-05 2:54 ` Chuyi Zhou
0 siblings, 1 reply; 31+ messages in thread
From: Nadav Amit @ 2026-06-04 21:21 UTC (permalink / raw)
To: Dave Hansen
Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, linux-kernel
> On 5 Jun 2026, at 0:16, Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 6/4/26 14:11, Nadav Amit wrote:
>>
>> That's what led us to move flush_tlb_info into a per-cpu struct (preemption
>> was disabled at that time, so there was no trade-off). Now that Zhou claims
>> that disabling the preemption is a pain-point, and therefore does not want
>> to set flush_tlb_info per-cpu, the solution we've landed on is to put it back
>> on the stack - avoiding the need for a single flush_tlb_info that depends on
>> preemption being disabled — while limiting the alignment to 64 bytes so the
>> stack doesn't grow out of control.
>
> OK, cool. That's great info for a Link: tag. It also seems like you're
> happy with this, so would a ack/review tag be appropriate to provide?
Ack tag is fine with me (there’s already suggested-by tag). For reviewed-by
I should look a bit more thoroughly...
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range
2026-05-28 15:13 ` [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
@ 2026-06-04 21:21 ` Dave Hansen
2026-06-05 3:51 ` Chuyi Zhou
0 siblings, 1 reply; 31+ messages in thread
From: Dave Hansen @ 2026-06-04 21:21 UTC (permalink / raw)
To: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel
On 5/28/26 08:13, Chuyi Zhou wrote:
> - info->initiating_cpu = smp_processor_id();
> + info->initiating_cpu = raw_smp_processor_id();
> info->trim_cpumask = 0;
> }
Doesn't this turn ->initiating_cpu into garbage? It doesn't mean
anything any more other than being a random record of the past.
I think it's just used for stats, so not the end of the world. But, the
warning is there for a *REASON*. Please don't just turn it off and
ignore the fallout.
I'm also just generally not sure this is worth it. Kernel TLB flushes
stink. This just makes them stink slightly less. Maybe imperceptibly so.
Is it worth the churn?
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack
2026-06-04 21:21 ` Nadav Amit
@ 2026-06-05 2:54 ` Chuyi Zhou
0 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-06-05 2:54 UTC (permalink / raw)
To: Nadav Amit, Dave Hansen
Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
pbonzini, bigeasy, clrkwllms, rostedt, linux-kernel
On 2026-06-05 5:21 a.m., Nadav Amit wrote:
>
>
>> On 5 Jun 2026, at 0:16, Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 6/4/26 14:11, Nadav Amit wrote:
>>>
>>> That's what led us to move flush_tlb_info into a per-cpu struct (preemption
>>> was disabled at that time, so there was no trade-off). Now that Zhou claims
>>> that disabling the preemption is a pain-point, and therefore does not want
>>> to set flush_tlb_info per-cpu, the solution we've landed on is to put it back
>>> on the stack - avoiding the need for a single flush_tlb_info that depends on
>>> preemption being disabled — while limiting the alignment to 64 bytes so the
>>> stack doesn't grow out of control.
>>
>> OK, cool. That's great info for a Link: tag. It also seems like you're
>> happy with this, so would a ack/review tag be appropriate to provide?
>
> Ack tag is fine with me (there’s already suggested-by tag). For reviewed-by
> I should look a bit more thoroughly...
Thanks Dave and Nadav.
Yes, I agree the changelog should be reworded. I did not mean to
question Nadav's original patch or the benchmark methodology. The
per-cpu storage made sense when preemption was disabled, especially
after the earlier stack-growth issue with cacheline-aligned on-stack
storage.
I will update the changelog to describe that history explicitly: the
old stack-aligned version could grow the stack too much on some configs,
which led to the per-cpu storage, while this series needs to remove the
single per-cpu flush_tlb_info dependency as preparation for shortening
the preemption-disabled window. The proposed compromise is to put it
back on the stack, but cap the alignment at 64 bytes.
I will also split this into smaller patches in v7:
1. refactor the flush_tlb_info helper so initialization can use
caller-provided storage, without changing the current per-cpu
storage;
2. add the capped alignment for struct flush_tlb_info;
3. move the actual flush_tlb_info storage from per-cpu back to the
stack.
Nadav, thanks for the ack. I will add this in the next version.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi
2026-06-04 21:15 ` Dave Hansen
@ 2026-06-05 3:36 ` Chuyi Zhou
0 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-06-05 3:36 UTC (permalink / raw)
To: Dave Hansen, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel
On 2026-06-05 5:15 a.m., Dave Hansen wrote:
> First, the subject needs some improvement. Add parenthesis to
> functions(), please. Second, it's literally wrong: "Enable preemption
> during native_flush_tlb_multi". It does not do that or it at least
> describes it badly.
>
> It enables preemption during *one* call to native_flush_tlb_multi().
>
Yes, the subject is misleading. Would the following subject work better?
x86/mm: Re-enable preemption before flush_tlb_multi()
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 29226d112029..d540f54f4d16 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -662,8 +662,10 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>> u8 state;
>> int cpu;
>> struct kvm_steal_time *src;
>> - struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
>> + struct cpumask *flushmask;
>>
>> + guard(preempt)();
>> + flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
>> cpumask_copy(flushmask, cpumask);
>> /*
>> * We have to call flush only on online vCPUs. And
>
> This KVM modification is a complete non sequitur. It comes from nowhere.
> No comments. No mention in the changelog.
>
> Now, looking at how it's called, I guess flush_tlb_multi() lands here
> because of pv_ops. But, please have mercy on the poor reviewers and walk
> them through this.
>
> This could also be done in a separate patch. It's OK to disable
> preemption twice.
>
I will move it into a preparatory patch which disables preemption in
kvm_flush_tlb_multi() while it uses the per-cpu __pv_cpu_mask scratch
cpumask. The changelog will explain that flush_tlb_multi() may reach
kvm_flush_tlb_multi() through pv_ops, so KVM should protect its own
per-cpu storage before the x86/mm callers stop guaranteeing
preemption-disabled context around flush_tlb_multi().
>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>> index cfc3a72477f5..58c6f3d2f993 100644
>> --- a/arch/x86/mm/tlb.c
>> +++ b/arch/x86/mm/tlb.c
>> @@ -1421,9 +1421,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> if (mm_global_asid(mm)) {
>> broadcast_tlb_flush(info);
>> } else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
>> + put_cpu();
>> info->trim_cpumask = should_trim_cpumask(mm);
>> flush_tlb_multi(mm_cpumask(mm), info);
>> consider_global_asid(mm);
>> + goto invalidate;
>> } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
>> lockdep_assert_irqs_enabled();
>> local_irq_disable();
>> @@ -1432,6 +1434,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>> }
>>
>> put_cpu();
>> +invalidate:
>> mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
>> }
>
> I really don't like the goto. Can this be refactored to not use a goto?
>
> I'd honestly rather it be:
>
> if (foo) {
> broadcast_tlb_flush(info);
> put_cpu();
> } else if (bar) {
> put_cpu();
> flush_tlb_multi();
> } else {
> flush_tlb_func(info);
> put_cpu();
> }
>
> than have the goto. At least that ^ makes it obvious that each case
> needs a put_cpu(). But I also just generally don't like how the code is
> structured at this point.
>
> Does anybody have any smart ideas?
would the following structure look better to you?
bool remote_flush = false;
int cpu = get_cpu();
if (mm_global_asid(mm)) {
broadcast_tlb_flush(info);
} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
remote_flush = true;
} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
flush_tlb_func(info);
local_irq_enable();
}
put_cpu();
if (remote_flush) {
info->trim_cpumask = should_trim_cpumask(mm);
flush_tlb_multi(mm_cpumask(mm), info);
consider_global_asid(mm);
}
And similarly for arch_tlbbatch_flush():
bool remote_flush = false;
int cpu = get_cpu();
...
if (cpu_feature_enabled(X86_FEATURE_INVLPGB) && batch->unmapped_pages) {
invlpgb_flush_all_nonglobals();
batch->unmapped_pages = false;
} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
remote_flush = true;
} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
flush_tlb_func(&info);
local_irq_enable();
}
put_cpu();
if (remote_flush)
flush_tlb_multi(&batch->cpumask, &info);
cpumask_clear(&batch->cpumask);
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range
2026-06-04 21:21 ` Dave Hansen
@ 2026-06-05 3:51 ` Chuyi Zhou
0 siblings, 0 replies; 31+ messages in thread
From: Chuyi Zhou @ 2026-06-05 3:51 UTC (permalink / raw)
To: Dave Hansen, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
dave.hansen, pbonzini, bigeasy, clrkwllms, rostedt, nadav.amit
Cc: linux-kernel
On 2026-06-05 5:21 a.m., Dave Hansen wrote:
> On 5/28/26 08:13, Chuyi Zhou wrote:
>> - info->initiating_cpu = smp_processor_id();
>> + info->initiating_cpu = raw_smp_processor_id();
>> info->trim_cpumask = 0;
>> }
>
> Doesn't this turn ->initiating_cpu into garbage? It doesn't mean
> anything any more other than being a random record of the past.
>
> I think it's just used for stats, so not the end of the world. But, the
> warning is there for a *REASON*. Please don't just turn it off and
> ignore the fallout.
>
> I'm also just generally not sure this is worth it. Kernel TLB flushes
> stink. This just makes them stink slightly less. Maybe imperceptibly so.
>
> Is it worth the churn?
Sebastian raised a similar concern earlier[1-2]. The kernel range path
does not really need a full flush_tlb_info: kernel_tlb_flush_all() does
not need it at all, and the range case only needs start/end. But doing
that properly means refactoring the kernel TLB flush path, which would
add more churn to this series.
Given your concern about whether this part is worth it, I think the best
approach is to drop the flush_tlb_kernel_range() preemption patch
from this series. That keeps get_flush_tlb_info() using
smp_processor_id(), avoids making ->initiating_cpu meaningless after
migration, and keeps the current series focused on the mm TLB flush path.
If we still want to optimize flush_tlb_kernel_range(), I can do that
later as a separate patchset, starting with a cleanup that separates the
kernel range/full flush data from struct flush_tlb_info.
[1]https://lore.kernel.org/lkml/20260522104818.CbT5fyN8@linutronix.de/
[2]https://lore.kernel.org/lkml/6265669c-d7d4-45a4-a9d2-2f5e884aa7c5@bytedance.com/
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2026-06-05 3:51 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 15:13 [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 03/12] smp: Refactor remote CPU selection in smp_call_function_any() Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 04/12] smp: Use task-local IPI cpumask in smp_call_function_many_cond() Chuyi Zhou
2026-06-03 10:54 ` Sebastian Andrzej Siewior
2026-06-03 11:48 ` Chuyi Zhou
2026-06-03 12:21 ` Sebastian Andrzej Siewior
2026-05-28 15:13 ` [PATCH v6 05/12] smp: Alloc percpu csd data in smpcfd_prepare_cpu() only once Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
2026-06-03 11:00 ` Sebastian Andrzej Siewior
2026-06-03 11:54 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
2026-06-04 20:54 ` Dave Hansen
2026-06-04 21:11 ` Nadav Amit
2026-06-04 21:16 ` Dave Hansen
2026-06-04 21:21 ` Nadav Amit
2026-06-05 2:54 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
2026-06-04 21:15 ` Dave Hansen
2026-06-05 3:36 ` Chuyi Zhou
2026-05-28 15:13 ` [PATCH v6 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
2026-06-04 21:21 ` Dave Hansen
2026-06-05 3:51 ` Chuyi Zhou
2026-05-28 19:47 ` [PATCH v6 00/12] Allow preemption during IPI completion waiting to improve real-time performance Paul E. McKenney
2026-05-29 3:22 ` Chuyi Zhou
2026-05-29 6:41 ` Sebastian Andrzej Siewior
2026-06-03 11:02 ` Sebastian Andrzej Siewior
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.