[PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance
@ 2026-03-18  4:56 Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
                   ` (11 more replies)
  0 siblings, 12 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Changes in v3:
 - Add benchmarks to measure the performance impact of changing
   flush_tlb_info to stack variable in [PATCH v3 10/12] (suggested by
   peter)
 - Adjust the rcu_read_unlock() location in [PATCH v3 5/12] (suggested
   by muchun)
 - Use raw_smp_processor_id() to prevent warning[1] from
   check_preemption_disabled() in [PATCH v3 12/12].
 - Collect Acked-bys and Reviewed-by.

[1]: https://lore.kernel.org/lkml/20260302075216.2170675-1-zhouchuyi@bytedance.com/T/#mc39999cbeb3f50be176f0903d0fa4075688b073d 

Changes in v2:
 - Simplify the code comments in [PATCH v2 2/12] (pointed by peter and
   muchun)
 - Adjust the preemption disabling logic in smp_call_function_any() in
   [PATCH v2 3/12] (suggested by peter).
 - Use on-stack cpumask only when !CONFIG_CPUMASK_OFFSTACK in [PATCH V2
   4/12] (pointed by peter)
 - Add [PATCH v2 5/12] to replace migrate_disable with the rcu mechanism
 - Adjust the preemption disabling logic to allow flush_tlb_multi() to be
   preemptible and migratable in [PATCH v2 11/12]
 - Collect Acked-bys and Reviewed-bys

Introduction
============

The vast majority of smp_call_function*() callers block until remote CPUs
complete the IPI function execution. As smp_call_function*() runs with
preemption disabled throughout, scheduling latency increases dramatically
with the number of remote CPUs and other factors (such as interrupts being
disabled).

On x86-64 architectures, TLB flushes are performed via IPIs; thus, during
process exit or when process-mapped pages are reclaimed, numerous IPI
operations must be awaited, leading to increased scheduling latency for
other threads on the current CPU. In our production environment, we
observed IPI wait-induced scheduling latency reaching up to 16ms on a
16-core machine. Our goal is to allow preemption during IPI completion
waiting to improve real-time performance.

Background
============

In our production environments, latency-sensitive workloads (DPDK) are
configured with the highest priority to preempt lower-priority tasks at any
time. We discovered that DPDK's wake-up latency is primarily caused by the
current CPU having preemption disabled. Therefore, we collected the maximum
preemption disabled events within every 30-second interval and then
calculated the P50/P99 of these max preemption disabled events:

                        p50(ns)               p99(ns)
cpu0                   254956                 5465050
cpu1                   115801                 120782
cpu2                   43324                  72957
cpu3                   256637                 16723307
cpu4                   58979                  87237
cpu5                   47464                  79815
cpu6                   48881                  81371
cpu7                   52263                  82294
cpu8                   263555                 4657713
cpu9                   44935                  73962
cpu10                  37659                  65026
cpu11                  257008                 2706878
cpu12                  49669                  90006
cpu13                  45186                  74666
cpu14                  60705                  83866
cpu15                  51311                  86885

Meanwhile, we have collected the distribution of preemption disabling
events exceeding 1ms across different CPUs over several hours(I omitted
CPU data that were all zeros):

CPU        1~10ms   10~50ms   50~100ms
cpu0        29       5       0
cpu3        38       13      0
cpu8        34       6       0
cpu11       24       10      0

The preemption disabled for several milliseconds or even 10ms+ mostly
originates from TLB flush:

@stack[
    trace_preempt_on+143
    trace_preempt_on+143
    preempt_count_sub+67
    arch_tlbbatch_flush/flush_tlb_mm_range
    task_exit/page_reclaim/...
]

Further analysis confirms that the majority of the time is consumed in
csd_lock_wait().

Now smp_call*() always needs to disable preemption, mainly to protect its
internal per‑CPU data structures and synchronize with CPU offline
operations. This patchset attempts to make csd_lock_wait() preemptible,
thereby reducing the preemption‑disabled critical section and improving
kernel real‑time performance.

Effect
======

After applying this patchset, we no longer observe preemption disabled for
more than 1ms on the arch_tlbbatch_flush/flush_tlb_mm_range path. The
overall P99 of max preemption disabled events in every 30-second is
reduced to around 1.5ms (the remaining latency is primarily due to lock
contention.

                     before patch    after patch    reduced by
                     -----------    --------------  ------------
p99(ns)                16723307        1556034        ~90.70%

Chuyi Zhou (12):
  smp: Disable preemption explicitly in __csd_lock_wait
  smp: Enable preemption early in smp_call_function_single
  smp: Remove get_cpu from smp_call_function_any
  smp: Use on-stack cpumask in smp_call_function_many_cond
  smp: Free call_function_data via RCU in smpcfd_dead_cpu
  smp: Enable preemption early in smp_call_function_many_cond
  smp: Remove preempt_disable from smp_call_function
  smp: Remove preempt_disable from on_each_cpu_cond_mask
  scftorture: Remove preempt_disable in scftorture_invoke_one
  x86/mm: Move flush_tlb_info back to the stack
  x86/mm: Enable preemption during native_flush_tlb_multi
  x86/mm: Enable preemption during flush_tlb_kernel_range

 arch/x86/kernel/kvm.c |   4 +-
 arch/x86/mm/tlb.c     | 137 ++++++++++++++++++------------------------
 kernel/scftorture.c   |   9 +--
 kernel/smp.c          |  81 +++++++++++++++++++------
 4 files changed, 125 insertions(+), 106 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 14:13   ` Steven Rostedt
  2026-03-18  4:56 ` [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

The latter patches will enable preemption before csd_lock_wait(), which
could break csdlock_debug. Because the slice of other tasks on the CPU may
be accounted between ktime_get_mono_fast_ns() calls. Disable preemption
explicitly in __csd_lock_wait(). This is a preparation for the next
patches.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
---
 kernel/smp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/smp.c b/kernel/smp.c
index f349960f79ca..fc1f7a964616 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -323,6 +323,8 @@ static void __csd_lock_wait(call_single_data_t *csd)
 	int bug_id = 0;
 	u64 ts0, ts1;
 
+	guard(preempt)();
+
 	ts1 = ts0 = ktime_get_mono_fast_ns();
 	for (;;) {
 		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 14:14   ` Steven Rostedt
  2026-03-18  4:56 ` [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any Chuyi Zhou
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Now smp_call_function_single() disables preemption mainly for the following
reasons:

- To protect the per-cpu csd_data from concurrent modification by other
tasks on the current CPU in the !wait case. For the wait case,
synchronization is not a concern as on-stack csd is used.

- To prevent the remote online CPU from being offlined. Specifically, we
want to ensure that no new IPIs are queued after smpcfd_dying_cpu() has
finished.

Disabling preemption for the entire execution is unnecessary, especially
csd_lock_wait() part does not require preemption protection. This patch
enables preemption before csd_lock_wait() to reduce the preemption-disabled
critical section.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 kernel/smp.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index fc1f7a964616..b603d4229f95 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -685,11 +685,16 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
 
 	err = generic_exec_single(cpu, csd);
 
+	/*
+	 * @csd is stack-allocated when @wait is true. No concurrent access
+	 * except from the IPI completion path, so we can re-enable preemption
+	 * early to reduce latency.
+	 */
+	put_cpu();
+
 	if (wait)
 		csd_lock_wait(csd);
 
-	put_cpu();
-
 	return err;
 }
 EXPORT_SYMBOL(smp_call_function_single);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 14:32   ` Steven Rostedt
  2026-03-18  4:56 ` [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond Chuyi Zhou
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Now smp_call_function_single() would enable preemption before
csd_lock_wait() to reduce the critical section. To allow callers of
smp_call_function_any() to also benefit from this optimization, remove
get_cpu()/put_cpu() from smp_call_function_any().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 kernel/smp.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index b603d4229f95..80daf9dd4a25 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -761,16 +761,26 @@ EXPORT_SYMBOL_GPL(smp_call_function_single_async);
 int smp_call_function_any(const struct cpumask *mask,
 			  smp_call_func_t func, void *info, int wait)
 {
+	bool local = true;
 	unsigned int cpu;
 	int ret;
 
-	/* Try for same CPU (cheapest) */
+	/*
+	 * Prevent migration to another CPU after selecting the current CPU
+	 * as the target.
+	 */
 	cpu = get_cpu();
-	if (!cpumask_test_cpu(cpu, mask))
+
+	/* Try for same CPU (cheapest) */
+	if (!cpumask_test_cpu(cpu, mask)) {
 		cpu = sched_numa_find_nth_cpu(mask, 0, cpu_to_node(cpu));
+		local = false;
+		put_cpu();
+	}
 
 	ret = smp_call_function_single(cpu, func, info, wait);
-	put_cpu();
+	if (local)
+		put_cpu();
 	return ret;
 }
 EXPORT_SYMBOL_GPL(smp_call_function_any);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (2 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 14:38   ` Steven Rostedt
  2026-03-18 15:55   ` Sebastian Andrzej Siewior
  2026-03-18  4:56 ` [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu Chuyi Zhou
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

This patch use on-stack cpumask to replace percpu cfd cpumask in
smp_call_function_many_cond(). Note that when both CONFIG_CPUMASK_OFFSTACK
and PREEMPT_RT are enabled, allocation during preempt-disabled section
would break RT. Therefore, only do this when CONFIG_CPUMASK_OFFSTACK=n.
This is a preparation for enabling preemption during csd_lock_wait() in
smp_call_function_many_cond().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 kernel/smp.c | 25 +++++++++++++++++++------
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 80daf9dd4a25..9728ba55944d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -799,14 +799,25 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 					unsigned int scf_flags,
 					smp_cond_func_t cond_func)
 {
+	bool preemptible_wait = !IS_ENABLED(CONFIG_CPUMASK_OFFSTACK);
 	int cpu, last_cpu, this_cpu = smp_processor_id();
 	struct call_function_data *cfd;
 	bool wait = scf_flags & SCF_WAIT;
+	cpumask_var_t cpumask_stack;
+	struct cpumask *cpumask;
 	int nr_cpus = 0;
 	bool run_remote = false;
 
 	lockdep_assert_preemption_disabled();
 
+	cfd = this_cpu_ptr(&cfd_data);
+	cpumask = cfd->cpumask;
+
+	if (preemptible_wait) {
+		BUILD_BUG_ON(!alloc_cpumask_var(&cpumask_stack, GFP_ATOMIC));
+		cpumask = cpumask_stack;
+	}
+
 	/*
 	 * Can deadlock when called with interrupts disabled.
 	 * We allow cpu's that are not yet online though, as no one else can
@@ -827,16 +838,15 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	/* Check if we need remote execution, i.e., any CPU excluding this one. */
 	if (cpumask_any_and_but(mask, cpu_online_mask, this_cpu) < nr_cpu_ids) {
-		cfd = this_cpu_ptr(&cfd_data);
-		cpumask_and(cfd->cpumask, mask, cpu_online_mask);
-		__cpumask_clear_cpu(this_cpu, cfd->cpumask);
+		cpumask_and(cpumask, mask, cpu_online_mask);
+		__cpumask_clear_cpu(this_cpu, cpumask);
 
 		cpumask_clear(cfd->cpumask_ipi);
-		for_each_cpu(cpu, cfd->cpumask) {
+		for_each_cpu(cpu, cpumask) {
 			call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
 
 			if (cond_func && !cond_func(cpu, info)) {
-				__cpumask_clear_cpu(cpu, cfd->cpumask);
+				__cpumask_clear_cpu(cpu, cpumask);
 				continue;
 			}
 
@@ -887,13 +897,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	}
 
 	if (run_remote && wait) {
-		for_each_cpu(cpu, cfd->cpumask) {
+		for_each_cpu(cpu, cpumask) {
 			call_single_data_t *csd;
 
 			csd = per_cpu_ptr(cfd->csd, cpu);
 			csd_lock_wait(csd);
 		}
 	}
+
+	if (preemptible_wait)
+		free_cpumask_var(cpumask_stack);
 }
 
 /**
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (3 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 16:19   ` Sebastian Andrzej Siewior
  2026-03-18  4:56 ` [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Use rcu_read_lock() to protect csd in smp_call_function_many_cond() and
wait for all read critical sections to exit before releasing percpu csd
data. This is preparation for enabling preemption during csd_lock_wait()
and can prevent accessing cfd->csd data that has already been freed in
smpcfd_dead_cpu().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/smp.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/smp.c b/kernel/smp.c
index 9728ba55944d..32c293d8be0e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -77,6 +77,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
 {
 	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
 
+	synchronize_rcu();
 	free_cpumask_var(cfd->cpumask);
 	free_cpumask_var(cfd->cpumask_ipi);
 	free_percpu(cfd->csd);
@@ -810,6 +811,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 
 	lockdep_assert_preemption_disabled();
 
+	rcu_read_lock();
 	cfd = this_cpu_ptr(&cfd_data);
 	cpumask = cfd->cpumask;
 
@@ -905,6 +907,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		}
 	}
 
+	rcu_read_unlock();
 	if (preemptible_wait)
 		free_cpumask_var(cpumask_stack);
 }
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (4 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 16:55   ` Sebastian Andrzej Siewior
  2026-03-18  4:56 ` [PATCH v3 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Now smp_call_function_many_cond() disables preemption mainly for the
following reasons:

- To prevent the remote online CPU from going offline. Specifically, we
want to ensure that no new csds are queued after smpcfd_dying_cpu() has
finished. Therefore, preemption must be disabled until all necessary IPIs
are sent.

- To prevent migration to another CPU, which also implicitly prevents the
current CPU from going offline (since stop_machine requires preempting the
current task to execute offline callbacks).

- To protect the per-cpu cfd_data from concurrent modification by other
smp_call_*() on the current CPU. cfd_data contains cpumasks and per-cpu
csds. Before enqueueing a csd, we block on the csd_lock() to ensure the
previous asyc csd->func() has completed, and then initialize csd->func and
csd->info. After sending the IPI, we spin-wait for the remote CPU to call
csd_unlock(). Actually the csd_lock mechanism already guarantees csd
serialization. If preemption occurs during csd_lock_wait, other concurrent
smp_call_function_many_cond calls will simply block until the previous
csd->func() completes:

task A                    task B

sd->func = fun_a
send ipis

                preempted by B
               --------------->
                        csd_lock(csd); // block until last
                                       // fun_a finished

                        csd->func = func_b;
                        csd->info = info;
                            ...
                        send ipis

                switch back to A
                <---------------

csd_lock_wait(csd); // block until remote finish func_*

This patch enables preemption before csd_lock_wait() which makes the
potentially unpredictable csd_lock_wait() preemptible and migratable.
Note that being migrated to another CPU and calling csd_lock_wait() may
cause UAF due to smpcfd_dead_cpu() during the current CPU offline process.
Previous patch used the RCU mechanism to synchronize csd_lock_wait()
with smpcfd_dead_cpu() to prevent the above UAF issue.

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/smp.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 32c293d8be0e..18e7e4a8f1b6 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -801,7 +801,7 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 					smp_cond_func_t cond_func)
 {
 	bool preemptible_wait = !IS_ENABLED(CONFIG_CPUMASK_OFFSTACK);
-	int cpu, last_cpu, this_cpu = smp_processor_id();
+	int cpu, last_cpu, this_cpu;
 	struct call_function_data *cfd;
 	bool wait = scf_flags & SCF_WAIT;
 	cpumask_var_t cpumask_stack;
@@ -809,9 +809,9 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 	int nr_cpus = 0;
 	bool run_remote = false;
 
-	lockdep_assert_preemption_disabled();
-
 	rcu_read_lock();
+	this_cpu = get_cpu();
+
 	cfd = this_cpu_ptr(&cfd_data);
 	cpumask = cfd->cpumask;
 
@@ -898,6 +898,19 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		local_irq_restore(flags);
 	}
 
+	/*
+	 * We may block in csd_lock_wait() for a significant amount of time,
+	 * especially when interrupts are disabled or with a large number of
+	 * remote CPUs. Try to enable preemption before csd_lock_wait().
+	 *
+	 * Use the cpumask_stack instead of cfd->cpumask to avoid concurrency
+	 * modification from tasks on the same cpu. If preemption occurs during
+	 * csd_lock_wait, other concurrent smp_call_function_many_cond() calls
+	 * will simply block until the previous csd->func() complete.
+	 */
+	if (preemptible_wait)
+		put_cpu();
+
 	if (run_remote && wait) {
 		for_each_cpu(cpu, cpumask) {
 			call_single_data_t *csd;
@@ -907,9 +920,11 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
 		}
 	}
 
-	rcu_read_unlock();
-	if (preemptible_wait)
+	if (!preemptible_wait)
+		put_cpu();
+	else
 		free_cpumask_var(cpumask_stack);
+	rcu_read_unlock();
 }
 
 /**
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 07/12] smp: Remove preempt_disable from smp_call_function
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (5 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Now smp_call_function_many_cond() internally handles the preemption logic,
so smp_call_function() does not need to explicitly disable preemption.
Remove preempt_{enable, disable} from smp_call_function().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 kernel/smp.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 18e7e4a8f1b6..f9c0028968ef 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -966,9 +966,8 @@ EXPORT_SYMBOL(smp_call_function_many);
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait)
 {
-	preempt_disable();
-	smp_call_function_many(cpu_online_mask, func, info, wait);
-	preempt_enable();
+	smp_call_function_many_cond(cpu_online_mask, func, info,
+			wait ? SCF_WAIT : 0, NULL);
 }
 EXPORT_SYMBOL(smp_call_function);
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (6 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Now smp_call_function_many_cond() internally handles the preemption logic,
so on_each_cpu_cond_mask does not need to explicitly disable preemption.
Remove preempt_{enable, disable} from on_each_cpu_cond_mask().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
---
 kernel/smp.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index f9c0028968ef..47c3b057f57f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -1086,9 +1086,7 @@ void on_each_cpu_cond_mask(smp_cond_func_t cond_func, smp_call_func_t func,
 	if (wait)
 		scf_flags |= SCF_WAIT;
 
-	preempt_disable();
 	smp_call_function_many_cond(mask, func, info, scf_flags, cond_func);
-	preempt_enable();
 }
 EXPORT_SYMBOL(on_each_cpu_cond_mask);
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (7 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Previous patches make smp_call*() handle preemption logic internally.
Now the preempt_disable() by most callers becomes unnecessary and can
therefore be removed. Remove preempt_{enable, disable} in
scftorture_invoke_one().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 kernel/scftorture.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/kernel/scftorture.c b/kernel/scftorture.c
index 327c315f411c..b87215e40be5 100644
--- a/kernel/scftorture.c
+++ b/kernel/scftorture.c
@@ -364,8 +364,6 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 	}
 	if (use_cpus_read_lock)
 		cpus_read_lock();
-	else
-		preempt_disable();
 	switch (scfsp->scfs_prim) {
 	case SCF_PRIM_RESCHED:
 		if (IS_BUILTIN(CONFIG_SCF_TORTURE_TEST)) {
@@ -411,13 +409,10 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 		if (!ret) {
 			if (use_cpus_read_lock)
 				cpus_read_unlock();
-			else
-				preempt_enable();
+
 			wait_for_completion(&scfcp->scfc_completion);
 			if (use_cpus_read_lock)
 				cpus_read_lock();
-			else
-				preempt_disable();
 		} else {
 			scfp->n_single_rpc_ofl++;
 			scf_add_to_free_list(scfcp);
@@ -463,8 +458,6 @@ static void scftorture_invoke_one(struct scf_statistics *scfp, struct torture_ra
 	}
 	if (use_cpus_read_lock)
 		cpus_read_unlock();
-	else
-		preempt_enable();
 	if (allocfail)
 		schedule_timeout_idle((1 + longwait) * HZ);  // Let no-wait handlers complete.
 	else if (!(torture_random(trsp) & 0xfff))
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (8 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18 17:21   ` Sebastian Andrzej Siewior
  2026-03-18  4:56 ` [PATCH v3 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
  11 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
stack") converted flush_tlb_info from stack variable to per-CPU variable.
This brought about a performance improvement of around 3% in extreme test.
However, it also required that all flush_tlb* operations keep preemption
disabled entirely to prevent concurrent modifications of flush_tlb_info.
flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
all remote CPUs to complete their local TLB flushes. The process could
take tens of milliseconds when interrupts are disabled or with a large
number of remote CPUs.

From the perspective of improving kernel real-time performance, this patch
reverts flush_tlb_info back to stack variables. This is a preparation for
enabling preemption during TLB flush in next patch.

To evaluate the performance impact of this patch, use the following script
to reproduce the microbenchmark mentioned in commit 3db6d5a5ecaf
("x86/mm/tlb: Remove 'struct flush_tlb_info' from the stack"). The test
environment is an Ice Lake system (Intel(R) Xeon(R) Platinum 8336C) with
128 CPUs and 2 NUMA nodes:

    #include <stdio.h>
    #include <stdlib.h>
    #include <pthread.h>
    #include <sys/mman.h>
    #include <sys/time.h>
    #include <unistd.h>

    #define NUM_OPS 1000000
    #define NUM_THREADS 3
    #define NUM_RUNS 5
    #define PAGE_SIZE 4096

    volatile int stop_threads = 0;

    void *busy_wait_thread(void *arg) {
        while (!stop_threads) {
            __asm__ volatile ("nop");
        }
        return NULL;
    }

    long long get_usec() {
        struct timeval tv;
        gettimeofday(&tv, NULL);
        return tv.tv_sec * 1000000LL + tv.tv_usec;
    }

    int main() {
        pthread_t threads[NUM_THREADS];
        char *addr;
        int i, r;
        addr = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
		| MAP_ANONYMOUS, -1, 0);

        if (addr == MAP_FAILED) {
            perror("mmap");
            exit(1);
        }

        for (i = 0; i < NUM_THREADS; i++) {
            if (pthread_create(&threads[i], NULL, busy_wait_thread, NULL))
                exit(1);
        }

        printf("Running benchmark: %d runs, %d ops each, %d background\n"
               "threads\n", NUM_RUNS, NUM_OPS, NUM_THREADS);

        for (r = 0; r < NUM_RUNS; r++) {
            long long start, end;
            start = get_usec();
            for (i = 0; i < NUM_OPS; i++) {
                addr[0] = 1;
                if (madvise(addr, PAGE_SIZE, MADV_DONTNEED)) {
                    perror("madvise");
                    exit(1);
                }
            }
            end = get_usec();
            double duration = (double)(end - start);
            double avg_lat = duration / NUM_OPS;
            printf("Run %d: Total time %.2f us, Avg latency %.4f us/op\n",
            r + 1, duration, avg_lat);
        }
        stop_threads = 1;
        for (i = 0; i < NUM_THREADS; i++)
            pthread_join(threads[i], NULL);
        munmap(addr, PAGE_SIZE);
        return 0;
    }

Using the per-cpu flush_tlb_info showed only a very marginal performance
advantage, approximately 1%.

                             base            on-stack
                             ----            ---------
       avg (usec/op)         5.9362           5.9956   (+1%)
       stddev                0.0240           0.0096

And for the mmtest/stress-ng-madvise test, which randomly calls madvise on
pages within a mmap range and triggers a large number of high-frequency TLB
flushes, no significant performance regression was observed.

				 baseline              on-stack

Amean     bops-madvise-1        13.64 (   0.00%)      13.56 (   0.59%)
Amean     bops-madvise-2        27.32 (   0.00%)      27.26 (   0.24%)
Amean     bops-madvise-4        53.35 (   0.00%)      53.54 (  -0.35%)
Amean     bops-madvise-8        103.09 (   0.00%)     103.30 (  -0.20%)
Amean     bops-madvise-16       191.88 (   0.00%)     191.75 (   0.07%)
Amean     bops-madvise-32       287.98 (   0.00%)     291.01 *  -1.05%*
Amean     bops-madvise-64       365.84 (   0.00%)     368.09 *  -0.61%*
Amean     bops-madvise-128      422.72 (   0.00%)     423.47 (  -0.18%)
Amean     bops-madvise-256      435.61 (   0.00%)     435.63 (  -0.01%)

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 arch/x86/mm/tlb.c | 124 ++++++++++++++++++----------------------------
 1 file changed, 49 insertions(+), 75 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index af43d177087e..4704200de3f0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1373,71 +1373,30 @@ void flush_tlb_multi(const struct cpumask *cpumask,
  */
 unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
-
-#ifdef CONFIG_DEBUG_VM
-static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
-#endif
-
-static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
-			unsigned long start, unsigned long end,
-			unsigned int stride_shift, bool freed_tables,
-			u64 new_tlb_gen)
+void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, unsigned int stride_shift,
+				bool freed_tables)
 {
-	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
+	int cpu = get_cpu();
 
-#ifdef CONFIG_DEBUG_VM
-	/*
-	 * Ensure that the following code is non-reentrant and flush_tlb_info
-	 * is not overwritten. This means no TLB flushing is initiated by
-	 * interrupt handlers and machine-check exception handlers.
-	 */
-	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
-#endif
+	struct flush_tlb_info info = {
+		.mm = mm,
+		.stride_shift = stride_shift,
+		.freed_tables = freed_tables,
+		.trim_cpumask = 0,
+		.initiating_cpu = cpu,
+	};
 
-	/*
-	 * If the number of flushes is so large that a full flush
-	 * would be faster, do a full flush.
-	 */
 	if ((end - start) >> stride_shift > tlb_single_page_flush_ceiling) {
 		start = 0;
 		end = TLB_FLUSH_ALL;
 	}
 
-	info->start		= start;
-	info->end		= end;
-	info->mm		= mm;
-	info->stride_shift	= stride_shift;
-	info->freed_tables	= freed_tables;
-	info->new_tlb_gen	= new_tlb_gen;
-	info->initiating_cpu	= smp_processor_id();
-	info->trim_cpumask	= 0;
-
-	return info;
-}
-
-static void put_flush_tlb_info(void)
-{
-#ifdef CONFIG_DEBUG_VM
-	/* Complete reentrancy prevention checks */
-	barrier();
-	this_cpu_dec(flush_tlb_info_idx);
-#endif
-}
-
-void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
-				unsigned long end, unsigned int stride_shift,
-				bool freed_tables)
-{
-	struct flush_tlb_info *info;
-	int cpu = get_cpu();
-	u64 new_tlb_gen;
-
 	/* This is also a barrier that synchronizes with switch_mm(). */
-	new_tlb_gen = inc_mm_tlb_gen(mm);
+	info.new_tlb_gen = inc_mm_tlb_gen(mm);
 
-	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
-				  new_tlb_gen);
+	info.start = start;
+	info.end = end;
 
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
@@ -1445,19 +1404,18 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	 * flush_tlb_func_local() directly in this case.
 	 */
 	if (mm_global_asid(mm)) {
-		broadcast_tlb_flush(info);
+		broadcast_tlb_flush(&info);
 	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
-		info->trim_cpumask = should_trim_cpumask(mm);
-		flush_tlb_multi(mm_cpumask(mm), info);
+		info.trim_cpumask = should_trim_cpumask(mm);
+		flush_tlb_multi(mm_cpumask(mm), &info);
 		consider_global_asid(mm);
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func(info);
+		flush_tlb_func(&info);
 		local_irq_enable();
 	}
 
-	put_flush_tlb_info();
 	put_cpu();
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
@@ -1527,19 +1485,29 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info info = {
+		.mm = NULL,
+		.stride_shift = PAGE_SHIFT,
+		.freed_tables = false,
+		.trim_cpumask = 0,
+		.new_tlb_gen = TLB_GENERATION_INVALID
+	};
 
 	guard(preempt)();
 
-	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
-				  TLB_GENERATION_INVALID);
+	if ((end - start) >> PAGE_SHIFT > tlb_single_page_flush_ceiling) {
+		start = 0;
+		end = TLB_FLUSH_ALL;
+	}
 
-	if (info->end == TLB_FLUSH_ALL)
-		kernel_tlb_flush_all(info);
-	else
-		kernel_tlb_flush_range(info);
+	info.initiating_cpu = smp_processor_id(),
+	info.start = start;
+	info.end = end;
 
-	put_flush_tlb_info();
+	if (info.end == TLB_FLUSH_ALL)
+		kernel_tlb_flush_all(&info);
+	else
+		kernel_tlb_flush_range(&info);
 }
 
 /*
@@ -1707,12 +1675,19 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	struct flush_tlb_info *info;
-
 	int cpu = get_cpu();
 
-	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
-				  TLB_GENERATION_INVALID);
+	struct flush_tlb_info info = {
+		.start = 0,
+		.end = TLB_FLUSH_ALL,
+		.mm = NULL,
+		.stride_shift = 0,
+		.freed_tables = false,
+		.new_tlb_gen = TLB_GENERATION_INVALID,
+		.initiating_cpu = cpu,
+		.trim_cpumask = 0,
+	};
+
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
@@ -1722,17 +1697,16 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		invlpgb_flush_all_nonglobals();
 		batch->unmapped_pages = false;
 	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
-		flush_tlb_multi(&batch->cpumask, info);
+		flush_tlb_multi(&batch->cpumask, &info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func(info);
+		flush_tlb_func(&info);
 		local_irq_enable();
 	}
 
 	cpumask_clear(&batch->cpumask);
 
-	put_flush_tlb_info();
 	put_cpu();
 }
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 11/12] x86/mm: Enable preemption during native_flush_tlb_multi
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (9 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  2026-03-18  4:56 ` [PATCH v3 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
  11 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

native_flush_tlb_multi() may be frequently called by flush_tlb_mm_range()
and arch_tlbbatch_flush() in production environments. When pages are
reclaimed or process exit, native_flush_tlb_multi() sends IPIs to remote
CPUs and waits for all remote CPUs to complete their local TLB flushes.
The overall latency may reach tens of milliseconds due to a large number of
remote CPUs and other factors (such as interrupts being disabled). Since
flush_tlb_mm_range() and arch_tlbbatch_flush() always disable preemption,
which may cause increased scheduling latency for other threads on the
current CPU.

Previous patch converted flush_tlb_info from per-cpu variable to on-stack
variable. Additionally, it's no longer necessary to explicitly disable
preemption before calling smp_call*() since they internally handle the
preemption logic. Now it's safe to enable preemption during
native_flush_tlb_multi().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 arch/x86/kernel/kvm.c | 4 +++-
 arch/x86/mm/tlb.c     | 9 +++++++--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3bc062363814..4f7f4c1149b9 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -668,8 +668,10 @@ static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
 	u8 state;
 	int cpu;
 	struct kvm_steal_time *src;
-	struct cpumask *flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
+	struct cpumask *flushmask;
 
+	guard(preempt)();
+	flushmask = this_cpu_cpumask_var_ptr(__pv_cpu_mask);
 	cpumask_copy(flushmask, cpumask);
 	/*
 	 * We have to call flush only on online vCPUs. And
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4704200de3f0..73500376d185 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1406,9 +1406,11 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	if (mm_global_asid(mm)) {
 		broadcast_tlb_flush(&info);
 	} else if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+		put_cpu();
 		info.trim_cpumask = should_trim_cpumask(mm);
 		flush_tlb_multi(mm_cpumask(mm), &info);
 		consider_global_asid(mm);
+		goto invalidate;
 	} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
@@ -1417,6 +1419,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 	}
 
 	put_cpu();
+invalidate:
 	mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
@@ -1697,7 +1700,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		invlpgb_flush_all_nonglobals();
 		batch->unmapped_pages = false;
 	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
+		put_cpu();
 		flush_tlb_multi(&batch->cpumask, &info);
+		goto clear;
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
@@ -1705,9 +1710,9 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		local_irq_enable();
 	}
 
-	cpumask_clear(&batch->cpumask);
-
 	put_cpu();
+clear:
+	cpumask_clear(&batch->cpumask);
 }
 
 /*
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v3 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range
  2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
                   ` (10 preceding siblings ...)
  2026-03-18  4:56 ` [PATCH v3 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
@ 2026-03-18  4:56 ` Chuyi Zhou
  11 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-18  4:56 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

flush_tlb_kernel_range() is invoked when kernel memory mapping changes.
On x86 platforms without the INVLPGB feature enabled, we need to send IPIs
to every online CPU and synchronously wait for them to complete
do_kernel_range_flush(). This process can be time-consuming due to factors
such as a large number of CPUs or other issues (like interrupts being
disabled). flush_tlb_kernel_range() always disables preemption, this may
affect the scheduling latency of other tasks on the current CPU.

Previous patch converted flush_tlb_info from per-cpu variable to on-stack
variable. Additionally, it's no longer necessary to explicitly disable
preemption before calling smp_call*() since they internally handles the
preemption logic. Now is's safe to enable preemption during
flush_tlb_kernel_range().

Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 73500376d185..b89949d4fb31 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1446,6 +1446,8 @@ static void invlpgb_kernel_range_flush(struct flush_tlb_info *info)
 {
 	unsigned long addr, nr;

+	guard(preempt)();
+
 	for (addr = info->start; addr < info->end; addr += nr << PAGE_SHIFT) {
 		nr = (info->end - addr) >> PAGE_SHIFT;

@@ -1496,14 +1498,12 @@ void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 		.new_tlb_gen = TLB_GENERATION_INVALID
 	};

-	guard(preempt)();
-
 	if ((end - start) >> PAGE_SHIFT > tlb_single_page_flush_ceiling) {
 		start = 0;
 		end = TLB_FLUSH_ALL;
 	}

-	info.initiating_cpu = smp_processor_id(),
+	info.initiating_cpu = raw_smp_processor_id(),
 	info.start = start;
 	info.end = end;

-- 
2.20.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait
  2026-03-18  4:56 ` [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
@ 2026-03-18 14:13   ` Steven Rostedt
  0 siblings, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2026-03-18 14:13 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, linux-kernel

On Wed, 18 Mar 2026 12:56:27 +0800
"Chuyi Zhou" <zhouchuyi@bytedance.com> wrote:

> The latter patches will enable preemption before csd_lock_wait(), which
> could break csdlock_debug. Because the slice of other tasks on the CPU may
> be accounted between ktime_get_mono_fast_ns() calls. Disable preemption
> explicitly in __csd_lock_wait(). This is a preparation for the next
> patches.
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> Acked-by: Muchun Song <muchun.song@linux.dev>

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

> ---
>  kernel/smp.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f349960f79ca..fc1f7a964616 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -323,6 +323,8 @@ static void __csd_lock_wait(call_single_data_t *csd)
>  	int bug_id = 0;
>  	u64 ts0, ts1;
>  
> +	guard(preempt)();
> +
>  	ts1 = ts0 = ktime_get_mono_fast_ns();
>  	for (;;) {
>  		if (csd_lock_wait_toolong(csd, ts0, &ts1, &bug_id, &nmessages))


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single
  2026-03-18  4:56 ` [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
@ 2026-03-18 14:14   ` Steven Rostedt
  2026-03-19  2:30     ` Chuyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2026-03-18 14:14 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, linux-kernel

On Wed, 18 Mar 2026 12:56:28 +0800
"Chuyi Zhou" <zhouchuyi@bytedance.com> wrote:

> diff --git a/kernel/smp.c b/kernel/smp.c
> index fc1f7a964616..b603d4229f95 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -685,11 +685,16 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
>  
>  	err = generic_exec_single(cpu, csd);
>  
> +	/*
> +	 * @csd is stack-allocated when @wait is true. No concurrent access
> +	 * except from the IPI completion path, so we can re-enable preemption
> +	 * early to reduce latency.
> +	 */

Thanks for the comment. I walked through the code and this looks fine to me.

> +	put_cpu();
> +
>  	if (wait)
>  		csd_lock_wait(csd);
>  
> -	put_cpu();
> -
>  	return err;

Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>

-- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any
  2026-03-18  4:56 ` [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any Chuyi Zhou
@ 2026-03-18 14:32   ` Steven Rostedt
  2026-03-18 15:39     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Steven Rostedt @ 2026-03-18 14:32 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, linux-kernel

On Wed, 18 Mar 2026 12:56:29 +0800
"Chuyi Zhou" <zhouchuyi@bytedance.com> wrote:

> Now smp_call_function_single() would enable preemption before
> csd_lock_wait() to reduce the critical section. To allow callers of
> smp_call_function_any() to also benefit from this optimization, remove
> get_cpu()/put_cpu() from smp_call_function_any().
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> ---
>  kernel/smp.c | 16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index b603d4229f95..80daf9dd4a25 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -761,16 +761,26 @@ EXPORT_SYMBOL_GPL(smp_call_function_single_async);
>  int smp_call_function_any(const struct cpumask *mask,
>  			  smp_call_func_t func, void *info, int wait)
>  {
> +	bool local = true;
>  	unsigned int cpu;
>  	int ret;
>  
> -	/* Try for same CPU (cheapest) */
> +	/*
> +	 * Prevent migration to another CPU after selecting the current CPU
> +	 * as the target.
> +	 */
>  	cpu = get_cpu();
> -	if (!cpumask_test_cpu(cpu, mask))
> +
> +	/* Try for same CPU (cheapest) */
> +	if (!cpumask_test_cpu(cpu, mask)) {
>  		cpu = sched_numa_find_nth_cpu(mask, 0, cpu_to_node(cpu));

Hmm, isn't this looking for another CPU that is closest to the current CPU?

By allowing migration, it is possible that the task will migrate to another
CPU where this will pick one that is much farther.

I'm not sure if that's really an issue or not, as I believe it's mostly for
performance reasons. Then again, why even keep preemption disabled for the
current CPU case? Isn't disabling preemption more for performance than
correctness?

Perhaps migrate_disable() is all that is needed?

-- Steve


> +		local = false;
> +		put_cpu();
> +	}
>  
>  	ret = smp_call_function_single(cpu, func, info, wait);
> -	put_cpu();
> +	if (local)
> +		put_cpu();
>  	return ret;
>  }
>  EXPORT_SYMBOL_GPL(smp_call_function_any);


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond
  2026-03-18  4:56 ` [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond Chuyi Zhou
@ 2026-03-18 14:38   ` Steven Rostedt
  2026-03-18 15:55   ` Sebastian Andrzej Siewior
  1 sibling, 0 replies; 34+ messages in thread
From: Steven Rostedt @ 2026-03-18 14:38 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, linux-kernel

On Wed, 18 Mar 2026 12:56:30 +0800
"Chuyi Zhou" <zhouchuyi@bytedance.com> wrote:

> diff --git a/kernel/smp.c b/kernel/smp.c
> index 80daf9dd4a25..9728ba55944d 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -799,14 +799,25 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  					unsigned int scf_flags,
>  					smp_cond_func_t cond_func)
>  {
> +	bool preemptible_wait = !IS_ENABLED(CONFIG_CPUMASK_OFFSTACK);
>  	int cpu, last_cpu, this_cpu = smp_processor_id();
>  	struct call_function_data *cfd;
>  	bool wait = scf_flags & SCF_WAIT;
> +	cpumask_var_t cpumask_stack;
> +	struct cpumask *cpumask;
>  	int nr_cpus = 0;
>  	bool run_remote = false;

> @@ -887,13 +897,16 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  	}
>  
>  	if (run_remote && wait) {
> -		for_each_cpu(cpu, cfd->cpumask) {
> +		for_each_cpu(cpu, cpumask) {
>  			call_single_data_t *csd;
>  
>  			csd = per_cpu_ptr(cfd->csd, cpu);
>  			csd_lock_wait(csd);
>  		}
>  	}
> +
> +	if (preemptible_wait)
> +		free_cpumask_var(cpumask_stack);

Ironic, that preemptible_wait is only true if !CONFIG_CPUMASK_OFFSTACK, and
free_cpumask_var() is defined as:

// #ifndef CONFIG_CPUMASK_OFFSTACK
static __always_inline void free_cpumask_var(cpumask_var_t mask)
{
}

So basically the above is just a compiler exercise to insert a nop :-/

-- Steve


>  }

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any
  2026-03-18 14:32   ` Steven Rostedt
@ 2026-03-18 15:39     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 15:39 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
	dave.hansen, pbonzini, clrkwllms, linux-kernel

On 2026-03-18 10:32:37 [-0400], Steven Rostedt wrote:
> On Wed, 18 Mar 2026 12:56:29 +0800
> "Chuyi Zhou" <zhouchuyi@bytedance.com> wrote:
> 
> > Now smp_call_function_single() would enable preemption before
> > csd_lock_wait() to reduce the critical section. To allow callers of
> > smp_call_function_any() to also benefit from this optimization, remove
> > get_cpu()/put_cpu() from smp_call_function_any().
> > 
> > Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> > Reviewed-by: Muchun Song <muchun.song@linux.dev>
> > ---
> >  kernel/smp.c | 16 +++++++++++++---
> >  1 file changed, 13 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/smp.c b/kernel/smp.c
> > index b603d4229f95..80daf9dd4a25 100644
> > --- a/kernel/smp.c
> > +++ b/kernel/smp.c
> > @@ -761,16 +761,26 @@ EXPORT_SYMBOL_GPL(smp_call_function_single_async);
> >  int smp_call_function_any(const struct cpumask *mask,
> >  			  smp_call_func_t func, void *info, int wait)
> >  {
> > +	bool local = true;
> >  	unsigned int cpu;
> >  	int ret;
> >  
> > -	/* Try for same CPU (cheapest) */
> > +	/*
> > +	 * Prevent migration to another CPU after selecting the current CPU
> > +	 * as the target.
> > +	 */
> >  	cpu = get_cpu();
> > -	if (!cpumask_test_cpu(cpu, mask))
> > +
> > +	/* Try for same CPU (cheapest) */
> > +	if (!cpumask_test_cpu(cpu, mask)) {
> >  		cpu = sched_numa_find_nth_cpu(mask, 0, cpu_to_node(cpu));
> 
> Hmm, isn't this looking for another CPU that is closest to the current CPU?
> 
> By allowing migration, it is possible that the task will migrate to another
> CPU where this will pick one that is much farther.
> 
> I'm not sure if that's really an issue or not, as I believe it's mostly for
> performance reasons. Then again, why even keep preemption disabled for the
> current CPU case? Isn't disabling preemption more for performance than
> correctness?
> 
> Perhaps migrate_disable() is all that is needed?

You want the local-CPU so it is cheap as in direct invocation. But if
you migrate away between the get_cpu() and the check in
generic_exec_single() you do trigger an interrupt.
A migration here would not be wrong, it would just not be as cheap as
intended. Not sure if migrate_disable() counters the cheap part again.

An alternative would be to have the local execution in
__generic_exec_local() and have it run with disabled preemption and a
fallback with enabled preemption where it is enqueue and probably waited
for.

> -- Steve

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond
  2026-03-18  4:56 ` [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond Chuyi Zhou
  2026-03-18 14:38   ` Steven Rostedt
@ 2026-03-18 15:55   ` Sebastian Andrzej Siewior
  2026-03-19  3:02     ` Chuyi Zhou
  1 sibling, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 15:55 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

On 2026-03-18 12:56:30 [+0800], Chuyi Zhou wrote:
> This patch use on-stack cpumask to replace percpu cfd cpumask in
> smp_call_function_many_cond(). Note that when both CONFIG_CPUMASK_OFFSTACK
> and PREEMPT_RT are enabled, allocation during preempt-disabled section
> would break RT. Therefore, only do this when CONFIG_CPUMASK_OFFSTACK=n.
> This is a preparation for enabling preemption during csd_lock_wait() in
> smp_call_function_many_cond().

You explained why we do this only for !CONFIG_CPUMASK_OFFSTACK but
failed to explain why we need a function local cpumask. Other than
preparation step. But this allocation looks pointless, let me look
further…

> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> Reviewed-by: Muchun Song <muchun.song@linux.dev>

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu
  2026-03-18  4:56 ` [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu Chuyi Zhou
@ 2026-03-18 16:19   ` Sebastian Andrzej Siewior
  2026-03-19  7:48     ` Chuyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 16:19 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

On 2026-03-18 12:56:31 [+0800], Chuyi Zhou wrote:
> Use rcu_read_lock() to protect csd in smp_call_function_many_cond() and
> wait for all read critical sections to exit before releasing percpu csd
> data. This is preparation for enabling preemption during csd_lock_wait()
> and can prevent accessing cfd->csd data that has already been freed in
> smpcfd_dead_cpu().
> 
> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
> ---
>  kernel/smp.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 9728ba55944d..32c293d8be0e 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -77,6 +77,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
>  {
>  	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
>  
> +	synchronize_rcu();
>  	free_cpumask_var(cfd->cpumask);
>  	free_cpumask_var(cfd->cpumask_ipi);
>  	free_percpu(cfd->csd);

This seems to make sense. But it could delay CPU shutdown and then the
stress-cpu-hotplug.sh. And this one helped to find bugs.

What is expectation of shutting down a CPU? Will it remain off for
_long_ at which point we care about this memory or is temporary and we
could skip to free the memory here because we allocate it only once on
the UP side?

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond
  2026-03-18  4:56 ` [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
@ 2026-03-18 16:55   ` Sebastian Andrzej Siewior
  2026-03-19  3:46     ` Chuyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 16:55 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

On 2026-03-18 12:56:32 [+0800], Chuyi Zhou wrote:
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -907,9 +920,11 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>  		}
>  	}

So now I understand why we have this cpumask on stack.
Could we, on a preemptible kernel, where we have a preemption counter,
in the case of preemptible() allocate a cpumask and use it here? If the
allocation fails or we are not on a preemptbile kernel then we don't do
this optimized wait with enabled preemption.

There is no benefit of doing all this if the caller has already
preemption disabled.

> -	rcu_read_unlock();
> -	if (preemptible_wait)
> +	if (!preemptible_wait)
> +		put_cpu();
> +	else
>  		free_cpumask_var(cpumask_stack);
> +	rcu_read_unlock();
>  }
>  
>  /**

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-18  4:56 ` [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
@ 2026-03-18 17:21   ` Sebastian Andrzej Siewior
  2026-03-18 20:24     ` Nadav Amit
  2026-03-20 14:33     ` Chuyi Zhou
  0 siblings, 2 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-18 17:21 UTC (permalink / raw)
  To: Chuyi Zhou, Nadav Amit
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

+Nadav, org post https://lore.kernel.org/all/20260318045638.1572777-11-zhouchuyi@bytedance.com/

On 2026-03-18 12:56:36 [+0800], Chuyi Zhou wrote:
> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
> stack") converted flush_tlb_info from stack variable to per-CPU variable.
> This brought about a performance improvement of around 3% in extreme test.
> However, it also required that all flush_tlb* operations keep preemption
> disabled entirely to prevent concurrent modifications of flush_tlb_info.
> flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
> all remote CPUs to complete their local TLB flushes. The process could
> take tens of milliseconds when interrupts are disabled or with a large
> number of remote CPUs.
…
PeterZ wasn't too happy to reverse this.
The snippet below results in the following assembly:

| 0000000000001ab0 <flush_tlb_kernel_range>:
…
|     1ac9:       48 89 e5                mov    %rsp,%rbp
|     1acc:       48 83 e4 c0             and    $0xffffffffffffffc0,%rsp
|     1ad0:       48 83 ec 40             sub    $0x40,%rsp

so it would align it properly which should result in the same cache-line
movement. I'm not sure about the virtual-to-physical translation of the
variables as in TLB misses since here we have a virtual mapped stack and
there we have virtual mapped per-CPU memory.

Here the below is my quick hack. Does this work, or still a now? I have
no numbers so…

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5a3cdc439e38d..4a7f40c7f939a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -227,7 +227,7 @@ struct flush_tlb_info {
 	u8			stride_shift;
 	u8			freed_tables;
 	u8			trim_cpumask;
-};
+} __aligned(SMP_CACHE_BYTES);
 
 void flush_tlb_local(void);
 void flush_tlb_one_user(unsigned long addr);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 621e09d049cb9..99b70e94ec281 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -1394,28 +1394,12 @@ void flush_tlb_multi(const struct cpumask *cpumask,
  */
 unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
 
-static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
-
-#ifdef CONFIG_DEBUG_VM
-static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
-#endif
-
-static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
-			unsigned long start, unsigned long end,
-			unsigned int stride_shift, bool freed_tables,
-			u64 new_tlb_gen)
+static void get_flush_tlb_info(struct flush_tlb_info *info,
+			       struct mm_struct *mm,
+			       unsigned long start, unsigned long end,
+			       unsigned int stride_shift, bool freed_tables,
+			       u64 new_tlb_gen)
 {
-	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
-
-#ifdef CONFIG_DEBUG_VM
-	/*
-	 * Ensure that the following code is non-reentrant and flush_tlb_info
-	 * is not overwritten. This means no TLB flushing is initiated by
-	 * interrupt handlers and machine-check exception handlers.
-	 */
-	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
-#endif
-
 	/*
 	 * If the number of flushes is so large that a full flush
 	 * would be faster, do a full flush.
@@ -1433,8 +1417,6 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
 	info->new_tlb_gen	= new_tlb_gen;
 	info->initiating_cpu	= smp_processor_id();
 	info->trim_cpumask	= 0;
-
-	return info;
 }
 
 static void put_flush_tlb_info(void)
@@ -1450,15 +1432,16 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 				unsigned long end, unsigned int stride_shift,
 				bool freed_tables)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info _info;
+	struct flush_tlb_info *info = &_info;
 	int cpu = get_cpu();
 	u64 new_tlb_gen;
 
 	/* This is also a barrier that synchronizes with switch_mm(). */
 	new_tlb_gen = inc_mm_tlb_gen(mm);
 
-	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
-				  new_tlb_gen);
+	get_flush_tlb_info(&_info, mm, start, end, stride_shift, freed_tables,
+			   new_tlb_gen);
 
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
@@ -1548,17 +1531,15 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info info;
 
-	guard(preempt)();
+	get_flush_tlb_info(&info, NULL, start, end, PAGE_SHIFT, false,
+			   TLB_GENERATION_INVALID);
 
-	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
-				  TLB_GENERATION_INVALID);
-
-	if (info->end == TLB_FLUSH_ALL)
-		kernel_tlb_flush_all(info);
+	if (info.end == TLB_FLUSH_ALL)
+		kernel_tlb_flush_all(&info);
 	else
-		kernel_tlb_flush_range(info);
+		kernel_tlb_flush_range(&info);
 
 	put_flush_tlb_info();
 }
@@ -1728,12 +1709,11 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
 
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
-	struct flush_tlb_info *info;
+	struct flush_tlb_info info;
 
 	int cpu = get_cpu();
-
-	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
-				  TLB_GENERATION_INVALID);
+	get_flush_tlb_info(&info, NULL, 0, TLB_FLUSH_ALL, 0, false,
+			   TLB_GENERATION_INVALID);
 	/*
 	 * flush_tlb_multi() is not optimized for the common case in which only
 	 * a local TLB flush is needed. Optimize this use-case by calling
@@ -1743,11 +1723,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 		invlpgb_flush_all_nonglobals();
 		batch->unmapped_pages = false;
 	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
-		flush_tlb_multi(&batch->cpumask, info);
+		flush_tlb_multi(&batch->cpumask, &info);
 	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
 		lockdep_assert_irqs_enabled();
 		local_irq_disable();
-		flush_tlb_func(info);
+		flush_tlb_func(&info);
 		local_irq_enable();
 	}
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-18 17:21   ` Sebastian Andrzej Siewior
@ 2026-03-18 20:24     ` Nadav Amit
  2026-03-18 22:28       ` Nadav Amit
  2026-03-20 14:33     ` Chuyi Zhou
  1 sibling, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2026-03-18 20:24 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
	dave.hansen, pbonzini, clrkwllms, rostedt, linux-kernel



> On 18 Mar 2026, at 19:21, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> +Nadav, org post https://lore.kernel.org/all/20260318045638.1572777-11-zhouchuyi@bytedance.com/
> 
> On 2026-03-18 12:56:36 [+0800], Chuyi Zhou wrote:
>> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
>> stack") converted flush_tlb_info from stack variable to per-CPU variable.
>> This brought about a performance improvement of around 3% in extreme test.
>> However, it also required that all flush_tlb* operations keep preemption
>> disabled entirely to prevent concurrent modifications of flush_tlb_info.
>> flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
>> all remote CPUs to complete their local TLB flushes. The process could
>> take tens of milliseconds when interrupts are disabled or with a large
>> number of remote CPUs.
> …
> PeterZ wasn't too happy to reverse this.
> The snippet below results in the following assembly:
> 
> | 0000000000001ab0 <flush_tlb_kernel_range>:
> …
> |     1ac9:       48 89 e5                mov    %rsp,%rbp
> |     1acc:       48 83 e4 c0             and    $0xffffffffffffffc0,%rsp
> |     1ad0:       48 83 ec 40             sub    $0x40,%rsp
> 
> so it would align it properly which should result in the same cache-line
> movement. I'm not sure about the virtual-to-physical translation of the
> variables as in TLB misses since here we have a virtual mapped stack and
> there we have virtual mapped per-CPU memory.
> 
> Here the below is my quick hack. Does this work, or still a now? I have
> no numbers so…
> 
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 5a3cdc439e38d..4a7f40c7f939a 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -227,7 +227,7 @@ struct flush_tlb_info {
> 	u8			stride_shift;
> 	u8			freed_tables;
> 	u8			trim_cpumask;
> -};
> +} __aligned(SMP_CACHE_BYTES);
> 

This would work, but you are likely to encounter the same problem PeterZ hit
when I did something similar: in some configurations SMP_CACHE_BYTES is very
large.

See https://lore.kernel.org/all/tip-780e0106d468a2962b16b52fdf42898f2639e0a0@git.kernel.org/

Maybe cap the alignment somehow? something like:

#if SMP_CACHE_BYTES > 64
#define FLUSH_TLB_INFO_ALIGN 64
#else
#define FLUSH_TLB_INFO_ALIGN SMP_CACHE_BYTES
#endif

And then use __aligned(FLUSH_TLB_INFO_ALIGN) ?



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-18 20:24     ` Nadav Amit
@ 2026-03-18 22:28       ` Nadav Amit
  2026-03-19  8:49         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2026-03-18 22:28 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
	dave.hansen, pbonzini, clrkwllms, rostedt, linux-kernel


> On 18 Mar 2026, at 22:24, Nadav Amit <nadav.amit@gmail.com> wrote:
> 
> 
> 
>> On 18 Mar 2026, at 19:21, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>> 
>> so it would align it properly which should result in the same cache-line
>> movement. I'm not sure about the virtual-to-physical translation of the
>> variables as in TLB misses since here we have a virtual mapped stack and
>> there we have virtual mapped per-CPU memory.
>> 
>> Here the below is my quick hack. Does this work, or still a now? I have
>> no numbers so…
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>> index 5a3cdc439e38d..4a7f40c7f939a 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -227,7 +227,7 @@ struct flush_tlb_info {
>> 	u8 stride_shift;
>> 	u8 freed_tables;
>> 	u8 trim_cpumask;
>> -};
>> +} __aligned(SMP_CACHE_BYTES);
>> 
> 
> This would work, but you are likely to encounter the same problem PeterZ hit
> when I did something similar: in some configurations SMP_CACHE_BYTES is very
> large.

Further thinking about it and looking at the rest of the series: wouldn’t it be
simpler to put flush_tlb_info and smp_call_function_many_cond()’s
cpumask on thread_struct? It would allow to support CONFIG_CPUMASK_OFFSTACK=y
case by preallocating cpumask on thread creation.

I’m not sure whether the memory overhead is prohibitive.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single
  2026-03-18 14:14   ` Steven Rostedt
@ 2026-03-19  2:30     ` Chuyi Zhou
  0 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-19  2:30 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, linux-kernel

Hi Steve,

在 2026/3/18 22:14, Steven Rostedt 写道:
> On Wed, 18 Mar 2026 12:56:28 +0800
> "Chuyi Zhou" <zhouchuyi@bytedance.com> wrote:
> 
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index fc1f7a964616..b603d4229f95 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -685,11 +685,16 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
>>   
>>   	err = generic_exec_single(cpu, csd);
>>   
>> +	/*
>> +	 * @csd is stack-allocated when @wait is true. No concurrent access
>> +	 * except from the IPI completion path, so we can re-enable preemption
>> +	 * early to reduce latency.
>> +	 */
> 
> Thanks for the comment. I walked through the code and this looks fine to me.
> 
>> +	put_cpu();
>> +
>>   	if (wait)
>>   		csd_lock_wait(csd);
>>   
>> -	put_cpu();
>> -
>>   	return err;
> 
> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
> 

Thanks!

> -- Steve

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond
  2026-03-18 15:55   ` Sebastian Andrzej Siewior
@ 2026-03-19  3:02     ` Chuyi Zhou
  0 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-19  3:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

Hi Sebastian,

在 2026/3/18 23:55, Sebastian Andrzej Siewior 写道:
> On 2026-03-18 12:56:30 [+0800], Chuyi Zhou wrote:
>> This patch use on-stack cpumask to replace percpu cfd cpumask in
>> smp_call_function_many_cond(). Note that when both CONFIG_CPUMASK_OFFSTACK
>> and PREEMPT_RT are enabled, allocation during preempt-disabled section
>> would break RT. Therefore, only do this when CONFIG_CPUMASK_OFFSTACK=n.
>> This is a preparation for enabling preemption during csd_lock_wait() in
>> smp_call_function_many_cond().
> 
> You explained why we do this only for !CONFIG_CPUMASK_OFFSTACK but
> failed to explain why we need a function local cpumask. Other than
> preparation step. But this allocation looks pointless, let me look
> further…
> 

OK. It might be better to explain here why we need an local cpumask.

Thanks.

>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> Reviewed-by: Muchun Song <muchun.song@linux.dev>
> 
> Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond
  2026-03-18 16:55   ` Sebastian Andrzej Siewior
@ 2026-03-19  3:46     ` Chuyi Zhou
  0 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-19  3:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

在 2026/3/19 00:55, Sebastian Andrzej Siewior 写道:
> On 2026-03-18 12:56:32 [+0800], Chuyi Zhou wrote:
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -907,9 +920,11 @@ static void smp_call_function_many_cond(const struct cpumask *mask,
>>   		}
>>   	}
> 
> So now I understand why we have this cpumask on stack.
> Could we, on a preemptible kernel, where we have a preemption counter,
> in the case of preemptible() allocate a cpumask and use it here? If the
> allocation fails or we are not on a preemptbile kernel then we don't do
> this optimized wait with enabled preemption.
> 
> There is no benefit of doing all this if the caller has already
> preemption disabled.
> 

IIUC, we can enable this feature only when 
`IS_ENABLED(CONFIG_PREEMPTION) && preemptible()`.

This way, the optimization can also take effect for 
CONFIG_CPUMASK_OFFSTACK=y without breaking the RT principle that forbids 
memory allocation inside preemption-disabled critical sections.

Thanks.

>> -	rcu_read_unlock();
>> -	if (preemptible_wait)
>> +	if (!preemptible_wait)
>> +		put_cpu();
>> +	else
>>   		free_cpumask_var(cpumask_stack);
>> +	rcu_read_unlock();
>>   }
>>   
>>   /**
> 
> Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu
  2026-03-18 16:19   ` Sebastian Andrzej Siewior
@ 2026-03-19  7:48     ` Chuyi Zhou
  0 siblings, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-19  7:48 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

On 2026-03-19 12:19 a.m., Sebastian Andrzej Siewior wrote:
> On 2026-03-18 12:56:31 [+0800], Chuyi Zhou wrote:
>> Use rcu_read_lock() to protect csd in smp_call_function_many_cond() and
>> wait for all read critical sections to exit before releasing percpu csd
>> data. This is preparation for enabling preemption during csd_lock_wait()
>> and can prevent accessing cfd->csd data that has already been freed in
>> smpcfd_dead_cpu().
>>
>> Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com>
>> ---
>>   kernel/smp.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 9728ba55944d..32c293d8be0e 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -77,6 +77,7 @@ int smpcfd_dead_cpu(unsigned int cpu)
>>   {
>>   	struct call_function_data *cfd = &per_cpu(cfd_data, cpu);
>>   
>> +	synchronize_rcu();
>>   	free_cpumask_var(cfd->cpumask);
>>   	free_cpumask_var(cfd->cpumask_ipi);
>>   	free_percpu(cfd->csd);
> 
> This seems to make sense. But it could delay CPU shutdown and then the
> stress-cpu-hotplug.sh. And this one helped to find bugs.
> 
> What is expectation of shutting down a CPU? Will it remain off for
> _long_ at which point we care about this memory or is temporary and we
> could skip to free the memory here because we allocate it only once on
> the UP side?
> 


Yes, we can allocate the csd only once during the first UP side. The 
advantage is that it avoids delaying the CPU offline process and keeps 
the code simpler.

The tradeoff is a slight memory waste in scenarios where the CPU remains 
offline for a long time(or never online again). If we cannot tolerate 
the delay caused by synchronize_rcu() during the CPU offline process, we 
can switch to approach you suggested.

Thanks.

> Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-18 22:28       ` Nadav Amit
@ 2026-03-19  8:49         ` Sebastian Andrzej Siewior
  2026-03-19 10:37           ` Nadav Amit
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-19  8:49 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Chuyi Zhou, tglx, mingo, luto, peterz, paulmck, muchun.song, bp,
	dave.hansen, pbonzini, clrkwllms, rostedt, linux-kernel

On 2026-03-19 00:28:19 [+0200], Nadav Amit wrote:
> >> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> >> index 5a3cdc439e38d..4a7f40c7f939a 100644
> >> --- a/arch/x86/include/asm/tlbflush.h
> >> +++ b/arch/x86/include/asm/tlbflush.h
> >> @@ -227,7 +227,7 @@ struct flush_tlb_info {
> >> 	u8 stride_shift;
> >> 	u8 freed_tables;
> >> 	u8 trim_cpumask;
> >> -};
> >> +} __aligned(SMP_CACHE_BYTES);
> >> 
> > 
> > This would work, but you are likely to encounter the same problem PeterZ hit
> > when I did something similar: in some configurations SMP_CACHE_BYTES is very
> > large.

So if capping to 64 is an option does not break performance where one
would complain, why not. But you did it initially so…

> Further thinking about it and looking at the rest of the series: wouldn’t it be
> simpler to put flush_tlb_info and smp_call_function_many_cond()’s
> cpumask on thread_struct? It would allow to support CONFIG_CPUMASK_OFFSTACK=y
> case by preallocating cpumask on thread creation.
> 
> I’m not sure whether the memory overhead is prohibitive.

My Debian config has CONFIG_NR_CPUS=8192 which would add 1KiB if we add
a plain cpumask_t. The allocation based on cpumask_size() would add just
8 bytes/ pointer to the struct which should be fine. We could even stash
the mask in the pointer for CPUs <= 64 on 64bit.
On RT it would be desired to have the memory and not to fallback to
waiting with disabled preemption if the allocation fails.

The flush_tlb_info are around 40 bytes + alignment. Maybe we could try
stack first if this gets us to acceptable performance.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-19  8:49         ` Sebastian Andrzej Siewior
@ 2026-03-19 10:37           ` Nadav Amit
  2026-03-19 10:58             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Nadav Amit @ 2026-03-19 10:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Chuyi Zhou, Thomas Gleixner, Ingo Molnar, luto, peterz, paulmck,
	muchun.song, Borislav Petkov, Dave Hansen, Paolo Bonzini,
	clrkwllms, Steven Rostedt, Linux Kernel Mailing List



> On 19 Mar 2026, at 10:49, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2026-03-19 00:28:19 [+0200], Nadav Amit wrote:
>>>> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>>>> index 5a3cdc439e38d..4a7f40c7f939a 100644
>>>> --- a/arch/x86/include/asm/tlbflush.h
>>>> +++ b/arch/x86/include/asm/tlbflush.h
>>>> @@ -227,7 +227,7 @@ struct flush_tlb_info {
>>>> 	u8 stride_shift;
>>>> 	u8 freed_tables;
>>>> 	u8 trim_cpumask;
>>>> -};
>>>> +} __aligned(SMP_CACHE_BYTES);
>>>> 
>>> 
>>> This would work, but you are likely to encounter the same problem PeterZ hit
>>> when I did something similar: in some configurations SMP_CACHE_BYTES is very
>>> large.
> 
> So if capping to 64 is an option does not break performance where one
> would complain, why not. But you did it initially so…

No, I only aligned to SMP_CACHE_BYTES. Then I encountered the described problem,
hence I propose to cap it.

> 
>> Further thinking about it and looking at the rest of the series: wouldn’t it be
>> simpler to put flush_tlb_info and smp_call_function_many_cond()’s
>> cpumask on thread_struct? It would allow to support CONFIG_CPUMASK_OFFSTACK=y
>> case by preallocating cpumask on thread creation.
>> 
>> I’m not sure whether the memory overhead is prohibitive.
> 
> My Debian config has CONFIG_NR_CPUS=8192 which would add 1KiB if we add
> a plain cpumask_t. The allocation based on cpumask_size() would add just
> 8 bytes/ pointer to the struct which should be fine. We could even stash
> the mask in the pointer for CPUs <= 64 on 64bit.
> On RT it would be desired to have the memory and not to fallback to
> waiting with disabled preemption if the allocation fails.
> 
> The flush_tlb_info are around 40 bytes + alignment. Maybe we could try
> stack first if this gets us to acceptable performance.

I know it 1KB sounds a lot, but considering fpu size and other per-task
structures, it’s already almost 6KB. Just raising the option.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-19 10:37           ` Nadav Amit
@ 2026-03-19 10:58             ` Sebastian Andrzej Siewior
  2026-03-19 13:41               ` Chuyi Zhou
  0 siblings, 1 reply; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-19 10:58 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Chuyi Zhou, Thomas Gleixner, Ingo Molnar, luto, peterz, paulmck,
	muchun.song, Borislav Petkov, Dave Hansen, Paolo Bonzini,
	clrkwllms, Steven Rostedt, Linux Kernel Mailing List

On 2026-03-19 12:37:27 [+0200], Nadav Amit wrote:
> >> Further thinking about it and looking at the rest of the series: wouldn’t it be
> >> simpler to put flush_tlb_info and smp_call_function_many_cond()’s
> >> cpumask on thread_struct? It would allow to support CONFIG_CPUMASK_OFFSTACK=y
> >> case by preallocating cpumask on thread creation.
> >> 
> >> I’m not sure whether the memory overhead is prohibitive.
> > 
> > My Debian config has CONFIG_NR_CPUS=8192 which would add 1KiB if we add
> > a plain cpumask_t. The allocation based on cpumask_size() would add just
> > 8 bytes/ pointer to the struct which should be fine. We could even stash
> > the mask in the pointer for CPUs <= 64 on 64bit.
> > On RT it would be desired to have the memory and not to fallback to
> > waiting with disabled preemption if the allocation fails.
> > 
> > The flush_tlb_info are around 40 bytes + alignment. Maybe we could try
> > stack first if this gets us to acceptable performance.
> 
> I know it 1KB sounds a lot, but considering fpu size and other per-task
> structures, it’s already almost 6KB. Just raising the option.

Sure. We might try allocating just the cpumask while allocating the
task_struct so we won't do the whole 1KiB but just what is required and
we might avoid allocation if it fits in the pointer size.

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-19 10:58             ` Sebastian Andrzej Siewior
@ 2026-03-19 13:41               ` Chuyi Zhou
  2026-03-19 14:40                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-19 13:41 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Nadav Amit
  Cc: Thomas Gleixner, Ingo Molnar, luto, peterz, paulmck, muchun.song,
	Borislav Petkov, Dave Hansen, Paolo Bonzini, clrkwllms,
	Steven Rostedt, Linux Kernel Mailing List

Hi Sebastian,

On 2026-03-19 6:58 p.m., Sebastian Andrzej Siewior wrote:
> On 2026-03-19 12:37:27 [+0200], Nadav Amit wrote:
>>>> Further thinking about it and looking at the rest of the series: wouldn’t it be
>>>> simpler to put flush_tlb_info and smp_call_function_many_cond()’s
>>>> cpumask on thread_struct? It would allow to support CONFIG_CPUMASK_OFFSTACK=y
>>>> case by preallocating cpumask on thread creation.
>>>>
>>>> I’m not sure whether the memory overhead is prohibitive.
>>>
>>> My Debian config has CONFIG_NR_CPUS=8192 which would add 1KiB if we add
>>> a plain cpumask_t. The allocation based on cpumask_size() would add just
>>> 8 bytes/ pointer to the struct which should be fine. We could even stash
>>> the mask in the pointer for CPUs <= 64 on 64bit.
>>> On RT it would be desired to have the memory and not to fallback to
>>> waiting with disabled preemption if the allocation fails.
>>>
>>> The flush_tlb_info are around 40 bytes + alignment. Maybe we could try
>>> stack first if this gets us to acceptable performance.
>>
>> I know it 1KB sounds a lot, but considering fpu size and other per-task
>> structures, it’s already almost 6KB. Just raising the option.
> 
> Sure. We might try allocating just the cpumask while allocating the
> task_struct so we won't do the whole 1KiB but just what is required and
> we might avoid allocation if it fits in the pointer size.
> 
> Sebastian

IIUC, you mean something like the following?


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5a5d3dbc9cdf..4222114cd34c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -927,6 +927,8 @@ struct task_struct {
         unsigned short                  migration_disabled;
         unsigned short                  migration_flags;

+       cpumask_t                       *ipi_cpus;
+
  #ifdef CONFIG_PREEMPT_RCU
         int                             rcu_read_lock_nesting;
         union rcu_special               rcu_read_unlock_special;
diff --git a/kernel/smp.c b/kernel/smp.c
index 47c3b057f57f..f2bbd9b87f7f 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -112,6 +112,29 @@ void __init call_function_init(void)
         smpcfd_prepare_cpu(smp_processor_id());
  }

+static inline bool can_inline_cpumask(void)
+{
+       return cpumask_size() <= sizeof(cpumask_t *);
+}
+
+void alloc_ipi_cpumask(struct task_struct *task)
+{
+       if (can_inline_cpumask())
+               return;
+       /*
+        * Fallback to the default smp_call_function_many_cond
+        * logic if the allocation fails.
+        */
+       task->ipi_cpus = kmalloc(cpumask_size(), GFP_KERNEL);
+}
+
+static inline cpumask_t *get_local_cpumask(struct task_struct *cur)
+{
+       if (can_inline_cpumask())
+               return (cpumask_t *)&cur->ipi_cpus;
+       return cur->ipi_cpus;
+}
+
  static __always_inline void
  send_call_function_single_ipi(int cpu)
  {

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-19 13:41               ` Chuyi Zhou
@ 2026-03-19 14:40                 ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 34+ messages in thread
From: Sebastian Andrzej Siewior @ 2026-03-19 14:40 UTC (permalink / raw)
  To: Chuyi Zhou
  Cc: Nadav Amit, Thomas Gleixner, Ingo Molnar, luto, peterz, paulmck,
	muchun.song, Borislav Petkov, Dave Hansen, Paolo Bonzini,
	clrkwllms, Steven Rostedt, Linux Kernel Mailing List

On 2026-03-19 21:41:28 [+0800], Chuyi Zhou wrote:
> Hi Sebastian,
Hi,

> IIUC, you mean something like the following?

basically yes. Later you might want to add a static_branch to
can_inline_cpumask() instead the check you have now. The value to
cpumask_size() should be assigned once the number of possible CPUs is
known and never change.

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 5a5d3dbc9cdf..4222114cd34c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -927,6 +927,8 @@ struct task_struct {
>          unsigned short                  migration_disabled;
>          unsigned short                  migration_flags;
> 
> +       cpumask_t                       *ipi_cpus;

This is probably not the best spot to stuff it. You will have a 4byte
gap there. After user_cpus_ptr would be okay from alignment but I am not
sure if something else shifts too much.

> +
>   #ifdef CONFIG_PREEMPT_RCU
>          int                             rcu_read_lock_nesting;
>          union rcu_special               rcu_read_unlock_special;

Sebastian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack
  2026-03-18 17:21   ` Sebastian Andrzej Siewior
  2026-03-18 20:24     ` Nadav Amit
@ 2026-03-20 14:33     ` Chuyi Zhou
  1 sibling, 0 replies; 34+ messages in thread
From: Chuyi Zhou @ 2026-03-20 14:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, Nadav Amit
  Cc: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, clrkwllms, rostedt, linux-kernel

Hello,

On 2026-03-19 1:21 a.m., Sebastian Andrzej Siewior wrote:
> +Nadav, org post https://lore.kernel.org/all/20260318045638.1572777-11-zhouchuyi@bytedance.com/
> 
> On 2026-03-18 12:56:36 [+0800], Chuyi Zhou wrote:
>> Commit 3db6d5a5ecaf ("x86/mm/tlb: Remove 'struct flush_tlb_info' from the
>> stack") converted flush_tlb_info from stack variable to per-CPU variable.
>> This brought about a performance improvement of around 3% in extreme test.
>> However, it also required that all flush_tlb* operations keep preemption
>> disabled entirely to prevent concurrent modifications of flush_tlb_info.
>> flush_tlb* needs to send IPIs to remote CPUs and synchronously wait for
>> all remote CPUs to complete their local TLB flushes. The process could
>> take tens of milliseconds when interrupts are disabled or with a large
>> number of remote CPUs.
> …
> PeterZ wasn't too happy to reverse this.
> The snippet below results in the following assembly:
> 
> | 0000000000001ab0 <flush_tlb_kernel_range>:
> …
> |     1ac9:       48 89 e5                mov    %rsp,%rbp
> |     1acc:       48 83 e4 c0             and    $0xffffffffffffffc0,%rsp
> |     1ad0:       48 83 ec 40             sub    $0x40,%rsp
> 
> so it would align it properly which should result in the same cache-line
> movement. I'm not sure about the virtual-to-physical translation of the
> variables as in TLB misses since here we have a virtual mapped stack and
> there we have virtual mapped per-CPU memory.
> 
> Here the below is my quick hack. Does this work, or still a now? I have
> no numbers so…
> 

I applied this patch on tip sched/core: fe7171d0d5df ("sched/fair: 
Simplify SIS_UTIL handling in select_idle_cpu()") and retest the 
microbenchmark in [PATCH v3 10/12], pinning the tasks to CPUs using 
cpuset, and disabling pti and mitigations.


                   base       on-stack-aligned  on-stack
                   ----       ---------      -----------
avg (usec/op)     2.5278       2.5261        2.5508
stddev            0.0007       0.0027        0.0023

Based on the data above, there is no performance regression.

Dose this make sense? Do you have other testing suggestions?

Thanks

> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index 5a3cdc439e38d..4a7f40c7f939a 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -227,7 +227,7 @@ struct flush_tlb_info {
>   	u8			stride_shift;
>   	u8			freed_tables;
>   	u8			trim_cpumask;
> -};
> +} __aligned(SMP_CACHE_BYTES);
>   
>   void flush_tlb_local(void);
>   void flush_tlb_one_user(unsigned long addr);
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 621e09d049cb9..99b70e94ec281 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -1394,28 +1394,12 @@ void flush_tlb_multi(const struct cpumask *cpumask,
>    */
>   unsigned long tlb_single_page_flush_ceiling __read_mostly = 33;
>   
> -static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info);
> -
> -#ifdef CONFIG_DEBUG_VM
> -static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
> -#endif
> -
> -static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
> -			unsigned long start, unsigned long end,
> -			unsigned int stride_shift, bool freed_tables,
> -			u64 new_tlb_gen)
> +static void get_flush_tlb_info(struct flush_tlb_info *info,
> +			       struct mm_struct *mm,
> +			       unsigned long start, unsigned long end,
> +			       unsigned int stride_shift, bool freed_tables,
> +			       u64 new_tlb_gen)
>   {
> -	struct flush_tlb_info *info = this_cpu_ptr(&flush_tlb_info);
> -
> -#ifdef CONFIG_DEBUG_VM
> -	/*
> -	 * Ensure that the following code is non-reentrant and flush_tlb_info
> -	 * is not overwritten. This means no TLB flushing is initiated by
> -	 * interrupt handlers and machine-check exception handlers.
> -	 */
> -	BUG_ON(this_cpu_inc_return(flush_tlb_info_idx) != 1);
> -#endif
> -
>   	/*
>   	 * If the number of flushes is so large that a full flush
>   	 * would be faster, do a full flush.
> @@ -1433,8 +1417,6 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
>   	info->new_tlb_gen	= new_tlb_gen;
>   	info->initiating_cpu	= smp_processor_id();
>   	info->trim_cpumask	= 0;
> -
> -	return info;
>   }
>   
>   static void put_flush_tlb_info(void)
> @@ -1450,15 +1432,16 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
>   				unsigned long end, unsigned int stride_shift,
>   				bool freed_tables)
>   {
> -	struct flush_tlb_info *info;
> +	struct flush_tlb_info _info;
> +	struct flush_tlb_info *info = &_info;
>   	int cpu = get_cpu();
>   	u64 new_tlb_gen;
>   
>   	/* This is also a barrier that synchronizes with switch_mm(). */
>   	new_tlb_gen = inc_mm_tlb_gen(mm);
>   
> -	info = get_flush_tlb_info(mm, start, end, stride_shift, freed_tables,
> -				  new_tlb_gen);
> +	get_flush_tlb_info(&_info, mm, start, end, stride_shift, freed_tables,
> +			   new_tlb_gen);
>   
>   	/*
>   	 * flush_tlb_multi() is not optimized for the common case in which only
> @@ -1548,17 +1531,15 @@ static void kernel_tlb_flush_range(struct flush_tlb_info *info)
>   
>   void flush_tlb_kernel_range(unsigned long start, unsigned long end)
>   {
> -	struct flush_tlb_info *info;
> +	struct flush_tlb_info info;
>   
> -	guard(preempt)();
> +	get_flush_tlb_info(&info, NULL, start, end, PAGE_SHIFT, false,
> +			   TLB_GENERATION_INVALID);
>   
> -	info = get_flush_tlb_info(NULL, start, end, PAGE_SHIFT, false,
> -				  TLB_GENERATION_INVALID);
> -
> -	if (info->end == TLB_FLUSH_ALL)
> -		kernel_tlb_flush_all(info);
> +	if (info.end == TLB_FLUSH_ALL)
> +		kernel_tlb_flush_all(&info);
>   	else
> -		kernel_tlb_flush_range(info);
> +		kernel_tlb_flush_range(&info);
>   
>   	put_flush_tlb_info();
>   }
> @@ -1728,12 +1709,11 @@ EXPORT_SYMBOL_FOR_KVM(__flush_tlb_all);
>   
>   void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>   {
> -	struct flush_tlb_info *info;
> +	struct flush_tlb_info info;
>   
>   	int cpu = get_cpu();
> -
> -	info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false,
> -				  TLB_GENERATION_INVALID);
> +	get_flush_tlb_info(&info, NULL, 0, TLB_FLUSH_ALL, 0, false,
> +			   TLB_GENERATION_INVALID);
>   	/*
>   	 * flush_tlb_multi() is not optimized for the common case in which only
>   	 * a local TLB flush is needed. Optimize this use-case by calling
> @@ -1743,11 +1723,11 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>   		invlpgb_flush_all_nonglobals();
>   		batch->unmapped_pages = false;
>   	} else if (cpumask_any_but(&batch->cpumask, cpu) < nr_cpu_ids) {
> -		flush_tlb_multi(&batch->cpumask, info);
> +		flush_tlb_multi(&batch->cpumask, &info);
>   	} else if (cpumask_test_cpu(cpu, &batch->cpumask)) {
>   		lockdep_assert_irqs_enabled();
>   		local_irq_disable();
> -		flush_tlb_func(info);
> +		flush_tlb_func(&info);
>   		local_irq_enable();
>   	}
>

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2026-03-20 14:33 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-18  4:56 [PATCH v3 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
2026-03-18 14:13   ` Steven Rostedt
2026-03-18  4:56 ` [PATCH v3 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
2026-03-18 14:14   ` Steven Rostedt
2026-03-19  2:30     ` Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 03/12] smp: Remove get_cpu from smp_call_function_any Chuyi Zhou
2026-03-18 14:32   ` Steven Rostedt
2026-03-18 15:39     ` Sebastian Andrzej Siewior
2026-03-18  4:56 ` [PATCH v3 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond Chuyi Zhou
2026-03-18 14:38   ` Steven Rostedt
2026-03-18 15:55   ` Sebastian Andrzej Siewior
2026-03-19  3:02     ` Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu Chuyi Zhou
2026-03-18 16:19   ` Sebastian Andrzej Siewior
2026-03-19  7:48     ` Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
2026-03-18 16:55   ` Sebastian Andrzej Siewior
2026-03-19  3:46     ` Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
2026-03-18 17:21   ` Sebastian Andrzej Siewior
2026-03-18 20:24     ` Nadav Amit
2026-03-18 22:28       ` Nadav Amit
2026-03-19  8:49         ` Sebastian Andrzej Siewior
2026-03-19 10:37           ` Nadav Amit
2026-03-19 10:58             ` Sebastian Andrzej Siewior
2026-03-19 13:41               ` Chuyi Zhou
2026-03-19 14:40                 ` Sebastian Andrzej Siewior
2026-03-20 14:33     ` Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
2026-03-18  4:56 ` [PATCH v3 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox