public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/12] Allow preemption during IPI completion waiting to improve real-time performance
@ 2026-03-02  7:52 Chuyi Zhou
  2026-03-02  7:52 ` [PATCH v2 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
                   ` (11 more replies)
  0 siblings, 12 replies; 22+ messages in thread
From: Chuyi Zhou @ 2026-03-02  7:52 UTC (permalink / raw)
  To: tglx, mingo, luto, peterz, paulmck, muchun.song, bp, dave.hansen,
	pbonzini, bigeasy, clrkwllms, rostedt
  Cc: linux-kernel, Chuyi Zhou

Changes in v2:
 - Simplify the code comments in [PATCH v2 2/12] (pointed by peter and
   muchun)
 - Adjust the preemption disabling logic in smp_call_function_any() in
   [PATCH v2 3/12] (suggested by peter).
 - Use on-stack cpumask only when !CONFIG_CPUMASK_OFFSTACK in [PATCH V2
   4/12] (pointed by peter)
 - Add [PATCH v2 5/12] to replace migrate_disable with the rcu mechanism
 - Adjust the preemption disabling logic to allow flush_tlb_multi() to be
   preemptible and migratable in [PATCH v2 11/12]
 - Collect Acked-bys and Reviewed-bys

Introduction
============

The vast majority of smp_call_function*() callers block until remote CPUs
complete the IPI function execution. As smp_call_function*() runs with
preemption disabled throughout, scheduling latency increases dramatically
with the number of remote CPUs and other factors (such as interrupts being
disabled).

On x86-64 architectures, TLB flushes are performed via IPIs; thus, during
process exit or when process-mapped pages are reclaimed, numerous IPI
operations must be awaited, leading to increased scheduling latency for
other threads on the current CPU. In our production environment, we
observed IPI wait-induced scheduling latency reaching up to 16ms on a
16-core machine. Our goal is to allow preemption during IPI completion
waiting to improve real-time performance.

Background
============

In our production environments, latency-sensitive workloads (DPDK) are
configured with the highest priority to preempt lower-priority tasks at any
time. We discovered that DPDK's wake-up latency is primarily caused by the
current CPU having preemption disabled. Therefore, we collected the maximum
preemption disabled events within every 30-second interval and then
calculated the P50/P99 of these max preemption disabled events:

                        p50(ns)               p99(ns)
cpu0                   254956                 5465050
cpu1                   115801                 120782
cpu2                   43324                  72957
cpu3                   256637                 16723307
cpu4                   58979                  87237
cpu5                   47464                  79815
cpu6                   48881                  81371
cpu7                   52263                  82294
cpu8                   263555                 4657713
cpu9                   44935                  73962
cpu10                  37659                  65026
cpu11                  257008                 2706878
cpu12                  49669                  90006
cpu13                  45186                  74666
cpu14                  60705                  83866
cpu15                  51311                  86885

Meanwhile, we have collected the distribution of preemption disabling
events exceeding 1ms across different CPUs over several hours(I omitted
CPU data that were all zeros):

CPU        1~10ms   10~50ms   50~100ms
cpu0        29       5       0
cpu3        38       13      0
cpu8        34       6       0
cpu11       24       10      0

The preemption disabled for several milliseconds or even 10ms+ mostly
originates from TLB flush:

@stack[
    trace_preempt_on+143
    trace_preempt_on+143
    preempt_count_sub+67
    arch_tlbbatch_flush/flush_tlb_mm_range
    task_exit/page_reclaim/...
]

Further analysis confirms that the majority of the time is consumed in
csd_lock_wait().

Now smp_call*() always needs to disable preemption, mainly to protect its
internal per‑CPU data structures and synchronize with CPU offline
operations. This patchset attempts to make csd_lock_wait() preemptible,
thereby reducing the preemption‑disabled critical section and improving
kernel real‑time performance.

Effect
======

After applying this patchset, we no longer observe preemption disabled for
more than 1ms on the arch_tlbbatch_flush/flush_tlb_mm_range path. The
overall P99 of max preemption disabled events in every 30-second is
reduced to around 1.5ms (the remaining latency is primarily due to lock
contention.

                     before patch    after patch    reduced by
                     -----------    --------------  ------------
p99(ns)                16723307        1556034        ~90.70%

Chuyi Zhou (12):
  smp: Disable preemption explicitly in __csd_lock_wait
  smp: Enable preemption early in smp_call_function_single
  smp: Remove get_cpu from smp_call_function_any
  smp: Use on-stack cpumask in smp_call_function_many_cond
  smp: Free call_function_data via RCU in smpcfd_dead_cpu
  smp: Enable preemption early in smp_call_function_many_cond
  smp: Remove preempt_disable from smp_call_function
  smp: Remove preempt_disable from on_each_cpu_cond_mask
  scftorture: Remove preempt_disable in scftorture_invoke_one
  x86/mm: Move flush_tlb_info back to the stack
  x86/mm: Enable preemption during native_flush_tlb_multi
  x86/mm: Enable preemption during flush_tlb_kernel_range

 arch/x86/kernel/kvm.c |   4 +-
 arch/x86/mm/tlb.c     | 137 ++++++++++++++++++------------------------
 kernel/scftorture.c   |   9 +--
 kernel/smp.c          |  81 +++++++++++++++++++------
 4 files changed, 125 insertions(+), 106 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-03-10  7:27 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-02  7:52 [PATCH v2 00/12] Allow preemption during IPI completion waiting to improve real-time performance Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 01/12] smp: Disable preemption explicitly in __csd_lock_wait Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 02/12] smp: Enable preemption early in smp_call_function_single Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 03/12] smp: Remove get_cpu from smp_call_function_any Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 04/12] smp: Use on-stack cpumask in smp_call_function_many_cond Chuyi Zhou
2026-03-10  7:12   ` Muchun Song
2026-03-02  7:52 ` [PATCH v2 05/12] smp: Free call_function_data via RCU in smpcfd_dead_cpu Chuyi Zhou
2026-03-10  7:05   ` Muchun Song
2026-03-10  7:26     ` Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 06/12] smp: Enable preemption early in smp_call_function_many_cond Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 07/12] smp: Remove preempt_disable from smp_call_function Chuyi Zhou
2026-03-10  7:06   ` Muchun Song
2026-03-02  7:52 ` [PATCH v2 08/12] smp: Remove preempt_disable from on_each_cpu_cond_mask Chuyi Zhou
2026-03-10  7:06   ` Muchun Song
2026-03-02  7:52 ` [PATCH v2 09/12] scftorture: Remove preempt_disable in scftorture_invoke_one Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 10/12] x86/mm: Move flush_tlb_info back to the stack Chuyi Zhou
2026-03-02 14:58   ` Peter Zijlstra
2026-03-03  3:20     ` Chuyi Zhou
2026-03-05  7:01     ` Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 11/12] x86/mm: Enable preemption during native_flush_tlb_multi Chuyi Zhou
2026-03-02  7:52 ` [PATCH v2 12/12] x86/mm: Enable preemption during flush_tlb_kernel_range Chuyi Zhou
2026-03-10  6:35   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox