[PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them

public inbox for virtualization@lists.linux-foundation.org
 help / color / mirror / Atom feed

* [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them
@ 2026-03-24  8:52 Lance Yang
  2026-03-24  8:52 ` [PATCH v8 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Lance Yang @ 2026-03-24  8:52 UTC (permalink / raw)
  To: akpm
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
	lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
	seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
	linux-arch, linux-mm, linux-kernel, ioworker0

Hi all,

When page table operations require synchronization with software/lockless
walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
TLB (tlb->freed_tables or tlb->unshared_tables).

On architectures where the TLB flush already sends IPIs to all target CPUs,
the subsequent sync IPI broadcast is redundant. This is not only costly on
large systems where it disrupts all CPUs even for single-process page table
operations, but has also been reported to hurt RT workloads[1].

This series introduces tlb_table_flush_implies_ipi_broadcast() to check if
the prior TLB flush already provided the necessary synchronization. When
true, the sync calls can early-return.

A few cases rely on this synchronization:

1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
   of the PMD table for other purposes in the last remaining user after
   unsharing.

2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
   and (possibly) freeing the page table / re-depositing it.

Two-step plan as David suggested[4]:

Step 1 (this series): Skip redundant sync when we're 100% certain the TLB
flush sent IPIs. INVLPGB is excluded because when supported, we cannot
guarantee IPIs were sent, keeping it clean and simple.

Step 2 (future work): Send targeted IPIs only to CPUs actually doing
software/lockless page table walks, benefiting all architectures.

Regarding Step 2, it obviously only applies to setups where Step 1 does not
apply: like x86 with INVLPGB or arm64. Step 2 work is ongoing; early
attempts showed ~3% GUP-fast overhead. Reducing the overhead requires more
work and tuning; it will be submitted separately once ready.

On a 64-core Intel x86 server, the CAL interrupt count in
/proc/interrupts dropped from 646,316 to 785 when collapsing a 20 GiB
range with this series applied.

David Hildenbrand did the initial implementation. I built on his work and
relied on off-list discussions to push it further - thanks a lot David!

[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
[2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@kernel.org/
[3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@kernel.org/
[4] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/

v7 -> v8:
- Pick up Acked-by tags from David, thanks!
- Add CAL interrupt numbers to the cover letter (per Andrew, thanks!)
- Rewrite the [2/2] changelog and reword the comment (per David, thanks!)
- https://lore.kernel.org/linux-mm/20260309020711.20831-1-lance.yang@linux.dev/

v6 -> v7:
- Simplify init logic and eliminate duplicated X86_FEATURE_INVLPGB checks
  (per Dave, thanks!)
- Remove flush_tlb_multi_implies_ipi_broadcast property because no PV
  backend sets it today.
- https://lore.kernel.org/linux-mm/20260304021046.18550-1-lance.yang@linux.dev/

v5 -> v6:
- Use static_branch to eliminate the branch overhead (per Peter, thanks!)
- https://lore.kernel.org/linux-mm/20260302063048.9479-1-lance.yang@linux.dev/

v4 -> v5:
- Drop per-CPU tracking (active_lockless_pt_walk_mm) from this series;
  defer to Step 2 as it adds ~3% GUP-fast overhead
- Keep pv_ops property false for PV backends like KVM: preempted vCPUs
  cannot be assumed safe (per Sean, thanks!)
  https://lore.kernel.org/linux-mm/aaCP95l-m8ISXF78@google.com/
- https://lore.kernel.org/linux-mm/20260202074557.16544-1-lance.yang@linux.dev/ 

v3 -> v4:
- Rework based on David's two-step direction and per-CPU idea:
  1) Targeted IPIs: per-CPU variable when entering/leaving lockless page
     table walk; tlb_remove_table_sync_mm() IPIs only those CPUs.
  2) On x86, pv_mmu_ops property set at init to skip the extra sync when
     flush_tlb_multi() already sends IPIs.
  https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
- https://lore.kernel.org/linux-mm/20260106120303.38124-1-lance.yang@linux.dev/

v2 -> v3:
- Complete rewrite: use dynamic IPI tracking instead of static checks
  (per Dave Hansen, thanks!)
- Track IPIs via mmu_gather: native_flush_tlb_multi() sets flag when
  actually sending IPIs
- Motivation for skipping redundant IPIs explained by David:
  https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
- https://lore.kernel.org/linux-mm/20251229145245.85452-1-lance.yang@linux.dev/

v1 -> v2:
- Fix cover letter encoding to resolve send-email issues. Apologies for
  any email flood caused by the failed send attempts :(

RFC -> v1:
- Use a callback function in pv_mmu_ops instead of comparing function
  pointers (per David)
- Embed the check directly in tlb_remove_table_sync_one() instead of
  requiring every caller to check explicitly (per David)
- Move tlb_table_flush_implies_ipi_broadcast() outside of
  CONFIG_MMU_GATHER_RCU_TABLE_FREE to fix build error on architectures
  that don't enable this config.
  https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
- https://lore.kernel.org/linux-mm/20251213080038.10917-1-lance.yang@linux.dev/

Lance Yang (2):
  mm/mmu_gather: prepare to skip redundant sync IPIs
  x86/tlb: skip redundant sync IPIs for native TLB flush

 arch/x86/include/asm/tlb.h      | 18 +++++++++++++++++-
 arch/x86/include/asm/tlbflush.h |  2 ++
 arch/x86/kernel/smpboot.c       |  1 +
 arch/x86/mm/tlb.c               | 15 +++++++++++++++
 include/asm-generic/tlb.h       | 17 +++++++++++++++++
 mm/mmu_gather.c                 | 15 +++++++++++++++
 6 files changed, 67 insertions(+), 1 deletion(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH v8 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs
  2026-03-24  8:52 [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
@ 2026-03-24  8:52 ` Lance Yang
  2026-03-24  8:52 ` [PATCH v8 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
  2026-03-24 18:43 ` [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
  2 siblings, 0 replies; 5+ messages in thread
From: Lance Yang @ 2026-03-24  8:52 UTC (permalink / raw)
  To: akpm
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
	lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
	seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
	linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

When page table operations require synchronization with software/lockless
walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
TLB (tlb->freed_tables or tlb->unshared_tables).

On architectures where the TLB flush already sends IPIs to all target CPUs,
the subsequent sync IPI broadcast is redundant. This is not only costly on
large systems where it disrupts all CPUs even for single-process page table
operations, but has also been reported to hurt RT workloads[1].

Introduce tlb_table_flush_implies_ipi_broadcast() to check if the prior TLB
flush already provided the necessary synchronization. When true, the sync
calls can early-return.

A few cases rely on this synchronization:

1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
   of the PMD table for other purposes in the last remaining user after
   unsharing.

2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
   and (possibly) freeing the page table / re-depositing it.

Currently always returns false (no behavior change). The follow-up patch
will enable the optimization for x86.

[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
[2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@kernel.org/
[3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@kernel.org/

Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/asm-generic/tlb.h | 17 +++++++++++++++++
 mm/mmu_gather.c           | 15 +++++++++++++++
 2 files changed, 32 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index bdcc2778ac64..cb41cc6a0024 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -240,6 +240,23 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 }
 #endif /* CONFIG_MMU_GATHER_TABLE_FREE */
 
+/**
+ * tlb_table_flush_implies_ipi_broadcast - does TLB flush imply IPI sync
+ *
+ * When page table operations require synchronization with software/lockless
+ * walkers, they flush the TLB (tlb->freed_tables or tlb->unshared_tables)
+ * then call tlb_remove_table_sync_{one,rcu}(). If the flush already sent
+ * IPIs to all CPUs, the sync call is redundant.
+ *
+ * Returns false by default. Architectures can override by defining this.
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+	return false;
+}
+#endif
+
 #ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
 /*
  * This allows an architecture that does not use the linux page-tables for
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 3985d856de7f..37a6a711c37e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -283,6 +283,14 @@ void tlb_remove_table_sync_one(void)
 	 * It is however sufficient for software page-table walkers that rely on
 	 * IRQ disabling.
 	 */
+
+	/*
+	 * Skip IPI if the preceding TLB flush already synchronized with
+	 * all CPUs that could be doing software/lockless page table walks.
+	 */
+	if (tlb_table_flush_implies_ipi_broadcast())
+		return;
+
 	smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
 }
 
@@ -312,6 +320,13 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
  */
 void tlb_remove_table_sync_rcu(void)
 {
+	/*
+	 * Skip RCU wait if the preceding TLB flush already synchronized
+	 * with all CPUs that could be doing software/lockless page table walks.
+	 */
+	if (tlb_table_flush_implies_ipi_broadcast())
+		return;
+
 	synchronize_rcu();
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH v8 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
  2026-03-24  8:52 [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
  2026-03-24  8:52 ` [PATCH v8 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
@ 2026-03-24  8:52 ` Lance Yang
  2026-03-24 18:43 ` [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
  2 siblings, 0 replies; 5+ messages in thread
From: Lance Yang @ 2026-03-24  8:52 UTC (permalink / raw)
  To: akpm
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
	lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
	seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
	linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Some page table operations need to synchronize with software/lockless
walkers after a TLB flush by calling tlb_remove_table_sync_{one,rcu}().
On x86, that extra synchronization is redundant when the preceding TLB
flush already broadcast IPIs to all relevant CPUs.

native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
flush is always in use.

It decides once at boot whether to enable the optimization: if using
native TLB flush and INVLPGB is not supported, we know IPIs were sent
and can skip the redundant sync. The decision is fixed via a static
key as Peter suggested[1].

PV backends (KVM, Xen, Hyper-V) typically have their own implementations
and don't call native_flush_tlb_multi() directly, so they cannot be trusted
to provide the IPI guarantees we need.

Also treat unshared_tables like freed_tables when issuing the TLB flush,
so lazy-TLB CPUs receive IPIs during unsharing of page tables as well.
This allows us to safely implement tlb_table_flush_implies_ipi_broadcast().

Two-step plan as David suggested[2]:

Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
flush sent IPIs. INVLPGB is excluded because when supported, we cannot
guarantee IPIs were sent, keeping it clean and simple.

Step 2 (future work): Send targeted IPIs only to CPUs actually doing
software/lockless page table walks, benefiting all architectures.

Regarding Step 2, it obviously only applies to setups where Step 1 does
not apply: like x86 with INVLPGB or arm64.

[1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 arch/x86/include/asm/tlb.h      | 18 +++++++++++++++++-
 arch/x86/include/asm/tlbflush.h |  2 ++
 arch/x86/kernel/smpboot.c       |  1 +
 arch/x86/mm/tlb.c               | 15 +++++++++++++++
 4 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..fc586ec8e768 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,11 +5,21 @@
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
 #include <asm-generic/tlb.h>
 #include <linux/kernel.h>
 #include <vdso/bits.h>
 #include <vdso/page.h>
 
+DECLARE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+	return static_branch_likely(&tlb_ipi_broadcast_key);
+}
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	unsigned long start = 0UL, end = TLB_FLUSH_ALL;
@@ -20,7 +30,13 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	/*
+	 * Treat unshared_tables just like freed_tables, such that lazy-TLB
+	 * CPUs also receive IPIs during unsharing of page tables, allowing
+	 * us to safely implement tlb_table_flush_implies_ipi_broadcast().
+	 */
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+			   tlb->freed_tables || tlb->unshared_tables);
 }
 
 static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5a3cdc439e38..8ba853154b46 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -18,6 +18,8 @@
 
 DECLARE_PER_CPU(u64, tlbstate_untag_mask);
 
+void __init native_pv_tlb_init(void);
+
 void __flush_tlb_all(void);
 
 #define TLB_FLUSH_ALL	-1UL
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 294a8ea60298..df776b645a9c 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1256,6 +1256,7 @@ void __init native_smp_prepare_boot_cpu(void)
 		switch_gdt_and_percpu_base(me);
 
 	native_pv_lock_init();
+	native_pv_tlb_init();
 }
 
 void __init native_smp_cpus_done(unsigned int max_cpus)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 621e09d049cb..8f5585ebaf09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -26,6 +26,8 @@
 
 #include "mm_internal.h"
 
+DEFINE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
+
 #ifdef CONFIG_PARAVIRT
 # define STATIC_NOPV
 #else
@@ -1834,3 +1836,16 @@ static int __init create_tlb_single_page_flush_ceiling(void)
 	return 0;
 }
 late_initcall(create_tlb_single_page_flush_ceiling);
+
+void __init native_pv_tlb_init(void)
+{
+#ifdef CONFIG_PARAVIRT
+	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+		return;
+#endif
+
+	if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+		return;
+
+	static_branch_enable(&tlb_ipi_broadcast_key);
+}
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them
  2026-03-24  8:52 [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
  2026-03-24  8:52 ` [PATCH v8 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
  2026-03-24  8:52 ` [PATCH v8 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
@ 2026-03-24 18:43 ` Andrew Morton
  2026-03-25  2:43   ` Lance Yang
  2 siblings, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2026-03-24 18:43 UTC (permalink / raw)
  To: Lance Yang
  Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
	aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
	lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
	seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
	linux-arch, linux-mm, linux-kernel, ioworker0

On Tue, 24 Mar 2026 16:52:36 +0800 Lance Yang <lance.yang@linux.dev> wrote:

> Hi all,
> 
> When page table operations require synchronization with software/lockless
> walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
> TLB (tlb->freed_tables or tlb->unshared_tables).
> 
> On architectures where the TLB flush already sends IPIs to all target CPUs,
> the subsequent sync IPI broadcast is redundant. This is not only costly on
> large systems where it disrupts all CPUs even for single-process page table
> operations, but has also been reported to hurt RT workloads[1].
> 
> This series introduces tlb_table_flush_implies_ipi_broadcast() to check if
> the prior TLB flush already provided the necessary synchronization. When
> true, the sync calls can early-return.
> 
> A few cases rely on this synchronization:
> 
> 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
>    of the PMD table for other purposes in the last remaining user after
>    unsharing.
> 
> 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
>    and (possibly) freeing the page table / re-depositing it.
> 
> Two-step plan as David suggested[4]:
> 
> Step 1 (this series): Skip redundant sync when we're 100% certain the TLB
> flush sent IPIs. INVLPGB is excluded because when supported, we cannot
> guarantee IPIs were sent, keeping it clean and simple.
> 
> Step 2 (future work): Send targeted IPIs only to CPUs actually doing
> software/lockless page table walks, benefiting all architectures.
> 
> Regarding Step 2, it obviously only applies to setups where Step 1 does not
> apply: like x86 with INVLPGB or arm64. Step 2 work is ongoing; early
> attempts showed ~3% GUP-fast overhead. Reducing the overhead requires more
> work and tuning; it will be submitted separately once ready.
> 
> On a 64-core Intel x86 server, the CAL interrupt count in
> /proc/interrupts dropped from 646,316 to 785 when collapsing a 20 GiB
> range with this series applied.

Well that's nice.

Which other architectures could utilize this?

> David Hildenbrand did the initial implementation. I built on his work and
> relied on off-list discussions to push it further - thanks a lot David!
>
> ...
>
>  arch/x86/include/asm/tlb.h      | 18 +++++++++++++++++-
>  arch/x86/include/asm/tlbflush.h |  2 ++
>  arch/x86/kernel/smpboot.c       |  1 +
>  arch/x86/mm/tlb.c               | 15 +++++++++++++++
>  include/asm-generic/tlb.h       | 17 +++++++++++++++++
>  mm/mmu_gather.c                 | 15 +++++++++++++++
>  6 files changed, 67 insertions(+), 1 deletion(-)

Can the x86 maintainers please review these changes?



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them
  2026-03-24 18:43 ` [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
@ 2026-03-25  2:43   ` Lance Yang
  0 siblings, 0 replies; 5+ messages in thread
From: Lance Yang @ 2026-03-25  2:43 UTC (permalink / raw)
  To: akpm, peterz, dave.hansen, david
  Cc: dave.hansen, ypodemsk, hughd, will, aneesh.kumar, npiggin, tglx,
	mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, shy828301,
	riel, jannh, jgross, seanjc, pbonzini, boris.ostrovsky,
	virtualization, kvm, linux-arch, linux-mm, linux-kernel,
	ioworker0, Lance Yang


On Tue, Mar 24, 2026 at 11:43:39AM -0700, Andrew Morton wrote:
>On Tue, 24 Mar 2026 16:52:36 +0800 Lance Yang <lance.yang@linux.dev> wrote:
>
>> Hi all,
>> 
>> When page table operations require synchronization with software/lockless
>> walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
>> TLB (tlb->freed_tables or tlb->unshared_tables).
>> 
>> On architectures where the TLB flush already sends IPIs to all target CPUs,
>> the subsequent sync IPI broadcast is redundant. This is not only costly on
>> large systems where it disrupts all CPUs even for single-process page table
>> operations, but has also been reported to hurt RT workloads[1].
>> 
>> This series introduces tlb_table_flush_implies_ipi_broadcast() to check if
>> the prior TLB flush already provided the necessary synchronization. When
>> true, the sync calls can early-return.
>> 
>> A few cases rely on this synchronization:
>> 
>> 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
>>    of the PMD table for other purposes in the last remaining user after
>>    unsharing.
>> 
>> 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
>>    and (possibly) freeing the page table / re-depositing it.
>> 
>> Two-step plan as David suggested[4]:
>> 
>> Step 1 (this series): Skip redundant sync when we're 100% certain the TLB
>> flush sent IPIs. INVLPGB is excluded because when supported, we cannot
>> guarantee IPIs were sent, keeping it clean and simple.
>> 
>> Step 2 (future work): Send targeted IPIs only to CPUs actually doing
>> software/lockless page table walks, benefiting all architectures.
>> 
>> Regarding Step 2, it obviously only applies to setups where Step 1 does not
>> apply: like x86 with INVLPGB or arm64. Step 2 work is ongoing; early
>> attempts showed ~3% GUP-fast overhead. Reducing the overhead requires more
>> work and tuning; it will be submitted separately once ready.
>> 
>> On a 64-core Intel x86 server, the CAL interrupt count in
>> /proc/interrupts dropped from 646,316 to 785 when collapsing a 20 GiB
>> range with this series applied.
>
>Well that's nice.
>
>Which other architectures could utilize this?

Thanks! RISC-V looks like a candidate, and if I get some time I'll dive
into it.

>
>> David Hildenbrand did the initial implementation. I built on his work and
>> relied on off-list discussions to push it further - thanks a lot David!
>>
>> ...
>>
>>  arch/x86/include/asm/tlb.h      | 18 +++++++++++++++++-
>>  arch/x86/include/asm/tlbflush.h |  2 ++
>>  arch/x86/kernel/smpboot.c       |  1 +
>>  arch/x86/mm/tlb.c               | 15 +++++++++++++++
>>  include/asm-generic/tlb.h       | 17 +++++++++++++++++
>>  mm/mmu_gather.c                 | 15 +++++++++++++++
>>  6 files changed, 67 insertions(+), 1 deletion(-)
>
>Can the x86 maintainers please review these changes?

Yes, an x86 review would be much appreciated, please!

After commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
unconditional"), tlb_remove_table_sync_one() is no longer a NOP, even on
native x86 without INVLPGB, so it ends up issuing a redundant IPI
broadcast.

Thanks,
Lance

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-03-25  2:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-24  8:52 [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
2026-03-24  8:52 ` [PATCH v8 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
2026-03-24  8:52 ` [PATCH v8 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
2026-03-24 18:43 ` [PATCH v8 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
2026-03-25  2:43   ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox