* [PATCH v7 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs
2026-03-09 2:07 [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
@ 2026-03-09 2:07 ` Lance Yang
2026-03-23 11:04 ` David Hildenbrand (Arm)
2026-03-09 2:07 ` [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
2026-03-23 20:53 ` [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
2 siblings, 1 reply; 10+ messages in thread
From: Lance Yang @ 2026-03-09 2:07 UTC (permalink / raw)
To: akpm
Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang
From: Lance Yang <lance.yang@linux.dev>
When page table operations require synchronization with software/lockless
walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
TLB (tlb->freed_tables or tlb->unshared_tables).
On architectures where the TLB flush already sends IPIs to all target CPUs,
the subsequent sync IPI broadcast is redundant. This is not only costly on
large systems where it disrupts all CPUs even for single-process page table
operations, but has also been reported to hurt RT workloads[1].
Introduce tlb_table_flush_implies_ipi_broadcast() to check if the prior TLB
flush already provided the necessary synchronization. When true, the sync
calls can early-return.
A few cases rely on this synchronization:
1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
of the PMD table for other purposes in the last remaining user after
unsharing.
2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
and (possibly) freeing the page table / re-depositing it.
Currently always returns false (no behavior change). The follow-up patch
will enable the optimization for x86.
[1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
[2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@kernel.org/
[3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@kernel.org/
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
include/asm-generic/tlb.h | 17 +++++++++++++++++
mm/mmu_gather.c | 15 +++++++++++++++
2 files changed, 32 insertions(+)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index bdcc2778ac64..cb41cc6a0024 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -240,6 +240,23 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
}
#endif /* CONFIG_MMU_GATHER_TABLE_FREE */
+/**
+ * tlb_table_flush_implies_ipi_broadcast - does TLB flush imply IPI sync
+ *
+ * When page table operations require synchronization with software/lockless
+ * walkers, they flush the TLB (tlb->freed_tables or tlb->unshared_tables)
+ * then call tlb_remove_table_sync_{one,rcu}(). If the flush already sent
+ * IPIs to all CPUs, the sync call is redundant.
+ *
+ * Returns false by default. Architectures can override by defining this.
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+ return false;
+}
+#endif
+
#ifdef CONFIG_MMU_GATHER_RCU_TABLE_FREE
/*
* This allows an architecture that does not use the linux page-tables for
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index 3985d856de7f..37a6a711c37e 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -283,6 +283,14 @@ void tlb_remove_table_sync_one(void)
* It is however sufficient for software page-table walkers that rely on
* IRQ disabling.
*/
+
+ /*
+ * Skip IPI if the preceding TLB flush already synchronized with
+ * all CPUs that could be doing software/lockless page table walks.
+ */
+ if (tlb_table_flush_implies_ipi_broadcast())
+ return;
+
smp_call_function(tlb_remove_table_smp_sync, NULL, 1);
}
@@ -312,6 +320,13 @@ static void tlb_remove_table_free(struct mmu_table_batch *batch)
*/
void tlb_remove_table_sync_rcu(void)
{
+ /*
+ * Skip RCU wait if the preceding TLB flush already synchronized
+ * with all CPUs that could be doing software/lockless page table walks.
+ */
+ if (tlb_table_flush_implies_ipi_broadcast())
+ return;
+
synchronize_rcu();
}
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v7 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs
2026-03-09 2:07 ` [PATCH v7 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
@ 2026-03-23 11:04 ` David Hildenbrand (Arm)
0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-23 11:04 UTC (permalink / raw)
To: Lance Yang, akpm
Cc: peterz, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On 3/9/26 03:07, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> When page table operations require synchronization with software/lockless
> walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
> TLB (tlb->freed_tables or tlb->unshared_tables).
>
> On architectures where the TLB flush already sends IPIs to all target CPUs,
> the subsequent sync IPI broadcast is redundant. This is not only costly on
> large systems where it disrupts all CPUs even for single-process page table
> operations, but has also been reported to hurt RT workloads[1].
>
> Introduce tlb_table_flush_implies_ipi_broadcast() to check if the prior TLB
> flush already provided the necessary synchronization. When true, the sync
> calls can early-return.
>
> A few cases rely on this synchronization:
>
> 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
> of the PMD table for other purposes in the last remaining user after
> unsharing.
>
> 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
> and (possibly) freeing the page table / re-depositing it.
>
> Currently always returns false (no behavior change). The follow-up patch
> will enable the optimization for x86.
>
> [1] https://lore.kernel.org/linux-mm/1b27a3fa-359a-43d0-bdeb-c31341749367@kernel.org/
> [2] https://lore.kernel.org/linux-mm/6a364356-5fea-4a6c-b959-ba3b22ce9c88@kernel.org/
> [3] https://lore.kernel.org/linux-mm/2cb4503d-3a3f-4f6c-8038-7b3d1c74b3c2@kernel.org/
>
> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
2026-03-09 2:07 [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
2026-03-09 2:07 ` [PATCH v7 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
@ 2026-03-09 2:07 ` Lance Yang
2026-03-16 2:36 ` Lance Yang
2026-03-23 11:10 ` David Hildenbrand (Arm)
2026-03-23 20:53 ` [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
2 siblings, 2 replies; 10+ messages in thread
From: Lance Yang @ 2026-03-09 2:07 UTC (permalink / raw)
To: akpm
Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0, Lance Yang
From: Lance Yang <lance.yang@linux.dev>
Enable the optimization introduced in the previous patch for x86.
native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
flush is always in use.
It decides once at boot whether to enable the optimization: if using
native TLB flush and INVLPGB is not supported, we know IPIs were sent
and can skip the redundant sync. The decision is fixed via a static
key as Peter suggested[1].
PV backends (KVM, Xen, Hyper-V) typically have their own implementations
and don't call native_flush_tlb_multi() directly, so they cannot be trusted
to provide the IPI guarantees we need.
Two-step plan as David suggested[2]:
Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
flush sent IPIs. INVLPGB is excluded because when supported, we cannot
guarantee IPIs were sent, keeping it clean and simple.
Step 2 (future work): Send targeted IPIs only to CPUs actually doing
software/lockless page table walks, benefiting all architectures.
Regarding Step 2, it obviously only applies to setups where Step 1 does
not apply: like x86 with INVLPGB or arm64.
[1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
[2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
arch/x86/include/asm/tlbflush.h | 2 ++
arch/x86/kernel/smpboot.c | 1 +
arch/x86/mm/tlb.c | 15 +++++++++++++++
4 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..99de622d3856 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,11 +5,21 @@
#define tlb_flush tlb_flush
static inline void tlb_flush(struct mmu_gather *tlb);
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
#include <asm-generic/tlb.h>
#include <linux/kernel.h>
#include <vdso/bits.h>
#include <vdso/page.h>
+DECLARE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+ return static_branch_likely(&tlb_ipi_broadcast_key);
+}
+
static inline void tlb_flush(struct mmu_gather *tlb)
{
unsigned long start = 0UL, end = TLB_FLUSH_ALL;
@@ -20,7 +30,12 @@ static inline void tlb_flush(struct mmu_gather *tlb)
end = tlb->end;
}
- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+ /*
+ * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs
+ * also receive IPIs during unsharing page tables.
+ */
+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+ tlb->freed_tables || tlb->unshared_tables);
}
static inline void invlpg(unsigned long addr)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5a3cdc439e38..8ba853154b46 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -18,6 +18,8 @@
DECLARE_PER_CPU(u64, tlbstate_untag_mask);
+void __init native_pv_tlb_init(void);
+
void __flush_tlb_all(void);
#define TLB_FLUSH_ALL -1UL
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5cd6950ab672..3cdb04162843 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1167,6 +1167,7 @@ void __init native_smp_prepare_boot_cpu(void)
switch_gdt_and_percpu_base(me);
native_pv_lock_init();
+ native_pv_tlb_init();
}
void __init native_smp_cpus_done(unsigned int max_cpus)
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 621e09d049cb..8f5585ebaf09 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -26,6 +26,8 @@
#include "mm_internal.h"
+DEFINE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
+
#ifdef CONFIG_PARAVIRT
# define STATIC_NOPV
#else
@@ -1834,3 +1836,16 @@ static int __init create_tlb_single_page_flush_ceiling(void)
return 0;
}
late_initcall(create_tlb_single_page_flush_ceiling);
+
+void __init native_pv_tlb_init(void)
+{
+#ifdef CONFIG_PARAVIRT
+ if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+ return;
+#endif
+
+ if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
+ return;
+
+ static_branch_enable(&tlb_ipi_broadcast_key);
+}
--
2.49.0
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
2026-03-09 2:07 ` [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
@ 2026-03-16 2:36 ` Lance Yang
2026-03-23 10:48 ` Lance Yang
2026-03-23 11:10 ` David Hildenbrand (Arm)
1 sibling, 1 reply; 10+ messages in thread
From: Lance Yang @ 2026-03-16 2:36 UTC (permalink / raw)
To: dave.hansen, peterz
Cc: akpm, david, dave.hansen, ypodemsk, hughd, will, aneesh.kumar,
npiggin, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
shy828301, riel, jannh, jgross, seanjc, pbonzini, boris.ostrovsky,
virtualization, kvm, linux-arch, linux-mm, linux-kernel,
ioworker0, Lance Yang
Gently ping :)
On Mon, Mar 09, 2026 at 10:07:11AM +0800, Lance Yang wrote:
>From: Lance Yang <lance.yang@linux.dev>
>
>Enable the optimization introduced in the previous patch for x86.
>
>native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
>On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
>flush is always in use.
>
>It decides once at boot whether to enable the optimization: if using
>native TLB flush and INVLPGB is not supported, we know IPIs were sent
>and can skip the redundant sync. The decision is fixed via a static
>key as Peter suggested[1].
>
>PV backends (KVM, Xen, Hyper-V) typically have their own implementations
>and don't call native_flush_tlb_multi() directly, so they cannot be trusted
>to provide the IPI guarantees we need.
>
>Two-step plan as David suggested[2]:
>
>Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
>flush sent IPIs. INVLPGB is excluded because when supported, we cannot
>guarantee IPIs were sent, keeping it clean and simple.
>
>Step 2 (future work): Send targeted IPIs only to CPUs actually doing
>software/lockless page table walks, benefiting all architectures.
>
>Regarding Step 2, it obviously only applies to setups where Step 1 does
>not apply: like x86 with INVLPGB or arm64.
>
>[1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
>[2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
>
>Suggested-by: Peter Zijlstra <peterz@infradead.org>
>Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
>Signed-off-by: Lance Yang <lance.yang@linux.dev>
>---
> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
> arch/x86/include/asm/tlbflush.h | 2 ++
> arch/x86/kernel/smpboot.c | 1 +
> arch/x86/mm/tlb.c | 15 +++++++++++++++
> 4 files changed, 34 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>index 866ea78ba156..99de622d3856 100644
>--- a/arch/x86/include/asm/tlb.h
>+++ b/arch/x86/include/asm/tlb.h
>@@ -5,11 +5,21 @@
> #define tlb_flush tlb_flush
> static inline void tlb_flush(struct mmu_gather *tlb);
>
>+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
>+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>+
> #include <asm-generic/tlb.h>
> #include <linux/kernel.h>
> #include <vdso/bits.h>
> #include <vdso/page.h>
>
>+DECLARE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
>+
>+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>+{
>+ return static_branch_likely(&tlb_ipi_broadcast_key);
>+}
>+
> static inline void tlb_flush(struct mmu_gather *tlb)
> {
> unsigned long start = 0UL, end = TLB_FLUSH_ALL;
>@@ -20,7 +30,12 @@ static inline void tlb_flush(struct mmu_gather *tlb)
> end = tlb->end;
> }
>
>- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
>+ /*
>+ * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs
>+ * also receive IPIs during unsharing page tables.
>+ */
>+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
>+ tlb->freed_tables || tlb->unshared_tables);
> }
>
> static inline void invlpg(unsigned long addr)
>diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>index 5a3cdc439e38..8ba853154b46 100644
>--- a/arch/x86/include/asm/tlbflush.h
>+++ b/arch/x86/include/asm/tlbflush.h
>@@ -18,6 +18,8 @@
>
> DECLARE_PER_CPU(u64, tlbstate_untag_mask);
>
>+void __init native_pv_tlb_init(void);
>+
> void __flush_tlb_all(void);
>
> #define TLB_FLUSH_ALL -1UL
>diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>index 5cd6950ab672..3cdb04162843 100644
>--- a/arch/x86/kernel/smpboot.c
>+++ b/arch/x86/kernel/smpboot.c
>@@ -1167,6 +1167,7 @@ void __init native_smp_prepare_boot_cpu(void)
> switch_gdt_and_percpu_base(me);
>
> native_pv_lock_init();
>+ native_pv_tlb_init();
> }
>
> void __init native_smp_cpus_done(unsigned int max_cpus)
>diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>index 621e09d049cb..8f5585ebaf09 100644
>--- a/arch/x86/mm/tlb.c
>+++ b/arch/x86/mm/tlb.c
>@@ -26,6 +26,8 @@
>
> #include "mm_internal.h"
>
>+DEFINE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
>+
> #ifdef CONFIG_PARAVIRT
> # define STATIC_NOPV
> #else
>@@ -1834,3 +1836,16 @@ static int __init create_tlb_single_page_flush_ceiling(void)
> return 0;
> }
> late_initcall(create_tlb_single_page_flush_ceiling);
>+
>+void __init native_pv_tlb_init(void)
>+{
>+#ifdef CONFIG_PARAVIRT
>+ if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>+ return;
>+#endif
>+
>+ if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
>+ return;
>+
>+ static_branch_enable(&tlb_ipi_broadcast_key);
>+}
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
2026-03-16 2:36 ` Lance Yang
@ 2026-03-23 10:48 ` Lance Yang
0 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2026-03-23 10:48 UTC (permalink / raw)
To: dave.hansen, peterz
Cc: akpm, david, dave.hansen, ypodemsk, hughd, will, aneesh.kumar,
npiggin, tglx, mingo, bp, x86, hpa, arnd, lorenzo.stoakes, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
shy828301, riel, jannh, jgross, seanjc, pbonzini, boris.ostrovsky,
virtualization, kvm, linux-arch, linux-mm, linux-kernel,
ioworker0, Lance Yang
Just following up on this series. Any feedback would be appreciated!
Thanks,
Lance
On Mon, Mar 16, 2026 at 10:36:30AM +0800, Lance Yang wrote:
>
>Gently ping :)
>
>On Mon, Mar 09, 2026 at 10:07:11AM +0800, Lance Yang wrote:
>>From: Lance Yang <lance.yang@linux.dev>
>>
>>Enable the optimization introduced in the previous patch for x86.
>>
>>native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
>>On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
>>flush is always in use.
>>
>>It decides once at boot whether to enable the optimization: if using
>>native TLB flush and INVLPGB is not supported, we know IPIs were sent
>>and can skip the redundant sync. The decision is fixed via a static
>>key as Peter suggested[1].
>>
>>PV backends (KVM, Xen, Hyper-V) typically have their own implementations
>>and don't call native_flush_tlb_multi() directly, so they cannot be trusted
>>to provide the IPI guarantees we need.
>>
>>Two-step plan as David suggested[2]:
>>
>>Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
>>flush sent IPIs. INVLPGB is excluded because when supported, we cannot
>>guarantee IPIs were sent, keeping it clean and simple.
>>
>>Step 2 (future work): Send targeted IPIs only to CPUs actually doing
>>software/lockless page table walks, benefiting all architectures.
>>
>>Regarding Step 2, it obviously only applies to setups where Step 1 does
>>not apply: like x86 with INVLPGB or arm64.
>>
>>[1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
>>[2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
>>
>>Suggested-by: Peter Zijlstra <peterz@infradead.org>
>>Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
>>Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>---
>> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>> arch/x86/include/asm/tlbflush.h | 2 ++
>> arch/x86/kernel/smpboot.c | 1 +
>> arch/x86/mm/tlb.c | 15 +++++++++++++++
>> 4 files changed, 34 insertions(+), 1 deletion(-)
>>
>>diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>index 866ea78ba156..99de622d3856 100644
>>--- a/arch/x86/include/asm/tlb.h
>>+++ b/arch/x86/include/asm/tlb.h
>>@@ -5,11 +5,21 @@
>> #define tlb_flush tlb_flush
>> static inline void tlb_flush(struct mmu_gather *tlb);
>>
>>+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
>>+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>+
>> #include <asm-generic/tlb.h>
>> #include <linux/kernel.h>
>> #include <vdso/bits.h>
>> #include <vdso/page.h>
>>
>>+DECLARE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
>>+
>>+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>+{
>>+ return static_branch_likely(&tlb_ipi_broadcast_key);
>>+}
>>+
>> static inline void tlb_flush(struct mmu_gather *tlb)
>> {
>> unsigned long start = 0UL, end = TLB_FLUSH_ALL;
>>@@ -20,7 +30,12 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>> end = tlb->end;
>> }
>>
>>- flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
>>+ /*
>>+ * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs
>>+ * also receive IPIs during unsharing page tables.
>>+ */
>>+ flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
>>+ tlb->freed_tables || tlb->unshared_tables);
>> }
>>
>> static inline void invlpg(unsigned long addr)
>>diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
>>index 5a3cdc439e38..8ba853154b46 100644
>>--- a/arch/x86/include/asm/tlbflush.h
>>+++ b/arch/x86/include/asm/tlbflush.h
>>@@ -18,6 +18,8 @@
>>
>> DECLARE_PER_CPU(u64, tlbstate_untag_mask);
>>
>>+void __init native_pv_tlb_init(void);
>>+
>> void __flush_tlb_all(void);
>>
>> #define TLB_FLUSH_ALL -1UL
>>diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
>>index 5cd6950ab672..3cdb04162843 100644
>>--- a/arch/x86/kernel/smpboot.c
>>+++ b/arch/x86/kernel/smpboot.c
>>@@ -1167,6 +1167,7 @@ void __init native_smp_prepare_boot_cpu(void)
>> switch_gdt_and_percpu_base(me);
>>
>> native_pv_lock_init();
>>+ native_pv_tlb_init();
>> }
>>
>> void __init native_smp_cpus_done(unsigned int max_cpus)
>>diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>>index 621e09d049cb..8f5585ebaf09 100644
>>--- a/arch/x86/mm/tlb.c
>>+++ b/arch/x86/mm/tlb.c
>>@@ -26,6 +26,8 @@
>>
>> #include "mm_internal.h"
>>
>>+DEFINE_STATIC_KEY_FALSE(tlb_ipi_broadcast_key);
>>+
>> #ifdef CONFIG_PARAVIRT
>> # define STATIC_NOPV
>> #else
>>@@ -1834,3 +1836,16 @@ static int __init create_tlb_single_page_flush_ceiling(void)
>> return 0;
>> }
>> late_initcall(create_tlb_single_page_flush_ceiling);
>>+
>>+void __init native_pv_tlb_init(void)
>>+{
>>+#ifdef CONFIG_PARAVIRT
>>+ if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>+ return;
>>+#endif
>>+
>>+ if (cpu_feature_enabled(X86_FEATURE_INVLPGB))
>>+ return;
>>+
>>+ static_branch_enable(&tlb_ipi_broadcast_key);
>>+}
>
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
2026-03-09 2:07 ` [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
2026-03-16 2:36 ` Lance Yang
@ 2026-03-23 11:10 ` David Hildenbrand (Arm)
2026-03-24 5:48 ` Lance Yang
1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-23 11:10 UTC (permalink / raw)
To: Lance Yang, akpm
Cc: peterz, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On 3/9/26 03:07, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Enable the optimization introduced in the previous patch for x86.
Best to make the patch description standalone, not referring to
"previous patch".
>
> native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
> On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
> flush is always in use.
>
> It decides once at boot whether to enable the optimization: if using
> native TLB flush and INVLPGB is not supported, we know IPIs were sent
> and can skip the redundant sync. The decision is fixed via a static
> key as Peter suggested[1].
>
> PV backends (KVM, Xen, Hyper-V) typically have their own implementations
> and don't call native_flush_tlb_multi() directly, so they cannot be trusted
> to provide the IPI guarantees we need.
>
> Two-step plan as David suggested[2]:
>
> Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
> flush sent IPIs. INVLPGB is excluded because when supported, we cannot
> guarantee IPIs were sent, keeping it clean and simple.
>
> Step 2 (future work): Send targeted IPIs only to CPUs actually doing
> software/lockless page table walks, benefiting all architectures.
>
> Regarding Step 2, it obviously only applies to setups where Step 1 does
> not apply: like x86 with INVLPGB or arm64.
>
> [1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
> [2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
[...]
> static inline void tlb_flush(struct mmu_gather *tlb)
> {
> unsigned long start = 0UL, end = TLB_FLUSH_ALL;
> @@ -20,7 +30,12 @@ static inline void tlb_flush(struct mmu_gather *tlb)
> end = tlb->end;
> }
>
> - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
> + /*
> + * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs
> + * also receive IPIs during unsharing page tables.
"unsharing of page tables" ?
I would maybe have it written ass
"Treat unshared_tables just like freed_tables, such that lazy-TLB CPUs
also receive IPIs during unsharing of page tables, allowing us to
safely implement tlb_table_flush_implies_ipi_broadcast()."
> + */
> + flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
> + tlb->freed_tables || tlb->unshared_tables);
> }
In general, LGTM.
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
--
Cheers,
David
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush
2026-03-23 11:10 ` David Hildenbrand (Arm)
@ 2026-03-24 5:48 ` Lance Yang
0 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2026-03-24 5:48 UTC (permalink / raw)
To: David Hildenbrand (Arm), akpm
Cc: peterz, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On 2026/3/23 19:10, David Hildenbrand (Arm) wrote:
> On 3/9/26 03:07, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Enable the optimization introduced in the previous patch for x86.
>
> Best to make the patch description standalone, not referring to
> "previous patch".
Good point. Will make the changelog standalone ;)
>>
>> native_pv_tlb_init() checks whether native_flush_tlb_multi() is in use.
>> On CONFIG_PARAVIRT systems, it checks pv_ops; on non-PARAVIRT, native
>> flush is always in use.
>>
>> It decides once at boot whether to enable the optimization: if using
>> native TLB flush and INVLPGB is not supported, we know IPIs were sent
>> and can skip the redundant sync. The decision is fixed via a static
>> key as Peter suggested[1].
>>
>> PV backends (KVM, Xen, Hyper-V) typically have their own implementations
>> and don't call native_flush_tlb_multi() directly, so they cannot be trusted
>> to provide the IPI guarantees we need.
>>
>> Two-step plan as David suggested[2]:
>>
>> Step 1 (this patch): Skip redundant sync when we're 100% certain the TLB
>> flush sent IPIs. INVLPGB is excluded because when supported, we cannot
>> guarantee IPIs were sent, keeping it clean and simple.
>>
>> Step 2 (future work): Send targeted IPIs only to CPUs actually doing
>> software/lockless page table walks, benefiting all architectures.
>>
>> Regarding Step 2, it obviously only applies to setups where Step 1 does
>> not apply: like x86 with INVLPGB or arm64.
>>
>> [1] https://lore.kernel.org/linux-mm/20260302145652.GH1395266@noisy.programming.kicks-ass.net/
>> [2] https://lore.kernel.org/linux-mm/bbfdf226-4660-4949-b17b-0d209ee4ef8c@kernel.org/
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>
> [...]
>
>> static inline void tlb_flush(struct mmu_gather *tlb)
>> {
>> unsigned long start = 0UL, end = TLB_FLUSH_ALL;
>> @@ -20,7 +30,12 @@ static inline void tlb_flush(struct mmu_gather *tlb)
>> end = tlb->end;
>> }
>>
>> - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
>> + /*
>> + * Pass both freed_tables and unshared_tables so that lazy-TLB CPUs
>> + * also receive IPIs during unsharing page tables.
>
> "unsharing of page tables" ?
Yes, that reads better.
>
> I would maybe have it written ass
>
> "Treat unshared_tables just like freed_tables, such that lazy-TLB CPUs
> also receive IPIs during unsharing of page tables, allowing us to
> safely implement tlb_table_flush_implies_ipi_broadcast()."
>
>> + */
>> + flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
>> + tlb->freed_tables || tlb->unshared_tables);
>> }
Cool, this wording is much clearer :)
> In general, LGTM.
>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Thanks for taking time to review!
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them
2026-03-09 2:07 [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them Lance Yang
2026-03-09 2:07 ` [PATCH v7 1/2] mm/mmu_gather: prepare to skip redundant sync IPIs Lance Yang
2026-03-09 2:07 ` [PATCH v7 2/2] x86/tlb: skip redundant sync IPIs for native TLB flush Lance Yang
@ 2026-03-23 20:53 ` Andrew Morton
2026-03-24 6:14 ` Lance Yang
2 siblings, 1 reply; 10+ messages in thread
From: Andrew Morton @ 2026-03-23 20:53 UTC (permalink / raw)
To: Lance Yang
Cc: peterz, david, dave.hansen, dave.hansen, ypodemsk, hughd, will,
aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa, arnd,
lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On Mon, 9 Mar 2026 10:07:09 +0800 Lance Yang <lance.yang@linux.dev> wrote:
> Hi all,
>
> When page table operations require synchronization with software/lockless
> walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
> TLB (tlb->freed_tables or tlb->unshared_tables).
>
> On architectures where the TLB flush already sends IPIs to all target CPUs,
> the subsequent sync IPI broadcast is redundant. This is not only costly on
> large systems where it disrupts all CPUs even for single-process page table
> operations, but has also been reported to hurt RT workloads[1].
>
> This series introduces tlb_table_flush_implies_ipi_broadcast() to check if
> the prior TLB flush already provided the necessary synchronization. When
> true, the sync calls can early-return.
>
> A few cases rely on this synchronization:
>
> 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
> of the PMD table for other purposes in the last remaining user after
> unsharing.
>
> 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
> and (possibly) freeing the page table / re-depositing it.
>
> Two-step plan as David suggested[4]:
>
> Step 1 (this series): Skip redundant sync when we're 100% certain the TLB
> flush sent IPIs. INVLPGB is excluded because when supported, we cannot
> guarantee IPIs were sent, keeping it clean and simple.
>
> Step 2 (future work): Send targeted IPIs only to CPUs actually doing
> software/lockless page table walks, benefiting all architectures.
>
> Regarding Step 2, it obviously only applies to setups where Step 1 does not
> apply: like x86 with INVLPGB or arm64. Step 2 work is ongoing; early
> attempts showed ~3% GUP-fast overhead. Reducing the overhead requires more
> work and tuning; it will be submitted separately once ready.
>
> ...
>
> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
> arch/x86/include/asm/tlbflush.h | 2 ++
> arch/x86/kernel/smpboot.c | 1 +
> arch/x86/mm/tlb.c | 15 +++++++++++++++
> include/asm-generic/tlb.h | 17 +++++++++++++++++
> mm/mmu_gather.c | 15 +++++++++++++++
> 6 files changed, 66 insertions(+), 1 deletion(-)
Kinda straddles both MM and x86.
I expect a v8 based on David's comments.
One merge path is for the x86 people to take this, noting David's acks.
The other merge path is via mm.git, if the x86 people can please
perform review.
And... mm.git is basically full (overflowing) for this cycle and
review/test has some catching up to do. So I'd prefer to only take the
important things. This patchset is a performance improvement but
contains no measurements to demonstrate the benefit, so I'm not able to
determine its importance!
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them
2026-03-23 20:53 ` [PATCH v7 0/2] skip redundant sync IPIs when TLB flush sent them Andrew Morton
@ 2026-03-24 6:14 ` Lance Yang
0 siblings, 0 replies; 10+ messages in thread
From: Lance Yang @ 2026-03-24 6:14 UTC (permalink / raw)
To: akpm
Cc: lance.yang, peterz, david, dave.hansen, dave.hansen, ypodemsk,
hughd, will, aneesh.kumar, npiggin, tglx, mingo, bp, x86, hpa,
arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
ryan.roberts, dev.jain, baohua, shy828301, riel, jannh, jgross,
seanjc, pbonzini, boris.ostrovsky, virtualization, kvm,
linux-arch, linux-mm, linux-kernel, ioworker0
On Mon, Mar 23, 2026 at 01:53:17PM -0700, Andrew Morton wrote:
>On Mon, 9 Mar 2026 10:07:09 +0800 Lance Yang <lance.yang@linux.dev> wrote:
>
>> Hi all,
>>
>> When page table operations require synchronization with software/lockless
>> walkers, they call tlb_remove_table_sync_{one,rcu}() after flushing the
>> TLB (tlb->freed_tables or tlb->unshared_tables).
>>
>> On architectures where the TLB flush already sends IPIs to all target CPUs,
>> the subsequent sync IPI broadcast is redundant. This is not only costly on
>> large systems where it disrupts all CPUs even for single-process page table
>> operations, but has also been reported to hurt RT workloads[1].
>>
>> This series introduces tlb_table_flush_implies_ipi_broadcast() to check if
>> the prior TLB flush already provided the necessary synchronization. When
>> true, the sync calls can early-return.
>>
>> A few cases rely on this synchronization:
>>
>> 1) hugetlb PMD unshare[2]: The problem is not the freeing but the reuse
>> of the PMD table for other purposes in the last remaining user after
>> unsharing.
>>
>> 2) khugepaged collapse[3]: Ensure no concurrent GUP-fast before collapsing
>> and (possibly) freeing the page table / re-depositing it.
>>
>> Two-step plan as David suggested[4]:
>>
>> Step 1 (this series): Skip redundant sync when we're 100% certain the TLB
>> flush sent IPIs. INVLPGB is excluded because when supported, we cannot
>> guarantee IPIs were sent, keeping it clean and simple.
>>
>> Step 2 (future work): Send targeted IPIs only to CPUs actually doing
>> software/lockless page table walks, benefiting all architectures.
>>
>> Regarding Step 2, it obviously only applies to setups where Step 1 does not
>> apply: like x86 with INVLPGB or arm64. Step 2 work is ongoing; early
>> attempts showed ~3% GUP-fast overhead. Reducing the overhead requires more
>> work and tuning; it will be submitted separately once ready.
>>
>> ...
>>
>> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>> arch/x86/include/asm/tlbflush.h | 2 ++
>> arch/x86/kernel/smpboot.c | 1 +
>> arch/x86/mm/tlb.c | 15 +++++++++++++++
>> include/asm-generic/tlb.h | 17 +++++++++++++++++
>> mm/mmu_gather.c | 15 +++++++++++++++
>> 6 files changed, 66 insertions(+), 1 deletion(-)
>
>Kinda straddles both MM and x86.
>
>I expect a v8 based on David's comments.
Yes, a v8 is on the way.
>One merge path is for the x86 people to take this, noting David's acks.
>
>The other merge path is via mm.git, if the x86 people can please
>perform review.
>
>And... mm.git is basically full (overflowing) for this cycle and
>review/test has some catching up to do. So I'd prefer to only take the
>important things. This patchset is a performance improvement but
>contains no measurements to demonstrate the benefit, so I'm not able to
>determine its importance!
That's a fair point. I should have included numbers from the start.
On a 64-core Intel x86 server, the CAL interrupt count in
/proc/interrupts dropped from 646,316 to 785 when collapsing a 20 GiB
range with this series applied.
The larger the system, the more costly redundant broadcast IPIs become.
Thanks,
Lance
^ permalink raw reply [flat|nested] 10+ messages in thread