[PATCH RFC 0/3] skip redundant TLB sync IPIs

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] skip redundant TLB sync IPIs
@ 2025-12-13  8:00 Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 1334 bytes --]

Hi all,

When unsharing hugetlb PMD page tables or collapsing pages in khugepaged,
we send two IPIs: one for TLB invalidation, and another to synchronize
with concurrent GUP-fast walkers. However, if the TLB flush already
reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs
disabled, so when the TLB flush IPI completes, any concurrent GUP-fast
must have finished.

This series introduces a way for architectures to indicate their TLB flush
already provides full synchronization, allowing the redundant IPI to be
skipped. For now, the optimization is implemented for x86 first and applied
to both hugetlb PMD unsharing and khugepaged. I will dig into other
architectures later, and folks maintaining other architectures are welcome
to help out.

David Hildenbrand did the initial implementation. I just built on his work
and relied on off-list discussions to push it further — thanks a lot David!

Lance Yang (3):
  mm/tlb: allow architectures to skip redundant TLB sync IPIs
  x86/mm: implement redundant IPI elimination for PMD unsharing
  mm/khugepaged: skip redundant IPI in collapse_huge_page()

 arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
 include/asm-generic/tlb.h  | 22 +++++++++++++++++++++-
 mm/khugepaged.c            |  7 ++++++-
 3 files changed, 43 insertions(+), 3 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
  2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
@ 2025-12-13  8:00 ` Lance Yang
  2025-12-15  5:48   ` Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
  2 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

When unsharing hugetlb PMD page tables, we currently send two IPIs:
one for TLB invalidation, and another to synchronize with concurrent
GUP-fast walkers.

However, if the TLB flush already reaches all CPUs, the second IPI is
redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
completes, any concurrent GUP-fast must have finished.

Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
their TLB flush provides full synchronization, enabling the redundant IPI
to be skipped.

The default implementation returns false to maintain current behavior.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 324a21f53b64..3f0add95604f 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 #define tlb_needs_table_invalidate() (true)
 #endif
 
+/*
+ * Architectures can override if their TLB flush already broadcasts IPIs to all
+ * CPUs when freeing or unsharing page tables.
+ *
+ * Return true only when the flush guarantees:
+ * - IPIs reach all CPUs with potentially stale paging-structure cache entries
+ * - Synchronization with IRQ-disabled code like GUP-fast
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+	return false;
+}
+#endif
+
 void tlb_remove_table_sync_one(void);
 
 #else
@@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
 	 * We only perform this when we are the last sharer of a page table,
 	 * as the IPI will reach all CPUs: any GUP-fast.
 	 *
+	 * However, if the TLB flush already synchronized with other CPUs
+	 * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip
+	 * the additional IPI.
+	 *
 	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
 	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
 	 * required IPIs already for us.
 	 */
 	if (tlb->fully_unshared_tables) {
-		tlb_remove_table_sync_one();
+		if (!tlb_table_flush_implies_ipi_broadcast())
+			tlb_remove_table_sync_one();
 		tlb->fully_unshared_tables = false;
 	}
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing
  2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
@ 2025-12-13  8:00 ` Lance Yang
  2025-12-18 13:08   ` David Hildenbrand (Red Hat)
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
  2 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:

	flush_tlb_mm_range(..., freed_tables || unshared_tables);

Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
without paravirt or INVLPGB, the TLB flush IPI already provides necessary
synchronization, allowing the second IPI to be skipped. For paravirt with
non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..96602b7b7210 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,24 @@
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
 #include <asm-generic/tlb.h>
 #include <linux/kernel.h>
 #include <vdso/bits.h>
 #include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+	/* Paravirt may use hypercalls that don't send real IPIs. */
+	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+		return false;
+#endif
+	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+}
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
@@ -20,7 +34,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+			   tlb->freed_tables || tlb->unshared_tables);
 }
 
 static inline void invlpg(unsigned long addr)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
@ 2025-12-13  8:00 ` Lance Yang
  2025-12-18 13:13   ` David Hildenbrand (Red Hat)
  2 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Similar to the hugetlb PMD unsharing optimization, skip the second IPI
in collapse_huge_page() when the TLB flush already provides necessary
synchronization.

Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
synchronization relied on IRQ disabling, which blocks TLB flush IPIs.

When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
INVLPGB, all x86 systems started sending the second IPI. However, on
native x86 this is redundant:

  - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
    CPUs to invalidate TLB entries

  - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
    any concurrent GUP-fast must have finished

  - tlb_remove_table_sync_one() provides no additional synchronization

On x86, skip the second IPI when running native (without paravirt) and
without INVLPGB. For paravirt with non-native flush_tlb_multi and for
INVLPGB, conservatively keep both IPIs.

Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
optimization.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..06ea793a8190 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_remove_table_sync_one();
+	/*
+	 * Skip the second IPI if the TLB flush above already synchronized
+	 * with concurrent GUP-fast via broadcast IPIs.
+	 */
+	if (!tlb_table_flush_implies_ipi_broadcast())
+		tlb_remove_table_sync_one();
 
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
@ 2025-12-15  5:48   ` Lance Yang
  2025-12-18 13:08     ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-12-15  5:48 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel



On 2025/12/13 16:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> When unsharing hugetlb PMD page tables, we currently send two IPIs:
> one for TLB invalidation, and another to synchronize with concurrent
> GUP-fast walkers.
> 
> However, if the TLB flush already reaches all CPUs, the second IPI is
> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
> completes, any concurrent GUP-fast must have finished.
> 
> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
> their TLB flush provides full synchronization, enabling the redundant IPI
> to be skipped.
> 
> The default implementation returns false to maintain current behavior.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
>   1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 324a21f53b64..3f0add95604f 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>   #define tlb_needs_table_invalidate() (true)
>   #endif
>   
> +/*
> + * Architectures can override if their TLB flush already broadcasts IPIs to all
> + * CPUs when freeing or unsharing page tables.
> + *
> + * Return true only when the flush guarantees:
> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries
> + * - Synchronization with IRQ-disabled code like GUP-fast
> + */
> +#ifndef tlb_table_flush_implies_ipi_broadcast
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> +{
> +	return false;
> +}
> +#endif

As the kernel test robot reported[1][2], the compiler is unhappy with
patch #3:

```
    mm/khugepaged.c: In function 'collapse_huge_page':
>> >> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
     1185 |         if (!tlb_table_flush_implies_ipi_broadcast())
          |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    cc1: some warnings being treated as errors
```

I'll move tlb_table_flush_implies_ipi_broadcast() outside of
CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier
happy on architectures that don't enable that config ;)

[1] 
https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/

[2] 
https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
> +
>   void tlb_remove_table_sync_one(void);
>   
>   #else
> @@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
>   	 * We only perform this when we are the last sharer of a page table,
>   	 * as the IPI will reach all CPUs: any GUP-fast.
>   	 *
> +	 * However, if the TLB flush already synchronized with other CPUs
> +	 * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip
> +	 * the additional IPI.
> +	 *
>   	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
>   	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
>   	 * required IPIs already for us.
>   	 */
>   	if (tlb->fully_unshared_tables) {
> -		tlb_remove_table_sync_one();
> +		if (!tlb_table_flush_implies_ipi_broadcast())
> +			tlb_remove_table_sync_one();
>   		tlb->fully_unshared_tables = false;
>   	}
>   }


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing
  2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
@ 2025-12-18 13:08   ` David Hildenbrand (Red Hat)
  2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel

On 12/13/25 09:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
> 
> 	flush_tlb_mm_range(..., freed_tables || unshared_tables);
> 
> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
> without paravirt or INVLPGB, the TLB flush IPI already provides necessary
> synchronization, allowing the second IPI to be skipped. For paravirt with
> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 866ea78ba156..96602b7b7210 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -5,10 +5,24 @@
>   #define tlb_flush tlb_flush
>   static inline void tlb_flush(struct mmu_gather *tlb);
>   
> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
> +
>   #include <asm-generic/tlb.h>
>   #include <linux/kernel.h>
>   #include <vdso/bits.h>
>   #include <vdso/page.h>
> +#include <asm/paravirt.h>
> +
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> +{
> +#ifdef CONFIG_PARAVIRT
> +	/* Paravirt may use hypercalls that don't send real IPIs. */
> +	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> +		return false;
> +#endif
> +	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);

Right, here I was wondering whether we should have a new pv_ops callback 
to indicate that instead.

pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()

Or a simple boolean property that pv init code properly sets.

Something for x86 folks to give suggestions for. :)

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
  2025-12-15  5:48   ` Lance Yang
@ 2025-12-18 13:08     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel

On 12/15/25 06:48, Lance Yang wrote:
> 
> 
> On 2025/12/13 16:00, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When unsharing hugetlb PMD page tables, we currently send two IPIs:
>> one for TLB invalidation, and another to synchronize with concurrent
>> GUP-fast walkers.
>>
>> However, if the TLB flush already reaches all CPUs, the second IPI is
>> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
>> completes, any concurrent GUP-fast must have finished.
>>
>> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
>> their TLB flush provides full synchronization, enabling the redundant IPI
>> to be skipped.
>>
>> The default implementation returns false to maintain current behavior.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>    include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
>>    1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 324a21f53b64..3f0add95604f 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>>    #define tlb_needs_table_invalidate() (true)
>>    #endif
>>    
>> +/*
>> + * Architectures can override if their TLB flush already broadcasts IPIs to all
>> + * CPUs when freeing or unsharing page tables.
>> + *
>> + * Return true only when the flush guarantees:
>> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries
>> + * - Synchronization with IRQ-disabled code like GUP-fast
>> + */
>> +#ifndef tlb_table_flush_implies_ipi_broadcast
>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>> +{
>> +	return false;
>> +}
>> +#endif
> 
> As the kernel test robot reported[1][2], the compiler is unhappy with
> patch #3:
> 
> ```
>      mm/khugepaged.c: In function 'collapse_huge_page':
>>>>> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
>       1185 |         if (!tlb_table_flush_implies_ipi_broadcast())
>            |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>      cc1: some warnings being treated as errors
> ```
> 
> I'll move tlb_table_flush_implies_ipi_broadcast() outside of
> CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier
> happy on architectures that don't enable that config ;)

Yeah, that's probably cleanest.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
@ 2025-12-18 13:13   ` David Hildenbrand (Red Hat)
  2025-12-18 14:35     ` Lance Yang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:13 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel

On 12/13/25 09:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
> in collapse_huge_page() when the TLB flush already provides necessary
> synchronization.
> 
> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
> 
> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
> INVLPGB, all x86 systems started sending the second IPI. However, on
> native x86 this is redundant:
> 
>    - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>      CPUs to invalidate TLB entries
> 
>    - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>      any concurrent GUP-fast must have finished
> 
>    - tlb_remove_table_sync_one() provides no additional synchronization
> 
> On x86, skip the second IPI when running native (without paravirt) and
> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
> INVLPGB, conservatively keep both IPIs.
> 
> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
> optimization.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   mm/khugepaged.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..06ea793a8190 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	_pmd = pmdp_collapse_flush(vma, address, pmd);
>   	spin_unlock(pmd_ptl);
>   	mmu_notifier_invalidate_range_end(&range);
> -	tlb_remove_table_sync_one();
> +	/*
> +	 * Skip the second IPI if the TLB flush above already synchronized
> +	 * with concurrent GUP-fast via broadcast IPIs.
> +	 */
> +	if (!tlb_table_flush_implies_ipi_broadcast())
> +		tlb_remove_table_sync_one();

We end up calling

	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

	-> flush_tlb_mm_range(freed_tables = true)

	-> flush_tlb_multi(mm_cpumask(mm), info);

So freed_tables=true and we should be doing the right thing.

BTW, I was wondering whether we should embed that 
tlb_table_flush_implies_ipi_broadcast() check in 
tlb_remove_table_sync_one() instead.

It then relies on the caller to do the right thing (flush with 
freed_tables=true or unshared_tables = true).

Thoughts?

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-18 13:13   ` David Hildenbrand (Red Hat)
@ 2025-12-18 14:35     ` Lance Yang
  2025-12-19  8:25       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-12-18 14:35 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, akpm



On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
> On 12/13/25 09:00, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>> in collapse_huge_page() when the TLB flush already provides necessary
>> synchronization.
>>
>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>
>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>> INVLPGB, all x86 systems started sending the second IPI. However, on
>> native x86 this is redundant:
>>
>>    - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>>      CPUs to invalidate TLB entries
>>
>>    - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>>      any concurrent GUP-fast must have finished
>>
>>    - tlb_remove_table_sync_one() provides no additional synchronization
>>
>> On x86, skip the second IPI when running native (without paravirt) and
>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>> INVLPGB, conservatively keep both IPIs.
>>
>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
>> optimization.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..06ea793a8190 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct 
>> *mm, unsigned long address,
>>       _pmd = pmdp_collapse_flush(vma, address, pmd);
>>       spin_unlock(pmd_ptl);
>>       mmu_notifier_invalidate_range_end(&range);
>> -    tlb_remove_table_sync_one();
>> +    /*
>> +     * Skip the second IPI if the TLB flush above already synchronized
>> +     * with concurrent GUP-fast via broadcast IPIs.
>> +     */
>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>> +        tlb_remove_table_sync_one();
> 
> We end up calling
> 
>      flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 
>      -> flush_tlb_mm_range(freed_tables = true)
> 
>      -> flush_tlb_multi(mm_cpumask(mm), info);
> 
> So freed_tables=true and we should be doing the right thing.

Yep ;)

> BTW, I was wondering whether we should embed that 
> tlb_table_flush_implies_ipi_broadcast() check in 
> tlb_remove_table_sync_one() instead.
> It then relies on the caller to do the right thing (flush with 
> freed_tables=true or unshared_tables = true).
> 
> Thoughts?

Good point! Let me check the other callers to ensure they
are all preceded by a flush with freed_tables=true (or unshared_tables).

Will get back to you with what I find :)

Cheers,
Lance

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-18 14:35     ` Lance Yang
@ 2025-12-19  8:25       ` David Hildenbrand (Red Hat)
  2025-12-21 10:43         ` Lance Yang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:25 UTC (permalink / raw)
  To: Lance Yang
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, akpm

On 12/18/25 15:35, Lance Yang wrote:
> 
> 
> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>> On 12/13/25 09:00, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>> in collapse_huge_page() when the TLB flush already provides necessary
>>> synchronization.
>>>
>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>
>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>> native x86 this is redundant:
>>>
>>>     - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>>>       CPUs to invalidate TLB entries
>>>
>>>     - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>>>       any concurrent GUP-fast must have finished
>>>
>>>     - tlb_remove_table_sync_one() provides no additional synchronization
>>>
>>> On x86, skip the second IPI when running native (without paravirt) and
>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>> INVLPGB, conservatively keep both IPIs.
>>>
>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
>>> optimization.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>    mm/khugepaged.c | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 97d1b2824386..06ea793a8190 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>> *mm, unsigned long address,
>>>        _pmd = pmdp_collapse_flush(vma, address, pmd);
>>>        spin_unlock(pmd_ptl);
>>>        mmu_notifier_invalidate_range_end(&range);
>>> -    tlb_remove_table_sync_one();
>>> +    /*
>>> +     * Skip the second IPI if the TLB flush above already synchronized
>>> +     * with concurrent GUP-fast via broadcast IPIs.
>>> +     */
>>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>>> +        tlb_remove_table_sync_one();
>>
>> We end up calling
>>
>>       flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>
>>       -> flush_tlb_mm_range(freed_tables = true)
>>
>>       -> flush_tlb_multi(mm_cpumask(mm), info);
>>
>> So freed_tables=true and we should be doing the right thing.
> 
> Yep ;)
> 
>> BTW, I was wondering whether we should embed that
>> tlb_table_flush_implies_ipi_broadcast() check in
>> tlb_remove_table_sync_one() instead.
>> It then relies on the caller to do the right thing (flush with
>> freed_tables=true or unshared_tables = true).
>>
>> Thoughts?
> 
> Good point! Let me check the other callers to ensure they
> are all preceded by a flush with freed_tables=true (or unshared_tables).
> 
> Will get back to you with what I find :)

The use case in tlb_table_flush() is a bit confusing. But I would assume 
that we have a TLB flush with remove_tables=true beforehand. Otherwise 
we cannot possibly free the page table.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-19  8:25       ` David Hildenbrand (Red Hat)
@ 2025-12-21 10:43         ` Lance Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Lance Yang @ 2025-12-21 10:43 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, akpm



On 2025/12/19 16:25, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 15:35, Lance Yang wrote:
>>
>>
>> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>>> in collapse_huge_page() when the TLB flush already provides necessary
>>>> synchronization.
>>>>
>>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>>> unconditional"), bare metal x86 didn't enable 
>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>>
>>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>>> native x86 this is redundant:
>>>>
>>>>     - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to 
>>>> all
>>>>       CPUs to invalidate TLB entries
>>>>
>>>>     - GUP-fast runs with IRQs disabled, so when the flush IPI 
>>>> completes,
>>>>       any concurrent GUP-fast must have finished
>>>>
>>>>     - tlb_remove_table_sync_one() provides no additional 
>>>> synchronization
>>>>
>>>> On x86, skip the second IPI when running native (without paravirt) and
>>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>>> INVLPGB, conservatively keep both IPIs.
>>>>
>>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the 
>>>> hugetlb
>>>> optimization.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    mm/khugepaged.c | 7 ++++++-
>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 97d1b2824386..06ea793a8190 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>>> *mm, unsigned long address,
>>>>        _pmd = pmdp_collapse_flush(vma, address, pmd);
>>>>        spin_unlock(pmd_ptl);
>>>>        mmu_notifier_invalidate_range_end(&range);
>>>> -    tlb_remove_table_sync_one();
>>>> +    /*
>>>> +     * Skip the second IPI if the TLB flush above already synchronized
>>>> +     * with concurrent GUP-fast via broadcast IPIs.
>>>> +     */
>>>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>>>> +        tlb_remove_table_sync_one();
>>>
>>> We end up calling
>>>
>>>       flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>>
>>>       -> flush_tlb_mm_range(freed_tables = true)
>>>
>>>       -> flush_tlb_multi(mm_cpumask(mm), info);
>>>
>>> So freed_tables=true and we should be doing the right thing.
>>
>> Yep ;)
>>
>>> BTW, I was wondering whether we should embed that
>>> tlb_table_flush_implies_ipi_broadcast() check in
>>> tlb_remove_table_sync_one() instead.
>>> It then relies on the caller to do the right thing (flush with
>>> freed_tables=true or unshared_tables = true).
>>>
>>> Thoughts?
>>
>> Good point! Let me check the other callers to ensure they
>> are all preceded by a flush with freed_tables=true (or unshared_tables).
>>
>> Will get back to you with what I find :)
> 
> The use case in tlb_table_flush() is a bit confusing. But I would assume 
> that we have a TLB flush with remove_tables=true beforehand. Otherwise 
> we cannot possibly free the page table.

Right! I assume you meant freed_tables=true (not remove_tables) ;)

Verified all callers have proper TLB flushes *beforehand*:

-> 1. mm/khugepaged.c:1188 (collapse_huge_page)

	pmdp_collapse_flush(vma, address, pmd)
	-> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE)
		-> flush_tlb_mm_range(mm, ..., freed_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

So freed_tables=true and we should be doing the right thing :)

-> 2. include/asm-generic/tlb.h:861 (tlb_flush_unshared_tables)

	tlb_flush_mmu_tlbonly(tlb)
	-> tlb_flush(tlb)
		-> flush_tlb_mm_range(mm, ..., unshared_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

unshared_tables=true (equivalent to freed_tables for sending IPIs).

-> 3. mm/mmu_gather.c:341 (__tlb_remove_table_one)

When we can't allocate a batch page in tlb_remove_table(), we do:

	tlb_table_invalidate(tlb)
	-> tlb_flush_mmu_tlbonly(tlb)
		-> flush_tlb_mm_range(mm, ..., freed_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

	Then:
	tlb_remove_table_one(table)
	-> __tlb_remove_table_one(table) // if !CONFIG_PT_RECLAIM
		-> tlb_remove_table_sync_one()

freed_tables=true, and this should work too.

Why is tlb->freed_tables guaranteed? Because callers like pte_free_tlb()
(via free_pte_range) set freed_tables=true before calling __pte_free_tlb(),
which then calls tlb_remove_table(). As you mentioned, we cannot free page
tables without freed_tables=true.

Note that tlb_remove_table_sync_one() was a NOP on bare metal x86
(CONFIG_MMU_GATHER_RCU_TABLE_FREE=n) before commit a37259732a7d.

-> 4-5. mm/khugepaged.c:1683,1819 (pmdp_get_lockless_sync macro)

Same as #1.

So all callers satisfy the requirement! Will embed the check in v2.

Hopefully I didn't miss any callers ;)

Cheers,
Lance

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
  2025-12-18 13:08   ` David Hildenbrand (Red Hat)
@ 2025-12-22  3:19     ` Lance Yang
  2025-12-23  9:44       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 14+ messages in thread
From: Lance Yang @ 2025-12-22  3:19 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
	dave.hansen, dev.jain, hpa, ioworker0, jannh, lance.yang,
	linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo,
	npache, npiggin, peterz, riel, ryan.roberts, shy828301, tglx,
	will, x86, ziy

From: Lance Yang <lance.yang@linux.dev>


On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
> On 12/13/25 09:00, Lance Yang wrote:
> > From: Lance Yang <lance.yang@linux.dev>
> > 
> > Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
> > ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
> > 
> > 	flush_tlb_mm_range(..., freed_tables || unshared_tables);
> > 
> > Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
> > without paravirt or INVLPGB, the TLB flush IPI already provides necessary
> > synchronization, allowing the second IPI to be skipped. For paravirt with
> > non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
> > 
> > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
> > ---
> >   arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
> >   1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> > index 866ea78ba156..96602b7b7210 100644
> > --- a/arch/x86/include/asm/tlb.h
> > +++ b/arch/x86/include/asm/tlb.h
> > @@ -5,10 +5,24 @@
> >   #define tlb_flush tlb_flush
> >   static inline void tlb_flush(struct mmu_gather *tlb);
> >   
> > +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
> > +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
> > +
> >   #include <asm-generic/tlb.h>
> >   #include <linux/kernel.h>
> >   #include <vdso/bits.h>
> >   #include <vdso/page.h>
> > +#include <asm/paravirt.h>
> > +
> > +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> > +{
> > +#ifdef CONFIG_PARAVIRT
> > +	/* Paravirt may use hypercalls that don't send real IPIs. */
> > +	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> > +		return false;
> > +#endif
> > +	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
> 
> Right, here I was wondering whether we should have a new pv_ops callback 
> to indicate that instead.
> 
> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
> 
> Or a simple boolean property that pv init code properly sets.

Cool!

> 
> Something for x86 folks to give suggestions for. :)

I prefer to use a boolean property instead of comparing function pointers.
Something like this:

----8<----
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index cfcb60468b01..90e9da33f2c7 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)

 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
+	pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
 }
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..f9756df6f3f6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,19 @@ struct pv_mmu_ops {
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);

+	/*
+	 * Indicates whether TLB flush IPIs provide sufficient synchronization
+	 * for GUP-fast when freeing or unsharing page tables.
+	 *
+	 * Set to true only when the TLB flush guarantees:
+	 * - IPIs reach all CPUs with potentially stale paging-structure caches
+	 * - Synchronization with IRQ-disabled code like GUP-fast
+	 *
+	 * Paravirt implementations that use hypercalls (which may not send
+	 * real IPIs) should set this to false.
+	 */
+	bool tlb_flush_implies_ipi_broadcast;
+
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 96602b7b7210..9d20ad4786cc 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void)
 {
 #ifdef CONFIG_PARAVIRT
 	/* Paravirt may use hypercalls that don't send real IPIs. */
-	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+	if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
 		return false;
 #endif
 	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index df78ddee0abb..aaea83100105 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -843,6 +843,7 @@ static void __init kvm_guest_init(void)
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
 		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
+		pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
 		pr_info("KVM setup pv remote TLB flush\n");
 	}

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..625fe93e138a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -173,6 +173,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
+	.mmu.tlb_flush_implies_ipi_broadcast = true,

 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 7a35c3393df4..06eb80cfb4da 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2185,6 +2185,7 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		.flush_tlb_kernel = xen_flush_tlb,
 		.flush_tlb_one_user = xen_flush_tlb_one_user,
 		.flush_tlb_multi = xen_flush_tlb_multi,
+		.tlb_flush_implies_ipi_broadcast = false,

 		.pgd_alloc = xen_pgd_alloc,
 		.pgd_free = xen_pgd_free,
---

Native x86 sets it to true, paravirt guests (Xen/KVM/Hyper-V) set it to
false. Making the intent explicit :)

Hopefully x86 folks can give me some suggestions!
 
Thanks,
Lance

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
  2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
@ 2025-12-23  9:44       ` David Hildenbrand (Red Hat)
  2025-12-23 11:13         ` Lance Yang
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23  9:44 UTC (permalink / raw)
  To: Lance Yang
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
	dave.hansen, dev.jain, hpa, jannh, lance.yang, linux-arch,
	linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
	peterz, riel, ryan.roberts, shy828301, tglx, will, x86, ziy

On 12/22/25 04:19, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> 
> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/13/25 09:00, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
>>>
>>> 	flush_tlb_mm_range(..., freed_tables || unshared_tables);
>>>
>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
>>> without paravirt or INVLPGB, the TLB flush IPI already provides necessary
>>> synchronization, allowing the second IPI to be skipped. For paravirt with
>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>    arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>>>    1 file changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>> index 866ea78ba156..96602b7b7210 100644
>>> --- a/arch/x86/include/asm/tlb.h
>>> +++ b/arch/x86/include/asm/tlb.h
>>> @@ -5,10 +5,24 @@
>>>    #define tlb_flush tlb_flush
>>>    static inline void tlb_flush(struct mmu_gather *tlb);
>>>    
>>> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>> +
>>>    #include <asm-generic/tlb.h>
>>>    #include <linux/kernel.h>
>>>    #include <vdso/bits.h>
>>>    #include <vdso/page.h>
>>> +#include <asm/paravirt.h>
>>> +
>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>> +{
>>> +#ifdef CONFIG_PARAVIRT
>>> +	/* Paravirt may use hypercalls that don't send real IPIs. */
>>> +	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>> +		return false;
>>> +#endif
>>> +	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>>
>> Right, here I was wondering whether we should have a new pv_ops callback
>> to indicate that instead.
>>
>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>>
>> Or a simple boolean property that pv init code properly sets.
> 
> Cool!
> 
>>
>> Something for x86 folks to give suggestions for. :)
> 
> I prefer to use a boolean property instead of comparing function pointers.
> Something like this:
> 
> ----8<----
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index cfcb60468b01..90e9da33f2c7 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
> 
>   	pr_info("Using hypercall for remote TLB flush\n");
>   	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
> +	pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
>   }
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 3502939415ad..f9756df6f3f6 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -133,6 +133,19 @@ struct pv_mmu_ops {
>   	void (*flush_tlb_multi)(const struct cpumask *cpus,
>   				const struct flush_tlb_info *info);
> 
> +	/*
> +	 * Indicates whether TLB flush IPIs provide sufficient synchronization
> +	 * for GUP-fast when freeing or unsharing page tables.
> +	 *
> +	 * Set to true only when the TLB flush guarantees:
> +	 * - IPIs reach all CPUs with potentially stale paging-structure caches
> +	 * - Synchronization with IRQ-disabled code like GUP-fast
> +	 *
> +	 * Paravirt implementations that use hypercalls (which may not send
> +	 * real IPIs) should set this to false.
> +	 */
> +	bool tlb_flush_implies_ipi_broadcast;
> +
>   	/* Hook for intercepting the destruction of an mm_struct. */
>   	void (*exit_mmap)(struct mm_struct *mm);
>   	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 96602b7b7210..9d20ad4786cc 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>   {
>   #ifdef CONFIG_PARAVIRT
>   	/* Paravirt may use hypercalls that don't send real IPIs. */
> -	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> +	if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
>   		return false;
>   #endif
>   	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);

I'd have thought that the X86_FEATURE_INVLPGB heck should then also be 
taken care of by whoever sets tlb_flush_implies_ipi_broadcast.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
  2025-12-23  9:44       ` David Hildenbrand (Red Hat)
@ 2025-12-23 11:13         ` Lance Yang
  0 siblings, 0 replies; 14+ messages in thread
From: Lance Yang @ 2025-12-23 11:13 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
	dave.hansen, dev.jain, hpa, jannh, linux-arch, linux-kernel,
	linux-mm, lorenzo.stoakes, mingo, npache, npiggin, peterz, riel,
	ryan.roberts, shy828301, tglx, will, x86, ziy, Lance Yang



On 2025/12/23 17:44, David Hildenbrand (Red Hat) wrote:
> On 12/22/25 04:19, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>>
>> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
>>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure 
>>>> caches:
>>>>
>>>>     flush_tlb_mm_range(..., freed_tables || unshared_tables);
>>>>
>>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native 
>>>> x86
>>>> without paravirt or INVLPGB, the TLB flush IPI already provides 
>>>> necessary
>>>> synchronization, allowing the second IPI to be skipped. For paravirt 
>>>> with
>>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both 
>>>> IPIs.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>>>>    1 file changed, 16 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>>> index 866ea78ba156..96602b7b7210 100644
>>>> --- a/arch/x86/include/asm/tlb.h
>>>> +++ b/arch/x86/include/asm/tlb.h
>>>> @@ -5,10 +5,24 @@
>>>>    #define tlb_flush tlb_flush
>>>>    static inline void tlb_flush(struct mmu_gather *tlb);
>>>> +#define tlb_table_flush_implies_ipi_broadcast 
>>>> tlb_table_flush_implies_ipi_broadcast
>>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>>> +
>>>>    #include <asm-generic/tlb.h>
>>>>    #include <linux/kernel.h>
>>>>    #include <vdso/bits.h>
>>>>    #include <vdso/page.h>
>>>> +#include <asm/paravirt.h>
>>>> +
>>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>>> +{
>>>> +#ifdef CONFIG_PARAVIRT
>>>> +    /* Paravirt may use hypercalls that don't send real IPIs. */
>>>> +    if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>>> +        return false;
>>>> +#endif
>>>> +    return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>>>
>>> Right, here I was wondering whether we should have a new pv_ops callback
>>> to indicate that instead.
>>>
>>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>>>
>>> Or a simple boolean property that pv init code properly sets.
>>
>> Cool!
>>
>>>
>>> Something for x86 folks to give suggestions for. :)
>>
>> I prefer to use a boolean property instead of comparing function 
>> pointers.
>> Something like this:
>>
>> ----8<----
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index cfcb60468b01..90e9da33f2c7 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
>>
>>       pr_info("Using hypercall for remote TLB flush\n");
>>       pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
>> +    pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
>>   }
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/ 
>> asm/paravirt_types.h
>> index 3502939415ad..f9756df6f3f6 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -133,6 +133,19 @@ struct pv_mmu_ops {
>>       void (*flush_tlb_multi)(const struct cpumask *cpus,
>>                   const struct flush_tlb_info *info);
>>
>> +    /*
>> +     * Indicates whether TLB flush IPIs provide sufficient 
>> synchronization
>> +     * for GUP-fast when freeing or unsharing page tables.
>> +     *
>> +     * Set to true only when the TLB flush guarantees:
>> +     * - IPIs reach all CPUs with potentially stale paging-structure 
>> caches
>> +     * - Synchronization with IRQ-disabled code like GUP-fast
>> +     *
>> +     * Paravirt implementations that use hypercalls (which may not send
>> +     * real IPIs) should set this to false.
>> +     */
>> +    bool tlb_flush_implies_ipi_broadcast;
>> +
>>       /* Hook for intercepting the destruction of an mm_struct. */
>>       void (*exit_mmap)(struct mm_struct *mm);
>>       void (*notify_page_enc_status_changed)(unsigned long pfn, int 
>> npages, bool enc);
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 96602b7b7210..9d20ad4786cc 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -18,7 +18,7 @@ static inline bool 
>> tlb_table_flush_implies_ipi_broadcast(void)
>>   {
>>   #ifdef CONFIG_PARAVIRT
>>       /* Paravirt may use hypercalls that don't send real IPIs. */
>> -    if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>> +    if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
>>           return false;
>>   #endif
>>       return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
> 
> I'd have thought that the X86_FEATURE_INVLPGB heck should then also be 
> taken care of by whoever sets tlb_flush_implies_ipi_broadcast.

Makes sense!

Let's have the INVLPGB check happen at setup time, not at use time :P

Cheers,
Lance

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-12-23 11:13 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
2025-12-15  5:48   ` Lance Yang
2025-12-18 13:08     ` David Hildenbrand (Red Hat)
2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
2025-12-18 13:08   ` David Hildenbrand (Red Hat)
2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
2025-12-23  9:44       ` David Hildenbrand (Red Hat)
2025-12-23 11:13         ` Lance Yang
2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2025-12-18 13:13   ` David Hildenbrand (Red Hat)
2025-12-18 14:35     ` Lance Yang
2025-12-19  8:25       ` David Hildenbrand (Red Hat)
2025-12-21 10:43         ` Lance Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).