[PATCH RFC 0/3] skip redundant TLB sync IPIs

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] skip redundant TLB sync IPIs
@ 2025-12-13  8:00 Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 1334 bytes --]

Hi all,

When unsharing hugetlb PMD page tables or collapsing pages in khugepaged,
we send two IPIs: one for TLB invalidation, and another to synchronize
with concurrent GUP-fast walkers. However, if the TLB flush already
reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs
disabled, so when the TLB flush IPI completes, any concurrent GUP-fast
must have finished.

This series introduces a way for architectures to indicate their TLB flush
already provides full synchronization, allowing the redundant IPI to be
skipped. For now, the optimization is implemented for x86 first and applied
to both hugetlb PMD unsharing and khugepaged. I will dig into other
architectures later, and folks maintaining other architectures are welcome
to help out.

David Hildenbrand did the initial implementation. I just built on his work
and relied on off-list discussions to push it further — thanks a lot David!

Lance Yang (3):
  mm/tlb: allow architectures to skip redundant TLB sync IPIs
  x86/mm: implement redundant IPI elimination for PMD unsharing
  mm/khugepaged: skip redundant IPI in collapse_huge_page()

 arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
 include/asm-generic/tlb.h  | 22 +++++++++++++++++++++-
 mm/khugepaged.c            |  7 ++++++-
 3 files changed, 43 insertions(+), 3 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
  2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
@ 2025-12-13  8:00 ` Lance Yang
  2025-12-15  5:48   ` Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
  2 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

When unsharing hugetlb PMD page tables, we currently send two IPIs:
one for TLB invalidation, and another to synchronize with concurrent
GUP-fast walkers.

However, if the TLB flush already reaches all CPUs, the second IPI is
redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
completes, any concurrent GUP-fast must have finished.

Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
their TLB flush provides full synchronization, enabling the redundant IPI
to be skipped.

The default implementation returns false to maintain current behavior.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 324a21f53b64..3f0add95604f 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
 #define tlb_needs_table_invalidate() (true)
 #endif
 
+/*
+ * Architectures can override if their TLB flush already broadcasts IPIs to all
+ * CPUs when freeing or unsharing page tables.
+ *
+ * Return true only when the flush guarantees:
+ * - IPIs reach all CPUs with potentially stale paging-structure cache entries
+ * - Synchronization with IRQ-disabled code like GUP-fast
+ */
+#ifndef tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+	return false;
+}
+#endif
+
 void tlb_remove_table_sync_one(void);
 
 #else
@@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
 	 * We only perform this when we are the last sharer of a page table,
 	 * as the IPI will reach all CPUs: any GUP-fast.
 	 *
+	 * However, if the TLB flush already synchronized with other CPUs
+	 * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip
+	 * the additional IPI.
+	 *
 	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
 	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
 	 * required IPIs already for us.
 	 */
 	if (tlb->fully_unshared_tables) {
-		tlb_remove_table_sync_one();
+		if (!tlb_table_flush_implies_ipi_broadcast())
+			tlb_remove_table_sync_one();
 		tlb->fully_unshared_tables = false;
 	}
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
@ 2025-12-15  5:48   ` Lance Yang
  2025-12-18 13:08     ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-15  5:48 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel



On 2025/12/13 16:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> When unsharing hugetlb PMD page tables, we currently send two IPIs:
> one for TLB invalidation, and another to synchronize with concurrent
> GUP-fast walkers.
> 
> However, if the TLB flush already reaches all CPUs, the second IPI is
> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
> completes, any concurrent GUP-fast must have finished.
> 
> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
> their TLB flush provides full synchronization, enabling the redundant IPI
> to be skipped.
> 
> The default implementation returns false to maintain current behavior.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
>   1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 324a21f53b64..3f0add95604f 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>   #define tlb_needs_table_invalidate() (true)
>   #endif
>   
> +/*
> + * Architectures can override if their TLB flush already broadcasts IPIs to all
> + * CPUs when freeing or unsharing page tables.
> + *
> + * Return true only when the flush guarantees:
> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries
> + * - Synchronization with IRQ-disabled code like GUP-fast
> + */
> +#ifndef tlb_table_flush_implies_ipi_broadcast
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> +{
> +	return false;
> +}
> +#endif

As the kernel test robot reported[1][2], the compiler is unhappy with
patch #3:

```
    mm/khugepaged.c: In function 'collapse_huge_page':
>> >> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
     1185 |         if (!tlb_table_flush_implies_ipi_broadcast())
          |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    cc1: some warnings being treated as errors
```

I'll move tlb_table_flush_implies_ipi_broadcast() outside of
CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier
happy on architectures that don't enable that config ;)

[1] 
https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/

[2] 
https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/
> +
>   void tlb_remove_table_sync_one(void);
>   
>   #else
> @@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
>   	 * We only perform this when we are the last sharer of a page table,
>   	 * as the IPI will reach all CPUs: any GUP-fast.
>   	 *
> +	 * However, if the TLB flush already synchronized with other CPUs
> +	 * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip
> +	 * the additional IPI.
> +	 *
>   	 * Note that on configs where tlb_remove_table_sync_one() is a NOP,
>   	 * the expectation is that the tlb_flush_mmu_tlbonly() would have issued
>   	 * required IPIs already for us.
>   	 */
>   	if (tlb->fully_unshared_tables) {
> -		tlb_remove_table_sync_one();
> +		if (!tlb_table_flush_implies_ipi_broadcast())
> +			tlb_remove_table_sync_one();
>   		tlb->fully_unshared_tables = false;
>   	}
>   }


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs
  2025-12-15  5:48   ` Lance Yang
@ 2025-12-18 13:08     ` David Hildenbrand (Red Hat)
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel

On 12/15/25 06:48, Lance Yang wrote:
> 
> 
> On 2025/12/13 16:00, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> When unsharing hugetlb PMD page tables, we currently send two IPIs:
>> one for TLB invalidation, and another to synchronize with concurrent
>> GUP-fast walkers.
>>
>> However, if the TLB flush already reaches all CPUs, the second IPI is
>> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI
>> completes, any concurrent GUP-fast must have finished.
>>
>> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate
>> their TLB flush provides full synchronization, enabling the redundant IPI
>> to be skipped.
>>
>> The default implementation returns false to maintain current behavior.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>    include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
>>    1 file changed, 21 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
>> index 324a21f53b64..3f0add95604f 100644
>> --- a/include/asm-generic/tlb.h
>> +++ b/include/asm-generic/tlb.h
>> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table)
>>    #define tlb_needs_table_invalidate() (true)
>>    #endif
>>    
>> +/*
>> + * Architectures can override if their TLB flush already broadcasts IPIs to all
>> + * CPUs when freeing or unsharing page tables.
>> + *
>> + * Return true only when the flush guarantees:
>> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries
>> + * - Synchronization with IRQ-disabled code like GUP-fast
>> + */
>> +#ifndef tlb_table_flush_implies_ipi_broadcast
>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>> +{
>> +	return false;
>> +}
>> +#endif
> 
> As the kernel test robot reported[1][2], the compiler is unhappy with
> patch #3:
> 
> ```
>      mm/khugepaged.c: In function 'collapse_huge_page':
>>>>> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
>       1185 |         if (!tlb_table_flush_implies_ipi_broadcast())
>            |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>      cc1: some warnings being treated as errors
> ```
> 
> I'll move tlb_table_flush_implies_ipi_broadcast() outside of
> CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier
> happy on architectures that don't enable that config ;)

Yeah, that's probably cleanest.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing
  2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
@ 2025-12-13  8:00 ` Lance Yang
  2025-12-18 13:08   ` David Hildenbrand (Red Hat)
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
  2 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:

	flush_tlb_mm_range(..., freed_tables || unshared_tables);

Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
without paravirt or INVLPGB, the TLB flush IPI already provides necessary
synchronization, allowing the second IPI to be skipped. For paravirt with
non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 866ea78ba156..96602b7b7210 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -5,10 +5,24 @@
 #define tlb_flush tlb_flush
 static inline void tlb_flush(struct mmu_gather *tlb);
 
+#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
+static inline bool tlb_table_flush_implies_ipi_broadcast(void);
+
 #include <asm-generic/tlb.h>
 #include <linux/kernel.h>
 #include <vdso/bits.h>
 #include <vdso/page.h>
+#include <asm/paravirt.h>
+
+static inline bool tlb_table_flush_implies_ipi_broadcast(void)
+{
+#ifdef CONFIG_PARAVIRT
+	/* Paravirt may use hypercalls that don't send real IPIs. */
+	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+		return false;
+#endif
+	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
+}
 
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
@@ -20,7 +34,8 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 		end = tlb->end;
 	}
 
-	flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables);
+	flush_tlb_mm_range(tlb->mm, start, end, stride_shift,
+			   tlb->freed_tables || tlb->unshared_tables);
 }
 
 static inline void invlpg(unsigned long addr)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing
  2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
@ 2025-12-18 13:08   ` David Hildenbrand (Red Hat)
  2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel

On 12/13/25 09:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
> 
> 	flush_tlb_mm_range(..., freed_tables || unshared_tables);
> 
> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
> without paravirt or INVLPGB, the TLB flush IPI already provides necessary
> synchronization, allowing the second IPI to be skipped. For paravirt with
> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>   1 file changed, 16 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 866ea78ba156..96602b7b7210 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -5,10 +5,24 @@
>   #define tlb_flush tlb_flush
>   static inline void tlb_flush(struct mmu_gather *tlb);
>   
> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
> +
>   #include <asm-generic/tlb.h>
>   #include <linux/kernel.h>
>   #include <vdso/bits.h>
>   #include <vdso/page.h>
> +#include <asm/paravirt.h>
> +
> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> +{
> +#ifdef CONFIG_PARAVIRT
> +	/* Paravirt may use hypercalls that don't send real IPIs. */
> +	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> +		return false;
> +#endif
> +	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);

Right, here I was wondering whether we should have a new pv_ops callback 
to indicate that instead.

pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()

Or a simple boolean property that pv init code properly sets.

Something for x86 folks to give suggestions for. :)

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
  2025-12-18 13:08   ` David Hildenbrand (Red Hat)
@ 2025-12-22  3:19     ` Lance Yang
  2025-12-23  9:44       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-22  3:19 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
	dave.hansen, dev.jain, hpa, ioworker0, jannh, lance.yang,
	linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo,
	npache, npiggin, peterz, riel, ryan.roberts, shy828301, tglx,
	will, x86, ziy

From: Lance Yang <lance.yang@linux.dev>


On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
> On 12/13/25 09:00, Lance Yang wrote:
> > From: Lance Yang <lance.yang@linux.dev>
> > 
> > Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
> > ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
> > 
> > 	flush_tlb_mm_range(..., freed_tables || unshared_tables);
> > 
> > Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
> > without paravirt or INVLPGB, the TLB flush IPI already provides necessary
> > synchronization, allowing the second IPI to be skipped. For paravirt with
> > non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
> > 
> > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> > Signed-off-by: Lance Yang <lance.yang@linux.dev>
> > ---
> >   arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
> >   1 file changed, 16 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> > index 866ea78ba156..96602b7b7210 100644
> > --- a/arch/x86/include/asm/tlb.h
> > +++ b/arch/x86/include/asm/tlb.h
> > @@ -5,10 +5,24 @@
> >   #define tlb_flush tlb_flush
> >   static inline void tlb_flush(struct mmu_gather *tlb);
> >   
> > +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
> > +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
> > +
> >   #include <asm-generic/tlb.h>
> >   #include <linux/kernel.h>
> >   #include <vdso/bits.h>
> >   #include <vdso/page.h>
> > +#include <asm/paravirt.h>
> > +
> > +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
> > +{
> > +#ifdef CONFIG_PARAVIRT
> > +	/* Paravirt may use hypercalls that don't send real IPIs. */
> > +	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> > +		return false;
> > +#endif
> > +	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
> 
> Right, here I was wondering whether we should have a new pv_ops callback 
> to indicate that instead.
> 
> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
> 
> Or a simple boolean property that pv init code properly sets.

Cool!

> 
> Something for x86 folks to give suggestions for. :)

I prefer to use a boolean property instead of comparing function pointers.
Something like this:

----8<----
diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index cfcb60468b01..90e9da33f2c7 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)

 	pr_info("Using hypercall for remote TLB flush\n");
 	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
+	pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
 }
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 3502939415ad..f9756df6f3f6 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -133,6 +133,19 @@ struct pv_mmu_ops {
 	void (*flush_tlb_multi)(const struct cpumask *cpus,
 				const struct flush_tlb_info *info);

+	/*
+	 * Indicates whether TLB flush IPIs provide sufficient synchronization
+	 * for GUP-fast when freeing or unsharing page tables.
+	 *
+	 * Set to true only when the TLB flush guarantees:
+	 * - IPIs reach all CPUs with potentially stale paging-structure caches
+	 * - Synchronization with IRQ-disabled code like GUP-fast
+	 *
+	 * Paravirt implementations that use hypercalls (which may not send
+	 * real IPIs) should set this to false.
+	 */
+	bool tlb_flush_implies_ipi_broadcast;
+
 	/* Hook for intercepting the destruction of an mm_struct. */
 	void (*exit_mmap)(struct mm_struct *mm);
 	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
index 96602b7b7210..9d20ad4786cc 100644
--- a/arch/x86/include/asm/tlb.h
+++ b/arch/x86/include/asm/tlb.h
@@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void)
 {
 #ifdef CONFIG_PARAVIRT
 	/* Paravirt may use hypercalls that don't send real IPIs. */
-	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
+	if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
 		return false;
 #endif
 	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index df78ddee0abb..aaea83100105 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -843,6 +843,7 @@ static void __init kvm_guest_init(void)
 #ifdef CONFIG_SMP
 	if (pv_tlb_flush_supported()) {
 		pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
+		pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
 		pr_info("KVM setup pv remote TLB flush\n");
 	}

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index ab3e172dcc69..625fe93e138a 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -173,6 +173,7 @@ struct paravirt_patch_template pv_ops = {
 	.mmu.flush_tlb_kernel	= native_flush_tlb_global,
 	.mmu.flush_tlb_one_user	= native_flush_tlb_one_user,
 	.mmu.flush_tlb_multi	= native_flush_tlb_multi,
+	.mmu.tlb_flush_implies_ipi_broadcast = true,

 	.mmu.exit_mmap		= paravirt_nop,
 	.mmu.notify_page_enc_status_changed	= paravirt_nop,
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 7a35c3393df4..06eb80cfb4da 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2185,6 +2185,7 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = {
 		.flush_tlb_kernel = xen_flush_tlb,
 		.flush_tlb_one_user = xen_flush_tlb_one_user,
 		.flush_tlb_multi = xen_flush_tlb_multi,
+		.tlb_flush_implies_ipi_broadcast = false,

 		.pgd_alloc = xen_pgd_alloc,
 		.pgd_free = xen_pgd_free,
---

Native x86 sets it to true, paravirt guests (Xen/KVM/Hyper-V) set it to
false. Making the intent explicit :)

Hopefully x86 folks can give me some suggestions!
 
Thanks,
Lance

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
  2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
@ 2025-12-23  9:44       ` David Hildenbrand (Red Hat)
  2025-12-23 11:13         ` Lance Yang
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-23  9:44 UTC (permalink / raw)
  To: Lance Yang
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
	dave.hansen, dev.jain, hpa, jannh, lance.yang, linux-arch,
	linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin,
	peterz, riel, ryan.roberts, shy828301, tglx, will, x86, ziy

On 12/22/25 04:19, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> 
> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
>> On 12/13/25 09:00, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches:
>>>
>>> 	flush_tlb_mm_range(..., freed_tables || unshared_tables);
>>>
>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86
>>> without paravirt or INVLPGB, the TLB flush IPI already provides necessary
>>> synchronization, allowing the second IPI to be skipped. For paravirt with
>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>    arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>>>    1 file changed, 16 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>> index 866ea78ba156..96602b7b7210 100644
>>> --- a/arch/x86/include/asm/tlb.h
>>> +++ b/arch/x86/include/asm/tlb.h
>>> @@ -5,10 +5,24 @@
>>>    #define tlb_flush tlb_flush
>>>    static inline void tlb_flush(struct mmu_gather *tlb);
>>>    
>>> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast
>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>> +
>>>    #include <asm-generic/tlb.h>
>>>    #include <linux/kernel.h>
>>>    #include <vdso/bits.h>
>>>    #include <vdso/page.h>
>>> +#include <asm/paravirt.h>
>>> +
>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>> +{
>>> +#ifdef CONFIG_PARAVIRT
>>> +	/* Paravirt may use hypercalls that don't send real IPIs. */
>>> +	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>> +		return false;
>>> +#endif
>>> +	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>>
>> Right, here I was wondering whether we should have a new pv_ops callback
>> to indicate that instead.
>>
>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>>
>> Or a simple boolean property that pv init code properly sets.
> 
> Cool!
> 
>>
>> Something for x86 folks to give suggestions for. :)
> 
> I prefer to use a boolean property instead of comparing function pointers.
> Something like this:
> 
> ----8<----
> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
> index cfcb60468b01..90e9da33f2c7 100644
> --- a/arch/x86/hyperv/mmu.c
> +++ b/arch/x86/hyperv/mmu.c
> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
> 
>   	pr_info("Using hypercall for remote TLB flush\n");
>   	pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
> +	pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
>   }
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 3502939415ad..f9756df6f3f6 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -133,6 +133,19 @@ struct pv_mmu_ops {
>   	void (*flush_tlb_multi)(const struct cpumask *cpus,
>   				const struct flush_tlb_info *info);
> 
> +	/*
> +	 * Indicates whether TLB flush IPIs provide sufficient synchronization
> +	 * for GUP-fast when freeing or unsharing page tables.
> +	 *
> +	 * Set to true only when the TLB flush guarantees:
> +	 * - IPIs reach all CPUs with potentially stale paging-structure caches
> +	 * - Synchronization with IRQ-disabled code like GUP-fast
> +	 *
> +	 * Paravirt implementations that use hypercalls (which may not send
> +	 * real IPIs) should set this to false.
> +	 */
> +	bool tlb_flush_implies_ipi_broadcast;
> +
>   	/* Hook for intercepting the destruction of an mm_struct. */
>   	void (*exit_mmap)(struct mm_struct *mm);
>   	void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc);
> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
> index 96602b7b7210..9d20ad4786cc 100644
> --- a/arch/x86/include/asm/tlb.h
> +++ b/arch/x86/include/asm/tlb.h
> @@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>   {
>   #ifdef CONFIG_PARAVIRT
>   	/* Paravirt may use hypercalls that don't send real IPIs. */
> -	if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
> +	if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
>   		return false;
>   #endif
>   	return !cpu_feature_enabled(X86_FEATURE_INVLPGB);

I'd have thought that the X86_FEATURE_INVLPGB heck should then also be 
taken care of by whoever sets tlb_flush_implies_ipi_broadcast.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for
  2025-12-23  9:44       ` David Hildenbrand (Red Hat)
@ 2025-12-23 11:13         ` Lance Yang
  0 siblings, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-23 11:13 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp,
	dave.hansen, dev.jain, hpa, jannh, linux-arch, linux-kernel,
	linux-mm, lorenzo.stoakes, mingo, npache, npiggin, peterz, riel,
	ryan.roberts, shy828301, tglx, will, x86, ziy, Lance Yang



On 2025/12/23 17:44, David Hildenbrand (Red Hat) wrote:
> On 12/22/25 04:19, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>>
>> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to
>>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure 
>>>> caches:
>>>>
>>>>     flush_tlb_mm_range(..., freed_tables || unshared_tables);
>>>>
>>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native 
>>>> x86
>>>> without paravirt or INVLPGB, the TLB flush IPI already provides 
>>>> necessary
>>>> synchronization, allowing the second IPI to be skipped. For paravirt 
>>>> with
>>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both 
>>>> IPIs.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
>>>>    1 file changed, 16 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>>>> index 866ea78ba156..96602b7b7210 100644
>>>> --- a/arch/x86/include/asm/tlb.h
>>>> +++ b/arch/x86/include/asm/tlb.h
>>>> @@ -5,10 +5,24 @@
>>>>    #define tlb_flush tlb_flush
>>>>    static inline void tlb_flush(struct mmu_gather *tlb);
>>>> +#define tlb_table_flush_implies_ipi_broadcast 
>>>> tlb_table_flush_implies_ipi_broadcast
>>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void);
>>>> +
>>>>    #include <asm-generic/tlb.h>
>>>>    #include <linux/kernel.h>
>>>>    #include <vdso/bits.h>
>>>>    #include <vdso/page.h>
>>>> +#include <asm/paravirt.h>
>>>> +
>>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void)
>>>> +{
>>>> +#ifdef CONFIG_PARAVIRT
>>>> +    /* Paravirt may use hypercalls that don't send real IPIs. */
>>>> +    if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>>>> +        return false;
>>>> +#endif
>>>> +    return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
>>>
>>> Right, here I was wondering whether we should have a new pv_ops callback
>>> to indicate that instead.
>>>
>>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast()
>>>
>>> Or a simple boolean property that pv init code properly sets.
>>
>> Cool!
>>
>>>
>>> Something for x86 folks to give suggestions for. :)
>>
>> I prefer to use a boolean property instead of comparing function 
>> pointers.
>> Something like this:
>>
>> ----8<----
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index cfcb60468b01..90e9da33f2c7 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void)
>>
>>       pr_info("Using hypercall for remote TLB flush\n");
>>       pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
>> +    pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false;
>>   }
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/ 
>> asm/paravirt_types.h
>> index 3502939415ad..f9756df6f3f6 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -133,6 +133,19 @@ struct pv_mmu_ops {
>>       void (*flush_tlb_multi)(const struct cpumask *cpus,
>>                   const struct flush_tlb_info *info);
>>
>> +    /*
>> +     * Indicates whether TLB flush IPIs provide sufficient 
>> synchronization
>> +     * for GUP-fast when freeing or unsharing page tables.
>> +     *
>> +     * Set to true only when the TLB flush guarantees:
>> +     * - IPIs reach all CPUs with potentially stale paging-structure 
>> caches
>> +     * - Synchronization with IRQ-disabled code like GUP-fast
>> +     *
>> +     * Paravirt implementations that use hypercalls (which may not send
>> +     * real IPIs) should set this to false.
>> +     */
>> +    bool tlb_flush_implies_ipi_broadcast;
>> +
>>       /* Hook for intercepting the destruction of an mm_struct. */
>>       void (*exit_mmap)(struct mm_struct *mm);
>>       void (*notify_page_enc_status_changed)(unsigned long pfn, int 
>> npages, bool enc);
>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h
>> index 96602b7b7210..9d20ad4786cc 100644
>> --- a/arch/x86/include/asm/tlb.h
>> +++ b/arch/x86/include/asm/tlb.h
>> @@ -18,7 +18,7 @@ static inline bool 
>> tlb_table_flush_implies_ipi_broadcast(void)
>>   {
>>   #ifdef CONFIG_PARAVIRT
>>       /* Paravirt may use hypercalls that don't send real IPIs. */
>> -    if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi)
>> +    if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast)
>>           return false;
>>   #endif
>>       return !cpu_feature_enabled(X86_FEATURE_INVLPGB);
> 
> I'd have thought that the X86_FEATURE_INVLPGB heck should then also be 
> taken care of by whoever sets tlb_flush_implies_ipi_broadcast.

Makes sense!

Let's have the INVLPGB check happen at setup time, not at use time :P

Cheers,
Lance

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
  2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
@ 2025-12-13  8:00 ` Lance Yang
  2025-12-14 13:24   ` kernel test robot
                     ` (2 more replies)
  2 siblings, 3 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-13  8:00 UTC (permalink / raw)
  To: akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
	Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
	shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel,
	Lance Yang

From: Lance Yang <lance.yang@linux.dev>

Similar to the hugetlb PMD unsharing optimization, skip the second IPI
in collapse_huge_page() when the TLB flush already provides necessary
synchronization.

Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
synchronization relied on IRQ disabling, which blocks TLB flush IPIs.

When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
INVLPGB, all x86 systems started sending the second IPI. However, on
native x86 this is redundant:

  - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
    CPUs to invalidate TLB entries

  - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
    any concurrent GUP-fast must have finished

  - tlb_remove_table_sync_one() provides no additional synchronization

On x86, skip the second IPI when running native (without paravirt) and
without INVLPGB. For paravirt with non-native flush_tlb_multi and for
INVLPGB, conservatively keep both IPIs.

Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
optimization.

Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..06ea793a8190 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	_pmd = pmdp_collapse_flush(vma, address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
-	tlb_remove_table_sync_one();
+	/*
+	 * Skip the second IPI if the TLB flush above already synchronized
+	 * with concurrent GUP-fast via broadcast IPIs.
+	 */
+	if (!tlb_table_flush_implies_ipi_broadcast())
+		tlb_remove_table_sync_one();
 
 	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
 	if (pte) {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
@ 2025-12-14 13:24   ` kernel test robot
  2025-12-14 13:56   ` kernel test robot
  2025-12-18 13:13   ` David Hildenbrand (Red Hat)
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-12-14 13:24 UTC (permalink / raw)
  To: Lance Yang; +Cc: oe-kbuild-all

Hi Lance,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251212]
[cannot apply to tip/x86/core arnd-asm-generic/master linus/master v6.19-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-allow-architectures-to-skip-redundant-TLB-sync-IPIs/20251213-160431
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251213080038.10917-4-lance.yang%40linux.dev
patch subject: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
config: arc-randconfig-001-20251214 (https://download.01.org/0day-ci/archive/20251214/202512142156.cShiu6PU-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 12.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251214/202512142156.cShiu6PU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/khugepaged.c: In function 'collapse_huge_page':
>> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration]
    1185 |         if (!tlb_table_flush_implies_ipi_broadcast())
         |              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +/tlb_table_flush_implies_ipi_broadcast +1185 mm/khugepaged.c

  1090	
  1091	static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
  1092				      int referenced, int unmapped,
  1093				      struct collapse_control *cc)
  1094	{
  1095		LIST_HEAD(compound_pagelist);
  1096		pmd_t *pmd, _pmd;
  1097		pte_t *pte;
  1098		pgtable_t pgtable;
  1099		struct folio *folio;
  1100		spinlock_t *pmd_ptl, *pte_ptl;
  1101		int result = SCAN_FAIL;
  1102		struct vm_area_struct *vma;
  1103		struct mmu_notifier_range range;
  1104	
  1105		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
  1106	
  1107		/*
  1108		 * Before allocating the hugepage, release the mmap_lock read lock.
  1109		 * The allocation can take potentially a long time if it involves
  1110		 * sync compaction, and we do not need to hold the mmap_lock during
  1111		 * that. We will recheck the vma after taking it again in write mode.
  1112		 */
  1113		mmap_read_unlock(mm);
  1114	
  1115		result = alloc_charge_folio(&folio, mm, cc);
  1116		if (result != SCAN_SUCCEED)
  1117			goto out_nolock;
  1118	
  1119		mmap_read_lock(mm);
  1120		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1121		if (result != SCAN_SUCCEED) {
  1122			mmap_read_unlock(mm);
  1123			goto out_nolock;
  1124		}
  1125	
  1126		result = find_pmd_or_thp_or_none(mm, address, &pmd);
  1127		if (result != SCAN_SUCCEED) {
  1128			mmap_read_unlock(mm);
  1129			goto out_nolock;
  1130		}
  1131	
  1132		if (unmapped) {
  1133			/*
  1134			 * __collapse_huge_page_swapin will return with mmap_lock
  1135			 * released when it fails. So we jump out_nolock directly in
  1136			 * that case.  Continuing to collapse causes inconsistency.
  1137			 */
  1138			result = __collapse_huge_page_swapin(mm, vma, address, pmd,
  1139							     referenced);
  1140			if (result != SCAN_SUCCEED)
  1141				goto out_nolock;
  1142		}
  1143	
  1144		mmap_read_unlock(mm);
  1145		/*
  1146		 * Prevent all access to pagetables with the exception of
  1147		 * gup_fast later handled by the ptep_clear_flush and the VM
  1148		 * handled by the anon_vma lock + PG_lock.
  1149		 *
  1150		 * UFFDIO_MOVE is prevented to race as well thanks to the
  1151		 * mmap_lock.
  1152		 */
  1153		mmap_write_lock(mm);
  1154		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1155		if (result != SCAN_SUCCEED)
  1156			goto out_up_write;
  1157		/* check if the pmd is still valid */
  1158		vma_start_write(vma);
  1159		result = check_pmd_still_valid(mm, address, pmd);
  1160		if (result != SCAN_SUCCEED)
  1161			goto out_up_write;
  1162	
  1163		anon_vma_lock_write(vma->anon_vma);
  1164	
  1165		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
  1166					address + HPAGE_PMD_SIZE);
  1167		mmu_notifier_invalidate_range_start(&range);
  1168	
  1169		pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
  1170		/*
  1171		 * This removes any huge TLB entry from the CPU so we won't allow
  1172		 * huge and small TLB entries for the same virtual address to
  1173		 * avoid the risk of CPU bugs in that area.
  1174		 *
  1175		 * Parallel GUP-fast is fine since GUP-fast will back off when
  1176		 * it detects PMD is changed.
  1177		 */
  1178		_pmd = pmdp_collapse_flush(vma, address, pmd);
  1179		spin_unlock(pmd_ptl);
  1180		mmu_notifier_invalidate_range_end(&range);
  1181		/*
  1182		 * Skip the second IPI if the TLB flush above already synchronized
  1183		 * with concurrent GUP-fast via broadcast IPIs.
  1184		 */
> 1185		if (!tlb_table_flush_implies_ipi_broadcast())
  1186			tlb_remove_table_sync_one();
  1187	
  1188		pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
  1189		if (pte) {
  1190			result = __collapse_huge_page_isolate(vma, address, pte, cc,
  1191							      &compound_pagelist);
  1192			spin_unlock(pte_ptl);
  1193		} else {
  1194			result = SCAN_NO_PTE_TABLE;
  1195		}
  1196	
  1197		if (unlikely(result != SCAN_SUCCEED)) {
  1198			if (pte)
  1199				pte_unmap(pte);
  1200			spin_lock(pmd_ptl);
  1201			BUG_ON(!pmd_none(*pmd));
  1202			/*
  1203			 * We can only use set_pmd_at when establishing
  1204			 * hugepmds and never for establishing regular pmds that
  1205			 * points to regular pagetables. Use pmd_populate for that
  1206			 */
  1207			pmd_populate(mm, pmd, pmd_pgtable(_pmd));
  1208			spin_unlock(pmd_ptl);
  1209			anon_vma_unlock_write(vma->anon_vma);
  1210			goto out_up_write;
  1211		}
  1212	
  1213		/*
  1214		 * All pages are isolated and locked so anon_vma rmap
  1215		 * can't run anymore.
  1216		 */
  1217		anon_vma_unlock_write(vma->anon_vma);
  1218	
  1219		result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
  1220						   vma, address, pte_ptl,
  1221						   &compound_pagelist);
  1222		pte_unmap(pte);
  1223		if (unlikely(result != SCAN_SUCCEED))
  1224			goto out_up_write;
  1225	
  1226		/*
  1227		 * The smp_wmb() inside __folio_mark_uptodate() ensures the
  1228		 * copy_huge_page writes become visible before the set_pmd_at()
  1229		 * write.
  1230		 */
  1231		__folio_mark_uptodate(folio);
  1232		pgtable = pmd_pgtable(_pmd);
  1233	
  1234		spin_lock(pmd_ptl);
  1235		BUG_ON(!pmd_none(*pmd));
  1236		pgtable_trans_huge_deposit(mm, pmd, pgtable);
  1237		map_anon_folio_pmd_nopf(folio, pmd, vma, address);
  1238		spin_unlock(pmd_ptl);
  1239	
  1240		folio = NULL;
  1241	
  1242		result = SCAN_SUCCEED;
  1243	out_up_write:
  1244		mmap_write_unlock(mm);
  1245	out_nolock:
  1246		if (folio)
  1247			folio_put(folio);
  1248		trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
  1249		return result;
  1250	}
  1251	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
  2025-12-14 13:24   ` kernel test robot
@ 2025-12-14 13:56   ` kernel test robot
  2025-12-18 13:13   ` David Hildenbrand (Red Hat)
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-12-14 13:56 UTC (permalink / raw)
  To: Lance Yang; +Cc: llvm, oe-kbuild-all

Hi Lance,

[This is a private test report for your RFC patch.]
kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]
[also build test ERROR on next-20251212]
[cannot apply to tip/x86/core arnd-asm-generic/master linus/master v6.19-rc1]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-allow-architectures-to-skip-redundant-TLB-sync-IPIs/20251213-160431
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20251213080038.10917-4-lance.yang%40linux.dev
patch subject: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
config: loongarch-randconfig-001-20251214 (https://download.01.org/0day-ci/archive/20251214/202512142105.NXwq6dfP-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251214/202512142105.NXwq6dfP-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/khugepaged.c:1185:7: error: call to undeclared function 'tlb_table_flush_implies_ipi_broadcast'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
    1185 |         if (!tlb_table_flush_implies_ipi_broadcast())
         |              ^
   1 error generated.


vim +/tlb_table_flush_implies_ipi_broadcast +1185 mm/khugepaged.c

  1090	
  1091	static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
  1092				      int referenced, int unmapped,
  1093				      struct collapse_control *cc)
  1094	{
  1095		LIST_HEAD(compound_pagelist);
  1096		pmd_t *pmd, _pmd;
  1097		pte_t *pte;
  1098		pgtable_t pgtable;
  1099		struct folio *folio;
  1100		spinlock_t *pmd_ptl, *pte_ptl;
  1101		int result = SCAN_FAIL;
  1102		struct vm_area_struct *vma;
  1103		struct mmu_notifier_range range;
  1104	
  1105		VM_BUG_ON(address & ~HPAGE_PMD_MASK);
  1106	
  1107		/*
  1108		 * Before allocating the hugepage, release the mmap_lock read lock.
  1109		 * The allocation can take potentially a long time if it involves
  1110		 * sync compaction, and we do not need to hold the mmap_lock during
  1111		 * that. We will recheck the vma after taking it again in write mode.
  1112		 */
  1113		mmap_read_unlock(mm);
  1114	
  1115		result = alloc_charge_folio(&folio, mm, cc);
  1116		if (result != SCAN_SUCCEED)
  1117			goto out_nolock;
  1118	
  1119		mmap_read_lock(mm);
  1120		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1121		if (result != SCAN_SUCCEED) {
  1122			mmap_read_unlock(mm);
  1123			goto out_nolock;
  1124		}
  1125	
  1126		result = find_pmd_or_thp_or_none(mm, address, &pmd);
  1127		if (result != SCAN_SUCCEED) {
  1128			mmap_read_unlock(mm);
  1129			goto out_nolock;
  1130		}
  1131	
  1132		if (unmapped) {
  1133			/*
  1134			 * __collapse_huge_page_swapin will return with mmap_lock
  1135			 * released when it fails. So we jump out_nolock directly in
  1136			 * that case.  Continuing to collapse causes inconsistency.
  1137			 */
  1138			result = __collapse_huge_page_swapin(mm, vma, address, pmd,
  1139							     referenced);
  1140			if (result != SCAN_SUCCEED)
  1141				goto out_nolock;
  1142		}
  1143	
  1144		mmap_read_unlock(mm);
  1145		/*
  1146		 * Prevent all access to pagetables with the exception of
  1147		 * gup_fast later handled by the ptep_clear_flush and the VM
  1148		 * handled by the anon_vma lock + PG_lock.
  1149		 *
  1150		 * UFFDIO_MOVE is prevented to race as well thanks to the
  1151		 * mmap_lock.
  1152		 */
  1153		mmap_write_lock(mm);
  1154		result = hugepage_vma_revalidate(mm, address, true, &vma, cc);
  1155		if (result != SCAN_SUCCEED)
  1156			goto out_up_write;
  1157		/* check if the pmd is still valid */
  1158		vma_start_write(vma);
  1159		result = check_pmd_still_valid(mm, address, pmd);
  1160		if (result != SCAN_SUCCEED)
  1161			goto out_up_write;
  1162	
  1163		anon_vma_lock_write(vma->anon_vma);
  1164	
  1165		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
  1166					address + HPAGE_PMD_SIZE);
  1167		mmu_notifier_invalidate_range_start(&range);
  1168	
  1169		pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
  1170		/*
  1171		 * This removes any huge TLB entry from the CPU so we won't allow
  1172		 * huge and small TLB entries for the same virtual address to
  1173		 * avoid the risk of CPU bugs in that area.
  1174		 *
  1175		 * Parallel GUP-fast is fine since GUP-fast will back off when
  1176		 * it detects PMD is changed.
  1177		 */
  1178		_pmd = pmdp_collapse_flush(vma, address, pmd);
  1179		spin_unlock(pmd_ptl);
  1180		mmu_notifier_invalidate_range_end(&range);
  1181		/*
  1182		 * Skip the second IPI if the TLB flush above already synchronized
  1183		 * with concurrent GUP-fast via broadcast IPIs.
  1184		 */
> 1185		if (!tlb_table_flush_implies_ipi_broadcast())
  1186			tlb_remove_table_sync_one();
  1187	
  1188		pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
  1189		if (pte) {
  1190			result = __collapse_huge_page_isolate(vma, address, pte, cc,
  1191							      &compound_pagelist);
  1192			spin_unlock(pte_ptl);
  1193		} else {
  1194			result = SCAN_NO_PTE_TABLE;
  1195		}
  1196	
  1197		if (unlikely(result != SCAN_SUCCEED)) {
  1198			if (pte)
  1199				pte_unmap(pte);
  1200			spin_lock(pmd_ptl);
  1201			BUG_ON(!pmd_none(*pmd));
  1202			/*
  1203			 * We can only use set_pmd_at when establishing
  1204			 * hugepmds and never for establishing regular pmds that
  1205			 * points to regular pagetables. Use pmd_populate for that
  1206			 */
  1207			pmd_populate(mm, pmd, pmd_pgtable(_pmd));
  1208			spin_unlock(pmd_ptl);
  1209			anon_vma_unlock_write(vma->anon_vma);
  1210			goto out_up_write;
  1211		}
  1212	
  1213		/*
  1214		 * All pages are isolated and locked so anon_vma rmap
  1215		 * can't run anymore.
  1216		 */
  1217		anon_vma_unlock_write(vma->anon_vma);
  1218	
  1219		result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
  1220						   vma, address, pte_ptl,
  1221						   &compound_pagelist);
  1222		pte_unmap(pte);
  1223		if (unlikely(result != SCAN_SUCCEED))
  1224			goto out_up_write;
  1225	
  1226		/*
  1227		 * The smp_wmb() inside __folio_mark_uptodate() ensures the
  1228		 * copy_huge_page writes become visible before the set_pmd_at()
  1229		 * write.
  1230		 */
  1231		__folio_mark_uptodate(folio);
  1232		pgtable = pmd_pgtable(_pmd);
  1233	
  1234		spin_lock(pmd_ptl);
  1235		BUG_ON(!pmd_none(*pmd));
  1236		pgtable_trans_huge_deposit(mm, pmd, pgtable);
  1237		map_anon_folio_pmd_nopf(folio, pmd, vma, address);
  1238		spin_unlock(pmd_ptl);
  1239	
  1240		folio = NULL;
  1241	
  1242		result = SCAN_SUCCEED;
  1243	out_up_write:
  1244		mmap_write_unlock(mm);
  1245	out_nolock:
  1246		if (folio)
  1247			folio_put(folio);
  1248		trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
  1249		return result;
  1250	}
  1251	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
  2025-12-14 13:24   ` kernel test robot
  2025-12-14 13:56   ` kernel test robot
@ 2025-12-18 13:13   ` David Hildenbrand (Red Hat)
  2025-12-18 14:35     ` Lance Yang
  2 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-18 13:13 UTC (permalink / raw)
  To: Lance Yang, akpm
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel

On 12/13/25 09:00, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
> in collapse_huge_page() when the TLB flush already provides necessary
> synchronization.
> 
> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
> 
> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
> INVLPGB, all x86 systems started sending the second IPI. However, on
> native x86 this is redundant:
> 
>    - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>      CPUs to invalidate TLB entries
> 
>    - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>      any concurrent GUP-fast must have finished
> 
>    - tlb_remove_table_sync_one() provides no additional synchronization
> 
> On x86, skip the second IPI when running native (without paravirt) and
> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
> INVLPGB, conservatively keep both IPIs.
> 
> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
> optimization.
> 
> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   mm/khugepaged.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386..06ea793a8190 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>   	_pmd = pmdp_collapse_flush(vma, address, pmd);
>   	spin_unlock(pmd_ptl);
>   	mmu_notifier_invalidate_range_end(&range);
> -	tlb_remove_table_sync_one();
> +	/*
> +	 * Skip the second IPI if the TLB flush above already synchronized
> +	 * with concurrent GUP-fast via broadcast IPIs.
> +	 */
> +	if (!tlb_table_flush_implies_ipi_broadcast())
> +		tlb_remove_table_sync_one();

We end up calling

	flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

	-> flush_tlb_mm_range(freed_tables = true)

	-> flush_tlb_multi(mm_cpumask(mm), info);

So freed_tables=true and we should be doing the right thing.

BTW, I was wondering whether we should embed that 
tlb_table_flush_implies_ipi_broadcast() check in 
tlb_remove_table_sync_one() instead.

It then relies on the caller to do the right thing (flush with 
freed_tables=true or unshared_tables = true).

Thoughts?

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-18 13:13   ` David Hildenbrand (Red Hat)
@ 2025-12-18 14:35     ` Lance Yang
  2025-12-19  8:25       ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 16+ messages in thread
From: Lance Yang @ 2025-12-18 14:35 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, akpm



On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
> On 12/13/25 09:00, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>> in collapse_huge_page() when the TLB flush already provides necessary
>> synchronization.
>>
>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>
>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>> INVLPGB, all x86 systems started sending the second IPI. However, on
>> native x86 this is redundant:
>>
>>    - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>>      CPUs to invalidate TLB entries
>>
>>    - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>>      any concurrent GUP-fast must have finished
>>
>>    - tlb_remove_table_sync_one() provides no additional synchronization
>>
>> On x86, skip the second IPI when running native (without paravirt) and
>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>> INVLPGB, conservatively keep both IPIs.
>>
>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
>> optimization.
>>
>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386..06ea793a8190 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct 
>> *mm, unsigned long address,
>>       _pmd = pmdp_collapse_flush(vma, address, pmd);
>>       spin_unlock(pmd_ptl);
>>       mmu_notifier_invalidate_range_end(&range);
>> -    tlb_remove_table_sync_one();
>> +    /*
>> +     * Skip the second IPI if the TLB flush above already synchronized
>> +     * with concurrent GUP-fast via broadcast IPIs.
>> +     */
>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>> +        tlb_remove_table_sync_one();
> 
> We end up calling
> 
>      flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> 
>      -> flush_tlb_mm_range(freed_tables = true)
> 
>      -> flush_tlb_multi(mm_cpumask(mm), info);
> 
> So freed_tables=true and we should be doing the right thing.

Yep ;)

> BTW, I was wondering whether we should embed that 
> tlb_table_flush_implies_ipi_broadcast() check in 
> tlb_remove_table_sync_one() instead.
> It then relies on the caller to do the right thing (flush with 
> freed_tables=true or unshared_tables = true).
> 
> Thoughts?

Good point! Let me check the other callers to ensure they
are all preceded by a flush with freed_tables=true (or unshared_tables).

Will get back to you with what I find :)

Cheers,
Lance

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-18 14:35     ` Lance Yang
@ 2025-12-19  8:25       ` David Hildenbrand (Red Hat)
  2025-12-21 10:43         ` Lance Yang
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-12-19  8:25 UTC (permalink / raw)
  To: Lance Yang
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, akpm

On 12/18/25 15:35, Lance Yang wrote:
> 
> 
> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>> On 12/13/25 09:00, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>> in collapse_huge_page() when the TLB flush already provides necessary
>>> synchronization.
>>>
>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE.
>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>
>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>> native x86 this is redundant:
>>>
>>>     - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all
>>>       CPUs to invalidate TLB entries
>>>
>>>     - GUP-fast runs with IRQs disabled, so when the flush IPI completes,
>>>       any concurrent GUP-fast must have finished
>>>
>>>     - tlb_remove_table_sync_one() provides no additional synchronization
>>>
>>> On x86, skip the second IPI when running native (without paravirt) and
>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>> INVLPGB, conservatively keep both IPIs.
>>>
>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb
>>> optimization.
>>>
>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>    mm/khugepaged.c | 7 ++++++-
>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 97d1b2824386..06ea793a8190 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>> *mm, unsigned long address,
>>>        _pmd = pmdp_collapse_flush(vma, address, pmd);
>>>        spin_unlock(pmd_ptl);
>>>        mmu_notifier_invalidate_range_end(&range);
>>> -    tlb_remove_table_sync_one();
>>> +    /*
>>> +     * Skip the second IPI if the TLB flush above already synchronized
>>> +     * with concurrent GUP-fast via broadcast IPIs.
>>> +     */
>>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>>> +        tlb_remove_table_sync_one();
>>
>> We end up calling
>>
>>       flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>
>>       -> flush_tlb_mm_range(freed_tables = true)
>>
>>       -> flush_tlb_multi(mm_cpumask(mm), info);
>>
>> So freed_tables=true and we should be doing the right thing.
> 
> Yep ;)
> 
>> BTW, I was wondering whether we should embed that
>> tlb_table_flush_implies_ipi_broadcast() check in
>> tlb_remove_table_sync_one() instead.
>> It then relies on the caller to do the right thing (flush with
>> freed_tables=true or unshared_tables = true).
>>
>> Thoughts?
> 
> Good point! Let me check the other callers to ensure they
> are all preceded by a flush with freed_tables=true (or unshared_tables).
> 
> Will get back to you with what I find :)

The use case in tlb_table_flush() is a bit confusing. But I would assume 
that we have a TLB flush with remove_tables=true beforehand. Otherwise 
we cannot possibly free the page table.

-- 
Cheers

David

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page()
  2025-12-19  8:25       ` David Hildenbrand (Red Hat)
@ 2025-12-21 10:43         ` Lance Yang
  0 siblings, 0 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-21 10:43 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
	x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301,
	riel, jannh, linux-arch, linux-mm, linux-kernel, akpm



On 2025/12/19 16:25, David Hildenbrand (Red Hat) wrote:
> On 12/18/25 15:35, Lance Yang wrote:
>>
>>
>> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote:
>>> On 12/13/25 09:00, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI
>>>> in collapse_huge_page() when the TLB flush already provides necessary
>>>> synchronization.
>>>>
>>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE
>>>> unconditional"), bare metal x86 didn't enable 
>>>> MMU_GATHER_RCU_TABLE_FREE.
>>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast
>>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs.
>>>>
>>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's
>>>> INVLPGB, all x86 systems started sending the second IPI. However, on
>>>> native x86 this is redundant:
>>>>
>>>>     - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to 
>>>> all
>>>>       CPUs to invalidate TLB entries
>>>>
>>>>     - GUP-fast runs with IRQs disabled, so when the flush IPI 
>>>> completes,
>>>>       any concurrent GUP-fast must have finished
>>>>
>>>>     - tlb_remove_table_sync_one() provides no additional 
>>>> synchronization
>>>>
>>>> On x86, skip the second IPI when running native (without paravirt) and
>>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for
>>>> INVLPGB, conservatively keep both IPIs.
>>>>
>>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the 
>>>> hugetlb
>>>> optimization.
>>>>
>>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    mm/khugepaged.c | 7 ++++++-
>>>>    1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 97d1b2824386..06ea793a8190 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct
>>>> *mm, unsigned long address,
>>>>        _pmd = pmdp_collapse_flush(vma, address, pmd);
>>>>        spin_unlock(pmd_ptl);
>>>>        mmu_notifier_invalidate_range_end(&range);
>>>> -    tlb_remove_table_sync_one();
>>>> +    /*
>>>> +     * Skip the second IPI if the TLB flush above already synchronized
>>>> +     * with concurrent GUP-fast via broadcast IPIs.
>>>> +     */
>>>> +    if (!tlb_table_flush_implies_ipi_broadcast())
>>>> +        tlb_remove_table_sync_one();
>>>
>>> We end up calling
>>>
>>>       flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>>>
>>>       -> flush_tlb_mm_range(freed_tables = true)
>>>
>>>       -> flush_tlb_multi(mm_cpumask(mm), info);
>>>
>>> So freed_tables=true and we should be doing the right thing.
>>
>> Yep ;)
>>
>>> BTW, I was wondering whether we should embed that
>>> tlb_table_flush_implies_ipi_broadcast() check in
>>> tlb_remove_table_sync_one() instead.
>>> It then relies on the caller to do the right thing (flush with
>>> freed_tables=true or unshared_tables = true).
>>>
>>> Thoughts?
>>
>> Good point! Let me check the other callers to ensure they
>> are all preceded by a flush with freed_tables=true (or unshared_tables).
>>
>> Will get back to you with what I find :)
> 
> The use case in tlb_table_flush() is a bit confusing. But I would assume 
> that we have a TLB flush with remove_tables=true beforehand. Otherwise 
> we cannot possibly free the page table.

Right! I assume you meant freed_tables=true (not remove_tables) ;)

Verified all callers have proper TLB flushes *beforehand*:

-> 1. mm/khugepaged.c:1188 (collapse_huge_page)

	pmdp_collapse_flush(vma, address, pmd)
	-> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE)
		-> flush_tlb_mm_range(mm, ..., freed_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

So freed_tables=true and we should be doing the right thing :)

-> 2. include/asm-generic/tlb.h:861 (tlb_flush_unshared_tables)

	tlb_flush_mmu_tlbonly(tlb)
	-> tlb_flush(tlb)
		-> flush_tlb_mm_range(mm, ..., unshared_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

unshared_tables=true (equivalent to freed_tables for sending IPIs).

-> 3. mm/mmu_gather.c:341 (__tlb_remove_table_one)

When we can't allocate a batch page in tlb_remove_table(), we do:

	tlb_table_invalidate(tlb)
	-> tlb_flush_mmu_tlbonly(tlb)
		-> flush_tlb_mm_range(mm, ..., freed_tables = true)
			-> flush_tlb_multi(mm_cpumask(mm), info)

	Then:
	tlb_remove_table_one(table)
	-> __tlb_remove_table_one(table) // if !CONFIG_PT_RECLAIM
		-> tlb_remove_table_sync_one()

freed_tables=true, and this should work too.

Why is tlb->freed_tables guaranteed? Because callers like pte_free_tlb()
(via free_pte_range) set freed_tables=true before calling __pte_free_tlb(),
which then calls tlb_remove_table(). As you mentioned, we cannot free page
tables without freed_tables=true.

Note that tlb_remove_table_sync_one() was a NOP on bare metal x86
(CONFIG_MMU_GATHER_RCU_TABLE_FREE=n) before commit a37259732a7d.

-> 4-5. mm/khugepaged.c:1683,1819 (pmdp_get_lockless_sync macro)

Same as #1.

So all callers satisfy the requirement! Will embed the check in v2.

Hopefully I didn't miss any callers ;)

Cheers,
Lance

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-12-23 11:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-13  8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang
2025-12-13  8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
2025-12-15  5:48   ` Lance Yang
2025-12-18 13:08     ` David Hildenbrand (Red Hat)
2025-12-13  8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang
2025-12-18 13:08   ` David Hildenbrand (Red Hat)
2025-12-22  3:19     ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang
2025-12-23  9:44       ` David Hildenbrand (Red Hat)
2025-12-23 11:13         ` Lance Yang
2025-12-13  8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang
2025-12-14 13:24   ` kernel test robot
2025-12-14 13:56   ` kernel test robot
2025-12-18 13:13   ` David Hildenbrand (Red Hat)
2025-12-18 14:35     ` Lance Yang
2025-12-19  8:25       ` David Hildenbrand (Red Hat)
2025-12-21 10:43         ` Lance Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.