* [PATCH RFC 0/3] skip redundant TLB sync IPIs
@ 2025-12-13 8:00 Lance Yang
2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang
` (2 more replies)
0 siblings, 3 replies; 16+ messages in thread
From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw)
To: akpm
Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen,
x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0,
shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel
[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=y, Size: 1334 bytes --]
Hi all,
When unsharing hugetlb PMD page tables or collapsing pages in khugepaged,
we send two IPIs: one for TLB invalidation, and another to synchronize
with concurrent GUP-fast walkers. However, if the TLB flush already
reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs
disabled, so when the TLB flush IPI completes, any concurrent GUP-fast
must have finished.
This series introduces a way for architectures to indicate their TLB flush
already provides full synchronization, allowing the redundant IPI to be
skipped. For now, the optimization is implemented for x86 first and applied
to both hugetlb PMD unsharing and khugepaged. I will dig into other
architectures later, and folks maintaining other architectures are welcome
to help out.
David Hildenbrand did the initial implementation. I just built on his work
and relied on off-list discussions to push it further — thanks a lot David!
Lance Yang (3):
mm/tlb: allow architectures to skip redundant TLB sync IPIs
x86/mm: implement redundant IPI elimination for PMD unsharing
mm/khugepaged: skip redundant IPI in collapse_huge_page()
arch/x86/include/asm/tlb.h | 17 ++++++++++++++++-
include/asm-generic/tlb.h | 22 +++++++++++++++++++++-
mm/khugepaged.c | 7 ++++++-
3 files changed, 43 insertions(+), 3 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 16+ messages in thread* [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs 2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang @ 2025-12-13 8:00 ` Lance Yang 2025-12-15 5:48 ` Lance Yang 2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang 2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang 2 siblings, 1 reply; 16+ messages in thread From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw) To: akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, Lance Yang From: Lance Yang <lance.yang@linux.dev> When unsharing hugetlb PMD page tables, we currently send two IPIs: one for TLB invalidation, and another to synchronize with concurrent GUP-fast walkers. However, if the TLB flush already reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI completes, any concurrent GUP-fast must have finished. Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate their TLB flush provides full synchronization, enabling the redundant IPI to be skipped. The default implementation returns false to maintain current behavior. Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Lance Yang <lance.yang@linux.dev> --- include/asm-generic/tlb.h | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 324a21f53b64..3f0add95604f 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) #define tlb_needs_table_invalidate() (true) #endif +/* + * Architectures can override if their TLB flush already broadcasts IPIs to all + * CPUs when freeing or unsharing page tables. + * + * Return true only when the flush guarantees: + * - IPIs reach all CPUs with potentially stale paging-structure cache entries + * - Synchronization with IRQ-disabled code like GUP-fast + */ +#ifndef tlb_table_flush_implies_ipi_broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ + return false; +} +#endif + void tlb_remove_table_sync_one(void); #else @@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) * We only perform this when we are the last sharer of a page table, * as the IPI will reach all CPUs: any GUP-fast. * + * However, if the TLB flush already synchronized with other CPUs + * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip + * the additional IPI. + * * Note that on configs where tlb_remove_table_sync_one() is a NOP, * the expectation is that the tlb_flush_mmu_tlbonly() would have issued * required IPIs already for us. */ if (tlb->fully_unshared_tables) { - tlb_remove_table_sync_one(); + if (!tlb_table_flush_implies_ipi_broadcast()) + tlb_remove_table_sync_one(); tlb->fully_unshared_tables = false; } } -- 2.49.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs 2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang @ 2025-12-15 5:48 ` Lance Yang 2025-12-18 13:08 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 16+ messages in thread From: Lance Yang @ 2025-12-15 5:48 UTC (permalink / raw) To: akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel On 2025/12/13 16:00, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > When unsharing hugetlb PMD page tables, we currently send two IPIs: > one for TLB invalidation, and another to synchronize with concurrent > GUP-fast walkers. > > However, if the TLB flush already reaches all CPUs, the second IPI is > redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI > completes, any concurrent GUP-fast must have finished. > > Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate > their TLB flush provides full synchronization, enabling the redundant IPI > to be skipped. > > The default implementation returns false to maintain current behavior. > > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > include/asm-generic/tlb.h | 22 +++++++++++++++++++++- > 1 file changed, 21 insertions(+), 1 deletion(-) > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 324a21f53b64..3f0add95604f 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) > #define tlb_needs_table_invalidate() (true) > #endif > > +/* > + * Architectures can override if their TLB flush already broadcasts IPIs to all > + * CPUs when freeing or unsharing page tables. > + * > + * Return true only when the flush guarantees: > + * - IPIs reach all CPUs with potentially stale paging-structure cache entries > + * - Synchronization with IRQ-disabled code like GUP-fast > + */ > +#ifndef tlb_table_flush_implies_ipi_broadcast > +static inline bool tlb_table_flush_implies_ipi_broadcast(void) > +{ > + return false; > +} > +#endif As the kernel test robot reported[1][2], the compiler is unhappy with patch #3: ``` mm/khugepaged.c: In function 'collapse_huge_page': >> >> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration] 1185 | if (!tlb_table_flush_implies_ipi_broadcast()) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors ``` I'll move tlb_table_flush_implies_ipi_broadcast() outside of CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier happy on architectures that don't enable that config ;) [1] https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/ [2] https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/ > + > void tlb_remove_table_sync_one(void); > > #else > @@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) > * We only perform this when we are the last sharer of a page table, > * as the IPI will reach all CPUs: any GUP-fast. > * > + * However, if the TLB flush already synchronized with other CPUs > + * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip > + * the additional IPI. > + * > * Note that on configs where tlb_remove_table_sync_one() is a NOP, > * the expectation is that the tlb_flush_mmu_tlbonly() would have issued > * required IPIs already for us. > */ > if (tlb->fully_unshared_tables) { > - tlb_remove_table_sync_one(); > + if (!tlb_table_flush_implies_ipi_broadcast()) > + tlb_remove_table_sync_one(); > tlb->fully_unshared_tables = false; > } > } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 1/3] mm/tlb: allow architectures to skip redundant TLB sync IPIs 2025-12-15 5:48 ` Lance Yang @ 2025-12-18 13:08 ` David Hildenbrand (Red Hat) 0 siblings, 0 replies; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw) To: Lance Yang, akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel On 12/15/25 06:48, Lance Yang wrote: > > > On 2025/12/13 16:00, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> When unsharing hugetlb PMD page tables, we currently send two IPIs: >> one for TLB invalidation, and another to synchronize with concurrent >> GUP-fast walkers. >> >> However, if the TLB flush already reaches all CPUs, the second IPI is >> redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI >> completes, any concurrent GUP-fast must have finished. >> >> Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate >> their TLB flush provides full synchronization, enabling the redundant IPI >> to be skipped. >> >> The default implementation returns false to maintain current behavior. >> >> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> include/asm-generic/tlb.h | 22 +++++++++++++++++++++- >> 1 file changed, 21 insertions(+), 1 deletion(-) >> >> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h >> index 324a21f53b64..3f0add95604f 100644 >> --- a/include/asm-generic/tlb.h >> +++ b/include/asm-generic/tlb.h >> @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) >> #define tlb_needs_table_invalidate() (true) >> #endif >> >> +/* >> + * Architectures can override if their TLB flush already broadcasts IPIs to all >> + * CPUs when freeing or unsharing page tables. >> + * >> + * Return true only when the flush guarantees: >> + * - IPIs reach all CPUs with potentially stale paging-structure cache entries >> + * - Synchronization with IRQ-disabled code like GUP-fast >> + */ >> +#ifndef tlb_table_flush_implies_ipi_broadcast >> +static inline bool tlb_table_flush_implies_ipi_broadcast(void) >> +{ >> + return false; >> +} >> +#endif > > As the kernel test robot reported[1][2], the compiler is unhappy with > patch #3: > > ``` > mm/khugepaged.c: In function 'collapse_huge_page': >>>>> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration] > 1185 | if (!tlb_table_flush_implies_ipi_broadcast()) > | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > cc1: some warnings being treated as errors > ``` > > I'll move tlb_table_flush_implies_ipi_broadcast() outside of > CONFIG_MMU_GATHER_RCU_TABLE_FREE in next version, making the complier > happy on architectures that don't enable that config ;) Yeah, that's probably cleanest. -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing 2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang 2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang @ 2025-12-13 8:00 ` Lance Yang 2025-12-18 13:08 ` David Hildenbrand (Red Hat) 2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang 2 siblings, 1 reply; 16+ messages in thread From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw) To: akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, Lance Yang From: Lance Yang <lance.yang@linux.dev> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches: flush_tlb_mm_range(..., freed_tables || unshared_tables); Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86 without paravirt or INVLPGB, the TLB flush IPI already provides necessary synchronization, allowing the second IPI to be skipped. For paravirt with non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Lance Yang <lance.yang@linux.dev> --- arch/x86/include/asm/tlb.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 866ea78ba156..96602b7b7210 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -5,10 +5,24 @@ #define tlb_flush tlb_flush static inline void tlb_flush(struct mmu_gather *tlb); +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void); + #include <asm-generic/tlb.h> #include <linux/kernel.h> #include <vdso/bits.h> #include <vdso/page.h> +#include <asm/paravirt.h> + +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ +#ifdef CONFIG_PARAVIRT + /* Paravirt may use hypercalls that don't send real IPIs. */ + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) + return false; +#endif + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); +} static inline void tlb_flush(struct mmu_gather *tlb) { @@ -20,7 +34,8 @@ static inline void tlb_flush(struct mmu_gather *tlb) end = tlb->end; } - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, + tlb->freed_tables || tlb->unshared_tables); } static inline void invlpg(unsigned long addr) -- 2.49.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing 2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang @ 2025-12-18 13:08 ` David Hildenbrand (Red Hat) 2025-12-22 3:19 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-18 13:08 UTC (permalink / raw) To: Lance Yang, akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel On 12/13/25 09:00, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to > ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches: > > flush_tlb_mm_range(..., freed_tables || unshared_tables); > > Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86 > without paravirt or INVLPGB, the TLB flush IPI already provides necessary > synchronization, allowing the second IPI to be skipped. For paravirt with > non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. > > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > arch/x86/include/asm/tlb.h | 17 ++++++++++++++++- > 1 file changed, 16 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h > index 866ea78ba156..96602b7b7210 100644 > --- a/arch/x86/include/asm/tlb.h > +++ b/arch/x86/include/asm/tlb.h > @@ -5,10 +5,24 @@ > #define tlb_flush tlb_flush > static inline void tlb_flush(struct mmu_gather *tlb); > > +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast > +static inline bool tlb_table_flush_implies_ipi_broadcast(void); > + > #include <asm-generic/tlb.h> > #include <linux/kernel.h> > #include <vdso/bits.h> > #include <vdso/page.h> > +#include <asm/paravirt.h> > + > +static inline bool tlb_table_flush_implies_ipi_broadcast(void) > +{ > +#ifdef CONFIG_PARAVIRT > + /* Paravirt may use hypercalls that don't send real IPIs. */ > + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) > + return false; > +#endif > + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); Right, here I was wondering whether we should have a new pv_ops callback to indicate that instead. pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast() Or a simple boolean property that pv init code properly sets. Something for x86 folks to give suggestions for. :) -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for 2025-12-18 13:08 ` David Hildenbrand (Red Hat) @ 2025-12-22 3:19 ` Lance Yang 2025-12-23 9:44 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 16+ messages in thread From: Lance Yang @ 2025-12-22 3:19 UTC (permalink / raw) To: david Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp, dave.hansen, dev.jain, hpa, ioworker0, jannh, lance.yang, linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin, peterz, riel, ryan.roberts, shy828301, tglx, will, x86, ziy From: Lance Yang <lance.yang@linux.dev> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote: > On 12/13/25 09:00, Lance Yang wrote: > > From: Lance Yang <lance.yang@linux.dev> > > > > Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to > > ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches: > > > > flush_tlb_mm_range(..., freed_tables || unshared_tables); > > > > Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86 > > without paravirt or INVLPGB, the TLB flush IPI already provides necessary > > synchronization, allowing the second IPI to be skipped. For paravirt with > > non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. > > > > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> > > Signed-off-by: Lance Yang <lance.yang@linux.dev> > > --- > > arch/x86/include/asm/tlb.h | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h > > index 866ea78ba156..96602b7b7210 100644 > > --- a/arch/x86/include/asm/tlb.h > > +++ b/arch/x86/include/asm/tlb.h > > @@ -5,10 +5,24 @@ > > #define tlb_flush tlb_flush > > static inline void tlb_flush(struct mmu_gather *tlb); > > > > +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast > > +static inline bool tlb_table_flush_implies_ipi_broadcast(void); > > + > > #include <asm-generic/tlb.h> > > #include <linux/kernel.h> > > #include <vdso/bits.h> > > #include <vdso/page.h> > > +#include <asm/paravirt.h> > > + > > +static inline bool tlb_table_flush_implies_ipi_broadcast(void) > > +{ > > +#ifdef CONFIG_PARAVIRT > > + /* Paravirt may use hypercalls that don't send real IPIs. */ > > + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) > > + return false; > > +#endif > > + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); > > Right, here I was wondering whether we should have a new pv_ops callback > to indicate that instead. > > pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast() > > Or a simple boolean property that pv init code properly sets. Cool! > > Something for x86 folks to give suggestions for. :) I prefer to use a boolean property instead of comparing function pointers. Something like this: ----8<---- diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c index cfcb60468b01..90e9da33f2c7 100644 --- a/arch/x86/hyperv/mmu.c +++ b/arch/x86/hyperv/mmu.c @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void) pr_info("Using hypercall for remote TLB flush\n"); pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; + pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false; } diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 3502939415ad..f9756df6f3f6 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -133,6 +133,19 @@ struct pv_mmu_ops { void (*flush_tlb_multi)(const struct cpumask *cpus, const struct flush_tlb_info *info); + /* + * Indicates whether TLB flush IPIs provide sufficient synchronization + * for GUP-fast when freeing or unsharing page tables. + * + * Set to true only when the TLB flush guarantees: + * - IPIs reach all CPUs with potentially stale paging-structure caches + * - Synchronization with IRQ-disabled code like GUP-fast + * + * Paravirt implementations that use hypercalls (which may not send + * real IPIs) should set this to false. + */ + bool tlb_flush_implies_ipi_broadcast; + /* Hook for intercepting the destruction of an mm_struct. */ void (*exit_mmap)(struct mm_struct *mm); void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc); diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 96602b7b7210..9d20ad4786cc 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void) { #ifdef CONFIG_PARAVIRT /* Paravirt may use hypercalls that don't send real IPIs. */ - if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) + if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast) return false; #endif return !cpu_feature_enabled(X86_FEATURE_INVLPGB); diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index df78ddee0abb..aaea83100105 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -843,6 +843,7 @@ static void __init kvm_guest_init(void) #ifdef CONFIG_SMP if (pv_tlb_flush_supported()) { pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi; + pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false; pr_info("KVM setup pv remote TLB flush\n"); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index ab3e172dcc69..625fe93e138a 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -173,6 +173,7 @@ struct paravirt_patch_template pv_ops = { .mmu.flush_tlb_kernel = native_flush_tlb_global, .mmu.flush_tlb_one_user = native_flush_tlb_one_user, .mmu.flush_tlb_multi = native_flush_tlb_multi, + .mmu.tlb_flush_implies_ipi_broadcast = true, .mmu.exit_mmap = paravirt_nop, .mmu.notify_page_enc_status_changed = paravirt_nop, diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 7a35c3393df4..06eb80cfb4da 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -2185,6 +2185,7 @@ static const typeof(pv_ops) xen_mmu_ops __initconst = { .flush_tlb_kernel = xen_flush_tlb, .flush_tlb_one_user = xen_flush_tlb_one_user, .flush_tlb_multi = xen_flush_tlb_multi, + .tlb_flush_implies_ipi_broadcast = false, .pgd_alloc = xen_pgd_alloc, .pgd_free = xen_pgd_free, --- Native x86 sets it to true, paravirt guests (Xen/KVM/Hyper-V) set it to false. Making the intent explicit :) Hopefully x86 folks can give me some suggestions! Thanks, Lance ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for 2025-12-22 3:19 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang @ 2025-12-23 9:44 ` David Hildenbrand (Red Hat) 2025-12-23 11:13 ` Lance Yang 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-23 9:44 UTC (permalink / raw) To: Lance Yang Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp, dave.hansen, dev.jain, hpa, jannh, lance.yang, linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin, peterz, riel, ryan.roberts, shy828301, tglx, will, x86, ziy On 12/22/25 04:19, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > > On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote: >> On 12/13/25 09:00, Lance Yang wrote: >>> From: Lance Yang <lance.yang@linux.dev> >>> >>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to >>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches: >>> >>> flush_tlb_mm_range(..., freed_tables || unshared_tables); >>> >>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86 >>> without paravirt or INVLPGB, the TLB flush IPI already provides necessary >>> synchronization, allowing the second IPI to be skipped. For paravirt with >>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. >>> >>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>> --- >>> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++- >>> 1 file changed, 16 insertions(+), 1 deletion(-) >>> >>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h >>> index 866ea78ba156..96602b7b7210 100644 >>> --- a/arch/x86/include/asm/tlb.h >>> +++ b/arch/x86/include/asm/tlb.h >>> @@ -5,10 +5,24 @@ >>> #define tlb_flush tlb_flush >>> static inline void tlb_flush(struct mmu_gather *tlb); >>> >>> +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast >>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void); >>> + >>> #include <asm-generic/tlb.h> >>> #include <linux/kernel.h> >>> #include <vdso/bits.h> >>> #include <vdso/page.h> >>> +#include <asm/paravirt.h> >>> + >>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void) >>> +{ >>> +#ifdef CONFIG_PARAVIRT >>> + /* Paravirt may use hypercalls that don't send real IPIs. */ >>> + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) >>> + return false; >>> +#endif >>> + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); >> >> Right, here I was wondering whether we should have a new pv_ops callback >> to indicate that instead. >> >> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast() >> >> Or a simple boolean property that pv init code properly sets. > > Cool! > >> >> Something for x86 folks to give suggestions for. :) > > I prefer to use a boolean property instead of comparing function pointers. > Something like this: > > ----8<---- > diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c > index cfcb60468b01..90e9da33f2c7 100644 > --- a/arch/x86/hyperv/mmu.c > +++ b/arch/x86/hyperv/mmu.c > @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void) > > pr_info("Using hypercall for remote TLB flush\n"); > pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; > + pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false; > } > diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h > index 3502939415ad..f9756df6f3f6 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -133,6 +133,19 @@ struct pv_mmu_ops { > void (*flush_tlb_multi)(const struct cpumask *cpus, > const struct flush_tlb_info *info); > > + /* > + * Indicates whether TLB flush IPIs provide sufficient synchronization > + * for GUP-fast when freeing or unsharing page tables. > + * > + * Set to true only when the TLB flush guarantees: > + * - IPIs reach all CPUs with potentially stale paging-structure caches > + * - Synchronization with IRQ-disabled code like GUP-fast > + * > + * Paravirt implementations that use hypercalls (which may not send > + * real IPIs) should set this to false. > + */ > + bool tlb_flush_implies_ipi_broadcast; > + > /* Hook for intercepting the destruction of an mm_struct. */ > void (*exit_mmap)(struct mm_struct *mm); > void (*notify_page_enc_status_changed)(unsigned long pfn, int npages, bool enc); > diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h > index 96602b7b7210..9d20ad4786cc 100644 > --- a/arch/x86/include/asm/tlb.h > +++ b/arch/x86/include/asm/tlb.h > @@ -18,7 +18,7 @@ static inline bool tlb_table_flush_implies_ipi_broadcast(void) > { > #ifdef CONFIG_PARAVIRT > /* Paravirt may use hypercalls that don't send real IPIs. */ > - if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) > + if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast) > return false; > #endif > return !cpu_feature_enabled(X86_FEATURE_INVLPGB); I'd have thought that the X86_FEATURE_INVLPGB heck should then also be taken care of by whoever sets tlb_flush_implies_ipi_broadcast. -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for 2025-12-23 9:44 ` David Hildenbrand (Red Hat) @ 2025-12-23 11:13 ` Lance Yang 0 siblings, 0 replies; 16+ messages in thread From: Lance Yang @ 2025-12-23 11:13 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: Liam.Howlett, akpm, aneesh.kumar, arnd, baohua, baolin.wang, bp, dave.hansen, dev.jain, hpa, jannh, linux-arch, linux-kernel, linux-mm, lorenzo.stoakes, mingo, npache, npiggin, peterz, riel, ryan.roberts, shy828301, tglx, will, x86, ziy, Lance Yang On 2025/12/23 17:44, David Hildenbrand (Red Hat) wrote: > On 12/22/25 04:19, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> >> On Thu, 18 Dec 2025 14:08:07 +0100, David Hildenbrand (Red Hat) wrote: >>> On 12/13/25 09:00, Lance Yang wrote: >>>> From: Lance Yang <lance.yang@linux.dev> >>>> >>>> Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to >>>> ensure lazy-TLB CPUs receive IPIs and flush their paging-structure >>>> caches: >>>> >>>> flush_tlb_mm_range(..., freed_tables || unshared_tables); >>>> >>>> Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native >>>> x86 >>>> without paravirt or INVLPGB, the TLB flush IPI already provides >>>> necessary >>>> synchronization, allowing the second IPI to be skipped. For paravirt >>>> with >>>> non-native flush_tlb_multi and for INVLPGB, conservatively keep both >>>> IPIs. >>>> >>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >>>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>>> --- >>>> arch/x86/include/asm/tlb.h | 17 ++++++++++++++++- >>>> 1 file changed, 16 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h >>>> index 866ea78ba156..96602b7b7210 100644 >>>> --- a/arch/x86/include/asm/tlb.h >>>> +++ b/arch/x86/include/asm/tlb.h >>>> @@ -5,10 +5,24 @@ >>>> #define tlb_flush tlb_flush >>>> static inline void tlb_flush(struct mmu_gather *tlb); >>>> +#define tlb_table_flush_implies_ipi_broadcast >>>> tlb_table_flush_implies_ipi_broadcast >>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void); >>>> + >>>> #include <asm-generic/tlb.h> >>>> #include <linux/kernel.h> >>>> #include <vdso/bits.h> >>>> #include <vdso/page.h> >>>> +#include <asm/paravirt.h> >>>> + >>>> +static inline bool tlb_table_flush_implies_ipi_broadcast(void) >>>> +{ >>>> +#ifdef CONFIG_PARAVIRT >>>> + /* Paravirt may use hypercalls that don't send real IPIs. */ >>>> + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) >>>> + return false; >>>> +#endif >>>> + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); >>> >>> Right, here I was wondering whether we should have a new pv_ops callback >>> to indicate that instead. >>> >>> pv_ops.mmu.tlb_table_flush_implies_ipi_broadcast() >>> >>> Or a simple boolean property that pv init code properly sets. >> >> Cool! >> >>> >>> Something for x86 folks to give suggestions for. :) >> >> I prefer to use a boolean property instead of comparing function >> pointers. >> Something like this: >> >> ----8<---- >> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c >> index cfcb60468b01..90e9da33f2c7 100644 >> --- a/arch/x86/hyperv/mmu.c >> +++ b/arch/x86/hyperv/mmu.c >> @@ -243,4 +243,5 @@ void hyperv_setup_mmu_ops(void) >> >> pr_info("Using hypercall for remote TLB flush\n"); >> pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; >> + pv_ops.mmu.tlb_flush_implies_ipi_broadcast = false; >> } >> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/ >> asm/paravirt_types.h >> index 3502939415ad..f9756df6f3f6 100644 >> --- a/arch/x86/include/asm/paravirt_types.h >> +++ b/arch/x86/include/asm/paravirt_types.h >> @@ -133,6 +133,19 @@ struct pv_mmu_ops { >> void (*flush_tlb_multi)(const struct cpumask *cpus, >> const struct flush_tlb_info *info); >> >> + /* >> + * Indicates whether TLB flush IPIs provide sufficient >> synchronization >> + * for GUP-fast when freeing or unsharing page tables. >> + * >> + * Set to true only when the TLB flush guarantees: >> + * - IPIs reach all CPUs with potentially stale paging-structure >> caches >> + * - Synchronization with IRQ-disabled code like GUP-fast >> + * >> + * Paravirt implementations that use hypercalls (which may not send >> + * real IPIs) should set this to false. >> + */ >> + bool tlb_flush_implies_ipi_broadcast; >> + >> /* Hook for intercepting the destruction of an mm_struct. */ >> void (*exit_mmap)(struct mm_struct *mm); >> void (*notify_page_enc_status_changed)(unsigned long pfn, int >> npages, bool enc); >> diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h >> index 96602b7b7210..9d20ad4786cc 100644 >> --- a/arch/x86/include/asm/tlb.h >> +++ b/arch/x86/include/asm/tlb.h >> @@ -18,7 +18,7 @@ static inline bool >> tlb_table_flush_implies_ipi_broadcast(void) >> { >> #ifdef CONFIG_PARAVIRT >> /* Paravirt may use hypercalls that don't send real IPIs. */ >> - if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) >> + if (!pv_ops.mmu.tlb_flush_implies_ipi_broadcast) >> return false; >> #endif >> return !cpu_feature_enabled(X86_FEATURE_INVLPGB); > > I'd have thought that the X86_FEATURE_INVLPGB heck should then also be > taken care of by whoever sets tlb_flush_implies_ipi_broadcast. Makes sense! Let's have the INVLPGB check happen at setup time, not at use time :P Cheers, Lance ^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang 2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang 2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang @ 2025-12-13 8:00 ` Lance Yang 2025-12-14 13:24 ` kernel test robot ` (2 more replies) 2 siblings, 3 replies; 16+ messages in thread From: Lance Yang @ 2025-12-13 8:00 UTC (permalink / raw) To: akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, Lance Yang From: Lance Yang <lance.yang@linux.dev> Similar to the hugetlb PMD unsharing optimization, skip the second IPI in collapse_huge_page() when the TLB flush already provides necessary synchronization. Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE. In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast synchronization relied on IRQ disabling, which blocks TLB flush IPIs. When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's INVLPGB, all x86 systems started sending the second IPI. However, on native x86 this is redundant: - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all CPUs to invalidate TLB entries - GUP-fast runs with IRQs disabled, so when the flush IPI completes, any concurrent GUP-fast must have finished - tlb_remove_table_sync_one() provides no additional synchronization On x86, skip the second IPI when running native (without paravirt) and without INVLPGB. For paravirt with non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb optimization. Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> Signed-off-by: Lance Yang <lance.yang@linux.dev> --- mm/khugepaged.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 97d1b2824386..06ea793a8190 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(&range); - tlb_remove_table_sync_one(); + /* + * Skip the second IPI if the TLB flush above already synchronized + * with concurrent GUP-fast via broadcast IPIs. + */ + if (!tlb_table_flush_implies_ipi_broadcast()) + tlb_remove_table_sync_one(); pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { -- 2.49.0 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang @ 2025-12-14 13:24 ` kernel test robot 2025-12-14 13:56 ` kernel test robot 2025-12-18 13:13 ` David Hildenbrand (Red Hat) 2 siblings, 0 replies; 16+ messages in thread From: kernel test robot @ 2025-12-14 13:24 UTC (permalink / raw) To: Lance Yang; +Cc: oe-kbuild-all Hi Lance, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on next-20251212] [cannot apply to tip/x86/core arnd-asm-generic/master linus/master v6.19-rc1] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-allow-architectures-to-skip-redundant-TLB-sync-IPIs/20251213-160431 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20251213080038.10917-4-lance.yang%40linux.dev patch subject: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() config: arc-randconfig-001-20251214 (https://download.01.org/0day-ci/archive/20251214/202512142156.cShiu6PU-lkp@intel.com/config) compiler: arc-linux-gcc (GCC) 12.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251214/202512142156.cShiu6PU-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202512142156.cShiu6PU-lkp@intel.com/ All errors (new ones prefixed by >>): mm/khugepaged.c: In function 'collapse_huge_page': >> mm/khugepaged.c:1185:14: error: implicit declaration of function 'tlb_table_flush_implies_ipi_broadcast' [-Werror=implicit-function-declaration] 1185 | if (!tlb_table_flush_implies_ipi_broadcast()) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +/tlb_table_flush_implies_ipi_broadcast +1185 mm/khugepaged.c 1090 1091 static int collapse_huge_page(struct mm_struct *mm, unsigned long address, 1092 int referenced, int unmapped, 1093 struct collapse_control *cc) 1094 { 1095 LIST_HEAD(compound_pagelist); 1096 pmd_t *pmd, _pmd; 1097 pte_t *pte; 1098 pgtable_t pgtable; 1099 struct folio *folio; 1100 spinlock_t *pmd_ptl, *pte_ptl; 1101 int result = SCAN_FAIL; 1102 struct vm_area_struct *vma; 1103 struct mmu_notifier_range range; 1104 1105 VM_BUG_ON(address & ~HPAGE_PMD_MASK); 1106 1107 /* 1108 * Before allocating the hugepage, release the mmap_lock read lock. 1109 * The allocation can take potentially a long time if it involves 1110 * sync compaction, and we do not need to hold the mmap_lock during 1111 * that. We will recheck the vma after taking it again in write mode. 1112 */ 1113 mmap_read_unlock(mm); 1114 1115 result = alloc_charge_folio(&folio, mm, cc); 1116 if (result != SCAN_SUCCEED) 1117 goto out_nolock; 1118 1119 mmap_read_lock(mm); 1120 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1121 if (result != SCAN_SUCCEED) { 1122 mmap_read_unlock(mm); 1123 goto out_nolock; 1124 } 1125 1126 result = find_pmd_or_thp_or_none(mm, address, &pmd); 1127 if (result != SCAN_SUCCEED) { 1128 mmap_read_unlock(mm); 1129 goto out_nolock; 1130 } 1131 1132 if (unmapped) { 1133 /* 1134 * __collapse_huge_page_swapin will return with mmap_lock 1135 * released when it fails. So we jump out_nolock directly in 1136 * that case. Continuing to collapse causes inconsistency. 1137 */ 1138 result = __collapse_huge_page_swapin(mm, vma, address, pmd, 1139 referenced); 1140 if (result != SCAN_SUCCEED) 1141 goto out_nolock; 1142 } 1143 1144 mmap_read_unlock(mm); 1145 /* 1146 * Prevent all access to pagetables with the exception of 1147 * gup_fast later handled by the ptep_clear_flush and the VM 1148 * handled by the anon_vma lock + PG_lock. 1149 * 1150 * UFFDIO_MOVE is prevented to race as well thanks to the 1151 * mmap_lock. 1152 */ 1153 mmap_write_lock(mm); 1154 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1155 if (result != SCAN_SUCCEED) 1156 goto out_up_write; 1157 /* check if the pmd is still valid */ 1158 vma_start_write(vma); 1159 result = check_pmd_still_valid(mm, address, pmd); 1160 if (result != SCAN_SUCCEED) 1161 goto out_up_write; 1162 1163 anon_vma_lock_write(vma->anon_vma); 1164 1165 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, 1166 address + HPAGE_PMD_SIZE); 1167 mmu_notifier_invalidate_range_start(&range); 1168 1169 pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ 1170 /* 1171 * This removes any huge TLB entry from the CPU so we won't allow 1172 * huge and small TLB entries for the same virtual address to 1173 * avoid the risk of CPU bugs in that area. 1174 * 1175 * Parallel GUP-fast is fine since GUP-fast will back off when 1176 * it detects PMD is changed. 1177 */ 1178 _pmd = pmdp_collapse_flush(vma, address, pmd); 1179 spin_unlock(pmd_ptl); 1180 mmu_notifier_invalidate_range_end(&range); 1181 /* 1182 * Skip the second IPI if the TLB flush above already synchronized 1183 * with concurrent GUP-fast via broadcast IPIs. 1184 */ > 1185 if (!tlb_table_flush_implies_ipi_broadcast()) 1186 tlb_remove_table_sync_one(); 1187 1188 pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); 1189 if (pte) { 1190 result = __collapse_huge_page_isolate(vma, address, pte, cc, 1191 &compound_pagelist); 1192 spin_unlock(pte_ptl); 1193 } else { 1194 result = SCAN_NO_PTE_TABLE; 1195 } 1196 1197 if (unlikely(result != SCAN_SUCCEED)) { 1198 if (pte) 1199 pte_unmap(pte); 1200 spin_lock(pmd_ptl); 1201 BUG_ON(!pmd_none(*pmd)); 1202 /* 1203 * We can only use set_pmd_at when establishing 1204 * hugepmds and never for establishing regular pmds that 1205 * points to regular pagetables. Use pmd_populate for that 1206 */ 1207 pmd_populate(mm, pmd, pmd_pgtable(_pmd)); 1208 spin_unlock(pmd_ptl); 1209 anon_vma_unlock_write(vma->anon_vma); 1210 goto out_up_write; 1211 } 1212 1213 /* 1214 * All pages are isolated and locked so anon_vma rmap 1215 * can't run anymore. 1216 */ 1217 anon_vma_unlock_write(vma->anon_vma); 1218 1219 result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, 1220 vma, address, pte_ptl, 1221 &compound_pagelist); 1222 pte_unmap(pte); 1223 if (unlikely(result != SCAN_SUCCEED)) 1224 goto out_up_write; 1225 1226 /* 1227 * The smp_wmb() inside __folio_mark_uptodate() ensures the 1228 * copy_huge_page writes become visible before the set_pmd_at() 1229 * write. 1230 */ 1231 __folio_mark_uptodate(folio); 1232 pgtable = pmd_pgtable(_pmd); 1233 1234 spin_lock(pmd_ptl); 1235 BUG_ON(!pmd_none(*pmd)); 1236 pgtable_trans_huge_deposit(mm, pmd, pgtable); 1237 map_anon_folio_pmd_nopf(folio, pmd, vma, address); 1238 spin_unlock(pmd_ptl); 1239 1240 folio = NULL; 1241 1242 result = SCAN_SUCCEED; 1243 out_up_write: 1244 mmap_write_unlock(mm); 1245 out_nolock: 1246 if (folio) 1247 folio_put(folio); 1248 trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); 1249 return result; 1250 } 1251 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang 2025-12-14 13:24 ` kernel test robot @ 2025-12-14 13:56 ` kernel test robot 2025-12-18 13:13 ` David Hildenbrand (Red Hat) 2 siblings, 0 replies; 16+ messages in thread From: kernel test robot @ 2025-12-14 13:56 UTC (permalink / raw) To: Lance Yang; +Cc: llvm, oe-kbuild-all Hi Lance, [This is a private test report for your RFC patch.] kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] [also build test ERROR on next-20251212] [cannot apply to tip/x86/core arnd-asm-generic/master linus/master v6.19-rc1] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Lance-Yang/mm-tlb-allow-architectures-to-skip-redundant-TLB-sync-IPIs/20251213-160431 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20251213080038.10917-4-lance.yang%40linux.dev patch subject: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() config: loongarch-randconfig-001-20251214 (https://download.01.org/0day-ci/archive/20251214/202512142105.NXwq6dfP-lkp@intel.com/config) compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff) rustc: rustc 1.88.0 (6b00bc388 2025-06-23) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251214/202512142105.NXwq6dfP-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202512142105.NXwq6dfP-lkp@intel.com/ All errors (new ones prefixed by >>): >> mm/khugepaged.c:1185:7: error: call to undeclared function 'tlb_table_flush_implies_ipi_broadcast'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 1185 | if (!tlb_table_flush_implies_ipi_broadcast()) | ^ 1 error generated. vim +/tlb_table_flush_implies_ipi_broadcast +1185 mm/khugepaged.c 1090 1091 static int collapse_huge_page(struct mm_struct *mm, unsigned long address, 1092 int referenced, int unmapped, 1093 struct collapse_control *cc) 1094 { 1095 LIST_HEAD(compound_pagelist); 1096 pmd_t *pmd, _pmd; 1097 pte_t *pte; 1098 pgtable_t pgtable; 1099 struct folio *folio; 1100 spinlock_t *pmd_ptl, *pte_ptl; 1101 int result = SCAN_FAIL; 1102 struct vm_area_struct *vma; 1103 struct mmu_notifier_range range; 1104 1105 VM_BUG_ON(address & ~HPAGE_PMD_MASK); 1106 1107 /* 1108 * Before allocating the hugepage, release the mmap_lock read lock. 1109 * The allocation can take potentially a long time if it involves 1110 * sync compaction, and we do not need to hold the mmap_lock during 1111 * that. We will recheck the vma after taking it again in write mode. 1112 */ 1113 mmap_read_unlock(mm); 1114 1115 result = alloc_charge_folio(&folio, mm, cc); 1116 if (result != SCAN_SUCCEED) 1117 goto out_nolock; 1118 1119 mmap_read_lock(mm); 1120 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1121 if (result != SCAN_SUCCEED) { 1122 mmap_read_unlock(mm); 1123 goto out_nolock; 1124 } 1125 1126 result = find_pmd_or_thp_or_none(mm, address, &pmd); 1127 if (result != SCAN_SUCCEED) { 1128 mmap_read_unlock(mm); 1129 goto out_nolock; 1130 } 1131 1132 if (unmapped) { 1133 /* 1134 * __collapse_huge_page_swapin will return with mmap_lock 1135 * released when it fails. So we jump out_nolock directly in 1136 * that case. Continuing to collapse causes inconsistency. 1137 */ 1138 result = __collapse_huge_page_swapin(mm, vma, address, pmd, 1139 referenced); 1140 if (result != SCAN_SUCCEED) 1141 goto out_nolock; 1142 } 1143 1144 mmap_read_unlock(mm); 1145 /* 1146 * Prevent all access to pagetables with the exception of 1147 * gup_fast later handled by the ptep_clear_flush and the VM 1148 * handled by the anon_vma lock + PG_lock. 1149 * 1150 * UFFDIO_MOVE is prevented to race as well thanks to the 1151 * mmap_lock. 1152 */ 1153 mmap_write_lock(mm); 1154 result = hugepage_vma_revalidate(mm, address, true, &vma, cc); 1155 if (result != SCAN_SUCCEED) 1156 goto out_up_write; 1157 /* check if the pmd is still valid */ 1158 vma_start_write(vma); 1159 result = check_pmd_still_valid(mm, address, pmd); 1160 if (result != SCAN_SUCCEED) 1161 goto out_up_write; 1162 1163 anon_vma_lock_write(vma->anon_vma); 1164 1165 mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, 1166 address + HPAGE_PMD_SIZE); 1167 mmu_notifier_invalidate_range_start(&range); 1168 1169 pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */ 1170 /* 1171 * This removes any huge TLB entry from the CPU so we won't allow 1172 * huge and small TLB entries for the same virtual address to 1173 * avoid the risk of CPU bugs in that area. 1174 * 1175 * Parallel GUP-fast is fine since GUP-fast will back off when 1176 * it detects PMD is changed. 1177 */ 1178 _pmd = pmdp_collapse_flush(vma, address, pmd); 1179 spin_unlock(pmd_ptl); 1180 mmu_notifier_invalidate_range_end(&range); 1181 /* 1182 * Skip the second IPI if the TLB flush above already synchronized 1183 * with concurrent GUP-fast via broadcast IPIs. 1184 */ > 1185 if (!tlb_table_flush_implies_ipi_broadcast()) 1186 tlb_remove_table_sync_one(); 1187 1188 pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); 1189 if (pte) { 1190 result = __collapse_huge_page_isolate(vma, address, pte, cc, 1191 &compound_pagelist); 1192 spin_unlock(pte_ptl); 1193 } else { 1194 result = SCAN_NO_PTE_TABLE; 1195 } 1196 1197 if (unlikely(result != SCAN_SUCCEED)) { 1198 if (pte) 1199 pte_unmap(pte); 1200 spin_lock(pmd_ptl); 1201 BUG_ON(!pmd_none(*pmd)); 1202 /* 1203 * We can only use set_pmd_at when establishing 1204 * hugepmds and never for establishing regular pmds that 1205 * points to regular pagetables. Use pmd_populate for that 1206 */ 1207 pmd_populate(mm, pmd, pmd_pgtable(_pmd)); 1208 spin_unlock(pmd_ptl); 1209 anon_vma_unlock_write(vma->anon_vma); 1210 goto out_up_write; 1211 } 1212 1213 /* 1214 * All pages are isolated and locked so anon_vma rmap 1215 * can't run anymore. 1216 */ 1217 anon_vma_unlock_write(vma->anon_vma); 1218 1219 result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, 1220 vma, address, pte_ptl, 1221 &compound_pagelist); 1222 pte_unmap(pte); 1223 if (unlikely(result != SCAN_SUCCEED)) 1224 goto out_up_write; 1225 1226 /* 1227 * The smp_wmb() inside __folio_mark_uptodate() ensures the 1228 * copy_huge_page writes become visible before the set_pmd_at() 1229 * write. 1230 */ 1231 __folio_mark_uptodate(folio); 1232 pgtable = pmd_pgtable(_pmd); 1233 1234 spin_lock(pmd_ptl); 1235 BUG_ON(!pmd_none(*pmd)); 1236 pgtable_trans_huge_deposit(mm, pmd, pgtable); 1237 map_anon_folio_pmd_nopf(folio, pmd, vma, address); 1238 spin_unlock(pmd_ptl); 1239 1240 folio = NULL; 1241 1242 result = SCAN_SUCCEED; 1243 out_up_write: 1244 mmap_write_unlock(mm); 1245 out_nolock: 1246 if (folio) 1247 folio_put(folio); 1248 trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); 1249 return result; 1250 } 1251 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang 2025-12-14 13:24 ` kernel test robot 2025-12-14 13:56 ` kernel test robot @ 2025-12-18 13:13 ` David Hildenbrand (Red Hat) 2025-12-18 14:35 ` Lance Yang 2 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-18 13:13 UTC (permalink / raw) To: Lance Yang, akpm Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel On 12/13/25 09:00, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Similar to the hugetlb PMD unsharing optimization, skip the second IPI > in collapse_huge_page() when the TLB flush already provides necessary > synchronization. > > Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE > unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE. > In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast > synchronization relied on IRQ disabling, which blocks TLB flush IPIs. > > When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's > INVLPGB, all x86 systems started sending the second IPI. However, on > native x86 this is redundant: > > - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all > CPUs to invalidate TLB entries > > - GUP-fast runs with IRQs disabled, so when the flush IPI completes, > any concurrent GUP-fast must have finished > > - tlb_remove_table_sync_one() provides no additional synchronization > > On x86, skip the second IPI when running native (without paravirt) and > without INVLPGB. For paravirt with non-native flush_tlb_multi and for > INVLPGB, conservatively keep both IPIs. > > Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb > optimization. > > Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > mm/khugepaged.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 97d1b2824386..06ea793a8190 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > _pmd = pmdp_collapse_flush(vma, address, pmd); > spin_unlock(pmd_ptl); > mmu_notifier_invalidate_range_end(&range); > - tlb_remove_table_sync_one(); > + /* > + * Skip the second IPI if the TLB flush above already synchronized > + * with concurrent GUP-fast via broadcast IPIs. > + */ > + if (!tlb_table_flush_implies_ipi_broadcast()) > + tlb_remove_table_sync_one(); We end up calling flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); -> flush_tlb_mm_range(freed_tables = true) -> flush_tlb_multi(mm_cpumask(mm), info); So freed_tables=true and we should be doing the right thing. BTW, I was wondering whether we should embed that tlb_table_flush_implies_ipi_broadcast() check in tlb_remove_table_sync_one() instead. It then relies on the caller to do the right thing (flush with freed_tables=true or unshared_tables = true). Thoughts? -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-18 13:13 ` David Hildenbrand (Red Hat) @ 2025-12-18 14:35 ` Lance Yang 2025-12-19 8:25 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 16+ messages in thread From: Lance Yang @ 2025-12-18 14:35 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, akpm On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote: > On 12/13/25 09:00, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> Similar to the hugetlb PMD unsharing optimization, skip the second IPI >> in collapse_huge_page() when the TLB flush already provides necessary >> synchronization. >> >> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE >> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE. >> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast >> synchronization relied on IRQ disabling, which blocks TLB flush IPIs. >> >> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's >> INVLPGB, all x86 systems started sending the second IPI. However, on >> native x86 this is redundant: >> >> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all >> CPUs to invalidate TLB entries >> >> - GUP-fast runs with IRQs disabled, so when the flush IPI completes, >> any concurrent GUP-fast must have finished >> >> - tlb_remove_table_sync_one() provides no additional synchronization >> >> On x86, skip the second IPI when running native (without paravirt) and >> without INVLPGB. For paravirt with non-native flush_tlb_multi and for >> INVLPGB, conservatively keep both IPIs. >> >> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb >> optimization. >> >> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> mm/khugepaged.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 97d1b2824386..06ea793a8190 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct >> *mm, unsigned long address, >> _pmd = pmdp_collapse_flush(vma, address, pmd); >> spin_unlock(pmd_ptl); >> mmu_notifier_invalidate_range_end(&range); >> - tlb_remove_table_sync_one(); >> + /* >> + * Skip the second IPI if the TLB flush above already synchronized >> + * with concurrent GUP-fast via broadcast IPIs. >> + */ >> + if (!tlb_table_flush_implies_ipi_broadcast()) >> + tlb_remove_table_sync_one(); > > We end up calling > > flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); > > -> flush_tlb_mm_range(freed_tables = true) > > -> flush_tlb_multi(mm_cpumask(mm), info); > > So freed_tables=true and we should be doing the right thing. Yep ;) > BTW, I was wondering whether we should embed that > tlb_table_flush_implies_ipi_broadcast() check in > tlb_remove_table_sync_one() instead. > It then relies on the caller to do the right thing (flush with > freed_tables=true or unshared_tables = true). > > Thoughts? Good point! Let me check the other callers to ensure they are all preceded by a flush with freed_tables=true (or unshared_tables). Will get back to you with what I find :) Cheers, Lance ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-18 14:35 ` Lance Yang @ 2025-12-19 8:25 ` David Hildenbrand (Red Hat) 2025-12-21 10:43 ` Lance Yang 0 siblings, 1 reply; 16+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-19 8:25 UTC (permalink / raw) To: Lance Yang Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, akpm On 12/18/25 15:35, Lance Yang wrote: > > > On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote: >> On 12/13/25 09:00, Lance Yang wrote: >>> From: Lance Yang <lance.yang@linux.dev> >>> >>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI >>> in collapse_huge_page() when the TLB flush already provides necessary >>> synchronization. >>> >>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE >>> unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE. >>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast >>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs. >>> >>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's >>> INVLPGB, all x86 systems started sending the second IPI. However, on >>> native x86 this is redundant: >>> >>> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all >>> CPUs to invalidate TLB entries >>> >>> - GUP-fast runs with IRQs disabled, so when the flush IPI completes, >>> any concurrent GUP-fast must have finished >>> >>> - tlb_remove_table_sync_one() provides no additional synchronization >>> >>> On x86, skip the second IPI when running native (without paravirt) and >>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for >>> INVLPGB, conservatively keep both IPIs. >>> >>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb >>> optimization. >>> >>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>> --- >>> mm/khugepaged.c | 7 ++++++- >>> 1 file changed, 6 insertions(+), 1 deletion(-) >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 97d1b2824386..06ea793a8190 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct >>> *mm, unsigned long address, >>> _pmd = pmdp_collapse_flush(vma, address, pmd); >>> spin_unlock(pmd_ptl); >>> mmu_notifier_invalidate_range_end(&range); >>> - tlb_remove_table_sync_one(); >>> + /* >>> + * Skip the second IPI if the TLB flush above already synchronized >>> + * with concurrent GUP-fast via broadcast IPIs. >>> + */ >>> + if (!tlb_table_flush_implies_ipi_broadcast()) >>> + tlb_remove_table_sync_one(); >> >> We end up calling >> >> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >> >> -> flush_tlb_mm_range(freed_tables = true) >> >> -> flush_tlb_multi(mm_cpumask(mm), info); >> >> So freed_tables=true and we should be doing the right thing. > > Yep ;) > >> BTW, I was wondering whether we should embed that >> tlb_table_flush_implies_ipi_broadcast() check in >> tlb_remove_table_sync_one() instead. >> It then relies on the caller to do the right thing (flush with >> freed_tables=true or unshared_tables = true). >> >> Thoughts? > > Good point! Let me check the other callers to ensure they > are all preceded by a flush with freed_tables=true (or unshared_tables). > > Will get back to you with what I find :) The use case in tlb_table_flush() is a bit confusing. But I would assume that we have a TLB flush with remove_tables=true beforehand. Otherwise we cannot possibly free the page table. -- Cheers David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() 2025-12-19 8:25 ` David Hildenbrand (Red Hat) @ 2025-12-21 10:43 ` Lance Yang 0 siblings, 0 replies; 16+ messages in thread From: Lance Yang @ 2025-12-21 10:43 UTC (permalink / raw) To: David Hildenbrand (Red Hat) Cc: will, aneesh.kumar, npiggin, peterz, tglx, mingo, bp, dave.hansen, x86, hpa, arnd, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain, baohua, ioworker0, shy828301, riel, jannh, linux-arch, linux-mm, linux-kernel, akpm On 2025/12/19 16:25, David Hildenbrand (Red Hat) wrote: > On 12/18/25 15:35, Lance Yang wrote: >> >> >> On 2025/12/18 21:13, David Hildenbrand (Red Hat) wrote: >>> On 12/13/25 09:00, Lance Yang wrote: >>>> From: Lance Yang <lance.yang@linux.dev> >>>> >>>> Similar to the hugetlb PMD unsharing optimization, skip the second IPI >>>> in collapse_huge_page() when the TLB flush already provides necessary >>>> synchronization. >>>> >>>> Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE >>>> unconditional"), bare metal x86 didn't enable >>>> MMU_GATHER_RCU_TABLE_FREE. >>>> In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast >>>> synchronization relied on IRQ disabling, which blocks TLB flush IPIs. >>>> >>>> When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's >>>> INVLPGB, all x86 systems started sending the second IPI. However, on >>>> native x86 this is redundant: >>>> >>>> - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to >>>> all >>>> CPUs to invalidate TLB entries >>>> >>>> - GUP-fast runs with IRQs disabled, so when the flush IPI >>>> completes, >>>> any concurrent GUP-fast must have finished >>>> >>>> - tlb_remove_table_sync_one() provides no additional >>>> synchronization >>>> >>>> On x86, skip the second IPI when running native (without paravirt) and >>>> without INVLPGB. For paravirt with non-native flush_tlb_multi and for >>>> INVLPGB, conservatively keep both IPIs. >>>> >>>> Use tlb_table_flush_implies_ipi_broadcast(), consistent with the >>>> hugetlb >>>> optimization. >>>> >>>> Suggested-by: David Hildenbrand (Red Hat) <david@kernel.org> >>>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>>> --- >>>> mm/khugepaged.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 97d1b2824386..06ea793a8190 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct >>>> *mm, unsigned long address, >>>> _pmd = pmdp_collapse_flush(vma, address, pmd); >>>> spin_unlock(pmd_ptl); >>>> mmu_notifier_invalidate_range_end(&range); >>>> - tlb_remove_table_sync_one(); >>>> + /* >>>> + * Skip the second IPI if the TLB flush above already synchronized >>>> + * with concurrent GUP-fast via broadcast IPIs. >>>> + */ >>>> + if (!tlb_table_flush_implies_ipi_broadcast()) >>>> + tlb_remove_table_sync_one(); >>> >>> We end up calling >>> >>> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >>> >>> -> flush_tlb_mm_range(freed_tables = true) >>> >>> -> flush_tlb_multi(mm_cpumask(mm), info); >>> >>> So freed_tables=true and we should be doing the right thing. >> >> Yep ;) >> >>> BTW, I was wondering whether we should embed that >>> tlb_table_flush_implies_ipi_broadcast() check in >>> tlb_remove_table_sync_one() instead. >>> It then relies on the caller to do the right thing (flush with >>> freed_tables=true or unshared_tables = true). >>> >>> Thoughts? >> >> Good point! Let me check the other callers to ensure they >> are all preceded by a flush with freed_tables=true (or unshared_tables). >> >> Will get back to you with what I find :) > > The use case in tlb_table_flush() is a bit confusing. But I would assume > that we have a TLB flush with remove_tables=true beforehand. Otherwise > we cannot possibly free the page table. Right! I assume you meant freed_tables=true (not remove_tables) ;) Verified all callers have proper TLB flushes *beforehand*: -> 1. mm/khugepaged.c:1188 (collapse_huge_page) pmdp_collapse_flush(vma, address, pmd) -> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE) -> flush_tlb_mm_range(mm, ..., freed_tables = true) -> flush_tlb_multi(mm_cpumask(mm), info) So freed_tables=true and we should be doing the right thing :) -> 2. include/asm-generic/tlb.h:861 (tlb_flush_unshared_tables) tlb_flush_mmu_tlbonly(tlb) -> tlb_flush(tlb) -> flush_tlb_mm_range(mm, ..., unshared_tables = true) -> flush_tlb_multi(mm_cpumask(mm), info) unshared_tables=true (equivalent to freed_tables for sending IPIs). -> 3. mm/mmu_gather.c:341 (__tlb_remove_table_one) When we can't allocate a batch page in tlb_remove_table(), we do: tlb_table_invalidate(tlb) -> tlb_flush_mmu_tlbonly(tlb) -> flush_tlb_mm_range(mm, ..., freed_tables = true) -> flush_tlb_multi(mm_cpumask(mm), info) Then: tlb_remove_table_one(table) -> __tlb_remove_table_one(table) // if !CONFIG_PT_RECLAIM -> tlb_remove_table_sync_one() freed_tables=true, and this should work too. Why is tlb->freed_tables guaranteed? Because callers like pte_free_tlb() (via free_pte_range) set freed_tables=true before calling __pte_free_tlb(), which then calls tlb_remove_table(). As you mentioned, we cannot free page tables without freed_tables=true. Note that tlb_remove_table_sync_one() was a NOP on bare metal x86 (CONFIG_MMU_GATHER_RCU_TABLE_FREE=n) before commit a37259732a7d. -> 4-5. mm/khugepaged.c:1683,1819 (pmdp_get_lockless_sync macro) Same as #1. So all callers satisfy the requirement! Will embed the check in v2. Hopefully I didn't miss any callers ;) Cheers, Lance ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2025-12-23 11:13 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-12-13 8:00 [PATCH RFC 0/3] skip redundant TLB sync IPIs Lance Yang 2025-12-13 8:00 ` [PATCH RFC 1/3] mm/tlb: allow architectures to " Lance Yang 2025-12-15 5:48 ` Lance Yang 2025-12-18 13:08 ` David Hildenbrand (Red Hat) 2025-12-13 8:00 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for PMD unsharing Lance Yang 2025-12-18 13:08 ` David Hildenbrand (Red Hat) 2025-12-22 3:19 ` [PATCH RFC 2/3] x86/mm: implement redundant IPI elimination for Lance Yang 2025-12-23 9:44 ` David Hildenbrand (Red Hat) 2025-12-23 11:13 ` Lance Yang 2025-12-13 8:00 ` [PATCH RFC 3/3] mm/khugepaged: skip redundant IPI in collapse_huge_page() Lance Yang 2025-12-14 13:24 ` kernel test robot 2025-12-14 13:56 ` kernel test robot 2025-12-18 13:13 ` David Hildenbrand (Red Hat) 2025-12-18 14:35 ` Lance Yang 2025-12-19 8:25 ` David Hildenbrand (Red Hat) 2025-12-21 10:43 ` Lance Yang
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.